* Considerations on sched APIs under RT patch @ 2010-04-19 20:48 Primiano Tucci 2010-04-20 9:20 ` Peter Zijlstra 0 siblings, 1 reply; 20+ messages in thread From: Primiano Tucci @ 2010-04-19 20:48 UTC (permalink / raw) To: linux-kernel Hi all I am an Italian researcher and I am working on a (cross-platform) Real Time scheduling infrastructure. I am currently using Linux Kernel 2.6.29.6-rt24-smp (PREEMPT-RT Patch) running on a Intel Q9550 CPU. Yesterday days I found a strange behavior of the scheduler API's using the RT patch, in particular the pthread_setaffinity_np (that stands on sched_setaffinity). (link: http://article.gmane.org/gmane.linux.kernel.api/1550) I think the main problem is that sched_setaffinity makes use of a rwlock, but rwlocks are pre-emptible with the RT patch. So it could happen that an high priority process/thread that makes use of the sched_setaffinity facility could be unwillingly preempted when controlling other (even low-priority) processes/threads. I think sched_setaffinity should make use of raw_spinlocks, or should anyway be guaranteed to not be pre-empted (maybe a preempt_disable?), otherwise could lead in unwanted situations for a Real Time OS, such the one described below. The issue can be easily reproduced taking inspiration from this scenario: I have 4 Real Time Threads (SCHED_FIFO) distributed as follows: T0 : CPU 0, Priority 2 (HIGH) T1 : CPU 1, Priority 2 (HIGH) T3 : CPU 0, Priority 1 (LOW) T4 : CPU 1, Priority 1 (LOW) So T0 and T1 are actually the "big bosses" on CPUs #0 and #1, T3 and T4, instead, never execute (let's assume that each thread is a simple busy wait that never sleeps/yields) Now, at a certain point, from T0 code, I want to migrate T4 from CPU #1 to #0, keeping its low priority. Therefore I perform a pthread_setaffinity_np from T0 changing T4 mask from CPU #1 to #0. In this scenario it happens that T3 (that should never execute since there is T0 with higher priority currently running on the same CPU #0) "emerge" and executes for a bit. It seems that the pthread_setaffinity_np syscall is somehow "suspensive" for the time needed to migrate T4 and let the scheduler to execute T3 for that bunch of time. What do you think about this situation? Should sched APIs be revised? Thanks in advance, Primiano -- Primiano Tucci http://www.primianotucci.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-19 20:48 Considerations on sched APIs under RT patch Primiano Tucci @ 2010-04-20 9:20 ` Peter Zijlstra 2010-04-20 21:56 ` Primiano Tucci 0 siblings, 1 reply; 20+ messages in thread From: Peter Zijlstra @ 2010-04-20 9:20 UTC (permalink / raw) To: Primiano Tucci; +Cc: linux-kernel, tglx, rostedt On Mon, 2010-04-19 at 22:48 +0200, Primiano Tucci wrote: > Yesterday days I found a strange behavior of the scheduler API's using > the RT patch, in particular the pthread_setaffinity_np (that stands on > sched_setaffinity). > I think the main problem is that sched_setaffinity makes use of a > rwlock, but rwlocks are pre-emptible with the RT patch. It does? where? sys_sched_setaffinity() sched_setaffinity() set_cpus_allowed_ptr() set_cpus_allowed_ptr() is the one that does the real work, and that takes the rq->lock and plays games with the migration thread, non of which should be able to cause any form of priority inversion. > So it could happen that an high priority process/thread that makes use > of the sched_setaffinity facility could be unwillingly preempted when > controlling other (even low-priority) processes/threads. Well, suppose there was a rwlock_t, then for PREEMPT_RT=y that would be mapped to an rt_mutex, which is PI aware. > I think sched_setaffinity should make use of raw_spinlocks, or should > anyway be guaranteed to not be pre-empted (maybe a preempt_disable?), > otherwise could lead in unwanted situations for a Real Time OS, such > the one described below. It does, rq->lock is a non preemptible lock, and the migration thread runs at a priority higher than FIFO-99. > The issue can be easily reproduced taking inspiration from this scenario: > > I have 4 Real Time Threads (SCHED_FIFO) distributed as follows: > > T0 : CPU 0, Priority 2 (HIGH) > T1 : CPU 1, Priority 2 (HIGH) > T3 : CPU 0, Priority 1 (LOW) > T4 : CPU 1, Priority 1 (LOW) > > So T0 and T1 are actually the "big bosses" on CPUs #0 and #1, T3 and > T4, instead, never execute (let's assume that each thread is a simple > busy wait that never sleeps/yields) Now, at a certain point, from T0 > code, I want to migrate T4 from CPU #1 to #0, keeping its low > priority. > Therefore I perform a pthread_setaffinity_np from T0 changing T4 mask > from CPU #1 to #0. > > In this scenario it happens that T3 (that should never execute since > there is T0 with higher priority currently running on the same CPU #0) > "emerge" and executes for a bit. > It seems that the pthread_setaffinity_np syscall is somehow > "suspensive" for the time needed to migrate T4 and let the scheduler > to execute T3 for that bunch of time. > > What do you think about this situation? Should sched APIs be revised? Not sure why you thinking the APIs should be changed. If this does indeed happen then there is a bug somewhere in the implementation, the trick will be finding it. So you run these four RT tasks on CPUs 0,1 and then control them from another cpu, say 2? Can you get a function trace that illustrates T3 getting scheduled, preferably while running the latest -rt kernel? ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-20 9:20 ` Peter Zijlstra @ 2010-04-20 21:56 ` Primiano Tucci 2010-04-20 23:00 ` Steven Rostedt ` (2 more replies) 0 siblings, 3 replies; 20+ messages in thread From: Primiano Tucci @ 2010-04-20 21:56 UTC (permalink / raw) To: Peter Zijlstra; +Cc: linux-kernel, tglx, rostedt Hi Peter, thank you for your reply. On Tue, Apr 20, 2010 at 11:20 AM, Peter Zijlstra <peterz@infradead.org> wrote: > On Mon, 2010-04-19 at 22:48 +0200, Primiano Tucci wrote: > >> Yesterday days I found a strange behavior of the scheduler API's using >> the RT patch, in particular the pthread_setaffinity_np (that stands on >> sched_setaffinity). > >> I think the main problem is that sched_setaffinity makes use of a >> rwlock, but rwlocks are pre-emptible with the RT patch. > > It does? where? > > sys_sched_setaffinity() > sched_setaffinity() > set_cpus_allowed_ptr() I see long sched_setaffinity(pid_t pid, const struct cpumask *in_mask) { cpumask_var_t cpus_allowed, new_mask; struct task_struct *p; int retval; get_online_cpus(); --> read_lock(&tasklist_lock); My question is: suppose that tasklist_lock is held by a writer. What happens to the calling thread? It can't take the lock, therefore it yields to the next ready task (that in my scenario has a lower priority). In my view, this is not a Priority Inversion problem. The problem is that the sched_setaffinity is unexpectedly "suspensive" and yields to the lower thread. Thank you for your support, Primiano > > set_cpus_allowed_ptr() is the one that does the real work, and that > takes the rq->lock and plays games with the migration thread, non of > which should be able to cause any form of priority inversion. > >> So it could happen that an high priority process/thread that makes use >> of the sched_setaffinity facility could be unwillingly preempted when >> controlling other (even low-priority) processes/threads. > > Well, suppose there was a rwlock_t, then for PREEMPT_RT=y that would be > mapped to an rt_mutex, which is PI aware. > >> I think sched_setaffinity should make use of raw_spinlocks, or should >> anyway be guaranteed to not be pre-empted (maybe a preempt_disable?), >> otherwise could lead in unwanted situations for a Real Time OS, such >> the one described below. > > It does, rq->lock is a non preemptible lock, and the migration thread > runs at a priority higher than FIFO-99. > >> The issue can be easily reproduced taking inspiration from this scenario: >> >> I have 4 Real Time Threads (SCHED_FIFO) distributed as follows: >> >> T0 : CPU 0, Priority 2 (HIGH) >> T1 : CPU 1, Priority 2 (HIGH) >> T3 : CPU 0, Priority 1 (LOW) >> T4 : CPU 1, Priority 1 (LOW) >> >> So T0 and T1 are actually the "big bosses" on CPUs #0 and #1, T3 and >> T4, instead, never execute (let's assume that each thread is a simple >> busy wait that never sleeps/yields) Now, at a certain point, from T0 >> code, I want to migrate T4 from CPU #1 to #0, keeping its low >> priority. >> Therefore I perform a pthread_setaffinity_np from T0 changing T4 mask >> from CPU #1 to #0. >> >> In this scenario it happens that T3 (that should never execute since >> there is T0 with higher priority currently running on the same CPU #0) >> "emerge" and executes for a bit. >> It seems that the pthread_setaffinity_np syscall is somehow >> "suspensive" for the time needed to migrate T4 and let the scheduler >> to execute T3 for that bunch of time. >> >> What do you think about this situation? Should sched APIs be revised? > > Not sure why you thinking the APIs should be changed. If this does > indeed happen then there is a bug somewhere in the implementation, the > trick will be finding it. > > So you run these four RT tasks on CPUs 0,1 and then control them from > another cpu, say 2? > > Can you get a function trace that illustrates T3 getting scheduled, > preferably while running the latest -rt kernel? > > -- Primiano Tucci http://www.primianotucci.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-20 21:56 ` Primiano Tucci @ 2010-04-20 23:00 ` Steven Rostedt 2010-04-21 5:16 ` Primiano Tucci 2010-04-21 12:56 ` Peter Zijlstra 2010-04-27 13:18 ` Thomas Gleixner 2 siblings, 1 reply; 20+ messages in thread From: Steven Rostedt @ 2010-04-20 23:00 UTC (permalink / raw) To: Primiano Tucci; +Cc: Peter Zijlstra, linux-kernel, tglx On Tue, 2010-04-20 at 23:56 +0200, Primiano Tucci wrote: > Hi Peter, > long sched_setaffinity(pid_t pid, const struct cpumask *in_mask) { > cpumask_var_t cpus_allowed, new_mask; > struct task_struct *p; > int retval; > > get_online_cpus(); > --> read_lock(&tasklist_lock); > > > My question is: suppose that tasklist_lock is held by a writer. > What happens to the calling thread? It can't take the lock, therefore > it yields to the next ready task (that in my scenario has a lower > priority). > In my view, this is not a Priority Inversion problem. The problem is > that the sched_setaffinity is unexpectedly "suspensive" and yields to > the lower thread. read_locks are converted into "special" rt_mutexes. The only thing special about them, is the owner may grab the same read lock more than once (recursive). If a lower priority process currently holds the tasklist_lock for write, when a high priority process tries to take it for read (or write for that matter) it will block on the lower priority process. But that lower priority process will acquire the priority of the higher priority process (priority inheritance) and will run at that priority until it releases the lock. Then it will go back to its low priority and the higher priority process will then preempt it and acquire the lock for read. The above is what is expected. -- Steve ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-20 23:00 ` Steven Rostedt @ 2010-04-21 5:16 ` Primiano Tucci 2010-04-21 8:49 ` Peter Zijlstra 0 siblings, 1 reply; 20+ messages in thread From: Primiano Tucci @ 2010-04-21 5:16 UTC (permalink / raw) To: rostedt; +Cc: Peter Zijlstra, linux-kernel, tglx Hi steve > read_locks are converted into "special" rt_mutexes. The only thing > special about them, is the owner may grab the same read lock more than > once (recursive). > > If a lower priority process currently holds the tasklist_lock for write, > when a high priority process tries to take it for read (or write for > that matter) it will block on the lower priority process. But that lower > priority process will acquire the priority of the higher priority > process (priority inheritance) and will run at that priority until it > releases the lock. Then it will go back to its low priority and the > higher priority process will then preempt it and acquire the lock for > read. In your example you implied that the low priority process, holding the lock for write, runs on the same CPU of the higher priority process that wants to lock it for read. This is clear to me. My problem is, in a SMP environment, what happens if a process (let's say T1 on CPU #1) holds the lock for write (its priority does not matter, it is not a PI problem) and now a process T0 on cpu #0 wants to lock it for read? The process T0 will be blocked! But who will run now on CPU 0, until the rwlock is held by T1? Probably the next ready process on CPU #'0. Is it right? Thanks, Primiano -- Primiano Tucci http://www.primianotucci.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-21 5:16 ` Primiano Tucci @ 2010-04-21 8:49 ` Peter Zijlstra 2010-04-21 12:46 ` Steven Rostedt 0 siblings, 1 reply; 20+ messages in thread From: Peter Zijlstra @ 2010-04-21 8:49 UTC (permalink / raw) To: Primiano Tucci; +Cc: rostedt, linux-kernel, tglx On Wed, 2010-04-21 at 07:16 +0200, Primiano Tucci wrote: > Hi steve > > read_locks are converted into "special" rt_mutexes. The only thing > > special about them, is the owner may grab the same read lock more than > > once (recursive). > > > > If a lower priority process currently holds the tasklist_lock for write, > > when a high priority process tries to take it for read (or write for > > that matter) it will block on the lower priority process. But that lower > > priority process will acquire the priority of the higher priority > > process (priority inheritance) and will run at that priority until it > > releases the lock. Then it will go back to its low priority and the > > higher priority process will then preempt it and acquire the lock for > > read. > > In your example you implied that the low priority process, holding the > lock for write, runs on the same CPU of the higher priority process > that wants to lock it for read. This is clear to me. > My problem is, in a SMP environment, what happens if a process (let's > say T1 on CPU #1) holds the lock for write (its priority does not > matter, it is not a PI problem) and now a process T0 on cpu #0 wants > to lock it for read? > The process T0 will be blocked! But who will run now on CPU 0, until > the rwlock is held by T1? Probably the next ready process on CPU #'0. > Is it right? Yes. This is the reality of SMP systems, nothing much you can do about that. System resources are shared between all cpus, irrespective of task affinities. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-21 8:49 ` Peter Zijlstra @ 2010-04-21 12:46 ` Steven Rostedt 2010-04-21 19:24 ` Primiano Tucci 0 siblings, 1 reply; 20+ messages in thread From: Steven Rostedt @ 2010-04-21 12:46 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Primiano Tucci, linux-kernel, tglx On Wed, 2010-04-21 at 10:49 +0200, Peter Zijlstra wrote: > On Wed, 2010-04-21 at 07:16 +0200, Primiano Tucci wrote: > > Hi steve > > > read_locks are converted into "special" rt_mutexes. The only thing > > > special about them, is the owner may grab the same read lock more than > > > once (recursive). > > > > > > If a lower priority process currently holds the tasklist_lock for write, > > > when a high priority process tries to take it for read (or write for > > > that matter) it will block on the lower priority process. But that lower > > > priority process will acquire the priority of the higher priority > > > process (priority inheritance) and will run at that priority until it > > > releases the lock. Then it will go back to its low priority and the > > > higher priority process will then preempt it and acquire the lock for > > > read. > > > > In your example you implied that the low priority process, holding the > > lock for write, runs on the same CPU of the higher priority process > > that wants to lock it for read. This is clear to me. > > My problem is, in a SMP environment, what happens if a process (let's > > say T1 on CPU #1) holds the lock for write (its priority does not > > matter, it is not a PI problem) and now a process T0 on cpu #0 wants > > to lock it for read? > > The process T0 will be blocked! But who will run now on CPU 0, until > > the rwlock is held by T1? Probably the next ready process on CPU #'0. > > Is it right? > > Yes. This is the reality of SMP systems, nothing much you can do about > that. System resources are shared between all cpus, irrespective of task > affinities. Actually, we do better than that. With adaptive locks, if the process on the other CPU is still running, the high priority task will spin until the other process releases the lock or goes to sleep. If it goes to sleep, then the high prio task will also sleep, otherwise it just spins and takes the lock when it is released. -- Steve ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-21 12:46 ` Steven Rostedt @ 2010-04-21 19:24 ` Primiano Tucci 2010-04-21 19:57 ` Peter Zijlstra 0 siblings, 1 reply; 20+ messages in thread From: Primiano Tucci @ 2010-04-21 19:24 UTC (permalink / raw) To: rostedt; +Cc: Peter Zijlstra, linux-kernel, tglx > Actually, we do better than that. With adaptive locks, if the process on > the other CPU is still running, the high priority task will spin until > the other process releases the lock or goes to sleep. If it goes to > sleep, then the high prio task will also sleep, otherwise it just spins > and takes the lock when it is released. > > -- Steve It sounds more reasonable now. I know that in a preemptible kernel even syscalls can be preempted. that absolutely fair except for those syscalls (such as setaffinity, setpriority) that control the scheduler. Going back to my original problem, the real question is: Is it sure that calling a scheduler api won't induce a re-scheduling of the caller process (e.g. as in the case of a lock held by another processor)? It would be very unpleasant if the scheduling apis can induce re-scheduling, making the realization of a Real Time scheduling infrastructure completely un-deterministic. If I have clearly understood your replies it seems that my problem is due to an *old* kernel version that still uses rw_lock into the setaffinity! Is it right? Thank you for your extremely valuable support. Primiano -- Primiano Tucci http://www.primianotucci.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-21 19:24 ` Primiano Tucci @ 2010-04-21 19:57 ` Peter Zijlstra 2010-04-21 20:38 ` Primiano Tucci 0 siblings, 1 reply; 20+ messages in thread From: Peter Zijlstra @ 2010-04-21 19:57 UTC (permalink / raw) To: Primiano Tucci; +Cc: rostedt, linux-kernel, tglx On Wed, 2010-04-21 at 21:24 +0200, Primiano Tucci wrote: > Is it sure that calling a scheduler api won't induce a re-scheduling > of the caller process (e.g. as in the case of a lock held by another > processor)? It would be very unpleasant if the scheduling apis can > induce re-scheduling, making the realization of a Real Time scheduling > infrastructure completely un-deterministic. No, any syscall can end up blocking/scheduling there are no exceptions. But blocking doesn't mean its non-deterministic, esp. when coupled with things like PI. But you do have to treat system resources as such, that is they can (and will) create cross-cpu dependencies, if you do not take that into account you will of course be surprised. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-21 19:57 ` Peter Zijlstra @ 2010-04-21 20:38 ` Primiano Tucci 2010-04-21 20:58 ` Peter Zijlstra 0 siblings, 1 reply; 20+ messages in thread From: Primiano Tucci @ 2010-04-21 20:38 UTC (permalink / raw) To: Peter Zijlstra; +Cc: rostedt, linux-kernel, tglx > No, any syscall can end up blocking/scheduling there are no exceptions. > But blocking doesn't mean its non-deterministic, esp. when coupled with > things like PI. > > But you do have to treat system resources as such, that is they can (and > will) create cross-cpu dependencies, if you do not take that into > account you will of course be surprised. > I actually don't understand why do you recall PI so frequently, it seems to be the unique point of interest. Actually I take care about not sharing cross-cpu resources, but I cannot take care of what the kernel should do. In my viewpoint is unacceptable that the scheduler apis can led into a rescheduling. It voids any form of process control. If I lose the control while controlling other processes, Quis custodiet ipsos custodes? P.S. It actually does not happen in other RTOSes, e.g., VxWorks SMP Primiano, -- Primiano Tucci http://www.primianotucci.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-21 20:38 ` Primiano Tucci @ 2010-04-21 20:58 ` Peter Zijlstra 2010-04-22 13:20 ` Steven Rostedt 0 siblings, 1 reply; 20+ messages in thread From: Peter Zijlstra @ 2010-04-21 20:58 UTC (permalink / raw) To: Primiano Tucci; +Cc: rostedt, linux-kernel, tglx On Wed, 2010-04-21 at 22:38 +0200, Primiano Tucci wrote: > > No, any syscall can end up blocking/scheduling there are no exceptions. > > But blocking doesn't mean its non-deterministic, esp. when coupled with > > things like PI. > > > > But you do have to treat system resources as such, that is they can (and > > will) create cross-cpu dependencies, if you do not take that into > > account you will of course be surprised. > > > I actually don't understand why do you recall PI so frequently, it > seems to be the unique point of interest. PI keeps preemptible locks working in a RT environment. Non-preemptible or preemptible+PI are both valid RT constructs that can be analyzed > Actually I take care about not sharing cross-cpu resources, but I > cannot take care of what the kernel should do. An SMP kernel must be treated as a cross-cpu resource. There's just no way around that. For instance, Unix allows two processes on different cpus to invoke sched_setscheduler/sched_setaffinity or any number of system calls on the same target process. Filesystems are shared etc.. > In my viewpoint is unacceptable that the scheduler apis can led into a > rescheduling. They can even lead to pagefaults and disk IO if you're not careful. I'm not sure if there are blocking locks left thereabout, but spinlocks or rt_mutex, both create cross-cpu dependencies that need to be analyzed, !preempt isn't magic in any way. > It voids any form of process control. > If I lose the control while controlling other processes, Quis > custodiet ipsos custodes? > > P.S. It actually does not happen in other RTOSes, e.g., VxWorks SMP I don't know any of those, but its impossible to migrate tasks from one cpu to another without creating cross-cpu dependencies. Whether locks are preemptible or not doesn't make them any less analyzable, if you use system-calls in your RT program, their implementation needs to be considered. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-21 20:58 ` Peter Zijlstra @ 2010-04-22 13:20 ` Steven Rostedt 2010-04-22 13:50 ` Primiano Tucci 0 siblings, 1 reply; 20+ messages in thread From: Steven Rostedt @ 2010-04-22 13:20 UTC (permalink / raw) To: Peter Zijlstra; +Cc: Primiano Tucci, linux-kernel, tglx On Wed, 2010-04-21 at 22:58 +0200, Peter Zijlstra wrote: > > P.S. It actually does not happen in other RTOSes, e.g., VxWorks SMP > > I don't know any of those, but its impossible to migrate tasks from one > cpu to another without creating cross-cpu dependencies. > > Whether locks are preemptible or not doesn't make them any less > analyzable, if you use system-calls in your RT program, their > implementation needs to be considered It's been a while since I've used SMP VxWorks, but back then what it did was to copy an image for ever CPU separately. It was not really SMP but instead a separate OS for each CPU. Things may have changed since then (it was around 2002 when I saw this). There's projects to do the same for Linux, and I feel it may give you the most control of the system. But the hardware is still shared, so the contention does not go away, it just gets moved to the hardware resources. -- Steve ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-22 13:20 ` Steven Rostedt @ 2010-04-22 13:50 ` Primiano Tucci 2010-04-22 13:57 ` Peter Zijlstra 0 siblings, 1 reply; 20+ messages in thread From: Primiano Tucci @ 2010-04-22 13:50 UTC (permalink / raw) To: rostedt; +Cc: Peter Zijlstra, linux-kernel, tglx > It's been a while since I've used SMP VxWorks, but back then what it did > was to copy an image for ever CPU separately. It was not really SMP but > instead a separate OS for each CPU. Things may have changed since then > (it was around 2002 when I saw this). > > There's projects to do the same for Linux, and I feel it may give you > the most control of the system. But the hardware is still shared, so the > contention does not go away, it just gets moved to the hardware > resources. I knew this kind of solution based on OS-partitioning, but my group and I are currently working on a Global-EDF scheduler, a unique scheduler (and therefore a unique OS/Kernel) that is able to migrate tasks between CPUs in order to maximize the global CPU usage. In order to to this we have a unique "super"-process (a Meta-Scheduler) that needs to be able to control priority and affinity of the managed tasks, without losing the control while doing this. We are able to make it work under VXWorks 6.7 SMP (that seems to be really-smp), now we are trying to port the same infrastructure on a Linux-RT Kernel to compare the performances, but we found the issue with the sefaffinity syscall as described into my first mail. I will try to update to the last kernel version and see if it works correctly. Thanks, Primiano -- Primiano Tucci http://www.primianotucci.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-22 13:50 ` Primiano Tucci @ 2010-04-22 13:57 ` Peter Zijlstra 2010-04-22 15:40 ` Primiano Tucci 0 siblings, 1 reply; 20+ messages in thread From: Peter Zijlstra @ 2010-04-22 13:57 UTC (permalink / raw) To: Primiano Tucci; +Cc: rostedt, linux-kernel, tglx On Thu, 2010-04-22 at 15:50 +0200, Primiano Tucci wrote: > I knew this kind of solution based on OS-partitioning, but my group > and I are currently working on a Global-EDF scheduler, a unique > scheduler (and therefore a unique OS/Kernel) that is able to migrate > tasks between CPUs in order to maximize the global CPU usage. I would hardly call a global-edf scheduler unique. Its a well studied algorithm and even available in commercial SMP operating systems (hopefully soon Linux too, see SCHED_DEADLINE, which will approximate global-edf, much like the current SCHED_FIFO approximates global-fifo). > In order to to this we have a unique "super"-process (a > Meta-Scheduler) that needs to be able to control priority and affinity > of the managed tasks, without losing the control while doing this. Implementing this as userspace/middleware seems daft. But if your controlling process has a global affinity mask and runs as the highest available userspace process priority its still all valid. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-22 13:57 ` Peter Zijlstra @ 2010-04-22 15:40 ` Primiano Tucci 2010-04-22 16:28 ` Peter Zijlstra 0 siblings, 1 reply; 20+ messages in thread From: Primiano Tucci @ 2010-04-22 15:40 UTC (permalink / raw) To: Peter Zijlstra; +Cc: rostedt, linux-kernel, tglx > I would hardly call a global-edf scheduler unique. I used the word "unique" to underline that it is not a partitioned scheduling scheme, but ONE global scheduler that manages tasks > Its a well studied > algorithm and even available in commercial SMP operating systems Can you cite me a commercial SMP system that supports multicore/multiprocessor G-EDF? Eventually can you cite me an uniprocessor system that support EDF? In my knowledge the most commercial systems do not ever have the concept of periodic task, release periods and deadlines (e.g. , VXWorks, QNX Neutrino)... how can they support G-EDF? The other commercial RT System, in my knoweledge, that support pediodic tasks such as B&R Automation Runtime components and Beckhoff Twincat System, only offer a Rate Monotonic scheduling scheme. > (hopefully soon Linux too, see SCHED_DEADLINE, which will approximate > global-edf, much like the current SCHED_FIFO approximates global-fifo). > Implementing this as userspace/middleware seems daft. But if your > controlling process has a global affinity mask and runs as the highest > available userspace process priority its still all valid. It is how it is implemented now, and how it works under VXWorks! ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-22 15:40 ` Primiano Tucci @ 2010-04-22 16:28 ` Peter Zijlstra 2010-04-22 17:48 ` Bjoern Brandenburg 2010-04-22 19:33 ` Primiano Tucci 0 siblings, 2 replies; 20+ messages in thread From: Peter Zijlstra @ 2010-04-22 16:28 UTC (permalink / raw) To: Primiano Tucci; +Cc: rostedt, linux-kernel, tglx, Bjoern Brandenburg On Thu, 2010-04-22 at 17:40 +0200, Primiano Tucci wrote: > > Its a well studied > > algorithm and even available in commercial SMP operating systems > > Can you cite me a commercial SMP system that supports > multicore/multiprocessor G-EDF? From: http://www.cs.unc.edu/~anderson/papers/rtlws09.pdf "Regarding the frequently voiced objections to G- EDF’s viability in a “real” system, it should be noted that xnu, the kernel underlying Apple’s multimedia-friendly OS X, has been relying on G-EDF to support real-time applications on multiprocessors for several years [5]." "[5] Apple Inc. xnu source code. http://opensource.apple.com/source/xnu/." > > Implementing this as userspace/middleware seems daft. But if your > > controlling process has a global affinity mask and runs as the highest > > available userspace process priority its still all valid. > > It is how it is implemented now, and how it works under VXWorks! But that does not, and can not, provide proper isolation and bandwidth guarantees since there could be runnable tasks outside the scope of the middleware. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-22 16:28 ` Peter Zijlstra @ 2010-04-22 17:48 ` Bjoern Brandenburg 2010-04-22 19:33 ` Primiano Tucci 1 sibling, 0 replies; 20+ messages in thread From: Bjoern Brandenburg @ 2010-04-22 17:48 UTC (permalink / raw) To: Peter Zijlstra Cc: Primiano Tucci, Steven Rostedt, linux-kernel, tglx, James H. Anderson, Andrea Bastoni On Apr 22, 2010, at 12:28 PM, Peter Zijlstra wrote: > On Thu, 2010-04-22 at 17:40 +0200, Primiano Tucci wrote: > >>> Its a well studied >>> algorithm and even available in commercial SMP operating systems >> >> Can you cite me a commercial SMP system that supports >> multicore/multiprocessor G-EDF? > > From: http://www.cs.unc.edu/~anderson/papers/rtlws09.pdf > > "Regarding the frequently voiced objections to G- > EDF’s viability in a “real” system, it should be noted that > xnu, the kernel underlying Apple’s multimedia-friendly > OS X, has been relying on G-EDF to support real-time > applications on multiprocessors for several years [5]." > > "[5] Apple Inc. xnu source code. > http://opensource.apple.com/source/xnu/." Since that reference is a bit coarse-grained, let me clarify by pointing out the actual implementation. In particular, if you look at /usr/include/mach/thread_policy.h (on OS X, of course), you'll find: > /* > * THREAD_TIME_CONSTRAINT_POLICY: > * > * This scheduling mode is for threads which have real time > * constraints on their execution. > * > * Parameters: > * > * period: This is the nominal amount of time between separate > * processing arrivals, specified in absolute time units. A > * value of 0 indicates that there is no inherent periodicity in > * the computation. > * > * computation: This is the nominal amount of computation > * time needed during a separate processing arrival, specified > * in absolute time units. > * > * constraint: This is the maximum amount of real time that > * may elapse from the start of a separate processing arrival > * to the end of computation for logically correct functioning, > * specified in absolute time units. Must be (>= computation). > * Note that latency = (constraint - computation). > * > * preemptible: This indicates that the computation may be > * interrupted, subject to the constraint specified above. > */ > I.e., THREAD_TIME_CONSTRAINT_POLICY implements the sporadic task model. Global EDF is implemented in osfmk/kern/sched_prim.c (line numbers pertain to XNU 1504.3.12): In line 473: > /* > * Calculate deadline for real-time threads. > */ > if (thread->sched_mode & TH_MODE_REALTIME) { > thread->realtime.deadline = mach_absolute_time(); > thread->realtime.deadline += thread->realtime.constraint; > } Further, in choose_processor() starting at line 26: > if (thread->sched_pri >= BASEPRI_RTQUEUES) { > /* > * For an RT thread, iterate through active processors, first fit. > */ > processor = (processor_t)queue_first(&cset->active_queue); > while (!queue_end(&cset->active_queue, (queue_entry_t)processor)) { > if (thread->sched_pri > processor->current_pri || > thread->realtime.deadline < processor->deadline) > return (processor); And in thread_setrun() at line 2482: > if (thread->last_processor != PROCESSOR_NULL) { > /* > * Simple (last processor) affinity case. > */ > processor = thread->last_processor; > pset = processor->processor_set; > pset_lock(pset); > > /* > * Choose a different processor in certain cases. > */ > if (thread->sched_pri >= BASEPRI_RTQUEUES) { > /* > * If the processor is executing an RT thread with > * an earlier deadline, choose another. > */ > if (thread->sched_pri <= processor->current_pri || > thread->realtime.deadline >= processor->deadline) > processor = choose_processor(pset, PROCESSOR_NULL, thread); > } > This Global EDF implementation might have been inherited from RT Mach, but I'm not sure about that. In LITMUS^RT [1], we implement Global EDF a bit differently. Instead of iterating over all processors, we keep the processors in a max heap (ordered by deadline, with no RT task == infinity). The xnu variant may be beneficial if you only expect a few RT tasks at any time, whereas ours is based on the assumption that most processors will be scheduling RT tasks most of the time. - Björn [1] http://www.cs.unc.edu/~anderson/litmus-rt (The posted patch is horribly out of date; we'll have something more recent based on 2.6.32 up soon.) --- Björn B. Brandenburg Ph.D. Candidate Dept. of Computer Science University of North Carolina at Chapel Hill http://www.cs.unc.edu/~bbb ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-22 16:28 ` Peter Zijlstra 2010-04-22 17:48 ` Bjoern Brandenburg @ 2010-04-22 19:33 ` Primiano Tucci 1 sibling, 0 replies; 20+ messages in thread From: Primiano Tucci @ 2010-04-22 19:33 UTC (permalink / raw) To: Peter Zijlstra; +Cc: rostedt, linux-kernel, tglx, Bjoern Brandenburg Hi Peter, It is not in my intents to start a flame, but, as you pointed out yourself, EDF is not so common, particularly in commercial RTOSes as you stated (and xnu, indeed, does not fall into this category). It is not surprisingly that the citation you found was from Mr Brandenburg and Mr. Anderson from North Carolina University, incidentally I had a copy of their paper (On the Implementation of Global RT Schedulers) at the time of reading your message. I think they're are among the few that concretely and deeply investigated on G-EDF. As you could read by Brandenburg and Anderson, there isn't a standard/reference implementation of Global EDF, but a alot of "variants" are possible (e.g., event-driven / tick driven promotion, dedicated/global interrupt handling, typology of data structures and locking methods ...). My intent is not to make a form of top ten best kernel award, all we are trying to do is investigating on the various facilities offered by the RT operating systems in order to determine how much we can rely on them. >> It is how it is implemented now, and how it works under VXWorks! > > But that does not, and can not, provide proper isolation and bandwidth > guarantees since there could be runnable tasks outside the scope of the > middleware. Actually our implementation in VxWorks is based on kernel space tasks that have the maximum priority on the system, even greater of VxWorks system tasks (that have been shifted down after long and fruitful support discussions with windriver) therefore no task (except from interrupt service routines that are accurately weighted under VxWorks) can alter our middleware. Actually it is controlling real world (yet not production stage) manufacturing systems, and it is a bit more than just a play-game. We are trying to study and understand if, and how, analogous things can be done on other Real Time Operating Systems. We choiced posix threads as a start point on linux kernel to verify the feasibility of the same approach, without intervening too deep on the kernel. Thanks again for your interventions! -- Primiano Tucci http://www.primianotucci.com ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-20 21:56 ` Primiano Tucci 2010-04-20 23:00 ` Steven Rostedt @ 2010-04-21 12:56 ` Peter Zijlstra 2010-04-27 13:18 ` Thomas Gleixner 2 siblings, 0 replies; 20+ messages in thread From: Peter Zijlstra @ 2010-04-21 12:56 UTC (permalink / raw) To: Primiano Tucci; +Cc: linux-kernel, tglx, rostedt On Tue, 2010-04-20 at 23:56 +0200, Primiano Tucci wrote: > > long sched_setaffinity(pid_t pid, const struct cpumask *in_mask) { > cpumask_var_t cpus_allowed, new_mask; > struct task_struct *p; > int retval; > > get_online_cpus(); > --> read_lock(&tasklist_lock); FWIW recent kernels don't use tasklist_lock there anymore. ^ permalink raw reply [flat|nested] 20+ messages in thread
* Re: Considerations on sched APIs under RT patch 2010-04-20 21:56 ` Primiano Tucci 2010-04-20 23:00 ` Steven Rostedt 2010-04-21 12:56 ` Peter Zijlstra @ 2010-04-27 13:18 ` Thomas Gleixner 2 siblings, 0 replies; 20+ messages in thread From: Thomas Gleixner @ 2010-04-27 13:18 UTC (permalink / raw) To: Primiano Tucci; +Cc: Peter Zijlstra, linux-kernel, rostedt [-- Attachment #1: Type: TEXT/PLAIN, Size: 1030 bytes --] On Tue, 20 Apr 2010, Primiano Tucci wrote: > Hi Peter, > thank you for your reply. > > On Tue, Apr 20, 2010 at 11:20 AM, Peter Zijlstra <peterz@infradead.org> wrote: > > On Mon, 2010-04-19 at 22:48 +0200, Primiano Tucci wrote: > > > >> Yesterday days I found a strange behavior of the scheduler API's using > >> the RT patch, in particular the pthread_setaffinity_np (that stands on > >> sched_setaffinity). > > > >> I think the main problem is that sched_setaffinity makes use of a > >> rwlock, but rwlocks are pre-emptible with the RT patch. > > > > It does? where? > > > > sys_sched_setaffinity() > > sched_setaffinity() > > set_cpus_allowed_ptr() > > > I see > > long sched_setaffinity(pid_t pid, const struct cpumask *in_mask) { > cpumask_var_t cpus_allowed, new_mask; > struct task_struct *p; > int retval; > > get_online_cpus(); > --> read_lock(&tasklist_lock); You must be looking at some older version of -rt. The current 2.6.33-rt series does not take tasklist_lock in sched_setaffinity(). Thanks, tglx ^ permalink raw reply [flat|nested] 20+ messages in thread
end of thread, other threads:[~2010-04-27 13:19 UTC | newest] Thread overview: 20+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2010-04-19 20:48 Considerations on sched APIs under RT patch Primiano Tucci 2010-04-20 9:20 ` Peter Zijlstra 2010-04-20 21:56 ` Primiano Tucci 2010-04-20 23:00 ` Steven Rostedt 2010-04-21 5:16 ` Primiano Tucci 2010-04-21 8:49 ` Peter Zijlstra 2010-04-21 12:46 ` Steven Rostedt 2010-04-21 19:24 ` Primiano Tucci 2010-04-21 19:57 ` Peter Zijlstra 2010-04-21 20:38 ` Primiano Tucci 2010-04-21 20:58 ` Peter Zijlstra 2010-04-22 13:20 ` Steven Rostedt 2010-04-22 13:50 ` Primiano Tucci 2010-04-22 13:57 ` Peter Zijlstra 2010-04-22 15:40 ` Primiano Tucci 2010-04-22 16:28 ` Peter Zijlstra 2010-04-22 17:48 ` Bjoern Brandenburg 2010-04-22 19:33 ` Primiano Tucci 2010-04-21 12:56 ` Peter Zijlstra 2010-04-27 13:18 ` Thomas Gleixner
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox