* [patch] sched: prevent bound kthreads from changing cpus_allowed
@ 2008-06-05 19:57 David Rientjes
2008-06-05 20:29 ` Paul Jackson
` (4 more replies)
0 siblings, 5 replies; 31+ messages in thread
From: David Rientjes @ 2008-06-05 19:57 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Peter Zijlstra, Paul Menage, Paul Jackson, linux-kernel
Kthreads that have called kthread_bind() are bound to specific cpus, so
other tasks should not be able to change their cpus_allowed from under
them. Otherwise, it is possible to move kthreads, such as the migration
or software watchdog threads, so they are not allowed access to the cpu
they work on.
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
include/linux/sched.h | 1 +
kernel/cpuset.c | 14 +++++++++++++-
kernel/kthread.c | 1 +
kernel/sched.c | 6 ++++++
4 files changed, 21 insertions(+), 1 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1504,6 +1504,7 @@ static inline void put_task_struct(struct task_struct *t)
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
#define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */
#define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
+#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */
#define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezeable */
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1190,6 +1190,15 @@ static int cpuset_can_attach(struct cgroup_subsys *ss,
if (cpus_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
return -ENOSPC;
+ if (tsk->flags & PF_THREAD_BOUND) {
+ cpumask_t mask;
+
+ mutex_lock(&callback_mutex);
+ mask = cs->cpus_allowed;
+ mutex_unlock(&callback_mutex);
+ if (!cpus_equal(tsk->cpus_allowed, mask))
+ return -EINVAL;
+ }
return security_task_setscheduler(tsk, 0, NULL);
}
@@ -1203,11 +1212,14 @@ static void cpuset_attach(struct cgroup_subsys *ss,
struct mm_struct *mm;
struct cpuset *cs = cgroup_cs(cont);
struct cpuset *oldcs = cgroup_cs(oldcont);
+ int err;
mutex_lock(&callback_mutex);
guarantee_online_cpus(cs, &cpus);
- set_cpus_allowed_ptr(tsk, &cpus);
+ err = set_cpus_allowed_ptr(tsk, &cpus);
mutex_unlock(&callback_mutex);
+ if (err)
+ return;
from = oldcs->mems_allowed;
to = cs->mems_allowed;
diff --git a/kernel/kthread.c b/kernel/kthread.c
--- a/kernel/kthread.c
+++ b/kernel/kthread.c
@@ -180,6 +180,7 @@ void kthread_bind(struct task_struct *k, unsigned int cpu)
set_task_cpu(k, cpu);
k->cpus_allowed = cpumask_of_cpu(cpu);
k->rt.nr_cpus_allowed = 1;
+ k->flags |= PF_THREAD_BOUND;
}
EXPORT_SYMBOL(kthread_bind);
diff --git a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5565,6 +5565,12 @@ int set_cpus_allowed_ptr(struct task_struct *p, const cpumask_t *new_mask)
goto out;
}
+ if (unlikely((p->flags & PF_THREAD_BOUND) && p != current &&
+ !cpus_equal(p->cpus_allowed, *new_mask))) {
+ ret = -EINVAL;
+ goto out;
+ }
+
if (p->sched_class->set_cpus_allowed)
p->sched_class->set_cpus_allowed(p, new_mask);
else {
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: [patch] sched: prevent bound kthreads from changing cpus_allowed 2008-06-05 19:57 [patch] sched: prevent bound kthreads from changing cpus_allowed David Rientjes @ 2008-06-05 20:29 ` Paul Jackson 2008-06-05 21:12 ` David Rientjes 2008-06-05 20:52 ` Daniel Walker ` (3 subsequent siblings) 4 siblings, 1 reply; 31+ messages in thread From: Paul Jackson @ 2008-06-05 20:29 UTC (permalink / raw) To: David Rientjes; +Cc: mingo, peterz, menage, linux-kernel A couple of questions on this: 1) Sometimes threads are bound to a set of CPUs, such as the CPUs on a particular node: drivers/pci/pci-driver.c: set_cpus_allowed_ptr(current, nodecpumask); net/sunrpc/svc.c: set_cpus_allowed_ptr(current, nodecpumask); Such cases can't invoke kthread_bind(), as that only binds to a single CPU. I only see one place in your patch that sets PF_THREAD_BOUND; would it make sense for such multi-CPU binds as above to be PF_THREAD_BOUND as well? 2) Sometimes calls to kthread_bind are binding to any online cpu, such as in: drivers/infiniband/hw/ehca/ehca_irq.c: kthread_bind(cct->task, any_online_cpu(cpu_online_map)); In such cases, the PF_THREAD_BOUND seems inappropriate. The caller of kthread_bind() really doesn't seem to care where that thread is bound; they just want it on a CPU that is still online. -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch] sched: prevent bound kthreads from changing cpus_allowed 2008-06-05 20:29 ` Paul Jackson @ 2008-06-05 21:12 ` David Rientjes 2008-06-09 20:59 ` Max Krasnyanskiy 0 siblings, 1 reply; 31+ messages in thread From: David Rientjes @ 2008-06-05 21:12 UTC (permalink / raw) To: Paul Jackson; +Cc: mingo, peterz, menage, linux-kernel On Thu, 5 Jun 2008, Paul Jackson wrote: > A couple of questions on this: > > 1) Sometimes threads are bound to a set of CPUs, such as the CPUs > on a particular node: > > drivers/pci/pci-driver.c: set_cpus_allowed_ptr(current, nodecpumask); > net/sunrpc/svc.c: set_cpus_allowed_ptr(current, nodecpumask); > > Such cases can't invoke kthread_bind(), as that only binds to a single CPU. > I only see one place in your patch that sets PF_THREAD_BOUND; would it make > sense for such multi-CPU binds as above to be PF_THREAD_BOUND as well? > Not in the drivers/pci/pci-driver.c case because the setting of cpus_allowed to nodecpumask is only temporary (as is the disabling of the mempolicy). It's just done for the probe callback and then reset to the old cpumask. So any migration here would be lost. I can't speculate about the net/sunrpc/svc.c case because I don't know if user migration is harmful or discouraged. The behavior with my patch is the same for any calls to set_cpus_allowed_ptr() for tasks that haven't called kthread_bind(), so I'll leave that decision up to those who know best for this networking code. > 2) Sometimes calls to kthread_bind are binding to any online cpu, such as in: > > drivers/infiniband/hw/ehca/ehca_irq.c: kthread_bind(cct->task, any_online_cpu(cpu_online_map)); > > In such cases, the PF_THREAD_BOUND seems inappropriate. The caller of > kthread_bind() really doesn't seem to care where that thread is bound; > they just want it on a CPU that is still online. > This particular case is simply moving the thread to any online cpu so that it survives long enough for the subsequent kthread_stop() in destroy_comp_task(). So I don't see a problem with this instance. A caller to kthread_bind() can always remove PF_THREAD_BOUND itself upon return, but I haven't found any cases in the tree where that is currently necessary. And doing that would defeat the semantics of kthread_bind() where these threads are supposed to be bound to a specific cpu and not allowed to run on others. David ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch] sched: prevent bound kthreads from changing cpus_allowed 2008-06-05 21:12 ` David Rientjes @ 2008-06-09 20:59 ` Max Krasnyanskiy 2008-06-09 22:07 ` David Rientjes 2008-06-10 6:44 ` [patch] sched: prevent bound kthreads from changing cpus_allowed Peter Zijlstra 0 siblings, 2 replies; 31+ messages in thread From: Max Krasnyanskiy @ 2008-06-09 20:59 UTC (permalink / raw) To: David Rientjes Cc: Paul Jackson, mingo, peterz, menage, linux-kernel, Oleg Nesterov David Rientjes wrote: >> 2) Sometimes calls to kthread_bind are binding to any online cpu, such as in: >> >> drivers/infiniband/hw/ehca/ehca_irq.c: kthread_bind(cct->task, any_online_cpu(cpu_online_map)); >> >> In such cases, the PF_THREAD_BOUND seems inappropriate. The caller of >> kthread_bind() really doesn't seem to care where that thread is bound; >> they just want it on a CPU that is still online. >> > > This particular case is simply moving the thread to any online cpu so that > it survives long enough for the subsequent kthread_stop() in > destroy_comp_task(). So I don't see a problem with this instance. > > A caller to kthread_bind() can always remove PF_THREAD_BOUND itself upon > return, but I haven't found any cases in the tree where that is currently > necessary. And doing that would defeat the semantics of kthread_bind() > where these threads are supposed to be bound to a specific cpu and not > allowed to run on others. Actually I have another use case here. Above example in particular may be ok but it does demonstrate the issue nicely. Which is that in some cases kthreads are bound to a CPU but do not have a strict "must run here" requirement and could be moved if needed. For example I need an ability to move workqueue threads. Workqueue threads do kthread_bind(). So how about we add something like kthread_bind_strict() which would set PF_THREAD_BOUND ? We could also simply add flags argument to the kthread_bind() which would be better imo but requires more changes. ie It'd look like kthread_bind(..., cpu, KTHREAD_BIND_STRICT); Things like migration threads, stop machine, etc would use the strict version and everything else would use non-strict bind. --- On the related note (this seems like the right crowd :). What do people think about kthreads and cpusets in general. We currently have a bit of a disconnect in the logic. 1. kthreads can be put into a cpuset at which point their cpus_allowed mask is updated properly 2. kthread's cpus_allowed mask is updated properly when cpuset setup changes (cpus added, removed, etc). 3. kthreads inherit cpuset from a parent (kthreadd for example) _but_ they either do kthread_bind() or set_cpus_allowed() and both of those simply ignore inherited cpusets. Notice how scenario #3 does not fit into the overall picture. The behaviour is inconsistent. How about this: - Split sched_setaffinity into sched_setaffinity() { task *p = get_task_by_pid(); return task_setaffinity(p); } task_setaffinity(task, cpumask, flags) { if (flags & FORCE) { // Used for kthreads that require strict binding. // Detach the task from the current cpuset // and put it into the root cpuset. // Set PF_THREAD_BOUND. } // Rest of the original sched_setaffinity logic } - Have kthreads call task_setaffinity() instead of set_cpus_allowed() directly. That way the behaviour will be consistent across the board. Comments ? Max ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch] sched: prevent bound kthreads from changing cpus_allowed 2008-06-09 20:59 ` Max Krasnyanskiy @ 2008-06-09 22:07 ` David Rientjes 2008-06-10 4:23 ` Max Krasnyansky 2008-06-10 16:30 ` cpusets and kthreads, inconsistent behaviour Max Krasnyansky 2008-06-10 6:44 ` [patch] sched: prevent bound kthreads from changing cpus_allowed Peter Zijlstra 1 sibling, 2 replies; 31+ messages in thread From: David Rientjes @ 2008-06-09 22:07 UTC (permalink / raw) To: Max Krasnyanskiy Cc: Paul Jackson, mingo, peterz, menage, linux-kernel, Oleg Nesterov On Mon, 9 Jun 2008, Max Krasnyanskiy wrote: > Actually I have another use case here. Above example in particular may be ok > but it does demonstrate the issue nicely. Which is that in some cases kthreads > are bound to a CPU but do not have a strict "must run here" requirement and > could be moved if needed. That isn't a completely accurate description of the patch; the kthread itself is always allowed to change its cpu affinity. This exception, for example, allows stopmachine to still work correctly since it uses set_cpus_allowed_ptr() for each thread. This patch simply prohibits other tasks from changing the cpu affinity of a kthread that has called kthread_bind(). > For example I need an ability to move workqueue threads. Let's elaborate a little on that: you're moving workqueue threads from their originating cpu to another cpu (hopefully on the same NUMA node) using cpusets or sched_setaffinity()? > Workqueue threads do > kthread_bind(). > So how about we add something like kthread_bind_strict() which would set > PF_THREAD_BOUND ? > We could also simply add flags argument to the kthread_bind() which would be > better imo but requires more changes. ie It'd look like > kthread_bind(..., cpu, KTHREAD_BIND_STRICT); > > Things like migration threads, stop machine, etc would use the strict version > and everything else would use non-strict bind. > It depends on the number of exceptions that we want to allow. If there's one or two, it's sufficient to just use p->flags &= ~PF_THREAD_BOUND; upon return from kthread_bind(), or simply convert the existing code to use set_cpus_allowed_ptr() instead: kthread_bind(p, cpu); becomes cpumask_t mask = cpumask_of_cpu(cpu); set_cpus_allowed_ptr(p, &mask); Or, if we have more exceptions to the rule than actual strict binding for kthreads using kthread_bind(), we can just add p->flags |= PF_THREAD_BOUND; upon return on watchdog and migration threads. But, to me, the semantics of kthread_bind() are pretty clear. I think it's dangerous to move kthreads such as watchdog or migration threads out from under them and that is easily shown if you try to do it. There are perhaps exceptions to the rule where existing kthread_bind() calls could be converted to set_cpus_allowed_ptr(), but I think we should enumerate those cases and discuss the hazards that could be associated with changing their cpu affinity. I'd also like to hear why you choose to move your workqueue threads off their originating cpu. > On the related note (this seems like the right crowd :). What do people think > about kthreads and cpusets in general. We currently have a bit of a disconnect > in the logic. > 1. kthreads can be put into a cpuset at which point their cpus_allowed mask is > updated properly > 2. kthread's cpus_allowed mask is updated properly when cpuset setup changes > (cpus added, removed, etc). > 3. kthreads inherit cpuset from a parent (kthreadd for example) _but_ they > either do kthread_bind() or set_cpus_allowed() and both of those simply ignore > inherited cpusets. > With my patch, this is slightly different. Kthreads that have called kthread_bind() can have a different cpu affinity than the cpus_allowed of their cpuset. This happens when such a kthread is attached to a cpuset and its 'cpus' file subsequently changes. Cpusets is written correctly to use set_cpus_allowed_ptr() so that this change in affinity is now rejected for PF_THREAD_BOUND tasks, yet the kthread is still a member of the cpuset. It's debatable whether the kthread should still remain as a member of the cpuset, but there are significant advantages: cpusets offers more than just a simple way to do sched_setaffinity() over an aggregate of tasks. David ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch] sched: prevent bound kthreads from changing cpus_allowed 2008-06-09 22:07 ` David Rientjes @ 2008-06-10 4:23 ` Max Krasnyansky 2008-06-10 17:04 ` David Rientjes 2008-06-10 16:30 ` cpusets and kthreads, inconsistent behaviour Max Krasnyansky 1 sibling, 1 reply; 31+ messages in thread From: Max Krasnyansky @ 2008-06-10 4:23 UTC (permalink / raw) To: David Rientjes Cc: Paul Jackson, mingo, peterz, menage, linux-kernel, Oleg Nesterov David Rientjes wrote: >> For example I need an ability to move workqueue threads. > > Let's elaborate a little on that: you're moving workqueue threads from > their originating cpu to another cpu (hopefully on the same NUMA node) > using cpusets or sched_setaffinity()? Yes. > But, to me, the semantics of kthread_bind() are pretty clear. I think > it's dangerous to move kthreads such as watchdog or migration threads out > from under them and that is easily shown if you try to do it. There are > perhaps exceptions to the rule where existing kthread_bind() calls could > be converted to set_cpus_allowed_ptr(), but I think we should enumerate > those cases and discuss the hazards that could be associated with changing > their cpu affinity. I think what can and cannot be moved is a separate question. As far as cpu affinity API for kthreads goes it makes sense to support both scenarios. See below. > I'd also like to hear why you choose to move your workqueue threads off > their originating cpu. CPU isolation. ie To avoid kernel activity on certain CPUs. >> On the related note (this seems like the right crowd :). What do people think >> about kthreads and cpusets in general. We currently have a bit of a disconnect >> in the logic. >> 1. kthreads can be put into a cpuset at which point their cpus_allowed mask is >> updated properly >> 2. kthread's cpus_allowed mask is updated properly when cpuset setup changes >> (cpus added, removed, etc). >> 3. kthreads inherit cpuset from a parent (kthreadd for example) _but_ they >> either do kthread_bind() or set_cpus_allowed() and both of those simply ignore >> inherited cpusets. >> > With my patch, this is slightly different. Kthreads that have called > kthread_bind() can have a different cpu affinity than the cpus_allowed of > their cpuset. This happens when such a kthread is attached to a cpuset > and its 'cpus' file subsequently changes. Cpusets is written correctly to > use set_cpus_allowed_ptr() so that this change in affinity is now rejected > for PF_THREAD_BOUND tasks, yet the kthread is still a member of the > cpuset. That would be inconsistent and confusing, I think. If a task is in the cpuset X all constraints of the cpuset X must apply. If some constraints cannot be satisfied then that task should not be in the cpuset X. > It's debatable whether the kthread should still remain as a member of the > cpuset, but there are significant advantages: cpusets offers more than > just a simple way to do sched_setaffinity() over an aggregate of tasks. Yes cpusets are not only about cpu affinity. But again the behaviour should be consistent across the board. cpuset.cpus must apply to all the task in the set, not just some of the tasks. Also I think you missed my other point/suggestion. Yes your patch fixes one problem, which is kthreads that must not move will not move. Although like I said the behaviour with regards to the cpuset is not consistent and confusing. The other thing that I pointed out is that kthreads that _can_ move do not honor cpuset constrains. Let me give another example. Forget the workqueues for a second. Kthreads like scsi_eh, kswapd, kacpid, etc, etc are all started with cpus_allows=ALL_CPUS even though they inherit a cpuset from kthreadd. Yes I can move them manually but the behaviour is not consistent, and for no good reason. kthreads can be stopped/started at any time (module load for example) which means that I'd have to keep moving them. To sum it up here is what I'm suggesting: kthread_bind(task, cpu) { // Set PF_THREAD_BOUND // Move into root cpuset // Bind to the cpu } kthread_setaffinity(task, cpumask) { // Enforce cpuset.cpus_allowed // Updated affinity mask and migrate kthread (if needed) } Instead of calling set_cpus_allowed_ptr() kthreads that do not need strict cpu binding will be calling kthread_setaffinity(). That way the behaviour is consistent across the board. Makes sense ? Max ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch] sched: prevent bound kthreads from changing cpus_allowed 2008-06-10 4:23 ` Max Krasnyansky @ 2008-06-10 17:04 ` David Rientjes 0 siblings, 0 replies; 31+ messages in thread From: David Rientjes @ 2008-06-10 17:04 UTC (permalink / raw) To: Max Krasnyansky Cc: Paul Jackson, mingo, peterz, menage, linux-kernel, Oleg Nesterov On Mon, 9 Jun 2008, Max Krasnyansky wrote: > > I'd also like to hear why you choose to move your workqueue threads off > > their originating cpu. > CPU isolation. ie To avoid kernel activity on certain CPUs. > This probably isn't something that you should be doing, at least with the workqueue threads. The slab cache reaper, for example, depends on being able to drain caches for each cpu and will be neglected if they are moved. I'm curious why you haven't encountered problems with this while isolating per-cpu workqueue threads in cpusets that don't have access to their own cpu. Regardless, we'd need a patch to the slab layer and ack'd by the appropriate people at this point to allow the exception. > Yes cpusets are not only about cpu affinity. But again the behaviour should be > consistent across the board. cpuset.cpus must apply to all the task in the > set, not just some of the tasks. > It has always been possible to assign a task to a cpu and then further constrict its set of allowable cpus with sched_setaffinity(). So, while the cpus_allowed in this case are always a subset of the cpuset's cpus, you could still describe this as inconsistent. > To sum it up here is what I'm suggesting: > kthread_bind(task, cpu) > { > // Set PF_THREAD_BOUND > // Move into root cpuset > // Bind to the cpu > } > kthread_bind() usually happens immediately following kthread_create(), so it should already be in the root cpuset. If it has been forked in a different cpuset, however, implicitly moving it may be more harmful than any inconsistency that exists in cpus_allowed. David ^ permalink raw reply [flat|nested] 31+ messages in thread
* cpusets and kthreads, inconsistent behaviour 2008-06-09 22:07 ` David Rientjes 2008-06-10 4:23 ` Max Krasnyansky @ 2008-06-10 16:30 ` Max Krasnyansky 2008-06-10 18:47 ` David Rientjes 1 sibling, 1 reply; 31+ messages in thread From: Max Krasnyansky @ 2008-06-10 16:30 UTC (permalink / raw) To: Paul Jackson, mingo; +Cc: David Rientjes, Peter Zijlstra, menage, linux-kernel I pointed this out in the email thread about PF_THREAD_BOUND patch and wanted to restart the thread to make sure that people pay attention :). I was going to cook up a patch for this and wanted to get some early feedback to avoid time waste. Basically the issue is that current behaviour of the cpusets is inconsistent with regards to kthreads. Kthreads inherit cpuset from a parent properly but they simply ignore cpuset.cpus when their cpu affinity is set/updated. I think the behaviour must be consistent across the board. cpuset.cpus must apply to _all_ the tasks in the set, not just some of the tasks. If kthread must run on the cpus other than current_cpuset.cpus then it should detach from the cpuset. To give you an example kthreads like scsi_eh, kswapd, kacpid, pdflush, kseriod, etc are all started with cpus_allows=ALL_CPUS even though they inherit a cpuset from kthreadd. Yes they can moved manually (with sched_setaffinity) but the behaviour is not consistent, and for no good reason. kthreads can be stopped/started at any time (module load for example) which means that the user will have to keep moving them. To sum it up here is what I'm suggesting: kthread_bind(task, cpu) { // Set PF_THREAD_BOUND // Move into root cpuset // Bind to the cpu } kthread_setaffinity(task, cpumask) { // Enforce cpuset.cpus_allowed // Updated affinity mask and migrate kthread (if needed) } Kthreads that do not require strict cpu binding will be calling kthread_setaffinity() instead of set_cpus_allowed_ptr() and such. Kthreads that require strict cpu binding will be calling kthread_bind() and detach from the cpuset they inherit from their parent. That way the behaviour is consistent across the board. Comments ? Max -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: cpusets and kthreads, inconsistent behaviour 2008-06-10 16:30 ` cpusets and kthreads, inconsistent behaviour Max Krasnyansky @ 2008-06-10 18:47 ` David Rientjes 2008-06-10 20:44 ` Max Krasnyansky 0 siblings, 1 reply; 31+ messages in thread From: David Rientjes @ 2008-06-10 18:47 UTC (permalink / raw) To: Max Krasnyansky; +Cc: Paul Jackson, mingo, Peter Zijlstra, menage, linux-kernel On Tue, 10 Jun 2008, Max Krasnyansky wrote: > Basically the issue is that current behaviour of the cpusets is inconsistent > with regards to kthreads. Kthreads inherit cpuset from a parent properly but > they simply ignore cpuset.cpus when their cpu affinity is set/updated. > I think the behaviour must be consistent across the board. cpuset.cpus must > apply to _all_ the tasks in the set, not just some of the tasks. If kthread > must run on the cpus other than current_cpuset.cpus then it should detach from > the cpuset. > I disagree that a cpuset's set of allowable cpus should apply to all tasks attached to that cpuset. It's certainly beneficial to be able to further constrict the set of allowed cpus for a task using sched_setaffinity(). It makes more sense to argue that for each task p, p->cpus_allowed is a subset of task_cs(p)->cpus_allowed. > To give you an example kthreads like scsi_eh, kswapd, kacpid, pdflush, > kseriod, etc are all started with cpus_allows=ALL_CPUS even though they > inherit a cpuset from kthreadd. Yes they can moved manually (with > sched_setaffinity) but the behaviour is not consistent, and for no good > reason. kthreads can be stopped/started at any time (module load for example) > which means that the user will have to keep moving them. > This doesn't seem to be purely a kthread issue. Tasks can be moved to a disjoint set of cpus by any caller to set_cpus_allowed_ptr() in the kernel. David ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: cpusets and kthreads, inconsistent behaviour 2008-06-10 18:47 ` David Rientjes @ 2008-06-10 20:44 ` Max Krasnyansky 2008-06-10 20:54 ` David Rientjes 0 siblings, 1 reply; 31+ messages in thread From: Max Krasnyansky @ 2008-06-10 20:44 UTC (permalink / raw) To: David Rientjes; +Cc: Paul Jackson, mingo, Peter Zijlstra, menage, linux-kernel David Rientjes wrote: > On Tue, 10 Jun 2008, Max Krasnyansky wrote: > >> Basically the issue is that current behaviour of the cpusets is inconsistent >> with regards to kthreads. Kthreads inherit cpuset from a parent properly but >> they simply ignore cpuset.cpus when their cpu affinity is set/updated. >> I think the behaviour must be consistent across the board. cpuset.cpus must >> apply to _all_ the tasks in the set, not just some of the tasks. If kthread >> must run on the cpus other than current_cpuset.cpus then it should detach from >> the cpuset. >> > > I disagree that a cpuset's set of allowable cpus should apply to all tasks > attached to that cpuset. It's certainly beneficial to be able to further > constrict the set of allowed cpus for a task using sched_setaffinity(). > > It makes more sense to argue that for each task p, p->cpus_allowed is a > subset of task_cs(p)->cpus_allowed. Yes that's exactly what I meant :). Sorry for not being clear. I did not mean that each task in a cpuset must have the same affinity mask. So we're on the same page here. >> To give you an example kthreads like scsi_eh, kswapd, kacpid, pdflush, >> kseriod, etc are all started with cpus_allows=ALL_CPUS even though they >> inherit a cpuset from kthreadd. Yes they can moved manually (with >> sched_setaffinity) but the behaviour is not consistent, and for no good >> reason. kthreads can be stopped/started at any time (module load for example) >> which means that the user will have to keep moving them. >> > > This doesn't seem to be purely a kthread issue. Tasks can be moved to a > disjoint set of cpus by any caller to set_cpus_allowed_ptr() in the > kernel. Hmm, technically you are correct of course. But we do not have any other kernel tasks besides kthreads. All the kernel tasks I have on my machines have kthreadd as their parent. And I'm not aware of any kernel code that changes affinity mask of a user-space task without paying attention to the cpuset the task belongs to. If you know of any we should fix it because it'd clearly be a bug. Thanx Max ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: cpusets and kthreads, inconsistent behaviour 2008-06-10 20:44 ` Max Krasnyansky @ 2008-06-10 20:54 ` David Rientjes 2008-06-10 21:15 ` Max Krasnyansky 0 siblings, 1 reply; 31+ messages in thread From: David Rientjes @ 2008-06-10 20:54 UTC (permalink / raw) To: Max Krasnyansky; +Cc: Paul Jackson, mingo, Peter Zijlstra, menage, linux-kernel On Tue, 10 Jun 2008, Max Krasnyansky wrote: > Hmm, technically you are correct of course. But we do not have any other > kernel tasks besides kthreads. All the kernel tasks I have on my machines have > kthreadd as their parent. > And I'm not aware of any kernel code that changes affinity mask of a > user-space task without paying attention to the cpuset the task belongs to. If > you know of any we should fix it because it'd clearly be a bug. > This is why it shouldn't belong in the sched or kthread code; the discrepency that you point out between p->cpus_allowed and task_cs(p)->cpus_allowed is a cpuset created one. So to avoid having tasks with a cpus_allowed mask that is not a subset of its cpuset's set of allowable cpus, the solution would probably be to add a flavor of cpuset_update_task_memory_state() for a cpus generation value. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: cpusets and kthreads, inconsistent behaviour 2008-06-10 20:54 ` David Rientjes @ 2008-06-10 21:15 ` Max Krasnyansky 0 siblings, 0 replies; 31+ messages in thread From: Max Krasnyansky @ 2008-06-10 21:15 UTC (permalink / raw) To: David Rientjes; +Cc: Paul Jackson, mingo, Peter Zijlstra, menage, linux-kernel David Rientjes wrote: > On Tue, 10 Jun 2008, Max Krasnyansky wrote: > >> Hmm, technically you are correct of course. But we do not have any other >> kernel tasks besides kthreads. All the kernel tasks I have on my machines have >> kthreadd as their parent. >> And I'm not aware of any kernel code that changes affinity mask of a >> user-space task without paying attention to the cpuset the task belongs to. If >> you know of any we should fix it because it'd clearly be a bug. >> > > This is why it shouldn't belong in the sched or kthread code; the > discrepency that you point out between p->cpus_allowed and > task_cs(p)->cpus_allowed is a cpuset created one. I guess I do not see how the kernel task case is different from the sched_setaffinity(). ie sched_setaffinity() is in the scheduler but it deals with cpusets. Also in this case the discrepancy is created _not_ by the cpuset, it's created due to the lack of the appropriate API. In other words if we had something kthread_setaffinty() from day one this would have been a non-issue :). > So to avoid having tasks with a cpus_allowed mask that is not a subset of > its cpuset's set of allowable cpus, the solution would probably be to add > a flavor of cpuset_update_task_memory_state() for a cpus generation value. Post policing would not work well in some scenarios due to lag time. ie By the time cpusets realized that some kthread is running on the wrong cpus it maybe too late. We should just enforce cpuset constraint when kthreads change their affinity mask, just like sched_setaffinity() already does. Max ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch] sched: prevent bound kthreads from changing cpus_allowed 2008-06-09 20:59 ` Max Krasnyanskiy 2008-06-09 22:07 ` David Rientjes @ 2008-06-10 6:44 ` Peter Zijlstra 2008-06-10 15:38 ` Max Krasnyansky 1 sibling, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2008-06-10 6:44 UTC (permalink / raw) To: Max Krasnyanskiy Cc: David Rientjes, Paul Jackson, mingo, menage, linux-kernel, Oleg Nesterov On Mon, 2008-06-09 at 13:59 -0700, Max Krasnyanskiy wrote: > David Rientjes wrote: > >> 2) Sometimes calls to kthread_bind are binding to any online cpu, such as in: > >> > >> drivers/infiniband/hw/ehca/ehca_irq.c: kthread_bind(cct->task, any_online_cpu(cpu_online_map)); > >> > >> In such cases, the PF_THREAD_BOUND seems inappropriate. The caller of > >> kthread_bind() really doesn't seem to care where that thread is bound; > >> they just want it on a CPU that is still online. > >> > > > > This particular case is simply moving the thread to any online cpu so that > > it survives long enough for the subsequent kthread_stop() in > > destroy_comp_task(). So I don't see a problem with this instance. > > > > A caller to kthread_bind() can always remove PF_THREAD_BOUND itself upon > > return, but I haven't found any cases in the tree where that is currently > > necessary. And doing that would defeat the semantics of kthread_bind() > > where these threads are supposed to be bound to a specific cpu and not > > allowed to run on others. > > Actually I have another use case here. Above example in particular may be ok > but it does demonstrate the issue nicely. Which is that in some cases kthreads > are bound to a CPU but do not have a strict "must run here" requirement and > could be moved if needed. > For example I need an ability to move workqueue threads. Workqueue threads do > kthread_bind(). Per cpu workqueues should stay on their cpu. What you're really looking for is a more fine grained alternative to flush_workqueue(). ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch] sched: prevent bound kthreads from changing cpus_allowed 2008-06-10 6:44 ` [patch] sched: prevent bound kthreads from changing cpus_allowed Peter Zijlstra @ 2008-06-10 15:38 ` Max Krasnyansky 2008-06-10 17:00 ` Oleg Nesterov 0 siblings, 1 reply; 31+ messages in thread From: Max Krasnyansky @ 2008-06-10 15:38 UTC (permalink / raw) To: Peter Zijlstra Cc: David Rientjes, Paul Jackson, mingo, menage, linux-kernel, Oleg Nesterov Peter Zijlstra wrote: > On Mon, 2008-06-09 at 13:59 -0700, Max Krasnyanskiy wrote: >> David Rientjes wrote: >>>> 2) Sometimes calls to kthread_bind are binding to any online cpu, such as in: >>>> >>>> drivers/infiniband/hw/ehca/ehca_irq.c: kthread_bind(cct->task, any_online_cpu(cpu_online_map)); >>>> >>>> In such cases, the PF_THREAD_BOUND seems inappropriate. The caller of >>>> kthread_bind() really doesn't seem to care where that thread is bound; >>>> they just want it on a CPU that is still online. >>>> >>> This particular case is simply moving the thread to any online cpu so that >>> it survives long enough for the subsequent kthread_stop() in >>> destroy_comp_task(). So I don't see a problem with this instance. >>> >>> A caller to kthread_bind() can always remove PF_THREAD_BOUND itself upon >>> return, but I haven't found any cases in the tree where that is currently >>> necessary. And doing that would defeat the semantics of kthread_bind() >>> where these threads are supposed to be bound to a specific cpu and not >>> allowed to run on others. >> Actually I have another use case here. Above example in particular may be ok >> but it does demonstrate the issue nicely. Which is that in some cases kthreads >> are bound to a CPU but do not have a strict "must run here" requirement and >> could be moved if needed. >> For example I need an ability to move workqueue threads. Workqueue threads do >> kthread_bind(). > > Per cpu workqueues should stay on their cpu. > > What you're really looking for is a more fine grained alternative to > flush_workqueue(). Actually I had a discussion on that with Oleg Nesterov. If you remember my original solution (ie centralized cpu_isolate_map) was to completely redirect work onto other cpus. Then you pointed out that it's the flush_() that really makes the box stuck. So I started thinking about redoing the flush. While looking at the code I realized that if I only change the flush_() then queued work can get stale so to speak. ie Machine does not get stuck but some work submitted on the isolated cpus will sit there for a long time. Oleg pointed out exact same thing. So the simplest solution that does not require any surgery to the workqueue is to just move the threads to other cpus. I did not want to get into too much detail on the workqueue stuff here. I'll start a separate thread on this. As I pointed out, there are a bunch of other kthreads like: kswapd, kacpid, pdflush, khubd, etc, etc, that clearly do not need any pinning but still violate cpuset constraints they inherit from kthreadd. Max ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch] sched: prevent bound kthreads from changing cpus_allowed 2008-06-10 15:38 ` Max Krasnyansky @ 2008-06-10 17:00 ` Oleg Nesterov 2008-06-10 17:19 ` Peter Zijlstra 2008-06-10 18:00 ` [patch] sched: prevent bound kthreads from changing cpus_allowed Max Krasnyansky 0 siblings, 2 replies; 31+ messages in thread From: Oleg Nesterov @ 2008-06-10 17:00 UTC (permalink / raw) To: Max Krasnyansky Cc: Peter Zijlstra, David Rientjes, Paul Jackson, mingo, menage, linux-kernel On 06/10, Max Krasnyansky wrote: > > Peter Zijlstra wrote: > > > > Per cpu workqueues should stay on their cpu. > > > > What you're really looking for is a more fine grained alternative to > > flush_workqueue(). > Actually I had a discussion on that with Oleg Nesterov. If you remember my > original solution (ie centralized cpu_isolate_map) was to completely redirect > work onto other cpus. Then you pointed out that it's the flush_() that really > makes the box stuck. So I started thinking about redoing the flush. While > looking at the code I realized that if I only change the flush_() then queued > work can get stale so to speak. ie Machine does not get stuck but some work > submitted on the isolated cpus will sit there for a long time. Oleg pointed > out exact same thing. So the simplest solution that does not require any > surgery to the workqueue is to just move the threads to other cpus. Cough... I'd like to mention that I _personally agree with Peter, cwq->thread's should stay on their cpu. I just meant that from the workqueue.c pov it is (afaics) OK to move cwq->thread to other CPUs, in a sense that this shouldn't add races or hotplug problems, etc. But still this doesn't look right to me. Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch] sched: prevent bound kthreads from changing cpus_allowed 2008-06-10 17:00 ` Oleg Nesterov @ 2008-06-10 17:19 ` Peter Zijlstra 2008-06-10 20:24 ` workqueue cpu affinity Max Krasnyansky 2008-06-10 18:00 ` [patch] sched: prevent bound kthreads from changing cpus_allowed Max Krasnyansky 1 sibling, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2008-06-10 17:19 UTC (permalink / raw) To: Oleg Nesterov Cc: Max Krasnyansky, David Rientjes, Paul Jackson, mingo, menage, linux-kernel On Tue, 2008-06-10 at 21:00 +0400, Oleg Nesterov wrote: > On 06/10, Max Krasnyansky wrote: > > > > Peter Zijlstra wrote: > > > > > > Per cpu workqueues should stay on their cpu. > > > > > > What you're really looking for is a more fine grained alternative to > > > flush_workqueue(). > > Actually I had a discussion on that with Oleg Nesterov. If you remember my > > original solution (ie centralized cpu_isolate_map) was to completely redirect > > work onto other cpus. Then you pointed out that it's the flush_() that really > > makes the box stuck. So I started thinking about redoing the flush. While > > looking at the code I realized that if I only change the flush_() then queued > > work can get stale so to speak. ie Machine does not get stuck but some work > > submitted on the isolated cpus will sit there for a long time. Oleg pointed > > out exact same thing. So the simplest solution that does not require any > > surgery to the workqueue is to just move the threads to other cpus. > > Cough... I'd like to mention that I _personally agree with Peter, cwq->thread's > should stay on their cpu. > > I just meant that from the workqueue.c pov it is (afaics) OK to move cwq->thread > to other CPUs, in a sense that this shouldn't add races or hotplug problems, etc. > But still this doesn't look right to me. The advantage of creating a more flexible or fine-grained flush is that large machine also profit from it. A simple scheme would be creating a workqueue context that is passed along on enqueue, and then passed to flush. This context could: - either track the individual worklets and employ a completion scheme to wait for them; - or track on which cpus the worklets are enqueued and flush only those few cpus. Doing this would solve your case since nobody (except those having business) will enqueue something on the isolated cpus. And it will improve the large machine case for the same reasons - it won't have to iterate all cpus. Of course, things that use schedule_on_each_cpu() will still end up doing things on your isolated cpus, but getting around those would probably get you into some correctness trouble. ^ permalink raw reply [flat|nested] 31+ messages in thread
* workqueue cpu affinity 2008-06-10 17:19 ` Peter Zijlstra @ 2008-06-10 20:24 ` Max Krasnyansky 2008-06-11 6:49 ` Peter Zijlstra 2008-06-11 16:08 ` Oleg Nesterov 0 siblings, 2 replies; 31+ messages in thread From: Max Krasnyansky @ 2008-06-10 20:24 UTC (permalink / raw) To: Peter Zijlstra, Oleg Nesterov, mingo, Andrew Morton Cc: David Rientjes, Paul Jackson, menage, linux-kernel, Mark Hounschell Ok looks like we got deeper into workqueue discussion in the wrong mail thread :). So let me restart it. Here is some backgound on this. Full cpu isolation requires some tweaks to the workqueue handling. Either the workqueue threads need to be moved (which is my current approach), or work needs to be redirected when it's submitted. flush_*_work() needs to be improved too. See Peter's reply below. First reaction that a lot of people get is "oh no no, this is bad, this will not work". Which is understandable but _wrong_ ;-). See below for more details and analysis. One thing that helps in accepting this isolation idea is to think about the use cases. There are two uses cases for it: 1. Normal threaded RT apps with threads that use system calls, block on events, etc. 2. Specialized RT apps with thread(s) that require close to 100% of the CPU resources. Their threads avoid using system calls and avoid blocking. This is done to achieve very low latency and low overhead. Scenario #1 is straightforward. You'd want to isolate the processor the RT app is running on to avoid typical sources of latency. Workqueues running on the same processor is not an issue (because RT threads block), but you do not get the same latency guaranties. Workqueues are an issue for the scenario #2. Workqueue kthreads do not get a chance to run because user's RT threads are higher priority. However those RT threads should not use regular kernel services because that by definition means that they are not getting ~100% of the CPU they want. In other words they cannot have it both ways :). Therefore it's expected that the kernel won't be used heavily on those cpus, and nothing really schedules workqueues and stuff. It's also expected that certain kernel services may not be available on those CPUs. Again we cannot have it both ways. ie Have all the kernel services and yet the kernel is not supposed to use much CPU time :). If at this point people still get this "Oh no, that's wrong" feeling, please read this excellent statement by Paul J "A key reason that Linux has succeeded is that it actively seeks to work for a variety of people, purposes and products. One operating system is now a strong player in the embedded market, the real time market, and the High Performance Computing market, as well as being an important player in a variety of other markets. That's a rather stunning success." — Paul Jackson, in a June 4th, 2008 message on the Linux Kernel mailing list. btw Paul, it got picked up by the kerneltrap.org http://kerneltrap.org/Quote/A_Rather_Stunning_Success Sorry for the lengthy into. Back to the technical discussions. Peter Zijlstra wrote: > The advantage of creating a more flexible or fine-grained flush is that > large machine also profit from it. I agree, our current workqueue flush scheme is expensive because it has to schedule on each online cpu. So yes improving flush makes sense in general. > A simple scheme would be creating a workqueue context that is passed > along on enqueue, and then passed to flush. > > This context could: > > - either track the individual worklets and employ a completion scheme > to wait for them; > > - or track on which cpus the worklets are enqueued and flush only those > few cpus. > > Doing this would solve your case since nobody (except those having > business) will enqueue something on the isolated cpus. > > And it will improve the large machine case for the same reasons - it > won't have to iterate all cpus. This will require a bit of surgery across the entire tree. There is a lot of code that calls flush_scheduled_work()P. All that would have to be changed, which is ok, but I think as the first step we could simply allow moving workqueue threads out of cpus where that load in undesirable and make people aware of what happens in that case. When I get a chance I'll look into the flush scheme you proposed above. > Of course, things that use schedule_on_each_cpu() will still end up > doing things on your isolated cpus, but getting around those would > probably get you into some correctness trouble. There is literally a _single_ user of that API. Actually lets look at all the current users of the schedule_on(cpu) kind of API. git grep 'queue_delayed_work_on\|schedPule_delayed_work_on\|schedule_on_each_cpu' |\ grep -v 'workqueue\.[ch]\|\.txt' > drivers/cpufreq/cpufreq_ondemand.c: queue_delayed_work_on(cpu, kondemand_wq, &dbs_info->work, delay); > drivers/cpufreq/cpufreq_ondemand.c: queue_delayed_work_on(dbs_info->cpu, kondemand_wq, &dbs_info->work, No big deal. Worst case cpufreq state or that cpu will be stale. RT apps would not want cpufreq governor messing with the cpu frequencies anyway. So if you look back at the scenarios #1 and #2 I described above this is non-issue. > drivers/macintosh/rack-meter.c: schedule_delayed_work_on(cpu, &rcpu->sniffer, > drivers/macintosh/rack-meter.c: schedule_delayed_work_on(cpu, &rm->cpu[cpu].sniffer, Not a big deal either. In the worst case stats for the isolated cpus will not be updated. Can probably be converted to timers. > drivers/oprofile/cpu_buffer.c: schedule_delayed_work_on(i, &b->work, DEFAULT_TIMER_EXPIRE + i); > drivers/oprofile/cpu_buffer.c: * By using schedule_delayed_work_on and then schedule_delayed_work Yep I mentioned before that messing with the workqueues brakes oprofile. So yes this one is an issue. However again it's not a catastrophic failure of the system. Oprofile will not be able to collect samples from the CPU RT app is running on and it actually warns the user about it (it prints and error that the work is running on the wrong cpu). I'm working on a patch that collects samples via IPI or per cpu timer. It will be configurable of course. So this one is not a big deal either. > mm/slab.c: schedule_delayed_work_on(cpu, reap_work, Garbage collection. Again see scenarios I described above. If the kernel is not being heavily used on the isolated cpu there is not a whole lot of SLAB activity, not running the garbage collector is not a big deal. Also SLUB does not have per cpu garbage collector, people running RT apps should simply switch to the SLUB. So this one is non-issue. > mm/swap.c: return schedule_on_each_cpu(lru_add_drain_per_cpu); This is one is swap LRU handling. This is the only user of schedule_on_each_cpu() btw. This case is similar to the above cases. Most people doing RT either have no swap at all, or avoid any kind of swapping activity on the CPUs used for RT. If they aren't already they should :). > mm/vmstat.c: schedule_delayed_work_on(cpu, vmstat_work, HZ + cpu); Not sure if it's an issue or not. It has not been for me. And again if it is an issue it's not a catastrophic failure kind of thing. There is not a whole lot of VM activity on the cpus running RT apps, otherwise they won't run for very long ;-). So as you can see all the current users that require strict workqueue cpu affinity is at most an inconvenience (like not being able to profile cpuX, or stale stats). Nothing fundamental fails. We've been running all kinds of machines with both scenarios #1 and #2 for weeks (rebooted for upgrades only). They do not show any more problems than the machine with the regular setup. There may be some other users that implicitly rely on the work queue affinity but I could not easily find them by looking at the code, nor did they show up during testing. If you know of any please let me know, we should convert them from schedule_work() to schedule_work_on(cpuX) to make the requirements clear. Thanks Max ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: workqueue cpu affinity 2008-06-10 20:24 ` workqueue cpu affinity Max Krasnyansky @ 2008-06-11 6:49 ` Peter Zijlstra 2008-06-11 19:02 ` Max Krasnyansky 2008-06-11 16:08 ` Oleg Nesterov 1 sibling, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2008-06-11 6:49 UTC (permalink / raw) To: Max Krasnyansky Cc: Oleg Nesterov, mingo, Andrew Morton, David Rientjes, Paul Jackson, menage, linux-kernel, Mark Hounschell, Thomas Gleixner On Tue, 2008-06-10 at 13:24 -0700, Max Krasnyansky wrote: > Ok looks like we got deeper into workqueue discussion in the wrong mail thread > :). So let me restart it. > > Here is some backgound on this. Full cpu isolation requires some tweaks to the > workqueue handling. Either the workqueue threads need to be moved (which is my > current approach), or work needs to be redirected when it's submitted. > flush_*_work() needs to be improved too. See Peter's reply below. > > First reaction that a lot of people get is "oh no no, this is bad, this will > not work". Which is understandable but _wrong_ ;-). See below for more details > and analysis. > > One thing that helps in accepting this isolation idea is to think about the > use cases. There are two uses cases for it: > 1. Normal threaded RT apps with threads that use system calls, block on > events, etc. > 2. Specialized RT apps with thread(s) that require close to 100% of the CPU > resources. Their threads avoid using system calls and avoid blocking. This is > done to achieve very low latency and low overhead. > > Scenario #1 is straightforward. You'd want to isolate the processor the RT app > is running on to avoid typical sources of latency. Workqueues running on the > same processor is not an issue (because RT threads block), but you do not get > the same latency guaranties. > > Workqueues are an issue for the scenario #2. Workqueue kthreads do not get a > chance to run because user's RT threads are higher priority. However those RT > threads should not use regular kernel services because that by definition > means that they are not getting ~100% of the CPU they want. In other words > they cannot have it both ways :). > > Therefore it's expected that the kernel won't be used heavily on those cpus, > and nothing really schedules workqueues and stuff. It's also expected that > certain kernel services may not be available on those CPUs. Again we cannot > have it both ways. ie Have all the kernel services and yet the kernel is not > supposed to use much CPU time :). > > If at this point people still get this "Oh no, that's wrong" feeling, please > read this excellent statement by Paul J > > "A key reason that Linux has succeeded is that it actively seeks to work for a > variety of people, purposes and products. One operating system is now a strong > player in the embedded market, the real time market, and the High Performance > Computing market, as well as being an important player in a variety of other > markets. That's a rather stunning success." > — Paul Jackson, in a June 4th, 2008 message on the Linux Kernel mailing list. While true, that doesn't mean we'll just merge anything :-) > btw Paul, it got picked up by the kerneltrap.org > http://kerneltrap.org/Quote/A_Rather_Stunning_Success > > Sorry for the lengthy into. Back to the technical discussions. > > Peter Zijlstra wrote: > > The advantage of creating a more flexible or fine-grained flush is that > > large machine also profit from it. > I agree, our current workqueue flush scheme is expensive because it has to > schedule on each online cpu. So yes improving flush makes sense in general. > > > A simple scheme would be creating a workqueue context that is passed > > along on enqueue, and then passed to flush. > > > > This context could: > > > > - either track the individual worklets and employ a completion scheme > > to wait for them; > > > > - or track on which cpus the worklets are enqueued and flush only those > > few cpus. > > > > Doing this would solve your case since nobody (except those having > > business) will enqueue something on the isolated cpus. > > > > And it will improve the large machine case for the same reasons - it > > won't have to iterate all cpus. > This will require a bit of surgery across the entire tree. There is a lot of > code that calls flush_scheduled_work()P. All that would have to be changed, > which is ok, but I think as the first step we could simply allow moving > workqueue threads out of cpus where that load in undesirable and make people > aware of what happens in that case. > When I get a chance I'll look into the flush scheme you proposed above. > > > Of course, things that use schedule_on_each_cpu() will still end up > > doing things on your isolated cpus, but getting around those would > > probably get you into some correctness trouble. > There is literally a _single_ user of that API. There are quite a bit more on -rt, where a lot of on_each_cpu() callers, that now use IPIs and run in hardirq context are converted to schedule. > Actually lets look at all the current users of the schedule_on(cpu) kind of API. > > git grep > 'queue_delayed_work_on\|schedPule_delayed_work_on\|schedule_on_each_cpu' |\ > grep -v 'workqueue\.[ch]\|\.txt' > > > drivers/cpufreq/cpufreq_ondemand.c: queue_delayed_work_on(cpu, kondemand_wq, &dbs_info->work, delay); > > drivers/cpufreq/cpufreq_ondemand.c: queue_delayed_work_on(dbs_info->cpu, kondemand_wq, &dbs_info->work, > No big deal. Worst case cpufreq state or that cpu will be stale. > RT apps would not want cpufreq governor messing with the cpu frequencies > anyway. So if you look back at the scenarios #1 and #2 I described above this > is non-issue. Sure, ondemand cpu_freq doesn't make sense while running (hard) rt apps. > > drivers/macintosh/rack-meter.c: schedule_delayed_work_on(cpu, &rcpu->sniffer, > > drivers/macintosh/rack-meter.c: schedule_delayed_work_on(cpu, &rm->cpu[cpu].sniffer, > Not a big deal either. In the worst case stats for the isolated cpus will not > be updated. Can probably be converted to timers. sure... see below [1] > > drivers/oprofile/cpu_buffer.c: schedule_delayed_work_on(i, &b->work, DEFAULT_TIMER_EXPIRE + i); > > drivers/oprofile/cpu_buffer.c: * By using schedule_delayed_work_on and then schedule_delayed_work > Yep I mentioned before that messing with the workqueues brakes oprofile. So > yes this one is an issue. However again it's not a catastrophic failure of the > system. Oprofile will not be able to collect samples from the CPU RT app is > running on and it actually warns the user about it (it prints and error that > the work is running on the wrong cpu). I'm working on a patch that collects > samples via IPI or per cpu timer. It will be configurable of course. So this > one is not a big deal either. NMI/timers sound like a good way to run oprofile - I thought it could already use them.. ? Anyway, see below.. [2] > > mm/slab.c: schedule_delayed_work_on(cpu, reap_work, > Garbage collection. Again see scenarios I described above. If the kernel is > not being heavily used on the isolated cpu there is not a whole lot of SLAB > activity, not running the garbage collector is not a big deal. > Also SLUB does not have per cpu garbage collector, people running RT apps > should simply switch to the SLUB. So this one is non-issue. Dude, SLUB uses on_each_cpu(), that's even worse for your #2 case. Hmm so does SLAB.. and a lot of other code. > > mm/swap.c: return schedule_on_each_cpu(lru_add_drain_per_cpu); > This is one is swap LRU handling. This is the only user of > schedule_on_each_cpu() btw. This case is similar to the above cases. Most > people doing RT either have no swap at all, or avoid any kind of swapping > activity on the CPUs used for RT. If they aren't already they should :). It isn't actually swap only - its all paging, including pagecache etc.. Still, you're probably right in that the per cpu lrus are empty, but why not improve the current scheme by keeping a cpumask of cpus with non-emppty pagevecs, that way everybody wins. > > mm/vmstat.c: schedule_delayed_work_on(cpu, vmstat_work, HZ + cpu); > Not sure if it's an issue or not. It has not been for me. > And again if it is an issue it's not a catastrophic failure kind of thing. > There is not a whole lot of VM activity on the cpus running RT apps, otherwise > they won't run for very long ;-). Looking at this code I'm not seeing the harm in letting it run - even for your #2 case, it certainly is not worse than some of the on_each_cpu() code, and starving it doesn't seem like a big issue. --- I'm worried by your approach to RT - both your solutions [1,2] and oversight of the on_each_cpu() stuff seem to indicate you don't care about some jitter on your isolated cpu. Timers and on_each_cpu() code run with hardirqs disabled and can do all kinds of funny stuff like spin on shared locks. This will certainly affect your #2 case. Again, the problem with most of your ideas is that they are very narrow - they fail to consider the bigger picture/other use-cases. To quote Paul again: "A key reason that Linux has succeeded is that it actively seeks to work for a variety of people, purposes and products" You often seem to forget 'variety' and target only your one use-case. I'm not saying it doesn't work for you - I'm just saying that by putting in a little more effort (ok, -rt is a lot more effort) we can make it work for a lot more people by taking out a lot of the restrictions you've put upon yourself. Please don't take this too personal - I'm glad you're working on this. I'm just trying to see what we can generalize. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: workqueue cpu affinity 2008-06-11 6:49 ` Peter Zijlstra @ 2008-06-11 19:02 ` Max Krasnyansky 2008-06-12 18:44 ` Peter Zijlstra 0 siblings, 1 reply; 31+ messages in thread From: Max Krasnyansky @ 2008-06-11 19:02 UTC (permalink / raw) To: Peter Zijlstra Cc: Oleg Nesterov, mingo, Andrew Morton, David Rientjes, Paul Jackson, menage, linux-kernel, Mark Hounschell, Thomas Gleixner Peter Zijlstra wrote: > On Tue, 2008-06-10 at 13:24 -0700, Max Krasnyansky wrote: >> If at this point people still get this "Oh no, that's wrong" feeling, please >> read this excellent statement by Paul J >> >> "A key reason that Linux has succeeded is that it actively seeks to work for a >> variety of people, purposes and products. One operating system is now a strong >> player in the embedded market, the real time market, and the High Performance >> Computing market, as well as being an important player in a variety of other >> markets. That's a rather stunning success." >> — Paul Jackson, in a June 4th, 2008 message on the Linux Kernel mailing list. > > While true, that doesn't mean we'll just merge anything :-) No, of course not. >>> Of course, things that use schedule_on_each_cpu() will still end up >>> doing things on your isolated cpus, but getting around those would >>> probably get you into some correctness trouble. >> There is literally a _single_ user of that API. > > There are quite a bit more on -rt, where a lot of on_each_cpu() callers, > that now use IPIs and run in hardirq context are converted to schedule. Yes I noticed that. I did not look deep enough though. I played with -rt kernels but did not look over all the changes. >>> drivers/oprofile/cpu_buffer.c: schedule_delayed_work_on(i, &b->work, DEFAULT_TIMER_EXPIRE + i); >>> drivers/oprofile/cpu_buffer.c: * By using schedule_delayed_work_on and then schedule_delayed_work >> Yep I mentioned before that messing with the workqueues brakes oprofile. So To sum up the discussion: Overall conclusion regarding workqueues for the current mainline kernels is that starting/moving workqueue threads is not a big deal. It's definitely not encouraged but at the same time is not be prohibited (and it isn't at this point I'm moving them from user-space). Looks like the action items are: - Optimize workqueue flush - Timer or IPI based Oprofile (configurable) - Optimize pagevec drain. I wonder if there is something on that front in the Nick's latest patches. Of course I may not be able to look at all those Did I miss any other suggestion ? Max >> yes this one is an issue. However again it's not a catastrophic failure of the >> system. Oprofile will not be able to collect samples from the CPU RT app is >> running on and it actually warns the user about it (it prints and error that >> the work is running on the wrong cpu). I'm working on a patch that collects >> samples via IPI or per cpu timer. It will be configurable of course. So this >> one is not a big deal either. > > NMI/timers sound like a good way to run oprofile - I thought it could > already use them.. ? Anyway, see below.. [2] Yeah, I'm not sure why oprofile uses workqueues. Seems like a much heavier way to collect samples than a lightweight per cpu timer. >>> mm/slab.c: schedule_delayed_work_on(cpu, reap_work, >> Garbage collection. Again see scenarios I described above. If the kernel is >> not being heavily used on the isolated cpu there is not a whole lot of SLAB >> activity, not running the garbage collector is not a big deal. >> Also SLUB does not have per cpu garbage collector, people running RT apps >> should simply switch to the SLUB. So this one is non-issue. > > Dude, SLUB uses on_each_cpu(), that's even worse for your #2 case. Hmm > so does SLAB.. and a lot of other code. Sure. I was talking about _workqueues_ specifically. There is also add_timer_on(). Also on_each_cpu() is an IPI. It's not necessarily worse for the #2 case. Depending what the IPI does of course. In the SLUB case it's essentially a noop on the CPU that is no a heavy SLUB user. I've been meaning to profile IPIs and timers but they have not generated enough noise. btw Converting everything to threads like -rt does is not necessarily the best solutions for all cases. Look at the "empty SLUB" case. IPI will get in and out with minimal intrusion. The threaded approach will have to run the scheduler and do a context switch. Granted in-kernel context switches are super fast but still not as fast as the IPI. Just to clarify I'm not saying -rt is doing the wrong thing, I'm just saying it's not black and white that threads are better in _all_ cases. >>> mm/swap.c: return schedule_on_each_cpu(lru_add_drain_per_cpu); >> This is one is swap LRU handling. This is the only user of >> schedule_on_each_cpu() btw. This case is similar to the above cases. Most >> people doing RT either have no swap at all, or avoid any kind of swapping >> activity on the CPUs used for RT. If they aren't already they should :). > > It isn't actually swap only - its all paging, including pagecache etc.. > Still, you're probably right in that the per cpu lrus are empty, Oh, I missed the pagecache part. btw It's also used only for the CONFIG_NUMA case. > but why not improve the current scheme by keeping a cpumask of cpus with > non-emppty pagevecs, that way everybody wins. I did not know enough about it to come with such an idea :). So yes now that you mentioned it sounds like a good thing to do. >>> mm/vmstat.c: schedule_delayed_work_on(cpu, vmstat_work, HZ + cpu); >> Not sure if it's an issue or not. It has not been for me. >> And again if it is an issue it's not a catastrophic failure kind of thing. >> There is not a whole lot of VM activity on the cpus running RT apps, otherwise >> they won't run for very long ;-). > > Looking at this code I'm not seeing the harm in letting it run - even > for your #2 case, it certainly is not worse than some of the > on_each_cpu() code, and starving it doesn't seem like a big issue. Yes that's I meant. ie That nothing needs to be done here. We just let it run and if it starves it starves, no big deal. > --- > I'm worried by your approach to RT - both your solutions [1,2] and > oversight of the on_each_cpu() stuff seem to indicate you don't care > about some jitter on your isolated cpu. Hang on, why do say an oversight of on_each_cpu() ? I started this discussion about workqueues specifically. on_each_cpu() is an IPI. Actually awhile ago I looked at most uses of the xxx_on_cpu() api including for_each_xxx_cpu(). Sure I may have missed something, but there is no oversight in general. Which is why I've solicited feedback for awhile now. As I mentioned awhile back the reason I'm doing this stuff is precisely because I care about the jitter. And what I'm getting is 1-2usec on the isolated CPUs when non-isolated one is loaded to the extreme. This is on the NUMA dual Opteron box. None of the existing -rt solutions I tried (RTAI, Xenomai and -rt) could do this. btw I'm not sure what your concern is for the #1 use case (ie regular -rt threads that block). It's not exactly my solution ;-). I just mentioned it as a use case that benefits from CPU isolation. In that case again due to the fact that heaviest latency sources have been moved to the other CPUs rt apps To sum up the discussion: Overall conclusion regarding workqueues for the current mainline kernels is that starting/moving workqueue threads is not a big deal. It's definitely not encouraged but at the same time is not be prohibited (and it isn't at this point I'm moving them from user-space). Looks like the action items are: - Optimize workqueue flush - Timer or IPI based Oprofile (configurable) - Optimize pagevec drain. I wonder if there is something on that front in the Nick's latest patches. Of course I may not be able to look at all those Did I miss any other suggestion ? Maxenjoy fairly low latencies. > Timers and on_each_cpu() code run with hardirqs disabled and can do all > kinds of funny stuff like spin on shared locks. This will certainly > affect your #2 case. It does. See above. Once most things are moved overall impact of those IPIs is fairly small. > Again, the problem with most of your ideas is that they are very narrow > - they fail to consider the bigger picture/other use-cases. Uh, that's a very broad statement ;-). I'm actually surprised that you draw such a conclusion. I'm not even sure what ideas specifically you're talking about. If you look at my latest cpuisol tree it contains exactly 4 patches. Three that you've already ACKed and are either bug fixes or simple extensions. So it's essentially a vanilla kernel running on the off-the-shelf hardware and providing awesome latencies. I not sure how that is a bad thing or how it fails to consider big picture. > To quote Paul again: > "A key reason that Linux has succeeded is that it actively seeks > to work for a variety of people, purposes and products" > > You often seem to forget 'variety' and target only your one use-case. Please be more specific. Which part targets only my use case. I'm assuming that by _my_ use case you mean #2 - "rt thread that never blocks", because the other use case is totally generic. I don't see how things that I'm doing address only my use case. There is clearly some usage of the scheduler isolation capabilities already. Look for example at this wiki page: http://rt.wiki.kernel.org/index.php/CPU_shielding_using_/proc_and_/dev/cpuset My changes clearly improve that use case. > I'm not saying it doesn't work for you - I'm just saying that by putting > in a little more effort (ok, -rt is a lot more effort) we can make it > work for a lot more people by taking out a lot of the restrictions > you've put upon yourself. Actually it may seem like a lot of restrictions but in reality the only restriction is on how RT threads are designed (and there will always be some). Otherwise you get an off-the-shelf box with essentially vanilla kernel that can be used in any way you like and can run very tight RT apps at the same time. > Please don't take this too personal - I'm glad you're working on this. > I'm just trying to see what we can generalize. Oh, no worries, I'm not taking this personally, except maybe the "most your ideas suck" part which got me a little bit ;-). I'm definitely all for making it suitable for more general usage. This is actually first constructive criticism (except for the "most of your ideas suck" part :). Oleg had some specific suggestions about workqueues. But otherwise it's been mostly "oh, it feels wrong" kind of thing with no specific suggestions and/or pointers. Thanks for your suggestions and pointers. I definitely appreciate them. I'll send a separate email with the summary of the discussion. This one is too long :). Once again thanks a lot for the detailed reply Max ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: workqueue cpu affinity 2008-06-11 19:02 ` Max Krasnyansky @ 2008-06-12 18:44 ` Peter Zijlstra 2008-06-12 19:10 ` Max Krasnyanskiy 0 siblings, 1 reply; 31+ messages in thread From: Peter Zijlstra @ 2008-06-12 18:44 UTC (permalink / raw) To: Max Krasnyansky Cc: Oleg Nesterov, mingo, Andrew Morton, David Rientjes, Paul Jackson, menage, linux-kernel, Mark Hounschell, Thomas Gleixner Sorry for being late,.. and I'm afraid most will have to wait a bit longer :-( On Wed, 2008-06-11 at 12:02 -0700, Max Krasnyansky wrote: > Peter Zijlstra wrote: > > Please don't take this too personal - I'm glad you're working on this. > > I'm just trying to see what we can generalize. > Oh, no worries, I'm not taking this personally, except maybe the "most your > ideas suck" part which got me a little bit ;-). I'm definitely all for making > it suitable for more general usage. > This is actually first constructive criticism (except for the "most of your > ideas suck" part :). No, no, you understand me wrong (or I expressed myself wrong). Your ideas are good, its just the implementation / execution I have issues with. Like with extending the isolation map, what didn't leave any room for hard-rt smp schedulers or multiple rt domains. Whereas the cpuset stuff does. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: workqueue cpu affinity 2008-06-12 18:44 ` Peter Zijlstra @ 2008-06-12 19:10 ` Max Krasnyanskiy 0 siblings, 0 replies; 31+ messages in thread From: Max Krasnyanskiy @ 2008-06-12 19:10 UTC (permalink / raw) To: Peter Zijlstra Cc: Oleg Nesterov, mingo, Andrew Morton, David Rientjes, Paul Jackson, menage, linux-kernel, Mark Hounschell, Thomas Gleixner Peter Zijlstra wrote: > Sorry for being late,.. and I'm afraid most will have to wait a bit > longer :-( > > On Wed, 2008-06-11 at 12:02 -0700, Max Krasnyansky wrote: >> Peter Zijlstra wrote: > >>> Please don't take this too personal - I'm glad you're working on this. >>> I'm just trying to see what we can generalize. > >> Oh, no worries, I'm not taking this personally, except maybe the "most your >> ideas suck" part which got me a little bit ;-). I'm definitely all for making >> it suitable for more general usage. >> This is actually first constructive criticism (except for the "most of your >> ideas suck" part :). > > No, no, you understand me wrong (or I expressed myself wrong). Your > ideas are good, its just the implementation / execution I have issues > with. > > Like with extending the isolation map, what didn't leave any room for > hard-rt smp schedulers or multiple rt domains. Whereas the cpuset stuff > does. Yep. And I redid it completely and switched gears to fix/improve cpusets, genirq, etc. Anyway, I think we're on the same page. Please look over the summary that I sent out and see if I missed anything. Max ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: workqueue cpu affinity 2008-06-10 20:24 ` workqueue cpu affinity Max Krasnyansky 2008-06-11 6:49 ` Peter Zijlstra @ 2008-06-11 16:08 ` Oleg Nesterov 2008-06-11 19:21 ` Max Krasnyansky ` (2 more replies) 1 sibling, 3 replies; 31+ messages in thread From: Oleg Nesterov @ 2008-06-11 16:08 UTC (permalink / raw) To: Max Krasnyansky Cc: Peter Zijlstra, mingo, Andrew Morton, David Rientjes, Paul Jackson, menage, linux-kernel, Mark Hounschell On 06/10, Max Krasnyansky wrote: > > Here is some backgound on this. Full cpu isolation requires some tweaks to the > workqueue handling. Either the workqueue threads need to be moved (which is my > current approach), or work needs to be redirected when it's submitted. _IF_ we have to do this, I think it is much better to move cwq->thread. > Peter Zijlstra wrote: > > The advantage of creating a more flexible or fine-grained flush is that > > large machine also profit from it. > I agree, our current workqueue flush scheme is expensive because it has to > schedule on each online cpu. So yes improving flush makes sense in general. Yes, it is easy to implement flush_work(struct work_struct *work) which only waits for that work, so it can't hang unless it was enqueued on the isolated cpu. But in most cases it is enough to just do if (cancel_work_sync(work)) work->func(work); Or we can add flush_workqueue_cpus(struct workqueue_struct *wq, cpumask_t *cpu_map). But I don't think we should change the behaviour of flush_workqueue(). > This will require a bit of surgery across the entire tree. There is a lot of > code that calls flush_scheduled_work() Almost all of them should be changed to use cancel_work_sync(). Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: workqueue cpu affinity 2008-06-11 16:08 ` Oleg Nesterov @ 2008-06-11 19:21 ` Max Krasnyansky 2008-06-11 19:21 ` Max Krasnyansky 2008-06-11 20:44 ` Max Krasnyansky 2 siblings, 0 replies; 31+ messages in thread From: Max Krasnyansky @ 2008-06-11 19:21 UTC (permalink / raw) To: Oleg Nesterov Cc: Peter Zijlstra, mingo, Andrew Morton, David Rientjes, Paul Jackson, menage, linux-kernel, Mark Hounschell Oleg Nesterov wrote: > On 06/10, Max Krasnyansky wrote: >> Here is some backgound on this. Full cpu isolation requires some tweaks to the >> workqueue handling. Either the workqueue threads need to be moved (which is my >> current approach), or work needs to be redirected when it's submitted. > > _IF_ we have to do this, I think it is much better to move cwq->thread. Ok. btw That's what I'm doing now from user-space. >> Peter Zijlstra wrote: >>> The advantage of creating a more flexible or fine-grained flush is that >>> large machine also profit from it. >> I agree, our current workqueue flush scheme is expensive because it has to >> schedule on each online cpu. So yes improving flush makes sense in general. > > Yes, it is easy to implement flush_work(struct work_struct *work) which > only waits for that work, so it can't hang unless it was enqueued on the > isolated cpu. > > But in most cases it is enough to just do > > if (cancel_work_sync(work)) > work->func(work); Cool. That would. btw Somehow I thought that you already implemented flush_work(). I do not see it 2.6.25 but I could've sworn that I saw a patch flying by. Must have been something else. Do you mind adding that ? > Or we can add flush_workqueue_cpus(struct workqueue_struct *wq, cpumask_t *cpu_map). That'd be special casing. I mean something will have to know what cpus cannot be flushed. I liked your proposal above much better. > But I don't think we should change the behaviour of flush_workqueue(). > >> This will require a bit of surgery across the entire tree. There is a lot of >> code that calls flush_scheduled_work() > > Almost all of them should be changed to use cancel_work_sync(). That'd be a lot of changes. git grep flush_scheduled_work | wc 154 376 8674 Hmm, I guess maybe not that bad. I might actually do that :-). Max ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: workqueue cpu affinity 2008-06-11 16:08 ` Oleg Nesterov 2008-06-11 19:21 ` Max Krasnyansky @ 2008-06-11 19:21 ` Max Krasnyansky 2008-06-12 16:35 ` Oleg Nesterov 2008-06-11 20:44 ` Max Krasnyansky 2 siblings, 1 reply; 31+ messages in thread From: Max Krasnyansky @ 2008-06-11 19:21 UTC (permalink / raw) To: Oleg Nesterov Cc: Peter Zijlstra, mingo, Andrew Morton, David Rientjes, Paul Jackson, menage, linux-kernel, Mark Hounschell Oleg Nesterov wrote: > On 06/10, Max Krasnyansky wrote: >> Here is some backgound on this. Full cpu isolation requires some tweaks to the >> workqueue handling. Either the workqueue threads need to be moved (which is my >> current approach), or work needs to be redirected when it's submitted. > > _IF_ we have to do this, I think it is much better to move cwq->thread. Ok. btw That's what I'm doing now from user-space. >> Peter Zijlstra wrote: >>> The advantage of creating a more flexible or fine-grained flush is that >>> large machine also profit from it. >> I agree, our current workqueue flush scheme is expensive because it has to >> schedule on each online cpu. So yes improving flush makes sense in general. > > Yes, it is easy to implement flush_work(struct work_struct *work) which > only waits for that work, so it can't hang unless it was enqueued on the > isolated cpu. > > But in most cases it is enough to just do > > if (cancel_work_sync(work)) > work->func(work); Cool. That would work. btw Somehow I thought that you already implemented flush_work(). I do not see it 2.6.25 but I could've sworn that I saw a patch flying by. Must have been something else. Do you mind adding that ? > Or we can add flush_workqueue_cpus(struct workqueue_struct *wq, cpumask_t *cpu_map). That'd be special casing. I mean something will have to know what cpus cannot be flushed. I liked your proposal above much better. > But I don't think we should change the behaviour of flush_workqueue(). > >> This will require a bit of surgery across the entire tree. There is a lot of >> code that calls flush_scheduled_work() > > Almost all of them should be changed to use cancel_work_sync(). That'd be a lot of changes. git grep flush_scheduled_work | wc 154 376 8674 Hmm, I guess maybe not that bad. I might actually do that :-). Max ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: workqueue cpu affinity 2008-06-11 19:21 ` Max Krasnyansky @ 2008-06-12 16:35 ` Oleg Nesterov 0 siblings, 0 replies; 31+ messages in thread From: Oleg Nesterov @ 2008-06-12 16:35 UTC (permalink / raw) To: Max Krasnyansky Cc: Peter Zijlstra, mingo, Andrew Morton, David Rientjes, Paul Jackson, menage, linux-kernel, Mark Hounschell On 06/11, Max Krasnyansky wrote: > > Oleg Nesterov wrote: > > > > Yes, it is easy to implement flush_work(struct work_struct *work) which > > only waits for that work, so it can't hang unless it was enqueued on the > > isolated cpu. > > > > But in most cases it is enough to just do > > > > if (cancel_work_sync(work)) > > work->func(work); > Cool. That would work. > btw Somehow I thought that you already implemented flush_work(). I do not see > it 2.6.25 but I could've sworn that I saw a patch flying by. Must have been > something else. Do you mind adding that ? Well... I don't think Andrew will take this patch right now... OK, I'll send the preparation patch with comments. Do you see an immediate user for this helper? > > Or we can add flush_workqueue_cpus(struct workqueue_struct *wq, cpumask_t *cpu_map). > That'd be special casing. I mean something will have to know what cpus cannot > be flushed. OK, we can make it flush_workqueue_except_isolated_cpus(struct workqueue_struct *wq). > I liked your proposal above much better. it was Peter who suggested this ;) > > But I don't think we should change the behaviour of flush_workqueue(). > > > >> This will require a bit of surgery across the entire tree. There is a lot of > >> code that calls flush_scheduled_work() > > > > Almost all of them should be changed to use cancel_work_sync(). > > That'd be a lot of changes. > > git grep flush_scheduled_work | wc > 154 376 8674 > > Hmm, I guess maybe not that bad. I might actually do that :-). Cool! I _bet_ you will find a lot of bugs ;) Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: workqueue cpu affinity 2008-06-11 16:08 ` Oleg Nesterov 2008-06-11 19:21 ` Max Krasnyansky 2008-06-11 19:21 ` Max Krasnyansky @ 2008-06-11 20:44 ` Max Krasnyansky 2 siblings, 0 replies; 31+ messages in thread From: Max Krasnyansky @ 2008-06-11 20:44 UTC (permalink / raw) To: Oleg Nesterov, Peter Zijlstra Cc: mingo, Andrew Morton, David Rientjes, Paul Jackson, menage, linux-kernel, Mark Hounschell, Nick Piggin Previous emails were very long :). So here is an executive summary of the discussions so far: ---- Workqueue kthread starvation by non-blocking user RT threads. Starving workqueue threads on the isolated cpus does not seems like a big deal. All current mainline users of schedule_on_cpu() kind of api can live with it. Starvation of the workqueue threads is an issue for the -rt kernels. See http://marc.info/?l=linux-kernel&m=121316707117552&w=2 for more info. If absolutely necessary moving workqueue threads from the isolated cpus is also not a big deal, even for cpu hotplug. It's certainly _not_ encouraged in general but at the same time is not strictly prohibited either, because nothing fundamental brakes (that's what my current isolation solution does). ---- Optimize workqueue flush. Current flush_work_queue() implementation is an issue for the starvation case mentioned above and in general it's not very efficient because it has to schedule in each online cpu. Peter suggested rewriting flush logic to avoid scheduling on each online cpu. Oleg suggested converting existing users of flush_queued_work() to cancel_work_sync(work) which will provide fine grained flushing and will not schedule on each cpu. Both of the suggestions would improve overall performance and address the case when machine gets stuck due work queue thread starvation. ---- Timer or IPI based Oprofile. Currently oprofile collects samples by using schedule_work_on_cpu(). Which means that if workqueue threads are starved on, or moved from cpuX oprofile fails to collect samples on that cpuX. It seems that it can be easily converted to use per-CPU timer or IPI. This might be useful in general (ie less expensive) and will take care of the issue described above. ---- Optimize pagevec drain. Current pavevec drain logic on the NUMA boxes schedules a workqueue on each online cpu. It's not an issue for the CPU isolation per se but can be improved in general. Peter suggested keeping a cpumask of cpus with non-emppty pagevecs which will not require scheduling work on each cpu. I wonder if there is something on that front in the Nick's latest patches. CC'ing Nick. ---- Did I miss anything ? Max ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch] sched: prevent bound kthreads from changing cpus_allowed 2008-06-10 17:00 ` Oleg Nesterov 2008-06-10 17:19 ` Peter Zijlstra @ 2008-06-10 18:00 ` Max Krasnyansky 1 sibling, 0 replies; 31+ messages in thread From: Max Krasnyansky @ 2008-06-10 18:00 UTC (permalink / raw) To: Oleg Nesterov Cc: Peter Zijlstra, David Rientjes, Paul Jackson, mingo, menage, linux-kernel Oleg Nesterov wrote: > On 06/10, Max Krasnyansky wrote: >> Peter Zijlstra wrote: >>> Per cpu workqueues should stay on their cpu. >>> >>> What you're really looking for is a more fine grained alternative to >>> flush_workqueue(). >> Actually I had a discussion on that with Oleg Nesterov. If you remember my >> original solution (ie centralized cpu_isolate_map) was to completely redirect >> work onto other cpus. Then you pointed out that it's the flush_() that really >> makes the box stuck. So I started thinking about redoing the flush. While >> looking at the code I realized that if I only change the flush_() then queued >> work can get stale so to speak. ie Machine does not get stuck but some work >> submitted on the isolated cpus will sit there for a long time. Oleg pointed >> out exact same thing. So the simplest solution that does not require any >> surgery to the workqueue is to just move the threads to other cpus. > > Cough... I'd like to mention that I _personally agree with Peter, cwq->thread's > should stay on their cpu. I never argued against the _should stay_ ;-). What I'm arguing against is the _must stay_ which is a big difference. I'll start a separate thread on this. > I just meant that from the workqueue.c pov it is (afaics) OK to move cwq->thread > to other CPUs, in a sense that this shouldn't add races or hotplug problems, etc. Yep. That's what I was referring to in the explanation that I send to Peter. > But still this doesn't look right to me. Yeah, it's all about perceptions. We'll fix that ;-). Max ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch] sched: prevent bound kthreads from changing cpus_allowed 2008-06-05 19:57 [patch] sched: prevent bound kthreads from changing cpus_allowed David Rientjes 2008-06-05 20:29 ` Paul Jackson @ 2008-06-05 20:52 ` Daniel Walker 2008-06-05 21:47 ` Paul Jackson ` (2 subsequent siblings) 4 siblings, 0 replies; 31+ messages in thread From: Daniel Walker @ 2008-06-05 20:52 UTC (permalink / raw) To: David Rientjes Cc: Ingo Molnar, Peter Zijlstra, Paul Menage, Paul Jackson, linux-kernel On Thu, 2008-06-05 at 12:57 -0700, David Rientjes wrote: > Kthreads that have called kthread_bind() are bound to specific cpus, so > other tasks should not be able to change their cpus_allowed from under > them. Otherwise, it is possible to move kthreads, such as the migration > or software watchdog threads, so they are not allowed access to the cpu > they work on. I'm all for it .. I've crashed test systems just by migrating tasks that are suppose to stay bound while using cpusets .. Daniel ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch] sched: prevent bound kthreads from changing cpus_allowed 2008-06-05 19:57 [patch] sched: prevent bound kthreads from changing cpus_allowed David Rientjes 2008-06-05 20:29 ` Paul Jackson 2008-06-05 20:52 ` Daniel Walker @ 2008-06-05 21:47 ` Paul Jackson 2008-06-10 10:28 ` Ingo Molnar 2008-06-10 17:47 ` Oleg Nesterov 4 siblings, 0 replies; 31+ messages in thread From: Paul Jackson @ 2008-06-05 21:47 UTC (permalink / raw) To: David Rientjes; +Cc: mingo, peterz, menage, linux-kernel Ok - looks good to me. (I too have shot my foot off moving tasks that shouldn't be moved - thanks.) Reviewed-by: Paul Jackson <pj@sgi.com> -- I won't rest till it's the best ... Programmer, Linux Scalability Paul Jackson <pj@sgi.com> 1.940.382.4214 ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch] sched: prevent bound kthreads from changing cpus_allowed 2008-06-05 19:57 [patch] sched: prevent bound kthreads from changing cpus_allowed David Rientjes ` (2 preceding siblings ...) 2008-06-05 21:47 ` Paul Jackson @ 2008-06-10 10:28 ` Ingo Molnar 2008-06-10 17:47 ` Oleg Nesterov 4 siblings, 0 replies; 31+ messages in thread From: Ingo Molnar @ 2008-06-10 10:28 UTC (permalink / raw) To: David Rientjes; +Cc: Peter Zijlstra, Paul Menage, Paul Jackson, linux-kernel * David Rientjes <rientjes@google.com> wrote: > Kthreads that have called kthread_bind() are bound to specific cpus, > so other tasks should not be able to change their cpus_allowed from > under them. Otherwise, it is possible to move kthreads, such as the > migration or software watchdog threads, so they are not allowed access > to the cpu they work on. applied to tip/sched-devel, thanks David. Ingo ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: [patch] sched: prevent bound kthreads from changing cpus_allowed 2008-06-05 19:57 [patch] sched: prevent bound kthreads from changing cpus_allowed David Rientjes ` (3 preceding siblings ...) 2008-06-10 10:28 ` Ingo Molnar @ 2008-06-10 17:47 ` Oleg Nesterov 4 siblings, 0 replies; 31+ messages in thread From: Oleg Nesterov @ 2008-06-10 17:47 UTC (permalink / raw) To: David Rientjes Cc: Ingo Molnar, Peter Zijlstra, Paul Menage, Paul Jackson, linux-kernel On 06/05, David Rientjes wrote: > > Kthreads that have called kthread_bind() are bound to specific cpus, so > other tasks should not be able to change their cpus_allowed from under > them. Otherwise, it is possible to move kthreads, such as the migration > or software watchdog threads, so they are not allowed access to the cpu > they work on. Imho, this is very good change, but... > @@ -180,6 +180,7 @@ void kthread_bind(struct task_struct *k, unsigned int cpu) > set_task_cpu(k, cpu); > k->cpus_allowed = cpumask_of_cpu(cpu); > k->rt.nr_cpus_allowed = 1; > + k->flags |= PF_THREAD_BOUND; What if user-space moves this task right before "|= PF_THREAD_BOUND" ? Oleg. ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2008-06-12 19:10 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2008-06-05 19:57 [patch] sched: prevent bound kthreads from changing cpus_allowed David Rientjes 2008-06-05 20:29 ` Paul Jackson 2008-06-05 21:12 ` David Rientjes 2008-06-09 20:59 ` Max Krasnyanskiy 2008-06-09 22:07 ` David Rientjes 2008-06-10 4:23 ` Max Krasnyansky 2008-06-10 17:04 ` David Rientjes 2008-06-10 16:30 ` cpusets and kthreads, inconsistent behaviour Max Krasnyansky 2008-06-10 18:47 ` David Rientjes 2008-06-10 20:44 ` Max Krasnyansky 2008-06-10 20:54 ` David Rientjes 2008-06-10 21:15 ` Max Krasnyansky 2008-06-10 6:44 ` [patch] sched: prevent bound kthreads from changing cpus_allowed Peter Zijlstra 2008-06-10 15:38 ` Max Krasnyansky 2008-06-10 17:00 ` Oleg Nesterov 2008-06-10 17:19 ` Peter Zijlstra 2008-06-10 20:24 ` workqueue cpu affinity Max Krasnyansky 2008-06-11 6:49 ` Peter Zijlstra 2008-06-11 19:02 ` Max Krasnyansky 2008-06-12 18:44 ` Peter Zijlstra 2008-06-12 19:10 ` Max Krasnyanskiy 2008-06-11 16:08 ` Oleg Nesterov 2008-06-11 19:21 ` Max Krasnyansky 2008-06-11 19:21 ` Max Krasnyansky 2008-06-12 16:35 ` Oleg Nesterov 2008-06-11 20:44 ` Max Krasnyansky 2008-06-10 18:00 ` [patch] sched: prevent bound kthreads from changing cpus_allowed Max Krasnyansky 2008-06-05 20:52 ` Daniel Walker 2008-06-05 21:47 ` Paul Jackson 2008-06-10 10:28 ` Ingo Molnar 2008-06-10 17:47 ` Oleg Nesterov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).