* [RFC][PATCH 0/3] delayed wakeup list
@ 2011-09-14 13:30 Peter Zijlstra
2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra
` (3 more replies)
0 siblings, 4 replies; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-14 13:30 UTC (permalink / raw)
To: Ingo Molnar, Thomas Gleixner
Cc: linux-kernel, Steven Rostedt, Darren Hart, Manfred Spraul,
David Miller, Eric Dumazet, Mike Galbraith, Peter Zijlstra
This patch-set provides the infrastructure to delay/batch task wakeups.
This is useful for locking primitives that can effect multiple wakeups
per operation and want to avoid lock internal lock contention by
delaying the wakeups until we've released the lock internal locks.
Patch 2 converts futexes
Patch 3 converts sysv sems, and is broken
[ I've been staring at patch 3 way too long, so I thought I'd post it just
to get a few more eyes on it.. ]
Alternatively it can be used to avoid issuing multiple wakeups, and
thus save a few cycles, in packet processing. Queue all target tasks
and wakeup once you've processed all packets. That way you avoid
waking the target task multiple times if there were multiple packets
for the same task.
No actual such usage yet, but ISTR talking to some net folks a long while back
about this, is there still interest, Dave, Eric?
^ permalink raw reply [flat|nested] 33+ messages in thread* [RFC][PATCH 1/3] sched: Provide delayed wakeup list 2011-09-14 13:30 [RFC][PATCH 0/3] delayed wakeup list Peter Zijlstra @ 2011-09-14 13:30 ` Peter Zijlstra 2011-09-14 13:50 ` Peter Zijlstra ` (5 more replies) 2011-09-14 13:30 ` [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention Peter Zijlstra ` (2 subsequent siblings) 3 siblings, 6 replies; 33+ messages in thread From: Peter Zijlstra @ 2011-09-14 13:30 UTC (permalink / raw) To: Ingo Molnar, Thomas Gleixner Cc: linux-kernel, Steven Rostedt, Darren Hart, Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith, Peter Zijlstra [-- Attachment #1: sched-wakeup-list.patch --] [-- Type: text/plain, Size: 3834 bytes --] Provide means to queue wakeup targets for later mass wakeup. This is useful for locking primitives that can effect multiple wakeups per operation and want to avoid lock internal lock contention by delaying the wakeups until we've released the lock internal locks. Alternatively it can be used to avoid issuing multiple wakeups, and thus save a few cycles, in packet processing. Queue all target tasks and wakeup once you've processed all packets. That way you avoid waking the target task multiple times if there were multiple packets for the same task. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Darren Hart <dvhart@linux.intel.com> Cc: Manfred Spraul <manfred@colorfullife.com> Cc: David Miller <davem@davemloft.net> Cc: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- include/linux/sched.h | 44 ++++++++++++++++++++++++++++++++++++++++++++ kernel/sched.c | 21 +++++++++++++++++++++ 2 files changed, 65 insertions(+) Index: linux-2.6/include/linux/sched.h =================================================================== --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -1065,6 +1065,19 @@ struct uts_namespace; struct rq; struct sched_domain; +struct wake_list_head { + struct wake_list_node *first; +}; + +struct wake_list_node { + struct wake_list_node *next; +}; + +#define WAKE_LIST_TAIL ((struct wake_list_node *)0x01) + +#define WAKE_LIST(name) \ + struct wake_list_head name = { WAKE_LIST_TAIL } + /* * wake flags */ @@ -1255,6 +1268,8 @@ struct task_struct { unsigned int btrace_seq; #endif + struct wake_list_node wake_list; + unsigned int policy; cpumask_t cpus_allowed; @@ -2143,6 +2158,35 @@ extern void wake_up_new_task(struct task extern void sched_fork(struct task_struct *p); extern void sched_dead(struct task_struct *p); +static inline void +wake_list_add(struct wake_list_head *head, struct task_struct *p) +{ + struct wake_list_node *n = &p->wake_list; + + get_task_struct(p); + /* + * Atomically grab the task, if ->wake_list is !0 already it means + * its already queued (either by us or someone else) and will get the + * wakeup due to that. + * + * This cmpxchg() implies a full barrier, which pairs with the write + * barrier implied by the wakeup in wake_up_list(). + */ + if (cmpxchg(&n->next, 0, n) != 0) { + /* It was already queued, drop the extra ref and we're done. */ + put_task_struct(p); + return; + } + + /* + * The head is context local, there can be no concurrency. + */ + n->next = head->first; + head->first = n; +} + +extern void wake_up_list(struct wake_list_head *head, unsigned int state); + extern void proc_caches_init(void); extern void flush_signals(struct task_struct *); extern void __flush_signals(struct task_struct *); Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -2916,6 +2916,25 @@ int wake_up_state(struct task_struct *p, return try_to_wake_up(p, state, 0); } +void wake_up_list(struct wake_list_head *head, unsigned int state) +{ + struct wake_list_node *n = head->first; + struct task_struct *p; + + while (n != WAKE_LIST_TAIL) { + p = container_of(n, struct task_struct, wake_list); + n = n->next; + + p->wake_list.next = NULL; + /* + * wake_up_state() implies a wmb() to pair with the queueing + * in wake_list_add() so as not to miss wakeups. + */ + wake_up_state(p, state); + put_task_struct(p); + } +} + /* * Perform scheduler related setup for a newly forked process p. * p is forked by current. @@ -2943,6 +2962,8 @@ static void __sched_fork(struct task_str #ifdef CONFIG_PREEMPT_NOTIFIERS INIT_HLIST_HEAD(&p->preempt_notifiers); #endif + + p->wake_list.next = NULL; } /* ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list 2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra @ 2011-09-14 13:50 ` Peter Zijlstra 2011-09-14 14:08 ` Eric Dumazet ` (4 subsequent siblings) 5 siblings, 0 replies; 33+ messages in thread From: Peter Zijlstra @ 2011-09-14 13:50 UTC (permalink / raw) To: Ingo Molnar Cc: Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith On Wed, 2011-09-14 at 15:30 +0200, Peter Zijlstra wrote: > + if (cmpxchg(&n->next, 0, n) != 0) { > + /* It was already queued, drop the extra ref and we're done. */ > + put_task_struct(p); > + return; > + } > + > + /* > + * The head is context local, there can be no concurrency. > + */ > + n->next = head->first; That could of course be folded into one op.. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list 2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra 2011-09-14 13:50 ` Peter Zijlstra @ 2011-09-14 14:08 ` Eric Dumazet 2011-09-14 14:12 ` Peter Zijlstra 2011-09-14 15:35 ` Darren Hart ` (3 subsequent siblings) 5 siblings, 1 reply; 33+ messages in thread From: Eric Dumazet @ 2011-09-14 14:08 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, Manfred Spraul, David Miller, Mike Galbraith Le mercredi 14 septembre 2011 à 15:30 +0200, Peter Zijlstra a écrit : > pièce jointe document texte brut (sched-wakeup-list.patch) > Provide means to queue wakeup targets for later mass wakeup. > > This is useful for locking primitives that can effect multiple wakeups > per operation and want to avoid lock internal lock contention by > delaying the wakeups until we've released the lock internal locks. > > Alternatively it can be used to avoid issuing multiple wakeups, and > thus save a few cycles, in packet processing. Queue all target tasks > and wakeup once you've processed all packets. That way you avoid > waking the target task multiple times if there were multiple packets > for the same task. > > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Darren Hart <dvhart@linux.intel.com> > Cc: Manfred Spraul <manfred@colorfullife.com> > Cc: David Miller <davem@davemloft.net> > Cc: Eric Dumazet <eric.dumazet@gmail.com> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > --- > include/linux/sched.h | 44 ++++++++++++++++++++++++++++++++++++++++++++ > kernel/sched.c | 21 +++++++++++++++++++++ > 2 files changed, 65 insertions(+) > Index: linux-2.6/include/linux/sched.h > =================================================================== > --- linux-2.6.orig/include/linux/sched.h > +++ linux-2.6/include/linux/sched.h > @@ -1065,6 +1065,19 @@ struct uts_namespace; > struct rq; > struct sched_domain; > > +struct wake_list_head { > + struct wake_list_node *first; > +}; > + > +struct wake_list_node { > + struct wake_list_node *next; > +}; > + > +#define WAKE_LIST_TAIL ((struct wake_list_node *)0x01) > + > +#define WAKE_LIST(name) \ > + struct wake_list_head name = { WAKE_LIST_TAIL } > + > /* > * wake flags > */ > @@ -1255,6 +1268,8 @@ struct task_struct { > unsigned int btrace_seq; > #endif > > + struct wake_list_node wake_list; > + > unsigned int policy; > cpumask_t cpus_allowed; > > @@ -2143,6 +2158,35 @@ extern void wake_up_new_task(struct task > extern void sched_fork(struct task_struct *p); > extern void sched_dead(struct task_struct *p); > > +static inline void > +wake_list_add(struct wake_list_head *head, struct task_struct *p) > +{ > + struct wake_list_node *n = &p->wake_list; > + > + get_task_struct(p); > + /* > + * Atomically grab the task, if ->wake_list is !0 already it means > + * its already queued (either by us or someone else) and will get the > + * wakeup due to that. > + * > + * This cmpxchg() implies a full barrier, which pairs with the write > + * barrier implied by the wakeup in wake_up_list(). > + */ > + if (cmpxchg(&n->next, 0, n) != 0) { > + /* It was already queued, drop the extra ref and we're done. */ > + put_task_struct(p); > + return; > + } > + > + /* > + * The head is context local, there can be no concurrency. > + */ > + n->next = head->first; > + head->first = n; > +} > + I dont understand why you make a get_task_struct() before the cmpxchg() and rollback. We certainly must hold a lock/mutex before calling wake_list_add() It could be : { if (cmpxchg(&n->next, NULL, head->first)) return; get_task_struct(p); head->first = n; } ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list 2011-09-14 14:08 ` Eric Dumazet @ 2011-09-14 14:12 ` Peter Zijlstra 0 siblings, 0 replies; 33+ messages in thread From: Peter Zijlstra @ 2011-09-14 14:12 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, Manfred Spraul, David Miller, Mike Galbraith On Wed, 2011-09-14 at 16:08 +0200, Eric Dumazet wrote: > I dont understand why you make a get_task_struct() before the cmpxchg() > and rollback. We certainly must hold a lock/mutex before calling > wake_list_add() > > It could be : > > { > if (cmpxchg(&n->next, NULL, head->first)) > return; > > get_task_struct(p); > head->first = n; > } You're quite right. No idea why I wrote it like that.. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list 2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra 2011-09-14 13:50 ` Peter Zijlstra 2011-09-14 14:08 ` Eric Dumazet @ 2011-09-14 15:35 ` Darren Hart 2011-09-14 15:39 ` Peter Zijlstra 2011-09-16 7:59 ` Paul Turner ` (2 subsequent siblings) 5 siblings, 1 reply; 33+ messages in thread From: Darren Hart @ 2011-09-14 15:35 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith, Michel Lespinasse On 09/14/2011 06:30 AM, Peter Zijlstra wrote: > Provide means to queue wakeup targets for later mass wakeup. > > This is useful for locking primitives that can effect multiple wakeups > per operation and want to avoid lock internal lock contention by > delaying the wakeups until we've released the lock internal locks. I believe Michel (on CC) was interested in a related sort of thing for read/write mechanisms with futexes. We discussed a way to accomplish what he was interested in without changes to futexes, but this wakeup list may also be of interest to him. > > Alternatively it can be used to avoid issuing multiple wakeups, and > thus save a few cycles, in packet processing. Queue all target tasks > and wakeup once you've processed all packets. That way you avoid > waking the target task multiple times if there were multiple packets > for the same task. > > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Darren Hart <dvhart@linux.intel.com> > Cc: Manfred Spraul <manfred@colorfullife.com> > Cc: David Miller <davem@davemloft.net> > Cc: Eric Dumazet <eric.dumazet@gmail.com> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > --- > include/linux/sched.h | 44 ++++++++++++++++++++++++++++++++++++++++++++ > kernel/sched.c | 21 +++++++++++++++++++++ > 2 files changed, 65 insertions(+) > Index: linux-2.6/include/linux/sched.h > =================================================================== > --- linux-2.6.orig/include/linux/sched.h > +++ linux-2.6/include/linux/sched.h > @@ -1065,6 +1065,19 @@ struct uts_namespace; > struct rq; > struct sched_domain; > > +struct wake_list_head { > + struct wake_list_node *first; > +}; > + > +struct wake_list_node { > + struct wake_list_node *next; > +}; > + > +#define WAKE_LIST_TAIL ((struct wake_list_node *)0x01) > + > +#define WAKE_LIST(name) \ > + struct wake_list_head name = { WAKE_LIST_TAIL } > + > /* > * wake flags > */ > @@ -1255,6 +1268,8 @@ struct task_struct { > unsigned int btrace_seq; > #endif > > + struct wake_list_node wake_list; > + > unsigned int policy; > cpumask_t cpus_allowed; > > @@ -2143,6 +2158,35 @@ extern void wake_up_new_task(struct task > extern void sched_fork(struct task_struct *p); > extern void sched_dead(struct task_struct *p); > > +static inline void > +wake_list_add(struct wake_list_head *head, struct task_struct *p) > +{ > + struct wake_list_node *n = &p->wake_list; > + > + get_task_struct(p); > + /* > + * Atomically grab the task, if ->wake_list is !0 already it means > + * its already queued (either by us or someone else) and will get the > + * wakeup due to that. > + * > + * This cmpxchg() implies a full barrier, which pairs with the write > + * barrier implied by the wakeup in wake_up_list(). > + */ > + if (cmpxchg(&n->next, 0, n) != 0) { > + /* It was already queued, drop the extra ref and we're done. */ > + put_task_struct(p); > + return; > + } > + > + /* > + * The head is context local, there can be no concurrency. > + */ > + n->next = head->first; > + head->first = n; > +} > + > +extern void wake_up_list(struct wake_list_head *head, unsigned int state); > + > extern void proc_caches_init(void); > extern void flush_signals(struct task_struct *); > extern void __flush_signals(struct task_struct *); > Index: linux-2.6/kernel/sched.c > =================================================================== > --- linux-2.6.orig/kernel/sched.c > +++ linux-2.6/kernel/sched.c > @@ -2916,6 +2916,25 @@ int wake_up_state(struct task_struct *p, > return try_to_wake_up(p, state, 0); > } > Maybe just my obsession with documentation and having my head in futex.c for so long, but I'd love to see some proper kerneldoc function commentary here... Admittedly, the inline comments are very helpful for the most part. Missing here is what the value of state should be when calling wake_up_list. > +void wake_up_list(struct wake_list_head *head, unsigned int state) > +{ > + struct wake_list_node *n = head->first; > + struct task_struct *p; > + > + while (n != WAKE_LIST_TAIL) { > + p = container_of(n, struct task_struct, wake_list); > + n = n->next; > + > + p->wake_list.next = NULL; > + /* > + * wake_up_state() implies a wmb() to pair with the queueing > + * in wake_list_add() so as not to miss wakeups. > + */ > + wake_up_state(p, state); > + put_task_struct(p); > + } > +} > + > /* > * Perform scheduler related setup for a newly forked process p. > * p is forked by current. > @@ -2943,6 +2962,8 @@ static void __sched_fork(struct task_str > #ifdef CONFIG_PREEMPT_NOTIFIERS > INIT_HLIST_HEAD(&p->preempt_notifiers); > #endif > + > + p->wake_list.next = NULL; > } > > /* > > -- Darren Hart Intel Open Source Technology Center Yocto Project - Linux Kernel ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list 2011-09-14 15:35 ` Darren Hart @ 2011-09-14 15:39 ` Peter Zijlstra 2011-09-14 15:49 ` Darren Hart 0 siblings, 1 reply; 33+ messages in thread From: Peter Zijlstra @ 2011-09-14 15:39 UTC (permalink / raw) To: Darren Hart Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith, Michel Lespinasse On Wed, 2011-09-14 at 08:35 -0700, Darren Hart wrote: > Missing here is what the value of state should be when > calling wake_up_list. Depends on what all you want to wake up :-) Would a reference to the comment of try_to_wake_up() be good enough? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list 2011-09-14 15:39 ` Peter Zijlstra @ 2011-09-14 15:49 ` Darren Hart 0 siblings, 0 replies; 33+ messages in thread From: Darren Hart @ 2011-09-14 15:49 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith, Michel Lespinasse On 09/14/2011 08:39 AM, Peter Zijlstra wrote: > On Wed, 2011-09-14 at 08:35 -0700, Darren Hart wrote: >> Missing here is what the value of state should be when >> calling wake_up_list. > > Depends on what all you want to wake up :-) Would a reference to the > comment of try_to_wake_up() be good enough? I found it by following the calls as well, and any serious developer could as well. I just find the kerneldoc headers to be very valuable as reference and I like to encourage their use when I can. This isn't a huge deal, just a nudge to see if it sticks ;) -- Darren Hart Intel Open Source Technology Center Yocto Project - Linux Kernel ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list 2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra ` (2 preceding siblings ...) 2011-09-14 15:35 ` Darren Hart @ 2011-09-16 7:59 ` Paul Turner 2011-09-16 7:59 ` Paul Turner 2011-10-02 14:01 ` Manfred Spraul 5 siblings, 0 replies; 33+ messages in thread From: Paul Turner @ 2011-09-16 7:59 UTC (permalink / raw) To: linux-kernel It's probably worth folding the idle_cpu fix for checking whether there's an enqueued task into this. - Paul ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list 2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra ` (3 preceding siblings ...) 2011-09-16 7:59 ` Paul Turner @ 2011-09-16 7:59 ` Paul Turner 2011-09-16 8:48 ` Peter Zijlstra 2011-10-02 14:01 ` Manfred Spraul 5 siblings, 1 reply; 33+ messages in thread From: Paul Turner @ 2011-09-16 7:59 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith It's probably worth folding the idle_cpu fix for checking whether there's an enqueued task into this. - Paul ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list 2011-09-16 7:59 ` Paul Turner @ 2011-09-16 8:48 ` Peter Zijlstra 0 siblings, 0 replies; 33+ messages in thread From: Peter Zijlstra @ 2011-09-16 8:48 UTC (permalink / raw) To: Paul Turner Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith On Fri, 2011-09-16 at 00:59 -0700, Paul Turner wrote: > It's probably worth folding the idle_cpu fix for checking whether > there's an enqueued task into this. Yeah, I've actually got that in too.. --- Subject: sched: Fix idle_cpu() From: Thomas Gleixner <tglx@linutronix.de> Date: Thu Sep 15 15:32:06 CEST 2011 On -rt we observed hackbench waking all 400 tasks to a single cpu. This is because of select_idle_sibling()'s interaction with the new ipi based wakeup scheme. The existing idle_cpu() test only checks to see if the current task on that cpu is the idle task, it does not take already queued tasks into account, nor does it take queued to be woken tasks into account. If the remote wakeup IPIs come hard enough, there won't be time to schedule away from the idle task, and would thus keep thinking the cpu was in fact idle, regardless of the fact that there were already several hundred tasks runnable. We couldn't reproduce on mainline, but there's no reason it couldn't happen. Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-3o30p18b2paswpc9ohy2gltp@git.kernel.org --- kernel/sched.c | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) Index: linux-2.6/kernel/sched.c =================================================================== --- linux-2.6.orig/kernel/sched.c +++ linux-2.6/kernel/sched.c @@ -5152,7 +5152,20 @@ EXPORT_SYMBOL(task_nice); */ int idle_cpu(int cpu) { - return cpu_curr(cpu) == cpu_rq(cpu)->idle; + struct rq *rq = cpu_rq(cpu); + + if (rq->curr != rq->idle) + return 0; + + if (rq->nr_running) + return 0; + +#ifdef CONFIG_SMP + if (!llist_empty(&rq->wake_list)) + return 0; +#endif + + return 1; } /** ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list 2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra ` (4 preceding siblings ...) 2011-09-16 7:59 ` Paul Turner @ 2011-10-02 14:01 ` Manfred Spraul 2011-10-03 10:23 ` Peter Zijlstra 5 siblings, 1 reply; 33+ messages in thread From: Manfred Spraul @ 2011-10-02 14:01 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, David Miller, Eric Dumazet, Mike Galbraith Hi Peter, Do you still work on the wake_up_list() patch? On 09/14/2011 03:30 PM, Peter Zijlstra wrote: > /* > * wake flags > */ > @@ -1255,6 +1268,8 @@ struct task_struct { > unsigned int btrace_seq; > #endif > > + struct wake_list_node wake_list; > + > unsigned int policy; > cpumask_t cpus_allowed; A global wake_list > > @@ -2143,6 +2158,35 @@ extern void wake_up_new_task(struct task > extern void sched_fork(struct task_struct *p); > extern void sched_dead(struct task_struct *p); > > +static inline void > +wake_list_add(struct wake_list_head *head, struct task_struct *p) > +{ > + struct wake_list_node *n =&p->wake_list; > + > + get_task_struct(p); > + /* > + * Atomically grab the task, if ->wake_list is !0 already it means > + * its already queued (either by us or someone else) and will get the > + * wakeup due to that. > + * > + * This cmpxchg() implies a full barrier, which pairs with the write > + * barrier implied by the wakeup in wake_up_list(). > + */ > + if (cmpxchg(&n->next, 0, n) != 0) { > + /* It was already queued, drop the extra ref and we're done. */ > + put_task_struct(p); > + return; > + } > + A task can be only once on the wake_list. > + /* > + * The head is context local, there can be no concurrency. > + */ > + n->next = head->first; > + head->first = n; > +} > + > +extern void wake_up_list(struct wake_list_head *head, unsigned int state); > + > extern void proc_caches_init(void); > extern void flush_signals(struct task_struct *); > extern void __flush_signals(struct task_struct *); > Index: linux-2.6/kernel/sched.c > =================================================================== > --- linux-2.6.orig/kernel/sched.c > +++ linux-2.6/kernel/sched.c > @@ -2916,6 +2916,25 @@ int wake_up_state(struct task_struct *p, > return try_to_wake_up(p, state, 0); > } > > +void wake_up_list(struct wake_list_head *head, unsigned int state) > +{ > + struct wake_list_node *n = head->first; > + struct task_struct *p; > + > + while (n != WAKE_LIST_TAIL) { > + p = container_of(n, struct task_struct, wake_list); > + n = n->next; > + > + p->wake_list.next = NULL; > + /* > + * wake_up_state() implies a wmb() to pair with the queueing > + * in wake_list_add() so as not to miss wakeups. > + */ > + wake_up_state(p, state); > + put_task_struct(p); > + } > +} And wake_up_list() uses state. That can't work: What if one waker wants to wake TASK_INTERRUPTIBLE and the other waker wants to wake TASK_UNINTERRUPTIBLE|TASK_INTERRUPTIBLE? -- Manfred ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list 2011-10-02 14:01 ` Manfred Spraul @ 2011-10-03 10:23 ` Peter Zijlstra 0 siblings, 0 replies; 33+ messages in thread From: Peter Zijlstra @ 2011-10-03 10:23 UTC (permalink / raw) To: Manfred Spraul Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, David Miller, Eric Dumazet, Mike Galbraith On Sun, 2011-10-02 at 16:01 +0200, Manfred Spraul wrote: > Do you still work on the wake_up_list() patch? I plan to, but I got a little side tracked. We need to ensure all schedule() callers can deal with spurious wakeups. To that end I talked to the smatch author who kindly provided a test for this, but I still need to actually run it and see how bad the fallout is and fix stuff up. All core primitives 'should' be good, but there's a lot of shady stuff out there that isn't. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention 2011-09-14 13:30 [RFC][PATCH 0/3] delayed wakeup list Peter Zijlstra 2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra @ 2011-09-14 13:30 ` Peter Zijlstra 2011-09-14 15:46 ` Darren Hart 2011-09-16 12:34 ` Peter Zijlstra 2011-09-14 13:30 ` [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme Peter Zijlstra 2011-09-14 13:51 ` [RFC][PATCH 0/3] delayed wakeup list Eric Dumazet 3 siblings, 2 replies; 33+ messages in thread From: Peter Zijlstra @ 2011-09-14 13:30 UTC (permalink / raw) To: Ingo Molnar, Thomas Gleixner Cc: linux-kernel, Steven Rostedt, Darren Hart, Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith, Peter Zijlstra [-- Attachment #1: futex-wake-list.patch --] [-- Type: text/plain, Size: 3694 bytes --] Use the brand spanking new wake_list to delay the futex wakeups until after we've released the hash bucket locks. This avoids the newly woken tasks from immediately getting stuck on the hb lock. This is esp. painful on -rt, where the hb lock is preemptible. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Darren Hart <dvhart@linux.intel.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- kernel/futex.c | 23 ++++++++++++++--------- 1 file changed, 14 insertions(+), 9 deletions(-) Index: linux-2.6/kernel/futex.c =================================================================== --- linux-2.6.orig/kernel/futex.c +++ linux-2.6/kernel/futex.c @@ -823,7 +823,7 @@ static void __unqueue_futex(struct futex * The hash bucket lock must be held when this is called. * Afterwards, the futex_q must not be accessed. */ -static void wake_futex(struct futex_q *q) +static void wake_futex(struct wake_list_head *wake_list, struct futex_q *q) { struct task_struct *p = q->task; @@ -834,7 +834,7 @@ static void wake_futex(struct futex_q *q * struct. Prevent this by holding a reference on p across the * wake up. */ - get_task_struct(p); + wake_list_add(wake_list, p); __unqueue_futex(q); /* @@ -845,9 +845,6 @@ static void wake_futex(struct futex_q *q */ smp_wmb(); q->lock_ptr = NULL; - - wake_up_state(p, TASK_NORMAL); - put_task_struct(p); } static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this) @@ -964,6 +961,7 @@ futex_wake(u32 __user *uaddr, unsigned i struct futex_q *this, *next; struct plist_head *head; union futex_key key = FUTEX_KEY_INIT; + WAKE_LIST(wake_list); int ret; if (!bitset) @@ -988,7 +986,7 @@ futex_wake(u32 __user *uaddr, unsigned i if (!(this->bitset & bitset)) continue; - wake_futex(this); + wake_futex(&wake_list, this); if (++ret >= nr_wake) break; } @@ -996,6 +994,8 @@ futex_wake(u32 __user *uaddr, unsigned i spin_unlock(&hb->lock); put_futex_key(&key); + + wake_up_list(&wake_list, TASK_NORMAL); out: return ret; } @@ -1012,6 +1012,7 @@ futex_wake_op(u32 __user *uaddr1, unsign struct futex_hash_bucket *hb1, *hb2; struct plist_head *head; struct futex_q *this, *next; + WAKE_LIST(wake_list); int ret, op_ret; retry: @@ -1062,7 +1063,7 @@ futex_wake_op(u32 __user *uaddr1, unsign plist_for_each_entry_safe(this, next, head, list) { if (match_futex (&this->key, &key1)) { - wake_futex(this); + wake_futex(&wake_list, this); if (++ret >= nr_wake) break; } @@ -1074,7 +1075,7 @@ futex_wake_op(u32 __user *uaddr1, unsign op_ret = 0; plist_for_each_entry_safe(this, next, head, list) { if (match_futex (&this->key, &key2)) { - wake_futex(this); + wake_futex(&wake_list, this); if (++op_ret >= nr_wake2) break; } @@ -1087,6 +1088,8 @@ futex_wake_op(u32 __user *uaddr1, unsign put_futex_key(&key2); out_put_key1: put_futex_key(&key1); + + wake_up_list(&wake_list, TASK_NORMAL); out: return ret; } @@ -1239,6 +1242,7 @@ static int futex_requeue(u32 __user *uad struct futex_hash_bucket *hb1, *hb2; struct plist_head *head1; struct futex_q *this, *next; + WAKE_LIST(wake_list); u32 curval2; if (requeue_pi) { @@ -1384,7 +1388,7 @@ static int futex_requeue(u32 __user *uad * woken by futex_unlock_pi(). */ if (++task_count <= nr_wake && !requeue_pi) { - wake_futex(this); + wake_futex(&wake_list, this); continue; } @@ -1437,6 +1441,7 @@ static int futex_requeue(u32 __user *uad put_futex_key(&key2); out_put_key1: put_futex_key(&key1); + wake_up_list(&wake_list, TASK_NORMAL); out: if (pi_state != NULL) free_pi_state(pi_state); ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention 2011-09-14 13:30 ` [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention Peter Zijlstra @ 2011-09-14 15:46 ` Darren Hart 2011-09-14 15:51 ` Peter Zijlstra 2011-09-16 12:34 ` Peter Zijlstra 1 sibling, 1 reply; 33+ messages in thread From: Darren Hart @ 2011-09-14 15:46 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith On 09/14/2011 06:30 AM, Peter Zijlstra wrote: > Use the brand spanking new wake_list to delay the futex wakeups until > after we've released the hash bucket locks. This avoids the newly > woken tasks from immediately getting stuck on the hb lock. > > This is esp. painful on -rt, where the hb lock is preemptible. Nice! Have you run this through the functional and performance tests from futextest? Looks like I should also add a multiwake test to really showcase this. If you don't have it local I can setup a github repository for futextest until korg is back.... or do the testing myself... right. > > Cc: Thomas Gleixner <tglx@linutronix.de> > Cc: Darren Hart <dvhart@linux.intel.com> > Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> > --- > kernel/futex.c | 23 ++++++++++++++--------- > 1 file changed, 14 insertions(+), 9 deletions(-) > > Index: linux-2.6/kernel/futex.c > =================================================================== > --- linux-2.6.orig/kernel/futex.c > +++ linux-2.6/kernel/futex.c > @@ -823,7 +823,7 @@ static void __unqueue_futex(struct futex > * The hash bucket lock must be held when this is called. > * Afterwards, the futex_q must not be accessed. > */ > -static void wake_futex(struct futex_q *q) > +static void wake_futex(struct wake_list_head *wake_list, struct futex_q *q) A good opportunity to add the proper kerneldoc to this function as well. > { > struct task_struct *p = q->task; > > @@ -834,7 +834,7 @@ static void wake_futex(struct futex_q *q > * struct. Prevent this by holding a reference on p across the > * wake up. > */ > - get_task_struct(p); > + wake_list_add(wake_list, p); > > __unqueue_futex(q); > /* > @@ -845,9 +845,6 @@ static void wake_futex(struct futex_q *q > */ > smp_wmb(); > q->lock_ptr = NULL; > - > - wake_up_state(p, TASK_NORMAL); > - put_task_struct(p); > } > > static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this) > @@ -964,6 +961,7 @@ futex_wake(u32 __user *uaddr, unsigned i > struct futex_q *this, *next; > struct plist_head *head; > union futex_key key = FUTEX_KEY_INIT; > + WAKE_LIST(wake_list); > int ret; > > if (!bitset) > @@ -988,7 +986,7 @@ futex_wake(u32 __user *uaddr, unsigned i > if (!(this->bitset & bitset)) > continue; > > - wake_futex(this); > + wake_futex(&wake_list, this); I guess this is OK. wake_futex_pi will always be one task I believe, so the list syntax might confuse newcomers... Would it make sense to have a wake_futex_list() call? Thinking outloud... > if (++ret >= nr_wake) > break; > } > @@ -996,6 +994,8 @@ futex_wake(u32 __user *uaddr, unsigned i > > spin_unlock(&hb->lock); > put_futex_key(&key); > + > + wake_up_list(&wake_list, TASK_NORMAL); > out: > return ret; > } > @@ -1012,6 +1012,7 @@ futex_wake_op(u32 __user *uaddr1, unsign > struct futex_hash_bucket *hb1, *hb2; > struct plist_head *head; > struct futex_q *this, *next; > + WAKE_LIST(wake_list); > int ret, op_ret; > > retry: > @@ -1062,7 +1063,7 @@ futex_wake_op(u32 __user *uaddr1, unsign > > plist_for_each_entry_safe(this, next, head, list) { > if (match_futex (&this->key, &key1)) { > - wake_futex(this); > + wake_futex(&wake_list, this); > if (++ret >= nr_wake) > break; > } > @@ -1074,7 +1075,7 @@ futex_wake_op(u32 __user *uaddr1, unsign > op_ret = 0; > plist_for_each_entry_safe(this, next, head, list) { > if (match_futex (&this->key, &key2)) { > - wake_futex(this); > + wake_futex(&wake_list, this); > if (++op_ret >= nr_wake2) > break; > } > @@ -1087,6 +1088,8 @@ futex_wake_op(u32 __user *uaddr1, unsign > put_futex_key(&key2); > out_put_key1: > put_futex_key(&key1); > + > + wake_up_list(&wake_list, TASK_NORMAL); > out: > return ret; > } > @@ -1239,6 +1242,7 @@ static int futex_requeue(u32 __user *uad > struct futex_hash_bucket *hb1, *hb2; > struct plist_head *head1; > struct futex_q *this, *next; > + WAKE_LIST(wake_list); > u32 curval2; > > if (requeue_pi) { > @@ -1384,7 +1388,7 @@ static int futex_requeue(u32 __user *uad > * woken by futex_unlock_pi(). > */ > if (++task_count <= nr_wake && !requeue_pi) { > - wake_futex(this); > + wake_futex(&wake_list, this); > continue; > } > > @@ -1437,6 +1441,7 @@ static int futex_requeue(u32 __user *uad > put_futex_key(&key2); > out_put_key1: > put_futex_key(&key1); > + wake_up_list(&wake_list, TASK_NORMAL); > out: > if (pi_state != NULL) > free_pi_state(pi_state); > > I _think_ requeue_pi is in the clear here as it uses requeue_pi_wake_futex, which calls wake_up_state directly. Still, some testing with futextest functional/futex_requeue_pi is in order. -- Darren Hart Intel Open Source Technology Center Yocto Project - Linux Kernel ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention 2011-09-14 15:46 ` Darren Hart @ 2011-09-14 15:51 ` Peter Zijlstra 2011-09-14 16:00 ` Darren Hart 2011-09-14 20:49 ` Thomas Gleixner 0 siblings, 2 replies; 33+ messages in thread From: Peter Zijlstra @ 2011-09-14 15:51 UTC (permalink / raw) To: Darren Hart Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith On Wed, 2011-09-14 at 08:46 -0700, Darren Hart wrote: > > On 09/14/2011 06:30 AM, Peter Zijlstra wrote: > > Use the brand spanking new wake_list to delay the futex wakeups until > > after we've released the hash bucket locks. This avoids the newly > > woken tasks from immediately getting stuck on the hb lock. > > > > This is esp. painful on -rt, where the hb lock is preemptible. > > Nice! > > Have you run this through the functional and performance tests from > futextest? Looks like I should also add a multiwake test to really > showcase this. Not more functional than booting, but a very similar patch used to live in 33-rt.. I lost the use-case we had that led to that patch, for -rt it made a huge difference because we endlessly scheduled back and forth between the waker and the wakee bouncing on the hb lock. > If you don't have it local I can setup a github repository for futextest > until korg is back.... or do the testing myself... right. Right, I don't think I have futextest, or I might, I'd have to dig around a bit. > > @@ -988,7 +986,7 @@ futex_wake(u32 __user *uaddr, unsigned i > > if (!(this->bitset & bitset)) > > continue; > > > > - wake_futex(this); > > + wake_futex(&wake_list, this); > > > I guess this is OK. wake_futex_pi will always be one task I believe, so > the list syntax might confuse newcomers... Would it make sense to have a > wake_futex_list() call? Thinking outloud... To what purpose? Even delaying a single wakeup until after we release the hb lock is useful. On it matters even on !-rt since the woken task can wake on another cpu and then spin on hb-lock. > > @@ -1437,6 +1441,7 @@ static int futex_requeue(u32 __user *uad > > put_futex_key(&key2); > > out_put_key1: > > put_futex_key(&key1); > > + wake_up_list(&wake_list, TASK_NORMAL); > > out: > > if (pi_state != NULL) > > free_pi_state(pi_state); > > > > > > I _think_ requeue_pi is in the clear here as it uses > requeue_pi_wake_futex, which calls wake_up_state directly. Still, some > testing with futextest functional/futex_requeue_pi is in order. Ah, right, that might want frobbing too.. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention 2011-09-14 15:51 ` Peter Zijlstra @ 2011-09-14 16:00 ` Darren Hart 2011-09-14 20:49 ` Thomas Gleixner 1 sibling, 0 replies; 33+ messages in thread From: Darren Hart @ 2011-09-14 16:00 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith On 09/14/2011 08:51 AM, Peter Zijlstra wrote: > On Wed, 2011-09-14 at 08:46 -0700, Darren Hart wrote: >> >> On 09/14/2011 06:30 AM, Peter Zijlstra wrote: >>> Use the brand spanking new wake_list to delay the futex wakeups until >>> after we've released the hash bucket locks. This avoids the newly >>> woken tasks from immediately getting stuck on the hb lock. >>> >>> This is esp. painful on -rt, where the hb lock is preemptible. >> >> Nice! >> >> Have you run this through the functional and performance tests from >> futextest? Looks like I should also add a multiwake test to really >> showcase this. > > Not more functional than booting, but a very similar patch used to live > in 33-rt.. I lost the use-case we had that led to that patch, for -rt it > made a huge difference because we endlessly scheduled back and forth > between the waker and the wakee bouncing on the hb lock. > >> If you don't have it local I can setup a github repository for futextest >> until korg is back.... or do the testing myself... right. > > Right, I don't think I have futextest, or I might, I'd have to dig > around a bit. In case you want to grab a quick copy, I decided I didn't want to have a github repo lying around confusing people :) http://www.dvhart.com/darren/linux/futextest.tar.bz2 > >>> @@ -988,7 +986,7 @@ futex_wake(u32 __user *uaddr, unsigned i >>> if (!(this->bitset & bitset)) >>> continue; >>> >>> - wake_futex(this); >>> + wake_futex(&wake_list, this); >> >> >> I guess this is OK. wake_futex_pi will always be one task I believe, so >> the list syntax might confuse newcomers... Would it make sense to have a >> wake_futex_list() call? Thinking outloud... > > To what purpose? Even delaying a single wakeup until after we release > the hb lock is useful. On it matters even on !-rt since the woken task > can wake on another cpu and then spin on hb-lock. Duh. You're correct of course. > >>> @@ -1437,6 +1441,7 @@ static int futex_requeue(u32 __user *uad >>> put_futex_key(&key2); >>> out_put_key1: >>> put_futex_key(&key1); >>> + wake_up_list(&wake_list, TASK_NORMAL); >>> out: >>> if (pi_state != NULL) >>> free_pi_state(pi_state); >>> >>> >> >> I _think_ requeue_pi is in the clear here as it uses >> requeue_pi_wake_futex, which calls wake_up_state directly. Still, some >> testing with futextest functional/futex_requeue_pi is in order. > > Ah, right, that might want frobbing too.. -- Darren Hart Intel Open Source Technology Center Yocto Project - Linux Kernel ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention 2011-09-14 15:51 ` Peter Zijlstra 2011-09-14 16:00 ` Darren Hart @ 2011-09-14 20:49 ` Thomas Gleixner 1 sibling, 0 replies; 33+ messages in thread From: Thomas Gleixner @ 2011-09-14 20:49 UTC (permalink / raw) To: Peter Zijlstra Cc: Darren Hart, Ingo Molnar, linux-kernel, Steven Rostedt, Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith On Wed, 14 Sep 2011, Peter Zijlstra wrote: > On Wed, 2011-09-14 at 08:46 -0700, Darren Hart wrote: > > > > On 09/14/2011 06:30 AM, Peter Zijlstra wrote: > > > Use the brand spanking new wake_list to delay the futex wakeups until > > > after we've released the hash bucket locks. This avoids the newly > > > woken tasks from immediately getting stuck on the hb lock. > > > > > > This is esp. painful on -rt, where the hb lock is preemptible. > > > > Nice! > > > > Have you run this through the functional and performance tests from > > futextest? Looks like I should also add a multiwake test to really > > showcase this. > > Not more functional than booting, but a very similar patch used to live > in 33-rt.. I lost the use-case we had that led to that patch, for -rt it > made a huge difference because we endlessly scheduled back and forth > between the waker and the wakee bouncing on the hb lock. The use case was that utter trainwreck AMQP, which is bouncing futexes back and forth just to burn the maximum cpu cycles for no value. Thanks, tglx ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention 2011-09-14 13:30 ` [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention Peter Zijlstra 2011-09-14 15:46 ` Darren Hart @ 2011-09-16 12:34 ` Peter Zijlstra 2011-09-17 12:57 ` Manfred Spraul 1 sibling, 1 reply; 33+ messages in thread From: Peter Zijlstra @ 2011-09-16 12:34 UTC (permalink / raw) To: Ingo Molnar Cc: Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith On Wed, 2011-09-14 at 15:30 +0200, Peter Zijlstra wrote: > @@ -964,6 +961,7 @@ futex_wake(u32 __user *uaddr, unsigned i > struct futex_q *this, *next; > struct plist_head *head; > union futex_key key = FUTEX_KEY_INIT; > + WAKE_LIST(wake_list); > int ret; > > if (!bitset) > @@ -988,7 +986,7 @@ futex_wake(u32 __user *uaddr, unsigned i > if (!(this->bitset & bitset)) > continue; > > - wake_futex(this); > + wake_futex(&wake_list, this); > if (++ret >= nr_wake) > break; > } > @@ -996,6 +994,8 @@ futex_wake(u32 __user *uaddr, unsigned i > > spin_unlock(&hb->lock); > put_futex_key(&key); > + > + wake_up_list(&wake_list, TASK_NORMAL); > out: > return ret; > } So while initially I thought the sem patch was busted, it turns out this one is. Thomas managed to spot the race: Task-0 Task-1 futex_wait() queue_me() futex_wake() wake_list_add(); __unqueue_futex(); plist_del(); if (!plist_node_empty()) __set_current_state(TASK_RUNNNIG); wake_up_list(); /* waking an already running task-0 */ I guess the biggest question is, do we care? Ideally everything should be able to deal with spurious wakeups, although we generally try to avoid them. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention 2011-09-16 12:34 ` Peter Zijlstra @ 2011-09-17 12:57 ` Manfred Spraul 2011-09-19 7:37 ` Peter Zijlstra 0 siblings, 1 reply; 33+ messages in thread From: Manfred Spraul @ 2011-09-17 12:57 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, David Miller, Eric Dumazet, Mike Galbraith On 09/16/2011 02:34 PM, Peter Zijlstra wrote: > So while initially I thought the sem patch was busted, it turns out this > one is. > > Thomas managed to spot the race: > > Task-0 Task-1 > > futex_wait() > queue_me() > > futex_wake() > wake_list_add(); > __unqueue_futex(); > plist_del(); > if (!plist_node_empty()) > __set_current_state(TASK_RUNNNIG); > > wake_up_list(); > /* waking an already running task-0 */ > > > I guess the biggest question is, do we care? Ideally everything should > be able to deal with spurious wakeups, although we generally try to > avoid them. > > The sem patch also causes such wakeups: Task-0 Task-1 semtimedop() schedule_timeout() semtimedop() wake_list_add(); q->status = 0; <Timeout> schedule_timeout() returns if (q->status==0) return; semtimedop() returns random user space/kernel space code spin_unlock(); wake_up_list(); It's a rare event, but that does happen. Which means: How do we verify that everything is able to deal with spurious wakeups? -- Manfred ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention 2011-09-17 12:57 ` Manfred Spraul @ 2011-09-19 7:37 ` Peter Zijlstra 2011-09-19 8:51 ` Peter Zijlstra 0 siblings, 1 reply; 33+ messages in thread From: Peter Zijlstra @ 2011-09-19 7:37 UTC (permalink / raw) To: Manfred Spraul Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, David Miller, Eric Dumazet, Mike Galbraith On Sat, 2011-09-17 at 14:57 +0200, Manfred Spraul wrote: > How do we verify that everything is able to deal with spurious > wakeups? > Well, I could go audit all 1400+ *schedule*() callsites in the kernel. Or we can rely on the fact that current code can already cause spurious wakeups due to signals. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention 2011-09-19 7:37 ` Peter Zijlstra @ 2011-09-19 8:51 ` Peter Zijlstra 0 siblings, 0 replies; 33+ messages in thread From: Peter Zijlstra @ 2011-09-19 8:51 UTC (permalink / raw) To: Manfred Spraul Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, David Miller, Eric Dumazet, Mike Galbraith On Mon, 2011-09-19 at 09:37 +0200, Peter Zijlstra wrote: > On Sat, 2011-09-17 at 14:57 +0200, Manfred Spraul wrote: > > How do we verify that everything is able to deal with spurious > > wakeups? > > > Well, I could go audit all 1400+ *schedule*() callsites in the kernel. > Or we can rely on the fact that current code can already cause spurious > wakeups due to signals. Hrmm,. the sem code would have serialized on the IN_WAKER stuff, the mutex code would serialize on the ->wait_lock, and the futex code would have serialized on the hb lock. So while it can issue multiple wakeups, those should not leak out of the locking primitive.. crap. Still wondering why we've got that many schedule() calls in the kernel though, clearly we don't have enough blocking primitives or so.. ^ permalink raw reply [flat|nested] 33+ messages in thread
* [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme 2011-09-14 13:30 [RFC][PATCH 0/3] delayed wakeup list Peter Zijlstra 2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra 2011-09-14 13:30 ` [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention Peter Zijlstra @ 2011-09-14 13:30 ` Peter Zijlstra 2011-09-15 17:29 ` Manfred Spraul 2011-09-14 13:51 ` [RFC][PATCH 0/3] delayed wakeup list Eric Dumazet 3 siblings, 1 reply; 33+ messages in thread From: Peter Zijlstra @ 2011-09-14 13:30 UTC (permalink / raw) To: Ingo Molnar, Thomas Gleixner Cc: linux-kernel, Steven Rostedt, Darren Hart, Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith, Peter Zijlstra [-- Attachment #1: sem-wakeup-list.patch --] [-- Type: text/plain, Size: 13604 bytes --] The current, convoluted, wakeup scheme exists to avoid two races: - if queue.status is set after wake_up_process, then the woken up idle thread could race forward and try (and fail) to acquire sma->lock before update_queue had a chance to set queue.status - if queue.status is written before wake_up_process and if the blocked process is woken up by a signal between writing queue.status and the wake_up_process, then the woken up process could return from semtimedop and die by calling sys_exit before wake_up_process is called. Then wake_up_process will oops, because the task structure is already invalid. (yes, this happened on s390 with sysv msg). However, we can avoid the latter by keeping a ref on the task struct, and can thus use the new wake_list infrastructure to issue the wakeups and set q->status before the wakeup. This removes the home-brew busy-wait and the requirement to keep preemption disabled. Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- ipc/sem.c | 155 ++++++++++++++++++-------------------------------------------- 1 file changed, 45 insertions(+), 110 deletions(-) Index: linux-2.6/ipc/sem.c =================================================================== --- linux-2.6.orig/ipc/sem.c +++ linux-2.6/ipc/sem.c @@ -60,9 +60,6 @@ * anything - not even acquiring a lock or dropping a refcount. * - A woken up task may not even touch the semaphore array anymore, it may * have been destroyed already by a semctl(RMID). - * - The synchronizations between wake-ups due to a timeout/signal and a - * wake-up due to a completed semaphore operation is achieved by using an - * intermediate state (IN_WAKEUP). * - UNDO values are stored in an array (one per process and per * semaphore array, lazily allocated). For backwards compatibility, multiple * modes for the UNDO variables are supported (per process, per thread) @@ -199,33 +196,18 @@ static inline void sem_rmid(struct ipc_n * - queue.status is initialized to -EINTR before blocking. * - wakeup is performed by * * unlinking the queue entry from sma->sem_pending - * * setting queue.status to IN_WAKEUP + * * adding the q->sleeper to the wake_list + * * setting queue.status to error * This is the notification for the blocked thread that a * result value is imminent. - * * call wake_up_process - * * set queue.status to the final value. + * * - the previously blocked thread checks queue.status: - * * if it's IN_WAKEUP, then it must wait until the value changes * * if it's not -EINTR, then the operation was completed by * update_queue. semtimedop can return queue.status without * performing any operation on the sem array. * * otherwise it must acquire the spinlock and check what's up. * - * The two-stage algorithm is necessary to protect against the following - * races: - * - if queue.status is set after wake_up_process, then the woken up idle - * thread could race forward and try (and fail) to acquire sma->lock - * before update_queue had a chance to set queue.status - * - if queue.status is written before wake_up_process and if the - * blocked process is woken up by a signal between writing - * queue.status and the wake_up_process, then the woken up - * process could return from semtimedop and die by calling - * sys_exit before wake_up_process is called. Then wake_up_process - * will oops, because the task structure is already invalid. - * (yes, this happened on s390 with sysv msg). - * */ -#define IN_WAKEUP 1 /** * newary - Create a new semaphore set @@ -406,51 +388,39 @@ static int try_atomic_semop (struct sem_ return result; } -/** wake_up_sem_queue_prepare(q, error): Prepare wake-up +/** wake_up_sem_queue_prepare(wake_list, q, error): Prepare wake-up + * @wake_list: list to queue the to be woken task on * @q: queue entry that must be signaled * @error: Error value for the signal * * Prepare the wake-up of the queue entry q. */ -static void wake_up_sem_queue_prepare(struct list_head *pt, +static void wake_up_sem_queue_prepare(struct wake_list_head *wake_list, struct sem_queue *q, int error) { - if (list_empty(pt)) { - /* - * Hold preempt off so that we don't get preempted and have the - * wakee busy-wait until we're scheduled back on. - */ - preempt_disable(); - } - q->status = IN_WAKEUP; - q->pid = error; + struct task_struct *p = ACCESS_ONCE(q->sleeper); - list_add_tail(&q->simple_list, pt); + get_task_struct(p); + q->status = error; + /* + * implies a full barrier + */ + wake_list_add(wake_list, p); + put_task_struct(p); } /** - * wake_up_sem_queue_do(pt) - do the actual wake-up - * @pt: list of tasks to be woken up + * wake_up_sem_queue_do(wake_list) - do the actual wake-up + * @wake_list: list of tasks to be woken up * * Do the actual wake-up. * The function is called without any locks held, thus the semaphore array * could be destroyed already and the tasks can disappear as soon as the * status is set to the actual return code. */ -static void wake_up_sem_queue_do(struct list_head *pt) +static void wake_up_sem_queue_do(struct wake_list_head *wake_list) { - struct sem_queue *q, *t; - int did_something; - - did_something = !list_empty(pt); - list_for_each_entry_safe(q, t, pt, simple_list) { - wake_up_process(q->sleeper); - /* q can disappear immediately after writing q->status. */ - smp_wmb(); - q->status = q->pid; - } - if (did_something) - preempt_enable(); + wake_up_list(wake_list, TASK_ALL); } static void unlink_queue(struct sem_array *sma, struct sem_queue *q) @@ -527,19 +497,20 @@ static int check_restart(struct sem_arra /** - * update_queue(sma, semnum): Look for tasks that can be completed. + * update_queue(sma, semnum, wake_list): Look for tasks that can be completed. * @sma: semaphore array. * @semnum: semaphore that was modified. - * @pt: list head for the tasks that must be woken up. + * @wake_list: list for the tasks that must be woken up. * * update_queue must be called after a semaphore in a semaphore array * was modified. If multiple semaphore were modified, then @semnum * must be set to -1. - * The tasks that must be woken up are added to @pt. The return code + * The tasks that must be woken up are added to @wake_list. The return code * is stored in q->pid. * The function return 1 if at least one semop was completed successfully. */ -static int update_queue(struct sem_array *sma, int semnum, struct list_head *pt) +static int +update_queue(struct sem_array *sma, int semnum, struct wake_list_head *wake_list) { struct sem_queue *q; struct list_head *walk; @@ -597,7 +568,7 @@ static int update_queue(struct sem_array restart = check_restart(sma, q); } - wake_up_sem_queue_prepare(pt, q, error); + wake_up_sem_queue_prepare(wake_list, q, error); if (restart) goto again; } @@ -605,26 +576,26 @@ static int update_queue(struct sem_array } /** - * do_smart_update(sma, sops, nsops, otime, pt) - optimized update_queue + * do_smart_update(sma, sops, nsops, otime, wake_list) - optimized update_queue * @sma: semaphore array * @sops: operations that were performed * @nsops: number of operations * @otime: force setting otime - * @pt: list head of the tasks that must be woken up. + * @wake_list: list of the tasks that must be woken up. * * do_smart_update() does the required called to update_queue, based on the * actual changes that were performed on the semaphore array. * Note that the function does not do the actual wake-up: the caller is - * responsible for calling wake_up_sem_queue_do(@pt). + * responsible for calling wake_up_sem_queue_do(@wake_list). * It is safe to perform this call after dropping all locks. */ static void do_smart_update(struct sem_array *sma, struct sembuf *sops, int nsops, - int otime, struct list_head *pt) + int otime, struct wake_list_head *wake_list) { int i; if (sma->complex_count || sops == NULL) { - if (update_queue(sma, -1, pt)) + if (update_queue(sma, -1, wake_list)) otime = 1; goto done; } @@ -633,7 +604,7 @@ static void do_smart_update(struct sem_a if (sops[i].sem_op > 0 || (sops[i].sem_op < 0 && sma->sem_base[sops[i].sem_num].semval == 0)) - if (update_queue(sma, sops[i].sem_num, pt)) + if (update_queue(sma, sops[i].sem_num, wake_list)) otime = 1; } done: @@ -698,7 +669,7 @@ static void freeary(struct ipc_namespace struct sem_undo *un, *tu; struct sem_queue *q, *tq; struct sem_array *sma = container_of(ipcp, struct sem_array, sem_perm); - struct list_head tasks; + WAKE_LIST(wake_list); /* Free the existing undo structures for this semaphore set. */ assert_spin_locked(&sma->sem_perm.lock); @@ -712,17 +683,16 @@ static void freeary(struct ipc_namespace } /* Wake up all pending processes and let them fail with EIDRM. */ - INIT_LIST_HEAD(&tasks); list_for_each_entry_safe(q, tq, &sma->sem_pending, list) { unlink_queue(sma, q); - wake_up_sem_queue_prepare(&tasks, q, -EIDRM); + wake_up_sem_queue_prepare(&wake_list, q, -EIDRM); } /* Remove the semaphore set from the IDR */ sem_rmid(ns, sma); sem_unlock(sma); - wake_up_sem_queue_do(&tasks); + wake_up_sem_queue_do(&wake_list); ns->used_sems -= sma->sem_nsems; security_sem_free(sma); ipc_rcu_putref(sma); @@ -846,13 +816,12 @@ static int semctl_main(struct ipc_namesp ushort fast_sem_io[SEMMSL_FAST]; ushort* sem_io = fast_sem_io; int nsems; - struct list_head tasks; + WAKE_LIST(wake_list); sma = sem_lock_check(ns, semid); if (IS_ERR(sma)) return PTR_ERR(sma); - INIT_LIST_HEAD(&tasks); nsems = sma->sem_nsems; err = -EACCES; @@ -941,7 +910,7 @@ static int semctl_main(struct ipc_namesp } sma->sem_ctime = get_seconds(); /* maybe some queued-up processes were waiting for this */ - do_smart_update(sma, NULL, 0, 0, &tasks); + do_smart_update(sma, NULL, 0, 0, &wake_list); err = 0; goto out_unlock; } @@ -983,14 +952,14 @@ static int semctl_main(struct ipc_namesp curr->sempid = task_tgid_vnr(current); sma->sem_ctime = get_seconds(); /* maybe some queued-up processes were waiting for this */ - do_smart_update(sma, NULL, 0, 0, &tasks); + do_smart_update(sma, NULL, 0, 0, &wake_list); err = 0; goto out_unlock; } } out_unlock: sem_unlock(sma); - wake_up_sem_queue_do(&tasks); + wake_up_sem_queue_do(&wake_list); out_free: if(sem_io != fast_sem_io) @@ -1255,32 +1224,6 @@ static struct sem_undo *find_alloc_undo( } -/** - * get_queue_result - Retrieve the result code from sem_queue - * @q: Pointer to queue structure - * - * Retrieve the return code from the pending queue. If IN_WAKEUP is found in - * q->status, then we must loop until the value is replaced with the final - * value: This may happen if a task is woken up by an unrelated event (e.g. - * signal) and in parallel the task is woken up by another task because it got - * the requested semaphores. - * - * The function can be called with or without holding the semaphore spinlock. - */ -static int get_queue_result(struct sem_queue *q) -{ - int error; - - error = q->status; - while (unlikely(error == IN_WAKEUP)) { - cpu_relax(); - error = q->status; - } - - return error; -} - - SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops, unsigned, nsops, const struct timespec __user *, timeout) { @@ -1293,7 +1236,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, struct sem_queue queue; unsigned long jiffies_left = 0; struct ipc_namespace *ns; - struct list_head tasks; + WAKE_LIST(wake_list); ns = current->nsproxy->ipc_ns; @@ -1342,8 +1285,6 @@ SYSCALL_DEFINE4(semtimedop, int, semid, } else un = NULL; - INIT_LIST_HEAD(&tasks); - sma = sem_lock_check(ns, semid); if (IS_ERR(sma)) { if (un) @@ -1392,7 +1333,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, error = try_atomic_semop (sma, sops, nsops, un, task_tgid_vnr(current)); if (error <= 0) { if (alter && error == 0) - do_smart_update(sma, sops, nsops, 1, &tasks); + do_smart_update(sma, sops, nsops, 1, &wake_list); goto out_unlock_free; } @@ -1434,8 +1375,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, else schedule(); - error = get_queue_result(&queue); - + error = queue.status; if (error != -EINTR) { /* fast path: update_queue already obtained all requested * resources. @@ -1450,11 +1390,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, } sma = sem_lock(ns, semid); - - /* - * Wait until it's guaranteed that no wakeup_sem_queue_do() is ongoing. - */ - error = get_queue_result(&queue); + error = queue.status; /* * Array removed? If yes, leave without sem_unlock(). @@ -1484,7 +1420,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, out_unlock_free: sem_unlock(sma); - wake_up_sem_queue_do(&tasks); + wake_up_sem_queue_do(&wake_list); out_free: if(sops != fast_sops) kfree(sops); @@ -1545,7 +1481,7 @@ void exit_sem(struct task_struct *tsk) for (;;) { struct sem_array *sma; struct sem_undo *un; - struct list_head tasks; + WAKE_LIST(wake_list); int semid; int i; @@ -1610,10 +1546,9 @@ void exit_sem(struct task_struct *tsk) } } /* maybe some queued-up processes were waiting for this */ - INIT_LIST_HEAD(&tasks); - do_smart_update(sma, NULL, 0, 1, &tasks); + do_smart_update(sma, NULL, 0, 1, &wake_list); sem_unlock(sma); - wake_up_sem_queue_do(&tasks); + wake_up_sem_queue_do(&wake_list); kfree_rcu(un, rcu); } ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme 2011-09-14 13:30 ` [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme Peter Zijlstra @ 2011-09-15 17:29 ` Manfred Spraul 2011-09-15 19:32 ` Peter Zijlstra ` (4 more replies) 0 siblings, 5 replies; 33+ messages in thread From: Manfred Spraul @ 2011-09-15 17:29 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, David Miller, Eric Dumazet, Mike Galbraith Hi Peter, On 09/14/2011 03:30 PM, Peter Zijlstra wrote: > This removes the home-brew busy-wait and the requirement to keep > preemption disabled. In the initial mail of the patch series, you write: > Patch 3 converts sysv sems, and is broken What is broken? > > /** > * newary - Create a new semaphore set > @@ -406,51 +388,39 @@ static int try_atomic_semop (struct sem_ > return result; > } > > -/** wake_up_sem_queue_prepare(q, error): Prepare wake-up > +/** wake_up_sem_queue_prepare(wake_list, q, error): Prepare wake-up > + * @wake_list: list to queue the to be woken task on > * @q: queue entry that must be signaled > * @error: Error value for the signal > * > * Prepare the wake-up of the queue entry q. > */ > -static void wake_up_sem_queue_prepare(struct list_head *pt, > +static void wake_up_sem_queue_prepare(struct wake_list_head *wake_list, > struct sem_queue *q, int error) > { > - if (list_empty(pt)) { > - /* > - * Hold preempt off so that we don't get preempted and have the > - * wakee busy-wait until we're scheduled back on. > - */ > - preempt_disable(); > - } > - q->status = IN_WAKEUP; > - q->pid = error; > + struct task_struct *p = ACCESS_ONCE(q->sleeper); > > - list_add_tail(&q->simple_list, pt); > + get_task_struct(p); > + q->status = error; > + /* > + * implies a full barrier > + */ > + wake_list_add(wake_list, p); > + put_task_struct(p); > } I think the get_task_struct()/put_task_struct is not necessary: Just do the wake_list_add() before writing q->status: wake_list_add() is identical to list_add_tail(&q->simple_list, pt). [except that it contains additional locking, which doesn't matter here] > > /** > - * wake_up_sem_queue_do(pt) - do the actual wake-up > - * @pt: list of tasks to be woken up > + * wake_up_sem_queue_do(wake_list) - do the actual wake-up > + * @wake_list: list of tasks to be woken up > * > * Do the actual wake-up. > * The function is called without any locks held, thus the semaphore array > * could be destroyed already and the tasks can disappear as soon as the > * status is set to the actual return code. > */ > -static void wake_up_sem_queue_do(struct list_head *pt) > +static void wake_up_sem_queue_do(struct wake_list_head *wake_list) > { > - struct sem_queue *q, *t; > - int did_something; > - > - did_something = !list_empty(pt); > - list_for_each_entry_safe(q, t, pt, simple_list) { > - wake_up_process(q->sleeper); > - /* q can disappear immediately after writing q->status. */ > - smp_wmb(); > - q->status = q->pid; > - } > - if (did_something) > - preempt_enable(); > + wake_up_list(wake_list, TASK_ALL); > } > wake_up_list() calls wake_up_state() that calls try_to_wake_up(). try_to_wake_up() seems to return immediately when the state is TASK_DEAD. That leaves: Is it safe to call wake_up_list() in parallel with do_exit()? The current implementation avoids that. -- Manfred ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme 2011-09-15 17:29 ` Manfred Spraul @ 2011-09-15 19:32 ` Peter Zijlstra 2011-09-15 19:35 ` Peter Zijlstra ` (3 subsequent siblings) 4 siblings, 0 replies; 33+ messages in thread From: Peter Zijlstra @ 2011-09-15 19:32 UTC (permalink / raw) To: Manfred Spraul Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, David Miller, Eric Dumazet, Mike Galbraith On Thu, 2011-09-15 at 19:29 +0200, Manfred Spraul wrote: > Hi Peter, > What is broken? I'm not quite sure yet, but the results are that sembench doesn't complete properly; http://oss.oracle.com/~mason/sembench.c That seems to be happening is that we get spurious wakeups in the ipc/sem code resulting it semtimedop returning -EINTR, even though there's no pending signal. (there really should be a if (!signal_pending(current)) goto again thing in that semtimedop wait loop) Adding a loop in userspace like: again: ret = semtimedop(semid_lookup[l->id], &sb, 1, tvp); if (ret) { if (errno == EINTR) { l->spurious++; kill_tracer(); goto again; } perror("semtimedop"); } makes it complete again (although performance seems to suffer a lot compared to a kernel without this patch). It seems related to patch 2/3 converting the futex code, without that patch I can't seem to reproduce. All this is strange though, because if there were multiple wakeups on the same task wake_lists ought to result in less wakeups in total, not more. I've been trying to trace the thing but so far no luck.. when I enable too much tracing it goes away.. silly heisenbugger. > > +static void wake_up_sem_queue_prepare(struct wake_list_head *wake_list, > > struct sem_queue *q, int error) > > { > > + struct task_struct *p = ACCESS_ONCE(q->sleeper); > > > > + get_task_struct(p); > > + q->status = error; > > + /* > > + * implies a full barrier > > + */ > > + wake_list_add(wake_list, p); > > + put_task_struct(p); > > } > I think the get_task_struct()/put_task_struct is not necessary: > Just do the wake_list_add() before writing q->status: > wake_list_add() is identical to list_add_tail(&q->simple_list, pt). > [except that it contains additional locking, which doesn't matter here] But the moment we write q->status, q can disappear right? Suppose the task gets a wakeup (say from a signal) right after we write q->status. Then p can disappear (do_exit) and we'd try to enqueue dead memory -> BOOM! > > +static void wake_up_sem_queue_do(struct wake_list_head *wake_list) > > { > > + wake_up_list(wake_list, TASK_ALL); > > } > > > wake_up_list() calls wake_up_state() that calls try_to_wake_up(). > try_to_wake_up() seems to return immediately when the state is TASK_DEAD. > > That leaves: Is it safe to call wake_up_list() in parallel with do_exit()? > The current implementation avoids that. Ah, wake_list_add() does get_task_struct() and wake_up_list() will first issue the wakeup and then drop the reference. Hrmm,. it looks like its all these atomic ops {get,put}_task_struct() that are causing the performance drop.. I just removed the ones in wake_up_sem_queue_prepare() just for kicks and I got about half my performance gap back. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme 2011-09-15 17:29 ` Manfred Spraul 2011-09-15 19:32 ` Peter Zijlstra @ 2011-09-15 19:35 ` Peter Zijlstra 2011-09-15 19:45 ` Peter Zijlstra ` (2 subsequent siblings) 4 siblings, 0 replies; 33+ messages in thread From: Peter Zijlstra @ 2011-09-15 19:35 UTC (permalink / raw) To: Manfred Spraul Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, David Miller, Eric Dumazet, Mike Galbraith On Thu, 2011-09-15 at 21:32 +0200, Peter Zijlstra wrote: > > > +static void wake_up_sem_queue_prepare(struct wake_list_head *wake_list, > > > struct sem_queue *q, int error) > > > { > > > + struct task_struct *p = ACCESS_ONCE(q->sleeper); > > > > > > + get_task_struct(p); > > > + q->status = error; > > > + /* > > > + * implies a full barrier > > > + */ > > > + wake_list_add(wake_list, p); > > > + put_task_struct(p); > > > } > > > I think the get_task_struct()/put_task_struct is not necessary: > > Just do the wake_list_add() before writing q->status: > > wake_list_add() is identical to list_add_tail(&q->simple_list, pt). > > [except that it contains additional locking, which doesn't matter here] OK, I can't read properly.. so the problem with doing the wake_list_add() before the write is that the wakeup can actually happen before the write in case p already had a wakeup queued. So far there isn't anybody else (except futexes) using this that could intersect with the wakeups.. but still it leaves me uneasy. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme 2011-09-15 17:29 ` Manfred Spraul 2011-09-15 19:32 ` Peter Zijlstra 2011-09-15 19:35 ` Peter Zijlstra @ 2011-09-15 19:45 ` Peter Zijlstra 2011-09-17 12:36 ` Manfred Spraul 2011-09-16 12:18 ` Peter Zijlstra 2011-09-16 12:39 ` Peter Zijlstra 4 siblings, 1 reply; 33+ messages in thread From: Peter Zijlstra @ 2011-09-15 19:45 UTC (permalink / raw) To: Manfred Spraul Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, David Miller, Eric Dumazet, Mike Galbraith On Thu, 2011-09-15 at 21:35 +0200, Peter Zijlstra wrote: > On Thu, 2011-09-15 at 21:32 +0200, Peter Zijlstra wrote: > > > > +static void wake_up_sem_queue_prepare(struct wake_list_head *wake_list, > > > > struct sem_queue *q, int error) > > > > { > > > > + struct task_struct *p = ACCESS_ONCE(q->sleeper); > > > > > > > > + get_task_struct(p); > > > > + q->status = error; > > > > + /* > > > > + * implies a full barrier > > > > + */ > > > > + wake_list_add(wake_list, p); > > > > + put_task_struct(p); > > > > } > > > > > I think the get_task_struct()/put_task_struct is not necessary: > > > Just do the wake_list_add() before writing q->status: > > > wake_list_add() is identical to list_add_tail(&q->simple_list, pt). > > > [except that it contains additional locking, which doesn't matter here] > > OK, I can't read properly.. so the problem with doing the > wake_list_add() before the write is that the wakeup can actually happen > before the write in case p already had a wakeup queued. Ah, but if the wakeup happens early, we return from schedule with -EINTR and re-acquire the sem_lock and re-test. Since we do this update from under sem_lock it should serialize and come out all-right,.. right? ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme 2011-09-15 19:45 ` Peter Zijlstra @ 2011-09-17 12:36 ` Manfred Spraul 0 siblings, 0 replies; 33+ messages in thread From: Manfred Spraul @ 2011-09-17 12:36 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, David Miller, Eric Dumazet, Mike Galbraith On 09/15/2011 09:45 PM, Peter Zijlstra wrote: > On Thu, 2011-09-15 at 21:35 +0200, Peter Zijlstra wrote: >> On Thu, 2011-09-15 at 21:32 +0200, Peter Zijlstra wrote: >>>>> +static void wake_up_sem_queue_prepare(struct wake_list_head *wake_list, >>>>> struct sem_queue *q, int error) >>>>> { >>>>> + struct task_struct *p = ACCESS_ONCE(q->sleeper); >>>>> >>>>> + get_task_struct(p); >>>>> + q->status = error; >>>>> + /* >>>>> + * implies a full barrier >>>>> + */ >>>>> + wake_list_add(wake_list, p); >>>>> + put_task_struct(p); >>>>> } >>>> I think the get_task_struct()/put_task_struct is not necessary: >>>> Just do the wake_list_add() before writing q->status: >>>> wake_list_add() is identical to list_add_tail(&q->simple_list, pt). >>>> [except that it contains additional locking, which doesn't matter here] >> OK, I can't read properly.. so the problem with doing the >> wake_list_add() before the write is that the wakeup can actually happen >> before the write in case p already had a wakeup queued. > Ah, but if the wakeup happens early, we return from schedule with -EINTR > and re-acquire the sem_lock and re-test. Since we do this update from > under sem_lock it should serialize and come out all-right,.. right? Correct. Just think about a timeout of semtimedop(). The code handles early wakeup properly, it will either return -EAGAIN or 0. (except for -EINTR). -- Manfred ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme 2011-09-15 17:29 ` Manfred Spraul ` (2 preceding siblings ...) 2011-09-15 19:45 ` Peter Zijlstra @ 2011-09-16 12:18 ` Peter Zijlstra 2011-09-17 12:32 ` Manfred Spraul 2011-09-16 12:39 ` Peter Zijlstra 4 siblings, 1 reply; 33+ messages in thread From: Peter Zijlstra @ 2011-09-16 12:18 UTC (permalink / raw) To: Manfred Spraul Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, David Miller, Eric Dumazet, Mike Galbraith On Thu, 2011-09-15 at 19:29 +0200, Manfred Spraul wrote: > What is broken? So basically sembench was broken and the futex patch is causing spurious wakeups. I've got the below patch to fix up the sem code. One more question, do the sem wakeups need to be issued in FIFO order? There's a comment in there: * User space visible behavior: * - FIFO ordering for semop() operations (just FIFO, not starvation * protection) that seems to suggest the sem ops processing is in FIFO order, but does the user visible effect propagate to the wakeup order? Currently the wake-list is a FILO, although making it FIFO isn't really hard (in fact, I've got the patch). --- Subject: ipc/sem: Deal with spurious wakeups From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Fri Sep 16 13:58:44 CEST 2011 The current code doesn't deal well with spurious wakeups and returns a -EINTR to user space even though there were no signals anywhere near the task. Deal with this to check for pending signals before actually dropping out of the kernel and try again, avoids user<->kernel round-trip overhead. Cc: Manfred Spraul <manfred@colorfullife.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-1uiuenzz5hwf04opwqmni7cn@git.kernel.org --- ipc/sem.c | 9 ++++++--- 1 file changed, 6 insertions(+), 3 deletions(-) Index: linux-2.6/ipc/sem.c =================================================================== --- linux-2.6.orig/ipc/sem.c +++ linux-2.6/ipc/sem.c @@ -1366,6 +1366,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, queue.status = -EINTR; queue.sleeper = current; +retry: current->state = TASK_INTERRUPTIBLE; sem_unlock(sma); @@ -1399,21 +1400,23 @@ SYSCALL_DEFINE4(semtimedop, int, semid, goto out_free; } - /* * If queue.status != -EINTR we are woken up by another process. * Leave without unlink_queue(), but with sem_unlock(). */ - if (error != -EINTR) { + if (error != -EINTR) goto out_unlock_free; - } /* * If an interrupt occurred we have to clean up the queue */ if (timeout && jiffies_left == 0) error = -EAGAIN; + + if (error == -EINTR && !signal_pending(current)) + goto retry; + unlink_queue(sma, &queue); out_unlock_free: ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme 2011-09-16 12:18 ` Peter Zijlstra @ 2011-09-17 12:32 ` Manfred Spraul 0 siblings, 0 replies; 33+ messages in thread From: Manfred Spraul @ 2011-09-17 12:32 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, David Miller, Eric Dumazet, Mike Galbraith On 09/16/2011 02:18 PM, Peter Zijlstra wrote: > On Thu, 2011-09-15 at 19:29 +0200, Manfred Spraul wrote: >> What is broken? > So basically sembench was broken and the futex patch is causing spurious > wakeups. > > I've got the below patch to fix up the sem code. > > One more question, do the sem wakeups need to be issued in FIFO order? > There's a comment in there: > > * User space visible behavior: > * - FIFO ordering for semop() operations (just FIFO, not starvation > * protection) > > that seems to suggest the sem ops processing is in FIFO order, but does > the user visible effect propagate to the wakeup order? I'd ask the question the other way arould: Is the wakeup order user visible? IMHO: No, the scheduler might reorder the tasks anyway. > > /* > * If an interrupt occurred we have to clean up the queue > */ > if (timeout&& jiffies_left == 0) > error = -EAGAIN; > + > + if (error == -EINTR&& !signal_pending(current)) > + goto retry; > + > unlink_queue(sma,&queue); > > out_unlock_free: Good solution. -ERESTARTNOHAND would be even better, but this is definitively better than the current code. -- Manfred ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme 2011-09-15 17:29 ` Manfred Spraul ` (3 preceding siblings ...) 2011-09-16 12:18 ` Peter Zijlstra @ 2011-09-16 12:39 ` Peter Zijlstra 4 siblings, 0 replies; 33+ messages in thread From: Peter Zijlstra @ 2011-09-16 12:39 UTC (permalink / raw) To: Manfred Spraul Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, David Miller, Eric Dumazet, Mike Galbraith On Fri, 2011-09-16 at 14:18 +0200, Peter Zijlstra wrote: > Currently the wake-list is a FILO, although making it FIFO isn't really > hard (in fact, I've got the patch). might as well post it too --- Subject: From: Peter Zijlstra <a.p.zijlstra@chello.nl> Date: Thu Sep 15 15:32:06 CEST 2011 Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/n/tip-mus5f94zr5rckytoysdn4pqd@git.kernel.org --- include/linux/sched.h | 11 ++++++++--- ipc/sem.c | 11 +++++++++-- 2 files changed, 17 insertions(+), 5 deletions(-) Index: linux-2.6/include/linux/sched.h =================================================================== --- linux-2.6.orig/include/linux/sched.h +++ linux-2.6/include/linux/sched.h @@ -1067,6 +1067,7 @@ struct sched_domain; struct wake_list_head { struct wake_list_node *first; + struct wake_list_node *last; }; struct wake_list_node { @@ -1076,7 +1077,7 @@ struct wake_list_node { #define WAKE_LIST_TAIL ((struct wake_list_node *)0x01) #define WAKE_LIST(name) \ - struct wake_list_head name = { WAKE_LIST_TAIL } + struct wake_list_head name = { WAKE_LIST_TAIL, WAKE_LIST_TAIL } /* * wake flags @@ -2174,11 +2175,15 @@ wake_list_add(struct wake_list_head *hea * This cmpxchg() implies a full barrier, which pairs with the write * barrier implied by the wakeup in wake_up_list(). */ - if (cmpxchg(&n->next, 0, head->first)) + if (cmpxchg(&n->next, 0, WAKE_LIST_TAIL)) return; get_task_struct(p); - head->first = n; + if (head->first == WAKE_LIST_TAIL) + head->first = n; + else + head->last->next = n; + head->last = n; } extern void wake_up_list(struct wake_list_head *head, unsigned int state); ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 0/3] delayed wakeup list 2011-09-14 13:30 [RFC][PATCH 0/3] delayed wakeup list Peter Zijlstra ` (2 preceding siblings ...) 2011-09-14 13:30 ` [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme Peter Zijlstra @ 2011-09-14 13:51 ` Eric Dumazet 2011-09-14 13:56 ` Peter Zijlstra 3 siblings, 1 reply; 33+ messages in thread From: Eric Dumazet @ 2011-09-14 13:51 UTC (permalink / raw) To: Peter Zijlstra Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, Manfred Spraul, David Miller, Mike Galbraith Le mercredi 14 septembre 2011 à 15:30 +0200, Peter Zijlstra a écrit : > This patch-set provides the infrastructure to delay/batch task wakeups. > > This is useful for locking primitives that can effect multiple wakeups > per operation and want to avoid lock internal lock contention by > delaying the wakeups until we've released the lock internal locks. > > Patch 2 converts futexes > Patch 3 converts sysv sems, and is broken > > [ I've been staring at patch 3 way too long, so I thought I'd post it just > to get a few more eyes on it.. ] > > Alternatively it can be used to avoid issuing multiple wakeups, and > thus save a few cycles, in packet processing. Queue all target tasks > and wakeup once you've processed all packets. That way you avoid > waking the target task multiple times if there were multiple packets > for the same task. > > No actual such usage yet, but ISTR talking to some net folks a long while back > about this, is there still interest, Dave, Eric? > Yes, I remember playing with such idea some years ago, to speedup multicast processing. Say you have 10 receivers on a multicast group, each incoming message actually wakeups 10 threads. If we receive a burst of 10 messages, we spend a lot of time in scheduler. So adding one queue to batch all scheduler works (and factorize some work if the same thread is queued), and perform the scheduler calls at the end of software IRQ for example was a win. ^ permalink raw reply [flat|nested] 33+ messages in thread
* Re: [RFC][PATCH 0/3] delayed wakeup list 2011-09-14 13:51 ` [RFC][PATCH 0/3] delayed wakeup list Eric Dumazet @ 2011-09-14 13:56 ` Peter Zijlstra 0 siblings, 0 replies; 33+ messages in thread From: Peter Zijlstra @ 2011-09-14 13:56 UTC (permalink / raw) To: Eric Dumazet Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart, Manfred Spraul, David Miller, Mike Galbraith On Wed, 2011-09-14 at 15:51 +0200, Eric Dumazet wrote: > Le mercredi 14 septembre 2011 à 15:30 +0200, Peter Zijlstra a écrit : > > This patch-set provides the infrastructure to delay/batch task wakeups. > > Alternatively it can be used to avoid issuing multiple wakeups, and > > thus save a few cycles, in packet processing. Queue all target tasks > > and wakeup once you've processed all packets. That way you avoid > > waking the target task multiple times if there were multiple packets > > for the same task. > > > > No actual such usage yet, but ISTR talking to some net folks a long while back > > about this, is there still interest, Dave, Eric? > > > > Yes, I remember playing with such idea some years ago, to speedup > multicast processing. > > Say you have 10 receivers on a multicast group, each incoming message > actually wakeups 10 threads. > > If we receive a burst of 10 messages, we spend a lot of time in > scheduler. > > So adding one queue to batch all scheduler works (and factorize some > work if the same thread is queued), and perform the scheduler calls at > the end of software IRQ for example was a win. Awesome, so my memory didn't trick me ;-) Patches 1 and 2 should be stable, its just 3 that's a bit troublesome. So if you have the bandwidth you could try this. ^ permalink raw reply [flat|nested] 33+ messages in thread
end of thread, other threads:[~2011-10-03 10:25 UTC | newest] Thread overview: 33+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-09-14 13:30 [RFC][PATCH 0/3] delayed wakeup list Peter Zijlstra 2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra 2011-09-14 13:50 ` Peter Zijlstra 2011-09-14 14:08 ` Eric Dumazet 2011-09-14 14:12 ` Peter Zijlstra 2011-09-14 15:35 ` Darren Hart 2011-09-14 15:39 ` Peter Zijlstra 2011-09-14 15:49 ` Darren Hart 2011-09-16 7:59 ` Paul Turner 2011-09-16 7:59 ` Paul Turner 2011-09-16 8:48 ` Peter Zijlstra 2011-10-02 14:01 ` Manfred Spraul 2011-10-03 10:23 ` Peter Zijlstra 2011-09-14 13:30 ` [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention Peter Zijlstra 2011-09-14 15:46 ` Darren Hart 2011-09-14 15:51 ` Peter Zijlstra 2011-09-14 16:00 ` Darren Hart 2011-09-14 20:49 ` Thomas Gleixner 2011-09-16 12:34 ` Peter Zijlstra 2011-09-17 12:57 ` Manfred Spraul 2011-09-19 7:37 ` Peter Zijlstra 2011-09-19 8:51 ` Peter Zijlstra 2011-09-14 13:30 ` [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme Peter Zijlstra 2011-09-15 17:29 ` Manfred Spraul 2011-09-15 19:32 ` Peter Zijlstra 2011-09-15 19:35 ` Peter Zijlstra 2011-09-15 19:45 ` Peter Zijlstra 2011-09-17 12:36 ` Manfred Spraul 2011-09-16 12:18 ` Peter Zijlstra 2011-09-17 12:32 ` Manfred Spraul 2011-09-16 12:39 ` Peter Zijlstra 2011-09-14 13:51 ` [RFC][PATCH 0/3] delayed wakeup list Eric Dumazet 2011-09-14 13:56 ` Peter Zijlstra
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox