[RFC][PATCH 0/3] delayed wakeup list

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC][PATCH 0/3] delayed wakeup list
@ 2011-09-14 13:30 Peter Zijlstra
  2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra
                   ` (3 more replies)
  0 siblings, 4 replies; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-14 13:30 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner
  Cc: linux-kernel, Steven Rostedt, Darren Hart, Manfred Spraul,
	David Miller, Eric Dumazet, Mike Galbraith, Peter Zijlstra

This patch-set provides the infrastructure to delay/batch task wakeups.

This is useful for locking primitives that can effect multiple wakeups
per operation and want to avoid lock internal lock contention by
delaying the wakeups until we've released the lock internal locks.

Patch 2 converts futexes
Patch 3 converts sysv sems, and is broken

[ I've been staring at patch 3 way too long, so I thought I'd post it just
  to get a few more eyes on it.. ]

Alternatively it can be used to avoid issuing multiple wakeups, and
thus save a few cycles, in packet processing. Queue all target tasks
and wakeup once you've processed all packets. That way you avoid
waking the target task multiple times if there were multiple packets
for the same task.

No actual such usage yet, but ISTR talking to some net folks a long while back
about this, is there still interest, Dave, Eric?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC][PATCH 1/3] sched: Provide delayed wakeup list
  2011-09-14 13:30 [RFC][PATCH 0/3] delayed wakeup list Peter Zijlstra
@ 2011-09-14 13:30 ` Peter Zijlstra
  2011-09-14 13:50   ` Peter Zijlstra
                     ` (5 more replies)
  2011-09-14 13:30 ` [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention Peter Zijlstra
                   ` (2 subsequent siblings)
  3 siblings, 6 replies; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-14 13:30 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner
  Cc: linux-kernel, Steven Rostedt, Darren Hart, Manfred Spraul,
	David Miller, Eric Dumazet, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sched-wakeup-list.patch --]
[-- Type: text/plain, Size: 3834 bytes --]

Provide means to queue wakeup targets for later mass wakeup.

This is useful for locking primitives that can effect multiple wakeups
per operation and want to avoid lock internal lock contention by
delaying the wakeups until we've released the lock internal locks.

Alternatively it can be used to avoid issuing multiple wakeups, and
thus save a few cycles, in packet processing. Queue all target tasks
and wakeup once you've processed all packets. That way you avoid
waking the target task multiple times if there were multiple packets
for the same task.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Darren Hart <dvhart@linux.intel.com>
Cc: Manfred Spraul <manfred@colorfullife.com>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 include/linux/sched.h |   44 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched.c        |   21 +++++++++++++++++++++
 2 files changed, 65 insertions(+)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1065,6 +1065,19 @@ struct uts_namespace;
 struct rq;
 struct sched_domain;
 
+struct wake_list_head {
+	struct wake_list_node *first;
+};
+
+struct wake_list_node {
+	struct wake_list_node *next;
+};
+
+#define WAKE_LIST_TAIL ((struct wake_list_node *)0x01)
+
+#define WAKE_LIST(name) \
+	struct wake_list_head name = { WAKE_LIST_TAIL }
+
 /*
  * wake flags
  */
@@ -1255,6 +1268,8 @@ struct task_struct {
 	unsigned int btrace_seq;
 #endif
 
+	struct wake_list_node wake_list;
+
 	unsigned int policy;
 	cpumask_t cpus_allowed;
 
@@ -2143,6 +2158,35 @@ extern void wake_up_new_task(struct task
 extern void sched_fork(struct task_struct *p);
 extern void sched_dead(struct task_struct *p);
 
+static inline void
+wake_list_add(struct wake_list_head *head, struct task_struct *p)
+{
+	struct wake_list_node *n = &p->wake_list;
+
+	get_task_struct(p);
+	/*
+	 * Atomically grab the task, if ->wake_list is !0 already it means
+	 * its already queued (either by us or someone else) and will get the
+	 * wakeup due to that.
+	 *
+	 * This cmpxchg() implies a full barrier, which pairs with the write
+	 * barrier implied by the wakeup in wake_up_list().
+	 */
+	if (cmpxchg(&n->next, 0, n) != 0) {
+		/* It was already queued, drop the extra ref and we're done. */
+		put_task_struct(p);
+		return;
+	}
+
+	/*
+	 * The head is context local, there can be no concurrency.
+	 */
+	n->next = head->first;
+	head->first = n;
+}
+
+extern void wake_up_list(struct wake_list_head *head, unsigned int state);
+
 extern void proc_caches_init(void);
 extern void flush_signals(struct task_struct *);
 extern void __flush_signals(struct task_struct *);
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2916,6 +2916,25 @@ int wake_up_state(struct task_struct *p,
 	return try_to_wake_up(p, state, 0);
 }
 
+void wake_up_list(struct wake_list_head *head, unsigned int state)
+{
+	struct wake_list_node *n = head->first;
+	struct task_struct *p;
+
+	while (n != WAKE_LIST_TAIL) {
+		p = container_of(n, struct task_struct, wake_list);
+		n = n->next;
+
+		p->wake_list.next = NULL;
+		/*
+		 * wake_up_state() implies a wmb() to pair with the queueing
+		 * in wake_list_add() so as not to miss wakeups.
+		 */
+		wake_up_state(p, state);
+		put_task_struct(p);
+	}
+}
+
 /*
  * Perform scheduler related setup for a newly forked process p.
  * p is forked by current.
@@ -2943,6 +2962,8 @@ static void __sched_fork(struct task_str
 #ifdef CONFIG_PREEMPT_NOTIFIERS
 	INIT_HLIST_HEAD(&p->preempt_notifiers);
 #endif
+
+	p->wake_list.next = NULL;
 }
 
 /*



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list
  2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra
@ 2011-09-14 13:50   ` Peter Zijlstra
  2011-09-14 14:08   ` Eric Dumazet
                     ` (4 subsequent siblings)
  5 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-14 13:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart,
	Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith

On Wed, 2011-09-14 at 15:30 +0200, Peter Zijlstra wrote:
> +       if (cmpxchg(&n->next, 0, n) != 0) {
> +               /* It was already queued, drop the extra ref and we're done. */
> +               put_task_struct(p);
> +               return;
> +       }
> +
> +       /*
> +        * The head is context local, there can be no concurrency.
> +        */
> +       n->next = head->first; 

That could of course be folded into one op..

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list
  2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra
  2011-09-14 13:50   ` Peter Zijlstra
@ 2011-09-14 14:08   ` Eric Dumazet
  2011-09-14 14:12     ` Peter Zijlstra
  2011-09-14 15:35   ` Darren Hart
                     ` (3 subsequent siblings)
  5 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2011-09-14 14:08 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, Manfred Spraul, David Miller, Mike Galbraith

Le mercredi 14 septembre 2011 à 15:30 +0200, Peter Zijlstra a écrit :
> pièce jointe document texte brut (sched-wakeup-list.patch)
> Provide means to queue wakeup targets for later mass wakeup.
> 
> This is useful for locking primitives that can effect multiple wakeups
> per operation and want to avoid lock internal lock contention by
> delaying the wakeups until we've released the lock internal locks.
> 
> Alternatively it can be used to avoid issuing multiple wakeups, and
> thus save a few cycles, in packet processing. Queue all target tasks
> and wakeup once you've processed all packets. That way you avoid
> waking the target task multiple times if there were multiple packets
> for the same task.
> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Darren Hart <dvhart@linux.intel.com>
> Cc: Manfred Spraul <manfred@colorfullife.com>
> Cc: David Miller <davem@davemloft.net>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/sched.h |   44 ++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched.c        |   21 +++++++++++++++++++++
>  2 files changed, 65 insertions(+)
> Index: linux-2.6/include/linux/sched.h
> ===================================================================
> --- linux-2.6.orig/include/linux/sched.h
> +++ linux-2.6/include/linux/sched.h
> @@ -1065,6 +1065,19 @@ struct uts_namespace;
>  struct rq;
>  struct sched_domain;
>  
> +struct wake_list_head {
> +	struct wake_list_node *first;
> +};
> +
> +struct wake_list_node {
> +	struct wake_list_node *next;
> +};
> +
> +#define WAKE_LIST_TAIL ((struct wake_list_node *)0x01)
> +
> +#define WAKE_LIST(name) \
> +	struct wake_list_head name = { WAKE_LIST_TAIL }
> +
>  /*
>   * wake flags
>   */
> @@ -1255,6 +1268,8 @@ struct task_struct {
>  	unsigned int btrace_seq;
>  #endif
>  
> +	struct wake_list_node wake_list;
> +
>  	unsigned int policy;
>  	cpumask_t cpus_allowed;
>  
> @@ -2143,6 +2158,35 @@ extern void wake_up_new_task(struct task
>  extern void sched_fork(struct task_struct *p);
>  extern void sched_dead(struct task_struct *p);
>  
> +static inline void
> +wake_list_add(struct wake_list_head *head, struct task_struct *p)
> +{
> +	struct wake_list_node *n = &p->wake_list;
> +
> +	get_task_struct(p);
> +	/*
> +	 * Atomically grab the task, if ->wake_list is !0 already it means
> +	 * its already queued (either by us or someone else) and will get the
> +	 * wakeup due to that.
> +	 *
> +	 * This cmpxchg() implies a full barrier, which pairs with the write
> +	 * barrier implied by the wakeup in wake_up_list().
> +	 */
> +	if (cmpxchg(&n->next, 0, n) != 0) {
> +		/* It was already queued, drop the extra ref and we're done. */
> +		put_task_struct(p);
> +		return;
> +	}
> +
> +	/*
> +	 * The head is context local, there can be no concurrency.
> +	 */
> +	n->next = head->first;
> +	head->first = n;
> +}
> +

I dont understand why you make a get_task_struct() before the cmpxchg()
and rollback. We certainly must hold a lock/mutex before calling
wake_list_add()

It could be :

{
	if (cmpxchg(&n->next, NULL, head->first))
		return;
	 
	get_task_struct(p);
	head->first = n;
}



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list
  2011-09-14 14:08   ` Eric Dumazet
@ 2011-09-14 14:12     ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-14 14:12 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, Manfred Spraul, David Miller, Mike Galbraith

On Wed, 2011-09-14 at 16:08 +0200, Eric Dumazet wrote:
> I dont understand why you make a get_task_struct() before the cmpxchg()
> and rollback. We certainly must hold a lock/mutex before calling
> wake_list_add()
> 
> It could be :
> 
> {
>         if (cmpxchg(&n->next, NULL, head->first))
>                 return;
>          
>         get_task_struct(p);
>         head->first = n;
> } 

You're quite right. No idea why I wrote it like that..

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list
  2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra
  2011-09-14 13:50   ` Peter Zijlstra
  2011-09-14 14:08   ` Eric Dumazet
@ 2011-09-14 15:35   ` Darren Hart
  2011-09-14 15:39     ` Peter Zijlstra
  2011-09-16  7:59   ` Paul Turner
                     ` (2 subsequent siblings)
  5 siblings, 1 reply; 33+ messages in thread
From: Darren Hart @ 2011-09-14 15:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith,
	Michel Lespinasse

On 09/14/2011 06:30 AM, Peter Zijlstra wrote:
> Provide means to queue wakeup targets for later mass wakeup.
> 
> This is useful for locking primitives that can effect multiple wakeups
> per operation and want to avoid lock internal lock contention by
> delaying the wakeups until we've released the lock internal locks.

I believe Michel (on CC) was interested in a related sort of thing for
read/write mechanisms with futexes. We discussed a way to accomplish
what he was interested in without changes to futexes, but this wakeup
list may also be of interest to him.

> 
> Alternatively it can be used to avoid issuing multiple wakeups, and
> thus save a few cycles, in packet processing. Queue all target tasks
> and wakeup once you've processed all packets. That way you avoid
> waking the target task multiple times if there were multiple packets
> for the same task.
> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Darren Hart <dvhart@linux.intel.com>
> Cc: Manfred Spraul <manfred@colorfullife.com>
> Cc: David Miller <davem@davemloft.net>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  include/linux/sched.h |   44 ++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched.c        |   21 +++++++++++++++++++++
>  2 files changed, 65 insertions(+)
> Index: linux-2.6/include/linux/sched.h
> ===================================================================
> --- linux-2.6.orig/include/linux/sched.h
> +++ linux-2.6/include/linux/sched.h
> @@ -1065,6 +1065,19 @@ struct uts_namespace;
>  struct rq;
>  struct sched_domain;
>  
> +struct wake_list_head {
> +	struct wake_list_node *first;
> +};
> +
> +struct wake_list_node {
> +	struct wake_list_node *next;
> +};
> +
> +#define WAKE_LIST_TAIL ((struct wake_list_node *)0x01)
> +
> +#define WAKE_LIST(name) \
> +	struct wake_list_head name = { WAKE_LIST_TAIL }
> +
>  /*
>   * wake flags
>   */
> @@ -1255,6 +1268,8 @@ struct task_struct {
>  	unsigned int btrace_seq;
>  #endif
>  
> +	struct wake_list_node wake_list;
> +
>  	unsigned int policy;
>  	cpumask_t cpus_allowed;
>  
> @@ -2143,6 +2158,35 @@ extern void wake_up_new_task(struct task
>  extern void sched_fork(struct task_struct *p);
>  extern void sched_dead(struct task_struct *p);
>  
> +static inline void
> +wake_list_add(struct wake_list_head *head, struct task_struct *p)
> +{
> +	struct wake_list_node *n = &p->wake_list;
> +
> +	get_task_struct(p);
> +	/*
> +	 * Atomically grab the task, if ->wake_list is !0 already it means
> +	 * its already queued (either by us or someone else) and will get the
> +	 * wakeup due to that.
> +	 *
> +	 * This cmpxchg() implies a full barrier, which pairs with the write
> +	 * barrier implied by the wakeup in wake_up_list().
> +	 */
> +	if (cmpxchg(&n->next, 0, n) != 0) {
> +		/* It was already queued, drop the extra ref and we're done. */
> +		put_task_struct(p);
> +		return;
> +	}
> +
> +	/*
> +	 * The head is context local, there can be no concurrency.
> +	 */
> +	n->next = head->first;
> +	head->first = n;
> +}
> +
> +extern void wake_up_list(struct wake_list_head *head, unsigned int state);
> +
>  extern void proc_caches_init(void);
>  extern void flush_signals(struct task_struct *);
>  extern void __flush_signals(struct task_struct *);
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2916,6 +2916,25 @@ int wake_up_state(struct task_struct *p,
>  	return try_to_wake_up(p, state, 0);
>  }
>  

Maybe just my obsession with documentation and having my head in futex.c
for so long, but I'd love to see some proper kerneldoc function
commentary here... Admittedly, the inline comments are very helpful for
the most part. Missing here is what the value of state should be when
calling wake_up_list.

> +void wake_up_list(struct wake_list_head *head, unsigned int state)
> +{
> +	struct wake_list_node *n = head->first;
> +	struct task_struct *p;
> +
> +	while (n != WAKE_LIST_TAIL) {
> +		p = container_of(n, struct task_struct, wake_list);
> +		n = n->next;
> +
> +		p->wake_list.next = NULL;
> +		/*
> +		 * wake_up_state() implies a wmb() to pair with the queueing
> +		 * in wake_list_add() so as not to miss wakeups.
> +		 */
> +		wake_up_state(p, state);
> +		put_task_struct(p);
> +	}
> +}
> +
>  /*
>   * Perform scheduler related setup for a newly forked process p.
>   * p is forked by current.
> @@ -2943,6 +2962,8 @@ static void __sched_fork(struct task_str
>  #ifdef CONFIG_PREEMPT_NOTIFIERS
>  	INIT_HLIST_HEAD(&p->preempt_notifiers);
>  #endif
> +
> +	p->wake_list.next = NULL;
>  }
>  
>  /*
> 
> 

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list
  2011-09-14 15:35   ` Darren Hart
@ 2011-09-14 15:39     ` Peter Zijlstra
  2011-09-14 15:49       ` Darren Hart
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-14 15:39 UTC (permalink / raw)
  To: Darren Hart
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith,
	Michel Lespinasse

On Wed, 2011-09-14 at 08:35 -0700, Darren Hart wrote:
> Missing here is what the value of state should be when
> calling wake_up_list. 

Depends on what all you want to wake up :-) Would a reference to the
comment of try_to_wake_up() be good enough?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list
  2011-09-14 15:39     ` Peter Zijlstra
@ 2011-09-14 15:49       ` Darren Hart
  0 siblings, 0 replies; 33+ messages in thread
From: Darren Hart @ 2011-09-14 15:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith,
	Michel Lespinasse



On 09/14/2011 08:39 AM, Peter Zijlstra wrote:
> On Wed, 2011-09-14 at 08:35 -0700, Darren Hart wrote:
>> Missing here is what the value of state should be when
>> calling wake_up_list. 
> 
> Depends on what all you want to wake up :-) Would a reference to the
> comment of try_to_wake_up() be good enough?

I found it by following the calls as well, and any serious developer
could as well. I just find the kerneldoc headers to be very valuable as
reference and I like to encourage their use when I can. This isn't a
huge deal, just a nudge to see if it sticks ;)

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list
  2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra
                     ` (2 preceding siblings ...)
  2011-09-14 15:35   ` Darren Hart
@ 2011-09-16  7:59   ` Paul Turner
  2011-09-16  7:59   ` Paul Turner
  2011-10-02 14:01   ` Manfred Spraul
  5 siblings, 0 replies; 33+ messages in thread
From: Paul Turner @ 2011-09-16  7:59 UTC (permalink / raw)
  To: linux-kernel

It's probably worth folding the idle_cpu fix for checking whether 
there's an enqueued task into this.

- Paul


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list
  2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra
                     ` (3 preceding siblings ...)
  2011-09-16  7:59   ` Paul Turner
@ 2011-09-16  7:59   ` Paul Turner
  2011-09-16  8:48     ` Peter Zijlstra
  2011-10-02 14:01   ` Manfred Spraul
  5 siblings, 1 reply; 33+ messages in thread
From: Paul Turner @ 2011-09-16  7:59 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, Manfred Spraul, David Miller, Eric Dumazet,
	Mike Galbraith

It's probably worth folding the idle_cpu fix for checking whether 
there's an enqueued task into this.

- Paul

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list
  2011-09-16  7:59   ` Paul Turner
@ 2011-09-16  8:48     ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-16  8:48 UTC (permalink / raw)
  To: Paul Turner
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, Manfred Spraul, David Miller, Eric Dumazet,
	Mike Galbraith

On Fri, 2011-09-16 at 00:59 -0700, Paul Turner wrote:
> It's probably worth folding the idle_cpu fix for checking whether 
> there's an enqueued task into this.

Yeah, I've actually got that in too..

---
Subject: sched: Fix idle_cpu()
From: Thomas Gleixner <tglx@linutronix.de>
Date: Thu Sep 15 15:32:06 CEST 2011

On -rt we observed hackbench waking all 400 tasks to a single cpu.
This is because of select_idle_sibling()'s interaction with the new
ipi based wakeup scheme.

The existing idle_cpu() test only checks to see if the current task on
that cpu is the idle task, it does not take already queued tasks into
account, nor does it take queued to be woken tasks into account.

If the remote wakeup IPIs come hard enough, there won't be time to
schedule away from the idle task, and would thus keep thinking the cpu
was in fact idle, regardless of the fact that there were already
several hundred tasks runnable.

We couldn't reproduce on mainline, but there's no reason it couldn't 
happen.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-3o30p18b2paswpc9ohy2gltp@git.kernel.org
---
 kernel/sched.c |   15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -5152,7 +5152,20 @@ EXPORT_SYMBOL(task_nice);
  */
 int idle_cpu(int cpu)
 {
-	return cpu_curr(cpu) == cpu_rq(cpu)->idle;
+	struct rq *rq = cpu_rq(cpu);
+
+	if (rq->curr != rq->idle)
+		return 0;
+
+	if (rq->nr_running)
+		return 0;
+
+#ifdef CONFIG_SMP
+	if (!llist_empty(&rq->wake_list))
+		return 0;
+#endif
+
+	return 1;
 }

 /**

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list
  2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra
                     ` (4 preceding siblings ...)
  2011-09-16  7:59   ` Paul Turner
@ 2011-10-02 14:01   ` Manfred Spraul
  2011-10-03 10:23     ` Peter Zijlstra
  5 siblings, 1 reply; 33+ messages in thread
From: Manfred Spraul @ 2011-10-02 14:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, David Miller, Eric Dumazet, Mike Galbraith

Hi Peter,

Do you still work on the wake_up_list() patch?

On 09/14/2011 03:30 PM, Peter Zijlstra wrote:
>   /*
>    * wake flags
>    */
> @@ -1255,6 +1268,8 @@ struct task_struct {
>   	unsigned int btrace_seq;
>   #endif
>
> +	struct wake_list_node wake_list;
> +
>   	unsigned int policy;
>   	cpumask_t cpus_allowed;
A global wake_list
>
> @@ -2143,6 +2158,35 @@ extern void wake_up_new_task(struct task
>   extern void sched_fork(struct task_struct *p);
>   extern void sched_dead(struct task_struct *p);
>
> +static inline void
> +wake_list_add(struct wake_list_head *head, struct task_struct *p)
> +{
> +	struct wake_list_node *n =&p->wake_list;
> +
> +	get_task_struct(p);
> +	/*
> +	 * Atomically grab the task, if ->wake_list is !0 already it means
> +	 * its already queued (either by us or someone else) and will get the
> +	 * wakeup due to that.
> +	 *
> +	 * This cmpxchg() implies a full barrier, which pairs with the write
> +	 * barrier implied by the wakeup in wake_up_list().
> +	 */
> +	if (cmpxchg(&n->next, 0, n) != 0) {
> +		/* It was already queued, drop the extra ref and we're done. */
> +		put_task_struct(p);
> +		return;
> +	}
> +
A task can be only once on the wake_list.
> +	/*
> +	 * The head is context local, there can be no concurrency.
> +	 */
> +	n->next = head->first;
> +	head->first = n;
> +}
> +
> +extern void wake_up_list(struct wake_list_head *head, unsigned int state);
> +
>   extern void proc_caches_init(void);
>   extern void flush_signals(struct task_struct *);
>   extern void __flush_signals(struct task_struct *);
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -2916,6 +2916,25 @@ int wake_up_state(struct task_struct *p,
>   	return try_to_wake_up(p, state, 0);
>   }
>
> +void wake_up_list(struct wake_list_head *head, unsigned int state)
> +{
> +	struct wake_list_node *n = head->first;
> +	struct task_struct *p;
> +
> +	while (n != WAKE_LIST_TAIL) {
> +		p = container_of(n, struct task_struct, wake_list);
> +		n = n->next;
> +
> +		p->wake_list.next = NULL;
> +		/*
> +		 * wake_up_state() implies a wmb() to pair with the queueing
> +		 * in wake_list_add() so as not to miss wakeups.
> +		 */
> +		wake_up_state(p, state);
> +		put_task_struct(p);
> +	}
> +}
And wake_up_list() uses state.
That can't work:
What if one waker wants to wake TASK_INTERRUPTIBLE and the other waker 
wants to wake TASK_UNINTERRUPTIBLE|TASK_INTERRUPTIBLE?

--
     Manfred

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 1/3] sched: Provide delayed wakeup list
  2011-10-02 14:01   ` Manfred Spraul
@ 2011-10-03 10:23     ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2011-10-03 10:23 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, David Miller, Eric Dumazet, Mike Galbraith

On Sun, 2011-10-02 at 16:01 +0200, Manfred Spraul wrote:
> Do you still work on the wake_up_list() patch?

I plan to, but I got a little side tracked. We need to ensure all
schedule() callers can deal with spurious wakeups. To that end I talked
to the smatch author who kindly provided a test for this, but I still
need to actually run it and see how bad the fallout is and fix stuff up.

All core primitives 'should' be good, but there's a lot of shady stuff
out there that isn't.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention
  2011-09-14 13:30 [RFC][PATCH 0/3] delayed wakeup list Peter Zijlstra
  2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra
@ 2011-09-14 13:30 ` Peter Zijlstra
  2011-09-14 15:46   ` Darren Hart
  2011-09-16 12:34   ` Peter Zijlstra
  2011-09-14 13:30 ` [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme Peter Zijlstra
  2011-09-14 13:51 ` [RFC][PATCH 0/3] delayed wakeup list Eric Dumazet
  3 siblings, 2 replies; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-14 13:30 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner
  Cc: linux-kernel, Steven Rostedt, Darren Hart, Manfred Spraul,
	David Miller, Eric Dumazet, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: futex-wake-list.patch --]
[-- Type: text/plain, Size: 3694 bytes --]

Use the brand spanking new wake_list to delay the futex wakeups until
after we've released the hash bucket locks. This avoids the newly
woken tasks from immediately getting stuck on the hb lock.

This is esp. painful on -rt, where the hb lock is preemptible.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Darren Hart <dvhart@linux.intel.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 kernel/futex.c |   23 ++++++++++++++---------
 1 file changed, 14 insertions(+), 9 deletions(-)

Index: linux-2.6/kernel/futex.c
===================================================================
--- linux-2.6.orig/kernel/futex.c
+++ linux-2.6/kernel/futex.c
@@ -823,7 +823,7 @@ static void __unqueue_futex(struct futex
  * The hash bucket lock must be held when this is called.
  * Afterwards, the futex_q must not be accessed.
  */
-static void wake_futex(struct futex_q *q)
+static void wake_futex(struct wake_list_head *wake_list, struct futex_q *q)
 {
 	struct task_struct *p = q->task;
 
@@ -834,7 +834,7 @@ static void wake_futex(struct futex_q *q
 	 * struct. Prevent this by holding a reference on p across the
 	 * wake up.
 	 */
-	get_task_struct(p);
+	wake_list_add(wake_list, p);
 
 	__unqueue_futex(q);
 	/*
@@ -845,9 +845,6 @@ static void wake_futex(struct futex_q *q
 	 */
 	smp_wmb();
 	q->lock_ptr = NULL;
-
-	wake_up_state(p, TASK_NORMAL);
-	put_task_struct(p);
 }
 
 static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
@@ -964,6 +961,7 @@ futex_wake(u32 __user *uaddr, unsigned i
 	struct futex_q *this, *next;
 	struct plist_head *head;
 	union futex_key key = FUTEX_KEY_INIT;
+	WAKE_LIST(wake_list);
 	int ret;
 
 	if (!bitset)
@@ -988,7 +986,7 @@ futex_wake(u32 __user *uaddr, unsigned i
 			if (!(this->bitset & bitset))
 				continue;
 
-			wake_futex(this);
+			wake_futex(&wake_list, this);
 			if (++ret >= nr_wake)
 				break;
 		}
@@ -996,6 +994,8 @@ futex_wake(u32 __user *uaddr, unsigned i
 
 	spin_unlock(&hb->lock);
 	put_futex_key(&key);
+
+	wake_up_list(&wake_list, TASK_NORMAL);
 out:
 	return ret;
 }
@@ -1012,6 +1012,7 @@ futex_wake_op(u32 __user *uaddr1, unsign
 	struct futex_hash_bucket *hb1, *hb2;
 	struct plist_head *head;
 	struct futex_q *this, *next;
+	WAKE_LIST(wake_list);
 	int ret, op_ret;
 
 retry:
@@ -1062,7 +1063,7 @@ futex_wake_op(u32 __user *uaddr1, unsign
 
 	plist_for_each_entry_safe(this, next, head, list) {
 		if (match_futex (&this->key, &key1)) {
-			wake_futex(this);
+			wake_futex(&wake_list, this);
 			if (++ret >= nr_wake)
 				break;
 		}
@@ -1074,7 +1075,7 @@ futex_wake_op(u32 __user *uaddr1, unsign
 		op_ret = 0;
 		plist_for_each_entry_safe(this, next, head, list) {
 			if (match_futex (&this->key, &key2)) {
-				wake_futex(this);
+				wake_futex(&wake_list, this);
 				if (++op_ret >= nr_wake2)
 					break;
 			}
@@ -1087,6 +1088,8 @@ futex_wake_op(u32 __user *uaddr1, unsign
 	put_futex_key(&key2);
 out_put_key1:
 	put_futex_key(&key1);
+
+	wake_up_list(&wake_list, TASK_NORMAL);
 out:
 	return ret;
 }
@@ -1239,6 +1242,7 @@ static int futex_requeue(u32 __user *uad
 	struct futex_hash_bucket *hb1, *hb2;
 	struct plist_head *head1;
 	struct futex_q *this, *next;
+	WAKE_LIST(wake_list);
 	u32 curval2;
 
 	if (requeue_pi) {
@@ -1384,7 +1388,7 @@ static int futex_requeue(u32 __user *uad
 		 * woken by futex_unlock_pi().
 		 */
 		if (++task_count <= nr_wake && !requeue_pi) {
-			wake_futex(this);
+			wake_futex(&wake_list, this);
 			continue;
 		}
 
@@ -1437,6 +1441,7 @@ static int futex_requeue(u32 __user *uad
 	put_futex_key(&key2);
 out_put_key1:
 	put_futex_key(&key1);
+	wake_up_list(&wake_list, TASK_NORMAL);
 out:
 	if (pi_state != NULL)
 		free_pi_state(pi_state);



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention
  2011-09-14 13:30 ` [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention Peter Zijlstra
@ 2011-09-14 15:46   ` Darren Hart
  2011-09-14 15:51     ` Peter Zijlstra
  2011-09-16 12:34   ` Peter Zijlstra
  1 sibling, 1 reply; 33+ messages in thread
From: Darren Hart @ 2011-09-14 15:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith



On 09/14/2011 06:30 AM, Peter Zijlstra wrote:
> Use the brand spanking new wake_list to delay the futex wakeups until
> after we've released the hash bucket locks. This avoids the newly
> woken tasks from immediately getting stuck on the hb lock.
> 
> This is esp. painful on -rt, where the hb lock is preemptible.

Nice!

Have you run this through the functional and performance tests from
futextest? Looks like I should also add a multiwake test to really
showcase this.

If you don't have it local I can setup a github repository for futextest
until korg is back.... or do the testing myself... right.

> 
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: Darren Hart <dvhart@linux.intel.com>
> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
> ---
>  kernel/futex.c |   23 ++++++++++++++---------
>  1 file changed, 14 insertions(+), 9 deletions(-)
> 
> Index: linux-2.6/kernel/futex.c
> ===================================================================
> --- linux-2.6.orig/kernel/futex.c
> +++ linux-2.6/kernel/futex.c
> @@ -823,7 +823,7 @@ static void __unqueue_futex(struct futex
>   * The hash bucket lock must be held when this is called.
>   * Afterwards, the futex_q must not be accessed.
>   */
> -static void wake_futex(struct futex_q *q)
> +static void wake_futex(struct wake_list_head *wake_list, struct futex_q *q)

A good opportunity to add the proper kerneldoc to this function as well.

>  {
>  	struct task_struct *p = q->task;
>  
> @@ -834,7 +834,7 @@ static void wake_futex(struct futex_q *q
>  	 * struct. Prevent this by holding a reference on p across the
>  	 * wake up.
>  	 */
> -	get_task_struct(p);
> +	wake_list_add(wake_list, p);
>  
>  	__unqueue_futex(q);
>  	/*
> @@ -845,9 +845,6 @@ static void wake_futex(struct futex_q *q
>  	 */
>  	smp_wmb();
>  	q->lock_ptr = NULL;
> -
> -	wake_up_state(p, TASK_NORMAL);
> -	put_task_struct(p);
>  }
>  
>  static int wake_futex_pi(u32 __user *uaddr, u32 uval, struct futex_q *this)
> @@ -964,6 +961,7 @@ futex_wake(u32 __user *uaddr, unsigned i
>  	struct futex_q *this, *next;
>  	struct plist_head *head;
>  	union futex_key key = FUTEX_KEY_INIT;
> +	WAKE_LIST(wake_list);
>  	int ret;
>  
>  	if (!bitset)
> @@ -988,7 +986,7 @@ futex_wake(u32 __user *uaddr, unsigned i
>  			if (!(this->bitset & bitset))
>  				continue;
>  
> -			wake_futex(this);
> +			wake_futex(&wake_list, this);


I guess this is OK. wake_futex_pi will always be one task I believe, so
the list syntax might confuse newcomers... Would it make sense to have a
wake_futex_list() call? Thinking outloud...

>  			if (++ret >= nr_wake)
>  				break;
>  		}
> @@ -996,6 +994,8 @@ futex_wake(u32 __user *uaddr, unsigned i
>  
>  	spin_unlock(&hb->lock);
>  	put_futex_key(&key);
> +
> +	wake_up_list(&wake_list, TASK_NORMAL);
>  out:
>  	return ret;
>  }
> @@ -1012,6 +1012,7 @@ futex_wake_op(u32 __user *uaddr1, unsign
>  	struct futex_hash_bucket *hb1, *hb2;
>  	struct plist_head *head;
>  	struct futex_q *this, *next;
> +	WAKE_LIST(wake_list);
>  	int ret, op_ret;
>  
>  retry:
> @@ -1062,7 +1063,7 @@ futex_wake_op(u32 __user *uaddr1, unsign
>  
>  	plist_for_each_entry_safe(this, next, head, list) {
>  		if (match_futex (&this->key, &key1)) {
> -			wake_futex(this);
> +			wake_futex(&wake_list, this);
>  			if (++ret >= nr_wake)
>  				break;
>  		}
> @@ -1074,7 +1075,7 @@ futex_wake_op(u32 __user *uaddr1, unsign
>  		op_ret = 0;
>  		plist_for_each_entry_safe(this, next, head, list) {
>  			if (match_futex (&this->key, &key2)) {
> -				wake_futex(this);
> +				wake_futex(&wake_list, this);
>  				if (++op_ret >= nr_wake2)
>  					break;
>  			}
> @@ -1087,6 +1088,8 @@ futex_wake_op(u32 __user *uaddr1, unsign
>  	put_futex_key(&key2);
>  out_put_key1:
>  	put_futex_key(&key1);
> +
> +	wake_up_list(&wake_list, TASK_NORMAL);
>  out:
>  	return ret;
>  }
> @@ -1239,6 +1242,7 @@ static int futex_requeue(u32 __user *uad
>  	struct futex_hash_bucket *hb1, *hb2;
>  	struct plist_head *head1;
>  	struct futex_q *this, *next;
> +	WAKE_LIST(wake_list);
>  	u32 curval2;
>  
>  	if (requeue_pi) {
> @@ -1384,7 +1388,7 @@ static int futex_requeue(u32 __user *uad
>  		 * woken by futex_unlock_pi().
>  		 */
>  		if (++task_count <= nr_wake && !requeue_pi) {
> -			wake_futex(this);
> +			wake_futex(&wake_list, this);
>  			continue;
>  		}
>  
> @@ -1437,6 +1441,7 @@ static int futex_requeue(u32 __user *uad
>  	put_futex_key(&key2);
>  out_put_key1:
>  	put_futex_key(&key1);
> +	wake_up_list(&wake_list, TASK_NORMAL);
>  out:
>  	if (pi_state != NULL)
>  		free_pi_state(pi_state);
> 
> 

I _think_ requeue_pi is in the clear here as it uses
requeue_pi_wake_futex, which calls wake_up_state directly. Still, some
testing with futextest functional/futex_requeue_pi is in order.

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention
  2011-09-14 15:46   ` Darren Hart
@ 2011-09-14 15:51     ` Peter Zijlstra
  2011-09-14 16:00       ` Darren Hart
  2011-09-14 20:49       ` Thomas Gleixner
  0 siblings, 2 replies; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-14 15:51 UTC (permalink / raw)
  To: Darren Hart
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith

On Wed, 2011-09-14 at 08:46 -0700, Darren Hart wrote:
> 
> On 09/14/2011 06:30 AM, Peter Zijlstra wrote:
> > Use the brand spanking new wake_list to delay the futex wakeups until
> > after we've released the hash bucket locks. This avoids the newly
> > woken tasks from immediately getting stuck on the hb lock.
> > 
> > This is esp. painful on -rt, where the hb lock is preemptible.
> 
> Nice!
> 
> Have you run this through the functional and performance tests from
> futextest? Looks like I should also add a multiwake test to really
> showcase this.

Not more functional than booting, but a very similar patch used to live
in 33-rt.. I lost the use-case we had that led to that patch, for -rt it
made a huge difference because we endlessly scheduled back and forth
between the waker and the wakee bouncing on the hb lock.

> If you don't have it local I can setup a github repository for futextest
> until korg is back.... or do the testing myself... right.

Right, I don't think I have futextest, or I might, I'd have to dig
around a bit.

> > @@ -988,7 +986,7 @@ futex_wake(u32 __user *uaddr, unsigned i
> >  			if (!(this->bitset & bitset))
> >  				continue;
> >  
> > -			wake_futex(this);
> > +			wake_futex(&wake_list, this);
> 
> 
> I guess this is OK. wake_futex_pi will always be one task I believe, so
> the list syntax might confuse newcomers... Would it make sense to have a
> wake_futex_list() call? Thinking outloud...

To what purpose? Even delaying a single wakeup until after we release
the hb lock is useful. On it matters even on !-rt since the woken task
can wake on another cpu and then spin on hb-lock.
 
> > @@ -1437,6 +1441,7 @@ static int futex_requeue(u32 __user *uad
> >  	put_futex_key(&key2);
> >  out_put_key1:
> >  	put_futex_key(&key1);
> > +	wake_up_list(&wake_list, TASK_NORMAL);
> >  out:
> >  	if (pi_state != NULL)
> >  		free_pi_state(pi_state);
> > 
> > 
> 
> I _think_ requeue_pi is in the clear here as it uses
> requeue_pi_wake_futex, which calls wake_up_state directly. Still, some
> testing with futextest functional/futex_requeue_pi is in order.

Ah, right, that might want frobbing too..

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention
  2011-09-14 15:51     ` Peter Zijlstra
@ 2011-09-14 16:00       ` Darren Hart
  2011-09-14 20:49       ` Thomas Gleixner
  1 sibling, 0 replies; 33+ messages in thread
From: Darren Hart @ 2011-09-14 16:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith

On 09/14/2011 08:51 AM, Peter Zijlstra wrote:
> On Wed, 2011-09-14 at 08:46 -0700, Darren Hart wrote:
>>
>> On 09/14/2011 06:30 AM, Peter Zijlstra wrote:
>>> Use the brand spanking new wake_list to delay the futex wakeups until
>>> after we've released the hash bucket locks. This avoids the newly
>>> woken tasks from immediately getting stuck on the hb lock.
>>>
>>> This is esp. painful on -rt, where the hb lock is preemptible.
>>
>> Nice!
>>
>> Have you run this through the functional and performance tests from
>> futextest? Looks like I should also add a multiwake test to really
>> showcase this.
> 
> Not more functional than booting, but a very similar patch used to live
> in 33-rt.. I lost the use-case we had that led to that patch, for -rt it
> made a huge difference because we endlessly scheduled back and forth
> between the waker and the wakee bouncing on the hb lock.
> 
>> If you don't have it local I can setup a github repository for futextest
>> until korg is back.... or do the testing myself... right.
> 
> Right, I don't think I have futextest, or I might, I'd have to dig
> around a bit.

In case you want to grab a quick copy, I decided I didn't want to have a
github repo lying around confusing people :)

http://www.dvhart.com/darren/linux/futextest.tar.bz2

> 
>>> @@ -988,7 +986,7 @@ futex_wake(u32 __user *uaddr, unsigned i
>>>  			if (!(this->bitset & bitset))
>>>  				continue;
>>>  
>>> -			wake_futex(this);
>>> +			wake_futex(&wake_list, this);
>>
>>
>> I guess this is OK. wake_futex_pi will always be one task I believe, so
>> the list syntax might confuse newcomers... Would it make sense to have a
>> wake_futex_list() call? Thinking outloud...
> 
> To what purpose? Even delaying a single wakeup until after we release
> the hb lock is useful. On it matters even on !-rt since the woken task
> can wake on another cpu and then spin on hb-lock.

Duh. You're correct of course.

>  
>>> @@ -1437,6 +1441,7 @@ static int futex_requeue(u32 __user *uad
>>>  	put_futex_key(&key2);
>>>  out_put_key1:
>>>  	put_futex_key(&key1);
>>> +	wake_up_list(&wake_list, TASK_NORMAL);
>>>  out:
>>>  	if (pi_state != NULL)
>>>  		free_pi_state(pi_state);
>>>
>>>
>>
>> I _think_ requeue_pi is in the clear here as it uses
>> requeue_pi_wake_futex, which calls wake_up_state directly. Still, some
>> testing with futextest functional/futex_requeue_pi is in order.
> 
> Ah, right, that might want frobbing too..

-- 
Darren Hart
Intel Open Source Technology Center
Yocto Project - Linux Kernel

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention
  2011-09-14 15:51     ` Peter Zijlstra
  2011-09-14 16:00       ` Darren Hart
@ 2011-09-14 20:49       ` Thomas Gleixner
  1 sibling, 0 replies; 33+ messages in thread
From: Thomas Gleixner @ 2011-09-14 20:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Darren Hart, Ingo Molnar, linux-kernel, Steven Rostedt,
	Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith

On Wed, 14 Sep 2011, Peter Zijlstra wrote:

> On Wed, 2011-09-14 at 08:46 -0700, Darren Hart wrote:
> > 
> > On 09/14/2011 06:30 AM, Peter Zijlstra wrote:
> > > Use the brand spanking new wake_list to delay the futex wakeups until
> > > after we've released the hash bucket locks. This avoids the newly
> > > woken tasks from immediately getting stuck on the hb lock.
> > > 
> > > This is esp. painful on -rt, where the hb lock is preemptible.
> > 
> > Nice!
> > 
> > Have you run this through the functional and performance tests from
> > futextest? Looks like I should also add a multiwake test to really
> > showcase this.
> 
> Not more functional than booting, but a very similar patch used to live
> in 33-rt.. I lost the use-case we had that led to that patch, for -rt it
> made a huge difference because we endlessly scheduled back and forth
> between the waker and the wakee bouncing on the hb lock.

The use case was that utter trainwreck AMQP, which is bouncing futexes
back and forth just to burn the maximum cpu cycles for no value.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention
  2011-09-14 13:30 ` [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention Peter Zijlstra
  2011-09-14 15:46   ` Darren Hart
@ 2011-09-16 12:34   ` Peter Zijlstra
  2011-09-17 12:57     ` Manfred Spraul
  1 sibling, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-16 12:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Thomas Gleixner, linux-kernel, Steven Rostedt, Darren Hart,
	Manfred Spraul, David Miller, Eric Dumazet, Mike Galbraith

On Wed, 2011-09-14 at 15:30 +0200, Peter Zijlstra wrote:
> @@ -964,6 +961,7 @@ futex_wake(u32 __user *uaddr, unsigned i
>         struct futex_q *this, *next;
>         struct plist_head *head;
>         union futex_key key = FUTEX_KEY_INIT;
> +       WAKE_LIST(wake_list);
>         int ret;
>  
>         if (!bitset)
> @@ -988,7 +986,7 @@ futex_wake(u32 __user *uaddr, unsigned i
>                         if (!(this->bitset & bitset))
>                                 continue;
>  
> -                       wake_futex(this);
> +                       wake_futex(&wake_list, this);
>                         if (++ret >= nr_wake)
>                                 break;
>                 }
> @@ -996,6 +994,8 @@ futex_wake(u32 __user *uaddr, unsigned i
>  
>         spin_unlock(&hb->lock);
>         put_futex_key(&key);
> +
> +       wake_up_list(&wake_list, TASK_NORMAL);
>  out:
>         return ret;
>  } 

So while initially I thought the sem patch was busted, it turns out this
one is.

Thomas managed to spot the race:

  Task-0			Task-1

futex_wait()
  queue_me()

				futex_wake()
				  wake_list_add();
				  __unqueue_futex();
				    plist_del();
  if (!plist_node_empty())
  __set_current_state(TASK_RUNNNIG);

				  wake_up_list();
				    /* waking an already running task-0 */


I guess the biggest question is, do we care? Ideally everything should
be able to deal with spurious wakeups, although we generally try to
avoid them.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention
  2011-09-16 12:34   ` Peter Zijlstra
@ 2011-09-17 12:57     ` Manfred Spraul
  2011-09-19  7:37       ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Manfred Spraul @ 2011-09-17 12:57 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, David Miller, Eric Dumazet, Mike Galbraith

On 09/16/2011 02:34 PM, Peter Zijlstra wrote:
> So while initially I thought the sem patch was busted, it turns out this
> one is.
>
> Thomas managed to spot the race:
>
>    Task-0			Task-1
>
> futex_wait()
>    queue_me()
>
> 				futex_wake()
> 				  wake_list_add();
> 				  __unqueue_futex();
> 				    plist_del();
>    if (!plist_node_empty())
>    __set_current_state(TASK_RUNNNIG);
>
> 				  wake_up_list();
> 				    /* waking an already running task-0 */
>
>
> I guess the biggest question is, do we care? Ideally everything should
> be able to deal with spurious wakeups, although we generally try to
> avoid them.
>
>
The sem patch also causes such wakeups:

  Task-0                Task-1
  semtimedop()
   schedule_timeout()

                             semtimedop()
                             wake_list_add();
                             q->status = 0;
<Timeout>
   schedule_timeout() returns
   if (q->status==0)
     return;
  semtimedop() returns

  random user space/kernel space code

                             spin_unlock();
                             wake_up_list();

It's a rare event, but that does happen.
Which means:
How do we verify that everything is able to deal with spurious wakeups?

--
     Manfred

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention
  2011-09-17 12:57     ` Manfred Spraul
@ 2011-09-19  7:37       ` Peter Zijlstra
  2011-09-19  8:51         ` Peter Zijlstra
  0 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-19  7:37 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, David Miller, Eric Dumazet, Mike Galbraith

On Sat, 2011-09-17 at 14:57 +0200, Manfred Spraul wrote:
> How do we verify that everything is able to deal with spurious
> wakeups?
> 
Well, I could go audit all 1400+ *schedule*() callsites in the kernel.
Or we can rely on the fact that current code can already cause spurious
wakeups due to signals.



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention
  2011-09-19  7:37       ` Peter Zijlstra
@ 2011-09-19  8:51         ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-19  8:51 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, David Miller, Eric Dumazet, Mike Galbraith

On Mon, 2011-09-19 at 09:37 +0200, Peter Zijlstra wrote:
> On Sat, 2011-09-17 at 14:57 +0200, Manfred Spraul wrote:
> > How do we verify that everything is able to deal with spurious
> > wakeups?
> > 
> Well, I could go audit all 1400+ *schedule*() callsites in the kernel.
> Or we can rely on the fact that current code can already cause spurious
> wakeups due to signals.

Hrmm,. the sem code would have serialized on the IN_WAKER stuff, the
mutex code would serialize on the ->wait_lock, and the futex code would
have serialized on the hb lock.

So while it can issue multiple wakeups, those should not leak out of the
locking primitive.. crap.

Still wondering why we've got that many schedule() calls in the kernel
though, clearly we don't have enough blocking primitives or so..

^ permalink raw reply	[flat|nested] 33+ messages in thread

* [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme
  2011-09-14 13:30 [RFC][PATCH 0/3] delayed wakeup list Peter Zijlstra
  2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra
  2011-09-14 13:30 ` [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention Peter Zijlstra
@ 2011-09-14 13:30 ` Peter Zijlstra
  2011-09-15 17:29   ` Manfred Spraul
  2011-09-14 13:51 ` [RFC][PATCH 0/3] delayed wakeup list Eric Dumazet
  3 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-14 13:30 UTC (permalink / raw)
  To: Ingo Molnar, Thomas Gleixner
  Cc: linux-kernel, Steven Rostedt, Darren Hart, Manfred Spraul,
	David Miller, Eric Dumazet, Mike Galbraith, Peter Zijlstra

[-- Attachment #1: sem-wakeup-list.patch --]
[-- Type: text/plain, Size: 13604 bytes --]

The current, convoluted, wakeup scheme exists to avoid two races:

 - if queue.status is set after wake_up_process, then the woken up idle
   thread could race forward and try (and fail) to acquire sma->lock
   before update_queue had a chance to set queue.status
 - if queue.status is written before wake_up_process and if the
   blocked process is woken up by a signal between writing
   queue.status and the wake_up_process, then the woken up
   process could return from semtimedop and die by calling
   sys_exit before wake_up_process is called. Then wake_up_process
   will oops, because the task structure is already invalid.
   (yes, this happened on s390 with sysv msg).

However, we can avoid the latter by keeping a ref on the task struct,
and can thus use the new wake_list infrastructure to issue the wakeups
and set q->status before the wakeup.

This removes the home-brew busy-wait and the requirement to keep
preemption disabled.

Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 ipc/sem.c |  155 ++++++++++++++++++--------------------------------------------
 1 file changed, 45 insertions(+), 110 deletions(-)
Index: linux-2.6/ipc/sem.c
===================================================================
--- linux-2.6.orig/ipc/sem.c
+++ linux-2.6/ipc/sem.c
@@ -60,9 +60,6 @@
  *   anything - not even acquiring a lock or dropping a refcount.
  * - A woken up task may not even touch the semaphore array anymore, it may
  *   have been destroyed already by a semctl(RMID).
- * - The synchronizations between wake-ups due to a timeout/signal and a
- *   wake-up due to a completed semaphore operation is achieved by using an
- *   intermediate state (IN_WAKEUP).
  * - UNDO values are stored in an array (one per process and per
  *   semaphore array, lazily allocated). For backwards compatibility, multiple
  *   modes for the UNDO variables are supported (per process, per thread)
@@ -199,33 +196,18 @@ static inline void sem_rmid(struct ipc_n
  * - queue.status is initialized to -EINTR before blocking.
  * - wakeup is performed by
  *	* unlinking the queue entry from sma->sem_pending
- *	* setting queue.status to IN_WAKEUP
+ *	* adding the q->sleeper to the wake_list
+ *	* setting queue.status to error
  *	  This is the notification for the blocked thread that a
  *	  result value is imminent.
- *	* call wake_up_process
- *	* set queue.status to the final value.
+ *
  * - the previously blocked thread checks queue.status:
- *   	* if it's IN_WAKEUP, then it must wait until the value changes
  *   	* if it's not -EINTR, then the operation was completed by
  *   	  update_queue. semtimedop can return queue.status without
  *   	  performing any operation on the sem array.
  *   	* otherwise it must acquire the spinlock and check what's up.
  *
- * The two-stage algorithm is necessary to protect against the following
- * races:
- * - if queue.status is set after wake_up_process, then the woken up idle
- *   thread could race forward and try (and fail) to acquire sma->lock
- *   before update_queue had a chance to set queue.status
- * - if queue.status is written before wake_up_process and if the
- *   blocked process is woken up by a signal between writing
- *   queue.status and the wake_up_process, then the woken up
- *   process could return from semtimedop and die by calling
- *   sys_exit before wake_up_process is called. Then wake_up_process
- *   will oops, because the task structure is already invalid.
- *   (yes, this happened on s390 with sysv msg).
- *
  */
-#define IN_WAKEUP	1
 
 /**
  * newary - Create a new semaphore set
@@ -406,51 +388,39 @@ static int try_atomic_semop (struct sem_
 	return result;
 }
 
-/** wake_up_sem_queue_prepare(q, error): Prepare wake-up
+/** wake_up_sem_queue_prepare(wake_list, q, error): Prepare wake-up
+ * @wake_list: list to queue the to be woken task on
  * @q: queue entry that must be signaled
  * @error: Error value for the signal
  *
  * Prepare the wake-up of the queue entry q.
  */
-static void wake_up_sem_queue_prepare(struct list_head *pt,
+static void wake_up_sem_queue_prepare(struct wake_list_head *wake_list,
 				struct sem_queue *q, int error)
 {
-	if (list_empty(pt)) {
-		/*
-		 * Hold preempt off so that we don't get preempted and have the
-		 * wakee busy-wait until we're scheduled back on.
-		 */
-		preempt_disable();
-	}
-	q->status = IN_WAKEUP;
-	q->pid = error;
+	struct task_struct *p = ACCESS_ONCE(q->sleeper);
 
-	list_add_tail(&q->simple_list, pt);
+	get_task_struct(p);
+	q->status = error;
+	/*
+	 * implies a full barrier
+	 */
+	wake_list_add(wake_list, p);
+	put_task_struct(p);
 }
 
 /**
- * wake_up_sem_queue_do(pt) - do the actual wake-up
- * @pt: list of tasks to be woken up
+ * wake_up_sem_queue_do(wake_list) - do the actual wake-up
+ * @wake_list: list of tasks to be woken up
  *
  * Do the actual wake-up.
  * The function is called without any locks held, thus the semaphore array
  * could be destroyed already and the tasks can disappear as soon as the
  * status is set to the actual return code.
  */
-static void wake_up_sem_queue_do(struct list_head *pt)
+static void wake_up_sem_queue_do(struct wake_list_head *wake_list)
 {
-	struct sem_queue *q, *t;
-	int did_something;
-
-	did_something = !list_empty(pt);
-	list_for_each_entry_safe(q, t, pt, simple_list) {
-		wake_up_process(q->sleeper);
-		/* q can disappear immediately after writing q->status. */
-		smp_wmb();
-		q->status = q->pid;
-	}
-	if (did_something)
-		preempt_enable();
+	wake_up_list(wake_list, TASK_ALL);
 }
 
 static void unlink_queue(struct sem_array *sma, struct sem_queue *q)
@@ -527,19 +497,20 @@ static int check_restart(struct sem_arra
 
 
 /**
- * update_queue(sma, semnum): Look for tasks that can be completed.
+ * update_queue(sma, semnum, wake_list): Look for tasks that can be completed.
  * @sma: semaphore array.
  * @semnum: semaphore that was modified.
- * @pt: list head for the tasks that must be woken up.
+ * @wake_list: list for the tasks that must be woken up.
  *
  * update_queue must be called after a semaphore in a semaphore array
  * was modified. If multiple semaphore were modified, then @semnum
  * must be set to -1.
- * The tasks that must be woken up are added to @pt. The return code
+ * The tasks that must be woken up are added to @wake_list. The return code
  * is stored in q->pid.
  * The function return 1 if at least one semop was completed successfully.
  */
-static int update_queue(struct sem_array *sma, int semnum, struct list_head *pt)
+static int
+update_queue(struct sem_array *sma, int semnum, struct wake_list_head *wake_list)
 {
 	struct sem_queue *q;
 	struct list_head *walk;
@@ -597,7 +568,7 @@ static int update_queue(struct sem_array
 			restart = check_restart(sma, q);
 		}
 
-		wake_up_sem_queue_prepare(pt, q, error);
+		wake_up_sem_queue_prepare(wake_list, q, error);
 		if (restart)
 			goto again;
 	}
@@ -605,26 +576,26 @@ static int update_queue(struct sem_array
 }
 
 /**
- * do_smart_update(sma, sops, nsops, otime, pt) - optimized update_queue
+ * do_smart_update(sma, sops, nsops, otime, wake_list) - optimized update_queue
  * @sma: semaphore array
  * @sops: operations that were performed
  * @nsops: number of operations
  * @otime: force setting otime
- * @pt: list head of the tasks that must be woken up.
+ * @wake_list: list of the tasks that must be woken up.
  *
  * do_smart_update() does the required called to update_queue, based on the
  * actual changes that were performed on the semaphore array.
  * Note that the function does not do the actual wake-up: the caller is
- * responsible for calling wake_up_sem_queue_do(@pt).
+ * responsible for calling wake_up_sem_queue_do(@wake_list).
  * It is safe to perform this call after dropping all locks.
  */
 static void do_smart_update(struct sem_array *sma, struct sembuf *sops, int nsops,
-			int otime, struct list_head *pt)
+			int otime, struct wake_list_head *wake_list)
 {
 	int i;
 
 	if (sma->complex_count || sops == NULL) {
-		if (update_queue(sma, -1, pt))
+		if (update_queue(sma, -1, wake_list))
 			otime = 1;
 		goto done;
 	}
@@ -633,7 +604,7 @@ static void do_smart_update(struct sem_a
 		if (sops[i].sem_op > 0 ||
 			(sops[i].sem_op < 0 &&
 				sma->sem_base[sops[i].sem_num].semval == 0))
-			if (update_queue(sma, sops[i].sem_num, pt))
+			if (update_queue(sma, sops[i].sem_num, wake_list))
 				otime = 1;
 	}
 done:
@@ -698,7 +669,7 @@ static void freeary(struct ipc_namespace
 	struct sem_undo *un, *tu;
 	struct sem_queue *q, *tq;
 	struct sem_array *sma = container_of(ipcp, struct sem_array, sem_perm);
-	struct list_head tasks;
+	WAKE_LIST(wake_list);
 
 	/* Free the existing undo structures for this semaphore set.  */
 	assert_spin_locked(&sma->sem_perm.lock);
@@ -712,17 +683,16 @@ static void freeary(struct ipc_namespace
 	}
 
 	/* Wake up all pending processes and let them fail with EIDRM. */
-	INIT_LIST_HEAD(&tasks);
 	list_for_each_entry_safe(q, tq, &sma->sem_pending, list) {
 		unlink_queue(sma, q);
-		wake_up_sem_queue_prepare(&tasks, q, -EIDRM);
+		wake_up_sem_queue_prepare(&wake_list, q, -EIDRM);
 	}
 
 	/* Remove the semaphore set from the IDR */
 	sem_rmid(ns, sma);
 	sem_unlock(sma);
 
-	wake_up_sem_queue_do(&tasks);
+	wake_up_sem_queue_do(&wake_list);
 	ns->used_sems -= sma->sem_nsems;
 	security_sem_free(sma);
 	ipc_rcu_putref(sma);
@@ -846,13 +816,12 @@ static int semctl_main(struct ipc_namesp
 	ushort fast_sem_io[SEMMSL_FAST];
 	ushort* sem_io = fast_sem_io;
 	int nsems;
-	struct list_head tasks;
+	WAKE_LIST(wake_list);
 
 	sma = sem_lock_check(ns, semid);
 	if (IS_ERR(sma))
 		return PTR_ERR(sma);
 
-	INIT_LIST_HEAD(&tasks);
 	nsems = sma->sem_nsems;
 
 	err = -EACCES;
@@ -941,7 +910,7 @@ static int semctl_main(struct ipc_namesp
 		}
 		sma->sem_ctime = get_seconds();
 		/* maybe some queued-up processes were waiting for this */
-		do_smart_update(sma, NULL, 0, 0, &tasks);
+		do_smart_update(sma, NULL, 0, 0, &wake_list);
 		err = 0;
 		goto out_unlock;
 	}
@@ -983,14 +952,14 @@ static int semctl_main(struct ipc_namesp
 		curr->sempid = task_tgid_vnr(current);
 		sma->sem_ctime = get_seconds();
 		/* maybe some queued-up processes were waiting for this */
-		do_smart_update(sma, NULL, 0, 0, &tasks);
+		do_smart_update(sma, NULL, 0, 0, &wake_list);
 		err = 0;
 		goto out_unlock;
 	}
 	}
 out_unlock:
 	sem_unlock(sma);
-	wake_up_sem_queue_do(&tasks);
+	wake_up_sem_queue_do(&wake_list);
 
 out_free:
 	if(sem_io != fast_sem_io)
@@ -1255,32 +1224,6 @@ static struct sem_undo *find_alloc_undo(
 }
 
 
-/**
- * get_queue_result - Retrieve the result code from sem_queue
- * @q: Pointer to queue structure
- *
- * Retrieve the return code from the pending queue. If IN_WAKEUP is found in
- * q->status, then we must loop until the value is replaced with the final
- * value: This may happen if a task is woken up by an unrelated event (e.g.
- * signal) and in parallel the task is woken up by another task because it got
- * the requested semaphores.
- *
- * The function can be called with or without holding the semaphore spinlock.
- */
-static int get_queue_result(struct sem_queue *q)
-{
-	int error;
-
-	error = q->status;
-	while (unlikely(error == IN_WAKEUP)) {
-		cpu_relax();
-		error = q->status;
-	}
-
-	return error;
-}
-
-
 SYSCALL_DEFINE4(semtimedop, int, semid, struct sembuf __user *, tsops,
 		unsigned, nsops, const struct timespec __user *, timeout)
 {
@@ -1293,7 +1236,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, 
 	struct sem_queue queue;
 	unsigned long jiffies_left = 0;
 	struct ipc_namespace *ns;
-	struct list_head tasks;
+	WAKE_LIST(wake_list);
 
 	ns = current->nsproxy->ipc_ns;
 
@@ -1342,8 +1285,6 @@ SYSCALL_DEFINE4(semtimedop, int, semid, 
 	} else
 		un = NULL;
 
-	INIT_LIST_HEAD(&tasks);
-
 	sma = sem_lock_check(ns, semid);
 	if (IS_ERR(sma)) {
 		if (un)
@@ -1392,7 +1333,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, 
 	error = try_atomic_semop (sma, sops, nsops, un, task_tgid_vnr(current));
 	if (error <= 0) {
 		if (alter && error == 0)
-			do_smart_update(sma, sops, nsops, 1, &tasks);
+			do_smart_update(sma, sops, nsops, 1, &wake_list);
 
 		goto out_unlock_free;
 	}
@@ -1434,8 +1375,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, 
 	else
 		schedule();
 
-	error = get_queue_result(&queue);
-
+	error = queue.status;
 	if (error != -EINTR) {
 		/* fast path: update_queue already obtained all requested
 		 * resources.
@@ -1450,11 +1390,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, 
 	}
 
 	sma = sem_lock(ns, semid);
-
-	/*
-	 * Wait until it's guaranteed that no wakeup_sem_queue_do() is ongoing.
-	 */
-	error = get_queue_result(&queue);
+	error = queue.status;
 
 	/*
 	 * Array removed? If yes, leave without sem_unlock().
@@ -1484,7 +1420,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, 
 out_unlock_free:
 	sem_unlock(sma);
 
-	wake_up_sem_queue_do(&tasks);
+	wake_up_sem_queue_do(&wake_list);
 out_free:
 	if(sops != fast_sops)
 		kfree(sops);
@@ -1545,7 +1481,7 @@ void exit_sem(struct task_struct *tsk)
 	for (;;) {
 		struct sem_array *sma;
 		struct sem_undo *un;
-		struct list_head tasks;
+		WAKE_LIST(wake_list);
 		int semid;
 		int i;
 
@@ -1610,10 +1546,9 @@ void exit_sem(struct task_struct *tsk)
 			}
 		}
 		/* maybe some queued-up processes were waiting for this */
-		INIT_LIST_HEAD(&tasks);
-		do_smart_update(sma, NULL, 0, 1, &tasks);
+		do_smart_update(sma, NULL, 0, 1, &wake_list);
 		sem_unlock(sma);
-		wake_up_sem_queue_do(&tasks);
+		wake_up_sem_queue_do(&wake_list);
 
 		kfree_rcu(un, rcu);
 	}



^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme
  2011-09-14 13:30 ` [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme Peter Zijlstra
@ 2011-09-15 17:29   ` Manfred Spraul
  2011-09-15 19:32     ` Peter Zijlstra
                       ` (4 more replies)
  0 siblings, 5 replies; 33+ messages in thread
From: Manfred Spraul @ 2011-09-15 17:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, David Miller, Eric Dumazet, Mike Galbraith

Hi Peter,


On 09/14/2011 03:30 PM, Peter Zijlstra wrote:
> This removes the home-brew busy-wait and the requirement to keep
> preemption disabled.
In the initial mail of the patch series, you write:

>  Patch 3 converts sysv sems, and is broken


What is broken?

>
>   /**
>    * newary - Create a new semaphore set
> @@ -406,51 +388,39 @@ static int try_atomic_semop (struct sem_
>   	return result;
>   }
>
> -/** wake_up_sem_queue_prepare(q, error): Prepare wake-up
> +/** wake_up_sem_queue_prepare(wake_list, q, error): Prepare wake-up
> + * @wake_list: list to queue the to be woken task on
>    * @q: queue entry that must be signaled
>    * @error: Error value for the signal
>    *
>    * Prepare the wake-up of the queue entry q.
>    */
> -static void wake_up_sem_queue_prepare(struct list_head *pt,
> +static void wake_up_sem_queue_prepare(struct wake_list_head *wake_list,
>   				struct sem_queue *q, int error)
>   {
> -	if (list_empty(pt)) {
> -		/*
> -		 * Hold preempt off so that we don't get preempted and have the
> -		 * wakee busy-wait until we're scheduled back on.
> -		 */
> -		preempt_disable();
> -	}
> -	q->status = IN_WAKEUP;
> -	q->pid = error;
> +	struct task_struct *p = ACCESS_ONCE(q->sleeper);
>
> -	list_add_tail(&q->simple_list, pt);
> +	get_task_struct(p);
> +	q->status = error;
> +	/*
> +	 * implies a full barrier
> +	 */
> +	wake_list_add(wake_list, p);
> +	put_task_struct(p);
>   }
I think the get_task_struct()/put_task_struct is not necessary:
Just do the wake_list_add() before writing q->status:
wake_list_add() is identical to list_add_tail(&q->simple_list, pt).
[except that it contains additional locking, which doesn't matter here]

>
>   /**
> - * wake_up_sem_queue_do(pt) - do the actual wake-up
> - * @pt: list of tasks to be woken up
> + * wake_up_sem_queue_do(wake_list) - do the actual wake-up
> + * @wake_list: list of tasks to be woken up
>    *
>    * Do the actual wake-up.
>    * The function is called without any locks held, thus the semaphore array
>    * could be destroyed already and the tasks can disappear as soon as the
>    * status is set to the actual return code.
>    */
> -static void wake_up_sem_queue_do(struct list_head *pt)
> +static void wake_up_sem_queue_do(struct wake_list_head *wake_list)
>   {
> -	struct sem_queue *q, *t;
> -	int did_something;
> -
> -	did_something = !list_empty(pt);
> -	list_for_each_entry_safe(q, t, pt, simple_list) {
> -		wake_up_process(q->sleeper);
> -		/* q can disappear immediately after writing q->status. */
> -		smp_wmb();
> -		q->status = q->pid;
> -	}
> -	if (did_something)
> -		preempt_enable();
> +	wake_up_list(wake_list, TASK_ALL);
>   }
>   
wake_up_list() calls wake_up_state() that calls try_to_wake_up().
try_to_wake_up() seems to return immediately when the state is TASK_DEAD.

That leaves: Is it safe to call wake_up_list() in parallel with do_exit()?
The current implementation avoids that.

--
     Manfred

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme
  2011-09-15 17:29   ` Manfred Spraul
@ 2011-09-15 19:32     ` Peter Zijlstra
  2011-09-15 19:35     ` Peter Zijlstra
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-15 19:32 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, David Miller, Eric Dumazet, Mike Galbraith

On Thu, 2011-09-15 at 19:29 +0200, Manfred Spraul wrote:
> Hi Peter,

> What is broken?

I'm not quite sure yet, but the results are that sembench doesn't
complete properly; http://oss.oracle.com/~mason/sembench.c

That seems to be happening is that we get spurious wakeups in the
ipc/sem code resulting it semtimedop returning -EINTR, even though
there's no pending signal.

(there really should be a if (!signal_pending(current)) goto again thing
in that semtimedop wait loop)

Adding a loop in userspace like:

again:
        ret = semtimedop(semid_lookup[l->id], &sb, 1, tvp);
        if (ret) {
                if (errno == EINTR) {
                        l->spurious++;
                        kill_tracer();
                        goto again;
                }
                perror("semtimedop");
        }

makes it complete again (although performance seems to suffer a lot
compared to a kernel without this patch).

It seems related to patch 2/3 converting the futex code, without that
patch I can't seem to reproduce. All this is strange though, because if
there were multiple wakeups on the same task wake_lists ought to result
in less wakeups in total, not more.

I've been trying to trace the thing but so far no luck.. when I enable
too much tracing it goes away.. silly heisenbugger.

> > +static void wake_up_sem_queue_prepare(struct wake_list_head *wake_list,
> >   				struct sem_queue *q, int error)
> >   {
> > +	struct task_struct *p = ACCESS_ONCE(q->sleeper);
> >
> > +	get_task_struct(p);
> > +	q->status = error;
> > +	/*
> > +	 * implies a full barrier
> > +	 */
> > +	wake_list_add(wake_list, p);
> > +	put_task_struct(p);
> >   }

> I think the get_task_struct()/put_task_struct is not necessary:
> Just do the wake_list_add() before writing q->status:
> wake_list_add() is identical to list_add_tail(&q->simple_list, pt).
> [except that it contains additional locking, which doesn't matter here]

But the moment we write q->status, q can disappear right? 

Suppose the task gets a wakeup (say from a signal) right after we write
q->status. Then p can disappear (do_exit) and we'd try to enqueue dead
memory -> BOOM!

> > +static void wake_up_sem_queue_do(struct wake_list_head *wake_list)
> >   {
> > +	wake_up_list(wake_list, TASK_ALL);
> >   }
> >   
> wake_up_list() calls wake_up_state() that calls try_to_wake_up().
> try_to_wake_up() seems to return immediately when the state is TASK_DEAD.
> 
> That leaves: Is it safe to call wake_up_list() in parallel with do_exit()?
> The current implementation avoids that.

Ah, wake_list_add() does get_task_struct() and wake_up_list() will first
issue the wakeup and then drop the reference.

Hrmm,. it looks like its all these atomic ops {get,put}_task_struct()
that are causing the performance drop.. I just removed the ones in
wake_up_sem_queue_prepare() just for kicks and I got about half my
performance gap back.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme
  2011-09-15 17:29   ` Manfred Spraul
  2011-09-15 19:32     ` Peter Zijlstra
@ 2011-09-15 19:35     ` Peter Zijlstra
  2011-09-15 19:45     ` Peter Zijlstra
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-15 19:35 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, David Miller, Eric Dumazet, Mike Galbraith

On Thu, 2011-09-15 at 21:32 +0200, Peter Zijlstra wrote:
> > > +static void wake_up_sem_queue_prepare(struct wake_list_head *wake_list,
> > >                             struct sem_queue *q, int error)
> > >   {
> > > +   struct task_struct *p = ACCESS_ONCE(q->sleeper);
> > >
> > > +   get_task_struct(p);
> > > +   q->status = error;
> > > +   /*
> > > +    * implies a full barrier
> > > +    */
> > > +   wake_list_add(wake_list, p);
> > > +   put_task_struct(p);
> > >   }
> 
> > I think the get_task_struct()/put_task_struct is not necessary:
> > Just do the wake_list_add() before writing q->status:
> > wake_list_add() is identical to list_add_tail(&q->simple_list, pt).
> > [except that it contains additional locking, which doesn't matter here]

OK, I can't read properly.. so the problem with doing the
wake_list_add() before the write is that the wakeup can actually happen
before the write in case p already had a wakeup queued.

So far there isn't anybody else (except futexes) using this that could
intersect with the wakeups.. but still it leaves me uneasy.

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme
  2011-09-15 17:29   ` Manfred Spraul
  2011-09-15 19:32     ` Peter Zijlstra
  2011-09-15 19:35     ` Peter Zijlstra
@ 2011-09-15 19:45     ` Peter Zijlstra
  2011-09-17 12:36       ` Manfred Spraul
  2011-09-16 12:18     ` Peter Zijlstra
  2011-09-16 12:39     ` Peter Zijlstra
  4 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-15 19:45 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, David Miller, Eric Dumazet, Mike Galbraith

On Thu, 2011-09-15 at 21:35 +0200, Peter Zijlstra wrote:
> On Thu, 2011-09-15 at 21:32 +0200, Peter Zijlstra wrote:
> > > > +static void wake_up_sem_queue_prepare(struct wake_list_head *wake_list,
> > > >                             struct sem_queue *q, int error)
> > > >   {
> > > > +   struct task_struct *p = ACCESS_ONCE(q->sleeper);
> > > >
> > > > +   get_task_struct(p);
> > > > +   q->status = error;
> > > > +   /*
> > > > +    * implies a full barrier
> > > > +    */
> > > > +   wake_list_add(wake_list, p);
> > > > +   put_task_struct(p);
> > > >   }
> > 
> > > I think the get_task_struct()/put_task_struct is not necessary:
> > > Just do the wake_list_add() before writing q->status:
> > > wake_list_add() is identical to list_add_tail(&q->simple_list, pt).
> > > [except that it contains additional locking, which doesn't matter here]
> 
> OK, I can't read properly.. so the problem with doing the
> wake_list_add() before the write is that the wakeup can actually happen
> before the write in case p already had a wakeup queued.

Ah, but if the wakeup happens early, we return from schedule with -EINTR
and re-acquire the sem_lock and re-test. Since we do this update from
under sem_lock it should serialize and come out all-right,.. right?

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme
  2011-09-15 19:45     ` Peter Zijlstra
@ 2011-09-17 12:36       ` Manfred Spraul
  0 siblings, 0 replies; 33+ messages in thread
From: Manfred Spraul @ 2011-09-17 12:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, David Miller, Eric Dumazet, Mike Galbraith

On 09/15/2011 09:45 PM, Peter Zijlstra wrote:
> On Thu, 2011-09-15 at 21:35 +0200, Peter Zijlstra wrote:
>> On Thu, 2011-09-15 at 21:32 +0200, Peter Zijlstra wrote:
>>>>> +static void wake_up_sem_queue_prepare(struct wake_list_head *wake_list,
>>>>>                              struct sem_queue *q, int error)
>>>>>    {
>>>>> +   struct task_struct *p = ACCESS_ONCE(q->sleeper);
>>>>>
>>>>> +   get_task_struct(p);
>>>>> +   q->status = error;
>>>>> +   /*
>>>>> +    * implies a full barrier
>>>>> +    */
>>>>> +   wake_list_add(wake_list, p);
>>>>> +   put_task_struct(p);
>>>>>    }
>>>> I think the get_task_struct()/put_task_struct is not necessary:
>>>> Just do the wake_list_add() before writing q->status:
>>>> wake_list_add() is identical to list_add_tail(&q->simple_list, pt).
>>>> [except that it contains additional locking, which doesn't matter here]
>> OK, I can't read properly.. so the problem with doing the
>> wake_list_add() before the write is that the wakeup can actually happen
>> before the write in case p already had a wakeup queued.
> Ah, but if the wakeup happens early, we return from schedule with -EINTR
> and re-acquire the sem_lock and re-test. Since we do this update from
> under sem_lock it should serialize and come out all-right,.. right?
Correct. Just think about a timeout of semtimedop().
The code handles early wakeup properly, it will either return -EAGAIN or 0.
(except for -EINTR).

--
     Manfred

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme
  2011-09-15 17:29   ` Manfred Spraul
                       ` (2 preceding siblings ...)
  2011-09-15 19:45     ` Peter Zijlstra
@ 2011-09-16 12:18     ` Peter Zijlstra
  2011-09-17 12:32       ` Manfred Spraul
  2011-09-16 12:39     ` Peter Zijlstra
  4 siblings, 1 reply; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-16 12:18 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, David Miller, Eric Dumazet, Mike Galbraith

On Thu, 2011-09-15 at 19:29 +0200, Manfred Spraul wrote:
> What is broken?

So basically sembench was broken and the futex patch is causing spurious
wakeups.

I've got the below patch to fix up the sem code.

One more question, do the sem wakeups need to be issued in FIFO order?
There's a comment in there:

 * User space visible behavior:
 * - FIFO ordering for semop() operations (just FIFO, not starvation
 *   protection)

that seems to suggest the sem ops processing is in FIFO order, but does
the user visible effect propagate to the wakeup order?

Currently the wake-list is a FILO, although making it FIFO isn't really
hard (in fact, I've got the patch).

---
Subject: ipc/sem: Deal with spurious wakeups
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri Sep 16 13:58:44 CEST 2011

The current code doesn't deal well with spurious wakeups and returns
a -EINTR to user space even though there were no signals anywhere near
the task.

Deal with this to check for pending signals before actually dropping
out of the kernel and try again, avoids user<->kernel round-trip
overhead.

Cc: Manfred Spraul <manfred@colorfullife.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-1uiuenzz5hwf04opwqmni7cn@git.kernel.org
---
 ipc/sem.c |    9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)
Index: linux-2.6/ipc/sem.c
===================================================================
--- linux-2.6.orig/ipc/sem.c
+++ linux-2.6/ipc/sem.c
@@ -1366,6 +1366,7 @@ SYSCALL_DEFINE4(semtimedop, int, semid, 

 	queue.status = -EINTR;
 	queue.sleeper = current;
+retry:
 	current->state = TASK_INTERRUPTIBLE;
 	sem_unlock(sma);

@@ -1399,21 +1400,23 @@ SYSCALL_DEFINE4(semtimedop, int, semid, 
 		goto out_free;
 	}

-
 	/*
 	 * If queue.status != -EINTR we are woken up by another process.
 	 * Leave without unlink_queue(), but with sem_unlock().
 	 */

-	if (error != -EINTR) {
+	if (error != -EINTR)
 		goto out_unlock_free;
-	}

 	/*
 	 * If an interrupt occurred we have to clean up the queue
 	 */
 	if (timeout && jiffies_left == 0)
 		error = -EAGAIN;
+
+	if (error == -EINTR && !signal_pending(current))
+		goto retry;
+
 	unlink_queue(sma, &queue);

 out_unlock_free:

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme
  2011-09-16 12:18     ` Peter Zijlstra
@ 2011-09-17 12:32       ` Manfred Spraul
  0 siblings, 0 replies; 33+ messages in thread
From: Manfred Spraul @ 2011-09-17 12:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, David Miller, Eric Dumazet, Mike Galbraith

On 09/16/2011 02:18 PM, Peter Zijlstra wrote:
> On Thu, 2011-09-15 at 19:29 +0200, Manfred Spraul wrote:
>> What is broken?
> So basically sembench was broken and the futex patch is causing spurious
> wakeups.
>
> I've got the below patch to fix up the sem code.
>
> One more question, do the sem wakeups need to be issued in FIFO order?
> There's a comment in there:
>
>   * User space visible behavior:
>   * - FIFO ordering for semop() operations (just FIFO, not starvation
>   *   protection)
>
> that seems to suggest the sem ops processing is in FIFO order, but does
> the user visible effect propagate to the wakeup order?
I'd ask the question the other way arould:
Is the wakeup order user visible?
IMHO: No, the scheduler might reorder the tasks anyway.

>
>   	/*
>   	 * If an interrupt occurred we have to clean up the queue
>   	 */
>   	if (timeout&&  jiffies_left == 0)
>   		error = -EAGAIN;
> +
> +	if (error == -EINTR&&  !signal_pending(current))
> +		goto retry;
> +
>   	unlink_queue(sma,&queue);
>
>   out_unlock_free:
Good solution.
-ERESTARTNOHAND would be even better, but this is definitively better 
than the current code.

--
     Manfred

^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme
  2011-09-15 17:29   ` Manfred Spraul
                       ` (3 preceding siblings ...)
  2011-09-16 12:18     ` Peter Zijlstra
@ 2011-09-16 12:39     ` Peter Zijlstra
  4 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-16 12:39 UTC (permalink / raw)
  To: Manfred Spraul
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, David Miller, Eric Dumazet, Mike Galbraith

On Fri, 2011-09-16 at 14:18 +0200, Peter Zijlstra wrote:
> Currently the wake-list is a FILO, although making it FIFO isn't really
> hard (in fact, I've got the patch). 

might as well post it too

---
Subject: 
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Thu Sep 15 15:32:06 CEST 2011


Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Link: http://lkml.kernel.org/n/tip-mus5f94zr5rckytoysdn4pqd@git.kernel.org
---
 include/linux/sched.h |   11 ++++++++---
 ipc/sem.c             |   11 +++++++++--
 2 files changed, 17 insertions(+), 5 deletions(-)
Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -1067,6 +1067,7 @@ struct sched_domain;
 
 struct wake_list_head {
 	struct wake_list_node *first;
+	struct wake_list_node *last;
 };
 
 struct wake_list_node {
@@ -1076,7 +1077,7 @@ struct wake_list_node {
 #define WAKE_LIST_TAIL ((struct wake_list_node *)0x01)
 
 #define WAKE_LIST(name) \
-	struct wake_list_head name = { WAKE_LIST_TAIL }
+	struct wake_list_head name = { WAKE_LIST_TAIL, WAKE_LIST_TAIL }
 
 /*
  * wake flags
@@ -2174,11 +2175,15 @@ wake_list_add(struct wake_list_head *hea
 	 * This cmpxchg() implies a full barrier, which pairs with the write
 	 * barrier implied by the wakeup in wake_up_list().
 	 */
-	if (cmpxchg(&n->next, 0, head->first))
+	if (cmpxchg(&n->next, 0, WAKE_LIST_TAIL))
 		return;
 
 	get_task_struct(p);
-	head->first = n;
+	if (head->first == WAKE_LIST_TAIL)
+		head->first = n;
+	else
+		head->last->next = n;
+	head->last = n;
 }
 
 extern void wake_up_list(struct wake_list_head *head, unsigned int state);


^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 0/3] delayed wakeup list
  2011-09-14 13:30 [RFC][PATCH 0/3] delayed wakeup list Peter Zijlstra
                   ` (2 preceding siblings ...)
  2011-09-14 13:30 ` [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme Peter Zijlstra
@ 2011-09-14 13:51 ` Eric Dumazet
  2011-09-14 13:56   ` Peter Zijlstra
  3 siblings, 1 reply; 33+ messages in thread
From: Eric Dumazet @ 2011-09-14 13:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, Manfred Spraul, David Miller, Mike Galbraith

Le mercredi 14 septembre 2011 à 15:30 +0200, Peter Zijlstra a écrit :
> This patch-set provides the infrastructure to delay/batch task wakeups.
> 
> This is useful for locking primitives that can effect multiple wakeups
> per operation and want to avoid lock internal lock contention by
> delaying the wakeups until we've released the lock internal locks.
> 
> Patch 2 converts futexes
> Patch 3 converts sysv sems, and is broken
> 
> [ I've been staring at patch 3 way too long, so I thought I'd post it just
>   to get a few more eyes on it.. ]
> 
> Alternatively it can be used to avoid issuing multiple wakeups, and
> thus save a few cycles, in packet processing. Queue all target tasks
> and wakeup once you've processed all packets. That way you avoid
> waking the target task multiple times if there were multiple packets
> for the same task.
> 
> No actual such usage yet, but ISTR talking to some net folks a long while back
> about this, is there still interest, Dave, Eric?
> 

Yes, I remember playing with such idea some years ago, to speedup
multicast processing.

Say you have 10 receivers on a multicast group, each incoming message
actually wakeups 10 threads.

If we receive a burst of 10 messages, we spend a lot of time in
scheduler.

So adding one queue to batch all scheduler works (and factorize some
work if the same thread is queued), and perform the scheduler calls at
the end of software IRQ for example was a win.




^ permalink raw reply	[flat|nested] 33+ messages in thread

* Re: [RFC][PATCH 0/3] delayed wakeup list
  2011-09-14 13:51 ` [RFC][PATCH 0/3] delayed wakeup list Eric Dumazet
@ 2011-09-14 13:56   ` Peter Zijlstra
  0 siblings, 0 replies; 33+ messages in thread
From: Peter Zijlstra @ 2011-09-14 13:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Thomas Gleixner, linux-kernel, Steven Rostedt,
	Darren Hart, Manfred Spraul, David Miller, Mike Galbraith

On Wed, 2011-09-14 at 15:51 +0200, Eric Dumazet wrote:
> Le mercredi 14 septembre 2011 à 15:30 +0200, Peter Zijlstra a écrit :
> > This patch-set provides the infrastructure to delay/batch task wakeups.

> > Alternatively it can be used to avoid issuing multiple wakeups, and
> > thus save a few cycles, in packet processing. Queue all target tasks
> > and wakeup once you've processed all packets. That way you avoid
> > waking the target task multiple times if there were multiple packets
> > for the same task.
> > 
> > No actual such usage yet, but ISTR talking to some net folks a long while back
> > about this, is there still interest, Dave, Eric?
> > 
> 
> Yes, I remember playing with such idea some years ago, to speedup
> multicast processing.
> 
> Say you have 10 receivers on a multicast group, each incoming message
> actually wakeups 10 threads.
> 
> If we receive a burst of 10 messages, we spend a lot of time in
> scheduler.
> 
> So adding one queue to batch all scheduler works (and factorize some
> work if the same thread is queued), and perform the scheduler calls at
> the end of software IRQ for example was a win.

Awesome, so my memory didn't trick me ;-) Patches 1 and 2 should be
stable, its just 3 that's a bit troublesome. So if you have the
bandwidth you could try this.

^ permalink raw reply	[flat|nested] 33+ messages in thread

end of thread, other threads:[~2011-10-03 10:25 UTC | newest]

Thread overview: 33+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-09-14 13:30 [RFC][PATCH 0/3] delayed wakeup list Peter Zijlstra
2011-09-14 13:30 ` [RFC][PATCH 1/3] sched: Provide " Peter Zijlstra
2011-09-14 13:50   ` Peter Zijlstra
2011-09-14 14:08   ` Eric Dumazet
2011-09-14 14:12     ` Peter Zijlstra
2011-09-14 15:35   ` Darren Hart
2011-09-14 15:39     ` Peter Zijlstra
2011-09-14 15:49       ` Darren Hart
2011-09-16  7:59   ` Paul Turner
2011-09-16  7:59   ` Paul Turner
2011-09-16  8:48     ` Peter Zijlstra
2011-10-02 14:01   ` Manfred Spraul
2011-10-03 10:23     ` Peter Zijlstra
2011-09-14 13:30 ` [RFC][PATCH 2/3] futex: Reduce hash bucket lock contention Peter Zijlstra
2011-09-14 15:46   ` Darren Hart
2011-09-14 15:51     ` Peter Zijlstra
2011-09-14 16:00       ` Darren Hart
2011-09-14 20:49       ` Thomas Gleixner
2011-09-16 12:34   ` Peter Zijlstra
2011-09-17 12:57     ` Manfred Spraul
2011-09-19  7:37       ` Peter Zijlstra
2011-09-19  8:51         ` Peter Zijlstra
2011-09-14 13:30 ` [RFC][PATCH 3/3] ipc/sem: Rework wakeup scheme Peter Zijlstra
2011-09-15 17:29   ` Manfred Spraul
2011-09-15 19:32     ` Peter Zijlstra
2011-09-15 19:35     ` Peter Zijlstra
2011-09-15 19:45     ` Peter Zijlstra
2011-09-17 12:36       ` Manfred Spraul
2011-09-16 12:18     ` Peter Zijlstra
2011-09-17 12:32       ` Manfred Spraul
2011-09-16 12:39     ` Peter Zijlstra
2011-09-14 13:51 ` [RFC][PATCH 0/3] delayed wakeup list Eric Dumazet
2011-09-14 13:56   ` Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox