udp ping pong with various process bindings (and correct cpu mappings)

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* udp ping pong with various process bindings (and correct cpu mappings)
@ 2009-04-24 20:10 Christoph Lameter
  2009-04-24 21:18 ` Eric Dumazet
  2009-04-25 15:47 ` [PATCH] net: Avoid extra wakeups of threads blocked in wait_for_packet() Eric Dumazet
  0 siblings, 2 replies; 44+ messages in thread
From: Christoph Lameter @ 2009-04-24 20:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: jesse.brandeburg, netdev, bhutchiings, mchan, David Miller

Here are the results of a 40 byte udpping (http://gentwo.org/ll) run on
kernel from 2.6.22 to 2.6.30-rc3 on a Dell 1950 dual quad core 3.3Ghz.
One system fixed 2.6.22 kernel version on the other are varied.

Nice graph at http://gentwo.org/results/udpping-results.pdf

Summary:
- Loss of ~1.5usec on fastest path (same cpu) since 2.6.22
- Different cpu same core looses 2-3 usecs vs. same cpu
- Different cpu different core looses ~ 8 usecs vs same cpu
- Maximum is usual if threads are on different sockets but sometimes
  the same socket different core is worse (2.6.26/2.6.27).
- Up to 9 usecs variance in a basic network operation just because
  of process placement.

Same CPU
Kernel		Test 1	Test 2	Test 3	Test 4	Average
2.6.22		83.03	82.9	82.89	82.92	82.94
2.6.23		83.35	82.81	82.83	82.86	82.96
2.6.24		82.66	82.56	82.64	82.73	82.65
2.6.25		84.28	84.29	84.37	84.3	84.31
2.6.26		84.72	84.38	84.41	84.68	84.55
2.6.27		84.56	84.44	84.41	84.58	84.5
2.6.28		84.7	84.43	84.47	84.48	84.52
2.6.29		84.91	84.67	84.69	84.75	84.76
2.6.30-rc2	84.94	84.72	84.69	84.93	84.82
2.6.30-rc3	84.88	84.7	84.73	84.89	84.8

Same core, different processor (l2 is shared)
Kernel		Test 1	Test 2	Test 3	Test 4	Average
2.6.22		84.6	84.71	84.52	84.53	84.59
2.6.23		84.59	84.5	84.33	84.34	84.44
2.6.24		84.28	84.3	84.38	84.28	84.31
2.6.25		86.12	85.8	86.2	86.04	86.04
2.6.26		86.61	86.46	86.49	86.7	86.57
2.6.27		87	87.01	87	86.95	86.99
2.6.28		86.53	86.44	86.26	86.24	86.37
2.6.29		85.88	85.94	86.1	85.69	85.9
2.6.30-rc2	86.03	85.93	85.99	86.06	86
2.6.30-rc3	85.73	85.88	85.67	85.94	85.81

Same Socket, different core (l2 not shared)
Kernel		Test 1	Test 2	Test 3	Test 4	Average
2.6.22		90.08	89.72	90	89.9	89.93
2.6.23		89.72	90.1	89.99	89.86	89.92
2.6.24		89.18	89.28	89.25	89.22	89.23
2.6.25		90.83	90.78	90.87	90.61	90.77
2.6.26		90.51	91.25	91.8	91.69	91.31
2.6.27		91.98	91.93	91.97	91.91	91.95
2.6.28		91.72	91.7	91.84	91.75	91.75
2.6.29		89.85	89.85	90.14	89.9	89.94
2.6.30-rc2	90.78	90.8	90.87	90.73	90.8
2.6.30-rc3	90.84	90.94	91.05	90.84	90.92

Different Socket
Kernel		Test 1	Test 2	Test 3	Test 4	Average
2.6.22		91.64	91.65	91.61	91.68	91.645
2.6.23		91.9	91.84	91.92	91.83	91.873
2.6.24		91.33	91.24	91.42	91.38	91.343
2.6.25		92.39	92.04	92.3	92.23	92.240
2.6.26		90.64	90.57	90.6	90.08	90.473
2.6.27		91.14	91.26	90.9	91.09	91.098
2.6.28		92.3	91.92	92.3	92.23	92.188
2.6.29		90.57	89.83	89.9	90.41	90.178
2.6.30-rc2	90.59	90.97	90.27	91.69	90.880
2.6.30-rc3	92.08	91.32	91.21	92.06	91.668


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: udp ping pong with various process bindings (and correct cpu mappings)
  2009-04-24 20:10 udp ping pong with various process bindings (and correct cpu mappings) Christoph Lameter
@ 2009-04-24 21:18 ` Eric Dumazet
  2009-04-25 15:47 ` [PATCH] net: Avoid extra wakeups of threads blocked in wait_for_packet() Eric Dumazet
  1 sibling, 0 replies; 44+ messages in thread
From: Eric Dumazet @ 2009-04-24 21:18 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: jesse.brandeburg, netdev, bhutchiings, mchan, David Miller

Christoph Lameter a écrit :
> Here are the results of a 40 byte udpping (http://gentwo.org/ll) run on
> kernel from 2.6.22 to 2.6.30-rc3 on a Dell 1950 dual quad core 3.3Ghz.
> One system fixed 2.6.22 kernel version on the other are varied.
> 
> Nice graph at http://gentwo.org/results/udpping-results.pdf
> 
> Summary:
> - Loss of ~1.5usec on fastest path (same cpu) since 2.6.22
> - Different cpu same core looses 2-3 usecs vs. same cpu
> - Different cpu different core looses ~ 8 usecs vs same cpu
> - Maximum is usual if threads are on different sockets but sometimes
>   the same socket different core is worse (2.6.26/2.6.27).
> - Up to 9 usecs variance in a basic network operation just because
>   of process placement.
> 
> Same CPU
> Kernel		Test 1	Test 2	Test 3	Test 4	Average
> 2.6.22		83.03	82.9	82.89	82.92	82.94
> 2.6.23		83.35	82.81	82.83	82.86	82.96
> 2.6.24		82.66	82.56	82.64	82.73	82.65
> 2.6.25		84.28	84.29	84.37	84.3	84.31
> 2.6.26		84.72	84.38	84.41	84.68	84.55
> 2.6.27		84.56	84.44	84.41	84.58	84.5
> 2.6.28		84.7	84.43	84.47	84.48	84.52
> 2.6.29		84.91	84.67	84.69	84.75	84.76
> 2.6.30-rc2	84.94	84.72	84.69	84.93	84.82
> 2.6.30-rc3	84.88	84.7	84.73	84.89	84.8
> 
> Same core, different processor (l2 is shared)
> Kernel		Test 1	Test 2	Test 3	Test 4	Average
> 2.6.22		84.6	84.71	84.52	84.53	84.59
> 2.6.23		84.59	84.5	84.33	84.34	84.44
> 2.6.24		84.28	84.3	84.38	84.28	84.31
> 2.6.25		86.12	85.8	86.2	86.04	86.04
> 2.6.26		86.61	86.46	86.49	86.7	86.57
> 2.6.27		87	87.01	87	86.95	86.99
> 2.6.28		86.53	86.44	86.26	86.24	86.37
> 2.6.29		85.88	85.94	86.1	85.69	85.9
> 2.6.30-rc2	86.03	85.93	85.99	86.06	86
> 2.6.30-rc3	85.73	85.88	85.67	85.94	85.81
> 
> Same Socket, different core (l2 not shared)
> Kernel		Test 1	Test 2	Test 3	Test 4	Average
> 2.6.22		90.08	89.72	90	89.9	89.93
> 2.6.23		89.72	90.1	89.99	89.86	89.92
> 2.6.24		89.18	89.28	89.25	89.22	89.23
> 2.6.25		90.83	90.78	90.87	90.61	90.77
> 2.6.26		90.51	91.25	91.8	91.69	91.31
> 2.6.27		91.98	91.93	91.97	91.91	91.95
> 2.6.28		91.72	91.7	91.84	91.75	91.75
> 2.6.29		89.85	89.85	90.14	89.9	89.94
> 2.6.30-rc2	90.78	90.8	90.87	90.73	90.8
> 2.6.30-rc3	90.84	90.94	91.05	90.84	90.92
> 
> Different Socket
> Kernel		Test 1	Test 2	Test 3	Test 4	Average
> 2.6.22		91.64	91.65	91.61	91.68	91.645
> 2.6.23		91.9	91.84	91.92	91.83	91.873
> 2.6.24		91.33	91.24	91.42	91.38	91.343
> 2.6.25		92.39	92.04	92.3	92.23	92.240
> 2.6.26		90.64	90.57	90.6	90.08	90.473
> 2.6.27		91.14	91.26	90.9	91.09	91.098
> 2.6.28		92.3	91.92	92.3	92.23	92.188
> 2.6.29		90.57	89.83	89.9	90.41	90.178
> 2.6.30-rc2	90.59	90.97	90.27	91.69	90.880
> 2.6.30-rc3	92.08	91.32	91.21	92.06	91.668
> 
> 

Thanks Christoph for doing this

I believe we can restore pre 2.6.25 performance level with litle changes.

[Problem is that on 2.6.25, UDP mem accounting forced us to add a callback
to sock_def_write_space() at skb TX completion time. This function
then wake up all thread(s) blocked in revfrom() syscall. Once awaken,
thread(s) block again because no frame was received]


Davide Libenzi added a 'key' opaque argument to wakeups so that eventpoll
can avoid unnecessary wakeups. This infrastructure could be used on other paths.
(Most important being this one : receivers, because writers are rarely blocked
because of sndbuffer filled)

commit 37e5540b3c9d838eb20f2ca8ea2eb8072271e403
Author: Davide Libenzi <davidel@xmailserver.org>
Date:   Tue Mar 31 15:24:21 2009 -0700

    epoll keyed wakeups: make sockets use keyed wakeups

    Add support for event-aware wakeups to the sockets code.  Events are
    delivered to the wakeup target, so that epoll can avoid spurious wakeups
    for non-interesting events.

commit : 2dfa4eeab0fc7e8633974f2770945311b31eedf6

    epoll keyed wakeups: teach epoll about hints coming with the wakeup key

    Use the events hint now sent by some devices, to avoid unnecessary wakeups
    for events that are of no interest for the caller.  This code handles both
    devices that are sending keyed events, and the ones that are not (and
    event the ones that sometimes send events, and sometimes don't).

We can add support for these key on regular socket code, so that a process
waiting on receive wont be scheduled because a TX completion occured.


Standard way is using autoremove_wake_function() :

int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
        int ret = default_wake_function(wait, mode, sync, key);

        if (ret)
                list_del_init(&wait->task_list);
        return ret;
}


/* this function ignores "key" argument */
int default_wake_function(wait_queue_t *curr, unsigned mode, int sync,
                          void *key)
{
        return try_to_wake_up(curr->private, mode, sync);
}


While new 'keyed' events can do better :

static int ep_poll_callback(wait_queue_t *wait, unsigned mode, int sync, void *key)
{
        int pwake = 0;
        unsigned long flags;
        struct epitem *epi = ep_item_from_wait(wait);
        struct eventpoll *ep = epi->ep;

        spin_lock_irqsave(&ep->lock, flags);


...
        /*
         * Check the events coming with the callback. At this stage, not
         * every device reports the events in the "key" parameter of the
         * callback. We need to be able to handle both cases here, hence the
         * test for "key" != NULL before the event match test.
         */
        if (key && !((unsigned long) key & epi->event.events))
                goto out_unlock;

}


I'll try to cook a patch in following days, unless someone beats me :)

Thanks


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH] net: Avoid extra wakeups of threads blocked in wait_for_packet()
  2009-04-24 20:10 udp ping pong with various process bindings (and correct cpu mappings) Christoph Lameter
  2009-04-24 21:18 ` Eric Dumazet
@ 2009-04-25 15:47 ` Eric Dumazet
  2009-04-26  9:04   ` David Miller
  2009-04-28  9:26   ` [PATCH] net: Avoid extra wakeups of threads blocked in wait_for_packet() David Miller
  1 sibling, 2 replies; 44+ messages in thread
From: Eric Dumazet @ 2009-04-25 15:47 UTC (permalink / raw)
  To: David Miller
  Cc: Christoph Lameter, jesse.brandeburg, netdev, haoki, mchan,
	Davide Libenzi


In 2.6.25 we added UDP mem accounting.

This unfortunatly added a penalty when a frame is transmitted, since
we have at TX completion time to call sock_wfree() to perform necessary
memory accounting. This calls sock_def_write_space() and utimately
scheduler if any thread is waiting on the socket.
Thread(s) waiting for an incoming frame was scheduled, then had to sleep
again as event was meaningless.

(All threads waiting on a socket are using same sk_sleep anchor)

This adds lot of extra wakeups and increases latencies, as noted
by Christoph Lameter, and slows down softirq handler.

Reference : http://marc.info/?l=linux-netdev&m=124060437012283&w=2 

Fortunatly, Davide Libenzi recently added concept of keyed wakeups
into kernel, and particularly for sockets (see commit
37e5540b3c9d838eb20f2ca8ea2eb8072271e403 
epoll keyed wakeups: make sockets use keyed wakeups)

Davide goal was to optimize epoll, but this new wakeup infrastructure
can help non epoll users as well, if they care to setup an appropriate
handler.

This patch introduces new DEFINE_WAIT_FUNC() helper and uses it
in wait_for_packet(), so that only relevant event can wakeup a thread
blocked in this function.

Trace of function calls from bnx2 TX completion bnx2_poll_work() is :
__kfree_skb()
 skb_release_head_state()
  sock_wfree()
   sock_def_write_space()
    __wake_up_sync_key()
     __wake_up_common()
      receiver_wake_function() : Stops here since thread is waiting for an INPUT


Reported-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/linux/wait.h |    6 ++++--
 net/core/datagram.c  |   14 +++++++++++++-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 5d631c1..bc02463 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -440,13 +440,15 @@ void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
 int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
 int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
 
-#define DEFINE_WAIT(name)						\
+#define DEFINE_WAIT_FUNC(name, function)				\
 	wait_queue_t name = {						\
 		.private	= current,				\
-		.func		= autoremove_wake_function,		\
+		.func		= function,				\
 		.task_list	= LIST_HEAD_INIT((name).task_list),	\
 	}
 
+#define DEFINE_WAIT(name) DEFINE_WAIT_FUNC(name, autoremove_wake_function)
+
 #define DEFINE_WAIT_BIT(name, word, bit)				\
 	struct wait_bit_queue name = {					\
 		.key = __WAIT_BIT_KEY_INITIALIZER(word, bit),		\
diff --git a/net/core/datagram.c b/net/core/datagram.c
index d0de644..b7960a3 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -64,13 +64,25 @@ static inline int connection_based(struct sock *sk)
 	return sk->sk_type == SOCK_SEQPACKET || sk->sk_type == SOCK_STREAM;
 }
 
+static int receiver_wake_function(wait_queue_t *wait, unsigned mode, int sync,
+				  void *key)
+{
+	unsigned long bits = (unsigned long)key;
+
+	/*
+	 * Avoid a wakeup if event not interesting for us
+	 */
+	if (bits && !(bits & (POLLIN | POLLERR)))
+		return 0;
+	return autoremove_wake_function(wait, mode, sync, key);
+}
 /*
  * Wait for a packet..
  */
 static int wait_for_packet(struct sock *sk, int *err, long *timeo_p)
 {
 	int error;
-	DEFINE_WAIT(wait);
+	DEFINE_WAIT_FUNC(wait, receiver_wake_function);
 
 	prepare_to_wait_exclusive(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
 

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] net: Avoid extra wakeups of threads blocked in wait_for_packet()
  2009-04-25 15:47 ` [PATCH] net: Avoid extra wakeups of threads blocked in wait_for_packet() Eric Dumazet
@ 2009-04-26  9:04   ` David Miller
  2009-04-26 10:46     ` [PATCH] poll: Avoid extra wakeups Eric Dumazet
  2009-04-28  9:26   ` [PATCH] net: Avoid extra wakeups of threads blocked in wait_for_packet() David Miller
  1 sibling, 1 reply; 44+ messages in thread
From: David Miller @ 2009-04-26  9:04 UTC (permalink / raw)
  To: dada1; +Cc: cl, jesse.brandeburg, netdev, haoki, mchan, davidel

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 25 Apr 2009 17:47:23 +0200

> (All threads waiting on a socket are using same sk_sleep anchor)

Great stuff Eric.

We've discussed splitting the wait queue up before, but shorter-term
your idea is pretty cool too :-)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH] poll: Avoid extra wakeups
  2009-04-26  9:04   ` David Miller
@ 2009-04-26 10:46     ` Eric Dumazet
  2009-04-26 13:33       ` Jarek Poplawski
                         ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Eric Dumazet @ 2009-04-26 10:46 UTC (permalink / raw)
  To: David Miller; +Cc: cl, jesse.brandeburg, netdev, haoki, mchan, davidel

David Miller a écrit :
> 
> Great stuff Eric.
> 
> We've discussed splitting the wait queue up before, but shorter-term
> your idea is pretty cool too :-)

Well, I only got this idea because Davide did its previous work, he is
the one who did the hard stuff :)

About poll()/select() improvements, I believe following patch should
be fine too.

Note some lines on this patch are longer than 80 columns, I am
aware of this but could not find an elegant/efficient way to
avoid this.

Thank you

[PATCH] poll: Avoid extra wakeups in select/poll

After introduction of keyed wakeups  Davide Libenzi did on epoll, we
are able to avoid spurious wakeups in poll()/select() code too.

For example, typical use of poll()/select() is to wait for incoming
network frames on many sockets. But TX completion for UDP/TCP 
frames call sock_wfree() which in turn schedules thread.

When scheduled, thread does a full scan of all polled fds and
can sleep again, because nothing is really available. If number
of fds is large, this cause significant load.

This patch makes select()/poll() aware of keyed wakeups and
useless wakeups are avoided. This reduces number of context
switches by about 50% on some setups, and work performed
by sofirq handlers.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 fs/select.c          |   28 +++++++++++++++++++++++++---
 include/linux/poll.h |    3 +++
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 0fe0e14..2708187 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -168,7 +168,7 @@ static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p)
 	return table->entry++;
 }
 
-static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+static int __pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
 {
 	struct poll_wqueues *pwq = wait->private;
 	DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
@@ -194,6 +194,16 @@ static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
 	return default_wake_function(&dummy_wait, mode, sync, key);
 }
 
+static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	struct poll_table_entry *entry;
+
+	entry = container_of(wait, struct poll_table_entry, wait);
+	if (key && !((unsigned long)key & entry->key))
+		return 0;
+	return __pollwake(wait, mode, sync, key);
+}
+
 /* Add a new entry */
 static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 				poll_table *p)
@@ -205,6 +215,7 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 	get_file(filp);
 	entry->filp = filp;
 	entry->wait_address = wait_address;
+	entry->key = p->key;
 	init_waitqueue_func_entry(&entry->wait, pollwake);
 	entry->wait.private = pwq;
 	add_wait_queue(wait_address, &entry->wait);
@@ -418,8 +429,16 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 				if (file) {
 					f_op = file->f_op;
 					mask = DEFAULT_POLLMASK;
-					if (f_op && f_op->poll)
+					if (f_op && f_op->poll) {
+						if (wait) {
+							wait->key = POLLEX_SET;
+							if (in & bit)
+								wait->key |= POLLIN_SET;
+							if (out & bit)
+								wait->key |= POLLOUT_SET;
+						}
 						mask = (*f_op->poll)(file, retval ? NULL : wait);
+					}
 					fput_light(file, fput_needed);
 					if ((mask & POLLIN_SET) && (in & bit)) {
 						res_in |= bit;
@@ -685,8 +704,11 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait)
 		mask = POLLNVAL;
 		if (file != NULL) {
 			mask = DEFAULT_POLLMASK;
-			if (file->f_op && file->f_op->poll)
+			if (file->f_op && file->f_op->poll) {
+				if (pwait)
+					pwait->key = pollfd->events | POLLERR | POLLHUP;
 				mask = file->f_op->poll(file, pwait);
+			}
 			/* Mask out unneeded events. */
 			mask &= pollfd->events | POLLERR | POLLHUP;
 			fput_light(file, fput_needed);
diff --git a/include/linux/poll.h b/include/linux/poll.h
index 8c24ef8..3327389 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -32,6 +32,7 @@ typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_
 
 typedef struct poll_table_struct {
 	poll_queue_proc qproc;
+	unsigned long key;
 } poll_table;
 
 static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
@@ -43,10 +44,12 @@ static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_addres
 static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
 {
 	pt->qproc = qproc;
+	pt->key = ~0UL; /* all events enabled */
 }
 
 struct poll_table_entry {
 	struct file *filp;
+	unsigned long key;
 	wait_queue_t wait;
 	wait_queue_head_t *wait_address;
 };


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups
  2009-04-26 10:46     ` [PATCH] poll: Avoid extra wakeups Eric Dumazet
@ 2009-04-26 13:33       ` Jarek Poplawski
  2009-04-26 14:27         ` Eric Dumazet
  2009-04-28  9:15       ` David Miller
  2009-04-28 14:21       ` Andi Kleen
  2 siblings, 1 reply; 44+ messages in thread
From: Jarek Poplawski @ 2009-04-26 13:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, cl, jesse.brandeburg, netdev, haoki, mchan, davidel

Eric Dumazet wrote, On 04/26/2009 12:46 PM:

...
> [PATCH] poll: Avoid extra wakeups in select/pol

...

> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> ---
>  fs/select.c          |   28 +++++++++++++++++++++++++---
>  include/linux/poll.h |    3 +++
>  2 files changed, 28 insertions(+), 3 deletions(-)


Eric, I wonder why you've forgotten about linux-kernel@ folks...

Jarek P.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups
  2009-04-26 13:33       ` Jarek Poplawski
@ 2009-04-26 14:27         ` Eric Dumazet
  0 siblings, 0 replies; 44+ messages in thread
From: Eric Dumazet @ 2009-04-26 14:27 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, cl, jesse.brandeburg, netdev, haoki, mchan, davidel

Jarek Poplawski a écrit :
> Eric Dumazet wrote, On 04/26/2009 12:46 PM:
> 
> ...
>> [PATCH] poll: Avoid extra wakeups in select/pol
> 
> ...
> 
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>> ---
>>  fs/select.c          |   28 +++++++++++++++++++++++++---
>>  include/linux/poll.h |    3 +++
>>  2 files changed, 28 insertions(+), 3 deletions(-)
> 
> 
> Eric, I wonder why you've forgotten about linux-kernel@ folks...
> 

Ah yes, I forgot, I only did a 'reply all' on David's mail.
I'll resubmit it anyway, since it was only a followup.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups
  2009-04-26 10:46     ` [PATCH] poll: Avoid extra wakeups Eric Dumazet
  2009-04-26 13:33       ` Jarek Poplawski
@ 2009-04-28  9:15       ` David Miller
  2009-04-28  9:24         ` Eric Dumazet
  2009-04-28 14:21       ` Andi Kleen
  2 siblings, 1 reply; 44+ messages in thread
From: David Miller @ 2009-04-28  9:15 UTC (permalink / raw)
  To: dada1; +Cc: cl, jesse.brandeburg, netdev, haoki, mchan, davidel

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sun, 26 Apr 2009 12:46:39 +0200

> [PATCH] poll: Avoid extra wakeups in select/poll

Looks great to me:

Acked-by: David S. Miller <davem@davemloft.net>

But this has to go through something other than the
networking tree, of course :-)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups
  2009-04-28  9:15       ` David Miller
@ 2009-04-28  9:24         ` Eric Dumazet
  0 siblings, 0 replies; 44+ messages in thread
From: Eric Dumazet @ 2009-04-28  9:24 UTC (permalink / raw)
  To: David Miller; +Cc: cl, jesse.brandeburg, netdev, haoki, mchan, davidel

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Sun, 26 Apr 2009 12:46:39 +0200
> 
>> [PATCH] poll: Avoid extra wakeups in select/poll
> 
> Looks great to me:
> 
> Acked-by: David S. Miller <davem@davemloft.net>
> 
> But this has to go through something other than the
> networking tree, of course :-)
> 
> 
Sure, I'll do that promptly :)

Thanks



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups
  2009-04-26 10:46     ` [PATCH] poll: Avoid extra wakeups Eric Dumazet
  2009-04-26 13:33       ` Jarek Poplawski
  2009-04-28  9:15       ` David Miller
@ 2009-04-28 14:21       ` Andi Kleen
  2009-04-28 14:58         ` Eric Dumazet
  2009-04-28 15:06         ` [PATCH] poll: Avoid extra wakeups in select/poll Eric Dumazet
  2 siblings, 2 replies; 44+ messages in thread
From: Andi Kleen @ 2009-04-28 14:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, cl, jesse.brandeburg, netdev, haoki, mchan, davidel

Eric Dumazet <dada1@cosmosbay.com> writes:
>
> When scheduled, thread does a full scan of all polled fds and
> can sleep again, because nothing is really available. If number
> of fds is large, this cause significant load.

I wonder if the key could be used for more state. For example if you
two processes are in recvmsg() on a socket and there's only a single
packet incoming we only need to wake up the first waiter.  Could that
be done with keys too?

> This patch makes select()/poll() aware of keyed wakeups and
> useless wakeups are avoided. This reduces number of context
> switches by about 50% on some setups, and work performed
> by sofirq handlers.

I'm late, but: very cool patch too.

Acked-by: Andi Kleen <ak@linux.intel.com>

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups
  2009-04-28 14:21       ` Andi Kleen
@ 2009-04-28 14:58         ` Eric Dumazet
  2009-04-28 15:06         ` [PATCH] poll: Avoid extra wakeups in select/poll Eric Dumazet
  1 sibling, 0 replies; 44+ messages in thread
From: Eric Dumazet @ 2009-04-28 14:58 UTC (permalink / raw)
  To: Andi Kleen
  Cc: David Miller, cl, jesse.brandeburg, netdev, haoki, mchan, davidel

Andi Kleen a écrit :
> Eric Dumazet <dada1@cosmosbay.com> writes:
>> When scheduled, thread does a full scan of all polled fds and
>> can sleep again, because nothing is really available. If number
>> of fds is large, this cause significant load.
> 
> I wonder if the key could be used for more state. For example if you
> two processes are in recvmsg() on a socket and there's only a single
> packet incoming we only need to wake up the first waiter.  Could that
> be done with keys too?

I am not sure its possible. I'll take a look.

> 
>> This patch makes select()/poll() aware of keyed wakeups and
>> useless wakeups are avoided. This reduces number of context
>> switches by about 50% on some setups, and work performed
>> by sofirq handlers.
> 
> I'm late, but: very cool patch too.
> 
> Acked-by: Andi Kleen <ak@linux.intel.com>

Thanks, I am going to send it again on lkml this time :)


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-28 14:21       ` Andi Kleen
  2009-04-28 14:58         ` Eric Dumazet
@ 2009-04-28 15:06         ` Eric Dumazet
  2009-04-28 19:05           ` Christoph Lameter
                             ` (2 more replies)
  1 sibling, 3 replies; 44+ messages in thread
From: Eric Dumazet @ 2009-04-28 15:06 UTC (permalink / raw)
  To: linux kernel
  Cc: Andi Kleen, David Miller, cl, jesse.brandeburg, netdev, haoki,
	mchan, davidel, Ingo Molnar

[PATCH] poll: Avoid extra wakeups in select/poll

After introduction of keyed wakeups Davide Libenzi did on epoll, we
are able to avoid spurious wakeups in poll()/select() code too.

For example, typical use of poll()/select() is to wait for incoming
network frames on many sockets. But TX completion for UDP/TCP 
frames call sock_wfree() which in turn schedules thread.

When scheduled, thread does a full scan of all polled fds and
can sleep again, because nothing is really available. If number
of fds is large, this cause significant load.

This patch makes select()/poll() aware of keyed wakeups and
useless wakeups are avoided. This reduces number of context
switches by about 50% on some setups, and work performed
by sofirq handlers.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Andi Kleen <ak@linux.intel.com>
---
 fs/select.c          |   28 +++++++++++++++++++++++++---
 include/linux/poll.h |    3 +++
 2 files changed, 28 insertions(+), 3 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 0fe0e14..2708187 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -168,7 +168,7 @@ static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p)
 	return table->entry++;
 }
 
-static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+static int __pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
 {
 	struct poll_wqueues *pwq = wait->private;
 	DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
@@ -194,6 +194,16 @@ static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
 	return default_wake_function(&dummy_wait, mode, sync, key);
 }
 
+static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	struct poll_table_entry *entry;
+
+	entry = container_of(wait, struct poll_table_entry, wait);
+	if (key && !((unsigned long)key & entry->key))
+		return 0;
+	return __pollwake(wait, mode, sync, key);
+}
+
 /* Add a new entry */
 static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 				poll_table *p)
@@ -205,6 +215,7 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 	get_file(filp);
 	entry->filp = filp;
 	entry->wait_address = wait_address;
+	entry->key = p->key;
 	init_waitqueue_func_entry(&entry->wait, pollwake);
 	entry->wait.private = pwq;
 	add_wait_queue(wait_address, &entry->wait);
@@ -418,8 +429,16 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 				if (file) {
 					f_op = file->f_op;
 					mask = DEFAULT_POLLMASK;
-					if (f_op && f_op->poll)
+					if (f_op && f_op->poll) {
+						if (wait) {
+							wait->key = POLLEX_SET;
+							if (in & bit)
+								wait->key |= POLLIN_SET;
+							if (out & bit)
+								wait->key |= POLLOUT_SET;
+						}
 						mask = (*f_op->poll)(file, retval ? NULL : wait);
+					}
 					fput_light(file, fput_needed);
 					if ((mask & POLLIN_SET) && (in & bit)) {
 						res_in |= bit;
@@ -685,8 +704,11 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait)
 		mask = POLLNVAL;
 		if (file != NULL) {
 			mask = DEFAULT_POLLMASK;
-			if (file->f_op && file->f_op->poll)
+			if (file->f_op && file->f_op->poll) {
+				if (pwait)
+					pwait->key = pollfd->events | POLLERR | POLLHUP;
 				mask = file->f_op->poll(file, pwait);
+			}
 			/* Mask out unneeded events. */
 			mask &= pollfd->events | POLLERR | POLLHUP;
 			fput_light(file, fput_needed);
diff --git a/include/linux/poll.h b/include/linux/poll.h
index 8c24ef8..3327389 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -32,6 +32,7 @@ typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_
 
 typedef struct poll_table_struct {
 	poll_queue_proc qproc;
+	unsigned long key;
 } poll_table;
 
 static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
@@ -43,10 +44,12 @@ static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_addres
 static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
 {
 	pt->qproc = qproc;
+	pt->key   = ~0UL; /* all events enabled */
 }
 
 struct poll_table_entry {
 	struct file *filp;
+	unsigned long key;
 	wait_queue_t wait;
 	wait_queue_head_t *wait_address;
 };


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-28 15:06         ` [PATCH] poll: Avoid extra wakeups in select/poll Eric Dumazet
@ 2009-04-28 19:05           ` Christoph Lameter
  2009-04-28 20:05             ` Eric Dumazet
  2009-04-29  7:20           ` Andrew Morton
  2009-04-29  9:16           ` Ingo Molnar
  2 siblings, 1 reply; 44+ messages in thread
From: Christoph Lameter @ 2009-04-28 19:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux kernel, Andi Kleen, David Miller, jesse.brandeburg, netdev,
	haoki, mchan, davidel, Ingo Molnar

For the udpping test load these patches have barely any effect:

git2p1 is the first version of the patch
git2p2 is the second version (this one)

Same CPU
Kernel			Test 1	Test 2	Test 3	Test 4	Average
2.6.22			83.03	82.9	82.89	82.92	82.935
2.6.23			83.35	82.81	82.83	82.86	82.9625
2.6.24			82.66	82.56	82.64	82.73	82.6475
2.6.25			84.28	84.29	84.37	84.3	84.31
2.6.26			84.72	84.38	84.41	84.68	84.5475
2.6.27			84.56	84.44	84.41	84.58	84.4975
2.6.28			84.7	84.43	84.47	84.48	84.52
2.6.29			84.91	84.67	84.69	84.75	84.755
2.6.30-rc2		84.94	84.72	84.69	84.93	84.82
2.6.30-rc3		84.88	84.7	84.73	84.89	84.8
2.6.30-rc3-git2p1	84.89	84.77	84.79	84.85	84.825
2.6.30-rc3-git2p2	84.91	84.79	84.78	84.8	84.82

Same Core
Kernel			Test 1	Test 2	Test 3	Test 4	Average
2.6.22			84.6	84.71	84.52	84.53	84.59
2.6.23			84.59	84.5	84.33	84.34	84.44
2.6.24			84.28	84.3	84.38	84.28	84.31
2.6.25			86.12	85.8	86.2	86.04	86.04
2.6.26			86.61	86.46	86.49	86.7	86.565
2.6.27			87	87.01	87	86.95	86.99
2.6.28			86.53	86.44	86.26	86.24	86.3675
2.6.29			85.88	85.94	86.1	85.69	85.9025
2.6.30-rc2		86.03	85.93	85.99	86.06	86.0025
2.6.30-rc3		85.73	85.88	85.67	85.94	85.805
2.6.30-rc3-git2p1	86.11	85.8	86.03	85.92	85.965
2.6.30-rc3-git2p2	86.04	85.96	85.89	86.04	85.9825

Same Socket
Kernel			Test 1	Test 2	Test 3	Test 4	Average
2.6.22			90.08	89.72	90	89.9	89.925
2.6.23			89.72	90.1	89.99	89.86	89.9175
2.6.24			89.18	89.28	89.25	89.22	89.2325
2.6.25			90.83	90.78	90.87	90.61	90.7725
2.6.26			90.51	91.25	91.8	91.69	91.3125
2.6.27			91.98	91.93	91.97	91.91	91.9475
2.6.28			91.72	91.7	91.84	91.75	91.7525
2.6.29			89.85	89.85	90.14	89.9	89.935
2.6.30-rc2		90.78	90.8	90.87	90.73	90.795
2.6.30-rc3		90.84	90.94	91.05	90.84	90.9175
2.6.30-rc3-git2p1	90.87	90.95	90.86	90.92	90.9
2.6.30-rc3-git2p2	91.09	91.01	90.97	91.06	91.0325

Different Socket
Kernel			Test 1	Test 2	Test 3	Test 4	Average
2.6.22			91.64	91.65	91.61	91.68	91.645
2.6.23			91.9	91.84	91.92	91.83	91.873
2.6.24			91.33	91.24	91.42	91.38	91.343
2.6.25			92.39	92.04	92.3	92.23	92.240
2.6.26			90.64	90.57	90.6	90.08	90.473
2.6.27			91.14	91.26	90.9	91.09	91.098
2.6.28			92.3	91.92	92.3	92.23	92.188
2.6.29			90.57	89.83	89.9	90.41	90.178
2.6.30-rc2		90.59	90.97	90.27	91.69	90.880
2.6.30-rc3		92.08	91.32	91.21	92.06	91.668
2.6.30-rc3-git2p1	91.46	91.38	91.92	91.03	91.448
2.6.30-rc3-git2p2	91.39	90.47	90.03	90.62	90.628


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-28 19:05           ` Christoph Lameter
@ 2009-04-28 20:05             ` Eric Dumazet
  2009-04-28 20:14               ` Christoph Lameter
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-04-28 20:05 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux kernel, Andi Kleen, David Miller, jesse.brandeburg, netdev,
	haoki, mchan, davidel, Ingo Molnar

Christoph Lameter a écrit :
> For the udpping test load these patches have barely any effect:
> 
> git2p1 is the first version of the patch
> git2p2 is the second version (this one)

But... udpping does *not* use poll() nor select(), unless I am mistaken ?

If you really want to test this patch with udpping, you might add a poll() call
before recvfrom() :

                while(1) {
+                       struct pollfd pfd = { .fd = sock, .events = POLLIN};
+                       poll(pfd, 1, -1);
                        nbytes = recvfrom(sock, msg, min(inblocksize, sizeof(msg)),
                                                                0, &inad, &inadlen);
                        if (nbytes < 0) {
                                perror("recvfrom");
                                break;
                        }
                        if (sendto(sock, msg, nbytes, 0, &inad, inadlen) < 0) {
                                perror("sendto");
                                break;
                        }
                }

Part about recvfrom() wakeup avoidance is in David net-2.6 tree, and saves 2 us on udpping here.

Is it what you call git2p1 ?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-28 20:05             ` Eric Dumazet
@ 2009-04-28 20:14               ` Christoph Lameter
  2009-04-28 20:33                 ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Lameter @ 2009-04-28 20:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux kernel, Andi Kleen, David Miller, jesse.brandeburg, netdev,
	haoki, mchan, davidel, Ingo Molnar

On Tue, 28 Apr 2009, Eric Dumazet wrote:

> Part about recvfrom() wakeup avoidance is in David net-2.6 tree, and saves 2 us on udpping here.
>
> Is it what you call git2p1 ?

No that is just the prior version of the poll/select improvements.

Which patch are you referring to?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-28 20:14               ` Christoph Lameter
@ 2009-04-28 20:33                 ` Eric Dumazet
  2009-04-28 20:49                   ` Christoph Lameter
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-04-28 20:33 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux kernel, Andi Kleen, David Miller, jesse.brandeburg, netdev,
	haoki, mchan, davidel, Ingo Molnar

Christoph Lameter a écrit :
> On Tue, 28 Apr 2009, Eric Dumazet wrote:
> 
>> Part about recvfrom() wakeup avoidance is in David net-2.6 tree, and saves 2 us on udpping here.
>>
>> Is it what you call git2p1 ?
> 
> No that is just the prior version of the poll/select improvements.
> 
> Which patch are you referring to?
> 
> 

The one that did improved your udpping 'bench' :)

http://git2.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commitdiff;h=bf368e4e70cd4e0f880923c44e95a4273d725ab4

[PATCH] net: Avoid extra wakeups of threads blocked in wait_for_packet()

In 2.6.25 we added UDP mem accounting.

This unfortunatly added a penalty when a frame is transmitted, since
we have at TX completion time to call sock_wfree() to perform necessary
memory accounting. This calls sock_def_write_space() and utimately
scheduler if any thread is waiting on the socket.
Thread(s) waiting for an incoming frame was scheduled, then had to sleep
again as event was meaningless.

(All threads waiting on a socket are using same sk_sleep anchor)

This adds lot of extra wakeups and increases latencies, as noted
by Christoph Lameter, and slows down softirq handler.

Reference : http://marc.info/?l=linux-netdev&m=124060437012283&w=2 

Fortunatly, Davide Libenzi recently added concept of keyed wakeups
into kernel, and particularly for sockets (see commit
37e5540b3c9d838eb20f2ca8ea2eb8072271e403 
epoll keyed wakeups: make sockets use keyed wakeups)

Davide goal was to optimize epoll, but this new wakeup infrastructure
can help non epoll users as well, if they care to setup an appropriate
handler.

This patch introduces new DEFINE_WAIT_FUNC() helper and uses it
in wait_for_packet(), so that only relevant event can wakeup a thread
blocked in this function.

Trace of function calls from bnx2 TX completion bnx2_poll_work() is :
__kfree_skb()
 skb_release_head_state()
  sock_wfree()
   sock_def_write_space()
    __wake_up_sync_key()
     __wake_up_common()
      receiver_wake_function() : Stops here since thread is waiting for an INPUT


Reported-by: Christoph Lameter <cl@linux.com>
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---
 include/linux/wait.h |    6 ++++--
 net/core/datagram.c  |   14 +++++++++++++-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/linux/wait.h b/include/linux/wait.h
index 5d631c1..bc02463 100644
--- a/include/linux/wait.h
+++ b/include/linux/wait.h
@@ -440,13 +440,15 @@ void abort_exclusive_wait(wait_queue_head_t *q, wait_queue_t *wait,
 int autoremove_wake_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
 int wake_bit_function(wait_queue_t *wait, unsigned mode, int sync, void *key);
 
-#define DEFINE_WAIT(name)						\
+#define DEFINE_WAIT_FUNC(name, function)				\
 	wait_queue_t name = {						\
 		.private	= current,				\
-		.func		= autoremove_wake_function,		\
+		.func		= function,				\
 		.task_list	= LIST_HEAD_INIT((name).task_list),	\
 	}
 
+#define DEFINE_WAIT(name) DEFINE_WAIT_FUNC(name, autoremove_wake_function)
+
 #define DEFINE_WAIT_BIT(name, word, bit)				\
 	struct wait_bit_queue name = {					\
 		.key = __WAIT_BIT_KEY_INITIALIZER(word, bit),		\
diff --git a/net/core/datagram.c b/net/core/datagram.c
index d0de644..b7960a3 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -64,13 +64,25 @@ static inline int connection_based(struct sock *sk)
 	return sk->sk_type == SOCK_SEQPACKET || sk->sk_type == SOCK_STREAM;
 }
 
+static int receiver_wake_function(wait_queue_t *wait, unsigned mode, int sync,
+				  void *key)
+{
+	unsigned long bits = (unsigned long)key;
+
+	/*
+	 * Avoid a wakeup if event not interesting for us
+	 */
+	if (bits && !(bits & (POLLIN | POLLERR)))
+		return 0;
+	return autoremove_wake_function(wait, mode, sync, key);
+}
 /*
  * Wait for a packet..
  */
 static int wait_for_packet(struct sock *sk, int *err, long *timeo_p)
 {
 	int error;
-	DEFINE_WAIT(wait);
+	DEFINE_WAIT_FUNC(wait, receiver_wake_function);
 
 	prepare_to_wait_exclusive(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
 

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-28 20:33                 ` Eric Dumazet
@ 2009-04-28 20:49                   ` Christoph Lameter
  2009-04-28 21:04                     ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Christoph Lameter @ 2009-04-28 20:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux kernel, Andi Kleen, David Miller, jesse.brandeburg, netdev,
	haoki, mchan, davidel, Ingo Molnar

On Tue, 28 Apr 2009, Eric Dumazet wrote:

> The one that did improved your udpping 'bench' :)
>
> http://git2.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commitdiff;h=bf368e4e70cd4e0f880923c44e95a4273d725ab4

Well yes that is git2p1. The measurements that we took showed not much of
an effect as you see.







^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-28 20:49                   ` Christoph Lameter
@ 2009-04-28 21:04                     ` Eric Dumazet
  2009-04-28 21:00                       ` Christoph Lameter
                                         ` (2 more replies)
  0 siblings, 3 replies; 44+ messages in thread
From: Eric Dumazet @ 2009-04-28 21:04 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: linux kernel, Andi Kleen, David Miller, jesse.brandeburg, netdev,
	haoki, mchan, davidel, Ingo Molnar

Christoph Lameter a écrit :
> On Tue, 28 Apr 2009, Eric Dumazet wrote:
> 
>> The one that did improved your udpping 'bench' :)
>>
>> http://git2.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commitdiff;h=bf368e4e70cd4e0f880923c44e95a4273d725ab4
> 
> Well yes that is git2p1. The measurements that we took showed not much of
> an effect as you see.
> 

It depends of coalescing parameters of NIC.

BNX2 interrupts first handle TX completions, then RX events.
So If by the

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-28 21:04                     ` Eric Dumazet
@ 2009-04-28 21:00                       ` Christoph Lameter
  2009-04-28 21:05                       ` Eric Dumazet
  2009-04-28 21:11                       ` Eric Dumazet
  2 siblings, 0 replies; 44+ messages in thread
From: Christoph Lameter @ 2009-04-28 21:00 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux kernel, Andi Kleen, David Miller, jesse.brandeburg, netdev,
	haoki, mchan, davidel, Ingo Molnar

On Tue, 28 Apr 2009, Eric Dumazet wrote:

> BNX2 interrupts first handle TX completions, then RX events.
> So If by the

Guess there is more to come?


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-28 21:04                     ` Eric Dumazet
  2009-04-28 21:00                       ` Christoph Lameter
@ 2009-04-28 21:05                       ` Eric Dumazet
  2009-04-28 21:04                         ` Christoph Lameter
  2009-04-28 21:11                       ` Eric Dumazet
  2 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-04-28 21:05 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, linux kernel, Andi Kleen, David Miller,
	jesse.brandeburg, netdev, haoki, mchan, davidel, Ingo Molnar

Eric Dumazet a écrit :
> Christoph Lameter a écrit :
>> On Tue, 28 Apr 2009, Eric Dumazet wrote:
>>
>>> The one that did improved your udpping 'bench' :)
>>>
>>> http://git2.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commitdiff;h=bf368e4e70cd4e0f880923c44e95a4273d725ab4
>> Well yes that is git2p1. The measurements that we took showed not much of
>> an effect as you see.
>>
> 
> It depends of coalescing parameters of NIC.
> 
> BNX2 interrupts first handle TX completions, then RX events.
> So If by the

Sorry, hit wrong key.

So if by t

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-28 21:05                       ` Eric Dumazet
@ 2009-04-28 21:04                         ` Christoph Lameter
  0 siblings, 0 replies; 44+ messages in thread
From: Christoph Lameter @ 2009-04-28 21:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux kernel, Andi Kleen, David Miller, jesse.brandeburg, netdev,
	haoki, mchan, davidel, Ingo Molnar

On Tue, 28 Apr 2009, Eric Dumazet wrote:

> Sorry, hit wrong key.
>
> So if by t

You lost 2 more characters..... Keyboard wonky?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-28 21:04                     ` Eric Dumazet
  2009-04-28 21:00                       ` Christoph Lameter
  2009-04-28 21:05                       ` Eric Dumazet
@ 2009-04-28 21:11                       ` Eric Dumazet
  2009-04-29  9:11                         ` Ingo Molnar
  2 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-04-28 21:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, linux kernel, Andi Kleen, David Miller,
	jesse.brandeburg, netdev, haoki, mchan, davidel, Ingo Molnar

Eric Dumazet a écrit :
> Christoph Lameter a écrit :
>> On Tue, 28 Apr 2009, Eric Dumazet wrote:
>>
>>> The one that did improved your udpping 'bench' :)
>>>
>>> http://git2.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commitdiff;h=bf368e4e70cd4e0f880923c44e95a4273d725ab4
>> Well yes that is git2p1. The measurements that we took showed not much of
>> an effect as you see.
>>
> 
> It depends of coalescing parameters of NIC.
> 
> BNX2 interrupts first handle TX completions, then RX events.
> So If by the
> 

Sorry for the previous message...

If by the time interrupt comes to the host, TX was handled right before RX event,
the extra wakeup is not a problem, because incoming frame will be delivered into
socket queue right before awaken thread tries to pull it.

On real workloads (many incoming/outgoing frames), then avoiding extra wakeups is
a win, regardless of coalescing parameters and cpu affinities...

On uddpping, I had prior to the patch about 49000 wakeups per second,
and after patch about 26000 wakeups per second (matches number of incoming
udp messages per second)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-28 21:11                       ` Eric Dumazet
@ 2009-04-29  9:11                         ` Ingo Molnar
  2009-04-30 10:49                           ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Ingo Molnar @ 2009-04-29  9:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, linux kernel, Andi Kleen, David Miller,
	jesse.brandeburg, netdev, haoki, mchan, davidel


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> On uddpping, I had prior to the patch about 49000 wakeups per 
> second, and after patch about 26000 wakeups per second (matches 
> number of incoming udp messages per second)

very nice. It might not show up as a real performance difference if 
the CPUs are not fully saturated during the test - but it could show 
up as a decrease in CPU utilization.

Also, if you run the test via 'perf stat -a ./test.sh' you should 
see a reduction in instructions executed:

aldebaran:~/linux/linux> perf stat -a sleep 1

 Performance counter stats for 'sleep':

   16128.045994  task clock ticks     (msecs)
          12876  context switches     (events)
            219  CPU migrations       (events)
         186144  pagefaults           (events)
    20911802763  CPU cycles           (events)
    19309416815  instructions         (events)
      199608554  cache references     (events)
       19990754  cache misses         (events)

 Wall-clock time elapsed:  1008.882282 msecs

With -a it's measured system-wide, from start of test to end of test 
- the results will be a lot more stable (and relevant) statistically 
than wall-clock time or CPU usage measurements. (both of which are 
rather imprecise in general)

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-29  9:11                         ` Ingo Molnar
@ 2009-04-30 10:49                           ` Eric Dumazet
  2009-04-30 11:57                             ` Ingo Molnar
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-04-30 10:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Lameter, linux kernel, Andi Kleen, David Miller,
	jesse.brandeburg, netdev, haoki, mchan, davidel

Ingo Molnar a écrit :
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> On uddpping, I had prior to the patch about 49000 wakeups per 
>> second, and after patch about 26000 wakeups per second (matches 
>> number of incoming udp messages per second)
> 
> very nice. It might not show up as a real performance difference if 
> the CPUs are not fully saturated during the test - but it could show 
> up as a decrease in CPU utilization.
> 
> Also, if you run the test via 'perf stat -a ./test.sh' you should 
> see a reduction in instructions executed:
> 
> aldebaran:~/linux/linux> perf stat -a sleep 1
> 
>  Performance counter stats for 'sleep':
> 
>    16128.045994  task clock ticks     (msecs)
>           12876  context switches     (events)
>             219  CPU migrations       (events)
>          186144  pagefaults           (events)
>     20911802763  CPU cycles           (events)
>     19309416815  instructions         (events)
>       199608554  cache references     (events)
>        19990754  cache misses         (events)
> 
>  Wall-clock time elapsed:  1008.882282 msecs
> 
> With -a it's measured system-wide, from start of test to end of test 
> - the results will be a lot more stable (and relevant) statistically 
> than wall-clock time or CPU usage measurements. (both of which are 
> rather imprecise in general)

I tried this perf stuff and got strange results on a cpu burning bench, 
saturating my 8 cpus with a "while (1) ;" loop


# perf stat -a sleep 10

 Performance counter stats for 'sleep':

   80334.709038  task clock ticks     (msecs)
          80638  context switches     (events)
              4  CPU migrations       (events)
            468  pagefaults           (events)
   160694681969  CPU cycles           (events)
   160127154810  instructions         (events)
         686393  cache references     (events)
         230117  cache misses         (events)

 Wall-clock time elapsed: 10041.531644 msecs

So its about 16069468196 cycles per second for 8 cpus
Divide by 8 to get 2008683524 cycles per second per cpu,
which is not       3000000000  (E5450  @ 3.00GHz)

It seems strange a "jmp myself" uses one unhalted cycle per instruction 
and 0.5 halted cycle ...

Also, after using "perf stat", tbench results are 1778 MB/S
instead of 2610 MB/s. Even if no perf stat running.




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-30 10:49                           ` Eric Dumazet
@ 2009-04-30 11:57                             ` Ingo Molnar
  2009-04-30 14:08                               ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Ingo Molnar @ 2009-04-30 11:57 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Lameter, linux kernel, Andi Kleen, David Miller,
	jesse.brandeburg, netdev, haoki, mchan, davidel


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> Ingo Molnar a écrit :
> > * Eric Dumazet <dada1@cosmosbay.com> wrote:
> > 
> >> On uddpping, I had prior to the patch about 49000 wakeups per 
> >> second, and after patch about 26000 wakeups per second (matches 
> >> number of incoming udp messages per second)
> > 
> > very nice. It might not show up as a real performance difference if 
> > the CPUs are not fully saturated during the test - but it could show 
> > up as a decrease in CPU utilization.
> > 
> > Also, if you run the test via 'perf stat -a ./test.sh' you should 
> > see a reduction in instructions executed:
> > 
> > aldebaran:~/linux/linux> perf stat -a sleep 1
> > 
> >  Performance counter stats for 'sleep':
> > 
> >    16128.045994  task clock ticks     (msecs)
> >           12876  context switches     (events)
> >             219  CPU migrations       (events)
> >          186144  pagefaults           (events)
> >     20911802763  CPU cycles           (events)
> >     19309416815  instructions         (events)
> >       199608554  cache references     (events)
> >        19990754  cache misses         (events)
> > 
> >  Wall-clock time elapsed:  1008.882282 msecs
> > 
> > With -a it's measured system-wide, from start of test to end of test 
> > - the results will be a lot more stable (and relevant) statistically 
> > than wall-clock time or CPU usage measurements. (both of which are 
> > rather imprecise in general)
> 
> I tried this perf stuff and got strange results on a cpu burning 
> bench, saturating my 8 cpus with a "while (1) ;" loop
> 
> 
> # perf stat -a sleep 10
> 
>  Performance counter stats for 'sleep':
> 
>    80334.709038  task clock ticks     (msecs)
>           80638  context switches     (events)
>               4  CPU migrations       (events)
>             468  pagefaults           (events)
>    160694681969  CPU cycles           (events)
>    160127154810  instructions         (events)
>          686393  cache references     (events)
>          230117  cache misses         (events)
> 
>  Wall-clock time elapsed: 10041.531644 msecs
> 
> So its about 16069468196 cycles per second for 8 cpus
> Divide by 8 to get 2008683524 cycles per second per cpu,
> which is not       3000000000  (E5450  @ 3.00GHz)

What does "perf stat -l -a sleep 10" show? I suspect your counters 
are scaled by about 67%, due to counter over-commit. -l will show 
the scaling factor (and will scale up the results).

If so then i think this behavior is confusing, and i'll make -l 
default-enabled. (in fact i just committed this change to latest 
-tip and pushed it out)

To get only instructions and cycles, do:

   perf stat -e instructions -e cycles

> It seems strange a "jmp myself" uses one unhalted cycle per 
> instruction and 0.5 halted cycle ...
> 
> Also, after using "perf stat", tbench results are 1778 MB/S 
> instead of 2610 MB/s. Even if no perf stat running.

Hm, that would be a bug. Could you send the dmesg output of:

   echo p > /proc/sysrq-trigger 
   echo p > /proc/sysrq-trigger 

with counters running it will show something like:

[  868.105712] SysRq : Show Regs
[  868.106544] 
[  868.106544] CPU#1: ctrl:       ffffffffffffffff
[  868.106544] CPU#1: status:     0000000000000000
[  868.106544] CPU#1: overflow:   0000000000000000
[  868.106544] CPU#1: fixed:      0000000000000000
[  868.106544] CPU#1: used:       0000000000000000
[  868.106544] CPU#1:   gen-PMC0 ctrl:  00000000001300c0
[  868.106544] CPU#1:   gen-PMC0 count: 000000ffee889194
[  868.106544] CPU#1:   gen-PMC0 left:  0000000011e1791a
[  868.106544] CPU#1:   gen-PMC1 ctrl:  000000000013003c
[  868.106544] CPU#1:   gen-PMC1 count: 000000ffd2542438
[  868.106544] CPU#1:   gen-PMC1 left:  000000002dd17a8e

the counts should stay put (i.e. all counters should be disabled). 
If they move around - despite there being no 'perf stat -a' session 
running, that would be a bug.

Also, the overhead might be profile-able, via:

	perf record -m 1024 sleep 10

(this records the profile into output.perf.)

followed by:

	./perf-report | tail -20

to display a histogram, with kernel-space and user-space symbols 
mixed into a single profile.

(Pick up latest -tip to get perf-report built by default.)

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-30 11:57                             ` Ingo Molnar
@ 2009-04-30 14:08                               ` Eric Dumazet
  2009-04-30 16:07                                 ` [BUG] perf_counter: change cpu frequencies Eric Dumazet
  2009-04-30 21:24                                 ` [PATCH] poll: Avoid extra wakeups in select/poll Paul E. McKenney
  0 siblings, 2 replies; 44+ messages in thread
From: Eric Dumazet @ 2009-04-30 14:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Lameter, linux kernel, Andi Kleen, David Miller,
	jesse.brandeburg, netdev, haoki, mchan, davidel

Ingo Molnar a écrit :
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> Ingo Molnar a écrit :
>>> * Eric Dumazet <dada1@cosmosbay.com> wrote:
>>>
>>>> On uddpping, I had prior to the patch about 49000 wakeups per 
>>>> second, and after patch about 26000 wakeups per second (matches 
>>>> number of incoming udp messages per second)
>>> very nice. It might not show up as a real performance difference if 
>>> the CPUs are not fully saturated during the test - but it could show 
>>> up as a decrease in CPU utilization.
>>>
>>> Also, if you run the test via 'perf stat -a ./test.sh' you should 
>>> see a reduction in instructions executed:
>>>
>>> aldebaran:~/linux/linux> perf stat -a sleep 1
>>>
>>>  Performance counter stats for 'sleep':
>>>
>>>    16128.045994  task clock ticks     (msecs)
>>>           12876  context switches     (events)
>>>             219  CPU migrations       (events)
>>>          186144  pagefaults           (events)
>>>     20911802763  CPU cycles           (events)
>>>     19309416815  instructions         (events)
>>>       199608554  cache references     (events)
>>>        19990754  cache misses         (events)
>>>
>>>  Wall-clock time elapsed:  1008.882282 msecs
>>>
>>> With -a it's measured system-wide, from start of test to end of test 
>>> - the results will be a lot more stable (and relevant) statistically 
>>> than wall-clock time or CPU usage measurements. (both of which are 
>>> rather imprecise in general)
>> I tried this perf stuff and got strange results on a cpu burning 
>> bench, saturating my 8 cpus with a "while (1) ;" loop
>>
>>
>> # perf stat -a sleep 10
>>
>>  Performance counter stats for 'sleep':
>>
>>    80334.709038  task clock ticks     (msecs)
>>           80638  context switches     (events)
>>               4  CPU migrations       (events)
>>             468  pagefaults           (events)
>>    160694681969  CPU cycles           (events)
>>    160127154810  instructions         (events)
>>          686393  cache references     (events)
>>          230117  cache misses         (events)
>>
>>  Wall-clock time elapsed: 10041.531644 msecs
>>
>> So its about 16069468196 cycles per second for 8 cpus
>> Divide by 8 to get 2008683524 cycles per second per cpu,
>> which is not       3000000000  (E5450  @ 3.00GHz)
> 
> What does "perf stat -l -a sleep 10" show? I suspect your counters 
> are scaled by about 67%, due to counter over-commit. -l will show 
> the scaling factor (and will scale up the results).

Only difference I see with '-l' is cache misses not counted.

(tbench 8 running, so not one instruction per cycle)

# perf stat -l -a sleep 10

 Performance counter stats for 'sleep':

   80007.128844  task clock ticks     (msecs)
        6754642  context switches     (events)
              2  CPU migrations       (events)
            474  pagefaults           (events)
   160925719143  CPU cycles           (events)
   108482003620  instructions         (events)
     7584035056  cache references     (events)
  <not counted>  cache misses

 Wall-clock time elapsed: 10000.595448 msecs

# perf stat -a sleep 10

 Performance counter stats for 'sleep':

   80702.908287  task clock ticks     (msecs)
        6792588  context switches     (events)
             24  CPU migrations       (events)
           4867  pagefaults           (events)
   161957342744  CPU cycles           (events)
   109147553984  instructions         (events)
     7633190481  cache references     (events)
       22996234  cache misses         (events)

 Wall-clock time elapsed: 10087.502391 msecs



> 
> If so then i think this behavior is confusing, and i'll make -l 
> default-enabled. (in fact i just committed this change to latest 
> -tip and pushed it out)
> 
> To get only instructions and cycles, do:
> 
>    perf stat -e instructions -e cycles
> 

# perf stat -e instructions -e cycles -a sleep 10

 Performance counter stats for 'sleep':

   109469842392  instructions         (events)
   162012922122  CPU cycles           (events)

 Wall-clock time elapsed: 10124.943544 msecs

I am wondering if cpus are not running at 2 GHz ;)


>> It seems strange a "jmp myself" uses one unhalted cycle per 
>> instruction and 0.5 halted cycle ...
>>
>> Also, after using "perf stat", tbench results are 1778 MB/S 
>> instead of 2610 MB/s. Even if no perf stat running.
> 
> Hm, that would be a bug. Could you send the dmesg output of:
> 
>    echo p > /proc/sysrq-trigger 
>    echo p > /proc/sysrq-trigger 
> 
> with counters running it will show something like:
> 
> [  868.105712] SysRq : Show Regs
> [  868.106544] 
> [  868.106544] CPU#1: ctrl:       ffffffffffffffff
> [  868.106544] CPU#1: status:     0000000000000000
> [  868.106544] CPU#1: overflow:   0000000000000000
> [  868.106544] CPU#1: fixed:      0000000000000000
> [  868.106544] CPU#1: used:       0000000000000000
> [  868.106544] CPU#1:   gen-PMC0 ctrl:  00000000001300c0
> [  868.106544] CPU#1:   gen-PMC0 count: 000000ffee889194
> [  868.106544] CPU#1:   gen-PMC0 left:  0000000011e1791a
> [  868.106544] CPU#1:   gen-PMC1 ctrl:  000000000013003c
> [  868.106544] CPU#1:   gen-PMC1 count: 000000ffd2542438
> [  868.106544] CPU#1:   gen-PMC1 left:  000000002dd17a8e

They stay fix (but only CPU#0 is displayed)

Is perf able to display per cpu counters, and not aggregated values ?

[ 7894.426787] CPU#0: ctrl:       ffffffffffffffff
[ 7894.426788] CPU#0: status:     0000000000000000
[ 7894.426790] CPU#0: overflow:   0000000000000000
[ 7894.426792] CPU#0: fixed:      0000000000000000
[ 7894.426793] CPU#0: used:       0000000000000000
[ 7894.426796] CPU#0:   gen-PMC0 ctrl:  0000000000134f2e
[ 7894.426798] CPU#0:   gen-PMC0 count: 000000ffb91e31e1
[ 7894.426799] CPU#0:   gen-PMC0 left:  000000007fffffff
[ 7894.426802] CPU#0:   gen-PMC1 ctrl:  000000000013412e
[ 7894.426804] CPU#0:   gen-PMC1 count: 000000ff80312b23
[ 7894.426805] CPU#0:   gen-PMC1 left:  000000007fffffff
[ 7894.426807] CPU#0: fixed-PMC0 count: 000000ffacf54a68
[ 7894.426809] CPU#0: fixed-PMC1 count: 000000ffb71cfe02
[ 7894.426811] CPU#0: fixed-PMC2 count: 0000000000000000
[ 7905.522262] SysRq : Show Regs
[ 7905.522277]
[ 7905.522279] CPU#0: ctrl:       ffffffffffffffff
[ 7905.522281] CPU#0: status:     0000000000000000
[ 7905.522283] CPU#0: overflow:   0000000000000000
[ 7905.522284] CPU#0: fixed:      0000000000000000
[ 7905.522286] CPU#0: used:       0000000000000000
[ 7905.522288] CPU#0:   gen-PMC0 ctrl:  0000000000134f2e
[ 7905.522290] CPU#0:   gen-PMC0 count: 000000ffb91e31e1
[ 7905.522292] CPU#0:   gen-PMC0 left:  000000007fffffff
[ 7905.522294] CPU#0:   gen-PMC1 ctrl:  000000000013412e
[ 7905.522296] CPU#0:   gen-PMC1 count: 000000ff80312b23
[ 7905.522298] CPU#0:   gen-PMC1 left:  000000007fffffff
[ 7905.522299] CPU#0: fixed-PMC0 count: 000000ffacf54a68
[ 7905.522301] CPU#0: fixed-PMC1 count: 000000ffb71cfe02
[ 7905.522303] CPU#0: fixed-PMC2 count: 0000000000000000


> 
> the counts should stay put (i.e. all counters should be disabled). 
> If they move around - despite there being no 'perf stat -a' session 
> running, that would be a bug.

I rebooted my machine then got good results

# perf stat -e instructions -e cycles -a sleep 10

 Performance counter stats for 'sleep':

   240021659058  instructions         (events)
   240997984836  CPU cycles           (events) << OK >>

 Wall-clock time elapsed: 10041.499326 msecs

But if I use plain "perf stat -a sleep 10"
it seems I get wrong values again (16 G cycles/sec) for all next perf sessions


# perf stat -a sleep 10

 Performance counter stats for 'sleep':

   80332.718661  task clock ticks     (msecs)
          80602  context switches     (events)
              4  CPU migrations       (events)
            473  pagefaults           (events)
   160665397757  CPU cycles           (events)  << bad >>
   160079277974  instructions         (events)
         793665  cache references     (events)
         226829  cache misses         (events)

 Wall-clock time elapsed: 10041.302969 msecs

# perf stat -e cycles -a sleep 10

 Performance counter stats for 'sleep':

   160665176575  CPU cycles           (events)  << bad >>

 Wall-clock time elapsed: 10041.491503 msecs


> 
> Also, the overhead might be profile-able, via:
> 
> 	perf record -m 1024 sleep 10
> 
> (this records the profile into output.perf.)
> 
> followed by:
> 
> 	./perf-report | tail -20
> 
> to display a histogram, with kernel-space and user-space symbols 
> mixed into a single profile.
> 
> (Pick up latest -tip to get perf-report built by default.)

Thanks, this is what I use currently


^ permalink raw reply	[flat|nested] 44+ messages in thread

* [BUG] perf_counter: change cpu frequencies
  2009-04-30 14:08                               ` Eric Dumazet
@ 2009-04-30 16:07                                 ` Eric Dumazet
  2009-05-03  6:06                                   ` Eric Dumazet
  2009-04-30 21:24                                 ` [PATCH] poll: Avoid extra wakeups in select/poll Paul E. McKenney
  1 sibling, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-04-30 16:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Lameter, linux kernel, Andi Kleen, David Miller,
	jesse.brandeburg, netdev, haoki, mchan, davidel

Eric Dumazet a écrit :
 
> But if I use plain "perf stat -a sleep 10"
> it seems I get wrong values again (16 G cycles/sec) for all next perf sessions
> 

Well, I confirm all my cpus switched from 3GHz to 2GHz, after

"perf stat -a sleep 10"

(but "perf stat -e instructions -e cycles -a sleep 10" doesnt trigger this problem)

Nothing logged, and /proc/cpuinfo stills reports 3 GHz frequencies

# cat unit.c
main() {
  int i;
  for (i = 0 ; i < 10000000; i++)
        getppid();
}
# time ./unit

real    0m0.818s
user    0m0.289s
sys     0m0.529s
# perf stat -a sleep 10 2>/dev/null
# time ./unit

real    0m1.122s
user    0m0.482s
sys     0m0.640s

# tail -n 27 /proc/cpuinfo
processor       : 7
vendor_id       : GenuineIntel
cpu family      : 6
model           : 23
model name      : Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
stepping        : 6
cpu MHz         : 3000.102
cache size      : 6144 KB
physical id     : 1
siblings        : 1
core id         : 3
cpu cores       : 4
apicid          : 7
initial apicid  : 7
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 10
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority
bogomips        : 6000.01
clflush size    : 64
power management:

# grep CPU_FREQ .config
# CONFIG_CPU_FREQ is not set


perf_counter seems promising, but still... needs some bug hunting :)

Thank you


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [BUG] perf_counter: change cpu frequencies
  2009-04-30 16:07                                 ` [BUG] perf_counter: change cpu frequencies Eric Dumazet
@ 2009-05-03  6:06                                   ` Eric Dumazet
  2009-05-03  7:25                                     ` Ingo Molnar
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-05-03  6:06 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Christoph Lameter, linux kernel, Andi Kleen, David Miller,
	jesse.brandeburg, netdev, haoki, mchan, davidel, Mike Galbraith,
	Peter Zijlstra

Eric Dumazet a écrit :
> Eric Dumazet a écrit :
>  
>> But if I use plain "perf stat -a sleep 10"
>> it seems I get wrong values again (16 G cycles/sec) for all next perf sessions
>>
> 
> Well, I confirm all my cpus switched from 3GHz to 2GHz, after
> 
> "perf stat -a sleep 10"
> 
> (but "perf stat -e instructions -e cycles -a sleep 10" doesnt trigger this problem)
> 
> Nothing logged, and /proc/cpuinfo stills reports 3 GHz frequencies
> 
> # cat unit.c
> main() {
>   int i;
>   for (i = 0 ; i < 10000000; i++)
>         getppid();
> }
> # time ./unit
> 
> real    0m0.818s
> user    0m0.289s
> sys     0m0.529s
> # perf stat -a sleep 10 2>/dev/null
> # time ./unit
> 
> real    0m1.122s
> user    0m0.482s
> sys     0m0.640s
> 
> # tail -n 27 /proc/cpuinfo
> processor       : 7
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 23
> model name      : Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
> stepping        : 6
> cpu MHz         : 3000.102
> cache size      : 6144 KB
> physical id     : 1
> siblings        : 1
> core id         : 3
> cpu cores       : 4
> apicid          : 7
> initial apicid  : 7
> fdiv_bug        : no
> hlt_bug         : no
> f00f_bug        : no
> coma_bug        : no
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 10
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority
> bogomips        : 6000.01
> clflush size    : 64
> power management:
> 
> # grep CPU_FREQ .config
> # CONFIG_CPU_FREQ is not set
> 
> 
> perf_counter seems promising, but still... needs some bug hunting :)
> 

Update :

Mike Galbraith suggested me to try various things, and finally, I discovered
this frequency change was probably a BIOS problem on my HP BL460c G1

System Options -> Power regulator for Proliant

[*] HP Dynamic Power Savings Mode
[ ] HP Static Low Power Mode
[ ] HP Static High Performance Mode
[ ] OS Control Mode


I switched it to 'OS Control Mode'

Then acpi-cpufreq could load, and no more frequencies changes on a "perf -a sleep 10" 
session, using or not cpufreq.
(Supported cpufreq speeds on these cpus : 1999 & 2999 MHz)

So it was a BIOS issue

# perf stat -a sleep 10

 Performance counter stats for 'sleep':

   80005.418223  task clock ticks     (msecs)
          80266  context switches     (events)
              3  CPU migrations       (events)
            486  pagefaults           (events)
   240013851624  CPU cycles           (events) << good >>
   239076501419  instructions         (events)
         679464  cache references     (events)
  <not counted>  cache misses

 Wall-clock time elapsed: 10000.468808 msecs

Thank you

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [BUG] perf_counter: change cpu frequencies
  2009-05-03  6:06                                   ` Eric Dumazet
@ 2009-05-03  7:25                                     ` Ingo Molnar
  2009-05-04 10:39                                       ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Ingo Molnar @ 2009-05-03  7:25 UTC (permalink / raw)
  To: Eric Dumazet, H. Peter Anvin, Paul E. McKenney, Paul Mackerras
  Cc: Christoph Lameter, linux kernel, Andi Kleen, David Miller,
	jesse.brandeburg, netdev, haoki, mchan, davidel, Mike Galbraith,
	Peter Zijlstra


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> Eric Dumazet a écrit :
> > Eric Dumazet a écrit :
> >  
> >> But if I use plain "perf stat -a sleep 10"
> >> it seems I get wrong values again (16 G cycles/sec) for all next perf sessions
> >>
> > 
> > Well, I confirm all my cpus switched from 3GHz to 2GHz, after
> > 
> > "perf stat -a sleep 10"
> > 
> > (but "perf stat -e instructions -e cycles -a sleep 10" doesnt trigger this problem)
> > 
> > Nothing logged, and /proc/cpuinfo stills reports 3 GHz frequencies
> > 
> > # cat unit.c
> > main() {
> >   int i;
> >   for (i = 0 ; i < 10000000; i++)
> >         getppid();
> > }
> > # time ./unit
> > 
> > real    0m0.818s
> > user    0m0.289s
> > sys     0m0.529s
> > # perf stat -a sleep 10 2>/dev/null
> > # time ./unit
> > 
> > real    0m1.122s
> > user    0m0.482s
> > sys     0m0.640s
> > 
> > # tail -n 27 /proc/cpuinfo
> > processor       : 7
> > vendor_id       : GenuineIntel
> > cpu family      : 6
> > model           : 23
> > model name      : Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
> > stepping        : 6
> > cpu MHz         : 3000.102
> > cache size      : 6144 KB
> > physical id     : 1
> > siblings        : 1
> > core id         : 3
> > cpu cores       : 4
> > apicid          : 7
> > initial apicid  : 7
> > fdiv_bug        : no
> > hlt_bug         : no
> > f00f_bug        : no
> > coma_bug        : no
> > fpu             : yes
> > fpu_exception   : yes
> > cpuid level     : 10
> > wp              : yes
> > flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority
> > bogomips        : 6000.01
> > clflush size    : 64
> > power management:
> > 
> > # grep CPU_FREQ .config
> > # CONFIG_CPU_FREQ is not set
> > 
> > 
> > perf_counter seems promising, but still... needs some bug hunting :)
> > 
> 
> Update :
> 
> Mike Galbraith suggested me to try various things, and finally, I discovered
> this frequency change was probably a BIOS problem on my HP BL460c G1
> 
> System Options -> Power regulator for Proliant
> 
> [*] HP Dynamic Power Savings Mode
> [ ] HP Static Low Power Mode
> [ ] HP Static High Performance Mode
> [ ] OS Control Mode
> 
> 
> I switched it to 'OS Control Mode'
> 
> Then acpi-cpufreq could load, and no more frequencies changes on a "perf -a sleep 10" 
> session, using or not cpufreq.
> (Supported cpufreq speeds on these cpus : 1999 & 2999 MHz)
> 
> So it was a BIOS issue

ah! That makes quite a bit of sense. The BIOS interfering with an OS 
feature ... Was that the default setting in the BIOS?

> # perf stat -a sleep 10
> 
>  Performance counter stats for 'sleep':
> 
>    80005.418223  task clock ticks     (msecs)
>           80266  context switches     (events)
>               3  CPU migrations       (events)
>             486  pagefaults           (events)
>    240013851624  CPU cycles           (events) << good >>
>    239076501419  instructions         (events)
>          679464  cache references     (events)
>   <not counted>  cache misses
> 
>  Wall-clock time elapsed: 10000.468808 msecs

That looks perfect now.

It would also be really nice to have a sysrq-p dump of your PMU 
state before you've done any profiling. Is there any trace of the 
BIOS meddling with them, that we could detect (and warn about) 
during bootup?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [BUG] perf_counter: change cpu frequencies
  2009-05-03  7:25                                     ` Ingo Molnar
@ 2009-05-04 10:39                                       ` Eric Dumazet
  0 siblings, 0 replies; 44+ messages in thread
From: Eric Dumazet @ 2009-05-04 10:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: H. Peter Anvin, Paul E. McKenney, Paul Mackerras,
	Christoph Lameter, linux kernel, Andi Kleen, David Miller,
	jesse.brandeburg, netdev, haoki, mchan, davidel, Mike Galbraith,
	Peter Zijlstra

Ingo Molnar a écrit :
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> Eric Dumazet a écrit :
>>> Eric Dumazet a écrit :
>>>  
>>>> But if I use plain "perf stat -a sleep 10"
>>>> it seems I get wrong values again (16 G cycles/sec) for all next perf sessions
>>>>
>>> Well, I confirm all my cpus switched from 3GHz to 2GHz, after
>>>
>>> "perf stat -a sleep 10"
>>>
>>> (but "perf stat -e instructions -e cycles -a sleep 10" doesnt trigger this problem)
>>>
>>> Nothing logged, and /proc/cpuinfo stills reports 3 GHz frequencies
>>>
>>> # cat unit.c
>>> main() {
>>>   int i;
>>>   for (i = 0 ; i < 10000000; i++)
>>>         getppid();
>>> }
>>> # time ./unit
>>>
>>> real    0m0.818s
>>> user    0m0.289s
>>> sys     0m0.529s
>>> # perf stat -a sleep 10 2>/dev/null
>>> # time ./unit
>>>
>>> real    0m1.122s
>>> user    0m0.482s
>>> sys     0m0.640s
>>>
>>> # tail -n 27 /proc/cpuinfo
>>> processor       : 7
>>> vendor_id       : GenuineIntel
>>> cpu family      : 6
>>> model           : 23
>>> model name      : Intel(R) Xeon(R) CPU           E5450  @ 3.00GHz
>>> stepping        : 6
>>> cpu MHz         : 3000.102
>>> cache size      : 6144 KB
>>> physical id     : 1
>>> siblings        : 1
>>> core id         : 3
>>> cpu cores       : 4
>>> apicid          : 7
>>> initial apicid  : 7
>>> fdiv_bug        : no
>>> hlt_bug         : no
>>> f00f_bug        : no
>>> coma_bug        : no
>>> fpu             : yes
>>> fpu_exception   : yes
>>> cpuid level     : 10
>>> wp              : yes
>>> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe lm constant_tsc arch_perfmon pebs bts pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 lahf_lm tpr_shadow vnmi flexpriority
>>> bogomips        : 6000.01
>>> clflush size    : 64
>>> power management:
>>>
>>> # grep CPU_FREQ .config
>>> # CONFIG_CPU_FREQ is not set
>>>
>>>
>>> perf_counter seems promising, but still... needs some bug hunting :)
>>>
>> Update :
>>
>> Mike Galbraith suggested me to try various things, and finally, I discovered
>> this frequency change was probably a BIOS problem on my HP BL460c G1
>>
>> System Options -> Power regulator for Proliant
>>
>> [*] HP Dynamic Power Savings Mode
>> [ ] HP Static Low Power Mode
>> [ ] HP Static High Performance Mode
>> [ ] OS Control Mode
>>
>>
>> I switched it to 'OS Control Mode'
>>
>> Then acpi-cpufreq could load, and no more frequencies changes on a "perf -a sleep 10" 
>> session, using or not cpufreq.
>> (Supported cpufreq speeds on these cpus : 1999 & 2999 MHz)
>>
>> So it was a BIOS issue
> 
> ah! That makes quite a bit of sense. The BIOS interfering with an OS 
> feature ... Was that the default setting in the BIOS?

This was default setting in BIOS, yes.

> 
>> # perf stat -a sleep 10
>>
>>  Performance counter stats for 'sleep':
>>
>>    80005.418223  task clock ticks     (msecs)
>>           80266  context switches     (events)
>>               3  CPU migrations       (events)
>>             486  pagefaults           (events)
>>    240013851624  CPU cycles           (events) << good >>
>>    239076501419  instructions         (events)
>>          679464  cache references     (events)
>>   <not counted>  cache misses
>>
>>  Wall-clock time elapsed: 10000.468808 msecs
> 
> That looks perfect now.
> 
> It would also be really nice to have a sysrq-p dump of your PMU 
> state before you've done any profiling. Is there any trace of the 
> BIOS meddling with them, that we could detect (and warn about) 
> during bootup?

Difference is that on BIOS set to 'OS Control Mode' I see one more entry in ACPI list :

[    0.000000] ACPI: SSDT cfe5b000 004C9 (v01 HP        SSDTP 00000001 INTL 20030228)

...
And these 8 additional lines after (one per cpu)
[    0.706697] ACPI: SSDT cfe5c000 002DA (v01 HP        SSDT0 00000001 INTL 20030228)
[    0.707250] ACPI: SSDT cfe5c300 002DA (v01 HP        SSDT1 00000001 INTL 20030228)
[    0.707768] ACPI: SSDT cfe5c600 002DA (v01 HP        SSDT2 00000001 INTL 20030228)
[    0.708376] ACPI: SSDT cfe5c900 002DF (v01 HP        SSDT3 00000001 INTL 20030228)
[    0.708964] ACPI: SSDT cfe5cc00 002DA (v01 HP        SSDT4 00000001 INTL 20030228)
[    0.709567] ACPI: SSDT cfe5cf00 002DA (v01 HP        SSDT5 00000001 INTL 20030228)
[    0.710122] ACPI: SSDT cfe5d200 002DA (v01 HP        SSDT6 00000001 INTL 20030228)
[    0.710713] ACPI: SSDT cfe5d500 002DA (v01 HP        SSDT7 00000001 INTL 20030228)

Also, if this option is set to default (HP Dynamic Power Savings Mode) I get :

# modprobe acpi-cpufreq
FATAL: Error inserting acpi_cpufreq (/lib/modules/2.6.30-rc4-tip-01560-gdd5fa92/kernel/arch/x86/kernel/cpu/cpufreq/acpi-cpufreq.ko): No such device

but no kernel message logged.

Might be possible to add some kind of warning yes, I can test a patch if you want.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-30 14:08                               ` Eric Dumazet
  2009-04-30 16:07                                 ` [BUG] perf_counter: change cpu frequencies Eric Dumazet
@ 2009-04-30 21:24                                 ` Paul E. McKenney
  1 sibling, 0 replies; 44+ messages in thread
From: Paul E. McKenney @ 2009-04-30 21:24 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, Christoph Lameter, linux kernel, Andi Kleen,
	David Miller, jesse.brandeburg, netdev, haoki, mchan, davidel

On Thu, Apr 30, 2009 at 04:08:48PM +0200, Eric Dumazet wrote:
> Ingo Molnar a écrit :
> > * Eric Dumazet <dada1@cosmosbay.com> wrote:
> > 
> >> Ingo Molnar a écrit :
> >>> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> >>>
> >>>> On uddpping, I had prior to the patch about 49000 wakeups per 
> >>>> second, and after patch about 26000 wakeups per second (matches 
> >>>> number of incoming udp messages per second)
> >>> very nice. It might not show up as a real performance difference if 
> >>> the CPUs are not fully saturated during the test - but it could show 
> >>> up as a decrease in CPU utilization.
> >>>
> >>> Also, if you run the test via 'perf stat -a ./test.sh' you should 
> >>> see a reduction in instructions executed:
> >>>
> >>> aldebaran:~/linux/linux> perf stat -a sleep 1
> >>>
> >>>  Performance counter stats for 'sleep':
> >>>
> >>>    16128.045994  task clock ticks     (msecs)
> >>>           12876  context switches     (events)
> >>>             219  CPU migrations       (events)
> >>>          186144  pagefaults           (events)
> >>>     20911802763  CPU cycles           (events)
> >>>     19309416815  instructions         (events)
> >>>       199608554  cache references     (events)
> >>>        19990754  cache misses         (events)
> >>>
> >>>  Wall-clock time elapsed:  1008.882282 msecs
> >>>
> >>> With -a it's measured system-wide, from start of test to end of test 
> >>> - the results will be a lot more stable (and relevant) statistically 
> >>> than wall-clock time or CPU usage measurements. (both of which are 
> >>> rather imprecise in general)
> >> I tried this perf stuff and got strange results on a cpu burning 
> >> bench, saturating my 8 cpus with a "while (1) ;" loop
> >>
> >>
> >> # perf stat -a sleep 10
> >>
> >>  Performance counter stats for 'sleep':
> >>
> >>    80334.709038  task clock ticks     (msecs)
> >>           80638  context switches     (events)
> >>               4  CPU migrations       (events)
> >>             468  pagefaults           (events)
> >>    160694681969  CPU cycles           (events)
> >>    160127154810  instructions         (events)
> >>          686393  cache references     (events)
> >>          230117  cache misses         (events)
> >>
> >>  Wall-clock time elapsed: 10041.531644 msecs
> >>
> >> So its about 16069468196 cycles per second for 8 cpus
> >> Divide by 8 to get 2008683524 cycles per second per cpu,
> >> which is not       3000000000  (E5450  @ 3.00GHz)
> > 
> > What does "perf stat -l -a sleep 10" show? I suspect your counters 
> > are scaled by about 67%, due to counter over-commit. -l will show 
> > the scaling factor (and will scale up the results).
> 
> Only difference I see with '-l' is cache misses not counted.
> 
> (tbench 8 running, so not one instruction per cycle)
> 
> # perf stat -l -a sleep 10
> 
>  Performance counter stats for 'sleep':
> 
>    80007.128844  task clock ticks     (msecs)
>         6754642  context switches     (events)
>               2  CPU migrations       (events)
>             474  pagefaults           (events)
>    160925719143  CPU cycles           (events)
>    108482003620  instructions         (events)
>      7584035056  cache references     (events)
>   <not counted>  cache misses
> 
>  Wall-clock time elapsed: 10000.595448 msecs
> 
> # perf stat -a sleep 10
> 
>  Performance counter stats for 'sleep':
> 
>    80702.908287  task clock ticks     (msecs)
>         6792588  context switches     (events)
>              24  CPU migrations       (events)
>            4867  pagefaults           (events)
>    161957342744  CPU cycles           (events)
>    109147553984  instructions         (events)
>      7633190481  cache references     (events)
>        22996234  cache misses         (events)
> 
>  Wall-clock time elapsed: 10087.502391 msecs
> 
> 
> 
> > 
> > If so then i think this behavior is confusing, and i'll make -l 
> > default-enabled. (in fact i just committed this change to latest 
> > -tip and pushed it out)
> > 
> > To get only instructions and cycles, do:
> > 
> >    perf stat -e instructions -e cycles
> > 
> 
> # perf stat -e instructions -e cycles -a sleep 10
> 
>  Performance counter stats for 'sleep':
> 
>    109469842392  instructions         (events)
>    162012922122  CPU cycles           (events)
> 
>  Wall-clock time elapsed: 10124.943544 msecs
> 
> I am wondering if cpus are not running at 2 GHz ;)
> 
> 
> >> It seems strange a "jmp myself" uses one unhalted cycle per 
> >> instruction and 0.5 halted cycle ...
> >>
> >> Also, after using "perf stat", tbench results are 1778 MB/S 
> >> instead of 2610 MB/s. Even if no perf stat running.
> > 
> > Hm, that would be a bug. Could you send the dmesg output of:
> > 
> >    echo p > /proc/sysrq-trigger 
> >    echo p > /proc/sysrq-trigger 
> > 
> > with counters running it will show something like:
> > 
> > [  868.105712] SysRq : Show Regs
> > [  868.106544] 
> > [  868.106544] CPU#1: ctrl:       ffffffffffffffff
> > [  868.106544] CPU#1: status:     0000000000000000
> > [  868.106544] CPU#1: overflow:   0000000000000000
> > [  868.106544] CPU#1: fixed:      0000000000000000
> > [  868.106544] CPU#1: used:       0000000000000000
> > [  868.106544] CPU#1:   gen-PMC0 ctrl:  00000000001300c0
> > [  868.106544] CPU#1:   gen-PMC0 count: 000000ffee889194
> > [  868.106544] CPU#1:   gen-PMC0 left:  0000000011e1791a
> > [  868.106544] CPU#1:   gen-PMC1 ctrl:  000000000013003c
> > [  868.106544] CPU#1:   gen-PMC1 count: 000000ffd2542438
> > [  868.106544] CPU#1:   gen-PMC1 left:  000000002dd17a8e
> 
> They stay fix (but only CPU#0 is displayed)
> 
> Is perf able to display per cpu counters, and not aggregated values ?
> 
> [ 7894.426787] CPU#0: ctrl:       ffffffffffffffff
> [ 7894.426788] CPU#0: status:     0000000000000000
> [ 7894.426790] CPU#0: overflow:   0000000000000000
> [ 7894.426792] CPU#0: fixed:      0000000000000000
> [ 7894.426793] CPU#0: used:       0000000000000000
> [ 7894.426796] CPU#0:   gen-PMC0 ctrl:  0000000000134f2e
> [ 7894.426798] CPU#0:   gen-PMC0 count: 000000ffb91e31e1
> [ 7894.426799] CPU#0:   gen-PMC0 left:  000000007fffffff
> [ 7894.426802] CPU#0:   gen-PMC1 ctrl:  000000000013412e
> [ 7894.426804] CPU#0:   gen-PMC1 count: 000000ff80312b23
> [ 7894.426805] CPU#0:   gen-PMC1 left:  000000007fffffff
> [ 7894.426807] CPU#0: fixed-PMC0 count: 000000ffacf54a68
> [ 7894.426809] CPU#0: fixed-PMC1 count: 000000ffb71cfe02
> [ 7894.426811] CPU#0: fixed-PMC2 count: 0000000000000000
> [ 7905.522262] SysRq : Show Regs
> [ 7905.522277]
> [ 7905.522279] CPU#0: ctrl:       ffffffffffffffff
> [ 7905.522281] CPU#0: status:     0000000000000000
> [ 7905.522283] CPU#0: overflow:   0000000000000000
> [ 7905.522284] CPU#0: fixed:      0000000000000000
> [ 7905.522286] CPU#0: used:       0000000000000000
> [ 7905.522288] CPU#0:   gen-PMC0 ctrl:  0000000000134f2e
> [ 7905.522290] CPU#0:   gen-PMC0 count: 000000ffb91e31e1
> [ 7905.522292] CPU#0:   gen-PMC0 left:  000000007fffffff
> [ 7905.522294] CPU#0:   gen-PMC1 ctrl:  000000000013412e
> [ 7905.522296] CPU#0:   gen-PMC1 count: 000000ff80312b23
> [ 7905.522298] CPU#0:   gen-PMC1 left:  000000007fffffff
> [ 7905.522299] CPU#0: fixed-PMC0 count: 000000ffacf54a68
> [ 7905.522301] CPU#0: fixed-PMC1 count: 000000ffb71cfe02
> [ 7905.522303] CPU#0: fixed-PMC2 count: 0000000000000000
> 
> 
> > 
> > the counts should stay put (i.e. all counters should be disabled). 
> > If they move around - despite there being no 'perf stat -a' session 
> > running, that would be a bug.
> 
> I rebooted my machine then got good results
> 
> # perf stat -e instructions -e cycles -a sleep 10
> 
>  Performance counter stats for 'sleep':
> 
>    240021659058  instructions         (events)
>    240997984836  CPU cycles           (events) << OK >>
> 
>  Wall-clock time elapsed: 10041.499326 msecs
> 
> But if I use plain "perf stat -a sleep 10"
> it seems I get wrong values again (16 G cycles/sec) for all next perf sessions

I have to ask...

Is it possible that the machine runs at 3GHz initially, but slows down
to 2GHz for cooling reasons?  One thing to try would be to run powertop, 
which displays the frequencies.  I get the following if mostly idle:

	     PowerTOP version 1.8       (C) 2007 Intel Corporation

	Cn                Avg residency       P-states (frequencies)
	C0 (cpu running)        (14.1%)         2.17 Ghz     4.3%
	C1                0.0ms ( 0.0%)         1.67 Ghz     0.0%
	C2                0.5ms (16.2%)         1333 Mhz     0.0%
	C3                0.5ms (69.8%)         1000 Mhz    95.7%

And the following with an infinite loop running:

	     PowerTOP version 1.8       (C) 2007 Intel Corporation

	Cn                Avg residency       P-states (frequencies)
	C0 (cpu running)        (54.3%)         2.17 Ghz   100.0%
	C1                0.0ms ( 0.0%)         1.67 Ghz     0.0%
	C2                1.2ms ( 1.7%)         1333 Mhz     0.0%
	C3                1.3ms (44.0%)         1000 Mhz     0.0%

But I am probably missing the point here...

							Thanx, Paul

> # perf stat -a sleep 10
> 
>  Performance counter stats for 'sleep':
> 
>    80332.718661  task clock ticks     (msecs)
>           80602  context switches     (events)
>               4  CPU migrations       (events)
>             473  pagefaults           (events)
>    160665397757  CPU cycles           (events)  << bad >>
>    160079277974  instructions         (events)
>          793665  cache references     (events)
>          226829  cache misses         (events)
> 
>  Wall-clock time elapsed: 10041.302969 msecs
> 
> # perf stat -e cycles -a sleep 10
> 
>  Performance counter stats for 'sleep':
> 
>    160665176575  CPU cycles           (events)  << bad >>
> 
>  Wall-clock time elapsed: 10041.491503 msecs
> 
> 
> > 
> > Also, the overhead might be profile-able, via:
> > 
> > 	perf record -m 1024 sleep 10
> > 
> > (this records the profile into output.perf.)
> > 
> > followed by:
> > 
> > 	./perf-report | tail -20
> > 
> > to display a histogram, with kernel-space and user-space symbols 
> > mixed into a single profile.
> > 
> > (Pick up latest -tip to get perf-report built by default.)
> 
> Thanks, this is what I use currently
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-28 15:06         ` [PATCH] poll: Avoid extra wakeups in select/poll Eric Dumazet
  2009-04-28 19:05           ` Christoph Lameter
@ 2009-04-29  7:20           ` Andrew Morton
  2009-04-29  7:35             ` Andi Kleen
  2009-04-29  7:39             ` Eric Dumazet
  2009-04-29  9:16           ` Ingo Molnar
  2 siblings, 2 replies; 44+ messages in thread
From: Andrew Morton @ 2009-04-29  7:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux kernel, Andi Kleen, David Miller, cl, jesse.brandeburg,
	netdev, haoki, mchan, davidel, Ingo Molnar

On Tue, 28 Apr 2009 17:06:11 +0200 Eric Dumazet <dada1@cosmosbay.com> wrote:

> [PATCH] poll: Avoid extra wakeups in select/poll
> 
> After introduction of keyed wakeups Davide Libenzi did on epoll, we
> are able to avoid spurious wakeups in poll()/select() code too.
> 
> For example, typical use of poll()/select() is to wait for incoming
> network frames on many sockets. But TX completion for UDP/TCP 
> frames call sock_wfree() which in turn schedules thread.
> 
> When scheduled, thread does a full scan of all polled fds and
> can sleep again, because nothing is really available. If number
> of fds is large, this cause significant load.
> 
> This patch makes select()/poll() aware of keyed wakeups and
> useless wakeups are avoided. This reduces number of context
> switches by about 50% on some setups, and work performed
> by sofirq handlers.
> 

Seems that this is a virtuous patch even though Christoph is struggling
a bit to test it?

>  fs/select.c          |   28 +++++++++++++++++++++++++---
>  include/linux/poll.h |    3 +++
>  2 files changed, 28 insertions(+), 3 deletions(-)
> 
> diff --git a/fs/select.c b/fs/select.c
> index 0fe0e14..2708187 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -168,7 +168,7 @@ static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p)
>  	return table->entry++;
>  }
>  
> -static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
> +static int __pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>  {
>  	struct poll_wqueues *pwq = wait->private;
>  	DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
> @@ -194,6 +194,16 @@ static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>  	return default_wake_function(&dummy_wait, mode, sync, key);
>  }
>  
> +static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
> +{
> +	struct poll_table_entry *entry;
> +
> +	entry = container_of(wait, struct poll_table_entry, wait);
> +	if (key && !((unsigned long)key & entry->key))
> +		return 0;
> +	return __pollwake(wait, mode, sync, key);
> +}
> +
>  /* Add a new entry */
>  static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>  				poll_table *p)
> @@ -205,6 +215,7 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>  	get_file(filp);
>  	entry->filp = filp;
>  	entry->wait_address = wait_address;
> +	entry->key = p->key;
>  	init_waitqueue_func_entry(&entry->wait, pollwake);
>  	entry->wait.private = pwq;
>  	add_wait_queue(wait_address, &entry->wait);
> @@ -418,8 +429,16 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
>  				if (file) {
>  					f_op = file->f_op;
>  					mask = DEFAULT_POLLMASK;
> -					if (f_op && f_op->poll)
> +					if (f_op && f_op->poll) {
> +						if (wait) {
> +							wait->key = POLLEX_SET;
> +							if (in & bit)
> +								wait->key |= POLLIN_SET;
> +							if (out & bit)
> +								wait->key |= POLLOUT_SET;
> +						}
>  						mask = (*f_op->poll)(file, retval ? NULL : wait);
> +					}

<resizes xterm rather a lot>

Can we (and should we) avoid all that manipulation of wait->key if
`retval' is zero?

> --- a/include/linux/poll.h
> +++ b/include/linux/poll.h
> @@ -32,6 +32,7 @@ typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_
>  
>  typedef struct poll_table_struct {
>  	poll_queue_proc qproc;
> +	unsigned long key;
>  } poll_table;
>  
>  static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
> @@ -43,10 +44,12 @@ static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_addres
>  static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
>  {
>  	pt->qproc = qproc;
> +	pt->key   = ~0UL; /* all events enabled */

I kind of prefer to use plain old -1 for the all-ones pattern.  Because
it always just works, and doesn't send the reviewer off to check if the
type was really u64 or something.

It's a bit ugly though.

>  }
>  
>  struct poll_table_entry {
>  	struct file *filp;
> +	unsigned long key;
>  	wait_queue_t wait;
>  	wait_queue_head_t *wait_address;
>  };


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-29  7:20           ` Andrew Morton
@ 2009-04-29  7:35             ` Andi Kleen
  2009-04-29  7:37               ` Eric Dumazet
  2009-04-29  9:22               ` Ingo Molnar
  2009-04-29  7:39             ` Eric Dumazet
  1 sibling, 2 replies; 44+ messages in thread
From: Andi Kleen @ 2009-04-29  7:35 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, linux kernel, Andi Kleen, David Miller, cl,
	jesse.brandeburg, netdev, haoki, mchan, davidel, Ingo Molnar

> Seems that this is a virtuous patch even though Christoph is struggling
> a bit to test it?

The main drawback is that the select/poll data structures will get
larger. That could cause regression in theory. But I suspect
the win in some situations is still worth it. Of course
it would be nice if it handled more situations (like
multiple reader etc.)

-Andi

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-29  7:35             ` Andi Kleen
@ 2009-04-29  7:37               ` Eric Dumazet
  2009-04-29  9:22               ` Ingo Molnar
  1 sibling, 0 replies; 44+ messages in thread
From: Eric Dumazet @ 2009-04-29  7:37 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, linux kernel, David Miller, cl, jesse.brandeburg,
	netdev, haoki, mchan, davidel, Ingo Molnar

Andi Kleen a écrit :
>> Seems that this is a virtuous patch even though Christoph is struggling
>> a bit to test it?
> 
> The main drawback is that the select/poll data structures will get
> larger. That could cause regression in theory. But I suspect
> the win in some situations is still worth it. Of course
> it would be nice if it handled more situations (like
> multiple reader etc.)

On poll()/select() interface, we must wakeup every pollers, because we dont know
if they really will consume the event

thread 1:
poll();
<insert an exit() or something bad here>
read();

thread 2:
poll(); /* no return because event was 'granted' to thread 1 */
read();


We could try to optimize read()/recvfrom() because we can know if event
is consumed, as its a whole syscall.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-29  7:35             ` Andi Kleen
  2009-04-29  7:37               ` Eric Dumazet
@ 2009-04-29  9:22               ` Ingo Molnar
  1 sibling, 0 replies; 44+ messages in thread
From: Ingo Molnar @ 2009-04-29  9:22 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andrew Morton, Eric Dumazet, linux kernel, David Miller, cl,
	jesse.brandeburg, netdev, haoki, mchan, davidel


* Andi Kleen <andi@firstfloor.org> wrote:

> > Seems that this is a virtuous patch even though Christoph is struggling
> > a bit to test it?
> 
> The main drawback is that the select/poll data structures will get 
> larger. That could cause regression in theory. [...]

Current size of struct poll_table_entry is 0x38 on 64-bit kernels. 
Adding the key will make it 0x40 - which is not only a power of two 
but also matches cache line size on most modern CPUs.

So the size of this structure is ideal now and arithmetics on the 
poll table have become simpler as well.

So the patch has my ack:

  Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-29  7:20           ` Andrew Morton
  2009-04-29  7:35             ` Andi Kleen
@ 2009-04-29  7:39             ` Eric Dumazet
  2009-04-29  8:26               ` Eric Dumazet
  1 sibling, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-04-29  7:39 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux kernel, Andi Kleen, David Miller, cl, jesse.brandeburg,
	netdev, haoki, mchan, davidel, Ingo Molnar

Andrew Morton a écrit :
> On Tue, 28 Apr 2009 17:06:11 +0200 Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> [PATCH] poll: Avoid extra wakeups in select/poll
>>
>> After introduction of keyed wakeups Davide Libenzi did on epoll, we
>> are able to avoid spurious wakeups in poll()/select() code too.
>>
>> For example, typical use of poll()/select() is to wait for incoming
>> network frames on many sockets. But TX completion for UDP/TCP 
>> frames call sock_wfree() which in turn schedules thread.
>>
>> When scheduled, thread does a full scan of all polled fds and
>> can sleep again, because nothing is really available. If number
>> of fds is large, this cause significant load.
>>
>> This patch makes select()/poll() aware of keyed wakeups and
>> useless wakeups are avoided. This reduces number of context
>> switches by about 50% on some setups, and work performed
>> by sofirq handlers.
>>
> 
> Seems that this is a virtuous patch even though Christoph is struggling
> a bit to test it?
> 
>>  fs/select.c          |   28 +++++++++++++++++++++++++---
>>  include/linux/poll.h |    3 +++
>>  2 files changed, 28 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/select.c b/fs/select.c
>> index 0fe0e14..2708187 100644
>> --- a/fs/select.c
>> +++ b/fs/select.c
>> @@ -168,7 +168,7 @@ static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p)
>>  	return table->entry++;
>>  }
>>  
>> -static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>> +static int __pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>  {
>>  	struct poll_wqueues *pwq = wait->private;
>>  	DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
>> @@ -194,6 +194,16 @@ static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>  	return default_wake_function(&dummy_wait, mode, sync, key);
>>  }
>>  
>> +static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>> +{
>> +	struct poll_table_entry *entry;
>> +
>> +	entry = container_of(wait, struct poll_table_entry, wait);
>> +	if (key && !((unsigned long)key & entry->key))
>> +		return 0;
>> +	return __pollwake(wait, mode, sync, key);
>> +}
>> +
>>  /* Add a new entry */
>>  static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>>  				poll_table *p)
>> @@ -205,6 +215,7 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>>  	get_file(filp);
>>  	entry->filp = filp;
>>  	entry->wait_address = wait_address;
>> +	entry->key = p->key;
>>  	init_waitqueue_func_entry(&entry->wait, pollwake);
>>  	entry->wait.private = pwq;
>>  	add_wait_queue(wait_address, &entry->wait);
>> @@ -418,8 +429,16 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
>>  				if (file) {
>>  					f_op = file->f_op;
>>  					mask = DEFAULT_POLLMASK;
>> -					if (f_op && f_op->poll)
>> +					if (f_op && f_op->poll) {
>> +						if (wait) {
>> +							wait->key = POLLEX_SET;
>> +							if (in & bit)
>> +								wait->key |= POLLIN_SET;
>> +							if (out & bit)
>> +								wait->key |= POLLOUT_SET;
>> +						}
>>  						mask = (*f_op->poll)(file, retval ? NULL : wait);
>> +					}
> 
> <resizes xterm rather a lot>
> 
> Can we (and should we) avoid all that manipulation of wait->key if
> `retval' is zero?

yes, we could set wait to NULL as soon as retval is incremented.
and also do :

mask = (*f_op->poll)(file, wait);


> 
>> --- a/include/linux/poll.h
>> +++ b/include/linux/poll.h
>> @@ -32,6 +32,7 @@ typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_
>>  
>>  typedef struct poll_table_struct {
>>  	poll_queue_proc qproc;
>> +	unsigned long key;
>>  } poll_table;
>>  
>>  static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
>> @@ -43,10 +44,12 @@ static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_addres
>>  static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
>>  {
>>  	pt->qproc = qproc;
>> +	pt->key   = ~0UL; /* all events enabled */
> 
> I kind of prefer to use plain old -1 for the all-ones pattern.  Because
> it always just works, and doesn't send the reviewer off to check if the
> type was really u64 or something.
> 
> It's a bit ugly though.
> 
>>  }
>>  
>>  struct poll_table_entry {
>>  	struct file *filp;
>> +	unsigned long key;
>>  	wait_queue_t wait;
>>  	wait_queue_head_t *wait_address;
>>  };
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
> 
> 



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-29  7:39             ` Eric Dumazet
@ 2009-04-29  8:26               ` Eric Dumazet
  0 siblings, 0 replies; 44+ messages in thread
From: Eric Dumazet @ 2009-04-29  8:26 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux kernel, Andi Kleen, David Miller, cl, jesse.brandeburg,
	netdev, haoki, mchan, davidel, Ingo Molnar

Eric Dumazet a écrit :
> Andrew Morton a écrit :
>> On Tue, 28 Apr 2009 17:06:11 +0200 Eric Dumazet <dada1@cosmosbay.com> wrote:
>>
>>> [PATCH] poll: Avoid extra wakeups in select/poll
>>>
>>> After introduction of keyed wakeups Davide Libenzi did on epoll, we
>>> are able to avoid spurious wakeups in poll()/select() code too.
>>>
>>> For example, typical use of poll()/select() is to wait for incoming
>>> network frames on many sockets. But TX completion for UDP/TCP 
>>> frames call sock_wfree() which in turn schedules thread.
>>>
>>> When scheduled, thread does a full scan of all polled fds and
>>> can sleep again, because nothing is really available. If number
>>> of fds is large, this cause significant load.
>>>
>>> This patch makes select()/poll() aware of keyed wakeups and
>>> useless wakeups are avoided. This reduces number of context
>>> switches by about 50% on some setups, and work performed
>>> by sofirq handlers.
>>>
>> Seems that this is a virtuous patch even though Christoph is struggling
>> a bit to test it?
>>
>>>  fs/select.c          |   28 +++++++++++++++++++++++++---
>>>  include/linux/poll.h |    3 +++
>>>  2 files changed, 28 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/select.c b/fs/select.c
>>> index 0fe0e14..2708187 100644
>>> --- a/fs/select.c
>>> +++ b/fs/select.c
>>> @@ -168,7 +168,7 @@ static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p)
>>>  	return table->entry++;
>>>  }
>>>  
>>> -static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>> +static int __pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>>  {
>>>  	struct poll_wqueues *pwq = wait->private;
>>>  	DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
>>> @@ -194,6 +194,16 @@ static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>>  	return default_wake_function(&dummy_wait, mode, sync, key);
>>>  }
>>>  
>>> +static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>> +{
>>> +	struct poll_table_entry *entry;
>>> +
>>> +	entry = container_of(wait, struct poll_table_entry, wait);
>>> +	if (key && !((unsigned long)key & entry->key))
>>> +		return 0;
>>> +	return __pollwake(wait, mode, sync, key);
>>> +}
>>> +
>>>  /* Add a new entry */
>>>  static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>>>  				poll_table *p)
>>> @@ -205,6 +215,7 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>>>  	get_file(filp);
>>>  	entry->filp = filp;
>>>  	entry->wait_address = wait_address;
>>> +	entry->key = p->key;
>>>  	init_waitqueue_func_entry(&entry->wait, pollwake);
>>>  	entry->wait.private = pwq;
>>>  	add_wait_queue(wait_address, &entry->wait);
>>> @@ -418,8 +429,16 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
>>>  				if (file) {
>>>  					f_op = file->f_op;
>>>  					mask = DEFAULT_POLLMASK;
>>> -					if (f_op && f_op->poll)
>>> +					if (f_op && f_op->poll) {
>>> +						if (wait) {
>>> +							wait->key = POLLEX_SET;
>>> +							if (in & bit)
>>> +								wait->key |= POLLIN_SET;
>>> +							if (out & bit)
>>> +								wait->key |= POLLOUT_SET;
>>> +						}
>>>  						mask = (*f_op->poll)(file, retval ? NULL : wait);
>>> +					}
>> <resizes xterm rather a lot>
>>
>> Can we (and should we) avoid all that manipulation of wait->key if
>> `retval' is zero?
> 
> yes, we could set wait to NULL as soon as retval is incremented.
> and also do :
> 
> mask = (*f_op->poll)(file, wait);
> 

[PATCH] poll: Avoid extra wakeups in select/poll

After introduction of keyed wakeups Davide Libenzi did on epoll, we
are able to avoid spurious wakeups in poll()/select() code too.

For example, typical use of poll()/select() is to wait for incoming
network frames on many sockets. But TX completion for UDP/TCP 
frames call sock_wfree() which in turn schedules thread.

When scheduled, thread does a full scan of all polled fds and
can sleep again, because nothing is really available. If number
of fds is large, this cause significant load.

This patch makes select()/poll() aware of keyed wakeups and
useless wakeups are avoided. This reduces number of context
switches by about 50% on some setups, and work performed
by sofirq handlers.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Andi Kleen <ak@linux.intel.com>
---
 fs/select.c          |   33 +++++++++++++++++++++++++++++----
 include/linux/poll.h |    3 +++
 2 files changed, 32 insertions(+), 4 deletions(-)


diff --git a/fs/select.c b/fs/select.c
index 0fe0e14..71377fd 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -168,7 +168,7 @@ static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p)
 	return table->entry++;
 }
 
-static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+static int __pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
 {
 	struct poll_wqueues *pwq = wait->private;
 	DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
@@ -194,6 +194,16 @@ static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
 	return default_wake_function(&dummy_wait, mode, sync, key);
 }
 
+static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	struct poll_table_entry *entry;
+
+	entry = container_of(wait, struct poll_table_entry, wait);
+	if (key && !((unsigned long)key & entry->key))
+		return 0;
+	return __pollwake(wait, mode, sync, key);
+}
+
 /* Add a new entry */
 static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 				poll_table *p)
@@ -205,6 +215,7 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 	get_file(filp);
 	entry->filp = filp;
 	entry->wait_address = wait_address;
+	entry->key = p->key;
 	init_waitqueue_func_entry(&entry->wait, pollwake);
 	entry->wait.private = pwq;
 	add_wait_queue(wait_address, &entry->wait);
@@ -418,20 +429,31 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 				if (file) {
 					f_op = file->f_op;
 					mask = DEFAULT_POLLMASK;
-					if (f_op && f_op->poll)
-						mask = (*f_op->poll)(file, retval ? NULL : wait);
+					if (f_op && f_op->poll) {
+						if (wait) {
+							wait->key = POLLEX_SET;
+							if (in & bit)
+								wait->key |= POLLIN_SET;
+							if (out & bit)
+								wait->key |= POLLOUT_SET;
+						}
+						mask = (*f_op->poll)(file, wait);
+					}
 					fput_light(file, fput_needed);
 					if ((mask & POLLIN_SET) && (in & bit)) {
 						res_in |= bit;
 						retval++;
+						wait = NULL;
 					}
 					if ((mask & POLLOUT_SET) && (out & bit)) {
 						res_out |= bit;
 						retval++;
+						wait = NULL;
 					}
 					if ((mask & POLLEX_SET) && (ex & bit)) {
 						res_ex |= bit;
 						retval++;
+						wait = NULL;
 					}
 				}
 			}
@@ -685,8 +707,11 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait)
 		mask = POLLNVAL;
 		if (file != NULL) {
 			mask = DEFAULT_POLLMASK;
-			if (file->f_op && file->f_op->poll)
+			if (file->f_op && file->f_op->poll) {
+				if (pwait)
+					pwait->key = pollfd->events | POLLERR | POLLHUP;
 				mask = file->f_op->poll(file, pwait);
+			}
 			/* Mask out unneeded events. */
 			mask &= pollfd->events | POLLERR | POLLHUP;
 			fput_light(file, fput_needed);
diff --git a/include/linux/poll.h b/include/linux/poll.h
index 8c24ef8..3327389 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -32,6 +32,7 @@ typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_
 
 typedef struct poll_table_struct {
 	poll_queue_proc qproc;
+	unsigned long key;
 } poll_table;
 
 static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
@@ -43,10 +44,12 @@ static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_addres
 static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
 {
 	pt->qproc = qproc;
+	pt->key = ~0UL; /* all events enabled */
 }
 
 struct poll_table_entry {
 	struct file *filp;
+	unsigned long key;
 	wait_queue_t wait;
 	wait_queue_head_t *wait_address;
 };


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-28 15:06         ` [PATCH] poll: Avoid extra wakeups in select/poll Eric Dumazet
  2009-04-28 19:05           ` Christoph Lameter
  2009-04-29  7:20           ` Andrew Morton
@ 2009-04-29  9:16           ` Ingo Molnar
  2009-04-29  9:36             ` Eric Dumazet
  2 siblings, 1 reply; 44+ messages in thread
From: Ingo Molnar @ 2009-04-29  9:16 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux kernel, Andi Kleen, David Miller, cl, jesse.brandeburg,
	netdev, haoki, mchan, davidel


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> @@ -418,8 +429,16 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
>  				if (file) {
>  					f_op = file->f_op;
>  					mask = DEFAULT_POLLMASK;
> -					if (f_op && f_op->poll)
> +					if (f_op && f_op->poll) {
> +						if (wait) {
> +							wait->key = POLLEX_SET;
> +							if (in & bit)
> +								wait->key |= POLLIN_SET;
> +							if (out & bit)
> +								wait->key |= POLLOUT_SET;
> +						}
>  						mask = (*f_op->poll)(file, retval ? NULL : wait);
> +					}
>  					fput_light(file, fput_needed);
>  					if ((mask & POLLIN_SET) && (in & bit)) {
>  						res_in |= bit;

Please factor this whole 'if (file)' branch out into a helper. 
Typical indentation levels go from 1 to 3 tabs - 4 should be avoided 
if possible and 5 is pretty excessive already. This goes to eight.

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-29  9:16           ` Ingo Molnar
@ 2009-04-29  9:36             ` Eric Dumazet
  2009-04-29 10:27               ` Ingo Molnar
  0 siblings, 1 reply; 44+ messages in thread
From: Eric Dumazet @ 2009-04-29  9:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux kernel, Andi Kleen, David Miller, cl, jesse.brandeburg,
	netdev, haoki, mchan, davidel

Ingo Molnar a écrit :
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> @@ -418,8 +429,16 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
>>  				if (file) {
>>  					f_op = file->f_op;
>>  					mask = DEFAULT_POLLMASK;
>> -					if (f_op && f_op->poll)
>> +					if (f_op && f_op->poll) {
>> +						if (wait) {
>> +							wait->key = POLLEX_SET;
>> +							if (in & bit)
>> +								wait->key |= POLLIN_SET;
>> +							if (out & bit)
>> +								wait->key |= POLLOUT_SET;
>> +						}
>>  						mask = (*f_op->poll)(file, retval ? NULL : wait);
>> +					}
>>  					fput_light(file, fput_needed);
>>  					if ((mask & POLLIN_SET) && (in & bit)) {
>>  						res_in |= bit;
> 
> Please factor this whole 'if (file)' branch out into a helper. 
> Typical indentation levels go from 1 to 3 tabs - 4 should be avoided 
> if possible and 5 is pretty excessive already. This goes to eight.
> 

Thanks Ingo,

Here is v3 of patch, with your Acked-by included :)

This is IMHO clearer since helper immediatly follows POLLIN_SET / POLLOUT_SET /
POLLEX_SET defines.

[PATCH] poll: Avoid extra wakeups in select/poll

After introduction of keyed wakeups Davide Libenzi did on epoll, we
are able to avoid spurious wakeups in poll()/select() code too.

For example, typical use of poll()/select() is to wait for incoming
network frames on many sockets. But TX completion for UDP/TCP 
frames call sock_wfree() which in turn schedules thread.

When scheduled, thread does a full scan of all polled fds and
can sleep again, because nothing is really available. If number
of fds is large, this cause significant load.

This patch makes select()/poll() aware of keyed wakeups and
useless wakeups are avoided. This reduces number of context
switches by about 50% on some setups, and work performed
by sofirq handlers.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
---
 fs/select.c          |   40 ++++++++++++++++++++++++++++++++++++----
 include/linux/poll.h |    3 +++
 2 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 0fe0e14..ba068ad 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -168,7 +168,7 @@ static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p)
 	return table->entry++;
 }
 
-static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+static int __pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
 {
 	struct poll_wqueues *pwq = wait->private;
 	DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
@@ -194,6 +194,16 @@ static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
 	return default_wake_function(&dummy_wait, mode, sync, key);
 }
 
+static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	struct poll_table_entry *entry;
+
+	entry = container_of(wait, struct poll_table_entry, wait);
+	if (key && !((unsigned long)key & entry->key))
+		return 0;
+	return __pollwake(wait, mode, sync, key);
+}
+
 /* Add a new entry */
 static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 				poll_table *p)
@@ -205,6 +215,7 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 	get_file(filp);
 	entry->filp = filp;
 	entry->wait_address = wait_address;
+	entry->key = p->key;
 	init_waitqueue_func_entry(&entry->wait, pollwake);
 	entry->wait.private = pwq;
 	add_wait_queue(wait_address, &entry->wait);
@@ -362,6 +373,18 @@ get_max:
 #define POLLOUT_SET (POLLWRBAND | POLLWRNORM | POLLOUT | POLLERR)
 #define POLLEX_SET (POLLPRI)
 
+static void wait_key_set(poll_table *wait, unsigned long in,
+			 unsigned long out, unsigned long bit)
+{
+	if (wait) {
+		wait->key = POLLEX_SET;
+		if (in & bit)
+			wait->key |= POLLIN_SET;
+		if (out & bit)
+			wait->key |= POLLOUT_SET;
+	}
+}
+
 int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 {
 	ktime_t expire, *to = NULL;
@@ -418,20 +441,25 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 				if (file) {
 					f_op = file->f_op;
 					mask = DEFAULT_POLLMASK;
-					if (f_op && f_op->poll)
-						mask = (*f_op->poll)(file, retval ? NULL : wait);
+					if (f_op && f_op->poll) {
+						wait_key_set(wait, in, out, bit);
+						mask = (*f_op->poll)(file, wait);
+					}
 					fput_light(file, fput_needed);
 					if ((mask & POLLIN_SET) && (in & bit)) {
 						res_in |= bit;
 						retval++;
+						wait = NULL;
 					}
 					if ((mask & POLLOUT_SET) && (out & bit)) {
 						res_out |= bit;
 						retval++;
+						wait = NULL;
 					}
 					if ((mask & POLLEX_SET) && (ex & bit)) {
 						res_ex |= bit;
 						retval++;
+						wait = NULL;
 					}
 				}
 			}
@@ -685,8 +713,12 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait)
 		mask = POLLNVAL;
 		if (file != NULL) {
 			mask = DEFAULT_POLLMASK;
-			if (file->f_op && file->f_op->poll)
+			if (file->f_op && file->f_op->poll) {
+				if (pwait)
+					pwait->key = pollfd->events |
+							POLLERR | POLLHUP;
 				mask = file->f_op->poll(file, pwait);
+			}
 			/* Mask out unneeded events. */
 			mask &= pollfd->events | POLLERR | POLLHUP;
 			fput_light(file, fput_needed);
diff --git a/include/linux/poll.h b/include/linux/poll.h
index 8c24ef8..3327389 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -32,6 +32,7 @@ typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_
 
 typedef struct poll_table_struct {
 	poll_queue_proc qproc;
+	unsigned long key;
 } poll_table;
 
 static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
@@ -43,10 +44,12 @@ static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_addres
 static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
 {
 	pt->qproc = qproc;
+	pt->key   = ~0UL; /* all events enabled */
 }
 
 struct poll_table_entry {
 	struct file *filp;
+	unsigned long key;
 	wait_queue_t wait;
 	wait_queue_head_t *wait_address;
 };

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-29  9:36             ` Eric Dumazet
@ 2009-04-29 10:27               ` Ingo Molnar
  2009-04-29 12:29                 ` Eric Dumazet
  0 siblings, 1 reply; 44+ messages in thread
From: Ingo Molnar @ 2009-04-29 10:27 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux kernel, Andi Kleen, David Miller, cl, jesse.brandeburg,
	netdev, haoki, mchan, davidel


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> Ingo Molnar a écrit :
> > * Eric Dumazet <dada1@cosmosbay.com> wrote:
> > 
> >> @@ -418,8 +429,16 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
> >>  				if (file) {
> >>  					f_op = file->f_op;
> >>  					mask = DEFAULT_POLLMASK;
> >> -					if (f_op && f_op->poll)
> >> +					if (f_op && f_op->poll) {
> >> +						if (wait) {
> >> +							wait->key = POLLEX_SET;
> >> +							if (in & bit)
> >> +								wait->key |= POLLIN_SET;
> >> +							if (out & bit)
> >> +								wait->key |= POLLOUT_SET;
> >> +						}
> >>  						mask = (*f_op->poll)(file, retval ? NULL : wait);
> >> +					}
> >>  					fput_light(file, fput_needed);
> >>  					if ((mask & POLLIN_SET) && (in & bit)) {
> >>  						res_in |= bit;
> > 
> > Please factor this whole 'if (file)' branch out into a helper. 
> > Typical indentation levels go from 1 to 3 tabs - 4 should be avoided 
> > if possible and 5 is pretty excessive already. This goes to eight.
> > 
> 
> Thanks Ingo,
> 
> Here is v3 of patch, with your Acked-by included :)
> 
> This is IMHO clearer since helper immediatly follows POLLIN_SET / POLLOUT_SET /
> POLLEX_SET defines.
> 
> [PATCH] poll: Avoid extra wakeups in select/poll
> 
> After introduction of keyed wakeups Davide Libenzi did on epoll, we
> are able to avoid spurious wakeups in poll()/select() code too.
> 
> For example, typical use of poll()/select() is to wait for incoming
> network frames on many sockets. But TX completion for UDP/TCP 
> frames call sock_wfree() which in turn schedules thread.
> 
> When scheduled, thread does a full scan of all polled fds and
> can sleep again, because nothing is really available. If number
> of fds is large, this cause significant load.
> 
> This patch makes select()/poll() aware of keyed wakeups and
> useless wakeups are avoided. This reduces number of context
> switches by about 50% on some setups, and work performed
> by sofirq handlers.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> Acked-by: David S. Miller <davem@davemloft.net>
> Acked-by: Andi Kleen <ak@linux.intel.com>
> Acked-by: Ingo Molnar <mingo@elte.hu>

> ---
>  fs/select.c          |   40 ++++++++++++++++++++++++++++++++++++----
>  include/linux/poll.h |    3 +++
>  2 files changed, 39 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/select.c b/fs/select.c
> index 0fe0e14..ba068ad 100644
> --- a/fs/select.c
> +++ b/fs/select.c
> @@ -168,7 +168,7 @@ static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p)
>  	return table->entry++;
>  }
>  
> -static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
> +static int __pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>  {
>  	struct poll_wqueues *pwq = wait->private;
>  	DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
> @@ -194,6 +194,16 @@ static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>  	return default_wake_function(&dummy_wait, mode, sync, key);
>  }
>  
> +static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
> +{
> +	struct poll_table_entry *entry;
> +
> +	entry = container_of(wait, struct poll_table_entry, wait);
> +	if (key && !((unsigned long)key & entry->key))
> +		return 0;
> +	return __pollwake(wait, mode, sync, key);
> +}
> +
>  /* Add a new entry */
>  static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>  				poll_table *p)
> @@ -205,6 +215,7 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>  	get_file(filp);
>  	entry->filp = filp;
>  	entry->wait_address = wait_address;
> +	entry->key = p->key;
>  	init_waitqueue_func_entry(&entry->wait, pollwake);
>  	entry->wait.private = pwq;
>  	add_wait_queue(wait_address, &entry->wait);
> @@ -362,6 +373,18 @@ get_max:
>  #define POLLOUT_SET (POLLWRBAND | POLLWRNORM | POLLOUT | POLLERR)
>  #define POLLEX_SET (POLLPRI)
>  
> +static void wait_key_set(poll_table *wait, unsigned long in,
> +			 unsigned long out, unsigned long bit)
> +{
> +	if (wait) {
> +		wait->key = POLLEX_SET;
> +		if (in & bit)
> +			wait->key |= POLLIN_SET;
> +		if (out & bit)
> +			wait->key |= POLLOUT_SET;
> +	}
> +}

should be inline perhaps?

> +
>  int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
>  {
>  	ktime_t expire, *to = NULL;
> @@ -418,20 +441,25 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
>  				if (file) {
>  					f_op = file->f_op;
>  					mask = DEFAULT_POLLMASK;
> -					if (f_op && f_op->poll)
> -						mask = (*f_op->poll)(file, retval ? NULL : wait);
> +					if (f_op && f_op->poll) {
> +						wait_key_set(wait, in, out, bit);
> +						mask = (*f_op->poll)(file, wait);
> +					}
>  					fput_light(file, fput_needed);
>  					if ((mask & POLLIN_SET) && (in & bit)) {
>  						res_in |= bit;
>  						retval++;
> +						wait = NULL;
>  					}
>  					if ((mask & POLLOUT_SET) && (out & bit)) {
>  						res_out |= bit;
>  						retval++;
> +						wait = NULL;
>  					}
>  					if ((mask & POLLEX_SET) && (ex & bit)) {
>  						res_ex |= bit;
>  						retval++;
> +						wait = NULL;
>  					}
>  				}
>  			}

Looks much nicer now!  [ I'd still suggest to factor out the guts of 
do_select() as its nesting is excessive that hurts its reviewability 
quite a bit - but now your patch does not make the situation any 
worse. ]

Even-More-Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-29 10:27               ` Ingo Molnar
@ 2009-04-29 12:29                 ` Eric Dumazet
  2009-04-29 13:07                   ` Ingo Molnar
  2009-04-29 15:53                   ` Davide Libenzi
  0 siblings, 2 replies; 44+ messages in thread
From: Eric Dumazet @ 2009-04-29 12:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux kernel, Andi Kleen, David Miller, cl, jesse.brandeburg,
	netdev, haoki, mchan, davidel

Ingo Molnar a écrit :
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
>> Ingo Molnar a écrit :
>>> * Eric Dumazet <dada1@cosmosbay.com> wrote:
>>>
>>>> @@ -418,8 +429,16 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
>>>>  				if (file) {
>>>>  					f_op = file->f_op;
>>>>  					mask = DEFAULT_POLLMASK;
>>>> -					if (f_op && f_op->poll)
>>>> +					if (f_op && f_op->poll) {
>>>> +						if (wait) {
>>>> +							wait->key = POLLEX_SET;
>>>> +							if (in & bit)
>>>> +								wait->key |= POLLIN_SET;
>>>> +							if (out & bit)
>>>> +								wait->key |= POLLOUT_SET;
>>>> +						}
>>>>  						mask = (*f_op->poll)(file, retval ? NULL : wait);
>>>> +					}
>>>>  					fput_light(file, fput_needed);
>>>>  					if ((mask & POLLIN_SET) && (in & bit)) {
>>>>  						res_in |= bit;
>>> Please factor this whole 'if (file)' branch out into a helper. 
>>> Typical indentation levels go from 1 to 3 tabs - 4 should be avoided 
>>> if possible and 5 is pretty excessive already. This goes to eight.
>>>
>> Thanks Ingo,
>>
>> Here is v3 of patch, with your Acked-by included :)
>>
>> This is IMHO clearer since helper immediatly follows POLLIN_SET / POLLOUT_SET /
>> POLLEX_SET defines.
>>
>> [PATCH] poll: Avoid extra wakeups in select/poll
>>
>> After introduction of keyed wakeups Davide Libenzi did on epoll, we
>> are able to avoid spurious wakeups in poll()/select() code too.
>>
>> For example, typical use of poll()/select() is to wait for incoming
>> network frames on many sockets. But TX completion for UDP/TCP 
>> frames call sock_wfree() which in turn schedules thread.
>>
>> When scheduled, thread does a full scan of all polled fds and
>> can sleep again, because nothing is really available. If number
>> of fds is large, this cause significant load.
>>
>> This patch makes select()/poll() aware of keyed wakeups and
>> useless wakeups are avoided. This reduces number of context
>> switches by about 50% on some setups, and work performed
>> by sofirq handlers.
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>> Acked-by: David S. Miller <davem@davemloft.net>
>> Acked-by: Andi Kleen <ak@linux.intel.com>
>> Acked-by: Ingo Molnar <mingo@elte.hu>
> 
>> ---
>>  fs/select.c          |   40 ++++++++++++++++++++++++++++++++++++----
>>  include/linux/poll.h |    3 +++
>>  2 files changed, 39 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/select.c b/fs/select.c
>> index 0fe0e14..ba068ad 100644
>> --- a/fs/select.c
>> +++ b/fs/select.c
>> @@ -168,7 +168,7 @@ static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p)
>>  	return table->entry++;
>>  }
>>  
>> -static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>> +static int __pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>  {
>>  	struct poll_wqueues *pwq = wait->private;
>>  	DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
>> @@ -194,6 +194,16 @@ static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>>  	return default_wake_function(&dummy_wait, mode, sync, key);
>>  }
>>  
>> +static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
>> +{
>> +	struct poll_table_entry *entry;
>> +
>> +	entry = container_of(wait, struct poll_table_entry, wait);
>> +	if (key && !((unsigned long)key & entry->key))
>> +		return 0;
>> +	return __pollwake(wait, mode, sync, key);
>> +}
>> +
>>  /* Add a new entry */
>>  static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>>  				poll_table *p)
>> @@ -205,6 +215,7 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
>>  	get_file(filp);
>>  	entry->filp = filp;
>>  	entry->wait_address = wait_address;
>> +	entry->key = p->key;
>>  	init_waitqueue_func_entry(&entry->wait, pollwake);
>>  	entry->wait.private = pwq;
>>  	add_wait_queue(wait_address, &entry->wait);
>> @@ -362,6 +373,18 @@ get_max:
>>  #define POLLOUT_SET (POLLWRBAND | POLLWRNORM | POLLOUT | POLLERR)
>>  #define POLLEX_SET (POLLPRI)
>>  
>> +static void wait_key_set(poll_table *wait, unsigned long in,
>> +			 unsigned long out, unsigned long bit)
>> +{
>> +	if (wait) {
>> +		wait->key = POLLEX_SET;
>> +		if (in & bit)
>> +			wait->key |= POLLIN_SET;
>> +		if (out & bit)
>> +			wait->key |= POLLOUT_SET;
>> +	}
>> +}
> 
> should be inline perhaps?

Well, I thought current practice was not using inline for such trivial functions,
as gcc already inlines them anyway.

Quoting Documentation/CodingStyle :
  Often people argue that adding inline to functions that are static and used
  only once is always a win since there is no space tradeoff. While this is
  technically correct, gcc is capable of inlining these automatically without
  help, and the maintenance issue of removing the inline when a second user
  appears outweighs the potential value of the hint that tells gcc to do
  something it would have done anyway.

Anyway :

[PATCH] poll: Avoid extra wakeups in select/poll

After introduction of keyed wakeups Davide Libenzi did on epoll, we
are able to avoid spurious wakeups in poll()/select() code too.

For example, typical use of poll()/select() is to wait for incoming
network frames on many sockets. But TX completion for UDP/TCP 
frames call sock_wfree() which in turn schedules thread.

When scheduled, thread does a full scan of all polled fds and
can sleep again, because nothing is really available. If number
of fds is large, this cause significant load.

This patch makes select()/poll() aware of keyed wakeups and
useless wakeups are avoided. This reduces number of context
switches by about 50% on some setups, and work performed
by sofirq handlers.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Andi Kleen <ak@linux.intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
---
 fs/select.c          |   40 ++++++++++++++++++++++++++++++++++++----
 include/linux/poll.h |    3 +++
 2 files changed, 39 insertions(+), 4 deletions(-)

diff --git a/fs/select.c b/fs/select.c
index 0fe0e14..ba068ad 100644
--- a/fs/select.c
+++ b/fs/select.c
@@ -168,7 +168,7 @@ static struct poll_table_entry *poll_get_entry(struct poll_wqueues *p)
 	return table->entry++;
 }
 
-static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+static int __pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
 {
 	struct poll_wqueues *pwq = wait->private;
 	DECLARE_WAITQUEUE(dummy_wait, pwq->polling_task);
@@ -194,6 +194,16 @@ static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
 	return default_wake_function(&dummy_wait, mode, sync, key);
 }
 
+static int pollwake(wait_queue_t *wait, unsigned mode, int sync, void *key)
+{
+	struct poll_table_entry *entry;
+
+	entry = container_of(wait, struct poll_table_entry, wait);
+	if (key && !((unsigned long)key & entry->key))
+		return 0;
+	return __pollwake(wait, mode, sync, key);
+}
+
 /* Add a new entry */
 static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 				poll_table *p)
@@ -205,6 +215,7 @@ static void __pollwait(struct file *filp, wait_queue_head_t *wait_address,
 	get_file(filp);
 	entry->filp = filp;
 	entry->wait_address = wait_address;
+	entry->key = p->key;
 	init_waitqueue_func_entry(&entry->wait, pollwake);
 	entry->wait.private = pwq;
 	add_wait_queue(wait_address, &entry->wait);
@@ -362,6 +373,18 @@ get_max:
 #define POLLOUT_SET (POLLWRBAND | POLLWRNORM | POLLOUT | POLLERR)
 #define POLLEX_SET (POLLPRI)
 
+static inline void wait_key_set(poll_table *wait, unsigned long in,
+				unsigned long out, unsigned long bit)
+{
+	if (wait) {
+		wait->key = POLLEX_SET;
+		if (in & bit)
+			wait->key |= POLLIN_SET;
+		if (out & bit)
+			wait->key |= POLLOUT_SET;
+	}
+}
+
 int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 {
 	ktime_t expire, *to = NULL;
@@ -418,20 +441,25 @@ int do_select(int n, fd_set_bits *fds, struct timespec *end_time)
 				if (file) {
 					f_op = file->f_op;
 					mask = DEFAULT_POLLMASK;
-					if (f_op && f_op->poll)
-						mask = (*f_op->poll)(file, retval ? NULL : wait);
+					if (f_op && f_op->poll) {
+						wait_key_set(wait, in, out, bit);
+						mask = (*f_op->poll)(file, wait);
+					}
 					fput_light(file, fput_needed);
 					if ((mask & POLLIN_SET) && (in & bit)) {
 						res_in |= bit;
 						retval++;
+						wait = NULL;
 					}
 					if ((mask & POLLOUT_SET) && (out & bit)) {
 						res_out |= bit;
 						retval++;
+						wait = NULL;
 					}
 					if ((mask & POLLEX_SET) && (ex & bit)) {
 						res_ex |= bit;
 						retval++;
+						wait = NULL;
 					}
 				}
 			}
@@ -685,8 +713,12 @@ static inline unsigned int do_pollfd(struct pollfd *pollfd, poll_table *pwait)
 		mask = POLLNVAL;
 		if (file != NULL) {
 			mask = DEFAULT_POLLMASK;
-			if (file->f_op && file->f_op->poll)
+			if (file->f_op && file->f_op->poll) {
+				if (pwait)
+					pwait->key = pollfd->events |
+							POLLERR | POLLHUP;
 				mask = file->f_op->poll(file, pwait);
+			}
 			/* Mask out unneeded events. */
 			mask &= pollfd->events | POLLERR | POLLHUP;
 			fput_light(file, fput_needed);
diff --git a/include/linux/poll.h b/include/linux/poll.h
index 8c24ef8..3327389 100644
--- a/include/linux/poll.h
+++ b/include/linux/poll.h
@@ -32,6 +32,7 @@ typedef void (*poll_queue_proc)(struct file *, wait_queue_head_t *, struct poll_
 
 typedef struct poll_table_struct {
 	poll_queue_proc qproc;
+	unsigned long key;
 } poll_table;
 
 static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_address, poll_table *p)
@@ -43,10 +44,12 @@ static inline void poll_wait(struct file * filp, wait_queue_head_t * wait_addres
 static inline void init_poll_funcptr(poll_table *pt, poll_queue_proc qproc)
 {
 	pt->qproc = qproc;
+	pt->key   = ~0UL; /* all events enabled */
 }
 
 struct poll_table_entry {
 	struct file *filp;
+	unsigned long key;
 	wait_queue_t wait;
 	wait_queue_head_t *wait_address;
 };

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-29 12:29                 ` Eric Dumazet
@ 2009-04-29 13:07                   ` Ingo Molnar
  2009-04-29 15:53                   ` Davide Libenzi
  1 sibling, 0 replies; 44+ messages in thread
From: Ingo Molnar @ 2009-04-29 13:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux kernel, Andi Kleen, David Miller, cl, jesse.brandeburg,
	netdev, haoki, mchan, davidel


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> > should be inline perhaps?
> 
> Well, I thought current practice was not using inline for such 
> trivial functions, as gcc already inlines them anyway.

ok.

how about:

> > [ I'd still suggest to factor out the guts of do_select() as 
> >   its nesting is excessive that hurts its reviewability quite a 
> >   bit - but now your patch does not make the situation any 
> >   worse. ]

We tend to shop for drive-by cleanups in visibly ugly code whenever 
someone wants to touch that code. Could go into a separate patch.

	Ingo

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] poll: Avoid extra wakeups in select/poll
  2009-04-29 12:29                 ` Eric Dumazet
  2009-04-29 13:07                   ` Ingo Molnar
@ 2009-04-29 15:53                   ` Davide Libenzi
  1 sibling, 0 replies; 44+ messages in thread
From: Davide Libenzi @ 2009-04-29 15:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, linux kernel, Andi Kleen, David Miller, cl,
	jesse.brandeburg, netdev, haoki, mchan

On Wed, 29 Apr 2009, Eric Dumazet wrote:

> [PATCH] poll: Avoid extra wakeups in select/poll
> 
> After introduction of keyed wakeups Davide Libenzi did on epoll, we
> are able to avoid spurious wakeups in poll()/select() code too.
> 
> For example, typical use of poll()/select() is to wait for incoming
> network frames on many sockets. But TX completion for UDP/TCP 
> frames call sock_wfree() which in turn schedules thread.
> 
> When scheduled, thread does a full scan of all polled fds and
> can sleep again, because nothing is really available. If number
> of fds is large, this cause significant load.
> 
> This patch makes select()/poll() aware of keyed wakeups and
> useless wakeups are avoided. This reduces number of context
> switches by about 50% on some setups, and work performed
> by sofirq handlers.

Looks fine to me Eric ...

Acked-by: Davide Libenzi <davidel@xmailserver.org>


- Davide



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH] net: Avoid extra wakeups of threads blocked in wait_for_packet()
  2009-04-25 15:47 ` [PATCH] net: Avoid extra wakeups of threads blocked in wait_for_packet() Eric Dumazet
  2009-04-26  9:04   ` David Miller
@ 2009-04-28  9:26   ` David Miller
  1 sibling, 0 replies; 44+ messages in thread
From: David Miller @ 2009-04-28  9:26 UTC (permalink / raw)
  To: dada1; +Cc: cl, jesse.brandeburg, netdev, haoki, mchan, davidel

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 25 Apr 2009 17:47:23 +0200

> In 2.6.25 we added UDP mem accounting.
> 
> This unfortunatly added a penalty when a frame is transmitted, since
> we have at TX completion time to call sock_wfree() to perform necessary
> memory accounting. This calls sock_def_write_space() and utimately
> scheduler if any thread is waiting on the socket.
> Thread(s) waiting for an incoming frame was scheduled, then had to sleep
> again as event was meaningless.
 ...
> This patch introduces new DEFINE_WAIT_FUNC() helper and uses it
> in wait_for_packet(), so that only relevant event can wakeup a thread
> blocked in this function.

Ok, I was going to give some time towards considering the
alternative implementation of using 2 wait queues and what
it would look like.

It didn't take long for me to figure out that this is so much
simpler that it's not even worth trying the dual wait queue
approach.

So I've applied this to net-2.6, thanks!

Now, if we want to fix this up in -stable we'll need to scratch
our heads if we can't get the keyed wakeup patch in too. :-/


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2009-05-04 10:40 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-04-24 20:10 udp ping pong with various process bindings (and correct cpu mappings) Christoph Lameter
2009-04-24 21:18 ` Eric Dumazet
2009-04-25 15:47 ` [PATCH] net: Avoid extra wakeups of threads blocked in wait_for_packet() Eric Dumazet
2009-04-26  9:04   ` David Miller
2009-04-26 10:46     ` [PATCH] poll: Avoid extra wakeups Eric Dumazet
2009-04-26 13:33       ` Jarek Poplawski
2009-04-26 14:27         ` Eric Dumazet
2009-04-28  9:15       ` David Miller
2009-04-28  9:24         ` Eric Dumazet
2009-04-28 14:21       ` Andi Kleen
2009-04-28 14:58         ` Eric Dumazet
2009-04-28 15:06         ` [PATCH] poll: Avoid extra wakeups in select/poll Eric Dumazet
2009-04-28 19:05           ` Christoph Lameter
2009-04-28 20:05             ` Eric Dumazet
2009-04-28 20:14               ` Christoph Lameter
2009-04-28 20:33                 ` Eric Dumazet
2009-04-28 20:49                   ` Christoph Lameter
2009-04-28 21:04                     ` Eric Dumazet
2009-04-28 21:00                       ` Christoph Lameter
2009-04-28 21:05                       ` Eric Dumazet
2009-04-28 21:04                         ` Christoph Lameter
2009-04-28 21:11                       ` Eric Dumazet
2009-04-29  9:11                         ` Ingo Molnar
2009-04-30 10:49                           ` Eric Dumazet
2009-04-30 11:57                             ` Ingo Molnar
2009-04-30 14:08                               ` Eric Dumazet
2009-04-30 16:07                                 ` [BUG] perf_counter: change cpu frequencies Eric Dumazet
2009-05-03  6:06                                   ` Eric Dumazet
2009-05-03  7:25                                     ` Ingo Molnar
2009-05-04 10:39                                       ` Eric Dumazet
2009-04-30 21:24                                 ` [PATCH] poll: Avoid extra wakeups in select/poll Paul E. McKenney
2009-04-29  7:20           ` Andrew Morton
2009-04-29  7:35             ` Andi Kleen
2009-04-29  7:37               ` Eric Dumazet
2009-04-29  9:22               ` Ingo Molnar
2009-04-29  7:39             ` Eric Dumazet
2009-04-29  8:26               ` Eric Dumazet
2009-04-29  9:16           ` Ingo Molnar
2009-04-29  9:36             ` Eric Dumazet
2009-04-29 10:27               ` Ingo Molnar
2009-04-29 12:29                 ` Eric Dumazet
2009-04-29 13:07                   ` Ingo Molnar
2009-04-29 15:53                   ` Davide Libenzi
2009-04-28  9:26   ` [PATCH] net: Avoid extra wakeups of threads blocked in wait_for_packet() David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).