netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup)
  2008-06-11  6:46 ` Herbert Xu
@ 2008-06-11  9:16   ` David Miller
  2008-06-12 14:05     ` Herbert Xu
  0 siblings, 1 reply; 20+ messages in thread
From: David Miller @ 2008-06-11  9:16 UTC (permalink / raw)
  To: herbert; +Cc: GHaskins, PMullaney, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Wed, 11 Jun 2008 16:46:12 +1000

> Anyway, you've lost me with this scenario.  The patch is stopping
> wake-ups on the write path yet you're talking about the read path.
> So my question is why are you getting the extra wake-up on that
> write path when you receive a packet?

During a short discussion on IRC with Herbert I explained
that indeed it's the write wakeup path that's causing the
issue here and we both agreed that the cost is likely
simply from the sk->sk_callback_lock and not the wakeups
themselves.  Once the first wakeup happens, the wait
queues should be inactive and the task awake already.

The sk->sk_callback_lock is a pile of mud that has existed since
around 1999 when we initially SMP threaded the networking stack.
It's long overdue to kill this thing, because for the most part it
is superfluous.

So I just sat for a bit and compiled some notes about the situation
and how we can move forward on this.  If I get some energy I will
attack this task some time soon:

--------------------

Deletion of sk->sk_callback_lock.

This lock is overloaded tremendously to protect several aspects of
socket state, most of which is superfluous and can be eliminated.

One aspect protected by this lock is the parent socket to child sock
relationship.  It makes sure that both the waitqueue and socket
pointer are updated atomically wrt. threads executing the sock
callbacks such as sk->sk_write_space().

A seperate lock was needed for this because socket callbacks can be
invoked asynchronously in softirq context via kfree_skb() and
therefore the socket lock would not be suitable for this purpose.

Basically, for this usage, all the callback cares about is that the
sk->sk_socket and sk->sk_sleep point to stable items.  Fortunately we
can ensure this by simply never letting these two things point to
NULL.  Instead we can point them at dummy waitqueue and socket objects
when there is no real parent currently associated with the sock.

This would allow large amounts of callback code to run lockless.  For
example, sock_def_write_space() would most not need the
sk_callback_lock held.  The exception being sk_wake_async()
processing.

sk_wake_async() uses the callback lock to guard the state of the
socket's fasync list.  It is taken as a reader for scanners, and taken
as a writer for thread paths which add entries to the fasync list.

These locking schemes date back to 1999 as indicated by comments above
sock_fasync().

To be honest, fasync usage is rare.  It's a way to register processes
which will receive SIGIO signals during IO events.  The generic fasync
code fs/fcntl.c:kill_fasync() uses a global lock for list protection
and nobody has noticed.  But we should be careful for this observation
because the other uses of fasync handling are devices like input
(keyboards, mice) and TTY devices.

If the "fasync on sockets is rare" observation is accurate, we could
simply use fasync_helper() and kill_fasync() to implement the
necessary locking.

Also, fasync is how SIGURG is implemented, again another rarely used
feature of sockets.

The next layer of (mis-)use of the sk_callback_lock are mis-level
protocol layers that sit on top of real protocol stacks.  For example
sunrpc and iscsi.  These subsystems want to override the socket
callbacks so that they can do data-ready etc. processing directly or
with modified behavior.

Typically these code bases stack away the callbacks they override so
that when they are done with the socket they restore those pointers.
During this "callback pointer swap" they hold the sk_callback_lock,
which on the surface seems pretty pointless since this does not
protect other threads of execution from invoking these callbacks.

SUNRPC also seems to use the sk->sk_callback_lock to stabilize the
state of the sk->sk_user_data pointer in the XPRT socket layer.  It
also has an xprt->shutdown state which is tested additionally to the
NULL'ness of the sk->sk_user_data XPRT pointer.  A seperate
xprt->transport_lock synchornizes the processing of incoming packets
for a given XID session, so the sk_callback_lock isn't needed for that
purpose.

sock_orphan() and sock_graft(), in include/net/sock.h, are the main
transition points of the parent socket relationship state.
Unfortunately, not all protocols use these helper routines
consistently.  In particular the radio protocols (ax25, etc.), IPX,
ECONET, and some others do this work by hand and in particular without
holding the sk->sk_callback_lock.

This should be consolidated so that the protocols do this stuff
consistently.  It also points us to the crummy SOCK_DEAD state bit we
have.  Really, all of the things guarded by SOCK_DEAD are totally
harmless if we point sk->sk_socket and sk->sk_sleep to dummy objects
instead of NULL.  So, there is strong evidence that SOCK_DEAD could be
eliminated.

Also, this lack of consistent usage of sock_orphan() and sock_graft()
means that some protocols are not invoking security_sock_graft()
properly.

So likely the first task in all of this is to get the protocol
to use sock_orphan() and sock_graft() across the board without
any exceptions.

At this point we can build dummy socket and wait queue objects that
sock allocation and sock_orphan() can point to.  I believe at this
point that sk_callback_lock acquisition can be removed from
sock_orphan() and sock_graft().

Next, we convert the fasync bits to use the global fasync_lock
via the helper functions fasync_helper() and kill_fasync() in
fs/fcntl.c instead of the per-socket locking, by-hand list insertion,
and __kill_fasync() invocations the socket layer uses currently.

Now, it should be possible to entirely remove the sk_callback_lock
grabs in the sock callbacks.

The rest of the work is composed of walking over the socket
wrapping layers such as SUNRPC, OCFS, and ISCSI and auditing
their sk_callback_lock usage, trying to eliminate the lock as
much as possible.  The unconvertable parts are probably replacable
with subsystem local object locks.

It should be possible to delete sk->sk_callback_lock at that point.
Killing SOCK_DEAD could be a followon task.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup)
  2008-06-11  9:16   ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) David Miller
@ 2008-06-12 14:05     ` Herbert Xu
  0 siblings, 0 replies; 20+ messages in thread
From: Herbert Xu @ 2008-06-12 14:05 UTC (permalink / raw)
  To: David Miller; +Cc: GHaskins, PMullaney, netdev

On Wed, Jun 11, 2008 at 02:16:42AM -0700, David Miller wrote:
> 
> Deletion of sk->sk_callback_lock.

Looks pretty reasonable.  Thanks!
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup)
       [not found] <484F43A3020000760003F543@lucius.provo.novell.com>
@ 2008-06-17  1:38 ` Patrick Mullaney
  2008-06-17  1:53   ` Killing sk->sk_callback_lock David Miller
                     ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Patrick Mullaney @ 2008-06-17  1:38 UTC (permalink / raw)
  To: davem; +Cc: herbert, Gregory Haskins, chuck.lever, netdev

On Wed, 2008-06-11 at 02:16 -0700, David Miller wrote: 
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 11 Jun 2008 16:46:12 +1000
> 
> > Anyway, you've lost me with this scenario.  The patch is stopping
> > wake-ups on the write path yet you're talking about the read path.
> > So my question is why are you getting the extra wake-up on that
> > write path when you receive a packet?
> 
> During a short discussion on IRC with Herbert I explained
> that indeed it's the write wakeup path that's causing the
> issue here and we both agreed that the cost is likely
> simply from the sk->sk_callback_lock and not the wakeups
> themselves.  Once the first wakeup happens, the wait
> queues should be inactive and the task awake already.

I don't follow but I wasn't part of the IRC discussion. :-) Please
send me a note out of band if you would like to discuss again on IRC.
I agree that the sk_callback_lock may be significant overhead
(although lockstat is not showing it to be highly contended).
Our testing consists of parallel netperf tasks, one per cpu, with
test runs utilizing rt priority and others that do not.

The overhead I was trying to address was scheduler overhead.
In  short, I think I have a new patch that solves the race issues
and addresses the unneeded wake up. See below.

Something that is worth pointing out about the approach
taken is that SOCK_NOSPACE gets cleared after a task wakes from
a sleep due to wmem exceeded. I was actually surprised that this
was not currently done(once a udp socket sets this flag it is
not cleared for the life of the socket as far as I have seen). I
am not sure that there are any unintended side-effects but none
have been observed in the code or while running so far. 

I think the elimination of the sk_callback_lock would be great. I
can see it having significant impact on this testing. However, a 
patch to address the extra wake and flags behavior will provide
some improvement.


From: Patrick Mullaney <pmullaney@novell.com>

    Remove extra wake up on receive caused by write space
    cleanup(below wmem highwater threshold).

Signed-off-by: Patrick Mullaney <pmullaney@novell.com>
---

diff --git a/net/core/sock.c b/net/core/sock.c
index 1a394d0..a26d223 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1159,7 +1159,6 @@ static long sock_wait_for_wmem(struct sock * sk, long timeo)
 			break;
 		if (signal_pending(current))
 			break;
-		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
 		if (atomic_read(&sk->sk_wmem_alloc) < sk->sk_sndbuf)
 			break;
@@ -1167,9 +1166,11 @@ static long sock_wait_for_wmem(struct sock * sk, long timeo)
 			break;
 		if (sk->sk_err)
 			break;
+		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 		timeo = schedule_timeout(timeo);
 	}
 	finish_wait(sk->sk_sleep, &wait);
+	clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 	return timeo;
 }
 
@@ -1467,6 +1468,10 @@ static void sock_def_readable(struct sock *sk, int len)
 
 static void sock_def_write_space(struct sock *sk)
 {
+	/* don't wake unless writer is waiting */
+	if(!test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
+		return;
+
 	read_lock(&sk->sk_callback_lock);
 
 	/* Do not wake up a writer until he can make "significant"


> 


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17  1:38 ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) Patrick Mullaney
@ 2008-06-17  1:53   ` David Miller
  2008-06-17  4:01     ` Gregory Haskins
  2008-06-17 13:38     ` Patrick Mullaney
  2008-06-17  2:56   ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) Herbert Xu
  2008-06-17 23:33   ` Herbert Xu
  2 siblings, 2 replies; 20+ messages in thread
From: David Miller @ 2008-06-17  1:53 UTC (permalink / raw)
  To: pmullaney; +Cc: herbert, GHaskins, chuck.lever, netdev

From: "Patrick Mullaney" <pmullaney@novell.com>
Date: Mon, 16 Jun 2008 19:38:23 -0600

> The overhead I was trying to address was scheduler overhead.

Neither Herbert nor I are convinced of this yet, and you have
to show us why you think this is the problem and not (in
our opinion) the more likely sk_callback_lock overhead.

I am tiring of stating this over and over so please address
this in any future correspondance.

Once the task is woken up the first time, future calls to
these callback functions should do nothing other than take
the sk_callback_lock and test some state.

Since the task is awake already, wakeups should be bypassed
or at worst be a nop.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup)
  2008-06-17  1:38 ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) Patrick Mullaney
  2008-06-17  1:53   ` Killing sk->sk_callback_lock David Miller
@ 2008-06-17  2:56   ` Herbert Xu
  2008-06-17  3:09     ` Eric Dumazet
  2008-06-17 23:33   ` Herbert Xu
  2 siblings, 1 reply; 20+ messages in thread
From: Herbert Xu @ 2008-06-17  2:56 UTC (permalink / raw)
  To: Patrick Mullaney; +Cc: davem, Gregory Haskins, chuck.lever, netdev

On Mon, Jun 16, 2008 at 07:38:23PM -0600, Patrick Mullaney wrote:
> 
> I don't follow but I wasn't part of the IRC discussion. :-) Please
> send me a note out of band if you would like to discuss again on IRC.
> I agree that the sk_callback_lock may be significant overhead
> (although lockstat is not showing it to be highly contended).

Lock contention doesn't matter! The worst problem with read-write
locks is cache-line bouncing.  That is, read locks will grab cache
lines from other read locks thus causing performance to go down
the drain if read locks are common (which is the only scenario
where you use read-write locks anyway).  Moral of the story is
that read-write locks are bad.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup)
  2008-06-17  2:56   ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) Herbert Xu
@ 2008-06-17  3:09     ` Eric Dumazet
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Dumazet @ 2008-06-17  3:09 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Patrick Mullaney, davem, Gregory Haskins, chuck.lever, netdev

Herbert Xu a écrit :
> On Mon, Jun 16, 2008 at 07:38:23PM -0600, Patrick Mullaney wrote:
>> I don't follow but I wasn't part of the IRC discussion. :-) Please
>> send me a note out of band if you would like to discuss again on IRC.
>> I agree that the sk_callback_lock may be significant overhead
>> (although lockstat is not showing it to be highly contended).
> 
> Lock contention doesn't matter! The worst problem with read-write
> locks is cache-line bouncing.  That is, read locks will grab cache
> lines from other read locks thus causing performance to go down
> the drain if read locks are common (which is the only scenario
> where you use read-write locks anyway).  Moral of the story is
> that read-write locks are bad.
> 
> Cheers,

Yes, read-write locks are bad (not the concept itself, only their
implementation) but alternatives are complex to setup and maintain.

ipt_do_table() for example hits a read-write lock for every packet
handled on a chain. This is why firewalling sucks even with few rules.






^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17  1:53   ` Killing sk->sk_callback_lock David Miller
@ 2008-06-17  4:01     ` Gregory Haskins
  2008-06-17  4:09       ` David Miller
                         ` (2 more replies)
  2008-06-17 13:38     ` Patrick Mullaney
  1 sibling, 3 replies; 20+ messages in thread
From: Gregory Haskins @ 2008-06-17  4:01 UTC (permalink / raw)
  To: David Miller, Patrick Mullaney; +Cc: herbert, chuck.lever, netdev

>>> On Mon, Jun 16, 2008 at  9:53 PM, in message
<20080616.185328.85842051.davem@davemloft.net>, David Miller
<davem@davemloft.net> wrote: 
> From: "Patrick Mullaney" <pmullaney@novell.com>
> Date: Mon, 16 Jun 2008 19:38:23 -0600
> 
>> The overhead I was trying to address was scheduler overhead.
> 
> Neither Herbert nor I are convinced of this yet, and you have
> to show us why you think this is the problem and not (in
> our opinion) the more likely sk_callback_lock overhead.

Please bear with us.  It is not our intent to be annoying, but we are perhaps doing a poor job of describing the actual nature of the issue we are seeing.

To be clear on how we got to this point: We are tasked with improving the performance of our particular kernel configuration.  We observed that our configuration was comparatively poor at UDP performance, so we started investigating why.  We instrumented the kernel using various tools (lockdep, oprofile, logdev, etc), and observed two wakeups for every packet that was received while running a multi-threaded netperf UDP throughput benchmark on some 8-core boxes.

A common pattern would emerge during analysis of the instrumentation data: An arbitrary client thread would block on the wait-queue waiting for a UDP packet.  It would of course initially block because there were no packets available.  It would then wake up, check the queue, see that there are still no packets available, and go back to sleep.  It would then wake up again a short time later, find a packet, and return to userspace.

This seemed odd to us, so we investigated further to see if an improvement was lurking or whether this was expected.  We traced back the source of each wakeup to be coming from 1) the wmem/nospace code, and 2) from the rx-wakeup code from the softirq.  First the softirq would process the tx-completions which would wake_up() the wait-queue for NOSPACE signaling.  Since the client was waiting for a packet on the same wait-queue, this was where the first wakeup came from.  Then later the softirq finally pushed an actual packet to the queue, and the client was once again re-awoken via the same overloaded wait-queue.  This time it would successfully find a packet and return to userspace.

Since the client does not care about wmem/nospace in the UDP rx path, yet the two events share a single wait-queue, the first wakeup was completely wasted.  It just causes extra scheduling activity that does not help in any way (and is quite expensive in the grand-scheme of things).  Based on this lead, Pat devised a solution which eliminates the extra wake-up() when there are no clients waiting for that particular NOSPACE event.  With his patch applied, we observed two things:

1) We now had 1 wake-up per packet, instead of 2 (decreasing context switching rates by ~50%)
2) overall UDP throughput performance increased by ~25%

This was true even without the presence of our instrumentation, so I don't think we can chalk up the "double-wake" analysis as an anomaly caused by the presence of the instrumentation itself.  Based on that, it would at least appear that the odd behavior w.r.t. the phantom wakeup does indeed hinder performance.  This is not to say that the locking issues you highlight are not also an issue.  But note that we have no evidence to suggest this particular phenomenon it is related in any way to the locking (in fact, sk_callback_lock was not showing up at all on the lockdep radar for this particular configuration, indicating a low contention rate).

So by all means, if there are improvements to the locking that can be made, thats great!  But fixing the locking will not likely address this scheduler overhead that Pat refers to IMO.  They would appear to be orthogonal issues.  I will keep an open mind, but the root-cause seems to be either the tendency of the stack code to overload the wait-queue, or the fact that UDP sockets do not dynamically manage the NOSPACE state flags.  From my perspective, I am not married to any one particular solution as long as the fundamental "phantom wakeup" problem is addressed.

HTH

Regards,
-Greg



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17  4:01     ` Gregory Haskins
@ 2008-06-17  4:09       ` David Miller
  2008-06-17  4:20         ` Gregory Haskins
  2008-06-17  4:22       ` Fwd: " Gregory Haskins
  2008-06-17  4:56       ` David Miller
  2 siblings, 1 reply; 20+ messages in thread
From: David Miller @ 2008-06-17  4:09 UTC (permalink / raw)
  To: ghaskins; +Cc: PMullaney, herbert, chuck.lever, netdev

From: "Gregory Haskins" <ghaskins@novell.com>
Date: Mon, 16 Jun 2008 22:01:13 -0600

> Please bear with us.  It is not our intent to be annoying, but we are perhaps doing a poor job of describing the actual nature of the issue we are seeing.

Even though you says so, I don't believe your email system locks you
into these extremely long poorly formatted lines.  I get properly
formatted plain text emails from other Novell employees, and they
don't have long lines.

Your emails are extremely painful for me, and other developers, to
read.  And you can expect an extremely degraded level of response, or
none at all, when you format your emails in this way.

Just FYI...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17  4:09       ` David Miller
@ 2008-06-17  4:20         ` Gregory Haskins
  2008-06-17  4:30           ` Gregory Haskins
  0 siblings, 1 reply; 20+ messages in thread
From: Gregory Haskins @ 2008-06-17  4:20 UTC (permalink / raw)
  To: David Miller
  Cc: gregory.haskins, herbert, Patrick Mullaney, chuck.lever, netdev

>>> On Tue, Jun 17, 2008 at 12:09 AM, in message
<20080616.210923.91199695.davem@davemloft.net>, David Miller
<davem@davemloft.net> wrote: 
> From: "Gregory Haskins" <ghaskins@novell.com>
> Date: Mon, 16 Jun 2008 22:01:13 -0600
> 
>> Please bear with us.  It is not our intent to be annoying, but we are 
> perhaps doing a poor job of describing the actual nature of the issue we are 
> seeing.
> 
> Even though you says so, I don't believe your email system locks you
> into these extremely long poorly formatted lines.  I get properly
> formatted plain text emails from other Novell employees, and they
> don't have long lines.
> 
> Your emails are extremely painful for me, and other developers, to
> read.  And you can expect an extremely degraded level of response, or
> none at all, when you format your emails in this way.

Understood and apologize.  I use the Linux Groupwise client which apparently does not get the same level of features as the other platforms we have (e.g. Windows).  Ive complained to the helpdesk and the answer I get is that unwrapped is the official standard and wrapping is supposed to be handled by the recipients mail-viewer.  Go figure.  I can set wrapping on all my other clients, but I cant find it anywhere on this one.  I'm sorry for the difficulties this causes. 

Ive CC'd my personal address and will respond in the future with that client (which I know supports it properly).

> 
> Just FYI...




^ permalink raw reply	[flat|nested] 20+ messages in thread

* Fwd: Re: Killing sk->sk_callback_lock
  2008-06-17  4:01     ` Gregory Haskins
  2008-06-17  4:09       ` David Miller
@ 2008-06-17  4:22       ` Gregory Haskins
  2008-06-17  4:56       ` David Miller
  2 siblings, 0 replies; 20+ messages in thread
From: Gregory Haskins @ 2008-06-17  4:22 UTC (permalink / raw)
  To: davem, Patrick Mullaney; +Cc: Herbert Xu, chuck.lever, netdev

Oddly enough, forwarding an email does seem to linewrap..so perhaps this one is easier to read:

Regards,
-Greg

>>> On Tue, Jun 17, 2008 at 12:01 AM, in message <4856FEC9.BA47.005A.0@novell.com>,
Gregory Haskins wrote: 
> >>> On Mon, Jun 16, 2008 at  9:53 PM, in message
> <20080616.185328.85842051.davem@davemloft.net>, David Miller
> <davem@davemloft.net> wrote: 
>> From: "Patrick Mullaney" <pmullaney@novell.com>
>> Date: Mon, 16 Jun 2008 19:38:23 -0600
>> 
>>> The overhead I was trying to address was scheduler overhead.
>> 
>> Neither Herbert nor I are convinced of this yet, and you have
>> to show us why you think this is the problem and not (in
>> our opinion) the more likely sk_callback_lock overhead.
> 
> Please bear with us.  It is not our intent to be annoying, but we are 
> perhaps doing a poor job of describing the actual nature of the issue we are 
> seeing.
> 
> To be clear on how we got to this point: We are tasked with improving the 
> performance of our particular kernel configuration.  We observed that our 
> configuration was comparatively poor at UDP performance, so we started 
> investigating why.  We instrumented the kernel using various tools (lockdep, 
> oprofile, logdev, etc), and observed two wakeups for every packet that was 
> received while running a multi-threaded netperf UDP throughput benchmark on 
> some 8-core boxes.
> 
> A common pattern would emerge during analysis of the instrumentation data: 
> An arbitrary client thread would block on the wait-queue waiting for a UDP 
> packet.  It would of course initially block because there were no packets 
> available.  It would then wake up, check the queue, see that there are still 
> no packets available, and go back to sleep.  It would then wake up again a 
> short time later, find a packet, and return to userspace.
> 
> This seemed odd to us, so we investigated further to see if an improvement 
> was lurking or whether this was expected.  We traced back the source of each 
> wakeup to be coming from 1) the wmem/nospace code, and 2) from the rx-wakeup 
> code from the softirq.  First the softirq would process the tx-completions 
> which would wake_up() the wait-queue for NOSPACE signaling.  Since the client 
> was waiting for a packet on the same wait-queue, this was where the first 
> wakeup came from.  Then later the softirq finally pushed an actual packet to 
> the queue, and the client was once again re-awoken via the same overloaded 
> wait-queue.  This time it would successfully find a packet and return to 
> userspace.
> 
> Since the client does not care about wmem/nospace in the UDP rx path, yet 
> the two events share a single wait-queue, the first wakeup was completely 
> wasted.  It just causes extra scheduling activity that does not help in any 
> way (and is quite expensive in the grand-scheme of things).  Based on this 
> lead, Pat devised a solution which eliminates the extra wake-up() when there 
> are no clients waiting for that particular NOSPACE event.  With his patch 
> applied, we observed two things:
> 
> 1) We now had 1 wake-up per packet, instead of 2 (decreasing context 
> switching rates by ~50%)
> 2) overall UDP throughput performance increased by ~25%
> 
> This was true even without the presence of our instrumentation, so I don't 
> think we can chalk up the "double-wake" analysis as an anomaly caused by the 
> presence of the instrumentation itself.  Based on that, it would at least 
> appear that the odd behavior w.r.t. the phantom wakeup does indeed hinder 
> performance.  This is not to say that the locking issues you highlight are 
> not also an issue.  But note that we have no evidence to suggest this 
> particular phenomenon it is related in any way to the locking (in fact, 
> sk_callback_lock was not showing up at all on the lockdep radar for this 
> particular configuration, indicating a low contention rate).
> 
> So by all means, if there are improvements to the locking that can be made, 
> thats great!  But fixing the locking will not likely address this scheduler 
> overhead that Pat refers to IMO.  They would appear to be orthogonal issues.  
> I will keep an open mind, but the root-cause seems to be either the tendency 
> of the stack code to overload the wait-queue, or the fact that UDP sockets do 
> not dynamically manage the NOSPACE state flags.  From my perspective, I am 
> not married to any one particular solution as long as the fundamental 
> "phantom wakeup" problem is addressed.
> 
> HTH
> 
> Regards,
> -Greg
> 
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17  4:20         ` Gregory Haskins
@ 2008-06-17  4:30           ` Gregory Haskins
  0 siblings, 0 replies; 20+ messages in thread
From: Gregory Haskins @ 2008-06-17  4:30 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: David Miller, herbert, Patrick Mullaney, chuck.lever, netdev

[-- Attachment #1: Type: text/plain, Size: 1759 bytes --]

Gregory Haskins wrote:
>>>> On Tue, Jun 17, 2008 at 12:09 AM, in message
>>>>         
> <20080616.210923.91199695.davem@davemloft.net>, David Miller
> <davem@davemloft.net> wrote: 
>   
>> From: "Gregory Haskins" <ghaskins@novell.com>
>> Date: Mon, 16 Jun 2008 22:01:13 -0600
>>
>>     
>>> Please bear with us.  It is not our intent to be annoying, but we are 
>>>       
>> perhaps doing a poor job of describing the actual nature of the issue we are 
>> seeing.
>>
>> Even though you says so, I don't believe your email system locks you
>> into these extremely long poorly formatted lines.  I get properly
>> formatted plain text emails from other Novell employees, and they
>> don't have long lines.
>>
>> Your emails are extremely painful for me, and other developers, to
>> read.  And you can expect an extremely degraded level of response, or
>> none at all, when you format your emails in this way.
>>     
>
> Understood and apologize.  I use the Linux Groupwise client which apparently does not get the same level of features as the other platforms we have (e.g. Windows).  Ive complained to the helpdesk and the answer I get is that unwrapped is the official standard and wrapping is supposed to be handled by the recipients mail-viewer.  Go figure.  I can set wrapping on all my other clients, but I cant find it anywhere on this one.  I'm sorry for the difficulties this causes. 
>
> Ive CC'd my personal address and will respond in the future with that client (which I know supports it properly).
>
>   
>> Just FYI...
>>     
>
>
>
>
>   

For what its worth, my thunderbird client here at home seems to render 
my non-wrapped email just fine.  What client are you using, out of 
curiosity?

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17  4:01     ` Gregory Haskins
  2008-06-17  4:09       ` David Miller
  2008-06-17  4:22       ` Fwd: " Gregory Haskins
@ 2008-06-17  4:56       ` David Miller
  2008-06-17 11:42         ` Gregory Haskins
  2 siblings, 1 reply; 20+ messages in thread
From: David Miller @ 2008-06-17  4:56 UTC (permalink / raw)
  To: ghaskins; +Cc: PMullaney, herbert, chuck.lever, netdev

From: "Gregory Haskins" <ghaskins@novell.com>
Date: Mon, 16 Jun 2008 22:01:13 -0600

> This seemed odd to us, so we investigated further to see if an
> improvement was lurking or whether this was expected.  We traced
> back the source of each wakeup to be coming from 1) the wmem/nospace
> code, and 2) from the rx-wakeup code from the softirq.  First the
> softirq would process the tx-completions which would wake_up() the
> wait-queue for NOSPACE signaling.  Since the client was waiting for
> a packet on the same wait-queue, this was where the first wakeup
> came from.  Then later the softirq finally pushed an actual packet
> to the queue, and the client was once again re-awoken via the same
> overloaded wait-queue.  This time it would successfully find a
> packet and return to userspace.
>
> Since the client does not care about wmem/nospace in the UDP rx
> path, yet the two events share a single wait-queue, the first wakeup
> was completely wasted.  It just causes extra scheduling activity
> that does not help in any way (and is quite expensive in the
> grand-scheme of things).  Based on this lead, Pat devised a solution
> which eliminates the extra wake-up() when there are no clients
> waiting for that particular NOSPACE event.  With his patch applied,
> we observed two things:

Why is the application checking for receive packets even on the
write-space wakeup?

poll/select/epoll should be giving the correct event indication,
therefore the application would know to not check for receive
packets when a write-wakeup event occurs.

Yes the wakeup is spurious and we should avoid it.  But this
application is also buggy.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17  4:56       ` David Miller
@ 2008-06-17 11:42         ` Gregory Haskins
  0 siblings, 0 replies; 20+ messages in thread
From: Gregory Haskins @ 2008-06-17 11:42 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, pmullaney, chuck.lever, netdev, Gregory Haskins

[-- Attachment #1: Type: text/plain, Size: 2594 bytes --]

>>> On Tue, Jun 17, 2008 at 12:56 AM, in message
<20080616.215632.119969915.davem@davemloft.net>, David Miller
<davem@davemloft.net> wrote:
> From: "Gregory Haskins" <ghaskins@novell.com>
> Date: Mon, 16 Jun 2008 22:01:13 -0600
> 
>> This seemed odd to us, so we investigated further to see if an
>> improvement was lurking or whether this was expected.  We traced
>> back the source of each wakeup to be coming from 1) the wmem/nospace
>> code, and 2) from the rx-wakeup code from the softirq.  First the
>> softirq would process the tx-completions which would wake_up() the
>> wait-queue for NOSPACE signaling.  Since the client was waiting for
>> a packet on the same wait-queue, this was where the first wakeup
>> came from.  Then later the softirq finally pushed an actual packet
>> to the queue, and the client was once again re-awoken via the same
>> overloaded wait-queue.  This time it would successfully find a
>> packet and return to userspace.
>>
>> Since the client does not care about wmem/nospace in the UDP rx
>> path, yet the two events share a single wait-queue, the first wakeup
>> was completely wasted.  It just causes extra scheduling activity
>> that does not help in any way (and is quite expensive in the
>> grand-scheme of things).  Based on this lead, Pat devised a solution
>> which eliminates the extra wake-up() when there are no clients
>> waiting for that particular NOSPACE event.  With his patch applied,
>> we observed two things:
> 
> Why is the application checking for receive packets even on the
> write-space wakeup?
> 
> poll/select/epoll should be giving the correct event indication,
> therefore the application would know to not check for receive
> packets when a write-wakeup event occurs.
> 
> Yes the wakeup is spurious and we should avoid it.  But this
> application is also buggy.

The application is blocked inside a system call (I forget which one 
right now..probably recv()).  So the wakeup is not against a 
poll/select.  Rather, the kernel is in 
net/core/datagram.c::wait_for_packet() (blocked on skb->sk_sleep).

Since both the wmem code and the rx code use skb->sk_sleep to wake up 
waiters, the wmem processing inadvertently kicks the client to go 
through __skb_recv_datagram() one more time.  And since there aren't yet 
any packets in skb->sk_receive_queue, the client loops and once again 
calls wait_for_packet().

So long story short:  This is entirely a kernel-space issue (unless you 
believe the usage of that system-call itself is a bug?)

HTH

Regards,
-Greg






[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17  1:53   ` Killing sk->sk_callback_lock David Miller
  2008-06-17  4:01     ` Gregory Haskins
@ 2008-06-17 13:38     ` Patrick Mullaney
  2008-06-17 21:40       ` David Miller
  1 sibling, 1 reply; 20+ messages in thread
From: Patrick Mullaney @ 2008-06-17 13:38 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, Gregory Haskins, chuck.lever, netdev



>>> On Mon, Jun 16, 2008 at  9:53 PM, in message
<20080616.185328.85842051.davem@davemloft.net>, David Miller
<davem@davemloft.net> wrote: 
> From: "Patrick Mullaney" <pmullaney@novell.com>
> Date: Mon, 16 Jun 2008 19:38:23 -0600
> 
>> The overhead I was trying to address was scheduler overhead.
> 
> Neither Herbert nor I are convinced of this yet, and you have
> to show us why you think this is the problem and not (in
> our opinion) the more likely sk_callback_lock overhead.

I think Greg has done an excellent job in describing what is
happening and how we found the issue so I won't go into
that unless you have further questions. We are both available
for an IRC chat on the problem if you would like. The only
thing I want to make clear is that lockstat is showing
sk_callback_lock as uncontended. I ran this once again just
to be sure last night(zero contentions). 

> 
> I am tiring of stating this over and over so please address
> this in any future correspondance.

I am confused by this statement - "over and over" implies more
than once. It seems to me that you stated this in one email
that you and Herbert came to this conclusion on IRC. What data
supports this conclusion? 

> 
> Once the task is woken up the first time, future calls to
> these callback functions should do nothing other than take
> the sk_callback_lock and test some state.
> 
> Since the task is awake already, wakeups should be bypassed
> or at worst be a nop.

The task can go directly back into a wait. This will effectively yield 2
wake ups per udp request-response.



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17 13:38     ` Patrick Mullaney
@ 2008-06-17 21:40       ` David Miller
  2008-06-17 22:15         ` Patrick Mullaney
  2008-06-18  7:36         ` Remi Denis-Courmont
  0 siblings, 2 replies; 20+ messages in thread
From: David Miller @ 2008-06-17 21:40 UTC (permalink / raw)
  To: pmullaney; +Cc: herbert, GHaskins.WAL-1.WALTHAM, chuck.lever, netdev

From: "Patrick Mullaney" <pmullaney@novell.com>
Date: Tue, 17 Jun 2008 07:38:29 -0600

> >>> On Mon, Jun 16, 2008 at  9:53 PM, in message
> <20080616.185328.85842051.davem@davemloft.net>, David Miller
> <davem@davemloft.net> wrote: 
> > Once the task is woken up the first time, future calls to
> > these callback functions should do nothing other than take
> > the sk_callback_lock and test some state.
> > 
> > Since the task is awake already, wakeups should be bypassed
> > or at worst be a nop.
> 
> The task can go directly back into a wait. This will effectively yield 2
> wake ups per udp request-response.

I made the mistake of assuming that a high performance threaded
networking application would use non-blocking operations and
select/poll/epoll, which is clearly not the case here.

It's blocking in a recv() and this is woken up by a write space
extraneous wakeup.

It does need to be fixed and I'll look at the most recent patch
submission and also try to imagine some other ideas.  Herbert
mentioned creating a seperate wait queue for write space wakeups.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17 21:40       ` David Miller
@ 2008-06-17 22:15         ` Patrick Mullaney
  2008-06-17 23:24           ` Herbert Xu
  2008-06-18  7:36         ` Remi Denis-Courmont
  1 sibling, 1 reply; 20+ messages in thread
From: Patrick Mullaney @ 2008-06-17 22:15 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, Gregory Haskins, chuck.lever, netdev



>>> On Tue, Jun 17, 2008 at  5:40 PM, in message
<20080617.144041.38758483.davem@davemloft.net>, David Miller
<davem@davemloft.net> wrote: 
> From: "Patrick Mullaney" <pmullaney@novell.com>
> Date: Tue, 17 Jun 2008 07:38:29 -0600
> 
>> >>> On Mon, Jun 16, 2008 at  9:53 PM, in message
>> <20080616.185328.85842051.davem@davemloft.net>, David Miller
>> <davem@davemloft.net> wrote: 
>> > Once the task is woken up the first time, future calls to
>> > these callback functions should do nothing other than take
>> > the sk_callback_lock and test some state.
>> > 
>> > Since the task is awake already, wakeups should be bypassed
>> > or at worst be a nop.
>> 
>> The task can go directly back into a wait. This will effectively yield 2
>> wake ups per udp request-response.
> 
> I made the mistake of assuming that a high performance threaded
> networking application would use non-blocking operations and
> select/poll/epoll, which is clearly not the case here.
> 
This is the standard netperf udp request response benchmark - it measures
back to back send/recv and is not necessarily high performance (async).

> It's blocking in a recv() and this is woken up by a write space
> extraneous wakeup.
> 
> It does need to be fixed and I'll look at the most recent patch
> submission and also try to imagine some other ideas.  Herbert
> mentioned creating a seperate wait queue for write space wakeups.

Yeah, I think I mentioned that approach in my first email about this. It
seemed like it would require adding to the socket struct so I decided to
try to do it without touching that. I am not positive but changing the odd
behavior of the SOCK_NOSPACE flag(mentioned in previous email) seems like
it may be in order regardless of the approach to the extra wake up.


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17 22:15         ` Patrick Mullaney
@ 2008-06-17 23:24           ` Herbert Xu
  0 siblings, 0 replies; 20+ messages in thread
From: Herbert Xu @ 2008-06-17 23:24 UTC (permalink / raw)
  To: Patrick Mullaney; +Cc: David Miller, Gregory Haskins, chuck.lever, netdev

On Tue, Jun 17, 2008 at 04:15:46PM -0600, Patrick Mullaney wrote:
>
> Yeah, I think I mentioned that approach in my first email about this. It
> seemed like it would require adding to the socket struct so I decided to
> try to do it without touching that. I am not positive but changing the odd
> behavior of the SOCK_NOSPACE flag(mentioned in previous email) seems like
> it may be in order regardless of the approach to the extra wake up.

Fiddling with flags is just a hack.  The same socket can be used
by multiple threads, so you could have one waiting on read while
another is waiting on write.  The only real solution is to have
separate queues.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup)
  2008-06-17  1:38 ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) Patrick Mullaney
  2008-06-17  1:53   ` Killing sk->sk_callback_lock David Miller
  2008-06-17  2:56   ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) Herbert Xu
@ 2008-06-17 23:33   ` Herbert Xu
  2 siblings, 0 replies; 20+ messages in thread
From: Herbert Xu @ 2008-06-17 23:33 UTC (permalink / raw)
  To: Patrick Mullaney; +Cc: davem, Gregory Haskins, chuck.lever, netdev

On Mon, Jun 16, 2008 at 07:38:23PM -0600, Patrick Mullaney wrote:
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 1a394d0..a26d223 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1159,7 +1159,6 @@ static long sock_wait_for_wmem(struct sock * sk, long timeo)
>  			break;
>  		if (signal_pending(current))
>  			break;
> -		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>  		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
>  		if (atomic_read(&sk->sk_wmem_alloc) < sk->sk_sndbuf)
>  			break;
> @@ -1167,9 +1166,11 @@ static long sock_wait_for_wmem(struct sock * sk, long timeo)
>  			break;
>  		if (sk->sk_err)
>  			break;
> +		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>  		timeo = schedule_timeout(timeo);
>  	}
>  	finish_wait(sk->sk_sleep, &wait);
> +	clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>  	return timeo;
>  }

This is buggy.  You're relying on the caller to not call this
on the same socket concurrently.  While this may be true for
UDP right now (which is a less than ideal situation in itself),
it isn't true in general.  In particular, raw sockets still
allow parallel writes to the same socket.

So forget about the flags and just do the two queues as you proposed
yourself :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17 21:40       ` David Miller
  2008-06-17 22:15         ` Patrick Mullaney
@ 2008-06-18  7:36         ` Remi Denis-Courmont
  1 sibling, 0 replies; 20+ messages in thread
From: Remi Denis-Courmont @ 2008-06-18  7:36 UTC (permalink / raw)
  To: David Miller
  Cc: pmullaney, herbert, GHaskins.WAL-1.WALTHAM, chuck.lever, netdev


    Hello,

On Tue, 17 Jun 2008 14:40:41 -0700 (PDT), David Miller
<davem@davemloft.net> wrote:
>> The task can go directly back into a wait. This will effectively yield 2
>> wake ups per udp request-response.
> 
> I made the mistake of assuming that a high performance threaded
> networking application would use non-blocking operations and
> select/poll/epoll, which is clearly not the case here.
> 
> It's blocking in a recv() and this is woken up by a write space
> extraneous wakeup.

With UDP, I have consistently gotten significantly (IIRC around 30%) better
results using plain recv()/send() on blocking sockets than
poll()/recv()/send() on non-blocking sockets, on Linux, on the fast path.
Of course, this assumes there is only one socket per thread. Namely, using
my Teredo IPv6 userland implementation.

You are surely in a better position than I can ever be as to explaining why
this is so, and how bad and wrong my approach may be. Nevertheless my guess
has been the system call overhead is simply higher when adding poll() to
recv() and send().

-- 
Rémi Denis-Courmont
http://www.remlab.net


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup)
       [not found] <4857F579020000B30004963C@lucius.provo.novell.com>
@ 2008-06-18 16:49 ` Patrick Mullaney
  0 siblings, 0 replies; 20+ messages in thread
From: Patrick Mullaney @ 2008-06-18 16:49 UTC (permalink / raw)
  To: herbert; +Cc: davem, Gregory Haskins, chuck.lever, netdev

On Wed, 2008-06-18 at 09:33 +1000, Herbert Xu wrote:

> This is buggy.  You're relying on the caller to not call this
> on the same socket concurrently.  While this may be true for
> UDP right now (which is a less than ideal situation in itself),
> it isn't true in general.  In particular, raw sockets still
> allow parallel writes to the same socket.
> 
Point taken, I wasn't aware of the raw sockets usage. In that case,
there would be a race introduced that can't be solved with bit flags.
As a side note, it also seems the setting of the NOSPACE flag is
useless here as well.
> So forget about the flags and just do the two queues as you proposed
> yourself :)
> 
I will give it a shot - I still have reservations about changing the
socket structures - at first glance, it seems like sock and socket will
have to change.
> Cheers,

Thanks.
Pat



^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2008-06-18 16:49 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <484F43A3020000760003F543@lucius.provo.novell.com>
2008-06-17  1:38 ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) Patrick Mullaney
2008-06-17  1:53   ` Killing sk->sk_callback_lock David Miller
2008-06-17  4:01     ` Gregory Haskins
2008-06-17  4:09       ` David Miller
2008-06-17  4:20         ` Gregory Haskins
2008-06-17  4:30           ` Gregory Haskins
2008-06-17  4:22       ` Fwd: " Gregory Haskins
2008-06-17  4:56       ` David Miller
2008-06-17 11:42         ` Gregory Haskins
2008-06-17 13:38     ` Patrick Mullaney
2008-06-17 21:40       ` David Miller
2008-06-17 22:15         ` Patrick Mullaney
2008-06-17 23:24           ` Herbert Xu
2008-06-18  7:36         ` Remi Denis-Courmont
2008-06-17  2:56   ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) Herbert Xu
2008-06-17  3:09     ` Eric Dumazet
2008-06-17 23:33   ` Herbert Xu
     [not found] <4857F579020000B30004963C@lucius.provo.novell.com>
2008-06-18 16:49 ` Patrick Mullaney
2008-05-30 10:05 [PATCH] net/core/sock.c remove extra wakeup Gregory Haskins
2008-06-11  6:46 ` Herbert Xu
2008-06-11  9:16   ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) David Miller
2008-06-12 14:05     ` Herbert Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).