Re: Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup)

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup)
       [not found] <484F43A3020000760003F543@lucius.provo.novell.com>
@ 2008-06-17  1:38 ` Patrick Mullaney
  2008-06-17  1:53   ` Killing sk->sk_callback_lock David Miller
                     ` (2 more replies)
  0 siblings, 3 replies; 18+ messages in thread
From: Patrick Mullaney @ 2008-06-17  1:38 UTC (permalink / raw)
  To: davem; +Cc: herbert, Gregory Haskins, chuck.lever, netdev

On Wed, 2008-06-11 at 02:16 -0700, David Miller wrote: 
> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Wed, 11 Jun 2008 16:46:12 +1000
> 
> > Anyway, you've lost me with this scenario.  The patch is stopping
> > wake-ups on the write path yet you're talking about the read path.
> > So my question is why are you getting the extra wake-up on that
> > write path when you receive a packet?
> 
> During a short discussion on IRC with Herbert I explained
> that indeed it's the write wakeup path that's causing the
> issue here and we both agreed that the cost is likely
> simply from the sk->sk_callback_lock and not the wakeups
> themselves.  Once the first wakeup happens, the wait
> queues should be inactive and the task awake already.

I don't follow but I wasn't part of the IRC discussion. :-) Please
send me a note out of band if you would like to discuss again on IRC.
I agree that the sk_callback_lock may be significant overhead
(although lockstat is not showing it to be highly contended).
Our testing consists of parallel netperf tasks, one per cpu, with
test runs utilizing rt priority and others that do not.

The overhead I was trying to address was scheduler overhead.
In  short, I think I have a new patch that solves the race issues
and addresses the unneeded wake up. See below.

Something that is worth pointing out about the approach
taken is that SOCK_NOSPACE gets cleared after a task wakes from
a sleep due to wmem exceeded. I was actually surprised that this
was not currently done(once a udp socket sets this flag it is
not cleared for the life of the socket as far as I have seen). I
am not sure that there are any unintended side-effects but none
have been observed in the code or while running so far. 

I think the elimination of the sk_callback_lock would be great. I
can see it having significant impact on this testing. However, a 
patch to address the extra wake and flags behavior will provide
some improvement.

From: Patrick Mullaney <pmullaney@novell.com>

    Remove extra wake up on receive caused by write space
    cleanup(below wmem highwater threshold).

Signed-off-by: Patrick Mullaney <pmullaney@novell.com>
---

diff --git a/net/core/sock.c b/net/core/sock.c
index 1a394d0..a26d223 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1159,7 +1159,6 @@ static long sock_wait_for_wmem(struct sock * sk, long timeo)
 			break;
 		if (signal_pending(current))
 			break;
-		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
 		if (atomic_read(&sk->sk_wmem_alloc) < sk->sk_sndbuf)
 			break;
@@ -1167,9 +1166,11 @@ static long sock_wait_for_wmem(struct sock * sk, long timeo)
 			break;
 		if (sk->sk_err)
 			break;
+		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 		timeo = schedule_timeout(timeo);
 	}
 	finish_wait(sk->sk_sleep, &wait);
+	clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
 	return timeo;
 }

@@ -1467,6 +1468,10 @@ static void sock_def_readable(struct sock *sk, int len)

 static void sock_def_write_space(struct sock *sk)
 {
+	/* don't wake unless writer is waiting */
+	if(!test_bit(SOCK_NOSPACE, &sk->sk_socket->flags))
+		return;
+
 	read_lock(&sk->sk_callback_lock);

 	/* Do not wake up a writer until he can make "significant"

> 

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17  1:38 ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) Patrick Mullaney
@ 2008-06-17  1:53   ` David Miller
  2008-06-17  4:01     ` Gregory Haskins
  2008-06-17 13:38     ` Patrick Mullaney
  2008-06-17  2:56   ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) Herbert Xu
  2008-06-17 23:33   ` Herbert Xu
  2 siblings, 2 replies; 18+ messages in thread
From: David Miller @ 2008-06-17  1:53 UTC (permalink / raw)
  To: pmullaney; +Cc: herbert, GHaskins, chuck.lever, netdev

From: "Patrick Mullaney" <pmullaney@novell.com>
Date: Mon, 16 Jun 2008 19:38:23 -0600

> The overhead I was trying to address was scheduler overhead.

Neither Herbert nor I are convinced of this yet, and you have
to show us why you think this is the problem and not (in
our opinion) the more likely sk_callback_lock overhead.

I am tiring of stating this over and over so please address
this in any future correspondance.

Once the task is woken up the first time, future calls to
these callback functions should do nothing other than take
the sk_callback_lock and test some state.

Since the task is awake already, wakeups should be bypassed
or at worst be a nop.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup)
  2008-06-17  1:38 ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) Patrick Mullaney
  2008-06-17  1:53   ` Killing sk->sk_callback_lock David Miller
@ 2008-06-17  2:56   ` Herbert Xu
  2008-06-17  3:09     ` Eric Dumazet
  2008-06-17 23:33   ` Herbert Xu
  2 siblings, 1 reply; 18+ messages in thread
From: Herbert Xu @ 2008-06-17  2:56 UTC (permalink / raw)
  To: Patrick Mullaney; +Cc: davem, Gregory Haskins, chuck.lever, netdev

On Mon, Jun 16, 2008 at 07:38:23PM -0600, Patrick Mullaney wrote:
> 
> I don't follow but I wasn't part of the IRC discussion. :-) Please
> send me a note out of band if you would like to discuss again on IRC.
> I agree that the sk_callback_lock may be significant overhead
> (although lockstat is not showing it to be highly contended).

Lock contention doesn't matter! The worst problem with read-write
locks is cache-line bouncing.  That is, read locks will grab cache
lines from other read locks thus causing performance to go down
the drain if read locks are common (which is the only scenario
where you use read-write locks anyway).  Moral of the story is
that read-write locks are bad.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup)
  2008-06-17  2:56   ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) Herbert Xu
@ 2008-06-17  3:09     ` Eric Dumazet
  0 siblings, 0 replies; 18+ messages in thread
From: Eric Dumazet @ 2008-06-17  3:09 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Patrick Mullaney, davem, Gregory Haskins, chuck.lever, netdev

Herbert Xu a écrit :
> On Mon, Jun 16, 2008 at 07:38:23PM -0600, Patrick Mullaney wrote:
>> I don't follow but I wasn't part of the IRC discussion. :-) Please
>> send me a note out of band if you would like to discuss again on IRC.
>> I agree that the sk_callback_lock may be significant overhead
>> (although lockstat is not showing it to be highly contended).
> 
> Lock contention doesn't matter! The worst problem with read-write
> locks is cache-line bouncing.  That is, read locks will grab cache
> lines from other read locks thus causing performance to go down
> the drain if read locks are common (which is the only scenario
> where you use read-write locks anyway).  Moral of the story is
> that read-write locks are bad.
> 
> Cheers,

Yes, read-write locks are bad (not the concept itself, only their
implementation) but alternatives are complex to setup and maintain.

ipt_do_table() for example hits a read-write lock for every packet
handled on a chain. This is why firewalling sucks even with few rules.






^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17  1:53   ` Killing sk->sk_callback_lock David Miller
@ 2008-06-17  4:01     ` Gregory Haskins
  2008-06-17  4:09       ` David Miller
                         ` (2 more replies)
  2008-06-17 13:38     ` Patrick Mullaney
  1 sibling, 3 replies; 18+ messages in thread
From: Gregory Haskins @ 2008-06-17  4:01 UTC (permalink / raw)
  To: David Miller, Patrick Mullaney; +Cc: herbert, chuck.lever, netdev

>>> On Mon, Jun 16, 2008 at  9:53 PM, in message
<20080616.185328.85842051.davem@davemloft.net>, David Miller
<davem@davemloft.net> wrote: 
> From: "Patrick Mullaney" <pmullaney@novell.com>
> Date: Mon, 16 Jun 2008 19:38:23 -0600
> 
>> The overhead I was trying to address was scheduler overhead.
> 
> Neither Herbert nor I are convinced of this yet, and you have
> to show us why you think this is the problem and not (in
> our opinion) the more likely sk_callback_lock overhead.

Please bear with us.  It is not our intent to be annoying, but we are perhaps doing a poor job of describing the actual nature of the issue we are seeing.

To be clear on how we got to this point: We are tasked with improving the performance of our particular kernel configuration.  We observed that our configuration was comparatively poor at UDP performance, so we started investigating why.  We instrumented the kernel using various tools (lockdep, oprofile, logdev, etc), and observed two wakeups for every packet that was received while running a multi-threaded netperf UDP throughput benchmark on some 8-core boxes.

A common pattern would emerge during analysis of the instrumentation data: An arbitrary client thread would block on the wait-queue waiting for a UDP packet.  It would of course initially block because there were no packets available.  It would then wake up, check the queue, see that there are still no packets available, and go back to sleep.  It would then wake up again a short time later, find a packet, and return to userspace.

This seemed odd to us, so we investigated further to see if an improvement was lurking or whether this was expected.  We traced back the source of each wakeup to be coming from 1) the wmem/nospace code, and 2) from the rx-wakeup code from the softirq.  First the softirq would process the tx-completions which would wake_up() the wait-queue for NOSPACE signaling.  Since the client was waiting for a packet on the same wait-queue, this was where the first wakeup came from.  Then later the softirq finally pushed an actual packet to the queue, and the client was once again re-awoken via the same overloaded wait-queue.  This time it would successfully find a packet and return to userspace.

Since the client does not care about wmem/nospace in the UDP rx path, yet the two events share a single wait-queue, the first wakeup was completely wasted.  It just causes extra scheduling activity that does not help in any way (and is quite expensive in the grand-scheme of things).  Based on this lead, Pat devised a solution which eliminates the extra wake-up() when there are no clients waiting for that particular NOSPACE event.  With his patch applied, we observed two things:

1) We now had 1 wake-up per packet, instead of 2 (decreasing context switching rates by ~50%)
2) overall UDP throughput performance increased by ~25%

This was true even without the presence of our instrumentation, so I don't think we can chalk up the "double-wake" analysis as an anomaly caused by the presence of the instrumentation itself.  Based on that, it would at least appear that the odd behavior w.r.t. the phantom wakeup does indeed hinder performance.  This is not to say that the locking issues you highlight are not also an issue.  But note that we have no evidence to suggest this particular phenomenon it is related in any way to the locking (in fact, sk_callback_lock was not showing up at all on the lockdep radar for this particular configuration, indicating a low contention rate).

So by all means, if there are improvements to the locking that can be made, thats great!  But fixing the locking will not likely address this scheduler overhead that Pat refers to IMO.  They would appear to be orthogonal issues.  I will keep an open mind, but the root-cause seems to be either the tendency of the stack code to overload the wait-queue, or the fact that UDP sockets do not dynamically manage the NOSPACE state flags.  From my perspective, I am not married to any one particular solution as long as the fundamental "phantom wakeup" problem is addressed.

HTH

Regards,
-Greg

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17  4:01     ` Gregory Haskins
@ 2008-06-17  4:09       ` David Miller
  2008-06-17  4:20         ` Gregory Haskins
  2008-06-17  4:22       ` Fwd: " Gregory Haskins
  2008-06-17  4:56       ` David Miller
  2 siblings, 1 reply; 18+ messages in thread
From: David Miller @ 2008-06-17  4:09 UTC (permalink / raw)
  To: ghaskins; +Cc: PMullaney, herbert, chuck.lever, netdev

From: "Gregory Haskins" <ghaskins@novell.com>
Date: Mon, 16 Jun 2008 22:01:13 -0600

> Please bear with us.  It is not our intent to be annoying, but we are perhaps doing a poor job of describing the actual nature of the issue we are seeing.

Even though you says so, I don't believe your email system locks you
into these extremely long poorly formatted lines.  I get properly
formatted plain text emails from other Novell employees, and they
don't have long lines.

Your emails are extremely painful for me, and other developers, to
read.  And you can expect an extremely degraded level of response, or
none at all, when you format your emails in this way.

Just FYI...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17  4:09       ` David Miller
@ 2008-06-17  4:20         ` Gregory Haskins
  2008-06-17  4:30           ` Gregory Haskins
  0 siblings, 1 reply; 18+ messages in thread
From: Gregory Haskins @ 2008-06-17  4:20 UTC (permalink / raw)
  To: David Miller
  Cc: gregory.haskins, herbert, Patrick Mullaney, chuck.lever, netdev

>>> On Tue, Jun 17, 2008 at 12:09 AM, in message
<20080616.210923.91199695.davem@davemloft.net>, David Miller
<davem@davemloft.net> wrote: 
> From: "Gregory Haskins" <ghaskins@novell.com>
> Date: Mon, 16 Jun 2008 22:01:13 -0600
> 
>> Please bear with us.  It is not our intent to be annoying, but we are 
> perhaps doing a poor job of describing the actual nature of the issue we are 
> seeing.
> 
> Even though you says so, I don't believe your email system locks you
> into these extremely long poorly formatted lines.  I get properly
> formatted plain text emails from other Novell employees, and they
> don't have long lines.
> 
> Your emails are extremely painful for me, and other developers, to
> read.  And you can expect an extremely degraded level of response, or
> none at all, when you format your emails in this way.

Understood and apologize.  I use the Linux Groupwise client which apparently does not get the same level of features as the other platforms we have (e.g. Windows).  Ive complained to the helpdesk and the answer I get is that unwrapped is the official standard and wrapping is supposed to be handled by the recipients mail-viewer.  Go figure.  I can set wrapping on all my other clients, but I cant find it anywhere on this one.  I'm sorry for the difficulties this causes. 

Ive CC'd my personal address and will respond in the future with that client (which I know supports it properly).

> 
> Just FYI...

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Fwd: Re: Killing sk->sk_callback_lock
  2008-06-17  4:01     ` Gregory Haskins
  2008-06-17  4:09       ` David Miller
@ 2008-06-17  4:22       ` Gregory Haskins
  2008-06-17  4:56       ` David Miller
  2 siblings, 0 replies; 18+ messages in thread
From: Gregory Haskins @ 2008-06-17  4:22 UTC (permalink / raw)
  To: davem, Patrick Mullaney; +Cc: Herbert Xu, chuck.lever, netdev

Oddly enough, forwarding an email does seem to linewrap..so perhaps this one is easier to read:

Regards,
-Greg

>>> On Tue, Jun 17, 2008 at 12:01 AM, in message <4856FEC9.BA47.005A.0@novell.com>,
Gregory Haskins wrote: 
> >>> On Mon, Jun 16, 2008 at  9:53 PM, in message
> <20080616.185328.85842051.davem@davemloft.net>, David Miller
> <davem@davemloft.net> wrote: 
>> From: "Patrick Mullaney" <pmullaney@novell.com>
>> Date: Mon, 16 Jun 2008 19:38:23 -0600
>> 
>>> The overhead I was trying to address was scheduler overhead.
>> 
>> Neither Herbert nor I are convinced of this yet, and you have
>> to show us why you think this is the problem and not (in
>> our opinion) the more likely sk_callback_lock overhead.
> 
> Please bear with us.  It is not our intent to be annoying, but we are 
> perhaps doing a poor job of describing the actual nature of the issue we are 
> seeing.
> 
> To be clear on how we got to this point: We are tasked with improving the 
> performance of our particular kernel configuration.  We observed that our 
> configuration was comparatively poor at UDP performance, so we started 
> investigating why.  We instrumented the kernel using various tools (lockdep, 
> oprofile, logdev, etc), and observed two wakeups for every packet that was 
> received while running a multi-threaded netperf UDP throughput benchmark on 
> some 8-core boxes.
> 
> A common pattern would emerge during analysis of the instrumentation data: 
> An arbitrary client thread would block on the wait-queue waiting for a UDP 
> packet.  It would of course initially block because there were no packets 
> available.  It would then wake up, check the queue, see that there are still 
> no packets available, and go back to sleep.  It would then wake up again a 
> short time later, find a packet, and return to userspace.
> 
> This seemed odd to us, so we investigated further to see if an improvement 
> was lurking or whether this was expected.  We traced back the source of each 
> wakeup to be coming from 1) the wmem/nospace code, and 2) from the rx-wakeup 
> code from the softirq.  First the softirq would process the tx-completions 
> which would wake_up() the wait-queue for NOSPACE signaling.  Since the client 
> was waiting for a packet on the same wait-queue, this was where the first 
> wakeup came from.  Then later the softirq finally pushed an actual packet to 
> the queue, and the client was once again re-awoken via the same overloaded 
> wait-queue.  This time it would successfully find a packet and return to 
> userspace.
> 
> Since the client does not care about wmem/nospace in the UDP rx path, yet 
> the two events share a single wait-queue, the first wakeup was completely 
> wasted.  It just causes extra scheduling activity that does not help in any 
> way (and is quite expensive in the grand-scheme of things).  Based on this 
> lead, Pat devised a solution which eliminates the extra wake-up() when there 
> are no clients waiting for that particular NOSPACE event.  With his patch 
> applied, we observed two things:
> 
> 1) We now had 1 wake-up per packet, instead of 2 (decreasing context 
> switching rates by ~50%)
> 2) overall UDP throughput performance increased by ~25%
> 
> This was true even without the presence of our instrumentation, so I don't 
> think we can chalk up the "double-wake" analysis as an anomaly caused by the 
> presence of the instrumentation itself.  Based on that, it would at least 
> appear that the odd behavior w.r.t. the phantom wakeup does indeed hinder 
> performance.  This is not to say that the locking issues you highlight are 
> not also an issue.  But note that we have no evidence to suggest this 
> particular phenomenon it is related in any way to the locking (in fact, 
> sk_callback_lock was not showing up at all on the lockdep radar for this 
> particular configuration, indicating a low contention rate).
> 
> So by all means, if there are improvements to the locking that can be made, 
> thats great!  But fixing the locking will not likely address this scheduler 
> overhead that Pat refers to IMO.  They would appear to be orthogonal issues.  
> I will keep an open mind, but the root-cause seems to be either the tendency 
> of the stack code to overload the wait-queue, or the fact that UDP sockets do 
> not dynamically manage the NOSPACE state flags.  From my perspective, I am 
> not married to any one particular solution as long as the fundamental 
> "phantom wakeup" problem is addressed.
> 
> HTH
> 
> Regards,
> -Greg
> 
> 



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17  4:20         ` Gregory Haskins
@ 2008-06-17  4:30           ` Gregory Haskins
  0 siblings, 0 replies; 18+ messages in thread
From: Gregory Haskins @ 2008-06-17  4:30 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: David Miller, herbert, Patrick Mullaney, chuck.lever, netdev

[-- Attachment #1: Type: text/plain, Size: 1759 bytes --]

Gregory Haskins wrote:
>>>> On Tue, Jun 17, 2008 at 12:09 AM, in message
>>>>         
> <20080616.210923.91199695.davem@davemloft.net>, David Miller
> <davem@davemloft.net> wrote: 
>   
>> From: "Gregory Haskins" <ghaskins@novell.com>
>> Date: Mon, 16 Jun 2008 22:01:13 -0600
>>
>>     
>>> Please bear with us.  It is not our intent to be annoying, but we are 
>>>       
>> perhaps doing a poor job of describing the actual nature of the issue we are 
>> seeing.
>>
>> Even though you says so, I don't believe your email system locks you
>> into these extremely long poorly formatted lines.  I get properly
>> formatted plain text emails from other Novell employees, and they
>> don't have long lines.
>>
>> Your emails are extremely painful for me, and other developers, to
>> read.  And you can expect an extremely degraded level of response, or
>> none at all, when you format your emails in this way.
>>     
>
> Understood and apologize.  I use the Linux Groupwise client which apparently does not get the same level of features as the other platforms we have (e.g. Windows).  Ive complained to the helpdesk and the answer I get is that unwrapped is the official standard and wrapping is supposed to be handled by the recipients mail-viewer.  Go figure.  I can set wrapping on all my other clients, but I cant find it anywhere on this one.  I'm sorry for the difficulties this causes. 
>
> Ive CC'd my personal address and will respond in the future with that client (which I know supports it properly).
>
>   
>> Just FYI...
>>     
>
>
>
>
>   

For what its worth, my thunderbird client here at home seems to render 
my non-wrapped email just fine.  What client are you using, out of 
curiosity?

-Greg


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17  4:01     ` Gregory Haskins
  2008-06-17  4:09       ` David Miller
  2008-06-17  4:22       ` Fwd: " Gregory Haskins
@ 2008-06-17  4:56       ` David Miller
  2008-06-17 11:42         ` Gregory Haskins
  2 siblings, 1 reply; 18+ messages in thread
From: David Miller @ 2008-06-17  4:56 UTC (permalink / raw)
  To: ghaskins; +Cc: PMullaney, herbert, chuck.lever, netdev

From: "Gregory Haskins" <ghaskins@novell.com>
Date: Mon, 16 Jun 2008 22:01:13 -0600

> This seemed odd to us, so we investigated further to see if an
> improvement was lurking or whether this was expected.  We traced
> back the source of each wakeup to be coming from 1) the wmem/nospace
> code, and 2) from the rx-wakeup code from the softirq.  First the
> softirq would process the tx-completions which would wake_up() the
> wait-queue for NOSPACE signaling.  Since the client was waiting for
> a packet on the same wait-queue, this was where the first wakeup
> came from.  Then later the softirq finally pushed an actual packet
> to the queue, and the client was once again re-awoken via the same
> overloaded wait-queue.  This time it would successfully find a
> packet and return to userspace.
>
> Since the client does not care about wmem/nospace in the UDP rx
> path, yet the two events share a single wait-queue, the first wakeup
> was completely wasted.  It just causes extra scheduling activity
> that does not help in any way (and is quite expensive in the
> grand-scheme of things).  Based on this lead, Pat devised a solution
> which eliminates the extra wake-up() when there are no clients
> waiting for that particular NOSPACE event.  With his patch applied,
> we observed two things:

Why is the application checking for receive packets even on the
write-space wakeup?

poll/select/epoll should be giving the correct event indication,
therefore the application would know to not check for receive
packets when a write-wakeup event occurs.

Yes the wakeup is spurious and we should avoid it.  But this
application is also buggy.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17  4:56       ` David Miller
@ 2008-06-17 11:42         ` Gregory Haskins
  0 siblings, 0 replies; 18+ messages in thread
From: Gregory Haskins @ 2008-06-17 11:42 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, pmullaney, chuck.lever, netdev, Gregory Haskins

[-- Attachment #1: Type: text/plain, Size: 2594 bytes --]

>>> On Tue, Jun 17, 2008 at 12:56 AM, in message
<20080616.215632.119969915.davem@davemloft.net>, David Miller
<davem@davemloft.net> wrote:
> From: "Gregory Haskins" <ghaskins@novell.com>
> Date: Mon, 16 Jun 2008 22:01:13 -0600
> 
>> This seemed odd to us, so we investigated further to see if an
>> improvement was lurking or whether this was expected.  We traced
>> back the source of each wakeup to be coming from 1) the wmem/nospace
>> code, and 2) from the rx-wakeup code from the softirq.  First the
>> softirq would process the tx-completions which would wake_up() the
>> wait-queue for NOSPACE signaling.  Since the client was waiting for
>> a packet on the same wait-queue, this was where the first wakeup
>> came from.  Then later the softirq finally pushed an actual packet
>> to the queue, and the client was once again re-awoken via the same
>> overloaded wait-queue.  This time it would successfully find a
>> packet and return to userspace.
>>
>> Since the client does not care about wmem/nospace in the UDP rx
>> path, yet the two events share a single wait-queue, the first wakeup
>> was completely wasted.  It just causes extra scheduling activity
>> that does not help in any way (and is quite expensive in the
>> grand-scheme of things).  Based on this lead, Pat devised a solution
>> which eliminates the extra wake-up() when there are no clients
>> waiting for that particular NOSPACE event.  With his patch applied,
>> we observed two things:
> 
> Why is the application checking for receive packets even on the
> write-space wakeup?
> 
> poll/select/epoll should be giving the correct event indication,
> therefore the application would know to not check for receive
> packets when a write-wakeup event occurs.
> 
> Yes the wakeup is spurious and we should avoid it.  But this
> application is also buggy.

The application is blocked inside a system call (I forget which one 
right now..probably recv()).  So the wakeup is not against a 
poll/select.  Rather, the kernel is in 
net/core/datagram.c::wait_for_packet() (blocked on skb->sk_sleep).

Since both the wmem code and the rx code use skb->sk_sleep to wake up 
waiters, the wmem processing inadvertently kicks the client to go 
through __skb_recv_datagram() one more time.  And since there aren't yet 
any packets in skb->sk_receive_queue, the client loops and once again 
calls wait_for_packet().

So long story short:  This is entirely a kernel-space issue (unless you 
believe the usage of that system-call itself is a bug?)

HTH

Regards,
-Greg






[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 250 bytes --]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17  1:53   ` Killing sk->sk_callback_lock David Miller
  2008-06-17  4:01     ` Gregory Haskins
@ 2008-06-17 13:38     ` Patrick Mullaney
  2008-06-17 21:40       ` David Miller
  1 sibling, 1 reply; 18+ messages in thread
From: Patrick Mullaney @ 2008-06-17 13:38 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, Gregory Haskins, chuck.lever, netdev

>>> On Mon, Jun 16, 2008 at  9:53 PM, in message
<20080616.185328.85842051.davem@davemloft.net>, David Miller
<davem@davemloft.net> wrote: 
> From: "Patrick Mullaney" <pmullaney@novell.com>
> Date: Mon, 16 Jun 2008 19:38:23 -0600
> 
>> The overhead I was trying to address was scheduler overhead.
> 
> Neither Herbert nor I are convinced of this yet, and you have
> to show us why you think this is the problem and not (in
> our opinion) the more likely sk_callback_lock overhead.

I think Greg has done an excellent job in describing what is
happening and how we found the issue so I won't go into
that unless you have further questions. We are both available
for an IRC chat on the problem if you would like. The only
thing I want to make clear is that lockstat is showing
sk_callback_lock as uncontended. I ran this once again just
to be sure last night(zero contentions). 

> 
> I am tiring of stating this over and over so please address
> this in any future correspondance.

I am confused by this statement - "over and over" implies more
than once. It seems to me that you stated this in one email
that you and Herbert came to this conclusion on IRC. What data
supports this conclusion? 

> 
> Once the task is woken up the first time, future calls to
> these callback functions should do nothing other than take
> the sk_callback_lock and test some state.
> 
> Since the task is awake already, wakeups should be bypassed
> or at worst be a nop.

The task can go directly back into a wait. This will effectively yield 2
wake ups per udp request-response.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17 13:38     ` Patrick Mullaney
@ 2008-06-17 21:40       ` David Miller
  2008-06-17 22:15         ` Patrick Mullaney
  2008-06-18  7:36         ` Remi Denis-Courmont
  0 siblings, 2 replies; 18+ messages in thread
From: David Miller @ 2008-06-17 21:40 UTC (permalink / raw)
  To: pmullaney; +Cc: herbert, GHaskins.WAL-1.WALTHAM, chuck.lever, netdev

From: "Patrick Mullaney" <pmullaney@novell.com>
Date: Tue, 17 Jun 2008 07:38:29 -0600

> >>> On Mon, Jun 16, 2008 at  9:53 PM, in message
> <20080616.185328.85842051.davem@davemloft.net>, David Miller
> <davem@davemloft.net> wrote: 
> > Once the task is woken up the first time, future calls to
> > these callback functions should do nothing other than take
> > the sk_callback_lock and test some state.
> > 
> > Since the task is awake already, wakeups should be bypassed
> > or at worst be a nop.
> 
> The task can go directly back into a wait. This will effectively yield 2
> wake ups per udp request-response.

I made the mistake of assuming that a high performance threaded
networking application would use non-blocking operations and
select/poll/epoll, which is clearly not the case here.

It's blocking in a recv() and this is woken up by a write space
extraneous wakeup.

It does need to be fixed and I'll look at the most recent patch
submission and also try to imagine some other ideas.  Herbert
mentioned creating a seperate wait queue for write space wakeups.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17 21:40       ` David Miller
@ 2008-06-17 22:15         ` Patrick Mullaney
  2008-06-17 23:24           ` Herbert Xu
  2008-06-18  7:36         ` Remi Denis-Courmont
  1 sibling, 1 reply; 18+ messages in thread
From: Patrick Mullaney @ 2008-06-17 22:15 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, Gregory Haskins, chuck.lever, netdev



>>> On Tue, Jun 17, 2008 at  5:40 PM, in message
<20080617.144041.38758483.davem@davemloft.net>, David Miller
<davem@davemloft.net> wrote: 
> From: "Patrick Mullaney" <pmullaney@novell.com>
> Date: Tue, 17 Jun 2008 07:38:29 -0600
> 
>> >>> On Mon, Jun 16, 2008 at  9:53 PM, in message
>> <20080616.185328.85842051.davem@davemloft.net>, David Miller
>> <davem@davemloft.net> wrote: 
>> > Once the task is woken up the first time, future calls to
>> > these callback functions should do nothing other than take
>> > the sk_callback_lock and test some state.
>> > 
>> > Since the task is awake already, wakeups should be bypassed
>> > or at worst be a nop.
>> 
>> The task can go directly back into a wait. This will effectively yield 2
>> wake ups per udp request-response.
> 
> I made the mistake of assuming that a high performance threaded
> networking application would use non-blocking operations and
> select/poll/epoll, which is clearly not the case here.
> 
This is the standard netperf udp request response benchmark - it measures
back to back send/recv and is not necessarily high performance (async).

> It's blocking in a recv() and this is woken up by a write space
> extraneous wakeup.
> 
> It does need to be fixed and I'll look at the most recent patch
> submission and also try to imagine some other ideas.  Herbert
> mentioned creating a seperate wait queue for write space wakeups.

Yeah, I think I mentioned that approach in my first email about this. It
seemed like it would require adding to the socket struct so I decided to
try to do it without touching that. I am not positive but changing the odd
behavior of the SOCK_NOSPACE flag(mentioned in previous email) seems like
it may be in order regardless of the approach to the extra wake up.


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17 22:15         ` Patrick Mullaney
@ 2008-06-17 23:24           ` Herbert Xu
  0 siblings, 0 replies; 18+ messages in thread
From: Herbert Xu @ 2008-06-17 23:24 UTC (permalink / raw)
  To: Patrick Mullaney; +Cc: David Miller, Gregory Haskins, chuck.lever, netdev

On Tue, Jun 17, 2008 at 04:15:46PM -0600, Patrick Mullaney wrote:
>
> Yeah, I think I mentioned that approach in my first email about this. It
> seemed like it would require adding to the socket struct so I decided to
> try to do it without touching that. I am not positive but changing the odd
> behavior of the SOCK_NOSPACE flag(mentioned in previous email) seems like
> it may be in order regardless of the approach to the extra wake up.

Fiddling with flags is just a hack.  The same socket can be used
by multiple threads, so you could have one waiting on read while
another is waiting on write.  The only real solution is to have
separate queues.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup)
  2008-06-17  1:38 ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) Patrick Mullaney
  2008-06-17  1:53   ` Killing sk->sk_callback_lock David Miller
  2008-06-17  2:56   ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) Herbert Xu
@ 2008-06-17 23:33   ` Herbert Xu
  2 siblings, 0 replies; 18+ messages in thread
From: Herbert Xu @ 2008-06-17 23:33 UTC (permalink / raw)
  To: Patrick Mullaney; +Cc: davem, Gregory Haskins, chuck.lever, netdev

On Mon, Jun 16, 2008 at 07:38:23PM -0600, Patrick Mullaney wrote:
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 1a394d0..a26d223 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1159,7 +1159,6 @@ static long sock_wait_for_wmem(struct sock * sk, long timeo)
>  			break;
>  		if (signal_pending(current))
>  			break;
> -		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>  		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
>  		if (atomic_read(&sk->sk_wmem_alloc) < sk->sk_sndbuf)
>  			break;
> @@ -1167,9 +1166,11 @@ static long sock_wait_for_wmem(struct sock * sk, long timeo)
>  			break;
>  		if (sk->sk_err)
>  			break;
> +		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>  		timeo = schedule_timeout(timeo);
>  	}
>  	finish_wait(sk->sk_sleep, &wait);
> +	clear_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
>  	return timeo;
>  }

This is buggy.  You're relying on the caller to not call this
on the same socket concurrently.  While this may be true for
UDP right now (which is a less than ideal situation in itself),
it isn't true in general.  In particular, raw sockets still
allow parallel writes to the same socket.

So forget about the flags and just do the two queues as you proposed
yourself :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-17 21:40       ` David Miller
  2008-06-17 22:15         ` Patrick Mullaney
@ 2008-06-18  7:36         ` Remi Denis-Courmont
  1 sibling, 0 replies; 18+ messages in thread
From: Remi Denis-Courmont @ 2008-06-18  7:36 UTC (permalink / raw)
  To: David Miller
  Cc: pmullaney, herbert, GHaskins.WAL-1.WALTHAM, chuck.lever, netdev

    Hello,

On Tue, 17 Jun 2008 14:40:41 -0700 (PDT), David Miller
<davem@davemloft.net> wrote:
>> The task can go directly back into a wait. This will effectively yield 2
>> wake ups per udp request-response.
> 
> I made the mistake of assuming that a high performance threaded
> networking application would use non-blocking operations and
> select/poll/epoll, which is clearly not the case here.
> 
> It's blocking in a recv() and this is woken up by a write space
> extraneous wakeup.

With UDP, I have consistently gotten significantly (IIRC around 30%) better
results using plain recv()/send() on blocking sockets than
poll()/recv()/send() on non-blocking sockets, on Linux, on the fast path.
Of course, this assumes there is only one socket per thread. Namely, using
my Teredo IPv6 userland implementation.

You are surely in a better position than I can ever be as to explaining why
this is so, and how bad and wrong my approach may be. Nevertheless my guess
has been the system call overhead is simply higher when adding poll() to
recv() and send().

-- 
Rémi Denis-Courmont
http://www.remlab.net

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: Killing sk->sk_callback_lock
  2008-06-18 16:49 ` Patrick Mullaney
@ 2008-06-18 21:56   ` David Miller
  0 siblings, 0 replies; 18+ messages in thread
From: David Miller @ 2008-06-18 21:56 UTC (permalink / raw)
  To: pmullaney; +Cc: herbert, GHaskins, chuck.lever, netdev

From: "Patrick Mullaney" <pmullaney@novell.com>
Date: Wed, 18 Jun 2008 10:49:28 -0600

> I will give it a shot - I still have reservations about changing the
> socket structures - at first glance, it seems like sock and socket will
> have to change.

If you work against net-next-2.6 everything setting sk->sk_sleep is
compartmentalized into a small number of code locations.  So such work
ought to be trivial.

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2008-06-18 21:56 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <484F43A3020000760003F543@lucius.provo.novell.com>
2008-06-17  1:38 ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) Patrick Mullaney
2008-06-17  1:53   ` Killing sk->sk_callback_lock David Miller
2008-06-17  4:01     ` Gregory Haskins
2008-06-17  4:09       ` David Miller
2008-06-17  4:20         ` Gregory Haskins
2008-06-17  4:30           ` Gregory Haskins
2008-06-17  4:22       ` Fwd: " Gregory Haskins
2008-06-17  4:56       ` David Miller
2008-06-17 11:42         ` Gregory Haskins
2008-06-17 13:38     ` Patrick Mullaney
2008-06-17 21:40       ` David Miller
2008-06-17 22:15         ` Patrick Mullaney
2008-06-17 23:24           ` Herbert Xu
2008-06-18  7:36         ` Remi Denis-Courmont
2008-06-17  2:56   ` Killing sk->sk_callback_lock (was Re: [PATCH] net/core/sock.c remove extra wakeup) Herbert Xu
2008-06-17  3:09     ` Eric Dumazet
2008-06-17 23:33   ` Herbert Xu
     [not found] <4857F579020000B30004963C@lucius.provo.novell.com>
2008-06-18 16:49 ` Patrick Mullaney
2008-06-18 21:56   ` Killing sk->sk_callback_lock David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).