Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: crash with bridge and inconsistent handling of NETDEV_TX_OK
From: David Miller @ 2010-04-20 23:48 UTC (permalink / raw)
  To: mpatocka; +Cc: lists.linux-foundation.org, netdev, kaber
In-Reply-To: <20100420.164505.34362726.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Tue, 20 Apr 2010 16:45:05 -0700 (PDT)

> From: Mikulas Patocka <mpatocka@redhat.com>
> Date: Tue, 20 Apr 2010 19:40:56 -0400 (EDT)
> 
>> Reviewing the code further, I found one very weird commit
>> 572a9d7b6fc7f20f573664063324c086be310c42 committed to 2.6.33. What
>> it does, it changes the semantics of ndo_hard_start_xmit(). Prior to
>> the patch, the meaning was --- return zero (NETDEV_TX_OK) --- the
>> skb is consumed by the driver. Returns non-zero --- the skb is left
>> owned by the caller.  The patch changes it to return other flags in
>> bits 4-7 and changes the consumed/returned logic.
>> 
>> The problem is that there is still plenty of code that compares it
>> against NETDEV_TX_OK to find out if the skb was consumed.
> 
> Drivers are not supposed to return those new flag bits, the new flag
> bits as return values exist only in the packet scheduler path.

And BTW, NETDEV_TX_OK is only ever returned by itself, the
flag bits only get set when a non-NETDEV_TX_OK value is returned.

So we really haven't changed semantics at all, NETDEV_TX_OK (which is
zero) and non-zero are the two valid return value cases.

^ permalink raw reply

* Re: crash with bridge and inconsistent handling of NETDEV_TX_OK
From: David Miller @ 2010-04-20 23:45 UTC (permalink / raw)
  To: mpatocka; +Cc: lists.linux-foundation.org, netdev, kaber
In-Reply-To: <Pine.LNX.4.64.1004201834190.29017@hs20-bc2-1.build.redhat.com>

From: Mikulas Patocka <mpatocka@redhat.com>
Date: Tue, 20 Apr 2010 19:40:56 -0400 (EDT)

> Reviewing the code further, I found one very weird commit
> 572a9d7b6fc7f20f573664063324c086be310c42 committed to 2.6.33. What
> it does, it changes the semantics of ndo_hard_start_xmit(). Prior to
> the patch, the meaning was --- return zero (NETDEV_TX_OK) --- the
> skb is consumed by the driver. Returns non-zero --- the skb is left
> owned by the caller.  The patch changes it to return other flags in
> bits 4-7 and changes the consumed/returned logic.
> 
> The problem is that there is still plenty of code that compares it
> against NETDEV_TX_OK to find out if the skb was consumed.

Drivers are not supposed to return those new flag bits, the new flag
bits as return values exist only in the packet scheduler path.

We're not reverting the commit you mention, we're going to fix
whatever bug exists instead.

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Eric Dumazet @ 2010-04-20 23:42 UTC (permalink / raw)
  To: Gaspar Chilingarov; +Cc: netdev
In-Reply-To: <v2n46c8cb3e1004201618s37e9f0c6j7527e59673a34322@mail.gmail.com>

Le mercredi 21 avril 2010 à 04:18 +0500, Gaspar Chilingarov a écrit :
> >
> > Its doable, only if you bind() your sockets before connect()
> >
> > For each socket, you'll need to chose an (local IP, local port) not
> > already in use.
> >
> > kernel wont magically select one source IP from the pool you have.
> >
> 
> I expect that I should select source IP and kernel should find some
> unused local_port for me.
> 
> I'm binding before the connect, of course ;)
> 
> 

Yes, you bind the IP, but let kernel choose the port.

In this case, kernel is a bit dumb (or too smart ?)

If you want to check source, its in file net/ipv4/inet_connection_sock.c

function inet_csk_get_port()

you can remove the 

if (atomic_read(&hashinfo->bsockets) > (high - low) + 1) { 
	spin_unlock(&head->lock); 
	snum = smallest_rover; 
	goto have_snum;
}

It will solve your problem (but bind() will probably be slow)




^ permalink raw reply

* crash with bridge and inconsistent handling of NETDEV_TX_OK
From: Mikulas Patocka @ 2010-04-20 23:40 UTC (permalink / raw)
  To: lists.linux-foundation.org, netdev; +Cc: kaber, davem

Hi

I got this crash on 2.6.34-rc4 when using it as a bridge. The crash is not 
reproducible. There are two interfaces, eth0 (sun-hme) and eth1 (tg3) and 
there is a bridge between them.

The crash was in inlined function skb_drop_list, inlined from 
skb_drop_fraglist from skb_release_data. The "list" pointer was 0x18.

This is backtrace:
skb_release_data+7c/e0
__kfree_skb+c/c0
dev_hard_start_xmit+288/3c0
sch_direct_xmit+12c/1c0
dev_queue_xmit+3fc/520
br_dev_queue_push_xmit+60/80
br_handle_frame_finish+100/1e00
br_handle_frame+168/240
netif_receive_skb+274/480
process_backlog+64/c0
net_rx_action+cc/180
__do_softirq+94/120

Settings for for eth0 and eth1 are:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off

For br0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: off
udp-fragmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: off
large-receive-offload: off

(there is no netfilter compiled on that machine)

I'm curious --- this code path in dev_hard_start_xmit is taken only if GSO 
is used. From the trace, it can be seen that a packet was received by one 
nic and forwarded by the bridge to the other nic. How could GSO be used in 
this scenario?

---

Reviewing the code further, I found one very weird commit 
572a9d7b6fc7f20f573664063324c086be310c42 committed to 2.6.33. What it 
does, it changes the semantics of ndo_hard_start_xmit(). Prior to the 
patch, the meaning was --- return zero (NETDEV_TX_OK) --- the skb is 
consumed by the driver. Returns non-zero --- the skb is left owned by the 
caller. The patch changes it to return other flags in bits 4-7 and changes 
the consumed/returned logic.

The problem is that there is still plenty of code that compares it against 
NETDEV_TX_OK to find out if the skb was consumed.

I don't know if this caused my crash (it is not reproducible), but the 
patch is buggy and dangerous.

Handling of the return codes is inconsistent:
* most callers use dev_xmit_complete(int rc) { return rc < 
NET_XMIT_MASK(0x0f); } to test if the skb was consumed. This seems to be 
the method to be used intended by the developers. (why not <= 0x0f ?)

* dev_hard_start_xmit uses rc == NETDEV_TX_OK || rc & ~NETDEV_TX_MASK 
(~0xf0) to test if the skb is consumed. This is differs from 
dev_xmit_complete for some values.

net/core/netpoll.c : at two places, it compares the return value of 
ndo_start_xmit with NETDEV_TX_OK
net/sched/sch_teql.c : compares with NETDEV_TX_OK
drivers/net/wan/dlci.c : ignores the return value of ndo_start_xmit

--- the above inconsistencies lead to situations where both the caller and 
the callee think that they own the skb, and the result is memory 
corruption.

If we grep for places that assume that the packet was consumed if the 
return code is NETDEV_TX_OK, there are even more:
drivers/net/qla3xxx.c
drivers/net/chelsio/sge.c
drivers/net/wireless/hostap/hostap_80211_tx.c
drivers/net/wireless/ipw2x00/libipw_tx.c
drivers/net/sfc/selftest.c
net/mac80211/tx.c
--- Not all these cases are bugs, because sometimes the return value is 
self-generated by the driver. But it is just confusing and it may trigger 
bugs in the future.

---

I'd recommend to revert 572a9d7b6fc7f20f573664063324c086be310c42 because 
it made several bugs (the code that compared the return value against 
NETDEV_TX_OK was correct before and is buggy after the patch). 
Furthermore, there may be hidden bugs, if someone compares the return 
value with 0 and frees skb based on this comparison, it is impossible to 
find it with grep, yet changing the meaning of return values would make a 
bug here.

Mikulas

^ permalink raw reply

* Re: [PATCH net-next-2.6] net: sk_sleep() helper
From: David Miller @ 2010-04-20 23:39 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev
In-Reply-To: <1271804631.7895.519.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 21 Apr 2010 01:03:51 +0200

> [PATCH net-next-2.6] net: sk_sleep() helper

Looks good, applied, thanks Eric.

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: David Miller @ 2010-04-20 23:38 UTC (permalink / raw)
  To: xiaosuo; +Cc: eric.dumazet, franco, therbert, netdev
In-Reply-To: <v2s412e6f7f1004201635m5258e777oac49fea8f9625bcf@mail.gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Wed, 21 Apr 2010 07:35:48 +0800

> On Wed, Apr 21, 2010 at 5:41 AM, David Miller <davem@davemloft.net> wrote:
>> Eric, do you remember that "TCP friends" rough patch I sent you last
>> year that essentailly made TCP sockets over loopback behave like
>> AF_UNIX ones and just queue the SKBs directly to the destination
>> socket without doing any protocol work?
> 
> I think it will break some benchmark tools.

Other systems already do this optimization, so if things break, this
breakage is already pervasive.

We should be able to tell people that they can use TCP solely in their
applications and it will perform optimally regardless of transport.

People already code their applications this way, and ignoring
this issue would just makes us stupid.

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: Changli Gao @ 2010-04-20 23:35 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, franco, therbert, netdev
In-Reply-To: <20100420.144106.118596093.davem@davemloft.net>

On Wed, Apr 21, 2010 at 5:41 AM, David Miller <davem@davemloft.net> wrote:
> Eric, do you remember that "TCP friends" rough patch I sent you last
> year that essentailly made TCP sockets over loopback behave like
> AF_UNIX ones and just queue the SKBs directly to the destination
> socket without doing any protocol work?

I think it will break some benchmark tools. The loopback device is for
testing networking protocol stacks, so we shouldn't bypass the
protocol processing. And anyone who has a performance problem of
loopback device should turn to UNIX domain socket.

For routers, how about letting users choose whether RPS mixes layer 4 info in?

-- 
Regards，
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [PATCHv2 1/7] X25: Add if_x25.h and x25 to device identifiers
From: David Miller @ 2010-04-20 23:35 UTC (permalink / raw)
  To: andrew.hendry; +Cc: netdev
In-Reply-To: <1271719717.2802.77.camel@ibex>

From: Andrew Hendry <andrew.hendry@gmail.com>
Date: Tue, 20 Apr 2010 09:28:37 +1000

> diff --git a/include/linux/if_x25.h b/include/linux/if_x25.h
> new file mode 100644
> index 0000000..897765f
> --- /dev/null
> +++ b/include/linux/if_x25.h
> @@ -0,0 +1,26 @@
> +/*
> + *  Linux X.25 packet to device interface

Headers meant to be used by userspace must be added
to the include/linux/Kbuild file.

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Gaspar Chilingarov @ 2010-04-20 23:35 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev
In-Reply-To: <4BCE392F.60104@candelatech.com>

> I have 40,000 out-going connections to myself..basically I'm
> acting as both client and server.  So, I'd expect you to be

What is yours

sysctl -a | grep local_port_range

?


>
> Be sure to use lots of processes so you don't end up with
> a poll loop with 80,000 fd entries.  I found that 10k sockets
> per process was a good upper limit..but it's not a hard
> upper limit.

sure -- it's a next step :)

>
> And be sure to do the bind logic correctly as Eric pointed out in an
> earlier email.
>
> Thanks,
> Ben
>
> --
> Ben Greear <greearb@candelatech.com>
> Candela Technologies Inc  http://www.candelatech.com
>



-- 
Gaspar Chilingarov

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Ben Greear @ 2010-04-20 23:30 UTC (permalink / raw)
  To: Gaspar Chilingarov; +Cc: netdev
In-Reply-To: <u2t46c8cb3e1004201620tf9fce087ibae421759f9e1e84@mail.gmail.com>

On 04/20/2010 04:20 PM, Gaspar Chilingarov wrote:
> Opening that amount of incoming connections in not a problem at all.
>
> I'm speaking of outgoing connections.
>
> What kind of hack do you have on your kernel?

send-to-self logic and similar..but nothing that should be
better at doing lots of sockets than the default kernel.

I have 40,000 out-going connections to myself..basically I'm
acting as both client and server.  So, I'd expect you to be
able to do around 80,000 connections to some external system on
similar hardware and efficient client software.  At least with
my software, I'd need more RAM to scale much beyond that, but I'm
trying to generate traffic on these sockets (which requires user-space
buffers per socket in my implementation), and I have a decent bit
of overhead gathering stats and such.  A simpler application with
less overhead should do more sockets.

Be sure to use lots of processes so you don't end up with
a poll loop with 80,000 fd entries.  I found that 10k sockets
per process was a good upper limit..but it's not a hard
upper limit.

And be sure to do the bind logic correctly as Eric pointed out in an
earlier email.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Gaspar Chilingarov @ 2010-04-20 23:20 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev
In-Reply-To: <4BCE33B9.8050101@candelatech.com>

Opening that amount of incoming connections in not a problem at all.

I'm speaking of outgoing connections.

What kind of hack do you have on your kernel?

Thanks in advance,
Gaspar

2010/4/21 Ben Greear <greearb@candelatech.com>:
> On 04/20/2010 03:17 PM, Gaspar Chilingarov wrote:
>
>> I would be grateful for hints where to look in the source -- may be I
>> can produce some working patches for it.
>
> I've opened 40,000 connections to/from my machine (over external interfaces)
> on a slightly hacked .31 kernel.  This is 80,000 sockets total.  I bind
> to local IP and port range, as well as SO_BINDTODEVICE.
>
> I'm using a 64-bit system with 12GB of RAM, quad-core i7 3.3Ghz, etc.
>
> It takes a lot of RAM to do this but you can probably use less RAM in
> user-space
> than I am.  Be sure to set your socket buffer sizes small.
>
> Thanks,
> Ben
>
>
> --
> Ben Greear <greearb@candelatech.com>
> Candela Technologies Inc  http://www.candelatech.com
>

-- 
Gaspar Chilingarov

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Gaspar Chilingarov @ 2010-04-20 23:20 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev
In-Reply-To: <4BCE33B9.8050101@candelatech.com>

Opening that amount of incoming connections in not a problem at all.

I'm speaking of outgoing connections.

What kind of hack do you have on your kernel?

Thanks in advance,
Gaspar

2010/4/21 Ben Greear <greearb@candelatech.com>:
> On 04/20/2010 03:17 PM, Gaspar Chilingarov wrote:
>
>> I would be grateful for hints where to look in the source -- may be I
>> can produce some working patches for it.
>
> I've opened 40,000 connections to/from my machine (over external interfaces)
> on a slightly hacked .31 kernel.  This is 80,000 sockets total.  I bind
> to local IP and port range, as well as SO_BINDTODEVICE.
>
> I'm using a 64-bit system with 12GB of RAM, quad-core i7 3.3Ghz, etc.
>
> It takes a lot of RAM to do this but you can probably use less RAM in
> user-space
> than I am.  Be sure to set your socket buffer sizes small.
>
> Thanks,
> Ben
>
>
> --
> Ben Greear <greearb@candelatech.com>
> Candela Technologies Inc  http://www.candelatech.com
>

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Gaspar Chilingarov @ 2010-04-20 23:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1271802676.7895.457.camel@edumazet-laptop>

>
> Its doable, only if you bind() your sockets before connect()
>
> For each socket, you'll need to chose an (local IP, local port) not
> already in use.
>
> kernel wont magically select one source IP from the pool you have.
>

I expect that I should select source IP and kernel should find some
unused local_port for me.

I'm binding before the connect, of course ;)

-- 
Gaspar Chilingarov

^ permalink raw reply

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of  outgoing connections (unable to bind ... )
From: Ben Greear @ 2010-04-20 23:07 UTC (permalink / raw)
  To: Gaspar Chilingarov; +Cc: netdev
In-Reply-To: <g2z46c8cb3e1004201517i5641a75cze2ec5bd33e81fb0f@mail.gmail.com>

On 04/20/2010 03:17 PM, Gaspar Chilingarov wrote:

> I would be grateful for hints where to look in the source -- may be I
> can produce some working patches for it.

I've opened 40,000 connections to/from my machine (over external interfaces)
on a slightly hacked .31 kernel.  This is 80,000 sockets total.  I bind
to local IP and port range, as well as SO_BINDTODEVICE.

I'm using a 64-bit system with 12GB of RAM, quad-core i7 3.3Ghz, etc.

It takes a lot of RAM to do this but you can probably use less RAM in user-space
than I am.  Be sure to set your socket buffer sizes small.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* [PATCH net-next-2.6] net: sk_sleep() helper
From: Eric Dumazet @ 2010-04-20 23:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <1271199845.16881.586.camel@edumazet-laptop>

Le mercredi 14 avril 2010 à 01:04 +0200, Eric Dumazet a écrit :

> I am now working on sk_callback_lock case, to speedup
> sock_def_readable(), sock_def_write_space() in typical cases
> (SOCK_FASYNC not set)
> 
> Instead of using rcu on whole "struct socket", my plan is to use a small
> structure :
> 
> struct wait_queue_head_rcu {
> 	wait_queue_head_t wait;
> 	struct rcu_head	  rcu;
> } ____cacheline_aligned_in_smp;
> 
> and make sk->sk_sleep points to this 'wait' field.
> 

I am preparing this now, and submit this preliminary patch.

Thanks !

[PATCH net-next-2.6] net: sk_sleep() helper

Define a new function to return the waitqueue of a "struct sock".

static inline wait_queue_head_t *sk_sleep(struct sock *sk)
{
	return sk->sk_sleep;
}

Change all read occurrences of sk_sleep by a call to this function.

Needed for a future RCU conversion. sk_sleep wont be a field directly
available.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 drivers/atm/atmtcp.c            |    6 +-
 drivers/net/macvtap.c           |    4 -
 drivers/net/tun.c               |    4 -
 drivers/scsi/iscsi_tcp.c        |    4 -
 include/net/sock.h              |   10 +++-
 include/net/tcp.h               |    2 
 net/atm/common.c                |   12 ++---
 net/atm/signaling.c             |    2 
 net/atm/svc.c                   |   62 +++++++++++++++---------------
 net/ax25/af_ax25.c              |    8 +--
 net/bluetooth/af_bluetooth.c    |    6 +-
 net/bluetooth/bnep/core.c       |    8 +--
 net/bluetooth/bnep/netdev.c     |    6 +-
 net/bluetooth/cmtp/cmtp.h       |    2 
 net/bluetooth/cmtp/core.c       |    4 -
 net/bluetooth/hidp/core.c       |   10 ++--
 net/bluetooth/hidp/hidp.h       |    4 -
 net/bluetooth/l2cap.c           |    4 -
 net/bluetooth/rfcomm/sock.c     |    8 +--
 net/bluetooth/sco.c             |    4 -
 net/caif/caif_socket.c          |    2 
 net/core/datagram.c             |    6 +-
 net/core/sock.c                 |   16 +++----
 net/core/stream.c               |   16 +++----
 net/dccp/output.c               |    6 +-
 net/dccp/proto.c                |    2 
 net/decnet/af_decnet.c          |   26 ++++++------
 net/ipv4/af_inet.c              |    6 +-
 net/ipv4/inet_connection_sock.c |    4 -
 net/ipv4/tcp.c                  |    2 
 net/irda/af_irda.c              |   14 +++---
 net/iucv/af_iucv.c              |   12 ++---
 net/llc/af_llc.c                |   12 ++---
 net/netfilter/ipvs/ip_vs_sync.c |    2 
 net/netrom/af_netrom.c          |    8 +--
 net/rds/af_rds.c                |    2 
 net/rds/rds.h                   |    2 
 net/rds/recv.c                  |    2 
 net/rds/send.c                  |    2 
 net/rose/af_rose.c              |    8 +--
 net/rxrpc/af_rxrpc.c            |    4 -
 net/sctp/socket.c               |   20 ++++-----
 net/sunrpc/svcsock.c            |   24 +++++------
 net/tipc/socket.c               |   26 ++++++------
 net/unix/af_unix.c              |   10 ++--
 net/x25/af_x25.c                |    8 +--
 46 files changed, 208 insertions(+), 204 deletions(-)

diff --git a/drivers/atm/atmtcp.c b/drivers/atm/atmtcp.c
index b867121..b910181 100644
--- a/drivers/atm/atmtcp.c
+++ b/drivers/atm/atmtcp.c
@@ -68,7 +68,7 @@ static int atmtcp_send_control(struct atm_vcc *vcc,int type,
 	*(struct atm_vcc **) &new_msg->vcc = vcc;
 	old_test = test_bit(flag,&vcc->flags);
 	out_vcc->push(out_vcc,skb);
-	add_wait_queue(sk_atm(vcc)->sk_sleep, &wait);
+	add_wait_queue(sk_sleep(sk_atm(vcc)), &wait);
 	while (test_bit(flag,&vcc->flags) == old_test) {
 		mb();
 		out_vcc = PRIV(vcc->dev) ? PRIV(vcc->dev)->vcc : NULL;
@@ -80,7 +80,7 @@ static int atmtcp_send_control(struct atm_vcc *vcc,int type,
 		schedule();
 	}
 	set_current_state(TASK_RUNNING);
-	remove_wait_queue(sk_atm(vcc)->sk_sleep, &wait);
+	remove_wait_queue(sk_sleep(sk_atm(vcc)), &wait);
 	return error;
 }
 
@@ -105,7 +105,7 @@ static int atmtcp_recv_control(const struct atmtcp_control *msg)
 		    msg->type);
 		return -EINVAL;
 	}
-	wake_up(sk_atm(vcc)->sk_sleep);
+	wake_up(sk_sleep(sk_atm(vcc)));
 	return 0;
 }
 
diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
index abba3cc..85d6420 100644
--- a/drivers/net/macvtap.c
+++ b/drivers/net/macvtap.c
@@ -246,8 +246,8 @@ static void macvtap_sock_write_space(struct sock *sk)
 	    !test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
 		return;
 
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
-		wake_up_interruptible_poll(sk->sk_sleep, POLLOUT | POLLWRNORM | POLLWRBAND);
+	if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
+		wake_up_interruptible_poll(sk_sleep(sk), POLLOUT | POLLWRNORM | POLLWRBAND);
 }
 
 static int macvtap_open(struct inode *inode, struct file *file)
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 4326520..20a1793 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -868,8 +868,8 @@ static void tun_sock_write_space(struct sock *sk)
 	if (!test_and_clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags))
 		return;
 
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
-		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT |
+	if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
+		wake_up_interruptible_sync_poll(sk_sleep(sk), POLLOUT |
 						POLLWRNORM | POLLWRBAND);
 
 	tun = tun_sk(sk)->tun;
diff --git a/drivers/scsi/iscsi_tcp.c b/drivers/scsi/iscsi_tcp.c
index 0ee725c..9eae04a 100644
--- a/drivers/scsi/iscsi_tcp.c
+++ b/drivers/scsi/iscsi_tcp.c
@@ -599,9 +599,9 @@ static void iscsi_sw_tcp_conn_stop(struct iscsi_cls_conn *cls_conn, int flag)
 	set_bit(ISCSI_SUSPEND_BIT, &conn->suspend_rx);
 	write_unlock_bh(&tcp_sw_conn->sock->sk->sk_callback_lock);
 
-	if (sock->sk->sk_sleep && waitqueue_active(sock->sk->sk_sleep)) {
+	if (sk_sleep(sock->sk) && waitqueue_active(sk_sleep(sock->sk))) {
 		sock->sk->sk_err = EIO;
-		wake_up_interruptible(sock->sk->sk_sleep);
+		wake_up_interruptible(sk_sleep(sock->sk));
 	}
 
 	iscsi_conn_stop(cls_conn, flag);
diff --git a/include/net/sock.h b/include/net/sock.h
index 56df440..8ab0514 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1160,6 +1160,10 @@ static inline void sk_set_socket(struct sock *sk, struct socket *sock)
 	sk->sk_socket = sock;
 }
 
+static inline wait_queue_head_t *sk_sleep(struct sock *sk)
+{
+	return sk->sk_sleep;
+}
 /* Detach socket from process context.
  * Announce socket dead, detach it from wait queue and inode.
  * Note that parent inode held reference count on this struct sock,
@@ -1346,8 +1350,8 @@ static inline int sk_has_allocations(const struct sock *sk)
  *   tp->rcv_nxt check   sock_def_readable
  *   ...                 {
  *   schedule               ...
- *                          if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
- *                              wake_up_interruptible(sk->sk_sleep)
+ *                          if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
+ *                              wake_up_interruptible(sk_sleep(sk))
  *                          ...
  *                       }
  *
@@ -1368,7 +1372,7 @@ static inline int sk_has_sleeper(struct sock *sk)
 	 * This memory barrier is paired in the sock_poll_wait.
 	 */
 	smp_mb__after_lock();
-	return sk->sk_sleep && waitqueue_active(sk->sk_sleep);
+	return sk_sleep(sk) && waitqueue_active(sk_sleep(sk));
 }
 
 /**
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 70c5159..b7d83d2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -939,7 +939,7 @@ static inline int tcp_prequeue(struct sock *sk, struct sk_buff *skb)
 
 		tp->ucopy.memory = 0;
 	} else if (skb_queue_len(&tp->ucopy.prequeue) == 1) {
-		wake_up_interruptible_sync_poll(sk->sk_sleep,
+		wake_up_interruptible_sync_poll(sk_sleep(sk),
 					   POLLIN | POLLRDNORM | POLLRDBAND);
 		if (!inet_csk_ack_scheduled(sk))
 			inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
diff --git a/net/atm/common.c b/net/atm/common.c
index 97ed94a..e3e10e6 100644
--- a/net/atm/common.c
+++ b/net/atm/common.c
@@ -92,7 +92,7 @@ static void vcc_def_wakeup(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
 	if (sk_has_sleeper(sk))
-		wake_up(sk->sk_sleep);
+		wake_up(sk_sleep(sk));
 	read_unlock(&sk->sk_callback_lock);
 }
 
@@ -110,7 +110,7 @@ static void vcc_write_space(struct sock *sk)
 
 	if (vcc_writable(sk)) {
 		if (sk_has_sleeper(sk))
-			wake_up_interruptible(sk->sk_sleep);
+			wake_up_interruptible(sk_sleep(sk));
 
 		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}
@@ -549,7 +549,7 @@ int vcc_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *m,
 	}
 
 	eff = (size+3) & ~3; /* align to word boundary */
-	prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 	error = 0;
 	while (!(skb = alloc_tx(vcc, eff))) {
 		if (m->msg_flags & MSG_DONTWAIT) {
@@ -568,9 +568,9 @@ int vcc_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *m,
 			send_sig(SIGPIPE, current, 0);
 			break;
 		}
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	if (error)
 		goto out;
 	skb->dev = NULL; /* for paths shared with net_device interfaces */
@@ -595,7 +595,7 @@ unsigned int vcc_poll(struct file *file, struct socket *sock, poll_table *wait)
 	struct atm_vcc *vcc;
 	unsigned int mask;
 
-	sock_poll_wait(file, sk->sk_sleep, wait);
+	sock_poll_wait(file, sk_sleep(sk), wait);
 	mask = 0;
 
 	vcc = ATM_SD(sock);
diff --git a/net/atm/signaling.c b/net/atm/signaling.c
index 6ba6e46..509c8ac 100644
--- a/net/atm/signaling.c
+++ b/net/atm/signaling.c
@@ -131,7 +131,7 @@ static int sigd_send(struct atm_vcc *vcc, struct sk_buff *skb)
 		}
 		sk->sk_ack_backlog++;
 		skb_queue_tail(&sk->sk_receive_queue, skb);
-		pr_debug("waking sk->sk_sleep 0x%p\n", sk->sk_sleep);
+		pr_debug("waking sk_sleep(sk) 0x%p\n", sk_sleep(sk));
 		sk->sk_state_change(sk);
 as_indicate_complete:
 		release_sock(sk);
diff --git a/net/atm/svc.c b/net/atm/svc.c
index 3ba9a45..754ee47 100644
--- a/net/atm/svc.c
+++ b/net/atm/svc.c
@@ -49,14 +49,14 @@ static void svc_disconnect(struct atm_vcc *vcc)
 
 	pr_debug("%p\n", vcc);
 	if (test_bit(ATM_VF_REGIS, &vcc->flags)) {
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_UNINTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_UNINTERRUPTIBLE);
 		sigd_enq(vcc, as_close, NULL, NULL, NULL);
 		while (!test_bit(ATM_VF_RELEASED, &vcc->flags) && sigd) {
 			schedule();
-			prepare_to_wait(sk->sk_sleep, &wait,
+			prepare_to_wait(sk_sleep(sk), &wait,
 					TASK_UNINTERRUPTIBLE);
 		}
-		finish_wait(sk->sk_sleep, &wait);
+		finish_wait(sk_sleep(sk), &wait);
 	}
 	/* beware - socket is still in use by atmsigd until the last
 	   as_indicate has been answered */
@@ -125,13 +125,13 @@ static int svc_bind(struct socket *sock, struct sockaddr *sockaddr,
 	}
 	vcc->local = *addr;
 	set_bit(ATM_VF_WAITING, &vcc->flags);
-	prepare_to_wait(sk->sk_sleep, &wait, TASK_UNINTERRUPTIBLE);
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_UNINTERRUPTIBLE);
 	sigd_enq(vcc, as_bind, NULL, NULL, &vcc->local);
 	while (test_bit(ATM_VF_WAITING, &vcc->flags) && sigd) {
 		schedule();
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_UNINTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_UNINTERRUPTIBLE);
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	clear_bit(ATM_VF_REGIS, &vcc->flags); /* doesn't count */
 	if (!sigd) {
 		error = -EUNATCH;
@@ -201,10 +201,10 @@ static int svc_connect(struct socket *sock, struct sockaddr *sockaddr,
 		}
 		vcc->remote = *addr;
 		set_bit(ATM_VF_WAITING, &vcc->flags);
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 		sigd_enq(vcc, as_connect, NULL, NULL, &vcc->remote);
 		if (flags & O_NONBLOCK) {
-			finish_wait(sk->sk_sleep, &wait);
+			finish_wait(sk_sleep(sk), &wait);
 			sock->state = SS_CONNECTING;
 			error = -EINPROGRESS;
 			goto out;
@@ -213,7 +213,7 @@ static int svc_connect(struct socket *sock, struct sockaddr *sockaddr,
 		while (test_bit(ATM_VF_WAITING, &vcc->flags) && sigd) {
 			schedule();
 			if (!signal_pending(current)) {
-				prepare_to_wait(sk->sk_sleep, &wait,
+				prepare_to_wait(sk_sleep(sk), &wait,
 						TASK_INTERRUPTIBLE);
 				continue;
 			}
@@ -232,14 +232,14 @@ static int svc_connect(struct socket *sock, struct sockaddr *sockaddr,
 			 */
 			sigd_enq(vcc, as_close, NULL, NULL, NULL);
 			while (test_bit(ATM_VF_WAITING, &vcc->flags) && sigd) {
-				prepare_to_wait(sk->sk_sleep, &wait,
+				prepare_to_wait(sk_sleep(sk), &wait,
 						TASK_INTERRUPTIBLE);
 				schedule();
 			}
 			if (!sk->sk_err)
 				while (!test_bit(ATM_VF_RELEASED, &vcc->flags) &&
 				       sigd) {
-					prepare_to_wait(sk->sk_sleep, &wait,
+					prepare_to_wait(sk_sleep(sk), &wait,
 							TASK_INTERRUPTIBLE);
 					schedule();
 				}
@@ -250,7 +250,7 @@ static int svc_connect(struct socket *sock, struct sockaddr *sockaddr,
 			error = -EINTR;
 			break;
 		}
-		finish_wait(sk->sk_sleep, &wait);
+		finish_wait(sk_sleep(sk), &wait);
 		if (error)
 			goto out;
 		if (!sigd) {
@@ -302,13 +302,13 @@ static int svc_listen(struct socket *sock, int backlog)
 		goto out;
 	}
 	set_bit(ATM_VF_WAITING, &vcc->flags);
-	prepare_to_wait(sk->sk_sleep, &wait, TASK_UNINTERRUPTIBLE);
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_UNINTERRUPTIBLE);
 	sigd_enq(vcc, as_listen, NULL, NULL, &vcc->local);
 	while (test_bit(ATM_VF_WAITING, &vcc->flags) && sigd) {
 		schedule();
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_UNINTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_UNINTERRUPTIBLE);
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	if (!sigd) {
 		error = -EUNATCH;
 		goto out;
@@ -343,7 +343,7 @@ static int svc_accept(struct socket *sock, struct socket *newsock, int flags)
 	while (1) {
 		DEFINE_WAIT(wait);
 
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 		while (!(skb = skb_dequeue(&sk->sk_receive_queue)) &&
 		       sigd) {
 			if (test_bit(ATM_VF_RELEASED, &old_vcc->flags))
@@ -363,10 +363,10 @@ static int svc_accept(struct socket *sock, struct socket *newsock, int flags)
 				error = -ERESTARTSYS;
 				break;
 			}
-			prepare_to_wait(sk->sk_sleep, &wait,
+			prepare_to_wait(sk_sleep(sk), &wait,
 					TASK_INTERRUPTIBLE);
 		}
-		finish_wait(sk->sk_sleep, &wait);
+		finish_wait(sk_sleep(sk), &wait);
 		if (error)
 			goto out;
 		if (!skb) {
@@ -392,17 +392,17 @@ static int svc_accept(struct socket *sock, struct socket *newsock, int flags)
 		}
 		/* wait should be short, so we ignore the non-blocking flag */
 		set_bit(ATM_VF_WAITING, &new_vcc->flags);
-		prepare_to_wait(sk_atm(new_vcc)->sk_sleep, &wait,
+		prepare_to_wait(sk_sleep(sk_atm(new_vcc)), &wait,
 				TASK_UNINTERRUPTIBLE);
 		sigd_enq(new_vcc, as_accept, old_vcc, NULL, NULL);
 		while (test_bit(ATM_VF_WAITING, &new_vcc->flags) && sigd) {
 			release_sock(sk);
 			schedule();
 			lock_sock(sk);
-			prepare_to_wait(sk_atm(new_vcc)->sk_sleep, &wait,
+			prepare_to_wait(sk_sleep(sk_atm(new_vcc)), &wait,
 					TASK_UNINTERRUPTIBLE);
 		}
-		finish_wait(sk_atm(new_vcc)->sk_sleep, &wait);
+		finish_wait(sk_sleep(sk_atm(new_vcc)), &wait);
 		if (!sigd) {
 			error = -EUNATCH;
 			goto out;
@@ -438,14 +438,14 @@ int svc_change_qos(struct atm_vcc *vcc, struct atm_qos *qos)
 	DEFINE_WAIT(wait);
 
 	set_bit(ATM_VF_WAITING, &vcc->flags);
-	prepare_to_wait(sk->sk_sleep, &wait, TASK_UNINTERRUPTIBLE);
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_UNINTERRUPTIBLE);
 	sigd_enq2(vcc, as_modify, NULL, NULL, &vcc->local, qos, 0);
 	while (test_bit(ATM_VF_WAITING, &vcc->flags) &&
 	       !test_bit(ATM_VF_RELEASED, &vcc->flags) && sigd) {
 		schedule();
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_UNINTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_UNINTERRUPTIBLE);
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	if (!sigd)
 		return -EUNATCH;
 	return -sk->sk_err;
@@ -534,20 +534,20 @@ static int svc_addparty(struct socket *sock, struct sockaddr *sockaddr,
 
 	lock_sock(sk);
 	set_bit(ATM_VF_WAITING, &vcc->flags);
-	prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 	sigd_enq(vcc, as_addparty, NULL, NULL,
 		 (struct sockaddr_atmsvc *) sockaddr);
 	if (flags & O_NONBLOCK) {
-		finish_wait(sk->sk_sleep, &wait);
+		finish_wait(sk_sleep(sk), &wait);
 		error = -EINPROGRESS;
 		goto out;
 	}
 	pr_debug("added wait queue\n");
 	while (test_bit(ATM_VF_WAITING, &vcc->flags) && sigd) {
 		schedule();
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	error = xchg(&sk->sk_err_soft, 0);
 out:
 	release_sock(sk);
@@ -563,13 +563,13 @@ static int svc_dropparty(struct socket *sock, int ep_ref)
 
 	lock_sock(sk);
 	set_bit(ATM_VF_WAITING, &vcc->flags);
-	prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 	sigd_enq2(vcc, as_dropparty, NULL, NULL, NULL, NULL, ep_ref);
 	while (test_bit(ATM_VF_WAITING, &vcc->flags) && sigd) {
 		schedule();
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	if (!sigd) {
 		error = -EUNATCH;
 		goto out;
diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index 65c5801..cfdfd7e 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -1281,7 +1281,7 @@ static int __must_check ax25_connect(struct socket *sock,
 		DEFINE_WAIT(wait);
 
 		for (;;) {
-			prepare_to_wait(sk->sk_sleep, &wait,
+			prepare_to_wait(sk_sleep(sk), &wait,
 					TASK_INTERRUPTIBLE);
 			if (sk->sk_state != TCP_SYN_SENT)
 				break;
@@ -1294,7 +1294,7 @@ static int __must_check ax25_connect(struct socket *sock,
 			err = -ERESTARTSYS;
 			break;
 		}
-		finish_wait(sk->sk_sleep, &wait);
+		finish_wait(sk_sleep(sk), &wait);
 
 		if (err)
 			goto out_release;
@@ -1346,7 +1346,7 @@ static int ax25_accept(struct socket *sock, struct socket *newsock, int flags)
 	 *	hooked into the SABM we saved
 	 */
 	for (;;) {
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 		skb = skb_dequeue(&sk->sk_receive_queue);
 		if (skb)
 			break;
@@ -1364,7 +1364,7 @@ static int ax25_accept(struct socket *sock, struct socket *newsock, int flags)
 		err = -ERESTARTSYS;
 		break;
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 
 	if (err)
 		goto out;
diff --git a/net/bluetooth/af_bluetooth.c b/net/bluetooth/af_bluetooth.c
index 404a850..421c45b 100644
--- a/net/bluetooth/af_bluetooth.c
+++ b/net/bluetooth/af_bluetooth.c
@@ -288,7 +288,7 @@ unsigned int bt_sock_poll(struct file * file, struct socket *sock, poll_table *w
 
 	BT_DBG("sock %p, sk %p", sock, sk);
 
-	poll_wait(file, sk->sk_sleep, wait);
+	poll_wait(file, sk_sleep(sk), wait);
 
 	if (sk->sk_state == BT_LISTEN)
 		return bt_accept_poll(sk);
@@ -378,7 +378,7 @@ int bt_sock_wait_state(struct sock *sk, int state, unsigned long timeo)
 
 	BT_DBG("sk %p", sk);
 
-	add_wait_queue(sk->sk_sleep, &wait);
+	add_wait_queue(sk_sleep(sk), &wait);
 	while (sk->sk_state != state) {
 		set_current_state(TASK_INTERRUPTIBLE);
 
@@ -401,7 +401,7 @@ int bt_sock_wait_state(struct sock *sk, int state, unsigned long timeo)
 			break;
 	}
 	set_current_state(TASK_RUNNING);
-	remove_wait_queue(sk->sk_sleep, &wait);
+	remove_wait_queue(sk_sleep(sk), &wait);
 	return err;
 }
 EXPORT_SYMBOL(bt_sock_wait_state);
diff --git a/net/bluetooth/bnep/core.c b/net/bluetooth/bnep/core.c
index 8062dad..f10b41f 100644
--- a/net/bluetooth/bnep/core.c
+++ b/net/bluetooth/bnep/core.c
@@ -474,7 +474,7 @@ static int bnep_session(void *arg)
 	set_user_nice(current, -15);
 
 	init_waitqueue_entry(&wait, current);
-	add_wait_queue(sk->sk_sleep, &wait);
+	add_wait_queue(sk_sleep(sk), &wait);
 	while (!atomic_read(&s->killed)) {
 		set_current_state(TASK_INTERRUPTIBLE);
 
@@ -496,7 +496,7 @@ static int bnep_session(void *arg)
 		schedule();
 	}
 	set_current_state(TASK_RUNNING);
-	remove_wait_queue(sk->sk_sleep, &wait);
+	remove_wait_queue(sk_sleep(sk), &wait);
 
 	/* Cleanup session */
 	down_write(&bnep_session_sem);
@@ -507,7 +507,7 @@ static int bnep_session(void *arg)
 	/* Wakeup user-space polling for socket errors */
 	s->sock->sk->sk_err = EUNATCH;
 
-	wake_up_interruptible(s->sock->sk->sk_sleep);
+	wake_up_interruptible(sk_sleep(s->sock->sk));
 
 	/* Release the socket */
 	fput(s->sock->file);
@@ -638,7 +638,7 @@ int bnep_del_connection(struct bnep_conndel_req *req)
 
 		/* Kill session thread */
 		atomic_inc(&s->killed);
-		wake_up_interruptible(s->sock->sk->sk_sleep);
+		wake_up_interruptible(sk_sleep(s->sock->sk));
 	} else
 		err = -ENOENT;
 
diff --git a/net/bluetooth/bnep/netdev.c b/net/bluetooth/bnep/netdev.c
index d48b33f..0faad5c 100644
--- a/net/bluetooth/bnep/netdev.c
+++ b/net/bluetooth/bnep/netdev.c
@@ -109,7 +109,7 @@ static void bnep_net_set_mc_list(struct net_device *dev)
 	}
 
 	skb_queue_tail(&sk->sk_write_queue, skb);
-	wake_up_interruptible(sk->sk_sleep);
+	wake_up_interruptible(sk_sleep(sk));
 #endif
 }
 
@@ -193,11 +193,11 @@ static netdev_tx_t bnep_net_xmit(struct sk_buff *skb,
 	/*
 	 * We cannot send L2CAP packets from here as we are potentially in a bh.
 	 * So we have to queue them and wake up session thread which is sleeping
-	 * on the sk->sk_sleep.
+	 * on the sk_sleep(sk).
 	 */
 	dev->trans_start = jiffies;
 	skb_queue_tail(&sk->sk_write_queue, skb);
-	wake_up_interruptible(sk->sk_sleep);
+	wake_up_interruptible(sk_sleep(sk));
 
 	if (skb_queue_len(&sk->sk_write_queue) >= BNEP_TX_QUEUE_LEN) {
 		BT_DBG("tx queue is full");
diff --git a/net/bluetooth/cmtp/cmtp.h b/net/bluetooth/cmtp/cmtp.h
index e4663aa..785e79e 100644
--- a/net/bluetooth/cmtp/cmtp.h
+++ b/net/bluetooth/cmtp/cmtp.h
@@ -125,7 +125,7 @@ static inline void cmtp_schedule(struct cmtp_session *session)
 {
 	struct sock *sk = session->sock->sk;
 
-	wake_up_interruptible(sk->sk_sleep);
+	wake_up_interruptible(sk_sleep(sk));
 }
 
 /* CMTP init defines */
diff --git a/net/bluetooth/cmtp/core.c b/net/bluetooth/cmtp/core.c
index 0073ec8..d4c6af0 100644
--- a/net/bluetooth/cmtp/core.c
+++ b/net/bluetooth/cmtp/core.c
@@ -284,7 +284,7 @@ static int cmtp_session(void *arg)
 	set_user_nice(current, -15);
 
 	init_waitqueue_entry(&wait, current);
-	add_wait_queue(sk->sk_sleep, &wait);
+	add_wait_queue(sk_sleep(sk), &wait);
 	while (!atomic_read(&session->terminate)) {
 		set_current_state(TASK_INTERRUPTIBLE);
 
@@ -301,7 +301,7 @@ static int cmtp_session(void *arg)
 		schedule();
 	}
 	set_current_state(TASK_RUNNING);
-	remove_wait_queue(sk->sk_sleep, &wait);
+	remove_wait_queue(sk_sleep(sk), &wait);
 
 	down_write(&cmtp_session_sem);
 
diff --git a/net/bluetooth/hidp/core.c b/net/bluetooth/hidp/core.c
index 280529a..bfe641b 100644
--- a/net/bluetooth/hidp/core.c
+++ b/net/bluetooth/hidp/core.c
@@ -561,8 +561,8 @@ static int hidp_session(void *arg)
 
 	init_waitqueue_entry(&ctrl_wait, current);
 	init_waitqueue_entry(&intr_wait, current);
-	add_wait_queue(ctrl_sk->sk_sleep, &ctrl_wait);
-	add_wait_queue(intr_sk->sk_sleep, &intr_wait);
+	add_wait_queue(sk_sleep(ctrl_sk), &ctrl_wait);
+	add_wait_queue(sk_sleep(intr_sk), &intr_wait);
 	while (!atomic_read(&session->terminate)) {
 		set_current_state(TASK_INTERRUPTIBLE);
 
@@ -584,8 +584,8 @@ static int hidp_session(void *arg)
 		schedule();
 	}
 	set_current_state(TASK_RUNNING);
-	remove_wait_queue(intr_sk->sk_sleep, &intr_wait);
-	remove_wait_queue(ctrl_sk->sk_sleep, &ctrl_wait);
+	remove_wait_queue(sk_sleep(intr_sk), &intr_wait);
+	remove_wait_queue(sk_sleep(ctrl_sk), &ctrl_wait);
 
 	down_write(&hidp_session_sem);
 
@@ -609,7 +609,7 @@ static int hidp_session(void *arg)
 
 	fput(session->intr_sock->file);
 
-	wait_event_timeout(*(ctrl_sk->sk_sleep),
+	wait_event_timeout(*(sk_sleep(ctrl_sk)),
 		(ctrl_sk->sk_state == BT_CLOSED), msecs_to_jiffies(500));
 
 	fput(session->ctrl_sock->file);
diff --git a/net/bluetooth/hidp/hidp.h b/net/bluetooth/hidp/hidp.h
index a4e215d..8d934a1 100644
--- a/net/bluetooth/hidp/hidp.h
+++ b/net/bluetooth/hidp/hidp.h
@@ -164,8 +164,8 @@ static inline void hidp_schedule(struct hidp_session *session)
 	struct sock *ctrl_sk = session->ctrl_sock->sk;
 	struct sock *intr_sk = session->intr_sock->sk;
 
-	wake_up_interruptible(ctrl_sk->sk_sleep);
-	wake_up_interruptible(intr_sk->sk_sleep);
+	wake_up_interruptible(sk_sleep(ctrl_sk));
+	wake_up_interruptible(sk_sleep(intr_sk));
 }
 
 /* HIDP init defines */
diff --git a/net/bluetooth/l2cap.c b/net/bluetooth/l2cap.c
index 99d68c3..c1e60ee 100644
--- a/net/bluetooth/l2cap.c
+++ b/net/bluetooth/l2cap.c
@@ -1147,7 +1147,7 @@ static int l2cap_sock_accept(struct socket *sock, struct socket *newsock, int fl
 	BT_DBG("sk %p timeo %ld", sk, timeo);
 
 	/* Wait for an incoming connection. (wake-one). */
-	add_wait_queue_exclusive(sk->sk_sleep, &wait);
+	add_wait_queue_exclusive(sk_sleep(sk), &wait);
 	while (!(nsk = bt_accept_dequeue(sk, newsock))) {
 		set_current_state(TASK_INTERRUPTIBLE);
 		if (!timeo) {
@@ -1170,7 +1170,7 @@ static int l2cap_sock_accept(struct socket *sock, struct socket *newsock, int fl
 		}
 	}
 	set_current_state(TASK_RUNNING);
-	remove_wait_queue(sk->sk_sleep, &wait);
+	remove_wait_queue(sk_sleep(sk), &wait);
 
 	if (err)
 		goto done;
diff --git a/net/bluetooth/rfcomm/sock.c b/net/bluetooth/rfcomm/sock.c
index 8ed3c37..43fbf6b 100644
--- a/net/bluetooth/rfcomm/sock.c
+++ b/net/bluetooth/rfcomm/sock.c
@@ -503,7 +503,7 @@ static int rfcomm_sock_accept(struct socket *sock, struct socket *newsock, int f
 	BT_DBG("sk %p timeo %ld", sk, timeo);
 
 	/* Wait for an incoming connection. (wake-one). */
-	add_wait_queue_exclusive(sk->sk_sleep, &wait);
+	add_wait_queue_exclusive(sk_sleep(sk), &wait);
 	while (!(nsk = bt_accept_dequeue(sk, newsock))) {
 		set_current_state(TASK_INTERRUPTIBLE);
 		if (!timeo) {
@@ -526,7 +526,7 @@ static int rfcomm_sock_accept(struct socket *sock, struct socket *newsock, int f
 		}
 	}
 	set_current_state(TASK_RUNNING);
-	remove_wait_queue(sk->sk_sleep, &wait);
+	remove_wait_queue(sk_sleep(sk), &wait);
 
 	if (err)
 		goto done;
@@ -621,7 +621,7 @@ static long rfcomm_sock_data_wait(struct sock *sk, long timeo)
 {
 	DECLARE_WAITQUEUE(wait, current);
 
-	add_wait_queue(sk->sk_sleep, &wait);
+	add_wait_queue(sk_sleep(sk), &wait);
 	for (;;) {
 		set_current_state(TASK_INTERRUPTIBLE);
 
@@ -640,7 +640,7 @@ static long rfcomm_sock_data_wait(struct sock *sk, long timeo)
 	}
 
 	__set_current_state(TASK_RUNNING);
-	remove_wait_queue(sk->sk_sleep, &wait);
+	remove_wait_queue(sk_sleep(sk), &wait);
 	return timeo;
 }
 
diff --git a/net/bluetooth/sco.c b/net/bluetooth/sco.c
index ca6b2ad..b406d3e 100644
--- a/net/bluetooth/sco.c
+++ b/net/bluetooth/sco.c
@@ -567,7 +567,7 @@ static int sco_sock_accept(struct socket *sock, struct socket *newsock, int flag
 	BT_DBG("sk %p timeo %ld", sk, timeo);
 
 	/* Wait for an incoming connection. (wake-one). */
-	add_wait_queue_exclusive(sk->sk_sleep, &wait);
+	add_wait_queue_exclusive(sk_sleep(sk), &wait);
 	while (!(ch = bt_accept_dequeue(sk, newsock))) {
 		set_current_state(TASK_INTERRUPTIBLE);
 		if (!timeo) {
@@ -590,7 +590,7 @@ static int sco_sock_accept(struct socket *sock, struct socket *newsock, int flag
 		}
 	}
 	set_current_state(TASK_RUNNING);
-	remove_wait_queue(sk->sk_sleep, &wait);
+	remove_wait_queue(sk_sleep(sk), &wait);
 
 	if (err)
 		goto done;
diff --git a/net/caif/caif_socket.c b/net/caif/caif_socket.c
index cdf62b9..90317e7 100644
--- a/net/caif/caif_socket.c
+++ b/net/caif/caif_socket.c
@@ -689,7 +689,7 @@ static unsigned int caif_poll(struct file *file, struct socket *sock,
 	struct sock *sk = sock->sk;
 	struct caifsock *cf_sk = container_of(sk, struct caifsock, sk);
 	u32 mask = 0;
-	poll_wait(file, sk->sk_sleep, wait);
+	poll_wait(file, sk_sleep(sk), wait);
 	lock_sock(&(cf_sk->sk));
 	if (!STATE_IS_OPEN(cf_sk)) {
 		if (!STATE_IS_PENDING(cf_sk))
diff --git a/net/core/datagram.c b/net/core/datagram.c
index 2dccd4e..5574a5d 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -86,7 +86,7 @@ static int wait_for_packet(struct sock *sk, int *err, long *timeo_p)
 	int error;
 	DEFINE_WAIT_FUNC(wait, receiver_wake_function);
 
-	prepare_to_wait_exclusive(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+	prepare_to_wait_exclusive(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 
 	/* Socket errors? */
 	error = sock_error(sk);
@@ -115,7 +115,7 @@ static int wait_for_packet(struct sock *sk, int *err, long *timeo_p)
 	error = 0;
 	*timeo_p = schedule_timeout(*timeo_p);
 out:
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	return error;
 interrupted:
 	error = sock_intr_errno(*timeo_p);
@@ -726,7 +726,7 @@ unsigned int datagram_poll(struct file *file, struct socket *sock,
 	struct sock *sk = sock->sk;
 	unsigned int mask;
 
-	sock_poll_wait(file, sk->sk_sleep, wait);
+	sock_poll_wait(file, sk_sleep(sk), wait);
 	mask = 0;
 
 	/* exceptional events? */
diff --git a/net/core/sock.c b/net/core/sock.c
index 7effa1e..58ebd14 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1395,7 +1395,7 @@ static long sock_wait_for_wmem(struct sock *sk, long timeo)
 		if (signal_pending(current))
 			break;
 		set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 		if (atomic_read(&sk->sk_wmem_alloc) < sk->sk_sndbuf)
 			break;
 		if (sk->sk_shutdown & SEND_SHUTDOWN)
@@ -1404,7 +1404,7 @@ static long sock_wait_for_wmem(struct sock *sk, long timeo)
 			break;
 		timeo = schedule_timeout(timeo);
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	return timeo;
 }
 
@@ -1570,11 +1570,11 @@ int sk_wait_data(struct sock *sk, long *timeo)
 	int rc;
 	DEFINE_WAIT(wait);
 
-	prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 	set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
 	rc = sk_wait_event(sk, timeo, !skb_queue_empty(&sk->sk_receive_queue));
 	clear_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	return rc;
 }
 EXPORT_SYMBOL(sk_wait_data);
@@ -1798,7 +1798,7 @@ static void sock_def_wakeup(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
 	if (sk_has_sleeper(sk))
-		wake_up_interruptible_all(sk->sk_sleep);
+		wake_up_interruptible_all(sk_sleep(sk));
 	read_unlock(&sk->sk_callback_lock);
 }
 
@@ -1806,7 +1806,7 @@ static void sock_def_error_report(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
 	if (sk_has_sleeper(sk))
-		wake_up_interruptible_poll(sk->sk_sleep, POLLERR);
+		wake_up_interruptible_poll(sk_sleep(sk), POLLERR);
 	sk_wake_async(sk, SOCK_WAKE_IO, POLL_ERR);
 	read_unlock(&sk->sk_callback_lock);
 }
@@ -1815,7 +1815,7 @@ static void sock_def_readable(struct sock *sk, int len)
 {
 	read_lock(&sk->sk_callback_lock);
 	if (sk_has_sleeper(sk))
-		wake_up_interruptible_sync_poll(sk->sk_sleep, POLLIN |
+		wake_up_interruptible_sync_poll(sk_sleep(sk), POLLIN |
 						POLLRDNORM | POLLRDBAND);
 	sk_wake_async(sk, SOCK_WAKE_WAITD, POLL_IN);
 	read_unlock(&sk->sk_callback_lock);
@@ -1830,7 +1830,7 @@ static void sock_def_write_space(struct sock *sk)
 	 */
 	if ((atomic_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf) {
 		if (sk_has_sleeper(sk))
-			wake_up_interruptible_sync_poll(sk->sk_sleep, POLLOUT |
+			wake_up_interruptible_sync_poll(sk_sleep(sk), POLLOUT |
 						POLLWRNORM | POLLWRBAND);
 
 		/* Should agree with poll, otherwise some programs break */
diff --git a/net/core/stream.c b/net/core/stream.c
index a37debf..7b3c3f3 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -32,8 +32,8 @@ void sk_stream_write_space(struct sock *sk)
 	if (sk_stream_wspace(sk) >= sk_stream_min_wspace(sk) && sock) {
 		clear_bit(SOCK_NOSPACE, &sock->flags);
 
-		if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
-			wake_up_interruptible_poll(sk->sk_sleep, POLLOUT |
+		if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
+			wake_up_interruptible_poll(sk_sleep(sk), POLLOUT |
 						POLLWRNORM | POLLWRBAND);
 		if (sock->fasync_list && !(sk->sk_shutdown & SEND_SHUTDOWN))
 			sock_wake_async(sock, SOCK_WAKE_SPACE, POLL_OUT);
@@ -66,13 +66,13 @@ int sk_stream_wait_connect(struct sock *sk, long *timeo_p)
 		if (signal_pending(tsk))
 			return sock_intr_errno(*timeo_p);
 
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 		sk->sk_write_pending++;
 		done = sk_wait_event(sk, timeo_p,
 				     !sk->sk_err &&
 				     !((1 << sk->sk_state) &
 				       ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)));
-		finish_wait(sk->sk_sleep, &wait);
+		finish_wait(sk_sleep(sk), &wait);
 		sk->sk_write_pending--;
 	} while (!done);
 	return 0;
@@ -96,13 +96,13 @@ void sk_stream_wait_close(struct sock *sk, long timeout)
 		DEFINE_WAIT(wait);
 
 		do {
-			prepare_to_wait(sk->sk_sleep, &wait,
+			prepare_to_wait(sk_sleep(sk), &wait,
 					TASK_INTERRUPTIBLE);
 			if (sk_wait_event(sk, &timeout, !sk_stream_closing(sk)))
 				break;
 		} while (!signal_pending(current) && timeout);
 
-		finish_wait(sk->sk_sleep, &wait);
+		finish_wait(sk_sleep(sk), &wait);
 	}
 }
 
@@ -126,7 +126,7 @@ int sk_stream_wait_memory(struct sock *sk, long *timeo_p)
 	while (1) {
 		set_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);
 
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 
 		if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
 			goto do_error;
@@ -157,7 +157,7 @@ int sk_stream_wait_memory(struct sock *sk, long *timeo_p)
 		*timeo_p = current_timeo;
 	}
 out:
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	return err;
 
 do_error:
diff --git a/net/dccp/output.c b/net/dccp/output.c
index e98b65e..2d3dcb3 100644
--- a/net/dccp/output.c
+++ b/net/dccp/output.c
@@ -198,7 +198,7 @@ void dccp_write_space(struct sock *sk)
 	read_lock(&sk->sk_callback_lock);
 
 	if (sk_has_sleeper(sk))
-		wake_up_interruptible(sk->sk_sleep);
+		wake_up_interruptible(sk_sleep(sk));
 	/* Should agree with poll, otherwise some programs break */
 	if (sock_writeable(sk))
 		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
@@ -225,7 +225,7 @@ static int dccp_wait_for_ccid(struct sock *sk, struct sk_buff *skb, int delay)
 		dccp_pr_debug("delayed send by %d msec\n", delay);
 		jiffdelay = msecs_to_jiffies(delay);
 
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 
 		sk->sk_write_pending++;
 		release_sock(sk);
@@ -241,7 +241,7 @@ static int dccp_wait_for_ccid(struct sock *sk, struct sk_buff *skb, int delay)
 		rc = ccid_hc_tx_send_packet(dp->dccps_hc_tx_ccid, sk, skb);
 	} while ((delay = rc) > 0);
 out:
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	return rc;
 
 do_error:
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index a0e38d8..b03ecf6 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -312,7 +312,7 @@ unsigned int dccp_poll(struct file *file, struct socket *sock,
 	unsigned int mask;
 	struct sock *sk = sock->sk;
 
-	sock_poll_wait(file, sk->sk_sleep, wait);
+	sock_poll_wait(file, sk_sleep(sk), wait);
 	if (sk->sk_state == DCCP_LISTEN)
 		return inet_csk_listen_poll(sk);
 
diff --git a/net/decnet/af_decnet.c b/net/decnet/af_decnet.c
index 55e3b6b..d6b93d1 100644
--- a/net/decnet/af_decnet.c
+++ b/net/decnet/af_decnet.c
@@ -832,7 +832,7 @@ static int dn_confirm_accept(struct sock *sk, long *timeo, gfp_t allocation)
 	scp->segsize_loc = dst_metric(__sk_dst_get(sk), RTAX_ADVMSS);
 	dn_send_conn_conf(sk, allocation);
 
-	prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 	for(;;) {
 		release_sock(sk);
 		if (scp->state == DN_CC)
@@ -850,9 +850,9 @@ static int dn_confirm_accept(struct sock *sk, long *timeo, gfp_t allocation)
 		err = -EAGAIN;
 		if (!*timeo)
 			break;
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	if (err == 0) {
 		sk->sk_socket->state = SS_CONNECTED;
 	} else if (scp->state != DN_CC) {
@@ -873,7 +873,7 @@ static int dn_wait_run(struct sock *sk, long *timeo)
 	if (!*timeo)
 		return -EALREADY;
 
-	prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 	for(;;) {
 		release_sock(sk);
 		if (scp->state == DN_CI || scp->state == DN_CC)
@@ -891,9 +891,9 @@ static int dn_wait_run(struct sock *sk, long *timeo)
 		err = -ETIMEDOUT;
 		if (!*timeo)
 			break;
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 out:
 	if (err == 0) {
 		sk->sk_socket->state = SS_CONNECTED;
@@ -1040,7 +1040,7 @@ static struct sk_buff *dn_wait_for_connect(struct sock *sk, long *timeo)
 	struct sk_buff *skb = NULL;
 	int err = 0;
 
-	prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 	for(;;) {
 		release_sock(sk);
 		skb = skb_dequeue(&sk->sk_receive_queue);
@@ -1060,9 +1060,9 @@ static struct sk_buff *dn_wait_for_connect(struct sock *sk, long *timeo)
 		err = -EAGAIN;
 		if (!*timeo)
 			break;
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 
 	return skb == NULL ? ERR_PTR(err) : skb;
 }
@@ -1746,11 +1746,11 @@ static int dn_recvmsg(struct kiocb *iocb, struct socket *sock,
 			goto out;
 		}
 
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 		set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
 		sk_wait_event(sk, &timeo, dn_data_ready(sk, queue, flags, target));
 		clear_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
-		finish_wait(sk->sk_sleep, &wait);
+		finish_wait(sk_sleep(sk), &wait);
 	}
 
 	skb_queue_walk_safe(queue, skb, n) {
@@ -2003,12 +2003,12 @@ static int dn_sendmsg(struct kiocb *iocb, struct socket *sock,
 				goto out;
 			}
 
-			prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+			prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 			set_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
 			sk_wait_event(sk, &timeo,
 				      !dn_queue_too_long(scp, queue, flags));
 			clear_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
-			finish_wait(sk->sk_sleep, &wait);
+			finish_wait(sk_sleep(sk), &wait);
 			continue;
 		}
 
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index c5376c7..5ca7290 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -548,7 +548,7 @@ static long inet_wait_for_connect(struct sock *sk, long timeo)
 {
 	DEFINE_WAIT(wait);
 
-	prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 
 	/* Basic assumption: if someone sets sk->sk_err, he _must_
 	 * change state of the socket from TCP_SYN_*.
@@ -561,9 +561,9 @@ static long inet_wait_for_connect(struct sock *sk, long timeo)
 		lock_sock(sk);
 		if (signal_pending(current) || !timeo)
 			break;
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	return timeo;
 }
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 8da6429..e0a3e35 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -234,7 +234,7 @@ static int inet_csk_wait_for_connect(struct sock *sk, long timeo)
 	 * having to remove and re-insert us on the wait queue.
 	 */
 	for (;;) {
-		prepare_to_wait_exclusive(sk->sk_sleep, &wait,
+		prepare_to_wait_exclusive(sk_sleep(sk), &wait,
 					  TASK_INTERRUPTIBLE);
 		release_sock(sk);
 		if (reqsk_queue_empty(&icsk->icsk_accept_queue))
@@ -253,7 +253,7 @@ static int inet_csk_wait_for_connect(struct sock *sk, long timeo)
 		if (!timeo)
 			break;
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	return err;
 }
 
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 0f8caf6..7720833 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -378,7 +378,7 @@ unsigned int tcp_poll(struct file *file, struct socket *sock, poll_table *wait)
 	struct sock *sk = sock->sk;
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	sock_poll_wait(file, sk->sk_sleep, wait);
+	sock_poll_wait(file, sk_sleep(sk), wait);
 	if (sk->sk_state == TCP_LISTEN)
 		return inet_csk_listen_poll(sk);
 
diff --git a/net/irda/af_irda.c b/net/irda/af_irda.c
index 2a4efce..79986a6 100644
--- a/net/irda/af_irda.c
+++ b/net/irda/af_irda.c
@@ -347,7 +347,7 @@ static void irda_flow_indication(void *instance, void *sap, LOCAL_FLOW flow)
 		self->tx_flow = flow;
 		IRDA_DEBUG(1, "%s(), IrTTP wants us to start again\n",
 			   __func__);
-		wake_up_interruptible(sk->sk_sleep);
+		wake_up_interruptible(sk_sleep(sk));
 		break;
 	default:
 		IRDA_DEBUG(0, "%s(), Unknown flow command!\n", __func__);
@@ -900,7 +900,7 @@ static int irda_accept(struct socket *sock, struct socket *newsock, int flags)
 		if (flags & O_NONBLOCK)
 			goto out;
 
-		err = wait_event_interruptible(*(sk->sk_sleep),
+		err = wait_event_interruptible(*(sk_sleep(sk)),
 					skb_peek(&sk->sk_receive_queue));
 		if (err)
 			goto out;
@@ -1066,7 +1066,7 @@ static int irda_connect(struct socket *sock, struct sockaddr *uaddr,
 		goto out;
 
 	err = -ERESTARTSYS;
-	if (wait_event_interruptible(*(sk->sk_sleep),
+	if (wait_event_interruptible(*(sk_sleep(sk)),
 				     (sk->sk_state != TCP_SYN_SENT)))
 		goto out;
 
@@ -1318,7 +1318,7 @@ static int irda_sendmsg(struct kiocb *iocb, struct socket *sock,
 
 	/* Check if IrTTP is wants us to slow down */
 
-	if (wait_event_interruptible(*(sk->sk_sleep),
+	if (wait_event_interruptible(*(sk_sleep(sk)),
 	    (self->tx_flow != FLOW_STOP  ||  sk->sk_state != TCP_ESTABLISHED))) {
 		err = -ERESTARTSYS;
 		goto out;
@@ -1477,7 +1477,7 @@ static int irda_recvmsg_stream(struct kiocb *iocb, struct socket *sock,
 			if (copied >= target)
 				break;
 
-			prepare_to_wait_exclusive(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+			prepare_to_wait_exclusive(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 
 			/*
 			 *	POSIX 1003.1g mandates this order.
@@ -1497,7 +1497,7 @@ static int irda_recvmsg_stream(struct kiocb *iocb, struct socket *sock,
 				/* Wait process until data arrives */
 				schedule();
 
-			finish_wait(sk->sk_sleep, &wait);
+			finish_wait(sk_sleep(sk), &wait);
 
 			if (err)
 				goto out;
@@ -1787,7 +1787,7 @@ static unsigned int irda_poll(struct file * file, struct socket *sock,
 	IRDA_DEBUG(4, "%s()\n", __func__);
 
 	lock_kernel();
-	poll_wait(file, sk->sk_sleep, wait);
+	poll_wait(file, sk_sleep(sk), wait);
 	mask = 0;
 
 	/* Exceptional events? */
diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index c18286a..9636b7d 100644
--- a/net/iucv/af_iucv.c
+++ b/net/iucv/af_iucv.c
@@ -59,7 +59,7 @@ do {									\
 	DEFINE_WAIT(__wait);						\
 	long __timeo = timeo;						\
 	ret = 0;							\
-	prepare_to_wait(sk->sk_sleep, &__wait, TASK_INTERRUPTIBLE);	\
+	prepare_to_wait(sk_sleep(sk), &__wait, TASK_INTERRUPTIBLE);	\
 	while (!(condition)) {						\
 		if (!__timeo) {						\
 			ret = -EAGAIN;					\
@@ -76,7 +76,7 @@ do {									\
 		if (ret)						\
 			break;						\
 	}								\
-	finish_wait(sk->sk_sleep, &__wait);				\
+	finish_wait(sk_sleep(sk), &__wait);				\
 } while (0)
 
 #define iucv_sock_wait(sk, condition, timeo)				\
@@ -307,7 +307,7 @@ static void iucv_sock_wake_msglim(struct sock *sk)
 {
 	read_lock(&sk->sk_callback_lock);
 	if (sk_has_sleeper(sk))
-		wake_up_interruptible_all(sk->sk_sleep);
+		wake_up_interruptible_all(sk_sleep(sk));
 	sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	read_unlock(&sk->sk_callback_lock);
 }
@@ -795,7 +795,7 @@ static int iucv_sock_accept(struct socket *sock, struct socket *newsock,
 	timeo = sock_rcvtimeo(sk, flags & O_NONBLOCK);
 
 	/* Wait for an incoming connection */
-	add_wait_queue_exclusive(sk->sk_sleep, &wait);
+	add_wait_queue_exclusive(sk_sleep(sk), &wait);
 	while (!(nsk = iucv_accept_dequeue(sk, newsock))) {
 		set_current_state(TASK_INTERRUPTIBLE);
 		if (!timeo) {
@@ -819,7 +819,7 @@ static int iucv_sock_accept(struct socket *sock, struct socket *newsock,
 	}
 
 	set_current_state(TASK_RUNNING);
-	remove_wait_queue(sk->sk_sleep, &wait);
+	remove_wait_queue(sk_sleep(sk), &wait);
 
 	if (err)
 		goto done;
@@ -1269,7 +1269,7 @@ unsigned int iucv_sock_poll(struct file *file, struct socket *sock,
 	struct sock *sk = sock->sk;
 	unsigned int mask = 0;
 
-	sock_poll_wait(file, sk->sk_sleep, wait);
+	sock_poll_wait(file, sk_sleep(sk), wait);
 
 	if (sk->sk_state == IUCV_LISTEN)
 		return iucv_accept_poll(sk);
diff --git a/net/llc/af_llc.c b/net/llc/af_llc.c
index 2db6a9f..023ba82 100644
--- a/net/llc/af_llc.c
+++ b/net/llc/af_llc.c
@@ -536,7 +536,7 @@ static int llc_ui_wait_for_disc(struct sock *sk, long timeout)
 	int rc = 0;
 
 	while (1) {
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 		if (sk_wait_event(sk, &timeout, sk->sk_state == TCP_CLOSE))
 			break;
 		rc = -ERESTARTSYS;
@@ -547,7 +547,7 @@ static int llc_ui_wait_for_disc(struct sock *sk, long timeout)
 			break;
 		rc = 0;
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	return rc;
 }
 
@@ -556,13 +556,13 @@ static int llc_ui_wait_for_conn(struct sock *sk, long timeout)
 	DEFINE_WAIT(wait);
 
 	while (1) {
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 		if (sk_wait_event(sk, &timeout, sk->sk_state != TCP_SYN_SENT))
 			break;
 		if (signal_pending(current) || !timeout)
 			break;
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	return timeout;
 }
 
@@ -573,7 +573,7 @@ static int llc_ui_wait_for_busy_core(struct sock *sk, long timeout)
 	int rc;
 
 	while (1) {
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 		rc = 0;
 		if (sk_wait_event(sk, &timeout,
 				  (sk->sk_shutdown & RCV_SHUTDOWN) ||
@@ -588,7 +588,7 @@ static int llc_ui_wait_for_busy_core(struct sock *sk, long timeout)
 		if (!timeout)
 			break;
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	return rc;
 }
 
diff --git a/net/netfilter/ipvs/ip_vs_sync.c b/net/netfilter/ipvs/ip_vs_sync.c
index 8fb0ae6..7ba0693 100644
--- a/net/netfilter/ipvs/ip_vs_sync.c
+++ b/net/netfilter/ipvs/ip_vs_sync.c
@@ -802,7 +802,7 @@ static int sync_thread_backup(void *data)
 		ip_vs_backup_mcast_ifn, ip_vs_backup_syncid);
 
 	while (!kthread_should_stop()) {
-		wait_event_interruptible(*tinfo->sock->sk->sk_sleep,
+		wait_event_interruptible(*sk_sleep(tinfo->sock->sk),
 			 !skb_queue_empty(&tinfo->sock->sk->sk_receive_queue)
 			 || kthread_should_stop());
 
diff --git a/net/netrom/af_netrom.c b/net/netrom/af_netrom.c
index fa07f04..06cb027 100644
--- a/net/netrom/af_netrom.c
+++ b/net/netrom/af_netrom.c
@@ -739,7 +739,7 @@ static int nr_connect(struct socket *sock, struct sockaddr *uaddr,
 		DEFINE_WAIT(wait);
 
 		for (;;) {
-			prepare_to_wait(sk->sk_sleep, &wait,
+			prepare_to_wait(sk_sleep(sk), &wait,
 					TASK_INTERRUPTIBLE);
 			if (sk->sk_state != TCP_SYN_SENT)
 				break;
@@ -752,7 +752,7 @@ static int nr_connect(struct socket *sock, struct sockaddr *uaddr,
 			err = -ERESTARTSYS;
 			break;
 		}
-		finish_wait(sk->sk_sleep, &wait);
+		finish_wait(sk_sleep(sk), &wait);
 		if (err)
 			goto out_release;
 	}
@@ -798,7 +798,7 @@ static int nr_accept(struct socket *sock, struct socket *newsock, int flags)
 	 *	hooked into the SABM we saved
 	 */
 	for (;;) {
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 		skb = skb_dequeue(&sk->sk_receive_queue);
 		if (skb)
 			break;
@@ -816,7 +816,7 @@ static int nr_accept(struct socket *sock, struct socket *newsock, int flags)
 		err = -ERESTARTSYS;
 		break;
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	if (err)
 		goto out_release;
 
diff --git a/net/rds/af_rds.c b/net/rds/af_rds.c
index 7919a9e..aebfecb 100644
--- a/net/rds/af_rds.c
+++ b/net/rds/af_rds.c
@@ -158,7 +158,7 @@ static unsigned int rds_poll(struct file *file, struct socket *sock,
 	unsigned int mask = 0;
 	unsigned long flags;
 
-	poll_wait(file, sk->sk_sleep, wait);
+	poll_wait(file, sk_sleep(sk), wait);
 
 	if (rs->rs_seen_congestion)
 		poll_wait(file, &rds_poll_waitq, wait);
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 4bec6e2..c224b5b 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -492,7 +492,7 @@ void rds_sock_put(struct rds_sock *rs);
 void rds_wake_sk_sleep(struct rds_sock *rs);
 static inline void __rds_wake_sk_sleep(struct sock *sk)
 {
-	wait_queue_head_t *waitq = sk->sk_sleep;
+	wait_queue_head_t *waitq = sk_sleep(sk);
 
 	if (!sock_flag(sk, SOCK_DEAD) && waitq)
 		wake_up(waitq);
diff --git a/net/rds/recv.c b/net/rds/recv.c
index e2a2b93..795a00b 100644
--- a/net/rds/recv.c
+++ b/net/rds/recv.c
@@ -432,7 +432,7 @@ int rds_recvmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 				break;
 			}
 
-			timeo = wait_event_interruptible_timeout(*sk->sk_sleep,
+			timeo = wait_event_interruptible_timeout(*sk_sleep(sk),
 					(!list_empty(&rs->rs_notify_queue) ||
 					 rs->rs_cong_notify ||
 					 rds_next_incoming(rs, &inc)), timeo);
diff --git a/net/rds/send.c b/net/rds/send.c
index 53d6795..9c1c6bc 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -915,7 +915,7 @@ int rds_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *msg,
 			goto out;
 		}
 
-		timeo = wait_event_interruptible_timeout(*sk->sk_sleep,
+		timeo = wait_event_interruptible_timeout(*sk_sleep(sk),
 					rds_send_queue_rm(rs, conn, rm,
 							  rs->rs_bound_port,
 							  dport,
diff --git a/net/rose/af_rose.c b/net/rose/af_rose.c
index 4fb711a..8e45e76 100644
--- a/net/rose/af_rose.c
+++ b/net/rose/af_rose.c
@@ -845,7 +845,7 @@ rose_try_next_neigh:
 		DEFINE_WAIT(wait);
 
 		for (;;) {
-			prepare_to_wait(sk->sk_sleep, &wait,
+			prepare_to_wait(sk_sleep(sk), &wait,
 					TASK_INTERRUPTIBLE);
 			if (sk->sk_state != TCP_SYN_SENT)
 				break;
@@ -858,7 +858,7 @@ rose_try_next_neigh:
 			err = -ERESTARTSYS;
 			break;
 		}
-		finish_wait(sk->sk_sleep, &wait);
+		finish_wait(sk_sleep(sk), &wait);
 
 		if (err)
 			goto out_release;
@@ -911,7 +911,7 @@ static int rose_accept(struct socket *sock, struct socket *newsock, int flags)
 	 *	hooked into the SABM we saved
 	 */
 	for (;;) {
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 
 		skb = skb_dequeue(&sk->sk_receive_queue);
 		if (skb)
@@ -930,7 +930,7 @@ static int rose_accept(struct socket *sock, struct socket *newsock, int flags)
 		err = -ERESTARTSYS;
 		break;
 	}
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	if (err)
 		goto out_release;
 
diff --git a/net/rxrpc/af_rxrpc.c b/net/rxrpc/af_rxrpc.c
index c060095..c432d76 100644
--- a/net/rxrpc/af_rxrpc.c
+++ b/net/rxrpc/af_rxrpc.c
@@ -65,7 +65,7 @@ static void rxrpc_write_space(struct sock *sk)
 	read_lock(&sk->sk_callback_lock);
 	if (rxrpc_writable(sk)) {
 		if (sk_has_sleeper(sk))
-			wake_up_interruptible(sk->sk_sleep);
+			wake_up_interruptible(sk_sleep(sk));
 		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}
 	read_unlock(&sk->sk_callback_lock);
@@ -589,7 +589,7 @@ static unsigned int rxrpc_poll(struct file *file, struct socket *sock,
 	unsigned int mask;
 	struct sock *sk = sock->sk;
 
-	sock_poll_wait(file, sk->sk_sleep, wait);
+	sock_poll_wait(file, sk_sleep(sk), wait);
 	mask = 0;
 
 	/* the socket is readable if there are any messages waiting on the Rx
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index c194127..f34adcc 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -5702,7 +5702,7 @@ unsigned int sctp_poll(struct file *file, struct socket *sock, poll_table *wait)
 	struct sctp_sock *sp = sctp_sk(sk);
 	unsigned int mask;
 
-	poll_wait(file, sk->sk_sleep, wait);
+	poll_wait(file, sk_sleep(sk), wait);
 
 	/* A TCP-style listening socket becomes readable when the accept queue
 	 * is not empty.
@@ -5943,7 +5943,7 @@ static int sctp_wait_for_packet(struct sock * sk, int *err, long *timeo_p)
 	int error;
 	DEFINE_WAIT(wait);
 
-	prepare_to_wait_exclusive(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+	prepare_to_wait_exclusive(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 
 	/* Socket errors? */
 	error = sock_error(sk);
@@ -5980,14 +5980,14 @@ static int sctp_wait_for_packet(struct sock * sk, int *err, long *timeo_p)
 	sctp_lock_sock(sk);
 
 ready:
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	return 0;
 
 interrupted:
 	error = sock_intr_errno(*timeo_p);
 
 out:
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	*err = error;
 	return error;
 }
@@ -6061,8 +6061,8 @@ static void __sctp_write_space(struct sctp_association *asoc)
 			wake_up_interruptible(&asoc->wait);
 
 		if (sctp_writeable(sk)) {
-			if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
-				wake_up_interruptible(sk->sk_sleep);
+			if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
+				wake_up_interruptible(sk_sleep(sk));
 
 			/* Note that we try to include the Async I/O support
 			 * here by modeling from the current TCP/UDP code.
@@ -6296,7 +6296,7 @@ static int sctp_wait_for_accept(struct sock *sk, long timeo)
 
 
 	for (;;) {
-		prepare_to_wait_exclusive(sk->sk_sleep, &wait,
+		prepare_to_wait_exclusive(sk_sleep(sk), &wait,
 					  TASK_INTERRUPTIBLE);
 
 		if (list_empty(&ep->asocs)) {
@@ -6322,7 +6322,7 @@ static int sctp_wait_for_accept(struct sock *sk, long timeo)
 			break;
 	}
 
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 
 	return err;
 }
@@ -6332,7 +6332,7 @@ static void sctp_wait_for_close(struct sock *sk, long timeout)
 	DEFINE_WAIT(wait);
 
 	do {
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 		if (list_empty(&sctp_sk(sk)->ep->asocs))
 			break;
 		sctp_release_sock(sk);
@@ -6340,7 +6340,7 @@ static void sctp_wait_for_close(struct sock *sk, long timeout)
 		sctp_lock_sock(sk);
 	} while (!signal_pending(current) && timeout);
 
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 }
 
 static void sctp_skb_set_owner_r_frag(struct sk_buff *skb, struct sock *sk)
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index a29f259..ce0d5b3 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -419,8 +419,8 @@ static void svc_udp_data_ready(struct sock *sk, int count)
 		set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
 		svc_xprt_enqueue(&svsk->sk_xprt);
 	}
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
-		wake_up_interruptible(sk->sk_sleep);
+	if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
+		wake_up_interruptible(sk_sleep(sk));
 }
 
 /*
@@ -436,10 +436,10 @@ static void svc_write_space(struct sock *sk)
 		svc_xprt_enqueue(&svsk->sk_xprt);
 	}
 
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep)) {
+	if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk))) {
 		dprintk("RPC svc_write_space: someone sleeping on %p\n",
 		       svsk);
-		wake_up_interruptible(sk->sk_sleep);
+		wake_up_interruptible(sk_sleep(sk));
 	}
 }
 
@@ -757,8 +757,8 @@ static void svc_tcp_listen_data_ready(struct sock *sk, int count_unused)
 			printk("svc: socket %p: no user data\n", sk);
 	}
 
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
-		wake_up_interruptible_all(sk->sk_sleep);
+	if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
+		wake_up_interruptible_all(sk_sleep(sk));
 }
 
 /*
@@ -777,8 +777,8 @@ static void svc_tcp_state_change(struct sock *sk)
 		set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
 		svc_xprt_enqueue(&svsk->sk_xprt);
 	}
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
-		wake_up_interruptible_all(sk->sk_sleep);
+	if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
+		wake_up_interruptible_all(sk_sleep(sk));
 }
 
 static void svc_tcp_data_ready(struct sock *sk, int count)
@@ -791,8 +791,8 @@ static void svc_tcp_data_ready(struct sock *sk, int count)
 		set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
 		svc_xprt_enqueue(&svsk->sk_xprt);
 	}
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
-		wake_up_interruptible(sk->sk_sleep);
+	if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
+		wake_up_interruptible(sk_sleep(sk));
 }
 
 /*
@@ -1494,8 +1494,8 @@ static void svc_sock_detach(struct svc_xprt *xprt)
 	sk->sk_data_ready = svsk->sk_odata;
 	sk->sk_write_space = svsk->sk_owspace;
 
-	if (sk->sk_sleep && waitqueue_active(sk->sk_sleep))
-		wake_up_interruptible(sk->sk_sleep);
+	if (sk_sleep(sk) && waitqueue_active(sk_sleep(sk)))
+		wake_up_interruptible(sk_sleep(sk));
 }
 
 /*
diff --git a/net/tipc/socket.c b/net/tipc/socket.c
index cfb20b8..66e889b 100644
--- a/net/tipc/socket.c
+++ b/net/tipc/socket.c
@@ -446,7 +446,7 @@ static unsigned int poll(struct file *file, struct socket *sock,
 	struct sock *sk = sock->sk;
 	u32 mask;
 
-	poll_wait(file, sk->sk_sleep, wait);
+	poll_wait(file, sk_sleep(sk), wait);
 
 	if (!skb_queue_empty(&sk->sk_receive_queue) ||
 	    (sock->state == SS_UNCONNECTED) ||
@@ -591,7 +591,7 @@ static int send_msg(struct kiocb *iocb, struct socket *sock,
 			break;
 		}
 		release_sock(sk);
-		res = wait_event_interruptible(*sk->sk_sleep,
+		res = wait_event_interruptible(*sk_sleep(sk),
 					       !tport->congested);
 		lock_sock(sk);
 		if (res)
@@ -650,7 +650,7 @@ static int send_packet(struct kiocb *iocb, struct socket *sock,
 			break;
 		}
 		release_sock(sk);
-		res = wait_event_interruptible(*sk->sk_sleep,
+		res = wait_event_interruptible(*sk_sleep(sk),
 			(!tport->congested || !tport->connected));
 		lock_sock(sk);
 		if (res)
@@ -931,7 +931,7 @@ restart:
 			goto exit;
 		}
 		release_sock(sk);
-		res = wait_event_interruptible(*sk->sk_sleep,
+		res = wait_event_interruptible(*sk_sleep(sk),
 			(!skb_queue_empty(&sk->sk_receive_queue) ||
 			 (sock->state == SS_DISCONNECTING)));
 		lock_sock(sk);
@@ -1064,7 +1064,7 @@ restart:
 			goto exit;
 		}
 		release_sock(sk);
-		res = wait_event_interruptible(*sk->sk_sleep,
+		res = wait_event_interruptible(*sk_sleep(sk),
 			(!skb_queue_empty(&sk->sk_receive_queue) ||
 			 (sock->state == SS_DISCONNECTING)));
 		lock_sock(sk);
@@ -1271,8 +1271,8 @@ static u32 filter_rcv(struct sock *sk, struct sk_buff *buf)
 		tipc_disconnect_port(tipc_sk_port(sk));
 	}
 
-	if (waitqueue_active(sk->sk_sleep))
-		wake_up_interruptible(sk->sk_sleep);
+	if (waitqueue_active(sk_sleep(sk)))
+		wake_up_interruptible(sk_sleep(sk));
 	return TIPC_OK;
 }
 
@@ -1343,8 +1343,8 @@ static void wakeupdispatch(struct tipc_port *tport)
 {
 	struct sock *sk = (struct sock *)tport->usr_handle;
 
-	if (waitqueue_active(sk->sk_sleep))
-		wake_up_interruptible(sk->sk_sleep);
+	if (waitqueue_active(sk_sleep(sk)))
+		wake_up_interruptible(sk_sleep(sk));
 }
 
 /**
@@ -1426,7 +1426,7 @@ static int connect(struct socket *sock, struct sockaddr *dest, int destlen,
 	/* Wait until an 'ACK' or 'RST' arrives, or a timeout occurs */
 
 	release_sock(sk);
-	res = wait_event_interruptible_timeout(*sk->sk_sleep,
+	res = wait_event_interruptible_timeout(*sk_sleep(sk),
 			(!skb_queue_empty(&sk->sk_receive_queue) ||
 			(sock->state != SS_CONNECTING)),
 			sk->sk_rcvtimeo);
@@ -1521,7 +1521,7 @@ static int accept(struct socket *sock, struct socket *new_sock, int flags)
 			goto exit;
 		}
 		release_sock(sk);
-		res = wait_event_interruptible(*sk->sk_sleep,
+		res = wait_event_interruptible(*sk_sleep(sk),
 				(!skb_queue_empty(&sk->sk_receive_queue)));
 		lock_sock(sk);
 		if (res)
@@ -1632,8 +1632,8 @@ restart:
 		/* Discard any unreceived messages; wake up sleeping tasks */
 
 		discard_rx_queue(sk);
-		if (waitqueue_active(sk->sk_sleep))
-			wake_up_interruptible(sk->sk_sleep);
+		if (waitqueue_active(sk_sleep(sk)))
+			wake_up_interruptible(sk_sleep(sk));
 		res = 0;
 		break;
 
diff --git a/net/unix/af_unix.c b/net/unix/af_unix.c
index 3d9122e..87c0360 100644
--- a/net/unix/af_unix.c
+++ b/net/unix/af_unix.c
@@ -316,7 +316,7 @@ static void unix_write_space(struct sock *sk)
 	read_lock(&sk->sk_callback_lock);
 	if (unix_writable(sk)) {
 		if (sk_has_sleeper(sk))
-			wake_up_interruptible_sync(sk->sk_sleep);
+			wake_up_interruptible_sync(sk_sleep(sk));
 		sk_wake_async(sk, SOCK_WAKE_SPACE, POLL_OUT);
 	}
 	read_unlock(&sk->sk_callback_lock);
@@ -1736,7 +1736,7 @@ static long unix_stream_data_wait(struct sock *sk, long timeo)
 	unix_state_lock(sk);
 
 	for (;;) {
-		prepare_to_wait(sk->sk_sleep, &wait, TASK_INTERRUPTIBLE);
+		prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 
 		if (!skb_queue_empty(&sk->sk_receive_queue) ||
 		    sk->sk_err ||
@@ -1752,7 +1752,7 @@ static long unix_stream_data_wait(struct sock *sk, long timeo)
 		clear_bit(SOCK_ASYNC_WAITDATA, &sk->sk_socket->flags);
 	}
 
-	finish_wait(sk->sk_sleep, &wait);
+	finish_wait(sk_sleep(sk), &wait);
 	unix_state_unlock(sk);
 	return timeo;
 }
@@ -1991,7 +1991,7 @@ static unsigned int unix_poll(struct file *file, struct socket *sock, poll_table
 	struct sock *sk = sock->sk;
 	unsigned int mask;
 
-	sock_poll_wait(file, sk->sk_sleep, wait);
+	sock_poll_wait(file, sk_sleep(sk), wait);
 	mask = 0;
 
 	/* exceptional events? */
@@ -2028,7 +2028,7 @@ static unsigned int unix_dgram_poll(struct file *file, struct socket *sock,
 	struct sock *sk = sock->sk, *other;
 	unsigned int mask, writable;
 
-	sock_poll_wait(file, sk->sk_sleep, wait);
+	sock_poll_wait(file, sk_sleep(sk), wait);
 	mask = 0;
 
 	/* exceptional events? */
diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index cbddd0c..6cffbc4 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -718,7 +718,7 @@ static int x25_wait_for_connection_establishment(struct sock *sk)
 	DECLARE_WAITQUEUE(wait, current);
 	int rc;
 
-	add_wait_queue_exclusive(sk->sk_sleep, &wait);
+	add_wait_queue_exclusive(sk_sleep(sk), &wait);
 	for (;;) {
 		__set_current_state(TASK_INTERRUPTIBLE);
 		rc = -ERESTARTSYS;
@@ -738,7 +738,7 @@ static int x25_wait_for_connection_establishment(struct sock *sk)
 			break;
 	}
 	__set_current_state(TASK_RUNNING);
-	remove_wait_queue(sk->sk_sleep, &wait);
+	remove_wait_queue(sk_sleep(sk), &wait);
 	return rc;
 }
 
@@ -838,7 +838,7 @@ static int x25_wait_for_data(struct sock *sk, long timeout)
 	DECLARE_WAITQUEUE(wait, current);
 	int rc = 0;
 
-	add_wait_queue_exclusive(sk->sk_sleep, &wait);
+	add_wait_queue_exclusive(sk_sleep(sk), &wait);
 	for (;;) {
 		__set_current_state(TASK_INTERRUPTIBLE);
 		if (sk->sk_shutdown & RCV_SHUTDOWN)
@@ -858,7 +858,7 @@ static int x25_wait_for_data(struct sock *sk, long timeout)
 			break;
 	}
 	__set_current_state(TASK_RUNNING);
-	remove_wait_queue(sk->sk_sleep, &wait);
+	remove_wait_queue(sk_sleep(sk), &wait);
 	return rc;
 }
 



^ permalink raw reply related

* Re: PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Eric Dumazet @ 2010-04-20 22:31 UTC (permalink / raw)
  To: Gaspar Chilingarov; +Cc: netdev
In-Reply-To: <g2z46c8cb3e1004201517i5641a75cze2ec5bd33e81fb0f@mail.gmail.com>

Le mercredi 21 avril 2010 à 03:17 +0500, Gaspar Chilingarov a écrit :
> [1.] Large amount of outgoing tcp connections fail to bind properly to
> their ip/ports
> 
> [2.] Full description of the problem/report:
> 
> I'm trying to establish huge amount of outgoing tcp connections (over
> several 100000-s) on a single machine. I need to test load a server,
> which could process that amount of connections :)
> 
> The number of connections which are possible to establish from single
> ip is regulated by
> net.ipv4.ip_local_port_range = 32768    61000, which gives 28232 connections.
> 
> Good. I expect that each socket is identified on a local side as
> unique pair of local_ip:local_port .
> Thus I've added some more IP addresses (say 10) to the machine
> (aliases to the same network interface).
> I expect to be able to establish 10 times more connections than before
> (I know about file descriptor limits, system limit of total number of
> file descriptors and  so on - which are tuned to high values already).
> 
> And the fun part begins -
> I have 28232 on a first source IP (all in established state, say
> 10.0.0.10) and now I'm trying to establish one more connection with
> nc, specifying 10.0.0.11 as a source IP -- and getting "unable to bind
> error"
> 
> Notes about example;
> 10.0.0.1:8192 is a server which just accepts a connections and listens
> forever on them. It's in erlang and it can handle great loads -- so
> there is not problems on that side.
> Using the same script I was able to establish more than 20.000
> connections without any problems (having a standard local port range
> set)
> 
> 
> To make experiment easily reproducible I've done the following things:
> 
> Decrease number of local ports available to 1001 -
> net.ipv4.ip_local_port_range = 60000    61000
> 
> I have script like this (writing from memory)
> 
> #!/bin/sh
> 
> I=0
> 
> IP=10.0.0.10
> 
> # connection stats before run
> netstat -n | grep ESTABLISHED | fgrep "$IP" | wc -l
> 
> while [ $I -le 1000 ]; do
> 
> # run nc in background, supress any output
> nc -s $IP 10.0.0.1 8192 > /dev/null 2>&1 &
> 
> I=$(($I + 1))
> 
> done
> 
> # connection stats after run
> netstat -n | grep ESTABLISHED | fgrep "$IP" | wc -l
> 
> 
> EVEN on the first run I get only 990 successful connections! something
> fails, strange ....
> 
> nc 10.0.0.1 8192 fails with error "unable to bind" and establishes
> connection only from 5-10 try.
> 
> Ook, well, run this script again, get all possible 1001 connections
> and than change source IP to 10.0.0.11
> 
> If you run in several times you will get the following numbers of
> established connections about each run (for given source IP)
> ~650, ~870, 950,980,990,995,995, 1000 and several runs to get 1001.
> 
> Then if you change IP to the next available and run it again - you
> will get practically the same numbers and this continues for 3-th,
> 4th, 5-th and other IP's.
> 
> 
> As a programmer, I feel that there is some hash table for
> local_ip:local_port pairs in the kernel (may be also incorporating
> PID), which has a collisions and
> in case of collision it just fails to reserve/bind this pair for the
> socket.  I hope I'm right, but I've failed to find where the
> allocation is done :)
> In case if PID does not change (i've tried to run tests from primitive
> client in erlang as well -- you get much more worse picture and
> getting new socket becomes just impossible).
> 
> I think that even in case if there is one port available for that IP
> -- it should be possible to bind (even if the kernel should do the
> full scan on local port range to find that unused port).
> 
> 
> I would be grateful for hints where to look in the source -- may be I
> can produce some working patches for it.
> 
> 
> [3.] Keywords (i.e., modules, networking, kernel):
> does not matter, i think.
> 
> [4.] Kernel version (from /proc/version):
> Ubuntu Karmic Koala on amd64 with latest shipped kernel.
> Linux version 2.6.31-21-generic (buildd@yellow) (gcc version 4.4.1
> (Ubuntu 4.4.1-4ubuntu9) ) #59-Ubuntu SMP Wed Mar 24 07:28:27 UTC 2010
> 
> 
> [5.] Output of Oops.. message (if applicable) with symbolic information
>     resolved (see Documentation/oops-tracing.txt)
> n/a
> 
> [6.] A small shell script or example program which triggers the
>     problem (if possible)
> [7.] Environment
> 
> 
> Thanks in advance,
> Gaspar
> 

Its doable, only if you bind() your sockets before connect()

For each socket, you'll need to chose an (local IP, local port) not
already in use.

kernel wont magically select one source IP from the pool you have.




^ permalink raw reply

* PROBLEM: Linux kernel 2.6.31 IPv4 TCP fails to open huge amount of outgoing connections (unable to bind ... )
From: Gaspar Chilingarov @ 2010-04-20 22:17 UTC (permalink / raw)
  To: netdev

[1.] Large amount of outgoing tcp connections fail to bind properly to
their ip/ports

[2.] Full description of the problem/report:

I'm trying to establish huge amount of outgoing tcp connections (over
several 100000-s) on a single machine. I need to test load a server,
which could process that amount of connections :)

The number of connections which are possible to establish from single
ip is regulated by
net.ipv4.ip_local_port_range = 32768    61000, which gives 28232 connections.

Good. I expect that each socket is identified on a local side as
unique pair of local_ip:local_port .
Thus I've added some more IP addresses (say 10) to the machine
(aliases to the same network interface).
I expect to be able to establish 10 times more connections than before
(I know about file descriptor limits, system limit of total number of
file descriptors and  so on - which are tuned to high values already).

And the fun part begins -
I have 28232 on a first source IP (all in established state, say
10.0.0.10) and now I'm trying to establish one more connection with
nc, specifying 10.0.0.11 as a source IP -- and getting "unable to bind
error"

Notes about example;
10.0.0.1:8192 is a server which just accepts a connections and listens
forever on them. It's in erlang and it can handle great loads -- so
there is not problems on that side.
Using the same script I was able to establish more than 20.000
connections without any problems (having a standard local port range
set)

To make experiment easily reproducible I've done the following things:

Decrease number of local ports available to 1001 -
net.ipv4.ip_local_port_range = 60000    61000

I have script like this (writing from memory)

#!/bin/sh

I=0

IP=10.0.0.10

# connection stats before run
netstat -n | grep ESTABLISHED | fgrep "$IP" | wc -l

while [ $I -le 1000 ]; do

# run nc in background, supress any output
nc -s $IP 10.0.0.1 8192 > /dev/null 2>&1 &

I=$(($I + 1))

done

# connection stats after run
netstat -n | grep ESTABLISHED | fgrep "$IP" | wc -l

EVEN on the first run I get only 990 successful connections! something
fails, strange ....

nc 10.0.0.1 8192 fails with error "unable to bind" and establishes
connection only from 5-10 try.

Ook, well, run this script again, get all possible 1001 connections
and than change source IP to 10.0.0.11

If you run in several times you will get the following numbers of
established connections about each run (for given source IP)
~650, ~870, 950,980,990,995,995, 1000 and several runs to get 1001.

Then if you change IP to the next available and run it again - you
will get practically the same numbers and this continues for 3-th,
4th, 5-th and other IP's.

As a programmer, I feel that there is some hash table for
local_ip:local_port pairs in the kernel (may be also incorporating
PID), which has a collisions and
in case of collision it just fails to reserve/bind this pair for the
socket.  I hope I'm right, but I've failed to find where the
allocation is done :)
In case if PID does not change (i've tried to run tests from primitive
client in erlang as well -- you get much more worse picture and
getting new socket becomes just impossible).

I think that even in case if there is one port available for that IP
-- it should be possible to bind (even if the kernel should do the
full scan on local port range to find that unused port).

I would be grateful for hints where to look in the source -- may be I
can produce some working patches for it.

[3.] Keywords (i.e., modules, networking, kernel):
does not matter, i think.

[4.] Kernel version (from /proc/version):
Ubuntu Karmic Koala on amd64 with latest shipped kernel.
Linux version 2.6.31-21-generic (buildd@yellow) (gcc version 4.4.1
(Ubuntu 4.4.1-4ubuntu9) ) #59-Ubuntu SMP Wed Mar 24 07:28:27 UTC 2010

[5.] Output of Oops.. message (if applicable) with symbolic information
    resolved (see Documentation/oops-tracing.txt)
n/a

[6.] A small shell script or example program which triggers the
    problem (if possible)
[7.] Environment

Thanks in advance,
Gaspar

--
Gaspar Chilingarov

tel +37493 419763 (mobile - leave voice mail message)
icq 63174784
skype://gasparch
e mailto:nm@web.am mailto:gasparch@gmail.com
w http://gasparchilingarov.com/

^ permalink raw reply

* [RFC PATCH] net: Set sock's TX queue to match the RX queue
From: Ben Hutchings @ 2010-04-20 22:10 UTC (permalink / raw)
  To: netdev; +Cc: linux-net-drivers

We generally expect TX work for a connected socket to be done on the
same CPU (or rather hardware thread) as RX work, so the RX and TX
queue selection should match.  Therefore, when a sock with a cached
destination receives an skb from a multi-RX-queue net device, set the
sock's TX queue index to the RX queue index of the skb, modulo the
number of TX queues on the cached output device.

This changes the representation of TX queue mapping to record how the
current queue selection (if any) was made.  We can make the initial
queue selection based on the current simple hash, then replace it once
we know which RX queue is selected.  For TCP and other connected
protocols, this should all be settled during the initial handshake
so the change of TX queue will not cause packet reordering.

I haven't given this a whole lot of testing yet.  I would be interested
to hear how this affects performance on different systems and whether
it's actually worth doing.

Ben.
---
 include/net/sock.h |   37 +++++++++++++++++++++++++++++--------
 net/core/dev.c     |    3 ++-
 net/core/sock.c    |   18 +++++++++++++++++-
 3 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 56df440..3cd2b17 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -108,7 +108,8 @@ struct net;
  *	@skc_node: main hash linkage for various protocol lookup tables
  *	@skc_nulls_node: main hash linkage for TCP/UDP/UDP-Lite protocol
  *	@skc_refcnt: reference count
- *	@skc_tx_queue_mapping: tx queue number for this connection
+ *	@skc_tx_queue_source: source for @skc_tx_queue_index
+ *	@skc_tx_queue_index: tx queue index for this connection
  *	@skc_hash: hash value used with various protocol lookup tables
  *	@skc_u16hashes: two u16 hash values used by UDP lookup tables
  *	@skc_family: network address family
@@ -132,7 +133,8 @@ struct sock_common {
 		struct hlist_nulls_node skc_nulls_node;
 	};
 	atomic_t		skc_refcnt;
-	int			skc_tx_queue_mapping;
+	u16			skc_tx_queue_source;
+	u16			skc_tx_queue_index;
 
 	union  {
 		unsigned int	skc_hash;
@@ -152,6 +154,12 @@ struct sock_common {
 #endif
 };
 
+enum sock_tx_queue_source {
+	SK_TX_QUEUE_SOURCE_NONE,
+	SK_TX_QUEUE_SOURCE_HASH,
+	SK_TX_QUEUE_SOURCE_RX_QUEUE
+};
+
 /**
   *	struct sock - network layer representation of sockets
   *	@__sk_common: shared layout with inet_timewait_sock
@@ -226,7 +234,8 @@ struct sock {
 #define sk_node			__sk_common.skc_node
 #define sk_nulls_node		__sk_common.skc_nulls_node
 #define sk_refcnt		__sk_common.skc_refcnt
-#define sk_tx_queue_mapping	__sk_common.skc_tx_queue_mapping
+#define sk_tx_queue_source	__sk_common.skc_tx_queue_source
+#define sk_tx_queue_index	__sk_common.skc_tx_queue_index
 
 #define sk_copy_start		__sk_common.skc_hash
 #define sk_hash			__sk_common.skc_hash
@@ -1134,24 +1143,29 @@ static inline void sock_put(struct sock *sk)
 extern int sk_receive_skb(struct sock *sk, struct sk_buff *skb,
 			  const int nested);
 
-static inline void sk_tx_queue_set(struct sock *sk, int tx_queue)
+static inline void
+sk_tx_queue_set(struct sock *sk, u16 index, enum sock_tx_queue_source source)
 {
-	sk->sk_tx_queue_mapping = tx_queue;
+	sk->sk_tx_queue_source = source;
+	sk->sk_tx_queue_index = index;
 }
 
+extern void
+__sk_tx_queue_set_from_rx_queue(struct sock *sk, const struct sk_buff *skb);
+
 static inline void sk_tx_queue_clear(struct sock *sk)
 {
-	sk->sk_tx_queue_mapping = -1;
+	sk->sk_tx_queue_source = SK_TX_QUEUE_SOURCE_NONE;
 }
 
 static inline int sk_tx_queue_get(const struct sock *sk)
 {
-	return sk->sk_tx_queue_mapping;
+	return sk->sk_tx_queue_index;
 }
 
 static inline bool sk_tx_queue_recorded(const struct sock *sk)
 {
-	return (sk && sk->sk_tx_queue_mapping >= 0);
+	return (sk && sk->sk_tx_queue_source != SK_TX_QUEUE_SOURCE_NONE);
 }
 
 static inline void sk_set_socket(struct sock *sk, struct socket *sock)
@@ -1423,6 +1437,13 @@ static inline void skb_set_owner_r(struct sk_buff *skb, struct sock *sk)
 	skb->destructor = sock_rfree;
 	atomic_add(skb->truesize, &sk->sk_rmem_alloc);
 	sk_mem_charge(sk, skb->truesize);
+	/*
+	 * If we're receiving through a multiqueue device, set the
+	 * TX queue to match the RX queue.
+	 */
+	if (sk->sk_dst_cache && skb_rx_queue_recorded(skb) &&
+	    unlikely(sk->sk_tx_queue_source != SK_TX_QUEUE_SOURCE_RX_QUEUE))
+		__sk_tx_queue_set_from_rx_queue(sk, skb);
 }
 
 extern void sk_reset_timer(struct sock *sk, struct timer_list* timer,
diff --git a/net/core/dev.c b/net/core/dev.c
index b31d5d6..576bb28 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2016,7 +2016,8 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev,
 				queue_index = skb_tx_hash(dev, skb);
 
 			if (sk && rcu_dereference_check(sk->sk_dst_cache, 1))
-				sk_tx_queue_set(sk, queue_index);
+				sk_tx_queue_set(sk, queue_index,
+						SK_TX_QUEUE_SOURCE_HASH);
 		}
 	}
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 7effa1e..66719a7 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -362,6 +362,21 @@ void sk_reset_txq(struct sock *sk)
 }
 EXPORT_SYMBOL(sk_reset_txq);
 
+void __sk_tx_queue_set_from_rx_queue(struct sock *sk, const struct sk_buff *skb)
+{
+	struct dst_entry *dst = __sk_dst_get(sk);
+	u16 queue_index = skb_get_rx_queue(skb);
+	struct net_device *output_dev;
+
+	if (!dst)
+		return;
+	output_dev = dst->dev;
+	if (unlikely(queue_index >= output_dev->real_num_tx_queues))
+		queue_index %= output_dev->real_num_tx_queues;
+	sk_tx_queue_set(sk, queue_index, SK_TX_QUEUE_SOURCE_RX_QUEUE);
+}
+EXPORT_SYMBOL_GPL(__sk_tx_queue_set_from_rx_queue);
+
 struct dst_entry *__sk_dst_check(struct sock *sk, u32 cookie)
 {
 	struct dst_entry *dst = __sk_dst_get(sk);
@@ -966,7 +981,8 @@ static void sock_copy(struct sock *nsk, const struct sock *osk)
 #endif
 	BUILD_BUG_ON(offsetof(struct sock, sk_copy_start) !=
 		     sizeof(osk->sk_node) + sizeof(osk->sk_refcnt) +
-		     sizeof(osk->sk_tx_queue_mapping));
+		     sizeof(osk->sk_tx_queue_source) +
+		     sizeof(osk->sk_tx_queue_index));
 	memcpy(&nsk->sk_copy_start, &osk->sk_copy_start,
 	       osk->sk_prot->obj_size - offsetof(struct sock, sk_copy_start));
 #ifdef CONFIG_SECURITY_NETWORK
-- 
1.6.2.5

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply related

* Socket filter access to hatype
From: Paul LeoNerd Evans @ 2010-04-20 22:00 UTC (permalink / raw)
  To: netdev

[-- Attachment #1.1: Type: text/plain, Size: 879 bytes --]

When capturing packets on a PF_PACKET/SOCK_RAW socket bound to all
interfaces, there doesn't appear to be a way for the filter program to
actually find out the underlying hardware type the packet was captured
on, such as is reported by the sll_hatype field of the struct sockaddr_ll
when the packet is sent up to userland.

Unless I've managed to miss a trick somewhere, this would seem to put a
fairly fundamental blocker on actually being able to filter in such
packets. Granted there's the SKF_OFF_NET area to inspect at the e.g. IPv4
level, but this makes it impossible to do anything on e.g. the Ethernet
level.

See attached for a patch to add an SKF_AD_HATYPE field, up among the
other special access fields around SKF_AD_OFF.

-- 
Paul "LeoNerd" Evans

leonerd@leonerd.org.uk
ICQ# 4135350       |  Registered Linux# 179460
http://www.leonerd.org.uk/

[-- Attachment #1.2: linux-SKF_AD_HATYPE.patch --]
[-- Type: text/x-patch, Size: 949 bytes --]

diff -ur linux-2.6.33.2.orig/include/linux/filter.h linux-2.6.33.2/include/linux/filter.h
--- linux-2.6.33.2.orig/include/linux/filter.h	2010-04-02 00:02:33.000000000 +0100
+++ linux-2.6.33.2/include/linux/filter.h	2010-04-20 22:40:25.000000000 +0100
@@ -123,7 +123,8 @@
 #define SKF_AD_NLATTR_NEST	16
 #define SKF_AD_MARK 	20
 #define SKF_AD_QUEUE	24
-#define SKF_AD_MAX	28
+#define SKF_AD_HATYPE	28
+#define SKF_AD_MAX	32
 #define SKF_NET_OFF   (-0x100000)
 #define SKF_LL_OFF    (-0x200000)

diff -ur linux-2.6.33.2.orig/net/core/filter.c linux-2.6.33.2/net/core/filter.c
--- linux-2.6.33.2.orig/net/core/filter.c	2010-04-02 00:02:33.000000000 +0100
+++ linux-2.6.33.2/net/core/filter.c	2010-04-20 22:41:01.000000000 +0100
@@ -309,6 +309,9 @@
 		case SKF_AD_QUEUE:
 			A = skb->queue_mapping;
 			continue;
+		case SKF_AD_HATYPE:
+			A = skb->dev->type;
+			continue;
 		case SKF_AD_NLATTR: {
 			struct nlattr *nla;

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 190 bytes --]

^ permalink raw reply

* Re: [PATCH net-next-2.6] rps: consistent rxhash
From: David Miller @ 2010-04-20 21:41 UTC (permalink / raw)
  To: eric.dumazet; +Cc: franco, xiaosuo, therbert, netdev
In-Reply-To: <1271775421.7895.19.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Apr 2010 16:57:01 +0200

> I know many applications using TCP on loopback, they are real :)

This is all true and I support your hashing patch and all of that.

But if we really want TCP over loopback to go fast, there are much
better ways to do this.

Eric, do you remember that "TCP friends" rough patch I sent you last
year that essentailly made TCP sockets over loopback behave like
AF_UNIX ones and just queue the SKBs directly to the destination
socket without doing any protocol work?

If we ever got that working, tbench performance would become
impressive :)

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Jiri Bohac @ 2010-04-20 21:35 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Eric Dumazet, Jiri Bohac, netdev, Hideaki YOSHIFUJI, David Miller
In-Reply-To: <20100420141655.3a66b8e8@nehalam>

On Tue, Apr 20, 2010 at 02:16:55PM -0700, Stephen Hemminger wrote:
> Is this problem new to net-next where we hold onto addresses, or was
> the issue there before?

Before. I am seeing it on 2.6.32.

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ

^ permalink raw reply

* Re: A possible bug in reqsk_queue_hash_req()
From: Eric Dumazet @ 2010-04-20 21:29 UTC (permalink / raw)
  To: David Miller; +Cc: raise.sail, netdev, linux-kernel
In-Reply-To: <20100420.142405.228805796.davem@davemloft.net>

Le mardi 20 avril 2010 à 14:24 -0700, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 20 Apr 2010 13:06:51 +0200
> 
> > I believe its not really necessary, because we are the only possible
> > writer at this stage.
> > 
> > The write_lock() ... write_unlock() is there only to enforce a
> > synchronisation with readers.
> > 
> > All callers of this reqsk_queue_hash_req() must have the socket locked
> 
> Right.
> 
> In fact there are quite a few snippets around the networking
> where we use this trick of only locking around the single
> pointer assignment that puts the object into the list.
> 
> And they were all written by Alexey Kuznetsov, so they must
> be correct :-)

We could convert to RCU, but given this read_lock is only taken by
inet_diag, its not really necessary.

Something like the following, but I have no plan to complete this work.


diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index b6d3b55..5f3373a 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -278,7 +278,9 @@ static inline void inet_csk_reqsk_queue_added(struct sock *sk,
 
 static inline int inet_csk_reqsk_queue_len(const struct sock *sk)
 {
-	return reqsk_queue_len(&inet_csk(sk)->icsk_accept_queue);
+	struct listen_sock * lopt = reqsk_lopt_get(&inet_csk(sk)->icsk_accept_queue);
+
+	return reqsk_queue_len(lopt);
 }
 
 static inline int inet_csk_reqsk_queue_young(const struct sock *sk)
diff --git a/include/net/request_sock.h b/include/net/request_sock.h
index 99e6e19..b665103 100644
--- a/include/net/request_sock.h
+++ b/include/net/request_sock.h
@@ -65,6 +65,7 @@ struct request_sock {
 	struct sock			*sk;
 	u32				secid;
 	u32				peer_secid;
+	struct rcu_head			rcu;
 };
 
 static inline struct request_sock *reqsk_alloc(const struct request_sock_ops *ops)
@@ -77,10 +78,7 @@ static inline struct request_sock *reqsk_alloc(const struct request_sock_ops *op
 	return req;
 }
 
-static inline void __reqsk_free(struct request_sock *req)
-{
-	kmem_cache_free(req->rsk_ops->slab, req);
-}
+extern void __reqsk_free(struct request_sock *req);
 
 static inline void reqsk_free(struct request_sock *req)
 {
@@ -110,23 +108,13 @@ struct listen_sock {
  * @rskq_accept_head - FIFO head of established children
  * @rskq_accept_tail - FIFO tail of established children
  * @rskq_defer_accept - User waits for some data after accept()
- * @syn_wait_lock - serializer
- *
- * %syn_wait_lock is necessary only to avoid proc interface having to grab the main
- * lock sock while browsing the listening hash (otherwise it's deadlock prone).
  *
- * This lock is acquired in read mode only from listening_get_next() seq_file
- * op and it's acquired in write mode _only_ from code that is actively
- * changing rskq_accept_head. All readers that are holding the master sock lock
- * don't need to grab this lock in read mode too as rskq_accept_head. writes
- * are always protected from the main sock lock.
  */
 struct request_sock_queue {
 	struct request_sock	*rskq_accept_head;
 	struct request_sock	*rskq_accept_tail;
-	rwlock_t		syn_wait_lock;
 	u8			rskq_defer_accept;
-	/* 3 bytes hole, try to pack */
+	/* 3-7 bytes hole, try to pack */
 	struct listen_sock	*listen_opt;
 };
 
@@ -154,9 +142,7 @@ static inline void reqsk_queue_unlink(struct request_sock_queue *queue,
 				      struct request_sock *req,
 				      struct request_sock **prev_req)
 {
-	write_lock(&queue->syn_wait_lock);
-	*prev_req = req->dl_next;
-	write_unlock(&queue->syn_wait_lock);
+	rcu_assign_pointer(*prev_req, req->dl_next);
 }
 
 static inline void reqsk_queue_add(struct request_sock_queue *queue,
@@ -167,13 +153,13 @@ static inline void reqsk_queue_add(struct request_sock_queue *queue,
 	req->sk = child;
 	sk_acceptq_added(parent);
 
+	req->dl_next = NULL;
 	if (queue->rskq_accept_head == NULL)
 		queue->rskq_accept_head = req;
 	else
 		queue->rskq_accept_tail->dl_next = req;
 
 	queue->rskq_accept_tail = req;
-	req->dl_next = NULL;
 }
 
 static inline struct request_sock *reqsk_queue_remove(struct request_sock_queue *queue)
@@ -223,9 +209,16 @@ static inline int reqsk_queue_added(struct request_sock_queue *queue)
 	return prev_qlen;
 }
 
-static inline int reqsk_queue_len(const struct request_sock_queue *queue)
+static inline struct listen_sock *reqsk_lopt_get(const struct request_sock_queue *queue)
 {
-	return queue->listen_opt != NULL ? queue->listen_opt->qlen : 0;
+	return rcu_dereference_check(queue->listen_opt,
+		rcu_read_lock_bh_held() ||
+		sock_owned_by_user(sk));
+}
+
+static inline int reqsk_queue_len(struct listen_sock *lopt)
+{
+	return lopt != NULL ? lopt->qlen : 0;
 }
 
 static inline int reqsk_queue_len_young(const struct request_sock_queue *queue)
@@ -247,11 +240,9 @@ static inline void reqsk_queue_hash_req(struct request_sock_queue *queue,
 	req->expires = jiffies + timeout;
 	req->retrans = 0;
 	req->sk = NULL;
-	req->dl_next = lopt->syn_table[hash];
 
-	write_lock(&queue->syn_wait_lock);
-	lopt->syn_table[hash] = req;
-	write_unlock(&queue->syn_wait_lock);
+	req->dl_next = lopt->syn_table[hash];
+	rcu_assign_pointer(lopt->syn_table[hash], req);
 }
 
 #endif /* _REQUEST_SOCK_H */
diff --git a/net/core/request_sock.c b/net/core/request_sock.c
index 7552495..e4c333e 100644
--- a/net/core/request_sock.c
+++ b/net/core/request_sock.c
@@ -58,13 +58,10 @@ int reqsk_queue_alloc(struct request_sock_queue *queue,
 	     lopt->max_qlen_log++);
 
 	get_random_bytes(&lopt->hash_rnd, sizeof(lopt->hash_rnd));
-	rwlock_init(&queue->syn_wait_lock);
 	queue->rskq_accept_head = NULL;
 	lopt->nr_table_entries = nr_table_entries;
 
-	write_lock_bh(&queue->syn_wait_lock);
-	queue->listen_opt = lopt;
-	write_unlock_bh(&queue->syn_wait_lock);
+	rcu_assign_pointer(queue->listen_opt, lopt);
 
 	return 0;
 }
@@ -94,10 +91,8 @@ static inline struct listen_sock *reqsk_queue_yank_listen_sk(
 {
 	struct listen_sock *lopt;
 
-	write_lock_bh(&queue->syn_wait_lock);
 	lopt = queue->listen_opt;
 	queue->listen_opt = NULL;
-	write_unlock_bh(&queue->syn_wait_lock);
 
 	return lopt;
 }
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 8da6429..2b2cb80 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -49,6 +49,19 @@ void inet_get_local_port_range(int *low, int *high)
 }
 EXPORT_SYMBOL(inet_get_local_port_range);
 
+static void reqsk_rcu_free(struct rcu_head *head)
+{
+	struct request_sock *req = container_of(head, struct request_sock, rcu);
+
+	kmem_cache_free(req->rsk_ops->slab, req);
+}
+
+void __reqsk_free(struct request_sock *req)
+{
+	call_rcu_bh(&req->rcu, reqsk_rcu_free);
+}
+EXPORT_SYMBOL(__reqsk_free);
+
 int inet_csk_bind_conflict(const struct sock *sk,
 			   const struct inet_bind_bucket *tb)
 {
@@ -420,7 +433,7 @@ struct request_sock *inet_csk_search_req(const struct sock *sk,
 		    ireq->loc_addr == laddr &&
 		    AF_INET_FAMILY(req->rsk_ops->family)) {
 			WARN_ON(req->sk);
-			*prevp = prev;
+			rcu_assign_pointer(*prevp, prev);
 			break;
 		}
 	}
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index e5fa2dd..4da8365 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -632,9 +632,9 @@ static int inet_diag_dump_reqs(struct sk_buff *skb, struct sock *sk,
 
 	entry.family = sk->sk_family;
 
-	read_lock_bh(&icsk->icsk_accept_queue.syn_wait_lock);
+	rcu_read_lock_bh();
 
-	lopt = icsk->icsk_accept_queue.listen_opt;
+	lopt = rcu_dereference_bh(icsk->icsk_accept_queue.listen_opt);
 	if (!lopt || !lopt->qlen)
 		goto out;
 
@@ -648,7 +648,7 @@ static int inet_diag_dump_reqs(struct sk_buff *skb, struct sock *sk,
 		struct request_sock *req, *head = lopt->syn_table[j];
 
 		reqnum = 0;
-		for (req = head; req; reqnum++, req = req->dl_next) {
+		for (req = head; req; reqnum++, req = rcu_dereference_bh(req->dl_next)) {
 			struct inet_request_sock *ireq = inet_rsk(req);
 
 			if (reqnum < s_reqnum)
@@ -691,7 +691,7 @@ static int inet_diag_dump_reqs(struct sk_buff *skb, struct sock *sk,
 	}
 
 out:
-	read_unlock_bh(&icsk->icsk_accept_queue.syn_wait_lock);
+	rcu_read_unlock_bh();
 
 	return err;
 }
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index ad08392..cd6a042 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1985,6 +1985,7 @@ static void *listening_get_next(struct seq_file *seq, void *cur)
 	struct inet_listen_hashbucket *ilb;
 	struct tcp_iter_state *st = seq->private;
 	struct net *net = seq_file_net(seq);
+	struct listen_sock *lopt;
 
 	if (!sk) {
 		st->bucket = 0;
@@ -2009,20 +2010,21 @@ static void *listening_get_next(struct seq_file *seq, void *cur)
 				}
 				req = req->dl_next;
 			}
-			if (++st->sbucket >= icsk->icsk_accept_queue.listen_opt->nr_table_entries)
+			if (++st->sbucket >= lopt->nr_table_entries)
 				break;
 get_req:
-			req = icsk->icsk_accept_queue.listen_opt->syn_table[st->sbucket];
+			req = lopt->syn_table[st->sbucket];
 		}
 		sk	  = sk_next(st->syn_wait_sk);
 		st->state = TCP_SEQ_STATE_LISTENING;
-		read_unlock_bh(&icsk->icsk_accept_queue.syn_wait_lock);
+		rcu_read_unlock_bh();
 	} else {
 		icsk = inet_csk(sk);
-		read_lock_bh(&icsk->icsk_accept_queue.syn_wait_lock);
-		if (reqsk_queue_len(&icsk->icsk_accept_queue))
+		rcu_read_lock_bh();
+		lopt = rcu_dereference_bh(icsk->icsk_accept_queue.listen_opt);
+		if (reqsk_queue_len(lopt))
 			goto start_req;
-		read_unlock_bh(&icsk->icsk_accept_queue.syn_wait_lock);
+		rcu_read_unlock_bh();
 		sk = sk_next(sk);
 	}
 get_sk:
@@ -2032,8 +2034,9 @@ get_sk:
 			goto out;
 		}
 		icsk = inet_csk(sk);
-		read_lock_bh(&icsk->icsk_accept_queue.syn_wait_lock);
-		if (reqsk_queue_len(&icsk->icsk_accept_queue)) {
+		rcu_read_lock_bh();
+		lopt = rcu_dereference_bh(icsk->icsk_accept_queue.listen_opt);
+		if (reqsk_queue_len(lopt)) {
 start_req:
 			st->uid		= sock_i_uid(sk);
 			st->syn_wait_sk = sk;
@@ -2041,7 +2044,7 @@ start_req:
 			st->sbucket	= 0;
 			goto get_req;
 		}
-		read_unlock_bh(&icsk->icsk_accept_queue.syn_wait_lock);
+		rcu_read_unlock_bh();
 	}
 	spin_unlock_bh(&ilb->lock);
 	if (++st->bucket < INET_LHTABLE_SIZE) {
@@ -2235,10 +2238,8 @@ static void tcp_seq_stop(struct seq_file *seq, void *v)
 
 	switch (st->state) {
 	case TCP_SEQ_STATE_OPENREQ:
-		if (v) {
-			struct inet_connection_sock *icsk = inet_csk(st->syn_wait_sk);
-			read_unlock_bh(&icsk->icsk_accept_queue.syn_wait_lock);
-		}
+		if (v)
+			rcu_read_unlock_bh();
 	case TCP_SEQ_STATE_LISTENING:
 		if (v != SEQ_START_TOKEN)
 			spin_unlock_bh(&tcp_hashinfo.listening_hash[st->bucket].lock);



^ permalink raw reply related

* Re: A possible bug in reqsk_queue_hash_req()
From: David Miller @ 2010-04-20 21:24 UTC (permalink / raw)
  To: eric.dumazet; +Cc: raise.sail, netdev, linux-kernel
In-Reply-To: <1271761611.3845.223.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 20 Apr 2010 13:06:51 +0200

> I believe its not really necessary, because we are the only possible
> writer at this stage.
> 
> The write_lock() ... write_unlock() is there only to enforce a
> synchronisation with readers.
> 
> All callers of this reqsk_queue_hash_req() must have the socket locked

Right.

In fact there are quite a few snippets around the networking
where we use this trick of only locking around the single
pointer assignment that puts the object into the list.

And they were all written by Alexey Kuznetsov, so they must
be correct :-)

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Stephen Hemminger @ 2010-04-20 21:16 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jiri Bohac, netdev, Hideaki YOSHIFUJI, David Miller
In-Reply-To: <1271797043.7895.320.camel@edumazet-laptop>

On Tue, 20 Apr 2010 22:57:23 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mardi 20 avril 2010 à 22:49 +0200, Jiri Bohac a écrit :
> > On Tue, Apr 20, 2010 at 07:57:27PM +0200, Eric Dumazet wrote:
> > > Le mardi 20 avril 2010 à 19:44 +0200, Jiri Bohac a écrit :
> > > > --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> > > > +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> > > > @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
> > > >  			addrconf_leave_anycast(ifp);
> > > >  		addrconf_leave_solict(ifp->idev, &ifp->addr);
> > > >  		dst_hold(&ifp->rt->u.dst);
> > > > -		if (ip6_del_rt(ifp->rt))
> > > > -			dst_free(&ifp->rt->u.dst);
> > > > +		ip6_del_rt(ifp->rt);
> > > >  		break;
> > > >  	}
> > > >  }
> > > > 
> > > 
> > > 
> > > I dont understand the problem Jiri.
> > > 
> > > We just did dst_hold(&ifp->rt->u.dst), so if ip6_del_rt() fails we must
> > > dst_free(), or we leak a refcount.
> > 
> > Well, no ... dst_free() does not decrement the refcount.
> > The "opposite" of dst_hold() is dst_release(). And it does not
> > automatically call dst_free() when the refcount drops to 0.
> > dst_free() needs to be called explicitly and it apparently
> > expects the caller to ensure that two dst_free()s won't be called
> > simultaneously. But my bonding example shows this is not the
> > case.
> > 
> > 
> 
> Ah yes you're right
> 
> Maybe ask Stephen ?
> 
> commit 93fa159abe50d3c55c7f83622d3f5c09b6e06f4b
> Author: stephen hemminger <shemminger@vyatta.com>
> Date:   Mon Apr 12 05:41:31 2010 +0000
> 
>     IPv6: keep route for tentative address
>     
>     Recent changes preserve IPv6 address when link goes down (good).
>     But would cause address to point to dead dst entry (bad).
>     The simplest fix is to just not delete route if address is
>     being held for later use.
>     
>     Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>
> 
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 1b00bfe..a9913d2 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -4047,7 +4047,8 @@ static void __ipv6_ifa_notify(int event, struct
> inet6_ifaddr *ifp)
>                         addrconf_leave_anycast(ifp);
>                 addrconf_leave_solict(ifp->idev, &ifp->addr);
>                 dst_hold(&ifp->rt->u.dst);
> -               if (ip6_del_rt(ifp->rt))
> +
> +               if (ifp->dead && ip6_del_rt(ifp->rt))
>                         dst_free(&ifp->rt->u.dst);
>                 break;


Is this problem new to net-next where we hold onto addresses, or was
the issue there before?

^ permalink raw reply

* Re: IPv6: race condition in __ipv6_ifa_notify() and dst_free() ?
From: Eric Dumazet @ 2010-04-20 20:57 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: netdev, Hideaki YOSHIFUJI, David Miller, Stephen Hemminger
In-Reply-To: <20100420204939.GA15354@smudla-wifi.bakulak.kosire.czf>

Le mardi 20 avril 2010 à 22:49 +0200, Jiri Bohac a écrit :
> On Tue, Apr 20, 2010 at 07:57:27PM +0200, Eric Dumazet wrote:
> > Le mardi 20 avril 2010 à 19:44 +0200, Jiri Bohac a écrit :
> > > --- a/net/ipv6/addrconf.c	2010-04-17 00:12:32.000000000 +0200
> > > +++ b/net/ipv6/addrconf.c	2010-04-20 19:07:35.000000000 +0200
> > > @@ -3974,8 +3974,7 @@ static void __ipv6_ifa_notify(int event,
> > >  			addrconf_leave_anycast(ifp);
> > >  		addrconf_leave_solict(ifp->idev, &ifp->addr);
> > >  		dst_hold(&ifp->rt->u.dst);
> > > -		if (ip6_del_rt(ifp->rt))
> > > -			dst_free(&ifp->rt->u.dst);
> > > +		ip6_del_rt(ifp->rt);
> > >  		break;
> > >  	}
> > >  }
> > > 
> > 
> > 
> > I dont understand the problem Jiri.
> > 
> > We just did dst_hold(&ifp->rt->u.dst), so if ip6_del_rt() fails we must
> > dst_free(), or we leak a refcount.
> 
> Well, no ... dst_free() does not decrement the refcount.
> The "opposite" of dst_hold() is dst_release(). And it does not
> automatically call dst_free() when the refcount drops to 0.
> dst_free() needs to be called explicitly and it apparently
> expects the caller to ensure that two dst_free()s won't be called
> simultaneously. But my bonding example shows this is not the
> case.
> 
> 

Ah yes you're right

Maybe ask Stephen ?

commit 93fa159abe50d3c55c7f83622d3f5c09b6e06f4b
Author: stephen hemminger <shemminger@vyatta.com>
Date:   Mon Apr 12 05:41:31 2010 +0000

    IPv6: keep route for tentative address
    
    Recent changes preserve IPv6 address when link goes down (good).
    But would cause address to point to dead dst entry (bad).
    The simplest fix is to just not delete route if address is
    being held for later use.
    
    Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 1b00bfe..a9913d2 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4047,7 +4047,8 @@ static void __ipv6_ifa_notify(int event, struct
inet6_ifaddr *ifp)
                        addrconf_leave_anycast(ifp);
                addrconf_leave_solict(ifp->idev, &ifp->addr);
                dst_hold(&ifp->rt->u.dst);
-               if (ip6_del_rt(ifp->rt))
+
+               if (ifp->dead && ip6_del_rt(ifp->rt))
                        dst_free(&ifp->rt->u.dst);
                break;
        }



^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox