[PATCH] net: Fix struct sock bitfield annotation

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] net: Fix struct sock bitfield annotation
@ 2009-10-08 15:16 Eric Dumazet
  2009-10-08 21:31 ` David Miller
  2009-10-08 21:54 ` Vegard Nossum
  0 siblings, 2 replies; 12+ messages in thread
From: Eric Dumazet @ 2009-10-08 15:16 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List, Vegard Nossum, Ingo Molnar

Since commit a98b65a3 (net: annotate struct sock bitfield), we lost 8 bytes
in struct sock on 64bits arches because of kmemcheck_bitfield_end(flags) misplacement.


struct good {
	int		begin_flags[0];
	unsigned char	a : 8;
	unsigned char	b;
	unsigned short	c;
	int		end_flags[0];
	int		sk_rcvbuf;
	void           *ptr;
};
struct bad {
	int		begin_flags[0];
	unsigned char	a : 8;
	int		end_flags[0];
	unsigned char	b;
	unsigned short	c;
	int		sk_rcvbuf;
	void           *ptr;
};
sizeof(struct good) = 16, sizeof(struct bad) = 24

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 1621935..ecfb831 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -229,9 +229,9 @@ struct sock {
 	unsigned char		sk_shutdown : 2,
 				sk_no_check : 2,
 				sk_userlocks : 4;
-	kmemcheck_bitfield_end(flags);
 	unsigned char		sk_protocol;
 	unsigned short		sk_type;
+	kmemcheck_bitfield_end(flags);
 	int			sk_rcvbuf;
 	socket_lock_t		sk_lock;
 	/*

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] net: Fix struct sock bitfield annotation
  2009-10-08 15:16 [PATCH] net: Fix struct sock bitfield annotation Eric Dumazet
@ 2009-10-08 21:31 ` David Miller
  2009-10-08 21:54 ` Vegard Nossum
  1 sibling, 0 replies; 12+ messages in thread
From: David Miller @ 2009-10-08 21:31 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, vegard.nossum, mingo

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Thu, 08 Oct 2009 17:16:13 +0200

> Since commit a98b65a3 (net: annotate struct sock bitfield), we lost 8 bytes
> in struct sock on 64bits arches because of kmemcheck_bitfield_end(flags) misplacement.
 ...
> sizeof(struct good) = 16, sizeof(struct bad) = 24
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Good catch, thanks Eric.

Applied to net-2.6

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] net: Fix struct sock bitfield annotation
  2009-10-08 15:16 [PATCH] net: Fix struct sock bitfield annotation Eric Dumazet
  2009-10-08 21:31 ` David Miller
@ 2009-10-08 21:54 ` Vegard Nossum
  2009-10-08 22:08   ` David Miller
  2009-10-09  1:07   ` Eric Dumazet
  1 sibling, 2 replies; 12+ messages in thread
From: Vegard Nossum @ 2009-10-08 21:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, Linux Netdev List, Ingo Molnar

2009/10/8 Eric Dumazet <eric.dumazet@gmail.com>:
> Since commit a98b65a3 (net: annotate struct sock bitfield), we lost 8 bytes
> in struct sock on 64bits arches because of kmemcheck_bitfield_end(flags) misplacement.
>
>
> struct good {
>        int             begin_flags[0];
>        unsigned char   a : 8;
>        unsigned char   b;
>        unsigned short  c;
>        int             end_flags[0];
>        int             sk_rcvbuf;
>        void           *ptr;
> };
> struct bad {
>        int             begin_flags[0];
>        unsigned char   a : 8;
>        int             end_flags[0];
>        unsigned char   b;
>        unsigned short  c;
>        int             sk_rcvbuf;
>        void           *ptr;
> };
> sizeof(struct good) = 16, sizeof(struct bad) = 24
>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  include/net/sock.h |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 1621935..ecfb831 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -229,9 +229,9 @@ struct sock {
>        unsigned char           sk_shutdown : 2,
>                                sk_no_check : 2,
>                                sk_userlocks : 4;
> -       kmemcheck_bitfield_end(flags);
>        unsigned char           sk_protocol;
>        unsigned short          sk_type;
> +       kmemcheck_bitfield_end(flags);
>        int                     sk_rcvbuf;
>        socket_lock_t           sk_lock;
>        /*
>

Hm, no, this looks wrong to me, because sk_protocol and sk_type
_aren't_ in fact part of the bitfield.

We don't want to affect the kernel _at all_ when CONFIG_KMEMCHECK=n,
so I guess we should make the kmemcheck_bitfield_{begin|end}() macros
empty instead for that case? (And for kmemcheck kernels, we don't
really care about the lost 8 bytes anyway.)


Vegard

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] net: Fix struct sock bitfield annotation
  2009-10-08 21:54 ` Vegard Nossum
@ 2009-10-08 22:08   ` David Miller
  2009-10-09  1:07   ` Eric Dumazet
  1 sibling, 0 replies; 12+ messages in thread
From: David Miller @ 2009-10-08 22:08 UTC (permalink / raw)
  To: vegard.nossum; +Cc: eric.dumazet, netdev, mingo

From: Vegard Nossum <vegard.nossum@gmail.com>
Date: Thu, 8 Oct 2009 23:54:02 +0200

> Hm, no, this looks wrong to me, because sk_protocol and sk_type
> _aren't_ in fact part of the bitfield.
> 
> We don't want to affect the kernel _at all_ when CONFIG_KMEMCHECK=n,
> so I guess we should make the kmemcheck_bitfield_{begin|end}() macros
> empty instead for that case? (And for kmemcheck kernels, we don't
> really care about the lost 8 bytes anyway.)

Hmmm, indeed.

Eric I'm holding this patch off for now.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] net: Fix struct sock bitfield annotation
  2009-10-08 21:54 ` Vegard Nossum
  2009-10-08 22:08   ` David Miller
@ 2009-10-09  1:07   ` Eric Dumazet
  2009-10-09  1:46     ` Eric Dumazet
  2009-10-09  7:54     ` [PATCH] net: Fix struct sock bitfield annotation David Miller
  1 sibling, 2 replies; 12+ messages in thread
From: Eric Dumazet @ 2009-10-09  1:07 UTC (permalink / raw)
  To: Vegard Nossum; +Cc: David S. Miller, Linux Netdev List, Ingo Molnar

Vegard Nossum a écrit :
> Hm, no, this looks wrong to me, because sk_protocol and sk_type
> _aren't_ in fact part of the bitfield.

What looks wrong to me is the original commit :)

> 
> We don't want to affect the kernel _at all_ when CONFIG_KMEMCHECK=n,
> so I guess we should make the kmemcheck_bitfield_{begin|end}() macros
> empty instead for that case? (And for kmemcheck kernels, we don't
> really care about the lost 8 bytes anyway.)

Point is we should not lose 8 bytes with kmemcheck on or off.
I believe kmemcheck macros are fine as they are.

When we have a structure with

        unsigned char           sk_shutdown : 2,
                                sk_no_check : 2,
                                sk_userlocks : 4;
        unsigned char           sk_protocol;
        unsigned short          sk_type;

Its pretty clear its *logically* a bitfield aggregation, or if you prefer :

        unsigned int            sk_shutdown : 2,
                                sk_no_check : 2,
                                sk_userlocks : 4,
                                sk_protocol : 8,
                                sk_type : 16;

Only difference is that in the second form, you cannot use
offsetof(struct sock, sk_type)

I am currently writing a tool to re-organize 'struct sock' fields,
for net-next-2.6 using offsetof() macro, this is how I spot the problem.

Thanks

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] net: Fix struct sock bitfield annotation
  2009-10-09  1:07   ` Eric Dumazet
@ 2009-10-09  1:46     ` Eric Dumazet
  2009-10-09 19:39       ` Christoph Lameter
  2009-10-09  7:54     ` [PATCH] net: Fix struct sock bitfield annotation David Miller
  1 sibling, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2009-10-09  1:46 UTC (permalink / raw)
  To: David S. Miller
  Cc: Vegard Nossum, Linux Netdev List, Ingo Molnar, Christoph Lameter

Eric Dumazet a écrit :
> 
> I am currently writing a tool to re-organize 'struct sock' fields,
> for net-next-2.6 using offsetof() macro, this is how I spot the problem.
> 

For networking guys, here is the actual mess with "struct sock" on x86_64,
related to UDP handling (critical latencies for some people). We basically touch
all cache lines, in every paths, bad effects on SMP...

- rx softirq reception
- rx recvmsg() time
- tx sendto() time
- tx completion time

sizeof(struct sk_buff)  =232 (0x e8)
sizeof(struct sock)     =528 (0x210)
sizeof(struct inet_sock)=680 (0x2a8)

Following offsets in hexadecimal

offsetof(struct sock, sk_refcnt)       = 10   (rw by reception of udp frame, __udp4_lib_lookup())
 (unavoidable hot spot unfortunatly, but not anymore touched by sock_wfree())

offsetof(struct sock, sk_hash)         = 14   (read  rx softirq )
offsetof(struct sock, sk_family)       = 18   (read   "     "    
offsetof(struct inet_sock, daddr)      =210   (read   "     "   
offsetof(struct inet_sock, rcv_saddr)  =214   (read   "     "  
offsetof(struct inet_sock, dport)      =218   (read   "     " 
offsetof(struct inet_sock, mc_list)    =240   (read   "       multicast reception
offsetof(struct sock, sk_bound_dev_if) = 1c   (read  rx softirq)

(First patch I'll submit is move daddr/rcv_saddr/dport to struct sock_common,
 so that lookup() use one cache line instead of two per socket in hash chain)


offsetof(struct sock, sk_prot)         = 30   (read by sk_has_account())
offsetof(struct sock, sk_rcvbuf)       = 3c   (read)
offsetof(struct sock, sk_protocol)     = 39
offsetof(struct sock, sk_allocation)   = e0   (read at send() time)
offsetof(struct sock, sk_flags)        = f8   (read)
offsetof(struct sock, sk_lock)         = 40   (rw by udp_sendmsg())
offsetof(struct sock, sk_dst_lock)     = 90   (rw by udp_sendmsg() on connected socks)
offsetof(struct sock, sk_dst_cache)    = 78   (read by udp_sendmsg() on connected socks)
offsetof(struct sock, sk_mark)         =1d8   (read at sendto() time)
offsetof(struct sock, sk_write_queue)  = c0   (rw by sendto())
offsetof(struct inet_sock, id)         =232   (rw by sendto() on connected socks)
offsetof(struct sock, sk_wmem_alloc)   = 98   (RW, both at sendto() and tx completion handler time)
offsetof(struct sock, sk_sndbuf)       = a0   (read at tx completion time and sendto())
offsetof(struct sock, sk_sndmsg_page)  =1b8   (rw by send())
offsetof(struct sock, sk_send_head)    =1c0   (rw by send(), tcp)

offsetof(struct sock, sk_rmem_alloc)   = 94   (RW, both when frame is received by softirq and dequeued at recvmsg() time)
offsetof(struct sock, sk_receive_queue)= a8   (RW, both when frame is received by softirq and dequeued at recvmsg() time)
offsetof(struct sock, sk_forward_alloc)= dc   (rw rx softirq and recvmsg() time)
offsetof(struct sock, sk_drops)        =134   (read when udp frame is received in softirq handler)
offsetof(struct sock, sk_stamp)        =1a0   (write at recvmsg() time)
offsetof(struct sock, sk_sleep)        = 70   (read by softirq handlers (rx/tx))
offsetof(struct sock, sk_filter)       =160   (read when udp frame is received in softirq handler)
offsetof(struct sock, sk_socket)       =1a8   (read)
offsetof(struct sock, sk_callback_lock)=128   (rw at softirq time
offsetof(struct sock, sk_data_ready)   =1e8   (read)
offsetof(struct sock, sk_write_space)  =1f0   (read at TX completion time)

used by TCP
offsetof(struct sock, sk_timer)        =170
offsetof(struct sock, sk_ack_backlog)  =138   (listen socks)

Almost never used
offsetof(struct sock, sk_lingertime)   =100
offsetof(struct sock, sk_write_pending)=1cc
offsetof(struct sock, sk_prot_creator )=120

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] net: Fix struct sock bitfield annotation
  2009-10-09  1:07   ` Eric Dumazet
  2009-10-09  1:46     ` Eric Dumazet
@ 2009-10-09  7:54     ` David Miller
  2009-10-09  8:50       ` Eric Dumazet
  1 sibling, 1 reply; 12+ messages in thread
From: David Miller @ 2009-10-09  7:54 UTC (permalink / raw)
  To: eric.dumazet; +Cc: vegard.nossum, netdev, mingo

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 09 Oct 2009 03:07:56 +0200

> Point is we should not lose 8 bytes with kmemcheck on or off.
> I believe kmemcheck macros are fine as they are.
> 
> When we have a structure with
> 
>         unsigned char           sk_shutdown : 2,
>                                 sk_no_check : 2,
>                                 sk_userlocks : 4;
>         unsigned char           sk_protocol;
>         unsigned short          sk_type;
> 
> Its pretty clear its *logically* a bitfield aggregation, or if you prefer :

I think from a practical standpoint, you are right.

But Vegard is right too, as we should be able to put the annotation
right next to the ":" statements.

So if you really want why don't you put the sk_protocol and
sk_type into the ":" block as you mentioned.

And then you can use Arnaldo's 'pahole' instead of the kludgy
offsetof() which doesn't work with bitfields :-)

I want the 8 bytes back just like you, but seperating the annotation
from the real C bitfields looks definitely wrong to me.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] net: Fix struct sock bitfield annotation
  2009-10-09  7:54     ` [PATCH] net: Fix struct sock bitfield annotation David Miller
@ 2009-10-09  8:50       ` Eric Dumazet
  2009-10-12  6:07         ` David Miller
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2009-10-09  8:50 UTC (permalink / raw)
  To: David Miller; +Cc: vegard.nossum, netdev, mingo

David Miller a écrit :

> 
> I think from a practical standpoint, you are right.
> 
> But Vegard is right too, as we should be able to put the annotation
> right next to the ":" statements.
> 
> So if you really want why don't you put the sk_protocol and
> sk_type into the ":" block as you mentioned.
> 
> And then you can use Arnaldo's 'pahole' instead of the kludgy
> offsetof() which doesn't work with bitfields :-)
> 
> I want the 8 bytes back just like you, but seperating the annotation
> from the real C bitfields looks definitely wrong to me.


Let's hope nobody wants to use &sk->sk_protocol, &sk->sk_type,
(or offsetof(..., sk_somefield) if that matters)

Only compile tested on 'allyesconfig' build on x86_64

[PATCH] net: Fix struct sock bitfield annotation

Since commit a98b65a3 (net: annotate struct sock bitfield), we lost
8 bytes in struct sock on 64bit arches because of 
kmemcheck_bitfield_end(flags) misplacement.

Fix this by putting together sk_shutdown, sk_no_check, sk_userlocks,
sk_protocol and sk_type in the 'flags' 32bits bitfield

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/sock.h |   10 +++++-----
 1 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 1621935..9f96394 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -226,12 +226,12 @@ struct sock {
 #define sk_prot			__sk_common.skc_prot
 #define sk_net			__sk_common.skc_net
 	kmemcheck_bitfield_begin(flags);
-	unsigned char		sk_shutdown : 2,
-				sk_no_check : 2,
-				sk_userlocks : 4;
+	unsigned int		sk_shutdown  : 2,
+				sk_no_check  : 2,
+				sk_userlocks : 4,
+				sk_protocol  : 8,
+				sk_type      : 16;
 	kmemcheck_bitfield_end(flags);
-	unsigned char		sk_protocol;
-	unsigned short		sk_type;
 	int			sk_rcvbuf;
 	socket_lock_t		sk_lock;
 	/*

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] net: Fix struct sock bitfield annotation
  2009-10-09  1:46     ` Eric Dumazet
@ 2009-10-09 19:39       ` Christoph Lameter
  2009-10-09 20:41         ` Eric Dumazet
  0 siblings, 1 reply; 12+ messages in thread
From: Christoph Lameter @ 2009-10-09 19:39 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S. Miller, Vegard Nossum, Linux Netdev List, Ingo Molnar

On Fri, 9 Oct 2009, Eric Dumazet wrote:

> For networking guys, here is the actual mess with "struct sock" on x86_64,
> related to UDP handling (critical latencies for some people). We basically touch
> all cache lines, in every paths, bad effects on SMP...

Please keep me posted on this. I am very interested in this work.

Some simple shuffling around may do some good here.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] net: Fix struct sock bitfield annotation
  2009-10-09 19:39       ` Christoph Lameter
@ 2009-10-09 20:41         ` Eric Dumazet
  2009-10-13 21:59           ` [RFC net-next-2.6] udp: Dont use lock_sock()/release_sock() in rx path Eric Dumazet
  0 siblings, 1 reply; 12+ messages in thread
From: Eric Dumazet @ 2009-10-09 20:41 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: David S. Miller, Vegard Nossum, Linux Netdev List, Ingo Molnar

Christoph Lameter a écrit :
> On Fri, 9 Oct 2009, Eric Dumazet wrote:
> 
>> For networking guys, here is the actual mess with "struct sock" on x86_64,
>> related to UDP handling (critical latencies for some people). We basically touch
>> all cache lines, in every paths, bad effects on SMP...
> 
> Please keep me posted on this. I am very interested in this work.
> 
> Some simple shuffling around may do some good here.
> 

Sure, will do, but first I want to suppress the lock_sock()/release_sock() in
rx path, that was added for sk_forward_alloc thing. This really hurts,
because of the backlog handling.

I have preliminary patch that restore UDP latencies we had in the past ;)

Trick is for UDP, sk_forward_alloc is not updated by tx/rx, only rx.
So we can use the sk_receive_queue.lock to forbid concurrent updates.

As this lock is already hot and only used by rx, we wont have to
dirty the sk_lock, that will only be used by tx path.

Then we can carefuly reorder struct sock to lower number of cache lines
needed for each path.


Patch against linux-2.6 git tree

 net/core/sock.c |    9 ++++
 net/ipv4/udp.c  |   89 ++++++++++++++++++++++------------------------
 2 files changed, 51 insertions(+), 47 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 7626b6a..45212d4 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -276,6 +276,7 @@ int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	int err = 0;
 	int skb_len;
+	unsigned long flags;
 
 	/* Cast sk->rcvbuf to unsigned... It's pointless, but reduces
 	   number of warnings when compiling with -W --ANK
@@ -290,8 +291,12 @@ int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	if (err)
 		goto out;
 
+	skb_orphan(skb);
+
+	spin_lock_irqsave(&sk->sk_receive_queue.lock, flags);
 	if (!sk_rmem_schedule(sk, skb->truesize)) {
 		err = -ENOBUFS;
+		spin_unlock_irqrestore(&sk->sk_receive_queue.lock, flags);
 		goto out;
 	}
 
@@ -305,7 +310,9 @@ int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	 */
 	skb_len = skb->len;
 
-	skb_queue_tail(&sk->sk_receive_queue, skb);
+	__skb_queue_tail(&sk->sk_receive_queue, skb);
+
+	spin_unlock_irqrestore(&sk->sk_receive_queue.lock, flags);
 
 	if (!sock_flag(sk, SOCK_DEAD))
 		sk->sk_data_ready(sk, skb_len);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 6ec6a8a..e8a1be4 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -841,6 +841,36 @@ out:
 	return ret;
 }
 
+
+/**
+ *	first_packet_length	- return length of first packet in receive queue
+ *	@sk: socket
+ *
+ *	Drops all bad checksum frames, until a valid one is found.
+ *	Returns the length of found skb, or 0 if none is found.
+ */
+static unsigned int first_packet_length(struct sock *sk)
+{
+	struct sk_buff_head *rcvq = &sk->sk_receive_queue;
+	struct sk_buff *skb;
+	unsigned int res;
+
+	spin_lock_bh(&rcvq->lock);
+
+	while ((skb = skb_peek(rcvq)) != NULL &&
+		udp_lib_checksum_complete(skb)) {
+		UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS,
+				 IS_UDPLITE(sk));
+		__skb_unlink(skb, rcvq);
+		skb_kill_datagram(sk, skb, 0);
+	}
+	res = skb ? skb->len : 0;
+
+	spin_unlock_bh(&rcvq->lock);
+
+	return res;
+}
+
 /*
  *	IOCTL requests applicable to the UDP protocol
  */
@@ -857,21 +887,16 @@ int udp_ioctl(struct sock *sk, int cmd, unsigned long arg)
 
 	case SIOCINQ:
 	{
-		struct sk_buff *skb;
-		unsigned long amount;
+		unsigned int amount = first_packet_length(sk);
 
-		amount = 0;
-		spin_lock_bh(&sk->sk_receive_queue.lock);
-		skb = skb_peek(&sk->sk_receive_queue);
-		if (skb != NULL) {
+		if (amount)
 			/*
 			 * We will only return the amount
 			 * of this packet since that is all
 			 * that will be read.
 			 */
-			amount = skb->len - sizeof(struct udphdr);
-		}
-		spin_unlock_bh(&sk->sk_receive_queue.lock);
+			amount -= sizeof(struct udphdr);
+
 		return put_user(amount, (int __user *)arg);
 	}
 
@@ -968,17 +993,17 @@ try_again:
 		err = ulen;
 
 out_free:
-	lock_sock(sk);
+	spin_lock_bh(&sk->sk_receive_queue.lock);
 	skb_free_datagram(sk, skb);
-	release_sock(sk);
+	spin_unlock_bh(&sk->sk_receive_queue.lock);
 out:
 	return err;
 
 csum_copy_err:
-	lock_sock(sk);
+	spin_lock_bh(&sk->sk_receive_queue.lock);
 	if (!skb_kill_datagram(sk, skb, flags))
-		UDP_INC_STATS_USER(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
-	release_sock(sk);
+		UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
+	spin_unlock_bh(&sk->sk_receive_queue.lock);
 
 	if (noblock)
 		return -EAGAIN;
@@ -1060,7 +1085,6 @@ drop:
 int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	struct udp_sock *up = udp_sk(sk);
-	int rc;
 	int is_udplite = IS_UDPLITE(sk);
 
 	/*
@@ -1140,16 +1164,7 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 			goto drop;
 	}
 
-	rc = 0;
-
-	bh_lock_sock(sk);
-	if (!sock_owned_by_user(sk))
-		rc = __udp_queue_rcv_skb(sk, skb);
-	else
-		sk_add_backlog(sk, skb);
-	bh_unlock_sock(sk);
-
-	return rc;
+	return __udp_queue_rcv_skb(sk, skb);
 
 drop:
 	UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
@@ -1540,29 +1555,11 @@ unsigned int udp_poll(struct file *file, struct socket *sock, poll_table *wait)
 {
 	unsigned int mask = datagram_poll(file, sock, wait);
 	struct sock *sk = sock->sk;
-	int 	is_lite = IS_UDPLITE(sk);
 
 	/* Check for false positives due to checksum errors */
-	if ((mask & POLLRDNORM) &&
-	    !(file->f_flags & O_NONBLOCK) &&
-	    !(sk->sk_shutdown & RCV_SHUTDOWN)) {
-		struct sk_buff_head *rcvq = &sk->sk_receive_queue;
-		struct sk_buff *skb;
-
-		spin_lock_bh(&rcvq->lock);
-		while ((skb = skb_peek(rcvq)) != NULL &&
-		       udp_lib_checksum_complete(skb)) {
-			UDP_INC_STATS_BH(sock_net(sk),
-					UDP_MIB_INERRORS, is_lite);
-			__skb_unlink(skb, rcvq);
-			kfree_skb(skb);
-		}
-		spin_unlock_bh(&rcvq->lock);
-
-		/* nothing to see, move along */
-		if (skb == NULL)
-			mask &= ~(POLLIN | POLLRDNORM);
-	}
+	if ((mask & POLLRDNORM) && !(file->f_flags & O_NONBLOCK) &&
+	    !(sk->sk_shutdown & RCV_SHUTDOWN) && !first_packet_length(sk))
+		mask &= ~(POLLIN | POLLRDNORM);
 
 	return mask;
 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] net: Fix struct sock bitfield annotation
  2009-10-09  8:50       ` Eric Dumazet
@ 2009-10-12  6:07         ` David Miller
  0 siblings, 0 replies; 12+ messages in thread
From: David Miller @ 2009-10-12  6:07 UTC (permalink / raw)
  To: eric.dumazet; +Cc: vegard.nossum, netdev, mingo

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 09 Oct 2009 10:50:25 +0200

> [PATCH] net: Fix struct sock bitfield annotation
> 
> Since commit a98b65a3 (net: annotate struct sock bitfield), we lost
> 8 bytes in struct sock on 64bit arches because of 
> kmemcheck_bitfield_end(flags) misplacement.
> 
> Fix this by putting together sk_shutdown, sk_no_check, sk_userlocks,
> sk_protocol and sk_type in the 'flags' 32bits bitfield
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [RFC net-next-2.6] udp: Dont use lock_sock()/release_sock() in rx path
  2009-10-09 20:41         ` Eric Dumazet
@ 2009-10-13 21:59           ` Eric Dumazet
  0 siblings, 0 replies; 12+ messages in thread
From: Eric Dumazet @ 2009-10-13 21:59 UTC (permalink / raw)
  To: David S. Miller; +Cc: Christoph Lameter, Vegard Nossum, Linux Netdev List

Eric Dumazet a écrit :
> Sure, will do, but first I want to suppress the lock_sock()/release_sock() in
> rx path, that was added for sk_forward_alloc thing. This really hurts,
> because of the backlog handling.
> 
> I have preliminary patch that restore UDP latencies we had in the past ;)
> 
> Trick is for UDP, sk_forward_alloc is not updated by tx/rx, only rx.
> So we can use the sk_receive_queue.lock to forbid concurrent updates.
> 
> As this lock is already hot and only used by rx, we wont have to
> dirty the sk_lock, that will only be used by tx path.
> 
> Then we can carefuly reorder struct sock to lower number of cache lines
> needed for each path.
> 

[RFC net-next-2.6] udp: Dont use lock_sock()/release_sock() in rx path

We added two years ago memory accounting for UDP and some people complain
about increased latencies, especially on multicast.

Because sk_forward_alloc is not atomic, we duplicated the protection we used for TCP,
ie use lock_sock()/release_sock() to guard sk_forward_alloc against concurrent updates.

When a frame is received by NIC, sofirq handler has to lock the socket, and eventually
has to queue the frame into backlog instead of receive queue.

Then, user application also has to lock socket to dequeue a frame from receive_queue
and eventually process backlog queue, leading to high latencies.

This lock is also used in tx path to guard cork structure against concurrent updates.

This cause false sharing of socket lock between several cpus that could be avoided,
considering UDP touches sk_forward_alloc only on rx. (TCP use it both for rx and tx)

Instead of using socket lock, we can use the sk_receive.lock lock that we have to
get anyway to queue frame at softirq time (or dequeue it by user application)

This avoids two atomic ops per packet in softirq handler, one in user app doing recvmsg(),
and we dont touch backlog anymore.

This way, the socket lock is only used in tx path, as in the past, and we can improve
things by reordering struct sock fields into separate rx/tx groups, in a followup patch.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 core/sock.c |    6 +++++-
 ipv4/udp.c  |   34 ++++++++--------------------------
 ipv6/udp.c  |   36 ++++++++++--------------------------
 3 files changed, 23 insertions(+), 53 deletions(-)

diff --git a/net/core/sock.c b/net/core/sock.c
index 43ca2c9..d06b1a0 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -292,8 +292,13 @@ int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	if (err)
 		goto out;
 
+	skb_orphan(skb);
+
+	spin_lock_irqsave(&list->lock, flags);
+
 	if (!sk_rmem_schedule(sk, skb->truesize)) {
 		err = -ENOBUFS;
+		spin_unlock_irqrestore(&list->lock, flags);
 		goto out;
 	}
 
@@ -307,7 +312,6 @@ int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	 */
 	skb_len = skb->len;
 
-	spin_lock_irqsave(&list->lock, flags);
 	skb->dropcount = atomic_read(&sk->sk_drops);
 	__skb_queue_tail(list, skb);
 	spin_unlock_irqrestore(&list->lock, flags);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index ee61b3f..f62cec3 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -855,29 +855,21 @@ out:
  */
 static unsigned int first_packet_length(struct sock *sk)
 {
-	struct sk_buff_head list_kill, *rcvq = &sk->sk_receive_queue;
+	struct sk_buff_head *rcvq = &sk->sk_receive_queue;
 	struct sk_buff *skb;
 	unsigned int res;
 
-	__skb_queue_head_init(&list_kill);
-
 	spin_lock_bh(&rcvq->lock);
 	while ((skb = skb_peek(rcvq)) != NULL &&
 		udp_lib_checksum_complete(skb)) {
 		UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS,
 				 IS_UDPLITE(sk));
 		__skb_unlink(skb, rcvq);
-		__skb_queue_tail(&list_kill, skb);
+		skb_kill_datagram(sk, skb, 0);
 	}
 	res = skb ? skb->len : 0;
 	spin_unlock_bh(&rcvq->lock);
 
-	if (!skb_queue_empty(&list_kill)) {
-		lock_sock(sk);
-		__skb_queue_purge(&list_kill);
-		sk_mem_reclaim_partial(sk);
-		release_sock(sk);
-	}
 	return res;
 }
 
@@ -1003,17 +995,17 @@ try_again:
 		err = ulen;
 
 out_free:
-	lock_sock(sk);
+	spin_lock_bh(&sk->sk_receive_queue.lock);
 	skb_free_datagram(sk, skb);
-	release_sock(sk);
+	spin_unlock_bh(&sk->sk_receive_queue.lock);
 out:
 	return err;
 
 csum_copy_err:
-	lock_sock(sk);
+	spin_lock_bh(&sk->sk_receive_queue.lock);
 	if (!skb_kill_datagram(sk, skb, flags))
-		UDP_INC_STATS_USER(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
-	release_sock(sk);
+		UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
+	spin_unlock_bh(&sk->sk_receive_queue.lock);
 
 	if (noblock)
 		return -EAGAIN;
@@ -1095,7 +1087,6 @@ drop:
 int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 {
 	struct udp_sock *up = udp_sk(sk);
-	int rc;
 	int is_udplite = IS_UDPLITE(sk);
 
 	/*
@@ -1175,16 +1166,7 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 			goto drop;
 	}
 
-	rc = 0;
-
-	bh_lock_sock(sk);
-	if (!sock_owned_by_user(sk))
-		rc = __udp_queue_rcv_skb(sk, skb);
-	else
-		sk_add_backlog(sk, skb);
-	bh_unlock_sock(sk);
-
-	return rc;
+	return __udp_queue_rcv_skb(sk, skb);
 
 drop:
 	UDP_INC_STATS_BH(sock_net(sk), UDP_MIB_INERRORS, is_udplite);
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 1f8e2af..07468aa 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -288,23 +288,23 @@ try_again:
 		err = ulen;
 
 out_free:
-	lock_sock(sk);
+	spin_lock_bh(&sk->sk_receive_queue.lock);
 	skb_free_datagram(sk, skb);
-	release_sock(sk);
+	spin_unlock_bh(&sk->sk_receive_queue.lock);
 out:
 	return err;
 
 csum_copy_err:
-	lock_sock(sk);
+	spin_lock_bh(&sk->sk_receive_queue.lock);
 	if (!skb_kill_datagram(sk, skb, flags)) {
 		if (is_udp4)
-			UDP_INC_STATS_USER(sock_net(sk),
+			UDP_INC_STATS_BH(sock_net(sk),
 					UDP_MIB_INERRORS, is_udplite);
 		else
-			UDP6_INC_STATS_USER(sock_net(sk),
+			UDP6_INC_STATS_BH(sock_net(sk),
 					UDP_MIB_INERRORS, is_udplite);
 	}
-	release_sock(sk);
+	spin_unlock_bh(&sk->sk_receive_queue.lock);
 
 	if (flags & MSG_DONTWAIT)
 		return -EAGAIN;
@@ -468,21 +468,10 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb,
 	while ((sk2 = udp_v6_mcast_next(net, sk_nulls_next(sk2), uh->dest, daddr,
 					uh->source, saddr, dif))) {
 		struct sk_buff *buff = skb_clone(skb, GFP_ATOMIC);
-		if (buff) {
-			bh_lock_sock(sk2);
-			if (!sock_owned_by_user(sk2))
-				udpv6_queue_rcv_skb(sk2, buff);
-			else
-				sk_add_backlog(sk2, buff);
-			bh_unlock_sock(sk2);
-		}
+		if (buff)
+			udpv6_queue_rcv_skb(sk2, buff);
 	}
-	bh_lock_sock(sk);
-	if (!sock_owned_by_user(sk))
-		udpv6_queue_rcv_skb(sk, skb);
-	else
-		sk_add_backlog(sk, skb);
-	bh_unlock_sock(sk);
+	udpv6_queue_rcv_skb(sk, skb);
 out:
 	spin_unlock(&hslot->lock);
 	return 0;
@@ -597,12 +586,7 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 
 	/* deliver */
 
-	bh_lock_sock(sk);
-	if (!sock_owned_by_user(sk))
-		udpv6_queue_rcv_skb(sk, skb);
-	else
-		sk_add_backlog(sk, skb);
-	bh_unlock_sock(sk);
+	udpv6_queue_rcv_skb(sk, skb);
 	sock_put(sk);
 	return 0;
 

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2009-10-13 22:00 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-10-08 15:16 [PATCH] net: Fix struct sock bitfield annotation Eric Dumazet
2009-10-08 21:31 ` David Miller
2009-10-08 21:54 ` Vegard Nossum
2009-10-08 22:08   ` David Miller
2009-10-09  1:07   ` Eric Dumazet
2009-10-09  1:46     ` Eric Dumazet
2009-10-09 19:39       ` Christoph Lameter
2009-10-09 20:41         ` Eric Dumazet
2009-10-13 21:59           ` [RFC net-next-2.6] udp: Dont use lock_sock()/release_sock() in rx path Eric Dumazet
2009-10-09  7:54     ` [PATCH] net: Fix struct sock bitfield annotation David Miller
2009-10-09  8:50       ` Eric Dumazet
2009-10-12  6:07         ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).