[PATCH V2] net: add accounting for socket backlog

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH V2] net: add accounting for socket backlog
@ 2010-02-26  9:27 Zhu Yi
  2010-02-26 12:05 ` David Miller
  2010-02-28  5:31 ` Eric Dumazet
  0 siblings, 2 replies; 7+ messages in thread
From: Zhu Yi @ 2010-02-26  9:27 UTC (permalink / raw)
  Cc: netdev, Zhu Yi, David Miller, Eric Dumazet

We got system OOM while running some UDP netperf testing on the loopback
device. The case is multiple senders sent stream UDP packets to a single
receiver via loopback on local host. Of course, the receiver is not able
to handle all the packets in time. But we surprisingly found that these
packets were not discarded due to the receiver's sk->sk_rcvbuf limit.
Instead, they are kept queuing to sk->sk_backlog and finally ate up all
the memory. We believe this is a secure hole that a none privileged user
can crash the system.

The root cause for this problem is, when the receiver is doing
__release_sock() (i.e. after userspace recv, kernel udp_recvmsg ->
skb_free_datagram_locked -> release_sock), it moves skbs from backlog to
sk_receive_queue with the softirq enabled. In the above case, multiple
busy senders will almost make it an endless loop. The skbs in the
backlog end up eat all the system memory.

The patch fixed this problem by adding accounting for the socket
backlog. So that the backlog size can be restricted by protocol's choice
(i.e. UDP).

Reported-by: Alex Shi <alex.shi@intel.com>
Cc: David Miller <davem@davemloft.net>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Zhu Yi <yi.zhu@intel.com>
---
V2: remove atomic operation for sk_backlog.len
    limit UDP backlog size to 2*sk->sk_rcvbuf

diff --git a/include/net/sock.h b/include/net/sock.h
index 3f1a480..9f6893b 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -253,6 +253,7 @@ struct sock {
 	struct {
 		struct sk_buff *head;
 		struct sk_buff *tail;
+		int len;
 	} sk_backlog;
 	wait_queue_head_t	*sk_sleep;
 	struct dst_entry	*sk_dst_cache;
@@ -583,11 +584,22 @@ static inline void sk_add_backlog(struct sock *sk, struct sk_buff *skb)
 		sk->sk_backlog.tail->next = skb;
 		sk->sk_backlog.tail = skb;
 	}
+	sk->sk_backlog.len += skb->truesize;
 	skb->next = NULL;
 }
 
+static inline int sk_add_backlog_limited(struct sock *sk, struct sk_buff *skb)
+{
+	if (sk->sk_backlog.len >= 2 * sk->sk_rcvbuf)
+		return -ENOBUFS;
+
+	sk_add_backlog(sk, skb);
+	return 0;
+}
+
 static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
 {
+	sk->sk_backlog.len -= skb->truesize;
 	return sk->sk_backlog_rcv(sk, skb);
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index e1f6f22..82228ef 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1138,6 +1138,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 		sock_lock_init(newsk);
 		bh_lock_sock(newsk);
 		newsk->sk_backlog.head	= newsk->sk_backlog.tail = NULL;
+		newsk->sk_backlog.len = 0;
 
 		atomic_set(&newsk->sk_rmem_alloc, 0);
 		/*
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index f0126fd..7bb4568 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -1372,8 +1372,10 @@ int udp_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 	bh_lock_sock(sk);
 	if (!sock_owned_by_user(sk))
 		rc = __udp_queue_rcv_skb(sk, skb);
-	else
-		sk_add_backlog(sk, skb);
+	else if (sk_add_backlog_limited(sk, skb)) {
+		bh_unlock_sock(sk);
+		goto drop;
+	}
 	bh_unlock_sock(sk);
 
 	return rc;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 69ebdbe..e4a8645 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -584,16 +584,19 @@ static void flush_stack(struct sock **stack, unsigned int count,
 			bh_lock_sock(sk);
 			if (!sock_owned_by_user(sk))
 				udpv6_queue_rcv_skb(sk, skb1);
-			else
-				sk_add_backlog(sk, skb1);
+			else if (sk_add_backlog_limited(sk, skb1)) {
+				bh_unlock_sock(sk);
+				goto drop;
+			}
 			bh_unlock_sock(sk);
-		} else {
-			atomic_inc(&sk->sk_drops);
-			UDP6_INC_STATS_BH(sock_net(sk),
-					UDP_MIB_RCVBUFERRORS, IS_UDPLITE(sk));
-			UDP6_INC_STATS_BH(sock_net(sk),
-					UDP_MIB_INERRORS, IS_UDPLITE(sk));
+			continue;
 		}
+drop:
+		atomic_inc(&sk->sk_drops);
+		UDP6_INC_STATS_BH(sock_net(sk),
+				UDP_MIB_RCVBUFERRORS, IS_UDPLITE(sk));
+		UDP6_INC_STATS_BH(sock_net(sk),
+				UDP_MIB_INERRORS, IS_UDPLITE(sk));
 	}
 }
 /*
@@ -756,8 +759,11 @@ int __udp6_lib_rcv(struct sk_buff *skb, struct udp_table *udptable,
 	bh_lock_sock(sk);
 	if (!sock_owned_by_user(sk))
 		udpv6_queue_rcv_skb(sk, skb);
-	else
-		sk_add_backlog(sk, skb);
+	else if (sk_add_backlog_limited(sk, skb)) {
+		bh_unlock_sock(sk);
+		sock_put(sk);
+		goto discard;
+	}
 	bh_unlock_sock(sk);
 	sock_put(sk);
 	return 0;

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH V2] net: add accounting for socket backlog
  2010-02-26  9:27 [PATCH V2] net: add accounting for socket backlog Zhu Yi
@ 2010-02-26 12:05 ` David Miller
  2010-03-01  2:08   ` Zhu Yi
  2010-02-28  5:31 ` Eric Dumazet
  1 sibling, 1 reply; 7+ messages in thread
From: David Miller @ 2010-02-26 12:05 UTC (permalink / raw)
  To: yi.zhu; +Cc: netdev, eric.dumazet

From: Zhu Yi <yi.zhu@intel.com>
Date: Fri, 26 Feb 2010 17:27:44 +0800

> We got system OOM while running some UDP netperf testing on the loopback
> device. The case is multiple senders sent stream UDP packets to a single
> receiver via loopback on local host. Of course, the receiver is not able
> to handle all the packets in time. But we surprisingly found that these
> packets were not discarded due to the receiver's sk->sk_rcvbuf limit.
> Instead, they are kept queuing to sk->sk_backlog and finally ate up all
> the memory. We believe this is a secure hole that a none privileged user
> can crash the system.
> 
> The root cause for this problem is, when the receiver is doing
> __release_sock() (i.e. after userspace recv, kernel udp_recvmsg ->
> skb_free_datagram_locked -> release_sock), it moves skbs from backlog to
> sk_receive_queue with the softirq enabled. In the above case, multiple
> busy senders will almost make it an endless loop. The skbs in the
> backlog end up eat all the system memory.
> 
> The patch fixed this problem by adding accounting for the socket
> backlog. So that the backlog size can be restricted by protocol's choice
> (i.e. UDP).
> 
> Reported-by: Alex Shi <alex.shi@intel.com>
> Cc: David Miller <davem@davemloft.net>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Zhu Yi <yi.zhu@intel.com>

So remind me why TCP, or any other non-UDP protocol, won't
intrinsically have this problem too?

It seems pretty trivial to do with any protocol, especially remotely,
with a packet generator.  The code in TCP, for example, which queues
to the backlog, doesn't care about sequence numbers or anything like
that.

So you could spray a machine with the same TCP frame over and over
again, as fast as possible, as long as it matches the socket identity.

And in this way fill up the backlog endlessly and OOM the system.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH V2] net: add accounting for socket backlog
  2010-02-26 12:05 ` David Miller
@ 2010-03-01  2:08   ` Zhu Yi
  2010-03-01  2:10     ` David Miller
  0 siblings, 1 reply; 7+ messages in thread
From: Zhu Yi @ 2010-03-01  2:08 UTC (permalink / raw)
  To: David Miller; +Cc: netdev@vger.kernel.org, eric.dumazet@gmail.com

On Fri, 2010-02-26 at 20:05 +0800, David Miller wrote:
> So remind me why TCP, or any other non-UDP protocol, won't
> intrinsically have this problem too?

If TCP ACKs are not received, the (closed) remote window prevents the
TCP sender to send more frames.

> It seems pretty trivial to do with any protocol, especially remotely,
> with a packet generator.  The code in TCP, for example, which queues
> to the backlog, doesn't care about sequence numbers or anything like
> that.
> 
> So you could spray a machine with the same TCP frame over and over
> again, as fast as possible, as long as it matches the socket identity.
> 
> And in this way fill up the backlog endlessly and OOM the system. 

Yeah, I only considered about the normal case, that is the TCP frames
are built and managed in the kernel. If a user does frame generation
himself, yes, the same problem could happen potentially for all
protocols using backlog.

Thanks,
-yi


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH V2] net: add accounting for socket backlog
  2010-03-01  2:08   ` Zhu Yi
@ 2010-03-01  2:10     ` David Miller
  0 siblings, 0 replies; 7+ messages in thread
From: David Miller @ 2010-03-01  2:10 UTC (permalink / raw)
  To: yi.zhu; +Cc: netdev, eric.dumazet

From: Zhu Yi <yi.zhu@intel.com>
Date: Mon, 01 Mar 2010 10:08:48 +0800

> Yeah, I only considered about the normal case, that is the TCP frames
> are built and managed in the kernel.

You're not even considering the kernel case completely.  It's just as
easy to modify the kernel to maliciously send frames in this way.

> If a user does frame generation himself, yes, the same problem could
> happen potentially for all protocols using backlog.

We need the protection for every protocol, please implement your
changes this way.

Thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH V2] net: add accounting for socket backlog
  2010-02-26  9:27 [PATCH V2] net: add accounting for socket backlog Zhu Yi
  2010-02-26 12:05 ` David Miller
@ 2010-02-28  5:31 ` Eric Dumazet
  2010-03-01 11:43   ` Eric Dumazet
  1 sibling, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2010-02-28  5:31 UTC (permalink / raw)
  To: Zhu Yi; +Cc: netdev, David Miller

Le vendredi 26 février 2010 à 17:27 +0800, Zhu Yi a écrit :
> We got system OOM while running some UDP netperf testing on the loopback
> device. The case is multiple senders sent stream UDP packets to a single
> receiver via loopback on local host. Of course, the receiver is not able
> to handle all the packets in time. But we surprisingly found that these
> packets were not discarded due to the receiver's sk->sk_rcvbuf limit.
> Instead, they are kept queuing to sk->sk_backlog and finally ate up all
> the memory. We believe this is a secure hole that a none privileged user
> can crash the system.
> 
> The root cause for this problem is, when the receiver is doing
> __release_sock() (i.e. after userspace recv, kernel udp_recvmsg ->
> skb_free_datagram_locked -> release_sock), it moves skbs from backlog to
> sk_receive_queue with the softirq enabled. In the above case, multiple
> busy senders will almost make it an endless loop. The skbs in the
> backlog end up eat all the system memory.
> 
> The patch fixed this problem by adding accounting for the socket
> backlog. So that the backlog size can be restricted by protocol's choice
> (i.e. UDP).
> 
> Reported-by: Alex Shi <alex.shi@intel.com>
> Cc: David Miller <davem@davemloft.net>
> Cc: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Zhu Yi <yi.zhu@intel.com>
> ---
> V2: remove atomic operation for sk_backlog.len
>     limit UDP backlog size to 2*sk->sk_rcvbuf
> 

> +
>  static inline int sk_backlog_rcv(struct sock *sk, struct sk_buff *skb)
>  {
> +	sk->sk_backlog.len -= skb->truesize;
>  	return sk->sk_backlog_rcv(sk, skb);
>  }
>  

I am afraid sk_backlog_rcv() is not always called with lock held, and
not always called to process backlog (see TCP ucopy.prequeue)

If you take a look at __release_sock() for example, we make the backlog
private to the process before handling it (outside of lock_sock())

Therefore, I suggest doing the 'substraction' outside of
sk_backlog_rcv().

diff --git a/net/core/sock.c b/net/core/sock.c
index e1f6f22..57271cb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1520,6 +1520,7 @@ static void __release_sock(struct sock *sk)
 
        do {
                sk->sk_backlog.head = sk->sk_backlog.tail = NULL;
+               sk->sk_backlog.len = 0;
                bh_unlock_sock(sk);
 
                do {


Ah, I see __release_sock() is already doing a preemption check, please
ignore my previous comment, when I said "__release_sock() could run
forever with no preemption, even with a limit on backlog"

Thanks



^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH V2] net: add accounting for socket backlog
  2010-02-28  5:31 ` Eric Dumazet
@ 2010-03-01 11:43   ` Eric Dumazet
  2010-03-02  2:34     ` Zhu Yi
  0 siblings, 1 reply; 7+ messages in thread
From: Eric Dumazet @ 2010-03-01 11:43 UTC (permalink / raw)
  To: Zhu Yi; +Cc: netdev, David Miller

Le dimanche 28 février 2010 à 06:31 +0100, Eric Dumazet a écrit :

> I am afraid sk_backlog_rcv() is not always called with lock held, and
> not always called to process backlog (see TCP ucopy.prequeue)
> 
> If you take a look at __release_sock() for example, we make the backlog
> private to the process before handling it (outside of lock_sock())
> 
> Therefore, I suggest doing the 'substraction' outside of
> sk_backlog_rcv().
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index e1f6f22..57271cb 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1520,6 +1520,7 @@ static void __release_sock(struct sock *sk)
>  
>         do {
>                 sk->sk_backlog.head = sk->sk_backlog.tail = NULL;
> +               sk->sk_backlog.len = 0;
>                 bh_unlock_sock(sk);
>  
>                 do {
> 
> 

Thinking again about this, doing this zero initialization at the very
end of __release_sock() solves the problem of potential infinite loop in
__release_sock(). Since producer will hit the backlog limit.

Thanks

diff --git a/net/core/sock.c b/net/core/sock.c
index 305cba4..544cf4a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1542,6 +1542,11 @@ static void __release_sock(struct sock *sk)
 
 		bh_lock_sock(sk);
 	} while ((skb = sk->sk_backlog.head) != NULL);
+	/*
+	 * Doing this zeroing at the end of this function guarantee we can not
+	 * loop forever while a wild producer attempts to flood us
+	 */
+	sk->sk_backlog.len = 0;
 }
 
 /**






^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH V2] net: add accounting for socket backlog
  2010-03-01 11:43   ` Eric Dumazet
@ 2010-03-02  2:34     ` Zhu Yi
  0 siblings, 0 replies; 7+ messages in thread
From: Zhu Yi @ 2010-03-02  2:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev@vger.kernel.org, David Miller

On Mon, 2010-03-01 at 19:43 +0800, Eric Dumazet wrote:
> Thinking again about this, doing this zero initialization at the very
> end of __release_sock() solves the problem of potential infinite loop
> in
> __release_sock(). Since producer will hit the backlog limit.
> 
> Thanks
> 
> diff --git a/net/core/sock.c b/net/core/sock.c
> index 305cba4..544cf4a 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1542,6 +1542,11 @@ static void __release_sock(struct sock *sk)
>  
>                 bh_lock_sock(sk);
>         } while ((skb = sk->sk_backlog.head) != NULL);
> +       /*
> +        * Doing this zeroing at the end of this function guarantee we
> can not
> +        * loop forever while a wild producer attempts to flood us
> +        */
> +       sk->sk_backlog.len = 0;
>  } 

Very good point. I'll add your sob in my next patch.

Thanks,
-yi


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2010-03-02  2:32 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-02-26  9:27 [PATCH V2] net: add accounting for socket backlog Zhu Yi
2010-02-26 12:05 ` David Miller
2010-03-01  2:08   ` Zhu Yi
2010-03-01  2:10     ` David Miller
2010-02-28  5:31 ` Eric Dumazet
2010-03-01 11:43   ` Eric Dumazet
2010-03-02  2:34     ` Zhu Yi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).