Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
From: Michal Kubecek @ 2018-08-16  6:16 UTC (permalink / raw)
  To: Mao Wenan
  Cc: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw,
	stable, Takashi Iwai
In-Reply-To: <1534387810-121428-1-git-send-email-maowenan@huawei.com>

On Thu, Aug 16, 2018 at 10:50:01AM +0800, Mao Wenan wrote:
> There are five patches to fix CVE-2018-5390 in latest mainline 
> branch, but only two patches exist in stable 4.4 and 3.18: 
> dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
> 5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
> I have tested with stable 4.4 kernel, and found the cpu usage was very high.
> So I think only two patches can't fix the CVE-2018-5390.
> test results:
> with fix patch:     78.2%   ksoftirqd
> withoutfix patch:   90%     ksoftirqd
> 
> Then I try to imitate 72cd43ba(tcp: free batches of packets in tcp_prune_ofo_queue())
> to drop at least 12.5 % of sk_rcvbuf to avoid malicious attacks with simple queue 
> instead of RB tree. The result is not very well.
>  
> After analysing the codes of stable 4.4, and debuging the 
> system, shows that search of ofo_queue(tcp ofo using a simple queue) cost more cycles.
> 
> So I try to backport "tcp: use an RB tree for ooo receive queue" using RB tree 
> instead of simple queue, then backport Eric Dumazet 5 fixed patches in mainline,
> good news is that ksoftirqd is turn to about 20%, which is the same with mainline now.
> 
> Stable 4.4 have already back port two patches, 
> f4a3313d(tcp: avoid collapses in tcp_prune_queue() if possible)
> 3d4bf93a(tcp: detect malicious patterns in tcp_collapse_ofo_queue())
> If we want to change simple queue to RB tree to finally resolve, we should apply previous 
> patch 9f5afeae(tcp: use an RB tree for ooo receive queue.) firstly, but 9f5afeae have many 
> conflicts with 3d4bf93a and f4a3313d, which are part of patch series from Eric in 
> mainline to fix CVE-2018-5390, so I need revert part of patches in stable 4.4 firstly, 
> then apply 9f5afeae, and reapply five patches from Eric.

There seems to be an obvious mistake in one of the backports. Could you
check the results with Takashi's follow-up fix submitted at

  http://lkml.kernel.org/r/20180815095846.7734-1-tiwai@suse.de

(I would try myself but you don't mention what test you ran.)

Michal Kubecek

^ permalink raw reply

* [PATCH stable 4.4 7/9] tcp: detect malicious patterns in tcp_collapse_ofo_queue()
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable
In-Reply-To: <1534387810-121428-1-git-send-email-maowenan@huawei.com>

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit 3d4bf93ac12003f9b8e1e2de37fe27983deebdcf ]

In case an attacker feeds tiny packets completely out of order,
tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
expensive copies, but not changing socket memory usage at all.

1) Do not attempt to collapse tiny skbs.
2) Add logic to exit early when too many tiny skbs are detected.

We prefer not doing aggressive collapsing (which copies packets)
for pathological flows, and revert to tcp_prune_ofo_queue() which
will be less expensive.

In the future, we might add the possibility of terminating flows
that are proven to be malicious.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 77130ae..c48924f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4845,6 +4845,7 @@ end:
 static void tcp_collapse_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	u32 range_truesize, sum_tiny = 0;
 	struct sk_buff *skb, *head;
 	struct rb_node *p;
 	u32 start, end;
@@ -4863,6 +4864,7 @@ new_range:
 	}
 	start = TCP_SKB_CB(skb)->seq;
 	end = TCP_SKB_CB(skb)->end_seq;
+	range_truesize = skb->truesize;
 
 	for (head = skb;;) {
 		skb = tcp_skb_next(skb, NULL);
@@ -4873,11 +4875,21 @@ new_range:
 		if (!skb ||
 		    after(TCP_SKB_CB(skb)->seq, end) ||
 		    before(TCP_SKB_CB(skb)->end_seq, start)) {
-			tcp_collapse(sk, NULL, &tp->out_of_order_queue,
-				     head, skb, start, end);
+			/* Do not attempt collapsing tiny skbs */
+			if (range_truesize != head->truesize ||
+			    end - start >= SKB_WITH_OVERHEAD(SK_MEM_QUANTUM)) {
+				tcp_collapse(sk, NULL, &tp->out_of_order_queue,
+					     head, skb, start, end);
+			} else {
+				sum_tiny += range_truesize;
+				if (sum_tiny > sk->sk_rcvbuf >> 3)
+					return;
+			}
+
 			goto new_range;
 		}
 
+		range_truesize += skb->truesize;
 		if (unlikely(before(TCP_SKB_CB(skb)->seq, start)))
 			start = TCP_SKB_CB(skb)->seq;
 		if (after(TCP_SKB_CB(skb)->end_seq, end))
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH stable 4.4 8/9] tcp: call tcp_drop() from tcp_data_queue_ofo()
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable
In-Reply-To: <1534387810-121428-1-git-send-email-maowenan@huawei.com>

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit 8541b21e781a22dce52a74fef0b9bed00404a1cd ] 

In order to be able to give better diagnostics and detect
malicious traffic, we need to have better sk->sk_drops tracking.

Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c48924f..96a1e0d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4445,7 +4445,7 @@ coalesce_done:
 				/* All the bits are present. Drop. */
 				NET_INC_STATS(sock_net(sk),
 					      LINUX_MIB_TCPOFOMERGE);
-				__kfree_skb(skb);
+				tcp_drop(sk, skb);
 				skb = NULL;
 				tcp_dsack_set(sk, seq, end_seq);
 				goto add_sack;
@@ -4464,7 +4464,7 @@ coalesce_done:
 						 TCP_SKB_CB(skb1)->end_seq);
 				NET_INC_STATS(sock_net(sk),
 					      LINUX_MIB_TCPOFOMERGE);
-				__kfree_skb(skb1);
+				tcp_drop(sk, skb1);
 				goto add_sack;
 			}
 		} else if (tcp_try_coalesce(sk, skb1, skb, &fragstolen)) {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH stable 4.4 4/9] tcp: use an RB tree for ooo receive queue
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable
In-Reply-To: <1534387810-121428-1-git-send-email-maowenan@huawei.com>

From: Yaogong Wang <wygivan@google.com>

[ Upstream commit 9f5afeae51526b3ad7b7cb21ee8b145ce6ea7a7a ]

Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.

Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.

In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.

Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.

However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.

This patch converts it to a RB tree, to get bounded latencies.

Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.

Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)

Next step would be to also use an RB tree for the write queue at sender
side ;)

Signed-off-by: Yaogong Wang <wygivan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 include/linux/skbuff.h   |   8 ++
 include/linux/tcp.h      |   7 +-
 include/net/tcp.h        |   2 +-
 net/core/skbuff.c        |  19 +++
 net/ipv4/tcp.c           |   4 +-
 net/ipv4/tcp_input.c     | 354 +++++++++++++++++++++++++++--------------------
 net/ipv4/tcp_ipv4.c      |   2 +-
 net/ipv4/tcp_minisocks.c |   1 -
 8 files changed, 241 insertions(+), 156 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c28bd8b..a490dd7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2273,6 +2273,8 @@ static inline void __skb_queue_purge(struct sk_buff_head *list)
 		kfree_skb(skb);
 }
 
+void skb_rbtree_purge(struct rb_root *root);
+
 void *netdev_alloc_frag(unsigned int fragsz);
 
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int length,
@@ -2807,6 +2809,12 @@ static inline int pskb_trim_rcsum(struct sk_buff *skb, unsigned int len)
 	return __pskb_trim(skb, len);
 }
 
+#define rb_to_skb(rb) rb_entry_safe(rb, struct sk_buff, rbnode)
+#define skb_rb_first(root) rb_to_skb(rb_first(root))
+#define skb_rb_last(root)  rb_to_skb(rb_last(root))
+#define skb_rb_next(skb)   rb_to_skb(rb_next(&(skb)->rbnode))
+#define skb_rb_prev(skb)   rb_to_skb(rb_prev(&(skb)->rbnode))
+
 #define skb_queue_walk(queue, skb) \
 		for (skb = (queue)->next;					\
 		     skb != (struct sk_buff *)(queue);				\
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 5b6df1a..747404d 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -279,10 +279,9 @@ struct tcp_sock {
 	struct sk_buff* lost_skb_hint;
 	struct sk_buff *retransmit_skb_hint;
 
-	/* OOO segments go in this list. Note that socket lock must be held,
-	 * as we do not use sk_buff_head lock.
-	 */
-	struct sk_buff_head	out_of_order_queue;
+	/* OOO segments go in this rbtree. Socket lock must be held. */
+	struct rb_root	out_of_order_queue;
+	struct sk_buff	*ooo_last_skb; /* cache rb_last(out_of_order_queue) */
 
 	/* SACKs data, these 2 need to be together (see tcp_options_write) */
 	struct tcp_sack_block duplicate_sack[1]; /* D-SACK block */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index cac4a6ad..8bc259d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -649,7 +649,7 @@ static inline void tcp_fast_path_check(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (skb_queue_empty(&tp->out_of_order_queue) &&
+	if (RB_EMPTY_ROOT(&tp->out_of_order_queue) &&
 	    tp->rcv_wnd &&
 	    atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf &&
 	    !tp->urg_data)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 55be076..9703924 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2378,6 +2378,25 @@ void skb_queue_purge(struct sk_buff_head *list)
 EXPORT_SYMBOL(skb_queue_purge);
 
 /**
+ *	skb_rbtree_purge - empty a skb rbtree
+ *	@root: root of the rbtree to empty
+ *
+ *	Delete all buffers on an &sk_buff rbtree. Each buffer is removed from
+ *	the list and one reference dropped. This function does not take
+ *	any lock. Synchronization should be handled by the caller (e.g., TCP
+ *	out-of-order queue is protected by the socket lock).
+ */
+void skb_rbtree_purge(struct rb_root *root)
+{
+	struct sk_buff *skb, *next;
+
+	rbtree_postorder_for_each_entry_safe(skb, next, root, rbnode)
+		kfree_skb(skb);
+
+	*root = RB_ROOT;
+}
+
+/**
  *	skb_queue_head - queue a buffer at the list head
  *	@list: list to use
  *	@newsk: buffer to queue
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a0f0a7d..8bd2874 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -382,7 +382,7 @@ void tcp_init_sock(struct sock *sk)
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	__skb_queue_head_init(&tp->out_of_order_queue);
+	tp->out_of_order_queue = RB_ROOT;
 	tcp_init_xmit_timers(sk);
 	tcp_prequeue_init(tp);
 	INIT_LIST_HEAD(&tp->tsq_node);
@@ -2240,7 +2240,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	tcp_clear_xmit_timers(sk);
 	__skb_queue_purge(&sk->sk_receive_queue);
 	tcp_write_queue_purge(sk);
-	__skb_queue_purge(&tp->out_of_order_queue);
+	skb_rbtree_purge(&tp->out_of_order_queue);
 
 	inet->inet_dport = 0;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5fb4e80..12edc4f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4073,7 +4073,7 @@ static void tcp_fin(struct sock *sk)
 	/* It _is_ possible, that we have something out-of-order _after_ FIN.
 	 * Probably, we should reset in this case. For now drop them.
 	 */
-	__skb_queue_purge(&tp->out_of_order_queue);
+	skb_rbtree_purge(&tp->out_of_order_queue);
 	if (tcp_is_sack(tp))
 		tcp_sack_reset(&tp->rx_opt);
 	sk_mem_reclaim(sk);
@@ -4233,7 +4233,7 @@ static void tcp_sack_remove(struct tcp_sock *tp)
 	int this_sack;
 
 	/* Empty ofo queue, hence, all the SACKs are eaten. Clear. */
-	if (skb_queue_empty(&tp->out_of_order_queue)) {
+	if (RB_EMPTY_ROOT(&tp->out_of_order_queue)) {
 		tp->rx_opt.num_sacks = 0;
 		return;
 	}
@@ -4309,10 +4309,13 @@ static void tcp_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	__u32 dsack_high = tp->rcv_nxt;
+	bool fin, fragstolen, eaten;
 	struct sk_buff *skb, *tail;
-	bool fragstolen, eaten;
+	struct rb_node *p;
 
-	while ((skb = skb_peek(&tp->out_of_order_queue)) != NULL) {
+	p = rb_first(&tp->out_of_order_queue);
+	while (p) {
+		skb = rb_entry(p, struct sk_buff, rbnode);
 		if (after(TCP_SKB_CB(skb)->seq, tp->rcv_nxt))
 			break;
 
@@ -4322,9 +4325,10 @@ static void tcp_ofo_queue(struct sock *sk)
 				dsack_high = TCP_SKB_CB(skb)->end_seq;
 			tcp_dsack_extend(sk, TCP_SKB_CB(skb)->seq, dsack);
 		}
+		p = rb_next(p);
+		rb_erase(&skb->rbnode, &tp->out_of_order_queue);
 
-		__skb_unlink(skb, &tp->out_of_order_queue);
-		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
+		if (unlikely(!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt))) {
 			SOCK_DEBUG(sk, "ofo packet was already received\n");
 			tcp_drop(sk, skb);
 			continue;
@@ -4336,12 +4340,19 @@ static void tcp_ofo_queue(struct sock *sk)
 		tail = skb_peek_tail(&sk->sk_receive_queue);
 		eaten = tail && tcp_try_coalesce(sk, tail, skb, &fragstolen);
 		tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq);
+		fin = TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN;
 		if (!eaten)
 			__skb_queue_tail(&sk->sk_receive_queue, skb);
-		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
-			tcp_fin(sk);
-		if (eaten)
+		else
 			kfree_skb_partial(skb, fragstolen);
+
+		if (unlikely(fin)) {
+			tcp_fin(sk);
+			/* tcp_fin() purges tp->out_of_order_queue,
+			 * so we must end this loop right now.
+			 */
+			break;
+		}
 	}
 }
 
@@ -4371,8 +4382,10 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
 static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct rb_node **p, *q, *parent;
 	struct sk_buff *skb1;
 	u32 seq, end_seq;
+	bool fragstolen;
 
 	tcp_ecn_check_ce(sk, skb);
 
@@ -4387,88 +4400,86 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	inet_csk_schedule_ack(sk);
 
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOQUEUE);
+	seq = TCP_SKB_CB(skb)->seq;
+	end_seq = TCP_SKB_CB(skb)->end_seq;
 	SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
-		   tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
+		   tp->rcv_nxt, seq, end_seq);
 
-	skb1 = skb_peek_tail(&tp->out_of_order_queue);
-	if (!skb1) {
+	p = &tp->out_of_order_queue.rb_node;
+	if (RB_EMPTY_ROOT(&tp->out_of_order_queue)) {
 		/* Initial out of order segment, build 1 SACK. */
 		if (tcp_is_sack(tp)) {
 			tp->rx_opt.num_sacks = 1;
-			tp->selective_acks[0].start_seq = TCP_SKB_CB(skb)->seq;
-			tp->selective_acks[0].end_seq =
-						TCP_SKB_CB(skb)->end_seq;
+			tp->selective_acks[0].start_seq = seq;
+			tp->selective_acks[0].end_seq = end_seq;
 		}
-		__skb_queue_head(&tp->out_of_order_queue, skb);
+		rb_link_node(&skb->rbnode, NULL, p);
+		rb_insert_color(&skb->rbnode, &tp->out_of_order_queue);
+		tp->ooo_last_skb = skb;
 		goto end;
 	}
 
-	seq = TCP_SKB_CB(skb)->seq;
-	end_seq = TCP_SKB_CB(skb)->end_seq;
-
-	if (seq == TCP_SKB_CB(skb1)->end_seq) {
-		bool fragstolen;
-
-		if (!tcp_try_coalesce(sk, skb1, skb, &fragstolen)) {
-			__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
-		} else {
-			tcp_grow_window(sk, skb);
-			kfree_skb_partial(skb, fragstolen);
-			skb = NULL;
+	/* In the typical case, we are adding an skb to the end of the list.
+	 * Use of ooo_last_skb avoids the O(Log(N)) rbtree lookup.
+	 */
+	if (tcp_try_coalesce(sk, tp->ooo_last_skb, skb, &fragstolen)) {
+coalesce_done:
+		tcp_grow_window(sk, skb);
+		kfree_skb_partial(skb, fragstolen);
+		skb = NULL;
+		goto add_sack;
+	}
+
+	/* Find place to insert this segment. Handle overlaps on the way. */
+	parent = NULL;
+	while (*p) {
+		parent = *p;
+		skb1 = rb_entry(parent, struct sk_buff, rbnode);
+		if (before(seq, TCP_SKB_CB(skb1)->seq)) {
+			p = &parent->rb_left;
+			continue;
 		}
 
-		if (!tp->rx_opt.num_sacks ||
-		    tp->selective_acks[0].end_seq != seq)
-			goto add_sack;
-
-		/* Common case: data arrive in order after hole. */
-		tp->selective_acks[0].end_seq = end_seq;
-		goto end;
-	}
-
-	/* Find place to insert this segment. */
-	while (1) {
-		if (!after(TCP_SKB_CB(skb1)->seq, seq))
-			break;
-		if (skb_queue_is_first(&tp->out_of_order_queue, skb1)) {
-			skb1 = NULL;
-			break;
+		if (before(seq, TCP_SKB_CB(skb1)->end_seq)) {
+			if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
+				/* All the bits are present. Drop. */
+				NET_INC_STATS(sock_net(sk),
+					      LINUX_MIB_TCPOFOMERGE);
+				__kfree_skb(skb);
+				skb = NULL;
+				tcp_dsack_set(sk, seq, end_seq);
+				goto add_sack;
+			}
+			if (after(seq, TCP_SKB_CB(skb1)->seq)) {
+				/* Partial overlap. */
+				tcp_dsack_set(sk, seq, TCP_SKB_CB(skb1)->end_seq);
+			} else {
+				/* skb's seq == skb1's seq and skb covers skb1.
+				 * Replace skb1 with skb.
+				 */
+				rb_replace_node(&skb1->rbnode, &skb->rbnode,
+						&tp->out_of_order_queue);
+				tcp_dsack_extend(sk,
+						 TCP_SKB_CB(skb1)->seq,
+						 TCP_SKB_CB(skb1)->end_seq);
+				NET_INC_STATS(sock_net(sk),
+					      LINUX_MIB_TCPOFOMERGE);
+				__kfree_skb(skb1);
+				goto add_sack;
+			}
+		} else if (tcp_try_coalesce(sk, skb1, skb, &fragstolen)) {
+			goto coalesce_done;
 		}
-		skb1 = skb_queue_prev(&tp->out_of_order_queue, skb1);
+		p = &parent->rb_right;
 	}
 
-	/* Do skb overlap to previous one? */
-	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
-		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-			/* All the bits are present. Drop. */
-			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
-			tcp_drop(sk, skb);
-			skb = NULL;
-			tcp_dsack_set(sk, seq, end_seq);
-			goto add_sack;
-		}
-		if (after(seq, TCP_SKB_CB(skb1)->seq)) {
-			/* Partial overlap. */
-			tcp_dsack_set(sk, seq,
-				      TCP_SKB_CB(skb1)->end_seq);
-		} else {
-			if (skb_queue_is_first(&tp->out_of_order_queue,
-					       skb1))
-				skb1 = NULL;
-			else
-				skb1 = skb_queue_prev(
-					&tp->out_of_order_queue,
-					skb1);
-		}
-	}
-	if (!skb1)
-		__skb_queue_head(&tp->out_of_order_queue, skb);
-	else
-		__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
+	/* Insert segment into RB tree. */
+	rb_link_node(&skb->rbnode, parent, p);
+	rb_insert_color(&skb->rbnode, &tp->out_of_order_queue);
 
-	/* And clean segments covered by new one as whole. */
-	while (!skb_queue_is_last(&tp->out_of_order_queue, skb)) {
-		skb1 = skb_queue_next(&tp->out_of_order_queue, skb);
+	/* Remove other segments covered by skb. */
+	while ((q = rb_next(&skb->rbnode)) != NULL) {
+		skb1 = rb_entry(q, struct sk_buff, rbnode);
 
 		if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
 			break;
@@ -4477,12 +4488,15 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 					 end_seq);
 			break;
 		}
-		__skb_unlink(skb1, &tp->out_of_order_queue);
+		rb_erase(&skb1->rbnode, &tp->out_of_order_queue);
 		tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
 				 TCP_SKB_CB(skb1)->end_seq);
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
 		tcp_drop(sk, skb1);
 	}
+	/* If there is no skb after us, we are the last_skb ! */
+	if (!q)
+		tp->ooo_last_skb = skb;
 
 add_sack:
 	if (tcp_is_sack(tp))
@@ -4621,13 +4635,13 @@ queue_and_out:
 		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
 			tcp_fin(sk);
 
-		if (!skb_queue_empty(&tp->out_of_order_queue)) {
+		if (!RB_EMPTY_ROOT(&tp->out_of_order_queue)) {
 			tcp_ofo_queue(sk);
 
 			/* RFC2581. 4.2. SHOULD send immediate ACK, when
 			 * gap in queue is filled.
 			 */
-			if (skb_queue_empty(&tp->out_of_order_queue))
+			if (RB_EMPTY_ROOT(&tp->out_of_order_queue))
 				inet_csk(sk)->icsk_ack.pingpong = 0;
 		}
 
@@ -4679,48 +4693,76 @@ drop:
 	tcp_data_queue_ofo(sk, skb);
 }
 
+static struct sk_buff *tcp_skb_next(struct sk_buff *skb, struct sk_buff_head *list)
+{
+	if (list)
+		return !skb_queue_is_last(list, skb) ? skb->next : NULL;
+
+	return rb_entry_safe(rb_next(&skb->rbnode), struct sk_buff, rbnode);
+}
+
 static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
-					struct sk_buff_head *list)
+					struct sk_buff_head *list,
+					struct rb_root *root)
 {
-	struct sk_buff *next = NULL;
+	struct sk_buff *next = tcp_skb_next(skb, list);
 
-	if (!skb_queue_is_last(list, skb))
-		next = skb_queue_next(list, skb);
+	if (list)
+		__skb_unlink(skb, list);
+	else
+		rb_erase(&skb->rbnode, root);
 
-	__skb_unlink(skb, list);
 	__kfree_skb(skb);
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOLLAPSED);
 
 	return next;
 }
 
+/* Insert skb into rb tree, ordered by TCP_SKB_CB(skb)->seq */
+static void tcp_rbtree_insert(struct rb_root *root, struct sk_buff *skb)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct sk_buff *skb1;
+
+	while (*p) {
+		parent = *p;
+		skb1 = rb_entry(parent, struct sk_buff, rbnode);
+		if (before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb1)->seq))
+			p = &parent->rb_left;
+		else
+			p = &parent->rb_right;
+	}
+	rb_link_node(&skb->rbnode, parent, p);
+	rb_insert_color(&skb->rbnode, root);
+}
+
 /* Collapse contiguous sequence of skbs head..tail with
  * sequence numbers start..end.
  *
- * If tail is NULL, this means until the end of the list.
+ * If tail is NULL, this means until the end of the queue.
  *
  * Segments with FIN/SYN are not collapsed (only because this
  * simplifies code)
  */
 static void
-tcp_collapse(struct sock *sk, struct sk_buff_head *list,
-	     struct sk_buff *head, struct sk_buff *tail,
-	     u32 start, u32 end)
+tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
+	     struct sk_buff *head, struct sk_buff *tail, u32 start, u32 end)
 {
-	struct sk_buff *skb, *n;
+	struct sk_buff *skb = head, *n;
+	struct sk_buff_head tmp;
 	bool end_of_skbs;
 
 	/* First, check that queue is collapsible and find
-	 * the point where collapsing can be useful. */
-	skb = head;
+	 * the point where collapsing can be useful.
+	 */
 restart:
-	end_of_skbs = true;
-	skb_queue_walk_from_safe(list, skb, n) {
-		if (skb == tail)
-			break;
+	for (end_of_skbs = true; skb != NULL && skb != tail; skb = n) {
+		n = tcp_skb_next(skb, list);
+
 		/* No new bits? It is possible on ofo queue. */
 		if (!before(start, TCP_SKB_CB(skb)->end_seq)) {
-			skb = tcp_collapse_one(sk, skb, list);
+			skb = tcp_collapse_one(sk, skb, list, root);
 			if (!skb)
 				break;
 			goto restart;
@@ -4738,13 +4780,10 @@ restart:
 			break;
 		}
 
-		if (!skb_queue_is_last(list, skb)) {
-			struct sk_buff *next = skb_queue_next(list, skb);
-			if (next != tail &&
-			    TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(next)->seq) {
-				end_of_skbs = false;
-				break;
-			}
+		if (n && n != tail &&
+		    TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(n)->seq) {
+			end_of_skbs = false;
+			break;
 		}
 
 		/* Decided to skip this, advance start seq. */
@@ -4754,17 +4793,22 @@ restart:
 	    (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
 		return;
 
+	__skb_queue_head_init(&tmp);
+
 	while (before(start, end)) {
 		int copy = min_t(int, SKB_MAX_ORDER(0, 0), end - start);
 		struct sk_buff *nskb;
 
 		nskb = alloc_skb(copy, GFP_ATOMIC);
 		if (!nskb)
-			return;
+			break;
 
 		memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
 		TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start;
-		__skb_queue_before(list, skb, nskb);
+		if (list)
+			__skb_queue_before(list, skb, nskb);
+		else
+			__skb_queue_tail(&tmp, nskb); /* defer rbtree insertion */
 		skb_set_owner_r(nskb, sk);
 
 		/* Copy data, releasing collapsed skbs. */
@@ -4782,14 +4826,17 @@ restart:
 				start += size;
 			}
 			if (!before(start, TCP_SKB_CB(skb)->end_seq)) {
-				skb = tcp_collapse_one(sk, skb, list);
+				skb = tcp_collapse_one(sk, skb, list, root);
 				if (!skb ||
 				    skb == tail ||
 				    (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
-					return;
+					goto end;
 			}
 		}
 	}
+end:
+	skb_queue_walk_safe(&tmp, skb, n)
+		tcp_rbtree_insert(root, skb);
 }
 
 /* Collapse ofo queue. Algorithm: select contiguous sequence of skbs
@@ -4798,43 +4845,43 @@ restart:
 static void tcp_collapse_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct sk_buff *skb = skb_peek(&tp->out_of_order_queue);
-	struct sk_buff *head;
+	struct sk_buff *skb, *head;
+	struct rb_node *p;
 	u32 start, end;
 
-	if (!skb)
+	p = rb_first(&tp->out_of_order_queue);
+	skb = rb_entry_safe(p, struct sk_buff, rbnode);
+new_range:
+	if (!skb) {
+		p = rb_last(&tp->out_of_order_queue);
+		/* Note: This is possible p is NULL here. We do not
+		 * use rb_entry_safe(), as ooo_last_skb is valid only
+		 * if rbtree is not empty.
+		 */
+		tp->ooo_last_skb = rb_entry(p, struct sk_buff, rbnode);
 		return;
-
+	}
 	start = TCP_SKB_CB(skb)->seq;
 	end = TCP_SKB_CB(skb)->end_seq;
-	head = skb;
-
-	for (;;) {
-		struct sk_buff *next = NULL;
 
-		if (!skb_queue_is_last(&tp->out_of_order_queue, skb))
-			next = skb_queue_next(&tp->out_of_order_queue, skb);
-		skb = next;
+	for (head = skb;;) {
+		skb = tcp_skb_next(skb, NULL);
 
-		/* Segment is terminated when we see gap or when
-		 * we are at the end of all the queue. */
+		/* Range is terminated when we see a gap or when
+		 * we are at the queue end.
+		 */
 		if (!skb ||
 		    after(TCP_SKB_CB(skb)->seq, end) ||
 		    before(TCP_SKB_CB(skb)->end_seq, start)) {
-			tcp_collapse(sk, &tp->out_of_order_queue,
+			tcp_collapse(sk, NULL, &tp->out_of_order_queue,
 				     head, skb, start, end);
-			head = skb;
-			if (!skb)
-				break;
-			/* Start new segment */
+			goto new_range;
+		}
+
+		if (unlikely(before(TCP_SKB_CB(skb)->seq, start)))
 			start = TCP_SKB_CB(skb)->seq;
+		if (after(TCP_SKB_CB(skb)->end_seq, end))
 			end = TCP_SKB_CB(skb)->end_seq;
-		} else {
-			if (before(TCP_SKB_CB(skb)->seq, start))
-				start = TCP_SKB_CB(skb)->seq;
-			if (after(TCP_SKB_CB(skb)->end_seq, end))
-				end = TCP_SKB_CB(skb)->end_seq;
-		}
 	}
 }
 
@@ -4845,23 +4892,36 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
 static bool tcp_prune_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	bool res = false;
+	struct rb_node *node, *prev;
 
-	if (!skb_queue_empty(&tp->out_of_order_queue)) {
-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
-		__skb_queue_purge(&tp->out_of_order_queue);
+	if (RB_EMPTY_ROOT(&tp->out_of_order_queue))
+		return false;
 
-		/* Reset SACK state.  A conforming SACK implementation will
-		 * do the same at a timeout based retransmit.  When a connection
-		 * is in a sad state like this, we care only about integrity
-		 * of the connection not performance.
-		 */
-		if (tp->rx_opt.sack_ok)
-			tcp_sack_reset(&tp->rx_opt);
+	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
+
+	node = &tp->ooo_last_skb->rbnode;
+	do {
+		prev = rb_prev(node);
+		rb_erase(node, &tp->out_of_order_queue);
+		__kfree_skb(rb_to_skb(node));
 		sk_mem_reclaim(sk);
-		res = true;
-	}
-	return res;
+                if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
+                    !tcp_under_memory_pressure(sk))
+                        break;
+
+		node = prev;
+	} while (node);
+	tp->ooo_last_skb = rb_entry(prev, struct sk_buff, rbnode);
+
+	/* Reset SACK state.  A conforming SACK implementation will
+	 * do the same at a timeout based retransmit.  When a connection
+	 * is in a sad state like this, we care only about integrity
+	 * of the connection not performance.
+	 */
+	if (tp->rx_opt.sack_ok)
+		tcp_sack_reset(&tp->rx_opt);
+
+	return true;
 }
 
 /* Reduce allocated memory if we can, trying to get
@@ -4886,7 +4946,7 @@ static int tcp_prune_queue(struct sock *sk)
 
 	tcp_collapse_ofo_queue(sk);
 	if (!skb_queue_empty(&sk->sk_receive_queue))
-		tcp_collapse(sk, &sk->sk_receive_queue,
+		tcp_collapse(sk, &sk->sk_receive_queue, NULL,
 			     skb_peek(&sk->sk_receive_queue),
 			     NULL,
 			     tp->copied_seq, tp->rcv_nxt);
@@ -4991,7 +5051,7 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
 	    /* We ACK each frame or... */
 	    tcp_in_quickack_mode(sk) ||
 	    /* We have out of order data. */
-	    (ofo_possible && skb_peek(&tp->out_of_order_queue))) {
+	    (ofo_possible && !RB_EMPTY_ROOT(&tp->out_of_order_queue))) {
 		/* Then ack it now */
 		tcp_send_ack(sk);
 	} else {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 01715fc..ee8399f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1830,7 +1830,7 @@ void tcp_v4_destroy_sock(struct sock *sk)
 	tcp_write_queue_purge(sk);
 
 	/* Cleans up our, hopefully empty, out_of_order_queue. */
-	__skb_queue_purge(&tp->out_of_order_queue);
+	skb_rbtree_purge(&tp->out_of_order_queue);
 
 #ifdef CONFIG_TCP_MD5SIG
 	/* Clean up the MD5 key list, if any */
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 4c1c94f..81c633d 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -495,7 +495,6 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 		newtp->snd_cwnd_cnt = 0;
 
 		tcp_init_xmit_timers(newsk);
-		__skb_queue_head_init(&newtp->out_of_order_queue);
 		newtp->write_seq = newtp->pushed_seq = treq->snt_isn + 1;
 
 		newtp->rx_opt.saw_tstamp = 0;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH stable 4.4 3/9] tcp: increment sk_drops for dropped rx packets
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable
In-Reply-To: <1534387810-121428-1-git-send-email-maowenan@huawei.com>

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit 532182cd610782db8c18230c2747626562032205 ]

Now ss can report sk_drops, we can instruct TCP to increment
this per socket counter when it drops an incoming frame, to refine
monitoring and debugging.

Following patch takes care of listeners drops.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 include/net/sock.h   |  7 +++++++
 net/ipv4/tcp_input.c | 33 ++++++++++++++++++++-------------
 net/ipv4/tcp_ipv4.c  |  1 +
 net/ipv6/tcp_ipv6.c  |  1 +
 4 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 3d5ff74..5770757 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2139,6 +2139,13 @@ sock_skb_set_dropcount(const struct sock *sk, struct sk_buff *skb)
 	SOCK_SKB_CB(skb)->dropcount = atomic_read(&sk->sk_drops);
 }
 
+static inline void sk_drops_add(struct sock *sk, const struct sk_buff *skb)
+{
+	int segs = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
+
+	atomic_add(segs, &sk->sk_drops);
+}
+
 void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 			   struct sk_buff *skb);
 void __sock_recv_wifi_status(struct msghdr *msg, struct sock *sk,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index df2f342..5fb4e80 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4296,6 +4296,12 @@ static bool tcp_try_coalesce(struct sock *sk,
 	return true;
 }
 
+static void tcp_drop(struct sock *sk, struct sk_buff *skb)
+{
+	sk_drops_add(sk, skb);
+	__kfree_skb(skb);
+}
+
 /* This one checks to see if we can put data from the
  * out_of_order queue into the receive_queue.
  */
@@ -4320,7 +4326,7 @@ static void tcp_ofo_queue(struct sock *sk)
 		__skb_unlink(skb, &tp->out_of_order_queue);
 		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
 			SOCK_DEBUG(sk, "ofo packet was already received\n");
-			__kfree_skb(skb);
+			tcp_drop(sk, skb);
 			continue;
 		}
 		SOCK_DEBUG(sk, "ofo requeuing : rcv_next %X seq %X - %X\n",
@@ -4372,7 +4378,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 
 	if (unlikely(tcp_try_rmem_schedule(sk, skb, skb->truesize))) {
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFODROP);
-		__kfree_skb(skb);
+		tcp_drop(sk, skb);
 		return;
 	}
 
@@ -4436,7 +4442,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
 			/* All the bits are present. Drop. */
 			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
-			__kfree_skb(skb);
+			tcp_drop(sk, skb);
 			skb = NULL;
 			tcp_dsack_set(sk, seq, end_seq);
 			goto add_sack;
@@ -4475,7 +4481,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 		tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
 				 TCP_SKB_CB(skb1)->end_seq);
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
-		__kfree_skb(skb1);
+		tcp_drop(sk, skb1);
 	}
 
 add_sack:
@@ -4558,12 +4564,13 @@ err:
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	int eaten = -1;
 	bool fragstolen = false;
+	int eaten = -1;
 
-	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
-		goto drop;
-
+	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) {
+		__kfree_skb(skb);
+		return;
+	}
 	skb_dst_drop(skb);
 	__skb_pull(skb, tcp_hdr(skb)->doff * 4);
 
@@ -4645,7 +4652,7 @@ out_of_window:
 		tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS);
 		inet_csk_schedule_ack(sk);
 drop:
-		__kfree_skb(skb);
+		tcp_drop(sk, skb);
 		return;
 	}
 
@@ -5220,7 +5227,7 @@ syn_challenge:
 	return true;
 
 discard:
-	__kfree_skb(skb);
+	tcp_drop(sk, skb);
 	return false;
 }
 
@@ -5438,7 +5445,7 @@ csum_error:
 	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
 
 discard:
-	__kfree_skb(skb);
+	tcp_drop(sk, skb);
 }
 EXPORT_SYMBOL(tcp_rcv_established);
 
@@ -5668,7 +5675,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 						  TCP_DELACK_MAX, TCP_RTO_MAX);
 
 discard:
-			__kfree_skb(skb);
+			tcp_drop(sk, skb);
 			return 0;
 		} else {
 			tcp_send_ack(sk);
@@ -6025,7 +6032,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 
 	if (!queued) {
 discard:
-		__kfree_skb(skb);
+		tcp_drop(sk, skb);
 	}
 	return 0;
 }
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index eeda67c..01715fc 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1716,6 +1716,7 @@ discard_it:
 	return 0;
 
 discard_and_relse:
+	sk_drops_add(sk, skb);
 	sock_put(sk);
 	goto discard_it;
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 90abe88..d6c1911 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1505,6 +1505,7 @@ discard_it:
 	return 0;
 
 discard_and_relse:
+	sk_drops_add(sk, skb);
 	sock_put(sk);
 	goto discard_it;
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH stable 4.4 6/9] tcp: avoid collapses in tcp_prune_queue() if possible
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable
In-Reply-To: <1534387810-121428-1-git-send-email-maowenan@huawei.com>

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit f4a3313d8e2ca9fd8d8f45e40a2903ba782607e7 ]

Right after a TCP flow is created, receiving tiny out of order
packets allways hit the condition :

if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
	tcp_clamp_window(sk);

tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
(guarded by tcp_rmem[2])

Calling tcp_collapse_ofo_queue() in this case is not useful,
and offers a O(N^2) surface attack to malicious peers.

Better not attempt anything before full queue capacity is reached,
forcing attacker to spend lots of resource and allow us to more
easily detect the abuse.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 32225dc..77130ae 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4948,6 +4948,9 @@ static int tcp_prune_queue(struct sock *sk)
 	else if (tcp_under_memory_pressure(sk))
 		tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
 
+	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
+		return 0;
+
 	tcp_collapse_ofo_queue(sk);
 	if (!skb_queue_empty(&sk->sk_receive_queue))
 		tcp_collapse(sk, &sk->sk_receive_queue, NULL,
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH stable 4.4 2/9] Revert "tcp: avoid collapses in tcp_prune_queue() if possible"
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable
In-Reply-To: <1534387810-121428-1-git-send-email-maowenan@huawei.com>

This reverts commit 5fbec4801264cb3279ef6ac9c70bcbe2aaef89d5.

We need change simple queue to RB tree to finally fix CVE-2018-5390, So
revert this patch firstly because of many conflicts when we want to apply 
previous patch 9f5afeae(tcp: use an RB tree for ooo receive queue), after 
this we will reapply patch series from Eric.

Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 995b2bc..df2f342 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4877,9 +4877,6 @@ static int tcp_prune_queue(struct sock *sk)
 	else if (tcp_under_memory_pressure(sk))
 		tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
 
-	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
-		return 0;
-
 	tcp_collapse_ofo_queue(sk);
 	if (!skb_queue_empty(&sk->sk_receive_queue))
 		tcp_collapse(sk, &sk->sk_receive_queue,
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH stable 4.4 1/9] Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable
In-Reply-To: <1534387810-121428-1-git-send-email-maowenan@huawei.com>

This reverts commit dc6ae4dffd656811dee7151b19545e4cd839d378.

We need change simple queue to RB tree to finally fix CVE-2018-5390, So
revert this patch firstly because of many conflicts when we want to apply
previous patch 9f5afeae(tcp: use an RB tree for ooo receive queue), after
this we will reapply patch series from Eric.

Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4a261e0..995b2bc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4791,7 +4791,6 @@ restart:
 static void tcp_collapse_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	u32 range_truesize, sum_tiny = 0;
 	struct sk_buff *skb = skb_peek(&tp->out_of_order_queue);
 	struct sk_buff *head;
 	u32 start, end;
@@ -4801,7 +4800,6 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
 
 	start = TCP_SKB_CB(skb)->seq;
 	end = TCP_SKB_CB(skb)->end_seq;
-	range_truesize = skb->truesize;
 	head = skb;
 
 	for (;;) {
@@ -4816,24 +4814,14 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
 		if (!skb ||
 		    after(TCP_SKB_CB(skb)->seq, end) ||
 		    before(TCP_SKB_CB(skb)->end_seq, start)) {
-			/* Do not attempt collapsing tiny skbs */
-			if (range_truesize != head->truesize ||
-			    end - start >= SKB_WITH_OVERHEAD(SK_MEM_QUANTUM)) {
-				tcp_collapse(sk, &tp->out_of_order_queue,
-					     head, skb, start, end);
-			} else {
-				sum_tiny += range_truesize;
-				if (sum_tiny > sk->sk_rcvbuf >> 3)
-					return;
-			}
-
+			tcp_collapse(sk, &tp->out_of_order_queue,
+				     head, skb, start, end);
 			head = skb;
 			if (!skb)
 				break;
 			/* Start new segment */
 			start = TCP_SKB_CB(skb)->seq;
 			end = TCP_SKB_CB(skb)->end_seq;
-			range_truesize = skb->truesize;
 		} else {
 			if (before(TCP_SKB_CB(skb)->seq, start))
 				start = TCP_SKB_CB(skb)->seq;
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH stable 4.4 1/9] Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"
From: maowenan @ 2018-08-16  1:55 UTC (permalink / raw)
  To: Greg KH; +Cc: dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <20180815131807.GA31330@kroah.com>



On 2018/8/15 21:18, Greg KH wrote:
> On Wed, Aug 15, 2018 at 09:21:00PM +0800, Mao Wenan wrote:
>> This reverts commit dc6ae4dffd656811dee7151b19545e4cd839d378.
> 
> I need a reason why, and a signed-off-by line :(

stable 4.4 only back port two patches to fix CVE-2018-5390, I have tested they can't
fix fully because of simple queue used in lower version, so we need change simple queue
to RB tree to finally resolve. But 9f5afeae have many conflicts with tcp: detect malicious patterns in tcp_collapse_ofo_queue()
and tcp: avoid collapses in tcp_prune_queue() if possible, and there are patch series from Eric in mainline to fix CVE-2018-5390,
so I need revert part of patches in stable 4.4 firstly, then apply 9f5afeae, and reapply five patches from Eric.
9f5afeae tcp: use an RB tree for ooo receive queue

> 
> thanks,
> 
> greg k-h
> 
> 

^ permalink raw reply

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
From: Jason Wang @ 2018-08-16  4:24 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: David Ahern, Jesper Dangaard Brouer, netdev, linux-kernel, ast,
	daniel, mst
In-Reply-To: <20180816040517.7vjm4bwxosyzvapu@ast-mbp.dhcp.thefacebook.com>



On 2018年08月16日 12:05, Alexei Starovoitov wrote:
> On Thu, Aug 16, 2018 at 11:34:20AM +0800, Jason Wang wrote:
>>> Nothing about the topology is hard coded. The idea is to mimic a
>>> hardware pipeline and acknowledging that a port device can have an
>>> arbitrary layers stacked on it - multiple vlan devices, bonds, macvlans, etc
>> I may miss something but BPF forbids loop. Without a loop how can we make
>> sure all stacked devices is enumerated correctly without knowing the
>> topology in advance?
> not following. why do you need a loop to implement macvlan as an xdp prog?
> if loop is needed, such algorithm is not going to scale whether
> it's implemented as bpf program or as in-kernel c code.

David said the port can have arbitrary layers stacked on it. So if we 
try to enumerate them before making forwarding decisions purely by BPF 
program, it looks to me a loop is needed here.

Thanks

^ permalink raw reply

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack (CVE-2018-5390)
From: maowenan @ 2018-08-16  1:20 UTC (permalink / raw)
  To: Greg KH; +Cc: dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <20180815154131.GA12619@kroah.com>



On 2018/8/15 23:41, Greg KH wrote:
> On Wed, Aug 15, 2018 at 03:24:32PM +0200, Greg KH wrote:
>> On Wed, Aug 15, 2018 at 09:20:59PM +0800, Mao Wenan wrote:
>>> There are five patches to fix CVE-2018-5390 in latest mainline 
>>> branch, but only two patches exist in stable 4.4 and 3.18: 
>>> dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
>>> 5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
>>> but I have tested with these patches, and found the cpu usage was very high.
>>> test results:
>>> with fix patch: 78.2%   ksoftirqd
>>> no fix patch:   90%     ksoftirqd
>>>
>>> After analysing the codes of stable 4.4, and debuging the 
>>> system, the search of ofo_queue(tcp ofo using a simple queue) cost more cycles.
>>> So I think only two patches can't fix the CVE-2018-5390.
>>> So I try to backport "tcp: use an RB tree for ooo receive queue" using RB tree 
>>> instead of simple queue, then backport Eric Dumazet 5 fixed patches in mainline,
>>> good news is that ksoftirqd is turn to about 20%, which is the same with mainline now.
>>
>> Thanks for doing this work, I had some questions on the individual
>> patches.  Can you address them and resend?
> 
> Also, always cc: the stable@vger list when sending stable patches so
> that others can review and comment on them.

ok,I will resend patches later after refining them.

> 
> thanks,
> 
> greg k-h
> 
> .
> 

^ permalink raw reply

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
From: Alexei Starovoitov @ 2018-08-16  4:05 UTC (permalink / raw)
  To: Jason Wang
  Cc: David Ahern, Jesper Dangaard Brouer, netdev, linux-kernel, ast,
	daniel, mst
In-Reply-To: <2792239a-ed3b-d66e-0c1c-e99455311eff@redhat.com>

On Thu, Aug 16, 2018 at 11:34:20AM +0800, Jason Wang wrote:
> > Nothing about the topology is hard coded. The idea is to mimic a
> > hardware pipeline and acknowledging that a port device can have an
> > arbitrary layers stacked on it - multiple vlan devices, bonds, macvlans, etc
> 
> I may miss something but BPF forbids loop. Without a loop how can we make
> sure all stacked devices is enumerated correctly without knowing the
> topology in advance?

not following. why do you need a loop to implement macvlan as an xdp prog?
if loop is needed, such algorithm is not going to scale whether
it's implemented as bpf program or as in-kernel c code.

^ permalink raw reply

* KINDLY REPLY stemlightresources@gmail.com URGENTLY
From: STEMLIGHTRESOURCES @ 2018-08-16  0:56 UTC (permalink / raw)
  To: Recipients

KINDLY REPLY stemlightresources@gmail.com URGENTLY

^ permalink raw reply

* RE: [PATCH] net: stmmac: Add SMC support for EMAC System Manager register
From: Ooi, Joyce @ 2018-08-16  3:39 UTC (permalink / raw)
  To: David Miller
  Cc: peppe.cavallaro@st.com, alexandre.torgue@st.com,
	joabreu@synopsys.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org, Ong, Hean Loong, Vandervennet, Yves
In-Reply-To: <20180813.094806.2117966073798365808.davem@davemloft.net>

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Tuesday, August 14, 2018 12:48 AM
> To: Ooi, Joyce <joyce.ooi@intel.com>
> Cc: peppe.cavallaro@st.com; alexandre.torgue@st.com;
> joabreu@synopsys.com; netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> Ong, Hean Loong <hean.loong.ong@intel.com>; Vandervennet, Yves
> <yves.vandervennet@intel.com>
> Subject: Re: [PATCH] net: stmmac: Add SMC support for EMAC System Manager
> register
> 
> From: "Ooi, Joyce" <joyce.ooi@intel.com>
> Date: Sun, 12 Aug 2018 23:41:34 -0700
> 
> > As there is restriction to access to EMAC System Manager registers in
> > the kernel for Intel Stratix10, the use of SMC calls are required and
> > added in dwmac-socfpga driver.
> >
> > Signed-off-by: Ooi, Joyce <joyce.ooi@intel.com>
> > ---
> > This patch is dependent on https://lkml.org/lkml/2018/7/26/624
> 
> I guess I cannot apply this to my networking tree then.
> 
> I would suggest that you make a helper in a header file which dos the special
> SMC EMAC accesses, or alternatively the regular regmap access, based upon the
> CPP ifdef.
Could you please explain what you mean by 'a helper in a header file'?

Thanks.
> 
> That way you won't have to put all of those CPP tests in the foo.c code.
> 
> Thanks.

^ permalink raw reply

* Re: [PATCH net] veth: Free queues on link delete
From: kbuild test robot @ 2018-08-16  0:44 UTC (permalink / raw)
  To: dsahern; +Cc: kbuild-all, netdev, davem, makita.toshiaki, David Ahern
In-Reply-To: <20180814223657.32101-1-dsahern@kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1734 bytes --]

Hi David,

I love your patch! Yet something to improve:

[auto build test ERROR on net/master]

url:    https://github.com/0day-ci/linux/commits/dsahern-kernel-org/veth-Free-queues-on-link-delete/20180816-073955
config: i386-randconfig-x016-201832 (attached as .config)
compiler: gcc-7 (Debian 7.3.0-16) 7.3.0
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers//net/veth.c: In function 'veth_dellink':
>> drivers//net/veth.c:476:2: error: implicit declaration of function 'veth_free_queues'; did you mean 'veth_dev_free'? [-Werror=implicit-function-declaration]
     veth_free_queues(dev);
     ^~~~~~~~~~~~~~~~
     veth_dev_free
   cc1: some warnings being treated as errors

vim +476 drivers//net/veth.c

   470	
   471	static void veth_dellink(struct net_device *dev, struct list_head *head)
   472	{
   473		struct veth_priv *priv;
   474		struct net_device *peer;
   475	
 > 476		veth_free_queues(dev);
   477		priv = netdev_priv(dev);
   478		peer = rtnl_dereference(priv->peer);
   479	
   480		/* Note : dellink() is called from default_device_exit_batch(),
   481		 * before a rcu_synchronize() point. The devices are guaranteed
   482		 * not being freed before one RCU grace period.
   483		 */
   484		RCU_INIT_POINTER(priv->peer, NULL);
   485		unregister_netdevice_queue(dev, head);
   486	
   487		if (peer) {
   488			priv = netdev_priv(peer);
   489			RCU_INIT_POINTER(priv->peer, NULL);
   490			unregister_netdevice_queue(peer, head);
   491		}
   492	}
   493	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 34697 bytes --]

^ permalink raw reply

* Re: [RFC PATCH net-next V2 0/6] XDP rx handler
From: Jason Wang @ 2018-08-16  3:34 UTC (permalink / raw)
  To: David Ahern, Jesper Dangaard Brouer
  Cc: Alexei Starovoitov, netdev, linux-kernel, ast, daniel, mst
In-Reply-To: <f4c0346d-6665-5312-0310-ffbda505b7ec@gmail.com>



On 2018年08月16日 01:17, David Ahern wrote:
> On 8/14/18 6:29 PM, Jason Wang wrote:
>>
>> On 2018年08月14日 22:03, David Ahern wrote:
>>> On 8/14/18 7:20 AM, Jason Wang wrote:
>>>> On 2018年08月14日 18:17, Jesper Dangaard Brouer wrote:
>>>>> On Tue, 14 Aug 2018 15:59:01 +0800
>>>>> Jason Wang <jasowang@redhat.com> wrote:
>>>>>
>>>>>> On 2018年08月14日 08:32, Alexei Starovoitov wrote:
>>>>>>> On Mon, Aug 13, 2018 at 11:17:24AM +0800, Jason Wang wrote:
>>>>>>>> Hi:
>>>>>>>>
>>>>>>>> This series tries to implement XDP support for rx hanlder. This
>>>>>>>> would
>>>>>>>> be useful for doing native XDP on stacked device like macvlan,
>>>>>>>> bridge
>>>>>>>> or even bond.
>>>>>>>>
>>>>>>>> The idea is simple, let stacked device register a XDP rx handler.
>>>>>>>> And
>>>>>>>> when driver return XDP_PASS, it will call a new helper xdp_do_pass()
>>>>>>>> which will try to pass XDP buff to XDP rx handler directly. XDP rx
>>>>>>>> handler may then decide how to proceed, it could consume the
>>>>>>>> buff, ask
>>>>>>>> driver to drop the packet or ask the driver to fallback to normal
>>>>>>>> skb
>>>>>>>> path.
>>>>>>>>
>>>>>>>> A sample XDP rx handler was implemented for macvlan. And virtio-net
>>>>>>>> (mergeable buffer case) was converted to call xdp_do_pass() as an
>>>>>>>> example. For ease comparision, generic XDP support for rx handler
>>>>>>>> was
>>>>>>>> also implemented.
>>>>>>>>
>>>>>>>> Compared to skb mode XDP on macvlan, native XDP on macvlan
>>>>>>>> (XDP_DROP)
>>>>>>>> shows about 83% improvement.
>>>>>>> I'm missing the motiviation for this.
>>>>>>> It seems performance of such solution is ~1M packet per second.
>>>>>> Notice it was measured by virtio-net which is kind of slow.
>>>>>>
>>>>>>> What would be a real life use case for such feature ?
>>>>>> I had another run on top of 10G mlx4 and macvlan:
>>>>>>
>>>>>> XDP_DROP on mlx4: 14.0Mpps
>>>>>> XDP_DROP on macvlan: 10.05Mpps
>>>>>>
>>>>>> Perf shows macvlan_hash_lookup() and indirect call to
>>>>>> macvlan_handle_xdp() are the reasons for the number drop. I think the
>>>>>> numbers are acceptable. And we could try more optimizations on top.
>>>>>>
>>>>>> So here's real life use case is trying to have an fast XDP path for rx
>>>>>> handler based device:
>>>>>>
>>>>>> - For containers, we can run XDP for macvlan (~70% of wire speed).
>>>>>> This
>>>>>> allows a container specific policy.
>>>>>> - For VM, we can implement macvtap XDP rx handler on top. This
>>>>>> allow us
>>>>>> to forward packet to VM without building skb in the setup of macvtap.
>>>>>> - The idea could be used by other rx handler based device like bridge,
>>>>>> we may have a XDP fast forwarding path for bridge.
>>>>>>
>>>>>>> Another concern is that XDP users expect to get line rate performance
>>>>>>> and native XDP delivers it. 'generic XDP' is a fallback only
>>>>>>> mechanism to operate on NICs that don't have native XDP yet.
>>>>>> So I can replace generic XDP TX routine with a native one for macvlan.
>>>>> If you simply implement ndo_xdp_xmit() for macvlan, and instead use
>>>>> XDP_REDIRECT, then we are basically done.
>>>> As I replied in another thread this probably not true. Its
>>>> ndo_xdp_xmit() just need to call under layer device's ndo_xdp_xmit()
>>>> except for the case of bridge mode.
>>>>
>>>>>>> Toshiaki's veth XDP work fits XDP philosophy and allows
>>>>>>> high speed networking to be done inside containers after veth.
>>>>>>> It's trying to get to line rate inside container.
>>>>>> This is one of the goal of this series as well. I agree veth XDP work
>>>>>> looks pretty fine, but it only work for a specific setup I believe
>>>>>> since
>>>>>> it depends on XDP_REDIRECT which is supported by few drivers (and
>>>>>> there's no VF driver support).
>>>>> The XDP_REDIRECT (RX-side) is trivial to add to drivers.  It is a bad
>>>>> argument that only a few drivers implement this.  Especially since all
>>>>> drivers also need to be extended with your proposed xdp_do_pass() call.
>>>>>
>>>>> (rant) The thing that is delaying XDP_REDIRECT adaption in drivers, is
>>>>> that it is harder to implement the TX-side, as the ndo_xdp_xmit() call
>>>>> have to allocate HW TX-queue resources.  If we disconnect RX and TX
>>>>> side of redirect, then we can implement RX-side in an afternoon.
>>>> That's exactly the point, ndo_xdp_xmit() may requires per CPU TX queues
>>>> which breaks assumptions of some drivers. And since we don't disconnect
>>>> RX and TX, it looks to me the partial implementation is even worse?
>>>> Consider a user can redirect from mlx4 to ixgbe but not ixgbe to mlx4.
>>>>
>>>>>> And in order to make it work for a end
>>>>>> user, the XDP program still need logic like hash(map) lookup to
>>>>>> determine the destination veth.
>>>>> That _is_ the general idea behind XDP and eBPF, that we need to add
>>>>> logic
>>>>> that determine the destination.  The kernel provides the basic
>>>>> mechanisms for moving/redirecting packets fast, and someone else
>>>>> builds an orchestration tool like Cilium, that adds the needed logic.
>>>> Yes, so my reply is for the concern about performance. I meant anyway
>>>> the hash lookup will make it not hit the wire speed.
>>>>
>>>>> Did you notice that we (Ahern) added bpf_fib_lookup a FIB route lookup
>>>>> accessible from XDP.
>>>> Yes.
>>>>
>>>>> For macvlan, I imagine that we could add a BPF helper that allows you
>>>>> to lookup/call macvlan_hash_lookup().
>>>> That's true but we still need a method to feed macvlan with XDP buff.
>>>> I'm not sure if this could be treated as another kind of redirection,
>>>> but ndo_xdp_xmit() could not be used for this case for sure. Compared to
>>>> redirection, XDP rx handler has its own advantages:
>>>>
>>>> 1) Use the exist API and userspace to setup the network topology instead
>>>> of inventing new tools and its own specific API. This means user can
>>>> just setup macvlan (macvtap, bridge or other) as usual and simply attach
>>>> XDP programs to both macvlan and its under layer device.
>>>> 2) Ease the processing of complex logic, XDP can not do cloning or
>>>> reference counting. We can differ those cases and let normal networking
>>>> stack to deal with such packets seamlessly. I believe this is one of the
>>>> advantage of XDP. This makes us to focus on the fast path and greatly
>>>> simplify the codes.
>>>>
>>>> Like ndo_xdp_xmit(), XDP rx handler is used to feed RX handler with XDP
>>>> buff. It's just another basic mechanism. Policy is still done by XDP
>>>> program itself.
>>>>
>>> I have been looking into handling stacked devices via lookup helper
>>> functions. The idea is that a program only needs to be installed on the
>>> root netdev (ie., the one representing the physical port), and it can
>>> use helpers to create an efficient pipeline to decide what to do with
>>> the packet in the presence of stacked devices.
>>>
>>> For example, anyone doing pure L3 could do:
>>>
>>> {port, vlan} --> [ find l2dev ] --> [ find l3dev ] ...
>>>
>>>     --> [ l3 forward lookup ] --> [ header rewrite ] --> XDP_REDIRECT
>>>
>>> port is the netdev associated with the ingress_ifindex in the xdp_md
>>> context, vlan is the vlan in the packet or the assigned PVID if
>>> relevant. From there l2dev could be a bond or bridge device for example,
>>> and l3dev is the one with a network address (vlan netdev, bond netdev,
>>> etc).
>> Looks less flexible since the topology is hard coded in the XDP program
>> itself and this requires all logic to be implemented in the program on
>> the root netdev.
> Nothing about the topology is hard coded. The idea is to mimic a
> hardware pipeline and acknowledging that a port device can have an
> arbitrary layers stacked on it - multiple vlan devices, bonds, macvlans, etc

I may miss something but BPF forbids loop. Without a loop how can we 
make sure all stacked devices is enumerated correctly without knowing 
the topology in advance?

>
>>> I have L3 forwarding working for vlan devices and bonds. I had not
>>> considered macvlans specifically yet, but it should be straightforward
>>> to add.
>>>
>> Yes, and all these could be done through XDP rx handler as well, and it
>> can do even more with rather simple logic:
>  From a forwarding perspective I suspect the rx handler approach is going
> to have much more overhead (ie., higher latency per packet and hence
> lower throughput) as the layers determine which one to use (e.g., is the
> FIB lookup done on the port device, vlan device, or macvlan device on
> the vlan device).

Well, if we want stacked device behave correctly, this is probably the 
only way. E.g in the above figure, to make "find l2dev" work correctly, 
we still need device specific logic which would be much similar to what 
XDP rx handler did.

Thanks

>
>> 1 macvlan has its own namespace, and want its own bpf logic.
>> 2 Ruse the exist topology information for dealing with more complex
>> setup like macvlan on top of bond and team. There's no need to bpf
>> program to care about topology. If you look at the code, there's even no
>> need to attach XDP on each stacked device. The calling of xdp_do_pass()
>> can try to pass XDP buff to upper device even if there's no XDP program
>> attached to current layer.
>> 3 Deliver XDP buff to userspace through macvtap.
>>
>> Thanks

^ permalink raw reply

* Re: [PATCH v2 net-next 0/8] net: dsa: microchip: Modify KSZ9477 DSA driver in preparation to add other KSZ switch drivers
From: Florian Fainelli @ 2018-08-16  0:28 UTC (permalink / raw)
  To: Tristram.Ha, Andrew Lunn, Pavel Machek, Ruediger Schmitt
  Cc: Arkadi Sharshevsky, UNGLinuxDriver, netdev
In-Reply-To: <1512524798-16210-1-git-send-email-Tristram.Ha@microchip.com>

On 12/05/2017 05:46 PM, Tristram.Ha@microchip.com wrote:
> From: Tristram Ha <Tristram.Ha@microchip.com>
> 
> This series of patches is to modify the original KSZ9477 DSA driver so
> that other KSZ switch drivers can be added and use the common code.
> 
> There are several steps to accomplish this achievement.  First is to
> rename some function names with a prefix to indicate chip specific
> function.  Second is to move common code into header that can be shared.
> Last is to modify tag_ksz.c so that it can handle many tail tag formats
> used by different KSZ switch drivers.
> 
> ksz_common.c will contain the common code used by all KSZ switch drivers.
> ksz9477.c will contain KSZ9477 code from the original ksz_common.c.
> ksz9477_spi.c is renamed from ksz_spi.c.
> ksz9477_reg.h is renamed from ksz_9477_reg.h.
> ksz_common.h is added to provide common code access to KSZ switch
> drivers.
> ksz_spi.h is added to provide common SPI access functions to KSZ SPI
> drivers.

Is something gating this series from getting included? It's been nearly
8 months now and this has not been include nor resubmitted, any plans to
rebase that patch series and work towards inclusion in net-next when it
opens back again?

Thank you!

> 
> v2
> - Initialize reg_mutex before use
> - The alu_mutex is only used inside chip specific functions
> 
> v1
> - Each patch in the set is self-contained
> - Use ksz9477 prefix to indicate KSZ9477 specific code
> 
> Tristram Ha (8):
>   Replace license with GPL.
>   Clean up code according to patch check suggestions.
>   Initialize mutex before use.
>   Rename some functions with ksz9477 prefix to separate chip specific
>     code from common code.
>   Rename ksz_spi.c to ksz9477_spi.c and update Kconfig in preparation to
>     add more KSZ switch drivers.
>   Break KSZ9477 DSA driver into two files in preparation to add more KSZ
>     switch drivers.  Add common functions in ksz_common.h so that other
>     KSZ switch drivers can access code in ksz_common.c.  Add ksz_spi.h
>     for common functions used by KSZ switch SPI drivers.
>   Prepare PHY for proper advertisement and get link status for the port.
>   Rename ksz_9477_reg.h to ksz9477_reg.h for consistency as the product
>     name is always KSZ####.
> 
>  drivers/net/dsa/microchip/Kconfig                  |   12 +-
>  drivers/net/dsa/microchip/Makefile                 |    4 +-
>  drivers/net/dsa/microchip/ksz9477.c                | 1331 ++++++++++++++++++++
>  .../microchip/{ksz_9477_reg.h => ksz9477_reg.h}    |   23 +-
>  drivers/net/dsa/microchip/ksz9477_spi.c            |  188 +++
>  drivers/net/dsa/microchip/ksz_common.c             | 1176 +++--------------
>  drivers/net/dsa/microchip/ksz_common.h             |  229 ++++
>  drivers/net/dsa/microchip/ksz_priv.h               |  256 ++--
>  drivers/net/dsa/microchip/ksz_spi.c                |  216 ----
>  drivers/net/dsa/microchip/ksz_spi.h                |   82 ++
>  10 files changed, 2122 insertions(+), 1395 deletions(-)
>  create mode 100644 drivers/net/dsa/microchip/ksz9477.c
>  rename drivers/net/dsa/microchip/{ksz_9477_reg.h => ksz9477_reg.h} (98%)
>  create mode 100644 drivers/net/dsa/microchip/ksz9477_spi.c
>  create mode 100644 drivers/net/dsa/microchip/ksz_common.h
>  delete mode 100644 drivers/net/dsa/microchip/ksz_spi.c
>  create mode 100644 drivers/net/dsa/microchip/ksz_spi.h
> 


-- 
Florian

^ permalink raw reply

* Re: [PATCH] net: dsa: add support for ksz9897 ethernet switch
From: Florian Fainelli @ 2018-08-16  0:24 UTC (permalink / raw)
  To: Lad Prabhakar, Woojung Huh, Microchip Linux Driver Support,
	Andrew Lunn, Vivien Didelot
  Cc: netdev, devicetree
In-Reply-To: <1534348283-12790-1-git-send-email-prabhakar.csengg@gmail.com>

On 08/15/2018 08:51 AM, Lad Prabhakar wrote:
> From: "Lad, Prabhakar" <prabhakar.csengg@gmail.com>
> 
> ksz9477 is superset of ksz9xx series, driver just works
> out of the box for ksz9897 chip with this patch.

net-next is currently closed, but other than that:

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

> 
> Signed-off-by: Lad, Prabhakar <prabhakar.csengg@gmail.com>
> ---
>  Documentation/devicetree/bindings/net/dsa/ksz.txt | 4 +++-
>  drivers/net/dsa/microchip/ksz_common.c            | 9 +++++++++
>  drivers/net/dsa/microchip/ksz_spi.c               | 1 +
>  3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/dsa/ksz.txt b/Documentation/devicetree/bindings/net/dsa/ksz.txt
> index a700943..ac145b8 100644
> --- a/Documentation/devicetree/bindings/net/dsa/ksz.txt
> +++ b/Documentation/devicetree/bindings/net/dsa/ksz.txt
> @@ -4,7 +4,9 @@ Microchip KSZ Series Ethernet switches
>  Required properties:
>  
>  - compatible: For external switch chips, compatible string must be exactly one
> -  of: "microchip,ksz9477"
> +  of the following:
> +  - "microchip,ksz9477"
> +  - "microchip,ksz9897"
>  
>  See Documentation/devicetree/bindings/net/dsa/dsa.txt for a list of additional
>  required and optional properties.
> diff --git a/drivers/net/dsa/microchip/ksz_common.c b/drivers/net/dsa/microchip/ksz_common.c
> index 7210c49..54e0ca6 100644
> --- a/drivers/net/dsa/microchip/ksz_common.c
> +++ b/drivers/net/dsa/microchip/ksz_common.c
> @@ -1102,6 +1102,15 @@ static const struct ksz_chip_data ksz_switch_chips[] = {
>  		.cpu_ports = 0x7F,	/* can be configured as cpu port */
>  		.port_cnt = 7,		/* total physical port count */
>  	},
> +	{
> +		.chip_id = 0x00989700,
> +		.dev_name = "KSZ9897",
> +		.num_vlans = 4096,
> +		.num_alus = 4096,
> +		.num_statics = 16,
> +		.cpu_ports = 0x7F,	/* can be configured as cpu port */
> +		.port_cnt = 7,		/* total physical port count */
> +	},
>  };
>  
>  static int ksz_switch_init(struct ksz_device *dev)
> diff --git a/drivers/net/dsa/microchip/ksz_spi.c b/drivers/net/dsa/microchip/ksz_spi.c
> index c519469..8c1778b 100644
> --- a/drivers/net/dsa/microchip/ksz_spi.c
> +++ b/drivers/net/dsa/microchip/ksz_spi.c
> @@ -195,6 +195,7 @@ static int ksz_spi_remove(struct spi_device *spi)
>  
>  static const struct of_device_id ksz_dt_ids[] = {
>  	{ .compatible = "microchip,ksz9477" },
> +	{ .compatible = "microchip,ksz9897" },
>  	{},
>  };
>  MODULE_DEVICE_TABLE(of, ksz_dt_ids);
> 


-- 
Florian

^ permalink raw reply

* Re: [PATCH bpf] bpf: fix a rcu usage warning in bpf_prog_array_copy_core()
From: Daniel Borkmann @ 2018-08-16  0:17 UTC (permalink / raw)
  To: Roman Gushchin, Alexei Starovoitov
  Cc: Yonghong Song, ast, netdev, kernel-team
In-Reply-To: <20180815000841.GA25304@castle.DHCP.thefacebook.com>

On 08/15/2018 02:08 AM, Roman Gushchin wrote:
> On Tue, Aug 14, 2018 at 04:59:45PM -0700, Alexei Starovoitov wrote:
>> On Tue, Aug 14, 2018 at 11:01:12AM -0700, Yonghong Song wrote:
>>> Commit 394e40a29788 ("bpf: extend bpf_prog_array to store pointers
>>> to the cgroup storage") refactored the bpf_prog_array_copy_core()
>>> to accommodate new structure bpf_prog_array_item which contains
>>> bpf_prog array itself.
>>>
>>> In the old code, we had
>>>    perf_event_query_prog_array():
>>>      mutex_lock(...)
>>>      bpf_prog_array_copy_call():
>>>        prog = rcu_dereference_check(array, 1)->progs
>>>        bpf_prog_array_copy_core(prog, ...)
>>>      mutex_unlock(...)
>>>
>>> With the above commit, we had
>>>    perf_event_query_prog_array():
>>>      mutex_lock(...)
>>>      bpf_prog_array_copy_call():
>>>        bpf_prog_array_copy_core(array, ...):
>>>          item = rcu_dereference(array)->items;
>>>          ...
>>>      mutex_unlock(...)
>>>
>>> The new code will trigger a lockdep rcu checking warning.
>>> The fix is to change rcu_dereference() to rcu_dereference_check()
>>> to prevent such a warning.
>>>
>>> Reported-by: syzbot+6e72317008eef84a216b@syzkaller.appspotmail.com
>>> Fixes: 394e40a29788 ("bpf: extend bpf_prog_array to store pointers to the cgroup storage")
>>> Cc: Roman Gushchin <guro@fb.com>
>>> Signed-off-by: Yonghong Song <yhs@fb.com>

Applied to bpf, thanks Yonghong!

^ permalink raw reply

* Re: [bpf PATCH] samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM
From: Song Liu @ 2018-08-16  0:13 UTC (permalink / raw)
  To: Y Song
  Cc: Jesper Dangaard Brouer, Daniel Borkmann, Alexei Starovoitov,
	netdev, jhsiao
In-Reply-To: <CAH3MdRWXNe2-Uuut+L-qUGW52Fv-dmtcmG=ayoqKPRxNOtFqXA@mail.gmail.com>

On Wed, Aug 15, 2018 at 8:47 AM, Y Song <ys114321@gmail.com> wrote:
> On Wed, Aug 15, 2018 at 7:57 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>> It is common XDP practice to unload/deattach the XDP bpf program,
>> when the XDP sample program is Ctrl-C interrupted (SIGINT) or
>> killed (SIGTERM).
>>
>> The samples/bpf programs xdp_redirect_cpu and xdp_rxq_info,
>> forgot to trap signal SIGTERM (which is the default signal used
>> by the kill command).
>>
>> This was discovered by Red Hat QA, which automated scripts depend
>> on killing the XDP sample program after a timeout period.
>>
>> Fixes: fad3917e361b ("samples/bpf: add cpumap sample program xdp_redirect_cpu")
>> Fixes: 0fca931a6f21 ("samples/bpf: program demonstrating access to xdp_rxq_info")
>> Reported-by: Jean-Tsung Hsiao <jhsiao@redhat.com>
>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
>
> Acked-by: Yonghong Song <yhs@fb.com>

Reviewed-by: Song Liu <songliubraving@fb.com>

>
>> ---
>>  samples/bpf/xdp_redirect_cpu_user.c |    3 ++-
>>  samples/bpf/xdp_rxq_info_user.c     |    3 ++-
>>  2 files changed, 4 insertions(+), 2 deletions(-)

^ permalink raw reply

* Re: [bpf PATCH] samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM
From: Daniel Borkmann @ 2018-08-16  0:12 UTC (permalink / raw)
  To: Y Song, Jesper Dangaard Brouer
  Cc: Daniel Borkmann, Alexei Starovoitov, netdev, jhsiao
In-Reply-To: <CAH3MdRWXNe2-Uuut+L-qUGW52Fv-dmtcmG=ayoqKPRxNOtFqXA@mail.gmail.com>

On 08/15/2018 05:47 PM, Y Song wrote:
> On Wed, Aug 15, 2018 at 7:57 AM, Jesper Dangaard Brouer
> <brouer@redhat.com> wrote:
>> It is common XDP practice to unload/deattach the XDP bpf program,
>> when the XDP sample program is Ctrl-C interrupted (SIGINT) or
>> killed (SIGTERM).
>>
>> The samples/bpf programs xdp_redirect_cpu and xdp_rxq_info,
>> forgot to trap signal SIGTERM (which is the default signal used
>> by the kill command).
>>
>> This was discovered by Red Hat QA, which automated scripts depend
>> on killing the XDP sample program after a timeout period.
>>
>> Fixes: fad3917e361b ("samples/bpf: add cpumap sample program xdp_redirect_cpu")
>> Fixes: 0fca931a6f21 ("samples/bpf: program demonstrating access to xdp_rxq_info")
>> Reported-by: Jean-Tsung Hsiao <jhsiao@redhat.com>
>> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> Acked-by: Yonghong Song <yhs@fb.com>

Applied to bpf, thanks Jesper!

^ permalink raw reply

* Re: [PATCH bpf-next V3] net/xdp: Fix suspicious RCU usage warning
From: Daniel Borkmann @ 2018-08-16  0:11 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Tariq Toukan
  Cc: Alexei Starovoitov, netdev, Eran Ben Elisha, Neil Brown,
	Paul E. McKenney
In-Reply-To: <55d5dbee-8065-ac05-6372-c220a97b486f@iogearbox.net>

On 08/13/2018 02:22 PM, Daniel Borkmann wrote:
[...]
> I'll get the patch in once it has been pulled.

Applied to bpf, thanks Tariq!

^ permalink raw reply

* [PATCH stable 4.4 9/9] tcp: add tcp_ooo_try_coalesce() helper
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable
In-Reply-To: <1534387810-121428-1-git-send-email-maowenan@huawei.com>

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit 58152ecbbcc6a0ce7fddd5bf5f6ee535834ece0c ]

In case skb in out_or_order_queue is the result of
multiple skbs coalescing, we would like to get a proper gso_segs
counter tracking, so that future tcp_drop() can report an accurate
number.

I chose to not implement this tracking for skbs in receive queue,
since they are not dropped, unless socket is disconnected.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 96a1e0d..fdb5509 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4296,6 +4296,23 @@ static bool tcp_try_coalesce(struct sock *sk,
 	return true;
 }
 
+static bool tcp_ooo_try_coalesce(struct sock *sk,
+			     struct sk_buff *to,
+			     struct sk_buff *from,
+			     bool *fragstolen)
+{
+	bool res = tcp_try_coalesce(sk, to, from, fragstolen);
+
+	/* In case tcp_drop() is called later, update to->gso_segs */
+	if (res) {
+		u32 gso_segs = max_t(u16, 1, skb_shinfo(to)->gso_segs) +
+			       max_t(u16, 1, skb_shinfo(from)->gso_segs);
+
+		skb_shinfo(to)->gso_segs = min_t(u32, gso_segs, 0xFFFF);
+	}
+	return res;
+}
+
 static void tcp_drop(struct sock *sk, struct sk_buff *skb)
 {
 	sk_drops_add(sk, skb);
@@ -4422,7 +4439,8 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	/* In the typical case, we are adding an skb to the end of the list.
 	 * Use of ooo_last_skb avoids the O(Log(N)) rbtree lookup.
 	 */
-	if (tcp_try_coalesce(sk, tp->ooo_last_skb, skb, &fragstolen)) {
+	if (tcp_ooo_try_coalesce(sk, tp->ooo_last_skb,
+				 skb, &fragstolen)) {
 coalesce_done:
 		tcp_grow_window(sk, skb);
 		kfree_skb_partial(skb, fragstolen);
@@ -4467,7 +4485,8 @@ coalesce_done:
 				tcp_drop(sk, skb1);
 				goto add_sack;
 			}
-		} else if (tcp_try_coalesce(sk, skb1, skb, &fragstolen)) {
+		} else if (tcp_ooo_try_coalesce(sk, skb1,
+						skb, &fragstolen)) {
 			goto coalesce_done;
 		}
 		p = &parent->rb_right;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH stable 4.4 5/9] tcp: free batches of packets in tcp_prune_ofo_queue()
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable
In-Reply-To: <1534387810-121428-1-git-send-email-maowenan@huawei.com>

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit 72cd43ba64fc172a443410ce01645895850844c8 ]

Juha-Matti Tilli reported that malicious peers could inject tiny
packets in out_of_order_queue, forcing very expensive calls
to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
every incoming packet. out_of_order_queue rb-tree can contain
thousands of nodes, iterating over all of them is not nice.

Before linux-4.9, we would have pruned all packets in ofo_queue
in one go, every XXXX packets. XXXX depends on sk_rcvbuf and skbs
truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.

Since we plan to increase tcp_rmem[2] in the future to cope with
modern BDP, can not revert to the old behavior, without great pain.

Strategy taken in this patch is to purge ~12.5 % of the queue capacity.

Fixes: 36a6503fedda ("tcp: refine tcp_prune_ofo_queue() to not drop all packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 12edc4f..32225dc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4893,22 +4893,26 @@ static bool tcp_prune_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct rb_node *node, *prev;
+	int goal;
 
 	if (RB_EMPTY_ROOT(&tp->out_of_order_queue))
 		return false;
 
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
-
+	goal = sk->sk_rcvbuf >> 3;
 	node = &tp->ooo_last_skb->rbnode;
 	do {
 		prev = rb_prev(node);
 		rb_erase(node, &tp->out_of_order_queue);
+		goal -= rb_to_skb(node)->truesize;
 		__kfree_skb(rb_to_skb(node));
-		sk_mem_reclaim(sk);
-                if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
-                    !tcp_under_memory_pressure(sk))
-                        break;
-
+		if (!prev || goal <= 0) {
+			sk_mem_reclaim(sk);
+			if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
+			    !tcp_under_memory_pressure(sk))
+				break;
+			goal = sk->sk_rcvbuf >> 3;
+		}
 		node = prev;
 	} while (node);
 	tp->ooo_last_skb = rb_entry(prev, struct sk_buff, rbnode);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH stable 4.4 0/9] fix SegmentSmack in stable branch (CVE-2018-5390)
From: Mao Wenan @ 2018-08-16  2:50 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw; +Cc: stable

There are five patches to fix CVE-2018-5390 in latest mainline 
branch, but only two patches exist in stable 4.4 and 3.18: 
dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
I have tested with stable 4.4 kernel, and found the cpu usage was very high.
So I think only two patches can't fix the CVE-2018-5390.
test results:
with fix patch:     78.2%   ksoftirqd
withoutfix patch:   90%     ksoftirqd

Then I try to imitate 72cd43ba(tcp: free batches of packets in tcp_prune_ofo_queue())
to drop at least 12.5 % of sk_rcvbuf to avoid malicious attacks with simple queue 
instead of RB tree. The result is not very well.
 
After analysing the codes of stable 4.4, and debuging the 
system, shows that search of ofo_queue(tcp ofo using a simple queue) cost more cycles.

So I try to backport "tcp: use an RB tree for ooo receive queue" using RB tree 
instead of simple queue, then backport Eric Dumazet 5 fixed patches in mainline,
good news is that ksoftirqd is turn to about 20%, which is the same with mainline now.

Stable 4.4 have already back port two patches, 
f4a3313d(tcp: avoid collapses in tcp_prune_queue() if possible)
3d4bf93a(tcp: detect malicious patterns in tcp_collapse_ofo_queue())
If we want to change simple queue to RB tree to finally resolve, we should apply previous 
patch 9f5afeae(tcp: use an RB tree for ooo receive queue.) firstly, but 9f5afeae have many 
conflicts with 3d4bf93a and f4a3313d, which are part of patch series from Eric in 
mainline to fix CVE-2018-5390, so I need revert part of patches in stable 4.4 firstly, 
then apply 9f5afeae, and reapply five patches from Eric.

Eric Dumazet (6):
  tcp: increment sk_drops for dropped rx packets
  tcp: free batches of packets in tcp_prune_ofo_queue()
  tcp: avoid collapses in tcp_prune_queue() if possible
  tcp: detect malicious patterns in tcp_collapse_ofo_queue()
  tcp: call tcp_drop() from tcp_data_queue_ofo()
  tcp: add tcp_ooo_try_coalesce() helper

Mao Wenan (2):
  Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"
  Revert "tcp: avoid collapses in tcp_prune_queue() if possible"

Yaogong Wang (1):
  tcp: use an RB tree for ooo receive queue

 include/linux/skbuff.h   |   8 +
 include/linux/tcp.h      |   7 +-
 include/net/sock.h       |   7 +
 include/net/tcp.h        |   2 +-
 net/core/skbuff.c        |  19 +++
 net/ipv4/tcp.c           |   4 +-
 net/ipv4/tcp_input.c     | 412 +++++++++++++++++++++++++++++------------------
 net/ipv4/tcp_ipv4.c      |   3 +-
 net/ipv4/tcp_minisocks.c |   1 -
 net/ipv6/tcp_ipv6.c      |   1 +
 10 files changed, 294 insertions(+), 170 deletions(-)

-- 
1.8.3.1

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox