Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH stable 4.4 4/9] tcp: use an RB tree for ooo receive queue
From: Greg KH @ 2018-08-15 13:25 UTC (permalink / raw)
  To: Mao Wenan; +Cc: dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-5-git-send-email-maowenan@huawei.com>

On Wed, Aug 15, 2018 at 09:21:03PM +0800, Mao Wenan wrote:
> From: Yaogong Wang <wygivan@google.com>
> 
> Over the years, TCP BDP has increased by several orders of magnitude,
> and some people are considering to reach the 2 Gbytes limit.
> 
> Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
> MSS.
> 
> In presence of packet losses (or reorders), TCP stores incoming packets
> into an out of order queue, and number of skbs sitting there waiting for
> the missing packets to be received can be in the 10^5 range.
> 
> Most packets are appended to the tail of this queue, and when
> packets can finally be transferred to receive queue, we scan the queue
> from its head.
> 
> However, in presence of heavy losses, we might have to find an arbitrary
> point in this queue, involving a linear scan for every incoming packet,
> throwing away cpu caches.
> 
> This patch converts it to a RB tree, to get bounded latencies.
> 
> Yaogong wrote a preliminary patch about 2 years ago.
> Eric did the rebase, added ofo_last_skb cache, polishing and tests.
> 
> Tested with network dropping between 1 and 10 % packets, with good
> success (about 30 % increase of throughput in stress tests)
> 
> Next step would be to also use an RB tree for the write queue at sender
> side ;)
> 
> Signed-off-by: Yaogong Wang <wygivan@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: root <root@localhost.localdomain>

root and commit id?

^ permalink raw reply

* Re: [PATCH stable 4.4 6/9] tcp: avoid collapses in tcp_prune_queue() if possible
From: Greg KH @ 2018-08-15 13:25 UTC (permalink / raw)
  To: Mao Wenan; +Cc: dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-7-git-send-email-maowenan@huawei.com>

On Wed, Aug 15, 2018 at 09:21:05PM +0800, Mao Wenan wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> [ Upstream commit f4a3313d8e2ca9fd8d8f45e40a2903ba782607e7 ]
> 
> Right after a TCP flow is created, receiving tiny out of order
> packets allways hit the condition :
> 
> if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
> 	tcp_clamp_window(sk);
> 
> tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
> (guarded by tcp_rmem[2])
> 
> Calling tcp_collapse_ofo_queue() in this case is not useful,
> and offers a O(N^2) surface attack to malicious peers.
> 
> Better not attempt anything before full queue capacity is reached,
> forcing attacker to spend lots of resource and allow us to more
> easily detect the abuse.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> Acked-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: root <root@localhost.localdomain>

root?

^ permalink raw reply

* Re: [PATCH stable 4.4 5/9] tcp: free batches of packets in tcp_prune_ofo_queue()
From: Greg KH @ 2018-08-15 13:25 UTC (permalink / raw)
  To: Mao Wenan; +Cc: dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-6-git-send-email-maowenan@huawei.com>

On Wed, Aug 15, 2018 at 09:21:04PM +0800, Mao Wenan wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Juha-Matti Tilli reported that malicious peers could inject tiny
> packets in out_of_order_queue, forcing very expensive calls
> to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
> every incoming packet. out_of_order_queue rb-tree can contain
> thousands of nodes, iterating over all of them is not nice.
> 
> Before linux-4.9, we would have pruned all packets in ofo_queue
> in one go, every XXXX packets. XXXX depends on sk_rcvbuf and skbs
> truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.
> 
> Since we plan to increase tcp_rmem[2] in the future to cope with
> modern BDP, can not revert to the old behavior, without great pain.
> 
> Strategy taken in this patch is to purge ~12.5 % of the queue capacity.
> 
> Fixes: 36a6503fedda ("tcp: refine tcp_prune_ofo_queue() to not drop all packets")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi>
> Acked-by: Yuchung Cheng <ycheng@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: root <root@localhost.localdomain>

root?

And commit id?

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH stable 4.4 8/9] tcp: call tcp_drop() from tcp_data_queue_ofo()
From: Greg KH @ 2018-08-15 13:24 UTC (permalink / raw)
  To: Mao Wenan; +Cc: dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-9-git-send-email-maowenan@huawei.com>

On Wed, Aug 15, 2018 at 09:21:07PM +0800, Mao Wenan wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> In order to be able to give better diagnostics and detect
> malicious traffic, we need to have better sk->sk_drops tracking.
> 
> Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> Acked-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Mao Wenan <maowenan@huawei.com>
> ---
>  net/ipv4/tcp_input.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)

Upstream commit id?

^ permalink raw reply

* Re: [PATCH stable 4.4 9/9] tcp: add tcp_ooo_try_coalesce() helper
From: Greg KH @ 2018-08-15 13:24 UTC (permalink / raw)
  To: Mao Wenan; +Cc: dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-10-git-send-email-maowenan@huawei.com>

On Wed, Aug 15, 2018 at 09:21:08PM +0800, Mao Wenan wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> In case skb in out_or_order_queue is the result of
> multiple skbs coalescing, we would like to get a proper gso_segs
> counter tracking, so that future tcp_drop() can report an accurate
> number.
> 
> I chose to not implement this tracking for skbs in receive queue,
> since they are not dropped, unless socket is disconnected.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> Acked-by: Yuchung Cheng <ycheng@google.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Mao Wenan <maowenan@huawei.com>

Upstream commit id?

^ permalink raw reply

* Re: [PATCH stable 4.4 0/9] fix SegmentSmack (CVE-2018-5390)
From: Greg KH @ 2018-08-15 13:24 UTC (permalink / raw)
  To: Mao Wenan; +Cc: dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-1-git-send-email-maowenan@huawei.com>

On Wed, Aug 15, 2018 at 09:20:59PM +0800, Mao Wenan wrote:
> There are five patches to fix CVE-2018-5390 in latest mainline 
> branch, but only two patches exist in stable 4.4 and 3.18: 
> dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
> 5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
> but I have tested with these patches, and found the cpu usage was very high.
> test results:
> with fix patch: 78.2%   ksoftirqd
> no fix patch:   90%     ksoftirqd
> 
> After analysing the codes of stable 4.4, and debuging the 
> system, the search of ofo_queue(tcp ofo using a simple queue) cost more cycles.
> So I think only two patches can't fix the CVE-2018-5390.
> So I try to backport "tcp: use an RB tree for ooo receive queue" using RB tree 
> instead of simple queue, then backport Eric Dumazet 5 fixed patches in mainline,
> good news is that ksoftirqd is turn to about 20%, which is the same with mainline now.

Thanks for doing this work, I had some questions on the individual
patches.  Can you address them and resend?

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH stable 4.4 3/9] tcp: increment sk_drops for dropped rx packets
From: Greg KH @ 2018-08-15 13:21 UTC (permalink / raw)
  To: Mao Wenan; +Cc: dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-4-git-send-email-maowenan@huawei.com>

On Wed, Aug 15, 2018 at 09:21:02PM +0800, Mao Wenan wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Now ss can report sk_drops, we can instruct TCP to increment
> this per socket counter when it drops an incoming frame, to refine
> monitoring and debugging.
> 
> Following patch takes care of listeners drops.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Mao Wenan <maowenan@huawei.com>

What is the upstream git commit id for this?

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH stable 4.4 7/9] tcp: detect malicious patterns in tcp_collapse_ofo_queue()
From: Greg KH @ 2018-08-15 13:19 UTC (permalink / raw)
  To: Mao Wenan; +Cc: dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-8-git-send-email-maowenan@huawei.com>

On Wed, Aug 15, 2018 at 09:21:06PM +0800, Mao Wenan wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> [ Upstream commit 3d4bf93ac12003f9b8e1e2de37fe27983deebdcf ]
> 
> In case an attacker feeds tiny packets completely out of order,
> tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
> expensive copies, but not changing socket memory usage at all.
> 
> 1) Do not attempt to collapse tiny skbs.
> 2) Add logic to exit early when too many tiny skbs are detected.
> 
> We prefer not doing aggressive collapsing (which copies packets)
> for pathological flows, and revert to tcp_prune_ofo_queue() which
> will be less expensive.
> 
> In the future, we might add the possibility of terminating flows
> that are proven to be malicious.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Signed-off-by: root <root@localhost.localdomain>

signed off by from "root"?  :)

And why are you adding the patch back after removing it?

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH stable 4.4 2/9] Revert "tcp: avoid collapses in tcp_prune_queue() if possible"
From: Greg KH @ 2018-08-15 13:18 UTC (permalink / raw)
  To: Mao Wenan; +Cc: dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-3-git-send-email-maowenan@huawei.com>

On Wed, Aug 15, 2018 at 09:21:01PM +0800, Mao Wenan wrote:
> This reverts commit 5fbec4801264cb3279ef6ac9c70bcbe2aaef89d5.
> ---

Same here for description and signed off by.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH stable 4.4 1/9] Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"
From: Greg KH @ 2018-08-15 13:18 UTC (permalink / raw)
  To: Mao Wenan; +Cc: dwmw2, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-2-git-send-email-maowenan@huawei.com>

On Wed, Aug 15, 2018 at 09:21:00PM +0800, Mao Wenan wrote:
> This reverts commit dc6ae4dffd656811dee7151b19545e4cd839d378.

I need a reason why, and a signed-off-by line :(

thanks,

greg k-h

^ permalink raw reply

* Re: I found a strange place while reading “net/ipv6/reassembly.c”
From: Sabrina Dubroca @ 2018-08-15 13:16 UTC (permalink / raw)
  To: Ttttabcd; +Cc: netdev@vger.kernel.org
In-Reply-To: <SFcjXy7yNtGx0prpq73gVredyTRO62MxkhifUjhdZT6AqXpetrRFT9AtzwqflOGcpsqVg9CiO-mNEr8uT24SyYenCwoGTz8x1sB6a-tNv7w=@protonmail.com>

2018-08-15, 04:38:29 +0000, Ttttabcd wrote:
> Hello everyone who develops the kernel.
> 
> At the beginning I was looking for the source author, but his email
> address has expired, so I can only come here to ask questions.
> 
> The problem is in the /net/ipv6/reassembly.c file, the author is
> Pedro Roque.
> 
> I found some strange places when I read the code for this file
> (Linux Kernel version 4.18).
> 
> In the "/net/ipv6/reassembly.c"
> 
> In the function "ip6_frag_queue"
> 
> 	offset = ntohs(fhdr->frag_off) & ~0x7;
> 	end = offset + (ntohs(ipv6_hdr(skb)->payload_len) -
> 			((u8 *)(fhdr + 1) - (u8 *)(ipv6_hdr(skb) + 1)));
> 
> 	if ((unsigned int)end > IPV6_MAXPLEN) {
> 		*prob_offset = (u8 *)&fhdr->frag_off - skb_network_header(skb);
> 		return -1;
> 	}
> 
> Here the length of the payload is judged.

This check is based on the fragment currently being processed, and
only considers the reassembled payload.

> And in the function "ip6_frag_reasm"
> 

Re-adding the comment that comes just above this:

  	/* Unfragmented part is taken from the first segment. */
> 	payload_len = ((head->data - skb_network_header(head)) -
> 		       sizeof(struct ipv6hdr) + fq->q.len -
> 		       sizeof(struct frag_hdr));
> 	if (payload_len > IPV6_MAXPLEN)
> 		goto out_oversize;

This considers the reassembled payload (ie, as above) *and* any
extension headers that might come before it.


> 	......
> 	out_oversize:
> 		net_dbg_ratelimited("ip6_frag_reasm: payload len = %d\n", payload_len);
> 		goto out_fail;
> 
> Here also judges the length of the payload.
> 
> Judged the payload length twice.
> 
> I tested that the code in the label "out_oversize:" does not execute
> at all, because it has been returned in "ip6_frag_queue".

If you try again with a routing header, I think you'll see it trigger.

> Unless I comment out the code that judge the payload length in the
> function "ip6_frag_queue", the code labeled "out_oversize:" can be
> executed.
> 
> So, is this repeated?

-- 
Sabrina

^ permalink raw reply

* [PATCH stable 4.4 4/9] tcp: use an RB tree for ooo receive queue
From: Mao Wenan @ 2018-08-15 13:21 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-1-git-send-email-maowenan@huawei.com>

From: Yaogong Wang <wygivan@google.com>

Over the years, TCP BDP has increased by several orders of magnitude,
and some people are considering to reach the 2 Gbytes limit.

Even with current window scale limit of 14, ~1 Gbytes maps to ~740,000
MSS.

In presence of packet losses (or reorders), TCP stores incoming packets
into an out of order queue, and number of skbs sitting there waiting for
the missing packets to be received can be in the 10^5 range.

Most packets are appended to the tail of this queue, and when
packets can finally be transferred to receive queue, we scan the queue
from its head.

However, in presence of heavy losses, we might have to find an arbitrary
point in this queue, involving a linear scan for every incoming packet,
throwing away cpu caches.

This patch converts it to a RB tree, to get bounded latencies.

Yaogong wrote a preliminary patch about 2 years ago.
Eric did the rebase, added ofo_last_skb cache, polishing and tests.

Tested with network dropping between 1 and 10 % packets, with good
success (about 30 % increase of throughput in stress tests)

Next step would be to also use an RB tree for the write queue at sender
side ;)

Signed-off-by: Yaogong Wang <wygivan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Acked-By: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: root <root@localhost.localdomain>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 include/linux/skbuff.h   |   8 ++
 include/linux/tcp.h      |   7 +-
 include/net/tcp.h        |   2 +-
 net/core/skbuff.c        |  19 +++
 net/ipv4/tcp.c           |   4 +-
 net/ipv4/tcp_input.c     | 354 +++++++++++++++++++++++++++--------------------
 net/ipv4/tcp_ipv4.c      |   2 +-
 net/ipv4/tcp_minisocks.c |   1 -
 8 files changed, 241 insertions(+), 156 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index c28bd8b..a490dd7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2273,6 +2273,8 @@ static inline void __skb_queue_purge(struct sk_buff_head *list)
 		kfree_skb(skb);
 }
 
+void skb_rbtree_purge(struct rb_root *root);
+
 void *netdev_alloc_frag(unsigned int fragsz);
 
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int length,
@@ -2807,6 +2809,12 @@ static inline int pskb_trim_rcsum(struct sk_buff *skb, unsigned int len)
 	return __pskb_trim(skb, len);
 }
 
+#define rb_to_skb(rb) rb_entry_safe(rb, struct sk_buff, rbnode)
+#define skb_rb_first(root) rb_to_skb(rb_first(root))
+#define skb_rb_last(root)  rb_to_skb(rb_last(root))
+#define skb_rb_next(skb)   rb_to_skb(rb_next(&(skb)->rbnode))
+#define skb_rb_prev(skb)   rb_to_skb(rb_prev(&(skb)->rbnode))
+
 #define skb_queue_walk(queue, skb) \
 		for (skb = (queue)->next;					\
 		     skb != (struct sk_buff *)(queue);				\
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 5b6df1a..747404d 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -279,10 +279,9 @@ struct tcp_sock {
 	struct sk_buff* lost_skb_hint;
 	struct sk_buff *retransmit_skb_hint;
 
-	/* OOO segments go in this list. Note that socket lock must be held,
-	 * as we do not use sk_buff_head lock.
-	 */
-	struct sk_buff_head	out_of_order_queue;
+	/* OOO segments go in this rbtree. Socket lock must be held. */
+	struct rb_root	out_of_order_queue;
+	struct sk_buff	*ooo_last_skb; /* cache rb_last(out_of_order_queue) */
 
 	/* SACKs data, these 2 need to be together (see tcp_options_write) */
 	struct tcp_sack_block duplicate_sack[1]; /* D-SACK block */
diff --git a/include/net/tcp.h b/include/net/tcp.h
index cac4a6ad..8bc259d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -649,7 +649,7 @@ static inline void tcp_fast_path_check(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (skb_queue_empty(&tp->out_of_order_queue) &&
+	if (RB_EMPTY_ROOT(&tp->out_of_order_queue) &&
 	    tp->rcv_wnd &&
 	    atomic_read(&sk->sk_rmem_alloc) < sk->sk_rcvbuf &&
 	    !tp->urg_data)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 55be076..9703924 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2378,6 +2378,25 @@ void skb_queue_purge(struct sk_buff_head *list)
 EXPORT_SYMBOL(skb_queue_purge);
 
 /**
+ *	skb_rbtree_purge - empty a skb rbtree
+ *	@root: root of the rbtree to empty
+ *
+ *	Delete all buffers on an &sk_buff rbtree. Each buffer is removed from
+ *	the list and one reference dropped. This function does not take
+ *	any lock. Synchronization should be handled by the caller (e.g., TCP
+ *	out-of-order queue is protected by the socket lock).
+ */
+void skb_rbtree_purge(struct rb_root *root)
+{
+	struct sk_buff *skb, *next;
+
+	rbtree_postorder_for_each_entry_safe(skb, next, root, rbnode)
+		kfree_skb(skb);
+
+	*root = RB_ROOT;
+}
+
+/**
  *	skb_queue_head - queue a buffer at the list head
  *	@list: list to use
  *	@newsk: buffer to queue
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a0f0a7d..8bd2874 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -382,7 +382,7 @@ void tcp_init_sock(struct sock *sk)
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	__skb_queue_head_init(&tp->out_of_order_queue);
+	tp->out_of_order_queue = RB_ROOT;
 	tcp_init_xmit_timers(sk);
 	tcp_prequeue_init(tp);
 	INIT_LIST_HEAD(&tp->tsq_node);
@@ -2240,7 +2240,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	tcp_clear_xmit_timers(sk);
 	__skb_queue_purge(&sk->sk_receive_queue);
 	tcp_write_queue_purge(sk);
-	__skb_queue_purge(&tp->out_of_order_queue);
+	skb_rbtree_purge(&tp->out_of_order_queue);
 
 	inet->inet_dport = 0;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5fb4e80..12edc4f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4073,7 +4073,7 @@ static void tcp_fin(struct sock *sk)
 	/* It _is_ possible, that we have something out-of-order _after_ FIN.
 	 * Probably, we should reset in this case. For now drop them.
 	 */
-	__skb_queue_purge(&tp->out_of_order_queue);
+	skb_rbtree_purge(&tp->out_of_order_queue);
 	if (tcp_is_sack(tp))
 		tcp_sack_reset(&tp->rx_opt);
 	sk_mem_reclaim(sk);
@@ -4233,7 +4233,7 @@ static void tcp_sack_remove(struct tcp_sock *tp)
 	int this_sack;
 
 	/* Empty ofo queue, hence, all the SACKs are eaten. Clear. */
-	if (skb_queue_empty(&tp->out_of_order_queue)) {
+	if (RB_EMPTY_ROOT(&tp->out_of_order_queue)) {
 		tp->rx_opt.num_sacks = 0;
 		return;
 	}
@@ -4309,10 +4309,13 @@ static void tcp_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	__u32 dsack_high = tp->rcv_nxt;
+	bool fin, fragstolen, eaten;
 	struct sk_buff *skb, *tail;
-	bool fragstolen, eaten;
+	struct rb_node *p;
 
-	while ((skb = skb_peek(&tp->out_of_order_queue)) != NULL) {
+	p = rb_first(&tp->out_of_order_queue);
+	while (p) {
+		skb = rb_entry(p, struct sk_buff, rbnode);
 		if (after(TCP_SKB_CB(skb)->seq, tp->rcv_nxt))
 			break;
 
@@ -4322,9 +4325,10 @@ static void tcp_ofo_queue(struct sock *sk)
 				dsack_high = TCP_SKB_CB(skb)->end_seq;
 			tcp_dsack_extend(sk, TCP_SKB_CB(skb)->seq, dsack);
 		}
+		p = rb_next(p);
+		rb_erase(&skb->rbnode, &tp->out_of_order_queue);
 
-		__skb_unlink(skb, &tp->out_of_order_queue);
-		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
+		if (unlikely(!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt))) {
 			SOCK_DEBUG(sk, "ofo packet was already received\n");
 			tcp_drop(sk, skb);
 			continue;
@@ -4336,12 +4340,19 @@ static void tcp_ofo_queue(struct sock *sk)
 		tail = skb_peek_tail(&sk->sk_receive_queue);
 		eaten = tail && tcp_try_coalesce(sk, tail, skb, &fragstolen);
 		tcp_rcv_nxt_update(tp, TCP_SKB_CB(skb)->end_seq);
+		fin = TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN;
 		if (!eaten)
 			__skb_queue_tail(&sk->sk_receive_queue, skb);
-		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
-			tcp_fin(sk);
-		if (eaten)
+		else
 			kfree_skb_partial(skb, fragstolen);
+
+		if (unlikely(fin)) {
+			tcp_fin(sk);
+			/* tcp_fin() purges tp->out_of_order_queue,
+			 * so we must end this loop right now.
+			 */
+			break;
+		}
 	}
 }
 
@@ -4371,8 +4382,10 @@ static int tcp_try_rmem_schedule(struct sock *sk, struct sk_buff *skb,
 static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct rb_node **p, *q, *parent;
 	struct sk_buff *skb1;
 	u32 seq, end_seq;
+	bool fragstolen;
 
 	tcp_ecn_check_ce(sk, skb);
 
@@ -4387,88 +4400,86 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	inet_csk_schedule_ack(sk);
 
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOQUEUE);
+	seq = TCP_SKB_CB(skb)->seq;
+	end_seq = TCP_SKB_CB(skb)->end_seq;
 	SOCK_DEBUG(sk, "out of order segment: rcv_next %X seq %X - %X\n",
-		   tp->rcv_nxt, TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb)->end_seq);
+		   tp->rcv_nxt, seq, end_seq);
 
-	skb1 = skb_peek_tail(&tp->out_of_order_queue);
-	if (!skb1) {
+	p = &tp->out_of_order_queue.rb_node;
+	if (RB_EMPTY_ROOT(&tp->out_of_order_queue)) {
 		/* Initial out of order segment, build 1 SACK. */
 		if (tcp_is_sack(tp)) {
 			tp->rx_opt.num_sacks = 1;
-			tp->selective_acks[0].start_seq = TCP_SKB_CB(skb)->seq;
-			tp->selective_acks[0].end_seq =
-						TCP_SKB_CB(skb)->end_seq;
+			tp->selective_acks[0].start_seq = seq;
+			tp->selective_acks[0].end_seq = end_seq;
 		}
-		__skb_queue_head(&tp->out_of_order_queue, skb);
+		rb_link_node(&skb->rbnode, NULL, p);
+		rb_insert_color(&skb->rbnode, &tp->out_of_order_queue);
+		tp->ooo_last_skb = skb;
 		goto end;
 	}
 
-	seq = TCP_SKB_CB(skb)->seq;
-	end_seq = TCP_SKB_CB(skb)->end_seq;
-
-	if (seq == TCP_SKB_CB(skb1)->end_seq) {
-		bool fragstolen;
-
-		if (!tcp_try_coalesce(sk, skb1, skb, &fragstolen)) {
-			__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
-		} else {
-			tcp_grow_window(sk, skb);
-			kfree_skb_partial(skb, fragstolen);
-			skb = NULL;
+	/* In the typical case, we are adding an skb to the end of the list.
+	 * Use of ooo_last_skb avoids the O(Log(N)) rbtree lookup.
+	 */
+	if (tcp_try_coalesce(sk, tp->ooo_last_skb, skb, &fragstolen)) {
+coalesce_done:
+		tcp_grow_window(sk, skb);
+		kfree_skb_partial(skb, fragstolen);
+		skb = NULL;
+		goto add_sack;
+	}
+
+	/* Find place to insert this segment. Handle overlaps on the way. */
+	parent = NULL;
+	while (*p) {
+		parent = *p;
+		skb1 = rb_entry(parent, struct sk_buff, rbnode);
+		if (before(seq, TCP_SKB_CB(skb1)->seq)) {
+			p = &parent->rb_left;
+			continue;
 		}
 
-		if (!tp->rx_opt.num_sacks ||
-		    tp->selective_acks[0].end_seq != seq)
-			goto add_sack;
-
-		/* Common case: data arrive in order after hole. */
-		tp->selective_acks[0].end_seq = end_seq;
-		goto end;
-	}
-
-	/* Find place to insert this segment. */
-	while (1) {
-		if (!after(TCP_SKB_CB(skb1)->seq, seq))
-			break;
-		if (skb_queue_is_first(&tp->out_of_order_queue, skb1)) {
-			skb1 = NULL;
-			break;
+		if (before(seq, TCP_SKB_CB(skb1)->end_seq)) {
+			if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
+				/* All the bits are present. Drop. */
+				NET_INC_STATS(sock_net(sk),
+					      LINUX_MIB_TCPOFOMERGE);
+				__kfree_skb(skb);
+				skb = NULL;
+				tcp_dsack_set(sk, seq, end_seq);
+				goto add_sack;
+			}
+			if (after(seq, TCP_SKB_CB(skb1)->seq)) {
+				/* Partial overlap. */
+				tcp_dsack_set(sk, seq, TCP_SKB_CB(skb1)->end_seq);
+			} else {
+				/* skb's seq == skb1's seq and skb covers skb1.
+				 * Replace skb1 with skb.
+				 */
+				rb_replace_node(&skb1->rbnode, &skb->rbnode,
+						&tp->out_of_order_queue);
+				tcp_dsack_extend(sk,
+						 TCP_SKB_CB(skb1)->seq,
+						 TCP_SKB_CB(skb1)->end_seq);
+				NET_INC_STATS(sock_net(sk),
+					      LINUX_MIB_TCPOFOMERGE);
+				__kfree_skb(skb1);
+				goto add_sack;
+			}
+		} else if (tcp_try_coalesce(sk, skb1, skb, &fragstolen)) {
+			goto coalesce_done;
 		}
-		skb1 = skb_queue_prev(&tp->out_of_order_queue, skb1);
+		p = &parent->rb_right;
 	}
 
-	/* Do skb overlap to previous one? */
-	if (skb1 && before(seq, TCP_SKB_CB(skb1)->end_seq)) {
-		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
-			/* All the bits are present. Drop. */
-			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
-			tcp_drop(sk, skb);
-			skb = NULL;
-			tcp_dsack_set(sk, seq, end_seq);
-			goto add_sack;
-		}
-		if (after(seq, TCP_SKB_CB(skb1)->seq)) {
-			/* Partial overlap. */
-			tcp_dsack_set(sk, seq,
-				      TCP_SKB_CB(skb1)->end_seq);
-		} else {
-			if (skb_queue_is_first(&tp->out_of_order_queue,
-					       skb1))
-				skb1 = NULL;
-			else
-				skb1 = skb_queue_prev(
-					&tp->out_of_order_queue,
-					skb1);
-		}
-	}
-	if (!skb1)
-		__skb_queue_head(&tp->out_of_order_queue, skb);
-	else
-		__skb_queue_after(&tp->out_of_order_queue, skb1, skb);
+	/* Insert segment into RB tree. */
+	rb_link_node(&skb->rbnode, parent, p);
+	rb_insert_color(&skb->rbnode, &tp->out_of_order_queue);
 
-	/* And clean segments covered by new one as whole. */
-	while (!skb_queue_is_last(&tp->out_of_order_queue, skb)) {
-		skb1 = skb_queue_next(&tp->out_of_order_queue, skb);
+	/* Remove other segments covered by skb. */
+	while ((q = rb_next(&skb->rbnode)) != NULL) {
+		skb1 = rb_entry(q, struct sk_buff, rbnode);
 
 		if (!after(end_seq, TCP_SKB_CB(skb1)->seq))
 			break;
@@ -4477,12 +4488,15 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 					 end_seq);
 			break;
 		}
-		__skb_unlink(skb1, &tp->out_of_order_queue);
+		rb_erase(&skb1->rbnode, &tp->out_of_order_queue);
 		tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
 				 TCP_SKB_CB(skb1)->end_seq);
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
 		tcp_drop(sk, skb1);
 	}
+	/* If there is no skb after us, we are the last_skb ! */
+	if (!q)
+		tp->ooo_last_skb = skb;
 
 add_sack:
 	if (tcp_is_sack(tp))
@@ -4621,13 +4635,13 @@ queue_and_out:
 		if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_FIN)
 			tcp_fin(sk);
 
-		if (!skb_queue_empty(&tp->out_of_order_queue)) {
+		if (!RB_EMPTY_ROOT(&tp->out_of_order_queue)) {
 			tcp_ofo_queue(sk);
 
 			/* RFC2581. 4.2. SHOULD send immediate ACK, when
 			 * gap in queue is filled.
 			 */
-			if (skb_queue_empty(&tp->out_of_order_queue))
+			if (RB_EMPTY_ROOT(&tp->out_of_order_queue))
 				inet_csk(sk)->icsk_ack.pingpong = 0;
 		}
 
@@ -4679,48 +4693,76 @@ drop:
 	tcp_data_queue_ofo(sk, skb);
 }
 
+static struct sk_buff *tcp_skb_next(struct sk_buff *skb, struct sk_buff_head *list)
+{
+	if (list)
+		return !skb_queue_is_last(list, skb) ? skb->next : NULL;
+
+	return rb_entry_safe(rb_next(&skb->rbnode), struct sk_buff, rbnode);
+}
+
 static struct sk_buff *tcp_collapse_one(struct sock *sk, struct sk_buff *skb,
-					struct sk_buff_head *list)
+					struct sk_buff_head *list,
+					struct rb_root *root)
 {
-	struct sk_buff *next = NULL;
+	struct sk_buff *next = tcp_skb_next(skb, list);
 
-	if (!skb_queue_is_last(list, skb))
-		next = skb_queue_next(list, skb);
+	if (list)
+		__skb_unlink(skb, list);
+	else
+		rb_erase(&skb->rbnode, root);
 
-	__skb_unlink(skb, list);
 	__kfree_skb(skb);
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPRCVCOLLAPSED);
 
 	return next;
 }
 
+/* Insert skb into rb tree, ordered by TCP_SKB_CB(skb)->seq */
+static void tcp_rbtree_insert(struct rb_root *root, struct sk_buff *skb)
+{
+	struct rb_node **p = &root->rb_node;
+	struct rb_node *parent = NULL;
+	struct sk_buff *skb1;
+
+	while (*p) {
+		parent = *p;
+		skb1 = rb_entry(parent, struct sk_buff, rbnode);
+		if (before(TCP_SKB_CB(skb)->seq, TCP_SKB_CB(skb1)->seq))
+			p = &parent->rb_left;
+		else
+			p = &parent->rb_right;
+	}
+	rb_link_node(&skb->rbnode, parent, p);
+	rb_insert_color(&skb->rbnode, root);
+}
+
 /* Collapse contiguous sequence of skbs head..tail with
  * sequence numbers start..end.
  *
- * If tail is NULL, this means until the end of the list.
+ * If tail is NULL, this means until the end of the queue.
  *
  * Segments with FIN/SYN are not collapsed (only because this
  * simplifies code)
  */
 static void
-tcp_collapse(struct sock *sk, struct sk_buff_head *list,
-	     struct sk_buff *head, struct sk_buff *tail,
-	     u32 start, u32 end)
+tcp_collapse(struct sock *sk, struct sk_buff_head *list, struct rb_root *root,
+	     struct sk_buff *head, struct sk_buff *tail, u32 start, u32 end)
 {
-	struct sk_buff *skb, *n;
+	struct sk_buff *skb = head, *n;
+	struct sk_buff_head tmp;
 	bool end_of_skbs;
 
 	/* First, check that queue is collapsible and find
-	 * the point where collapsing can be useful. */
-	skb = head;
+	 * the point where collapsing can be useful.
+	 */
 restart:
-	end_of_skbs = true;
-	skb_queue_walk_from_safe(list, skb, n) {
-		if (skb == tail)
-			break;
+	for (end_of_skbs = true; skb != NULL && skb != tail; skb = n) {
+		n = tcp_skb_next(skb, list);
+
 		/* No new bits? It is possible on ofo queue. */
 		if (!before(start, TCP_SKB_CB(skb)->end_seq)) {
-			skb = tcp_collapse_one(sk, skb, list);
+			skb = tcp_collapse_one(sk, skb, list, root);
 			if (!skb)
 				break;
 			goto restart;
@@ -4738,13 +4780,10 @@ restart:
 			break;
 		}
 
-		if (!skb_queue_is_last(list, skb)) {
-			struct sk_buff *next = skb_queue_next(list, skb);
-			if (next != tail &&
-			    TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(next)->seq) {
-				end_of_skbs = false;
-				break;
-			}
+		if (n && n != tail &&
+		    TCP_SKB_CB(skb)->end_seq != TCP_SKB_CB(n)->seq) {
+			end_of_skbs = false;
+			break;
 		}
 
 		/* Decided to skip this, advance start seq. */
@@ -4754,17 +4793,22 @@ restart:
 	    (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
 		return;
 
+	__skb_queue_head_init(&tmp);
+
 	while (before(start, end)) {
 		int copy = min_t(int, SKB_MAX_ORDER(0, 0), end - start);
 		struct sk_buff *nskb;
 
 		nskb = alloc_skb(copy, GFP_ATOMIC);
 		if (!nskb)
-			return;
+			break;
 
 		memcpy(nskb->cb, skb->cb, sizeof(skb->cb));
 		TCP_SKB_CB(nskb)->seq = TCP_SKB_CB(nskb)->end_seq = start;
-		__skb_queue_before(list, skb, nskb);
+		if (list)
+			__skb_queue_before(list, skb, nskb);
+		else
+			__skb_queue_tail(&tmp, nskb); /* defer rbtree insertion */
 		skb_set_owner_r(nskb, sk);
 
 		/* Copy data, releasing collapsed skbs. */
@@ -4782,14 +4826,17 @@ restart:
 				start += size;
 			}
 			if (!before(start, TCP_SKB_CB(skb)->end_seq)) {
-				skb = tcp_collapse_one(sk, skb, list);
+				skb = tcp_collapse_one(sk, skb, list, root);
 				if (!skb ||
 				    skb == tail ||
 				    (TCP_SKB_CB(skb)->tcp_flags & (TCPHDR_SYN | TCPHDR_FIN)))
-					return;
+					goto end;
 			}
 		}
 	}
+end:
+	skb_queue_walk_safe(&tmp, skb, n)
+		tcp_rbtree_insert(root, skb);
 }
 
 /* Collapse ofo queue. Algorithm: select contiguous sequence of skbs
@@ -4798,43 +4845,43 @@ restart:
 static void tcp_collapse_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct sk_buff *skb = skb_peek(&tp->out_of_order_queue);
-	struct sk_buff *head;
+	struct sk_buff *skb, *head;
+	struct rb_node *p;
 	u32 start, end;
 
-	if (!skb)
+	p = rb_first(&tp->out_of_order_queue);
+	skb = rb_entry_safe(p, struct sk_buff, rbnode);
+new_range:
+	if (!skb) {
+		p = rb_last(&tp->out_of_order_queue);
+		/* Note: This is possible p is NULL here. We do not
+		 * use rb_entry_safe(), as ooo_last_skb is valid only
+		 * if rbtree is not empty.
+		 */
+		tp->ooo_last_skb = rb_entry(p, struct sk_buff, rbnode);
 		return;
-
+	}
 	start = TCP_SKB_CB(skb)->seq;
 	end = TCP_SKB_CB(skb)->end_seq;
-	head = skb;
-
-	for (;;) {
-		struct sk_buff *next = NULL;
 
-		if (!skb_queue_is_last(&tp->out_of_order_queue, skb))
-			next = skb_queue_next(&tp->out_of_order_queue, skb);
-		skb = next;
+	for (head = skb;;) {
+		skb = tcp_skb_next(skb, NULL);
 
-		/* Segment is terminated when we see gap or when
-		 * we are at the end of all the queue. */
+		/* Range is terminated when we see a gap or when
+		 * we are at the queue end.
+		 */
 		if (!skb ||
 		    after(TCP_SKB_CB(skb)->seq, end) ||
 		    before(TCP_SKB_CB(skb)->end_seq, start)) {
-			tcp_collapse(sk, &tp->out_of_order_queue,
+			tcp_collapse(sk, NULL, &tp->out_of_order_queue,
 				     head, skb, start, end);
-			head = skb;
-			if (!skb)
-				break;
-			/* Start new segment */
+			goto new_range;
+		}
+
+		if (unlikely(before(TCP_SKB_CB(skb)->seq, start)))
 			start = TCP_SKB_CB(skb)->seq;
+		if (after(TCP_SKB_CB(skb)->end_seq, end))
 			end = TCP_SKB_CB(skb)->end_seq;
-		} else {
-			if (before(TCP_SKB_CB(skb)->seq, start))
-				start = TCP_SKB_CB(skb)->seq;
-			if (after(TCP_SKB_CB(skb)->end_seq, end))
-				end = TCP_SKB_CB(skb)->end_seq;
-		}
 	}
 }
 
@@ -4845,23 +4892,36 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
 static bool tcp_prune_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	bool res = false;
+	struct rb_node *node, *prev;
 
-	if (!skb_queue_empty(&tp->out_of_order_queue)) {
-		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
-		__skb_queue_purge(&tp->out_of_order_queue);
+	if (RB_EMPTY_ROOT(&tp->out_of_order_queue))
+		return false;
 
-		/* Reset SACK state.  A conforming SACK implementation will
-		 * do the same at a timeout based retransmit.  When a connection
-		 * is in a sad state like this, we care only about integrity
-		 * of the connection not performance.
-		 */
-		if (tp->rx_opt.sack_ok)
-			tcp_sack_reset(&tp->rx_opt);
+	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
+
+	node = &tp->ooo_last_skb->rbnode;
+	do {
+		prev = rb_prev(node);
+		rb_erase(node, &tp->out_of_order_queue);
+		__kfree_skb(rb_to_skb(node));
 		sk_mem_reclaim(sk);
-		res = true;
-	}
-	return res;
+                if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
+                    !tcp_under_memory_pressure(sk))
+                        break;
+
+		node = prev;
+	} while (node);
+	tp->ooo_last_skb = rb_entry(prev, struct sk_buff, rbnode);
+
+	/* Reset SACK state.  A conforming SACK implementation will
+	 * do the same at a timeout based retransmit.  When a connection
+	 * is in a sad state like this, we care only about integrity
+	 * of the connection not performance.
+	 */
+	if (tp->rx_opt.sack_ok)
+		tcp_sack_reset(&tp->rx_opt);
+
+	return true;
 }
 
 /* Reduce allocated memory if we can, trying to get
@@ -4886,7 +4946,7 @@ static int tcp_prune_queue(struct sock *sk)
 
 	tcp_collapse_ofo_queue(sk);
 	if (!skb_queue_empty(&sk->sk_receive_queue))
-		tcp_collapse(sk, &sk->sk_receive_queue,
+		tcp_collapse(sk, &sk->sk_receive_queue, NULL,
 			     skb_peek(&sk->sk_receive_queue),
 			     NULL,
 			     tp->copied_seq, tp->rcv_nxt);
@@ -4991,7 +5051,7 @@ static void __tcp_ack_snd_check(struct sock *sk, int ofo_possible)
 	    /* We ACK each frame or... */
 	    tcp_in_quickack_mode(sk) ||
 	    /* We have out of order data. */
-	    (ofo_possible && skb_peek(&tp->out_of_order_queue))) {
+	    (ofo_possible && !RB_EMPTY_ROOT(&tp->out_of_order_queue))) {
 		/* Then ack it now */
 		tcp_send_ack(sk);
 	} else {
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 01715fc..ee8399f 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1830,7 +1830,7 @@ void tcp_v4_destroy_sock(struct sock *sk)
 	tcp_write_queue_purge(sk);
 
 	/* Cleans up our, hopefully empty, out_of_order_queue. */
-	__skb_queue_purge(&tp->out_of_order_queue);
+	skb_rbtree_purge(&tp->out_of_order_queue);
 
 #ifdef CONFIG_TCP_MD5SIG
 	/* Clean up the MD5 key list, if any */
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 4c1c94f..81c633d 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -495,7 +495,6 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 		newtp->snd_cwnd_cnt = 0;
 
 		tcp_init_xmit_timers(newsk);
-		__skb_queue_head_init(&newtp->out_of_order_queue);
 		newtp->write_seq = newtp->pushed_seq = treq->snt_isn + 1;
 
 		newtp->rx_opt.saw_tstamp = 0;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH stable 4.4 6/9] tcp: avoid collapses in tcp_prune_queue() if possible
From: Mao Wenan @ 2018-08-15 13:21 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-1-git-send-email-maowenan@huawei.com>

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit f4a3313d8e2ca9fd8d8f45e40a2903ba782607e7 ]

Right after a TCP flow is created, receiving tiny out of order
packets allways hit the condition :

if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
	tcp_clamp_window(sk);

tcp_clamp_window() increases sk_rcvbuf to match sk_rmem_alloc
(guarded by tcp_rmem[2])

Calling tcp_collapse_ofo_queue() in this case is not useful,
and offers a O(N^2) surface attack to malicious peers.

Better not attempt anything before full queue capacity is reached,
forcing attacker to spend lots of resource and allow us to more
easily detect the abuse.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: root <root@localhost.localdomain>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 32225dc..77130ae 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4948,6 +4948,9 @@ static int tcp_prune_queue(struct sock *sk)
 	else if (tcp_under_memory_pressure(sk))
 		tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
 
+	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
+		return 0;
+
 	tcp_collapse_ofo_queue(sk);
 	if (!skb_queue_empty(&sk->sk_receive_queue))
 		tcp_collapse(sk, &sk->sk_receive_queue, NULL,
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH stable 4.4 5/9] tcp: free batches of packets in tcp_prune_ofo_queue()
From: Mao Wenan @ 2018-08-15 13:21 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-1-git-send-email-maowenan@huawei.com>

From: Eric Dumazet <edumazet@google.com>

Juha-Matti Tilli reported that malicious peers could inject tiny
packets in out_of_order_queue, forcing very expensive calls
to tcp_collapse_ofo_queue() and tcp_prune_ofo_queue() for
every incoming packet. out_of_order_queue rb-tree can contain
thousands of nodes, iterating over all of them is not nice.

Before linux-4.9, we would have pruned all packets in ofo_queue
in one go, every XXXX packets. XXXX depends on sk_rcvbuf and skbs
truesize, but is about 7000 packets with tcp_rmem[2] default of 6 MB.

Since we plan to increase tcp_rmem[2] in the future to cope with
modern BDP, can not revert to the old behavior, without great pain.

Strategy taken in this patch is to purge ~12.5 % of the queue capacity.

Fixes: 36a6503fedda ("tcp: refine tcp_prune_ofo_queue() to not drop all packets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: root <root@localhost.localdomain>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 16 ++++++++++------
 1 file changed, 10 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 12edc4f..32225dc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4893,22 +4893,26 @@ static bool tcp_prune_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct rb_node *node, *prev;
+	int goal;
 
 	if (RB_EMPTY_ROOT(&tp->out_of_order_queue))
 		return false;
 
 	NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_OFOPRUNED);
-
+	goal = sk->sk_rcvbuf >> 3;
 	node = &tp->ooo_last_skb->rbnode;
 	do {
 		prev = rb_prev(node);
 		rb_erase(node, &tp->out_of_order_queue);
+		goal -= rb_to_skb(node)->truesize;
 		__kfree_skb(rb_to_skb(node));
-		sk_mem_reclaim(sk);
-                if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
-                    !tcp_under_memory_pressure(sk))
-                        break;
-
+		if (!prev || goal <= 0) {
+			sk_mem_reclaim(sk);
+			if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf &&
+			    !tcp_under_memory_pressure(sk))
+				break;
+			goal = sk->sk_rcvbuf >> 3;
+		}
 		node = prev;
 	} while (node);
 	tp->ooo_last_skb = rb_entry(prev, struct sk_buff, rbnode);
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH stable 4.4 7/9] tcp: detect malicious patterns in tcp_collapse_ofo_queue()
From: Mao Wenan @ 2018-08-15 13:21 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-1-git-send-email-maowenan@huawei.com>

From: Eric Dumazet <edumazet@google.com>

[ Upstream commit 3d4bf93ac12003f9b8e1e2de37fe27983deebdcf ]

In case an attacker feeds tiny packets completely out of order,
tcp_collapse_ofo_queue() might scan the whole rb-tree, performing
expensive copies, but not changing socket memory usage at all.

1) Do not attempt to collapse tiny skbs.
2) Add logic to exit early when too many tiny skbs are detected.

We prefer not doing aggressive collapsing (which copies packets)
for pathological flows, and revert to tcp_prune_ofo_queue() which
will be less expensive.

In the future, we might add the possibility of terminating flows
that are proven to be malicious.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: root <root@localhost.localdomain>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 77130ae..c48924f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4845,6 +4845,7 @@ end:
 static void tcp_collapse_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
+	u32 range_truesize, sum_tiny = 0;
 	struct sk_buff *skb, *head;
 	struct rb_node *p;
 	u32 start, end;
@@ -4863,6 +4864,7 @@ new_range:
 	}
 	start = TCP_SKB_CB(skb)->seq;
 	end = TCP_SKB_CB(skb)->end_seq;
+	range_truesize = skb->truesize;
 
 	for (head = skb;;) {
 		skb = tcp_skb_next(skb, NULL);
@@ -4873,11 +4875,21 @@ new_range:
 		if (!skb ||
 		    after(TCP_SKB_CB(skb)->seq, end) ||
 		    before(TCP_SKB_CB(skb)->end_seq, start)) {
-			tcp_collapse(sk, NULL, &tp->out_of_order_queue,
-				     head, skb, start, end);
+			/* Do not attempt collapsing tiny skbs */
+			if (range_truesize != head->truesize ||
+			    end - start >= SKB_WITH_OVERHEAD(SK_MEM_QUANTUM)) {
+				tcp_collapse(sk, NULL, &tp->out_of_order_queue,
+					     head, skb, start, end);
+			} else {
+				sum_tiny += range_truesize;
+				if (sum_tiny > sk->sk_rcvbuf >> 3)
+					return;
+			}
+
 			goto new_range;
 		}
 
+		range_truesize += skb->truesize;
 		if (unlikely(before(TCP_SKB_CB(skb)->seq, start)))
 			start = TCP_SKB_CB(skb)->seq;
 		if (after(TCP_SKB_CB(skb)->end_seq, end))
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH stable 4.4 9/9] tcp: add tcp_ooo_try_coalesce() helper
From: Mao Wenan @ 2018-08-15 13:21 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-1-git-send-email-maowenan@huawei.com>

From: Eric Dumazet <edumazet@google.com>

In case skb in out_or_order_queue is the result of
multiple skbs coalescing, we would like to get a proper gso_segs
counter tracking, so that future tcp_drop() can report an accurate
number.

I chose to not implement this tracking for skbs in receive queue,
since they are not dropped, unless socket is disconnected.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 96a1e0d..fdb5509 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4296,6 +4296,23 @@ static bool tcp_try_coalesce(struct sock *sk,
 	return true;
 }
 
+static bool tcp_ooo_try_coalesce(struct sock *sk,
+			     struct sk_buff *to,
+			     struct sk_buff *from,
+			     bool *fragstolen)
+{
+	bool res = tcp_try_coalesce(sk, to, from, fragstolen);
+
+	/* In case tcp_drop() is called later, update to->gso_segs */
+	if (res) {
+		u32 gso_segs = max_t(u16, 1, skb_shinfo(to)->gso_segs) +
+			       max_t(u16, 1, skb_shinfo(from)->gso_segs);
+
+		skb_shinfo(to)->gso_segs = min_t(u32, gso_segs, 0xFFFF);
+	}
+	return res;
+}
+
 static void tcp_drop(struct sock *sk, struct sk_buff *skb)
 {
 	sk_drops_add(sk, skb);
@@ -4422,7 +4439,8 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 	/* In the typical case, we are adding an skb to the end of the list.
 	 * Use of ooo_last_skb avoids the O(Log(N)) rbtree lookup.
 	 */
-	if (tcp_try_coalesce(sk, tp->ooo_last_skb, skb, &fragstolen)) {
+	if (tcp_ooo_try_coalesce(sk, tp->ooo_last_skb,
+				 skb, &fragstolen)) {
 coalesce_done:
 		tcp_grow_window(sk, skb);
 		kfree_skb_partial(skb, fragstolen);
@@ -4467,7 +4485,8 @@ coalesce_done:
 				tcp_drop(sk, skb1);
 				goto add_sack;
 			}
-		} else if (tcp_try_coalesce(sk, skb1, skb, &fragstolen)) {
+		} else if (tcp_ooo_try_coalesce(sk, skb1,
+						skb, &fragstolen)) {
 			goto coalesce_done;
 		}
 		p = &parent->rb_right;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH stable 4.4 8/9] tcp: call tcp_drop() from tcp_data_queue_ofo()
From: Mao Wenan @ 2018-08-15 13:21 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-1-git-send-email-maowenan@huawei.com>

From: Eric Dumazet <edumazet@google.com>

In order to be able to give better diagnostics and detect
malicious traffic, we need to have better sk->sk_drops tracking.

Fixes: 9f5afeae5152 ("tcp: use an RB tree for ooo receive queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 net/ipv4/tcp_input.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index c48924f..96a1e0d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4445,7 +4445,7 @@ coalesce_done:
 				/* All the bits are present. Drop. */
 				NET_INC_STATS(sock_net(sk),
 					      LINUX_MIB_TCPOFOMERGE);
-				__kfree_skb(skb);
+				tcp_drop(sk, skb);
 				skb = NULL;
 				tcp_dsack_set(sk, seq, end_seq);
 				goto add_sack;
@@ -4464,7 +4464,7 @@ coalesce_done:
 						 TCP_SKB_CB(skb1)->end_seq);
 				NET_INC_STATS(sock_net(sk),
 					      LINUX_MIB_TCPOFOMERGE);
-				__kfree_skb(skb1);
+				tcp_drop(sk, skb1);
 				goto add_sack;
 			}
 		} else if (tcp_try_coalesce(sk, skb1, skb, &fragstolen)) {
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH stable 4.4 0/9] fix SegmentSmack (CVE-2018-5390)
From: Mao Wenan @ 2018-08-15 13:20 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw

There are five patches to fix CVE-2018-5390 in latest mainline 
branch, but only two patches exist in stable 4.4 and 3.18: 
dc6ae4d tcp: detect malicious patterns in tcp_collapse_ofo_queue()
5fbec48 tcp: avoid collapses in tcp_prune_queue() if possible
but I have tested with these patches, and found the cpu usage was very high.
test results:
with fix patch: 78.2%   ksoftirqd
no fix patch:   90%     ksoftirqd

After analysing the codes of stable 4.4, and debuging the 
system, the search of ofo_queue(tcp ofo using a simple queue) cost more cycles.
So I think only two patches can't fix the CVE-2018-5390.
So I try to backport "tcp: use an RB tree for ooo receive queue" using RB tree 
instead of simple queue, then backport Eric Dumazet 5 fixed patches in mainline,
good news is that ksoftirqd is turn to about 20%, which is the same with mainline now.

Eric Dumazet (6):
  tcp: increment sk_drops for dropped rx packets
  tcp: free batches of packets in tcp_prune_ofo_queue()
  tcp: avoid collapses in tcp_prune_queue() if possible
  tcp: detect malicious patterns in tcp_collapse_ofo_queue()
  tcp: call tcp_drop() from tcp_data_queue_ofo()
  tcp: add tcp_ooo_try_coalesce() helper

Mao Wenan (2):
  Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"
  Revert "tcp: avoid collapses in tcp_prune_queue() if possible"

Yaogong Wang (1):
  tcp: use an RB tree for ooo receive queue

 include/linux/skbuff.h   |   8 +
 include/linux/tcp.h      |   7 +-
 include/net/sock.h       |   7 +
 include/net/tcp.h        |   2 +-
 net/core/skbuff.c        |  19 +++
 net/ipv4/tcp.c           |   4 +-
 net/ipv4/tcp_input.c     | 412 +++++++++++++++++++++++++++++------------------
 net/ipv4/tcp_ipv4.c      |   3 +-
 net/ipv4/tcp_minisocks.c |   1 -
 net/ipv6/tcp_ipv6.c      |   1 +
 10 files changed, 294 insertions(+), 170 deletions(-)

-- 
1.8.3.1

^ permalink raw reply

* [PATCH stable 4.4 3/9] tcp: increment sk_drops for dropped rx packets
From: Mao Wenan @ 2018-08-15 13:21 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-1-git-send-email-maowenan@huawei.com>

From: Eric Dumazet <edumazet@google.com>

Now ss can report sk_drops, we can instruct TCP to increment
this per socket counter when it drops an incoming frame, to refine
monitoring and debugging.

Following patch takes care of listeners drops.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 include/net/sock.h   |  7 +++++++
 net/ipv4/tcp_input.c | 33 ++++++++++++++++++++-------------
 net/ipv4/tcp_ipv4.c  |  1 +
 net/ipv6/tcp_ipv6.c  |  1 +
 4 files changed, 29 insertions(+), 13 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 3d5ff74..5770757 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -2139,6 +2139,13 @@ sock_skb_set_dropcount(const struct sock *sk, struct sk_buff *skb)
 	SOCK_SKB_CB(skb)->dropcount = atomic_read(&sk->sk_drops);
 }
 
+static inline void sk_drops_add(struct sock *sk, const struct sk_buff *skb)
+{
+	int segs = max_t(u16, 1, skb_shinfo(skb)->gso_segs);
+
+	atomic_add(segs, &sk->sk_drops);
+}
+
 void __sock_recv_timestamp(struct msghdr *msg, struct sock *sk,
 			   struct sk_buff *skb);
 void __sock_recv_wifi_status(struct msghdr *msg, struct sock *sk,
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index df2f342..5fb4e80 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4296,6 +4296,12 @@ static bool tcp_try_coalesce(struct sock *sk,
 	return true;
 }
 
+static void tcp_drop(struct sock *sk, struct sk_buff *skb)
+{
+	sk_drops_add(sk, skb);
+	__kfree_skb(skb);
+}
+
 /* This one checks to see if we can put data from the
  * out_of_order queue into the receive_queue.
  */
@@ -4320,7 +4326,7 @@ static void tcp_ofo_queue(struct sock *sk)
 		__skb_unlink(skb, &tp->out_of_order_queue);
 		if (!after(TCP_SKB_CB(skb)->end_seq, tp->rcv_nxt)) {
 			SOCK_DEBUG(sk, "ofo packet was already received\n");
-			__kfree_skb(skb);
+			tcp_drop(sk, skb);
 			continue;
 		}
 		SOCK_DEBUG(sk, "ofo requeuing : rcv_next %X seq %X - %X\n",
@@ -4372,7 +4378,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 
 	if (unlikely(tcp_try_rmem_schedule(sk, skb, skb->truesize))) {
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFODROP);
-		__kfree_skb(skb);
+		tcp_drop(sk, skb);
 		return;
 	}
 
@@ -4436,7 +4442,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 		if (!after(end_seq, TCP_SKB_CB(skb1)->end_seq)) {
 			/* All the bits are present. Drop. */
 			NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
-			__kfree_skb(skb);
+			tcp_drop(sk, skb);
 			skb = NULL;
 			tcp_dsack_set(sk, seq, end_seq);
 			goto add_sack;
@@ -4475,7 +4481,7 @@ static void tcp_data_queue_ofo(struct sock *sk, struct sk_buff *skb)
 		tcp_dsack_extend(sk, TCP_SKB_CB(skb1)->seq,
 				 TCP_SKB_CB(skb1)->end_seq);
 		NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPOFOMERGE);
-		__kfree_skb(skb1);
+		tcp_drop(sk, skb1);
 	}
 
 add_sack:
@@ -4558,12 +4564,13 @@ err:
 static void tcp_data_queue(struct sock *sk, struct sk_buff *skb)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	int eaten = -1;
 	bool fragstolen = false;
+	int eaten = -1;
 
-	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq)
-		goto drop;
-
+	if (TCP_SKB_CB(skb)->seq == TCP_SKB_CB(skb)->end_seq) {
+		__kfree_skb(skb);
+		return;
+	}
 	skb_dst_drop(skb);
 	__skb_pull(skb, tcp_hdr(skb)->doff * 4);
 
@@ -4645,7 +4652,7 @@ out_of_window:
 		tcp_enter_quickack_mode(sk, TCP_MAX_QUICKACKS);
 		inet_csk_schedule_ack(sk);
 drop:
-		__kfree_skb(skb);
+		tcp_drop(sk, skb);
 		return;
 	}
 
@@ -5220,7 +5227,7 @@ syn_challenge:
 	return true;
 
 discard:
-	__kfree_skb(skb);
+	tcp_drop(sk, skb);
 	return false;
 }
 
@@ -5438,7 +5445,7 @@ csum_error:
 	TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
 
 discard:
-	__kfree_skb(skb);
+	tcp_drop(sk, skb);
 }
 EXPORT_SYMBOL(tcp_rcv_established);
 
@@ -5668,7 +5675,7 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 						  TCP_DELACK_MAX, TCP_RTO_MAX);
 
 discard:
-			__kfree_skb(skb);
+			tcp_drop(sk, skb);
 			return 0;
 		} else {
 			tcp_send_ack(sk);
@@ -6025,7 +6032,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 
 	if (!queued) {
 discard:
-		__kfree_skb(skb);
+		tcp_drop(sk, skb);
 	}
 	return 0;
 }
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index eeda67c..01715fc 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1716,6 +1716,7 @@ discard_it:
 	return 0;
 
 discard_and_relse:
+	sk_drops_add(sk, skb);
 	sock_put(sk);
 	goto discard_it;
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 90abe88..d6c1911 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1505,6 +1505,7 @@ discard_it:
 	return 0;
 
 discard_and_relse:
+	sk_drops_add(sk, skb);
 	sock_put(sk);
 	goto discard_it;
 
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH stable 4.4 1/9] Revert "tcp: detect malicious patterns in tcp_collapse_ofo_queue()"
From: Mao Wenan @ 2018-08-15 13:21 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-1-git-send-email-maowenan@huawei.com>

This reverts commit dc6ae4dffd656811dee7151b19545e4cd839d378.
---
 net/ipv4/tcp_input.c | 16 ++--------------
 1 file changed, 2 insertions(+), 14 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 4a261e0..995b2bc 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4791,7 +4791,6 @@ restart:
 static void tcp_collapse_ofo_queue(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	u32 range_truesize, sum_tiny = 0;
 	struct sk_buff *skb = skb_peek(&tp->out_of_order_queue);
 	struct sk_buff *head;
 	u32 start, end;
@@ -4801,7 +4800,6 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
 
 	start = TCP_SKB_CB(skb)->seq;
 	end = TCP_SKB_CB(skb)->end_seq;
-	range_truesize = skb->truesize;
 	head = skb;
 
 	for (;;) {
@@ -4816,24 +4814,14 @@ static void tcp_collapse_ofo_queue(struct sock *sk)
 		if (!skb ||
 		    after(TCP_SKB_CB(skb)->seq, end) ||
 		    before(TCP_SKB_CB(skb)->end_seq, start)) {
-			/* Do not attempt collapsing tiny skbs */
-			if (range_truesize != head->truesize ||
-			    end - start >= SKB_WITH_OVERHEAD(SK_MEM_QUANTUM)) {
-				tcp_collapse(sk, &tp->out_of_order_queue,
-					     head, skb, start, end);
-			} else {
-				sum_tiny += range_truesize;
-				if (sum_tiny > sk->sk_rcvbuf >> 3)
-					return;
-			}
-
+			tcp_collapse(sk, &tp->out_of_order_queue,
+				     head, skb, start, end);
 			head = skb;
 			if (!skb)
 				break;
 			/* Start new segment */
 			start = TCP_SKB_CB(skb)->seq;
 			end = TCP_SKB_CB(skb)->end_seq;
-			range_truesize = skb->truesize;
 		} else {
 			if (before(TCP_SKB_CB(skb)->seq, start))
 				start = TCP_SKB_CB(skb)->seq;
-- 
1.8.3.1

^ permalink raw reply related

* [PATCH stable 4.4 2/9] Revert "tcp: avoid collapses in tcp_prune_queue() if possible"
From: Mao Wenan @ 2018-08-15 13:21 UTC (permalink / raw)
  To: dwmw2, gregkh, netdev, eric.dumazet, edumazet, davem, ycheng, jdw
In-Reply-To: <1534339268-111834-1-git-send-email-maowenan@huawei.com>

This reverts commit 5fbec4801264cb3279ef6ac9c70bcbe2aaef89d5.
---
 net/ipv4/tcp_input.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 995b2bc..df2f342 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -4877,9 +4877,6 @@ static int tcp_prune_queue(struct sock *sk)
 	else if (tcp_under_memory_pressure(sk))
 		tp->rcv_ssthresh = min(tp->rcv_ssthresh, 4U * tp->advmss);
 
-	if (atomic_read(&sk->sk_rmem_alloc) <= sk->sk_rcvbuf)
-		return 0;
-
 	tcp_collapse_ofo_queue(sk);
 	if (!skb_queue_empty(&sk->sk_receive_queue))
 		tcp_collapse(sk, &sk->sk_receive_queue,
-- 
1.8.3.1

^ permalink raw reply related

* WARNING in wiphy_register (2)
From: syzbot @ 2018-08-15 16:00 UTC (permalink / raw)
  To: davem-fT/PcQaiUtIeIZ0/mPfg9Q, johannes-cdvu00un1VgdHxzADdlk8Q,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-wireless-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	syzkaller-bugs-/JYPxA39Uh5TLH3MbocFFw

Hello,

syzbot found the following crash on:

HEAD commit:    ec0c96714e7d Merge git://git.kernel.org/pub/scm/linux/kern..
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=17b01274400000
kernel config:  https://syzkaller.appspot.com/x/.config?x=152cb8ccd35b1f70
dashboard link: https://syzkaller.appspot.com/bug?extid=2a12f11c306afe871c1f
compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
syzkaller repro:https://syzkaller.appspot.com/x/repro.syz?x=10b2a2f8400000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=14ff3622400000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+2a12f11c306afe871c1f-Pl5Pbv+GP7P466ipTTIvnc23WoclnBCfAL8bYrjMMd8@public.gmane.org

random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
random: sshd: uninitialized urandom read (32 bytes read)
netlink: 'syz-executor064': attribute type 9 has an invalid length.
WARNING: CPU: 0 PID: 4439 at net/wireless/core.c:550  
wiphy_verify_combinations net/wireless/core.c:550 [inline]
WARNING: CPU: 0 PID: 4439 at net/wireless/core.c:550  
wiphy_register+0x12c1/0x2510 net/wireless/core.c:741
Kernel panic - not syncing: panic_on_warn set ...

CPU: 0 PID: 4439 Comm: syz-executor064 Not tainted 4.18.0-rc8+ #184
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
  panic+0x238/0x4e7 kernel/panic.c:184
  __warn.cold.8+0x163/0x1ba kernel/panic.c:536
  report_bug+0x252/0x2d0 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:178 [inline]
  do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296
  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316
  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992
RIP: 0010:wiphy_verify_combinations net/wireless/core.c:550 [inline]
RIP: 0010:wiphy_register+0x12c1/0x2510 net/wireless/core.c:741
Code: 74 08 3c 01 0f 8e c5 10 00 00 66 45 89 67 30 e9 18 f9 ff ff e8 40 00  
4c fb 0f 0b bb ea ff ff ff e9 f5 f3 ff ff e8 2f 00 4c fb <0f> 0b bb ea ff  
ff ff e9 e4 f3 ff ff e8 1e 00 4c fb 0f 0b bb ea ff
RSP: 0018:ffff8801ac51ee60 EFLAGS: 00010293
RAX: ffff8801af3fc700 RBX: 0000000000000000 RCX: ffffffff86301238
RDX: 0000000000000000 RSI: ffffffff86301821 RDI: 0000000000000005
RBP: ffff8801ac51f000 R08: ffff8801af3fc700 R09: ffffed00358d6114
R10: ffffed00358d6114 R11: ffff8801ac6b08a5 R12: ffff8801ac6b327c
R13: ffffffff87838ae0 R14: dffffc0000000000 R15: ffff8801ac6b08a0
  ieee80211_register_hw+0x13d5/0x35e0 net/mac80211/main.c:1050
  mac80211_hwsim_new_radio+0x1db8/0x33b0  
drivers/net/wireless/mac80211_hwsim.c:2772
  hwsim_new_radio_nl+0x7c0/0xa80 drivers/net/wireless/mac80211_hwsim.c:3247
  genl_family_rcv_msg+0x8a3/0x1140 net/netlink/genetlink.c:601
  genl_rcv_msg+0xc6/0x168 net/netlink/genetlink.c:626
  netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2455
  genl_rcv+0x28/0x40 net/netlink/genetlink.c:637
  netlink_unicast_kernel net/netlink/af_netlink.c:1317 [inline]
  netlink_unicast+0x5a0/0x760 net/netlink/af_netlink.c:1343
  netlink_sendmsg+0xa18/0xfd0 net/netlink/af_netlink.c:1908
  sock_sendmsg_nosec net/socket.c:642 [inline]
  sock_sendmsg+0xd5/0x120 net/socket.c:652
  ___sys_sendmsg+0x7fd/0x930 net/socket.c:2126
  __sys_sendmsg+0x11d/0x290 net/socket.c:2164
  __do_sys_sendmsg net/socket.c:2173 [inline]
  __se_sys_sendmsg net/socket.c:2171 [inline]
  __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2171
  do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4402c9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7  
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffcf9426d68 EFLAGS: 00000213 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004402c9
RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000000000000 R11: 0000000000000213 R12: 0000000000401b50
R13: 0000000000401be0 R14: 0000000000000000 R15: 0000000000000000
Dumping ftrace buffer:
    (ftrace buffer empty)
Kernel Offset: disabled
Rebooting in 86400 seconds..


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* [BUG] Kernel Oops and crash using i40e VF devices
From: Maik Broemme @ 2018-08-15 14:23 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel

Hi,

I have a SuperMicro X11SPM-F mainboard with two Intel X722 devices which
support up to 32 VF devices per PF device. They are running with i40e
driver. Whenever I try to use the VF devices in Xen VMs, the host kernel
got an Oops or crash. In all cases the PF running on the host
immediately loses network connection. I can reproduce this always
running the following:

Enable VFs:

$> echo 24 > /sys/bus/pci/devices/0000:b5:00.2/sriov_numvfs
$> echo 2 > /sys/bus/pci/devices/0000:b5:00.3/sriov_numvfs

Assign MACs:

$> ip link set net0 vf 0 mac 00:16:3e:00:b9:1e
...

Enable trust:

$> ip link set net0 vf 0 trust on
...

Assign NIcs:
xl pci-assignable-add b5:0a.0
...

If I start 1 VM everything works fine, as soon as I start a second one,
the host becomes unavailable and the log shows the following:

Aug 15 12:33:44 server kernel: xen_pciback: vpci: 0000:b5:0b.3: assign to virtual slot 0
Aug 15 12:33:44 server kernel: pciback 0000:b5:0b.3: registering for 3
Aug 15 12:33:58 server kernel: xen-blkback: backend/vbd/3/51712: using 2 queues, protocol 1 (x86_64-abi) persistent grants
Aug 15 12:34:04 server kernel: i40e 0000:b5:00.2: TX driver issue detected, PF reset issued
Aug 15 12:34:04 server kernel: i40e 0000:b5:00.2: TX driver issue detected on VF 11
Aug 15 12:34:10 server kernel: i40e 0000:b5:00.2: TX driver issue detected, PF reset issued
Aug 15 12:34:10 server kernel: i40e 0000:b5:00.2: TX driver issue detected on VF 11
Aug 15 12:34:10 server kernel: i40e 0000:b5:00.2: TX driver issue detected, PF reset issued
Aug 15 12:34:10 server kernel: i40e 0000:b5:00.2: TX driver issue detected on VF 11
Aug 15 12:34:41 server kernel: i40e 0000:b5:00.2: VSI seid 409 Tx ring 175 disable timeout
Aug 15 12:34:52 server kernel: i40e 0000:b5:00.2: VSI seid 409 Tx ring 175 disable timeout
Aug 15 12:34:58 server kernel: i40e 0000:b5:00.2: VSI seid 409 Tx ring 175 disable timeout
Aug 15 12:35:09 server kernel: i40e 0000:b5:00.2: VSI seid 409 Tx ring 175 disable timeout
Aug 15 12:35:55 server kernel: i40e 0000:b5:00.2: VSI seid 409 Tx ring 175 disable timeout
Aug 15 12:36:26 server kernel: i40e 0000:b5:00.2: VSI seid 409 Tx ring 175 disable timeout
Aug 15 12:36:39 server kernel: i40e 0000:b5:00.2: VSI seid 409 Tx ring 175 disable timeout
Aug 15 12:36:41 server kernel: i40e 0000:b5:00.2: VSI seid 409 Tx ring 175 disable timeout
Aug 15 12:36:41 server kernel: BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
Aug 15 12:36:41 server kernel: PGD 0 P4D 0
Aug 15 12:36:41 server kernel: Oops: 0000 [#1] PREEMPT SMP NOPTI
Aug 15 12:36:41 server kernel: Modules linked in: dm_crypt algif_skcipher af_alg bonding intel_rapl skx_edac nfit intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc iTCO_wdt iTCO_vendor_support nls_iso8859_1 nls_cp437 vfat aesni_intel fat aes_x86_64 crypto_simd cryptd glue_helper ofpart ipmi_ssif cmdlinepart intel_rapl_perf pcspkr i40e ast i2c_algo_bit ttm drm_kms_helper drm intel_spi_pci intel_spi spi_nor mtd i2c_i801 agpgart syscopyarea joydev sysfillrect sysimgblt fb_sys_fops input_leds mousedev led_class mei_me shpchp lpc_ich mei ioatdma dca wmi ipmi_si ipmi_devintf rtc_cmos ipmi_msghandler acpi_power_meter evdev mac_hid xen_acpi_processor xen_pciback xen_netback xen_blkback xen_gntalloc xen_gntdev xen_evtchn xenfs xen_privcmd ip_tables x_tables ext4 crc32c_generic
  crc16 mbcache jbd2 fscrypto
Aug 15 12:36:41 server kernel:  hid_generic usbhid hid sd_mod ahci libahci crc32c_intel libata xhci_pci xhci_hcd usbcore usb_common scsi_mod dm_mod
Aug 15 12:36:41 server kernel: CPU: 1 PID: 1326 Comm: logger Not tainted 4.17.14-arch1-1-ARCH #1
Aug 15 12:36:41 server kernel: Hardware name: Supermicro Super Server/X11SPM-F, BIOS 2.1 06/15/2018
Aug 15 12:36:41 server kernel: RIP: e030:__rb_insert_augmented+0x32/0x230
Aug 15 12:36:41 server kernel: RSP: e02b:ffffc90043ed3d98 EFLAGS: 00010246 
Aug 15 12:36:41 server kernel: RAX: ffff880109ddec58 RBX: 0000000000000000 RCX: ffff88010bf2d7c8
Aug 15 12:36:41 server kernel: RDX: 0000000000000000 RSI: ffff88010bf2d7c0 RDI: ffff880109ddec58
Aug 15 12:36:41 server kernel: RBP: ffff88004bf9eb98 R08: ffffffff811e56e0 R09: ffff880109ddec58
Aug 15 12:36:41 server kernel: R10: 0000000000000285 R11: ffff88004bf9eb40 R12: ffff88010bf2d7d0
Aug 15 12:36:41 server kernel: R13: ffff88010bf2d7c0 R14: 00007fdd44c4e000 R15: 0000000000000000
Aug 15 12:36:41 server kernel: FS:  0000000000000000(0000) GS:ffff880115040000(0000) knlGS:0000000000000000
Aug 15 12:36:41 server kernel: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 15 12:36:41 server kernel: CR2: 0000000000000008 CR3: 000000004bd04000 CR4: 0000000000042660
Aug 15 12:36:41 server kernel: Call Trace:
Aug 15 12:36:41 server kernel:  __vma_adjust+0x2bb/0x7d0
Aug 15 12:36:41 server kernel:  ? kmem_cache_alloc+0x179/0x1d0
Aug 15 12:36:41 server kernel:  __split_vma+0x117/0x1c0
Aug 15 12:36:41 server kernel:  mprotect_fixup+0x1f6/0x240
Aug 15 12:36:41 server kernel:  do_mprotect_pkey+0x1b4/0x2f0
Aug 15 12:36:41 server kernel:  ? ksys_mmap_pgoff+0x19e/0x220
Aug 15 12:36:41 server kernel:  __x64_sys_mprotect+0x1b/0x20
Aug 15 12:36:41 server kernel:  do_syscall_64+0x5b/0x170
Aug 15 12:36:41 server kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xa9
Aug 15 12:36:41 server kernel: RIP: 0033:0x7fdd44c714cb
Aug 15 12:36:41 server kernel: RSP: 002b:00007ffe224454a8 EFLAGS: 00000206 ORIG_RAX: 000000000000000a
Aug 15 12:36:41 server kernel: RAX: ffffffffffffffda RBX: 00007fdd44c53000 RCX: 00007fdd44c714cb
Aug 15 12:36:41 server kernel: RDX: 0000000000000000 RSI: 00000000001ff000 RDI: 00007fdd44a4f000
Aug 15 12:36:41 server kernel: RBP: 00007ffe22445770 R08: 0000000000000005 R09: 0000000000000000
Aug 15 12:36:41 server kernel: R10: 00007ffe22445858 R11: 0000000000000206 R12: 0000000000000000
Aug 15 12:36:41 server kernel: R13: 000000000000fe01 R14: 00007ffe22445810 R15: 0000000000000002
Aug 15 12:36:41 server kernel: Code: 55 48 89 fd 53 48 83 ec 08 48 8b 07 48 89 c7 84 d2 74 03 48 89 29 48 85 c0 0f 84 c8 01 00 00 48 8b 18 f6 c3 01 0f 85 14 01 00 00 <48> 8b 43 08 48 89 da 48 39 c7 74 6c 48 85 c0 74 09 f6 00 01 0f
Aug 15 12:36:41 server kernel: RIP: __rb_insert_augmented+0x32/0x230 RSP: ffffc90043ed3d98
Aug 15 12:36:41 server kernel: CR2: 0000000000000008
Aug 15 12:36:41 server kernel: ---[ end trace ab257d75c031e186 ]---

After that PF and VFs are no longer accessible. In another try with
same kernel I get:

Aug 15 12:43:05 server kernel: i40e 0000:b5:00.2: TX driver issue detected, PF reset issued
Aug 15 12:43:05 server kernel: i40e 0000:b5:00.2: TX driver issue detected on VF 11
Aug 15 12:43:05 server kernel: i40e 0000:b5:00.2: Too many MDD events on VF 11, disabled
Aug 15 12:43:05 server kernel: i40e 0000:b5:00.2: Use PF Control I/F to re-enable the VF
Aug 15 12:43:05 server kernel: i40e 0000:b5:00.2: TX driver issue detected, PF reset issued
Aug 15 12:43:05 server kernel: i40e 0000:b5:00.2: TX driver issue detected on VF 11
Aug 15 12:43:05 server kernel: i40e 0000:b5:00.2: Too many MDD events on VF 11, disabled
Aug 15 12:43:05 server kernel: i40e 0000:b5:00.2: Use PF Control I/F to re-enable the VF
Aug 15 12:43:05 server kernel: bond0: link status definitely down for interface net0, disabling it
Aug 15 12:43:05 server kernel: bond0: now running without any active interface!
Aug 15 12:43:06 server kernel: i40e 0000:b5:00.2: TX driver issue detected, PF reset issued
Aug 15 12:43:06 server kernel: i40e 0000:b5:00.2: TX driver issue detected on VF 11
Aug 15 12:43:06 server kernel: i40e 0000:b5:00.2: Too many MDD events on VF 11, disabled
Aug 15 12:43:06 server kernel: i40e 0000:b5:00.2: Use PF Control I/F to re-enable the VF
Aug 15 12:43:06 server kernel: bond0: link status definitely up for interface net0, 1000 Mbps full duplex
Aug 15 12:43:06 server kernel: bond0: first active interface up!
Aug 15 12:43:06 server kernel: i40e 0000:b5:00.2: TX driver issue detected, PF reset issued
Aug 15 12:43:06 server kernel: i40e 0000:b5:00.2: TX driver issue detected on VF 11
Aug 15 12:43:06 server kernel: i40e 0000:b5:00.2: Too many MDD events on VF 11, disabled
Aug 15 12:43:06 server kernel: i40e 0000:b5:00.2: Use PF Control I/F to re-enable the VF
...
Aug 15 12:43:28 server kernel: WARNING: CPU: 0 PID: 2649 at arch/x86/xen/multicalls.c:130 xen_mc_flush+0x1cd/0x1e0
Aug 15 12:43:28 server kernel: Modules linked in: dm_crypt algif_skcipher af_alg bonding intel_rapl skx_edac nfit intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc joydev mousedev input_leds led_class iTCO_wdt iTCO_vendor_support hid_generic ipmi_ssif aesni_intel aes_x86_64 crypto_simd cryptd glue_helper nls_iso8859_1 nls_cp437 vfat fat ofpart cmdlinepart intel_rapl_perf pcspkr ast i2c_algo_bit ttm drm_kms_helper i40e drm agpgart syscopyarea sysfillrect sysimgblt fb_sys_fops intel_spi_pci intel_spi spi_nor mtd i2c_i801 lpc_ich usbhid hid shpchp mei_me mei ioatdma dca wmi ipmi_si ipmi_devintf rtc_cmos ipmi_msghandler acpi_power_meter evdev mac_hid xen_acpi_processor xen_pciback xen_netback xen_blkback xenfs xen_privcmd xen_gntalloc xen_gntdev xen_evtchn ip_tables x_tab
 les ext4 crc32c_generic crc16
Aug 15 12:43:28 server kernel:  mbcache jbd2 fscrypto sd_mod ahci libahci crc32c_intel xhci_pci xhci_hcd usbcore libata usb_common scsi_mod dm_mod
Aug 15 12:43:28 server kernel: CPU: 0 PID: 2649 Comm: cc1 Not tainted 4.17.14-arch1-1-ARCH #1
Aug 15 12:43:28 server kernel: Hardware name: Supermicro Super Server/X11SPM-F, BIOS 2.1 06/15/2018
Aug 15 12:43:28 server kernel: RIP: e030:xen_mc_flush+0x1cd/0x1e0
Aug 15 12:43:28 server kernel: RSP: e02b:ffffc90045dbfc90 EFLAGS: 00010002
Aug 15 12:43:28 server kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8801150141d8
Aug 15 12:43:28 server kernel: RDX: 0000000000000001 RSI: 0000000000000002 RDI: 0000000080000001
Aug 15 12:43:28 server kernel: RBP: 0000000000000001 R08: ffffea000123ee80 R09: 0000000000000950
Aug 15 12:43:28 server kernel: R10: ffff8800062daff8 R11: 0000000000000000 R12: 0000000080000001
Aug 15 12:43:28 server kernel: R13: ffff880115014140 R14: ffff880115014150 R15: 0000000000000002
Aug 15 12:43:28 server kernel: FS:  00007fc772128ac0(0000) GS:ffff880115000000(0000) knlGS:0000000000000000
Aug 15 12:43:28 server kernel: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 15 12:43:28 server kernel: CR2: 0000000000d549f0 CR3: 000000004d762000 CR4: 0000000000042660
Aug 15 12:43:28 server kernel: Call Trace:
Aug 15 12:43:28 server kernel:  xen_alloc_pte+0x3b3/0x3c0
Aug 15 12:43:28 server kernel:  alloc_set_pte+0x326/0x500
Aug 15 12:43:28 server kernel:  filemap_map_pages+0x37b/0x3b0
Aug 15 12:43:28 server kernel:  __handle_mm_fault+0xf7d/0x1480
Aug 15 12:43:28 server kernel:  handle_mm_fault+0x10a/0x250
Aug 15 12:43:28 server kernel:  __do_page_fault+0x214/0x570
Aug 15 12:43:28 server kernel:  do_page_fault+0x32/0x130
Aug 15 12:43:28 server kernel:  ? page_fault+0x8/0x30
Aug 15 12:43:28 server kernel:  page_fault+0x1e/0x30
Aug 15 12:43:28 server kernel: RIP: e033:0xd549f0
Aug 15 12:43:28 server kernel: RSP: e02b:00007ffdc8058bb8 EFLAGS: 00010246
Aug 15 12:43:28 server kernel: RAX: 0000000000000000 RBX: 000000000000001a RCX: 00000000000000e0
Aug 15 12:43:28 server kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001dec220
Aug 15 12:43:28 server kernel: RBP: 0000000000000024 R08: 000000000268bec0 R09: 0000000000000000
Aug 15 12:43:28 server kernel: R10: 000000000268b010 R11: 0000000000000000 R12: 0000000001ca43d8
Aug 15 12:43:28 server kernel: R13: 000000000000002b R14: 00007ffdc8058ce8 R15: 00007ffdc8058e48
Aug 15 12:43:28 server kernel: Code: 81 e8 c8 ee 9e 00 0f 1f 00 49 89 45 18 48 c1 e8 3f 48 89 c5 e9 ed fe ff ff ff 14 25 80 64 02 82 f6 c4 02 0f 84 6c fe ff ff 0f 0b <0f> 0b e9 26 ff ff ff 0f 0b e8 da f3 fe ff eb 83 0f 0b 90 0f 1f
Aug 15 12:43:28 server kernel: ---[ end trace ff1c4f9a6f1cb2a0 ]---
Aug 15 12:43:28 server kernel: WARNING: CPU: 0 PID: 2649 at arch/x86/xen/multicalls.c:130 xen_mc_flush+0x1cd/0x1e0
Aug 15 12:43:28 server kernel: Modules linked in: dm_crypt algif_skcipher af_alg bonding intel_rapl skx_edac nfit intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc joydev mousedev input_leds led_class iTCO_wdt iTCO_vendor_support hid_generic ipmi_ssif aesni_intel aes_x86_64 crypto_simd cryptd glue_helper nls_iso8859_1 nls_cp437 vfat fat ofpart cmdlinepart intel_rapl_perf pcspkr ast i2c_algo_bit ttm drm_kms_helper i40e drm agpgart syscopyarea sysfillrect sysimgblt fb_sys_fops intel_spi_pci intel_spi spi_nor mtd i2c_i801 lpc_ich usbhid hid shpchp mei_me mei ioatdma dca wmi ipmi_si ipmi_devintf rtc_cmos ipmi_msghandler acpi_power_meter evdev mac_hid xen_acpi_processor xen_pciback xen_netback xen_blkback xenfs xen_privcmd xen_gntalloc xen_gntdev xen_evtchn ip_tables x_tab
 les ext4 crc32c_generic crc16
Aug 15 12:43:28 server kernel:  mbcache jbd2 fscrypto sd_mod ahci libahci crc32c_intel xhci_pci xhci_hcd usbcore libata usb_common scsi_mod dm_mod
Aug 15 12:43:28 server kernel: CPU: 0 PID: 2649 Comm: cc1 Tainted: G        W         4.17.14-arch1-1-ARCH #1
Aug 15 12:43:28 server kernel: Hardware name: Supermicro Super Server/X11SPM-F, BIOS 2.1 06/15/2018
Aug 15 12:43:28 server kernel: RIP: e030:xen_mc_flush+0x1cd/0x1e0
Aug 15 12:43:28 server kernel: RSP: e02b:ffffc90045dbfc90 EFLAGS: 00010002
Aug 15 12:43:28 server kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
Aug 15 12:43:28 server kernel: RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000080000002
Aug 15 12:43:28 server kernel: RBP: 0000000000000001 R08: 0000000000000000 R09: 0000000000000950
Aug 15 12:43:28 server kernel: R10: 0000000000007ff0 R11: 0000000000000000 R12: 0000000080000002
Aug 15 12:43:28 server kernel: R13: ffff880115014140 R14: 0000000000000202 R15: 0000000000000001
Aug 15 12:43:28 server kernel: FS:  00007fc772128ac0(0000) GS:ffff880115000000(0000) knlGS:0000000000000000
Aug 15 12:43:28 server kernel: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 15 12:43:28 server kernel: CR2: 0000000000d549f0 CR3: 000000004d762000 CR4: 0000000000042660
Aug 15 12:43:28 server kernel: Call Trace:
Aug 15 12:43:28 server kernel:  xen_set_pmd_hyper+0x16c/0x190
Aug 15 12:43:28 server kernel:  alloc_set_pte+0x34d/0x500
Aug 15 12:43:28 server kernel:  filemap_map_pages+0x37b/0x3b0
Aug 15 12:43:28 server kernel:  __handle_mm_fault+0xf7d/0x1480
Aug 15 12:43:28 server kernel:  handle_mm_fault+0x10a/0x250
Aug 15 12:43:28 server kernel:  __do_page_fault+0x214/0x570
Aug 15 12:43:28 server kernel:  do_page_fault+0x32/0x130
Aug 15 12:43:28 server kernel:  ? page_fault+0x8/0x30
Aug 15 12:43:28 server kernel:  page_fault+0x1e/0x30
Aug 15 12:43:28 server kernel: RIP: e033:0xd549f0
Aug 15 12:43:28 server kernel: RSP: e02b:00007ffdc8058bb8 EFLAGS: 00010246
Aug 15 12:43:28 server kernel: RAX: 0000000000000000 RBX: 000000000000001a RCX: 00000000000000e0
Aug 15 12:43:28 server kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000001dec220
Aug 15 12:43:28 server kernel: RBP: 0000000000000024 R08: 000000000268bec0 R09: 0000000000000000
Aug 15 12:43:28 server kernel: R10: 000000000268b010 R11: 0000000000000000 R12: 0000000001ca43d8
Aug 15 12:43:28 server kernel: R13: 000000000000002b R14: 00007ffdc8058ce8 R15: 00007ffdc8058e48
Aug 15 12:43:28 server kernel: Code: 81 e8 c8 ee 9e 00 0f 1f 00 49 89 45 18 48 c1 e8 3f 48 89 c5 e9 ed fe ff ff ff 14 25 80 64 02 82 f6 c4 02 0f 84 6c fe ff ff 0f 0b <0f> 0b e9 26 ff ff ff 0f 0b e8 da f3 fe ff eb 83 0f 0b 90 0f 1f
Aug 15 12:43:28 server kernel: ---[ end trace ff1c4f9a6f1cb2a1 ]---
Aug 15 12:43:28 server kernel: BUG: non-zero pgtables_bytes on freeing mm: 4096
Aug 15 12:43:28 server kernel: i40e 0000:b5:00.2: TX driver issue detected, PF reset issued
Aug 15 12:43:28 server kernel: i40e 0000:b5:00.2: TX driver issue detected on VF 11
Aug 15 12:43:28 server kernel: i40e 0000:b5:00.2: Too many MDD events on VF 11, disabled
Aug 15 12:43:28 server kernel: i40e 0000:b5:00.2: Use PF Control I/F to re-enable the VF
Aug 15 12:43:29 server kernel: i40e 0000:b5:00.2: TX driver issue detected, PF reset issued
Aug 15 12:43:29 server kernel: i40e 0000:b5:00.2: TX driver issue detected on VF 11
Aug 15 12:43:29 server kernel: i40e 0000:b5:00.2: Too many MDD events on VF 11, disabled
Aug 15 12:43:29 server kernel: i40e 0000:b5:00.2: Use PF Control I/F to re-enable the VF
...
Aug 15 12:43:39 server kernel: BUG: unable to handle kernel paging request at 0000001fb3ed20dc
Aug 15 12:43:39 server kernel: PGD 0 P4D 0
Aug 15 12:43:39 server kernel: Oops: 0002 [#1] PREEMPT SMP NOPTI
Aug 15 12:43:39 server kernel: Modules linked in: dm_crypt algif_skcipher af_alg bonding intel_rapl skx_edac nfit intel_powerclamp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc joydev mousedev input_leds led_class iTCO_wdt iTCO_vendor_support hid_generic ipmi_ssif aesni_intel aes_x86_64 crypto_simd cryptd glue_helper nls_iso8859_1 nls_cp437 vfat fat ofpart cmdlinepart intel_rapl_perf pcspkr ast i2c_algo_bit ttm drm_kms_helper i40e drm agpgart syscopyarea sysfillrect sysimgblt fb_sys_fops intel_spi_pci intel_spi spi_nor mtd i2c_i801 lpc_ich usbhid hid shpchp mei_me mei ioatdma dca wmi ipmi_si ipmi_devintf rtc_cmos ipmi_msghandler acpi_power_meter evdev mac_hid xen_acpi_processor xen_pciback xen_netback xen_blkback xenfs xen_privcmd xen_gntalloc xen_gntdev xen_evtchn ip_tables x_tab
 les ext4 crc32c_generic crc16
Aug 15 12:43:39 server kernel:  mbcache jbd2 fscrypto sd_mod ahci libahci crc32c_intel xhci_pci xhci_hcd usbcore libata usb_common scsi_mod dm_mod
Aug 15 12:43:39 server kernel: CPU: 0 PID: 4 Comm: kworker/0:0 Tainted: G        W         4.17.14-arch1-1-ARCH #1
Aug 15 12:43:39 server kernel: Hardware name: Supermicro Super Server/X11SPM-F, BIOS 2.1 06/15/2018
Aug 15 12:43:39 server kernel: Workqueue: i40e i40e_service_task [i40e]
Aug 15 12:43:39 server kernel: RIP: e030:__page_frag_cache_drain+0x5/0x30
Aug 15 12:43:39 server kernel: RSP: e02b:ffffc900400e7d10 EFLAGS: 00010292
Aug 15 12:43:39 server kernel: RAX: 0000000000000000 RBX: ffff88004cb49ff8 RCX: ffff880067f86000
Aug 15 12:43:39 server kernel: RDX: 000077ff80000000 RSI: 0000000000000000 RDI: 0000001fb3ed20c0
Aug 15 12:43:39 server kernel: RBP: ffff88010b3d2140 R08: 0000000000000022 R09: 0000000000000058
Aug 15 12:43:39 server kernel: R10: ffffea000010fc20 R11: 0000000000000000 R12: 0000000000000155
Aug 15 12:43:39 server kernel: R13: 0000000000001000 R14: ffff88010b339f40 R15: ffff88010b5c1000
Aug 15 12:43:39 server kernel: FS:  0000000000000000(0000) GS:ffff880115000000(0000) knlGS:0000000000000000
Aug 15 12:43:39 server kernel: CS:  e033 DS: 0000 ES: 0000 CR0: 0000000080050033
Aug 15 12:43:39 server kernel: CR2: 0000001fb3ed20dc CR3: 0000000104b32000 CR4: 0000000000042660
Aug 15 12:43:39 server kernel: Call Trace:
Aug 15 12:43:39 server kernel:  i40e_clean_rx_ring+0xc5/0x1b0 [i40e]
Aug 15 12:43:39 server kernel:  i40e_down+0x16b/0x1b0 [i40e]
Aug 15 12:43:39 server kernel:  i40e_vsi_close+0x78/0x80 [i40e]
Aug 15 12:43:39 server kernel:  i40e_close+0x11/0x20 [i40e]
Aug 15 12:43:39 server kernel:  i40e_pf_quiesce_all_vsi.isra.48+0x34/0x50 [i40e]
Aug 15 12:43:39 server kernel:  i40e_prep_for_reset+0x117/0x130 [i40e]
Aug 15 12:43:39 server kernel:  i40e_do_reset+0xb0/0x200 [i40e]
Aug 15 12:43:39 server kernel:  i40e_service_task+0x908/0x1150 [i40e]
Aug 15 12:43:39 server kernel:  ? finish_task_switch+0x83/0x2e0
Aug 15 12:43:39 server kernel:  process_one_work+0x1d1/0x3b0
Aug 15 12:43:39 server kernel:  worker_thread+0x2b/0x3d0
Aug 15 12:43:39 server kernel:  ? process_one_work+0x3b0/0x3b0
Aug 15 12:43:39 server kernel:  kthread+0x112/0x130
Aug 15 12:43:39 server kernel:  ? kthread_flush_work_fn+0x10/0x10
Aug 15 12:43:39 server kernel:  ret_from_fork+0x35/0x40
Aug 15 12:43:39 server kernel: Code: 39 ef 73 1e 48 89 fb 48 85 db 74 0a 31 f6 48 89 df e8 70 fe ff ff 48 81 c3 00 10 00 00 48 39 dd 77 e5 5b 5d c3 90 0f 1f 44 00 00 <f0> 29 77 1c 75 15 48 8b 07 f6 c4 80 74 08 0f b6 77 69 85 f6 75
Aug 15 12:43:39 server kernel: RIP: __page_frag_cache_drain+0x5/0x30 RSP: ffffc900400e7d10
Aug 15 12:43:39 server kernel: CR2: 0000001fb3ed20dc
Aug 15 12:43:39 server kernel: ---[ end trace ff1c4f9a6f1cb2a2 ]---
Aug 15 12:44:03 server systemd[1]: Started Session c4 of user root.
Aug 15 12:44:39 server systemd-timesyncd[675]: Timed out waiting for reply from 176.9.144.121:123 (3.arch.pool.ntp.org).
Aug 15 12:44:49 server systemd-timesyncd[675]: Timed out waiting for reply from 146.0.32.144:123 (3.arch.pool.ntp.org).
Aug 15 12:45:00 server systemd-timesyncd[675]: Timed out waiting for reply from 138.201.20.231:123 (3.arch.pool.ntp.org).
Aug 15 12:45:10 server systemd-timesyncd[675]: Timed out waiting for reply from 94.16.116.137:123 (3.arch.pool.ntp.org).

This can be easily reproduced on my system in all cases when running 2
VMs simultaneously.

What I've done so far:

1. I've tried 4.18.0, it is even more worse. With this kernel the system
immediately reboots when assigning MACs to the VFs, sometimes after 1st,
sometimes after 2nd, sometimes after 20th. No errors shown, system just
resets.

2. I've tried 4.14.62 LTS version. VFs are not working at all cause of:
Unable to enable 24 VFs. Limited to 0 VFs due to device resource constraints.

3. I've tried i40e version 2.4.10 from https://sourceforge.net/projects/e1000/files/i40e%20stable/2.4.10/
I've tried it with 4.17.14 and 4.14.62 LTS, both lead to kernel freezes
and reboots without any output on the local display.

As intermediate solution I've reverted configuration to use bridges and
put physical NICs into the system for those VMs which requires VLANs and
PPPoE support.

Also the same configuration (same SSD) works with VFs perfectly using a
NIC under ixgb driver.

Any help is very much appreciated as I can test kernel patches on this
machine.

--Maik

^ permalink raw reply

* Re: [PATCH mlx5-next] RDMA/mlx5: Don't use cached IRQ affinity mask
From: Leon Romanovsky @ 2018-08-15  6:37 UTC (permalink / raw)
  To: Steve Wise
  Cc: Max Gurtovoy, Sagi Grimberg, Jason Gunthorpe,
	'Doug Ledford', 'RDMA mailing list',
	'Saeed Mahameed', 'linux-netdev'
In-Reply-To: <47178d4d-f730-6e59-5c19-58331cc3864a@opengridcomputing.com>

[-- Attachment #1: Type: text/plain, Size: 1414 bytes --]

On Mon, Aug 06, 2018 at 02:20:37PM -0500, Steve Wise wrote:
>
>
> On 8/1/2018 9:27 AM, Max Gurtovoy wrote:
> >
> >
> > On 8/1/2018 8:12 AM, Sagi Grimberg wrote:
> >> Hi Max,
> >
> > Hi,
> >
> >>
> >>> Yes, since nvmf is the only user of this function.
> >>> Still waiting for comments on the suggested patch :)
> >>>
> >>
> >> Sorry for the late response (but I'm on vacation so I have
> >> an excuse ;))
> >
> > NP :) currently the code works..
> >
> >>
> >> I'm thinking that we should avoid trying to find an assignment
> >> when stuff like irqbalance daemon is running and changing
> >> the affinitization.
> >
> > but this is exactly what Steve complained and Leon try to fix (and
> > break the connection establishment).
> > If this is the case and we all agree then we're good without Leon's
> > patch and without our suggestions.
> >
>
> I don't agree.  Currently setting certain affinity mappings breaks nvme
> connectivity.  I don't think that is desirable.  And mlx5 is broken in
> that it doesn't allow changing the affinity but silently ignores the
> change, which misleads the admin or irqbalance...

Exactly, I completely agree with Steve and don't understand any
rationale in the comments above. As a summery from my side:
NVMeOF is broken, but we are not going to fix and prohibit
from one specific driver to change affinity on the fly.

Nice.

Thanks

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* Re: [PATCH] net: macb: Fix regression breaking non-MDIO fixed-link PHYs
From: Lad, Prabhakar @ 2018-08-15 13:59 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Uwe Kleine-König, a.fatoum, David S. Miller, Nicolas Ferre,
	netdev, mdf, stable, Sascha Hauer, brad.mouring, Florian Fainelli
In-Reply-To: <20180815023233.GD11610@lunn.ch>

Hi,

On Wed, Aug 15, 2018 at 3:35 AM Andrew Lunn <andrew@lunn.ch> wrote:
>
> On Tue, Aug 14, 2018 at 05:58:12PM +0200, Uwe Kleine-König wrote:
> > Hello Ahmad,
> >
> >
> > On Tue, Aug 14, 2018 at 04:12:40PM +0200, Ahmad Fatoum wrote:
> > > The referenced commit broke initializing macb on the EVB-KSZ9477 eval board.
> > > There, of_mdiobus_register was called even for the fixed-link representing
> > > the SPI-connected switch PHY, with the result that the driver attempts to
> > > enumerate PHYs on a non-existent MDIO bus:
> > >
I ran into a similar problem on v14.4 for davinci_mdio I had to patch
it with [1].
The cpsw has 2 phys one phy is connected to KSZ9031 and other to
ksz9897 Ethernet switch
which is treated as a fixed phy with no mdio lines because of which
mdio_read/write failed.
This didn’t happen in v4.9.x something in core has changed ?

[1]

diff --git a/drivers/net/ethernet/ti/davinci_mdio.c
b/drivers/net/ethernet/ti/davinci_mdio.c
index 3e84107..197baa6 100644
--- a/drivers/net/ethernet/ti/davinci_mdio.c
+++ b/drivers/net/ethernet/ti/davinci_mdio.c
@@ -245,6 +245,13 @@ static int davinci_mdio_read(struct mii_bus *bus,
int phy_id, int phy_reg)
        u32 reg;
        int ret;

+       if (phy_id == 2)
+               return 0;
+
        if (phy_reg & ~PHY_REG_MASK || phy_id & ~PHY_ID_MASK)
                return -EINVAL;

@@ -289,6 +296,13 @@ static int davinci_mdio_write(struct mii_bus
*bus, int phy_id,
        u32 reg;
        int ret;

+       if (phy_id == 2)
+               return 0;
+
        if (phy_reg & ~PHY_REG_MASK || phy_id & ~PHY_ID_MASK)
                return -EINVAL;


Cheers,
--Prabhakar Lad

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox