netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next 0/3] ip: Use rb trees for IP frag queue.
@ 2018-08-02 22:45 Peter Oskolkov
  2018-08-02 22:45 ` [PATCH net-next 1/3] ip: discard IPv4 datagrams with overlapping segments Peter Oskolkov
                   ` (3 more replies)
  0 siblings, 4 replies; 8+ messages in thread
From: Peter Oskolkov @ 2018-08-02 22:45 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: Eric Dumazet, Florian Westphal, Peter Oskolkov

This patchset
 * changes IPv4 defrag behavior to match that of IPv6: overlapping
   fragments now cause the whole IP datagram to be discarded (suggested
   by David Miller): there are no legitimate use cases for overlapping
   fragments;
 * changes IPv4 defrag queue from a list to a rb tree (suggested
   by Eric Dumazet): this change removes a potential attach vector.

Upcoming patches will contain similar changes for IPv6 frag queue,
as well as a comprehensive IP defrag self-test (temporarily delayed).

Peter Oskolkov (3):
  ip: discard IPv4 datagrams with overlapping segments.
  net: modify skb_rbtree_purge to return the truesize of all purged
    skbs.
  ip: use rb trees for IP frag queue.

 include/linux/skbuff.h                  |  11 +-
 include/net/inet_frag.h                 |   3 +-
 include/uapi/linux/snmp.h               |   1 +
 net/core/skbuff.c                       |   6 +-
 net/ipv4/inet_fragment.c                |  16 +-
 net/ipv4/ip_fragment.c                  | 239 +++++++++++-------------
 net/ipv4/proc.c                         |   1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c                   |   1 +
 9 files changed, 139 insertions(+), 140 deletions(-)

-- 
2.18.0.597.ga71716f1ad-goog

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH net-next 1/3] ip: discard IPv4 datagrams with overlapping segments.
  2018-08-02 22:45 [PATCH net-next 0/3] ip: Use rb trees for IP frag queue Peter Oskolkov
@ 2018-08-02 22:45 ` Peter Oskolkov
  2018-08-02 22:54   ` Stephen Hemminger
  2018-08-02 22:45 ` [PATCH net-next 2/3] net: modify skb_rbtree_purge to return the truesize of all purged skbs Peter Oskolkov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 8+ messages in thread
From: Peter Oskolkov @ 2018-08-02 22:45 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: Eric Dumazet, Florian Westphal, Peter Oskolkov

This behavior is required in IPv6, and there is little need
to tolerate overlapping fragments in IPv4. This change
simplifies the code and eliminates potential DDoS attack vectors.

Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
---
 include/uapi/linux/snmp.h |  1 +
 net/ipv4/ip_fragment.c    | 75 ++++++++++-----------------------------
 net/ipv4/proc.c           |  1 +
 3 files changed, 21 insertions(+), 56 deletions(-)

diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
index e5ebc83827ab..da1a144f1a51 100644
--- a/include/uapi/linux/snmp.h
+++ b/include/uapi/linux/snmp.h
@@ -40,6 +40,7 @@ enum
 	IPSTATS_MIB_REASMREQDS,			/* ReasmReqds */
 	IPSTATS_MIB_REASMOKS,			/* ReasmOKs */
 	IPSTATS_MIB_REASMFAILS,			/* ReasmFails */
+	IPSTATS_MIB_REASM_OVERLAPS,		/* ReasmOverlaps */
 	IPSTATS_MIB_FRAGOKS,			/* FragOKs */
 	IPSTATS_MIB_FRAGFAILS,			/* FragFails */
 	IPSTATS_MIB_FRAGCREATES,		/* FragCreates */
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index d14d741fb05e..960bf5eab59f 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -277,6 +277,7 @@ static int ip_frag_reinit(struct ipq *qp)
 /* Add new segment to existing queue. */
 static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
+	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct sk_buff *prev, *next;
 	struct net_device *dev;
 	unsigned int fragsize;
@@ -357,65 +358,23 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	}
 
 found:
-	/* We found where to put this one.  Check for overlap with
-	 * preceding fragment, and, if needed, align things so that
-	 * any overlaps are eliminated.
+	/* RFC5722, Section 4, amended by Errata ID : 3089
+	 *                          When reassembling an IPv6 datagram, if
+	 *   one or more its constituent fragments is determined to be an
+	 *   overlapping fragment, the entire datagram (and any constituent
+	 *   fragments) MUST be silently discarded.
+	 *
+	 * We do the same here for IPv4.
 	 */
-	if (prev) {
-		int i = (prev->ip_defrag_offset + prev->len) - offset;
 
-		if (i > 0) {
-			offset += i;
-			err = -EINVAL;
-			if (end <= offset)
-				goto err;
-			err = -ENOMEM;
-			if (!pskb_pull(skb, i))
-				goto err;
-			if (skb->ip_summed != CHECKSUM_UNNECESSARY)
-				skb->ip_summed = CHECKSUM_NONE;
-		}
-	}
+	/* Is there an overlap with the previous fragment? */
+	if (prev &&
+	    (prev->ip_defrag_offset + prev->len) > offset)
+		goto discard_qp;
 
-	err = -ENOMEM;
-
-	while (next && next->ip_defrag_offset < end) {
-		int i = end - next->ip_defrag_offset; /* overlap is 'i' bytes */
-
-		if (i < next->len) {
-			int delta = -next->truesize;
-
-			/* Eat head of the next overlapped fragment
-			 * and leave the loop. The next ones cannot overlap.
-			 */
-			if (!pskb_pull(next, i))
-				goto err;
-			delta += next->truesize;
-			if (delta)
-				add_frag_mem_limit(qp->q.net, delta);
-			next->ip_defrag_offset += i;
-			qp->q.meat -= i;
-			if (next->ip_summed != CHECKSUM_UNNECESSARY)
-				next->ip_summed = CHECKSUM_NONE;
-			break;
-		} else {
-			struct sk_buff *free_it = next;
-
-			/* Old fragment is completely overridden with
-			 * new one drop it.
-			 */
-			next = next->next;
-
-			if (prev)
-				prev->next = next;
-			else
-				qp->q.fragments = next;
-
-			qp->q.meat -= free_it->len;
-			sub_frag_mem_limit(qp->q.net, free_it->truesize);
-			kfree_skb(free_it);
-		}
-	}
+	/* Is there an overlap with the next fragment? */
+	if (next && next->ip_defrag_offset < end)
+		goto discard_qp;
 
 	/* Note : skb->ip_defrag_offset and skb->dev share the same location */
 	dev = skb->dev;
@@ -463,6 +422,10 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	skb_dst_drop(skb);
 	return -EINPROGRESS;
 
+discard_qp:
+	inet_frag_kill(&qp->q);
+	err = -EINVAL;
+	__IP_INC_STATS(net, IPSTATS_MIB_REASM_OVERLAPS);
 err:
 	kfree_skb(skb);
 	return err;
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index b46e4cf9a55a..70289682a670 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -119,6 +119,7 @@ static const struct snmp_mib snmp4_ipextstats_list[] = {
 	SNMP_MIB_ITEM("InECT1Pkts", IPSTATS_MIB_ECT1PKTS),
 	SNMP_MIB_ITEM("InECT0Pkts", IPSTATS_MIB_ECT0PKTS),
 	SNMP_MIB_ITEM("InCEPkts", IPSTATS_MIB_CEPKTS),
+	SNMP_MIB_ITEM("ReasmOverlaps", IPSTATS_MIB_REASM_OVERLAPS),
 	SNMP_MIB_SENTINEL
 };
 
-- 
2.18.0.597.ga71716f1ad-goog

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH net-next 2/3] net: modify skb_rbtree_purge to return the truesize of all purged skbs.
  2018-08-02 22:45 [PATCH net-next 0/3] ip: Use rb trees for IP frag queue Peter Oskolkov
  2018-08-02 22:45 ` [PATCH net-next 1/3] ip: discard IPv4 datagrams with overlapping segments Peter Oskolkov
@ 2018-08-02 22:45 ` Peter Oskolkov
  2018-08-02 22:46 ` [PATCH net-next 3/3] ip: use rb trees for IP frag queue Peter Oskolkov
  2018-08-06  0:17 ` [PATCH net-next 0/3] ip: Use " David Miller
  3 siblings, 0 replies; 8+ messages in thread
From: Peter Oskolkov @ 2018-08-02 22:45 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: Eric Dumazet, Florian Westphal, Peter Oskolkov

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
---
 include/linux/skbuff.h | 2 +-
 net/core/skbuff.c      | 6 +++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index fd3cb1b247df..47848367c816 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2585,7 +2585,7 @@ static inline void __skb_queue_purge(struct sk_buff_head *list)
 		kfree_skb(skb);
 }
 
-void skb_rbtree_purge(struct rb_root *root);
+unsigned int skb_rbtree_purge(struct rb_root *root);
 
 void *netdev_alloc_frag(unsigned int fragsz);
 
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 51b0a9126e12..8d574a88125d 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2858,23 +2858,27 @@ EXPORT_SYMBOL(skb_queue_purge);
 /**
  *	skb_rbtree_purge - empty a skb rbtree
  *	@root: root of the rbtree to empty
+ *	Return value: the sum of truesizes of all purged skbs.
  *
  *	Delete all buffers on an &sk_buff rbtree. Each buffer is removed from
  *	the list and one reference dropped. This function does not take
  *	any lock. Synchronization should be handled by the caller (e.g., TCP
  *	out-of-order queue is protected by the socket lock).
  */
-void skb_rbtree_purge(struct rb_root *root)
+unsigned int skb_rbtree_purge(struct rb_root *root)
 {
 	struct rb_node *p = rb_first(root);
+	unsigned int sum = 0;
 
 	while (p) {
 		struct sk_buff *skb = rb_entry(p, struct sk_buff, rbnode);
 
 		p = rb_next(p);
 		rb_erase(&skb->rbnode, root);
+		sum += skb->truesize;
 		kfree_skb(skb);
 	}
+	return sum;
 }
 
 /**
-- 
2.18.0.597.ga71716f1ad-goog

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH net-next 3/3] ip: use rb trees for IP frag queue.
  2018-08-02 22:45 [PATCH net-next 0/3] ip: Use rb trees for IP frag queue Peter Oskolkov
  2018-08-02 22:45 ` [PATCH net-next 1/3] ip: discard IPv4 datagrams with overlapping segments Peter Oskolkov
  2018-08-02 22:45 ` [PATCH net-next 2/3] net: modify skb_rbtree_purge to return the truesize of all purged skbs Peter Oskolkov
@ 2018-08-02 22:46 ` Peter Oskolkov
  2018-08-06  0:17 ` [PATCH net-next 0/3] ip: Use " David Miller
  3 siblings, 0 replies; 8+ messages in thread
From: Peter Oskolkov @ 2018-08-02 22:46 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: Eric Dumazet, Florian Westphal, Peter Oskolkov

Similar to TCP OOO RX queue, it makes sense to use rb trees to store
IP fragments, so that OOO fragments are inserted faster.

Tested:

- a follow-up patch contains a rather comprehensive ip defrag
  self-test (functional)
- ran neper `udp_stream -c -H <host> -F 100 -l 300 -T 20`:
    netstat --statistics
    Ip:
        282078937 total packets received
        0 forwarded
        0 incoming packets discarded
        946760 incoming packets delivered
        18743456 requests sent out
        101 fragments dropped after timeout
        282077129 reassemblies required
        944952 packets reassembled ok
        262734239 packet reassembles failed
   (The numbers/stats above are somewhat better re:
    reassemblies vs a kernel without this patchset. More
    comprehensive performance testing TBD).

Reported-by: Jann Horn <jannh@google.com>
Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
---
 include/linux/skbuff.h                  |   9 +-
 include/net/inet_frag.h                 |   3 +-
 net/ipv4/inet_fragment.c                |  16 ++-
 net/ipv4/ip_fragment.c                  | 182 +++++++++++++-----------
 net/ipv6/netfilter/nf_conntrack_reasm.c |   1 +
 net/ipv6/reassembly.c                   |   1 +
 6 files changed, 121 insertions(+), 91 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 47848367c816..7ebdf158a795 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -676,13 +676,16 @@ struct sk_buff {
 				 * UDP receive path is one user.
 				 */
 				unsigned long		dev_scratch;
-				int			ip_defrag_offset;
 			};
 		};
-		struct rb_node		rbnode; /* used in netem & tcp stack */
+		struct rb_node		rbnode; /* used in netem, ip4 defrag, and tcp stack */
 		struct list_head	list;
 	};
-	struct sock		*sk;
+
+	union {
+		struct sock		*sk;
+		int			ip_defrag_offset;
+	};
 
 	union {
 		ktime_t		tstamp;
diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index f4272a29dc44..b86d14528188 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -75,7 +75,8 @@ struct inet_frag_queue {
 	struct timer_list	timer;
 	spinlock_t		lock;
 	refcount_t		refcnt;
-	struct sk_buff		*fragments;
+	struct sk_buff		*fragments;  /* Used in IPv6. */
+	struct rb_root		rb_fragments; /* Used in IPv4. */
 	struct sk_buff		*fragments_tail;
 	ktime_t			stamp;
 	int			len;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index ccd140e4082d..6d258a5669e7 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -137,12 +137,16 @@ void inet_frag_destroy(struct inet_frag_queue *q)
 	fp = q->fragments;
 	nf = q->net;
 	f = nf->f;
-	while (fp) {
-		struct sk_buff *xp = fp->next;
-
-		sum_truesize += fp->truesize;
-		kfree_skb(fp);
-		fp = xp;
+	if (fp) {
+		do {
+			struct sk_buff *xp = fp->next;
+
+			sum_truesize += fp->truesize;
+			kfree_skb(fp);
+			fp = xp;
+		} while (fp);
+	} else {
+		sum_truesize = skb_rbtree_purge(&q->rb_fragments);
 	}
 	sum = sum_truesize + f->qsize;
 
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 960bf5eab59f..ffbf9135fd71 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -136,7 +136,7 @@ static void ip_expire(struct timer_list *t)
 {
 	struct inet_frag_queue *frag = from_timer(frag, t, timer);
 	const struct iphdr *iph;
-	struct sk_buff *head;
+	struct sk_buff *head = NULL;
 	struct net *net;
 	struct ipq *qp;
 	int err;
@@ -152,14 +152,31 @@ static void ip_expire(struct timer_list *t)
 
 	ipq_kill(qp);
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
-
-	head = qp->q.fragments;
-
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-	if (!(qp->q.flags & INET_FRAG_FIRST_IN) || !head)
+	if (!qp->q.flags & INET_FRAG_FIRST_IN)
 		goto out;
 
+	/* sk_buff::dev and sk_buff::rbnode are unionized. So we
+	 * pull the head out of the tree in order to be able to
+	 * deal with head->dev.
+	 */
+	if (qp->q.fragments) {
+		head = qp->q.fragments;
+		qp->q.fragments = head->next;
+	} else {
+		head = skb_rb_first(&qp->q.rb_fragments);
+		if (!head)
+			goto out;
+		rb_erase(&head->rbnode, &qp->q.rb_fragments);
+		memset(&head->rbnode, 0, sizeof(head->rbnode));
+		barrier();
+	}
+	if (head == qp->q.fragments_tail)
+		qp->q.fragments_tail = NULL;
+
+	sub_frag_mem_limit(qp->q.net, head->truesize);
+
 	head->dev = dev_get_by_index_rcu(net, qp->iif);
 	if (!head->dev)
 		goto out;
@@ -179,16 +196,16 @@ static void ip_expire(struct timer_list *t)
 	    (skb_rtable(head)->rt_type != RTN_LOCAL))
 		goto out;
 
-	skb_get(head);
 	spin_unlock(&qp->q.lock);
 	icmp_send(head, ICMP_TIME_EXCEEDED, ICMP_EXC_FRAGTIME, 0);
-	kfree_skb(head);
 	goto out_rcu_unlock;
 
 out:
 	spin_unlock(&qp->q.lock);
 out_rcu_unlock:
 	rcu_read_unlock();
+	if (head)
+		kfree_skb(head);
 	ipq_put(qp);
 }
 
@@ -231,7 +248,7 @@ static int ip_frag_too_far(struct ipq *qp)
 	end = atomic_inc_return(&peer->rid);
 	qp->rid = end;
 
-	rc = qp->q.fragments && (end - start) > max;
+	rc = qp->q.fragments_tail && (end - start) > max;
 
 	if (rc) {
 		struct net *net;
@@ -245,7 +262,6 @@ static int ip_frag_too_far(struct ipq *qp)
 
 static int ip_frag_reinit(struct ipq *qp)
 {
-	struct sk_buff *fp;
 	unsigned int sum_truesize = 0;
 
 	if (!mod_timer(&qp->q.timer, jiffies + qp->q.net->timeout)) {
@@ -253,20 +269,14 @@ static int ip_frag_reinit(struct ipq *qp)
 		return -ETIMEDOUT;
 	}
 
-	fp = qp->q.fragments;
-	do {
-		struct sk_buff *xp = fp->next;
-
-		sum_truesize += fp->truesize;
-		kfree_skb(fp);
-		fp = xp;
-	} while (fp);
+	sum_truesize = skb_rbtree_purge(&qp->q.rb_fragments);
 	sub_frag_mem_limit(qp->q.net, sum_truesize);
 
 	qp->q.flags = 0;
 	qp->q.len = 0;
 	qp->q.meat = 0;
 	qp->q.fragments = NULL;
+	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
 	qp->iif = 0;
 	qp->ecn = 0;
@@ -278,7 +288,8 @@ static int ip_frag_reinit(struct ipq *qp)
 static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
-	struct sk_buff *prev, *next;
+	struct rb_node **rbn, *parent;
+	struct sk_buff *skb1;
 	struct net_device *dev;
 	unsigned int fragsize;
 	int flags, offset;
@@ -341,58 +352,58 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	if (err)
 		goto err;
 
-	/* Find out which fragments are in front and at the back of us
-	 * in the chain of fragments so far.  We must know where to put
-	 * this fragment, right?
-	 */
-	prev = qp->q.fragments_tail;
-	if (!prev || prev->ip_defrag_offset < offset) {
-		next = NULL;
-		goto found;
-	}
-	prev = NULL;
-	for (next = qp->q.fragments; next != NULL; next = next->next) {
-		if (next->ip_defrag_offset >= offset)
-			break;	/* bingo! */
-		prev = next;
-	}
+	/* Note : skb->rbnode and skb->dev share the same location. */
+	dev = skb->dev;
+	/* Makes sure compiler wont do silly aliasing games */
+	barrier();
 
-found:
 	/* RFC5722, Section 4, amended by Errata ID : 3089
 	 *                          When reassembling an IPv6 datagram, if
 	 *   one or more its constituent fragments is determined to be an
 	 *   overlapping fragment, the entire datagram (and any constituent
 	 *   fragments) MUST be silently discarded.
 	 *
-	 * We do the same here for IPv4.
+	 * We do the same here for IPv4 (and increment an snmp counter).
 	 */
 
-	/* Is there an overlap with the previous fragment? */
-	if (prev &&
-	    (prev->ip_defrag_offset + prev->len) > offset)
-		goto discard_qp;
-
-	/* Is there an overlap with the next fragment? */
-	if (next && next->ip_defrag_offset < end)
-		goto discard_qp;
+	/* Find out where to put this fragment.  */
+	skb1 = qp->q.fragments_tail;
+	if (!skb1) {
+		/* This is the first fragment we've received. */
+		rb_link_node(&skb->rbnode, NULL, &qp->q.rb_fragments.rb_node);
+		qp->q.fragments_tail = skb;
+	} else if ((skb1->ip_defrag_offset + skb1->len) < end) {
+		/* This is the common/special case: skb goes to the end. */
+		/* Detect and discard overlaps. */
+		if (offset < (skb1->ip_defrag_offset + skb1->len))
+			goto discard_qp;
+		/* Insert after skb1. */
+		rb_link_node(&skb->rbnode, &skb1->rbnode, &skb1->rbnode.rb_right);
+		qp->q.fragments_tail = skb;
+	} else {
+		/* Binary search. Note that skb can become the first fragment, but
+		 * not the last (covered above). */
+		rbn = &qp->q.rb_fragments.rb_node;
+		do {
+			parent = *rbn;
+			skb1 = rb_to_skb(parent);
+			if (end <= skb1->ip_defrag_offset)
+				rbn = &parent->rb_left;
+			else if (offset >= skb1->ip_defrag_offset + skb1->len)
+				rbn = &parent->rb_right;
+			else /* Found an overlap with skb1. */
+				goto discard_qp;
+		} while (*rbn);
+		/* Here we have parent properly set, and rbn pointing to
+		 * one of its NULL left/right children. Insert skb. */
+		rb_link_node(&skb->rbnode, parent, rbn);
+	}
+	rb_insert_color(&skb->rbnode, &qp->q.rb_fragments);
 
-	/* Note : skb->ip_defrag_offset and skb->dev share the same location */
-	dev = skb->dev;
 	if (dev)
 		qp->iif = dev->ifindex;
-	/* Makes sure compiler wont do silly aliasing games */
-	barrier();
 	skb->ip_defrag_offset = offset;
 
-	/* Insert this fragment in the chain of fragments. */
-	skb->next = next;
-	if (!next)
-		qp->q.fragments_tail = skb;
-	if (prev)
-		prev->next = skb;
-	else
-		qp->q.fragments = skb;
-
 	qp->q.stamp = skb->tstamp;
 	qp->q.meat += skb->len;
 	qp->ecn |= ecn;
@@ -414,7 +425,7 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 		unsigned long orefdst = skb->_skb_refdst;
 
 		skb->_skb_refdst = 0UL;
-		err = ip_frag_reasm(qp, prev, dev);
+		err = ip_frag_reasm(qp, skb, dev);
 		skb->_skb_refdst = orefdst;
 		return err;
 	}
@@ -431,15 +442,15 @@ static int ip_frag_queue(struct ipq *qp, struct sk_buff *skb)
 	return err;
 }
 
-
 /* Build a new IP datagram from all its fragments. */
-
-static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
+static int ip_frag_reasm(struct ipq *qp, struct sk_buff *skb,
 			 struct net_device *dev)
 {
 	struct net *net = container_of(qp->q.net, struct net, ipv4.frags);
 	struct iphdr *iph;
-	struct sk_buff *fp, *head = qp->q.fragments;
+	struct sk_buff *fp, *head = skb_rb_first(&qp->q.rb_fragments);
+	struct sk_buff **nextp; /* To build frag_list. */
+	struct rb_node *rbn;
 	int len;
 	int ihlen;
 	int err;
@@ -453,25 +464,20 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		goto out_fail;
 	}
 	/* Make the one we just received the head. */
-	if (prev) {
-		head = prev->next;
-		fp = skb_clone(head, GFP_ATOMIC);
+	if (head != skb) {
+		fp = skb_clone(skb, GFP_ATOMIC);
 		if (!fp)
 			goto out_nomem;
-
-		fp->next = head->next;
-		if (!fp->next)
+		rb_replace_node(&skb->rbnode, &fp->rbnode, &qp->q.rb_fragments);
+		if (qp->q.fragments_tail == skb)
 			qp->q.fragments_tail = fp;
-		prev->next = fp;
-
-		skb_morph(head, qp->q.fragments);
-		head->next = qp->q.fragments->next;
-
-		consume_skb(qp->q.fragments);
-		qp->q.fragments = head;
+		skb_morph(skb, head);
+		rb_replace_node(&head->rbnode, &skb->rbnode,
+				&qp->q.rb_fragments);
+		consume_skb(head);
+		head = skb;
 	}
 
-	WARN_ON(!head);
 	WARN_ON(head->ip_defrag_offset != 0);
 
 	/* Allocate a new buffer for the datagram. */
@@ -496,24 +502,35 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 		clone = alloc_skb(0, GFP_ATOMIC);
 		if (!clone)
 			goto out_nomem;
-		clone->next = head->next;
-		head->next = clone;
 		skb_shinfo(clone)->frag_list = skb_shinfo(head)->frag_list;
 		skb_frag_list_init(head);
 		for (i = 0; i < skb_shinfo(head)->nr_frags; i++)
 			plen += skb_frag_size(&skb_shinfo(head)->frags[i]);
 		clone->len = clone->data_len = head->data_len - plen;
-		head->data_len -= clone->len;
-		head->len -= clone->len;
+		skb->truesize += clone->truesize;
 		clone->csum = 0;
 		clone->ip_summed = head->ip_summed;
 		add_frag_mem_limit(qp->q.net, clone->truesize);
+		skb_shinfo(head)->frag_list = clone;
+		nextp = &clone->next;
+	} else {
+		nextp = &skb_shinfo(head)->frag_list;
 	}
 
-	skb_shinfo(head)->frag_list = head->next;
 	skb_push(head, head->data - skb_network_header(head));
 
-	for (fp=head->next; fp; fp = fp->next) {
+	/* Traverse the tree in order, to build frag_list. */
+	rbn = rb_next(&head->rbnode);
+	rb_erase(&head->rbnode, &qp->q.rb_fragments);
+	while (rbn) {
+		struct rb_node *rbnext = rb_next(rbn);
+		fp = rb_to_skb(rbn);
+		rb_erase(rbn, &qp->q.rb_fragments);
+		rbn = rbnext;
+		*nextp = fp;
+		nextp = &fp->next;
+		fp->prev = NULL;
+		memset(&fp->rbnode, 0, sizeof(fp->rbnode));
 		head->data_len += fp->len;
 		head->len += fp->len;
 		if (head->ip_summed != fp->ip_summed)
@@ -524,7 +541,9 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 	}
 	sub_frag_mem_limit(qp->q.net, head->truesize);
 
+	*nextp = NULL;
 	head->next = NULL;
+	head->prev = NULL;
 	head->dev = dev;
 	head->tstamp = qp->q.stamp;
 	IPCB(head)->frag_max_size = max(qp->max_df_size, qp->q.max_size);
@@ -552,6 +571,7 @@ static int ip_frag_reasm(struct ipq *qp, struct sk_buff *prev,
 
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMOKS);
 	qp->q.fragments = NULL;
+	qp->q.rb_fragments = RB_ROOT;
 	qp->q.fragments_tail = NULL;
 	return 0;
 
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 0610bdab721c..38d69ef516d5 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -463,6 +463,7 @@ nf_ct_frag6_reasm(struct frag_queue *fq, struct sk_buff *prev,  struct net_devic
 					  head->csum);
 
 	fq->q.fragments = NULL;
+	fq->q.rb_fragments = RB_ROOT;
 	fq->q.fragments_tail = NULL;
 
 	return true;
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 6edd2ac8ae4b..b4e558ab39fa 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -405,6 +405,7 @@ static int ip6_frag_reasm(struct frag_queue *fq, struct sk_buff *prev,
 	__IP6_INC_STATS(net, __in6_dev_get(dev), IPSTATS_MIB_REASMOKS);
 	rcu_read_unlock();
 	fq->q.fragments = NULL;
+	fq->q.rb_fragments = RB_ROOT;
 	fq->q.fragments_tail = NULL;
 	return 1;
 
-- 
2.18.0.597.ga71716f1ad-goog

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 1/3] ip: discard IPv4 datagrams with overlapping segments.
  2018-08-02 22:45 ` [PATCH net-next 1/3] ip: discard IPv4 datagrams with overlapping segments Peter Oskolkov
@ 2018-08-02 22:54   ` Stephen Hemminger
  2018-08-02 23:33     ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Stephen Hemminger @ 2018-08-02 22:54 UTC (permalink / raw)
  To: Peter Oskolkov; +Cc: David Miller, netdev, Eric Dumazet, Florian Westphal

On Thu,  2 Aug 2018 22:45:58 +0000
Peter Oskolkov <posk@google.com> wrote:

> diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
> index e5ebc83827ab..da1a144f1a51 100644
> --- a/include/uapi/linux/snmp.h
> +++ b/include/uapi/linux/snmp.h
> @@ -40,6 +40,7 @@ enum
>  	IPSTATS_MIB_REASMREQDS,			/* ReasmReqds */
>  	IPSTATS_MIB_REASMOKS,			/* ReasmOKs */
>  	IPSTATS_MIB_REASMFAILS,			/* ReasmFails */
> +	IPSTATS_MIB_REASM_OVERLAPS,		/* ReasmOverlaps */
>  	IPSTATS_MIB_FRAGOKS,			/* FragOKs */
>  	IPSTATS_MIB_FRAGFAILS,			/* FragFails */
>  	IPSTATS_MIB_FRAGCREATES,		/* FragCreates */

Inserting new entries in the middle of an enum means the numeric
values will change. Isn't this going to break userspace ABI?

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 1/3] ip: discard IPv4 datagrams with overlapping segments.
  2018-08-02 22:54   ` Stephen Hemminger
@ 2018-08-02 23:33     ` Eric Dumazet
  2018-08-02 23:56       ` Stephen Hemminger
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2018-08-02 23:33 UTC (permalink / raw)
  To: Stephen Hemminger, Peter Oskolkov
  Cc: David Miller, netdev, Eric Dumazet, Florian Westphal



On 08/02/2018 03:54 PM, Stephen Hemminger wrote:
> On Thu,  2 Aug 2018 22:45:58 +0000
> Peter Oskolkov <posk@google.com> wrote:
> 
>> diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
>> index e5ebc83827ab..da1a144f1a51 100644
>> --- a/include/uapi/linux/snmp.h
>> +++ b/include/uapi/linux/snmp.h
>> @@ -40,6 +40,7 @@ enum
>>  	IPSTATS_MIB_REASMREQDS,			/* ReasmReqds */
>>  	IPSTATS_MIB_REASMOKS,			/* ReasmOKs */
>>  	IPSTATS_MIB_REASMFAILS,			/* ReasmFails */
>> +	IPSTATS_MIB_REASM_OVERLAPS,		/* ReasmOverlaps */
>>  	IPSTATS_MIB_FRAGOKS,			/* FragOKs */
>>  	IPSTATS_MIB_FRAGFAILS,			/* FragFails */
>>  	IPSTATS_MIB_FRAGCREATES,		/* FragCreates */
> 
> Inserting new entries in the middle of an enum means the numeric
> values will change. Isn't this going to break userspace ABI?
> 

I would argue the only exported thing from the kernel are the Key:Val in the /proc files.

Not sure why these enum are uapi.

Commit 46c2fa39877ed70415ee2b1acfb9129e956f6de4 added LINUX_MIB_TCPFASTOPENBLACKHOLE in the middle,
and really nobody complained.


Commit 0604475119de5f80dc051a5db055c6a2a75bd542 added LINUX_MIBSTCPMEMORYPRESSURESCHRONO
in the middle as well.

I am pretty sure we should maintain locality of these counters to lower number
of dirtied cache lines, say under IP frag DDOS ;)

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 1/3] ip: discard IPv4 datagrams with overlapping segments.
  2018-08-02 23:33     ` Eric Dumazet
@ 2018-08-02 23:56       ` Stephen Hemminger
  0 siblings, 0 replies; 8+ messages in thread
From: Stephen Hemminger @ 2018-08-02 23:56 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Peter Oskolkov, David Miller, netdev, Eric Dumazet,
	Florian Westphal

On Thu, 2 Aug 2018 16:33:55 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> On 08/02/2018 03:54 PM, Stephen Hemminger wrote:
> > On Thu,  2 Aug 2018 22:45:58 +0000
> > Peter Oskolkov <posk@google.com> wrote:
> >   
> >> diff --git a/include/uapi/linux/snmp.h b/include/uapi/linux/snmp.h
> >> index e5ebc83827ab..da1a144f1a51 100644
> >> --- a/include/uapi/linux/snmp.h
> >> +++ b/include/uapi/linux/snmp.h
> >> @@ -40,6 +40,7 @@ enum
> >>  	IPSTATS_MIB_REASMREQDS,			/* ReasmReqds */
> >>  	IPSTATS_MIB_REASMOKS,			/* ReasmOKs */
> >>  	IPSTATS_MIB_REASMFAILS,			/* ReasmFails */
> >> +	IPSTATS_MIB_REASM_OVERLAPS,		/* ReasmOverlaps */
> >>  	IPSTATS_MIB_FRAGOKS,			/* FragOKs */
> >>  	IPSTATS_MIB_FRAGFAILS,			/* FragFails */
> >>  	IPSTATS_MIB_FRAGCREATES,		/* FragCreates */  
> > 
> > Inserting new entries in the middle of an enum means the numeric
> > values will change. Isn't this going to break userspace ABI?
> >   
> 
> I would argue the only exported thing from the kernel are the Key:Val in the /proc files.
> 
> Not sure why these enum are uapi.
> 
> Commit 46c2fa39877ed70415ee2b1acfb9129e956f6de4 added LINUX_MIB_TCPFASTOPENBLACKHOLE in the middle,
> and really nobody complained.
> 
> 
> Commit 0604475119de5f80dc051a5db055c6a2a75bd542 added LINUX_MIBSTCPMEMORYPRESSURESCHRONO
> in the middle as well.
> 
> I am pretty sure we should maintain locality of these counters to lower number
> of dirtied cache lines, say under IP frag DDOS ;)
> 

Agree with eric, go ahead and keep things local

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next 0/3] ip: Use rb trees for IP frag queue.
  2018-08-02 22:45 [PATCH net-next 0/3] ip: Use rb trees for IP frag queue Peter Oskolkov
                   ` (2 preceding siblings ...)
  2018-08-02 22:46 ` [PATCH net-next 3/3] ip: use rb trees for IP frag queue Peter Oskolkov
@ 2018-08-06  0:17 ` David Miller
  3 siblings, 0 replies; 8+ messages in thread
From: David Miller @ 2018-08-06  0:17 UTC (permalink / raw)
  To: posk; +Cc: netdev, edumazet, fw

From: Peter Oskolkov <posk@google.com>
Date: Thu,  2 Aug 2018 22:45:57 +0000

> This patchset
>  * changes IPv4 defrag behavior to match that of IPv6: overlapping
>    fragments now cause the whole IP datagram to be discarded (suggested
>    by David Miller): there are no legitimate use cases for overlapping
>    fragments;
>  * changes IPv4 defrag queue from a list to a rb tree (suggested
>    by Eric Dumazet): this change removes a potential attach vector.
> 
> Upcoming patches will contain similar changes for IPv6 frag queue,
> as well as a comprehensive IP defrag self-test (temporarily delayed).

Looks good, series applied, thanks!

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2018-08-06  2:24 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-08-02 22:45 [PATCH net-next 0/3] ip: Use rb trees for IP frag queue Peter Oskolkov
2018-08-02 22:45 ` [PATCH net-next 1/3] ip: discard IPv4 datagrams with overlapping segments Peter Oskolkov
2018-08-02 22:54   ` Stephen Hemminger
2018-08-02 23:33     ` Eric Dumazet
2018-08-02 23:56       ` Stephen Hemminger
2018-08-02 22:45 ` [PATCH net-next 2/3] net: modify skb_rbtree_purge to return the truesize of all purged skbs Peter Oskolkov
2018-08-02 22:46 ` [PATCH net-next 3/3] ip: use rb trees for IP frag queue Peter Oskolkov
2018-08-06  0:17 ` [PATCH net-next 0/3] ip: Use " David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).