[RFC PATCH] inet: fix enforcing of fragment queue hash list depth

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH] inet: fix enforcing of fragment queue hash list depth
@ 2013-04-15 14:25 Jesper Dangaard Brouer
  2013-04-15 15:26 ` Hannes Frederic Sowa
  0 siblings, 1 reply; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2013-04-15 14:25 UTC (permalink / raw)
  To: Eric Dumazet, Hannes Frederic Sowa
  Cc: Jesper Dangaard Brouer, netdev, David S. Miller

I have found an issues with commit:

 commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055
 Author: Hannes Frederic Sowa <hannes@stressinduktion.org>
 Date:   Fri Mar 15 11:32:30 2013 +0000

    inet: limit length of fragment queue hash table bucket lists

There is a connection between the fixed 128 hash depth limit and the
frag mem limit/thresh settings, which limits how high the thresh can
be set.

The 128 elems hash depth limit, results in bad behaviour if mem limit
thresh holds are increased, via /proc/sys/net ::

 /proc/sys/net/ipv4/ipfrag_high_thresh
 /proc/sys/net/ipv4/ipfrag_low_thresh

If we increase the thresh, to something allowing 128 elements in each
bucket, which is not that high given the hash array size of 64
(64*128=8192), e.g.
  big MTU frags (2944(truesize)+208(ipq))*8192(max elems)=25755648
  small frags   ( 896(truesize)+208(ipq))*8192(max elems)=9043968

The problem with commit 5a3da1fe (inet: limit length of fragment queue
hash table bucket lists) is that, once we hit the limit, the we *keep*
the existing frag queues, not allowing new frag queues to be created.
Thus, an attacker can effectivly block handling of fragments for 60
sec (as each frag queue have a timeout of 60 sec).

Even without increasing the limit, as Hannes showed, an attacker on
IPv6 can "attack" a specific hash bucket, and via that change, can
block/drop new fragments also (trying to) utilize this bucket.

Summary:
 With the default mem limit/thresh settings, this is not general
problem, but adjusting the thresh limits result in some-what
unexpected behavior.

Proposed solution:
 IMHO instead of keeping existing frag queues, we should kill one of
the frag queues in the hash instead.

Implementation complications:
 Killing of frag queues while only holding the hash bucket lock, and
not the frag queue lock, complicates the implementation, as we race
and can end up (trying to) remove the hash element twice (resulting in
an oops).

RFC (Request For Comments):
 Can we do better than the hlist_unhashed() check in fq_unlink().

PATCH NOT FINISHED:
 TODO: remove the unused inet_frag_maybe_warn_overflow()
 TODO: Add depth as sysctl option, to make large thresh possible

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 net/ipv4/inet_fragment.c |   54 +++++++++++++++++++++++++++++++++++++---------
 1 files changed, 43 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index e97d66a..a954ff2 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -130,7 +130,8 @@ void inet_frags_exit_net(struct netns_frags *nf, struct inet_frags *f)
 }
 EXPORT_SYMBOL(inet_frags_exit_net);

-static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
+static inline void fq_unlink_hash(struct inet_frag_queue *fq,
+				  struct inet_frags *f)
 {
 	struct inet_frag_bucket *hb;
 	unsigned int hash;
@@ -140,24 +141,35 @@ static inline void fq_unlink(struct inet_frag_queue *fq, struct inet_frags *f)
 	hb = &f->hash[hash];

 	spin_lock(&hb->chain_lock);
-	hlist_del(&fq->list);
+	/* Handle race-condition between direct hash tables cleaning
+	 * in inet_frag_find() and users of inet_frag_kill()
+	 */
+	if (!hlist_unhashed(&fq->list))
+		hlist_del(&fq->list);
 	spin_unlock(&hb->chain_lock);

 	read_unlock(&f->lock);
-	inet_frag_lru_del(fq);
 }

-void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f)
+void __inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f,
+		      bool unlink_hash)
 {
 	if (del_timer(&fq->timer))
 		atomic_dec(&fq->refcnt);

 	if (!(fq->last_in & INET_FRAG_COMPLETE)) {
-		fq_unlink(fq, f);
+		if (unlink_hash)
+			fq_unlink_hash(fq, f);
+		inet_frag_lru_del(fq);
 		atomic_dec(&fq->refcnt);
 		fq->last_in |= INET_FRAG_COMPLETE;
 	}
 }
+
+void inet_frag_kill(struct inet_frag_queue *fq, struct inet_frags *f)
+{
+	__inet_frag_kill(fq, f, true);
+}
 EXPORT_SYMBOL(inet_frag_kill);

 static inline void frag_kfree_skb(struct netns_frags *nf, struct inet_frags *f,
@@ -326,13 +338,14 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 	__releases(&f->lock)
 {
 	struct inet_frag_bucket *hb;
-	struct inet_frag_queue *q;
+	struct inet_frag_queue *q, *q_evict = NULL;
+	struct hlist_node *n;
 	int depth = 0;

 	hb = &f->hash[hash];

 	spin_lock(&hb->chain_lock);
-	hlist_for_each_entry(q, &hb->chain, list) {
+	hlist_for_each_entry_safe(q, n, &hb->chain, list) {
 		if (q->net == nf && f->match(q, key)) {
 			atomic_inc(&q->refcnt);
 			spin_unlock(&hb->chain_lock);
@@ -340,14 +353,33 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 			return q;
 		}
 		depth++;
+		q_evict = q; /* candidate for eviction */
 	}
+	/* Not found situation */
+	if (depth > INETFRAGS_MAXDEPTH) {
+		atomic_inc(&q_evict->refcnt);
+		hlist_del_init(&q_evict->list);
+	} else
+		q_evict = NULL;
 	spin_unlock(&hb->chain_lock);
 	read_unlock(&f->lock);

-	if (depth <= INETFRAGS_MAXDEPTH)
-		return inet_frag_create(nf, f, key);
-	else
-		return ERR_PTR(-ENOBUFS);
+	if (q_evict) {
+		spin_lock(&q_evict->lock);
+		if (!(q_evict->last_in & INET_FRAG_COMPLETE))
+			__inet_frag_kill(q_evict, f, false);
+		spin_unlock(&q_evict->lock);
+
+		if (atomic_dec_and_test(&q_evict->refcnt))
+			inet_frag_destroy(q_evict, f, NULL);
+
+		LIMIT_NETDEBUG(KERN_WARNING "%s(): Fragment hash bucket"
+			       " list length grew over limit (len %d),"
+			       " Dropping another fragment.\n",
+			       __func__, depth);
+	}
+
+	return inet_frag_create(nf, f, key);
 }
 EXPORT_SYMBOL(inet_frag_find);

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] inet: fix enforcing of fragment queue hash list depth
  2013-04-15 14:25 [RFC PATCH] inet: fix enforcing of fragment queue hash list depth Jesper Dangaard Brouer
@ 2013-04-15 15:26 ` Hannes Frederic Sowa
  2013-04-15 16:23   ` Eric Dumazet
  2013-04-15 17:16   ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 7+ messages in thread
From: Hannes Frederic Sowa @ 2013-04-15 15:26 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Eric Dumazet, netdev, David S. Miller

Hi Jesper!

On Mon, Apr 15, 2013 at 04:25:10PM +0200, Jesper Dangaard Brouer wrote:
> I have found an issues with commit:
> 
>  commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055
>  Author: Hannes Frederic Sowa <hannes@stressinduktion.org>
>  Date:   Fri Mar 15 11:32:30 2013 +0000
> 
>     inet: limit length of fragment queue hash table bucket lists
> 
> There is a connection between the fixed 128 hash depth limit and the
> frag mem limit/thresh settings, which limits how high the thresh can
> be set.
> 
> The 128 elems hash depth limit, results in bad behaviour if mem limit
> thresh holds are increased, via /proc/sys/net ::
> 
>  /proc/sys/net/ipv4/ipfrag_high_thresh
>  /proc/sys/net/ipv4/ipfrag_low_thresh
> 
> If we increase the thresh, to something allowing 128 elements in each
> bucket, which is not that high given the hash array size of 64
> (64*128=8192), e.g.
>   big MTU frags (2944(truesize)+208(ipq))*8192(max elems)=25755648
>   small frags   ( 896(truesize)+208(ipq))*8192(max elems)=9043968

I thought it was pretty high already. While creating this patch I also
had a patch which did calculate the chain limit while updating the sysctl
high_thresh knob (perhaps this could be of use):

http://patchwork.ozlabs.org/patch/227136/

Perhaps we should reconsider the formula I choose to calculate this limit.
But because we would actually have 128 iterations in the hash bucket I
am more in favor of resizing (or even come up with a way to dynamically
resize) the hash table. On a smaller sized machine I can actually create
severe latency because of the list iteration even with the 128 list length
limit in place.

> The problem with commit 5a3da1fe (inet: limit length of fragment queue
> hash table bucket lists) is that, once we hit the limit, the we *keep*
> the existing frag queues, not allowing new frag queues to be created.
> Thus, an attacker can effectivly block handling of fragments for 60
> sec (as each frag queue have a timeout of 60 sec).
> 
> Even without increasing the limit, as Hannes showed, an attacker on
> IPv6 can "attack" a specific hash bucket, and via that change, can
> block/drop new fragments also (trying to) utilize this bucket.
> 
> Summary:
>  With the default mem limit/thresh settings, this is not general
> problem, but adjusting the thresh limits result in some-what
> unexpected behavior.
> 
> Proposed solution:
>  IMHO instead of keeping existing frag queues, we should kill one of
> the frag queues in the hash instead.

I thought that if we actually reach this limit the machine would be in
some kind DoS situation and there won't be a good solution to identify
harmless fragments from non-harmless fragments. So I settled with the
simple drop and did not try to come up with a load balancing or rehashing
method.

My first try to solve this problem actually was converting the hash chains
from hlists to lists (so doubling the size of the hash table by two) to be
able to drop the last element per bucket.

> Implementation complications:
>  Killing of frag queues while only holding the hash bucket lock, and
> not the frag queue lock, complicates the implementation, as we race
> and can end up (trying to) remove the hash element twice (resulting in
> an oops).

I will have to play with this patch a bit. Perhaps I have some more insights
after that. :)

Thanks for looking into it,

  Hannes

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] inet: fix enforcing of fragment queue hash list depth
  2013-04-15 15:26 ` Hannes Frederic Sowa
@ 2013-04-15 16:23   ` Eric Dumazet
  2013-04-15 17:09     ` Hannes Frederic Sowa
  2013-04-15 17:12     ` Jesper Dangaard Brouer
  2013-04-15 17:16   ` Jesper Dangaard Brouer
  1 sibling, 2 replies; 7+ messages in thread
From: Eric Dumazet @ 2013-04-15 16:23 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Jesper Dangaard Brouer, netdev, David S. Miller

On Mon, 2013-04-15 at 17:26 +0200, Hannes Frederic Sowa wrote:
> Hi Jesper!
> 
> On Mon, Apr 15, 2013 at 04:25:10PM +0200, Jesper Dangaard Brouer wrote:
> > I have found an issues with commit:
> > 
> >  commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055
> >  Author: Hannes Frederic Sowa <hannes@stressinduktion.org>
> >  Date:   Fri Mar 15 11:32:30 2013 +0000
> > 
> >     inet: limit length of fragment queue hash table bucket lists
> > 
> > There is a connection between the fixed 128 hash depth limit and the
> > frag mem limit/thresh settings, which limits how high the thresh can
> > be set.
> > 
> > The 128 elems hash depth limit, results in bad behaviour if mem limit
> > thresh holds are increased, via /proc/sys/net ::
> > 
> >  /proc/sys/net/ipv4/ipfrag_high_thresh
> >  /proc/sys/net/ipv4/ipfrag_low_thresh
> > 
> > If we increase the thresh, to something allowing 128 elements in each
> > bucket, which is not that high given the hash array size of 64
> > (64*128=8192), e.g.
> >   big MTU frags (2944(truesize)+208(ipq))*8192(max elems)=25755648
> >   small frags   ( 896(truesize)+208(ipq))*8192(max elems)=9043968
> 
> I thought it was pretty high already. While creating this patch I also
> had a patch which did calculate the chain limit while updating the sysctl
> high_thresh knob (perhaps this could be of use):
> 
> http://patchwork.ozlabs.org/patch/227136/
> 
> Perhaps we should reconsider the formula I choose to calculate this limit.
> But because we would actually have 128 iterations in the hash bucket I
> am more in favor of resizing (or even come up with a way to dynamically
> resize) the hash table. On a smaller sized machine I can actually create
> severe latency because of the list iteration even with the 128 list length
> limit in place.

Allowing thousand of fragments and keeping a 64 slot hash table is not
going to work.

depths of 128 are just insane.

Really Jesper, you'll need to make the hash table dynamic, if you really
care.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] inet: fix enforcing of fragment queue hash list depth
  2013-04-15 16:23   ` Eric Dumazet
@ 2013-04-15 17:09     ` Hannes Frederic Sowa
  2013-04-15 17:12     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 7+ messages in thread
From: Hannes Frederic Sowa @ 2013-04-15 17:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jesper Dangaard Brouer, netdev, David S. Miller

On Mon, Apr 15, 2013 at 09:23:50AM -0700, Eric Dumazet wrote:
> On Mon, 2013-04-15 at 17:26 +0200, Hannes Frederic Sowa wrote:
> > Hi Jesper!
> > 
> > On Mon, Apr 15, 2013 at 04:25:10PM +0200, Jesper Dangaard Brouer wrote:
> > > I have found an issues with commit:
> > > 
> > >  commit 5a3da1fe9561828d0ca7eca664b16ec2b9bf0055
> > >  Author: Hannes Frederic Sowa <hannes@stressinduktion.org>
> > >  Date:   Fri Mar 15 11:32:30 2013 +0000
> > > 
> > >     inet: limit length of fragment queue hash table bucket lists
> > > 
> > > There is a connection between the fixed 128 hash depth limit and the
> > > frag mem limit/thresh settings, which limits how high the thresh can
> > > be set.
> > > 
> > > The 128 elems hash depth limit, results in bad behaviour if mem limit
> > > thresh holds are increased, via /proc/sys/net ::
> > > 
> > >  /proc/sys/net/ipv4/ipfrag_high_thresh
> > >  /proc/sys/net/ipv4/ipfrag_low_thresh
> > > 
> > > If we increase the thresh, to something allowing 128 elements in each
> > > bucket, which is not that high given the hash array size of 64
> > > (64*128=8192), e.g.
> > >   big MTU frags (2944(truesize)+208(ipq))*8192(max elems)=25755648
> > >   small frags   ( 896(truesize)+208(ipq))*8192(max elems)=9043968
> > 
> > I thought it was pretty high already. While creating this patch I also
> > had a patch which did calculate the chain limit while updating the sysctl
> > high_thresh knob (perhaps this could be of use):
> > 
> > http://patchwork.ozlabs.org/patch/227136/
> > 
> > Perhaps we should reconsider the formula I choose to calculate this limit.
> > But because we would actually have 128 iterations in the hash bucket I
> > am more in favor of resizing (or even come up with a way to dynamically
> > resize) the hash table. On a smaller sized machine I can actually create
> > severe latency because of the list iteration even with the 128 list length
> > limit in place.
> 
> Allowing thousand of fragments and keeping a 64 slot hash table is not
> going to work.
> 
> depths of 128 are just insane.
> 
> Really Jesper, you'll need to make the hash table dynamic, if you really
> care.

Where there already plans how this could be achieved? I am currently
looking at nested hash table lookups, the dcache and Relativistic
Hash Table[1] paper. Last time I played with the idea to move the
fragmentation cache to RCU I abandoned it because of the high update
rate it could experience.

[1] http://static.usenix.org/event/atc11/tech/final_files/Triplett.pdf

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] inet: fix enforcing of fragment queue hash list depth
  2013-04-15 16:23   ` Eric Dumazet
  2013-04-15 17:09     ` Hannes Frederic Sowa
@ 2013-04-15 17:12     ` Jesper Dangaard Brouer
  1 sibling, 0 replies; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2013-04-15 17:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Hannes Frederic Sowa, netdev, David S. Miller

On Mon, 2013-04-15 at 09:23 -0700, Eric Dumazet wrote:

> Allowing thousand of fragments and keeping a 64 slot hash table is not
> going to work.
> 
> depths of 128 are just insane.

I fully agree, my plan was actually to reduce this to 5 or 10 depth
limit.  I just noticed this problem with Hannes patch, while working on
your idea of direct hash cleaning, and then I just/only extracted the
parts that was relevant for fixing Hannes patch.

> Really Jesper, you'll need to make the hash table dynamic, if you really
> care.

My plan/idea is to make the hash tables size depend on the available
memory.  As on small memory devices, we are opening up for (an attack
vector where) remote hosts can pin-down a large portion of their memory,
which we want to avoid.  (And you don't even need a port in listen
state).

How dynamic do you want it?  Would initial sizing based on memory be
enough, or should I also add a proc/sysctl option for changing the hash
size from userspace?

--Jesper

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] inet: fix enforcing of fragment queue hash list depth
  2013-04-15 15:26 ` Hannes Frederic Sowa
  2013-04-15 16:23   ` Eric Dumazet
@ 2013-04-15 17:16   ` Jesper Dangaard Brouer
  2013-04-15 17:43     ` Hannes Frederic Sowa
  1 sibling, 1 reply; 7+ messages in thread
From: Jesper Dangaard Brouer @ 2013-04-15 17:16 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Eric Dumazet, netdev, David S. Miller

On Mon, 2013-04-15 at 17:26 +0200, Hannes Frederic Sowa wrote:

> I will have to play with this patch a bit. Perhaps I have some more insights
> after that. :)

I would actually like to do/finish up this patch my self, as I have
several patches in this code area.  But I would value your input :-)

--Jesper

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [RFC PATCH] inet: fix enforcing of fragment queue hash list depth
  2013-04-15 17:16   ` Jesper Dangaard Brouer
@ 2013-04-15 17:43     ` Hannes Frederic Sowa
  0 siblings, 0 replies; 7+ messages in thread
From: Hannes Frederic Sowa @ 2013-04-15 17:43 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Eric Dumazet, netdev, David S. Miller

On Mon, Apr 15, 2013 at 07:16:26PM +0200, Jesper Dangaard Brouer wrote:
> On Mon, 2013-04-15 at 17:26 +0200, Hannes Frederic Sowa wrote:
> 
> > I will have to play with this patch a bit. Perhaps I have some more insights
> > after that. :)
> 
> I would actually like to do/finish up this patch my self, as I have
> several patches in this code area.  But I would value your input :-)

No problem at all, I already thought so.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2013-04-15 17:43 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-15 14:25 [RFC PATCH] inet: fix enforcing of fragment queue hash list depth Jesper Dangaard Brouer
2013-04-15 15:26 ` Hannes Frederic Sowa
2013-04-15 16:23   ` Eric Dumazet
2013-04-15 17:09     ` Hannes Frederic Sowa
2013-04-15 17:12     ` Jesper Dangaard Brouer
2013-04-15 17:16   ` Jesper Dangaard Brouer
2013-04-15 17:43     ` Hannes Frederic Sowa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).