[PATCH RFC] ipv6: use stronger hash for reassembly queue hash table

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
@ 2013-03-07 21:42 Hannes Frederic Sowa
  2013-03-08  5:57 ` Hannes Frederic Sowa
  0 siblings, 1 reply; 25+ messages in thread
From: Hannes Frederic Sowa @ 2013-03-07 21:42 UTC (permalink / raw)
  To: netdev; +Cc: eric.dumazet, yoshfuji

ipv6_addr_hash can yield the exact same hash value for countless ip
addresses from one /64 subnet. Let's make it a bit harder to create a
long list of reassembly queues in the hash table by using a stronger hash.

I spotted this just by looking at the source and did not verify if it
is really the case, so this patch has RFC status. But the jhash change
should not really hurt anyway. This patch is only compile tested.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/reassembly.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 3c6a772..4d39cf3 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -79,7 +79,7 @@ unsigned int inet6_hash_frag(__be32 id, const struct in6_addr *saddr,
 {
 	u32 c;
 
-	c = jhash_3words(ipv6_addr_hash(saddr), ipv6_addr_hash(daddr),
+	c = jhash_3words(ipv6_addr_jhash(saddr), ipv6_addr_hash(daddr),
 			 (__force u32)id, rnd);
 
 	return c & (INETFRAGS_HASHSZ - 1);
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-07 21:42 [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table Hannes Frederic Sowa
@ 2013-03-08  5:57 ` Hannes Frederic Sowa
  2013-03-08 13:04   ` Hannes Frederic Sowa
  0 siblings, 1 reply; 25+ messages in thread
From: Hannes Frederic Sowa @ 2013-03-08  5:57 UTC (permalink / raw)
  To: netdev, eric.dumazet, yoshfuji

On Thu, Mar 07, 2013 at 10:42:11PM +0100, Hannes Frederic Sowa wrote:
> ipv6_addr_hash can yield the exact same hash value for countless ip
> addresses from one /64 subnet. Let's make it a bit harder to create a
> long list of reassembly queues in the hash table by using a stronger hash.
> 
> I spotted this just by looking at the source and did not verify if it
> is really the case, so this patch has RFC status. But the jhash change
> should not really hurt anyway. This patch is only compile tested.

Hmpf, I haven't seen Eric's change(279e9f2: ipv6: optimize
inet6_hash_frag()) and tried to compare a v3.8 against a net-next
kernel. At a first sight, it seems in some kind of denial of service
scenario the relative amount of time spend in inet_frag_find increased,
but I will do more comparable benchmarks later today (could be also
because of other changes). I just wanted to let you know that I will do
more research on this one and that you shouldn't apply this patch for now.

I also noticed that this function is also used by netfilter in the
forwarding path. Perhaps a jenkins hash on the destination address would
be better, too. Perhaps a netfilter specific hash function could also
be used.

Greetings,

  Hannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-08  5:57 ` Hannes Frederic Sowa
@ 2013-03-08 13:04   ` Hannes Frederic Sowa
  2013-03-08 14:53     ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Hannes Frederic Sowa @ 2013-03-08 13:04 UTC (permalink / raw)
  To: netdev, eric.dumazet, yoshfuji

On Fri, Mar 08, 2013 at 06:57:18AM +0100, Hannes Frederic Sowa wrote:
> On Thu, Mar 07, 2013 at 10:42:11PM +0100, Hannes Frederic Sowa wrote:
> > ipv6_addr_hash can yield the exact same hash value for countless ip
> > addresses from one /64 subnet. Let's make it a bit harder to create a
> > long list of reassembly queues in the hash table by using a stronger hash.
> > 
> > I spotted this just by looking at the source and did not verify if it
> > is really the case, so this patch has RFC status. But the jhash change
> > should not really hurt anyway. This patch is only compile tested.
> 
> Hmpf, I haven't seen Eric's change(279e9f2: ipv6: optimize
> inet6_hash_frag()) and tried to compare a v3.8 against a net-next
> kernel. At a first sight, it seems in some kind of denial of service
> scenario the relative amount of time spend in inet_frag_find increased,
> but I will do more comparable benchmarks later today (could be also
> because of other changes). I just wanted to let you know that I will do
> more research on this one and that you shouldn't apply this patch for now.
> 
> I also noticed that this function is also used by netfilter in the
> forwarding path. Perhaps a jenkins hash on the destination address would
> be better, too. Perhaps a netfilter specific hash function could also
> be used.

Getting benchmarks on my box is actually pretty hairy without burning
it down. :)

Generating a flow of fragmented packets with frag_id set to zero and ipv6
addresses which generate the same ipv6_addr_hash I gathered this from a debug
printk in inet6_hash_frag (current version from net-next):

[   25.992202] FRAG6: hash -635286572
[   25.992218] FRAG6: hash -635286572
[   25.992237] FRAG6: hash -635286572
[   25.992253] FRAG6: hash -635286572
[   25.992273] FRAG6: hash -635286572
[   25.992289] FRAG6: hash -635286572
[   25.992309] FRAG6: hash -635286572
[   25.992327] FRAG6: hash -635286572
[   25.992347] FRAG6: hash -635286572
[   25.992363] FRAG6: hash -635286572
[   25.992382] FRAG6: hash -635286572
[   25.992398] FRAG6: hash -635286572

With the patches reverted:

[   39.770790] FRAG6: hash 1323153854
[   39.773691] FRAG6: hash 1323153854
[   39.776413] FRAG6: hash 977187067
[   39.778782] FRAG6: hash 977187067
[   39.781298] FRAG6: hash -1226453562
[   39.784043] FRAG6: hash -1226453562
[   39.786443] FRAG6: hash -923346920
[   39.789431] FRAG6: hash -923346920
[   39.792205] FRAG6: hash 282862966
[   39.794918] FRAG6: hash 282862966
[   39.797749] FRAG6: hash -1059175404
[   39.800054] FRAG6: hash -1059175404
[   39.802904] FRAG6: hash -1085381868

It is pretty simple to generate a source of different addresses that yield to
the same hash value.

So I am in favour to revert the optimization changes in 279e9f2. It seems that
even ipv6_addr_jhash is not strong enough when used in the forwarding path for
reassembly, because of the xor of the most significant 32 bits of the address.

Quick test program is here: https://gist.github.com/hannes/5116331

Ok if I send a patch later to revert this change? What do you think about
increasing the size of the fragmentation queue hash table INETFRAGS_HASHSZ,
too?

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-08 13:04   ` Hannes Frederic Sowa
@ 2013-03-08 14:53     ` Eric Dumazet
  2013-03-08 15:08       ` Hannes Frederic Sowa
  0 siblings, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2013-03-08 14:53 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev, yoshfuji

On Fri, 2013-03-08 at 14:04 +0100, Hannes Frederic Sowa wrote:

> Ok if I send a patch later to revert this change? What do you think about
> increasing the size of the fragmentation queue hash table INETFRAGS_HASHSZ,
> too?

Thats because reassembly hash was so small, and number of in flight
reassembly is so small anyway, that I felt not worth having so many
instructions to compute a hash that is truncated to 6 bits

In real life I always advocate _not_ using fragments, I dont know why so
many people try to used them.

No matter how you hash, a hacker can easily fill your defrag unit with
not complete datagrams, so what's the point ?

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-08 14:53     ` Eric Dumazet
@ 2013-03-08 15:08       ` Hannes Frederic Sowa
  2013-03-08 15:23         ` Eric Dumazet
  0 siblings, 1 reply; 25+ messages in thread
From: Hannes Frederic Sowa @ 2013-03-08 15:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, yoshfuji

On Fri, Mar 08, 2013 at 06:53:06AM -0800, Eric Dumazet wrote:
> No matter how you hash, a hacker can easily fill your defrag unit with
> not complete datagrams, so what's the point ?

I want to harden reassembly logic against all fragments being put in
the same hash bucket because of malicious traffic and thus creating
long list traversals in the fragment queue hash table.

I totally agree that fragments should be avoided if possible.

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-08 15:08       ` Hannes Frederic Sowa
@ 2013-03-08 15:23         ` Eric Dumazet
  2013-03-08 15:54           ` Hannes Frederic Sowa
                             ` (2 more replies)
  0 siblings, 3 replies; 25+ messages in thread
From: Eric Dumazet @ 2013-03-08 15:23 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev, yoshfuji

On Fri, 2013-03-08 at 16:08 +0100, Hannes Frederic Sowa wrote:
> On Fri, Mar 08, 2013 at 06:53:06AM -0800, Eric Dumazet wrote:
> > No matter how you hash, a hacker can easily fill your defrag unit with
> > not complete datagrams, so what's the point ?
> 
> I want to harden reassembly logic against all fragments being put in
> the same hash bucket because of malicious traffic and thus creating
> long list traversals in the fragment queue hash table.

Note that the long traversal was a real issue with TCP (thats why I
introduced ipv6_addr_jhash()), as a single ehash slot could contains
thousand of sockets.

But with fragments, we should just limit the depth of any particular
slot, and drop above a particular threshold.

reassembly is a best effort mechanism, better make sure it doesnt use
all our cpu cycles.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-08 15:23         ` Eric Dumazet
@ 2013-03-08 15:54           ` Hannes Frederic Sowa
  2013-03-08 16:15             ` Eric Dumazet
  2013-03-09 15:19             ` Hannes Frederic Sowa
  2013-03-08 20:53           ` Hannes Frederic Sowa
  2013-03-13  1:27           ` Hannes Frederic Sowa
  2 siblings, 2 replies; 25+ messages in thread
From: Hannes Frederic Sowa @ 2013-03-08 15:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, yoshfuji

On Fri, Mar 08, 2013 at 07:23:39AM -0800, Eric Dumazet wrote:
> On Fri, 2013-03-08 at 16:08 +0100, Hannes Frederic Sowa wrote:
> > On Fri, Mar 08, 2013 at 06:53:06AM -0800, Eric Dumazet wrote:
> > > No matter how you hash, a hacker can easily fill your defrag unit with
> > > not complete datagrams, so what's the point ?
> > 
> > I want to harden reassembly logic against all fragments being put in
> > the same hash bucket because of malicious traffic and thus creating
> > long list traversals in the fragment queue hash table.
> 
> Note that the long traversal was a real issue with TCP (thats why I
> introduced ipv6_addr_jhash()), as a single ehash slot could contains
> thousand of sockets.
> 
> But with fragments, we should just limit the depth of any particular
> slot, and drop above a particular threshold.
> 
> reassembly is a best effort mechanism, better make sure it doesnt use
> all our cpu cycles.

Hm, I have to think about it, especially because it is used in the netfilter
code. There could be some fairness issues if packets get dropped in netfilter
if reassembly could not be performed. I'll check this.

Btw. did s.o. have a look at skb->rxhash? I just started looking after
its many users but perhaps someone did this job already.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-08 15:54           ` Hannes Frederic Sowa
@ 2013-03-08 16:15             ` Eric Dumazet
  2013-03-08 16:18               ` Hannes Frederic Sowa
  2013-03-09 15:19             ` Hannes Frederic Sowa
  1 sibling, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2013-03-08 16:15 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev, yoshfuji

On Fri, 2013-03-08 at 16:54 +0100, Hannes Frederic Sowa wrote:

> Btw. did s.o. have a look at skb->rxhash? I just started looking after
> its many users but perhaps someone did this job already.

You mean for IPv6 ?

Most of us use the rxhash provided by the NIC, our software fallback is
a best effort as well for RPS or some qdisc.

If some hacker specifically cook packets hashing to same rxhash, thats
fine.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-08 16:15             ` Eric Dumazet
@ 2013-03-08 16:18               ` Hannes Frederic Sowa
  0 siblings, 0 replies; 25+ messages in thread
From: Hannes Frederic Sowa @ 2013-03-08 16:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, yoshfuji

On Fri, Mar 08, 2013 at 08:15:10AM -0800, Eric Dumazet wrote:
> On Fri, 2013-03-08 at 16:54 +0100, Hannes Frederic Sowa wrote:
> 
> > Btw. did s.o. have a look at skb->rxhash? I just started looking after
> > its many users but perhaps someone did this job already.
> 
> You mean for IPv6 ?

Yes, because of the usage of ipv6_addr_hash in skb_flow_dissect.

> Most of us use the rxhash provided by the NIC, our software fallback is
> a best effort as well for RPS or some qdisc.
> 
> If some hacker specifically cook packets hashing to same rxhash, thats
> fine.

Ok, good to know. Thanks!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-08 15:54           ` Hannes Frederic Sowa
  2013-03-08 16:15             ` Eric Dumazet
@ 2013-03-09 15:19             ` Hannes Frederic Sowa
  1 sibling, 0 replies; 25+ messages in thread
From: Hannes Frederic Sowa @ 2013-03-09 15:19 UTC (permalink / raw)
  To: Eric Dumazet, netdev, yoshfuji

On Fri, Mar 08, 2013 at 04:54:04PM +0100, Hannes Frederic Sowa wrote:
> On Fri, Mar 08, 2013 at 07:23:39AM -0800, Eric Dumazet wrote:
> > On Fri, 2013-03-08 at 16:08 +0100, Hannes Frederic Sowa wrote:
> > > On Fri, Mar 08, 2013 at 06:53:06AM -0800, Eric Dumazet wrote:
> > > > No matter how you hash, a hacker can easily fill your defrag unit with
> > > > not complete datagrams, so what's the point ?
> > > 
> > > I want to harden reassembly logic against all fragments being put in
> > > the same hash bucket because of malicious traffic and thus creating
> > > long list traversals in the fragment queue hash table.
> > 
> > Note that the long traversal was a real issue with TCP (thats why I
> > introduced ipv6_addr_jhash()), as a single ehash slot could contains
> > thousand of sockets.
> > 
> > But with fragments, we should just limit the depth of any particular
> > slot, and drop above a particular threshold.
> > 
> > reassembly is a best effort mechanism, better make sure it doesnt use
> > all our cpu cycles.
> 
> Hm, I have to think about it, especially because it is used in the netfilter
> code. There could be some fairness issues if packets get dropped in netfilter
> if reassembly could not be performed. I'll check this.

There should be no fairness issues in the forwarding path because
legitimate traffic should send a reasonable random 32bit fragmentation
id which is a direct input to jhash.

I thought about the list length limit this morning and I believe we have
to dynamically compute it (maybe on sysctl change), because I bet people
want to have their fragmentation cache utilized if they higher the memory
thresholds (maybe dns resolvers with dnssec employed). The missing value
is the average skb->truesize so I'll have to assume one. Any pointers
on this? We could also export the limit to userspace and have the users
deal with it. But I am not in favour of this solution.

Should we do reclamation in the hash buckets (I would have to switch
from hlist to list) or just drop incoming fragments (this should be
fairly easy). Currently we discard fragments in lru fashion, so I think
reclamation would be the way to go, too.

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-08 15:23         ` Eric Dumazet
  2013-03-08 15:54           ` Hannes Frederic Sowa
@ 2013-03-08 20:53           ` Hannes Frederic Sowa
  2013-03-13  1:27           ` Hannes Frederic Sowa
  2 siblings, 0 replies; 25+ messages in thread
From: Hannes Frederic Sowa @ 2013-03-08 20:53 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, yoshfuji

On Fri, Mar 08, 2013 at 07:23:39AM -0800, Eric Dumazet wrote:
> On Fri, 2013-03-08 at 16:08 +0100, Hannes Frederic Sowa wrote:
> > On Fri, Mar 08, 2013 at 06:53:06AM -0800, Eric Dumazet wrote:
> > > No matter how you hash, a hacker can easily fill your defrag unit with
> > > not complete datagrams, so what's the point ?
> > 
> > I want to harden reassembly logic against all fragments being put in
> > the same hash bucket because of malicious traffic and thus creating
> > long list traversals in the fragment queue hash table.
> 
> Note that the long traversal was a real issue with TCP (thats why I
> introduced ipv6_addr_jhash()), as a single ehash slot could contains
> thousand of sockets.
> 
> But with fragments, we should just limit the depth of any particular
> slot, and drop above a particular threshold.
> 
> reassembly is a best effort mechanism, better make sure it doesnt use
> all our cpu cycles.

On my VM I counted 17500 iterations in one hash bucket and maxing out
one CPU until I got rcu stalls and NMIs. In comparison: If I use the
old hasing code I have max iterations of 370 and don't expirience any
rcu stalls or NMIs (seems to be around 17500/64+-epsilon).

I have not yet drawn my conclusion on this, yet, but I agree some list
length limiting code would be useful independent of the hash function.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-08 15:23         ` Eric Dumazet
  2013-03-08 15:54           ` Hannes Frederic Sowa
  2013-03-08 20:53           ` Hannes Frederic Sowa
@ 2013-03-13  1:27           ` Hannes Frederic Sowa
  2013-03-13  1:31             ` Hannes Frederic Sowa
  2013-03-13  5:29             ` Eric Dumazet
  2 siblings, 2 replies; 25+ messages in thread
From: Hannes Frederic Sowa @ 2013-03-13  1:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, yoshfuji, brouer

[cc'ing Jesper, too]

On Fri, Mar 08, 2013 at 07:23:39AM -0800, Eric Dumazet wrote:
> On Fri, 2013-03-08 at 16:08 +0100, Hannes Frederic Sowa wrote:
> > On Fri, Mar 08, 2013 at 06:53:06AM -0800, Eric Dumazet wrote:
> > > No matter how you hash, a hacker can easily fill your defrag unit with
> > > not complete datagrams, so what's the point ?
> > 
> > I want to harden reassembly logic against all fragments being put in
> > the same hash bucket because of malicious traffic and thus creating
> > long list traversals in the fragment queue hash table.
> 
> Note that the long traversal was a real issue with TCP (thats why I
> introduced ipv6_addr_jhash()), as a single ehash slot could contains
> thousand of sockets.
> 
> But with fragments, we should just limit the depth of any particular
> slot, and drop above a particular threshold.

[PATCH net-next RFC] inet: add max_depth to limit list length in inet_frags hash

This does implement trivial drop for fragments where the hash queue
is above some limit.

I calculate the limit as follow:

I averaged the folowing formula

max_depth = max_threshold / INETFRAGS_HASHSZ / rounded up (SKB_TRUELEN(0)
            sizeof(struct ipq or struct frag_queue))

to

max_threshold >> 15

So we start with a maximum list length of 128. I think we could halve
this value to 64, but because I have no real performance data I left it
at this higher value for now.

This patch does only protect IPv6 (and not netfilter ipv6 defragmentation)
and will switch off limit checking if max_depth is zero. I'll rewrite
the check if we agree that this simple solution is the way to go (simple
drop) and will clamp the minimum value to 1 as soon as I also migrated
ipv4 and netfilter to the new sysctl handler.

When testing this patch:

Disable netfilter defragmenation for ipv6 on your machine if you test
this patch, otherwise you won't see the improvment. Machine now runs
smoothly under fragmentation dos.

Ok if I target this patch for net next time because the hashing changes
are in there already?

Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/inet_frag.h  | 13 +++++++++++++
 net/ipv4/inet_fragment.c | 25 ++++++++++++++++++++++++-
 net/ipv6/reassembly.c    |  6 +++++-
 3 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 76c3fe5..9ba6ada 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -17,6 +17,7 @@ struct netns_frags {
 	int			timeout;
 	int			high_thresh;
 	int			low_thresh;
+	int			max_depth;
 };
 
 struct inet_frag_queue {
@@ -43,6 +44,11 @@ struct inet_frag_queue {
 
 #define INETFRAGS_HASHSZ		64
 
+/* max_depth = max_threshold / INETFRAGS_HASHSZ / rounded up (SKB_TRUELEN(0) +
+ *	       sizeof(struct ipq or struct frag_queue))
+ */
+#define INETFRAGS_MAXDEPTH_SHIFT	15
+
 struct inet_frags {
 	struct hlist_head	hash[INETFRAGS_HASHSZ];
 	/* This rwlock is a global lock (seperate per IPv4, IPv6 and
@@ -144,4 +150,11 @@ static inline void inet_frag_lru_add(struct netns_frags *nf,
 	list_add_tail(&q->lru_list, &nf->lru_list);
 	spin_unlock(&nf->lru_lock);
 }
+
+#ifdef CONFIG_SYSCTL
+int inet_frag_update_high_thresh(struct ctl_table *table, int write,
+				 void __user *buffer, size_t *lenp,
+				 loff_t *ppos);
+#endif
+
 #endif
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 245ae07..92f1fdd 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -277,6 +277,7 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 	__releases(&f->lock)
 {
 	struct inet_frag_queue *q;
+	int depth = 0;
 
 	hlist_for_each_entry(q, &f->hash[hash], list) {
 		if (q->net == nf && f->match(q, key)) {
@@ -284,9 +285,31 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 			read_unlock(&f->lock);
 			return q;
 		}
+		depth++;
 	}
 	read_unlock(&f->lock);
 
-	return inet_frag_create(nf, f, key);
+	if (!nf->max_depth || depth <= nf->max_depth)
+		return inet_frag_create(nf, f, key);
+	else
+		return NULL;
 }
 EXPORT_SYMBOL(inet_frag_find);
+
+#ifdef CONFIG_SYSCTL
+int inet_frag_update_high_thresh(struct ctl_table *table, int write,
+				 void __user *buffer, size_t *lenp,
+				 loff_t *ppos)
+{
+	int ret;
+	ret = proc_dointvec(table, write, buffer, lenp, ppos);
+
+	if (!ret && write && table->extra1) {
+		int *data = table->data;
+		int *max_depth = table->extra1;
+		*max_depth = *data >> INETFRAGS_MAXDEPTH_SHIFT;
+	}
+	return ret;
+}
+EXPORT_SYMBOL(inet_frag_update_high_thresh);
+#endif
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 3c6a772..84b35f6 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -558,7 +558,8 @@ static struct ctl_table ip6_frags_ns_ctl_table[] = {
 		.data		= &init_net.ipv6.frags.high_thresh,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec
+		.proc_handler	= inet_frag_update_high_thresh,
+		.extra1		= &init_net.ipv6.frags.max_depth
 	},
 	{
 		.procname	= "ip6frag_low_thresh",
@@ -600,6 +601,7 @@ static int __net_init ip6_frags_ns_sysctl_register(struct net *net)
 			goto err_alloc;
 
 		table[0].data = &net->ipv6.frags.high_thresh;
+		table[0].extra1 = &net->ipv6.frags.max_depth;
 		table[1].data = &net->ipv6.frags.low_thresh;
 		table[2].data = &net->ipv6.frags.timeout;
 
@@ -670,6 +672,8 @@ static int __net_init ipv6_frags_init_net(struct net *net)
 	net->ipv6.frags.high_thresh = IPV6_FRAG_HIGH_THRESH;
 	net->ipv6.frags.low_thresh = IPV6_FRAG_LOW_THRESH;
 	net->ipv6.frags.timeout = IPV6_FRAG_TIMEOUT;
+	net->ipv6.frags.max_depth =
+		IPV6_FRAG_HIGH_THRESH >> INETFRAGS_MAXDEPTH_SHIFT;
 
 	inet_frags_init_net(&net->ipv6.frags);
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-13  1:27           ` Hannes Frederic Sowa
@ 2013-03-13  1:31             ` Hannes Frederic Sowa
  2013-03-13  5:29             ` Eric Dumazet
  1 sibling, 0 replies; 25+ messages in thread
From: Hannes Frederic Sowa @ 2013-03-13  1:31 UTC (permalink / raw)
  To: Eric Dumazet, netdev, yoshfuji, brouer

On Wed, Mar 13, 2013 at 02:27:15AM +0100, Hannes Frederic Sowa wrote:
> Disable netfilter defragmenation for ipv6 on your machine if you test
> this patch, otherwise you won't see the improvment. Machine now runs
> smoothly under fragmentation dos.

Just to clarify:

When I wrote dos above, I actually meant a dos where the addresses have a
common ipv6_addr_hash. A simple test tool I wrote for testing is here:

  https://gist.github.com/hannes/5116331

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-13  1:27           ` Hannes Frederic Sowa
  2013-03-13  1:31             ` Hannes Frederic Sowa
@ 2013-03-13  5:29             ` Eric Dumazet
  2013-03-14  1:37               ` Hannes Frederic Sowa
  1 sibling, 1 reply; 25+ messages in thread
From: Eric Dumazet @ 2013-03-13  5:29 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: netdev, yoshfuji, brouer

On Wed, 2013-03-13 at 02:27 +0100, Hannes Frederic Sowa wrote:

> 
> [PATCH net-next RFC] inet: add max_depth to limit list length in inet_frags hash
> 
> This does implement trivial drop for fragments where the hash queue
> is above some limit.
> 
> I calculate the limit as follow:
> 
> I averaged the folowing formula
> 
> max_depth = max_threshold / INETFRAGS_HASHSZ / rounded up (SKB_TRUELEN(0)
>             sizeof(struct ipq or struct frag_queue))
> 
> to
> 
> max_threshold >> 15
> 
> So we start with a maximum list length of 128. I think we could halve
> this value to 64, but because I have no real performance data I left it
> at this higher value for now.
> 
> This patch does only protect IPv6 (and not netfilter ipv6 defragmentation)
> and will switch off limit checking if max_depth is zero. I'll rewrite
> the check if we agree that this simple solution is the way to go (simple
> drop) and will clamp the minimum value to 1 as soon as I also migrated
> ipv4 and netfilter to the new sysctl handler.
> 
> When testing this patch:
> 
> Disable netfilter defragmenation for ipv6 on your machine if you test
> this patch, otherwise you won't see the improvment. Machine now runs
> smoothly under fragmentation dos.
> 
> Ok if I target this patch for net next time because the hashing changes
> are in there already?
> 
> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
> ---
>  include/net/inet_frag.h  | 13 +++++++++++++
>  net/ipv4/inet_fragment.c | 25 ++++++++++++++++++++++++-
>  net/ipv6/reassembly.c    |  6 +++++-
>  3 files changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
> index 76c3fe5..9ba6ada 100644
> --- a/include/net/inet_frag.h
> +++ b/include/net/inet_frag.h
> @@ -17,6 +17,7 @@ struct netns_frags {
>  	int			timeout;
>  	int			high_thresh;
>  	int			low_thresh;
> +	int			max_depth;
>  };
>  
>  struct inet_frag_queue {
> @@ -43,6 +44,11 @@ struct inet_frag_queue {
>  
>  #define INETFRAGS_HASHSZ		64
>  
> +/* max_depth = max_threshold / INETFRAGS_HASHSZ / rounded up (SKB_TRUELEN(0) +
> + *	       sizeof(struct ipq or struct frag_queue))
> + */
> +#define INETFRAGS_MAXDEPTH_SHIFT	15
> +

This looks like a bit complex to me, as we cant change INETFRAGS_HASHSZ
from userland.

I would just a use a predefined max depth of 128, or even less.
If we have 128 items in a hash chain, then something is really wrong
with our choices.

> diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
> index 245ae07..92f1fdd 100644
> --- a/net/ipv4/inet_fragment.c
> +++ b/net/ipv4/inet_fragment.c
> @@ -277,6 +277,7 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
>  	__releases(&f->lock)
>  {
>  	struct inet_frag_queue *q;
> +	int depth = 0;
>  
>  	hlist_for_each_entry(q, &f->hash[hash], list) {
>  		if (q->net == nf && f->match(q, key)) {
> @@ -284,9 +285,31 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
>  			read_unlock(&f->lock);
>  			return q;
>  		}
> +		depth++;
>  	}
>  	read_unlock(&f->lock);
>  
> -	return inet_frag_create(nf, f, key);
> +	if (!nf->max_depth || depth <= nf->max_depth)
> +		return inet_frag_create(nf, f, key);
> +	else
> +		return NULL;
>  }

I would issue a one one time warning in syslog when depth exceeds the
limit.

Thanks !

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-13  5:29             ` Eric Dumazet
@ 2013-03-14  1:37               ` Hannes Frederic Sowa
  2013-03-14  4:36                 ` Stephen Hemminger
  2013-03-14  7:10                 ` Jesper Dangaard Brouer
  0 siblings, 2 replies; 25+ messages in thread
From: Hannes Frederic Sowa @ 2013-03-14  1:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, yoshfuji, brouer

On Wed, Mar 13, 2013 at 06:29:28AM +0100, Eric Dumazet wrote:
> I would issue a one one time warning in syslog when depth exceeds the
> limit.

I addressed your suggestion to simplify this patch.

I decided against a once message but used net_ratelimit() (as it was
already used by the warning about no memory available). I don't have
a strong opinion on that, just thought it could be a recurring event
which would be worth reporting again because it should only happen on
strange/malicous traffic patterns where admins should act.

I based this patch on the net tree.

Thanks!

[PATCH net] inet: limit length of fragment queue hash table bucket lists

This patch introduces a constant limit of the fragment queue hash
table bucket list lengths. Currently the limit 128 is choosen somewhat
arbitrary and just ensures that we can fill up the fragment cache with
empty packets up to the default ip_frag_high_thresh limits. It should
just protect from list iteration eating considerable amounts of cpu.

If we reach the maximum length in one hash bucket a warning is printed.
This is implemented on the caller side of inet_frag_find to distinguish
between the different users of inet_fragment.c.

Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Jesper Dangaard Brouer <jbrouer@redhat.com>
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 include/net/inet_frag.h                 | 30 ++++++++++++++++++++++++++++++
 net/ipv4/inet_fragment.c                |  7 ++++++-
 net/ipv4/ip_fragment.c                  |  9 ++-------
 net/ipv6/netfilter/nf_conntrack_reasm.c | 10 ++++------
 net/ipv6/reassembly.c                   |  6 ++++--
 5 files changed, 46 insertions(+), 16 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 76c3fe5..0350468 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -43,6 +43,13 @@ struct inet_frag_queue {
 
 #define INETFRAGS_HASHSZ		64
 
+/* averaged:
+ * max_depth = default ipfrag_high_thresh / INETFRAGS_HASHSZ /
+ *	       rounded up (SKB_TRUELEN(0) + sizeof(struct ipq or
+ *	       struct frag_queue))
+ */
+#define INETFRAGS_MAXDEPTH		128
+
 struct inet_frags {
 	struct hlist_head	hash[INETFRAGS_HASHSZ];
 	/* This rwlock is a global lock (seperate per IPv4, IPv6 and
@@ -77,6 +84,29 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 		struct inet_frags *f, void *key, unsigned int hash)
 	__releases(&f->lock);
 
+#define INET_FRAG_FIND_CHECK(val)					\
+	({								\
+		static const char ___mem[] =				\
+			KERN_ERR pr_fmt(				\
+				"inet_frag_find: No memory left."	\
+				" Dropping fragment.\n");		\
+		static const char ___limit[] =				\
+			KERN_WARNING pr_fmt(				\
+				"inet_frag_find: Fragment hash bucket"	\
+				" list length grew above limit "	\
+				__stringify(INETFRAGS_MAXDEPTH)		\
+				". Dropping fragment.\n");		\
+		bool ___b = true;					\
+		if (IS_ERR_OR_NULL(val)) {				\
+			___b = false;					\
+			if (PTR_ERR(val) == -ENOBUFS)			\
+				LIMIT_NETDEBUG(___limit);		\
+			else						\
+				LIMIT_NETDEBUG(___mem);			\
+		}							\
+		___b;							\
+	})
+
 static inline void inet_frag_put(struct inet_frag_queue *q, struct inet_frags *f)
 {
 	if (atomic_dec_and_test(&q->refcnt))
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 245ae07..0022a3e 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -277,6 +277,7 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 	__releases(&f->lock)
 {
 	struct inet_frag_queue *q;
+	int depth = 0;
 
 	hlist_for_each_entry(q, &f->hash[hash], list) {
 		if (q->net == nf && f->match(q, key)) {
@@ -284,9 +285,13 @@ struct inet_frag_queue *inet_frag_find(struct netns_frags *nf,
 			read_unlock(&f->lock);
 			return q;
 		}
+		depth++;
 	}
 	read_unlock(&f->lock);
 
-	return inet_frag_create(nf, f, key);
+	if (depth <= INETFRAGS_MAXDEPTH)
+		return inet_frag_create(nf, f, key);
+	else
+		return ERR_PTR(-ENOBUFS);
 }
 EXPORT_SYMBOL(inet_frag_find);
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index b6d30ac..8533316 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -292,14 +292,9 @@ static inline struct ipq *ip_find(struct net *net, struct iphdr *iph, u32 user)
 	hash = ipqhashfn(iph->id, iph->saddr, iph->daddr, iph->protocol);
 
 	q = inet_frag_find(&net->ipv4.frags, &ip4_frags, &arg, hash);
-	if (q == NULL)
-		goto out_nomem;
-
+	if (!INET_FRAG_FIND_CHECK(q))
+		return NULL;
 	return container_of(q, struct ipq, q);
-
-out_nomem:
-	LIMIT_NETDEBUG(KERN_ERR pr_fmt("ip_frag_create: no memory left !\n"));
-	return NULL;
 }
 
 /* Is the fragment too far ahead to be part of ipq? */
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index 54087e9..f56468b 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -14,6 +14,8 @@
  * 2 of the License, or (at your option) any later version.
  */
 
+#define pr_fmt(fmt) "IPv6-nf: " fmt
+
 #include <linux/errno.h>
 #include <linux/types.h>
 #include <linux/string.h>
@@ -180,13 +182,9 @@ static inline struct frag_queue *fq_find(struct net *net, __be32 id,
 
 	q = inet_frag_find(&net->nf_frag.frags, &nf_frags, &arg, hash);
 	local_bh_enable();
-	if (q == NULL)
-		goto oom;
-
+	if (!INET_FRAG_FIND_CHECK(q))
+		return NULL;
 	return container_of(q, struct frag_queue, q);
-
-oom:
-	return NULL;
 }
 
 
diff --git a/net/ipv6/reassembly.c b/net/ipv6/reassembly.c
index 3c6a772..7dd0841 100644
--- a/net/ipv6/reassembly.c
+++ b/net/ipv6/reassembly.c
@@ -26,6 +26,9 @@
  *	YOSHIFUJI,H. @USAGI	Always remove fragment header to
  *				calculate ICV correctly.
  */
+
+#define pr_fmt(fmt) "IPv6: " fmt
+
 #include <linux/errno.h>
 #include <linux/types.h>
 #include <linux/string.h>
@@ -185,9 +188,8 @@ fq_find(struct net *net, __be32 id, const struct in6_addr *src, const struct in6
 	hash = inet6_hash_frag(id, src, dst, ip6_frags.rnd);
 
 	q = inet_frag_find(&net->ipv6.frags, &ip6_frags, &arg, hash);
-	if (q == NULL)
+	if (!INET_FRAG_FIND_CHECK(q))
 		return NULL;
-
 	return container_of(q, struct frag_queue, q);
 }
 
-- 
1.8.1.4

^ permalink raw reply related	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-14  1:37               ` Hannes Frederic Sowa
@ 2013-03-14  4:36                 ` Stephen Hemminger
  2013-03-14  7:14                   ` Hannes Frederic Sowa
  2013-03-14  7:10                 ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 25+ messages in thread
From: Stephen Hemminger @ 2013-03-14  4:36 UTC (permalink / raw)
  To: Eric Dumazet, netdev, yoshfuji, brouer

> +#define INET_FRAG_FIND_CHECK(val)                                      \
> +       ({                                                              \
> +               static const char ___mem[] =                            \
> +                       KERN_ERR pr_fmt(                                \
> +                               "inet_frag_find: No memory left."       \
> +                               " Dropping fragment.\n");               \
> +               static const char ___limit[] =                          \
> +                       KERN_WARNING pr_fmt(                            \
> +                               "inet_frag_find: Fragment hash bucket"  \
> +                               " list length grew above limit "        \
> +                               __stringify(INETFRAGS_MAXDEPTH)         \
> +                               ". Dropping fragment.\n");              \
> +               bool ___b = true;                                       \
> +               if (IS_ERR_OR_NULL(val)) {                              \
> +                       ___b = false;                                   \
> +                       if (PTR_ERR(val) == -ENOBUFS)                   \
> +                               LIMIT_NETDEBUG(___limit);               \
> +                       else                                            \
> +                               LIMIT_NETDEBUG(___mem);                 \
> +               }                                                       \
> +               ___b;                                                   \
> +       })
> +

Big macros suck, write it as an inline function or better yet a real function.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-14  4:36                 ` Stephen Hemminger
@ 2013-03-14  7:14                   ` Hannes Frederic Sowa
  2013-03-14  9:47                     ` David Laight
  0 siblings, 1 reply; 25+ messages in thread
From: Hannes Frederic Sowa @ 2013-03-14  7:14 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Eric Dumazet, netdev, yoshfuji, brouer

On Wed, Mar 13, 2013 at 09:36:52PM -0700, Stephen Hemminger wrote:
> > +#define INET_FRAG_FIND_CHECK(val)                                      \
> > +       ({                                                              \
> > +               static const char ___mem[] =                            \
> > +                       KERN_ERR pr_fmt(                                \
> > +                               "inet_frag_find: No memory left."       \
> > +                               " Dropping fragment.\n");               \
> > +               static const char ___limit[] =                          \
> > +                       KERN_WARNING pr_fmt(                            \
> > +                               "inet_frag_find: Fragment hash bucket"  \
> > +                               " list length grew above limit "        \
> > +                               __stringify(INETFRAGS_MAXDEPTH)         \
> > +                               ". Dropping fragment.\n");              \
> > +               bool ___b = true;                                       \
> > +               if (IS_ERR_OR_NULL(val)) {                              \
> > +                       ___b = false;                                   \
> > +                       if (PTR_ERR(val) == -ENOBUFS)                   \
> > +                               LIMIT_NETDEBUG(___limit);               \
> > +                       else                                            \
> > +                               LIMIT_NETDEBUG(___mem);                 \
> > +               }                                                       \
> > +               ___b;                                                   \
> > +       })
> > +
> 
> Big macros suck, write it as an inline function or better yet a real function.

I switched to the macro to have string expansion with pr_fmt. So it is visible
from the dmesg if IPv4, IPv6 or IPv6-nf did generate the message. This could
be done with a function, too, but would require a bit more string handling.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-14  7:14                   ` Hannes Frederic Sowa
@ 2013-03-14  9:47                     ` David Laight
  2013-03-14 10:34                       ` Eric Dumazet
  2013-03-14 12:34                       ` Hannes Frederic Sowa
  0 siblings, 2 replies; 25+ messages in thread
From: David Laight @ 2013-03-14  9:47 UTC (permalink / raw)
  To: Hannes Frederic Sowa, Stephen Hemminger
  Cc: Eric Dumazet, netdev, yoshfuji, brouer

> On Wed, Mar 13, 2013 at 09:36:52PM -0700, Stephen Hemminger wrote:
> > > +#define INET_FRAG_FIND_CHECK(val)                                      \
> > > +       ({                                                              \
> > > +               static const char ___mem[] =                            \
> > > +                       KERN_ERR pr_fmt(                                \
> > > +                               "inet_frag_find: No memory left."       \
> > > +                               " Dropping fragment.\n");               \
> > > +               static const char ___limit[] =                          \
> > > +                       KERN_WARNING pr_fmt(                            \
> > > +                               "inet_frag_find: Fragment hash bucket"  \
> > > +                               " list length grew above limit "        \
> > > +                               __stringify(INETFRAGS_MAXDEPTH)         \
> > > +                               ". Dropping fragment.\n");              \
> > > +               bool ___b = true;                                       \
> > > +               if (IS_ERR_OR_NULL(val)) {                              \
> > > +                       ___b = false;                                   \
> > > +                       if (PTR_ERR(val) == -ENOBUFS)                   \
> > > +                               LIMIT_NETDEBUG(___limit);               \
> > > +                       else                                            \
> > > +                               LIMIT_NETDEBUG(___mem);                 \
> > > +               }                                                       \
> > > +               ___b;                                                   \
> > > +       })
> > > +
> >
> > Big macros suck, write it as an inline function or better yet a real function.
> 
> I switched to the macro to have string expansion with pr_fmt. So it is visible
> from the dmesg if IPv4, IPv6 or IPv6-nf did generate the message. This could
> be done with a function, too, but would require a bit more string handling.

I'd guess it would be best to have the IS_ERR_OR_NULL() inline calling a
real function on error.
I'd also have thought that INETFRAGS_MAXDEPTH should be run-time tunable.

	David


^ permalink raw reply	[flat|nested] 25+ messages in thread

* RE: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-14  9:47                     ` David Laight
@ 2013-03-14 10:34                       ` Eric Dumazet
  2013-03-14 12:34                       ` Hannes Frederic Sowa
  1 sibling, 0 replies; 25+ messages in thread
From: Eric Dumazet @ 2013-03-14 10:34 UTC (permalink / raw)
  To: David Laight
  Cc: Hannes Frederic Sowa, Stephen Hemminger, netdev, yoshfuji, brouer

On Thu, 2013-03-14 at 09:47 +0000, David Laight wrote:

> I'd also have thought that INETFRAGS_MAXDEPTH should be run-time tunable.

The useful thing would be to be able to resize the hash table, for some
heavy user cases.

Check net/ipv4/tcp_metrics.c : Not only we use a small depth, but is not
tunable.

#define TCP_METRICS_RECLAIM_DEPTH   5

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-14  9:47                     ` David Laight
  2013-03-14 10:34                       ` Eric Dumazet
@ 2013-03-14 12:34                       ` Hannes Frederic Sowa
  1 sibling, 0 replies; 25+ messages in thread
From: Hannes Frederic Sowa @ 2013-03-14 12:34 UTC (permalink / raw)
  To: David Laight; +Cc: Stephen Hemminger, Eric Dumazet, netdev, yoshfuji, brouer

On Thu, Mar 14, 2013 at 09:47:24AM -0000, David Laight wrote:
> > On Wed, Mar 13, 2013 at 09:36:52PM -0700, Stephen Hemminger wrote:
> > > > +#define INET_FRAG_FIND_CHECK(val)                                      \
> > > > +       ({                                                              \
> > > > +               static const char ___mem[] =                            \
> > > > +                       KERN_ERR pr_fmt(                                \
> > > > +                               "inet_frag_find: No memory left."       \
> > > > +                               " Dropping fragment.\n");               \
> > > > +               static const char ___limit[] =                          \
> > > > +                       KERN_WARNING pr_fmt(                            \
> > > > +                               "inet_frag_find: Fragment hash bucket"  \
> > > > +                               " list length grew above limit "        \
> > > > +                               __stringify(INETFRAGS_MAXDEPTH)         \
> > > > +                               ". Dropping fragment.\n");              \
> > > > +               bool ___b = true;                                       \
> > > > +               if (IS_ERR_OR_NULL(val)) {                              \
> > > > +                       ___b = false;                                   \
> > > > +                       if (PTR_ERR(val) == -ENOBUFS)                   \
> > > > +                               LIMIT_NETDEBUG(___limit);               \
> > > > +                       else                                            \
> > > > +                               LIMIT_NETDEBUG(___mem);                 \
> > > > +               }                                                       \
> > > > +               ___b;                                                   \
> > > > +       })
> > > > +
> > >
> > > Big macros suck, write it as an inline function or better yet a real function.
> > 
> > I switched to the macro to have string expansion with pr_fmt. So it is visible
> > from the dmesg if IPv4, IPv6 or IPv6-nf did generate the message. This could
> > be done with a function, too, but would require a bit more string handling.
> 
> I'd guess it would be best to have the IS_ERR_OR_NULL() inline calling a
> real function on error.

I'll explore later on how to achieve this. My rational was to
not do dynamic string handling to build the error messages. Perhaps I
can achieve both at the same time. :)

> I'd also have thought that INETFRAGS_MAXDEPTH should be run-time tunable.

128 is actually a pretty high limit. Have you seen the patch I posted before
this one? I tried to come up with a solution to dynamically calculate
max_depth based on the max_thresh of the fragmentation cache. What do you
think about that solution?

Thanks!

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-14  1:37               ` Hannes Frederic Sowa
  2013-03-14  4:36                 ` Stephen Hemminger
@ 2013-03-14  7:10                 ` Jesper Dangaard Brouer
  2013-03-14  7:23                   ` Hannes Frederic Sowa
  1 sibling, 1 reply; 25+ messages in thread
From: Jesper Dangaard Brouer @ 2013-03-14  7:10 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Eric Dumazet, netdev, yoshfuji

On Thu, 2013-03-14 at 02:37 +0100, Hannes Frederic Sowa wrote:

> [PATCH net] inet: limit length of fragment queue hash table bucket lists
> 
> This patch introduces a constant limit of the fragment queue hash
> table bucket list lengths. Currently the limit 128 is choosen somewhat
> arbitrary and just ensures that we can fill up the fragment cache with
> empty packets up to the default ip_frag_high_thresh limits. It should
> just protect from list iteration eating considerable amounts of cpu.
> 
> If we reach the maximum length in one hash bucket a warning is printed.
> This is implemented on the caller side of inet_frag_find to distinguish
> between the different users of inet_fragment.c.

I like the idea of having a safe guard on the fragment queue hash table
bucket list lengths.  But I'm considering another cleanup/evictor
strategy, where we drop the LRU list, and do frag eviction on a hash
bucket level (which will be more cache optimal).  This strategy would
also involve a list length limit.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-14  7:10                 ` Jesper Dangaard Brouer
@ 2013-03-14  7:23                   ` Hannes Frederic Sowa
  2013-03-14  7:28                     ` Hannes Frederic Sowa
  0 siblings, 1 reply; 25+ messages in thread
From: Hannes Frederic Sowa @ 2013-03-14  7:23 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Eric Dumazet, netdev, yoshfuji

On Thu, Mar 14, 2013 at 08:10:40AM +0100, Jesper Dangaard Brouer wrote:
> On Thu, 2013-03-14 at 02:37 +0100, Hannes Frederic Sowa wrote:
> 
> > [PATCH net] inet: limit length of fragment queue hash table bucket lists
> > 
> > This patch introduces a constant limit of the fragment queue hash
> > table bucket list lengths. Currently the limit 128 is choosen somewhat
> > arbitrary and just ensures that we can fill up the fragment cache with
> > empty packets up to the default ip_frag_high_thresh limits. It should
> > just protect from list iteration eating considerable amounts of cpu.
> > 
> > If we reach the maximum length in one hash bucket a warning is printed.
> > This is implemented on the caller side of inet_frag_find to distinguish
> > between the different users of inet_fragment.c.
> 
> I like the idea of having a safe guard on the fragment queue hash table
> bucket list lengths.  But I'm considering another cleanup/evictor
> strategy, where we drop the LRU list, and do frag eviction on a hash
> bucket level (which will be more cache optimal).  This strategy would
> also involve a list length limit.

I would try to get a simple guard into v3.9. In 3.9 the hashing of the key
of ipv6 fragments changed in such a way that an attacker could generate
fragments which would end up in just one hash chain, thus eating a lot
of cpu time because of list traversal. Later on, when you post your
patches we could simply revert/update this safeguard to your version.

Thanks,

  Hannes

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-14  7:23                   ` Hannes Frederic Sowa
@ 2013-03-14  7:28                     ` Hannes Frederic Sowa
  2013-03-14  9:18                       ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 25+ messages in thread
From: Hannes Frederic Sowa @ 2013-03-14  7:28 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, Eric Dumazet, netdev, yoshfuji

On Thu, Mar 14, 2013 at 08:23:41AM +0100, Hannes Frederic Sowa wrote:
> On Thu, Mar 14, 2013 at 08:10:40AM +0100, Jesper Dangaard Brouer wrote:
> > On Thu, 2013-03-14 at 02:37 +0100, Hannes Frederic Sowa wrote:
> > 
> > > [PATCH net] inet: limit length of fragment queue hash table bucket lists
> > > 
> > > This patch introduces a constant limit of the fragment queue hash
> > > table bucket list lengths. Currently the limit 128 is choosen somewhat
> > > arbitrary and just ensures that we can fill up the fragment cache with
> > > empty packets up to the default ip_frag_high_thresh limits. It should
> > > just protect from list iteration eating considerable amounts of cpu.
> > > 
> > > If we reach the maximum length in one hash bucket a warning is printed.
> > > This is implemented on the caller side of inet_frag_find to distinguish
> > > between the different users of inet_fragment.c.
> > 
> > I like the idea of having a safe guard on the fragment queue hash table
> > bucket list lengths.  But I'm considering another cleanup/evictor
> > strategy, where we drop the LRU list, and do frag eviction on a hash
> > bucket level (which will be more cache optimal).  This strategy would
> > also involve a list length limit.
> 
> I would try to get a simple guard into v3.9. In 3.9 the hashing of the key
> of ipv6 fragments changed in such a way that an attacker could generate
> fragments which would end up in just one hash chain, thus eating a lot
> of cpu time because of list traversal. Later on, when you post your
> patches we could simply revert/update this safeguard to your version.

I just wanted to mention that if you plan to target v3.9 with some patches we
could simply drop this patch.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-14  7:28                     ` Hannes Frederic Sowa
@ 2013-03-14  9:18                       ` Jesper Dangaard Brouer
  2013-03-14 12:45                         ` Hannes Frederic Sowa
  0 siblings, 1 reply; 25+ messages in thread
From: Jesper Dangaard Brouer @ 2013-03-14  9:18 UTC (permalink / raw)
  To: Hannes Frederic Sowa; +Cc: Eric Dumazet, netdev, yoshfuji

On Thu, 2013-03-14 at 08:28 +0100, Hannes Frederic Sowa wrote:
> On Thu, Mar 14, 2013 at 08:23:41AM +0100, Hannes Frederic Sowa wrote:
> > On Thu, Mar 14, 2013 at 08:10:40AM +0100, Jesper Dangaard Brouer wrote:
> > > On Thu, 2013-03-14 at 02:37 +0100, Hannes Frederic Sowa wrote:
> > > 
> > > > [PATCH net] inet: limit length of fragment queue hash table bucket lists
> > > > 
> > > > This patch introduces a constant limit of the fragment queue hash
> > > > table bucket list lengths. Currently the limit 128 is choosen somewhat
> > > > arbitrary and just ensures that we can fill up the fragment cache with
> > > > empty packets up to the default ip_frag_high_thresh limits. It should
> > > > just protect from list iteration eating considerable amounts of cpu.
> > > > 
> > > > If we reach the maximum length in one hash bucket a warning is printed.
> > > > This is implemented on the caller side of inet_frag_find to distinguish
> > > > between the different users of inet_fragment.c.
> > > 
> > > I like the idea of having a safe guard on the fragment queue hash table
> > > bucket list lengths.  But I'm considering another cleanup/evictor
> > > strategy, where we drop the LRU list, and do frag eviction on a hash
> > > bucket level (which will be more cache optimal).  This strategy would
> > > also involve a list length limit.
> > 
> > I would try to get a simple guard into v3.9. In 3.9 the hashing of the key
> > of ipv6 fragments changed in such a way that an attacker could generate
> > fragments which would end up in just one hash chain, thus eating a lot
> > of cpu time because of list traversal. Later on, when you post your
> > patches we could simply revert/update this safeguard to your version.
> 
> I just wanted to mention that if you plan to target v3.9 with some patches we
> could simply drop this patch.

I will start working on this as soon as Netfilter Workshop is over and I
have recovered from organizing this event.  DaveM told me I needed to
finish my frag patches first, before I could implement all the other
cool ideas we have come up with during the workshop ;-)

But I don't know if my frag changes can make v3.9, maybe v3.10 is more
realistic?  In which case we should use your patch to close this problem
on v3.9 IMHO.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table
  2013-03-14  9:18                       ` Jesper Dangaard Brouer
@ 2013-03-14 12:45                         ` Hannes Frederic Sowa
  0 siblings, 0 replies; 25+ messages in thread
From: Hannes Frederic Sowa @ 2013-03-14 12:45 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: Eric Dumazet, netdev, yoshfuji

On Thu, Mar 14, 2013 at 10:18:10AM +0100, Jesper Dangaard Brouer wrote:
> I will start working on this as soon as Netfilter Workshop is over and I
> have recovered from organizing this event.  DaveM told me I needed to
> finish my frag patches first, before I could implement all the other
> cool ideas we have come up with during the workshop ;-)

Yeah, I have had a look at your patches, looks promising. Definitely worth
that you get some pressure to finish them. :)

> But I don't know if my frag changes can make v3.9, maybe v3.10 is more
> realistic?  In which case we should use your patch to close this problem
> on v3.9 IMHO.

Sounds good.

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2013-03-14 12:45 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-07 21:42 [PATCH RFC] ipv6: use stronger hash for reassembly queue hash table Hannes Frederic Sowa
2013-03-08  5:57 ` Hannes Frederic Sowa
2013-03-08 13:04   ` Hannes Frederic Sowa
2013-03-08 14:53     ` Eric Dumazet
2013-03-08 15:08       ` Hannes Frederic Sowa
2013-03-08 15:23         ` Eric Dumazet
2013-03-08 15:54           ` Hannes Frederic Sowa
2013-03-08 16:15             ` Eric Dumazet
2013-03-08 16:18               ` Hannes Frederic Sowa
2013-03-09 15:19             ` Hannes Frederic Sowa
2013-03-08 20:53           ` Hannes Frederic Sowa
2013-03-13  1:27           ` Hannes Frederic Sowa
2013-03-13  1:31             ` Hannes Frederic Sowa
2013-03-13  5:29             ` Eric Dumazet
2013-03-14  1:37               ` Hannes Frederic Sowa
2013-03-14  4:36                 ` Stephen Hemminger
2013-03-14  7:14                   ` Hannes Frederic Sowa
2013-03-14  9:47                     ` David Laight
2013-03-14 10:34                       ` Eric Dumazet
2013-03-14 12:34                       ` Hannes Frederic Sowa
2013-03-14  7:10                 ` Jesper Dangaard Brouer
2013-03-14  7:23                   ` Hannes Frederic Sowa
2013-03-14  7:28                     ` Hannes Frederic Sowa
2013-03-14  9:18                       ` Jesper Dangaard Brouer
2013-03-14 12:45                         ` Hannes Frederic Sowa

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).