netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next] pkt_sched: fq: better control of DDOS traffic
@ 2015-01-30 14:06 Eric Dumazet
  2015-02-03  2:18 ` David Miller
  2015-02-03 17:32 ` [PATCH v2 " Eric Dumazet
  0 siblings, 2 replies; 8+ messages in thread
From: Eric Dumazet @ 2015-01-30 14:06 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

FQ has a fast path for skb attached to a socket, as it does not
have to compute a flow hash. But for other packets, FQ being non
stochastic means that hosts exposed to random Internet traffic
can allocate million of flows structure (104 bytes each) pretty
easily. Not only host can OOM, but lookup in RB trees can take
too much cpu and memory resources.

This patch adds a new attribute, orphan_mask, that is adding
possibility of having a stochastic hash for orphaned skb.

Its default value is 1024 slots.

This patch also handles the specific case of SYNACK messages:

They are attached to the listener socket, and therefore all map
to a single hash bucket. If listener have set SO_MAX_PACING_RATE,
hoping to have new accepted socket inherit this rate, SYNACK
might be paced and even dropped.

This is very similar to an internal patch Google have used more
than one year.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/uapi/linux/pkt_sched.h |    2 ++
 net/sched/sch_fq.c             |   19 +++++++++++++++++--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index d62316baae94..534b84710745 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -774,6 +774,8 @@ enum {
 
 	TCA_FQ_FLOW_REFILL_DELAY,	/* flow credit refill delay in usec */
 
+	TCA_FQ_ORPHAN_MASK,	/* mask applied to orphaned skb hashes */
+
 	__TCA_FQ_MAX
 };
 
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 9b05924cc386..0e7d7b98fc93 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -92,6 +92,7 @@ struct fq_sched_data {
 	u32		flow_refill_delay;
 	u32		flow_max_rate;	/* optional max rate per flow */
 	u32		flow_plimit;	/* max packets per flow */
+	u32		orphan_mask;	/* mask for orphaned skb */
 	struct rb_root	*fq_root;
 	u8		rate_enable;
 	u8		fq_trees_log;
@@ -222,11 +223,20 @@ static struct fq_flow *fq_classify(struct sk_buff *skb, struct fq_sched_data *q)
 	if (unlikely((skb->priority & TC_PRIO_MAX) == TC_PRIO_CONTROL))
 		return &q->internal;
 
-	if (unlikely(!sk)) {
+	/* SYNACK messages are attached to a listener socket.
+	 * 1) They are not part of a 'flow' yet
+	 * 2) We do not want to rate limit them (eg SYNFLOOD attack),
+	 *    especially if the listener set SO_MAX_PACING_RATE
+	 * 3) We pretend they are orphaned
+	 */
+	if (!sk || sk->sk_state == TCP_LISTEN) {
+		u32 hash = skb_get_hash(skb) & q->orphan_mask;
+
 		/* By forcing low order bit to 1, we make sure to not
 		 * collide with a local flow (socket pointers are word aligned)
 		 */
-		sk = (struct sock *)(skb_get_hash(skb) | 1L);
+		sk = (struct sock *)(hash | 1L);
+		skb_orphan(skb);
 	}
 
 	root = &q->fq_root[hash_32((u32)(long)sk, q->fq_trees_log)];
@@ -698,6 +708,9 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt)
 		q->flow_refill_delay = usecs_to_jiffies(usecs_delay);
 	}
 
+	if (tb[TCA_FQ_ORPHAN_MASK])
+		q->orphan_mask = nla_get_u32(tb[TCA_FQ_ORPHAN_MASK]);
+
 	if (!err) {
 		sch_tree_unlock(sch);
 		err = fq_resize(sch, fq_log);
@@ -743,6 +756,7 @@ static int fq_init(struct Qdisc *sch, struct nlattr *opt)
 	q->delayed		= RB_ROOT;
 	q->fq_root		= NULL;
 	q->fq_trees_log		= ilog2(1024);
+	q->orphan_mask		= (1024 - 1) << 1;
 	qdisc_watchdog_init(&q->watchdog, sch);
 
 	if (opt)
@@ -772,6 +786,7 @@ static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
 	    nla_put_u32(skb, TCA_FQ_FLOW_MAX_RATE, q->flow_max_rate) ||
 	    nla_put_u32(skb, TCA_FQ_FLOW_REFILL_DELAY,
 			jiffies_to_usecs(q->flow_refill_delay)) ||
+	    nla_put_u32(skb, TCA_FQ_ORPHAN_MASK, q->orphan_mask) ||
 	    nla_put_u32(skb, TCA_FQ_BUCKETS_LOG, q->fq_trees_log))
 		goto nla_put_failure;
 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] pkt_sched: fq: better control of DDOS traffic
  2015-01-30 14:06 [PATCH net-next] pkt_sched: fq: better control of DDOS traffic Eric Dumazet
@ 2015-02-03  2:18 ` David Miller
  2015-02-03  3:34   ` Eric Dumazet
  2015-02-03 17:32 ` [PATCH v2 " Eric Dumazet
  1 sibling, 1 reply; 8+ messages in thread
From: David Miller @ 2015-02-03  2:18 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 30 Jan 2015 06:06:12 -0800

> From: Eric Dumazet <edumazet@google.com>
> 
> FQ has a fast path for skb attached to a socket, as it does not
> have to compute a flow hash. But for other packets, FQ being non
> stochastic means that hosts exposed to random Internet traffic
> can allocate million of flows structure (104 bytes each) pretty
> easily. Not only host can OOM, but lookup in RB trees can take
> too much cpu and memory resources.
> 
> This patch adds a new attribute, orphan_mask, that is adding
> possibility of having a stochastic hash for orphaned skb.
> 
> Its default value is 1024 slots.
> 
> This patch also handles the specific case of SYNACK messages:
> 
> They are attached to the listener socket, and therefore all map
> to a single hash bucket. If listener have set SO_MAX_PACING_RATE,
> hoping to have new accepted socket inherit this rate, SYNACK
> might be paced and even dropped.
> 
> This is very similar to an internal patch Google have used more
> than one year.
> 
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Can you document the mask value a little bit more?

For example, I don't understand why "(1024 - 1) << 1" means 1024
slots just from looking at this change.

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next] pkt_sched: fq: better control of DDOS traffic
  2015-02-03  2:18 ` David Miller
@ 2015-02-03  3:34   ` Eric Dumazet
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2015-02-03  3:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Mon, 2015-02-02 at 18:18 -0800, David Miller wrote:

> Can you document the mask value a little bit more?
> 
> For example, I don't understand why "(1024 - 1) << 1" means 1024
> slots just from looking at this change.
> 

Sure , will do.

Thats because we reserve low order bit of the sk/hash value,
to make sure the stochastic hash is only applied on non locally
generated traffic.

Thanks

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2 net-next] pkt_sched: fq: better control of DDOS traffic
  2015-01-30 14:06 [PATCH net-next] pkt_sched: fq: better control of DDOS traffic Eric Dumazet
  2015-02-03  2:18 ` David Miller
@ 2015-02-03 17:32 ` Eric Dumazet
  2015-02-05  4:22   ` David Miller
  2015-02-05  5:30   ` [PATCH v3 " Eric Dumazet
  1 sibling, 2 replies; 8+ messages in thread
From: Eric Dumazet @ 2015-02-03 17:32 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

FQ has a fast path for skb attached to a socket, as it does not
have to compute a flow hash. But for other packets, FQ being non
stochastic means that hosts exposed to random Internet traffic
can allocate million of flows structure (104 bytes each) pretty
easily. Not only host can OOM, but lookup in RB trees can take
too much cpu and memory resources.

This patch adds a new attribute, orphan_mask, that is adding
possibility of having a stochastic hash for orphaned skb.

Its default value is 1024 slots, to mimic SFQ behavior.

Note: This does not apply to locally generated TCP traffic,
and no locally generated traffic will share a flow structure
with another perfect or stochastic flow.

This patch also handles the specific case of SYNACK messages:

They are attached to the listener socket, and therefore all map
to a single hash bucket. If listener have set SO_MAX_PACING_RATE,
hoping to have new accepted socket inherit this rate, SYNACK
might be paced and even dropped.

This is very similar to an internal patch Google have used more
than one year.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
v2: make the left shift in fq_classify.

 net/sched/sch_fq.c |   19 +++++++++++++++++--
 1 file changed, 17 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 2a50f5c62070a81ae37d871aac2626555128fd38..c7dee59763454777e8bb2c028d932340b2ced5da 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -92,6 +92,7 @@ struct fq_sched_data {
 	u32		flow_refill_delay;
 	u32		flow_max_rate;	/* optional max rate per flow */
 	u32		flow_plimit;	/* max packets per flow */
+	u32		orphan_mask;	/* mask for orphaned skb */
 	struct rb_root	*fq_root;
 	u8		rate_enable;
 	u8		fq_trees_log;
@@ -222,11 +223,20 @@ static struct fq_flow *fq_classify(struct sk_buff *skb, struct fq_sched_data *q)
 	if (unlikely((skb->priority & TC_PRIO_MAX) == TC_PRIO_CONTROL))
 		return &q->internal;
 
-	if (unlikely(!sk)) {
+	/* SYNACK messages are attached to a listener socket.
+	 * 1) They are not part of a 'flow' yet
+	 * 2) We do not want to rate limit them (eg SYNFLOOD attack),
+	 *    especially if the listener set SO_MAX_PACING_RATE
+	 * 3) We pretend they are orphaned
+	 */
+	if (!sk || sk->sk_state == TCP_LISTEN) {
+		unsigned long hash = skb_get_hash(skb) & q->orphan_mask;
+
 		/* By forcing low order bit to 1, we make sure to not
 		 * collide with a local flow (socket pointers are word aligned)
 		 */
-		sk = (struct sock *)(skb_get_hash(skb) | 1L);
+		sk = (struct sock *)((hash << 1) | 1UL);
+		skb_orphan(skb);
 	}
 
 	root = &q->fq_root[hash_32((u32)(long)sk, q->fq_trees_log)];
@@ -698,6 +708,9 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt)
 		q->flow_refill_delay = usecs_to_jiffies(usecs_delay);
 	}
 
+	if (tb[TCA_FQ_ORPHAN_MASK])
+		q->orphan_mask = nla_get_u32(tb[TCA_FQ_ORPHAN_MASK]);
+
 	if (!err) {
 		sch_tree_unlock(sch);
 		err = fq_resize(sch, fq_log);
@@ -743,6 +756,7 @@ static int fq_init(struct Qdisc *sch, struct nlattr *opt)
 	q->delayed		= RB_ROOT;
 	q->fq_root		= NULL;
 	q->fq_trees_log		= ilog2(1024);
+	q->orphan_mask		= 1024 - 1;
 	qdisc_watchdog_init(&q->watchdog, sch);
 
 	if (opt)
@@ -772,6 +786,7 @@ static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
 	    nla_put_u32(skb, TCA_FQ_FLOW_MAX_RATE, q->flow_max_rate) ||
 	    nla_put_u32(skb, TCA_FQ_FLOW_REFILL_DELAY,
 			jiffies_to_usecs(q->flow_refill_delay)) ||
+	    nla_put_u32(skb, TCA_FQ_ORPHAN_MASK, q->orphan_mask) ||
 	    nla_put_u32(skb, TCA_FQ_BUCKETS_LOG, q->fq_trees_log))
 		goto nla_put_failure;
 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 net-next] pkt_sched: fq: better control of DDOS traffic
  2015-02-03 17:32 ` [PATCH v2 " Eric Dumazet
@ 2015-02-05  4:22   ` David Miller
  2015-02-05  5:24     ` Eric Dumazet
  2015-02-05  5:30   ` [PATCH v3 " Eric Dumazet
  1 sibling, 1 reply; 8+ messages in thread
From: David Miller @ 2015-02-05  4:22 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 03 Feb 2015 09:32:19 -0800

> +	if (tb[TCA_FQ_ORPHAN_MASK])
> +		q->orphan_mask = nla_get_u32(tb[TCA_FQ_ORPHAN_MASK]);

This doesn't build, the header file changes that add TCA_FQ_ORPHAN_MASK
are missing from your patch.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH v2 net-next] pkt_sched: fq: better control of DDOS traffic
  2015-02-05  4:22   ` David Miller
@ 2015-02-05  5:24     ` Eric Dumazet
  0 siblings, 0 replies; 8+ messages in thread
From: Eric Dumazet @ 2015-02-05  5:24 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Wed, 2015-02-04 at 20:22 -0800, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 03 Feb 2015 09:32:19 -0800
> 
> > +	if (tb[TCA_FQ_ORPHAN_MASK])
> > +		q->orphan_mask = nla_get_u32(tb[TCA_FQ_ORPHAN_MASK]);
> 
> This doesn't build, the header file changes that add TCA_FQ_ORPHAN_MASK
> are missing from your patch.

Arg.. sorry for that. Will send a v3

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v3 net-next] pkt_sched: fq: better control of DDOS traffic
  2015-02-03 17:32 ` [PATCH v2 " Eric Dumazet
  2015-02-05  4:22   ` David Miller
@ 2015-02-05  5:30   ` Eric Dumazet
  2015-02-05  6:17     ` David Miller
  1 sibling, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2015-02-05  5:30 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

FQ has a fast path for skb attached to a socket, as it does not
have to compute a flow hash. But for other packets, FQ being non
stochastic means that hosts exposed to random Internet traffic
can allocate million of flows structure (104 bytes each) pretty
easily. Not only host can OOM, but lookup in RB trees can take
too much cpu and memory resources.

This patch adds a new attribute, orphan_mask, that is adding
possibility of having a stochastic hash for orphaned skb.

Its default value is 1024 slots, to mimic SFQ behavior.

Note: This does not apply to locally generated TCP traffic,
and no locally generated traffic will share a flow structure
with another perfect or stochastic flow.

This patch also handles the specific case of SYNACK messages:

They are attached to the listener socket, and therefore all map
to a single hash bucket. If listener have set SO_MAX_PACING_RATE,
hoping to have new accepted socket inherit this rate, SYNACK
might be paced and even dropped.

This is very similar to an internal patch Google have used more
than one year.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
v3: adding the missing include file change
v2: make the left shift in fq_classify.

 include/uapi/linux/pkt_sched.h |    2 ++
 net/sched/sch_fq.c             |   19 +++++++++++++++++--
 2 files changed, 19 insertions(+), 2 deletions(-)

diff --git a/include/uapi/linux/pkt_sched.h b/include/uapi/linux/pkt_sched.h
index d62316baae942c43b2558ed2768c88950516126c..534b847107453019d362e9f9f9c0969fc3100c8b 100644
--- a/include/uapi/linux/pkt_sched.h
+++ b/include/uapi/linux/pkt_sched.h
@@ -774,6 +774,8 @@ enum {
 
 	TCA_FQ_FLOW_REFILL_DELAY,	/* flow credit refill delay in usec */
 
+	TCA_FQ_ORPHAN_MASK,	/* mask applied to orphaned skb hashes */
+
 	__TCA_FQ_MAX
 };
 
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index 69a3dbf55c60271723e580b209282c8b3ae91ae8..a00c4304300101a093e834c73fdd3bdb2c2c38a3 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -93,6 +93,7 @@ struct fq_sched_data {
 	u32		flow_refill_delay;
 	u32		flow_max_rate;	/* optional max rate per flow */
 	u32		flow_plimit;	/* max packets per flow */
+	u32		orphan_mask;	/* mask for orphaned skb */
 	struct rb_root	*fq_root;
 	u8		rate_enable;
 	u8		fq_trees_log;
@@ -223,11 +224,20 @@ static struct fq_flow *fq_classify(struct sk_buff *skb, struct fq_sched_data *q)
 	if (unlikely((skb->priority & TC_PRIO_MAX) == TC_PRIO_CONTROL))
 		return &q->internal;
 
-	if (unlikely(!sk)) {
+	/* SYNACK messages are attached to a listener socket.
+	 * 1) They are not part of a 'flow' yet
+	 * 2) We do not want to rate limit them (eg SYNFLOOD attack),
+	 *    especially if the listener set SO_MAX_PACING_RATE
+	 * 3) We pretend they are orphaned
+	 */
+	if (!sk || sk->sk_state == TCP_LISTEN) {
+		unsigned long hash = skb_get_hash(skb) & q->orphan_mask;
+
 		/* By forcing low order bit to 1, we make sure to not
 		 * collide with a local flow (socket pointers are word aligned)
 		 */
-		sk = (struct sock *)(skb_get_hash(skb) | 1L);
+		sk = (struct sock *)((hash << 1) | 1UL);
+		skb_orphan(skb);
 	}
 
 	root = &q->fq_root[hash_32((u32)(long)sk, q->fq_trees_log)];
@@ -704,6 +714,9 @@ static int fq_change(struct Qdisc *sch, struct nlattr *opt)
 		q->flow_refill_delay = usecs_to_jiffies(usecs_delay);
 	}
 
+	if (tb[TCA_FQ_ORPHAN_MASK])
+		q->orphan_mask = nla_get_u32(tb[TCA_FQ_ORPHAN_MASK]);
+
 	if (!err) {
 		sch_tree_unlock(sch);
 		err = fq_resize(sch, fq_log);
@@ -749,6 +762,7 @@ static int fq_init(struct Qdisc *sch, struct nlattr *opt)
 	q->delayed		= RB_ROOT;
 	q->fq_root		= NULL;
 	q->fq_trees_log		= ilog2(1024);
+	q->orphan_mask		= 1024 - 1;
 	qdisc_watchdog_init(&q->watchdog, sch);
 
 	if (opt)
@@ -778,6 +792,7 @@ static int fq_dump(struct Qdisc *sch, struct sk_buff *skb)
 	    nla_put_u32(skb, TCA_FQ_FLOW_MAX_RATE, q->flow_max_rate) ||
 	    nla_put_u32(skb, TCA_FQ_FLOW_REFILL_DELAY,
 			jiffies_to_usecs(q->flow_refill_delay)) ||
+	    nla_put_u32(skb, TCA_FQ_ORPHAN_MASK, q->orphan_mask) ||
 	    nla_put_u32(skb, TCA_FQ_BUCKETS_LOG, q->fq_trees_log))
 		goto nla_put_failure;
 

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH v3 net-next] pkt_sched: fq: better control of DDOS traffic
  2015-02-05  5:30   ` [PATCH v3 " Eric Dumazet
@ 2015-02-05  6:17     ` David Miller
  0 siblings, 0 replies; 8+ messages in thread
From: David Miller @ 2015-02-05  6:17 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 04 Feb 2015 21:30:40 -0800

> This patch adds a new attribute, orphan_mask, that is adding
> possibility of having a stochastic hash for orphaned skb.
 ...
> v3: adding the missing include file change
> v2: make the left shift in fq_classify.

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2015-02-05  6:17 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-30 14:06 [PATCH net-next] pkt_sched: fq: better control of DDOS traffic Eric Dumazet
2015-02-03  2:18 ` David Miller
2015-02-03  3:34   ` Eric Dumazet
2015-02-03 17:32 ` [PATCH v2 " Eric Dumazet
2015-02-05  4:22   ` David Miller
2015-02-05  5:24     ` Eric Dumazet
2015-02-05  5:30   ` [PATCH v3 " Eric Dumazet
2015-02-05  6:17     ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).