Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next v13 2/7] sch_cake: Add ingress mode
From: Toke Høiland-Jørgensen @ 2018-05-21 16:24 UTC (permalink / raw)
  To: netdev, cake
In-Reply-To: <152691970035.4083.3349902669266371765.stgit@alrua-kau>

The ingress mode is meant to be enabled when CAKE runs downlink of the
actual bottleneck (such as on an IFB device). The mode changes the shaper
to also account dropped packets to the shaped rate, as these have already
traversed the bottleneck.

Enabling ingress mode will also tune the AQM to always keep at least two
packets queued *for each flow*. This is done by scaling the minimum queue
occupancy level that will disable the AQM by the number of active bulk
flows. The rationale for this is that retransmits are more expensive in
ingress mode, since dropped packets have to traverse the bottleneck again
when they are retransmitted; thus, being more lenient and keeping a minimum
number of packets queued will improve throughput in cases where the number
of active flows are so large that they saturate the bottleneck even at
their minimum window size.

This commit also adds a separate switch to enable ingress mode rate
autoscaling. If enabled, the autoscaling code will observe the actual
traffic rate and adjust the shaper rate to match it. This can help avoid
latency increases in the case where the actual bottleneck rate decreases
below the shaped rate. The scaling filters out spikes by an EWMA filter.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 net/sched/sch_cake.c |   85 ++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 81 insertions(+), 4 deletions(-)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 7ea4aa261cec..10e208e4255d 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -435,7 +435,8 @@ static bool cobalt_queue_empty(struct cobalt_vars *vars,
 static bool cobalt_should_drop(struct cobalt_vars *vars,
 			       struct cobalt_params *p,
 			       ktime_t now,
-			       struct sk_buff *skb)
+			       struct sk_buff *skb,
+			       u32 bulk_flows)
 {
 	bool next_due, over_target, drop = false;
 	ktime_t schedule;
@@ -459,6 +460,7 @@ static bool cobalt_should_drop(struct cobalt_vars *vars,
 	sojourn = ktime_to_ns(ktime_sub(now, cobalt_get_enqueue_time(skb)));
 	schedule = ktime_sub(now, vars->drop_next);
 	over_target = sojourn > p->target &&
+		      sojourn > p->mtu_time * bulk_flows * 2 &&
 		      sojourn > p->mtu_time * 4;
 	next_due = vars->count && ktime_to_ns(schedule) >= 0;
 
@@ -913,6 +915,9 @@ static unsigned int cake_drop(struct Qdisc *sch, struct sk_buff **to_free)
 	b->tin_dropped++;
 	sch->qstats.drops++;
 
+	if (q->rate_flags & CAKE_FLAG_INGRESS)
+		cake_advance_shaper(q, b, skb, now, true);
+
 	__qdisc_drop(skb, to_free);
 	sch->q.qlen--;
 
@@ -990,8 +995,46 @@ static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 		cake_heapify_up(q, b->overflow_idx[idx]);
 
 	/* incoming bandwidth capacity estimate */
-	q->avg_window_bytes = 0;
-	q->last_packet_time = now;
+	if (q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS) {
+		u64 packet_interval = \
+			ktime_to_ns(ktime_sub(now, q->last_packet_time));
+
+		if (packet_interval > NSEC_PER_SEC)
+			packet_interval = NSEC_PER_SEC;
+
+		/* filter out short-term bursts, eg. wifi aggregation */
+		q->avg_packet_interval = \
+			cake_ewma(q->avg_packet_interval,
+				  packet_interval,
+				  (packet_interval > q->avg_packet_interval ?
+					  2 : 8));
+
+		q->last_packet_time = now;
+
+		if (packet_interval > q->avg_packet_interval) {
+			u64 window_interval = \
+				ktime_to_ns(ktime_sub(now,
+						      q->avg_window_begin));
+			u64 b = q->avg_window_bytes * (u64)NSEC_PER_SEC;
+
+			do_div(b, window_interval);
+			q->avg_peak_bandwidth =
+				cake_ewma(q->avg_peak_bandwidth, b,
+					  b > q->avg_peak_bandwidth ? 2 : 8);
+			q->avg_window_bytes = 0;
+			q->avg_window_begin = now;
+
+			if (ktime_after(now,
+					ktime_add_ms(q->last_reconfig_time,
+						     250))) {
+				q->rate_bps = (q->avg_peak_bandwidth * 15) >> 4;
+				cake_reconfigure(sch);
+			}
+		}
+	} else {
+		q->avg_window_bytes = 0;
+		q->last_packet_time = now;
+	}
 
 	/* flowchain */
 	if (!flow->set || flow->set == CAKE_SET_DECAYING) {
@@ -1251,15 +1294,27 @@ static struct sk_buff *cake_dequeue(struct Qdisc *sch)
 		}
 
 		/* Last packet in queue may be marked, shouldn't be dropped */
-		if (!cobalt_should_drop(&flow->cvars, &b->cparams, now, skb) ||
+		if (!cobalt_should_drop(&flow->cvars, &b->cparams, now, skb,
+					(b->bulk_flow_count *
+					 !!(q->rate_flags &
+					    CAKE_FLAG_INGRESS))) ||
 		    !flow->head)
 			break;
 
+		/* drop this packet, get another one */
+		if (q->rate_flags & CAKE_FLAG_INGRESS) {
+			len = cake_advance_shaper(q, b, skb,
+						  now, true);
+			flow->deficit -= len;
+			b->tin_deficit -= len;
+		}
 		flow->dropped++;
 		b->tin_dropped++;
 		qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(skb));
 		qdisc_qstats_drop(sch);
 		kfree_skb(skb);
+		if (q->rate_flags & CAKE_FLAG_INGRESS)
+			goto retry;
 	}
 
 	b->tin_ecn_mark += !!flow->cvars.ecn_marked;
@@ -1442,6 +1497,20 @@ static int cake_change(struct Qdisc *sch, struct nlattr *opt,
 			q->target = 1;
 	}
 
+	if (tb[TCA_CAKE_AUTORATE]) {
+		if (!!nla_get_u32(tb[TCA_CAKE_AUTORATE]))
+			q->rate_flags |= CAKE_FLAG_AUTORATE_INGRESS;
+		else
+			q->rate_flags &= ~CAKE_FLAG_AUTORATE_INGRESS;
+	}
+
+	if (tb[TCA_CAKE_INGRESS]) {
+		if (!!nla_get_u32(tb[TCA_CAKE_INGRESS]))
+			q->rate_flags |= CAKE_FLAG_INGRESS;
+		else
+			q->rate_flags &= ~CAKE_FLAG_INGRESS;
+	}
+
 	if (tb[TCA_CAKE_MEMORY])
 		q->buffer_config_limit = nla_get_u32(tb[TCA_CAKE_MEMORY]);
 
@@ -1565,6 +1634,14 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff *skb)
 	if (nla_put_u32(skb, TCA_CAKE_MEMORY, q->buffer_config_limit))
 		goto nla_put_failure;
 
+	if (nla_put_u32(skb, TCA_CAKE_AUTORATE,
+			!!(q->rate_flags & CAKE_FLAG_AUTORATE_INGRESS)))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_INGRESS,
+			!!(q->rate_flags & CAKE_FLAG_INGRESS)))
+		goto nla_put_failure;
+
 	return nla_nest_end(skb, opts);
 
 nla_put_failure:

^ permalink raw reply related

* [PATCH net-next v13 5/7] sch_cake: Add DiffServ handling
From: Toke Høiland-Jørgensen @ 2018-05-21 16:24 UTC (permalink / raw)
  To: netdev, cake
In-Reply-To: <152691970035.4083.3349902669266371765.stgit@alrua-kau>

This adds support for DiffServ-based priority queueing to CAKE. If the
shaper is in use, each priority tier gets its own virtual clock, which
limits that tier's rate to a fraction of the overall shaped rate, to
discourage trying to game the priority mechanism.

CAKE defaults to a simple, three-tier mode that interprets most code points
as "best effort", but places CS1 traffic into a low-priority "bulk" tier
which is assigned 1/16 of the total rate, and a few code points indicating
latency-sensitive or control traffic (specifically TOS4, VA, EF, CS6, CS7)
into a "latency sensitive" high-priority tier, which is assigned 1/4 rate.
The other supported DiffServ modes are a 4-tier mode matching the 802.11e
precedence rules, as well as two 8-tier modes, one of which implements
strict precedence of the eight priority levels.

This commit also adds an optional DiffServ 'wash' mode, which will zero out
the DSCP fields of any packet passing through CAKE. While this can
technically be done with other mechanisms in the kernel, having the feature
available in CAKE significantly decreases configuration complexity; and the
implementation cost is low on top of the other DiffServ-handling code.

Filters and applications can set the skb->priority field to override the
DSCP-based classification into tiers. If TC_H_MAJ(skb->priority) matches
CAKE's qdisc handle, the minor number will be interpreted as a priority
tier if it is less than or equal to the number of configured priority
tiers.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 net/sched/sch_cake.c |  412 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 404 insertions(+), 8 deletions(-)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 116c935b2914..d025899ef11f 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -298,6 +298,68 @@ static void cobalt_set_enqueue_time(struct sk_buff *skb,
 
 static u16 quantum_div[CAKE_QUEUES + 1] = {0};
 
+/* Diffserv lookup tables */
+
+static const u8 precedence[] = {
+	0, 0, 0, 0, 0, 0, 0, 0,
+	1, 1, 1, 1, 1, 1, 1, 1,
+	2, 2, 2, 2, 2, 2, 2, 2,
+	3, 3, 3, 3, 3, 3, 3, 3,
+	4, 4, 4, 4, 4, 4, 4, 4,
+	5, 5, 5, 5, 5, 5, 5, 5,
+	6, 6, 6, 6, 6, 6, 6, 6,
+	7, 7, 7, 7, 7, 7, 7, 7,
+};
+
+static const u8 diffserv8[] = {
+	2, 5, 1, 2, 4, 2, 2, 2,
+	0, 2, 1, 2, 1, 2, 1, 2,
+	5, 2, 4, 2, 4, 2, 4, 2,
+	3, 2, 3, 2, 3, 2, 3, 2,
+	6, 2, 3, 2, 3, 2, 3, 2,
+	6, 2, 2, 2, 6, 2, 6, 2,
+	7, 2, 2, 2, 2, 2, 2, 2,
+	7, 2, 2, 2, 2, 2, 2, 2,
+};
+
+static const u8 diffserv4[] = {
+	0, 2, 0, 0, 2, 0, 0, 0,
+	1, 0, 0, 0, 0, 0, 0, 0,
+	2, 0, 2, 0, 2, 0, 2, 0,
+	2, 0, 2, 0, 2, 0, 2, 0,
+	3, 0, 2, 0, 2, 0, 2, 0,
+	3, 0, 0, 0, 3, 0, 3, 0,
+	3, 0, 0, 0, 0, 0, 0, 0,
+	3, 0, 0, 0, 0, 0, 0, 0,
+};
+
+static const u8 diffserv3[] = {
+	0, 0, 0, 0, 2, 0, 0, 0,
+	1, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 2, 0, 2, 0,
+	2, 0, 0, 0, 0, 0, 0, 0,
+	2, 0, 0, 0, 0, 0, 0, 0,
+};
+
+static const u8 besteffort[] = {
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+	0, 0, 0, 0, 0, 0, 0, 0,
+};
+
+/* tin priority order for stats dumping */
+
+static const u8 normal_order[] = {0, 1, 2, 3, 4, 5, 6, 7};
+static const u8 bulk_order[] = {1, 0, 2, 3};
+
 #define REC_INV_SQRT_CACHE (16)
 static u32 cobalt_rec_inv_sqrt_cache[REC_INV_SQRT_CACHE] = {0};
 
@@ -1387,6 +1449,46 @@ static unsigned int cake_drop(struct Qdisc *sch, struct sk_buff **to_free)
 	return idx + (tin << 16);
 }
 
+static void cake_wash_diffserv(struct sk_buff *skb)
+{
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, 0);
+		break;
+	case htons(ETH_P_IPV6):
+		ipv6_change_dsfield(ipv6_hdr(skb), INET_ECN_MASK, 0);
+		break;
+	default:
+		break;
+	}
+}
+
+static u8 cake_handle_diffserv(struct sk_buff *skb, u16 wash)
+{
+	u8 dscp;
+
+	switch (skb->protocol) {
+	case htons(ETH_P_IP):
+		dscp = ipv4_get_dsfield(ip_hdr(skb)) >> 2;
+		if (wash && dscp)
+			ipv4_change_dsfield(ip_hdr(skb), INET_ECN_MASK, 0);
+		return dscp;
+
+	case htons(ETH_P_IPV6):
+		dscp = ipv6_get_dsfield(ipv6_hdr(skb)) >> 2;
+		if (wash && dscp)
+			ipv6_change_dsfield(ipv6_hdr(skb), INET_ECN_MASK, 0);
+		return dscp;
+
+	case htons(ETH_P_ARP):
+		return 0x38;  /* CS7 - Net Control */
+
+	default:
+		/* If there is no Diffserv field, treat as best-effort */
+		return 0;
+	}
+}
+
 static void cake_reconfigure(struct Qdisc *sch);
 
 static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch,
@@ -1401,7 +1503,26 @@ static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	struct cake_flow *flow;
 	u32 idx, tin;
 
-	tin = 0;
+	if (TC_H_MAJ(skb->priority) == sch->handle &&
+	    TC_H_MIN(skb->priority) > 0 &&
+	    TC_H_MIN(skb->priority) <= q->tin_cnt) {
+		tin = TC_H_MIN(skb->priority) - 1;
+
+		if (q->rate_flags & CAKE_FLAG_WASH)
+			cake_wash_diffserv(skb);
+	} else if (q->tin_mode != CAKE_DIFFSERV_BESTEFFORT) {
+		/* extract the Diffserv Precedence field, if it exists */
+		/* and clear DSCP bits if washing */
+		tin = q->tin_index[cake_handle_diffserv(skb,
+				q->rate_flags & CAKE_FLAG_WASH)];
+		if (unlikely(tin >= q->tin_cnt))
+			tin = 0;
+	} else {
+		tin = 0;
+		if (q->rate_flags & CAKE_FLAG_WASH)
+			cake_wash_diffserv(skb);
+	}
+
 	b = &q->tins[tin];
 
 	/* choose flow to insert into */
@@ -1902,18 +2023,275 @@ static void cake_set_rate(struct cake_tin_data *b, u64 rate, u32 mtu,
 	b->cparams.p_dec = 1 << 20; /* 1/4096 */
 }
 
-static void cake_reconfigure(struct Qdisc *sch)
+static int cake_config_besteffort(struct Qdisc *sch)
 {
 	struct cake_sched_data *q = qdisc_priv(sch);
 	struct cake_tin_data *b = &q->tins[0];
-	int c, ft = 0;
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u64 rate = q->rate_bps;
 
 	q->tin_cnt = 1;
-	cake_set_rate(b, q->rate_bps, psched_mtu(qdisc_dev(sch)),
+
+	q->tin_index = besteffort;
+	q->tin_order = normal_order;
+
+	cake_set_rate(b, rate, mtu,
 		      us_to_ns(q->target), us_to_ns(q->interval));
 	b->tin_quantum_band = 65535;
 	b->tin_quantum_prio = 65535;
 
+	return 0;
+}
+
+static int cake_config_precedence(struct Qdisc *sch)
+{
+	/* convert high-level (user visible) parameters into internal format */
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u64 rate = q->rate_bps;
+	u32 quantum1 = 256;
+	u32 quantum2 = 256;
+	u32 i;
+
+	q->tin_cnt = 8;
+	q->tin_index = precedence;
+	q->tin_order = normal_order;
+
+	for (i = 0; i < q->tin_cnt; i++) {
+		struct cake_tin_data *b = &q->tins[i];
+
+		cake_set_rate(b, rate, mtu, us_to_ns(q->target),
+			      us_to_ns(q->interval));
+
+		b->tin_quantum_prio = max_t(u16, 1U, quantum1);
+		b->tin_quantum_band = max_t(u16, 1U, quantum2);
+
+		/* calculate next class's parameters */
+		rate  *= 7;
+		rate >>= 3;
+
+		quantum1  *= 3;
+		quantum1 >>= 1;
+
+		quantum2  *= 7;
+		quantum2 >>= 3;
+	}
+
+	return 0;
+}
+
+/*	List of known Diffserv codepoints:
+ *
+ *	Least Effort (CS1)
+ *	Best Effort (CS0)
+ *	Max Reliability & LLT "Lo" (TOS1)
+ *	Max Throughput (TOS2)
+ *	Min Delay (TOS4)
+ *	LLT "La" (TOS5)
+ *	Assured Forwarding 1 (AF1x) - x3
+ *	Assured Forwarding 2 (AF2x) - x3
+ *	Assured Forwarding 3 (AF3x) - x3
+ *	Assured Forwarding 4 (AF4x) - x3
+ *	Precedence Class 2 (CS2)
+ *	Precedence Class 3 (CS3)
+ *	Precedence Class 4 (CS4)
+ *	Precedence Class 5 (CS5)
+ *	Precedence Class 6 (CS6)
+ *	Precedence Class 7 (CS7)
+ *	Voice Admit (VA)
+ *	Expedited Forwarding (EF)
+
+ *	Total 25 codepoints.
+ */
+
+/*	List of traffic classes in RFC 4594:
+ *		(roughly descending order of contended priority)
+ *		(roughly ascending order of uncontended throughput)
+ *
+ *	Network Control (CS6,CS7)      - routing traffic
+ *	Telephony (EF,VA)         - aka. VoIP streams
+ *	Signalling (CS5)               - VoIP setup
+ *	Multimedia Conferencing (AF4x) - aka. video calls
+ *	Realtime Interactive (CS4)     - eg. games
+ *	Multimedia Streaming (AF3x)    - eg. YouTube, NetFlix, Twitch
+ *	Broadcast Video (CS3)
+ *	Low Latency Data (AF2x,TOS4)      - eg. database
+ *	Ops, Admin, Management (CS2,TOS1) - eg. ssh
+ *	Standard Service (CS0 & unrecognised codepoints)
+ *	High Throughput Data (AF1x,TOS2)  - eg. web traffic
+ *	Low Priority Data (CS1)           - eg. BitTorrent
+
+ *	Total 12 traffic classes.
+ */
+
+static int cake_config_diffserv8(struct Qdisc *sch)
+{
+/*	Pruned list of traffic classes for typical applications:
+ *
+ *		Network Control          (CS6, CS7)
+ *		Minimum Latency          (EF, VA, CS5, CS4)
+ *		Interactive Shell        (CS2, TOS1)
+ *		Low Latency Transactions (AF2x, TOS4)
+ *		Video Streaming          (AF4x, AF3x, CS3)
+ *		Bog Standard             (CS0 etc.)
+ *		High Throughput          (AF1x, TOS2)
+ *		Background Traffic       (CS1)
+ *
+ *		Total 8 traffic classes.
+ */
+
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u64 rate = q->rate_bps;
+	u32 quantum1 = 256;
+	u32 quantum2 = 256;
+	u32 i;
+
+	q->tin_cnt = 8;
+
+	/* codepoint to class mapping */
+	q->tin_index = diffserv8;
+	q->tin_order = normal_order;
+
+	/* class characteristics */
+	for (i = 0; i < q->tin_cnt; i++) {
+		struct cake_tin_data *b = &q->tins[i];
+
+		cake_set_rate(b, rate, mtu, us_to_ns(q->target),
+			      us_to_ns(q->interval));
+
+		b->tin_quantum_prio = max_t(u16, 1U, quantum1);
+		b->tin_quantum_band = max_t(u16, 1U, quantum2);
+
+		/* calculate next class's parameters */
+		rate  *= 7;
+		rate >>= 3;
+
+		quantum1  *= 3;
+		quantum1 >>= 1;
+
+		quantum2  *= 7;
+		quantum2 >>= 3;
+	}
+
+	return 0;
+}
+
+static int cake_config_diffserv4(struct Qdisc *sch)
+{
+/*  Further pruned list of traffic classes for four-class system:
+ *
+ *	    Latency Sensitive  (CS7, CS6, EF, VA, CS5, CS4)
+ *	    Streaming Media    (AF4x, AF3x, CS3, AF2x, TOS4, CS2, TOS1)
+ *	    Best Effort        (CS0, AF1x, TOS2, and those not specified)
+ *	    Background Traffic (CS1)
+ *
+ *		Total 4 traffic classes.
+ */
+
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u64 rate = q->rate_bps;
+	u32 quantum = 1024;
+
+	q->tin_cnt = 4;
+
+	/* codepoint to class mapping */
+	q->tin_index = diffserv4;
+	q->tin_order = bulk_order;
+
+	/* class characteristics */
+	cake_set_rate(&q->tins[0], rate, mtu,
+		      us_to_ns(q->target), us_to_ns(q->interval));
+	cake_set_rate(&q->tins[1], rate >> 4, mtu,
+		      us_to_ns(q->target), us_to_ns(q->interval));
+	cake_set_rate(&q->tins[2], rate >> 1, mtu,
+		      us_to_ns(q->target), us_to_ns(q->interval));
+	cake_set_rate(&q->tins[3], rate >> 2, mtu,
+		      us_to_ns(q->target), us_to_ns(q->interval));
+
+	/* priority weights */
+	q->tins[0].tin_quantum_prio = quantum;
+	q->tins[1].tin_quantum_prio = quantum >> 4;
+	q->tins[2].tin_quantum_prio = quantum << 2;
+	q->tins[3].tin_quantum_prio = quantum << 4;
+
+	/* bandwidth-sharing weights */
+	q->tins[0].tin_quantum_band = quantum;
+	q->tins[1].tin_quantum_band = quantum >> 4;
+	q->tins[2].tin_quantum_band = quantum >> 1;
+	q->tins[3].tin_quantum_band = quantum >> 2;
+
+	return 0;
+}
+
+static int cake_config_diffserv3(struct Qdisc *sch)
+{
+/*  Simplified Diffserv structure with 3 tins.
+ *		Low Priority		(CS1)
+ *		Best Effort
+ *		Latency Sensitive	(TOS4, VA, EF, CS6, CS7)
+ */
+	struct cake_sched_data *q = qdisc_priv(sch);
+	u32 mtu = psched_mtu(qdisc_dev(sch));
+	u64 rate = q->rate_bps;
+	u32 quantum = 1024;
+
+	q->tin_cnt = 3;
+
+	/* codepoint to class mapping */
+	q->tin_index = diffserv3;
+	q->tin_order = bulk_order;
+
+	/* class characteristics */
+	cake_set_rate(&q->tins[0], rate, mtu,
+		      us_to_ns(q->target), us_to_ns(q->interval));
+	cake_set_rate(&q->tins[1], rate >> 4, mtu,
+		      us_to_ns(q->target), us_to_ns(q->interval));
+	cake_set_rate(&q->tins[2], rate >> 2, mtu,
+		      us_to_ns(q->target), us_to_ns(q->interval));
+
+	/* priority weights */
+	q->tins[0].tin_quantum_prio = quantum;
+	q->tins[1].tin_quantum_prio = quantum >> 4;
+	q->tins[2].tin_quantum_prio = quantum << 4;
+
+	/* bandwidth-sharing weights */
+	q->tins[0].tin_quantum_band = quantum;
+	q->tins[1].tin_quantum_band = quantum >> 4;
+	q->tins[2].tin_quantum_band = quantum >> 2;
+
+	return 0;
+}
+
+static void cake_reconfigure(struct Qdisc *sch)
+{
+	struct cake_sched_data *q = qdisc_priv(sch);
+	int c, ft;
+
+	switch (q->tin_mode) {
+	case CAKE_DIFFSERV_BESTEFFORT:
+		ft = cake_config_besteffort(sch);
+		break;
+
+	case CAKE_DIFFSERV_PRECEDENCE:
+		ft = cake_config_precedence(sch);
+		break;
+
+	case CAKE_DIFFSERV_DIFFSERV8:
+		ft = cake_config_diffserv8(sch);
+		break;
+
+	case CAKE_DIFFSERV_DIFFSERV4:
+		ft = cake_config_diffserv4(sch);
+		break;
+
+	case CAKE_DIFFSERV_DIFFSERV3:
+	default:
+		ft = cake_config_diffserv3(sch);
+		break;
+	}
+
 	for (c = q->tin_cnt; c < CAKE_MAX_TINS; c++) {
 		cake_clear_tin(sch, c);
 		q->tins[c].cparams.mtu_time = q->tins[ft].cparams.mtu_time;
@@ -1969,6 +2347,16 @@ static int cake_change(struct Qdisc *sch, struct nlattr *opt,
 	if (tb[TCA_CAKE_BASE_RATE64])
 		q->rate_bps = nla_get_u64(tb[TCA_CAKE_BASE_RATE64]);
 
+	if (tb[TCA_CAKE_DIFFSERV_MODE])
+		q->tin_mode = nla_get_u32(tb[TCA_CAKE_DIFFSERV_MODE]);
+
+	if (tb[TCA_CAKE_WASH]) {
+		if (!!nla_get_u32(tb[TCA_CAKE_WASH]))
+			q->rate_flags |= CAKE_FLAG_WASH;
+		else
+			q->rate_flags &= ~CAKE_FLAG_WASH;
+	}
+
 	if (tb[TCA_CAKE_FLOW_MODE])
 		q->flow_mode = (nla_get_u32(tb[TCA_CAKE_FLOW_MODE]) &
 				CAKE_FLOW_MASK);
@@ -2032,7 +2420,7 @@ static int cake_init(struct Qdisc *sch, struct nlattr *opt,
 	int i, j, err;
 
 	sch->limit = 10240;
-	q->tin_mode = CAKE_DIFFSERV_BESTEFFORT;
+	q->tin_mode = CAKE_DIFFSERV_DIFFSERV3;
 	q->flow_mode  = CAKE_FLOW_TRIPLE;
 
 	q->rate_bps = 0; /* unlimited by default */
@@ -2142,6 +2530,13 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff *skb)
 			!!(q->flow_mode & CAKE_FLOW_NAT_FLAG)))
 		goto nla_put_failure;
 
+	if (nla_put_u32(skb, TCA_CAKE_DIFFSERV_MODE, q->tin_mode))
+		goto nla_put_failure;
+
+	if (nla_put_u32(skb, TCA_CAKE_WASH,
+			!!(q->rate_flags & CAKE_FLAG_WASH)))
+		goto nla_put_failure;
+
 	return nla_nest_end(skb, opts);
 
 nla_put_failure:
@@ -2195,7 +2590,7 @@ static int cake_dump_stats(struct Qdisc *sch, struct gnet_dump *d)
 	} while (0)
 
 	for (i = 0; i < q->tin_cnt; i++) {
-		struct cake_tin_data *b = &q->tins[i];
+		struct cake_tin_data *b = &q->tins[q->tin_order[i]];
 
 		ts = nla_nest_start(d->skb, i + 1);
 		if (!ts)
@@ -2294,7 +2689,8 @@ static int cake_dump_class_stats(struct Qdisc *sch, unsigned long cl,
 	u32 idx = cl - 1;
 
 	if (idx < CAKE_QUEUES * q->tin_cnt) {
-		const struct cake_tin_data *b = &q->tins[idx / CAKE_QUEUES];
+		const struct cake_tin_data *b = \
+			&q->tins[q->tin_order[idx / CAKE_QUEUES]];
 		const struct sk_buff *skb;
 
 		flow = &b->flows[idx % CAKE_QUEUES];
@@ -2366,7 +2762,7 @@ static void cake_walk(struct Qdisc *sch, struct qdisc_walker *arg)
 		return;
 
 	for (i = 0; i < q->tin_cnt; i++) {
-		struct cake_tin_data *b = &q->tins[i];
+		struct cake_tin_data *b = &q->tins[q->tin_order[i]];
 
 		for (j = 0; j < CAKE_QUEUES; j++) {
 			if (list_empty(&b->flows[j].flowchain) ||

^ permalink raw reply related

* [PATCH net-next v13 4/7] sch_cake: Add NAT awareness to packet classifier
From: Toke Høiland-Jørgensen @ 2018-05-21 16:24 UTC (permalink / raw)
  To: netdev, cake
In-Reply-To: <152691970035.4083.3349902669266371765.stgit@alrua-kau>

When CAKE is deployed on a gateway that also performs NAT (which is a
common deployment mode), the host fairness mechanism cannot distinguish
internal hosts from each other, and so fails to work correctly.

To fix this, we add an optional NAT awareness mode, which will query the
kernel conntrack mechanism to obtain the pre-NAT addresses for each packet
and use that in the flow and host hashing.

When the shaper is enabled and the host is already performing NAT, the cost
of this lookup is negligible. However, in unlimited mode with no NAT being
performed, there is a significant CPU cost at higher bandwidths. For this
reason, the feature is turned off by default.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 net/sched/sch_cake.c |   79 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index c20f33940a57..116c935b2914 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -71,6 +71,12 @@
 #include <net/tcp.h>
 #include <net/flow_dissector.h>
 
+#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
+#include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_zones.h>
+#include <net/netfilter/nf_conntrack.h>
+#endif
+
 #define CAKE_SET_WAYS (8)
 #define CAKE_MAX_TINS (8)
 #define CAKE_QUEUES (1024)
@@ -516,6 +522,60 @@ static bool cobalt_should_drop(struct cobalt_vars *vars,
 	return drop;
 }
 
+#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
+
+static void cake_update_flowkeys(struct flow_keys *keys,
+				 const struct sk_buff *skb)
+{
+	const struct nf_conntrack_tuple *tuple;
+	enum ip_conntrack_info ctinfo;
+	struct nf_conn *ct;
+	bool rev = false;
+
+	if (tc_skb_protocol(skb) != htons(ETH_P_IP))
+		return;
+
+	ct = nf_ct_get(skb, &ctinfo);
+	if (ct) {
+		tuple = nf_ct_tuple(ct, CTINFO2DIR(ctinfo));
+	} else {
+		const struct nf_conntrack_tuple_hash *hash;
+		struct nf_conntrack_tuple srctuple;
+
+		if (!nf_ct_get_tuplepr(skb, skb_network_offset(skb),
+				       NFPROTO_IPV4, dev_net(skb->dev),
+				       &srctuple))
+			return;
+
+		hash = nf_conntrack_find_get(dev_net(skb->dev),
+					     &nf_ct_zone_dflt,
+					     &srctuple);
+		if (!hash)
+			return;
+
+		rev = true;
+		ct = nf_ct_tuplehash_to_ctrack(hash);
+		tuple = nf_ct_tuple(ct, !hash->tuple.dst.dir);
+	}
+
+	keys->addrs.v4addrs.src = rev ? tuple->dst.u3.ip : tuple->src.u3.ip;
+	keys->addrs.v4addrs.dst = rev ? tuple->src.u3.ip : tuple->dst.u3.ip;
+
+	if (keys->ports.ports) {
+		keys->ports.src = rev ? tuple->dst.u.all : tuple->src.u.all;
+		keys->ports.dst = rev ? tuple->src.u.all : tuple->dst.u.all;
+	}
+	if (rev)
+		nf_ct_put(ct);
+}
+#else
+static void cake_update_flowkeys(struct flow_keys *keys,
+				 const struct sk_buff *skb)
+{
+	/* There is nothing we can do here without CONNTRACK */
+}
+#endif
+
 /* Cake has several subtle multiple bit settings. In these cases you
  *  would be matching triple isolate mode as well.
  */
@@ -543,6 +603,9 @@ static u32 cake_hash(struct cake_tin_data *q, const struct sk_buff *skb,
 	skb_flow_dissect_flow_keys(skb, &keys,
 				   FLOW_DISSECTOR_F_STOP_AT_FLOW_LABEL);
 
+	if (flow_mode & CAKE_FLOW_NAT_FLAG)
+		cake_update_flowkeys(&keys, skb);
+
 	/* flow_hash_from_keys() sorts the addresses by value, so we have
 	 * to preserve their order in a separate data structure to treat
 	 * src and dst host addresses as independently selectable.
@@ -1891,6 +1954,18 @@ static int cake_change(struct Qdisc *sch, struct nlattr *opt,
 	if (err < 0)
 		return err;
 
+	if (tb[TCA_CAKE_NAT]) {
+#if IS_REACHABLE(CONFIG_NF_CONNTRACK)
+		q->flow_mode &= ~CAKE_FLOW_NAT_FLAG;
+		q->flow_mode |= CAKE_FLOW_NAT_FLAG *
+			!!nla_get_u32(tb[TCA_CAKE_NAT]);
+#else
+		NL_SET_ERR_MSG_ATTR(extack, "No conntrack support in kernel",
+				    tb[TCA_CAKE_NAT]);
+		return -EOPNOTSUPP;
+#endif
+	}
+
 	if (tb[TCA_CAKE_BASE_RATE64])
 		q->rate_bps = nla_get_u64(tb[TCA_CAKE_BASE_RATE64]);
 
@@ -2063,6 +2138,10 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff *skb)
 	if (nla_put_u32(skb, TCA_CAKE_ACK_FILTER, q->ack_filter))
 		goto nla_put_failure;
 
+	if (nla_put_u32(skb, TCA_CAKE_NAT,
+			!!(q->flow_mode & CAKE_FLOW_NAT_FLAG)))
+		goto nla_put_failure;
+
 	return nla_nest_end(skb, opts);
 
 nla_put_failure:

^ permalink raw reply related

* [PATCH net-next v13 3/7] sch_cake: Add optional ACK filter
From: Toke Høiland-Jørgensen @ 2018-05-21 16:24 UTC (permalink / raw)
  To: netdev, cake
In-Reply-To: <152691970035.4083.3349902669266371765.stgit@alrua-kau>

The ACK filter is an optional feature of CAKE which is designed to improve
performance on links with very asymmetrical rate limits. On such links
(which are unfortunately quite prevalent, especially for DSL and cable
subscribers), the downstream throughput can be limited by the number of
ACKs capable of being transmitted in the *upstream* direction.

Filtering ACKs can, in general, have adverse effects on TCP performance
because it interferes with ACK clocking (especially in slow start), and it
reduces the flow's resiliency to ACKs being dropped further along the path.
To alleviate these drawbacks, the ACK filter in CAKE tries its best to
always keep enough ACKs queued to ensure forward progress in the TCP flow
being filtered. It does this by only filtering redundant ACKs. In its
default 'conservative' mode, the filter will always keep at least two
redundant ACKs in the queue, while in 'aggressive' mode, it will filter
down to a single ACK.

The ACK filter works by inspecting the per-flow queue on every packet
enqueue. Starting at the head of the queue, the filter looks for another
eligible packet to drop (so the ACK being dropped is always closer to the
head of the queue than the packet being enqueued). An ACK is eligible only
if it ACKs *fewer* bytes than the new packet being enqueued, including any
SACK options. This prevents duplicate ACKs from being filtered, to avoid
interfering with retransmission logic. In addition, we check TCP header
options and only drop those that are known to not interfere with sender
state. In particular, packets with unknown option codes are never dropped.

In aggressive mode, an eligible packet is always dropped, while in
conservative mode, at least two ACKs are kept in the queue. Only pure ACKs
(with no data segments) are considered eligible for dropping, but when an
ACK with data segments is enqueued, this can cause another pure ACK to
become eligible for dropping.

The approach described above ensures that this ACK filter avoids most of
the drawbacks of a naive filtering mechanism that only keeps flow state but
does not inspect the queue. This is the rationale for including the ACK
filter in CAKE itself rather than as separate module (as the TC filter, for
instance).

Our performance evaluation has shown that on a 30/1 Mbps link with a
bidirectional traffic test (RRUL), turning on the ACK filter on the
upstream link improves downstream throughput by ~20% (both modes) and
upstream throughput by ~12% in conservative mode and ~40% in aggressive
mode, at the cost of ~5ms of inter-flow latency due to the increased
congestion.

In *really* pathological cases, the effect can be a lot more; for instance,
the ACK filter increases the achievable downstream throughput on a link
with 100 Kbps in the upstream direction by an order of magnitude (from ~2.5
Mbps to ~25 Mbps).

Finally, even though we consider the ACK filter to be safer than most, we
do not recommend turning it on everywhere: on more symmetrical link
bandwidths the effect is negligible at best.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
---
 net/sched/sch_cake.c |  425 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 423 insertions(+), 2 deletions(-)

diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 10e208e4255d..c20f33940a57 100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -757,6 +757,404 @@ static void flow_queue_add(struct cake_flow *flow, struct sk_buff *skb)
 	skb->next = NULL;
 }
 
+static struct iphdr *cake_get_iphdr(const struct sk_buff *skb,
+				    struct ipv6hdr *buf)
+{
+	unsigned int offset = skb_network_offset(skb);
+	struct iphdr *iph;
+
+	iph = skb_header_pointer(skb, offset, sizeof(struct iphdr), buf);
+
+	if (!iph)
+		return NULL;
+
+	if (iph->version == 4 && iph->protocol == IPPROTO_IPV6)
+		return skb_header_pointer(skb, offset + iph->ihl * 4,
+					  sizeof(struct ipv6hdr), buf);
+
+	else if (iph->version == 4)
+		return iph;
+
+	else if (iph->version == 6)
+		return skb_header_pointer(skb, offset, sizeof(struct ipv6hdr),
+					  buf);
+
+	return NULL;
+}
+
+static struct tcphdr *cake_get_tcphdr(const struct sk_buff *skb,
+				      void *buf, unsigned int bufsize)
+{
+	unsigned int offset = skb_network_offset(skb);
+	const struct ipv6hdr *ipv6h;
+	const struct tcphdr *tcph;
+	const struct iphdr *iph;
+	struct ipv6hdr _ipv6h;
+	struct tcphdr _tcph;
+
+	ipv6h = skb_header_pointer(skb, offset, sizeof(_ipv6h), &_ipv6h);
+
+	if (!ipv6h)
+		return NULL;
+
+	if (ipv6h->version == 4) {
+		iph = (struct iphdr *)ipv6h;
+		offset += iph->ihl * 4;
+
+		/* special-case 6in4 tunnelling, as that is a common way to get
+		 * v6 connectivity in the home
+		 */
+		if (iph->protocol == IPPROTO_IPV6) {
+			ipv6h = skb_header_pointer(skb, offset,
+						   sizeof(_ipv6h), &_ipv6h);
+
+			if (!ipv6h || ipv6h->nexthdr != IPPROTO_TCP)
+				return NULL;
+
+			offset += sizeof(struct ipv6hdr);
+
+		} else if (iph->protocol != IPPROTO_TCP) {
+			return NULL;
+		}
+
+	} else if (ipv6h->version == 6) {
+		if (ipv6h->nexthdr != IPPROTO_TCP)
+			return NULL;
+
+		offset += sizeof(struct ipv6hdr);
+	} else {
+		return NULL;
+	}
+
+	tcph = skb_header_pointer(skb, offset, sizeof(_tcph), &_tcph);
+	if (!tcph)
+		return NULL;
+
+	return skb_header_pointer(skb, offset,
+				  min(__tcp_hdrlen(tcph), bufsize), buf);
+}
+
+static const u8 *cake_get_tcpopt(const struct tcphdr *tcph,
+				 int code, int *oplen)
+{
+	/* inspired by tcp_parse_options in tcp_input.c */
+	int length = __tcp_hdrlen(tcph) - sizeof(struct tcphdr);
+	const u8 *ptr = (const u8 *)(tcph + 1);
+
+	while (length > 0) {
+		int opcode = *ptr++;
+		int opsize;
+
+		if (opcode == TCPOPT_EOL)
+			break;
+		if (opcode == TCPOPT_NOP) {
+			length--;
+			continue;
+		}
+		opsize = *ptr++;
+		if (opsize < 2 || opsize > length)
+			break;
+
+		if (opcode == code) {
+			*oplen = opsize;
+			return ptr;
+		}
+
+		ptr += opsize - 2;
+		length -= opsize;
+	}
+
+	return NULL;
+}
+
+/* Compare two SACK sequences. A sequence is considered greater if it SACKs more
+ * bytes than the other. In the case where both sequences ACKs bytes that the
+ * other doesn't, A is considered greater.
+ *
+ * @return -1, 0 or 1 as normal compare functions
+ */
+static int cake_tcph_sack_compare(const struct tcphdr *tcph_a,
+				  const struct tcphdr *tcph_b)
+{
+	u64 bytes_a = 0, bytes_b = 0;
+	const u8 *sack_a, *sack_b;
+	int oplen_a, oplen_b;
+	bool first = true;
+
+	sack_a = cake_get_tcpopt(tcph_a, TCPOPT_SACK, &oplen_a);
+	sack_b = cake_get_tcpopt(tcph_b, TCPOPT_SACK, &oplen_b);
+
+	if (sack_a && !sack_b)
+		return -1;
+	else if (sack_b && !sack_a)
+		return 1;
+	else if (!sack_a && !sack_b)
+		return 0;
+
+	/* pointer has already advanced past opcode and length bytes */
+	oplen_a -= 2;
+
+	while (oplen_a >= 8) {
+		u32 right_a = get_unaligned_be32(sack_a + 4);
+		u32 left_a = get_unaligned_be32(sack_a);
+		const u8 *sack_tmp = sack_b;
+		int oplen_tmp = oplen_b - 2;
+		bool found = false;
+
+		/* invalid or empty SACK range; ignore */
+		if (left_a >= right_a)
+			continue;
+
+		bytes_a += right_a - left_a;
+
+		while (oplen_tmp >= 8) {
+			u32 right_b = get_unaligned_be32(sack_tmp + 4);
+			u32 left_b = get_unaligned_be32(sack_tmp);
+
+			if (left_b >= right_b)
+				continue;
+
+			if (first)
+				bytes_b += right_b - left_b;
+
+			if (left_b <= left_a && right_a <= right_b) {
+				found = true;
+				if (!first)
+					break;
+			}
+			oplen_tmp -= 8;
+			sack_tmp += 8;
+		}
+
+		first = false;
+
+		if (!found)
+			return -1;
+
+		oplen_a -= 8;
+		sack_a += 8;
+	}
+
+	/* If we made it this far, all ranges SACKed by A are covered by B, so
+	 * either the SACKs are equal, or B SACKs more bytes.
+	 */
+	return bytes_b > bytes_a ? 1 : 0;
+}
+
+static void cake_tcph_get_tstamp(const struct tcphdr *tcph,
+				 u32 *tsval, u32 *tsecr)
+{
+	const u8 *ptr;
+	int opsize;
+
+	ptr = cake_get_tcpopt(tcph, TCPOPT_TIMESTAMP, &opsize);
+
+	if (ptr && opsize == TCPOLEN_TIMESTAMP) {
+		*tsval = get_unaligned_be32(ptr);
+		*tsecr = get_unaligned_be32(ptr + 4);
+	}
+}
+
+static bool cake_tcph_may_drop(const struct tcphdr *tcph,
+			       u32 tstamp_new, u32 tsecr_new)
+{
+	/* inspired by tcp_parse_options in tcp_input.c */
+	int length = __tcp_hdrlen(tcph) - sizeof(struct tcphdr);
+	const u8 *ptr = (const u8 *)(tcph + 1);
+	u32 tstamp, tsecr;
+
+	/* 3 reserved flags must be unset to avoid future breakage
+	 * ECE/CWR/NS can be safely ignored
+	 * ACK must be set
+	 * All other flags URG/PSH/RST/SYN/FIN must be unset
+	 * 0x0FFF0000 = all TCP flags (confirm ACK=1, others zero)
+	 * 0x01C00000 = NS/CWR/ECE (safe to ignore)
+	 * 0x0E3F0000 = 0x0FFF0000 & ~0x01C00000
+	 */
+	if (((tcp_flag_word(tcph) &
+	      cpu_to_be32(0x0E3F0000)) != TCP_FLAG_ACK))
+		return false;
+
+	while (length > 0) {
+		int opcode = *ptr++;
+		int opsize;
+
+		if (opcode == TCPOPT_EOL)
+			break;
+		if (opcode == TCPOPT_NOP) {
+			length--;
+			continue;
+		}
+		opsize = *ptr++;
+		if (opsize < 2 || opsize > length)
+			break;
+
+		switch (opcode) {
+		case TCPOPT_MD5SIG: /* doesn't influence state */
+			break;
+
+		case TCPOPT_SACK: /* stricter checking performed later */
+			if (opsize % 8 != 2)
+				return false;
+			break;
+
+		case TCPOPT_TIMESTAMP:
+			/* only drop timestamps lower than new */
+			if (opsize != TCPOLEN_TIMESTAMP)
+				return false;
+			tstamp = get_unaligned_be32(ptr);
+			tsecr = get_unaligned_be32(ptr + 4);
+			if (tstamp > tstamp_new || tsecr > tsecr_new)
+				return false;
+			break;
+
+		case TCPOPT_MSS:  /* these should only be set on SYN */
+		case TCPOPT_WINDOW:
+		case TCPOPT_SACK_PERM:
+		case TCPOPT_FASTOPEN:
+		case TCPOPT_EXP:
+		default: /* don't drop if any unknown options are present */
+			return false;
+		}
+
+		ptr += opsize - 2;
+		length -= opsize;
+	}
+
+	return true;
+}
+
+static struct sk_buff *cake_ack_filter(struct cake_sched_data *q,
+				       struct cake_flow *flow)
+{
+	bool aggressive = q->ack_filter == CAKE_ACK_AGGRESSIVE;
+	struct sk_buff *elig_ack = NULL, *elig_ack_prev = NULL;
+	struct sk_buff *skb_check, *skb_prev = NULL;
+	const struct ipv6hdr *ipv6h, *ipv6h_check;
+	unsigned char _tcph[64], _tcph_check[64];
+	const struct tcphdr *tcph, *tcph_check;
+	const struct iphdr *iph, *iph_check;
+	struct ipv6hdr _iph, _iph_check;
+	const struct sk_buff *skb;
+	int seglen, num_found = 0;
+	u32 tstamp = 0, tsecr = 0;
+	int sack_comp;
+
+	/* no other possible ACKs to filter */
+	if (flow->head == flow->tail)
+		return NULL;
+
+	skb = flow->tail;
+	tcph = cake_get_tcphdr(skb, _tcph, sizeof(_tcph));
+	iph = cake_get_iphdr(skb, &_iph);
+	if (!tcph)
+		return NULL;
+
+	cake_tcph_get_tstamp(tcph, &tstamp, &tsecr);
+
+	/* the 'triggering' packet need only have the ACK flag set.
+	 * also check that SYN is not set, as there won't be any previous ACKs.
+	 */
+	if ((tcp_flag_word(tcph) &
+	     (TCP_FLAG_ACK | TCP_FLAG_SYN)) != TCP_FLAG_ACK)
+		return NULL;
+
+	/* the 'triggering' ACK is at the tail of the queue, we have already
+	 * returned if it is the only packet in the flow. loop through the rest
+	 * of the queue looking for pure ACKs with the same 5-tuple as the
+	 * triggering one.
+	 */
+	for (skb_check = flow->head;
+	     skb_check && skb_check != skb;
+	     skb_prev = skb_check, skb_check = skb_check->next) {
+		iph_check = cake_get_iphdr(skb_check, &_iph_check);
+		tcph_check = cake_get_tcphdr(skb_check, &_tcph_check,
+					     sizeof(_tcph_check));
+
+		/* only TCP packets with matching 5-tuple are eligible, and only
+		 * drop safe headers
+		 */
+		if (!tcph_check || iph->version != iph_check->version ||
+		    tcph_check->source != tcph->source ||
+		    tcph_check->dest != tcph->dest ||
+		    !cake_tcph_may_drop(tcph_check, tstamp, tsecr))
+			continue;
+
+		if (iph_check->version == 4) {
+			if (iph_check->saddr != iph->saddr ||
+			    iph_check->daddr != iph->daddr)
+				continue;
+
+			seglen = ntohs(iph_check->tot_len) -
+				       (4 * iph_check->ihl);
+		} else if (iph_check->version == 6) {
+			ipv6h = (struct ipv6hdr *)iph;
+			ipv6h_check = (struct ipv6hdr *)iph_check;
+
+			if (ipv6_addr_cmp(&ipv6h_check->saddr, &ipv6h->saddr) ||
+			    ipv6_addr_cmp(&ipv6h_check->daddr, &ipv6h->daddr))
+				continue;
+
+			seglen = ntohs(ipv6h_check->payload_len);
+		} else {
+			WARN_ON(1);  /* shouldn't happen */
+			continue;
+		}
+
+		/* Don't drop ACKs with segment data, and don't drop ACKs higher
+		 * cumulative ACK counter than triggering packet. Check ACK
+		 * seqno here to avoid parsing SACK options of packets we are
+		 * going to exclude anyway.
+		 */
+		if ((seglen - __tcp_hdrlen(tcph_check)) != 0 ||
+		    (int32_t)(ntohl(tcph_check->ack_seq) -
+			      ntohl(tcph->ack_seq)) > 0)
+			continue;
+
+		/* Check SACK options. The triggering packet must SACK more data
+		 * than the ACK under consideration, or SACK the same range but
+		 * have a larger cumulative ACK counter. The latter is a
+		 * pathological case, but is contained in the following check
+		 * anyway, just to be safe.
+		 */
+		sack_comp = cake_tcph_sack_compare(tcph_check, tcph);
+
+		if (sack_comp < 0 ||
+		    (ntohl(tcph_check->ack_seq) == ntohl(tcph->ack_seq) &&
+		     sack_comp == 0))
+			continue;
+
+		/* At this point we have found an eligible pure ACK to drop; if
+		 * we are in aggressive mode, we are done. Otherwise, keep
+		 * searching unless this is the second eligible ACK we
+		 * found.
+		 *
+		 * Since we want to drop ACK closest to the head of the queue,
+		 * save the first eligible ACK we find, even if we need to loop
+		 * again.
+		 */
+		if (!elig_ack) {
+			elig_ack = skb_check;
+			elig_ack_prev = skb_prev;
+		}
+
+		if (num_found++ > 0 || aggressive)
+			goto found;
+	}
+
+	return NULL;
+
+found:
+	if (elig_ack_prev)
+		elig_ack_prev->next = elig_ack->next;
+	else
+		flow->head = elig_ack->next;
+
+	elig_ack->next = NULL;
+
+	return elig_ack;
+}
+
 static u64 cake_ewma(u64 avg, u64 sample, u32 shift)
 {
 	avg -= avg >> shift;
@@ -934,6 +1332,7 @@ static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	struct cake_sched_data *q = qdisc_priv(sch);
 	int len = qdisc_pkt_len(skb);
 	int uninitialized_var(ret);
+	struct sk_buff *ack = NULL;
 	ktime_t now = ktime_get();
 	struct cake_tin_data *b;
 	struct cake_flow *flow;
@@ -980,8 +1379,24 @@ static s32 cake_enqueue(struct sk_buff *skb, struct Qdisc *sch,
 	cobalt_set_enqueue_time(skb, now);
 	flow_queue_add(flow, skb);
 
-	sch->q.qlen++;
-	q->buffer_used      += skb->truesize;
+	if (q->ack_filter)
+		ack = cake_ack_filter(q, flow);
+
+	if (ack) {
+		b->ack_drops++;
+		sch->qstats.drops++;
+		b->bytes += qdisc_pkt_len(ack);
+		len -= qdisc_pkt_len(ack);
+		q->buffer_used += skb->truesize - ack->truesize;
+		if (q->rate_flags & CAKE_FLAG_INGRESS)
+			cake_advance_shaper(q, b, ack, now, true);
+
+		qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(ack));
+		consume_skb(ack);
+	} else {
+		sch->q.qlen++;
+		q->buffer_used      += skb->truesize;
+	}
 
 	/* stats */
 	b->packets++;
@@ -1511,6 +1926,9 @@ static int cake_change(struct Qdisc *sch, struct nlattr *opt,
 			q->rate_flags &= ~CAKE_FLAG_INGRESS;
 	}
 
+	if (tb[TCA_CAKE_ACK_FILTER])
+		q->ack_filter = nla_get_u32(tb[TCA_CAKE_ACK_FILTER]);
+
 	if (tb[TCA_CAKE_MEMORY])
 		q->buffer_config_limit = nla_get_u32(tb[TCA_CAKE_MEMORY]);
 
@@ -1642,6 +2060,9 @@ static int cake_dump(struct Qdisc *sch, struct sk_buff *skb)
 			!!(q->rate_flags & CAKE_FLAG_INGRESS)))
 		goto nla_put_failure;
 
+	if (nla_put_u32(skb, TCA_CAKE_ACK_FILTER, q->ack_filter))
+		goto nla_put_failure;
+
 	return nla_nest_end(skb, opts);
 
 nla_put_failure:

^ permalink raw reply related

* [PATCH net-next v13 0/7] sched: Add Common Applications Kept Enhanced (cake) qdisc
From: Toke Høiland-Jørgensen @ 2018-05-21 16:24 UTC (permalink / raw)
  To: netdev, cake

This patch series adds the CAKE qdisc, and has been split up to ease
review.

I have attempted to split out each configurable feature into its own patch.
The first commit adds the base shaper and packet scheduler, while
subsequent commits add the optional features. The full userspace API and
most data structures are included in this commit, but options not
understood in the base version will be ignored.

The result of applying the entire series is identical to the out of tree
version that have seen extensive testing in previous deployments, most
notably as an out of tree patch to OpenWrt. However, note that I have only
compile tested the individual patches; so the whole series should be
considered as a unit.

---
Changelog

v13:
  - Avoid ktime_t to scalar compares
  - Add class dumping and basic stats
  - Fail with ENOTSUPP when requesting NAT mode and conntrack is not
    available.
  - Parse all TCP options in ACK filter and make sure to only drop safe
    ones. Also handle SACK ranges properly.

v12:
  - Get rid of custom time typedefs. Use ktime_t for time and u64 for
    duration instead.

v11:
  - Fix overhead compensation calculation for GSO packets
  - Change configured rate to be u64 (I ran out of bits before I ran out
    of CPU when testing the effects of the above)

v10:
  - Christmas tree gardening (fix variable declarations to be in reverse
    line length order)

v9:
  - Remove duplicated checks around kvfree() and just call it
    unconditionally.
  - Don't pass __GFP_NOWARN when allocating memory
  - Move options in cake_dump() that are related to optional features to
    later patches implementing the features.
  - Support attaching filters to the qdisc and use the classification
    result to select flow queue.
  - Support overriding diffserv priority tin from skb->priority

v8:
  - Remove inline keyword from function definitions
  - Simplify ACK filter; remove the complex state handling to make the
    logic easier to follow. This will potentially be a bit less efficient,
    but I have not been able to measure a difference.

v7:
  - Split up patch into a series to ease review.
  - Constify the ACK filter.

v6:
  - Fix 6in4 encapsulation checks in ACK filter code
  - Checkpatch fixes

v5:
  - Refactor ACK filter code and hopefully fix the safety issues
    properly this time.

v4:
  - Only split GSO packets if shaping at speeds <= 1Gbps
  - Fix overhead calculation code to also work for GSO packets
  - Don't re-implement kvzalloc()
  - Remove local header include from out-of-tree build (fixes kbuild-bot
    complaint).
  - Several fixes to the ACK filter:
    - Check pskb_may_pull() before deref of transport headers.
    - Don't run ACK filter logic on split GSO packets
    - Fix TCP sequence number compare to deal with wraparounds

v3:
  - Use IS_REACHABLE() macro to fix compilation when sch_cake is
    built-in and conntrack is a module.
  - Switch the stats output to use nested netlink attributes instead
    of a versioned struct.
  - Remove GPL boilerplate.
  - Fix array initialisation style.

v2:
  - Fix kbuild test bot complaint
  - Clean up the netlink ABI
  - Fix checkpatch complaints
  - A few tweaks to the behaviour of cake based on testing carried out
    while writing the paper.

---

Toke Høiland-Jørgensen (7):
      sched: Add Common Applications Kept Enhanced (cake) qdisc
      sch_cake: Add ingress mode
      sch_cake: Add optional ACK filter
      sch_cake: Add NAT awareness to packet classifier
      sch_cake: Add DiffServ handling
      sch_cake: Add overhead compensation support to the rate shaper
      sch_cake: Conditionally split GSO segments


 include/uapi/linux/pkt_sched.h |  113 ++
 net/sched/Kconfig              |   11 
 net/sched/Makefile             |    1 
 net/sched/sch_cake.c           | 2992 ++++++++++++++++++++++++++++++++++++++++
 4 files changed, 3117 insertions(+)
 create mode 100644 net/sched/sch_cake.c

^ permalink raw reply

* Re: [RFC PATCH net-next 01/12] vhost_net: introduce helper to initialize tx iov iter
From: Jesse Brandeburg @ 2018-05-21 16:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, netdev, linux-kernel, kvm, virtualization, jesse.brandeburg
In-Reply-To: <1526893473-20128-2-git-send-email-jasowang@redhat.com>

Hi Jason, a few nits. 

On Mon, 21 May 2018 17:04:22 +0800 Jason wrote:
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vhost/net.c | 34 +++++++++++++++++++++++-----------
>  1 file changed, 23 insertions(+), 11 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index c4b49fc..15d191a 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -459,6 +459,26 @@ static bool vhost_exceeds_maxpend(struct vhost_net *net)
>  	       min_t(unsigned int, VHOST_MAX_PEND, vq->num >> 2);
>  }
>  
> +static size_t init_iov_iter(struct vhost_virtqueue *vq, struct iov_iter *iter,
> +			    size_t hdr_size, int out)
> +{
> +	/* Skip header. TODO: support TSO. */
> +	size_t len = iov_length(vq->iov, out);
> +
> +	iov_iter_init(iter, WRITE, vq->iov, out, len);
> +	iov_iter_advance(iter, hdr_size);
> +	/* Sanity check */
> +	if (!iov_iter_count(iter)) {
> +		vq_err(vq, "Unexpected header len for TX: "
> +			"%zd expected %zd\n",
> +			len, hdr_size);

ok, it was like this before, but please unwrap the string in " ", there
should be no line breaks in string declarations and they are allowed to
go over 80 characters.

> +		return -EFAULT;
> +	}
> +	len = iov_iter_count(iter);
> +
> +	return len;
> +}
> +
>  /* Expects to be always run from workqueue - which acts as
>   * read-size critical section for our kind of RCU. */
>  static void handle_tx(struct vhost_net *net)
> @@ -521,18 +541,10 @@ static void handle_tx(struct vhost_net *net)
>  			       "out %d, int %d\n", out, in);
>  			break;
>  		}
> -		/* Skip header. TODO: support TSO. */
> -		len = iov_length(vq->iov, out);
> -		iov_iter_init(&msg.msg_iter, WRITE, vq->iov, out, len);
> -		iov_iter_advance(&msg.msg_iter, hdr_size);
> -		/* Sanity check */
> -		if (!msg_data_left(&msg)) {
> -			vq_err(vq, "Unexpected header len for TX: "
> -			       "%zd expected %zd\n",
> -			       len, hdr_size);
> +
> +		len = init_iov_iter(vq, &msg.msg_iter, hdr_size, out);
> +		if (len < 0)

len is declared as size_t, which is unsigned, and can never be
negative.  I'm pretty sure this is a bug.


>  			break;
> -		}
> -		len = msg_data_left(&msg);
>  
>  		zcopy_used = zcopy && len >= VHOST_GOODCOPY_LEN
>  				   && !vhost_exceeds_maxpend(net)

^ permalink raw reply

* Re: [PATCH net-next] cxgb4/cxgb4vf: link management changes for new SFP
From: David Miller @ 2018-05-21 16:23 UTC (permalink / raw)
  To: ganeshgr; +Cc: netdev, nirranjan, indranil, venkatesh
In-Reply-To: <1526894195-5048-1-git-send-email-ganeshgr@chelsio.com>

From: Ganesh Goudar <ganeshgr@chelsio.com>
Date: Mon, 21 May 2018 14:46:35 +0530

> newer SFPs like SFP28 and QSFP28 Transceiver Modules present
> several new possibilities which we haven't faced before. Fix the
> assumptions in the code reflecting the more limited capabilities
> of previous Transceiver Module systems
> 
> Original work by Casey Leedom <leedom@chelsio.com>
> 
> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>

This patch doesn't apply to the net-next tree.

^ permalink raw reply

* Re: [PATCH net-next] cxgb4: do L1 config when module is inserted
From: David Miller @ 2018-05-21 16:23 UTC (permalink / raw)
  To: ganeshgr; +Cc: netdev, nirranjan, indranil, venkatesh, leedom
In-Reply-To: <20180521.122104.1660333635075713378.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Mon, 21 May 2018 12:21:04 -0400 (EDT)

> From: Ganesh Goudar <ganeshgr@chelsio.com>
> Date: Mon, 21 May 2018 14:45:43 +0530
> 
>> trigger an L1 configure operation when a transceiver module
>> is inserted in order to cause current "sticky" options like
>> Requested Forward Error Correction to be reapplied.
>> 
>> Signed-off-by: Casey Leedom <leedom@chelsio.com>
>> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
> 
> Applied, although:

Actually I had to revert, this doesn't even compile:

drivers/scsi/csiostor/csio_hw.c: In function ‘fwcaps16_to_caps32’:
drivers/scsi/csiostor/csio_hw.c:1490:17: error: ‘FW_PORT_CAP_MDIX’ undeclared (first use in this function); did you mean ‘FW_PORT_CAP_MDI_S’?
    if (caps16 & FW_PORT_CAP_##__cap) \
                 ^

^ permalink raw reply

* Re: [PATCH net-next] cxgb4: do L1 config when module is inserted
From: David Miller @ 2018-05-21 16:21 UTC (permalink / raw)
  To: ganeshgr; +Cc: netdev, nirranjan, indranil, venkatesh, leedom
In-Reply-To: <1526894143-4986-1-git-send-email-ganeshgr@chelsio.com>

From: Ganesh Goudar <ganeshgr@chelsio.com>
Date: Mon, 21 May 2018 14:45:43 +0530

> trigger an L1 configure operation when a transceiver module
> is inserted in order to cause current "sticky" options like
> Requested Forward Error Correction to be reapplied.
> 
> Signed-off-by: Casey Leedom <leedom@chelsio.com>
> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>

Applied, although:

> @@ -491,6 +491,9 @@ struct link_config {
>  
>  	unsigned char  link_ok;          /* link up? */
>  	unsigned char  link_down_rc;     /* link down reason */
> +
> +	unsigned char   new_module;	 /* ->OS Transceiver Module inserted */
> +	unsigned char   redo_l1cfg;	 /* ->CC redo current "sticky" L1 CFG */
>  };

The various booleans in link_config should be converted to use type 'bool'
and true/false values.

^ permalink raw reply

* [PATCH net] dccp: don't free ccid2_hc_tx_sock struct in dccp_disconnect()
From: Alexey Kodanev @ 2018-05-21 16:28 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Alexey Kodanev

Syzbot reported the use-after-free in timer_is_static_object() [1].

This can happen because the structure for the rto timer (ccid2_hc_tx_sock)
is removed in dccp_disconnect(), and ccid2_hc_tx_rto_expire() can be
called after that.

The report [1] is similar to the one in commit 120e9dabaf55 ("dccp:
defer ccid_hc_tx_delete() at dismantle time"). And the fix is the same,
delay freeing ccid2_hc_tx_sock structure, so that it is freed in
dccp_sk_destruct().

[1]
==================================================================
BUG: KASAN: use-after-free in timer_is_static_object+0x80/0x90
kernel/time/timer.c:607
Read of size 8 at addr ffff8801bebb5118 by task syz-executor2/25299

CPU: 1 PID: 25299 Comm: syz-executor2 Not tainted 4.17.0-rc5+ #54
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  <IRQ>
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1b9/0x294 lib/dump_stack.c:113
  print_address_description+0x6c/0x20b mm/kasan/report.c:256
  kasan_report_error mm/kasan/report.c:354 [inline]
  kasan_report.cold.7+0x242/0x2fe mm/kasan/report.c:412
  __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
  timer_is_static_object+0x80/0x90 kernel/time/timer.c:607
  debug_object_activate+0x2d9/0x670 lib/debugobjects.c:508
  debug_timer_activate kernel/time/timer.c:709 [inline]
  debug_activate kernel/time/timer.c:764 [inline]
  __mod_timer kernel/time/timer.c:1041 [inline]
  mod_timer+0x4d3/0x13b0 kernel/time/timer.c:1102
  sk_reset_timer+0x22/0x60 net/core/sock.c:2742
  ccid2_hc_tx_rto_expire+0x587/0x680 net/dccp/ccids/ccid2.c:147
  call_timer_fn+0x230/0x940 kernel/time/timer.c:1326
  expire_timers kernel/time/timer.c:1363 [inline]
  __run_timers+0x79e/0xc50 kernel/time/timer.c:1666
  run_timer_softirq+0x4c/0x70 kernel/time/timer.c:1692
  __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285
  invoke_softirq kernel/softirq.c:365 [inline]
  irq_exit+0x1d1/0x200 kernel/softirq.c:405
  exiting_irq arch/x86/include/asm/apic.h:525 [inline]
  smp_apic_timer_interrupt+0x17e/0x710 arch/x86/kernel/apic/apic.c:1052
  apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:863
  </IRQ>
...
Allocated by task 25374:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
  set_track mm/kasan/kasan.c:460 [inline]
  kasan_kmalloc+0xc4/0xe0 mm/kasan/kasan.c:553
  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:490
  kmem_cache_alloc+0x12e/0x760 mm/slab.c:3554
  ccid_new+0x25b/0x3e0 net/dccp/ccid.c:151
  dccp_hdlr_ccid+0x27/0x150 net/dccp/feat.c:44
  __dccp_feat_activate+0x184/0x270 net/dccp/feat.c:344
  dccp_feat_activate_values+0x3a7/0x819 net/dccp/feat.c:1538
  dccp_create_openreq_child+0x472/0x610 net/dccp/minisocks.c:128
  dccp_v4_request_recv_sock+0x12c/0xca0 net/dccp/ipv4.c:408
  dccp_v6_request_recv_sock+0x125d/0x1f10 net/dccp/ipv6.c:415
  dccp_check_req+0x455/0x6a0 net/dccp/minisocks.c:197
  dccp_v4_rcv+0x7b8/0x1f3f net/dccp/ipv4.c:841
  ip_local_deliver_finish+0x2e3/0xd80 net/ipv4/ip_input.c:215
  NF_HOOK include/linux/netfilter.h:288 [inline]
  ip_local_deliver+0x1e1/0x720 net/ipv4/ip_input.c:256
  dst_input include/net/dst.h:450 [inline]
  ip_rcv_finish+0x81b/0x2200 net/ipv4/ip_input.c:396
  NF_HOOK include/linux/netfilter.h:288 [inline]
  ip_rcv+0xb70/0x143d net/ipv4/ip_input.c:492
  __netif_receive_skb_core+0x26f5/0x3630 net/core/dev.c:4592
  __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:4657
  process_backlog+0x219/0x760 net/core/dev.c:5337
  napi_poll net/core/dev.c:5735 [inline]
  net_rx_action+0x7b7/0x1930 net/core/dev.c:5801
  __do_softirq+0x2e0/0xaf5 kernel/softirq.c:285

Freed by task 25374:
  save_stack+0x43/0xd0 mm/kasan/kasan.c:448
  set_track mm/kasan/kasan.c:460 [inline]
  __kasan_slab_free+0x11a/0x170 mm/kasan/kasan.c:521
  kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
  __cache_free mm/slab.c:3498 [inline]
  kmem_cache_free+0x86/0x2d0 mm/slab.c:3756
  ccid_hc_tx_delete+0xc3/0x100 net/dccp/ccid.c:190
  dccp_disconnect+0x130/0xc66 net/dccp/proto.c:286
  dccp_close+0x3bc/0xe60 net/dccp/proto.c:1045
  inet_release+0x104/0x1f0 net/ipv4/af_inet.c:427
  inet6_release+0x50/0x70 net/ipv6/af_inet6.c:460
  sock_release+0x96/0x1b0 net/socket.c:594
  sock_close+0x16/0x20 net/socket.c:1149
  __fput+0x34d/0x890 fs/file_table.c:209
  ____fput+0x15/0x20 fs/file_table.c:243
  task_work_run+0x1e4/0x290 kernel/task_work.c:113
  tracehook_notify_resume include/linux/tracehook.h:191 [inline]
  exit_to_usermode_loop+0x2bd/0x310 arch/x86/entry/common.c:166
  prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
  syscall_return_slowpath arch/x86/entry/common.c:265 [inline]
  do_syscall_64+0x6ac/0x800 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8801bebb4cc0
  which belongs to the cache ccid2_hc_tx_sock of size 1240
The buggy address is located 1112 bytes inside of
  1240-byte region [ffff8801bebb4cc0, ffff8801bebb5198)
The buggy address belongs to the page:
page:ffffea0006faed00 count:1 mapcount:0 mapping:ffff8801bebb41c0
index:0xffff8801bebb5240 compound_mapcount: 0
flags: 0x2fffc0000008100(slab|head)
raw: 02fffc0000008100 ffff8801bebb41c0 ffff8801bebb5240 0000000100000003
raw: ffff8801cdba3138 ffffea0007634120 ffff8801cdbaab40 0000000000000000
page dumped because: kasan: bad access detected
...
==================================================================

Reported-by: syzbot+5d47e9ec91a6f15dbd6f@syzkaller.appspotmail.com
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
---
 net/dccp/proto.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 84cd4e3..0d56e36 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -283,9 +283,7 @@ int dccp_disconnect(struct sock *sk, int flags)

 	dccp_clear_xmit_timers(sk);
 	ccid_hc_rx_delete(dp->dccps_hc_rx_ccid, sk);
-	ccid_hc_tx_delete(dp->dccps_hc_tx_ccid, sk);
 	dp->dccps_hc_rx_ccid = NULL;
-	dp->dccps_hc_tx_ccid = NULL;

 	__skb_queue_purge(&sk->sk_receive_queue);
 	__skb_queue_purge(&sk->sk_write_queue);
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PACTH net-next] cxgb4: copy the length of cpl_tx_pkt_core to fw_wr
From: David Miller @ 2018-05-21 16:16 UTC (permalink / raw)
  To: ganeshgr; +Cc: netdev, nirranjan, indranil, venkatesh
In-Reply-To: <1526885796-13618-1-git-send-email-ganeshgr@chelsio.com>

From: Ganesh Goudar <ganeshgr@chelsio.com>
Date: Mon, 21 May 2018 12:26:36 +0530

> immdlen field of FW_ETH_TX_PKT_WR is filled in a wrong way,
> we must copy the length of all the cpls encapsulated in fw
> work request. In the xmit path we missed adding the length
> of CPL_TX_PKT_CORE but we added the length of WR_HDR and it
> worked because WR_HDR and CPL_TX_PKT_CORE are of same length.
> Add the length of cpl_tx_pkt_core not WR_HDR's. This also
> fixes the lso cpl errors for udp tunnels
> 
> Fixes: d0a1299c6bf7 ("cxgb4: add support for vxlan segmentation offload")
> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net: ethernet: Sort Kconfig sourcing alphabetically
From: David Miller @ 2018-05-21 16:15 UTC (permalink / raw)
  To: f.fainelli
  Cc: netdev, arnd, andrew, aviad.krawczyk, jaswinder.singh,
	hayashi.kunihiko, mdf, linus.walleij, alexandre.belloni,
	linux-kernel
In-Reply-To: <20180521035830.7897-1-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Sun, 20 May 2018 20:58:28 -0700

> A number of entries were not alphabetically sorted, remedy that.
> 
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] net: phy: phylink: Don't release NULL GPIO
From: David Miller @ 2018-05-21 16:14 UTC (permalink / raw)
  To: f.fainelli; +Cc: netdev, rmk+kernel, andrew, linux-kernel
In-Reply-To: <20180521034947.469-1-f.fainelli@gmail.com>

From: Florian Fainelli <f.fainelli@gmail.com>
Date: Sun, 20 May 2018 20:49:47 -0700

> If CONFIG_GPIOLIB is disabled, gpiod_put() becomes a stub that produces a
> warning, this helped identify that we could be attempting to release a NULL
> pl->link_gpio GPIO descriptor, so guard against that.
> 
> Fixes: daab3349ad1a ("net: phy: phylink: Release link GPIO")
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Applied.

^ permalink raw reply

* [PATCH v2 bpf-next 2/3] net/ipv6: Add helper to return path MTU based on fib result
From: dsahern @ 2018-05-21 16:08 UTC (permalink / raw)
  To: netdev, borkmann, ast; +Cc: davem, David Ahern
In-Reply-To: <20180521160816.7060-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

Determine path MTU from a FIB lookup result. Logic is based on
ip6_dst_mtu_forward plus lookup of nexthop exception.

Add ip6_dst_mtu_forward to ipv6_stubs to handle access by core
bpf code.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/addrconf.h   |  2 ++
 include/net/ip6_fib.h    |  6 ++++++
 include/net/ip6_route.h  |  3 +++
 net/ipv6/addrconf_core.c |  8 ++++++++
 net/ipv6/af_inet6.c      |  1 +
 net/ipv6/route.c         | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 68 insertions(+)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index ff766ab207e0..c07d4dd09361 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -236,6 +236,8 @@ struct ipv6_stub {
 						   struct flowi6 *fl6, int oif,
 						   const struct sk_buff *skb,
 						   int strict);
+	u32 (*ip6_mtu_from_fib6)(struct fib6_info *f6i, struct in6_addr *daddr,
+				 struct in6_addr *saddr);
 
 	void (*udpv6_encap_enable)(void);
 	void (*ndisc_send_na)(struct net_device *dev, const struct in6_addr *daddr,
diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index cc70f6da8462..7897efe80727 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -412,6 +412,12 @@ static inline struct net_device *fib6_info_nh_dev(const struct fib6_info *f6i)
 	return f6i->fib6_nh.nh_dev;
 }
 
+static inline
+struct lwtunnel_state *fib6_info_nh_lwt(const struct fib6_info *f6i)
+{
+	return f6i->fib6_nh.nh_lwtstate;
+}
+
 void inet6_rt_notify(int event, struct fib6_info *rt, struct nl_info *info,
 		     unsigned int flags);
 
diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 4cf1ef935ed9..7b9c82de11cc 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -300,6 +300,9 @@ static inline unsigned int ip6_dst_mtu_forward(const struct dst_entry *dst)
 	return mtu;
 }
 
+u32 ip6_mtu_from_fib6(struct fib6_info *f6i, struct in6_addr *daddr,
+		      struct in6_addr *saddr);
+
 struct neighbour *ip6_neigh_lookup(const struct in6_addr *gw,
 				   struct net_device *dev, struct sk_buff *skb,
 				   const void *daddr);
diff --git a/net/ipv6/addrconf_core.c b/net/ipv6/addrconf_core.c
index 2fe754fd4f5e..5cd0029d930e 100644
--- a/net/ipv6/addrconf_core.c
+++ b/net/ipv6/addrconf_core.c
@@ -161,12 +161,20 @@ eafnosupport_fib6_multipath_select(const struct net *net, struct fib6_info *f6i,
 	return f6i;
 }
 
+static u32
+eafnosupport_ip6_mtu_from_fib6(struct fib6_info *f6i, struct in6_addr *daddr,
+			       struct in6_addr *saddr)
+{
+	return 0;
+}
+
 const struct ipv6_stub *ipv6_stub __read_mostly = &(struct ipv6_stub) {
 	.ipv6_dst_lookup   = eafnosupport_ipv6_dst_lookup,
 	.fib6_get_table    = eafnosupport_fib6_get_table,
 	.fib6_table_lookup = eafnosupport_fib6_table_lookup,
 	.fib6_lookup       = eafnosupport_fib6_lookup,
 	.fib6_multipath_select = eafnosupport_fib6_multipath_select,
+	.ip6_mtu_from_fib6 = eafnosupport_ip6_mtu_from_fib6,
 };
 EXPORT_SYMBOL_GPL(ipv6_stub);
 
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 50de8b0d4f70..9ed0eae91758 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -894,6 +894,7 @@ static const struct ipv6_stub ipv6_stub_impl = {
 	.fib6_table_lookup = fib6_table_lookup,
 	.fib6_lookup       = fib6_lookup,
 	.fib6_multipath_select = fib6_multipath_select,
+	.ip6_mtu_from_fib6 = ip6_mtu_from_fib6,
 	.udpv6_encap_enable = udpv6_encap_enable,
 	.ndisc_send_na = ndisc_send_na,
 	.nd_tbl	= &nd_tbl,
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index cc24ed3bc334..dc5d5c84dbef 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -2603,6 +2603,54 @@ static unsigned int ip6_mtu(const struct dst_entry *dst)
 	return mtu - lwtunnel_headroom(dst->lwtstate, mtu);
 }
 
+/* MTU selection:
+ * 1. mtu on route is locked - use it
+ * 2. mtu from nexthop exception
+ * 3. mtu from egress device
+ *
+ * based on ip6_dst_mtu_forward and exception logic of
+ * rt6_find_cached_rt; called with rcu_read_lock
+ */
+u32 ip6_mtu_from_fib6(struct fib6_info *f6i, struct in6_addr *daddr,
+		      struct in6_addr *saddr)
+{
+	struct rt6_exception_bucket *bucket;
+	struct rt6_exception *rt6_ex;
+	struct in6_addr *src_key;
+	struct inet6_dev *idev;
+	u32 mtu = 0;
+
+	if (unlikely(fib6_metric_locked(f6i, RTAX_MTU))) {
+		mtu = f6i->fib6_pmtu;
+		if (mtu)
+			goto out;
+	}
+
+	src_key = NULL;
+#ifdef CONFIG_IPV6_SUBTREES
+	if (f6i->fib6_src.plen)
+		src_key = saddr;
+#endif
+
+	bucket = rcu_dereference(f6i->rt6i_exception_bucket);
+	rt6_ex = __rt6_find_exception_rcu(&bucket, daddr, src_key);
+	if (rt6_ex && !rt6_check_expired(rt6_ex->rt6i))
+		mtu = dst_metric_raw(&rt6_ex->rt6i->dst, RTAX_MTU);
+
+	if (likely(!mtu)) {
+		struct net_device *dev = fib6_info_nh_dev(f6i);
+
+		mtu = IPV6_MIN_MTU;
+		idev = __in6_dev_get(dev);
+		if (idev && idev->cnf.mtu6 > mtu)
+			mtu = idev->cnf.mtu6;
+	}
+
+	mtu = min_t(unsigned int, mtu, IP6_MAX_MTU);
+out:
+	return mtu - lwtunnel_headroom(fib6_info_nh_lwt(f6i), mtu);
+}
+
 struct dst_entry *icmp6_dst_alloc(struct net_device *dev,
 				  struct flowi6 *fl6)
 {
-- 
2.11.0

^ permalink raw reply related

* [PATCH v2 bpf-next 3/3] bpf: Add mtu checking to FIB forwarding helper
From: dsahern @ 2018-05-21 16:08 UTC (permalink / raw)
  To: netdev, borkmann, ast; +Cc: davem, David Ahern
In-Reply-To: <20180521160816.7060-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

Add check that egress MTU can handle packet to be forwarded. If
the MTU is less than the packet length, return 0 meaning the
packet is expected to continue up the stack for help - eg.,
fragmenting the packet or sending an ICMP.

The XDP path needs to leverage the FIB entry for an MTU on the
route spec or an exception entry for a given destination. The
skb path lets is_skb_forwardable decide if the packet can be
sent.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/core/filter.c | 42 +++++++++++++++++++++++++++++++++++-------
 1 file changed, 35 insertions(+), 7 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index aec5ebafb262..ba3ff5aa575a 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -4089,7 +4089,7 @@ static int bpf_fib_set_fwd_params(struct bpf_fib_lookup *params,
 
 #if IS_ENABLED(CONFIG_INET)
 static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
-			       u32 flags)
+			       u32 flags, bool check_mtu)
 {
 	struct in_device *in_dev;
 	struct neighbour *neigh;
@@ -4098,6 +4098,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	struct fib_nh *nh;
 	struct flowi4 fl4;
 	int err;
+	u32 mtu;
 
 	dev = dev_get_by_index_rcu(net, params->ifindex);
 	if (unlikely(!dev))
@@ -4149,6 +4150,12 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	if (res.fi->fib_nhs > 1)
 		fib_select_path(net, &res, &fl4, NULL);
 
+	if (check_mtu) {
+		mtu = ip_mtu_from_fib_result(&res, params->ipv4_dst);
+		if (params->tot_len > mtu)
+			return 0;
+	}
+
 	nh = &res.fi->fib_nh[res.nh_sel];
 
 	/* do not handle lwt encaps right now */
@@ -4177,7 +4184,7 @@ static int bpf_ipv4_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 
 #if IS_ENABLED(CONFIG_IPV6)
 static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
-			       u32 flags)
+			       u32 flags, bool check_mtu)
 {
 	struct in6_addr *src = (struct in6_addr *) params->ipv6_src;
 	struct in6_addr *dst = (struct in6_addr *) params->ipv6_dst;
@@ -4188,6 +4195,7 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 	struct flowi6 fl6;
 	int strict = 0;
 	int oif;
+	u32 mtu;
 
 	/* link local addresses are never forwarded */
 	if (rt6_need_strict(dst) || rt6_need_strict(src))
@@ -4250,6 +4258,12 @@ static int bpf_ipv6_fib_lookup(struct net *net, struct bpf_fib_lookup *params,
 						       fl6.flowi6_oif, NULL,
 						       strict);
 
+	if (check_mtu) {
+		mtu = ipv6_stub->ip6_mtu_from_fib6(f6i, dst, src);
+		if (params->tot_len > mtu)
+			return 0;
+	}
+
 	if (f6i->fib6_nh.nh_lwtstate)
 		return 0;
 
@@ -4282,12 +4296,12 @@ BPF_CALL_4(bpf_xdp_fib_lookup, struct xdp_buff *, ctx,
 #if IS_ENABLED(CONFIG_INET)
 	case AF_INET:
 		return bpf_ipv4_fib_lookup(dev_net(ctx->rxq->dev), params,
-					   flags);
+					   flags, true);
 #endif
 #if IS_ENABLED(CONFIG_IPV6)
 	case AF_INET6:
 		return bpf_ipv6_fib_lookup(dev_net(ctx->rxq->dev), params,
-					   flags);
+					   flags, true);
 #endif
 	}
 	return 0;
@@ -4306,20 +4320,34 @@ static const struct bpf_func_proto bpf_xdp_fib_lookup_proto = {
 BPF_CALL_4(bpf_skb_fib_lookup, struct sk_buff *, skb,
 	   struct bpf_fib_lookup *, params, int, plen, u32, flags)
 {
+	struct net *net = dev_net(skb->dev);
+	int index = 0;
+
 	if (plen < sizeof(*params))
 		return -EINVAL;
 
 	switch (params->family) {
 #if IS_ENABLED(CONFIG_INET)
 	case AF_INET:
-		return bpf_ipv4_fib_lookup(dev_net(skb->dev), params, flags);
+		index = bpf_ipv4_fib_lookup(net, params, flags, false);
+		break;
 #endif
 #if IS_ENABLED(CONFIG_IPV6)
 	case AF_INET6:
-		return bpf_ipv6_fib_lookup(dev_net(skb->dev), params, flags);
+		index = bpf_ipv6_fib_lookup(net, params, flags, false);
+		break;
 #endif
 	}
-	return -ENOTSUPP;
+
+	if (index > 0) {
+		struct net_device *dev;
+
+		dev = dev_get_by_index_rcu(net, index);
+		if (!is_skb_forwardable(dev, skb))
+			index = 0;
+	}
+
+	return index;
 }
 
 static const struct bpf_func_proto bpf_skb_fib_lookup_proto = {
-- 
2.11.0

^ permalink raw reply related

* [PATCH v2 bpf-next 1/3] net/ipv4: Add helper to return path MTU based on fib result
From: dsahern @ 2018-05-21 16:08 UTC (permalink / raw)
  To: netdev, borkmann, ast; +Cc: davem, David Ahern
In-Reply-To: <20180521160816.7060-1-dsahern@kernel.org>

From: David Ahern <dsahern@gmail.com>

Determine path MTU from a FIB lookup result. Logic is a distillation of
ip_dst_mtu_maybe_forward.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/ip_fib.h |  2 ++
 net/ipv4/route.c     | 31 +++++++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 81d0f2107ff1..69c91d1934c1 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -449,4 +449,6 @@ static inline void fib_proc_exit(struct net *net)
 }
 #endif
 
+u32 ip_mtu_from_fib_result(struct fib_result *res, __be32 daddr);
+
 #endif  /* _NET_FIB_H */
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 29268efad247..ac3b22bc51b2 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1352,6 +1352,37 @@ static struct fib_nh_exception *find_exception(struct fib_nh *nh, __be32 daddr)
 	return NULL;
 }
 
+/* MTU selection:
+ * 1. mtu on route is locked - use it
+ * 2. mtu from nexthop exception
+ * 3. mtu from egress device
+ */
+
+u32 ip_mtu_from_fib_result(struct fib_result *res, __be32 daddr)
+{
+	struct fib_info *fi = res->fi;
+	struct fib_nh *nh = &fi->fib_nh[res->nh_sel];
+	struct net_device *dev = nh->nh_dev;
+	u32 mtu = 0;
+
+	if (dev_net(dev)->ipv4.sysctl_ip_fwd_use_pmtu ||
+	    fi->fib_metrics->metrics[RTAX_LOCK - 1] & (1 << RTAX_MTU))
+		mtu = fi->fib_mtu;
+
+	if (likely(!mtu)) {
+		struct fib_nh_exception *fnhe;
+
+		fnhe = find_exception(nh, daddr);
+		if (fnhe && !time_after_eq(jiffies, fnhe->fnhe_expires))
+			mtu = fnhe->fnhe_pmtu;
+	}
+
+	if (likely(!mtu))
+		mtu = min(READ_ONCE(dev->mtu), IP_MAX_MTU);
+
+	return mtu - lwtunnel_headroom(nh->nh_lwtstate, mtu);
+}
+
 static bool rt_bind_exception(struct rtable *rt, struct fib_nh_exception *fnhe,
 			      __be32 daddr, const bool do_cache)
 {
-- 
2.11.0

^ permalink raw reply related

* [PATCH v2 bpf-next 0/3] bpf: Add MTU check to fib lookup helper
From: dsahern @ 2018-05-21 16:08 UTC (permalink / raw)
  To: netdev, borkmann, ast; +Cc: davem, David Ahern

From: David Ahern <dsahern@gmail.com>

Packets that exceed the egress MTU can not be forwarded in the fast path.
Add IPv4 and IPv6 MTU helpers that take a FIB lookup result (versus the
typical dst path) and add the calls to bpf_ipv{4,6}_fib_lookup.

v2
- add ip6_mtu_from_fib6 to ipv6_stub
- only call the new MTU helpers for fib lookups in XDP path; skb
  path uses is_skb_forwardable to determine if the packet can be
  sent via the egress device from the FIB lookup

David Ahern (3):
  net/ipv4: Add helper to return path MTU based on fib result
  net/ipv6: Add helper to return path MTU based on fib result
  bpf: Add mtu checking to FIB forwarding helper

 include/net/addrconf.h   |  2 ++
 include/net/ip6_fib.h    |  6 ++++++
 include/net/ip6_route.h  |  3 +++
 include/net/ip_fib.h     |  2 ++
 net/core/filter.c        | 42 +++++++++++++++++++++++++++++++++++-------
 net/ipv4/route.c         | 31 +++++++++++++++++++++++++++++++
 net/ipv6/addrconf_core.c |  8 ++++++++
 net/ipv6/af_inet6.c      |  1 +
 net/ipv6/route.c         | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 136 insertions(+), 7 deletions(-)

-- 
2.11.0

^ permalink raw reply

* Re: [PATCH net] sctp: fix the issue that flags are ignored when using kernel_connect
From: Marcelo Ricardo Leitner @ 2018-05-21 15:52 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, linux-sctp, davem, Neil Horman, mkubecek
In-Reply-To: <4863916c3e574b0d860725466d7d4a2f445fbe5b.1526805550.git.lucien.xin@gmail.com>

On Sun, May 20, 2018 at 04:39:10PM +0800, Xin Long wrote:
> Now sctp uses inet_dgram_connect as its proto_ops .connect, and the flags
> param can't be passed into its proto .connect where this flags is really
> needed.
> 
> sctp works around it by getting flags from socket file in __sctp_connect.
> It works for connecting from userspace, as inherently the user sock has
> socket file and it passes f_flags as the flags param into the proto_ops
> .connect.
> 
> However, the sock created by sock_create_kern doesn't have a socket file,
> and it passes the flags (like O_NONBLOCK) by using the flags param in
> kernel_connect, which calls proto_ops .connect later.
> 
> So to fix it, this patch defines a new proto_ops .connect for sctp,
> sctp_inet_connect, which calls __sctp_connect() directly with this
> flags param. After this, the sctp's proto .connect can be removed.
> 
> Note that sctp_inet_connect doesn't need to do some checks that are not
> needed for sctp, which makes thing better than with inet_dgram_connect.
> 
> Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

> ---
>  include/net/sctp/sctp.h |  2 ++
>  net/sctp/ipv6.c         |  2 +-
>  net/sctp/protocol.c     |  2 +-
>  net/sctp/socket.c       | 51 +++++++++++++++++++++++++++++++++----------------
>  4 files changed, 39 insertions(+), 18 deletions(-)
> 
> diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
> index 28b996d..35498e6 100644
> --- a/include/net/sctp/sctp.h
> +++ b/include/net/sctp/sctp.h
> @@ -103,6 +103,8 @@ void sctp_addr_wq_mgmt(struct net *, struct sctp_sockaddr_entry *, int);
>  /*
>   * sctp/socket.c
>   */
> +int sctp_inet_connect(struct socket *sock, struct sockaddr *uaddr,
> +		      int addr_len, int flags);
>  int sctp_backlog_rcv(struct sock *sk, struct sk_buff *skb);
>  int sctp_inet_listen(struct socket *sock, int backlog);
>  void sctp_write_space(struct sock *sk);
> diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
> index 4224711..0cd2e76 100644
> --- a/net/sctp/ipv6.c
> +++ b/net/sctp/ipv6.c
> @@ -1006,7 +1006,7 @@ static const struct proto_ops inet6_seqpacket_ops = {
>  	.owner		   = THIS_MODULE,
>  	.release	   = inet6_release,
>  	.bind		   = inet6_bind,
> -	.connect	   = inet_dgram_connect,
> +	.connect	   = sctp_inet_connect,
>  	.socketpair	   = sock_no_socketpair,
>  	.accept		   = inet_accept,
>  	.getname	   = sctp_getname,
> diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
> index d685f84..6bf0a99 100644
> --- a/net/sctp/protocol.c
> +++ b/net/sctp/protocol.c
> @@ -1012,7 +1012,7 @@ static const struct proto_ops inet_seqpacket_ops = {
>  	.owner		   = THIS_MODULE,
>  	.release	   = inet_release,	/* Needs to be wrapped... */
>  	.bind		   = inet_bind,
> -	.connect	   = inet_dgram_connect,
> +	.connect	   = sctp_inet_connect,
>  	.socketpair	   = sock_no_socketpair,
>  	.accept		   = inet_accept,
>  	.getname	   = inet_getname,	/* Semantics are different.  */
> diff --git a/net/sctp/socket.c b/net/sctp/socket.c
> index 80835ac..ae7e7c6 100644
> --- a/net/sctp/socket.c
> +++ b/net/sctp/socket.c
> @@ -1086,7 +1086,7 @@ static int sctp_setsockopt_bindx(struct sock *sk,
>   */
>  static int __sctp_connect(struct sock *sk,
>  			  struct sockaddr *kaddrs,
> -			  int addrs_size,
> +			  int addrs_size, int flags,
>  			  sctp_assoc_t *assoc_id)
>  {
>  	struct net *net = sock_net(sk);
> @@ -1104,7 +1104,6 @@ static int __sctp_connect(struct sock *sk,
>  	union sctp_addr *sa_addr = NULL;
>  	void *addr_buf;
>  	unsigned short port;
> -	unsigned int f_flags = 0;
>  
>  	sp = sctp_sk(sk);
>  	ep = sp->ep;
> @@ -1254,13 +1253,7 @@ static int __sctp_connect(struct sock *sk,
>  	sp->pf->to_sk_daddr(sa_addr, sk);
>  	sk->sk_err = 0;
>  
> -	/* in-kernel sockets don't generally have a file allocated to them
> -	 * if all they do is call sock_create_kern().
> -	 */
> -	if (sk->sk_socket->file)
> -		f_flags = sk->sk_socket->file->f_flags;
> -
> -	timeo = sock_sndtimeo(sk, f_flags & O_NONBLOCK);
> +	timeo = sock_sndtimeo(sk, flags & O_NONBLOCK);
>  
>  	if (assoc_id)
>  		*assoc_id = asoc->assoc_id;
> @@ -1348,7 +1341,7 @@ static int __sctp_setsockopt_connectx(struct sock *sk,
>  				      sctp_assoc_t *assoc_id)
>  {
>  	struct sockaddr *kaddrs;
> -	int err = 0;
> +	int err = 0, flags = 0;
>  
>  	pr_debug("%s: sk:%p addrs:%p addrs_size:%d\n",
>  		 __func__, sk, addrs, addrs_size);
> @@ -1367,7 +1360,13 @@ static int __sctp_setsockopt_connectx(struct sock *sk,
>  	if (err)
>  		goto out_free;
>  
> -	err = __sctp_connect(sk, kaddrs, addrs_size, assoc_id);
> +	/* in-kernel sockets don't generally have a file allocated to them
> +	 * if all they do is call sock_create_kern().
> +	 */
> +	if (sk->sk_socket->file)
> +		flags = sk->sk_socket->file->f_flags;
> +
> +	err = __sctp_connect(sk, kaddrs, addrs_size, flags, assoc_id);
>  
>  out_free:
>  	kvfree(kaddrs);
> @@ -4397,16 +4396,26 @@ static int sctp_setsockopt(struct sock *sk, int level, int optname,
>   * len: the size of the address.
>   */
>  static int sctp_connect(struct sock *sk, struct sockaddr *addr,
> -			int addr_len)
> +			int addr_len, int flags)
>  {
> -	int err = 0;
> +	struct inet_sock *inet = inet_sk(sk);
>  	struct sctp_af *af;
> +	int err = 0;
>  
>  	lock_sock(sk);
>  
>  	pr_debug("%s: sk:%p, sockaddr:%p, addr_len:%d\n", __func__, sk,
>  		 addr, addr_len);
>  
> +	/* We may need to bind the socket. */
> +	if (!inet->inet_num) {
> +		if (sk->sk_prot->get_port(sk, 0)) {
> +			release_sock(sk);
> +			return -EAGAIN;
> +		}
> +		inet->inet_sport = htons(inet->inet_num);
> +	}
> +
>  	/* Validate addr_len before calling common connect/connectx routine. */
>  	af = sctp_get_af_specific(addr->sa_family);
>  	if (!af || addr_len < af->sockaddr_len) {
> @@ -4415,13 +4424,25 @@ static int sctp_connect(struct sock *sk, struct sockaddr *addr,
>  		/* Pass correct addr len to common routine (so it knows there
>  		 * is only one address being passed.
>  		 */
> -		err = __sctp_connect(sk, addr, af->sockaddr_len, NULL);
> +		err = __sctp_connect(sk, addr, af->sockaddr_len, flags, NULL);
>  	}
>  
>  	release_sock(sk);
>  	return err;
>  }
>  
> +int sctp_inet_connect(struct socket *sock, struct sockaddr *uaddr,
> +		      int addr_len, int flags)
> +{
> +	if (addr_len < sizeof(uaddr->sa_family))
> +		return -EINVAL;
> +
> +	if (uaddr->sa_family == AF_UNSPEC)
> +		return -EOPNOTSUPP;
> +
> +	return sctp_connect(sock->sk, uaddr, addr_len, flags);
> +}
> +
>  /* FIXME: Write comments. */
>  static int sctp_disconnect(struct sock *sk, int flags)
>  {
> @@ -8724,7 +8745,6 @@ struct proto sctp_prot = {
>  	.name        =	"SCTP",
>  	.owner       =	THIS_MODULE,
>  	.close       =	sctp_close,
> -	.connect     =	sctp_connect,
>  	.disconnect  =	sctp_disconnect,
>  	.accept      =	sctp_accept,
>  	.ioctl       =	sctp_ioctl,
> @@ -8767,7 +8787,6 @@ struct proto sctpv6_prot = {
>  	.name		= "SCTPv6",
>  	.owner		= THIS_MODULE,
>  	.close		= sctp_close,
> -	.connect	= sctp_connect,
>  	.disconnect	= sctp_disconnect,
>  	.accept		= sctp_accept,
>  	.ioctl		= sctp_ioctl,
> -- 
> 2.1.0
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [PATCH net-next] sctp: add support for SCTP_REUSE_PORT sockopt
From: Marcelo Ricardo Leitner @ 2018-05-21 15:51 UTC (permalink / raw)
  To: Michael Tuexen; +Cc: Neil Horman, Xin Long, network dev, linux-sctp, davem
In-Reply-To: <43A7D2C9-DFCE-4ADA-9ABB-B7ACD78C210B@fh-muenster.de>

On Mon, May 21, 2018 at 04:09:31PM +0200, Michael Tuexen wrote:
> > On 21. May 2018, at 15:48, Neil Horman <nhorman@tuxdriver.com> wrote:
> > 
> > On Mon, May 21, 2018 at 02:16:56PM +0200, Michael Tuexen wrote:
> >>> On 21. May 2018, at 13:39, Neil Horman <nhorman@tuxdriver.com> wrote:
> >>> 
> >>> On Sun, May 20, 2018 at 10:54:04PM -0300, Marcelo Ricardo Leitner wrote:
> >>>> On Sun, May 20, 2018 at 08:50:59PM -0400, Neil Horman wrote:
> >>>>> On Sat, May 19, 2018 at 03:44:40PM +0800, Xin Long wrote:
> >>>>>> This feature is actually already supported by sk->sk_reuse which can be
> >>>>>> set by SO_REUSEADDR. But it's not working exactly as RFC6458 demands in
> >>>>>> section 8.1.27, like:
> >>>>>> 
> >>>>>> - This option only supports one-to-one style SCTP sockets
> >>>>>> - This socket option must not be used after calling bind()
> >>>>>>   or sctp_bindx().
> >>>>>> 
> >>>>>> Besides, SCTP_REUSE_PORT sockopt should be provided for user's programs.
> >>>>>> Otherwise, the programs with SCTP_REUSE_PORT from other systems will not
> >>>>>> work in linux.
> >>>>>> 
> >>>>>> This patch reuses sk->sk_reuse and works pretty much as SO_REUSEADDR,
> >>>>>> just with some extra setup limitations that are neeeded when it is being
> >>>>>> enabled.
> >>>>>> 
> >>>>>> "It should be noted that the behavior of the socket-level socket option
> >>>>>> to reuse ports and/or addresses for SCTP sockets is unspecified", so it
> >>>>>> leaves SO_REUSEADDR as is for the compatibility.
> >>>>>> 
> >>>>>> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> >>>>>> ---
> >>>>>> include/uapi/linux/sctp.h |  1 +
> >>>>>> net/sctp/socket.c         | 48 +++++++++++++++++++++++++++++++++++++++++++++++
> >>>>>> 2 files changed, 49 insertions(+)
> >>>>>> 
> >>>>> A few things:
> >>>>> 
> >>>>> 1) I agree with Tom, this feature is a complete duplication of the SK_REUSEPORT
> >>>>> socket option.  I understand that this is an implementation of the option in the
> >>>>> RFC, but its definately a duplication of a feature, which makes several things
> >>>>> really messy.
> >>>>> 
> >>>>> 2) The overloading of the sk_reuse opeion is a bad idea, for several reasons.
> >>>>> Chief among them is the behavioral interference between this patch and the
> >>>>> SO_REUSEADDR socket level option, that also sets this feature.  If you set
> >>>>> sk_reuse via SO_REUSEADDR, you will set the SCTP port reuse feature regardless
> >>>>> of the bind or 1:1/1:m state of the socket.  Vice versa, if you set this socket
> >>>>> option via the SCTP_PORT_REUSE option you will inadvertently turn on address
> >>>>> reuse for the socket.  We can't do that.
> >>>> 
> >>>> Given your comments, going a bit further here, one other big
> >>>> implication is that a port would never be able to be considered to
> >>>> fully meet SCTP standards regarding reuse because a rogue application
> >>>> may always abuse of the socket level opt to gain access to the port.
> >>>> 
> >>>> IOW, the patch allows the application to use such restrictions against
> >>>> itself and nothing else, which undermines the patch idea.
> >>>> 
> >>> Agreed.
> >>> 
> >>>> I lack the knowledge on why the SCTP option was proposed in the RFC. I
> >>>> guess they had a good reason to add the restriction on 1:1/1:m style.
> >>>> Does the usage of the current imply in any risk to SCTP sockets? If
> >>>> yes, that would give some grounds for going forward with the SCTP
> >>>> option.
> >>>> 
> >>> I'm also not privy to why the sctp option was proposed, though I expect that the
> >>> lack of standardization of SO_REUSEPORT probably had something to do with it.
> >>> As for the reasoning behind restriction to only 1:1 sockets, if I had to guess,
> >>> I would say it likely because it creates ordering difficulty at the application
> >>> level.
> >>> 
> >>> CC-ing Michael Tuxen, who I believe had some input on this RFC.  Hopefully he
> >>> can shed some light on this.
> >> Dear all,
> >> 
> >> the reason this was added is to have a specified way to allow a system to
> >> behave like a client and server making use of the INIT collision.
> >> 
> >> For 1-to-many style sockets you can do this by creating a socket, binding it,
> >> calling listen on it and trying to connect to the peer.
> >> 
> >> For 1-to-1 style sockets you need two sockets for it. One listener and one
> >> you use to connect (and close it in case of failure, open a new one...).
> >> 
> >> It was not clear if one can achieve this with SO_REUSEPORT and/or SO_REUSEADDR
> >> on all platforms. We left that unspecified.
> >> 
> >> I hope this makes the intention clearer.
> >> 
> > I think it makes the intention clearer yes, but it unfortunately does nothing in
> > my mind to clarify how the implementation should best handle the potential
> > overlap in functionality.  What I see here is that we have two functional paths
> > (the SO_REUSEPORT path and the SCTP_PORT_REUSE path), which may or may not
> > (depending on the OS implementation achieve the same functional goal (allowing
> > multiple sockets to share a port while allowing one socket to listen and the
> > other connect to a remote peer).  If both implementations do the same thing on a
> > given platform, we can either just alias one to another and be done, but if they
> > don't then we either have to implement both paths, and ensure that the
> > SO_REUSEPORT path is a no-op/error return for SCTP sockets, or that each path
> > implements a distinct feature set that is cleaarly documented.
> > 
> > That said, I think we may be in luck.  Looking at the connect and listen paths,
> > it appears to me that:
> > 
> > 1) Sockets ignore SO_REUSEPORT in the connect and listen paths (save for any
> > autobinding) so it would appear that the intent of the SCTP rfc can be honored
> > via SO_REUSEPORT on linux.  
> > 
> > 2) SO_REUSEPORT prevents changing state after a bind has occured, so we can honr
> > that part of the SCTP RFC.
> > 
> > The only missing part is the restriction that SCTP_REUSE_PORT has which is
> > unaccounted for is that 1:M sockets aren't allowed to enable port reuse.
> > However, I think the implication from Michaels description above is that port
> > reuse on a 1:M socket is implicit because a single socket can connect and listen
> > in that use case, rather than there being a danger to doing so.
> > 
> > As such, I would propose that we implement this socket option by simply setting
> > the sk->sk_reuseport field in the sock structure, and document the fact that
> > linux does not restrict port reuse from 1:M sockets.
> > 
> > Thoughts?
> Sounds acceptable to me...

+1

> 
> Best regards
> Michael
> > Neil
> > 
> 

^ permalink raw reply

* Re: [PATCH net-next 0/2] net: sfp: small improvements
From: David Miller @ 2018-05-21 15:51 UTC (permalink / raw)
  To: antoine.tenart
  Cc: linux, netdev, linux-kernel, thomas.petazzoni, maxime.chevallier,
	gregory.clement, miquel.raynal, nadavh, stefanc, ymarkman, mw
In-Reply-To: <20180517082907.14420-1-antoine.tenart@bootlin.com>

From: Antoine Tenart <antoine.tenart@bootlin.com>
Date: Thu, 17 May 2018 10:29:05 +0200

> This series was part of the mvpp2 phylink one but as we reworked it to
> use fixed-link on the DB boards, the SFP commits weren't needed
> anymore for our use case. Two of the three patches still are needed I
> believe (I ditched the one about non-wired SFP cages), so they are sent
> here in a separate series.

Based upon the discussion of patch #1, it seems there is a desire to make
the i2c-bus property mandatory since it isn't clear if access to the SFP
module without it really all that doable.

^ permalink raw reply

* Re: [patch net-next] nfp: flower: set sysfs link to device for representors
From: David Miller @ 2018-05-21 15:49 UTC (permalink / raw)
  To: jiri
  Cc: netdev, jakub.kicinski, simon.horman, dirk.vandermerwe,
	john.hurley, pieter.jansenvanvuuren, oss-drivers
In-Reply-To: <20180517100520.23971-1-jiri@resnulli.us>

From: Jiri Pirko <jiri@resnulli.us>
Date: Thu, 17 May 2018 12:05:20 +0200

> From: Jiri Pirko <jiri@mellanox.com>
> 
> Do this so the sysfs has "device" link correctly set.
> 
> Signed-off-by: Jiri Pirko <jiri@mellanox.com>

Please sort out the non-PF representor issue with Or and Jakub.

Thanks.

^ permalink raw reply

* Re: [PATCH v2 net] stmmac: strip vlan tag on reception only for 8021q tagged frames
From: David Miller @ 2018-05-21 15:48 UTC (permalink / raw)
  To: eladv6; +Cc: makita.toshiaki, netdev, peppe.cavallaro, alexandre.torgue
In-Reply-To: <20180517.124356.1373521143004050823.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Thu, 17 May 2018 12:43:56 -0400 (EDT)

> Giuseppe and Alexandre, please review this patch.

If nobody thinks this patch is important enough to actually
review, I'm tossing it.

Sorry.

^ permalink raw reply

* Re: [PATCH net] tuntap: raise EPOLLOUT on device up
From: David Miller @ 2018-05-21 15:47 UTC (permalink / raw)
  To: jasowang; +Cc: netdev, linux-kernel, mst, hannes, edumazet
In-Reply-To: <1526648443-24128-1-git-send-email-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Fri, 18 May 2018 21:00:43 +0800

> We return -EIO on device down but can not raise EPOLLOUT after it was
> up. This may confuse user like vhost which expects tuntap to raise
> EPOLLOUT to re-enable its TX routine after tuntap is down. This could
> be easily reproduced by transmitting packets from VM while down and up
> the tap device. Fixing this by set SOCKWQ_ASYNC_NOSPACE on -EIO.
> 
> Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
> Cc: Eric Dumazet <edumazet@google.com>
> Fixes: 1bd4978a88ac2 ("tun: honor IFF_UP in tun_get_user()")
> Signed-off-by: Jason Wang <jasowang@redhat.com>

I'm no so sure what to do with this patch.

Like Michael says, this flag bit is only checks upon transmit which
may or may not happen after this point.  It doesn't seem to be
guaranteed.

^ permalink raw reply

* Re: [PATCH net-next 7/7] net: dsa: qca8k: Remove rudundant parentheses
From: Florian Fainelli @ 2018-05-21 15:21 UTC (permalink / raw)
  To: Michal Vokáč, netdev, michal.vokac
  Cc: linux-kernel, devicetree, vivien.didelot, andrew, mark.rutland,
	robh+dt, davem
In-Reply-To: <1526909293-56377-8-git-send-email-michal.vokac@ysoft.com>



On 05/21/2018 06:28 AM, Michal Vokáč wrote:
> Fix warning reported by checkpatch.

Nit in the subject: should be redundant, with that:

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next 6/7] net: dsa: qca8k: Replace GPL boilerplate by SPDX
From: Florian Fainelli @ 2018-05-21 15:20 UTC (permalink / raw)
  To: Michal Vokáč, netdev, michal.vokac
  Cc: linux-kernel, devicetree, vivien.didelot, andrew, mark.rutland,
	robh+dt, davem
In-Reply-To: <1526909293-56377-7-git-send-email-michal.vokac@ysoft.com>



On 05/21/2018 06:28 AM, Michal Vokáč wrote:
> Signed-off-by: Michal Vokáč <michal.vokac@ysoft.com>

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

I don't know if we need all people who contributed to that driver to
agree on that, this is not a license change, so it should be okay I presume?

-- 
Florian

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox