Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH net-next 11/14] tcp: allow congestion control to expand send buffer differently
From: Neal Cardwell @ 2016-09-16 18:49 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Yuchung Cheng, Van Jacobson, Neal Cardwell,
	Nandita Dukkipati, Eric Dumazet, Soheil Hassas Yeganeh
In-Reply-To: <1474051743-13311-1-git-send-email-ncardwell@google.com>

From: Yuchung Cheng <ycheng@google.com>

Currently the TCP send buffer expands to twice cwnd, in order to allow
limited transmits in the CA_Recovery state. This assumes that cwnd
does not increase in the CA_Recovery.

For some congestion control algorithms, like the upcoming BBR module,
if the losses in recovery do not indicate congestion then we may
continue to raise cwnd multiplicatively in recovery. In such cases the
current multiplier will falsely limit the sending rate, much as if it
were limited by the application.

This commit adds an optional congestion control callback to use a
different multiplier to expand the TCP send buffer. For congestion
control modules that do not specificy this callback, TCP continues to
use the previous default of 2.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/net/tcp.h    | 2 ++
 net/ipv4/tcp_input.c | 4 +++-
 2 files changed, 5 insertions(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 8805c65..c4d2e46 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -917,6 +917,8 @@ struct tcp_congestion_ops {
 	void (*pkts_acked)(struct sock *sk, const struct ack_sample *sample);
 	/* suggest number of segments for each skb to transmit (optional) */
 	u32 (*tso_segs_goal)(struct sock *sk);
+	/* returns the multiplier used in tcp_sndbuf_expand (optional) */
+	u32 (*sndbuf_expand)(struct sock *sk);
 	/* get info for inet_diag (optional) */
 	size_t (*get_info)(struct sock *sk, u32 ext, int *attr,
 			   union tcp_cc_info *info);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index df26af0..a134e66 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -289,6 +289,7 @@ static bool tcp_ecn_rcv_ecn_echo(const struct tcp_sock *tp, const struct tcphdr
 static void tcp_sndbuf_expand(struct sock *sk)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
+	const struct tcp_congestion_ops *ca_ops = inet_csk(sk)->icsk_ca_ops;
 	int sndmem, per_mss;
 	u32 nr_segs;
 
@@ -309,7 +310,8 @@ static void tcp_sndbuf_expand(struct sock *sk)
 	 * Cubic needs 1.7 factor, rounded to 2 to include
 	 * extra cushion (application might react slowly to POLLOUT)
 	 */
-	sndmem = 2 * nr_segs * per_mss;
+	sndmem = ca_ops->sndbuf_expand ? ca_ops->sndbuf_expand(sk) : 2;
+	sndmem *= nr_segs * per_mss;
 
 	if (sk->sk_sndbuf < sndmem)
 		sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2]);
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH net-next 14/14] tcp_bbr: add BBR congestion control
From: Neal Cardwell @ 2016-09-16 18:49 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Neal Cardwell, Van Jacobson, Yuchung Cheng,
	Nandita Dukkipati, Eric Dumazet, Soheil Hassas Yeganeh
In-Reply-To: <1474051743-13311-1-git-send-email-ncardwell@google.com>

This commit implements a new TCP congestion control algorithm: BBR
(Bottleneck Bandwidth and RTT). A detailed description of BBR will be
published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
"BBR: Congestion-Based Congestion Control".

BBR has significantly increased throughput and reduced latency for
connections on Google's internal backbone networks and google.com and
YouTube Web servers.

BBR requires only changes on the sender side, not in the network or
the receiver side. Thus it can be incrementally deployed on today's
Internet, or in datacenters.

The Internet has predominantly used loss-based congestion control
(largely Reno or CUBIC) since the 1980s, relying on packet loss as the
signal to slow down. While this worked well for many years, loss-based
congestion control is unfortunately out-dated in today's networks. On
today's Internet, loss-based congestion control causes the infamous
bufferbloat problem, often causing seconds of needless queuing delay,
since it fills the bloated buffers in many last-mile links. On today's
high-speed long-haul links using commodity switches with shallow
buffers, loss-based congestion control has abysmal throughput because
it over-reacts to losses caused by transient traffic bursts.

In 1981 Kleinrock and Gale showed that the optimal operating point for
a network maximizes delivered bandwidth while minimizing delay and
loss, not only for single connections but for the network as a
whole. Finding that optimal operating point has been elusive, since
any single network measurement is ambiguous: network measurements are
the result of both bandwidth and propagation delay, and those two
cannot be measured simultaneously.

While it is impossible to disambiguate any single bandwidth or RTT
measurement, a connection's behavior over time tells a clearer
story. BBR uses a measurement strategy designed to resolve this
ambiguity. It combines these measurements with a robust servo loop
using recent control systems advances to implement a distributed
congestion control algorithm that reacts to actual congestion, not
packet loss or transient queue delay, and is designed to converge with
high probability to a point near the optimal operating point.

In a nutshell, BBR creates an explicit model of the network pipe by
sequentially probing the bottleneck bandwidth and RTT. On the arrival
of each ACK, BBR derives the current delivery rate of the last round
trip, and feeds it through a windowed max-filter to estimate the
bottleneck bandwidth. Conversely it uses a windowed min-filter to
estimate the round trip propagation delay. The max-filtered bandwidth
and min-filtered RTT estimates form BBR's model of the network pipe.

Using its model, BBR sets control parameters to govern sending
behavior. The primary control is the pacing rate: BBR applies a gain
multiplier to transmit faster or slower than the observed bottleneck
bandwidth. The conventional congestion window (cwnd) is now the
secondary control; the cwnd is set to a small multiple of the
estimated BDP (bandwidth-delay product) in order to allow full
utilization and bandwidth probing while bounding the potential amount
of queue at the bottleneck.

When a BBR connection starts, it enters STARTUP mode and applies a
high gain to perform an exponential search to quickly probe the
bottleneck bandwidth (doubling its sending rate each round trip, like
slow start). However, instead of continuing until it fills up the
buffer (i.e. a loss), or until delay or ACK spacing reaches some
threshold (like Hystart), it uses its model of the pipe to estimate
when that pipe is full: it estimates the pipe is full when it notices
the estimated bandwidth has stopped growing. At that point it exits
STARTUP and enters DRAIN mode, where it reduces its pacing rate to
drain the queue it estimates it has created.

Then BBR enters steady state. In steady state, PROBE_BW mode cycles
between first pacing faster to probe for more bandwidth, then pacing
slower to drain any queue that created if no more bandwidth was
available, and then cruising at the estimated bandwidth to utilize the
pipe without creating excess queue. Occasionally, on an as-needed
basis, it sends significantly slower to probe for RTT (PROBE_RTT
mode).

Our long-term goal is to improve the congestion control algorithms
used on the Internet. We are hopeful that BBR can help advance the
efforts toward this goal, and motivate the community to do further
research.

Test results, performance evaluations, feedback, and BBR-related
discussions are very welcome in the public e-mail list for BBR:

  https://groups.google.com/forum/#!forum/bbr-dev

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/uapi/linux/inet_diag.h |  13 +
 net/ipv4/Kconfig               |  18 +
 net/ipv4/Makefile              |   1 +
 net/ipv4/tcp_bbr.c             | 875 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 907 insertions(+)
 create mode 100644 net/ipv4/tcp_bbr.c

diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h
index b5c366f..509cd96 100644
--- a/include/uapi/linux/inet_diag.h
+++ b/include/uapi/linux/inet_diag.h
@@ -124,6 +124,7 @@ enum {
 	INET_DIAG_PEERS,
 	INET_DIAG_PAD,
 	INET_DIAG_MARK,
+	INET_DIAG_BBRINFO,
 	__INET_DIAG_MAX,
 };
 
@@ -157,8 +158,20 @@ struct tcp_dctcp_info {
 	__u32	dctcp_ab_tot;
 };
 
+/* INET_DIAG_BBRINFO */
+
+struct tcp_bbr_info {
+	/* u64 bw: max-filtered BW (app throughput) estimate in Byte per sec: */
+	__u32	bbr_bw_lo;		/* lower 32 bits of bw */
+	__u32	bbr_bw_hi;		/* upper 32 bits of bw */
+	__u32	bbr_min_rtt;		/* min-filtered RTT in uSec */
+	__u32	bbr_pacing_gain;	/* pacing gain shifted left 8 bits */
+	__u32	bbr_cwnd_gain;		/* cwnd gain shifted left 8 bits */
+};
+
 union tcp_cc_info {
 	struct tcpvegas_info	vegas;
 	struct tcp_dctcp_info	dctcp;
+	struct tcp_bbr_info	bbr;
 };
 #endif /* _UAPI_INET_DIAG_H_ */
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 50d6a9b..300b068 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -640,6 +640,21 @@ config TCP_CONG_CDG
 	  D.A. Hayes and G. Armitage. "Revisiting TCP congestion control using
 	  delay gradients." In Networking 2011. Preprint: http://goo.gl/No3vdg
 
+config TCP_CONG_BBR
+	tristate "BBR TCP"
+	default n
+	---help---
+
+	BBR (Bottleneck Bandwidth and RTT) TCP congestion control aims to
+	maximize network utilization and minimize queues. It builds an explicit
+	model of the the bottleneck delivery rate and path round-trip
+	propagation delay. It tolerates packet loss and delay unrelated to
+	congestion. It can operate over LAN, WAN, cellular, wifi, or cable
+	modem links. It can coexist with flows that use loss-based congestion
+	control, and can operate with shallow buffers, deep buffers,
+	bufferbloat, policers, or AQM schemes that do not provide a delay
+	signal. It requires the fq ("Fair Queue") pacing packet scheduler.
+
 choice
 	prompt "Default TCP congestion control"
 	default DEFAULT_CUBIC
@@ -674,6 +689,9 @@ choice
 	config DEFAULT_CDG
 		bool "CDG" if TCP_CONG_CDG=y
 
+	config DEFAULT_BBR
+		bool "BBR" if TCP_CONG_BBR=y
+
 	config DEFAULT_RENO
 		bool "Reno"
 endchoice
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 9cfff1a..bc6a6c8 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -41,6 +41,7 @@ obj-$(CONFIG_INET_DIAG) += inet_diag.o
 obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o
 obj-$(CONFIG_INET_UDP_DIAG) += udp_diag.o
 obj-$(CONFIG_NET_TCPPROBE) += tcp_probe.o
+obj-$(CONFIG_TCP_CONG_BBR) += tcp_bbr.o
 obj-$(CONFIG_TCP_CONG_BIC) += tcp_bic.o
 obj-$(CONFIG_TCP_CONG_CDG) += tcp_cdg.o
 obj-$(CONFIG_TCP_CONG_CUBIC) += tcp_cubic.o
diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
new file mode 100644
index 0000000..34cde81
--- /dev/null
+++ b/net/ipv4/tcp_bbr.c
@@ -0,0 +1,875 @@
+/* Bottleneck Bandwidth and RTT (BBR) congestion control
+ *
+ * BBR congestion control computes the sending rate based on the delivery
+ * rate (throughput) estimated from ACKs. In a nutshell:
+ *
+ *   On each ACK, update our model of the network path:
+ *      bottleneck_bandwidth = windowed_max(delivered / elapsed, 10 round trips)
+ *      min_rtt = windowed_min(rtt, 10 seconds)
+ *   pacing_rate = pacing_gain * bottleneck_bandwidth
+ *   cwnd = max(cwnd_gain * bottleneck_bandwidth * min_rtt, 4)
+ *
+ * The core algorithm does not react directly to packet losses or delays,
+ * although BBR may adjust the size of next send per ACK when loss is
+ * observed, or adjust the sending rate if it estimates there is a
+ * traffic policer, in order to keep the drop rate reasonable.
+ *
+ * BBR is described in detail in:
+ *   "BBR: Congestion-Based Congestion Control",
+ *   Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, Soheil Hassas Yeganeh,
+ *   Van Jacobson. ACM Queue, Vol. 14 No. 5, September-October 2016.
+ *
+ * There is a public e-mail list for discussing BBR development and testing:
+ *   https://groups.google.com/forum/#!forum/bbr-dev
+ *
+ * NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing enabled,
+ * since pacing is integral to the BBR design and implementation.
+ * BBR without pacing would not function properly, and may incur unnecessary
+ * high packet loss rates.
+ */
+#include <linux/module.h>
+#include <net/tcp.h>
+#include <linux/inet_diag.h>
+#include <linux/inet.h>
+#include <linux/random.h>
+#include <linux/win_minmax.h>
+
+/* Scale factor for rate in pkt/uSec unit to avoid truncation in bandwidth
+ * estimation. The rate unit ~= (1500 bytes / 1 usec / 2^24) ~= 715 bps.
+ * This handles bandwidths from 0.06pps (715bps) to 256Mpps (3Tbps) in a u32.
+ * Since the minimum window is >=4 packets, the lower bound isn't
+ * an issue. The upper bound isn't an issue with existing technologies.
+ */
+#define BW_SCALE 24
+#define BW_UNIT (1 << BW_SCALE)
+
+#define BBR_SCALE 8	/* scaling factor for fractions in BBR (e.g. gains) */
+#define BBR_UNIT (1 << BBR_SCALE)
+
+/* BBR has the following modes for deciding how fast to send: */
+enum bbr_mode {
+	BBR_STARTUP,	/* ramp up sending rate rapidly to fill pipe */
+	BBR_DRAIN,	/* drain any queue created during startup */
+	BBR_PROBE_BW,	/* discover, share bw: pace around estimated bw */
+	BBR_PROBE_RTT,	/* cut cwnd to min to probe min_rtt */
+};
+
+/* BBR congestion control block */
+struct bbr {
+	u32	min_rtt_us;	        /* min RTT in min_rtt_win_sec window */
+	u32	min_rtt_stamp;	        /* timestamp of min_rtt_us */
+	u32	probe_rtt_done_stamp;   /* end time for BBR_PROBE_RTT mode */
+	struct minmax bw;	/* Max recent delivery rate in pkts/uS << 24 */
+	u32	rtt_cnt;	    /* count of packet-timed rounds elapsed */
+	u32     next_rtt_delivered; /* scb->tx.delivered at end of round */
+	struct skb_mstamp cycle_mstamp;  /* time of this cycle phase start */
+	u32     mode:3,		     /* current bbr_mode in state machine */
+		prev_ca_state:3,     /* CA state on previous ACK */
+		packet_conservation:1,  /* use packet conservation? */
+		restore_cwnd:1,	     /* decided to revert cwnd to old value */
+		round_start:1,	     /* start of packet-timed tx->ack round? */
+		tso_segs_goal:7,     /* segments we want in each skb we send */
+		idle_restart:1,	     /* restarting after idle? */
+		probe_rtt_round_done:1,  /* a BBR_PROBE_RTT round at 4 pkts? */
+		unused:5,
+		lt_is_sampling:1,    /* taking long-term ("LT") samples now? */
+		lt_rtt_cnt:7,	     /* round trips in long-term interval */
+		lt_use_bw:1;	     /* use lt_bw as our bw estimate? */
+	u32	lt_bw;		     /* LT est delivery rate in pkts/uS << 24 */
+	u32	lt_last_delivered;   /* LT intvl start: tp->delivered */
+	u32	lt_last_stamp;	     /* LT intvl start: tp->delivered_mstamp */
+	u32	lt_last_lost;	     /* LT intvl start: tp->lost */
+	u32	pacing_gain:10,	/* current gain for setting pacing rate */
+		cwnd_gain:10,	/* current gain for setting cwnd */
+		full_bw_cnt:3,	/* number of rounds without large bw gains */
+		cycle_idx:3,	/* current index in pacing_gain cycle array */
+		unused_b:6;
+	u32	prior_cwnd;	/* prior cwnd upon entering loss recovery */
+	u32	full_bw;	/* recent bw, to estimate if pipe is full */
+};
+
+#define CYCLE_LEN	8	/* number of phases in a pacing gain cycle */
+
+static int bbr_bw_rtts	= CYCLE_LEN + 2; /* win len of bw filter (in rounds) */
+static u32 bbr_min_rtt_win_sec = 10;	 /* min RTT filter window (in sec) */
+static u32 bbr_probe_rtt_mode_ms = 200;	 /* min ms at cwnd=4 in BBR_PROBE_RTT */
+static int bbr_min_tso_rate	= 1200000;  /* skip TSO below here (bits/sec) */
+
+/* We use a high_gain value chosen to allow a smoothly increasing pacing rate
+ * that will double each RTT and send the same number of packets per RTT that
+ * an un-paced, slow-starting Reno or CUBIC flow would.
+ */
+static int bbr_high_gain  = BBR_UNIT * 2885 / 1000 + 1;	/* 2/ln(2) */
+static int bbr_drain_gain = BBR_UNIT * 1000 / 2885;	/* 1/high_gain */
+static int bbr_cwnd_gain  = BBR_UNIT * 2;	/* gain for steady-state cwnd */
+/* The pacing_gain values for the PROBE_BW gain cycle: */
+static int bbr_pacing_gain[] = { BBR_UNIT * 5 / 4, BBR_UNIT * 3 / 4,
+				 BBR_UNIT, BBR_UNIT, BBR_UNIT,
+				 BBR_UNIT, BBR_UNIT, BBR_UNIT };
+static u32 bbr_cycle_rand = 7;  /* randomize gain cycling phase over N phases */
+
+/* Try to keep at least this many packets in flight, if things go smoothly. For
+ * smooth functioning, a sliding window protocol ACKing every other packet
+ * needs at least 4 packets in flight.
+ */
+static u32 bbr_cwnd_min_target	= 4;
+
+/* To estimate if BBR_STARTUP mode (i.e. high_gain) has filled pipe. */
+static u32 bbr_full_bw_thresh = BBR_UNIT * 5 / 4;  /* bw up 1.25x per round? */
+static u32 bbr_full_bw_cnt    = 3;    /* N rounds w/o bw growth -> pipe full */
+
+/* "long-term" ("LT") bandwidth estimator parameters: */
+static bool bbr_lt_bw_estimator = true;	/* use the long-term bw estimate? */
+static u32 bbr_lt_intvl_min_rtts = 4;	/* min rounds in sampling interval */
+static u32 bbr_lt_loss_thresh = 50;	/*  lost/delivered > 20% -> "lossy" */
+static u32 bbr_lt_conv_thresh = BBR_UNIT / 8;  /* bw diff <= 12.5% -> "close" */
+static u32 bbr_lt_bw_max_rtts	= 48;	/* max # of round trips using lt_bw */
+
+/* Do we estimate that STARTUP filled the pipe? */
+static bool bbr_full_bw_reached(const struct sock *sk)
+{
+	const struct bbr *bbr = inet_csk_ca(sk);
+
+	return bbr->full_bw_cnt >= bbr_full_bw_cnt;
+}
+
+/* Return the windowed max recent bandwidth sample, in pkts/uS << BW_SCALE. */
+static u32 bbr_max_bw(const struct sock *sk)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	return minmax_get(&bbr->bw);
+}
+
+/* Return the estimated bandwidth of the path, in pkts/uS << BW_SCALE. */
+static u32 bbr_bw(const struct sock *sk)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	return bbr->lt_use_bw ? bbr->lt_bw : bbr_max_bw(sk);
+}
+
+/* Return rate in bytes per second, optionally with a gain.
+ * The order here is chosen carefully to avoid overflow of u64. This should
+ * work for input rates of up to 2.9Tbit/sec and gain of 2.89x.
+ */
+static u64 bbr_rate_bytes_per_sec(struct sock *sk, u64 rate, int gain)
+{
+	rate *= tcp_mss_to_mtu(sk, tcp_sk(sk)->mss_cache);
+	rate *= gain;
+	rate >>= BBR_SCALE;
+	rate *= USEC_PER_SEC;
+	return rate >> BW_SCALE;
+}
+
+static u64 bbr_rate_kbps(struct sock *sk, u64 rate)
+{
+	return bbr_rate_bytes_per_sec(sk, rate, BBR_UNIT) * 8 / 1000;
+}
+
+/* Pace using current bw estimate and a gain factor. */
+static void bbr_set_pacing_rate(struct sock *sk, u32 bw, int gain)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+	u64 rate = bw;
+
+	rate = bbr_rate_bytes_per_sec(sk, rate, gain);
+	rate = min_t(u64, rate, sk->sk_max_pacing_rate);
+	if (bbr->mode != BBR_STARTUP || rate > sk->sk_pacing_rate)
+		sk->sk_pacing_rate = rate;
+}
+
+/* Return count of segments we want in the skbs we send, or 0 for default. */
+static u32 bbr_tso_segs_goal(struct sock *sk)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	return bbr->tso_segs_goal;
+}
+
+static void bbr_set_tso_segs_goal(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+	u32 min_segs;
+
+	min_segs = sk->sk_pacing_rate < (bbr_min_tso_rate >> 3) ? 1 : 2;
+	bbr->tso_segs_goal = min(tcp_tso_autosize(sk, tp->mss_cache, min_segs),
+				 0x7FU);
+}
+
+/* Save "last known good" cwnd so we can restore it after losses or PROBE_RTT */
+static void bbr_save_cwnd(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	if (bbr->prev_ca_state < TCP_CA_Recovery && bbr->mode != BBR_PROBE_RTT)
+		bbr->prior_cwnd = tp->snd_cwnd;  /* this cwnd is good enough */
+	else  /* loss recovery or BBR_PROBE_RTT have temporarily cut cwnd */
+		bbr->prior_cwnd = max(bbr->prior_cwnd, tp->snd_cwnd);
+}
+
+static void bbr_cwnd_event(struct sock *sk, enum tcp_ca_event event)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	if (event == CA_EVENT_TX_START && tp->app_limited) {
+		bbr->idle_restart = 1;
+		/* Avoid pointless buffer overflows: pace at est. bw if we don't
+		 * need more speed (we're restarting from idle and app-limited).
+		 */
+		if (bbr->mode == BBR_PROBE_BW)
+			bbr_set_pacing_rate(sk, bbr_bw(sk), BBR_UNIT);
+	}
+}
+
+/* Find target cwnd. Right-size the cwnd based on min RTT and the
+ * estimated bottleneck bandwidth:
+ *
+ * cwnd = bw * min_rtt * gain = BDP * gain
+ *
+ * The key factor, gain, controls the amount of queue. While a small gain
+ * builds a smaller queue, it becomes more vulnerable to noise in RTT
+ * measurements (e.g., delayed ACKs or other ACK compression effects). This
+ * noise may cause BBR to under-estimate the rate.
+ *
+ * To achieve full performance in high-speed paths, we budget enough cwnd to
+ * fit full-sized skbs in-flight on both end hosts to fully utilize the path:
+ *   - one skb in sending host Qdisc,
+ *   - one skb in sending host TSO/GSO engine
+ *   - one skb being received by receiver host LRO/GRO/delayed-ACK engine
+ * Don't worry, at low rates (bbr_min_tso_rate) this won't bloat cwnd because
+ * in such cases tso_segs_goal is 1. The minimum cwnd is 4 packets,
+ * which allows 2 outstanding 2-packet sequences, to try to keep pipe
+ * full even with ACK-every-other-packet delayed ACKs.
+ */
+static u32 bbr_target_cwnd(struct sock *sk, u32 bw, int gain)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+	u32 cwnd;
+	u64 w;
+
+	/* If we've never had a valid RTT sample, cap cwnd at the initial
+	 * default. This should only happen when the connection is not using TCP
+	 * timestamps and has retransmitted all of the SYN/SYNACK/data packets
+	 * ACKed so far. In this case, an RTO can cut cwnd to 1, in which
+	 * case we need to slow-start up toward something safe: TCP_INIT_CWND.
+	 */
+	if (unlikely(bbr->min_rtt_us == ~0U))	 /* no valid RTT samples yet? */
+		return TCP_INIT_CWND;  /* be safe: cap at default initial cwnd*/
+
+	w = (u64)bw * bbr->min_rtt_us;
+
+	/* Apply a gain to the given value, then remove the BW_SCALE shift. */
+	cwnd = (((w * gain) >> BBR_SCALE) + BW_UNIT - 1) / BW_UNIT;
+
+	/* Allow enough full-sized skbs in flight to utilize end systems. */
+	cwnd += 3 * bbr->tso_segs_goal;
+
+	/* Reduce delayed ACKs by rounding up cwnd to the next even number. */
+	cwnd = (cwnd + 1) & ~1U;
+
+	return cwnd;
+}
+
+/* An optimization in BBR to reduce losses: On the first round of recovery, we
+ * follow the packet conservation principle: send P packets per P packets acked.
+ * After that, we slow-start and send at most 2*P packets per P packets acked.
+ * After recovery finishes, or upon undo, we restore the cwnd we had when
+ * recovery started (capped by the target cwnd based on estimated BDP).
+ *
+ * TODO(ycheng/ncardwell): implement a rate-based approach.
+ */
+static bool bbr_set_cwnd_to_recover_or_restore(
+	struct sock *sk, const struct rate_sample *rs, u32 acked, u32 *new_cwnd)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+	u8 prev_state = bbr->prev_ca_state, state = inet_csk(sk)->icsk_ca_state;
+	u32 cwnd = tp->snd_cwnd;
+
+	/* An ACK for P pkts should release at most 2*P packets. We do this
+	 * in two steps. First, here we deduct the number of lost packets.
+	 * Then, in bbr_set_cwnd() we slow start up toward the target cwnd.
+	 */
+	if (rs->losses > 0)
+		cwnd = max_t(s32, cwnd - rs->losses, 1);
+
+	if (state == TCP_CA_Recovery && prev_state != TCP_CA_Recovery) {
+		/* Starting 1st round of Recovery, so do packet conservation. */
+		bbr->packet_conservation = 1;
+		bbr->next_rtt_delivered = tp->delivered;  /* start round now */
+		/* Cut unused cwnd from app behavior, TSQ, or TSO deferral: */
+		cwnd = tcp_packets_in_flight(tp) + acked;
+	} else if (prev_state >= TCP_CA_Recovery && state < TCP_CA_Recovery) {
+		/* Exiting loss recovery; restore cwnd saved before recovery. */
+		bbr->restore_cwnd = 1;
+		bbr->packet_conservation = 0;
+	}
+	bbr->prev_ca_state = state;
+
+	if (bbr->restore_cwnd) {
+		/* Restore cwnd after exiting loss recovery or PROBE_RTT. */
+		cwnd = max(cwnd, bbr->prior_cwnd);
+		bbr->restore_cwnd = 0;
+	}
+
+	if (bbr->packet_conservation) {
+		*new_cwnd = max(cwnd, tcp_packets_in_flight(tp) + acked);
+		return true;	/* yes, using packet conservation */
+	}
+	*new_cwnd = cwnd;
+	return false;
+}
+
+/* Slow-start up toward target cwnd (if bw estimate is growing, or packet loss
+ * has drawn us down below target), or snap down to target if we're above it.
+ */
+static void bbr_set_cwnd(struct sock *sk, const struct rate_sample *rs,
+			 u32 acked, u32 bw, int gain)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+	u32 cwnd = 0, target_cwnd = 0;
+
+	if (!acked)
+		return;
+
+	if (bbr_set_cwnd_to_recover_or_restore(sk, rs, acked, &cwnd))
+		goto done;
+
+	/* If we're below target cwnd, slow start cwnd toward target cwnd. */
+	target_cwnd = bbr_target_cwnd(sk, bw, gain);
+	if (bbr_full_bw_reached(sk))  /* only cut cwnd if we filled the pipe */
+		cwnd = min(cwnd + acked, target_cwnd);
+	else if (cwnd < target_cwnd || tp->delivered < TCP_INIT_CWND)
+		cwnd = cwnd + acked;
+	cwnd = max(cwnd, bbr_cwnd_min_target);
+
+done:
+	tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);	/* apply global cap */
+	if (bbr->mode == BBR_PROBE_RTT)  /* drain queue, refresh min_rtt */
+		tp->snd_cwnd = min(tp->snd_cwnd, bbr_cwnd_min_target);
+}
+
+/* End cycle phase if it's time and/or we hit the phase's in-flight target. */
+static bool bbr_is_next_cycle_phase(struct sock *sk,
+				    const struct rate_sample *rs)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+	bool is_full_length =
+		skb_mstamp_us_delta(&tp->delivered_mstamp, &bbr->cycle_mstamp) >
+		bbr->min_rtt_us;
+	u32 inflight, bw;
+
+	/* The pacing_gain of 1.0 paces at the estimated bw to try to fully
+	 * use the pipe without increasing the queue.
+	 */
+	if (bbr->pacing_gain == BBR_UNIT)
+		return is_full_length;		/* just use wall clock time */
+
+	inflight = rs->prior_in_flight;  /* what was in-flight before ACK? */
+	bw = bbr_max_bw(sk);
+
+	/* A pacing_gain > 1.0 probes for bw by trying to raise inflight to at
+	 * least pacing_gain*BDP; this may take more than min_rtt if min_rtt is
+	 * small (e.g. on a LAN). We do not persist if packets are lost, since
+	 * a path with small buffers may not hold that much.
+	 */
+	if (bbr->pacing_gain > BBR_UNIT)
+		return is_full_length &&
+			(rs->losses ||  /* perhaps pacing_gain*BDP won't fit */
+			 inflight >= bbr_target_cwnd(sk, bw, bbr->pacing_gain));
+
+	/* A pacing_gain < 1.0 tries to drain extra queue we added if bw
+	 * probing didn't find more bw. If inflight falls to match BDP then we
+	 * estimate queue is drained; persisting would underutilize the pipe.
+	 */
+	return is_full_length ||
+		inflight <= bbr_target_cwnd(sk, bw, BBR_UNIT);
+}
+
+static void bbr_advance_cycle_phase(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	bbr->cycle_idx = (bbr->cycle_idx + 1) & (CYCLE_LEN - 1);
+	bbr->cycle_mstamp = tp->delivered_mstamp;
+	bbr->pacing_gain = bbr_pacing_gain[bbr->cycle_idx];
+}
+
+/* Gain cycling: cycle pacing gain to converge to fair share of available bw. */
+static void bbr_update_cycle_phase(struct sock *sk,
+				   const struct rate_sample *rs)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	if ((bbr->mode == BBR_PROBE_BW) && !bbr->lt_use_bw &&
+	    bbr_is_next_cycle_phase(sk, rs))
+		bbr_advance_cycle_phase(sk);
+}
+
+static void bbr_reset_startup_mode(struct sock *sk)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	bbr->mode = BBR_STARTUP;
+	bbr->pacing_gain = bbr_high_gain;
+	bbr->cwnd_gain	 = bbr_high_gain;
+}
+
+static void bbr_reset_probe_bw_mode(struct sock *sk)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	bbr->mode = BBR_PROBE_BW;
+	bbr->pacing_gain = BBR_UNIT;
+	bbr->cwnd_gain = bbr_cwnd_gain;
+	bbr->cycle_idx = CYCLE_LEN - 1 - prandom_u32_max(bbr_cycle_rand);
+	bbr_advance_cycle_phase(sk);	/* flip to next phase of gain cycle */
+}
+
+static void bbr_reset_mode(struct sock *sk)
+{
+	if (!bbr_full_bw_reached(sk))
+		bbr_reset_startup_mode(sk);
+	else
+		bbr_reset_probe_bw_mode(sk);
+}
+
+/* Start a new long-term sampling interval. */
+static void bbr_reset_lt_bw_sampling_interval(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	bbr->lt_last_stamp = tp->delivered_mstamp.stamp_jiffies;
+	bbr->lt_last_delivered = tp->delivered;
+	bbr->lt_last_lost = tp->lost;
+	bbr->lt_rtt_cnt = 0;
+}
+
+/* Completely reset long-term bandwidth sampling. */
+static void bbr_reset_lt_bw_sampling(struct sock *sk)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	bbr->lt_bw = 0;
+	bbr->lt_use_bw = 0;
+	bbr->lt_is_sampling = false;
+	bbr_reset_lt_bw_sampling_interval(sk);
+}
+
+/* Long-term bw sampling interval is done. Estimate whether we're policed. */
+static void bbr_lt_bw_interval_done(struct sock *sk, u32 bw)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+	u32 diff;
+
+	if (bbr->lt_bw &&  /* do we have bw from a previous interval? */
+	    bbr_lt_bw_estimator) {  /* using long-term bw estimator enabled? */
+		/* Is new bw close to the lt_bw from the previous interval? */
+		diff = abs(bw - bbr->lt_bw);
+		if ((diff * BBR_UNIT <= bbr_lt_conv_thresh * bbr->lt_bw) ||
+		    (bbr_rate_kbps(sk, diff) <= 4)) {  /* diff <= 4 Kbit/sec? */
+			/* All criteria are met; estimate we're policed. */
+			bbr->lt_bw = (bw + bbr->lt_bw) >> 1;  /* avg 2 intvls */
+			bbr->lt_use_bw = 1;
+			bbr->pacing_gain = BBR_UNIT;  /* try to avoid drops */
+			bbr->lt_rtt_cnt = 0;
+			return;
+		}
+	}
+	bbr->lt_bw = bw;
+	bbr_reset_lt_bw_sampling_interval(sk);
+}
+
+/* Token-bucket traffic policers are common (see "An Internet-Wide Analysis of
+ * Traffic Policing", SIGCOMM 2016). BBR detects token-bucket policers and
+ * explicitly models their policed rate, to reduce unnecessary losses. We
+ * estimate that we're policed if we see 2 consecutive sampling intervals with
+ * consistent throughput and high packet loss. If we think we're being policed,
+ * set lt_bw to the "long-term" average delivery rate from those 2 intervals.
+ */
+static void bbr_lt_bw_sampling(struct sock *sk, const struct rate_sample *rs)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+	u32 lost, delivered;
+	u64 bw;
+	s32 t;
+
+	if (bbr->lt_use_bw) {	/* already using long-term rate, lt_bw? */
+		if (bbr->mode == BBR_PROBE_BW && bbr->round_start &&
+		    ++bbr->lt_rtt_cnt >= bbr_lt_bw_max_rtts) {
+			bbr_reset_lt_bw_sampling(sk);    /* stop using lt_bw */
+			bbr_reset_probe_bw_mode(sk);  /* restart gain cycling */
+		}
+		return;
+	}
+
+	/* Wait for the first loss before sampling, to let the policer exhaust
+	 * its tokens and estimate the steady-state rate allowed by the policer.
+	 * Starting samples earlier includes bursts that over-estimate the bw.
+	 */
+	if (!bbr->lt_is_sampling) {
+		if (!rs->losses)
+			return;
+		bbr_reset_lt_bw_sampling_interval(sk);
+		bbr->lt_is_sampling = true;
+	}
+
+	/* To avoid underestimates, reset sampling if we run out of data. */
+	if (rs->is_app_limited) {
+		bbr_reset_lt_bw_sampling(sk);
+		return;
+	}
+
+	if (bbr->round_start)
+		bbr->lt_rtt_cnt++;	/* count round trips in this interval */
+	if (bbr->lt_rtt_cnt < bbr_lt_intvl_min_rtts)
+		return;		/* sampling interval needs to be longer */
+	if (bbr->lt_rtt_cnt > 4 * bbr_lt_intvl_min_rtts) {
+		bbr_reset_lt_bw_sampling(sk);  /* interval is too long */
+		return;
+	}
+
+	/* End sampling interval when a packet is lost, so we estimate the
+	 * policer tokens were exhausted. Stopping the sampling before the
+	 * tokens are exhausted under-estimates the policed rate.
+	 */
+	if (!rs->losses)
+		return;
+
+	/* Calculate packets lost and delivered in sampling interval. */
+	lost = tp->lost - bbr->lt_last_lost;
+	delivered = tp->delivered - bbr->lt_last_delivered;
+	/* Is loss rate (lost/delivered) >= lt_loss_thresh? If not, wait. */
+	if (!delivered || (lost << BBR_SCALE) < bbr_lt_loss_thresh * delivered)
+		return;
+
+	/* Find average delivery rate in this sampling interval. */
+	t = (s32)(tp->delivered_mstamp.stamp_jiffies - bbr->lt_last_stamp);
+	if (t < 1)
+		return;		/* interval is less than one jiffy, so wait */
+	t = jiffies_to_usecs(t);
+	/* Interval long enough for jiffies_to_usecs() to return a bogus 0? */
+	if (t < 1) {
+		bbr_reset_lt_bw_sampling(sk);  /* interval too long; reset */
+		return;
+	}
+	bw = (u64)delivered * BW_UNIT;
+	do_div(bw, t);
+	bbr_lt_bw_interval_done(sk, bw);
+}
+
+/* Estimate the bandwidth based on how fast packets are delivered */
+static void bbr_update_bw(struct sock *sk, const struct rate_sample *rs)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+	u64 bw;
+
+	bbr->round_start = 0;
+	if (rs->delivered < 0 || rs->interval_us <= 0)
+		return; /* Not a valid observation */
+
+	/* See if we've reached the next RTT */
+	if (!before(rs->prior_delivered, bbr->next_rtt_delivered)) {
+		bbr->next_rtt_delivered = tp->delivered;
+		bbr->rtt_cnt++;
+		bbr->round_start = 1;
+		bbr->packet_conservation = 0;
+	}
+
+	bbr_lt_bw_sampling(sk, rs);
+
+	/* Divide delivered by the interval to find a (lower bound) bottleneck
+	 * bandwidth sample. Delivered is in packets and interval_us in uS and
+	 * ratio will be <<1 for most connections. So delivered is first scaled.
+	 */
+	bw = (u64)rs->delivered * BW_UNIT;
+	do_div(bw, rs->interval_us);
+
+	/* If this sample is application-limited, it is likely to have a very
+	 * low delivered count that represents application behavior rather than
+	 * the available network rate. Such a sample could drag down estimated
+	 * bw, causing needless slow-down. Thus, to continue to send at the
+	 * last measured network rate, we filter out app-limited samples unless
+	 * they describe the path bw at least as well as our bw model.
+	 *
+	 * So the goal during app-limited phase is to proceed with the best
+	 * network rate no matter how long. We automatically leave this
+	 * phase when app writes faster than the network can deliver :)
+	 */
+	if (!rs->is_app_limited || bw >= bbr_max_bw(sk)) {
+		/* Incorporate new sample into our max bw filter. */
+		minmax_running_max(&bbr->bw, bbr_bw_rtts, bbr->rtt_cnt, bw);
+	}
+}
+
+/* Estimate when the pipe is full, using the change in delivery rate: BBR
+ * estimates that STARTUP filled the pipe if the estimated bw hasn't changed by
+ * at least bbr_full_bw_thresh (25%) after bbr_full_bw_cnt (3) non-app-limited
+ * rounds. Why 3 rounds: 1: rwin autotuning grows the rwin, 2: we fill the
+ * higher rwin, 3: we get higher delivery rate samples. Or transient
+ * cross-traffic or radio noise can go away. CUBIC Hystart shares a similar
+ * design goal, but uses delay and inter-ACK spacing instead of bandwidth.
+ */
+static void bbr_check_full_bw_reached(struct sock *sk,
+				      const struct rate_sample *rs)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+	u32 bw_thresh;
+
+	if (bbr_full_bw_reached(sk) || !bbr->round_start || rs->is_app_limited)
+		return;
+
+	bw_thresh = (u64)bbr->full_bw * bbr_full_bw_thresh >> BBR_SCALE;
+	if (bbr_max_bw(sk) >= bw_thresh) {
+		bbr->full_bw = bbr_max_bw(sk);
+		bbr->full_bw_cnt = 0;
+		return;
+	}
+	++bbr->full_bw_cnt;
+}
+
+/* If pipe is probably full, drain the queue and then enter steady-state. */
+static void bbr_check_drain(struct sock *sk, const struct rate_sample *rs)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	if (bbr->mode == BBR_STARTUP && bbr_full_bw_reached(sk)) {
+		bbr->mode = BBR_DRAIN;	/* drain queue we created */
+		bbr->pacing_gain = bbr_drain_gain;	/* pace slow to drain */
+		bbr->cwnd_gain = bbr_high_gain;	/* maintain cwnd */
+	}	/* fall through to check if in-flight is already small: */
+	if (bbr->mode == BBR_DRAIN &&
+	    tcp_packets_in_flight(tcp_sk(sk)) <=
+	    bbr_target_cwnd(sk, bbr_max_bw(sk), BBR_UNIT))
+		bbr_reset_probe_bw_mode(sk);  /* we estimate queue is drained */
+}
+
+/* The goal of PROBE_RTT mode is to have BBR flows cooperatively and
+ * periodically drain the bottleneck queue, to converge to measure the true
+ * min_rtt (unloaded propagation delay). This allows the flows to keep queues
+ * small (reducing queuing delay and packet loss) and achieve fairness among
+ * BBR flows.
+ *
+ * The min_rtt filter window is 10 seconds. When the min_rtt estimate expires,
+ * we enter PROBE_RTT mode and cap the cwnd at bbr_cwnd_min_target=4 packets.
+ * After at least bbr_probe_rtt_mode_ms=200ms and at least one packet-timed
+ * round trip elapsed with that flight size <= 4, we leave PROBE_RTT mode and
+ * re-enter the previous mode. BBR uses 200ms to approximately bound the
+ * performance penalty of PROBE_RTT's cwnd capping to roughly 2% (200ms/10s).
+ *
+ * Note that flows need only pay 2% if they are busy sending over the last 10
+ * seconds. Interactive applications (e.g., Web, RPCs, video chunks) often have
+ * natural silences or low-rate periods within 10 seconds where the rate is low
+ * enough for long enough to drain its queue in the bottleneck. We pick up
+ * these min RTT measurements opportunistically with our min_rtt filter. :-)
+ */
+static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+	bool filter_expired;
+
+	/* Track min RTT seen in the min_rtt_win_sec filter window: */
+	filter_expired = after(tcp_time_stamp,
+			       bbr->min_rtt_stamp + bbr_min_rtt_win_sec * HZ);
+	if (rs->rtt_us >= 0 &&
+	    (rs->rtt_us <= bbr->min_rtt_us || filter_expired)) {
+		bbr->min_rtt_us = rs->rtt_us;
+		bbr->min_rtt_stamp = tcp_time_stamp;
+	}
+
+	if (bbr_probe_rtt_mode_ms > 0 && filter_expired &&
+	    !bbr->idle_restart && bbr->mode != BBR_PROBE_RTT) {
+		bbr->mode = BBR_PROBE_RTT;  /* dip, drain queue */
+		bbr->pacing_gain = BBR_UNIT;
+		bbr->cwnd_gain = BBR_UNIT;
+		bbr_save_cwnd(sk);  /* note cwnd so we can restore it */
+		bbr->probe_rtt_done_stamp = 0;
+	}
+
+	if (bbr->mode == BBR_PROBE_RTT) {
+		/* Ignore low rate samples during this mode. */
+		tp->app_limited =
+			(tp->delivered + tcp_packets_in_flight(tp)) ? : 1;
+		/* Maintain min packets in flight for max(200 ms, 1 round). */
+		if (!bbr->probe_rtt_done_stamp &&
+		    tcp_packets_in_flight(tp) <= bbr_cwnd_min_target) {
+			bbr->probe_rtt_done_stamp = tcp_time_stamp +
+				msecs_to_jiffies(bbr_probe_rtt_mode_ms);
+			bbr->probe_rtt_round_done = 0;
+			bbr->next_rtt_delivered = tp->delivered;
+		} else if (bbr->probe_rtt_done_stamp) {
+			if (bbr->round_start)
+				bbr->probe_rtt_round_done = 1;
+			if (bbr->probe_rtt_round_done &&
+			    after(tcp_time_stamp, bbr->probe_rtt_done_stamp)) {
+				bbr->min_rtt_stamp = tcp_time_stamp;
+				bbr->restore_cwnd = 1;  /* snap to prior_cwnd */
+				bbr_reset_mode(sk);
+			}
+		}
+	}
+	bbr->idle_restart = 0;
+}
+
+static void bbr_update_model(struct sock *sk, const struct rate_sample *rs)
+{
+	bbr_update_bw(sk, rs);
+	bbr_update_cycle_phase(sk, rs);
+	bbr_check_full_bw_reached(sk, rs);
+	bbr_check_drain(sk, rs);
+	bbr_update_min_rtt(sk, rs);
+}
+
+static void bbr_main(struct sock *sk, const struct rate_sample *rs)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+	u32 bw;
+
+	bbr_update_model(sk, rs);
+
+	bw = bbr_bw(sk);
+	bbr_set_pacing_rate(sk, bw, bbr->pacing_gain);
+	bbr_set_tso_segs_goal(sk);
+	bbr_set_cwnd(sk, rs, rs->acked_sacked, bw, bbr->cwnd_gain);
+}
+
+static void bbr_init(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+	u64 bw;
+
+	bbr->prior_cwnd = 0;
+	bbr->tso_segs_goal = 0;	 /* default segs per skb until first ACK */
+	bbr->rtt_cnt = 0;
+	bbr->next_rtt_delivered = 0;
+	bbr->prev_ca_state = TCP_CA_Open;
+	bbr->packet_conservation = 0;
+
+	bbr->probe_rtt_done_stamp = 0;
+	bbr->probe_rtt_round_done = 0;
+	bbr->min_rtt_us = tcp_min_rtt(tp);
+	bbr->min_rtt_stamp = tcp_time_stamp;
+
+	minmax_reset(&bbr->bw, bbr->rtt_cnt, 0);  /* init max bw to 0 */
+
+	/* Initialize pacing rate to: high_gain * init_cwnd / RTT. */
+	bw = (u64)tp->snd_cwnd * BW_UNIT;
+	do_div(bw, (tp->srtt_us >> 3) ? : USEC_PER_MSEC);
+	sk->sk_pacing_rate = 0;		/* force an update of sk_pacing_rate */
+	bbr_set_pacing_rate(sk, bw, bbr_high_gain);
+
+	bbr->restore_cwnd = 0;
+	bbr->round_start = 0;
+	bbr->idle_restart = 0;
+	bbr->full_bw = 0;
+	bbr->full_bw_cnt = 0;
+	bbr->cycle_mstamp.v64 = 0;
+	bbr->cycle_idx = 0;
+	bbr_reset_lt_bw_sampling(sk);
+	bbr_reset_startup_mode(sk);
+}
+
+static u32 bbr_sndbuf_expand(struct sock *sk)
+{
+	/* Provision 3 * cwnd since BBR may slow-start even during recovery. */
+	return 3;
+}
+
+/* In theory BBR does not need to undo the cwnd since it does not
+ * always reduce cwnd on losses (see bbr_main()). Keep it for now.
+ */
+static u32 bbr_undo_cwnd(struct sock *sk)
+{
+	return tcp_sk(sk)->snd_cwnd;
+}
+
+/* Entering loss recovery, so save cwnd for when we exit or undo recovery. */
+static u32 bbr_ssthresh(struct sock *sk)
+{
+	bbr_save_cwnd(sk);
+	return TCP_INFINITE_SSTHRESH;	 /* BBR does not use ssthresh */
+}
+
+static size_t bbr_get_info(struct sock *sk, u32 ext, int *attr,
+			   union tcp_cc_info *info)
+{
+	if (ext & (1 << (INET_DIAG_BBRINFO - 1)) ||
+	    ext & (1 << (INET_DIAG_VEGASINFO - 1))) {
+		struct tcp_sock *tp = tcp_sk(sk);
+		struct bbr *bbr = inet_csk_ca(sk);
+		u64 bw = bbr_bw(sk);
+
+		bw = bw * tp->mss_cache * USEC_PER_SEC >> BW_SCALE;
+		memset(&info->bbr, 0, sizeof(info->bbr));
+		info->bbr.bbr_bw_lo		= (u32)bw;
+		info->bbr.bbr_bw_hi		= (u32)(bw >> 32);
+		info->bbr.bbr_min_rtt		= bbr->min_rtt_us;
+		info->bbr.bbr_pacing_gain	= bbr->pacing_gain;
+		info->bbr.bbr_cwnd_gain		= bbr->cwnd_gain;
+		*attr = INET_DIAG_BBRINFO;
+		return sizeof(info->bbr);
+	}
+	return 0;
+}
+
+static void bbr_set_state(struct sock *sk, u8 new_state)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	if (new_state == TCP_CA_Loss) {
+		struct rate_sample rs = { .prior_mstamp.v64 = 0, .losses = 1 };
+
+		bbr->prev_ca_state = TCP_CA_Loss;
+		bbr->full_bw = 0;
+		bbr->round_start = 1;	/* treat RTO like end of a round */
+		bbr_lt_bw_sampling(sk, &rs);
+	}
+}
+
+static struct tcp_congestion_ops tcp_bbr_cong_ops __read_mostly = {
+	.flags		= TCP_CONG_NON_RESTRICTED,
+	.name		= "bbr",
+	.owner		= THIS_MODULE,
+	.init		= bbr_init,
+	.cong_control	= bbr_main,
+	.sndbuf_expand	= bbr_sndbuf_expand,
+	.undo_cwnd	= bbr_undo_cwnd,
+	.cwnd_event	= bbr_cwnd_event,
+	.ssthresh	= bbr_ssthresh,
+	.tso_segs_goal	= bbr_tso_segs_goal,
+	.get_info	= bbr_get_info,
+	.set_state	= bbr_set_state,
+};
+
+static int __init bbr_register(void)
+{
+	BUILD_BUG_ON(sizeof(struct bbr) > ICSK_CA_PRIV_SIZE);
+	return tcp_register_congestion_control(&tcp_bbr_cong_ops);
+}
+
+static void __exit bbr_unregister(void)
+{
+	tcp_unregister_congestion_control(&tcp_bbr_cong_ops);
+}
+
+module_init(bbr_register);
+module_exit(bbr_unregister);
+
+MODULE_AUTHOR("Van Jacobson <vanj@google.com>");
+MODULE_AUTHOR("Neal Cardwell <ncardwell@google.com>");
+MODULE_AUTHOR("Yuchung Cheng <ycheng@google.com>");
+MODULE_AUTHOR("Soheil Hassas Yeganeh <soheil@google.com>");
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("TCP BBR (Bottleneck Bandwidth and RTT)");
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets
From: Cyrill Gorcunov @ 2016-09-16 19:00 UTC (permalink / raw)
  To: David Ahern, Eric Dumazet
  Cc: netdev, linux-kernel, David Miller, kuznet, jmorris, yoshfuji,
	kaber, avagin, stephen
In-Reply-To: <20160916070623.GD1867@uranus.lan>

On Fri, Sep 16, 2016 at 10:06:23AM +0300, Cyrill Gorcunov wrote:
> On Thu, Sep 15, 2016 at 05:45:02PM -0600, David Ahern wrote:
> > > 
> > > Try to be selective in the -K , do not kill tcp sockets ?
> > 
> > I am running
> >    ss -aKw 'dev == red'
> > 
> > to kill raw sockets bound to device named 'red'.
> 
> Thanks David, Eric! I'll play with this option today and report the results.

I created veth pair and bound raw socket into it.

[root@pcs7 iproute2]# misc/ss -A raw
State      Recv-Q Send-Q                                Local Address:Port                                                 Peer Address:Port                
ESTAB      0      0                                         127.0.0.1:ipproto-255                                            127.0.0.10:ipproto-9090         
UNCONN     0      0                                        127.0.0.10:ipproto-255                                                     *:*                    
UNCONN     0      0                                                :::ipv6-icmp                                                      :::*                    
UNCONN     0      0                                                :::ipv6-icmp                                                      :::*                    
ESTAB      0      0                                               ::1:ipproto-255                                                   ::1:ipproto-9091         
UNCONN     0      0                                           ::1%vm1:ipproto-255                                                    :::*                    
[root@pcs7 iproute2]# 

[root@pcs7 iproute2]# misc/ss -aKw 'dev == vm1'
State      Recv-Q Send-Q                                Local Address:Port                                                 Peer Address:Port                
UNCONN     0      0                                           ::1%vm1:ipproto-255                                                    :::*                    

[root@pcs7 iproute2]# misc/ss -A raw
State      Recv-Q Send-Q                                Local Address:Port                                                 Peer Address:Port                
ESTAB      0      0                                         127.0.0.1:ipproto-255                                            127.0.0.10:ipproto-9090         
UNCONN     0      0                                        127.0.0.10:ipproto-255                                                     *:*                    
UNCONN     0      0                                                :::ipv6-icmp                                                      :::*                    
UNCONN     0      0                                                :::ipv6-icmp                                                      :::*                    
ESTAB      0      0                                               ::1:ipproto-255                                                   ::1:ipproto-9091         

so it get zapped out. Is there some other way to test it?

^ permalink raw reply

* XDP_TX bug report on mlx4
From: Jesper Dangaard Brouer @ 2016-09-16 19:03 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: brouer, netdev@vger.kernel.org, Tariq Toukan, Alexei Starovoitov,
	Tom Herbert, Saeed Mahameed, Rana Shahout, Eran Ben Elisha

Hi Brenden,

I've discovered a bug with XDP_TX recycling of pages in the mlx4 driver.

If I increase the number of RX and TX queues/channels via ethtool cmd:
 ethtool -L mlx4p1 rx 10 tx 10

Then when running the xdp2 program, which does XDP_TX, the kernel will
crash with page errors, because the page refcnt goes to zero or even
minus.  I've noticed pages delivered to mlx4_en_rx_recycle() can have
a page refcnt of zero, which is wrong, they should always have 1 (for
XDP).

Debugging it further, I find that this can happen when mlx4_en_rx_recycle()
is called from mlx4_en_recycle_tx_desc().  This is the TX cleanup function,
associated with TX ring queues used for XDP_TX only. No others than the
XDP_TX action should be able to place packets into these TX rings
which call mlx4_en_recycle_tx_desc().

Do you have any idea of what could be going wrong in this case?

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH] net: ipv6: fallback to full lookup if table lookup is unsuitable
From: Vincent Bernat @ 2016-09-16 19:15 UTC (permalink / raw)
  To: David Ahern
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev
In-Reply-To: <adff2e07-75e1-d44c-4799-0920ffcd2622@cumulusnetworks.com>

 ❦ 16 septembre 2016 20:36 CEST, David Ahern <dsa@cumulusnetworks.com> :

>> contained a non-connected route (like a default gateway) fails while it
>> was previously working:
>> 
>>     $ ip link add eth0 type dummy
>>     $ ip link set up dev eth0
>>     $ ip addr add 2001:db8::1/64 dev eth0
>>     $ ip route add ::/0 via 2001:db8::5 dev eth0 table 20
>>     $ ip route add 2001:db8:cafe::1/128 via 2001:db8::6 dev eth0 table 20
>>     RTNETLINK answers: No route to host
>>     $ ip -6 route show table 20
>>     default via 2001:db8::5 dev eth0  metric 1024  pref medium
>
> so your table 20 is not complete in that it lacks a connected route to
> resolve 2001:db8::6 as a nexthop, so you are relying on a fallback to
> other tables (main in this case).

Yes.

>> @@ -1991,33 +2015,15 @@ static struct rt6_info *ip6_route_info_create(struct fib6_config *cfg)
>>  			if (!(gwa_type & IPV6_ADDR_UNICAST))
>>  				goto out;
>>  
>> +			err = -EHOSTUNREACH;
>>  			if (cfg->fc_table)
>>  				grt = ip6_nh_lookup_table(net, cfg, gw_addr);
>
> -----8<-----
>
>> -			if (!(grt->rt6i_flags & RTF_GATEWAY))
>> -				err = 0;
>
> This is the check that is failing for your use
> case. ip6_nh_lookup_table is returning the default route and nexthops
> can not rely on a gateway. Given that a simpler and more direct change
> is (whitespace mangled on paste):
>
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index ad4a7ff301fc..48bae2ee2e18 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1991,9 +1991,19 @@ static struct rt6_info *ip6_route_info_create(struct fib6_config *cfg)
>                         if (!(gwa_type & IPV6_ADDR_UNICAST))
>                                 goto out;
>
> -                       if (cfg->fc_table)
> +                       if (cfg->fc_table) {
>                                 grt = ip6_nh_lookup_table(net, cfg, gw_addr);
>
> +                               /* a nexthop lookup can not go through a gw.
> +                                * if this happens on a table based lookup
> +                                * then fallback to a full lookup
> +                                */
> +                               if (grt && grt->rt6i_flags & RTF_GATEWAY) {
> +                                       ip6_rt_put(grt);
> +                                       grt = NULL;
> +                               }
> +                       }
> +
>                         if (!grt)
>                                 grt = rt6_lookup(net, gw_addr, NULL,
>                                                  cfg->fc_ifindex, 1);

OK. Should the dev check be dismissed or do we add "dev && dev !=
grt->dst.dev" just as a safety net (this would be a convulated setup,
but the correct direct route could be in an ip rule with higher priority
while the one in this table is incorrect)?
-- 
"... an experienced, industrious, ambitious, and often quite often
picturesque liar."
		-- Mark Twain

^ permalink raw reply

* Re: XDP_TX bug report on mlx4
From: Brenden Blanco @ 2016-09-16 19:17 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev@vger.kernel.org, Tariq Toukan, Alexei Starovoitov,
	Tom Herbert, Saeed Mahameed, Rana Shahout, Eran Ben Elisha
In-Reply-To: <20160916210340.4a7cdef8@redhat.com>

On Fri, Sep 16, 2016 at 09:03:40PM +0200, Jesper Dangaard Brouer wrote:
> Hi Brenden,
> 
> I've discovered a bug with XDP_TX recycling of pages in the mlx4 driver.
> 
> If I increase the number of RX and TX queues/channels via ethtool cmd:
>  ethtool -L mlx4p1 rx 10 tx 10
> 
> Then when running the xdp2 program, which does XDP_TX, the kernel will
> crash with page errors, because the page refcnt goes to zero or even
> minus.  I've noticed pages delivered to mlx4_en_rx_recycle() can have
> a page refcnt of zero, which is wrong, they should always have 1 (for
> XDP).
> 
> Debugging it further, I find that this can happen when mlx4_en_rx_recycle()
> is called from mlx4_en_recycle_tx_desc().  This is the TX cleanup function,
> associated with TX ring queues used for XDP_TX only. No others than the
> XDP_TX action should be able to place packets into these TX rings
> which call mlx4_en_recycle_tx_desc().

Sounds pretty straightforward, let me look into it.
> 
> Do you have any idea of what could be going wrong in this case?
> 
> -- 
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   Author of http://www.iptv-analyzer.org
>   LinkedIn: http://www.linkedin.com/in/brouer
> 
> 

^ permalink raw reply

* Re: [PATCH net-next 02/14] tcp: use windowed min filter library for TCP min_rtt estimation
From: kbuild test robot @ 2016-09-16 19:21 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: kbuild-all, David Miller, netdev, Neal Cardwell, Van Jacobson,
	Yuchung Cheng, Nandita Dukkipati, Eric Dumazet,
	Soheil Hassas Yeganeh
In-Reply-To: <1474051743-13311-3-git-send-email-ncardwell@google.com>

[-- Attachment #1: Type: text/plain, Size: 3309 bytes --]

Hi Neal,

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Neal-Cardwell/tcp-BBR-congestion-control-algorithm/20160917-025323
config: x86_64-randconfig-x006-201637 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=x86_64 

All errors (new ones prefixed by >>):

>> net/ipv4/tcp_cdg.c:59:8: error: redefinition of 'struct minmax'
    struct minmax {
           ^~~~~~
   In file included from include/linux/tcp.h:22:0,
                    from include/net/tcp.h:24,
                    from net/ipv4/tcp_cdg.c:30:
   include/linux/win_minmax.h:17:8: note: originally defined here
    struct minmax {
           ^~~~~~

vim +59 net/ipv4/tcp_cdg.c

2b0a8c9e Kenneth Klette Jonassen 2015-06-10  43  module_param(window, int, 0444);
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  44  MODULE_PARM_DESC(window, "gradient window size (power of two <= 256)");
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  45  module_param(backoff_beta, uint, 0644);
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  46  MODULE_PARM_DESC(backoff_beta, "backoff beta (0-1024)");
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  47  module_param(backoff_factor, uint, 0644);
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  48  MODULE_PARM_DESC(backoff_factor, "backoff probability scale factor");
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  49  module_param(hystart_detect, uint, 0644);
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  50  MODULE_PARM_DESC(hystart_detect, "use Hybrid Slow start "
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  51  		 "(0: disabled, 1: ACK train, 2: delay threshold, 3: both)");
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  52  module_param(use_ineff, uint, 0644);
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  53  MODULE_PARM_DESC(use_ineff, "use ineffectual backoff detection (threshold)");
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  54  module_param(use_shadow, bool, 0644);
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  55  MODULE_PARM_DESC(use_shadow, "use shadow window heuristic");
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  56  module_param(use_tolerance, bool, 0644);
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  57  MODULE_PARM_DESC(use_tolerance, "use loss tolerance heuristic");
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  58  
2b0a8c9e Kenneth Klette Jonassen 2015-06-10 @59  struct minmax {
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  60  	union {
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  61  		struct {
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  62  			s32 min;
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  63  			s32 max;
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  64  		};
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  65  		u64 v64;
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  66  	};
2b0a8c9e Kenneth Klette Jonassen 2015-06-10  67  };

:::::: The code at line 59 was first introduced by commit
:::::: 2b0a8c9eee81882fc0001ccf6d9af62cdc682f9e tcp: add CDG congestion control

:::::: TO: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
:::::: CC: David S. Miller <davem@davemloft.net>

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 26892 bytes --]

^ permalink raw reply

* Re: XDP_TX bug report on mlx4
From: Jesper Dangaard Brouer @ 2016-09-16 19:24 UTC (permalink / raw)
  To: Brenden Blanco
  Cc: netdev@vger.kernel.org, Tariq Toukan, Tom Herbert, Saeed Mahameed,
	Rana Shahout, Eran Ben Elisha, brouer
In-Reply-To: <20160916191727.GA8410@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 4746 bytes --]

On Fri, 16 Sep 2016 12:17:27 -0700
Brenden Blanco <bblanco@plumgrid.com> wrote:

> On Fri, Sep 16, 2016 at 09:03:40PM +0200, Jesper Dangaard Brouer wrote:
> > Hi Brenden,
> > 
> > I've discovered a bug with XDP_TX recycling of pages in the mlx4 driver.
> > 
> > If I increase the number of RX and TX queues/channels via ethtool cmd:
> >  ethtool -L mlx4p1 rx 10 tx 10
> > 
> > Then when running the xdp2 program, which does XDP_TX, the kernel will
> > crash with page errors, because the page refcnt goes to zero or even
> > minus.  I've noticed pages delivered to mlx4_en_rx_recycle() can have
> > a page refcnt of zero, which is wrong, they should always have 1 (for
> > XDP).
> > 
> > Debugging it further, I find that this can happen when mlx4_en_rx_recycle()
> > is called from mlx4_en_recycle_tx_desc().  This is the TX cleanup function,
> > associated with TX ring queues used for XDP_TX only. No others than the
> > XDP_TX action should be able to place packets into these TX rings
> > which call mlx4_en_recycle_tx_desc().  
> 
> Sounds pretty straightforward, let me look into it.

Here is some debug info I instrumented my kernel with, and I've
attached my minicom output with a warning and a panic.

Enable some driver debug printks via::
 ethtool -s mlx4p1 msglvl drv on

Debug normal situation::

 $ grep recycle_ring minicom_capturefile.log08
 [  520.746610] mlx4_en: mlx4p1: Set tx_ring[56]->recycle_ring = rx_ring[0]
 [  520.747042] mlx4_en: mlx4p1: Set tx_ring[57]->recycle_ring = rx_ring[1]
 [  520.747470] mlx4_en: mlx4p1: Set tx_ring[58]->recycle_ring = rx_ring[2]
 [  520.747918] mlx4_en: mlx4p1: Set tx_ring[59]->recycle_ring = rx_ring[3]
 [  520.748330] mlx4_en: mlx4p1: Set tx_ring[60]->recycle_ring = rx_ring[4]
 [  520.748749] mlx4_en: mlx4p1: Set tx_ring[61]->recycle_ring = rx_ring[5]
 [  520.749181] mlx4_en: mlx4p1: Set tx_ring[62]->recycle_ring = rx_ring[6]
 [  520.749620] mlx4_en: mlx4p1: Set tx_ring[63]->recycle_ring = rx_ring[7]

Change $ ethtool -L mlx4p1 rx 9 tx 9 ::

 [  911.594692] mlx4_en: mlx4p1: Set tx_ring[56]->recycle_ring = rx_ring[0]
 [  911.608345] mlx4_en: mlx4p1: Set tx_ring[57]->recycle_ring = rx_ring[1]
 [  911.622008] mlx4_en: mlx4p1: Set tx_ring[58]->recycle_ring = rx_ring[2]
 [  911.636364] mlx4_en: mlx4p1: Set tx_ring[59]->recycle_ring = rx_ring[3]
 [  911.650015] mlx4_en: mlx4p1: Set tx_ring[60]->recycle_ring = rx_ring[4]
 [  911.663690] mlx4_en: mlx4p1: Set tx_ring[61]->recycle_ring = rx_ring[5]
 [  911.677356] mlx4_en: mlx4p1: Set tx_ring[62]->recycle_ring = rx_ring[6]
 [  911.690924] mlx4_en: mlx4p1: Set tx_ring[63]->recycle_ring = rx_ring[7]
 [  911.704544] mlx4_en: mlx4p1: Set tx_ring[64]->recycle_ring = rx_ring[8]
 [  911.718171] mlx4_en: mlx4p1: Set tx_ring[65]->recycle_ring = rx_ring[9]
 [  911.731772] mlx4_en: mlx4p1: Set tx_ring[66]->recycle_ring = rx_ring[10]
 [  911.745438] mlx4_en: mlx4p1: Set tx_ring[67]->recycle_ring = rx_ring[11]
 [  911.759063] mlx4_en: mlx4p1: Set tx_ring[68]->recycle_ring = rx_ring[12]
 [  911.772741] mlx4_en: mlx4p1: Set tx_ring[69]->recycle_ring = rx_ring[13]
 [  911.786415] mlx4_en: mlx4p1: Set tx_ring[70]->recycle_ring = rx_ring[14]
 [  911.800070] mlx4_en: mlx4p1: Set tx_ring[71]->recycle_ring = rx_ring[15]

Change $ ethtool -L mlx4p1 rx 10 tx 10::

 netif_set_real_num_tx_queues() setting dev->real_num_tx_queues(now:80) = 64
 mlx4_en: mlx4p1:   frag:0 - size:1522 prefix:0 stride:4096
 mlx4_en_init_recycle_ring() Set tx_ring[64]->recycle_ring = rx_ring[0]
 mlx4_en_init_recycle_ring() Set tx_ring[65]->recycle_ring = rx_ring[1]
 mlx4_en_init_recycle_ring() Set tx_ring[66]->recycle_ring = rx_ring[2]
 mlx4_en_init_recycle_ring() Set tx_ring[67]->recycle_ring = rx_ring[3]
 mlx4_en_init_recycle_ring() Set tx_ring[68]->recycle_ring = rx_ring[4]
 mlx4_en_init_recycle_ring() Set tx_ring[69]->recycle_ring = rx_ring[5]
 mlx4_en_init_recycle_ring() Set tx_ring[70]->recycle_ring = rx_ring[6]
 mlx4_en_init_recycle_ring() Set tx_ring[71]->recycle_ring = rx_ring[7]
 mlx4_en_init_recycle_ring() Set tx_ring[72]->recycle_ring = rx_ring[8]
 mlx4_en_init_recycle_ring() Set tx_ring[73]->recycle_ring = rx_ring[9]
 mlx4_en_init_recycle_ring() Set tx_ring[74]->recycle_ring = rx_ring[10]
 mlx4_en_init_recycle_ring() Set tx_ring[75]->recycle_ring = rx_ring[11]
 mlx4_en_init_recycle_ring() Set tx_ring[76]->recycle_ring = rx_ring[12]
 mlx4_en_init_recycle_ring() Set tx_ring[77]->recycle_ring = rx_ring[13]
 mlx4_en_init_recycle_ring() Set tx_ring[78]->recycle_ring = rx_ring[14]
 mlx4_en_init_recycle_ring() Set tx_ring[79]->recycle_ring = rx_ring[15]

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

[-- Attachment #2: minicom_capturefile.log17.txt --]
[-- Type: text/plain, Size: 8293 bytes --]


[   95.777366] systemd[1]: Started Session c1 of user jbrouer.
[   95.783108] systemd[1]: Starting Session c1 of user jbrouer.
[  102.577674] XXX: netif_set_real_num_tx_queues() setting dev->real_num_tx_queues(now:64) = 80
[  102.586160] mlx4_en: mlx4p1: Using 80 TX rings
[  102.590640] mlx4_en: mlx4p1: Using 0 TX rings for XDP
[  102.595702] mlx4_en: mlx4p1: Using 10 RX rings
[  102.600183] mlx4_en: mlx4p1:   frag:0 - size:1522 prefix:0 stride:1536
[  102.612851] mlx4_en: mlx4p1: Setting RSS context tunnel type to RSS on inner headers
[  102.677064] mlx4_en: mlx4p1: Link Down
[  103.830065] mlx4_en: mlx4p1: Link Up
[  171.811607] XXX: netif_set_real_num_tx_queues() calling qdisc_reset_all_tx_gt(64)
[  171.819208] XXX: netif_set_real_num_tx_queues() setting dev->real_num_tx_queues(now:80) = 64
[  171.827796] mlx4_en: mlx4p1:   frag:0 - size:1522 prefix:0 stride:4096
[  171.840921] mlx4_en: mlx4p1: Setting RSS context tunnel type to RSS on inner headers
[  171.877866] XXX: mlx4_en_init_recycle_ring() Set tx_ring[64]->recycle_ring = rx_ring[0]
[  171.886379] XXX: mlx4_en_init_recycle_ring() Set tx_ring[65]->recycle_ring = rx_ring[1]
[  171.894938] XXX: mlx4_en_init_recycle_ring() Set tx_ring[66]->recycle_ring = rx_ring[2]
[  171.903442] XXX: mlx4_en_init_recycle_ring() Set tx_ring[67]->recycle_ring = rx_ring[3]
[  171.911979] XXX: mlx4_en_init_recycle_ring() Set tx_ring[68]->recycle_ring = rx_ring[4]
[  171.920469] XXX: mlx4_en_init_recycle_ring() Set tx_ring[69]->recycle_ring = rx_ring[5]
[  171.929016] XXX: mlx4_en_init_recycle_ring() Set tx_ring[70]->recycle_ring = rx_ring[6]
[  171.937564] XXX: mlx4_en_init_recycle_ring() Set tx_ring[71]->recycle_ring = rx_ring[7]
[  171.946102] XXX: mlx4_en_init_recycle_ring() Set tx_ring[72]->recycle_ring = rx_ring[8]
[  171.954628] XXX: mlx4_en_init_recycle_ring() Set tx_ring[73]->recycle_ring = rx_ring[9]
[  171.963190] XXX: mlx4_en_init_recycle_ring() Set tx_ring[74]->recycle_ring = rx_ring[10]
[  171.971851] XXX: mlx4_en_init_recycle_ring() Set tx_ring[75]->recycle_ring = rx_ring[11]
[  171.980436] XXX: mlx4_en_init_recycle_ring() Set tx_ring[76]->recycle_ring = rx_ring[12]
[  171.989039] XXX: mlx4_en_init_recycle_ring() Set tx_ring[77]->recycle_ring = rx_ring[13]
[  171.997641] XXX: mlx4_en_init_recycle_ring() Set tx_ring[78]->recycle_ring = rx_ring[14]
[  172.006253] XXX: mlx4_en_init_recycle_ring() Set tx_ring[79]->recycle_ring = rx_ring[15]
[  172.025565] mlx4_en: mlx4p1: Link Down
[  173.154950] mlx4_en: mlx4p1: Link Up
[  209.277225] systemd-logind[778]: New session c2 of user jbrouer.
[  209.284629] systemd[1]: Started Session c2 of user jbrouer.
[  209.290519] systemd[1]: Starting Session c2 of user jbrouer.
[  259.151792] XXX: mlx4_en_xmit_frame(cpu:2) tx_drop (tx_ind:66) *doorbell_pending:0
[  259.152002] XXX: mlx4_en_rx_recycle(cpu:6) page refcnt(0) bug cache->index:128/128
[  259.152003] ------------[ cut here ]------------
[  259.152007] WARNING: CPU: 6 PID: 0 at drivers/net/ethernet/mellanox/mlx4/en_rx.c:532 mlx4_en_rx_recycle+0x9c/0xb0 [mlx4_en]
[  259.152018] Modules linked in: coretemp kvm_intel kvm mxm_wmi irqbypass i2c_i801 intel_cstate i2c_smbus intel_rapl_perf sg i2c_core pcspkr shpchp video wmi acpi_pad nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables mlx4_en mlx5_core e1000e ptp sd_mod serio_raw mlx4_core pps_core devlink hid_generic
[  259.152020] CPU: 6 PID: 0 Comm: swapper/6 Not tainted 4.8.0-rc4-xdp02_seperate_xdp_struct+ #89
[  259.152021] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Extreme4, BIOS P2.10 05/12/2015
[  259.152022]  0000000000000000 ffff88041fb83d58 ffffffff813a8ccb 0000000000000000
[  259.152023]  0000000000000000 ffff88041fb83d98 ffffffff81060f3b 000002141fb83d70
[  259.152024]  ffff88041fb83dd8 ffff8804081e8000 0000000000000151 0000000000000009
[  259.152024] Call Trace:
[  259.152029]  <IRQ>  [<ffffffff813a8ccb>] dump_stack+0x4d/0x72
[  259.152032]  [<ffffffff81060f3b>] __warn+0xcb/0xf0
[  259.152034]  [<ffffffff8106102d>] warn_slowpath_null+0x1d/0x20
[  259.152035]  [<ffffffffa016a30c>] mlx4_en_rx_recycle+0x9c/0xb0 [mlx4_en]
[  259.152037]  [<ffffffffa0166a2e>] mlx4_en_recycle_tx_desc+0x4e/0xf0 [mlx4_en]
[  259.152038]  [<ffffffffa01678a7>] mlx4_en_poll_tx_cq+0x1e7/0x480 [mlx4_en]
[  259.152040]  [<ffffffff815f0f6c>] net_rx_action+0x1fc/0x350
[  259.152042]  [<ffffffff8170b83e>] __do_softirq+0xce/0x2cf
[  259.152043]  [<ffffffff8106635b>] irq_exit+0xab/0xb0
[  259.152044]  [<ffffffff8170b584>] do_IRQ+0x54/0xd0
[  259.152046]  [<ffffffff81709bbf>] common_interrupt+0x7f/0x7f
[  259.152048]  <EOI>  [<ffffffff815b213f>] ? cpuidle_enter_state+0x12f/0x300
[  259.152049]  [<ffffffff815b2347>] cpuidle_enter+0x17/0x20
[  259.152051]  [<ffffffff810a3c94>] cpu_startup_entry+0x2b4/0x300
[  259.152053]  [<ffffffff8103c611>] start_secondary+0x101/0x120
[  259.152054] ---[ end trace 2b35fea8b90ba9b5 ]---
[  259.152055] page:ffffea000a8e8300 count:0 mapcount:0 mapping:          (null) index:0x0
[  259.152056] flags: 0x2fffff80000000()
[  259.152056] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
[  259.152067] ------------[ cut here ]------------
[  259.152068] kernel BUG at ./include/linux/mm.h:445!
[  259.152068] invalid opcode: 0000 [#1] PREEMPT SMP
[  259.152074] Modules linked in: coretemp kvm_intel kvm mxm_wmi irqbypass i2c_i801 intel_cstate i2c_smbus intel_rapl_perf sg i2c_core pcspkr shpchp video wmi acpi_pad nfsd auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables mlx4_en mlx5_core e1000e ptp sd_mod serio_raw mlx4_core pps_core devlink hid_generic
[  259.152075] CPU: 6 PID: 0 Comm: swapper/6 Tainted: G        W       4.8.0-rc4-xdp02_seperate_xdp_struct+ #89
[  259.152076] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Extreme4, BIOS P2.10 05/12/2015
[  259.152076] task: ffff88040d5bab80 task.stack: ffff88040d5d0000
[  259.152078] RIP: 0010:[<ffffffffa0166acd>]  [<ffffffffa0166acd>] mlx4_en_recycle_tx_desc+0xed/0xf0 [mlx4_en]
[  259.152079] RSP: 0018:ffff88041fb83dd8  EFLAGS: 00010292
[  259.152079] RAX: 000000000000003e RBX: ffff88040b76d440 RCX: 0000000000000001
[  259.152079] RDX: 0000000000000001 RSI: 0000000000000286 RDI: ffffffff81c40b90
[  259.152080] RBP: ffff88041fb83e00 R08: 0000000000000000 R09: 000000000000003e
[  259.152080] R10: 000000000000000a R11: 0000000000000000 R12: ffff8803f1b40840
[  259.152081] R13: 0000000000000151 R14: 0000000000000009 R15: 0000000000005440
[  259.152081] FS:  0000000000000000(0000) GS:ffff88041fb80000(0000) knlGS:0000000000000000
[  259.152082] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  259.152082] CR2: 00007fd47c1c4520 CR3: 00000003e970a000 CR4: 00000000001406e0
[  259.152082] Stack:
[  259.152084]  ffffea000a8e8300 00000002a3a0c000 0000100000000000 0000000000000150
[  259.152085]  ffff88040b748800 ffff88041fb83e90 ffffffffa01678a7 ffff88040b3ed500
[  259.152086]  0000000000000020 ffff88040b760000 ffff8803f1b40000 0367cd5100000200
[  259.152086] Call Trace:
[  259.152088]  <IRQ> 
[  259.152088]  [<ffffffffa01678a7>] mlx4_en_poll_tx_cq+0x1e7/0x480 [mlx4_en]
[  259.152089]  [<ffffffff815f0f6c>] net_rx_action+0x1fc/0x350
[  259.152090]  [<ffffffff8170b83e>] __do_softirq+0xce/0x2cf
[  259.152091]  [<ffffffff8106635b>] irq_exit+0xab/0xb0
[  259.152092]  [<ffffffff8170b584>] do_IRQ+0x54/0xd0
[  259.152093]  [<ffffffff81709bbf>] common_interrupt+0x7f/0x7f
[  259.152095]  <EOI> 
[  259.152095]  [<ffffffff815b213f>] ? cpuidle_enter_state+0x12f/0x300
[  259.152096]  [<ffffffff815b2347>] cpuidle_enter+0x17/0x20
[  259.152097]  [<ffffffff810a3c94>] cpu_startup_entry+0x2b4/0x300
[  259.152098]  [<ffffffff8103c611>] start_secondary+0x101/0x120
[  259.152108] Code: 5c 5d c3 e8 b6 a4 ff e0 8b 43 14 48 83 c4 18 5b 41 5c 5d c3 48 8b 05 13 80 ab e1 eb a0 0f 0b 48 c7 c6 60 9e 17 a0 e8 a3 8b 01 e1 <0f> 0b 90 0f 1f 44 00 00 55 48 63 c2 89 d1 49 89 f2 48 c1 e0 06 
[  259.152110] RIP  [<ffffffffa0166acd>] mlx4_en_recycle_tx_desc+0xed/0xf0 [mlx4_en]
[  259.152110]  RSP <ffff88041fb83dd8>
[  259.152117] ---[ end trace 2b35fea8b90ba9b6 ]---
[  259.152118] Kernel panic - not syncing: Fatal exception in interrupt
[  259.159443] Kernel Offset: disabled
[  259.640689] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

^ permalink raw reply

* Re: [PATCH net-next 02/14] tcp: use windowed min filter library for TCP min_rtt estimation
From: Neal Cardwell @ 2016-09-16 19:25 UTC (permalink / raw)
  To: David Miller
  Cc: Netdev, Van Jacobson, Yuchung Cheng, Nandita Dukkipati,
	Eric Dumazet, Soheil Hassas Yeganeh
In-Reply-To: <201609170308.dReNRIGl%fengguang.wu@intel.com>

On Fri, Sep 16, 2016 at 3:21 PM, kbuild test robot <lkp@intel.com> wrote:
> All errors (new ones prefixed by >>):
>
>>> net/ipv4/tcp_cdg.c:59:8: error: redefinition of 'struct minmax'
>     struct minmax {
>            ^~~~~~
>    In file included from include/linux/tcp.h:22:0,
>                     from include/net/tcp.h:24,
>                     from net/ipv4/tcp_cdg.c:30:
>    include/linux/win_minmax.h:17:8: note: originally defined here
>     struct minmax {
>            ^~~~~~
>
> vim +59 net/ipv4/tcp_cdg.c

Sorry about that. I will fix that and re-post.

neal

^ permalink raw reply

* Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets
From: David Ahern @ 2016-09-16 19:30 UTC (permalink / raw)
  To: Cyrill Gorcunov, Eric Dumazet
  Cc: netdev, linux-kernel, David Miller, kuznet, jmorris, yoshfuji,
	kaber, avagin, stephen
In-Reply-To: <20160916190000.GA18116@uranus.lan>

On 9/16/16 1:00 PM, Cyrill Gorcunov wrote:
> I created veth pair and bound raw socket into it.
> 
> [root@pcs7 iproute2]# misc/ss -A raw
> State      Recv-Q Send-Q                                Local Address:Port                                                 Peer Address:Port                
> ESTAB      0      0                                         127.0.0.1:ipproto-255                                            127.0.0.10:ipproto-9090         
> UNCONN     0      0                                        127.0.0.10:ipproto-255                                                     *:*                    
> UNCONN     0      0                                                :::ipv6-icmp                                                      :::*                    
> UNCONN     0      0                                                :::ipv6-icmp                                                      :::*                    
> ESTAB      0      0                                               ::1:ipproto-255                                                   ::1:ipproto-9091         
> UNCONN     0      0                                           ::1%vm1:ipproto-255                                                    :::*                    
> [root@pcs7 iproute2]# 
> 
> [root@pcs7 iproute2]# misc/ss -aKw 'dev == vm1'
> State      Recv-Q Send-Q                                Local Address:Port                                                 Peer Address:Port                
> UNCONN     0      0                                           ::1%vm1:ipproto-255                                                    :::*                    
> 
> [root@pcs7 iproute2]# misc/ss -A raw
> State      Recv-Q Send-Q                                Local Address:Port                                                 Peer Address:Port                
> ESTAB      0      0                                         127.0.0.1:ipproto-255                                            127.0.0.10:ipproto-9090         
> UNCONN     0      0                                        127.0.0.10:ipproto-255                                                     *:*                    
> UNCONN     0      0                                                :::ipv6-icmp                                                      :::*                    
> UNCONN     0      0                                                :::ipv6-icmp                                                      :::*                    
> ESTAB      0      0                                               ::1:ipproto-255                                                   ::1:ipproto-9091         
> 
> so it get zapped out. Is there some other way to test it?
> 

I'm guessing you passed IPPROTO_RAW (255) as the protocol to socket(). If you pass something else (IPPROTO_ICMP for example) it won't work.

^ permalink raw reply

* Re: [PATCH] net: ipv6: fallback to full lookup if table lookup is unsuitable
From: David Ahern @ 2016-09-16 19:38 UTC (permalink / raw)
  To: Vincent Bernat
  Cc: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev
In-Reply-To: <m3twdfk665.fsf@neo.luffy.cx>

On 9/16/16 1:15 PM, Vincent Bernat wrote:
>> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> index ad4a7ff301fc..48bae2ee2e18 100644
>> --- a/net/ipv6/route.c
>> +++ b/net/ipv6/route.c
>> @@ -1991,9 +1991,19 @@ static struct rt6_info *ip6_route_info_create(struct fib6_config *cfg)
>>                         if (!(gwa_type & IPV6_ADDR_UNICAST))
>>                                 goto out;
>>
>> -                       if (cfg->fc_table)
>> +                       if (cfg->fc_table) {
>>                                 grt = ip6_nh_lookup_table(net, cfg, gw_addr);
>>
>> +                               /* a nexthop lookup can not go through a gw.
>> +                                * if this happens on a table based lookup
>> +                                * then fallback to a full lookup
>> +                                */
>> +                               if (grt && grt->rt6i_flags & RTF_GATEWAY) {
>> +                                       ip6_rt_put(grt);
>> +                                       grt = NULL;
>> +                               }
>> +                       }
>> +
>>                         if (!grt)
>>                                 grt = rt6_lookup(net, gw_addr, NULL,
>>                                                  cfg->fc_ifindex, 1);
> 
> OK. Should the dev check be dismissed or do we add "dev && dev !=
> grt->dst.dev" just as a safety net (this would be a convulated setup,
> but the correct direct route could be in an ip rule with higher priority
> while the one in this table is incorrect)?
> 

yes. So the validity check becomes:

	grt = ip6_nh_lookup_table(net, cfg, gw_addr);
	if (grt) {
		if (grt->rt6i_flags & RTF_GATEWAY ||
		    dev && dev != grt->dst.dev) {
			ip6_rt_put(grt);
			grt = NULL;            <---- causes the full rt6_lookup
		}
	}

^ permalink raw reply

* Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets
From: Cyrill Gorcunov @ 2016-09-16 19:39 UTC (permalink / raw)
  To: David Ahern
  Cc: Eric Dumazet, netdev, linux-kernel, David Miller, kuznet, jmorris,
	yoshfuji, kaber, avagin, stephen
In-Reply-To: <59e12627-7043-fd20-0d68-899ab43b0e71@cumulusnetworks.com>

On Fri, Sep 16, 2016 at 01:30:28PM -0600, David Ahern wrote:
> > [root@pcs7 iproute2]# misc/ss -A raw
> > State      Recv-Q Send-Q                                Local Address:Port                                                 Peer Address:Port                
> > ESTAB      0      0                                         127.0.0.1:ipproto-255                                            127.0.0.10:ipproto-9090         
> > UNCONN     0      0                                        127.0.0.10:ipproto-255                                                     *:*                    
> > UNCONN     0      0                                                :::ipv6-icmp                                                      :::*                    
> > UNCONN     0      0                                                :::ipv6-icmp                                                      :::*                    
> > ESTAB      0      0                                               ::1:ipproto-255                                                   ::1:ipproto-9091         
> > 
> > so it get zapped out. Is there some other way to test it?
> > 
> 
> I'm guessing you passed IPPROTO_RAW (255) as the protocol to socket(). If you pass something
> else (IPPROTO_ICMP for example) it won't work.

True. To support IPPROTO_ICMP it need enhancement. I thought start with
plain _RAW first and then extend to support _ICMP.

	Cyrill

^ permalink raw reply

* [net PATCH] mlx4: fix XDP_TX is acting like XDP_PASS on TX ring full
From: Jesper Dangaard Brouer @ 2016-09-16 19:47 UTC (permalink / raw)
  To: netdev, tariqt
  Cc: tom, bblanco, rana.shahot, David S. Miller,
	Jesper Dangaard Brouer

The XDP_TX action can fail transmitting the frame in case the TX ring
is full or port is down.  In case of TX failure it should drop the
frame, and not as now call 'break' which is the same as XDP_PASS.

Fixes: 9ecc2d86171a ("net/mlx4_en: add xdp forwarding and data write support")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>

---
Note, this fix have nothing to do with the page-refcnt bug I just reported.

 drivers/net/ethernet/mellanox/mlx4/en_rx.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
index 2040dad8611d..d414c67dfd12 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
@@ -906,6 +906,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
 							length, tx_index,
 							&doorbell_pending))
 					goto consumed;
+				goto next;
 				break;
 			default:
 				bpf_warn_invalid_xdp_action(act);

^ permalink raw reply related

* Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets
From: David Ahern @ 2016-09-16 19:47 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Eric Dumazet, netdev, linux-kernel, David Miller, kuznet, jmorris,
	yoshfuji, kaber, avagin, stephen
In-Reply-To: <20160916193927.GB18116@uranus.lan>

On 9/16/16 1:39 PM, Cyrill Gorcunov wrote:
> On Fri, Sep 16, 2016 at 01:30:28PM -0600, David Ahern wrote:
>>> [root@pcs7 iproute2]# misc/ss -A raw
>>> State      Recv-Q Send-Q                                Local Address:Port                                                 Peer Address:Port                
>>> ESTAB      0      0                                         127.0.0.1:ipproto-255                                            127.0.0.10:ipproto-9090         
>>> UNCONN     0      0                                        127.0.0.10:ipproto-255                                                     *:*                    
>>> UNCONN     0      0                                                :::ipv6-icmp                                                      :::*                    
>>> UNCONN     0      0                                                :::ipv6-icmp                                                      :::*                    
>>> ESTAB      0      0                                               ::1:ipproto-255                                                   ::1:ipproto-9091         
>>>
>>> so it get zapped out. Is there some other way to test it?
>>>
>>
>> I'm guessing you passed IPPROTO_RAW (255) as the protocol to socket(). If you pass something
>> else (IPPROTO_ICMP for example) it won't work.
> 
> True. To support IPPROTO_ICMP it need enhancement. I thought start with
> plain _RAW first and then extend to support _ICMP.

I thought raw in this case was SOCK_RAW as in the socket type.

Since the display is showing sockets in addition to IPPROTO_RAW:

$ ss -A raw
State      Recv-Q Send-Q        Local Address:Port                         Peer Address:Port
UNCONN     0      0                    *%eth0:icmp                                    *:*

It is going to be confusing if only ipproto-255 sockets can be killed.

^ permalink raw reply

* Re: [PATCH net-next 07/14] tcp: export data delivery rate
From: kbuild test robot @ 2016-09-17  3:56 UTC (permalink / raw)
  To: Neal Cardwell
  Cc: kbuild-all, David Miller, netdev, Yuchung Cheng, Van Jacobson,
	Neal Cardwell, Nandita Dukkipati, Eric Dumazet,
	Soheil Hassas Yeganeh
In-Reply-To: <1474051743-13311-8-git-send-email-ncardwell@google.com>

[-- Attachment #1: Type: text/plain, Size: 4180 bytes --]

Hi Yuchung,

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Neal-Cardwell/tcp-BBR-congestion-control-algorithm/20160917-025323
config: arm-nhk8815_defconfig (attached as .config)
compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
reproduce:
        wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm 

All error/warnings (new ones prefixed by >>):

   In file included from arch/arm/include/asm/div64.h:126:0,
                    from include/linux/kernel.h:142,
                    from include/linux/crypto.h:21,
                    from include/crypto/hash.h:16,
                    from net/ipv4/tcp.c:250:
   net/ipv4/tcp.c: In function 'tcp_get_info':
   include/asm-generic/div64.h:207:28: warning: comparison of distinct pointer types lacks a cast
     (void)(((typeof((n)) *)0) == ((uint64_t *)0)); \
                               ^
>> net/ipv4/tcp.c:2794:3: note: in expansion of macro 'do_div'
      do_div(rate, intv);
      ^~~~~~
   In file included from arch/arm/include/asm/atomic.h:14:0,
                    from include/linux/atomic.h:4,
                    from include/linux/crypto.h:20,
                    from include/crypto/hash.h:16,
                    from net/ipv4/tcp.c:250:
   include/asm-generic/div64.h:220:25: warning: right shift count >= width of type [-Wshift-count-overflow]
     } else if (likely(((n) >> 32) == 0)) {  \
                            ^
   include/linux/compiler.h:167:40: note: in definition of macro 'likely'
    # define likely(x) __builtin_expect(!!(x), 1)
                                           ^
>> net/ipv4/tcp.c:2794:3: note: in expansion of macro 'do_div'
      do_div(rate, intv);
      ^~~~~~
   In file included from arch/arm/include/asm/div64.h:126:0,
                    from include/linux/kernel.h:142,
                    from include/linux/crypto.h:21,
                    from include/crypto/hash.h:16,
                    from net/ipv4/tcp.c:250:
>> include/asm-generic/div64.h:224:22: error: passing argument 1 of '__div64_32' from incompatible pointer type [-Werror=incompatible-pointer-types]
      __rem = __div64_32(&(n), __base); \
                         ^
>> net/ipv4/tcp.c:2794:3: note: in expansion of macro 'do_div'
      do_div(rate, intv);
      ^~~~~~
   In file included from include/linux/kernel.h:142:0,
                    from include/linux/crypto.h:21,
                    from include/crypto/hash.h:16,
                    from net/ipv4/tcp.c:250:
   arch/arm/include/asm/div64.h:32:24: note: expected 'uint64_t * {aka long long unsigned int *}' but argument is of type 'u32 * {aka unsigned int *}'
    static inline uint32_t __div64_32(uint64_t *n, uint32_t base)
                           ^~~~~~~~~~
   cc1: some warnings being treated as errors

vim +/do_div +2794 net/ipv4/tcp.c

  2778		} while (u64_stats_fetch_retry_irq(&tp->syncp, start));
  2779		info->tcpi_segs_out = tp->segs_out;
  2780		info->tcpi_segs_in = tp->segs_in;
  2781	
  2782		notsent_bytes = READ_ONCE(tp->write_seq) - READ_ONCE(tp->snd_nxt);
  2783		info->tcpi_notsent_bytes = max(0, notsent_bytes);
  2784	
  2785		info->tcpi_min_rtt = tcp_min_rtt(tp);
  2786		info->tcpi_data_segs_in = tp->data_segs_in;
  2787		info->tcpi_data_segs_out = tp->data_segs_out;
  2788	
  2789		info->tcpi_delivery_rate_app_limited = tp->rate_app_limited ? 1 : 0;
  2790		rate = READ_ONCE(tp->rate_delivered);
  2791		intv = READ_ONCE(tp->rate_interval_us);
  2792		if (rate && intv) {
  2793			rate = rate * tp->mss_cache * USEC_PER_SEC;
> 2794			do_div(rate, intv);
  2795			put_unaligned(rate, &info->tcpi_delivery_rate);
  2796		}
  2797	}
  2798	EXPORT_SYMBOL_GPL(tcp_get_info);
  2799	
  2800	static int do_tcp_getsockopt(struct sock *sk, int level,
  2801			int optname, char __user *optval, int __user *optlen)
  2802	{

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 14948 bytes --]

^ permalink raw reply

* Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets
From: Cyrill Gorcunov @ 2016-09-16 19:52 UTC (permalink / raw)
  To: David Ahern
  Cc: Eric Dumazet, netdev, linux-kernel, David Miller, kuznet, jmorris,
	yoshfuji, kaber, avagin, stephen
In-Reply-To: <d943c6b7-0f06-3823-58d2-6f79f17c3d59@cumulusnetworks.com>

On Fri, Sep 16, 2016 at 01:47:57PM -0600, David Ahern wrote:
> >>
> >> I'm guessing you passed IPPROTO_RAW (255) as the protocol to socket(). If you pass something
> >> else (IPPROTO_ICMP for example) it won't work.
> > 
> > True. To support IPPROTO_ICMP it need enhancement. I thought start with
> > plain _RAW first and then extend to support _ICMP.
> 
> I thought raw in this case was SOCK_RAW as in the socket type.
> 
> Since the display is showing sockets in addition to IPPROTO_RAW:
> 
> $ ss -A raw
> State      Recv-Q Send-Q        Local Address:Port                         Peer Address:Port
> UNCONN     0      0                    *%eth0:icmp                                    *:*
> 
> It is going to be confusing if only ipproto-255 sockets can be killed.

OK, gimme some time to implement it. Hopefully on the weekend or monday.
Thanks a huge for feedback!

^ permalink raw reply

* Re: Modification to skb->queue_mapping affecting performance
From: Eric Dumazet @ 2016-09-16 19:53 UTC (permalink / raw)
  To: Michael Ma; +Cc: netdev
In-Reply-To: <CAAmHdhx7uvg_q49oz_u8q6NYmv=okxmKg1Tc5ny6oRCvaxbnww@mail.gmail.com>

On Fri, 2016-09-16 at 10:57 -0700, Michael Ma wrote:

> This is actually the problem - if flows from different RX queues are
> switched to the same RX queue in IFB, they'll use different processor
> context with the same tasklet, and the processor context of different
> tasklets might be the same. So multiple tasklets in IFB competes for
> the same core when queue is switched.
> 
> The following simple fix proved this - with this change even switching
> the queue won't affect small packet bandwidth/latency anymore:
> 
> in ifb.c:
> 
> -       struct ifb_q_private *txp = dp->tx_private + skb_get_queue_mapping(skb);
> +       struct ifb_q_private *txp = dp->tx_private +
> (smp_processor_id() % dev->num_tx_queues);
> 
> This should be more efficient since we're not sending the task to a
> different processor, instead we try to queue the packet to an
> appropriate tasklet based on the processor ID. Will this cause any
> packet out-of-order problem? If packets from the same flow are queued
> to the same RX queue due to RSS, and processor affinity is set for RX
> queues, I assume packets from the same flow will end up in the same
> core when tasklet is scheduled. But I might have missed some uncommon
> cases here... Would appreciate if anyone can provide more insights.

Wait, don't you have proper smp affinity for the RX queues on your NIC ?

( Documentation/networking/scaling.txt RSS IRQ Configuration )

A driver ndo_start_xmit() MUST use skb_get_queue_mapping(skb), because
the driver queue is locked before ndo_start_xmit())  (for non
NETIF_F_LLTX drivers at least)

In case of ifb, __skb_queue_tail(&txp->rq, skb); could corrupt the skb
list.

In any case, you could have an action to do this before reaching IFB.

^ permalink raw reply

* Re: [PATCH v3] net: ip, diag -- Add diag interface for raw sockets
From: David Ahern @ 2016-09-16 19:55 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Eric Dumazet, netdev, linux-kernel, David Miller, kuznet, jmorris,
	yoshfuji, kaber, avagin, stephen
In-Reply-To: <20160916195252.GC18116@uranus.lan>

On 9/16/16 1:52 PM, Cyrill Gorcunov wrote:
> On Fri, Sep 16, 2016 at 01:47:57PM -0600, David Ahern wrote:
>>>>
>>>> I'm guessing you passed IPPROTO_RAW (255) as the protocol to socket(). If you pass something
>>>> else (IPPROTO_ICMP for example) it won't work.
>>>
>>> True. To support IPPROTO_ICMP it need enhancement. I thought start with
>>> plain _RAW first and then extend to support _ICMP.
>>
>> I thought raw in this case was SOCK_RAW as in the socket type.
>>
>> Since the display is showing sockets in addition to IPPROTO_RAW:
>>
>> $ ss -A raw
>> State      Recv-Q Send-Q        Local Address:Port                         Peer Address:Port
>> UNCONN     0      0                    *%eth0:icmp                                    *:*
>>
>> It is going to be confusing if only ipproto-255 sockets can be killed.
> 
> OK, gimme some time to implement it. Hopefully on the weekend or monday.
> Thanks a huge for feedback!
> 

It may well be a ss bug / problem. As I mentioned I am always seeing 255 for the protocol which is odd since ss does a dump and takes the matches and invokes the kill. Thanks for taking the time to do the kill piece.

^ permalink raw reply

* Re: [PATCH v5 0/6] Add eBPF hooks for cgroups
From: Sargun Dhillon @ 2016-09-16 19:57 UTC (permalink / raw)
  To: Daniel Mack
  Cc: Pablo Neira Ayuso, htejun, daniel, ast, davem, kafai, fw, harald,
	netdev, cgroups
In-Reply-To: <6de6809a-13f5-4000-5639-c760dde30223@zonque.org>

On Wed, Sep 14, 2016 at 01:13:16PM +0200, Daniel Mack wrote:
> Hi Pablo,
> 
> On 09/13/2016 07:24 PM, Pablo Neira Ayuso wrote:
> > On Tue, Sep 13, 2016 at 03:31:20PM +0200, Daniel Mack wrote:
> >> On 09/13/2016 01:56 PM, Pablo Neira Ayuso wrote:
> >>> On Mon, Sep 12, 2016 at 06:12:09PM +0200, Daniel Mack wrote:
> >>>> This is v5 of the patch set to allow eBPF programs for network
> >>>> filtering and accounting to be attached to cgroups, so that they apply
> >>>> to all sockets of all tasks placed in that cgroup. The logic also
> >>>> allows to be extendeded for other cgroup based eBPF logic.
> >>>
> >>> 1) This infrastructure can only be useful to systemd, or any similar
> >>>    orchestration daemon. Look, you can only apply filtering policies
> >>>    to processes that are launched by systemd, so this only works
> >>>    for server processes.
> >>
> >> Sorry, but both statements aren't true. The eBPF policies apply to every
> >> process that is placed in a cgroup, and my example program in 6/6 shows
> >> how that can be done from the command line.
> > 
> > Then you have to explain me how can anyone else than systemd use this
> > infrastructure?
> 
> I have no idea what makes you think this is limited to systemd. As I
> said, I provided an example for userspace that works from the command
> line. The same limitation apply as for all other users of cgroups.
> 
So, at least in my work, we have Mesos, but on nearly every machine that Mesos 
runs, people also have systemd. Now, there's recently become a bit of a battle 
of ownership of things like cgroups on these machines. We can usually solve it 
by nesting under systemd cgroups, and thus so far we've avoided making too many 
systemd-specific concessions.

The reason this works (mostly), is because everything we touch has a sense of 
nesting, where we can apply policy at a place lower in the hierarchy, and yet 
systemd's monitoring and policy still stays in place. 

Now, with this patch, we don't have that, but I think we can reasonably add some 
flag like "no override" when applying policies, or alternatively something like 
"no new privileges", to prevent children from applying policies that override 
top-level policy. I realize there is a speed concern as well, but I think for 
people who want nested policy, we're willing to make the tradeoff. The cost
of traversing a few extra pointers still outweighs the overhead of network
namespaces, iptables, etc.. for many of us. 

What do you think Daniel?

> > My main point is that those processes *need* to be launched by the
> > orchestrator, which is was refering as 'server processes'.
> 
> Yes, that's right. But as I said, this rule applies to many other kernel
> concepts, so I don't see any real issue.
>
Also, cgroups have become such a big part of how applications are managed
that many of us have solved this problem.

> >> That's a limitation that applies to many more control mechanisms in the
> >> kernel, and it's something that can easily be solved with fork+exec.
> > 
> > As long as you have control to launch the processes yes, but this
> > will not work in other scenarios. Just like cgroup net_cls and friends
> > are broken for filtering for things that you have no control to
> > fork+exec.
> 
> Probably, but that's only solvable with rules that store the full cgroup
> path then, and do a string comparison (!) for each packet flying by.
>
> >> That's just as transparent as SO_ATTACH_FILTER. What kind of
> >> introspection mechanism do you have in mind?
> > 
> > SO_ATTACH_FILTER is called from the process itself, so this is a local
> > filtering policy that you apply to your own process.
> 
> Not necessarily. You can as well do it the inetd way, and pass the
> socket to a process that is launched on demand, but do SO_ATTACH_FILTER
> + SO_LOCK_FILTER  in the middle. What happens with payload on the socket
> is not transparent to the launched binary at all. The proposed cgroup
> eBPF solution implements a very similar behavior in that regard.
> 
It would be nice to be able to see whether or not a filter is attached to a 
cgroup, but given this is going through syscalls, at least introspection
is possible as opposed to something like netlink.

> >> It's about filtering outgoing network packets of applications, and
> >> providing them with L2 information for filtering purposes. I don't think
> >> that's a very specific use-case.
> >>
> >> When the feature is not used at all, the added costs on the output path
> >> are close to zero, due to the use of static branches.
> > 
> > *You're proposing a socket filtering facility that hooks layer 2
> > output path*!
> 
> As I said, I'm open to discussing that. In order to make it work for L3,
> the LL_OFF issues need to be solved, as Daniel explained. Daniel,
> Alexei, any idea how much work that would be?
> 
> > That is only a rough ~30 lines kernel patchset to support this in
> > netfilter and only one extra input hook, with potential access to
> > conntrack and better integration with other existing subsystems.
> 
> Care to share the patches for that? I'd really like to have a look.
> 
> And FWIW, I agree with Thomas - there is nothing wrong with having
> multiple options to use for such use-cases.
Right now, for containers, we have netfilter and network namespaces.
There's a lot of performance overhead that comes with this. Not only
that, but iptables doesn't really have a simple way of usage by
automated infrastructure. We (firewalld, systemd, dockerd, mesos)
end up fighting with one another for ownership over firewall rules.

Although, I have problems with this approach, I think that it's
a good baseline where we can have top level owned by systemd,
docker underneath that, and Mesos underneath that. We can add
additional hooks for things like Checmate and Landlock, and
with a little more work, we can do compositition, solving
all of our problems.

> 
> 
> Thanks,
> Daniel
> 

^ permalink raw reply

* [PATCHv4 next 0/3] IPvlan introduce l3s mode
From: Mahesh Bandewar @ 2016-09-16 19:59 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, David Miller, Mahesh Bandewar

From: Mahesh Bandewar <maheshb@google.com>

Same old problem with new approach especially from suggestions from
earlier patch-series.

First thing is that this is introduced as a new mode rather than
modifying the old (L3) mode. So the behavior of the existing modes is
preserved as it is and the new L3s mode obeys iptables so that intended
conn-tracking can work. 

To do this, the code uses newly added l3mdev_rcv() handler and an
Iptables hook. l3mdev_rcv() to perform an inbound route lookup with the
correct (IPvlan slave) interface and then IPtable-hook at LOCAL_INPUT
to change the input device from master to the slave to complete the
formality.

Supporting stack changes are trivial changes to export symbol to get
IPv4 equivalent code exported for IPv6 and to allow netfilter hook
registration code to allow caller to hold RTNL. Please look into
individual patches for details.

Mahesh Bandewar (3):
  ipv6: Export p6_route_input_lookup symbol
  net: Add _nf_(un)register_hooks symbols
  ipvlan: Introduce l3s mode

 Documentation/networking/ipvlan.txt |  7 ++-
 drivers/net/Kconfig                 |  1 +
 drivers/net/ipvlan/ipvlan.h         |  6 +++
 drivers/net/ipvlan/ipvlan_core.c    | 94 +++++++++++++++++++++++++++++++++++++
 drivers/net/ipvlan/ipvlan_main.c    | 87 +++++++++++++++++++++++++++++++---
 include/linux/netfilter.h           |  2 +
 include/net/ip6_route.h             |  3 ++
 include/uapi/linux/if_link.h        |  1 +
 net/ipv6/route.c                    |  7 +--
 net/netfilter/core.c                | 51 ++++++++++++++++++--
 10 files changed, 243 insertions(+), 16 deletions(-)

v1: Initial post
v2: Text correction and config changed from "select" to "depends on"
v3: separated nf_hook registration logic and made it independent of port
    as nf_hook registration is independant of how many IPvlan ports are
    present in the system.
v4: Eliminated need to have "hooks_attached" per port and rely just on
    the mode. Also change BUG_ON to WARN_ON

-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply

* [PATCHv4 next 1/3] ipv6: Export p6_route_input_lookup symbol
From: Mahesh Bandewar @ 2016-09-16 19:59 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, David Miller, Mahesh Bandewar

From: Mahesh Bandewar <maheshb@google.com>

Make ip6_route_input_lookup available outside of ipv6 the module
similar to ip_route_input_noref in the IPv4 world.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
---
 include/net/ip6_route.h | 3 +++
 net/ipv6/route.c        | 7 ++++---
 2 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index d97305d0e71f..e0cd318d5103 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -64,6 +64,9 @@ static inline bool rt6_need_strict(const struct in6_addr *daddr)
 }
 
 void ip6_route_input(struct sk_buff *skb);
+struct dst_entry *ip6_route_input_lookup(struct net *net,
+					 struct net_device *dev,
+					 struct flowi6 *fl6, int flags);
 
 struct dst_entry *ip6_route_output_flags(struct net *net, const struct sock *sk,
 					 struct flowi6 *fl6, int flags);
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index ad4a7ff301fc..4dab585f7642 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1147,15 +1147,16 @@ static struct rt6_info *ip6_pol_route_input(struct net *net, struct fib6_table *
 	return ip6_pol_route(net, table, fl6->flowi6_iif, fl6, flags);
 }
 
-static struct dst_entry *ip6_route_input_lookup(struct net *net,
-						struct net_device *dev,
-						struct flowi6 *fl6, int flags)
+struct dst_entry *ip6_route_input_lookup(struct net *net,
+					 struct net_device *dev,
+					 struct flowi6 *fl6, int flags)
 {
 	if (rt6_need_strict(&fl6->daddr) && dev->type != ARPHRD_PIMREG)
 		flags |= RT6_LOOKUP_F_IFACE;
 
 	return fib6_rule_lookup(net, fl6, flags, ip6_pol_route_input);
 }
+EXPORT_SYMBOL_GPL(ip6_route_input_lookup);
 
 void ip6_route_input(struct sk_buff *skb)
 {
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCHv4 next 2/3] net: Add _nf_(un)register_hooks symbols
From: Mahesh Bandewar @ 2016-09-16 19:59 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, David Miller, Mahesh Bandewar, Pablo Neira Ayuso

From: Mahesh Bandewar <maheshb@google.com>

Add _nf_register_hooks() and _nf_unregister_hooks() calls which allow
caller to hold RTNL mutex.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/netfilter.h |  2 ++
 net/netfilter/core.c      | 51 ++++++++++++++++++++++++++++++++++++++++++-----
 2 files changed, 48 insertions(+), 5 deletions(-)

diff --git a/include/linux/netfilter.h b/include/linux/netfilter.h
index 9230f9aee896..e82b76781bf6 100644
--- a/include/linux/netfilter.h
+++ b/include/linux/netfilter.h
@@ -133,6 +133,8 @@ int nf_register_hook(struct nf_hook_ops *reg);
 void nf_unregister_hook(struct nf_hook_ops *reg);
 int nf_register_hooks(struct nf_hook_ops *reg, unsigned int n);
 void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n);
+int _nf_register_hooks(struct nf_hook_ops *reg, unsigned int n);
+void _nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n);
 
 /* Functions to register get/setsockopt ranges (non-inclusive).  You
    need to check permissions yourself! */
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index f39276d1c2d7..2c5327e43a88 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -188,19 +188,17 @@ EXPORT_SYMBOL(nf_unregister_net_hooks);
 
 static LIST_HEAD(nf_hook_list);
 
-int nf_register_hook(struct nf_hook_ops *reg)
+static int _nf_register_hook(struct nf_hook_ops *reg)
 {
 	struct net *net, *last;
 	int ret;
 
-	rtnl_lock();
 	for_each_net(net) {
 		ret = nf_register_net_hook(net, reg);
 		if (ret && ret != -ENOENT)
 			goto rollback;
 	}
 	list_add_tail(&reg->list, &nf_hook_list);
-	rtnl_unlock();
 
 	return 0;
 rollback:
@@ -210,19 +208,34 @@ rollback:
 			break;
 		nf_unregister_net_hook(net, reg);
 	}
+	return ret;
+}
+
+int nf_register_hook(struct nf_hook_ops *reg)
+{
+	int ret;
+
+	rtnl_lock();
+	ret = _nf_register_hook(reg);
 	rtnl_unlock();
+
 	return ret;
 }
 EXPORT_SYMBOL(nf_register_hook);
 
-void nf_unregister_hook(struct nf_hook_ops *reg)
+static void _nf_unregister_hook(struct nf_hook_ops *reg)
 {
 	struct net *net;
 
-	rtnl_lock();
 	list_del(&reg->list);
 	for_each_net(net)
 		nf_unregister_net_hook(net, reg);
+}
+
+void nf_unregister_hook(struct nf_hook_ops *reg)
+{
+	rtnl_lock();
+	_nf_unregister_hook(reg);
 	rtnl_unlock();
 }
 EXPORT_SYMBOL(nf_unregister_hook);
@@ -246,6 +259,26 @@ err:
 }
 EXPORT_SYMBOL(nf_register_hooks);
 
+/* Caller MUST take rtnl_lock() */
+int _nf_register_hooks(struct nf_hook_ops *reg, unsigned int n)
+{
+	unsigned int i;
+	int err = 0;
+
+	for (i = 0; i < n; i++) {
+		err = _nf_register_hook(&reg[i]);
+		if (err)
+			goto err;
+	}
+	return err;
+
+err:
+	if (i > 0)
+		_nf_unregister_hooks(reg, i);
+	return err;
+}
+EXPORT_SYMBOL(_nf_register_hooks);
+
 void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n)
 {
 	while (n-- > 0)
@@ -253,6 +286,14 @@ void nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n)
 }
 EXPORT_SYMBOL(nf_unregister_hooks);
 
+/* Caller MUST take rtnl_lock */
+void _nf_unregister_hooks(struct nf_hook_ops *reg, unsigned int n)
+{
+	while (n-- > 0)
+		_nf_unregister_hook(&reg[n]);
+}
+EXPORT_SYMBOL(_nf_unregister_hooks);
+
 unsigned int nf_iterate(struct list_head *head,
 			struct sk_buff *skb,
 			struct nf_hook_state *state,
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCHv4 next 3/3] ipvlan: Introduce l3s mode
From: Mahesh Bandewar @ 2016-09-16 19:59 UTC (permalink / raw)
  To: netdev; +Cc: Eric Dumazet, David Miller, Mahesh Bandewar, David Ahern

From: Mahesh Bandewar <maheshb@google.com>

In a typical IPvlan L3 setup where master is in default-ns and
each slave is into different (slave) ns. In this setup egress
packet processing for traffic originating from slave-ns will
hit all NF_HOOKs in slave-ns as well as default-ns. However same
is not true for ingress processing. All these NF_HOOKs are
hit only in the slave-ns skipping them in the default-ns.
IPvlan in L3 mode is restrictive and if admins want to deploy
iptables rules in default-ns, this asymmetric data path makes it
impossible to do so.

This patch makes use of the l3_rcv() (added as part of l3mdev
enhancements) to perform input route lookup on RX packets without
changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN
to change the skb->dev just before handing over skb to L4.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
CC: David Ahern <dsa@cumulusnetworks.com>
---
 Documentation/networking/ipvlan.txt |  7 ++-
 drivers/net/Kconfig                 |  1 +
 drivers/net/ipvlan/ipvlan.h         |  6 +++
 drivers/net/ipvlan/ipvlan_core.c    | 94 +++++++++++++++++++++++++++++++++++++
 drivers/net/ipvlan/ipvlan_main.c    | 87 +++++++++++++++++++++++++++++++---
 include/uapi/linux/if_link.h        |  1 +
 6 files changed, 188 insertions(+), 8 deletions(-)

diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.txt
index 14422f8fcdc4..24196cef7c91 100644
--- a/Documentation/networking/ipvlan.txt
+++ b/Documentation/networking/ipvlan.txt
@@ -22,7 +22,7 @@ The driver can be built into the kernel (CONFIG_IPVLAN=y) or as a module
 	There are no module parameters for this driver and it can be configured
 using IProute2/ip utility.
 
-	ip link add link <master-dev> <slave-dev> type ipvlan mode { l2 | L3 }
+	ip link add link <master-dev> <slave-dev> type ipvlan mode { l2 | l3 | l3s }
 
 	e.g. ip link add link ipvl0 eth0 type ipvlan mode l2
 
@@ -48,6 +48,11 @@ master device for the L2 processing and routing from that instance will be
 used before packets are queued on the outbound device. In this mode the slaves
 will not receive nor can send multicast / broadcast traffic.
 
+4.3 L3S mode:
+	This is very similar to the L3 mode except that iptables (conn-tracking)
+works in this mode and hence it is L3-symmetric (L3s). This will have slightly less
+performance but that shouldn't matter since you are choosing this mode over plain-L3
+mode to make conn-tracking work.
 
 5. What to choose (macvlan vs. ipvlan)?
 	These two devices are very similar in many regards and the specific use
diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
index 0c5415b05ea9..8768a625350d 100644
--- a/drivers/net/Kconfig
+++ b/drivers/net/Kconfig
@@ -149,6 +149,7 @@ config IPVLAN
     tristate "IP-VLAN support"
     depends on INET
     depends on IPV6
+    depends on NET_L3_MASTER_DEV
     ---help---
       This allows one to create virtual devices off of a main interface
       and packets will be delivered based on the dest L3 (IPv6/IPv4 addr)
diff --git a/drivers/net/ipvlan/ipvlan.h b/drivers/net/ipvlan/ipvlan.h
index 695a5dc9ace3..7e0732f5ea07 100644
--- a/drivers/net/ipvlan/ipvlan.h
+++ b/drivers/net/ipvlan/ipvlan.h
@@ -23,11 +23,13 @@
 #include <linux/if_vlan.h>
 #include <linux/ip.h>
 #include <linux/inetdevice.h>
+#include <linux/netfilter.h>
 #include <net/ip.h>
 #include <net/ip6_route.h>
 #include <net/rtnetlink.h>
 #include <net/route.h>
 #include <net/addrconf.h>
+#include <net/l3mdev.h>
 
 #define IPVLAN_DRV	"ipvlan"
 #define IPV_DRV_VER	"0.1"
@@ -124,4 +126,8 @@ struct ipvl_addr *ipvlan_find_addr(const struct ipvl_dev *ipvlan,
 				   const void *iaddr, bool is_v6);
 bool ipvlan_addr_busy(struct ipvl_port *port, void *iaddr, bool is_v6);
 void ipvlan_ht_addr_del(struct ipvl_addr *addr);
+struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb,
+			      u16 proto);
+unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb,
+			     const struct nf_hook_state *state);
 #endif /* __IPVLAN_H */
diff --git a/drivers/net/ipvlan/ipvlan_core.c b/drivers/net/ipvlan/ipvlan_core.c
index b5f9511d819e..b4e990743e1d 100644
--- a/drivers/net/ipvlan/ipvlan_core.c
+++ b/drivers/net/ipvlan/ipvlan_core.c
@@ -560,6 +560,7 @@ int ipvlan_queue_xmit(struct sk_buff *skb, struct net_device *dev)
 	case IPVLAN_MODE_L2:
 		return ipvlan_xmit_mode_l2(skb, dev);
 	case IPVLAN_MODE_L3:
+	case IPVLAN_MODE_L3S:
 		return ipvlan_xmit_mode_l3(skb, dev);
 	}
 
@@ -664,6 +665,8 @@ rx_handler_result_t ipvlan_handle_frame(struct sk_buff **pskb)
 		return ipvlan_handle_mode_l2(pskb, port);
 	case IPVLAN_MODE_L3:
 		return ipvlan_handle_mode_l3(pskb, port);
+	case IPVLAN_MODE_L3S:
+		return RX_HANDLER_PASS;
 	}
 
 	/* Should not reach here */
@@ -672,3 +675,94 @@ rx_handler_result_t ipvlan_handle_frame(struct sk_buff **pskb)
 	kfree_skb(skb);
 	return RX_HANDLER_CONSUMED;
 }
+
+static struct ipvl_addr *ipvlan_skb_to_addr(struct sk_buff *skb,
+					    struct net_device *dev)
+{
+	struct ipvl_addr *addr = NULL;
+	struct ipvl_port *port;
+	void *lyr3h;
+	int addr_type;
+
+	if (!dev || !netif_is_ipvlan_port(dev))
+		goto out;
+
+	port = ipvlan_port_get_rcu(dev);
+	if (!port || port->mode != IPVLAN_MODE_L3S)
+		goto out;
+
+	lyr3h = ipvlan_get_L3_hdr(skb, &addr_type);
+	if (!lyr3h)
+		goto out;
+
+	addr = ipvlan_addr_lookup(port, lyr3h, addr_type, true);
+out:
+	return addr;
+}
+
+struct sk_buff *ipvlan_l3_rcv(struct net_device *dev, struct sk_buff *skb,
+			      u16 proto)
+{
+	struct ipvl_addr *addr;
+	struct net_device *sdev;
+
+	addr = ipvlan_skb_to_addr(skb, dev);
+	if (!addr)
+		goto out;
+
+	sdev = addr->master->dev;
+	switch (proto) {
+	case AF_INET:
+	{
+		int err;
+		struct iphdr *ip4h = ip_hdr(skb);
+
+		err = ip_route_input_noref(skb, ip4h->daddr, ip4h->saddr,
+					   ip4h->tos, sdev);
+		if (unlikely(err))
+			goto out;
+		break;
+	}
+	case AF_INET6:
+	{
+		struct dst_entry *dst;
+		struct ipv6hdr *ip6h = ipv6_hdr(skb);
+		int flags = RT6_LOOKUP_F_HAS_SADDR;
+		struct flowi6 fl6 = {
+			.flowi6_iif   = sdev->ifindex,
+			.daddr        = ip6h->daddr,
+			.saddr        = ip6h->saddr,
+			.flowlabel    = ip6_flowinfo(ip6h),
+			.flowi6_mark  = skb->mark,
+			.flowi6_proto = ip6h->nexthdr,
+		};
+
+		skb_dst_drop(skb);
+		dst = ip6_route_input_lookup(dev_net(sdev), sdev, &fl6, flags);
+		skb_dst_set(skb, dst);
+		break;
+	}
+	default:
+		break;
+	}
+
+out:
+	return skb;
+}
+
+unsigned int ipvlan_nf_input(void *priv, struct sk_buff *skb,
+			     const struct nf_hook_state *state)
+{
+	struct ipvl_addr *addr;
+	unsigned int len;
+
+	addr = ipvlan_skb_to_addr(skb, skb->dev);
+	if (!addr)
+		goto out;
+
+	skb->dev = addr->master->dev;
+	len = skb->len + ETH_HLEN;
+	ipvlan_count_rx(addr->master, len, true, false);
+out:
+	return NF_ACCEPT;
+}
diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
index 18b4e8c7f68a..f442eb366863 100644
--- a/drivers/net/ipvlan/ipvlan_main.c
+++ b/drivers/net/ipvlan/ipvlan_main.c
@@ -9,24 +9,87 @@
 
 #include "ipvlan.h"
 
+static u32 ipvl_nf_hook_refcnt = 0;
+
+static struct nf_hook_ops ipvl_nfops[] __read_mostly = {
+	{
+		.hook     = ipvlan_nf_input,
+		.pf       = NFPROTO_IPV4,
+		.hooknum  = NF_INET_LOCAL_IN,
+		.priority = INT_MAX,
+	},
+	{
+		.hook     = ipvlan_nf_input,
+		.pf       = NFPROTO_IPV6,
+		.hooknum  = NF_INET_LOCAL_IN,
+		.priority = INT_MAX,
+	},
+};
+
+static struct l3mdev_ops ipvl_l3mdev_ops __read_mostly = {
+	.l3mdev_l3_rcv = ipvlan_l3_rcv,
+};
+
 static void ipvlan_adjust_mtu(struct ipvl_dev *ipvlan, struct net_device *dev)
 {
 	ipvlan->dev->mtu = dev->mtu - ipvlan->mtu_adj;
 }
 
-static void ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
+static int ipvlan_register_nf_hook(void)
+{
+	int err = 0;
+
+	if (!ipvl_nf_hook_refcnt) {
+		err = _nf_register_hooks(ipvl_nfops, ARRAY_SIZE(ipvl_nfops));
+		if (!err)
+			ipvl_nf_hook_refcnt = 1;
+	} else {
+		ipvl_nf_hook_refcnt++;
+	}
+
+	return err;
+}
+
+static void ipvlan_unregister_nf_hook(void)
+{
+	WARN_ON(!ipvl_nf_hook_refcnt);
+
+	ipvl_nf_hook_refcnt--;
+	if (!ipvl_nf_hook_refcnt)
+		_nf_unregister_hooks(ipvl_nfops, ARRAY_SIZE(ipvl_nfops));
+}
+
+static int ipvlan_set_port_mode(struct ipvl_port *port, u16 nval)
 {
 	struct ipvl_dev *ipvlan;
+	struct net_device *mdev = port->dev;
+	int err = 0;
 
+	ASSERT_RTNL();
 	if (port->mode != nval) {
+		if (nval == IPVLAN_MODE_L3S) {
+			/* New mode is L3S */
+			err = ipvlan_register_nf_hook();
+			if (!err) {
+				mdev->l3mdev_ops = &ipvl_l3mdev_ops;
+				mdev->priv_flags |= IFF_L3MDEV_MASTER;
+			} else
+				return err;
+		} else if (port->mode == IPVLAN_MODE_L3S) {
+			/* Old mode was L3S */
+			mdev->priv_flags &= ~IFF_L3MDEV_MASTER;
+			ipvlan_unregister_nf_hook();
+			mdev->l3mdev_ops = NULL;
+		}
 		list_for_each_entry(ipvlan, &port->ipvlans, pnode) {
-			if (nval == IPVLAN_MODE_L3)
+			if (nval == IPVLAN_MODE_L3 || nval == IPVLAN_MODE_L3S)
 				ipvlan->dev->flags |= IFF_NOARP;
 			else
 				ipvlan->dev->flags &= ~IFF_NOARP;
 		}
 		port->mode = nval;
 	}
+	return err;
 }
 
 static int ipvlan_port_create(struct net_device *dev)
@@ -74,6 +137,11 @@ static void ipvlan_port_destroy(struct net_device *dev)
 	struct ipvl_port *port = ipvlan_port_get_rtnl(dev);
 
 	dev->priv_flags &= ~IFF_IPVLAN_MASTER;
+	if (port->mode == IPVLAN_MODE_L3S) {
+		dev->priv_flags &= ~IFF_L3MDEV_MASTER;
+		ipvlan_unregister_nf_hook();
+		dev->l3mdev_ops = NULL;
+	}
 	netdev_rx_handler_unregister(dev);
 	cancel_work_sync(&port->wq);
 	__skb_queue_purge(&port->backlog);
@@ -132,7 +200,8 @@ static int ipvlan_open(struct net_device *dev)
 	struct net_device *phy_dev = ipvlan->phy_dev;
 	struct ipvl_addr *addr;
 
-	if (ipvlan->port->mode == IPVLAN_MODE_L3)
+	if (ipvlan->port->mode == IPVLAN_MODE_L3 ||
+	    ipvlan->port->mode == IPVLAN_MODE_L3S)
 		dev->flags |= IFF_NOARP;
 	else
 		dev->flags &= ~IFF_NOARP;
@@ -372,13 +441,14 @@ static int ipvlan_nl_changelink(struct net_device *dev,
 {
 	struct ipvl_dev *ipvlan = netdev_priv(dev);
 	struct ipvl_port *port = ipvlan_port_get_rtnl(ipvlan->phy_dev);
+	int err = 0;
 
 	if (data && data[IFLA_IPVLAN_MODE]) {
 		u16 nmode = nla_get_u16(data[IFLA_IPVLAN_MODE]);
 
-		ipvlan_set_port_mode(port, nmode);
+		err = ipvlan_set_port_mode(port, nmode);
 	}
-	return 0;
+	return err;
 }
 
 static size_t ipvlan_nl_getsize(const struct net_device *dev)
@@ -473,10 +543,13 @@ static int ipvlan_link_new(struct net *src_net, struct net_device *dev,
 		unregister_netdevice(dev);
 		return err;
 	}
+	err = ipvlan_set_port_mode(port, mode);
+	if (err) {
+		unregister_netdevice(dev);
+		return err;
+	}
 
 	list_add_tail_rcu(&ipvlan->pnode, &port->ipvlans);
-	ipvlan_set_port_mode(port, mode);
-
 	netif_stacked_transfer_operstate(phy_dev, dev);
 	return 0;
 }
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 9bf3aecfe05b..a615583bab09 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -464,6 +464,7 @@ enum {
 enum ipvlan_mode {
 	IPVLAN_MODE_L2 = 0,
 	IPVLAN_MODE_L3,
+	IPVLAN_MODE_L3S,
 	IPVLAN_MODE_MAX
 };
 
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* Re: [net PATCH] mlx4: fix XDP_TX is acting like XDP_PASS on TX ring full
From: Eric Dumazet @ 2016-09-16 20:00 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, tariqt, tom, bblanco, rana.shahot, David S. Miller
In-Reply-To: <20160916194645.13201.70408.stgit@firesoul>

On Fri, 2016-09-16 at 21:47 +0200, Jesper Dangaard Brouer wrote:
> The XDP_TX action can fail transmitting the frame in case the TX ring
> is full or port is down.  In case of TX failure it should drop the
> frame, and not as now call 'break' which is the same as XDP_PASS.
> 
> Fixes: 9ecc2d86171a ("net/mlx4_en: add xdp forwarding and data write support")
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> 
> ---
> Note, this fix have nothing to do with the page-refcnt bug I just reported.

Yeah, the e1000 driver proposal patch had the same issue.

> 
>  drivers/net/ethernet/mellanox/mlx4/en_rx.c |    1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 2040dad8611d..d414c67dfd12 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -906,6 +906,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>  							length, tx_index,
>  							&doorbell_pending))
>  					goto consumed;
> +				goto next;
>  				break;

Why keeping this break; then ? ;)

>  			default:
>  				bpf_warn_invalid_xdp_action(act);
> 

^ permalink raw reply

* Re: [PATCH net-next 07/14] tcp: export data delivery rate
From: Eric Dumazet @ 2016-09-16 20:04 UTC (permalink / raw)
  To: kbuild test robot
  Cc: Neal Cardwell, kbuild-all, David Miller, netdev, Yuchung Cheng,
	Van Jacobson, Nandita Dukkipati, Eric Dumazet,
	Soheil Hassas Yeganeh
In-Reply-To: <201609171104.7NaqQo41%fengguang.wu@intel.com>

On Sat, 2016-09-17 at 11:56 +0800, kbuild test robot wrote:
> Hi Yuchung,
> 
> [auto build test ERROR on net-next/master]
> 
> url:    https://github.com/0day-ci/linux/commits/Neal-Cardwell/tcp-BBR-congestion-control-algorithm/20160917-025323
> config: arm-nhk8815_defconfig (attached as .config)
> compiler: arm-linux-gnueabi-gcc (Debian 6.1.1-9) 6.1.1 20160705
> reproduce:
>         wget https://git.kernel.org/cgit/linux/kernel/git/wfg/lkp-tests.git/plain/sbin/make.cross -O ~/bin/make.cross
>         chmod +x ~/bin/make.cross
>         # save the attached .config to linux build tree
>         make.cross ARCH=arm 

Right, we need to include <asm/div64.h>  for some arches.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox