Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Bug in skb_gro_receive - possible bad page state problems?
From: Eric Dumazet @ 2017-05-05  4:14 UTC (permalink / raw)
  To: Anand H. Krishnan; +Cc: netdev
In-Reply-To: <CACeb84vFNfPcC=FJ3HEjraZ02Zi2igbK0Q=M3=pUw7AKG=u3PQ@mail.gmail.com>

On Fri, 2017-05-05 at 08:57 +0530, Anand H. Krishnan wrote:
> Hello,
> 
> Is skb_gro_receive doing the right thing for cloned packets?
> 
> When we are merging fragments, we do not seem to be taking a reference
> to the underlying page. To me, it looks like it should work fine for non-cloned
> packets. However, for cloned packets, when the gro-ed packet is eventually
> freed (because the original skb was not cloned and hence reference was 1),
> the merged skb's frags also get freed (put_page-ed) without taking into account
> the other references that were held for the fragments (dataref).
> 
> We saw crashes because of this behavior. Our setup had a third party kernel
> forwarding module which uses GRO (napi_gro_receive). Doing iperf3 with small
> packets and doing tcpdump on the receiving tap interface results in the problem.
> With DEBUG_VM enabled, put page crashes. Without DEBUG_VM, bad page
> state results.

Yep, GRO must not be used with cloned skb.

This is why gro_cells_receive() has this check :

if (!gcells->cells || skb_cloned(skb) || netif_elide_gro(dev))
        return netif_rx(skb);

(But not the main napi_gro_receive() that is supposed to be used by
driver before any tap)

^ permalink raw reply

* Bug in skb_gro_receive - possible bad page state problems?
From: Anand H. Krishnan @ 2017-05-05  3:27 UTC (permalink / raw)
  To: netdev

Hello,

Is skb_gro_receive doing the right thing for cloned packets?

When we are merging fragments, we do not seem to be taking a reference
to the underlying page. To me, it looks like it should work fine for non-cloned
packets. However, for cloned packets, when the gro-ed packet is eventually
freed (because the original skb was not cloned and hence reference was 1),
the merged skb's frags also get freed (put_page-ed) without taking into account
the other references that were held for the fragments (dataref).

We saw crashes because of this behavior. Our setup had a third party kernel
forwarding module which uses GRO (napi_gro_receive). Doing iperf3 with small
packets and doing tcpdump on the receiving tap interface results in the problem.
With DEBUG_VM enabled, put page crashes. Without DEBUG_VM, bad page
state results.

Your thoughts (please CC me, since I am not part of this list).

Thanks,
Anand

^ permalink raw reply

* Re: FEC on i.MX 7 transmit queue timeout
From: Andy Duan @ 2017-05-05  2:44 UTC (permalink / raw)
  To: Stefan Agner
  Cc: festevam@gmail.com, netdev@vger.kernel.org,
	netdev-owner@vger.kernel.org
In-Reply-To: <110a7a48649cfcbbee46340c230e9008@agner.ch>

On 2017年05月05日 10:09, Stefan Agner wrote:
> On 2017-05-04 19:03, Andy Duan wrote:
>> On 2017年05月05日 05:36, Stefan Agner wrote:
>>> On 2017-05-03 20:08, Andy Duan wrote:
>>>> From: Stefan Agner <stefan@agner.ch> Sent: Thursday, May 04, 2017 9:22 AM
>>>>> To: Andy Duan <fugang.duan@nxp.com>
>>>>> Cc: fugang.duan@freescale.com; festevam@gmail.com;
>>>>> netdev@vger.kernel.org; netdev-owner@vger.kernel.org
>>>>> Subject: Re: FEC on i.MX 7 transmit queue timeout
>>>>>
>>>>> Hi Andy,
>>>>>
>>>>> On 2017-04-20 19:48, Andy Duan wrote:
>>>>>> On 2017年04月20日 07:15, Stefan Agner wrote:
>>>>>>> I tested again with imx6sx-fec compatible string. I could reproduce
>>>>>>> it on a Colibri with i.MX 7Dual. But not always: It really depends
>>>>>>> whether queue 2 is counting up or not. Just after boot, I check
>>>>>>> /proc/interrupts twice, if queue 2 is counting it will happen!
>>>>>>>
>>>>>>> But if only queue 0 is mostly in use, then it seems to work just fine.
>>>>>> If your case is only running best effort like tcp/udp, you can re-set
>>>>>> the "fsl,num-tx-queues" and "fsl,num-rx-queues" to 1 in board dts file.
>>>>>> Other two queues are for AVB audio/video queues, they have high
>>>>>> priority than queue 0. If running iperf tcp test on the three queues,
>>>>>> then the tcp segment may be out-of-order that cause net watchdog
>>>>> timeout.
>>>>>>> I also tried i.MX 7Dual SabreSD here, and the same thing. I had to
>>>>>>> reboot 3 times, then queue 2 was counting:
>>>>>>>     57:          8     GIC-0 150 Level     30be0000.ethernet
>>>>>>>     58:      20137     GIC-0 151 Level     30be0000.ethernet
>>>>>>>     59:       9269     GIC-0 152 Level     30be0000.ethernet
>>>>>>>
>>>>>>> It took me about 40 minutes on Sabre until it happened, and I had to
>>>>>>> force it using iperf, but then I got the ring dumps:
>>>>>> My board had ran more than 47 hours with nfs rootfs in 4.11.0-rc6, but
>>>>>> not running iperf.
>>>>>> I am testing with iperf.
>>>>> Any update on this issue?
>>>>>
>>>>> When using iperf (server) on the board with Linux 4.11 the issue appears
>>>>> within a few iperf iterations on a Sabre (TO 1.2, Board Rev C, if that matters)...
>>>>>
>>>> I don’t know whether you received my last mail. (maybe failed due to I
>>>> received some rejection mails)
>>> I think I did not... The last email I received was Fri, 21 Apr 2017
>>> 02:48:23 UTC.
>>>
>>>
>>>> If your case is only running best effort like tcp/udp, you can re-set
>>>> the "fsl,num-tx-queues" and "fsl,num-rx-queues" to 1 in board dts
>>>> file.
>>> I did test that, and it seems to work fine with those properties set to
>>> 1.
>> So it can fix your problem after long time test?
> Yes, seems to work fine after more than 2 hours.
>
>>>> Other two queues are for AVB audio/video queues, they have high
>>>> priority than queue 0. If running iperf tcp test on the three queues,
>>>> then the tcp segment may be out-of-order that cause net watchdog
>>>> timeout.
>>> Okay. A single event would be understandable, but it seems to enter some
>>> kind of loop after that (continuously printing "fec 30be0000.ethernet
>>> eth0: TX ring dump ...").
>>>
>>> In a quick test I commented out the fec_dump call, with that it seems to
>>> print only once and continues working afterwards (although, speed starts
>>> to decrease, so something is not good at that point).
>> The test base on above change ? One queue still bring watchdog timeout ?
> No, sorry for the confusion: This was without the fix above. So use
> multiple queues, and disable fec_dump... I was just wondering, because
> disabling the multiple queues seems to me somewhat a workaround for
> now... :-)
>
No, it is not workaround. As i said, quque1 and queue2 are for AVB paths 
have higher priority in transmition.
It bring the trouble for your case. I will submit one patch to fix it 
that best effort go queue0, AVB streaming go
quque1 and queue2.

>
>>>> In fsl kernel tree, there have one patch that only select the queue0
>>>> for best effort like tcp/udp. Pls test again in your board, if no
>>>> problem I will upstream the patch.
>>> That sounds like a reasonable fix.
>>>
>>> IP, no matter whether TCP/UDP, is the most common use case, so IMHO this
>>> should "just work" by default.
>>>
>>> --
>>> Stefan

^ permalink raw reply

* RE: [net-next] net: remove duplicate add_device_randomness() call
From: 张胜举 @ 2017-05-05  2:40 UTC (permalink / raw)
  To: 'David Miller'; +Cc: netdev, edumazet
In-Reply-To: <20170504.111302.1520073128399942830.davem@davemloft.net>

> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Thursday, May 04, 2017 11:13 PM
> To: zhangshengju@cmss.chinamobile.com
> Cc: netdev@vger.kernel.org; edumazet@google.com
> Subject: Re: [net-next] net: remove duplicate add_device_randomness() call
> 
> From: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
> Date: Thu,  4 May 2017 11:40:42 +0800
> 
> > Since register_netdevice() already call add_device_randomness() and
> > dev_set_mac_address() will call it after mac address change.
> > It's not necessary to call at device UP.
> >
> > Signed-off-by: Zhang Shengju <zhangshengju@cmss.chinamobile.com>
> 
> The net-next tree is closed, please resubmit this when the net-next tree
> opens back up.

Okay, thanks David.

BRs,
ZSJ

^ permalink raw reply

* [PATCH v2 net] tcp: randomize timestamps on syncookies
From: Eric Dumazet @ 2017-05-05  2:22 UTC (permalink / raw)
  To: Florian Westphal; +Cc: David Miller, netdev, Yuchung Cheng
In-Reply-To: <1493949548.7796.32.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <edumazet@google.com>

Whole point of randomization was to hide server uptime, but an attacker
can simply start a syn flood and TCP generates 'old style' timestamps,
directly revealing server jiffies value.

Also, TSval sent by the server to a particular remote address vary
depending on syncookies being sent or not, potentially triggering PAWS
drops for innocent clients.

Lets implement proper randomization, including for SYNcookies.

Also we do not need to export sysctl_tcp_timestamps, since it is not
used from a module.

In v2, I added Florian feedback and contribution, adding tsoff to
tcp_get_cookie_sock()

Fixes: 95a22caee396c ("tcp: randomize tcp timestamp offsets for each connection")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Yuchung Cheng <ycheng@google.com>
---
 include/net/secure_seq.h |   10 ++++++----
 include/net/tcp.h        |    5 +++--
 net/core/secure_seq.c    |   31 +++++++++++++++++++------------
 net/ipv4/syncookies.c    |   12 ++++++++++--
 net/ipv4/tcp_input.c     |    8 +++-----
 net/ipv4/tcp_ipv4.c      |   31 +++++++++++++++++++------------
 net/ipv6/syncookies.c    |   10 +++++++++-
 net/ipv6/tcp_ipv6.c      |   32 +++++++++++++++++++-------------
 8 files changed, 88 insertions(+), 51 deletions(-)

diff --git a/include/net/secure_seq.h b/include/net/secure_seq.h
index fe236b3429f0d8caeb1adc367b5b4a20591c848b..b94006f6fbdde0d78fe33b9c2d86159e291c30cf 100644
--- a/include/net/secure_seq.h
+++ b/include/net/secure_seq.h
@@ -6,10 +6,12 @@
 u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport);
 u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr,
 			       __be16 dport);
-u32 secure_tcp_seq_and_tsoff(__be32 saddr, __be32 daddr,
-			     __be16 sport, __be16 dport, u32 *tsoff);
-u32 secure_tcpv6_seq_and_tsoff(const __be32 *saddr, const __be32 *daddr,
-			       __be16 sport, __be16 dport, u32 *tsoff);
+u32 secure_tcp_seq(__be32 saddr, __be32 daddr,
+		   __be16 sport, __be16 dport);
+u32 secure_tcp_ts_off(__be32 saddr, __be32 daddr);
+u32 secure_tcpv6_seq(const __be32 *saddr, const __be32 *daddr,
+		     __be16 sport, __be16 dport);
+u32 secure_tcpv6_ts_off(const __be32 *saddr, const __be32 *daddr);
 u64 secure_dccp_sequence_number(__be32 saddr, __be32 daddr,
 				__be16 sport, __be16 dport);
 u64 secure_dccpv6_sequence_number(__be32 *saddr, __be32 *daddr,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 270e5cc43c99e7030e95af218095cf9f283950bc..8c0e5a901d6424fbd01233cd3adfdce52076f7a9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -470,7 +470,7 @@ void inet_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb);
 /* From syncookies.c */
 struct sock *tcp_get_cookie_sock(struct sock *sk, struct sk_buff *skb,
 				 struct request_sock *req,
-				 struct dst_entry *dst);
+				 struct dst_entry *dst, u32 tsoff);
 int __cookie_v4_check(const struct iphdr *iph, const struct tcphdr *th,
 		      u32 cookie);
 struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb);
@@ -1822,7 +1822,8 @@ struct tcp_request_sock_ops {
 #endif
 	struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl,
 				       const struct request_sock *req);
-	__u32 (*init_seq_tsoff)(const struct sk_buff *skb, u32 *tsoff);
+	u32 (*init_seq)(const struct sk_buff *skb);
+	u32 (*init_ts_off)(const struct sk_buff *skb);
 	int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
 			   struct flowi *fl, struct request_sock *req,
 			   struct tcp_fastopen_cookie *foc,
diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
index 6bd2f8fb0476baabf507557fc0d06b6787511c70..ae35cce3a40d70387bee815798933aa43a0e6d84 100644
--- a/net/core/secure_seq.c
+++ b/net/core/secure_seq.c
@@ -24,9 +24,13 @@ static siphash_key_t ts_secret __read_mostly;
 
 static __always_inline void net_secret_init(void)
 {
-	net_get_random_once(&ts_secret, sizeof(ts_secret));
 	net_get_random_once(&net_secret, sizeof(net_secret));
 }
+
+static __always_inline void ts_secret_init(void)
+{
+	net_get_random_once(&ts_secret, sizeof(ts_secret));
+}
 #endif
 
 #ifdef CONFIG_INET
@@ -47,7 +51,7 @@ static u32 seq_scale(u32 seq)
 #endif
 
 #if IS_ENABLED(CONFIG_IPV6)
-static u32 secure_tcpv6_ts_off(const __be32 *saddr, const __be32 *daddr)
+u32 secure_tcpv6_ts_off(const __be32 *saddr, const __be32 *daddr)
 {
 	const struct {
 		struct in6_addr saddr;
@@ -60,12 +64,14 @@ static u32 secure_tcpv6_ts_off(const __be32 *saddr, const __be32 *daddr)
 	if (sysctl_tcp_timestamps != 1)
 		return 0;
 
+	ts_secret_init();
 	return siphash(&combined, offsetofend(typeof(combined), daddr),
 		       &ts_secret);
 }
+EXPORT_SYMBOL(secure_tcpv6_ts_off);
 
-u32 secure_tcpv6_seq_and_tsoff(const __be32 *saddr, const __be32 *daddr,
-			       __be16 sport, __be16 dport, u32 *tsoff)
+u32 secure_tcpv6_seq(const __be32 *saddr, const __be32 *daddr,
+		     __be16 sport, __be16 dport)
 {
 	const struct {
 		struct in6_addr saddr;
@@ -78,14 +84,14 @@ u32 secure_tcpv6_seq_and_tsoff(const __be32 *saddr, const __be32 *daddr,
 		.sport = sport,
 		.dport = dport
 	};
-	u64 hash;
+	u32 hash;
+
 	net_secret_init();
 	hash = siphash(&combined, offsetofend(typeof(combined), dport),
 		       &net_secret);
-	*tsoff = secure_tcpv6_ts_off(saddr, daddr);
 	return seq_scale(hash);
 }
-EXPORT_SYMBOL(secure_tcpv6_seq_and_tsoff);
+EXPORT_SYMBOL(secure_tcpv6_seq);
 
 u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr,
 			       __be16 dport)
@@ -107,11 +113,12 @@ EXPORT_SYMBOL(secure_ipv6_port_ephemeral);
 #endif
 
 #ifdef CONFIG_INET
-static u32 secure_tcp_ts_off(__be32 saddr, __be32 daddr)
+u32 secure_tcp_ts_off(__be32 saddr, __be32 daddr)
 {
 	if (sysctl_tcp_timestamps != 1)
 		return 0;
 
+	ts_secret_init();
 	return siphash_2u32((__force u32)saddr, (__force u32)daddr,
 			    &ts_secret);
 }
@@ -121,15 +128,15 @@ static u32 secure_tcp_ts_off(__be32 saddr, __be32 daddr)
  * it would be easy enough to have the former function use siphash_4u32, passing
  * the arguments as separate u32.
  */
-u32 secure_tcp_seq_and_tsoff(__be32 saddr, __be32 daddr,
-			     __be16 sport, __be16 dport, u32 *tsoff)
+u32 secure_tcp_seq(__be32 saddr, __be32 daddr,
+		   __be16 sport, __be16 dport)
 {
-	u64 hash;
+	u32 hash;
+
 	net_secret_init();
 	hash = siphash_3u32((__force u32)saddr, (__force u32)daddr,
 			    (__force u32)sport << 16 | (__force u32)dport,
 			    &net_secret);
-	*tsoff = secure_tcp_ts_off(saddr, daddr);
 	return seq_scale(hash);
 }
 
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 496b97e17aaf7ed2cf41cef303cb0696927f66ac..0257d965f11119acf8c55888d6e672d171ef5f08 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -16,6 +16,7 @@
 #include <linux/siphash.h>
 #include <linux/kernel.h>
 #include <linux/export.h>
+#include <net/secure_seq.h>
 #include <net/tcp.h>
 #include <net/route.h>
 
@@ -203,7 +204,7 @@ EXPORT_SYMBOL_GPL(__cookie_v4_check);
 
 struct sock *tcp_get_cookie_sock(struct sock *sk, struct sk_buff *skb,
 				 struct request_sock *req,
-				 struct dst_entry *dst)
+				 struct dst_entry *dst, u32 tsoff)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct sock *child;
@@ -213,6 +214,7 @@ struct sock *tcp_get_cookie_sock(struct sock *sk, struct sk_buff *skb,
 						 NULL, &own_req);
 	if (child) {
 		atomic_set(&req->rsk_refcnt, 1);
+		tcp_sk(child)->tsoffset = tsoff;
 		sock_rps_save_rxhash(child, skb);
 		inet_csk_reqsk_queue_add(sk, req, child);
 	} else {
@@ -292,6 +294,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	struct rtable *rt;
 	__u8 rcv_wscale;
 	struct flowi4 fl4;
+	u32 tsoff = 0;
 
 	if (!sock_net(sk)->ipv4.sysctl_tcp_syncookies || !th->ack || th->rst)
 		goto out;
@@ -311,6 +314,11 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
 	tcp_parse_options(skb, &tcp_opt, 0, NULL);
 
+	if (tcp_opt.saw_tstamp && tcp_opt.rcv_tsecr) {
+		tsoff = secure_tcp_ts_off(ip_hdr(skb)->daddr, ip_hdr(skb)->saddr);
+		tcp_opt.rcv_tsecr -= tsoff;
+	}
+
 	if (!cookie_timestamp_decode(&tcp_opt))
 		goto out;
 
@@ -381,7 +389,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	ireq->rcv_wscale  = rcv_wscale;
 	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), &rt->dst);
 
-	ret = tcp_get_cookie_sock(sk, skb, req, &rt->dst);
+	ret = tcp_get_cookie_sock(sk, skb, req, &rt->dst, tsoff);
 	/* ip_queue_xmit() depends on our flow being setup
 	 * Normal sockets get it right from inet_csk_route_child_sock()
 	 */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9739962bfb3fd2d39cb13f643def223f4f17fcb6..5a3ad09e2786fb41ad12681d09938c645b69866d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -85,7 +85,6 @@ int sysctl_tcp_dsack __read_mostly = 1;
 int sysctl_tcp_app_win __read_mostly = 31;
 int sysctl_tcp_adv_win_scale __read_mostly = 1;
 EXPORT_SYMBOL(sysctl_tcp_adv_win_scale);
-EXPORT_SYMBOL(sysctl_tcp_timestamps);
 
 /* rfc5961 challenge ack rate limiting */
 int sysctl_tcp_challenge_ack_limit = 1000;
@@ -6347,8 +6346,8 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	if (security_inet_conn_request(sk, skb, req))
 		goto drop_and_free;
 
-	if (isn && tmp_opt.tstamp_ok)
-		af_ops->init_seq_tsoff(skb, &tcp_rsk(req)->ts_off);
+	if (tmp_opt.tstamp_ok)
+		tcp_rsk(req)->ts_off = af_ops->init_ts_off(skb);
 
 	if (!want_cookie && !isn) {
 		/* Kill the following clause, if you dislike this way. */
@@ -6368,7 +6367,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 			goto drop_and_release;
 		}
 
-		isn = af_ops->init_seq_tsoff(skb, &tcp_rsk(req)->ts_off);
+		isn = af_ops->init_seq(skb);
 	}
 	if (!dst) {
 		dst = af_ops->route_req(sk, &fl, req);
@@ -6380,7 +6379,6 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 
 	if (want_cookie) {
 		isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
-		tcp_rsk(req)->ts_off = 0;
 		req->cookie_ts = tmp_opt.tstamp_ok;
 		if (!tmp_opt.tstamp_ok)
 			inet_rsk(req)->ecn_ok = 0;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index cbbafe546c0f5c5f43531eaf24f5b460264785c6..e7d4e47f83ae2074b6ca602bc50a39d5defb85c5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -94,12 +94,18 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
 struct inet_hashinfo tcp_hashinfo;
 EXPORT_SYMBOL(tcp_hashinfo);
 
-static u32 tcp_v4_init_seq_and_tsoff(const struct sk_buff *skb, u32 *tsoff)
+static u32 tcp_v4_init_seq(const struct sk_buff *skb)
 {
-	return secure_tcp_seq_and_tsoff(ip_hdr(skb)->daddr,
-					ip_hdr(skb)->saddr,
-					tcp_hdr(skb)->dest,
-					tcp_hdr(skb)->source, tsoff);
+	return secure_tcp_seq(ip_hdr(skb)->daddr,
+			      ip_hdr(skb)->saddr,
+			      tcp_hdr(skb)->dest,
+			      tcp_hdr(skb)->source);
+}
+
+static u32 tcp_v4_init_ts_off(const struct sk_buff *skb)
+{
+	return secure_tcp_ts_off(ip_hdr(skb)->daddr,
+				 ip_hdr(skb)->saddr);
 }
 
 int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
@@ -232,13 +238,13 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	rt = NULL;
 
 	if (likely(!tp->repair)) {
-		seq = secure_tcp_seq_and_tsoff(inet->inet_saddr,
-					       inet->inet_daddr,
-					       inet->inet_sport,
-					       usin->sin_port,
-					       &tp->tsoffset);
 		if (!tp->write_seq)
-			tp->write_seq = seq;
+			tp->write_seq = secure_tcp_seq(inet->inet_saddr,
+						       inet->inet_daddr,
+						       inet->inet_sport,
+						       usin->sin_port);
+		tp->tsoffset = secure_tcp_ts_off(inet->inet_saddr,
+						 inet->inet_daddr);
 	}
 
 	inet->inet_id = tp->write_seq ^ jiffies;
@@ -1239,7 +1245,8 @@ static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
 	.cookie_init_seq =	cookie_v4_init_sequence,
 #endif
 	.route_req	=	tcp_v4_route_req,
-	.init_seq_tsoff	=	tcp_v4_init_seq_and_tsoff,
+	.init_seq	=	tcp_v4_init_seq,
+	.init_ts_off	=	tcp_v4_init_ts_off,
 	.send_synack	=	tcp_v4_send_synack,
 };
 
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 895ff650db43017ef39344679771d94ad6eaaf00..5abc3692b9011b140816dc4ce6223e79e5defddb 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -18,6 +18,7 @@
 #include <linux/random.h>
 #include <linux/siphash.h>
 #include <linux/kernel.h>
+#include <net/secure_seq.h>
 #include <net/ipv6.h>
 #include <net/tcp.h>
 
@@ -143,6 +144,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	int mss;
 	struct dst_entry *dst;
 	__u8 rcv_wscale;
+	u32 tsoff = 0;
 
 	if (!sock_net(sk)->ipv4.sysctl_tcp_syncookies || !th->ack || th->rst)
 		goto out;
@@ -162,6 +164,12 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
 	tcp_parse_options(skb, &tcp_opt, 0, NULL);
 
+	if (tcp_opt.saw_tstamp && tcp_opt.rcv_tsecr) {
+		tsoff = secure_tcpv6_ts_off(ipv6_hdr(skb)->daddr.s6_addr32,
+					    ipv6_hdr(skb)->saddr.s6_addr32);
+		tcp_opt.rcv_tsecr -= tsoff;
+	}
+
 	if (!cookie_timestamp_decode(&tcp_opt))
 		goto out;
 
@@ -242,7 +250,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	ireq->rcv_wscale = rcv_wscale;
 	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), dst);
 
-	ret = tcp_get_cookie_sock(sk, skb, req, dst);
+	ret = tcp_get_cookie_sock(sk, skb, req, dst, tsoff);
 out:
 	return ret;
 out_free:
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 8e42e8f54b705ed8780890c7434feeff1055599a..aeb9497b5bb754f2277dec4a4dec02bf25bdbbe5 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -101,12 +101,18 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
 	}
 }
 
-static u32 tcp_v6_init_seq_and_tsoff(const struct sk_buff *skb, u32 *tsoff)
+static u32 tcp_v6_init_seq(const struct sk_buff *skb)
 {
-	return secure_tcpv6_seq_and_tsoff(ipv6_hdr(skb)->daddr.s6_addr32,
-					  ipv6_hdr(skb)->saddr.s6_addr32,
-					  tcp_hdr(skb)->dest,
-					  tcp_hdr(skb)->source, tsoff);
+	return secure_tcpv6_seq(ipv6_hdr(skb)->daddr.s6_addr32,
+				ipv6_hdr(skb)->saddr.s6_addr32,
+				tcp_hdr(skb)->dest,
+				tcp_hdr(skb)->source);
+}
+
+static u32 tcp_v6_init_ts_off(const struct sk_buff *skb)
+{
+	return secure_tcpv6_ts_off(ipv6_hdr(skb)->daddr.s6_addr32,
+				   ipv6_hdr(skb)->saddr.s6_addr32);
 }
 
 static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
@@ -122,7 +128,6 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	struct flowi6 fl6;
 	struct dst_entry *dst;
 	int addr_type;
-	u32 seq;
 	int err;
 	struct inet_timewait_death_row *tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
 
@@ -282,13 +287,13 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	sk_set_txhash(sk);
 
 	if (likely(!tp->repair)) {
-		seq = secure_tcpv6_seq_and_tsoff(np->saddr.s6_addr32,
-						 sk->sk_v6_daddr.s6_addr32,
-						 inet->inet_sport,
-						 inet->inet_dport,
-						 &tp->tsoffset);
 		if (!tp->write_seq)
-			tp->write_seq = seq;
+			tp->write_seq = secure_tcpv6_seq(np->saddr.s6_addr32,
+							 sk->sk_v6_daddr.s6_addr32,
+							 inet->inet_sport,
+							 inet->inet_dport);
+		tp->tsoffset = secure_tcpv6_ts_off(np->saddr.s6_addr32,
+						   sk->sk_v6_daddr.s6_addr32);
 	}
 
 	if (tcp_fastopen_defer_connect(sk, &err))
@@ -749,7 +754,8 @@ static const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
 	.cookie_init_seq =	cookie_v6_init_sequence,
 #endif
 	.route_req	=	tcp_v6_route_req,
-	.init_seq_tsoff	=	tcp_v6_init_seq_and_tsoff,
+	.init_seq	=	tcp_v6_init_seq,
+	.init_ts_off	=	tcp_v6_init_ts_off,
 	.send_synack	=	tcp_v6_send_synack,
 };
 

^ permalink raw reply related

* Re: FEC on i.MX 7 transmit queue timeout
From: Stefan Agner @ 2017-05-05  2:09 UTC (permalink / raw)
  To: Andy Duan; +Cc: festevam, netdev, netdev-owner
In-Reply-To: <e739d361-c20f-0de1-cdd1-c9aa13a0d11a@nxp.com>

On 2017-05-04 19:03, Andy Duan wrote:
> On 2017年05月05日 05:36, Stefan Agner wrote:
>> On 2017-05-03 20:08, Andy Duan wrote:
>>> From: Stefan Agner <stefan@agner.ch> Sent: Thursday, May 04, 2017 9:22 AM
>>>> To: Andy Duan <fugang.duan@nxp.com>
>>>> Cc: fugang.duan@freescale.com; festevam@gmail.com;
>>>> netdev@vger.kernel.org; netdev-owner@vger.kernel.org
>>>> Subject: Re: FEC on i.MX 7 transmit queue timeout
>>>>
>>>> Hi Andy,
>>>>
>>>> On 2017-04-20 19:48, Andy Duan wrote:
>>>>> On 2017年04月20日 07:15, Stefan Agner wrote:
>>>>>> I tested again with imx6sx-fec compatible string. I could reproduce
>>>>>> it on a Colibri with i.MX 7Dual. But not always: It really depends
>>>>>> whether queue 2 is counting up or not. Just after boot, I check
>>>>>> /proc/interrupts twice, if queue 2 is counting it will happen!
>>>>>>
>>>>>> But if only queue 0 is mostly in use, then it seems to work just fine.
>>>>> If your case is only running best effort like tcp/udp, you can re-set
>>>>> the "fsl,num-tx-queues" and "fsl,num-rx-queues" to 1 in board dts file.
>>>>> Other two queues are for AVB audio/video queues, they have high
>>>>> priority than queue 0. If running iperf tcp test on the three queues,
>>>>> then the tcp segment may be out-of-order that cause net watchdog
>>>> timeout.
>>>>>> I also tried i.MX 7Dual SabreSD here, and the same thing. I had to
>>>>>> reboot 3 times, then queue 2 was counting:
>>>>>>    57:          8     GIC-0 150 Level     30be0000.ethernet
>>>>>>    58:      20137     GIC-0 151 Level     30be0000.ethernet
>>>>>>    59:       9269     GIC-0 152 Level     30be0000.ethernet
>>>>>>
>>>>>> It took me about 40 minutes on Sabre until it happened, and I had to
>>>>>> force it using iperf, but then I got the ring dumps:
>>>>> My board had ran more than 47 hours with nfs rootfs in 4.11.0-rc6, but
>>>>> not running iperf.
>>>>> I am testing with iperf.
>>>> Any update on this issue?
>>>>
>>>> When using iperf (server) on the board with Linux 4.11 the issue appears
>>>> within a few iperf iterations on a Sabre (TO 1.2, Board Rev C, if that matters)...
>>>>
>>> I don’t know whether you received my last mail. (maybe failed due to I
>>> received some rejection mails)
>> I think I did not... The last email I received was Fri, 21 Apr 2017
>> 02:48:23 UTC.
>>
>>
>>> If your case is only running best effort like tcp/udp, you can re-set
>>> the "fsl,num-tx-queues" and "fsl,num-rx-queues" to 1 in board dts
>>> file.
>> I did test that, and it seems to work fine with those properties set to
>> 1.
> So it can fix your problem after long time test?

Yes, seems to work fine after more than 2 hours.

>>> Other two queues are for AVB audio/video queues, they have high
>>> priority than queue 0. If running iperf tcp test on the three queues,
>>> then the tcp segment may be out-of-order that cause net watchdog
>>> timeout.
>> Okay. A single event would be understandable, but it seems to enter some
>> kind of loop after that (continuously printing "fec 30be0000.ethernet
>> eth0: TX ring dump ...").
>>
>> In a quick test I commented out the fec_dump call, with that it seems to
>> print only once and continues working afterwards (although, speed starts
>> to decrease, so something is not good at that point).
> The test base on above change ? One queue still bring watchdog timeout ?

No, sorry for the confusion: This was without the fix above. So use
multiple queues, and disable fec_dump... I was just wondering, because
disabling the multiple queues seems to me somewhat a workaround for
now... :-)

^ permalink raw reply

* Re: FEC on i.MX 7 transmit queue timeout
From: Andy Duan @ 2017-05-05  2:03 UTC (permalink / raw)
  To: Stefan Agner
  Cc: festevam@gmail.com, netdev@vger.kernel.org,
	netdev-owner@vger.kernel.org
In-Reply-To: <e044b850a90e7112028cc12977d15cf2@agner.ch>

On 2017年05月05日 05:36, Stefan Agner wrote:
> On 2017-05-03 20:08, Andy Duan wrote:
>> From: Stefan Agner <stefan@agner.ch> Sent: Thursday, May 04, 2017 9:22 AM
>>> To: Andy Duan <fugang.duan@nxp.com>
>>> Cc: fugang.duan@freescale.com; festevam@gmail.com;
>>> netdev@vger.kernel.org; netdev-owner@vger.kernel.org
>>> Subject: Re: FEC on i.MX 7 transmit queue timeout
>>>
>>> Hi Andy,
>>>
>>> On 2017-04-20 19:48, Andy Duan wrote:
>>>> On 2017年04月20日 07:15, Stefan Agner wrote:
>>>>> I tested again with imx6sx-fec compatible string. I could reproduce
>>>>> it on a Colibri with i.MX 7Dual. But not always: It really depends
>>>>> whether queue 2 is counting up or not. Just after boot, I check
>>>>> /proc/interrupts twice, if queue 2 is counting it will happen!
>>>>>
>>>>> But if only queue 0 is mostly in use, then it seems to work just fine.
>>>> If your case is only running best effort like tcp/udp, you can re-set
>>>> the "fsl,num-tx-queues" and "fsl,num-rx-queues" to 1 in board dts file.
>>>> Other two queues are for AVB audio/video queues, they have high
>>>> priority than queue 0. If running iperf tcp test on the three queues,
>>>> then the tcp segment may be out-of-order that cause net watchdog
>>> timeout.
>>>>> I also tried i.MX 7Dual SabreSD here, and the same thing. I had to
>>>>> reboot 3 times, then queue 2 was counting:
>>>>>    57:          8     GIC-0 150 Level     30be0000.ethernet
>>>>>    58:      20137     GIC-0 151 Level     30be0000.ethernet
>>>>>    59:       9269     GIC-0 152 Level     30be0000.ethernet
>>>>>
>>>>> It took me about 40 minutes on Sabre until it happened, and I had to
>>>>> force it using iperf, but then I got the ring dumps:
>>>> My board had ran more than 47 hours with nfs rootfs in 4.11.0-rc6, but
>>>> not running iperf.
>>>> I am testing with iperf.
>>> Any update on this issue?
>>>
>>> When using iperf (server) on the board with Linux 4.11 the issue appears
>>> within a few iperf iterations on a Sabre (TO 1.2, Board Rev C, if that matters)...
>>>
>> I don’t know whether you received my last mail. (maybe failed due to I
>> received some rejection mails)
> I think I did not... The last email I received was Fri, 21 Apr 2017
> 02:48:23 UTC.
>
>   
>> If your case is only running best effort like tcp/udp, you can re-set
>> the "fsl,num-tx-queues" and "fsl,num-rx-queues" to 1 in board dts
>> file.
> I did test that, and it seems to work fine with those properties set to
> 1.
So it can fix your problem after long time test?
>> Other two queues are for AVB audio/video queues, they have high
>> priority than queue 0. If running iperf tcp test on the three queues,
>> then the tcp segment may be out-of-order that cause net watchdog
>> timeout.
> Okay. A single event would be understandable, but it seems to enter some
> kind of loop after that (continuously printing "fec 30be0000.ethernet
> eth0: TX ring dump ...").
>
> In a quick test I commented out the fec_dump call, with that it seems to
> print only once and continues working afterwards (although, speed starts
> to decrease, so something is not good at that point).
The test base on above change ? One queue still bring watchdog timeout ?
>> In fsl kernel tree, there have one patch that only select the queue0
>> for best effort like tcp/udp. Pls test again in your board, if no
>> problem I will upstream the patch.
> That sounds like a reasonable fix.
>
> IP, no matter whether TCP/UDP, is the most common use case, so IMHO this
> should "just work" by default.
>
> --
> Stefan

^ permalink raw reply

* Re: [PATCH net] tcp: randomize timestamps on syncookies
From: Eric Dumazet @ 2017-05-05  1:59 UTC (permalink / raw)
  To: Florian Westphal; +Cc: David Miller, netdev, Yuchung Cheng
In-Reply-To: <20170505003208.GI13320@breakpoint.cc>

On Fri, 2017-05-05 at 02:32 +0200, Florian Westphal wrote:
> Florian Westphal <fw@strlen.de> wrote:
> [..]
> > This breaks syncookies w. timestamps; cookie_timestamp_decode() lacks a tsoff
> > for readjustment.
> > 
> > We also need to pass the (recomputed) tsoff to tcp_get_cookie_sock().
> 
> This small delta makes things work for me:
> 

Hi Florian, thanks for looking at this.

One comment :

> diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
> index 895ff650db43..eb96825d6340 100644
> --- a/net/ipv6/syncookies.c
> +++ b/net/ipv6/syncookies.c
> @@ -18,6 +18,7 @@
>  #include <linux/random.h>
>  #include <linux/siphash.h>
>  #include <linux/kernel.h>
> +#include <net/secure_seq.h>
>  #include <net/ipv6.h>
>  #include <net/tcp.h>
>  
> @@ -143,6 +144,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
>  	int mss;
>  	struct dst_entry *dst;
>  	__u8 rcv_wscale;
> +	u32 tsoff;
>  
>  	if (!sock_net(sk)->ipv4.sysctl_tcp_syncookies || !th->ack || th->rst)
>  		goto out;
> @@ -162,6 +164,12 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
>  	memset(&tcp_opt, 0, sizeof(tcp_opt));
>  	tcp_parse_options(skb, &tcp_opt, 0, NULL);
>  
> +	tsoff = 0;
> +	if (tcp_opt.saw_tstamp && tcp_opt.rcv_tsecr) {
> +		tsoff = secure_tcpv6_ts_off(&ip_hdr(skb)->daddr, &ip_hdr(skb)->saddr);

I will use the ipv6_hdr(skb)->daddr.s6_addr32 and
ipv6_hdr(skb)->saddr.s6_addr32 if you agree ;)

> +		tcp_opt.rcv_tsecr -= tsoff;
> +	}
> +

Thanks !

^ permalink raw reply

* Re: [PATCH net] tcp: randomize timestamps on syncookies
From: Florian Westphal @ 2017-05-05  0:32 UTC (permalink / raw)
  To: Florian Westphal; +Cc: Eric Dumazet, David Miller, netdev, Yuchung Cheng
In-Reply-To: <20170505002456.GH13320@breakpoint.cc>

Florian Westphal <fw@strlen.de> wrote:
[..]
> This breaks syncookies w. timestamps; cookie_timestamp_decode() lacks a tsoff
> for readjustment.
> 
> We also need to pass the (recomputed) tsoff to tcp_get_cookie_sock().

This small delta makes things work for me:

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 1be353fc5cb1..8c0e5a901d64 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -470,7 +470,7 @@ void inet_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb);
 /* From syncookies.c */
 struct sock *tcp_get_cookie_sock(struct sock *sk, struct sk_buff *skb,
 				 struct request_sock *req,
-				 struct dst_entry *dst);
+				 struct dst_entry *dst, u32 tsoff);
 int __cookie_v4_check(const struct iphdr *iph, const struct tcphdr *th,
 		      u32 cookie);
 struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb);
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 496b97e17aaf..39bfdc94bf44 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -16,6 +16,7 @@
 #include <linux/siphash.h>
 #include <linux/kernel.h>
 #include <linux/export.h>
+#include <net/secure_seq.h>
 #include <net/tcp.h>
 #include <net/route.h>
 
@@ -203,7 +204,7 @@ EXPORT_SYMBOL_GPL(__cookie_v4_check);
 
 struct sock *tcp_get_cookie_sock(struct sock *sk, struct sk_buff *skb,
 				 struct request_sock *req,
-				 struct dst_entry *dst)
+				 struct dst_entry *dst, u32 tsoff)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct sock *child;
@@ -213,6 +214,7 @@ struct sock *tcp_get_cookie_sock(struct sock *sk, struct sk_buff *skb,
 						 NULL, &own_req);
 	if (child) {
 		atomic_set(&req->rsk_refcnt, 1);
+		tcp_sk(child)->tsoffset = tsoff;
 		sock_rps_save_rxhash(child, skb);
 		inet_csk_reqsk_queue_add(sk, req, child);
 	} else {
@@ -292,6 +294,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	struct rtable *rt;
 	__u8 rcv_wscale;
 	struct flowi4 fl4;
+	u32 tsoff;
 
 	if (!sock_net(sk)->ipv4.sysctl_tcp_syncookies || !th->ack || th->rst)
 		goto out;
@@ -311,6 +314,12 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
 	tcp_parse_options(skb, &tcp_opt, 0, NULL);
 
+	tsoff = 0;
+	if (tcp_opt.saw_tstamp && tcp_opt.rcv_tsecr) {
+		tsoff = secure_tcp_ts_off(ip_hdr(skb)->daddr, ip_hdr(skb)->saddr);
+		tcp_opt.rcv_tsecr -= tsoff;
+	}
+
 	if (!cookie_timestamp_decode(&tcp_opt))
 		goto out;
 
@@ -381,7 +390,7 @@ struct sock *cookie_v4_check(struct sock *sk, struct sk_buff *skb)
 	ireq->rcv_wscale  = rcv_wscale;
 	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), &rt->dst);
 
-	ret = tcp_get_cookie_sock(sk, skb, req, &rt->dst);
+	ret = tcp_get_cookie_sock(sk, skb, req, &rt->dst, tsoff);
 	/* ip_queue_xmit() depends on our flow being setup
 	 * Normal sockets get it right from inet_csk_route_child_sock()
 	 */
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 895ff650db43..eb96825d6340 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -18,6 +18,7 @@
 #include <linux/random.h>
 #include <linux/siphash.h>
 #include <linux/kernel.h>
+#include <net/secure_seq.h>
 #include <net/ipv6.h>
 #include <net/tcp.h>
 
@@ -143,6 +144,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	int mss;
 	struct dst_entry *dst;
 	__u8 rcv_wscale;
+	u32 tsoff;
 
 	if (!sock_net(sk)->ipv4.sysctl_tcp_syncookies || !th->ack || th->rst)
 		goto out;
@@ -162,6 +164,12 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	memset(&tcp_opt, 0, sizeof(tcp_opt));
 	tcp_parse_options(skb, &tcp_opt, 0, NULL);
 
+	tsoff = 0;
+	if (tcp_opt.saw_tstamp && tcp_opt.rcv_tsecr) {
+		tsoff = secure_tcpv6_ts_off(&ip_hdr(skb)->daddr, &ip_hdr(skb)->saddr);
+		tcp_opt.rcv_tsecr -= tsoff;
+	}
+
 	if (!cookie_timestamp_decode(&tcp_opt))
 		goto out;
 
@@ -242,7 +250,7 @@ struct sock *cookie_v6_check(struct sock *sk, struct sk_buff *skb)
 	ireq->rcv_wscale = rcv_wscale;
 	ireq->ecn_ok = cookie_ecn_ok(&tcp_opt, sock_net(sk), dst);
 
-	ret = tcp_get_cookie_sock(sk, skb, req, dst);
+	ret = tcp_get_cookie_sock(sk, skb, req, dst, tsoff);
 out:
 	return ret;
 out_free:

^ permalink raw reply related

* Re: [PATCH iproute2] vxlan: Add support for modifying vxlan device attributes
From: Girish Moodalbail @ 2017-05-05  0:26 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20170504170743.3b411f21@xeon-e3>

On 5/4/17 5:07 PM, Stephen Hemminger wrote:
> On Thu,  4 May 2017 14:46:34 -0700
> Girish Moodalbail <girish.moodalbail@oracle.com> wrote:
>
>> Ability to change vxlan device attributes was added to kernel through
>> commit 8bcdc4f3a20b ("vxlan: add changelink support"), however one
>> cannot do the same through ip(8) command.  Changing the allowed vxlan
>> device attributes using 'ip link set dev <vxlan_name> type vxlan
>> <allowed_attributes>' currently fails with 'operation not supported'
>> error.  This failure is due to the incorrect rtnetlink message
>> construction for the 'ip link set' operation.
>>
>> The vxlan_parse_opt() callback function is called for parsing options
>> for both 'ip link add' and 'ip link set'. For the 'add' case, we pass
>> down default values for those attributes that were not provided as CLI
>> options. However, for the 'set' case we should be only passing down the
>> explicitly provided attributes and not any other (default) attributes.
>>
>> Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
>> ---
>
> All these foo_set variables are ugly. This looks almost like machine
> generated code. It doesn't read well.

I thought about it, however I wasn't sure if refactoring that whole routine will 
be well received so I decided to follow the current model that already existed 
in iplink_vxlan.c. I will re-submit a patch cleaning up that whole routine.

thanks,
~Girish

^ permalink raw reply

* Re: [PATCH net] tcp: randomize timestamps on syncookies
From: Florian Westphal @ 2017-05-05  0:24 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Florian Westphal, Yuchung Cheng
In-Reply-To: <1493935361.7796.29.camel@edumazet-glaptop3.roam.corp.google.com>

Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
> 
> Whole point of randomization was to hide server uptime, but an attacker
> can simply start a syn flood and TCP generates 'old style' timestamps,
> directly revealing server jiffies value.
> 
> Also, TSval sent by the server to a particular remote address vary depending
> on syncookies being sent or not, potentially triggering PAWS drops for
> innocent clients.
> 
> Lets implement proper randomization, including for SYNcookies.
> 
> Also we do not need to export sysctl_tcp_timestamps, it is not used from
> a module.

I like the direction, but this is incomplete.

>  	if (want_cookie) {
>  		isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
> -		tcp_rsk(req)->ts_off = 0;

This breaks syncookies w. timestamps; cookie_timestamp_decode() lacks a tsoff
for readjustment.

We also need to pass the (recomputed) tsoff to tcp_get_cookie_sock().

Other than this, this patch looks good to me, thanks!

^ permalink raw reply

* Re: [PATCH iproute2] vxlan: Add support for modifying vxlan device attributes
From: Stephen Hemminger @ 2017-05-05  0:07 UTC (permalink / raw)
  To: Girish Moodalbail; +Cc: netdev
In-Reply-To: <1493934394-26317-1-git-send-email-girish.moodalbail@oracle.com>

On Thu,  4 May 2017 14:46:34 -0700
Girish Moodalbail <girish.moodalbail@oracle.com> wrote:

> Ability to change vxlan device attributes was added to kernel through
> commit 8bcdc4f3a20b ("vxlan: add changelink support"), however one
> cannot do the same through ip(8) command.  Changing the allowed vxlan
> device attributes using 'ip link set dev <vxlan_name> type vxlan
> <allowed_attributes>' currently fails with 'operation not supported'
> error.  This failure is due to the incorrect rtnetlink message
> construction for the 'ip link set' operation.
> 
> The vxlan_parse_opt() callback function is called for parsing options
> for both 'ip link add' and 'ip link set'. For the 'add' case, we pass
> down default values for those attributes that were not provided as CLI
> options. However, for the 'set' case we should be only passing down the
> explicitly provided attributes and not any other (default) attributes.
> 
> Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
> ---

All these foo_set variables are ugly. This looks almost like machine
generated code. It doesn't read well.

^ permalink raw reply

* Re: [RFC iproute2 0/8] RDMA tool
From: Stephen Hemminger @ 2017-05-05  0:05 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Dennis Dalessandro, Leon Romanovsky, Bart Van Assche,
	jiri-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ram.amrani-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org,
	sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	hch-jcswGhMUV9g@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org,
	ariela-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org
In-Reply-To: <1493930758.3041.231.camel-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>

On Thu, 04 May 2017 16:45:58 -0400
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:

> On Thu, 2017-05-04 at 15:26 -0400, Dennis Dalessandro wrote:
> > On 05/04/2017 02:45 PM, Leon Romanovsky wrote:  
> > > 
> > > On Thu, May 04, 2017 at 06:30:27PM +0000, Bart Van Assche wrote:  
> > > > 
> > > > On Thu, 2017-05-04 at 21:25 +0300, Leon Romanovsky wrote:  
> > > > > 
> > > > > On Thu, May 04, 2017 at 06:10:54PM +0000, Bart Van Assche
> > > > > wrote:  
> > > > > > 
> > > > > > On Thu, 2017-05-04 at 21:02 +0300, Leon Romanovsky wrote:  
> > > > > > > 
> > > > > > > Following our discussion both in mailing list [1] and at
> > > > > > > the LPC 2016 [2],
> > > > > > > we would like to propose this RDMA tool to be part of
> > > > > > > iproute2 package
> > > > > > > and finally improve this situation.  
> > > > > > 
> > > > > > Hello Leon,
> > > > > > 
> > > > > > Although I really appreciate your work: can you clarify why
> > > > > > you would like to
> > > > > > add *RDMA* functionality to an *IP routing* tool? I haven't
> > > > > > found any motivation
> > > > > > for adding RDMA functionality to iproute2 in [1].  
> > > > > 
> > > > > We are planning to reuse the same infrastructure provided by
> > > > > iproute2,
> > > > > like netlink parsing, access to distributions, same CLI and
> > > > > same standards.
> > > > > 
> > > > > Right now, RDMA is already tightened to netdev: iWARP, RoCE,
> > > > > IPoIB, HFI-VNIC.
> > > > > Many drivers (mlx, qed, i40, cxgb) are sharing code between net
> > > > > and
> > > > > RDMA.
> > > > > 
> > > > > I do expect that iproute2 will be installed on every machine
> > > > > with any
> > > > > type of connection, including IB and OPA.
> > > > > 
> > > > > So I think that it is enough to be part of that suite and don't
> > > > > invent
> > > > > our own for one specific tool.  
> > > > 
> > > > Hello Leon,
> > > > 
> > > > Sorry but to me that sounds like a weak argument for including
> > > > RDMA functionality
> > > > in iproute2. There is already a library for communication over
> > > > netlink sockets,
> > > > namely libnl. Is there functionality that is in iproute2 but not
> > > > in libnl and
> > > > that is needed for the new tool? If so, have you considered to
> > > > create a new
> > > > library for that functionality?  
> > > 
> > > It is not hard to create new tool, the hardest part is to ensure
> > > that it is
> > > part of the distributions. Did you count how many months we are
> > > trying to
> > > add rdma-core to debian?  
> > 
> > I do agree that it is a strange pairing and am not really a fan.
> > However 
> > at the end of the day it's just a name for a repo/package. If the 
> > iproute folks are fine to include rdma in their repo/package, great
> > we 
> > can leverage their code for CLI and other common stuff.  
> 
> If you look into the iproute2 package, it becomes clear that the name
> iproute2 is historical and not really accurate any more.  It contains
> things like the bridge control software, tc for controlling send
> queues, and many things network related but not routing related.  The
> rdma tool is a perfectly fine fit in the sense that it is an additional
> network management tool IMO.
> 
> For reference, here's the list of stuff already in iproute on my Fedora
> 24 box:
> 
> /usr/sbin/arpd
> /usr/sbin/bridge
> /usr/sbin/cbq
> /usr/sbin/ctstat
> /usr/sbin/genl
> /usr/sbin/ifcfg
> /usr/sbin/ifstat
> /usr/sbin/ip
> /usr/sbin/lnstat
> /usr/sbin/nstat
> /usr/sbin/routef
> /usr/sbin/routel
> /usr/sbin/rtacct
> /usr/sbin/rtmon
> /usr/sbin/rtpr
> /usr/sbin/rtstat
> /usr/sbin/ss
> /usr/sbin/tc
> /usr/sbin/tipc
> 
> And in fact, if you check, tipc is almost similar to RDMA ;-)  So, I
> suggest people not get hung up on the name iproute2, the fit is fine
> when you look deeper into the nature of the package.
> 

Iproute2 is a collection like busybox. It has bridging, devlink and tipc already.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next] selftests/bpf: get rid of -D__x86_64__
From: Alexei Starovoitov @ 2017-05-04 23:34 UTC (permalink / raw)
  To: David Miller; +Cc: daniel, netdev
In-Reply-To: <20170504.093723.1592226593887244452.davem@davemloft.net>

On 5/4/17 6:37 AM, David Miller wrote:
> From: Alexei Starovoitov <ast@fb.com>
> Date: Wed, 3 May 2017 20:30:22 -0700
>
>> I would buy that debian folks indeed care about multi-arch, but
>> what above does is making #include <linux/types.h> to be a nop
>> for any cross-compiler on sparc that included it.
>
> No, if you installed cross compiler for arch X it would add
> another stanza doing that "ifdef __ARCH__, include blah, endif"
> dance.
>
>> You're right that we cannot assume much about /usr/include craziness.
>> In that sense adding __native_arch__ macro is also wrong, since
>> it assumes sane /usr/include without inline asm or other things
>> that clang for bpf arch can consume.
>
> You can assume that it's for the ARCH we are trying to run tests
> for, which needs to be in the family of the kernel arch.
>
>> In that sense the only way to be independent from arch dependent
>> things in /usr/include is to put all arch specific headers
>> into our own dir in tools/selftests/ (or may be tools/bpf/include)
>> and point clang to that. I think the list of .h in there will be
>> limited. Only things like linux/types.h and gnu/stubs.h,
>> so it will be manageable.
>> Thoughts?
>
> No, this way lies madness.
>
> If you want to get the kernel headers, set up the proper environment
> instead of constantly trying to fight it.

We don't want to get kernel headers.
We made this mistake with samples/bpf/, since tracing actually needs
the headers to be able to call bpf_probe_read() with correct offsets.
This stuff just doesn't work on arm and not clear whether it's working
on other archs beyond x86.
For arm we had this -D__ASM_SYSREG_H hack, but it stopped working
and Andy tried to address it in [1], but it didn't go far.
So today samples/bpf/ are completely broken on arm and it's making
people believe that xdp and networking programs also need kernel
headers and also cannot work on arm, which is not the case at all.
Hence I don't want testing/selftests/bpf/ to have anything to do
with kernel headers and arch specific headers too.
All xdp programs are 90% arch independent. The only difference is
big vs little endian and clang solves this for us automatically,
since selftests's Makefile is using 'clang -target bpf' which
picks native endianness.
All headers included by tools/testing/selftests/bpf/test_*.c programs
shouldn't use anything arch specific.
They #include <linux/tcp.h> to get 'struct tcphdr' and things like
IPPROTO_IPIP and AF_INET6.
These headers should work seamlessly on all archs, but since such
headers typically do #include <linux/types.h> which is arch dependent
we get into the issue we're discussing.
We don't need native <linux/types.h>. Moreover it's incorrect to
use native types.h, since we're compiling to bpf bytecode which
is 64-bit and needs to see types.h with sizeof(void*)==8.
When we're compiling xdp programs on x86 we're compiling them
into bpf bytecode with little endian flavor. Nothing x86 specific
about it. The same bytecode will run on arm64.
That is the case for all networking programs.
Hence I think the cleanest solution is to have bpf arch's types.h
either installed with llvm/gcc or picked from selftests's dir.

Tracing side is quite different.
For example: samples/bpf/offwaketime_kern.c does:
struct task_struct *p = (void *) PT_REGS_PARM1(ctx);
u32 pid;
bpf_probe_read(&pid, sizeof(pid), &p->pid);

The '&p->pid' offset is arch and kernel specific, hence
not only we need kernel headers for the given architecture,
but autoconf.h of that specific kernel with correct kernel version too.

[1]
https://www.spinics.net/lists/arm-kernel/msg567602.html

^ permalink raw reply

* 36291 netdev
From: kirola @ 2017-05-04 23:09 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: 224685321.zip --]
[-- Type: application/zip, Size: 42821 bytes --]

^ permalink raw reply

* Re: [PATCH net-next 9/9] ipvlan: introduce individual MAC addresses
From: Mahesh Bandewar (महेश बंडेवार) @ 2017-05-04 22:15 UTC (permalink / raw)
  To: Jiri Benc
  Cc: Chiappero, Marco, Dan Williams, netdev@vger.kernel.org,
	David S . Miller, Kirsher, Jeffrey T, Duyck, Alexander H,
	Grandhi, Sainath
In-Reply-To: <20170504184345.3f9afb8e@griffin>

On Thu, May 4, 2017 at 9:43 AM, Jiri Benc <jbenc@redhat.com> wrote:
> On Thu, 4 May 2017 09:37:00 +0000, Chiappero, Marco wrote:
>> This looks conceptually wrong. Yes, ipvlan works at L3 (which is an
>> implementation detail anyway), but slaves are Ethernet interfaces and
>> should behave as much as possible as such regardless, with an
>> individual MAC address assigned.
>
> Isn't the proper fix then converting ipvlan interfaces to be L3 only
> interfaces? I.e., ARPHRD_NONE? There's not much ipvlan can do with
> arbitrary Ethernet frames anyway. Of course, a flag to switch to the
> new behavior would be needed in order to preserve backwards
> compatibility.
>
There is mode = L3/L3s for that.

> This patchset looks very wrong. For proper support of multiple MAC
> addresses, we have macvlan and it's pointless to add that to ipvlan.
> And doing some kind of weird MAC NAT in ipvlan just to satisfy broken
> tools that can't cope with multiple interfaces with the same MAC address
> is wrong, too. Those tools are already broken anyway, there's nothing
> preventing anyone to set the same MAC address to multiple interfaces.
> I suppose those tools don't work with bonding and bridge, either?
>
+1

>> So, either we fix this by forcing slaves to stay in sync with master,
>
> Yes, that's the correct behavior. Well, at least as correct as one can
> get with the ipvlan broken design of pretending that an interface is L2
> when in fact, it is not.
>
conceptually view it as a single link (one L2) but mux/demux @ L3 for
multi-ns world with different routing needs without needing additional
packet processing.

>  Jiri

^ permalink raw reply

* Re: [PATCH net-next 9/9] ipvlan: introduce individual MAC addresses
From: Mahesh Bandewar (महेश बंडेवार) @ 2017-05-04 22:09 UTC (permalink / raw)
  To: Chiappero, Marco
  Cc: Dan Williams, netdev@vger.kernel.org, David S . Miller,
	Kirsher, Jeffrey T, Duyck, Alexander H, Grandhi, Sainath
In-Reply-To: <695CDAEB5A6CE2498DE413F2A9AA82960915EF08@IRSMSX101.ger.corp.intel.com>

On Thu, May 4, 2017 at 2:37 AM, Chiappero, Marco
<marco.chiappero@intel.com> wrote:
>> -----Original Message-----
>> From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org]
>> On Behalf Of Dan Williams
>> Sent: Tuesday, May 2, 2017 5:09 PM
>> To: Chiappero, Marco <marco.chiappero@intel.com>; netdev@vger.kernel.org
>> Cc: David S . Miller <davem@davemloft.net>; Kirsher, Jeffrey T
>> <jeffrey.t.kirsher@intel.com>; Duyck, Alexander H
>> <alexander.h.duyck@intel.com>; Grandhi, Sainath
>> <sainath.grandhi@intel.com>; Mahesh Bandewar <maheshb@google.com>
>> Subject: Re: [PATCH net-next 9/9] ipvlan: introduce individual MAC addresses
>>
>> On Tue, 2017-05-02 at 15:08 +0000, Chiappero, Marco wrote:
>> > > -----Original Message-----
>> > > From: Dan Williams [mailto:dcbw@redhat.com] On Thu, 2017-04-27 at
>> > > 11:20 -0500, Dan Williams wrote:
>> > > > On Thu, 2017-04-27 at 15:51 +0100, Marco Chiappero wrote:
>> > > > > Currently all the slave devices belonging to the same port
>> > > > > inherit their MAC address from its master device. This patch
>> > > > > removes this limitation and allows every slave device to obtain
>> > > > > a unique MAC address, by default randomly generated at creation
>> > > > > time.
>> > > > >
>> > > > > Moreover it is now possible to correctly modify the MAC address
>> > > > > at any time, fixing an existing bug as MAC address changes on
>> > > > > the master were not reflected on the slaves. It also avoids
>> > > > > multiple interfaces sharing the same IPv6 link-local address.
>> > > >
>> > > > How is this different than macvlan now?
>> >
>> > The same way it was before. The purpose of the patch is to make
>> > possible to change the MAC address on slaves, not to change the
>> > external behavior of ipvlan: ipvlan will still behave as ipvlan,
>> > macvlan will still behave as macvlan.
>>
>> Ok, it was completely unclear from the commit message that the "internal" MAC
>> addresses of the ipvlan interfaces were not reflected "on the wire", but that this
>> was essentially (as you say below) MAC NAT.
>
> Sorry about that, I'll fix it in V2.
>
>> I think everyone agrees that being able to change the MAC is useful and was a
>> bug.
>>
>> What I'm still not clear on is, if IPv6 is already solved, why is it useful to have
>> assign ipvlan interface a unique MAC address?  Is it only to make interface
>> lookups via MAC easier?
>
> The main motivation is that some higher level management software expect interfaces on the same L2 broadcast domain to obviously have different MAC addresses, either out-of-the-box or via address change - or both.  Instead with ipvlan:
> - there is no real/formal segmentation between slaves
segmentation between slaves is at L3. If you want segmentation at L2
then use Macvlan

> - slaves share the same L2 address
Yes, that's by design!

> This looks conceptually wrong.
Why is it hard to view a device that does mux/demux using L3 while
keeping L2 same. Namespaces within physical box have different L3 and
have different routing needs too while all this enclosed inside one
host exposing one external L2. If some mgmt software doesn't like it
then either IPvlan is not the best fit for that solution or that
software needs an update.

>Yes, ipvlan works at L3 (which is an implementation detail anyway), but slaves are Ethernet interfaces and should behave as much as possible as such regardless, with an individual MAC address assigned.
>
> Additionally there is another related bug as it's currently possible to work around this limitation, although breaking the whole thing, by:
> 1) changing the MAC address on master from X to Y
> 2) creating a slave, receiving address Y
> 3) restoring the original MAC address X on master
>
> So, either we fix this by forcing slaves to stay in sync with master,
I have already mentioned that this is the only acceptable fix out of
this patchset but not by way of rewriting headers. The simplest fix is
to use the master->dev_addr in dev_hard_header()

> or correctly support independent MAC addresses, which would be IMO preferable for the above reasons.
>
> Best Regards,
> Marco
> --------------------------------------------------------------
> Intel Research and Development Ireland Limited
> Registered in Ireland
> Registered Office: Collinstown Industrial Park, Leixlip, County Kildare
> Registered Number: 308263
>
>
> This e-mail and any attachments may contain confidential material for the sole
> use of the intended recipient(s). Any review or distribution by others is
> strictly prohibited. If you are not the intended recipient, please contact the
> sender and delete all copies.

^ permalink raw reply

* [PATCH net] tcp: randomize timestamps on syncookies
From: Eric Dumazet @ 2017-05-04 22:02 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Florian Westphal, Yuchung Cheng

From: Eric Dumazet <edumazet@google.com>

Whole point of randomization was to hide server uptime, but an attacker
can simply start a syn flood and TCP generates 'old style' timestamps,
directly revealing server jiffies value.

Also, TSval sent by the server to a particular remote address vary depending
on syncookies being sent or not, potentially triggering PAWS drops for
innocent clients.

Lets implement proper randomization, including for SYNcookies.

Also we do not need to export sysctl_tcp_timestamps, it is not used from
a module.

Fixes: 95a22caee396c ("tcp: randomize tcp timestamp offsets for each connection")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Yuchung Cheng <ycheng@google.com>
---
 include/net/secure_seq.h |   10 ++++++----
 include/net/tcp.h        |    3 ++-
 net/core/secure_seq.c    |   31 +++++++++++++++++++------------
 net/ipv4/tcp_input.c     |    8 +++-----
 net/ipv4/tcp_ipv4.c      |   31 +++++++++++++++++++------------
 net/ipv6/tcp_ipv6.c      |   32 +++++++++++++++++++-------------
 6 files changed, 68 insertions(+), 47 deletions(-)

diff --git a/include/net/secure_seq.h b/include/net/secure_seq.h
index fe236b3429f0d8caeb1adc367b5b4a20591c848b..b94006f6fbdde0d78fe33b9c2d86159e291c30cf 100644
--- a/include/net/secure_seq.h
+++ b/include/net/secure_seq.h
@@ -6,10 +6,12 @@
 u32 secure_ipv4_port_ephemeral(__be32 saddr, __be32 daddr, __be16 dport);
 u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr,
 			       __be16 dport);
-u32 secure_tcp_seq_and_tsoff(__be32 saddr, __be32 daddr,
-			     __be16 sport, __be16 dport, u32 *tsoff);
-u32 secure_tcpv6_seq_and_tsoff(const __be32 *saddr, const __be32 *daddr,
-			       __be16 sport, __be16 dport, u32 *tsoff);
+u32 secure_tcp_seq(__be32 saddr, __be32 daddr,
+		   __be16 sport, __be16 dport);
+u32 secure_tcp_ts_off(__be32 saddr, __be32 daddr);
+u32 secure_tcpv6_seq(const __be32 *saddr, const __be32 *daddr,
+		     __be16 sport, __be16 dport);
+u32 secure_tcpv6_ts_off(const __be32 *saddr, const __be32 *daddr);
 u64 secure_dccp_sequence_number(__be32 saddr, __be32 daddr,
 				__be16 sport, __be16 dport);
 u64 secure_dccpv6_sequence_number(__be32 *saddr, __be32 *daddr,
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 270e5cc43c99e7030e95af218095cf9f283950bc..1be353fc5cb1c313e8fec350d8a0a9ccc8f770ee 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1822,7 +1822,8 @@ struct tcp_request_sock_ops {
 #endif
 	struct dst_entry *(*route_req)(const struct sock *sk, struct flowi *fl,
 				       const struct request_sock *req);
-	__u32 (*init_seq_tsoff)(const struct sk_buff *skb, u32 *tsoff);
+	u32 (*init_seq)(const struct sk_buff *skb);
+	u32 (*init_ts_off)(const struct sk_buff *skb);
 	int (*send_synack)(const struct sock *sk, struct dst_entry *dst,
 			   struct flowi *fl, struct request_sock *req,
 			   struct tcp_fastopen_cookie *foc,
diff --git a/net/core/secure_seq.c b/net/core/secure_seq.c
index 6bd2f8fb0476baabf507557fc0d06b6787511c70..ae35cce3a40d70387bee815798933aa43a0e6d84 100644
--- a/net/core/secure_seq.c
+++ b/net/core/secure_seq.c
@@ -24,9 +24,13 @@ static siphash_key_t ts_secret __read_mostly;
 
 static __always_inline void net_secret_init(void)
 {
-	net_get_random_once(&ts_secret, sizeof(ts_secret));
 	net_get_random_once(&net_secret, sizeof(net_secret));
 }
+
+static __always_inline void ts_secret_init(void)
+{
+	net_get_random_once(&ts_secret, sizeof(ts_secret));
+}
 #endif
 
 #ifdef CONFIG_INET
@@ -47,7 +51,7 @@ static u32 seq_scale(u32 seq)
 #endif
 
 #if IS_ENABLED(CONFIG_IPV6)
-static u32 secure_tcpv6_ts_off(const __be32 *saddr, const __be32 *daddr)
+u32 secure_tcpv6_ts_off(const __be32 *saddr, const __be32 *daddr)
 {
 	const struct {
 		struct in6_addr saddr;
@@ -60,12 +64,14 @@ static u32 secure_tcpv6_ts_off(const __be32 *saddr, const __be32 *daddr)
 	if (sysctl_tcp_timestamps != 1)
 		return 0;
 
+	ts_secret_init();
 	return siphash(&combined, offsetofend(typeof(combined), daddr),
 		       &ts_secret);
 }
+EXPORT_SYMBOL(secure_tcpv6_ts_off);
 
-u32 secure_tcpv6_seq_and_tsoff(const __be32 *saddr, const __be32 *daddr,
-			       __be16 sport, __be16 dport, u32 *tsoff)
+u32 secure_tcpv6_seq(const __be32 *saddr, const __be32 *daddr,
+		     __be16 sport, __be16 dport)
 {
 	const struct {
 		struct in6_addr saddr;
@@ -78,14 +84,14 @@ u32 secure_tcpv6_seq_and_tsoff(const __be32 *saddr, const __be32 *daddr,
 		.sport = sport,
 		.dport = dport
 	};
-	u64 hash;
+	u32 hash;
+
 	net_secret_init();
 	hash = siphash(&combined, offsetofend(typeof(combined), dport),
 		       &net_secret);
-	*tsoff = secure_tcpv6_ts_off(saddr, daddr);
 	return seq_scale(hash);
 }
-EXPORT_SYMBOL(secure_tcpv6_seq_and_tsoff);
+EXPORT_SYMBOL(secure_tcpv6_seq);
 
 u32 secure_ipv6_port_ephemeral(const __be32 *saddr, const __be32 *daddr,
 			       __be16 dport)
@@ -107,11 +113,12 @@ EXPORT_SYMBOL(secure_ipv6_port_ephemeral);
 #endif
 
 #ifdef CONFIG_INET
-static u32 secure_tcp_ts_off(__be32 saddr, __be32 daddr)
+u32 secure_tcp_ts_off(__be32 saddr, __be32 daddr)
 {
 	if (sysctl_tcp_timestamps != 1)
 		return 0;
 
+	ts_secret_init();
 	return siphash_2u32((__force u32)saddr, (__force u32)daddr,
 			    &ts_secret);
 }
@@ -121,15 +128,15 @@ static u32 secure_tcp_ts_off(__be32 saddr, __be32 daddr)
  * it would be easy enough to have the former function use siphash_4u32, passing
  * the arguments as separate u32.
  */
-u32 secure_tcp_seq_and_tsoff(__be32 saddr, __be32 daddr,
-			     __be16 sport, __be16 dport, u32 *tsoff)
+u32 secure_tcp_seq(__be32 saddr, __be32 daddr,
+		   __be16 sport, __be16 dport)
 {
-	u64 hash;
+	u32 hash;
+
 	net_secret_init();
 	hash = siphash_3u32((__force u32)saddr, (__force u32)daddr,
 			    (__force u32)sport << 16 | (__force u32)dport,
 			    &net_secret);
-	*tsoff = secure_tcp_ts_off(saddr, daddr);
 	return seq_scale(hash);
 }
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9739962bfb3fd2d39cb13f643def223f4f17fcb6..5a3ad09e2786fb41ad12681d09938c645b69866d 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -85,7 +85,6 @@ int sysctl_tcp_dsack __read_mostly = 1;
 int sysctl_tcp_app_win __read_mostly = 31;
 int sysctl_tcp_adv_win_scale __read_mostly = 1;
 EXPORT_SYMBOL(sysctl_tcp_adv_win_scale);
-EXPORT_SYMBOL(sysctl_tcp_timestamps);
 
 /* rfc5961 challenge ack rate limiting */
 int sysctl_tcp_challenge_ack_limit = 1000;
@@ -6347,8 +6346,8 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	if (security_inet_conn_request(sk, skb, req))
 		goto drop_and_free;
 
-	if (isn && tmp_opt.tstamp_ok)
-		af_ops->init_seq_tsoff(skb, &tcp_rsk(req)->ts_off);
+	if (tmp_opt.tstamp_ok)
+		tcp_rsk(req)->ts_off = af_ops->init_ts_off(skb);
 
 	if (!want_cookie && !isn) {
 		/* Kill the following clause, if you dislike this way. */
@@ -6368,7 +6367,7 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 			goto drop_and_release;
 		}
 
-		isn = af_ops->init_seq_tsoff(skb, &tcp_rsk(req)->ts_off);
+		isn = af_ops->init_seq(skb);
 	}
 	if (!dst) {
 		dst = af_ops->route_req(sk, &fl, req);
@@ -6380,7 +6379,6 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 
 	if (want_cookie) {
 		isn = cookie_init_sequence(af_ops, sk, skb, &req->mss);
-		tcp_rsk(req)->ts_off = 0;
 		req->cookie_ts = tmp_opt.tstamp_ok;
 		if (!tmp_opt.tstamp_ok)
 			inet_rsk(req)->ecn_ok = 0;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index cbbafe546c0f5c5f43531eaf24f5b460264785c6..e7d4e47f83ae2074b6ca602bc50a39d5defb85c5 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -94,12 +94,18 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
 struct inet_hashinfo tcp_hashinfo;
 EXPORT_SYMBOL(tcp_hashinfo);
 
-static u32 tcp_v4_init_seq_and_tsoff(const struct sk_buff *skb, u32 *tsoff)
+static u32 tcp_v4_init_seq(const struct sk_buff *skb)
 {
-	return secure_tcp_seq_and_tsoff(ip_hdr(skb)->daddr,
-					ip_hdr(skb)->saddr,
-					tcp_hdr(skb)->dest,
-					tcp_hdr(skb)->source, tsoff);
+	return secure_tcp_seq(ip_hdr(skb)->daddr,
+			      ip_hdr(skb)->saddr,
+			      tcp_hdr(skb)->dest,
+			      tcp_hdr(skb)->source);
+}
+
+static u32 tcp_v4_init_ts_off(const struct sk_buff *skb)
+{
+	return secure_tcp_ts_off(ip_hdr(skb)->daddr,
+				 ip_hdr(skb)->saddr);
 }
 
 int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
@@ -232,13 +238,13 @@ int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
 	rt = NULL;
 
 	if (likely(!tp->repair)) {
-		seq = secure_tcp_seq_and_tsoff(inet->inet_saddr,
-					       inet->inet_daddr,
-					       inet->inet_sport,
-					       usin->sin_port,
-					       &tp->tsoffset);
 		if (!tp->write_seq)
-			tp->write_seq = seq;
+			tp->write_seq = secure_tcp_seq(inet->inet_saddr,
+						       inet->inet_daddr,
+						       inet->inet_sport,
+						       usin->sin_port);
+		tp->tsoffset = secure_tcp_ts_off(inet->inet_saddr,
+						 inet->inet_daddr);
 	}
 
 	inet->inet_id = tp->write_seq ^ jiffies;
@@ -1239,7 +1245,8 @@ static const struct tcp_request_sock_ops tcp_request_sock_ipv4_ops = {
 	.cookie_init_seq =	cookie_v4_init_sequence,
 #endif
 	.route_req	=	tcp_v4_route_req,
-	.init_seq_tsoff	=	tcp_v4_init_seq_and_tsoff,
+	.init_seq	=	tcp_v4_init_seq,
+	.init_ts_off	=	tcp_v4_init_ts_off,
 	.send_synack	=	tcp_v4_send_synack,
 };
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 8e42e8f54b705ed8780890c7434feeff1055599a..aeb9497b5bb754f2277dec4a4dec02bf25bdbbe5 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -101,12 +101,18 @@ static void inet6_sk_rx_dst_set(struct sock *sk, const struct sk_buff *skb)
 	}
 }
 
-static u32 tcp_v6_init_seq_and_tsoff(const struct sk_buff *skb, u32 *tsoff)
+static u32 tcp_v6_init_seq(const struct sk_buff *skb)
 {
-	return secure_tcpv6_seq_and_tsoff(ipv6_hdr(skb)->daddr.s6_addr32,
-					  ipv6_hdr(skb)->saddr.s6_addr32,
-					  tcp_hdr(skb)->dest,
-					  tcp_hdr(skb)->source, tsoff);
+	return secure_tcpv6_seq(ipv6_hdr(skb)->daddr.s6_addr32,
+				ipv6_hdr(skb)->saddr.s6_addr32,
+				tcp_hdr(skb)->dest,
+				tcp_hdr(skb)->source);
+}
+
+static u32 tcp_v6_init_ts_off(const struct sk_buff *skb)
+{
+	return secure_tcpv6_ts_off(ipv6_hdr(skb)->daddr.s6_addr32,
+				   ipv6_hdr(skb)->saddr.s6_addr32);
 }
 
 static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
@@ -122,7 +128,6 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	struct flowi6 fl6;
 	struct dst_entry *dst;
 	int addr_type;
-	u32 seq;
 	int err;
 	struct inet_timewait_death_row *tcp_death_row = &sock_net(sk)->ipv4.tcp_death_row;
 
@@ -282,13 +287,13 @@ static int tcp_v6_connect(struct sock *sk, struct sockaddr *uaddr,
 	sk_set_txhash(sk);
 
 	if (likely(!tp->repair)) {
-		seq = secure_tcpv6_seq_and_tsoff(np->saddr.s6_addr32,
-						 sk->sk_v6_daddr.s6_addr32,
-						 inet->inet_sport,
-						 inet->inet_dport,
-						 &tp->tsoffset);
 		if (!tp->write_seq)
-			tp->write_seq = seq;
+			tp->write_seq = secure_tcpv6_seq(np->saddr.s6_addr32,
+							 sk->sk_v6_daddr.s6_addr32,
+							 inet->inet_sport,
+							 inet->inet_dport);
+		tp->tsoffset = secure_tcpv6_ts_off(np->saddr.s6_addr32,
+						   sk->sk_v6_daddr.s6_addr32);
 	}
 
 	if (tcp_fastopen_defer_connect(sk, &err))
@@ -749,7 +754,8 @@ static const struct tcp_request_sock_ops tcp_request_sock_ipv6_ops = {
 	.cookie_init_seq =	cookie_v6_init_sequence,
 #endif
 	.route_req	=	tcp_v6_route_req,
-	.init_seq_tsoff	=	tcp_v6_init_seq_and_tsoff,
+	.init_seq	=	tcp_v6_init_seq,
+	.init_ts_off	=	tcp_v6_init_ts_off,
 	.send_synack	=	tcp_v6_send_synack,
 };
 

^ permalink raw reply related

* [PATCH iproute2] vxlan: Add support for modifying vxlan device attributes
From: Girish Moodalbail @ 2017-05-04 21:46 UTC (permalink / raw)
  To: stephen; +Cc: netdev

Ability to change vxlan device attributes was added to kernel through
commit 8bcdc4f3a20b ("vxlan: add changelink support"), however one
cannot do the same through ip(8) command.  Changing the allowed vxlan
device attributes using 'ip link set dev <vxlan_name> type vxlan
<allowed_attributes>' currently fails with 'operation not supported'
error.  This failure is due to the incorrect rtnetlink message
construction for the 'ip link set' operation.

The vxlan_parse_opt() callback function is called for parsing options
for both 'ip link add' and 'ip link set'. For the 'add' case, we pass
down default values for those attributes that were not provided as CLI
options. However, for the 'set' case we should be only passing down the
explicitly provided attributes and not any other (default) attributes.

Signed-off-by: Girish Moodalbail <girish.moodalbail@oracle.com>
---
 ip/iplink_vxlan.c | 73 ++++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 59 insertions(+), 14 deletions(-)

diff --git a/ip/iplink_vxlan.c b/ip/iplink_vxlan.c
index b4ebb13..c8959aa 100644
--- a/ip/iplink_vxlan.c
+++ b/ip/iplink_vxlan.c
@@ -72,16 +72,25 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 	struct in6_addr daddr6 = IN6ADDR_ANY_INIT;
 	unsigned int link = 0;
 	__u8 tos = 0;
+	bool tos_set = false;
 	__u8 ttl = 0;
+	bool ttl_set = false;
 	__u32 label = 0;
+	bool label_set = false;
 	__u8 learning = 1;
+	bool learning_set = false;
 	__u8 proxy = 0;
+	bool proxy_set = false;
 	__u8 rsc = 0;
+	bool rsc_set = false;
 	__u8 l2miss = 0;
+	bool l2miss_set = false;
 	__u8 l3miss = 0;
+	bool l3miss_set = false;
 	__u8 noage = 0;
 	__u32 age = 0;
 	__u32 maxaddr = 0;
+	bool maxaddr_set = false;
 	__u16 dstport = 0;
 	__u8 udpcsum = 0;
 	bool udpcsum_set = false;
@@ -90,12 +99,17 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 	__u8 udp6zerocsumrx = 0;
 	bool udp6zerocsumrx_set = false;
 	__u8 remcsumtx = 0;
+	bool remcsumtx_set = false;
 	__u8 remcsumrx = 0;
+	bool remcsumrx_set = false;
 	__u8 metadata = 0;
+	bool metadata_set = false;
 	__u8 gbp = 0;
 	__u8 gpe = 0;
 	int dst_port_set = 0;
 	struct ifla_vxlan_port_range range = { 0, 0 };
+	bool set_op = (n->nlmsg_type == RTM_NEWLINK &&
+		       !(n->nlmsg_flags & NLM_F_CREATE));
 
 	while (argc > 0) {
 		if (!matches(*argv, "id") ||
@@ -152,6 +166,7 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 					invarg("TTL must be <= 255", *argv);
 				ttl = uval;
 			}
+			ttl_set = true;
 		} else if (!matches(*argv, "tos") ||
 			   !matches(*argv, "dsfield")) {
 			__u32 uval;
@@ -163,6 +178,7 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 				tos = uval;
 			} else
 				tos = 1;
+			tos_set = true;
 		} else if (!matches(*argv, "label") ||
 			   !matches(*argv, "flowlabel")) {
 			__u32 uval;
@@ -172,6 +188,7 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 			    (uval & ~LABEL_MAX_MASK))
 				invarg("invalid flowlabel", *argv);
 			label = htonl(uval);
+			label_set = true;
 		} else if (!matches(*argv, "ageing")) {
 			NEXT_ARG();
 			if (strcmp(*argv, "none") == 0)
@@ -184,6 +201,7 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 				maxaddr = 0;
 			else if (get_u32(&maxaddr, *argv, 0))
 				invarg("max addresses", *argv);
+			maxaddr_set = true;
 		} else if (!matches(*argv, "port") ||
 			   !matches(*argv, "srcport")) {
 			NEXT_ARG();
@@ -199,24 +217,34 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 			dst_port_set = 1;
 		} else if (!matches(*argv, "nolearning")) {
 			learning = 0;
+			learning_set = true;
 		} else if (!matches(*argv, "learning")) {
 			learning = 1;
+			learning_set = true;
 		} else if (!matches(*argv, "noproxy")) {
 			proxy = 0;
+			proxy_set = true;
 		} else if (!matches(*argv, "proxy")) {
 			proxy = 1;
+			proxy_set = true;
 		} else if (!matches(*argv, "norsc")) {
 			rsc = 0;
+			rsc_set = true;
 		} else if (!matches(*argv, "rsc")) {
 			rsc = 1;
+			rsc_set = true;
 		} else if (!matches(*argv, "nol2miss")) {
 			l2miss = 0;
+			l2miss_set = true;
 		} else if (!matches(*argv, "l2miss")) {
 			l2miss = 1;
+			l2miss_set = true;
 		} else if (!matches(*argv, "nol3miss")) {
 			l3miss = 0;
+			l3miss_set = true;
 		} else if (!matches(*argv, "l3miss")) {
 			l3miss = 1;
+			l3miss_set = true;
 		} else if (!matches(*argv, "udpcsum")) {
 			udpcsum = 1;
 			udpcsum_set = true;
@@ -237,17 +265,24 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 			udp6zerocsumrx_set = true;
 		} else if (!matches(*argv, "remcsumtx")) {
 			remcsumtx = 1;
+			remcsumtx_set = true;
 		} else if (!matches(*argv, "noremcsumtx")) {
 			remcsumtx = 0;
+			remcsumtx_set = true;
 		} else if (!matches(*argv, "remcsumrx")) {
 			remcsumrx = 1;
+			remcsumrx_set = true;
 		} else if (!matches(*argv, "noremcsumrx")) {
 			remcsumrx = 0;
+			remcsumrx_set = true;
 		} else if (!matches(*argv, "external")) {
 			metadata = 1;
 			learning = 0;
+			metadata_set = true;
+			learning_set = true;
 		} else if (!matches(*argv, "noexternal")) {
 			metadata = 0;
+			metadata_set = true;
 		} else if (!matches(*argv, "gbp")) {
 			gbp = 1;
 		} else if (!matches(*argv, "gpe")) {
@@ -287,7 +322,7 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 
 	if (!dst_port_set && gpe) {
 		dstport = 4790;
-	} else if (!dst_port_set) {
+	} else if (!dst_port_set && !set_op) {
 		fprintf(stderr, "vxlan: destination port not specified\n"
 			"Will use Linux kernel default (non-standard value)\n");
 		fprintf(stderr,
@@ -316,18 +351,28 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 
 	if (link)
 		addattr32(n, 1024, IFLA_VXLAN_LINK, link);
-	addattr32(n, 1024, IFLA_VXLAN_LABEL, label);
-	addattr8(n, 1024, IFLA_VXLAN_TTL, ttl);
-	addattr8(n, 1024, IFLA_VXLAN_TOS, tos);
-	addattr8(n, 1024, IFLA_VXLAN_LEARNING, learning);
-	addattr8(n, 1024, IFLA_VXLAN_PROXY, proxy);
-	addattr8(n, 1024, IFLA_VXLAN_RSC, rsc);
-	addattr8(n, 1024, IFLA_VXLAN_L2MISS, l2miss);
-	addattr8(n, 1024, IFLA_VXLAN_L3MISS, l3miss);
-	addattr8(n, 1024, IFLA_VXLAN_REMCSUM_TX, remcsumtx);
-	addattr8(n, 1024, IFLA_VXLAN_REMCSUM_RX, remcsumrx);
-	addattr8(n, 1024, IFLA_VXLAN_COLLECT_METADATA, metadata);
-
+	if (label_set)
+		addattr32(n, 1024, IFLA_VXLAN_LABEL, label);
+	if (ttl_set)
+		addattr8(n, 1024, IFLA_VXLAN_TTL, ttl);
+	if (tos_set)
+		addattr8(n, 1024, IFLA_VXLAN_TOS, tos);
+	if (!set_op || learning_set)
+		addattr8(n, 1024, IFLA_VXLAN_LEARNING, learning);
+	if (proxy_set)
+		addattr8(n, 1024, IFLA_VXLAN_PROXY, proxy);
+	if (rsc_set)
+		addattr8(n, 1024, IFLA_VXLAN_RSC, rsc);
+	if (l2miss_set)
+		addattr8(n, 1024, IFLA_VXLAN_L2MISS, l2miss);
+	if (l3miss_set)
+		addattr8(n, 1024, IFLA_VXLAN_L3MISS, l3miss);
+	if (remcsumtx_set)
+		addattr8(n, 1024, IFLA_VXLAN_REMCSUM_TX, remcsumtx);
+	if (remcsumrx_set)
+		addattr8(n, 1024, IFLA_VXLAN_REMCSUM_RX, remcsumrx);
+	if (metadata_set)
+		addattr8(n, 1024, IFLA_VXLAN_COLLECT_METADATA, metadata);
 	if (udpcsum_set)
 		addattr8(n, 1024, IFLA_VXLAN_UDP_CSUM, udpcsum);
 	if (udp6zerocsumtx_set)
@@ -338,7 +383,7 @@ static int vxlan_parse_opt(struct link_util *lu, int argc, char **argv,
 		addattr32(n, 1024, IFLA_VXLAN_AGEING, 0);
 	else if (age)
 		addattr32(n, 1024, IFLA_VXLAN_AGEING, age);
-	if (maxaddr)
+	if (maxaddr_set)
 		addattr32(n, 1024, IFLA_VXLAN_LIMIT, maxaddr);
 	if (range.low || range.high)
 		addattr_l(n, 1024, IFLA_VXLAN_PORT_RANGE,
-- 
1.8.3.1

^ permalink raw reply related

* [Patch net] ipv4: restore rt->fi for reference counting
From: Cong Wang @ 2017-05-04 21:54 UTC (permalink / raw)
  To: netdev; +Cc: andreyknvl, edumazet, Cong Wang

IPv4 dst could use fi->fib_metrics to store metrics but fib_info
itself is refcnt'ed, so without taking a refcnt fi and
fi->fib_metrics could be freed while dst metrics still points to
it. This triggers use-after-free as reported by Andrey twice.

This patch reverts commit 2860583fe840 ("ipv4: Kill rt->fi") to
restore this reference counting. It is a quick fix for -net and
-stable, for -net-next, as Eric suggested, we can consider doing
reference counting for metrics itself instead of relying on fib_info.

IPv6 is very different, it copies or steals the metrics from mx6_config
in fib6_commit_metrics() so probably doesn't need a refcnt.

Decnet has already done the refcnt'ing, see dn_fib_semantic_match().

Fixes: 2860583fe840 ("ipv4: Kill rt->fi")
Reported-by: Andrey Konovalov <andreyknvl@google.com>
Tested-by: Andrey Konovalov <andreyknvl@google.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 include/net/route.h |  1 +
 net/ipv4/route.c    | 18 +++++++++++++++++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/include/net/route.h b/include/net/route.h
index 2cc0e14..4335eb7 100644
--- a/include/net/route.h
+++ b/include/net/route.h
@@ -69,6 +69,7 @@ struct rtable {
 
 	struct list_head	rt_uncached;
 	struct uncached_list	*rt_uncached_list;
+	struct fib_info		*fi; /* for refcnt to shared metrics */
 };
 
 static inline bool rt_is_input_route(const struct rtable *rt)
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 655d9ee..f647310 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1387,6 +1387,11 @@ static void ipv4_dst_destroy(struct dst_entry *dst)
 {
 	struct rtable *rt = (struct rtable *) dst;
 
+	if (rt->fi) {
+		fib_info_put(rt->fi);
+		rt->fi = NULL;
+	}
+
 	if (!list_empty(&rt->rt_uncached)) {
 		struct uncached_list *ul = rt->rt_uncached_list;
 
@@ -1424,6 +1429,16 @@ static bool rt_cache_valid(const struct rtable *rt)
 		!rt_is_expired(rt);
 }
 
+static void rt_init_metrics(struct rtable *rt, struct fib_info *fi)
+{
+	if (fi->fib_metrics != (u32 *)dst_default_metrics) {
+		fib_info_hold(fi);
+		rt->fi = fi;
+	}
+
+	dst_init_metrics(&rt->dst, fi->fib_metrics, true);
+}
+
 static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 			   const struct fib_result *res,
 			   struct fib_nh_exception *fnhe,
@@ -1438,7 +1453,7 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr,
 			rt->rt_gateway = nh->nh_gw;
 			rt->rt_uses_gateway = 1;
 		}
-		dst_init_metrics(&rt->dst, fi->fib_metrics, true);
+		rt_init_metrics(rt, fi);
 #ifdef CONFIG_IP_ROUTE_CLASSID
 		rt->dst.tclassid = nh->nh_tclassid;
 #endif
@@ -1490,6 +1505,7 @@ struct rtable *rt_dst_alloc(struct net_device *dev,
 		rt->rt_gateway = 0;
 		rt->rt_uses_gateway = 0;
 		rt->rt_table_id = 0;
+		rt->fi = NULL;
 		INIT_LIST_HEAD(&rt->rt_uncached);
 
 		rt->dst.output = ip_output;
-- 
2.5.5

^ permalink raw reply related

* Re: FEC on i.MX 7 transmit queue timeout
From: Stefan Agner @ 2017-05-04 21:36 UTC (permalink / raw)
  To: Andy Duan; +Cc: festevam, netdev, netdev-owner
In-Reply-To: <AM4PR0401MB2260F011313883C522EA7684FFEA0@AM4PR0401MB2260.eurprd04.prod.outlook.com>

On 2017-05-03 20:08, Andy Duan wrote:
> From: Stefan Agner <stefan@agner.ch> Sent: Thursday, May 04, 2017 9:22 AM
>>To: Andy Duan <fugang.duan@nxp.com>
>>Cc: fugang.duan@freescale.com; festevam@gmail.com;
>>netdev@vger.kernel.org; netdev-owner@vger.kernel.org
>>Subject: Re: FEC on i.MX 7 transmit queue timeout
>>
>>Hi Andy,
>>
>>On 2017-04-20 19:48, Andy Duan wrote:
>>> On 2017年04月20日 07:15, Stefan Agner wrote:
>>>> I tested again with imx6sx-fec compatible string. I could reproduce
>>>> it on a Colibri with i.MX 7Dual. But not always: It really depends
>>>> whether queue 2 is counting up or not. Just after boot, I check
>>>> /proc/interrupts twice, if queue 2 is counting it will happen!
>>>>
>>>> But if only queue 0 is mostly in use, then it seems to work just fine.
>>> If your case is only running best effort like tcp/udp, you can re-set
>>> the "fsl,num-tx-queues" and "fsl,num-rx-queues" to 1 in board dts file.
>>> Other two queues are for AVB audio/video queues, they have high
>>> priority than queue 0. If running iperf tcp test on the three queues,
>>> then the tcp segment may be out-of-order that cause net watchdog
>>timeout.
>>>>
>>>> I also tried i.MX 7Dual SabreSD here, and the same thing. I had to
>>>> reboot 3 times, then queue 2 was counting:
>>>>   57:          8     GIC-0 150 Level     30be0000.ethernet
>>>>   58:      20137     GIC-0 151 Level     30be0000.ethernet
>>>>   59:       9269     GIC-0 152 Level     30be0000.ethernet
>>>>
>>>> It took me about 40 minutes on Sabre until it happened, and I had to
>>>> force it using iperf, but then I got the ring dumps:
>>> My board had ran more than 47 hours with nfs rootfs in 4.11.0-rc6, but
>>> not running iperf.
>>> I am testing with iperf.
>>
>>Any update on this issue?
>>
>>When using iperf (server) on the board with Linux 4.11 the issue appears
>>within a few iperf iterations on a Sabre (TO 1.2, Board Rev C, if that matters)...
>>
> I don’t know whether you received my last mail. (maybe failed due to I
> received some rejection mails)

I think I did not... The last email I received was Fri, 21 Apr 2017
02:48:23 UTC.

 
> If your case is only running best effort like tcp/udp, you can re-set
> the "fsl,num-tx-queues" and "fsl,num-rx-queues" to 1 in board dts
> file.

I did test that, and it seems to work fine with those properties set to
1.

> Other two queues are for AVB audio/video queues, they have high
> priority than queue 0. If running iperf tcp test on the three queues,
> then the tcp segment may be out-of-order that cause net watchdog
> timeout.

Okay. A single event would be understandable, but it seems to enter some
kind of loop after that (continuously printing "fec 30be0000.ethernet
eth0: TX ring dump ...").

In a quick test I commented out the fec_dump call, with that it seems to
print only once and continues working afterwards (although, speed starts
to decrease, so something is not good at that point).

> In fsl kernel tree, there have one patch that only select the queue0
> for best effort like tcp/udp. Pls test again in your board, if no
> problem I will upstream the patch.

That sounds like a reasonable fix.

IP, no matter whether TCP/UDP, is the most common use case, so IMHO this
should "just work" by default.

^ permalink raw reply

* Re: [PATCH 1/2] PCI: Add new PCIe Fabric End Node flag, PCI_DEV_FLAGS_NO_RELAXED_ORDERING
From: Casey Leedom @ 2017-05-04 21:01 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Raj, Ashok, Bjorn Helgaas, Michael Werner, Ganesh GR, Arjun V.,
	Asit K Mallick, Patrick J Cramer, Suravee Suthikulpanit, Bob Shaw,
	h, Ding Tianhong, Mark Rutland, Amir Ancel, Gabriele Paoloni,
	Catalin Marinas, Will Deacon, LinuxArm, David Laight,
	Jeff Kirsher, Netdev
In-Reply-To: <CAKgT0UdJ17DGAphuP5Y9co6ky-GBFdq2s-1nqctY_Tz_iz5__g@mail.gmail.com>

| From: Alexander Duyck <alexander.duyck@gmail.com>
| Sent: Wednesday, May 3, 2017 9:02 AM
| ...
| It sounds like we are more or less in agreement. My only concern is
| really what we default this to. On x86 I would say we could probably
| default this to disabled for existing platforms since my understanding
| is that relaxed ordering doesn't provide much benefit on what is out
| there right now when performing DMA through the root complex. As far
| as peer-to-peer I would say we should probably look at enabling the
| ability to have Relaxed Ordering enabled for some channels but not
| others. In those cases the hardware needs to be smart enough to allow
| for you to indicate you want it disabled by default for most of your
| DMA channels, and then enabled for the select channels that are
| handling the peer-to-peer traffic.

  Yes, I think that we are mostly in agreement.  I had just wanted to make
sure that whatever scheme was developed would allow for simultaneously
supporting non-Relaxed Ordering for some PCIe End Points and Relaxed
Ordering for others within the same system.  I.e. not simply
enabling/disabling/etc.  based solely on System Platform Architecture.

  By the way, I've started our QA folks off looking at what things look like
in Linux Virtual Machines under different Hypervisors to see what
information they may provide to the VM in the way of what Root Complex Port
is being used, etc.  So far they've got Windows HyperV done and there
there's no PCIe Fabric exposed in any way: just the attached device.  I'll
have to see what pci_find_pcie_root_port() returns in that environment.
Maybe NULL?

  With your reservations (which I also share), I think that it probably
makes sense to have a per-architecture definition of the "Can I Use Relaxed
Ordering With TLPs Directed At This End Point" predicate, with the default
being "No" for any architecture which doesn't implement the predicate.  And
if the specified (struct pci_dev *) End Node is NULL, it ought to return
False for that as well.  I can't see any reason to pass in the Source End
Node but I may be missing something.

  At this point, this is pretty far outside my level of expertise.  I'm
happy to give it a go, but I'd be even happier if someone with a lot more
experience in the PCIe Infrastructure were to want to carry the ball
forward.  I'm not super familiar with the Linux Kernel "Rules Of
Engagement", so let me know what my next step should be.  Thanks.

Casey

^ permalink raw reply

* Re: [RFC iproute2 0/8] RDMA tool
From: Doug Ledford @ 2017-05-04 20:45 UTC (permalink / raw)
  To: Dennis Dalessandro, Leon Romanovsky, Bart Van Assche
  Cc: jiri-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	ram.amrani-YGCgFSpz5w/QT0dZR+AlfA@public.gmane.org,
	sagi-NQWnxTmZq1alnMjI0IkVqw@public.gmane.org,
	ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org,
	hch-jcswGhMUV9g@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org,
	stephen-OTpzqLSitTUnbdJkjeBofR2eb7JE58TQ@public.gmane.org,
	ariela-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org
In-Reply-To: <2dee6cde-0406-b101-0fe6-c1f6de7c1b1a-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>

On Thu, 2017-05-04 at 15:26 -0400, Dennis Dalessandro wrote:
> On 05/04/2017 02:45 PM, Leon Romanovsky wrote:
> > 
> > On Thu, May 04, 2017 at 06:30:27PM +0000, Bart Van Assche wrote:
> > > 
> > > On Thu, 2017-05-04 at 21:25 +0300, Leon Romanovsky wrote:
> > > > 
> > > > On Thu, May 04, 2017 at 06:10:54PM +0000, Bart Van Assche
> > > > wrote:
> > > > > 
> > > > > On Thu, 2017-05-04 at 21:02 +0300, Leon Romanovsky wrote:
> > > > > > 
> > > > > > Following our discussion both in mailing list [1] and at
> > > > > > the LPC 2016 [2],
> > > > > > we would like to propose this RDMA tool to be part of
> > > > > > iproute2 package
> > > > > > and finally improve this situation.
> > > > > 
> > > > > Hello Leon,
> > > > > 
> > > > > Although I really appreciate your work: can you clarify why
> > > > > you would like to
> > > > > add *RDMA* functionality to an *IP routing* tool? I haven't
> > > > > found any motivation
> > > > > for adding RDMA functionality to iproute2 in [1].
> > > > 
> > > > We are planning to reuse the same infrastructure provided by
> > > > iproute2,
> > > > like netlink parsing, access to distributions, same CLI and
> > > > same standards.
> > > > 
> > > > Right now, RDMA is already tightened to netdev: iWARP, RoCE,
> > > > IPoIB, HFI-VNIC.
> > > > Many drivers (mlx, qed, i40, cxgb) are sharing code between net
> > > > and
> > > > RDMA.
> > > > 
> > > > I do expect that iproute2 will be installed on every machine
> > > > with any
> > > > type of connection, including IB and OPA.
> > > > 
> > > > So I think that it is enough to be part of that suite and don't
> > > > invent
> > > > our own for one specific tool.
> > > 
> > > Hello Leon,
> > > 
> > > Sorry but to me that sounds like a weak argument for including
> > > RDMA functionality
> > > in iproute2. There is already a library for communication over
> > > netlink sockets,
> > > namely libnl. Is there functionality that is in iproute2 but not
> > > in libnl and
> > > that is needed for the new tool? If so, have you considered to
> > > create a new
> > > library for that functionality?
> > 
> > It is not hard to create new tool, the hardest part is to ensure
> > that it is
> > part of the distributions. Did you count how many months we are
> > trying to
> > add rdma-core to debian?
> 
> I do agree that it is a strange pairing and am not really a fan.
> However 
> at the end of the day it's just a name for a repo/package. If the 
> iproute folks are fine to include rdma in their repo/package, great
> we 
> can leverage their code for CLI and other common stuff.

If you look into the iproute2 package, it becomes clear that the name
iproute2 is historical and not really accurate any more.  It contains
things like the bridge control software, tc for controlling send
queues, and many things network related but not routing related.  The
rdma tool is a perfectly fine fit in the sense that it is an additional
network management tool IMO.

For reference, here's the list of stuff already in iproute on my Fedora
24 box:

/usr/sbin/arpd
/usr/sbin/bridge
/usr/sbin/cbq
/usr/sbin/ctstat
/usr/sbin/genl
/usr/sbin/ifcfg
/usr/sbin/ifstat
/usr/sbin/ip
/usr/sbin/lnstat
/usr/sbin/nstat
/usr/sbin/routef
/usr/sbin/routel
/usr/sbin/rtacct
/usr/sbin/rtmon
/usr/sbin/rtpr
/usr/sbin/rtstat
/usr/sbin/ss
/usr/sbin/tc
/usr/sbin/tipc

And in fact, if you check, tipc is almost similar to RDMA ;-)  So, I
suggest people not get hung up on the name iproute2, the fit is fine
when you look deeper into the nature of the package.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG KeyID: B826A3330E572FDD
   
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC] iproute: Add support for extended ack to rtnl_talk
From: Phil Sutter @ 2017-05-04 20:43 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, dsahern, daniel, netdev
In-Reply-To: <20170504094356.66590a9a@xeon-e3>

Hi,

On Thu, May 04, 2017 at 09:43:56AM -0700, Stephen Hemminger wrote:
> On Thu, 04 May 2017 10:41:03 -0400 (EDT)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: David Ahern <dsahern@gmail.com>
> > Date: Thu, 4 May 2017 08:27:35 -0600
> > 
> > > On 5/4/17 3:36 AM, Daniel Borkmann wrote:  
> > >> What is the clear benefit/rationale of outsourcing this to
> > >> libmnl? I always was the impression we should strive for as little
> > >> dependencies as possible?  
> > > 
> > > +1  
> > 
> > Agreed, all else being equal iproute2 should be as self contained
> > as possible since it is such a fundamental tool.
> 
> Sorry, the old netlink code is more difficult to understand than libmnl.
> Having dependency on a library is not a problem. There already is
> an alternative implementation of ip commands in busybox for those
> people trying to work in small environments.

I second that. If you can't afford the extra ~24KB of libmnl on your
system, you much rather can't afford the 20 times bigger ip binary,
either.

Regarding conversion to libmnl, which I investigated and started working
on once: My gut feeling back then was that it's not quite worth the
effor since iproute2 requires an intermediate layer of functions anyway.
Another detail which I didn't like that much was libmnl's idiom of
creating netlink messages on base of just a plain buffer and using
mnl_nlmsg_put_header() et al. to populate it with data. I'm probably a
bit biased since I did the conversion to c99-style initializers for the
various struct req data types, but I didn't like the added run-time
overhead to achieve just the same.

So in summary, given that very little change happens to iproute2's
internal libnetlink, I don't see much urge to make it use libmnl as
backend. In my opinion it just adds another potential source of errors.

Eventually this should be a maintainer level decision, though. :)

Cheers, Phil

^ permalink raw reply

* Re: [PATCH] aquantia: Fix "ethtool -S" crash when adapter down.
From: David Arcari @ 2017-05-04 20:39 UTC (permalink / raw)
  To: Pavel Belous, David S . Miller; +Cc: netdev, Lino Sanfilippo, Simon Edelhaus
In-Reply-To: <9ea5a2ba21b4f8b22a63acf39135a41e2f1da23c.1493928470.git.pavel.belous@aquantia.com>

On 05/04/2017 04:10 PM, Pavel Belous wrote:
> From: Pavel Belous <pavel.belous@aquantia.com>
> 
> This patch fixes the crash that happens when driver tries to collect statistics
> from already released "aq_vec" object.
> If adapter is in "down" state we still allow user to see statistics from HW.
> 
> V2: fixed braces around "aq_vec_free".
> 
> Fixes: 97bde5c4f909 ("net: ethernet: aquantia: Support for NIC-specific code")
> Signed-off-by: Pavel Belous <pavel.belous@aquantia.com>
> ---
>  drivers/net/ethernet/aquantia/atlantic/aq_nic.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
> index cdb0299..9ee1c50 100644
> --- a/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
> +++ b/drivers/net/ethernet/aquantia/atlantic/aq_nic.c
> @@ -755,7 +755,7 @@ void aq_nic_get_stats(struct aq_nic_s *self, u64 *data)
>  	count = 0U;
>  
>  	for (i = 0U, aq_vec = self->aq_vec[0];
> -		self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
> +		aq_vec && self->aq_vecs > i; ++i, aq_vec = self->aq_vec[i]) {
>  		data += count;
>  		aq_vec_get_sw_stats(aq_vec, data, &count);
>  	}
> @@ -959,8 +959,10 @@ void aq_nic_free_hot_resources(struct aq_nic_s *self)
>  		goto err_exit;
>  
>  	for (i = AQ_DIMOF(self->aq_vec); i--;) {
> -		if (self->aq_vec[i])
> +		if (self->aq_vec[i]) {
>  			aq_vec_free(self->aq_vec[i]);
> +			self->aq_vec[i] = NULL;
> +		}
>  	}
>  
>  err_exit:;
> 

Resolves the ethtool crash.

Tested-by: David Arcari <darcari@redhat.com>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox