Netdev List
 help / color / mirror / Atom feed
* llvm-objdump...
From: David Miller @ 2017-04-25 17:13 UTC (permalink / raw)
  To: ast; +Cc: daniel, netdev


I think there are some endianness issues ;-)

davem@patience:~/src/GIT/net-next/tools/testing/selftests/bpf$ llvm-objdump -S x.o

x.o:    file format ELF64-BPF

Disassembly of section test1:
process:
       0:       b7 00 00 00 00 00 00 02         r0 = 33554432
       1:       61 21 00 50 00 00 00 00         r1 = *(u32 *)(r2 + 20480)

That first instruction should be "r0 = 2"

^ permalink raw reply

* [PATCH net-next 00/10] tcp: do not use tcp_time_stamp for rcv autotuning
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet

Some devices or linux distributions use HZ=100 or HZ=250

TCP receive buffer autotuning has poor behavior caused by this choice.
Since autotuning happens after 4 ms or 10 ms, short distance flows
get their receive buffer tuned to a very high value, but after an initial
period where it was frozen to (too small) initial value.

With BBR (or other CC allowing to increase BDP), we are willing to
increase tcp_rmem[2], but this receive autotuning defect is a blocker
for hosts dealing with gazillions of TCP flows in the data centers,
since many of them have inflated RCVBUF. Risk of OOM is too high.

Note that TSO autodefer, tcp cubic, and TCP TS options (RFC 7323)
also suffer from our dependency to jiffies (via tcp_time_stamp).

We have ongoing efforts to improve all that in the future.

Eric Dumazet (10):
  tcp: add tp->tcp_mstamp field
  tcp: do not pass timestamp to tcp_rack_detect_loss()
  tcp: do not pass timestamp to tcp_rack_mark_lost()
  tcp: do not pass timestamp to tcp_rack_identify_loss()
  tcp: do not pass timestamp to tcp_fastretrans_alert()
  tcp: do not pass timestamp to tcp_rate_gen()
  tcp: do not pass timestamp to tcp_rack_advance()
  tcp: use tp->tcp_mstamp in tcp_clean_rtx_queue()
  tcp: remove ack_time from struct tcp_sacktag_state
  tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps

 include/linux/tcp.h     | 13 +++++-----
 include/net/tcp.h       |  7 +++--
 net/ipv4/tcp.c          |  2 +-
 net/ipv4/tcp_input.c    | 69 +++++++++++++++++++++++--------------------------
 net/ipv4/tcp_rate.c     |  7 ++---
 net/ipv4/tcp_recovery.c | 18 +++++--------
 6 files changed, 55 insertions(+), 61 deletions(-)

-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply

* [PATCH net-next 01/10] tcp: add tp->tcp_mstamp field
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

We want to use precise timestamps in TCP stack, but we do not
want to call possibly expensive kernel time services too often.

tp->tcp_mstamp is guaranteed to be updated once per incoming packet.

We will use it in the following patches, removing specific
skb_mstamp_get() calls, and removing ack_time from
struct tcp_sacktag_state.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/tcp.h  | 1 +
 net/ipv4/tcp_input.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index cbe5b602a2d349fdeb1e878305f37b4da1e6cc86..99a22f44c32e1587a6bf4835b65c7a4314807aa8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -240,6 +240,7 @@ struct tcp_sock {
 	u32	tlp_high_seq;	/* snd_nxt at the time of TLP retransmit. */
 
 /* RTT measurement */
+	struct skb_mstamp tcp_mstamp; /* most recent packet received/sent */
 	u32	srtt_us;	/* smoothed round trip time << 3 in usecs */
 	u32	mdev_us;	/* medium deviation			*/
 	u32	mdev_max_us;	/* maximal mdev for the last rtt period	*/
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5af2f04f885914491a7116c20056b3d2188d2d7d..bd18c65df4a9d9c2b66d8005f2cc4ff468140a73 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5362,6 +5362,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	skb_mstamp_get(&tp->tcp_mstamp);
 	if (unlikely(!sk->sk_rx_dst))
 		inet_csk(sk)->icsk_af_ops->sk_rx_dst_set(sk, skb);
 	/*
@@ -5922,6 +5923,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 
 	case TCP_SYN_SENT:
 		tp->rx_opt.saw_tstamp = 0;
+		skb_mstamp_get(&tp->tcp_mstamp);
 		queued = tcp_rcv_synsent_state_process(sk, skb, th);
 		if (queued >= 0)
 			return queued;
@@ -5933,6 +5935,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		return 0;
 	}
 
+	skb_mstamp_get(&tp->tcp_mstamp);
 	tp->rx_opt.saw_tstamp = 0;
 	req = tp->fastopen_rsk;
 	if (req) {
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 03/10] tcp: do not pass timestamp to tcp_rack_mark_lost()
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

This is no longer used, since tcp_rack_detect_loss() takes
the timestamp from tp->tcp_mstamp

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/net/tcp.h       | 2 +-
 net/ipv4/tcp_input.c    | 2 +-
 net/ipv4/tcp_recovery.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index da28bef1d82b6773bbfcf7c7eafebb7a4932f25b..8b4433c4aaa221b83af90d2b44ba4b01a872a7af 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1853,7 +1853,7 @@ void tcp_v4_init(void);
 void tcp_init(void);
 
 /* tcp_recovery.c */
-extern void tcp_rack_mark_lost(struct sock *sk, const struct skb_mstamp *now);
+extern void tcp_rack_mark_lost(struct sock *sk);
 extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
 			     const struct skb_mstamp *xmit_time,
 			     const struct skb_mstamp *ack_time);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bd18c65df4a9d9c2b66d8005f2cc4ff468140a73..d4885f7a6a930ff1794b0ab931c0b73c274371b2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2769,7 +2769,7 @@ static void tcp_rack_identify_loss(struct sock *sk, int *ack_flag,
 	if (sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION) {
 		u32 prior_retrans = tp->retrans_out;
 
-		tcp_rack_mark_lost(sk, ack_time);
+		tcp_rack_mark_lost(sk);
 		if (prior_retrans > tp->retrans_out)
 			*ack_flag |= FLAG_LOST_RETRANS;
 	}
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index fdac262e277b2f25492f155bbb295d6d87e31d02..6ca8b5d9d803d872ec7043b02c72fffaec5c7270 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -104,7 +104,7 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout)
 	}
 }
 
-void tcp_rack_mark_lost(struct sock *sk, const struct skb_mstamp *now)
+void tcp_rack_mark_lost(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	u32 timeout;
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 04/10] tcp: do not pass timestamp to tcp_rack_identify_loss()
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

Not used anymore now tp->tcp_mstamp holds the information.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/ipv4/tcp_input.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d4885f7a6a930ff1794b0ab931c0b73c274371b2..99b0d65de169a13679477f49f3733ca00c842090 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2760,8 +2760,7 @@ static bool tcp_try_undo_partial(struct sock *sk, const int acked)
 	return false;
 }
 
-static void tcp_rack_identify_loss(struct sock *sk, int *ack_flag,
-				   const struct skb_mstamp *ack_time)
+static void tcp_rack_identify_loss(struct sock *sk, int *ack_flag)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
@@ -2857,11 +2856,11 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
 			tcp_try_keep_open(sk);
 			return;
 		}
-		tcp_rack_identify_loss(sk, ack_flag, ack_time);
+		tcp_rack_identify_loss(sk, ack_flag);
 		break;
 	case TCP_CA_Loss:
 		tcp_process_loss(sk, flag, is_dupack, rexmit);
-		tcp_rack_identify_loss(sk, ack_flag, ack_time);
+		tcp_rack_identify_loss(sk, ack_flag);
 		if (!(icsk->icsk_ca_state == TCP_CA_Open ||
 		      (*ack_flag & FLAG_LOST_RETRANS)))
 			return;
@@ -2877,7 +2876,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
 		if (icsk->icsk_ca_state <= TCP_CA_Disorder)
 			tcp_try_undo_dsack(sk);
 
-		tcp_rack_identify_loss(sk, ack_flag, ack_time);
+		tcp_rack_identify_loss(sk, ack_flag);
 		if (!tcp_time_to_recover(sk, flag)) {
 			tcp_try_to_open(sk, flag);
 			return;
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 05/10] tcp: do not pass timestamp to tcp_fastretrans_alert()
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

Not used anymore now tp->tcp_mstamp holds the information.

This is needed to remove sack_state.ack_time in a following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/ipv4/tcp_input.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 99b0d65de169a13679477f49f3733ca00c842090..68094aa8cfb2ee2dc6939ea1931277b745deae4a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2787,8 +2787,7 @@ static void tcp_rack_identify_loss(struct sock *sk, int *ack_flag)
  * tcp_xmit_retransmit_queue().
  */
 static void tcp_fastretrans_alert(struct sock *sk, const int acked,
-				  bool is_dupack, int *ack_flag, int *rexmit,
-				  const struct skb_mstamp *ack_time)
+				  bool is_dupack, int *ack_flag, int *rexmit)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -3646,8 +3645,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 
 	if (tcp_ack_is_dubious(sk, flag)) {
 		is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
-		tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit,
-				      &sack_state.ack_time);
+		tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit);
 	}
 	if (tp->tlp_high_seq)
 		tcp_process_tlp_ack(sk, ack, flag);
@@ -3668,8 +3666,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 no_queue:
 	/* If data was DSACKed, see if we can undo a cwnd reduction. */
 	if (flag & FLAG_DSACKING_ACK)
-		tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit,
-				      &sack_state.ack_time);
+		tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit);
 	/* If this ack opens up a zero window, clear backoff.  It was
 	 * being used to time the probes, and is probably far higher than
 	 * it needs to be for normal retransmission.
@@ -3693,8 +3690,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 		skb_mstamp_get(&sack_state.ack_time);
 		flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una,
 						&sack_state);
-		tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit,
-				      &sack_state.ack_time);
+		tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit);
 		tcp_xmit_recovery(sk, rexmit);
 	}
 
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 02/10] tcp: do not pass timestamp to tcp_rack_detect_loss()
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

We can use tp->tcp_mstamp as it contains a recent timestamp.

This removes a call to skb_mstamp_get() from tcp_rack_reo_timeout()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/ipv4/tcp_recovery.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index d8acbd9f477a2ac6b0f8eee1bf59f3ab43abff07..fdac262e277b2f25492f155bbb295d6d87e31d02 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -45,8 +45,7 @@ static bool tcp_rack_sent_after(const struct skb_mstamp *t1,
  * or tcp_time_to_recover()'s "Trick#1: the loss is proven" code path will
  * make us enter the CA_Recovery state.
  */
-static void tcp_rack_detect_loss(struct sock *sk, const struct skb_mstamp *now,
-				 u32 *reo_timeout)
+static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
@@ -79,7 +78,7 @@ static void tcp_rack_detect_loss(struct sock *sk, const struct skb_mstamp *now,
 			 * A packet is lost if its elapsed time is beyond
 			 * the recent RTT plus the reordering window.
 			 */
-			u32 elapsed = skb_mstamp_us_delta(now,
+			u32 elapsed = skb_mstamp_us_delta(&tp->tcp_mstamp,
 							  &skb->skb_mstamp);
 			s32 remaining = tp->rack.rtt_us + reo_wnd - elapsed;
 
@@ -115,7 +114,7 @@ void tcp_rack_mark_lost(struct sock *sk, const struct skb_mstamp *now)
 
 	/* Reset the advanced flag to avoid unnecessary queue scanning */
 	tp->rack.advanced = 0;
-	tcp_rack_detect_loss(sk, now, &timeout);
+	tcp_rack_detect_loss(sk, &timeout);
 	if (timeout) {
 		timeout = usecs_to_jiffies(timeout + TCP_REO_TIMEOUT_MIN);
 		inet_csk_reset_xmit_timer(sk, ICSK_TIME_REO_TIMEOUT,
@@ -165,12 +164,10 @@ void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
 void tcp_rack_reo_timeout(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct skb_mstamp now;
 	u32 timeout, prior_inflight;
 
-	skb_mstamp_get(&now);
 	prior_inflight = tcp_packets_in_flight(tp);
-	tcp_rack_detect_loss(sk, &now, &timeout);
+	tcp_rack_detect_loss(sk, &timeout);
 	if (prior_inflight != tcp_packets_in_flight(tp)) {
 		if (inet_csk(sk)->icsk_ca_state != TCP_CA_Recovery) {
 			tcp_enter_recovery(sk, false);
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 07/10] tcp: do not pass timestamp to tcp_rack_advance()
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

No longer needed, since tp->tcp_mstamp holds the information.

This is needed to remove sack_state.ack_time in a following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/net/tcp.h       | 3 +--
 net/ipv4/tcp_input.c    | 6 ++----
 net/ipv4/tcp_recovery.c | 5 ++---
 3 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d7aae25efc7f9664a482ce50974a2d79f7fc8e0c..270e5cc43c99e7030e95af218095cf9f283950bc 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1855,8 +1855,7 @@ void tcp_init(void);
 /* tcp_recovery.c */
 extern void tcp_rack_mark_lost(struct sock *sk);
 extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
-			     const struct skb_mstamp *xmit_time,
-			     const struct skb_mstamp *ack_time);
+			     const struct skb_mstamp *xmit_time);
 extern void tcp_rack_reo_timeout(struct sock *sk);
 
 /*
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2d84483de2e10ab10d4906f2cca01d76da83dc06..5485204853d3aefaea5027bcf6480ed20a1e8efa 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1214,8 +1214,7 @@ static u8 tcp_sacktag_one(struct sock *sk,
 		return sacked;
 
 	if (!(sacked & TCPCB_SACKED_ACKED)) {
-		tcp_rack_advance(tp, sacked, end_seq,
-				 xmit_time, &state->ack_time);
+		tcp_rack_advance(tp, sacked, end_seq, xmit_time);
 
 		if (sacked & TCPCB_SACKED_RETRANS) {
 			/* If the segment is not tagged as lost,
@@ -3118,8 +3117,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 			tp->delivered += acked_pcount;
 			if (!tcp_skb_spurious_retrans(tp, skb))
 				tcp_rack_advance(tp, sacked, scb->end_seq,
-						 &skb->skb_mstamp,
-						 &sack->ack_time);
+						 &skb->skb_mstamp);
 		}
 		if (sacked & TCPCB_LOST)
 			tp->lost_out -= acked_pcount;
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index 6ca8b5d9d803d872ec7043b02c72fffaec5c7270..cd72b3d3879e88181c8a4639f0334a24e4cda852 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -127,8 +127,7 @@ void tcp_rack_mark_lost(struct sock *sk)
  * draft-cheng-tcpm-rack-00.txt
  */
 void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
-		      const struct skb_mstamp *xmit_time,
-		      const struct skb_mstamp *ack_time)
+		      const struct skb_mstamp *xmit_time)
 {
 	u32 rtt_us;
 
@@ -137,7 +136,7 @@ void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
 				 end_seq, tp->rack.end_seq))
 		return;
 
-	rtt_us = skb_mstamp_us_delta(ack_time, xmit_time);
+	rtt_us = skb_mstamp_us_delta(&tp->tcp_mstamp, xmit_time);
 	if (sacked & TCPCB_RETRANS) {
 		/* If the sacked packet was retransmitted, it's ambiguous
 		 * whether the retransmission or the original (or the prior
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 06/10] tcp: do not pass timestamp to tcp_rate_gen()
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

No longer needed, since tp->tcp_mstamp holds the information.

This is needed to remove sack_state.ack_time in a following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/net/tcp.h    | 2 +-
 net/ipv4/tcp_input.c | 3 +--
 net/ipv4/tcp_rate.c  | 7 ++++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 8b4433c4aaa221b83af90d2b44ba4b01a872a7af..d7aae25efc7f9664a482ce50974a2d79f7fc8e0c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1004,7 +1004,7 @@ void tcp_rate_skb_sent(struct sock *sk, struct sk_buff *skb);
 void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff *skb,
 			    struct rate_sample *rs);
 void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost,
-		  struct skb_mstamp *now, struct rate_sample *rs);
+		  struct rate_sample *rs);
 void tcp_rate_check_app_limited(struct sock *sk);
 
 /* These functions determine how the current flow behaves in respect of SACK
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 68094aa8cfb2ee2dc6939ea1931277b745deae4a..2d84483de2e10ab10d4906f2cca01d76da83dc06 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3657,8 +3657,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 		tcp_schedule_loss_probe(sk);
 	delivered = tp->delivered - delivered;	/* freshly ACKed or SACKed */
 	lost = tp->lost - lost;			/* freshly marked lost */
-	tcp_rate_gen(sk, delivered, lost, &sack_state.ack_time,
-		     sack_state.rate);
+	tcp_rate_gen(sk, delivered, lost, sack_state.rate);
 	tcp_cong_control(sk, ack, delivered, flag, sack_state.rate);
 	tcp_xmit_recovery(sk, rexmit);
 	return 1;
diff --git a/net/ipv4/tcp_rate.c b/net/ipv4/tcp_rate.c
index 9be1581a5a08c36f4544fbdabedd9741fb266a1e..c6a9fa8946462100947ab62d86464ff8f99565c2 100644
--- a/net/ipv4/tcp_rate.c
+++ b/net/ipv4/tcp_rate.c
@@ -106,7 +106,7 @@ void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff *skb,
 
 /* Update the connection delivery information and generate a rate sample. */
 void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost,
-		  struct skb_mstamp *now, struct rate_sample *rs)
+		  struct rate_sample *rs)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	u32 snd_us, ack_us;
@@ -120,7 +120,7 @@ void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost,
 	 * to carry current time, flags, stats like "tcp_sacktag_state".
 	 */
 	if (delivered)
-		tp->delivered_mstamp = *now;
+		tp->delivered_mstamp = tp->tcp_mstamp;
 
 	rs->acked_sacked = delivered;	/* freshly ACKed or SACKed */
 	rs->losses = lost;		/* freshly marked lost */
@@ -138,7 +138,8 @@ void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost,
 	 * longer phase.
 	 */
 	snd_us = rs->interval_us;				/* send phase */
-	ack_us = skb_mstamp_us_delta(now, &rs->prior_mstamp);	/* ack phase */
+	ack_us = skb_mstamp_us_delta(&tp->tcp_mstamp,
+				     &rs->prior_mstamp); /* ack phase */
 	rs->interval_us = max(snd_us, ack_us);
 
 	/* Normally we expect interval_us >= min-rtt.
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 08/10] tcp: use tp->tcp_mstamp in tcp_clean_rtx_queue()
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

Following patch will remove ack_time from struct tcp_sacktag_state

Same info is now found in tp->tcp_mstamp

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5485204853d3aefaea5027bcf6480ed20a1e8efa..f4e1836c696c9a45a545a32de8ba62ce3bd0dc12 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3056,8 +3056,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct skb_mstamp first_ackt, last_ackt;
-	struct skb_mstamp *now = &sack->ack_time;
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct skb_mstamp *now = &tp->tcp_mstamp;
 	u32 prior_sacked = tp->sacked_out;
 	u32 reord = tp->packets_out;
 	bool fully_acked = true;
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 09/10] tcp: remove ack_time from struct tcp_sacktag_state
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

It is no longer needed, everything uses tp->tcp_mstamp instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/ipv4/tcp_input.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f4e1836c696c9a45a545a32de8ba62ce3bd0dc12..f475f0b53bfe4cb67c19b7f30d9d68bd703ff23b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1131,7 +1131,6 @@ struct tcp_sacktag_state {
 	 */
 	struct skb_mstamp first_sackt;
 	struct skb_mstamp last_sackt;
-	struct skb_mstamp ack_time; /* Timestamp when the S/ACK was received */
 	struct rate_sample *rate;
 	int	flag;
 };
@@ -3572,8 +3571,6 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	if (after(ack, tp->snd_nxt))
 		goto invalid_ack;
 
-	skb_mstamp_get(&sack_state.ack_time);
-
 	if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
 		tcp_rearm_rto(sk);
 
@@ -3684,7 +3681,6 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	 * If data was DSACKed, see if we can undo a cwnd reduction.
 	 */
 	if (TCP_SKB_CB(skb)->sacked) {
-		skb_mstamp_get(&sack_state.ack_time);
 		flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una,
 						&sack_state);
 		tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit);
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 10/10] tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

Some devices or distributions use HZ=100 or HZ=250

TCP receive buffer autotuning has poor behavior caused by this choice.
Since autotuning happens after 4 ms or 10 ms, short distance flows
get their receive buffer tuned to a very high value, but after an initial
period where it was frozen to (too small) initial value.

With tp->tcp_mstamp introduction, we can switch to high resolution
timestamps almost for free (at the expense of 8 additional bytes per
TCP structure)

Note that some TCP stacks use usec TCP timestamps where this
patch makes even more sense : Many TCP flows have < 500 usec RTT.
Hopefully this finer TS option can be standardized soon.

Tested:
 HZ=100 kernel
 ./netperf -H lpaa24 -t TCP_RR -l 1000 -- -r 10000,10000 &

 Peer without patch :
 lpaa24:~# ss -tmi dst lpaa23
 ...
 skmem:(r0,rb8388608,...)
 rcv_rtt:10 rcv_space:3210000 minrtt:0.017

 Peer with the patch :
 lpaa23:~# ss -tmi dst lpaa24
 ...
 skmem:(r0,rb428800,...)
 rcv_rtt:0.069 rcv_space:30000 minrtt:0.017

We can see saner RCVBUF, and more precise rcv_rtt information.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/linux/tcp.h  | 12 ++++++------
 net/ipv4/tcp.c       |  2 +-
 net/ipv4/tcp_input.c | 28 +++++++++++++++++-----------
 3 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 99a22f44c32e1587a6bf4835b65c7a4314807aa8..b6d5adcee8fcb611de202993623cc80274d262e4 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -333,16 +333,16 @@ struct tcp_sock {
 
 /* Receiver side RTT estimation */
 	struct {
-		u32	rtt;
-		u32	seq;
-		u32	time;
+		u32		rtt_us;
+		u32		seq;
+		struct skb_mstamp time;
 	} rcv_rtt_est;
 
 /* Receiver queue space */
 	struct {
-		int	space;
-		u32	seq;
-		u32	time;
+		int		space;
+		u32		seq;
+		struct skb_mstamp time;
 	} rcvq_space;
 
 /* TCP-specific MTU probe information. */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index efc976ae66ae5b82d496323634c3030fb71c6c92..059dad7deefe883bd3a26c93f27637dc22ccefda 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2853,7 +2853,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
 	info->tcpi_snd_ssthresh = tp->snd_ssthresh;
 	info->tcpi_advmss = tp->advmss;
 
-	info->tcpi_rcv_rtt = jiffies_to_usecs(tp->rcv_rtt_est.rtt)>>3;
+	info->tcpi_rcv_rtt = tp->rcv_rtt_est.rtt_us >> 3;
 	info->tcpi_rcv_space = tp->rcvq_space.space;
 
 	info->tcpi_total_retrans = tp->total_retrans;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f475f0b53bfe4cb67c19b7f30d9d68bd703ff23b..9739962bfb3fd2d39cb13f643def223f4f17fcb6 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -442,7 +442,8 @@ void tcp_init_buffer_space(struct sock *sk)
 		tcp_sndbuf_expand(sk);
 
 	tp->rcvq_space.space = tp->rcv_wnd;
-	tp->rcvq_space.time = tcp_time_stamp;
+	skb_mstamp_get(&tp->tcp_mstamp);
+	tp->rcvq_space.time = tp->tcp_mstamp;
 	tp->rcvq_space.seq = tp->copied_seq;
 
 	maxwin = tcp_full_space(sk);
@@ -518,7 +519,7 @@ EXPORT_SYMBOL(tcp_initialize_rcv_mss);
  */
 static void tcp_rcv_rtt_update(struct tcp_sock *tp, u32 sample, int win_dep)
 {
-	u32 new_sample = tp->rcv_rtt_est.rtt;
+	u32 new_sample = tp->rcv_rtt_est.rtt_us;
 	long m = sample;
 
 	if (m == 0)
@@ -548,21 +549,23 @@ static void tcp_rcv_rtt_update(struct tcp_sock *tp, u32 sample, int win_dep)
 		new_sample = m << 3;
 	}
 
-	if (tp->rcv_rtt_est.rtt != new_sample)
-		tp->rcv_rtt_est.rtt = new_sample;
+	tp->rcv_rtt_est.rtt_us = new_sample;
 }
 
 static inline void tcp_rcv_rtt_measure(struct tcp_sock *tp)
 {
-	if (tp->rcv_rtt_est.time == 0)
+	u32 delta_us;
+
+	if (tp->rcv_rtt_est.time.v64 == 0)
 		goto new_measure;
 	if (before(tp->rcv_nxt, tp->rcv_rtt_est.seq))
 		return;
-	tcp_rcv_rtt_update(tp, tcp_time_stamp - tp->rcv_rtt_est.time, 1);
+	delta_us = skb_mstamp_us_delta(&tp->tcp_mstamp, &tp->rcv_rtt_est.time);
+	tcp_rcv_rtt_update(tp, delta_us, 1);
 
 new_measure:
 	tp->rcv_rtt_est.seq = tp->rcv_nxt + tp->rcv_wnd;
-	tp->rcv_rtt_est.time = tcp_time_stamp;
+	tp->rcv_rtt_est.time = tp->tcp_mstamp;
 }
 
 static inline void tcp_rcv_rtt_measure_ts(struct sock *sk,
@@ -572,7 +575,10 @@ static inline void tcp_rcv_rtt_measure_ts(struct sock *sk,
 	if (tp->rx_opt.rcv_tsecr &&
 	    (TCP_SKB_CB(skb)->end_seq -
 	     TCP_SKB_CB(skb)->seq >= inet_csk(sk)->icsk_ack.rcv_mss))
-		tcp_rcv_rtt_update(tp, tcp_time_stamp - tp->rx_opt.rcv_tsecr, 0);
+		tcp_rcv_rtt_update(tp,
+				   jiffies_to_usecs(tcp_time_stamp -
+						    tp->rx_opt.rcv_tsecr),
+				   0);
 }
 
 /*
@@ -585,8 +591,8 @@ void tcp_rcv_space_adjust(struct sock *sk)
 	int time;
 	int copied;
 
-	time = tcp_time_stamp - tp->rcvq_space.time;
-	if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
+	time = skb_mstamp_us_delta(&tp->tcp_mstamp, &tp->rcvq_space.time);
+	if (time < (tp->rcv_rtt_est.rtt_us >> 3) || tp->rcv_rtt_est.rtt_us == 0)
 		return;
 
 	/* Number of bytes copied to user in last RTT */
@@ -642,7 +648,7 @@ void tcp_rcv_space_adjust(struct sock *sk)
 
 new_measure:
 	tp->rcvq_space.seq = tp->copied_seq;
-	tp->rcvq_space.time = tcp_time_stamp;
+	tp->rcvq_space.time = tp->tcp_mstamp;
 }
 
 /* There is something which you must keep in mind when you analyze the
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* Re: Re: [Intel-wired-lan] [PATCH] ixgbe: initialize u64_stats_sync structures early at ixgbe_probe
From: Alexander Duyck @ 2017-04-25 17:16 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: Singh, Krishneil K, Song, Liwei (Wind River), Kirsher, Jeffrey T,
	netdev@vger.kernel.org, intel-wired-lan@lists.osuosl.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <trinity-a348af9b-9ef5-47a9-a14e-94265c880cf1-1493134775614@3capp-gmx-bs78>

On Tue, Apr 25, 2017 at 8:39 AM, Lino Sanfilippo <LinoSanfilippo@gmx.de> wrote:
> Hi,
>
>> This patch doesn't look right to me. I would suggest rejecting it.
>>
>> The call to initialize the stats should be done when the ring is
>> allocated, not in ixgbe_probe(). This should probably be done in
>> ixgbe_alloc_q_vector() instead.
>>
>
> AFAICS ixgbe_alloc_q_vector() is also called in probe() (by ixgbe_init_interrupt_scheme()).
> Furthermore it is also called in resume() which would lead to multiple initialization of
> the u64_stats_sync in case of resume.

ixgbe_alloc_q_vector() is what allocates the ring structures that are
being initialized here. Calling it anywhere other than here doesn't
make sense since what we want to do is initialize the memory after we
have allocated it, but before we hand the pointer to it over to a
netdev or in this case an adapter structure.

> IMHO the u64_stats_sync variables have to be initialized before register_netdev() is called
> since this is the point from which userspace can call ixgbe_get_stats64(). I would say the
> best place to do so is the probe() function as it is done in this patch.

I would disagree here. We should be initializing the stats variables
after we allocate them. Especially since we can end up freeing and
reallocating them any time the number of queues is changed.

> Just my 2 cents.
>
> Regards,
> Lino

My advice would be to look through the code and verify what it is you
need to initialize and where it should happen. In this case we are
getting a lockdep splat since we are just letting things get
initialized with kzalloc and aren't following up in the right place. I
don't disagree that the original code has the u64_stats_init in the
wrong place since we can open/close the interface and trigger a
reinitialization of the stats. I would say we need to initialize the
stats just after we allocate them in memory so that if we decide to
free and reallocate the rings we initialize the new rings before they
are added to the netdev and don't reintroduce this issue in just a
different form.

- Alex

^ permalink raw reply

* Re: [PATCH v4 net] net: ipv6: regenerate host route if moved to gc list
From: Martin KaFai Lau @ 2017-04-25 17:18 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, dvyukov, andreyknvl, mmanning, eric.dumazet
In-Reply-To: <1493137049-16465-1-git-send-email-dsa@cumulusnetworks.com>

On Tue, Apr 25, 2017 at 09:17:29AM -0700, David Ahern wrote:
[...]
>
> All of those faults are fixed by regenerating the host route if the
> existing one has been moved to the gc list, something that can be
> determined by checking if the rt6i_ref counter is 0.
Acked-by: Martin KaFai Lau <kafai@fb.com>

>
> Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional")
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>

> ---
> v4
> - move 'prev = ifp->rt;' under spinlock as requested by Eric
>
> v3
> - removed 'if (prev)' and just call ip6_rt_put; added comment about spinlock
>
> v2
> - change ifp->rt under spinlock vs cmpxchg
> - add comment about rt6i_ref == 0
>
>  net/ipv6/addrconf.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 80ce478c4851..0ea96c4d334d 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -3271,14 +3271,24 @@ static void addrconf_gre_config(struct net_device *dev)
>  static int fixup_permanent_addr(struct inet6_dev *idev,
>  				struct inet6_ifaddr *ifp)
>  {
> -	if (!ifp->rt) {
> -		struct rt6_info *rt;
> +	/* rt6i_ref == 0 means the host route was removed from the
> +	 * FIB, for example, if 'lo' device is taken down. In that
> +	 * case regenerate the host route.
> +	 */
> +	if (!ifp->rt || !atomic_read(&ifp->rt->rt6i_ref)) {
> +		struct rt6_info *rt, *prev;
>
>  		rt = addrconf_dst_alloc(idev, &ifp->addr, false);
>  		if (unlikely(IS_ERR(rt)))
>  			return PTR_ERR(rt);
>
> +		/* ifp->rt can be accessed outside of rtnl */
> +		spin_lock(&ifp->lock);
> +		prev = ifp->rt;
>  		ifp->rt = rt;
> +		spin_unlock(&ifp->lock);
> +
> +		ip6_rt_put(prev);
>  	}
>
>  	if (!(ifp->flags & IFA_F_NOPREFIXROUTE)) {
> --
> 2.1.4
>

^ permalink raw reply

* Re: [RFC PATCH 3/7] net: add option to get information about timestamped packets
From: Willem de Bruijn @ 2017-04-25 17:23 UTC (permalink / raw)
  To: Miroslav Lichvar
  Cc: Willem de Bruijn, Network Development, Richard Cochran,
	Soheil Hassas Yeganeh, Keller, Jacob E, Denny Page, Jiri Benc
In-Reply-To: <20170425135642.GB27148@localhost>

On Tue, Apr 25, 2017 at 9:56 AM, Miroslav Lichvar <mlichvar@redhat.com> wrote:
> On Mon, Apr 24, 2017 at 11:18:13AM -0400, Willem de Bruijn wrote:
>> On Mon, Apr 24, 2017 at 5:00 AM, Miroslav Lichvar <mlichvar@redhat.com> wrote:
>> > Would "skb->data - skb->head -
>> > skb->mac_header + skb->len" always work as the L2 length for received
>> > packets at the time when the cmsg is prepared?
>>
>> (skb->data - skb->head) - skb->mac_header computes the length
>> of data before the mac, such as reserve?
>
> data - head includes the reserve, but mac_header does too, so I think
> it should be just the length of MAC header and everything up to the
> data.
>
>> Do you mean skb->data -
>> skb->mac_header (or - skb_mac_offset(skb))?
>
> That would give me a pointer? If I used skb_mac_offset(), the total
> length would be just skb->len - skb_mac_offset()?

It appears so. The only existing caller first checks
skb_mac_header_was_set(skb).

^ permalink raw reply

* Re: Blogpost evaluation this [PATCH v4 net-next RFC] net: Generic XDP
From: Andy Gospodarek @ 2017-04-25 17:25 UTC (permalink / raw)
  To: David Miller; +Cc: brouer, xdp-newbies, netdev
In-Reply-To: <20170424.182643.485613135674690555.davem@davemloft.net>

On Mon, Apr 24, 2017 at 06:26:43PM -0400, David Miller wrote:
> From: Jesper Dangaard Brouer <brouer@redhat.com>
> Date: Mon, 24 Apr 2017 16:24:05 +0200
> 
> > I've done a very detailed evaluation of this patch, and I've created a
> > blogpost like report here:
> > 
> >  https://prototype-kernel.readthedocs.io/en/latest/blogposts/xdp25_eval_generic_xdp_tx.html
> 
> Thanks for doing this Jesper.

Yes, this is excellent.  I'm not all the way thru it, but I looked at
the data and corroborate the results you are seeing.

My results for both optimized and generic XDP for
xdp_bench01_mem_access_cost --action XDP_DROP --readmem are quite
similar to yours (11.7Mpps and 7.8Mpps, respectively for me 11.7Mpps and
8.4Mpps for you).

I also noted (as you did) that there is no discernible difference
running xdp_bench01_mem_access_cost with or without the --readmem
option since the packet data is already being accessed that late it the
stack.

> 
> > I didn't evaluate the adjust_head part, so I hope Andy is still
> > planning to validate that part?
> 
> I was hoping he would post some results today as well.
> 
> Andy, how goes it? :)

Sorry for the delayed response.  I was AFK yesterday, but based on
testing from Friday and what I wrapped up today all looks good to me.

On my system (i7-6700 CPU @ 3.40GHz) the reported and actual TX
throughput for xdp_tx_iptunnel is 4.6Mpps for the optimized XDP.

For generic XDP the reported throughput of xdp_tx_iptunnel is 4.6Mpps
but only ~880kpps actually on the wire.  It seems to me that can be
fixed with a follow-up for offending drivers or the stack if deemed that
there is a real error there.

> Once the basic patch is ready and integrated in we can try to do
> xmit_more in generic XDP and see what that does for XDP_TX
> performance.

Agreed.

^ permalink raw reply

* Re: [PATCH net-next 0/2] flower: add MPLS matching support
From: Benjamin LaHaise @ 2017-04-25 17:26 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: Simon Horman, Jakub Kicinski, David Miller, netdev, bcrl,
	Jiri Pirko
In-Reply-To: <1a05824d-342a-6d06-5fa4-d747a8b8460f@mojatatu.com>

On Tue, Apr 25, 2017 at 08:47:00AM -0400, Jamal Hadi Salim wrote:
> On 17-04-25 07:55 AM, Simon Horman wrote:
> [..]
> > 
> > I agree something should be done wrt BOS. If the LABEL and TC are to
> > be left as-is then I think a similar treatment of BOS - that is masking it
> > - makes sense.
> > 
> > I also agree with statements made earlier in the thread that it is unlikely
> > that the unused bits of these attributes will be used - as opposed to a
> > bitmask of flag values which seems ripe for re-use for future flags.
> > 
> 
> For your use case, I think you are fine if you just do the mask in the
> kernel. A mask  to a user value implies "I am ignoring the rest
> of these bits - I dont  care if you set them "
> 
> > I would like to add to the discussion that I think in future it would
> > be good to expand the features provided by this patch to support supplying
> > a mask as part of the match - as flower supports for other fields such
> > as IP addresses. But I think the current scheme of masking out invalid bits
> > should also work in conjunction with user-supplied masks.
> > 
> 
> The challenge we have right now is "users do stoopid or malicious
> things". So are you going to accept the wrong bitmap + mask?

I think rejecting bits in a mask that clearly cannot be set (like the 
bits above the lower 20 bits in an MPLS label) makes perfect sense.  It 
doesn't impact usability for the tools since they shouldn't be set (and 
actually can't be in the iproute2 changes).

		-ben

> cheers,
> jamal
> 

^ permalink raw reply

* Re: Blogpost evaluation this [PATCH v4 net-next RFC] net: Generic XDP
From: David Miller @ 2017-04-25 17:31 UTC (permalink / raw)
  To: andy; +Cc: brouer, xdp-newbies, netdev
In-Reply-To: <20170425172549.GQ4730@C02RW35GFVH8.dhcp.broadcom.net>

From: Andy Gospodarek <andy@greyhouse.net>
Date: Tue, 25 Apr 2017 13:25:49 -0400

> On Mon, Apr 24, 2017 at 06:26:43PM -0400, David Miller wrote:
>> Andy, how goes it? :)
> 
> Sorry for the delayed response.  I was AFK yesterday, but based on
> testing from Friday and what I wrapped up today all looks good to me.
> 
> On my system (i7-6700 CPU @ 3.40GHz) the reported and actual TX
> throughput for xdp_tx_iptunnel is 4.6Mpps for the optimized XDP.
> 
> For generic XDP the reported throughput of xdp_tx_iptunnel is 4.6Mpps
> but only ~880kpps actually on the wire.  It seems to me that can be
> fixed with a follow-up for offending drivers or the stack if deemed that
> there is a real error there.

Ok, I'll commit the latest version with tested-by tags for you, Jesper,
and David Ahern added.

Thanks everyone.

^ permalink raw reply

* Re: [PATCH v4 net] net: ipv6: regenerate host route if moved to gc list
From: Eric Dumazet @ 2017-04-25 17:34 UTC (permalink / raw)
  To: David Ahern; +Cc: netdev, dvyukov, andreyknvl, mmanning, kafai
In-Reply-To: <1493137049-16465-1-git-send-email-dsa@cumulusnetworks.com>

On Tue, 2017-04-25 at 09:17 -0700, David Ahern wrote:
> Taking down the loopback device wreaks havoc on IPv6 routing. By
> extension, taking down a VRF device wreaks havoc on its table.
> 
> Dmitry and Andrey both reported heap out-of-bounds reports in the IPv6
> FIB code while running syzkaller fuzzer. The root cause is a dead dst
> that is on the garbage list gets reinserted into the IPv6 FIB. While on
> the gc (or perhaps when it gets added to the gc list) the dst->next is
> set to an IPv4 dst. A subsequent walk of the ipv6 tables causes the
> out-of-bounds access.
> 
> Andrey's reproducer was the key to getting to the bottom of this.
> 
> With IPv6, host routes for an address have the dst->dev set to the
> loopback device. When the 'lo' device is taken down, rt6_ifdown initiates
> a walk of the fib evicting routes with the 'lo' device which means all
> host routes are removed. That process moves the dst which is attached to
> an inet6_ifaddr to the gc list and marks it as dead.
> 
> The recent change to keep global IPv6 addresses added a new function,
> fixup_permanent_addr, that is called on admin up. That function restarts
> dad for an inet6_ifaddr and when it completes the host route attached
> to it is inserted into the fib. Since the route was marked dead and
> moved to the gc list, re-inserting the route causes the reported
> out-of-bounds accesses. If the device with the address is taken down
> or the address is removed, the WARN_ON in fib6_del is triggered.
> 
> All of those faults are fixed by regenerating the host route if the
> existing one has been moved to the gc list, something that can be
> determined by checking if the rt6i_ref counter is 0.
> 
> Fixes: f1705ec197e7 ("net: ipv6: Make address flushing on ifdown optional")
> Reported-by: Dmitry Vyukov <dvyukov@google.com>
> Reported-by: Andrey Konovalov <andreyknvl@google.com>
> Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
> ---

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* Bug and configuration MPLS error?
From: Алексей Болдырев @ 2017-04-25 17:28 UTC (permalink / raw)
  To: netdev
In-Reply-To: <61671493137651@web14o.yandex.ru>

Короче, вот конфиг MPLS на одном из дистрибутивов:
In short, here's the MPLS configuration on one of the distributions:
226 sysctl -w net.mpls.conf.lo.input=1
227 sysctl -w net.mpls.platform_labels=1048575
228 ip link add veth0 type veth peer name veth1
229 ip link add veth2 type veth peer name veth3
230 sysctl -w net.mpls.conf.veth0.input=1
231 sysctl -w net.mpls.conf.veth2.input=1
232 ifconfig veth0 10.3.3.1 netmask 255.255.255.0
233 ifconfig veth2 10.4.4.1 netmask 255.255.255.0
234 ip netns add host1
235 ip netns add host2
236 ip link set veth1 netns host1
237 ip link set veth3 netns host2
238 ip netns exec host1 ifconfig veth1 10.3.3.2 netmask 255.255.255.0 up
239 ip netns exec host2 ifconfig veth3 10.4.4.2 netmask 255.255.255.0 up
240 ip netns exec host1 ip route add 10.10.10.2/32 encap mpls 112 via inet 10.3.3.1
241 ip netns exec host2 ip route add 10.10.10.1/32 encap mpls 111 via inet 10.4.4.1
242 ip -f mpls route add 111 via inet 10.3.3.2
243 ip -f mpls route add 112 via inet 10.4.4.2

Результаты теста:
Test Results:
tcp по mpls:
~ # ip netns exec host2 iperf3 -c 10.10.10.1 -B 10.10.10.2
Connecting to host 10.10.10.1, port 5201
[ 4] local 10.10.10.2 port 34021 connected to 10.10.10.1 port 5201
[ ID] Interval Transfer Bandwidth Retr Cwnd
[ 4] 0.00-1.00 sec 912 KBytes 7.46 Mbits/sec 0 636 KBytes
[ 4] 1.00-2.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
[ 4] 2.00-3.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
[ 4] 3.00-4.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
[ 4] 4.00-5.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
[ 4] 5.00-6.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
[ 4] 6.00-7.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
[ 4] 7.00-8.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
[ 4] 8.00-9.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
[ 4] 9.00-10.00 sec 0.00 Bytes 0.00 bits/sec 0 636 KBytes
----------------------------------------

[ ID] Interval Transfer Bandwidth Retr
[ 4] 0.00-10.00 sec 912 KBytes 747 Kbits/sec 0 sender
[ 4] 0.00-10.00 sec 21.3 KBytes 17.5 Kbits/sec receiver

iperf Done.
~ #
udp по mpls:
~ # ip netns exec host2 iperf3 -c 10.10.10.1 -B 10.10.10.2 -u -b 10g
Connecting to host 10.10.10.1, port 5201
[ 4] local 10.10.10.2 port 56901 connected to 10.10.10.1 port 5201
[ ID] Interval Transfer Bandwidth Total Datagrams
[ 4] 0.00-1.00 sec 438 MBytes 3.67 Gbits/sec 56049
[ 4] 1.00-2.00 sec 491 MBytes 4.12 Gbits/sec 62829
[ 4] 2.00-3.00 sec 492 MBytes 4.12 Gbits/sec 62919
[ 4] 3.00-4.00 sec 490 MBytes 4.11 Gbits/sec 62762
[ 4] 4.00-5.00 sec 491 MBytes 4.12 Gbits/sec 62891
[ 4] 5.00-6.00 sec 492 MBytes 4.13 Gbits/sec 62994
[ 4] 6.00-7.00 sec 503 MBytes 4.22 Gbits/sec 64322
[ 4] 7.00-8.00 sec 503 MBytes 4.22 Gbits/sec 64321
[ 4] 8.00-9.00 sec 502 MBytes 4.21 Gbits/sec 64279
[ 4] 9.00-10.00 sec 511 MBytes 4.28 Gbits/sec 65352
----------------------------------------

[ ID] Interval Transfer Bandwidth Jitter Lost/Total Datagrams
[ 4] 0.00-10.00 sec 4.80 GBytes 4.12 Gbits/sec 0.001 ms 0/628718 (0%)
[ 4] Sent 628718 datagrams

iperf Done.
UDP как видим, проходит нормально.
UDP as seen, is normal.
Вот параметры интерфейсов:
Here are the interface parameters:
P:
veth0 Link encap:Ethernet HWaddr 72:0D:9E:D7:BC:B3
inet addr:10.3.3.1 Bcast:10.3.3.255 Mask:255.255.255.0
inet6 addr: fe80::700d:9eff:fed7:bcb3/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65535 Metric:1
RX packets:126 errors:0 dropped:0 overruns:0 frame:0
TX packets:629026 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:9592 (9.3 KiB) TX bytes:5178498619 (4.8 GiB)

veth2 Link encap:Ethernet HWaddr CE:24:F8:1F:99:C1
inet addr:10.4.4.1 Bcast:10.4.4.255 Mask:255.255.255.0
inet6 addr: fe80::cc24:f8ff:fe1f:99c1/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65535 Metric:1
RX packets:629015 errors:0 dropped:0 overruns:0 frame:0
TX packets:135 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5181014123 (4.8 GiB) TX bytes:9564 (9.3 KiB)
PE1:
~ # ip netns exec host2 ifconfig
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

veth3 Link encap:Ethernet HWaddr 36:00:C2:29:0D:F9
inet addr:10.4.4.2 Bcast:10.4.4.255 Mask:255.255.255.0
inet6 addr: fe80::3400:c2ff:fe29:df9/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65200 Metric:1
RX packets:136 errors:0 dropped:0 overruns:0 frame:0
TX packets:629015 errors:0 dropped:1 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:9596 (9.3 KiB) TX bytes:5181014123 (4.8 GiB)
PE2:
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)

veth1 Link encap:Ethernet HWaddr DA:B2:AD:31:68:77
inet addr:10.3.3.2 Bcast:10.3.3.255 Mask:255.255.255.0
inet6 addr: fe80::d8b2:adff:fe31:6877/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:65200 Metric:1
RX packets:629027 errors:0 dropped:0 overruns:0 frame:0
TX packets:126 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:5178498651 (4.8 GiB) TX bytes:9592 (9.3 KiB)
Тоже самое, только на более свежом ядре: http://forum.nag.ru/forum/index.php?showtopic=128927&st=0
The same thing, only on amore recent nucleus:
Ядро:
Core:
/ # uname -r
4.8.6
Конфиг ядра:
Kernel Config:
https://pastebin.com/raw/EE1k05cT

Это баг ядра, или ошибка конфигурирования?
Is it a kernel bug, or a configuration error?

^ permalink raw reply

* [PATCH net-next v5] net: Generic XDP
From: David Miller @ 2017-04-25 17:43 UTC (permalink / raw)
  To: netdev; +Cc: andy, brouer, ast, daniel, netdev, xdp-newbies


This provides a generic SKB based non-optimized XDP path which is used
if either the driver lacks a specific XDP implementation, or the user
requests it via a new IFLA_XDP_FLAGS value named XDP_FLAGS_SKB_MODE.

It is arguable that perhaps I should have required something like
this as part of the initial XDP feature merge.

I believe this is critical for two reasons:

1) Accessibility.  More people can play with XDP with less
   dependencies.  Yes I know we have XDP support in virtio_net, but
   that just creates another depedency for learning how to use this
   facility.

   I wrote this to make life easier for the XDP newbies.

2) As a model for what the expected semantics are.  If there is a pure
   generic core implementation, it serves as a semantic example for
   driver folks adding XDP support.

One thing I have not tried to address here is the issue of
XDP_PACKET_HEADROOM, thanks to Daniel for spotting that.  It seems
incredibly expensive to do a skb_cow(skb, XDP_PACKET_HEADROOM) or
whatever even if the XDP program doesn't try to push headers at all.
I think we really need the verifier to somehow propagate whether
certain XDP helpers are used or not.

v5:
 - Handle both negative and positive offset after running prog
 - Fix mac length in XDP_TX case (Alexei)
 - Use rcu_dereference_protected() in free_netdev (kbuild test robot)

v4:
 - Fix MAC header adjustmnet before calling prog (David Ahern)
 - Disable LRO when generic XDP is installed (Michael Chan)
 - Bypass qdisc et al. on XDP_TX and record the event (Alexei)
 - Do not perform generic XDP on reinjected packets (DaveM)

v3:
 - Make sure XDP program sees packet at MAC header, push back MAC
   header if we do XDP_TX.  (Alexei)
 - Elide GRO when generic XDP is in use.  (Alexei)
 - Add XDP_FLAG_SKB_MODE flag which the user can use to request generic
   XDP even if the driver has an XDP implementation.  (Alexei)
 - Report whether SKB mode is in use in rtnl_xdp_fill() via XDP_FLAGS
   attribute.  (Daniel)

v2:
 - Add some "fall through" comments in switch statements based
   upon feedback from Andrew Lunn
 - Use RCU for generic xdp_prog, thanks to Johannes Berg.

Tested-by: Andy Gospodarek <andy@greyhouse.net>
Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
---

Committed to net-next... thanks again everyone.

 include/linux/netdevice.h    |   8 +++
 include/uapi/linux/if_link.h |   4 +-
 net/core/dev.c               | 155 +++++++++++++++++++++++++++++++++++++++++--
 net/core/gro_cells.c         |   2 +-
 net/core/rtnetlink.c         |  40 ++++++-----
 5 files changed, 187 insertions(+), 22 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5d5267f..46d220c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1905,9 +1905,17 @@ struct net_device {
 	struct lock_class_key	*qdisc_tx_busylock;
 	struct lock_class_key	*qdisc_running_key;
 	bool			proto_down;
+	struct bpf_prog __rcu	*xdp_prog;
 };
 #define to_net_dev(d) container_of(d, struct net_device, dev)
 
+static inline bool netif_elide_gro(const struct net_device *dev)
+{
+	if (!(dev->features & NETIF_F_GRO) || dev->xdp_prog)
+		return true;
+	return false;
+}
+
 #define	NETDEV_ALIGN		32
 
 static inline
diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
index 8b405af..633aa02 100644
--- a/include/uapi/linux/if_link.h
+++ b/include/uapi/linux/if_link.h
@@ -887,7 +887,9 @@ enum {
 /* XDP section */
 
 #define XDP_FLAGS_UPDATE_IF_NOEXIST	(1U << 0)
-#define XDP_FLAGS_MASK			(XDP_FLAGS_UPDATE_IF_NOEXIST)
+#define XDP_FLAGS_SKB_MODE		(2U << 0)
+#define XDP_FLAGS_MASK			(XDP_FLAGS_UPDATE_IF_NOEXIST | \
+					 XDP_FLAGS_SKB_MODE)
 
 enum {
 	IFLA_XDP_UNSPEC,
diff --git a/net/core/dev.c b/net/core/dev.c
index db6e315..1b3317c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -95,6 +95,7 @@
 #include <linux/notifier.h>
 #include <linux/skbuff.h>
 #include <linux/bpf.h>
+#include <linux/bpf_trace.h>
 #include <net/net_namespace.h>
 #include <net/sock.h>
 #include <net/busy_poll.h>
@@ -4251,6 +4252,125 @@ static int __netif_receive_skb(struct sk_buff *skb)
 	return ret;
 }
 
+static struct static_key generic_xdp_needed __read_mostly;
+
+static int generic_xdp_install(struct net_device *dev, struct netdev_xdp *xdp)
+{
+	struct bpf_prog *new = xdp->prog;
+	int ret = 0;
+
+	switch (xdp->command) {
+	case XDP_SETUP_PROG: {
+		struct bpf_prog *old = rtnl_dereference(dev->xdp_prog);
+
+		rcu_assign_pointer(dev->xdp_prog, new);
+		if (old)
+			bpf_prog_put(old);
+
+		if (old && !new) {
+			static_key_slow_dec(&generic_xdp_needed);
+		} else if (new && !old) {
+			static_key_slow_inc(&generic_xdp_needed);
+			dev_disable_lro(dev);
+		}
+		break;
+	}
+
+	case XDP_QUERY_PROG:
+		xdp->prog_attached = !!rcu_access_pointer(dev->xdp_prog);
+		break;
+
+	default:
+		ret = -EINVAL;
+		break;
+	}
+
+	return ret;
+}
+
+static u32 netif_receive_generic_xdp(struct sk_buff *skb,
+				     struct bpf_prog *xdp_prog)
+{
+	struct xdp_buff xdp;
+	u32 act = XDP_DROP;
+	void *orig_data;
+	int hlen, off;
+	u32 mac_len;
+
+	/* Reinjected packets coming from act_mirred or similar should
+	 * not get XDP generic processing.
+	 */
+	if (skb_cloned(skb))
+		return XDP_PASS;
+
+	if (skb_linearize(skb))
+		goto do_drop;
+
+	/* The XDP program wants to see the packet starting at the MAC
+	 * header.
+	 */
+	mac_len = skb->data - skb_mac_header(skb);
+	hlen = skb_headlen(skb) + mac_len;
+	xdp.data = skb->data - mac_len;
+	xdp.data_end = xdp.data + hlen;
+	xdp.data_hard_start = skb->data - skb_headroom(skb);
+	orig_data = xdp.data;
+
+	act = bpf_prog_run_xdp(xdp_prog, &xdp);
+
+	off = xdp.data - orig_data;
+	if (off > 0)
+		__skb_pull(skb, off);
+	else if (off < 0)
+		__skb_push(skb, -off);
+
+	switch (act) {
+	case XDP_TX:
+		__skb_push(skb, mac_len);
+		/* fall through */
+	case XDP_PASS:
+		break;
+
+	default:
+		bpf_warn_invalid_xdp_action(act);
+		/* fall through */
+	case XDP_ABORTED:
+		trace_xdp_exception(skb->dev, xdp_prog, act);
+		/* fall through */
+	case XDP_DROP:
+	do_drop:
+		kfree_skb(skb);
+		break;
+	}
+
+	return act;
+}
+
+/* When doing generic XDP we have to bypass the qdisc layer and the
+ * network taps in order to match in-driver-XDP behavior.
+ */
+static void generic_xdp_tx(struct sk_buff *skb, struct bpf_prog *xdp_prog)
+{
+	struct net_device *dev = skb->dev;
+	struct netdev_queue *txq;
+	bool free_skb = true;
+	int cpu, rc;
+
+	txq = netdev_pick_tx(dev, skb, NULL);
+	cpu = smp_processor_id();
+	HARD_TX_LOCK(dev, txq, cpu);
+	if (!netif_xmit_stopped(txq)) {
+		rc = netdev_start_xmit(skb, dev, txq, 0);
+		if (dev_xmit_complete(rc))
+			free_skb = false;
+	}
+	HARD_TX_UNLOCK(dev, txq);
+	if (free_skb) {
+		trace_xdp_exception(dev, xdp_prog, XDP_TX);
+		kfree_skb(skb);
+	}
+}
+
 static int netif_receive_skb_internal(struct sk_buff *skb)
 {
 	int ret;
@@ -4262,6 +4382,21 @@ static int netif_receive_skb_internal(struct sk_buff *skb)
 
 	rcu_read_lock();
 
+	if (static_key_false(&generic_xdp_needed)) {
+		struct bpf_prog *xdp_prog = rcu_dereference(skb->dev->xdp_prog);
+
+		if (xdp_prog) {
+			u32 act = netif_receive_generic_xdp(skb, xdp_prog);
+
+			if (act != XDP_PASS) {
+				rcu_read_unlock();
+				if (act == XDP_TX)
+					generic_xdp_tx(skb, xdp_prog);
+				return NET_RX_DROP;
+			}
+		}
+	}
+
 #ifdef CONFIG_RPS
 	if (static_key_false(&rps_needed)) {
 		struct rps_dev_flow voidflow, *rflow = &voidflow;
@@ -4494,7 +4629,7 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
 	enum gro_result ret;
 	int grow;
 
-	if (!(skb->dev->features & NETIF_F_GRO))
+	if (netif_elide_gro(skb->dev))
 		goto normal;
 
 	if (skb->csum_bad)
@@ -6723,6 +6858,7 @@ EXPORT_SYMBOL(dev_change_proto_down);
  */
 int dev_change_xdp_fd(struct net_device *dev, int fd, u32 flags)
 {
+	int (*xdp_op)(struct net_device *dev, struct netdev_xdp *xdp);
 	const struct net_device_ops *ops = dev->netdev_ops;
 	struct bpf_prog *prog = NULL;
 	struct netdev_xdp xdp;
@@ -6730,14 +6866,16 @@ int dev_change_xdp_fd(struct net_device *dev, int fd, u32 flags)
 
 	ASSERT_RTNL();
 
-	if (!ops->ndo_xdp)
-		return -EOPNOTSUPP;
+	xdp_op = ops->ndo_xdp;
+	if (!xdp_op || (flags & XDP_FLAGS_SKB_MODE))
+		xdp_op = generic_xdp_install;
+
 	if (fd >= 0) {
 		if (flags & XDP_FLAGS_UPDATE_IF_NOEXIST) {
 			memset(&xdp, 0, sizeof(xdp));
 			xdp.command = XDP_QUERY_PROG;
 
-			err = ops->ndo_xdp(dev, &xdp);
+			err = xdp_op(dev, &xdp);
 			if (err < 0)
 				return err;
 			if (xdp.prog_attached)
@@ -6753,7 +6891,7 @@ int dev_change_xdp_fd(struct net_device *dev, int fd, u32 flags)
 	xdp.command = XDP_SETUP_PROG;
 	xdp.prog = prog;
 
-	err = ops->ndo_xdp(dev, &xdp);
+	err = xdp_op(dev, &xdp);
 	if (err < 0 && prog)
 		bpf_prog_put(prog);
 
@@ -7793,6 +7931,7 @@ EXPORT_SYMBOL(alloc_netdev_mqs);
 void free_netdev(struct net_device *dev)
 {
 	struct napi_struct *p, *n;
+	struct bpf_prog *prog;
 
 	might_sleep();
 	netif_free_tx_queues(dev);
@@ -7811,6 +7950,12 @@ void free_netdev(struct net_device *dev)
 	free_percpu(dev->pcpu_refcnt);
 	dev->pcpu_refcnt = NULL;
 
+	prog = rcu_dereference_protected(dev->xdp_prog, 1);
+	if (prog) {
+		bpf_prog_put(prog);
+		static_key_slow_dec(&generic_xdp_needed);
+	}
+
 	/*  Compatibility with error handling in drivers */
 	if (dev->reg_state == NETREG_UNINITIALIZED) {
 		netdev_freemem(dev);
diff --git a/net/core/gro_cells.c b/net/core/gro_cells.c
index c98bbfb..814e58a 100644
--- a/net/core/gro_cells.c
+++ b/net/core/gro_cells.c
@@ -13,7 +13,7 @@ int gro_cells_receive(struct gro_cells *gcells, struct sk_buff *skb)
 	struct net_device *dev = skb->dev;
 	struct gro_cell *cell;
 
-	if (!gcells->cells || skb_cloned(skb) || !(dev->features & NETIF_F_GRO))
+	if (!gcells->cells || skb_cloned(skb) || netif_elide_gro(dev))
 		return netif_rx(skb);
 
 	cell = this_cpu_ptr(gcells->cells);
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 088f9c8..9031a6c 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -896,15 +896,13 @@ static size_t rtnl_port_size(const struct net_device *dev,
 		return port_self_size;
 }
 
-static size_t rtnl_xdp_size(const struct net_device *dev)
+static size_t rtnl_xdp_size(void)
 {
 	size_t xdp_size = nla_total_size(0) +	/* nest IFLA_XDP */
-			  nla_total_size(1);	/* XDP_ATTACHED */
+			  nla_total_size(1) +	/* XDP_ATTACHED */
+			  nla_total_size(4);	/* XDP_FLAGS */
 
-	if (!dev->netdev_ops->ndo_xdp)
-		return 0;
-	else
-		return xdp_size;
+	return xdp_size;
 }
 
 static noinline size_t if_nlmsg_size(const struct net_device *dev,
@@ -943,7 +941,7 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_PORT_ID */
 	       + nla_total_size(MAX_PHYS_ITEM_ID_LEN) /* IFLA_PHYS_SWITCH_ID */
 	       + nla_total_size(IFNAMSIZ) /* IFLA_PHYS_PORT_NAME */
-	       + rtnl_xdp_size(dev) /* IFLA_XDP */
+	       + rtnl_xdp_size() /* IFLA_XDP */
 	       + nla_total_size(1); /* IFLA_PROTO_DOWN */
 
 }
@@ -1251,23 +1249,35 @@ static int rtnl_fill_link_ifmap(struct sk_buff *skb, struct net_device *dev)
 
 static int rtnl_xdp_fill(struct sk_buff *skb, struct net_device *dev)
 {
-	struct netdev_xdp xdp_op = {};
 	struct nlattr *xdp;
+	u32 xdp_flags = 0;
+	u8 val = 0;
 	int err;
 
-	if (!dev->netdev_ops->ndo_xdp)
-		return 0;
 	xdp = nla_nest_start(skb, IFLA_XDP);
 	if (!xdp)
 		return -EMSGSIZE;
-	xdp_op.command = XDP_QUERY_PROG;
-	err = dev->netdev_ops->ndo_xdp(dev, &xdp_op);
-	if (err)
-		goto err_cancel;
-	err = nla_put_u8(skb, IFLA_XDP_ATTACHED, xdp_op.prog_attached);
+	if (rcu_access_pointer(dev->xdp_prog)) {
+		xdp_flags = XDP_FLAGS_SKB_MODE;
+		val = 1;
+	} else if (dev->netdev_ops->ndo_xdp) {
+		struct netdev_xdp xdp_op = {};
+
+		xdp_op.command = XDP_QUERY_PROG;
+		err = dev->netdev_ops->ndo_xdp(dev, &xdp_op);
+		if (err)
+			goto err_cancel;
+		val = xdp_op.prog_attached;
+	}
+	err = nla_put_u8(skb, IFLA_XDP_ATTACHED, val);
 	if (err)
 		goto err_cancel;
 
+	if (xdp_flags) {
+		err = nla_put_u32(skb, IFLA_XDP_FLAGS, xdp_flags);
+		if (err)
+			goto err_cancel;
+	}
 	nla_nest_end(skb, xdp);
 	return 0;
 
-- 
2.4.11

^ permalink raw reply related

* Re: [PATCH net-next] drivers: net: xgene-v2: Fix error return code in xge_mdio_config()
From: David Miller @ 2017-04-25 17:48 UTC (permalink / raw)
  To: weiyj.lk; +Cc: isubramanian, kchudgar, weiyongjun1, netdev
In-Reply-To: <20170425113650.22730-1-weiyj.lk@gmail.com>

From: Wei Yongjun <weiyj.lk@gmail.com>
Date: Tue, 25 Apr 2017 11:36:50 +0000

> From: Wei Yongjun <weiyongjun1@huawei.com>
> 
> Fix to return error code -ENODEV from the no PHY found error
> handling case instead of 0, as done elsewhere in this function.
> 
> Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next 01/10] tcp: add tp->tcp_mstamp field
From: Soheil Hassas Yeganeh @ 2017-04-25 17:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S . Miller, netdev, Eric Dumazet
In-Reply-To: <20170425171541.3417-2-edumazet@google.com>

On Tue, Apr 25, 2017 at 1:15 PM, Eric Dumazet <edumazet@google.com> wrote:
> We want to use precise timestamps in TCP stack, but we do not
> want to call possibly expensive kernel time services too often.
>
> tp->tcp_mstamp is guaranteed to be updated once per incoming packet.
>
> We will use it in the following patches, removing specific
> skb_mstamp_get() calls, and removing ack_time from
> struct tcp_sacktag_state.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>

Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

> ---
>  include/linux/tcp.h  | 1 +
>  net/ipv4/tcp_input.c | 3 +++
>  2 files changed, 4 insertions(+)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index cbe5b602a2d349fdeb1e878305f37b4da1e6cc86..99a22f44c32e1587a6bf4835b65c7a4314807aa8 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -240,6 +240,7 @@ struct tcp_sock {
>         u32     tlp_high_seq;   /* snd_nxt at the time of TLP retransmit. */
>
>  /* RTT measurement */
> +       struct skb_mstamp tcp_mstamp; /* most recent packet received/sent */
>         u32     srtt_us;        /* smoothed round trip time << 3 in usecs */
>         u32     mdev_us;        /* medium deviation                     */
>         u32     mdev_max_us;    /* maximal mdev for the last rtt period */
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 5af2f04f885914491a7116c20056b3d2188d2d7d..bd18c65df4a9d9c2b66d8005f2cc4ff468140a73 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5362,6 +5362,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
>  {
>         struct tcp_sock *tp = tcp_sk(sk);
>
> +       skb_mstamp_get(&tp->tcp_mstamp);
>         if (unlikely(!sk->sk_rx_dst))
>                 inet_csk(sk)->icsk_af_ops->sk_rx_dst_set(sk, skb);
>         /*
> @@ -5922,6 +5923,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>
>         case TCP_SYN_SENT:
>                 tp->rx_opt.saw_tstamp = 0;
> +               skb_mstamp_get(&tp->tcp_mstamp);
>                 queued = tcp_rcv_synsent_state_process(sk, skb, th);
>                 if (queued >= 0)
>                         return queued;
> @@ -5933,6 +5935,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>                 return 0;
>         }
>
> +       skb_mstamp_get(&tp->tcp_mstamp);
>         tp->rx_opt.saw_tstamp = 0;
>         req = tp->fastopen_rsk;
>         if (req) {
> --
> 2.13.0.rc0.306.g87b477812d-goog
>

Nice patchset. Thanks, Eric!

^ permalink raw reply

* Re: [PATCH net-next 01/10] tcp: add tp->tcp_mstamp field
From: Neal Cardwell @ 2017-04-25 17:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20170425171541.3417-2-edumazet@google.com>

On Tue, Apr 25, 2017 at 1:15 PM, Eric Dumazet <edumazet@google.com> wrote:
> We want to use precise timestamps in TCP stack, but we do not
> want to call possibly expensive kernel time services too often.
>
> tp->tcp_mstamp is guaranteed to be updated once per incoming packet.
>
> We will use it in the following patches, removing specific
> skb_mstamp_get() calls, and removing ack_time from
> struct tcp_sacktag_state.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---

This series is great! Thanks, Eric!

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply

* Re: [PATCH net-next 02/10] tcp: do not pass timestamp to tcp_rack_detect_loss()
From: Neal Cardwell @ 2017-04-25 17:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, netdev, Soheil Hassas Yeganeh, Eric Dumazet
In-Reply-To: <20170425171541.3417-3-edumazet@google.com>

On Tue, Apr 25, 2017 at 1:15 PM, Eric Dumazet <edumazet@google.com> wrote:
> We can use tp->tcp_mstamp as it contains a recent timestamp.
>
> This removes a call to skb_mstamp_get() from tcp_rack_reo_timeout()
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Acked-by: Soheil Hassas Yeganeh <soheil@google.com>

Acked-by: Neal Cardwell <ncardwell@google.com>

neal

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox