Netdev List
 help / color / mirror / Atom feed
* Re: Re: [Intel-wired-lan] [PATCH] ixgbe: initialize u64_stats_sync structures early at ixgbe_probe
From: Alexander Duyck @ 2017-04-25 17:16 UTC (permalink / raw)
  To: Lino Sanfilippo
  Cc: Singh, Krishneil K, Song, Liwei (Wind River), Kirsher, Jeffrey T,
	netdev@vger.kernel.org, intel-wired-lan@lists.osuosl.org,
	linux-kernel@vger.kernel.org
In-Reply-To: <trinity-a348af9b-9ef5-47a9-a14e-94265c880cf1-1493134775614@3capp-gmx-bs78>

On Tue, Apr 25, 2017 at 8:39 AM, Lino Sanfilippo <LinoSanfilippo@gmx.de> wrote:
> Hi,
>
>> This patch doesn't look right to me. I would suggest rejecting it.
>>
>> The call to initialize the stats should be done when the ring is
>> allocated, not in ixgbe_probe(). This should probably be done in
>> ixgbe_alloc_q_vector() instead.
>>
>
> AFAICS ixgbe_alloc_q_vector() is also called in probe() (by ixgbe_init_interrupt_scheme()).
> Furthermore it is also called in resume() which would lead to multiple initialization of
> the u64_stats_sync in case of resume.

ixgbe_alloc_q_vector() is what allocates the ring structures that are
being initialized here. Calling it anywhere other than here doesn't
make sense since what we want to do is initialize the memory after we
have allocated it, but before we hand the pointer to it over to a
netdev or in this case an adapter structure.

> IMHO the u64_stats_sync variables have to be initialized before register_netdev() is called
> since this is the point from which userspace can call ixgbe_get_stats64(). I would say the
> best place to do so is the probe() function as it is done in this patch.

I would disagree here. We should be initializing the stats variables
after we allocate them. Especially since we can end up freeing and
reallocating them any time the number of queues is changed.

> Just my 2 cents.
>
> Regards,
> Lino

My advice would be to look through the code and verify what it is you
need to initialize and where it should happen. In this case we are
getting a lockdep splat since we are just letting things get
initialized with kzalloc and aren't following up in the right place. I
don't disagree that the original code has the u64_stats_init in the
wrong place since we can open/close the interface and trigger a
reinitialization of the stats. I would say we need to initialize the
stats just after we allocate them in memory so that if we decide to
free and reallocate the rings we initialize the new rings before they
are added to the netdev and don't reintroduce this issue in just a
different form.

- Alex

^ permalink raw reply

* [PATCH net-next 10/10] tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

Some devices or distributions use HZ=100 or HZ=250

TCP receive buffer autotuning has poor behavior caused by this choice.
Since autotuning happens after 4 ms or 10 ms, short distance flows
get their receive buffer tuned to a very high value, but after an initial
period where it was frozen to (too small) initial value.

With tp->tcp_mstamp introduction, we can switch to high resolution
timestamps almost for free (at the expense of 8 additional bytes per
TCP structure)

Note that some TCP stacks use usec TCP timestamps where this
patch makes even more sense : Many TCP flows have < 500 usec RTT.
Hopefully this finer TS option can be standardized soon.

Tested:
 HZ=100 kernel
 ./netperf -H lpaa24 -t TCP_RR -l 1000 -- -r 10000,10000 &

 Peer without patch :
 lpaa24:~# ss -tmi dst lpaa23
 ...
 skmem:(r0,rb8388608,...)
 rcv_rtt:10 rcv_space:3210000 minrtt:0.017

 Peer with the patch :
 lpaa23:~# ss -tmi dst lpaa24
 ...
 skmem:(r0,rb428800,...)
 rcv_rtt:0.069 rcv_space:30000 minrtt:0.017

We can see saner RCVBUF, and more precise rcv_rtt information.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/linux/tcp.h  | 12 ++++++------
 net/ipv4/tcp.c       |  2 +-
 net/ipv4/tcp_input.c | 28 +++++++++++++++++-----------
 3 files changed, 24 insertions(+), 18 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 99a22f44c32e1587a6bf4835b65c7a4314807aa8..b6d5adcee8fcb611de202993623cc80274d262e4 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -333,16 +333,16 @@ struct tcp_sock {
 
 /* Receiver side RTT estimation */
 	struct {
-		u32	rtt;
-		u32	seq;
-		u32	time;
+		u32		rtt_us;
+		u32		seq;
+		struct skb_mstamp time;
 	} rcv_rtt_est;
 
 /* Receiver queue space */
 	struct {
-		int	space;
-		u32	seq;
-		u32	time;
+		int		space;
+		u32		seq;
+		struct skb_mstamp time;
 	} rcvq_space;
 
 /* TCP-specific MTU probe information. */
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index efc976ae66ae5b82d496323634c3030fb71c6c92..059dad7deefe883bd3a26c93f27637dc22ccefda 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2853,7 +2853,7 @@ void tcp_get_info(struct sock *sk, struct tcp_info *info)
 	info->tcpi_snd_ssthresh = tp->snd_ssthresh;
 	info->tcpi_advmss = tp->advmss;
 
-	info->tcpi_rcv_rtt = jiffies_to_usecs(tp->rcv_rtt_est.rtt)>>3;
+	info->tcpi_rcv_rtt = tp->rcv_rtt_est.rtt_us >> 3;
 	info->tcpi_rcv_space = tp->rcvq_space.space;
 
 	info->tcpi_total_retrans = tp->total_retrans;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f475f0b53bfe4cb67c19b7f30d9d68bd703ff23b..9739962bfb3fd2d39cb13f643def223f4f17fcb6 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -442,7 +442,8 @@ void tcp_init_buffer_space(struct sock *sk)
 		tcp_sndbuf_expand(sk);
 
 	tp->rcvq_space.space = tp->rcv_wnd;
-	tp->rcvq_space.time = tcp_time_stamp;
+	skb_mstamp_get(&tp->tcp_mstamp);
+	tp->rcvq_space.time = tp->tcp_mstamp;
 	tp->rcvq_space.seq = tp->copied_seq;
 
 	maxwin = tcp_full_space(sk);
@@ -518,7 +519,7 @@ EXPORT_SYMBOL(tcp_initialize_rcv_mss);
  */
 static void tcp_rcv_rtt_update(struct tcp_sock *tp, u32 sample, int win_dep)
 {
-	u32 new_sample = tp->rcv_rtt_est.rtt;
+	u32 new_sample = tp->rcv_rtt_est.rtt_us;
 	long m = sample;
 
 	if (m == 0)
@@ -548,21 +549,23 @@ static void tcp_rcv_rtt_update(struct tcp_sock *tp, u32 sample, int win_dep)
 		new_sample = m << 3;
 	}
 
-	if (tp->rcv_rtt_est.rtt != new_sample)
-		tp->rcv_rtt_est.rtt = new_sample;
+	tp->rcv_rtt_est.rtt_us = new_sample;
 }
 
 static inline void tcp_rcv_rtt_measure(struct tcp_sock *tp)
 {
-	if (tp->rcv_rtt_est.time == 0)
+	u32 delta_us;
+
+	if (tp->rcv_rtt_est.time.v64 == 0)
 		goto new_measure;
 	if (before(tp->rcv_nxt, tp->rcv_rtt_est.seq))
 		return;
-	tcp_rcv_rtt_update(tp, tcp_time_stamp - tp->rcv_rtt_est.time, 1);
+	delta_us = skb_mstamp_us_delta(&tp->tcp_mstamp, &tp->rcv_rtt_est.time);
+	tcp_rcv_rtt_update(tp, delta_us, 1);
 
 new_measure:
 	tp->rcv_rtt_est.seq = tp->rcv_nxt + tp->rcv_wnd;
-	tp->rcv_rtt_est.time = tcp_time_stamp;
+	tp->rcv_rtt_est.time = tp->tcp_mstamp;
 }
 
 static inline void tcp_rcv_rtt_measure_ts(struct sock *sk,
@@ -572,7 +575,10 @@ static inline void tcp_rcv_rtt_measure_ts(struct sock *sk,
 	if (tp->rx_opt.rcv_tsecr &&
 	    (TCP_SKB_CB(skb)->end_seq -
 	     TCP_SKB_CB(skb)->seq >= inet_csk(sk)->icsk_ack.rcv_mss))
-		tcp_rcv_rtt_update(tp, tcp_time_stamp - tp->rx_opt.rcv_tsecr, 0);
+		tcp_rcv_rtt_update(tp,
+				   jiffies_to_usecs(tcp_time_stamp -
+						    tp->rx_opt.rcv_tsecr),
+				   0);
 }
 
 /*
@@ -585,8 +591,8 @@ void tcp_rcv_space_adjust(struct sock *sk)
 	int time;
 	int copied;
 
-	time = tcp_time_stamp - tp->rcvq_space.time;
-	if (time < (tp->rcv_rtt_est.rtt >> 3) || tp->rcv_rtt_est.rtt == 0)
+	time = skb_mstamp_us_delta(&tp->tcp_mstamp, &tp->rcvq_space.time);
+	if (time < (tp->rcv_rtt_est.rtt_us >> 3) || tp->rcv_rtt_est.rtt_us == 0)
 		return;
 
 	/* Number of bytes copied to user in last RTT */
@@ -642,7 +648,7 @@ void tcp_rcv_space_adjust(struct sock *sk)
 
 new_measure:
 	tp->rcvq_space.seq = tp->copied_seq;
-	tp->rcvq_space.time = tcp_time_stamp;
+	tp->rcvq_space.time = tp->tcp_mstamp;
 }
 
 /* There is something which you must keep in mind when you analyze the
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 09/10] tcp: remove ack_time from struct tcp_sacktag_state
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

It is no longer needed, everything uses tp->tcp_mstamp instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/ipv4/tcp_input.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f4e1836c696c9a45a545a32de8ba62ce3bd0dc12..f475f0b53bfe4cb67c19b7f30d9d68bd703ff23b 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1131,7 +1131,6 @@ struct tcp_sacktag_state {
 	 */
 	struct skb_mstamp first_sackt;
 	struct skb_mstamp last_sackt;
-	struct skb_mstamp ack_time; /* Timestamp when the S/ACK was received */
 	struct rate_sample *rate;
 	int	flag;
 };
@@ -3572,8 +3571,6 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	if (after(ack, tp->snd_nxt))
 		goto invalid_ack;
 
-	skb_mstamp_get(&sack_state.ack_time);
-
 	if (icsk->icsk_pending == ICSK_TIME_LOSS_PROBE)
 		tcp_rearm_rto(sk);
 
@@ -3684,7 +3681,6 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	 * If data was DSACKed, see if we can undo a cwnd reduction.
 	 */
 	if (TCP_SKB_CB(skb)->sacked) {
-		skb_mstamp_get(&sack_state.ack_time);
 		flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una,
 						&sack_state);
 		tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit);
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 08/10] tcp: use tp->tcp_mstamp in tcp_clean_rtx_queue()
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

Following patch will remove ack_time from struct tcp_sacktag_state

Same info is now found in tp->tcp_mstamp

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/ipv4/tcp_input.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5485204853d3aefaea5027bcf6480ed20a1e8efa..f4e1836c696c9a45a545a32de8ba62ce3bd0dc12 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3056,8 +3056,8 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 {
 	const struct inet_connection_sock *icsk = inet_csk(sk);
 	struct skb_mstamp first_ackt, last_ackt;
-	struct skb_mstamp *now = &sack->ack_time;
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct skb_mstamp *now = &tp->tcp_mstamp;
 	u32 prior_sacked = tp->sacked_out;
 	u32 reord = tp->packets_out;
 	bool fully_acked = true;
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 06/10] tcp: do not pass timestamp to tcp_rate_gen()
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

No longer needed, since tp->tcp_mstamp holds the information.

This is needed to remove sack_state.ack_time in a following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/net/tcp.h    | 2 +-
 net/ipv4/tcp_input.c | 3 +--
 net/ipv4/tcp_rate.c  | 7 ++++---
 3 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 8b4433c4aaa221b83af90d2b44ba4b01a872a7af..d7aae25efc7f9664a482ce50974a2d79f7fc8e0c 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1004,7 +1004,7 @@ void tcp_rate_skb_sent(struct sock *sk, struct sk_buff *skb);
 void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff *skb,
 			    struct rate_sample *rs);
 void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost,
-		  struct skb_mstamp *now, struct rate_sample *rs);
+		  struct rate_sample *rs);
 void tcp_rate_check_app_limited(struct sock *sk);
 
 /* These functions determine how the current flow behaves in respect of SACK
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 68094aa8cfb2ee2dc6939ea1931277b745deae4a..2d84483de2e10ab10d4906f2cca01d76da83dc06 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -3657,8 +3657,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 		tcp_schedule_loss_probe(sk);
 	delivered = tp->delivered - delivered;	/* freshly ACKed or SACKed */
 	lost = tp->lost - lost;			/* freshly marked lost */
-	tcp_rate_gen(sk, delivered, lost, &sack_state.ack_time,
-		     sack_state.rate);
+	tcp_rate_gen(sk, delivered, lost, sack_state.rate);
 	tcp_cong_control(sk, ack, delivered, flag, sack_state.rate);
 	tcp_xmit_recovery(sk, rexmit);
 	return 1;
diff --git a/net/ipv4/tcp_rate.c b/net/ipv4/tcp_rate.c
index 9be1581a5a08c36f4544fbdabedd9741fb266a1e..c6a9fa8946462100947ab62d86464ff8f99565c2 100644
--- a/net/ipv4/tcp_rate.c
+++ b/net/ipv4/tcp_rate.c
@@ -106,7 +106,7 @@ void tcp_rate_skb_delivered(struct sock *sk, struct sk_buff *skb,
 
 /* Update the connection delivery information and generate a rate sample. */
 void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost,
-		  struct skb_mstamp *now, struct rate_sample *rs)
+		  struct rate_sample *rs)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	u32 snd_us, ack_us;
@@ -120,7 +120,7 @@ void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost,
 	 * to carry current time, flags, stats like "tcp_sacktag_state".
 	 */
 	if (delivered)
-		tp->delivered_mstamp = *now;
+		tp->delivered_mstamp = tp->tcp_mstamp;
 
 	rs->acked_sacked = delivered;	/* freshly ACKed or SACKed */
 	rs->losses = lost;		/* freshly marked lost */
@@ -138,7 +138,8 @@ void tcp_rate_gen(struct sock *sk, u32 delivered, u32 lost,
 	 * longer phase.
 	 */
 	snd_us = rs->interval_us;				/* send phase */
-	ack_us = skb_mstamp_us_delta(now, &rs->prior_mstamp);	/* ack phase */
+	ack_us = skb_mstamp_us_delta(&tp->tcp_mstamp,
+				     &rs->prior_mstamp); /* ack phase */
 	rs->interval_us = max(snd_us, ack_us);
 
 	/* Normally we expect interval_us >= min-rtt.
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 07/10] tcp: do not pass timestamp to tcp_rack_advance()
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

No longer needed, since tp->tcp_mstamp holds the information.

This is needed to remove sack_state.ack_time in a following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/net/tcp.h       | 3 +--
 net/ipv4/tcp_input.c    | 6 ++----
 net/ipv4/tcp_recovery.c | 5 ++---
 3 files changed, 5 insertions(+), 9 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index d7aae25efc7f9664a482ce50974a2d79f7fc8e0c..270e5cc43c99e7030e95af218095cf9f283950bc 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1855,8 +1855,7 @@ void tcp_init(void);
 /* tcp_recovery.c */
 extern void tcp_rack_mark_lost(struct sock *sk);
 extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
-			     const struct skb_mstamp *xmit_time,
-			     const struct skb_mstamp *ack_time);
+			     const struct skb_mstamp *xmit_time);
 extern void tcp_rack_reo_timeout(struct sock *sk);
 
 /*
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 2d84483de2e10ab10d4906f2cca01d76da83dc06..5485204853d3aefaea5027bcf6480ed20a1e8efa 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1214,8 +1214,7 @@ static u8 tcp_sacktag_one(struct sock *sk,
 		return sacked;
 
 	if (!(sacked & TCPCB_SACKED_ACKED)) {
-		tcp_rack_advance(tp, sacked, end_seq,
-				 xmit_time, &state->ack_time);
+		tcp_rack_advance(tp, sacked, end_seq, xmit_time);
 
 		if (sacked & TCPCB_SACKED_RETRANS) {
 			/* If the segment is not tagged as lost,
@@ -3118,8 +3117,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 			tp->delivered += acked_pcount;
 			if (!tcp_skb_spurious_retrans(tp, skb))
 				tcp_rack_advance(tp, sacked, scb->end_seq,
-						 &skb->skb_mstamp,
-						 &sack->ack_time);
+						 &skb->skb_mstamp);
 		}
 		if (sacked & TCPCB_LOST)
 			tp->lost_out -= acked_pcount;
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index 6ca8b5d9d803d872ec7043b02c72fffaec5c7270..cd72b3d3879e88181c8a4639f0334a24e4cda852 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -127,8 +127,7 @@ void tcp_rack_mark_lost(struct sock *sk)
  * draft-cheng-tcpm-rack-00.txt
  */
 void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
-		      const struct skb_mstamp *xmit_time,
-		      const struct skb_mstamp *ack_time)
+		      const struct skb_mstamp *xmit_time)
 {
 	u32 rtt_us;
 
@@ -137,7 +136,7 @@ void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
 				 end_seq, tp->rack.end_seq))
 		return;
 
-	rtt_us = skb_mstamp_us_delta(ack_time, xmit_time);
+	rtt_us = skb_mstamp_us_delta(&tp->tcp_mstamp, xmit_time);
 	if (sacked & TCPCB_RETRANS) {
 		/* If the sacked packet was retransmitted, it's ambiguous
 		 * whether the retransmission or the original (or the prior
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 02/10] tcp: do not pass timestamp to tcp_rack_detect_loss()
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

We can use tp->tcp_mstamp as it contains a recent timestamp.

This removes a call to skb_mstamp_get() from tcp_rack_reo_timeout()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/ipv4/tcp_recovery.c | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index d8acbd9f477a2ac6b0f8eee1bf59f3ab43abff07..fdac262e277b2f25492f155bbb295d6d87e31d02 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -45,8 +45,7 @@ static bool tcp_rack_sent_after(const struct skb_mstamp *t1,
  * or tcp_time_to_recover()'s "Trick#1: the loss is proven" code path will
  * make us enter the CA_Recovery state.
  */
-static void tcp_rack_detect_loss(struct sock *sk, const struct skb_mstamp *now,
-				 u32 *reo_timeout)
+static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	struct sk_buff *skb;
@@ -79,7 +78,7 @@ static void tcp_rack_detect_loss(struct sock *sk, const struct skb_mstamp *now,
 			 * A packet is lost if its elapsed time is beyond
 			 * the recent RTT plus the reordering window.
 			 */
-			u32 elapsed = skb_mstamp_us_delta(now,
+			u32 elapsed = skb_mstamp_us_delta(&tp->tcp_mstamp,
 							  &skb->skb_mstamp);
 			s32 remaining = tp->rack.rtt_us + reo_wnd - elapsed;
 
@@ -115,7 +114,7 @@ void tcp_rack_mark_lost(struct sock *sk, const struct skb_mstamp *now)
 
 	/* Reset the advanced flag to avoid unnecessary queue scanning */
 	tp->rack.advanced = 0;
-	tcp_rack_detect_loss(sk, now, &timeout);
+	tcp_rack_detect_loss(sk, &timeout);
 	if (timeout) {
 		timeout = usecs_to_jiffies(timeout + TCP_REO_TIMEOUT_MIN);
 		inet_csk_reset_xmit_timer(sk, ICSK_TIME_REO_TIMEOUT,
@@ -165,12 +164,10 @@ void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
 void tcp_rack_reo_timeout(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	struct skb_mstamp now;
 	u32 timeout, prior_inflight;
 
-	skb_mstamp_get(&now);
 	prior_inflight = tcp_packets_in_flight(tp);
-	tcp_rack_detect_loss(sk, &now, &timeout);
+	tcp_rack_detect_loss(sk, &timeout);
 	if (prior_inflight != tcp_packets_in_flight(tp)) {
 		if (inet_csk(sk)->icsk_ca_state != TCP_CA_Recovery) {
 			tcp_enter_recovery(sk, false);
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 05/10] tcp: do not pass timestamp to tcp_fastretrans_alert()
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

Not used anymore now tp->tcp_mstamp holds the information.

This is needed to remove sack_state.ack_time in a following patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/ipv4/tcp_input.c | 12 ++++--------
 1 file changed, 4 insertions(+), 8 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 99b0d65de169a13679477f49f3733ca00c842090..68094aa8cfb2ee2dc6939ea1931277b745deae4a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2787,8 +2787,7 @@ static void tcp_rack_identify_loss(struct sock *sk, int *ack_flag)
  * tcp_xmit_retransmit_queue().
  */
 static void tcp_fastretrans_alert(struct sock *sk, const int acked,
-				  bool is_dupack, int *ack_flag, int *rexmit,
-				  const struct skb_mstamp *ack_time)
+				  bool is_dupack, int *ack_flag, int *rexmit)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
@@ -3646,8 +3645,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 
 	if (tcp_ack_is_dubious(sk, flag)) {
 		is_dupack = !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP));
-		tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit,
-				      &sack_state.ack_time);
+		tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit);
 	}
 	if (tp->tlp_high_seq)
 		tcp_process_tlp_ack(sk, ack, flag);
@@ -3668,8 +3666,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 no_queue:
 	/* If data was DSACKed, see if we can undo a cwnd reduction. */
 	if (flag & FLAG_DSACKING_ACK)
-		tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit,
-				      &sack_state.ack_time);
+		tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit);
 	/* If this ack opens up a zero window, clear backoff.  It was
 	 * being used to time the probes, and is probably far higher than
 	 * it needs to be for normal retransmission.
@@ -3693,8 +3690,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 		skb_mstamp_get(&sack_state.ack_time);
 		flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una,
 						&sack_state);
-		tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit,
-				      &sack_state.ack_time);
+		tcp_fastretrans_alert(sk, acked, is_dupack, &flag, &rexmit);
 		tcp_xmit_recovery(sk, rexmit);
 	}
 
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 04/10] tcp: do not pass timestamp to tcp_rack_identify_loss()
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

Not used anymore now tp->tcp_mstamp holds the information.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 net/ipv4/tcp_input.c | 9 ++++-----
 1 file changed, 4 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index d4885f7a6a930ff1794b0ab931c0b73c274371b2..99b0d65de169a13679477f49f3733ca00c842090 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2760,8 +2760,7 @@ static bool tcp_try_undo_partial(struct sock *sk, const int acked)
 	return false;
 }
 
-static void tcp_rack_identify_loss(struct sock *sk, int *ack_flag,
-				   const struct skb_mstamp *ack_time)
+static void tcp_rack_identify_loss(struct sock *sk, int *ack_flag)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
@@ -2857,11 +2856,11 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
 			tcp_try_keep_open(sk);
 			return;
 		}
-		tcp_rack_identify_loss(sk, ack_flag, ack_time);
+		tcp_rack_identify_loss(sk, ack_flag);
 		break;
 	case TCP_CA_Loss:
 		tcp_process_loss(sk, flag, is_dupack, rexmit);
-		tcp_rack_identify_loss(sk, ack_flag, ack_time);
+		tcp_rack_identify_loss(sk, ack_flag);
 		if (!(icsk->icsk_ca_state == TCP_CA_Open ||
 		      (*ack_flag & FLAG_LOST_RETRANS)))
 			return;
@@ -2877,7 +2876,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
 		if (icsk->icsk_ca_state <= TCP_CA_Disorder)
 			tcp_try_undo_dsack(sk);
 
-		tcp_rack_identify_loss(sk, ack_flag, ack_time);
+		tcp_rack_identify_loss(sk, ack_flag);
 		if (!tcp_time_to_recover(sk, flag)) {
 			tcp_try_to_open(sk, flag);
 			return;
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 03/10] tcp: do not pass timestamp to tcp_rack_mark_lost()
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

This is no longer used, since tcp_rack_detect_loss() takes
the timestamp from tp->tcp_mstamp

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/net/tcp.h       | 2 +-
 net/ipv4/tcp_input.c    | 2 +-
 net/ipv4/tcp_recovery.c | 2 +-
 3 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index da28bef1d82b6773bbfcf7c7eafebb7a4932f25b..8b4433c4aaa221b83af90d2b44ba4b01a872a7af 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1853,7 +1853,7 @@ void tcp_v4_init(void);
 void tcp_init(void);
 
 /* tcp_recovery.c */
-extern void tcp_rack_mark_lost(struct sock *sk, const struct skb_mstamp *now);
+extern void tcp_rack_mark_lost(struct sock *sk);
 extern void tcp_rack_advance(struct tcp_sock *tp, u8 sacked, u32 end_seq,
 			     const struct skb_mstamp *xmit_time,
 			     const struct skb_mstamp *ack_time);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bd18c65df4a9d9c2b66d8005f2cc4ff468140a73..d4885f7a6a930ff1794b0ab931c0b73c274371b2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2769,7 +2769,7 @@ static void tcp_rack_identify_loss(struct sock *sk, int *ack_flag,
 	if (sysctl_tcp_recovery & TCP_RACK_LOSS_DETECTION) {
 		u32 prior_retrans = tp->retrans_out;
 
-		tcp_rack_mark_lost(sk, ack_time);
+		tcp_rack_mark_lost(sk);
 		if (prior_retrans > tp->retrans_out)
 			*ack_flag |= FLAG_LOST_RETRANS;
 	}
diff --git a/net/ipv4/tcp_recovery.c b/net/ipv4/tcp_recovery.c
index fdac262e277b2f25492f155bbb295d6d87e31d02..6ca8b5d9d803d872ec7043b02c72fffaec5c7270 100644
--- a/net/ipv4/tcp_recovery.c
+++ b/net/ipv4/tcp_recovery.c
@@ -104,7 +104,7 @@ static void tcp_rack_detect_loss(struct sock *sk, u32 *reo_timeout)
 	}
 }
 
-void tcp_rack_mark_lost(struct sock *sk, const struct skb_mstamp *now)
+void tcp_rack_mark_lost(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	u32 timeout;
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 01/10] tcp: add tp->tcp_mstamp field
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet
In-Reply-To: <20170425171541.3417-1-edumazet@google.com>

We want to use precise timestamps in TCP stack, but we do not
want to call possibly expensive kernel time services too often.

tp->tcp_mstamp is guaranteed to be updated once per incoming packet.

We will use it in the following patches, removing specific
skb_mstamp_get() calls, and removing ack_time from
struct tcp_sacktag_state.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/tcp.h  | 1 +
 net/ipv4/tcp_input.c | 3 +++
 2 files changed, 4 insertions(+)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index cbe5b602a2d349fdeb1e878305f37b4da1e6cc86..99a22f44c32e1587a6bf4835b65c7a4314807aa8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -240,6 +240,7 @@ struct tcp_sock {
 	u32	tlp_high_seq;	/* snd_nxt at the time of TLP retransmit. */
 
 /* RTT measurement */
+	struct skb_mstamp tcp_mstamp; /* most recent packet received/sent */
 	u32	srtt_us;	/* smoothed round trip time << 3 in usecs */
 	u32	mdev_us;	/* medium deviation			*/
 	u32	mdev_max_us;	/* maximal mdev for the last rtt period	*/
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5af2f04f885914491a7116c20056b3d2188d2d7d..bd18c65df4a9d9c2b66d8005f2cc4ff468140a73 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5362,6 +5362,7 @@ void tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	skb_mstamp_get(&tp->tcp_mstamp);
 	if (unlikely(!sk->sk_rx_dst))
 		inet_csk(sk)->icsk_af_ops->sk_rx_dst_set(sk, skb);
 	/*
@@ -5922,6 +5923,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 
 	case TCP_SYN_SENT:
 		tp->rx_opt.saw_tstamp = 0;
+		skb_mstamp_get(&tp->tcp_mstamp);
 		queued = tcp_rcv_synsent_state_process(sk, skb, th);
 		if (queued >= 0)
 			return queued;
@@ -5933,6 +5935,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		return 0;
 	}
 
+	skb_mstamp_get(&tp->tcp_mstamp);
 	tp->rx_opt.saw_tstamp = 0;
 	req = tp->fastopen_rsk;
 	if (req) {
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* [PATCH net-next 00/10] tcp: do not use tcp_time_stamp for rcv autotuning
From: Eric Dumazet @ 2017-04-25 17:15 UTC (permalink / raw)
  To: David S . Miller
  Cc: netdev, Soheil Hassas Yeganeh, Eric Dumazet, Eric Dumazet

Some devices or linux distributions use HZ=100 or HZ=250

TCP receive buffer autotuning has poor behavior caused by this choice.
Since autotuning happens after 4 ms or 10 ms, short distance flows
get their receive buffer tuned to a very high value, but after an initial
period where it was frozen to (too small) initial value.

With BBR (or other CC allowing to increase BDP), we are willing to
increase tcp_rmem[2], but this receive autotuning defect is a blocker
for hosts dealing with gazillions of TCP flows in the data centers,
since many of them have inflated RCVBUF. Risk of OOM is too high.

Note that TSO autodefer, tcp cubic, and TCP TS options (RFC 7323)
also suffer from our dependency to jiffies (via tcp_time_stamp).

We have ongoing efforts to improve all that in the future.

Eric Dumazet (10):
  tcp: add tp->tcp_mstamp field
  tcp: do not pass timestamp to tcp_rack_detect_loss()
  tcp: do not pass timestamp to tcp_rack_mark_lost()
  tcp: do not pass timestamp to tcp_rack_identify_loss()
  tcp: do not pass timestamp to tcp_fastretrans_alert()
  tcp: do not pass timestamp to tcp_rate_gen()
  tcp: do not pass timestamp to tcp_rack_advance()
  tcp: use tp->tcp_mstamp in tcp_clean_rtx_queue()
  tcp: remove ack_time from struct tcp_sacktag_state
  tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps

 include/linux/tcp.h     | 13 +++++-----
 include/net/tcp.h       |  7 +++--
 net/ipv4/tcp.c          |  2 +-
 net/ipv4/tcp_input.c    | 69 +++++++++++++++++++++++--------------------------
 net/ipv4/tcp_rate.c     |  7 ++---
 net/ipv4/tcp_recovery.c | 18 +++++--------
 6 files changed, 55 insertions(+), 61 deletions(-)

-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply

* llvm-objdump...
From: David Miller @ 2017-04-25 17:13 UTC (permalink / raw)
  To: ast; +Cc: daniel, netdev


I think there are some endianness issues ;-)

davem@patience:~/src/GIT/net-next/tools/testing/selftests/bpf$ llvm-objdump -S x.o

x.o:    file format ELF64-BPF

Disassembly of section test1:
process:
       0:       b7 00 00 00 00 00 00 02         r0 = 33554432
       1:       61 21 00 50 00 00 00 00         r1 = *(u32 *)(r2 + 20480)

That first instruction should be "r0 = 2"

^ permalink raw reply

* [PATCH v2] macsec: dynamically allocate space for sglist
From: Jason A. Donenfeld @ 2017-04-25 17:08 UTC (permalink / raw)
  To: Netdev, LKML, David Miller, stable, security, Sabrina Dubroca
  Cc: Jason A. Donenfeld
In-Reply-To: <20170425163602.GA17973@bistromath.localdomain>

We call skb_cow_data, which is good anyway to ensure we can actually
modify the skb as such (another error from prior). Now that we have the
number of fragments required, we can safely allocate exactly that amount
of memory.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: security@kernel.org
Cc: stable@vger.kernel.org
---
Changes v1 -> v2:
  - sg_init_table now takes the correct argument.

 drivers/net/macsec.c | 29 ++++++++++++++++++++++-------
 1 file changed, 22 insertions(+), 7 deletions(-)

diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
index dbab05afcdbe..49ce4e9f4a0f 100644
--- a/drivers/net/macsec.c
+++ b/drivers/net/macsec.c
@@ -617,7 +617,8 @@ static void macsec_encrypt_done(struct crypto_async_request *base, int err)
 
 static struct aead_request *macsec_alloc_req(struct crypto_aead *tfm,
 					     unsigned char **iv,
-					     struct scatterlist **sg)
+					     struct scatterlist **sg,
+					     int num_frags)
 {
 	size_t size, iv_offset, sg_offset;
 	struct aead_request *req;
@@ -629,7 +630,7 @@ static struct aead_request *macsec_alloc_req(struct crypto_aead *tfm,
 
 	size = ALIGN(size, __alignof__(struct scatterlist));
 	sg_offset = size;
-	size += sizeof(struct scatterlist) * (MAX_SKB_FRAGS + 1);
+	size += sizeof(struct scatterlist) * num_frags;
 
 	tmp = kmalloc(size, GFP_ATOMIC);
 	if (!tmp)
@@ -649,6 +650,7 @@ static struct sk_buff *macsec_encrypt(struct sk_buff *skb,
 {
 	int ret;
 	struct scatterlist *sg;
+	struct sk_buff *trailer;
 	unsigned char *iv;
 	struct ethhdr *eth;
 	struct macsec_eth_header *hh;
@@ -723,7 +725,14 @@ static struct sk_buff *macsec_encrypt(struct sk_buff *skb,
 		return ERR_PTR(-EINVAL);
 	}
 
-	req = macsec_alloc_req(tx_sa->key.tfm, &iv, &sg);
+	ret = skb_cow_data(skb, 0, &trailer);
+	if (unlikely(ret < 0)) {
+		macsec_txsa_put(tx_sa);
+		kfree_skb(skb);
+		return ERR_PTR(ret);
+	}
+
+	req = macsec_alloc_req(tx_sa->key.tfm, &iv, &sg, ret);
 	if (!req) {
 		macsec_txsa_put(tx_sa);
 		kfree_skb(skb);
@@ -732,7 +741,7 @@ static struct sk_buff *macsec_encrypt(struct sk_buff *skb,
 
 	macsec_fill_iv(iv, secy->sci, pn);
 
-	sg_init_table(sg, MAX_SKB_FRAGS + 1);
+	sg_init_table(sg, ret);
 	skb_to_sgvec(skb, sg, 0, skb->len);
 
 	if (tx_sc->encrypt) {
@@ -917,6 +926,7 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb,
 {
 	int ret;
 	struct scatterlist *sg;
+	struct sk_buff *trailer;
 	unsigned char *iv;
 	struct aead_request *req;
 	struct macsec_eth_header *hdr;
@@ -927,7 +937,12 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb,
 	if (!skb)
 		return ERR_PTR(-ENOMEM);
 
-	req = macsec_alloc_req(rx_sa->key.tfm, &iv, &sg);
+	ret = skb_cow_data(skb, 0, &trailer);
+	if (unlikely(ret < 0)) {
+		kfree_skb(skb);
+		return ERR_PTR(ret);
+	}
+	req = macsec_alloc_req(rx_sa->key.tfm, &iv, &sg, ret);
 	if (!req) {
 		kfree_skb(skb);
 		return ERR_PTR(-ENOMEM);
@@ -936,7 +951,7 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb,
 	hdr = (struct macsec_eth_header *)skb->data;
 	macsec_fill_iv(iv, sci, ntohl(hdr->packet_number));
 
-	sg_init_table(sg, MAX_SKB_FRAGS + 1);
+	sg_init_table(sg, ret);
 	skb_to_sgvec(skb, sg, 0, skb->len);
 
 	if (hdr->tci_an & MACSEC_TCI_E) {
@@ -2716,7 +2731,7 @@ static netdev_tx_t macsec_start_xmit(struct sk_buff *skb,
 }
 
 #define MACSEC_FEATURES \
-	(NETIF_F_SG | NETIF_F_HIGHDMA)
+	(NETIF_F_SG | NETIF_F_HIGHDMA | NETIF_F_FRAGLIST)
 static struct lock_class_key macsec_netdev_addr_lock_key;
 
 static int macsec_dev_init(struct net_device *dev)
-- 
2.12.2

^ permalink raw reply related

* Re: [PATCHv2 net] bridge: move bridge multicast cleanup to ndo_uninit
From: Nikolay Aleksandrov @ 2017-04-25 17:01 UTC (permalink / raw)
  To: Xin Long, network dev; +Cc: davem, stephen
In-Reply-To: <6e156e72cb1a5e279da8ac53bdb601eee5d654fe.1493132317.git.lucien.xin@gmail.com>

On 25/04/17 17:58, Xin Long wrote:
> During removing a bridge device, if the bridge is still up, a new mdb entry
> still can be added in br_multicast_add_group() after all mdb entries are
> removed in br_multicast_dev_del(). Like the path:
> 
>   mld_ifc_timer_expire ->
>     mld_sendpack -> ...
>       br_multicast_rcv ->
>         br_multicast_add_group
> 
> The new mp's timer will be set up. If the timer expires after the bridge
> is freed, it may cause use-after-free panic in br_multicast_group_expired.
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
> IP: [<ffffffffa07ed2c8>] br_multicast_group_expired+0x28/0xb0 [bridge]
> Call Trace:
>  <IRQ>
>  [<ffffffff81094536>] call_timer_fn+0x36/0x110
>  [<ffffffffa07ed2a0>] ? br_mdb_free+0x30/0x30 [bridge]
>  [<ffffffff81096967>] run_timer_softirq+0x237/0x340
>  [<ffffffff8108dcbf>] __do_softirq+0xef/0x280
>  [<ffffffff8169889c>] call_softirq+0x1c/0x30
>  [<ffffffff8102c275>] do_softirq+0x65/0xa0
>  [<ffffffff8108e055>] irq_exit+0x115/0x120
>  [<ffffffff81699515>] smp_apic_timer_interrupt+0x45/0x60
>  [<ffffffff81697a5d>] apic_timer_interrupt+0x6d/0x80
> 
> Nikolay also found it would cause a memory leak - the mdb hash is
> reallocated and not freed due to the mdb rehash.
> 
> unreferenced object 0xffff8800540ba800 (size 2048):
>   backtrace:
>     [<ffffffff816e2287>] kmemleak_alloc+0x67/0xc0
>     [<ffffffff81260bea>] __kmalloc+0x1ba/0x3e0
>     [<ffffffffa05c60ee>] br_mdb_rehash+0x5e/0x340 [bridge]
>     [<ffffffffa05c74af>] br_multicast_new_group+0x43f/0x6e0 [bridge]
>     [<ffffffffa05c7aa3>] br_multicast_add_group+0x203/0x260 [bridge]
>     [<ffffffffa05ca4b5>] br_multicast_rcv+0x945/0x11d0 [bridge]
>     [<ffffffffa05b6b10>] br_dev_xmit+0x180/0x470 [bridge]
>     [<ffffffff815c781b>] dev_hard_start_xmit+0xbb/0x3d0
>     [<ffffffff815c8743>] __dev_queue_xmit+0xb13/0xc10
>     [<ffffffff815c8850>] dev_queue_xmit+0x10/0x20
>     [<ffffffffa02f8d7a>] ip6_finish_output2+0x5ca/0xac0 [ipv6]
>     [<ffffffffa02fbfc6>] ip6_finish_output+0x126/0x2c0 [ipv6]
>     [<ffffffffa02fc245>] ip6_output+0xe5/0x390 [ipv6]
>     [<ffffffffa032b92c>] NF_HOOK.constprop.44+0x6c/0x240 [ipv6]
>     [<ffffffffa032bd16>] mld_sendpack+0x216/0x3e0 [ipv6]
>     [<ffffffffa032d5eb>] mld_ifc_timer_expire+0x18b/0x2b0 [ipv6]
> 
> This could happen when ip link remove a bridge or destroy a netns with a
> bridge device inside.
> 
> With Nikolay's suggestion, this patch is to clean up bridge multicast in
> ndo_uninit after bridge dev is shutdown, instead of br_dev_delete, so
> that netif_running check in br_multicast_add_group can avoid this issue.
> 
> v1->v2:
>   - fix this issue by moving br_multicast_dev_del to ndo_uninit, instead
>     of calling dev_close in br_dev_delete.
> 
> Reported-by: Jianwen Ji <jiji@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>
> ---
>  net/bridge/br_device.c | 1 +
>  net/bridge/br_if.c     | 1 -
>  2 files changed, 1 insertion(+), 1 deletion(-)

Thank you for modifying the fix to use ndo_uninit(). Important note -
this fix is dependent on Ido's earlier ndo_uninit() patch:
b6fe0440c637 ("bridge: implement missing ndo_uninit()")

Fixes: e10177abf842 ("bridge: multicast: fix handling of temp and perm
entries")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>

^ permalink raw reply

* Re: [PATCH 5/7] IB/hfi1: use pcie_flr instead of duplicating it
From: Doug Ledford @ 2017-04-25 17:00 UTC (permalink / raw)
  To: Christoph Hellwig, Byczkowski, Jakub
  Cc: Bjorn Helgaas, Cabiddu, Giovanni, Benedetto, Salvatore,
	Marciniszyn, Mike, Dalessandro, Dennis, Derek Chickles,
	Satanand Burla, Felix Manlunas, Raghu Vatsavayi,
	Kirsher, Jeffrey T,
	linux-pci-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, qat-linux,
	linux-crypto-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org, "li
In-Reply-To: <20170424143507.GA28812-jcswGhMUV9g@public.gmane.org>

On Mon, 2017-04-24 at 16:35 +0200, Christoph Hellwig wrote:
> On Mon, Apr 24, 2017 at 02:16:31PM +0000, Byczkowski, Jakub wrote:
> > 
> > Tested-by: Jakub Byczkowski <jakub.byczkowski-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
> 
> Are you (and Doug) ok with queueing this up in the PCI tree?

I'm fine with that.  Feel free to add my Acked-by to the hfi1 patch.

-- 
Doug Ledford <dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
    GPG KeyID: B826A3330E572FDD
   
Key fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [RFC 0/4] xdp: use netlink extended ACK reporting
From: Jesper Dangaard Brouer @ 2017-04-25 16:54 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: brouer, Jakub Kicinski, netdev, davem, johannes, dsa,
	alexei.starovoitov, bblanco, john.fastabend, kubakici,
	oss-drivers
In-Reply-To: <58FF1157.8030309@iogearbox.net>

On Tue, 25 Apr 2017 11:05:27 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> > Where the message is coming directly from the driver.  There could
> > still be a bit of a leap for a complete novice from the message
> > above to the right settings.  I wonder if it would be worthwhile  
> 
> But still 100x better than the current situation. ;) I really
> like the series, thanks for working on this!

+1 thanks for working on this! :-)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* more test_progs...
From: David Miller @ 2017-04-25 16:52 UTC (permalink / raw)
  To: ast; +Cc: daniel, netdev


I'm trying to get test_progs working in net-next on sparc, it can't
even load the first BPF program.  It's triggering this verifier
check:

static int check_pkt_ptr_alignment(const struct bpf_reg_state *reg,
                                   int off, int size)
{
        if (reg->id && size != 1) {
                verbose("Unknown alignment. Only byte-sized access allowed in packet access.\n");

Specifically (this is test_pkt_access.o of course):

libbpf: load bpf program failed: Permission denied
libbpf: -- BEGIN DUMP LOG ---
libbpf: 
0: (b7) r0 = 2
1: (61) r2 = *(u32 *)(r1 +80)

r2 = skb->data_end

2: (61) r1 = *(u32 *)(r1 +76)

r1 = skb->data

3: (bf) r4 = r1

r4 = skb->data

4: (07) r4 += 14

r4 += sizeof(struct ethhdr);

5: (2d) if r4 > r2 goto pc+37

if out of range, exit

 R0=imm2,min_value=2,max_value=2 R1=pkt(id=0,off=0,r=14) R2=pkt_end R4=pkt(id=0,off=14,r=14) R10=fp

R1/R2/R4 analysis seems correct

6: (71) r5 = *(u8 *)(r1 +13)
7: (71) r3 = *(u8 *)(r1 +12)
8: (67) r3 <<= 8
9: (4f) r3 |= r5

Load eth->h_proto

10: (15) if r3 == 0xdd86 goto pc+9
 R0=imm2,min_value=2,max_value=2 R1=pkt(id=0,off=0,r=14) R2=pkt_end R3=inv R4=pkt(id=0,off=14,r=14) R5=inv56 R10=fp

Hmmm, endianness looks wrong here.  "-target bpf" defaults to the
endianness of whatever cpu that llvm was built for, right?

Maybe I need to make even more adjustments to selftests/bpf/Makefile
handling of clang options...

Anyways this is checking for ipv6, if so goto insn 25

11: (55) if r3 != 0x8 goto pc+29
 R0=imm2,min_value=2,max_value=2 R1=pkt(id=0,off=0,r=14) R2=pkt_end R3=inv,min_value=8,max_value=8 R4=pkt(id=0,off=14,r=14) R5=inv56 R10=fp

IPV4:

12: (bf) r3 = r1
13: (07) r3 += 34
14: (2d) if r3 > r2 goto pc+28

Make sure pkt + 34 is in range

 R0=imm2,min_value=2,max_value=2 R1=pkt(id=0,off=0,r=34) R2=pkt_end R3=pkt(id=0,off=34,r=34) R4=pkt(id=0,off=14,r=34) R5=inv56 R10=fp
15: (b7) r3 = 3
16: (71) r5 = *(u8 *)(r4 +0)
17: (67) r5 <<= 2
18: (57) r5 &= 60
19: (05) goto pc+5

IPV6:

25: (bf) r4 = r1
26: (0f) r4 += r5
27: (07) r4 += 14
28: (15) if r4 == 0x0 goto pc+12
 R0=imm2,min_value=2,max_value=2 R1=pkt(id=0,off=0,r=34) R2=pkt_end R3=imm3,min_value=3,max_value=3 R4=pkt(id=1,off=14,r=0) R5=inv58,min_value=0,max_value=60 R10=fp
29: (bf) r5 = r4
30: (07) r5 += 20
31: (2d) if r5 > r2 goto pc+11
 R0=imm2,min_value=2,max_value=2 R1=pkt(id=0,off=0,r=34) R2=pkt_end R3=imm3,min_value=3,max_value=3 R4=pkt(id=1,off=14,r=34) R5=pkt(id=1,off=34,r=34) R10=fp
32: (0f) r1 += r3
33: (71) r1 = *(u8 *)(r1 +20)

Load ipv6 nexthdr

34: (57) r1 &= 255
35: (55) if r1 != 0x6 goto pc+7

if proto not TCP skip

 R0=imm2,min_value=2,max_value=2 R1=inv56,min_value=6,max_value=6 R2=pkt_end R3=imm3,min_value=3,max_value=3 R4=pkt(id=1,off=14,r=34) R5=pkt(id=1,off=34,r=34) R10=fp
36: (07) r4 += 18
37: (2d) if r4 > r2 goto pc+5
 R0=imm2,min_value=2,max_value=2 R1=inv56,min_value=6,max_value=6 R2=pkt_end R3=imm3,min_value=3,max_value=3 R4=pkt(id=1,off=32,r=34) R5=pkt(id=1,off=34,r=34) R10=fp
38: (b7) r0 = 0
39: (69) r1 = *(u16 *)(r4 +0)
Unknown alignment. Only byte-sized access allowed in packet access.

And this seems to load the urgent pointer as a u16 which is what the verifier rejects.

Oh I see, this is guarded by CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS.

How in the world is this supposed to work?

^ permalink raw reply

* [PATCH] net/packet: check length in getsockopt() called with PACKET_HDRLEN
From: Alexander Potapenko @ 2017-04-25 16:51 UTC (permalink / raw)
  To: dvyukov, kcc, edumazet, davem, kuznet; +Cc: linux-kernel, netdev

In the case getsockopt() is called with PACKET_HDRLEN and optlen < 4
|val| remains uninitialized and the syscall may behave differently
depending on its value, and even copy garbage to userspace on certain
architectures. To fix this we now return -EINVAL if optlen is too small.

This bug has been detected with KMSAN.

Signed-off-by: Alexander Potapenko <glider@google.com>

---
The previous versions of this patch were called "net/packet: initialize
val in packet_getsockopt()"

v3: - change patch summary, return -EINVAL for optlen < sizeof(int)
v2: - if len < sizeof(int), make it 0

---
 net/packet/af_packet.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 8489beff5c25..ea81ccf3c7d6 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3836,6 +3836,8 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 	case PACKET_HDRLEN:
 		if (len > sizeof(int))
 			len = sizeof(int);
+		if (len < sizeof(int))
+			return -EINVAL;
 		if (copy_from_user(&val, optval, len))
 			return -EFAULT;
 		switch (val) {
-- 
2.13.0.rc0.306.g87b477812d-goog

^ permalink raw reply related

* Re: [PATCHv2 net] bridge: move bridge multicast cleanup to ndo_uninit
From: Stephen Hemminger @ 2017-04-25 16:41 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, davem, nikolay
In-Reply-To: <6e156e72cb1a5e279da8ac53bdb601eee5d654fe.1493132317.git.lucien.xin@gmail.com>

On Tue, 25 Apr 2017 22:58:37 +0800
Xin Long <lucien.xin@gmail.com> wrote:

> During removing a bridge device, if the bridge is still up, a new mdb entry
> still can be added in br_multicast_add_group() after all mdb entries are
> removed in br_multicast_dev_del(). Like the path:
> 
>   mld_ifc_timer_expire ->
>     mld_sendpack -> ...
>       br_multicast_rcv ->
>         br_multicast_add_group
> 
> The new mp's timer will be set up. If the timer expires after the bridge
> is freed, it may cause use-after-free panic in br_multicast_group_expired.
> 
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000048
> IP: [<ffffffffa07ed2c8>] br_multicast_group_expired+0x28/0xb0 [bridge]
> Call Trace:
>  <IRQ>
>  [<ffffffff81094536>] call_timer_fn+0x36/0x110
>  [<ffffffffa07ed2a0>] ? br_mdb_free+0x30/0x30 [bridge]
>  [<ffffffff81096967>] run_timer_softirq+0x237/0x340
>  [<ffffffff8108dcbf>] __do_softirq+0xef/0x280
>  [<ffffffff8169889c>] call_softirq+0x1c/0x30
>  [<ffffffff8102c275>] do_softirq+0x65/0xa0
>  [<ffffffff8108e055>] irq_exit+0x115/0x120
>  [<ffffffff81699515>] smp_apic_timer_interrupt+0x45/0x60
>  [<ffffffff81697a5d>] apic_timer_interrupt+0x6d/0x80
> 
> Nikolay also found it would cause a memory leak - the mdb hash is
> reallocated and not freed due to the mdb rehash.
> 
> unreferenced object 0xffff8800540ba800 (size 2048):
>   backtrace:
>     [<ffffffff816e2287>] kmemleak_alloc+0x67/0xc0
>     [<ffffffff81260bea>] __kmalloc+0x1ba/0x3e0
>     [<ffffffffa05c60ee>] br_mdb_rehash+0x5e/0x340 [bridge]
>     [<ffffffffa05c74af>] br_multicast_new_group+0x43f/0x6e0 [bridge]
>     [<ffffffffa05c7aa3>] br_multicast_add_group+0x203/0x260 [bridge]
>     [<ffffffffa05ca4b5>] br_multicast_rcv+0x945/0x11d0 [bridge]
>     [<ffffffffa05b6b10>] br_dev_xmit+0x180/0x470 [bridge]
>     [<ffffffff815c781b>] dev_hard_start_xmit+0xbb/0x3d0
>     [<ffffffff815c8743>] __dev_queue_xmit+0xb13/0xc10
>     [<ffffffff815c8850>] dev_queue_xmit+0x10/0x20
>     [<ffffffffa02f8d7a>] ip6_finish_output2+0x5ca/0xac0 [ipv6]
>     [<ffffffffa02fbfc6>] ip6_finish_output+0x126/0x2c0 [ipv6]
>     [<ffffffffa02fc245>] ip6_output+0xe5/0x390 [ipv6]
>     [<ffffffffa032b92c>] NF_HOOK.constprop.44+0x6c/0x240 [ipv6]
>     [<ffffffffa032bd16>] mld_sendpack+0x216/0x3e0 [ipv6]
>     [<ffffffffa032d5eb>] mld_ifc_timer_expire+0x18b/0x2b0 [ipv6]
> 
> This could happen when ip link remove a bridge or destroy a netns with a
> bridge device inside.
> 
> With Nikolay's suggestion, this patch is to clean up bridge multicast in
> ndo_uninit after bridge dev is shutdown, instead of br_dev_delete, so
> that netif_running check in br_multicast_add_group can avoid this issue.
> 
> v1->v2:
>   - fix this issue by moving br_multicast_dev_del to ndo_uninit, instead
>     of calling dev_close in br_dev_delete.
> 
> Reported-by: Jianwen Ji <jiji@redhat.com>
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Makes sense.

Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>

^ permalink raw reply

* Re: net: heap out-of-bounds in fib6_clean_node/rt6_fill_node/fib6_age/fib6_prune_clone
From: David Ahern @ 2017-04-25 16:40 UTC (permalink / raw)
  To: Andrey Konovalov
  Cc: Dmitry Vyukov, Mahesh Bandewar, Eric Dumazet, David Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, netdev, LKML, Cong Wang, syzkaller
In-Reply-To: <CAAeHK+wnFMYUXBCQDBKR142uEVXuGoOvo2SO1b04Ya3i8XSr5g@mail.gmail.com>

On 4/25/17 10:38 AM, Andrey Konovalov wrote:
> I'll keep fuzzing in the meantime to make sure.
> Maybe I'll be able to collect more reports or even another reproducer.

start a new email thread for each stack trace. I'll write a debug patch
for the trace you hit today.

^ permalink raw reply

* Re: net: heap out-of-bounds in fib6_clean_node/rt6_fill_node/fib6_age/fib6_prune_clone
From: Andrey Konovalov @ 2017-04-25 16:38 UTC (permalink / raw)
  To: David Ahern
  Cc: Dmitry Vyukov, Mahesh Bandewar, Eric Dumazet, David Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, netdev, LKML, Cong Wang, syzkaller
In-Reply-To: <CAAeHK+zVkJ1t1xMhMXO2wUyMfJrbXsARHyQ6A5hh2O9pqPRxiw@mail.gmail.com>

On Tue, Apr 25, 2017 at 6:36 PM, Andrey Konovalov <andreyknvl@google.com> wrote:
> On Tue, Apr 25, 2017 at 5:56 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
>> On 3/4/17 11:57 AM, Dmitry Vyukov wrote:
>>> ==================================================================
>>> BUG: KASAN: slab-out-of-bounds in rt6_dump_route+0x293/0x2f0
>>> net/ipv6/route.c:3551 at addr ffff88007e523694
>>> Read of size 4 by task syz-executor3/24426
>>> CPU: 2 PID: 24426 Comm: syz-executor3 Not tainted 4.10.0+ #293
>>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>>> Call Trace:
>>>  __dump_stack lib/dump_stack.c:16 [inline]
>>>  dump_stack+0x2fb/0x3fd lib/dump_stack.c:52
>>>  kasan_object_err+0x1c/0x90 mm/kasan/report.c:166
>>>  print_address_description mm/kasan/report.c:208 [inline]
>>>  kasan_report_error mm/kasan/report.c:292 [inline]
>>>  kasan_report.part.2+0x1b0/0x460 mm/kasan/report.c:314
>>>  kasan_report mm/kasan/report.c:334 [inline]
>>>  __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:334
>>>  rt6_dump_route+0x293/0x2f0 net/ipv6/route.c:3551
>>>  fib6_dump_node+0x101/0x1a0 net/ipv6/ip6_fib.c:315
>>>  fib6_walk_continue+0x4b3/0x620 net/ipv6/ip6_fib.c:1576
>>>  fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1621
>>>  fib6_dump_table net/ipv6/ip6_fib.c:374 [inline]
>>>  inet6_dump_fib+0x832/0xea0 net/ipv6/ip6_fib.c:447
>>
>> My expectation is that this one is fixed with the ipv6 patch I have sent
>> (not yet committed). Are you seeing this backtrace with that patch?
>
> Before applying your patch I was hitting reports related to fib6 all the time.
> I've stopped seeing them for some time after I applied your patch.
> However today another one was triggered:

I'll keep fuzzing in the meantime to make sure.
Maybe I'll be able to collect more reports or even another reproducer.

>
> ==================================================================
> BUG: KASAN: slab-out-of-bounds in find_rr_leaf net/ipv6/route.c:722
> [inline] at addr ffff880033dddaa8
> BUG: KASAN: slab-out-of-bounds in rt6_select net/ipv6/route.c:758
> [inline] at addr ffff880033dddaa8
> BUG: KASAN: slab-out-of-bounds in ip6_pol_route+0x1b1e/0x1ba0
> net/ipv6/route.c:1091 at addr ffff880033dddaa8
> Read of size 4 by task syz-executor7/9808
> CPU: 2 PID: 9808 Comm: syz-executor7 Not tainted 4.11.0-rc8+ #268
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> Call Trace:
>  __dump_stack lib/dump_stack.c:16 [inline]
>  dump_stack+0x192/0x22d lib/dump_stack.c:52
>  kasan_object_err+0x1c/0x70 mm/kasan/report.c:164
>  print_address_description mm/kasan/report.c:202 [inline]
>  kasan_report_error mm/kasan/report.c:291 [inline]
>  kasan_report+0x252/0x510 mm/kasan/report.c:347
>  __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:367
>  find_rr_leaf net/ipv6/route.c:722 [inline]
>  rt6_select net/ipv6/route.c:758 [inline]
>  ip6_pol_route+0x1b1e/0x1ba0 net/ipv6/route.c:1091
>  ip6_pol_route_output+0x4c/0x60 net/ipv6/route.c:1212
>  fib6_rule_action+0x261/0x8a0 net/ipv6/fib6_rules.c:100
>  fib_rules_lookup+0x34b/0xae0 net/core/fib_rules.c:265
>  fib6_rule_lookup+0x175/0x360 net/ipv6/fib6_rules.c:44
>  ip6_route_output_flags+0x260/0x2f0 net/ipv6/route.c:1240
>  ip6_route_output include/net/ip6_route.h:79 [inline]
>  ip6_dst_lookup_tail+0xe59/0x1640 net/ipv6/ip6_output.c:959
>  ip6_dst_lookup_flow+0xb1/0x260 net/ipv6/ip6_output.c:1082
>  ip6_sk_dst_lookup_flow+0x2c6/0x7f0 net/ipv6/ip6_output.c:1113
>  udpv6_sendmsg+0x2350/0x3310 net/ipv6/udp.c:1219
>  inet_sendmsg+0x164/0x490 net/ipv4/af_inet.c:762
>  sock_sendmsg_nosec net/socket.c:633 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:643
>  SYSC_sendto+0x660/0x810 net/socket.c:1696
>  SyS_sendto+0x40/0x50 net/socket.c:1664
>  entry_SYSCALL_64_fastpath+0x1a/0xa9
> RIP: 0033:0x4458d9
> RSP: 002b:00007ff3a5343b58 EFLAGS: 00000282 ORIG_RAX: 000000000000002c
> RAX: ffffffffffffffda RBX: 0000000000708000 RCX: 00000000004458d9
> RDX: 0000000000000fa8 RSI: 0000000020000000 RDI: 0000000000000005
> RBP: 0000000000004300 R08: 0000000020006000 R09: 000000000000001c
> R10: 0000000000040000 R11: 0000000000000282 R12: 00000000006e33c0
> R13: 0000000000000005 R14: 0000000000000029 R15: 0000000000000023
> Object at ffff880033ddd940, in cache ip_dst_cache size: 224
> Allocated:
> PID = 9717
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>  set_track mm/kasan/kasan.c:525 [inline]
>  kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:616
>  kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:555
>  slab_post_alloc_hook mm/slab.h:456 [inline]
>  slab_alloc_node mm/slub.c:2718 [inline]
>  slab_alloc mm/slub.c:2726 [inline]
>  kmem_cache_alloc+0xa8/0x160 mm/slub.c:2731
>  dst_alloc+0x11b/0x1a0 net/core/dst.c:209
>  rt_dst_alloc+0xf0/0x5d0 net/ipv4/route.c:1497
>  __mkroute_output net/ipv4/route.c:2181 [inline]
>  __ip_route_output_key_hash+0xdc4/0x2930 net/ipv4/route.c:2391
>  __ip_route_output_key include/net/route.h:123 [inline]
>  ip_route_output_flow+0x29/0xa0 net/ipv4/route.c:2477
>  ip_route_output_key include/net/route.h:133 [inline]
>  sctp_v4_get_dst+0x606/0x1420 net/sctp/protocol.c:458
>  sctp_transport_route+0xa8/0x420 net/sctp/transport.c:287
>  sctp_assoc_add_peer+0x5a5/0x1470 net/sctp/associola.c:656
>  sctp_sendmsg+0x18a8/0x3b50 net/sctp/socket.c:1871
>  inet_sendmsg+0x164/0x490 net/ipv4/af_inet.c:762
>  sock_sendmsg_nosec net/socket.c:633 [inline]
>  sock_sendmsg+0xca/0x110 net/socket.c:643
>  SYSC_sendto+0x660/0x810 net/socket.c:1696
>  SyS_sendto+0x40/0x50 net/socket.c:1664
>  entry_SYSCALL_64_fastpath+0x1a/0xa9
> Freed:
> PID = 868
>  save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
>  save_stack+0x43/0xd0 mm/kasan/kasan.c:513
>  set_track mm/kasan/kasan.c:525 [inline]
>  kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:589
>  slab_free_hook mm/slub.c:1357 [inline]
>  slab_free_freelist_hook mm/slub.c:1379 [inline]
>  slab_free mm/slub.c:2961 [inline]
>  kmem_cache_free+0x72/0x190 mm/slub.c:2983
>  dst_destroy+0x24c/0x390 net/core/dst.c:269
>  dst_free include/net/dst.h:428 [inline]
>  rt_fibinfo_free net/ipv4/fib_semantics.c:155 [inline]
>  free_fib_info_rcu+0x852/0xa10 net/ipv4/fib_semantics.c:214
>  __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
>  rcu_do_batch.isra.65+0x6de/0xbd0 kernel/rcu/tree.c:2879
>  invoke_rcu_callbacks kernel/rcu/tree.c:3142 [inline]
>  __rcu_process_callbacks kernel/rcu/tree.c:3109 [inline]
>  rcu_process_callbacks+0x23f/0x810 kernel/rcu/tree.c:3126
>  __do_softirq+0x253/0x78b kernel/softirq.c:284
> Memory state around the buggy address:
>  ffff880033ddd980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>  ffff880033ddda00: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
>>ffff880033ddda80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>                                   ^
>  ffff880033dddb00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  ffff880033dddb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
> ==================================================================.
>
>>
>> --
>> You received this message because you are subscribed to the Google Groups "syzkaller" group.
>> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller+unsubscribe@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [PATCH v2] net/packet: initialize val in packet_getsockopt()
From: Alexander Potapenko @ 2017-04-25 16:38 UTC (permalink / raw)
  To: David Miller
  Cc: Dmitriy Vyukov, Kostya Serebryany, Eric Dumazet, Alexey Kuznetsov,
	LKML, Networking
In-Reply-To: <20170425.123201.1910437101082999848.davem@davemloft.net>

On Tue, Apr 25, 2017 at 6:32 PM, David Miller <davem@davemloft.net> wrote:
> From: Alexander Potapenko <glider@google.com>
> Date: Tue, 25 Apr 2017 18:27:04 +0200
>
>> On Tue, Apr 25, 2017 at 5:44 PM, David Miller <davem@davemloft.net> wrote:
>>> From: Alexander Potapenko <glider@google.com>
>>> Date: Mon, 24 Apr 2017 14:59:14 +0200
>>>
>>>> In the case getsockopt() is called with PACKET_HDRLEN and optlen < 4
>>>> |val| remains uninitialized and the syscall may behave differently
>>>> depending on its value. This doesn't have security consequences (as the
>>>> uninit bytes aren't copied back), but it's still cleaner to initialize
>>>> |val| and ensure optlen is not less than sizeof(int).
>>>>
>>>> This bug has been detected with KMSAN.
>>>>
>>>> Signed-off-by: Alexander Potapenko <glider@google.com>
>>>> ---
>>>> v2: - if len < sizeof(int), make it 0
>>>
>>> No, you should signal an error if the len is too small.
>> According to manpages, only setsockopt() may return EINVAL.
>> Is it ok to change the behavior of getsockopt() to return EINVAL in
>> this case? (I.e. won't we break existing users that don't expect it?)
>
> They are currently getting corrupt data depending upon the endianness,
> so -EINVAL is a serious improvement.
On a second glance getsockopt() already returns -EINVAL in some cases,
so man is already imprecise.


-- 
Alexander Potapenko
Software Engineer

Google Germany GmbH
Erika-Mann-Straße, 33
80636 München

Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg

^ permalink raw reply

* Re: net: heap out-of-bounds in fib6_clean_node/rt6_fill_node/fib6_age/fib6_prune_clone
From: Andrey Konovalov @ 2017-04-25 16:36 UTC (permalink / raw)
  To: David Ahern
  Cc: Dmitry Vyukov, Mahesh Bandewar, Eric Dumazet, David Miller,
	Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, netdev, LKML, Cong Wang, syzkaller
In-Reply-To: <381a3e2d-dbf8-5e26-e6b2-4dc291d49dbe@cumulusnetworks.com>

On Tue, Apr 25, 2017 at 5:56 PM, David Ahern <dsa@cumulusnetworks.com> wrote:
> On 3/4/17 11:57 AM, Dmitry Vyukov wrote:
>> ==================================================================
>> BUG: KASAN: slab-out-of-bounds in rt6_dump_route+0x293/0x2f0
>> net/ipv6/route.c:3551 at addr ffff88007e523694
>> Read of size 4 by task syz-executor3/24426
>> CPU: 2 PID: 24426 Comm: syz-executor3 Not tainted 4.10.0+ #293
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> Call Trace:
>>  __dump_stack lib/dump_stack.c:16 [inline]
>>  dump_stack+0x2fb/0x3fd lib/dump_stack.c:52
>>  kasan_object_err+0x1c/0x90 mm/kasan/report.c:166
>>  print_address_description mm/kasan/report.c:208 [inline]
>>  kasan_report_error mm/kasan/report.c:292 [inline]
>>  kasan_report.part.2+0x1b0/0x460 mm/kasan/report.c:314
>>  kasan_report mm/kasan/report.c:334 [inline]
>>  __asan_report_load4_noabort+0x29/0x30 mm/kasan/report.c:334
>>  rt6_dump_route+0x293/0x2f0 net/ipv6/route.c:3551
>>  fib6_dump_node+0x101/0x1a0 net/ipv6/ip6_fib.c:315
>>  fib6_walk_continue+0x4b3/0x620 net/ipv6/ip6_fib.c:1576
>>  fib6_walk+0x91/0xf0 net/ipv6/ip6_fib.c:1621
>>  fib6_dump_table net/ipv6/ip6_fib.c:374 [inline]
>>  inet6_dump_fib+0x832/0xea0 net/ipv6/ip6_fib.c:447
>
> My expectation is that this one is fixed with the ipv6 patch I have sent
> (not yet committed). Are you seeing this backtrace with that patch?

Before applying your patch I was hitting reports related to fib6 all the time.
I've stopped seeing them for some time after I applied your patch.
However today another one was triggered:

==================================================================
BUG: KASAN: slab-out-of-bounds in find_rr_leaf net/ipv6/route.c:722
[inline] at addr ffff880033dddaa8
BUG: KASAN: slab-out-of-bounds in rt6_select net/ipv6/route.c:758
[inline] at addr ffff880033dddaa8
BUG: KASAN: slab-out-of-bounds in ip6_pol_route+0x1b1e/0x1ba0
net/ipv6/route.c:1091 at addr ffff880033dddaa8
Read of size 4 by task syz-executor7/9808
CPU: 2 PID: 9808 Comm: syz-executor7 Not tainted 4.11.0-rc8+ #268
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:16 [inline]
 dump_stack+0x192/0x22d lib/dump_stack.c:52
 kasan_object_err+0x1c/0x70 mm/kasan/report.c:164
 print_address_description mm/kasan/report.c:202 [inline]
 kasan_report_error mm/kasan/report.c:291 [inline]
 kasan_report+0x252/0x510 mm/kasan/report.c:347
 __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:367
 find_rr_leaf net/ipv6/route.c:722 [inline]
 rt6_select net/ipv6/route.c:758 [inline]
 ip6_pol_route+0x1b1e/0x1ba0 net/ipv6/route.c:1091
 ip6_pol_route_output+0x4c/0x60 net/ipv6/route.c:1212
 fib6_rule_action+0x261/0x8a0 net/ipv6/fib6_rules.c:100
 fib_rules_lookup+0x34b/0xae0 net/core/fib_rules.c:265
 fib6_rule_lookup+0x175/0x360 net/ipv6/fib6_rules.c:44
 ip6_route_output_flags+0x260/0x2f0 net/ipv6/route.c:1240
 ip6_route_output include/net/ip6_route.h:79 [inline]
 ip6_dst_lookup_tail+0xe59/0x1640 net/ipv6/ip6_output.c:959
 ip6_dst_lookup_flow+0xb1/0x260 net/ipv6/ip6_output.c:1082
 ip6_sk_dst_lookup_flow+0x2c6/0x7f0 net/ipv6/ip6_output.c:1113
 udpv6_sendmsg+0x2350/0x3310 net/ipv6/udp.c:1219
 inet_sendmsg+0x164/0x490 net/ipv4/af_inet.c:762
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 SYSC_sendto+0x660/0x810 net/socket.c:1696
 SyS_sendto+0x40/0x50 net/socket.c:1664
 entry_SYSCALL_64_fastpath+0x1a/0xa9
RIP: 0033:0x4458d9
RSP: 002b:00007ff3a5343b58 EFLAGS: 00000282 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000708000 RCX: 00000000004458d9
RDX: 0000000000000fa8 RSI: 0000000020000000 RDI: 0000000000000005
RBP: 0000000000004300 R08: 0000000020006000 R09: 000000000000001c
R10: 0000000000040000 R11: 0000000000000282 R12: 00000000006e33c0
R13: 0000000000000005 R14: 0000000000000029 R15: 0000000000000023
Object at ffff880033ddd940, in cache ip_dst_cache size: 224
Allocated:
PID = 9717
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:513
 set_track mm/kasan/kasan.c:525 [inline]
 kasan_kmalloc+0xad/0xe0 mm/kasan/kasan.c:616
 kasan_slab_alloc+0x12/0x20 mm/kasan/kasan.c:555
 slab_post_alloc_hook mm/slab.h:456 [inline]
 slab_alloc_node mm/slub.c:2718 [inline]
 slab_alloc mm/slub.c:2726 [inline]
 kmem_cache_alloc+0xa8/0x160 mm/slub.c:2731
 dst_alloc+0x11b/0x1a0 net/core/dst.c:209
 rt_dst_alloc+0xf0/0x5d0 net/ipv4/route.c:1497
 __mkroute_output net/ipv4/route.c:2181 [inline]
 __ip_route_output_key_hash+0xdc4/0x2930 net/ipv4/route.c:2391
 __ip_route_output_key include/net/route.h:123 [inline]
 ip_route_output_flow+0x29/0xa0 net/ipv4/route.c:2477
 ip_route_output_key include/net/route.h:133 [inline]
 sctp_v4_get_dst+0x606/0x1420 net/sctp/protocol.c:458
 sctp_transport_route+0xa8/0x420 net/sctp/transport.c:287
 sctp_assoc_add_peer+0x5a5/0x1470 net/sctp/associola.c:656
 sctp_sendmsg+0x18a8/0x3b50 net/sctp/socket.c:1871
 inet_sendmsg+0x164/0x490 net/ipv4/af_inet.c:762
 sock_sendmsg_nosec net/socket.c:633 [inline]
 sock_sendmsg+0xca/0x110 net/socket.c:643
 SYSC_sendto+0x660/0x810 net/socket.c:1696
 SyS_sendto+0x40/0x50 net/socket.c:1664
 entry_SYSCALL_64_fastpath+0x1a/0xa9
Freed:
PID = 868
 save_stack_trace+0x16/0x20 arch/x86/kernel/stacktrace.c:59
 save_stack+0x43/0xd0 mm/kasan/kasan.c:513
 set_track mm/kasan/kasan.c:525 [inline]
 kasan_slab_free+0x73/0xc0 mm/kasan/kasan.c:589
 slab_free_hook mm/slub.c:1357 [inline]
 slab_free_freelist_hook mm/slub.c:1379 [inline]
 slab_free mm/slub.c:2961 [inline]
 kmem_cache_free+0x72/0x190 mm/slub.c:2983
 dst_destroy+0x24c/0x390 net/core/dst.c:269
 dst_free include/net/dst.h:428 [inline]
 rt_fibinfo_free net/ipv4/fib_semantics.c:155 [inline]
 free_fib_info_rcu+0x852/0xa10 net/ipv4/fib_semantics.c:214
 __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
 rcu_do_batch.isra.65+0x6de/0xbd0 kernel/rcu/tree.c:2879
 invoke_rcu_callbacks kernel/rcu/tree.c:3142 [inline]
 __rcu_process_callbacks kernel/rcu/tree.c:3109 [inline]
 rcu_process_callbacks+0x23f/0x810 kernel/rcu/tree.c:3126
 __do_softirq+0x253/0x78b kernel/softirq.c:284
Memory state around the buggy address:
 ffff880033ddd980: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff880033ddda00: 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc
>ffff880033ddda80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                                  ^
 ffff880033dddb00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff880033dddb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
==================================================================.

>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: [PATCH] macsec: dynamically allocate space for sglist
From: Sabrina Dubroca @ 2017-04-25 16:36 UTC (permalink / raw)
  To: Jason A. Donenfeld; +Cc: Netdev, LKML, David Miller, stable, security
In-Reply-To: <20170425152300.3830-1-Jason@zx2c4.com>

2017-04-25, 17:23:00 +0200, Jason A. Donenfeld wrote:
> We call skb_cow_data, which is good anyway to ensure we can actually
> modify the skb as such (another error from prior). Now that we have the
> number of fragments required, we can safely allocate exactly that amount
> of memory.
> 
> Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
> Cc: Sabrina Dubroca <sd@queasysnail.net>
> Cc: security@kernel.org
> Cc: stable@vger.kernel.org
> ---
>  drivers/net/macsec.c | 25 ++++++++++++++++++++-----
>  1 file changed, 20 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/macsec.c b/drivers/net/macsec.c
> index dbab05afcdbe..56dafdee4c9c 100644
> --- a/drivers/net/macsec.c
> +++ b/drivers/net/macsec.c
[...]
> @@ -917,6 +926,7 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb,
>  {
>  	int ret;
>  	struct scatterlist *sg;
> +	struct sk_buff *trailer;
>  	unsigned char *iv;
>  	struct aead_request *req;
>  	struct macsec_eth_header *hdr;
> @@ -927,7 +937,12 @@ static struct sk_buff *macsec_decrypt(struct sk_buff *skb,
>  	if (!skb)
>  		return ERR_PTR(-ENOMEM);
>  
> -	req = macsec_alloc_req(rx_sa->key.tfm, &iv, &sg);
> +	ret = skb_cow_data(skb, 0, &trailer);
> +	if (unlikely(ret < 0)) {
> +		kfree_skb(skb);
> +		return ERR_PTR(ret);
> +	}
> +	req = macsec_alloc_req(rx_sa->key.tfm, &iv, &sg, ret);
>  	if (!req) {
>  		kfree_skb(skb);
>  		return ERR_PTR(-ENOMEM);

There's a problem here (and in macsec_encrypt): you need to update the
call to sg_init_table, like I did in my patch.  Otherwise,
sg_init_table() is going to access sg[MAX_SKB_FRAGS], which may be
past what you allocated.

How did you test this? ;)

-- 
Sabrina

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox