Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next] bpf/verifier: enable ctx + const + 0.
From: William Tu @ 2018-05-02 17:54 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexei Starovoitov, Linux Kernel Network Developers,
	Yonghong Song, Yifeng Sun
In-Reply-To: <e4f93f62-d7f4-0851-d1c2-34a13c2af4c6@iogearbox.net>

On Wed, May 2, 2018 at 1:29 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
> On 05/02/2018 06:52 AM, Alexei Starovoitov wrote:
>> On Tue, May 01, 2018 at 09:35:29PM -0700, William Tu wrote:
>>>
>>>> How did you test this patch?
>>>>
>>> Without the patch, the test case will fail.
>>> With the patch, the test case passes.
>>
>> Please test it with real program and you'll see crashes and garbage returned.
>
> +1, *convert_ctx_access() use bpf_insn's off to determine what to rewrite,
> so this is definitely buggy, and wasn't properly tested as it should have
> been. The test case is also way too simple, just the LDX and then doing a
> return 0 will get you past verifier, but won't give you anything in terms
> of runtime testing that test_verifier is doing. A single test case for a
> non trivial verifier change like this is also _completely insufficient_,
> this really needs to test all sort of weird corner cases (involving out of
> bounds accesses, overflows, etc).

Thanks, now I understand.
It's much more complicated than I thought.

William

^ permalink raw reply

* Re: [PATCH net-next 1/4] ipv6: Calculate hash thresholds for IPv6 nexthops
From: Ido Schimmel @ 2018-05-02 17:52 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Ahern, Ido Schimmel, netdev, davem, roopa, nikolay, pch,
	jkbs, yoshfuji, mlxsw
In-Reply-To: <20180502172106.GA12986@splinter>

On Wed, May 02, 2018 at 08:21:06PM +0300, Ido Schimmel wrote:
> On Wed, May 02, 2018 at 09:43:50AM -0700, Eric Dumazet wrote:
> > 
> > 
> > On 01/09/2018 07:43 PM, David Ahern wrote:
> > > On 1/9/18 7:40 AM, Ido Schimmel wrote:
> > >> Before we convert IPv6 to use hash-threshold instead of modulo-N, we
> > >> first need each nexthop to store its region boundary in the hash
> > >> function's output space.
> > >>
> > >> The boundary is calculated by dividing the output space equally between
> > >> the different active nexthops. That is, nexthops that are not dead or
> > >> linkdown.
> > >>
> > >> The boundaries are rebalanced whenever a nexthop is added or removed to
> > >> a multipath route and whenever a nexthop becomes active or inactive.
> > >>
> > >> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> > >> ---
> > >>  include/net/ip6_fib.h   |  1 +
> > >>  include/net/ip6_route.h |  7 ++++
> > >>  net/ipv6/ip6_fib.c      |  8 ++---
> > >>  net/ipv6/route.c        | 96 +++++++++++++++++++++++++++++++++++++++++++++++++
> > >>  4 files changed, 106 insertions(+), 6 deletions(-)
> > >>
> > > 
> > > LGTM.
> > > Acked-by: David Ahern <dsahern@gmail.com>
> > > 
> > 
> > For some reason I have a divide by zero error booting my hosts with latest net tree.
> > 
> > What guarantee do we have that total is not zero when rt6_upper_bound_set() is called ?
> 
> Thanks for the report, Eric. I believe I didn't cover all the cases and
> 'rt6i_nh_weight' might be 0 is some cases. I'll try to reproduce and
> work on a fix.

Hmmm, I think it's due to commit edd7ceb78296 ("ipv6: Allow non-gateway
ECMP for IPv6") which allows routes without a gateway (such as those
configured using slaac) to have siblings.

Can you please check if reverting the patch / applying the below fixes
the issue?

diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index f4d61736c41a..129dd4f4b264 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -3606,6 +3606,7 @@ struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
 	rt->dst.input = ip6_input;
 	rt->dst.output = ip6_output;
 	rt->rt6i_idev = idev;
+	rt->rt6i_nh_weight = 1;
 
 	rt->rt6i_protocol = RTPROT_KERNEL;
 	rt->rt6i_flags = RTF_UP | RTF_NONEXTHOP;

^ permalink raw reply related

* Re: [PATCH net-next v9 2/4] net: Introduce generic failover module
From: Samudrala, Sridhar @ 2018-05-02 17:51 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, aaron.f.brown
In-Reply-To: <20180502161542.GI19250@nanopsycho>



On 5/2/2018 9:15 AM, Jiri Pirko wrote:
> Sat, Apr 28, 2018 at 11:06:01AM CEST, jiri@resnulli.us wrote:
>> Fri, Apr 27, 2018 at 07:06:58PM CEST, sridhar.samudrala@intel.com wrote:
> [...]
>
>
>>> +
>>> +	err = netdev_rx_handler_register(slave_dev, net_failover_handle_frame,
>>> +					 failover_dev);
>>> +	if (err) {
>>> +		netdev_err(slave_dev, "can not register failover rx handler (err = %d)\n",
>>> +			   err);
>>> +		goto err_handler_register;
>>> +	}
>>> +
>>> +	err = netdev_upper_dev_link(slave_dev, failover_dev, NULL);
>> Please use netdev_master_upper_dev_link().
> Don't forget to fillup struct netdev_lag_upper_info - NETDEV_LAG_TX_TYPE_ACTIVEBACKUP
>
>
> Also, please call netdev_lower_state_changed() when the active slave
> device changes from primary->backup of backup->primary and whenever link
> state of a slave changes
>
Sure. will look into it.  Do you think this will help with the issue
you saw with having to change mac on standy twice to get the init scripts
working? We are now going to block changing the mac on both standby and
failover.

Also, i was wondering if we should set dev->flags to IFF_MASTER on failover
and IFF_SLAVE on primary and standby. netvsc does this.
Does this help with the init scripts and network manager to skip slave
devices for dhcp requests?

^ permalink raw reply

* [bpf PATCH 3/3] bpf: sockmap, fix error handling in redirect failures
From: John Fastabend @ 2018-05-02 17:47 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev
In-Reply-To: <20180502173954.17875.19624.stgit@john-Precision-Tower-5810>

When a redirect failure happens we release the buffers in-flight
without calling a sk_mem_uncharge(), the uncharge is called before
dropping the sock lock for the redirecte, however we missed updating
the ring start index. When no apply actions are in progress this
is OK because we uncharge the entire buffer before the redirect.
But, when we have apply logic running its possible that only a
portion of the buffer is being redirected. In this case we only
do memory accounting for the buffer slice being redirected and
expect to be able to loop over the BPF program again and/or if
a sock is closed uncharge the memory at sock destruct time.

With an invalid start index however the program logic looks at
the start pointer index, checks the length, and when seeing the
length is zero (from the initial release and failure to update
the pointer) aborts without uncharging/releasing the remaining
memory.

The fix for this is simply to update the start index. To avoid
fixing this error in two locations we do a small refactor and
remove one case where it is open-coded. Then fix it in the
single function.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |   26 +++++++++++---------------
 1 file changed, 11 insertions(+), 15 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 052c313..7e3c4cd 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -393,7 +393,8 @@ static void return_mem_sg(struct sock *sk, int bytes, struct sk_msg_buff *md)
 	} while (i != md->sg_end);
 }
 
-static void free_bytes_sg(struct sock *sk, int bytes, struct sk_msg_buff *md)
+static void free_bytes_sg(struct sock *sk, int bytes,
+			  struct sk_msg_buff *md, bool charge)
 {
 	struct scatterlist *sg = md->sg_data;
 	int i = md->sg_start, free;
@@ -403,11 +404,13 @@ static void free_bytes_sg(struct sock *sk, int bytes, struct sk_msg_buff *md)
 		if (bytes < free) {
 			sg[i].length -= bytes;
 			sg[i].offset += bytes;
-			sk_mem_uncharge(sk, bytes);
+			if (charge)
+				sk_mem_uncharge(sk, bytes);
 			break;
 		}
 
-		sk_mem_uncharge(sk, sg[i].length);
+		if (charge)
+			sk_mem_uncharge(sk, sg[i].length);
 		put_page(sg_page(&sg[i]));
 		bytes -= sg[i].length;
 		sg[i].length = 0;
@@ -418,6 +421,7 @@ static void free_bytes_sg(struct sock *sk, int bytes, struct sk_msg_buff *md)
 		if (i == MAX_SKB_FRAGS)
 			i = 0;
 	}
+	md->sg_start = i;
 }
 
 static int free_sg(struct sock *sk, int start, struct sk_msg_buff *md)
@@ -578,7 +582,7 @@ static int bpf_tcp_sendmsg_do_redirect(struct sock *sk, int send,
 {
 	struct smap_psock *psock;
 	struct scatterlist *sg;
-	int i, err, free = 0;
+	int i, err = 0;
 	bool ingress = !!(md->flags & BPF_F_INGRESS);
 
 	sg = md->sg_data;
@@ -607,16 +611,8 @@ static int bpf_tcp_sendmsg_do_redirect(struct sock *sk, int send,
 out_rcu:
 	rcu_read_unlock();
 out:
-	i = md->sg_start;
-	while (sg[i].length) {
-		free += sg[i].length;
-		put_page(sg_page(&sg[i]));
-		sg[i].length = 0;
-		i++;
-		if (i == MAX_SKB_FRAGS)
-			i = 0;
-	}
-	return free;
+	free_bytes_sg(NULL, send, md, false);
+	return err;
 }
 
 static inline void bpf_md_init(struct smap_psock *psock)
@@ -720,7 +716,7 @@ static int bpf_exec_tx_verdict(struct smap_psock *psock,
 		break;
 	case __SK_DROP:
 	default:
-		free_bytes_sg(sk, send, m);
+		free_bytes_sg(sk, send, m, true);
 		apply_bytes_dec(psock, send);
 		*copied -= send;
 		psock->sg_size -= send;

^ permalink raw reply related

* [bpf PATCH 2/3] bpf: sockmap, zero sg_size on error when buffer is released
From: John Fastabend @ 2018-05-02 17:47 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev
In-Reply-To: <20180502173954.17875.19624.stgit@john-Precision-Tower-5810>

When an error occurs during a redirect we have two cases that need
to be handled (i) we have a cork'ed buffer (ii) we have a normal
sendmsg buffer.

In the cork'ed buffer case we don't currently support recovering from
errors in a redirect action. So the buffer is released and the error
should _not_ be pushed back to the caller of sendmsg/sendpage. The
rationale here is the user will get an error that relates to old
data that may have been sent by some arbitrary thread on that sock.
Instead we simple consume the data and tell the user that the data
has been consumed. We may add proper error recovery in the future.
However, this patch fixes a bug where the bytes outstanding counter
sg_size was not zeroed. This could result in a case where if the user
has both a cork'ed action and apply action in progress we may
incorrectly call into the BPF program when the user expected an
old verdict to be applied via the apply action. I don't have a use
case where using apply and cork at the same time is valid but we
never explicitly reject it because it should work fine. This patch
ensures the sg_size is zeroed so we don't have this case.

In the normal sendmsg buffer case (no cork data) we also do not
zero sg_size. Again this can confuse the apply logic when the logic
calls into the BPF program when the BPF programmer expected the old
verdict to remain. So ensure we set sg_size to zero here as well. And
additionally to keep the psock state in-sync with the sk_msg_buff
release all the memory as well. Previously we did this before
returning to the user but this left a gap where psock and sk_msg_buff
states were out of sync which seems fragile. No additional overhead
is taken here except for a call to check the length and realize its
already been freed. This is in the error path as well so in my
opinion lets have robust code over optimized error paths.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |   15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 943929a..052c313 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -701,15 +701,22 @@ static int bpf_exec_tx_verdict(struct smap_psock *psock,
 		err = bpf_tcp_sendmsg_do_redirect(redir, send, m, flags);
 		lock_sock(sk);
 
+		if (unlikely(err < 0)) {
+			free_start_sg(sk, m);
+			psock->sg_size = 0;
+			if (!cork)
+				*copied -= send;
+		} else {
+			psock->sg_size -= send;
+		}
+
 		if (cork) {
 			free_start_sg(sk, m);
+			psock->sg_size = 0;
 			kfree(m);
 			m = NULL;
+			err = 0;
 		}
-		if (unlikely(err))
-			*copied -= err;
-		else
-			psock->sg_size -= send;
 		break;
 	case __SK_DROP:
 	default:

^ permalink raw reply related

* [bpf PATCH 1/3] bpf: sockmap, fix scatterlist update on error path in send with apply
From: John Fastabend @ 2018-05-02 17:47 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev
In-Reply-To: <20180502173954.17875.19624.stgit@john-Precision-Tower-5810>

When the call to do_tcp_sendpage() fails to send the complete block
requested we either retry if only a partial send was completed or
abort if we receive a error less than or equal to zero. Before
returning though we must update the scatterlist length/offset to
account for any partial send completed.

Before this patch we did this at the end of the retry loop, but
this was buggy when used while applying a verdict to fewer bytes
than in the scatterlist. When the scatterlist length was being set
we forgot to account for the apply logic reducing the size variable.
So the result was we chopped off some bytes in the scatterlist without
doing proper cleanup on them. This results in a WARNING when the
sock is tore down because the bytes have previously been charged to
the socket but are never uncharged.

The simple fix is to simply do the accounting inside the retry loop
subtracting from the absolute scatterlist values rather than trying
to accumulate the totals and subtract at the end.

Reported-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
---
 kernel/bpf/sockmap.c |    5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/sockmap.c b/kernel/bpf/sockmap.c
index 634415c..943929a 100644
--- a/kernel/bpf/sockmap.c
+++ b/kernel/bpf/sockmap.c
@@ -326,6 +326,9 @@ static int bpf_tcp_push(struct sock *sk, int apply_bytes,
 			if (ret > 0) {
 				if (apply)
 					apply_bytes -= ret;
+
+				sg->offset += ret;
+				sg->length -= ret;
 				size -= ret;
 				offset += ret;
 				if (uncharge)
@@ -333,8 +336,6 @@ static int bpf_tcp_push(struct sock *sk, int apply_bytes,
 				goto retry;
 			}
 
-			sg->length = size;
-			sg->offset = offset;
 			return ret;
 		}
 

^ permalink raw reply related

* [bpf PATCH 0/3] sockmap error path fixes
From: John Fastabend @ 2018-05-02 17:47 UTC (permalink / raw)
  To: borkmann, ast; +Cc: netdev

When I added the test_sockmap to selftests I mistakenly changed the
test logic a bit. The result of this was on redirect cases we ended up
choosing the wrong sock from the BPF program and ended up sending to a
socket that had no receive handler. The result was the actual receive
handler, running on a different socket, is timing out and closing the
socket. This results in errors (-EPIPE to be specific) on the sending
side. Typically happening if the sender does not complete the send
before the receive side times out. So depending on timing and the size
of the send we may get errors. This exposed some bugs in the sockmap
error path handling.

This series fixes the errors. The primary issue is we did not do proper
memory accounting in these cases which resulted in missing a
sk_mem_uncharge(). This happened in the redirect path and in one case
on the normal send path. See the three patches for the details.

The other take-away from this is we need to fix the test_sockmap and
also add more negative test cases. That will happen in bpf-next.

Finally, I tested this using the existing test_sockmap program, the
older sockmap sample test script, and a few real use cases with
Cilium. All of these seem to be in working correctly.

---

John Fastabend (3):
      bpf: sockmap, fix scatterlist update on error path in send with apply
      bpf: sockmap, zero sg_size on error when buffer is released
      bpf: sockmap, fix error handling in redirect failures


 kernel/bpf/sockmap.c |   46 +++++++++++++++++++++++++---------------------
 1 file changed, 25 insertions(+), 21 deletions(-)

^ permalink raw reply

* Re: [RFC] inet: add bound ports statistic
From: David Miller @ 2018-05-02 17:35 UTC (permalink / raw)
  To: stephen; +Cc: netdev, sthemmin
In-Reply-To: <20180502172531.3458-1-sthemmin@microsoft.com>

From: Stephen Hemminger <stephen@networkplumber.org>
Date: Wed,  2 May 2018 10:25:31 -0700

> diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
> index 881ac6d046f2..8a70465c240c 100644
> --- a/net/ipv4/inet_connection_sock.c
> +++ b/net/ipv4/inet_connection_sock.c
> @@ -309,6 +309,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
>  				     net, head, port);
>  	if (!tb)
>  		goto fail_unlock;
> +	++head->count;
>  tb_found:
>  	if (!hlist_empty(&tb->owners)) {
>  		if (sk->sk_reuse == SK_FORCE_REUSE)

Are you really able to commit to the counter increment here?  We can still
fail after this point.

^ permalink raw reply

* Re: [v2 PATCH 1/1] tg3: fix meaningless hw_stats reading after tg3_halt memset 0 hw_stats
From: Michael Chan @ 2018-05-02 17:32 UTC (permalink / raw)
  To: Zumeng Chen
  Cc: Netdev, open list, Siva Reddy Kallam,
	prashant.sreedharan@broadcom.com, David Miller, Zumeng Chen
In-Reply-To: <5af186f8-9718-c295-4e34-e84dd78ea157@gmail.com>

On Wed, May 2, 2018 at 3:27 AM, Zumeng Chen <zumeng.chen@gmail.com> wrote:
> On 2018年05月02日 13:12, Michael Chan wrote:
>>
>> On Tue, May 1, 2018 at 5:42 PM, Zumeng Chen <zumeng.chen@gmail.com> wrote:
>>
>>> diff --git a/drivers/net/ethernet/broadcom/tg3.h
>>> b/drivers/net/ethernet/broadcom/tg3.h
>>> index 3b5e98e..c61d83c 100644
>>> --- a/drivers/net/ethernet/broadcom/tg3.h
>>> +++ b/drivers/net/ethernet/broadcom/tg3.h
>>> @@ -3102,6 +3102,7 @@ enum TG3_FLAGS {
>>>          TG3_FLAG_ROBOSWITCH,
>>>          TG3_FLAG_ONE_DMA_AT_ONCE,
>>>          TG3_FLAG_RGMII_MODE,
>>> +       TG3_FLAG_HALT,
>>
>> I think you should be able to use the existing INIT_COMPLETE flag
>
>
> No,  it will bring the uncertain factors into the existed complicate logic
> of INIT_COMPLETE.
> And I think it's very simple logic here to fix the meaningless hw_stats
> reading and the problem
> of commit f5992b72. I even suspect if you have read INIT_COMPLETE related
> codes carefully.
>

We should use an existing flag whenever appropriate, instead of adding
yet another flag to do similar things. I've looked at the code briefly
and believe that INIT_COMPLETE will work.  If you think it won't work,
please be specific and point out why it won't work.  Thanks.

^ permalink raw reply

* Re: [PATCH net-next v2 0/4] net/smc: small features 2018/04/30
From: David Miller @ 2018-05-02 17:30 UTC (permalink / raw)
  To: ubraun; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl
In-Reply-To: <20180502145647.36147-1-ubraun@linux.ibm.com>

From: Ursula Braun <ubraun@linux.ibm.com>
Date: Wed,  2 May 2018 16:56:43 +0200

> here are 4 smc patches for net-next covering small new features
> in different areas:
>    * link health check
>    * diagnostics for IPv6 smc sockets
>    * ioctl
>    * improvement for vlan determination
> 
> v2 changes:
>    * better title
>    * patch 2 - remove compile problem for disabled CONFIG_IPV6

Series applied, thank you.

^ permalink raw reply

* Re: [PATCH bpf-next 1/2] bpf: enable stackmap with build_id in nmi context
From: Peter Zijlstra @ 2018-05-02 17:30 UTC (permalink / raw)
  To: Song Liu; +Cc: Networking, Kernel Team, Alexei Starovoitov, Daniel Borkmann
In-Reply-To: <30F32ACF-5155-459B-BD47-5060CCA52788@fb.com>

On Wed, May 02, 2018 at 04:48:32PM +0000, Song Liu wrote:
> > It's broken though, I've bet you've never actually ran this with lockdep
> > enabled for example.
> 
> I am not following here. I just run the new selftest with CONFIG_LOCKDEP on, 
> and got no warning for this. 

Weird, I would be expecting complaints about releasing an unheld lock.

nmi_enter(),nmi_exit() have lockdep_off(),lockdep_on() resp. Which means
that the down_trylock() will not be recorded. The up, which is done from
IRQ context, will not be so supressed and should hit
print_unlock_imbalance_bug().

> > Also, you set work->sem before you do trylock, if the trylock fails you
> > return early and keep work->sem set, which will thereafter always result
> > in irq_work_busy.
> 
> work->sem was set after down_read_trylock(). I guess you misread the patch?

Argh, yes indeed. Sorry.

^ permalink raw reply

* Re: [PATCH net 1/1] net/smc: restrict non-blocking connect finish
From: David Miller @ 2018-05-02 17:28 UTC (permalink / raw)
  To: ubraun; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl
In-Reply-To: <20180502145356.32913-1-ubraun@linux.ibm.com>

From: Ursula Braun <ubraun@linux.ibm.com>
Date: Wed,  2 May 2018 16:53:56 +0200

> The smc_poll code tries to finish connect() if the socket is in
> state SMC_INIT and polling of the internal CLC-socket returns with
> EPOLLOUT. This makes sense for a select/poll call following a connect
> call, but not without preceding connect().
> With this patch smc_poll starts connect logic only, if the CLC-socket
> is no longer in its initial state TCP_CLOSE.
> 
> In addition, a poll error on the internal CLC-socket is always
> propagated to the SMC socket.
> 
> With this patch the code path mentioned by syzbot
> https://syzkaller.appspot.com/bug?extid=03faa2dc16b8b64be396
> is no longer possible.
> 
> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
> Reported-by: syzbot+03faa2dc16b8b64be396@syzkaller.appspotmail.com

Applied and queued up for -stable, thanks.

^ permalink raw reply

* [RFC] inet: add bound ports statistic
From: Stephen Hemminger @ 2018-05-02 17:25 UTC (permalink / raw)
  To: netdev; +Cc: Stephen Hemminger, Stephen Hemminger

This adds a number of bound ports which fixes socket summary
command.  The ss -s has been broken since changes to slab info
and this is one way to recover the missing value by adding a
field onto /proc/net/sockstat.

Since this is an informational value only, there is no need
for locking.


Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
---
 include/net/inet_hashtables.h    |  3 +++
 include/net/inet_timewait_sock.h |  2 ++
 net/dccp/proto.c                 |  1 +
 net/ipv4/inet_connection_sock.c  |  1 +
 net/ipv4/inet_hashtables.c       | 22 +++++++++++++++++++---
 net/ipv4/inet_timewait_sock.c    |  8 +++++---
 net/ipv4/proc.c                  |  5 +++--
 net/ipv4/tcp.c                   |  1 +
 8 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index 9141e95529e7..dc74f7af4446 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -103,6 +103,7 @@ static inline struct net *ib_net(struct inet_bind_bucket *ib)
 
 struct inet_bind_hashbucket {
 	spinlock_t		lock;
+	int			count;
 	struct hlist_head	chain;
 };
 
@@ -193,7 +194,9 @@ inet_bind_bucket_create(struct kmem_cache *cachep, struct net *net,
 			struct inet_bind_hashbucket *head,
 			const unsigned short snum);
 void inet_bind_bucket_destroy(struct kmem_cache *cachep,
+			      struct inet_bind_hashbucket *head,
 			      struct inet_bind_bucket *tb);
+int inet_bind_bucket_count(struct proto *prot);
 
 static inline u32 inet_bhashfn(const struct net *net, const __u16 lport,
 			       const u32 bhash_size)
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index c7be1ca8e562..4cdb8034ad80 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -87,7 +87,9 @@ static inline struct inet_timewait_sock *inet_twsk(const struct sock *sk)
 void inet_twsk_free(struct inet_timewait_sock *tw);
 void inet_twsk_put(struct inet_timewait_sock *tw);
 
+struct inet_bind_hashbucket;
 void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
+			   struct inet_bind_hashbucket *head,
 			   struct inet_hashinfo *hashinfo);
 
 struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 84cd4e3fd01b..25f03e62cfea 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -1208,6 +1208,7 @@ static int __init dccp_init(void)
 	for (i = 0; i < dccp_hashinfo.bhash_size; i++) {
 		spin_lock_init(&dccp_hashinfo.bhash[i].lock);
 		INIT_HLIST_HEAD(&dccp_hashinfo.bhash[i].chain);
+		dccp_hashinfo.bhash[i].count = 0;
 	}
 
 	rc = dccp_mib_init();
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 881ac6d046f2..8a70465c240c 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -309,6 +309,7 @@ int inet_csk_get_port(struct sock *sk, unsigned short snum)
 				     net, head, port);
 	if (!tb)
 		goto fail_unlock;
+	++head->count;
 tb_found:
 	if (!hlist_empty(&tb->owners)) {
 		if (sk->sk_reuse == SK_FORCE_REUSE)
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index 31ff46daae97..f7c7a589bfb3 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -58,6 +58,18 @@ static u32 sk_ehashfn(const struct sock *sk)
 			    sk->sk_daddr, sk->sk_dport);
 }
 
+/* Count how many any entries are in the bind hash table */
+int inet_bind_bucket_count(struct proto *prot)
+{
+	struct inet_hashinfo *hinfo = prot->h.hashinfo;
+	int i, ports = 0;
+
+	for (i = 0; i < hinfo->bhash_size; i++)
+		ports += hinfo->bhash[i].count;
+
+	return ports;
+}
+
 /*
  * Allocate and initialize a new local port bind bucket.
  * The bindhash mutex for snum's hash chain must be held here.
@@ -76,6 +88,7 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
 		tb->fastreuseport = 0;
 		INIT_HLIST_HEAD(&tb->owners);
 		hlist_add_head(&tb->node, &head->chain);
+		++head->count;
 	}
 	return tb;
 }
@@ -83,10 +96,13 @@ struct inet_bind_bucket *inet_bind_bucket_create(struct kmem_cache *cachep,
 /*
  * Caller must hold hashbucket lock for this tb with local BH disabled
  */
-void inet_bind_bucket_destroy(struct kmem_cache *cachep, struct inet_bind_bucket *tb)
+void inet_bind_bucket_destroy(struct kmem_cache *cachep,
+			      struct inet_bind_hashbucket *head,
+			      struct inet_bind_bucket *tb)
 {
 	if (hlist_empty(&tb->owners)) {
 		__hlist_del(&tb->node);
+		--head->count;
 		kmem_cache_free(cachep, tb);
 	}
 }
@@ -115,7 +131,7 @@ static void __inet_put_port(struct sock *sk)
 	__sk_del_bind_node(sk);
 	inet_csk(sk)->icsk_bind_hash = NULL;
 	inet_sk(sk)->inet_num = 0;
-	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, head, tb);
 	spin_unlock(&head->lock);
 }
 
@@ -756,7 +772,7 @@ int __inet_hash_connect(struct inet_timewait_death_row *death_row,
 		inet_ehash_nolisten(sk, (struct sock *)tw);
 	}
 	if (tw)
-		inet_twsk_bind_unhash(tw, hinfo);
+		inet_twsk_bind_unhash(tw, head, hinfo);
 	spin_unlock(&head->lock);
 	if (tw)
 		inet_twsk_deschedule_put(tw);
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 88c5069b5d20..dd888c52f958 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -26,7 +26,8 @@
  *	Returns 1 if caller should call inet_twsk_put() after lock release.
  */
 void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
-			  struct inet_hashinfo *hashinfo)
+			   struct inet_bind_hashbucket *head,
+			   struct inet_hashinfo *hashinfo)
 {
 	struct inet_bind_bucket *tb = tw->tw_tb;
 
@@ -35,7 +36,8 @@ void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
 
 	__hlist_del(&tw->tw_bind_node);
 	tw->tw_tb = NULL;
-	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep, tb);
+	inet_bind_bucket_destroy(hashinfo->bind_bucket_cachep,
+				 head, tb);
 	__sock_put((struct sock *)tw);
 }
 
@@ -55,7 +57,7 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
 			hashinfo->bhash_size)];
 
 	spin_lock(&bhead->lock);
-	inet_twsk_bind_unhash(tw, hashinfo);
+	inet_twsk_bind_unhash(tw, bhead, hashinfo);
 	spin_unlock(&bhead->lock);
 
 	atomic_dec(&tw->tw_dr->tw_count);
diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c
index 261b71d0ccc5..12621f8fb4d5 100644
--- a/net/ipv4/proc.c
+++ b/net/ipv4/proc.c
@@ -60,10 +60,11 @@ static int sockstat_seq_show(struct seq_file *seq, void *v)
 	sockets = proto_sockets_allocated_sum_positive(&tcp_prot);
 
 	socket_seq_show(seq);
-	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld\n",
+	seq_printf(seq, "TCP: inuse %d orphan %d tw %d alloc %d mem %ld ports %d\n",
 		   sock_prot_inuse_get(net, &tcp_prot), orphans,
 		   atomic_read(&net->ipv4.tcp_death_row.tw_count), sockets,
-		   proto_memory_allocated(&tcp_prot));
+		   proto_memory_allocated(&tcp_prot),
+		   inet_bind_bucket_count(&tcp_prot));
 	seq_printf(seq, "UDP: inuse %d mem %ld\n",
 		   sock_prot_inuse_get(net, &udp_prot),
 		   proto_memory_allocated(&udp_prot));
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 868ed74a76a8..f62e2fb02fdf 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -3836,6 +3836,7 @@ void __init tcp_init(void)
 	for (i = 0; i < tcp_hashinfo.bhash_size; i++) {
 		spin_lock_init(&tcp_hashinfo.bhash[i].lock);
 		INIT_HLIST_HEAD(&tcp_hashinfo.bhash[i].chain);
+		tcp_hashinfo.bhash[i].count = 0;
 	}
 
 
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH bpf-next 0/3] bpf: cleanups on managing subprog information
From: John Fastabend @ 2018-05-02 17:24 UTC (permalink / raw)
  To: Jiong Wang, Alexei Starovoitov; +Cc: borkmann, ecree, netdev, oss-drivers
In-Reply-To: <fa61c0ca-db3d-bc7b-f6ac-521335f539ed@netronome.com>

On 05/02/2018 09:59 AM, Jiong Wang wrote:
> On 01/05/2018 23:22, Alexei Starovoitov wrote:
> ...
>> [   27.784931]  ? bpf_int_jit_compile+0x7ac/0xab0
>> [   27.785475]  bpf_int_jit_compile+0x2b6/0xab0
>> [   27.786001]  ? do_jit+0x6020/0x6020
>> [   27.786428]  ? kasan_kmalloc+0xa0/0xd0
>> [   27.786885]  bpf_check+0x2c05/0x4c40
>> [   27.787346]  ? fixup_bpf_calls+0x1140/0x1140
>> [   27.787865]  ? kasan_unpoison_shadow+0x30/0x40
>> [   27.788406]  ? kasan_kmalloc+0xa0/0xd0
>> [   27.788865]  ? memset+0x1f/0x40
>> [   27.789255]  ? bpf_obj_name_cpy+0x2d/0x200
>> [   27.789750]  bpf_prog_load+0xb07/0xeb0
>>
>> simply running test_verifier with JIT and kasan on.
> 
> Ah, sorry, I should add "sysctl net/core/bpf_jit_enable=1" to my test
> script, error reproduced.
> 
> convert_ctx_accesses and fixup_bpf_calls might insert ebpf insns that
> prog->len would change.
> 
> The new fake "exit" subprog whose .start offset is prog->len should be
> updated as well.
> 
> The "for" condition in adjust_subprog_starts:
> 
>   for (i = 0; i < env->subprog_cnt; i++) {
> 
> need to be changed into:
> 
>   for (i = 0; i <= env->subprog_cnt; i++) {
> 
> Will respin the patch set.
> 
> Thanks.
> 
> Regards,
> Jiong
> 

Also a bit of a nit, but if you are doing a respin. How about
consider renaming BPF_MAX_SUBPROGS -> BPF_MAX_PROGS. It will
make the naming more accurate and also avoid some diffs below
where changing '>=' to '>' is required.

@@ -191,7 +191,7 @@ struct bpf_verifier_env {
 	bool seen_direct_write;
 	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
 	struct bpf_verifier_log log;
-	u32 subprog_starts[BPF_MAX_SUBPROGS];
+	u32 subprog_starts[BPF_MAX_SUBPROGS + 1];
 	/* computes the stack depth of each bpf function */
 	u16 subprog_stack_depth[BPF_MAX_SUBPROGS + 1];
 	u32 subprog_cnt;

^ permalink raw reply

* Re: [PATCH] 8139too: Use disable_irq_nosync() in rtl8139_poll_controller()
From: David Miller @ 2018-05-02 17:22 UTC (permalink / raw)
  To: bigeasy; +Cc: netdev, mingo, tglx
In-Reply-To: <20180502113057.18209-1-bigeasy@linutronix.de>

From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Date: Wed,  2 May 2018 13:30:57 +0200

> From: Ingo Molnar <mingo@elte.hu>
> 
> Use disable_irq_nosync() instead of disable_irq() as this might be
> called in atomic context with netpoll.
> 
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: phy: broadcom: add support for BCM89610 PHY
From: David Miller @ 2018-05-02 17:21 UTC (permalink / raw)
  To: vbhadram; +Cc: andrew, f.fainelli, netdev
In-Reply-To: <1525256676-24335-1-git-send-email-vbhadram@nvidia.com>


Please remove the email footer from your postings here that talks about
confidential information and whatnot.

That is expressly inappropriate for this mailing list, and any such
postings shall be ignored in their entirety.

Thank you.

^ permalink raw reply

* Re: [PATCH net-next 1/4] ipv6: Calculate hash thresholds for IPv6 nexthops
From: Ido Schimmel @ 2018-05-02 17:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Ahern, Ido Schimmel, netdev, davem, roopa, nikolay, pch,
	jkbs, yoshfuji, mlxsw
In-Reply-To: <e4c89c14-99fc-5ba5-667b-0220520f47f3@gmail.com>

On Wed, May 02, 2018 at 09:43:50AM -0700, Eric Dumazet wrote:
> 
> 
> On 01/09/2018 07:43 PM, David Ahern wrote:
> > On 1/9/18 7:40 AM, Ido Schimmel wrote:
> >> Before we convert IPv6 to use hash-threshold instead of modulo-N, we
> >> first need each nexthop to store its region boundary in the hash
> >> function's output space.
> >>
> >> The boundary is calculated by dividing the output space equally between
> >> the different active nexthops. That is, nexthops that are not dead or
> >> linkdown.
> >>
> >> The boundaries are rebalanced whenever a nexthop is added or removed to
> >> a multipath route and whenever a nexthop becomes active or inactive.
> >>
> >> Signed-off-by: Ido Schimmel <idosch@mellanox.com>
> >> ---
> >>  include/net/ip6_fib.h   |  1 +
> >>  include/net/ip6_route.h |  7 ++++
> >>  net/ipv6/ip6_fib.c      |  8 ++---
> >>  net/ipv6/route.c        | 96 +++++++++++++++++++++++++++++++++++++++++++++++++
> >>  4 files changed, 106 insertions(+), 6 deletions(-)
> >>
> > 
> > LGTM.
> > Acked-by: David Ahern <dsahern@gmail.com>
> > 
> 
> For some reason I have a divide by zero error booting my hosts with latest net tree.
> 
> What guarantee do we have that total is not zero when rt6_upper_bound_set() is called ?

Thanks for the report, Eric. I believe I didn't cover all the cases and
'rt6i_nh_weight' might be 0 is some cases. I'll try to reproduce and
work on a fix.

> 
> 
> 
> [    8.498639] divide error: 0000 [#1] SMP PTI
> [    8.503178] gsmi: Log Shutdown Reason 0x03
> [    8.507270] Modules linked in: bnx2x mdio
> [    8.511276] CPU: 17 PID: 116 Comm: kworker/17:0 Not tainted 4.17.0-smp-DEV #110
> [    8.518571] Hardware name: Intel RML,PCH/Ibis_QC_18, BIOS 2.40.0 06/22/2016
> [    8.525526] Workqueue: ipv6_addrconf addrconf_dad_work
> [    8.530662] RIP: 0010:rt6_multipath_rebalance.part.82+0x1cb/0x1f0
> [    8.536752] RSP: 0018:ffffba72867cbbf8 EFLAGS: 00010246
> [    8.541966] RAX: 0000000000000000 RBX: 0000000000000025 RCX: ffff9d555ab73180
> [    8.549090] RDX: 0000000000000000 RSI: ffff9d4d5a34b1c0 RDI: 0000000000000000
> [    8.556212] RBP: ffffba72867cbc00 R08: 0000000000000000 R09: 0000000000000000
> [    8.563336] R10: 0000000000000000 R11: 0000000000000000 R12: ffff9d5559f95680
> [    8.570457] R13: ffff9d4d5a34b1c0 R14: ffff9d555ab73180 R15: 0000000000000000
> [    8.577579] FS:  0000000000000000(0000) GS:ffff9d4d5fc40000(0000) knlGS:0000000000000000
> [    8.585654] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    8.591391] CR2: 00007fffe47ff000 CR3: 0000000c39c0a001 CR4: 00000000000606e0
> [    8.598515] Call Trace:
> [    8.600961]  ? rt6_multipath_rebalance+0x21/0x30
> [    8.605579]  fib6_add+0x75f/0xf70
> [    8.608899]  ? __wake_up+0x13/0x20
> [    8.612303]  ? netlink_broadcast_filtered+0x14c/0x3c0
> [    8.617355]  __ip6_ins_rt+0x4c/0x70
> [    8.620847]  ip6_ins_rt+0x6e/0xa0
> [    8.624157]  __ipv6_ifa_notify+0x226/0x2e0
> [    8.628249]  ipv6_ifa_notify+0x2a/0x40
> [    8.631999]  addrconf_dad_completed+0x59/0x360
> [    8.636438]  addrconf_dad_work+0x11c/0x400
> [    8.640536]  ? addrconf_dad_work+0x11c/0x400
> [    8.644810]  process_one_work+0x184/0x370
> [    8.648820]  ? process_one_work+0x184/0x370
> [    8.652996]  worker_thread+0x35/0x3a0
> [    8.656654]  kthread+0x121/0x140
> [    8.659887]  ? process_one_work+0x370/0x370
> [    8.664073]  ? kthread_create_worker_on_cpu+0x70/0x70
> [    8.669118]  ret_from_fork+0x35/0x40
> [    8.672693] Code: c3 8b b9 38 01 00 00 eb aa 48 63 81 38 01 00 00 89 fa 41 89 f8 c1 ea 1f 01 fa d1 fa 48 63 d2 49 89 c2 48 c1 e0 1f 48 01 d0 31 d2 <49> f7 f0 83 e8 01 48 39 ce 89 81 b4 00 00 00 0f 85 e2 fe ff ff 
> [    8.691533] RIP: rt6_multipath_rebalance.part.82+0x1cb/0x1f0 RSP: ffffba72867cbbf8
> [    8.699135] ---[ end trace 9ae26819121cdc3a ]---
> [    8.703760] Kernel panic - not syncing: Fatal exception in interrupt
> [    8.710169] Kernel Offset: 0x3d200000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
> [    8.721256] gsmi: Log Shutdown Reason 0x02
> [    8.725357] Rebooting in 10 seconds..
> 
> 
> 

^ permalink raw reply

* Re: [PATCH] change the comment of ip6gre_tnl_addr_conflict
From: David Miller @ 2018-05-02 17:19 UTC (permalink / raw)
  To: sunlw.fnst; +Cc: netdev
In-Reply-To: <20180502090608.13612-1-sunlw.fnst@cn.fujitsu.com>

From: Sun Lianwen <sunlw.fnst@cn.fujitsu.com>
Date: Wed, 2 May 2018 17:06:08 +0800

> The comment of ip6gre_tnl_addr_conflict() is wrong. which use
> ip6_tnl_addr_conflict instead of ip6gre_tnl_addr_conflict.
> 
> Signed-off-by: Sun Lianwen <sunlw.fnst@cn.fujitsu.com>

Please format your Subject line properly.

If should start with "[PATCH X] " where "X" is the name of the
networking GIT tree your patch is against.

Then there should be a subsystem prefix, followed by a colon ":" and a
space character.  In this case, "ipv6: " would be an appropriate
subject prefix.

Then, please make your Subject sentence more clear.

Be more precise with what you are doing.  You are correcting the
function name in a comment, so say something that expresses that.

Thank you.

^ permalink raw reply

* Re: [PATCH V2 net-next 4/6] tun: Add support for SCTP checksum offload
From: Vlad Yasevich @ 2018-05-02 17:17 UTC (permalink / raw)
  To: Marcelo Ricardo Leitner, Vladislav Yasevich
  Cc: netdev, linux-sctp, virtualization, virtio-dev, mst, jasowang,
	nhorman
In-Reply-To: <20180502145607.GH5105@localhost.localdomain>

On 05/02/2018 10:56 AM, Marcelo Ricardo Leitner wrote:
> On Wed, May 02, 2018 at 11:53:47AM -0300, Marcelo Ricardo Leitner wrote:
>> On Tue, May 01, 2018 at 10:07:37PM -0400, Vladislav Yasevich wrote:
>>> Adds a new tun offload flag to allow for SCTP checksum offload.
>>> The flag has to be set by the user and defaults to "no offload".
>>
>> I'm confused here:
>>
>>> +++ b/drivers/net/tun.c
>>> @@ -216,7 +216,7 @@ struct tun_struct {
>>>  	struct net_device	*dev;
>>>  	netdev_features_t	set_features;
>>>  #define TUN_USER_FEATURES (NETIF_F_HW_CSUM|NETIF_F_TSO_ECN|NETIF_F_TSO| \
>>> -			  NETIF_F_TSO6)
>>> +			  NETIF_F_TSO6|NETIF_F_SCTP_CRC)
>>
>> Doesn't adding it here mean it defaults to "offload", instead?
>>
>> later on, it does:
>>                 dev->hw_features = NETIF_F_SG | NETIF_F_FRAGLIST |
>>                                    TUN_USER_FEATURES | NETIF_F_HW_VLAN_CTAG_TX |
>>                                    NETIF_F_HW_VLAN_STAG_TX;
> 
> Missed to paste the next line too:
>                 dev->features = dev->hw_features | NETIF_F_LLTX;
> 

Yes, as a software device, we can default it to on.  However, qemu will 0-out
the features and then set them up based on the guest (just like regular checksum).

-vlad

^ permalink raw reply

* Re: [PATCH net-next v2 0/2] mlxsw: Reject unsupported FIB configurations
From: David Miller @ 2018-05-02 17:17 UTC (permalink / raw)
  To: idosch; +Cc: netdev, jiri, dsahern, mlxsw
In-Reply-To: <20180502071735.32352-1-idosch@mellanox.com>

From: Ido Schimmel <idosch@mellanox.com>
Date: Wed,  2 May 2018 10:17:33 +0300

> Recently it became possible for listeners of the FIB notification chain
> to veto operations such as addition of routes and rules.
> 
> Adjust the mlxsw driver to take advantage of it and return an error for
> unsupported FIB rules and for routes configured after the abort
> mechanism was triggered (due to exceeded resources for example).
> 
> v2:
> * Change error code in first patch to -EOPNOTSUPP (David Ahern).

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
From: Martin KaFai Lau @ 2018-05-02 17:10 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: netdev, David Ahern, Tom Herbert, Eric Dumazet, Nikita Shirokov,
	kernel-team, lvs-devel
In-Reply-To: <alpine.LFD.2.20.1805020932270.1964@ja.home.ssi.bg>

On Wed, May 02, 2018 at 09:38:43AM +0300, Julian Anastasov wrote:
> 
> 	Hello,
> 
> On Thu, 19 Apr 2018, Martin KaFai Lau wrote:
> 
> > This patch is not a proper fix and mainly serves for discussion purpose.
> > It is based on net-next which I have been using to debug the issue.
> > 
> > The change that works around the issue is in ensure_mtu_is_adequate().
> > Other changes are the rippling effect in function arg.
> > 
> > This bug was uncovered by one of our legacy service that
> > are still using ipvs for load balancing.  In that setup,
> > the ipvs encap the ipv6-tcp packet in another ipv6 hdr
> > before tx it out to eth0.
> > 
> > The problem is the kernel stack could pass a skb (which was
> > originated from a sys_write(tcp_fd)) to the driver with skb->len
> > bigger than the device MTU.  In one NIC setup (with gso and tso off)
> > that we are using, it upset the NIC/driver and caused the tx queue
> > stalled for tens of seconds which is how it got uncovered.
> > (On the NIC side, the NIC firmware and driver have been fixed
> > to avoid this tx queue stall after seeing this skb).
> 
> 	In the last week I analyzed the situation and found
> that just changes in route.c are able to solve the problems,
> at 99% :) I'm posting a separate patch for this.
> 
> 	Here is what happens, I'm testing with FTP which
> starts with connection to port 21 and then with data connection,
> from local ftp client to local Virtual IP (forwarded to remote
> real server via tunnel).
> 
> - TCP creates local route which after commit 839da4d98960
> appears to be non-cached, before this commit it is cached.
> Route is saved and reused.
> 
> - initial traffic for port 21 does not use GSO. But after
> every packet IPVS calls maybe_update_pmtu (rt->dst.ops->update_pmtu)
> to report the reduced MTU. These updates are stored in fnhe_pmtu
> but they do not go to any route, even if we try to get fresh
> output route. Why? Because the local routes are not cached, so
> they can not use the fnhe. This is what my patch for route.c
> will fix. With this fix FTP-DATA gets route with reduced PMTU.
For IPv6, the 'if (rt6->rt6i_flags & RTF_LOCAL)' gate in
__ip6_rt_update_pmtu() may need to be lifted also.

> 
> - later, there are large GSO packets in the FTP-DATA connection.
> Currently, IPVS does not send ICMP error for exceeded PMTU
> by GSO packets (this will be fixed, see patch below). Even
> if they reach TCP and it refreshes its route, TCP can not get
> any actual PMTU values from the routing, so continues to
> use the large gso_size without the patch for route.c
> 
> 	For the changes in IPVS that I show below as RFC:
> 
> - I synchronized the PMTU checks in __mtu_check_toobig* with
> IPv4/IPv6 forwarding
> 
> - I modified your changes, see ipvs_gso_hlen() and how I use it
> at start of ensure_mtu_is_adequate(), after skb is made writable,
> before the PMTU checks.
> 
> 	In my tests, all variants work, just with skb_decrease_gso_size
> or even if I add SKB_GSO_DODGY+gso_segs=0. But I'm not sure
> if this is safe for the different GSO configurations.
> 
> 	I'm analyzing what can be changed in route.c, so that
> dst.ops->update_pmtu changes to take effect for the provided
> route. If that is possible, it will allow the PMTU change to
> take immediate effect for the local routes.
> 
> diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
> index 4527921..7a2a0a89 100644
> --- a/net/netfilter/ipvs/ip_vs_xmit.c
> +++ b/net/netfilter/ipvs/ip_vs_xmit.c
> @@ -104,9 +104,32 @@ __ip_vs_dst_check(struct ip_vs_dest *dest)
>  	return dest_dst;
>  }
>  
> -static inline bool
> -__mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
> +/* Check if packet size violates MTU */
> +static bool __mtu_check_toobig(const struct sk_buff *skb, u32 mtu)
>  {
> +	if (skb->len <= mtu)
> +		return false;
> +
> +	if (unlikely((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0))
> +		return false;
> +
> +	if (IPCB(skb)->frag_max_size) {
> +		/* frag_max_size tell us that, this packet have been
> +		 * defragmented by netfilter IPv4 conntrack module.
> +		 */
> +		if (IPCB(skb)->frag_max_size > mtu)
> +			return true; /* largest fragment violate MTU */
> +	}
> +	if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
> +		return false;
> +	return true;
> +}
> +
> +/* Check if packet size violates MTU */
> +static bool __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
> +{
> +	if (skb->len <= mtu)
> +		return false;
>  	if (IP6CB(skb)->frag_max_size) {
>  		/* frag_max_size tell us that, this packet have been
>  		 * defragmented by netfilter IPv6 conntrack module.
> @@ -114,10 +137,9 @@ __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
>  		if (IP6CB(skb)->frag_max_size > mtu)
>  			return true; /* largest fragment violate MTU */
>  	}
> -	else if (skb->len > mtu && !skb_is_gso(skb)) {
> -		return true; /* Packet size violate MTU size */
> -	}
> -	return false;
> +	if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
> +		return false;
> +	return true;
>  }
>  
>  /* Get route to daddr, update *saddr, optionally bind route to saddr */
> @@ -212,11 +234,42 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
>  		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
>  }
>  
> +/* GSO: length of network and transport headers, 0=unsupported */
> +static unsigned short ipvs_gso_hlen(struct sk_buff *skb,
> +				    struct ip_vs_iphdr *ipvsh)
> +{
> +	unsigned short hlen = ipvsh->len - ipvsh->off;
> +
> +	if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)) {
> +		struct tcphdr _tcph, *th;
> +
> +		th = skb_header_pointer(skb, ipvsh->len, sizeof(_tcph), &_tcph);
> +		if (th)
> +			return hlen + (th->doff << 2);
> +	}
> +	return 0;
> +}
> + 
>  static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
>  					  int rt_mode,
>  					  struct ip_vs_iphdr *ipvsh,
>  					  struct sk_buff *skb, int mtu)
>  {
> +	/* Re-segment traffic from local clients */
> +	if (!skb->dev && skb_is_gso(skb)) {
> +		unsigned short hlen = ipvs_gso_hlen(skb, ipvsh);
> +
> +		if (hlen && mtu - hlen < skb_shinfo(skb)->gso_size &&
> +		    mtu > hlen) {
> +			u16 reduce = skb_shinfo(skb)->gso_size - (mtu - hlen);
> +
> +			IP_VS_DBG(10, "Reducing gso_size=%u with %u\n",
> +				skb_shinfo(skb)->gso_size, reduce);
> +			skb_decrease_gso_size(skb_shinfo(skb), reduce);
> +			skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
> +			skb_shinfo(skb)->gso_segs = 0;
> +		}
> +	}
>  #ifdef CONFIG_IP_VS_IPV6
>  	if (skb_af == AF_INET6) {
>  		struct net *net = ipvs->net;
> @@ -240,9 +293,7 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
>  		if ((rt_mode & IP_VS_RT_MODE_TUNNEL) && !sysctl_pmtu_disc(ipvs))
>  			return true;
>  
> -		if (unlikely(ip_hdr(skb)->frag_off & htons(IP_DF) &&
> -			     skb->len > mtu && !skb_is_gso(skb) &&
> -			     !ip_vs_iph_icmp(ipvsh))) {
> +		if (unlikely(__mtu_check_toobig(skb, mtu))) {
>  			icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
>  				  htonl(mtu));
>  			IP_VS_DBG(1, "frag needed for %pI4\n",
> 
> Regards
> 
> --
> Julian Anastasov <ja@ssi.bg>

^ permalink raw reply

* Re: [PATCH net] ipv4: fix fnhe usage by non-cached routes
From: David Miller @ 2018-05-02 17:04 UTC (permalink / raw)
  To: ja; +Cc: netdev, kafai, kernel-team, dsahern, lucien.xin
In-Reply-To: <20180502064119.4552-1-ja@ssi.bg>

From: Julian Anastasov <ja@ssi.bg>
Date: Wed,  2 May 2018 09:41:19 +0300

> Allow some non-cached routes to use non-expired fnhe:
 ...
> Fixes: 839da4d98960 ("net: ipv4: set orig_oif based on fib result for local traffic")
> Fixes: d6d5e999e5df ("route: do not cache fib route info on local routes with oif")
> Fixes: deed49df7390 ("route: check and remove route cache when we get route")
> Cc: David Ahern <dsahern@gmail.com>
> Cc: Xin Long <lucien.xin@gmail.com>
> Signed-off-by: Julian Anastasov <ja@ssi.bg>

David A., please review.

^ permalink raw reply

* Re: [PATCH net-next] cxgb4: add new T5 device id's
From: David Miller @ 2018-05-02 17:04 UTC (permalink / raw)
  To: ganeshgr; +Cc: netdev, nirranjan, indranil
In-Reply-To: <1525241835-14055-1-git-send-email-ganeshgr@chelsio.com>

From: Ganesh Goudar <ganeshgr@chelsio.com>
Date: Wed,  2 May 2018 11:47:15 +0530

> Add device id's 0x5019, 0x501a and 0x501b for T5
> cards.
> 
> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>

Applied, thanks.

^ permalink raw reply

* [PATCH net] net_sched: fq: take care of throttled flows before reuse
From: Eric Dumazet @ 2018-05-02 17:03 UTC (permalink / raw)
  To: David S . Miller; +Cc: netdev, Eric Dumazet, Eric Dumazet

Normally, a socket can not be freed/reused unless all its TX packets
left qdisc and were TX-completed. However connect(AF_UNSPEC) allows
this to happen.

With commit fc59d5bdf1e3 ("pkt_sched: fq: clear time_next_packet for
reused flows") we cleared f->time_next_packet but took no special
action if the flow was still in the throttled rb-tree.

Since f->time_next_packet is the key used in the rb-tree searches,
blindly clearing it might break rb-tree integrity. We need to make
sure the flow is no longer in the rb-tree to avoid this problem.

Fixes: fc59d5bdf1e3 ("pkt_sched: fq: clear time_next_packet for reused flows")
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/sched/sch_fq.c | 37 +++++++++++++++++++++++++------------
 1 file changed, 25 insertions(+), 12 deletions(-)

diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index a366e4c9413ab4fe4dfb16f0255cb7632ade7f1c..4808713c73b988cc3e536cff866cf18de05375fa 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -128,6 +128,28 @@ static bool fq_flow_is_detached(const struct fq_flow *f)
 	return f->next == &detached;
 }
 
+static bool fq_flow_is_throttled(const struct fq_flow *f)
+{
+	return f->next == &throttled;
+}
+
+static void fq_flow_add_tail(struct fq_flow_head *head, struct fq_flow *flow)
+{
+	if (head->first)
+		head->last->next = flow;
+	else
+		head->first = flow;
+	head->last = flow;
+	flow->next = NULL;
+}
+
+static void fq_flow_unset_throttled(struct fq_sched_data *q, struct fq_flow *f)
+{
+	rb_erase(&f->rate_node, &q->delayed);
+	q->throttled_flows--;
+	fq_flow_add_tail(&q->old_flows, f);
+}
+
 static void fq_flow_set_throttled(struct fq_sched_data *q, struct fq_flow *f)
 {
 	struct rb_node **p = &q->delayed.rb_node, *parent = NULL;
@@ -155,15 +177,6 @@ static void fq_flow_set_throttled(struct fq_sched_data *q, struct fq_flow *f)
 
 static struct kmem_cache *fq_flow_cachep __read_mostly;
 
-static void fq_flow_add_tail(struct fq_flow_head *head, struct fq_flow *flow)
-{
-	if (head->first)
-		head->last->next = flow;
-	else
-		head->first = flow;
-	head->last = flow;
-	flow->next = NULL;
-}
 
 /* limit number of collected flows per round */
 #define FQ_GC_MAX 8
@@ -267,6 +280,8 @@ static struct fq_flow *fq_classify(struct sk_buff *skb, struct fq_sched_data *q)
 				     f->socket_hash != sk->sk_hash)) {
 				f->credit = q->initial_quantum;
 				f->socket_hash = sk->sk_hash;
+				if (fq_flow_is_throttled(f))
+					fq_flow_unset_throttled(q, f);
 				f->time_next_packet = 0ULL;
 			}
 			return f;
@@ -438,9 +453,7 @@ static void fq_check_throttled(struct fq_sched_data *q, u64 now)
 			q->time_next_delayed_flow = f->time_next_packet;
 			break;
 		}
-		rb_erase(p, &q->delayed);
-		q->throttled_flows--;
-		fq_flow_add_tail(&q->old_flows, f);
+		fq_flow_unset_throttled(q, f);
 	}
 }
 
-- 
2.17.0.441.gb46fe60e1d-goog

^ permalink raw reply related

* Re: [RFC v2 bpf-next 8/9] bpf: Provide helper to do lookups in kernel FIB table
From: David Miller @ 2018-05-02 17:00 UTC (permalink / raw)
  To: dsahern
  Cc: brouer, netdev, borkmann, ast, shm, roopa, toke, john.fastabend,
	bernat
In-Reply-To: <4a92f04a-6865-ae70-1c54-e4b92d886491@gmail.com>

From: David Ahern <dsahern@gmail.com>
Date: Wed, 2 May 2018 09:37:21 -0600

> To share numbers from recent testing I did using Vincent's modules,
> lookup times in nsec (using local_clock) with MULTIPLE_TABLES config
> disabled for IPv4 and IPv6
> 
>             IPv4    IPv6-dst    IPv6-fib6
> baseline     49        126         52
> 
> I have other cases with combinations of configs and rules, but this
> shows the best possible case.
> 
> IPv6 needs some more work to improve speeds with MULTIPLE_TABLES enabled
> (separate local and main tables unlike IPv4) and IPV6_SUBTREES enabled.

Yes, like for ipv4 sharing local and main tables will help a lot.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox