Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] hinic: skb_pad() frees on error
From: David Miller @ 2017-08-26  0:14 UTC (permalink / raw)
  To: dan.carpenter; +Cc: aviad.krawczyk, netdev, kernel-janitors
In-Reply-To: <20170825082428.hpnbs4i74bubm4cz@mwanda>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Fri, 25 Aug 2017 11:24:28 +0300

> The skb_pad() function frees the skb on error, so this code has a double
> free.
> 
> Fixes: 00e57a6d4ad3 ("net-next/hinic: Add Tx operation")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next v2 0/5] net: updates for IPv6 Segment Routing
From: David Miller @ 2017-08-26  0:10 UTC (permalink / raw)
  To: david.lebrun; +Cc: netdev
In-Reply-To: <20170825075648.5061-1-david.lebrun@uclouvain.be>

From: David Lebrun <david.lebrun@uclouvain.be>
Date: Fri, 25 Aug 2017 09:56:43 +0200

> v2: seg6_lwt_headroom() is not relevant for lwtunnel_input_redirect()
>     use cases, and L2ENCAP only uses this redirection. Fix incoherence
>     between arbitrary MAC header size support and fixed headroom
>     computation by setting only LWTUNNEL_STATE_INPUT_REDIRECT for L2ENCAP
>     mode.
> 
> This patch series provides several updates for the SRv6 implementation. The
> first patch leverages the existing infrastructure to support encapsulation
> of IPv4 packets. The second patch implements the T.Encaps.L2 SR function,
> enabling to encapsulate an L2 Ethernet frame within an IPv6+SRH packet.
> The last three patches update the seg6local lightweight tunnel, and mainly
> implement four new actions: End.T, End.DX2, End.DX4 and End.DT6.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net] ipv6: Fix may be used uninitialized warning in rt6_check
From: David Miller @ 2017-08-26  0:08 UTC (permalink / raw)
  To: steffen.klassert; +Cc: weiwan, edumazet, kafai, netdev
In-Reply-To: <20170825070542.GV31224@secunet.com>

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Fri, 25 Aug 2017 09:05:42 +0200

> rt_cookie might be used uninitialized, fix this by
> initializing it.
> 
> Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node")
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net] ipv6: fix sparse warning on rt6i_node
From: Martin KaFai Lau @ 2017-08-26  0:06 UTC (permalink / raw)
  To: Wei Wang; +Cc: David Miller, netdev, Eric Dumazet
In-Reply-To: <20170825220310.24863-1-tracywwnj@gmail.com>

On Fri, Aug 25, 2017 at 03:03:10PM -0700, Wei Wang wrote:
> From: Wei Wang <weiwan@google.com>
>
> Commit c5cff8561d2d adds rcu grace period before freeing fib6_node. This
> generates a new sparse warning on rt->rt6i_node related code:
>   net/ipv6/route.c:1394:30: error: incompatible types in comparison
>   expression (different address spaces)
>   ./include/net/ip6_fib.h:187:14: error: incompatible types in comparison
>   expression (different address spaces)
>
> This commit adds "__rcu" tag for rt6i_node and makes sure corresponding
> rcu API is used for it.
> After this fix, sparse no longer generates the above warning.
>
> Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node")
> Signed-off-by: Wei Wang <weiwan@google.com>
> Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>

> ---
>  include/net/ip6_fib.h |  2 +-
>  net/ipv6/addrconf.c   |  2 +-
>  net/ipv6/ip6_fib.c    | 11 +++++++----
>  net/ipv6/route.c      |  3 ++-
>  4 files changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> index e9c59db92942..af509f801084 100644
> --- a/include/net/ip6_fib.h
> +++ b/include/net/ip6_fib.h
> @@ -105,7 +105,7 @@ struct rt6_info {
>  	 * the same cache line.
>  	 */
>  	struct fib6_table		*rt6i_table;
> -	struct fib6_node		*rt6i_node;
> +	struct fib6_node __rcu		*rt6i_node;
>
>  	struct in6_addr			rt6i_gateway;
>
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 3c46e9513a31..936e9ab4dda5 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -5556,7 +5556,7 @@ static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)
>  		 * our DAD process, so we don't need
>  		 * to do it again
>  		 */
> -		if (!(ifp->rt->rt6i_node))
> +		if (!rcu_access_pointer(ifp->rt->rt6i_node))
>  			ip6_ins_rt(ifp->rt);
>  		if (ifp->idev->cnf.forwarding)
>  			addrconf_join_anycast(ifp);
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index a5ebf86f6be8..10b4b1f8b838 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -889,7 +889,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
>
>  		rt->dst.rt6_next = iter;
>  		*ins = rt;
> -		rt->rt6i_node = fn;
> +		rcu_assign_pointer(rt->rt6i_node, fn);
>  		atomic_inc(&rt->rt6i_ref);
>  		if (!info->skip_notify)
>  			inet6_rt_notify(RTM_NEWROUTE, rt, info, nlflags);
> @@ -915,7 +915,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
>  			return err;
>
>  		*ins = rt;
> -		rt->rt6i_node = fn;
> +		rcu_assign_pointer(rt->rt6i_node, fn);
>  		rt->dst.rt6_next = iter->dst.rt6_next;
>  		atomic_inc(&rt->rt6i_ref);
>  		if (!info->skip_notify)
> @@ -1480,8 +1480,9 @@ static void fib6_del_route(struct fib6_node *fn, struct rt6_info **rtp,
>
>  int fib6_del(struct rt6_info *rt, struct nl_info *info)
>  {
> +	struct fib6_node *fn = rcu_dereference_protected(rt->rt6i_node,
> +				    lockdep_is_held(&rt->rt6i_table->tb6_lock));
>  	struct net *net = info->nl_net;
> -	struct fib6_node *fn = rt->rt6i_node;
>  	struct rt6_info **rtp;
>
>  #if RT6_DEBUG >= 2
> @@ -1670,7 +1671,9 @@ static int fib6_clean_node(struct fib6_walker *w)
>  			if (res) {
>  #if RT6_DEBUG >= 2
>  				pr_debug("%s: del failed: rt=%p@%p err=%d\n",
> -					 __func__, rt, rt->rt6i_node, res);
> +					 __func__, rt,
> +					 rcu_access_pointer(rt->rt6i_node),
> +					 res);
>  #endif
>  				continue;
>  			}
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index a9d3564caf49..33629f2a0f9d 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1383,7 +1383,8 @@ static void rt6_do_update_pmtu(struct rt6_info *rt, u32 mtu)
>  static bool rt6_cache_allowed_for_pmtu(const struct rt6_info *rt)
>  {
>  	return !(rt->rt6i_flags & RTF_CACHE) &&
> -		(rt->rt6i_flags & RTF_PCPU || rt->rt6i_node);
> +		(rt->rt6i_flags & RTF_PCPU ||
> +		 rcu_access_pointer(rt->rt6i_node));
>  }
>
>  static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk,
> --
> 2.14.1.342.g6490525c54-goog
>

^ permalink raw reply

* Re: [PATCH net] ipv6: Fix may be used uninitialized warning in rt6_check
From: David Miller @ 2017-08-26  0:04 UTC (permalink / raw)
  To: sbrivio; +Cc: steffen.klassert, weiwan, edumazet, kafai, netdev
In-Reply-To: <20170825110206.36e4a7a7@elisabeth>

From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 25 Aug 2017 11:02:06 +0200

> On Fri, 25 Aug 2017 09:52:17 +0200
> Steffen Klassert <steffen.klassert@secunet.com> wrote:
> 
>> On Fri, Aug 25, 2017 at 09:05:42AM +0200, Steffen Klassert wrote:
>> > rt_cookie might be used uninitialized, fix this by
>> > initializing it.
>> > 
>> > Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node")
>> > Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
>> > ---
>> >  net/ipv6/route.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> > 
>> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> > index a9d3564..48c8c92 100644
>> > --- a/net/ipv6/route.c
>> > +++ b/net/ipv6/route.c
>> > @@ -1289,7 +1289,7 @@ static void rt6_dst_from_metrics_check(struct rt6_info *rt)
>> >  
>> >  static struct dst_entry *rt6_check(struct rt6_info *rt, u32 cookie)
>> >  {
>> > -	u32 rt_cookie;
>> > +	u32 rt_cookie = 0;
>> >  
>> >  	if (!rt6_get_cookie_safe(rt, &rt_cookie) || rt_cookie != cookie)
>> >  		return NULL;  
>> 
>> The compiler warning seems to be a false positive, as
>> rt_cookie != cookie is only checked if rt6_get_cookie_safe
>> returns true in which case rt_cookie is initialized.
>> 
>> Please disregard this patch.
> 
> ...or not? I was thinking of sending a similar patch with
> uninitialized_var(rt_cookie), but it seems we have similar cases
> where we just initialize to zero instead.
> 
> I wonder which approach is considered the most acceptable nowadays. I
> would be in favour of uninitialized_var() as it doesn't change the
> binary output, but https://lwn.net/Articles/529954/ also contains some
> valid criticism. Ideas?

Generally speaking I guess initializing to zero is Ok to do.

As far as which approach is better, I don't have any strong opinion.

So I will probably just apply Steffen's patch.

^ permalink raw reply

* Re: UDP sockets oddities
From: Eric Dumazet @ 2017-08-25 23:57 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, edumazet, pabeni, willemb, davem
In-Reply-To: <3c44d27e-27d9-e9c4-04b3-17c0366e60d9@gmail.com>

On Fri, 2017-08-25 at 16:18 -0700, Florian Fainelli wrote:

> Eric, are there areas of the stack where we are allowed to drop packets,
> not propagate that back to write(2) and also not increment any counter
> either, or maybe I am not looking where I should...

What happens if you increase these sysctls ?

grep .  `find /proc/sys|grep unres_qlen`


unres_qlen_bytes -> 2000000
unres_qlen -> 10000

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Michael S. Tsirkin @ 2017-08-25 23:32 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development
In-Reply-To: <CAF=yD-+1wheMmC+HFKe_B_ULO0Mmh6HMSEbYY5D-HgqxVJee6A@mail.gmail.com>

On Fri, Aug 25, 2017 at 06:44:36PM -0400, Willem de Bruijn wrote:
> >> >> > We don't enable network watchdog on virtio but we could and maybe
> >> >> > should.
> >> >>
> >> >> Can you elaborate?
> >> >
> >> > The issue is that holding onto buffers for very long times makes guests
> >> > think they are stuck. This is funamentally because from guest point of
> >> > view this is a NIC, so it is supposed to transmit things out in
> >> > a timely manner. If host backs the virtual NIC by something that is not
> >> > a NIC, with traffic shaping etc introducing unbounded latencies,
> >> > guest will be confused.
> >>
> >> That assumes that guests are fragile in this regard. A linux guest
> >> does not make such assumptions.
> >
> > Yes it does. Examples above:
> >         > > - a single slow flow can occupy the whole ring, you will not
> >         > >   be able to make any new buffers available for the fast flow
> 
> Oh, right. Though those are due to vring_desc pool exhaustion
> rather than an upper bound on latency of any single packet.
> 
> Limiting the number of zerocopy packets in flight to some fraction
> of the ring ensures that fast flows can always grab a slot.
> Running
> out of ubuf_info slots reverts to copy, so indirectly does this. But
> I read it correclty the zerocopy pool may be equal to or larger than
> the descriptor pool. Should we refine the zcopy_used test
> 
>     (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx
> 
> to also return false if the number of outstanding ubuf_info is greater
> than, say, vq->num >> 1?


We'll need to think about where to put the threshold, but I think it's
a good idea.

Maybe even a fixed number, e.g. max(vq->num >> 1, X) to limit host
resources.

In a sense it still means once you run out of slots zcopt gets disabled possibly permanently.

Need to experiment with some numbers.

-- 
MST

^ permalink raw reply

* Re: [PATCH net-next v6 2/3] net: gso: Add GSO support for NSH
From: Jiri Benc @ 2017-08-25 23:22 UTC (permalink / raw)
  To: Yi Yang; +Cc: netdev, dev, e, blp, jan.scheurich
In-Reply-To: <20170825182514.6ff6c36b@griffin>

On Fri, 25 Aug 2017 18:25:14 +0200, Jiri Benc wrote:
> While looking at this, I realized that GSO for VXLAN-GPE is broken,
> too. Let me fix it by implementing what I described above which will
> make your patch much easier.

Okay, it's not really broken and we don't need that complexity. At
least not immediately. Hw offloading in the VXLAN-GPE case probably does
not work correctly and would benefit from that change but that's a
different beast to tackle at a different time. Software segmentation
works fine for VXLAN-GPE.

There should not be much problems with NSH segmentation, either, if we
carefully store and set mac_header, mac_len and skb->protocol around
calls to skb_mac_gso_segment. Note that with zero mac_len (and correct
skb->protocol), skb_mac_gso_segment behaves in the same way that you
tried to achieve with find_gso_segment_by_type, which is thus completely
unnecessary.

More on Monday.

 Jiri

^ permalink raw reply

* Re: UDP sockets oddities
From: Florian Fainelli @ 2017-08-25 23:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, edumazet, pabeni, willemb, davem
In-Reply-To: <4adb4b66-590c-c55a-44aa-27dd409ce14f@gmail.com>

On 08/23/2017 07:23 PM, Florian Fainelli wrote:
> On 08/23/2017 05:43 PM, Eric Dumazet wrote:
>> On Wed, 2017-08-23 at 17:03 -0700, Florian Fainelli wrote:
>>
>>> Attached is the perf report --stdio of:
>>>
>>> # perf record -a -g -e skb:kfree_skb iperf -c 192.168.1.23 -b 800M -t 60
>>> WARNING: option -b implies udp testing
>>> ------------------------------------------------------------
>>> Client connecting to 192.168.1.23, UDP port 5001
>>> Sending 1470 byte datagrams
>>> UDP buffer size:  160 KByte (default)
>>> ------------------------------------------------------------
>>> [  4] local 192.168.1.66 port 36169 connected with 192.168.1.23 port 5001
>>> [ ID] Interval       Transfer     Bandwidth
>>> [  4]  0.0-60.0 sec   685 MBytes  95.8 Mbits/sec
>>> [  4] Sent 488633 datagrams
>>> [  4] Server Report:
>>> [  4]  0.0-74.4 sec   685 MBytes  77.2 Mbits/sec   0.187 ms  300/488632
>>> (0.061%)
>>> [  4]  0.0-74.4 sec  58 datagrams received out-of-order
>>>
>>> It measured 488644 events which is greater than the number of packets
>>> transmitted by iperf but I only count 8 non-IP packets being sent
>>> (protocl=2054 -> ETH_P_ARP) so I am not sure what the other 4 were and
>>> why there are not accounted for.
>>>
>>> Almost all events came from net_tx_action() except 2 that came from
>>> net/core/neighbour.c::neigh_probe() and 65 that came from
>>> arch/arm/include/asm/irqflags.h::arch_local_irq_save() ?!?
>>
>> Oh you have too much noise and need this fix :
>>
>> diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c
>> index dc3052751bc1..fcfa8d991850 100644
>> --- a/drivers/net/ethernet/broadcom/bcmsysport.c
>> +++ b/drivers/net/ethernet/broadcom/bcmsysport.c
>> @@ -597,7 +597,7 @@ static int bcm_sysport_set_coalesce(struct net_device *dev,
>>  
>>  static void bcm_sysport_free_cb(struct bcm_sysport_cb *cb)
>>  {
>> -	dev_kfree_skb_any(cb->skb);
>> +	dev_consume_skb_any(cb->skb);
>>  	cb->skb = NULL;
>>  	dma_unmap_addr_set(cb, dma_addr, 0);
>>  }
>>
> 
> Yay, now I am getting some sensible information, thanks!
> 
> iperf reports 143 packets lost, and perf report gets me 146 entries for
> kfree_skb() looking like that:
> #
>      2.74%     2.74%  skbaddr=0xee5f30c0 protocol=2048 location=0xc0855cdc
>             |
>             ---0x2d
>                write
>                0xe9fc2368
>                kfree_skb
> 
> 
> What is annoying is that 0xc0855cdc is resolved to
> arch/arm/include/asm/irqflags.h::arch_local_irq_save() which does not
> really help telling me where exactly in the stack the drop is happening,
> although now I know it is somewhere down the path from write(2)... of
> course it is :)
> 
> Now what is disturbing too is that iperf does  have its write() system
> call get an error returned, write happily returned the number of bytes
> written even though the perf trace indicates there was packet drops down
> the road..

Because perf was not able to track down exactly where the location was
with builtin_return_address(0) and always pointed to
arch_local_irq_save(), I added a WARN() in kfree_skb(), and of course
this made it impossible to see packet loss under the same conditions
anymore. After increasing dramatically wmem_{default,max} I could start
seeing significant drop made largely worse by the printks, so not very
conclusive either.

I took another approach and recorded net_dev_queue() events, and I can
see only 976704 (gphy + eth0 devices) / 2 = 488352 dev_queue_events()
were recorded. The total sent by iperf was 488898 and we lost 558
packets according to the server 488352 + 558 > 488898 which makes sense
because neigh_probe() and others still run. This still confirms there is
a drop occurring between UDP socket sendmsg and the SKB enqueuing...

So my take away is that the CPU has a bursty write(2) behavior, and the
larger the socket write buffer size, the more datagrams you can burst
into and the more of these SKBs can be flat out being dropped when the
transmit queue is congested. Adding a WARN() is enough of a slow down
for the CPU that we are not able to reliably reproduce this burst.

Eric, are there areas of the stack where we are allowed to drop packets,
not propagate that back to write(2) and also not increment any counter
either, or maybe I am not looking where I should...

Thanks!
-- 
Florian

^ permalink raw reply

* Re: [PATCH 2/2] drivers: net: xgene: Clean up all outstanding tx descriptors
From: Andrew Lunn @ 2017-08-25 23:10 UTC (permalink / raw)
  To: Iyappan Subramanian
  Cc: davem, netdev, linux-arm-kernel, dnelson, qnguyen, patches
In-Reply-To: <1503699810-12803-3-git-send-email-isubramanian@apm.com>

On Fri, Aug 25, 2017 at 03:23:30PM -0700, Iyappan Subramanian wrote:
> When xgene_enet is rmmod'd and there are still outstanding tx descriptors
> that have been setup but have not completed, it is possible on the next
> modprobe of the driver to receive the oldest of such tx descriptors. This
> results in a kernel NULL pointer dereference.
> 
> This patch attempts to clean up (by tearing down) all outstanding tx
> descriptors when the xgene_enet driver is being rmmod'd.
> 
> Given that, on the next modprobe it should be safe to ignore any such tx
> descriptors received that map to a NULL skb pointer.

This does not sound correct. Before the module is allowed to be
removed, everything needs to be finished. You need to wait for all the
tx descriptors to be returned before unloading. How can you free the
memory for the descriptor if it is still in use? How can you free the
skbuf the descriptor points to, if it is still in use... 

      Andrew

^ permalink raw reply

* Re: [PATCH net] tcp: fix refcnt leak with ebpf congestion control
From: Lawrence Brakmo @ 2017-08-25 23:08 UTC (permalink / raw)
  To: Sabrina Dubroca, netdev@vger.kernel.org; +Cc: Daniel Borkmann
In-Reply-To: <d69d1981d2c74805affc4ed6378447ea4cb06c67.1503659184.git.sd@queasysnail.net>


On 8/25/17, 4:10 AM, "Sabrina Dubroca" <sd@queasysnail.net> wrote:

    There are a few bugs around refcnt handling in the new BPF congestion
    control setsockopt:
    
     - The new ca is assigned to icsk->icsk_ca_ops even in the case where we
       cannot get a reference on it. This would lead to a use after free,
       since that ca is going away soon.
    
     - Changing the congestion control case doesn't release the refcnt on
       the previous ca.
    
     - In the reinit case, we first leak a reference on the old ca, then we
       call tcp_reinit_congestion_control on the ca that we have just
       assigned, leading to deinitializing the wrong ca (->release of the
       new ca on the old ca's data) and releasing the refcount on the ca
       that we actually want to use.
    
    This is visible by building (for example) BIC as a module and setting
    net.ipv4.tcp_congestion_control=bic, and using tcp_cong_kern.c from
    samples/bpf.
    
    This patch fixes the refcount issues, and moves reinit back into tcp
    core to avoid passing a ca pointer back to BPF.
    
    Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
    Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
    ---
     include/net/tcp.h   |  4 +---
     net/core/filter.c   |  7 ++-----
     net/ipv4/tcp.c      |  2 +-
     net/ipv4/tcp_cong.c | 19 ++++++++++++++-----
     4 files changed, 18 insertions(+), 14 deletions(-)
    
    diff --git a/include/net/tcp.h b/include/net/tcp.h
    index ada65e767b28..f642a39f9eee 100644
    --- a/include/net/tcp.h
    +++ b/include/net/tcp.h
    @@ -1004,9 +1004,7 @@ void tcp_get_default_congestion_control(char *name);
     void tcp_get_available_congestion_control(char *buf, size_t len);
     void tcp_get_allowed_congestion_control(char *buf, size_t len);
     int tcp_set_allowed_congestion_control(char *allowed);
    -int tcp_set_congestion_control(struct sock *sk, const char *name, bool load);
    -void tcp_reinit_congestion_control(struct sock *sk,
    -				   const struct tcp_congestion_ops *ca);
    +int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, bool reinit);
     u32 tcp_slow_start(struct tcp_sock *tp, u32 acked);
     void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked);
     
    diff --git a/net/core/filter.c b/net/core/filter.c
    index 8eb81e5fae08..169974998c76 100644
    --- a/net/core/filter.c
    +++ b/net/core/filter.c
    @@ -2836,15 +2836,12 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
     		   sk->sk_prot->setsockopt == tcp_setsockopt) {
     		if (optname == TCP_CONGESTION) {
     			char name[TCP_CA_NAME_MAX];
    +			bool reinit = bpf_sock->op > BPF_SOCK_OPS_NEEDS_ECN;
     
     			strncpy(name, optval, min_t(long, optlen,
     						    TCP_CA_NAME_MAX-1));
     			name[TCP_CA_NAME_MAX-1] = 0;
    -			ret = tcp_set_congestion_control(sk, name, false);
    -			if (!ret && bpf_sock->op > BPF_SOCK_OPS_NEEDS_ECN)
    -				/* replacing an existing ca */
    -				tcp_reinit_congestion_control(sk,
    -					inet_csk(sk)->icsk_ca_ops);
    +			ret = tcp_set_congestion_control(sk, name, false, reinit);
     		} else {
     			struct tcp_sock *tp = tcp_sk(sk);
     
    diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
    index 71ce33decd97..a3e91b552edc 100644
    --- a/net/ipv4/tcp.c
    +++ b/net/ipv4/tcp.c
    @@ -2481,7 +2481,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
     		name[val] = 0;
     
     		lock_sock(sk);
    -		err = tcp_set_congestion_control(sk, name, true);
    +		err = tcp_set_congestion_control(sk, name, true, true);
     		release_sock(sk);
     		return err;
     	}
    diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
    index fde983f6376b..421ea1b918da 100644
    --- a/net/ipv4/tcp_cong.c
    +++ b/net/ipv4/tcp_cong.c
    @@ -189,8 +189,8 @@ void tcp_init_congestion_control(struct sock *sk)
     		INET_ECN_dontxmit(sk);
     }
     
    -void tcp_reinit_congestion_control(struct sock *sk,
    -				   const struct tcp_congestion_ops *ca)
    +static void tcp_reinit_congestion_control(struct sock *sk,
    +					  const struct tcp_congestion_ops *ca)
     {
     	struct inet_connection_sock *icsk = inet_csk(sk);
     
    @@ -338,7 +338,7 @@ int tcp_set_allowed_congestion_control(char *val)
      * tcp_reinit_congestion_control (if the current congestion control was
      * already initialized.
      */
    -int tcp_set_congestion_control(struct sock *sk, const char *name, bool load)
    +int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, bool reinit)
     {
     	struct inet_connection_sock *icsk = inet_csk(sk);
     	const struct tcp_congestion_ops *ca;
    @@ -360,9 +360,18 @@ int tcp_set_congestion_control(struct sock *sk, const char *name, bool load)
     	if (!ca) {
     		err = -ENOENT;
     	} else if (!load) {
    -		icsk->icsk_ca_ops = ca;
    -		if (!try_module_get(ca->owner))
    +		const struct tcp_congestion_ops *old_ca = icsk->icsk_ca_ops;
    +
    +		if (try_module_get(ca->owner)) {
    +			if (reinit) {
    +				tcp_reinit_congestion_control(sk, ca);
    +			} else {
    +				icsk->icsk_ca_ops = ca;
    +				module_put(old_ca->owner);
    +			}
    +		} else {
     			err = -EBUSY;
    +		}
     	} else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) ||
     		     ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))) {
     		err = -EPERM;
    -- 
    2.14.1
    
Thank you for finding and fixing these issues.

Acked-by: Lawrence Brakmo <brakmo@fb.com>
 


^ permalink raw reply

* Re: [ethtool 1/1] ethtool: Add DMA Coalescing support
From: Stephen Hemminger @ 2017-08-25 22:57 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: linville, Paul Greenwalt, netdev, nhorman, sassmann, jogreene
In-Reply-To: <20170825223910.54989-1-jeffrey.t.kirsher@intel.com>

On Fri, 25 Aug 2017 15:39:10 -0700
Jeff Kirsher <jeffrey.t.kirsher@intel.com> wrote:

> diff --git a/ethtool-copy.h b/ethtool-copy.h
> index 06fc04c..4bb91eb 100644
> --- a/ethtool-copy.h
> +++ b/ethtool-copy.h
> @@ -400,6 +400,7 @@ struct ethtool_modinfo {
>   *	a TX interrupt, when the packet rate is above @pkt_rate_high.
>   * @rate_sample_interval: How often to do adaptive coalescing packet rate
>   *	sampling, measured in seconds.  Must not be zero.
> + * @dmac: How many usecs to store packets before moving to host memory.
>   *
>   * Each pair of (usecs, max_frames) fields specifies that interrupts
>   * should be coalesced until
> @@ -450,6 +451,7 @@ struct ethtool_coalesce {
>  	__u32	tx_coalesce_usecs_high;
>  	__u32	tx_max_coalesced_frames_high;
>  	__u32	rate_sample_interval;
> +	__u32	dmac;
>  };
>  

Because of backwards ABI compatibility, it is not safe to extend
an existing structure.

^ permalink raw reply

* Re: [PATCH net-next v2 09/14] net: mvpp2: dynamic reconfiguration of the PHY mode
From: Russell King - ARM Linux @ 2017-08-25 22:46 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, kishon, andrew, jason, sebastian.hesselbarth,
	gregory.clement, thomas.petazzoni, nadavh, linux-kernel, mw,
	stefanc, miquel.raynal, netdev
In-Reply-To: <20170825144821.31129-10-antoine.tenart@free-electrons.com>

On Fri, Aug 25, 2017 at 04:48:16PM +0200, Antoine Tenart wrote:
> This patch adds logic to reconfigure the comphy/gop when the link status
> change at runtime. This is very useful on boards such as the mcbin which
> have SFP and Ethernet ports connected to the same MAC port: depending on
> what the user connects the driver will automatically reconfigure the
> link mode.

This commit commentry needs updating - as I've already pointed out in
the previous round, the need to reconfigure things has *nothing* to do
with there being SFP and "Ethernet" ports present.  Hence, your commit
message is entirely misleading.

> 
> Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
> ---
>  drivers/net/ethernet/marvell/mvpp2.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
> index 49a6789a4142..04e0c8ab7b51 100644
> --- a/drivers/net/ethernet/marvell/mvpp2.c
> +++ b/drivers/net/ethernet/marvell/mvpp2.c
> @@ -5740,6 +5740,7 @@ static void mvpp2_link_event(struct net_device *dev)
>  {
>  	struct mvpp2_port *port = netdev_priv(dev);
>  	struct phy_device *phydev = dev->phydev;
> +	bool link_reconfigured = false;
>  
>  	if (!netif_running(dev))
>  		return;
> @@ -5750,9 +5751,27 @@ static void mvpp2_link_event(struct net_device *dev)
>  			port->duplex = phydev->duplex;
>  			port->speed  = phydev->speed;
>  		}
> +
> +		if (port->phy_interface != phydev->interface && port->comphy) {
> +	                /* disable current port for reconfiguration */
> +	                mvpp2_interrupts_disable(port);
> +	                netif_carrier_off(port->dev);
> +	                mvpp2_port_disable(port);
> +			phy_power_off(port->comphy);
> +
> +	                /* comphy reconfiguration */
> +	                port->phy_interface = phydev->interface;
> +	                mvpp22_comphy_init(port);
> +
> +	                /* gop/mac reconfiguration */
> +	                mvpp22_gop_init(port);
> +	                mvpp2_port_mii_set(port);
> +
> +	                link_reconfigured = true;
> +		}
>  	}
>  
> -	if (phydev->link != port->link) {
> +	if (phydev->link != port->link || link_reconfigured) {
>  		port->link = phydev->link;
>  
>  		if (phydev->link) {
> -- 
> 2.13.5
> 

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Willem de Bruijn @ 2017-08-25 22:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development
In-Reply-To: <20170824234551-mutt-send-email-mst@kernel.org>

>> >> > We don't enable network watchdog on virtio but we could and maybe
>> >> > should.
>> >>
>> >> Can you elaborate?
>> >
>> > The issue is that holding onto buffers for very long times makes guests
>> > think they are stuck. This is funamentally because from guest point of
>> > view this is a NIC, so it is supposed to transmit things out in
>> > a timely manner. If host backs the virtual NIC by something that is not
>> > a NIC, with traffic shaping etc introducing unbounded latencies,
>> > guest will be confused.
>>
>> That assumes that guests are fragile in this regard. A linux guest
>> does not make such assumptions.
>
> Yes it does. Examples above:
>         > > - a single slow flow can occupy the whole ring, you will not
>         > >   be able to make any new buffers available for the fast flow

Oh, right. Though those are due to vring_desc pool exhaustion
rather than an upper bound on latency of any single packet.

Limiting the number of zerocopy packets in flight to some fraction
of the ring ensures that fast flows can always grab a slot. Running
out of ubuf_info slots reverts to copy, so indirectly does this. But
I read it correclty the zerocopy pool may be equal to or larger than
the descriptor pool. Should we refine the zcopy_used test

    (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx

to also return false if the number of outstanding ubuf_info is greater
than, say, vq->num >> 1?

^ permalink raw reply

* Re: [PATCH net-next v2 05/14] net: mvpp2: do not force the link mode
From: Russell King - ARM Linux @ 2017-08-25 22:43 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, kishon, andrew, jason, sebastian.hesselbarth,
	gregory.clement, thomas.petazzoni, nadavh, linux-kernel, mw,
	stefanc, miquel.raynal, netdev
In-Reply-To: <20170825144821.31129-6-antoine.tenart@free-electrons.com>

On Fri, Aug 25, 2017 at 04:48:12PM +0200, Antoine Tenart wrote:
> The link mode (speed, duplex) was forced based on what the phylib
> returns. This should not be the case, and only forced by ethtool
> functions manually. This patch removes the link mode enforcement from
> the phylib link_event callback.

So how does RGMII work (which has no in-band signalling between the PHY
and MAC)?

phylib expects the network driver to configure it according to the PHY
state at link_event time - I think you need to explain more why you
think that this is not necessary.

> 
> Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
> ---
>  drivers/net/ethernet/marvell/mvpp2.c | 24 ------------------------
>  1 file changed, 24 deletions(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
> index fab231858a41..498a4969dc58 100644
> --- a/drivers/net/ethernet/marvell/mvpp2.c
> +++ b/drivers/net/ethernet/marvell/mvpp2.c
> @@ -5741,30 +5741,10 @@ static void mvpp2_link_event(struct net_device *dev)
>  	struct mvpp2_port *port = netdev_priv(dev);
>  	struct phy_device *phydev = dev->phydev;
>  	int status_change = 0;
> -	u32 val;
>  
>  	if (phydev->link) {
>  		if ((port->speed != phydev->speed) ||
>  		    (port->duplex != phydev->duplex)) {
> -			u32 val;
> -
> -			val = readl(port->base + MVPP2_GMAC_AUTONEG_CONFIG);
> -			val &= ~(MVPP2_GMAC_CONFIG_MII_SPEED |
> -				 MVPP2_GMAC_CONFIG_GMII_SPEED |
> -				 MVPP2_GMAC_CONFIG_FULL_DUPLEX |
> -				 MVPP2_GMAC_AN_SPEED_EN |
> -				 MVPP2_GMAC_AN_DUPLEX_EN);
> -
> -			if (phydev->duplex)
> -				val |= MVPP2_GMAC_CONFIG_FULL_DUPLEX;
> -
> -			if (phydev->speed == SPEED_1000)
> -				val |= MVPP2_GMAC_CONFIG_GMII_SPEED;
> -			else if (phydev->speed == SPEED_100)
> -				val |= MVPP2_GMAC_CONFIG_MII_SPEED;
> -
> -			writel(val, port->base + MVPP2_GMAC_AUTONEG_CONFIG);
> -
>  			port->duplex = phydev->duplex;
>  			port->speed  = phydev->speed;
>  		}
> @@ -5782,10 +5762,6 @@ static void mvpp2_link_event(struct net_device *dev)
>  
>  	if (status_change) {
>  		if (phydev->link) {
> -			val = readl(port->base + MVPP2_GMAC_AUTONEG_CONFIG);
> -			val |= (MVPP2_GMAC_FORCE_LINK_PASS |
> -				MVPP2_GMAC_FORCE_LINK_DOWN);
> -			writel(val, port->base + MVPP2_GMAC_AUTONEG_CONFIG);
>  			mvpp2_egress_enable(port);
>  			mvpp2_ingress_enable(port);
>  		} else {
> -- 
> 2.13.5
> 

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply

* [ethtool 1/1] ethtool: Add DMA Coalescing support
From: Jeff Kirsher @ 2017-08-25 22:39 UTC (permalink / raw)
  To: linville
  Cc: Paul Greenwalt, netdev, nhorman, sassmann, jogreene, Jeff Kirsher

From: Paul Greenwalt <paul.greenwalt@intel.com>

Add support for DMA Coalescing (DMAC) hardware feature. The feature
allows synchronization of port DMA activity across ports in order to
optimize power consumption. DMAC is supported on igb and ixgbe
devices.

Support for enabling and configuring the DMAC watchdog timer is via
the ethtool coalesce [-c|-C] dmac option added with this patch.
Since DMAC is disabled when interrupt moderation is disabled, placing
the dmac option in the coalesce command provides related -c 'show'
information with a single command. The dmac option allows the users
to disable DMAC, or enable and set the DMAC watchdog timer. When in
coalescing mode, this timer starts counting down when the first
transaction is batched. The controller moves to the not coalescing
state when the watchdog timer reaches zero.

Set usage: ethtool -C DEVNAME dmac [0 | N]

Where 0 disables DMAC, and N is watchdog timer interval in usecs.
The device driver will check that N is within a valid range.

Example:
Enable and configure DMAC watchdog timer to 1000 usecs:
 # ethtool -C dmac 1000

Disable DMAC:
 # ethtool -C dmac 0

Signed-off-by: Paul Greenwalt <paul.greenwalt@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 ethtool-copy.h | 2 ++
 ethtool.8.in   | 1 +
 ethtool.c      | 8 +++++++-
 3 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/ethtool-copy.h b/ethtool-copy.h
index 06fc04c..4bb91eb 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -400,6 +400,7 @@ struct ethtool_modinfo {
  *	a TX interrupt, when the packet rate is above @pkt_rate_high.
  * @rate_sample_interval: How often to do adaptive coalescing packet rate
  *	sampling, measured in seconds.  Must not be zero.
+ * @dmac: How many usecs to store packets before moving to host memory.
  *
  * Each pair of (usecs, max_frames) fields specifies that interrupts
  * should be coalesced until
@@ -450,6 +451,7 @@ struct ethtool_coalesce {
 	__u32	tx_coalesce_usecs_high;
 	__u32	tx_max_coalesced_frames_high;
 	__u32	rate_sample_interval;
+	__u32	dmac;
 };
 
 /**
diff --git a/ethtool.8.in b/ethtool.8.in
index 7224b04..96b0a67 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -165,6 +165,7 @@ ethtool \- query or control network driver and hardware settings
 .BN tx\-usecs\-high
 .BN tx\-frames\-high
 .BN sample\-interval
+.BN dmac
 .HP
 .B ethtool \-g|\-\-show\-ring
 .I devname
diff --git a/ethtool.c b/ethtool.c
index ad18704..e1530f5 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -1337,6 +1337,7 @@ static int dump_coalesce(const struct ethtool_coalesce *ecoal)
 		"sample-interval: %u\n"
 		"pkt-rate-low: %u\n"
 		"pkt-rate-high: %u\n"
+		"dmac: %u\n"
 		"\n"
 		"rx-usecs: %u\n"
 		"rx-frames: %u\n"
@@ -1362,6 +1363,7 @@ static int dump_coalesce(const struct ethtool_coalesce *ecoal)
 		ecoal->rate_sample_interval,
 		ecoal->pkt_rate_low,
 		ecoal->pkt_rate_high,
+		ecoal->dmac,
 
 		ecoal->rx_coalesce_usecs,
 		ecoal->rx_max_coalesced_frames,
@@ -2068,6 +2070,7 @@ static int do_scoalesce(struct cmd_context *ctx)
 	int coal_adaptive_rx_wanted = -1;
 	int coal_adaptive_tx_wanted = -1;
 	s32 coal_sample_rate_wanted = -1;
+	s32 coal_dmac_wanted = -1;
 	s32 coal_pkt_rate_low_wanted = -1;
 	s32 coal_pkt_rate_high_wanted = -1;
 	s32 coal_rx_usec_wanted = -1;
@@ -2093,6 +2096,8 @@ static int do_scoalesce(struct cmd_context *ctx)
 		  &ecoal.use_adaptive_tx_coalesce },
 		{ "sample-interval", CMDL_S32, &coal_sample_rate_wanted,
 		  &ecoal.rate_sample_interval },
+		{ "dmac", CMDL_S32, &coal_dmac_wanted,
+		  &ecoal.dmac },
 		{ "stats-block-usecs", CMDL_S32, &coal_stats_wanted,
 		  &ecoal.stats_block_coalesce_usecs },
 		{ "pkt-rate-low", CMDL_S32, &coal_pkt_rate_low_wanted,
@@ -4784,7 +4789,8 @@ static const struct option {
 	  "		[rx-frames-high N]\n"
 	  "		[tx-usecs-high N]\n"
 	  "		[tx-frames-high N]\n"
-	  "		[sample-interval N]\n" },
+	  "		[sample-interval N]\n"
+	  "		[dmac N]\n" },
 	{ "-g|--show-ring", 1, do_gring, "Query RX/TX ring parameters" },
 	{ "-G|--set-ring", 1, do_sring, "Set RX/TX ring parameters",
 	  "		[ rx N ]\n"
-- 
2.14.1

^ permalink raw reply related

* Re: [RFC PATCH] net: limit maximum number of packets to mark with xmit_more
From: Alexander Duyck @ 2017-08-25 22:33 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Waskiewicz Jr, Peter, Keller, Jacob E, netdev@vger.kernel.org
In-Reply-To: <20170825085816.3425a70c@xeon-e3>

On Fri, Aug 25, 2017 at 8:58 AM, Stephen Hemminger
<stephen@networkplumber.org> wrote:
> On Fri, 25 Aug 2017 15:36:22 +0000
> "Waskiewicz Jr, Peter" <peter.waskiewicz.jr@intel.com> wrote:
>
>> On 8/25/17 11:25 AM, Jacob Keller wrote:
>> > Under some circumstances, such as with many stacked devices, it is
>> > possible that dev_hard_start_xmit will bundle many packets together, and
>> > mark them all with xmit_more.
>> >
>> > Most drivers respond to xmit_more by skipping tail bumps on packet
>> > rings, or similar behavior as long as xmit_more is set. This is
>> > a performance win since it means drivers can avoid notifying hardware of
>> > new packets repeat daily, and thus avoid wasting unnecessary PCIe or other
>> > bandwidth.
>> >
>> > This use of xmit_more comes with a trade off because bundling too many
>> > packets can increase latency of the Tx packets. To avoid this, we should
>> > limit the maximum number of packets with xmit_more.
>> >
>> > Driver authors could modify their drivers to check for some determined
>> > limit, but this requires all drivers to be modified in order to gain
>> > advantage.
>> >
>> > Instead, add a sysctl "xmit_more_max" which can be used to configure the
>> > maximum number of xmit_more skbs to send in a sequence. This ensures
>> > that all drivers benefit, and allows system administrators the option to
>> > tune the value to their environment.
>> >
>> > Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
>> > ---
>> >
>> > Stray thoughts and further questions....
>> >
>> > Is this the right approach? Did I miss any other places where we should
>> > limit? Does the limit make sense? Should it instead be a per-device
>> > tuning nob instead of a global? Is 32 a good default?
>>
>> I actually like the idea of a per-device knob.  A xmit_more_max that's
>> global in a system with 1GbE devices along with a 25/50GbE or more just
>> doesn't make much sense to me.  Or having heterogeneous vendor devices
>> in the same system that have different HW behaviors could mask issues
>> with latency.
>>
>> This seems like another incarnation of possible buffer-bloat if the max
>> is too high...
>>
>> >
>> >   Documentation/sysctl/net.txt |  6 ++++++
>> >   include/linux/netdevice.h    |  2 ++
>> >   net/core/dev.c               | 10 +++++++++-
>> >   net/core/sysctl_net_core.c   |  7 +++++++
>> >   4 files changed, 24 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/Documentation/sysctl/net.txt b/Documentation/sysctl/net.txt
>> > index b67044a2575f..3d995e8f4448 100644
>> > --- a/Documentation/sysctl/net.txt
>> > +++ b/Documentation/sysctl/net.txt
>> > @@ -230,6 +230,12 @@ netdev_max_backlog
>> >   Maximum number  of  packets,  queued  on  the  INPUT  side, when the interface
>> >   receives packets faster than kernel can process them.
>> >
>> > +xmit_more_max
>> > +-------------
>> > +
>> > +Maximum number of packets in a row to mark with skb->xmit_more. A value of zero
>> > +indicates no limit.
>>
>> What defines "packet?"  MTU-sized packets, or payloads coming down from
>> the stack (e.g. TSO's)?
>
> xmit_more is only a hint to the device. The device driver should ignore it unless
> there are hardware advantages. The device driver is the place with HW specific
> knowledge (like 4 Tx descriptors is equivalent to one PCI transaction on this device).
>
> Anything that pushes that optimization out to the user is only useful for benchmarks
> and embedded devices.

Actually I think I might have an idea what is going on here and I
agree that this is probably something that needs to be fixed in the
drivers. Especially since the problem isn't so much the skbs but
descriptors in the descriptor ring.

If I am not mistaken the issue is most drivers will honor the
xmit_more unless the ring cannot enqueue another packet. The problem
is if the clean-up is occurring on a different CPU than transmit we
can cause the clean-up CPU/device DMA to go idle by not providing any
notifications to the device that new packets are present. What we
should probably do is look at adding another condition which is to
force us to flush the packet if we have used over half of the
descriptors in a given ring without notifying the device. Then that
way we can be filling half while the device is processing the other
half which should result in us operating smoothly.

- Alex

^ permalink raw reply

* Re: [Intel-wired-lan] how to submit fixes for i40e/i40evf?
From: Stefano Brivio @ 2017-08-25 22:28 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: David S. Miller, Jeff Kirsher, Netdev, intel-wired-lan
In-Reply-To: <CAKgT0UcLAw0958zJ4o9V9Vyamvd1mBQX1+9GAaZw-hZubMqXnw@mail.gmail.com>

On Fri, 25 Aug 2017 15:10:08 -0700
Alexander Duyck <alexander.duyck@gmail.com> wrote:

> On Fri, Aug 25, 2017 at 1:52 PM, Stefano Brivio <sbrivio@redhat.com> wrote:
>
> [...]
>
> > Once patches reach Intel's patchwork, will they need to wait for some
> > kind of periodically scheduled pull request process?  
> 
> Once in the patchwork they go through testing and after they have
> passed testing Jeff will try to push them to Dave.

Ok, the whole part above is clear, thanks a lot for clarifying.

> > I don't know if a process is actually defined at this level of detail,
> > but still I feel it's wrong that an obvious fix for a potential crash is
> > waiting in some sort of limbo for 10 days now. Sure, worse things
> > happen in the world, but I can't understand what this patch is waiting
> > for.  
> 
> Well in the case of your patch it was rejected as it didn't apply to
> Jeff's tree

It actually did when I posted it.

> and conflicted with Jacob Keller's patch. He submitted a v2 on Tuesday
> which has only been applied for a few days. Once it receives a
> "Tested-by:"

Which, if I understood correctly, only comes after some internal testing
process, right?

> it will be ready for submission assuming it passes testing.

Now that patch is again in a v2 pull request for net-next, without the
changes I suggested for the commit message. And the same exact code
changes were around for two weeks. IMHO there's room for improvement,
so to speak.

> I hope that helps to clarify things.

It did to some extent, and thanks again for that.

^ permalink raw reply

* [PATCH 2/2] drivers: net: xgene: Clean up all outstanding tx descriptors
From: Iyappan Subramanian @ 2017-08-25 22:23 UTC (permalink / raw)
  To: davem, netdev
  Cc: linux-arm-kernel, dnelson, qnguyen, patches, Iyappan Subramanian
In-Reply-To: <1503699810-12803-1-git-send-email-isubramanian@apm.com>

When xgene_enet is rmmod'd and there are still outstanding tx descriptors
that have been setup but have not completed, it is possible on the next
modprobe of the driver to receive the oldest of such tx descriptors. This
results in a kernel NULL pointer dereference.

This patch attempts to clean up (by tearing down) all outstanding tx
descriptors when the xgene_enet driver is being rmmod'd.

Given that, on the next modprobe it should be safe to ignore any such tx
descriptors received that map to a NULL skb pointer.

Additionally this patch removes redundant call to dev_kfree_skb_any() from
xgene_enet_setup_tx_desc(). The only caller of xgene_enet_setup_tx_desc()
will call dev_kfree_skb_any() upon return of an error. Nothing is gained by
calling it twice in a row.

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
Signed-off-by: Dean Nelson <dnelson@redhat.com>
Tested-by: Quan Nguyen <qnguyen@apm.com>
---
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 120 +++++++++++++++++------
 1 file changed, 89 insertions(+), 31 deletions(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
index 6e253d9..76e2903 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
@@ -237,22 +237,24 @@ static irqreturn_t xgene_enet_rx_irq(const int irq, void *data)
 	return IRQ_HANDLED;
 }
 
-static int xgene_enet_tx_completion(struct xgene_enet_desc_ring *cp_ring,
-				    struct xgene_enet_raw_desc *raw_desc)
+static dma_addr_t *xgene_get_frag_dma_array(struct xgene_enet_desc_ring *ring,
+					    u16 skb_index)
 {
-	struct xgene_enet_pdata *pdata = netdev_priv(cp_ring->ndev);
-	struct sk_buff *skb;
+	return &ring->frag_dma_addr[skb_index * MAX_SKB_FRAGS];
+}
+
+static void xgene_enet_teardown_tx_desc(struct xgene_enet_desc_ring *cp_ring,
+					struct xgene_enet_raw_desc *raw_desc,
+					struct xgene_enet_raw_desc *exp_desc,
+					struct sk_buff *skb,
+					u16 skb_index)
+{
+	dma_addr_t dma_addr, *frag_dma_addr;
 	struct device *dev;
 	skb_frag_t *frag;
-	dma_addr_t *frag_dma_addr;
-	u16 skb_index;
-	u8 mss_index;
-	u8 status;
 	int i;
 
-	skb_index = GET_VAL(USERINFO, le64_to_cpu(raw_desc->m0));
-	skb = cp_ring->cp_skb[skb_index];
-	frag_dma_addr = &cp_ring->frag_dma_addr[skb_index * MAX_SKB_FRAGS];
+	frag_dma_addr = xgene_get_frag_dma_array(cp_ring, skb_index);
 
 	dev = ndev_to_dev(cp_ring->ndev);
 	dma_unmap_single(dev, GET_VAL(DATAADDR, le64_to_cpu(raw_desc->m1)),
@@ -265,6 +267,36 @@ static int xgene_enet_tx_completion(struct xgene_enet_desc_ring *cp_ring,
 			       DMA_TO_DEVICE);
 	}
 
+	if (exp_desc && GET_VAL(LL_BYTES_LSB, le64_to_cpu(raw_desc->m2))) {
+		dma_addr = GET_VAL(DATAADDR, le64_to_cpu(exp_desc->m2));
+		dma_unmap_single(dev, dma_addr, sizeof(u64) * MAX_EXP_BUFFS,
+				 DMA_TO_DEVICE);
+	}
+
+	dev_kfree_skb_any(skb);
+}
+
+static int xgene_enet_tx_completion(struct xgene_enet_desc_ring *cp_ring,
+				    struct xgene_enet_raw_desc *raw_desc,
+				    struct xgene_enet_raw_desc *exp_desc)
+{
+	struct xgene_enet_pdata *pdata = netdev_priv(cp_ring->ndev);
+	struct sk_buff *skb;
+	u16 skb_index;
+	u8 status;
+	u8 mss_index;
+
+	skb_index = GET_VAL(USERINFO, le64_to_cpu(raw_desc->m0));
+	skb = cp_ring->cp_skb[skb_index];
+	if (unlikely(!skb)) {
+		netdev_err(cp_ring->ndev, "completion skb is NULL\n");
+		return -EIO;
+	}
+	cp_ring->cp_skb[skb_index] = NULL;
+
+	xgene_enet_teardown_tx_desc(cp_ring, raw_desc, exp_desc, skb,
+				    skb_index);
+
 	if (GET_BIT(ET, le64_to_cpu(raw_desc->m3))) {
 		mss_index = GET_VAL(MSS, le64_to_cpu(raw_desc->m3));
 		spin_lock(&pdata->mss_lock);
@@ -279,12 +311,6 @@ static int xgene_enet_tx_completion(struct xgene_enet_desc_ring *cp_ring,
 		cp_ring->tx_errors++;
 	}
 
-	if (likely(skb)) {
-		dev_kfree_skb_any(skb);
-	} else {
-		netdev_err(cp_ring->ndev, "completion skb is NULL\n");
-	}
-
 	return 0;
 }
 
@@ -412,11 +438,6 @@ static __le64 *xgene_enet_get_exp_bufs(struct xgene_enet_desc_ring *ring)
 	return exp_bufs;
 }
 
-static dma_addr_t *xgene_get_frag_dma_array(struct xgene_enet_desc_ring *ring)
-{
-	return &ring->cp_ring->frag_dma_addr[ring->tail * MAX_SKB_FRAGS];
-}
-
 static int xgene_enet_setup_tx_desc(struct xgene_enet_desc_ring *tx_ring,
 				    struct sk_buff *skb)
 {
@@ -473,7 +494,8 @@ static int xgene_enet_setup_tx_desc(struct xgene_enet_desc_ring *tx_ring,
 	for (i = nr_frags; i < 4 ; i++)
 		exp_desc[i ^ 1] = cpu_to_le64(LAST_BUFFER);
 
-	frag_dma_addr = xgene_get_frag_dma_array(tx_ring);
+	frag_dma_addr = xgene_get_frag_dma_array(tx_ring->cp_ring,
+						 tx_ring->tail);
 
 	for (i = 0, fidx = 0; split || (fidx < nr_frags); i++) {
 		if (!split) {
@@ -484,7 +506,7 @@ static int xgene_enet_setup_tx_desc(struct xgene_enet_desc_ring *tx_ring,
 			pbuf_addr = skb_frag_dma_map(dev, frag, 0, size,
 						     DMA_TO_DEVICE);
 			if (dma_mapping_error(dev, pbuf_addr))
-				return -EINVAL;
+				goto err;
 
 			frag_dma_addr[fidx] = pbuf_addr;
 			fidx++;
@@ -539,10 +561,9 @@ static int xgene_enet_setup_tx_desc(struct xgene_enet_desc_ring *tx_ring,
 		dma_addr = dma_map_single(dev, exp_bufs,
 					  sizeof(u64) * MAX_EXP_BUFFS,
 					  DMA_TO_DEVICE);
-		if (dma_mapping_error(dev, dma_addr)) {
-			dev_kfree_skb_any(skb);
-			return -EINVAL;
-		}
+		if (dma_mapping_error(dev, dma_addr))
+			goto err;
+
 		i = ell_bytes >> LL_BYTES_LSB_LEN;
 		exp_desc[2] = cpu_to_le64(SET_VAL(DATAADDR, dma_addr) |
 					  SET_VAL(LL_BYTES_MSB, i) |
@@ -558,6 +579,19 @@ static int xgene_enet_setup_tx_desc(struct xgene_enet_desc_ring *tx_ring,
 	tx_ring->tail = tail;
 
 	return count;
+
+err:
+	dma_unmap_single(dev, GET_VAL(DATAADDR, le64_to_cpu(raw_desc->m1)),
+			 skb_headlen(skb),
+			 DMA_TO_DEVICE);
+
+	for (i = 0; i < fidx; i++) {
+		frag = &skb_shinfo(skb)->frags[i];
+		dma_unmap_page(dev, frag_dma_addr[i], skb_frag_size(frag),
+			       DMA_TO_DEVICE);
+	}
+
+	return -EINVAL;
 }
 
 static netdev_tx_t xgene_enet_start_xmit(struct sk_buff *skb,
@@ -828,7 +862,8 @@ static int xgene_enet_process_ring(struct xgene_enet_desc_ring *ring,
 		if (is_rx_desc(raw_desc)) {
 			ret = xgene_enet_rx_frame(ring, raw_desc, exp_desc);
 		} else {
-			ret = xgene_enet_tx_completion(ring, raw_desc);
+			ret = xgene_enet_tx_completion(ring, raw_desc,
+						       exp_desc);
 			is_completion = true;
 		}
 		xgene_enet_mark_desc_slot_empty(raw_desc);
@@ -1071,18 +1106,41 @@ static void xgene_enet_delete_desc_rings(struct xgene_enet_pdata *pdata)
 {
 	struct xgene_enet_desc_ring *buf_pool, *page_pool;
 	struct xgene_enet_desc_ring *ring;
-	int i;
+	struct xgene_enet_raw_desc *raw_desc, *exp_desc;
+	struct sk_buff *skb;
+	int i, j, k;
 
 	for (i = 0; i < pdata->txq_cnt; i++) {
 		ring = pdata->tx_ring[i];
 		if (ring) {
+			/*
+			 * Find any tx descriptors that were setup but never
+			 * completed, and teardown the setup.
+			 */
+			for (j = 0; j < ring->slots; j++) {
+				skb = ring->cp_ring->cp_skb[j];
+				if (likely(!skb))
+					continue;
+
+				raw_desc = &ring->raw_desc[j];
+				exp_desc = NULL;
+				if (GET_BIT(NV, le64_to_cpu(raw_desc->m0))) {
+					k = (j + 1) & (ring->slots - 1);
+					exp_desc = &ring->raw_desc[k];
+				}
+
+				xgene_enet_teardown_tx_desc(ring->cp_ring,
+							    raw_desc, exp_desc,
+							    skb, j);
+			}
+
 			xgene_enet_delete_ring(ring);
 			pdata->port_ops->clear(pdata, ring);
+
 			if (pdata->cq_cnt)
 				xgene_enet_delete_ring(ring->cp_ring);
 			pdata->tx_ring[i] = NULL;
 		}
-
 	}
 
 	for (i = 0; i < pdata->rxq_cnt; i++) {
-- 
2.7.4

^ permalink raw reply related

* [PATCH 1/2] drivers: net: xgene: Correct probe sequence handling
From: Iyappan Subramanian @ 2017-08-25 22:23 UTC (permalink / raw)
  To: davem, netdev
  Cc: linux-arm-kernel, dnelson, qnguyen, patches, Iyappan Subramanian
In-Reply-To: <1503699810-12803-1-git-send-email-isubramanian@apm.com>

From: Quan Nguyen <qnguyen@apm.com>

The phy is connected at early stage of probe but not properly
disconnected if error occurs.  This patch fixes the issue.

Also changing the return type of xgene_enet_check_phy_handle(),
since this function always returns success.

Signed-off-by: Quan Nguyen <qnguyen@apm.com>
Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
---
 drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 27 ++++++++++++------------
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
index 1d307f2..6e253d9 100644
--- a/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
+++ b/drivers/net/ethernet/apm/xgene/xgene_enet_main.c
@@ -1661,21 +1661,21 @@ static int xgene_enet_get_irqs(struct xgene_enet_pdata *pdata)
 	return 0;
 }
 
-static int xgene_enet_check_phy_handle(struct xgene_enet_pdata *pdata)
+static void xgene_enet_check_phy_handle(struct xgene_enet_pdata *pdata)
 {
 	int ret;
 
 	if (pdata->phy_mode == PHY_INTERFACE_MODE_XGMII)
-		return 0;
+		return;
 
 	if (!IS_ENABLED(CONFIG_MDIO_XGENE))
-		return 0;
+		return;
 
 	ret = xgene_enet_phy_connect(pdata->ndev);
 	if (!ret)
 		pdata->mdio_driver = true;
 
-	return 0;
+	return;
 }
 
 static void xgene_enet_gpiod_get(struct xgene_enet_pdata *pdata)
@@ -1779,10 +1779,6 @@ static int xgene_enet_get_resources(struct xgene_enet_pdata *pdata)
 	if (ret)
 		return ret;
 
-	ret = xgene_enet_check_phy_handle(pdata);
-	if (ret)
-		return ret;
-
 	xgene_enet_gpiod_get(pdata);
 
 	pdata->clk = devm_clk_get(&pdev->dev, NULL);
@@ -2097,9 +2093,11 @@ static int xgene_enet_probe(struct platform_device *pdev)
 		goto err;
 	}
 
+	xgene_enet_check_phy_handle(pdata);
+
 	ret = xgene_enet_init_hw(pdata);
 	if (ret)
-		goto err;
+		goto err2;
 
 	link_state = pdata->mac_ops->link_state;
 	if (pdata->phy_mode == PHY_INTERFACE_MODE_XGMII) {
@@ -2117,29 +2115,30 @@ static int xgene_enet_probe(struct platform_device *pdev)
 	spin_lock_init(&pdata->stats_lock);
 	ret = xgene_extd_stats_init(pdata);
 	if (ret)
-		goto err2;
+		goto err1;
 
 	xgene_enet_napi_add(pdata);
 	ret = register_netdev(ndev);
 	if (ret) {
 		netdev_err(ndev, "Failed to register netdev\n");
-		goto err2;
+		goto err1;
 	}
 
 	return 0;
 
-err2:
+err1:
 	/*
 	 * If necessary, free_netdev() will call netif_napi_del() and undo
 	 * the effects of xgene_enet_napi_add()'s calls to netif_napi_add().
 	 */
 
+	xgene_enet_delete_desc_rings(pdata);
+
+err2:
 	if (pdata->mdio_driver)
 		xgene_enet_phy_disconnect(pdata);
 	else if (phy_interface_mode_is_rgmii(pdata->phy_mode))
 		xgene_enet_mdio_remove(pdata);
-err1:
-	xgene_enet_delete_desc_rings(pdata);
 err:
 	free_netdev(ndev);
 	return ret;
-- 
2.7.4

^ permalink raw reply related

* [PATCH 0/2] drivers: net: xgene: Misc bug fixes
From: Iyappan Subramanian @ 2017-08-25 22:23 UTC (permalink / raw)
  To: davem, netdev
  Cc: linux-arm-kernel, dnelson, qnguyen, patches, Iyappan Subramanian

This patch set,

     1. Adds call to PHY disconnect in the case of error
     2. Cleans up all outstanding TX descriptors when the driver is
	being rmmod'd

Signed-off-by: Iyappan Subramanian <isubramanian@apm.com>
---

Iyappan Subramanian (1):
  drivers: net: xgene: Clean up all outstanding tx descriptors

Quan Nguyen (1):
  drivers: net: xgene: Correct probe sequence handling

 drivers/net/ethernet/apm/xgene/xgene_enet_main.c | 147 ++++++++++++++++-------
 1 file changed, 102 insertions(+), 45 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: [Intel-wired-lan] how to submit fixes for i40e/i40evf?
From: Alexander Duyck @ 2017-08-25 22:10 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: David S. Miller, Jeff Kirsher, Netdev, intel-wired-lan
In-Reply-To: <20170825225215.35fc6b59@elisabeth>

On Fri, Aug 25, 2017 at 1:52 PM, Stefano Brivio <sbrivio@redhat.com> wrote:
> Hi,
>
> As I'm currently preparing another fix for i40e, and the last one I
> submitted has been stuck for about two weeks now, I would like to ask
> some details about the process to submit fixes for i40e/i40evf drivers,
> before I do something wrong again.
>
> Do all the patches have to go through Intel's patchwork, no matter
> what's the perceived severity of the issue? Should I still submit them
> to netdev anyway?

It is preferred if they go through Intel's patchwork as they should go
through some additional testing and review that way.

> Which trees should I check before submitting a patch? Is it enough to
> check the master branch of jkirsher/net-queue.git and
> jkirsher/next-queue.git?

It depends on if you want to see the patches end up in Dave's net tree
or his net-next tree. If the fix is high enough priority to be
accepted into Dave's net tree then you can submit it against Jeff's
net-queue, otherwise it should be the next-queue. You might want to
check the next-queue for a fix if you are seeing an issue though, as
there is a chance of a fix for an issue you deem critical ending up
there if somebody else judged it differently.

> Once patches reach Intel's patchwork, will they need to wait for some
> kind of periodically scheduled pull request process?

Once in the patchwork they go through testing and after they have
passed testing Jeff will try to push them to Dave.

> I don't know if a process is actually defined at this level of detail,
> but still I feel it's wrong that an obvious fix for a potential crash is
> waiting in some sort of limbo for 10 days now. Sure, worse things
> happen in the world, but I can't understand what this patch is waiting
> for.

Well in the case of your patch it was rejected as it didn't apply to
Jeff's tree and conflicted with Jacob Keller's patch. He submitted a
v2 on Tuesday which has only been applied for a few days. Once it
receives a "Tested-by:" it will be ready for submission assuming it
passes testing.

> Any answer is appreciated. Thanks,
>
> --
> Stefano

I hope that helps to clarify things.

- Alex

^ permalink raw reply

* [PATCH net] ipv6: fix sparse warning on rt6i_node
From: Wei Wang @ 2017-08-25 22:03 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: Eric Dumazet, Martin KaFai Lau, Wei Wang

From: Wei Wang <weiwan@google.com>

Commit c5cff8561d2d adds rcu grace period before freeing fib6_node. This
generates a new sparse warning on rt->rt6i_node related code:
  net/ipv6/route.c:1394:30: error: incompatible types in comparison
  expression (different address spaces)
  ./include/net/ip6_fib.h:187:14: error: incompatible types in comparison
  expression (different address spaces)

This commit adds "__rcu" tag for rt6i_node and makes sure corresponding
rcu API is used for it.
After this fix, sparse no longer generates the above warning.

Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node")
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
 include/net/ip6_fib.h |  2 +-
 net/ipv6/addrconf.c   |  2 +-
 net/ipv6/ip6_fib.c    | 11 +++++++----
 net/ipv6/route.c      |  3 ++-
 4 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index e9c59db92942..af509f801084 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -105,7 +105,7 @@ struct rt6_info {
 	 * the same cache line.
 	 */
 	struct fib6_table		*rt6i_table;
-	struct fib6_node		*rt6i_node;
+	struct fib6_node __rcu		*rt6i_node;
 
 	struct in6_addr			rt6i_gateway;
 
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 3c46e9513a31..936e9ab4dda5 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -5556,7 +5556,7 @@ static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)
 		 * our DAD process, so we don't need
 		 * to do it again
 		 */
-		if (!(ifp->rt->rt6i_node))
+		if (!rcu_access_pointer(ifp->rt->rt6i_node))
 			ip6_ins_rt(ifp->rt);
 		if (ifp->idev->cnf.forwarding)
 			addrconf_join_anycast(ifp);
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index a5ebf86f6be8..10b4b1f8b838 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -889,7 +889,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
 
 		rt->dst.rt6_next = iter;
 		*ins = rt;
-		rt->rt6i_node = fn;
+		rcu_assign_pointer(rt->rt6i_node, fn);
 		atomic_inc(&rt->rt6i_ref);
 		if (!info->skip_notify)
 			inet6_rt_notify(RTM_NEWROUTE, rt, info, nlflags);
@@ -915,7 +915,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
 			return err;
 
 		*ins = rt;
-		rt->rt6i_node = fn;
+		rcu_assign_pointer(rt->rt6i_node, fn);
 		rt->dst.rt6_next = iter->dst.rt6_next;
 		atomic_inc(&rt->rt6i_ref);
 		if (!info->skip_notify)
@@ -1480,8 +1480,9 @@ static void fib6_del_route(struct fib6_node *fn, struct rt6_info **rtp,
 
 int fib6_del(struct rt6_info *rt, struct nl_info *info)
 {
+	struct fib6_node *fn = rcu_dereference_protected(rt->rt6i_node,
+				    lockdep_is_held(&rt->rt6i_table->tb6_lock));
 	struct net *net = info->nl_net;
-	struct fib6_node *fn = rt->rt6i_node;
 	struct rt6_info **rtp;
 
 #if RT6_DEBUG >= 2
@@ -1670,7 +1671,9 @@ static int fib6_clean_node(struct fib6_walker *w)
 			if (res) {
 #if RT6_DEBUG >= 2
 				pr_debug("%s: del failed: rt=%p@%p err=%d\n",
-					 __func__, rt, rt->rt6i_node, res);
+					 __func__, rt,
+					 rcu_access_pointer(rt->rt6i_node),
+					 res);
 #endif
 				continue;
 			}
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index a9d3564caf49..33629f2a0f9d 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1383,7 +1383,8 @@ static void rt6_do_update_pmtu(struct rt6_info *rt, u32 mtu)
 static bool rt6_cache_allowed_for_pmtu(const struct rt6_info *rt)
 {
 	return !(rt->rt6i_flags & RTF_CACHE) &&
-		(rt->rt6i_flags & RTF_PCPU || rt->rt6i_node);
+		(rt->rt6i_flags & RTF_PCPU ||
+		 rcu_access_pointer(rt->rt6i_node));
 }
 
 static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk,
-- 
2.14.1.342.g6490525c54-goog

^ permalink raw reply related

* Re: [PATCH] net: sunrpc: svcsock: fix NULL-pointer exception
From: J. Bruce Fields @ 2017-08-25 22:01 UTC (permalink / raw)
  To: Vadim Lomovtsev
  Cc: trond.myklebust, anna.schumaker, jlayton, davem, linux-nfs,
	netdev, linux-kernel, pabeni
In-Reply-To: <1503050447-13362-1-git-send-email-vlomovts@redhat.com>

On Fri, Aug 18, 2017 at 06:00:47AM -0400, Vadim Lomovtsev wrote:
> While running nfs/connectathon tests kernel NULL-pointer exception
> has been observed due to races in svcsock.c.
> 
> Race is appear when kernel accepts connection by kernel_accept
> (which creates new socket) and start queuing ingress packets
> to new socket. This happanes in ksoftirq context which concurrently
> on a differnt core while new socket setup is not done yet.
> 
> The fix is to re-order socket user data init sequence, add NULL-ptr
> check before callback call along with barriers to prevent kernel crash.
> 
> Test results: nfs/connectathon reports '0' failed tests for about 200+ iterations.

By the way, is there anything special about your setup that allows you
to reproduce this?  There's nothing special about connectathon tests, so
I'm just wondering why we haven't had a lot of reports of this.

--b.

> 
> Crash log:
> ---<-snip->---
> [ 6708.638984] Unable to handle kernel NULL pointer dereference at virtual address 00000000
> [ 6708.647093] pgd = ffff0000094e0000
> [ 6708.650497] [00000000] *pgd=0000010ffff90003, *pud=0000010ffff90003, *pmd=0000010ffff80003, *pte=0000000000000000
> [ 6708.660761] Internal error: Oops: 86000005 [#1] SMP
> [ 6708.665630] Modules linked in: nfsv3 nfnetlink_queue nfnetlink_log nfnetlink rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache overlay xt_CONNSECMARK xt_SECMARK xt_conntrack iptable_security ip_tables ah4 xfrm4_mode_transport sctp tun binfmt_misc ext4 jbd2 mbcache loop tcp_diag udp_diag inet_diag rpcrdma ib_isert iscsi_target_mod ib_iser rdma_cm iw_cm libiscsi scsi_transport_iscsi ib_srpt target_core_mod ib_srp scsi_transport_srp ib_ipoib ib_ucm ib_uverbs ib_umad ib_cm ib_core nls_koi8_u nls_cp932 ts_kmp nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack vfat fat ghash_ce sha2_ce sha1_ce cavium_rng_vf i2c_thunderx sg thunderx_edac i2c_smbus edac_core cavium_rng nfsd auth_rpcgss nfs_acl lockd grace sunrpc xfs libcrc32c nicvf nicpf ast i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgb
 lt fb_sys_fops
> [ 6708.736446]  ttm drm i2c_core thunder_bgx thunder_xcv mdio_thunder mdio_cavium dm_mirror dm_region_hash dm_log dm_mod [last unloaded: stap_3c300909c5b3f46dcacd49aab3334af_87021]
> [ 6708.752275] CPU: 84 PID: 0 Comm: swapper/84 Tainted: G        W  OE   4.11.0-4.el7.aarch64 #1
> [ 6708.760787] Hardware name: www.cavium.com CRB-2S/CRB-2S, BIOS 0.3 Mar 13 2017
> [ 6708.767910] task: ffff810006842e80 task.stack: ffff81000689c000
> [ 6708.773822] PC is at 0x0
> [ 6708.776739] LR is at svc_data_ready+0x38/0x88 [sunrpc]
> [ 6708.781866] pc : [<0000000000000000>] lr : [<ffff0000029d7378>] pstate: 60000145
> [ 6708.789248] sp : ffff810ffbad3900
> [ 6708.792551] x29: ffff810ffbad3900 x28: ffff000008c73d58
> [ 6708.797853] x27: 0000000000000000 x26: ffff81000bbe1e00
> [ 6708.803156] x25: 0000000000000020 x24: ffff800f7410bf28
> [ 6708.808458] x23: ffff000008c63000 x22: ffff000008c63000
> [ 6708.813760] x21: ffff800f7410bf28 x20: ffff81000bbe1e00
> [ 6708.819063] x19: ffff810012412400 x18: 00000000d82a9df2
> [ 6708.824365] x17: 0000000000000000 x16: 0000000000000000
> [ 6708.829667] x15: 0000000000000000 x14: 0000000000000001
> [ 6708.834969] x13: 0000000000000000 x12: 722e736f622e676e
> [ 6708.840271] x11: 00000000f814dd99 x10: 0000000000000000
> [ 6708.845573] x9 : 7374687225000000 x8 : 0000000000000000
> [ 6708.850875] x7 : 0000000000000000 x6 : 0000000000000000
> [ 6708.856177] x5 : 0000000000000028 x4 : 0000000000000000
> [ 6708.861479] x3 : 0000000000000000 x2 : 00000000e5000000
> [ 6708.866781] x1 : 0000000000000000 x0 : ffff81000bbe1e00
> [ 6708.872084]
> [ 6708.873565] Process swapper/84 (pid: 0, stack limit = 0xffff81000689c000)
> [ 6708.880341] Stack: (0xffff810ffbad3900 to 0xffff8100068a0000)
> [ 6708.886075] Call trace:
> [ 6708.888513] Exception stack(0xffff810ffbad3710 to 0xffff810ffbad3840)
> [ 6708.894942] 3700:                                   ffff810012412400 0001000000000000
> [ 6708.902759] 3720: ffff810ffbad3900 0000000000000000 0000000060000145 ffff800f79300000
> [ 6708.910577] 3740: ffff000009274d00 00000000000003ea 0000000000000015 ffff000008c63000
> [ 6708.918395] 3760: ffff810ffbad3830 ffff800f79300000 000000000000004d 0000000000000000
> [ 6708.926212] 3780: ffff810ffbad3890 ffff0000080f88dc ffff800f79300000 000000000000004d
> [ 6708.934030] 37a0: ffff800f7930093c ffff000008c63000 0000000000000000 0000000000000140
> [ 6708.941848] 37c0: ffff000008c2c000 0000000000040b00 ffff81000bbe1e00 0000000000000000
> [ 6708.949665] 37e0: 00000000e5000000 0000000000000000 0000000000000000 0000000000000028
> [ 6708.957483] 3800: 0000000000000000 0000000000000000 0000000000000000 7374687225000000
> [ 6708.965300] 3820: 0000000000000000 00000000f814dd99 722e736f622e676e 0000000000000000
> [ 6708.973117] [<          (null)>]           (null)
> [ 6708.977824] [<ffff0000086f9fa4>] tcp_data_queue+0x754/0xc5c
> [ 6708.983386] [<ffff0000086fa64c>] tcp_rcv_established+0x1a0/0x67c
> [ 6708.989384] [<ffff000008704120>] tcp_v4_do_rcv+0x15c/0x22c
> [ 6708.994858] [<ffff000008707418>] tcp_v4_rcv+0xaf0/0xb58
> [ 6709.000077] [<ffff0000086df784>] ip_local_deliver_finish+0x10c/0x254
> [ 6709.006419] [<ffff0000086dfea4>] ip_local_deliver+0xf0/0xfc
> [ 6709.011980] [<ffff0000086dfad4>] ip_rcv_finish+0x208/0x3a4
> [ 6709.017454] [<ffff0000086e018c>] ip_rcv+0x2dc/0x3c8
> [ 6709.022328] [<ffff000008692fc8>] __netif_receive_skb_core+0x2f8/0xa0c
> [ 6709.028758] [<ffff000008696068>] __netif_receive_skb+0x38/0x84
> [ 6709.034580] [<ffff00000869611c>] netif_receive_skb_internal+0x68/0xdc
> [ 6709.041010] [<ffff000008696bc0>] napi_gro_receive+0xcc/0x1a8
> [ 6709.046690] [<ffff0000014b0fc4>] nicvf_cq_intr_handler+0x59c/0x730 [nicvf]
> [ 6709.053559] [<ffff0000014b1380>] nicvf_poll+0x38/0xb8 [nicvf]
> [ 6709.059295] [<ffff000008697a6c>] net_rx_action+0x2f8/0x464
> [ 6709.064771] [<ffff000008081824>] __do_softirq+0x11c/0x308
> [ 6709.070164] [<ffff0000080d14e4>] irq_exit+0x12c/0x174
> [ 6709.075206] [<ffff00000813101c>] __handle_domain_irq+0x78/0xc4
> [ 6709.081027] [<ffff000008081608>] gic_handle_irq+0x94/0x190
> [ 6709.086501] Exception stack(0xffff81000689fdf0 to 0xffff81000689ff20)
> [ 6709.092929] fde0:                                   0000810ff2ec0000 ffff000008c10000
> [ 6709.100747] fe00: ffff000008c70ef4 0000000000000001 0000000000000000 ffff810ffbad9b18
> [ 6709.108565] fe20: ffff810ffbad9c70 ffff8100169d3800 ffff810006843ab0 ffff81000689fe80
> [ 6709.116382] fe40: 0000000000000bd0 0000ffffdf979cd0 183f5913da192500 0000ffff8a254ce4
> [ 6709.124200] fe60: 0000ffff8a254b78 0000aaab10339808 0000000000000000 0000ffff8a0c2a50
> [ 6709.132018] fe80: 0000ffffdf979b10 ffff000008d6d450 ffff000008c10000 ffff000008d6d000
> [ 6709.139836] fea0: 0000000000000054 ffff000008cd3dbc 0000000000000000 0000000000000000
> [ 6709.147653] fec0: 0000000000000000 0000000000000000 0000000000000000 ffff81000689ff20
> [ 6709.155471] fee0: ffff000008085240 ffff81000689ff20 ffff000008085244 0000000060000145
> [ 6709.163289] ff00: ffff81000689ff10 ffff00000813f1e4 ffffffffffffffff ffff00000813f238
> [ 6709.171107] [<ffff000008082eb4>] el1_irq+0xb4/0x140
> [ 6709.175976] [<ffff000008085244>] arch_cpu_idle+0x44/0x11c
> [ 6709.181368] [<ffff0000087bf3b8>] default_idle_call+0x20/0x30
> [ 6709.187020] [<ffff000008116d50>] do_idle+0x158/0x1e4
> [ 6709.191973] [<ffff000008116ff4>] cpu_startup_entry+0x2c/0x30
> [ 6709.197624] [<ffff00000808e7cc>] secondary_start_kernel+0x13c/0x160
> [ 6709.203878] [<0000000001bc71c4>] 0x1bc71c4
> [ 6709.207967] Code: bad PC value
> [ 6709.211061] SMP: stopping secondary CPUs
> [ 6709.218830] Starting crashdump kernel...
> [ 6709.222749] Bye!
> ---<-snip>---
> 
> Signed-off-by: Vadim Lomovtsev <vlomovts@redhat.com>
> ---
>  net/sunrpc/svcsock.c | 24 ++++++++++++++++++------
>  1 file changed, 18 insertions(+), 6 deletions(-)
> 
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 2b720fa..b6496f3 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -421,7 +421,9 @@ static void svc_data_ready(struct sock *sk)
>  		dprintk("svc: socket %p(inet %p), busy=%d\n",
>  			svsk, sk,
>  			test_bit(XPT_BUSY, &svsk->sk_xprt.xpt_flags));
> -		svsk->sk_odata(sk);
> +		rmb();
> +		if (svsk->sk_odata)
> +			svsk->sk_odata(sk);
>  		if (!test_and_set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags))
>  			svc_xprt_enqueue(&svsk->sk_xprt);
>  	}
> @@ -437,7 +439,9 @@ static void svc_write_space(struct sock *sk)
>  	if (svsk) {
>  		dprintk("svc: socket %p(inet %p), write_space busy=%d\n",
>  			svsk, sk, test_bit(XPT_BUSY, &svsk->sk_xprt.xpt_flags));
> -		svsk->sk_owspace(sk);
> +		rmb();
> +		if (svsk->sk_owspace)
> +			svsk->sk_owspace(sk);
>  		svc_xprt_enqueue(&svsk->sk_xprt);
>  	}
>  }
> @@ -760,8 +764,12 @@ static void svc_tcp_listen_data_ready(struct sock *sk)
>  	dprintk("svc: socket %p TCP (listen) state change %d\n",
>  		sk, sk->sk_state);
>  
> -	if (svsk)
> -		svsk->sk_odata(sk);
> +	if (svsk) { 
> +		rmb();
> +		if (svsk->sk_odata)
> +			svsk->sk_odata(sk);
> +	}
> +
>  	/*
>  	 * This callback may called twice when a new connection
>  	 * is established as a child socket inherits everything
> @@ -794,7 +802,10 @@ static void svc_tcp_state_change(struct sock *sk)
>  	if (!svsk)
>  		printk("svc: socket %p: no user data\n", sk);
>  	else {
> -		svsk->sk_ostate(sk);
> +		rmb();
> +		if (svsk->sk_ostate)
> +			svsk->sk_ostate(sk);
> +
>  		if (sk->sk_state != TCP_ESTABLISHED) {
>  			set_bit(XPT_CLOSE, &svsk->sk_xprt.xpt_flags);
>  			svc_xprt_enqueue(&svsk->sk_xprt);
> @@ -1381,12 +1392,13 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
>  		return ERR_PTR(err);
>  	}
>  
> -	inet->sk_user_data = svsk;
>  	svsk->sk_sock = sock;
>  	svsk->sk_sk = inet;
>  	svsk->sk_ostate = inet->sk_state_change;
>  	svsk->sk_odata = inet->sk_data_ready;
>  	svsk->sk_owspace = inet->sk_write_space;
> +	wmb();
> +	inet->sk_user_data = svsk;
>  
>  	/* Initialize the socket */
>  	if (sock->type == SOCK_DGRAM)
> -- 
> 1.8.3.1

^ permalink raw reply

* [net-next v2 13/13] i40e: synchronize nvmupdate command and adminq subtask
From: Jeff Kirsher @ 2017-08-25 22:00 UTC (permalink / raw)
  To: davem
  Cc: Sudheer Mogilappagari, netdev, nhorman, sassmann, jogreene,
	Jeff Kirsher
In-Reply-To: <20170825220057.51804-1-jeffrey.t.kirsher@intel.com>

From: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com>

During NVM update, state machine gets into unrecoverable state because
i40e_clean_adminq_subtask can get scheduled after the admin queue
command but before other state variables are updated. This causes
incorrect input to i40e_nvmupd_check_wait_event and state transitions
don't happen.

This issue existed before but surfaced after commit 373149fc99a0
("i40e: Decrease the scope of rtnl lock")

This fix adds locking around admin queue command and update of
state variables so that adminq_subtask will have accurate information
whenever it gets scheduled.

Signed-off-by: Sudheer Mogilappagari <sudheer.mogilappagari@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_nvm.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_nvm.c b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
index 6fdecd70dcbc..2cf7db2dc7cd 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_nvm.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_nvm.c
@@ -753,6 +753,11 @@ i40e_status i40e_nvmupd_command(struct i40e_hw *hw,
 		hw->nvmupd_state = I40E_NVMUPD_STATE_INIT;
 	}
 
+	/* Acquire lock to prevent race condition where adminq_task
+	 * can execute after i40e_nvmupd_nvm_read/write but before state
+	 * variables (nvm_wait_opcode, nvm_release_on_done) are updated
+	 */
+	mutex_lock(&hw->aq.arq_mutex);
 	switch (hw->nvmupd_state) {
 	case I40E_NVMUPD_STATE_INIT:
 		status = i40e_nvmupd_state_init(hw, cmd, bytes, perrno);
@@ -788,6 +793,7 @@ i40e_status i40e_nvmupd_command(struct i40e_hw *hw,
 		*perrno = -ESRCH;
 		break;
 	}
+	mutex_unlock(&hw->aq.arq_mutex);
 	return status;
 }
 
-- 
2.14.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox