Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Willem de Bruijn @ 2017-08-25 22:44 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development
In-Reply-To: <20170824234551-mutt-send-email-mst@kernel.org>

>> >> > We don't enable network watchdog on virtio but we could and maybe
>> >> > should.
>> >>
>> >> Can you elaborate?
>> >
>> > The issue is that holding onto buffers for very long times makes guests
>> > think they are stuck. This is funamentally because from guest point of
>> > view this is a NIC, so it is supposed to transmit things out in
>> > a timely manner. If host backs the virtual NIC by something that is not
>> > a NIC, with traffic shaping etc introducing unbounded latencies,
>> > guest will be confused.
>>
>> That assumes that guests are fragile in this regard. A linux guest
>> does not make such assumptions.
>
> Yes it does. Examples above:
>         > > - a single slow flow can occupy the whole ring, you will not
>         > >   be able to make any new buffers available for the fast flow

Oh, right. Though those are due to vring_desc pool exhaustion
rather than an upper bound on latency of any single packet.

Limiting the number of zerocopy packets in flight to some fraction
of the ring ensures that fast flows can always grab a slot. Running
out of ubuf_info slots reverts to copy, so indirectly does this. But
I read it correclty the zerocopy pool may be equal to or larger than
the descriptor pool. Should we refine the zcopy_used test

    (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx

to also return false if the number of outstanding ubuf_info is greater
than, say, vq->num >> 1?

^ permalink raw reply

* Re: [PATCH net-next v2 09/14] net: mvpp2: dynamic reconfiguration of the PHY mode
From: Russell King - ARM Linux @ 2017-08-25 22:46 UTC (permalink / raw)
  To: Antoine Tenart
  Cc: davem, kishon, andrew, jason, sebastian.hesselbarth,
	gregory.clement, thomas.petazzoni, nadavh, linux-kernel, mw,
	stefanc, miquel.raynal, netdev
In-Reply-To: <20170825144821.31129-10-antoine.tenart@free-electrons.com>

On Fri, Aug 25, 2017 at 04:48:16PM +0200, Antoine Tenart wrote:
> This patch adds logic to reconfigure the comphy/gop when the link status
> change at runtime. This is very useful on boards such as the mcbin which
> have SFP and Ethernet ports connected to the same MAC port: depending on
> what the user connects the driver will automatically reconfigure the
> link mode.

This commit commentry needs updating - as I've already pointed out in
the previous round, the need to reconfigure things has *nothing* to do
with there being SFP and "Ethernet" ports present.  Hence, your commit
message is entirely misleading.

> 
> Signed-off-by: Antoine Tenart <antoine.tenart@free-electrons.com>
> ---
>  drivers/net/ethernet/marvell/mvpp2.c | 21 ++++++++++++++++++++-
>  1 file changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/marvell/mvpp2.c b/drivers/net/ethernet/marvell/mvpp2.c
> index 49a6789a4142..04e0c8ab7b51 100644
> --- a/drivers/net/ethernet/marvell/mvpp2.c
> +++ b/drivers/net/ethernet/marvell/mvpp2.c
> @@ -5740,6 +5740,7 @@ static void mvpp2_link_event(struct net_device *dev)
>  {
>  	struct mvpp2_port *port = netdev_priv(dev);
>  	struct phy_device *phydev = dev->phydev;
> +	bool link_reconfigured = false;
>  
>  	if (!netif_running(dev))
>  		return;
> @@ -5750,9 +5751,27 @@ static void mvpp2_link_event(struct net_device *dev)
>  			port->duplex = phydev->duplex;
>  			port->speed  = phydev->speed;
>  		}
> +
> +		if (port->phy_interface != phydev->interface && port->comphy) {
> +	                /* disable current port for reconfiguration */
> +	                mvpp2_interrupts_disable(port);
> +	                netif_carrier_off(port->dev);
> +	                mvpp2_port_disable(port);
> +			phy_power_off(port->comphy);
> +
> +	                /* comphy reconfiguration */
> +	                port->phy_interface = phydev->interface;
> +	                mvpp22_comphy_init(port);
> +
> +	                /* gop/mac reconfiguration */
> +	                mvpp22_gop_init(port);
> +	                mvpp2_port_mii_set(port);
> +
> +	                link_reconfigured = true;
> +		}
>  	}
>  
> -	if (phydev->link != port->link) {
> +	if (phydev->link != port->link || link_reconfigured) {
>  		port->link = phydev->link;
>  
>  		if (phydev->link) {
> -- 
> 2.13.5
> 

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 8.8Mbps down 630kbps up
According to speedtest.net: 8.21Mbps down 510kbps up

^ permalink raw reply

* Re: [ethtool 1/1] ethtool: Add DMA Coalescing support
From: Stephen Hemminger @ 2017-08-25 22:57 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: linville, Paul Greenwalt, netdev, nhorman, sassmann, jogreene
In-Reply-To: <20170825223910.54989-1-jeffrey.t.kirsher@intel.com>

On Fri, 25 Aug 2017 15:39:10 -0700
Jeff Kirsher <jeffrey.t.kirsher@intel.com> wrote:

> diff --git a/ethtool-copy.h b/ethtool-copy.h
> index 06fc04c..4bb91eb 100644
> --- a/ethtool-copy.h
> +++ b/ethtool-copy.h
> @@ -400,6 +400,7 @@ struct ethtool_modinfo {
>   *	a TX interrupt, when the packet rate is above @pkt_rate_high.
>   * @rate_sample_interval: How often to do adaptive coalescing packet rate
>   *	sampling, measured in seconds.  Must not be zero.
> + * @dmac: How many usecs to store packets before moving to host memory.
>   *
>   * Each pair of (usecs, max_frames) fields specifies that interrupts
>   * should be coalesced until
> @@ -450,6 +451,7 @@ struct ethtool_coalesce {
>  	__u32	tx_coalesce_usecs_high;
>  	__u32	tx_max_coalesced_frames_high;
>  	__u32	rate_sample_interval;
> +	__u32	dmac;
>  };
>  

Because of backwards ABI compatibility, it is not safe to extend
an existing structure.

^ permalink raw reply

* Re: [PATCH net] tcp: fix refcnt leak with ebpf congestion control
From: Lawrence Brakmo @ 2017-08-25 23:08 UTC (permalink / raw)
  To: Sabrina Dubroca, netdev@vger.kernel.org; +Cc: Daniel Borkmann
In-Reply-To: <d69d1981d2c74805affc4ed6378447ea4cb06c67.1503659184.git.sd@queasysnail.net>


On 8/25/17, 4:10 AM, "Sabrina Dubroca" <sd@queasysnail.net> wrote:

    There are a few bugs around refcnt handling in the new BPF congestion
    control setsockopt:
    
     - The new ca is assigned to icsk->icsk_ca_ops even in the case where we
       cannot get a reference on it. This would lead to a use after free,
       since that ca is going away soon.
    
     - Changing the congestion control case doesn't release the refcnt on
       the previous ca.
    
     - In the reinit case, we first leak a reference on the old ca, then we
       call tcp_reinit_congestion_control on the ca that we have just
       assigned, leading to deinitializing the wrong ca (->release of the
       new ca on the old ca's data) and releasing the refcount on the ca
       that we actually want to use.
    
    This is visible by building (for example) BIC as a module and setting
    net.ipv4.tcp_congestion_control=bic, and using tcp_cong_kern.c from
    samples/bpf.
    
    This patch fixes the refcount issues, and moves reinit back into tcp
    core to avoid passing a ca pointer back to BPF.
    
    Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
    Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
    ---
     include/net/tcp.h   |  4 +---
     net/core/filter.c   |  7 ++-----
     net/ipv4/tcp.c      |  2 +-
     net/ipv4/tcp_cong.c | 19 ++++++++++++++-----
     4 files changed, 18 insertions(+), 14 deletions(-)
    
    diff --git a/include/net/tcp.h b/include/net/tcp.h
    index ada65e767b28..f642a39f9eee 100644
    --- a/include/net/tcp.h
    +++ b/include/net/tcp.h
    @@ -1004,9 +1004,7 @@ void tcp_get_default_congestion_control(char *name);
     void tcp_get_available_congestion_control(char *buf, size_t len);
     void tcp_get_allowed_congestion_control(char *buf, size_t len);
     int tcp_set_allowed_congestion_control(char *allowed);
    -int tcp_set_congestion_control(struct sock *sk, const char *name, bool load);
    -void tcp_reinit_congestion_control(struct sock *sk,
    -				   const struct tcp_congestion_ops *ca);
    +int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, bool reinit);
     u32 tcp_slow_start(struct tcp_sock *tp, u32 acked);
     void tcp_cong_avoid_ai(struct tcp_sock *tp, u32 w, u32 acked);
     
    diff --git a/net/core/filter.c b/net/core/filter.c
    index 8eb81e5fae08..169974998c76 100644
    --- a/net/core/filter.c
    +++ b/net/core/filter.c
    @@ -2836,15 +2836,12 @@ BPF_CALL_5(bpf_setsockopt, struct bpf_sock_ops_kern *, bpf_sock,
     		   sk->sk_prot->setsockopt == tcp_setsockopt) {
     		if (optname == TCP_CONGESTION) {
     			char name[TCP_CA_NAME_MAX];
    +			bool reinit = bpf_sock->op > BPF_SOCK_OPS_NEEDS_ECN;
     
     			strncpy(name, optval, min_t(long, optlen,
     						    TCP_CA_NAME_MAX-1));
     			name[TCP_CA_NAME_MAX-1] = 0;
    -			ret = tcp_set_congestion_control(sk, name, false);
    -			if (!ret && bpf_sock->op > BPF_SOCK_OPS_NEEDS_ECN)
    -				/* replacing an existing ca */
    -				tcp_reinit_congestion_control(sk,
    -					inet_csk(sk)->icsk_ca_ops);
    +			ret = tcp_set_congestion_control(sk, name, false, reinit);
     		} else {
     			struct tcp_sock *tp = tcp_sk(sk);
     
    diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
    index 71ce33decd97..a3e91b552edc 100644
    --- a/net/ipv4/tcp.c
    +++ b/net/ipv4/tcp.c
    @@ -2481,7 +2481,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
     		name[val] = 0;
     
     		lock_sock(sk);
    -		err = tcp_set_congestion_control(sk, name, true);
    +		err = tcp_set_congestion_control(sk, name, true, true);
     		release_sock(sk);
     		return err;
     	}
    diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
    index fde983f6376b..421ea1b918da 100644
    --- a/net/ipv4/tcp_cong.c
    +++ b/net/ipv4/tcp_cong.c
    @@ -189,8 +189,8 @@ void tcp_init_congestion_control(struct sock *sk)
     		INET_ECN_dontxmit(sk);
     }
     
    -void tcp_reinit_congestion_control(struct sock *sk,
    -				   const struct tcp_congestion_ops *ca)
    +static void tcp_reinit_congestion_control(struct sock *sk,
    +					  const struct tcp_congestion_ops *ca)
     {
     	struct inet_connection_sock *icsk = inet_csk(sk);
     
    @@ -338,7 +338,7 @@ int tcp_set_allowed_congestion_control(char *val)
      * tcp_reinit_congestion_control (if the current congestion control was
      * already initialized.
      */
    -int tcp_set_congestion_control(struct sock *sk, const char *name, bool load)
    +int tcp_set_congestion_control(struct sock *sk, const char *name, bool load, bool reinit)
     {
     	struct inet_connection_sock *icsk = inet_csk(sk);
     	const struct tcp_congestion_ops *ca;
    @@ -360,9 +360,18 @@ int tcp_set_congestion_control(struct sock *sk, const char *name, bool load)
     	if (!ca) {
     		err = -ENOENT;
     	} else if (!load) {
    -		icsk->icsk_ca_ops = ca;
    -		if (!try_module_get(ca->owner))
    +		const struct tcp_congestion_ops *old_ca = icsk->icsk_ca_ops;
    +
    +		if (try_module_get(ca->owner)) {
    +			if (reinit) {
    +				tcp_reinit_congestion_control(sk, ca);
    +			} else {
    +				icsk->icsk_ca_ops = ca;
    +				module_put(old_ca->owner);
    +			}
    +		} else {
     			err = -EBUSY;
    +		}
     	} else if (!((ca->flags & TCP_CONG_NON_RESTRICTED) ||
     		     ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN))) {
     		err = -EPERM;
    -- 
    2.14.1
    
Thank you for finding and fixing these issues.

Acked-by: Lawrence Brakmo <brakmo@fb.com>
 


^ permalink raw reply

* Re: [PATCH 2/2] drivers: net: xgene: Clean up all outstanding tx descriptors
From: Andrew Lunn @ 2017-08-25 23:10 UTC (permalink / raw)
  To: Iyappan Subramanian
  Cc: davem, netdev, linux-arm-kernel, dnelson, qnguyen, patches
In-Reply-To: <1503699810-12803-3-git-send-email-isubramanian@apm.com>

On Fri, Aug 25, 2017 at 03:23:30PM -0700, Iyappan Subramanian wrote:
> When xgene_enet is rmmod'd and there are still outstanding tx descriptors
> that have been setup but have not completed, it is possible on the next
> modprobe of the driver to receive the oldest of such tx descriptors. This
> results in a kernel NULL pointer dereference.
> 
> This patch attempts to clean up (by tearing down) all outstanding tx
> descriptors when the xgene_enet driver is being rmmod'd.
> 
> Given that, on the next modprobe it should be safe to ignore any such tx
> descriptors received that map to a NULL skb pointer.

This does not sound correct. Before the module is allowed to be
removed, everything needs to be finished. You need to wait for all the
tx descriptors to be returned before unloading. How can you free the
memory for the descriptor if it is still in use? How can you free the
skbuf the descriptor points to, if it is still in use... 

      Andrew

^ permalink raw reply

* Re: UDP sockets oddities
From: Florian Fainelli @ 2017-08-25 23:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, edumazet, pabeni, willemb, davem
In-Reply-To: <4adb4b66-590c-c55a-44aa-27dd409ce14f@gmail.com>

On 08/23/2017 07:23 PM, Florian Fainelli wrote:
> On 08/23/2017 05:43 PM, Eric Dumazet wrote:
>> On Wed, 2017-08-23 at 17:03 -0700, Florian Fainelli wrote:
>>
>>> Attached is the perf report --stdio of:
>>>
>>> # perf record -a -g -e skb:kfree_skb iperf -c 192.168.1.23 -b 800M -t 60
>>> WARNING: option -b implies udp testing
>>> ------------------------------------------------------------
>>> Client connecting to 192.168.1.23, UDP port 5001
>>> Sending 1470 byte datagrams
>>> UDP buffer size:  160 KByte (default)
>>> ------------------------------------------------------------
>>> [  4] local 192.168.1.66 port 36169 connected with 192.168.1.23 port 5001
>>> [ ID] Interval       Transfer     Bandwidth
>>> [  4]  0.0-60.0 sec   685 MBytes  95.8 Mbits/sec
>>> [  4] Sent 488633 datagrams
>>> [  4] Server Report:
>>> [  4]  0.0-74.4 sec   685 MBytes  77.2 Mbits/sec   0.187 ms  300/488632
>>> (0.061%)
>>> [  4]  0.0-74.4 sec  58 datagrams received out-of-order
>>>
>>> It measured 488644 events which is greater than the number of packets
>>> transmitted by iperf but I only count 8 non-IP packets being sent
>>> (protocl=2054 -> ETH_P_ARP) so I am not sure what the other 4 were and
>>> why there are not accounted for.
>>>
>>> Almost all events came from net_tx_action() except 2 that came from
>>> net/core/neighbour.c::neigh_probe() and 65 that came from
>>> arch/arm/include/asm/irqflags.h::arch_local_irq_save() ?!?
>>
>> Oh you have too much noise and need this fix :
>>
>> diff --git a/drivers/net/ethernet/broadcom/bcmsysport.c b/drivers/net/ethernet/broadcom/bcmsysport.c
>> index dc3052751bc1..fcfa8d991850 100644
>> --- a/drivers/net/ethernet/broadcom/bcmsysport.c
>> +++ b/drivers/net/ethernet/broadcom/bcmsysport.c
>> @@ -597,7 +597,7 @@ static int bcm_sysport_set_coalesce(struct net_device *dev,
>>  
>>  static void bcm_sysport_free_cb(struct bcm_sysport_cb *cb)
>>  {
>> -	dev_kfree_skb_any(cb->skb);
>> +	dev_consume_skb_any(cb->skb);
>>  	cb->skb = NULL;
>>  	dma_unmap_addr_set(cb, dma_addr, 0);
>>  }
>>
> 
> Yay, now I am getting some sensible information, thanks!
> 
> iperf reports 143 packets lost, and perf report gets me 146 entries for
> kfree_skb() looking like that:
> #
>      2.74%     2.74%  skbaddr=0xee5f30c0 protocol=2048 location=0xc0855cdc
>             |
>             ---0x2d
>                write
>                0xe9fc2368
>                kfree_skb
> 
> 
> What is annoying is that 0xc0855cdc is resolved to
> arch/arm/include/asm/irqflags.h::arch_local_irq_save() which does not
> really help telling me where exactly in the stack the drop is happening,
> although now I know it is somewhere down the path from write(2)... of
> course it is :)
> 
> Now what is disturbing too is that iperf does  have its write() system
> call get an error returned, write happily returned the number of bytes
> written even though the perf trace indicates there was packet drops down
> the road..

Because perf was not able to track down exactly where the location was
with builtin_return_address(0) and always pointed to
arch_local_irq_save(), I added a WARN() in kfree_skb(), and of course
this made it impossible to see packet loss under the same conditions
anymore. After increasing dramatically wmem_{default,max} I could start
seeing significant drop made largely worse by the printks, so not very
conclusive either.

I took another approach and recorded net_dev_queue() events, and I can
see only 976704 (gphy + eth0 devices) / 2 = 488352 dev_queue_events()
were recorded. The total sent by iperf was 488898 and we lost 558
packets according to the server 488352 + 558 > 488898 which makes sense
because neigh_probe() and others still run. This still confirms there is
a drop occurring between UDP socket sendmsg and the SKB enqueuing...

So my take away is that the CPU has a bursty write(2) behavior, and the
larger the socket write buffer size, the more datagrams you can burst
into and the more of these SKBs can be flat out being dropped when the
transmit queue is congested. Adding a WARN() is enough of a slow down
for the CPU that we are not able to reliably reproduce this burst.

Eric, are there areas of the stack where we are allowed to drop packets,
not propagate that back to write(2) and also not increment any counter
either, or maybe I am not looking where I should...

Thanks!
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next v6 2/3] net: gso: Add GSO support for NSH
From: Jiri Benc @ 2017-08-25 23:22 UTC (permalink / raw)
  To: Yi Yang; +Cc: netdev, dev, e, blp, jan.scheurich
In-Reply-To: <20170825182514.6ff6c36b@griffin>

On Fri, 25 Aug 2017 18:25:14 +0200, Jiri Benc wrote:
> While looking at this, I realized that GSO for VXLAN-GPE is broken,
> too. Let me fix it by implementing what I described above which will
> make your patch much easier.

Okay, it's not really broken and we don't need that complexity. At
least not immediately. Hw offloading in the VXLAN-GPE case probably does
not work correctly and would benefit from that change but that's a
different beast to tackle at a different time. Software segmentation
works fine for VXLAN-GPE.

There should not be much problems with NSH segmentation, either, if we
carefully store and set mac_header, mac_len and skb->protocol around
calls to skb_mac_gso_segment. Note that with zero mac_len (and correct
skb->protocol), skb_mac_gso_segment behaves in the same way that you
tried to achieve with find_gso_segment_by_type, which is thus completely
unnecessary.

More on Monday.

 Jiri

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Michael S. Tsirkin @ 2017-08-25 23:32 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development
In-Reply-To: <CAF=yD-+1wheMmC+HFKe_B_ULO0Mmh6HMSEbYY5D-HgqxVJee6A@mail.gmail.com>

On Fri, Aug 25, 2017 at 06:44:36PM -0400, Willem de Bruijn wrote:
> >> >> > We don't enable network watchdog on virtio but we could and maybe
> >> >> > should.
> >> >>
> >> >> Can you elaborate?
> >> >
> >> > The issue is that holding onto buffers for very long times makes guests
> >> > think they are stuck. This is funamentally because from guest point of
> >> > view this is a NIC, so it is supposed to transmit things out in
> >> > a timely manner. If host backs the virtual NIC by something that is not
> >> > a NIC, with traffic shaping etc introducing unbounded latencies,
> >> > guest will be confused.
> >>
> >> That assumes that guests are fragile in this regard. A linux guest
> >> does not make such assumptions.
> >
> > Yes it does. Examples above:
> >         > > - a single slow flow can occupy the whole ring, you will not
> >         > >   be able to make any new buffers available for the fast flow
> 
> Oh, right. Though those are due to vring_desc pool exhaustion
> rather than an upper bound on latency of any single packet.
> 
> Limiting the number of zerocopy packets in flight to some fraction
> of the ring ensures that fast flows can always grab a slot.
> Running
> out of ubuf_info slots reverts to copy, so indirectly does this. But
> I read it correclty the zerocopy pool may be equal to or larger than
> the descriptor pool. Should we refine the zcopy_used test
> 
>     (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx
> 
> to also return false if the number of outstanding ubuf_info is greater
> than, say, vq->num >> 1?


We'll need to think about where to put the threshold, but I think it's
a good idea.

Maybe even a fixed number, e.g. max(vq->num >> 1, X) to limit host
resources.

In a sense it still means once you run out of slots zcopt gets disabled possibly permanently.

Need to experiment with some numbers.

-- 
MST

^ permalink raw reply

* Re: UDP sockets oddities
From: Eric Dumazet @ 2017-08-25 23:57 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev, edumazet, pabeni, willemb, davem
In-Reply-To: <3c44d27e-27d9-e9c4-04b3-17c0366e60d9@gmail.com>

On Fri, 2017-08-25 at 16:18 -0700, Florian Fainelli wrote:

> Eric, are there areas of the stack where we are allowed to drop packets,
> not propagate that back to write(2) and also not increment any counter
> either, or maybe I am not looking where I should...

What happens if you increase these sysctls ?

grep .  `find /proc/sys|grep unres_qlen`


unres_qlen_bytes -> 2000000
unres_qlen -> 10000

^ permalink raw reply

* Re: [PATCH net] ipv6: Fix may be used uninitialized warning in rt6_check
From: David Miller @ 2017-08-26  0:04 UTC (permalink / raw)
  To: sbrivio; +Cc: steffen.klassert, weiwan, edumazet, kafai, netdev
In-Reply-To: <20170825110206.36e4a7a7@elisabeth>

From: Stefano Brivio <sbrivio@redhat.com>
Date: Fri, 25 Aug 2017 11:02:06 +0200

> On Fri, 25 Aug 2017 09:52:17 +0200
> Steffen Klassert <steffen.klassert@secunet.com> wrote:
> 
>> On Fri, Aug 25, 2017 at 09:05:42AM +0200, Steffen Klassert wrote:
>> > rt_cookie might be used uninitialized, fix this by
>> > initializing it.
>> > 
>> > Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node")
>> > Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
>> > ---
>> >  net/ipv6/route.c | 2 +-
>> >  1 file changed, 1 insertion(+), 1 deletion(-)
>> > 
>> > diff --git a/net/ipv6/route.c b/net/ipv6/route.c
>> > index a9d3564..48c8c92 100644
>> > --- a/net/ipv6/route.c
>> > +++ b/net/ipv6/route.c
>> > @@ -1289,7 +1289,7 @@ static void rt6_dst_from_metrics_check(struct rt6_info *rt)
>> >  
>> >  static struct dst_entry *rt6_check(struct rt6_info *rt, u32 cookie)
>> >  {
>> > -	u32 rt_cookie;
>> > +	u32 rt_cookie = 0;
>> >  
>> >  	if (!rt6_get_cookie_safe(rt, &rt_cookie) || rt_cookie != cookie)
>> >  		return NULL;  
>> 
>> The compiler warning seems to be a false positive, as
>> rt_cookie != cookie is only checked if rt6_get_cookie_safe
>> returns true in which case rt_cookie is initialized.
>> 
>> Please disregard this patch.
> 
> ...or not? I was thinking of sending a similar patch with
> uninitialized_var(rt_cookie), but it seems we have similar cases
> where we just initialize to zero instead.
> 
> I wonder which approach is considered the most acceptable nowadays. I
> would be in favour of uninitialized_var() as it doesn't change the
> binary output, but https://lwn.net/Articles/529954/ also contains some
> valid criticism. Ideas?

Generally speaking I guess initializing to zero is Ok to do.

As far as which approach is better, I don't have any strong opinion.

So I will probably just apply Steffen's patch.

^ permalink raw reply

* Re: [PATCH net] ipv6: fix sparse warning on rt6i_node
From: Martin KaFai Lau @ 2017-08-26  0:06 UTC (permalink / raw)
  To: Wei Wang; +Cc: David Miller, netdev, Eric Dumazet
In-Reply-To: <20170825220310.24863-1-tracywwnj@gmail.com>

On Fri, Aug 25, 2017 at 03:03:10PM -0700, Wei Wang wrote:
> From: Wei Wang <weiwan@google.com>
>
> Commit c5cff8561d2d adds rcu grace period before freeing fib6_node. This
> generates a new sparse warning on rt->rt6i_node related code:
>   net/ipv6/route.c:1394:30: error: incompatible types in comparison
>   expression (different address spaces)
>   ./include/net/ip6_fib.h:187:14: error: incompatible types in comparison
>   expression (different address spaces)
>
> This commit adds "__rcu" tag for rt6i_node and makes sure corresponding
> rcu API is used for it.
> After this fix, sparse no longer generates the above warning.
>
> Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node")
> Signed-off-by: Wei Wang <weiwan@google.com>
> Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>

> ---
>  include/net/ip6_fib.h |  2 +-
>  net/ipv6/addrconf.c   |  2 +-
>  net/ipv6/ip6_fib.c    | 11 +++++++----
>  net/ipv6/route.c      |  3 ++-
>  4 files changed, 11 insertions(+), 7 deletions(-)
>
> diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
> index e9c59db92942..af509f801084 100644
> --- a/include/net/ip6_fib.h
> +++ b/include/net/ip6_fib.h
> @@ -105,7 +105,7 @@ struct rt6_info {
>  	 * the same cache line.
>  	 */
>  	struct fib6_table		*rt6i_table;
> -	struct fib6_node		*rt6i_node;
> +	struct fib6_node __rcu		*rt6i_node;
>
>  	struct in6_addr			rt6i_gateway;
>
> diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
> index 3c46e9513a31..936e9ab4dda5 100644
> --- a/net/ipv6/addrconf.c
> +++ b/net/ipv6/addrconf.c
> @@ -5556,7 +5556,7 @@ static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)
>  		 * our DAD process, so we don't need
>  		 * to do it again
>  		 */
> -		if (!(ifp->rt->rt6i_node))
> +		if (!rcu_access_pointer(ifp->rt->rt6i_node))
>  			ip6_ins_rt(ifp->rt);
>  		if (ifp->idev->cnf.forwarding)
>  			addrconf_join_anycast(ifp);
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index a5ebf86f6be8..10b4b1f8b838 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -889,7 +889,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
>
>  		rt->dst.rt6_next = iter;
>  		*ins = rt;
> -		rt->rt6i_node = fn;
> +		rcu_assign_pointer(rt->rt6i_node, fn);
>  		atomic_inc(&rt->rt6i_ref);
>  		if (!info->skip_notify)
>  			inet6_rt_notify(RTM_NEWROUTE, rt, info, nlflags);
> @@ -915,7 +915,7 @@ static int fib6_add_rt2node(struct fib6_node *fn, struct rt6_info *rt,
>  			return err;
>
>  		*ins = rt;
> -		rt->rt6i_node = fn;
> +		rcu_assign_pointer(rt->rt6i_node, fn);
>  		rt->dst.rt6_next = iter->dst.rt6_next;
>  		atomic_inc(&rt->rt6i_ref);
>  		if (!info->skip_notify)
> @@ -1480,8 +1480,9 @@ static void fib6_del_route(struct fib6_node *fn, struct rt6_info **rtp,
>
>  int fib6_del(struct rt6_info *rt, struct nl_info *info)
>  {
> +	struct fib6_node *fn = rcu_dereference_protected(rt->rt6i_node,
> +				    lockdep_is_held(&rt->rt6i_table->tb6_lock));
>  	struct net *net = info->nl_net;
> -	struct fib6_node *fn = rt->rt6i_node;
>  	struct rt6_info **rtp;
>
>  #if RT6_DEBUG >= 2
> @@ -1670,7 +1671,9 @@ static int fib6_clean_node(struct fib6_walker *w)
>  			if (res) {
>  #if RT6_DEBUG >= 2
>  				pr_debug("%s: del failed: rt=%p@%p err=%d\n",
> -					 __func__, rt, rt->rt6i_node, res);
> +					 __func__, rt,
> +					 rcu_access_pointer(rt->rt6i_node),
> +					 res);
>  #endif
>  				continue;
>  			}
> diff --git a/net/ipv6/route.c b/net/ipv6/route.c
> index a9d3564caf49..33629f2a0f9d 100644
> --- a/net/ipv6/route.c
> +++ b/net/ipv6/route.c
> @@ -1383,7 +1383,8 @@ static void rt6_do_update_pmtu(struct rt6_info *rt, u32 mtu)
>  static bool rt6_cache_allowed_for_pmtu(const struct rt6_info *rt)
>  {
>  	return !(rt->rt6i_flags & RTF_CACHE) &&
> -		(rt->rt6i_flags & RTF_PCPU || rt->rt6i_node);
> +		(rt->rt6i_flags & RTF_PCPU ||
> +		 rcu_access_pointer(rt->rt6i_node));
>  }
>
>  static void __ip6_rt_update_pmtu(struct dst_entry *dst, const struct sock *sk,
> --
> 2.14.1.342.g6490525c54-goog
>

^ permalink raw reply

* Re: [PATCH net] ipv6: Fix may be used uninitialized warning in rt6_check
From: David Miller @ 2017-08-26  0:08 UTC (permalink / raw)
  To: steffen.klassert; +Cc: weiwan, edumazet, kafai, netdev
In-Reply-To: <20170825070542.GV31224@secunet.com>

From: Steffen Klassert <steffen.klassert@secunet.com>
Date: Fri, 25 Aug 2017 09:05:42 +0200

> rt_cookie might be used uninitialized, fix this by
> initializing it.
> 
> Fixes: c5cff8561d2d ("ipv6: add rcu grace period before freeing fib6_node")
> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next v2 0/5] net: updates for IPv6 Segment Routing
From: David Miller @ 2017-08-26  0:10 UTC (permalink / raw)
  To: david.lebrun; +Cc: netdev
In-Reply-To: <20170825075648.5061-1-david.lebrun@uclouvain.be>

From: David Lebrun <david.lebrun@uclouvain.be>
Date: Fri, 25 Aug 2017 09:56:43 +0200

> v2: seg6_lwt_headroom() is not relevant for lwtunnel_input_redirect()
>     use cases, and L2ENCAP only uses this redirection. Fix incoherence
>     between arbitrary MAC header size support and fixed headroom
>     computation by setting only LWTUNNEL_STATE_INPUT_REDIRECT for L2ENCAP
>     mode.
> 
> This patch series provides several updates for the SRv6 implementation. The
> first patch leverages the existing infrastructure to support encapsulation
> of IPv4 packets. The second patch implements the T.Encaps.L2 SR function,
> enabling to encapsulate an L2 Ethernet frame within an IPv6+SRH packet.
> The last three patches update the seg6local lightweight tunnel, and mainly
> implement four new actions: End.T, End.DX2, End.DX4 and End.DT6.

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH] hinic: skb_pad() frees on error
From: David Miller @ 2017-08-26  0:14 UTC (permalink / raw)
  To: dan.carpenter; +Cc: aviad.krawczyk, netdev, kernel-janitors
In-Reply-To: <20170825082428.hpnbs4i74bubm4cz@mwanda>

From: Dan Carpenter <dan.carpenter@oracle.com>
Date: Fri, 25 Aug 2017 11:24:28 +0300

> The skb_pad() function frees the skb on error, so this code has a double
> free.
> 
> Fixes: 00e57a6d4ad3 ("net-next/hinic: Add Tx operation")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH net] tcp: fix refcnt leak with ebpf congestion control
From: David Miller @ 2017-08-26  0:16 UTC (permalink / raw)
  To: sd; +Cc: netdev, brakmo, daniel
In-Reply-To: <d69d1981d2c74805affc4ed6378447ea4cb06c67.1503659184.git.sd@queasysnail.net>

From: Sabrina Dubroca <sd@queasysnail.net>
Date: Fri, 25 Aug 2017 13:10:12 +0200

> There are a few bugs around refcnt handling in the new BPF congestion
> control setsockopt:
> 
>  - The new ca is assigned to icsk->icsk_ca_ops even in the case where we
>    cannot get a reference on it. This would lead to a use after free,
>    since that ca is going away soon.
> 
>  - Changing the congestion control case doesn't release the refcnt on
>    the previous ca.
> 
>  - In the reinit case, we first leak a reference on the old ca, then we
>    call tcp_reinit_congestion_control on the ca that we have just
>    assigned, leading to deinitializing the wrong ca (->release of the
>    new ca on the old ca's data) and releasing the refcount on the ca
>    that we actually want to use.
> 
> This is visible by building (for example) BIC as a module and setting
> net.ipv4.tcp_congestion_control=bic, and using tcp_cong_kern.c from
> samples/bpf.
> 
> This patch fixes the refcount issues, and moves reinit back into tcp
> core to avoid passing a ca pointer back to BPF.
> 
> Fixes: 91b5b21c7c16 ("bpf: Add support for changing congestion control")
> Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>

Applied, thank you.

^ permalink raw reply

* Re: [Patch net-next v2 0/4] net_sched: clean up tc classes and u32 filter
From: David Miller @ 2017-08-26  0:20 UTC (permalink / raw)
  To: xiyou.wangcong; +Cc: netdev, jhs
In-Reply-To: <20170824235130.28503-1-xiyou.wangcong@gmail.com>

From: Cong Wang <xiyou.wangcong@gmail.com>
Date: Thu, 24 Aug 2017 16:51:26 -0700

> Patch 1 and patch 2 prepare for patch 3. Major changes
> are in patch 3 and patch 4, details are there too.
> 
> Cong Wang (4):
>   net_sched: get rid of more forward declarations
>   net_sched: introduce tclass_del_notify()
>   net_sched: remove tc class reference counting
>   net_sched: kill u32_node pointer in Qdisc
> 
> ---
> v2: Add patch 1 and 2, group all into a patchset
>     Fix a coding style issue in patch 4

Series applied, thanks.

^ permalink raw reply

* Re: [PATCH net-next] tcp: fix hang in tcp_sendpage_locked()
From: David Miller @ 2017-08-26  0:22 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, tom, dvyukov
In-Reply-To: <1503667625.18816.9.camel@edumazet-glaptop3.roam.corp.google.com>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 25 Aug 2017 06:27:05 -0700

> From: Eric Dumazet <edumazet@google.com>
> 
> syszkaller got a hang in tcp stack, related to a bug in
> tcp_sendpage_locked()
> 
> root@syzkaller:~# cat /proc/3059/stack
> [<ffffffff83de926c>] __lock_sock+0x1dc/0x2f0
> [<ffffffff83de9473>] lock_sock_nested+0xf3/0x110
> [<ffffffff8408ce01>] tcp_sendmsg+0x21/0x50
> [<ffffffff84163b6f>] inet_sendmsg+0x11f/0x5e0
> [<ffffffff83dd8eea>] sock_sendmsg+0xca/0x110
> [<ffffffff83dd9547>] kernel_sendmsg+0x47/0x60
> [<ffffffff83de35dc>] sock_no_sendpage+0x1cc/0x280
> [<ffffffff8408916b>] tcp_sendpage_locked+0x10b/0x160
> [<ffffffff84089203>] tcp_sendpage+0x43/0x60
> [<ffffffff841641da>] inet_sendpage+0x1aa/0x660
> [<ffffffff83dd4fcd>] kernel_sendpage+0x8d/0xe0
> [<ffffffff83dd50ac>] sock_sendpage+0x8c/0xc0
> [<ffffffff81b63300>] pipe_to_sendpage+0x290/0x3b0
> [<ffffffff81b67243>] __splice_from_pipe+0x343/0x750
> [<ffffffff81b6a459>] splice_from_pipe+0x1e9/0x330
> [<ffffffff81b6a5e0>] generic_splice_sendpage+0x40/0x50
> [<ffffffff81b6b1d7>] SyS_splice+0x7b7/0x1610
> [<ffffffff84d77a01>] entry_SYSCALL_64_fastpath+0x1f/0xbe
> 
> Fixes: 306b13eb3cf9 ("proto_ops: Add locked held versions of sendmsg and sendpage")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Reported-by: Dmitry Vyukov <dvyukov@google.com>

APplied, thanks Eric.

^ permalink raw reply

* Re: Permissions for eBPF objects
From: Alexei Starovoitov @ 2017-08-26  1:03 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Chenbo Feng, Jeffrey Vander Stoep, Stephen Smalley, netdev,
	SELinux, mic
In-Reply-To: <59A0837F.9090301@iogearbox.net>

On Fri, Aug 25, 2017 at 10:07:27PM +0200, Daniel Borkmann wrote:
> On 08/25/2017 09:52 PM, Chenbo Feng wrote:
> > On Fri, Aug 25, 2017 at 12:45 PM, Jeffrey Vander Stoep <jeffv@google.com> wrote:
> > > On Fri, Aug 25, 2017 at 12:26 PM, Stephen Smalley <sds@tycho.nsa.gov> wrote:
> > > > On Fri, 2017-08-25 at 11:01 -0700, Jeffrey Vander Stoep via Selinux
> > > > wrote:
> > > > > I’d like to get your thoughts on adding LSM permission checks on BPF
> > > > > objects.

before reinventing the wheel please take a look at landlock work.
Everything that was discussed in this thread is covered by it.
The patches have been in development for more than a year and most of the early
issues have been resolved.
It will be presented again during security summit in LA in September.

^ permalink raw reply

* Re: [PATCH net-next] virtio-net: invoke zerocopy callback on xmit path if no tx napi
From: Willem de Bruijn @ 2017-08-26  1:03 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Koichiro Den, Jason Wang, virtualization, Network Development
In-Reply-To: <20170826022744-mutt-send-email-mst@kernel.org>

On Fri, Aug 25, 2017 at 7:32 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Fri, Aug 25, 2017 at 06:44:36PM -0400, Willem de Bruijn wrote:
>> >> >> > We don't enable network watchdog on virtio but we could and maybe
>> >> >> > should.
>> >> >>
>> >> >> Can you elaborate?
>> >> >
>> >> > The issue is that holding onto buffers for very long times makes guests
>> >> > think they are stuck. This is funamentally because from guest point of
>> >> > view this is a NIC, so it is supposed to transmit things out in
>> >> > a timely manner. If host backs the virtual NIC by something that is not
>> >> > a NIC, with traffic shaping etc introducing unbounded latencies,
>> >> > guest will be confused.
>> >>
>> >> That assumes that guests are fragile in this regard. A linux guest
>> >> does not make such assumptions.
>> >
>> > Yes it does. Examples above:
>> >         > > - a single slow flow can occupy the whole ring, you will not
>> >         > >   be able to make any new buffers available for the fast flow
>>
>> Oh, right. Though those are due to vring_desc pool exhaustion
>> rather than an upper bound on latency of any single packet.
>>
>> Limiting the number of zerocopy packets in flight to some fraction
>> of the ring ensures that fast flows can always grab a slot.
>> Running
>> out of ubuf_info slots reverts to copy, so indirectly does this. But
>> I read it correclty the zerocopy pool may be equal to or larger than
>> the descriptor pool. Should we refine the zcopy_used test
>>
>>     (nvq->upend_idx + 1) % UIO_MAXIOV != nvq->done_idx
>>
>> to also return false if the number of outstanding ubuf_info is greater
>> than, say, vq->num >> 1?
>
>
> We'll need to think about where to put the threshold, but I think it's
> a good idea.
>
> Maybe even a fixed number, e.g. max(vq->num >> 1, X) to limit host
> resources.
>
> In a sense it still means once you run out of slots zcopt gets disabled possibly permanently.
>
> Need to experiment with some numbers.

I can take a stab with two flows, one delayed in a deep host qdisc
queue. See how this change affects the other flow and also how
sensitive that is to the chosen threshold value.

^ permalink raw reply

* Re: [PATCH net-next] bpf: fix oops on allocation failure
From: Alexei Starovoitov @ 2017-08-26  1:05 UTC (permalink / raw)
  To: Dan Carpenter
  Cc: Alexei Starovoitov, John Fastabend, Daniel Borkmann, netdev,
	kernel-janitors
In-Reply-To: <20170825202714.64ivixeindjph3z6@mwanda>

On Fri, Aug 25, 2017 at 11:27:14PM +0300, Dan Carpenter wrote:
> "err" is set to zero if bpf_map_area_alloc() fails so it means we return
> ERR_PTR(0) which is NULL.  The caller, find_and_alloc_map(), is not
> expecting NULL returns and will oops.
> 
> Fixes: 174a79ff9515 ("bpf: sockmap with sk redirect support")
> Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>

good catch. Thanks!
Acked-by: Alexei Starovoitov <ast@kernel.org>


^ permalink raw reply

* Re: [PATCH net-next v7 01/10] selftest: Enhance kselftest_harness.h with a step mechanism
From: Alexei Starovoitov @ 2017-08-26  1:07 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Shuah Khan, linux-kernel, Alexei Starovoitov, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, Jonathan Corbet, Matthew Garrett,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Tejun Heo, Thomas Graf <tgr
In-Reply-To: <0e15da13-fad0-ba01-053c-1b4853e2bd6f@digikod.net>

On Fri, Aug 25, 2017 at 09:58:33AM +0200, Mickaël Salaün wrote:
> 
> 
> On 24/08/2017 04:31, Alexei Starovoitov wrote:
> > On Mon, Aug 21, 2017 at 02:09:24AM +0200, Mickaël Salaün wrote:
> >> This step mechanism may be useful to return an information about the
> >> error without being able to write to TH_LOG_STREAM.
> >>
> >> Set _metadata->no_print to true to print this counter.
> >>
> >> Signed-off-by: Mickaël Salaün <mic@digikod.net>
> >> Cc: Andy Lutomirski <luto@amacapital.net>
> >> Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
> >> Cc: Kees Cook <keescook@chromium.org>
> >> Cc: Shuah Khan <shuah@kernel.org>
> >> Cc: Will Drewry <wad@chromium.org>
> >> Link: https://lkml.kernel.org/r/CAGXu5j+D-FP8Kt9unNOqKrQJP4DYTpmgkJxWykZyrYiVPz3Y3Q@mail.gmail.com
> >> ---
> >>
> >> This patch is intended to the kselftest tree:
> >> https://lkml.kernel.org/r/20170806232337.4191-1-mic@digikod.net
> >>
> >> Changes since v6:
> >> * add the step counter in assert/expect macros and use _metadata to
> >>   enable the counter (suggested by Kees Cook)
> >> ---
> >>  tools/testing/selftests/kselftest_harness.h   | 31 ++++++++++++++++++++++-----
> >>  tools/testing/selftests/seccomp/seccomp_bpf.c |  2 +-
> >>  2 files changed, 27 insertions(+), 6 deletions(-)
> > 
> > is there a dependency on this in patches 2+ ?
> > if not, I would send this patch to selftests right away.
> > 
> > 
> 
> The Landlock tests [patch 9/10] rely on it for now.
> 
> I sent it three weeks ago:
> https://lkml.kernel.org/r/20170806232337.4191-1-mic@digikod.net
> 
> Anyway, until this patch is merged in the kselftest tree and then
> available to net-next, I'll have to keep it here.

Shuah,
could you please pick up this patch into your tree?

^ permalink raw reply

* [PATCH net-next v3] e1000e: Be drop monitor friendly
From: Florian Fainelli @ 2017-08-26  1:14 UTC (permalink / raw)
  To: netdev
  Cc: edumazet, davem, Florian Fainelli, Jeff Kirsher,
	moderated list:INTEL ETHERNET DRIVERS, open list

e1000e_put_txbuf() can be called from normal reclamation path as well as
when a DMA mapping failure, so we need to differentiate these two cases
when freeing SKBs to be drop monitor friendly. e1000e_tx_hwtstamp_work()
and e1000_remove() are processing TX timestamped SKBs and those should
not be accounted as drops either.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
---
Changes in v3:

- differentiate normal reclamation from TX DMA fragment mapping errors
- removed a few invalid dev_kfree_skb() replacements (those are already
  drop monitor friendly)

Changes in v2:

- make it compile

 drivers/net/ethernet/intel/e1000e/netdev.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 327dfe5bedc0..cfd21858c095 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1071,7 +1071,8 @@ static bool e1000_clean_rx_irq(struct e1000_ring *rx_ring, int *work_done,
 }
 
 static void e1000_put_txbuf(struct e1000_ring *tx_ring,
-			    struct e1000_buffer *buffer_info)
+			    struct e1000_buffer *buffer_info,
+			    bool drop)
 {
 	struct e1000_adapter *adapter = tx_ring->adapter;
 
@@ -1085,7 +1086,10 @@ static void e1000_put_txbuf(struct e1000_ring *tx_ring,
 		buffer_info->dma = 0;
 	}
 	if (buffer_info->skb) {
-		dev_kfree_skb_any(buffer_info->skb);
+		if (drop)
+			dev_kfree_skb_any(buffer_info->skb);
+		else
+			dev_consume_skb_any(buffer_info->skb);
 		buffer_info->skb = NULL;
 	}
 	buffer_info->time_stamp = 0;
@@ -1199,7 +1203,7 @@ static void e1000e_tx_hwtstamp_work(struct work_struct *work)
 		wmb(); /* force write prior to skb_tstamp_tx */
 
 		skb_tstamp_tx(skb, &shhwtstamps);
-		dev_kfree_skb_any(skb);
+		dev_consume_skb_any(skb);
 	} else if (time_after(jiffies, adapter->tx_hwtstamp_start
 			      + adapter->tx_timeout_factor * HZ)) {
 		dev_kfree_skb_any(adapter->tx_hwtstamp_skb);
@@ -1254,7 +1258,7 @@ static bool e1000_clean_tx_irq(struct e1000_ring *tx_ring)
 				}
 			}
 
-			e1000_put_txbuf(tx_ring, buffer_info);
+			e1000_put_txbuf(tx_ring, buffer_info, false);
 			tx_desc->upper.data = 0;
 
 			i++;
@@ -2421,7 +2425,7 @@ static void e1000_clean_tx_ring(struct e1000_ring *tx_ring)
 
 	for (i = 0; i < tx_ring->count; i++) {
 		buffer_info = &tx_ring->buffer_info[i];
-		e1000_put_txbuf(tx_ring, buffer_info);
+		e1000_put_txbuf(tx_ring, buffer_info, false);
 	}
 
 	netdev_reset_queue(adapter->netdev);
@@ -5614,7 +5618,7 @@ static int e1000_tx_map(struct e1000_ring *tx_ring, struct sk_buff *skb,
 			i += tx_ring->count;
 		i--;
 		buffer_info = &tx_ring->buffer_info[i];
-		e1000_put_txbuf(tx_ring, buffer_info);
+		e1000_put_txbuf(tx_ring, buffer_info, true);
 	}
 
 	return 0;
@@ -7411,7 +7415,7 @@ static void e1000_remove(struct pci_dev *pdev)
 	if (adapter->flags & FLAG_HAS_HW_TIMESTAMP) {
 		cancel_work_sync(&adapter->tx_hwtstamp_work);
 		if (adapter->tx_hwtstamp_skb) {
-			dev_kfree_skb_any(adapter->tx_hwtstamp_skb);
+			dev_consume_skb_any(adapter->tx_hwtstamp_skb);
 			adapter->tx_hwtstamp_skb = NULL;
 		}
 	}
-- 
2.9.3

^ permalink raw reply related

* Re: [PATCH net-next] selftests/bpf: check the instruction dumps are populated
From: Daniel Borkmann @ 2017-08-26  1:16 UTC (permalink / raw)
  To: Jakub Kicinski, netdev; +Cc: kafai, oss-drivers
In-Reply-To: <20170825213957.4768-1-jakub.kicinski@netronome.com>

On 08/25/2017 11:39 PM, Jakub Kicinski wrote:
> Add a basic test for checking whether kernel is populating
> the jited and xlated BPF images.  It was used to confirm
> the behaviour change from commit d777b2ddbecf ("bpf: don't
> zero out the info struct in bpf_obj_get_info_by_fd()"),
> which made bpf_obj_get_info_by_fd() usable for retrieving
> the image dumps.
>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
[...]
> @@ -328,15 +331,20 @@ static void test_bpf_obj_id(void)
>   			  prog_infos[i].type != BPF_PROG_TYPE_SOCKET_FILTER ||
>   			  info_len != sizeof(struct bpf_prog_info) ||
>   			  (jit_enabled && !prog_infos[i].jited_prog_len) ||
> -			  !prog_infos[i].xlated_prog_len,
> +			  (jit_enabled &&
> +			   !memcmp(jited_insns, zeros, sizeof(zeros))) ||
> +			  !prog_infos[i].xlated_prog_len ||
> +			  !memcmp(xlated_insns, zeros, sizeof(zeros)),

There could still be the case where a JIT could bail out
for some reason and punt to the interpreter instead, but
I'm fine assuming for the specific test cases we have it
has to succeed, and if not JIT misses features or has other
issues. ;) Thus:

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

Thanks!

^ permalink raw reply

* Re: [PATCH net-next v7 05/10] landlock: Add LSM hooks related to filesystem
From: Alexei Starovoitov @ 2017-08-26  1:16 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: linux-kernel, Alexei Starovoitov, Andy Lutomirski,
	Arnaldo Carvalho de Melo, Casey Schaufler, Daniel Borkmann,
	David Drysdale, David S . Miller, Eric W . Biederman,
	James Morris, Jann Horn, Jonathan Corbet, Matthew Garrett,
	Michael Kerrisk, Kees Cook, Paul Moore, Sargun Dhillon,
	Serge E . Hallyn, Shuah Khan, Tejun Heo, Thomas Graf <tgr
In-Reply-To: <22d09137-7212-5803-af64-0964fad875c7@digikod.net>

On Fri, Aug 25, 2017 at 10:16:39AM +0200, Mickaël Salaün wrote:
> > 
> >> +/* WRAP_ARG_SB */
> >> +#define WRAP_ARG_SB_TYPE	WRAP_TYPE_FS
> >> +#define WRAP_ARG_SB_DEC(arg)					\
> >> +	EXPAND_C(WRAP_TYPE_FS) wrap_##arg =			\
> >> +	{ .type = BPF_HANDLE_FS_TYPE_DENTRY, .dentry = arg->s_root };
> >> +#define WRAP_ARG_SB_VAL(arg)	((uintptr_t)&wrap_##arg)
> >> +#define WRAP_ARG_SB_OK(arg)	(arg && arg->s_root)
> > ...
> > 
> >> +HOOK_NEW_FS(sb_remount, 2,
> >> +	struct super_block *, sb,
> >> +	void *, data,
> >> +	WRAP_ARG_SB, sb,
> >> +	WRAP_ARG_RAW, LANDLOCK_ACTION_FS_WRITE
> >> +);
> > 
> > this looks wrong. casting super_block to dentry?
> 
> This is called when remounting a block device. The WRAP_ARG_SB take the
> sb->s_root as a dentry, it is not a cast. What do you expect from this hook?

got it. I missed -> part. Now it makes sense.

> > 
> >> +/* a directory inode contains only one dentry */
> >> +HOOK_NEW_FS(inode_create, 3,
> >> +	struct inode *, dir,
> >> +	struct dentry *, dentry,
> >> +	umode_t, mode,
> >> +	WRAP_ARG_INODE, dir,
> >> +	WRAP_ARG_RAW, LANDLOCK_ACTION_FS_WRITE
> >> +);
> > 
> > more general question: why you're not wrapping all useful
> > arguments? Like in the above dentry can be acted upon
> > by the landlock rule and it's readily available...
> 
> The context used for the FS event must have the exact same types for all
> calls. This event is meant to be generic but we can add more specific
> ones if needed, like I do with FS_IOCTL.

I see. So all FS events will have dentry as first argument
regardless of how it is in LSM hook ?
I guess that will simplify the rules indeed.
I suspect you're doing it to simplify the LSM->landlock shim layer as well, right?

> The idea is to enable people to write simple rules, while being able to
> write fine grain rules for special cases (e.g. IOCTL) if needed.
> 
> > 
> > The limitation of only 2 args looks odd.
> > Is it a hard limitation ? how hard to extend?
> 
> It's not a hard limit at all. Actually, the FS_FNCTL event should have
> three arguments (I'll add them in the next series): FS handle, FCNTL
> command and FCNTL argument. I made sure that it's really easy to add
> more arguments to the context of an event.

The reason I'm asking, because I'm not completely convinced that
adding another argument to existing event will be backwards compatible.
It looks like you're expecting only two args for all FS events, right?
How can you add 3rd argument? All FS events would have to get it,
but in some LSM hooks such argument will be meaningless, whereas
in other places it will carry useful info that rule can operate on.
Would that mean that we'll have FS_3 event type and only few LSM
hooks will be converted to it. That works, but then we'll lose
compatiblity with old rules written for FS event and that given hook.
Otherwise we'd need to have fancy logic to accept old FS event
into FS_3 LSM hook.



^ permalink raw reply

* Re: UDP sockets oddities
From: Florian Fainelli @ 2017-08-26  1:17 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, edumazet, pabeni, willemb, davem
In-Reply-To: <1503705440.11498.9.camel@edumazet-glaptop3.roam.corp.google.com>

On 08/25/2017 04:57 PM, Eric Dumazet wrote:
> On Fri, 2017-08-25 at 16:18 -0700, Florian Fainelli wrote:
> 
>> Eric, are there areas of the stack where we are allowed to drop packets,
>> not propagate that back to write(2) and also not increment any counter
>> either, or maybe I am not looking where I should...
> 
> What happens if you increase these sysctls ?

I don't see packet loss after I tweak these two sysctls according to
your suggestions.

Tweaking eth0's sysctls did not change anything, but tweaking gphy's
sysctl resolved the loss. This was a little surprising considering that
gphy is an IFF_NO_QUEUE interface and eth0 is the conduit interface that
does the real transmission.

Does that make sense with respect to what I reported earlier? Should I
try to dump the neigh stats?

Thanks!

> 
> grep .  `find /proc/sys|grep unres_qlen`
> 
> 
> unres_qlen_bytes -> 2000000
> unres_qlen -> 10000
> 
> 

-- 
Florian

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox