Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next] modules: allow modprobe load regular elf binaries
From: Jesper Dangaard Brouer @ 2018-05-02  9:12 UTC (permalink / raw)
  To: Chris Mason
  Cc: brouer, Linus Torvalds, Alexei Starovoitov, David Miller,
	Daniel Borkmann, Greg Kroah-Hartman, Luis R. Rodriguez,
	Network Development, Linux Kernel Mailing List, kernel-team,
	Linux API
In-Reply-To: <EED9C7CB-BC5D-4E2D-B0CC-0003F682C73B@fb.com>


On Tue, 6 Mar 2018 15:42:41 -0800 Chris Mason <clm@fb.com> wrote:
> On 6 Mar 2018, at 11:12, Linus Torvalds wrote:
> 
[...]
> >
> > I do *not* want this to be a magical way to hide things.  
> 
> Especially early on, this makes a lot of sense.  But I wanted to plug 
> bps and the hopefully growing set of bpf introspection tools:
> 
> https://github.com/iovisor/bcc/blob/master/introspection/bps_example.txt
> 
> Long term these are probably a good place to tell the admin what's going 
> on.
(related to bpf itself not modprobe subject)

Hi Chris,

I just want to point out that the tool 'bpftool', is currently the
dominating tool for eBPF introspection.  And the 'bps' tool you mention
seems to have gained less (open source) traction.

The bpftool is part of the kernel git-tree: tools/bpf/bpftool
 https://github.com/torvalds/linux/blob/master/tools/bpf/bpftool/

And it even have bash-completion and man-pages in RST format so they
even render nicely when viewed via github:

 https://github.com/torvalds/linux/blob/master/tools/bpf/bpftool/Documentation/bpftool.rst
 https://github.com/torvalds/linux/blob/master/tools/bpf/bpftool/Documentation/bpftool-prog.rst
 https://github.com/torvalds/linux/blob/master/tools/bpf/bpftool/Documentation/bpftool-map.rst

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH] net/xfrm: Fix lookups for states with spi == 0
From: Herbert Xu @ 2018-05-02  9:11 UTC (permalink / raw)
  To: Dmitry Safonov
  Cc: linux-kernel, 0x7f454c46, Steffen Klassert, David S. Miller,
	netdev
In-Reply-To: <20180502020220.2027-1-dima@arista.com>

On Wed, May 02, 2018 at 03:02:20AM +0100, Dmitry Safonov wrote:
> It seems to be a valid use case to add xfrm state without
> Security Parameter Indexes (SPI) value associated:
> ip xfrm state add src $src dst $dst proto $proto mode $mode sel src $src dst $dst $algo
> 
> The bad thing is that it's currently impossible to get/delete the state
> without SPI: __xfrm_state_insert() obviously doesn't add hash for zero
> SPI in xfrm.state_byspi, and xfrm_user_state_lookup() will fail as
> xfrm_state_lookup() does lookups by hash.
> 
> It also isn't possible to workaround from userspace as
> xfrm_id_proto_match() will be always true for ah/esp/comp protos.
> 
> So, don't try looking up by hash if SPI == 0.
> 
> Cc: Steffen Klassert <steffen.klassert@secunet.com>
> Cc: Herbert Xu <herbert@gondor.apana.org.au>
> Cc: "David S. Miller" <davem@davemloft.net>
> Cc: netdev@vger.kernel.org
> Signed-off-by: Dmitry Safonov <dima@arista.com>

A zero SPI is illegal for many IPsec protocols because that value
is used for other purposes, e.g., IKE encapsulation.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* non-blocking connect for kernel SCTP sockets
From: Michal Kubecek @ 2018-05-02  9:06 UTC (permalink / raw)
  To: netdev
  Cc: linux-sctp, linux-kernel, Vlad Yasevich, Neil Horman, Gang He,
	GuoQing Jiang

Hello,

while investigating a bug, we noticed that DLM tries to connect an SCTP
socket in non-blocking mode using

	result = sock->ops->connect(sock, (struct sockaddr *)&daddr, addr_len,
				    O_NONBLOCK);

which does not work. The reason is that inet_dgram_connect() cannot pass
its flags argument to sctp_connect() so that __sctp_connect() which does
the actual waiting resorts to checking sk->sk_socket->file->f_flags
instead. As the socket used by DLM is a kernel socket with no associated
file, it ends up blocking.

TCP doesn't suffer from this problem as for TCP, the waiting is done in
inet_stream_connect() which has the flags argument. I also checked other
proto::connect handlers and sctp_connect() seems to be the only one with
this kind of problem.

This could be worked around in DLM and further experiments indicate
current DLM code wouldn't actually handle the non-blocking connect
properly. But I still feel ignoring the flags argument is rather a trap
that should be fixed.

I have prepared a series adding flags argument to proto::connect and
using it in sctp_connect() and __sctp_connect(). But I'm not sure if
it's not too big hammer to address issue only affecting one handler.
So my question is: would such generic approach be preferred or should we
rather make SCTP work the way TCP does, i.e. move the waiting from
proto::connect() to proto_ops::connect()? This would require introducing
inet_seqpacket_connect() as inet_dgram_connect() is primarily intended
for use with UDP.)

Michal Kubecek

^ permalink raw reply

* [PATCH] change the comment of ip6gre_tnl_addr_conflict
From: Sun Lianwen @ 2018-05-02  9:06 UTC (permalink / raw)
  To: davem; +Cc: netdev

The comment of ip6gre_tnl_addr_conflict() is wrong. which use
ip6_tnl_addr_conflict instead of ip6gre_tnl_addr_conflict.

Signed-off-by: Sun Lianwen <sunlw.fnst@cn.fujitsu.com>
---
 net/ipv6/ip6_gre.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_gre.c b/net/ipv6/ip6_gre.c
index 69727bc168cb..f81362d1ac8a 100644
--- a/net/ipv6/ip6_gre.c
+++ b/net/ipv6/ip6_gre.c
@@ -807,7 +807,7 @@ static inline int ip6gre_xmit_ipv6(struct sk_buff *skb, struct net_device *dev)
 }
 
 /**
- * ip6_tnl_addr_conflict - compare packet addresses to tunnel's own
+ * ip6gre_tnl_addr_conflict - compare packet addresses to ip6gre tunnel's own
  *   @t: the outgoing tunnel device
  *   @hdr: IPv6 header from the incoming packet
  *
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH] net: stmmac: Avoid VLA usage
From: Jose Abreu @ 2018-05-02  8:54 UTC (permalink / raw)
  To: Kees Cook, Giuseppe Cavallaro, Alexandre Torgue
  Cc: linux-kernel, netdev, Jose Abreu
In-Reply-To: <20180501210130.GA47709@beast>

Hi Kees,

On 01-05-2018 22:01, Kees Cook wrote:
> In the quest to remove all stack VLAs from the kernel[1], this switches
> the "status" stack buffer to use the existing small (8) upper bound on
> how many queues can be checked for DMA, and adds a sanity-check just to
> make sure it doesn't operate under pathological conditions.
>
> [1] https://urldefense.proofpoint.com/v2/url?u=http-3A__lkml.kernel.org_r_CA-2B55aFzCG-2DzNmZwX4A2FQpadafLfEzK6CC-3DqPXydAacU1RqZWA-40mail.gmail.com&d=DwIBAg&c=DPL6_X_6JkXFx7AXWqB0tg&r=WHDsc6kcWAl4i96Vm5hJ_19IJiuxx_p_Rzo2g-uHDKw&m=TBD6a7UY2VbpPmV9LOW_eHAyg8uPq1ZPDhq93VROTVE&s=4fvOST1HhWmZ4lThQe-dHCJYEXNOwey00BCXOWm8tKo&e=
>
> Signed-off-by: Kees Cook <keescook@chromium.org>
>

I rather prefer the variables declaration in reverse-tree order,
but thats just a minor pick.

Reviewed-by: Jose Abreu <joabreu@synopsys.com>

Thanks and Best Regards,
Jose Miguel Abreu

PS: Is VLA warning switch in gcc already active? Because I didn't
see this warning in my builds.

^ permalink raw reply

* Re: [PATCH bpf-next] bpf/verifier: enable ctx + const + 0.
From: Daniel Borkmann @ 2018-05-02  8:29 UTC (permalink / raw)
  To: Alexei Starovoitov, William Tu
  Cc: Linux Kernel Network Developers, Yonghong Song, Yifeng Sun
In-Reply-To: <20180502045206.6aqm4koawyisgkxr@ast-mbp>

On 05/02/2018 06:52 AM, Alexei Starovoitov wrote:
> On Tue, May 01, 2018 at 09:35:29PM -0700, William Tu wrote:
>>
>>> How did you test this patch?
>>>
>> Without the patch, the test case will fail.
>> With the patch, the test case passes.
> 
> Please test it with real program and you'll see crashes and garbage returned.

+1, *convert_ctx_access() use bpf_insn's off to determine what to rewrite,
so this is definitely buggy, and wasn't properly tested as it should have
been. The test case is also way too simple, just the LDX and then doing a
return 0 will get you past verifier, but won't give you anything in terms
of runtime testing that test_verifier is doing. A single test case for a
non trivial verifier change like this is also _completely insufficient_,
this really needs to test all sort of weird corner cases (involving out of
bounds accesses, overflows, etc).

^ permalink raw reply

* RE: [PATCH  2/8] dt-bindings: stm32-dwmac: add support of MPU families
From: Christophe ROULLIER @ 2018-05-02  8:15 UTC (permalink / raw)
  To: Rob Herring
  Cc: mark.rutland@arm.com, mcoquelin.stm32@gmail.com, Alexandre TORGUE,
	Peppe CAVALLARO, devicetree@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, netdev@vger.kernel.org
In-Reply-To: <20180501135848.GA12597@rob-hp-laptop>

Hi,
My answers below under "CRO"

Thanks.
Christophe

-----Original Message-----
From: Rob Herring [mailto:robh@kernel.org] 
Sent: mardi 1 mai 2018 15:59
To: Christophe ROULLIER <christophe.roullier@st.com>
Cc: mark.rutland@arm.com; mcoquelin.stm32@gmail.com; Alexandre TORGUE <alexandre.torgue@st.com>; Peppe CAVALLARO <peppe.cavallaro@st.com>; devicetree@vger.kernel.org; linux-arm-kernel@lists.infradead.org; netdev@vger.kernel.org
Subject: Re: [PATCH 2/8] dt-bindings: stm32-dwmac: add support of MPU families

On Tue, Apr 24, 2018 at 05:01:54PM +0200, Christophe Roullier wrote:
> Add description for Ethernet MPU families fields
> 
> Signed-off-by: Christophe Roullier <christophe.roullier@st.com>
> ---
>  Documentation/devicetree/bindings/net/stm32-dwmac.txt | 16 
> ++++++++++++++--
>  1 file changed, 14 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/devicetree/bindings/net/stm32-dwmac.txt 
> b/Documentation/devicetree/bindings/net/stm32-dwmac.txt
> index 489dbcb..e9d1c4a 100644
> --- a/Documentation/devicetree/bindings/net/stm32-dwmac.txt
> +++ b/Documentation/devicetree/bindings/net/stm32-dwmac.txt
> @@ -6,14 +6,26 @@ Please see stmmac.txt for the other unchanged properties.
>  The device node has following properties.
>  
>  Required properties:
> -- compatible:  Should be "st,stm32-dwmac" to select glue, and
> +- compatible:  For MCU family should be "st,stm32-dwmac" to select 
> +glue, and
>  	       "snps,dwmac-3.50a" to select IP version.
> +	       For MPU family should be "st,stm32mp1-dwmac" to select
> +	       glue, and "snps,dwmac-4.20a" to select IP version.
>  - clocks: Must contain a phandle for each entry in clock-names.
>  - clock-names: Should be "stmmaceth" for the host clock.
>  	       Should be "mac-clk-tx" for the MAC TX clock.
>  	       Should be "mac-clk-rx" for the MAC RX clock.
> +	       For MPU family "ethstp" for power mode clock.
> +	       For MPU family need also "syscfg-clk" for SYSCFG clock.

These are in addition or instead of the first 3 clocks.
CRO: This is in addition of the first 3 clocks, I will modified my comment:
>>  - clock-names: Should be "stmmaceth" for the host clock.
>>  	       Should be "mac-clk-tx" for the MAC TX clock.
>>  	       Should be "mac-clk-rx" for the MAC RX clock.
>> +	       For MPU family need to add also "ethstp" for power mode clock and,
>> +	                                                                   "syscfg-clk" for SYSCFG clock.

> +- interrupt-names: Should contain a list of interrupt names corresponding to
> +           the interrupts in the interrupts property, if available.

You need to list the names. Seems unrelated to MPU support.
CRO: 		   
> +- interrupt-names: Should contain a list of interrupt names corresponding to
> +           the interrupts in the interrupts property, if available.
        + Should be "macirq" for the main MAC IT
        + Should be "eth_wake_irq" for the IT which wake up system

>  - st,syscon : Should be phandle/offset pair. The phandle to the syscon node which
> -	      encompases the glue register, and the offset of the control register.
> +	       encompases the glue register, and the offset of the control register.
> +
> +Optional properties:
> +- clock-names:     For MPU family "mac-clk-ck" for PHY without quartz

The clock is always connected whether you use it or not, right? So it shouldn't be optional based on use.
CRO: Yes the clock is always connected but it is not necessary to enable it if you have phy with quartz (it will not use)
So it is optional.

> +- st,int-phyclk :  valid only where PHY do not have quartz and need to be clock
> +	           by RCC

Boolean?
CRO: Yes
> +- st,int-phyclk (boolean) :  valid only where PHY do not have quartz and need to be clock

Is it ok ?

> +
>  Example:
>  
>  	ethernet@40028000 {
> --
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe devicetree" 
> in the body of a message to majordomo@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH V2 net-next 6/6] ipvlan: Add support for SCTP checksum offload
From: Davide Caratti @ 2018-05-02  8:12 UTC (permalink / raw)
  To: Vladislav Yasevich, netdev
  Cc: linux-sctp, virtualization, virtio-dev, mst, jasowang, nhorman,
	marcelo.leitner, Vladislav Yasevich
In-Reply-To: <20180502020739.19239-7-vyasevic@redhat.com>

On Tue, 2018-05-01 at 22:07 -0400, Vladislav Yasevich wrote:
> Since ipvlan is a software device, we can turn on SCTP checksum
> offload and let the checksum be computed at the lower layer.
> 
> Signed-off-by: Vladislav Yasevich <vyasevic@redhat.com>
> ---
>  drivers/net/ipvlan/ipvlan_main.c | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 

Acked-by: Davide Caratti <dcaratti@redhat.com>

> diff --git a/drivers/net/ipvlan/ipvlan_main.c b/drivers/net/ipvlan/ipvlan_main.c
> index 450eec2..ddb1590 100644
> --- a/drivers/net/ipvlan/ipvlan_main.c
> +++ b/drivers/net/ipvlan/ipvlan_main.c
> @@ -161,7 +161,8 @@ static void ipvlan_port_destroy(struct net_device *dev)
>  	(NETIF_F_SG | NETIF_F_HW_CSUM | NETIF_F_HIGHDMA | NETIF_F_FRAGLIST | \
>  	 NETIF_F_GSO | NETIF_F_TSO | NETIF_F_GSO_ROBUST | \
>  	 NETIF_F_TSO_ECN | NETIF_F_TSO6 | NETIF_F_GRO | NETIF_F_RXCSUM | \
> -	 NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER)
> +	 NETIF_F_HW_VLAN_CTAG_FILTER | NETIF_F_HW_VLAN_STAG_FILTER | \
> +	 NETIF_F_SCTP_CRC)
>  
>  #define IPVLAN_STATE_MASK \
>  	((1<<__LINK_STATE_NOCARRIER) | (1<<__LINK_STATE_DORMANT))

^ permalink raw reply

* Re: [RFC V3 PATCH 1/8] vhost: move get_rx_bufs to vhost.c
From: Tiwei Bie @ 2018-05-02  8:05 UTC (permalink / raw)
  To: Jason Wang
  Cc: mst, kvm, virtualization, netdev, linux-kernel, jfreimann, wexu
In-Reply-To: <1524461700-5469-2-git-send-email-jasowang@redhat.com>

On Mon, Apr 23, 2018 at 01:34:53PM +0800, Jason Wang wrote:
> Move get_rx_bufs() to vhost.c and rename it to
> vhost_get_rx_bufs(). This helps to hide vring internal layout from

A small typo. Based on the code change in this patch, it
seems that this function is renamed to vhost_get_bufs().

Thanks


> specific device implementation. Packed ring implementation will
> benefit from this.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
>  drivers/vhost/net.c   | 83 ++-------------------------------------------------
>  drivers/vhost/vhost.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.h |  7 +++++
>  3 files changed, 88 insertions(+), 80 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 986058a..762aa81 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -664,83 +664,6 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk)
>  	return len;
>  }
>  
> -/* This is a multi-buffer version of vhost_get_desc, that works if
> - *	vq has read descriptors only.
> - * @vq		- the relevant virtqueue
> - * @datalen	- data length we'll be reading
> - * @iovcount	- returned count of io vectors we fill
> - * @log		- vhost log
> - * @log_num	- log offset
> - * @quota       - headcount quota, 1 for big buffer
> - *	returns number of buffer heads allocated, negative on error
> - */
> -static int get_rx_bufs(struct vhost_virtqueue *vq,
> -		       struct vring_used_elem *heads,
> -		       int datalen,
> -		       unsigned *iovcount,
> -		       struct vhost_log *log,
> -		       unsigned *log_num,
> -		       unsigned int quota)
> -{
> -	unsigned int out, in;
> -	int seg = 0;
> -	int headcount = 0;
> -	unsigned d;
> -	int r, nlogs = 0;
> -	/* len is always initialized before use since we are always called with
> -	 * datalen > 0.
> -	 */
> -	u32 uninitialized_var(len);
> -
> -	while (datalen > 0 && headcount < quota) {
> -		if (unlikely(seg >= UIO_MAXIOV)) {
> -			r = -ENOBUFS;
> -			goto err;
> -		}
> -		r = vhost_get_vq_desc(vq, vq->iov + seg,
> -				      ARRAY_SIZE(vq->iov) - seg, &out,
> -				      &in, log, log_num);
> -		if (unlikely(r < 0))
> -			goto err;
> -
> -		d = r;
> -		if (d == vq->num) {
> -			r = 0;
> -			goto err;
> -		}
> -		if (unlikely(out || in <= 0)) {
> -			vq_err(vq, "unexpected descriptor format for RX: "
> -				"out %d, in %d\n", out, in);
> -			r = -EINVAL;
> -			goto err;
> -		}
> -		if (unlikely(log)) {
> -			nlogs += *log_num;
> -			log += *log_num;
> -		}
> -		heads[headcount].id = cpu_to_vhost32(vq, d);
> -		len = iov_length(vq->iov + seg, in);
> -		heads[headcount].len = cpu_to_vhost32(vq, len);
> -		datalen -= len;
> -		++headcount;
> -		seg += in;
> -	}
> -	heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
> -	*iovcount = seg;
> -	if (unlikely(log))
> -		*log_num = nlogs;
> -
> -	/* Detect overrun */
> -	if (unlikely(datalen > 0)) {
> -		r = UIO_MAXIOV + 1;
> -		goto err;
> -	}
> -	return headcount;
> -err:
> -	vhost_discard_vq_desc(vq, headcount);
> -	return r;
> -}
> -
>  /* Expects to be always run from workqueue - which acts as
>   * read-size critical section for our kind of RCU. */
>  static void handle_rx(struct vhost_net *net)
> @@ -790,9 +713,9 @@ static void handle_rx(struct vhost_net *net)
>  	while ((sock_len = vhost_net_rx_peek_head_len(net, sock->sk))) {
>  		sock_len += sock_hlen;
>  		vhost_len = sock_len + vhost_hlen;
> -		headcount = get_rx_bufs(vq, vq->heads + nheads, vhost_len,
> -					&in, vq_log, &log,
> -					likely(mergeable) ? UIO_MAXIOV : 1);
> +		headcount = vhost_get_bufs(vq, vq->heads + nheads, vhost_len,
> +					   &in, vq_log, &log,
> +					   likely(mergeable) ? UIO_MAXIOV : 1);
>  		/* On error, stop handling until the next kick. */
>  		if (unlikely(headcount < 0))
>  			goto out;
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index f3bd8e9..6b455f6 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2097,6 +2097,84 @@ int vhost_get_vq_desc(struct vhost_virtqueue *vq,
>  }
>  EXPORT_SYMBOL_GPL(vhost_get_vq_desc);
>  
> +/* This is a multi-buffer version of vhost_get_desc, that works if
> + *	vq has read descriptors only.
> + * @vq		- the relevant virtqueue
> + * @datalen	- data length we'll be reading
> + * @iovcount	- returned count of io vectors we fill
> + * @log		- vhost log
> + * @log_num	- log offset
> + * @quota       - headcount quota, 1 for big buffer
> + *	returns number of buffer heads allocated, negative on error
> + */
> +int vhost_get_bufs(struct vhost_virtqueue *vq,
> +		   struct vring_used_elem *heads,
> +		   int datalen,
> +		   unsigned *iovcount,
> +		   struct vhost_log *log,
> +		   unsigned *log_num,
> +		   unsigned int quota)
> +{
> +	unsigned int out, in;
> +	int seg = 0;
> +	int headcount = 0;
> +	unsigned d;
> +	int r, nlogs = 0;
> +	/* len is always initialized before use since we are always called with
> +	 * datalen > 0.
> +	 */
> +	u32 uninitialized_var(len);
> +
> +	while (datalen > 0 && headcount < quota) {
> +		if (unlikely(seg >= UIO_MAXIOV)) {
> +			r = -ENOBUFS;
> +			goto err;
> +		}
> +		r = vhost_get_vq_desc(vq, vq->iov + seg,
> +				      ARRAY_SIZE(vq->iov) - seg, &out,
> +				      &in, log, log_num);
> +		if (unlikely(r < 0))
> +			goto err;
> +
> +		d = r;
> +		if (d == vq->num) {
> +			r = 0;
> +			goto err;
> +		}
> +		if (unlikely(out || in <= 0)) {
> +			vq_err(vq, "unexpected descriptor format for RX: "
> +				"out %d, in %d\n", out, in);
> +			r = -EINVAL;
> +			goto err;
> +		}
> +		if (unlikely(log)) {
> +			nlogs += *log_num;
> +			log += *log_num;
> +		}
> +		heads[headcount].id = cpu_to_vhost32(vq, d);
> +		len = iov_length(vq->iov + seg, in);
> +		heads[headcount].len = cpu_to_vhost32(vq, len);
> +		datalen -= len;
> +		++headcount;
> +		seg += in;
> +	}
> +	heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
> +	*iovcount = seg;
> +	if (unlikely(log))
> +		*log_num = nlogs;
> +
> +	/* Detect overrun */
> +	if (unlikely(datalen > 0)) {
> +		r = UIO_MAXIOV + 1;
> +		goto err;
> +	}
> +	return headcount;
> +err:
> +	vhost_discard_vq_desc(vq, headcount);
> +	return r;
> +}
> +EXPORT_SYMBOL_GPL(vhost_get_bufs);
> +
>  /* Reverse the effect of vhost_get_vq_desc. Useful for error handling. */
>  void vhost_discard_vq_desc(struct vhost_virtqueue *vq, int n)
>  {
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 6c844b9..52edd242 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -185,6 +185,13 @@ int vhost_get_vq_desc(struct vhost_virtqueue *,
>  		      struct iovec iov[], unsigned int iov_count,
>  		      unsigned int *out_num, unsigned int *in_num,
>  		      struct vhost_log *log, unsigned int *log_num);
> +int vhost_get_bufs(struct vhost_virtqueue *vq,
> +		   struct vring_used_elem *heads,
> +		   int datalen,
> +		   unsigned *iovcount,
> +		   struct vhost_log *log,
> +		   unsigned *log_num,
> +		   unsigned int quota);
>  void vhost_discard_vq_desc(struct vhost_virtqueue *, int n);
>  
>  int vhost_vq_init_access(struct vhost_virtqueue *);
> -- 
> 2.7.4
> 

^ permalink raw reply

* Re: [PATCH net-next v9 3/4] virtio_net: Extend virtio to use VF datapath when available
From: Jiri Pirko @ 2018-05-02  7:50 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: mst, stephen, davem, netdev, virtualization, virtio-dev,
	jesse.brandeburg, alexander.h.duyck, kubakici, jasowang,
	loseweigh, aaron.f.brown
In-Reply-To: <9c6f055c-c665-0601-3973-bce74d72544b@intel.com>

Wed, May 02, 2018 at 02:20:26AM CEST, sridhar.samudrala@intel.com wrote:
>On 4/30/2018 12:20 AM, Jiri Pirko wrote:
>> 
>> > > Now I try to change mac of the failover master:
>> > > [root@test1 ~]# ip link set ens3 addr 52:54:00:b2:a7:f3
>> > > RTNETLINK answers: Operation not supported
>> > > 
>> > > That I did expect to work. I would expect this would change the mac of
>> > > the master and both standby and primary slaves.
>> > If a VF is untrusted, a VM will not able to change its MAC and moreover
>> Note that at this point, I have no VF. So I'm not sure why you mention
>> that.
>> 
>> 
>> > in this mode we are assuming that the hypervisor has assigned the MAC and
>> > guest is not expected to change the MAC.
>> Wait, for ordinary old-fashioned virtio_net, as a VM user, I can change
>> mac and all works fine. How is this different? Change mac on "failover
>> instance" should work and should propagate the mac down to its slaves.
>> 
>> 
>> > For the initial implementation, i would propose not allowing the guest to
>> > change the MAC of failover or standby dev.
>> I see no reason for such restriction.
>> 
>
>It is true that a VM user can change mac address of a normal virtio-net interface,
>however when it is in STANDBY mode i think we should not allow this change specifically
>because we are creating a failover instance based on a MAC that is assigned by the
>hypervisor.
>
>Moreover,  in a cloud environment i would think that PF/hypervisor assigns a MAC to
>the VF and it cannot be changed by the guest.

So that is easy. You allow the change of the mac and in the "failover"
mac change implementation you propagate the change down to slaves. If
one slave does not support the change, you bail out. And since VF does
not allow it as you say, once it will be enslaved, the mac change could
not be done. Seems like a correct behavior to me and is in-sync with how
bond/team behaves.


>
>So for the initial implementation, do you see any issues with having this restriction
>in STANDBY mode.
>
>

^ permalink raw reply

* Re: [RFC v3 4/5] virtio_ring: add event idx support in packed ring
From: Tiwei Bie @ 2018-05-02  7:28 UTC (permalink / raw)
  To: Jason Wang; +Cc: mst, virtualization, linux-kernel, netdev, wexu, jfreimann
In-Reply-To: <34781052-df9f-e505-cd3f-08e460b34dcc@redhat.com>

On Wed, May 02, 2018 at 10:51:06AM +0800, Jason Wang wrote:
> On 2018年04月25日 13:15, Tiwei Bie wrote:
> > This commit introduces the event idx support in packed
> > ring. This feature is temporarily disabled, because the
> > implementation in this patch may not work as expected,
> > and some further discussions on the implementation are
> > needed, e.g. do we have to check the wrap counter when
> > checking whether a kick is needed?
> > 
> > Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
> > ---
> >   drivers/virtio/virtio_ring.c | 53 ++++++++++++++++++++++++++++++++++++++++----
> >   1 file changed, 49 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/virtio/virtio_ring.c b/drivers/virtio/virtio_ring.c
> > index 0181e93897be..b1039c2985b9 100644
> > --- a/drivers/virtio/virtio_ring.c
> > +++ b/drivers/virtio/virtio_ring.c
> > @@ -986,7 +986,7 @@ static inline int virtqueue_add_packed(struct virtqueue *_vq,
> >   static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> >   {
> >   	struct vring_virtqueue *vq = to_vvq(_vq);
> > -	u16 flags;
> > +	u16 new, old, off_wrap, flags;
> >   	bool needs_kick;
> >   	u32 snapshot;
> > @@ -995,7 +995,12 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> >   	 * suppressions. */
> >   	virtio_mb(vq->weak_barriers);
> > +	old = vq->next_avail_idx - vq->num_added;
> > +	new = vq->next_avail_idx;
> > +	vq->num_added = 0;
> > +
> >   	snapshot = *(u32 *)vq->vring_packed.device;
> > +	off_wrap = virtio16_to_cpu(_vq->vdev, snapshot & 0xffff);
> >   	flags = cpu_to_virtio16(_vq->vdev, snapshot >> 16) & 0x3;
> >   #ifdef DEBUG
> > @@ -1006,7 +1011,10 @@ static bool virtqueue_kick_prepare_packed(struct virtqueue *_vq)
> >   	vq->last_add_time_valid = false;
> >   #endif
> > -	needs_kick = (flags != VRING_EVENT_F_DISABLE);
> > +	if (flags == VRING_EVENT_F_DESC)
> > +		needs_kick = vring_need_event(off_wrap & ~(1<<15), new, old);
> 
> I wonder whether or not the math is correct. Both new and event are in the
> unit of descriptor ring size, but old looks not.

What vring_need_event() cares is the distance between
`new` and `old`, i.e. vq->num_added. So I think there
is nothing wrong with `old`. But the calculation of the
distance between `new` and `event_idx` isn't right when
`new` wraps. How do you think about the below code:

	wrap_counter = off_wrap >> 15;
	event_idx = off_wrap & ~(1<<15);
	if (wrap_counter != vq->wrap_counter)
		event_idx -= vq->vring_packed.num;
	
	needs_kick = vring_need_event(event_idx, new, old);

Best regards,
Tiwei Bie


> 
> Thanks
> 
> > +	else
> > +		needs_kick = (flags != VRING_EVENT_F_DISABLE);
> >   	END_USE(vq);
> >   	return needs_kick;
> >   }
> > @@ -1116,6 +1124,15 @@ static void *virtqueue_get_buf_ctx_packed(struct virtqueue *_vq,
> >   	if (vq->last_used_idx >= vq->vring_packed.num)
> >   		vq->last_used_idx -= vq->vring_packed.num;
> > +	/* If we expect an interrupt for the next entry, tell host
> > +	 * by writing event index and flush out the write before
> > +	 * the read in the next get_buf call. */
> > +	if (vq->event_flags_shadow == VRING_EVENT_F_DESC)
> > +		virtio_store_mb(vq->weak_barriers,
> > +				&vq->vring_packed.driver->off_wrap,
> > +				cpu_to_virtio16(_vq->vdev, vq->last_used_idx |
> > +						(vq->wrap_counter << 15)));
> > +
> >   #ifdef DEBUG
> >   	vq->last_add_time_valid = false;
> >   #endif
> > @@ -1143,10 +1160,17 @@ static unsigned virtqueue_enable_cb_prepare_packed(struct virtqueue *_vq)
> >   	/* We optimistically turn back on interrupts, then check if there was
> >   	 * more to do. */
> > +	/* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> > +	 * either clear the flags bit or point the event index at the next
> > +	 * entry. Always update the event index to keep code simple. */
> > +
> > +	vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > +			vq->last_used_idx | (vq->wrap_counter << 15));
> >   	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> >   		virtio_wmb(vq->weak_barriers);
> > -		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +		vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > +						     VRING_EVENT_F_ENABLE;
> >   		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> >   							vq->event_flags_shadow);
> >   	}
> > @@ -1172,15 +1196,34 @@ static bool virtqueue_poll_packed(struct virtqueue *_vq, unsigned last_used_idx)
> >   static bool virtqueue_enable_cb_delayed_packed(struct virtqueue *_vq)
> >   {
> >   	struct vring_virtqueue *vq = to_vvq(_vq);
> > +	u16 bufs, used_idx, wrap_counter;
> >   	START_USE(vq);
> >   	/* We optimistically turn back on interrupts, then check if there was
> >   	 * more to do. */
> > +	/* Depending on the VIRTIO_RING_F_USED_EVENT_IDX feature, we need to
> > +	 * either clear the flags bit or point the event index at the next
> > +	 * entry. Always update the event index to keep code simple. */
> > +
> > +	/* TODO: tune this threshold */
> > +	bufs = (u16)(vq->next_avail_idx - vq->last_used_idx) * 3 / 4;
> > +
> > +	used_idx = vq->last_used_idx + bufs;
> > +	wrap_counter = vq->wrap_counter;
> > +
> > +	if (used_idx >= vq->vring_packed.num) {
> > +		used_idx -= vq->vring_packed.num;
> > +		wrap_counter ^= 1;
> > +	}
> > +
> > +	vq->vring_packed.driver->off_wrap = cpu_to_virtio16(_vq->vdev,
> > +			used_idx | (wrap_counter << 15));
> >   	if (vq->event_flags_shadow == VRING_EVENT_F_DISABLE) {
> >   		virtio_wmb(vq->weak_barriers);
> > -		vq->event_flags_shadow = VRING_EVENT_F_ENABLE;
> > +		vq->event_flags_shadow = vq->event ? VRING_EVENT_F_DESC :
> > +						     VRING_EVENT_F_ENABLE;
> >   		vq->vring_packed.driver->flags = cpu_to_virtio16(_vq->vdev,
> >   							vq->event_flags_shadow);
> >   	}
> > @@ -1822,8 +1865,10 @@ void vring_transport_features(struct virtio_device *vdev)
> >   		switch (i) {
> >   		case VIRTIO_RING_F_INDIRECT_DESC:
> >   			break;
> > +#if 0
> >   		case VIRTIO_RING_F_EVENT_IDX:
> >   			break;
> > +#endif
> >   		case VIRTIO_F_VERSION_1:
> >   			break;
> >   		case VIRTIO_F_IOMMU_PLATFORM:
> 

^ permalink raw reply

* [PATCH net-next v2 2/2] mlxsw: spectrum_router: Return an error for routes added after abort
From: Ido Schimmel @ 2018-05-02  7:17 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20180502071735.32352-1-idosch@mellanox.com>

We currently do not perform accounting in the driver and thus can't
reject routes before resources are exceeded.

However, in order to make users aware of the fact that routes are no
longer offloaded we can return an error for routes configured after the
abort mechanism was triggered.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index added380e344..8028d221aece 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -5928,6 +5928,13 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
 						     router->mlxsw_sp);
 		if (!err || info->extack)
 			return notifier_from_errno(err);
+		break;
+	case FIB_EVENT_ENTRY_ADD:
+		if (router->aborted) {
+			NL_SET_ERR_MSG_MOD(info->extack, "FIB offload was aborted. Not configuring route");
+			return notifier_from_errno(-EINVAL);
+		}
+		break;
 	}
 
 	fib_work = kzalloc(sizeof(*fib_work), GFP_ATOMIC);
-- 
2.14.3

^ permalink raw reply related

* [PATCH net-next v2 1/2] mlxsw: spectrum_router: Return an error for non-default FIB rules
From: Ido Schimmel @ 2018-05-02  7:17 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20180502071735.32352-1-idosch@mellanox.com>

Since commit 9776d32537d2 ("net: Move call_fib_rule_notifiers up in
fib_nl_newrule") it is possible to forbid the installation of
unsupported FIB rules.

Have mlxsw return an error for non-default FIB rules in addition to the
existing extack message.

Example:
# ip rule add from 198.51.100.1 table 10
Error: mlxsw_spectrum: FIB rules not supported.

Note that offload is only aborted when non-default FIB rules are already
installed and merely replayed during module initialization.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 8e4edb634b11..added380e344 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -5882,24 +5882,24 @@ static int mlxsw_sp_router_fib_rule_event(unsigned long event,
 	switch (info->family) {
 	case AF_INET:
 		if (!fib4_rule_default(rule) && !rule->l3mdev)
-			err = -1;
+			err = -EOPNOTSUPP;
 		break;
 	case AF_INET6:
 		if (!fib6_rule_default(rule) && !rule->l3mdev)
-			err = -1;
+			err = -EOPNOTSUPP;
 		break;
 	case RTNL_FAMILY_IPMR:
 		if (!ipmr_rule_default(rule) && !rule->l3mdev)
-			err = -1;
+			err = -EOPNOTSUPP;
 		break;
 	case RTNL_FAMILY_IP6MR:
 		if (!ip6mr_rule_default(rule) && !rule->l3mdev)
-			err = -1;
+			err = -EOPNOTSUPP;
 		break;
 	}
 
 	if (err < 0)
-		NL_SET_ERR_MSG_MOD(extack, "FIB rules not supported. Aborting offload");
+		NL_SET_ERR_MSG_MOD(extack, "FIB rules not supported");
 
 	return err;
 }
@@ -5926,8 +5926,8 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
 	case FIB_EVENT_RULE_DEL:
 		err = mlxsw_sp_router_fib_rule_event(event, info,
 						     router->mlxsw_sp);
-		if (!err)
-			return NOTIFY_DONE;
+		if (!err || info->extack)
+			return notifier_from_errno(err);
 	}
 
 	fib_work = kzalloc(sizeof(*fib_work), GFP_ATOMIC);
-- 
2.14.3

^ permalink raw reply related

* [PATCH net-next v2 0/2] mlxsw: Reject unsupported FIB configurations
From: Ido Schimmel @ 2018-05-02  7:17 UTC (permalink / raw)
  To: netdev; +Cc: davem, jiri, dsahern, mlxsw, Ido Schimmel

Recently it became possible for listeners of the FIB notification chain
to veto operations such as addition of routes and rules.

Adjust the mlxsw driver to take advantage of it and return an error for
unsupported FIB rules and for routes configured after the abort
mechanism was triggered (due to exceeded resources for example).

v2:
* Change error code in first patch to -EOPNOTSUPP (David Ahern).

Ido Schimmel (2):
  mlxsw: spectrum_router: Return an error for non-default FIB rules
  mlxsw: spectrum_router: Return an error for routes added after abort

 .../net/ethernet/mellanox/mlxsw/spectrum_router.c   | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

-- 
2.14.3

^ permalink raw reply

* Re: [PATCH net-next] udp: disable gso with no_check_tx
From: Michal Kubecek @ 2018-05-02  7:05 UTC (permalink / raw)
  To: netdev; +Cc: davem, Willem de Bruijn, Willem de Bruijn
In-Reply-To: <20180430195836.69378-1-willemdebruijn.kernel@gmail.com>

On Mon, Apr 30, 2018 at 03:58:36PM -0400, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> Syzbot managed to send a udp gso packet without checksum offload into
> the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX). This
> triggered the skb_warn_bad_offload.
> 
>   RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658
>    skb_gso_segment include/linux/netdevice.h:4038 [inline]
>    validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120
>    __dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577
>    dev_queue_xmit+0x17/0x20 net/core/dev.c:3618
> 
> UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the
> udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to
> catch and fail in this case.

Sounds rather familiar... perhaps we might want to check other
exceptions added to UFO over the years, some might apply here as well.

Michal Kubecek

^ permalink raw reply

* [PATCH net] ipv4: fix fnhe usage by non-cached routes
From: Julian Anastasov @ 2018-05-02  6:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Martin KaFai Lau, kernel-team, David Ahern, Xin Long

Allow some non-cached routes to use non-expired fnhe:

1. ip_del_fnhe: moved above and now called by find_exception.
The 4.5+ commit deed49df7390 expires fnhe only when caching
routes. Change that to:

1.1. use fnhe for non-cached local output routes, with the help
from (2)

1.2. allow __mkroute_input to detect expired fnhe (outdated
fnhe_gw, for example) when do_cache is false, eg. when itag!=0
for unicast destinations.

2. __mkroute_output: keep fi to allow local routes with orig_oif != 0
to use fnhe info even when the new route will not be cached into fnhe.
After commit 839da4d98960 ("net: ipv4: set orig_oif based on fib
result for local traffic") it means all local routes will be affected
because they are not cached. This change is used to solve a PMTU
problem with IPVS (and probably Netfilter DNAT) setups that redirect
local clients from target local IP (local route to Virtual IP)
to new remote IP target, eg. IPVS TUN real server. Loopback has
64K MTU and we need to create fnhe on the local route that will
keep the reduced PMTU for the Virtual IP. Without this change
fnhe_pmtu is updated from ICMP but never exposed to non-cached
local routes. This includes routes with flowi4_oif!=0 for 4.6+ and
with flowi4_oif=any for 4.14+).

3. update_or_create_fnhe: make sure fnhe_expires is not 0 for
new entries

Fixes: 839da4d98960 ("net: ipv4: set orig_oif based on fib result for local traffic")
Fixes: d6d5e999e5df ("route: do not cache fib route info on local routes with oif")
Fixes: deed49df7390 ("route: check and remove route cache when we get route")
Cc: David Ahern <dsahern@gmail.com>
Cc: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
---
 net/ipv4/route.c | 118 +++++++++++++++++++++++++------------------------------
 1 file changed, 53 insertions(+), 65 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index ccb25d8..1412a7b 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -709,7 +709,7 @@ static void update_or_create_fnhe(struct fib_nh *nh, __be32 daddr, __be32 gw,
 		fnhe->fnhe_gw = gw;
 		fnhe->fnhe_pmtu = pmtu;
 		fnhe->fnhe_mtu_locked = lock;
-		fnhe->fnhe_expires = expires;
+		fnhe->fnhe_expires = max(1UL, expires);
 
 		/* Exception created; mark the cached routes for the nexthop
 		 * stale, so anyone caching it rechecks if this exception
@@ -1297,6 +1297,36 @@ static unsigned int ipv4_mtu(const struct dst_entry *dst)
 	return mtu - lwtunnel_headroom(dst->lwtstate, mtu);
 }
 
+static void ip_del_fnhe(struct fib_nh *nh, __be32 daddr)
+{
+	struct fnhe_hash_bucket *hash;
+	struct fib_nh_exception *fnhe, __rcu **fnhe_p;
+	u32 hval = fnhe_hashfun(daddr);
+
+	spin_lock_bh(&fnhe_lock);
+
+	hash = rcu_dereference_protected(nh->nh_exceptions,
+					 lockdep_is_held(&fnhe_lock));
+	hash += hval;
+
+	fnhe_p = &hash->chain;
+	fnhe = rcu_dereference_protected(*fnhe_p, lockdep_is_held(&fnhe_lock));
+	while (fnhe) {
+		if (fnhe->fnhe_daddr == daddr) {
+			rcu_assign_pointer(*fnhe_p, rcu_dereference_protected(
+				fnhe->fnhe_next, lockdep_is_held(&fnhe_lock)));
+			fnhe_flush_routes(fnhe);
+			kfree_rcu(fnhe, rcu);
+			break;
+		}
+		fnhe_p = &fnhe->fnhe_next;
+		fnhe = rcu_dereference_protected(fnhe->fnhe_next,
+						 lockdep_is_held(&fnhe_lock));
+	}
+
+	spin_unlock_bh(&fnhe_lock);
+}
+
 static struct fib_nh_exception *find_exception(struct fib_nh *nh, __be32 daddr)
 {
 	struct fnhe_hash_bucket *hash = rcu_dereference(nh->nh_exceptions);
@@ -1310,8 +1340,14 @@ static struct fib_nh_exception *find_exception(struct fib_nh *nh, __be32 daddr)
 
 	for (fnhe = rcu_dereference(hash[hval].chain); fnhe;
 	     fnhe = rcu_dereference(fnhe->fnhe_next)) {
-		if (fnhe->fnhe_daddr == daddr)
+		if (fnhe->fnhe_daddr == daddr) {
+			if (fnhe->fnhe_expires &&
+			    time_after(jiffies, fnhe->fnhe_expires)) {
+				ip_del_fnhe(nh, daddr);
+				break;
+			}
 			return fnhe;
+		}
 	}
 	return NULL;
 }
@@ -1636,36 +1672,6 @@ static void ip_handle_martian_source(struct net_device *dev,
 #endif
 }
 
-static void ip_del_fnhe(struct fib_nh *nh, __be32 daddr)
-{
-	struct fnhe_hash_bucket *hash;
-	struct fib_nh_exception *fnhe, __rcu **fnhe_p;
-	u32 hval = fnhe_hashfun(daddr);
-
-	spin_lock_bh(&fnhe_lock);
-
-	hash = rcu_dereference_protected(nh->nh_exceptions,
-					 lockdep_is_held(&fnhe_lock));
-	hash += hval;
-
-	fnhe_p = &hash->chain;
-	fnhe = rcu_dereference_protected(*fnhe_p, lockdep_is_held(&fnhe_lock));
-	while (fnhe) {
-		if (fnhe->fnhe_daddr == daddr) {
-			rcu_assign_pointer(*fnhe_p, rcu_dereference_protected(
-				fnhe->fnhe_next, lockdep_is_held(&fnhe_lock)));
-			fnhe_flush_routes(fnhe);
-			kfree_rcu(fnhe, rcu);
-			break;
-		}
-		fnhe_p = &fnhe->fnhe_next;
-		fnhe = rcu_dereference_protected(fnhe->fnhe_next,
-						 lockdep_is_held(&fnhe_lock));
-	}
-
-	spin_unlock_bh(&fnhe_lock);
-}
-
 /* called in rcu_read_lock() section */
 static int __mkroute_input(struct sk_buff *skb,
 			   const struct fib_result *res,
@@ -1719,20 +1725,10 @@ static int __mkroute_input(struct sk_buff *skb,
 
 	fnhe = find_exception(&FIB_RES_NH(*res), daddr);
 	if (do_cache) {
-		if (fnhe) {
+		if (fnhe)
 			rth = rcu_dereference(fnhe->fnhe_rth_input);
-			if (rth && rth->dst.expires &&
-			    time_after(jiffies, rth->dst.expires)) {
-				ip_del_fnhe(&FIB_RES_NH(*res), daddr);
-				fnhe = NULL;
-			} else {
-				goto rt_cache;
-			}
-		}
-
-		rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_input);
-
-rt_cache:
+		else
+			rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_input);
 		if (rt_cache_valid(rth)) {
 			skb_dst_set_noref(skb, &rth->dst);
 			goto out;
@@ -2216,39 +2212,31 @@ static struct rtable *__mkroute_output(const struct fib_result *res,
 		 * the loopback interface and the IP_PKTINFO ipi_ifindex will
 		 * be set to the loopback interface as well.
 		 */
-		fi = NULL;
+		do_cache = false;
 	}
 
 	fnhe = NULL;
 	do_cache &= fi != NULL;
-	if (do_cache) {
+	if (fi) {
 		struct rtable __rcu **prth;
 		struct fib_nh *nh = &FIB_RES_NH(*res);
 
 		fnhe = find_exception(nh, fl4->daddr);
+		if (!do_cache)
+			goto add;
 		if (fnhe) {
 			prth = &fnhe->fnhe_rth_output;
-			rth = rcu_dereference(*prth);
-			if (rth && rth->dst.expires &&
-			    time_after(jiffies, rth->dst.expires)) {
-				ip_del_fnhe(nh, fl4->daddr);
-				fnhe = NULL;
-			} else {
-				goto rt_cache;
+		} else {
+			if (unlikely(fl4->flowi4_flags &
+				     FLOWI_FLAG_KNOWN_NH &&
+				     !(nh->nh_gw &&
+				       nh->nh_scope == RT_SCOPE_LINK))) {
+				do_cache = false;
+				goto add;
 			}
+			prth = raw_cpu_ptr(nh->nh_pcpu_rth_output);
 		}
-
-		if (unlikely(fl4->flowi4_flags &
-			     FLOWI_FLAG_KNOWN_NH &&
-			     !(nh->nh_gw &&
-			       nh->nh_scope == RT_SCOPE_LINK))) {
-			do_cache = false;
-			goto add;
-		}
-		prth = raw_cpu_ptr(nh->nh_pcpu_rth_output);
 		rth = rcu_dereference(*prth);
-
-rt_cache:
 		if (rt_cache_valid(rth) && dst_hold_safe(&rth->dst))
 			return rth;
 	}
-- 
2.9.5

^ permalink raw reply related

* Re: [PATCH RFC net-next] net: ipvs: Adjust gso_size for IPPROTO_TCP
From: Julian Anastasov @ 2018-05-02  6:38 UTC (permalink / raw)
  To: Martin KaFai Lau
  Cc: netdev, Tom Herbert, Eric Dumazet, Nikita Shirokov, kernel-team,
	lvs-devel
In-Reply-To: <20180419212324.1542504-1-kafai@fb.com>


	Hello,

On Thu, 19 Apr 2018, Martin KaFai Lau wrote:

> This patch is not a proper fix and mainly serves for discussion purpose.
> It is based on net-next which I have been using to debug the issue.
> 
> The change that works around the issue is in ensure_mtu_is_adequate().
> Other changes are the rippling effect in function arg.
> 
> This bug was uncovered by one of our legacy service that
> are still using ipvs for load balancing.  In that setup,
> the ipvs encap the ipv6-tcp packet in another ipv6 hdr
> before tx it out to eth0.
> 
> The problem is the kernel stack could pass a skb (which was
> originated from a sys_write(tcp_fd)) to the driver with skb->len
> bigger than the device MTU.  In one NIC setup (with gso and tso off)
> that we are using, it upset the NIC/driver and caused the tx queue
> stalled for tens of seconds which is how it got uncovered.
> (On the NIC side, the NIC firmware and driver have been fixed
> to avoid this tx queue stall after seeing this skb).

	In the last week I analyzed the situation and found
that just changes in route.c are able to solve the problems,
at 99% :) I'm posting a separate patch for this.

	Here is what happens, I'm testing with FTP which
starts with connection to port 21 and then with data connection,
from local ftp client to local Virtual IP (forwarded to remote
real server via tunnel).

- TCP creates local route which after commit 839da4d98960
appears to be non-cached, before this commit it is cached.
Route is saved and reused.

- initial traffic for port 21 does not use GSO. But after
every packet IPVS calls maybe_update_pmtu (rt->dst.ops->update_pmtu)
to report the reduced MTU. These updates are stored in fnhe_pmtu
but they do not go to any route, even if we try to get fresh
output route. Why? Because the local routes are not cached, so
they can not use the fnhe. This is what my patch for route.c
will fix. With this fix FTP-DATA gets route with reduced PMTU.

- later, there are large GSO packets in the FTP-DATA connection.
Currently, IPVS does not send ICMP error for exceeded PMTU
by GSO packets (this will be fixed, see patch below). Even
if they reach TCP and it refreshes its route, TCP can not get
any actual PMTU values from the routing, so continues to
use the large gso_size without the patch for route.c

	For the changes in IPVS that I show below as RFC:

- I synchronized the PMTU checks in __mtu_check_toobig* with
IPv4/IPv6 forwarding

- I modified your changes, see ipvs_gso_hlen() and how I use it
at start of ensure_mtu_is_adequate(), after skb is made writable,
before the PMTU checks.

	In my tests, all variants work, just with skb_decrease_gso_size
or even if I add SKB_GSO_DODGY+gso_segs=0. But I'm not sure
if this is safe for the different GSO configurations.

	I'm analyzing what can be changed in route.c, so that
dst.ops->update_pmtu changes to take effect for the provided
route. If that is possible, it will allow the PMTU change to
take immediate effect for the local routes.

diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 4527921..7a2a0a89 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -104,9 +104,32 @@ __ip_vs_dst_check(struct ip_vs_dest *dest)
 	return dest_dst;
 }
 
-static inline bool
-__mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
+/* Check if packet size violates MTU */
+static bool __mtu_check_toobig(const struct sk_buff *skb, u32 mtu)
 {
+	if (skb->len <= mtu)
+		return false;
+
+	if (unlikely((ip_hdr(skb)->frag_off & htons(IP_DF)) == 0))
+		return false;
+
+	if (IPCB(skb)->frag_max_size) {
+		/* frag_max_size tell us that, this packet have been
+		 * defragmented by netfilter IPv4 conntrack module.
+		 */
+		if (IPCB(skb)->frag_max_size > mtu)
+			return true; /* largest fragment violate MTU */
+	}
+	if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
+		return false;
+	return true;
+}
+
+/* Check if packet size violates MTU */
+static bool __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
+{
+	if (skb->len <= mtu)
+		return false;
 	if (IP6CB(skb)->frag_max_size) {
 		/* frag_max_size tell us that, this packet have been
 		 * defragmented by netfilter IPv6 conntrack module.
@@ -114,10 +137,9 @@ __mtu_check_toobig_v6(const struct sk_buff *skb, u32 mtu)
 		if (IP6CB(skb)->frag_max_size > mtu)
 			return true; /* largest fragment violate MTU */
 	}
-	else if (skb->len > mtu && !skb_is_gso(skb)) {
-		return true; /* Packet size violate MTU size */
-	}
-	return false;
+	if (skb_is_gso(skb) && skb_gso_validate_network_len(skb, mtu))
+		return false;
+	return true;
 }
 
 /* Get route to daddr, update *saddr, optionally bind route to saddr */
@@ -212,11 +234,42 @@ static inline void maybe_update_pmtu(int skb_af, struct sk_buff *skb, int mtu)
 		ort->dst.ops->update_pmtu(&ort->dst, sk, NULL, mtu);
 }
 
+/* GSO: length of network and transport headers, 0=unsupported */
+static unsigned short ipvs_gso_hlen(struct sk_buff *skb,
+				    struct ip_vs_iphdr *ipvsh)
+{
+	unsigned short hlen = ipvsh->len - ipvsh->off;
+
+	if (skb_shinfo(skb)->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)) {
+		struct tcphdr _tcph, *th;
+
+		th = skb_header_pointer(skb, ipvsh->len, sizeof(_tcph), &_tcph);
+		if (th)
+			return hlen + (th->doff << 2);
+	}
+	return 0;
+}
+ 
 static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
 					  int rt_mode,
 					  struct ip_vs_iphdr *ipvsh,
 					  struct sk_buff *skb, int mtu)
 {
+	/* Re-segment traffic from local clients */
+	if (!skb->dev && skb_is_gso(skb)) {
+		unsigned short hlen = ipvs_gso_hlen(skb, ipvsh);
+
+		if (hlen && mtu - hlen < skb_shinfo(skb)->gso_size &&
+		    mtu > hlen) {
+			u16 reduce = skb_shinfo(skb)->gso_size - (mtu - hlen);
+
+			IP_VS_DBG(10, "Reducing gso_size=%u with %u\n",
+				skb_shinfo(skb)->gso_size, reduce);
+			skb_decrease_gso_size(skb_shinfo(skb), reduce);
+			skb_shinfo(skb)->gso_type |= SKB_GSO_DODGY;
+			skb_shinfo(skb)->gso_segs = 0;
+		}
+	}
 #ifdef CONFIG_IP_VS_IPV6
 	if (skb_af == AF_INET6) {
 		struct net *net = ipvs->net;
@@ -240,9 +293,7 @@ static inline bool ensure_mtu_is_adequate(struct netns_ipvs *ipvs, int skb_af,
 		if ((rt_mode & IP_VS_RT_MODE_TUNNEL) && !sysctl_pmtu_disc(ipvs))
 			return true;
 
-		if (unlikely(ip_hdr(skb)->frag_off & htons(IP_DF) &&
-			     skb->len > mtu && !skb_is_gso(skb) &&
-			     !ip_vs_iph_icmp(ipvsh))) {
+		if (unlikely(__mtu_check_toobig(skb, mtu))) {
 			icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
 				  htonl(mtu));
 			IP_VS_DBG(1, "frag needed for %pI4\n",

Regards

^ permalink raw reply related

* Re: [PATCH] vhost: make msg padding explicit
From: Kevin Easton @ 2018-05-02  6:28 UTC (permalink / raw)
  To: David Miller; +Cc: mst, linux-kernel, jasowang, kvm, virtualization, netdev
In-Reply-To: <20180501.140551.1944561534446599966.davem@davemloft.net>

On Tue, May 01, 2018 at 02:05:51PM -0400, David Miller wrote:
> From: "Michael S. Tsirkin" <mst@redhat.com>
> Date: Tue, 1 May 2018 20:19:19 +0300
> 
> > On Tue, May 01, 2018 at 11:28:22AM -0400, David Miller wrote:
> >> From: "Michael S. Tsirkin" <mst@redhat.com>
> >> Date: Fri, 27 Apr 2018 19:02:05 +0300
> >> 
> >> > There's a 32 bit hole just after type. It's best to
> >> > give it a name, this way compiler is forced to initialize
> >> > it with rest of the structure.
> >> > 
> >> > Reported-by: Kevin Easton <kevin@guarana.org>
> >> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >> 
> >> Michael, will you be sending this directly to Linus or would you like
> >> me to apply it to net or net-next?
> >> 
> >> Thanks.
> > 
> > I'd prefer you to apply it for net and cc stable if possible.
> 
> Ok, applied, and added to my -stable submission queue.

Hold on, this patch changes the layout for i386 (where there is
no padding at all).  And it's part of UAPI.

    - Kevin

> 

^ permalink raw reply

* [PATCH net-next] cxgb4: add new T5 device id's
From: Ganesh Goudar @ 2018-05-02  6:17 UTC (permalink / raw)
  To: netdev, davem; +Cc: nirranjan, indranil, Ganesh Goudar

Add device id's 0x5019, 0x501a and 0x501b for T5
cards.

Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h b/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
index 51b1803..90b5274 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_pci_id_tbl.h
@@ -145,6 +145,9 @@ CH_PCI_DEVICE_ID_TABLE_DEFINE_BEGIN
 	CH_PCI_ID_TABLE_FENTRY(0x5016),	/* T580-OCP-SO */
 	CH_PCI_ID_TABLE_FENTRY(0x5017),	/* T520-OCP-SO */
 	CH_PCI_ID_TABLE_FENTRY(0x5018),	/* T540-BT */
+	CH_PCI_ID_TABLE_FENTRY(0x5019),	/* T540-LP-BT */
+	CH_PCI_ID_TABLE_FENTRY(0x501a),	/* T540-SO-BT */
+	CH_PCI_ID_TABLE_FENTRY(0x501b),	/* T540-SO-CR */
 	CH_PCI_ID_TABLE_FENTRY(0x5080),	/* Custom T540-cr */
 	CH_PCI_ID_TABLE_FENTRY(0x5081),	/* Custom T540-LL-cr */
 	CH_PCI_ID_TABLE_FENTRY(0x5082),	/* Custom T504-cr */
-- 
2.1.0

^ permalink raw reply related

* Re: linux-next: manual merge of the bpf-next tree with the bpf tree
From: Song Liu @ 2018-05-02  6:05 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Daniel Borkmann, Alexei Starovoitov, Networking,
	Linux-Next Mailing List, Linux Kernel Mailing List, Yonghong Song
In-Reply-To: <20180502155014.4f5acdcb@canb.auug.org.au>



> On May 1, 2018, at 10:50 PM, Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> 
> Hi Song,
> 
> On Wed, 2 May 2018 04:40:20 +0000 Song Liu <songliubraving@fb.com> wrote:
>> 
>>> - 	CHECK(build_id_matches < 1, "build id match",
>>> - 	      "Didn't find expected build ID from the map\n");
>>> + 	if (CHECK(build_id_matches < 1, "build id match",
>>> -		  "Didn't find expected build ID from the map"))
>>> ++		  "Didn't find expected build ID from the map\n"))  
>> 
>> ^^^  Is there a "+" at the beginning of this line? 
> 
> No, this is a merge resolution diff, so the ++ means that the line did
> not appear in either parent commit.

Hi Stephen,

Thanks for the explanation! 

Song

^ permalink raw reply

* Re: RTL8723BE performance regression
From: Pkshih @ 2018-05-02  5:58 UTC (permalink / raw)
  To: jprvita@gmail.com, Larry.Finger@lwfinger.net
  Cc: linux-kernel@vger.kernel.org, jprvita@endlessm.com, Birming Chiu,
	drake@endlessm.com, Chaoming_Li, kvalo@codeaurora.org,
	莊彥宣, derosier@gmail.com, Steven Ting,
	netdev@vger.kernel.org, linux@endlessm.com, Shaofu,
	linux-wireless@vger.kernel.org
In-Reply-To: <5B2DA6FDDF928F4E855344EE0A5C39D13BEC14C0@RTITMBSV07.realtek.com.tw>

On Wed, 2018-05-02 at 05:44 +0000, Pkshih wrote:
> 
> > -----Original Message-----
> > From: João Paulo Rechi Vita [mailto:jprvita@gmail.com]
> > Sent: Wednesday, May 02, 2018 6:41 AM
> > To: Larry Finger
> > Cc: Steve deRosier; 莊彥宣; Pkshih; Birming Chiu; Shaofu; Steven Ting; Chaoming_Li; Kalle Valo;
> > linux-wireless; Network Development; LKML; Daniel Drake; João Paulo Rechi Vita; linux@endlessm.c
> om
> > Subject: Re: RTL8723BE performance regression
> > 
> > On Tue, Apr 3, 2018 at 7:51 PM, Larry Finger <Larry.Finger@lwfinger.net> wrote:
> > > On 04/03/2018 09:37 PM, João Paulo Rechi Vita wrote:
> > >>
> > >> On Tue, Apr 3, 2018 at 7:28 PM, Larry Finger <Larry.Finger@lwfinger.net>
> > >> wrote:
> > >>
> > >> (...)
> > >>
> > >>> As the antenna selection code changes affected your first bisection, do
> > >>> you
> > >>> have one of those HP laptops with only one antenna and the incorrect
> > >>> coding
> > >>> in the FUSE?
> > >>
> > >>
> > >> Yes, that is why I've been passing ant_sel=1 during my tests -- this
> > >> was needed to achieve a good performance in the past, before this
> > >> regression. I've also opened the laptop chassis and confirmed the
> > >> antenna cable is plugged to the connector labeled with "1" on the
> > >> card.
> > >>
> > >>> If so, please make sure that you still have the same signal
> > >>> strength for good and bad cases. I have tried to keep the driver and the
> > >>> btcoex code in sync, but there may be some combinations of antenna
> > >>> configuration and FUSE contents that cause the code to fail.
> > >>>
> > >>
> > >> What is the recommended way to monitor the signal strength?
> > >
> > >
> > > The btcoex code is developed for multiple platforms by a different group
> > > than the Linux driver. I think they made a change that caused ant_sel to
> > > switch from 1 to 2. At least numerous comments at
> > > github.com/lwfinger/rtlwifi_new claimed they needed to make that change.
> > >
> > > Mhy recommended method is to verify the wifi device name with "iw dev". Then
> > > using that device
> > >
> > > sudo iw dev <dev_name> scan | egrep "SSID|signal"
> > >
> > 
> > I have confirmed that the performance regression is indeed tied to
> > signal strength: on the good cases signal was between -16 and -8 dBm,
> > whereas in bad cases signal was always between -50 to - 40 dBm. I've
> > also switched to testing bandwidth in controlled LAN environment using
> > iperf3, as suggested by Steve deRosier, with the DUT being the only
> > machine connected to the 2.4 GHz radio and the machine running the
> > iperf3 server connected via ethernet.
> > 
> 
> We have new experimental results in commit af8a41cccf8f46 ("rtlwifi: cleanup 
> 8723be ant_sel definition"). You can use the above commit and do the same 
> experiments (with ant_sel=0, 1 and 2) in your side, and then share your results.
> Since performance is tied to signal strength, you can only share signal strength.
> 

Please pay attention to cold reboot once ant_sel is changed.


^ permalink raw reply

* Re: [PATCH RFC 6/9] veth: Add ndo_xdp_xmit
From: Jesper Dangaard Brouer @ 2018-05-02  5:56 UTC (permalink / raw)
  To: Toshiaki Makita
  Cc: Toshiaki Makita, netdev, Tariq Toukan, Daniel Borkmann,
	Alexei Starovoitov, Eran Ben Elisha, brouer
In-Reply-To: <874a1c42-bbdf-0ab4-b96e-3b07e6fb4aef@gmail.com>

On Wed, 2 May 2018 12:33:47 +0900
Toshiaki Makita <toshiaki.makita1@gmail.com> wrote:

> On 18/05/01 (火) 17:14, Jesper Dangaard Brouer wrote:
> > On Tue, 1 May 2018 10:02:01 +0900 Toshiaki Makita 
> > <makita.toshiaki@lab.ntt.co.jp> wrote:
> >   
> >> On 2018/05/01 2:27, Jesper Dangaard Brouer wrote:  
> >>> On Thu, 26 Apr 2018 19:52:40 +0900 Toshiaki Makita 
> >>> <makita.toshiaki@lab.ntt.co.jp> wrote:
> >>>   
> >>>> On 2018/04/26 5:24, Jesper Dangaard Brouer wrote:  
> >>>>> On Tue, 24 Apr 2018 23:39:20 +0900 Toshiaki Makita 
> >>>>> <toshiaki.makita1@gmail.com> wrote:
> >>>>>   
> >>>>>> +static int veth_xdp_xmit(struct net_device *dev, struct 
[...]
> >>>>> 
> >>>>> I'm not sure you can make this assumption, that xdp_frames 
> >>>>> coming from another device driver uses a refcnt based memory 
> >>>>> model. But maybe I'm confused, as this looks like an SKB 
> >>>>> receive path, but in the ndo_xdp_xmit().  
> >>>> 
> >>>> I find this path similar to cpumap, which creates skb from 
> >>>> redirected xdp frame. Once it is converted to skb, skb head is 
> >>>> freed by page_frag_free, so anyway I needed to get the
> >>>> refcount here regardless of memory model.  
> >>> 
> >>> Yes I know, I wrote cpumap ;-)
> >>> 
> >>> First of all, I don't want to see such xdp_frame to SKB 
> >>> conversion code in every driver.  Because that increase the 
> >>> chances of errors.  And when looking at the details, then it 
> >>> seems that you have made the mistake of making it possible to 
> >>> leak xdp_frame info to the SKB (which cpumap takes into 
> >>> account).  
> >> 
> >> Do you mean leaving xdp_frame in skb->head is leaking something? 
> >> how?  
> > 
> > Like commit 97e19cce05e5 ("bpf: reserve xdp_frame size in xdp 
> > headroom") and commit 6dfb970d3dbd ("xdp: avoid leaking info stored 
> > in frame data on page reuse").  
> 
> Thanks for sharing the info.
> 
> > But this time, the concern is a bpf_prog attached at TC/bpf_cls 
> > level, that can that can adjust head via bpf_skb_change_head (for
> > XDP it is bpf_xdp_adjust_head) into the area used by xdp_frame.  As 
> > described in https://git.kernel.org/davem/net-next/c/6dfb970d3dbd in 
> > is not super critical at the moment, as this _currently_ runs as 
> > CAP_SYS_ADMIN, but we would like to move towards CAP_NET_ADMIN.  
> 
> What I don't get is why special casing xdp_frame info. My assumption is
> that the area above skb->mac_header is uninit kernel memory and should
> not be readable by unprivileged users anyway. So I didn't clear the area
> at this point.

This is also my understanding. But Alexei explicitly asked me to handle
this xdp_frame info case.  I assume that more work is required for the
transition from CAP_SYS_ADMIN to CAP_NET_ADMIN, we just don't want to
add more/new code that makes this more difficult.


> >>> Second, I think the refcnt scheme here is wrong. The xdp_frame 
> >>> should be "owned" by XDP and have the proper refcnt to deliver
> >>> it directly to the network stack.
> >>> 
> >>> Third, if we choose that we want a fallback, in-case XDP is not 
> >>> enabled on egress dev (but it have an ndo_xdp_xmit), then it 
> >>> should be placed in the generic/core code.  E.g. 
> >>> __bpf_tx_xdp_map() could look at the return code from 
> >>> dev->netdev_ops->ndo_xdp() and create an SKB.  (Hint, this would 
> >>> make it easy to implement TX bulking towards the dev).  
> >> 
> >> Right, this is a much cleaner way.
> >> 
> >> Although I feel like we should add this fallback for veth because 
> >> it requires something which is different from other drivers 
> >> (enabling XDP on the peer device of the egress device),  
> > 
> > (This is why I Cc'ed Tariq...)
> > 
> > This is actually a general problem with the xdp "xmit" side (and not
> >  specific to veth driver). The problem exists for other drivers as 
> > well.
> > 
> > The problem is that a driver can implement ndo_xdp_xmit(), but the 
> > driver might not have allocated the necessary XDP TX-queue resources
> >  yet (or it might not be possible due to system resource limits).
> > 
> > The current "hack" is to load an XDP prog on the egress device, and 
> > then assume that this cause the driver to also allocate the XDP 
> > ndo_xdo_xmit side HW resources.  This is IMHO a wrong assumption.
>  >
> > We need a more generic way to test if a net_device is "ready/enabled"
> > for handling xdp_frames via ndo_xdp_xmit.  And Tariq had some ideas
> > on how to implement this...  
> 
> My assumption on REDIRECT requirement came from this.
> https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/commit/?id=306da4e685b

Yes, I wrote that.

> I guess you are saying thing are changing, and having an XDP program 
> attached on the egress device is no longer generally sufficient. Looking 
> forward to Tariq's solution.

Yes, (hopefully) things are changing. Loading a dummy XDP program to
enable ndo_xdp_xmit, turned out to be a bad model, for all the reasons
I listed.

I hope Tariq find some time to work on this ... ;-)


> Toshiaki Makita
> 
> > 
> > My opinion is that it is a waste of (HW/mem) resources to always 
> > allocate resources for ndo_xdp_xmit when loading an XDP program. 
> > Because what if my use-cases are XDP_DROP DDoS filter, or CPUMAP 
> > redirect load-balancer, then I don't want/need ndo_xdp_xmit.  E.g. 
> > today using ixgbe on machines with more than 96 CPUs, will fail due 
> > to limited TX queue resources. Thus, blocking the mentioned 
> > use-cases.
> > 

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: linux-next: manual merge of the bpf-next tree with the bpf tree
From: Stephen Rothwell @ 2018-05-02  5:50 UTC (permalink / raw)
  To: Song Liu
  Cc: Daniel Borkmann, Alexei Starovoitov, Networking,
	Linux-Next Mailing List, Linux Kernel Mailing List, Yonghong Song
In-Reply-To: <0942C278-336F-4795-BE63-FAD7FBAA231B@fb.com>

[-- Attachment #1: Type: text/plain, Size: 588 bytes --]

Hi Song,

On Wed, 2 May 2018 04:40:20 +0000 Song Liu <songliubraving@fb.com> wrote:
>
> > - 	CHECK(build_id_matches < 1, "build id match",
> > - 	      "Didn't find expected build ID from the map\n");
> > + 	if (CHECK(build_id_matches < 1, "build id match",
> > -		  "Didn't find expected build ID from the map"))
> > ++		  "Didn't find expected build ID from the map\n"))  
> 
> ^^^  Is there a "+" at the beginning of this line? 

No, this is a merge resolution diff, so the ++ means that the line did
not appear in either parent commit.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH net] sctp: fix the issue that the cookie-ack with auth can't get processed
From: Xin Long @ 2018-05-02  5:45 UTC (permalink / raw)
  To: network dev, linux-sctp; +Cc: davem, Marcelo Ricardo Leitner, Neil Horman

When auth is enabled for cookie-ack chunk, in sctp_inq_pop, sctp
processes auth chunk first, then continues to the next chunk in
this packet if chunk_end + chunk_hdr size < skb_tail_pointer().
Otherwise, it will go to the next packet or discard this chunk.

However, it missed the fact that cookie-ack chunk's size is equal
to chunk_hdr size, which couldn't match that check, and thus this
chunk would not get processed.

This patch fixes it by changing the check to chunk_end + chunk_hdr
size <= skb_tail_pointer().

Fixes: 26b87c788100 ("net: sctp: fix remote memory pressure from excessive queueing")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 net/sctp/inqueue.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sctp/inqueue.c b/net/sctp/inqueue.c
index 23ebc53..eb93ffe 100644
--- a/net/sctp/inqueue.c
+++ b/net/sctp/inqueue.c
@@ -217,7 +217,7 @@ struct sctp_chunk *sctp_inq_pop(struct sctp_inq *queue)
 	skb_pull(chunk->skb, sizeof(*ch));
 	chunk->subh.v = NULL; /* Subheader is no longer valid.  */

-	if (chunk->chunk_end + sizeof(*ch) < skb_tail_pointer(chunk->skb)) {
+	if (chunk->chunk_end + sizeof(*ch) <= skb_tail_pointer(chunk->skb)) {
 		/* This is not a singleton */
 		chunk->singleton = 0;
 	} else if (chunk->chunk_end > skb_tail_pointer(chunk->skb)) {
-- 
2.1.0

^ permalink raw reply related

* RE: RTL8723BE performance regression
From: Pkshih @ 2018-05-02  5:44 UTC (permalink / raw)
  To: João Paulo Rechi Vita, Larry Finger
  Cc: Steve deRosier, 莊彥宣, Birming Chiu, Shaofu,
	Steven Ting, Chaoming_Li, Kalle Valo, linux-wireless,
	Network Development, LKML, Daniel Drake,
	João Paulo Rechi Vita, linux@endlessm.com
In-Reply-To: <CA+A7VXUUsZHD2gr9TBcV6jZBOPGZv7_vK-wWZ0MvGgiCnAiUgQ@mail.gmail.com>



> -----Original Message-----
> From: João Paulo Rechi Vita [mailto:jprvita@gmail.com]
> Sent: Wednesday, May 02, 2018 6:41 AM
> To: Larry Finger
> Cc: Steve deRosier; 莊彥宣; Pkshih; Birming Chiu; Shaofu; Steven Ting; Chaoming_Li; Kalle Valo;
> linux-wireless; Network Development; LKML; Daniel Drake; João Paulo Rechi Vita; linux@endlessm.com
> Subject: Re: RTL8723BE performance regression
> 
> On Tue, Apr 3, 2018 at 7:51 PM, Larry Finger <Larry.Finger@lwfinger.net> wrote:
> > On 04/03/2018 09:37 PM, João Paulo Rechi Vita wrote:
> >>
> >> On Tue, Apr 3, 2018 at 7:28 PM, Larry Finger <Larry.Finger@lwfinger.net>
> >> wrote:
> >>
> >> (...)
> >>
> >>> As the antenna selection code changes affected your first bisection, do
> >>> you
> >>> have one of those HP laptops with only one antenna and the incorrect
> >>> coding
> >>> in the FUSE?
> >>
> >>
> >> Yes, that is why I've been passing ant_sel=1 during my tests -- this
> >> was needed to achieve a good performance in the past, before this
> >> regression. I've also opened the laptop chassis and confirmed the
> >> antenna cable is plugged to the connector labeled with "1" on the
> >> card.
> >>
> >>> If so, please make sure that you still have the same signal
> >>> strength for good and bad cases. I have tried to keep the driver and the
> >>> btcoex code in sync, but there may be some combinations of antenna
> >>> configuration and FUSE contents that cause the code to fail.
> >>>
> >>
> >> What is the recommended way to monitor the signal strength?
> >
> >
> > The btcoex code is developed for multiple platforms by a different group
> > than the Linux driver. I think they made a change that caused ant_sel to
> > switch from 1 to 2. At least numerous comments at
> > github.com/lwfinger/rtlwifi_new claimed they needed to make that change.
> >
> > Mhy recommended method is to verify the wifi device name with "iw dev". Then
> > using that device
> >
> > sudo iw dev <dev_name> scan | egrep "SSID|signal"
> >
> 
> I have confirmed that the performance regression is indeed tied to
> signal strength: on the good cases signal was between -16 and -8 dBm,
> whereas in bad cases signal was always between -50 to - 40 dBm. I've
> also switched to testing bandwidth in controlled LAN environment using
> iperf3, as suggested by Steve deRosier, with the DUT being the only
> machine connected to the 2.4 GHz radio and the machine running the
> iperf3 server connected via ethernet.
> 

We have new experimental results in commit af8a41cccf8f46 ("rtlwifi: cleanup 
8723be ant_sel definition"). You can use the above commit and do the same 
experiments (with ant_sel=0, 1 and 2) in your side, and then share your results.
Since performance is tied to signal strength, you can only share signal strength.

Regards
PK


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox