Netdev List
 help / color / mirror / Atom feed
* [PATCH bpf-next v2] xsk: include XDP meta data in AF_XDP frames
From: Björn Töpel @ 2018-08-30 13:12 UTC (permalink / raw)
  To: bjorn.topel, magnus.karlsson, magnus.karlsson, ast, daniel,
	netdev
  Cc: Björn Töpel, brouer
In-Reply-To: <20180830080900.16350-1-bjorn.topel@gmail.com>

From: Björn Töpel <bjorn.topel@intel.com>

Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did
not include XDP meta data in the data buffers copied out to the user
application.

In this commit, we check if meta data is available, and if so, it is
prepended to the frame.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
---
 net/xdp/xsk.c | 24 ++++++++++++++++++------
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 4e937cd7c17d..569048e299df 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -55,20 +55,30 @@ EXPORT_SYMBOL(xsk_umem_discard_addr);
 
 static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 {
-	void *buffer;
+	void *to_buf, *from_buf;
+	u32 metalen;
 	u64 addr;
 	int err;
 
 	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
-	    len > xs->umem->chunk_size_nohr) {
+	    len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
 		xs->rx_dropped++;
 		return -ENOSPC;
 	}
 
 	addr += xs->umem->headroom;
 
-	buffer = xdp_umem_get_data(xs->umem, addr);
-	memcpy(buffer, xdp->data, len);
+	if (unlikely(xdp_data_meta_unsupported(xdp))) {
+		from_buf = xdp->data;
+		metalen = 0;
+	} else {
+		from_buf = xdp->data_meta;
+		metalen = xdp->data - xdp->data_meta;
+	}
+
+	to_buf = xdp_umem_get_data(xs->umem, addr);
+	memcpy(to_buf, from_buf, len + metalen);
+	addr += metalen;
 	err = xskq_produce_batch_desc(xs->rx, addr, len);
 	if (!err) {
 		xskq_discard_addr(xs->umem->fq);
@@ -111,6 +121,7 @@ void xsk_flush(struct xdp_sock *xs)
 
 int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
+	u32 metalen = xdp->data - xdp->data_meta;
 	u32 len = xdp->data_end - xdp->data;
 	void *buffer;
 	u64 addr;
@@ -120,7 +131,7 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 		return -EINVAL;
 
 	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
-	    len > xs->umem->chunk_size_nohr) {
+	    len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
 		xs->rx_dropped++;
 		return -ENOSPC;
 	}
@@ -128,7 +139,8 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 	addr += xs->umem->headroom;
 
 	buffer = xdp_umem_get_data(xs->umem, addr);
-	memcpy(buffer, xdp->data, len);
+	memcpy(buffer, xdp->data_meta, len + metalen);
+	addr += metalen;
 	err = xskq_produce_batch_desc(xs->rx, addr, len);
 	if (!err) {
 		xskq_discard_addr(xs->umem->fq);
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH V1] add huawei ibma driver modules The driver is used for communication between in-band management agent(iBMA) and out-of-band management controller(iBMC) via pcie bus in Huawei V3 server. The driver provides character device,VNIC and black box interface for application layer.
From: Randy Dunlap @ 2018-08-30 17:15 UTC (permalink / raw)
  To: xiongsujuan, davem, zhaochen6, aviad.krawczyk, romain.perier,
	bhelgaas, keescook, colin.king
  Cc: netdev, linux-kernel
In-Reply-To: <1535591472-44997-1-git-send-email-xiongsujuan.xiongsujuan@huawei.com>

Hi,

Please fix the Subject: and patch description.

(more below)

On 08/29/2018 06:11 PM, xiongsujuan wrote:
> diff --git a/drivers/net/ethernet/huawei/Kconfig b/drivers/net/ethernet/huawei/Kconfig
> index c1a95ae..68748e9 100644
> --- a/drivers/net/ethernet/huawei/Kconfig
> +++ b/drivers/net/ethernet/huawei/Kconfig
> @@ -4,7 +4,7 @@
>  
>  config NET_VENDOR_HUAWEI
>  	bool "Huawei devices"
> -	default y
> +	default y 

Don't add a space after the 'y'.

>  	---help---
>  	  If you have a network (Ethernet) card belonging to this class, say Y.
>  	  Note that the answer to this question doesn't directly affect the
> @@ -16,4 +16,7 @@ if NET_VENDOR_HUAWEI
>  
>  source "drivers/net/ethernet/huawei/hinic/Kconfig"
>  
> +source "drivers/net/ethernet/huawei/ibma/Kconfig"
> +
> +
>  endif # NET_VENDOR_HUAWEI

> diff --git a/drivers/net/ethernet/huawei/ibma/Kconfig b/drivers/net/ethernet/huawei/ibma/Kconfig
> new file mode 100644
> index 0000000..810cc60
> --- /dev/null
> +++ b/drivers/net/ethernet/huawei/ibma/Kconfig
> @@ -0,0 +1,17 @@
> +#
> +# Huawei driver configuration
> +#
> +
> +config IBMANIC
> +	tristate "Huawei IBMA PCIE Network Interface Card"
> +	depends on (PCI_MSI && X86)
> +	---help---
> +	  This driver supports IBMANIC PCIE Ethernet cards.
> +	  To compile this driver as part of the kernel, choose Y here.
> +	  If unsure, choose N.
> +	  The default is compiled as module.
> +
> +
> +
> +
> +

Drop those trailing blank lines.

What makes the default be compiled as module?

thanks,
-- 
~Randy

^ permalink raw reply

* [PATCH 2/2] vti6: do not check for ignore_df in order to update pmtu
From: Thadeu Lima de Souza Cascardo @ 2018-08-30 12:58 UTC (permalink / raw)
  To: netdev; +Cc: yoshfuji, kuznet, davem, herbert, steffen.klassert, eyal.birger
In-Reply-To: <20180830125817.4567-1-cascardo@canonical.com>

Before commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU caching
and reporting on xmit"), skb was scrubbed before checking for ignore_df. The
scrubbing meant ignore_df was false, making the check irrelevant. Now that the
scrubbing happens after that, some packets might fail the checking and dst
will not have its pmtu updated.

Not only that, but too big skb will be potentially passed down to
__xfrm6_output, causing it to fail to transmit but not free the skb, causing a
leak of skb, and consequentially a leak of dst references.

After running pmtu.sh, that shows as failure to unregister devices in a namespace:

[  311.397671] unregister_netdevice: waiting for veth_b to become free. Usage count = 1

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
---
 net/ipv6/ip6_vti.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_vti.c b/net/ipv6/ip6_vti.c
index c72ae3a4fe09..fbd3752ea587 100644
--- a/net/ipv6/ip6_vti.c
+++ b/net/ipv6/ip6_vti.c
@@ -481,7 +481,7 @@ vti6_xmit(struct sk_buff *skb, struct net_device *dev, struct flowi *fl)
 	}
 
 	mtu = dst_mtu(dst);
-	if (!skb->ignore_df && skb->len > mtu) {
+	if (skb->len > mtu) {
 		skb_dst_update_pmtu(skb, mtu);
 
 		if (skb->protocol == htons(ETH_P_IPV6)) {
-- 
2.17.1

^ permalink raw reply related

* [PATCH 1/2] xfrm6: call kfree_skb when skb is toobig
From: Thadeu Lima de Souza Cascardo @ 2018-08-30 12:58 UTC (permalink / raw)
  To: netdev; +Cc: yoshfuji, kuznet, davem, herbert, steffen.klassert, eyal.birger

After commit d6990976af7c5d8f55903bfb4289b6fb030bf754 ("vti6: fix PMTU caching
and reporting on xmit"), some too big skbs might be potentially passed down to
__xfrm6_output, causing it to fail to transmit but not free the skb, causing a
leak of skb, and consequentially a leak of dst references.

After running pmtu.sh, that shows as failure to unregister devices in a namespace:

[  311.397671] unregister_netdevice: waiting for veth_b to become free. Usage count = 1

The fix is to call kfree_skb in case of transmit failures.

Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
---
 net/ipv6/xfrm6_output.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/ipv6/xfrm6_output.c b/net/ipv6/xfrm6_output.c
index 5959ce9620eb..6a74080005cf 100644
--- a/net/ipv6/xfrm6_output.c
+++ b/net/ipv6/xfrm6_output.c
@@ -170,9 +170,11 @@ static int __xfrm6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
 
 	if (toobig && xfrm6_local_dontfrag(skb)) {
 		xfrm6_local_rxpmtu(skb, mtu);
+		kfree_skb(skb);
 		return -EMSGSIZE;
 	} else if (!skb->ignore_df && toobig && skb->sk) {
 		xfrm_local_error(skb, mtu);
+		kfree_skb(skb);
 		return -EMSGSIZE;
 	}
 
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH bpf-next] xsk: include XDP meta data in AF_XDP frames
From: Björn Töpel @ 2018-08-30 12:51 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Karlsson, Magnus, Magnus Karlsson, ast, Netdev,
	Björn Töpel, Jesper Dangaard Brouer
In-Reply-To: <0c3ab4c5-3b22-d37d-d1a0-98d343cc7543@iogearbox.net>

Den tors 30 aug. 2018 kl 14:37 skrev Daniel Borkmann <daniel@iogearbox.net>:
>
> On 08/30/2018 10:09 AM, Björn Töpel wrote:
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did
> > not include XDP meta data in the data buffers copied out to the user
> > application.
> >
> > In this commit, we check if meta data is available, and if so, it is
> > prepended to the frame.
> >
> > Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> > ---
> >  net/xdp/xsk.c | 37 ++++++++++++++++++++++++++++---------
> >  1 file changed, 28 insertions(+), 9 deletions(-)
> >
> > diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> > index 4e937cd7c17d..817e4cee1540 100644
> > --- a/net/xdp/xsk.c
> > +++ b/net/xdp/xsk.c
> > @@ -55,20 +55,30 @@ EXPORT_SYMBOL(xsk_umem_discard_addr);
> >
> >  static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
> >  {
> > -     void *buffer;
> > +     void *to_buf, *from_buf;
> > +     u32 metalen;
> >       u64 addr;
> >       int err;
> >
> >       if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> > -         len > xs->umem->chunk_size_nohr) {
> > +         len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
> >               xs->rx_dropped++;
> >               return -ENOSPC;
> >       }
> >
> >       addr += xs->umem->headroom;
> >
> > -     buffer = xdp_umem_get_data(xs->umem, addr);
> > -     memcpy(buffer, xdp->data, len);
> > +     if (xdp_data_meta_unsupported(xdp)) {
>
> Nit: Probably makes sense to wrap with unlikely() since we expect all
> drivers to implement this.
>

I'll send a v2 fixing this...

> > +             from_buf = xdp->data;
> > +             metalen = 0;
> > +     } else {
> > +             from_buf = xdp->data_meta;
> > +             metalen = xdp->data - xdp->data_meta;
> > +     }
> > +
> > +     to_buf = xdp_umem_get_data(xs->umem, addr);
> > +     memcpy(to_buf, from_buf, len + metalen);
> > +     addr += metalen;
> >       err = xskq_produce_batch_desc(xs->rx, addr, len);
> >       if (!err) {
> >               xskq_discard_addr(xs->umem->fq);
> > @@ -111,8 +121,8 @@ void xsk_flush(struct xdp_sock *xs)
> >
> >  int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
> >  {
> > -     u32 len = xdp->data_end - xdp->data;
> > -     void *buffer;
> > +     u32 metalen, len = xdp->data_end - xdp->data;
> > +     void *to_buf, *from_buf;
> >       u64 addr;
> >       int err;
> >
> > @@ -120,15 +130,24 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
> >               return -EINVAL;
> >
> >       if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> > -         len > xs->umem->chunk_size_nohr) {
> > +         len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
> >               xs->rx_dropped++;
> >               return -ENOSPC;
> >       }
> >
> >       addr += xs->umem->headroom;
> >
> > -     buffer = xdp_umem_get_data(xs->umem, addr);
> > -     memcpy(buffer, xdp->data, len);
> > +     if (xdp_data_meta_unsupported(xdp)) {
> > +             from_buf = xdp->data;
> > +             metalen = 0;
>
> Note that this condition should be dead code. netif_receive_generic_xdp()
> sets xdp->data_meta to xdp->data, so all good here and above is never hit.
>

...and this! Thanks for pointing this out!


Björn

> > +     } else {
> > +             from_buf = xdp->data_meta;
> > +             metalen = xdp->data - xdp->data_meta;
> > +     }
> > +
> > +     to_buf = xdp_umem_get_data(xs->umem, addr);
> > +     memcpy(to_buf, from_buf, len + metalen);
> > +     addr += metalen;
> >       err = xskq_produce_batch_desc(xs->rx, addr, len);
> >       if (!err) {
> >               xskq_discard_addr(xs->umem->fq);
> >
>

^ permalink raw reply

* Re: [PATCH bpf-next] xsk: include XDP meta data in AF_XDP frames
From: Daniel Borkmann @ 2018-08-30 12:37 UTC (permalink / raw)
  To: Björn Töpel, magnus.karlsson, magnus.karlsson, ast,
	netdev
  Cc: Björn Töpel, brouer
In-Reply-To: <20180830080900.16350-1-bjorn.topel@gmail.com>

On 08/30/2018 10:09 AM, Björn Töpel wrote:
> From: Björn Töpel <bjorn.topel@intel.com>
> 
> Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did
> not include XDP meta data in the data buffers copied out to the user
> application.
> 
> In this commit, we check if meta data is available, and if so, it is
> prepended to the frame.
> 
> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
> ---
>  net/xdp/xsk.c | 37 ++++++++++++++++++++++++++++---------
>  1 file changed, 28 insertions(+), 9 deletions(-)
> 
> diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
> index 4e937cd7c17d..817e4cee1540 100644
> --- a/net/xdp/xsk.c
> +++ b/net/xdp/xsk.c
> @@ -55,20 +55,30 @@ EXPORT_SYMBOL(xsk_umem_discard_addr);
>  
>  static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
>  {
> -	void *buffer;
> +	void *to_buf, *from_buf;
> +	u32 metalen;
>  	u64 addr;
>  	int err;
>  
>  	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> -	    len > xs->umem->chunk_size_nohr) {
> +	    len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
>  		xs->rx_dropped++;
>  		return -ENOSPC;
>  	}
>  
>  	addr += xs->umem->headroom;
>  
> -	buffer = xdp_umem_get_data(xs->umem, addr);
> -	memcpy(buffer, xdp->data, len);
> +	if (xdp_data_meta_unsupported(xdp)) {

Nit: Probably makes sense to wrap with unlikely() since we expect all
drivers to implement this.

> +		from_buf = xdp->data;
> +		metalen = 0;
> +	} else {
> +		from_buf = xdp->data_meta;
> +		metalen = xdp->data - xdp->data_meta;
> +	}
> +
> +	to_buf = xdp_umem_get_data(xs->umem, addr);
> +	memcpy(to_buf, from_buf, len + metalen);
> +	addr += metalen;
>  	err = xskq_produce_batch_desc(xs->rx, addr, len);
>  	if (!err) {
>  		xskq_discard_addr(xs->umem->fq);
> @@ -111,8 +121,8 @@ void xsk_flush(struct xdp_sock *xs)
>  
>  int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>  {
> -	u32 len = xdp->data_end - xdp->data;
> -	void *buffer;
> +	u32 metalen, len = xdp->data_end - xdp->data;
> +	void *to_buf, *from_buf;
>  	u64 addr;
>  	int err;
>  
> @@ -120,15 +130,24 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
>  		return -EINVAL;
>  
>  	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
> -	    len > xs->umem->chunk_size_nohr) {
> +	    len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
>  		xs->rx_dropped++;
>  		return -ENOSPC;
>  	}
>  
>  	addr += xs->umem->headroom;
>  
> -	buffer = xdp_umem_get_data(xs->umem, addr);
> -	memcpy(buffer, xdp->data, len);
> +	if (xdp_data_meta_unsupported(xdp)) {
> +		from_buf = xdp->data;
> +		metalen = 0;

Note that this condition should be dead code. netif_receive_generic_xdp()
sets xdp->data_meta to xdp->data, so all good here and above is never hit.

> +	} else {
> +		from_buf = xdp->data_meta;
> +		metalen = xdp->data - xdp->data_meta;
> +	}
> +
> +	to_buf = xdp_umem_get_data(xs->umem, addr);
> +	memcpy(to_buf, from_buf, len + metalen);
> +	addr += metalen;
>  	err = xskq_produce_batch_desc(xs->rx, addr, len);
>  	if (!err) {
>  		xskq_discard_addr(xs->umem->fq);
> 

^ permalink raw reply

* [PATCH net] tcp: do not restart timewait timer on rst reception
From: Florian Westphal @ 2018-08-30 12:24 UTC (permalink / raw)
  To: netdev; +Cc: edumazet, Florian Westphal

RFC 1337 says:
 ''Ignore RST segments in TIME-WAIT state.
   If the 2 minute MSL is enforced, this fix avoids all three hazards.''

So with net.ipv4.tcp_rfc1337=1, expected behaviour is to have TIME-WAIT sk
expire rather than removing it instantly when a reset is received.

However, Linux will also re-start the TIME-WAIT timer.

This causes connect to fail when tying to re-use ports or very long
delays (until syn retry interval exceeds MSL).

packetdrill test case:
// Demonstrate bogus rearming of TIME-WAIT timer in rfc1337 mode.
`sysctl net.ipv4.tcp_rfc1337=1`

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 29200 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4

// Receive first segment
0.310 < P. 1:1001(1000) ack 1 win 46

// Send one ACK
0.310 > . 1:1(0) ack 1001

// read 1000 byte
0.310 read(4, ..., 1000) = 1000

// Application writes 100 bytes
0.350 write(4, ..., 100) = 100
0.350 > P. 1:101(100) ack 1001

// ACK
0.500 < . 1001:1001(0) ack 101 win 257

// close the connection
0.600 close(4) = 0
0.600 > F. 101:101(0) ack 1001 win 244

// Our side is in FIN_WAIT_1 & waits for ack to fin
0.7 < . 1001:1001(0) ack 102 win 244

// Our side is in FIN_WAIT_2 with no outstanding data.
0.8 < F. 1001:1001(0) ack 102 win 244
0.8 > . 102:102(0) ack 1002 win 244

// Our side is now in TIME_WAIT state, send ack for fin.
0.9 < F. 1002:1002(0) ack 102 win 244
0.9 > . 102:102(0) ack 1002 win 244

// Peer reopens with in-window SYN:
1.000 < S 1000:1000(0) win 9200 <mss 1460,nop,nop,sackOK,nop,wscale 7>

// Therefore, reply with ACK.
1.000 > . 102:102(0) ack 1002 win 244

// Peer sends RST for this ACK.  Normally this RST results
// in tw socket removal, but rfc1337=1 setting prevents this.
1.100 < R 1002:1002(0) win 244

// second syn. Due to rfc1337=1 expect another pure ACK.
31.0 < S 1000:1000(0) win 9200 <mss 1460,nop,nop,sackOK,nop,wscale 7>
31.0 > . 102:102(0) ack 1002 win 244

// .. and another RST from peer.
31.1 < R 1002:1002(0) win 244
31.2 `echo no timer restart;ss -m -e -a -i -n -t -o state TIME-WAIT`

// third syn after one minute.  Time-Wait socket should have expired by now.
63.0 < S 1000:1000(0) win 9200 <mss 1460,nop,nop,sackOK,nop,wscale 7>

// so we expect a syn-ack & 3whs to proceed from here on.
63.0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>

Without this patch, 'ss' shows restarts of tw timer and last packet is
thus just another pure ack, more than one minute later.

This restores the original code from commit 283fd6cf0be690a83
("Merge in ANK networking jumbo patch") in netdev-vger-cvs.git .

For some reason the else branch was removed/lost in 1f28b683339f7
("Merge in TCP/UDP optimizations and [..]") and timer restart became
unconditional.

Reported-by: Michal Tesar <mtesar@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/ipv4/tcp_minisocks.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 1dda1341a223..b690132f5da2 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -184,8 +184,9 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
 				inet_twsk_deschedule_put(tw);
 				return TCP_TW_SUCCESS;
 			}
+		} else {
+			inet_twsk_reschedule(tw, TCP_TIMEWAIT_LEN);
 		}
-		inet_twsk_reschedule(tw, TCP_TIMEWAIT_LEN);
 
 		if (tmp_opt.saw_tstamp) {
 			tcptw->tw_ts_recent	  = tmp_opt.rcv_tsval;
-- 
2.16.4

^ permalink raw reply related

* Re: [PATCH bpf-next 0/3] bpf: implement percpu map pretty print for bpffs and bpftool
From: Daniel Borkmann @ 2018-08-30 12:25 UTC (permalink / raw)
  To: Yonghong Song, ast, kafai, netdev; +Cc: kernel-team
In-Reply-To: <20180829214315.4065753-1-yhs@fb.com>

On 08/29/2018 11:43 PM, Yonghong Song wrote:
> Commit a26ca7c982cb ("bpf: btf: Add pretty print support to the
> basic arraymap") and Commit 699c86d6ec21 ("bpf: btf: add pretty print
> for hash/lru_hash maps") added bpffs pretty print for array, hash and
> lru hash maps. The pretty print gives users a structurally formatted
> dump for keys/values which much easy to understand than raw bytes.
> 
> This patch set implemented bpffs pretty print support for
> percpu arraymap, percpu hashmap and percpu lru hashmap.
> For complex key/value types, the pretty print here is even more useful
> due to
>   . large volumne of data making it even harder to correlate bytes
>     to a particular field in a particular cpu.
>   . kernel rounds the value size for each cpu to multiple of 8.
>     User has to be aware of this otherwise wrong value may be
>     derived from cpu 1/2/...
> 
> For example, we may have a bpffs pretty print like below:
>    43602: {
>         cpu0: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
>         cpu1: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
>         cpu2: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
>         cpu3: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
>    }
> for a percpu map.
> 
> This patch also added percpu formatted print on bpftool. For example,
> bpftool may print like below:
>     {
>         "key": 0,
>         "values": [{
>                 "cpu": 0,
>                 "value": {
>                     "ui32": 0,
>                     "ui16": 0,
>                 }
>             },{
>                 "cpu": 1,
>                 "value": {
>                     "ui32": 1,
>                     "ui16": 0,
>                 }
>             },{
>                 "cpu": 2,
>                 "value": {
>                     "ui32": 2,
>                     "ui16": 0,
>                 }
>             },{
>                 "cpu": 3,
>                 "value": {
>                     "ui32": 3,
>                     "ui16": 0,
>                 }
>             }
>         ]
>     }
> 
> Patch #1 implemented bpffs pretty print for percpu arraymap/hash/lru_hash
> in kernel. Patch #2 added the test case in tools bpf selftest test_btf.
> Patch #3 added percpu map btf based dump.
> 
> Yonghong Song (3):
>   bpf: add bpffs pretty print for percpu arraymap/hash/lru_hash
>   tools/bpf: add bpffs percpu map pretty print tests in test_btf
>   tools/bpf: bpftool: add btf percpu map formated dump
> 
>  kernel/bpf/arraymap.c                  |  24 +++++
>  kernel/bpf/hashtab.c                   |  31 ++++++
>  tools/bpf/bpftool/map.c                |  33 +++++-
>  tools/testing/selftests/bpf/test_btf.c | 179 ++++++++++++++++++++++++++-------
>  4 files changed, 230 insertions(+), 37 deletions(-)

Applied to bpf-next, thanks Yonghong!

^ permalink raw reply

* Re: [PATCH bpf-next 08/11] i40e: add AF_XDP zero-copy Rx support
From: Björn Töpel @ 2018-08-30 12:06 UTC (permalink / raw)
  To: Jakub Kicinski, Björn Töpel
  Cc: magnus.karlsson, magnus.karlsson, alexander.h.duyck,
	alexander.duyck, ast, brouer, daniel, netdev, jesse.brandeburg,
	anjali.singhai, peter.waskiewicz.jr, michael.lundkvist,
	willemdebruijn.kernel, john.fastabend, neerav.parikh,
	mykyta.iziumtsev, francois.ozog, ilias.apalodimas, brian.brooks,
	u9012063, pavel, qi.z.zhang
In-Reply-To: <20180829211428.6cce4a1b@cakuba.netronome.com>

On 2018-08-29 21:14, Jakub Kicinski wrote:
 > On Tue, 28 Aug 2018 14:44:32 +0200, Björn Töpel wrote:
 >> From: Björn Töpel <bjorn.topel@intel.com>
 >>
 >> This patch adds zero-copy Rx support for AF_XDP sockets. Instead of
 >> allocating buffers of type MEM_TYPE_PAGE_SHARED, the Rx frames are
 >> allocated as MEM_TYPE_ZERO_COPY when AF_XDP is enabled for a certain
 >> queue.
 >>
 >> All AF_XDP specific functions are added to a new file, i40e_xsk.c.
 >>
 >> Note that when AF_XDP zero-copy is enabled, the XDP action XDP_PASS
 >> will allocate a new buffer and copy the zero-copy frame prior passing
 >> it to the kernel stack.
 >>
 >> Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
 >
 > Mm.. I'm surprised you don't run into buffer reuse issues that I had
 > when playing with AF_XDP.  What happens in i40e if someone downs the
 > interface?  Will UMEMs get destroyed?  Will the RX buffers get freed?
 >

The UMEM will linger in the driver until the sockets are dead.

 > I'll shortly send an RFC with my quick and dirty RX buffer reuse queue,
 > FWIW.
 >

Some background for folks that don't know the details: A zero-copy
capable driver picks buffers off the fill ring and places them on the
hardware Rx ring to be completed at a later point when DMA is
complete. Analogous for the Tx side; The driver picks buffers off the
Tx ring and places them on the Tx hardware ring.

In the typical flow, the Rx buffer will be placed onto an Rx ring
(completed to the user), and the Tx buffer will be placed on the
completion ring to notify the user that the transfer is done.

However, if the driver needs to tear down the hardware rings for some
reason (interface goes down, reconfiguration and such), what should be
done with the Rx and Tx buffers that has been given to the driver?

So, to frame the problem: What should a driver do when this happens,
so that buffers aren't leaked?

Note that when the UMEM is going down, there's no need to complete
anything, since the sockets are dying/dead already.

This is, as you state, a missing piece in the implementation and needs
to be fixed.

Now on to possible solutions:

1. Complete the buffers back to the user. For Tx, this is probably the
    best way -- just place the buffers onto the completion ring.

    For Rx, we can give buffers back to user space by setting the
    length in the Rx descriptor to zero And putting them on the Rx
    ring. However, one complication here is that we do not have any
    back-pressure mechanism for the Rx side like we have on Tx. If the
    Rx ring(s) is (are) full the kernel will have to leak them or
    implement a retry mechanism (ugly and should be avoided).

    Another option to solve this without needing any retry or leaking
    for Rx is to implement the same back-pressure mechanism that we
    have on the Tx path in the Rx path. In the Tx path, the driver will
    only get a Tx packet to send if there is space for it in the
    completion ring. On Rx, this would be that the driver would only
    get a buffer from the fill ring if there is space for it to put it
    on the Rx ring. The drawback of this is that it would likely impact
    performance negatively since the Rx ring would have to be touch one
    more time (in the Tx path, it increased performance since it made
    it possible to implement the Tx path without any buffering), but it
    would guarantee that all buffers can always be returned to user
    space making solution this a viable option.

2. Store the buffers internally in the driver, and make sure that they
    are inserted into the "normal flow" again. For Rx that would be
    putting the buffers back into the allocation scheme that the driver
    is using. For Tx, placing the buffers back onto the Tx HW ring
    (plus all the logic for making sure that all corner cases work).

3. Mark the socket(s) as in error state, en require the user to redo
    the setup. This is bit harsh...

For i40e I think #2 for Rx (buffers reside in kernel, return to
allocator) and #1 for Tx (complete to userland).

Your RFC is plumbing to implement #2 for Rx in a driver. I'm not a fan
of extending the umem with the "reuse queue". This decision is really
up the driver. Some driver might prefer another scheme, or simply
prefer storing the buffers in another manner.

Looking forward, as both you and Jesper has alluded to, we need a
proper allocator for zero-copy. Then it would be a matter of injecting
the Rx buffers back to the allocator.

I'll send out a patch, so that we don't leak the buffers, but I'd like
to hear your thoughts on what the behavior should be.

And what should the behavior be when the netdev is removed from the
kernel?

And thanks for looking into the code!


Björn

^ permalink raw reply

* Re: KASAN: stack-out-of-bounds Read in __schedule
From: Dmitry Vyukov @ 2018-08-30 15:40 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexander Potapenko, Alexei Starovoitov, netdev, Jan Kara,
	syzbot+45a34334c61a8ecf661d, Jan Kara, linux-ext4, LKML,
	syzkaller-bugs, Theodore Ts'o
In-Reply-To: <CACT4Y+b7-oJCecsigTsN=OERGpMMQx+8GFNvD7JhN1vbNt4e+A@mail.gmail.com>

On Thu, Aug 30, 2018 at 7:19 AM, Dmitry Vyukov <dvyukov@google.com> wrote:
> On Thu, Aug 30, 2018 at 2:52 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>>> Hello,
>>>>>>
>>>>>> syzbot found the following crash on:
>>>>>>
>>>>>> HEAD commit:    5b394b2ddf03 Linux 4.19-rc1
>>>>>> git tree:       upstream
>>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=14f4d8e1400000
>>>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=49927b422dcf0b29
>>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=45a34334c61a8ecf661d
>>>>>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>>>>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=13127e5a400000
>>>>>>
>>>>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>>>>> Reported-by: syzbot+45a34334c61a8ecf661d@syzkaller.appspotmail.com
>>>>>>
>>>>>> IPv6: ADDRCONF(NETDEV_UP): veth1: link is not ready
>>>>>> IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready
>>>>>> IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
>>>>>> 8021q: adding VLAN 0 to HW filter on device team0
>>>>>> ==================================================================
>>>>>> BUG: KASAN: stack-out-of-bounds in schedule_debug kernel/sched/core.c:3285
>>>>>> [inline]
>>>>>> BUG: KASAN: stack-out-of-bounds in __schedule+0x1977/0x1df0
>>>>>> kernel/sched/core.c:3395
>>>>>> Read of size 8 at addr ffff8801ad090000 by task syz-executor0/4718
>>>>>
>>>>> Weird, can you please help me decipher this? So here KASAN complains about
>>>>> wrong memory access in the scheduler.
>>>
>>> This looks like a result of a previous bad silent memory corruption.
>>>
>>> The KASAN report says there is a stack out-of-bounds in scheduler. And
>>> that if followed by slab corruption report in another task.
>>>
>>> fs/jbd2/transaction.c happens to be the first meaningful file in this
>>> crash, and so that's where it is attributed to.
>>>
>>> Rerunning the reproducer several times can maybe give some better
>>> glues, or maybe not, maybe they all will look equally puzzling.
>>>
>>> This part of the repro looks familiar:
>>>
>>> r1 = bpf$MAP_CREATE(0x0, &(0x7f0000002e40)={0x12, 0x0, 0x4, 0x6e, 0x0,
>>> 0x1}, 0x68)
>>> bpf$MAP_UPDATE_ELEM(0x2, &(0x7f0000000180)={r1, &(0x7f0000000000),
>>> &(0x7f0000000140)}, 0x20)
>>>
>>> We had exactly such consequences of a bug in bpf map very recently,
>>> but that was claimed to be fixed. Maybe not completely?
>>> +bpf maintainers
>>
>> Looks like syzbot found this in Linus tree with HEAD commit 5b394b2ddf03 ("Linux 4.19-rc1")
>> one day later net PR got merged via 050cdc6c9501 ("Merge git://git.kernel.org/pub/...").
>>
>> This PR contained a couple of fixes I did on sockmap code during audit such as:
>>
>>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b845c898b2f1ea458d5453f0fa1da6e2dfce3bb4
>>
>> Looking at the reproducer syzkaller found it contains:
>>
>>   r1 = bpf$MAP_CREATE(0x0, &(0x7f0000002e40)={0x12, 0x0, 0x4, 0x6e, 0x0, 0x1}, 0x68)
>>                                                     ^^^
>>
>> So it found the crash with map type of sock hash and key size of 0x0 (which is invalid),
>> where subsequent map update triggered the corruption. I just did a 'syz test' and it
>> wasn't able to trigger the crash anymore.
>>
>> #syz fix: bpf, sockmap: fix sock_hash_alloc and reject zero-sized keys


This crash looks related:
https://groups.google.com/d/msg/syzkaller-bugs/luviyHUQ9N4/dmgK2OmLBAAJ

^ permalink raw reply

* Re: WARNING in handle_irq (3)
From: Dmitry Vyukov @ 2018-08-30 15:39 UTC (permalink / raw)
  To: syzbot, netdev, Alexei Starovoitov, Daniel Borkmann
  Cc: Greg Kroah-Hartman, H. Peter Anvin, Kate Stewart, LKML,
	Andy Lutomirski, Ingo Molnar, nstange, syzkaller-bugs,
	Thomas Gleixner, the arch/x86 maintainers
In-Reply-To: <0000000000003b1c360574a8c31e@google.com>

On Thu, Aug 30, 2018 at 8:31 AM, syzbot
<syzbot+a58b558e3e62d0604e5c@syzkaller.appspotmail.com> wrote:
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    58c3f14f86c9 Merge tag 'riscv-for-linus-4.19-rc2' of git:/..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=10be176a400000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=531a917630d2a492
> dashboard link: https://syzkaller.appspot.com/bug?extid=a58b558e3e62d0604e5c
> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.

+bpf maintainers

Looks suspiciously similar to:
https://groups.google.com/d/msg/syzkaller-bugs/4v7MtbIT1hY/A87hInzyAwAJ

Note this commit seems to already have "bpf, sockmap: fix
sock_hash_alloc and reject zero-sized keys ".

Tentative reproducer from the log is:

14:08:59 executing program 5:
socketpair(0x20000, 0x0, 0x0, &(0x7f0000000140))
r0 = socket$inet6_tcp(0xa, 0x1, 0x0)
r1 = socket$inet6_tcp(0xa, 0x1, 0x0)
bind$inet6(r1, &(0x7f00000000c0)={0xa, 0x4e22}, 0x1c)
listen(r1, 0x0)
sendto$inet6(r0, &(0x7f0000000140), 0x2d6, 0x20000004,
&(0x7f0000000080)={0xa, 0x100000004e22, 0x0, @loopback}, 0x1c)
setsockopt$inet6_tcp_TCP_ULP(r0, 0x6, 0x1f, &(0x7f0000000080)='tls\x00', 0x152)
r2 = bpf$MAP_CREATE(0x0, &(0x7f0000000280)={0xf, 0x4, 0x4, 0x70}, 0x2c)
bpf$MAP_UPDATE_ELEM(0x2, &(0x7f0000000180)={r2, &(0x7f0000000000),
&(0x7f0000000140)}, 0x20)

Which does not create a 0-key map.



> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+a58b558e3e62d0604e5c@syzkaller.appspotmail.com
>
> TCP: request_sock_TCPv6: Possible SYN flooding on port 20002. Sending
> cookies.  Check SNMP counters.
> ------------[ cut here ]------------
> do_IRQ(): syz-executor5 has overflown the kernel stack
> (cur:ffff88018aec0000,sp:ffff88018aeb0e18,irq stk
> top-bottom:ffff8801db000080-ffff8801db008000,exception stk
> top-bottom:fffffe0000007080-fffffe0000011000,ip:lock_is_held_type+0x18b/0x210)
> WARNING: CPU: 0 PID: 13805 at arch/x86/kernel/irq_64.c:64
> stack_overflow_check arch/x86/kernel/irq_64.c:61 [inline]
> WARNING: CPU: 0 PID: 13805 at arch/x86/kernel/irq_64.c:64
> handle_irq+0x1fb/0x2e7 arch/x86/kernel/irq_64.c:73
> Kernel panic - not syncing: panic_on_warn set ...
>
> CPU: 0 PID: 13805 Comm: syz-executor5 Not tainted 4.19.0-rc1+ #215
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>  <IRQ>
>  __dump_stack lib/dump_stack.c:77 [inline]
>  dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
>  panic+0x238/0x4e7 kernel/panic.c:184
>  __warn.cold.8+0x163/0x1ba kernel/panic.c:536
>  report_bug+0x252/0x2d0 lib/bug.c:186
>  fixup_bug arch/x86/kernel/traps.c:178 [inline]
>  do_error_trap+0x1fc/0x4d0 arch/x86/kernel/traps.c:296
>  do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:316
>  invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:993
> RIP: 0010:stack_overflow_check arch/x86/kernel/irq_64.c:61 [inline]
> RIP: 0010:handle_irq+0x1fb/0x2e7 arch/x86/kernel/irq_64.c:73
> Code: 00 00 ff b6 80 00 00 00 48 c7 c7 80 ca 24 87 41 54 41 55 65 48 8b 04
> 25 40 ee 01 00 48 05 68 06 00 00 48 89 c6 e8 95 c4 1c 00 <0f> 0b 48 83 c4 18
> e9 3f ff ff ff 48 89 75 e0 e8 c1 fe 90 00 48 8b
> RSP: 0018:ffff8801db007f58 EFLAGS: 00010082
> RAX: 0000000000000000 RBX: ffff8801cee0ad80 RCX: 0000000000000000
> RDX: 0000000000010000 RSI: ffffffff8163ac01 RDI: 0000000000000001
> RBP: ffff8801db007fb0 R08: ffff8801c9b4a700 R09: ffffed003b603eca
> R10: ffffed003b603eca R11: ffff8801db01f657 R12: fffffe0000011000
> R13: fffffe0000007080 R14: 000000000000002a R15: 0000000000000000
>  do_IRQ+0x80/0x1a0 arch/x86/kernel/irq.c:246
>  common_interrupt+0xf/0xf arch/x86/entry/entry_64.S:643
>  </IRQ>
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Kernel Offset: disabled
> Rebooting in 86400 seconds..
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
> syzbot.

^ permalink raw reply

* RE: [PATCH v5 1/4] gpiolib: Pass bitmaps, not integer arrays, to get/set array
From: David Laight @ 2018-08-30 15:35 UTC (permalink / raw)
  To: 'Miguel Ojeda', Janusz Krzysztofik
  Cc: Linus Walleij, Jonathan Corbet, Peter Korsgaard, Peter Rosin,
	Ulf Hansson, Andrew Lunn, Florian Fainelli, David S. Miller,
	Dominik Brodowski, Greg Kroah-Hartman, Kishon Vijay Abraham I,
	Lars-Peter Clausen, Michael Hennerich, Jonathan Cameron,
	Hartmut Knaack, Peter Meerwald-Stadler, Jiri Slaby, Willy Tarreau,
	Geert Uytterhoeven, Linux
In-Reply-To: <CANiq72kJT15ELkYr2U5r4-pvGunQFLmZ5dhdZLnjmM7NWBnkNQ@mail.gmail.com>

From: Miguel Ojeda
> Sent: 30 August 2018 12:11
...
> > +       unsigned long value_bitmap[1];  /* for DATA[0-7], RS, RW */
> 
> (I read your comments in the other email)
> 
> I still find this odd, but if everyone is going to have this change
> done like this, consistency is better.

Maybe there ought to be a define so you can do:
	DEFINE_BITMAP(value_bitmap, 32);

While it might just generate an unsigned long [] there is probably
scope for stronger typing.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

^ permalink raw reply

* [PATCH net-next 1/2] netlink: ipv4 IGMP join notifications
From: Patrick Ruddy @ 2018-08-30  9:35 UTC (permalink / raw)
  To: netdev; +Cc: roopa, jiri, stephen
In-Reply-To: <20180830093545.29465-1-pruddy@vyatta.att-mail.com>

Some userspace applications need to know about IGMP joins from the kernel
for 2 reasons
1. To allow the programming of multicast MAC filters in hardware
2. To form a multicast FORUS list for non link-local multicast
   groups to be sent to the kernel and from there to the interested
   party.
(1) can be fulfilled but simply sending the hardware multicast MAC
address to be programmed but (2) requires the L3 address to be sent
since this cannot be constructed from the MAC address whereas the
reverse translation is a standard library function.

This commit provides addition and deletion of multicast addresses
using the RTM_NEWADDR and RTM_DELADDR messages. It also provides
the RTM_GETADDR extension to allow multicast join state to be read
from the kernel.

Signed-off-by: Patrick Ruddy <pruddy@vyatta.att-mail.com>
---
 include/linux/igmp.h |  2 +
 net/ipv4/devinet.c   | 39 +++++++++++++------
 net/ipv4/igmp.c      | 90 ++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 120 insertions(+), 11 deletions(-)

diff --git a/include/linux/igmp.h b/include/linux/igmp.h
index 119f53941c12..1fb417865e7d 100644
--- a/include/linux/igmp.h
+++ b/include/linux/igmp.h
@@ -130,6 +130,8 @@ extern void ip_mc_unmap(struct in_device *);
 extern void ip_mc_remap(struct in_device *);
 extern void ip_mc_dec_group(struct in_device *in_dev, __be32 addr);
 extern void ip_mc_inc_group(struct in_device *in_dev, __be32 addr);
+extern int ip_mc_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb,
+			     struct net_device *dev);
 int ip_mc_check_igmp(struct sk_buff *skb, struct sk_buff **skb_trimmed);
 
 #endif
diff --git a/net/ipv4/devinet.c b/net/ipv4/devinet.c
index ea4bd8a52422..42f7dcc4fb5e 100644
--- a/net/ipv4/devinet.c
+++ b/net/ipv4/devinet.c
@@ -57,6 +57,7 @@
 #endif
 #include <linux/kmod.h>
 #include <linux/netconf.h>
+#include <linux/igmp.h>
 
 #include <net/arp.h>
 #include <net/ip.h>
@@ -1651,6 +1652,7 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 	int h, s_h;
 	int idx, s_idx;
 	int ip_idx, s_ip_idx;
+	int multicast, mcast_idx;
 	struct net_device *dev;
 	struct in_device *in_dev;
 	struct in_ifaddr *ifa;
@@ -1659,6 +1661,8 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 	s_h = cb->args[0];
 	s_idx = idx = cb->args[1];
 	s_ip_idx = ip_idx = cb->args[2];
+	multicast = cb->args[3];
+	mcast_idx = cb->args[4];
 
 	for (h = s_h; h < NETDEV_HASHENTRIES; h++, s_idx = 0) {
 		idx = 0;
@@ -1675,18 +1679,29 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 			if (!in_dev)
 				goto cont;
 
-			for (ifa = in_dev->ifa_list, ip_idx = 0; ifa;
-			     ifa = ifa->ifa_next, ip_idx++) {
-				if (ip_idx < s_ip_idx)
-					continue;
-				if (inet_fill_ifaddr(skb, ifa,
-					     NETLINK_CB(cb->skb).portid,
-					     cb->nlh->nlmsg_seq,
-					     RTM_NEWADDR, NLM_F_MULTI) < 0) {
-					rcu_read_unlock();
-					goto done;
+			if (!multicast) {
+				for (ifa = in_dev->ifa_list, ip_idx = 0; ifa;
+				     ifa = ifa->ifa_next, ip_idx++) {
+					if (ip_idx < s_ip_idx)
+						continue;
+					if (inet_fill_ifaddr(skb, ifa,
+							     NETLINK_CB(cb->skb).portid,
+							     cb->nlh->nlmsg_seq,
+							     RTM_NEWADDR,
+							     NLM_F_MULTI) < 0) {
+						rcu_read_unlock();
+						goto done;
+					}
+					nl_dump_check_consistent(cb,
+								 nlmsg_hdr(skb));
 				}
-				nl_dump_check_consistent(cb, nlmsg_hdr(skb));
+				/* set for multicast loop */
+				multicast++;
+			}
+			/* loop over multicast addresses */
+			if (ip_mc_dump_ifaddr(skb, cb, dev) < 0) {
+				rcu_read_unlock();
+				goto done;
 			}
 cont:
 			idx++;
@@ -1698,6 +1713,8 @@ static int inet_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 	cb->args[0] = h;
 	cb->args[1] = idx;
 	cb->args[2] = ip_idx;
+	cb->args[3] = multicast;
+	cb->args[4] = mcast_idx;
 
 	return skb->len;
 }
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index cf75f8944b05..c9bbd1d27124 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -86,6 +86,7 @@
 #include <linux/inetdevice.h>
 #include <linux/igmp.h>
 #include <linux/if_arp.h>
+#include <net/netlink.h>
 #include <linux/rtnetlink.h>
 #include <linux/times.h>
 #include <linux/pkt_sched.h>
@@ -1384,6 +1385,91 @@ static void ip_mc_hash_remove(struct in_device *in_dev,
 }
 
 
+static int fill_addr(struct sk_buff *skb, struct net_device *dev, __be32 addr,
+		     int type, unsigned int flags)
+{
+	struct nlmsghdr *nlh;
+	struct ifaddrmsg *ifm;
+
+	nlh = nlmsg_put(skb, 0, 0, type, sizeof(*ifm), flags);
+	if (!nlh)
+		return -EMSGSIZE;
+
+	ifm = nlmsg_data(nlh);
+	ifm->ifa_family = AF_INET;
+	ifm->ifa_prefixlen = 32;
+	ifm->ifa_flags = IFA_F_PERMANENT;
+	ifm->ifa_scope = RT_SCOPE_LINK;
+	ifm->ifa_index = dev->ifindex;
+
+	if (nla_put_in_addr(skb, IFA_ADDRESS, addr))
+		goto nla_put_failure;
+	nlmsg_end(skb, nlh);
+	return 0;
+
+nla_put_failure:
+	nlmsg_cancel(skb, nlh);
+	return -EMSGSIZE;
+}
+
+static inline size_t addr_nlmsg_size(void)
+{
+	return NLMSG_ALIGN(sizeof(struct ifaddrmsg))
+		+ nla_total_size(sizeof(__be32));
+}
+
+static void ip_mc_addr_notify(struct net_device *dev, __be32 addr, int type)
+{
+	struct net *net = dev_net(dev);
+	struct sk_buff *skb;
+	int err = -ENOBUFS;
+
+	skb = nlmsg_new(addr_nlmsg_size(), GFP_ATOMIC);
+	if (!skb)
+		goto errout;
+
+	err = fill_addr(skb, dev, addr, type, 0);
+	if (err < 0) {
+		WARN_ON(err == -EMSGSIZE);
+		kfree_skb(skb);
+		goto errout;
+	}
+	rtnl_notify(skb, net, 0, RTNLGRP_IPV4_IFADDR, NULL, GFP_ATOMIC);
+	return;
+errout:
+	if (err < 0)
+		rtnl_set_sk_err(net, RTNLGRP_LINK, err);
+}
+
+int ip_mc_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb,
+		      struct net_device *dev)
+{
+	int s_idx;
+	int idx = 0;
+	struct ip_mc_list *im;
+	struct in_device *in_dev;
+
+	ASSERT_RTNL();
+
+	s_idx = cb->args[4];
+	in_dev = __in_dev_get_rtnl(dev);
+
+	for_each_pmc_rtnl(in_dev, im) {
+		if (idx < s_idx)
+			continue;
+		if (fill_addr(skb, dev, im->multiaddr, RTM_NEWADDR,
+			      NLM_F_MULTI) < 0)
+			goto done;
+		nl_dump_check_consistent(cb, nlmsg_hdr(skb));
+		idx++;
+	}
+
+ done:
+	cb->args[4] = idx;
+
+	return skb->len;
+}
+
 /*
  *	A socket has joined a multicast group on device dev.
  */
@@ -1433,6 +1519,8 @@ static void __ip_mc_inc_group(struct in_device *in_dev, __be32 addr,
 	igmpv3_del_delrec(in_dev, im);
 #endif
 	igmp_group_added(im);
+
+	ip_mc_addr_notify(in_dev->dev, addr, RTM_NEWADDR);
 	if (!in_dev->dead)
 		ip_rt_multicast_event(in_dev);
 out:
@@ -1664,6 +1752,8 @@ void ip_mc_dec_group(struct in_device *in_dev, __be32 addr)
 				in_dev->mc_count--;
 				igmp_group_dropped(i);
 				ip_mc_clear_src(i);
+				ip_mc_addr_notify(in_dev->dev, addr,
+						  RTM_DELADDR);
 
 				if (!in_dev->dead)
 					ip_rt_multicast_event(in_dev);
-- 
2.17.1

^ permalink raw reply related

* Re: [Patch net-nnext] net_sched: add missing tcf_lock for act_connmark
From: Vlad Buslov @ 2018-08-30 10:59 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev
In-Reply-To: <20180829171536.23162-2-xiyou.wangcong@gmail.com>

On Wed 29 Aug 2018 at 17:15, Cong Wang <xiyou.wangcong@gmail.com> wrote:
> According to the new locking rule, we have to take tcf_lock
> for both ->init() and ->dump(), as RTNL will be removed.
> However, it is missing for act_connmark.

Thank you for finding and fixing this!

>
> Cc: Vlad Buslov <vladbu@mellanox.com>
> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> ---
>  net/sched/act_connmark.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/net/sched/act_connmark.c b/net/sched/act_connmark.c
> index e869c0ee63c8..8475913f2070 100644
> --- a/net/sched/act_connmark.c
> +++ b/net/sched/act_connmark.c
> @@ -143,8 +143,10 @@ static int tcf_connmark_init(struct net *net, struct nlattr *nla,
>  			return -EEXIST;
>  		}
>  		/* replacing action and zone */
> +		spin_lock_bh(&ci->tcf_lock);
>  		ci->tcf_action = parm->action;
>  		ci->zone = parm->zone;
> +		spin_unlock_bh(&ci->tcf_lock);
>  		ret = 0;
>  	}
>  
> @@ -156,16 +158,16 @@ static inline int tcf_connmark_dump(struct sk_buff *skb, struct tc_action *a,
>  {
>  	unsigned char *b = skb_tail_pointer(skb);
>  	struct tcf_connmark_info *ci = to_connmark(a);
> -
>  	struct tc_connmark opt = {
>  		.index   = ci->tcf_index,
>  		.refcnt  = refcount_read(&ci->tcf_refcnt) - ref,
>  		.bindcnt = atomic_read(&ci->tcf_bindcnt) - bind,
> -		.action  = ci->tcf_action,
> -		.zone   = ci->zone,
>  	};
>  	struct tcf_t t;
>  
> +	spin_lock_bh(&ci->tcf_lock);
> +	opt.action = ci->tcf_action;
> +	opt.zone = ci->zone;
>  	if (nla_put(skb, TCA_CONNMARK_PARMS, sizeof(opt), &opt))
>  		goto nla_put_failure;
>  
> @@ -173,9 +175,12 @@ static inline int tcf_connmark_dump(struct sk_buff *skb, struct tc_action *a,
>  	if (nla_put_64bit(skb, TCA_CONNMARK_TM, sizeof(t), &t,
>  			  TCA_CONNMARK_PAD))
>  		goto nla_put_failure;
> +	spin_unlock_bh(&ci->tcf_lock);
>  
>  	return skb->len;
> +
>  nla_put_failure:
> +	spin_unlock_bh(&ci->tcf_lock);
>  	nlmsg_trim(skb, b);
>  	return -1;
>  }

^ permalink raw reply

* kernels >= v4.12 oops/crash with ipsec-traffic: partly bisected
From: Wolfgang Walter @ 2018-08-30 10:48 UTC (permalink / raw)
  To: netdev; +Cc: Wei Wang

Hello,

kernels >= 4.12 do not work on one of our main routers. They crash as soon as 
ipsec-tunnels are configured and ipsec-traffic actually flows.

Just configuring ipsec (that is starting strongswan) does not trigger the 
oops.

I finally found time to bisect that. Though I have not completed that yet, I 
already narrowed it down to the following commits

good: d24406c85d123df773bc4df88ad5da2233896919
	udp: call dst_hold_safe() in udp_sk_rx_set_dst()
bad: 5b7c9a8ff828287af5aebe93e707271bf1a82cc3
	net: remove dst gc related code

Commits in between are almost all changes to remove dst gc.

Now we have other machines which run just fine with the very same kernels 
doing ipsec. They differ insofar as they have much less cores, do not use the 
ixgbe driver, do not have 10G and terminate only a few tunnels instead of 
hundreds.

I already tested distribution kernels > 4.12 from debian, they also crash.

Regards,
-- 
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts

^ permalink raw reply

* [PATCH net-next 2/2] netlink: ipv6 MLD join notifications
From: Patrick Ruddy @ 2018-08-30  9:35 UTC (permalink / raw)
  To: netdev; +Cc: roopa, jiri, stephen
In-Reply-To: <20180830093545.29465-1-pruddy@vyatta.att-mail.com>

Some userspace applications need to know about MLD joins from the
kernel for 2 reasons:
1. To allow the programming of multicast MAC filters in hardware
2. To form a multicast FORUS list for non link-local multicast
   groups to be sent to the kernel and from there to the interested
   party.
(1) can be fulfilled but simply sending the hardware multicast MAC
address to be programmed but (2) requires the L3 address to be sent
since this cannot be constructed from the MAC address whereas the
reverse translation is a standard library function.

This commit provides addition and deletion of multicast addresses
using the RTM_NEWADDR and RTM_DELADDR messages. It also provides
the RTM_GETADDR extension to allow multicast join state to be read
from the kernel.

Signed-off-by: Patrick Ruddy <pruddy@vyatta.att-mail.com>
---
 net/ipv6/addrconf.c | 44 +++++++++++++++++++++---------
 net/ipv6/mcast.c    | 66 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 98 insertions(+), 12 deletions(-)

diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index d51a8c0b3372..e4e7362f9298 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -4855,11 +4855,13 @@ static int inet6_fill_ifaddr(struct sk_buff *skb, struct inet6_ifaddr *ifa,
 }
 
 static int inet6_fill_ifmcaddr(struct sk_buff *skb, struct ifmcaddr6 *ifmca,
-				u32 portid, u32 seq, int event, u16 flags)
+			u32 portid, u32 seq, int event, u16 flags)
 {
 	struct nlmsghdr  *nlh;
 	u8 scope = RT_SCOPE_UNIVERSE;
 	int ifindex = ifmca->idev->dev->ifindex;
+	int addr_type = (event == RTM_GETMULTICAST) ? IFA_MULTICAST :
+		IFA_ADDRESS;
 
 	if (ipv6_addr_scope(&ifmca->mca_addr) & IFA_SITE)
 		scope = RT_SCOPE_SITE;
@@ -4869,7 +4871,7 @@ static int inet6_fill_ifmcaddr(struct sk_buff *skb, struct ifmcaddr6 *ifmca,
 		return -EMSGSIZE;
 
 	put_ifaddrmsg(nlh, 128, IFA_F_PERMANENT, scope, ifindex);
-	if (nla_put_in6_addr(skb, IFA_MULTICAST, &ifmca->mca_addr) < 0 ||
+	if (nla_put_in6_addr(skb, addr_type, &ifmca->mca_addr) < 0 ||
 	    put_cacheinfo(skb, ifmca->mca_cstamp, ifmca->mca_tstamp,
 			  INFINITY_LIFE_TIME, INFINITY_LIFE_TIME) < 0) {
 		nlmsg_cancel(skb, nlh);
@@ -4916,7 +4918,7 @@ enum addr_type_t {
 /* called with rcu_read_lock() */
 static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
 			  struct netlink_callback *cb, enum addr_type_t type,
-			  int s_ip_idx, int *p_ip_idx)
+			  int s_ip_idx, int *p_ip_idx, int msg_type)
 {
 	struct ifmcaddr6 *ifmca;
 	struct ifacaddr6 *ifaca;
@@ -4935,7 +4937,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
 			err = inet6_fill_ifaddr(skb, ifa,
 						NETLINK_CB(cb->skb).portid,
 						cb->nlh->nlmsg_seq,
-						RTM_NEWADDR,
+						msg_type,
 						NLM_F_MULTI);
 			if (err < 0)
 				break;
@@ -4952,7 +4954,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
 			err = inet6_fill_ifmcaddr(skb, ifmca,
 						  NETLINK_CB(cb->skb).portid,
 						  cb->nlh->nlmsg_seq,
-						  RTM_GETMULTICAST,
+						  msg_type,
 						  NLM_F_MULTI);
 			if (err < 0)
 				break;
@@ -4967,7 +4969,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
 			err = inet6_fill_ifacaddr(skb, ifaca,
 						  NETLINK_CB(cb->skb).portid,
 						  cb->nlh->nlmsg_seq,
-						  RTM_GETANYCAST,
+						  msg_type,
 						  NLM_F_MULTI);
 			if (err < 0)
 				break;
@@ -4982,7 +4984,7 @@ static int in6_dump_addrs(struct inet6_dev *idev, struct sk_buff *skb,
 }
 
 static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
-			   enum addr_type_t type)
+			   enum addr_type_t type, int msg_type)
 {
 	struct net *net = sock_net(skb->sk);
 	int h, s_h;
@@ -5012,7 +5014,7 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
 				goto cont;
 
 			if (in6_dump_addrs(idev, skb, cb, type,
-					   s_ip_idx, &ip_idx) < 0)
+					   s_ip_idx, &ip_idx, msg_type) < 0)
 				goto done;
 cont:
 			idx++;
@@ -5029,16 +5031,34 @@ static int inet6_dump_addr(struct sk_buff *skb, struct netlink_callback *cb,
 
 static int inet6_dump_ifaddr(struct sk_buff *skb, struct netlink_callback *cb)
 {
-	enum addr_type_t type = UNICAST_ADDR;
+	enum addr_type_t type;
+	int ret;
+
+	type = cb->args[3];
+	if (type == UNICAST_ADDR) {
+		ret =  = inet6_dump_addr(skb, cb, type, RTM_NEWADDR);
+		if (ret > 0)
+			goto done;
 
-	return inet6_dump_addr(skb, cb, type);
+		/* reset indices and move on to multicast*/
+		cb->args[0] = 0;
+		cb->args[1] = 0;
+		cb->args[2] = 0;
+		type = MULTICAST_ADDR;
+	}
+
+	/* do the RTM_NEWADDR notifications for multicast type */
+	ret = inet6_dump_addr(skb, cb, type, RTM_NEWADDR);
+ done:
+	cb->args[3] = type;
+	return ret;
 }
 
 static int inet6_dump_ifmcaddr(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	enum addr_type_t type = MULTICAST_ADDR;
 
-	return inet6_dump_addr(skb, cb, type);
+	return inet6_dump_addr(skb, cb, type, RTM_GETMULTICAST);
 }
 
 
@@ -5046,7 +5066,7 @@ static int inet6_dump_ifacaddr(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	enum addr_type_t type = ANYCAST_ADDR;
 
-	return inet6_dump_addr(skb, cb, type);
+	return inet6_dump_addr(skb, cb, type, RTM_GETANYCAST);
 }
 
 static int inet6_rtm_getaddr(struct sk_buff *in_skb, struct nlmsghdr *nlh,
diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index 4ae54aaca373..735fb6a8ad34 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -880,6 +880,67 @@ static struct ifmcaddr6 *mca_alloc(struct inet6_dev *idev,
 	return mc;
 }
 
+static int fill_addr(struct sk_buff *skb, struct net_device *dev,
+		     const struct in6_addr *addr, int type, unsigned int flags)
+{
+	struct nlmsghdr *nlh;
+	struct ifaddrmsg *ifm;
+	u8 scope = RT_SCOPE_UNIVERSE;
+
+	if (ipv6_addr_scope(addr) & IFA_SITE)
+		scope = RT_SCOPE_SITE;
+
+	nlh = nlmsg_put(skb, 0, 0, type, sizeof(*ifm), flags);
+	if (!nlh)
+		return -EMSGSIZE;
+
+	ifm = nlmsg_data(nlh);
+	ifm->ifa_family = AF_INET6;
+	ifm->ifa_prefixlen = 128;
+	ifm->ifa_flags = IFA_F_PERMANENT;
+	ifm->ifa_scope = scope;
+	ifm->ifa_index = dev->ifindex;
+
+	if (nla_put_in6_addr(skb, IFA_ADDRESS, addr))
+		goto nla_put_failure;
+	nlmsg_end(skb, nlh);
+	return 0;
+
+nla_put_failure:
+	nlmsg_cancel(skb, nlh);
+	return -EMSGSIZE;
+}
+
+static inline size_t addr_nlmsg_size(void)
+{
+	return NLMSG_ALIGN(sizeof(struct ifaddrmsg))
+		+ nla_total_size(sizeof(struct in6_addr));
+}
+
+static void ipv6_mc_addr_notify(struct net_device *dev,
+				const struct in6_addr *addr, int type)
+{
+	struct net *net = dev_net(dev);
+	struct sk_buff *skb;
+	int err = -ENOBUFS;
+
+	skb = nlmsg_new(addr_nlmsg_size(), GFP_ATOMIC);
+	if (!skb)
+		goto errout;
+
+	err = fill_addr(skb, dev, addr, type, 0);
+	if (err < 0) {
+		WARN_ON(err == -EMSGSIZE);
+		kfree_skb(skb);
+		goto errout;
+	}
+	rtnl_notify(skb, net, 0, RTNLGRP_IPV6_IFADDR, NULL, GFP_ATOMIC);
+	return;
+errout:
+	if (err < 0)
+		rtnl_set_sk_err(net, RTNLGRP_IPV6_IFADDR, err);
+}
+
 /*
  *	device multicast group inc (add if not found)
  */
@@ -932,6 +993,9 @@ static int __ipv6_dev_mc_inc(struct net_device *dev,
 
 	mld_del_delrec(idev, mc);
 	igmp6_group_added(mc);
+
+	ipv6_mc_addr_notify(dev, addr, RTM_NEWADDR);
+
 	ma_put(mc);
 	return 0;
 }
@@ -960,6 +1024,8 @@ int __ipv6_dev_mc_dec(struct inet6_dev *idev, const struct in6_addr *addr)
 				igmp6_group_dropped(ma);
 				ip6_mc_clear_src(ma);
 
+				ipv6_mc_addr_notify(idev->dev, addr,
+						    RTM_DELADDR);
 				ma_put(ma);
 				return 0;
 			}
-- 
2.17.1

^ permalink raw reply related

* [RFT net-next] net: stmmac: Rework coalesce timer and fix multi-queue races
From: Jose Abreu @ 2018-08-30 10:37 UTC (permalink / raw)
  To: netdev
  Cc: Jose Abreu, Jerome Brunet, Martin Blumenstingl, David S. Miller,
	Joao Pinto, Giuseppe Cavallaro, Alexandre Torgue

[ As for now this is only for testing! ]

This follows David Miller advice and tries to fix coalesce timer in
multi-queue scenarios.

We are now using per-queue coalesce values and per-queue TX timer. This
assumes that tx_queues == rx_queues, which can not be necessarly true.
Official patch will need to have this fixed.

Coalesce timer default values was changed to 1ms and the coalesce frames
to 25.

Tested in B2B setup between XGMAC2 and GMAC5.

Signed-off-by: Jose Abreu <joabreu@synopsys.com>
Cc: Jerome Brunet <jbrunet@baylibre.com>
Cc: Martin Blumenstingl <martin.blumenstingl@googlemail.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Joao Pinto <jpinto@synopsys.com>
Cc: Giuseppe Cavallaro <peppe.cavallaro@st.com>
Cc: Alexandre Torgue <alexandre.torgue@st.com>
---
 drivers/net/ethernet/stmicro/stmmac/common.h      |   4 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac.h      |   6 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 117 +++++++++++++---------
 3 files changed, 75 insertions(+), 52 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
index 1854f270ad66..b1b305f8f414 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -258,10 +258,10 @@ struct stmmac_safety_stats {
 #define MAX_DMA_RIWT		0xff
 #define MIN_DMA_RIWT		0x20
 /* Tx coalesce parameters */
-#define STMMAC_COAL_TX_TIMER	40000
+#define STMMAC_COAL_TX_TIMER	1000
 #define STMMAC_MAX_COAL_TX_TICK	100000
 #define STMMAC_TX_MAX_FRAMES	256
-#define STMMAC_TX_FRAMES	64
+#define STMMAC_TX_FRAMES	25
 
 /* Packets types */
 enum packets_types {
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index 76649adf8fb0..db42a5f03a5f 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -48,6 +48,9 @@ struct stmmac_tx_info {
 
 /* Frequently used values are kept adjacent for cache effect */
 struct stmmac_tx_queue {
+	u32 tx_count_frames;
+	int tx_timer_active;
+	struct timer_list txtimer;
 	u32 queue_index;
 	struct stmmac_priv *priv_data;
 	struct dma_extended_desc *dma_etx ____cacheline_aligned_in_smp;
@@ -109,15 +112,12 @@ struct stmmac_pps_cfg {
 
 struct stmmac_priv {
 	/* Frequently used values are kept adjacent for cache effect */
-	u32 tx_count_frames;
 	u32 tx_coal_frames;
 	u32 tx_coal_timer;
-	bool tx_timer_armed;
 
 	int tx_coalesce;
 	int hwts_tx_en;
 	bool tx_path_in_lpi_mode;
-	struct timer_list txtimer;
 	bool tso;
 
 	unsigned int dma_buf_sz;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index ff1ffb46198a..ae26a6e8608e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2034,7 +2034,6 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv)
 	u32 channels_to_check = tx_channel_count > rx_channel_count ?
 				tx_channel_count : rx_channel_count;
 	u32 chan;
-	bool poll_scheduled = false;
 	int status[max_t(u32, MTL_MAX_TX_QUEUES, MTL_MAX_RX_QUEUES)];
 
 	/* Make sure we never check beyond our status buffer. */
@@ -2058,7 +2057,6 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv)
 			if (likely(napi_schedule_prep(&rx_q->napi))) {
 				stmmac_disable_dma_irq(priv, priv->ioaddr, chan);
 				__napi_schedule(&rx_q->napi);
-				poll_scheduled = true;
 			}
 		}
 	}
@@ -2067,21 +2065,13 @@ static void stmmac_dma_interrupt(struct stmmac_priv *priv)
 	 * If we didn't schedule poll, see if any DMA channel (used by tx) has a
 	 * completed transmission, if so, call stmmac_poll (once).
 	 */
-	if (!poll_scheduled) {
-		for (chan = 0; chan < tx_channel_count; chan++) {
-			if (status[chan] & handle_tx) {
-				/* It doesn't matter what rx queue we choose
-				 * here. We use 0 since it always exists.
-				 */
-				struct stmmac_rx_queue *rx_q =
-					&priv->rx_queue[0];
+	for (chan = 0; chan < tx_channel_count; chan++) {
+		if (status[chan] & handle_tx) {
+			struct stmmac_rx_queue *rx_q = &priv->rx_queue[chan];
 
-				if (likely(napi_schedule_prep(&rx_q->napi))) {
-					stmmac_disable_dma_irq(priv,
-							priv->ioaddr, chan);
-					__napi_schedule(&rx_q->napi);
-				}
-				break;
+			if (likely(napi_schedule_prep(&rx_q->napi))) {
+				stmmac_disable_dma_irq(priv, priv->ioaddr, chan);
+				__napi_schedule(&rx_q->napi);
 			}
 		}
 	}
@@ -2241,13 +2231,18 @@ static int stmmac_init_dma_engine(struct stmmac_priv *priv)
  */
 static void stmmac_tx_timer(struct timer_list *t)
 {
-	struct stmmac_priv *priv = from_timer(priv, t, txtimer);
-	u32 tx_queues_count = priv->plat->tx_queues_to_use;
-	u32 queue;
+	struct stmmac_tx_queue *tx_q = from_timer(tx_q, t, txtimer);
+	struct stmmac_priv *priv = tx_q->priv_data;
+	struct napi_struct *napi;
+
+	napi = &priv->rx_queue[tx_q->queue_index].napi;
 
-	/* let's scan all the tx queues */
-	for (queue = 0; queue < tx_queues_count; queue++)
-		stmmac_tx_clean(priv, queue);
+	if (napi_schedule_prep(napi)) {
+		stmmac_disable_dma_irq(priv, priv->ioaddr, tx_q->queue_index);
+		__napi_schedule(napi);
+	}
+
+	tx_q->tx_timer_active = 0;
 }
 
 /**
@@ -2260,11 +2255,17 @@ static void stmmac_tx_timer(struct timer_list *t)
  */
 static void stmmac_init_tx_coalesce(struct stmmac_priv *priv)
 {
+	u32 tx_channel_count = priv->plat->tx_queues_to_use;
+	u32 chan;
+
 	priv->tx_coal_frames = STMMAC_TX_FRAMES;
 	priv->tx_coal_timer = STMMAC_COAL_TX_TIMER;
-	timer_setup(&priv->txtimer, stmmac_tx_timer, 0);
-	priv->txtimer.expires = STMMAC_COAL_TIMER(priv->tx_coal_timer);
-	add_timer(&priv->txtimer);
+
+	for (chan = 0; chan < tx_channel_count; chan++) {
+		struct stmmac_tx_queue *tx_q = &priv->tx_queue[chan];
+
+		timer_setup(&tx_q->txtimer, stmmac_tx_timer, 0);
+	}
 }
 
 static void stmmac_set_rings_length(struct stmmac_priv *priv)
@@ -2592,6 +2593,7 @@ static void stmmac_hw_teardown(struct net_device *dev)
 static int stmmac_open(struct net_device *dev)
 {
 	struct stmmac_priv *priv = netdev_priv(dev);
+	u32 chan;
 	int ret;
 
 	stmmac_check_ether_addr(priv);
@@ -2688,7 +2690,9 @@ static int stmmac_open(struct net_device *dev)
 	if (dev->phydev)
 		phy_stop(dev->phydev);
 
-	del_timer_sync(&priv->txtimer);
+	for (chan = 0; chan < priv->plat->tx_queues_to_use; chan++)
+		del_timer_sync(&priv->tx_queue[chan].txtimer);
+
 	stmmac_hw_teardown(dev);
 init_error:
 	free_dma_desc_resources(priv);
@@ -2708,6 +2712,7 @@ static int stmmac_open(struct net_device *dev)
 static int stmmac_release(struct net_device *dev)
 {
 	struct stmmac_priv *priv = netdev_priv(dev);
+	u32 chan;
 
 	if (priv->eee_enabled)
 		del_timer_sync(&priv->eee_ctrl_timer);
@@ -2722,7 +2727,8 @@ static int stmmac_release(struct net_device *dev)
 
 	stmmac_disable_all_queues(priv);
 
-	del_timer_sync(&priv->txtimer);
+	for (chan = 0; chan < priv->plat->tx_queues_to_use; chan++)
+		del_timer_sync(&priv->tx_queue[chan].txtimer);
 
 	/* Free the IRQ lines */
 	free_irq(dev->irq, dev);
@@ -2828,6 +2834,7 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	int tmp_pay_len = 0;
 	u32 pay_len, mss;
 	u8 proto_hdr_len;
+	bool tx_ic;
 	int i;
 
 	tx_q = &priv->tx_queue[queue];
@@ -2936,12 +2943,17 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 	priv->xstats.tx_tso_nfrags += nfrags;
 
 	/* Manage tx mitigation */
-	priv->tx_count_frames += nfrags + 1;
-	if (likely(priv->tx_coal_frames > priv->tx_count_frames)) {
-		mod_timer(&priv->txtimer,
-			  STMMAC_COAL_TIMER(priv->tx_coal_timer));
-	} else {
-		priv->tx_count_frames = 0;
+	tx_q->tx_count_frames += nfrags + 1;
+	if (!priv->tx_coal_frames)
+		tx_ic = false;
+	else if ((nfrags + 1) > priv->tx_coal_frames)
+		tx_ic = true;
+	else if ((tx_q->tx_count_frames % priv->tx_coal_frames) < (nfrags + 1))
+		tx_ic = true;
+	else
+		tx_ic = false;
+
+	if (tx_ic) {
 		stmmac_set_tx_ic(priv, desc);
 		priv->xstats.tx_set_ic_bit++;
 	}
@@ -2994,6 +3006,12 @@ static netdev_tx_t stmmac_tso_xmit(struct sk_buff *skb, struct net_device *dev)
 
 	stmmac_set_tx_tail_ptr(priv, priv->ioaddr, tx_q->tx_tail_addr, queue);
 
+	if (priv->tx_coal_timer && !tx_q->tx_timer_active) {
+		tx_q->tx_timer_active = 1;
+		mod_timer(&tx_q->txtimer,
+				STMMAC_COAL_TIMER(priv->tx_coal_timer));
+	}
+
 	return NETDEV_TX_OK;
 
 dma_map_err:
@@ -3024,6 +3042,7 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct stmmac_tx_queue *tx_q;
 	unsigned int enh_desc;
 	unsigned int des;
+	bool tx_ic;
 
 	tx_q = &priv->tx_queue[queue];
 
@@ -3146,17 +3165,19 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 	 * This approach takes care about the fragments: desc is the first
 	 * element in case of no SG.
 	 */
-	priv->tx_count_frames += nfrags + 1;
-	if (likely(priv->tx_coal_frames > priv->tx_count_frames) &&
-	    !priv->tx_timer_armed) {
-		mod_timer(&priv->txtimer,
-			  STMMAC_COAL_TIMER(priv->tx_coal_timer));
-		priv->tx_timer_armed = true;
-	} else {
-		priv->tx_count_frames = 0;
+	tx_q->tx_count_frames += nfrags + 1;
+	if (!priv->tx_coal_frames)
+		tx_ic = false;
+	else if ((nfrags + 1) > priv->tx_coal_frames)
+		tx_ic = true;
+	else if ((tx_q->tx_count_frames % priv->tx_coal_frames) < (nfrags + 1))
+		tx_ic = true;
+	else
+		tx_ic = false;
+
+	if (tx_ic) {
 		stmmac_set_tx_ic(priv, desc);
 		priv->xstats.tx_set_ic_bit++;
-		priv->tx_timer_armed = false;
 	}
 
 	skb_tx_timestamp(skb);
@@ -3204,6 +3225,12 @@ static netdev_tx_t stmmac_xmit(struct sk_buff *skb, struct net_device *dev)
 	stmmac_enable_dma_transmission(priv, priv->ioaddr);
 	stmmac_set_tx_tail_ptr(priv, priv->ioaddr, tx_q->tx_tail_addr, queue);
 
+	if (priv->tx_coal_timer && !tx_q->tx_timer_active) {
+		tx_q->tx_timer_active = 1;
+		mod_timer(&tx_q->txtimer,
+				STMMAC_COAL_TIMER(priv->tx_coal_timer));
+	}
+
 	return NETDEV_TX_OK;
 
 dma_map_err:
@@ -3522,16 +3549,12 @@ static int stmmac_poll(struct napi_struct *napi, int budget)
 	struct stmmac_rx_queue *rx_q =
 		container_of(napi, struct stmmac_rx_queue, napi);
 	struct stmmac_priv *priv = rx_q->priv_data;
-	u32 tx_count = priv->plat->tx_queues_to_use;
 	u32 chan = rx_q->queue_index;
 	int work_done = 0;
-	u32 queue;
 
 	priv->xstats.napi_poll++;
 
-	/* check all the queues */
-	for (queue = 0; queue < tx_count; queue++)
-		stmmac_tx_clean(priv, queue);
+	stmmac_tx_clean(priv, chan);
 
 	work_done = stmmac_rx(priv, budget, rx_q->queue_index);
 	if (work_done < budget) {
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH][net-next] xdp: remove redundant variable 'headroom'
From: Björn Töpel @ 2018-08-30 14:37 UTC (permalink / raw)
  To: Colin King, David S . Miller, Jesper Dangaard Brouer, netdev,
	Alexei Starovoitov, daniel@iogearbox.net
  Cc: kernel-janitors, linux-kernel
In-Reply-To: <20180830142718.10850-1-colin.king@canonical.com>

On 2018-08-30 16:27, Colin King wrote:
> From: Colin Ian King <colin.king@canonical.com>
> 
> Variable 'headroom' is being assigned but is never used hence it is
> redundant and can be removed.
> 
> Cleans up clang warning:
> variable ‘headroom’ set but not used [-Wunused-but-set-variable]
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>
> ---
>   net/core/xdp.c | 3 +--
>   1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 654dbb19707e..4b2b194f4f1f 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -412,7 +412,7 @@ EXPORT_SYMBOL_GPL(xdp_attachment_setup);
>   
>   struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
>   {
> -	unsigned int metasize, headroom, totsize;
> +	unsigned int metasize, totsize;
>   	void *addr, *data_to_copy;
>   	struct xdp_frame *xdpf;
>   	struct page *page;
> @@ -420,7 +420,6 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
>   	/* Clone into a MEM_TYPE_PAGE_ORDER0 xdp_frame. */
>   	metasize = xdp_data_meta_unsupported(xdp) ? 0 :
>   		   xdp->data - xdp->data_meta;
> -	headroom = xdp->data - xdp->data_hard_start;
>   	totsize = xdp->data_end - xdp->data + metasize;
>   
>   	if (sizeof(*xdpf) + totsize > PAGE_SIZE)
> 

This should go via bpf-next. Thanks for the fix!

Acked-by: Björn Töpel <bjorn.topel@intel.com>

^ permalink raw reply

* [PATCH v1] bridge: Switch to bitmap_zalloc()
From: Andy Shevchenko @ 2018-08-30 10:33 UTC (permalink / raw)
  To: Stephen Hemminger, bridge, David S. Miller, netdev; +Cc: Andy Shevchenko

Switch to bitmap_zalloc() to show clearly what we are allocating.
Besides that it returns pointer of bitmap type instead of opaque void *.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
---
 net/bridge/br_if.c   | 5 ++---
 net/bridge/br_vlan.c | 5 ++---
 2 files changed, 4 insertions(+), 6 deletions(-)

diff --git a/net/bridge/br_if.c b/net/bridge/br_if.c
index 0363f1bdc401..3bb66508f07d 100644
--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
@@ -394,8 +394,7 @@ static int find_portno(struct net_bridge *br)
 	struct net_bridge_port *p;
 	unsigned long *inuse;
 
-	inuse = kcalloc(BITS_TO_LONGS(BR_MAX_PORTS), sizeof(unsigned long),
-			GFP_KERNEL);
+	inuse = bitmap_zalloc(BR_MAX_PORTS, GFP_KERNEL);
 	if (!inuse)
 		return -ENOMEM;
 
@@ -404,7 +403,7 @@ static int find_portno(struct net_bridge *br)
 		set_bit(p->port_no, inuse);
 	}
 	index = find_first_zero_bit(inuse, BR_MAX_PORTS);
-	kfree(inuse);
+	bitmap_free(inuse);
 
 	return (index >= BR_MAX_PORTS) ? -EXFULL : index;
 }
diff --git a/net/bridge/br_vlan.c b/net/bridge/br_vlan.c
index 7df269092103..bb6ba794864f 100644
--- a/net/bridge/br_vlan.c
+++ b/net/bridge/br_vlan.c
@@ -877,8 +877,7 @@ int __br_vlan_set_default_pvid(struct net_bridge *br, u16 pvid)
 		return 0;
 	}
 
-	changed = kcalloc(BITS_TO_LONGS(BR_MAX_PORTS), sizeof(unsigned long),
-			  GFP_KERNEL);
+	changed = bitmap_zalloc(BR_MAX_PORTS, GFP_KERNEL);
 	if (!changed)
 		return -ENOMEM;
 
@@ -925,7 +924,7 @@ int __br_vlan_set_default_pvid(struct net_bridge *br, u16 pvid)
 	br->default_pvid = pvid;
 
 out:
-	kfree(changed);
+	bitmap_free(changed);
 	return err;
 
 err_port:
-- 
2.18.0

^ permalink raw reply related

* [PATCH][net-next] xdp: remove redundant variable 'headroom'
From: Colin King @ 2018-08-30 14:27 UTC (permalink / raw)
  To: David S . Miller, Björn Töpel, Jesper Dangaard Brouer,
	netdev
  Cc: kernel-janitors, linux-kernel

From: Colin Ian King <colin.king@canonical.com>

Variable 'headroom' is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
variable ‘headroom’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
---
 net/core/xdp.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/xdp.c b/net/core/xdp.c
index 654dbb19707e..4b2b194f4f1f 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -412,7 +412,7 @@ EXPORT_SYMBOL_GPL(xdp_attachment_setup);
 
 struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
 {
-	unsigned int metasize, headroom, totsize;
+	unsigned int metasize, totsize;
 	void *addr, *data_to_copy;
 	struct xdp_frame *xdpf;
 	struct page *page;
@@ -420,7 +420,6 @@ struct xdp_frame *xdp_convert_zc_to_xdp_frame(struct xdp_buff *xdp)
 	/* Clone into a MEM_TYPE_PAGE_ORDER0 xdp_frame. */
 	metasize = xdp_data_meta_unsupported(xdp) ? 0 :
 		   xdp->data - xdp->data_meta;
-	headroom = xdp->data - xdp->data_hard_start;
 	totsize = xdp->data_end - xdp->data + metasize;
 
 	if (sizeof(*xdpf) + totsize > PAGE_SIZE)
-- 
2.17.1

^ permalink raw reply related

* Re: [PATCH] rtnetlink: expose value from SET_NETDEV_DEVTYPE via IFLA_DEVTYPE attribute
From: Jiri Pirko @ 2018-08-30 10:20 UTC (permalink / raw)
  To: Marcel Holtmann; +Cc: netdev, David S. Miller
In-Reply-To: <0F27B37D-0D2E-4BCB-B325-FD0527A2E1F3@holtmann.org>

Thu, Aug 30, 2018 at 12:13:40PM CEST, marcel@holtmann.org wrote:
>Hi Jiri,
>
>>>>> The name value from SET_NETDEV_DEVTYPE only ended up in the uevent sysfs
>>>>> file as DEVTYPE= information. To avoid any kind of race conditions
>>>>> between netlink messages and reading from sysfs, it is useful to add the
>>>>> same string as new IFLA_DEVTYPE attribute included in the RTM_NEWLINK
>>>>> messages.
>>>>> 
>>>>> For network managing daemons that have to classify ARPHRD_ETHER network
>>>>> devices into different types (like Wireless LAN, Bluetooth etc.), this
>>>>> avoids the extra round trip to sysfs and parsing of the uevent file.
>>>>> 
>>>>> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
>>>>> ---
>>>>> include/uapi/linux/if_link.h |  2 ++
>>>>> net/core/rtnetlink.c         | 12 ++++++++++++
>>>>> 2 files changed, 14 insertions(+)
>>>>> 
>>>>> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
>>>>> index 43391e2d1153..781294972bb4 100644
>>>>> --- a/include/uapi/linux/if_link.h
>>>>> +++ b/include/uapi/linux/if_link.h
>>>>> @@ -166,6 +166,8 @@ enum {
>>>>> 	IFLA_NEW_IFINDEX,
>>>>> 	IFLA_MIN_MTU,
>>>>> 	IFLA_MAX_MTU,
>>>>> +	IFLA_DEVTYPE,		/* Name value from SET_NETDEV_DEVTYPE */
>>>> 
>>>> This is not something netdev-related. dev->dev.type is struct device_type.
>>>> This is a generic "device" thing. Incorrect to expose over
>>>> netdev-specific API. Please use "device" API for this.
>>> 
>>> it is not just "device" related since this is a sub-classification of ARPHRD_ETHER type as a wrote above. Don't get hang up that this information is part of struct device. The dev->dev.type contains strings like "wlan", "bluetooth", "wimax", "gadget" etc. so that you can tell what kind of Ethernet card you have.
>> 
>> I understand. But using dev->dev.type for that purpose seems like an
>> abuse. If this is subtype of netdev, it should not be stored in parent
>> device. I believe that you need to re-introduce this as a part of struct
>> net_device - preferably an enum - and that you can expose over rtnl (and
>> net-sysfs).
>
>it is not stored in the parent device. It is stored in the device of the netdev.

Yeah. That is what I ment. "Device struct" associated with the
netdevice.

>
>As Stephen said, there is no device API. And why would I add the same info again and bloat netdev?
>
>drivers/net/bonding/bond_main.c:        SET_NETDEV_DEVTYPE(bond_dev, &bond_type);
>drivers/net/geneve.c:   SET_NETDEV_DEVTYPE(dev, &geneve_type);
>drivers/net/macsec.c:   SET_NETDEV_DEVTYPE(dev, &macsec_type);
>drivers/net/ppp/ppp_generic.c:  SET_NETDEV_DEVTYPE(dev, &ppp_type);
>drivers/net/usb/hso.c:  SET_NETDEV_DEVTYPE(net, &hso_type);
>drivers/net/usb/usbnet.c:               SET_NETDEV_DEVTYPE(net, &wlan_type);
>drivers/net/usb/usbnet.c:               SET_NETDEV_DEVTYPE(net, &wwan_type);
>drivers/net/vxlan.c:    SET_NETDEV_DEVTYPE(dev, &vxlan_type);
>drivers/net/wimax/i2400m/usb.c: SET_NETDEV_DEVTYPE(net_dev, &i2400mu_type);
>drivers/net/wireless/intersil/prism54/islpci_dev.c:     SET_NETDEV_DEVTYPE(ndev, &wlan_type);
>drivers/staging/gdm724x/gdm_lte.c:              SET_NETDEV_DEVTYPE(net, &wwan_type);
>drivers/usb/gadget/function/u_ether.c:  SET_NETDEV_DEVTYPE(net, &gadget_type);
>drivers/usb/gadget/function/u_ether.c:  SET_NETDEV_DEVTYPE(net, &gadget_type);
>net/8021q/vlan_dev.c:   SET_NETDEV_DEVTYPE(dev, &vlan_type);
>net/bluetooth/6lowpan.c:        SET_NETDEV_DEVTYPE(netdev, &bt_type);
>net/bluetooth/bnep/core.c:      SET_NETDEV_DEVTYPE(dev, &bnep_type);
>net/bridge/br_device.c: SET_NETDEV_DEVTYPE(dev, &br_type);
>net/dsa/slave.c:        SET_NETDEV_DEVTYPE(slave_dev, &dsa_type);
>net/hsr/hsr_device.c:   SET_NETDEV_DEVTYPE(dev, &hsr_type);
>net/l2tp/l2tp_eth.c:    SET_NETDEV_DEVTYPE(dev, &l2tpeth_type);
>net/wireless/core.c:            SET_NETDEV_DEVTYPE(dev, &wiphy_type);
>
>So for all these devices we have to add duplicate information now?

The fact the things are done now it a wrong way does not justify
extending UAPI in the same wrong way. So yes, change them all if you
need it. The "ethernet-subtype" should not an attribute of "struct
device" but rather "struct net_device".

>
>I actually don't like that idea. I can rename it to IFLA_FOO or any other proposal, but adding the same information twice is pointless.

So don't expose the "struct device" attribute over rtnetlink. Use
"struct device"-specific interface for that.

>
>>> We can revive the patches for adding this information as IFLA_INFO_KIND, but last time they where reverted since NetworkManager did all the wrong decisions based on that. That part is fixed now and we can put that back and declare the sub-type classification of Ethernet device in two places. Or we just take the information that we added a long time ago for exactly this sub-classification and provide them to userspace via RTNL.
>> 
>> IFLA_INFO_KIND serves for different purpose. It is for drivers that
>> register rtnl_link_ops. I don't think it would be wise to mix it with
>> this.
>
>We could register rtnl_link_ops for these Ethernet drivers, but then we are duplicating the information as well.

That is not the purpose of "rtnl_link_ops".


>
>Regards
>
>Marcel
>

^ permalink raw reply

* Re: [PATCH bpf-next 11/11] samples/bpf: add -c/--copy -z/--zero-copy flags to xdpsock
From: Björn Töpel @ 2018-08-30 10:21 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Karlsson, Magnus, Magnus Karlsson, Duyck, Alexander H,
	Alexander Duyck, ast, Daniel Borkmann, Netdev, Brandeburg, Jesse,
	Singhai, Anjali, peter.waskiewicz.jr, Björn Töpel,
	michael.lundkvist, Willem de Bruijn, John Fastabend,
	Jakub Kicinski, neerav.parikh, MykytaI Iziumtsev, Francois Ozog
In-Reply-To: <20180829144446.72509a96@redhat.com>

Den ons 29 aug. 2018 kl 14:44 skrev Jesper Dangaard Brouer <brouer@redhat.com>:
>
> On Tue, 28 Aug 2018 14:44:35 +0200
> Björn Töpel <bjorn.topel@gmail.com> wrote:
>
> > From: Björn Töpel <bjorn.topel@intel.com>
> >
> > The -c/--copy -z/--zero-copy flags enforces either copy or zero-copy
> > mode.
>
> Nice, thanks for adding this.  It allows me to quickly test the
> difference between normal-copy vs zero-copy modes.
> (Kernel bpf-next without RETPOLINE).
>
> AF_XDP RX-drop:
>  Normal-copy mode: rx 13,070,318 pps - 76.5 ns
>  Zero-copy   mode: rx 26,132,328 pps - 38.3 ns
>
> Compare to XDP_DROP:  34,251,464 pps - 29.2 ns
>    XDP_DROP + read :  30,756,664 pps - 32.5 ns
>
> The normal-copy mode is surprisingly fast (and it works for every
> driver implemeting the regular XDP_REDIRECT action).  It is still
> faster to do in-kernel XDP_DROP than AF_XDP zero-copy mode dropping,
> which was expected given frames travel to a remote CPU before returned
> (don't think remote CPU reads payload?).  The gap in nanosec is
> actually quite small, thus I'm impressed by the SPSC-queue
> implementation working across these CPUs.
>
>
> AF_XDP layer2-fwd:
>  Normal-copy mode: rx  3,200,885   tx  3,200,892
>  Zero-copy   mode: rx 17,026,300   tx 17,026,269
>
> Compare to XDP_TX: rx 14,529,079   tx 14,529,850  - 68.82 ns
>      XDP_REDIRECT: rx 13,235,785   tx 13,235,784  - 75.55 ns
>
> The copy-mode is slow because it allocates SKBs internally (I do
> wonder if we could speed it up by using ndo_xdp_xmit + disable-BH).
> More intersting is that the zero-copy is faster than XDP_TX and
> XDP_REDIRECT. I think the speedup comes from avoiding some DMA mapping
> calls with ZC.
>
> Side-note: XDP_TX vs. REDIRECT: 75.55 - 68.82 = 6.73 ns.  The cost of
> going through the xdp_do_redirect_map core is actually quite small :-)
> (I have some micro optimizations that should help ~2ns).
>
>
> AF_XDP TX-only:
>  Normal-copy mode: tx  2,853,461 pps
>  Zero-copy   mode: tx 22,255,311 pps
>
> (There is not XDP mode that does TX to compare against)
>

Kudos for doing the in-depth benchmarking!


Thanks!
Björn

> --
> Best regards,
>   Jesper Dangaard Brouer
>   MSc.CS, Principal Kernel Engineer at Red Hat
>   LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: KASAN: stack-out-of-bounds Read in __schedule
From: Dmitry Vyukov @ 2018-08-30 14:19 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Alexander Potapenko, Alexei Starovoitov, netdev, Jan Kara,
	syzbot+45a34334c61a8ecf661d, Jan Kara, linux-ext4, LKML,
	syzkaller-bugs, Theodore Ts'o
In-Reply-To: <279041ab-d449-1bfb-a05d-2d8b0d5c601b@iogearbox.net>

On Thu, Aug 30, 2018 at 2:52 AM, Daniel Borkmann <daniel@iogearbox.net> wrote:
>>>>> Hello,
>>>>>
>>>>> syzbot found the following crash on:
>>>>>
>>>>> HEAD commit:    5b394b2ddf03 Linux 4.19-rc1
>>>>> git tree:       upstream
>>>>> console output: https://syzkaller.appspot.com/x/log.txt?x=14f4d8e1400000
>>>>> kernel config:  https://syzkaller.appspot.com/x/.config?x=49927b422dcf0b29
>>>>> dashboard link: https://syzkaller.appspot.com/bug?extid=45a34334c61a8ecf661d
>>>>> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>>>>> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=13127e5a400000
>>>>>
>>>>> IMPORTANT: if you fix the bug, please add the following tag to the commit:
>>>>> Reported-by: syzbot+45a34334c61a8ecf661d@syzkaller.appspotmail.com
>>>>>
>>>>> IPv6: ADDRCONF(NETDEV_UP): veth1: link is not ready
>>>>> IPv6: ADDRCONF(NETDEV_CHANGE): veth1: link becomes ready
>>>>> IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
>>>>> 8021q: adding VLAN 0 to HW filter on device team0
>>>>> ==================================================================
>>>>> BUG: KASAN: stack-out-of-bounds in schedule_debug kernel/sched/core.c:3285
>>>>> [inline]
>>>>> BUG: KASAN: stack-out-of-bounds in __schedule+0x1977/0x1df0
>>>>> kernel/sched/core.c:3395
>>>>> Read of size 8 at addr ffff8801ad090000 by task syz-executor0/4718
>>>>
>>>> Weird, can you please help me decipher this? So here KASAN complains about
>>>> wrong memory access in the scheduler.
>>
>> This looks like a result of a previous bad silent memory corruption.
>>
>> The KASAN report says there is a stack out-of-bounds in scheduler. And
>> that if followed by slab corruption report in another task.
>>
>> fs/jbd2/transaction.c happens to be the first meaningful file in this
>> crash, and so that's where it is attributed to.
>>
>> Rerunning the reproducer several times can maybe give some better
>> glues, or maybe not, maybe they all will look equally puzzling.
>>
>> This part of the repro looks familiar:
>>
>> r1 = bpf$MAP_CREATE(0x0, &(0x7f0000002e40)={0x12, 0x0, 0x4, 0x6e, 0x0,
>> 0x1}, 0x68)
>> bpf$MAP_UPDATE_ELEM(0x2, &(0x7f0000000180)={r1, &(0x7f0000000000),
>> &(0x7f0000000140)}, 0x20)
>>
>> We had exactly such consequences of a bug in bpf map very recently,
>> but that was claimed to be fixed. Maybe not completely?
>> +bpf maintainers
>
> Looks like syzbot found this in Linus tree with HEAD commit 5b394b2ddf03 ("Linux 4.19-rc1")
> one day later net PR got merged via 050cdc6c9501 ("Merge git://git.kernel.org/pub/...").
>
> This PR contained a couple of fixes I did on sockmap code during audit such as:
>
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b845c898b2f1ea458d5453f0fa1da6e2dfce3bb4
>
> Looking at the reproducer syzkaller found it contains:
>
>   r1 = bpf$MAP_CREATE(0x0, &(0x7f0000002e40)={0x12, 0x0, 0x4, 0x6e, 0x0, 0x1}, 0x68)
>                                                     ^^^
>
> So it found the crash with map type of sock hash and key size of 0x0 (which is invalid),
> where subsequent map update triggered the corruption. I just did a 'syz test' and it
> wasn't able to trigger the crash anymore.
>
> #syz fix: bpf, sockmap: fix sock_hash_alloc and reject zero-sized keys


Thanks.

I am again trying to figure out how/why this causes such bad failure modes.
Looking at sock_hash_ctx_update_elem it seems that all of
htab_map_hash/lookup_elem_raw/alloc_sock_hash_elem should handle
key_size=0 fine hashing/comparing/updating 0 bytes. Do you have any
ideas as to what could have gone wrong?

^ permalink raw reply

* Re: [PATCH] rtnetlink: expose value from SET_NETDEV_DEVTYPE via IFLA_DEVTYPE attribute
From: Marcel Holtmann @ 2018-08-30 10:13 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, David S. Miller
In-Reply-To: <20180830051230.GE2181@nanopsycho>

Hi Jiri,

>>>> The name value from SET_NETDEV_DEVTYPE only ended up in the uevent sysfs
>>>> file as DEVTYPE= information. To avoid any kind of race conditions
>>>> between netlink messages and reading from sysfs, it is useful to add the
>>>> same string as new IFLA_DEVTYPE attribute included in the RTM_NEWLINK
>>>> messages.
>>>> 
>>>> For network managing daemons that have to classify ARPHRD_ETHER network
>>>> devices into different types (like Wireless LAN, Bluetooth etc.), this
>>>> avoids the extra round trip to sysfs and parsing of the uevent file.
>>>> 
>>>> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
>>>> ---
>>>> include/uapi/linux/if_link.h |  2 ++
>>>> net/core/rtnetlink.c         | 12 ++++++++++++
>>>> 2 files changed, 14 insertions(+)
>>>> 
>>>> diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
>>>> index 43391e2d1153..781294972bb4 100644
>>>> --- a/include/uapi/linux/if_link.h
>>>> +++ b/include/uapi/linux/if_link.h
>>>> @@ -166,6 +166,8 @@ enum {
>>>> 	IFLA_NEW_IFINDEX,
>>>> 	IFLA_MIN_MTU,
>>>> 	IFLA_MAX_MTU,
>>>> +	IFLA_DEVTYPE,		/* Name value from SET_NETDEV_DEVTYPE */
>>> 
>>> This is not something netdev-related. dev->dev.type is struct device_type.
>>> This is a generic "device" thing. Incorrect to expose over
>>> netdev-specific API. Please use "device" API for this.
>> 
>> it is not just "device" related since this is a sub-classification of ARPHRD_ETHER type as a wrote above. Don't get hang up that this information is part of struct device. The dev->dev.type contains strings like "wlan", "bluetooth", "wimax", "gadget" etc. so that you can tell what kind of Ethernet card you have.
> 
> I understand. But using dev->dev.type for that purpose seems like an
> abuse. If this is subtype of netdev, it should not be stored in parent
> device. I believe that you need to re-introduce this as a part of struct
> net_device - preferably an enum - and that you can expose over rtnl (and
> net-sysfs).

it is not stored in the parent device. It is stored in the device of the netdev.

As Stephen said, there is no device API. And why would I add the same info again and bloat netdev?

drivers/net/bonding/bond_main.c:        SET_NETDEV_DEVTYPE(bond_dev, &bond_type);
drivers/net/geneve.c:   SET_NETDEV_DEVTYPE(dev, &geneve_type);
drivers/net/macsec.c:   SET_NETDEV_DEVTYPE(dev, &macsec_type);
drivers/net/ppp/ppp_generic.c:  SET_NETDEV_DEVTYPE(dev, &ppp_type);
drivers/net/usb/hso.c:  SET_NETDEV_DEVTYPE(net, &hso_type);
drivers/net/usb/usbnet.c:               SET_NETDEV_DEVTYPE(net, &wlan_type);
drivers/net/usb/usbnet.c:               SET_NETDEV_DEVTYPE(net, &wwan_type);
drivers/net/vxlan.c:    SET_NETDEV_DEVTYPE(dev, &vxlan_type);
drivers/net/wimax/i2400m/usb.c: SET_NETDEV_DEVTYPE(net_dev, &i2400mu_type);
drivers/net/wireless/intersil/prism54/islpci_dev.c:     SET_NETDEV_DEVTYPE(ndev, &wlan_type);
drivers/staging/gdm724x/gdm_lte.c:              SET_NETDEV_DEVTYPE(net, &wwan_type);
drivers/usb/gadget/function/u_ether.c:  SET_NETDEV_DEVTYPE(net, &gadget_type);
drivers/usb/gadget/function/u_ether.c:  SET_NETDEV_DEVTYPE(net, &gadget_type);
net/8021q/vlan_dev.c:   SET_NETDEV_DEVTYPE(dev, &vlan_type);
net/bluetooth/6lowpan.c:        SET_NETDEV_DEVTYPE(netdev, &bt_type);
net/bluetooth/bnep/core.c:      SET_NETDEV_DEVTYPE(dev, &bnep_type);
net/bridge/br_device.c: SET_NETDEV_DEVTYPE(dev, &br_type);
net/dsa/slave.c:        SET_NETDEV_DEVTYPE(slave_dev, &dsa_type);
net/hsr/hsr_device.c:   SET_NETDEV_DEVTYPE(dev, &hsr_type);
net/l2tp/l2tp_eth.c:    SET_NETDEV_DEVTYPE(dev, &l2tpeth_type);
net/wireless/core.c:            SET_NETDEV_DEVTYPE(dev, &wiphy_type);

So for all these devices we have to add duplicate information now?

I actually don't like that idea. I can rename it to IFLA_FOO or any other proposal, but adding the same information twice is pointless.

>> We can revive the patches for adding this information as IFLA_INFO_KIND, but last time they where reverted since NetworkManager did all the wrong decisions based on that. That part is fixed now and we can put that back and declare the sub-type classification of Ethernet device in two places. Or we just take the information that we added a long time ago for exactly this sub-classification and provide them to userspace via RTNL.
> 
> IFLA_INFO_KIND serves for different purpose. It is for drivers that
> register rtnl_link_ops. I don't think it would be wise to mix it with
> this.

We could register rtnl_link_ops for these Ethernet drivers, but then we are duplicating the information as well.

Regards

Marcel

^ permalink raw reply

* [PATCH net-next] packet: add sockopt to ignore outgoing packets
From: Vincent Whitchurch @ 2018-08-30 10:01 UTC (permalink / raw)
  To: davem; +Cc: netdev, willemb, Vincent Whitchurch

Currently, the only way to ignore outgoing packets on a packet socket is
via the BPF filter.  With MSG_ZEROCOPY, packets that are looped into
AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even
if the filter run from packet_rcv() would reject them.  So the presence
of a packet socket on the interface takes away the benefits of
MSG_ZEROCOPY, even if the packet socket is not interested in outgoing
packets.  (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily
cloned, but the cost for that is much lower.)

Add a socket option to allow AF_PACKET sockets to ignore outgoing
packets to solve this.  Note that the *BSDs already have something
similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT.

The first intended user is lldpd.

Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
---
 include/linux/netdevice.h      |  1 +
 include/uapi/linux/if_packet.h |  1 +
 net/core/dev.c                 |  3 +++
 net/packet/af_packet.c         | 15 +++++++++++++++
 4 files changed, 20 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ca5ab98053c8..8ef14d9edc58 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2317,6 +2317,7 @@ static inline struct sk_buff *call_gro_receive_sk(gro_receive_sk_t cb,
 
 struct packet_type {
 	__be16			type;	/* This is really htons(ether_type). */
+	bool			ignore_outgoing;
 	struct net_device	*dev;	/* NULL is wildcarded here	     */
 	int			(*func) (struct sk_buff *,
 					 struct net_device *,
diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index 67b61d91d89b..467b654bd4c7 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -57,6 +57,7 @@ struct sockaddr_ll {
 #define PACKET_QDISC_BYPASS		20
 #define PACKET_ROLLOVER_STATS		21
 #define PACKET_FANOUT_DATA		22
+#define PACKET_IGNORE_OUTGOING		23
 
 #define PACKET_FANOUT_HASH		0
 #define PACKET_FANOUT_LB		1
diff --git a/net/core/dev.c b/net/core/dev.c
index 325fc5088370..0addb4f0abfe 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1947,6 +1947,9 @@ static inline bool skb_loop_sk(struct packet_type *ptype, struct sk_buff *skb)
 	if (!ptype->af_packet_priv || !skb->sk)
 		return false;
 
+	if (ptype->ignore_outgoing)
+		return true;
+
 	if (ptype->id_match)
 		return ptype->id_match(ptype, skb->sk);
 	else if ((struct sock *)ptype->af_packet_priv == skb->sk)
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 5610061e7f2e..37bbafad724a 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -3805,6 +3805,18 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 
 		return fanout_set_data(po, optval, optlen);
 	}
+	case PACKET_IGNORE_OUTGOING:
+	{
+		int val;
+
+		if (optlen != sizeof(val))
+			return -EINVAL;
+		if (copy_from_user(&val, optval, sizeof(val)))
+			return -EFAULT;
+
+		po->prot_hook.ignore_outgoing = !!val;
+		return 0;
+	}
 	case PACKET_TX_HAS_OFF:
 	{
 		unsigned int val;
@@ -3928,6 +3940,9 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 			((u32)po->fanout->flags << 24)) :
 		       0);
 		break;
+	case PACKET_IGNORE_OUTGOING:
+		val = po->prot_hook.ignore_outgoing;
+		break;
 	case PACKET_ROLLOVER_STATS:
 		if (!po->rollover)
 			return -EINVAL;
-- 
2.11.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox