Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v4 1/2] dt-bindings: net: add MDIO bus multiplexer driven by a regmap device
From: Andrew Lunn @ 2019-02-06 13:52 UTC (permalink / raw)
  To: Pankaj Bansal; +Cc: Florian Fainelli, netdev@vger.kernel.org
In-Reply-To: <20190206114523.8954-2-pankaj.bansal@nxp.com>

On Wed, Feb 06, 2019 at 06:20:39AM +0000, Pankaj Bansal wrote:
> Add support for an MDIO bus multiplexer controlled by a regmap
> device, like an FPGA.
> 
> Tested on a NXP LX2160AQDS board which uses the "QIXIS" FPGA
> attached to the i2c bus.
> 
> Signed-off-by: Pankaj Bansal <pankaj.bansal@nxp.com>
> ---
> 
> Notes:
>     V4:
>     - No change
>     V3:
>     - No change
>     V2:
>     - New file describing the device tree bindings for regmap controlled devices'
>       mdio mux
> 
>  .../bindings/net/mdio-mux-regmap.txt         | 167 +++++++++++++++++
>  1 file changed, 167 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/net/mdio-mux-regmap.txt b/Documentation/devicetree/bindings/net/mdio-mux-regmap.txt
> new file mode 100644
> index 000000000000..8968f317965f
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/mdio-mux-regmap.txt
> @@ -0,0 +1,167 @@
> +Properties for an MDIO bus multiplexer controlled by a regmap
> +
> +This is a special case of a MDIO bus multiplexer.  A regmap device,
> +like an FPGA, is used to control which child bus is connected.  The mdio-mux
> +node must be a child of the device that is controlled by a regmap.
> +The driver currently only supports devices with upto 32-bit registers.
> +
> +Required properties in addition to the generic multiplexer properties:
> +
> +- reg : integer, contains the offset of the register that controls the bus
> +	multiplexer. it can be 32 bit number.
> +
> +- mux-mask : integer, contains an 32 bit mask that specifies which
> +	bits in the register control the actual bus multiplexer.  The
> +	'reg' property of each child mdio-mux node must be constrained by
> +	this mask.

Hi Pankaj

Maybe you can address the comment about not having a compatible flag
by commenting that this is a device tree fragment, which is expected
to appear inside the binding of some other device. It is not a
standalone device.

	   Andrew

^ permalink raw reply

* Re: [PATCH 0/8] nic: thunderx: fix communication races betwen VF & PF
From: Vadim Lomovtsev @ 2019-02-06 14:18 UTC (permalink / raw)
  To: sgoutham@cavium.com, rric@kernel.org, davem@davemloft.net,
	linux-arm-kernel@lists.infradead.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org
  Cc: dnelson@redhat.com
In-Reply-To: <20190206101351.16744-1-vlomovtsev@marvell.com>

self-NACK here, some emails get's corrupted for some reasons,
along with some typos found.

sorry for inconvenience.

Vadim

On Wed, Feb 06, 2019 at 10:13:54AM +0000, Vadim Lomovtsev wrote:
> The ThunderX CN88XX NIC Virtual Function driver uses mailbox interface 
> to communicate to physical function driver. Each of VF has it's own pair
> of mailbox registers to read from and write to. The mailbox registers
> has no protection from possible races, so it has to be implemented
> at software side.
> 
> After long term testing by loop of 'ip link set <ifname> up/down'
> command it was found that there are two possible scenarios when
> race condition appears:
>  1. VF receives link change message from PF and VF send RX mode
> configuration message to PF in the same time from separate thread.
>  2. PF receives RX mode configuration from VF and in the same time,
> in separate thread PF detects link status change and sends appropriate
> message to particular VF.
> 
> Both cases leads to mailbox data to be rewritten, NIC VF messaging control
> data to be updated incorrectly and communication sequence gets broken.
> 
> This patch series is to address race condition with VF & PF communication.
> 
> Vadim Lomovtsev (8):
>   net: thunderx: correct typo in macro name
>   net: thunderx: replace global nicvf_rx_mode_wq work queue for all VFs
>     to private for each of them.
>   net: thunderx: make CFG_DONE message to run through generic send-ack
>     sequence
>   net: thunderx: add nicvf_send_msg_to_pf result check for
>     set_rx_mode_task
>   net: thunderx: rework xcast message structure to make it fit into 64
>     bit
>   net: thunderx: add mutex to protect mailbox from concurrent calls for
>     same VF
>   net: thunderx: implement helpers to read mailbox IRQ status
>   net: thunderx: check status of mailbox IRQ before sending a message
> 
>  drivers/net/ethernet/cavium/thunder/nic.h     | 12 +--
>  .../net/ethernet/cavium/thunder/nic_main.c    | 58 +++++++------
>  .../net/ethernet/cavium/thunder/nicvf_main.c  | 82 +++++++++++++------
>  .../ethernet/cavium/thunder/nicvf_queues.c    | 14 ++++
>  .../ethernet/cavium/thunder/nicvf_queues.h    |  1 +
>  .../net/ethernet/cavium/thunder/thunder_bgx.c |  2 +-
>  .../net/ethernet/cavium/thunder/thunder_bgx.h |  2 +-
>  7 files changed, 112 insertions(+), 59 deletions(-)
> 
> -- 
> 2.17.2

^ permalink raw reply

* Re: [PATCH net-next v3 00/12] net: Introduce ndo_get_port_parent_id()
From: Ido Schimmel @ 2019-02-06 14:24 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev@vger.kernel.org, David S. Miller, open list,
	open list:MELLANOX MLX5 core VPI driver,
	open list:NETRONOME ETHERNET DRIVERS, open list:STAGING SUBSYSTEM,
	moderated list:ETHERNET BRIDGE
In-Reply-To: <20190206075136.GA2283@splinter>

On Wed, Feb 06, 2019 at 09:51:36AM +0200, Ido Schimmel wrote:
> On Tue, Feb 05, 2019 at 03:53:14PM -0800, Florian Fainelli wrote:
> > Hi all,
> > 
> > Based on discussion with Ido and feedback from Jakub there are clearly
> > two classes of users that implement SWITCHDEV_ATTR_ID_PORT_PARENT_ID:
> > 
> > - PF/VF drivers which typically only implement return the port's parent
> >   ID, yet have to implement switchdev_port_attr_get() just for that
> > 
> > - Ethernet switch drivers: mlxsw, ocelot, DSA, etc. which implement more
> >   attributes which we want to be able to eventually veto in the context
> >   of the caller, thus making them candidates for using a blocking notifier
> >   chain
> 
> Florian, patches look good to me. I'm going to build a kernel with these
> patches and run some tests. Will report later today.

Ran most of our tests and nothing exploded. Thanks!

^ permalink raw reply

* Re: [PATCH bpf-next 0/5] net: xdp: allow offload to coexist with generic
From: Daniel Borkmann @ 2019-02-06 14:44 UTC (permalink / raw)
  To: Jakub Kicinski, alexei.starovoitov; +Cc: netdev, oss-drivers
In-Reply-To: <20190206040324.22109-1-jakub.kicinski@netronome.com>

On 02/06/2019 05:03 AM, Jakub Kicinski wrote:
> Hi!
> 
> Offloaded and native/driver XDP programs can already coexist.
> Allow offload and generic hook to coexist as well, there seem
> to be no reason why not to do so.
> 
> Jakub Kicinski (5):
>   selftests/bpf: fix the expected messages
>   net: xdp: allow generic and driver XDP on one interface
>   selftests/bpf: print traceback when test fails
>   selftests/bpf: add test for mixing generic and offload XDP
>   selftests/bpf: test reading the offloaded program
> 
>  net/core/dev.c                              |  10 +-
>  tools/testing/selftests/bpf/test_offload.py | 135 ++++++++++++--------
>  2 files changed, 85 insertions(+), 60 deletions(-)
> 

Applied, thanks!

^ permalink raw reply

* Re: [PATCH bpf-next] tools: bpftool: doc, fix incorrect text
From: Daniel Borkmann @ 2019-02-06 14:45 UTC (permalink / raw)
  To: Prashant Bhole, Alexei Starovoitov; +Cc: netdev
In-Reply-To: <20190206014723.1424-1-bhole_prashant_q7@lab.ntt.co.jp>

On 02/06/2019 02:47 AM, Prashant Bhole wrote:
> Documentation about cgroup, feature, prog uses wrong header
> 'MAP COMMANDS' while listing commands. This patch corrects the header
> in respective doc files.
> 
> Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>

Applied, thanks!

^ permalink raw reply

* Re: [PATCH v2] bpf: test_maps: Avoid possible out of bound access
From: Daniel Borkmann @ 2019-02-06 14:52 UTC (permalink / raw)
  To: Breno Leitao, netdev; +Cc: ast, davem
In-Reply-To: <1549386754-8549-1-git-send-email-leitao@debian.org>

On 02/05/2019 06:12 PM, Breno Leitao wrote:
> When compiling test_maps selftest with GCC-8, it warns that an array might
> be indexed with a negative value, which could cause a negative out of bound
> access, depending on parameters of the function. This is the GCC-8 warning:
> 
> 	gcc -Wall -O2 -I../../../include/uapi -I../../../lib -I../../../lib/bpf -I../../../../include/generated -DHAVE_GENHDR -I../../../include    test_maps.c /home/breno/Devel/linux/tools/testing/selftests/bpf/libbpf.a -lcap -lelf -lrt -lpthread -o /home/breno/Devel/linux/tools/testing/selftests/bpf/test_maps
> 	In file included from test_maps.c:16:
> 	test_maps.c: In function ‘run_all_tests’:
> 	test_maps.c:1079:10: warning: array subscript -1 is below array bounds of ‘pid_t[<Ube20> + 1]’ [-Warray-bounds]
> 	   assert(waitpid(pid[i], &status, 0) == pid[i]);
> 		  ^~~~~~~~~~~~~~~~~~~~~~~~~~~
> 	test_maps.c:1059:6: warning: array subscript -1 is below array bounds of ‘pid_t[<Ube20> + 1]’ [-Warray-bounds]
> 	   pid[i] = fork();
> 	   ~~~^~~
> 
> This patch simply guarantees that the task(s) variables are unsigned, thus,
> they could never be a negative number, hence avoiding an out of bound access
> warning.
> 
> Signed-off-by: Breno Leitao <leitao@debian.org>

Given this is a false positive anyway and we only want to reduce gcc noise, I've
applied it to bpf-next, thanks Breno!

^ permalink raw reply

* Re: [PATCH net 1/2] geneve: should not call rt6_lookup() when ipv6 was disabled
From: Stefano Brivio @ 2019-02-06 14:58 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: netdev, David Miller
In-Reply-To: <20190206125111.5286-2-liuhangbin@gmail.com>

Hi Hangbin,

On Wed,  6 Feb 2019 20:51:10 +0800
Hangbin Liu <liuhangbin@gmail.com> wrote:

> When we add a new GENEVE device with IPv6 remote, checking only for
> IS_ENABLED(CONFIG_IPV6) is not enough as we may disable IPv6 in kernel
> cmd(ipv6.disable=1), which will cause a NULL pointer dereference.
> 
> Reported-by: Jianlin Shi <jishi@redhat.com>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  drivers/net/geneve.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
> index 58bbba8582b0..0658715581e3 100644
> --- a/drivers/net/geneve.c
> +++ b/drivers/net/geneve.c
> @@ -1512,6 +1512,10 @@ static void geneve_link_config(struct net_device *dev,
>  	}
>  #if IS_ENABLED(CONFIG_IPV6)
>  	case AF_INET6: {
> +		struct inet6_dev *idev = in6_dev_get(dev);
> +		if (!idev)
> +			break;
> +
>  		struct rt6_info *rt = rt6_lookup(geneve->net,
>  						 &info->key.u.ipv6.dst, NULL, 0,
>  						 NULL, 0);

You're mixing declarations and code here, ISO C90 forbids it. You
could rather declare:

		struct inet6_dev *idev = in6_dev_get(dev);
		struct rt6_info *rt;

then check idev, and then call rt6_lookup().

> @@ -1519,6 +1523,8 @@ static void geneve_link_config(struct net_device *dev,
>  		if (rt && rt->dst.dev)
>  			ldev_mtu = rt->dst.dev->mtu - GENEVE_IPV6_HLEN;
>  		ip6_rt_put(rt);
> +
> +		in6_dev_put(idev);

I think it would be better to put this right after the check on idev,
mostly for readability, but also, marginally, to reduce the scope of
the reference count bump.

-- 
Stefano

^ permalink raw reply

* Re: [PATCH net 2/2] sit: check if IPv6 enabled before call ip6_err_gen_icmpv6_unreach()
From: Stefano Brivio @ 2019-02-06 15:00 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: netdev, David Miller
In-Reply-To: <20190206125111.5286-3-liuhangbin@gmail.com>

On Wed,  6 Feb 2019 20:51:11 +0800
Hangbin Liu <liuhangbin@gmail.com> wrote:

> If we disabled IPv6 from kernel boot up cmd(ipv6.disable=1), we should not
> call ip6_err_gen_icmpv6_unreach().
> 
> Reproducer:
> ip link add sit1 type sit local 10.10.0.1 remote 10.10.1.1 ttl 1
> ip link set sit1 up
> ip addr add 192.168.0.1/24 dev sit1
> ping 192.168.0.2
> 
> Reported-by: Jianlin Shi <jishi@redhat.com>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  net/ipv6/sit.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
> index 1e03305c0549..e43fbac0fd1a 100644
> --- a/net/ipv6/sit.c
> +++ b/net/ipv6/sit.c
> @@ -493,6 +493,7 @@ static int ipip6_err(struct sk_buff *skb, u32 info)
>  	const int type = icmp_hdr(skb)->type;
>  	const int code = icmp_hdr(skb)->code;
>  	unsigned int data_len = 0;
> +	struct inet6_dev *idev;
>  	struct ip_tunnel *t;
>  	int sifindex;
>  	int err;
> @@ -546,8 +547,13 @@ static int ipip6_err(struct sk_buff *skb, u32 info)
>  	}
>  
>  	err = 0;
> -	if (!ip6_err_gen_icmpv6_unreach(skb, iph->ihl * 4, type, data_len))
> +
> +	idev = in6_dev_get(skb->dev);
> +	if (idev &&
> +	    !ip6_err_gen_icmpv6_unreach(skb, iph->ihl * 4, type, data_len)) {
> +		in6_dev_put(idev);
>  		goto out;
> +	}

You are leaking a reference if idev && ip6_err_gen_icmpv6_unreach(...).
You should call in6_dev_put(idev) whenever idev is not NULL instead.

-- 
Stefano

^ permalink raw reply

* Re: [PATCH net-next 5/6] net: marvell: neta: add comphy support
From: Maxime Chevallier @ 2019-02-06 15:09 UTC (permalink / raw)
  To: Russell King
  Cc: Andrew Lunn, Gregory Clement, Jason Cooper,
	Kishon Vijay Abraham I, Sebastian Hesselbarth, Thomas Petazzoni,
	devicetree, linux-arm-kernel, netdev, David S. Miller
In-Reply-To: <E1grLTv-00061v-Tw@rmk-PC.armlinux.org.uk>

Hello Russell,

On Wed, 06 Feb 2019 11:35:07 +0000
Russell King <rmk+kernel@armlinux.org.uk> wrote:


>+	if (pp->comphy) {
>+		enum phy_mode mode = PHY_MODE_INVALID;
>+
>+		switch (state->interface) {
>+		case PHY_INTERFACE_MODE_SGMII:
>+		case PHY_INTERFACE_MODE_1000BASEX:
>+			mode = PHY_MODE_SGMII;

PHY_MODE_SGMII no longer exists, the new PHY_MODE_ETHERNET must be used
with the correct submode now, so this doesn't build unfortunately.

>+			break;
>+		case PHY_INTERFACE_MODE_2500BASEX:
>+			mode = PHY_MODE_2500SGMII;

Same here.

Thanks,

Maxime

^ permalink raw reply

* Re:Re: [PATCH net] net: defxx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
From: albin_yang @ 2019-02-06 15:13 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: netdev, David S. Miller, yang.wei9
In-Reply-To: <alpine.LFD.2.21.1902051816130.9991@eddie.linux-mips.org>

At 2019-02-06 03:57:34, "Maciej W. Rozycki" <macro@linux-mips.org> wrote:
>On Wed, 6 Feb 2019, Yang Wei wrote:
>
>Reviewed-by: Maciej W. Rozycki <macro@linux-mips.org>
>
> It looks to me the driver has to be reviewed WRT `dev_kfree_skb' vs 
>`kfree_skb' use too.  I'll have a look into it unless you are happy to do 
>that.
>
> Thanks for your contribution!
>
Hi, Maciej

I think kfree_skb() should be called when skb is dropped by network drivers. 
I found that many network drivers have such problems that not use 
kfree_skb/consume_skb properly. Maybe because 'drop profiles' appears later 
than many network drivers?

The problem 'dev_kfree_skb' vs 'kfree_skb' should be fixed. I think that is 
lower priority than the current patch. Network driver should not perturb 
drop profiles when skb successful xmit.

Thanks
Yang

^ permalink raw reply

* [PATCH net] net: dsa: Fix NULL checking in dsa_slave_set_eee()
From: Dan Carpenter @ 2019-02-06 15:35 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Vivien Didelot, Florian Fainelli, David S. Miller, netdev,
	kernel-janitors

This function can't succeed if dp->pl is NULL.  It will Oops inside the
call to return phylink_ethtool_get_eee(dp->pl, e);

Fixes: 1be52e97ed3e ("dsa: slave: eee: Allow ports to use phylink")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
 net/dsa/slave.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 91de3a663226..9384f95e04d5 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -639,7 +639,7 @@ static int dsa_slave_set_eee(struct net_device *dev, struct ethtool_eee *e)
 	int ret;
 
 	/* Port's PHY and MAC both need to be EEE capable */
-	if (!dev->phydev && !dp->pl)
+	if (!dev->phydev || !dp->pl)
 		return -ENODEV;
 
 	if (!ds->ops->set_mac_eee)
@@ -659,7 +659,7 @@ static int dsa_slave_get_eee(struct net_device *dev, struct ethtool_eee *e)
 	int ret;
 
 	/* Port's PHY and MAC both need to be EEE capable */
-	if (!dev->phydev && !dp->pl)
+	if (!dev->phydev || !dp->pl)
 		return -ENODEV;
 
 	if (!ds->ops->get_mac_eee)
-- 
2.17.1


^ permalink raw reply related

* RE: [PATCH net-next 1/4] dpaa2-eth: Use a single page per Rx buffer
From: Ioana Ciocoi Radulescu @ 2019-02-06 15:36 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: netdev@vger.kernel.org, davem@davemloft.net, Ioana Ciornei,
	brouer@redhat.com
In-Reply-To: <20190205093513.GA31466@apalos>

> -----Original Message-----
> From: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> Sent: Tuesday, February 5, 2019 11:35 AM
> To: Ioana Ciocoi Radulescu <ruxandra.radulescu@nxp.com>
> Cc: netdev@vger.kernel.org; davem@davemloft.net; Ioana Ciornei
> <ioana.ciornei@nxp.com>; brouer@redhat.com
> Subject: Re: [PATCH net-next 1/4] dpaa2-eth: Use a single page per Rx buffer
> 
> Hi Ioana,
> 
> Can you share any results on XDP (XDP_DROP is usually useful for the
> hardware capabilities).

XDP numbers are pretty much the same as before this patch:

On a LS2088A with A72 cores @2GHz (numbers in Mpps):
				1core		8cores
-------------------------------------------------------------------------
XDP_DROP (no touching data)	5.37		29.6 (linerate)
XDP_DROP (xdp1 sample)	3.14		24.22
XDP_TX(xdp2 sample)		1.65		13.08

> 
> > Instead of allocating page fragments via the network stack,
> > use the page allocator directly. For now, we consume one page
> > for each Rx buffer.
> >
> > With the new memory model we are free to consider adding more
> > XDP support.
> >
> > Performance decreases slightly in some IP forwarding cases.
> > No visible effect on termination traffic. The driver memory
> > footprint increases as a result of this change, but it is
> > still small enough to not really matter.
> >
> > Another side effect is that now Rx buffer alignment requirements
> > are naturally satisfied without any additional actions needed.
> > Remove alignment related code, except in the buffer layout
> > information conveyed to MC, as hardware still needs to know the
> > alignment value we guarantee.
> >
> > Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
> > Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com>
> > ---
> >  drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 61 +++++++++++++-
> ----------
> >  drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h | 21 +++-----
> >  2 files changed, 38 insertions(+), 44 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
> b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
> > index 04925c7..6e58de6 100644
> > --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
> > +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
> > @@ -86,16 +86,16 @@ static void free_rx_fd(struct dpaa2_eth_priv *priv,
> >  	for (i = 1; i < DPAA2_ETH_MAX_SG_ENTRIES; i++) {
> >  		addr = dpaa2_sg_get_addr(&sgt[i]);
> >  		sg_vaddr = dpaa2_iova_to_virt(priv->iommu_domain, addr);
> > -		dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
> > -				 DMA_BIDIRECTIONAL);
> > +		dma_unmap_page(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
> > +			       DMA_BIDIRECTIONAL);
> >
> > -		skb_free_frag(sg_vaddr);
> > +		free_pages((unsigned long)sg_vaddr, 0);
> >  		if (dpaa2_sg_is_final(&sgt[i]))
> >  			break;
> >  	}
> >
> >  free_buf:
> > -	skb_free_frag(vaddr);
> > +	free_pages((unsigned long)vaddr, 0);
> >  }
> >
> >  /* Build a linear skb based on a single-buffer frame descriptor */
> > @@ -109,7 +109,7 @@ static struct sk_buff *build_linear_skb(struct
> dpaa2_eth_channel *ch,
> >
> >  	ch->buf_count--;
> >
> > -	skb = build_skb(fd_vaddr, DPAA2_ETH_SKB_SIZE);
> > +	skb = build_skb(fd_vaddr, DPAA2_ETH_RX_BUF_RAW_SIZE);
> >  	if (unlikely(!skb))
> >  		return NULL;
> >
> > @@ -144,19 +144,19 @@ static struct sk_buff *build_frag_skb(struct
> dpaa2_eth_priv *priv,
> >  		/* Get the address and length from the S/G entry */
> >  		sg_addr = dpaa2_sg_get_addr(sge);
> >  		sg_vaddr = dpaa2_iova_to_virt(priv->iommu_domain,
> sg_addr);
> > -		dma_unmap_single(dev, sg_addr,
> DPAA2_ETH_RX_BUF_SIZE,
> > -				 DMA_BIDIRECTIONAL);
> > +		dma_unmap_page(dev, sg_addr, DPAA2_ETH_RX_BUF_SIZE,
> > +			       DMA_BIDIRECTIONAL);
> >
> >  		sg_length = dpaa2_sg_get_len(sge);
> >
> >  		if (i == 0) {
> >  			/* We build the skb around the first data buffer */
> > -			skb = build_skb(sg_vaddr, DPAA2_ETH_SKB_SIZE);
> > +			skb = build_skb(sg_vaddr,
> DPAA2_ETH_RX_BUF_RAW_SIZE);
> >  			if (unlikely(!skb)) {
> >  				/* Free the first SG entry now, since we
> already
> >  				 * unmapped it and obtained the virtual
> address
> >  				 */
> > -				skb_free_frag(sg_vaddr);
> > +				free_pages((unsigned long)sg_vaddr, 0);
> >
> >  				/* We still need to subtract the buffers used
> >  				 * by this FD from our software counter
> > @@ -211,9 +211,9 @@ static void free_bufs(struct dpaa2_eth_priv *priv,
> u64 *buf_array, int count)
> >
> >  	for (i = 0; i < count; i++) {
> >  		vaddr = dpaa2_iova_to_virt(priv->iommu_domain,
> buf_array[i]);
> > -		dma_unmap_single(dev, buf_array[i],
> DPAA2_ETH_RX_BUF_SIZE,
> > -				 DMA_BIDIRECTIONAL);
> > -		skb_free_frag(vaddr);
> > +		dma_unmap_page(dev, buf_array[i],
> DPAA2_ETH_RX_BUF_SIZE,
> > +			       DMA_BIDIRECTIONAL);
> > +		free_pages((unsigned long)vaddr, 0);
> 
> I got some hardware/XDP related questions since i have no idea how the
> hardware
> works. From what i understand on the code, you are trying to manage the
> buffer
> from the hw buffer manager and if the fails you unmap and free the pages.
> Since XDP relies on recycling for speed (hint check xdp_return_buff()), is it
> possible to recycle the buffer if the hw fails ?

We're still looking for a way to use the page pool mechanism in our driver.
Our hardware buffer pool is shared between all Rx queues, so it's not that
straightforward.

This error path in particular is very unlikely to ever get exercised, it
wouldn't have an impact on performance.

Thanks,
Ioana

> 
> >  	}
> >  }
> >
> > @@ -378,16 +378,16 @@ static void dpaa2_eth_rx(struct dpaa2_eth_priv
> *priv,
> >  			return;
> >  		}
> >
> > -		dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
> > -				 DMA_BIDIRECTIONAL);
> > +		dma_unmap_page(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
> > +			       DMA_BIDIRECTIONAL);
> >  		skb = build_linear_skb(ch, fd, vaddr);
> >  	} else if (fd_format == dpaa2_fd_sg) {
> >  		WARN_ON(priv->xdp_prog);
> >
> > -		dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
> > -				 DMA_BIDIRECTIONAL);
> > +		dma_unmap_page(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
> > +			       DMA_BIDIRECTIONAL);
> >  		skb = build_frag_skb(priv, ch, buf_data);
> > -		skb_free_frag(vaddr);
> > +		free_pages((unsigned long)vaddr, 0);
> >  		percpu_extras->rx_sg_frames++;
> >  		percpu_extras->rx_sg_bytes += dpaa2_fd_get_len(fd);
> >  	} else {
> > @@ -903,7 +903,7 @@ static int add_bufs(struct dpaa2_eth_priv *priv,
> >  {
> >  	struct device *dev = priv->net_dev->dev.parent;
> >  	u64 buf_array[DPAA2_ETH_BUFS_PER_CMD];
> > -	void *buf;
> > +	struct page *page;
> >  	dma_addr_t addr;
> >  	int i, err;
> >
> > @@ -911,14 +911,16 @@ static int add_bufs(struct dpaa2_eth_priv *priv,
> >  		/* Allocate buffer visible to WRIOP + skb shared info +
> >  		 * alignment padding
> >  		 */
> > -		buf = napi_alloc_frag(dpaa2_eth_buf_raw_size(priv));
> > -		if (unlikely(!buf))
> > +		/* allocate one page for each Rx buffer. WRIOP sees
> > +		 * the entire page except for a tailroom reserved for
> > +		 * skb shared info
> > +		 */
> > +		page = dev_alloc_pages(0);
> > +		if (!page)
> >  			goto err_alloc;
> >
> > -		buf = PTR_ALIGN(buf, priv->rx_buf_align);
> > -
> > -		addr = dma_map_single(dev, buf,
> DPAA2_ETH_RX_BUF_SIZE,
> > -				      DMA_BIDIRECTIONAL);
> > +		addr = dma_map_page(dev, page, 0,
> DPAA2_ETH_RX_BUF_SIZE,
> > +				    DMA_BIDIRECTIONAL);
> >  		if (unlikely(dma_mapping_error(dev, addr)))
> >  			goto err_map;
> >
> > @@ -926,7 +928,7 @@ static int add_bufs(struct dpaa2_eth_priv *priv,
> >
> >  		/* tracing point */
> >  		trace_dpaa2_eth_buf_seed(priv->net_dev,
> > -					 buf, dpaa2_eth_buf_raw_size(priv),
> > +					 page,
> DPAA2_ETH_RX_BUF_RAW_SIZE,
> >  					 addr, DPAA2_ETH_RX_BUF_SIZE,
> >  					 bpid);
> >  	}
> > @@ -948,7 +950,7 @@ static int add_bufs(struct dpaa2_eth_priv *priv,
> >  	return i;
> >
> >  err_map:
> > -	skb_free_frag(buf);
> > +	__free_pages(page, 0);
> >  err_alloc:
> >  	/* If we managed to allocate at least some buffers,
> >  	 * release them to hardware
> > @@ -2134,6 +2136,7 @@ static int set_buffer_layout(struct
> dpaa2_eth_priv *priv)
> >  {
> >  	struct device *dev = priv->net_dev->dev.parent;
> >  	struct dpni_buffer_layout buf_layout = {0};
> > +	u16 rx_buf_align;
> >  	int err;
> >
> >  	/* We need to check for WRIOP version 1.0.0, but depending on the
> MC
> > @@ -2142,9 +2145,9 @@ static int set_buffer_layout(struct
> dpaa2_eth_priv *priv)
> >  	 */
> >  	if (priv->dpni_attrs.wriop_version == DPAA2_WRIOP_VERSION(0, 0,
> 0) ||
> >  	    priv->dpni_attrs.wriop_version == DPAA2_WRIOP_VERSION(1, 0,
> 0))
> > -		priv->rx_buf_align = DPAA2_ETH_RX_BUF_ALIGN_REV1;
> > +		rx_buf_align = DPAA2_ETH_RX_BUF_ALIGN_REV1;
> >  	else
> > -		priv->rx_buf_align = DPAA2_ETH_RX_BUF_ALIGN;
> > +		rx_buf_align = DPAA2_ETH_RX_BUF_ALIGN;
> >
> >  	/* tx buffer */
> >  	buf_layout.private_data_size = DPAA2_ETH_SWA_SIZE;
> > @@ -2184,7 +2187,7 @@ static int set_buffer_layout(struct
> dpaa2_eth_priv *priv)
> >  	/* rx buffer */
> >  	buf_layout.pass_frame_status = true;
> >  	buf_layout.pass_parser_result = true;
> > -	buf_layout.data_align = priv->rx_buf_align;
> > +	buf_layout.data_align = rx_buf_align;
> >  	buf_layout.data_head_room = dpaa2_eth_rx_head_room(priv);
> >  	buf_layout.private_data_size = 0;
> >  	buf_layout.options = DPNI_BUF_LAYOUT_OPT_PARSER_RESULT |
> > diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
> b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
> > index 31fe486..da3d039 100644
> > --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
> > +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
> > @@ -63,9 +63,11 @@
> >  /* Hardware requires alignment for ingress/egress buffer addresses */
> >  #define DPAA2_ETH_TX_BUF_ALIGN		64
> >
> > -#define DPAA2_ETH_RX_BUF_SIZE		2048
> > -#define DPAA2_ETH_SKB_SIZE \
> > -	(DPAA2_ETH_RX_BUF_SIZE + SKB_DATA_ALIGN(sizeof(struct
> skb_shared_info)))
> > +#define DPAA2_ETH_RX_BUF_RAW_SIZE	PAGE_SIZE
> > +#define DPAA2_ETH_RX_BUF_TAILROOM \
> > +	SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
> > +#define DPAA2_ETH_RX_BUF_SIZE \
> > +	(DPAA2_ETH_RX_BUF_RAW_SIZE -
> DPAA2_ETH_RX_BUF_TAILROOM)
> >
> >  /* Hardware annotation area in RX/TX buffers */
> >  #define DPAA2_ETH_RX_HWA_SIZE		64
> > @@ -343,7 +345,6 @@ struct dpaa2_eth_priv {
> >  	bool rx_tstamp; /* Rx timestamping enabled */
> >
> >  	u16 tx_qdid;
> > -	u16 rx_buf_align;
> >  	struct fsl_mc_io *mc_io;
> >  	/* Cores which have an affine DPIO/DPCON.
> >  	 * This is the cpu set on which Rx and Tx conf frames are processed
> > @@ -418,15 +419,6 @@ enum dpaa2_eth_rx_dist {
> >  	DPAA2_ETH_RX_DIST_CLS
> >  };
> >
> > -/* Hardware only sees DPAA2_ETH_RX_BUF_SIZE, but the skb built
> around
> > - * the buffer also needs space for its shared info struct, and we need
> > - * to allocate enough to accommodate hardware alignment restrictions
> > - */
> > -static inline unsigned int dpaa2_eth_buf_raw_size(struct dpaa2_eth_priv
> *priv)
> > -{
> > -	return DPAA2_ETH_SKB_SIZE + priv->rx_buf_align;
> > -}
> > -
> >  static inline
> >  unsigned int dpaa2_eth_needed_headroom(struct dpaa2_eth_priv *priv,
> >  				       struct sk_buff *skb)
> > @@ -451,8 +443,7 @@ unsigned int dpaa2_eth_needed_headroom(struct
> dpaa2_eth_priv *priv,
> >   */
> >  static inline unsigned int dpaa2_eth_rx_head_room(struct dpaa2_eth_priv
> *priv)
> >  {
> > -	return priv->tx_data_offset + DPAA2_ETH_TX_BUF_ALIGN -
> > -	       DPAA2_ETH_RX_HWA_SIZE;
> > +	return priv->tx_data_offset - DPAA2_ETH_RX_HWA_SIZE;
> >  }
> >
> >  int dpaa2_eth_set_hash(struct net_device *net_dev, u64 flags);
> > --
> > 2.7.4
> >

^ permalink raw reply

* Re: [RFC PATCH 0/4] Initial support for allocating BPF JITs in vmalloc for x86
From: Daniel Borkmann @ 2019-02-06 15:38 UTC (permalink / raw)
  To: Alexei Starovoitov, Edgecombe, Rick P
  Cc: kristen@linux.intel.com, netdev@vger.kernel.org,
	ard.biesheuvel@linaro.org, Hansen, Dave
In-Reply-To: <730f7250-0b01-e2d5-6253-e15541ad4298@fb.com>

On 02/06/2019 02:40 AM, Alexei Starovoitov wrote:
> On 2/5/19 5:11 PM, Edgecombe, Rick P wrote:
>> On Wed, 2019-02-06 at 00:35 +0000, Alexei Starovoitov wrote:
>>> On 2/5/19 2:50 PM, Rick Edgecombe wrote:
>>>> This introduces a new capability for BPF program JIT's to be located in
>>>> vmalloc
>>>> space on x86_64. This can serve as a backup area for
>>>> CONFIG_BPF_JIT_ALWAYS_ON in
>>>> case an unprivileged app uses all of the module space allowed by
>>>> bpf_jit_limit.
>>>>
>>>> In order to allow for calls from the increased distance of vmalloc from
>>>> kernel/module space, relative calls are emitted as full indirect calls if
>>>> the
>>>> maximum relative call distance is exceeded. So the resulting performance of
>>>> call
>>>> BPF instructions in this case is similar to the BPF interpreter.
>>>
>>> If I read this correctly the patches introduce retpoline overhead
>>> to direct function call because JITed progs are more than 32-bit apart
>>> and they're far away only because of dubious security concern ?
>>> Nack.
>>>
>> There really isn't any overhead, because they are only far away if the module
>> space is full, or the bpf_jit_limit is exceeded for non-admin. So cases today
>> when insertions would succeed it emits the same code, but cases where the
>> insertion would fail due to lack of space, it now at least works with the
>> described performance.
> 
> I disagree with the problem statement.
> x86 classic BPF jit has been around forever and no one
> complained that _unprivileged_ bpf progs exhaust module space.
> With bpf_jit_limit we got an extra knob to close this
> remote possibility of an attack.
> It's more than enough.

Agree here, also for cap_sys_admin programs there is kind of unpredictability
when we cross the condition that module space is full and thus fall back having
to emit retpolines for every call, which *is* overhead and hard to debug behavior
wrt performance regression from user pov. So I'd rather have it fail and enlarge
the module space instead, for example.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH] mlx4_ib: Increase the timeout for CM cache
From: Håkon Bugge @ 2019-02-06 15:40 UTC (permalink / raw)
  To: Jack Morgenstein
  Cc: Jason Gunthorpe, netdev, OFED mailing list, rds-devel,
	linux-kernel
In-Reply-To: <13750147-482A-4F90-976A-033C52DCF85E@oracle.com>



> On 6 Feb 2019, at 09:50, Håkon Bugge <haakon.bugge@oracle.com> wrote:
> 
> 
> 
>> On 5 Feb 2019, at 23:36, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> 
>> On Thu, Jan 31, 2019 at 06:09:51PM +0100, Håkon Bugge wrote:
>>> Using CX-3 virtual functions, either from a bare-metal machine or
>>> pass-through from a VM, MAD packets are proxied through the PF driver.
>>> 
>>> Since the VMs have separate name spaces for MAD Transaction Ids
>>> (TIDs), the PF driver has to re-map the TIDs and keep the book keeping
>>> in a cache.
>>> 
>>> Following the RDMA CM protocol, it is clear when an entry has to
>>> evicted form the cache. But life is not perfect, remote peers may die
>>> or be rebooted. Hence, it's a timeout to wipe out a cache entry, when
>>> the PF driver assumes the remote peer has gone.
>>> 
>>> We have experienced excessive amount of DREQ retries during fail-over
>>> testing, when running with eight VMs per database server.
>>> 
>>> The problem has been reproduced in a bare-metal system using one VM
>>> per physical node. In this environment, running 256 processes in each
>>> VM, each process uses RDMA CM to create an RC QP between himself and
>>> all (256) remote processes. All in all 16K QPs.
>>> 
>>> When tearing down these 16K QPs, excessive DREQ retries (and
>>> duplicates) are observed. With some cat/paste/awk wizardry on the
>>> infiniband_cm sysfs, we observe:
>>> 
>>>     dreq:       5007
>>> cm_rx_msgs:
>>>     drep:       3838
>>>     dreq:      13018
>>>      rep:       8128
>>>      req:       8256
>>>      rtu:       8256
>>> cm_tx_msgs:
>>>     drep:       8011
>>>     dreq:      68856
>>>      rep:       8256
>>>      req:       8128
>>>      rtu:       8128
>>> cm_tx_retries:
>>>     dreq:      60483
>>> 
>>> Note that the active/passive side is distributed.
>>> 
>>> Enabling pr_debug in cm.c gives tons of:
>>> 
>>> [171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
>>> 1,sl_cm_id: 0xd393089f} is NULL!
>>> 
>>> By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
>>> tear-down phase of the application is reduced from 113 to 67
>>> seconds. Retries/duplicates are also significantly reduced:
>>> 
>>> cm_rx_duplicates:
>>>     dreq:       7726
>>> []
>>> cm_tx_retries:
>>>     drep:          1
>>>     dreq:       7779
>>> 
>>> Increasing the timeout further didn't help, as these duplicates and
>>> retries stem from a too short CMA timeout, which was 20 (~4 seconds)
>>> on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
>>> numbers fell down to about one hundred for both of them.
>>> 
>>> Adjustment of the CMA timeout is _not_ part of this commit.
>>> 
>>> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
>>> ---
>>> drivers/infiniband/hw/mlx4/cm.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> Jack? What do you think?
> 
> I am tempted to send a v2 making this a sysctl tuneable. This because, full-rack testing using 8 servers, each with 8 VMs, only showed 33% reduction in the occurrences of "mlx4_ib_multiplex_cm_handler: id{slave:1,sl_cm_id: 0xd393089f} is NULL" with this commit.
> 
> But sure, Jack's opinion matters.

Jack,

A major contributor to the long processing time in the PF driver proxying QP1 packets is:

create_pv_resources
   -> ib_create_cq(ctx->ib_dev, mlx4_ib_tunnel_comp_handler,
                               NULL, ctx, cq_size, 0);

That is, comp_vector is zero.

Due to commit 6ba1eb776461 ("IB/mlx4: Scatter CQs to different EQs"), the zero comp_vector has the intent of let the mlx4_core driver select the least used vector.

But, in mlx4_ib_create_cq(), we have:

        pr_info("eq_table: %p\n", dev->eq_table);
        if (dev->eq_table) {
		vector = dev->eq_table[mlx4_choose_vector(dev->dev, vector,
                                                          ibdev->num_comp_vectors)];
        }

        cq->vector = vector;

and dev->eq_table is NULL, so all the CQs for the proxy QPs get comp_vector zero.

I have to make some reservations, as this analysis is based on uek4, but I think the code here is equal upstream, but need to double check.


Thxs, Håkon








> 
> 
> Thxs, Håkon
> 
>> 
>>> diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c
>>> index fedaf8260105..8c79a480f2b7 100644
>>> --- a/drivers/infiniband/hw/mlx4/cm.c
>>> +++ b/drivers/infiniband/hw/mlx4/cm.c
>>> @@ -39,7 +39,7 @@
>>> 
>>> #include "mlx4_ib.h"
>>> 
>>> -#define CM_CLEANUP_CACHE_TIMEOUT  (5 * HZ)
>>> +#define CM_CLEANUP_CACHE_TIMEOUT  (30 * HZ)
>>> 
>>> struct id_map_entry {
>>> 	struct rb_node node;
>>> -- 
>>> 2.20.1


^ permalink raw reply

* Re: [PATCH net] virtio_net: Account for tx bytes and packets on sending xdp_frames
From: Jakub Kicinski @ 2019-02-06 15:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Saeed Mahameed, dsahern@gmail.com, thoiland@redhat.com,
	virtualization@lists.linux-foundation.org, borkmann@iogearbox.net,
	Tariq Toukan, john.fastabend@gmail.com, mst@redhat.com,
	netdev@vger.kernel.org, jasowang@redhat.com, davem@davemloft.net,
	makita.toshiaki@lab.ntt.co.jp
In-Reply-To: <20190206144814.46996933@carbon>

On Wed, 6 Feb 2019 14:48:14 +0100, Jesper Dangaard Brouer wrote:
> On Wed, 6 Feb 2019 00:06:33 +0000
> Saeed Mahameed <saeedm@mellanox.com> wrote:
> 
> > 3) Unrelated, In non XDP case, if skb allocation fails or driver fails
> > to pass the skb up to the stack for somereason, should the driver
> > increase rx packets ? IMHO the answer should be yes if we want to have
> > similar behavior between XDP and non XDP cases.  
> 
> I don't think "skb allocation fails" should increase rx packets
> counter.  The difference is that these events are outside sysadm/users
> control, and is an error detected inside the driver.  The XDP program
> takes a policy choice to XDP_DROP a packet, which can be accounted
> inside the XDP prog (as the samples show) or as we also discuss via a
> more generic XDP-action counters.

FWIW that's my understanding as well.  My understanding of Linux stats
is that they are incrementing one counter per packet.  I.e. in RX
direction success packets are those given to the stack, and for TX
those given to the hardware.  Standards (IETF/IEEE) usually count stats
on the same layer boundary, but I think software generally counts when
it's done with the packet.

I haven't seen it documented anywhere, yet.  I have tried to document
it in the docs of the recent RFC:
https://patchwork.ozlabs.org/patch/1032332/

Incidentally XDP_DROP may have been better named XDP_DISCARD from stats
perspective ;)

^ permalink raw reply

* [ISSUE][4.20.6] mlx5 and checksum failures
From: Ian Kumlien @ 2019-02-06 16:16 UTC (permalink / raw)
  To: Linux Kernel Network Developers

Hi,

I'm hitting an issue that i think is fixed by the following patch,
i haven't verified it yet - but it looks like it should go on the
stable queue(?)

(And yes, I did look, and couldn't find it ;))

commit e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date:   Mon Dec 3 22:14:04 2018 -0800

    net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames

    When an ethernet frame is padded to meet the minimum ethernet frame
    size, the padding octets are not covered by the hardware checksum.
    Fortunately the padding octets are usually zero's, which don't affect
    checksum. However, we have a switch which pads non-zero octets, this
    causes kernel hardware checksum fault repeatedly.

    Prior to:
    commit '88078d98d1bb ("net: pskb_trim_rcsum() and
CHECKSUM_COMPLETE ...")'
    skb checksum was forced to be CHECKSUM_NONE when padding is detected.
    After it, we need to keep skb->csum updated, like what we do for RXFCS.
    However, fixing up CHECKSUM_COMPLETE requires to verify and parse IP
    headers, it is not worthy the effort as the packets are so small that
    CHECKSUM_COMPLETE can't save anything.

    Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE
are friends"),
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Tariq Toukan <tariqt@mellanox.com>
    Cc: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
    Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
    Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 1d0bb5ff8c26..f86e4804e83e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -732,6 +732,8 @@ static u8 get_ip_proto(struct sk_buff *skb, int
network_depth, __be16 proto)
                                            ((struct ipv6hdr
*)ip_p)->nexthdr;
 }

+#define short_frame(size) ((size) <= ETH_ZLEN + ETH_FCS_LEN)
+
 static inline void mlx5e_handle_csum(struct net_device *netdev,
                                     struct mlx5_cqe64 *cqe,
                                     struct mlx5e_rq *rq,
@@ -754,6 +756,17 @@ static inline void mlx5e_handle_csum(struct
net_device *netdev,
        if (unlikely(test_bit(MLX5E_RQ_STATE_NO_CSUM_COMPLETE, &rq->state)))
                goto csum_unnecessary;

+       /* CQE csum doesn't cover padding octets in short ethernet
+        * frames. And the pad field is appended prior to calculating
+        * and appending the FCS field.
+        *
+        * Detecting these padded frames requires to verify and parse
+        * IP headers, so we simply force all those small frames to be
+        * CHECKSUM_UNNECESSARY even if they are not padded.
+        */
+       if (short_frame(skb->len))
+               goto csum_unnecessary;
+
        if (likely(is_last_ethertype_ip(skb, &network_depth, &proto))) {
                if (unlikely(get_ip_proto(skb, network_depth, proto)
== IPPROTO_SCTP))
                        goto csum_unnecessary;
---

Kernel log:
[ 3226.017424] bond0: hw csum failure
[ 3226.018387] CPU: 13 PID: 0 Comm: swapper/13 Tainted: G          I
    4.20.6-1.el7.elrepo.x86_64 #1
[ 3226.020928] Hardware name: HP ProLiant DL380 G6, BIOS P62 01/22/2015
[ 3226.022649] Call Trace:
[ 3226.023409]  <IRQ>
[ 3226.024039]  dump_stack+0x63/0x88
[ 3226.025066]  netdev_rx_csum_fault+0x3a/0x40
[ 3226.026208]  __skb_checksum_complete+0xd5/0xe0
[ 3226.027418]  nf_ip_checksum+0xc9/0xf0
[ 3226.028474]  nf_checksum+0x2d/0x40
[ 3226.029504]  tcp_packet+0x2ce/0xa20 [nf_conntrack]
[ 3226.030913]  ? tcp_v4_do_rcv+0x77/0x1f0
[ 3226.032094]  ? sock_put+0x19/0x20
[ 3226.033070]  ? nf_ct_deliver_cached_events+0xd0/0x110 [nf_conntrack]
[ 3226.034754]  nf_conntrack_in+0x140/0x510 [nf_conntrack]
[ 3226.036228]  ipv4_conntrack_in+0x14/0x20 [nf_conntrack]
[ 3226.037646]  nf_hook_slow+0x42/0xc0
[ 3226.038626]  ip_rcv+0xb5/0xd0
[ 3226.039480]  ? ip_local_deliver_finish+0x1e0/0x1e0
[ 3226.040767]  __netif_receive_skb_one_core+0x57/0x80
[ 3226.042155]  __netif_receive_skb+0x18/0x60
[ 3226.043275]  netif_receive_skb_internal+0x45/0xf0
[ 3226.044530]  napi_gro_receive+0xd0/0xf0
[ 3226.045665]  mlx5e_handle_rx_cqe+0x1e6/0x540 [mlx5_core]
[ 3226.047167]  mlx5e_poll_rx_cq+0xd6/0x9c0 [mlx5_core]
[ 3226.048516]  mlx5e_napi_poll+0xc2/0xcd0 [mlx5_core]
[ 3226.049836]  ? mlx5_eq_int+0x4b4/0x6c0 [mlx5_core]
[ 3226.051118]  net_rx_action+0x289/0x3d0
[ 3226.052257]  __do_softirq+0xd5/0x2a2
[ 3226.053277]  irq_exit+0xe8/0x100
[ 3226.054183]  do_IRQ+0x59/0xe0
[ 3226.055014]  common_interrupt+0xf/0xf
[ 3226.056038]  </IRQ>
[ 3226.056722] RIP: 0010:cpuidle_enter_state+0xba/0x2f0
[ 3226.058087] Code: d0 95 7e e8 38 07 a1 ff 41 8b 5c 24 04 49 89 c6
66 66 66 66 90 31 ff e8 34 19 a1 ff 80 7d cf 00 0f 85 8c 01 00 00 fb
66 66 90 <66> 66 90 45 85 ed 0f 88 94 01 00 00 4c 2b 75 c0 48 ba cf f7
53 e3
[ 3226.062925] RSP: 0018:ffffc9000c547e50 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffffd6
[ 3226.064974] RAX: ffff88a3df7a2dc0 RBX: 000000000000000d RCX: 000000000000001f
[ 3226.066866] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
[ 3226.068747] RBP: ffffc9000c547e90 R08: 0000000000000002 R09: ffffffcdc506f2e7
[ 3226.070622] R10: 0000000000000018 R11: 071c71c71c71c71c R12: ffffe8ffffb96f00
[ 3226.072525] R13: 0000000000000004 R14: 000002ef1d9f1e10 R15: ffff88a3d8900000
[ 3226.074479]  cpuidle_enter+0x17/0x20
[ 3226.075463]  call_cpuidle+0x23/0x40
[ 3226.076412]  do_idle+0x1db/0x280
[ 3226.077323]  cpu_startup_entry+0x1d/0x30
[ 3226.078417]  start_secondary+0x1ae/0x200
[ 3226.079490]  secondary_startup_64+0xa4/0xb0

^ permalink raw reply related

* Re: [xfrm, backport request] Request backport of e2612cd496e7 - set-mark backwards compatibility
From: Greg Kroah-Hartman @ 2019-02-06 16:29 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Benedict Wong, Greg Kroah-Hartman, Linux NetDev, Nathan Harold,
	Lorenzo Colitti, Steffen Klassert, Eyal Birger, tobias
In-Reply-To: <CANP3RGcdptMfnJXQNjNdiD4kuoZxj4vETro_aWRQqzsxY=yAfQ@mail.gmail.com>

On Tue, Feb 05, 2019 at 02:54:29PM -0800, Maciej Żenczykowski wrote:
> > I propose backporting commit e2612cd496e7 ("xfrm: Make set-mark default
> > behavior backward compatible") to 4.19 and 4.20 kernels to fix a backwards
> > compatibility bug introduced in 9b42c1f179a6 (“xfrm: Extend the
> > output_mark to support input direction and masking”).
> >
> > The fix is small, relatively simple, and has unit tests. :)
> >
> > Without this change, systems using mark-based routing on 4.19 or 4.20
> > kernels will by fail to route IPsec tunnel mode packets correctly in the
> > default case. This specifically affects Android devices.
> 
> Looks like it already includes a 'fixes: sha1' tag.
> I'm not sure what causes these patches to get picked up for stable...
> I'm guessing it's some sort of Greg-fu-style-magic...?

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>

Hint, putting a fixes tag there does NOT trigger my scripts...

^ permalink raw reply

* Re: [PATCH net 1/2] geneve: should not call rt6_lookup() when ipv6 was disabled
From: Eric Dumazet @ 2019-02-06 16:43 UTC (permalink / raw)
  To: Hangbin Liu, netdev; +Cc: Stefano Brivio, David Miller
In-Reply-To: <20190206125111.5286-2-liuhangbin@gmail.com>



On 02/06/2019 04:51 AM, Hangbin Liu wrote:
> When we add a new GENEVE device with IPv6 remote, checking only for
> IS_ENABLED(CONFIG_IPV6) is not enough as we may disable IPv6 in kernel
> cmd(ipv6.disable=1), which will cause a NULL pointer dereference.
> 
> Reported-by: Jianlin Shi <jishi@redhat.com>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  drivers/net/geneve.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
> index 58bbba8582b0..0658715581e3 100644
> --- a/drivers/net/geneve.c
> +++ b/drivers/net/geneve.c
> @@ -1512,6 +1512,10 @@ static void geneve_link_config(struct net_device *dev,
>  	}
>  #if IS_ENABLED(CONFIG_IPV6)
>  	case AF_INET6: {
> +		struct inet6_dev *idev = in6_dev_get(dev);
> +		if (!idev)
> +			break;

Do not mix declarations and code.

Instead, use :
		struct inet6_dev *idev = in6_dev_get(dev);
		struct rt6_info *rt;

		if (!idev)
			break;
		rt = rt6_lookup(geneve->net, ...);

> +
>  		struct rt6_info *rt = rt6_lookup(geneve->net,
>  						 &info->key.u.ipv6.dst, NULL, 0,
>  						 NULL, 0);
> @@ -1519,6 +1523,8 @@ static void geneve_link_config(struct net_device *dev,
>  		if (rt && rt->dst.dev)
>  			ldev_mtu = rt->dst.dev->mtu - GENEVE_IPV6_HLEN;
>  		ip6_rt_put(rt);
> +
> +		in6_dev_put(idev);
>  		break;
>  	}
>  #endif
> 

^ permalink raw reply

* Re: [PATCH net 2/2] sit: check if IPv6 enabled before call ip6_err_gen_icmpv6_unreach()
From: Eric Dumazet @ 2019-02-06 16:45 UTC (permalink / raw)
  To: Hangbin Liu, netdev; +Cc: Stefano Brivio, David Miller
In-Reply-To: <20190206125111.5286-3-liuhangbin@gmail.com>



On 02/06/2019 04:51 AM, Hangbin Liu wrote:
> If we disabled IPv6 from kernel boot up cmd(ipv6.disable=1), we should not
> call ip6_err_gen_icmpv6_unreach().
> 
> Reproducer:
> ip link add sit1 type sit local 10.10.0.1 remote 10.10.1.1 ttl 1
> ip link set sit1 up
> ip addr add 192.168.0.1/24 dev sit1
> ping 192.168.0.2
> 
> Reported-by: Jianlin Shi <jishi@redhat.com>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  net/ipv6/sit.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
> index 1e03305c0549..e43fbac0fd1a 100644
> --- a/net/ipv6/sit.c
> +++ b/net/ipv6/sit.c
> @@ -493,6 +493,7 @@ static int ipip6_err(struct sk_buff *skb, u32 info)
>  	const int type = icmp_hdr(skb)->type;
>  	const int code = icmp_hdr(skb)->code;
>  	unsigned int data_len = 0;
> +	struct inet6_dev *idev;
>  	struct ip_tunnel *t;
>  	int sifindex;
>  	int err;
> @@ -546,8 +547,13 @@ static int ipip6_err(struct sk_buff *skb, u32 info)
>  	}
>  
>  	err = 0;
> -	if (!ip6_err_gen_icmpv6_unreach(skb, iph->ihl * 4, type, data_len))
> +
> +	idev = in6_dev_get(skb->dev);
> +	if (idev &&
> +	    !ip6_err_gen_icmpv6_unreach(skb, iph->ihl * 4, type, data_len)) {
> +		in6_dev_put(idev);
>  		goto out;
> +	}
>  


It seems there is a missing in6_dev_put(idev) depending on ip6_err_gen_icmpv6_unreach()() return value ?

^ permalink raw reply

* INFO: task hung in tipc_bcast_stop
From: syzbot @ 2019-02-06 16:47 UTC (permalink / raw)
  To: davem, jon.maloy, linux-kernel, netdev, syzkaller-bugs,
	tipc-discussion, ying.xue

Hello,

syzbot found the following crash on:

HEAD commit:    c8101f7729da net: dsa: Fix lockdep false positive splat
git tree:       net
console output: https://syzkaller.appspot.com/x/log.txt?x=1553f728c00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=2e0064f906afee10
dashboard link: https://syzkaller.appspot.com/bug?extid=8118fd903ae608d128e1
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+8118fd903ae608d128e1@syzkaller.appspotmail.com

bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:aa:aa:aa:aa:aa:0c, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:3e:c4:ef:c6:e1:9c, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:aa:aa:aa:aa:aa:0c, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:a6:63:10:7b:ed:f3, vlan:0)
INFO: task kworker/u4:4:3054 blocked for more than 140 seconds.
       Not tainted 5.0.0-rc4+ #63
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
net_ratelimit: 66009 callbacks suppressed
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:aa:aa:aa:aa:aa:0c, vlan:0)
bridge0: received packet on bridge_slave_0 with own address as source  
address (addr:aa:aa:aa:aa:aa:1b, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:aa:aa:aa:aa:aa:0c, vlan:0)
bridge0: received packet on bridge_slave_0 with own address as source  
address (addr:aa:aa:aa:aa:aa:1b, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:aa:aa:aa:aa:aa:0c, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:3e:c4:ef:c6:e1:9c, vlan:0)
bridge0: received packet on bridge_slave_0 with own address as source  
address (addr:aa:aa:aa:aa:aa:0c, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:a6:63:10:7b:ed:f3, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:aa:aa:aa:aa:aa:0c, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:aa:aa:aa:aa:aa:0c, vlan:0)
kworker/u4:4    D24832  3054      2 0x80000000
Workqueue: netns cleanup_net
Call Trace:
  context_switch kernel/sched/core.c:2844 [inline]
  __schedule+0x817/0x1cc0 kernel/sched/core.c:3485
  schedule+0x92/0x180 kernel/sched/core.c:3529
  exp_funnel_lock kernel/rcu/tree_exp.h:319 [inline]
  _synchronize_rcu_expedited.constprop.0+0x4d4/0x530  
kernel/rcu/tree_exp.h:622
  synchronize_rcu_expedited+0x27/0xa0 kernel/rcu/tree_exp.h:762
  synchronize_net+0x3b/0x60 net/core/dev.c:9165
  tipc_bcast_stop+0x198/0x310 net/tipc/bcast.c:545
  tipc_exit_net+0x26/0x40 net/tipc/core.c:102
  ops_exit_list.isra.0+0xb0/0x160 net/core/net_namespace.c:153
  cleanup_net+0x3fb/0x960 net/core/net_namespace.c:551
  process_one_work+0x98e/0x1760 kernel/workqueue.c:2153
  worker_thread+0x98/0xe40 kernel/workqueue.c:2296
  kthread+0x357/0x430 kernel/kthread.c:246
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
INFO: task syz-executor1:17203 blocked for more than 140 seconds.
       Not tainted 5.0.0-rc4+ #63
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syz-executor1   D30144 17203   7626 0x00000004
Call Trace:
  context_switch kernel/sched/core.c:2844 [inline]
  __schedule+0x817/0x1cc0 kernel/sched/core.c:3485
  schedule+0x92/0x180 kernel/sched/core.c:3529
  __rwsem_down_write_failed_common kernel/locking/rwsem-xadd.c:584 [inline]
  rwsem_down_write_failed+0x774/0xc30 kernel/locking/rwsem-xadd.c:613
  call_rwsem_down_write_failed+0x17/0x30 arch/x86/lib/rwsem.S:117
  __down_write arch/x86/include/asm/rwsem.h:142 [inline]
  down_write+0x53/0x90 kernel/locking/rwsem.c:72
  register_netdevice_notifier+0x7e/0x630 net/core/dev.c:1634
  bcm_init+0x1a8/0x220 net/can/bcm.c:1502
  can_create+0x28a/0x4b0 net/can/af_can.c:183
  __sock_create+0x3e6/0x750 net/socket.c:1275
  sock_create net/socket.c:1315 [inline]
  __sys_socket+0x103/0x220 net/socket.c:1345
  __do_sys_socket net/socket.c:1354 [inline]
  __se_sys_socket net/socket.c:1352 [inline]
  __x64_sys_socket+0x73/0xb0 net/socket.c:1352
  do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457e39
Code: Bad RIP value.
RSP: 002b:00007f2b02d39c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000029
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457e39
RDX: 0000000000000002 RSI: 0000000000000002 RDI: 000000000000001d
RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2b02d3a6d4
R13: 00000000004c60f6 R14: 00000000004db1e0 R15: 00000000ffffffff
INFO: task syz-executor1:17207 blocked for more than 140 seconds.
       Not tainted 5.0.0-rc4+ #63
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syz-executor1   D29872 17207   7626 0x00000004
Call Trace:
  context_switch kernel/sched/core.c:2844 [inline]
  __schedule+0x817/0x1cc0 kernel/sched/core.c:3485
  schedule+0x92/0x180 kernel/sched/core.c:3529
  __rwsem_down_write_failed_common kernel/locking/rwsem-xadd.c:584 [inline]
  rwsem_down_write_failed+0x774/0xc30 kernel/locking/rwsem-xadd.c:613
  call_rwsem_down_write_failed+0x17/0x30 arch/x86/lib/rwsem.S:117
  __down_write arch/x86/include/asm/rwsem.h:142 [inline]
  down_write+0x53/0x90 kernel/locking/rwsem.c:72
  register_netdevice_notifier+0x7e/0x630 net/core/dev.c:1634
  bcm_init+0x1a8/0x220 net/can/bcm.c:1502
  can_create+0x28a/0x4b0 net/can/af_can.c:183
  __sock_create+0x3e6/0x750 net/socket.c:1275
  sock_create net/socket.c:1315 [inline]
  __sys_socket+0x103/0x220 net/socket.c:1345
  __do_sys_socket net/socket.c:1354 [inline]
  __se_sys_socket net/socket.c:1352 [inline]
  __x64_sys_socket+0x73/0xb0 net/socket.c:1352
  do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457e39
Code: Bad RIP value.
RSP: 002b:00007f2b02d18c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000029
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457e39
RDX: 0000000000000002 RSI: 0000000000000002 RDI: 000000000000001d
RBP: 000000000073bfa0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2b02d196d4
R13: 00000000004c60f6 R14: 00000000004db1e0 R15: 00000000ffffffff

Showing all locks held in the system:
3 locks held by ksoftirqd/1/16:
1 lock held by khungtaskd/1040:
  #0: 00000000e3131bf8 (rcu_read_lock){....}, at:  
debug_show_all_locks+0x5f/0x27e kernel/locking/lockdep.c:4389
3 locks held by kworker/u4:4/3054:
  #0: 000000002f6c8bac ((wq_completion)"%s""netns"){+.+.}, at:  
__write_once_size include/linux/compiler.h:220 [inline]
  #0: 000000002f6c8bac ((wq_completion)"%s""netns"){+.+.}, at:  
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
  #0: 000000002f6c8bac ((wq_completion)"%s""netns"){+.+.}, at: atomic64_set  
include/asm-generic/atomic-instrumented.h:40 [inline]
  #0: 000000002f6c8bac ((wq_completion)"%s""netns"){+.+.}, at:  
atomic_long_set include/asm-generic/atomic-long.h:59 [inline]
  #0: 000000002f6c8bac ((wq_completion)"%s""netns"){+.+.}, at: set_work_data  
kernel/workqueue.c:617 [inline]
  #0: 000000002f6c8bac ((wq_completion)"%s""netns"){+.+.}, at:  
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
  #0: 000000002f6c8bac ((wq_completion)"%s""netns"){+.+.}, at:  
process_one_work+0x87e/0x1760 kernel/workqueue.c:2124
  #1: 00000000356e77de (net_cleanup_work){+.+.}, at:  
process_one_work+0x8b4/0x1760 kernel/workqueue.c:2128
  #2: 0000000070308617 (pernet_ops_rwsem){++++}, at: cleanup_net+0xae/0x960  
net/core/net_namespace.c:518
1 lock held by rsyslogd/7473:
2 locks held by getty/7586:
  #0: 000000007d8d2e99 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
  #1: 000000009ac587d5 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2154
2 locks held by getty/7587:
  #0: 000000009302afb1 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
  #1: 00000000be8c1a8d (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2154
2 locks held by getty/7588:
  #0: 0000000082755b9e (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
  #1: 00000000a2bc657f (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2154
2 locks held by getty/7589:
  #0: 00000000d211856e (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
  #1: 00000000279c75af (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2154
2 locks held by getty/7590:
  #0: 000000000111820e (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
  #1: 00000000db1c3ad5 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2154
2 locks held by getty/7591:
  #0: 00000000756df6ea (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
  #1: 0000000026a342a6 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2154
2 locks held by getty/7592:
  #0: 00000000f85959b5 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
  #1: 00000000a7cf75cc (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2154
1 lock held by syz-executor1/17203:
  #0: 0000000070308617 (pernet_ops_rwsem){++++}, at:  
register_netdevice_notifier+0x7e/0x630 net/core/dev.c:1634
1 lock held by syz-executor1/17207:
  #0: 0000000070308617 (pernet_ops_rwsem){++++}, at:  
register_netdevice_notifier+0x7e/0x630 net/core/dev.c:1634

=============================================

NMI backtrace for cpu 1
CPU: 1 PID: 1040 Comm: khungtaskd Not tainted 5.0.0-rc4+ #63
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  nmi_cpu_backtrace.cold+0x63/0xa4 lib/nmi_backtrace.c:101
  nmi_trigger_cpumask_backtrace+0x1be/0x236 lib/nmi_backtrace.c:62
  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
  trigger_all_cpu_backtrace include/linux/nmi.h:146 [inline]
  check_hung_uninterruptible_tasks kernel/hung_task.c:203 [inline]
  watchdog+0x9df/0xee0 kernel/hung_task.c:287
  kthread+0x357/0x430 kernel/kthread.c:246
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
Sending NMI from CPU 1 to CPUs 0:
INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.389  
msecs
NMI backtrace for cpu 0
CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 5.0.0-rc4+ #63
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:rol32 include/linux/bitops.h:83 [inline]
RIP: 0010:jhash2 include/linux/jhash.h:128 [inline]
RIP: 0010:hash_stack lib/stackdepot.c:161 [inline]
RIP: 0010:depot_save_stack+0x76/0x460 lib/stackdepot.c:230
Code: 0f 86 ec 01 00 00 44 89 cb 44 89 c8 48 89 ca 03 5a 08 41 83 e8 03 48  
83 c2 0c 44 03 4a f4 03 42 f8 45 89 cb 41 89 d9 41 29 db <41> c1 c1 04 01  
c3 45 31 d9 44 29 c8 41 89 c2 44 89 c8 41 01 d9 c1
RSP: 0018:ffff8880a9886ed8 EFLAGS: 00000206
RAX: 00000000ce5caf29 RBX: 00000000b5fb600c RCX: ffff8880a9886f38
RDX: ffff8880a9886f68 RSI: 0000000000490220 RDI: 0000000000000016
RBP: ffff8880a9886f10 R08: 0000000000000020 R09: 00000000b5fb600c
R10: 000000008a18bda7 R11: 000000001a539322 R12: 0000000000490220
R13: ffff8880a9886f20 R14: 0000000000000000 R15: ffff88812c3f0940
FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000457e0f CR3: 00000000a4f2e000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  save_stack+0xa9/0xd0 mm/kasan/common.c:79
  set_track mm/kasan/common.c:85 [inline]
  __kasan_kmalloc mm/kasan/common.c:496 [inline]
  __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:469
  kasan_kmalloc mm/kasan/common.c:504 [inline]
  kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:411
  slab_post_alloc_hook mm/slab.h:444 [inline]
  slab_alloc_node mm/slab.c:3324 [inline]
  kmem_cache_alloc_node_trace+0x13c/0x720 mm/slab.c:3650
  __do_kmalloc_node mm/slab.c:3672 [inline]
  __kmalloc_node_track_caller+0x3d/0x70 mm/slab.c:3687
  __kmalloc_reserve.isra.0+0x40/0xf0 net/core/skbuff.c:140
  __alloc_skb+0x10b/0x5e0 net/core/skbuff.c:208
  alloc_skb include/linux/skbuff.h:1011 [inline]
  nlmsg_new include/net/netlink.h:654 [inline]
  fdb_notify+0x9f/0x190 net/bridge/br_fdb.c:706
  br_fdb_update net/bridge/br_fdb.c:601 [inline]
  br_fdb_update+0x2fd/0xa70 net/bridge/br_fdb.c:562
  br_handle_frame_finish+0x84f/0x14c0 net/bridge/br_input.c:97
  br_nf_hook_thresh+0x2ec/0x380 net/bridge/br_netfilter_hooks.c:1009
  br_nf_pre_routing_finish_ipv6+0x708/0xdc0  
net/bridge/br_netfilter_ipv6.c:210
  NF_HOOK include/linux/netfilter.h:289 [inline]
  br_nf_pre_routing_ipv6+0x3c4/0x740 net/bridge/br_netfilter_ipv6.c:238
  br_nf_pre_routing+0xe80/0x13a0 net/bridge/br_netfilter_hooks.c:482
  nf_hook_entry_hookfn include/linux/netfilter.h:119 [inline]
  nf_hook_slow+0xbf/0x1f0 net/netfilter/core.c:511
  nf_hook include/linux/netfilter.h:244 [inline]
  NF_HOOK include/linux/netfilter.h:287 [inline]
  br_handle_frame+0x95b/0x1450 net/bridge/br_input.c:305
  __netif_receive_skb_core+0xa96/0x3010 net/core/dev.c:4902
  __netif_receive_skb_one_core+0xa8/0x1a0 net/core/dev.c:4971
  __netif_receive_skb+0x2c/0x1c0 net/core/dev.c:5083
  process_backlog+0x206/0x750 net/core/dev.c:5923
  napi_poll net/core/dev.c:6346 [inline]
  net_rx_action+0x4fa/0x1070 net/core/dev.c:6412
  __do_softirq+0x266/0x95a kernel/softirq.c:292
  run_ksoftirqd kernel/softirq.c:654 [inline]
  run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
  smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
  kthread+0x357/0x430 kernel/kthread.c:246
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.

^ permalink raw reply

* Re: [Patch net-next] mlx5: use RCU lock in mlx5_eq_cq_get()
From: Saeed Mahameed @ 2019-02-06 16:55 UTC (permalink / raw)
  To: netdev@vger.kernel.org, Tariq Toukan, xiyou.wangcong@gmail.com
In-Reply-To: <1620af96-f60c-9ec2-dd71-7c9081828bce@mellanox.com>

On Wed, 2019-02-06 at 12:02 +0000, Tariq Toukan wrote:
> 
> On 2/6/2019 2:35 AM, Cong Wang wrote:
> > mlx5_eq_cq_get() is called in IRQ handler, the spinlock inside
> > gets a lot of contentions when we test some heavy workload
> > with 60 RX queues and 80 CPU's, and it is clearly shown in the
> > flame graph.
> > 


Hi Cong,

The patch is ok to me, but i really doubt that you can hit a contention
on latest upstream driver, since we already have spinlock per EQ, which
means spinlock per core,  each EQ (core) msix handler can only access
one spinlock (its own), so I am surprised how you got the contention,
Maybe you are not running on latest upstream driver ?

what is the workload ? 

> > In fact, radix_tree_lookup() is perfectly fine with RCU read lock,
> > we don't have to take a spinlock on this hot path. It is pretty
> > much
> > similar to commit 291c566a2891
> > ("net/mlx4_core: Fix racy CQ (Completion Queue) free"). Slow paths
> > are still serialized with the spinlock, and with synchronize_irq()
> > it should be safe to just move the fast path to RCU read lock.
> > 
> > This patch itself reduces the latency by about 50% with our
> > workload.
> > 
> > Cc: Saeed Mahameed <saeedm@mellanox.com>
> > Cc: Tariq Toukan <tariqt@mellanox.com>
> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> > ---
> >   drivers/net/ethernet/mellanox/mlx5/core/eq.c | 12 ++++++------
> >   1 file changed, 6 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > index ee04aab65a9f..7092457705a2 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > @@ -114,11 +114,11 @@ static struct mlx5_core_cq
> > *mlx5_eq_cq_get(struct mlx5_eq *eq, u32 cqn)
> >   	struct mlx5_cq_table *table = &eq->cq_table;
> >   	struct mlx5_core_cq *cq = NULL;
> >   
> > -	spin_lock(&table->lock);
> > +	rcu_read_lock();
> >   	cq = radix_tree_lookup(&table->tree, cqn);
> >   	if (likely(cq))
> >   		mlx5_cq_hold(cq);
> > -	spin_unlock(&table->lock);
> > +	rcu_read_unlock();
> 
> Thanks for you patch.
> 
> I think we can improve it further, by taking the if statement out of
> the 
> critical section.
> 

No, mlx5_cq_hold must stay under RCU read, otherwise cq might get freed
before the irq gets a change to increment ref count on it.

another way to do it is not to do any refcounting in the irq handler
and fence cq removal via synchronize_irq(eq->irqn) on mlx5_eq_del_cq.
But let's keep one approach (refcounting), synchronize_irq/rcu can be
heavy sometimes especially on RDMA workloads with many create/destroy
cq in loops.

> Other than that, patch LGTM.
> 
> Regards,
> Tariq
> 
> >   
> >   	return cq;
> >   }
> > @@ -371,9 +371,9 @@ int mlx5_eq_add_cq(struct mlx5_eq *eq, struct
> > mlx5_core_cq *cq)
> >   	struct mlx5_cq_table *table = &eq->cq_table;
> >   	int err;
> >   
> > -	spin_lock_irq(&table->lock);
> > +	spin_lock(&table->lock);
> >   	err = radix_tree_insert(&table->tree, cq->cqn, cq);
> > -	spin_unlock_irq(&table->lock);
> > +	spin_unlock(&table->lock);
> >   
> >   	return err;
> >   }
> > @@ -383,9 +383,9 @@ int mlx5_eq_del_cq(struct mlx5_eq *eq, struct
> > mlx5_core_cq *cq)
> >   	struct mlx5_cq_table *table = &eq->cq_table;
> >   	struct mlx5_core_cq *tmp;
> >   
> > -	spin_lock_irq(&table->lock);
> > +	spin_lock(&table->lock);
> >   	tmp = radix_tree_delete(&table->tree, cq->cqn);
> > -	spin_unlock_irq(&table->lock);
> > +	spin_unlock(&table->lock);
> >   
> >   	if (!tmp) {
> >   		mlx5_core_warn(eq->dev, "cq 0x%x not found in eq 0x%x
> > tree\n", eq->eqn, cq->cqn);
> > 

^ permalink raw reply

* Re: [Patch net-next] mlx5: use RCU lock in mlx5_eq_cq_get()
From: Cong Wang @ 2019-02-06 17:15 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: netdev@vger.kernel.org, Tariq Toukan
In-Reply-To: <e88ac12b80b510648f7ab1d4cee50c43908ba49d.camel@mellanox.com>

On Wed, Feb 6, 2019 at 8:55 AM Saeed Mahameed <saeedm@mellanox.com> wrote:
> Hi Cong,
>
> The patch is ok to me, but i really doubt that you can hit a contention
> on latest upstream driver, since we already have spinlock per EQ, which
> means spinlock per core,  each EQ (core) msix handler can only access
> one spinlock (its own), so I am surprised how you got the contention,
> Maybe you are not running on latest upstream driver ?

We are running 4.14 stable release. Which commit changes the game
here? We can consider to backport it unless it is complicated.

Also, if you don't like this patch, we are happy to carry it for our own,
sometimes it isn't worth the time to push into upstream.

>
> what is the workload ?
>

It's a memcached RPC performance test, that is all I can tell.
(Apparently I have almost zero knowledge about memcached.)


> > > In fact, radix_tree_lookup() is perfectly fine with RCU read lock,
> > > we don't have to take a spinlock on this hot path. It is pretty
> > > much
> > > similar to commit 291c566a2891
> > > ("net/mlx4_core: Fix racy CQ (Completion Queue) free"). Slow paths
> > > are still serialized with the spinlock, and with synchronize_irq()
> > > it should be safe to just move the fast path to RCU read lock.
> > >
> > > This patch itself reduces the latency by about 50% with our
> > > workload.
> > >
> > > Cc: Saeed Mahameed <saeedm@mellanox.com>
> > > Cc: Tariq Toukan <tariqt@mellanox.com>
> > > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> > > ---
> > >   drivers/net/ethernet/mellanox/mlx5/core/eq.c | 12 ++++++------
> > >   1 file changed, 6 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > > b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > > index ee04aab65a9f..7092457705a2 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > > @@ -114,11 +114,11 @@ static struct mlx5_core_cq
> > > *mlx5_eq_cq_get(struct mlx5_eq *eq, u32 cqn)
> > >     struct mlx5_cq_table *table = &eq->cq_table;
> > >     struct mlx5_core_cq *cq = NULL;
> > >
> > > -   spin_lock(&table->lock);
> > > +   rcu_read_lock();
> > >     cq = radix_tree_lookup(&table->tree, cqn);
> > >     if (likely(cq))
> > >             mlx5_cq_hold(cq);
> > > -   spin_unlock(&table->lock);
> > > +   rcu_read_unlock();
> >
> > Thanks for you patch.
> >
> > I think we can improve it further, by taking the if statement out of
> > the
> > critical section.
> >
>
> No, mlx5_cq_hold must stay under RCU read, otherwise cq might get freed
> before the irq gets a change to increment ref count on it.
>

Agreed.


Thanks.

^ permalink raw reply

* Re: [Patch net-next] mlx5: use RCU lock in mlx5_eq_cq_get()
From: Eric Dumazet @ 2019-02-06 17:17 UTC (permalink / raw)
  To: Saeed Mahameed, netdev@vger.kernel.org, Tariq Toukan,
	xiyou.wangcong@gmail.com
In-Reply-To: <e88ac12b80b510648f7ab1d4cee50c43908ba49d.camel@mellanox.com>



On 02/06/2019 08:55 AM, Saeed Mahameed wrote:
> On Wed, 2019-02-06 at 12:02 +0000, Tariq Toukan wrote:
>>
>> On 2/6/2019 2:35 AM, Cong Wang wrote:
>>> mlx5_eq_cq_get() is called in IRQ handler, the spinlock inside
>>> gets a lot of contentions when we test some heavy workload
>>> with 60 RX queues and 80 CPU's, and it is clearly shown in the
>>> flame graph.
>>>
> 
> 
> Hi Cong,
> 
> The patch is ok to me, but i really doubt that you can hit a contention
> on latest upstream driver, since we already have spinlock per EQ, which
> means spinlock per core,  each EQ (core) msix handler can only access
> one spinlock (its own), so I am surprised how you got the contention,
> Maybe you are not running on latest upstream driver ?
> 
> what is the workload ? 

Surprisingly (or not), atomic operations, even on _not_ contended cache lines can
stall the cpu enough for perf tools to notice...

If the atomic operation can be trivially replaced by RCU, then do it by any mean.



^ permalink raw reply

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
From: Saeed Mahameed @ 2019-02-06 17:18 UTC (permalink / raw)
  To: ian.kumlien@gmail.com, netdev@vger.kernel.org,
	davem@davemloft.net
In-Reply-To: <CAA85sZtE7Gv8mKL5tUh8AJ4yG9xd_HZh9svWkHXm=j7VohD1Cw@mail.gmail.com>

On Wed, 2019-02-06 at 17:16 +0100, Ian Kumlien wrote:
> Hi,
> 
> I'm hitting an issue that i think is fixed by the following patch,
> i haven't verified it yet - but it looks like it should go on the
> stable queue(?)
> 
> (And yes, I did look, and couldn't find it ;))
> 

Yes, i couldn't find it neither,

It should have been queued up for 4.18 by now.
Dave said he will take care of it, maybe he just forgot or something.
since the patch needed some extra care..

https://patchwork.ozlabs.org/patch/1027837/

Dave, Is there anything i can do here ?

Thanks,
Saeed.

> commit e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a
> Author: Cong Wang <xiyou.wangcong@gmail.com>
> Date:   Mon Dec 3 22:14:04 2018 -0800
> 
>     net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames


^ permalink raw reply

* Re: [PATCH iproute2-next v3] devlink: report cell size
From: David Ahern @ 2019-02-06 17:19 UTC (permalink / raw)
  To: Jakub Kicinski, idosch, jiri; +Cc: stephen, oss-drivers, netdev
In-Reply-To: <20190204152859.6667-1-jakub.kicinski@netronome.com>

On 2/4/19 7:28 AM, Jakub Kicinski wrote:
> Print the value of DEVLINK_ATTR_SB_POOL_CELL_SIZE, if reported.
> 
> Example:
> pci/0000:82:00.0:
>   sb 1 pool 0 type egress size 40945664 thtype static cell_size 2048
>   sb 2 pool 0 type egress size 258867200 thtype static cell_size 10240
> ...
> 
> v3: - don't double space.
> v2: - fix spelling.
> 
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---
>  devlink/devlink.c     |  3 +++
>  man/man8/devlink-sb.8 | 10 ++++++++++
>  2 files changed, 13 insertions(+)

applied to iproute2-next. Thanks

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox