Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net 2/2] sit: check if IPv6 enabled before call ip6_err_gen_icmpv6_unreach()
From: Stefano Brivio @ 2019-02-06 15:00 UTC (permalink / raw)
  To: Hangbin Liu; +Cc: netdev, David Miller
In-Reply-To: <20190206125111.5286-3-liuhangbin@gmail.com>

On Wed,  6 Feb 2019 20:51:11 +0800
Hangbin Liu <liuhangbin@gmail.com> wrote:

> If we disabled IPv6 from kernel boot up cmd(ipv6.disable=1), we should not
> call ip6_err_gen_icmpv6_unreach().
> 
> Reproducer:
> ip link add sit1 type sit local 10.10.0.1 remote 10.10.1.1 ttl 1
> ip link set sit1 up
> ip addr add 192.168.0.1/24 dev sit1
> ping 192.168.0.2
> 
> Reported-by: Jianlin Shi <jishi@redhat.com>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  net/ipv6/sit.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
> index 1e03305c0549..e43fbac0fd1a 100644
> --- a/net/ipv6/sit.c
> +++ b/net/ipv6/sit.c
> @@ -493,6 +493,7 @@ static int ipip6_err(struct sk_buff *skb, u32 info)
>  	const int type = icmp_hdr(skb)->type;
>  	const int code = icmp_hdr(skb)->code;
>  	unsigned int data_len = 0;
> +	struct inet6_dev *idev;
>  	struct ip_tunnel *t;
>  	int sifindex;
>  	int err;
> @@ -546,8 +547,13 @@ static int ipip6_err(struct sk_buff *skb, u32 info)
>  	}
>  
>  	err = 0;
> -	if (!ip6_err_gen_icmpv6_unreach(skb, iph->ihl * 4, type, data_len))
> +
> +	idev = in6_dev_get(skb->dev);
> +	if (idev &&
> +	    !ip6_err_gen_icmpv6_unreach(skb, iph->ihl * 4, type, data_len)) {
> +		in6_dev_put(idev);
>  		goto out;
> +	}

You are leaking a reference if idev && ip6_err_gen_icmpv6_unreach(...).
You should call in6_dev_put(idev) whenever idev is not NULL instead.

-- 
Stefano

^ permalink raw reply

* Re: [PATCH net-next 5/6] net: marvell: neta: add comphy support
From: Maxime Chevallier @ 2019-02-06 15:09 UTC (permalink / raw)
  To: Russell King
  Cc: Andrew Lunn, Gregory Clement, Jason Cooper,
	Kishon Vijay Abraham I, Sebastian Hesselbarth, Thomas Petazzoni,
	devicetree, linux-arm-kernel, netdev, David S. Miller
In-Reply-To: <E1grLTv-00061v-Tw@rmk-PC.armlinux.org.uk>

Hello Russell,

On Wed, 06 Feb 2019 11:35:07 +0000
Russell King <rmk+kernel@armlinux.org.uk> wrote:


>+	if (pp->comphy) {
>+		enum phy_mode mode = PHY_MODE_INVALID;
>+
>+		switch (state->interface) {
>+		case PHY_INTERFACE_MODE_SGMII:
>+		case PHY_INTERFACE_MODE_1000BASEX:
>+			mode = PHY_MODE_SGMII;

PHY_MODE_SGMII no longer exists, the new PHY_MODE_ETHERNET must be used
with the correct submode now, so this doesn't build unfortunately.

>+			break;
>+		case PHY_INTERFACE_MODE_2500BASEX:
>+			mode = PHY_MODE_2500SGMII;

Same here.

Thanks,

Maxime

^ permalink raw reply

* Re:Re: [PATCH net] net: defxx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
From: albin_yang @ 2019-02-06 15:13 UTC (permalink / raw)
  To: Maciej W. Rozycki; +Cc: netdev, David S. Miller, yang.wei9
In-Reply-To: <alpine.LFD.2.21.1902051816130.9991@eddie.linux-mips.org>

At 2019-02-06 03:57:34, "Maciej W. Rozycki" <macro@linux-mips.org> wrote:
>On Wed, 6 Feb 2019, Yang Wei wrote:
>
>Reviewed-by: Maciej W. Rozycki <macro@linux-mips.org>
>
> It looks to me the driver has to be reviewed WRT `dev_kfree_skb' vs 
>`kfree_skb' use too.  I'll have a look into it unless you are happy to do 
>that.
>
> Thanks for your contribution!
>
Hi, Maciej

I think kfree_skb() should be called when skb is dropped by network drivers. 
I found that many network drivers have such problems that not use 
kfree_skb/consume_skb properly. Maybe because 'drop profiles' appears later 
than many network drivers?

The problem 'dev_kfree_skb' vs 'kfree_skb' should be fixed. I think that is 
lower priority than the current patch. Network driver should not perturb 
drop profiles when skb successful xmit.

Thanks
Yang

^ permalink raw reply

* [PATCH net] net: dsa: Fix NULL checking in dsa_slave_set_eee()
From: Dan Carpenter @ 2019-02-06 15:35 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Vivien Didelot, Florian Fainelli, David S. Miller, netdev,
	kernel-janitors

This function can't succeed if dp->pl is NULL.  It will Oops inside the
call to return phylink_ethtool_get_eee(dp->pl, e);

Fixes: 1be52e97ed3e ("dsa: slave: eee: Allow ports to use phylink")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
---
 net/dsa/slave.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 91de3a663226..9384f95e04d5 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -639,7 +639,7 @@ static int dsa_slave_set_eee(struct net_device *dev, struct ethtool_eee *e)
 	int ret;
 
 	/* Port's PHY and MAC both need to be EEE capable */
-	if (!dev->phydev && !dp->pl)
+	if (!dev->phydev || !dp->pl)
 		return -ENODEV;
 
 	if (!ds->ops->set_mac_eee)
@@ -659,7 +659,7 @@ static int dsa_slave_get_eee(struct net_device *dev, struct ethtool_eee *e)
 	int ret;
 
 	/* Port's PHY and MAC both need to be EEE capable */
-	if (!dev->phydev && !dp->pl)
+	if (!dev->phydev || !dp->pl)
 		return -ENODEV;
 
 	if (!ds->ops->get_mac_eee)
-- 
2.17.1


^ permalink raw reply related

* RE: [PATCH net-next 1/4] dpaa2-eth: Use a single page per Rx buffer
From: Ioana Ciocoi Radulescu @ 2019-02-06 15:36 UTC (permalink / raw)
  To: Ilias Apalodimas
  Cc: netdev@vger.kernel.org, davem@davemloft.net, Ioana Ciornei,
	brouer@redhat.com
In-Reply-To: <20190205093513.GA31466@apalos>

> -----Original Message-----
> From: Ilias Apalodimas <ilias.apalodimas@linaro.org>
> Sent: Tuesday, February 5, 2019 11:35 AM
> To: Ioana Ciocoi Radulescu <ruxandra.radulescu@nxp.com>
> Cc: netdev@vger.kernel.org; davem@davemloft.net; Ioana Ciornei
> <ioana.ciornei@nxp.com>; brouer@redhat.com
> Subject: Re: [PATCH net-next 1/4] dpaa2-eth: Use a single page per Rx buffer
> 
> Hi Ioana,
> 
> Can you share any results on XDP (XDP_DROP is usually useful for the
> hardware capabilities).

XDP numbers are pretty much the same as before this patch:

On a LS2088A with A72 cores @2GHz (numbers in Mpps):
				1core		8cores
-------------------------------------------------------------------------
XDP_DROP (no touching data)	5.37		29.6 (linerate)
XDP_DROP (xdp1 sample)	3.14		24.22
XDP_TX(xdp2 sample)		1.65		13.08

> 
> > Instead of allocating page fragments via the network stack,
> > use the page allocator directly. For now, we consume one page
> > for each Rx buffer.
> >
> > With the new memory model we are free to consider adding more
> > XDP support.
> >
> > Performance decreases slightly in some IP forwarding cases.
> > No visible effect on termination traffic. The driver memory
> > footprint increases as a result of this change, but it is
> > still small enough to not really matter.
> >
> > Another side effect is that now Rx buffer alignment requirements
> > are naturally satisfied without any additional actions needed.
> > Remove alignment related code, except in the buffer layout
> > information conveyed to MC, as hardware still needs to know the
> > alignment value we guarantee.
> >
> > Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
> > Signed-off-by: Ioana Radulescu <ruxandra.radulescu@nxp.com>
> > ---
> >  drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c | 61 +++++++++++++-
> ----------
> >  drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h | 21 +++-----
> >  2 files changed, 38 insertions(+), 44 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
> b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
> > index 04925c7..6e58de6 100644
> > --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
> > +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.c
> > @@ -86,16 +86,16 @@ static void free_rx_fd(struct dpaa2_eth_priv *priv,
> >  	for (i = 1; i < DPAA2_ETH_MAX_SG_ENTRIES; i++) {
> >  		addr = dpaa2_sg_get_addr(&sgt[i]);
> >  		sg_vaddr = dpaa2_iova_to_virt(priv->iommu_domain, addr);
> > -		dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
> > -				 DMA_BIDIRECTIONAL);
> > +		dma_unmap_page(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
> > +			       DMA_BIDIRECTIONAL);
> >
> > -		skb_free_frag(sg_vaddr);
> > +		free_pages((unsigned long)sg_vaddr, 0);
> >  		if (dpaa2_sg_is_final(&sgt[i]))
> >  			break;
> >  	}
> >
> >  free_buf:
> > -	skb_free_frag(vaddr);
> > +	free_pages((unsigned long)vaddr, 0);
> >  }
> >
> >  /* Build a linear skb based on a single-buffer frame descriptor */
> > @@ -109,7 +109,7 @@ static struct sk_buff *build_linear_skb(struct
> dpaa2_eth_channel *ch,
> >
> >  	ch->buf_count--;
> >
> > -	skb = build_skb(fd_vaddr, DPAA2_ETH_SKB_SIZE);
> > +	skb = build_skb(fd_vaddr, DPAA2_ETH_RX_BUF_RAW_SIZE);
> >  	if (unlikely(!skb))
> >  		return NULL;
> >
> > @@ -144,19 +144,19 @@ static struct sk_buff *build_frag_skb(struct
> dpaa2_eth_priv *priv,
> >  		/* Get the address and length from the S/G entry */
> >  		sg_addr = dpaa2_sg_get_addr(sge);
> >  		sg_vaddr = dpaa2_iova_to_virt(priv->iommu_domain,
> sg_addr);
> > -		dma_unmap_single(dev, sg_addr,
> DPAA2_ETH_RX_BUF_SIZE,
> > -				 DMA_BIDIRECTIONAL);
> > +		dma_unmap_page(dev, sg_addr, DPAA2_ETH_RX_BUF_SIZE,
> > +			       DMA_BIDIRECTIONAL);
> >
> >  		sg_length = dpaa2_sg_get_len(sge);
> >
> >  		if (i == 0) {
> >  			/* We build the skb around the first data buffer */
> > -			skb = build_skb(sg_vaddr, DPAA2_ETH_SKB_SIZE);
> > +			skb = build_skb(sg_vaddr,
> DPAA2_ETH_RX_BUF_RAW_SIZE);
> >  			if (unlikely(!skb)) {
> >  				/* Free the first SG entry now, since we
> already
> >  				 * unmapped it and obtained the virtual
> address
> >  				 */
> > -				skb_free_frag(sg_vaddr);
> > +				free_pages((unsigned long)sg_vaddr, 0);
> >
> >  				/* We still need to subtract the buffers used
> >  				 * by this FD from our software counter
> > @@ -211,9 +211,9 @@ static void free_bufs(struct dpaa2_eth_priv *priv,
> u64 *buf_array, int count)
> >
> >  	for (i = 0; i < count; i++) {
> >  		vaddr = dpaa2_iova_to_virt(priv->iommu_domain,
> buf_array[i]);
> > -		dma_unmap_single(dev, buf_array[i],
> DPAA2_ETH_RX_BUF_SIZE,
> > -				 DMA_BIDIRECTIONAL);
> > -		skb_free_frag(vaddr);
> > +		dma_unmap_page(dev, buf_array[i],
> DPAA2_ETH_RX_BUF_SIZE,
> > +			       DMA_BIDIRECTIONAL);
> > +		free_pages((unsigned long)vaddr, 0);
> 
> I got some hardware/XDP related questions since i have no idea how the
> hardware
> works. From what i understand on the code, you are trying to manage the
> buffer
> from the hw buffer manager and if the fails you unmap and free the pages.
> Since XDP relies on recycling for speed (hint check xdp_return_buff()), is it
> possible to recycle the buffer if the hw fails ?

We're still looking for a way to use the page pool mechanism in our driver.
Our hardware buffer pool is shared between all Rx queues, so it's not that
straightforward.

This error path in particular is very unlikely to ever get exercised, it
wouldn't have an impact on performance.

Thanks,
Ioana

> 
> >  	}
> >  }
> >
> > @@ -378,16 +378,16 @@ static void dpaa2_eth_rx(struct dpaa2_eth_priv
> *priv,
> >  			return;
> >  		}
> >
> > -		dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
> > -				 DMA_BIDIRECTIONAL);
> > +		dma_unmap_page(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
> > +			       DMA_BIDIRECTIONAL);
> >  		skb = build_linear_skb(ch, fd, vaddr);
> >  	} else if (fd_format == dpaa2_fd_sg) {
> >  		WARN_ON(priv->xdp_prog);
> >
> > -		dma_unmap_single(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
> > -				 DMA_BIDIRECTIONAL);
> > +		dma_unmap_page(dev, addr, DPAA2_ETH_RX_BUF_SIZE,
> > +			       DMA_BIDIRECTIONAL);
> >  		skb = build_frag_skb(priv, ch, buf_data);
> > -		skb_free_frag(vaddr);
> > +		free_pages((unsigned long)vaddr, 0);
> >  		percpu_extras->rx_sg_frames++;
> >  		percpu_extras->rx_sg_bytes += dpaa2_fd_get_len(fd);
> >  	} else {
> > @@ -903,7 +903,7 @@ static int add_bufs(struct dpaa2_eth_priv *priv,
> >  {
> >  	struct device *dev = priv->net_dev->dev.parent;
> >  	u64 buf_array[DPAA2_ETH_BUFS_PER_CMD];
> > -	void *buf;
> > +	struct page *page;
> >  	dma_addr_t addr;
> >  	int i, err;
> >
> > @@ -911,14 +911,16 @@ static int add_bufs(struct dpaa2_eth_priv *priv,
> >  		/* Allocate buffer visible to WRIOP + skb shared info +
> >  		 * alignment padding
> >  		 */
> > -		buf = napi_alloc_frag(dpaa2_eth_buf_raw_size(priv));
> > -		if (unlikely(!buf))
> > +		/* allocate one page for each Rx buffer. WRIOP sees
> > +		 * the entire page except for a tailroom reserved for
> > +		 * skb shared info
> > +		 */
> > +		page = dev_alloc_pages(0);
> > +		if (!page)
> >  			goto err_alloc;
> >
> > -		buf = PTR_ALIGN(buf, priv->rx_buf_align);
> > -
> > -		addr = dma_map_single(dev, buf,
> DPAA2_ETH_RX_BUF_SIZE,
> > -				      DMA_BIDIRECTIONAL);
> > +		addr = dma_map_page(dev, page, 0,
> DPAA2_ETH_RX_BUF_SIZE,
> > +				    DMA_BIDIRECTIONAL);
> >  		if (unlikely(dma_mapping_error(dev, addr)))
> >  			goto err_map;
> >
> > @@ -926,7 +928,7 @@ static int add_bufs(struct dpaa2_eth_priv *priv,
> >
> >  		/* tracing point */
> >  		trace_dpaa2_eth_buf_seed(priv->net_dev,
> > -					 buf, dpaa2_eth_buf_raw_size(priv),
> > +					 page,
> DPAA2_ETH_RX_BUF_RAW_SIZE,
> >  					 addr, DPAA2_ETH_RX_BUF_SIZE,
> >  					 bpid);
> >  	}
> > @@ -948,7 +950,7 @@ static int add_bufs(struct dpaa2_eth_priv *priv,
> >  	return i;
> >
> >  err_map:
> > -	skb_free_frag(buf);
> > +	__free_pages(page, 0);
> >  err_alloc:
> >  	/* If we managed to allocate at least some buffers,
> >  	 * release them to hardware
> > @@ -2134,6 +2136,7 @@ static int set_buffer_layout(struct
> dpaa2_eth_priv *priv)
> >  {
> >  	struct device *dev = priv->net_dev->dev.parent;
> >  	struct dpni_buffer_layout buf_layout = {0};
> > +	u16 rx_buf_align;
> >  	int err;
> >
> >  	/* We need to check for WRIOP version 1.0.0, but depending on the
> MC
> > @@ -2142,9 +2145,9 @@ static int set_buffer_layout(struct
> dpaa2_eth_priv *priv)
> >  	 */
> >  	if (priv->dpni_attrs.wriop_version == DPAA2_WRIOP_VERSION(0, 0,
> 0) ||
> >  	    priv->dpni_attrs.wriop_version == DPAA2_WRIOP_VERSION(1, 0,
> 0))
> > -		priv->rx_buf_align = DPAA2_ETH_RX_BUF_ALIGN_REV1;
> > +		rx_buf_align = DPAA2_ETH_RX_BUF_ALIGN_REV1;
> >  	else
> > -		priv->rx_buf_align = DPAA2_ETH_RX_BUF_ALIGN;
> > +		rx_buf_align = DPAA2_ETH_RX_BUF_ALIGN;
> >
> >  	/* tx buffer */
> >  	buf_layout.private_data_size = DPAA2_ETH_SWA_SIZE;
> > @@ -2184,7 +2187,7 @@ static int set_buffer_layout(struct
> dpaa2_eth_priv *priv)
> >  	/* rx buffer */
> >  	buf_layout.pass_frame_status = true;
> >  	buf_layout.pass_parser_result = true;
> > -	buf_layout.data_align = priv->rx_buf_align;
> > +	buf_layout.data_align = rx_buf_align;
> >  	buf_layout.data_head_room = dpaa2_eth_rx_head_room(priv);
> >  	buf_layout.private_data_size = 0;
> >  	buf_layout.options = DPNI_BUF_LAYOUT_OPT_PARSER_RESULT |
> > diff --git a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
> b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
> > index 31fe486..da3d039 100644
> > --- a/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
> > +++ b/drivers/net/ethernet/freescale/dpaa2/dpaa2-eth.h
> > @@ -63,9 +63,11 @@
> >  /* Hardware requires alignment for ingress/egress buffer addresses */
> >  #define DPAA2_ETH_TX_BUF_ALIGN		64
> >
> > -#define DPAA2_ETH_RX_BUF_SIZE		2048
> > -#define DPAA2_ETH_SKB_SIZE \
> > -	(DPAA2_ETH_RX_BUF_SIZE + SKB_DATA_ALIGN(sizeof(struct
> skb_shared_info)))
> > +#define DPAA2_ETH_RX_BUF_RAW_SIZE	PAGE_SIZE
> > +#define DPAA2_ETH_RX_BUF_TAILROOM \
> > +	SKB_DATA_ALIGN(sizeof(struct skb_shared_info))
> > +#define DPAA2_ETH_RX_BUF_SIZE \
> > +	(DPAA2_ETH_RX_BUF_RAW_SIZE -
> DPAA2_ETH_RX_BUF_TAILROOM)
> >
> >  /* Hardware annotation area in RX/TX buffers */
> >  #define DPAA2_ETH_RX_HWA_SIZE		64
> > @@ -343,7 +345,6 @@ struct dpaa2_eth_priv {
> >  	bool rx_tstamp; /* Rx timestamping enabled */
> >
> >  	u16 tx_qdid;
> > -	u16 rx_buf_align;
> >  	struct fsl_mc_io *mc_io;
> >  	/* Cores which have an affine DPIO/DPCON.
> >  	 * This is the cpu set on which Rx and Tx conf frames are processed
> > @@ -418,15 +419,6 @@ enum dpaa2_eth_rx_dist {
> >  	DPAA2_ETH_RX_DIST_CLS
> >  };
> >
> > -/* Hardware only sees DPAA2_ETH_RX_BUF_SIZE, but the skb built
> around
> > - * the buffer also needs space for its shared info struct, and we need
> > - * to allocate enough to accommodate hardware alignment restrictions
> > - */
> > -static inline unsigned int dpaa2_eth_buf_raw_size(struct dpaa2_eth_priv
> *priv)
> > -{
> > -	return DPAA2_ETH_SKB_SIZE + priv->rx_buf_align;
> > -}
> > -
> >  static inline
> >  unsigned int dpaa2_eth_needed_headroom(struct dpaa2_eth_priv *priv,
> >  				       struct sk_buff *skb)
> > @@ -451,8 +443,7 @@ unsigned int dpaa2_eth_needed_headroom(struct
> dpaa2_eth_priv *priv,
> >   */
> >  static inline unsigned int dpaa2_eth_rx_head_room(struct dpaa2_eth_priv
> *priv)
> >  {
> > -	return priv->tx_data_offset + DPAA2_ETH_TX_BUF_ALIGN -
> > -	       DPAA2_ETH_RX_HWA_SIZE;
> > +	return priv->tx_data_offset - DPAA2_ETH_RX_HWA_SIZE;
> >  }
> >
> >  int dpaa2_eth_set_hash(struct net_device *net_dev, u64 flags);
> > --
> > 2.7.4
> >

^ permalink raw reply

* Re: [RFC PATCH 0/4] Initial support for allocating BPF JITs in vmalloc for x86
From: Daniel Borkmann @ 2019-02-06 15:38 UTC (permalink / raw)
  To: Alexei Starovoitov, Edgecombe, Rick P
  Cc: kristen@linux.intel.com, netdev@vger.kernel.org,
	ard.biesheuvel@linaro.org, Hansen, Dave
In-Reply-To: <730f7250-0b01-e2d5-6253-e15541ad4298@fb.com>

On 02/06/2019 02:40 AM, Alexei Starovoitov wrote:
> On 2/5/19 5:11 PM, Edgecombe, Rick P wrote:
>> On Wed, 2019-02-06 at 00:35 +0000, Alexei Starovoitov wrote:
>>> On 2/5/19 2:50 PM, Rick Edgecombe wrote:
>>>> This introduces a new capability for BPF program JIT's to be located in
>>>> vmalloc
>>>> space on x86_64. This can serve as a backup area for
>>>> CONFIG_BPF_JIT_ALWAYS_ON in
>>>> case an unprivileged app uses all of the module space allowed by
>>>> bpf_jit_limit.
>>>>
>>>> In order to allow for calls from the increased distance of vmalloc from
>>>> kernel/module space, relative calls are emitted as full indirect calls if
>>>> the
>>>> maximum relative call distance is exceeded. So the resulting performance of
>>>> call
>>>> BPF instructions in this case is similar to the BPF interpreter.
>>>
>>> If I read this correctly the patches introduce retpoline overhead
>>> to direct function call because JITed progs are more than 32-bit apart
>>> and they're far away only because of dubious security concern ?
>>> Nack.
>>>
>> There really isn't any overhead, because they are only far away if the module
>> space is full, or the bpf_jit_limit is exceeded for non-admin. So cases today
>> when insertions would succeed it emits the same code, but cases where the
>> insertion would fail due to lack of space, it now at least works with the
>> described performance.
> 
> I disagree with the problem statement.
> x86 classic BPF jit has been around forever and no one
> complained that _unprivileged_ bpf progs exhaust module space.
> With bpf_jit_limit we got an extra knob to close this
> remote possibility of an attack.
> It's more than enough.

Agree here, also for cap_sys_admin programs there is kind of unpredictability
when we cross the condition that module space is full and thus fall back having
to emit retpolines for every call, which *is* overhead and hard to debug behavior
wrt performance regression from user pov. So I'd rather have it fail and enlarge
the module space instead, for example.

Thanks,
Daniel

^ permalink raw reply

* Re: [PATCH] mlx4_ib: Increase the timeout for CM cache
From: Håkon Bugge @ 2019-02-06 15:40 UTC (permalink / raw)
  To: Jack Morgenstein
  Cc: Jason Gunthorpe, netdev, OFED mailing list, rds-devel,
	linux-kernel
In-Reply-To: <13750147-482A-4F90-976A-033C52DCF85E@oracle.com>



> On 6 Feb 2019, at 09:50, Håkon Bugge <haakon.bugge@oracle.com> wrote:
> 
> 
> 
>> On 5 Feb 2019, at 23:36, Jason Gunthorpe <jgg@ziepe.ca> wrote:
>> 
>> On Thu, Jan 31, 2019 at 06:09:51PM +0100, Håkon Bugge wrote:
>>> Using CX-3 virtual functions, either from a bare-metal machine or
>>> pass-through from a VM, MAD packets are proxied through the PF driver.
>>> 
>>> Since the VMs have separate name spaces for MAD Transaction Ids
>>> (TIDs), the PF driver has to re-map the TIDs and keep the book keeping
>>> in a cache.
>>> 
>>> Following the RDMA CM protocol, it is clear when an entry has to
>>> evicted form the cache. But life is not perfect, remote peers may die
>>> or be rebooted. Hence, it's a timeout to wipe out a cache entry, when
>>> the PF driver assumes the remote peer has gone.
>>> 
>>> We have experienced excessive amount of DREQ retries during fail-over
>>> testing, when running with eight VMs per database server.
>>> 
>>> The problem has been reproduced in a bare-metal system using one VM
>>> per physical node. In this environment, running 256 processes in each
>>> VM, each process uses RDMA CM to create an RC QP between himself and
>>> all (256) remote processes. All in all 16K QPs.
>>> 
>>> When tearing down these 16K QPs, excessive DREQ retries (and
>>> duplicates) are observed. With some cat/paste/awk wizardry on the
>>> infiniband_cm sysfs, we observe:
>>> 
>>>     dreq:       5007
>>> cm_rx_msgs:
>>>     drep:       3838
>>>     dreq:      13018
>>>      rep:       8128
>>>      req:       8256
>>>      rtu:       8256
>>> cm_tx_msgs:
>>>     drep:       8011
>>>     dreq:      68856
>>>      rep:       8256
>>>      req:       8128
>>>      rtu:       8128
>>> cm_tx_retries:
>>>     dreq:      60483
>>> 
>>> Note that the active/passive side is distributed.
>>> 
>>> Enabling pr_debug in cm.c gives tons of:
>>> 
>>> [171778.814239] <mlx4_ib> mlx4_ib_multiplex_cm_handler: id{slave:
>>> 1,sl_cm_id: 0xd393089f} is NULL!
>>> 
>>> By increasing the CM_CLEANUP_CACHE_TIMEOUT from 5 to 30 seconds, the
>>> tear-down phase of the application is reduced from 113 to 67
>>> seconds. Retries/duplicates are also significantly reduced:
>>> 
>>> cm_rx_duplicates:
>>>     dreq:       7726
>>> []
>>> cm_tx_retries:
>>>     drep:          1
>>>     dreq:       7779
>>> 
>>> Increasing the timeout further didn't help, as these duplicates and
>>> retries stem from a too short CMA timeout, which was 20 (~4 seconds)
>>> on the systems. By increasing the CMA timeout to 22 (~17 seconds), the
>>> numbers fell down to about one hundred for both of them.
>>> 
>>> Adjustment of the CMA timeout is _not_ part of this commit.
>>> 
>>> Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
>>> ---
>>> drivers/infiniband/hw/mlx4/cm.c | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>> 
>> Jack? What do you think?
> 
> I am tempted to send a v2 making this a sysctl tuneable. This because, full-rack testing using 8 servers, each with 8 VMs, only showed 33% reduction in the occurrences of "mlx4_ib_multiplex_cm_handler: id{slave:1,sl_cm_id: 0xd393089f} is NULL" with this commit.
> 
> But sure, Jack's opinion matters.

Jack,

A major contributor to the long processing time in the PF driver proxying QP1 packets is:

create_pv_resources
   -> ib_create_cq(ctx->ib_dev, mlx4_ib_tunnel_comp_handler,
                               NULL, ctx, cq_size, 0);

That is, comp_vector is zero.

Due to commit 6ba1eb776461 ("IB/mlx4: Scatter CQs to different EQs"), the zero comp_vector has the intent of let the mlx4_core driver select the least used vector.

But, in mlx4_ib_create_cq(), we have:

        pr_info("eq_table: %p\n", dev->eq_table);
        if (dev->eq_table) {
		vector = dev->eq_table[mlx4_choose_vector(dev->dev, vector,
                                                          ibdev->num_comp_vectors)];
        }

        cq->vector = vector;

and dev->eq_table is NULL, so all the CQs for the proxy QPs get comp_vector zero.

I have to make some reservations, as this analysis is based on uek4, but I think the code here is equal upstream, but need to double check.


Thxs, Håkon








> 
> 
> Thxs, Håkon
> 
>> 
>>> diff --git a/drivers/infiniband/hw/mlx4/cm.c b/drivers/infiniband/hw/mlx4/cm.c
>>> index fedaf8260105..8c79a480f2b7 100644
>>> --- a/drivers/infiniband/hw/mlx4/cm.c
>>> +++ b/drivers/infiniband/hw/mlx4/cm.c
>>> @@ -39,7 +39,7 @@
>>> 
>>> #include "mlx4_ib.h"
>>> 
>>> -#define CM_CLEANUP_CACHE_TIMEOUT  (5 * HZ)
>>> +#define CM_CLEANUP_CACHE_TIMEOUT  (30 * HZ)
>>> 
>>> struct id_map_entry {
>>> 	struct rb_node node;
>>> -- 
>>> 2.20.1


^ permalink raw reply

* Re: [PATCH net] virtio_net: Account for tx bytes and packets on sending xdp_frames
From: Jakub Kicinski @ 2019-02-06 15:52 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Saeed Mahameed, dsahern@gmail.com, thoiland@redhat.com,
	virtualization@lists.linux-foundation.org, borkmann@iogearbox.net,
	Tariq Toukan, john.fastabend@gmail.com, mst@redhat.com,
	netdev@vger.kernel.org, jasowang@redhat.com, davem@davemloft.net,
	makita.toshiaki@lab.ntt.co.jp
In-Reply-To: <20190206144814.46996933@carbon>

On Wed, 6 Feb 2019 14:48:14 +0100, Jesper Dangaard Brouer wrote:
> On Wed, 6 Feb 2019 00:06:33 +0000
> Saeed Mahameed <saeedm@mellanox.com> wrote:
> 
> > 3) Unrelated, In non XDP case, if skb allocation fails or driver fails
> > to pass the skb up to the stack for somereason, should the driver
> > increase rx packets ? IMHO the answer should be yes if we want to have
> > similar behavior between XDP and non XDP cases.  
> 
> I don't think "skb allocation fails" should increase rx packets
> counter.  The difference is that these events are outside sysadm/users
> control, and is an error detected inside the driver.  The XDP program
> takes a policy choice to XDP_DROP a packet, which can be accounted
> inside the XDP prog (as the samples show) or as we also discuss via a
> more generic XDP-action counters.

FWIW that's my understanding as well.  My understanding of Linux stats
is that they are incrementing one counter per packet.  I.e. in RX
direction success packets are those given to the stack, and for TX
those given to the hardware.  Standards (IETF/IEEE) usually count stats
on the same layer boundary, but I think software generally counts when
it's done with the packet.

I haven't seen it documented anywhere, yet.  I have tried to document
it in the docs of the recent RFC:
https://patchwork.ozlabs.org/patch/1032332/

Incidentally XDP_DROP may have been better named XDP_DISCARD from stats
perspective ;)

^ permalink raw reply

* [ISSUE][4.20.6] mlx5 and checksum failures
From: Ian Kumlien @ 2019-02-06 16:16 UTC (permalink / raw)
  To: Linux Kernel Network Developers

Hi,

I'm hitting an issue that i think is fixed by the following patch,
i haven't verified it yet - but it looks like it should go on the
stable queue(?)

(And yes, I did look, and couldn't find it ;))

commit e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a
Author: Cong Wang <xiyou.wangcong@gmail.com>
Date:   Mon Dec 3 22:14:04 2018 -0800

    net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames

    When an ethernet frame is padded to meet the minimum ethernet frame
    size, the padding octets are not covered by the hardware checksum.
    Fortunately the padding octets are usually zero's, which don't affect
    checksum. However, we have a switch which pads non-zero octets, this
    causes kernel hardware checksum fault repeatedly.

    Prior to:
    commit '88078d98d1bb ("net: pskb_trim_rcsum() and
CHECKSUM_COMPLETE ...")'
    skb checksum was forced to be CHECKSUM_NONE when padding is detected.
    After it, we need to keep skb->csum updated, like what we do for RXFCS.
    However, fixing up CHECKSUM_COMPLETE requires to verify and parse IP
    headers, it is not worthy the effort as the packets are so small that
    CHECKSUM_COMPLETE can't save anything.

    Fixes: 88078d98d1bb ("net: pskb_trim_rcsum() and CHECKSUM_COMPLETE
are friends"),
    Cc: Eric Dumazet <edumazet@google.com>
    Cc: Tariq Toukan <tariqt@mellanox.com>
    Cc: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
    Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
    Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 1d0bb5ff8c26..f86e4804e83e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -732,6 +732,8 @@ static u8 get_ip_proto(struct sk_buff *skb, int
network_depth, __be16 proto)
                                            ((struct ipv6hdr
*)ip_p)->nexthdr;
 }

+#define short_frame(size) ((size) <= ETH_ZLEN + ETH_FCS_LEN)
+
 static inline void mlx5e_handle_csum(struct net_device *netdev,
                                     struct mlx5_cqe64 *cqe,
                                     struct mlx5e_rq *rq,
@@ -754,6 +756,17 @@ static inline void mlx5e_handle_csum(struct
net_device *netdev,
        if (unlikely(test_bit(MLX5E_RQ_STATE_NO_CSUM_COMPLETE, &rq->state)))
                goto csum_unnecessary;

+       /* CQE csum doesn't cover padding octets in short ethernet
+        * frames. And the pad field is appended prior to calculating
+        * and appending the FCS field.
+        *
+        * Detecting these padded frames requires to verify and parse
+        * IP headers, so we simply force all those small frames to be
+        * CHECKSUM_UNNECESSARY even if they are not padded.
+        */
+       if (short_frame(skb->len))
+               goto csum_unnecessary;
+
        if (likely(is_last_ethertype_ip(skb, &network_depth, &proto))) {
                if (unlikely(get_ip_proto(skb, network_depth, proto)
== IPPROTO_SCTP))
                        goto csum_unnecessary;
---

Kernel log:
[ 3226.017424] bond0: hw csum failure
[ 3226.018387] CPU: 13 PID: 0 Comm: swapper/13 Tainted: G          I
    4.20.6-1.el7.elrepo.x86_64 #1
[ 3226.020928] Hardware name: HP ProLiant DL380 G6, BIOS P62 01/22/2015
[ 3226.022649] Call Trace:
[ 3226.023409]  <IRQ>
[ 3226.024039]  dump_stack+0x63/0x88
[ 3226.025066]  netdev_rx_csum_fault+0x3a/0x40
[ 3226.026208]  __skb_checksum_complete+0xd5/0xe0
[ 3226.027418]  nf_ip_checksum+0xc9/0xf0
[ 3226.028474]  nf_checksum+0x2d/0x40
[ 3226.029504]  tcp_packet+0x2ce/0xa20 [nf_conntrack]
[ 3226.030913]  ? tcp_v4_do_rcv+0x77/0x1f0
[ 3226.032094]  ? sock_put+0x19/0x20
[ 3226.033070]  ? nf_ct_deliver_cached_events+0xd0/0x110 [nf_conntrack]
[ 3226.034754]  nf_conntrack_in+0x140/0x510 [nf_conntrack]
[ 3226.036228]  ipv4_conntrack_in+0x14/0x20 [nf_conntrack]
[ 3226.037646]  nf_hook_slow+0x42/0xc0
[ 3226.038626]  ip_rcv+0xb5/0xd0
[ 3226.039480]  ? ip_local_deliver_finish+0x1e0/0x1e0
[ 3226.040767]  __netif_receive_skb_one_core+0x57/0x80
[ 3226.042155]  __netif_receive_skb+0x18/0x60
[ 3226.043275]  netif_receive_skb_internal+0x45/0xf0
[ 3226.044530]  napi_gro_receive+0xd0/0xf0
[ 3226.045665]  mlx5e_handle_rx_cqe+0x1e6/0x540 [mlx5_core]
[ 3226.047167]  mlx5e_poll_rx_cq+0xd6/0x9c0 [mlx5_core]
[ 3226.048516]  mlx5e_napi_poll+0xc2/0xcd0 [mlx5_core]
[ 3226.049836]  ? mlx5_eq_int+0x4b4/0x6c0 [mlx5_core]
[ 3226.051118]  net_rx_action+0x289/0x3d0
[ 3226.052257]  __do_softirq+0xd5/0x2a2
[ 3226.053277]  irq_exit+0xe8/0x100
[ 3226.054183]  do_IRQ+0x59/0xe0
[ 3226.055014]  common_interrupt+0xf/0xf
[ 3226.056038]  </IRQ>
[ 3226.056722] RIP: 0010:cpuidle_enter_state+0xba/0x2f0
[ 3226.058087] Code: d0 95 7e e8 38 07 a1 ff 41 8b 5c 24 04 49 89 c6
66 66 66 66 90 31 ff e8 34 19 a1 ff 80 7d cf 00 0f 85 8c 01 00 00 fb
66 66 90 <66> 66 90 45 85 ed 0f 88 94 01 00 00 4c 2b 75 c0 48 ba cf f7
53 e3
[ 3226.062925] RSP: 0018:ffffc9000c547e50 EFLAGS: 00000246 ORIG_RAX:
ffffffffffffffd6
[ 3226.064974] RAX: ffff88a3df7a2dc0 RBX: 000000000000000d RCX: 000000000000001f
[ 3226.066866] RDX: 0000000000000000 RSI: 0000000000000002 RDI: 0000000000000000
[ 3226.068747] RBP: ffffc9000c547e90 R08: 0000000000000002 R09: ffffffcdc506f2e7
[ 3226.070622] R10: 0000000000000018 R11: 071c71c71c71c71c R12: ffffe8ffffb96f00
[ 3226.072525] R13: 0000000000000004 R14: 000002ef1d9f1e10 R15: ffff88a3d8900000
[ 3226.074479]  cpuidle_enter+0x17/0x20
[ 3226.075463]  call_cpuidle+0x23/0x40
[ 3226.076412]  do_idle+0x1db/0x280
[ 3226.077323]  cpu_startup_entry+0x1d/0x30
[ 3226.078417]  start_secondary+0x1ae/0x200
[ 3226.079490]  secondary_startup_64+0xa4/0xb0

^ permalink raw reply related

* Re: [xfrm, backport request] Request backport of e2612cd496e7 - set-mark backwards compatibility
From: Greg Kroah-Hartman @ 2019-02-06 16:29 UTC (permalink / raw)
  To: Maciej Żenczykowski
  Cc: Benedict Wong, Greg Kroah-Hartman, Linux NetDev, Nathan Harold,
	Lorenzo Colitti, Steffen Klassert, Eyal Birger, tobias
In-Reply-To: <CANP3RGcdptMfnJXQNjNdiD4kuoZxj4vETro_aWRQqzsxY=yAfQ@mail.gmail.com>

On Tue, Feb 05, 2019 at 02:54:29PM -0800, Maciej Żenczykowski wrote:
> > I propose backporting commit e2612cd496e7 ("xfrm: Make set-mark default
> > behavior backward compatible") to 4.19 and 4.20 kernels to fix a backwards
> > compatibility bug introduced in 9b42c1f179a6 (“xfrm: Extend the
> > output_mark to support input direction and masking”).
> >
> > The fix is small, relatively simple, and has unit tests. :)
> >
> > Without this change, systems using mark-based routing on 4.19 or 4.20
> > kernels will by fail to route IPsec tunnel mode packets correctly in the
> > default case. This specifically affects Android devices.
> 
> Looks like it already includes a 'fixes: sha1' tag.
> I'm not sure what causes these patches to get picked up for stable...
> I'm guessing it's some sort of Greg-fu-style-magic...?

<formletter>

This is not the correct way to submit patches for inclusion in the
stable kernel tree.  Please read:
    https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html
for how to do this properly.

</formletter>

Hint, putting a fixes tag there does NOT trigger my scripts...

^ permalink raw reply

* Re: [PATCH net 1/2] geneve: should not call rt6_lookup() when ipv6 was disabled
From: Eric Dumazet @ 2019-02-06 16:43 UTC (permalink / raw)
  To: Hangbin Liu, netdev; +Cc: Stefano Brivio, David Miller
In-Reply-To: <20190206125111.5286-2-liuhangbin@gmail.com>



On 02/06/2019 04:51 AM, Hangbin Liu wrote:
> When we add a new GENEVE device with IPv6 remote, checking only for
> IS_ENABLED(CONFIG_IPV6) is not enough as we may disable IPv6 in kernel
> cmd(ipv6.disable=1), which will cause a NULL pointer dereference.
> 
> Reported-by: Jianlin Shi <jishi@redhat.com>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  drivers/net/geneve.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/net/geneve.c b/drivers/net/geneve.c
> index 58bbba8582b0..0658715581e3 100644
> --- a/drivers/net/geneve.c
> +++ b/drivers/net/geneve.c
> @@ -1512,6 +1512,10 @@ static void geneve_link_config(struct net_device *dev,
>  	}
>  #if IS_ENABLED(CONFIG_IPV6)
>  	case AF_INET6: {
> +		struct inet6_dev *idev = in6_dev_get(dev);
> +		if (!idev)
> +			break;

Do not mix declarations and code.

Instead, use :
		struct inet6_dev *idev = in6_dev_get(dev);
		struct rt6_info *rt;

		if (!idev)
			break;
		rt = rt6_lookup(geneve->net, ...);

> +
>  		struct rt6_info *rt = rt6_lookup(geneve->net,
>  						 &info->key.u.ipv6.dst, NULL, 0,
>  						 NULL, 0);
> @@ -1519,6 +1523,8 @@ static void geneve_link_config(struct net_device *dev,
>  		if (rt && rt->dst.dev)
>  			ldev_mtu = rt->dst.dev->mtu - GENEVE_IPV6_HLEN;
>  		ip6_rt_put(rt);
> +
> +		in6_dev_put(idev);
>  		break;
>  	}
>  #endif
> 

^ permalink raw reply

* Re: [PATCH net 2/2] sit: check if IPv6 enabled before call ip6_err_gen_icmpv6_unreach()
From: Eric Dumazet @ 2019-02-06 16:45 UTC (permalink / raw)
  To: Hangbin Liu, netdev; +Cc: Stefano Brivio, David Miller
In-Reply-To: <20190206125111.5286-3-liuhangbin@gmail.com>



On 02/06/2019 04:51 AM, Hangbin Liu wrote:
> If we disabled IPv6 from kernel boot up cmd(ipv6.disable=1), we should not
> call ip6_err_gen_icmpv6_unreach().
> 
> Reproducer:
> ip link add sit1 type sit local 10.10.0.1 remote 10.10.1.1 ttl 1
> ip link set sit1 up
> ip addr add 192.168.0.1/24 dev sit1
> ping 192.168.0.2
> 
> Reported-by: Jianlin Shi <jishi@redhat.com>
> Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
> ---
>  net/ipv6/sit.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/net/ipv6/sit.c b/net/ipv6/sit.c
> index 1e03305c0549..e43fbac0fd1a 100644
> --- a/net/ipv6/sit.c
> +++ b/net/ipv6/sit.c
> @@ -493,6 +493,7 @@ static int ipip6_err(struct sk_buff *skb, u32 info)
>  	const int type = icmp_hdr(skb)->type;
>  	const int code = icmp_hdr(skb)->code;
>  	unsigned int data_len = 0;
> +	struct inet6_dev *idev;
>  	struct ip_tunnel *t;
>  	int sifindex;
>  	int err;
> @@ -546,8 +547,13 @@ static int ipip6_err(struct sk_buff *skb, u32 info)
>  	}
>  
>  	err = 0;
> -	if (!ip6_err_gen_icmpv6_unreach(skb, iph->ihl * 4, type, data_len))
> +
> +	idev = in6_dev_get(skb->dev);
> +	if (idev &&
> +	    !ip6_err_gen_icmpv6_unreach(skb, iph->ihl * 4, type, data_len)) {
> +		in6_dev_put(idev);
>  		goto out;
> +	}
>  


It seems there is a missing in6_dev_put(idev) depending on ip6_err_gen_icmpv6_unreach()() return value ?

^ permalink raw reply

* INFO: task hung in tipc_bcast_stop
From: syzbot @ 2019-02-06 16:47 UTC (permalink / raw)
  To: davem, jon.maloy, linux-kernel, netdev, syzkaller-bugs,
	tipc-discussion, ying.xue

Hello,

syzbot found the following crash on:

HEAD commit:    c8101f7729da net: dsa: Fix lockdep false positive splat
git tree:       net
console output: https://syzkaller.appspot.com/x/log.txt?x=1553f728c00000
kernel config:  https://syzkaller.appspot.com/x/.config?x=2e0064f906afee10
dashboard link: https://syzkaller.appspot.com/bug?extid=8118fd903ae608d128e1
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+8118fd903ae608d128e1@syzkaller.appspotmail.com

bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:aa:aa:aa:aa:aa:0c, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:3e:c4:ef:c6:e1:9c, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:aa:aa:aa:aa:aa:0c, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:a6:63:10:7b:ed:f3, vlan:0)
INFO: task kworker/u4:4:3054 blocked for more than 140 seconds.
       Not tainted 5.0.0-rc4+ #63
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
net_ratelimit: 66009 callbacks suppressed
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:aa:aa:aa:aa:aa:0c, vlan:0)
bridge0: received packet on bridge_slave_0 with own address as source  
address (addr:aa:aa:aa:aa:aa:1b, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:aa:aa:aa:aa:aa:0c, vlan:0)
bridge0: received packet on bridge_slave_0 with own address as source  
address (addr:aa:aa:aa:aa:aa:1b, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:aa:aa:aa:aa:aa:0c, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:3e:c4:ef:c6:e1:9c, vlan:0)
bridge0: received packet on bridge_slave_0 with own address as source  
address (addr:aa:aa:aa:aa:aa:0c, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:a6:63:10:7b:ed:f3, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:aa:aa:aa:aa:aa:0c, vlan:0)
bridge0: received packet on veth0_to_bridge with own address as source  
address (addr:aa:aa:aa:aa:aa:0c, vlan:0)
kworker/u4:4    D24832  3054      2 0x80000000
Workqueue: netns cleanup_net
Call Trace:
  context_switch kernel/sched/core.c:2844 [inline]
  __schedule+0x817/0x1cc0 kernel/sched/core.c:3485
  schedule+0x92/0x180 kernel/sched/core.c:3529
  exp_funnel_lock kernel/rcu/tree_exp.h:319 [inline]
  _synchronize_rcu_expedited.constprop.0+0x4d4/0x530  
kernel/rcu/tree_exp.h:622
  synchronize_rcu_expedited+0x27/0xa0 kernel/rcu/tree_exp.h:762
  synchronize_net+0x3b/0x60 net/core/dev.c:9165
  tipc_bcast_stop+0x198/0x310 net/tipc/bcast.c:545
  tipc_exit_net+0x26/0x40 net/tipc/core.c:102
  ops_exit_list.isra.0+0xb0/0x160 net/core/net_namespace.c:153
  cleanup_net+0x3fb/0x960 net/core/net_namespace.c:551
  process_one_work+0x98e/0x1760 kernel/workqueue.c:2153
  worker_thread+0x98/0xe40 kernel/workqueue.c:2296
  kthread+0x357/0x430 kernel/kthread.c:246
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
INFO: task syz-executor1:17203 blocked for more than 140 seconds.
       Not tainted 5.0.0-rc4+ #63
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syz-executor1   D30144 17203   7626 0x00000004
Call Trace:
  context_switch kernel/sched/core.c:2844 [inline]
  __schedule+0x817/0x1cc0 kernel/sched/core.c:3485
  schedule+0x92/0x180 kernel/sched/core.c:3529
  __rwsem_down_write_failed_common kernel/locking/rwsem-xadd.c:584 [inline]
  rwsem_down_write_failed+0x774/0xc30 kernel/locking/rwsem-xadd.c:613
  call_rwsem_down_write_failed+0x17/0x30 arch/x86/lib/rwsem.S:117
  __down_write arch/x86/include/asm/rwsem.h:142 [inline]
  down_write+0x53/0x90 kernel/locking/rwsem.c:72
  register_netdevice_notifier+0x7e/0x630 net/core/dev.c:1634
  bcm_init+0x1a8/0x220 net/can/bcm.c:1502
  can_create+0x28a/0x4b0 net/can/af_can.c:183
  __sock_create+0x3e6/0x750 net/socket.c:1275
  sock_create net/socket.c:1315 [inline]
  __sys_socket+0x103/0x220 net/socket.c:1345
  __do_sys_socket net/socket.c:1354 [inline]
  __se_sys_socket net/socket.c:1352 [inline]
  __x64_sys_socket+0x73/0xb0 net/socket.c:1352
  do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457e39
Code: Bad RIP value.
RSP: 002b:00007f2b02d39c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000029
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457e39
RDX: 0000000000000002 RSI: 0000000000000002 RDI: 000000000000001d
RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2b02d3a6d4
R13: 00000000004c60f6 R14: 00000000004db1e0 R15: 00000000ffffffff
INFO: task syz-executor1:17207 blocked for more than 140 seconds.
       Not tainted 5.0.0-rc4+ #63
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
syz-executor1   D29872 17207   7626 0x00000004
Call Trace:
  context_switch kernel/sched/core.c:2844 [inline]
  __schedule+0x817/0x1cc0 kernel/sched/core.c:3485
  schedule+0x92/0x180 kernel/sched/core.c:3529
  __rwsem_down_write_failed_common kernel/locking/rwsem-xadd.c:584 [inline]
  rwsem_down_write_failed+0x774/0xc30 kernel/locking/rwsem-xadd.c:613
  call_rwsem_down_write_failed+0x17/0x30 arch/x86/lib/rwsem.S:117
  __down_write arch/x86/include/asm/rwsem.h:142 [inline]
  down_write+0x53/0x90 kernel/locking/rwsem.c:72
  register_netdevice_notifier+0x7e/0x630 net/core/dev.c:1634
  bcm_init+0x1a8/0x220 net/can/bcm.c:1502
  can_create+0x28a/0x4b0 net/can/af_can.c:183
  __sock_create+0x3e6/0x750 net/socket.c:1275
  sock_create net/socket.c:1315 [inline]
  __sys_socket+0x103/0x220 net/socket.c:1345
  __do_sys_socket net/socket.c:1354 [inline]
  __se_sys_socket net/socket.c:1352 [inline]
  __x64_sys_socket+0x73/0xb0 net/socket.c:1352
  do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457e39
Code: Bad RIP value.
RSP: 002b:00007f2b02d18c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000029
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457e39
RDX: 0000000000000002 RSI: 0000000000000002 RDI: 000000000000001d
RBP: 000000000073bfa0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f2b02d196d4
R13: 00000000004c60f6 R14: 00000000004db1e0 R15: 00000000ffffffff

Showing all locks held in the system:
3 locks held by ksoftirqd/1/16:
1 lock held by khungtaskd/1040:
  #0: 00000000e3131bf8 (rcu_read_lock){....}, at:  
debug_show_all_locks+0x5f/0x27e kernel/locking/lockdep.c:4389
3 locks held by kworker/u4:4/3054:
  #0: 000000002f6c8bac ((wq_completion)"%s""netns"){+.+.}, at:  
__write_once_size include/linux/compiler.h:220 [inline]
  #0: 000000002f6c8bac ((wq_completion)"%s""netns"){+.+.}, at:  
arch_atomic64_set arch/x86/include/asm/atomic64_64.h:34 [inline]
  #0: 000000002f6c8bac ((wq_completion)"%s""netns"){+.+.}, at: atomic64_set  
include/asm-generic/atomic-instrumented.h:40 [inline]
  #0: 000000002f6c8bac ((wq_completion)"%s""netns"){+.+.}, at:  
atomic_long_set include/asm-generic/atomic-long.h:59 [inline]
  #0: 000000002f6c8bac ((wq_completion)"%s""netns"){+.+.}, at: set_work_data  
kernel/workqueue.c:617 [inline]
  #0: 000000002f6c8bac ((wq_completion)"%s""netns"){+.+.}, at:  
set_work_pool_and_clear_pending kernel/workqueue.c:644 [inline]
  #0: 000000002f6c8bac ((wq_completion)"%s""netns"){+.+.}, at:  
process_one_work+0x87e/0x1760 kernel/workqueue.c:2124
  #1: 00000000356e77de (net_cleanup_work){+.+.}, at:  
process_one_work+0x8b4/0x1760 kernel/workqueue.c:2128
  #2: 0000000070308617 (pernet_ops_rwsem){++++}, at: cleanup_net+0xae/0x960  
net/core/net_namespace.c:518
1 lock held by rsyslogd/7473:
2 locks held by getty/7586:
  #0: 000000007d8d2e99 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
  #1: 000000009ac587d5 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2154
2 locks held by getty/7587:
  #0: 000000009302afb1 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
  #1: 00000000be8c1a8d (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2154
2 locks held by getty/7588:
  #0: 0000000082755b9e (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
  #1: 00000000a2bc657f (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2154
2 locks held by getty/7589:
  #0: 00000000d211856e (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
  #1: 00000000279c75af (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2154
2 locks held by getty/7590:
  #0: 000000000111820e (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
  #1: 00000000db1c3ad5 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2154
2 locks held by getty/7591:
  #0: 00000000756df6ea (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
  #1: 0000000026a342a6 (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2154
2 locks held by getty/7592:
  #0: 00000000f85959b5 (&tty->ldisc_sem){++++}, at:  
ldsem_down_read+0x33/0x40 drivers/tty/tty_ldsem.c:341
  #1: 00000000a7cf75cc (&ldata->atomic_read_lock){+.+.}, at:  
n_tty_read+0x232/0x1b70 drivers/tty/n_tty.c:2154
1 lock held by syz-executor1/17203:
  #0: 0000000070308617 (pernet_ops_rwsem){++++}, at:  
register_netdevice_notifier+0x7e/0x630 net/core/dev.c:1634
1 lock held by syz-executor1/17207:
  #0: 0000000070308617 (pernet_ops_rwsem){++++}, at:  
register_netdevice_notifier+0x7e/0x630 net/core/dev.c:1634

=============================================

NMI backtrace for cpu 1
CPU: 1 PID: 1040 Comm: khungtaskd Not tainted 5.0.0-rc4+ #63
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x172/0x1f0 lib/dump_stack.c:113
  nmi_cpu_backtrace.cold+0x63/0xa4 lib/nmi_backtrace.c:101
  nmi_trigger_cpumask_backtrace+0x1be/0x236 lib/nmi_backtrace.c:62
  arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
  trigger_all_cpu_backtrace include/linux/nmi.h:146 [inline]
  check_hung_uninterruptible_tasks kernel/hung_task.c:203 [inline]
  watchdog+0x9df/0xee0 kernel/hung_task.c:287
  kthread+0x357/0x430 kernel/kthread.c:246
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
Sending NMI from CPU 1 to CPUs 0:
INFO: NMI handler (nmi_cpu_backtrace_handler) took too long to run: 1.389  
msecs
NMI backtrace for cpu 0
CPU: 0 PID: 9 Comm: ksoftirqd/0 Not tainted 5.0.0-rc4+ #63
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:rol32 include/linux/bitops.h:83 [inline]
RIP: 0010:jhash2 include/linux/jhash.h:128 [inline]
RIP: 0010:hash_stack lib/stackdepot.c:161 [inline]
RIP: 0010:depot_save_stack+0x76/0x460 lib/stackdepot.c:230
Code: 0f 86 ec 01 00 00 44 89 cb 44 89 c8 48 89 ca 03 5a 08 41 83 e8 03 48  
83 c2 0c 44 03 4a f4 03 42 f8 45 89 cb 41 89 d9 41 29 db <41> c1 c1 04 01  
c3 45 31 d9 44 29 c8 41 89 c2 44 89 c8 41 01 d9 c1
RSP: 0018:ffff8880a9886ed8 EFLAGS: 00000206
RAX: 00000000ce5caf29 RBX: 00000000b5fb600c RCX: ffff8880a9886f38
RDX: ffff8880a9886f68 RSI: 0000000000490220 RDI: 0000000000000016
RBP: ffff8880a9886f10 R08: 0000000000000020 R09: 00000000b5fb600c
R10: 000000008a18bda7 R11: 000000001a539322 R12: 0000000000490220
R13: ffff8880a9886f20 R14: 0000000000000000 R15: ffff88812c3f0940
FS:  0000000000000000(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000457e0f CR3: 00000000a4f2e000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  save_stack+0xa9/0xd0 mm/kasan/common.c:79
  set_track mm/kasan/common.c:85 [inline]
  __kasan_kmalloc mm/kasan/common.c:496 [inline]
  __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:469
  kasan_kmalloc mm/kasan/common.c:504 [inline]
  kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:411
  slab_post_alloc_hook mm/slab.h:444 [inline]
  slab_alloc_node mm/slab.c:3324 [inline]
  kmem_cache_alloc_node_trace+0x13c/0x720 mm/slab.c:3650
  __do_kmalloc_node mm/slab.c:3672 [inline]
  __kmalloc_node_track_caller+0x3d/0x70 mm/slab.c:3687
  __kmalloc_reserve.isra.0+0x40/0xf0 net/core/skbuff.c:140
  __alloc_skb+0x10b/0x5e0 net/core/skbuff.c:208
  alloc_skb include/linux/skbuff.h:1011 [inline]
  nlmsg_new include/net/netlink.h:654 [inline]
  fdb_notify+0x9f/0x190 net/bridge/br_fdb.c:706
  br_fdb_update net/bridge/br_fdb.c:601 [inline]
  br_fdb_update+0x2fd/0xa70 net/bridge/br_fdb.c:562
  br_handle_frame_finish+0x84f/0x14c0 net/bridge/br_input.c:97
  br_nf_hook_thresh+0x2ec/0x380 net/bridge/br_netfilter_hooks.c:1009
  br_nf_pre_routing_finish_ipv6+0x708/0xdc0  
net/bridge/br_netfilter_ipv6.c:210
  NF_HOOK include/linux/netfilter.h:289 [inline]
  br_nf_pre_routing_ipv6+0x3c4/0x740 net/bridge/br_netfilter_ipv6.c:238
  br_nf_pre_routing+0xe80/0x13a0 net/bridge/br_netfilter_hooks.c:482
  nf_hook_entry_hookfn include/linux/netfilter.h:119 [inline]
  nf_hook_slow+0xbf/0x1f0 net/netfilter/core.c:511
  nf_hook include/linux/netfilter.h:244 [inline]
  NF_HOOK include/linux/netfilter.h:287 [inline]
  br_handle_frame+0x95b/0x1450 net/bridge/br_input.c:305
  __netif_receive_skb_core+0xa96/0x3010 net/core/dev.c:4902
  __netif_receive_skb_one_core+0xa8/0x1a0 net/core/dev.c:4971
  __netif_receive_skb+0x2c/0x1c0 net/core/dev.c:5083
  process_backlog+0x206/0x750 net/core/dev.c:5923
  napi_poll net/core/dev.c:6346 [inline]
  net_rx_action+0x4fa/0x1070 net/core/dev.c:6412
  __do_softirq+0x266/0x95a kernel/softirq.c:292
  run_ksoftirqd kernel/softirq.c:654 [inline]
  run_ksoftirqd+0x8e/0x110 kernel/softirq.c:646
  smpboot_thread_fn+0x6ab/0xa10 kernel/smpboot.c:164
  kthread+0x357/0x430 kernel/kthread.c:246
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with  
syzbot.

^ permalink raw reply

* Re: [Patch net-next] mlx5: use RCU lock in mlx5_eq_cq_get()
From: Saeed Mahameed @ 2019-02-06 16:55 UTC (permalink / raw)
  To: netdev@vger.kernel.org, Tariq Toukan, xiyou.wangcong@gmail.com
In-Reply-To: <1620af96-f60c-9ec2-dd71-7c9081828bce@mellanox.com>

On Wed, 2019-02-06 at 12:02 +0000, Tariq Toukan wrote:
> 
> On 2/6/2019 2:35 AM, Cong Wang wrote:
> > mlx5_eq_cq_get() is called in IRQ handler, the spinlock inside
> > gets a lot of contentions when we test some heavy workload
> > with 60 RX queues and 80 CPU's, and it is clearly shown in the
> > flame graph.
> > 


Hi Cong,

The patch is ok to me, but i really doubt that you can hit a contention
on latest upstream driver, since we already have spinlock per EQ, which
means spinlock per core,  each EQ (core) msix handler can only access
one spinlock (its own), so I am surprised how you got the contention,
Maybe you are not running on latest upstream driver ?

what is the workload ? 

> > In fact, radix_tree_lookup() is perfectly fine with RCU read lock,
> > we don't have to take a spinlock on this hot path. It is pretty
> > much
> > similar to commit 291c566a2891
> > ("net/mlx4_core: Fix racy CQ (Completion Queue) free"). Slow paths
> > are still serialized with the spinlock, and with synchronize_irq()
> > it should be safe to just move the fast path to RCU read lock.
> > 
> > This patch itself reduces the latency by about 50% with our
> > workload.
> > 
> > Cc: Saeed Mahameed <saeedm@mellanox.com>
> > Cc: Tariq Toukan <tariqt@mellanox.com>
> > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> > ---
> >   drivers/net/ethernet/mellanox/mlx5/core/eq.c | 12 ++++++------
> >   1 file changed, 6 insertions(+), 6 deletions(-)
> > 
> > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > index ee04aab65a9f..7092457705a2 100644
> > --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > @@ -114,11 +114,11 @@ static struct mlx5_core_cq
> > *mlx5_eq_cq_get(struct mlx5_eq *eq, u32 cqn)
> >   	struct mlx5_cq_table *table = &eq->cq_table;
> >   	struct mlx5_core_cq *cq = NULL;
> >   
> > -	spin_lock(&table->lock);
> > +	rcu_read_lock();
> >   	cq = radix_tree_lookup(&table->tree, cqn);
> >   	if (likely(cq))
> >   		mlx5_cq_hold(cq);
> > -	spin_unlock(&table->lock);
> > +	rcu_read_unlock();
> 
> Thanks for you patch.
> 
> I think we can improve it further, by taking the if statement out of
> the 
> critical section.
> 

No, mlx5_cq_hold must stay under RCU read, otherwise cq might get freed
before the irq gets a change to increment ref count on it.

another way to do it is not to do any refcounting in the irq handler
and fence cq removal via synchronize_irq(eq->irqn) on mlx5_eq_del_cq.
But let's keep one approach (refcounting), synchronize_irq/rcu can be
heavy sometimes especially on RDMA workloads with many create/destroy
cq in loops.

> Other than that, patch LGTM.
> 
> Regards,
> Tariq
> 
> >   
> >   	return cq;
> >   }
> > @@ -371,9 +371,9 @@ int mlx5_eq_add_cq(struct mlx5_eq *eq, struct
> > mlx5_core_cq *cq)
> >   	struct mlx5_cq_table *table = &eq->cq_table;
> >   	int err;
> >   
> > -	spin_lock_irq(&table->lock);
> > +	spin_lock(&table->lock);
> >   	err = radix_tree_insert(&table->tree, cq->cqn, cq);
> > -	spin_unlock_irq(&table->lock);
> > +	spin_unlock(&table->lock);
> >   
> >   	return err;
> >   }
> > @@ -383,9 +383,9 @@ int mlx5_eq_del_cq(struct mlx5_eq *eq, struct
> > mlx5_core_cq *cq)
> >   	struct mlx5_cq_table *table = &eq->cq_table;
> >   	struct mlx5_core_cq *tmp;
> >   
> > -	spin_lock_irq(&table->lock);
> > +	spin_lock(&table->lock);
> >   	tmp = radix_tree_delete(&table->tree, cq->cqn);
> > -	spin_unlock_irq(&table->lock);
> > +	spin_unlock(&table->lock);
> >   
> >   	if (!tmp) {
> >   		mlx5_core_warn(eq->dev, "cq 0x%x not found in eq 0x%x
> > tree\n", eq->eqn, cq->cqn);
> > 

^ permalink raw reply

* Re: [Patch net-next] mlx5: use RCU lock in mlx5_eq_cq_get()
From: Cong Wang @ 2019-02-06 17:15 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: netdev@vger.kernel.org, Tariq Toukan
In-Reply-To: <e88ac12b80b510648f7ab1d4cee50c43908ba49d.camel@mellanox.com>

On Wed, Feb 6, 2019 at 8:55 AM Saeed Mahameed <saeedm@mellanox.com> wrote:
> Hi Cong,
>
> The patch is ok to me, but i really doubt that you can hit a contention
> on latest upstream driver, since we already have spinlock per EQ, which
> means spinlock per core,  each EQ (core) msix handler can only access
> one spinlock (its own), so I am surprised how you got the contention,
> Maybe you are not running on latest upstream driver ?

We are running 4.14 stable release. Which commit changes the game
here? We can consider to backport it unless it is complicated.

Also, if you don't like this patch, we are happy to carry it for our own,
sometimes it isn't worth the time to push into upstream.

>
> what is the workload ?
>

It's a memcached RPC performance test, that is all I can tell.
(Apparently I have almost zero knowledge about memcached.)


> > > In fact, radix_tree_lookup() is perfectly fine with RCU read lock,
> > > we don't have to take a spinlock on this hot path. It is pretty
> > > much
> > > similar to commit 291c566a2891
> > > ("net/mlx4_core: Fix racy CQ (Completion Queue) free"). Slow paths
> > > are still serialized with the spinlock, and with synchronize_irq()
> > > it should be safe to just move the fast path to RCU read lock.
> > >
> > > This patch itself reduces the latency by about 50% with our
> > > workload.
> > >
> > > Cc: Saeed Mahameed <saeedm@mellanox.com>
> > > Cc: Tariq Toukan <tariqt@mellanox.com>
> > > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> > > ---
> > >   drivers/net/ethernet/mellanox/mlx5/core/eq.c | 12 ++++++------
> > >   1 file changed, 6 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > > b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > > index ee04aab65a9f..7092457705a2 100644
> > > --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > > @@ -114,11 +114,11 @@ static struct mlx5_core_cq
> > > *mlx5_eq_cq_get(struct mlx5_eq *eq, u32 cqn)
> > >     struct mlx5_cq_table *table = &eq->cq_table;
> > >     struct mlx5_core_cq *cq = NULL;
> > >
> > > -   spin_lock(&table->lock);
> > > +   rcu_read_lock();
> > >     cq = radix_tree_lookup(&table->tree, cqn);
> > >     if (likely(cq))
> > >             mlx5_cq_hold(cq);
> > > -   spin_unlock(&table->lock);
> > > +   rcu_read_unlock();
> >
> > Thanks for you patch.
> >
> > I think we can improve it further, by taking the if statement out of
> > the
> > critical section.
> >
>
> No, mlx5_cq_hold must stay under RCU read, otherwise cq might get freed
> before the irq gets a change to increment ref count on it.
>

Agreed.


Thanks.

^ permalink raw reply

* Re: [Patch net-next] mlx5: use RCU lock in mlx5_eq_cq_get()
From: Eric Dumazet @ 2019-02-06 17:17 UTC (permalink / raw)
  To: Saeed Mahameed, netdev@vger.kernel.org, Tariq Toukan,
	xiyou.wangcong@gmail.com
In-Reply-To: <e88ac12b80b510648f7ab1d4cee50c43908ba49d.camel@mellanox.com>



On 02/06/2019 08:55 AM, Saeed Mahameed wrote:
> On Wed, 2019-02-06 at 12:02 +0000, Tariq Toukan wrote:
>>
>> On 2/6/2019 2:35 AM, Cong Wang wrote:
>>> mlx5_eq_cq_get() is called in IRQ handler, the spinlock inside
>>> gets a lot of contentions when we test some heavy workload
>>> with 60 RX queues and 80 CPU's, and it is clearly shown in the
>>> flame graph.
>>>
> 
> 
> Hi Cong,
> 
> The patch is ok to me, but i really doubt that you can hit a contention
> on latest upstream driver, since we already have spinlock per EQ, which
> means spinlock per core,  each EQ (core) msix handler can only access
> one spinlock (its own), so I am surprised how you got the contention,
> Maybe you are not running on latest upstream driver ?
> 
> what is the workload ? 

Surprisingly (or not), atomic operations, even on _not_ contended cache lines can
stall the cpu enough for perf tools to notice...

If the atomic operation can be trivially replaced by RCU, then do it by any mean.



^ permalink raw reply

* Re: [ISSUE][4.20.6] mlx5 and checksum failures
From: Saeed Mahameed @ 2019-02-06 17:18 UTC (permalink / raw)
  To: ian.kumlien@gmail.com, netdev@vger.kernel.org,
	davem@davemloft.net
In-Reply-To: <CAA85sZtE7Gv8mKL5tUh8AJ4yG9xd_HZh9svWkHXm=j7VohD1Cw@mail.gmail.com>

On Wed, 2019-02-06 at 17:16 +0100, Ian Kumlien wrote:
> Hi,
> 
> I'm hitting an issue that i think is fixed by the following patch,
> i haven't verified it yet - but it looks like it should go on the
> stable queue(?)
> 
> (And yes, I did look, and couldn't find it ;))
> 

Yes, i couldn't find it neither,

It should have been queued up for 4.18 by now.
Dave said he will take care of it, maybe he just forgot or something.
since the patch needed some extra care..

https://patchwork.ozlabs.org/patch/1027837/

Dave, Is there anything i can do here ?

Thanks,
Saeed.

> commit e8c8b53ccaff568fef4c13a6ccaf08bf241aa01a
> Author: Cong Wang <xiyou.wangcong@gmail.com>
> Date:   Mon Dec 3 22:14:04 2018 -0800
> 
>     net/mlx5e: Force CHECKSUM_UNNECESSARY for short ethernet frames


^ permalink raw reply

* Re: [PATCH iproute2-next v3] devlink: report cell size
From: David Ahern @ 2019-02-06 17:19 UTC (permalink / raw)
  To: Jakub Kicinski, idosch, jiri; +Cc: stephen, oss-drivers, netdev
In-Reply-To: <20190204152859.6667-1-jakub.kicinski@netronome.com>

On 2/4/19 7:28 AM, Jakub Kicinski wrote:
> Print the value of DEVLINK_ATTR_SB_POOL_CELL_SIZE, if reported.
> 
> Example:
> pci/0000:82:00.0:
>   sb 1 pool 0 type egress size 40945664 thtype static cell_size 2048
>   sb 2 pool 0 type egress size 258867200 thtype static cell_size 10240
> ...
> 
> v3: - don't double space.
> v2: - fix spelling.
> 
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
> ---
>  devlink/devlink.c     |  3 +++
>  man/man8/devlink-sb.8 | 10 ++++++++++
>  2 files changed, 13 insertions(+)

applied to iproute2-next. Thanks

^ permalink raw reply

* Re: [PATCH iproute2-next v5] devlink: add info subcommand
From: David Ahern @ 2019-02-06 17:20 UTC (permalink / raw)
  To: Jakub Kicinski, jiri; +Cc: stephen, netdev, oss-drivers
In-Reply-To: <20190204161011.7808-1-jakub.kicinski@netronome.com>

On 2/4/19 8:10 AM, Jakub Kicinski wrote:
> Add support for reading the device serial number, driver name
> and various versions.  Example:
> 
> $ devlink dev info pci/0000:82:00.0
> pci/0000:82:00.0:
>   driver nfp
>   serial_number 16240145
>   versions:
>       fixed:
>         board.id AMDA0081-0001
>         board.rev 15
>         board.vendor SMA
>         board.model hydrogen
>       running:
>         fw.mgmt 010181.010181.0101d4
>         fw.cpld 0x1030000
>         fw.app abm-d372b6
>         fw.undi 0.0.2
>         chip.init AMDA-0081-0001  20160318164536
>       stored:
>         fw.mgmt 010181.010181.0101d4
>         fw.app abm-d372b6
>         fw.undi 0.0.2
>         chip.init AMDA-0081-0001  20160318164536
> 
> $ devlink -jp dev info pci/0000:82:00.0
> {
>     "info": {
>         "pci/0000:82:00.0": {
>             "driver": "nfp",
>             "serial_number": "16240145",
>             "versions": {
>                 "fixed": {
>                     "board.id": "AMDA0081-0001",
>                     "board.rev": "15",
>                     "board.vendor": "SMA",
>                     "board.model": "hydrogen"
>                 },
>                 "running": {
>                     "fw.mgmt": "010181.010181.0101d4",
>                     "fw.cpld": "0x1030000",
>                     "fw.app": "abm-d372b6",
>                     "fw.undi": "0.0.2",
>                     "chip.init": "AMDA-0081-0001  20160318164536"
>                 },
>                 "stored": {
>                     "fw.mgmt": "010181.010181.0101d4",
>                     "fw.app": "abm-d372b6",
>                     "fw.undi": "0.0.2",
>                     "chip.init": "AMDA-0081-0001  20160318164536"
>                 }
>             }
>         }
>     }
> }
> 
> v5:
>  - remove spurious new line.
> v4:
>  - more commit message improvements.
> v3:
>  - show up-to-date output in the commit message.
> v2 (Jiri):
>  - remove filtering;
>  - add example in the commit message.
> RFCv2:
>  - make info subcommand of dev.
> 
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---
>  devlink/devlink.c      | 171 +++++++++++++++++++++++++++++++++++++++++
>  man/man8/devlink-dev.8 |  26 +++++++
>  2 files changed, 197 insertions(+)
> 
applied to iproute2-next.
Thanks for adding the example output


^ permalink raw reply

* Re: [PATCH net-next 5/6] net: marvell: neta: add comphy support
From: Russell King - ARM Linux admin @ 2019-02-06 17:23 UTC (permalink / raw)
  To: Maxime Chevallier
  Cc: Andrew Lunn, Gregory Clement, Jason Cooper,
	Kishon Vijay Abraham I, Sebastian Hesselbarth, Thomas Petazzoni,
	devicetree, linux-arm-kernel, netdev, David S. Miller
In-Reply-To: <20190206160928.56fb4808@bootlin.com>

On Wed, Feb 06, 2019 at 04:09:28PM +0100, Maxime Chevallier wrote:
> Hello Russell,
> 
> On Wed, 06 Feb 2019 11:35:07 +0000
> Russell King <rmk+kernel@armlinux.org.uk> wrote:
> 
> 
> >+	if (pp->comphy) {
> >+		enum phy_mode mode = PHY_MODE_INVALID;
> >+
> >+		switch (state->interface) {
> >+		case PHY_INTERFACE_MODE_SGMII:
> >+		case PHY_INTERFACE_MODE_1000BASEX:
> >+			mode = PHY_MODE_SGMII;
> 
> PHY_MODE_SGMII no longer exists, the new PHY_MODE_ETHERNET must be used
> with the correct submode now, so this doesn't build unfortunately.
> 
> >+			break;
> >+		case PHY_INTERFACE_MODE_2500BASEX:
> >+			mode = PHY_MODE_2500SGMII;
> 
> Same here.

Of course, I knew there was some reason I hadn't sent them yet... thanks
for spotting.

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
According to speedtest.net: 11.9Mbps down 500kbps up

^ permalink raw reply

* [PATCH 0/3] iw_cxgb4: add support for completing cached SRQ buffers
From: Raju Rangoju @ 2019-02-06 17:24 UTC (permalink / raw)
  To: jgg, davem, linux-rdma; +Cc: netdev, swise, rajur

This series adds support for completing the SRQ buffers that were
fetched but could not be completed by hw due to connection aborts,
also fixes the potential srqidx leak during the connection abort.

This series has both net(cxgb4) and rdma(iw_cxgb4) changes,
and I would request this merge via rdma repo.

I have made sure this series applies cleanly on both net-next
and rdma-for-next and doesn't cause any merge conflicts.

Raju Rangoju (3):
  cxgb4: add tcb flags and tcb rpl struct
  iw_cxgb4: complete the cached SRQ buffers
  iw_cxgb4: fix srqidx leak during connection abort

 drivers/infiniband/hw/cxgb4/cm.c            | 166 ++++++++++++++++++++++++++--
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h      |   3 +
 drivers/infiniband/hw/cxgb4/t4.h            |   1 +
 drivers/net/ethernet/chelsio/cxgb4/t4_msg.h |   8 ++
 drivers/net/ethernet/chelsio/cxgb4/t4_tcb.h |  12 ++
 5 files changed, 180 insertions(+), 10 deletions(-)

-- 
2.13.0

^ permalink raw reply

* [PATCH 1/3] cxgb4: add tcb flags and tcb rpl struct
From: Raju Rangoju @ 2019-02-06 17:24 UTC (permalink / raw)
  To: jgg, davem, linux-rdma; +Cc: netdev, swise, rajur
In-Reply-To: <20190206172444.21997-1-rajur@chelsio.com>

This patch adds the tcb flags and structures needed for querying tcb
information.

Signed-off-by: Raju Rangoju <rajur@chelsio.com>
---
 drivers/net/ethernet/chelsio/cxgb4/t4_msg.h |  8 ++++++++
 drivers/net/ethernet/chelsio/cxgb4/t4_tcb.h | 12 ++++++++++++
 2 files changed, 20 insertions(+)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h b/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
index c62a0c830705..38dd41eb959e 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_msg.h
@@ -56,6 +56,7 @@ enum {
 	CPL_TX_DATA_ISO	      = 0x1F,
 
 	CPL_CLOSE_LISTSRV_RPL = 0x20,
+	CPL_GET_TCB_RPL       = 0x22,
 	CPL_L2T_WRITE_RPL     = 0x23,
 	CPL_PASS_OPEN_RPL     = 0x24,
 	CPL_ACT_OPEN_RPL      = 0x25,
@@ -688,6 +689,13 @@ struct cpl_get_tcb {
 #define NO_REPLY_V(x) ((x) << NO_REPLY_S)
 #define NO_REPLY_F    NO_REPLY_V(1U)
 
+struct cpl_get_tcb_rpl {
+	union opcode_tid ot;
+	__u8 cookie;
+	__u8 status;
+	__be16 len;
+};
+
 struct cpl_set_tcb_field {
 	WR_HDR;
 	union opcode_tid ot;
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_tcb.h b/drivers/net/ethernet/chelsio/cxgb4/t4_tcb.h
index 3297ce025e8b..1b9afb192f7f 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_tcb.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_tcb.h
@@ -41,6 +41,14 @@
 #define TCB_SMAC_SEL_V(x)	((x) << TCB_SMAC_SEL_S)
 
 #define TCB_T_FLAGS_W		1
+#define TCB_T_FLAGS_S		0
+#define TCB_T_FLAGS_M		0xffffffffffffffffULL
+#define TCB_T_FLAGS_V(x)	((__u64)(x) << TCB_T_FLAGS_S)
+
+#define TCB_RQ_START_W		30
+#define TCB_RQ_START_S		0
+#define TCB_RQ_START_M		0x3ffffffULL
+#define TCB_RQ_START_V(x)	((x) << TCB_RQ_START_S)
 
 #define TF_CCTRL_ECE_S		60
 #define TF_CCTRL_CWR_S		61
@@ -66,4 +74,8 @@
 #define TCB_RX_FRAG3_LEN_RAW_W	29
 #define TCB_RX_FRAG3_START_IDX_OFFSET_RAW_W	30
 #define TCB_PDU_HDR_LEN_W	31
+
+#define TF_RX_PDU_OUT_S		49
+#define TF_RX_PDU_OUT_V(x)	((__u64)(x) << TF_RX_PDU_OUT_S)
+
 #endif /* __T4_TCB_H */
-- 
2.13.0


^ permalink raw reply related

* [PATCH 2/3] iw_cxgb4: complete the cached SRQ buffers
From: Raju Rangoju @ 2019-02-06 17:24 UTC (permalink / raw)
  To: jgg, davem, linux-rdma; +Cc: netdev, swise, rajur
In-Reply-To: <20190206172444.21997-1-rajur@chelsio.com>

If TP fetches an SRQ buffer but ends up not using it before the
connection is aborted, then it passes the index of that SRQ
buffer to the host in ABORT_REQ_RSS or ABORT_RPL CPL message.

But, if the srqidx field is zero in the received ABORT_RPL or
ABORT_REQ_RSS CPL, then we need to read the tcb.rq_start field
to see if it really did have an RQE cached. This works around a
case where HW does not include the srqidx in the
ABORT_RPL/ABORT_REQ_RSS CPL.

The final value of rq_start is the one present in TCB with the
TF_RX_PDU_OUT bit cleared. So, we need to read the TCB, examine
the TF_RX_PDU_OUT (bit 49 of t_flags) in order to determine if
there's a rx PDU feedback event pending.

Signed-off-by: Raju Rangoju <rajur@chelsio.com>
---
 drivers/infiniband/hw/cxgb4/cm.c       | 161 +++++++++++++++++++++++++++++++--
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |   3 +
 drivers/infiniband/hw/cxgb4/t4.h       |   1 +
 3 files changed, 157 insertions(+), 8 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 8221813219e5..7aca5d73c19e 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -655,7 +655,33 @@ static int send_halfclose(struct c4iw_ep *ep)
 	return c4iw_l2t_send(&ep->com.dev->rdev, skb, ep->l2t);
 }
 
-static int send_abort(struct c4iw_ep *ep)
+void read_tcb(struct c4iw_ep *ep)
+{
+	struct sk_buff *skb;
+	struct cpl_get_tcb *req;
+	int wrlen = roundup(sizeof(*req), 16);
+
+	skb = get_skb(NULL, sizeof(*req), GFP_KERNEL);
+	if (WARN_ON(!skb))
+		return;
+
+	set_wr_txq(skb, CPL_PRIORITY_CONTROL, ep->ctrlq_idx);
+	req = (struct cpl_get_tcb *) skb_put(skb, wrlen);
+	memset(req, 0, wrlen);
+	INIT_TP_WR(req, ep->hwtid);
+	OPCODE_TID(req) = cpu_to_be32(MK_OPCODE_TID(CPL_GET_TCB, ep->hwtid));
+	req->reply_ctrl = htons(REPLY_CHAN_V(0) | QUEUENO_V(ep->rss_qid));
+
+	/*
+	 * keep a ref on the ep so the tcb is not unlocked before this
+	 * cpl completes. The ref is released in read_tcb_rpl().
+	 */
+	c4iw_get_ep(&ep->com);
+	if (WARN_ON(c4iw_ofld_send(&ep->com.dev->rdev, skb)))
+		c4iw_put_ep(&ep->com);
+}
+
+static int send_abort_req(struct c4iw_ep *ep)
 {
 	u32 wrlen = roundup(sizeof(struct cpl_abort_req), 16);
 	struct sk_buff *req_skb = skb_dequeue(&ep->com.ep_skb_list);
@@ -670,6 +696,17 @@ static int send_abort(struct c4iw_ep *ep)
 	return c4iw_l2t_send(&ep->com.dev->rdev, req_skb, ep->l2t);
 }
 
+static int send_abort(struct c4iw_ep *ep)
+{
+	if (!ep->com.qp || !ep->com.qp->srq) {
+		send_abort_req(ep);
+		return 0;
+	}
+	set_bit(ABORT_REQ_IN_PROGRESS, &ep->com.flags);
+	read_tcb(ep);
+	return 0;
+}
+
 static int send_connect(struct c4iw_ep *ep)
 {
 	struct cpl_act_open_req *req = NULL;
@@ -1851,14 +1888,11 @@ static int rx_data(struct c4iw_dev *dev, struct sk_buff *skb)
 	return 0;
 }
 
-static void complete_cached_srq_buffers(struct c4iw_ep *ep,
-					__be32 srqidx_status)
+static void complete_cached_srq_buffers(struct c4iw_ep *ep, u32 srqidx)
 {
 	enum chip_type adapter_type;
-	u32 srqidx;
 
 	adapter_type = ep->com.dev->rdev.lldi.adapter_type;
-	srqidx = ABORT_RSS_SRQIDX_G(be32_to_cpu(srqidx_status));
 
 	/*
 	 * If this TCB had a srq buffer cached, then we must complete
@@ -1876,6 +1910,7 @@ static void complete_cached_srq_buffers(struct c4iw_ep *ep,
 
 static int abort_rpl(struct c4iw_dev *dev, struct sk_buff *skb)
 {
+	u32 srqidx;
 	struct c4iw_ep *ep;
 	struct cpl_abort_rpl_rss6 *rpl = cplhdr(skb);
 	int release = 0;
@@ -1887,7 +1922,10 @@ static int abort_rpl(struct c4iw_dev *dev, struct sk_buff *skb)
 		return 0;
 	}
 
-	complete_cached_srq_buffers(ep, rpl->srqidx_status);
+	if (ep->com.qp && ep->com.qp->srq) {
+		srqidx = ABORT_RSS_SRQIDX_G(be32_to_cpu(rpl->srqidx_status));
+		complete_cached_srq_buffers(ep, srqidx ? srqidx : ep->srqe_idx);
+	}
 
 	pr_debug("ep %p tid %u\n", ep, ep->hwtid);
 	mutex_lock(&ep->com.mutex);
@@ -2740,6 +2778,21 @@ static int peer_close(struct c4iw_dev *dev, struct sk_buff *skb)
 	return 0;
 }
 
+static void finish_peer_abort(struct c4iw_dev *dev, struct c4iw_ep *ep)
+{
+	complete_cached_srq_buffers(ep, ep->srqe_idx);
+	if (ep->com.cm_id && ep->com.qp) {
+		struct c4iw_qp_attributes attrs;
+
+		attrs.next_state = C4IW_QP_STATE_ERROR;
+		c4iw_modify_qp(ep->com.qp->rhp, ep->com.qp,
+			       C4IW_QP_ATTR_NEXT_STATE,	&attrs, 1);
+	}
+	peer_abort_upcall(ep);
+	release_ep_resources(ep);
+	c4iw_put_ep(&ep->com);
+}
+
 static int peer_abort(struct c4iw_dev *dev, struct sk_buff *skb)
 {
 	struct cpl_abort_req_rss6 *req = cplhdr(skb);
@@ -2750,6 +2803,7 @@ static int peer_abort(struct c4iw_dev *dev, struct sk_buff *skb)
 	int release = 0;
 	unsigned int tid = GET_TID(req);
 	u8 status;
+	u32 srqidx;
 
 	u32 len = roundup(sizeof(struct cpl_abort_rpl), 16);
 
@@ -2769,8 +2823,6 @@ static int peer_abort(struct c4iw_dev *dev, struct sk_buff *skb)
 		goto deref_ep;
 	}
 
-	complete_cached_srq_buffers(ep, req->srqidx_status);
-
 	pr_debug("ep %p tid %u state %u\n", ep, ep->hwtid,
 		 ep->com.state);
 	set_bit(PEER_ABORT, &ep->com.history);
@@ -2819,6 +2871,23 @@ static int peer_abort(struct c4iw_dev *dev, struct sk_buff *skb)
 		stop_ep_timer(ep);
 		/*FALLTHROUGH*/
 	case FPDU_MODE:
+		if (ep->com.qp && ep->com.qp->srq) {
+			srqidx = ABORT_RSS_SRQIDX_G(
+					be32_to_cpu(req->srqidx_status));
+			if (srqidx) {
+				complete_cached_srq_buffers(ep,
+							    req->srqidx_status);
+			} else {
+				/* Hold ep ref until finish_peer_abort() */
+				c4iw_get_ep(&ep->com);
+				__state_set(&ep->com, ABORTING);
+				set_bit(PEER_ABORT_IN_PROGRESS, &ep->com.flags);
+				read_tcb(ep);
+				break;
+
+			}
+		}
+
 		if (ep->com.cm_id && ep->com.qp) {
 			attrs.next_state = C4IW_QP_STATE_ERROR;
 			ret = c4iw_modify_qp(ep->com.qp->rhp,
@@ -3717,6 +3786,80 @@ static void passive_ofld_conn_reply(struct c4iw_dev *dev, struct sk_buff *skb,
 	return;
 }
 
+static inline u64 t4_tcb_get_field64(__be64 *tcb, u16 word)
+{
+	u64 tlo = be64_to_cpu(tcb[((31 - word) / 2)]);
+	u64 thi = be64_to_cpu(tcb[((31 - word) / 2) - 1]);
+	u64 t;
+	u32 shift = 32;
+
+	t = (thi << shift) | (tlo >> shift);
+
+	return t;
+}
+
+static inline u32 t4_tcb_get_field32(__be64 *tcb, u16 word, u32 mask, u32 shift)
+{
+	u32 v;
+	u64 t = be64_to_cpu(tcb[(31 - word) / 2]);
+
+	if (word & 0x1)
+		shift += 32;
+	v = (t >> shift) & mask;
+	return v;
+}
+
+static int read_tcb_rpl(struct c4iw_dev *dev, struct sk_buff *skb)
+{
+	struct cpl_get_tcb_rpl *rpl = cplhdr(skb);
+	__be64 *tcb = (__be64 *)(rpl + 1);
+	unsigned int tid = GET_TID(rpl);
+	struct c4iw_ep *ep;
+	u64 t_flags_64;
+	u32 rx_pdu_out;
+
+	ep = get_ep_from_tid(dev, tid);
+	if (!ep)
+		return 0;
+	/* Examine the TF_RX_PDU_OUT (bit 49 of the t_flags) in order to
+	 * determine if there's a rx PDU feedback event pending.
+	 *
+	 * If that bit is set, it means we'll need to re-read the TCB's
+	 * rq_start value. The final value is the one present in a TCB
+	 * with the TF_RX_PDU_OUT bit cleared.
+	 */
+
+	t_flags_64 = t4_tcb_get_field64(tcb, TCB_T_FLAGS_W);
+	rx_pdu_out = (t_flags_64 & TF_RX_PDU_OUT_V(1)) >> TF_RX_PDU_OUT_S;
+
+	c4iw_put_ep(&ep->com); /* from get_ep_from_tid() */
+	c4iw_put_ep(&ep->com); /* from read_tcb() */
+
+	/* If TF_RX_PDU_OUT bit is set, re-read the TCB */
+	if (rx_pdu_out) {
+		if (++ep->rx_pdu_out_cnt >= 2) {
+			WARN_ONCE(1, "tcb re-read() reached the guard limit, finishing the cleanup\n");
+			goto cleanup;
+		}
+		read_tcb(ep);
+		return 0;
+	}
+
+	ep->srqe_idx = t4_tcb_get_field32(tcb, TCB_RQ_START_W, TCB_RQ_START_W,
+			TCB_RQ_START_S);
+cleanup:
+	pr_debug("ep %p tid %u %016x\n", ep, ep->hwtid, ep->srqe_idx);
+
+	if (test_bit(PEER_ABORT_IN_PROGRESS, &ep->com.flags))
+		finish_peer_abort(dev, ep);
+	else if (test_bit(ABORT_REQ_IN_PROGRESS, &ep->com.flags))
+		send_abort_req(ep);
+	else
+		WARN_ONCE(1, "unexpected state!");
+
+	return 0;
+}
+
 static int deferred_fw6_msg(struct c4iw_dev *dev, struct sk_buff *skb)
 {
 	struct cpl_fw6_msg *rpl = cplhdr(skb);
@@ -4037,6 +4180,7 @@ static c4iw_handler_func work_handlers[NUM_CPL_CMDS + NUM_FAKE_CPLS] = {
 	[CPL_CLOSE_CON_RPL] = close_con_rpl,
 	[CPL_RDMA_TERMINATE] = terminate,
 	[CPL_FW4_ACK] = fw4_ack,
+	[CPL_GET_TCB_RPL] = read_tcb_rpl,
 	[CPL_FW6_MSG] = deferred_fw6_msg,
 	[CPL_RX_PKT] = rx_pkt,
 	[FAKE_CPL_PUT_EP_SAFE] = _put_ep_safe,
@@ -4268,6 +4412,7 @@ c4iw_handler_func c4iw_handlers[NUM_CPL_CMDS] = {
 	[CPL_RDMA_TERMINATE] = sched,
 	[CPL_FW4_ACK] = sched,
 	[CPL_SET_TCB_RPL] = set_tcb_rpl,
+	[CPL_GET_TCB_RPL] = sched,
 	[CPL_FW6_MSG] = fw6_msg,
 	[CPL_RX_PKT] = sched
 };
diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index f0fceadd0d12..3a0923f7c60e 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -982,6 +982,9 @@ struct c4iw_ep {
 	int rcv_win;
 	u32 snd_wscale;
 	struct c4iw_ep_stats stats;
+	u32 srqe_idx;
+	u32 rx_pdu_out_cnt;
+	struct sk_buff *peer_abort_skb;
 };
 
 static inline struct c4iw_ep *to_ep(struct iw_cm_id *cm_id)
diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index fff6d48d262f..b170817b2741 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -35,6 +35,7 @@
 #include "t4_regs.h"
 #include "t4_values.h"
 #include "t4_msg.h"
+#include "t4_tcb.h"
 #include "t4fw_ri_api.h"
 
 #define T4_MAX_NUM_PD 65536
-- 
2.13.0


^ permalink raw reply related

* [PATCH 3/3] iw_cxgb4: fix srqidx leak during connection abort
From: Raju Rangoju @ 2019-02-06 17:24 UTC (permalink / raw)
  To: jgg, davem, linux-rdma; +Cc: netdev, swise, rajur
In-Reply-To: <20190206172444.21997-1-rajur@chelsio.com>

When an application aborts the connection by moving QP from RTS to
ERROR, then iw_cxgb4's modify_rc_qp() RTS->ERROR logic sets the
*srqidxp to 0 via t4_set_wq_in_error(&qhp->wq, 0), and aborts the
connection by calling c4iw_ep_disconnect().

c4iw_ep_disconnect() does the following:
 1. sends up a close_complete_upcall(ep, -ECONNRESET) to libcxgb4.
 2. sends abort request CPL to hw.

But, since the close_complete_upcall() is sent before sending the
ABORT_REQ to hw, libcxgb4 would fail to release the srqidx if the
connection holds one. Because, the srqidx is passed up to libcxgb4
only after corresponding ABORT_RPL is processed by kernel in
abort_rpl().

This patch handle the corner-case by moving the call to
close_complete_upcall() from c4iw_ep_disconnect() to abort_rpl().
So that libcxgb4 is notified about the -ECONNRESET only after
abort_rpl(), and libcxgb4 can relinquish the srqidx properly.

Signed-off-by: Raju Rangoju <rajur@chelsio.com>
---
 drivers/infiniband/hw/cxgb4/cm.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 7aca5d73c19e..f45044d8de4d 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -1941,8 +1941,10 @@ static int abort_rpl(struct c4iw_dev *dev, struct sk_buff *skb)
 	}
 	mutex_unlock(&ep->com.mutex);

-	if (release)
+	if (release) {
+		close_complete_upcall(ep, -ECONNRESET);
 		release_ep_resources(ep);
+	}
 	c4iw_put_ep(&ep->com);
 	return 0;
 }
@@ -3675,7 +3677,6 @@ int c4iw_ep_disconnect(struct c4iw_ep *ep, int abrupt, gfp_t gfp)
 	if (close) {
 		if (abrupt) {
 			set_bit(EP_DISC_ABORT, &ep->com.history);
-			close_complete_upcall(ep, -ECONNRESET);
 			ret = send_abort(ep);
 		} else {
 			set_bit(EP_DISC_CLOSE, &ep->com.history);
-- 
2.13.0

^ permalink raw reply related

* Re: [Patch net-next] mlx5: use RCU lock in mlx5_eq_cq_get()
From: Saeed Mahameed @ 2019-02-06 17:35 UTC (permalink / raw)
  To: xiyou.wangcong@gmail.com; +Cc: netdev@vger.kernel.org, Tariq Toukan
In-Reply-To: <CAM_iQpW5GQD8PS1d34tNF5Dfd63pTJDfz_5kJWyqJOOrAMdU7g@mail.gmail.com>

On Wed, 2019-02-06 at 09:15 -0800, Cong Wang wrote:
> On Wed, Feb 6, 2019 at 8:55 AM Saeed Mahameed <saeedm@mellanox.com>
> wrote:
> > Hi Cong,
> > 
> > The patch is ok to me, but i really doubt that you can hit a
> > contention
> > on latest upstream driver, since we already have spinlock per EQ,
> > which
> > means spinlock per core,  each EQ (core) msix handler can only
> > access
> > one spinlock (its own), so I am surprised how you got the
> > contention,
> > Maybe you are not running on latest upstream driver ?
> 
> We are running 4.14 stable release. Which commit changes the game
> here? We can consider to backport it unless it is complicated.
> 

Ok, so there is no issue upstream, you are just missing the following
patch:

commit 02d92f7903647119e125b24f5470f96cee0d4b4b
Author: Saeed Mahameed <saeedm@mellanox.com>
Date:   Fri Jan 19 16:13:01 2018 -0800

    net/mlx5: CQ Database per EQ
    
    Before this patch the driver had one CQ database protected via one
    spinlock, this spinlock is meant to synchronize between CQ
    adding/removing and CQ IRQ interrupt handling.
[...]

> Also, if you don't like this patch, we are happy to carry it for our
> own,
> sometimes it isn't worth the time to push into upstream.

I Do like it and it always worth it to push upstream, we all get to
learn cool new stuff.

> 
> > what is the workload ?
> > 
> 
> It's a memcached RPC performance test, that is all I can tell.

cool, thanks, so the missing commit should fix your issue.

> (Apparently I have almost zero knowledge about memcached.)
> 
> 
> > > > In fact, radix_tree_lookup() is perfectly fine with RCU read
> > > > lock,
> > > > we don't have to take a spinlock on this hot path. It is pretty
> > > > much
> > > > similar to commit 291c566a2891
> > > > ("net/mlx4_core: Fix racy CQ (Completion Queue) free"). Slow
> > > > paths
> > > > are still serialized with the spinlock, and with
> > > > synchronize_irq()
> > > > it should be safe to just move the fast path to RCU read lock.
> > > > 
> > > > This patch itself reduces the latency by about 50% with our
> > > > workload.
> > > > 
> > > > Cc: Saeed Mahameed <saeedm@mellanox.com>
> > > > Cc: Tariq Toukan <tariqt@mellanox.com>
> > > > Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
> > > > ---
> > > >   drivers/net/ethernet/mellanox/mlx5/core/eq.c | 12 ++++++-----
> > > > -
> > > >   1 file changed, 6 insertions(+), 6 deletions(-)
> > > > 
> > > > diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > > > b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > > > index ee04aab65a9f..7092457705a2 100644
> > > > --- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > > > +++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
> > > > @@ -114,11 +114,11 @@ static struct mlx5_core_cq
> > > > *mlx5_eq_cq_get(struct mlx5_eq *eq, u32 cqn)
> > > >     struct mlx5_cq_table *table = &eq->cq_table;
> > > >     struct mlx5_core_cq *cq = NULL;
> > > > 
> > > > -   spin_lock(&table->lock);
> > > > +   rcu_read_lock();
> > > >     cq = radix_tree_lookup(&table->tree, cqn);
> > > >     if (likely(cq))
> > > >             mlx5_cq_hold(cq);
> > > > -   spin_unlock(&table->lock);
> > > > +   rcu_read_unlock();
> > > 
> > > Thanks for you patch.
> > > 
> > > I think we can improve it further, by taking the if statement out
> > > of
> > > the
> > > critical section.
> > > 
> > 
> > No, mlx5_cq_hold must stay under RCU read, otherwise cq might get
> > freed
> > before the irq gets a change to increment ref count on it.
> > 
> 
> Agreed.
> 

Cool, I will ack the patch.. 

> 
> Thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox