Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH bpf] bpf: fix several offset tests in bpf_msg_pull_data
From: Alexei Starovoitov @ 2018-08-29  5:28 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: john.fastabend, netdev
In-Reply-To: <20180828141535.11727-1-daniel@iogearbox.net>

On Tue, Aug 28, 2018 at 04:15:35PM +0200, Daniel Borkmann wrote:
> While recently going over bpf_msg_pull_data(), I noticed three
> issues which are fixed in here:
> 
> 1) When we attempt to find the first scatterlist element (sge)
>    for the start offset, we add len to the offset before we check
>    for start < offset + len, whereas it should come after when
>    we iterate to the next sge to accumulate the offsets. For
>    example, given a start offset of 12 with a sge length of 8
>    for the first sge in the list would lead us to determine this
>    sge as the first sge thinking it covers first 16 bytes where
>    start is located, whereas start sits in subsequent sges so
>    we would end up pulling in the wrong data.
> 
> 2) After figuring out the starting sge, we have a short-cut test
>    in !msg->sg_copy[i] && bytes <= len. This checks whether it's
>    not needed to make the page at the sge private where we can
>    just exit by updating msg->data and msg->data_end. However,
>    the length test is not fully correct. bytes <= len checks
>    whether the requested bytes (end - start offsets) fit into the
>    sge's length. The part that is missing is that start must not
>    be sge length aligned. Meaning, the start offset into the sge
>    needs to be accounted as well on top of the requested bytes
>    as otherwise we can access the sge out of bounds. For example
>    the sge could have length of 8, our requested bytes could have
>    length of 8, but at a start offset of 4, so we also would need
>    to pull in 4 bytes of the next sge, when we jump to the out
>    label we do set msg->data to sg_virt(&sg[i]) + start - offset
>    and msg->data_end to msg->data + bytes which would be oob.
> 
> 3) The subsequent bytes < copy test for finding the last sge has
>    the same issue as in point 2) but also it tests for less than
>    rather than less or equal to. Meaning if the sge length is of
>    8 and requested bytes of 8 while having the start aligned with
>    the sge, we would unnecessarily go and pull in the next sge as
>    well to make it private.
> 
> Fixes: 015632bb30da ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Acked-by: John Fastabend <john.fastabend@gmail.com>

Applied to bpf tree, Thanks

^ permalink raw reply

* Re: [PATCH 2/5] net: mvneta: fix the wrong function to unmap rx buf
From: Gregory CLEMENT @ 2018-08-29  9:21 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: thomas.petazzoni, David S. Miller, netdev, linux-kernel,
	linux-arm-kernel, Andrew Lunn
In-Reply-To: <20180829162751.018acbb6@xhacker.debian>

Hi Jisheng,
 
 On mer., août 29 2018, Jisheng Zhang <Jisheng.Zhang@synaptics.com> wrote:

> Commit 7e47fd84b56b ("net: mvneta: Allocate page for the descriptor")
> always allocate one page for each rx descriptor, so the rx is mapped
> with dmap_map_page() now, but the unmap routine isn't updated at the
> same time.
>
> Fix this by using dma_unmap_page() in corresponding places.
>
> Fixes: 7e47fd84b56b ("net: mvneta: Allocate page for the descriptor")
> Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 14 +++++++-------
>  1 file changed, 7 insertions(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index 0ce94f6587a5..d9206094fce3 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -1890,8 +1890,9 @@ static void mvneta_rxq_drop_pkts(struct mvneta_port *pp,
>  		if (!data || !(rx_desc->buf_phys_addr))
>  			continue;
>  
> -		dma_unmap_single(pp->dev->dev.parent, rx_desc->buf_phys_addr,
> -				 MVNETA_RX_BUF_SIZE(pp->pkt_size), DMA_FROM_DEVICE);
> +		dma_unmap_page(pp->dev->dev.parent, rx_desc->buf_phys_addr,
> +			       MVNETA_RX_BUF_SIZE(pp->pkt_size),
> +			       DMA_FROM_DEVICE);
>  		__free_page(data);
>  	}
>  }
This one can be called when the allocation is done in with HWBM in this
case which use a dma_map_single.

Gregory



> @@ -2008,8 +2009,8 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
>  				skb_add_rx_frag(rxq->skb, frag_num, page,
>  						frag_offset, frag_size,
>  						PAGE_SIZE);
> -				dma_unmap_single(dev->dev.parent, phys_addr,
> -						 PAGE_SIZE, DMA_FROM_DEVICE);
> +				dma_unmap_page(dev->dev.parent, phys_addr,
> +					       PAGE_SIZE, DMA_FROM_DEVICE);
>  				rxq->left_size -= frag_size;
>  			}
>  		} else {
> @@ -2039,9 +2040,8 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
>  						frag_offset, frag_size,
>  						PAGE_SIZE);
>  
> -				dma_unmap_single(dev->dev.parent, phys_addr,
> -						 PAGE_SIZE,
> -						 DMA_FROM_DEVICE);
> +				dma_unmap_page(dev->dev.parent, phys_addr,
> +					       PAGE_SIZE, DMA_FROM_DEVICE);
>  
>  				rxq->left_size -= frag_size;
>  			}
> -- 
> 2.18.0
>
>
> _______________________________________________
> linux-arm-kernel mailing list
> linux-arm-kernel@lists.infradead.org
> http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

-- 
Gregory Clement, Bootlin
Embedded Linux and Kernel engineering
http://bootlin.com

^ permalink raw reply

* Re: [PATCH 1/5] net: mvneta: fix rx_offset_correction set and usage
From: Jisheng Zhang @ 2018-08-29  9:16 UTC (permalink / raw)
  To: Gregory CLEMENT
  Cc: Andrew Lunn, netdev, linux-kernel, thomas.petazzoni,
	David S. Miller, linux-arm-kernel
In-Reply-To: <87efehk0eu.fsf@bootlin.com>

Hi,

On Wed, 29 Aug 2018 11:05:45 +0200 Gregory CLEMENT wrote:

> Hi Jisheng,
>  
>  On mer., août 29 2018, Jisheng Zhang <Jisheng.Zhang@synaptics.com> wrote:
> 
> > The rx_offset_correction is RX packet offset correction for platforms,
> > it's not related with SW BM, instead, it's only related with the
> > platform's NET_SKB_PAD.
> >  
> 
> But if I undrestood well, the value of rx_offset_correction has an
> influence only when we use HW BM.

The rx_offset_correction is introduced by commit 8d5047cf9ca2 ("net: mvneta:
 Convert to be 64 bits compatible"). It's to support mvneta on 64bit
platforms such as Armada 3700. It's not related with HW BM.

> 
> However since d93277b9839b ("Revert "arm64: Increase the max granular
> size""), NET_SKB_PAD is 64 for arm64, so in the end rx_offset_correction
> is always 0 for recent kernels.

yes, I mentioned this in email "[query] about recent mvneta patches".
IMHO, we'd better not rely on the platform's L1_CACHE_BYTES value,
we dunno whether the max granular size is increased again in the future.

Thanks

> 
> Gregory
> 
> 
> > Fix the issue by reverting to the original behavior.
> >
> > Fixes: 562e2f467e71 ("net: mvneta: Improve the buffer allocation method for SWBM")
> > Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
> > ---
> >  drivers/net/ethernet/marvell/mvneta.c | 24 ++++++++++--------------
> >  1 file changed, 10 insertions(+), 14 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> > index bc80a678abc3..0ce94f6587a5 100644
> > --- a/drivers/net/ethernet/marvell/mvneta.c
> > +++ b/drivers/net/ethernet/marvell/mvneta.c
> > @@ -2899,21 +2899,18 @@ static void mvneta_rxq_hw_init(struct mvneta_port *pp,
> >  	mvreg_write(pp, MVNETA_RXQ_BASE_ADDR_REG(rxq->id), rxq->descs_phys);
> >  	mvreg_write(pp, MVNETA_RXQ_SIZE_REG(rxq->id), rxq->size);
> >  
> > +	/* Set Offset */
> > +	mvneta_rxq_offset_set(pp, rxq, NET_SKB_PAD - pp->rx_offset_correction);
> > +
> >  	/* Set coalescing pkts and time */
> >  	mvneta_rx_pkts_coal_set(pp, rxq, rxq->pkts_coal);
> >  	mvneta_rx_time_coal_set(pp, rxq, rxq->time_coal);
> >  
> >  	if (!pp->bm_priv) {
> > -		/* Set Offset */
> > -		mvneta_rxq_offset_set(pp, rxq, 0);
> >  		mvneta_rxq_buf_size_set(pp, rxq, pp->frag_size);
> >  		mvneta_rxq_bm_disable(pp, rxq);
> >  		mvneta_rxq_fill(pp, rxq, rxq->size);
> >  	} else {
> > -		/* Set Offset */
> > -		mvneta_rxq_offset_set(pp, rxq,
> > -				      NET_SKB_PAD - pp->rx_offset_correction);
> > -
> >  		mvneta_rxq_bm_enable(pp, rxq);
> >  		/* Fill RXQ with buffers from RX pool */
> >  		mvneta_rxq_long_pool_set(pp, rxq);
> > @@ -4547,7 +4544,13 @@ static int mvneta_probe(struct platform_device *pdev)
> >  	SET_NETDEV_DEV(dev, &pdev->dev);
> >  
> >  	pp->id = global_port_id++;
> > -	pp->rx_offset_correction = 0; /* not relevant for SW BM */
> > +
> > +	/* Set RX packet offset correction for platforms, whose
> > +	 * NET_SKB_PAD, exceeds 64B. It should be 64B for 64-bit
> > +	 * platforms and 0B for 32-bit ones.
> > +	 */
> > +	pp->rx_offset_correction =
> > +		max(0, NET_SKB_PAD - MVNETA_RX_PKT_OFFSET_CORRECTION);
> >  
> >  	/* Obtain access to BM resources if enabled and already initialized */
> >  	bm_node = of_parse_phandle(dn, "buffer-manager", 0);
> > @@ -4562,13 +4565,6 @@ static int mvneta_probe(struct platform_device *pdev)
> >  				pp->bm_priv = NULL;
> >  			}
> >  		}
> > -		/* Set RX packet offset correction for platforms, whose
> > -		 * NET_SKB_PAD, exceeds 64B. It should be 64B for 64-bit
> > -		 * platforms and 0B for 32-bit ones.
> > -		 */
> > -		pp->rx_offset_correction = max(0,
> > -					       NET_SKB_PAD -
> > -					       MVNETA_RX_PKT_OFFSET_CORRECTION);
> >  	}
> >  	of_node_put(bm_node);
> >  
> > -- 
> > 2.18.0
> >  
> 


_______________________________________________
linux-arm-kernel mailing list
linux-arm-kernel@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-arm-kernel

^ permalink raw reply

* Re: [RFC RFT PATCH 0/4] gpiolib: speed up GPIO array processing
From: Linus Walleij @ 2018-08-29  9:06 UTC (permalink / raw)
  To: Janusz Krzysztofik
  Cc: Jonathan Corbet, Miguel Ojeda Sandonis, Peter Korsgaard,
	Peter Rosin, Ulf Hansson, Andrew Lunn, Florian Fainelli,
	David S. Miller, Dominik Brodowski, kishon, Lars-Peter Clausen,
	Michael Hennerich, Jonathan Cameron, Hartmut Knaack,
	Peter Meerwald, Greg KH, Jiri Slaby, open list:GPIO SUBSYSTEM,
	linux-doc, linux-i2c
In-Reply-To: <20180820234341.5271-1-jmkrzyszt@gmail.com>

On Tue, Aug 21, 2018 at 1:42 AM Janusz Krzysztofik <jmkrzyszt@gmail.com> wrote:

> This series is a follow up of the former "mtd: rawnand: ams-delta: Use
> gpio-omap accessors for data I/O" which already contained some changes
> to gpiolib.  Those previous attempts were commented by Borris Brezillon
> who suggested using GPIO API modified to accept bitmaps, and by Linus
> Walleij who suggested still more great ideas for further immprovement
> of the proposed API changes - thanks!
>
> The goal is to boost performans of get/set array functions while
> processing GPIO arrays which represent pins of a signle chip in
> hardware order.  If resulting performance is close to PIO, GPIO API
> can be used for data I/O without much loss of speed.

Hands down, this is a very pretty patch set. I'm a big fan already.

This is mainly because it fulfills the requirement for libraries
to be narrow and deep, which is what we want.
This refers to John Ousterhouts software design philosophy,
here is a great lecture if you haven't seen it already:
https://www.youtube.com/watch?v=bmSAYlu0NcY

Let's get this into v1 and get some testing and merge it for v4.20
ASAP so we get some proper testing before the v4.20 merge
window. It would be excellent if some of the current users of
the array API could provide tested-by's or at least ACKs.

For example ts-nbus.c must be a big benefactor.

Yours,
Linus Walleij

^ permalink raw reply

* Re: [PATCH 1/5] net: mvneta: fix rx_offset_correction set and usage
From: Gregory CLEMENT @ 2018-08-29  9:05 UTC (permalink / raw)
  To: Jisheng Zhang
  Cc: thomas.petazzoni, David S. Miller, netdev, linux-kernel,
	Andrew Lunn, linux-arm-kernel
In-Reply-To: <20180829162706.24111f9c@xhacker.debian>

Hi Jisheng,
 
 On mer., août 29 2018, Jisheng Zhang <Jisheng.Zhang@synaptics.com> wrote:

> The rx_offset_correction is RX packet offset correction for platforms,
> it's not related with SW BM, instead, it's only related with the
> platform's NET_SKB_PAD.
>

But if I undrestood well, the value of rx_offset_correction has an
influence only when we use HW BM.

However since d93277b9839b ("Revert "arm64: Increase the max granular
size""), NET_SKB_PAD is 64 for arm64, so in the end rx_offset_correction
is always 0 for recent kernels.

Gregory


> Fix the issue by reverting to the original behavior.
>
> Fixes: 562e2f467e71 ("net: mvneta: Improve the buffer allocation method for SWBM")
> Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
> ---
>  drivers/net/ethernet/marvell/mvneta.c | 24 ++++++++++--------------
>  1 file changed, 10 insertions(+), 14 deletions(-)
>
> diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
> index bc80a678abc3..0ce94f6587a5 100644
> --- a/drivers/net/ethernet/marvell/mvneta.c
> +++ b/drivers/net/ethernet/marvell/mvneta.c
> @@ -2899,21 +2899,18 @@ static void mvneta_rxq_hw_init(struct mvneta_port *pp,
>  	mvreg_write(pp, MVNETA_RXQ_BASE_ADDR_REG(rxq->id), rxq->descs_phys);
>  	mvreg_write(pp, MVNETA_RXQ_SIZE_REG(rxq->id), rxq->size);
>  
> +	/* Set Offset */
> +	mvneta_rxq_offset_set(pp, rxq, NET_SKB_PAD - pp->rx_offset_correction);
> +
>  	/* Set coalescing pkts and time */
>  	mvneta_rx_pkts_coal_set(pp, rxq, rxq->pkts_coal);
>  	mvneta_rx_time_coal_set(pp, rxq, rxq->time_coal);
>  
>  	if (!pp->bm_priv) {
> -		/* Set Offset */
> -		mvneta_rxq_offset_set(pp, rxq, 0);
>  		mvneta_rxq_buf_size_set(pp, rxq, pp->frag_size);
>  		mvneta_rxq_bm_disable(pp, rxq);
>  		mvneta_rxq_fill(pp, rxq, rxq->size);
>  	} else {
> -		/* Set Offset */
> -		mvneta_rxq_offset_set(pp, rxq,
> -				      NET_SKB_PAD - pp->rx_offset_correction);
> -
>  		mvneta_rxq_bm_enable(pp, rxq);
>  		/* Fill RXQ with buffers from RX pool */
>  		mvneta_rxq_long_pool_set(pp, rxq);
> @@ -4547,7 +4544,13 @@ static int mvneta_probe(struct platform_device *pdev)
>  	SET_NETDEV_DEV(dev, &pdev->dev);
>  
>  	pp->id = global_port_id++;
> -	pp->rx_offset_correction = 0; /* not relevant for SW BM */
> +
> +	/* Set RX packet offset correction for platforms, whose
> +	 * NET_SKB_PAD, exceeds 64B. It should be 64B for 64-bit
> +	 * platforms and 0B for 32-bit ones.
> +	 */
> +	pp->rx_offset_correction =
> +		max(0, NET_SKB_PAD - MVNETA_RX_PKT_OFFSET_CORRECTION);
>  
>  	/* Obtain access to BM resources if enabled and already initialized */
>  	bm_node = of_parse_phandle(dn, "buffer-manager", 0);
> @@ -4562,13 +4565,6 @@ static int mvneta_probe(struct platform_device *pdev)
>  				pp->bm_priv = NULL;
>  			}
>  		}
> -		/* Set RX packet offset correction for platforms, whose
> -		 * NET_SKB_PAD, exceeds 64B. It should be 64B for 64-bit
> -		 * platforms and 0B for 32-bit ones.
> -		 */
> -		pp->rx_offset_correction = max(0,
> -					       NET_SKB_PAD -
> -					       MVNETA_RX_PKT_OFFSET_CORRECTION);
>  	}
>  	of_node_put(bm_node);
>  
> -- 
> 2.18.0
>

-- 
Gregory Clement, Bootlin
Embedded Linux and Kernel engineering
http://bootlin.com

^ permalink raw reply

* Re: [PATCH 0/5] net: mvneta: some bug fix and trivial improvement
From: Jisheng Zhang @ 2018-08-29  8:51 UTC (permalink / raw)
  To: thomas.petazzoni, David S. Miller
  Cc: netdev, linux-kernel, Andrew Lunn, Gregory CLEMENT,
	linux-arm-kernel, Yelena Krivosheev
In-Reply-To: <20180829164024.41e8439d@xhacker.debian>

On Wed, 29 Aug 2018 16:40:24 +0800 Jisheng Zhang wrote:

> On Wed, 29 Aug 2018 16:25:57 +0800
> Jisheng Zhang wrote:
> 
> > patch1 fixes rx_offset_correction set and usage. Because the
> > rx_offset_correction is RX packet offset correction for platforms,
> > it's not related with SW BM, instead, it's only related with the
> > platform's NET_SKB_PAD.
> > 
> > patch2 fixes the wrong function to unmap rx buf  
> 
> I have question about the following two commits:
> 
> 7e47fd84b56b ("net: mvneta: Allocate page for the descriptor"), it cause
> a waste, for normal 1500 MTU, before this patch we allocate 1920Bytes for rx
> after this patch, we always allocate PAGE_SIZE bytes, if PAGE_SIZE=4096, we
> waste 53% memory for each rx buf. I'm not sure whether the performance
> improvement deserve the pay.
> 
> 562e2f467e71 ("net: mvneta: Improve the buffer allocation method for SWBM")
> mentions that "With system having a small memory (around 256MB), the state
> "cannot allocate memory to refill with new buffer" is reach pretty quickly"
> is it due to the memory waste as said above? Anyway, by this commit, we
> want to improve the situation on a small memory system, so should we firstly
> revert commit 7e47fd84b56b ("net: mvneta: Allocate page for the descriptor")?
> 

If maintainers decide to revert the two commits: 7e47fd84b56b and 562e2f467e71
then, patch1,2,3 are useless, we can drop them. Only patch4 and patch5 are
still useful.

Thanks

> Any comments are welcome!
> 
> Thanks
> 
> 
> > 
> > patch3 removes the NETIF_F_GRO check ourself, because the net subsystem
> > will handle it for us.
> > 
> > patch4 enables NETIF_F_RXCSUM by default, since the driver and HW
> > supports the feature.
> > 
> > patch5 is a trivial optimization, to reduce smp_processor_id() calling
> > in mvneta_tx_done_gbe.
> > 
> > Jisheng Zhang (5):
> >   net: mvneta: fix rx_offset_correction set and usage
> >   net: mvneta: fix the wrong function to unmap rx buf
> >   net: mvneta: Don't check NETIF_F_GRO ourself
> >   net: mvneta: enable NETIF_F_RXCSUM by default
> >   net: mvneta: reduce smp_processor_id() calling in mvneta_tx_done_gbe
> > 
> >  drivers/net/ethernet/marvell/mvneta.c | 49 ++++++++++++---------------
> >  1 file changed, 22 insertions(+), 27 deletions(-)
> >   
> 

^ permalink raw reply

* Re: [PATCH net-next] esp: remove redundant define esph
From: Steffen Klassert @ 2018-08-29  8:41 UTC (permalink / raw)
  To: Haishuang Yan; +Cc: Herbert Xu, David S. Miller, netdev, linux-kernel
In-Reply-To: <1534492260-2639-1-git-send-email-yanhaishuang@cmss.chinamobile.com>

On Fri, Aug 17, 2018 at 03:51:00PM +0800, Haishuang Yan wrote:
> The pointer 'esph' is defined but is never used hence it is redundant
> and canbe removed.
> 
> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>

Applied to ipsec-next, thanks!

^ permalink raw reply

* Re: [PATCH 0/5] net: mvneta: some bug fix and trivial improvement
From: Jisheng Zhang @ 2018-08-29  8:40 UTC (permalink / raw)
  To: thomas.petazzoni, David S. Miller
  Cc: netdev, linux-kernel, Andrew Lunn, Gregory CLEMENT,
	linux-arm-kernel, Yelena Krivosheev
In-Reply-To: <20180829162456.2bd69796@xhacker.debian>

On Wed, 29 Aug 2018 16:25:57 +0800
Jisheng Zhang <Jisheng.Zhang@synaptics.com> wrote:

> patch1 fixes rx_offset_correction set and usage. Because the
> rx_offset_correction is RX packet offset correction for platforms,
> it's not related with SW BM, instead, it's only related with the
> platform's NET_SKB_PAD.
> 
> patch2 fixes the wrong function to unmap rx buf

I have question about the following two commits:

7e47fd84b56b ("net: mvneta: Allocate page for the descriptor"), it cause
a waste, for normal 1500 MTU, before this patch we allocate 1920Bytes for rx
after this patch, we always allocate PAGE_SIZE bytes, if PAGE_SIZE=4096, we
waste 53% memory for each rx buf. I'm not sure whether the performance
improvement deserve the pay.

562e2f467e71 ("net: mvneta: Improve the buffer allocation method for SWBM")
mentions that "With system having a small memory (around 256MB), the state
"cannot allocate memory to refill with new buffer" is reach pretty quickly"
is it due to the memory waste as said above? Anyway, by this commit, we
want to improve the situation on a small memory system, so should we firstly
revert commit 7e47fd84b56b ("net: mvneta: Allocate page for the descriptor")?

Any comments are welcome!

Thanks

> 
> patch3 removes the NETIF_F_GRO check ourself, because the net subsystem
> will handle it for us.
> 
> patch4 enables NETIF_F_RXCSUM by default, since the driver and HW
> supports the feature.
> 
> patch5 is a trivial optimization, to reduce smp_processor_id() calling
> in mvneta_tx_done_gbe.
> 
> Jisheng Zhang (5):
>   net: mvneta: fix rx_offset_correction set and usage
>   net: mvneta: fix the wrong function to unmap rx buf
>   net: mvneta: Don't check NETIF_F_GRO ourself
>   net: mvneta: enable NETIF_F_RXCSUM by default
>   net: mvneta: reduce smp_processor_id() calling in mvneta_tx_done_gbe
> 
>  drivers/net/ethernet/marvell/mvneta.c | 49 ++++++++++++---------------
>  1 file changed, 22 insertions(+), 27 deletions(-)
> 

^ permalink raw reply

* [PATCH net-next v1] selftests/tls: Add test for recv(PEEK) spanning across multiple records
From: Vakul Garg @ 2018-08-29 10:00 UTC (permalink / raw)
  To: netdev; +Cc: borisp, aviadye, davejwatson, davem, Vakul Garg

Added test case to receive multiple records with a single recvmsg()
operation with a MSG_PEEK set.
---
 tools/testing/selftests/net/tls.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/tools/testing/selftests/net/tls.c b/tools/testing/selftests/net/tls.c
index b3ebf2646e52..07daff076ce0 100644
--- a/tools/testing/selftests/net/tls.c
+++ b/tools/testing/selftests/net/tls.c
@@ -502,6 +502,28 @@ TEST_F(tls, recv_peek_multiple)
 	EXPECT_EQ(memcmp(test_str, buf, send_len), 0);
 }
 
+TEST_F(tls, recv_peek_large_buf_mult_recs)
+{
+	char const *test_str = "test_read_peek_mult_recs";
+	char const *test_str_first = "test_read_peek";
+	char const *test_str_second = "_mult_recs";
+	int len;
+	char buf[64];
+
+	len = strlen(test_str_first);
+	EXPECT_EQ(send(self->fd, test_str_first, len, 0), len);
+
+	len = strlen(test_str_second) + 1;
+	EXPECT_EQ(send(self->fd, test_str_second, len, 0), len);
+
+	len = sizeof(buf);
+	memset(buf, 0, len);
+	EXPECT_NE(recv(self->cfd, buf, len, MSG_PEEK), -1);
+
+	len = strlen(test_str) + 1;
+	EXPECT_EQ(memcmp(test_str, buf, len), 0);
+}
+
 TEST_F(tls, pollin)
 {
 	char const *test_str = "test_poll";
-- 
2.13.6

^ permalink raw reply related

* [PATCH net-next v2] net/tls: Add support for async decryption of tls records
From: Vakul Garg @ 2018-08-29  9:56 UTC (permalink / raw)
  To: netdev; +Cc: borisp, aviadye, davejwatson, davem, Vakul Garg

When tls records are decrypted using asynchronous acclerators such as
NXP CAAM engine, the crypto apis return -EINPROGRESS. Presently, on
getting -EINPROGRESS, the tls record processing stops till the time the
crypto accelerator finishes off and returns the result. This incurs a
context switch and is not an efficient way of accessing the crypto
accelerators. Crypto accelerators work efficient when they are queued
with multiple crypto jobs without having to wait for the previous ones
to complete.

The patch submits multiple crypto requests without having to wait for
for previous ones to complete. This has been implemented for records
which are decrypted in zero-copy mode. At the end of recvmsg(), we wait
for all the asynchronous decryption requests to complete.

The references to records which have been sent for async decryption are
dropped. For cases where record decryption is not possible in zero-copy
mode, asynchronous decryption is not used and we wait for decryption
crypto api to complete.

For crypto requests executing in async fashion, the memory for
aead_request, sglists and skb etc is freed from the decryption
completion handler. The decryption completion handler wakesup the
sleeping user context when recvmsg() flags that it has done sending
all the decryption requests and there are no more decryption requests
pending to be completed.

Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Reviewed-by: Dave Watson <davejwatson@fb.com>
---

Changes since v1:
	- Simplified recvmsg() so to drop reference to skb in case it
	  was submimtted for async decryption.
	- Modified tls_sw_advance_skb() to handle case when input skb is
	  NULL.

 include/net/tls.h |   6 +++
 net/tls/tls_sw.c  | 134 ++++++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 127 insertions(+), 13 deletions(-)

diff --git a/include/net/tls.h b/include/net/tls.h
index d5c683e8bb22..cd0a65bd92f9 100644
--- a/include/net/tls.h
+++ b/include/net/tls.h
@@ -124,6 +124,12 @@ struct tls_sw_context_rx {
 	struct sk_buff *recv_pkt;
 	u8 control;
 	bool decrypted;
+	atomic_t decrypt_pending;
+	bool async_notify;
+};
+
+struct decrypt_req_ctx {
+	struct sock *sk;
 };
 
 struct tls_record_info {
diff --git a/net/tls/tls_sw.c b/net/tls/tls_sw.c
index 52fbe727d7c1..9503e5a4c27e 100644
--- a/net/tls/tls_sw.c
+++ b/net/tls/tls_sw.c
@@ -43,12 +43,50 @@
 
 #define MAX_IV_SIZE	TLS_CIPHER_AES_GCM_128_IV_SIZE
 
+static void tls_decrypt_done(struct crypto_async_request *req, int err)
+{
+	struct aead_request *aead_req = (struct aead_request *)req;
+	struct decrypt_req_ctx *req_ctx =
+			(struct decrypt_req_ctx *)(aead_req + 1);
+
+	struct scatterlist *sgout = aead_req->dst;
+
+	struct tls_context *tls_ctx = tls_get_ctx(req_ctx->sk);
+	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
+	int pending = atomic_dec_return(&ctx->decrypt_pending);
+	struct scatterlist *sg;
+	unsigned int pages;
+
+	/* Propagate if there was an err */
+	if (err) {
+		ctx->async_wait.err = err;
+		tls_err_abort(req_ctx->sk, err);
+	}
+
+	/* Release the skb, pages and memory allocated for crypto req */
+	kfree_skb(req->data);
+
+	/* Skip the first S/G entry as it points to AAD */
+	for_each_sg(sg_next(sgout), sg, UINT_MAX, pages) {
+		if (!sg)
+			break;
+		put_page(sg_page(sg));
+	}
+
+	kfree(aead_req);
+
+	if (!pending && READ_ONCE(ctx->async_notify))
+		complete(&ctx->async_wait.completion);
+}
+
 static int tls_do_decryption(struct sock *sk,
+			     struct sk_buff *skb,
 			     struct scatterlist *sgin,
 			     struct scatterlist *sgout,
 			     char *iv_recv,
 			     size_t data_len,
-			     struct aead_request *aead_req)
+			     struct aead_request *aead_req,
+			     bool async)
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
 	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
@@ -59,10 +97,34 @@ static int tls_do_decryption(struct sock *sk,
 	aead_request_set_crypt(aead_req, sgin, sgout,
 			       data_len + tls_ctx->rx.tag_size,
 			       (u8 *)iv_recv);
-	aead_request_set_callback(aead_req, CRYPTO_TFM_REQ_MAY_BACKLOG,
-				  crypto_req_done, &ctx->async_wait);
 
-	ret = crypto_wait_req(crypto_aead_decrypt(aead_req), &ctx->async_wait);
+	if (async) {
+		struct decrypt_req_ctx *req_ctx;
+
+		req_ctx = (struct decrypt_req_ctx *)(aead_req + 1);
+		req_ctx->sk = sk;
+
+		aead_request_set_callback(aead_req,
+					  CRYPTO_TFM_REQ_MAY_BACKLOG,
+					  tls_decrypt_done, skb);
+		atomic_inc(&ctx->decrypt_pending);
+	} else {
+		aead_request_set_callback(aead_req,
+					  CRYPTO_TFM_REQ_MAY_BACKLOG,
+					  crypto_req_done, &ctx->async_wait);
+	}
+
+	ret = crypto_aead_decrypt(aead_req);
+	if (ret == -EINPROGRESS) {
+		if (async)
+			return ret;
+
+		ret = crypto_wait_req(ret, &ctx->async_wait);
+	}
+
+	if (async)
+		atomic_dec(&ctx->decrypt_pending);
+
 	return ret;
 }
 
@@ -763,7 +825,10 @@ static int decrypt_internal(struct sock *sk, struct sk_buff *skb,
 	}
 
 	/* Prepare and submit AEAD request */
-	err = tls_do_decryption(sk, sgin, sgout, iv, data_len, aead_req);
+	err = tls_do_decryption(sk, skb, sgin, sgout, iv,
+				data_len, aead_req, *zc);
+	if (err == -EINPROGRESS)
+		return err;
 
 	/* Release the pages in case iov was mapped to pages */
 	for (; pages > 0; pages--)
@@ -788,8 +853,12 @@ static int decrypt_skb_update(struct sock *sk, struct sk_buff *skb,
 #endif
 	if (!ctx->decrypted) {
 		err = decrypt_internal(sk, skb, dest, NULL, chunk, zc);
-		if (err < 0)
+		if (err < 0) {
+			if (err == -EINPROGRESS)
+				tls_advance_record_sn(sk, &tls_ctx->rx);
+
 			return err;
+		}
 	} else {
 		*zc = false;
 	}
@@ -817,18 +886,20 @@ static bool tls_sw_advance_skb(struct sock *sk, struct sk_buff *skb,
 {
 	struct tls_context *tls_ctx = tls_get_ctx(sk);
 	struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_ctx);
-	struct strp_msg *rxm = strp_msg(skb);
 
-	if (len < rxm->full_len) {
-		rxm->offset += len;
-		rxm->full_len -= len;
+	if (skb) {
+		struct strp_msg *rxm = strp_msg(skb);
 
-		return false;
+		if (len < rxm->full_len) {
+			rxm->offset += len;
+			rxm->full_len -= len;
+			return false;
+		}
+		kfree_skb(skb);
 	}
 
 	/* Finished with message */
 	ctx->recv_pkt = NULL;
-	kfree_skb(skb);
 	__strp_unpause(&ctx->strp);
 
 	return true;
@@ -851,6 +922,7 @@ int tls_sw_recvmsg(struct sock *sk,
 	int target, err = 0;
 	long timeo;
 	bool is_kvec = msg->msg_iter.type & ITER_KVEC;
+	int num_async = 0;
 
 	flags |= nonblock;
 
@@ -863,6 +935,7 @@ int tls_sw_recvmsg(struct sock *sk,
 	timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT);
 	do {
 		bool zc = false;
+		bool async = false;
 		int chunk = 0;
 
 		skb = tls_wait_data(sk, flags, timeo, &err);
@@ -870,6 +943,7 @@ int tls_sw_recvmsg(struct sock *sk,
 			goto recv_end;
 
 		rxm = strp_msg(skb);
+
 		if (!cmsg) {
 			int cerr;
 
@@ -896,26 +970,39 @@ int tls_sw_recvmsg(struct sock *sk,
 
 			err = decrypt_skb_update(sk, skb, &msg->msg_iter,
 						 &chunk, &zc);
-			if (err < 0) {
+			if (err < 0 && err != -EINPROGRESS) {
 				tls_err_abort(sk, EBADMSG);
 				goto recv_end;
 			}
+
+			if (err == -EINPROGRESS) {
+				async = true;
+				num_async++;
+				goto pick_next_record;
+			}
+
 			ctx->decrypted = true;
 		}
 
 		if (!zc) {
 			chunk = min_t(unsigned int, rxm->full_len, len);
+
 			err = skb_copy_datagram_msg(skb, rxm->offset, msg,
 						    chunk);
 			if (err < 0)
 				goto recv_end;
 		}
 
+pick_next_record:
 		copied += chunk;
 		len -= chunk;
 		if (likely(!(flags & MSG_PEEK))) {
 			u8 control = ctx->control;
 
+			/* For async, drop current skb reference */
+			if (async)
+				skb = NULL;
+
 			if (tls_sw_advance_skb(sk, skb, chunk)) {
 				/* Return full control message to
 				 * userspace before trying to parse
@@ -924,14 +1011,33 @@ int tls_sw_recvmsg(struct sock *sk,
 				msg->msg_flags |= MSG_EOR;
 				if (control != TLS_RECORD_TYPE_DATA)
 					goto recv_end;
+			} else {
+				break;
 			}
 		}
+
 		/* If we have a new message from strparser, continue now. */
 		if (copied >= target && !ctx->recv_pkt)
 			break;
 	} while (len);
 
 recv_end:
+	if (num_async) {
+		/* Wait for all previously submitted records to be decrypted */
+		smp_store_mb(ctx->async_notify, true);
+		if (atomic_read(&ctx->decrypt_pending)) {
+			err = crypto_wait_req(-EINPROGRESS, &ctx->async_wait);
+			if (err) {
+				/* one of async decrypt failed */
+				tls_err_abort(sk, err);
+				copied = 0;
+			}
+		} else {
+			reinit_completion(&ctx->async_wait.completion);
+		}
+		WRITE_ONCE(ctx->async_notify, false);
+	}
+
 	release_sock(sk);
 	return copied ? : err;
 }
@@ -1271,6 +1377,8 @@ int tls_set_sw_offload(struct sock *sk, struct tls_context *ctx, int tx)
 		goto free_aead;
 
 	if (sw_ctx_rx) {
+		(*aead)->reqsize = sizeof(struct decrypt_req_ctx);
+
 		/* Set up strparser */
 		memset(&cb, 0, sizeof(cb));
 		cb.rcv_msg = tls_queue;
-- 
2.13.6

^ permalink raw reply related

* Re: Oops running iptables -F OUTPUT
From: Nicholas Piggin @ 2018-08-29  4:36 UTC (permalink / raw)
  To: Ard Biesheuvel
  Cc: Andreas Schwab, <netdev@vger.kernel.org>, linuxppc-dev,
	Jessica Yu, Michael Ellerman, Will Deacon, Ingo Molnar,
	Andrew Morton, linux-arch, Linus Torvalds
In-Reply-To: <CAKv+Gu_hPaxrVtsBOoviRraYk4FWnT9zQVCVF=i27xd_nGHryw@mail.gmail.com>

On Tue, 28 Aug 2018 18:09:09 +0200
Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:

> On 28 August 2018 at 15:56, Ard Biesheuvel <ard.biesheuvel@linaro.org> wrote:
> > Hello Andreas, Nick,
> >
> > On 28 August 2018 at 06:06, Nicholas Piggin <nicholas.piggin@gmail.com> wrote:  
> >> On Mon, 27 Aug 2018 19:11:01 +0200
> >> Andreas Schwab <schwab@linux-m68k.org> wrote:
> >>  
> >>> I'm getting this Oops when running iptables -F OUTPUT:
> >>>
> >>> [   91.139409] Unable to handle kernel paging request for data at address 0xd0000001fff12f34
> >>> [   91.139414] Faulting instruction address: 0xd0000000016a5718
> >>> [   91.139419] Oops: Kernel access of bad area, sig: 11 [#1]
> >>> [   91.139426] BE SMP NR_CPUS=2 PowerMac
> >>> [   91.139434] Modules linked in: iptable_filter ip_tables x_tables bpfilter nfsd auth_rpcgss lockd grace nfs_acl sunrpc tun af_packet snd_aoa_codec_tas snd_aoa_fabric_layout snd_aoa snd_aoa_i2sbus snd_aoa_soundbus snd_pcm_oss snd_pcm snd_seq snd_timer snd_seq_device snd_mixer_oss snd sungem sr_mod firewire_ohci cdrom sungem_phy soundcore firewire_core pata_macio crc_itu_t sg hid_generic usbhid linear md_mod ohci_pci ohci_hcd ehci_pci ehci_hcd usbcore usb_common dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_mod sata_svw
> >>> [   91.139522] CPU: 1 PID: 3620 Comm: iptables Not tainted 4.19.0-rc1 #1
> >>> [   91.139526] NIP:  d0000000016a5718 LR: d0000000016a569c CTR: c0000000006f560c
> >>> [   91.139531] REGS: c0000001fa577670 TRAP: 0300   Not tainted  (4.19.0-rc1)
> >>> [   91.139534] MSR:  900000000200b032 <SF,HV,VEC,EE,FP,ME,IR,DR,RI>  CR: 84002484  XER: 20000000
> >>> [   91.139553] DAR: d0000001fff12f34 DSISR: 40000000 IRQMASK: 0
> >>> GPR00: d0000000016a569c c0000001fa5778f0 d0000000016b0400 0000000000000000
> >>> GPR04: 0000000000000002 0000000000000000 80000001fa46418e c0000001fa0d05c8
> >>> GPR08: d0000000016b0400 d00037fffff13000 00000001ff3e7000 d0000000016a6fb8
> >>> GPR12: c0000000006f560c c00000000ffff780 0000000000000000 0000000000000000
> >>> GPR16: 0000000011635010 00003fffa1b7aa68 0000000000000000 0000000000000000
> >>> GPR20: 0000000000000003 0000000010013918 00000000116350c0 c000000000b88990
> >>> GPR24: c000000000b88ba4 0000000000000000 d0000001fff12f34 0000000000000000
> >>> GPR28: d0000000016b8000 c0000001fa20f400 c0000001fa20f440 0000000000000000
> >>> [   91.139627] NIP [d0000000016a5718] .alloc_counters.isra.10+0xbc/0x140 [ip_tables]
> >>> [   91.139634] LR [d0000000016a569c] .alloc_counters.isra.10+0x40/0x140 [ip_tables]
> >>> [   91.139638] Call Trace:
> >>> [   91.139645] [c0000001fa5778f0] [d0000000016a569c] .alloc_counters.isra.10+0x40/0x140 [ip_tables] (unreliable)
> >>> [   91.139655] [c0000001fa5779b0] [d0000000016a5b54] .do_ipt_get_ctl+0x110/0x2ec [ip_tables]
> >>> [   91.139666] [c0000001fa577aa0] [c0000000006233e0] .nf_getsockopt+0x68/0x88
> >>> [   91.139674] [c0000001fa577b40] [c000000000631608] .ip_getsockopt+0xbc/0x128
> >>> [   91.139682] [c0000001fa577bf0] [c00000000065adf4] .raw_getsockopt+0x18/0x5c
> >>> [   91.139690] [c0000001fa577c60] [c0000000005b5f60] .sock_common_getsockopt+0x2c/0x40
> >>> [   91.139697] [c0000001fa577cd0] [c0000000005b3394] .__sys_getsockopt+0xa4/0xd0
> >>> [   91.139704] [c0000001fa577d80] [c0000000005b5ab0] .__se_sys_socketcall+0x238/0x2b4
> >>> [   91.139712] [c0000001fa577e30] [c00000000000a31c] system_call+0x5c/0x70
> >>> [   91.139716] Instruction dump:
> >>> [   91.139721] 39290040 7d3d4a14 7fbe4840 409cff98 81380000 2b890001 419d000c 393e0060
> >>> [   91.139736] 48000010 7d57c82a e93e0060 7d295214 <815a0000> 794807e1 41e20010 7c210b78
> >>> [   91.139752] ---[ end trace f5d1d5431651845d ]---  
> >>
> >> This is due to 7290d58095 ("module: use relative references for
> >> __ksymtab entries"). This part of kernel/module.c -
> >>
> >>    /* Divert to percpu allocation if a percpu var. */
> >>    if (sym[i].st_shndx == info->index.pcpu)
> >>        secbase = (unsigned long)mod_percpu(mod);
> >>    else
> >>        secbase = info->sechdrs[sym[i].st_shndx].sh_addr;
> >>    sym[i].st_value += secbase;
> >>
> >> Causes the distance to the target to exceed 32-bits on powerpc, so
> >> it doesn't fit in a rel32 reloc. Not sure how other archs cope.
> >>  
> >
> > Apologies for the breakage. It does indeed appear to affect all
> > architectures, and I'm a bit puzzled why you are the first one to spot
> > it.
> >
> > I will try to find a clean way to special case the per-CPU variable
> > __ksymtab references in the generic module code, and if that is too
> > cumbersome, we can switch to 64-bit relative references (or rather,
> > native word size relative references) instead. Or revert the whole
> > thing ...  
> 
> OK, after a bit of digging, and confirming that the arm64
> implementation works as expected (its module loader actually detects
> overflows of the 32-bit place relative relocations, so the problem
> definitely does not occur there), I think I found the explanation why
> this occurs on powerpc and not on x86 or arm64.
> 
> Could you please check whether this change makes the issue go away?
> (whitespace damage courtesy of Gmail)
> 
> diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
> index 6a501b25dd85..57d09d5ceb1a 100644
> --- a/arch/powerpc/kernel/setup_64.c
> +++ b/arch/powerpc/kernel/setup_64.c
> @@ -779,7 +779,6 @@ EXPORT_SYMBOL(__per_cpu_offset);
> 
>  void __init setup_per_cpu_areas(void)
>  {
> -       const size_t dyn_size = PERCPU_MODULE_RESERVE + PERCPU_DYNAMIC_RESERVE;
>         size_t atom_size;
>         unsigned long delta;
>         unsigned int cpu;
> @@ -795,7 +794,9 @@ void __init setup_per_cpu_areas(void)
>         else
>                 atom_size = 1 << 20;
> 
> -       rc = pcpu_embed_first_chunk(0, dyn_size, atom_size, pcpu_cpu_distance,
> +       rc = pcpu_embed_first_chunk(PERCPU_MODULE_RESERVE,
> +                                   PERCPU_DYNAMIC_RESERVE,
> +                                   atom_size, pcpu_cpu_distance,
>                                     pcpu_fc_alloc, pcpu_fc_free);
>         if (rc < 0)
>                 panic("cannot initialize percpu area (err=%d)", rc);
> 
> The git log does not explain why power deviates from x86 and arm64 in
> the way it initializes the percpu areas.

The reason for 64-bit powerpc is actually that modules are allocated in
vmalloc space which is a long way out from the linear map where the per
cpu embedded chunk is.

It does look like x86 and arm64 are probably okay because they set up a
module vmalloc area close to their kernel text in the linear map, which
should be close to per-cpu I guess.

I'm not entirely sure why pcpu setup is different on powerpc, but I
think the module vmalloc addresses bite first anyway.

Okay I'd say let's just remove powerpc for now.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH net-next 0/5] rtnetlink: add IFA_IF_NETNSID for RTM_GETADDR
From: Kirill Tkhai @ 2018-08-29  8:30 UTC (permalink / raw)
  To: Christian Brauner, netdev, linux-kernel
  Cc: davem, kuznet, yoshfuji, pombredanne, kstewart, gregkh, dsahern,
	fw, lucien.xin, jakub.kicinski, jbenc, nicolas.dichtel,
	Christian Brauner
In-Reply-To: <20180828231859.29758-1-christian@brauner.io>

Hi, Christian,

On 29.08.2018 02:18, Christian Brauner wrote:
> From: Christian Brauner <christian.brauner@ubuntu.com>
> 
> Hey,
> 
> A while back we introduced and enabled IFLA_IF_NETNSID in
> RTM_{DEL,GET,NEW}LINK requests (cf. [1], [2], [3], [4], [5]). This has led
> to signficant performance increases since it allows userspace to avoid
> taking the hit of a setns(netns_fd, CLONE_NEWNET), then getting the
> interfaces from the netns associated with the netns_fd. Especially when a
> lot of network namespaces are in use, using setns() becomes increasingly
> problematic when performance matters.

could you please give a real example, when setns()+socket(AF_NETLINK) cause
problems with the performance? You should do this only once on application
startup, and then you have created netlink sockets in any net namespaces you
need. What is the problem here?

> Usually, RTML_GETLINK requests are followed by RTM_GETADDR requests (cf.
> getifaddrs() style functions and friends). But currently, RTM_GETADDR
> requests do not support a similar property like IFLA_IF_NETNSID for
> RTM_*LINK requests.
> This is problematic since userspace can retrieve interfaces from another
> network namespace by sending a IFLA_IF_NETNSID property along but
> RTM_GETLINK request but is still forced to use the legacy setns() style of
> retrieving interfaces in RTM_GETADDR requests.
> 
> The goal of this series is to make it possible to perform RTM_GETADDR
> requests on different network namespaces. To this end a new IFA_IF_NETNSID
> property for RTM_*ADDR requests is introduced. It can be used to send a
> network namespace identifier along in RTM_*ADDR requests.  The network
> namespace identifier will be used to retrieve the target network namespace
> in which the request is supposed to be fulfilled.  This aligns the behavior
> of RTM_*ADDR requests with the behavior of RTM_*LINK requests.
> 
> Security:
> - The caller must have assigned a valid network namespace identifier for
>   the target network namespace.
> - The caller must have CAP_NET_ADMIN in the owning user namespace of the
>   target network namespace.
> 
> Thanks!
> Christian
> 
> [1]: commit 7973bfd8758d ("rtnetlink: remove check for IFLA_IF_NETNSID")
> [2]: commit 5bb8ed075428 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_NEWLINK")
> [3]: commit b61ad68a9fe8 ("rtnetlink: enable IFLA_IF_NETNSID for RTM_DELLINK")
> [4]: commit c310bfcb6e1b ("rtnetlink: enable IFLA_IF_NETNSID for RTM_SETLINK")
> [5]: commit 7c4f63ba8243 ("rtnetlink: enable IFLA_IF_NETNSID in do_setlink()")
> 
> Christian Brauner (5):
>   rtnetlink: add rtnl_get_net_ns_capable()
>   if_addr: add IFA_IF_NETNSID
>   ipv4: enable IFA_IF_NETNSID for RTM_GETADDR
>   ipv6: enable IFA_IF_NETNSID for RTM_GETADDR
>   rtnetlink: move type calculation out of loop
> 
>  include/net/rtnetlink.h      |  1 +
>  include/uapi/linux/if_addr.h |  1 +
>  net/core/rtnetlink.c         | 15 +++++---
>  net/ipv4/devinet.c           | 38 +++++++++++++++-----
>  net/ipv6/addrconf.c          | 70 ++++++++++++++++++++++++++++--------
>  5 files changed, 97 insertions(+), 28 deletions(-)
> 

^ permalink raw reply

* [PATCH 5/5] net: mvneta: reduce smp_processor_id() calling in mvneta_tx_done_gbe
From: Jisheng Zhang @ 2018-08-29  8:30 UTC (permalink / raw)
  To: thomas.petazzoni, David S. Miller
  Cc: netdev, linux-kernel, Andrew Lunn, Gregory CLEMENT,
	linux-arm-kernel
In-Reply-To: <20180829162456.2bd69796@xhacker.debian>

In the loop of mvneta_tx_done_gbe(), we call the smp_processor_id()
each time, move the call out of the loop to optimize the code a bit.

Before the patch, the loop looks like(under arm64):

        ldr     x1, [x29,#120]
        ...
        ldr     w24, [x1,#36]
        ...
        bl      0 <_raw_spin_lock>
        str     w24, [x27,#132]
        ...

After the patch, the loop looks like(under arm64):

        ...
        bl      0 <_raw_spin_lock>
        str     w23, [x28,#132]
        ...
where w23 is loaded so be ready before the loop.

From another side, mvneta_tx_done_gbe() is called from mvneta_poll()
which is in non-preemptible context, so it's safe to call the
smp_processor_id() function once.

Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 7d98f7828a30..62e81e267e13 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2507,12 +2507,13 @@ static void mvneta_tx_done_gbe(struct mvneta_port *pp, u32 cause_tx_done)
 {
 	struct mvneta_tx_queue *txq;
 	struct netdev_queue *nq;
+	int cpu = smp_processor_id();
 
 	while (cause_tx_done) {
 		txq = mvneta_tx_done_policy(pp, cause_tx_done);
 
 		nq = netdev_get_tx_queue(pp->dev, txq->id);
-		__netif_tx_lock(nq, smp_processor_id());
+		__netif_tx_lock(nq, cpu);
 
 		if (txq->count)
 			mvneta_txq_done(pp, txq);
-- 
2.18.0

^ permalink raw reply related

* [PATCH 4/5] net: mvneta: enable NETIF_F_RXCSUM by default
From: Jisheng Zhang @ 2018-08-29  8:29 UTC (permalink / raw)
  To: thomas.petazzoni, David S. Miller
  Cc: netdev, linux-kernel, Andrew Lunn, Gregory CLEMENT,
	linux-arm-kernel
In-Reply-To: <20180829162456.2bd69796@xhacker.debian>

The code and HW supports NETIF_F_RXCSUM, so let's enable it by default.

Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index 06634d4f9b94..7d98f7828a30 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -4591,7 +4591,8 @@ static int mvneta_probe(struct platform_device *pdev)
 		}
 	}
 
-	dev->features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM | NETIF_F_TSO;
+	dev->features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
+			NETIF_F_TSO | NETIF_F_RXCSUM;
 	dev->hw_features |= dev->features;
 	dev->vlan_features |= dev->features;
 	dev->priv_flags |= IFF_LIVE_ADDR_CHANGE;
-- 
2.18.0

^ permalink raw reply related

* [PATCH 3/5] net: mvneta: Don't check NETIF_F_GRO ourself
From: Jisheng Zhang @ 2018-08-29  8:28 UTC (permalink / raw)
  To: thomas.petazzoni, David S. Miller
  Cc: netdev, linux-kernel, Andrew Lunn, Gregory CLEMENT,
	linux-arm-kernel
In-Reply-To: <20180829162456.2bd69796@xhacker.debian>

napi_gro_receive() checks NETIF_F_GRO bit as well, if the bit is not
set, we will go through GRO_NORMAL in napi_skb_finish(), so fall back
to netif_receive_skb_internal(), so we don't need to check NETIF_F_GRO
ourself.

Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index d9206094fce3..06634d4f9b94 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2065,10 +2065,7 @@ static int mvneta_rx_swbm(struct napi_struct *napi,
 		/* Linux processing */
 		rxq->skb->protocol = eth_type_trans(rxq->skb, dev);
 
-		if (dev->features & NETIF_F_GRO)
-			napi_gro_receive(napi, rxq->skb);
-		else
-			netif_receive_skb(rxq->skb);
+		napi_gro_receive(napi, rxq->skb);
 
 		/* clean uncomplete skb pointer in queue */
 		rxq->skb = NULL;
-- 
2.18.0

^ permalink raw reply related

* [PATCH 1/5] net: mvneta: fix rx_offset_correction set and usage
From: Jisheng Zhang @ 2018-08-29  8:27 UTC (permalink / raw)
  To: thomas.petazzoni, David S. Miller
  Cc: netdev, linux-kernel, Andrew Lunn, Gregory CLEMENT,
	linux-arm-kernel
In-Reply-To: <20180829162456.2bd69796@xhacker.debian>

The rx_offset_correction is RX packet offset correction for platforms,
it's not related with SW BM, instead, it's only related with the
platform's NET_SKB_PAD.

Fix the issue by reverting to the original behavior.

Fixes: 562e2f467e71 ("net: mvneta: Improve the buffer allocation method for SWBM")
Signed-off-by: Jisheng Zhang <Jisheng.Zhang@synaptics.com>
---
 drivers/net/ethernet/marvell/mvneta.c | 24 ++++++++++--------------
 1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvneta.c b/drivers/net/ethernet/marvell/mvneta.c
index bc80a678abc3..0ce94f6587a5 100644
--- a/drivers/net/ethernet/marvell/mvneta.c
+++ b/drivers/net/ethernet/marvell/mvneta.c
@@ -2899,21 +2899,18 @@ static void mvneta_rxq_hw_init(struct mvneta_port *pp,
 	mvreg_write(pp, MVNETA_RXQ_BASE_ADDR_REG(rxq->id), rxq->descs_phys);
 	mvreg_write(pp, MVNETA_RXQ_SIZE_REG(rxq->id), rxq->size);
 
+	/* Set Offset */
+	mvneta_rxq_offset_set(pp, rxq, NET_SKB_PAD - pp->rx_offset_correction);
+
 	/* Set coalescing pkts and time */
 	mvneta_rx_pkts_coal_set(pp, rxq, rxq->pkts_coal);
 	mvneta_rx_time_coal_set(pp, rxq, rxq->time_coal);
 
 	if (!pp->bm_priv) {
-		/* Set Offset */
-		mvneta_rxq_offset_set(pp, rxq, 0);
 		mvneta_rxq_buf_size_set(pp, rxq, pp->frag_size);
 		mvneta_rxq_bm_disable(pp, rxq);
 		mvneta_rxq_fill(pp, rxq, rxq->size);
 	} else {
-		/* Set Offset */
-		mvneta_rxq_offset_set(pp, rxq,
-				      NET_SKB_PAD - pp->rx_offset_correction);
-
 		mvneta_rxq_bm_enable(pp, rxq);
 		/* Fill RXQ with buffers from RX pool */
 		mvneta_rxq_long_pool_set(pp, rxq);
@@ -4547,7 +4544,13 @@ static int mvneta_probe(struct platform_device *pdev)
 	SET_NETDEV_DEV(dev, &pdev->dev);
 
 	pp->id = global_port_id++;
-	pp->rx_offset_correction = 0; /* not relevant for SW BM */
+
+	/* Set RX packet offset correction for platforms, whose
+	 * NET_SKB_PAD, exceeds 64B. It should be 64B for 64-bit
+	 * platforms and 0B for 32-bit ones.
+	 */
+	pp->rx_offset_correction =
+		max(0, NET_SKB_PAD - MVNETA_RX_PKT_OFFSET_CORRECTION);
 
 	/* Obtain access to BM resources if enabled and already initialized */
 	bm_node = of_parse_phandle(dn, "buffer-manager", 0);
@@ -4562,13 +4565,6 @@ static int mvneta_probe(struct platform_device *pdev)
 				pp->bm_priv = NULL;
 			}
 		}
-		/* Set RX packet offset correction for platforms, whose
-		 * NET_SKB_PAD, exceeds 64B. It should be 64B for 64-bit
-		 * platforms and 0B for 32-bit ones.
-		 */
-		pp->rx_offset_correction = max(0,
-					       NET_SKB_PAD -
-					       MVNETA_RX_PKT_OFFSET_CORRECTION);
 	}
 	of_node_put(bm_node);
 
-- 
2.18.0

^ permalink raw reply related

* [PATCH 0/5] net: mvneta: some bug fix and trivial improvement
From: Jisheng Zhang @ 2018-08-29  8:25 UTC (permalink / raw)
  To: thomas.petazzoni, David S. Miller
  Cc: netdev, linux-kernel, Andrew Lunn, Gregory CLEMENT,
	linux-arm-kernel

patch1 fixes rx_offset_correction set and usage. Because the
rx_offset_correction is RX packet offset correction for platforms,
it's not related with SW BM, instead, it's only related with the
platform's NET_SKB_PAD.

patch2 fixes the wrong function to unmap rx buf

patch3 removes the NETIF_F_GRO check ourself, because the net subsystem
will handle it for us.

patch4 enables NETIF_F_RXCSUM by default, since the driver and HW
supports the feature.

patch5 is a trivial optimization, to reduce smp_processor_id() calling
in mvneta_tx_done_gbe.

Jisheng Zhang (5):
  net: mvneta: fix rx_offset_correction set and usage
  net: mvneta: fix the wrong function to unmap rx buf
  net: mvneta: Don't check NETIF_F_GRO ourself
  net: mvneta: enable NETIF_F_RXCSUM by default
  net: mvneta: reduce smp_processor_id() calling in mvneta_tx_done_gbe

 drivers/net/ethernet/marvell/mvneta.c | 49 ++++++++++++---------------
 1 file changed, 22 insertions(+), 27 deletions(-)

-- 
2.18.0

^ permalink raw reply

* Re: Oops running iptables -F OUTPUT
From: Nicholas Piggin @ 2018-08-29  3:53 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: netdev, linuxppc-dev, Ard Biesheuvel, Jessica Yu,
	Michael Ellerman, Will Deacon, Ingo Molnar, Andrew Morton,
	linux-arch, Linus Torvalds, x86, linux-arm-kernel
In-Reply-To: <20180829132827.6dbc4352@roar.ozlabs.ibm.com>

On Wed, 29 Aug 2018 13:28:27 +1000
Nicholas Piggin <npiggin@gmail.com> wrote:

> On Tue, 28 Aug 2018 14:06:32 +1000
> Nicholas Piggin <nicholas.piggin@gmail.com> wrote:
> 
> > On Mon, 27 Aug 2018 19:11:01 +0200
> > Andreas Schwab <schwab@linux-m68k.org> wrote:
> >   
> > > I'm getting this Oops when running iptables -F OUTPUT:
> > > 
> > > [   91.139409] Unable to handle kernel paging request for data at address 0xd0000001fff12f34
> > > [   91.139414] Faulting instruction address: 0xd0000000016a5718
> > > [   91.139419] Oops: Kernel access of bad area, sig: 11 [#1]
> > > [   91.139426] BE SMP NR_CPUS=2 PowerMac
> > > [   91.139434] Modules linked in: iptable_filter ip_tables x_tables bpfilter nfsd auth_rpcgss lockd grace nfs_acl sunrpc tun af_packet snd_aoa_codec_tas snd_aoa_fabric_layout snd_aoa snd_aoa_i2sbus snd_aoa_soundbus snd_pcm_oss snd_pcm snd_seq snd_timer snd_seq_device snd_mixer_oss snd sungem sr_mod firewire_ohci cdrom sungem_phy soundcore firewire_core pata_macio crc_itu_t sg hid_generic usbhid linear md_mod ohci_pci ohci_hcd ehci_pci ehci_hcd usbcore usb_common dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_mod sata_svw
> > > [   91.139522] CPU: 1 PID: 3620 Comm: iptables Not tainted 4.19.0-rc1 #1
> > > [   91.139526] NIP:  d0000000016a5718 LR: d0000000016a569c CTR: c0000000006f560c
> > > [   91.139531] REGS: c0000001fa577670 TRAP: 0300   Not tainted  (4.19.0-rc1)
> > > [   91.139534] MSR:  900000000200b032 <SF,HV,VEC,EE,FP,ME,IR,DR,RI>  CR: 84002484  XER: 20000000
> > > [   91.139553] DAR: d0000001fff12f34 DSISR: 40000000 IRQMASK: 0 
> > > GPR00: d0000000016a569c c0000001fa5778f0 d0000000016b0400 0000000000000000 
> > > GPR04: 0000000000000002 0000000000000000 80000001fa46418e c0000001fa0d05c8 
> > > GPR08: d0000000016b0400 d00037fffff13000 00000001ff3e7000 d0000000016a6fb8 
> > > GPR12: c0000000006f560c c00000000ffff780 0000000000000000 0000000000000000 
> > > GPR16: 0000000011635010 00003fffa1b7aa68 0000000000000000 0000000000000000 
> > > GPR20: 0000000000000003 0000000010013918 00000000116350c0 c000000000b88990 
> > > GPR24: c000000000b88ba4 0000000000000000 d0000001fff12f34 0000000000000000 
> > > GPR28: d0000000016b8000 c0000001fa20f400 c0000001fa20f440 0000000000000000 
> > > [   91.139627] NIP [d0000000016a5718] .alloc_counters.isra.10+0xbc/0x140 [ip_tables]
> > > [   91.139634] LR [d0000000016a569c] .alloc_counters.isra.10+0x40/0x140 [ip_tables]
> > > [   91.139638] Call Trace:
> > > [   91.139645] [c0000001fa5778f0] [d0000000016a569c] .alloc_counters.isra.10+0x40/0x140 [ip_tables] (unreliable)
> > > [   91.139655] [c0000001fa5779b0] [d0000000016a5b54] .do_ipt_get_ctl+0x110/0x2ec [ip_tables]
> > > [   91.139666] [c0000001fa577aa0] [c0000000006233e0] .nf_getsockopt+0x68/0x88
> > > [   91.139674] [c0000001fa577b40] [c000000000631608] .ip_getsockopt+0xbc/0x128
> > > [   91.139682] [c0000001fa577bf0] [c00000000065adf4] .raw_getsockopt+0x18/0x5c
> > > [   91.139690] [c0000001fa577c60] [c0000000005b5f60] .sock_common_getsockopt+0x2c/0x40
> > > [   91.139697] [c0000001fa577cd0] [c0000000005b3394] .__sys_getsockopt+0xa4/0xd0
> > > [   91.139704] [c0000001fa577d80] [c0000000005b5ab0] .__se_sys_socketcall+0x238/0x2b4
> > > [   91.139712] [c0000001fa577e30] [c00000000000a31c] system_call+0x5c/0x70
> > > [   91.139716] Instruction dump:
> > > [   91.139721] 39290040 7d3d4a14 7fbe4840 409cff98 81380000 2b890001 419d000c 393e0060 
> > > [   91.139736] 48000010 7d57c82a e93e0060 7d295214 <815a0000> 794807e1 41e20010 7c210b78 
> > > [   91.139752] ---[ end trace f5d1d5431651845d ]---    
> > 
> > This is due to 7290d58095 ("module: use relative references for
> > __ksymtab entries"). This part of kernel/module.c -
> > 
> >    /* Divert to percpu allocation if a percpu var. */
> >    if (sym[i].st_shndx == info->index.pcpu)
> >        secbase = (unsigned long)mod_percpu(mod);
> >    else
> >        secbase = info->sechdrs[sym[i].st_shndx].sh_addr;
> >    sym[i].st_value += secbase;
> > 
> > Causes the distance to the target to exceed 32-bits on powerpc, so
> > it doesn't fit in a rel32 reloc. Not sure how other archs cope.  
> 
> Any progress on this one? I had a bit of a look but can't see a really
> trivial fix and don't have a lot of time to work on it. Maybe use 64
> bit relative offsets for per-cpu exports, or better might be apply the
> per-cpu fixup when linking against the symbol rather than when writing
> the module symbol table.
> 
> Until then I'd like to just remove HAVE_ARCH_PREL32_RELOCATIONS from
> powerpc/Kconfig, but if other archs are going to have issues too, we
> could just revert
> 
> 271ca788774aa ("arch: enable relative relocations for arm64, power and x86")
> 
> arm64, x86 -- can the distance between your module percpu data link
> location -> module percpu runtime allocation location exceed 31 bits?

[Sorry ignore this, I missed some mail, will reply in the thread]

Thanks,
Nick

^ permalink raw reply

* [PATCH][net-next] vxlan: reduce dirty cache line in vxlan_find_mac
From: Li RongQing @ 2018-08-29  3:52 UTC (permalink / raw)
  To: netdev

vxlan_find_mac() unconditionally set f->used for every packet,
this causes a cache miss for every packet, since remote, hlist
and used of vxlan_fdb share the same cache line, which are
accessed when send every packets.

so f->used is set only if not equal to jiffies, to reduce dirty
cache line times, this gives 3% speed-up with small packets.

Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
---
 drivers/net/vxlan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/vxlan.c b/drivers/net/vxlan.c
index ababba37d735..e5d236595206 100644
--- a/drivers/net/vxlan.c
+++ b/drivers/net/vxlan.c
@@ -464,7 +464,7 @@ static struct vxlan_fdb *vxlan_find_mac(struct vxlan_dev *vxlan,
 	struct vxlan_fdb *f;

 	f = __vxlan_find_mac(vxlan, mac, vni);
-	if (f)
+	if (f && f->used != jiffies)
 		f->used = jiffies;

 	return f;
-- 
2.16.2

^ permalink raw reply related

* Re: Oops running iptables -F OUTPUT
From: Nicholas Piggin @ 2018-08-29  3:28 UTC (permalink / raw)
  To: Andreas Schwab
  Cc: netdev, linuxppc-dev, Ard Biesheuvel, Jessica Yu,
	Michael Ellerman, Will Deacon, Ingo Molnar, Andrew Morton,
	linux-arch, Linus Torvalds, x86, linux-arm-kernel
In-Reply-To: <20180828140632.593291cf@roar.ozlabs.ibm.com>

On Tue, 28 Aug 2018 14:06:32 +1000
Nicholas Piggin <nicholas.piggin@gmail.com> wrote:

> On Mon, 27 Aug 2018 19:11:01 +0200
> Andreas Schwab <schwab@linux-m68k.org> wrote:
> 
> > I'm getting this Oops when running iptables -F OUTPUT:
> > 
> > [   91.139409] Unable to handle kernel paging request for data at address 0xd0000001fff12f34
> > [   91.139414] Faulting instruction address: 0xd0000000016a5718
> > [   91.139419] Oops: Kernel access of bad area, sig: 11 [#1]
> > [   91.139426] BE SMP NR_CPUS=2 PowerMac
> > [   91.139434] Modules linked in: iptable_filter ip_tables x_tables bpfilter nfsd auth_rpcgss lockd grace nfs_acl sunrpc tun af_packet snd_aoa_codec_tas snd_aoa_fabric_layout snd_aoa snd_aoa_i2sbus snd_aoa_soundbus snd_pcm_oss snd_pcm snd_seq snd_timer snd_seq_device snd_mixer_oss snd sungem sr_mod firewire_ohci cdrom sungem_phy soundcore firewire_core pata_macio crc_itu_t sg hid_generic usbhid linear md_mod ohci_pci ohci_hcd ehci_pci ehci_hcd usbcore usb_common dm_snapshot dm_bufio dm_mirror dm_region_hash dm_log dm_mod sata_svw
> > [   91.139522] CPU: 1 PID: 3620 Comm: iptables Not tainted 4.19.0-rc1 #1
> > [   91.139526] NIP:  d0000000016a5718 LR: d0000000016a569c CTR: c0000000006f560c
> > [   91.139531] REGS: c0000001fa577670 TRAP: 0300   Not tainted  (4.19.0-rc1)
> > [   91.139534] MSR:  900000000200b032 <SF,HV,VEC,EE,FP,ME,IR,DR,RI>  CR: 84002484  XER: 20000000
> > [   91.139553] DAR: d0000001fff12f34 DSISR: 40000000 IRQMASK: 0 
> > GPR00: d0000000016a569c c0000001fa5778f0 d0000000016b0400 0000000000000000 
> > GPR04: 0000000000000002 0000000000000000 80000001fa46418e c0000001fa0d05c8 
> > GPR08: d0000000016b0400 d00037fffff13000 00000001ff3e7000 d0000000016a6fb8 
> > GPR12: c0000000006f560c c00000000ffff780 0000000000000000 0000000000000000 
> > GPR16: 0000000011635010 00003fffa1b7aa68 0000000000000000 0000000000000000 
> > GPR20: 0000000000000003 0000000010013918 00000000116350c0 c000000000b88990 
> > GPR24: c000000000b88ba4 0000000000000000 d0000001fff12f34 0000000000000000 
> > GPR28: d0000000016b8000 c0000001fa20f400 c0000001fa20f440 0000000000000000 
> > [   91.139627] NIP [d0000000016a5718] .alloc_counters.isra.10+0xbc/0x140 [ip_tables]
> > [   91.139634] LR [d0000000016a569c] .alloc_counters.isra.10+0x40/0x140 [ip_tables]
> > [   91.139638] Call Trace:
> > [   91.139645] [c0000001fa5778f0] [d0000000016a569c] .alloc_counters.isra.10+0x40/0x140 [ip_tables] (unreliable)
> > [   91.139655] [c0000001fa5779b0] [d0000000016a5b54] .do_ipt_get_ctl+0x110/0x2ec [ip_tables]
> > [   91.139666] [c0000001fa577aa0] [c0000000006233e0] .nf_getsockopt+0x68/0x88
> > [   91.139674] [c0000001fa577b40] [c000000000631608] .ip_getsockopt+0xbc/0x128
> > [   91.139682] [c0000001fa577bf0] [c00000000065adf4] .raw_getsockopt+0x18/0x5c
> > [   91.139690] [c0000001fa577c60] [c0000000005b5f60] .sock_common_getsockopt+0x2c/0x40
> > [   91.139697] [c0000001fa577cd0] [c0000000005b3394] .__sys_getsockopt+0xa4/0xd0
> > [   91.139704] [c0000001fa577d80] [c0000000005b5ab0] .__se_sys_socketcall+0x238/0x2b4
> > [   91.139712] [c0000001fa577e30] [c00000000000a31c] system_call+0x5c/0x70
> > [   91.139716] Instruction dump:
> > [   91.139721] 39290040 7d3d4a14 7fbe4840 409cff98 81380000 2b890001 419d000c 393e0060 
> > [   91.139736] 48000010 7d57c82a e93e0060 7d295214 <815a0000> 794807e1 41e20010 7c210b78 
> > [   91.139752] ---[ end trace f5d1d5431651845d ]---  
> 
> This is due to 7290d58095 ("module: use relative references for
> __ksymtab entries"). This part of kernel/module.c -
> 
>    /* Divert to percpu allocation if a percpu var. */
>    if (sym[i].st_shndx == info->index.pcpu)
>        secbase = (unsigned long)mod_percpu(mod);
>    else
>        secbase = info->sechdrs[sym[i].st_shndx].sh_addr;
>    sym[i].st_value += secbase;
> 
> Causes the distance to the target to exceed 32-bits on powerpc, so
> it doesn't fit in a rel32 reloc. Not sure how other archs cope.

Any progress on this one? I had a bit of a look but can't see a really
trivial fix and don't have a lot of time to work on it. Maybe use 64
bit relative offsets for per-cpu exports, or better might be apply the
per-cpu fixup when linking against the symbol rather than when writing
the module symbol table.

Until then I'd like to just remove HAVE_ARCH_PREL32_RELOCATIONS from
powerpc/Kconfig, but if other archs are going to have issues too, we
could just revert

271ca788774aa ("arch: enable relative relocations for arm64, power and x86")

arm64, x86 -- can the distance between your module percpu data link
location -> module percpu runtime allocation location exceed 31 bits?

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH RFT] net: dsa: Allow configuring CPU port VLANs
From: Maxim Uvarov @ 2018-08-29  7:14 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Ilias Apalodimas, petrm, netdev, jiri, Andrew Lunn,
	Vivien Didelot, David Miller, kernel list
In-Reply-To: <25821da3-b923-485f-0991-e1dc943aefec@gmail.com>

вт, 28 авг. 2018 г. в 22:17, Florian Fainelli <f.fainelli@gmail.com>:
>
> On 08/28/2018 12:08 PM, Maxim Uvarov wrote:
> > вт, 28 авг. 2018 г. в 20:00, Florian Fainelli <f.fainelli@gmail.com>:
> >>
> >> On 08/28/2018 01:32 AM, Ilias Apalodimas wrote:
> >>> On Fri, Aug 10, 2018 at 04:58:10PM -0700, Florian Fainelli wrote:
> >>>> On 06/25/2018 02:17 AM, Ilias Apalodimas wrote:
> >>>>> On Mon, Jun 25, 2018 at 12:13:10PM +0300, Petr Machata wrote:
> >>>>>> Florian Fainelli <f.fainelli@gmail.com> writes:
> >>>>>>
> >>>>>>>   if (netif_is_bridge_master(vlan->obj.orig_dev))
> >>>>>>> -         return -EOPNOTSUPP;
> >>>>>>> +         info.port = dp->cpu_dp->index;
> >>>>>>
> >>>>>> The condition above will trigger also when a VLAN is added on a member
> >>>>>> port, and there's no other port with that VLAN. In that case the VLAN
> >>>>>> comes without the BRIDGE_VLAN_INFO_BRENTRY flag. In mlxsw we have this
> >>>>>> to get the bridge VLANs:
> >>>>>>
> >>>>>>    if (netif_is_bridge_master(orig_dev)) {
> >>>>>>            [...]
> >>>>>>            if ((vlan->flags & BRIDGE_VLAN_INFO_BRENTRY) &&
> >>>>>>            [...]
> >>>>>>
> >>>>>> This doesn't appear to be done in DSA unless I'm missing something.
> >>>>> Petr's right. This will trigger for VLANs added on 'not cpu ports' if the VLAN
> >>>>> is not already a member.
> >>>>>
> >>>>> This command has BRIDGE_VLAN_INFO_BRENTRY set:
> >>>>> bridge vlan add dev br0 vid 100 pvid untagged self
> >>>>> I had the same issue on my CPSW RFC and solved it
> >>>>> exactly the same was as Petr suggested.
> >>>>
> >>>> Humm, there must be something obvious I am missing, but the following
> >>>> don't exactly result in what I would expect after adding a check for
> >>>> vlan->flags & BRIDGE_VLAN_INFO_BRENTRY:
> >>>>
> >>>> brctl addbr br0
> >>>> echo 1 > /sys/class/net/br0/bridge/vlan_filtering
> >>>> brctl addif br0 lan1
> >>>>
> >>>> #1 results in lan1 being programmed with VID 1, PVID, untagged, but not
> >>>> the CPU port. I would have sort of expected that the bridge layer would
> >>>> also push the configuration to br0/CPU port since this is the default VLAN:
> >>>>
> >>>> bridge vlan show dev br0
> >>>> port vlan ids
> >>>> br0  1 PVID Egress Untagged
> >>>>
> >>>> But it does not.
> >>>>
> >>>> bridge vlan add vid 2 dev lan1
> >>>>
> >>>> #2 same thing, results in only lan1 being programmed with VID 2, tagged
> >>>> but that is expected because we are creating the VLAN only for the
> >>>> user-facing port.
> >>>>
> >>>> bridge vlan add vid 3 dev br0 self
> >>>>
> >>>> #3 results in the CPU port being programmed with VID 3, tagged, again,
> >>>> this is expected because we are only programming the bridge master/CPU
> >>>> port here.
> >>>>
> >>>> Does #1 also happen for cpsw and mlxsw or do you actually get events
> >>>> about the bridge's default VLAN configuration? Or does the switch driver
> >>>> actually need to obtain that at the time the port is enslaved somehow?
> >>> As long as ports are attached you get the events (one event per attached port
> >>> iirc)
> >>> if the event is checked against BRIDGE_VLAN_INFO_BRENTRY, the only way to add a
> >>> VLAN to the cpu port is via 'bridge vlan add vid 3 dev br0 self'
> >>
> >> Do we have a guarantee that upon port enslavement, whatever default_pvid
> >> is configured on the bridge master device also happens to be the port's
> >> default_pvid settings as well?
> >
> > I think default pvid is per port thing. I.e. each port can have it's
> > own pvid (i.e. it will tag with vlan id not tagged incoming packet to
> > that port),
>
> We are talking about the bridge master device's default_pvid which can
> be set prior to any port being enslaved into the bridge. As of today, if
> you enslave a port of a switch into a bridge, you need to properly
> configure the CPU/management port as well otherwise things just wont' be
> working. At the time we enslave the first port into the bridge, there is
> no notification AFAICT that is generated to tell us about what the
> bridge master device's default_pvid is.
>
> > I did not exactly understand use case. With adding vlan filtering to
> > cpu port you filter out packets from other vlan groups to cpu port.
> > This might be useful
> > only for multicast packes or missing fbd entry on some dsa port. Is
> > filtering multicast a main problem to solve here?
> > Linux is missing vlan ingress policy. I.e. filtering (echo 1 >
> > /sys/br0/vlan_filter) has to be case of 3 policies: secure (default
> > now), check and fallback. With current secure mode it
> > might work, but with check mode it will be needed to add all vlans to
> > cpu port. Btw, on some hardware vlan ingress policies are also per
> > port, not per bridge.
>
> The general use case is that the CPU port on switches that have such a
> thing is just a normal port on which you should be able to configure
> exactly the VLAN membership and attributes.
>

that has to be good feature to add.

> With DSA switches today, we cannot do that, because there is no network
> interface exposed for the CPU port (and there should not be one), so
> when you target the bridge master device, e.g: br0, we can generate
> events towards the switch driver that map to the CPU port.
>
> There are many reasons for trying to do that, if we don't support such a
> thing, then we need to have the CPU port be part of all VLAN IDs that
> get added to ports, as a tagged member (because if untagged, you can't
> differentiate traffic anymore).
>
> Regarding your suggestion, we could certainly change vlan_filtering to
> take several values:
>
> 0: disabled
> 1: secure
> 2: check
>
> Or something like that.

I think that will work.

Maxim.

> --
> Florian



-- 
Best regards,
Maxim Uvarov

^ permalink raw reply

* RE: [PATCH 4/6] net/wan/fsl_ucc_hdlc: default hmask value
From: Qiang Zhao @ 2018-08-29  2:54 UTC (permalink / raw)
  To: David Gounaris, netdev@vger.kernel.org,
	linuxppc-dev@lists.ozlabs.org
  Cc: David Gounaris
In-Reply-To: <20180828110921.2542-5-david.gounaris@infinera.com>

From: David Gounaris <david.gounaris@infinera.com>
Date: 2018/8/28 19:09
> Subject: [PATCH 4/6] net/wan/fsl_ucc_hdlc: default hmask value
> 
> Set default HMASK to 0x0000 to use
> promiscuous mode in the hdlc controller.
> 
> Signed-off-by: David Gounaris <david.gounaris@infinera.com>
> ---
>  drivers/net/wan/fsl_ucc_hdlc.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/wan/fsl_ucc_hdlc.h b/drivers/net/wan/fsl_ucc_hdlc.h
> index c21134c1f180..5bc3d1a6ca6e 100644
> --- a/drivers/net/wan/fsl_ucc_hdlc.h
> +++ b/drivers/net/wan/fsl_ucc_hdlc.h
> @@ -134,7 +134,7 @@ struct ucc_hdlc_private {
> 
>  #define HDLC_HEAD_MASK		0x0000
>  #define DEFAULT_HDLC_HEAD	0xff44
> -#define DEFAULT_ADDR_MASK	0x00ff
> +#define DEFAULT_ADDR_MASK	0x0000
>  #define DEFAULT_HDLC_ADDR	0x00ff
> 
>  #define BMR_GBL			0x20000000
> --

It is not proper to set default HMASK to 0x0000, how about to add a new property standing for hmask into device tree,
If get this property from dtb, then set it with the value from dtb, otherwise, set it with default HMASK ox00ff?

Best Regards
Qiang Zhao

^ permalink raw reply

* [PATCH] neighbour: confirm neigh entries when ARP packet is received
From: Vasily Khoruzhick @ 2018-08-29  2:48 UTC (permalink / raw)
  To: Ihar Hrachyshka, David S. Miller, Roopa Prabhu, Alexey Dobriyan,
	Jim Westfall, Stephen Hemminger, Vasily Khoruzhick, Kees Cook,
	Wolfgang Bumiller, Eric Dumazet, netdev
  Cc: Vasily Khoruzhick

Update 'confirmed' timestamp when ARP packet is received. It shouldn't
affect locktime logic and anyway entry can be confirmed by any higher-layer
protocol. Thus it makes no sense not to confirm it when ARP packet is
received.

Fixes: 77d7123342 ("neighbour: update neigh timestamps iff update is
effective")

Signed-off-by: Vasily Khoruzhick <vasilykh@arista.com>
---
 net/core/neighbour.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index aa19d86937af..901418ef70ea 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -1180,6 +1180,9 @@ int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
 		lladdr = neigh->ha;
 	}
 
+	if (new & NUD_CONNECTED)
+		neigh->confirmed = jiffies;
+
 	/* If entry was valid and address is not changed,
 	   do not change entry state, if new one is STALE.
 	 */
@@ -1205,11 +1208,8 @@ int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
 	 * neighbour entry. Otherwise we risk to move the locktime window with
 	 * noop updates and ignore relevant ARP updates.
 	 */
-	if (new != old || lladdr != neigh->ha) {
-		if (new & NUD_CONNECTED)
-			neigh->confirmed = jiffies;
+	if (new != old || lladdr != neigh->ha)
 		neigh->updated = jiffies;
-	}
 
 	if (new != old) {
 		neigh_del_timer(neigh);
-- 
2.18.0

^ permalink raw reply related

* [PATCH net-next 1/4] liquidio: improve soft command handling
From: Felix Manlunas @ 2018-08-29  1:51 UTC (permalink / raw)
  To: davem
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	felix.manlunas, weilin.chang
In-Reply-To: <20180829015058.GA7898@felix-thinkpad.cavium.com>

1. Set LIO_SC_MAX_TMO_MS as the maximum timeout value for a soft command
   (sc).  All sc's use this value as a hard timeout value. Add expiry_time
   in struct octeon_soft_command to keep the hard timeout value. The field
   wait_time and timeout in struct octeon_soft_command will be obsoleted in
   the last patch of this patch series.
2. Add processing a synchronous sc in sc response thread
   lio_process_ordered_list. The memory allocated for a synchronous sc will
   be freed by lio_process_ordered_list() to the sc pool.
3. Add two response lists for lio_process_ordered_list to process the
   storage allocated for sc's:
   OCTEON_DONE_SC_LIST response list keeps all sc's which will be freed to
   the pool after their requestors have finished processing the responses.
   OCTEON_ZOMBIE_SC_LIST response list keeps all sc's which have got
   LIO_SC_MAX_TMO_MS timeout.
   When an sc gets a hard timeout, lio_process_order_list() will recheck
   its status 1 ms later. If the status has not updated by the firmware at
   that time, the sc will be removed from OCTEON_DONE_SC_LIST response list
   to OCTEON_ZOMBIE_SC_LIST response list. The sc's in the
   OCTEON_ZOMBIE_SC_LIST response list will be freed when the driver is
   unloaded.

Signed-off-by: Weilin Chang <weilin.chang@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
---
 drivers/net/ethernet/cavium/liquidio/lio_main.c    |  31 +++++-
 drivers/net/ethernet/cavium/liquidio/lio_vf_main.c |  34 +++++-
 .../net/ethernet/cavium/liquidio/octeon_config.h   |   2 +-
 drivers/net/ethernet/cavium/liquidio/octeon_iq.h   |  11 ++
 drivers/net/ethernet/cavium/liquidio/octeon_nic.c  |   3 +-
 .../net/ethernet/cavium/liquidio/request_manager.c | 114 +++++++++++++++------
 .../ethernet/cavium/liquidio/response_manager.c    |  82 +++++++++++++--
 .../ethernet/cavium/liquidio/response_manager.h    |   4 +-
 8 files changed, 232 insertions(+), 49 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 6fb13fa..6663749 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -1037,12 +1037,12 @@ static void octeon_destroy_resources(struct octeon_device *oct)
 
 		/* fallthrough */
 	case OCT_DEV_IO_QUEUES_DONE:
-		if (wait_for_pending_requests(oct))
-			dev_err(&oct->pci_dev->dev, "There were pending requests\n");
-
 		if (lio_wait_for_instr_fetch(oct))
 			dev_err(&oct->pci_dev->dev, "IQ had pending instructions\n");
 
+		if (wait_for_pending_requests(oct))
+			dev_err(&oct->pci_dev->dev, "There were pending requests\n");
+
 		/* Disable the input and output queues now. No more packets will
 		 * arrive from Octeon, but we should wait for all packet
 		 * processing to finish.
@@ -1052,6 +1052,31 @@ static void octeon_destroy_resources(struct octeon_device *oct)
 		if (lio_wait_for_oq_pkts(oct))
 			dev_err(&oct->pci_dev->dev, "OQ had pending packets\n");
 
+		/* Force all requests waiting to be fetched by OCTEON to
+		 * complete.
+		 */
+		for (i = 0; i < MAX_OCTEON_INSTR_QUEUES(oct); i++) {
+			struct octeon_instr_queue *iq;
+
+			if (!(oct->io_qmask.iq & BIT_ULL(i)))
+				continue;
+			iq = oct->instr_queue[i];
+
+			if (atomic_read(&iq->instr_pending)) {
+				spin_lock_bh(&iq->lock);
+				iq->fill_cnt = 0;
+				iq->octeon_read_index = iq->host_write_index;
+				iq->stats.instr_processed +=
+					atomic_read(&iq->instr_pending);
+				lio_process_iq_request_list(oct, iq, 0);
+				spin_unlock_bh(&iq->lock);
+			}
+		}
+
+		lio_process_ordered_list(oct, 1);
+		octeon_free_sc_done_list(oct);
+		octeon_free_sc_zombie_list(oct);
+
 	/* fallthrough */
 	case OCT_DEV_INTR_SET_DONE:
 		/* Disable interrupts  */
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
index b778357..59c2dd9 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_vf_main.c
@@ -471,12 +471,12 @@ static void octeon_destroy_resources(struct octeon_device *oct)
 	case OCT_DEV_HOST_OK:
 		/* fallthrough */
 	case OCT_DEV_IO_QUEUES_DONE:
-		if (wait_for_pending_requests(oct))
-			dev_err(&oct->pci_dev->dev, "There were pending requests\n");
-
 		if (lio_wait_for_instr_fetch(oct))
 			dev_err(&oct->pci_dev->dev, "IQ had pending instructions\n");
 
+		if (wait_for_pending_requests(oct))
+			dev_err(&oct->pci_dev->dev, "There were pending requests\n");
+
 		/* Disable the input and output queues now. No more packets will
 		 * arrive from Octeon, but we should wait for all packet
 		 * processing to finish.
@@ -485,7 +485,33 @@ static void octeon_destroy_resources(struct octeon_device *oct)
 
 		if (lio_wait_for_oq_pkts(oct))
 			dev_err(&oct->pci_dev->dev, "OQ had pending packets\n");
-		/* fall through */
+
+		/* Force all requests waiting to be fetched by OCTEON to
+		 * complete.
+		 */
+		for (i = 0; i < MAX_OCTEON_INSTR_QUEUES(oct); i++) {
+			struct octeon_instr_queue *iq;
+
+			if (!(oct->io_qmask.iq & BIT_ULL(i)))
+				continue;
+			iq = oct->instr_queue[i];
+
+			if (atomic_read(&iq->instr_pending)) {
+				spin_lock_bh(&iq->lock);
+				iq->fill_cnt = 0;
+				iq->octeon_read_index = iq->host_write_index;
+				iq->stats.instr_processed +=
+					atomic_read(&iq->instr_pending);
+				lio_process_iq_request_list(oct, iq, 0);
+				spin_unlock_bh(&iq->lock);
+			}
+		}
+
+		lio_process_ordered_list(oct, 1);
+		octeon_free_sc_done_list(oct);
+		octeon_free_sc_zombie_list(oct);
+
+	/* fall through */
 	case OCT_DEV_INTR_SET_DONE:
 		/* Disable interrupts  */
 		oct->fn_list.disable_interrupt(oct, OCTEON_ALL_INTR);
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_config.h b/drivers/net/ethernet/cavium/liquidio/octeon_config.h
index ceac743..056dceb 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_config.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_config.h
@@ -440,7 +440,7 @@ struct octeon_config {
 /* Response lists - 1 ordered, 1 unordered-blocking, 1 unordered-nonblocking
  * NoResponse Lists are now maintained with each IQ. (Dec' 2007).
  */
-#define MAX_RESPONSE_LISTS           4
+#define MAX_RESPONSE_LISTS           6
 
 /* Opcode hash bits. The opcode is hashed on the lower 6-bits to lookup the
  * dispatch table.
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_iq.h b/drivers/net/ethernet/cavium/liquidio/octeon_iq.h
index aecd0d3..3437d7f 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_iq.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_iq.h
@@ -294,11 +294,20 @@ struct octeon_soft_command {
 	/** Time out and callback */
 	size_t wait_time;
 	size_t timeout;
+	size_t expiry_time;
+
 	u32 iq_no;
 	void (*callback)(struct octeon_device *, u32, void *);
 	void *callback_arg;
+
+	int caller_is_done;
+	u32 sc_status;
+	struct completion complete;
 };
 
+/* max timeout (in milli sec) for soft request */
+#define LIO_SC_MAX_TMO_MS       60000
+
 /** Maximum number of buffers to allocate into soft command buffer pool
  */
 #define  MAX_SOFT_COMMAND_BUFFERS	256
@@ -319,6 +328,8 @@ struct octeon_sc_buffer_pool {
 		(((octeon_dev_ptr)->instr_queue[iq_no]->stats.field) += count)
 
 int octeon_setup_sc_buffer_pool(struct octeon_device *oct);
+int octeon_free_sc_done_list(struct octeon_device *oct);
+int octeon_free_sc_zombie_list(struct octeon_device *oct);
 int octeon_free_sc_buffer_pool(struct octeon_device *oct);
 struct octeon_soft_command *
 	octeon_alloc_soft_command(struct octeon_device *oct,
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_nic.c b/drivers/net/ethernet/cavium/liquidio/octeon_nic.c
index 150609b..b7364bb 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_nic.c
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_nic.c
@@ -75,8 +75,7 @@ octeon_alloc_soft_command_resp(struct octeon_device    *oct,
 	else
 		sc->cmd.cmd2.rptr =  sc->dmarptr;
 
-	sc->wait_time = 1000;
-	sc->timeout = jiffies + sc->wait_time;
+	sc->expiry_time = jiffies + msecs_to_jiffies(LIO_SC_MAX_TMO_MS);
 
 	return sc;
 }
diff --git a/drivers/net/ethernet/cavium/liquidio/request_manager.c b/drivers/net/ethernet/cavium/liquidio/request_manager.c
index 5de5ce9..bd0153e 100644
--- a/drivers/net/ethernet/cavium/liquidio/request_manager.c
+++ b/drivers/net/ethernet/cavium/liquidio/request_manager.c
@@ -409,33 +409,22 @@ lio_process_iq_request_list(struct octeon_device *oct,
 			else
 				irh = (struct octeon_instr_irh *)
 					&sc->cmd.cmd2.irh;
-			if (irh->rflag) {
-				/* We're expecting a response from Octeon.
-				 * It's up to lio_process_ordered_list() to
-				 * process  sc. Add sc to the ordered soft
-				 * command response list because we expect
-				 * a response from Octeon.
-				 */
-				spin_lock_irqsave
-					(&oct->response_list
-					 [OCTEON_ORDERED_SC_LIST].lock,
-					 flags);
-				atomic_inc(&oct->response_list
-					[OCTEON_ORDERED_SC_LIST].
-					pending_req_count);
-				list_add_tail(&sc->node, &oct->response_list
-					[OCTEON_ORDERED_SC_LIST].head);
-				spin_unlock_irqrestore
-					(&oct->response_list
-					 [OCTEON_ORDERED_SC_LIST].lock,
-					 flags);
-			} else {
-				if (sc->callback) {
-					/* This callback must not sleep */
-					sc->callback(oct, OCTEON_REQUEST_DONE,
-						     sc->callback_arg);
-				}
-			}
+
+			/* We're expecting a response from Octeon.
+			 * It's up to lio_process_ordered_list() to
+			 * process  sc. Add sc to the ordered soft
+			 * command response list because we expect
+			 * a response from Octeon.
+			 */
+			spin_lock_irqsave(&oct->response_list
+					  [OCTEON_ORDERED_SC_LIST].lock, flags);
+			atomic_inc(&oct->response_list
+				   [OCTEON_ORDERED_SC_LIST].pending_req_count);
+			list_add_tail(&sc->node, &oct->response_list
+				[OCTEON_ORDERED_SC_LIST].head);
+			spin_unlock_irqrestore(&oct->response_list
+					       [OCTEON_ORDERED_SC_LIST].lock,
+					       flags);
 			break;
 		default:
 			dev_err(&oct->pci_dev->dev,
@@ -755,8 +744,7 @@ int octeon_send_soft_command(struct octeon_device *oct,
 		len = (u32)ih2->dlengsz;
 	}
 
-	if (sc->wait_time)
-		sc->timeout = jiffies + sc->wait_time;
+	sc->expiry_time = jiffies + msecs_to_jiffies(LIO_SC_MAX_TMO_MS);
 
 	return (octeon_send_command(oct, sc->iq_no, 1, &sc->cmd, sc,
 				    len, REQTYPE_SOFT_COMMAND));
@@ -791,11 +779,76 @@ int octeon_setup_sc_buffer_pool(struct octeon_device *oct)
 	return 0;
 }
 
+int octeon_free_sc_done_list(struct octeon_device *oct)
+{
+	struct octeon_response_list *done_sc_list, *zombie_sc_list;
+	struct octeon_soft_command *sc;
+	struct list_head *tmp, *tmp2;
+	spinlock_t *sc_lists_lock; /* lock for response_list */
+
+	done_sc_list = &oct->response_list[OCTEON_DONE_SC_LIST];
+	zombie_sc_list = &oct->response_list[OCTEON_ZOMBIE_SC_LIST];
+
+	if (!atomic_read(&done_sc_list->pending_req_count))
+		return 0;
+
+	sc_lists_lock = &oct->response_list[OCTEON_ORDERED_SC_LIST].lock;
+
+	spin_lock_bh(sc_lists_lock);
+
+	list_for_each_safe(tmp, tmp2, &done_sc_list->head) {
+		sc = list_entry(tmp, struct octeon_soft_command, node);
+
+		if (READ_ONCE(sc->caller_is_done)) {
+			list_del(&sc->node);
+			atomic_dec(&done_sc_list->pending_req_count);
+
+			if (*sc->status_word == COMPLETION_WORD_INIT) {
+				/* timeout; move sc to zombie list */
+				list_add_tail(&sc->node, &zombie_sc_list->head);
+				atomic_inc(&zombie_sc_list->pending_req_count);
+			} else {
+				octeon_free_soft_command(oct, sc);
+			}
+		}
+	}
+
+	spin_unlock_bh(sc_lists_lock);
+
+	return 0;
+}
+
+int octeon_free_sc_zombie_list(struct octeon_device *oct)
+{
+	struct octeon_response_list *zombie_sc_list;
+	struct octeon_soft_command *sc;
+	struct list_head *tmp, *tmp2;
+	spinlock_t *sc_lists_lock; /* lock for response_list */
+
+	zombie_sc_list = &oct->response_list[OCTEON_ZOMBIE_SC_LIST];
+	sc_lists_lock = &oct->response_list[OCTEON_ORDERED_SC_LIST].lock;
+
+	spin_lock_bh(sc_lists_lock);
+
+	list_for_each_safe(tmp, tmp2, &zombie_sc_list->head) {
+		list_del(tmp);
+		atomic_dec(&zombie_sc_list->pending_req_count);
+		sc = list_entry(tmp, struct octeon_soft_command, node);
+		octeon_free_soft_command(oct, sc);
+	}
+
+	spin_unlock_bh(sc_lists_lock);
+
+	return 0;
+}
+
 int octeon_free_sc_buffer_pool(struct octeon_device *oct)
 {
 	struct list_head *tmp, *tmp2;
 	struct octeon_soft_command *sc;
 
+	octeon_free_sc_zombie_list(oct);
+
 	spin_lock_bh(&oct->sc_buf_pool.lock);
 
 	list_for_each_safe(tmp, tmp2, &oct->sc_buf_pool.head) {
@@ -824,6 +877,9 @@ struct octeon_soft_command *octeon_alloc_soft_command(struct octeon_device *oct,
 	struct octeon_soft_command *sc = NULL;
 	struct list_head *tmp;
 
+	if (!rdatasize)
+		rdatasize = 16;
+
 	WARN_ON((offset + datasize + rdatasize + ctxsize) >
 	       SOFT_COMMAND_BUFFER_SIZE);
 
diff --git a/drivers/net/ethernet/cavium/liquidio/response_manager.c b/drivers/net/ethernet/cavium/liquidio/response_manager.c
index fe5b537..ac7747c 100644
--- a/drivers/net/ethernet/cavium/liquidio/response_manager.c
+++ b/drivers/net/ethernet/cavium/liquidio/response_manager.c
@@ -69,6 +69,8 @@ int lio_process_ordered_list(struct octeon_device *octeon_dev,
 	u32 status;
 	u64 status64;
 
+	octeon_free_sc_done_list(octeon_dev);
+
 	ordered_sc_list = &octeon_dev->response_list[OCTEON_ORDERED_SC_LIST];
 
 	do {
@@ -111,26 +113,88 @@ int lio_process_ordered_list(struct octeon_device *octeon_dev,
 					}
 				}
 			}
-		} else if (force_quit || (sc->timeout &&
-			time_after(jiffies, (unsigned long)sc->timeout))) {
-			dev_err(&octeon_dev->pci_dev->dev, "%s: cmd failed, timeout (%ld, %ld)\n",
-				__func__, (long)jiffies, (long)sc->timeout);
+		} else if (unlikely(force_quit) || (sc->expiry_time &&
+			time_after(jiffies, (unsigned long)sc->expiry_time))) {
+			struct octeon_instr_irh *irh =
+				(struct octeon_instr_irh *)&sc->cmd.cmd3.irh;
+
+			dev_err(&octeon_dev->pci_dev->dev, "%s: ", __func__);
+			dev_err(&octeon_dev->pci_dev->dev,
+				"cmd %x/%x/%llx/%llx failed, ",
+				irh->opcode, irh->subcode,
+				sc->cmd.cmd3.ossp[0], sc->cmd.cmd3.ossp[1]);
+			dev_err(&octeon_dev->pci_dev->dev,
+				"timeout (%ld, %ld)\n",
+				(long)jiffies, (long)sc->expiry_time);
 			status = OCTEON_REQUEST_TIMEOUT;
 		}
 
 		if (status != OCTEON_REQUEST_PENDING) {
+			sc->sc_status = status;
+
 			/* we have received a response or we have timed out */
 			/* remove node from linked list */
 			list_del(&sc->node);
 			atomic_dec(&octeon_dev->response_list
-					  [OCTEON_ORDERED_SC_LIST].
-					  pending_req_count);
-			spin_unlock_bh
-			    (&ordered_sc_list->lock);
+				   [OCTEON_ORDERED_SC_LIST].
+				   pending_req_count);
+
+			if (!sc->callback) {
+				atomic_inc(&octeon_dev->response_list
+					   [OCTEON_DONE_SC_LIST].
+					   pending_req_count);
+				list_add_tail(&sc->node,
+					      &octeon_dev->response_list
+					      [OCTEON_DONE_SC_LIST].head);
+
+				if (unlikely(READ_ONCE(sc->caller_is_done))) {
+					/* caller does not wait for response
+					 * from firmware
+					 */
+					if (status != OCTEON_REQUEST_DONE) {
+						struct octeon_instr_irh *irh;
+
+						irh =
+						    (struct octeon_instr_irh *)
+						    &sc->cmd.cmd3.irh;
+						dev_dbg
+						    (&octeon_dev->pci_dev->dev,
+						    "%s: sc failed: opcode=%x, ",
+						    __func__, irh->opcode);
+						dev_dbg
+						    (&octeon_dev->pci_dev->dev,
+						    "subcode=%x, ossp[0]=%llx, ",
+						    irh->subcode,
+						    sc->cmd.cmd3.ossp[0]);
+						dev_dbg
+						    (&octeon_dev->pci_dev->dev,
+						    "ossp[1]=%llx, status=%d\n",
+						    sc->cmd.cmd3.ossp[1],
+						    status);
+					}
+				} else {
+					complete(&sc->complete);
+				}
+
+				spin_unlock_bh(&ordered_sc_list->lock);
+			} else {
+				/* sc with callback function */
+				if (status == OCTEON_REQUEST_TIMEOUT) {
+					atomic_inc(&octeon_dev->response_list
+						   [OCTEON_ZOMBIE_SC_LIST].
+						   pending_req_count);
+					list_add_tail(&sc->node,
+						      &octeon_dev->response_list
+						      [OCTEON_ZOMBIE_SC_LIST].
+						      head);
+				}
+
+				spin_unlock_bh(&ordered_sc_list->lock);
 
-			if (sc->callback)
 				sc->callback(octeon_dev, status,
 					     sc->callback_arg);
+				/* sc is freed by caller */
+			}
 
 			request_complete++;
 
diff --git a/drivers/net/ethernet/cavium/liquidio/response_manager.h b/drivers/net/ethernet/cavium/liquidio/response_manager.h
index 9169c28..ed4020d 100644
--- a/drivers/net/ethernet/cavium/liquidio/response_manager.h
+++ b/drivers/net/ethernet/cavium/liquidio/response_manager.h
@@ -53,7 +53,9 @@ enum {
 	OCTEON_ORDERED_LIST = 0,
 	OCTEON_UNORDERED_NONBLOCKING_LIST = 1,
 	OCTEON_UNORDERED_BLOCKING_LIST = 2,
-	OCTEON_ORDERED_SC_LIST = 3
+	OCTEON_ORDERED_SC_LIST = 3,
+	OCTEON_DONE_SC_LIST = 4,
+	OCTEON_ZOMBIE_SC_LIST = 5
 };
 
 /** Response Order values for a Octeon Request. */
-- 
2.9.0

^ permalink raw reply related

* [PATCH net-next 2/4] liquidio: make soft command calls synchronous
From: Felix Manlunas @ 2018-08-29  1:51 UTC (permalink / raw)
  To: davem
  Cc: netdev, raghu.vatsavayi, derek.chickles, satananda.burla,
	felix.manlunas, weilin.chang
In-Reply-To: <20180829015058.GA7898@felix-thinkpad.cavium.com>

1. Add wait_for_sc_completion_timeout() for waiting the response and
   handling common response errors
2. Send sc's synchronously: remove unused callback function,
   and context structure; use wait_for_sc_completion_timeout() to wait
   its response.

Signed-off-by: Weilin Chang <weilin.chang@cavium.com>
Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com>
---
 drivers/net/ethernet/cavium/liquidio/lio_core.c    | 134 ++++++---------------
 drivers/net/ethernet/cavium/liquidio/lio_main.c    |  42 ++-----
 drivers/net/ethernet/cavium/liquidio/lio_vf_rep.c  |  42 ++-----
 drivers/net/ethernet/cavium/liquidio/octeon_main.h |  66 ++++++++++
 .../net/ethernet/cavium/liquidio/octeon_network.h  |   6 -
 5 files changed, 129 insertions(+), 161 deletions(-)

diff --git a/drivers/net/ethernet/cavium/liquidio/lio_core.c b/drivers/net/ethernet/cavium/liquidio/lio_core.c
index 8093c5e..822ce0f 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_core.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_core.c
@@ -1333,8 +1333,6 @@ octnet_nic_stats_callback(struct octeon_device *oct_dev,
 	struct octeon_soft_command *sc = (struct octeon_soft_command *)ptr;
 	struct oct_nic_stats_resp *resp =
 	    (struct oct_nic_stats_resp *)sc->virtrptr;
-	struct oct_nic_stats_ctrl *ctrl =
-	    (struct oct_nic_stats_ctrl *)sc->ctxptr;
 	struct nic_rx_stats *rsp_rstats = &resp->stats.fromwire;
 	struct nic_tx_stats *rsp_tstats = &resp->stats.fromhost;
 	struct nic_rx_stats *rstats = &oct_dev->link_stats.fromwire;
@@ -1424,7 +1422,6 @@ octnet_nic_stats_callback(struct octeon_device *oct_dev,
 	} else {
 		resp->status = -1;
 	}
-	complete(&ctrl->complete);
 }
 
 int octnet_get_link_stats(struct net_device *netdev)
@@ -1432,7 +1429,6 @@ int octnet_get_link_stats(struct net_device *netdev)
 	struct lio *lio = GET_LIO(netdev);
 	struct octeon_device *oct_dev = lio->oct_dev;
 	struct octeon_soft_command *sc;
-	struct oct_nic_stats_ctrl *ctrl;
 	struct oct_nic_stats_resp *resp;
 	int retval;
 
@@ -1441,7 +1437,7 @@ int octnet_get_link_stats(struct net_device *netdev)
 		octeon_alloc_soft_command(oct_dev,
 					  0,
 					  sizeof(struct oct_nic_stats_resp),
-					  sizeof(struct octnic_ctrl_pkt));
+					  0);
 
 	if (!sc)
 		return -ENOMEM;
@@ -1449,66 +1445,39 @@ int octnet_get_link_stats(struct net_device *netdev)
 	resp = (struct oct_nic_stats_resp *)sc->virtrptr;
 	memset(resp, 0, sizeof(struct oct_nic_stats_resp));
 
-	ctrl = (struct oct_nic_stats_ctrl *)sc->ctxptr;
-	memset(ctrl, 0, sizeof(struct oct_nic_stats_ctrl));
-	ctrl->netdev = netdev;
-	init_completion(&ctrl->complete);
+	init_completion(&sc->complete);
+	sc->sc_status = OCTEON_REQUEST_PENDING;
 
 	sc->iq_no = lio->linfo.txpciq[0].s.q_no;
 
 	octeon_prepare_soft_command(oct_dev, sc, OPCODE_NIC,
 				    OPCODE_NIC_PORT_STATS, 0, 0, 0);
 
-	sc->callback = octnet_nic_stats_callback;
-	sc->callback_arg = sc;
-	sc->wait_time = 500;	/*in milli seconds*/
-
 	retval = octeon_send_soft_command(oct_dev, sc);
 	if (retval == IQ_SEND_FAILED) {
 		octeon_free_soft_command(oct_dev, sc);
 		return -EINVAL;
 	}
 
-	wait_for_completion_timeout(&ctrl->complete, msecs_to_jiffies(1000));
-
-	if (resp->status != 1) {
-		octeon_free_soft_command(oct_dev, sc);
-
-		return -EINVAL;
+	retval = wait_for_sc_completion_timeout(oct_dev, sc,
+						(2 * LIO_SC_MAX_TMO_MS));
+	if (retval)  {
+		dev_err(&oct_dev->pci_dev->dev, "sc OPCODE_NIC_PORT_STATS command failed\n");
+		return retval;
 	}
 
-	octeon_free_soft_command(oct_dev, sc);
+	octnet_nic_stats_callback(oct_dev, sc->sc_status, sc);
+	WRITE_ONCE(sc->caller_is_done, true);
 
 	return 0;
 }
 
-static void liquidio_nic_seapi_ctl_callback(struct octeon_device *oct,
-					    u32 status,
-					    void *buf)
-{
-	struct liquidio_nic_seapi_ctl_context *ctx;
-	struct octeon_soft_command *sc = buf;
-
-	ctx = sc->ctxptr;
-
-	oct = lio_get_device(ctx->octeon_id);
-	if (status) {
-		dev_err(&oct->pci_dev->dev, "%s: instruction failed. Status: %llx\n",
-			__func__,
-			CVM_CAST64(status));
-	}
-	ctx->status = status;
-	complete(&ctx->complete);
-}
-
 int liquidio_set_speed(struct lio *lio, int speed)
 {
-	struct liquidio_nic_seapi_ctl_context *ctx;
 	struct octeon_device *oct = lio->oct_dev;
 	struct oct_nic_seapi_resp *resp;
 	struct octeon_soft_command *sc;
 	union octnet_cmd *ncmd;
-	u32 ctx_size;
 	int retval;
 	u32 var;
 
@@ -1521,21 +1490,18 @@ int liquidio_set_speed(struct lio *lio, int speed)
 		return -EOPNOTSUPP;
 	}
 
-	ctx_size = sizeof(struct liquidio_nic_seapi_ctl_context);
 	sc = octeon_alloc_soft_command(oct, OCTNET_CMD_SIZE,
 				       sizeof(struct oct_nic_seapi_resp),
-				       ctx_size);
+				       0);
 	if (!sc)
 		return -ENOMEM;
 
 	ncmd = sc->virtdptr;
-	ctx  = sc->ctxptr;
 	resp = sc->virtrptr;
 	memset(resp, 0, sizeof(struct oct_nic_seapi_resp));
 
-	ctx->octeon_id = lio_get_device_id(oct);
-	ctx->status = 0;
-	init_completion(&ctx->complete);
+	init_completion(&sc->complete);
+	sc->sc_status = OCTEON_REQUEST_PENDING;
 
 	ncmd->u64 = 0;
 	ncmd->s.cmd = SEAPI_CMD_SPEED_SET;
@@ -1548,30 +1514,24 @@ int liquidio_set_speed(struct lio *lio, int speed)
 	octeon_prepare_soft_command(oct, sc, OPCODE_NIC,
 				    OPCODE_NIC_UBOOT_CTL, 0, 0, 0);
 
-	sc->callback = liquidio_nic_seapi_ctl_callback;
-	sc->callback_arg = sc;
-	sc->wait_time = 5000;
-
 	retval = octeon_send_soft_command(oct, sc);
 	if (retval == IQ_SEND_FAILED) {
 		dev_info(&oct->pci_dev->dev, "Failed to send soft command\n");
+		octeon_free_soft_command(oct, sc);
 		retval = -EBUSY;
 	} else {
 		/* Wait for response or timeout */
-		if (wait_for_completion_timeout(&ctx->complete,
-						msecs_to_jiffies(10000)) == 0) {
-			dev_err(&oct->pci_dev->dev, "%s: sc timeout\n",
-				__func__);
-			octeon_free_soft_command(oct, sc);
-			return -EINTR;
-		}
+		retval = wait_for_sc_completion_timeout(oct, sc, 0);
+		if (retval)
+			return retval;
 
 		retval = resp->status;
 
 		if (retval) {
 			dev_err(&oct->pci_dev->dev, "%s failed, retval=%d\n",
 				__func__, retval);
-			octeon_free_soft_command(oct, sc);
+			WRITE_ONCE(sc->caller_is_done, true);
+
 			return -EIO;
 		}
 
@@ -1583,38 +1543,32 @@ int liquidio_set_speed(struct lio *lio, int speed)
 		}
 
 		oct->speed_setting = var;
+		WRITE_ONCE(sc->caller_is_done, true);
 	}
 
-	octeon_free_soft_command(oct, sc);
-
 	return retval;
 }
 
 int liquidio_get_speed(struct lio *lio)
 {
-	struct liquidio_nic_seapi_ctl_context *ctx;
 	struct octeon_device *oct = lio->oct_dev;
 	struct oct_nic_seapi_resp *resp;
 	struct octeon_soft_command *sc;
 	union octnet_cmd *ncmd;
-	u32 ctx_size;
 	int retval;
 
-	ctx_size = sizeof(struct liquidio_nic_seapi_ctl_context);
 	sc = octeon_alloc_soft_command(oct, OCTNET_CMD_SIZE,
 				       sizeof(struct oct_nic_seapi_resp),
-				       ctx_size);
+				       0);
 	if (!sc)
 		return -ENOMEM;
 
 	ncmd = sc->virtdptr;
-	ctx  = sc->ctxptr;
 	resp = sc->virtrptr;
 	memset(resp, 0, sizeof(struct oct_nic_seapi_resp));
 
-	ctx->octeon_id = lio_get_device_id(oct);
-	ctx->status = 0;
-	init_completion(&ctx->complete);
+	init_completion(&sc->complete);
+	sc->sc_status = OCTEON_REQUEST_PENDING;
 
 	ncmd->u64 = 0;
 	ncmd->s.cmd = SEAPI_CMD_SPEED_GET;
@@ -1626,37 +1580,20 @@ int liquidio_get_speed(struct lio *lio)
 	octeon_prepare_soft_command(oct, sc, OPCODE_NIC,
 				    OPCODE_NIC_UBOOT_CTL, 0, 0, 0);
 
-	sc->callback = liquidio_nic_seapi_ctl_callback;
-	sc->callback_arg = sc;
-	sc->wait_time = 5000;
-
 	retval = octeon_send_soft_command(oct, sc);
 	if (retval == IQ_SEND_FAILED) {
 		dev_info(&oct->pci_dev->dev, "Failed to send soft command\n");
-		oct->no_speed_setting = 1;
-		oct->speed_setting = 25;
-
-		retval = -EBUSY;
+		octeon_free_soft_command(oct, sc);
+		retval = -EIO;
 	} else {
-		if (wait_for_completion_timeout(&ctx->complete,
-						msecs_to_jiffies(10000)) == 0) {
-			dev_err(&oct->pci_dev->dev, "%s: sc timeout\n",
-				__func__);
-
-			oct->speed_setting = 25;
-			oct->no_speed_setting = 1;
+		retval = wait_for_sc_completion_timeout(oct, sc, 0);
+		if (retval)
+			return retval;
 
-			octeon_free_soft_command(oct, sc);
-
-			return -EINTR;
-		}
 		retval = resp->status;
 		if (retval) {
 			dev_err(&oct->pci_dev->dev,
 				"%s failed retval=%d\n", __func__, retval);
-			oct->no_speed_setting = 1;
-			oct->speed_setting = 25;
-			octeon_free_soft_command(oct, sc);
 			retval = -EIO;
 		} else {
 			u32 var;
@@ -1664,16 +1601,23 @@ int liquidio_get_speed(struct lio *lio)
 			var = be32_to_cpu((__force __be32)resp->speed);
 			oct->speed_setting = var;
 			if (var == 0xffff) {
-				oct->no_speed_setting = 1;
 				/* unable to access boot variables
 				 * get the default value based on the NIC type
 				 */
-				oct->speed_setting = 25;
+				if (oct->subsystem_id ==
+						OCTEON_CN2350_25GB_SUBSYS_ID ||
+				    oct->subsystem_id ==
+						OCTEON_CN2360_25GB_SUBSYS_ID) {
+					oct->no_speed_setting = 1;
+					oct->speed_setting = 25;
+				} else {
+					oct->speed_setting = 10;
+				}
 			}
+
 		}
+		WRITE_ONCE(sc->caller_is_done, true);
 	}
 
-	octeon_free_soft_command(oct, sc);
-
 	return retval;
 }
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_main.c b/drivers/net/ethernet/cavium/liquidio/lio_main.c
index 6663749..8ddc191 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_main.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_main.c
@@ -2969,30 +2969,15 @@ static int liquidio_get_vf_config(struct net_device *netdev, int vfidx,
 	return 0;
 }
 
-static void trusted_vf_callback(struct octeon_device *oct_dev,
-				u32 status, void *ptr)
-{
-	struct octeon_soft_command *sc = (struct octeon_soft_command *)ptr;
-	struct lio_trusted_vf_ctx *ctx;
-
-	ctx = (struct lio_trusted_vf_ctx *)sc->ctxptr;
-	ctx->status = status;
-
-	complete(&ctx->complete);
-}
-
 static int liquidio_send_vf_trust_cmd(struct lio *lio, int vfidx, bool trusted)
 {
 	struct octeon_device *oct = lio->oct_dev;
-	struct lio_trusted_vf_ctx *ctx;
 	struct octeon_soft_command *sc;
-	int ctx_size, retval;
-
-	ctx_size = sizeof(struct lio_trusted_vf_ctx);
-	sc = octeon_alloc_soft_command(oct, 0, 0, ctx_size);
+	int retval;
 
-	ctx  = (struct lio_trusted_vf_ctx *)sc->ctxptr;
-	init_completion(&ctx->complete);
+	sc = octeon_alloc_soft_command(oct, 0, 16, 0);
+	if (!sc)
+		return -ENOMEM;
 
 	sc->iq_no = lio->linfo.txpciq[0].s.q_no;
 
@@ -3001,23 +2986,21 @@ static int liquidio_send_vf_trust_cmd(struct lio *lio, int vfidx, bool trusted)
 				    OPCODE_NIC_SET_TRUSTED_VF, 0, vfidx + 1,
 				    trusted);
 
-	sc->callback = trusted_vf_callback;
-	sc->callback_arg = sc;
-	sc->wait_time = 1000;
+	init_completion(&sc->complete);
+	sc->sc_status = OCTEON_REQUEST_PENDING;
 
 	retval = octeon_send_soft_command(oct, sc);
 	if (retval == IQ_SEND_FAILED) {
+		octeon_free_soft_command(oct, sc);
 		retval = -1;
 	} else {
 		/* Wait for response or timeout */
-		if (wait_for_completion_timeout(&ctx->complete,
-						msecs_to_jiffies(2000)))
-			retval = ctx->status;
-		else
-			retval = -1;
-	}
+		retval = wait_for_sc_completion_timeout(oct, sc, 0);
+		if (retval)
+			return (retval);
 
-	octeon_free_soft_command(oct, sc);
+		WRITE_ONCE(sc->caller_is_done, true);
+	}
 
 	return retval;
 }
@@ -3733,7 +3716,6 @@ static int setup_nic_devices(struct octeon_device *octeon_dev)
 			octeon_dev->speed_setting = 10;
 		}
 		octeon_dev->speed_boot = octeon_dev->speed_setting;
-
 	}
 
 	devlink = devlink_alloc(&liquidio_devlink_ops,
diff --git a/drivers/net/ethernet/cavium/liquidio/lio_vf_rep.c b/drivers/net/ethernet/cavium/liquidio/lio_vf_rep.c
index ddd7431..dfd4d10 100644
--- a/drivers/net/ethernet/cavium/liquidio/lio_vf_rep.c
+++ b/drivers/net/ethernet/cavium/liquidio/lio_vf_rep.c
@@ -49,44 +49,25 @@ static const struct net_device_ops lio_vf_rep_ndev_ops = {
 	.ndo_change_mtu = lio_vf_rep_change_mtu,
 };
 
-static void
-lio_vf_rep_send_sc_complete(struct octeon_device *oct,
-			    u32 status, void *ptr)
-{
-	struct octeon_soft_command *sc = (struct octeon_soft_command *)ptr;
-	struct lio_vf_rep_sc_ctx *ctx =
-		(struct lio_vf_rep_sc_ctx *)sc->ctxptr;
-	struct lio_vf_rep_resp *resp =
-		(struct lio_vf_rep_resp *)sc->virtrptr;
-
-	if (status != OCTEON_REQUEST_TIMEOUT && READ_ONCE(resp->status))
-		WRITE_ONCE(resp->status, 0);
-
-	complete(&ctx->complete);
-}
-
 static int
 lio_vf_rep_send_soft_command(struct octeon_device *oct,
 			     void *req, int req_size,
 			     void *resp, int resp_size)
 {
 	int tot_resp_size = sizeof(struct lio_vf_rep_resp) + resp_size;
-	int ctx_size = sizeof(struct lio_vf_rep_sc_ctx);
 	struct octeon_soft_command *sc = NULL;
 	struct lio_vf_rep_resp *rep_resp;
-	struct lio_vf_rep_sc_ctx *ctx;
 	void *sc_req;
 	int err;
 
 	sc = (struct octeon_soft_command *)
 		octeon_alloc_soft_command(oct, req_size,
-					  tot_resp_size, ctx_size);
+					  tot_resp_size, 0);
 	if (!sc)
 		return -ENOMEM;
 
-	ctx = (struct lio_vf_rep_sc_ctx *)sc->ctxptr;
-	memset(ctx, 0, ctx_size);
-	init_completion(&ctx->complete);
+	init_completion(&sc->complete);
+	sc->sc_status = OCTEON_REQUEST_PENDING;
 
 	sc_req = (struct lio_vf_rep_req *)sc->virtdptr;
 	memcpy(sc_req, req, req_size);
@@ -98,23 +79,24 @@ lio_vf_rep_send_soft_command(struct octeon_device *oct,
 	sc->iq_no = 0;
 	octeon_prepare_soft_command(oct, sc, OPCODE_NIC,
 				    OPCODE_NIC_VF_REP_CMD, 0, 0, 0);
-	sc->callback = lio_vf_rep_send_sc_complete;
-	sc->callback_arg = sc;
-	sc->wait_time = LIO_VF_REP_REQ_TMO_MS;
 
 	err = octeon_send_soft_command(oct, sc);
 	if (err == IQ_SEND_FAILED)
 		goto free_buff;
 
-	wait_for_completion_timeout(&ctx->complete,
-				    msecs_to_jiffies
-				    (2 * LIO_VF_REP_REQ_TMO_MS));
+	err = wait_for_sc_completion_timeout(oct, sc, 0);
+	if (err)
+		return err;
+
 	err = READ_ONCE(rep_resp->status) ? -EBUSY : 0;
 	if (err)
 		dev_err(&oct->pci_dev->dev, "VF rep send config failed\n");
-
-	if (resp)
+	else if (resp)
 		memcpy(resp, (rep_resp + 1), resp_size);
+
+	WRITE_ONCE(sc->caller_is_done, true);
+	return err;
+
 free_buff:
 	octeon_free_soft_command(oct, sc);
 
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_main.h b/drivers/net/ethernet/cavium/liquidio/octeon_main.h
index c846eec..de2a229 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_main.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_main.h
@@ -188,6 +188,72 @@ sleep_timeout_cond(wait_queue_head_t *wait_queue,
 	remove_wait_queue(wait_queue, &we);
 }
 
+/* input parameter:
+ * sc: pointer to a soft request
+ * timeout: milli sec which an application wants to wait for the
+	    response of the request.
+ *          0: the request will wait until its response gets back
+ *	       from the firmware within LIO_SC_MAX_TMO_MS milli sec.
+ *	       It the response does not return within
+ *	       LIO_SC_MAX_TMO_MS milli sec, lio_process_ordered_list()
+ *	       will move the request to zombie response list.
+ *
+ * return value:
+ * 0: got the response from firmware for the sc request.
+ * errno -EINTR: user abort the command.
+ * errno -ETIME: user spefified timeout value has been expired.
+ * errno -EBUSY: the response of the request does not return in
+ *               resonable time (LIO_SC_MAX_TMO_MS).
+ *               the sc wll be move to zombie response list by
+ *               lio_process_ordered_list()
+ *
+ * A request with non-zero return value, the sc->caller_is_done
+ *  will be marked 1.
+ * When getting a request with zero return value, the requestor
+ *  should mark sc->caller_is_done with 1 after examing the
+ *  response of sc.
+ * lio_process_ordered_list() will free the soft command on behalf
+ * of the soft command requestor.
+ * This is to fix the possible race condition of both timeout process
+ * and lio_process_ordered_list()/callback function to free a
+ * sc strucutre.
+ */
+static inline int
+wait_for_sc_completion_timeout(struct octeon_device *oct_dev,
+			       struct octeon_soft_command *sc,
+			       unsigned long timeout)
+{
+	int errno = 0;
+	long timeout_jiff;
+
+	if (timeout)
+		timeout_jiff = msecs_to_jiffies(timeout);
+	else
+		timeout_jiff = MAX_SCHEDULE_TIMEOUT;
+
+	timeout_jiff =
+		wait_for_completion_interruptible_timeout(&sc->complete,
+							  timeout_jiff);
+	if (timeout_jiff == 0) {
+		dev_err(&oct_dev->pci_dev->dev, "%s: sc is timeout\n",
+			__func__);
+		WRITE_ONCE(sc->caller_is_done, true);
+		errno = -ETIME;
+	} else if (timeout_jiff == -ERESTARTSYS) {
+		dev_err(&oct_dev->pci_dev->dev, "%s: sc is interrupted\n",
+			__func__);
+		WRITE_ONCE(sc->caller_is_done, true);
+		errno = -EINTR;
+	} else  if (sc->sc_status == OCTEON_REQUEST_TIMEOUT) {
+		dev_err(&oct_dev->pci_dev->dev, "%s: sc has fatal timeout\n",
+			__func__);
+		WRITE_ONCE(sc->caller_is_done, true);
+		errno = -EBUSY;
+	}
+
+	return errno;
+}
+
 #ifndef ROUNDUP4
 #define ROUNDUP4(val) (((val) + 3) & 0xfffffffc)
 #endif
diff --git a/drivers/net/ethernet/cavium/liquidio/octeon_network.h b/drivers/net/ethernet/cavium/liquidio/octeon_network.h
index d7a3916..a62826a 100644
--- a/drivers/net/ethernet/cavium/liquidio/octeon_network.h
+++ b/drivers/net/ethernet/cavium/liquidio/octeon_network.h
@@ -87,12 +87,6 @@ struct oct_nic_seapi_resp {
 	u64 status;
 };
 
-struct liquidio_nic_seapi_ctl_context {
-	int octeon_id;
-	u32 status;
-	struct completion complete;
-};
-
 /** LiquidIO per-interface network private data */
 struct lio {
 	/** State of the interface. Rx/Tx happens only in the RUNNING state.  */
-- 
2.9.0

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox