Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v4 3/4] net: socket: simplify dev_ifconf handling
From: Jakub Kicinski @ 2020-11-24 20:52 UTC (permalink / raw)
  To: Arnd Bergmann; +Cc: netdev, Arnd Bergmann
In-Reply-To: <20201124151828.169152-4-arnd@kernel.org>

On Tue, 24 Nov 2020 16:18:27 +0100 Arnd Bergmann wrote:
> From: Arnd Bergmann <arnd@arndb.de>
> 
> The dev_ifconf() calling conventions make compat handling
> more complicated than necessary, simplify this by moving
> the in_compat_syscall() check into the function.
> The implementation can be simplified further, based on the
> knowledge that the dynamic registration is only ever used
> for IPv4.

Looks like this one breaks bisection (/breaks build which patch 4 then
fixes):


In file included from ../arch/x86/include/asm/sigframe.h:8,
                 from ../arch/x86/kernel/asm-offsets.c:17:
../include/linux/compat.h:348:29: error: field ‘ifru_settings’ has incomplete type
  348 |   struct compat_if_settings ifru_settings;
      |                             ^~~~~~~~~~~~~
../include/linux/compat.h:352:8: error: redefinition of ‘struct compat_ifconf’
  352 | struct compat_ifconf {
      |        ^~~~~~~~~~~~~
../include/linux/compat.h:108:8: note: originally defined here
  108 | struct compat_ifconf {
      |        ^~~~~~~~~~~~~
make[2]: *** [arch/x86/kernel/asm-offsets.s] Error 1
make[1]: *** [prepare0] Error 2
make: *** [__sub-make] Error 2

^ permalink raw reply

* Re: VRF NS for lladdr sent on the wrong interface
From: David Ahern @ 2020-11-24 20:43 UTC (permalink / raw)
  To: Stephen Suryaputra, netdev
In-Reply-To: <20201124002345.GA42222@ubuntu>

On 11/23/20 5:23 PM, Stephen Suryaputra wrote:
> Hi,
> 
> I'm running into a problem with lladdr pinging all-host mcast all nodes
> addr. The ping intially works but after cycling the interface that
> receives the ping, the echo request packet causes a neigh solicitation
> being sent on a different interface.
> 
> To repro, I included the attached namespace scripts. This is the
> topology and an output of my test.

Can you run your test script on 4.14-4.17 kernel? I am wondering if the
changes in 4.19-next changed this behavior.



^ permalink raw reply

* [PATCH net] gro_cells: reduce number of synchronize_net() calls
From: Eric Dumazet @ 2020-11-24 20:38 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski; +Cc: netdev, Eric Dumazet, Eric Dumazet

From: Eric Dumazet <edumazet@google.com>

After cited commit, gro_cells_destroy() became damn slow
on hosts with a lot of cores.

This is because we have one additional synchronize_net() per cpu as
stated in the changelog.

gro_cells_init() is setting NAPI_STATE_NO_BUSY_POLL, and this was enough
to not have one synchronize_net() call per netif_napi_del()

We can factorize all the synchronize_net() to a single one,
right before freeing per-cpu memory.

Fixes: 5198d545dba8 ("net: remove napi_hash_del() from driver-facing API")
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/gro_cells.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/core/gro_cells.c b/net/core/gro_cells.c
index e095fb871d9120787bfdf62149f4d82e0e3b0a51..6eb2e5ec2c5068e1d798557e55d084b785187a9b 100644
--- a/net/core/gro_cells.c
+++ b/net/core/gro_cells.c
@@ -99,9 +99,14 @@ void gro_cells_destroy(struct gro_cells *gcells)
 		struct gro_cell *cell = per_cpu_ptr(gcells->cells, i);
 
 		napi_disable(&cell->napi);
-		netif_napi_del(&cell->napi);
+		__netif_napi_del(&cell->napi);
 		__skb_queue_purge(&cell->napi_skbs);
 	}
+	/* This barrier is needed because netpoll could access dev->napi_list
+	 * under rcu protection.
+	 */
+	synchronize_net();
+
 	free_percpu(gcells->cells);
 	gcells->cells = NULL;
 }
-- 
2.29.2.454.gaff20da3a2-goog


^ permalink raw reply related

* Re: [PATCH net-next 1/3] dsa: add support for Arrow XRS700x tag trailer
From: George McCollister @ 2020-11-24 20:26 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Andrew Lunn, Vivien Didelot, Vladimir Oltean, David S . Miller,
	netdev, open list:OPEN FIRMWARE AND...
In-Reply-To: <919273d3-6aa6-33b3-a8fe-d59ace9b1342@gmail.com>

On Mon, Nov 23, 2020 at 4:18 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
>
>
> On 11/20/2020 10:16 AM, George McCollister wrote:
> > Add support for Arrow SpeedChips XRS700x single byte tag trailer. This
> > is modeled on tag_trailer.c which works in a similar way.
> >
> > Signed-off-by: George McCollister <george.mccollister@gmail.com>
>
> One question below:
>
> [snip]
>
> > +     if (pskb_trim_rcsum(skb, skb->len - 1))
> > +             return NULL;
> > +
> > +     /* Frame is forwarded by hardware, don't forward in software. */
> > +     skb->offload_fwd_mark = 1;
>
> Given the switch does not give you a forwarding reason, I suppose this
> is fine, but do you possibly have to qualify this against different
> source MAC addresses?

I don't see any reason why I'd need to do that but maybe I'm missing something.

> --
> Florian

^ permalink raw reply

* Re: [PATCH net-next 0/3] mvneta: access skb_shared_info only on last frag
From: Jakub Kicinski @ 2020-11-24 20:26 UTC (permalink / raw)
  To: Lorenzo Bianconi
  Cc: netdev, lorenzo.bianconi, davem, brouer, echaudro, john.fastabend
In-Reply-To: <cover.1605889258.git.lorenzo@kernel.org>

On Fri, 20 Nov 2020 18:05:41 +0100 Lorenzo Bianconi wrote:
> Build skb_shared_info on mvneta_rx_swbm stack and sync it to xdp_buff
> skb_shared_info area only on the last fragment.
> Avoid avoid unnecessary xdp_buff initialization in mvneta_rx_swbm routine.
> This a preliminary series to complete xdp multi-buff in mvneta driver.

Looks fine, but since you need this for XDP multi-buff it should
probably go via bpf-next, right?

Reviewed-by: Jakub Kicinski <kuba@kernel.org>

^ permalink raw reply

* Re: [PATCH net-next 00/10] mlxsw: Add support for blackhole nexthops
From: Jakub Kicinski @ 2020-11-24 20:19 UTC (permalink / raw)
  To: Ido Schimmel; +Cc: netdev, davem, jiri, dsahern, mlxsw, Ido Schimmel
In-Reply-To: <20201123071230.676469-1-idosch@idosch.org>

On Mon, 23 Nov 2020 09:12:20 +0200 Ido Schimmel wrote:
> This patch set adds support for blackhole nexthops in mlxsw. These
> nexthops are exactly the same as other nexthops, but instead of
> forwarding packets to an egress router interface (RIF), they are
> programmed to silently drop them.
> 
> Patches #1-#4 are preparations.
> 
> Patch #5 adds support for blackhole nexthops and removes the check that
> prevented them from being programmed.
> 
> Patch #6 adds a selftests over mlxsw which tests that blackhole nexthops
> can be programmed and are marked as offloaded.
> 
> Patch #7 extends the existing nexthop forwarding test to also test
> blackhole functionality.
> 
> Patches #8-#10 add support for a new packet trap ('blackhole_nexthop')
> which should be triggered whenever packets are dropped by a blackhole
> nexthop. Obviously, by default, the trap action is set to 'drop' so that
> dropped packets will not be reported.

Applied, thanks!

^ permalink raw reply

* Re: [PATCH net-next v4 5/7] dpaa_eth: add XDP_REDIRECT support
From: Maciej Fijalkowski @ 2020-11-24 20:07 UTC (permalink / raw)
  To: Camelia Groza
  Cc: kuba, brouer, saeed, davem, madalin.bucur, ioana.ciornei, netdev
In-Reply-To: <89e611f6ffed0a4604c5f70d0050ca5ac243c222.1606150838.git.camelia.groza@nxp.com>

On Mon, Nov 23, 2020 at 07:36:23PM +0200, Camelia Groza wrote:
> After transmission, the frame is returned on confirmation queues for
> cleanup. For this, store a backpointer to the xdp_frame in the private
> reserved area at the start of the TX buffer.
> 
> No TX batching support is implemented at this time.
> 
> Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
> Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
> ---
>  drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 48 +++++++++++++++++++++++++-
>  drivers/net/ethernet/freescale/dpaa/dpaa_eth.h |  1 +
>  2 files changed, 48 insertions(+), 1 deletion(-)
> 

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

^ permalink raw reply

* Re: [PATCH net 00/17] rxrpc: Prelude to gssapi support
From: Jakub Kicinski @ 2020-11-24 20:08 UTC (permalink / raw)
  To: David Howells; +Cc: netdev, linux-afs, linux-kernel
In-Reply-To: <160616220405.830164.2239716599743995145.stgit@warthog.procyon.org.uk>

On Mon, 23 Nov 2020 20:10:04 +0000 David Howells wrote:
> Here are some patches that do some reorganisation of the security class
> handling in rxrpc to allow implementation of the RxGK security class that
> will allow AF_RXRPC to use GSSAPI-negotiated tokens and better crypto.  The
> RxGK security class is not included in this patchset.
> 
> It does the following things:
> 
>  (1) Add a keyrings patch to provide the original key description, as
>      provided to add_key(), to the payload preparser so that it can
>      interpret the content on that basis.  Unfortunately, the rxrpc_s key
>      type wasn't written to interpret its payload as anything other than a
>      string of bytes comprising a key, but for RxGK, more information is
>      required as multiple Kerberos enctypes are supported.
> 
>  (2) Remove the rxk5 security class key parsing.  The rxk5 class never got
>      rolled out in OpenAFS and got replaced with rxgk.
> 
>  (3) Support the creation of rxrpc keys with multiple tokens of different
>      types.  If some types are not supported, the ENOPKG error is
>      suppressed if at least one other token's type is supported.
> 
>  (4) Punt the handling of server keys (rxrpc_s type) to the appropriate
>      security class.
> 
>  (5) Organise the security bits in the rxrpc_connection struct into a
>      union to make it easier to override for other classes.
> 
>  (6) Move some bits from core code into rxkad that won't be appropriate to
>      rxgk.

Pulled into net-next, thank you!

^ permalink raw reply

* Re: [PATCH net-next v4 4/7] dpaa_eth: add XDP_TX support
From: Maciej Fijalkowski @ 2020-11-24 19:52 UTC (permalink / raw)
  To: Camelia Groza
  Cc: kuba, brouer, saeed, davem, madalin.bucur, ioana.ciornei, netdev
In-Reply-To: <6491d6ba855c7e736383e7f603321fe7184681bc.1606150838.git.camelia.groza@nxp.com>

On Mon, Nov 23, 2020 at 07:36:22PM +0200, Camelia Groza wrote:
> Use an xdp_frame structure for managing the frame. Store a backpointer to
> the structure at the start of the buffer before enqueueing for cleanup
> on TX confirmation. Reserve DPAA_TX_PRIV_DATA_SIZE bytes from the frame
> size shared with the XDP program for this purpose. Use the XDP
> API for freeing the buffer when it returns to the driver on the TX
> confirmation path.
> 
> The frame queues are shared with the netstack.

Can you also provide the info from cover letter about locklessness (is
that even a word?) in here?

One question below and:

Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

> 
> This approach will be reused for XDP REDIRECT.
> 
> Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
> Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
> ---
> Changes in v4:
> - call xdp_rxq_info_is_reg() before unregistering
> - minor cleanups (remove unneeded variable, print error code)
> - add more details in the commit message
> - did not call qman_destroy_fq() in case of xdp_rxq_info_reg() failure
> since it would lead to a double free of the fq resources
> 
>  drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 128 ++++++++++++++++++++++++-
>  drivers/net/ethernet/freescale/dpaa/dpaa_eth.h |   2 +
>  2 files changed, 125 insertions(+), 5 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
> index ee076f4..0deffcc 100644
> --- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
> +++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
> @@ -1130,6 +1130,24 @@ static int dpaa_fq_init(struct dpaa_fq *dpaa_fq, bool td_enable)
> 
>  	dpaa_fq->fqid = qman_fq_fqid(fq);
> 
> +	if (dpaa_fq->fq_type == FQ_TYPE_RX_DEFAULT ||
> +	    dpaa_fq->fq_type == FQ_TYPE_RX_PCD) {
> +		err = xdp_rxq_info_reg(&dpaa_fq->xdp_rxq, dpaa_fq->net_dev,
> +				       dpaa_fq->fqid);
> +		if (err) {
> +			dev_err(dev, "xdp_rxq_info_reg() = %d\n", err);
> +			return err;
> +		}
> +
> +		err = xdp_rxq_info_reg_mem_model(&dpaa_fq->xdp_rxq,
> +						 MEM_TYPE_PAGE_ORDER0, NULL);
> +		if (err) {
> +			dev_err(dev, "xdp_rxq_info_reg_mem_model() = %d\n", err);
> +			xdp_rxq_info_unreg(&dpaa_fq->xdp_rxq);
> +			return err;
> +		}
> +	}
> +
>  	return 0;
>  }
> 
> @@ -1159,6 +1177,11 @@ static int dpaa_fq_free_entry(struct device *dev, struct qman_fq *fq)
>  		}
>  	}
> 
> +	if ((dpaa_fq->fq_type == FQ_TYPE_RX_DEFAULT ||
> +	     dpaa_fq->fq_type == FQ_TYPE_RX_PCD) &&
> +	    xdp_rxq_info_is_reg(&dpaa_fq->xdp_rxq))
> +		xdp_rxq_info_unreg(&dpaa_fq->xdp_rxq);
> +
>  	qman_destroy_fq(fq);
>  	list_del(&dpaa_fq->list);
> 
> @@ -1625,6 +1648,9 @@ static int dpaa_eth_refill_bpools(struct dpaa_priv *priv)
>   *
>   * Return the skb backpointer, since for S/G frames the buffer containing it
>   * gets freed here.
> + *
> + * No skb backpointer is set when transmitting XDP frames. Cleanup the buffer
> + * and return NULL in this case.
>   */
>  static struct sk_buff *dpaa_cleanup_tx_fd(const struct dpaa_priv *priv,
>  					  const struct qm_fd *fd, bool ts)
> @@ -1664,13 +1690,21 @@ static struct sk_buff *dpaa_cleanup_tx_fd(const struct dpaa_priv *priv,
>  		}
>  	} else {
>  		dma_unmap_single(priv->tx_dma_dev, addr,
> -				 priv->tx_headroom + qm_fd_get_length(fd),
> +				 qm_fd_get_offset(fd) + qm_fd_get_length(fd),
>  				 dma_dir);
>  	}
> 
>  	swbp = (struct dpaa_eth_swbp *)vaddr;
>  	skb = swbp->skb;
> 
> +	/* No skb backpointer is set when running XDP. An xdp_frame
> +	 * backpointer is saved instead.
> +	 */
> +	if (!skb) {
> +		xdp_return_frame(swbp->xdpf);
> +		return NULL;
> +	}
> +
>  	/* DMA unmapping is required before accessing the HW provided info */
>  	if (ts && priv->tx_tstamp &&
>  	    skb_shinfo(skb)->tx_flags & SKBTX_HW_TSTAMP) {
> @@ -2350,11 +2384,76 @@ static enum qman_cb_dqrr_result rx_error_dqrr(struct qman_portal *portal,
>  	return qman_cb_dqrr_consume;
>  }
> 
> +static int dpaa_xdp_xmit_frame(struct net_device *net_dev,
> +			       struct xdp_frame *xdpf)
> +{
> +	struct dpaa_priv *priv = netdev_priv(net_dev);
> +	struct rtnl_link_stats64 *percpu_stats;
> +	struct dpaa_percpu_priv *percpu_priv;
> +	struct dpaa_eth_swbp *swbp;
> +	struct netdev_queue *txq;
> +	void *buff_start;
> +	struct qm_fd fd;
> +	dma_addr_t addr;
> +	int err;
> +
> +	percpu_priv = this_cpu_ptr(priv->percpu_priv);
> +	percpu_stats = &percpu_priv->stats;
> +
> +	if (xdpf->headroom < DPAA_TX_PRIV_DATA_SIZE) {
> +		err = -EINVAL;
> +		goto out_error;
> +	}
> +
> +	buff_start = xdpf->data - xdpf->headroom;
> +
> +	/* Leave empty the skb backpointer at the start of the buffer.
> +	 * Save the XDP frame for easy cleanup on confirmation.
> +	 */
> +	swbp = (struct dpaa_eth_swbp *)buff_start;
> +	swbp->skb = NULL;
> +	swbp->xdpf = xdpf;
> +
> +	qm_fd_clear_fd(&fd);
> +	fd.bpid = FSL_DPAA_BPID_INV;
> +	fd.cmd |= cpu_to_be32(FM_FD_CMD_FCO);
> +	qm_fd_set_contig(&fd, xdpf->headroom, xdpf->len);
> +
> +	addr = dma_map_single(priv->tx_dma_dev, buff_start,
> +			      xdpf->headroom + xdpf->len,
> +			      DMA_TO_DEVICE);

Not sure if I asked that.  What is the purpose for including the headroom
in frame being set? I would expect to take into account only frame from
xdpf->data.

> +	if (unlikely(dma_mapping_error(priv->tx_dma_dev, addr))) {
> +		err = -EINVAL;
> +		goto out_error;
> +	}
> +
> +	qm_fd_addr_set64(&fd, addr);
> +
> +	/* Bump the trans_start */
> +	txq = netdev_get_tx_queue(net_dev, smp_processor_id());
> +	txq->trans_start = jiffies;
> +
> +	err = dpaa_xmit(priv, percpu_stats, smp_processor_id(), &fd);
> +	if (err) {
> +		dma_unmap_single(priv->tx_dma_dev, addr,
> +				 qm_fd_get_offset(&fd) + qm_fd_get_length(&fd),
> +				 DMA_TO_DEVICE);
> +		goto out_error;
> +	}
> +
> +	return 0;
> +
> +out_error:
> +	percpu_stats->tx_errors++;
> +	return err;
> +}
> +
>  static u32 dpaa_run_xdp(struct dpaa_priv *priv, struct qm_fd *fd, void *vaddr,
> -			unsigned int *xdp_meta_len)
> +			struct dpaa_fq *dpaa_fq, unsigned int *xdp_meta_len)
>  {
>  	ssize_t fd_off = qm_fd_get_offset(fd);
>  	struct bpf_prog *xdp_prog;
> +	struct xdp_frame *xdpf;
>  	struct xdp_buff xdp;
>  	u32 xdp_act;
> 
> @@ -2370,7 +2469,8 @@ static u32 dpaa_run_xdp(struct dpaa_priv *priv, struct qm_fd *fd, void *vaddr,
>  	xdp.data_meta = xdp.data;
>  	xdp.data_hard_start = xdp.data - XDP_PACKET_HEADROOM;
>  	xdp.data_end = xdp.data + qm_fd_get_length(fd);
> -	xdp.frame_sz = DPAA_BP_RAW_SIZE;
> +	xdp.frame_sz = DPAA_BP_RAW_SIZE - DPAA_TX_PRIV_DATA_SIZE;
> +	xdp.rxq = &dpaa_fq->xdp_rxq;
> 
>  	xdp_act = bpf_prog_run_xdp(xdp_prog, &xdp);
> 
> @@ -2381,6 +2481,22 @@ static u32 dpaa_run_xdp(struct dpaa_priv *priv, struct qm_fd *fd, void *vaddr,
>  	case XDP_PASS:
>  		*xdp_meta_len = xdp.data - xdp.data_meta;
>  		break;
> +	case XDP_TX:
> +		/* We can access the full headroom when sending the frame
> +		 * back out
> +		 */
> +		xdp.data_hard_start = vaddr;
> +		xdp.frame_sz = DPAA_BP_RAW_SIZE;
> +		xdpf = xdp_convert_buff_to_frame(&xdp);
> +		if (unlikely(!xdpf)) {
> +			free_pages((unsigned long)vaddr, 0);
> +			break;
> +		}
> +
> +		if (dpaa_xdp_xmit_frame(priv->net_dev, xdpf))
> +			xdp_return_frame_rx_napi(xdpf);
> +
> +		break;
>  	default:
>  		bpf_warn_invalid_xdp_action(xdp_act);
>  		fallthrough;
> @@ -2415,6 +2531,7 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal,
>  	u32 fd_status, hash_offset;
>  	struct qm_sg_entry *sgt;
>  	struct dpaa_bp *dpaa_bp;
> +	struct dpaa_fq *dpaa_fq;
>  	struct dpaa_priv *priv;
>  	struct sk_buff *skb;
>  	int *count_ptr;
> @@ -2423,9 +2540,10 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal,
>  	u32 hash;
>  	u64 ns;
> 
> +	dpaa_fq = container_of(fq, struct dpaa_fq, fq_base);
>  	fd_status = be32_to_cpu(fd->status);
>  	fd_format = qm_fd_get_format(fd);
> -	net_dev = ((struct dpaa_fq *)fq)->net_dev;
> +	net_dev = dpaa_fq->net_dev;
>  	priv = netdev_priv(net_dev);
>  	dpaa_bp = dpaa_bpid2pool(dq->fd.bpid);
>  	if (!dpaa_bp)
> @@ -2494,7 +2612,7 @@ static enum qman_cb_dqrr_result rx_default_dqrr(struct qman_portal *portal,
> 
>  	if (likely(fd_format == qm_fd_contig)) {
>  		xdp_act = dpaa_run_xdp(priv, (struct qm_fd *)fd, vaddr,
> -				       &xdp_meta_len);
> +				       dpaa_fq, &xdp_meta_len);
>  		if (xdp_act != XDP_PASS) {
>  			percpu_stats->rx_packets++;
>  			percpu_stats->rx_bytes += qm_fd_get_length(fd);
> diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h
> index 94e8613..5c8d52a 100644
> --- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h
> +++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.h
> @@ -68,6 +68,7 @@ struct dpaa_fq {
>  	u16 channel;
>  	u8 wq;
>  	enum dpaa_fq_type fq_type;
> +	struct xdp_rxq_info xdp_rxq;
>  };
> 
>  struct dpaa_fq_cbs {
> @@ -150,6 +151,7 @@ struct dpaa_buffer_layout {
>   */
>  struct dpaa_eth_swbp {
>  	struct sk_buff *skb;
> +	struct xdp_frame *xdpf;
>  };
> 
>  struct dpaa_priv {
> --
> 1.9.1
> 

^ permalink raw reply

* Re: [PATCH mlx5-next 11/16] net/mlx5: Add VDPA priority to NIC RX namespace
From: Jason Gunthorpe @ 2020-11-24 19:44 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: Saeed Mahameed, Eli Cohen, Leon Romanovsky, netdev, linux-rdma,
	Eli Cohen, Mark Bloch, Maor Gottlieb
In-Reply-To: <20201124104106.0b1201b2@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>

On Tue, Nov 24, 2020 at 10:41:06AM -0800, Jakub Kicinski wrote:
> On Tue, 24 Nov 2020 14:02:10 -0400 Jason Gunthorpe wrote:
> > On Tue, Nov 24, 2020 at 09:12:19AM -0800, Jakub Kicinski wrote:
> > > On Sun, 22 Nov 2020 08:41:58 +0200 Eli Cohen wrote:  
> > > > On Sat, Nov 21, 2020 at 04:01:55PM -0800, Jakub Kicinski wrote:  
> > > > > On Fri, 20 Nov 2020 15:03:34 -0800 Saeed Mahameed wrote:    
> > > > > > From: Eli Cohen <eli@mellanox.com>
> > > > > > 
> > > > > > Add a new namespace type to the NIC RX root namespace to allow for
> > > > > > inserting VDPA rules before regular NIC but after bypass, thus allowing
> > > > > > DPDK to have precedence in packet processing.    
> > > > > 
> > > > > How does DPDK and VDPA relate in this context?    
> > > > 
> > > > mlx5 steering is hierarchical and defines precedence amongst namespaces.
> > > > Up till now, the VDPA implementation would insert a rule into the
> > > > MLX5_FLOW_NAMESPACE_BYPASS hierarchy which is used by DPDK thus taking
> > > > all the incoming traffic.
> > > > 
> > > > The MLX5_FLOW_NAMESPACE_VDPA hirerachy comes after
> > > > MLX5_FLOW_NAMESPACE_BYPASS.  
> > > 
> > > Our policy was no DPDK driver bifurcation. There's no asterisk saying
> > > "unless you pretend you need flow filters for RDMA, get them upstream
> > > and then drop the act".  
> > 
> > Huh?
> > 
> > mlx5 DPDK is an *RDMA* userspace application. 
> 
> Forgive me for my naiveté. 
> 
> Here I thought the RDMA subsystem is for doing RDMA.

RDMA covers a wide range of accelerated networking these days.. Where
else are you going to put this stuff in the kernel?

> I'm sure if you start doing crypto over ibverbs crypto people will want
> to have a look.

Well, RDMA has crypto transforms for a few years now too. Why would
crypto subsystem people be involved? It isn't using or duplicating
their APIs.

> > libibverbs. It runs on the RDMA stack. It uses RDMA flow filtering and
> > RDMA raw ethernet QPs. 
> 
> I'm not saying that's not the case. I'm saying I don't think this was
> something that netdev developers signed-off on.

Part of the point of the subsystem split was to end the fighting that
started all of it. It was very clear during the whole iWarp and TCP
Offload Engine buisness in the mid 2000's that netdev wanted nothing
to do with the accelerator world.

So why would netdev need sign off on any accelerator stuff?  Do you
want to start co-operating now? I'm willing to talk about how to do
that.

> And our policy on DPDK is pretty widely known.

I honestly have no idea on the netdev DPDK policy, I'm maintaining the
RDMA subsystem not DPDK :)

> Would you mind pointing us to the introduction of raw Ethernet QPs?
> 
> Is there any production use for that without DPDK?

Hmm.. It is very old. RAW (InfiniBand) QPs were part of the original
IBA specification cira 2000. When RoCE was defined (around 2010) they
were naturally carried forward to Ethernet. The "flow steering"
concept to make raw ethernet QP useful was added to verbs around 2012
- 2013. It officially made it upstream in commit 436f2ad05a0b
("IB/core: Export ib_create/destroy_flow through uverbs")

If I recall properly the first real application was ultra low latency
ethernet processing for financial applications.

dpdk later adopted the first mlx4 PMD using this libibverbs API around
2015. Interestingly the mlx4 PMD was made through an open source
process with minimal involvment from Mellanox, based on the
pre-existing RDMA work.

Currently there are many projects, and many open source, built on top
of the RDMA raw ethernet QP and RDMA flow steering model. It is now
long established kernel ABI.

> > It has been like this for years, it is not some "act".
> > 
> > It is long standing uABI that accelerators like RDMA/etc get to take
> > the traffic before netdev. This cannot be reverted. I don't really
> > understand what you are expecting here?
> 
> Same. I don't really know what you expect me to do either. I don't
> think I can sign-off on kernel changes needed for DPDK.

This patch is fine tuning the shared logic that splits the traffic to
accelerator subsystems, I don't think netdev should have a veto
here. This needs to be consensus among the various communities and
subsystems that rely on this.

Eli did not explain this well in his commit message. When he said DPDK
he means RDMA which is the owner of the FLOW_NAMESPACE. Each
accelerator subsystem gets hooked into this, so here VPDA is getting
its own hook because re-using the the same hook between two kernel
subsystems is buggy.

Jason

^ permalink raw reply

* Re: [PATCH v4 1/4] ethtool: improve compat ioctl handling
From: Arnd Bergmann @ 2020-11-24 19:42 UTC (permalink / raw)
  To: David Laight; +Cc: netdev@vger.kernel.org, Arnd Bergmann, Christoph Hellwig
In-Reply-To: <4d1a587e7a9e4b65ac3a0c20554abdd3@AcuMS.aculab.com>

On Tue, Nov 24, 2020 at 5:19 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Arnd Bergmann
> > Sent: 24 November 2020 15:18
> >
> > The ethtool compat ioctl handling is hidden away in net/socket.c,
> > which introduces a couple of minor oddities:
> >
> ...
> > +
> > +static int ethtool_rxnfc_copy_from_compat(struct ethtool_rxnfc *rxnfc,
> > +                                       const struct compat_ethtool_rxnfc __user *useraddr,
> > +                                       size_t size)
> > +{
>
> I think this (and possibly others) want a 'noinline_for_stack'.
> So that both the normal and compat structures aren't both on the
> stack when the real code is called.

Yes, makes sense. I checked that the difference is small unless
CONFIG_KASAN_STACK is enabled, and that gcc is smart enough
not to inline these by default, but adding noinline_for_stack is
both consistent with the rest of the file and the safe bet here.

     Arnd

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: remove napi_hash_del() from driver-facing API
From: Jakub Kicinski @ 2020-11-24 19:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, kernel-team
In-Reply-To: <a9ab432f-6d3c-aa8e-66bd-a82fac5d1098@gmail.com>

On Tue, 24 Nov 2020 20:11:42 +0100 Eric Dumazet wrote:
> >> gro_cells_init() is setting NAPI_STATE_NO_BUSY_POLL, and this was enough
> >> to not have one synchronize_net() call per netif_napi_del()
> >>
> >> I will test something like :
> >> I am not yet convinced the synchronize_net() is needed, since these
> >> NAPI structs are not involved in busy polling.  
> > 
> > IDK how this squares against netpoll, though?
> >   
> 
> Can we actually attach netpoll to a virtual device using gro_cells ?

Not 100% sure but I grabbed a machine running 5.2 and it appears to
work:

# ip tunnel add gre1 mode gre local 192.168.123.1 remote 192.168.123.2 ttl 255
# ip link set gre1 up
# ip addr add 10.12.34.56/30 dev gre1

# cd /sys/kernel/config/netconsole/
# cat enabled 
0
# echo gre1 > dev_name 
# echo 10.12.34.57 > remote_ip 
# echo cb:a9:87:65:43:21 > remote_mac 
# echo 1 > enabled
# dmesg | tail
[16859.016632] gre: GRE over IPv4 demultiplexor driver
[16859.019831] ip_gre: GRE over IPv4 tunneling driver
[16945.776625] netpoll: netconsole: local port 6665
[16945.776627] netpoll: netconsole: local IPv4 address 0.0.0.0
[16945.776628] netpoll: netconsole: interface 'gre1'
[16945.776629] netpoll: netconsole: remote port 6666
[16945.776630] netpoll: netconsole: remote IPv4 address 10.12.34.57
[16945.776631] netpoll: netconsole: remote ethernet address cb:a9:87:65:43:21
[16945.776633] netpoll: netconsole: local IP 10.12.34.56
[16945.776635] netconsole: network logging started
# ip li show dev gre1
6: gre1@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1476 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/gre 192.168.123.1 peer 192.168.123.2


Not sure anything actually comes out on the other end of the pipe :)

^ permalink raw reply

* Re: [PATCH v3 net-next 3/3] net/sched: sch_frag: add generic packet fragment support.
From: Jakub Kicinski @ 2020-11-24 19:24 UTC (permalink / raw)
  To: wenxu; +Cc: marcelo.leitner, vladbu, jhs, xiyou.wangcong, netdev
In-Reply-To: <1605829116-10056-4-git-send-email-wenxu@ucloud.cn>

On Fri, 20 Nov 2020 07:38:36 +0800 wenxu@ucloud.cn wrote:
> +int tcf_dev_queue_xmit(struct sk_buff *skb, int (*xmit)(struct sk_buff *skb))
> +{
> +	xmit_hook_func *xmit_hook;
> +
> +	xmit_hook = rcu_dereference(tcf_xmit_hook);
> +	if (xmit_hook)
> +		return xmit_hook(skb, xmit);
> +	else
> +		return xmit(skb);
> +}
> +EXPORT_SYMBOL_GPL(tcf_dev_queue_xmit);

I'm concerned about the performance impact of these indirect calls.

Did you check what code compiler will generate? What the impact with
retpolines enabled is going to be?

Now that sch_frag is no longer a module this could be simplified.

First of all - xmit_hook can only be sch_frag_xmit_hook, so please use
that directly. 

	if (READ_ONCE(tcf_xmit_hook_count)) 
		sch_frag_xmit_hook(...
	else
		dev_queue_xmit(...

The abstraction is costly and not necessary right now IMO.

Then probably the counter should be:

	u32 __read_mostly tcf_xmit_hook_count;

To avoid byte loads and having it be places in an unlucky cache line.

You could also make the helper a static inline in a header.


Unless I'm not giving the compiler enough credit and the performance
impact of this patch with retpolines on is indiscernible, but that'd
need to be proven by testing...

^ permalink raw reply

* Re: [PATCH net-next v4 3/7] dpaa_eth: limit the possible MTU range when XDP is enabled
From: Maciej Fijalkowski @ 2020-11-24 19:11 UTC (permalink / raw)
  To: Camelia Groza
  Cc: kuba, brouer, saeed, davem, madalin.bucur, ioana.ciornei, netdev
In-Reply-To: <654d6300001825e542341bc052c31433b48b1913.1606150838.git.camelia.groza@nxp.com>

On Mon, Nov 23, 2020 at 07:36:21PM +0200, Camelia Groza wrote:
> Implement the ndo_change_mtu callback to prevent users from setting an
> MTU that would permit processing of S/G frames. The maximum MTU size
> is dependent on the buffer size.
> 
> Acked-by: Madalin Bucur <madalin.bucur@oss.nxp.com>
> Signed-off-by: Camelia Groza <camelia.groza@nxp.com>
> ---
>  drivers/net/ethernet/freescale/dpaa/dpaa_eth.c | 40 ++++++++++++++++++++------
>  1 file changed, 31 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
> index 8acce62..ee076f4 100644
> --- a/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
> +++ b/drivers/net/ethernet/freescale/dpaa/dpaa_eth.c
> @@ -2756,23 +2756,44 @@ static int dpaa_eth_stop(struct net_device *net_dev)
>  	return err;
>  }
>  
> +static bool xdp_validate_mtu(struct dpaa_priv *priv, int mtu)
> +{
> +	int max_contig_data = priv->dpaa_bp->size - priv->rx_headroom;
> +
> +	/* We do not support S/G fragments when XDP is enabled.
> +	 * Limit the MTU in relation to the buffer size.
> +	 */
> +	if (mtu + VLAN_ETH_HLEN + ETH_FCS_LEN > max_contig_data) {

Do you support VLAN double tagging? We normally take into acount to two vlan
headers in these checks.

Other than that:
Reviewed-by: Maciej Fijalkowski <maciej.fijalkowski@intel.com>

> +		dev_warn(priv->net_dev->dev.parent,
> +			 "The maximum MTU for XDP is %d\n",
> +			 max_contig_data - VLAN_ETH_HLEN - ETH_FCS_LEN);
> +		return false;
> +	}
> +
> +	return true;
> +}
> +
> +static int dpaa_change_mtu(struct net_device *net_dev, int new_mtu)
> +{
> +	struct dpaa_priv *priv = netdev_priv(net_dev);
> +
> +	if (priv->xdp_prog && !xdp_validate_mtu(priv, new_mtu))
> +		return -EINVAL;
> +
> +	net_dev->mtu = new_mtu;
> +	return 0;
> +}
> +
>  static int dpaa_setup_xdp(struct net_device *net_dev, struct bpf_prog *prog)
>  {
>  	struct dpaa_priv *priv = netdev_priv(net_dev);
>  	struct bpf_prog *old_prog;
> -	int err, max_contig_data;
> +	int err;
>  	bool up;
>  
> -	max_contig_data = priv->dpaa_bp->size - priv->rx_headroom;
> -
>  	/* S/G fragments are not supported in XDP-mode */
> -	if (prog &&
> -	    (net_dev->mtu + VLAN_ETH_HLEN + ETH_FCS_LEN > max_contig_data)) {
> -		dev_warn(net_dev->dev.parent,
> -			 "The maximum MTU for XDP is %d\n",
> -			 max_contig_data - VLAN_ETH_HLEN - ETH_FCS_LEN);
> +	if (prog && !xdp_validate_mtu(priv, net_dev->mtu))
>  		return -EINVAL;
> -	}
>  
>  	up = netif_running(net_dev);
>  
> @@ -2870,6 +2891,7 @@ static int dpaa_ioctl(struct net_device *net_dev, struct ifreq *rq, int cmd)
>  	.ndo_set_rx_mode = dpaa_set_rx_mode,
>  	.ndo_do_ioctl = dpaa_ioctl,
>  	.ndo_setup_tc = dpaa_setup_tc,
> +	.ndo_change_mtu = dpaa_change_mtu,
>  	.ndo_bpf = dpaa_xdp,
>  };
>  
> -- 
> 1.9.1
> 

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: remove napi_hash_del() from driver-facing API
From: Eric Dumazet @ 2020-11-24 19:11 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: davem, netdev, kernel-team
In-Reply-To: <20201124105413.0406e879@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>



On 11/24/20 7:54 PM, Jakub Kicinski wrote:
> On Tue, 24 Nov 2020 19:00:50 +0100 Eric Dumazet wrote:
>> On 9/9/20 7:37 PM, Jakub Kicinski wrote:
>>> We allow drivers to call napi_hash_del() before calling
>>> netif_napi_del() to batch RCU grace periods. This makes
>>> the API asymmetric and leaks internal implementation details.
>>> Soon we will want the grace period to protect more than just
>>> the NAPI hash table.
>>>
>>> Restructure the API and have drivers call a new function -
>>> __netif_napi_del() if they want to take care of RCU waits.
>>>
>>> Note that only core was checking the return status from
>>> napi_hash_del() so the new helper does not report if the
>>> NAPI was actually deleted.
>>>
>>> Some notes on driver oddness:
>>>  - veth observed the grace period before calling netif_napi_del()
>>>    but that should not matter
>>>  - myri10ge observed normal RCU flavor
>>>  - bnx2x and enic did not actually observe the grace period
>>>    (unless they did so implicitly)
>>>  - virtio_net and enic only unhashed Rx NAPIs
>>>
>>> The last two points seem to indicate that the calls to
>>> napi_hash_del() were a left over rather than an optimization.
>>> Regardless, it's easy enough to correct them.
>>>
>>> This patch may introduce extra synchronize_net() calls for
>>> interfaces which set NAPI_STATE_NO_BUSY_POLL and depend on
>>> free_netdev() to call netif_napi_del(). This seems inevitable
>>> since we want to use RCU for netpoll dev->napi_list traversal,
>>> and almost no drivers set IFF_DISABLE_NETPOLL.
>>>
>>> Signed-off-by: Jakub Kicinski <kuba@kernel.org>  
>>
>> After this patch, gro_cells_destroy() became damn slow
>> on hosts with a lot of cores.
>>
>> After your change, we have one additional synchronize_net() per cpu as
>> you stated in your changelog.
> 
> Sorry :S  I hope it didn't waste too much of your time..

Do not worry ;)

> 
>> gro_cells_init() is setting NAPI_STATE_NO_BUSY_POLL, and this was enough
>> to not have one synchronize_net() call per netif_napi_del()
>>
>> I will test something like :
>> I am not yet convinced the synchronize_net() is needed, since these
>> NAPI structs are not involved in busy polling.
> 
> IDK how this squares against netpoll, though?
> 

Can we actually attach netpoll to a virtual device using gro_cells ?


^ permalink raw reply

* Re: [PATCH v4 2/4] net: socket: rework SIOC?IFMAP ioctls
From: Arnd Bergmann @ 2020-11-24 19:05 UTC (permalink / raw)
  To: David Laight; +Cc: netdev@vger.kernel.org, Arnd Bergmann
In-Reply-To: <e86a5d8a3aed44139010dac219dfcf08@AcuMS.aculab.com>

On Tue, Nov 24, 2020 at 5:13 PM David Laight <David.Laight@aculab.com> wrote:
>
> From: Arnd Bergmann
> > Sent: 24 November 2020 15:18
> >
> > SIOCGIFMAP and SIOCSIFMAP currently require compat_alloc_user_space()
> > and copy_in_user() for compat mode.
> >
> > Move the compat handling into the location where the structures are
> > actually used, to avoid using those interfaces and get a clearer
> > implementation.
> >
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> > ---
> > changes in v3:
> >  - complete rewrite
> ...
> >  include/linux/compat.h | 18 ++++++------
> >  net/core/dev_ioctl.c   | 64 +++++++++++++++++++++++++++++++++---------
> >  net/socket.c           | 39 ++-----------------------
> >  3 files changed, 62 insertions(+), 59 deletions(-)
> >
> > diff --git a/include/linux/compat.h b/include/linux/compat.h
> > index 08dbd34bb7a5..47496c5eb5eb 100644
> > --- a/include/linux/compat.h
> > +++ b/include/linux/compat.h
> > @@ -96,6 +96,15 @@ struct compat_iovec {
> >       compat_size_t   iov_len;
> >  };
> >
> > +struct compat_ifmap {
> > +     compat_ulong_t mem_start;
> > +     compat_ulong_t mem_end;
> > +     unsigned short base_addr;
> > +     unsigned char irq;
> > +     unsigned char dma;
> > +     unsigned char port;
> > +};
>
> Isn't the only difference the number of pad bytes at the end?

No, the main difference is in the first two fields, which are
'unsigned long' and therefore different. The three-byte padding
is in fact the same on all architectures (including x86) that
have a compat mode, though it might be different on
m68k and arm-oabi, which have slightly special struct
alignment rules.

It could be done with two assignments and a memcpy, but
I like the individual assignments better here.

      Arnd

^ permalink raw reply

* Re: [PATCH net-next v6 2/5] net/lapb: support netdev events
From: Xie He @ 2020-11-24 19:05 UTC (permalink / raw)
  To: Martin Schiller
  Cc: Andrew Hendry, David S. Miller, Jakub Kicinski, Linux X25,
	Linux Kernel Network Developers, LKML
In-Reply-To: <20201124093938.22012-3-ms@dev.tdt.de>

On Tue, Nov 24, 2020 at 1:40 AM Martin Schiller <ms@dev.tdt.de> wrote:
>
> This patch allows layer2 (LAPB) to react to netdev events itself and
> avoids the detour via layer3 (X.25).
>
> 1. Establish layer2 on NETDEV_UP events, if the carrier is already up.
>
> 2. Call lapb_disconnect_request() on NETDEV_GOING_DOWN events to signal
>    the peer that the connection will go down.
>    (Only when the carrier is up.)
>
> 3. When a NETDEV_DOWN event occur, clear all queues, enter state
>    LAPB_STATE_0 and stop all timers.
>
> 4. The NETDEV_CHANGE event makes it possible to handle carrier loss and
>    detection.
>
>    In case of Carrier Loss, clear all queues, enter state LAPB_STATE_0
>    and stop all timers.
>
>    In case of Carrier Detection, we start timer t1 on a DCE interface,
>    and on a DTE interface we change to state LAPB_STATE_1 and start
>    sending SABM(E).
>
> Signed-off-by: Martin Schiller <ms@dev.tdt.de>

Acked-by: Xie He <xie.he.0141@gmail.com>

Thanks!

^ permalink raw reply

* Re: [PATCH 0/3] Bluetooth: Power down controller when suspending
From: Abhishek Pandit-Subedi @ 2020-11-24 19:04 UTC (permalink / raw)
  To: Marcel Holtmann, chin-ran.lo, amitkumar.karwar
  Cc: BlueZ development, ChromeOS Bluetooth Upstreaming, Miao-chen Chou,
	Daniel Winkler, David S. Miller, Johan Hedberg, netdev, LKML,
	Jakub Kicinski
In-Reply-To: <CANFp7mVSGNbwCkWCj=bVzbE8L38nwu0+UMR9jkOYcYQmGBaAEw@mail.gmail.com>

Re-send to NXP email addresses for Chin-Ran Lo and Amitkumar Karwar
(Marvell wireless IP acquired by NXP)



On Tue, Nov 24, 2020 at 11:02 AM Abhishek Pandit-Subedi
<abhishekpandit@chromium.org> wrote:
>
> Hi Marcel,
>
>
> On Mon, Nov 23, 2020 at 3:46 AM Marcel Holtmann <marcel@holtmann.org> wrote:
> >
> > Hi Abhishek,
> >
> > > This patch series adds support for a quirk that will power down the
> > > Bluetooth controller when suspending and power it back up when resuming.
> > >
> > > On Marvell SDIO Bluetooth controllers (SD8897 and SD8997), we are seeing
> > > a large number of suspend failures with the following log messages:
> > >
> > > [ 4764.773873] Bluetooth: hci_cmd_timeout() hci0 command 0x0c14 tx timeout
> > > [ 4767.777897] Bluetooth: btmrvl_enable_hs() Host sleep enable command failed
> > > [ 4767.777920] Bluetooth: btmrvl_sdio_suspend() HS not actived, suspend failed!
> > > [ 4767.777946] dpm_run_callback(): pm_generic_suspend+0x0/0x48 returns -16
> > > [ 4767.777963] call mmc2:0001:2+ returned -16 after 4882288 usecs
> > >
> > > The daily failure rate with this signature is quite significant and
> > > users are likely facing this at least once a day (and some unlucky users
> > > are likely facing it multiple times a day).
> > >
> > > Given the severity, we'd like to power off the controller during suspend
> > > so the driver doesn't need to take any action (or block in any way) when
> > > suspending and power on during resume. This will break wake-on-bt for
> > > users but should improve the reliability of suspend.
> > >
> > > We don't want to force all users of MVL8897 and MVL8997 to encounter
> > > this behavior if they're not affected (especially users that depend on
> > > Bluetooth for keyboard/mouse input) so the new behavior is enabled via
> > > module param. We are limiting this quirk to only Chromebooks (i.e.
> > > laptop). Chromeboxes will continue to have the old behavior since users
> > > may depend on BT HID to wake and use the system.
> >
> > I don’t have a super great feeling with this change.
> >
> > So historically only hciconfig hci0 up/down was doing a power cycle of the controller and when adding the mgmt interface we moved that to the mgmt interface. In addition we added a special case of power up via hdev->setup. We never had an intention that the kernel otherwise can power up/down the controller as it pleases.
>
> Aside from the powered setting, the stack is resilient to the
> controller crashing (which would be akin to a power off and power on).
> From the view of bluez, adapter lost and power down should be almost
> equivalent right? ChromeOS has several platforms where Bluetooth has
> been reset after suspend, usually due USB being powered off in S3, and
> the stack is still well-behaving when that occurs.
>
> >
> > Can we ask Marvell first to investigate why this is fundamentally broken with their hardware?
>
> +Chin-Ran Lo and +Amitkumar Karwar (added based on changes to
> drivers/bluetooth/btmrvl_main.c)
>
> Could you please take a look at the original cover letter and comment
> (or add others at Marvell who may be able to)? Is this a known issue
> or a fix?
>
> >Since what you are proposing is a pretty heavy change that might has side affects. For example the state machine for the mgmt interface has no concept of a power down/up from the kernel. It is all triggered by bluetoothd.
> >
> > I am careful here since the whole power up/down path is already complicated enough.
> >
>
> That sounds reasonable. I have landed this within ChromeOS so we can
> test whether a) this improves stability enough and b) whether the
> power off/on in the kernel has significant side effects. This will go
> through our automated testing and dogfooding over the next few weeks
> and hopefully identify those side-effects. I will re-raise this topic
> with updates once we have more data.
>
> Also, in case it wasn't very clear, I put this behind a module param
> that defaults to False because this is so heavy handed. We're only
> using it on specific Chromebooks that are exhibiting the worst
> behavior and not disabling it wholesale for all btmrvl controllers.
>
> Thanks
> Abhishek
>
> > Regards
> >
> > Marcel
> >

^ permalink raw reply

* Re: [PATCH 0/3] Bluetooth: Power down controller when suspending
From: Abhishek Pandit-Subedi @ 2020-11-24 19:02 UTC (permalink / raw)
  To: Marcel Holtmann, crlo, akarwar
  Cc: BlueZ development, ChromeOS Bluetooth Upstreaming, Miao-chen Chou,
	Daniel Winkler, David S. Miller, Johan Hedberg, netdev, LKML,
	Jakub Kicinski
In-Reply-To: <7235CD4E-963C-4BCB-B891-62494AD7F10D@holtmann.org>

Hi Marcel,


On Mon, Nov 23, 2020 at 3:46 AM Marcel Holtmann <marcel@holtmann.org> wrote:
>
> Hi Abhishek,
>
> > This patch series adds support for a quirk that will power down the
> > Bluetooth controller when suspending and power it back up when resuming.
> >
> > On Marvell SDIO Bluetooth controllers (SD8897 and SD8997), we are seeing
> > a large number of suspend failures with the following log messages:
> >
> > [ 4764.773873] Bluetooth: hci_cmd_timeout() hci0 command 0x0c14 tx timeout
> > [ 4767.777897] Bluetooth: btmrvl_enable_hs() Host sleep enable command failed
> > [ 4767.777920] Bluetooth: btmrvl_sdio_suspend() HS not actived, suspend failed!
> > [ 4767.777946] dpm_run_callback(): pm_generic_suspend+0x0/0x48 returns -16
> > [ 4767.777963] call mmc2:0001:2+ returned -16 after 4882288 usecs
> >
> > The daily failure rate with this signature is quite significant and
> > users are likely facing this at least once a day (and some unlucky users
> > are likely facing it multiple times a day).
> >
> > Given the severity, we'd like to power off the controller during suspend
> > so the driver doesn't need to take any action (or block in any way) when
> > suspending and power on during resume. This will break wake-on-bt for
> > users but should improve the reliability of suspend.
> >
> > We don't want to force all users of MVL8897 and MVL8997 to encounter
> > this behavior if they're not affected (especially users that depend on
> > Bluetooth for keyboard/mouse input) so the new behavior is enabled via
> > module param. We are limiting this quirk to only Chromebooks (i.e.
> > laptop). Chromeboxes will continue to have the old behavior since users
> > may depend on BT HID to wake and use the system.
>
> I don’t have a super great feeling with this change.
>
> So historically only hciconfig hci0 up/down was doing a power cycle of the controller and when adding the mgmt interface we moved that to the mgmt interface. In addition we added a special case of power up via hdev->setup. We never had an intention that the kernel otherwise can power up/down the controller as it pleases.

Aside from the powered setting, the stack is resilient to the
controller crashing (which would be akin to a power off and power on).
From the view of bluez, adapter lost and power down should be almost
equivalent right? ChromeOS has several platforms where Bluetooth has
been reset after suspend, usually due USB being powered off in S3, and
the stack is still well-behaving when that occurs.

>
> Can we ask Marvell first to investigate why this is fundamentally broken with their hardware?

+Chin-Ran Lo and +Amitkumar Karwar (added based on changes to
drivers/bluetooth/btmrvl_main.c)

Could you please take a look at the original cover letter and comment
(or add others at Marvell who may be able to)? Is this a known issue
or a fix?

>Since what you are proposing is a pretty heavy change that might has side affects. For example the state machine for the mgmt interface has no concept of a power down/up from the kernel. It is all triggered by bluetoothd.
>
> I am careful here since the whole power up/down path is already complicated enough.
>

That sounds reasonable. I have landed this within ChromeOS so we can
test whether a) this improves stability enough and b) whether the
power off/on in the kernel has significant side effects. This will go
through our automated testing and dogfooding over the next few weeks
and hopefully identify those side-effects. I will re-raise this topic
with updates once we have more data.

Also, in case it wasn't very clear, I put this behind a module param
that defaults to False because this is so heavy handed. We're only
using it on specific Chromebooks that are exhibiting the worst
behavior and not disabling it wholesale for all btmrvl controllers.

Thanks
Abhishek

> Regards
>
> Marcel
>

^ permalink raw reply

* Re: [PATCH net v5] aquantia: Remove the build_skb path
From: Jakub Kicinski @ 2020-11-24 19:02 UTC (permalink / raw)
  To: Ramsay, Lincoln
  Cc: Florian Westphal, Igor Russkikh, David S. Miller,
	netdev@vger.kernel.org, Dmitry Bogdanov
In-Reply-To: <MWHPR1001MB23184F3EAFA413E0D1910EC9E8FC0@MWHPR1001MB2318.namprd10.prod.outlook.com>

On Mon, 23 Nov 2020 21:40:43 +0000 Ramsay, Lincoln wrote:
> From: Lincoln Ramsay <lincoln.ramsay@opengear.com>
> 
> When performing IPv6 forwarding, there is an expectation that SKBs
> will have some headroom. When forwarding a packet from the aquantia
> driver, this does not always happen, triggering a kernel warning.
> 
> aq_ring.c has this code (edited slightly for brevity):
> 
> if (buff->is_eop && buff->len <= AQ_CFG_RX_FRAME_MAX - AQ_SKB_ALIGN) {
>     skb = build_skb(aq_buf_vaddr(&buff->rxdata), AQ_CFG_RX_FRAME_MAX);
> } else {
>     skb = napi_alloc_skb(napi, AQ_CFG_RX_HDR_SIZE);
> 
> There is a significant difference between the SKB produced by these
> 2 code paths. When napi_alloc_skb creates an SKB, there is a certain
> amount of headroom reserved. However, this is not done in the
> build_skb codepath.
> 
> As the hardware buffer that build_skb is built around does not
> handle the presence of the SKB header, this code path is being
> removed and the napi_alloc_skb path will always be used. This code
> path does have to copy the packet header into the SKB, but it adds
> the packet data as a frag.
> 
> Fixes: 018423e90bee ("net: ethernet: aquantia: Add ring support code")
> Signed-off-by: Lincoln Ramsay <lincoln.ramsay@opengear.com>

Applied, queued of stable.

Thanks!

^ permalink raw reply

* Re: [PATCH] net: stmmac: add flexible PPS to dwmac 4.10a
From: Jakub Kicinski @ 2020-11-24 18:56 UTC (permalink / raw)
  To: Antonio Borneo
  Cc: Ahmad Fatoum, Giuseppe Cavallaro, Alexandre Torgue, Jose Abreu,
	David S. Miller, netdev, Maxime Coquelin, linux-stm32,
	linux-arm-kernel, linux-kernel, Pengutronix Kernel Team, has
In-Reply-To: <4a53794f1a0cea5eb009fce0b4b4c4846771f8be.camel@st.com>

On Tue, 24 Nov 2020 19:27:03 +0100 Antonio Borneo wrote:
> On Tue, 2020-11-24 at 10:20 -0800, Jakub Kicinski wrote:
> > On Tue, 24 Nov 2020 15:23:27 +0100 Antonio Borneo wrote:  
> > > On Tue, 2020-11-24 at 15:15 +0100, Ahmad Fatoum wrote:  
> > > > On 10.10.19 00:26, Jakub Kicinski wrote:    
> > > > > On Mon, 7 Oct 2019 17:43:06 +0200, Antonio Borneo wrote:    
> > > > > > All the registers and the functionalities used in the callback
> > > > > > dwmac5_flex_pps_config() are common between dwmac 4.10a [1] and
> > > > > > 5.00a [2].
> > > > > > 
> > > > > > Reuse the same callback for dwmac 4.10a too.
> > > > > > 
> > > > > > Tested on STM32MP15x, based on dwmac 4.10a.
> > > > > > 
> > > > > > [1] DWC Ethernet QoS Databook 4.10a October 2014
> > > > > > [2] DWC Ethernet QoS Databook 5.00a September 2017
> > > > > > 
> > > > > > Signed-off-by: Antonio Borneo <antonio.borneo@st.com>    
> > > > > 
> > > > > Applied to net-next.    
> > > > 
> > > > This patch seems to have been fuzzily applied at the wrong location.
> > > > The diff describes extension of dwmac 4.10a and so does the @@ line:
> > > > 
> > > >   @@ -864,6 +864,7 @@ const struct stmmac_ops dwmac410_ops = {
> > > > 
> > > > The patch was applied mainline as 757926247836 ("net: stmmac: add
> > > > flexible PPS to dwmac 4.10a"), but it extends dwmac4_ops instead:
> > > > 
> > > >   @@ -938,6 +938,7 @@ const struct stmmac_ops dwmac4_ops = {
> > > > 
> > > > I don't know if dwmac4 actually supports FlexPPS, so I think it's
> > > > better to be on the safe side and revert 757926247836 and add the
> > > > change for the correct variant.    
> > > 
> > > Agree,
> > > the patch get applied to the wrong place!  
> > 
> > :-o
> > 
> > This happens sometimes with stable backports but I've never seen it
> > happen working on "current" branches.
> > 
> > Sorry about that!
> > 
> > Would you mind sending the appropriate patches? I can do the revert if
> > you prefer, but since you need to send the fix anyway..  
> 
> You mean sending two patches one for revert and one to re-apply the code?
> Or a single patch for the fix?

Either way is fine by me. If I was doing it - I'd probably send just one
patch, but if you prefer to revert first - nothing wrong with that.

^ permalink raw reply

* Re: [PATCH net-next 1/3] net: remove napi_hash_del() from driver-facing API
From: Jakub Kicinski @ 2020-11-24 18:54 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: davem, netdev, kernel-team
In-Reply-To: <8735d11e-e734-2ba9-7ced-d047682f9f3e@gmail.com>

On Tue, 24 Nov 2020 19:00:50 +0100 Eric Dumazet wrote:
> On 9/9/20 7:37 PM, Jakub Kicinski wrote:
> > We allow drivers to call napi_hash_del() before calling
> > netif_napi_del() to batch RCU grace periods. This makes
> > the API asymmetric and leaks internal implementation details.
> > Soon we will want the grace period to protect more than just
> > the NAPI hash table.
> > 
> > Restructure the API and have drivers call a new function -
> > __netif_napi_del() if they want to take care of RCU waits.
> > 
> > Note that only core was checking the return status from
> > napi_hash_del() so the new helper does not report if the
> > NAPI was actually deleted.
> > 
> > Some notes on driver oddness:
> >  - veth observed the grace period before calling netif_napi_del()
> >    but that should not matter
> >  - myri10ge observed normal RCU flavor
> >  - bnx2x and enic did not actually observe the grace period
> >    (unless they did so implicitly)
> >  - virtio_net and enic only unhashed Rx NAPIs
> > 
> > The last two points seem to indicate that the calls to
> > napi_hash_del() were a left over rather than an optimization.
> > Regardless, it's easy enough to correct them.
> > 
> > This patch may introduce extra synchronize_net() calls for
> > interfaces which set NAPI_STATE_NO_BUSY_POLL and depend on
> > free_netdev() to call netif_napi_del(). This seems inevitable
> > since we want to use RCU for netpoll dev->napi_list traversal,
> > and almost no drivers set IFF_DISABLE_NETPOLL.
> > 
> > Signed-off-by: Jakub Kicinski <kuba@kernel.org>  
> 
> After this patch, gro_cells_destroy() became damn slow
> on hosts with a lot of cores.
> 
> After your change, we have one additional synchronize_net() per cpu as
> you stated in your changelog.

Sorry :S  I hope it didn't waste too much of your time..

> gro_cells_init() is setting NAPI_STATE_NO_BUSY_POLL, and this was enough
> to not have one synchronize_net() call per netif_napi_del()
> 
> I will test something like :
> I am not yet convinced the synchronize_net() is needed, since these
> NAPI structs are not involved in busy polling.

IDK how this squares against netpoll, though?

> diff --git a/net/core/gro_cells.c b/net/core/gro_cells.c
> index e095fb871d9120787bfdf62149f4d82e0e3b0a51..8cfa6ce0738977290cc9f76a3f5daa617308e107 100644
> --- a/net/core/gro_cells.c
> +++ b/net/core/gro_cells.c
> @@ -99,9 +99,10 @@ void gro_cells_destroy(struct gro_cells *gcells)
>                 struct gro_cell *cell = per_cpu_ptr(gcells->cells, i);
>  
>                 napi_disable(&cell->napi);
> -               netif_napi_del(&cell->napi);
> +               __netif_napi_del(&cell->napi);
>                 __skb_queue_purge(&cell->napi_skbs);
>         }
> +       synchronize_net();
>         free_percpu(gcells->cells);
>         gcells->cells = NULL;
>  }
> 
> 


^ permalink raw reply

* Re: [PATCH net-next] mptcp: be careful on MPTCP-level ack.
From: Paolo Abeni @ 2020-11-24 18:53 UTC (permalink / raw)
  To: Jakub Kicinski; +Cc: netdev, mptcp, Eric Dumazet
In-Reply-To: <20201124100809.08360e4c@kicinski-fedora-pc1c0hjn.dhcp.thefacebook.com>

On Tue, 2020-11-24 at 10:08 -0800, Jakub Kicinski wrote:
> On Tue, 24 Nov 2020 12:20:11 +0100 Paolo Abeni wrote:
> > -static void mptcp_send_ack(struct mptcp_sock *msk, bool force)
> > +static inline bool tcp_can_send_ack(const struct sock *ssk)
> > +{
> > +	return !((1 << inet_sk_state_load(ssk)) &
> > +	       (TCPF_SYN_SENT | TCPF_SYN_RECV | TCPF_TIME_WAIT | TCPF_CLOSE));
> > +}
> 
> Does the compiler really not inline this trivial static function?

whoops... That is just me adding an unneeded keyword. I'll send a v2.

Thanks,

Paolo




^ permalink raw reply

* bonding in active/backup with ARP monitoring does not silently switch the slaves
From: Finer, Howard @ 2020-11-24 18:44 UTC (permalink / raw)
  To: netdev@vger.kernel.org

We use the bond driver in an active-backup configuration with ARP monitoring. We also use the TIPC protocol which we run over the bond device. We are consistently seeing an issue in both the 3.16 and 4.19 kernels whereby when the bond slave is switched TIPC is being notified of the change rather than it happening silently.  The problem that we see is that when the active slave fails, a NETDEV_CHANGE event is being sent to the TIPC driver to notify it that the link is down. This causes the TIPC driver to reset its bearers and therefore break communication between the nodes that are clustered.
With some additional instrumentation in thee driver, I see this in /var/log/syslog:
<6> 1 2020-11-20T18:14:19.159524+01:00 LABNBS5B kernel - - - [65818.378287] bond0: link status definitely down for interface eth0, disabling it
<6> 1 2020-11-20T18:14:19.159536+01:00 LABNBS5B kernel - - - [65818.378296] bond0: now running without any active interface!
<6> 1 2020-11-20T18:14:19.159537+01:00 LABNBS5B kernel - - - [65818.378304] bond0: bond_activebackup_arp_mon: notify_rtnl, slave state notify/slave link notify
<6> 1 2020-11-20T18:14:19.159538+01:00 LABNBS5B kernel - - - [65818.378835] netdev change bearer <eth:bond0>
<6> 1 2020-11-20T18:14:19.263523+01:00 LABNBS5B kernel - - - [65818.482384] bond0: link status definitely up for interface eth1
<6> 1 2020-11-20T18:14:19.263534+01:00 LABNBS5B kernel - - - [65818.482387] bond0: making interface eth1 the new active one
<6> 1 2020-11-20T18:14:19.263536+01:00 LABNBS5B kernel - - - [65818.482633] bond0: first active interface up!
<6> 1 2020-11-20T18:14:19.263537+01:00 LABNBS5B kernel - - - [65818.482671] netdev change bearer <eth:bond0>
<6> 1 2020-11-20T18:14:19.367523+01:00 LABNBS5B kernel - - - [65818.586228] bond0: bond_activebackup_arp_mon: call_netdevice_notifiers NETDEV_NOTIFY_PEERS

There is no issue when using MII monitoring instead of ARP monitoring since when the slave is detected as down, it immediately switches to the backup as it sees that slave as being up and ready.    But when using ARP monitoring, only one of the slaves is 'up'.  So when the active slave goes down, the bond driver will see no active slaves until it brings up the backup slave on the next call to bond_activebackup_arp_mon.  Brining up that backup slave has to be attempted prior to notifying any peers of a change or else they will see the outage.  In this case it seems the should_notify_rtnl flag has to be set to false.    However, I also question if the switch to the backup slave should actually occur immediately like it does for MII and that the backup should be immediately brought up/switched to without having to wait for the next iteration.

As it currently is implemented there is no way to run TIPC over an active-backup ARP-monitored bond device.  I suspect there are other situations/uses that would likewise have an issue with the 'erroneous' NETDEV_CHANGE being issued.   Since TIPC (and others) have no idea what the dev is, it is not possible to ignore the event nor should it be ignored.  It therefore seems the event shouldn't be sent for this situation.   Please confirm the analysis above and provide a path forward since as currently implemented the functionality is broken.

Thanks,
Howard Finer



^ permalink raw reply

* Re: [PATCH mlx5-next 11/16] net/mlx5: Add VDPA priority to NIC RX namespace
From: Jakub Kicinski @ 2020-11-24 18:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Saeed Mahameed, Eli Cohen, Leon Romanovsky, netdev, linux-rdma,
	Eli Cohen, Mark Bloch, Maor Gottlieb
In-Reply-To: <20201124180210.GJ5487@ziepe.ca>

On Tue, 24 Nov 2020 14:02:10 -0400 Jason Gunthorpe wrote:
> On Tue, Nov 24, 2020 at 09:12:19AM -0800, Jakub Kicinski wrote:
> > On Sun, 22 Nov 2020 08:41:58 +0200 Eli Cohen wrote:  
> > > On Sat, Nov 21, 2020 at 04:01:55PM -0800, Jakub Kicinski wrote:  
> > > > On Fri, 20 Nov 2020 15:03:34 -0800 Saeed Mahameed wrote:    
> > > > > From: Eli Cohen <eli@mellanox.com>
> > > > > 
> > > > > Add a new namespace type to the NIC RX root namespace to allow for
> > > > > inserting VDPA rules before regular NIC but after bypass, thus allowing
> > > > > DPDK to have precedence in packet processing.    
> > > > 
> > > > How does DPDK and VDPA relate in this context?    
> > > 
> > > mlx5 steering is hierarchical and defines precedence amongst namespaces.
> > > Up till now, the VDPA implementation would insert a rule into the
> > > MLX5_FLOW_NAMESPACE_BYPASS hierarchy which is used by DPDK thus taking
> > > all the incoming traffic.
> > > 
> > > The MLX5_FLOW_NAMESPACE_VDPA hirerachy comes after
> > > MLX5_FLOW_NAMESPACE_BYPASS.  
> > 
> > Our policy was no DPDK driver bifurcation. There's no asterisk saying
> > "unless you pretend you need flow filters for RDMA, get them upstream
> > and then drop the act".  
> 
> Huh?
> 
> mlx5 DPDK is an *RDMA* userspace application. 

Forgive me for my naiveté. 

Here I thought the RDMA subsystem is for doing RDMA.

I'm sure if you start doing crypto over ibverbs crypto people will want
to have a look.

> libibverbs. It runs on the RDMA stack. It uses RDMA flow filtering and
> RDMA raw ethernet QPs. 

I'm not saying that's not the case. I'm saying I don't think this was
something that netdev developers signed-off on. And our policy on DPDK
is pretty widely known.

Would you mind pointing us to the introduction of raw Ethernet QPs?

Is there any production use for that without DPDK?

> It has been like this for years, it is not some "act".
> 
> It is long standing uABI that accelerators like RDMA/etc get to take
> the traffic before netdev. This cannot be reverted. I don't really
> understand what you are expecting here?

Same. I don't really know what you expect me to do either. I don't
think I can sign-off on kernel changes needed for DPDK.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox