Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v3 net-next 3/3] openvswitch: Fix skb->protocol for vlan frames.
From: Jiri Benc @ 2016-11-30 14:30 UTC (permalink / raw)
  To: Jarno Rajahalme; +Cc: netdev, pshelar, e
In-Reply-To: <1480462253-114713-3-git-send-email-jarno@ovn.org>

On Tue, 29 Nov 2016 15:30:53 -0800, Jarno Rajahalme wrote:
> Do not always set skb->protocol to be the ethertype of the L3 header.
> For a packet with non-accelerated VLAN tags skb->protocol needs to be
> the ethertype of the outermost non-accelerated VLAN ethertype.

Well, the current handling of skb->protocol matches what used to be the
handling of the kernel net stack before Jiri Pirko cleaned up the vlan
code.

I'm not opposed to changing this but I'm afraid it needs much deeper
review. Because with this in place, no core kernel functions that
depend on skb->protocol may be called from within openvswitch.

> @@ -361,6 +362,11 @@ static int parse_vlan(struct sk_buff *skb, struct sw_flow_key *key)
>  	if (res <= 0)
>  		return res;
>  
> +	/* If the outer vlan tag was accelerated, skb->protocol should
> +	 * refelect the inner vlan type. */
> +	if (!eth_type_vlan(skb->protocol))
> +		skb->protocol = key->eth.cvlan.tpid;

This should not depend on the current value in skb->protocol which
could be arbitrary at this point (from the point of view of how this
patch understands the skb->protocol values). It's easy to fix, though -
just add a local bool variable tracking whether the skb->protocol has
been set.

> @@ -531,15 +544,16 @@ static int key_extract(struct sk_buff *skb, struct sw_flow_key *key)
>  		if (unlikely(parse_vlan(skb, key)))
>  			return -ENOMEM;
>  
> -		skb->protocol = parse_ethertype(skb);
> -		if (unlikely(skb->protocol == htons(0)))
> +		key->eth.type = parse_ethertype(skb);
> +		if (unlikely(key->eth.type == htons(0)))
>  			return -ENOMEM;
>  
>  		skb_reset_network_header(skb);
>  		__skb_push(skb, skb->data - skb_mac_header(skb));
>  	}
>  	skb_reset_mac_len(skb);
> -	key->eth.type = skb->protocol;
> +	if (!eth_type_vlan(skb->protocol))
> +		skb->protocol = key->eth.type;

This leaves key->eth.type undefined for key->mac_proto ==
MAC_PROTO_NONE.

Plus the same problem as above with unknown value of skb->protocol. But
this is more complicated here, as skb->protocol may be either
uninitialized at this point or already initialized by parse_vlan.

>  
>  	/* Network layer. */
>  	if (key->eth.type == htons(ETH_P_IP)) {
> @@ -800,29 +814,15 @@ int ovs_flow_key_extract_userspace(struct net *net, const struct nlattr *attr,
>  	if (err)
>  		return err;
>  
> -	if (ovs_key_mac_proto(key) == MAC_PROTO_NONE) {
> -		/* key_extract assumes that skb->protocol is set-up for
> -		 * layer 3 packets which is the case for other callers,
> -		 * in particular packets recieved from the network stack.
> -		 * Here the correct value can be set from the metadata
> -		 * extracted above.
> -		 */
> -		skb->protocol = key->eth.type;
> -	} else {
> -		struct ethhdr *eth;
> -
> -		skb_reset_mac_header(skb);
> -		eth = eth_hdr(skb);
> -
> -		/* Normally, setting the skb 'protocol' field would be
> -		 * handled by a call to eth_type_trans(), but it assumes
> -		 * there's a sending device, which we may not have.
> -		 */
> -		if (eth_proto_is_802_3(eth->h_proto))
> -			skb->protocol = eth->h_proto;
> -		else
> -			skb->protocol = htons(ETH_P_802_2);
> -	}
> +	/* key_extract assumes that skb->protocol is set-up for
> +	 * layer 3 packets which is the case for other callers,
> +	 * in particular packets recieved from the network stack.
> +	 * Here the correct value can be set from the metadata
> +	 * extracted above.  For layer 2 packets we initialize
> +         * skb->protocol to zero and set it in key_extract() while
> +         * parsing the L2 headers.
> +	 */
> +	skb->protocol = key->eth.type;
>  
>  	return key_extract(skb, key);
>  }

Interesting. This hunk looks safe even without the rest of the patch.
You should fix the comment indentation, though.

 Jiri

^ permalink raw reply

* Re: [net-next PATCH v3 6/6] virtio_net: xdp, add slowpath case for non contiguous buffers
From: Jakub Kicinski @ 2016-11-30 14:30 UTC (permalink / raw)
  To: John Fastabend, Michael S. Tsirkin
  Cc: eric.dumazet, daniel, shm, davem, tgraf, alexei.starovoitov,
	john.r.fastabend, netdev, bblanco, brouer
In-Reply-To: <20161129201133.26851.31803.stgit@john-Precision-Tower-5810>

[add MST]

On Tue, 29 Nov 2016 12:11:33 -0800, John Fastabend wrote:
> virtio_net XDP support expects receive buffers to be contiguous.
> If this is not the case we enable a slowpath to allow connectivity
> to continue but at a significan performance overhead associated with
> linearizing data. To make it painfully aware to users that XDP is
> running in a degraded mode we throw an xdp buffer error.
> 
> To linearize packets we allocate a page and copy the segments of
> the data, including the header, into it. After this the page can be
> handled by XDP code flow as normal.
> 
> Then depending on the return code the page is either freed or sent
> to the XDP xmit path. There is no attempt to optimize this path.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  drivers/net/virtio_net.c |   70 +++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 68 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index 9604e55..b0ce4ef 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -449,6 +449,56 @@ static struct sk_buff *receive_big(struct net_device *dev,
>  	return NULL;
>  }
>  
> +/* The conditions to enable XDP should preclude the underlying device from
> + * sending packets across multiple buffers (num_buf > 1). However per spec
> + * it does not appear to be illegal to do so but rather just against convention.
> + * So in order to avoid making a system unresponsive the packets are pushed
> + * into a page and the XDP program is run. This will be extremely slow and we
> + * push a warning to the user to fix this as soon as possible. Fixing this may
> + * require resolving the underlying hardware to determine why multiple buffers
> + * are being received or simply loading the XDP program in the ingress stack
> + * after the skb is built because there is no advantage to running it here
> + * anymore.
> + */
> +static struct page *xdp_linearize_page(struct receive_queue *rq,
> +				       u16 num_buf,
> +				       struct page *p,
> +				       int offset,
> +				       unsigned int *len)
> +{
> +	struct page *page = alloc_page(GFP_ATOMIC);
> +	unsigned int page_off = 0;
> +
> +	if (!page)
> +		return NULL;
> +
> +	memcpy(page_address(page) + page_off, page_address(p) + offset, *len);
> +	while (--num_buf) {
> +		unsigned int buflen;
> +		unsigned long ctx;
> +		void *buf;
> +		int off;
> +
> +		ctx = (unsigned long)virtqueue_get_buf(rq->vq, &buflen);
> +		if (unlikely(!ctx))
> +			goto err_buf;
> +
> +		buf = mergeable_ctx_to_buf_address(ctx);
> +		p = virt_to_head_page(buf);
> +		off = buf - page_address(p);
> +
> +		memcpy(page_address(page) + page_off,
> +		       page_address(p) + off, buflen);
> +		page_off += buflen;

Could malicious user potentially submit a frame bigger than MTU?

> +	}
> +
> +	*len = page_off;
> +	return page;
> +err_buf:
> +	__free_pages(page, 0);
> +	return NULL;
> +}
> +
>  static struct sk_buff *receive_mergeable(struct net_device *dev,
>  					 struct virtnet_info *vi,
>  					 struct receive_queue *rq,
> @@ -469,21 +519,37 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
>  	rcu_read_lock();
>  	xdp_prog = rcu_dereference(rq->xdp_prog);
>  	if (xdp_prog) {
> +		struct page *xdp_page;
>  		u32 act;
>  
>  		if (num_buf > 1) {
>  			bpf_warn_invalid_xdp_buffer();
> -			goto err_xdp;
> +
> +			/* linearize data for XDP */
> +			xdp_page = xdp_linearize_page(rq, num_buf,
> +						      page, offset, &len);
> +			if (!xdp_page)
> +				goto err_xdp;
> +			offset = len;
> +		} else {
> +			xdp_page = page;
>  		}
>  
> -		act = do_xdp_prog(vi, xdp_prog, page, offset, len);
> +		act = do_xdp_prog(vi, xdp_prog, xdp_page, offset, len);
>  		switch (act) {
>  		case XDP_PASS:
> +			if (unlikely(xdp_page != page))
> +				__free_pages(xdp_page, 0);
>  			break;
>  		case XDP_TX:
> +			if (unlikely(xdp_page != page))
> +				goto err_xdp;
> +			rcu_read_unlock();

Only if there is a reason for v4 - this unlock could go to the previous
patch.

^ permalink raw reply

* [PATCH net 7/7] net: ethernet: stmmac: fix of-node and fixed-link-phydev leaks
From: Johan Hovold @ 2016-11-30 14:29 UTC (permalink / raw)
  To: David S. Miller
  Cc: Giuseppe Cavallaro, Alexandre Torgue, Joachim Eastwood,
	Carlo Caione, Kevin Hilman, Maxime Coquelin, Maxime Ripard,
	Chen-Yu Tsai, netdev, linux-kernel, Johan Hovold
In-Reply-To: <1480516195-27696-1-git-send-email-johan@kernel.org>

Make sure to deregister and free any fixed-link phy registered during
probe on probe errors and on driver unbind by adding a new glue helper
function.

Drop the of-node reference taken in the same path also on late probe
errors (and not just on driver unbind) by moving the put from
stmmac_dvr_remove() to the new helper.

Fixes: 277323814e49 ("stmmac: add fixed-link device-tree support")
Fixes: 4613b279bee7 ("ethernet: stmicro: stmmac: add missing of_node_put
after calling of_parse_phandle")
Signed-off-by: Johan Hovold <johan@kernel.org>
---
 .../net/ethernet/stmicro/stmmac/dwmac-generic.c    |  5 +++-
 .../net/ethernet/stmicro/stmmac/dwmac-ipq806x.c    | 25 +++++++++++++----
 .../net/ethernet/stmicro/stmmac/dwmac-lpc18xx.c    | 17 ++++++++++--
 drivers/net/ethernet/stmicro/stmmac/dwmac-meson.c  | 23 ++++++++++++----
 .../net/ethernet/stmicro/stmmac/dwmac-meson8b.c    | 21 +++++++++-----
 drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c     | 10 +++++--
 .../net/ethernet/stmicro/stmmac/dwmac-socfpga.c    | 12 +++++---
 drivers/net/ethernet/stmicro/stmmac/dwmac-sti.c    | 12 +++++---
 drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c  | 19 +++++++++----
 drivers/net/ethernet/stmicro/stmmac/dwmac-sunxi.c  | 26 +++++++++++++-----
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c  |  1 -
 .../net/ethernet/stmicro/stmmac/stmmac_platform.c  | 32 ++++++++++++++++++++--
 .../net/ethernet/stmicro/stmmac/stmmac_platform.h  |  2 ++
 13 files changed, 156 insertions(+), 49 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-generic.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-generic.c
index 05e46a82cdb1..e6e6c2fcc4b7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-generic.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-generic.c
@@ -50,7 +50,7 @@ static int dwmac_generic_probe(struct platform_device *pdev)
 	if (plat_dat->init) {
 		ret = plat_dat->init(pdev, plat_dat->bsp_priv);
 		if (ret)
-			return ret;
+			goto err_remove_config_dt;
 	}
 
 	ret = stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
@@ -62,6 +62,9 @@ static int dwmac_generic_probe(struct platform_device *pdev)
 err_exit:
 	if (plat_dat->exit)
 		plat_dat->exit(pdev, plat_dat->bsp_priv);
+err_remove_config_dt:
+	if (pdev->dev.of_node)
+		stmmac_remove_config_dt(pdev, plat_dat);
 
 	return ret;
 }
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-ipq806x.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-ipq806x.c
index 36d3355f2fb0..866444b6c82f 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-ipq806x.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-ipq806x.c
@@ -271,15 +271,17 @@ static int ipq806x_gmac_probe(struct platform_device *pdev)
 		return PTR_ERR(plat_dat);
 
 	gmac = devm_kzalloc(dev, sizeof(*gmac), GFP_KERNEL);
-	if (!gmac)
-		return -ENOMEM;
+	if (!gmac) {
+		err = -ENOMEM;
+		goto err_remove_config_dt;
+	}
 
 	gmac->pdev = pdev;
 
 	err = ipq806x_gmac_of_parse(gmac);
 	if (err) {
 		dev_err(dev, "device tree parsing error\n");
-		return err;
+		goto err_remove_config_dt;
 	}
 
 	regmap_write(gmac->qsgmii_csr, QSGMII_PCS_CAL_LCKDT_CTL,
@@ -300,7 +302,8 @@ static int ipq806x_gmac_probe(struct platform_device *pdev)
 	default:
 		dev_err(&pdev->dev, "Unsupported PHY mode: \"%s\"\n",
 			phy_modes(gmac->phy_mode));
-		return -EINVAL;
+		err = -EINVAL;
+		goto err_remove_config_dt;
 	}
 	regmap_write(gmac->nss_common, NSS_COMMON_GMAC_CTL(gmac->id), val);
 
@@ -319,7 +322,8 @@ static int ipq806x_gmac_probe(struct platform_device *pdev)
 	default:
 		dev_err(&pdev->dev, "Unsupported PHY mode: \"%s\"\n",
 			phy_modes(gmac->phy_mode));
-		return -EINVAL;
+		err = -EINVAL;
+		goto err_remove_config_dt;
 	}
 	regmap_write(gmac->nss_common, NSS_COMMON_CLK_SRC_CTRL, val);
 
@@ -346,7 +350,16 @@ static int ipq806x_gmac_probe(struct platform_device *pdev)
 	plat_dat->bsp_priv = gmac;
 	plat_dat->fix_mac_speed = ipq806x_gmac_fix_mac_speed;
 
-	return stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
+	err = stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
+	if (err)
+		goto err_remove_config_dt;
+
+	return 0;
+
+err_remove_config_dt:
+	stmmac_remove_config_dt(pdev, plat_dat);
+
+	return err;
 }
 
 static const struct of_device_id ipq806x_gmac_dwmac_match[] = {
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-lpc18xx.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-lpc18xx.c
index 78e9d1861896..3d3f43d91b98 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-lpc18xx.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-lpc18xx.c
@@ -46,7 +46,8 @@ static int lpc18xx_dwmac_probe(struct platform_device *pdev)
 	reg = syscon_regmap_lookup_by_compatible("nxp,lpc1850-creg");
 	if (IS_ERR(reg)) {
 		dev_err(&pdev->dev, "syscon lookup failed\n");
-		return PTR_ERR(reg);
+		ret = PTR_ERR(reg);
+		goto err_remove_config_dt;
 	}
 
 	if (plat_dat->interface == PHY_INTERFACE_MODE_MII) {
@@ -55,13 +56,23 @@ static int lpc18xx_dwmac_probe(struct platform_device *pdev)
 		ethmode = LPC18XX_CREG_CREG6_ETHMODE_RMII;
 	} else {
 		dev_err(&pdev->dev, "Only MII and RMII mode supported\n");
-		return -EINVAL;
+		ret = -EINVAL;
+		goto err_remove_config_dt;
 	}
 
 	regmap_update_bits(reg, LPC18XX_CREG_CREG6,
 			   LPC18XX_CREG_CREG6_ETHMODE_MASK, ethmode);
 
-	return stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
+	ret = stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
+	if (ret)
+		goto err_remove_config_dt;
+
+	return 0;
+
+err_remove_config_dt:
+	stmmac_remove_config_dt(pdev, plat_dat);
+
+	return ret;
 }
 
 static const struct of_device_id lpc18xx_dwmac_match[] = {
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson.c
index 309d99536a2c..7fdd1760a74c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson.c
@@ -64,18 +64,31 @@ static int meson6_dwmac_probe(struct platform_device *pdev)
 		return PTR_ERR(plat_dat);
 
 	dwmac = devm_kzalloc(&pdev->dev, sizeof(*dwmac), GFP_KERNEL);
-	if (!dwmac)
-		return -ENOMEM;
+	if (!dwmac) {
+		ret = -ENOMEM;
+		goto err_remove_config_dt;
+	}
 
 	res = platform_get_resource(pdev, IORESOURCE_MEM, 1);
 	dwmac->reg = devm_ioremap_resource(&pdev->dev, res);
-	if (IS_ERR(dwmac->reg))
-		return PTR_ERR(dwmac->reg);
+	if (IS_ERR(dwmac->reg)) {
+		ret = PTR_ERR(dwmac->reg);
+		goto err_remove_config_dt;
+	}
 
 	plat_dat->bsp_priv = dwmac;
 	plat_dat->fix_mac_speed = meson6_dwmac_fix_mac_speed;
 
-	return stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
+	ret = stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
+	if (ret)
+		goto err_remove_config_dt;
+
+	return 0;
+
+err_remove_config_dt:
+	stmmac_remove_config_dt(pdev, plat_dat);
+
+	return ret;
 }
 
 static const struct of_device_id meson6_dwmac_match[] = {
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
index 45e7aaf0170d..ffaed1f35efe 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
@@ -264,28 +264,33 @@ static int meson8b_dwmac_probe(struct platform_device *pdev)
 		return PTR_ERR(plat_dat);
 
 	dwmac = devm_kzalloc(&pdev->dev, sizeof(*dwmac), GFP_KERNEL);
-	if (!dwmac)
-		return -ENOMEM;
+	if (!dwmac) {
+		ret = -ENOMEM;
+		goto err_remove_config_dt;
+	}
 
 	res = platform_get_resource(pdev, IORESOURCE_MEM, 1);
 	dwmac->regs = devm_ioremap_resource(&pdev->dev, res);
-	if (IS_ERR(dwmac->regs))
-		return PTR_ERR(dwmac->regs);
+	if (IS_ERR(dwmac->regs)) {
+		ret = PTR_ERR(dwmac->regs);
+		goto err_remove_config_dt;
+	}
 
 	dwmac->pdev = pdev;
 	dwmac->phy_mode = of_get_phy_mode(pdev->dev.of_node);
 	if (dwmac->phy_mode < 0) {
 		dev_err(&pdev->dev, "missing phy-mode property\n");
-		return -EINVAL;
+		ret = -EINVAL;
+		goto err_remove_config_dt;
 	}
 
 	ret = meson8b_init_clk(dwmac);
 	if (ret)
-		return ret;
+		goto err_remove_config_dt;
 
 	ret = meson8b_init_prg_eth(dwmac);
 	if (ret)
-		return ret;
+		goto err_remove_config_dt;
 
 	plat_dat->bsp_priv = dwmac;
 
@@ -297,6 +302,8 @@ static int meson8b_dwmac_probe(struct platform_device *pdev)
 
 err_clk_disable:
 	clk_disable_unprepare(dwmac->m25_div_clk);
+err_remove_config_dt:
+	stmmac_remove_config_dt(pdev, plat_dat);
 
 	return ret;
 }
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
index e7aabe56c15a..d80c88bd2bba 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-rk.c
@@ -981,12 +981,14 @@ static int rk_gmac_probe(struct platform_device *pdev)
 	plat_dat->resume = rk_gmac_resume;
 
 	plat_dat->bsp_priv = rk_gmac_setup(pdev, data);
-	if (IS_ERR(plat_dat->bsp_priv))
-		return PTR_ERR(plat_dat->bsp_priv);
+	if (IS_ERR(plat_dat->bsp_priv)) {
+		ret = PTR_ERR(plat_dat->bsp_priv);
+		goto err_remove_config_dt;
+	}
 
 	ret = rk_gmac_init(pdev, plat_dat->bsp_priv);
 	if (ret)
-		return ret;
+		goto err_remove_config_dt;
 
 	ret = stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
 	if (ret)
@@ -996,6 +998,8 @@ static int rk_gmac_probe(struct platform_device *pdev)
 
 err_gmac_exit:
 	rk_gmac_exit(pdev, plat_dat->bsp_priv);
+err_remove_config_dt:
+	stmmac_remove_config_dt(pdev, plat_dat);
 
 	return ret;
 }
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
index 47db157da3e8..0c420e97de1e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
@@ -316,13 +316,15 @@ static int socfpga_dwmac_probe(struct platform_device *pdev)
 		return PTR_ERR(plat_dat);
 
 	dwmac = devm_kzalloc(dev, sizeof(*dwmac), GFP_KERNEL);
-	if (!dwmac)
-		return -ENOMEM;
+	if (!dwmac) {
+		ret = -ENOMEM;
+		goto err_remove_config_dt;
+	}
 
 	ret = socfpga_dwmac_parse_data(dwmac, dev);
 	if (ret) {
 		dev_err(dev, "Unable to parse OF data\n");
-		return ret;
+		goto err_remove_config_dt;
 	}
 
 	plat_dat->bsp_priv = dwmac;
@@ -330,7 +332,7 @@ static int socfpga_dwmac_probe(struct platform_device *pdev)
 
 	ret = stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
 	if (ret)
-		return ret;
+		goto err_remove_config_dt;
 
 	ndev = platform_get_drvdata(pdev);
 	stpriv = netdev_priv(ndev);
@@ -349,6 +351,8 @@ static int socfpga_dwmac_probe(struct platform_device *pdev)
 
 err_dvr_remove:
 	stmmac_dvr_remove(&pdev->dev);
+err_remove_config_dt:
+	stmmac_remove_config_dt(pdev, plat_dat);
 
 	return ret;
 }
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sti.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-sti.c
index a1ce018bf844..060b98c37a85 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sti.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sti.c
@@ -345,13 +345,15 @@ static int sti_dwmac_probe(struct platform_device *pdev)
 		return PTR_ERR(plat_dat);
 
 	dwmac = devm_kzalloc(&pdev->dev, sizeof(*dwmac), GFP_KERNEL);
-	if (!dwmac)
-		return -ENOMEM;
+	if (!dwmac) {
+		ret = -ENOMEM;
+		goto err_remove_config_dt;
+	}
 
 	ret = sti_dwmac_parse_data(dwmac, pdev);
 	if (ret) {
 		dev_err(&pdev->dev, "Unable to parse OF data\n");
-		return ret;
+		goto err_remove_config_dt;
 	}
 
 	dwmac->fix_retime_src = data->fix_retime_src;
@@ -363,7 +365,7 @@ static int sti_dwmac_probe(struct platform_device *pdev)
 
 	ret = sti_dwmac_init(pdev, plat_dat->bsp_priv);
 	if (ret)
-		return ret;
+		goto err_remove_config_dt;
 
 	ret = stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
 	if (ret)
@@ -373,6 +375,8 @@ static int sti_dwmac_probe(struct platform_device *pdev)
 
 err_dwmac_exit:
 	sti_dwmac_exit(pdev, plat_dat->bsp_priv);
+err_remove_config_dt:
+	stmmac_remove_config_dt(pdev, plat_dat);
 
 	return ret;
 }
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c
index e5a926b8bee7..61cb24810d10 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-stm32.c
@@ -107,24 +107,33 @@ static int stm32_dwmac_probe(struct platform_device *pdev)
 		return PTR_ERR(plat_dat);
 
 	dwmac = devm_kzalloc(&pdev->dev, sizeof(*dwmac), GFP_KERNEL);
-	if (!dwmac)
-		return -ENOMEM;
+	if (!dwmac) {
+		ret = -ENOMEM;
+		goto err_remove_config_dt;
+	}
 
 	ret = stm32_dwmac_parse_data(dwmac, &pdev->dev);
 	if (ret) {
 		dev_err(&pdev->dev, "Unable to parse OF data\n");
-		return ret;
+		goto err_remove_config_dt;
 	}
 
 	plat_dat->bsp_priv = dwmac;
 
 	ret = stm32_dwmac_init(plat_dat);
 	if (ret)
-		return ret;
+		goto err_remove_config_dt;
 
 	ret = stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
 	if (ret)
-		stm32_dwmac_clk_disable(dwmac);
+		goto err_clk_disable;
+
+	return 0;
+
+err_clk_disable:
+	stm32_dwmac_clk_disable(dwmac);
+err_remove_config_dt:
+	stmmac_remove_config_dt(pdev, plat_dat);
 
 	return ret;
 }
diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-sunxi.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-sunxi.c
index adff46375a32..d07520fb969e 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-sunxi.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-sunxi.c
@@ -120,22 +120,27 @@ static int sun7i_gmac_probe(struct platform_device *pdev)
 		return PTR_ERR(plat_dat);
 
 	gmac = devm_kzalloc(dev, sizeof(*gmac), GFP_KERNEL);
-	if (!gmac)
-		return -ENOMEM;
+	if (!gmac) {
+		ret = -ENOMEM;
+		goto err_remove_config_dt;
+	}
 
 	gmac->interface = of_get_phy_mode(dev->of_node);
 
 	gmac->tx_clk = devm_clk_get(dev, "allwinner_gmac_tx");
 	if (IS_ERR(gmac->tx_clk)) {
 		dev_err(dev, "could not get tx clock\n");
-		return PTR_ERR(gmac->tx_clk);
+		ret = PTR_ERR(gmac->tx_clk);
+		goto err_remove_config_dt;
 	}
 
 	/* Optional regulator for PHY */
 	gmac->regulator = devm_regulator_get_optional(dev, "phy");
 	if (IS_ERR(gmac->regulator)) {
-		if (PTR_ERR(gmac->regulator) == -EPROBE_DEFER)
-			return -EPROBE_DEFER;
+		if (PTR_ERR(gmac->regulator) == -EPROBE_DEFER) {
+			ret = -EPROBE_DEFER;
+			goto err_remove_config_dt;
+		}
 		dev_info(dev, "no regulator found\n");
 		gmac->regulator = NULL;
 	}
@@ -151,11 +156,18 @@ static int sun7i_gmac_probe(struct platform_device *pdev)
 
 	ret = sun7i_gmac_init(pdev, plat_dat->bsp_priv);
 	if (ret)
-		return ret;
+		goto err_remove_config_dt;
 
 	ret = stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
 	if (ret)
-		sun7i_gmac_exit(pdev, plat_dat->bsp_priv);
+		goto err_gmac_exit;
+
+	return 0;
+
+err_gmac_exit:
+	sun7i_gmac_exit(pdev, plat_dat->bsp_priv);
+err_remove_config_dt:
+	stmmac_remove_config_dt(pdev, plat_dat);
 
 	return ret;
 }
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 1f9ec02fa7f8..caf069a465f2 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -3416,7 +3416,6 @@ int stmmac_dvr_remove(struct device *dev)
 	stmmac_set_mac(priv->ioaddr, false);
 	netif_carrier_off(ndev);
 	unregister_netdev(ndev);
-	of_node_put(priv->plat->phy_node);
 	if (priv->stmmac_rst)
 		reset_control_assert(priv->stmmac_rst);
 	clk_disable_unprepare(priv->pclk);
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
index bcbf123d5ba2..a840818bf4df 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.c
@@ -305,7 +305,7 @@ stmmac_probe_config_dt(struct platform_device *pdev, const char **mac)
 		dma_cfg = devm_kzalloc(&pdev->dev, sizeof(*dma_cfg),
 				       GFP_KERNEL);
 		if (!dma_cfg) {
-			of_node_put(plat->phy_node);
+			stmmac_remove_config_dt(pdev, plat);
 			return ERR_PTR(-ENOMEM);
 		}
 		plat->dma_cfg = dma_cfg;
@@ -328,14 +328,37 @@ stmmac_probe_config_dt(struct platform_device *pdev, const char **mac)
 
 	return plat;
 }
+
+/**
+ * stmmac_remove_config_dt - undo the effects of stmmac_probe_config_dt()
+ * @pdev: platform_device structure
+ * @plat: driver data platform structure
+ *
+ * Release resources claimed by stmmac_probe_config_dt().
+ */
+void stmmac_remove_config_dt(struct platform_device *pdev,
+			     struct plat_stmmacenet_data *plat)
+{
+	struct device_node *np = pdev->dev.of_node;
+
+	if (of_phy_is_fixed_link(np))
+		of_phy_deregister_fixed_link(np);
+	of_node_put(plat->phy_node);
+}
 #else
 struct plat_stmmacenet_data *
 stmmac_probe_config_dt(struct platform_device *pdev, const char **mac)
 {
 	return ERR_PTR(-ENOSYS);
 }
+
+void stmmac_remove_config_dt(struct platform_device *pdev,
+			     struct plat_stmmacenet_data *plat)
+{
+}
 #endif /* CONFIG_OF */
 EXPORT_SYMBOL_GPL(stmmac_probe_config_dt);
+EXPORT_SYMBOL_GPL(stmmac_remove_config_dt);
 
 int stmmac_get_platform_resources(struct platform_device *pdev,
 				  struct stmmac_resources *stmmac_res)
@@ -391,10 +414,13 @@ int stmmac_pltfr_remove(struct platform_device *pdev)
 {
 	struct net_device *ndev = platform_get_drvdata(pdev);
 	struct stmmac_priv *priv = netdev_priv(ndev);
+	struct plat_stmmacenet_data *plat = priv->plat;
 	int ret = stmmac_dvr_remove(&pdev->dev);
 
-	if (priv->plat->exit)
-		priv->plat->exit(pdev, priv->plat->bsp_priv);
+	if (plat->exit)
+		plat->exit(pdev, plat->bsp_priv);
+
+	stmmac_remove_config_dt(pdev, plat);
 
 	return ret;
 }
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.h b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.h
index 64e147f53a9c..b72eb0de57b7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_platform.h
@@ -23,6 +23,8 @@
 
 struct plat_stmmacenet_data *
 stmmac_probe_config_dt(struct platform_device *pdev, const char **mac);
+void stmmac_remove_config_dt(struct platform_device *pdev,
+			     struct plat_stmmacenet_data *plat);
 
 int stmmac_get_platform_resources(struct platform_device *pdev,
 				  struct stmmac_resources *stmmac_res);
-- 
2.7.3

^ permalink raw reply related

* [PATCH net 5/7] net: ethernet: stmmac: dwmac-meson8b: fix probe error path
From: Johan Hovold @ 2016-11-30 14:29 UTC (permalink / raw)
  To: David S. Miller
  Cc: Giuseppe Cavallaro, Alexandre Torgue, Joachim Eastwood,
	Carlo Caione, Kevin Hilman, Maxime Coquelin, Maxime Ripard,
	Chen-Yu Tsai, netdev, linux-kernel, Johan Hovold
In-Reply-To: <1480516195-27696-1-git-send-email-johan@kernel.org>

Make sure to disable clocks before returning on late probe errors.

Fixes: 566e82516253 ("net: stmmac: add a glue driver for the Amlogic
Meson 8b / GXBB DWMAC")
Signed-off-by: Johan Hovold <johan@kernel.org>
---
 drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
index 250e4ceafc8d..45e7aaf0170d 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-meson8b.c
@@ -289,7 +289,16 @@ static int meson8b_dwmac_probe(struct platform_device *pdev)
 
 	plat_dat->bsp_priv = dwmac;
 
-	return stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
+	ret = stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
+	if (ret)
+		goto err_clk_disable;
+
+	return 0;
+
+err_clk_disable:
+	clk_disable_unprepare(dwmac->m25_div_clk);
+
+	return ret;
 }
 
 static int meson8b_dwmac_remove(struct platform_device *pdev)
-- 
2.7.3

^ permalink raw reply related

* [PATCH net 4/7] net: ethernet: stmmac: dwmac-generic: fix probe error path
From: Johan Hovold @ 2016-11-30 14:29 UTC (permalink / raw)
  To: David S. Miller
  Cc: Giuseppe Cavallaro, Alexandre Torgue, Joachim Eastwood,
	Carlo Caione, Kevin Hilman, Maxime Coquelin, Maxime Ripard,
	Chen-Yu Tsai, netdev, linux-kernel, Johan Hovold
In-Reply-To: <1480516195-27696-1-git-send-email-johan@kernel.org>

Make sure to call any exit() callback to undo the effect of init()
before returning on late probe errors.

Fixes: cf3f047b9af4 ("stmmac: move hw init in the probe (v2)")
Signed-off-by: Johan Hovold <johan@kernel.org>
---
 drivers/net/ethernet/stmicro/stmmac/dwmac-generic.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-generic.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-generic.c
index b1e5f24708c9..05e46a82cdb1 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-generic.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-generic.c
@@ -53,7 +53,17 @@ static int dwmac_generic_probe(struct platform_device *pdev)
 			return ret;
 	}
 
-	return stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
+	ret = stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
+	if (ret)
+		goto err_exit;
+
+	return 0;
+
+err_exit:
+	if (plat_dat->exit)
+		plat_dat->exit(pdev, plat_dat->bsp_priv);
+
+	return ret;
 }
 
 static const struct of_device_id dwmac_generic_match[] = {
-- 
2.7.3

^ permalink raw reply related

* [PATCH net 1/7] net: ethernet: stmmac: dwmac-socfpga: fix use-after-free on probe errors
From: Johan Hovold @ 2016-11-30 14:29 UTC (permalink / raw)
  To: David S. Miller
  Cc: Giuseppe Cavallaro, Alexandre Torgue, Joachim Eastwood,
	Carlo Caione, Kevin Hilman, Maxime Coquelin, Maxime Ripard,
	Chen-Yu Tsai, netdev, linux-kernel, Johan Hovold
In-Reply-To: <1480516195-27696-1-git-send-email-johan@kernel.org>

Make sure to call stmmac_dvr_remove() before returning on late probe
errors so that memory is freed, clocks are disabled, and the netdev is
deregistered before its resources go away.

Fixes: 3c201b5a84ed ("net: stmmac: socfpga: Remove re-registration of
reset controller")
Signed-off-by: Johan Hovold <johan@kernel.org>
---
 .../net/ethernet/stmicro/stmmac/dwmac-socfpga.c    | 29 ++++++++++++++--------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c b/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
index bec6963ac71e..47db157da3e8 100644
--- a/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
+++ b/drivers/net/ethernet/stmicro/stmmac/dwmac-socfpga.c
@@ -304,6 +304,8 @@ static int socfpga_dwmac_probe(struct platform_device *pdev)
 	struct device		*dev = &pdev->dev;
 	int			ret;
 	struct socfpga_dwmac	*dwmac;
+	struct net_device	*ndev;
+	struct stmmac_priv	*stpriv;
 
 	ret = stmmac_get_platform_resources(pdev, &stmmac_res);
 	if (ret)
@@ -327,19 +329,26 @@ static int socfpga_dwmac_probe(struct platform_device *pdev)
 	plat_dat->fix_mac_speed = socfpga_dwmac_fix_mac_speed;
 
 	ret = stmmac_dvr_probe(&pdev->dev, plat_dat, &stmmac_res);
+	if (ret)
+		return ret;
 
-	if (!ret) {
-		struct net_device *ndev = platform_get_drvdata(pdev);
-		struct stmmac_priv *stpriv = netdev_priv(ndev);
+	ndev = platform_get_drvdata(pdev);
+	stpriv = netdev_priv(ndev);
 
-		/* The socfpga driver needs to control the stmmac reset to
-		 * set the phy mode. Create a copy of the core reset handel
-		 * so it can be used by the driver later.
-		 */
-		dwmac->stmmac_rst = stpriv->stmmac_rst;
+	/* The socfpga driver needs to control the stmmac reset to set the phy
+	 * mode. Create a copy of the core reset handle so it can be used by
+	 * the driver later.
+	 */
+	dwmac->stmmac_rst = stpriv->stmmac_rst;
 
-		ret = socfpga_dwmac_set_phy_mode(dwmac);
-	}
+	ret = socfpga_dwmac_set_phy_mode(dwmac);
+	if (ret)
+		goto err_dvr_remove;
+
+	return 0;
+
+err_dvr_remove:
+	stmmac_dvr_remove(&pdev->dev);
 
 	return ret;
 }
-- 
2.7.3

^ permalink raw reply related

* [PATCH 2/2] net: ethernet: altera_tse: add support for SGMII PCS
From: Neill Whillans @ 2016-11-30 13:41 UTC (permalink / raw)
  To: netdev, linux-kernel, linux-kernel, nios2-dev, vbridger,
	f.fainelli
In-Reply-To: <cover.1480497966.git.neill.whillans@codethink.co.uk>

Add support for the (optional) SGMII PCS functionality of the Altera
TSE MAC. If the phy-mode is set to 'sgmii' then we attempt to discover
and initialise the PCS so that the MAC can communicate to the PHY.

The PCS IP block provides a scratch register for testing presence of
the PCS, which is mapped into one of the two MDIO spaces present in
the MAC's register space.  Once we have determined that the scratch
register is functioning, we attempt to initialise the PCS to
auto-negotiate an SGMII link with the PHY. There is no need to monitor
or manage the SGMII link beyond this, since the normal PHY MDIO will
then be used to monitor the media layer.

The Altera TSE MAC has only one way in which it can be configured with an
SGMII PCS, and as such, this patch only looks to the phy-mode to select
whether or not to attempt to initialise the PCS registers.  During
initialisation, we report the PCS's equivalent of a PHY ID register.
This can be parameterised during the IP instantiation and is often left
as '0x00000000' which is not an error.

Signed-off-by: Neill Whillans <neill.whillans@codethink.co.uk>
Reviewed-by: Daniel Silverstone <daniel.silverstone@codethink.co.uk>
---
 drivers/net/ethernet/altera/altera_tse.h      | 11 ++++
 drivers/net/ethernet/altera/altera_tse_main.c | 91 +++++++++++++++++++++++++++
 2 files changed, 102 insertions(+)

diff --git a/drivers/net/ethernet/altera/altera_tse.h b/drivers/net/ethernet/altera/altera_tse.h
index e005200..446e778 100644
--- a/drivers/net/ethernet/altera/altera_tse.h
+++ b/drivers/net/ethernet/altera/altera_tse.h
@@ -120,6 +120,17 @@
 #define MAC_CMDCFG_DISABLE_READ_TIMEOUT_GET(v)	GET_BIT_VALUE(v, 27)
 #define MAC_CMDCFG_CNT_RESET_GET(v)		GET_BIT_VALUE(v, 31)
 
+/* SGMII PCS register addresses
+ */
+#define SGMII_PCS_SCRATCH	0x10
+#define SGMII_PCS_REV		0x11
+#define SGMII_PCS_LINK_TIMER_0	0x12
+#define SGMII_PCS_LINK_TIMER_1	0x13
+#define SGMII_PCS_IF_MODE	0x14
+#define SGMII_PCS_DIS_READ_TO	0x15
+#define SGMII_PCS_READ_TO	0x16
+#define SGMII_PCS_SW_RESET_TIMEOUT 100 /* usecs */
+
 /* MDIO registers within MAC register Space
  */
 struct altera_tse_mdio {
diff --git a/drivers/net/ethernet/altera/altera_tse_main.c b/drivers/net/ethernet/altera/altera_tse_main.c
index bda31f3..afecdd1 100644
--- a/drivers/net/ethernet/altera/altera_tse_main.c
+++ b/drivers/net/ethernet/altera/altera_tse_main.c
@@ -37,6 +37,7 @@
 #include <linux/io.h>
 #include <linux/kernel.h>
 #include <linux/module.h>
+#include <linux/mii.h>
 #include <linux/netdevice.h>
 #include <linux/of_device.h>
 #include <linux/of_mdio.h>
@@ -96,6 +97,27 @@ static inline u32 tse_tx_avail(struct altera_tse_private *priv)
 	return priv->tx_cons + priv->tx_ring_size - priv->tx_prod - 1;
 }
 
+/* PCS Register read/write functions
+ */
+static u16 sgmii_pcs_read(struct altera_tse_private *priv, int regnum)
+{
+	return csrrd32(priv->mac_dev,
+		       tse_csroffs(mdio_phy0) + regnum * 4) & 0xffff;
+}
+
+static void sgmii_pcs_write(struct altera_tse_private *priv, int regnum,
+				u16 value)
+{
+	csrwr32(value, priv->mac_dev, tse_csroffs(mdio_phy0) + regnum * 4);
+}
+
+/* Check PCS scratch memory */
+static int sgmii_pcs_scratch_test(struct altera_tse_private *priv, u16 value)
+{
+	sgmii_pcs_write(priv, SGMII_PCS_SCRATCH, value);
+	return (sgmii_pcs_read(priv, SGMII_PCS_SCRATCH) == value);
+}
+
 /* MDIO specific functions
  */
 static int altera_tse_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
@@ -1092,6 +1114,66 @@ static void tse_set_rx_mode(struct net_device *dev)
 	spin_unlock(&priv->mac_cfg_lock);
 }
 
+/* Initialise (if necessary) the SGMII PCS component
+ */
+static int init_sgmii_pcs(struct net_device *dev)
+{
+	struct altera_tse_private *priv = netdev_priv(dev);
+	int n;
+	unsigned int tmp_reg = 0;
+
+	if (priv->phy_iface != PHY_INTERFACE_MODE_SGMII)
+		return 0; /* Nothing to do, not in SGMII mode */
+
+	/* The TSE SGMII PCS block looks a little like a PHY, it is
+	 * mapped into the zeroth MDIO space of the MAC and it has
+	 * ID registers like a PHY would.  Sadly this is often
+	 * configured to zeroes, so don't be surprised if it does
+	 * show 0x00000000.
+	 */
+
+	if (sgmii_pcs_scratch_test(priv, 0x0000) &&
+		sgmii_pcs_scratch_test(priv, 0xffff) &&
+		sgmii_pcs_scratch_test(priv, 0xa5a5) &&
+		sgmii_pcs_scratch_test(priv, 0x5a5a)) {
+		netdev_info(dev, "PCS PHY ID: 0x%04x%04x\n",
+				sgmii_pcs_read(priv, MII_PHYSID1),
+				sgmii_pcs_read(priv, MII_PHYSID2));
+	} else {
+		netdev_err(dev, "SGMII PCS Scratch memory test failed.\n");
+		return -ENOMEM;
+	}
+
+	/* Starting on page 5-29 of the MegaCore Function User Guide
+	 * Set SGMII Link timer to 1.6ms
+	 */
+	sgmii_pcs_write(priv, SGMII_PCS_LINK_TIMER_0, 0x0D40);
+	sgmii_pcs_write(priv, SGMII_PCS_LINK_TIMER_1, 0x03);
+
+	/* Enable SGMII Interface and Enable SGMII Auto Negotiation */
+	sgmii_pcs_write(priv, SGMII_PCS_IF_MODE, 0x3);
+
+	/* Enable Autonegotiation */
+	tmp_reg = sgmii_pcs_read(priv, MII_BMCR);
+	tmp_reg |= (BMCR_SPEED1000 | BMCR_FULLDPLX | BMCR_ANENABLE);
+	sgmii_pcs_write(priv, MII_BMCR, tmp_reg);
+
+	/* Reset PCS block */
+	tmp_reg |= BMCR_RESET;
+	sgmii_pcs_write(priv, MII_BMCR, tmp_reg);
+	for (n = 0; n < SGMII_PCS_SW_RESET_TIMEOUT; n++) {
+		if (!(sgmii_pcs_read(priv, MII_BMCR) & BMCR_RESET)) {
+			netdev_info(dev, "SGMII PCS block initialised OK\n");
+			return 0;
+		}
+		udelay(1);
+	}
+
+	/* We failed to reset the block, return a timeout */
+	netdev_err(dev, "SGMII PCS block reset failed.\n");
+	return -ETIMEDOUT;
+}
+
 /* Open and initialize the interface
  */
 static int tse_open(struct net_device *dev)
@@ -1116,6 +1198,15 @@ static int tse_open(struct net_device *dev)
 		netdev_warn(dev, "TSE revision %x\n", priv->revision);
 
 	spin_lock(&priv->mac_cfg_lock);
+	/* no-op if MAC not operating in SGMII mode*/
+	ret = init_sgmii_pcs(dev);
+	if (ret) {
+		netdev_err(dev,
+			   "Cannot init the SGMII PCS (error: %d)\n", ret);
+		spin_unlock(&priv->mac_cfg_lock);
+		goto phy_error;
+	}
+
 	ret = reset_mac(priv);
 	/* Note that reset_mac will fail if the clocks are gated by the PHY
 	 * due to the PHY being put into isolation or power down mode.
-- 
2.1.4

^ permalink raw reply related

* [PATCH 1/2] net: phy: vitesse: add support for VSC8572
From: Neill Whillans @ 2016-11-30 13:41 UTC (permalink / raw)
  To: netdev, linux-kernel, linux-kernel, nios2-dev, vbridger,
	f.fainelli
In-Reply-To: <cover.1480497966.git.neill.whillans@codethink.co.uk>

From: Stephen Agate <stephen.agate@uk.thalesgroup.com>

Add support for the Vitesse VSC8572 which is functionally equivalent to
the already supported VSC8574. As such, all the same handling functions
are used since the VSC8572 merely has half the number of phy blocks
internally.

Signed-off-by: Stephen Agate <stephen.agate@uk.thalesgroup.com>
Signed-off-by: Neill Whillans <neill.whillans@codethink.co.uk>
Reviewed-by: Daniel Silverstone <daniel.silverstone@codethink.co.uk>
---
 drivers/net/phy/vitesse.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/drivers/net/phy/vitesse.c b/drivers/net/phy/vitesse.c
index 24b4a09..f78ff02 100644
--- a/drivers/net/phy/vitesse.c
+++ b/drivers/net/phy/vitesse.c
@@ -69,6 +69,7 @@
 #define PHY_ID_VSC8234			0x000fc620
 #define PHY_ID_VSC8244			0x000fc6c0
 #define PHY_ID_VSC8514			0x00070670
+#define PHY_ID_VSC8572			0x000704d0
 #define PHY_ID_VSC8574			0x000704a0
 #define PHY_ID_VSC8601			0x00070420
 #define PHY_ID_VSC8662			0x00070660
@@ -166,6 +167,7 @@ static int vsc82xx_config_intr(struct phy_device *phydev)
 			(phydev->drv->phy_id == PHY_ID_VSC8234 ||
 			 phydev->drv->phy_id == PHY_ID_VSC8244 ||
 			 phydev->drv->phy_id == PHY_ID_VSC8514 ||
+			 phydev->drv->phy_id == PHY_ID_VSC8572 ||
 			 phydev->drv->phy_id == PHY_ID_VSC8574 ||
 			 phydev->drv->phy_id == PHY_ID_VSC8601) ?
 				MII_VSC8244_IMASK_MASK :
@@ -291,6 +293,17 @@ static struct phy_driver vsc82xx_driver[] = {
 	.ack_interrupt	= &vsc824x_ack_interrupt,
 	.config_intr	= &vsc82xx_config_intr,
 }, {
+	.phy_id         = PHY_ID_VSC8572,
+	.name           = "Vitesse VSC8572",
+	.phy_id_mask    = 0x000ffff0,
+	.features       = PHY_GBIT_FEATURES,
+	.flags          = PHY_HAS_INTERRUPT,
+	.config_init    = &vsc824x_config_init,
+	.config_aneg    = &vsc82x4_config_aneg,
+	.read_status    = &genphy_read_status,
+	.ack_interrupt  = &vsc824x_ack_interrupt,
+	.config_intr    = &vsc82xx_config_intr,
+}, {
 	.phy_id         = PHY_ID_VSC8574,
 	.name           = "Vitesse VSC8574",
 	.phy_id_mask    = 0x000ffff0,
@@ -355,6 +368,7 @@ static struct mdio_device_id __maybe_unused vitesse_tbl[] = {
 	{ PHY_ID_VSC8234, 0x000ffff0 },
 	{ PHY_ID_VSC8244, 0x000fffc0 },
 	{ PHY_ID_VSC8514, 0x000ffff0 },
+	{ PHY_ID_VSC8572, 0x000ffff0 },
 	{ PHY_ID_VSC8574, 0x000ffff0 },
 	{ PHY_ID_VSC8662, 0x000ffff0 },
 	{ PHY_ID_VSC8221, 0x000ffff0 },
-- 
2.1.4

^ permalink raw reply related

* Regression: [PATCH] mlx4: give precise rx/tx bytes/packets counters
From: Jesper Dangaard Brouer @ 2016-11-30 14:08 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: brouer, David Miller, netdev, Tariq Toukan
In-Reply-To: <1480088780.8455.543.camel@edumazet-glaptop3.roam.corp.google.com>


On Fri, 25 Nov 2016 07:46:20 -0800 Eric Dumazet <eric.dumazet@gmail.com> wrote:

> From: Eric Dumazet <edumazet@google.com>

Ended up-in net-next as:

 commit 40931b85113dad7881d49e8759e5ad41d30a5e6c
 Author: Eric Dumazet <edumazet@google.com>
 Date:   Fri Nov 25 07:46:20 2016 -0800

    mlx4: give precise rx/tx bytes/packets counters
    
    mlx4 stats are chaotic because a deferred work queue is responsible
    to update them every 250 ms.

Likely after this patch I get this crash (below), when rebooting my machine.
Looks like a device removal order thing.
Tested with net-next at commit 93ba22225504.   

[...]
[ 1967.248453] mlx5_core 0000:02:00.1: Shutdown was called
[ 1967.854556] mlx5_core 0000:02:00.0: Shutdown was called
[ 1968.443015] e1000e: EEE TX LPI TIMER: 00000011
[ 1968.484676] sd 3:0:0:0: [sda] Synchronizing SCSI cache
[ 1968.528354] mlx4_core 0000:01:00.0: mlx4_shutdown was called
[ 1968.534054] mlx4_en: mlx4p1: Close port called
[ 1968.571156] mlx4_en 0000:01:00.0: removed PHC
[ 1968.575677] mlx4_en: mlx4p2: Close port called
[ 1969.506602] BUG: unable to handle kernel NULL pointer dereference at 0000000000000d08
[ 1969.514530] IP: [<ffffffffa0127ca4>] mlx4_en_fold_software_stats.part.1+0x34/0xb0 [mlx4_en]
[ 1969.522963] PGD 0 [ 1969.524803] 
[ 1969.526332] Oops: 0000 [#1] PREEMPT SMP
[ 1969.530201] Modules linked in: coretemp kvm_intel kvm irqbypass intel_cstate mxm_wmi i2c_i801 intel_rapl_perf i2c_smbus sg pcspkr i2c_core shpchp nfsd wmi video acpi_pad auth_rpcgss oid_registry nfs_acl lockd grace sunrpc ip_tables x_tables mlx4_en e1000e mlx5_core ptp serio_raw sd_mod mlx4_core pps_core devlink hid_generic
[ 1969.559616] CPU: 3 PID: 3104 Comm: kworker/3:1 Not tainted 4.9.0-rc6-net-next3-01390-g93ba22225504 #12
[ 1969.568984] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z97 Extreme4, BIOS P2.10 05/12/2015
[ 1969.578877] Workqueue: events linkwatch_event
[ 1969.583285] task: ffff8803f42a0000 task.stack: ffff88040b2d0000
[ 1969.589238] RIP: 0010:[<ffffffffa0127ca4>]  [<ffffffffa0127ca4>] mlx4_en_fold_software_stats.part.1+0x34/0xb0 [mlx4_en]
[ 1969.600102] RSP: 0018:ffff88040b2d3bd8  EFLAGS: 00010282
[ 1969.605442] RAX: ffff8803f432efc8 RBX: ffff8803f4320000 RCX: 0000000000000000
[ 1969.612604] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8803f4320000
[ 1969.619772] RBP: ffff88040b2d3bd8 R08: 000000000000000c R09: ffff8803f432f000
[ 1969.626938] R10: 0000000000000000 R11: ffff88040d64ac00 R12: ffff8803e5aff8dc
[ 1969.634104] R13: ffff8803f4320a28 R14: ffff8803e5aff800 R15: 0000000000000000
[ 1969.641273] FS:  0000000000000000(0000) GS:ffff88041fac0000(0000) knlGS:0000000000000000
[ 1969.649422] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1969.655197] CR2: 0000000000000d08 CR3: 0000000001c07000 CR4: 00000000001406e0
[ 1969.662366] Stack:
[ 1969.664412]  ffff88040b2d3be8 ffffffffa0127f8e ffff88040b2d3c10 ffffffffa012a23b
[ 1969.671948]  ffff8803e5aff8dc ffff8803f4320000 ffff8803f4320000 ffff88040b2d3c30
[ 1969.679478]  ffffffff8160ae29 ffff8803e5aff8d8 ffff8804088ff300 ffff88040b2d3c58
[ 1969.687001] Call Trace:
[ 1969.689484]  [<ffffffffa0127f8e>] mlx4_en_fold_software_stats+0x1e/0x20 [mlx4_en]
[ 1969.697026]  [<ffffffffa012a23b>] mlx4_en_get_stats64+0x2b/0x50 [mlx4_en]
[ 1969.703844]  [<ffffffff8160ae29>] dev_get_stats+0x39/0xa0
[ 1969.709274]  [<ffffffff81622470>] rtnl_fill_stats+0x40/0x130
[ 1969.714968]  [<ffffffff8162631b>] rtnl_fill_ifinfo+0x55b/0x1010
[ 1969.720921]  [<ffffffff816285d3>] rtmsg_ifinfo_build_skb+0x73/0xd0
[ 1969.727136]  [<ffffffff81628646>] rtmsg_ifinfo.part.25+0x16/0x50
[ 1969.733176]  [<ffffffff81628698>] rtmsg_ifinfo+0x18/0x20
[ 1969.738522]  [<ffffffff8160e947>] netdev_state_change+0x47/0x50
[ 1969.744478]  [<ffffffff81629018>] linkwatch_do_dev+0x38/0x50
[ 1969.750170]  [<ffffffff81629257>] __linkwatch_run_queue+0xe7/0x160
[ 1969.756385]  [<ffffffff816292f5>] linkwatch_event+0x25/0x30
[ 1969.761991]  [<ffffffff8107b3cb>] process_one_work+0x15b/0x460
[ 1969.767857]  [<ffffffff8107b71e>] worker_thread+0x4e/0x480
[ 1969.773378]  [<ffffffff8107b6d0>] ? process_one_work+0x460/0x460
[ 1969.779420]  [<ffffffff8107b6d0>] ? process_one_work+0x460/0x460
[ 1969.785460]  [<ffffffff810811ea>] kthread+0xca/0xe0
[ 1969.790372]  [<ffffffff81081120>] ? kthread_worker_fn+0x120/0x120
[ 1969.796495]  [<ffffffff817302d2>] ret_from_fork+0x22/0x30
[ 1969.801924] Code: 00 00 55 48 89 e5 85 d2 0f 84 90 00 00 00 83 ea 01 31 c9 31 f6 48 8d 87 c0 ef 00 00 4c 8d 8c d7 c8 ef 00 00 48 8b 10 48 83 c0 08 <4c> 8b 82 08 0d 00 00 48 8b 92 00 0d 00 00 4c 01 c6 48 01 d1 4c 
[ 1969.821969] RIP  [<ffffffffa0127ca4>] mlx4_en_fold_software_stats.part.1+0x34/0xb0 [mlx4_en]
[ 1969.830486]  RSP <ffff88040b2d3bd8>
[ 1969.834002] CR2: 0000000000000d08
[ 1969.837440] ---[ end trace 80b9fbc1e7baed9b ]---
[ 1969.842102] Kernel panic - not syncing: Fatal exception in interrupt
[ 1969.848520] Kernel Offset: disabled
[ 1969.852050] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
(END)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* Re: [PATCH net 2/2] macvtap: handle ubuf refcount correctly when meet erros
From: Michael S. Tsirkin @ 2016-11-30 13:58 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, wangyunjian
In-Reply-To: <1480483072-14201-2-git-send-email-jasowang@redhat.com>

On Wed, Nov 30, 2016 at 01:17:52PM +0800, Jason Wang wrote:
> We trigger uarg->callback() immediately after we decide do datacopy
> even if caller want to do zerocopy. This will cause the callback
> (vhost_net_zerocopy_callback) decrease the refcount. But when we meet
> an error afterwards, the error handling in vhost handle_tx() will try
> to decrease it again. This is wrong and fix this by delay the
> uarg->callback() until we're sure there's no errors.
> 
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
> The patch is needed for -stable.
> ---
>  drivers/net/macvtap.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> index bceca28..7869b06 100644
> --- a/drivers/net/macvtap.c
> +++ b/drivers/net/macvtap.c
> @@ -742,13 +742,8 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
>  
>  	if (zerocopy)
>  		err = zerocopy_sg_from_iter(skb, from);
> -	else {
> +	else
>  		err = skb_copy_datagram_from_iter(skb, 0, from, len);
> -		if (!err && m && m->msg_control) {
> -			struct ubuf_info *uarg = m->msg_control;
> -			uarg->callback(uarg, false);
> -		}
> -	}
>  
>  	if (err)
>  		goto err_kfree;
> @@ -779,7 +774,11 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
>  		skb_shinfo(skb)->destructor_arg = m->msg_control;
>  		skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
>  		skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
> +	} else if (m && m->msg_control) {
> +		struct ubuf_info *uarg = m->msg_control;
> +		uarg->callback(uarg, false);
>  	}
> +
>  	if (vlan) {
>  		skb->dev = vlan->dev;
>  		dev_queue_xmit(skb);
> -- 
> 2.7.4

^ permalink raw reply

* Re: [PATCH net 1/2] tun: handle ubuf refcount correctly when meet erros
From: Michael S. Tsirkin @ 2016-11-30 13:57 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, wangyunjian
In-Reply-To: <1480483072-14201-1-git-send-email-jasowang@redhat.com>

On Wed, Nov 30, 2016 at 01:17:51PM +0800, Jason Wang wrote:
> We trigger uarg->callback() immediately after we decide do datacopy
> even if caller want to do zerocopy. This will cause the callback
> (vhost_net_zerocopy_callback) decrease the refcount. But when we meet
> an error afterwards, the error handling in vhost handle_tx() will try
> to decrease it again. This is wrong and fix this by delay the
> uarg->callback() until we're sure there's no errors.
> 
> Reported-by: wangyunjian <wangyunjian@huawei.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
> The patch is needed for -stable.
> ---
>  drivers/net/tun.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index 8093e39..db6acec 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1246,13 +1246,8 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>  
>  	if (zerocopy)
>  		err = zerocopy_sg_from_iter(skb, from);
> -	else {
> +	else
>  		err = skb_copy_datagram_from_iter(skb, 0, from, len);
> -		if (!err && msg_control) {
> -			struct ubuf_info *uarg = msg_control;
> -			uarg->callback(uarg, false);
> -		}
> -	}
>  
>  	if (err) {
>  		this_cpu_inc(tun->pcpu_stats->rx_dropped);
> @@ -1298,6 +1293,9 @@ static ssize_t tun_get_user(struct tun_struct *tun, struct tun_file *tfile,
>  		skb_shinfo(skb)->destructor_arg = msg_control;
>  		skb_shinfo(skb)->tx_flags |= SKBTX_DEV_ZEROCOPY;
>  		skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
> +	} else if (msg_control) {
> +		struct ubuf_info *uarg = msg_control;
> +		uarg->callback(uarg, false);
>  	}
>  
>  	skb_reset_network_header(skb);
> -- 
> 2.7.4

^ permalink raw reply

* Re: DSA vs. SWTICHDEV ?
From: Andrew Lunn @ 2016-11-30 13:52 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: netdev@vger.kernel.org
In-Reply-To: <1480495831.3563.135.camel@infinera.com>

On Wed, Nov 30, 2016 at 08:50:34AM +0000, Joakim Tjernlund wrote:
> I am trying to wrap my head around these two "devices" and have a hard time telling them apart.
> We are looking att adding a faily large switch(over PCIe) to our board and from what I can tell
> switchdev is the new way to do it but DSA is still there. Is it possible to just list
> how they differ?

Hi Joakim

If the interface you use to send frames from the host to the switch is
PCIe, you probably want to use switchdev directly.

DSA devices all use a host Ethernet interface to send frames to the
switch. DSA sits under switchdev, and effectively provides a lot of
the common stuff needed for implementing switch drivers of this
sort. It creates the slave interfaces, links the MAC to the PHY, has
one uniform device tree binding which all DSA switches have, deals
with encapsulation/decapsulating frames sent over the master device,
etc.

	    Andrew

^ permalink raw reply

* Re: [PATCH v3 net-next 2/3] openvswitch: Use is_skb_forwardable() for length check.
From: Jiri Benc @ 2016-11-30 13:51 UTC (permalink / raw)
  To: Jarno Rajahalme; +Cc: netdev, pshelar, e
In-Reply-To: <1480462253-114713-2-git-send-email-jarno@ovn.org>

On Tue, 29 Nov 2016 15:30:52 -0800, Jarno Rajahalme wrote:
> @@ -504,11 +485,20 @@ void ovs_vport_send(struct vport *vport, struct sk_buff *skb, u8 mac_proto)
>  		goto drop;
>  	}
>  
> -	if (unlikely(packet_length(skb, vport->dev) > mtu &&
> -		     !skb_is_gso(skb))) {
> -		net_warn_ratelimited("%s: dropped over-mtu packet: %d > %d\n",
> -				     vport->dev->name,
> -				     packet_length(skb, vport->dev), mtu);
> +	if (unlikely(!is_skb_forwardable(vport->dev, skb))) {

How does this work when the vlan tag is accelerated? Then we can be
over MTU, yet the check will pass.

 Jiri

^ permalink raw reply

* Re: [WIP] net+mlx4: auto doorbell
From: Saeed Mahameed @ 2016-11-30 13:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesper Dangaard Brouer, Rick Jones, Linux Netdev List,
	Saeed Mahameed, Tariq Toukan
In-Reply-To: <1480402716.18162.124.camel@edumazet-glaptop3.roam.corp.google.com>

On Tue, Nov 29, 2016 at 8:58 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Mon, 2016-11-21 at 10:10 -0800, Eric Dumazet wrote:
>
>
>> Not sure it this has been tried before, but the doorbell avoidance could
>> be done by the driver itself, because it knows a TX completion will come
>> shortly (well... if softirqs are not delayed too much !)
>>
>> Doorbell would be forced only if :
>>
>> (    "skb->xmit_more is not set" AND "TX engine is not 'started yet'" )
>> OR
>> ( too many [1] packets were put in TX ring buffer, no point deferring
>> more)
>>
>> Start the pump, but once it is started, let the doorbells being done by
>> TX completion.
>>
>> ndo_start_xmit and TX completion handler would have to maintain a shared
>> state describing if packets were ready but doorbell deferred.
>>
>>
>> Note that TX completion means "if at least one packet was drained",
>> otherwise busy polling, constantly calling napi->poll() would force a
>> doorbell too soon for devices sharing a NAPI for both RX and TX.
>>
>> But then, maybe busy poll would like to force a doorbell...
>>
>> I could try these ideas on mlx4 shortly.
>>
>>
>> [1] limit could be derived from active "ethtool -c" params, eg tx-frames
>
> I have a WIP, that increases pktgen rate by 75 % on mlx4 when bulking is
> not used.

Hi Eric, Nice Idea indeed and we need something like this,
today we almost don't exploit the TX bulking at all.

But please see below, i am not sure different contexts should share
the doorbell ringing, it is really risky.

>  drivers/net/ethernet/mellanox/mlx4/en_rx.c   |    2
>  drivers/net/ethernet/mellanox/mlx4/en_tx.c   |   90 +++++++++++------
>  drivers/net/ethernet/mellanox/mlx4/mlx4_en.h |    4
>  include/linux/netdevice.h                    |    1
>  net/core/net-sysfs.c                         |   18 +++
>  5 files changed, 83 insertions(+), 32 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_rx.c b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> index 6562f78b07f4..fbea83218fc0 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_rx.c
> @@ -1089,7 +1089,7 @@ int mlx4_en_process_rx_cq(struct net_device *dev, struct mlx4_en_cq *cq, int bud
>
>         if (polled) {
>                 if (doorbell_pending)
> -                       mlx4_en_xmit_doorbell(priv->tx_ring[TX_XDP][cq->ring]);
> +                       mlx4_en_xmit_doorbell(dev, priv->tx_ring[TX_XDP][cq->ring]);
>
>                 mlx4_cq_set_ci(&cq->mcq);
>                 wmb(); /* ensure HW sees CQ consumer before we post new buffers */
> diff --git a/drivers/net/ethernet/mellanox/mlx4/en_tx.c b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> index 4b597dca5c52..affebb435679 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/en_tx.c
> @@ -67,7 +67,7 @@ int mlx4_en_create_tx_ring(struct mlx4_en_priv *priv,
>         ring->size = size;
>         ring->size_mask = size - 1;
>         ring->sp_stride = stride;
> -       ring->full_size = ring->size - HEADROOM - MAX_DESC_TXBBS;
> +       ring->full_size = ring->size - HEADROOM - 2*MAX_DESC_TXBBS;
>
>         tmp = size * sizeof(struct mlx4_en_tx_info);
>         ring->tx_info = kmalloc_node(tmp, GFP_KERNEL | __GFP_NOWARN, node);
> @@ -193,6 +193,7 @@ int mlx4_en_activate_tx_ring(struct mlx4_en_priv *priv,
>         ring->sp_cqn = cq;
>         ring->prod = 0;
>         ring->cons = 0xffffffff;
> +       ring->ncons = 0;
>         ring->last_nr_txbb = 1;
>         memset(ring->tx_info, 0, ring->size * sizeof(struct mlx4_en_tx_info));
>         memset(ring->buf, 0, ring->buf_size);
> @@ -227,9 +228,9 @@ void mlx4_en_deactivate_tx_ring(struct mlx4_en_priv *priv,
>                        MLX4_QP_STATE_RST, NULL, 0, 0, &ring->sp_qp);
>  }
>
> -static inline bool mlx4_en_is_tx_ring_full(struct mlx4_en_tx_ring *ring)
> +static inline bool mlx4_en_is_tx_ring_full(const struct mlx4_en_tx_ring *ring)
>  {
> -       return ring->prod - ring->cons > ring->full_size;
> +       return READ_ONCE(ring->prod) - READ_ONCE(ring->cons) > ring->full_size;
>  }
>
>  static void mlx4_en_stamp_wqe(struct mlx4_en_priv *priv,
> @@ -374,6 +375,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>
>         /* Skip last polled descriptor */
>         ring->cons += ring->last_nr_txbb;
> +       ring->ncons += ring->last_nr_txbb;
>         en_dbg(DRV, priv, "Freeing Tx buf - cons:0x%x prod:0x%x\n",
>                  ring->cons, ring->prod);
>
> @@ -389,6 +391,7 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>                                                 !!(ring->cons & ring->size), 0,
>                                                 0 /* Non-NAPI caller */);
>                 ring->cons += ring->last_nr_txbb;
> +               ring->ncons += ring->last_nr_txbb;
>                 cnt++;
>         }
>
> @@ -401,6 +404,38 @@ int mlx4_en_free_tx_buf(struct net_device *dev, struct mlx4_en_tx_ring *ring)
>         return cnt;
>  }
>
> +void mlx4_en_xmit_doorbell(const struct net_device *dev,
> +                          struct mlx4_en_tx_ring *ring)
> +{
> +
> +       if (dev->doorbell_opt & 1) {
> +               u32 oval = READ_ONCE(ring->prod_bell);
> +               u32 nval = READ_ONCE(ring->prod);
> +
> +               if (oval == nval)
> +                       return;
> +
> +               /* I can not tell yet if a cmpxchg() is needed or not */
> +               if (dev->doorbell_opt & 2)
> +                       WRITE_ONCE(ring->prod_bell, nval);
> +               else
> +                       if (cmpxchg(&ring->prod_bell, oval, nval) != oval)
> +                               return;
> +       }
> +       /* Since there is no iowrite*_native() that writes the
> +        * value as is, without byteswapping - using the one
> +        * the doesn't do byteswapping in the relevant arch
> +        * endianness.
> +        */
> +#if defined(__LITTLE_ENDIAN)
> +       iowrite32(
> +#else
> +       iowrite32be(
> +#endif
> +                 ring->doorbell_qpn,
> +                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
> +}
> +
>  static bool mlx4_en_process_tx_cq(struct net_device *dev,
>                                   struct mlx4_en_cq *cq, int napi_budget)
>  {
> @@ -496,8 +531,13 @@ static bool mlx4_en_process_tx_cq(struct net_device *dev,
>         wmb();
>
>         /* we want to dirty this cache line once */
> -       ACCESS_ONCE(ring->last_nr_txbb) = last_nr_txbb;
> -       ACCESS_ONCE(ring->cons) = ring_cons + txbbs_skipped;
> +       WRITE_ONCE(ring->last_nr_txbb, last_nr_txbb);
> +       ring_cons += txbbs_skipped;
> +       WRITE_ONCE(ring->cons, ring_cons);
> +       WRITE_ONCE(ring->ncons, ring_cons + last_nr_txbb);
> +
> +       if (dev->doorbell_opt)
> +               mlx4_en_xmit_doorbell(dev, ring);
>
>         if (ring->free_tx_desc == mlx4_en_recycle_tx_desc)
>                 return done < budget;
> @@ -725,29 +765,14 @@ static void mlx4_bf_copy(void __iomem *dst, const void *src,
>         __iowrite64_copy(dst, src, bytecnt / 8);
>  }
>
> -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring)
> -{
> -       wmb();

you missed/removed this "wmb()" in the new function, why ? where did
you compensate for it ?

> -       /* Since there is no iowrite*_native() that writes the
> -        * value as is, without byteswapping - using the one
> -        * the doesn't do byteswapping in the relevant arch
> -        * endianness.
> -        */
> -#if defined(__LITTLE_ENDIAN)
> -       iowrite32(
> -#else
> -       iowrite32be(
> -#endif
> -                 ring->doorbell_qpn,
> -                 ring->bf.uar->map + MLX4_SEND_DOORBELL);
> -}
>
>  static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>                                   struct mlx4_en_tx_desc *tx_desc,
>                                   union mlx4_wqe_qpn_vlan qpn_vlan,
>                                   int desc_size, int bf_index,
>                                   __be32 op_own, bool bf_ok,
> -                                 bool send_doorbell)
> +                                 bool send_doorbell,
> +                                 const struct net_device *dev, int nr_txbb)
>  {
>         tx_desc->ctrl.qpn_vlan = qpn_vlan;
>
> @@ -761,6 +786,7 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>
>                 wmb();
>
> +               ring->prod += nr_txbb;
>                 mlx4_bf_copy(ring->bf.reg + ring->bf.offset, &tx_desc->ctrl,
>                              desc_size);
>
> @@ -773,8 +799,9 @@ static void mlx4_en_tx_write_desc(struct mlx4_en_tx_ring *ring,
>                  */
>                 dma_wmb();
>                 tx_desc->ctrl.owner_opcode = op_own;
> +               ring->prod += nr_txbb;
>                 if (send_doorbell)
> -                       mlx4_en_xmit_doorbell(ring);
> +                       mlx4_en_xmit_doorbell(dev, ring);
>                 else
>                         ring->xmit_more++;
>         }
> @@ -1017,8 +1044,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                         op_own |= cpu_to_be32(MLX4_WQE_CTRL_IIP);
>         }
>
> -       ring->prod += nr_txbb;
> -
>         /* If we used a bounce buffer then copy descriptor back into place */
>         if (unlikely(bounce))
>                 tx_desc = mlx4_en_bounce_to_desc(priv, ring, index, desc_size);
> @@ -1033,6 +1058,14 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>         }
>         send_doorbell = !skb->xmit_more || netif_xmit_stopped(ring->tx_queue);
>
> +       /* Doorbell avoidance : We can omit doorbell if we know a TX completion
> +        * will happen shortly.
> +        */
> +       if (send_doorbell &&
> +           dev->doorbell_opt &&
> +           (s32)(READ_ONCE(ring->prod_bell) - READ_ONCE(ring->ncons)) > 0)

Aelexi already expressed his worries about synchronization, and i
think here (in this exact line) sits the problem,
what about if exactly at this point the TX completion handler just
finished and rang the last doorbell,
you didn't write the new TX descriptor yet (mlx4_en_tx_write_desc), so
the last doorbell from the CQ handler missed it.
even if you wrote the TX descriptor before the doorbell decision here,
you will need a memory barrier to make sure the HW sees
the new packet, which was typically done before ringing the doorbell.

All in all, this is risky business :),  the right way to go is to
force the upper layer to use xmit-more and delay doorbells/use bulking
but from the same context
(xmit routine).  For example see Achiad's suggestion (attached in
Jesper's response), he used stop queue to force the stack to queue up
packets (TX bulking)
which would set xmit-more and will use the next completion to release
the "stopped" ring TXQ rather than hit the doorbell on behalf of it.

> +               send_doorbell = false;
> +
>         real_size = (real_size / 16) & 0x3f;
>
>         bf_ok &= desc_size <= MAX_BF && send_doorbell;
> @@ -1043,7 +1076,7 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                 qpn_vlan.fence_size = real_size;
>
>         mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, desc_size, bf_index,
> -                             op_own, bf_ok, send_doorbell);
> +                             op_own, bf_ok, send_doorbell, dev, nr_txbb);
>
>         if (unlikely(stop_queue)) {
>                 /* If queue was emptied after the if (stop_queue) , and before
> @@ -1054,7 +1087,6 @@ netdev_tx_t mlx4_en_xmit(struct sk_buff *skb, struct net_device *dev)
>                  */
>                 smp_rmb();
>
> -               ring_cons = ACCESS_ONCE(ring->cons);
>                 if (unlikely(!mlx4_en_is_tx_ring_full(ring))) {
>                         netif_tx_wake_queue(ring->tx_queue);
>                         ring->wake_queue++;
> @@ -1158,8 +1190,6 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>         rx_ring->xdp_tx++;
>         AVG_PERF_COUNTER(priv->pstats.tx_pktsz_avg, length);
>
> -       ring->prod += nr_txbb;
> -
>         stop_queue = mlx4_en_is_tx_ring_full(ring);
>         send_doorbell = stop_queue ||
>                                 *doorbell_pending > MLX4_EN_DOORBELL_BUDGET;
> @@ -1173,7 +1203,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>                 qpn_vlan.fence_size = real_size;
>
>         mlx4_en_tx_write_desc(ring, tx_desc, qpn_vlan, TXBB_SIZE, bf_index,
> -                             op_own, bf_ok, send_doorbell);
> +                             op_own, bf_ok, send_doorbell, dev, nr_txbb);
>         *doorbell_pending = send_doorbell ? 0 : *doorbell_pending + 1;
>
>         return NETDEV_TX_OK;
> diff --git a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> index 574bcbb1b38f..c3fd0deda198 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> +++ b/drivers/net/ethernet/mellanox/mlx4/mlx4_en.h
> @@ -280,6 +280,7 @@ struct mlx4_en_tx_ring {
>          */
>         u32                     last_nr_txbb;
>         u32                     cons;
> +       u32                     ncons;
>         unsigned long           wake_queue;
>         struct netdev_queue     *tx_queue;
>         u32                     (*free_tx_desc)(struct mlx4_en_priv *priv,
> @@ -290,6 +291,7 @@ struct mlx4_en_tx_ring {
>
>         /* cache line used and dirtied in mlx4_en_xmit() */
>         u32                     prod ____cacheline_aligned_in_smp;
> +       u32                     prod_bell;
>         unsigned int            tx_dropped;
>         unsigned long           bytes;
>         unsigned long           packets;
> @@ -699,7 +701,7 @@ netdev_tx_t mlx4_en_xmit_frame(struct mlx4_en_rx_ring *rx_ring,
>                                struct mlx4_en_rx_alloc *frame,
>                                struct net_device *dev, unsigned int length,
>                                int tx_ind, int *doorbell_pending);
> -void mlx4_en_xmit_doorbell(struct mlx4_en_tx_ring *ring);
> +void mlx4_en_xmit_doorbell(const struct net_device *dev, struct mlx4_en_tx_ring *ring);
>  bool mlx4_en_rx_recycle(struct mlx4_en_rx_ring *ring,
>                         struct mlx4_en_rx_alloc *frame);
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 4ffcd874cc20..39565b5425a6 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1816,6 +1816,7 @@ struct net_device {
>         DECLARE_HASHTABLE       (qdisc_hash, 4);
>  #endif
>         unsigned long           tx_queue_len;
> +       unsigned long           doorbell_opt;
>         spinlock_t              tx_global_lock;
>         int                     watchdog_timeo;
>
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index b0c04cf4851d..df05f81f5150 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -367,6 +367,23 @@ static ssize_t gro_flush_timeout_store(struct device *dev,
>  }
>  NETDEVICE_SHOW_RW(gro_flush_timeout, fmt_ulong);
>
> +static int change_doorbell_opt(struct net_device *dev, unsigned long val)
> +{
> +       dev->doorbell_opt = val;
> +       return 0;
> +}
> +
> +static ssize_t doorbell_opt_store(struct device *dev,
> +                                 struct device_attribute *attr,
> +                                 const char *buf, size_t len)
> +{
> +       if (!capable(CAP_NET_ADMIN))
> +               return -EPERM;
> +
> +       return netdev_store(dev, attr, buf, len, change_doorbell_opt);
> +}
> +NETDEVICE_SHOW_RW(doorbell_opt, fmt_ulong);
> +
>  static ssize_t ifalias_store(struct device *dev, struct device_attribute *attr,
>                              const char *buf, size_t len)
>  {
> @@ -531,6 +548,7 @@ static struct attribute *net_class_attrs[] = {
>         &dev_attr_phys_port_name.attr,
>         &dev_attr_phys_switch_id.attr,
>         &dev_attr_proto_down.attr,
> +       &dev_attr_doorbell_opt.attr,
>         NULL,
>  };
>  ATTRIBUTE_GROUPS(net_class);
>
>

^ permalink raw reply

* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Michael S. Tsirkin @ 2016-11-30 13:43 UTC (permalink / raw)
  To: Jason Wang; +Cc: Yunjian Wang, netdev, linux-kernel, caihe
In-Reply-To: <e4ac063d-db9f-ec93-30cd-163375ececfc@redhat.com>

On Wed, Nov 30, 2016 at 09:07:16PM +0800, Jason Wang wrote:
> 
> 
> On 2016年11月30日 20:10, Yunjian Wang wrote:
> > When we meet an error(err=-EBADFD) recvmsg, the error handling in vhost
> > handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
> > 
> > Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
> > ---
> >   drivers/vhost/net.c | 3 +++
> >   1 file changed, 3 insertions(+)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 5dc128a..edc470b 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
> >   			pr_debug("Discarded rx packet: "
> >   				 " len %d, expected %zd\n", err, sock_len);
> >   			vhost_discard_vq_desc(vq, headcount);
> > +			/* Don't continue to do, when meet errors. */
> > +			if (err < 0)
> > +				goto out;
> >   			continue;
> >   		}
> >   		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
> 
> Acked-by: Jason Wang <jasowang@redhat.com>
> 
> We may want to rename vhost_discard_vq_desc() in the future, since it does
> not discard the desc in fact.

To me this looks a bit too aggressive. I would also like commit log
to explain better what is going on, to make sure we are
not just treating the symptoms.
-- 
MST

^ permalink raw reply

* Re: [PATCH net-next v2 3/4] Documentation: net: phy: Add blurb about RGMII
From: Måns Rullgård @ 2016-11-30 13:43 UTC (permalink / raw)
  To: David Laight
  Cc: 'Florian Fainelli', Timur Tabi, netdev@vger.kernel.org,
	davem@davemloft.net, andrew@lunn.ch, sf84@laposte.net,
	martin.blumenstingl@googlemail.com, alexandre.torgue@st.com,
	peppe.cavallaro@st.com, jbrunet@baylibre.com
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DB023015E@AcuExch.aculab.com>

David Laight <David.Laight@ACULAB.COM> writes:

> From: Florian Fainelli
>> Sent: 27 November 2016 23:03
>> Le 27/11/2016  14:24, Timur Tabi a crit :
>> >> + * PHY device drivers in PHYLIB being reusable by nature, being able to
>> >> +   configure correctly a specified delay enables more designs with
>> >> similar delay
>> >> +   requirements to be operate correctly
>> >
>> > Ok, this one I don't know how to fix.  I'm not really sure what you're
>> > trying to say.
>> 
>> What I am trying to say is that once a PHY driver properly configures a
>> delay that you have specified, there is no reason why this is not
>> applicable to other platforms using this same PHY driver.
>
> As has been stated earlier it can depend on the track lengths on the
> board itself.
> (Although 1ns is about 1 foot - so track delays of that length are unlikely.)

There could be a delay element.

-- 
Måns Rullgård

^ permalink raw reply

* [PATCH 0/2] net: Add support for SGMII PCS on Altera TSE MAC
From: Neill Whillans @ 2016-11-30 13:41 UTC (permalink / raw)
  To: netdev, linux-kernel, linux-kernel, nios2-dev, vbridger,
	f.fainelli

These patches were created as part of work to add support for SGMII
PCS functionality to the Altera TSE MAC. Patches are based on 4.9-rc6
git tree.

The first patch in the series adds support for the VSC8572 dual-port
Gigabit Ethernet transceiver, used in integration testing.

The second patch adds support for the SGMII PCS functionality to the
Altera TSE driver.

 
Neill Whillans (1):
  net: ethernet: altera_tse: add support for SGMII PCS

Stephen Agate (1):
  net: phy: vitesse: add support for VSC8572

 drivers/net/ethernet/altera/altera_tse.h      | 11 ++++
 drivers/net/ethernet/altera/altera_tse_main.c | 91 +++++++++++++++++++++++++++
 drivers/net/phy/vitesse.c                     | 14 +++++
 3 files changed, 116 insertions(+)

-- 
2.1.4

^ permalink raw reply

* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Michael S. Tsirkin @ 2016-11-30 13:40 UTC (permalink / raw)
  To: Yunjian Wang; +Cc: jasowang, netdev, linux-kernel, caihe
In-Reply-To: <1480507857-22976-1-git-send-email-wangyunjian@huawei.com>

On Wed, Nov 30, 2016 at 08:10:57PM +0800, Yunjian Wang wrote:
> When we meet an error(err=-EBADFD) recvmsg,

How do you get EBADFD? Won't vhost_net_rx_peek_head_len
return 0 in this case, breaking the loop?

> the error handling in vhost
> handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
> 
> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
> ---
>  drivers/vhost/net.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 5dc128a..edc470b 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
>  			pr_debug("Discarded rx packet: "
>  				 " len %d, expected %zd\n", err, sock_len);
>  			vhost_discard_vq_desc(vq, headcount);
> +			/* Don't continue to do, when meet errors. */
> +			if (err < 0)
> +				goto out;

You might get e.g. EAGAIN and I think you need to retry
in this case.

>  			continue;
>  		}
>  		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */
> -- 
> 1.9.5.msysgit.1
> 

^ permalink raw reply

* Re: [PATCH net-next v3 2/2] net: phy: bcm7xxx: Plug in support for reading PHY error counters
From: Andrew Lunn @ 2016-11-30 13:37 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: netdev, davem, bcm-kernel-feedback-list, allan.nielsen,
	raju.lakkaraju
In-Reply-To: <20161129175718.20213-3-f.fainelli@gmail.com>

On Tue, Nov 29, 2016 at 09:57:18AM -0800, Florian Fainelli wrote:
> Broadcom BCM7xxx internal PHYs can leverage the Broadcom PHY library
> module PHY error counters helper functions, just implement the
> appropriate PHY driver function calls to do so. We need to allocate some
> storage space for our PHY statistics, and provide it to the Broadcom PHY
> library, so do this in a specific probe function, and slightly wrap the
> get_stats function call.
> 
> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>

Hi Florian

Nice to see another PHY driver make use of this.

Reviewed-by: Andrew Lunn <andrew@lunn.ch>

    Andrew

^ permalink raw reply

* [PATCH] net/rtnetlink: fix attribute name in nlmsg_size() comments
From: Tobias Klauser @ 2016-11-30 13:30 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Eric Dumazet

Use the correct attribute constant names IFLA_GSO_MAX_{SEGS,SIZE}
instead of IFLA_MAX_GSO_{SEGS,SIZE} for the comments int nlmsg_size().

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
---
 net/core/rtnetlink.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index deb35acbefd0..a6196cf844f6 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -931,8 +931,8 @@ static noinline size_t if_nlmsg_size(const struct net_device *dev,
 	       + nla_total_size(4) /* IFLA_PROMISCUITY */
 	       + nla_total_size(4) /* IFLA_NUM_TX_QUEUES */
 	       + nla_total_size(4) /* IFLA_NUM_RX_QUEUES */
-	       + nla_total_size(4) /* IFLA_MAX_GSO_SEGS */
-	       + nla_total_size(4) /* IFLA_MAX_GSO_SIZE */
+	       + nla_total_size(4) /* IFLA_GSO_MAX_SEGS */
+	       + nla_total_size(4) /* IFLA_GSO_MAX_SIZE */
 	       + nla_total_size(1) /* IFLA_OPERSTATE */
 	       + nla_total_size(1) /* IFLA_LINKMODE */
 	       + nla_total_size(4) /* IFLA_CARRIER_CHANGES */
-- 
2.11.0

^ permalink raw reply related

* Re: [PATCH net] tipc: check minimum bearer MTU
From: Michal Kubecek @ 2016-11-30 13:23 UTC (permalink / raw)
  To: Ying Xue
  Cc: Jon Maloy, David S. Miller, tipc-discussion, netdev, linux-kernel,
	Ben Hutchings, Qian Zhang
In-Reply-To: <4a2da388-d798-11cf-bf2c-84207cae6159@windriver.com>

On Wed, Nov 30, 2016 at 06:28:14PM +0800, Ying Xue wrote:
...
> >diff --git a/net/tipc/bearer.h b/net/tipc/bearer.h
> >index 78892e2f53e3..1a0b7434ec24 100644
> >--- a/net/tipc/bearer.h
> >+++ b/net/tipc/bearer.h
> >@@ -39,6 +39,7 @@
> >
> > #include "netlink.h"
> > #include "core.h"
> >+#include "msg.h"
> > #include <net/genetlink.h>
> >
> > #define MAX_MEDIA	3
> >@@ -59,6 +60,9 @@
> > #define TIPC_MEDIA_TYPE_IB	2
> > #define TIPC_MEDIA_TYPE_UDP	3
> >
> >+/* minimum bearer MTU */
> >+#define TIPC_MIN_BEARER_MTU	(MAX_H_SIZE + INT_H_SIZE)
> >+
> > /**
> >  * struct tipc_media_addr - destination address used by TIPC bearers
> >  * @value: address info (format defined by media)
> >@@ -215,4 +219,13 @@ void tipc_bearer_xmit(struct net *net, u32 bearer_id,
> > void tipc_bearer_bc_xmit(struct net *net, u32 bearer_id,
> > 			 struct sk_buff_head *xmitq);
> >
> >+/* check if device MTU is sufficient for tipc headers */
> >+inline bool tipc_check_mtu(struct net_device *dev, unsigned int reserve)
> 
> It's unnecessary to explicitly declare a function as inline, instead
> please let GCC smartly decide this.

This is a header file. But I just noticed the last change (adding
missing "static" keyword) is missing in the version I sent.

> 
> >+{
> >+	if (dev->mtu >= TIPC_MIN_BEARER_MTU + reserve)
> >+		return false;
> >+	netdev_warn(dev, "MTU too low for tipc bearer\n");
> >+	return true;
> >+}
> >+
> > #endif	/* _TIPC_BEARER_H */
> >diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
> >index 78cab9c5a445..376ed3e3ed46 100644
> >--- a/net/tipc/udp_media.c
> >+++ b/net/tipc/udp_media.c
> >@@ -697,6 +697,11 @@ static int tipc_udp_enable(struct net *net, struct tipc_bearer *b,
> > 		udp_conf.local_ip.s_addr = htonl(INADDR_ANY);
> > 		udp_conf.use_udp_checksums = false;
> > 		ub->ifindex = dev->ifindex;
> >+		if (tipc_check_mtu(dev, sizeof(struct iphdr) +
> >+					sizeof(struct udphdr))) {
> >+			err = -EINVAL;
> >+			goto err;
> >+		}
> 
> For UDP bearer, it seems insufficient for us to check MTU size only
> when UDP bearer is enabled. Meanwhile, we should update MTU size for
> UDP bearer with Path MTU discovery protocol once MTU size is changed
> after bearer is enabled.

I should admit I'm not that familiar with tipc. Do you mean updating
b->mtu in response to PMTU updates of the route used for ub->ubsock?
The way I understand it, it would be certainly useful but it's not
directly related to the security issue this patch addresses as if there
are no updates, b->mtu cannot get too low and there is no risk of
a buffer overflow. In other words, reflecting PMTU updates is something
that can be IMHO left for later.

                                                      Michal Kubecek

^ permalink raw reply

* Re: [PATCH net-next v2 2/4] Documentation: net: phy: Add a paragraph about pause frames/flow control
From: Sebastian Frias @ 2016-11-30 13:20 UTC (permalink / raw)
  To: Florian Fainelli, netdev
  Cc: davem, andrew, martin.blumenstingl, mans, alexandre.torgue,
	peppe.cavallaro, timur, jbrunet
In-Reply-To: <0a923150-582c-16dc-d14d-11d8a2620871@gmail.com>

On 28/11/16 18:33, Florian Fainelli wrote:
> On 11/28/2016 02:38 AM, Sebastian Frias wrote:
>> On 27/11/16 19:44, Florian Fainelli wrote:
>>> Describe that the Ethernet MAC controller is ultimately responsible for
>>> dealing with proper pause frames/flow control advertisement and
>>> enabling, and that it is therefore allowed to have it change
>>> phydev->supported/advertising with SUPPORTED_Pause and
>>> SUPPORTED_AsymPause.
>>>
>>> Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
>>> ---
>>>  Documentation/networking/phy.txt | 18 ++++++++++++++++--
>>>  1 file changed, 16 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt
>>> index 4b25c0f24201..9a42a9414cea 100644
>>> --- a/Documentation/networking/phy.txt
>>> +++ b/Documentation/networking/phy.txt
>>> @@ -127,8 +127,9 @@ Letting the PHY Abstraction Layer do Everything
>>>   values pruned from them which don't make sense for your controller (a 10/100
>>>   controller may be connected to a gigabit capable PHY, so you would need to
>>>   mask off SUPPORTED_1000baseT*).  See include/linux/ethtool.h for definitions
>>> - for these bitfields. Note that you should not SET any bits, or the PHY may
>>> - get put into an unsupported state.
>>> + for these bitfields. Note that you should not SET any bits, except the
>>> + SUPPORTED_Pause and SUPPORTED_AsymPause bits (see below), or the PHY may get
>>> + put into an unsupported state.
>>>  
>>>   Lastly, once the controller is ready to handle network traffic, you call
>>>   phy_start(phydev).  This tells the PAL that you are ready, and configures the
>>> @@ -139,6 +140,19 @@ Letting the PHY Abstraction Layer do Everything
>>>   When you want to disconnect from the network (even if just briefly), you call
>>>   phy_stop(phydev).
>>>  
>>> +Pause frames / flow control
>>> +
>>> + The PHY does not participate directly in flow control/pause frames except by
>>> + making sure that the SUPPORTED_Pause and SUPPORTED_AsymPause bits are set in
>>> + MII_ADVERTISE to indicate towards the link partner that the Ethernet MAC
>>> + controller supports such a thing. Since flow control/pause frames generation
>>> + involves the Ethernet MAC driver, it is recommended that this driver takes care
>>> + of properly indicating advertisement and support for such features by setting
>>> + the SUPPORTED_Pause and SUPPORTED_AsymPause bits accordingly. This can be done
>>> + either before or after phy_connect() 
>>
>> If the bits are set after phy_connect(), how does the PHY framework knows there's
>> an update to the bits? Should some call be made?
> 
> You would most likely either call phy_start() to start the PHY state
> machine (again) or have to re-negotiate the link with e.g:
> genphy_restart_aneg().
> 

Thanks, I think that would be worth adding to the documentation, right?

^ permalink raw reply

* [PATCH net] tcp: warn on bogus MSS and try to amend it
From: Marcelo Ricardo Leitner @ 2016-11-30 13:14 UTC (permalink / raw)
  To: netdev
  Cc: Jon Maxwell, Alex Sidorenko, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, tlfalcon, Brian King,
	Eric Dumazet, davem

There have been some reports lately about TCP connection stalls caused
by NIC drivers that aren't setting gso_size on aggregated packets on rx
path. This causes TCP to assume that the MSS is actually the size of the
aggregated packet, which is invalid.

Although the proper fix is to be done at each driver, it's often hard
and cumbersome for one to debug, come to such root cause and report/fix
it.

This patch amends this situation in two ways. First, it adds a warning
on when this situation occurs, so it gives a hint to those trying to
debug this. It also limit the maximum probed MSS to the adverised MSS,
as it should never be any higher than that.

The result is that the connection may not have the best performance ever
but it shouldn't stall, and the admin will have a hint on what to look
for.

Tested with virtio by forcing gso_size to 0.

Cc: Jonathan Maxwell <jmaxwell37@gmail.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
---
 net/ipv4/tcp_input.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a27b9c0e27c08b4e4aeaff3d0bfdf3ae561ba4d8..ecc86105eb479de9b80db71af6a16a5af612a61c 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -144,7 +144,10 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
 	 */
 	len = skb_shinfo(skb)->gso_size ? : skb->len;
 	if (len >= icsk->icsk_ack.rcv_mss) {
-		icsk->icsk_ack.rcv_mss = len;
+		icsk->icsk_ack.rcv_mss = min_t(unsigned int, len,
+					       tcp_sk(sk)->advmss);
+		if (icsk->icsk_ack.rcv_mss != len)
+			pr_warn_once("Seems your NIC driver is doing bad RX acceleration. TCP performance may be compromised.\n");
 	} else {
 		/* Otherwise, we make more careful check taking into account,
 		 * that SACKs block is variable.
-- 
2.9.3

^ permalink raw reply related

* Re: [RFC PATCH net-next] ipv6: implement consistent hashing for equal-cost multipath routing
From: Hannes Frederic Sowa @ 2016-11-30 13:12 UTC (permalink / raw)
  To: David Miller, david.lebrun; +Cc: netdev
In-Reply-To: <20161128.153209.2135257061368558724.davem@davemloft.net>

On Mon, Nov 28, 2016, at 21:32, David Miller wrote:
> From: David Lebrun <david.lebrun@uclouvain.be>
> Date: Mon, 28 Nov 2016 21:16:19 +0100
> 
> > The advantage of my solution over RFC2992 is lowest possible disruption
> > and equal rebalancing of affected flows. The disadvantage is the lookup
> > complexity of O(log n) vs O(1). Although from a theoretical viewpoint
> > O(1) is obviously better, would O(log n) have an effectively measurable
> > negative impact on scalability ? If we consider 32 next-hops for a route
> > and 100 pseudo-random numbers generated per next-hop, the lookup
> > algorithm would have to perform in the worst case log2 3200 = 11
> > comparisons to select a next-hop for that route.
> 
> When I was working on the routing cache removal in ipv4 I compared
> using a stupid O(1) hash lookup of the FIB entries vs. the O(log n)
> fib_trie stuff actually in use.
> 
> It did make a difference.
> 
> This is a lookup that can be invoked 20 million times per second or
> more.
> 
> Every cycle matters.
> 
> We already have a lot of trouble getting under the cycle budget one
> has for routing at wire speed for very high link rates, please don't
> make it worse.

David, one question: do you remember if you measured with linked lists
at that time or also with arrays. I actually would expect small arrays
that entirely fit into cachelines to be actually faster than our current
approach, which also walks a linked list, probably the best algorithm to
trash cache lines. I ask because I currently prefer this approach more
than having large allocations in the O(1) case because of easier code
and easier management.

Thanks,
Hannes

^ permalink raw reply

* Re: [PATCH net] vhost_net: don't continue to call the recvmsg when meet errors
From: Jason Wang @ 2016-11-30 13:07 UTC (permalink / raw)
  To: Yunjian Wang, mst, netdev, linux-kernel; +Cc: caihe
In-Reply-To: <1480507857-22976-1-git-send-email-wangyunjian@huawei.com>



On 2016年11月30日 20:10, Yunjian Wang wrote:
> When we meet an error(err=-EBADFD) recvmsg, the error handling in vhost
> handle_rx() will continue. This will cause a soft CPU lockup in vhost thread.
>
> Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
> ---
>   drivers/vhost/net.c | 3 +++
>   1 file changed, 3 insertions(+)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 5dc128a..edc470b 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -717,6 +717,9 @@ static void handle_rx(struct vhost_net *net)
>   			pr_debug("Discarded rx packet: "
>   				 " len %d, expected %zd\n", err, sock_len);
>   			vhost_discard_vq_desc(vq, headcount);
> +			/* Don't continue to do, when meet errors. */
> +			if (err < 0)
> +				goto out;
>   			continue;
>   		}
>   		/* Supply virtio_net_hdr if VHOST_NET_F_VIRTIO_NET_HDR */

Acked-by: Jason Wang <jasowang@redhat.com>

We may want to rename vhost_discard_vq_desc() in the future, since it 
does not discard the desc in fact.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox