Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] bnx2x: avoid atomic allocations during initialization
From: David Miller @ 2013-09-07  4:29 UTC (permalink / raw)
  To: dmitry; +Cc: mschmidt, netdev, ariele, eilong
In-Reply-To: <504C9EFCA2D0054393414C9CB605C37F20D90E3D@SJEXCHMB06.corp.ad.broadcom.com>

From: "Dmitry Kravkov" <dmitry@broadcom.com>
Date: Sat, 7 Sep 2013 01:45:10 +0000

> Once you allocated the memory during initialization , you will most
> probably fail to allocate its replacement during RX handling (on
> this machine).

Not true, these two events can exist in totally different timeframes
and totally different memory pressure situations.

Probe should never fail because of an atomic memory allocation if at
all possible.

You should use sleeping allocations in every possible situation for
maximum stability, and this absolutely matches the approach taken
by every other major driver.

^ permalink raw reply

* [PATCH] net: korina: remove deprecated IRQF_DISABLED
From: Michael Opdenacker @ 2013-09-07  5:20 UTC (permalink / raw)
  To: davem
  Cc: emilio, mugunthanvnm, jg1.han, hsweeten, netdev, linux-kernel,
	Michael Opdenacker

This patch proposes to remove the IRQF_DISABLED flag from
drivers/net/ethernet/korina.c

It's a NOOP since 2.6.35 and it will be removed one day.

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
---
 drivers/net/ethernet/korina.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/korina.c b/drivers/net/ethernet/korina.c
index 270e65f..a36fa80 100644
--- a/drivers/net/ethernet/korina.c
+++ b/drivers/net/ethernet/korina.c
@@ -996,14 +996,14 @@ static int korina_open(struct net_device *dev)
 	 * that handles the Done Finished
 	 * Ovr and Und Events */
 	ret = request_irq(lp->rx_irq, korina_rx_dma_interrupt,
-			IRQF_DISABLED, "Korina ethernet Rx", dev);
+			0, "Korina ethernet Rx", dev);
 	if (ret < 0) {
 		printk(KERN_ERR "%s: unable to get Rx DMA IRQ %d\n",
 		    dev->name, lp->rx_irq);
 		goto err_release;
 	}
 	ret = request_irq(lp->tx_irq, korina_tx_dma_interrupt,
-			IRQF_DISABLED, "Korina ethernet Tx", dev);
+			0, "Korina ethernet Tx", dev);
 	if (ret < 0) {
 		printk(KERN_ERR "%s: unable to get Tx DMA IRQ %d\n",
 		    dev->name, lp->tx_irq);
@@ -1012,7 +1012,7 @@ static int korina_open(struct net_device *dev)
 
 	/* Install handler for overrun error. */
 	ret = request_irq(lp->ovr_irq, korina_ovr_interrupt,
-			IRQF_DISABLED, "Ethernet Overflow", dev);
+			0, "Ethernet Overflow", dev);
 	if (ret < 0) {
 		printk(KERN_ERR "%s: unable to get OVR IRQ %d\n",
 		    dev->name, lp->ovr_irq);
@@ -1021,7 +1021,7 @@ static int korina_open(struct net_device *dev)
 
 	/* Install handler for underflow error. */
 	ret = request_irq(lp->und_irq, korina_und_interrupt,
-			IRQF_DISABLED, "Ethernet Underflow", dev);
+			0, "Ethernet Underflow", dev);
 	if (ret < 0) {
 		printk(KERN_ERR "%s: unable to get UND IRQ %d\n",
 		    dev->name, lp->und_irq);
-- 
1.8.1.2

^ permalink raw reply related

* Re: [net v6 1/8] i40e: main driver core
From: Daniel Borkmann @ 2013-09-07  6:44 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: e1000-devel, netdev, Jesse Brandeburg, gospo, davem, sassmann
In-Reply-To: <1378510875-10704-2-git-send-email-jeffrey.t.kirsher@intel.com>

[...]
+i40e_status i40e_allocate_dma_mem_d(struct i40e_hw *hw,
+				    struct i40e_dma_mem *mem,
+				    u64 size, u32 alignment)
+{
+	struct i40e_pf *pf = (struct i40e_pf *)hw->back;
+
+	if (!mem)
+		return I40E_ERR_PARAM;
+
+	mem->size = ALIGN(size, alignment);
+	mem->va = dma_zalloc_coherent(&pf->pdev->dev, mem->size,
+				      &mem->pa, GFP_KERNEL);
+	if (mem->va)
+		return I40E_SUCCESS;
+
+	return I40E_ERR_NO_MEMORY;
+}

Nitpicking ... at some point in time i40e_status should be removed plus
I40E_ERR_PARAM, I40E_SUCCESS, I40E_ERR_NO_MEMORY and the like, as we have
int and -EINVAL, 0, -ENOMEM for that. ;-)

[...]
+i40e_status i40e_put_mac_in_vlan(struct i40e_vsi *vsi, u8 *macaddr,
+				 bool is_vf, bool is_netdev)
+{
+	struct i40e_mac_filter *f, *add_f;
+
+	list_for_each_entry(f, &vsi->mac_filter_list, list) {
[...]
+			if (!add_f) {
+				dev_info(&vsi->back->pdev->dev, "Could not add filter %d for %pM\n",
+					 f->vlan, f->macaddr);
+				return -ENOMEM;
+			}
+		}
+	}
+	return I40E_SUCCESS;
+}

Their usage seems also to be mixed anyway: -ENOMEM vs. I40E_SUCCESS.

[...]
+void i40e_vsi_reset_stats(struct i40e_vsi *vsi)
+{
+	struct rtnl_link_stats64 *ns;
+	int i;
+
+	if (!vsi)
+		return;
+
[...]
+static struct i40e_mac_filter *i40e_find_filter(struct i40e_vsi *vsi,
+						u8 *macaddr, s16 vlan,
+						bool is_vf, bool is_netdev)
+{
+	struct i40e_mac_filter *f;
+
+	if (!vsi || !macaddr)
+		return NULL;
[...]

Probably the code could also be scanned to remove such checks as well ...

------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* [PATCH] bcm63xx_enet: remove deprecated IRQF_DISABLED
From: Michael Opdenacker @ 2013-09-07  6:56 UTC (permalink / raw)
  To: davem; +Cc: jogo, joe, jg1.han, mbizon, netdev, linux-kernel,
	Michael Opdenacker

This patch proposes to remove the IRQF_DISABLED flag from
drivers/net/ethernet/broadcom/bcm63xx_enet.c

It's a NOOP since 2.6.35 and it will be removed one day.

Signed-off-by: Michael Opdenacker <michael.opdenacker@free-electrons.com>
---
 drivers/net/ethernet/broadcom/bcm63xx_enet.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bcm63xx_enet.c b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
index 8ac48fb..b9a5fb6 100644
--- a/drivers/net/ethernet/broadcom/bcm63xx_enet.c
+++ b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
@@ -926,13 +926,13 @@ static int bcm_enet_open(struct net_device *dev)
 	if (ret)
 		goto out_phy_disconnect;
 
-	ret = request_irq(priv->irq_rx, bcm_enet_isr_dma, IRQF_DISABLED,
+	ret = request_irq(priv->irq_rx, bcm_enet_isr_dma, 0,
 			  dev->name, dev);
 	if (ret)
 		goto out_freeirq;
 
 	ret = request_irq(priv->irq_tx, bcm_enet_isr_dma,
-			  IRQF_DISABLED, dev->name, dev);
+			  0, dev->name, dev);
 	if (ret)
 		goto out_freeirq_rx;
 
@@ -2156,13 +2156,13 @@ static int bcm_enetsw_open(struct net_device *dev)
 	enet_dmac_writel(priv, 0, ENETDMAC_IRMASK, priv->tx_chan);
 
 	ret = request_irq(priv->irq_rx, bcm_enet_isr_dma,
-			  IRQF_DISABLED, dev->name, dev);
+			  0, dev->name, dev);
 	if (ret)
 		goto out_freeirq;
 
 	if (priv->irq_tx != -1) {
 		ret = request_irq(priv->irq_tx, bcm_enet_isr_dma,
-				  IRQF_DISABLED, dev->name, dev);
+				  0, dev->name, dev);
 		if (ret)
 			goto out_freeirq_rx;
 	}
-- 
1.8.1.2

^ permalink raw reply related

* Re: [net v6 1/8] i40e: main driver core
From: Daniel Borkmann @ 2013-09-07  7:15 UTC (permalink / raw)
  To: Jeff Kirsher
  Cc: e1000-devel, netdev, Jesse Brandeburg, gospo, davem, sassmann
In-Reply-To: <1378510875-10704-2-git-send-email-jeffrey.t.kirsher@intel.com>

[...]

> +/**
> + * i40e_config_netdev - Setup the netdev flags
> + * @vsi: the VSI being configured
> + *
> + * Returns 0 on success, negative value on failure
> + **/
> +static s32 i40e_config_netdev(struct i40e_vsi *vsi)
> +{
> +	struct i40e_pf *pf = vsi->back;
> +	struct i40e_hw *hw = &pf->hw;
> +	struct i40e_netdev_priv *np;
> +	struct net_device *netdev;
> +	u8 mac_addr[ETH_ALEN];
> +	int etherdev_size;
> +
> +	etherdev_size = sizeof(struct i40e_netdev_priv);
> +	netdev = alloc_etherdev_mq(etherdev_size, vsi->alloc_queue_pairs);
> +	if (!netdev)
> +		return -ENOMEM;
> +
> +	vsi->netdev = netdev;
> +	np = netdev_priv(netdev);
> +	np->vsi = vsi;
> +
> +	netdev->hw_enc_features = NETIF_F_IP_CSUM	 |
> +				  NETIF_F_GSO_UDP_TUNNEL |
> +				  NETIF_F_TSO		 |
> +				  NETIF_F_SG;
> +
> +	netdev->features = NETIF_F_SG		       |
> +			   NETIF_F_IP_CSUM	       |
> +			   NETIF_F_SCTP_CSUM	       |

Thanks for including SCTP! ;-)

> +			   NETIF_F_HIGHDMA	       |
> +			   NETIF_F_GSO_UDP_TUNNEL      |
> +			   NETIF_F_HW_VLAN_CTAG_TX     |
> +			   NETIF_F_HW_VLAN_CTAG_RX     |
> +			   NETIF_F_HW_VLAN_CTAG_FILTER |
> +			   NETIF_F_IPV6_CSUM	       |
> +			   NETIF_F_TSO		       |
> +			   NETIF_F_TSO6		       |
> +			   NETIF_F_RXCSUM	       |
> +			   NETIF_F_RXHASH	       |
> +			   0;
[..]

> +/**
> + * i40e_veb_mem_alloc - Allocates the next available struct veb in the PF
> + * @pf: board private structure
> + *
> + * On error: returns error code (negative)
> + * On success: returns vsi index in PF (positive)
> + **/
> +static s32 i40e_veb_mem_alloc(struct i40e_pf *pf)
> +{
> +	int ret = I40E_ERR_NO_AVAILABLE_VSI;
> +	struct i40e_veb *veb;
> +	int i;
> +
> +	/* Need to protect the allocation of switch elements at the PF level */
> +	mutex_lock(&pf->switch_mutex);
> +
> +	/* VEB list may be fragmented if VEB creation/destruction has
> +	 * been happening.  We can afford to do a quick scan to look
> +	 * for any free slots in the list.
> +	 *
> +	 * find next empty veb slot, looping back around if necessary
> +	 */
> +	i = 0;
> +	while ((i < I40E_MAX_VEB) && (pf->veb[i] != NULL))
> +		i++;
> +	if (i >= I40E_MAX_VEB) {
> +		ret = I40E_ERR_NO_MEMORY;
> +		goto err_alloc_veb;  /* out of VEB slots! */
> +	}
> +
> +	veb = kzalloc(sizeof(*veb), GFP_KERNEL);
> +	if (!veb) {
> +		ret = -ENOMEM;
> +		goto err_alloc_veb;
> +	}

Here as well, I40E_ERR_NO_MEMORY vs -ENOMEM.

> +	veb->pf = pf;
> +	veb->idx = i;
> +	veb->enabled_tc = 1;
> +
> +	pf->veb[i] = veb;
> +	ret = i;
> +err_alloc_veb:
> +	mutex_unlock(&pf->switch_mutex);
> +	return ret;
> +}

[...]

> +/**
> + * i40e_add_veb - create the VEB in the switch
> + * @veb: the VEB to be instantiated
> + * @vsi: the controlling VSI
> + **/
> +static s32 i40e_add_veb(struct i40e_veb *veb, struct i40e_vsi *vsi)
> +{
> +	bool is_default = (vsi->idx == vsi->back->lan_vsi);
> +	int ret;
> +
> +	/* get a VEB from the hardware */
> +	ret = i40e_aq_add_veb(&veb->pf->hw, veb->uplink_seid, vsi->seid,
> +			      veb->enabled_tc, is_default, &veb->seid, NULL);
> +	if (ret != I40E_SUCCESS) {
> +		dev_info(&veb->pf->pdev->dev,
> +			 "couldn't add VEB, err %d, aq_err %d\n",
> +			 ret, veb->pf->hw.aq.asq_last_status);
> +		return ret;
> +	}

Nitpick: why s32 in the signature? There are a lot of such places, just int would have
been fine probably.

> +
> +	return I40E_SUCCESS;
> +}

[...]

> +struct i40e_veb *i40e_veb_setup(struct i40e_pf *pf, u16 flags,
> +				u16 uplink_seid, u16 vsi_seid,
> +				u8 enabled_tc)
> +{
[...]
> +
> +	/* get veb sw struct */
> +	veb_idx = i40e_veb_mem_alloc(pf);
> +	if (veb_idx < 0)
> +		goto err_alloc;

See below.

> +	veb = pf->veb[veb_idx];
> +	veb->flags = flags;
> +	veb->uplink_seid = uplink_seid;
> +	veb->veb_idx = (uplink_veb ? uplink_veb->idx : I40E_NO_VEB);
> +	veb->enabled_tc = (enabled_tc ? enabled_tc : 0x1);
> +
> +	/* create the VEB in the switch */
> +	ret = i40e_add_veb(veb, pf->vsi[vsi_idx]);
> +	if (ret != I40E_SUCCESS)
> +		goto err_veb;
> +
> +	return veb;
> +
> +err_veb:
> +	i40e_veb_clear(veb);
> +err_alloc:
> +	return NULL;
> +}
> +
> +/**
> + * i40e_setup_pf_switch_element - set pf vars based on switch type
> + * @pf: board private structure
> + * @ele: element we are building info from
> + * @num_reported: total number of elements
> + * @printconfig: should we print the contents
> + *
> + * helper function to assist in extracting a few useful SEID values.
> + **/
> +static void i40e_setup_pf_switch_element(struct i40e_pf *pf,
> +				struct i40e_aqc_switch_config_element_resp *ele,
> +				u16 num_reported, bool printconfig)
> +{
> +	u16 downlink_seid = le16_to_cpu(ele->downlink_seid);
> +	u16 uplink_seid = le16_to_cpu(ele->uplink_seid);
> +	u8 element_type = ele->element_type;
> +	u16 seid = le16_to_cpu(ele->seid);
> +
> +	if (printconfig)
> +		dev_info(&pf->pdev->dev,
> +			 "type=%d seid=%d uplink=%d downlink=%d\n",
> +			 element_type, seid, uplink_seid, downlink_seid);
> +
> +	switch (element_type) {
> +	case I40E_SWITCH_ELEMENT_TYPE_MAC:
> +		pf->mac_seid = seid;
> +		break;
> +	case I40E_SWITCH_ELEMENT_TYPE_VEB:
> +		/* Main VEB? */
> +		if (uplink_seid != pf->mac_seid)
> +			break;
> +		if (pf->lan_veb == I40E_NO_VEB) {
> +			int v;
> +
> +			/* find existing or else empty VEB */
> +			for (v = 0; v < I40E_MAX_VEB; v++) {
> +				if (pf->veb[v] && (pf->veb[v]->seid == seid)) {
> +					pf->lan_veb = v;
> +					break;
> +				}
> +			}
> +			if (pf->lan_veb == I40E_NO_VEB) {
> +				v = i40e_veb_mem_alloc(pf);
> +				if (v < 0)
> +					break;

Nitpick: I'd expect from *alloc() functions to return NULL, but fair enough.

> +				pf->lan_veb = v;
> +			}
> +		}
> +
> +		pf->veb[pf->lan_veb]->seid = seid;
> +		pf->veb[pf->lan_veb]->uplink_seid = pf->mac_seid;
> +		pf->veb[pf->lan_veb]->pf = pf;
> +		pf->veb[pf->lan_veb]->veb_idx = I40E_NO_VEB;
> +		break;
> +	case I40E_SWITCH_ELEMENT_TYPE_VSI:
> +		if (num_reported != 1)
> +			break;
> +		/* This is immediately after a reset so we can assume this is
> +		 * the PF's VSI
> +		 */
> +		pf->mac_seid = uplink_seid;
> +		pf->pf_seid = downlink_seid;
> +		pf->main_vsi_seid = seid;
> +		if (printconfig)
> +			dev_info(&pf->pdev->dev,
> +				 "pf_seid=%d main_vsi_seid=%d\n",
> +				 pf->pf_seid, pf->main_vsi_seid);
> +		break;
> +	case I40E_SWITCH_ELEMENT_TYPE_PF:
> +	case I40E_SWITCH_ELEMENT_TYPE_VF:
> +	case I40E_SWITCH_ELEMENT_TYPE_EMP:
> +	case I40E_SWITCH_ELEMENT_TYPE_BMC:
> +	case I40E_SWITCH_ELEMENT_TYPE_PE:
> +	case I40E_SWITCH_ELEMENT_TYPE_PA:
> +		/* ignore these for now */
> +		break;
> +	default:
> +		dev_info(&pf->pdev->dev, "unknown element type=%d seid=%d\n",
> +			 element_type, seid);
> +		break;
> +	}
> +}

[...]

> +/**
> + * i40e_fetch_switch_configuration - Get switch config from firmware
> + * @pf: board private structure
> + * @printconfig: should we print the contents
> + *
> + * Get the current switch configuration from the device and
> + * extract a few useful SEID values.
> + **/
> +s32 i40e_fetch_switch_configuration(struct i40e_pf *pf, bool printconfig)
> +{
> +	struct i40e_aqc_get_switch_config_resp *sw_config;
> +	int ret = I40E_SUCCESS;
> +	u16 next_seid = 0;
> +	u8 *aq_buf;
> +	int i;
> +
> +	if (!pf)
> +		return I40E_ERR_BAD_PTR;
> +
> +	aq_buf = kzalloc(I40E_AQ_LARGE_BUF, GFP_KERNEL);
> +	if (!aq_buf)
> +		return -ENOMEM;

More of such examples ... ;-) I40E_ERR_BAD_PTR vs. -ENOMEM on a s32 instead of int.

> +	sw_config = (struct i40e_aqc_get_switch_config_resp *)aq_buf;
> +	do {
> +		u16 num_reported, num_total;
> +
> +		ret = i40e_aq_get_switch_config(&pf->hw, sw_config,
> +						I40E_AQ_LARGE_BUF,
> +						&next_seid, NULL);
> +		if (ret) {
> +			dev_info(&pf->pdev->dev,
> +				 "get switch config failed %d aq_err=%x\n",
> +				 ret, pf->hw.aq.asq_last_status);
> +			kfree(aq_buf);
> +			return ret;
> +		}
> +
> +		num_reported = le16_to_cpu(sw_config->header.num_reported);
> +		num_total = le16_to_cpu(sw_config->header.num_total);
> +
> +		if (printconfig)
> +			dev_info(&pf->pdev->dev,
> +				 "header: %d reported %d total\n",
> +				 num_reported, num_total);
> +
> +		if (num_reported) {
> +			int sz = sizeof(*sw_config) * num_reported;
> +
> +			kfree(pf->sw_config);
> +			pf->sw_config = kzalloc(sz, GFP_KERNEL);
> +			if (pf->sw_config)
> +				memcpy(pf->sw_config, sw_config, sz);
> +		}
> +
> +		for (i = 0; i < num_reported; i++) {
> +			struct i40e_aqc_switch_config_element_resp *ele =
> +				&sw_config->element[i];
> +
> +			i40e_setup_pf_switch_element(pf, ele, num_reported,
> +						     printconfig);
> +		}
> +	} while (next_seid != 0);
> +
> +	kfree(aq_buf);
> +	return ret;
> +}

[...]

> +/**
> + * i40e_determine_queue_usage - Work out queue distribution
> + * @pf: board private structure
> + **/
> +static i40e_status i40e_determine_queue_usage(struct i40e_pf *pf)
> +{
> +	int accum_tc_size;
> +	int queues_left;
> +	int num_tc0;
> +
> +	pf->num_lan_qps = 0;
> +	pf->num_tc_qps = rounddown_pow_of_two(pf->num_tc_qps);
> +	accum_tc_size = (I40E_MAX_TRAFFIC_CLASS - 1) * pf->num_tc_qps;
> +
> +	/* a quicky macro for a repeated set of lines */
> +#define SET_RSS_SIZE do { \
> +num_tc0 = min_t(int, queues_left, pf->rss_size_max);           \
> +num_tc0 = min_t(int, num_tc0, nr_cpus_node(numa_node_id()));   \
> +num_tc0 = rounddown_pow_of_two(num_tc0);                       \
> +pf->rss_size = num_tc0;                                        \
> +} while (0)
> +
> +	/* Find the max queues to be put into basic use.  We'll always be
> +	 * using TC0, whether or not DCB is running, and TC0 will get the
> +	 * big RSS set.
> +	 */
> +	queues_left = pf->hw.func_caps.num_tx_qp;
> +
> +	if   (!((pf->flags & I40E_FLAG_MSIX_ENABLED)		 &&
> +		(pf->flags & I40E_FLAG_MQ_ENABLED))		 ||
> +		!(pf->flags & (I40E_FLAG_RSS_ENABLED |
> +		I40E_FLAG_FDIR_ENABLED | I40E_FLAG_DCB_ENABLED)) ||
> +		(queues_left == 1)) {
> +
> +		/* one qp for PF, no queues for anything else */
> +		queues_left = 0;
> +		pf->rss_size = pf->num_lan_qps = 1;
> +
> +		/* make sure all the fancies are disabled */
> +		pf->flags &= ~(I40E_FLAG_RSS_ENABLED       |
> +				I40E_FLAG_MQ_ENABLED	   |
> +				I40E_FLAG_FDIR_ENABLED	   |
> +				I40E_FLAG_FDIR_ATR_ENABLED |
> +				I40E_FLAG_DCB_ENABLED	   |
> +				I40E_FLAG_SRIOV_ENABLED	   |
> +				I40E_FLAG_VMDQ_ENABLED);
> +
> +	} else if (pf->flags & I40E_FLAG_RSS_ENABLED	  &&
> +		   !(pf->flags & I40E_FLAG_FDIR_ENABLED)  &&
> +		   !(pf->flags & I40E_FLAG_DCB_ENABLED)) {
> +
> +		SET_RSS_SIZE;

Can't these macros be done in a small inline function instead?

> +		queues_left -= pf->rss_size;
> +		pf->num_lan_qps = pf->rss_size;
> +
> +	} else if (pf->flags & I40E_FLAG_RSS_ENABLED	  &&
> +		   !(pf->flags & I40E_FLAG_FDIR_ENABLED)  &&
> +		   (pf->flags & I40E_FLAG_DCB_ENABLED)) {
> +
> +		/* save num_tc_qps queues for TCs 1 thru 7 and the rest
> +		 * are set up for RSS in TC0
> +		 */
> +		queues_left -= accum_tc_size;
> +
> +		SET_RSS_SIZE;

ditto

> +		queues_left -= pf->rss_size;
> +		if (queues_left < 0) {
> +			dev_info(&pf->pdev->dev, "not enough queues for DCB\n");
> +			return I40E_ERR_CONFIG;
> +		}
> +
> +		pf->num_lan_qps = pf->rss_size + accum_tc_size;
> +
> +	} else if (pf->flags & I40E_FLAG_RSS_ENABLED   &&
> +		  (pf->flags & I40E_FLAG_FDIR_ENABLED) &&
> +		  !(pf->flags & I40E_FLAG_DCB_ENABLED)) {
> +
> +		queues_left -= 1; /* save 1 queue for FD */
> +
> +		SET_RSS_SIZE;

ditto

> +		queues_left -= pf->rss_size;
> +		if (queues_left < 0) {
> +			dev_info(&pf->pdev->dev, "not enough queues for Flow Director\n");
> +			return I40E_ERR_CONFIG;
> +		}
> +
> +		pf->num_lan_qps = pf->rss_size;
> +
> +	} else if (pf->flags & I40E_FLAG_RSS_ENABLED   &&
> +		  (pf->flags & I40E_FLAG_FDIR_ENABLED) &&
> +		  (pf->flags & I40E_FLAG_DCB_ENABLED)) {
> +
> +		/* save 1 queue for TCs 1 thru 7,
> +		 * 1 queue for flow director,
> +		 * and the rest are set up for RSS in TC0
> +		 */
> +		queues_left -= 1;
> +		queues_left -= accum_tc_size;
> +
> +		SET_RSS_SIZE;

ditto

> +		queues_left -= pf->rss_size;
> +		if (queues_left < 0) {
> +			dev_info(&pf->pdev->dev, "not enough queues for DCB and Flow Director\n");
> +			return I40E_ERR_CONFIG;
> +		}
> +
> +		pf->num_lan_qps = pf->rss_size + accum_tc_size;
> +
> +	} else {
> +		dev_info(&pf->pdev->dev,
> +			 "Invalid configuration, flags=0x%08llx\n", pf->flags);
> +		return I40E_ERR_CONFIG;
> +	}
> +
> +	if ((pf->flags & I40E_FLAG_SRIOV_ENABLED) &&
> +	    pf->num_vf_qps && pf->num_req_vfs && queues_left) {
> +		pf->num_req_vfs = min_t(int, pf->num_req_vfs, (queues_left /
> +							       pf->num_vf_qps));
> +		queues_left -= (pf->num_req_vfs * pf->num_vf_qps);
> +	}
> +
> +	if ((pf->flags & I40E_FLAG_VMDQ_ENABLED) &&
> +	    pf->num_vmdq_vsis && pf->num_vmdq_qps && queues_left) {
> +		pf->num_vmdq_vsis = min_t(int, pf->num_vmdq_vsis,
> +					  (queues_left / pf->num_vmdq_qps));
> +		queues_left -= (pf->num_vmdq_vsis * pf->num_vmdq_qps);
> +	}
> +
> +	return I40E_SUCCESS;
> +}

------------------------------------------------------------------------------
Learn the latest--Visual Studio 2012, SharePoint 2013, SQL 2012, more!
Discover the easy way to master current and previous Microsoft technologies
and advance your career. Get an incredible 1,500+ hours of step-by-step
tutorial videos with LearnDevNow. Subscribe today and save!
http://pubads.g.doubleclick.net/gampad/clk?id=58041391&iu=/4140/ostg.clktrk
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* [PATCH net] net: ovs: flow: fix potential illegal memory access in __parse_flow_nlattrs
From: Daniel Borkmann @ 2013-09-07  7:41 UTC (permalink / raw)
  To: davem; +Cc: netdev, jesse, Andy Zhou

In function __parse_flow_nlattrs(), we check for condition
(type > OVS_KEY_ATTR_MAX) and if true, print an error, but we do
not return from this function as in other checks. It seems this
has been forgotten, as otherwise, we could access beyond the
memory of ovs_key_lens, which is of ovs_key_lens[OVS_KEY_ATTR_MAX + 1].
Hence, a maliciously prepared nla_type from user space could access
beyond this upper limit.

Introduced by 03f0d916a ("openvswitch: Mega flow implementation").

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Andy Zhou <azhou@nicira.com>
---
 net/openvswitch/flow.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/openvswitch/flow.c b/net/openvswitch/flow.c
index fb36f85..410db90 100644
--- a/net/openvswitch/flow.c
+++ b/net/openvswitch/flow.c
@@ -1178,6 +1178,7 @@ static int __parse_flow_nlattrs(const struct nlattr *attr,
 		if (type > OVS_KEY_ATTR_MAX) {
 			OVS_NLERR("Unknown key attribute (type=%d, max=%d).\n",
 				  type, OVS_KEY_ATTR_MAX);
+			return -EINVAL;
 		}
 
 		if (attrs & (1 << type)) {
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH] bnx2x: avoid atomic allocations during initialization
From: Dmitry Kravkov @ 2013-09-07  9:07 UTC (permalink / raw)
  To: David Miller
  Cc: Dmitry Kravkov, Tereza Schmidtová, netdev@vger.kernel.org,
	Ariel Elior, Eilon Greenstein
In-Reply-To: <20130907.002952.1951928882694928547.davem@davemloft.net>

On Sat, Sep 7, 2013 at 7:29 AM, David Miller <davem@davemloft.net> wrote:
> From: "Dmitry Kravkov" <dmitry@broadcom.com>
> Date: Sat, 7 Sep 2013 01:45:10 +0000
>
>> Once you allocated the memory during initialization , you will most
>> probably fail to allocate its replacement during RX handling (on
>> this machine).
>
> Not true, these two events can exist in totally different timeframes
> and totally different memory pressure situations.
As seen from the customers, these allocation fail when system
constantly under memory pressure - like VM environment or 32-bit
platforms

>
> Probe should never fail because of an atomic memory allocation if at
> all possible.
We here are in open() flow, not probe()

>
> You should use sleeping allocations in every possible situation for
> maximum stability, and this absolutely matches the approach taken
> by every other major driver.
Sleeping allocation will be valid only for one ring overlap - it just
delays the issue, not resolving it, after ring overlap ALL of them
will be replaced by allocations from interrupt context. And again if
we, discover issue earlier we can provide system with fully working
networking with very acceptable performance using SW GRO stack.
Instead of having device which drops every TPA/FW_GRO aggregation -
user will not able to ssh :(

> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* RE: [PATCH net] bnx2x: bail out if unable to acquire stats_sema
From: Dmitry Kravkov @ 2013-09-07  9:34 UTC (permalink / raw)
  To: Neal Cardwell, Dmitry Kravkov
  Cc: Michal Schmidt, davem@davemloft.net, netdev@vger.kernel.org,
	Ariel Elior, Eilon Greenstein, Havard Skinnemoen, Eric Dumazet
In-Reply-To: <CADVnQyn-i1SsdqPR1SN5pqbE6sFFD-5_HLqVa6hXmCu2-f8h1Q@mail.gmail.com>

> -----Original Message-----
> From: Neal Cardwell [mailto:ncardwell@google.com]
> Sent: Saturday, September 07, 2013 6:54 AM
> To: Dmitry Kravkov
> Cc: Dmitry Kravkov; Michal Schmidt; davem@davemloft.net;
> netdev@vger.kernel.org; Ariel Elior; Eilon Greenstein; Havard Skinnemoen;
> Eric Dumazet
> Subject: Re: [PATCH net] bnx2x: bail out if unable to acquire stats_sema
> 
> On Fri, Sep 6, 2013 at 2:40 AM, Dmitry Kravkov <dkravkov@gmail.com>
> wrote:
> > On Wed, Sep 4, 2013 at 8:17 PM, Neal Cardwell <ncardwell@google.com>
> wrote:
> >> On Tue, Sep 3, 2013 at 11:51 AM, Dmitry Kravkov
> <dmitry@broadcom.com> wrote:
> >>>> -----Original Message-----
> >>>> From: Michal Schmidt [mailto:mschmidt@redhat.com]
> >>>> Sent: Tuesday, September 03, 2013 6:46 PM
> >>>> To: davem@davemloft.net
> >>>> Cc: netdev@vger.kernel.org; Dmitry Kravkov; Ariel Elior; Eilon
> >>>> Greenstein
> >>>> Subject: [PATCH net] bnx2x: bail out if unable to acquire
> >>>> stats_sema
> >>>>
> >>>> If we fail to acquire stats_sema in the specified time limit, the
> >>>> chip is probably dead. It probably does not matter whether we try
> >>>> to continue or not, but certainly we should not up() the semaphore
> afterwards.
> >>>>
> >>>> Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
> >>>> ---
> >>>>  drivers/net/ethernet/broadcom/bnx2x/bnx2x_stats.c | 24
> >>>> +++++++++++++++++------
> >>
> >> It seems like this patch has the downside that if the down_timeout()
> >> fails, then bnx2x_stats_handle() ends up updating the stats state
> >> machine's state without really executing the real body of the
> >> action().
> >>
> >> In fact it seems like there is a more general pre-existing problem of
> >> this flavor with the bnx2x stats state machine: the
> >> bnx2x_stats_handle() function updates the state machine
> >> bp->stats_state while holding the spin lock, but does not execute the
> >> action() while holding any sort of synchronization, so AFAICT there
> >> is nothing to guarantee that the state machine actions happen in the
> >> order the state machine wants them to happen. For example, if stats
> >> events fire such that we want to execute actions that disable and
> >> then enable stats, we could instead end up executing the actions in
> >> the order that would attempt to enable and then disable them, if we
> >> get unlucky with respect to when interrupts fire, etc.
> >>
> >> It seems to me that instead of having all of the callees of
> >> bnx2x_stats_handle() try to down/up the semaphore, instead
> >> bnx2x_stats_handle() should try to down the stats_sema at the top,
> >> and then if successful, it should change the bp->stats_state, call
> >> the action, and up the stats_sema. Would that work?
> >>
> > handle() is called from sleepable context (open/close) and timer
> > context, then it's not possible to use semaphore for pretection
> 
> But it seems like all of the callees of bnx2x_stats_handle() are already using
> the stats_sema for protection, and the only difference is that
> bnx2x_stats_update uses down_trylock and the other callees use
> down_timeout. What about making this explicit in bnx2x_stats_handle, with
> something like:
> 
> if (event == STATS_EVENT_UPDATE) {
>   if (down_trylock(&bp->stats_sema)) {
>        BNX2X_ERR("stats down_trylock failed\n");
>        goto out;
>   }
> else {
>   if (down_timeout(&bp->stats_sema, HZ/10)) {
>        BNX2X_ERR("stats down_timeout failed\n");
>        goto out;
>   }
> }
> bp->stats_state = ...;
> action = ...;
> action(bp);
> up(&bp->stats_sema);
> 
> That would allow us to protect the bnx2x state machine so that the
> action() events happen in the correct order, and we don't (e.g.) accidentally
> end up doing an enable-then-disable when we wanted disable-then-
> enable.
> 
> Would that work?
Looks like there is no a lot of difference  from current implementation - I will try it and report back...

> 
> neal
> 
>

^ permalink raw reply

* Re: Realtek r8168 hangs when sending data at full speed on a gigabit link
From: Francois Romieu @ 2013-09-07 10:15 UTC (permalink / raw)
  To: Frédéric Leroy
  Cc: netdev, Realtek linux nic maintainers, David R, Hayes Wang
In-Reply-To: <52245E82.5020004@starox.org>

Frédéric Leroy <fredo@starox.org> :
[...]

Sorry for the delay. It was a busy week.

Can you give the hack below a try ?

David, could you send me the r8169 XID line from a kernel running on
the hardware for which I sent you a similar patch back in 2013/04 ?
You appeared to own a 8168f and it could be a RTL_GIGA_MAC_VER_36.

Thanks.

Hayes, see http://marc.info/?l=linux-netdev&m=137794473416308&w=1 for
history. It could be eb2dc35d99028b698cdedba4f5522bc43e576bd2
("r8169: RxConfig hack for the 8168evl.") return, with a revenge.

---
 drivers/net/ethernet/realtek/r8169.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
index 6f87f2c..3397cee 100644
--- a/drivers/net/ethernet/realtek/r8169.c
+++ b/drivers/net/ethernet/realtek/r8169.c
@@ -4231,6 +4231,7 @@ static void rtl_init_rxcfg(struct rtl8169_private *tp)
 	case RTL_GIGA_MAC_VER_23:
 	case RTL_GIGA_MAC_VER_24:
 	case RTL_GIGA_MAC_VER_34:
+	case RTL_GIGA_MAC_VER_35:
 		RTL_W32(RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST);
 		break;
 	case RTL_GIGA_MAC_VER_40:
-- 
1.8.3.1

^ permalink raw reply related

* Re: [RFC PATCHv2 0/4] Add DT support for fixed PHYs
From: Thomas Petazzoni @ 2013-09-07 10:27 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: David S. Miller, netdev, devicetree, Lior Amsalem,
	Gregory Clement, Ezequiel Garcia, linux-arm-kernel, Mark Rutland,
	Sascha Hauer, Christian Gmeiner
In-Reply-To: <2916397.Vz9aZYzLyb@lenovo>

Dear Florian Fainelli,

On Fri, 06 Sep 2013 21:42:42 +0100, Florian Fainelli wrote:

> > Here is a second version of the patch set that adds a Device Tree
> > binding and the related code to support fixed PHYs. Marked as RFC,
> > this patch set is obviously not intended for merging in 3.12.
> 
> Thanks a lot for continuing on this work, I really like the state of it now.

Thanks for your feedback.

> > Since the first version, the changes have been:
> > 
> >  * Instead of using a 'fixed-link' property inside the Ethernet device
> >    DT node, with a fairly cryptic succession of integer values, we now
> >    use a PHY subnode under the Ethernet device DT node, with explicit
> >    properties to configure the duplex, speed, pause and other PHY
> >    properties.
> > 
> >  * The PHY address is automatically allocated by the kernel and no
> >    longer visible in the Device Tree binding.
> > 
> >  * The PHY device is created directly when the network driver calls
> >    of_phy_connect_fixed_link(), and associated to the PHY DT node,
> >    which allows the existing of_phy_connect() function to work,
> >    without the need to use the deprecated of_phy_connect_fixed_link().
> > 
> > The things I am not entirely happy with yet are:
> > 
> >  * The PHY ID is hardcoded to 0xdeadbeef. Ideally, it should be a
> >    properly reserved vendor/device identifier, but it isn't clear how
> >    to get one allocated for this purpose.
> 
> Right, we should try to get something better, but we obviously cannot use an 
> already allocated OUI for this. Can we ask the Linux foundation or a Linux-
> friendly company to allocate one maybe?

According to http://standards.ieee.org/develop/regauth/oui/oui.txt, the
Linux Foundation doesn't seem to own any OUI. Should we simply be
contacting some random companies in this list?

> >  * The fixed_phy_register() function in drivers/net/phy/fixed.c has
> >    some OF references. So ideally, I would have preferred to put this
> >    code in drivers/of/of_mdio.c, but to call get_phy_device(), we need
> >    a reference to the mii_bus structure that represents the fixed MDIO
> >    bus.
> 
> This is not a big deal, not everything in drivers/ is consistent with this, 
> and making the fixed MDIO bus globally accessible does not sound too great.

Indeed.

> >  * There is some error management missing in fixed_phy_register(), but
> >    it can certainly be added easily. This RFC is meant to sort out the
> >    general idea.
> 
> Do you think you could add these to got beyond the RFC state? The patchset as 
> it currently is fine with me if you can address these.

Sure, it shouldn't be too difficult.

In the mean time, I'm interested in hearing comments from other people,
especially from the Device Tree bindings maintainers: while the
internal implementation details can always be fixed later on, the DT
binding should obviously get an approval from the DT maintainers.

Thanks,

Thomas
-- 
Thomas Petazzoni, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply

* Re: [PATCH net] bnx2x: Restore a call to config_init
From: Dmitry Kravkov @ 2013-09-07 10:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, Eilon Greenstein, netdev@vger.kernel.org, davej
In-Reply-To: <1378507183.31445.53.camel@edumazet-glaptop>

On Sat, Sep 7, 2013 at 1:39 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2013-09-06 at 14:47 -0400, David Miller wrote:
>> From: "Eilon Greenstein" <eilong@broadcom.com>
>> Date: Fri, 6 Sep 2013 12:55:02 +0300
>>
>> > Commit c0a77ec74f295013d7ba3204dd3ed25fccf83cb4 'bnx2x: Add missing braces in
>> > bnx2x:bnx2x_link_initialize' identified indentation problem, but resolved it
>> > by adding braces instead of fixing the indentation. The braces now prevents a
>> > config_init call in some cases, though it should be called regardless of that
>> > condition. This patch removes the braces and fix the confusing indentation
>> > that caused this mess.
>> >
>> > Signed-off-by: Eilon Greenstein <eilong@broadcom.com>
>>
>> Applied.
>
> Oh well, latest net tree broke bnx2x again.
>
> No idea why.
>
> [   61.203313] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[2]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> [   74.021238] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[1]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> [   76.109176] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[2]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> [   78.198178] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[1]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> [   80.286960] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[2]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> [   93.031584] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[1]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> [   95.069939] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[2]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> [   97.133493] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[1]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> [   99.195912] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[2]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> [  112.026026] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[1]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> [  114.087991] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[2]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> [  116.148976] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[1]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> [  118.237310] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[2]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)

Unable to reproduce ,,,
Which HW you have - i will try to get similar...
Thanks

^ permalink raw reply

* Re: [patch net/stable] ipv6/exthdrs: accept tlv which includes only padding
From: Eldad Zack @ 2013-09-07 12:31 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <1378476145-6282-1-git-send-email-jiri@resnulli.us>


Hi Jiri,

On Fri, 6 Sep 2013, Jiri Pirko wrote:

> In rfc4942 and rfc2460 I cannot find anything which would implicate to
> drop packets which have only padding in tlv.

NAK from my side.
Please read RFC4942 2.1.9.5 "Misuse of Pad1 and PadN Options".

While it doesn't specifically discusses this corner case, you can 
understand from "There is no legitimate reason for padding beyond the 
next eight octet..." that there's also no legitimate reason for an 
option header containing only padding.
I can't imagine a sane use-case for this.

> Current behaviour breaks TAHI Test v6LC.1.2.6.

I'm not familiar with this, but IMHO the test should be reversed :)

Cheers,
Eldad

> 
> Problem was intruduced in:
> 9b905fe6843 "ipv6/exthdrs: strict Pad1 and PadN check"
> 
> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
> ---
>  net/ipv6/exthdrs.c | 6 ------
>  1 file changed, 6 deletions(-)
> 
> diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c
> index 07a7d65..8d67900 100644
> --- a/net/ipv6/exthdrs.c
> +++ b/net/ipv6/exthdrs.c
> @@ -162,12 +162,6 @@ static bool ip6_parse_tlv(const struct tlvtype_proc *procs, struct sk_buff *skb)
>  		off += optlen;
>  		len -= optlen;
>  	}
> -	/* This case will not be caught by above check since its padding
> -	 * length is smaller than 7:
> -	 * 1 byte NH + 1 byte Length + 6 bytes Padding
> -	 */
> -	if ((padlen == 6) && ((off - skb_network_header_len(skb)) == 8))
> -		goto bad;
>  
>  	if (len == 0)
>  		return true;
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Informationen.
From: TUB @ 2013-09-07 12:54 UTC (permalink / raw)



Informationen zu allen E-Mail-Benutzer TUB,

Hiermit möchten wir Sie informieren, dass wir die hohe Spam zu reduzieren
emials, empfehlen wir, dass Sie auf ein Postfach zu aktualisieren, um Spam
zu vermeiden E-Mails und Aktualisierung notifcation. Klicken Sie oder
fügen Sie den folgenden Link kopieren, füllen und aktivieren Sie den
Spam-Schutz.

================>

http://www.tu-berlin-de-mail-konto-anmelden-schutz.tk/

Wenn wir nicht bekommen, eine Antwort von Ihnen innerhalb von 48 Stunden,
wir zerstören Ihre Mailbox.


Mit freundlichen Grüßen,

Webmaster@tu-berlin.de.

^ permalink raw reply

* netfilter: active obj WARN when cleaning up
From: Sasha Levin @ 2013-09-07 13:10 UTC (permalink / raw)
  To: pablo, kaber, kadlec, David S. Miller
  Cc: netfilter-devel, coreteam, netdev, LKML

Hi all,

While fuzzing with trinity inside a KVM tools guest, running latest -next kernel, I've
stumbled on the following:

[  418.311956] ------------[ cut here ]------------
[  418.312449] WARNING: CPU: 6 PID: 4178 at lib/debugobjects.c:260 debug_print_object+0x8d/0xb0()
[  418.313243] ODEBUG: free active (active state 0) object type: timer_list hint: 
delayed_work_timer_fn+0x0/0x20
[  418.314038] Modules linked in:
[  418.314038] CPU: 6 PID: 4178 Comm: kworker/u16:2 Tainted: G        W 
3.11.0-next-20130906-sasha #3984
[  418.314038] Workqueue: netns cleanup_net
[  418.314038]  0000000000000104 ffff880410371a58 ffffffff841b7815 0000000000000104
[  418.314038]  ffff880410371aa8 ffff880410371a98 ffffffff8112540c ffff880410371a78
[  418.314038]  ffffffff853ea0ab ffff8800bdc68b80 ffffffff85a65c00 ffffffff87676658
[  418.314038] Call Trace:
[  418.314038]  [<ffffffff841b7815>] dump_stack+0x52/0x87
[  418.320315]  [<ffffffff8112540c>] warn_slowpath_common+0x8c/0xc0
[  418.321040]  [<ffffffff811254f6>] warn_slowpath_fmt+0x46/0x50
[  418.321101]  [<ffffffff81a60ded>] debug_print_object+0x8d/0xb0
[  418.321101]  [<ffffffff811444e0>] ? __queue_work+0x390/0x390
[  418.321101]  [<ffffffff81a61605>] __debug_check_no_obj_freed+0xa5/0x220
[  418.321101]  [<ffffffff81249e76>] ? kmem_cache_destroy+0x86/0xe0
[  418.321101]  [<ffffffff81a61795>] debug_check_no_obj_freed+0x15/0x20
[  418.321101]  [<ffffffff812874d7>] kmem_cache_free+0x197/0x340
[  418.321101]  [<ffffffff81249e76>] kmem_cache_destroy+0x86/0xe0
[  418.321101]  [<ffffffff83d5d681>] nf_conntrack_cleanup_net_list+0x131/0x170
[  418.321101]  [<ffffffff83d5ec5d>] nf_conntrack_pernet_exit+0x5d/0x70
[  418.321101]  [<ffffffff83cda1ce>] ops_exit_list+0x5e/0x70
[  418.321101]  [<ffffffff83cda80b>] cleanup_net+0xfb/0x1c0
[  418.321101]  [<ffffffff81145c28>] process_one_work+0x338/0x550
[  418.321101]  [<ffffffff81145b50>] ? process_one_work+0x260/0x550
[  418.321883]  [<ffffffff81147605>] worker_thread+0x215/0x350
[  418.321883]  [<ffffffff811473f0>] ? manage_workers+0x180/0x180
[  418.321883]  [<ffffffff81150887>] kthread+0xe7/0xf0
[  418.321883]  [<ffffffff811507a0>] ? __init_kthread_worker+0x70/0x70
[  418.321883]  [<ffffffff841c7a2c>] ret_from_fork+0x7c/0xb0
[  418.321883]  [<ffffffff811507a0>] ? __init_kthread_worker+0x70/0x70
[  418.321883] ---[ end trace c974221542b61d0d ]---


Thanks,
Sasha

^ permalink raw reply

* [PATCH net] net: fib: fib6_add: fix potential NULL pointer dereference
From: Daniel Borkmann @ 2013-09-07 13:13 UTC (permalink / raw)
  To: davem; +Cc: netdev, Lin Ming, Matti Vaittinen, Hannes Frederic Sowa

When the kernel is compiled with CONFIG_IPV6_SUBTREES, and we return
with an error in fn = fib6_add_1(), then error codes are encoded into
the return pointer e.g. ERR_PTR(-ENOENT). In such an error case, we
write the error code into err and jump to out, hence enter the if(err)
condition. Now, if CONFIG_IPV6_SUBTREES is enabled, we check for:

  if (pn != fn && pn->leaf == rt)
    ...
  if (pn != fn && !pn->leaf && !(pn->fn_flags & RTN_RTINFO))
    ...

Since pn is NULL and fn is f.e. ERR_PTR(-ENOENT), then pn != fn
evaluates to true and causes a NULL-pointer dereference on further
checks on pn. Fix it, by setting both NULL in error case, so that
pn != fn already evaluates to false and no further dereference
takes place.

This was first correctly implemented in 4a287eba2 ("IPv6 routing,
NLM_F_* flag support: REPLACE and EXCL flags support, warn about
missing CREATE flag"), but the bug got later on introduced by
188c517a0 ("ipv6: return errno pointers consistently for fib6_add_1()").

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Lin Ming <mlin@ss.pku.edu.cn>
Cc: Matti Vaittinen <matti.vaittinen@nsn.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
---
 net/ipv6/ip6_fib.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index 73db48e..5bec666 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -825,9 +825,9 @@ int fib6_add(struct fib6_node *root, struct rt6_info *rt, struct nl_info *info)
 	fn = fib6_add_1(root, &rt->rt6i_dst.addr, rt->rt6i_dst.plen,
 			offsetof(struct rt6_info, rt6i_dst), allow_create,
 			replace_required);
-
 	if (IS_ERR(fn)) {
 		err = PTR_ERR(fn);
+		fn = NULL;
 		goto out;
 	}
 
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH net-next v4 1/6] bonding: simplify and use RCU protection for 3ad xmit path
From: Veaceslav Falico @ 2013-09-07 14:20 UTC (permalink / raw)
  To: Ding Tianhong
  Cc: Jay Vosburgh, Andy Gospodarek, David S. Miller,
	Nikolay Aleksandrov, Netdev
In-Reply-To: <52298407.9040103@huawei.com>

On Fri, Sep 06, 2013 at 03:28:07PM +0800, Ding Tianhong wrote:
...snip...
>+/**
>+ * IMPORTANT: bond_first/last_slave_rcu can return NULL in case of an empty list
>+ * Caller must hold rcu_read_lock
>+ */
>+#define bond_first_slave_rcu(bond) \
>+	(bond_is_empty_rcu(bond) ? NULL : \
>+					bond_to_slave_rcu((bond)->slave_list.next))
>+#define bond_last_slave_rcu(bond) \
>+	(bond_is_empty_rcu(bond) ? NULL : \
>+					bond_to_slave_rcu((bond)->slave_list.prev))
>+

Hi Ding,

I didn't have time to actually go into detail about this last week, and, as
Eric correctly said, RCU is hard even for experienced kernel developers, so
it's really not the easiest part to understand. I, for one, can't say that
I understand it completely, though I know the overall, basic usage.

That being said, I really advise you to go through, first of all,
Documentation/RCU. There are also tons of online materials already written
on this topic - and every one takes its own approach in explaining it.

I'll now try to explain what is wrong with the approach of verifying for
list emptyness while holding only RCU read lock.

First of all, you've replied to one of my emails that, quoting - "the
slave_list will not changed in the rcu_read_lock". This is completely wrong
- rcu_read_lock() is not a lock, in its classical meaning. It allows the
    list, and everything else, to be *modified* (in the most basic case) while
holding it. I.e. if we, concurrently, run list_for_each_entry_rcu() and, on
another CPU, list_del_rcu() - then we might end up holding a reference to
the entry that *was* deleted from the list already. In other words - the
list can be freely modified while we're holding rcu_read_lock().

What rcu_read_lock() *does* guarantee is that the entry we've dereferenced
won't 'go away' - as in - get freed, get deinitialized (in most cases, and
in this one - with slave_list) or whatever else that can stop us from using
it. That's, again, a really broad assumption and there are *lots* of small
things to be taken care of, and, in some cases it's not even true -
however, the basic idea is this - once you rcu_dereference() something -
you can use it while holding rcu_read_lock().

So, again - while holding the rcu_read_lock(), working with a list - the
list itself can change easily, can become empty, or can become non-empty,
or whatever else - even while going through it via
list_for_each_entry_rcu(). However, if we dereference an entry there - we
can use it.

Now, about the list emptiness under rcu_read_lock(). Lets suppose the
following code (imagine that we have list_empty_rcu(), and we're holding
the rcu_read_lock()):

    1 if (!list_empty_rcu(&my_list)) {
    2         do_something(get_first_entry(&my_list));
    3 }

on the 1st line, we verify if the list is not empty - and, sometimes, it
will be true. For example, we'll have only one element there.

Now, after we execute the first line, on another CPU we *remove* the only
remaining element from the list, and thus my_list->next will point to
&my_list - classical empty list.

So, back to the first CPU, when we run our code, we execute
get_first_entry(&my_list) - which, usually, translates to
container_of(my_list->next, ...) - getting the container of the next
element of the list - which is our *head* - and not a valid entry. And we
will, eventually, bug here.

This is exactly what you're doing:

+#define bond_last_slave_rcu(bond) \
+	(bond_is_empty_rcu(bond) ? NULL : \
+					bond_to_slave_rcu((bond)->slave_list.prev))

You first verify if the list is empty - which can be non-empty *in the time
of the verification* - and then get the bond_to_slave_rcu() of the .prev -
when the list can already become empty and the prev will point to the list
head, and not to a valid slave.

That's exactly why we don't have the list_empty_rcu() - because it's
meaningless - cause exactly after it was run the list can become empty or
non-empty. It guarantees us nothing. The only, maybe, slightly useful usage
of it could be some kind of optimization:

   1 rcu_read_lock();
   2 
   3 if (unlikely(list_empty(&bond->slave_list))) {
   4         rcu_read_unlock();
   5         return -ENODEV;
   6 }
   7 
   8 heavy_preparation1(bond);
   9 heavy_preparation2(bond);
  10 ...
  11 
  12 bond_for_each_slave(bond, slave)
  13         do_something_with_a_slave(slave);
  14 
  15 rcu_read_unlock();

Here, as you can see, we must make some heavy (time/memory-consuming)
preparations to the bond *if it has slaves* before working *with each
slave*, and if he doesn't - we can just omit those preparations (if we
can...). But still it's a bit useless, and usually we anyway need proper
RTNL or bond->lock locking before doing something like that.

Back to the real world - to implement the bond_first/last_slave_rcu(), we
*must not* rely on list emptiness before taking the reference to the
first/last element of the list. To make this happen, we *first* take the
reference to list.next/list.prev, and only *after* it we verify if what
we've dereferenced is an actual slave or the list head (i.e. the list was
empty). If it's the list head - then we assume that the list was empty and
return NULL, otherwise - if it's a normal slave - we return the
container_of(what we've dereferenced).

Here's the function that gets the first element in the list:

268 #define list_first_or_null_rcu(ptr, type, member) \
269         ({struct list_head *__ptr = (ptr); \
270           struct list_head __rcu *__next = list_next_rcu(__ptr); \
271           likely(__ptr != __next) ? container_of(__next, type, member) : NULL; \
272         })

As you can see, we first make a copy of the list_head pointer, which we
were given - cause in the meanwhile *it can also, possibly, change*.

269         ({struct list_head *__ptr = (ptr); \

Then we dereference the pointer's copy.next, and also save it - cause it
can *also* change - become (non)empty, for example.

270           struct list_head __rcu *__next = list_next_rcu(__ptr); \

And here comes the rcu magic, actually. We know that the ptr can change,
the ptr.next can also change, however we also know that if ptr.next, when
we've dereferenced it, while holding the rcu_read_lock(), was a normal list
element - then we're sure that this element *will not go away*. Thus,
everything what remained for us is just to verify - is our next pointer the
list head or not? I.e. do we actually have an element in __next, or was the
list empty at that time and we can return NULL? And, of course, if the list
was not empty - we have a valid __next, and we can return its container:

271           likely(__ptr != __next) ? container_of(__next, type, member) : NULL; \

(Also, mind the "likely(__ptr != __next)" - usually, and bonding is a very
good example, we always have at least one element in the list.)

This way, I'd recommend you to do bond_last_slave_rcu() the same way as
here - you, though, might omit saving the slave_list pointer, it's not
needed in case of bonding AFAIK. Something like that (I'm writing it in my
mail editor - so it's only for the reference):

#define bond_last_slave_rcu(bond) \
	({struct list_head *__slave_ptr = list_next_rcu(&bond->slave_list); \
	  likely(__slave_ptr != &bond->slave_list) ? \
		bond_to_slave_rcu(__slave_ptr) : NULL;})


Or, even better (from my POV), add a generic macro to rculist.h (again,
didn't even compile it) - it can be used later on:

diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index f4b1001..37b49d1 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -23,6 +23,7 @@
   * way, we must not access it directly
   */
  #define list_next_rcu(list)	(*((struct list_head __rcu **)(&(list)->next)))
+#define list_prev_rcu(list)	(*((struct list_head __rcu **)(&(list)->prev)))
  
  /*
   * Insert a new entry between two known consecutive entries.
@@ -271,6 +272,12 @@ static inline void list_splice_init_rcu(struct list_head *list,
  	  likely(__ptr != __next) ? container_of(__next, type, member) : NULL; \
  	})
  
+#define list_last_or_null_rcu(ptr, type, member) \
+	({struct list_head *__ptr = (ptr); \
+	  struct list_head __rcu *__last = list_prev_rcu(__ptr); \
+	  likely(__ptr != __last) ? container_of(__prev, type, member) : NULL; \
+	})
+
  /**
   * list_for_each_entry_rcu	-	iterate over rcu list of given type
   * @pos:	the type * to use as a loop cursor.
------- END OF PATCH ------

Anyway, it's up to you.

Hope that helps.

^ permalink raw reply related

* Re: [PATCH net] bnx2x: Restore a call to config_init
From: Eric Dumazet @ 2013-09-07 14:43 UTC (permalink / raw)
  To: Dmitry Kravkov
  Cc: David Miller, Eilon Greenstein, netdev@vger.kernel.org, davej
In-Reply-To: <CAM8tLiNj-s0RKUeKePDURK7W7HZF=8Zz7D073qcqCLmy75CevQ@mail.gmail.com>

On Sat, 2013-09-07 at 13:35 +0300, Dmitry Kravkov wrote:
> On Sat, Sep 7, 2013 at 1:39 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:

> > Oh well, latest net tree broke bnx2x again.
> >
> > No idea why.
> >
> > [   61.203313] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[2]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> > [   74.021238] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[1]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> > [   76.109176] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[2]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> > [   78.198178] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[1]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> > [   80.286960] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[2]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> > [   93.031584] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[1]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> > [   95.069939] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[2]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> > [   97.133493] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[1]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> > [   99.195912] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[2]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> > [  112.026026] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[1]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> > [  114.087991] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[2]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> > [  116.148976] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[1]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> > [  118.237310] bnx2x: [bnx2x_clean_tx_queue:1259(eth0)]timeout waiting for queue[2]: txdata->tx_pkt_prod(1) != txdata->tx_pkt_cons(0)
> 
> Unable to reproduce ,,,
> Which HW you have - i will try to get similar...
> Thanks
> --


To have a working NIC, I had to revert 

937e5c3 bnx2x: Restore a call to config_init
9b0be65 bnx2x: fix broken compilation with CONFIG_BNX2X_SRIOV is not set
c0a77ec bnx2x: Add missing braces in bnx2x:bnx2x_link_initialize
60cad4e bnx2x: VF RSS support - VF side
b9871bc bnx2x: VF RSS support - PF side

# lspci -s 03:00.0
03:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM57712 10 Gigabit Ethernet (rev 01)

$ grep BNX2X .config
# CONFIG_SCSI_BNX2X_FCOE is not set
CONFIG_BNX2X=m

^ permalink raw reply

* [PATCH net] net: sctp: fix bug in sctp_poll for SOCK_SELECT_ERR_QUEUE
From: Daniel Borkmann @ 2013-09-07 14:44 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-sctp, Jacob Keller

If we do not add braces around ...

  mask |= POLLERR |
          sock_flag(sk, SOCK_SELECT_ERR_QUEUE) ? POLLPRI : 0;

... then this condition always evaluates to true as POLLERR is
defined as 8 and binary or'd with whatever result comes out of
sock_flag(). Hence instead of (X | Y) ? A : B, transform it into
X | (Y ? A : B). Unfortunatelty, commit 8facd5fb73 ("net: fix
smatch warnings inside datagram_poll") forgot about SCTP. :-(

Introduced by 7d4c04fc170 ("net: add option to enable error queue
packets waking select").

Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Jacob Keller <jacob.e.keller@intel.com>
---
 net/sctp/socket.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index d5d5882..5462bbb 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -6176,7 +6176,7 @@ unsigned int sctp_poll(struct file *file, struct socket *sock, poll_table *wait)
 	/* Is there any exceptional events?  */
 	if (sk->sk_err || !skb_queue_empty(&sk->sk_error_queue))
 		mask |= POLLERR |
-			sock_flag(sk, SOCK_SELECT_ERR_QUEUE) ? POLLPRI : 0;
+			(sock_flag(sk, SOCK_SELECT_ERR_QUEUE) ? POLLPRI : 0);
 	if (sk->sk_shutdown & RCV_SHUTDOWN)
 		mask |= POLLRDHUP | POLLIN | POLLRDNORM;
 	if (sk->sk_shutdown == SHUTDOWN_MASK)
-- 
1.7.11.7

^ permalink raw reply related

* Re: [PATCH net-next v4 1/6] bonding: simplify and use RCU protection for 3ad xmit path
From: Nikolay Aleksandrov @ 2013-09-07 14:45 UTC (permalink / raw)
  To: Veaceslav Falico
  Cc: Ding Tianhong, Jay Vosburgh, Andy Gospodarek, David S. Miller,
	Netdev
In-Reply-To: <20130907142041.GA20237@redhat.com>


On 09/07/2013 04:20 PM, Veaceslav Falico wrote:
> On Fri, Sep 06, 2013 at 03:28:07PM +0800, Ding Tianhong wrote:
<snip>
> (Also, mind the "likely(__ptr != __next)" - usually, and bonding is a very
> good example, we always have at least one element in the list.)
> 
> This way, I'd recommend you to do bond_last_slave_rcu() the same way as
> here - you, though, might omit saving the slave_list pointer, it's not
> needed in case of bonding AFAIK. Something like that (I'm writing it in my
> mail editor - so it's only for the reference):
> 
> #define bond_last_slave_rcu(bond) \
>     ({struct list_head *__slave_ptr = list_next_rcu(&bond->slave_list); \
>       likely(__slave_ptr != &bond->slave_list) ? \
>         bond_to_slave_rcu(__slave_ptr) : NULL;})
>
> 
> Or, even better (from my POV), add a generic macro to rculist.h (again,
> didn't even compile it) - it can be used later on:
> 
> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
> index f4b1001..37b49d1 100644
> --- a/include/linux/rculist.h
> +++ b/include/linux/rculist.h
> @@ -23,6 +23,7 @@
>   * way, we must not access it directly
>   */
>  #define list_next_rcu(list)    (*((struct list_head __rcu
> **)(&(list)->next)))
> +#define list_prev_rcu(list)    (*((struct list_head __rcu
> **)(&(list)->prev)))
>  
>  /*
>   * Insert a new entry between two known consecutive entries.
> @@ -271,6 +272,12 @@ static inline void list_splice_init_rcu(struct
> list_head *list,
>        likely(__ptr != __next) ? container_of(__next, type, member) : NULL; \
>      })
>  
> +#define list_last_or_null_rcu(ptr, type, member) \
> +    ({struct list_head *__ptr = (ptr); \
> +      struct list_head __rcu *__last = list_prev_rcu(__ptr); \
> +      likely(__ptr != __last) ? container_of(__prev, type, member) : NULL; \
> +    })
> +
Hi,
Actually I don't think you can dereference ->prev and use the standard
list_del_rcu because it guarantees only the ->next ptr will be valid and
->prev is set to LIST_POISON2.
IMO, you'll need something like this: https://lkml.org/lkml/2012/7/25/193
with the bidir_del and all that.

But in any case I complete agree with Veaceslav here. Read all the
documentation carefully :-)

Cheers,
 Nik

>  /**
>   * list_for_each_entry_rcu    -    iterate over rcu list of given type
>   * @pos:    the type * to use as a loop cursor.
> ------- END OF PATCH ------
> 
> Anyway, it's up to you.
> 
> Hope that helps.

^ permalink raw reply

* Re: [PATCH net-next v4 1/6] bonding: simplify and use RCU protection for 3ad xmit path
From: Veaceslav Falico @ 2013-09-07 15:03 UTC (permalink / raw)
  To: Nikolay Aleksandrov
  Cc: Ding Tianhong, Jay Vosburgh, Andy Gospodarek, David S. Miller,
	Netdev
In-Reply-To: <522B3BF1.2020208@redhat.com>

On Sat, Sep 07, 2013 at 04:45:05PM +0200, Nikolay Aleksandrov wrote:
>
>On 09/07/2013 04:20 PM, Veaceslav Falico wrote:
>> On Fri, Sep 06, 2013 at 03:28:07PM +0800, Ding Tianhong wrote:
...snip...
>> diff --git a/include/linux/rculist.h b/include/linux/rculist.h
>> index f4b1001..37b49d1 100644
>> --- a/include/linux/rculist.h
>> +++ b/include/linux/rculist.h
>> @@ -23,6 +23,7 @@
>>   * way, we must not access it directly
>>   */
>>  #define list_next_rcu(list)    (*((struct list_head __rcu
>> **)(&(list)->next)))
>> +#define list_prev_rcu(list)    (*((struct list_head __rcu
>> **)(&(list)->prev)))
>>
>>  /*
>>   * Insert a new entry between two known consecutive entries.
>> @@ -271,6 +272,12 @@ static inline void list_splice_init_rcu(struct
>> list_head *list,
>>        likely(__ptr != __next) ? container_of(__next, type, member) : NULL; \
>>      })
>>
>> +#define list_last_or_null_rcu(ptr, type, member) \
>> +    ({struct list_head *__ptr = (ptr); \
>> +      struct list_head __rcu *__last = list_prev_rcu(__ptr); \
>> +      likely(__ptr != __last) ? container_of(__prev, type, member) : NULL; \
>> +    })
>> +
>Hi,
>Actually I don't think you can dereference ->prev and use the standard
>list_del_rcu because it guarantees only the ->next ptr will be valid and
>->prev is set to LIST_POISON2.
>IMO, you'll need something like this: https://lkml.org/lkml/2012/7/25/193
>with the bidir_del and all that.

Yeah, right, my bad - we can rely only on the ->next pointer, indeed,
missed that part. RCU is hard :).

So it'll be a lot harder to implement bond_last_slave_rcu() in a
'straightforward' approach.

I'd rather go in the opposite direction here - i.e. drop the 'reverse'
traversal completely, and all the use cases for bond_last_slave_rcu(). I've
got some patches already - http://patchwork.ozlabs.org/patch/272076/ doing
that, and hopefully will remove the whole 'backword' traversal completely
in the future.

>
>But in any case I complete agree with Veaceslav here. Read all the
>documentation carefully :-)
>
>Cheers,
> Nik
>
>>  /**
>>   * list_for_each_entry_rcu    -    iterate over rcu list of given type
>>   * @pos:    the type * to use as a loop cursor.
>> ------- END OF PATCH ------
>>
>> Anyway, it's up to you.
>>
>> Hope that helps.
>

^ permalink raw reply

* Re: [patch net/stable] ipv6/exthdrs: accept tlv which includes only padding
From: Jiri Pirko @ 2013-09-07 15:46 UTC (permalink / raw)
  To: Eldad Zack; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <alpine.LNX.2.00.1309071427120.1262@heraclitus>

Sat, Sep 07, 2013 at 02:31:36PM CEST, eldad@fogrefinery.com wrote:
>
>Hi Jiri,
>
>On Fri, 6 Sep 2013, Jiri Pirko wrote:
>
>> In rfc4942 and rfc2460 I cannot find anything which would implicate to
>> drop packets which have only padding in tlv.
>
>NAK from my side.
>Please read RFC4942 2.1.9.5 "Misuse of Pad1 and PadN Options".

I did.

>
>While it doesn't specifically discusses this corner case, you can 
>understand from "There is no legitimate reason for padding beyond the 
>next eight octet..." that there's also no legitimate reason for an 
>option header containing only padding.

Okay. I'm glad you agree with me and that we both understand the rfc the
same way. And since the rfc does not say that "here's no legitimate
reason for an option header containing only padding", this should be
possible. I say we respect rfc and do not add stronger restrictions than
it dictates. No need for them.

>I can't imagine a sane use-case for this.

rfcs are not about sanity...

>
>> Current behaviour breaks TAHI Test v6LC.1.2.6.
>
>I'm not familiar with this, but IMHO the test should be reversed :)
>
>Cheers,
>Eldad
>
>> 
>> Problem was intruduced in:
>> 9b905fe6843 "ipv6/exthdrs: strict Pad1 and PadN check"
>> 
>> Signed-off-by: Jiri Pirko <jiri@resnulli.us>
>> ---
>>  net/ipv6/exthdrs.c | 6 ------
>>  1 file changed, 6 deletions(-)
>> 
>> diff --git a/net/ipv6/exthdrs.c b/net/ipv6/exthdrs.c
>> index 07a7d65..8d67900 100644
>> --- a/net/ipv6/exthdrs.c
>> +++ b/net/ipv6/exthdrs.c
>> @@ -162,12 +162,6 @@ static bool ip6_parse_tlv(const struct tlvtype_proc *procs, struct sk_buff *skb)
>>  		off += optlen;
>>  		len -= optlen;
>>  	}
>> -	/* This case will not be caught by above check since its padding
>> -	 * length is smaller than 7:
>> -	 * 1 byte NH + 1 byte Length + 6 bytes Padding
>> -	 */
>> -	if ((padlen == 6) && ((off - skb_network_header_len(skb)) == 8))
>> -		goto bad;
>>  
>>  	if (len == 0)
>>  		return true;
>> -- 
>> 1.8.3.1
>> 

^ permalink raw reply

* Re: [patch net/stable] ipv6/exthdrs: accept tlv which includes only padding
From: Eldad Zack @ 2013-09-07 16:46 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, kuznet, jmorris, yoshfuji, kaber
In-Reply-To: <20130907154602.GA1442@minipsycho.orion>



On Sat, 7 Sep 2013, Jiri Pirko wrote:

> Sat, Sep 07, 2013 at 02:31:36PM CEST, eldad@fogrefinery.com wrote:
> >
> >Hi Jiri,
> >
> >On Fri, 6 Sep 2013, Jiri Pirko wrote:
> >
> >> In rfc4942 and rfc2460 I cannot find anything which would implicate to
> >> drop packets which have only padding in tlv.
> >
> >NAK from my side.
> >Please read RFC4942 2.1.9.5 "Misuse of Pad1 and PadN Options".
> 
> I did.
> 
> >
> >While it doesn't specifically discusses this corner case, you can 
> >understand from "There is no legitimate reason for padding beyond the 
> >next eight octet..." that there's also no legitimate reason for an 
> >option header containing only padding.
> 
> Okay. I'm glad you agree with me and that we both understand the rfc the
> same way. And since the rfc does not say that "here's no legitimate
> reason for an option header containing only padding", this should be
> possible. I say we respect rfc and do not add stronger restrictions than
> it dictates. No need for them.

Strictly speaking, this RFC is informational, so it is doesn't dictate 
per se. I hope you don't suggest removing the other checks as well...

> >I can't imagine a sane use-case for this.
> 
> rfcs are not about sanity...

Great, we're in agreement again :) But I think the networking code is 
or should be.
What IPv6 stack would generate such a packet, given that the only usage 
of the padding options is to align other options?
Why should I accept a packet which is most likely an artificially 
crafted packet (RFC 3514)?

Cheers,
Eldad

^ permalink raw reply

* Re: TSQ accounting skb->truesize degrades throughput for large packets
From: Eric Dumazet @ 2013-09-07 17:21 UTC (permalink / raw)
  To: Zoltan Kiss; +Cc: Wei Liu, Jonathan Davies, Ian Campbell, netdev, xen-devel
In-Reply-To: <1378486840.31445.36.camel@edumazet-glaptop>

On Fri, 2013-09-06 at 10:00 -0700, Eric Dumazet wrote:
> On Fri, 2013-09-06 at 17:36 +0100, Zoltan Kiss wrote:
> 
> > So I guess it would be good to revisit the default value of this 
> > setting.
> 
> If ixgbe requires 3 TSO packets in TX ring to get line rate, you also
> can tweak dev->gso_max_size from 65535 to 64000.

Another idea would be to no longer use tcp_limit_output_bytes but

max(sk_pacing_rate / 1000, 2*MSS)

This means that number of packets in FQ would be limited to the
equivalent of 1ms, so TCP could have faster response to packet losses : 

Retransmitted packets would not have to wait for prior packets being
drained from FQ

For a 8Gbps flow, 1Gbyte/s, sk_pacing_rate would be 2Gbyte, this would
translate to ~2 Mbytes in Qdisc/TX ring.

sk_pacing_rate was introduced in linux-3.12, but could be backported
easily.

^ permalink raw reply

* Re: Realtek r8168 hangs when sending data at full speed on a gigabit link
From: David R @ 2013-09-07 17:19 UTC (permalink / raw)
  To: Francois Romieu
  Cc: Frédéric Leroy, netdev, Realtek linux nic maintainers,
	Hayes Wang
In-Reply-To: <20130907101546.GA19560@electric-eye.fr.zoreil.com>

Hi

You mean this line from the dmesg? :-

[    6.015979] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[    6.016285] r8169 0000:02:00.0: irq 73 for MSI/MSI-X
[    6.016549] r8169 0000:02:00.0 eth0: RTL8168f/8111f at
0xffffc9000060e000, 60:a4:4c:2c:ff:a1, XID 08000800 IRQ 73

Cheers
David

On 07/09/13 11:15, Francois Romieu wrote:
> Frédéric Leroy <fredo@starox.org> :
> [...]
>
> Sorry for the delay. It was a busy week.
>
> Can you give the hack below a try ?
>
> David, could you send me the r8169 XID line from a kernel running on
> the hardware for which I sent you a similar patch back in 2013/04 ?
> You appeared to own a 8168f and it could be a RTL_GIGA_MAC_VER_36.
>
> Thanks.
>
> Hayes, see http://marc.info/?l=linux-netdev&m=137794473416308&w=1 for
> history. It could be eb2dc35d99028b698cdedba4f5522bc43e576bd2
> ("r8169: RxConfig hack for the 8168evl.") return, with a revenge.
>
> ---
>  drivers/net/ethernet/realtek/r8169.c | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/drivers/net/ethernet/realtek/r8169.c b/drivers/net/ethernet/realtek/r8169.c
> index 6f87f2c..3397cee 100644
> --- a/drivers/net/ethernet/realtek/r8169.c
> +++ b/drivers/net/ethernet/realtek/r8169.c
> @@ -4231,6 +4231,7 @@ static void rtl_init_rxcfg(struct rtl8169_private *tp)
>  	case RTL_GIGA_MAC_VER_23:
>  	case RTL_GIGA_MAC_VER_24:
>  	case RTL_GIGA_MAC_VER_34:
> +	case RTL_GIGA_MAC_VER_35:
>  		RTL_W32(RxConfig, RX128_INT_EN | RX_MULTI_EN | RX_DMA_BURST);
>  		break;
>  	case RTL_GIGA_MAC_VER_40:

^ permalink raw reply

* [PATCH net] net: sctp: fix smatch warning in sctp_send_asconf_del_ip
From: Daniel Borkmann @ 2013-09-07 18:51 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-sctp, Neil Horman, Michio Honda

This was originally reported in [1] and posted by Neil Horman [2], he said:

  Fix up a missed null pointer check in the asconf code. If we don't find
  a local address, but we pass in an address length of more than 1, we may
  dereference a NULL laddr pointer. Currently this can't happen, as the only
  users of the function pass in the value 1 as the addrcnt parameter, but
  its not hot path, and it doesn't hurt to check for NULL should that ever
  be the case.

The callpath from sctp_asconf_mgmt() looks okay. But this could be triggered
from sctp_setsockopt_bindx() call with SCTP_BINDX_REM_ADDR and addrcnt > 1
while passing all possible addresses from the bind list to SCTP_BINDX_REM_ADDR
so that we do *not* find a single address in the association's bind address
list that is not in the packed array of addresses. If this happens when we
have an established association with ASCONF-capable peers, then we could get
a NULL pointer dereference as we only check for laddr == NULL && addrcnt == 1
and call later sctp_make_asconf_update_ip() with NULL laddr.

BUT: this actually won't happen as sctp_bindx_rem() will catch such a case
and return with an error earlier. As this is incredably unintuitive and error
prone, add a check to catch at least future bugs here. As Neil says, its not
hot path. Introduced by 8a07eb0a5 ("sctp: Add ASCONF operation on the
single-homed host").

 [1] http://www.spinics.net/lists/linux-sctp/msg02132.html
 [2] http://www.spinics.net/lists/linux-sctp/msg02133.html

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
Cc: Michio Honda <micchie@sfc.wide.ad.jp>
---
 net/sctp/socket.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index 5462bbb..911b71b 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -806,6 +806,9 @@ static int sctp_send_asconf_del_ip(struct sock		*sk,
 			goto skip_mkasconf;
 		}
 
+		if (laddr == NULL)
+			return -EINVAL;
+
 		/* We do not need RCU protection throughout this loop
 		 * because this is done under a socket lock from the
 		 * setsockopt call.
-- 
1.7.11.7

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox