Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] xen-netback: use correct index for invalidation in netbk_tx_check_mop()
From: Ian Campbell @ 2011-11-18 15:09 UTC (permalink / raw)
  To: netdev@vger.kernel.org, stable; +Cc: xen-devel@lists.xensource.com, Jan Beulich
In-Reply-To: <1321625599-3739-1-git-send-email-ian.campbell@citrix.com>

On Fri, 2011-11-18 at 14:13 +0000, Ian Campbell wrote:
> From: Jan Beulich <JBeulich@suse.com>
> 
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Ian Campbell <ian.campbell@citrix.com>

I forgot to CC stable@ here. This applies to 2.6.39 onwards.

> ---
>  drivers/net/xen-netback/netback.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
> index 0cb594c..1ae270e 100644
> --- a/drivers/net/xen-netback/netback.c
> +++ b/drivers/net/xen-netback/netback.c
> @@ -1021,7 +1021,7 @@ static int xen_netbk_tx_check_gop(struct xen_netbk *netbk,
>  		pending_idx = *((u16 *)skb->data);
>  		xen_netbk_idx_release(netbk, pending_idx);
>  		for (j = start; j < i; j++) {
> -			pending_idx = frag_get_pending_idx(&shinfo->frags[i]);
> +			pending_idx = frag_get_pending_idx(&shinfo->frags[j]);
>  			xen_netbk_idx_release(netbk, pending_idx);
>  		}
>  

^ permalink raw reply

* Re: [PATCH] xen-netback: use correct index for invalidation in netbk_tx_check_mop()
From: Ian Campbell @ 2011-11-18 15:38 UTC (permalink / raw)
  To: netdev@vger.kernel.org; +Cc: stable, xen-devel@lists.xensource.com, Jan Beulich
In-Reply-To: <1321628966.3664.363.camel@zakaz.uk.xensource.com>

On Fri, 2011-11-18 at 15:09 +0000, Ian Campbell wrote:
> On Fri, 2011-11-18 at 14:13 +0000, Ian Campbell wrote:
> > From: Jan Beulich <JBeulich@suse.com>
> > 
> > Signed-off-by: Jan Beulich <jbeulich@suse.com>
> > Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
> 
> I forgot to CC stable@ here. This applies to 2.6.39 onwards.

I also neglected to change the subject line to the correct upstream
function name. Please ignore this patch, I'll try again...

Ian.

> 
> > ---
> >  drivers/net/xen-netback/netback.c |    2 +-
> >  1 files changed, 1 insertions(+), 1 deletions(-)
> > 
> > diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
> > index 0cb594c..1ae270e 100644
> > --- a/drivers/net/xen-netback/netback.c
> > +++ b/drivers/net/xen-netback/netback.c
> > @@ -1021,7 +1021,7 @@ static int xen_netbk_tx_check_gop(struct xen_netbk *netbk,
> >  		pending_idx = *((u16 *)skb->data);
> >  		xen_netbk_idx_release(netbk, pending_idx);
> >  		for (j = start; j < i; j++) {
> > -			pending_idx = frag_get_pending_idx(&shinfo->frags[i]);
> > +			pending_idx = frag_get_pending_idx(&shinfo->frags[j]);
> >  			xen_netbk_idx_release(netbk, pending_idx);
> >  		}
> >  
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: PROBLEM: pppol2tp over pppoe NULL pointer dereference
From: Окунев Дмитрий Юрьевич @ 2011-11-18 15:35 UTC (permalink / raw)
  To: netdev@vger.kernel.org

Hello.

I just want to notify, that patch in the subject works well. Panics disappeared.

Best regards, Dmitry.

^ permalink raw reply

* Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes
From: Ben Hutchings @ 2011-11-18 15:40 UTC (permalink / raw)
  To: Sasha Levin; +Cc: Krishna Kumar, rusty, mst, netdev, kvm, davem, virtualization
In-Reply-To: <1321597481.8010.19.camel@lappy>

On Fri, 2011-11-18 at 08:24 +0200, Sasha Levin wrote:
> On Fri, 2011-11-18 at 01:08 +0000, Ben Hutchings wrote:
> > On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote:
> > > Changes for multiqueue virtio_net driver.
> > [...]
> > > @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet
> > >  {
> > >  	struct virtnet_info *vi = netdev_priv(dev);
> > >  	int cpu;
> > > -	unsigned int start;
> > >  
> > >  	for_each_possible_cpu(cpu) {
> > > -		struct virtnet_stats __percpu *stats
> > > -			= per_cpu_ptr(vi->stats, cpu);
> > > -		u64 tpackets, tbytes, rpackets, rbytes;
> > > -
> > > -		do {
> > > -			start = u64_stats_fetch_begin(&stats->syncp);
> > > -			tpackets = stats->tx_packets;
> > > -			tbytes   = stats->tx_bytes;
> > > -			rpackets = stats->rx_packets;
> > > -			rbytes   = stats->rx_bytes;
> > > -		} while (u64_stats_fetch_retry(&stats->syncp, start));
> > > -
> > > -		tot->rx_packets += rpackets;
> > > -		tot->tx_packets += tpackets;
> > > -		tot->rx_bytes   += rbytes;
> > > -		tot->tx_bytes   += tbytes;
> > > +		int qpair;
> > > +
> > > +		for (qpair = 0; qpair < vi->num_queue_pairs; qpair++) {
> > > +			struct virtnet_send_stats __percpu *tx_stat;
> > > +			struct virtnet_recv_stats __percpu *rx_stat;
> > 
> > While you're at it, you can drop the per-CPU stats and make them only
> > per-queue.  There is unlikely to be any benefit in maintaining them
> > per-CPU while receive and transmit processing is serialised per-queue.
> 
> It allows you to update stats without a lock.

But you'll already be holding a lock related to the queue.

> Whats the benefit of having them per queue?

It should save some memory (and a little time when summing stats, though
that's unlikely to matter much).

The important thing is that splitting up stats per-CPU *and* per-queue
is a waste.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* [PATCH] xen-netback: use correct index for invalidation in xen_netbk_tx_check_gop()
From: Ian Campbell @ 2011-11-18 15:42 UTC (permalink / raw)
  To: netdev; +Cc: xen-devel, Jan Beulich, Jan Beulich, Ian Campbell, stable

From: Jan Beulich <JBeulich@suse.com>

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Cc: stable@vger.kernel.org
---
 drivers/net/xen-netback/netback.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/xen-netback/netback.c b/drivers/net/xen-netback/netback.c
index 0cb594c..1ae270e 100644
--- a/drivers/net/xen-netback/netback.c
+++ b/drivers/net/xen-netback/netback.c
@@ -1021,7 +1021,7 @@ static int xen_netbk_tx_check_gop(struct xen_netbk *netbk,
 		pending_idx = *((u16 *)skb->data);
 		xen_netbk_idx_release(netbk, pending_idx);
 		for (j = start; j < i; j++) {
-			pending_idx = frag_get_pending_idx(&shinfo->frags[i]);
+			pending_idx = frag_get_pending_idx(&shinfo->frags[j]);
 			xen_netbk_idx_release(netbk, pending_idx);
 		}
 
-- 
1.7.2.5

^ permalink raw reply related

* Re: [PATCH] ethtool: Use kmemdup rather than duplicating its implementation
From: Ben Hutchings @ 2011-11-18 15:47 UTC (permalink / raw)
  To: Thomas Meyer; +Cc: netdev, linux-kernel
In-Reply-To: <1321571135.1624.313.camel@localhost.localdomain>

On Fri, 2011-11-18 at 00:05 +0100, Thomas Meyer wrote:
> The semantic patch that makes this change is available
> in scripts/coccinelle/api/memdup.cocci.
> 
> Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
> ---

The one-line description should refer to 'gianfar', not 'ethtool'.
(I expect David will fix this up.)

Ben.

> diff -u -p a/drivers/net/ethernet/freescale/gianfar_ethtool.c b/drivers/net/ethernet/freescale/gianfar_ethtool.c
> --- a/drivers/net/ethernet/freescale/gianfar_ethtool.c 2011-11-07 19:37:57.036756543 +0100
> +++ b/drivers/net/ethernet/freescale/gianfar_ethtool.c 2011-11-08 10:42:14.842512269 +0100
> @@ -1410,10 +1410,9 @@ static int gfar_optimize_filer_masks(str
>  
>  	/* We need a copy of the filer table because
>  	 * we want to change its order */
> -	temp_table = kmalloc(sizeof(*temp_table), GFP_KERNEL);
> +	temp_table = kmemdup(tab, sizeof(*temp_table), GFP_KERNEL);
>  	if (temp_table == NULL)
>  		return -ENOMEM;
> -	memcpy(temp_table, tab, sizeof(*temp_table));
>  
>  	mask_table = kcalloc(MAX_FILER_CACHE_IDX / 2 + 1,
>  			sizeof(struct gfar_mask_entry), GFP_KERNEL);
> 


-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] net: Use kmemdup rather than duplicating its implementation
From: Ben Hutchings @ 2011-11-18 15:48 UTC (permalink / raw)
  To: Thomas Meyer; +Cc: samuel, davem, netdev, linux-kernel
In-Reply-To: <1321569820.1624.307.camel@localhost.localdomain>

On Thu, 2011-11-17 at 23:43 +0100, Thomas Meyer wrote:
> The semantic patch that makes this change is available
> in scripts/coccinelle/api/memdup.cocci.
> 
> Signed-off-by: Thomas Meyer <thomas@m3y3r.de>
> ---

Similarly this is for the specific driver 'irttp' and not generally
'net'.

Ben.

> diff -u -p a/net/irda/irttp.c b/net/irda/irttp.c
> --- a/net/irda/irttp.c 2011-11-07 19:39:06.071138486 +0100
> +++ b/net/irda/irttp.c 2011-11-08 10:59:07.152748948 +0100
> @@ -1461,14 +1461,13 @@ struct tsap_cb *irttp_dup(struct tsap_cb
>  	}
>  
>  	/* Allocate a new instance */
> -	new = kmalloc(sizeof(struct tsap_cb), GFP_ATOMIC);
> +	new = kmemdup(orig, sizeof(struct tsap_cb), GFP_ATOMIC);
>  	if (!new) {
>  		IRDA_DEBUG(0, "%s(), unable to kmalloc\n", __func__);
>  		spin_unlock_irqrestore(&irttp->tsaps->hb_spinlock, flags);
>  		return NULL;
>  	}
>  	/* Dup */
> -	memcpy(new, orig, sizeof(struct tsap_cb));
>  	spin_lock_init(&new->lock);
>  
>  	/* We don't need the old instance any more */

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: query : unregister/register netdev
From: Ben Hutchings @ 2011-11-18 15:52 UTC (permalink / raw)
  To: Madhvapathi Sriram; +Cc: netdev
In-Reply-To: <CAAvRe=jhfXEnLFkwbPmWEroDE-qHUpZ54TSSdw=WpmfPGZpsPQ@mail.gmail.com>

On Fri, 2011-11-18 at 17:53 +0530, Madhvapathi Sriram wrote:
> Hi,
> 
> In register_netdevice(), BUG_ON(dev->reg_state != NETREG_UNINITIALIZED) is
> used to check if the device that is being registered is indeed a new one.
> 
> However, I see that this state is never moved to. It only happens when a
> netdevice is allocated (by default to 0 using kzalloc).
> 
> So, the cycle register-->unregister-->register would fail since in the
> unregister_netdevice the state is only moved to NETREG_UNREGISTERED (at max
> to NETREG_RELEASED using free_netdev)
> 
> So, I presume that to reinitialize a netdevice one has to free the
> netdevice, re allocate netdevice and only then re register.
> 
> Wondering why unregister and reregister is not allowed, rather than having
> go through the free/alloc cycle - I am not an expert though.

Why do you think that would be useful?

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH 1/1] net/macb: add DT support
From: Jamie Iles @ 2011-11-18 15:58 UTC (permalink / raw)
  To: Jean-Christophe PLAGNIOL-VILLARD
  Cc: devicetree-discuss, netdev, Jamie Iles, Nicolas Ferre
In-Reply-To: <1321626565-11261-1-git-send-email-plagnioj@jcrosoft.com>

Hi Jean-Christophe,

On Fri, Nov 18, 2011 at 03:29:25PM +0100, Jean-Christophe PLAGNIOL-VILLARD wrote:
> allow the DT to pass the mac address and the phy mode
> 
> Signed-off-by: Jean-Christophe PLAGNIOL-VILLARD <plagnioj@jcrosoft.com>
> Cc: Jamie Iles <jamie@jamieiles.com>
> Cc: Nicolas Ferre <nicolas.ferre@atmel.com>

This looks OK to me in principle.  I can't easily test this at the 
moment, but as I don't have a DT platform that has the clk framework up 
and running.  A couple of nits/questions inline, but thanks for doing 
this!

Jamie

> ---
>  Documentation/devicetree/bindings/net/macb.txt |   22 ++++++++
>  drivers/net/ethernet/cadence/macb.c            |   65 +++++++++++++++++++++---
>  drivers/net/ethernet/cadence/macb.h            |    2 +
>  3 files changed, 81 insertions(+), 8 deletions(-)
>  create mode 100644 Documentation/devicetree/bindings/net/macb.txt
> 
> diff --git a/Documentation/devicetree/bindings/net/macb.txt b/Documentation/devicetree/bindings/net/macb.txt
> new file mode 100644
> index 0000000..2b727ec
> --- /dev/null
> +++ b/Documentation/devicetree/bindings/net/macb.txt
> @@ -0,0 +1,22 @@
> +* Cadence EMACB
> +
> +Implemeted on Atmel AT91 & AVR32 SoC

I think something along the lines of "Binding for the Cadence MACB 
Ethernet controller" rather than listing specific parts might be 
clearer.

> +
> +Required properties:
> +- compatible : Should be "atmel,macb" for Atmel
> +- reg : Address and length of the register set for the device
> +- interrupts : Should contain macb interrupt
> +- phy-mode : String, operation mode of the PHY interface.
> +  Supported values are: "mii", "rmii",
> +
> +Optional properties:
> +- local-mac-address : 6 bytes, mac address
> +
> +Examples:
> +
> +	macb0: macb@fffc4000 {

Rob pointed out to me a little while ago that the preferred naming from 
the ePAPR document would be:

	macb0: ethernet@fffc4000

so it might be worth being consistent here.

> +		compatible = "atmel,macb";

This should be "cdns,macb" as it isn't Atmel specific.  I believe cdns 
is the correct stock ticker symbol for Cadence.

> +		reg = <oxfffc4000 0x4000>;
> +		interrupts = <21>;
> +		phy-mode = "mii";
> +	};
> diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c
> index a437b46..2c345bc 100644
> --- a/drivers/net/ethernet/cadence/macb.c
> +++ b/drivers/net/ethernet/cadence/macb.c
> @@ -20,6 +20,9 @@
>  #include <linux/etherdevice.h>
>  #include <linux/dma-mapping.h>
>  #include <linux/platform_device.h>
> +#include <linux/of.h>
> +#include <linux/of_device.h>
> +#include <linux/of_net.h>
>  #include <linux/phy.h>
>  
>  #include <mach/board.h>
> @@ -81,6 +84,20 @@ static void __init macb_get_hwaddr(struct macb *bp)
>  	addr[4] = top & 0xff;
>  	addr[5] = (top >> 8) & 0xff;
>  
> +#ifdef CONFIG_OF
> +	/*
> +	 * 2) from device tree data
> +	 */
> +	if (!is_valid_ether_addr(addr)) {
> +		struct device_node *np = bp->pdev->dev.of_node;
> +		if (np) {
> +			const char *mac = of_get_mac_address(np);
> +			if (mac)
> +				memcpy(addr, mac, sizeof(addr));
> +		}
> +	}
> +#endif

I'm a bit conflicted here.  I think we should always use the MAC address 
from the device tree if it is present even if the current MAC address is 
valid.

> +
>  	if (is_valid_ether_addr(addr)) {
>  		memcpy(bp->dev->dev_addr, addr, sizeof(addr));
>  	} else {
> @@ -191,7 +208,6 @@ static int macb_mii_probe(struct net_device *dev)
>  {
>  	struct macb *bp = netdev_priv(dev);
>  	struct phy_device *phydev;
> -	struct eth_platform_data *pdata;
>  	int ret;
>  
>  	phydev = phy_find_first(bp->mii_bus);
> @@ -200,14 +216,11 @@ static int macb_mii_probe(struct net_device *dev)
>  		return -1;
>  	}
>  
> -	pdata = bp->pdev->dev.platform_data;
>  	/* TODO : add pin_irq */
>  
>  	/* attach the mac to the phy */
>  	ret = phy_connect_direct(dev, phydev, &macb_handle_link_change, 0,
> -				 pdata && pdata->is_rmii ?
> -				 PHY_INTERFACE_MODE_RMII :
> -				 PHY_INTERFACE_MODE_MII);
> +				 bp->phy_interface);
>  	if (ret) {
>  		printk(KERN_ERR "%s: Could not attach to PHY\n", dev->name);
>  		return ret;
> @@ -1117,6 +1130,30 @@ static const struct net_device_ops macb_netdev_ops = {
>  #endif
>  };
>  
> +#if defined(CONFIG_OF)
> +static const struct of_device_id macb_dt_ids[] = {
> +	{ .compatible = "atmel,macb" },

cdns,macb again.

> +	{ /* sentinel */ }
> +};
> +
> +MODULE_DEVICE_TABLE(of, macb_dt_ids);
> +
> +static int __devinit macb_get_phy_mode_dt(struct platform_device *pdev)
> +{
> +	struct device_node *np = pdev->dev.of_node;
> +
> +	if (np)
> +		return of_get_phy_mode(np);
> +
> +	return -ENODEV;
> +}
> +#else
> +static int __devinit macb_get_phy_mode_dt(struct platform_device *pdev)
> +{
> +	return -ENODEV;
> +}
> +#endif
> +
>  static int __init macb_probe(struct platform_device *pdev)
>  {
>  	struct eth_platform_data *pdata;
> @@ -1210,20 +1247,31 @@ static int __init macb_probe(struct platform_device *pdev)
>  	macb_writel(bp, NCFGR, config);
>  
>  	macb_get_hwaddr(bp);
> -	pdata = pdev->dev.platform_data;
>  
> -	if (pdata && pdata->is_rmii)
> +	err = macb_get_phy_mode_dt(pdev);
> +	if (err < 0) {
> +		pdata = pdev->dev.platform_data;
> +		if (pdata && pdata->is_rmii)
> +			bp->phy_interface = PHY_INTERFACE_MODE_RMII;
> +		else
> +			bp->phy_interface = PHY_INTERFACE_MODE_MII;
> +	} else {
> +		bp->phy_interface = err;
> +	}
> +
> +	if (bp->phy_interface == PHY_INTERFACE_MODE_RMII) {
>  #if defined(CONFIG_ARCH_AT91)
>  		macb_writel(bp, USRIO, (MACB_BIT(RMII) | MACB_BIT(CLKEN)) );
>  #else
>  		macb_writel(bp, USRIO, 0);
>  #endif
> -	else
> +	} else {
>  #if defined(CONFIG_ARCH_AT91)
>  		macb_writel(bp, USRIO, MACB_BIT(CLKEN));
>  #else
>  		macb_writel(bp, USRIO, MACB_BIT(MII));
>  #endif
> +	}
>  
>  	bp->tx_pending = DEF_TX_RING_PENDING;
>  
> @@ -1344,6 +1392,7 @@ static struct platform_driver macb_driver = {
>  	.driver		= {
>  		.name		= "macb",
>  		.owner	= THIS_MODULE,
> +		.of_match_table	= of_match_ptr(macb_dt_ids),
>  	},
>  };
>  
> diff --git a/drivers/net/ethernet/cadence/macb.h b/drivers/net/ethernet/cadence/macb.h
> index d3212f6..8342744 100644
> --- a/drivers/net/ethernet/cadence/macb.h
> +++ b/drivers/net/ethernet/cadence/macb.h
> @@ -389,6 +389,8 @@ struct macb {
>  	unsigned int 		link;
>  	unsigned int 		speed;
>  	unsigned int 		duplex;
> +
> +	phy_interface_t		phy_interface;
>  };
>  
>  #endif /* _MACB_H */
> -- 
> 1.7.7
> 

^ permalink raw reply

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Eric Dumazet @ 2011-11-18 16:02 UTC (permalink / raw)
  To: Flavio Leitner, David Miller; +Cc: Ivan Zahariev, netdev, Vasiliy Kulikov
In-Reply-To: <1321551536.2751.87.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>



David, unless I missed something, we should revert commit f39925dbde77
ipv4: Cache learned redirect information in inetpeer.)

With following patch, redirects now work for me.

Thanks !



[PATCH net-next] ipv4: fix redirect handling

commit f39925dbde77 (ipv4: Cache learned redirect information in
inetpeer.) introduced a regression in ICMP redirect handling.

It assumed ipv4_dst_check() would be called because all possible routes
were attached to the inetpeer we modify in ip_rt_redirect(), but thats
not true.

commit 7cc9150ebe (route: fix ICMP redirect validation) tried to fix
this but solution was not complete. (It fixed only one route)

So we must lookup existing routes (including different TOS values) and
call check_peer_redir() on them.

Reported-by: Ivan Zahariev <famzah@icdsoft.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Flavio Leitner <fbl@redhat.com>
---
 net/ipv4/route.c |  110 ++++++++++++++++++++++++---------------------
 1 file changed, 59 insertions(+), 51 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 511f4a7..0c74da8 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -1304,16 +1304,42 @@ static void rt_del(unsigned hash, struct rtable *rt)
 	spin_unlock_bh(rt_hash_lock_addr(hash));
 }
 
+static int check_peer_redir(struct dst_entry *dst, struct inet_peer *peer)
+{
+	struct rtable *rt = (struct rtable *) dst;
+	__be32 orig_gw = rt->rt_gateway;
+	struct neighbour *n, *old_n;
+
+	dst_confirm(&rt->dst);
+
+	rt->rt_gateway = peer->redirect_learned.a4;
+
+	n = ipv4_neigh_lookup(&rt->dst, &rt->rt_gateway);
+	if (IS_ERR(n))
+		return PTR_ERR(n);
+	old_n = xchg(&rt->dst._neighbour, n);
+	if (old_n)
+		neigh_release(old_n);
+	if (!n || !(n->nud_state & NUD_VALID)) {
+		if (n)
+			neigh_event_send(n, NULL);
+		rt->rt_gateway = orig_gw;
+		return -EAGAIN;
+	} else {
+		rt->rt_flags |= RTCF_REDIRECTED;
+		call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, n);
+	}
+	return 0;
+}
+
 /* called in rcu_read_lock() section */
 void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 		    __be32 saddr, struct net_device *dev)
 {
 	int s, i;
 	struct in_device *in_dev = __in_dev_get_rcu(dev);
-	struct rtable *rt;
 	__be32 skeys[2] = { saddr, 0 };
 	int    ikeys[2] = { dev->ifindex, 0 };
-	struct flowi4 fl4;
 	struct inet_peer *peer;
 	struct net *net;
 
@@ -1336,33 +1362,42 @@ void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
 			goto reject_redirect;
 	}
 
-	memset(&fl4, 0, sizeof(fl4));
-	fl4.daddr = daddr;
 	for (s = 0; s < 2; s++) {
 		for (i = 0; i < 2; i++) {
-			fl4.flowi4_oif = ikeys[i];
-			fl4.saddr = skeys[s];
-			rt = __ip_route_output_key(net, &fl4);
-			if (IS_ERR(rt))
-				continue;
-
-			if (rt->dst.error || rt->dst.dev != dev ||
-			    rt->rt_gateway != old_gw) {
-				ip_rt_put(rt);
-				continue;
-			}
+			unsigned int hash;
+			struct rtable __rcu **rthp;
+			struct rtable *rt;
+
+			hash = rt_hash(daddr, skeys[s], ikeys[i], rt_genid(net));
+
+			rthp = &rt_hash_table[hash].chain;
+
+			while ((rt = rcu_dereference(*rthp)) != NULL) {
+				rthp = &rt->dst.rt_next;
+
+				if (rt->rt_key_dst != daddr ||
+				    rt->rt_key_src != skeys[s] ||
+				    rt->rt_oif != ikeys[i] ||
+				    rt_is_input_route(rt) ||
+				    rt_is_expired(rt) ||
+				    !net_eq(dev_net(rt->dst.dev), net) ||
+				    rt->dst.error ||
+				    rt->dst.dev != dev ||
+				    rt->rt_gateway != old_gw)
+					continue;
 
-			if (!rt->peer)
-				rt_bind_peer(rt, rt->rt_dst, 1);
+				if (!rt->peer)
+					rt_bind_peer(rt, rt->rt_dst, 1);
 
-			peer = rt->peer;
-			if (peer) {
-				peer->redirect_learned.a4 = new_gw;
-				atomic_inc(&__rt_peer_genid);
+				peer = rt->peer;
+				if (peer) {
+					if (peer->redirect_learned.a4 != new_gw) {
+						peer->redirect_learned.a4 = new_gw;
+						atomic_inc(&__rt_peer_genid);
+					}
+					check_peer_redir(&rt->dst, peer);
+				}
 			}
-
-			ip_rt_put(rt);
-			return;
 		}
 	}
 	return;
@@ -1649,33 +1684,6 @@ static void ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu)
 	}
 }
 
-static int check_peer_redir(struct dst_entry *dst, struct inet_peer *peer)
-{
-	struct rtable *rt = (struct rtable *) dst;
-	__be32 orig_gw = rt->rt_gateway;
-	struct neighbour *n, *old_n;
-
-	dst_confirm(&rt->dst);
-
-	rt->rt_gateway = peer->redirect_learned.a4;
-
-	n = ipv4_neigh_lookup(&rt->dst, &rt->rt_gateway);
-	if (IS_ERR(n))
-		return PTR_ERR(n);
-	old_n = xchg(&rt->dst._neighbour, n);
-	if (old_n)
-		neigh_release(old_n);
-	if (!n || !(n->nud_state & NUD_VALID)) {
-		if (n)
-			neigh_event_send(n, NULL);
-		rt->rt_gateway = orig_gw;
-		return -EAGAIN;
-	} else {
-		rt->rt_flags |= RTCF_REDIRECTED;
-		call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, n);
-	}
-	return 0;
-}
 
 static struct dst_entry *ipv4_dst_check(struct dst_entry *dst, u32 cookie)
 {

^ permalink raw reply related

* [PATCH 0/2] net: Add network priority cgroup (v3)
From: Neil Horman @ 2011-11-18 16:13 UTC (permalink / raw)
  To: netdev; +Cc: Neil Horman, John Fastabend, Robert Love, David S. Miller
In-Reply-To: <1321476666-8225-1-git-send-email-nhorman@tuxdriver.com>

Data Center Bridging environments are currently somewhat limited in their
ability to provide a general mechanism for controlling traffic priority.
Specifically they are unable to administratively control the priority at which
various types of network traffic are sent.

Currently, the only ways to set the priority of a network buffer are:

1) Through the use of the SO_PRIORITY socket option
2) By using low level hooks, like a tc action

(1) is difficult from an administrative perspective because it requires that the
application to be coded to not just assume the default priority is sufficient,
and must expose an administrative interface to allow priority adjustment.  Such
a solution is not scalable in a DCB environment

(2) is also difficult, as it requires constant administrative oversight of
applications so as to build appropriate rules to match traffic belonging to
various classes, so that priority can be appropriately set. It is further
limiting when DCB enabled hardware is in use, due to the fact that tc rules are
only run after a root qdisc has been selected (DCB enabled hardware may reserve
hw queues for various traffic classes and needs the priority to be set prior to
selecting the root qdisc)

I've discussed various solutions with John Fastabend, and we saw a cgroup as
being a good general solution to this problem.  The network priority cgroup
allows for a per-interface priority map to be built per cgroup.  Any traffic
originating from an application in a cgroup, that does not explicitly set its
priority with SO_PRIORITY will have its priority assigned to the value
designated for that group on that interface.  This allows a user space daemon,
when conducting LLDP negotiation with a DCB enabled peer to create a cgroup
based on the APP_TLV value received and administratively assign applications to
that priority using the existing cgroup utility infrastructure.

Tested by John and myself, with good results

(v2)
Based on reviews from John F., Amerigo Wang and Neerav Parikh, I've cleaned up
the rcu locking, fixed a memory leak in an error path, and corrected some typos.

(v3)
Converted rcu_dereference to rntl_dereference where appropriate as per request
from John F.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
CC: Robert Love <robert.w.love@intel.com>
CC: "David S. Miller" <davem@davemloft.net>

^ permalink raw reply

* [PATCH 1/2] net: add network priority cgroup infrastructure (v3)
From: Neil Horman @ 2011-11-18 16:13 UTC (permalink / raw)
  To: netdev; +Cc: Neil Horman, John Fastabend, Robert Love, David S. Miller
In-Reply-To: <1321632821-11640-1-git-send-email-nhorman@tuxdriver.com>

This patch adds in the infrastructure code to create the network priority
cgroup.  The cgroup, in addition to the standard processes file creates two
control files:

1) prioidx - This is a read-only file that exports the index of this cgroup.
This is a value that is both arbitrary and unique to a cgroup in this subsystem,
and is used to index the per-device priority map

2) priomap - This is a writeable file.  On read it reports a table of 2-tuples
<name:priority> where name is the name of a network interface and priority is
indicates the priority assigned to frames egresessing on the named interface and
originating from a pid in this cgroup

This cgroup allows for skb priority to be set prior to a root qdisc getting
selected. This is benenficial for DCB enabled systems, in that it allows for any
application to use dcb configured priorities so without application modification

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
CC: Robert Love <robert.w.love@intel.com>
CC: "David S. Miller" <davem@davemloft.net>
---
 include/linux/cgroup_subsys.h |    8 +
 include/linux/netdevice.h     |    4 +
 include/net/netprio_cgroup.h  |   66 ++++++++
 include/net/sock.h            |    3 +
 net/Kconfig                   |    7 +
 net/core/Makefile             |    1 +
 net/core/dev.c                |   13 ++
 net/core/netprio_cgroup.c     |  347 +++++++++++++++++++++++++++++++++++++++++
 net/core/sock.c               |   22 +++-
 net/socket.c                  |    2 +
 10 files changed, 472 insertions(+), 1 deletions(-)
 create mode 100644 include/net/netprio_cgroup.h
 create mode 100644 net/core/netprio_cgroup.c

diff --git a/include/linux/cgroup_subsys.h b/include/linux/cgroup_subsys.h
index ac663c1..0bd390c 100644
--- a/include/linux/cgroup_subsys.h
+++ b/include/linux/cgroup_subsys.h
@@ -59,8 +59,16 @@ SUBSYS(net_cls)
 SUBSYS(blkio)
 #endif
 
+/* */
+
 #ifdef CONFIG_CGROUP_PERF
 SUBSYS(perf)
 #endif
 
 /* */
+
+#ifdef CONFIG_NETPRIO_CGROUP
+SUBSYS(net_prio)
+#endif
+
+/* */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0db1f5f..750ea8e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -50,6 +50,7 @@
 #ifdef CONFIG_DCB
 #include <net/dcbnl.h>
 #endif
+#include <net/netprio_cgroup.h>
 
 struct vlan_group;
 struct netpoll_info;
@@ -1312,6 +1313,9 @@ struct net_device {
 	/* max exchange id for FCoE LRO by ddp */
 	unsigned int		fcoe_ddp_xid;
 #endif
+#if IS_ENABLED(CONFIG_NETPRIO_CGROUP)
+	struct netprio_map __rcu *priomap;
+#endif
 	/* phy device may attach itself for hardware timestamping */
 	struct phy_device *phydev;
 
diff --git a/include/net/netprio_cgroup.h b/include/net/netprio_cgroup.h
new file mode 100644
index 0000000..6b65936
--- /dev/null
+++ b/include/net/netprio_cgroup.h
@@ -0,0 +1,66 @@
+/*
+ * netprio_cgroup.h			Control Group Priority set 
+ *
+ *
+ * Authors:	Neil Horman <nhorman@tuxdriver.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation; either version 2 of the License, or (at your option)
+ * any later version.
+ *
+ */
+
+#ifndef _NETPRIO_CGROUP_H
+#define _NETPRIO_CGROUP_H
+#include <linux/module.h>
+#include <linux/cgroup.h>
+#include <linux/hardirq.h>
+#include <linux/rcupdate.h>
+
+struct cgroup_netprio_state
+{
+	struct cgroup_subsys_state css;
+	u32 prioidx;
+};
+
+struct netprio_map {
+	struct rcu_head rcu;
+	u32 priomap_len;
+	u32 priomap[];
+};
+
+#ifdef CONFIG_CGROUPS
+
+#ifndef CONFIG_NETPRIO_CGROUP
+extern int net_prio_subsys_id;
+#endif
+
+extern void sock_update_netprioidx(struct sock *sk);
+extern void skb_update_prio(struct sk_buff *skb);
+
+static inline struct cgroup_netprio_state
+		*task_netprio_state(struct task_struct *p)
+{
+#if IS_ENABLED(CONFIG_NETPRIO_CGROUP)
+	return container_of(task_subsys_state(p, net_prio_subsys_id),
+			    struct cgroup_netprio_state, css);
+#else
+	return NULL;
+#endif
+}
+
+#else
+
+#define sock_update_netprioidx(sk)
+#define skb_update_prio(skb)
+
+static inline struct cgroup_netprio_state
+		*task_netprio_state(struct task_struct *p)
+{
+	return NULL;
+}
+
+#endif
+
+#endif  /* _NET_CLS_CGROUP_H */
diff --git a/include/net/sock.h b/include/net/sock.h
index 5ac682f..87b24aa 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -321,6 +321,9 @@ struct sock {
 	unsigned short		sk_ack_backlog;
 	unsigned short		sk_max_ack_backlog;
 	__u32			sk_priority;
+#ifdef CONFIG_CGROUPS
+	__u32			sk_cgrp_prioidx;
+#endif
 	struct pid		*sk_peer_pid;
 	const struct cred	*sk_peer_cred;
 	long			sk_rcvtimeo;
diff --git a/net/Kconfig b/net/Kconfig
index a073148..63d2c5d 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -232,6 +232,13 @@ config XPS
 	depends on SMP && SYSFS && USE_GENERIC_SMP_HELPERS
 	default y
 
+config NETPRIO_CGROUP
+	tristate "Network priority cgroup"
+	depends on CGROUPS
+	---help---
+	  Cgroup subsystem for use in assigning processes to network priorities on
+	  a per-interface basis
+
 config HAVE_BPF_JIT
 	bool
 
diff --git a/net/core/Makefile b/net/core/Makefile
index 0d357b1..3606d40 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -19,3 +19,4 @@ obj-$(CONFIG_FIB_RULES) += fib_rules.o
 obj-$(CONFIG_TRACEPOINTS) += net-traces.o
 obj-$(CONFIG_NET_DROP_MONITOR) += drop_monitor.o
 obj-$(CONFIG_NETWORK_PHY_TIMESTAMPING) += timestamping.o
+obj-$(CONFIG_NETPRIO_CGROUP) += netprio_cgroup.o
diff --git a/net/core/dev.c b/net/core/dev.c
index b7ba81a..a1dca83 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2456,6 +2456,17 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 	return rc;
 }
 
+#ifdef CONFIG_CGROUPS
+void skb_update_prio(struct sk_buff *skb)
+{
+	struct netprio_map *map = rcu_dereference(skb->dev->priomap);
+
+	if ((!skb->priority) && (skb->sk) && map)
+		skb->priority = map->priomap[skb->sk->sk_cgrp_prioidx];
+}
+EXPORT_SYMBOL_GPL(skb_update_prio);
+#endif
+
 static DEFINE_PER_CPU(int, xmit_recursion);
 #define RECURSION_LIMIT 10
 
@@ -2496,6 +2507,8 @@ int dev_queue_xmit(struct sk_buff *skb)
 	 */
 	rcu_read_lock_bh();
 
+	skb_update_prio(skb);
+
 	txq = dev_pick_tx(dev, skb);
 	q = rcu_dereference_bh(txq->qdisc);
 
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
new file mode 100644
index 0000000..1e043db
--- /dev/null
+++ b/net/core/netprio_cgroup.c
@@ -0,0 +1,347 @@
+/*
+ * net/core/netprio_cgroup.c	Priority Control Group
+ *
+ *		This program is free software; you can redistribute it and/or
+ *		modify it under the terms of the GNU General Public License
+ *		as published by the Free Software Foundation; either version
+ *		2 of the License, or (at your option) any later version.
+ *
+ * Authors:	Neil Horman <nhorman@tuxdriver.com>
+ */
+
+#include <linux/module.h>
+#include <linux/slab.h>
+#include <linux/types.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <linux/cgroup.h>
+#include <linux/rcupdate.h>
+#include <linux/atomic.h>
+#include <net/rtnetlink.h>
+#include <net/pkt_cls.h>
+#include <net/sock.h>
+#include <net/netprio_cgroup.h>
+
+static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
+					       struct cgroup *cgrp);
+static void cgrp_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp);
+static int cgrp_populate(struct cgroup_subsys *ss, struct cgroup *cgrp);
+
+struct cgroup_subsys net_prio_subsys = {
+	.name		= "net_prio",
+	.create		= cgrp_create,
+	.destroy	= cgrp_destroy,
+	.populate	= cgrp_populate,
+#ifdef CONFIG_NETPRIO_CGROUP
+	.subsys_id	= net_prio_subsys_id,
+#endif
+	.module		= THIS_MODULE
+};
+
+#define PRIOIDX_SZ 128
+
+static unsigned long prioidx_map[PRIOIDX_SZ];
+static DEFINE_SPINLOCK(prioidx_map_lock);
+static atomic_t max_prioidx = ATOMIC_INIT(0);
+
+static inline struct cgroup_netprio_state *cgrp_netprio_state(struct cgroup *cgrp)
+{
+	return container_of(cgroup_subsys_state(cgrp, net_prio_subsys_id),
+			    struct cgroup_netprio_state, css);
+}
+
+static int get_prioidx(u32 *prio)
+{
+	unsigned long flags;
+	u32 prioidx;
+
+	spin_lock_irqsave(&prioidx_map_lock, flags);
+	prioidx = find_first_zero_bit(prioidx_map, sizeof(unsigned long) * PRIOIDX_SZ);
+	set_bit(prioidx, prioidx_map);
+	spin_unlock_irqrestore(&prioidx_map_lock, flags);
+	if (prioidx == sizeof(unsigned long) * PRIOIDX_SZ)
+		return -ENOSPC;
+
+	atomic_set(&max_prioidx, prioidx);
+	*prio = prioidx;
+	return 0;
+}
+
+static void put_prioidx(u32 idx)
+{
+	unsigned long flags;
+	spin_lock_irqsave(&prioidx_map_lock, flags);
+	clear_bit(idx, prioidx_map);
+	spin_unlock_irqrestore(&prioidx_map_lock, flags);
+}
+
+static void extend_netdev_table(struct net_device *dev, u32 new_len)
+{
+	size_t new_size = sizeof(struct netprio_map) +
+			   ((sizeof(u32) * new_len));
+	struct netprio_map *new_priomap = kzalloc(new_size, GFP_KERNEL);
+	struct netprio_map *old_priomap;
+	int i;
+
+	old_priomap  = rtnl_dereference(dev->priomap);
+
+	if (!new_priomap) {
+		printk(KERN_WARNING "Unable to alloc new priomap!\n");
+		return;
+	}
+
+	for (i = 0;
+	     old_priomap && (i < old_priomap->priomap_len);
+	     i++)
+		new_priomap->priomap[i] = old_priomap->priomap[i];
+
+	new_priomap->priomap_len = new_len;
+
+	rcu_assign_pointer(dev->priomap, new_priomap);
+	if (old_priomap)
+		kfree_rcu(old_priomap, rcu);
+}
+
+static void update_netdev_tables(void)
+{
+	struct net_device *dev;
+	u32 max_len = atomic_read(&max_prioidx);
+	struct netprio_map *map;
+
+	rtnl_lock();
+
+
+	for_each_netdev(&init_net, dev) {
+		map = rtnl_dereference(dev->priomap);
+		if ((!map) ||
+		    (map->priomap_len < max_len))
+			extend_netdev_table(dev, max_len);
+	}
+
+	rtnl_unlock();
+}
+
+static struct cgroup_subsys_state *cgrp_create(struct cgroup_subsys *ss,
+						 struct cgroup *cgrp)
+{
+	struct cgroup_netprio_state *cs;
+	int ret;
+
+	cs = kzalloc(sizeof(*cs), GFP_KERNEL);
+	if (!cs)
+		return ERR_PTR(-ENOMEM);
+
+	if (cgrp->parent && cgrp_netprio_state(cgrp->parent)->prioidx) {
+		kfree(cs);
+		return ERR_PTR(-EINVAL);
+	}
+
+	ret = get_prioidx(&cs->prioidx);
+	if (ret != 0) {
+		printk(KERN_WARNING "No space in priority index array\n");
+		kfree(cs);
+		return ERR_PTR(ret);
+	}
+
+	return &cs->css;
+}
+
+static void cgrp_destroy(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	struct cgroup_netprio_state *cs;
+	struct net_device *dev;
+	struct netprio_map *map;
+
+	cs = cgrp_netprio_state(cgrp);
+	rtnl_lock();
+	for_each_netdev(&init_net, dev) {
+		map = rtnl_dereference(dev->priomap);
+		if (map)
+			map->priomap[cs->prioidx] = 0;
+	}
+	rtnl_unlock();
+	put_prioidx(cs->prioidx);
+	kfree(cs);
+}
+
+static u64 read_prioidx(struct cgroup *cgrp, struct cftype *cft)
+{
+	return (u64)cgrp_netprio_state(cgrp)->prioidx;
+}
+
+static int read_priomap(struct cgroup *cont, struct cftype *cft,
+			struct cgroup_map_cb *cb)
+{
+	struct net_device *dev;
+	u32 prioidx = cgrp_netprio_state(cont)->prioidx;
+	u32 priority;
+	struct netprio_map *map;
+
+	rcu_read_lock();
+
+	for_each_netdev_rcu(&init_net, dev) {
+		map = rcu_dereference(dev->priomap);
+		priority = map ? map->priomap[prioidx] : 0;
+		cb->fill(cb, dev->name, priority);
+	}
+	rcu_read_unlock();
+	return 0;
+}
+
+static int write_priomap(struct cgroup *cgrp, struct cftype *cft,
+			 const char *buffer)
+{
+	char *devname = kstrdup(buffer, GFP_KERNEL);
+	int ret = -EINVAL;
+	u32 prioidx = cgrp_netprio_state(cgrp)->prioidx;
+	unsigned long priority;
+	char *priostr;
+	struct net_device *dev;
+	struct netprio_map *map;
+
+	if (!devname)
+		return -ENOMEM;
+
+	/*
+	 * Minimally sized valid priomap string
+	 */
+	if (strlen(devname) < 3)
+		goto out_free_devname;
+
+	priostr = strstr(devname, " ");
+	if (!priostr)
+		goto out_free_devname;
+
+	/*
+	 *Separate the devname from the associated priority
+	 *and advance the priostr poitner to the priority value
+	 */
+	*priostr = '\0';
+	priostr++;
+
+	/*
+	 * If the priostr points to NULL, we're at the end of the passed
+	 * in string, and its not a valid write
+	 */
+	if (*priostr == '\0')
+		goto out_free_devname;
+
+	ret = kstrtoul(priostr, 10, &priority);
+	if (ret < 0)
+		goto out_free_devname;
+
+	ret = -ENODEV;
+
+	dev = dev_get_by_name(&init_net, devname);
+	if (!dev)
+		goto out_free_devname;
+
+	update_netdev_tables();
+	ret = 0;
+	rcu_read_lock();
+	map = rcu_dereference(dev->priomap);
+	if (map)
+		map->priomap[prioidx] = priority;
+	rcu_read_unlock();
+	dev_put(dev);
+
+out_free_devname:
+	kfree(devname);
+	return ret;
+}
+
+static struct cftype ss_files[] = {
+	{
+		.name = "prioidx",
+		.read_u64 = read_prioidx,
+	},
+	{
+		.name = "ifpriomap",
+		.read_map = read_priomap,
+		.write_string = write_priomap,
+	},
+};
+
+static int cgrp_populate(struct cgroup_subsys *ss, struct cgroup *cgrp)
+{
+	return cgroup_add_files(cgrp, ss, ss_files, ARRAY_SIZE(ss_files));
+}
+
+static int netprio_device_event(struct notifier_block *unused,
+				unsigned long event, void *ptr)
+{
+	struct net_device *dev = ptr;
+	struct netprio_map *old;
+	u32 max_len = atomic_read(&max_prioidx);
+
+	/*
+	 * Note this is called with rtnl_lock held so we have update side
+	 * protection on our rcu assignments
+	 */
+
+	switch (event) {
+
+	case NETDEV_REGISTER:
+		if (max_len)
+			extend_netdev_table(dev, max_len);
+		break;
+	case NETDEV_UNREGISTER:
+		old = rtnl_dereference(dev->priomap);
+		rcu_assign_pointer(dev->priomap, NULL);
+		if (old)
+			kfree_rcu(old, rcu);
+		break;
+	}
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block netprio_device_notifier = {
+	.notifier_call = netprio_device_event
+};
+
+static int __init init_cgroup_netprio(void)
+{
+	int ret;
+
+	ret = cgroup_load_subsys(&net_prio_subsys);
+	if (ret)
+		goto out;
+#ifndef CONFIG_NETPRIO_CGROUP
+	smp_wmb();
+	net_prio_subsys_id = net_prio_subsys.subsys_id;
+#endif
+
+	register_netdevice_notifier(&netprio_device_notifier);
+
+out:
+	return ret;
+}
+
+static void __exit exit_cgroup_netprio(void)
+{
+	struct netprio_map *old;
+	struct net_device *dev;
+
+	unregister_netdevice_notifier(&netprio_device_notifier);
+
+	cgroup_unload_subsys(&net_prio_subsys);
+
+#ifndef CONFIG_NETPRIO_CGROUP
+	net_prio_subsys_id = -1;
+	synchronize_rcu();
+#endif
+
+	rtnl_lock();
+	for_each_netdev(&init_net, dev) {
+		old = rtnl_dereference(dev->priomap);
+		rcu_assign_pointer(dev->priomap, NULL);
+		if (old)
+			kfree_rcu(old, rcu);
+	}
+	rtnl_unlock();
+}
+
+module_init(init_cgroup_netprio);
+module_exit(exit_cgroup_netprio);
+MODULE_LICENSE("GPL v2");
diff --git a/net/core/sock.c b/net/core/sock.c
index 5a08762..77a4888 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -125,6 +125,7 @@
 #include <net/xfrm.h>
 #include <linux/ipsec.h>
 #include <net/cls_cgroup.h>
+#include <net/netprio_cgroup.h>
 
 #include <linux/filter.h>
 
@@ -221,10 +222,16 @@ __u32 sysctl_rmem_default __read_mostly = SK_RMEM_MAX;
 int sysctl_optmem_max __read_mostly = sizeof(unsigned long)*(2*UIO_MAXIOV+512);
 EXPORT_SYMBOL(sysctl_optmem_max);
 
-#if defined(CONFIG_CGROUPS) && !defined(CONFIG_NET_CLS_CGROUP)
+#if defined(CONFIG_CGROUPS)
+#if !defined(CONFIG_NET_CLS_CGROUP)
 int net_cls_subsys_id = -1;
 EXPORT_SYMBOL_GPL(net_cls_subsys_id);
 #endif
+#if !defined(CONFIG_NETPRIO_CGROUP)
+int net_prio_subsys_id = -1;
+EXPORT_SYMBOL_GPL(net_prio_subsys_id);
+#endif
+#endif
 
 static int sock_set_timeout(long *timeo_p, char __user *optval, int optlen)
 {
@@ -1111,6 +1118,18 @@ void sock_update_classid(struct sock *sk)
 		sk->sk_classid = classid;
 }
 EXPORT_SYMBOL(sock_update_classid);
+
+void sock_update_netprioidx(struct sock *sk)
+{
+	struct cgroup_netprio_state *state;
+	if (in_interrupt())
+		return;
+	rcu_read_lock();
+	state = task_netprio_state(current);
+	sk->sk_cgrp_prioidx = state ? state->prioidx : 0;
+	rcu_read_unlock();
+}
+EXPORT_SYMBOL_GPL(sock_update_netprioidx);
 #endif
 
 /**
@@ -1138,6 +1157,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
 		atomic_set(&sk->sk_wmem_alloc, 1);
 
 		sock_update_classid(sk);
+		sock_update_netprioidx(sk);
 	}
 
 	return sk;
diff --git a/net/socket.c b/net/socket.c
index 2877647..108716f 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -549,6 +549,8 @@ static inline int __sock_sendmsg_nosec(struct kiocb *iocb, struct socket *sock,
 
 	sock_update_classid(sock->sk);
 
+	sock_update_netprioidx(sock->sk);
+
 	si->sock = sock;
 	si->scm = NULL;
 	si->msg = msg;
-- 
1.7.6.4

^ permalink raw reply related

* [PATCH 2/2] net: add documentation for net_prio cgroups (v3)
From: Neil Horman @ 2011-11-18 16:13 UTC (permalink / raw)
  To: netdev; +Cc: Neil Horman, John Fastabend, Robert Love, David S. Miller
In-Reply-To: <1321632821-11640-1-git-send-email-nhorman@tuxdriver.com>

Add the requisite documentation to explain to new users how net_prio cgroups work

Signed-off-by:Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
CC: Robert Love <robert.w.love@intel.com>
CC: "David S. Miller" <davem@davemloft.net>
---
 Documentation/cgroups/net_prio.txt |   53 ++++++++++++++++++++++++++++++++++++
 1 files changed, 53 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/cgroups/net_prio.txt

diff --git a/Documentation/cgroups/net_prio.txt b/Documentation/cgroups/net_prio.txt
new file mode 100644
index 0000000..01b3226
--- /dev/null
+++ b/Documentation/cgroups/net_prio.txt
@@ -0,0 +1,53 @@
+Network priority cgroup
+-------------------------
+
+The Network priority cgroup provides an interface to allow an administrator to
+dynamically set the priority of network traffic generated by various
+applications
+
+Nominally, an application would set the priority of its traffic via the
+SO_PRIORITY socket option.  This however, is not always possible because:
+
+1) The application may not have been coded to set this value
+2) The priority of application traffic is often a site-specific administrative
+   decision rather than an application defined one.
+
+This cgroup allows an administrator to assign a process to a group which defines
+the priority of egress traffic on a given interface. Network priority groups can
+be created by first mounting the cgroup filesystem.
+
+# mount -t cgroup -onet_prio none /sys/fs/cgroup/net_prio
+
+With the above step, the initial group acting as the parent accounting group
+becomes visible at '/sys/fs/cgroup/net_prio'.  This group includes all tasks in
+the system. '/sys/fs/cgroup/net_prio/tasks' lists the tasks in this cgroup.
+
+Each net_prio cgroup contains two files that are subsystem specific
+
+net_prio.prioidx
+This file is read-only, and is simply informative.  It contains a unique integer
+value that the kernel uses as an internal representation of this cgroup.
+
+net_prio.ifpriomap
+This file contains a map of the priorities assigned to traffic originating from
+processes in this group and egressing the system on various interfaces. It
+contains a list of tuples in the form <ifname priority>.  Contents of this file
+can be modified by echoing a string into the file using the same tuple format.
+for example:
+
+echo "eth0 5" > /sys/fs/cgroups/net_prio/iscsi/net_prio.ifpriomap
+
+This command would force any traffic originating from processes belonging to the
+iscsi net_prio cgroup and egressing on interface eth0 to have the priority of
+said traffic set to the value 5. The parent accounting group also has a
+writeable 'net_prio.ifpriomap' file that can be used to set a system default
+priority.
+
+Priorities are set immediately prior to queueing a frame to the device
+queueing discipline (qdisc) so priorities will be assigned prior to the hardware
+queue selection being made.
+
+One usage for the net_prio cgroup is with mqprio qdisc allowing application
+traffic to be steered to hardware/driver based traffic classes. These mappings
+can then be managed by administrators or other networking protocols such as
+DCBX.
-- 
1.7.6.4

^ permalink raw reply related

* Re: [PATCH net-next] W5300: Add WIZnet W5300 Ethernet driver
From: Taehun Kim @ 2011-11-18 16:15 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: David S. Miller, netdev, linux-kernel, suhwan, bongbong
In-Reply-To: <1320779368.2799.55.camel@bwh-desktop>

Thank you for your detailed feedback.

Does anybody have any more comments?

^ permalink raw reply

* Re: [RFC] [ver3 PATCH 3/6] virtio_net: virtio_net driver changes
From: Sasha Levin @ 2011-11-18 16:18 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Krishna Kumar, rusty, mst, netdev, kvm, davem, virtualization
In-Reply-To: <1321630839.2885.117.camel@deadeye>

On Fri, 2011-11-18 at 15:40 +0000, Ben Hutchings wrote:
> On Fri, 2011-11-18 at 08:24 +0200, Sasha Levin wrote:
> > On Fri, 2011-11-18 at 01:08 +0000, Ben Hutchings wrote:
> > > On Fri, 2011-11-11 at 18:34 +0530, Krishna Kumar wrote:
> > > > Changes for multiqueue virtio_net driver.
> > > [...]
> > > > @@ -677,25 +730,35 @@ static struct rtnl_link_stats64 *virtnet
> > > >  {
> > > >  	struct virtnet_info *vi = netdev_priv(dev);
> > > >  	int cpu;
> > > > -	unsigned int start;
> > > >  
> > > >  	for_each_possible_cpu(cpu) {
> > > > -		struct virtnet_stats __percpu *stats
> > > > -			= per_cpu_ptr(vi->stats, cpu);
> > > > -		u64 tpackets, tbytes, rpackets, rbytes;
> > > > -
> > > > -		do {
> > > > -			start = u64_stats_fetch_begin(&stats->syncp);
> > > > -			tpackets = stats->tx_packets;
> > > > -			tbytes   = stats->tx_bytes;
> > > > -			rpackets = stats->rx_packets;
> > > > -			rbytes   = stats->rx_bytes;
> > > > -		} while (u64_stats_fetch_retry(&stats->syncp, start));
> > > > -
> > > > -		tot->rx_packets += rpackets;
> > > > -		tot->tx_packets += tpackets;
> > > > -		tot->rx_bytes   += rbytes;
> > > > -		tot->tx_bytes   += tbytes;
> > > > +		int qpair;
> > > > +
> > > > +		for (qpair = 0; qpair < vi->num_queue_pairs; qpair++) {
> > > > +			struct virtnet_send_stats __percpu *tx_stat;
> > > > +			struct virtnet_recv_stats __percpu *rx_stat;
> > > 
> > > While you're at it, you can drop the per-CPU stats and make them only
> > > per-queue.  There is unlikely to be any benefit in maintaining them
> > > per-CPU while receive and transmit processing is serialised per-queue.
> > 
> > It allows you to update stats without a lock.
> 
> But you'll already be holding a lock related to the queue.

Right, but now you're holding a queue lock just when playing with the
queue, we don't hold it when we process the data - which is when we
usually need to update stats.

> > Whats the benefit of having them per queue?
> 
> It should save some memory (and a little time when summing stats, though
> that's unlikely to matter much).
> 
> The important thing is that splitting up stats per-CPU *and* per-queue
> is a waste.
> 
> Ben.
> 


-- 

Sasha.


^ permalink raw reply

* Re: query : unregister/register netdev
From: Stephen Hemminger @ 2011-11-18 16:22 UTC (permalink / raw)
  To: Madhvapathi Sriram; +Cc: netdev
In-Reply-To: <CAAvRe=jhfXEnLFkwbPmWEroDE-qHUpZ54TSSdw=WpmfPGZpsPQ@mail.gmail.com>

On Fri, 18 Nov 2011 17:53:37 +0530
Madhvapathi Sriram <sriram.madhvapathi@gmail.com> wrote:

> In register_netdevice(), BUG_ON(dev->reg_state != NETREG_UNINITIALIZED) is
> used to check if the device that is being registered is indeed a new one.
> 
> However, I see that this state is never moved to. It only happens when a
> netdevice is allocated (by default to 0 using kzalloc).
> 
> So, the cycle register-->unregister-->register would fail since in the
> unregister_netdevice the state is only moved to NETREG_UNREGISTERED (at max
> to NETREG_RELEASED using free_netdev)
> 
> So, I presume that to reinitialize a netdevice one has to free the
> netdevice, re allocate netdevice and only then re register.
> 
> Wondering why unregister and reregister is not allowed, rather than having
> go through the free/alloc cycle - I am not an expert though.

There are two reasons. First because of RCU the device object still might
be in use after unregister (until after RCU quiescent period). But the real
reason is that no device driver existing needed it. The normal use case
of unregister is when device is being removed or session for a virtual device
is going away.

^ permalink raw reply

* Re: Occasional oops with IPSec and IPv6.
From: Nick Bowler @ 2011-11-18 16:27 UTC (permalink / raw)
  To: netdev; +Cc: David S. Miller, Timo Teras
In-Reply-To: <20111117190925.GA23214@elliptictech.com>

On 2011-11-17 14:09 -0500, Nick Bowler wrote:
> One of the tests we do with IPsec involves sending and receiving UDP
> datagrams of all sizes from 1 to N bytes, where N is much larger than
> the MTU.  In this particular instance, the MTU is 1500 bytes and N is
> 10000 bytes.  This test works fine with IPv4, but I'm getting an
> occasional oops on Linus' master with IPv6 (output at end of email).  We
> also run the same test where N is less than the MTU, and it does not
> trigger this issue.  The resulting fallout seems to eventually lock up
> the box (although it continues to work for a little while afterwards).
> 
> The issue appears timing related, and it doesn't always occur.  This
> probably also explains why I've not seen this issue before now, as we
> recently upgraded all our lab systems to machines from this century
> (with newfangled dual core processors).  This also makes it somewhat
> hard to reproduce, but I can trigger it pretty reliably by running 'yes'
> in an ssh session (which doesn't use IPsec) while running the test:
> it'll usually trigger in 2 or 3 runs.  The choice of cipher suite
> appears to be irrelevant.
> 
> I built a relatively old kernel (2.6.34) and could not reproduce the
> issue there, so I ran a git bisect.  It pointed to the following, which
> (unsurprisingly) no longer reverts cleanly.
> 
> Let me know if you need any more info.  I'll see if I can reproduce the
> issue with a smaller test case...

OK, here's a somewhat straigthforward way to reproduce it that I've
found.  It uses a short test program called "udp_burst" which simply
transmits a bunch of UDP datagrams at all sizes between 1 and 10000,
included at the end of this mail.

 * Build the test program

    % gcc -o udp_burst udp_burst.c

 * Setup transport mode IPv6 SAs between two hosts so that they can
   communicate using IPsec.  Choose your favourite cipher suite.
   In this example, my two hosts are "fec0::3/64" and "fec0::2/64": I
   will be crashing the former.

   It can be reproduced with just one host transmitting to the bit
   bucket, but it seems to go much faster with two.

 * Create some constant non-IPsec network traffic on the machine to be
   crashed (for example, log in via SSH and run "yes").
 
 * On the machine to be crashed, run

    % while :; do ./udp_burst remote; done

   where remote is the other host (fec0::2 in my case).
 
 * Wait a few seconds and watch the fireworks.

% cat >udp_burst.c <<'EOF'
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>

#define MAX_DGRAM_SIZE 10000

static char buf[MAX_DGRAM_SIZE];

int main(int argc, char **argv)
{
	char *addr = NULL, *port = "9000";
	struct addrinfo *info, hints = {
		.ai_family   = AF_UNSPEC,
		.ai_socktype = SOCK_DGRAM,
		.ai_flags    = AI_PASSIVE,
	};
	int i, rc, sock;

	if (argc > 1)
		addr = argv[1];
	if (argc > 2)
		port = argv[2];
	if (!addr) {
		fprintf(stderr, "usage: %s addr [port]\n", argv[0]);
		return EXIT_FAILURE;
	}

	rc = getaddrinfo(addr, port, &hints, &info);
	if (rc != 0) {
		fprintf(stderr, "getaddrinfo: %s\n", gai_strerror(rc));
		return EXIT_FAILURE;
	}

	sock = socket(info->ai_family, info->ai_socktype, info->ai_protocol);
	if (sock == -1) {
		perror("socket");
		return EXIT_FAILURE;
	}

	if (connect(sock, info->ai_addr, info->ai_addrlen) == -1) {
		perror("connect");
		return EXIT_FAILURE;
	}

	for (i = 0; i < MAX_DGRAM_SIZE; i++) {
		if (send(sock, buf, i+1, MSG_DONTWAIT) == -1) {
			if (errno != EAGAIN && errno != ECONNREFUSED) {
				perror("send");
			}
		}
	}

	return 0;
}
EOF

Cheers,
-- 
Nick Bowler, Elliptic Technologies (http://www.elliptictech.com/)

^ permalink raw reply

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Flavio Leitner @ 2011-11-18 16:30 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Ivan Zahariev, netdev, Vasiliy Kulikov
In-Reply-To: <1321632128.3277.29.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Fri, 18 Nov 2011 17:02:08 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> 
> 
> David, unless I missed something, we should revert commit f39925dbde77
> ipv4: Cache learned redirect information in inetpeer.)
> 
> With following patch, redirects now work for me.
> 
> Thanks !
> 
> 
> 
> [PATCH net-next] ipv4: fix redirect handling
> 
> commit f39925dbde77 (ipv4: Cache learned redirect information in
> inetpeer.) introduced a regression in ICMP redirect handling.
> 
> It assumed ipv4_dst_check() would be called because all possible
> routes were attached to the inetpeer we modify in ip_rt_redirect(),
> but thats not true.
> 
> commit 7cc9150ebe (route: fix ICMP redirect validation) tried to fix
> this but solution was not complete. (It fixed only one route)
> 
> So we must lookup existing routes (including different TOS values) and
> call check_peer_redir() on them.
> 
> Reported-by: Ivan Zahariev <famzah@icdsoft.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Flavio Leitner <fbl@redhat.com>
> ---
>  net/ipv4/route.c |  110 ++++++++++++++++++++++++---------------------
>  1 file changed, 59 insertions(+), 51 deletions(-)
> 
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index 511f4a7..0c74da8 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1304,16 +1304,42 @@ static void rt_del(unsigned hash, struct
> rtable *rt) spin_unlock_bh(rt_hash_lock_addr(hash));
>  }
>  
> +static int check_peer_redir(struct dst_entry *dst, struct inet_peer
> *peer) +{
> +	struct rtable *rt = (struct rtable *) dst;
> +	__be32 orig_gw = rt->rt_gateway;
> +	struct neighbour *n, *old_n;
> +
> +	dst_confirm(&rt->dst);
> +
> +	rt->rt_gateway = peer->redirect_learned.a4;
> +
> +	n = ipv4_neigh_lookup(&rt->dst, &rt->rt_gateway);
> +	if (IS_ERR(n))
> +		return PTR_ERR(n);
> +	old_n = xchg(&rt->dst._neighbour, n);
> +	if (old_n)
> +		neigh_release(old_n);
> +	if (!n || !(n->nud_state & NUD_VALID)) {
> +		if (n)
> +			neigh_event_send(n, NULL);
> +		rt->rt_gateway = orig_gw;
> +		return -EAGAIN;
> +	} else {
> +		rt->rt_flags |= RTCF_REDIRECTED;
> +		call_netevent_notifiers(NETEVENT_NEIGH_UPDATE, n);
> +	}
> +	return 0;
> +}
> +
>  /* called in rcu_read_lock() section */
>  void ip_rt_redirect(__be32 old_gw, __be32 daddr, __be32 new_gw,
>  		    __be32 saddr, struct net_device *dev)
>  {
>  	int s, i;
>  	struct in_device *in_dev = __in_dev_get_rcu(dev);
> -	struct rtable *rt;
>  	__be32 skeys[2] = { saddr, 0 };
>  	int    ikeys[2] = { dev->ifindex, 0 };
> -	struct flowi4 fl4;
>  	struct inet_peer *peer;
>  	struct net *net;
>  
> @@ -1336,33 +1362,42 @@ void ip_rt_redirect(__be32 old_gw, __be32
> daddr, __be32 new_gw, goto reject_redirect;
>  	}
>  
> -	memset(&fl4, 0, sizeof(fl4));
> -	fl4.daddr = daddr;
>  	for (s = 0; s < 2; s++) {
>  		for (i = 0; i < 2; i++) {
> -			fl4.flowi4_oif = ikeys[i];
> -			fl4.saddr = skeys[s];
> -			rt = __ip_route_output_key(net, &fl4);
> -			if (IS_ERR(rt))
> -				continue;
> -
> -			if (rt->dst.error || rt->dst.dev != dev ||
> -			    rt->rt_gateway != old_gw) {
> -				ip_rt_put(rt);
> -				continue;
> -			}
> +			unsigned int hash;
> +			struct rtable __rcu **rthp;
> +			struct rtable *rt;
> +
> +			hash = rt_hash(daddr, skeys[s], ikeys[i],
> rt_genid(net)); +
> +			rthp = &rt_hash_table[hash].chain;
> +
> +			while ((rt = rcu_dereference(*rthp)) !=
> NULL) {
> +				rthp = &rt->dst.rt_next;
> +
> +				if (rt->rt_key_dst != daddr ||
> +				    rt->rt_key_src != skeys[s] ||
> +				    rt->rt_oif != ikeys[i] ||
> +				    rt_is_input_route(rt) ||
> +				    rt_is_expired(rt) ||
> +				    !net_eq(dev_net(rt->dst.dev),
> net) ||
> +				    rt->dst.error ||
> +				    rt->dst.dev != dev ||
> +				    rt->rt_gateway != old_gw)
> +					continue;
>  

I know we are reverting to get it fixed, but this adds the routing
cache back, so what is the plan? Revert to get it working and then
think on new approach to remove the route cache again later?

I had one previous patch using the routing cache posted to the list,
but it won't fix the route flush problem.

thanks,
fbl

^ permalink raw reply

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Eric Dumazet @ 2011-11-18 16:34 UTC (permalink / raw)
  To: Flavio Leitner; +Cc: David Miller, Ivan Zahariev, netdev, Vasiliy Kulikov
In-Reply-To: <20111118143016.01e24b37@asterix.rh>

Le vendredi 18 novembre 2011 à 14:30 -0200, Flavio Leitner a écrit :

> I know we are reverting to get it fixed, but this adds the routing
> cache back, so what is the plan? Revert to get it working and then
> think on new approach to remove the route cache again later?
> 
> I had one previous patch using the routing cache posted to the list,
> but it won't fix the route flush problem.
> 

I dont "add the routing cache back".

Note I only fix existing route entries in the cache ;)

A "revert" is probably safe, since we should push a fix for 3.0/3.1/3.2
kernels...

^ permalink raw reply

* Re: Occasional oops with IPSec and IPv6.
From: Eric Dumazet @ 2011-11-18 16:39 UTC (permalink / raw)
  To: Nick Bowler; +Cc: netdev, David S. Miller, Timo Teras
In-Reply-To: <20111118162709.GA8342@elliptictech.com>

Le vendredi 18 novembre 2011 à 11:27 -0500, Nick Bowler a écrit :
> On 2011-11-17 14:09 -0500, Nick Bowler wrote:
> > One of the tests we do with IPsec involves sending and receiving UDP
> > datagrams of all sizes from 1 to N bytes, where N is much larger than
> > the MTU.  In this particular instance, the MTU is 1500 bytes and N is
> > 10000 bytes.  This test works fine with IPv4, but I'm getting an
> > occasional oops on Linus' master with IPv6 (output at end of email).  We
> > also run the same test where N is less than the MTU, and it does not
> > trigger this issue.  The resulting fallout seems to eventually lock up
> > the box (although it continues to work for a little while afterwards).
> > 
> > The issue appears timing related, and it doesn't always occur.  This
> > probably also explains why I've not seen this issue before now, as we
> > recently upgraded all our lab systems to machines from this century
> > (with newfangled dual core processors).  This also makes it somewhat
> > hard to reproduce, but I can trigger it pretty reliably by running 'yes'
> > in an ssh session (which doesn't use IPsec) while running the test:
> > it'll usually trigger in 2 or 3 runs.  The choice of cipher suite
> > appears to be irrelevant.
> > 
> > I built a relatively old kernel (2.6.34) and could not reproduce the
> > issue there, so I ran a git bisect.  It pointed to the following, which
> > (unsurprisingly) no longer reverts cleanly.
> > 
> > Let me know if you need any more info.  I'll see if I can reproduce the
> > issue with a smaller test case...
> 
> OK, here's a somewhat straigthforward way to reproduce it that I've
> found.  It uses a short test program called "udp_burst" which simply
> transmits a bunch of UDP datagrams at all sizes between 1 and 10000,
> included at the end of this mail.
> 
>  * Build the test program
> 
>     % gcc -o udp_burst udp_burst.c
> 
>  * Setup transport mode IPv6 SAs between two hosts so that they can
>    communicate using IPsec.  Choose your favourite cipher suite.
>    In this example, my two hosts are "fec0::3/64" and "fec0::2/64": I
>    will be crashing the former.
> 
>    It can be reproduced with just one host transmitting to the bit
>    bucket, but it seems to go much faster with two.
> 
>  * Create some constant non-IPsec network traffic on the machine to be
>    crashed (for example, log in via SSH and run "yes").
>  
>  * On the machine to be crashed, run
> 
>     % while :; do ./udp_burst remote; done
> 
>    where remote is the other host (fec0::2 in my case).
>  
>  * Wait a few seconds and watch the fireworks.
> 
> % cat >udp_burst.c <<'EOF'
> #include <stdio.h>
> #include <stdlib.h>
> #include <unistd.h>
> #include <errno.h>
> #include <sys/types.h>
> #include <sys/socket.h>
> #include <netdb.h>
> 
> #define MAX_DGRAM_SIZE 10000
> 
> static char buf[MAX_DGRAM_SIZE];
> 
> int main(int argc, char **argv)
> {
> 	char *addr = NULL, *port = "9000";
> 	struct addrinfo *info, hints = {
> 		.ai_family   = AF_UNSPEC,
> 		.ai_socktype = SOCK_DGRAM,
> 		.ai_flags    = AI_PASSIVE,
> 	};
> 	int i, rc, sock;
> 
> 	if (argc > 1)
> 		addr = argv[1];
> 	if (argc > 2)
> 		port = argv[2];
> 	if (!addr) {
> 		fprintf(stderr, "usage: %s addr [port]\n", argv[0]);
> 		return EXIT_FAILURE;
> 	}
> 
> 	rc = getaddrinfo(addr, port, &hints, &info);
> 	if (rc != 0) {
> 		fprintf(stderr, "getaddrinfo: %s\n", gai_strerror(rc));
> 		return EXIT_FAILURE;
> 	}
> 
> 	sock = socket(info->ai_family, info->ai_socktype, info->ai_protocol);
> 	if (sock == -1) {
> 		perror("socket");
> 		return EXIT_FAILURE;
> 	}
> 
> 	if (connect(sock, info->ai_addr, info->ai_addrlen) == -1) {
> 		perror("connect");
> 		return EXIT_FAILURE;
> 	}
> 
> 	for (i = 0; i < MAX_DGRAM_SIZE; i++) {
> 		if (send(sock, buf, i+1, MSG_DONTWAIT) == -1) {
> 			if (errno != EAGAIN && errno != ECONNREFUSED) {
> 				perror("send");
> 			}
> 		}
> 	}
> 
> 	return 0;
> }
> EOF
> 

Please note commit 80c802f307 added a known bug, fixed in commit
0b150932197b (xfrm: avoid possible oopse in xfrm_alloc_dst)

Given commit 80c802f307 complexity, we can assume other bugs are to be
fixed as well.

Unfortunately, Timo seems unresponsive.

^ permalink raw reply

* [PATCH net-next] tg3: switch to build_skb() infrastructure
From: Eric Dumazet @ 2011-11-18 16:47 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Eilon Greenstein, Michael Chan, Matt Carlson

This is very similar to bnx2x conversion, but simpler since no special
alignement is required, so goal was not to reduce skb truesize.

Using build_skb() reduces cache line misses in the driver, since we
use cache hot skb instead of cold ones. Number of in-flight sk_buff
structures is lower, they are more likely recycled in SLUB caches
while still hot.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Matt Carlson <mcarlson@broadcom.com>
CC: Michael Chan <mchan@broadcom.com>
CC: Eilon Greenstein <eilong@broadcom.com>
---
 drivers/net/ethernet/broadcom/tg3.c |  115 +++++++++++++-------------
 drivers/net/ethernet/broadcom/tg3.h |    6 +
 2 files changed, 66 insertions(+), 55 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 024ca1d..5060766 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -194,7 +194,7 @@ static inline void _tg3_flag_clear(enum TG3_FLAGS flag, unsigned long *bits)
 #if (NET_IP_ALIGN != 0)
 #define TG3_RX_OFFSET(tp)	((tp)->rx_offset)
 #else
-#define TG3_RX_OFFSET(tp)	0
+#define TG3_RX_OFFSET(tp)	(NET_SKB_PAD)
 #endif
 
 /* minimum number of free TX descriptors required to wake up TX process */
@@ -5370,15 +5370,15 @@ static void tg3_tx(struct tg3_napi *tnapi)
 	}
 }
 
-static void tg3_rx_skb_free(struct tg3 *tp, struct ring_info *ri, u32 map_sz)
+static void tg3_rx_data_free(struct tg3 *tp, struct ring_info *ri, u32 map_sz)
 {
-	if (!ri->skb)
+	if (!ri->data)
 		return;
 
 	pci_unmap_single(tp->pdev, dma_unmap_addr(ri, mapping),
 			 map_sz, PCI_DMA_FROMDEVICE);
-	dev_kfree_skb_any(ri->skb);
-	ri->skb = NULL;
+	kfree(ri->data);
+	ri->data = NULL;
 }
 
 /* Returns size of skb allocated or < 0 on error.
@@ -5392,28 +5392,28 @@ static void tg3_rx_skb_free(struct tg3 *tp, struct ring_info *ri, u32 map_sz)
  * buffers the cpu only reads the last cacheline of the RX descriptor
  * (to fetch the error flags, vlan tag, checksum, and opaque cookie).
  */
-static int tg3_alloc_rx_skb(struct tg3 *tp, struct tg3_rx_prodring_set *tpr,
+static int tg3_alloc_rx_data(struct tg3 *tp, struct tg3_rx_prodring_set *tpr,
 			    u32 opaque_key, u32 dest_idx_unmasked)
 {
 	struct tg3_rx_buffer_desc *desc;
 	struct ring_info *map;
-	struct sk_buff *skb;
+	u8 *data;
 	dma_addr_t mapping;
-	int skb_size, dest_idx;
+	int skb_size, data_size, dest_idx;
 
 	switch (opaque_key) {
 	case RXD_OPAQUE_RING_STD:
 		dest_idx = dest_idx_unmasked & tp->rx_std_ring_mask;
 		desc = &tpr->rx_std[dest_idx];
 		map = &tpr->rx_std_buffers[dest_idx];
-		skb_size = tp->rx_pkt_map_sz;
+		data_size = tp->rx_pkt_map_sz;
 		break;
 
 	case RXD_OPAQUE_RING_JUMBO:
 		dest_idx = dest_idx_unmasked & tp->rx_jmb_ring_mask;
 		desc = &tpr->rx_jmb[dest_idx].std;
 		map = &tpr->rx_jmb_buffers[dest_idx];
-		skb_size = TG3_RX_JMB_MAP_SZ;
+		data_size = TG3_RX_JMB_MAP_SZ;
 		break;
 
 	default:
@@ -5426,31 +5426,33 @@ static int tg3_alloc_rx_skb(struct tg3 *tp, struct tg3_rx_prodring_set *tpr,
 	 * Callers depend upon this behavior and assume that
 	 * we leave everything unchanged if we fail.
 	 */
-	skb = netdev_alloc_skb(tp->dev, skb_size + TG3_RX_OFFSET(tp));
-	if (skb == NULL)
+	skb_size = SKB_DATA_ALIGN(data_size + TG3_RX_OFFSET(tp)) +
+		   SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	data = kmalloc(skb_size, GFP_ATOMIC);
+	if (!data)
 		return -ENOMEM;
 
-	skb_reserve(skb, TG3_RX_OFFSET(tp));
-
-	mapping = pci_map_single(tp->pdev, skb->data, skb_size,
+	mapping = pci_map_single(tp->pdev,
+				 data + TG3_RX_OFFSET(tp),
+				 data_size,
 				 PCI_DMA_FROMDEVICE);
 	if (pci_dma_mapping_error(tp->pdev, mapping)) {
-		dev_kfree_skb(skb);
+		kfree(data);
 		return -EIO;
 	}
 
-	map->skb = skb;
+	map->data = data;
 	dma_unmap_addr_set(map, mapping, mapping);
 
 	desc->addr_hi = ((u64)mapping >> 32);
 	desc->addr_lo = ((u64)mapping & 0xffffffff);
 
-	return skb_size;
+	return data_size;
 }
 
 /* We only need to move over in the address because the other
  * members of the RX descriptor are invariant.  See notes above
- * tg3_alloc_rx_skb for full details.
+ * tg3_alloc_rx_data for full details.
  */
 static void tg3_recycle_rx(struct tg3_napi *tnapi,
 			   struct tg3_rx_prodring_set *dpr,
@@ -5484,7 +5486,7 @@ static void tg3_recycle_rx(struct tg3_napi *tnapi,
 		return;
 	}
 
-	dest_map->skb = src_map->skb;
+	dest_map->data = src_map->data;
 	dma_unmap_addr_set(dest_map, mapping,
 			   dma_unmap_addr(src_map, mapping));
 	dest_desc->addr_hi = src_desc->addr_hi;
@@ -5495,7 +5497,7 @@ static void tg3_recycle_rx(struct tg3_napi *tnapi,
 	 */
 	smp_wmb();
 
-	src_map->skb = NULL;
+	src_map->data = NULL;
 }
 
 /* The RX ring scheme is composed of multiple rings which post fresh
@@ -5549,19 +5551,20 @@ static int tg3_rx(struct tg3_napi *tnapi, int budget)
 		struct sk_buff *skb;
 		dma_addr_t dma_addr;
 		u32 opaque_key, desc_idx, *post_ptr;
+		u8 *data;
 
 		desc_idx = desc->opaque & RXD_OPAQUE_INDEX_MASK;
 		opaque_key = desc->opaque & RXD_OPAQUE_RING_MASK;
 		if (opaque_key == RXD_OPAQUE_RING_STD) {
 			ri = &tp->napi[0].prodring.rx_std_buffers[desc_idx];
 			dma_addr = dma_unmap_addr(ri, mapping);
-			skb = ri->skb;
+			data = ri->data;
 			post_ptr = &std_prod_idx;
 			rx_std_posted++;
 		} else if (opaque_key == RXD_OPAQUE_RING_JUMBO) {
 			ri = &tp->napi[0].prodring.rx_jmb_buffers[desc_idx];
 			dma_addr = dma_unmap_addr(ri, mapping);
-			skb = ri->skb;
+			data = ri->data;
 			post_ptr = &jmb_prod_idx;
 		} else
 			goto next_pkt_nopost;
@@ -5579,13 +5582,14 @@ static int tg3_rx(struct tg3_napi *tnapi, int budget)
 			goto next_pkt;
 		}
 
+		prefetch(data + TG3_RX_OFFSET(tp));
 		len = ((desc->idx_len & RXD_LEN_MASK) >> RXD_LEN_SHIFT) -
 		      ETH_FCS_LEN;
 
 		if (len > TG3_RX_COPY_THRESH(tp)) {
 			int skb_size;
 
-			skb_size = tg3_alloc_rx_skb(tp, tpr, opaque_key,
+			skb_size = tg3_alloc_rx_data(tp, tpr, opaque_key,
 						    *post_ptr);
 			if (skb_size < 0)
 				goto drop_it;
@@ -5593,35 +5597,37 @@ static int tg3_rx(struct tg3_napi *tnapi, int budget)
 			pci_unmap_single(tp->pdev, dma_addr, skb_size,
 					 PCI_DMA_FROMDEVICE);
 
-			/* Ensure that the update to the skb happens
+			skb = build_skb(data);
+			if (!skb) {
+				kfree(data);
+				goto drop_it_no_recycle;
+			}
+			skb_reserve(skb, TG3_RX_OFFSET(tp));
+			/* Ensure that the update to the data happens
 			 * after the usage of the old DMA mapping.
 			 */
 			smp_wmb();
 
-			ri->skb = NULL;
+			ri->data = NULL;
 
-			skb_put(skb, len);
 		} else {
-			struct sk_buff *copy_skb;
-
 			tg3_recycle_rx(tnapi, tpr, opaque_key,
 				       desc_idx, *post_ptr);
 
-			copy_skb = netdev_alloc_skb(tp->dev, len +
-						    TG3_RAW_IP_ALIGN);
-			if (copy_skb == NULL)
+			skb = netdev_alloc_skb(tp->dev,
+					       len + TG3_RAW_IP_ALIGN);
+			if (skb == NULL)
 				goto drop_it_no_recycle;
 
-			skb_reserve(copy_skb, TG3_RAW_IP_ALIGN);
-			skb_put(copy_skb, len);
+			skb_reserve(skb, TG3_RAW_IP_ALIGN);
 			pci_dma_sync_single_for_cpu(tp->pdev, dma_addr, len, PCI_DMA_FROMDEVICE);
-			skb_copy_from_linear_data(skb, copy_skb->data, len);
+			memcpy(skb->data,
+			       data + TG3_RX_OFFSET(tp),
+			       len);
 			pci_dma_sync_single_for_device(tp->pdev, dma_addr, len, PCI_DMA_FROMDEVICE);
-
-			/* We'll reuse the original ring buffer. */
-			skb = copy_skb;
 		}
 
+		skb_put(skb, len);
 		if ((tp->dev->features & NETIF_F_RXCSUM) &&
 		    (desc->type_flags & RXD_FLAG_TCPUDP_CSUM) &&
 		    (((desc->ip_tcp_csum & RXD_TCPCSUM_MASK)
@@ -5760,7 +5766,7 @@ static int tg3_rx_prodring_xfer(struct tg3 *tp,
 		di = dpr->rx_std_prod_idx;
 
 		for (i = di; i < di + cpycnt; i++) {
-			if (dpr->rx_std_buffers[i].skb) {
+			if (dpr->rx_std_buffers[i].data) {
 				cpycnt = i - di;
 				err = -ENOSPC;
 				break;
@@ -5818,7 +5824,7 @@ static int tg3_rx_prodring_xfer(struct tg3 *tp,
 		di = dpr->rx_jmb_prod_idx;
 
 		for (i = di; i < di + cpycnt; i++) {
-			if (dpr->rx_jmb_buffers[i].skb) {
+			if (dpr->rx_jmb_buffers[i].data) {
 				cpycnt = i - di;
 				err = -ENOSPC;
 				break;
@@ -7056,14 +7062,14 @@ static void tg3_rx_prodring_free(struct tg3 *tp,
 	if (tpr != &tp->napi[0].prodring) {
 		for (i = tpr->rx_std_cons_idx; i != tpr->rx_std_prod_idx;
 		     i = (i + 1) & tp->rx_std_ring_mask)
-			tg3_rx_skb_free(tp, &tpr->rx_std_buffers[i],
+			tg3_rx_data_free(tp, &tpr->rx_std_buffers[i],
 					tp->rx_pkt_map_sz);
 
 		if (tg3_flag(tp, JUMBO_CAPABLE)) {
 			for (i = tpr->rx_jmb_cons_idx;
 			     i != tpr->rx_jmb_prod_idx;
 			     i = (i + 1) & tp->rx_jmb_ring_mask) {
-				tg3_rx_skb_free(tp, &tpr->rx_jmb_buffers[i],
+				tg3_rx_data_free(tp, &tpr->rx_jmb_buffers[i],
 						TG3_RX_JMB_MAP_SZ);
 			}
 		}
@@ -7072,12 +7078,12 @@ static void tg3_rx_prodring_free(struct tg3 *tp,
 	}
 
 	for (i = 0; i <= tp->rx_std_ring_mask; i++)
-		tg3_rx_skb_free(tp, &tpr->rx_std_buffers[i],
+		tg3_rx_data_free(tp, &tpr->rx_std_buffers[i],
 				tp->rx_pkt_map_sz);
 
 	if (tg3_flag(tp, JUMBO_CAPABLE) && !tg3_flag(tp, 5780_CLASS)) {
 		for (i = 0; i <= tp->rx_jmb_ring_mask; i++)
-			tg3_rx_skb_free(tp, &tpr->rx_jmb_buffers[i],
+			tg3_rx_data_free(tp, &tpr->rx_jmb_buffers[i],
 					TG3_RX_JMB_MAP_SZ);
 	}
 }
@@ -7133,7 +7139,7 @@ static int tg3_rx_prodring_alloc(struct tg3 *tp,
 
 	/* Now allocate fresh SKBs for each rx ring. */
 	for (i = 0; i < tp->rx_pending; i++) {
-		if (tg3_alloc_rx_skb(tp, tpr, RXD_OPAQUE_RING_STD, i) < 0) {
+		if (tg3_alloc_rx_data(tp, tpr, RXD_OPAQUE_RING_STD, i) < 0) {
 			netdev_warn(tp->dev,
 				    "Using a smaller RX standard ring. Only "
 				    "%d out of %d buffers were allocated "
@@ -7165,7 +7171,7 @@ static int tg3_rx_prodring_alloc(struct tg3 *tp,
 	}
 
 	for (i = 0; i < tp->rx_jumbo_pending; i++) {
-		if (tg3_alloc_rx_skb(tp, tpr, RXD_OPAQUE_RING_JUMBO, i) < 0) {
+		if (tg3_alloc_rx_data(tp, tpr, RXD_OPAQUE_RING_JUMBO, i) < 0) {
 			netdev_warn(tp->dev,
 				    "Using a smaller RX jumbo ring. Only %d "
 				    "out of %d buffers were allocated "
@@ -11374,8 +11380,8 @@ static int tg3_run_loopback(struct tg3 *tp, u32 pktsz, bool tso_loopback)
 	u32 rx_start_idx, rx_idx, tx_idx, opaque_key;
 	u32 base_flags = 0, mss = 0, desc_idx, coal_now, data_off, val;
 	u32 budget;
-	struct sk_buff *skb, *rx_skb;
-	u8 *tx_data;
+	struct sk_buff *skb;
+	u8 *tx_data, *rx_data;
 	dma_addr_t map;
 	int num_pkts, tx_len, rx_len, i, err;
 	struct tg3_rx_buffer_desc *desc;
@@ -11543,11 +11549,11 @@ static int tg3_run_loopback(struct tg3 *tp, u32 pktsz, bool tso_loopback)
 		}
 
 		if (opaque_key == RXD_OPAQUE_RING_STD) {
-			rx_skb = tpr->rx_std_buffers[desc_idx].skb;
+			rx_data = tpr->rx_std_buffers[desc_idx].data;
 			map = dma_unmap_addr(&tpr->rx_std_buffers[desc_idx],
 					     mapping);
 		} else if (opaque_key == RXD_OPAQUE_RING_JUMBO) {
-			rx_skb = tpr->rx_jmb_buffers[desc_idx].skb;
+			rx_data = tpr->rx_jmb_buffers[desc_idx].data;
 			map = dma_unmap_addr(&tpr->rx_jmb_buffers[desc_idx],
 					     mapping);
 		} else
@@ -11556,15 +11562,16 @@ static int tg3_run_loopback(struct tg3 *tp, u32 pktsz, bool tso_loopback)
 		pci_dma_sync_single_for_cpu(tp->pdev, map, rx_len,
 					    PCI_DMA_FROMDEVICE);
 
+		rx_data += TG3_RX_OFFSET(tp);
 		for (i = data_off; i < rx_len; i++, val++) {
-			if (*(rx_skb->data + i) != (u8) (val & 0xff))
+			if (*(rx_data + i) != (u8) (val & 0xff))
 				goto out;
 		}
 	}
 
 	err = 0;
 
-	/* tg3_free_rings will unmap and free the rx_skb */
+	/* tg3_free_rings will unmap and free the rx_data */
 out:
 	return err;
 }
@@ -14522,11 +14529,11 @@ static int __devinit tg3_get_invariants(struct tg3 *tp)
 	else
 		tg3_flag_clear(tp, POLL_SERDES);
 
-	tp->rx_offset = NET_IP_ALIGN;
+	tp->rx_offset = NET_SKB_PAD + NET_IP_ALIGN;
 	tp->rx_copy_thresh = TG3_RX_COPY_THRESHOLD;
 	if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5701 &&
 	    tg3_flag(tp, PCIX_MODE)) {
-		tp->rx_offset = 0;
+		tp->rx_offset = NET_SKB_PAD;
 #ifndef CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS
 		tp->rx_copy_thresh = ~(u16)0;
 #endif
diff --git a/drivers/net/ethernet/broadcom/tg3.h b/drivers/net/ethernet/broadcom/tg3.h
index 94b4bd0..8e2f380 100644
--- a/drivers/net/ethernet/broadcom/tg3.h
+++ b/drivers/net/ethernet/broadcom/tg3.h
@@ -2662,9 +2662,13 @@ struct tg3_hw_stats {
 /* 'mapping' is superfluous as the chip does not write into
  * the tx/rx post rings so we could just fetch it from there.
  * But the cache behavior is better how we are doing it now.
+ *
+ * This driver uses new build_skb() API :
+ * RX ring buffer contains pointer to kmalloc() data only,
+ * skb are built only after Hardware filled the frame.
  */
 struct ring_info {
-	struct sk_buff			*skb;
+	u8				*data;
 	DEFINE_DMA_UNMAP_ADDR(mapping);
 };
 

^ permalink raw reply related

* [PATCH] xen-netfront: report link speed to ethtool
From: Olaf Hering @ 2011-11-18 16:48 UTC (permalink / raw)
  To: netdev, xen-devel, Jeremy Fitzhardinge, Konrad Rzeszutek Wilk


Add .get_settings function, return fake data so that ethtool can get
enough information. For some application like VCS, this is useful,
otherwise some of application logic will get panic.
The reported data refers to VMWare vmxnet.

Signed-off-by: Xin Wei Hu <xwhu@suse.com>
Signed-off-by: Chunyan Liu <cyliu@suse.com>
Signed-off-by: Olaf Hering <olaf@aepfle.de>

---
 drivers/net/xen-netfront.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

Index: linux-3.2-rc2/drivers/net/xen-netfront.c
===================================================================
--- linux-3.2-rc2.orig/drivers/net/xen-netfront.c
+++ linux-3.2-rc2/drivers/net/xen-netfront.c
@@ -1727,6 +1727,17 @@ static void netback_changed(struct xenbu
 	}
 }
 
+static int xennet_get_settings(struct net_device *netdev, struct ethtool_cmd *ecmd)
+{
+	ecmd->supported = SUPPORTED_1000baseT_Full | SUPPORTED_TP;
+	ecmd->advertising = ADVERTISED_TP;
+	ecmd->port = PORT_TP;
+	ecmd->transceiver = XCVR_INTERNAL;
+	ecmd->speed = SPEED_1000;
+	ecmd->duplex = DUPLEX_FULL;
+	return 0;
+}
+
 static const struct xennet_stat {
 	char name[ETH_GSTRING_LEN];
 	u16 offset;
@@ -1774,6 +1785,7 @@ static const struct ethtool_ops xennet_e
 {
 	.get_link = ethtool_op_get_link,
 
+	.get_settings = xennet_get_settings,
 	.get_sset_count = xennet_get_sset_count,
 	.get_ethtool_stats = xennet_get_ethtool_stats,
 	.get_strings = xennet_get_strings,

^ permalink raw reply

* Re: [BUG] e1000: possible deadlock scenario caught by lockdep
From: Jesse Brandeburg @ 2011-11-18 16:57 UTC (permalink / raw)
  To: Steven Rostedt; +Cc: e1000-devel, netdev, LKML, Thomas Gleixner, Brown, Aaron F
In-Reply-To: <1321579620.3533.29.camel@frodo>

CC'd netdev, and e1000-devel

On Thu, 17 Nov 2011 17:27:00 -0800
Steven Rostedt <rostedt@goodmis.org> wrote:

> I hit the following lockdep splat:
> 
> ======================================================
> [ INFO: possible circular locking dependency detected ]
> 3.2.0-rc2-test+ #14
> -------------------------------------------------------
> reboot/2316 is trying to acquire lock:
>  ((&(&adapter->watchdog_task)->work)){+.+...}, at: [<ffffffff81069553>] wait_on_work+0x0/0xac
> 
> but task is already holding lock:
>  (&adapter->mutex){+.+...}, at: [<ffffffff81359b1d>] __e1000_shutdown+0x56/0x1f5
> 
> which lock already depends on the new lock.
> 
> 
> the existing dependency chain (in reverse order) is:
> 
> -> #1 (&adapter->mutex){+.+...}:
>        [<ffffffff8108261a>] lock_acquire+0x103/0x158
>        [<ffffffff8150bcf3>] __mutex_lock_common+0x6a/0x441
>        [<ffffffff8150c13d>] mutex_lock_nested+0x1b/0x1d
>        [<ffffffff81359288>] e1000_watchdog+0x56/0x4a4
>        [<ffffffff8106a1b0>] process_one_work+0x1ef/0x3e0
>        [<ffffffff8106b4e0>] worker_thread+0xda/0x15e
>        [<ffffffff8106f00e>] kthread+0x9f/0xa7
>        [<ffffffff81514e84>] kernel_thread_helper+0x4/0x10
> 
> -> #0 ((&(&adapter->watchdog_task)->work)){+.+...}:
>        [<ffffffff81081e4a>] __lock_acquire+0xa29/0xd06
>        [<ffffffff8108261a>] lock_acquire+0x103/0x158
>        [<ffffffff81069590>] wait_on_work+0x3d/0xac
>        [<ffffffff8106a616>] __cancel_work_timer+0xb9/0xff
>        [<ffffffff8106a66e>] cancel_delayed_work_sync+0x12/0x14
>        [<ffffffff81355c8f>] e1000_down_and_stop+0x2e/0x4a
>        [<ffffffff813581ed>] e1000_down+0x116/0x176
>        [<ffffffff81359b4a>] __e1000_shutdown+0x83/0x1f5
>        [<ffffffff81359cd6>] e1000_shutdown+0x1a/0x43
>        [<ffffffff8126fdad>] pci_device_shutdown+0x29/0x3d
>        [<ffffffff8130c601>] device_shutdown+0xbe/0xf9
>        [<ffffffff81065b17>] kernel_restart_prepare+0x31/0x38
>        [<ffffffff81065b32>] kernel_restart+0x14/0x51
>        [<ffffffff81065cd8>] sys_reboot+0x157/0x1b0
>        [<ffffffff81513882>] system_call_fastpath+0x16/0x1b
> 
> other info that might help us debug this:
> 
>  Possible unsafe locking scenario:
> 
>        CPU0                    CPU1
>        ----                    ----
>   lock(&adapter->mutex);
>                                lock((&(&adapter->watchdog_task)->work));
>                                lock(&adapter->mutex);
>   lock((&(&adapter->watchdog_task)->work));
> 
>  *** DEADLOCK ***
> 
> 2 locks held by reboot/2316:
>  #0:  (reboot_mutex){+.+.+.}, at: [<ffffffff81065c20>] sys_reboot+0x9f/0x1b0
>  #1:  (&adapter->mutex){+.+...}, at: [<ffffffff81359b1d>] __e1000_shutdown+0x56/0x1f5
> 
> stack backtrace:
> Pid: 2316, comm: reboot Not tainted 3.2.0-rc2-test+ #14
> Call Trace:
>  [<ffffffff81503eb2>] print_circular_bug+0x1f8/0x209
>  [<ffffffff81081e4a>] __lock_acquire+0xa29/0xd06
>  [<ffffffff81069553>] ? wait_on_cpu_work+0x94/0x94
>  [<ffffffff8108261a>] lock_acquire+0x103/0x158
>  [<ffffffff81069553>] ? wait_on_cpu_work+0x94/0x94
>  [<ffffffff810c7caf>] ? trace_preempt_on+0x2a/0x2f
>  [<ffffffff81069590>] wait_on_work+0x3d/0xac
>  [<ffffffff81069553>] ? wait_on_cpu_work+0x94/0x94
>  [<ffffffff8106a616>] __cancel_work_timer+0xb9/0xff
>  [<ffffffff8106a66e>] cancel_delayed_work_sync+0x12/0x14
>  [<ffffffff81355c8f>] e1000_down_and_stop+0x2e/0x4a
>  [<ffffffff813581ed>] e1000_down+0x116/0x176
>  [<ffffffff81359b4a>] __e1000_shutdown+0x83/0x1f5
>  [<ffffffff8150d51c>] ? _raw_spin_unlock+0x33/0x56
>  [<ffffffff8130c583>] ? device_shutdown+0x40/0xf9
>  [<ffffffff81359cd6>] e1000_shutdown+0x1a/0x43
>  [<ffffffff81510757>] ? sub_preempt_count+0xa1/0xb4
>  [<ffffffff8126fdad>] pci_device_shutdown+0x29/0x3d
>  [<ffffffff8130c601>] device_shutdown+0xbe/0xf9
>  [<ffffffff81065b17>] kernel_restart_prepare+0x31/0x38
>  [<ffffffff81065b32>] kernel_restart+0x14/0x51
>  [<ffffffff81065cd8>] sys_reboot+0x157/0x1b0
>  [<ffffffff81072ccb>] ? hrtimer_cancel+0x17/0x24
>  [<ffffffff8150c304>] ? do_nanosleep+0x74/0xac
>  [<ffffffff8125c72d>] ? trace_hardirqs_off_thunk+0x3a/0x3c
>  [<ffffffff8150e066>] ? error_sti+0x5/0x6
>  [<ffffffff810c7c80>] ? time_hardirqs_off+0x2a/0x2f
>  [<ffffffff8125c6ee>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>  [<ffffffff8150db5d>] ? retint_swapgs+0x13/0x1b
>  [<ffffffff8150db5d>] ? retint_swapgs+0x13/0x1b
>  [<ffffffff81082a78>] ? trace_hardirqs_on_caller+0x12d/0x164
>  [<ffffffff810a74ce>] ? audit_syscall_entry+0x11c/0x148
>  [<ffffffff8125c6ee>] ? trace_hardirqs_on_thunk+0x3a/0x3f
>  [<ffffffff81513882>] system_call_fastpath+0x16/0x1b
> 
> 
> The issue comes from two recent commits:
> 
> commit a4010afef585b7142eb605e3a6e4210c0e1b2957
> Author: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Date:   Wed Oct 5 07:24:41 2011 +0000
> e1000: convert hardware management from timers to threads
> 
> and
> 
> commit 0ef4eedc2e98edd51cd106e1f6a27178622b7e57
> Author: Jesse Brandeburg <jesse.brandeburg@intel.com>
> Date:   Wed Oct 5 07:24:51 2011 +0000
> e1000: convert to private mutex from rtnl
> 
> 
> What we have is on __e1000_shutdown():
> 
> 	mutex_lock(&adapter->mutex);
> 
> 	if (netif_running(netdev)) {
> 		WARN_ON(test_bit(__E1000_RESETTING, &adapter->flags));
> 		e1000_down(adapter);
> 	}
> 
> but e1000_down() calls: e1000_down_and_stop():
> 
> static void e1000_down_and_stop(struct e1000_adapter *adapter)
> {
> 	set_bit(__E1000_DOWN, &adapter->flags);
> 	cancel_work_sync(&adapter->reset_task);
> 	cancel_delayed_work_sync(&adapter->watchdog_task);
> 	cancel_delayed_work_sync(&adapter->phy_info_task);
> 	cancel_delayed_work_sync(&adapter->fifo_stall_task);
> }
> 
> 
> Here you see that we are calling cancel_delayed_work_sync(&adapter->watchdog_task);
> 
> The problem is that adapter->watchdog_task grabs the mutex &adapter->mutex.
> 
> If the work has started and it blocked on that mutex, the
> cancel_delayed_work_sync() will block indefinitely and we have a
> deadlock.
> 
> Not sure what's the best way around this. Can we call e1000_down()
> without grabbing the adapter->mutex?

Thanks for the report, I'll look at it today and see if I can work out
a way to avoid the bonk.

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: [net-next-2.6 PATCH 0/6 v4] macvlan: MAC Address filtering support for passthru mode
From: Greg Rose @ 2011-11-18 16:58 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Roopa Prabhu, netdev@vger.kernel.org, davem@davemloft.net,
	chrisw@redhat.com, sri@us.ibm.com, dragos.tatulea@gmail.com,
	kvm@vger.kernel.org, arnd@arndb.de, mst@redhat.com,
	mchan@broadcom.com, dwang2@cisco.com, shemminger@vyatta.com,
	eric.dumazet@gmail.com, kaber@trash.net, benve@cisco.com
In-Reply-To: <1321577078.2749.58.camel@bwh-desktop>


On 11/17/2011 4:44 PM, Ben Hutchings wrote:
> On Thu, 2011-11-17 at 16:32 -0800, Greg Rose wrote:
>> On 11/17/2011 4:15 PM, Ben Hutchings wrote:
>>> Sorry to come to this rather late.
>>>
>>> On Tue, 2011-11-08 at 23:55 -0800, Roopa Prabhu wrote:
>>> [...]
>>>> v2 ->   v3
>>>> - Moved set and get filter ops from rtnl_link_ops to netdev_ops
>>>> - Support for SRIOV VFs.
>>>>           [Note: The get filters msg (in the way current get rtnetlink handles
>>>>           it) might get too big for SRIOV vfs. This patch follows existing sriov
>>>>           vf get code and tries to accomodate filters for all VF's in a PF.
>>>>           And for the SRIOV case I have only tested the fact that the VF
>>>>           arguments are getting delivered to rtnetlink correctly. The code
>>>>           follows existing sriov vf handling code so rest of it should work fine]
>>> [...]
>>>
>>> This is already broken for large numbers of VFs, and increasing the
>>> amount of information per VF is going to make the situation worse.  I am
>>> no netlink expert but I think that the current approach of bundling all
>>> information about an interface in a single message may not be
>>> sustainable.
>>>
>>> Also, I'm unclear on why this interface is to be used to set filtering
>>> for the (PF) net device as well as for related VFs.  Doesn't that
>>> duplicate the functionality of ndo_set_rx_mode and
>>> ndo_vlan_rx_{add,kill}_vid?
>>
>> Functionally yes but contextually no.  This allows the PF driver to know
>> that it is setting these filters in the context of the existence of VFs,
>> allowing it to take appropriate action.  The other two functions may be
>> called without the presence of SR-IOV enablement and the existence of VFs.
>>
>> Anyway, that's why I asked Roopa to add that capability.
>
> I don't follow.  The PF driver already knows whether it has enabled VFs.
>
> How do filters set this way interact with filters set through the
> existing operations?  Should they override promiscuous mode?  None of
> this has been specified.

Promiscuous mode is exactly the issue this feature is intended for.  I'm 
not familiar with the solarflare device but Intel HW promiscuous mode is 
only promiscuous on the physical port, not on the VEB.  So a packet sent 
from a VF will not be captured by the PF across the VEB unless the MAC 
and VLAN filters have been programmed into the HW.  So you may not need 
the feature for your devices but it is required for Intel devices.  And 
it's a fairly simple request, just allow -1 to indicate that the target 
of the filter requests is for the PF itself.  Using the already existing 
set_rx_mode function wont' work because the PF driver will look at it 
and figure it's in promiscuous mode anyway, so it won't set the filters 
into the HW.  At least that is how it is in the case of our HW and 
driver.  Again, the behavior of your HW and driver is unknown to me and 
thus you may not require this feature.

- Greg

^ permalink raw reply

* Re: Unable to flush ICMP redirect routes in kernel 3.0+
From: Flavio Leitner @ 2011-11-18 17:05 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, Ivan Zahariev, netdev, Vasiliy Kulikov
In-Reply-To: <1321634046.3277.33.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On Fri, 18 Nov 2011 17:34:06 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le vendredi 18 novembre 2011 à 14:30 -0200, Flavio Leitner a écrit :
> 
> > I know we are reverting to get it fixed, but this adds the routing
> > cache back, so what is the plan? Revert to get it working and then
> > think on new approach to remove the route cache again later?
> > 
> > I had one previous patch using the routing cache posted to the list,
> > but it won't fix the route flush problem.
> > 
> 
> I dont "add the routing cache back".

Sorry, I meant that we are trying to avoid doing this:
+			hash = rt_hash(daddr, skeys[s], ikeys[i],rt_genid(net));
+
+			rthp = &rt_hash_table[hash].chain;
+
+			while ((rt = rcu_dereference(*rthp)) != NULL) {
+				rthp = &rt->dst.rt_next;

anyway, see below.

> Note I only fix existing route entries in the cache ;)
Exactly.
 
> A "revert" is probably safe, since we should push a fix for
> 3.0/3.1/3.2 kernels...

I agree that reverting is probably safe.
fbl

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox