Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH RFC] tun: experimental zero copy tx support
From: Shirley Ma @ 2012-05-14 17:34 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David S. Miller, Stephen Hemminger, Joe Perches, Jason Wang,
	netdev, linux-kernel, Ian.Campbell, kvm
In-Reply-To: <20120513155206.GA26847@redhat.com>

On Sun, 2012-05-13 at 18:52 +0300, Michael S. Tsirkin wrote:
> Let vhost-net utilize zero copy tx when the experimental
> zero copy mode is enabled and when used with tun.  This works on
> top of the patchset 'copy aside frags with destructors' that I posted
> previously. This is not using tcp so doesn't have the
> issue with early skb cloning noticed by Ian.
> 
> For those that wish to test this with kvm, I intend to post a patchset
> +
> git tree with just the necessary bits from the destructor patch
> a bit later.
> 
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> 

Hello Mike,

Have you tested this patch? I think the difference between macvtap and
tap is tap forwarding the packet to bridge. The zerocopy is disabled in
this case.

Shirley

^ permalink raw reply

* Re: [PATCH] net: codel: fix build errors
From: Sasha Levin @ 2012-05-14 17:37 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: edumazet, dave.taht, davem, linux-kernel, netdev
In-Reply-To: <20120514103526.1f1a4399@nehalam.linuxnetplumber.net>

On Mon, May 14, 2012 at 7:35 PM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
> On Mon, 14 May 2012 19:14:47 +0200
> Sasha Levin <levinsasha928@gmail.com> wrote:
>
>> Fix the following build error:
>>
>> net/sched/sch_fq_codel.c: In function 'fq_codel_dump_stats':
>> net/sched/sch_fq_codel.c:464:3: error: unknown field 'qdisc_stats' specified in initializer
>> net/sched/sch_fq_codel.c:464:3: warning: missing braces around initializer
>> net/sched/sch_fq_codel.c:464:3: warning: (near initialization for 'st.<anonymous>')
>> net/sched/sch_fq_codel.c:465:3: error: unknown field 'qdisc_stats' specified in initializer
>> net/sched/sch_fq_codel.c:465:3: warning: excess elements in struct initializer
>> net/sched/sch_fq_codel.c:465:3: warning: (near initialization for 'st')
>> net/sched/sch_fq_codel.c:466:3: error: unknown field 'qdisc_stats' specified in initializer
>> net/sched/sch_fq_codel.c:466:3: warning: excess elements in struct initializer
>> net/sched/sch_fq_codel.c:466:3: warning: (near initialization for 'st')
>> net/sched/sch_fq_codel.c:467:3: error: unknown field 'qdisc_stats' specified in initializer
>> net/sched/sch_fq_codel.c:467:3: warning: excess elements in struct initializer
>> net/sched/sch_fq_codel.c:467:3: warning: (near initialization for 'st')
>> make[1]: *** [net/sched/sch_fq_codel.o] Error 1
>>
>> Signed-off-by: Sasha Levin <levinsasha928@gmail.com>
>
> Which Gcc version, looks like your compiler is old.

gcc version 4.5.3 (Gentoo 4.5.3-r2 p1.1, pie-0.4.7)

This is the latest gcc provided by gentoo.

^ permalink raw reply

* Re: [PATCH 1/4] netfilter: ipset: fix timeout value overflow bug
From: Jozsef Kadlecsik @ 2012-05-14 17:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Pablo Neira Ayuso, David Laight, netfilter-devel, davem, netdev
In-Reply-To: <1337006844.8512.491.camel@edumazet-glaptop>

On Mon, 14 May 2012, Eric Dumazet wrote:

> On Mon, 2012-05-14 at 16:36 +0200, Pablo Neira Ayuso wrote:
> > On Mon, May 14, 2012 at 03:19:49PM +0100, David Laight wrote:
> > >  
> > > > --- a/include/linux/netfilter/ipset/ip_set_timeout.h
> > > > +++ b/include/linux/netfilter/ipset/ip_set_timeout.h
> > > > @@ -30,6 +30,10 @@ ip_set_timeout_uget(struct nlattr *tb)
> > > >  {
> > > >  	unsigned int timeout = ip_set_get_h32(tb);
> > > >  
> > > > +	/* Normalize to fit into jiffies */
> > > > +	if (timeout > UINT_MAX/1000)
> > > > +		timeout = UINT_MAX/1000;
> > > > +
> > > 
> > > Doesn't that rather assume that HZ is 1000 ?
> > 
> > Indeed. I overlooked that. Thanks David.
> 
> I dont think so.
> 
> 1000 here is really MSEC_PER_SEC
> 
> (All occurrences should be changed in this file)

Yes, exactly. Pablo, shall I produce a little patch or could you change 
1000 to MSEC_PER_SEC?

Best regards, 
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply

* [RESEND PATCH V2 12/17] NET: MIPS: lantiq: implement OF support inside the etop driver
From: John Crispin @ 2012-05-14 17:42 UTC (permalink / raw)
  To: Ralf Baechle; +Cc: linux-mips, John Crispin, David S. Miller, netdev
In-Reply-To: <1337017363-14424-1-git-send-email-blogic@openwrt.org>

This patch makes it possible to load the driver for the ETOP ethernet on
Lantiq SoC from a devicetree.

Additionally we convert the driver to using the new clkdev clock in favour of
the old ltq_pmu_*() api.

Signed-off-by: John Crispin <blogic@openwrt.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
---
This patch is part of a series moving the mips/lantiq target to OF and clkdev
support. The patch, once Acked, should go upstream via Ralf's MIPS tree.

Changes in V2
* rebase on Ralf's next tree

 drivers/net/ethernet/lantiq_etop.c |   63 +++++++++++++++++++++++++++++------
 1 files changed, 52 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/lantiq_etop.c b/drivers/net/ethernet/lantiq_etop.c
index 5dc9cbd..c0443d4 100644
--- a/drivers/net/ethernet/lantiq_etop.c
+++ b/drivers/net/ethernet/lantiq_etop.c
@@ -29,19 +29,22 @@
 #include <linux/tcp.h>
 #include <linux/skbuff.h>
 #include <linux/mm.h>
-#include <linux/platform_device.h>
 #include <linux/ethtool.h>
 #include <linux/init.h>
 #include <linux/delay.h>
 #include <linux/io.h>
 #include <linux/dma-mapping.h>
 #include <linux/module.h>
+#include <linux/of_platform.h>
+#include <linux/of_net.h>
+#include <linux/of_address.h>
+#include <linux/of_irq.h>
+#include <linux/of_gpio.h>
 
 #include <asm/checksum.h>
 
 #include <lantiq_soc.h>
 #include <xway_dma.h>
-#include <lantiq_platform.h>
 
 #define LTQ_ETOP_MDIO		0x11804
 #define MDIO_REQUEST		0x80000000
@@ -73,6 +76,7 @@
 #define ETOP_CGEN		0x800
 
 /* use 2 static channels for TX/RX */
+#define LTQ_DMA_CH0_INT		INT_NUM_IM2_IRL0
 #define LTQ_ETOP_TX_CHANNEL	1
 #define LTQ_ETOP_RX_CHANNEL	6
 #define IS_TX(x)		(x == LTQ_ETOP_TX_CHANNEL)
@@ -99,12 +103,17 @@ struct ltq_etop_chan {
 struct ltq_etop_priv {
 	struct net_device *netdev;
 	struct platform_device *pdev;
-	struct ltq_eth_data *pldata;
 	struct resource *res;
 
 	struct mii_bus *mii_bus;
 	struct phy_device *phydev;
 
+	struct clk *clk;
+	const void *mac;
+	int mii_mode;
+	int tx_irq;
+	int rx_irq;
+
 	struct ltq_etop_chan ch[MAX_DMA_CHAN];
 	int tx_free[MAX_DMA_CHAN >> 1];
 
@@ -239,7 +248,7 @@ ltq_etop_hw_exit(struct net_device *dev)
 	struct ltq_etop_priv *priv = netdev_priv(dev);
 	int i;
 
-	ltq_pmu_disable(PMU_PPE);
+	clk_disable(priv->clk);
 	for (i = 0; i < MAX_DMA_CHAN; i++)
 		if (IS_TX(i) || IS_RX(i))
 			ltq_etop_free_channel(dev, &priv->ch[i]);
@@ -251,9 +260,9 @@ ltq_etop_hw_init(struct net_device *dev)
 	struct ltq_etop_priv *priv = netdev_priv(dev);
 	int i;
 
-	ltq_pmu_enable(PMU_PPE);
+	clk_enable(priv->clk);
 
-	switch (priv->pldata->mii_mode) {
+	switch (priv->mii_mode) {
 	case PHY_INTERFACE_MODE_RMII:
 		ltq_etop_w32_mask(ETOP_MII_MASK,
 			ETOP_MII_REVERSE, LTQ_ETOP_CFG);
@@ -266,7 +275,7 @@ ltq_etop_hw_init(struct net_device *dev)
 
 	default:
 		netdev_err(dev, "unknown mii mode %d\n",
-			priv->pldata->mii_mode);
+			priv->mii_mode);
 		return -ENOTSUPP;
 	}
 
@@ -395,7 +404,7 @@ ltq_etop_mdio_probe(struct net_device *dev)
 	}
 
 	phydev = phy_connect(dev, dev_name(&phydev->dev), &ltq_etop_mdio_link,
-			0, priv->pldata->mii_mode);
+			0, priv->mii_mode);
 
 	if (IS_ERR(phydev)) {
 		netdev_err(dev, "Could not attach to PHY\n");
@@ -643,7 +652,7 @@ ltq_etop_init(struct net_device *dev)
 		goto err_hw;
 	ltq_etop_change_mtu(dev, 1500);
 
-	memcpy(&mac, &priv->pldata->mac, sizeof(struct sockaddr));
+	memcpy(&mac.sa_data, &priv->mac, ETH_ALEN);
 	if (!is_valid_ether_addr(mac.sa_data)) {
 		pr_warn("etop: invalid MAC, using random\n");
 		random_ether_addr(mac.sa_data);
@@ -707,9 +716,12 @@ static const struct net_device_ops ltq_eth_netdev_ops = {
 static int __init
 ltq_etop_probe(struct platform_device *pdev)
 {
+	struct device_node *node = pdev->dev.of_node;
 	struct net_device *dev;
 	struct ltq_etop_priv *priv;
 	struct resource *res;
+	struct clk *clk;
+	struct resource irqres[2];
 	int err;
 	int i;
 
@@ -737,6 +749,20 @@ ltq_etop_probe(struct platform_device *pdev)
 		goto err_out;
 	}
 
+	err = of_irq_to_resource_table(node, irqres, 2);
+	if (err != 2) {
+		dev_err(&pdev->dev, "not enough irqs defined\n");
+		err = -EINVAL;
+		goto err_out;
+	}
+
+	clk = clk_get(&pdev->dev, NULL);
+	if (IS_ERR(clk)) {
+		dev_err(&pdev->dev, "Failed to get clock\n");
+		err = PTR_ERR(clk);
+		goto err_out;
+	}
+
 	dev = alloc_etherdev_mq(sizeof(struct ltq_etop_priv), 4);
 	if (!dev) {
 		err = -ENOMEM;
@@ -748,9 +774,15 @@ ltq_etop_probe(struct platform_device *pdev)
 	priv = netdev_priv(dev);
 	priv->res = res;
 	priv->pdev = pdev;
-	priv->pldata = dev_get_platdata(&pdev->dev);
 	priv->netdev = dev;
 	spin_lock_init(&priv->lock);
+	priv->tx_irq = irqres[0].start;
+	priv->rx_irq = irqres[1].start;
+	priv->mii_mode = of_get_phy_mode(node);
+	priv->mac = of_get_mac_address(node);
+	priv->clk = clk;
+	if (priv->mii_mode < 0)
+		priv->mii_mode = PHY_INTERFACE_MODE_MII;
 
 	for (i = 0; i < MAX_DMA_CHAN; i++) {
 		if (IS_TX(i))
@@ -779,21 +811,30 @@ static int __devexit
 ltq_etop_remove(struct platform_device *pdev)
 {
 	struct net_device *dev = platform_get_drvdata(pdev);
+	struct ltq_etop_priv *priv = netdev_priv(dev);
 
 	if (dev) {
 		netif_tx_stop_all_queues(dev);
 		ltq_etop_hw_exit(dev);
 		ltq_etop_mdio_cleanup(dev);
+		clk_put(priv->clk);
 		unregister_netdev(dev);
 	}
 	return 0;
 }
 
+static const struct of_device_id ltq_etop_match[] = {
+	{ .compatible = "lantiq,etop-xway" },
+	{},
+};
+MODULE_DEVICE_TABLE(of, ltq_etop_match);
+
 static struct platform_driver ltq_mii_driver = {
 	.remove = __devexit_p(ltq_etop_remove),
 	.driver = {
-		.name = "ltq_etop",
+		.name = "etop-xway",
 		.owner = THIS_MODULE,
+		.of_match_table = ltq_etop_match,
 	},
 };
 
-- 
1.7.9.1

^ permalink raw reply related

* Re: [PATCH] netfilter: xt_HMARK: endian bugs
From: Hans Schillstrom @ 2012-05-14 17:51 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jan Engelhardt, Pablo Neira Ayuso, kaber@trash.net,
	jengelh@medozas.de, netfilter-devel@vger.kernel.org,
	netdev@vger.kernel.org, dan.carpenter@oracle.com,
	hans@schillstrom.com
In-Reply-To: <1337012674.8512.589.camel@edumazet-glaptop>

On Monday 14 May 2012 18:24:34 Eric Dumazet wrote:
> On Mon, 2012-05-14 at 18:09 +0200, Hans Schillstrom wrote:
> 
> > This context can contain both le & be machines,
> > so at least in hmark it make sense
> 
> Before jhash() and its shuffle ? What do you mean ?

I want that a Big endian machine should produce the same
hash value independent of flow direction as a Little endian.

OK, I missed ntohl() before calling jhash_3words()

Correct me if I'm wrong here (have no big endian machine available for test)
jhash_3words() and __jhash_final() seems to be "endian" safe.

So by doing the expensive ntohl on addresses and ports into jhash_3words()
it will produce the same value on both be and le.

That's why I want to have the ntohs() / ntohl() when comparing.

> 
> Please respin your patch using (__force u16/u32) instead of
> useless/expensive ntohs() / ntohl() (in _this_ context of hashing)
> 
> If you compare two 32bits values, of course they must have same
> ordering, but seeding jhash() is another matter.
> 
> (Granted all calls use the same ordering of course)
> 
> sparse is great tool, but if you add useless ntohl() calls to make
> sparse silent, then its probably better to not use sparse.
> 

^ permalink raw reply

* Re: [RFC] net/hsr: Add support for IEC 62439-3 High-availability Seamless Redundancy
From: Arvid Brodin @ 2012-05-14 18:11 UTC (permalink / raw)
  To: netdev@vger.kernel.org
  Cc: David S. Miller, Stephen Hemminger, Bruno Ferreira, Arvid Brodin,
	Christian Borntraeger, Herbert Xu
In-Reply-To: <4F71BEAD.5080605@enea.com>

On 2012-03-27 15:20, Arvid Brodin wrote:
> Hi!

*snip*
> 
> 2) I have a locking problem that I haven't managed to figure out. This happens
>    the first time I send any packet (hsr_dev_xmit() in hsr_device.c:121, called
>    from hsr_device.c:147). It happens even if I set skb2 to NULL (i.e. only send
>    one copy):
> 
> =============================================
> [ INFO: possible recursive locking detected ]
> 2.6.37 #118
> ---------------------------------------------
> swapper/0 is trying to acquire lock:
>  (_xmit_ETHER#2){+.-...}, at: [<901bf38e>] sch_direct_xmit+0x24/0x152
> 
> but task is already holding lock:
>  (_xmit_ETHER#2){+.-...}, at: [<901b4d1a>] dev_queue_xmit+0x31e/0x3cc
> 
> other info that might help us debug this:
> 4 locks held by swapper/0:
>  #0:  (&n->timer){+.-...}, at: [<9002bc20>] run_timer_softirq+0x98/0x184
>  #1:  (rcu_read_lock_bh){.+....}, at: [<901b49fc>] dev_queue_xmit+0x0/0x3cc
>  #2:  (_xmit_ETHER#2){+.-...}, at: [<901b4d1a>] dev_queue_xmit+0x31e/0x3cc
>  #3:  (rcu_read_lock_bh){.+....}, at: [<901b49fc>] dev_queue_xmit+0x0/0x3cc
> 
> stack backtrace:
> Call trace:
>  [<9001c640>] dump_stack+0x18/0x20
>  [<90040eac>] validate_chain+0x40c/0x9ac
>  [<90041a58>] __lock_acquire+0x60c/0x670
>  [<90042f32>] lock_acquire+0x3a/0x48
>  [<902201a4>] _raw_spin_lock+0x20/0x44
>  [<901bf38e>] sch_direct_xmit+0x24/0x152
>  [<901b4c14>] dev_queue_xmit+0x218/0x3cc
>  [<9021c2e0>] slave_xmit+0x10/0x14
>  [<9021c540>] hsr_dev_xmit+0x88/0x8c
>  [<901b4942>] dev_hard_start_xmit+0x3c6/0x480
>  [<901b4d34>] dev_queue_xmit+0x338/0x3cc
>  [<901e3cd8>] arp_xmit+0x8/0xc
>  [<901e4436>] arp_send+0x2a/0x2c
>  [<901e4e74>] arp_solicit+0x15c/0x170
>  [<901bad0c>] neigh_timer_handler+0x1c0/0x204
>  [<9002bc8a>] run_timer_softirq+0x102/0x184
>  [<900287d8>] __do_softirq+0x64/0xe0
>  [<9002896a>] do_softirq+0x26/0x48
>  [<90028a66>] irq_exit+0x2e/0x64
>  [<90019f16>] do_IRQ+0x46/0x5c
>  [<90018428>] irq_level0+0x18/0x60
>  [<9021cc16>] rest_init+0x72/0x98
>  [<9000063c>] start_kernel+0x21c/0x258
>  [<00000000>] 0x0
> 
>    Any idea why this happens? I need help!

I've spent a few days digging into this and the key apparently is NETIF_F_LLTX.

The problem seems to be that HARD_TX_LOCK is called more than once, first for my virtual
hsr device and then, recursively, for each of the slaves in turn. (At least that's where
lockdep complains - at __netif_tx_lock(), that is.)

At first I just could not understand why both the VLAN and the bonding code got away with
recursive calls to dev_queue_xmit() but I didn't. After some gooling (a lot, actually) I
found some references to the NETIF_F_LLTX flag (here's one:
http://lwn.net/Articles/121566/). I realised both VLAN and bonding code set this flag. And
sure enough, if I set it for my hsr device lockdep does not complain any more.

But NETIF_F_LLTX is described as deprecated in both netdevice.h and in
Documentation/networking/netdevices.txt. Is there an alternative solution that I should
use instead?

(To recap, a hsr device is a virtual device which uses two Ethernet devices as slaves.
This gives redundancy with instant failover, and since nodes are connected in a ring
topology, uses less cabling than duplication.)

-- 
Arvid Brodin | Consultant (Linux)
XDIN AB | Jan Stenbecks Torg 17 | SE-164 40 Kista | Sweden | xdin.com

^ permalink raw reply

* Re: [PATCH v5 2/2] decrement static keys on real destroy time
From: Tejun Heo @ 2012-05-14 18:12 UTC (permalink / raw)
  To: Glauber Costa
  Cc: cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg,
	devel-GEFAQzZX7r8dnm+yROfE0A,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	netdev-u79uwXL29TY76Z2rM5mHXA, Li Zefan, Johannes Weiner,
	Michal Hocko
In-Reply-To: <1336767077-25351-3-git-send-email-glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>

On Fri, May 11, 2012 at 05:11:17PM -0300, Glauber Costa wrote:
> We call the destroy function when a cgroup starts to be removed,
> such as by a rmdir event.
> 
> However, because of our reference counters, some objects are still
> inflight. Right now, we are decrementing the static_keys at destroy()
> time, meaning that if we get rid of the last static_key reference,
> some objects will still have charges, but the code to properly
> uncharge them won't be run.
> 
> This becomes a problem specially if it is ever enabled again, because
> now new charges will be added to the staled charges making keeping
> it pretty much impossible.
> 
> We just need to be careful with the static branch activation:
> since there is no particular preferred order of their activation,
> we need to make sure that we only start using it after all
> call sites are active. This is achieved by having a per-memcg
> flag that is only updated after static_key_slow_inc() returns.
> At this time, we are sure all sites are active.
> 
> This is made per-memcg, not global, for a reason:
> it also has the effect of making socket accounting more
> consistent. The first memcg to be limited will trigger static_key()
> activation, therefore, accounting. But all the others will then be
> accounted no matter what. After this patch, only limited memcgs
> will have its sockets accounted.
> 
> [v2: changed a tcp limited flag for a generic proto limited flag ]
> [v3: update the current active flag only after the static_key update ]
> [v4: disarm_static_keys() inside free_work ]
> [v5: got rid of tcp_limit_mutex, now in the static_key interface ]
> 
> Signed-off-by: Glauber Costa <glommer-bzQdu9zFT3WakBO8gow8eQ@public.gmane.org>
> CC: Tejun Heo <tj-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> CC: Li Zefan <lizefan-hv44wF8Li93QT0dZR+AlfA@public.gmane.org>
> CC: Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
> CC: Johannes Weiner <hannes-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
> CC: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>

Generally looks sane to me.  Please feel free to addmy  Reviewed-by.

> +	if (val == RESOURCE_MAX)
> +		cg_proto->active = false;
> +	else if (val != RESOURCE_MAX) {

Minor nitpick: CodingStyle says not to omit { } if other branches need
them.

Thanks.

-- 
tejun

^ permalink raw reply

* Re: [PATCH] netfilter: xt_HMARK: endian bugs
From: Jan Engelhardt @ 2012-05-14 18:24 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: Eric Dumazet, Pablo Neira Ayuso, kaber@trash.net,
	jengelh@medozas.de, netfilter-devel@vger.kernel.org,
	netdev@vger.kernel.org, dan.carpenter@oracle.com,
	hans@schillstrom.com
In-Reply-To: <201205141951.36692.hans.schillstrom@ericsson.com>

On Monday 2012-05-14 19:51, Hans Schillstrom wrote:

>On Monday 14 May 2012 18:24:34 Eric Dumazet wrote:
>> On Mon, 2012-05-14 at 18:09 +0200, Hans Schillstrom wrote:
>> 
>> > This context can contain both le & be machines,
>> > so at least in hmark it make sense
>> 
>> Before jhash() and its shuffle ? What do you mean ?
>
>I want that a Big endian machine should produce the same
>hash value independent of flow direction as a Little endian.

But one does not really need that, since the hash is only used locally.

^ permalink raw reply

* Re: [RFC] net/hsr: Add support for IEC 62439-3 High-availability Seamless Redundancy
From: Stephen Hemminger @ 2012-05-14 18:28 UTC (permalink / raw)
  To: Arvid Brodin
  Cc: netdev@vger.kernel.org, David S. Miller, Bruno Ferreira,
	Christian Borntraeger, Herbert Xu
In-Reply-To: <4FB14ADD.3050708@xdin.com>

On Mon, 14 May 2012 18:11:44 +0000
Arvid Brodin <Arvid.Brodin@xdin.com> wrote:

> On 2012-03-27 15:20, Arvid Brodin wrote:
> > Hi!
> 
> *snip*
> > 
> > 2) I have a locking problem that I haven't managed to figure out. This happens
> >    the first time I send any packet (hsr_dev_xmit() in hsr_device.c:121, called
> >    from hsr_device.c:147). It happens even if I set skb2 to NULL (i.e. only send
> >    one copy):
> > 
> > =============================================
> > [ INFO: possible recursive locking detected ]
> > 2.6.37 #118
> > ---------------------------------------------
> > swapper/0 is trying to acquire lock:
> >  (_xmit_ETHER#2){+.-...}, at: [<901bf38e>] sch_direct_xmit+0x24/0x152
> > 
> > but task is already holding lock:
> >  (_xmit_ETHER#2){+.-...}, at: [<901b4d1a>] dev_queue_xmit+0x31e/0x3cc
> > 
> > other info that might help us debug this:
> > 4 locks held by swapper/0:
> >  #0:  (&n->timer){+.-...}, at: [<9002bc20>] run_timer_softirq+0x98/0x184
> >  #1:  (rcu_read_lock_bh){.+....}, at: [<901b49fc>] dev_queue_xmit+0x0/0x3cc
> >  #2:  (_xmit_ETHER#2){+.-...}, at: [<901b4d1a>] dev_queue_xmit+0x31e/0x3cc
> >  #3:  (rcu_read_lock_bh){.+....}, at: [<901b49fc>] dev_queue_xmit+0x0/0x3cc
> > 
> > stack backtrace:
> > Call trace:
> >  [<9001c640>] dump_stack+0x18/0x20
> >  [<90040eac>] validate_chain+0x40c/0x9ac
> >  [<90041a58>] __lock_acquire+0x60c/0x670
> >  [<90042f32>] lock_acquire+0x3a/0x48
> >  [<902201a4>] _raw_spin_lock+0x20/0x44
> >  [<901bf38e>] sch_direct_xmit+0x24/0x152
> >  [<901b4c14>] dev_queue_xmit+0x218/0x3cc
> >  [<9021c2e0>] slave_xmit+0x10/0x14
> >  [<9021c540>] hsr_dev_xmit+0x88/0x8c
> >  [<901b4942>] dev_hard_start_xmit+0x3c6/0x480
> >  [<901b4d34>] dev_queue_xmit+0x338/0x3cc
> >  [<901e3cd8>] arp_xmit+0x8/0xc
> >  [<901e4436>] arp_send+0x2a/0x2c
> >  [<901e4e74>] arp_solicit+0x15c/0x170
> >  [<901bad0c>] neigh_timer_handler+0x1c0/0x204
> >  [<9002bc8a>] run_timer_softirq+0x102/0x184
> >  [<900287d8>] __do_softirq+0x64/0xe0
> >  [<9002896a>] do_softirq+0x26/0x48
> >  [<90028a66>] irq_exit+0x2e/0x64
> >  [<90019f16>] do_IRQ+0x46/0x5c
> >  [<90018428>] irq_level0+0x18/0x60
> >  [<9021cc16>] rest_init+0x72/0x98
> >  [<9000063c>] start_kernel+0x21c/0x258
> >  [<00000000>] 0x0
> > 
> >    Any idea why this happens? I need help!
> 
> 
> I've spent a few days digging into this and the key apparently is NETIF_F_LLTX.
> 
> The problem seems to be that HARD_TX_LOCK is called more than once, first for my virtual
> hsr device and then, recursively, for each of the slaves in turn. (At least that's where
> lockdep complains - at __netif_tx_lock(), that is.)
> 
> At first I just could not understand why both the VLAN and the bonding code got away with
> recursive calls to dev_queue_xmit() but I didn't. After some gooling (a lot, actually) I
> found some references to the NETIF_F_LLTX flag (here's one:
> http://lwn.net/Articles/121566/). I realised both VLAN and bonding code set this flag. And
> sure enough, if I set it for my hsr device lockdep does not complain any more.
> 
> But NETIF_F_LLTX is described as deprecated in both netdevice.h and in
> Documentation/networking/netdevices.txt. Is there an alternative solution that I should
> use instead?
> 
> (To recap, a hsr device is a virtual device which uses two Ethernet devices as slaves.
> This gives redundancy with instant failover, and since nodes are connected in a ring
> topology, uses less cabling than duplication.)
> 

LLTX is deprecated (ie should not be used) for physical devices.

Also, for virtual devices, there should be non transmit queue, this
causes mulit-queue lockless semantics to be preserved as the call passes
through the virtual device.

^ permalink raw reply

* Re: [PATCH] netfilter: xt_HMARK: endian bugs
From: Eric Dumazet @ 2012-05-14 18:28 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: Jan Engelhardt, Pablo Neira Ayuso, kaber@trash.net,
	jengelh@medozas.de, netfilter-devel@vger.kernel.org,
	netdev@vger.kernel.org, dan.carpenter@oracle.com,
	hans@schillstrom.com
In-Reply-To: <201205141951.36692.hans.schillstrom@ericsson.com>

On Mon, 2012-05-14 at 19:51 +0200, Hans Schillstrom wrote:
> On Monday 14 May 2012 18:24:34 Eric Dumazet wrote:
> > On Mon, 2012-05-14 at 18:09 +0200, Hans Schillstrom wrote:
> > 
> > > This context can contain both le & be machines,
> > > so at least in hmark it make sense
> > 
> > Before jhash() and its shuffle ? What do you mean ?
> 
> I want that a Big endian machine should produce the same
> hash value independent of flow direction as a Little endian.
> 

So one machine can be both le and be ? at the same time ?

> OK, I missed ntohl() before calling jhash_3words()
> 
> Correct me if I'm wrong here (have no big endian machine available for test)
> jhash_3words() and __jhash_final() seems to be "endian" safe.
> 
> So by doing the expensive ntohl on addresses and ports into jhash_3words()
> it will produce the same value on both be and le.
> 

And what is the purpose of the jhash output ? Is is sent to other
machines on the network, or only localy used ?

> That's why I want to have the ntohs() / ntohl() when comparing.

If xt_HMARK depends on a particular bit ordering to jhash() input, then
something is really wrong. I mean it.

jhash() primary purpose it to shuffle input.

We use (__force u32) everywhere in network tree to avoid sparse
warnings. Please grep for them.




^ permalink raw reply

* Re: [PATCH] netfilter: xt_HMARK: endian bugs
From: Jozsef Kadlecsik @ 2012-05-14 18:35 UTC (permalink / raw)
  To: Hans Schillstrom
  Cc: Eric Dumazet, Jan Engelhardt, Pablo Neira Ayuso, kaber@trash.net,
	jengelh@medozas.de, netfilter-devel@vger.kernel.org,
	netdev@vger.kernel.org, dan.carpenter@oracle.com,
	hans@schillstrom.com
In-Reply-To: <201205141951.36692.hans.schillstrom@ericsson.com>

On Mon, 14 May 2012, Hans Schillstrom wrote:

> On Monday 14 May 2012 18:24:34 Eric Dumazet wrote:
> > On Mon, 2012-05-14 at 18:09 +0200, Hans Schillstrom wrote:
> > 
> > > This context can contain both le & be machines,
> > > so at least in hmark it make sense
> > 
> > Before jhash() and its shuffle ? What do you mean ?
> 
> I want that a Big endian machine should produce the same
> hash value independent of flow direction as a Little endian.
> 
> OK, I missed ntohl() before calling jhash_3words()
> 
> Correct me if I'm wrong here (have no big endian machine available for test)
> jhash_3words() and __jhash_final() seems to be "endian" safe.

No, but as Eric wrote: what is the point in forcing the same hash value 
for the same input on big endian and little endian machines? Are you going 
to transfer the hash value between machines?

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply

* Re: [PATCH RFC] tun: experimental zero copy tx support
From: Michael S. Tsirkin @ 2012-05-14 18:39 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David S. Miller, Stephen Hemminger, Joe Perches, Jason Wang,
	netdev, linux-kernel, Ian.Campbell, kvm
In-Reply-To: <1337016890.3851.20.camel@oc3660625478.ibm.com>

On Mon, May 14, 2012 at 10:34:50AM -0700, Shirley Ma wrote:
> On Sun, 2012-05-13 at 18:52 +0300, Michael S. Tsirkin wrote:
> > Let vhost-net utilize zero copy tx when the experimental
> > zero copy mode is enabled and when used with tun.  This works on
> > top of the patchset 'copy aside frags with destructors' that I posted
> > previously. This is not using tcp so doesn't have the
> > issue with early skb cloning noticed by Ian.
> > 
> > For those that wish to test this with kvm, I intend to post a patchset
> > +
> > git tree with just the necessary bits from the destructor patch
> > a bit later.
> > 
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> 
> 
> Hello Mike,
> 
> Have you tested this patch? I think the difference between macvtap and
> tap is tap forwarding the packet to bridge. The zerocopy is disabled in
> this case.
> 
> Shirley

Testing in progress, but the patchset I pointed to enables
zerocopy with bridge.

^ permalink raw reply

* RE: [PATCH ethtool] Add command to dump module EEPROM
From: Yaniv Rosner @ 2012-05-14 18:57 UTC (permalink / raw)
  To: Stuart Hodgson, Ben Hutchings
  Cc: David Miller, netdev@vger.kernel.org, Eilon Greenstein
In-Reply-To: <4FB1267E.2090006@solarflare.com>

> On 14/05/12 15:18, Ben Hutchings wrote:
> > On Mon, 2012-05-14 at 18:13 +0300, Yaniv Rosner wrote:
> >> Hi Ben,
> >> This patch adds a new option to dump (SFP+, XFP, ...) module EEPROM following
> >> recent support to kernel side. Below some examples:
> >>
> >> bash-3.00# ethtool -m eth1 offset 0x14 length 32 raw on
> >> JDSU            PLRXPLSCS432
> >>
> >> bash-3.00# ethtool -m eth1 offset 0x14 length 32
> >> Offset          Values
> >> ------          ------
> >> 0x0014          4a 44 53 55 20 20 20 20 20 20 20 20 20 20 20 20
> >> 0x0024          00 00 01 9c 50 4c 52 58 50 4c 53 43 53 34 33 32
> >>
> >> Please consider applying to ethtool.
> >
> > I agree there should be ASCII-hex and binary dump modes, but we should
> > also support decoding of recognised EEPROM types (as Stuart proposed
> > earlier).
> 
> I have a patch to do this as well, but also parse the SFP+ EEPROM.
> I need to fix it up after some of the changes to the patches that added
> kernel support but can submit in the next day or so if this would be of
> use.
> 
> Stu

Sounds good. I'll wait for your patch.


^ permalink raw reply

* Re: [PATCH 1/4] netfilter: ipset: fix timeout value overflow bug
From: Pablo Neira Ayuso @ 2012-05-14 19:00 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Eric Dumazet, David Laight, netfilter-devel, davem, netdev
In-Reply-To: <alpine.DEB.2.00.1205141943240.28291@blackhole.kfki.hu>

On Mon, May 14, 2012 at 07:45:07PM +0200, Jozsef Kadlecsik wrote:
> On Mon, 14 May 2012, Eric Dumazet wrote:
> 
> > On Mon, 2012-05-14 at 16:36 +0200, Pablo Neira Ayuso wrote:
> > > On Mon, May 14, 2012 at 03:19:49PM +0100, David Laight wrote:
> > > >  
> > > > > --- a/include/linux/netfilter/ipset/ip_set_timeout.h
> > > > > +++ b/include/linux/netfilter/ipset/ip_set_timeout.h
> > > > > @@ -30,6 +30,10 @@ ip_set_timeout_uget(struct nlattr *tb)
> > > > >  {
> > > > >  	unsigned int timeout = ip_set_get_h32(tb);
> > > > >  
> > > > > +	/* Normalize to fit into jiffies */
> > > > > +	if (timeout > UINT_MAX/1000)
> > > > > +		timeout = UINT_MAX/1000;
> > > > > +
> > > > 
> > > > Doesn't that rather assume that HZ is 1000 ?
> > > 
> > > Indeed. I overlooked that. Thanks David.
> > 
> > I dont think so.
> > 
> > 1000 here is really MSEC_PER_SEC
> > 
> > (All occurrences should be changed in this file)
> 
> Yes, exactly. Pablo, shall I produce a little patch or could you change 
> 1000 to MSEC_PER_SEC?

I'll mangle this myself. No need to send a new patch.

^ permalink raw reply

* Re: [PATCH] netfilter: xt_HMARK: endian bugs
From: Pablo Neira Ayuso @ 2012-05-14 19:02 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Hans Schillstrom, Eric Dumazet, Jan Engelhardt, kaber@trash.net,
	jengelh@medozas.de, netfilter-devel@vger.kernel.org,
	netdev@vger.kernel.org, dan.carpenter@oracle.com,
	hans@schillstrom.com
In-Reply-To: <alpine.DEB.2.00.1205142032570.28548@blackhole.kfki.hu>

On Mon, May 14, 2012 at 08:35:26PM +0200, Jozsef Kadlecsik wrote:
> On Mon, 14 May 2012, Hans Schillstrom wrote:
> 
> > On Monday 14 May 2012 18:24:34 Eric Dumazet wrote:
> > > On Mon, 2012-05-14 at 18:09 +0200, Hans Schillstrom wrote:
> > > 
> > > > This context can contain both le & be machines,
> > > > so at least in hmark it make sense
> > > 
> > > Before jhash() and its shuffle ? What do you mean ?
> > 
> > I want that a Big endian machine should produce the same
> > hash value independent of flow direction as a Little endian.
> > 
> > OK, I missed ntohl() before calling jhash_3words()
> > 
> > Correct me if I'm wrong here (have no big endian machine available for test)
> > jhash_3words() and __jhash_final() seems to be "endian" safe.
> 
> No, but as Eric wrote: what is the point in forcing the same hash value 
> for the same input on big endian and little endian machines? Are you going 
> to transfer the hash value between machines?

IIRC, Hans wants that, in case you have a cluster composed of system
with different endianess, the hash mark calculated will be the same
in both systems. To ensure that the distribution is consistent with
independency of the endianess.

^ permalink raw reply

* Re: [PATCH RFC] tun: experimental zero copy tx support
From: Eric Dumazet @ 2012-05-14 19:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, David S. Miller, Joe Perches, Jason Wang,
	netdev, linux-kernel, Ian.Campbell, kvm
In-Reply-To: <20120514170412.GA17086@redhat.com>

On Mon, 2012-05-14 at 20:04 +0300, Michael S. Tsirkin wrote:
> On Mon, May 14, 2012 at 09:54:46AM -0700, Stephen Hemminger wrote:
> > On Sun, 13 May 2012 18:52:06 +0300
> > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > 
> > > +		/* Userspace may produce vectors with count greater than
> > > +		 * MAX_SKB_FRAGS, so we need to linearize parts of the skb
> > > +		 * to let the rest of data to be fit in the frags.
> > > +		 */
> > Rather than complex partial code, just go through slow path for
> > requests with too many frags (or for really small requests).
> > Creating mixed skb's seems too easy to get wrong.
> 
> I don't object in principle but macvtap has same code
> so seems better to stay consistent.
> 

If I remember well, code in vtap was buggy and still is.

Jason Wang fixes are not yet in, so maybe wait a bit, so that we dont
add a pile of new bugs ?

^ permalink raw reply

* Re: [PATCH v2 1/6] netfilter: sanity checks on NFPROTO_NUMPROTO
From: Pablo Neira Ayuso @ 2012-05-14 19:04 UTC (permalink / raw)
  To: Alban Crequy
  Cc: Patrick McHardy, Vincent Sanders, Javier Martinez Canillas,
	netfilter-devel, netdev
In-Reply-To: <20120514170410.6c2f1c5b@rainbow.cbg.collabora.co.uk>

On Mon, May 14, 2012 at 05:04:10PM +0100, Alban Crequy wrote:
> Le Mon, 14 May 2012 16:39:49 +0100,
> Alban Crequy <alban.crequy@collabora.co.uk> a écrit :
> 
> > Le Mon, 14 May 2012 16:42:35 +0200,
> > Pablo Neira Ayuso <pablo@netfilter.org> a écrit :
> > 
> > > On Mon, May 14, 2012 at 02:56:34PM +0100, Alban Crequy wrote:
> > > > With the NFPROTO_* constants introduced by commit 7e9c6e
> > > > ("netfilter: Introduce NFPROTO_* constants"), it is too easy to
> > > > confuse PF_* and NFPROTO_* constants in new protocols.
> > > > 
> > > > Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
> > > > Reviewed-by: Javier Martinez Canillas
> > > > <javier.martinez@collabora.co.uk> Reviewed-by: Vincent Sanders
> > > > <vincent.sanders@collabora.co.uk> ---
> > > >  net/netfilter/core.c |    5 +++++
> > > >  1 files changed, 5 insertions(+), 0 deletions(-)
> > > > 
> > > > diff --git a/net/netfilter/core.c b/net/netfilter/core.c
> > > > index e1b7e05..4f16552 100644
> > > > --- a/net/netfilter/core.c
> > > > +++ b/net/netfilter/core.c
> > > > @@ -67,6 +67,11 @@ int nf_register_hook(struct nf_hook_ops *reg)
> > > >  	struct nf_hook_ops *elem;
> > > >  	int err;
> > > >  
> > > > +	if (reg->pf >= NFPROTO_NUMPROTO || reg->hooknum >=
> > > > NF_MAX_HOOKS) {
> > > > +		BUG();
> > > > +		return 1;
> > > 
> > > nf_register_hook returns a negative value on error. -EINVAL can be
> > > fine.
> > 
> > Is it the patch you mean? Do you want me to do a series repost?
> 
> Please disregard the previous patch, this is the correct one.
> 
> 
> From: Alban Crequy <alban.crequy@collabora.co.uk>
> 
> netfilter: sanity checks on NFPROTO_NUMPROTO
> 
> With the NFPROTO_* constants introduced by commit 7e9c6e ("netfilter: Introduce
> NFPROTO_* constants"), it is too easy to confuse PF_* and NFPROTO_* constants
> in new protocols.
> 
> Signed-off-by: Alban Crequy <alban.crequy@collabora.co.uk>
> ---
>  net/netfilter/core.c |    8 ++++++++
>  1 files changed, 8 insertions(+), 0 deletions(-)
> 
> diff --git a/net/netfilter/core.c b/net/netfilter/core.c
> index e1b7e05..7422989 100644
> --- a/net/netfilter/core.c
> +++ b/net/netfilter/core.c
> @@ -67,6 +67,14 @@ int nf_register_hook(struct nf_hook_ops *reg)
>  	struct nf_hook_ops *elem;
>  	int err;
>  
> +	if (reg->pf >= NFPROTO_NUMPROTO || reg->hooknum >= NF_MAX_HOOKS) {
> +		WARN(reg->pf >= NFPROTO_NUMPROTO,
> +		     "netfilter: Invalid nfproto %d\n", reg->pf);
> +		WARN(reg->hooknum >= NF_MAX_HOOKS,
> +		     "netfilter: Invalid hooknum %d\n", reg->hooknum);

Then, better add two checkings. One to spot the first warning, and
another to spot the second.

I havent seen such a code in any netfilter code and I like that things
remain consistent.

> +		return -EINVAL;
> +	}
> +
>  	err = mutex_lock_interruptible(&nf_hook_mutex);
>  	if (err < 0)
>  		return err;
> -- 
> 1.7.2.5
> 

^ permalink raw reply

* Re: [PATCH] [RFC] netfilter: don't assume NFPROTO_* are like PF_*
From: Pablo Neira Ayuso @ 2012-05-14 19:06 UTC (permalink / raw)
  To: Jan Engelhardt
  Cc: Alban Crequy, Patrick McHardy, Vincent Sanders,
	Javier Martinez Canillas, netfilter-devel, netdev
In-Reply-To: <alpine.LNX.2.01.1205141707490.11548@frira.vanv.qr>

On Mon, May 14, 2012 at 05:09:54PM +0200, Jan Engelhardt wrote:
> 
> On Monday 2012-05-14 15:58, Alban Crequy wrote:
> >--- a/include/linux/netfilter.h
> >+++ b/include/linux/netfilter.h
> >@@ -62,11 +62,11 @@ enum nf_inet_hooks {
> > 
> > enum {
> > 	NFPROTO_UNSPEC =  0,
> >-	NFPROTO_IPV4   =  2,
> >-	NFPROTO_ARP    =  3,
> >-	NFPROTO_BRIDGE =  7,
> >-	NFPROTO_IPV6   = 10,
> >-	NFPROTO_DECNET = 12,
> >+	NFPROTO_IPV4,
> >+	NFPROTO_ARP,
> >+	NFPROTO_BRIDGE,
> >+	NFPROTO_IPV6,
> >+	NFPROTO_DECNET,
> > 	NFPROTO_NUMPROTO,
> > };
> 
> This must not be changed under any circumstances. It is exported to
> and used by userspace. (Except perhaps for NFPROTO_DECNET, which
> refers to a quite dead protocol that I think no user parts have ever
> used NFPROTO_DECNET.) I would consider it acceptable to change the
> value for NFPROTO_DECNET if Pablo joins.

If there is some remote posibility to break userspace code, I won't
take the patch, sorry.

^ permalink raw reply

* Re: [PATCH] netfilter: xt_HMARK: endian bugs
From: Eric Dumazet @ 2012-05-14 19:13 UTC (permalink / raw)
  To: Pablo Neira Ayuso
  Cc: Jozsef Kadlecsik, Hans Schillstrom, Jan Engelhardt,
	kaber@trash.net, jengelh@medozas.de,
	netfilter-devel@vger.kernel.org, netdev@vger.kernel.org,
	dan.carpenter@oracle.com, hans@schillstrom.com
In-Reply-To: <20120514190211.GC14897@1984>

On Mon, 2012-05-14 at 21:02 +0200, Pablo Neira Ayuso wrote:

> IIRC, Hans wants that, in case you have a cluster composed of system
> with different endianess, the hash mark calculated will be the same
> in both systems. To ensure that the distribution is consistent with
> independency of the endianess.

Then jhash() must be audited to make sure its output is OK with this
requirement.






^ permalink raw reply

* Re: [PATCH RFC] tun: experimental zero copy tx support
From: Michael S. Tsirkin @ 2012-05-14 19:14 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, David S. Miller, Joe Perches, Jason Wang,
	netdev, linux-kernel, Ian.Campbell, kvm
In-Reply-To: <1337022146.8512.606.camel@edumazet-glaptop>

On Mon, May 14, 2012 at 09:02:26PM +0200, Eric Dumazet wrote:
> On Mon, 2012-05-14 at 20:04 +0300, Michael S. Tsirkin wrote:
> > On Mon, May 14, 2012 at 09:54:46AM -0700, Stephen Hemminger wrote:
> > > On Sun, 13 May 2012 18:52:06 +0300
> > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > 
> > > > +		/* Userspace may produce vectors with count greater than
> > > > +		 * MAX_SKB_FRAGS, so we need to linearize parts of the skb
> > > > +		 * to let the rest of data to be fit in the frags.
> > > > +		 */
> > > Rather than complex partial code, just go through slow path for
> > > requests with too many frags (or for really small requests).
> > > Creating mixed skb's seems too easy to get wrong.
> > 
> > I don't object in principle but macvtap has same code
> > so seems better to stay consistent.
> > 
> 
> If I remember well, code in vtap was buggy and still is.
> 
> Jason Wang fixes are not yet in,

They seem to be in net-next, or did I miss something?

> so maybe wait a bit, so that we dont
> add a pile of new bugs ?
> 
> 
> 

^ permalink raw reply

* Re: [PATCH RFC] tun: experimental zero copy tx support
From: Michael S. Tsirkin @ 2012-05-14 19:17 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Stephen Hemminger, David S. Miller, Joe Perches, Jason Wang,
	netdev, linux-kernel, Ian.Campbell, kvm
In-Reply-To: <20120514191456.GC17086@redhat.com>

On Mon, May 14, 2012 at 10:14:56PM +0300, Michael S. Tsirkin wrote:
> On Mon, May 14, 2012 at 09:02:26PM +0200, Eric Dumazet wrote:
> > On Mon, 2012-05-14 at 20:04 +0300, Michael S. Tsirkin wrote:
> > > On Mon, May 14, 2012 at 09:54:46AM -0700, Stephen Hemminger wrote:
> > > > On Sun, 13 May 2012 18:52:06 +0300
> > > > "Michael S. Tsirkin" <mst@redhat.com> wrote:
> > > > 
> > > > > +		/* Userspace may produce vectors with count greater than
> > > > > +		 * MAX_SKB_FRAGS, so we need to linearize parts of the skb
> > > > > +		 * to let the rest of data to be fit in the frags.
> > > > > +		 */
> > > > Rather than complex partial code, just go through slow path for
> > > > requests with too many frags (or for really small requests).
> > > > Creating mixed skb's seems too easy to get wrong.
> > > 
> > > I don't object in principle but macvtap has same code
> > > so seems better to stay consistent.
> > > 
> > 
> > If I remember well, code in vtap was buggy and still is.
> > 
> > Jason Wang fixes are not yet in,
> 
> They seem to be in net-next, or did I miss something?
> 
> > so maybe wait a bit, so that we dont
> > add a pile of new bugs ?
> > 
> > 

Things progress smoother upstream than out of tree
is my experience.

Also everything is guarded by a mod param in vhost
which is off by default and the name
experimental_zcopytx makes it hopefully clear there's
risk involved.

So the chance of hurting someone is imo minimal.

^ permalink raw reply

* Re: [PATCH RFC 5/6] net: orphan frags on receive
From: Michael S. Tsirkin @ 2012-05-14 19:18 UTC (permalink / raw)
  To: Ian Campbell; +Cc: David Miller, netdev@vger.kernel.org, eric.dumazet@gmail.com
In-Reply-To: <1336571692.25514.122.camel@zakaz.uk.xensource.com>

On Wed, May 09, 2012 at 02:54:52PM +0100, Ian Campbell wrote:
> On Mon, 2012-05-07 at 14:54 +0100, Michael S. Tsirkin wrote:
> > zero copy packets are normally sent to the outside
> > network, but bridging, tun etc might loop them
> > back to host networking stack. If this happens
> > destructors will never be called, so orphan
> > the frags immediately on receive.
> 
> I think this deceptively simply patch is actually the meat of the
> series.

Eric, does this patch look ok to you?
It's key to both my and ian's zerocopy work.

-- 
MST

^ permalink raw reply

* Re: [PATCH RFC] tun: experimental zero copy tx support
From: Eric Dumazet @ 2012-05-14 19:21 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Stephen Hemminger, David S. Miller, Joe Perches, Jason Wang,
	netdev, linux-kernel, Ian.Campbell, kvm
In-Reply-To: <20120514191456.GC17086@redhat.com>

On Mon, 2012-05-14 at 22:14 +0300, Michael S. Tsirkin wrote:

> They seem to be in net-next, or did I miss something?

Yes, you re-introduce in this patch some bugs already fixed in macvtap

^ permalink raw reply

* [PATCH] pch_gbe: fix transmit races
From: Eric Dumazet @ 2012-05-14 19:26 UTC (permalink / raw)
  To: Andy Cress, David Miller; +Cc: netdev
In-Reply-To: <40680C535D6FE6498883F1640FACD44DDF93F1@ka-exchange-1.kontronamerica.local>

From: Eric Dumazet <edumazet@google.com>

Andy reported pch_gbe triggered "NETDEV WATCHDOG" errors.

May 11 11:06:09 kontron kernel: WARNING: at net/sched/sch_generic.c:261
dev_watchdog+0x1ec/0x200() (Not tainted)
May 11 11:06:09 kontron kernel: Hardware name: N/A
May 11 11:06:09 kontron kernel: NETDEV WATCHDOG: eth0 (pch_gbe):
transmit queue 0 timed out

It seems pch_gbe has a racy tx path (races with TX completion path)

Remove tx_queue_lock lock since it has no purpose, we must use tx_lock
instead.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andy Cress <andy.cress@us.kontron.com>
Tested-by: Andy Cress <andy.cress@us.kontron.com>
---

patch rebased on net tree.

 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h      |    2 
 drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c |   25 ++++------
 2 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
index dd14915..ba78174 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe.h
@@ -584,7 +584,6 @@ struct pch_gbe_hw_stats {
 /**
  * struct pch_gbe_adapter - board specific private data structure
  * @stats_lock:	Spinlock structure for status
- * @tx_queue_lock:	Spinlock structure for transmit
  * @ethtool_lock:	Spinlock structure for ethtool
  * @irq_sem:		Semaphore for interrupt
  * @netdev:		Pointer of network device structure
@@ -609,7 +608,6 @@ struct pch_gbe_hw_stats {
 
 struct pch_gbe_adapter {
 	spinlock_t stats_lock;
-	spinlock_t tx_queue_lock;
 	spinlock_t ethtool_lock;
 	atomic_t irq_sem;
 	struct net_device *netdev;
diff --git a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
index 8035e5f..1e38d50 100644
--- a/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
+++ b/drivers/net/ethernet/oki-semi/pch_gbe/pch_gbe_main.c
@@ -640,14 +640,11 @@ static void pch_gbe_mac_set_pause_packet(struct pch_gbe_hw *hw)
  */
 static int pch_gbe_alloc_queues(struct pch_gbe_adapter *adapter)
 {
-	int size;
-
-	size = (int)sizeof(struct pch_gbe_tx_ring);
-	adapter->tx_ring = kzalloc(size, GFP_KERNEL);
+	adapter->tx_ring = kzalloc(sizeof(*adapter->tx_ring), GFP_KERNEL);
 	if (!adapter->tx_ring)
 		return -ENOMEM;
-	size = (int)sizeof(struct pch_gbe_rx_ring);
-	adapter->rx_ring = kzalloc(size, GFP_KERNEL);
+
+	adapter->rx_ring = kzalloc(sizeof(*adapter->rx_ring), GFP_KERNEL);
 	if (!adapter->rx_ring) {
 		kfree(adapter->tx_ring);
 		return -ENOMEM;
@@ -1162,7 +1159,6 @@ static void pch_gbe_tx_queue(struct pch_gbe_adapter *adapter,
 	struct sk_buff *tmp_skb;
 	unsigned int frame_ctrl;
 	unsigned int ring_num;
-	unsigned long flags;
 
 	/*-- Set frame control --*/
 	frame_ctrl = 0;
@@ -1211,14 +1207,14 @@ static void pch_gbe_tx_queue(struct pch_gbe_adapter *adapter,
 			}
 		}
 	}
-	spin_lock_irqsave(&tx_ring->tx_lock, flags);
+
 	ring_num = tx_ring->next_to_use;
 	if (unlikely((ring_num + 1) == tx_ring->count))
 		tx_ring->next_to_use = 0;
 	else
 		tx_ring->next_to_use = ring_num + 1;
 
-	spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
+
 	buffer_info = &tx_ring->buffer_info[ring_num];
 	tmp_skb = buffer_info->skb;
 
@@ -1518,7 +1514,7 @@ pch_gbe_alloc_rx_buffers_pool(struct pch_gbe_adapter *adapter,
 						&rx_ring->rx_buff_pool_logic,
 						GFP_KERNEL);
 	if (!rx_ring->rx_buff_pool) {
-		pr_err("Unable to allocate memory for the receive poll buffer\n");
+		pr_err("Unable to allocate memory for the receive pool buffer\n");
 		return -ENOMEM;
 	}
 	memset(rx_ring->rx_buff_pool, 0, size);
@@ -1637,15 +1633,17 @@ pch_gbe_clean_tx(struct pch_gbe_adapter *adapter,
 	pr_debug("called pch_gbe_unmap_and_free_tx_resource() %d count\n",
 		 cleaned_count);
 	/* Recover from running out of Tx resources in xmit_frame */
+	spin_lock(&tx_ring->tx_lock);
 	if (unlikely(cleaned && (netif_queue_stopped(adapter->netdev)))) {
 		netif_wake_queue(adapter->netdev);
 		adapter->stats.tx_restart_count++;
 		pr_debug("Tx wake queue\n");
 	}
-	spin_lock(&adapter->tx_queue_lock);
+
 	tx_ring->next_to_clean = i;
-	spin_unlock(&adapter->tx_queue_lock);
+
 	pr_debug("next_to_clean : %d\n", tx_ring->next_to_clean);
+	spin_unlock(&tx_ring->tx_lock);
 	return cleaned;
 }
 
@@ -2037,7 +2035,6 @@ static int pch_gbe_sw_init(struct pch_gbe_adapter *adapter)
 		return -ENOMEM;
 	}
 	spin_lock_init(&adapter->hw.miim_lock);
-	spin_lock_init(&adapter->tx_queue_lock);
 	spin_lock_init(&adapter->stats_lock);
 	spin_lock_init(&adapter->ethtool_lock);
 	atomic_set(&adapter->irq_sem, 0);
@@ -2142,10 +2139,10 @@ static int pch_gbe_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 			 tx_ring->next_to_use, tx_ring->next_to_clean);
 		return NETDEV_TX_BUSY;
 	}
-	spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
 
 	/* CRC,ITAG no support */
 	pch_gbe_tx_queue(adapter, tx_ring, skb);
+	spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
 	return NETDEV_TX_OK;
 }
 

^ permalink raw reply related

* Re: [PATCH net-next 0/2] extend sch_mqprio to distribute traffic not only by ETS TC
From: Amir Vadai @ 2012-05-14 19:24 UTC (permalink / raw)
  To: John Fastabend
  Cc: David S. Miller, netdev, Oren Duer, Liran Liss, Jamal Hadi Salim,
	Diego Crupnicoff, Or Gerlitz
In-Reply-To: <4FAA0D2C.6040306@intel.com>

On 05/09/2012 09:22 AM, John Fastabend wrote:
> On 5/8/2012 6:56 AM, Amir Vadai wrote:
>> On 05/08/2012 03:54 AM, John Fastabend wrote:
>>> On 5/6/2012 12:05 AM, Amir Vadai wrote:
>>>> This series comes to revive the discussion initiated on the thread "net:
>>>> support tx_ring per UP in HW based QoS mechanism" (see
>>>> http://marc.info/?t=133165957200004&r=1&w=2) with the major issue to be address
>>>> is - how should sk_prio<=>    TC be done, for both, tagged and untagged traffic.
>>>> Following is a staged description addressing the background, problem
>>>> description, current situation, suggestion for the change and implementation of
>>>> it.
>>>
>>> OK but the mqprio qdisc is only concerned with mapping skb->priority to
>>> queue sets I perhaps unfortunately called the queue sets tc's. Try not
>>> to get hung up on my perhaps limiting naming of variables and functions.
>> I understand that.
>
> I figured you did. Just wanted to state it again.
>
>>
>>>
>>> mqprio is used outside of 802.1Q as well in these cases a traffic class
>>> is not usually even defined.
>>>
>>>>
>
> [...]
>
>>>> Implementation
>>>> --------------
>>>> Extended mqprio hw attribute:
>>>> * Bit 1: is queue offset/count owned by HW
>>>> * Bits 2-7: HW queueing type.
>>>>     * 0 - by ETS TC
>>>>     * 1 - by UP
>>>>
>>>> __skb_tx_hash() is now aware to the HW queuing type (pg_type): for pg_type
>>>> being ETS TC, traffic is distributed as it was before - tagged and untagged
>>>> packets are distributed by netdev_get_prio_tc_map. For pg_type being UP, tagged
>>>> and untagged packets are distributed by UP (taken from egress map for tagged
>>>> traffic, or netdev_get_prio_tc_map for untagged).
>>>
>>> I guess I don't see why we need this. If you keep the mqprio priority to
>>> queue set mapping set to 1:1. Then modify the egress map accordingly then
>>> this should work right?
>>>
>>> For example:
>>>
>>> If we want to map 8 802.1Q priority code points onto 4 traffic classes this
>>> should work,
>>>
>>> egress map: 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- vlan layer inserts correct tag
>>> mqprio up2tc: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7<-- mqprio with unfortunate 'tc' name maps priority to queues
>> But mqprio is mapping sk_priority to queue set, which is different from UP to queue set.
>> The term UP which we use, is the 8021q PCP in tagged traffic, and a priority mapped by the host admin for untagged traffic.
>> What you wrote, is actually, that the application will set the priority without enabling the host admin to control it.
>>> dcbnl up2tc: 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- dcbx pushes up2tc mapping correctly
>>
>> For example, lets take an application that calls setsockopt(SO_PRIORITY, 2):
>> according to egress map: 8021q PCP field in vlan tag should be set to 1 (=UP)
>> according to mqprio: a tx ring belonging to UP 2 will be selected.
>> according to dcbnl: traffic will have ETS attributes of TC 1
>>
>> Except some conceptual problems, it won't work:
>> 8021q PCP field is set by HW according to the tx ring (UP=2). which
>> is different from the one that the user configured in the egress_map
>> (UP=1).
>
> sorry I set it up wrong above but it _can_ be made to work as best I
> can tell (for a single egress_map at least)
>
> egress map: 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7<-- ignored
> mqprio    : 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7<-- map priority to up queue sets
> up2tc     : 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- dcbx negotiated up2tc map
>
> With your example application does setsockopt(SO_PRIORITY, 2):
> 	according to egress map : insert PCP 2
> 	according to mqprio	: tx ring belonging to UP2 is selected
> 	accroding to dcbnl	: traffic will have ETS attributes of TC1
>
> The point is either you use the skb->priority to PCP map and then a 1:1
> mqprio map or you use the mqprio map directly. Agree?
>
> One more example translating the case with these patches onto a case
> without these patches as I understand them.
>
> first with these patches mapping:
>
> up2tc     : 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- dcbnl up2tc map
> egress map: 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- sets pcp bits
> mqprio    : 0:0 1:1 2:2 3:3 4:4 5:5 6:6 7:7<-- with PGROUP_UP set
>
> equivalent mapping without PGROUP_UP set
>
> up2tc     : 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- dcbnl negotiated up2tc map
> egress map: 0:0 1:0 2:0 3:0 4:0 5:0 6:0 7:0<-- default unset egress map
> mqprio    : 0:0 1:0 2:1 3:1 4:2 5:2 6:3 7:3<-- with PGROUP_UP set
>
> Application sets skb->priority to 2, in the first case egress map
> sets the PCP to 1 and mqprio maps it to queues for UP1 based on PCP bits.
>
> In case two egress map does nothing but skb->priority is still 2 so the
> mqprio maps these to queues associated with UP1 so everything works
> still.
>
> I apologize if I'm missing something here but it seems correct to me. Spell
> it out for me if I am still wrong.
>
>>
>>>
>>> We may need to fixup userspace tools some but I think the kernel mechanisms
>>> are in place.
>>>
>>> BTW I did think about this while implementing the existing code and decided
>>> that rather than create more if/else branches the case you are describing
>>> could be handled by independently controlling the priority to queue set mappings
>>> in mqprio and the egress map.
>>>
>>> Feel free to let me know where I went wrong or why this doesn't work on
>>> your hardware. I agree though we may need to fixup lldpad and maybe even
>>> 'tc' some to get this to look correct in user space and get automagically
>>> setup correctly.
>>>
>>> Thanks,
>>> .John
>>
>> I think I understand your stand, that mqprio mapping should be generic
>> and should not be TC nor UP oriented.
>>
>
> Right. Also I want to avoid user space having to somehow "know" which
> mode to put the qdisc in. Checking if the hardware is a mellanox card
> does not scale.
>
>> The problem is that our HW needs the queues to be selected by UP. ETS
>> TC mapping is configured before traffic starts, and therefore ETS
>> attributes are enforced by HW according to the UP, which the tx ring
>> belongs to. Same thing for vlan tag in tagged traffic. 8021q PCP is
>> placed by HW according to the tx ring.
>
> OK. Can you agree though in the case where we restrict the egress_map
> to be the same for all vlan's on a real_dev the existing mqprio map
> _can_ work?
>
>>
>> For backward compatibility, tagged traffic should be steered to a tx
>> ring by UP taken from egress map - skb_tx_hash() as it is today, can't
>> do it. Even if we implement ndo_select_queue(), we will need to
>> duplicate most of the code from skb_tx_hash(), because there are more
>> than one tx ring per UP, and we need skb_tx_hash() to be able to select
>> a ring from a range of rings belonging to a UP.
>
> agreed implementing this in select_queue() is no good.
>
>>
>> Also, there are some configurations that can't be done by mqprio and
>> egress map. For example when having two vlans with different egress
>> maps.
>
> This is a good example thanks. This currently doesn't work for hardware
> that maps tx_rings to traffic classes either. So if we need/want to
> solve this case then I guess we need something like your patches. Note
> I think the patch you submitted should work regardless if you map
> the PCP bits to UP queues or PCP bits to TC queues. We could use this
> in both cases which helps my user space concerns.
>
>>
>> And in the conceptual level, I think that kernel should not accept bad
>> configurations and rely on user space to protect it.
>
> I disagree policy should be managed in user space. Also I see no reason
> to create a strong coupling between egress maps and mqprio maps. I can
> imagine a case where they could be used independently. DCB is just one
> usage model for mqprio. Using a bit like you did seems fine though.
>
>>
>> - Amir
>
> Assuming we want the multiple different egress map case to work I guess
> we will have to do something like this. At least right now I can't think
> of anything better but I'll think on it tomorrow some more.
>
> The other thought would be to provide a qdisc hook before queue selection
> to attach 'tc filters' and provide a new action to hash a skb across
> queue sets.
>
> Thanks,
> John

John Hi,

After some internal discussions, it was agreed to line up with your 
approach, to leave mqprio an abstract skb->priority <=> queue set 
mapping and to ignore egress_map if mqprio is enabled.

It would be very nice, if the term 'tc' in kernel code would be replaced 
to queue set, since it is very misleading.

There still might be some small issues with skb_tx_hash for tagged 
traffic, which I will work on tomorrow, and hopefully will send a new 
patch set with the solution.

Thanks,
Amir

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox