Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Eric Dumazet @ 2011-01-07  3:54 UTC (permalink / raw)
  To: Matt Carlson
  Cc: Jesse Gross, Michael Leun, Michael Chan, David Miller, Ben Greear,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20110107034130.GA18028@mcarlson.broadcom.com>

Le jeudi 06 janvier 2011 à 19:41 -0800, Matt Carlson a écrit :
> On Thu, Jan 06, 2011 at 07:04:46PM -0800, Eric Dumazet wrote:
> > Le jeudi 06 janvier 2011 ?? 18:59 -0800, Matt Carlson a ??crit :
> > > On Thu, Jan 06, 2011 at 06:43:22PM -0800, Eric Dumazet wrote:
> > > > Le vendredi 07 janvier 2011 ?? 03:41 +0100, Eric Dumazet a ??crit :
> > > > > Le jeudi 06 janvier 2011 ?? 18:29 -0800, Matt Carlson a ??crit :
> > > > > 
> > > > > > Hi Eric.  Sorry for the delay.  I was under the impression that your
> > > > > > problems were software related and that you just needed a revised
> > > > > > version of these VLAN patches I was sending to Michael.  Is this not
> > > > > > true?
> > > > > > 
> > > > > > Having a hardware stat increment suggests this is a new problem.
> > > > > > Maybe I missed it, but I didn't see what hardware you are working
> > > > > > with and whether or not management firmware was enabled.  Could you tell
> > > > > > me that info?
> > > > > > 
> > > > > 
> > > > > Hi Matt
> > > > > 
> > > > > I started a bisection, because I couldnt sleep tonight anyway :(
> > > > > 
> > > > > 14:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S
> > > > > Gigabit Ethernet (rev a3)
> > > > > 	Subsystem: Hewlett-Packard Company NC326m PCIe Dual Port Adapter
> > > > > 	Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 43
> > > > > 	Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
> > > > > 	Memory at fdfe0000 (64-bit, non-prefetchable) [size=64K]
> > > > > 	[virtual] Expansion ROM at fdbe0000 [disabled] [size=128K]
> > > > > 	Capabilities: [40] PCI-X non-bridge device
> > > > > 	Capabilities: [48] Power Management version 2
> > > > > 	Capabilities: [50] Vital Product Data
> > > > > 	Capabilities: [58] MSI: Enable+ Count=1/8 Maskable- 64bit+
> > > > > 	Kernel driver in use: tg3
> > > > > 	Kernel modules: tg3
> > > > > 
> > > > > 
> > > > 
> > > > $ ethtool -i eth2
> > > > driver: tg3
> > > > version: 3.115
> > > > firmware-version: 5715s-v3.28
> > > > bus-info: 0000:14:04.0
> > > > $ dmesg | grep ASF
> > > > [    6.220577] tg3 0000:14:04.0: eth2: RXcsums[1] LinkChgREG[0] MIirq[0]
> > > > ASF[0] TSOcap[1]
> > > > [    6.228586] tg3 0000:14:04.1: eth3: RXcsums[1] LinkChgREG[0] MIirq[0]
> > > > ASF[0] TSOcap[1]
> > > 
> > > Thanks.  So management firmware is disabled.  This should be
> > > straightforward case.
> > > 
> > > I'm wondering if I'm misunderstanding something though.  You said earlier
> > > that VLAN tagging doesn't work unless you applied my patch.  Is this no
> > > longer true?
> > > 
> > 
> > I dont apply your patch because Jesse said it was not a good patch ;)
> 
> Oh.
> 
> > Maybe I missed something and it must be applied ? Problem is : current
> > Linus tree now includes net-next-2.6 and vlan doesnt work. You should
> > resubmit it perhaps ?
> 
> Yes, something needs to be submitted.  I want to make sure we aren't
> chasing the same problem though.  If the patch(es) fix your problem,
> then I can concentrate on finalizing the patch.
> 

I believe it did, I can test your next patch ;)

> I can combine my last patch (the one that always enabled VLAN tag
> stripping) and the previous patch (that implements all your comments so
> far) into one patch, but that still leaves the behavior Michael noted
> unaddressed.
> 
> Michael, did you ever find out whether or not RXD_FLAG_VLAN was being
> set?
> 

Here is the bisect log , just in case :

d2394e6bb1aa636f3bd142cb6f7845a4332514b5 is first bad commit
commit d2394e6bb1aa636f3bd142cb6f7845a4332514b5
Author: Matt Carlson <mcarlson@broadcom.com>
Date:   Wed Nov 24 08:31:47 2010 +0000

    tg3: Always turn on APE features in mac_mode reg
    
    The APE needs certain bits in the mac_mode register to be enabled for
    traffic to flow correctly.  This patch changes the code to always enable
    these bits in the presence of the APE.
    
    Signed-off-by: Matt Carlson <mcarlson@broadcom.com>
    Reviewed-by: Michael Chan <mchan@broadcom.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

:040000 040000 086382ce3ecd909222faf53ca23c48d1200dd60c 0a325ac6e7aa87a6737610292ab3025156a4ed80 M	drivers



$ git bisect log
git bisect start
# bad: [3c0cb7c31c206aaedb967e44b98442bbeb17a6c4] Merge branch 'devel' of master.kernel.org:/home/rmk/linux-2.6-arm
git bisect bad 3c0cb7c31c206aaedb967e44b98442bbeb17a6c4
# good: [3c0eee3fe6a3a1c745379547c7e7c904aa64f6d5] Linux 2.6.37
git bisect good 3c0eee3fe6a3a1c745379547c7e7c904aa64f6d5
# bad: [63e35cd9bd4c8ae085c8b9a70554595b529c4100] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/linville/wireless-next-2.6 into for-davem
git bisect bad 63e35cd9bd4c8ae085c8b9a70554595b529c4100
# bad: [67d5288049f46f816181f63eaa8f1371877ad8ea] vxge: update driver version
git bisect bad 67d5288049f46f816181f63eaa8f1371877ad8ea
# good: [b382b191ea9e9ccefc437433d23befe91f4a8925] ipv6: AF_INET6 link address family
git bisect good b382b191ea9e9ccefc437433d23befe91f4a8925
# good: [bedbbb959d2c1d1dbb4c2215f5b7074b1da3030a] ath: Add a driver_info bitmask field
git bisect good bedbbb959d2c1d1dbb4c2215f5b7074b1da3030a
# bad: [cf7afbfeb8ceb0187348d0a1a0db61305e25f05f] rtnl: make link af-specific updates atomic
git bisect bad cf7afbfeb8ceb0187348d0a1a0db61305e25f05f
# good: [22674a24b44ac53f244ef6edadd02021a270df5a] Net: dns_resolver: Makefile: Remove deprecated kbuild goal definitions
git bisect good 22674a24b44ac53f244ef6edadd02021a270df5a
# bad: [cf79003d598b1f82a4caa0564107283b4f560e14] tg3: Fix 5719 internal FIFO overflow problem
git bisect bad cf79003d598b1f82a4caa0564107283b4f560e14
# good: [094f2faaa2c4973e50979158f655a1d31a97ba98] Net: rds: Makefile: Remove deprecated items
git bisect good 094f2faaa2c4973e50979158f655a1d31a97ba98
# good: [04f6d70f6e64900a5d70a5fc199dd9d5fa787738] SELinux: Only return netlink error when we know the return is fatal
git bisect good 04f6d70f6e64900a5d70a5fc199dd9d5fa787738
# good: [5093eedc8bdfd7d906836a44a248f66a99e27d22] tg3: Apply 10Mbps fix to all 57765 revisions
git bisect good 5093eedc8bdfd7d906836a44a248f66a99e27d22
# bad: [d2394e6bb1aa636f3bd142cb6f7845a4332514b5] tg3: Always turn on APE features in mac_mode reg
git bisect bad d2394e6bb1aa636f3bd142cb6f7845a4332514b5
# good: [b75cc0e4c1caac63941d96a73b2214e8007b934b] tg3: Assign correct tx margin for 5719
git bisect good b75cc0e4c1caac63941d96a73b2214e8007b934b

^ permalink raw reply

* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Matt Carlson @ 2011-01-07  3:41 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Matthew Carlson, Jesse Gross, Michael Leun, Michael Chan,
	David Miller, Ben Greear, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <1294369486.2704.53.camel@edumazet-laptop>

On Thu, Jan 06, 2011 at 07:04:46PM -0800, Eric Dumazet wrote:
> Le jeudi 06 janvier 2011 ?? 18:59 -0800, Matt Carlson a ??crit :
> > On Thu, Jan 06, 2011 at 06:43:22PM -0800, Eric Dumazet wrote:
> > > Le vendredi 07 janvier 2011 ?? 03:41 +0100, Eric Dumazet a ??crit :
> > > > Le jeudi 06 janvier 2011 ?? 18:29 -0800, Matt Carlson a ??crit :
> > > > 
> > > > > Hi Eric.  Sorry for the delay.  I was under the impression that your
> > > > > problems were software related and that you just needed a revised
> > > > > version of these VLAN patches I was sending to Michael.  Is this not
> > > > > true?
> > > > > 
> > > > > Having a hardware stat increment suggests this is a new problem.
> > > > > Maybe I missed it, but I didn't see what hardware you are working
> > > > > with and whether or not management firmware was enabled.  Could you tell
> > > > > me that info?
> > > > > 
> > > > 
> > > > Hi Matt
> > > > 
> > > > I started a bisection, because I couldnt sleep tonight anyway :(
> > > > 
> > > > 14:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S
> > > > Gigabit Ethernet (rev a3)
> > > > 	Subsystem: Hewlett-Packard Company NC326m PCIe Dual Port Adapter
> > > > 	Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 43
> > > > 	Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
> > > > 	Memory at fdfe0000 (64-bit, non-prefetchable) [size=64K]
> > > > 	[virtual] Expansion ROM at fdbe0000 [disabled] [size=128K]
> > > > 	Capabilities: [40] PCI-X non-bridge device
> > > > 	Capabilities: [48] Power Management version 2
> > > > 	Capabilities: [50] Vital Product Data
> > > > 	Capabilities: [58] MSI: Enable+ Count=1/8 Maskable- 64bit+
> > > > 	Kernel driver in use: tg3
> > > > 	Kernel modules: tg3
> > > > 
> > > > 
> > > 
> > > $ ethtool -i eth2
> > > driver: tg3
> > > version: 3.115
> > > firmware-version: 5715s-v3.28
> > > bus-info: 0000:14:04.0
> > > $ dmesg | grep ASF
> > > [    6.220577] tg3 0000:14:04.0: eth2: RXcsums[1] LinkChgREG[0] MIirq[0]
> > > ASF[0] TSOcap[1]
> > > [    6.228586] tg3 0000:14:04.1: eth3: RXcsums[1] LinkChgREG[0] MIirq[0]
> > > ASF[0] TSOcap[1]
> > 
> > Thanks.  So management firmware is disabled.  This should be
> > straightforward case.
> > 
> > I'm wondering if I'm misunderstanding something though.  You said earlier
> > that VLAN tagging doesn't work unless you applied my patch.  Is this no
> > longer true?
> > 
> 
> I dont apply your patch because Jesse said it was not a good patch ;)

Oh.

> Maybe I missed something and it must be applied ? Problem is : current
> Linus tree now includes net-next-2.6 and vlan doesnt work. You should
> resubmit it perhaps ?

Yes, something needs to be submitted.  I want to make sure we aren't
chasing the same problem though.  If the patch(es) fix your problem,
then I can concentrate on finalizing the patch.

I can combine my last patch (the one that always enabled VLAN tag
stripping) and the previous patch (that implements all your comments so
far) into one patch, but that still leaves the behavior Michael noted
unaddressed.

Michael, did you ever find out whether or not RXD_FLAG_VLAN was being
set?


^ permalink raw reply

* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Matt Carlson @ 2011-01-07  3:24 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Michael Leun, Matthew Carlson, Michael Chan, Eric Dumazet,
	David Miller, Ben Greear, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <AANLkTi=hGRW+DwoKQzxPAZao6-y_tvn6nNXS-tj_-Y6T@mail.gmail.com>

On Sat, Dec 18, 2010 at 07:38:00PM -0800, Jesse Gross wrote:
> On Tue, Dec 14, 2010 at 11:16 PM, Michael Leun
> <lkml20101129@newton.leun.net> wrote:
> > OK - all tests done on that DL320G5:
> >
> > For completeness, 2.6.37-rc5 unpatched:
> >
> > eth0, no vlan configured: totally broken - see double tagged vlans
> > without tag, single or untagged packets missing at all
> 
> Random behavior?  This one is somewhat hard to explain - maybe there
> are some other factors.  eth0 has ASF on, so it always strips tags.  I
> would expect it to behave like the vlan configured case.
> 
> >
> > eth0, vlan configured: see packets without vlan tag (see double tagged
> > packets with one vlan tag)
> 
> Both ASF and vlan group configured cause tag stripping to be enabled.
> Missing tag.
> 
> >
> > eth1 same as originally reported:
> > without vlan configured see vlan tags (single and double tagged as
> > expected)
> 
> No ASF and no vlan group means tag stripping is disabled.  Have tag.
> 
> > with vlan configured: see packets without vlan tag (see double tagged
> > packets with one vlan tag)
> 
> Configuring vlan group causes stripping to be enabled.  Missing tag.
> 
> >
> >
> > 2.6.37-rc5, your tg3 use new vlan-code patch:
> >
> > eth0, no vlan configured: ?see packets without vlan tag (see double
> > tagged packets with one vlan tag)
> 
> ASF enables tag stripping.  Missing tag.
> 
> > eth1, no vlan configured: see vlan tags (single and double tagged as
> > expected)
> 
> No ASF, no vlan group means no stripping.  Have tag.
> 
> >
> >
> > eth0, vlan configured: as without vlan
> 
> ASF enables stripping.  Missing tag.
> 
> > eth1, vlan configured: as without vlan
> 
> With this patch vlan stripping is only enabled when ASF is on, so no
> stripping.  Have tag.
> 
> >
> > 2.6.37-rc5, your tg3 use new vlan-code patch with test patch ontop
> >
> > eth1 no vlan configured: see packets without vlan tag (see double tagged
> > packets with one vlan tag)
> 
> With the second patch, vlan stripping is always enabled.  Missing tag.
> 
> > eth1 with vlan: the same
> 
> Stripping still always enabled.  Missing tag.
> 
> The bottom line is whenever vlan stripping is enabled we're missing
> the outer tag.  It might be worth adding some debugging in the area
> before napi_gro_receive/vlan_gro_receive (depending on version).  My
> guess is that (desc->type_flags & RXD_FLAG_VLAN) is false even for
> vlan packets on this NIC.
> 
> You said that everything works on the 5752?  Matt, is it possible that
> the 5714 either has a problem with vlan stripping or a different way
> of reporting it?

I don't think this is a 5714 specific issue.  I think the problem is
rooted in the fact that the VLAN tag stripping is enabled.

Your RXD_FLAG_VLAN idea sounds unlikely to me, but it's worth a check.

The patch here is using __vlan_hwaccel_put_tag(), which informs the
stack a VLAN tag is present.  If this is indeed a reporting problem, I'm
not sure what else the driver should be doing.

> Also, why does ASF require vlan stripping?

This is a firmware limitation.

^ permalink raw reply

* Re: [PATCH] ehea: Add some info messages and fix an issue
From: Anton Blanchard @ 2011-01-07  3:24 UTC (permalink / raw)
  To: leitao; +Cc: joe, netdev, davem
In-Reply-To: <1290792387-12331-1-git-send-email-leitao@linux.vnet.ibm.com>

Hi,

> From: Breno Leitao <breno@cafe.(none)>
> 
> This patch adds some debug information about ehea not being able to
> allocate enough spaces. Also it correctly updates the amount of
> available skb.

I'm seeing issues on a number of machines with the ehea device.
Sometime after boot I see a bunch of:

ehea: Error in ehea_proc_rwqes: LL rq1: skb=NULL
ehea: Error in ehea_proc_rwqes: LL rq1: skb=NULL
ehea: Error in ehea_proc_rwqes: LL rq1: skb=NULL
ehea: Error in ehea_proc_rwqes: LL rq1: skb=NULL

which eventually stop.

-       for (i = 0; i < pr->rq1_skba.len; i++) {
+       for (i = 0; i < nr_rq1a; i++) {

It looks like you are now only initialising half the ring, but still
telling the hardware to use the whole ring. Once you get through the
entire ring once the errors go away.

Anton

^ permalink raw reply

* [net-next-2.6 PATCH v6 2/2] net_sched: implement a root container qdisc sch_mqprio
From: John Fastabend @ 2011-01-07  3:12 UTC (permalink / raw)
  To: davem, jarkao2
  Cc: hadi, eric.dumazet, shemminger, tgraf, bhutchings, nhorman,
	netdev
In-Reply-To: <20110107031211.2446.35715.stgit@jf-dev1-dcblab>

This implements a mqprio queueing discipline that by default creates
a pfifo_fast qdisc per tx queue and provides the needed configuration
interface.

Using the mqprio qdisc the number of tcs currently in use along
with the range of queues alloted to each class can be configured. By
default skbs are mapped to traffic classes using the skb priority.
This mapping is configurable.

Configurable parameters,

struct tc_mqprio_qopt {
        __u8    num_tc;
        __u8    prio_tc_map[TC_BITMASK + 1];
        __u8    hw;
        __u16   count[TC_MAX_QUEUE];
        __u16   offset[TC_MAX_QUEUE];
};

Here the count/offset pairing give the queue alignment and the
prio_tc_map gives the mapping from skb->priority to tc.

The hw bit determines if the hardware should configure the count
and offset values. If the hardware bit is set then the operation
will fail if the hardware does not implement the ndo_setup_tc
operation. This is to avoid undetermined states where the hardware
may or may not control the queue mapping. Also minimal bounds
checking is done on the count/offset to verify a queue does not
exceed num_tx_queues and that queue ranges do not overlap. Otherwise
it is left to user policy or hardware configuration to create
useful mappings.

It is expected that hardware QOS schemes can be implemented by
creating appropriate mappings of queues in ndo_tc_setup().

One expected use case is drivers will use the ndo_setup_tc to map
queue ranges onto 802.1Q traffic classes. This provides a generic
mechanism to map network traffic onto these traffic classes and
removes the need for lower layer drivers to know specifics about
traffic types.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 include/linux/pkt_sched.h |   12 +
 net/sched/Kconfig         |   12 +
 net/sched/Makefile        |    1 
 net/sched/sch_generic.c   |    4 
 net/sched/sch_mqprio.c    |  415 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 444 insertions(+), 0 deletions(-)
 create mode 100644 net/sched/sch_mqprio.c

diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 2cfa4bc..776cd93 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -481,4 +481,16 @@ struct tc_drr_stats {
 	__u32	deficit;
 };
 
+/* MQPRIO */
+#define TC_QOPT_BITMASK 15
+#define TC_QOPT_MAX_QUEUE 16
+
+struct tc_mqprio_qopt {
+	__u8	num_tc;
+	__u8	prio_tc_map[TC_QOPT_BITMASK + 1];
+	__u8	hw;
+	__u16	count[TC_QOPT_MAX_QUEUE];
+	__u16	offset[TC_QOPT_MAX_QUEUE];
+};
+
 #endif
diff --git a/net/sched/Kconfig b/net/sched/Kconfig
index a36270a..f52f5eb 100644
--- a/net/sched/Kconfig
+++ b/net/sched/Kconfig
@@ -205,6 +205,18 @@ config NET_SCH_DRR
 
 	  If unsure, say N.
 
+config NET_SCH_MQPRIO
+	tristate "Multi-queue priority scheduler (MQPRIO)"
+	help
+	  Say Y here if you want to use the Multi-queue Priority scheduler.
+	  This scheduler allows QOS to be offloaded on NICs that have support
+	  for offloading QOS schedulers.
+
+	  To compile this driver as a module, choose M here: the module will
+	  be called sch_mqprio.
+
+	  If unsure, say N.
+
 config NET_SCH_INGRESS
 	tristate "Ingress Qdisc"
 	depends on NET_CLS_ACT
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 960f5db..26ce681 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -32,6 +32,7 @@ obj-$(CONFIG_NET_SCH_MULTIQ)	+= sch_multiq.o
 obj-$(CONFIG_NET_SCH_ATM)	+= sch_atm.o
 obj-$(CONFIG_NET_SCH_NETEM)	+= sch_netem.o
 obj-$(CONFIG_NET_SCH_DRR)	+= sch_drr.o
+obj-$(CONFIG_NET_SCH_MQPRIO)	+= sch_mqprio.o
 obj-$(CONFIG_NET_CLS_U32)	+= cls_u32.o
 obj-$(CONFIG_NET_CLS_ROUTE4)	+= cls_route.o
 obj-$(CONFIG_NET_CLS_FW)	+= cls_fw.o
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 34dc598..723b278 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -540,6 +540,7 @@ struct Qdisc_ops pfifo_fast_ops __read_mostly = {
 	.dump		=	pfifo_fast_dump,
 	.owner		=	THIS_MODULE,
 };
+EXPORT_SYMBOL(pfifo_fast_ops);
 
 struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 			  struct Qdisc_ops *ops)
@@ -674,6 +675,7 @@ struct Qdisc *dev_graft_qdisc(struct netdev_queue *dev_queue,
 
 	return oqdisc;
 }
+EXPORT_SYMBOL(dev_graft_qdisc);
 
 static void attach_one_default_qdisc(struct net_device *dev,
 				     struct netdev_queue *dev_queue,
@@ -761,6 +763,7 @@ void dev_activate(struct net_device *dev)
 		dev_watchdog_up(dev);
 	}
 }
+EXPORT_SYMBOL(dev_activate);
 
 static void dev_deactivate_queue(struct net_device *dev,
 				 struct netdev_queue *dev_queue,
@@ -840,6 +843,7 @@ void dev_deactivate(struct net_device *dev)
 	list_add(&dev->unreg_list, &single);
 	dev_deactivate_many(&single);
 }
+EXPORT_SYMBOL(dev_deactivate);
 
 static void dev_init_scheduler_queue(struct net_device *dev,
 				     struct netdev_queue *dev_queue,
diff --git a/net/sched/sch_mqprio.c b/net/sched/sch_mqprio.c
new file mode 100644
index 0000000..4363f95
--- /dev/null
+++ b/net/sched/sch_mqprio.c
@@ -0,0 +1,415 @@
+/*
+ * net/sched/sch_mqprio.c
+ *
+ * Copyright (c) 2010 John Fastabend <john.r.fastabend@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ */
+
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/sch_generic.h>
+
+struct mqprio_sched {
+	struct Qdisc		**qdiscs;
+	int hw_owned;
+};
+
+static void mqprio_destroy(struct Qdisc *sch)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mqprio_sched *priv = qdisc_priv(sch);
+	unsigned int ntx;
+
+	if (!priv->qdiscs)
+		return;
+
+	for (ntx = 0; ntx < dev->num_tx_queues && priv->qdiscs[ntx]; ntx++)
+		qdisc_destroy(priv->qdiscs[ntx]);
+
+	if (priv->hw_owned && dev->netdev_ops->ndo_setup_tc)
+		dev->netdev_ops->ndo_setup_tc(dev, 0, dev->real_num_tx_queues);
+	else
+		netdev_set_num_tc(dev, 0);
+
+	kfree(priv->qdiscs);
+}
+
+static int mqprio_parse_opt(struct net_device *dev, struct tc_mqprio_qopt *qopt)
+{
+	int i, j;
+
+	/* Verify num_tc is not out of max range */
+	if (qopt->num_tc > TC_QOPT_MAX_QUEUE)
+		return -EINVAL;
+
+	/* Verify priority mapping uses valid tcs */
+	for (i = 0; i < TC_QOPT_BITMASK + 1; i++) {
+		if (qopt->prio_tc_map[i] >= qopt->num_tc)
+			return -EINVAL;
+	}
+
+	/* net_device does not support requested operation */
+	if (qopt->hw && !dev->netdev_ops->ndo_setup_tc)
+		return -EINVAL;
+
+	/* if hw owned qcount and qoffset are taken from LLD so
+	 * no reason to verify them here
+	 */
+	if (qopt->hw)
+		return 0;
+
+	for (i = 0; i < qopt->num_tc; i++) {
+		unsigned int last = qopt->offset[i] + qopt->count[i];
+
+		/* Verify the queue count is in tx range being equal to the
+		 * real_num_tx_queues indicates the last queue is in use.
+		 */
+		if (qopt->offset[i] >= dev->real_num_tx_queues ||
+		    !qopt->count[i] ||
+		    last > dev->real_num_tx_queues)
+			return -EINVAL;
+
+		/* Verify that the offset and counts do not overlap */
+		for (j = i + 1; j < qopt->num_tc; j++) {
+			if (last > qopt->offset[j])
+				return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int mqprio_init(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mqprio_sched *priv = qdisc_priv(sch);
+	struct netdev_queue *dev_queue;
+	struct Qdisc *qdisc;
+	int i, err = -EOPNOTSUPP;
+	struct tc_mqprio_qopt *qopt = NULL;
+
+	if (sch->parent != TC_H_ROOT)
+		return -EOPNOTSUPP;
+
+	if (!netif_is_multiqueue(dev))
+		return -EOPNOTSUPP;
+
+	if (nla_len(opt) < sizeof(*qopt))
+		return -EINVAL;
+
+	qopt = nla_data(opt);
+	if (mqprio_parse_opt(dev, qopt))
+		return -EINVAL;
+
+	/* pre-allocate qdisc, attachment can't fail */
+	priv->qdiscs = kcalloc(dev->num_tx_queues, sizeof(priv->qdiscs[0]),
+			       GFP_KERNEL);
+	if (priv->qdiscs == NULL) {
+		err = -ENOMEM;
+		goto err;
+	}
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		dev_queue = netdev_get_tx_queue(dev, i);
+		qdisc = qdisc_create_dflt(dev_queue, &pfifo_fast_ops,
+					  TC_H_MAKE(TC_H_MAJ(sch->handle),
+						    TC_H_MIN(i + 1)));
+		if (qdisc == NULL) {
+			err = -ENOMEM;
+			goto err;
+		}
+		qdisc->flags |= TCQ_F_CAN_BYPASS;
+		priv->qdiscs[i] = qdisc;
+	}
+
+	/* If the mqprio options indicate that hardware should own
+	 * the queue mapping then run ndo_setup_tc otherwise use the
+	 * supplied and verified mapping
+	 */
+	if (qopt->hw) {
+		priv->hw_owned = 1;
+		err = dev->netdev_ops->ndo_setup_tc(dev, qopt->num_tc,
+						    dev->real_num_tx_queues);
+		if (err)
+			goto err;
+	} else {
+		netdev_set_num_tc(dev, qopt->num_tc);
+		for (i = 0; i < qopt->num_tc; i++)
+			netdev_set_tc_queue(dev, i,
+					    qopt->count[i], qopt->offset[i]);
+	}
+
+	/* Always use supplied priority mappings */
+	for (i = 0; i < TC_QOPT_BITMASK + 1; i++)
+		netdev_set_prio_tc_map(dev, i, qopt->prio_tc_map[i]);
+
+	sch->flags |= TCQ_F_MQROOT;
+	return 0;
+
+err:
+	mqprio_destroy(sch);
+	return err;
+}
+
+static void mqprio_attach(struct Qdisc *sch)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mqprio_sched *priv = qdisc_priv(sch);
+	struct Qdisc *qdisc;
+	unsigned int ntx;
+
+	/* Attach underlying qdisc */
+	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
+		qdisc = priv->qdiscs[ntx];
+		qdisc = dev_graft_qdisc(qdisc->dev_queue, qdisc);
+		if (qdisc)
+			qdisc_destroy(qdisc);
+	}
+	kfree(priv->qdiscs);
+	priv->qdiscs = NULL;
+}
+
+static struct netdev_queue *mqprio_queue_get(struct Qdisc *sch,
+					     unsigned long cl)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	unsigned long ntx = cl - 1 - netdev_get_num_tc(dev);
+
+	if (ntx >= dev->num_tx_queues)
+		return NULL;
+	return netdev_get_tx_queue(dev, ntx);
+}
+
+static int mqprio_graft(struct Qdisc *sch, unsigned long cl, struct Qdisc *new,
+		    struct Qdisc **old)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
+
+	if (!dev_queue)
+		return -EINVAL;
+
+	if (dev->flags & IFF_UP)
+		dev_deactivate(dev);
+
+	*old = dev_graft_qdisc(dev_queue, new);
+
+	if (dev->flags & IFF_UP)
+		dev_activate(dev);
+
+	return 0;
+}
+
+static int mqprio_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mqprio_sched *priv = qdisc_priv(sch);
+	unsigned char *b = skb_tail_pointer(skb);
+	struct tc_mqprio_qopt opt;
+	struct Qdisc *qdisc;
+	unsigned int i;
+
+	sch->q.qlen = 0;
+	memset(&sch->bstats, 0, sizeof(sch->bstats));
+	memset(&sch->qstats, 0, sizeof(sch->qstats));
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		qdisc = netdev_get_tx_queue(dev, i)->qdisc;
+		spin_lock_bh(qdisc_lock(qdisc));
+		sch->q.qlen		+= qdisc->q.qlen;
+		sch->bstats.bytes	+= qdisc->bstats.bytes;
+		sch->bstats.packets	+= qdisc->bstats.packets;
+		sch->qstats.qlen	+= qdisc->qstats.qlen;
+		sch->qstats.backlog	+= qdisc->qstats.backlog;
+		sch->qstats.drops	+= qdisc->qstats.drops;
+		sch->qstats.requeues	+= qdisc->qstats.requeues;
+		sch->qstats.overlimits	+= qdisc->qstats.overlimits;
+		spin_unlock_bh(qdisc_lock(qdisc));
+	}
+
+	opt.num_tc = netdev_get_num_tc(dev);
+	memcpy(opt.prio_tc_map, dev->prio_tc_map, sizeof(opt.prio_tc_map));
+	opt.hw = priv->hw_owned;
+
+	for (i = 0; i < netdev_get_num_tc(dev); i++) {
+		opt.count[i] = dev->tc_to_txq[i].count;
+		opt.offset[i] = dev->tc_to_txq[i].offset;
+	}
+
+	NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
+
+	return skb->len;
+nla_put_failure:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static struct Qdisc *mqprio_leaf(struct Qdisc *sch, unsigned long cl)
+{
+	struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
+
+	if (!dev_queue)
+		return NULL;
+
+	return dev_queue->qdisc_sleeping;
+}
+
+static unsigned long mqprio_get(struct Qdisc *sch, u32 classid)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	unsigned int ntx = TC_H_MIN(classid);
+
+	if (ntx > dev->num_tx_queues + netdev_get_num_tc(dev))
+		return 0;
+	return ntx;
+}
+
+static void mqprio_put(struct Qdisc *sch, unsigned long cl)
+{
+}
+
+static int mqprio_dump_class(struct Qdisc *sch, unsigned long cl,
+			 struct sk_buff *skb, struct tcmsg *tcm)
+{
+	struct net_device *dev = qdisc_dev(sch);
+
+	if (cl <= netdev_get_num_tc(dev)) {
+		tcm->tcm_parent = TC_H_ROOT;
+		tcm->tcm_info = 0;
+	} else {
+		int i;
+		struct netdev_queue *dev_queue;
+
+		dev_queue = mqprio_queue_get(sch, cl);
+		tcm->tcm_parent = 0;
+		for (i = 0; i < netdev_get_num_tc(dev); i++) {
+			struct netdev_tc_txq tc = dev->tc_to_txq[i];
+			int q_idx = cl - netdev_get_num_tc(dev);
+
+			if (q_idx > tc.offset &&
+			    q_idx <= tc.offset + tc.count) {
+				tcm->tcm_parent =
+					TC_H_MAKE(TC_H_MAJ(sch->handle),
+						  TC_H_MIN(i + 1));
+				break;
+			}
+		}
+		tcm->tcm_info = dev_queue->qdisc_sleeping->handle;
+	}
+	tcm->tcm_handle |= TC_H_MIN(cl);
+	return 0;
+}
+
+static int mqprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
+			       struct gnet_dump *d)
+{
+	struct net_device *dev = qdisc_dev(sch);
+
+	if (cl <= netdev_get_num_tc(dev)) {
+		int i;
+		struct Qdisc *qdisc;
+		struct gnet_stats_queue qstats = {0};
+		struct gnet_stats_basic_packed bstats = {0};
+		struct netdev_tc_txq tc = dev->tc_to_txq[cl - 1];
+
+		/* Drop lock here it will be reclaimed before touching
+		 * statistics this is required because the d->lock we
+		 * hold here is the look on dev_queue->qdisc_sleeping
+		 * also acquired below.
+		 */
+		spin_unlock_bh(d->lock);
+
+		for (i = tc.offset; i < tc.offset + tc.count; i++) {
+			qdisc = netdev_get_tx_queue(dev, i)->qdisc;
+			spin_lock_bh(qdisc_lock(qdisc));
+			bstats.bytes      += qdisc->bstats.bytes;
+			bstats.packets    += qdisc->bstats.packets;
+			qstats.qlen       += qdisc->qstats.qlen;
+			qstats.backlog    += qdisc->qstats.backlog;
+			qstats.drops      += qdisc->qstats.drops;
+			qstats.requeues   += qdisc->qstats.requeues;
+			qstats.overlimits += qdisc->qstats.overlimits;
+			spin_unlock_bh(qdisc_lock(qdisc));
+		}
+		/* Reclaim root sleeping lock before completing stats */
+		spin_lock_bh(d->lock);
+		if (gnet_stats_copy_basic(d, &bstats) < 0 ||
+		    gnet_stats_copy_queue(d, &qstats) < 0)
+			return -1;
+	} else {
+		struct netdev_queue *dev_queue = mqprio_queue_get(sch, cl);
+
+		sch = dev_queue->qdisc_sleeping;
+		sch->qstats.qlen = sch->q.qlen;
+		if (gnet_stats_copy_basic(d, &sch->bstats) < 0 ||
+		    gnet_stats_copy_queue(d, &sch->qstats) < 0)
+			return -1;
+	}
+	return 0;
+}
+
+static void mqprio_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	unsigned long ntx;
+
+	if (arg->stop)
+		return;
+
+	/* Walk hierarchy with a virtual class per tc */
+	arg->count = arg->skip;
+	for (ntx = arg->skip;
+	     ntx < dev->num_tx_queues + netdev_get_num_tc(dev);
+	     ntx++) {
+		if (arg->fn(sch, ntx + 1, arg) < 0) {
+			arg->stop = 1;
+			break;
+		}
+		arg->count++;
+	}
+}
+
+static const struct Qdisc_class_ops mqprio_class_ops = {
+	.graft		= mqprio_graft,
+	.leaf		= mqprio_leaf,
+	.get		= mqprio_get,
+	.put		= mqprio_put,
+	.walk		= mqprio_walk,
+	.dump		= mqprio_dump_class,
+	.dump_stats	= mqprio_dump_class_stats,
+};
+
+struct Qdisc_ops mqprio_qdisc_ops __read_mostly = {
+	.cl_ops		= &mqprio_class_ops,
+	.id		= "mqprio",
+	.priv_size	= sizeof(struct mqprio_sched),
+	.init		= mqprio_init,
+	.destroy	= mqprio_destroy,
+	.attach		= mqprio_attach,
+	.dump		= mqprio_dump,
+	.owner		= THIS_MODULE,
+};
+
+static int __init mqprio_module_init(void)
+{
+	return register_qdisc(&mqprio_qdisc_ops);
+}
+
+static void __exit mqprio_module_exit(void)
+{
+	unregister_qdisc(&mqprio_qdisc_ops);
+}
+
+module_init(mqprio_module_init);
+module_exit(mqprio_module_exit);
+
+MODULE_LICENSE("GPL");


^ permalink raw reply related

* [net-next-2.6 PATCH v6 1/2] net: implement mechanism for HW based QOS
From: John Fastabend @ 2011-01-07  3:12 UTC (permalink / raw)
  To: davem, jarkao2
  Cc: hadi, eric.dumazet, shemminger, tgraf, bhutchings, nhorman,
	netdev

This patch provides a mechanism for lower layer devices to
steer traffic using skb->priority to tx queues. This allows
for hardware based QOS schemes to use the default qdisc without
incurring the penalties related to global state and the qdisc
lock. While reliably receiving skbs on the correct tx ring
to avoid head of line blocking resulting from shuffling in
the LLD. Finally, all the goodness from txq caching and xps/rps
can still be leveraged.

Many drivers and hardware exist with the ability to implement
QOS schemes in the hardware but currently these drivers tend
to rely on firmware to reroute specific traffic, a driver
specific select_queue or the queue_mapping action in the
qdisc.

By using select_queue for this drivers need to be updated for
each and every traffic type and we lose the goodness of much
of the upstream work. Firmware solutions are inherently
inflexible. And finally if admins are expected to build a
qdisc and filter rules to steer traffic this requires knowledge
of how the hardware is currently configured. The number of tx
queues and the queue offsets may change depending on resources.
Also this approach incurs all the overhead of a qdisc with filters.

With the mechanism in this patch users can set skb priority using
expected methods ie setsockopt() or the stack can set the priority
directly. Then the skb will be steered to the correct tx queues
aligned with hardware QOS traffic classes. In the normal case with
a single traffic class and all queues in this class everything
works as is until the LLD enables multiple tcs.

To steer the skb we mask out the lower 4 bits of the priority
and allow the hardware to configure upto 15 distinct classes
of traffic. This is expected to be sufficient for most applications
at any rate it is more then the 8021Q spec designates and is
equal to the number of prio bands currently implemented in
the default qdisc.

This in conjunction with a userspace application such as
lldpad can be used to implement 8021Q transmission selection
algorithms one of these algorithms being the extended transmission
selection algorithm currently being used for DCB.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 include/linux/netdevice.h |   65 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/dev.c            |   52 +++++++++++++++++++++++++++++++++++-
 2 files changed, 116 insertions(+), 1 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0f6b1c9..12fff42 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -646,6 +646,14 @@ struct xps_dev_maps {
     (nr_cpu_ids * sizeof(struct xps_map *)))
 #endif /* CONFIG_XPS */
 
+#define TC_MAX_QUEUE	16
+#define TC_BITMASK	15
+/* HW offloaded queuing disciplines txq count and offset maps */
+struct netdev_tc_txq {
+	u16 count;
+	u16 offset;
+};
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -756,6 +764,7 @@ struct xps_dev_maps {
  * int (*ndo_set_vf_port)(struct net_device *dev, int vf,
  *			  struct nlattr *port[]);
  * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb);
+ * void (*ndo_setup_tc)(struct net_device *dev, u8 tc)
  */
 #define HAVE_NET_DEVICE_OPS
 struct net_device_ops {
@@ -814,6 +823,8 @@ struct net_device_ops {
 						   struct nlattr *port[]);
 	int			(*ndo_get_vf_port)(struct net_device *dev,
 						   int vf, struct sk_buff *skb);
+	int			(*ndo_setup_tc)(struct net_device *dev, u8 tc,
+						unsigned int txq);
 #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
 	int			(*ndo_fcoe_enable)(struct net_device *dev);
 	int			(*ndo_fcoe_disable)(struct net_device *dev);
@@ -1146,6 +1157,9 @@ struct net_device {
 	/* Data Center Bridging netlink ops */
 	const struct dcbnl_rtnl_ops *dcbnl_ops;
 #endif
+	u8 num_tc;
+	struct netdev_tc_txq tc_to_txq[TC_MAX_QUEUE];
+	u8 prio_tc_map[TC_BITMASK + 1];
 
 #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
 	/* max exchange id for FCoE LRO by ddp */
@@ -1162,6 +1176,57 @@ struct net_device {
 #define	NETDEV_ALIGN		32
 
 static inline
+int netdev_get_prio_tc_map(const struct net_device *dev, u32 prio)
+{
+	return dev->prio_tc_map[prio & TC_BITMASK];
+}
+
+static inline
+int netdev_set_prio_tc_map(struct net_device *dev, u8 prio, u8 tc)
+{
+	if (tc >= dev->num_tc)
+		return -EINVAL;
+
+	dev->prio_tc_map[prio & TC_BITMASK] = tc & TC_BITMASK;
+	return 0;
+}
+
+static inline
+void netdev_reset_tc(struct net_device *dev)
+{
+	dev->num_tc = 0;
+	memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
+	memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
+}
+
+static inline
+int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset)
+{
+	if (tc >= dev->num_tc)
+		return -EINVAL;
+
+	dev->tc_to_txq[tc].count = count;
+	dev->tc_to_txq[tc].offset = offset;
+	return 0;
+}
+
+static inline
+int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
+{
+	if (num_tc > TC_MAX_QUEUE)
+		return -EINVAL;
+
+	dev->num_tc = num_tc;
+	return 0;
+}
+
+static inline
+int netdev_get_num_tc(struct net_device *dev)
+{
+	return dev->num_tc;
+}
+
+static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 					 unsigned int index)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index a215269..12a2c2a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1593,6 +1593,45 @@ static void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
 	rcu_read_unlock();
 }
 
+/* netif_setup_tc - Handle tc mappings on real_num_tx_queues change
+ * @dev: Network device
+ * @txq: number of queues available
+ *
+ * If real_num_tx_queues is changed the tc mappings may no longer be
+ * valid. To resolve this if the net_device supports ndo_setup_tc
+ * call the ops routine with the new queue number. If the ops is not
+ * available verify the tc mapping remains valid and if not NULL the
+ * mapping. With no priorities mapping to this offset/count pair it
+ * will no longer be used. In the worst case TC0 is invalid nothing
+ * can be done so disable priority mappings.
+ */
+void netif_setup_tc(struct net_device *dev, unsigned int txq)
+{
+	const struct net_device_ops *ops = dev->netdev_ops;
+
+	if (ops->ndo_setup_tc) {
+		ops->ndo_setup_tc(dev, dev->num_tc, txq);
+	} else {
+		int i;
+		struct netdev_tc_txq *tc = &dev->tc_to_txq[0];
+
+		/* If TC0 is invalidated disable TC mapping */
+		if (tc->offset + tc->count > txq) {
+			dev->num_tc = 0;
+			return;
+		}
+
+		/* Invalidated prio to tc mappings set to TC0 */
+		for (i = 1; i < TC_BITMASK + 1; i++) {
+			int q = netdev_get_prio_tc_map(dev, i);
+			tc = &dev->tc_to_txq[q];
+
+			if (tc->offset + tc->count > txq)
+				netdev_set_prio_tc_map(dev, i, 0);
+		}
+	}
+}
+
 /*
  * Routine to help set real_num_tx_queues. To avoid skbs mapped to queues
  * greater then real_num_tx_queues stale skbs on the qdisc must be flushed.
@@ -1614,6 +1653,9 @@ int netif_set_real_num_tx_queues(struct net_device *dev, unsigned int txq)
 
 		if (txq < dev->real_num_tx_queues)
 			qdisc_reset_all_tx_gt(dev, txq);
+
+		if (dev->num_tc)
+			netif_setup_tc(dev, txq);
 	}
 
 	dev->real_num_tx_queues = txq;
@@ -2165,6 +2207,8 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
 		  unsigned int num_tx_queues)
 {
 	u32 hash;
+	u16 qoffset = 0;
+	u16 qcount = num_tx_queues;
 
 	if (skb_rx_queue_recorded(skb)) {
 		hash = skb_get_rx_queue(skb);
@@ -2173,13 +2217,19 @@ u16 __skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb,
 		return hash;
 	}
 
+	if (dev->num_tc) {
+		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+		qoffset = dev->tc_to_txq[tc].offset;
+		qcount = dev->tc_to_txq[tc].count;
+	}
+
 	if (skb->sk && skb->sk->sk_hash)
 		hash = skb->sk->sk_hash;
 	else
 		hash = (__force u16) skb->protocol ^ skb->rxhash;
 	hash = jhash_1word(hash, hashrnd);
 
-	return (u16) (((u64) hash * num_tx_queues) >> 32);
+	return (u16) (((u64) hash * qcount) >> 32) + qoffset;
 }
 EXPORT_SYMBOL(__skb_tx_hash);
 


^ permalink raw reply related

* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Eric Dumazet @ 2011-01-07  3:04 UTC (permalink / raw)
  To: Matt Carlson
  Cc: Jesse Gross, Michael Leun, Michael Chan, David Miller, Ben Greear,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20110107025930.GA17808@mcarlson.broadcom.com>

Le jeudi 06 janvier 2011 à 18:59 -0800, Matt Carlson a écrit :
> On Thu, Jan 06, 2011 at 06:43:22PM -0800, Eric Dumazet wrote:
> > Le vendredi 07 janvier 2011 ?? 03:41 +0100, Eric Dumazet a ??crit :
> > > Le jeudi 06 janvier 2011 ?? 18:29 -0800, Matt Carlson a ??crit :
> > > 
> > > > Hi Eric.  Sorry for the delay.  I was under the impression that your
> > > > problems were software related and that you just needed a revised
> > > > version of these VLAN patches I was sending to Michael.  Is this not
> > > > true?
> > > > 
> > > > Having a hardware stat increment suggests this is a new problem.
> > > > Maybe I missed it, but I didn't see what hardware you are working
> > > > with and whether or not management firmware was enabled.  Could you tell
> > > > me that info?
> > > > 
> > > 
> > > Hi Matt
> > > 
> > > I started a bisection, because I couldnt sleep tonight anyway :(
> > > 
> > > 14:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S
> > > Gigabit Ethernet (rev a3)
> > > 	Subsystem: Hewlett-Packard Company NC326m PCIe Dual Port Adapter
> > > 	Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 43
> > > 	Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
> > > 	Memory at fdfe0000 (64-bit, non-prefetchable) [size=64K]
> > > 	[virtual] Expansion ROM at fdbe0000 [disabled] [size=128K]
> > > 	Capabilities: [40] PCI-X non-bridge device
> > > 	Capabilities: [48] Power Management version 2
> > > 	Capabilities: [50] Vital Product Data
> > > 	Capabilities: [58] MSI: Enable+ Count=1/8 Maskable- 64bit+
> > > 	Kernel driver in use: tg3
> > > 	Kernel modules: tg3
> > > 
> > > 
> > 
> > $ ethtool -i eth2
> > driver: tg3
> > version: 3.115
> > firmware-version: 5715s-v3.28
> > bus-info: 0000:14:04.0
> > $ dmesg | grep ASF
> > [    6.220577] tg3 0000:14:04.0: eth2: RXcsums[1] LinkChgREG[0] MIirq[0]
> > ASF[0] TSOcap[1]
> > [    6.228586] tg3 0000:14:04.1: eth3: RXcsums[1] LinkChgREG[0] MIirq[0]
> > ASF[0] TSOcap[1]
> 
> Thanks.  So management firmware is disabled.  This should be
> straightforward case.
> 
> I'm wondering if I'm misunderstanding something though.  You said earlier
> that VLAN tagging doesn't work unless you applied my patch.  Is this no
> longer true?
> 

I dont apply your patch because Jesse said it was not a good patch ;)

Maybe I missed something and it must be applied ? Problem is : current
Linus tree now includes net-next-2.6 and vlan doesnt work. You should
resubmit it perhaps ?

^ permalink raw reply

* Re: [PATCH v2] net: ppp: use {get,put}_unaligned_be{16,32}
From: Paul Mackerras @ 2011-01-07  3:01 UTC (permalink / raw)
  To: Changli Gao; +Cc: David S. Miller, Harvey Harrison, linux-ppp, netdev
In-Reply-To: <1294357056-25889-1-git-send-email-xiaosuo@gmail.com>

On Fri, Jan 07, 2011 at 07:37:36AM +0800, Changli Gao wrote:

> Signed-off-by: Changli Gao <xiaosuo@gmail.com>

This patch description is inadequate.  It should tell us why you are
making this change.  Does it result in smaller and/or faster code, and
if so by how much on what sort of machine?  Do you think it makes the
code clearer?  (I don't.)  Or is there some other motivation for this?

Paul.

^ permalink raw reply

* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Matt Carlson @ 2011-01-07  2:59 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Matthew Carlson, Jesse Gross, Michael Leun, Michael Chan,
	David Miller, Ben Greear, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <1294368202.2704.50.camel@edumazet-laptop>

On Thu, Jan 06, 2011 at 06:43:22PM -0800, Eric Dumazet wrote:
> Le vendredi 07 janvier 2011 ?? 03:41 +0100, Eric Dumazet a ??crit :
> > Le jeudi 06 janvier 2011 ?? 18:29 -0800, Matt Carlson a ??crit :
> > 
> > > Hi Eric.  Sorry for the delay.  I was under the impression that your
> > > problems were software related and that you just needed a revised
> > > version of these VLAN patches I was sending to Michael.  Is this not
> > > true?
> > > 
> > > Having a hardware stat increment suggests this is a new problem.
> > > Maybe I missed it, but I didn't see what hardware you are working
> > > with and whether or not management firmware was enabled.  Could you tell
> > > me that info?
> > > 
> > 
> > Hi Matt
> > 
> > I started a bisection, because I couldnt sleep tonight anyway :(
> > 
> > 14:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S
> > Gigabit Ethernet (rev a3)
> > 	Subsystem: Hewlett-Packard Company NC326m PCIe Dual Port Adapter
> > 	Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 43
> > 	Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
> > 	Memory at fdfe0000 (64-bit, non-prefetchable) [size=64K]
> > 	[virtual] Expansion ROM at fdbe0000 [disabled] [size=128K]
> > 	Capabilities: [40] PCI-X non-bridge device
> > 	Capabilities: [48] Power Management version 2
> > 	Capabilities: [50] Vital Product Data
> > 	Capabilities: [58] MSI: Enable+ Count=1/8 Maskable- 64bit+
> > 	Kernel driver in use: tg3
> > 	Kernel modules: tg3
> > 
> > 
> 
> $ ethtool -i eth2
> driver: tg3
> version: 3.115
> firmware-version: 5715s-v3.28
> bus-info: 0000:14:04.0
> $ dmesg | grep ASF
> [    6.220577] tg3 0000:14:04.0: eth2: RXcsums[1] LinkChgREG[0] MIirq[0]
> ASF[0] TSOcap[1]
> [    6.228586] tg3 0000:14:04.1: eth3: RXcsums[1] LinkChgREG[0] MIirq[0]
> ASF[0] TSOcap[1]

Thanks.  So management firmware is disabled.  This should be
straightforward case.

I'm wondering if I'm misunderstanding something though.  You said earlier
that VLAN tagging doesn't work unless you applied my patch.  Is this no
longer true?

^ permalink raw reply

* Re: [Bugme-new] [Bug 25062] New: Bonding packet deduplication doesn't work properly anymore
From: Jay Vosburgh @ 2011-01-07  2:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: netdev, bugzilla-daemon, bugme-new, bugme-daemon, kevin.lapagna
In-Reply-To: <20110104133936.60d389e2.akpm@linux-foundation.org>

Andrew Morton <akpm@linux-foundation.org> wrote:

>On Fri, 17 Dec 2010 11:45:18 GMT
>bugzilla-daemon@bugzilla.kernel.org wrote:
>
>> https://bugzilla.kernel.org/show_bug.cgi?id=25062
>> 
>>            Summary: Bonding packet deduplication doesn't work properly
>>                     anymore
>>            Product: Networking
>>            Version: 2.5
>>     Kernel Version: > 2.6.33
>>           Platform: All
>>         OS/Version: Linux
>>               Tree: Mainline
>>             Status: NEW
>>           Severity: high
>>           Priority: P1
>>          Component: Other
>>         AssignedTo: acme@ghostprotocols.net
>>         ReportedBy: kevin.lapagna@bigtag.ch
>>         Regression: No
>> 
>> 
>> Here's the setup:
>> 
>> switch: ordinary cisco switch
>> eth0: NIC with kernel module tg3
>> eth1: NIC with kernel module e1000e
>> bond0: bond with slaves eth0,eth1 in mode 1 (or 5)
>> bond0.100: vlan device created with vconfig
>> bridge100: bridge created with brctl
>> tap1: tap device created with tunctl
>> vguest: qemu-kvm vguest whit emulated e1000 NIC
>> 
>> 
>> |________________|-- eth0 \                                               |________________|
>> | switch |          -- bond0 -- bond0.100 -- bridge100 -- tap1 -- | vguest |
>> |________|-- eth1 /                                               |________|
>> 
>> When the vguest emits an ethernet broadcast (DHCP-request), it's forwarded all
>> the way up to the switch, through eth0. The switch forwards the broadcast -
>> also to eth1. The packet travels then all the way back to bridge100. So the
>> last status known for bridge100, regarding the mac address of the vgeust is,
>> that it is behind bond0.110 (instead of tap1). If a DHCP-server responds to the
>> request, the packet travels to bridge100, which has now a faulty
>> MAC-address-table and the packet will be rejected and never reaches tap1 and
>> therefor not the vguest.
>> 
>> I witnessed this wrong behavior in kernel 2.6.37-rc5 (debian package), 2.6.36.2
>> and 2.6.35.9 (self compiled -  vanilla). The setup has worked with kernels <=
>> 2.6.33.7. I've never tried 2.6.34.
>> 
>> I assume the setup above is a common way for the separation of virtual guests
>> on a network level. So this could become a major issue for a lot of people when
>> upgrading their kernels.

	Just a note that I have reproduced what I believe is the same
problem (I didn't use tap, and assigned an IP to the bridge).  I used
arping to generate ethernet broadcasts.  I see the problem on 2.6.36.2,
but not on today's net-next-2.6.

	I'll see if I can dig up the root cause tomorrow.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Eric Dumazet @ 2011-01-07  2:43 UTC (permalink / raw)
  To: Matt Carlson
  Cc: Jesse Gross, Michael Leun, Michael Chan, David Miller, Ben Greear,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <1294368071.2704.49.camel@edumazet-laptop>

Le vendredi 07 janvier 2011 à 03:41 +0100, Eric Dumazet a écrit :
> Le jeudi 06 janvier 2011 à 18:29 -0800, Matt Carlson a écrit :
> 
> > Hi Eric.  Sorry for the delay.  I was under the impression that your
> > problems were software related and that you just needed a revised
> > version of these VLAN patches I was sending to Michael.  Is this not
> > true?
> > 
> > Having a hardware stat increment suggests this is a new problem.
> > Maybe I missed it, but I didn't see what hardware you are working
> > with and whether or not management firmware was enabled.  Could you tell
> > me that info?
> > 
> 
> Hi Matt
> 
> I started a bisection, because I couldnt sleep tonight anyway :(
> 
> 14:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S
> Gigabit Ethernet (rev a3)
> 	Subsystem: Hewlett-Packard Company NC326m PCIe Dual Port Adapter
> 	Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 43
> 	Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
> 	Memory at fdfe0000 (64-bit, non-prefetchable) [size=64K]
> 	[virtual] Expansion ROM at fdbe0000 [disabled] [size=128K]
> 	Capabilities: [40] PCI-X non-bridge device
> 	Capabilities: [48] Power Management version 2
> 	Capabilities: [50] Vital Product Data
> 	Capabilities: [58] MSI: Enable+ Count=1/8 Maskable- 64bit+
> 	Kernel driver in use: tg3
> 	Kernel modules: tg3
> 
> 

$ ethtool -i eth2
driver: tg3
version: 3.115
firmware-version: 5715s-v3.28
bus-info: 0000:14:04.0
$ dmesg | grep ASF
[    6.220577] tg3 0000:14:04.0: eth2: RXcsums[1] LinkChgREG[0] MIirq[0]
ASF[0] TSOcap[1]
[    6.228586] tg3 0000:14:04.1: eth3: RXcsums[1] LinkChgREG[0] MIirq[0]
ASF[0] TSOcap[1]

^ permalink raw reply

* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Eric Dumazet @ 2011-01-07  2:41 UTC (permalink / raw)
  To: Matt Carlson
  Cc: Jesse Gross, Michael Leun, Michael Chan, David Miller, Ben Greear,
	linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <20110107022912.GA17757@mcarlson.broadcom.com>

Le jeudi 06 janvier 2011 à 18:29 -0800, Matt Carlson a écrit :

> Hi Eric.  Sorry for the delay.  I was under the impression that your
> problems were software related and that you just needed a revised
> version of these VLAN patches I was sending to Michael.  Is this not
> true?
> 
> Having a hardware stat increment suggests this is a new problem.
> Maybe I missed it, but I didn't see what hardware you are working
> with and whether or not management firmware was enabled.  Could you tell
> me that info?
> 

Hi Matt

I started a bisection, because I couldnt sleep tonight anyway :(

14:04.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5715S
Gigabit Ethernet (rev a3)
	Subsystem: Hewlett-Packard Company NC326m PCIe Dual Port Adapter
	Flags: bus master, 66MHz, medium devsel, latency 64, IRQ 43
	Memory at fdff0000 (64-bit, non-prefetchable) [size=64K]
	Memory at fdfe0000 (64-bit, non-prefetchable) [size=64K]
	[virtual] Expansion ROM at fdbe0000 [disabled] [size=128K]
	Capabilities: [40] PCI-X non-bridge device
	Capabilities: [48] Power Management version 2
	Capabilities: [50] Vital Product Data
	Capabilities: [58] MSI: Enable+ Count=1/8 Maskable- 64bit+
	Kernel driver in use: tg3
	Kernel modules: tg3

^ permalink raw reply

* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Matt Carlson @ 2011-01-07  2:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jesse Gross, Matthew Carlson, Michael Leun, Michael Chan,
	David Miller, Ben Greear, linux-kernel@vger.kernel.org,
	netdev@vger.kernel.org
In-Reply-To: <1294363201.2704.19.camel@edumazet-laptop>

On Thu, Jan 06, 2011 at 05:20:01PM -0800, Eric Dumazet wrote:
> Le vendredi 07 janvier 2011 ?? 00:34 +0100, Eric Dumazet a ??crit :
> > Le jeudi 06 janvier 2011 ?? 16:01 -0500, Jesse Gross a ??crit :
> > 
> > > Hmm, I thought that it might be some interaction with a corner case in
> > > the networking core but now it seems less likely.  There weren't too
> > > many vlan changes between the working and non-working states.  Plus,
> > > since the rx counter isn't increasing, the packets probably aren't
> > > making it anywhere.
> > > 
> > > I see that tg3 increases the drop counter in one place, which also
> > > happens to be checking for vlan errors (at tg3.c:4753).  That seems
> > > suspicious - maybe the NIC is only partially configured for vlan
> > > offloading.  If we can confirm that is where the drop counter is being
> > > incremented and what the error code is maybe it would shed some light.
> > > 
> > 
> > Hmm... I am pretty sure the drop counter is the dev rx_dropped (core
> > network handled, not tg3 one) incremented at the end of
> > __netif_receive_skb() : We found no suitable handler for packets.
> > 
> > atomic_long_inc(&skb->dev->rx_dropped);
> > 
> > But thats a guess, I'll have to check
> > 
> 
> wrong guess. Its really the tg3 which drops frames
> 
> increasing rx_missed_errors  (get_stat64(&hw_stats->rx_discards)
> 
> ip -s -s link show dev eth2
> 5: eth2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq
> master bond0 state UP qlen 1000
>     link/ether 00:1e:0b:92:78:50 brd ff:ff:ff:ff:ff:ff
>     RX: bytes  packets  errors  dropped overrun mcast   
>     11627      167      0       0       0       2      
>     RX errors: length  crc     frame   fifo    missed
>                0        0       0       0       2713   
>     TX: bytes  packets  errors  dropped carrier collsns 
>     2274       31       0       0       0       0      
>     TX errors: aborted fifo    window  heartbeat
>                0        0       0       0      
> 
> 
> 
> It would be nice Broadcom guys could help a bit ?

Hi Eric.  Sorry for the delay.  I was under the impression that your
problems were software related and that you just needed a revised
version of these VLAN patches I was sending to Michael.  Is this not
true?

Having a hardware stat increment suggests this is a new problem.
Maybe I missed it, but I didn't see what hardware you are working
with and whether or not management firmware was enabled.  Could you tell
me that info?

^ permalink raw reply

* Re: Bad TCP timestamps on non-PC platforms
From: Eric Dumazet @ 2011-01-07  2:11 UTC (permalink / raw)
  To: Alex Dubov; +Cc: netdev, David Miller
In-Reply-To: <117536.54377.qm@web37607.mail.mud.yahoo.com>

Le jeudi 06 janvier 2011 à 17:55 -0800, Alex Dubov a écrit :
> Sorry for the awful synopsis of my problem. I never cease to amaze myself
> at how bad those usually turn up. :-)
> 
> What I really meant to write is:
> 
> I have a dev board running 2.6.37-rc7. Normal kernel config, nothing fancy.
> Remote machines are just usual linux boxes in constant operation (I tried
> several of those).
> 
> UDP/DHCP works correctly all the time, so ethernet side is probably ok.
> 
> When tcp_timestamps are enabled, SYN packets from dev board just get
> ignored by the remote side. I see them arrive in wireshark, but nothing
> else happens.
> 
> When I disable tcp_timestamps on the dev board everything works.
> 
> The problem is reproducible every single time.
> 
> The only difference is the "Options" block of the SYN packets.
> If timestamps are not really to blame, then it probably window scale
> parameters. That's what I see on a usual dropped packet:
> 
> Options: (20 bytes)
>         Maximum segment size: 1460 bytes
>         SACK permitted
>         Timestamps: TSval 4294893842, TSecr 0
>         NOP
>         Window scale: 5 (multiply by 32)
> 
> 
> 
>       


You dont give new informations ;)

I asked if you could give information on the other side : The bug is to
drop this legal packet.

uname -a
sysctl -a | grep tcp



^ permalink raw reply

* Re: Bad TCP timestamps on non-PC platforms
From: Alex Dubov @ 2011-01-07  1:55 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, David Miller
In-Reply-To: <1294301356.2723.73.camel@edumazet-laptop>

Sorry for the awful synopsis of my problem. I never cease to amaze myself
at how bad those usually turn up. :-)

What I really meant to write is:

I have a dev board running 2.6.37-rc7. Normal kernel config, nothing fancy.
Remote machines are just usual linux boxes in constant operation (I tried
several of those).

UDP/DHCP works correctly all the time, so ethernet side is probably ok.

When tcp_timestamps are enabled, SYN packets from dev board just get
ignored by the remote side. I see them arrive in wireshark, but nothing
else happens.

When I disable tcp_timestamps on the dev board everything works.

The problem is reproducible every single time.

The only difference is the "Options" block of the SYN packets.
If timestamps are not really to blame, then it probably window scale
parameters. That's what I see on a usual dropped packet:

Options: (20 bytes)
        Maximum segment size: 1460 bytes
        SACK permitted
        Timestamps: TSval 4294893842, TSecr 0
        NOP
        Window scale: 5 (multiply by 32)



      

^ permalink raw reply

* Re: About disabling congestion control
From: Stephen Hemminger @ 2011-01-07  1:37 UTC (permalink / raw)
  To: Syed Obaid Amin; +Cc: netdev
In-Reply-To: <AANLkTin53drsv18TZ0CPZM7==6G=rK6vScUb=SCQ4-xr@mail.gmail.com>

On Thu, 6 Jan 2011 20:25:18 -0500
Syed Obaid Amin <obaidasyed@gmail.com> wrote:

> Hey all,
> 
> I am currently working on a socket option to disable the tcp
> congestion control. I think the simplest approach to do this is to
> ignore cwnd before sending out a packet.
> 
> After going through tcp output engine it seems that tcp_cwnd_test is
> the method that decides that how many segments can be sent out on a
> wire. For testing it out, I changed this method so that if no-cc
> option is ON, just return a big constant value. But, it didn't work
> and I am unable to see a burst of pkts. It looks like that I am
> missing something here.
> 
> Any suggestions that what is the right place to look for disabling the
> congestion control ?
> 
> Thanks much!
> 
> Obaid

I assume this is just a local hack experiment; not something you
want to actually submit for other users to use...

The easiest/safest way to do this would be to build/define a new TCP congestion
control type that does nothing.


^ permalink raw reply

* Re: genetlink misinterprets NEW as GET
From: Pablo Neira Ayuso @ 2011-01-07  1:31 UTC (permalink / raw)
  To: Ben Pfaff
  Cc: Jan Engelhardt, Netfilter Developer Mailing List,
	Linux Networking Developer Mailing List
In-Reply-To: <878vyyvtci.fsf@benpfaff.org>

On 06/01/11 18:23, Ben Pfaff wrote:
> Pablo Neira Ayuso <pablo@netfilter.org> writes:
> 
>> On 04/01/11 03:14, Jan Engelhardt wrote:
>>> 	/* Modifiers to GET request */
>>> 	#define NLM_F_ROOT      0x100
>>> 	#define NLM_F_MATCH     0x200
>>> 	#define NLM_F_ATOMIC    0x400
>>> 	#define NLM_F_DUMP      (NLM_F_ROOT|NLM_F_MATCH)
> [...]
>>> [N.B.: I am also wondering whether
>>> 	(nlh->nlmsg_flags & NLM_F_DUMP) == NLM_F_DUMP
>>> may have been desired, because NLM_F_DUMP is composed of two bits.]
>>
>> Someone may include NLM_F_ATOMIC to a dump operation, in that case the
>> checking that you propose is not valid.
> 
> Are you saying that NLM_F_MATCH and NLM_F_ATOMIC are mutually
> exclusive, and that NLM_F_ROOT|NLM_F_ATOMIC would also signal a
> dump operation?  Otherwise the test that Jan proposes looks valid
> to me.

Indeed, Jan's test is fine to fix this. Please, send a patch to Davem asap.

^ permalink raw reply

* Re: [PATCH v2] net: Allow ethtool to set interface in loopback mode.
From: Ben Hutchings @ 2011-01-07  1:30 UTC (permalink / raw)
  To: Mahesh Bandewar
  Cc: Jeff Garzik, Stephen Hemminger, David Miller, Laurent Chavey,
	Tom Herbert, netdev
In-Reply-To: <AANLkTi=3ewzgz=z-WmqT=vBvci3-H6HE3CCVk4ZGuFED@mail.gmail.com>

On Thu, 2011-01-06 at 16:47 -0800, Mahesh Bandewar wrote:
> On Thu, Jan 6, 2011 at 2:13 PM, Ben Hutchings <bhutchings@solarflare.com> wrote:
> > On Wed, 2011-01-05 at 11:22 -0500, Jeff Garzik wrote:
> >> On 01/04/2011 08:21 PM, Ben Hutchings wrote:
> >> > On Tue, 2011-01-04 at 16:36 -0800, Stephen Hemminger wrote:
> >> >> On Tue,  4 Jan 2011 16:30:01 -0800
> >> >> Mahesh Bandewar<maheshb@google.com>  wrote:
> >> >>
> >> >>> This patch enables ethtool to set the loopback mode on a given interface.
> >> >>> By configuring the interface in loopback mode in conjunction with a policy
> >> >>> route / rule, a userland application can stress the egress / ingress path
> >> >>> exposing the flows of the change in progress and potentially help developer(s)
> >> >>> understand the impact of those changes without even sending a packet out
> >> >>> on the network.
> >> >>>
> >> >>> Following set of commands illustrates one such example -
> >> >>>   a) ip -4 addr add 192.168.1.1/24 dev eth1
> >> >>>   b) ip -4 rule add from all iif eth1 lookup 250
> >> >>>   c) ip -4 route add local 0/0 dev lo proto kernel scope host table 250
> >> >>>   d) arp -Ds 192.168.1.100 eth1
> >> >>>   e) arp -Ds 192.168.1.200 eth1
> >> >>>   f) sysctl -w net.ipv4.ip_nonlocal_bind=1
> >> >>>   g) sysctl -w net.ipv4.conf.all.accept_local=1
> >> >>>   # Assuming that the machine has 8 cores
> >> >>>   h) taskset 000f netserver -L 192.168.1.200
> >> >>>   i) taskset 00f0 netperf -t TCP_CRR -L 192.168.1.100 -H 192.168.1.200 -l 30
> >> >>>
> >> >>> Signed-off-by: Mahesh Bandewar<maheshb@google.com>
> >> >>> Reviewed-by: Ben Hutchings<bhutchings@solarflare.com>
> >> >>
> >> >> Since this is a boolean it SHOULD go into ethtool_flags rather than
> >> >> being a high level operation.
> >> >
> >> > It could do, but I though ETHTOOL_{G,S}FLAGS were intended for
> >> > controlling offload features.
> >>
> >> It doesn't have to be.  As Stephen guessed, [GS]FLAGS are basically
> >> common flags -- as differentiated from private,
> >> driver-specific/hardware-specific flags.
> >
> > Well, that would allow the patch to be simplified quite a bit. :-)
> 
> Ben, Are you suggesting to use ETH_FLAG_LOOPBACK instead of
> ETHTOOL_{G|S}LOOPBACK flags?
[...]

Exactly.

An example implementation (untested):

--- a/drivers/net/sfc/ethtool.c
+++ b/drivers/net/sfc/ethtool.c
@@ -548,11 +548,24 @@ static u32 efx_ethtool_get_rx_csum(struct net_device *net_dev)
 	return efx->rx_checksum_enabled;
 }
 
+static u32 efx_ethtool_get_flags(struct net_device *net_dev)
+{
+	struct efx_nic *efx = netdev_priv(net_dev);
+	u32 flags;
+
+	flags = ethtool_op_get_flags(net_dev);
+	if (efx->loopback_mode != LOOPBACK_NONE)
+		flags |= ETH_FLAG_LOOPBACK;
+	return flags;
+}
+
 static int efx_ethtool_set_flags(struct net_device *net_dev, u32 data)
 {
 	struct efx_nic *efx = netdev_priv(net_dev);
-	u32 supported = (efx->type->offload_features &
-			 (ETH_FLAG_RXHASH | ETH_FLAG_NTUPLE));
+	u32 supported = (ETH_FLAG_LOOPBACK |
+			 (efx->type->offload_features &
+			  (ETH_FLAG_RXHASH | ETH_FLAG_NTUPLE)));
+	enum efx_loopback_mode loopback;
 	int rc;
 
 	rc = ethtool_op_set_flags(net_dev, data, supported);
@@ -562,7 +575,15 @@ static int efx_ethtool_set_flags(struct net_device *net_dev, u32 data)
 	if (!(data & ETH_FLAG_NTUPLE))
 		efx_filter_clear_rx(efx, EFX_FILTER_PRI_MANUAL);
 
-	return 0;
+	loopback = (data & ETH_FLAG_LOOPBACK) ? LOOPBACK_DATA : LOOPBACK_NONE;
+	mutex_lock(&efx->mac_lock);
+	if (efx->loopback_mode != loopback) {
+		efx->loopback_mode = loopback;
+		rc = __efx_reconfigure_port(efx);
+	}
+	mutex_unlock(&efx->mac_lock);
+
+	return rc;
 }
 
 static void efx_ethtool_self_test(struct net_device *net_dev,
@@ -1057,7 +1078,7 @@ const struct ethtool_ops efx_ethtool_ops = {
 	.get_tso		= ethtool_op_get_tso,
 	/* Need to enable/disable TSO-IPv6 too */
 	.set_tso		= efx_ethtool_set_tso,
-	.get_flags		= ethtool_op_get_flags,
+	.get_flags		= efx_ethtool_get_flags,
 	.set_flags		= efx_ethtool_set_flags,
 	.get_sset_count		= efx_ethtool_get_sset_count,
 	.self_test		= efx_ethtool_self_test,
---

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* About disabling congestion control
From: Syed Obaid Amin @ 2011-01-07  1:25 UTC (permalink / raw)
  To: netdev

Hey all,

I am currently working on a socket option to disable the tcp
congestion control. I think the simplest approach to do this is to
ignore cwnd before sending out a packet.

After going through tcp output engine it seems that tcp_cwnd_test is
the method that decides that how many segments can be sent out on a
wire. For testing it out, I changed this method so that if no-cc
option is ON, just return a big constant value. But, it didn't work
and I am unable to see a burst of pkts. It looks like that I am
missing something here.

Any suggestions that what is the right place to look for disabling the
congestion control ?

Thanks much!

Obaid

^ permalink raw reply

* Re: Flow Control and Port Mirroring Revisited
From: Simon Horman @ 2011-01-07  1:23 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Eric Dumazet, Rusty Russell, virtualization, dev, virtualization,
	netdev, kvm, Michael S. Tsirkin
In-Reply-To: <AANLkTinJK-nbkP5_ee2cuS8RA7jTB4-bcWmAf4bjSouP@mail.gmail.com>

On Thu, Jan 06, 2011 at 05:38:01PM -0500, Jesse Gross wrote:

[ snip ]
> 
> I know that everyone likes a nice netperf result but I agree with
> Michael that this probably isn't the right question to be asking.  I
> don't think that socket buffers are a real solution to the flow
> control problem: they happen to provide that functionality but it's
> more of a side effect than anything.  It's just that the amount of
> memory consumed by packets in the queue(s) doesn't really have any
> implicit meaning for flow control (think multiple physical adapters,
> all with the same speed instead of a virtual device and a physical
> device with wildly different speeds).  The analog in the physical
> world that you're looking for would be Ethernet flow control.
> Obviously, if the question is limiting CPU or memory consumption then
> that's a different story.

Point taken. I will see if I can control CPU (and thus memory) consumption
using cgroups and/or tc.

> This patch also double counts memory, since the full size of the
> packet will be accounted for by each clone, even though they share the
> actual packet data.  Probably not too significant here but it might be
> when flooding/mirroring to many interfaces.  This is at least fixable
> (the Xen-style accounting through page tracking deals with it, though
> it has its own problems).

Agreed on all counts.



^ permalink raw reply

* Re: [PATCH 2.6.36] vlan: Avoid hwaccel vlan packets when vid not used
From: Eric Dumazet @ 2011-01-07  1:20 UTC (permalink / raw)
  To: Jesse Gross
  Cc: Matt Carlson, Michael Leun, Michael Chan, David Miller,
	Ben Greear, linux-kernel@vger.kernel.org, netdev@vger.kernel.org
In-Reply-To: <1294356887.2704.13.camel@edumazet-laptop>

Le vendredi 07 janvier 2011 à 00:34 +0100, Eric Dumazet a écrit :
> Le jeudi 06 janvier 2011 à 16:01 -0500, Jesse Gross a écrit :
> 
> > Hmm, I thought that it might be some interaction with a corner case in
> > the networking core but now it seems less likely.  There weren't too
> > many vlan changes between the working and non-working states.  Plus,
> > since the rx counter isn't increasing, the packets probably aren't
> > making it anywhere.
> > 
> > I see that tg3 increases the drop counter in one place, which also
> > happens to be checking for vlan errors (at tg3.c:4753).  That seems
> > suspicious - maybe the NIC is only partially configured for vlan
> > offloading.  If we can confirm that is where the drop counter is being
> > incremented and what the error code is maybe it would shed some light.
> > 
> 
> Hmm... I am pretty sure the drop counter is the dev rx_dropped (core
> network handled, not tg3 one) incremented at the end of
> __netif_receive_skb() : We found no suitable handler for packets.
> 
> atomic_long_inc(&skb->dev->rx_dropped);
> 
> But thats a guess, I'll have to check
> 

wrong guess. Its really the tg3 which drops frames

increasing rx_missed_errors  (get_stat64(&hw_stats->rx_discards)

ip -s -s link show dev eth2
5: eth2: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc mq
master bond0 state UP qlen 1000
    link/ether 00:1e:0b:92:78:50 brd ff:ff:ff:ff:ff:ff
    RX: bytes  packets  errors  dropped overrun mcast   
    11627      167      0       0       0       2      
    RX errors: length  crc     frame   fifo    missed
               0        0       0       0       2713   
    TX: bytes  packets  errors  dropped carrier collsns 
    2274       31       0       0       0       0      
    TX errors: aborted fifo    window  heartbeat
               0        0       0       0      



It would be nice Broadcom guys could help a bit ?

^ permalink raw reply

* Re: linux-next: manual merge of the security-testing tree with the net tree
From: Casey Schaufler @ 2011-01-07  1:05 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: James Morris, linux-next, linux-kernel, David Miller, netdev,
	Casey Schaufler
In-Reply-To: <20110107114400.e4d1d33b.sfr@canb.auug.org.au>

On 1/6/2011 4:44 PM, Stephen Rothwell wrote:
> Hi James,
>
> Today's linux-next merge of the security-testing tree got a conflict in
> security/smack/smack_lsm.c between commit
> 3610cda53f247e176bcbb7a7cca64bc53b12acdb ("af_unix: Avoid socket->sk NULL
> OOPS in stream connect security hooks") from the net tree and commit
> b4e0d5f0791bd6dd12a1c1edea0340969c7c1f90 ("Smack: UDS revision") from the
> security-testing tree.
>
> I fixed it up (I think - see below) and can carry the fix as necessary.

The change looks like it addresses the change in interface. Thank you.

^ permalink raw reply

* Re: [net-next 12/12] ixgbe: update ntuple filter configuration
From: Ben Hutchings @ 2011-01-07  1:02 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: davem, Alexander Duyck, netdev, gosp, bphilips
In-Reply-To: <1294360199-9860-13-git-send-email-jeffrey.t.kirsher@intel.com>

On Thu, 2011-01-06 at 16:29 -0800, jeffrey.t.kirsher@intel.com wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> 
> This change fixes several issues found in ntuple filtering while I was
> doing the ATR refactor.
> 
> Specifically I updated the masks to work correctly with the latest version
> of ethtool,
[...]

Did the previous code not correctly handle a zero value with a non-zero
mask for some fields?  If so, I can revert that change to ethtool.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: 2.6.37 vlans on bnx2 not functional, panic with tcpdump
From: Michael Chan @ 2011-01-07  0:46 UTC (permalink / raw)
  To: Iain Paton; +Cc: netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <1294357941.21580.2.camel@HP1>


On Thu, 2011-01-06 at 15:52 -0800, Michael Chan wrote:
> On Thu, 2011-01-06 at 13:32 -0800, Iain Paton wrote:
> > Hi,
> > 
> > vlans don't appear to be functional on my HP DL380G6 with onboard bnx2
> > adapter using vanilla 2.6.37 kernel. No tagged vlan traffic 
> > is arriving at the vlan interface.
> 
> VLANs on net-next-2.6 kernel works for me on bnx2 devices.  I'll try
> 2.6.37 next.

May be you have management firmware running on your devices that can
change the behavior.  Can you provide ethtool -i eth0 on both bnx2
devices on your system?



^ permalink raw reply

* Re: [net-next 03/12] e1000e: properly bounds-check string functions
From: Ben Hutchings @ 2011-01-07  0:48 UTC (permalink / raw)
  To: jeffrey.t.kirsher; +Cc: davem, Bruce Allan, netdev, gosp, bphilips
In-Reply-To: <1294360199-9860-4-git-send-email-jeffrey.t.kirsher@intel.com>

On Thu, 2011-01-06 at 16:29 -0800, jeffrey.t.kirsher@intel.com wrote:
> From: Bruce Allan <bruce.w.allan@intel.com>
> 
> Use string functions with bounds checking rather than their non-bounds
> checking counterparts, and do not hard code these boundaries.
[...]
> --- a/drivers/net/e1000e/netdev.c
> +++ b/drivers/net/e1000e/netdev.c
[...]
> @@ -5968,7 +5968,7 @@ static int __devinit e1000_probe(struct pci_dev *pdev,
>  	if (!(adapter->flags & FLAG_HAS_AMT))
>  		e1000_get_hw_control(adapter);
>  
> -	strcpy(netdev->name, "eth%d");
> +	strncpy(netdev->name, "eth%d", sizeof(netdev->name) - 1);
>  	err = register_netdev(netdev);
>  	if (err)
>  		goto err_register;
[...]

This statement is actually redundant - alloc_etherdev() sets the name
for you.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox