Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [BUG] 2.6.37-rc5 Memory leak in net/ipv4/udp.c
From: Lothar Waßmann @ 2010-12-17 11:47 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1292585534.2906.12.camel@edumazet-laptop>

Eric Dumazet writes:
> Le vendredi 17 décembre 2010 à 12:11 +0100, Lothar Waßmann a écrit :
> > Hi,
> > 
> > Eric Dumazet writes:
> > > Le vendredi 17 décembre 2010 à 11:18 +0100, Lothar Waßmann a écrit :
> > > > The offending code in net/ipv4/udp.c is:
> > > > |void __init udp_table_init(struct udp_table *table, const char *name)
> > > > |{
> > > > |	unsigned int i;
> > > > |
> > > > |	if (!CONFIG_BASE_SMALL)
> > > > |		table->hash = alloc_large_system_hash(name,
> > > > |			2 * sizeof(struct udp_hslot),
> > > > |			uhash_entries,
> > > > |			21, /* one slot per 2 MB */
> > > > |			0,
> > > > |			&table->log,
> > > > |			&table->mask,
> > > > |			64 * 1024);
> > > > |	/*
> > > > |	 * Make sure hash table has the minimum size
> > > > |	 */
> > > > |	if (CONFIG_BASE_SMALL || table->mask < UDP_HTABLE_SIZE_MIN - 1) {
> > > > |		table->hash = kmalloc(UDP_HTABLE_SIZE_MIN *
> > > > |				      2 * sizeof(struct udp_hslot), GFP_KERNEL);
> > > > In case of !CONFIG_BASE_SMALL and 'table->mask < UDP_HTABLE_SIZE_MIN - 1)'
> > > > the memory allocated in the previous if clause becomes inacessible!
> > > > 
> > > > Shouldn't this be:
> > > > |	if (!CONFIG_BASE_SMALL && table->mask >= UDP_HTABLE_SIZE_MIN - 1) {
> > > > |		table->hash = alloc_large_system_hash(name,
> > > > |			2 * sizeof(struct udp_hslot),
> > > > |			uhash_entries,
> > > > |			21, /* one slot per 2 MB */
> > > > |			0,
> > > > |			&table->log,
> > > > |			&table->mask,
> > > > |			64 * 1024);
> > > > |	} else {
> > > > |		table->hash = kmalloc(UDP_HTABLE_SIZE_MIN *
> > > > |				      2 * sizeof(struct udp_hslot), GFP_KERNEL);
> > > > [...]
> > > > 
> > > 
> > > Nothing we can do about it, there is no API to reverse the
> > > alloc_large_system_hash() effect. We could call kmemleak api to at least
> > > avoid this false alarm.
> > > 
> > Do you have to call it at all in case of table->mask < UDP_HTABLE_SIZE_MIN - 1?
> > 
> 
> We call alloc_large_system_hash() asking it to size the table _itself_.
> We give some hints : 
> 
> - How many slots per MB of avail memory.
> - An upper limit (64*1024 slots because we only handle 65536 udp ports)
> - but not a lower limit (not available in the API)
> 
> Problem is in your case, alloc_large_system_hash() allocates a very
> small area. Then we catch the problem, seeing table->mask is too small
> for our needs. We prefer to 'lost' this too small memory than crashing
> kernel later.
> 
table->mask is not altered by alloc_large_system_hash(), so you could
detect the situation beforhand and avoid calling that function in this
case. As far as I can tell there is no need for
alloc_large_system_hash() if you later decide to use kmalloc'ed memory
instead.

The current situation is
if (!CONFIG_BASE_SMALL)
	call alloc_large_system_hash()
if (CONFIG_BASE_SMALL || table->mask < MIN)
	call kmalloc() dropping evnetually allocated memory from the
	previous if clause

My proposal was:
if (!CONFIG_BASE_SMALL && table->mask >= MIN)
	call alloc_large_system_hash()
else
	call kmalloc()

which is functionally equivalent except for the missing call to
alloc_large_system_hash() if the memory allocated by that function is
not used.

> > > We really want a minimum size for the UDP hash table, because our algos
> > > depend on this.
> > > 
> > I can't see why this could not be achieved by doing _either_
> > alloc_large_system_hash() _OR_ kmalloc() as stated above, but not
> > both.
> 
> We definitly want alloc_large_system_hash() for the general case
> (nice NUMA spread, while kmalloc() would allocate the hash table on a
> single memory node. Not so nice)
> 
That would still be the case with my proposed solution.


Lothar Waßmann
-- 
___________________________________________________________

Ka-Ro electronics GmbH | Pascalstraße 22 | D - 52076 Aachen
Phone: +49 2408 1402-0 | Fax: +49 2408 1402-10
Geschäftsführer: Matthias Kaussen
Handelsregistereintrag: Amtsgericht Aachen, HRB 4996

www.karo-electronics.de | info@karo-electronics.de
___________________________________________________________

^ permalink raw reply

* Re: [BUG] 2.6.37-rc5 Memory leak in net/ipv4/udp.c
From: Lothar Waßmann @ 2010-12-17 11:56 UTC (permalink / raw)
  To: Eric Dumazet, netdev
In-Reply-To: <19723.19914.961119.861405@ipc1.ka-ro>

Hi again,

Lothar Waßmann writes:
> Eric Dumazet writes:
> > Le vendredi 17 décembre 2010 à 12:11 +0100, Lothar Waßmann a écrit :
> > > Hi,
> > > 
> > > Eric Dumazet writes:
> > > > Le vendredi 17 décembre 2010 à 11:18 +0100, Lothar Waßmann a écrit :
> > > > > The offending code in net/ipv4/udp.c is:
> > > > > |void __init udp_table_init(struct udp_table *table, const char *name)
> > > > > |{
> > > > > |	unsigned int i;
> > > > > |
> > > > > |	if (!CONFIG_BASE_SMALL)
> > > > > |		table->hash = alloc_large_system_hash(name,
> > > > > |			2 * sizeof(struct udp_hslot),
> > > > > |			uhash_entries,
> > > > > |			21, /* one slot per 2 MB */
> > > > > |			0,
> > > > > |			&table->log,
> > > > > |			&table->mask,
> > > > > |			64 * 1024);
> > > > > |	/*
> > > > > |	 * Make sure hash table has the minimum size
> > > > > |	 */
> > > > > |	if (CONFIG_BASE_SMALL || table->mask < UDP_HTABLE_SIZE_MIN - 1) {
> > > > > |		table->hash = kmalloc(UDP_HTABLE_SIZE_MIN *
> > > > > |				      2 * sizeof(struct udp_hslot), GFP_KERNEL);
> > > > > In case of !CONFIG_BASE_SMALL and 'table->mask < UDP_HTABLE_SIZE_MIN - 1)'
> > > > > the memory allocated in the previous if clause becomes inacessible!
> > > > > 
> > > > > Shouldn't this be:
> > > > > |	if (!CONFIG_BASE_SMALL && table->mask >= UDP_HTABLE_SIZE_MIN - 1) {
> > > > > |		table->hash = alloc_large_system_hash(name,
> > > > > |			2 * sizeof(struct udp_hslot),
> > > > > |			uhash_entries,
> > > > > |			21, /* one slot per 2 MB */
> > > > > |			0,
> > > > > |			&table->log,
> > > > > |			&table->mask,
> > > > > |			64 * 1024);
> > > > > |	} else {
> > > > > |		table->hash = kmalloc(UDP_HTABLE_SIZE_MIN *
> > > > > |				      2 * sizeof(struct udp_hslot), GFP_KERNEL);
> > > > > [...]
> > > > > 
> > > > 
> > > > Nothing we can do about it, there is no API to reverse the
> > > > alloc_large_system_hash() effect. We could call kmemleak api to at least
> > > > avoid this false alarm.
> > > > 
> > > Do you have to call it at all in case of table->mask < UDP_HTABLE_SIZE_MIN - 1?
> > > 
> > 
> > We call alloc_large_system_hash() asking it to size the table _itself_.
> > We give some hints : 
> > 
> > - How many slots per MB of avail memory.
> > - An upper limit (64*1024 slots because we only handle 65536 udp ports)
> > - but not a lower limit (not available in the API)
> > 
> > Problem is in your case, alloc_large_system_hash() allocates a very
> > small area. Then we catch the problem, seeing table->mask is too small
> > for our needs. We prefer to 'lost' this too small memory than crashing
> > kernel later.
> > 
> table->mask is not altered by alloc_large_system_hash(), so you could
> detect the situation beforhand and avoid calling that function in this
> case. As far as I can tell there is no need for
> alloc_large_system_hash() if you later decide to use kmalloc'ed memory
> instead.
> 
Forget about this. I was a little confused when reading the
alloc_large_system_hash() function. No I understand.

Sorry for the noise.


Lothar Waßmann
-- 
___________________________________________________________

Ka-Ro electronics GmbH | Pascalstraße 22 | D - 52076 Aachen
Phone: +49 2408 1402-0 | Fax: +49 2408 1402-10
Geschäftsführer: Matthias Kaussen
Handelsregistereintrag: Amtsgericht Aachen, HRB 4996

www.karo-electronics.de | info@karo-electronics.de
___________________________________________________________

^ permalink raw reply

* Re: |PATCH net-next-2.6] ifb: use netif_receive_skb() instead of netif_rx()
From: jamal @ 2010-12-17 12:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Changli Gao, David S. Miller, Stephen Hemminger, Tom Herbert,
	Jiri Pirko, netdev, netem
In-Reply-To: <1292422896.3427.251.camel@edumazet-laptop>

On Wed, 2010-12-15 at 15:21 +0100, Eric Dumazet wrote:
> Le mercredi 15 décembre 2010 à 07:49 -0500, jamal a écrit :

> > Eric, did you do at least a simple test on this one? 
> > It used to be problematic (I cant remember why or
> > what use case was problematic).
> 
> Yes, I run SFQ / IFB right now on my dev machine, and found SFQ bugs by
> the way ;)

Ok, thanks;->

cheers,
jamal



^ permalink raw reply

* Re: [PATCH] iproute2: initialize the ll_map only once
From: Denys Fedoryshchenko @ 2010-12-17 13:06 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Octavian Purdila, netdev, Lucian Adrian Grijincu, Vlad Dogaru
In-Reply-To: <20101210113809.56ea259e@nehalam>

On Friday 10 December 2010 21:38:09 Stephen Hemminger wrote:
> On Fri, 10 Dec 2010 16:59:50 +0200
> 
> Octavian Purdila <opurdila@ixiacom.com> wrote:
> > Avoid initializing the LL map (which involves a costly RTNL dump)
> > multiple times. This can happen when running in batch mode.
> > 
> > Signed-off-by: Octavian Purdila <opurdila@ixiacom.com>
> 
> applied
There is some longstanding bug related to current hashing system.
To "workaround" it i did my own "flush" command, to flush hashes, but with 
this patch it becomes more difficult to handle this situation.
Here is how to reproduce it:

ip -force -batch -
link add link eth0 name new0 type macvlan
link show dev new0
link delete dev new0 type macvlan
link add link eth0 name new0 type macvlan
link show dev new0

Last command will not show link, because index of old one is stored in hash.

I guess it is more bugreport for old problem, than problem with current patch.
Sure it is possible to flush hash on del/add operations, but additionally 
during batch run it is possible that interfaces can appear/disappear (NAS with 
thousands of ppp interfaces). Maybe still as an idea i can do patch with flag 
to dump rtnl before each command and additional "flush hash" command?

^ permalink raw reply

* Re: [PATCH 5/5 v4] net: add old_queue_mapping into skb->cb
From: jamal @ 2010-12-17 13:09 UTC (permalink / raw)
  To: Changli Gao
  Cc: David S. Miller, Stephen Hemminger, Eric Dumazet, Tom Herbert,
	Jiri Pirko, netdev, netem
In-Reply-To: <1292475410-24665-1-git-send-email-xiaosuo@gmail.com>

On Thu, 2010-12-16 at 12:56 +0800, Changli Gao wrote:
> For the skbs returned from ifb, we should use the queue_mapping
> saved before ifb.
> 
> We save old queue_mapping in old_queue_mapping just before calling 
> dev_queue_xmit, and restore the old_queue_mapping to queue_mapping
> just before reinjecting the ingress packets.
> 
> A new struct dev_skb_cb is added, and valid in qdisc and gso layer.
> The original qdisc_skb_cb and DEV_GSO_CB use dev_skb_cb as the first
> member.
> 
> netem_skb_cb is changed to contain qdisc_skb_cb.

I am sorry Changli - I think we are talking past each other. I
a conflicted on the whole point of saving and restoring these
devqueue mappings. I understand that for ifb, saving and restoring the
original devs is fundamental for its operation- but i am not sure i see
it for the queues. As an example:

---
# For all packets arriving on ifb0, change mapping to 3 and
# redirect to to ifb1

tc filter add dev ifb0 parent 1:0 protocol ip prio 10 u32 \
match u32 0 0 flowid 1:2 \
action skbedit queue_mapping 4 \
action mirred egress redirect dev ifb1
#
# redirect all packets arriving in eth0 to ifb0
$TC filter add eth0 parent 1:0 protocol ip prio 10 u32 \
match u32 0 0 flowid 1:2 action mirred egress redirect dev ifb0
----

what is the expected behavior?

cheers,
jamal


^ permalink raw reply

* Re: [PATCH net-next-2.6] ifb: fix a lockdep splat
From: jamal @ 2010-12-17 13:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, changli Gao, netdev
In-Reply-To: <1292493175.2883.56.camel@edumazet-laptop>

On Thu, 2010-12-16 at 10:52 +0100, Eric Dumazet wrote:
> After recent ifb changes, we must use lockless __skb_dequeue() since
> lock is not anymore initialized.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Jamal Hadi Salim <hadi@cyberus.ca>
> Cc: Changli Gao <xiaosuo@gmail.com>

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

cheers,
jamal


^ permalink raw reply

* Re: [PATCH net-next-2.6] ifb: fix a lockdep splat
From: Eric Dumazet @ 2010-12-17 13:15 UTC (permalink / raw)
  To: hadi; +Cc: David Miller, changli Gao, netdev
In-Reply-To: <1292591398.2668.20.camel@mojatatu>

Le vendredi 17 décembre 2010 à 08:09 -0500, jamal a écrit :
> On Thu, 2010-12-16 at 10:52 +0100, Eric Dumazet wrote:
> > After recent ifb changes, we must use lockless __skb_dequeue() since
> > lock is not anymore initialized.
> > 
> > Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> > Cc: Jamal Hadi Salim <hadi@cyberus.ca>
> > Cc: Changli Gao <xiaosuo@gmail.com>
> 
> Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

Thanks for reviewing Jamal !



^ permalink raw reply

* Re: [PATCH net-next] bnx2x: Add Nic partitioning mode (57712 devices)
From: Ben Hutchings @ 2010-12-17 13:22 UTC (permalink / raw)
  To: Matt Domsch
  Cc: Eilon Greenstein, Dimitris Michailidis, Dmitry Kravkov,
	davem@davemloft.net, netdev@vger.kernel.org, narendra_k@dell.com,
	jordan_hargrave@dell.com
In-Reply-To: <20101217024509.GA5854@auslistsprd01.us.dell.com>

On Thu, 2010-12-16 at 20:45 -0600, Matt Domsch wrote:
> On Thu, Dec 09, 2010 at 04:49:25PM +0200, Eilon Greenstein wrote:
> > On Mon, 2010-12-06 at 10:21 -0800, Dimitris Michailidis wrote:
> > > Matt Domsch wrote:
> > ...
> > > /sys/class/net/<ifname>/dev_id indicates the physical port <ifname> is 
> > > associated with.  At least a few drivers set up dev_id this way.
> > > 
> > > 
> > 
> > So we are on agreement? This can satisf all needs? If so, we will add
> > this scheme to the bnx2x as well.
> 
> I don't think that's enough.  Necessary, but not sufficient.
> 
> If dev_id is a field that starts over with each PCI device (e.g. is
> used to distinguish multiple ports that share the same PCI
> device), that's enough to handle the Chelsio case, but not the NPAR &
> SR-IOV case.
> 
> If the above is true, then a value of dev_id=0 for all 1:1 PCI Device
> : Port relations is fine, leaving the three drivers that set dev_id
> non-zero are all multi-port, single PCI device controllers.
> 
> cxgb4/t4_hw.c:          adap->port[i]->dev_id = j;
> mlx4/en_netdev.c:       dev->dev_id =  port - 1;
> sfc/siena.c:    efx->net_dev->dev_id = EFX_OWORD_FIELD(reg, FRF_CZ_CS_PORT_NUM) - 1;
> 
> Is that truly how these three controllers work: they set dev_id when
> there are multiple physical ports that a single PCI d/b/d/f drives?
[...]

In the case of sfc, each port has a separate PCI function.  We read this
register field to find out which port we're talking to, as
virtualisation can alter the function number.  I don't know about the
others.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* [patch] USB: mcs7830: return negative if auto negotiate fails
From: Dan Carpenter @ 2010-12-17 13:25 UTC (permalink / raw)
  To: Greg Kroah-Hartman; +Cc: linux-usb, netdev, kernel-janitors

The original code returns 0 on success and 1 on failure.  In fact, at
this point, "ret" is already either zero or a negative error code so
we can just return it directly.

Signed-off-by: Dan Carpenter <error27@gmail.com>

diff --git a/drivers/net/usb/mcs7830.c b/drivers/net/usb/mcs7830.c
index a6281e3..f3fe8e1 100644
--- a/drivers/net/usb/mcs7830.c
+++ b/drivers/net/usb/mcs7830.c
@@ -351,7 +351,7 @@ static int mcs7830_set_autoneg(struct usbnet *dev, int ptrUserPhyMode)
 	if (!ret)
 		ret = mcs7830_write_phy(dev, MII_BMCR,
 				BMCR_ANENABLE | BMCR_ANRESTART	);
-	return ret < 0 ? : 0;
+	return ret;
 }
 
 

^ permalink raw reply related

* Re: [PATCH] iproute2: ip: add wilcard support for device matching
From: jamal @ 2010-12-17 13:38 UTC (permalink / raw)
  To: Octavian Purdila
  Cc: Eric Dumazet, Stephen Hemminger, netdev, Lucian Adrian Grijincu,
	Vlad Dogaru
In-Reply-To: <201012102006.03401.opurdila@ixiacom.com>

On Fri, 2010-12-10 at 20:06 +0200, Octavian Purdila wrote:

> > $ ip link add link bond0 "vlan\*199" type vlan id 199
> > $ ifconfig "vlan\*199"
> 
> :) Then use a special dev keyword like dev* ?
> 
> $ ip link set dev* dummy set
> 
> Or use a new flag to allow expansion?
> 
> $ ip -e link set dev dummy* set

There was something ive always wanted to do but
havent had time. It will cut time in a big way the
user-kernel interaction in precisely your situation.

Add a new general purpose netdev 32 bit tag. You can use this
feature to "group" netdevs. The group "all netdevs" is 0 - which
is the default.
I can group individual dummy interfaces into group 1. 
ip link dev dummy0 set group 1
..
..
ip link dev dummy99 set group 1

Then i can send a query to only ifup and ignore
the 1000 vlans that exist.

ip link dev ls group 1

or if i didnt list the group, then group 0 is assumed.

As a warning - this would be a general purpose tag.
So i can use it in conjunction with skb->mark to enable
filtering for example on ingress side with some action.

cheers,
jamal

^ permalink raw reply

* Re: [NETWORK] Firmware file for tehuti
From: Ben Hutchings @ 2010-12-17 14:04 UTC (permalink / raw)
  To: Joe Jin
  Cc: Alexander Indenbaum, Andy Gospodarek, netdev, linux-kernel,
	Guru Anbalagane, greg.marsden@oracle.com, DuanZhenzhong,
	Jaswinder Singh Rajput
In-Reply-To: <4D0B0B70.9050306@oracle.com>

[-- Attachment #1: Type: text/plain, Size: 826 bytes --]

On Fri, 2010-12-17 at 15:04 +0800, Joe Jin wrote:
> Hi,
> 
> Regarding firmware for driver tehuti, it request firmware name is
> tehuti/firmware.bin,
> but from kernel source found no such file but have tehuti/bdx.bin.ihex,
> so the firmware
> should be bdx.bin rather than firmware.bin?

Yes it should.  I'm sorry for this mistake.

My first version of the patch to use request_firmware() used
"tehuti/firmware.bin".  I then found Jaswinder's patch at
<http://git.infradead.org/users/jaswinder/firm-jsr-2.6.git?a=commitdiff;h=e41f3e5f8c5110871e376a2566b8eea2932b813b>,
which used the name "tehuti/bdx.bin".  For some reason I changed the
name of the firmware file in my patch to match that, but not the code.

Ben.

-- 
Ben Hutchings
Once a job is fouled up, anything done to improve it makes it worse.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 828 bytes --]

^ permalink raw reply

* Re: [PATCH] e1000e: workaround missing power down mii control bit on 82571
From: Arthur Jones @ 2010-12-17 14:04 UTC (permalink / raw)
  To: Allan, Bruce W; +Cc: Ben Hutchings, Kirsher, Jeffrey T, netdev@vger.kernel.org
In-Reply-To: <8DD2590731AB5D4C9DBF71A877482A9001773F7AD6@orsmsx509.amr.corp.intel.com>

Hi Bruce, ...

On Thu, Dec 16, 2010 at 05:46:02PM -0800, Allan, Bruce W wrote:
> >-----Original Message-----
> >From: Arthur Jones [mailto:arthur.jones@riverbed.com]
> >Sent: Thursday, December 16, 2010 2:14 PM
> >To: Allan, Bruce W
> >Cc: Ben Hutchings; Kirsher, Jeffrey T; netdev@vger.kernel.org
> >Subject: Re: [PATCH] e1000e: workaround missing power down mii control bit on
> >82571
> >
> >> > It's the reset in e1000_set_settings() which ignores that we had previously
> >> > powered off the Phy.  I'll go through the rest of the code and fix up this
> >> > and any other occurrences of similar issues properly.
> >>
> >> Thanks for having a look!
> >>
> >> We do a read-modify-write there of
> >> the PHY control register.  We take
> >> the rest of the bits as being good,
> >> but, for some reason we don't get the
> >> power down bit (always reads back
> >> zero).  Is this a known 82571 issue?
> >> On 82574, e.g., we seem to get the
> >> power down bit back when we read...
> >
> >BTW:  The 802.3 spec seems to indicate
> >that this bit _should_ be readable even
> >when the PHY is powered down (i.e.  this
> >is a PHY bug)...
> >
> >Arthur
> >
> >>
> >> Are you sure you want to spread that
> >> 82571 specific logic all over the driver?
> >>
> >> Arthur
> 
> No, not a PHY bug.  One difference between 82571 and 82574 is during a
> hardware reset (which is done by the ethtool command in your example
> repro case), the reset on 82571 is a much more aggressive reset than on
> 82574 which causes the bit to be cleared automatically.

That's not what I saw.  I saw the bit always
read back zero, even right after I set it.

Have you tried writing the power down bit and
reading it back?

Arthur


^ permalink raw reply

* [PATCH net-2.6] tehuti: Firmware filename is tehuti/bdx.bin
From: Ben Hutchings @ 2010-12-17 14:13 UTC (permalink / raw)
  To: David Miller
  Cc: Joe Jin, Alexander Indenbaum, Andy Gospodarek, netdev,
	linux-kernel, Guru Anbalagane, greg.marsden@oracle.com,
	DuanZhenzhong, Jaswinder Singh Rajput

My conversion of tehuti to use request_firmware() was confused about
the filename of the firmware blob.  Change the driver to match the
blob.

Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
Cc: stable@kernel.org [2.6.32+]
---
 drivers/net/tehuti.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/tehuti.c b/drivers/net/tehuti.c
index 8b3dc1e..296000b 100644
--- a/drivers/net/tehuti.c
+++ b/drivers/net/tehuti.c
@@ -324,7 +324,7 @@ static int bdx_fw_load(struct bdx_priv *priv)
 	ENTER;
 	master = READ_REG(priv, regINIT_SEMAPHORE);
 	if (!READ_REG(priv, regINIT_STATUS) && master) {
-		rc = request_firmware(&fw, "tehuti/firmware.bin", &priv->pdev->dev);
+		rc = request_firmware(&fw, "tehuti/bdx.bin", &priv->pdev->dev);
 		if (rc)
 			goto out;
 		bdx_tx_push_desc_safe(priv, (char *)fw->data, fw->size);
@@ -2510,4 +2510,4 @@ module_exit(bdx_module_exit);
 MODULE_LICENSE("GPL");
 MODULE_AUTHOR(DRIVER_AUTHOR);
 MODULE_DESCRIPTION(BDX_DRV_DESC);
-MODULE_FIRMWARE("tehuti/firmware.bin");
+MODULE_FIRMWARE("tehuti/bdx.bin");
-- 
1.7.2.3

^ permalink raw reply related

* [ANNOUNCE] libmnl 1.0.0 release
From: Pablo Neira Ayuso @ 2010-12-17 14:20 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netfilter, netfilter-announce, lwn, Linux Netdev List

[-- Attachment #1: Type: text/plain, Size: 1298 bytes --]

Hi!

The Netfilter project presents libmnl-1.0.0

libmnl is a minimalistic user-space library oriented to Netlink
developers. There are a lot of common tasks in parsing, validating,
constructing of both the Netlink header and TLVs that are repetitive and
easy to get wrong. This library aims to provide simple helpers that
allows you to re-use code and to avoid re-inventing the wheel.

This library is released under LGPLv2+.

Features:
* Small: the shared library requires around 45KB for an x86-based computer.
* Simple: this library avoids complexity and elaborated abstractions
that tend to hide Netlink details.
* Easy to use: the library simplifies the work for Netlink-wise
developers. It provides functions to make socket handling, message
building, validating, parsing and sequence tracking, easier.
* Easy to re-use: you can use the library to build your own abstraction
layer on top of this library.
* Decoupling: the interdependency of the main bricks that compose the
library is reduced, i.e. the library provides many helpers, but the
programmer is not forced to use them.

More info at:
http://www.netfilter.org/projects/libmnl/

Doxygen documentation at:
http://www.netfilter.org/projects/libmnl/doxygen/

You can download it via FTP at:
ftp://ftp.netfilter.org/pub/libmnl

Enjoy!

[-- Attachment #2: changes-libmnl-1.0.0.txt --]
[-- Type: text/plain, Size: 5769 bytes --]

Cristian RodrÃguez (1):
      src: implement both GCC visibility support and export script

Jan Engelhardt (36):
      build: just use autoreconf
      build: do not abuse AM_INIT_AUTOMAKE for autoconf options
      build: automake options should be in AM_INIT_AUTOMAKE
      build: use subdir-objects and CC_C_O
      build: run autoupdate
      build: rebuild .pc files when configure status changed
      build: resolve compiler warnings
      build: remove unneeded -dynamic -ldl -nostartfiles flags
      build: default to not build static libraries
      Add .gitignore files
      build: fix disable_static functionality
      src: avoid using deprecated unspecified argument lists
      src: add const qualifiers
      src: remove redundant casts
      socket: remove statement with no effect
      doc: documentation updates
      include: consistent usage of "extern"
      attr: string functions should take char *
      callback: mnl_cb_run should use a void *
      socket: use more appropriate types for mnl_socket_bind
      include: add cplusplus guards for extern
      nlmsg: use bool return type for yes-no functions
      attr: rename str_null from NULL away
      examples: remove redundant casts
      build: remove -fPIC flag
      build: remove statements without obvious effect
      socket: constify a struct sockaddr_nl
      nlmsg: use bool for mnl_nlmsg_ok()
      attr: remove redundant check for NULL
      include: use C++ headers in C++ mode
      attr: avoid multiple definition of hidden variable
      socket: propagate sendto/recvmsg's return types
      Update .gitignore
      build: tag function headers rather than decls as exported
      ld: add some more precautionary CFLAGS
      nlmsg: remove unused function mnl_nlmsg_aligned_size()

Jozsef Kadlecsik (2):
      fix for mnl_attr_for_each_nested()
      fix mnl_attr_parse()

Pablo Neira Ayuso (75):
      initial libmnl import
      fix leak in mnl_socket_open()
      remove libnfnetlink stuff from autogen.sh
      finish API documentation
      fix mnl_cb_run() and mnl_cb_run2() return value logic
      add COPYING file
      use `unsigned int' for number of bytes and array size in callback API
      partially revert previous commit
      fix mnl_socket_bind() to support the selection of the netlink portID
      constify several mnl_socket_* parameters and use size_t instead of int
      use C99 types uintXX_t instead of POSIX u_intXX_t
      check portid of received messages in examples
      revert abcaad6b65ed368c13c353ed71619332f76d9c2a
      add validation infrastructure and rework attribute parsing
      check source of the netlink message and fix sequence tracking logic
      remove mnl_align() as it's been replaced by MNL_ALIGN()
      rename mnl_attr_type_invalid() by mnl_attr_type_ok()
      remove bogus checking in mnl_attr_validate() and mnl_attr_validate2()
      add -Wextra to spot more errors in compilation
      fix warning in compilation due to different signess
      rename mnl_attr_type_ok() by mnl_attr_type_valid() for consistency
      rename msg.c to nlmsg.c
      rename mnl_nlmsg_payload_size() to mnl_nlmsg_get_payload_len() for consistency
      x
      more consistency name issues: rename get_data*() to get_payload*()
      add new README file
      review documentation on netlink attribute helpers
      improve documentation of netlink message helpers
      remove bogus casting in mnl_nlmsg_get_payload_tail()
      remove mnl_nlmsg_get_len() function
      minor update in README (library is around 30KB here, not 20KB)
      update socket helper documentation
      add mnl_nlmsg_fprintf() function for debugging purposes
      review data types for input parameters of mnl_attr_*() functions
      use size_t to indicate the buffer size in mnl_cb_run*()
      remove redudant alignment in mnl_nlmsg_size()
      fix warning in mnl_cb_run2()
      add -Wextra -Wall for example files
      fix lots of compilation warnings in example files
      remove references to 'generic' in header file
      add rtnl-route-add.c to examples
      add helpers to nest attributes
      add licensing terms of example files
      add nf-queue.c example file for nfnetlink_queue
      statify function in nf-queue.c example
      change errno values for mnl_cb_run[2]()
      relax mnl_attr_type_valid() checkings and change errno value
      fix rtnl-link-dump3.c
      add nfct-event example
      nlmsg: use size_t instead of int for several input parameters
      socket: remove mnl_socket_sendmsg() and mnl_socket_recvmsg()
      examples: fix rtnl-set-link
      examples: fix byte-order in nfct-event
      build: add notice on how to update library API version
      skip PortID and sequence checking if zero
      add missing .gitignore file to m4/ directory
      examples: put examples files into specific directories
      doxygen documentation
      add quote from Thoureau to documentation
      src: define MNL_SOCKET_BUFFER_SIZE to 8192UL
      doc: git tree update (now at netfilter.org) and fix listing in doxygen
      examples: add nflog example
      nlmsg: rework mnl_nlmsg_fprintf
      Merge branch 'master' of git://dev.medozas.de/libmnl
      license: change licensing terms from GPLv2+ to LGPLv2.1+
      nlmsg: remove unexisting mnl_nlmsg_total_size
      add libmnl.map file to src/Makefile.am
      attr: add mnl_attr_nest_cancel()
      header: add MNL_ARRAY_SIZE(x)
      callback: use of inline in mnl_cb_run*() function
      attr: add put function that allows to check buffer size
      header: use getpagesize() for MNL_SOCKET_BUFFER_SIZE
      header: missing parenthesis in MNL_SOCKET_BUFFER_SIZE definition
      nlmsg: add new message batching infrastructure
      build: 1.0.0 release


^ permalink raw reply

* Re: [BUG] 2.6.37-rc5 Memory leak in net/ipv4/udp.c
From: Eric Dumazet @ 2010-12-17 14:22 UTC (permalink / raw)
  To: Lothar Waßmann; +Cc: netdev
In-Reply-To: <19723.20449.974043.309608@ipc1.ka-ro>

Le vendredi 17 décembre 2010 à 12:56 +0100, Lothar Waßmann a écrit :

> Forget about this. I was a little confused when reading the
> alloc_large_system_hash() function. No I understand.

Any idea why you got a small allocation ?

How much LOWMEM do you have on your platform ?



^ permalink raw reply

* Re: [BUG] 2.6.37-rc5 Memory leak in net/ipv4/udp.c
From: Lothar Waßmann @ 2010-12-17 14:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1292595727.2906.14.camel@edumazet-laptop>

Hi,

Eric Dumazet writes:
> Le vendredi 17 décembre 2010 à 12:56 +0100, Lothar Waßmann a écrit :
> 
> > Forget about this. I was a little confused when reading the
> > alloc_large_system_hash() function. No I understand.
> 
> Any idea why you got a small allocation ?
> 
How much should I expect?

> How much LOWMEM do you have on your platform ?
> 
It's a new platform with a Freescale i.MX28 SoC:

Linux version 2.6.37-rc5-karo+ (lothar@ipc1) (gcc version 4.4.1 (GCC) ) #43 PREEMPT Fri Dec 17 13:59:40 CET 2010
CPU: ARM926EJ-S [41069265] revision 5 (ARMv5TEJ), cr=00053177
CPU: VIVT data cache, VIVT instruction cache
Machine: Ka-Ro electronics TX28 module
Memory policy: ECC disabled, Data cache writeback
On node 0 totalpages: 32768
free_area_init_node: node 0, pgdat c0454c9c, node_mem_map c0994000
  Normal zone: 288 pages used for memmap
  Normal zone: 0 pages reserved
  Normal zone: 32480 pages, LIFO batch:7
pcpu-alloc: s0 r0 d32768 u32768 alloc=1*32768
pcpu-alloc: [0] 0 
Built 1 zonelists in Zone order, mobility grouping on.  Total pages: 32480
Kernel command line: init=/linuxrc tx28_baseboard=stk5-v3 root=/dev/nfs nfsroot=192.168.1.225:/tftpboot/KARO/imx28,nolock ip=bootp debug panic=1 console=ttyAM0,115200 ro
PID hash table entries: 512 (order: -1, 2048 bytes)
Dentry cache hash table entries: 16384 (order: 4, 65536 bytes)
Inode-cache hash table entries: 8192 (order: 3, 32768 bytes)
Memory: 128MB = 128MB total
Memory: 119968k/119968k available, 11104k reserved, 0K highmem

Lothar Waßmann
-- 
___________________________________________________________

Ka-Ro electronics GmbH | Pascalstraße 22 | D - 52076 Aachen
Phone: +49 2408 1402-0 | Fax: +49 2408 1402-10
Geschäftsführer: Matthias Kaussen
Handelsregistereintrag: Amtsgericht Aachen, HRB 4996

www.karo-electronics.de | info@karo-electronics.de
___________________________________________________________

^ permalink raw reply

* Re: [PATCH 5/5 v4] net: add old_queue_mapping into skb->cb
From: Changli Gao @ 2010-12-17 13:41 UTC (permalink / raw)
  To: hadi
  Cc: David S. Miller, Stephen Hemminger, Eric Dumazet, Tom Herbert,
	Jiri Pirko, netdev, netem
In-Reply-To: <1292591363.2668.19.camel@mojatatu>

On Fri, Dec 17, 2010 at 9:09 PM, jamal <hadi@cyberus.ca> wrote:
> On Thu, 2010-12-16 at 12:56 +0800, Changli Gao wrote:
>
> I am sorry Changli - I think we are talking past each other. I
> a conflicted on the whole point of saving and restoring these
> devqueue mappings. I understand that for ifb, saving and restoring the
> original devs is fundamental for its operation- but i am not sure i see
> it for the queues. As an example:
>
> ---
> # For all packets arriving on ifb0, change mapping to 3 and
> # redirect to to ifb1
>
> tc filter add dev ifb0 parent 1:0 protocol ip prio 10 u32 \
> match u32 0 0 flowid 1:2 \
> action skbedit queue_mapping 4 \
> action mirred egress redirect dev ifb1
> #
> # redirect all packets arriving in eth0 to ifb0
> $TC filter add eth0 parent 1:0 protocol ip prio 10 u32 \
> match u32 0 0 flowid 1:2 action mirred egress redirect dev ifb0
> ----
>
> what is the expected behavior?
>

I doubt it can work.

eth0 -> ifb0: skb->skb_iif = eth0.
ifb0 -> ifb1: skb->skb_iif = ifb0
ifb1 -> ifb0: as skb->skb_iif == ifb0, skb->skb_iif = ifb1
ifb0 -> ifb1...
...

Did you test it?

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* kernel bug 14839
From: Ian Shorter @ 2010-12-17 15:18 UTC (permalink / raw)
  To: netdev

At the end of 2009 I reported kernel bug 14839: "Trying to use a TUN device for IPv6 traffic, cannot set destination address". I was wondering whether any progress has been made towards resolving the problem? Regards.

^ permalink raw reply

* Re: kernel bug 14839
From: David Lamparter @ 2010-12-17 15:31 UTC (permalink / raw)
  To: Ian Shorter; +Cc: netdev
In-Reply-To: <588003.11315.qm@web113905.mail.gq1.yahoo.com>

[-- Attachment #1: Type: text/plain, Size: 784 bytes --]

On Fri, Dec 17, 2010 at 07:18:58AM -0800, Ian Shorter wrote:
> At the end of 2009 I reported kernel bug 14839: "Trying to use a TUN
> device for IPv6 traffic, cannot set destination address". I was
> wondering whether any progress has been made towards resolving the
> problem? Regards.

The bug is invalid. You are incorrectly assuming the IPv4 pattern of a
"local" and a "remote" address applies to IPv6. Also, you are
incorrectly using an fe80:: address with a /128 mask. (fe80:: with
anything but /64 is usually gross mis-setup)

Please change the application to do the following:
* add a fe80::/64 address to the tun device
* use either device-only routes ("2001:db8::/32 dev tun0") or get the
  peer link-local address (might be complicated) and use that as
  nexthop.

-David

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [PATCH net-2.6] tehuti: Firmware filename is tehuti/bdx.bin
From: Andy Gospodarek @ 2010-12-17 15:34 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Miller, Joe Jin, Alexander Indenbaum, Andy Gospodarek,
	netdev, linux-kernel, Guru Anbalagane, greg.marsden@oracle.com,
	DuanZhenzhong, Jaswinder Singh Rajput
In-Reply-To: <1292595184.3136.843.camel@localhost>

On Fri, Dec 17, 2010 at 02:13:03PM +0000, Ben Hutchings wrote:
> My conversion of tehuti to use request_firmware() was confused about
> the filename of the firmware blob.  Change the driver to match the
> blob.
> 
> Signed-off-by: Ben Hutchings <ben@decadent.org.uk>

Thanks for doing that, Ben.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>

^ permalink raw reply

* [net-next-2.6 PATCH 1/4] net: implement mechanism for HW based QOS
From: John Fastabend @ 2010-12-17 15:34 UTC (permalink / raw)
  To: davem; +Cc: netdev, hadi, shemminger, tgraf, eric.dumazet, nhorman

This patch provides a mechanism for lower layer devices to
steer traffic using skb->priority to tx queues. This allows
for hardware based QOS schemes to use the default qdisc without
incurring the penalties related to global state and the qdisc
lock. While reliably receiving skbs on the correct tx ring
to avoid head of line blocking resulting from shuffling in
the LLD. Finally, all the goodness from txq caching and xps/rps
can still be leveraged.

Many drivers and hardware exist with the ability to implement
QOS schemes in the hardware but currently these drivers tend
to rely on firmware to reroute specific traffic, a driver
specific select_queue or the queue_mapping action in the
qdisc.

By using select_queue for this drivers need to be updated for
each and every traffic type and we lose the goodness of much
of the upstream work. Firmware solutions are inherently
inflexible. And finally if admins are expected to build a
qdisc and filter rules to steer traffic this requires knowledge
of how the hardware is currently configured. The number of tx
queues and the queue offsets may change depending on resources.
Also this approach incurs all the overhead of a qdisc with filters.

With the mechanism in this patch users can set skb priority using
expected methods ie setsockopt() or the stack can set the priority
directly. Then the skb will be steered to the correct tx queues
aligned with hardware QOS traffic classes. In the normal case with
a single traffic class and all queues in this class everything
works as is until the LLD enables multiple tcs.

To steer the skb we mask out the lower 4 bits of the priority
and allow the hardware to configure upto 15 distinct classes
of traffic. This is expected to be sufficient for most applications
at any rate it is more then the 8021Q spec designates and is
equal to the number of prio bands currently implemented in
the default qdisc.

This in conjunction with a userspace application such as
lldpad can be used to implement 8021Q transmission selection
algorithms one of these algorithms being the extended transmission
selection algorithm currently being used for DCB.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 include/linux/netdevice.h |   60 +++++++++++++++++++++++++++++++++++++++++++++
 net/core/dev.c            |   10 +++++++-
 2 files changed, 69 insertions(+), 1 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a9ac5dc..9694138 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -646,6 +646,12 @@ struct xps_dev_maps {
     (nr_cpu_ids * sizeof(struct xps_map *)))
 #endif /* CONFIG_XPS */

+/* HW offloaded queuing disciplines txq count and offset maps */
+struct netdev_tc_txq {
+	u16 count;
+	u16 offset;
+};
+
 /*
  * This structure defines the management hooks for network devices.
  * The following hooks can be defined; unless noted otherwise, they are
@@ -1146,6 +1152,9 @@ struct net_device {
 	/* Data Center Bridging netlink ops */
 	const struct dcbnl_rtnl_ops *dcbnl_ops;
 #endif
+	u8 num_tc;
+	struct netdev_tc_txq tc_to_txq[16];
+	u8 prio_tc_map[16];

 #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
 	/* max exchange id for FCoE LRO by ddp */
@@ -1162,6 +1171,57 @@ struct net_device {
 #define	NETDEV_ALIGN		32

 static inline
+int netdev_get_prio_tc_map(const struct net_device *dev, u32 prio)
+{
+	return dev->prio_tc_map[prio & 15];
+}
+
+static inline
+int netdev_set_prio_tc_map(struct net_device *dev, u8 prio, u8 tc)
+{
+	if (tc >= dev->num_tc)
+		return -EINVAL;
+
+	dev->prio_tc_map[prio & 15] = tc & 15;
+	return 0;
+}
+
+static inline
+void netdev_reset_tc(struct net_device *dev)
+{
+	dev->num_tc = 0;
+	memset(dev->tc_to_txq, 0, sizeof(dev->tc_to_txq));
+	memset(dev->prio_tc_map, 0, sizeof(dev->prio_tc_map));
+}
+
+static inline
+int netdev_set_tc_queue(struct net_device *dev, u8 tc, u16 count, u16 offset)
+{
+	if (tc >= dev->num_tc)
+		return -EINVAL;
+
+	dev->tc_to_txq[tc].count = count;
+	dev->tc_to_txq[tc].offset = offset;
+	return 0;
+}
+
+static inline
+int netdev_set_num_tc(struct net_device *dev, u8 num_tc)
+{
+	if (num_tc > 16)
+		return -EINVAL;
+
+	dev->num_tc = num_tc;
+	return 0;
+}
+
+static inline
+u8 netdev_get_num_tc(const struct net_device *dev)
+{
+	return dev->num_tc;
+}
+
+static inline
 struct netdev_queue *netdev_get_tx_queue(const struct net_device *dev,
 					 unsigned int index)
 {
diff --git a/net/core/dev.c b/net/core/dev.c
index 55ff66f..58e04ba 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2118,6 +2118,8 @@ static u32 hashrnd __read_mostly;
 u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
 {
 	u32 hash;
+	u16 qoffset = 0;
+	u16 qcount = dev->real_num_tx_queues;

 	if (skb_rx_queue_recorded(skb)) {
 		hash = skb_get_rx_queue(skb);
@@ -2126,13 +2128,19 @@ u16 skb_tx_hash(const struct net_device *dev, const struct sk_buff *skb)
 		return hash;
 	}

+	if (dev->num_tc) {
+		u8 tc = netdev_get_prio_tc_map(dev, skb->priority);
+		qoffset = dev->tc_to_txq[tc].offset;
+		qcount = dev->tc_to_txq[tc].count;
+	}
+
 	if (skb->sk && skb->sk->sk_hash)
 		hash = skb->sk->sk_hash;
 	else
 		hash = (__force u16) skb->protocol ^ skb->rxhash;
 	hash = jhash_1word(hash, hashrnd);

-	return (u16) (((u64) hash * dev->real_num_tx_queues) >> 32);
+	return (u16) ((((u64) hash * qcount)) >> 32) + qoffset;
 }
 EXPORT_SYMBOL(skb_tx_hash);

^ permalink raw reply related

* [net-next-2.6 PATCH 2/4] net_sched: Allow multiple mq qdisc to be used as non-root
From: John Fastabend @ 2010-12-17 15:34 UTC (permalink / raw)
  To: davem; +Cc: netdev, hadi, shemminger, tgraf, eric.dumazet, nhorman
In-Reply-To: <20101217153439.12170.39538.stgit@jf-dev1-dcblab>

This patch modifies the mq qdisc to allow multiple mq qdiscs
to be used. Allowing TX queues to be grouped for management.

This allows a root container qdisc to create multiple traffic
classes and use the mq qdisc as a default queueing discipline. It
is expected other queueing disciplines can then be grafted to the
container as needed.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 net/sched/sch_mq.c |   70 ++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 54 insertions(+), 16 deletions(-)

diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
index ecc302f..35ed26d 100644
--- a/net/sched/sch_mq.c
+++ b/net/sched/sch_mq.c
@@ -19,17 +19,39 @@
 
 struct mq_sched {
 	struct Qdisc		**qdiscs;
+	u8 num_tc;
 };
 
+static void mq_queues(struct net_device *dev, struct Qdisc *sch,
+		      unsigned int *count, unsigned int *offset)
+{
+	struct mq_sched *priv = qdisc_priv(sch);
+	if (priv->num_tc) {
+		int queue = TC_H_MIN(sch->parent) - 1;
+		if (count)
+			*count = dev->tc_to_txq[queue].count;
+		if (offset)
+			*offset = dev->tc_to_txq[queue].offset;
+	} else {
+		if (count)
+			*count = dev->num_tx_queues;
+		if (offset)
+			*offset = 0;
+	}
+}
+
 static void mq_destroy(struct Qdisc *sch)
 {
 	struct net_device *dev = qdisc_dev(sch);
 	struct mq_sched *priv = qdisc_priv(sch);
-	unsigned int ntx;
+	unsigned int ntx, count;
 
 	if (!priv->qdiscs)
 		return;
-	for (ntx = 0; ntx < dev->num_tx_queues && priv->qdiscs[ntx]; ntx++)
+
+	mq_queues(dev, sch, &count, NULL);
+
+	for (ntx = 0; ntx < count && priv->qdiscs[ntx]; ntx++)
 		qdisc_destroy(priv->qdiscs[ntx]);
 	kfree(priv->qdiscs);
 }
@@ -41,21 +63,26 @@ static int mq_init(struct Qdisc *sch, struct nlattr *opt)
 	struct netdev_queue *dev_queue;
 	struct Qdisc *qdisc;
 	unsigned int ntx;
+	unsigned int count, offset;
 
-	if (sch->parent != TC_H_ROOT)
+	if (sch->parent != TC_H_ROOT && !dev->num_tc)
 		return -EOPNOTSUPP;
 
 	if (!netif_is_multiqueue(dev))
 		return -EOPNOTSUPP;
 
+	/* Record num tc's in priv so we can tear down cleanly */
+	priv->num_tc = dev->num_tc;
+	mq_queues(dev, sch, &count, &offset);
+
 	/* pre-allocate qdiscs, attachment can't fail */
-	priv->qdiscs = kcalloc(dev->num_tx_queues, sizeof(priv->qdiscs[0]),
+	priv->qdiscs = kcalloc(count, sizeof(priv->qdiscs[0]),
 			       GFP_KERNEL);
 	if (priv->qdiscs == NULL)
 		return -ENOMEM;
 
-	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
-		dev_queue = netdev_get_tx_queue(dev, ntx);
+	for (ntx = 0; ntx < count; ntx++) {
+		dev_queue = netdev_get_tx_queue(dev, ntx + offset);
 		qdisc = qdisc_create_dflt(dev_queue, &pfifo_fast_ops,
 					  TC_H_MAKE(TC_H_MAJ(sch->handle),
 						    TC_H_MIN(ntx + 1)));
@@ -65,7 +92,8 @@ static int mq_init(struct Qdisc *sch, struct nlattr *opt)
 		priv->qdiscs[ntx] = qdisc;
 	}
 
-	sch->flags |= TCQ_F_MQROOT;
+	if (!priv->num_tc)
+		sch->flags |= TCQ_F_MQROOT;
 	return 0;
 
 err:
@@ -78,9 +106,11 @@ static void mq_attach(struct Qdisc *sch)
 	struct net_device *dev = qdisc_dev(sch);
 	struct mq_sched *priv = qdisc_priv(sch);
 	struct Qdisc *qdisc;
-	unsigned int ntx;
+	unsigned int ntx, count;
 
-	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
+	mq_queues(dev, sch, &count, NULL);
+
+	for (ntx = 0; ntx < count; ntx++) {
 		qdisc = priv->qdiscs[ntx];
 		qdisc = dev_graft_qdisc(qdisc->dev_queue, qdisc);
 		if (qdisc)
@@ -94,14 +124,17 @@ static int mq_dump(struct Qdisc *sch, struct sk_buff *skb)
 {
 	struct net_device *dev = qdisc_dev(sch);
 	struct Qdisc *qdisc;
-	unsigned int ntx;
+	unsigned int ntx, count, offset;
+
+	mq_queues(dev, sch, &count, &offset);
 
 	sch->q.qlen = 0;
 	memset(&sch->bstats, 0, sizeof(sch->bstats));
 	memset(&sch->qstats, 0, sizeof(sch->qstats));
 
-	for (ntx = 0; ntx < dev->num_tx_queues; ntx++) {
-		qdisc = netdev_get_tx_queue(dev, ntx)->qdisc_sleeping;
+	for (ntx = 0; ntx < count; ntx++) {
+		int txq = ntx + offset;
+		qdisc = netdev_get_tx_queue(dev, txq)->qdisc_sleeping;
 		spin_lock_bh(qdisc_lock(qdisc));
 		sch->q.qlen		+= qdisc->q.qlen;
 		sch->bstats.bytes	+= qdisc->bstats.bytes;
@@ -120,10 +153,13 @@ static struct netdev_queue *mq_queue_get(struct Qdisc *sch, unsigned long cl)
 {
 	struct net_device *dev = qdisc_dev(sch);
 	unsigned long ntx = cl - 1;
+	unsigned int count, offset;
+
+	mq_queues(dev, sch, &count, &offset);
 
-	if (ntx >= dev->num_tx_queues)
+	if (ntx >= count)
 		return NULL;
-	return netdev_get_tx_queue(dev, ntx);
+	return netdev_get_tx_queue(dev, offset + ntx);
 }
 
 static struct netdev_queue *mq_select_queue(struct Qdisc *sch,
@@ -203,13 +239,15 @@ static int mq_dump_class_stats(struct Qdisc *sch, unsigned long cl,
 static void mq_walk(struct Qdisc *sch, struct qdisc_walker *arg)
 {
 	struct net_device *dev = qdisc_dev(sch);
-	unsigned int ntx;
+	unsigned int ntx, count;
+
+	mq_queues(dev, sch, &count, NULL);
 
 	if (arg->stop)
 		return;
 
 	arg->count = arg->skip;
-	for (ntx = arg->skip; ntx < dev->num_tx_queues; ntx++) {
+	for (ntx = arg->skip; ntx < count; ntx++) {
 		if (arg->fn(sch, ntx + 1, arg) < 0) {
 			arg->stop = 1;
 			break;


^ permalink raw reply related

* [net-next-2.6 PATCH 3/4] net_sched: implement a root container qdisc sch_mclass
From: John Fastabend @ 2010-12-17 15:34 UTC (permalink / raw)
  To: davem; +Cc: netdev, hadi, shemminger, tgraf, eric.dumazet, nhorman
In-Reply-To: <20101217153439.12170.39538.stgit@jf-dev1-dcblab>

This implements a mclass 'multi-class' queueing discipline that by
default creates multiple mq qdisc's one for each traffic class. Each
mq qdisc then owns a range of queues per the netdev_tc_txq mappings.

Using the mclass qdisc the number of tcs currently in use along
with the range of queues alloted to each class can be configured. By
default skbs are mapped to traffic classes using the skb priority.
This mapping is configurable.

Configurable parameters,

struct tc_mclass_qopt {
        __u8    num_tc;
        __u8    prio_tc_map[16];
        __u8    hw;
        __u16   count[16];
        __u16   offset[16];
};

Here the count/offset pairing give the queue alignment and the
prio_tc_map gives the mapping from skb->priority to tc. The
hw bit determines if the hardware should configure the count
and offset values. If the hardware bit is set then the operation
will fail if the hardware does not implement the ndo_setup_tc
operation. This is to avoid undetermined states where the hardware
may or may not control the queue mapping. Also minimal bounds
checking is done on the count/offset to verify a queue does not
exceed num_tx_queues and that queue ranges do not overlap. Otherwise
it is left to user policy or hardware configuration to create
useful mappings.

It is expected that hardware QOS schemes can be implemented by
creating appropriate mappings of queues in ndo_tc_setup(). This
scheme can be expanded as needed with additional qdisc being graft'd
onto the root qdisc to provide per tc queuing disciplines. Allowing
Software and hardware queuing disciplines can be used together

One expected use case is drivers will use the ndo_setup_tc to map
queue ranges onto 802.1Q traffic classes. This provides a generic
mechanism to map network traffic onto these traffic classes and
removes the need for lower layer drivers to no specifics about
traffic types.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 include/linux/netdevice.h |    3 
 include/linux/pkt_sched.h |    9 +
 include/net/sch_generic.h |    1 
 net/sched/Makefile        |    2 
 net/sched/sch_api.c       |    1 
 net/sched/sch_generic.c   |    8 +
 net/sched/sch_mclass.c    |  375 +++++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 397 insertions(+), 2 deletions(-)
 create mode 100644 net/sched/sch_mclass.c

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9694138..169a23f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -762,6 +762,8 @@ struct netdev_tc_txq {
  * int (*ndo_set_vf_port)(struct net_device *dev, int vf,
  *			  struct nlattr *port[]);
  * int (*ndo_get_vf_port)(struct net_device *dev, int vf, struct sk_buff *skb);
+ *
+ * int (*ndo_setup_tc)(struct net_device *dev, int tc);
  */
 #define HAVE_NET_DEVICE_OPS
 struct net_device_ops {
@@ -820,6 +822,7 @@ struct net_device_ops {
 						   struct nlattr *port[]);
 	int			(*ndo_get_vf_port)(struct net_device *dev,
 						   int vf, struct sk_buff *skb);
+	int			(*ndo_setup_tc)(struct net_device *dev, u8 tc);
 #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
 	int			(*ndo_fcoe_enable)(struct net_device *dev);
 	int			(*ndo_fcoe_disable)(struct net_device *dev);
diff --git a/include/linux/pkt_sched.h b/include/linux/pkt_sched.h
index 2cfa4bc..0134ed4 100644
--- a/include/linux/pkt_sched.h
+++ b/include/linux/pkt_sched.h
@@ -481,4 +481,13 @@ struct tc_drr_stats {
 	__u32	deficit;
 };
 
+/* MCLASS */
+struct tc_mclass_qopt {
+	__u8	num_tc;
+	__u8	prio_tc_map[16];
+	__u8	hw;
+	__u16	count[16];
+	__u16	offset[16];
+};
+
 #endif
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index ea1f8a8..2bbcd09 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -276,6 +276,7 @@ extern struct Qdisc noop_qdisc;
 extern struct Qdisc_ops noop_qdisc_ops;
 extern struct Qdisc_ops pfifo_fast_ops;
 extern struct Qdisc_ops mq_qdisc_ops;
+extern struct Qdisc_ops mclass_qdisc_ops;
 
 struct Qdisc_class_common {
 	u32			classid;
diff --git a/net/sched/Makefile b/net/sched/Makefile
index 960f5db..76dcf5b 100644
--- a/net/sched/Makefile
+++ b/net/sched/Makefile
@@ -2,7 +2,7 @@
 # Makefile for the Linux Traffic Control Unit.
 #
 
-obj-y	:= sch_generic.o sch_mq.o
+obj-y	:= sch_generic.o sch_mq.o sch_mclass.o
 
 obj-$(CONFIG_NET_SCHED)		+= sch_api.o sch_blackhole.o
 obj-$(CONFIG_NET_CLS)		+= cls_api.o
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index b22ca2d..24f40e0 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1770,6 +1770,7 @@ static int __init pktsched_init(void)
 	register_qdisc(&bfifo_qdisc_ops);
 	register_qdisc(&pfifo_head_drop_qdisc_ops);
 	register_qdisc(&mq_qdisc_ops);
+	register_qdisc(&mclass_qdisc_ops);
 
 	rtnl_register(PF_UNSPEC, RTM_NEWQDISC, tc_modify_qdisc, NULL);
 	rtnl_register(PF_UNSPEC, RTM_DELQDISC, tc_get_qdisc, NULL);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 0918834..73ed9b7 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -709,7 +709,13 @@ static void attach_default_qdiscs(struct net_device *dev)
 		dev->qdisc = txq->qdisc_sleeping;
 		atomic_inc(&dev->qdisc->refcnt);
 	} else {
-		qdisc = qdisc_create_dflt(txq, &mq_qdisc_ops, TC_H_ROOT);
+		if (dev->num_tc)
+			qdisc = qdisc_create_dflt(txq, &mclass_qdisc_ops,
+						  TC_H_ROOT);
+		else
+			qdisc = qdisc_create_dflt(txq, &mq_qdisc_ops,
+						  TC_H_ROOT);
+
 		if (qdisc) {
 			qdisc->ops->attach(qdisc);
 			dev->qdisc = qdisc;
diff --git a/net/sched/sch_mclass.c b/net/sched/sch_mclass.c
new file mode 100644
index 0000000..551b660
--- /dev/null
+++ b/net/sched/sch_mclass.c
@@ -0,0 +1,375 @@
+/*
+ * net/sched/sch_mclass.c
+ *
+ * Copyright (c) 2010 John Fastabend <john.r.fastabend@intel.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * version 2 as published by the Free Software Foundation.
+ */
+
+#include <linux/types.h>
+#include <linux/slab.h>
+#include <linux/kernel.h>
+#include <linux/string.h>
+#include <linux/errno.h>
+#include <linux/skbuff.h>
+#include <net/netlink.h>
+#include <net/pkt_sched.h>
+#include <net/sch_generic.h>
+
+struct mclass_sched {
+	struct Qdisc		**qdiscs;
+	int hw_owned;
+};
+
+static void mclass_destroy(struct Qdisc *sch)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mclass_sched *priv = qdisc_priv(sch);
+	unsigned int ntc;
+
+	if (!priv->qdiscs)
+		return;
+
+	for (ntc = 0; ntc < dev->num_tc && priv->qdiscs[ntc]; ntc++)
+		qdisc_destroy(priv->qdiscs[ntc]);
+
+	if (priv->hw_owned && dev->netdev_ops->ndo_setup_tc)
+		dev->netdev_ops->ndo_setup_tc(dev, 0);
+	else
+		netdev_set_num_tc(dev, 0);
+
+	kfree(priv->qdiscs);
+}
+
+static int mclass_parse_opt(struct net_device *dev, struct tc_mclass_qopt *qopt)
+{
+	int i, j;
+
+	/* Verify TC offset and count are sane */
+	for (i = 0; i < qopt->num_tc; i++) {
+		int last = qopt->offset[i] + qopt->count[i];
+		if (last > dev->num_tx_queues)
+			return -EINVAL;
+		for (j = i + 1; j < qopt->num_tc; j++) {
+			if (last > qopt->offset[j])
+				return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int mclass_init(struct Qdisc *sch, struct nlattr *opt)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mclass_sched *priv = qdisc_priv(sch);
+	struct netdev_queue *dev_queue;
+	struct Qdisc *qdisc;
+	int i, err = -EOPNOTSUPP;
+	struct tc_mclass_qopt *qopt = NULL;
+
+	/* Unwind attributes on failure */
+	u8 unwnd_tc = dev->num_tc;
+	u8 unwnd_map[16];
+	struct netdev_tc_txq unwnd_txq[16];
+
+	if (sch->parent != TC_H_ROOT)
+		return -EOPNOTSUPP;
+
+	if (!netif_is_multiqueue(dev))
+		return -EOPNOTSUPP;
+
+	if (nla_len(opt) < sizeof(*qopt))
+		return -EINVAL;
+	qopt = nla_data(opt);
+
+	memcpy(unwnd_map, dev->prio_tc_map, sizeof(unwnd_map));
+	memcpy(unwnd_txq, dev->tc_to_txq, sizeof(unwnd_txq));
+
+	/* If the mclass options indicate that hardware should own
+	 * the queue mapping then run ndo_setup_tc if this can not
+	 * be done fail immediately.
+	 */
+	if (qopt->hw && dev->netdev_ops->ndo_setup_tc) {
+		priv->hw_owned = 1;
+		if (dev->netdev_ops->ndo_setup_tc(dev, qopt->num_tc))
+			return -EINVAL;
+	} else if (!qopt->hw) {
+		if (mclass_parse_opt(dev, qopt))
+			return -EINVAL;
+
+		if (netdev_set_num_tc(dev, qopt->num_tc))
+			return -ENOMEM;
+
+		for (i = 0; i < qopt->num_tc; i++)
+			netdev_set_tc_queue(dev, i,
+					    qopt->count[i], qopt->offset[i]);
+	} else {
+		return -EINVAL;
+	}
+
+	/* Always use supplied priority mappings */
+	for (i = 0; i < 16; i++) {
+		if (netdev_set_prio_tc_map(dev, i, qopt->prio_tc_map[i])) {
+			err = -EINVAL;
+			goto tc_err;
+		}
+	}
+
+	/* pre-allocate qdisc, attachment can't fail */
+	priv->qdiscs = kcalloc(qopt->num_tc,
+			       sizeof(priv->qdiscs[0]), GFP_KERNEL);
+	if (priv->qdiscs == NULL) {
+		err = -ENOMEM;
+		goto tc_err;
+	}
+
+	for (i = 0; i < dev->num_tc; i++) {
+		dev_queue = netdev_get_tx_queue(dev, dev->tc_to_txq[i].offset);
+		qdisc = qdisc_create_dflt(dev_queue, &mq_qdisc_ops,
+					  TC_H_MAKE(TC_H_MAJ(sch->handle),
+						    TC_H_MIN(i + 1)));
+		if (qdisc == NULL) {
+			err = -ENOMEM;
+			goto err;
+		}
+		qdisc->flags |= TCQ_F_CAN_BYPASS;
+		priv->qdiscs[i] = qdisc;
+	}
+
+	sch->flags |= TCQ_F_MQROOT;
+	return 0;
+
+err:
+	mclass_destroy(sch);
+tc_err:
+	if (priv->hw_owned)
+		dev->netdev_ops->ndo_setup_tc(dev, unwnd_tc);
+	else
+		netdev_set_num_tc(dev, unwnd_tc);
+
+	memcpy(dev->prio_tc_map, unwnd_map, sizeof(unwnd_map));
+	memcpy(dev->tc_to_txq, unwnd_txq, sizeof(unwnd_txq));
+
+	return err;
+}
+
+static void mclass_attach(struct Qdisc *sch)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mclass_sched *priv = qdisc_priv(sch);
+	struct Qdisc *qdisc;
+	unsigned int ntc;
+
+	/* Attach underlying qdisc */
+	for (ntc = 0; ntc < dev->num_tc; ntc++) {
+		qdisc = priv->qdiscs[ntc];
+		if (qdisc->ops && qdisc->ops->attach)
+			qdisc->ops->attach(qdisc);
+	}
+}
+
+static int mclass_graft(struct Qdisc *sch, unsigned long cl, struct Qdisc *new,
+		    struct Qdisc **old)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mclass_sched *priv = qdisc_priv(sch);
+	unsigned long ntc = cl - 1;
+
+	if (ntc >= dev->num_tc)
+		return -EINVAL;
+
+	if (dev->flags & IFF_UP)
+		dev_deactivate(dev);
+
+	*old = priv->qdiscs[ntc];
+	if (new == NULL)
+		new = &noop_qdisc;
+	priv->qdiscs[ntc] = new;
+	qdisc_reset(*old);
+
+	if (dev->flags & IFF_UP)
+		dev_activate(dev);
+
+	return 0;
+}
+
+static int mclass_dump(struct Qdisc *sch, struct sk_buff *skb)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mclass_sched *priv = qdisc_priv(sch);
+	unsigned char *b = skb_tail_pointer(skb);
+	struct tc_mclass_qopt opt;
+	struct Qdisc *qdisc;
+	unsigned int i;
+
+	sch->q.qlen = 0;
+	memset(&sch->bstats, 0, sizeof(sch->bstats));
+	memset(&sch->qstats, 0, sizeof(sch->qstats));
+
+	for (i = 0; i < dev->num_tx_queues; i++) {
+		qdisc = netdev_get_tx_queue(dev, i)->qdisc;
+		spin_lock_bh(qdisc_lock(qdisc));
+		sch->q.qlen		+= qdisc->q.qlen;
+		sch->bstats.bytes	+= qdisc->bstats.bytes;
+		sch->bstats.packets	+= qdisc->bstats.packets;
+		sch->qstats.qlen	+= qdisc->qstats.qlen;
+		sch->qstats.backlog	+= qdisc->qstats.backlog;
+		sch->qstats.drops	+= qdisc->qstats.drops;
+		sch->qstats.requeues	+= qdisc->qstats.requeues;
+		sch->qstats.overlimits	+= qdisc->qstats.overlimits;
+		spin_unlock_bh(qdisc_lock(qdisc));
+	}
+
+	opt.num_tc = dev->num_tc;
+	memcpy(opt.prio_tc_map, dev->prio_tc_map, 16);
+	opt.hw = priv->hw_owned;
+
+	for (i = 0; i < dev->num_tc; i++) {
+		opt.count[i] = dev->tc_to_txq[i].count;
+		opt.offset[i] = dev->tc_to_txq[i].offset;
+	}
+
+	NLA_PUT(skb, TCA_OPTIONS, sizeof(opt), &opt);
+
+	return skb->len;
+nla_put_failure:
+	nlmsg_trim(skb, b);
+	return -1;
+}
+
+static struct Qdisc *mclass_leaf(struct Qdisc *sch, unsigned long cl)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	struct mclass_sched *priv = qdisc_priv(sch);
+	unsigned long ntc = cl - 1;
+
+	if (ntc >= dev->num_tc)
+		return NULL;
+	return priv->qdiscs[ntc];
+}
+
+static unsigned long mclass_get(struct Qdisc *sch, u32 classid)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	unsigned int ntc = TC_H_MIN(classid);
+
+	if (ntc >= dev->num_tc)
+		return 0;
+	return ntc;
+}
+
+static void mclass_put(struct Qdisc *sch, unsigned long cl)
+{
+}
+
+static int mclass_dump_class(struct Qdisc *sch, unsigned long cl,
+			 struct sk_buff *skb, struct tcmsg *tcm)
+{
+	struct Qdisc *class;
+	struct net_device *dev = qdisc_dev(sch);
+	struct mclass_sched *priv = qdisc_priv(sch);
+	unsigned long ntc = cl - 1;
+
+	if (ntc >= dev->num_tc)
+		return -EINVAL;
+
+	class = priv->qdiscs[ntc];
+
+	tcm->tcm_parent = TC_H_ROOT;
+	tcm->tcm_handle |= TC_H_MIN(cl);
+	tcm->tcm_info = class->handle;
+	return 0;
+}
+
+static int mclass_dump_class_stats(struct Qdisc *sch, unsigned long cl,
+			       struct gnet_dump *d)
+{
+	struct Qdisc *class, *qdisc;
+	struct net_device *dev = qdisc_dev(sch);
+	struct mclass_sched *priv = qdisc_priv(sch);
+	unsigned long ntc = cl - 1;
+	unsigned int i;
+	u16 count, offset;
+
+	if (ntc >= dev->num_tc)
+		return -EINVAL;
+
+	class = priv->qdiscs[ntc];
+	count = dev->tc_to_txq[ntc].count;
+	offset = dev->tc_to_txq[ntc].offset;
+
+	memset(&class->bstats, 0, sizeof(class->bstats));
+	memset(&class->qstats, 0, sizeof(class->qstats));
+
+	/* Drop lock here it will be reclaimed before touching statistics
+	 * this is required because the qdisc_root_sleeping_lock we hold
+	 * here is the look on dev_queue->qdisc_sleeping also acquired
+	 * below.
+	 */
+	spin_unlock_bh(d->lock);
+
+	for (i = offset; i < offset + count; i++) {
+		qdisc = netdev_get_tx_queue(dev, i)->qdisc;
+		spin_lock_bh(qdisc_lock(qdisc));
+		class->q.qlen		 += qdisc->q.qlen;
+		class->bstats.bytes	 += qdisc->bstats.bytes;
+		class->bstats.packets	 += qdisc->bstats.packets;
+		class->qstats.qlen	 += qdisc->qstats.qlen;
+		class->qstats.backlog	 += qdisc->qstats.backlog;
+		class->qstats.drops	 += qdisc->qstats.drops;
+		class->qstats.requeues	 += qdisc->qstats.requeues;
+		class->qstats.overlimits += qdisc->qstats.overlimits;
+		spin_unlock_bh(qdisc_lock(qdisc));
+	}
+
+	/* Reclaim root sleeping lock before completing stats */
+	spin_lock_bh(d->lock);
+
+	class->qstats.qlen = class->q.qlen;
+	if (gnet_stats_copy_basic(d, &class->bstats) < 0 ||
+	    gnet_stats_copy_queue(d, &class->qstats) < 0)
+		return -1;
+	return 0;
+}
+
+static void mclass_walk(struct Qdisc *sch, struct qdisc_walker *arg)
+{
+	struct net_device *dev = qdisc_dev(sch);
+	unsigned long ntc;
+
+	if (arg->stop)
+		return;
+
+	arg->count = arg->skip;
+	for (ntc = arg->skip; ntc < dev->num_tc; ntc++) {
+		if (arg->fn(sch, ntc + 1, arg) < 0) {
+			arg->stop = 1;
+			break;
+		}
+		arg->count++;
+	}
+}
+
+static const struct Qdisc_class_ops mclass_class_ops = {
+	.graft		= mclass_graft,
+	.leaf		= mclass_leaf,
+	.get		= mclass_get,
+	.put		= mclass_put,
+	.walk		= mclass_walk,
+	.dump		= mclass_dump_class,
+	.dump_stats	= mclass_dump_class_stats,
+};
+
+struct Qdisc_ops mclass_qdisc_ops __read_mostly = {
+	.cl_ops		= &mclass_class_ops,
+	.id		= "mclass",
+	.priv_size	= sizeof(struct mclass_sched),
+	.init		= mclass_init,
+	.destroy	= mclass_destroy,
+	.attach		= mclass_attach,
+	.dump		= mclass_dump,
+	.owner		= THIS_MODULE,
+};


^ permalink raw reply related

* [net-next-2.6 PATCH 4/4] net_sched: add MQSAFE flag to qdisc to identify mq like qdiscs
From: John Fastabend @ 2010-12-17 15:34 UTC (permalink / raw)
  To: davem; +Cc: netdev, hadi, shemminger, tgraf, eric.dumazet, nhorman
In-Reply-To: <20101217153439.12170.39538.stgit@jf-dev1-dcblab>

Add a MQSAFE flag to the qdisc schedulers that can be safely
managed by sch_mclass. Without this flag schedulers that are
not aware of multiple tx queues can be grafted under the
mclass qdisc. Allowing incorrect qdiscs to be grafted causes
an invalid mapping from qdisc's to netdevice queues.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

 include/net/sch_generic.h |    1 +
 net/sched/sch_generic.c   |    2 +-
 net/sched/sch_mclass.c    |    5 +++--
 net/sched/sch_mq.c        |    3 +++
 4 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 2bbcd09..791df75 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -50,6 +50,7 @@ struct Qdisc {
 #define TCQ_F_INGRESS		4
 #define TCQ_F_CAN_BYPASS	8
 #define TCQ_F_MQROOT		16
+#define TCQ_F_MQSAFE		32
 #define TCQ_F_WARN_NONWC	(1 << 16)
 	int			padded;
 	struct Qdisc_ops	*ops;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 73ed9b7..1bcc0ed 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -376,7 +376,7 @@ static struct netdev_queue noop_netdev_queue = {
 struct Qdisc noop_qdisc = {
 	.enqueue	=	noop_enqueue,
 	.dequeue	=	noop_dequeue,
-	.flags		=	TCQ_F_BUILTIN,
+	.flags		=	TCQ_F_BUILTIN | TCQ_F_MQSAFE,
 	.ops		=	&noop_qdisc_ops,
 	.list		=	LIST_HEAD_INIT(noop_qdisc.list),
 	.q.lock		=	__SPIN_LOCK_UNLOCKED(noop_qdisc.q.lock),
diff --git a/net/sched/sch_mclass.c b/net/sched/sch_mclass.c
index 551b660..444492a 100644
--- a/net/sched/sch_mclass.c
+++ b/net/sched/sch_mclass.c
@@ -178,15 +178,16 @@ static int mclass_graft(struct Qdisc *sch, unsigned long cl, struct Qdisc *new,
 	struct mclass_sched *priv = qdisc_priv(sch);
 	unsigned long ntc = cl - 1;
 
-	if (ntc >= dev->num_tc)
+	if (ntc >= dev->num_tc || (new && !(new->flags & TCQ_F_MQSAFE)))
 		return -EINVAL;
 
 	if (dev->flags & IFF_UP)
 		dev_deactivate(dev);
 
-	*old = priv->qdiscs[ntc];
 	if (new == NULL)
 		new = &noop_qdisc;
+
+	*old = priv->qdiscs[ntc];
 	priv->qdiscs[ntc] = new;
 	qdisc_reset(*old);
 
diff --git a/net/sched/sch_mq.c b/net/sched/sch_mq.c
index 35ed26d..493eaab 100644
--- a/net/sched/sch_mq.c
+++ b/net/sched/sch_mq.c
@@ -94,6 +94,9 @@ static int mq_init(struct Qdisc *sch, struct nlattr *opt)
 
 	if (!priv->num_tc)
 		sch->flags |= TCQ_F_MQROOT;
+	else
+		sch->flags |= TCQ_F_MQSAFE;
+
 	return 0;
 
 err:


^ permalink raw reply related

* RE: [PATCH] e1000e: workaround missing power down mii control bit on 82571
From: Allan, Bruce W @ 2010-12-17 15:53 UTC (permalink / raw)
  To: Arthur Jones; +Cc: Ben Hutchings, Kirsher, Jeffrey T, netdev@vger.kernel.org
In-Reply-To: <20101217140453.GS18990@ajones-laptop.nbttech.com>

>-----Original Message-----
>From: netdev-owner@vger.kernel.org [mailto:netdev-owner@vger.kernel.org] On
>Behalf Of Arthur Jones
>Sent: Friday, December 17, 2010 6:05 AM
>To: Allan, Bruce W
>Cc: Ben Hutchings; Kirsher, Jeffrey T; netdev@vger.kernel.org
>Subject: Re: [PATCH] e1000e: workaround missing power down mii control bit on
>82571
>
>That's not what I saw.  I saw the bit always
>read back zero, even right after I set it.
>
>Have you tried writing the power down bit and
>reading it back?
>
>Arthur

Yes, and it worked as expected.


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox