Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next-2.6] tc: report informations for multiqueue devices
From: Eric Dumazet @ 2009-09-02 13:49 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David Miller, jarkao2, cl, netdev
In-Reply-To: <4A9E708D.5040806@trash.net>

Patrick McHardy a écrit :
> Eric Dumazet wrote:
>> [PATCH net-next-2.6] tc: report informations for multiqueue devices
>>
>> qdisc and classes are not yet displayed by "tc -s -d {qdisc|class} show"
>> for multiqueue devices.
>>
>> We use a new TCA_QINDEX attribute, to report queue index to user space.
>> iproute2 tc should be changed to eventually display this queue index as in :
>>
>> $ tc -s -d qdisc
>> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>>  Sent 52498 bytes 465 pkt (dropped 0, overlimits 0 requeues 0)
>>  rate 0bit 0pps backlog 0b 0p requeues 0
>> qdisc pfifo_fast 0: dev eth0 qindex 1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>>  rate 0bit 0pps backlog 0b 0p requeues 0
> 
> This might confuse existing userspace since the handle is not unique
> anymore. libnl f.i. will treat all but the first root qdisc as an
> update and use it to update the state of the first one. There's also
> no combined view for applications unaware of multiqueue.
> 
> Please have a look at the mail I just wrote for some possible ways
> around this.

Hum, how can we combine infos on qdisc/class if in the future we allow each queue index
to have its own qdisc/classes ?

htb on queue index 0
cbq on queue index 1

Combining info would lock us and not allow for special configurations.
Say 
   macvlan device 0 mapped to queue index 0
   macvlan device 1 mapped to queue index 1...

For old apps, just give informations for queue 0 as we do now, and
allow kernel to give more informations only if new application provided a TCA_INDEX attribute
in its request ?

(-1 : all queue indexes,  >=0 for a given queue index)

^ permalink raw reply

* Re: [PATCH] drop_monitor: make last_rx timestamp private
From: Neil Horman @ 2009-09-02 13:36 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, eric.dumazet
In-Reply-To: <20090902105535.GB402@hmsreliant.think-freely.org>

Ok, heres a repost, unchanged saved for being massaged to fit in with the recent
changes that my origional patch raced against.  Thanks!
Neil


It was recently pointed out to me that the last_rx field of the net_device
structure wasn't updated regularly.  In fact only the bonding driver really uses
it currently.  Since the drop_monitor code relies on the last_rx field to detect
drops on recevie in hardware, We need to find a more reliable way to rate limit
our drop checks (so that we don't check for drops on every frame recevied, which
would be inefficient.  This patch makes a last_rx timestamp that is private to
the drop monitor code and is updated for every device that we track.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>


 drop_monitor.c |   15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
index d311202..0a113f2 100644
--- a/net/core/drop_monitor.c
+++ b/net/core/drop_monitor.c
@@ -52,6 +52,7 @@ struct per_cpu_dm_data {
 
 struct dm_hw_stat_delta {
 	struct net_device *dev;
+	unsigned long last_rx;
 	struct list_head list;
 	struct rcu_head rcu;
 	unsigned long last_drop_val;
@@ -180,18 +181,25 @@ static void trace_napi_poll_hit(struct napi_struct *napi)
 	struct dm_hw_stat_delta *new_stat;
 
 	/*
-	 * Ratelimit our check time to dm_hw_check_delta jiffies
+	 * Don't check napi structures with no associated device
 	 */
-	if (!napi->dev ||
-	    !time_after(jiffies, napi->dev->last_rx + dm_hw_check_delta))
+	if (!napi->dev)
 		return;
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
+		/*
+		 * only add a note to our monitor buffer if:
+		 * 1) this is the dev we received on
+		 * 2) its after the last_rx delta
+		 * 3) our rx_dropped count has gone up
+		 */
 		if ((new_stat->dev == napi->dev)  &&
+		    (time_after(jiffies, new_stat->last_rx + dm_hw_check_delta)) &&
 		    (napi->dev->stats.rx_dropped != new_stat->last_drop_val)) {
 			trace_drop_common(NULL, NULL);
 			new_stat->last_drop_val = napi->dev->stats.rx_dropped;
+			new_stat->last_rx = jiffies;
 			break;
 		}
 	}
@@ -287,6 +295,7 @@ static int dropmon_net_event(struct notifier_block *ev_block,
 			goto out;
 
 		new_stat->dev = dev;
+		new_stat->last_rx = jiffies;
 		INIT_RCU_HEAD(&new_stat->rcu);
 		spin_lock(&trace_state_lock);
 		list_add_rcu(&new_stat->list, &hw_stats_list);

^ permalink raw reply related

* RE: [PATCH] [V3] net: add Xilinx emac lite device driver
From: John Linn @ 2009-09-02 13:33 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: netdev, linuxppc-dev, davem, jgarzik, grant.likely, Josh Boyer,
	John Williams, Michal Simek, Sadanand Mutyala
In-Reply-To: <20090820094914.46f1db9c@nehalam>

> -----Original Message-----
> From: Stephen Hemminger [mailto:shemminger@vyatta.com]
> Sent: Thursday, August 20, 2009 5:49 PM
> To: John Linn
> Cc: netdev@vger.kernel.org; linuxppc-dev@ozlabs.org;
davem@davemloft.net; jgarzik@pobox.com; John
> Linn; grant.likely@secretlab.ca; Josh Boyer; John Williams; Michal
Simek; Sadanand Mutyala
> Subject: Re: [PATCH] [V3] net: add Xilinx emac lite device driver
> 
> On Thu, 20 Aug 2009 03:49:51 -0600
> John Linn <john.linn@xilinx.com> wrote:
> 
> > +/**
> > + * xemaclite_ioctl - Perform IO Control operations on the network
device
> > + * @dev:	Pointer to the network device
> > + * @rq:		Pointer to the interface request structure
> > + * @cmd:	IOCTL command
> > + *
> > + * The only IOCTL operation supported by this function is setting
the MAC
> > + * address. An error is reported if any other operations are
requested.
> > + *
> > + * Return:	0 to indicate success, or a negative error for failure.
> > + */
> > +static int xemaclite_ioctl(struct net_device *dev, struct ifreq
*rq, int cmd)
> > +{
> > +	struct net_local *lp = (struct net_local *) netdev_priv(dev);
> > +	struct hw_addr_data *hw_addr = (struct hw_addr_data *)
&rq->ifr_hwaddr;
> > +
> > +	switch (cmd) {
> > +	case SIOCETHTOOL:
> > +		return -EIO;
> > +
> > +	case SIOCSIFHWADDR:
> > +		dev_err(&lp->ndev->dev, "SIOCSIFHWADDR\n");
> > +
> > +		/* Copy MAC address in from user space */
> > +		copy_from_user((void __force *) dev->dev_addr,
> > +			       (void __user __force *) hw_addr,
> > +			       IFHWADDRLEN);
> > +		xemaclite_set_mac_address(lp, dev->dev_addr);
> > +		break;
> > +	default:
> > +		return -EOPNOTSUPP;
> > +	}
> > +
> > +	return 0;
> > +}
> 
> Do you really need this? I doubt the SIOCSIFHWADDR even reaches
driver!
> 
> The normal call path for setting hardware address is:
>   dev_ifsioc
>          dev_set_mac_address
>              ops->ndo_set_mac_address -->
> 
> The driver should be:
>    1. defining new code to do ndo_set_mac_address
>    2. remove existing xmaclite_ioctl - all ioctl's handled by upper
layers
> 
> FYI - the only ioctl's that make it to network device ndo_ioctl
>  are listed in dev_ifsioc
>    SIOCDEVPRIVATE ... SIOCDEVPRIVATE + 15
>    SIOCBOND*
>    SIOCMII*
>    SIOCBR*
>    SIOCHWTSTAMP
>    SIOCWANDEV
> 
> 

We agree and will be updating the driver for this change. Sorry about
the delay as I was on vacation.

Thanks,
John


This email and any attachments are intended for the sole use of the named recipient(s) and contain(s) confidential information that may be proprietary, privileged or copyrighted under applicable law. If you are not the intended recipient, do not read, copy, or forward this email message or any attachments. Delete this email message and any attachments immediately.



^ permalink raw reply

* Re: [PATCH net-next-2.6] tc: report informations for multiqueue devices
From: Patrick McHardy @ 2009-09-02 13:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, jarkao2, cl, netdev
In-Reply-To: <4A9E6551.4030209@gmail.com>

Eric Dumazet wrote:
> [PATCH net-next-2.6] tc: report informations for multiqueue devices
> 
> qdisc and classes are not yet displayed by "tc -s -d {qdisc|class} show"
> for multiqueue devices.
> 
> We use a new TCA_QINDEX attribute, to report queue index to user space.
> iproute2 tc should be changed to eventually display this queue index as in :
> 
> $ tc -s -d qdisc
> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 52498 bytes 465 pkt (dropped 0, overlimits 0 requeues 0)
>  rate 0bit 0pps backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: dev eth0 qindex 1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  rate 0bit 0pps backlog 0b 0p requeues 0

This might confuse existing userspace since the handle is not unique
anymore. libnl f.i. will treat all but the first root qdisc as an
update and use it to update the state of the first one. There's also
no combined view for applications unaware of multiqueue.

Please have a look at the mail I just wrote for some possible ways
around this.

^ permalink raw reply

* Re: about packets forwarding
From: Xiaofei Wu @ 2009-09-02 13:14 UTC (permalink / raw)
  To: Mark Smith; +Cc: linux netdev
In-Reply-To: <20090902201800.1dbb4988.lk-netdev@lk-netdev.nosense.org>





> Hello,
> 
> I have something to ask here.
> 
> The topology of the network is as follows.
> There are six Nodes (A, B, C, D, M, N).
> 
>   M
>   |
>   A
>  / \
> B   D
>  \ /
>   C
>   |
>   N
> 
> M-A, C-N are wired links.
> A-B, B-C, A-D, D-C are wireless links.
> 
> Node M wnats to communicate with node N. Because the wireless links
> are not very reliable, I want to forward the packets through A-B-C and
> A-D-C simultaneously (When Node A receives packets(from Node M) from
> its wired interface eth0, It will forward the same packets to its
> wireless interfaces wlan0 and wlan1 simultaneously) . How to implement
> this?
> 

------
Stephen Hemminger  <shemminger@linux-foundation.org> said,

* The traffic control command 'tc filter    mirred'  probably does this, but not sure.
* Other way to do the same thing would be iptables, not sure if there is an iptables target to mirror.
------

Can you give more information about how to implement this?


      


^ permalink raw reply

* Re: [PATCH 08/14] pktgen: reorganize transmit loop
From: Patrick McHardy @ 2009-09-02 13:06 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, greearb, robert.olsson, netdev, tglx
In-Reply-To: <20090901143033.0e9cb7ec@nehalam>

Stephen Hemminger wrote:
> On Fri, 28 Aug 2009 23:04:28 -0700 (PDT)
> David Miller <davem@davemloft.net> wrote:
> 
>> I think Patrick's goals are quite sound, and with his patch we
>> could let pktgen transmit over vlan just like any other device.
>>
>> Otherwise we give no way to use pktgen to test the VLAN transmit path.
> 
> The only valid returns from device are OK, BUSY, or LOCKED
> anything else is an error and should never occur.
> 
> If the vlan code returns other values, it must translate.

That was part of my patches to allow hard_start_xmit() to return
NETDEV_TX codes, NET_XMIT codes to propagate transmission qdisc state
upwards through virtual devices, as well as errno codes to propagate
errors from virtual devices, like EHOSTUNREACH from an IP tunnel device.

I'll refresh that patch and will post it tonight or tommorrow.

^ permalink raw reply

* Re: [RFC] net/core: Delay neighbor only if it has been used after confirmed
From: Jens Rosenboom @ 2009-09-02 13:03 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki; +Cc: Linux Network Developers
In-Reply-To: <4A9E639B.20907@linux-ipv6.org>

On Wed, 2009-09-02 at 21:22 +0900, YOSHIFUJI Hideaki wrote:
> Hello.
> 
> Jens Rosenboom wrote:
> > When doing some IPv6 testing with the router advertising a small (e.g. 5
> > seconds) reachable time, I noticed that after the traffic has stopped,
> > hosts continue to exchange ND packets every 10 seconds. This is due to
> > neigh_timer_handler() only checking neigh->used and puts a neighbor into
> > NUD_DELAY state even if neigh->confirmed may be >= neigh->used.
> 
> Well, as you can see in neigh_periodic_timer():
> |                if (time_before(n->used, n->confirmed))
> |                        n->used = n->confirmed;
> time_after_eq(n->used, n->confirmed) should be taken valid;
> confirmed <= used <= now <= jiffies

Isn't there a chance that neigh_timer_handler() is run before the
periodic timer fixes this? Otherwise I agree that the test could just
say (n->used != n->confirmed).

> > The following patch for net-next-2.6 fixes this behaviour for my IPv6
> > setup, however I would like to hear some opinion on whether this might
> > have some negative influence on other protocols that use this code.
> > 
> > I also think that it would make more sense to compute the time for the
> > delay timer starting from neigh->used instead of using now (second part
> > of the patch).
> 
> okay, but I would rather have this in another patch.
> 
> > diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> > index 5bc4ad5..ca20162 100644
> > --- a/net/core/neighbour.c
> > +++ b/net/core/neighbour.c
> > @@ -820,12 +820,13 @@ static void neigh_timer_handler(unsigned long arg)
> >  			NEIGH_PRINTK2("neigh %p is still alive.\n", neigh);
> >  			next = neigh->confirmed + neigh->parms->reachable_time;
> >  		} else if (time_before_eq(now,
> > -					  neigh->used + neigh->parms->delay_probe_time)) {
> > +					  neigh->used + neigh->parms->delay_probe_time) &&
> > +			   time_after(neigh->confirmed, neigh->used)) {
> >  			NEIGH_PRINTK2("neigh %p is delayed.\n", neigh);
> >  			neigh->nud_state = NUD_DELAY;
> 
> I think your change should be
> | time_after(neigh->used, neigh->confirmed)
> or
> | time_before(neigh->confirmed, neigh->used)
> 
> ("_eq" is removed because there is a little chance
> that the neighbor had been confirmed just before it was
> used.  It is not interesting for us at this moment.)
> 
> No?

Yes, you are right, this test should be reversed. But together with your
remarks above, this probably means that the whole stuff also works fine
if we completely remove this if-branch.

> And, this "if" for REACHABLE->DELAY may be completely needless.
> Timer in REACHABLE is only for state transition for toward REACHABLE
> or STALE.

Well, this part has been there for a long time, at least it looks pretty
much the same in the first git version, so I would be a bit reluctant to
completely remove it, but since that would probably also solve my
problem, I also wouldn't object that proposition. ;-)


^ permalink raw reply

* Re: [patch] ipvs: Use atomic operations atomicly
From: Patrick McHardy @ 2009-09-02 12:56 UTC (permalink / raw)
  To: Simon Horman
  Cc: lvs-devel, netdev, netfilter-devel, ?? shin hong, David Miller
In-Reply-To: <20090901015909.GA12252@verge.net.au>

Simon Horman wrote:
> On Mon, Aug 31, 2009 at 02:22:26PM +0200, Patrick McHardy wrote:
>> It seems that proc_do_sync_threshold() should check whether this value
>> is zero. The current checks also look racy since incorrect values are
>> first updated, then overwritten again.
> 
> I'm wondering if an approach along the lines of the following is valid.
> The idea is that the value in the ctl_table is essentially a scratch
> value that is used by the parser and then copied into ip_vs_sync_threshold
> if it is valid.

Even simpler would be to use a temporary buffer on the stack for copying
the values from userspace and then copy them to the final buffer after
validation.

> I'm concerned that there are atomicity issues
> surrounding writing ip_vs_sync_threshold while there might be readers.

That might be a problem if they are required to be "synchronized".

> --- a/net/netfilter/ipvs/ip_vs_core.c
> +++ b/net/netfilter/ipvs/ip_vs_core.c
> @@ -1362,8 +1362,7 @@ ip_vs_in(unsigned int hooknum, struct sk_buff *skb,
>  	    (ip_vs_sync_state & IP_VS_STATE_MASTER) &&
>  	    (((cp->protocol != IPPROTO_TCP ||
>  	       cp->state == IP_VS_TCP_S_ESTABLISHED) &&
> -	      (pkts % sysctl_ip_vs_sync_threshold[1]
> -	       == sysctl_ip_vs_sync_threshold[0])) ||
> +	      (pkts % ip_vs_sync_threshold[1] == ip_vs_sync_threshold[0])) ||
>  	     ((cp->protocol == IPPROTO_TCP) && (cp->old_state != cp->state) &&
>  	      ((cp->state == IP_VS_TCP_S_FIN_WAIT) ||
>  	       (cp->state == IP_VS_TCP_S_CLOSE_WAIT) ||
> diff --git a/net/netfilter/ipvs/ip_vs_ctl.c b/net/netfilter/ipvs/ip_vs_ctl.c
> index fba2892..8a9ff21 100644
> --- a/net/netfilter/ipvs/ip_vs_ctl.c
> +++ b/net/netfilter/ipvs/ip_vs_ctl.c
> @@ -76,6 +76,11 @@ static atomic_t ip_vs_dropentry = ATOMIC_INIT(0);
>  /* number of virtual services */
>  static int ip_vs_num_services = 0;
>  
> +/* threshold handling */
> +static int ip_vs_sync_threshold_min = 0;
> +static int ip_vs_sync_threshold_max = INT_MAX;
> +int ip_vs_sync_threshold[2] = { 3, 50 };
> +

min should be 1 I guess or you still need to manually check
that ip_vs_sync_threshold[1] != 0 to avoid a division be zero.

^ permalink raw reply

* Re: [PATCH net-next-2.6] tc: report informations for multiqueue devices
From: Eric Dumazet @ 2009-09-02 12:48 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, jarkao2, cl, kaber, netdev
In-Reply-To: <4A9E6551.4030209@gmail.com>

Eric Dumazet a écrit :
> David Miller a écrit :
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Wed, 02 Sep 2009 10:28:55 +0200
>>
>>> What naming convention should we choose for multiqueue devices ?
>> We could give an index field to multiple root qdiscs assigned
>> to a device.
> 
> Here is a patch then :)
> 
> Only point is that I am iterating from 0 to dev->real_num_tx_queues
> instead of dev->num_tx_queues. I hope it's fine, because there are
> allocated qdisc, but not really used.
> 
> Next patches to allow selective qdisc change/fetch (providing a TCA_QINDEX
> selector value to kernel)
> 
> Thanks
> 
> 
> [PATCH net-next-2.6] tc: report informations for multiqueue devices
> 
> qdisc and classes are not yet displayed by "tc -s -d {qdisc|class} show"
> for multiqueue devices.
> 
> We use a new TCA_QINDEX attribute, to report queue index to user space.
> iproute2 tc should be changed to eventually display this queue index as in :
> 
> $ tc -s -d qdisc
> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 52498 bytes 465 pkt (dropped 0, overlimits 0 requeues 0)
>  rate 0bit 0pps backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: dev eth0 qindex 1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  rate 0bit 0pps backlog 0b 0p requeues 0
> 


Here is the iproute2 patch as well, to display queue indexes

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index ba3254e..b80e0f6 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -490,6 +490,7 @@ enum
 	TCA_FCNT,
 	TCA_STATS2,
 	TCA_STAB,
+	TCA_QINDEX,
 	__TCA_MAX
 };
 
diff --git a/tc/tc_class.c b/tc/tc_class.c
index 9d4eea5..1bc4bc6 100644
--- a/tc/tc_class.c
+++ b/tc/tc_class.c
@@ -196,6 +196,13 @@ int print_class(const struct sockaddr_nl *who,
 	if (filter_ifindex == 0)
 		fprintf(fp, "dev %s ", ll_index_to_name(t->tcm_ifindex));
 
+	if (tb[TCA_QINDEX]) {
+		int qindex = 0;
+
+		memcpy(&qindex, RTA_DATA(tb[TCA_QINDEX]), MIN(RTA_PAYLOAD(tb[TCA_QINDEX]), sizeof(int)));
+		fprintf(fp, "qindex %d ", qindex);
+	}
+
 	if (t->tcm_parent == TC_H_ROOT)
 		fprintf(fp, "root ");
 	else {
diff --git a/tc/tc_qdisc.c b/tc/tc_qdisc.c
index c7f2988..3ec2cf4 100644
--- a/tc/tc_qdisc.c
+++ b/tc/tc_qdisc.c
@@ -232,6 +232,14 @@ int print_qdisc(const struct sockaddr_nl *who,
 	fprintf(fp, "qdisc %s %x: ", (char*)RTA_DATA(tb[TCA_KIND]), t->tcm_handle>>16);
 	if (filter_ifindex == 0)
 		fprintf(fp, "dev %s ", ll_index_to_name(t->tcm_ifindex));
+
+	if (tb[TCA_QINDEX]) {
+		int qindex = 0;
+
+		memcpy(&qindex, RTA_DATA(tb[TCA_QINDEX]), MIN(RTA_PAYLOAD(tb[TCA_QINDEX]), sizeof(int)));
+		fprintf(fp, "qindex %d ", qindex);
+	}
+
 	if (t->tcm_parent == TC_H_ROOT)
 		fprintf(fp, "root ");
 	else if (t->tcm_parent) {

^ permalink raw reply related

* Re: [NET] Add proc file to display the state of all qdiscs.
From: Patrick McHardy @ 2009-09-02 12:44 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: Eric Dumazet, Christoph Lameter, David Miller, netdev
In-Reply-To: <20090902093710.GA7097@ff.dom.local>

Jarek Poplawski wrote:
> On Wed, Sep 02, 2009 at 09:33:29AM +0000, Jarek Poplawski wrote:
>> On Wed, Sep 02, 2009 at 09:18:54AM +0000, Jarek Poplawski wrote:
>>> On Wed, Sep 02, 2009 at 10:28:55AM +0200, Eric Dumazet wrote:
>> ...
>>>> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>>>>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>>>>  rate 0bit 0pps backlog 0b 0p requeues 0
>>>>
>>>>
>>>> Same name "eth0" is displayed, that might confuse parsers...
>>>>
>>>> What naming convention should we choose for multiqueue devices ?
>>>>
>>> Hmm... anything could break here something for somebody, so there is
>>> still a (Patrick's) question if not sum it all? Otherwise, I wonder
>>> about using the qdisc handle (tcm_handle>>16): there would be at
>>> least one "pfifo_fast 0:" looking like proper root for somebody...
>> I meant "proper" for pfifo_fast. On the other hand, I wonder why these
>> multiqueue qdisc handles can't be really given such unique per dev
> 
> should be:
> multiqueue qdiscs can't be really given such unique per dev
> 
>> (instead of per queue) handles?

In my opinion the best way would be to sum up the stats and display
them for the "main" qdisc to avoid any compatibility problems and
additionally dump each queue for informational purposes.

This raises a few more questions however. First of all, there's no
"main" qdisc, so if we just use the first one for the summed up stats,
the question is whether the parameters of all root qdiscs are the same.
Currently they should be since pfifo_fast doesn't support changing
parameters, but this might change in the future, in which case we
might be displaying "incorrect" parameters.

The next question would be whether and how to support changing
parameters of individual multiq qdiscs. Similar to dumps, when
changing qdisc parameters we always pick queue number 0. It we want
to support changing parameters of different queues, we could
replicate changes to all queues, which would avoid the problem
stated above, or we could add an optional identifier for the
queue number, or we could use a combination of both which only
replicates settings when no queue number is specified.

In the second and third case, one possibility to get around the
incorrect parameters for the "main" qdisc would be to display
a virtual (non-existant) qdisc as the root, which only contains
the stats. I don't think the default choice of root qdisc type
should be counted as belonging to the API and userspace should
be prepared to deal with this.

And finally, the TC API used to be symetrical in the sense that
you could replay a dump to the kernel to get the same configuration.
Dumping the individual queues would break that property unless
we also support creating and configuring them in a symetrical
fashion.

So here's something of a possible solution, assuming that at some
point we do want to support changing parameters for individual root
qdiscs and taking the above into account:

- when creating a qdisc on a multiqueue-device through the TC
  API, we keep the current behaviour and share it between all
  queues. Configuration changes don't need to be replicated since
  the qdisc is shared. Dumps only contain a single instance of
  the qdisc. This is exactly what is done today.

- for non-shared root qdiscs on a multiqueue device like the
  pfifo_fast qdiscs created by default, we dump a virtual root
  qdisc (multiqueue) which only contains the number of bands
  and the statistics and dump the real root qdiscs as children
  of that qdisc.

- to create or configure a qdisc for a specific queue number, we
  require the user to create the virtual root qdisc (which only
  sets a flag or something like that) and then address each queue
  number. When the virtual qdisc has been created, we don't
  replicate any changes. This restores symetry with dumps and
  allows to address individual queues.

This is just a first collection of thoughts and probably can be
improved. Comments welcome :)

^ permalink raw reply

* [PATCH] tc: Fix unitialized kernel memory leak
From: Eric Dumazet @ 2009-09-02 12:40 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

Three bytes of uninitialized kernel memory are currently leaked to user

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 24d17ce..fdb694e 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -1456,6 +1456,8 @@ static int tc_fill_tclass(struct sk_buff *skb, struct Qdisc *q,
 	nlh = NLMSG_NEW(skb, pid, seq, event, sizeof(*tcm), flags);
 	tcm = NLMSG_DATA(nlh);
 	tcm->tcm_family = AF_UNSPEC;
+	tcm->tcm__pad1 = 0;
+	tcm->tcm__pad2 = 0;
 	tcm->tcm_ifindex = qdisc_dev(q)->ifindex;
 	tcm->tcm_parent = q->handle;
 	tcm->tcm_handle = q->handle;

^ permalink raw reply related

* Re: about packets forwarding
From: wu xiaofei @ 2009-09-02 12:31 UTC (permalink / raw)
  To: Mark Smith; +Cc: netdev
In-Reply-To: <20090902201800.1dbb4988.lk-netdev@lk-netdev.nosense.org>

2009/9/2 Mark Smith <lk-netdev@lk-netdev.nosense.org>:
> On Wed, 2 Sep 2009 17:03:42 +0800
> wu xiaofei <lampsu@gmail.com> wrote:
>
>> Hello,
>>
>> I have something to ask here.
>>
>> The topology of the network is as follows.
>> There are six Nodes (A, B, C, D, M, N).
>>
>>   M
>>   |
>>   A
>>  / \
>> B   D
>>  \ /
>>   C
>>   |
>>   N
>>
>> M-A, C-N are wired links.
>> A-B, B-C, A-D, D-C are wireless links.
>>
>> Node M wnats to communicate with node N. Because the wireless links
>> are not very reliable, I want to forward the packets through A-B-C and
>> A-D-C simultaneously (When Node A receives packets(from Node M) from
>> its wired interface eth0, It will forward the same packets to its
>> wireless interfaces wlan0 and wlan1 simultaneously) . How to implement
>> this?
>>
>
> Packets being duplicated in this manner is generally considered an
> error - protocols are designed to deal with it, but only as an error
> robustness mechanism. In other words, it won't break anything, but
> performance is very likely to suffer.
>

I will process the duplicated packets on node C, just forward one copy
to node N.
My purpose:
to improve the reliablity of the wireless links, make our individual
network more robust.
If the path A-B-C is not available, maybe the path A-D-C is still
available, so the communication between node M and node N will not be
interrupted.
The probability of the path A-B-C and A-D-C broken at the same time   is small.

> The Internet protocols are designed on the assumption of a low
> possibility of packet loss (1 in 100)- they only assume each
> link in the path will have an error detection mechanism (and drop them
> when errors occur), but not an error correction mechanism. If the
> possibility of packet loss is high (1 in 10), then it is expected that
> the link layer itself will implement error detection and recovery.
>

We use 802.11b/g wireless cards on our nodes.
The wireless links may have varying quality in terms of packet loss,
data rates, and interference. It's not very stable. It's very
different from wired links.

--
Wu

^ permalink raw reply

* Re: [RFC] net/core: Delay neighbor only if it has been used after confirmed
From: YOSHIFUJI Hideaki @ 2009-09-02 12:22 UTC (permalink / raw)
  To: Jens Rosenboom; +Cc: Linux Network Developers, YOSHIFUJI Hideaki
In-Reply-To: <1251883079.5813.18.camel@fnki-nb00130>

Hello.

Jens Rosenboom wrote:
> When doing some IPv6 testing with the router advertising a small (e.g. 5
> seconds) reachable time, I noticed that after the traffic has stopped,
> hosts continue to exchange ND packets every 10 seconds. This is due to
> neigh_timer_handler() only checking neigh->used and puts a neighbor into
> NUD_DELAY state even if neigh->confirmed may be >= neigh->used.

Well, as you can see in neigh_periodic_timer():
|                if (time_before(n->used, n->confirmed))
|                        n->used = n->confirmed;
time_after_eq(n->used, n->confirmed) should be taken valid;
confirmed <= used <= now <= jiffies

> The following patch for net-next-2.6 fixes this behaviour for my IPv6
> setup, however I would like to hear some opinion on whether this might
> have some negative influence on other protocols that use this code.
> 
> I also think that it would make more sense to compute the time for the
> delay timer starting from neigh->used instead of using now (second part
> of the patch).

okay, but I would rather have this in another patch.

> diff --git a/net/core/neighbour.c b/net/core/neighbour.c
> index 5bc4ad5..ca20162 100644
> --- a/net/core/neighbour.c
> +++ b/net/core/neighbour.c
> @@ -820,12 +820,13 @@ static void neigh_timer_handler(unsigned long arg)
>  			NEIGH_PRINTK2("neigh %p is still alive.\n", neigh);
>  			next = neigh->confirmed + neigh->parms->reachable_time;
>  		} else if (time_before_eq(now,
> -					  neigh->used + neigh->parms->delay_probe_time)) {
> +					  neigh->used + neigh->parms->delay_probe_time) &&
> +			   time_after(neigh->confirmed, neigh->used)) {
>  			NEIGH_PRINTK2("neigh %p is delayed.\n", neigh);
>  			neigh->nud_state = NUD_DELAY;

I think your change should be
| time_after(neigh->used, neigh->confirmed)
or
| time_before(neigh->confirmed, neigh->used)

("_eq" is removed because there is a little chance
that the neighbor had been confirmed just before it was
used.  It is not interesting for us at this moment.)

No?

And, this "if" for REACHABLE->DELAY may be completely needless.
Timer in REACHABLE is only for state transition for toward REACHABLE
or STALE.

--yoshfuji

^ permalink raw reply

* [PATCH net-next-2.6] tc: report informations for multiqueue devices
From: Eric Dumazet @ 2009-09-02 12:30 UTC (permalink / raw)
  To: David Miller; +Cc: jarkao2, cl, kaber, netdev
In-Reply-To: <20090902.013002.181288977.davem@davemloft.net>

David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Wed, 02 Sep 2009 10:28:55 +0200
> 
>> What naming convention should we choose for multiqueue devices ?
> 
> We could give an index field to multiple root qdiscs assigned
> to a device.

Here is a patch then :)

Only point is that I am iterating from 0 to dev->real_num_tx_queues
instead of dev->num_tx_queues. I hope it's fine, because there are
allocated qdisc, but not really used.

Next patches to allow selective qdisc change/fetch (providing a TCA_QINDEX
selector value to kernel)

Thanks


[PATCH net-next-2.6] tc: report informations for multiqueue devices

qdisc and classes are not yet displayed by "tc -s -d {qdisc|class} show"
for multiqueue devices.

We use a new TCA_QINDEX attribute, to report queue index to user space.
iproute2 tc should be changed to eventually display this queue index as in :

$ tc -s -d qdisc
qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 52498 bytes 465 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0
qdisc pfifo_fast 0: dev eth0 qindex 1 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
 Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
 rate 0bit 0pps backlog 0b 0p requeues 0



Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/linux/rtnetlink.h |    1
 net/sched/sch_api.c       |  118 ++++++++++++++++++++----------------
 2 files changed, 67 insertions(+), 52 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index ba3254e..b80e0f6 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -490,6 +490,7 @@ enum
 	TCA_FCNT,
 	TCA_STATS2,
 	TCA_STAB,
+	TCA_QINDEX,
 	__TCA_MAX
 };
 
diff --git a/net/sched/sch_api.c b/net/sched/sch_api.c
index 24d17ce..74cde83 100644
--- a/net/sched/sch_api.c
+++ b/net/sched/sch_api.c
@@ -35,9 +35,9 @@
 #include <net/pkt_sched.h>
 
 static int qdisc_notify(struct sk_buff *oskb, struct nlmsghdr *n, u32 clid,
-			struct Qdisc *old, struct Qdisc *new);
+			struct Qdisc *old, struct Qdisc *new, int qnum);
 static int tclass_notify(struct sk_buff *oskb, struct nlmsghdr *n,
-			 struct Qdisc *q, unsigned long cl, int event);
+			 struct Qdisc *q, unsigned long cl, int event, int qnum);
 
 /*
 
@@ -671,10 +671,10 @@ void qdisc_tree_decrease_qlen(struct Qdisc *sch, unsigned int n)
 EXPORT_SYMBOL(qdisc_tree_decrease_qlen);
 
 static void notify_and_destroy(struct sk_buff *skb, struct nlmsghdr *n, u32 clid,
-			       struct Qdisc *old, struct Qdisc *new)
+			       struct Qdisc *old, struct Qdisc *new, int qnum)
 {
 	if (new || old)
-		qdisc_notify(skb, n, clid, old, new);
+		qdisc_notify(skb, n, clid, old, new, qnum);
 
 	if (old)
 		qdisc_destroy(old);
@@ -720,7 +720,7 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 			if (new && i > 0)
 				atomic_inc(&new->refcnt);
 
-			notify_and_destroy(skb, n, classid, old, new);
+			notify_and_destroy(skb, n, classid, old, new, i);
 		}
 
 		if (dev->flags & IFF_UP)
@@ -738,7 +738,7 @@ static int qdisc_graft(struct net_device *dev, struct Qdisc *parent,
 			}
 		}
 		if (!err)
-			notify_and_destroy(skb, n, classid, old, new);
+			notify_and_destroy(skb, n, classid, old, new, 0);
 	}
 	return err;
 }
@@ -999,7 +999,7 @@ static int tc_get_qdisc(struct sk_buff *skb, struct nlmsghdr *n, void *arg)
 		if ((err = qdisc_graft(dev, p, skb, n, clid, NULL, q)) != 0)
 			return err;
 	} else {
-		qdisc_notify(skb, n, clid, NULL, q);
+		qdisc_notify(skb, n, clid, NULL, q, 0);
 	}
 	return 0;
 }
@@ -1116,7 +1116,7 @@ replay:
 		return -EINVAL;
 	err = qdisc_change(q, tca);
 	if (err == 0)
-		qdisc_notify(skb, n, clid, NULL, q);
+		qdisc_notify(skb, n, clid, NULL, q, 0);
 	return err;
 
 create_n_graft:
@@ -1148,7 +1148,7 @@ graft:
 }
 
 static int tc_fill_qdisc(struct sk_buff *skb, struct Qdisc *q, u32 clid,
-			 u32 pid, u32 seq, u16 flags, int event)
+			 u32 pid, u32 seq, u16 flags, int event, int qnum)
 {
 	struct tcmsg *tcm;
 	struct nlmsghdr  *nlh;
@@ -1187,6 +1187,9 @@ static int tc_fill_qdisc(struct sk_buff *skb, struct Qdisc *q, u32 clid,
 	if (gnet_stats_finish_copy(&d) < 0)
 		goto nla_put_failure;
 
+	if (qnum)
+		NLA_PUT_U32(skb, TCA_QINDEX, qnum);
+
 	nlh->nlmsg_len = skb_tail_pointer(skb) - b;
 	return skb->len;
 
@@ -1197,7 +1200,8 @@ nla_put_failure:
 }
 
 static int qdisc_notify(struct sk_buff *oskb, struct nlmsghdr *n,
-			u32 clid, struct Qdisc *old, struct Qdisc *new)
+			u32 clid, struct Qdisc *old, struct Qdisc *new,
+			int qnum)
 {
 	struct sk_buff *skb;
 	u32 pid = oskb ? NETLINK_CB(oskb).pid : 0;
@@ -1207,11 +1211,13 @@ static int qdisc_notify(struct sk_buff *oskb, struct nlmsghdr *n,
 		return -ENOBUFS;
 
 	if (old && old->handle) {
-		if (tc_fill_qdisc(skb, old, clid, pid, n->nlmsg_seq, 0, RTM_DELQDISC) < 0)
+		if (tc_fill_qdisc(skb, old, clid, pid, n->nlmsg_seq, 0,
+				  RTM_DELQDISC, qnum) < 0)
 			goto err_out;
 	}
 	if (new) {
-		if (tc_fill_qdisc(skb, new, clid, pid, n->nlmsg_seq, old ? NLM_F_REPLACE : 0, RTM_NEWQDISC) < 0)
+		if (tc_fill_qdisc(skb, new, clid, pid, n->nlmsg_seq,
+				  old ? NLM_F_REPLACE : 0, RTM_NEWQDISC, qnum) < 0)
 			goto err_out;
 	}
 
@@ -1230,7 +1236,7 @@ static bool tc_qdisc_dump_ignore(struct Qdisc *q)
 
 static int tc_dump_qdisc_root(struct Qdisc *root, struct sk_buff *skb,
 			      struct netlink_callback *cb,
-			      int *q_idx_p, int s_q_idx)
+			      int *q_idx_p, int s_q_idx, int qnum)
 {
 	int ret = 0, q_idx = *q_idx_p;
 	struct Qdisc *q;
@@ -1239,23 +1245,18 @@ static int tc_dump_qdisc_root(struct Qdisc *root, struct sk_buff *skb,
 		return 0;
 
 	q = root;
-	if (q_idx < s_q_idx) {
-		q_idx++;
-	} else {
-		if (!tc_qdisc_dump_ignore(q) &&
-		    tc_fill_qdisc(skb, q, q->parent, NETLINK_CB(cb->skb).pid,
-				  cb->nlh->nlmsg_seq, NLM_F_MULTI, RTM_NEWQDISC) <= 0)
-			goto done;
-		q_idx++;
-	}
+	if (q_idx >= s_q_idx &&
+	    !tc_qdisc_dump_ignore(q) &&
+	    tc_fill_qdisc(skb, q, q->parent, NETLINK_CB(cb->skb).pid,
+			  cb->nlh->nlmsg_seq, NLM_F_MULTI, RTM_NEWQDISC, qnum) <= 0)
+		goto done;
+	q_idx++;
+
 	list_for_each_entry(q, &root->list, list) {
-		if (q_idx < s_q_idx) {
-			q_idx++;
-			continue;
-		}
-		if (!tc_qdisc_dump_ignore(q) && 
+		if (q_idx >= s_q_idx &&
+		    !tc_qdisc_dump_ignore(q) && 
 		    tc_fill_qdisc(skb, q, q->parent, NETLINK_CB(cb->skb).pid,
-				  cb->nlh->nlmsg_seq, NLM_F_MULTI, RTM_NEWQDISC) <= 0)
+				  cb->nlh->nlmsg_seq, NLM_F_MULTI, RTM_NEWQDISC, qnum) <= 0)
 			goto done;
 		q_idx++;
 	}
@@ -1284,6 +1285,7 @@ static int tc_dump_qdisc(struct sk_buff *skb, struct netlink_callback *cb)
 	idx = 0;
 	for_each_netdev(&init_net, dev) {
 		struct netdev_queue *dev_queue;
+		int ntx;
 
 		if (idx < s_idx)
 			goto cont;
@@ -1291,12 +1293,15 @@ static int tc_dump_qdisc(struct sk_buff *skb, struct netlink_callback *cb)
 			s_q_idx = 0;
 		q_idx = 0;
 
-		dev_queue = netdev_get_tx_queue(dev, 0);
-		if (tc_dump_qdisc_root(dev_queue->qdisc_sleeping, skb, cb, &q_idx, s_q_idx) < 0)
-			goto done;
-
+		for (ntx = 0 ; ntx < dev->real_num_tx_queues; ntx++) {
+			dev_queue = netdev_get_tx_queue(dev, ntx);
+			if (tc_dump_qdisc_root(dev_queue->qdisc_sleeping, skb,
+					       cb, &q_idx, s_q_idx, ntx) < 0)
+				goto done;
+		}
 		dev_queue = &dev->rx_queue;
-		if (tc_dump_qdisc_root(dev_queue->qdisc_sleeping, skb, cb, &q_idx, s_q_idx) < 0)
+		if (tc_dump_qdisc_root(dev_queue->qdisc_sleeping, skb, cb,
+				       &q_idx, s_q_idx, 0) < 0)
 			goto done;
 
 cont:
@@ -1419,10 +1424,10 @@ static int tc_ctl_tclass(struct sk_buff *skb, struct nlmsghdr *n, void *arg)
 		case RTM_DELTCLASS:
 			err = cops->delete(q, cl);
 			if (err == 0)
-				tclass_notify(skb, n, q, cl, RTM_DELTCLASS);
+				tclass_notify(skb, n, q, cl, RTM_DELTCLASS, 0);
 			goto out;
 		case RTM_GETTCLASS:
-			err = tclass_notify(skb, n, q, cl, RTM_NEWTCLASS);
+			err = tclass_notify(skb, n, q, cl, RTM_NEWTCLASS, 0);
 			goto out;
 		default:
 			err = -EINVAL;
@@ -1433,7 +1438,7 @@ static int tc_ctl_tclass(struct sk_buff *skb, struct nlmsghdr *n, void *arg)
 	new_cl = cl;
 	err = cops->change(q, clid, pid, tca, &new_cl);
 	if (err == 0)
-		tclass_notify(skb, n, q, new_cl, RTM_NEWTCLASS);
+		tclass_notify(skb, n, q, new_cl, RTM_NEWTCLASS, 0);
 
 out:
 	if (cl)
@@ -1445,7 +1450,7 @@ out:
 
 static int tc_fill_tclass(struct sk_buff *skb, struct Qdisc *q,
 			  unsigned long cl,
-			  u32 pid, u32 seq, u16 flags, int event)
+			  u32 pid, u32 seq, u16 flags, int event, int qnum)
 {
 	struct tcmsg *tcm;
 	struct nlmsghdr  *nlh;
@@ -1474,6 +1479,9 @@ static int tc_fill_tclass(struct sk_buff *skb, struct Qdisc *q,
 	if (gnet_stats_finish_copy(&d) < 0)
 		goto nla_put_failure;
 
+	if (qnum)
+		NLA_PUT_U32(skb, TCA_QINDEX, qnum);
+
 	nlh->nlmsg_len = skb_tail_pointer(skb) - b;
 	return skb->len;
 
@@ -1484,7 +1492,8 @@ nla_put_failure:
 }
 
 static int tclass_notify(struct sk_buff *oskb, struct nlmsghdr *n,
-			  struct Qdisc *q, unsigned long cl, int event)
+			 struct Qdisc *q, unsigned long cl, int event,
+			 int qnum)
 {
 	struct sk_buff *skb;
 	u32 pid = oskb ? NETLINK_CB(oskb).pid : 0;
@@ -1493,7 +1502,7 @@ static int tclass_notify(struct sk_buff *oskb, struct nlmsghdr *n,
 	if (!skb)
 		return -ENOBUFS;
 
-	if (tc_fill_tclass(skb, q, cl, pid, n->nlmsg_seq, 0, event) < 0) {
+	if (tc_fill_tclass(skb, q, cl, pid, n->nlmsg_seq, 0, event, qnum) < 0) {
 		kfree_skb(skb);
 		return -EINVAL;
 	}
@@ -1503,9 +1512,10 @@ static int tclass_notify(struct sk_buff *oskb, struct nlmsghdr *n,
 
 struct qdisc_dump_args
 {
-	struct qdisc_walker w;
-	struct sk_buff *skb;
+	struct qdisc_walker	w;
+	struct sk_buff		*skb;
 	struct netlink_callback *cb;
+	int			qnum;
 };
 
 static int qdisc_class_dump(struct Qdisc *q, unsigned long cl, struct qdisc_walker *arg)
@@ -1513,12 +1523,13 @@ static int qdisc_class_dump(struct Qdisc *q, unsigned long cl, struct qdisc_walk
 	struct qdisc_dump_args *a = (struct qdisc_dump_args *)arg;
 
 	return tc_fill_tclass(a->skb, q, cl, NETLINK_CB(a->cb->skb).pid,
-			      a->cb->nlh->nlmsg_seq, NLM_F_MULTI, RTM_NEWTCLASS);
+			      a->cb->nlh->nlmsg_seq, NLM_F_MULTI, RTM_NEWTCLASS,
+			      a->qnum);
 }
 
 static int tc_dump_tclass_qdisc(struct Qdisc *q, struct sk_buff *skb,
 				struct tcmsg *tcm, struct netlink_callback *cb,
-				int *t_p, int s_t)
+				int *t_p, int s_t, int qnum)
 {
 	struct qdisc_dump_args arg;
 
@@ -1537,6 +1548,7 @@ static int tc_dump_tclass_qdisc(struct Qdisc *q, struct sk_buff *skb,
 	arg.w.stop  = 0;
 	arg.w.skip = cb->args[1];
 	arg.w.count = 0;
+	arg.qnum = qnum;
 	q->ops->cl_ops->walk(q, &arg.w);
 	cb->args[1] = arg.w.count;
 	if (arg.w.stop)
@@ -1547,18 +1559,18 @@ static int tc_dump_tclass_qdisc(struct Qdisc *q, struct sk_buff *skb,
 
 static int tc_dump_tclass_root(struct Qdisc *root, struct sk_buff *skb,
 			       struct tcmsg *tcm, struct netlink_callback *cb,
-			       int *t_p, int s_t)
+			       int *t_p, int s_t, int qnum)
 {
 	struct Qdisc *q;
 
 	if (!root)
 		return 0;
 
-	if (tc_dump_tclass_qdisc(root, skb, tcm, cb, t_p, s_t) < 0)
+	if (tc_dump_tclass_qdisc(root, skb, tcm, cb, t_p, s_t, qnum) < 0)
 		return -1;
 
 	list_for_each_entry(q, &root->list, list) {
-		if (tc_dump_tclass_qdisc(q, skb, tcm, cb, t_p, s_t) < 0)
+		if (tc_dump_tclass_qdisc(q, skb, tcm, cb, t_p, s_t, qnum) < 0)
 			return -1;
 	}
 
@@ -1571,7 +1583,7 @@ static int tc_dump_tclass(struct sk_buff *skb, struct netlink_callback *cb)
 	struct net *net = sock_net(skb->sk);
 	struct netdev_queue *dev_queue;
 	struct net_device *dev;
-	int t, s_t;
+	int t, s_t, ntx;
 
 	if (net != &init_net)
 		return 0;
@@ -1584,12 +1596,14 @@ static int tc_dump_tclass(struct sk_buff *skb, struct netlink_callback *cb)
 	s_t = cb->args[0];
 	t = 0;
 
-	dev_queue = netdev_get_tx_queue(dev, 0);
-	if (tc_dump_tclass_root(dev_queue->qdisc_sleeping, skb, tcm, cb, &t, s_t) < 0)
-		goto done;
-
+	for (ntx = 0 ; ntx < dev->real_num_tx_queues; ntx++) {
+		dev_queue = netdev_get_tx_queue(dev, ntx);
+		if (tc_dump_tclass_root(dev_queue->qdisc_sleeping, skb, tcm,
+					cb, &t, s_t, ntx) < 0)
+			goto done;
+	}
 	dev_queue = &dev->rx_queue;
-	if (tc_dump_tclass_root(dev_queue->qdisc_sleeping, skb, tcm, cb, &t, s_t) < 0)
+	if (tc_dump_tclass_root(dev_queue->qdisc_sleeping, skb, tcm, cb, &t, s_t, 0) < 0)
 		goto done;
 
 done:

^ permalink raw reply related

* Re: 2.6.31 ARP related problems
From: Or Gerlitz @ 2009-09-02 12:20 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: netdev, Eric Dumazet, Duyck, Alexander H, Kirsher, Jeffrey T,
	David Miller
In-Reply-To: <m163c2ztrf.fsf@fess.ebiederm.org>

Eric W. Biederman wrote:
> I just tested.  If the two macvlans are in separate network namespaces all is well,
> so definitely not macvlan. As you have observed there are no real changes to arp.c

Yes, it's not macvlan to blame, I just tested it in SR-IOV scheme with igb/igbvf and
I see the same problem, only ping that goes through / targeted to the IP address of the first
VF device routing hit works, which means SR-IOV isn't really usable with 2.6.31 when you want 
a multiple VMs scheme, each attached to a different VF and all VMs on the same network segment.

Or.

^ permalink raw reply

* [RFC] net/fs_enet: send a reset request to the PHY on init
From: Sebastian Andrzej Siewior @ 2009-09-02 11:04 UTC (permalink / raw)
  To: Pantelis Antoniou, Vitaly Bordug; +Cc: linuxppc-dev, netdev

Usually u-boot sends a phy request in its network init routine. An uboot
without network support doesn't do it and I endup without working
network. I still can switch between 10/100Mbit (according to the LED on
the hub and phy registers) but I can't send or receive any data.

At this point I'm not sure if the PowerON Reset takes the PHY a few
nsecs too early out of reset or if this reset is required and everyone
relies on U-boot performing this reset.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
This is done on a custom mpc512x board. Unfortunately I don't have other
boards to check. The PHY is a AMD Am79C874, phylib uses the generic one.

 drivers/net/fs_enet/fs_enet-main.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/net/fs_enet/fs_enet-main.c b/drivers/net/fs_enet/fs_enet-main.c
index ee15402..a3c962b 100644
--- a/drivers/net/fs_enet/fs_enet-main.c
+++ b/drivers/net/fs_enet/fs_enet-main.c
@@ -823,7 +823,8 @@ static int fs_init_phy(struct net_device *dev)
 	}

 	fep->phydev = phydev;
-
+	phy_write(phydev, MII_BMCR, BMCR_RESET);
+	udelay(1);
 	return 0;
 }

-- 
1.6.4.GIT

^ permalink raw reply related

* Re: [PATCH] drop_monitor: make last_rx timestamp private
From: Neil Horman @ 2009-09-02 10:55 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, eric.dumazet
In-Reply-To: <20090901.182127.96935772.davem@davemloft.net>

On Tue, Sep 01, 2009 at 06:21:27PM -0700, David Miller wrote:
> From: Neil Horman <nhorman@tuxdriver.com>
> Date: Mon, 31 Aug 2009 15:58:47 -0400
> 
> > It was recently pointed out to me that the last_rx field of the net_device
> > structure wasn't updated regularly.  In fact only the bonding driver really uses
> > it currently.  Since the drop_monitor code relies on the last_rx field to detect
> > drops on recevie in hardware, We need to find a more reliable way to rate limit
> > our drop checks (so that we don't check for drops on every frame recevied, which
> > would be inefficient.  This patch makes a last_rx timestamp that is private to
> > the drop monitor code and is updated for every device that we track.
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> 
> Neil, this doesn't apply to net-next-2.6:
> 
> > diff --git a/net/core/drop_monitor.c b/net/core/drop_monitor.c
> > index 9d66fa9..34a05ce 100644
> > --- a/net/core/drop_monitor.c
> > +++ b/net/core/drop_monitor.c
>  ...
> > @@ -179,18 +180,21 @@ static void trace_napi_poll_hit(struct napi_struct *napi)
> >  {
> >  	struct dm_hw_stat_delta *new_stat;
> >  
> > -	/*
> > -	 * Ratelimit our check time to dm_hw_check_delta jiffies
> > -	 */
> > -	if (!time_after(jiffies, napi->dev->last_rx + dm_hw_check_delta))
> > -		return;
> >  
> >  	rcu_read_lock();
> >  	list_for_each_entry_rcu(new_stat, &hw_stats_list, list) {
> 
> In net-next-2.6 this test reads:
> 
> 	/*
> 	 * Ratelimit our check time to dm_hw_check_delta jiffies
> 	 */
> 	if (!napi->dev ||
> 	    !time_after(jiffies, napi->dev->last_rx + dm_hw_check_delta))
> 		return;
> 
> and you must retain the napi->dev NULL check there as otherwise
> the list traversal tests will blindly dereference that pointer.
> 

Apologies, this raced with Xiao fix to trace_napi_poll hit, which introduced
that null check.  I'll rediff/repost shortly.

Neil


^ permalink raw reply

* Re: Receive side performance issue with multi-10-GigE and NUMA
From: Neil Horman @ 2009-09-02 10:49 UTC (permalink / raw)
  To: Bill Fink; +Cc: Linux Network Developers, brice, gallatin
In-Reply-To: <20090902011143.89359828.billfink@mindspring.com>

On Wed, Sep 02, 2009 at 01:11:43AM -0400, Bill Fink wrote:
> On Thu, 27 Aug 2009, Neil Horman wrote:
> 
> > On Thu, Aug 27, 2009 at 01:44:29PM -0400, Bill Fink wrote:
> > > On Wed, 26 Aug 2009, Neil Horman wrote:
> > > 
> > > > On Wed, Aug 26, 2009 at 07:00:13AM -0400, Neil Horman wrote:
> > > > > On Wed, Aug 26, 2009 at 03:10:57AM -0400, Bill Fink wrote:
> > > > > 
> > > > > > Fortunately, in this specific case, the SuperMicro X8DAH+-F system
> > > > > > does have a serial console, and after a fair amount of effort I was
> > > > > > able to get it to work as desired, and was able to finally capture
> > > > > > a backtrace of the kernel oops.  BTW I believe the reason the
> > > > > > kexec/kdump didn't work was probably because it couldn't find
> > > > > > a /proc/vmcore file, although I don't know why that would be,
> > > > > > and the Fedora 10 /etc/init.d/kdump script will then just boot
> > > > > > up normally if it fails to find the /proc/vmcore file (or it's
> > > > > > zero size).
> > > > > > 
> > > > > I take care of kdump for fedora and RHEL.  If you file a bug on this, I'd be
> > > > > happy to look into it further.
> > > > > 
> > > > > > The following shows a simple ping test usage of the skb_sources
> > > > > > tracing feature:
> > > > > > 
> > > > > > [root@xeontest1 tracing]# numactl --membind=1 taskset -c 4 ping -c 5 -s 1472 192.168.1.10
> > > > > > PING 192.168.1.10 (192.168.1.10) 1472(1500) bytes of data.
> > > > > > 1480 bytes from 192.168.1.10: icmp_seq=1 ttl=64 time=0.139 ms
> > > > > > 1480 bytes from 192.168.1.10: icmp_seq=2 ttl=64 time=0.182 ms
> > > > > > 1480 bytes from 192.168.1.10: icmp_seq=3 ttl=64 time=0.178 ms
> > > > > > 1480 bytes from 192.168.1.10: icmp_seq=4 ttl=64 time=0.188 ms
> > > > > > 1480 bytes from 192.168.1.10: icmp_seq=5 ttl=64 time=0.178 ms
> > > > > > 
> > > > > > --- 192.168.1.10 ping statistics ---
> > > > > > 5 packets transmitted, 5 received, 0% packet loss, time 3999ms
> > > > > > rtt min/avg/max/mdev = 0.139/0.173/0.188/0.017 ms
> > > > > > 
> > > > > > [root@xeontest1 tracing]# cat trace
> > > > > > # tracer: skb_sources
> > > > > > #
> > > > > > #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> > > > > > #        |       |       |       |       |       |       |
> > > > > >         4217    1       1       eth2    0       4       1500
> > > > > >         4217    1       1       eth2    0       4       1500
> > > > > >         4217    1       1       eth2    0       4       1500
> > > > > >         4217    1       1       eth2    0       4       1500
> > > > > >         4217    1       1       eth2    0       4       1500
> > > > > > 
> > > > > > All is as was expected.
> > > > > > 
> > > > > > But if I try an actual nuttcp performance test (even rate limited
> > > > > > to 1 Mbps), I get the following kernel oops:
> > > > > > 
> > > > > thank you, I think I see the problem, I'll have a patch for you in just a bit
> > > > > 
> > > > > Thanks
> > > > > Neil
> > > > > 
> > > > > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -Ri1m -xc4/0 192.168.1.10
> > > > > > BUG: unable to handle kernel NULL pointer dereference at 0000000000000038
> > > > > > IP: [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > > > > > PGD 337d12067 PUD 337d11067 PMD 0
> > > > > > Oops: 0000 [#1] SMP
> > > > > > last sysfs file: /sys/devices/pci0000:80/0000:80:07.0/0000:8b:00.0/0000:8c:04.0e
> > > > > > CPU 4
> > > > > > Modules linked in: w83627ehf hwmon_vid coretemp hwmon ipv6 dm_multipath uinput ]
> > > > > > Pid: 4222, comm: nuttcp Not tainted 2.6.31-rc6-bf #3 X8DAH
> > > > > > RIP: 0010:[<ffffffff810b01ab>]  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x12
> > > > > > RSP: 0018:ffff8801a5811a88  EFLAGS: 00010213
> > > > > > RAX: 0000000000000000 RBX: ffff88033906d154 RCX: 000000000000000d
> > > > > > RDX: 000000000000f88c RSI: 000000000000000b RDI: ffff8803383d3044
> > > > > > RBP: ffff8801a5811ab8 R08: 0000000000000001 R09: ffff8801ab311a00
> > > > > > R10: 0000000000000005 R11: ffffc9000080e2b0 R12: ffff880337c45400
> > > > > > R13: ffff88033906d150 R14: 0000000000000014 R15: ffffffff818bb890
> > > > > > FS:  00007fa976d326f0(0000) GS:ffffc90000800000(0000) knlGS:0000000000000000
> > > > > > CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > > > > > CR2: 0000000000000038 CR3: 000000033801e000 CR4: 00000000000006e0
> > > > > > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > > > > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > > > > Process nuttcp (pid: 4222, threadinfo ffff8801a5810000, task ffff8801ab2e5d00)
> > > > > > Stack:
> > > > > >  ffff8801a5811ab8 ffff8801b35d4ab0 0000000000000014 0000000000000000
> > > > > > <0> 0000000000000014 0000000000000014 ffff8801a5811b18 ffffffff81366ae8
> > > > > > <0> ffff8801a5811ed8 0000001439084000 ffff880337c45400 00000001001416ef
> > > > > > Call Trace:
> > > > > >  [<ffffffff81366ae8>] skb_copy_datagram_iovec+0x50/0x1f5
> > > > > >  [<ffffffff813ac875>] tcp_rcv_established+0x278/0x6db
> > > > > >  [<ffffffff813b3ef5>] tcp_v4_do_rcv+0x1b8/0x366
> > > > > >  [<ffffffff8135f99e>] ? release_sock+0xab/0xb4
> > > > > >  [<ffffffff8136004d>] ? sk_wait_data+0xc8/0xd6
> > > > > >  [<ffffffff813a32d6>] tcp_prequeue_process+0x79/0x8f
> > > > > >  [<ffffffff813a455d>] tcp_recvmsg+0x4e8/0xaa0
> > > > > >  [<ffffffff8135ec90>] sock_common_recvmsg+0x37/0x4c
> > > > > >  [<ffffffff8135cb06>] __sock_recvmsg+0x72/0x7f
> > > > > >  [<ffffffff8135cbdd>] sock_aio_read+0xca/0xda
> > > > > >  [<ffffffff810d9536>] ? vma_merge+0x2a0/0x318
> > > > > >  [<ffffffff810f6d4f>] do_sync_read+0xec/0x132
> > > > > >  [<ffffffff81067ddc>] ? autoremove_wake_function+0x0/0x3d
> > > > > >  [<ffffffff811b646c>] ? security_file_permission+0x16/0x18
> > > > > >  [<ffffffff810f785c>] vfs_read+0xc0/0x107
> > > > > >  [<ffffffff810f7971>] sys_read+0x4c/0x75
> > > > > >  [<ffffffff81011c82>] system_call_fastpath+0x16/0x1b
> > > > > > Code: 44 89 73 30 89 43 14 41 0f b7 84 24 ac 00 00 00 89 43 28 65 8b 04 25 98 e
> > > > > > RIP  [<ffffffff810b01ab>] probe_skb_dequeue+0xf7/0x152
> > > > > >  RSP <ffff8801a5811a88>
> > > > > > CR2: 0000000000000038
> > > > 
> > > > 
> > > > 
> > > > Here  you go, I think this will fix your oops.
> > > > 
> > > > 
> > > >     Fix NULL pointer deref in skb sources ftracer
> > > >     
> > > >     Its possible that skb->sk will be null in this path, so we shouldn't just assume
> > > >     we can pass it to sock_net
> > > >     
> > > >     Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > > > 
> > > >  trace_skb_sources.c |    6 ++++--
> > > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/kernel/trace/trace_skb_sources.c b/kernel/trace/trace_skb_sources.c
> > > > index 40eb071..8bf518f 100644
> > > > --- a/kernel/trace/trace_skb_sources.c
> > > > +++ b/kernel/trace/trace_skb_sources.c
> > > > @@ -29,7 +29,7 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len)
> > > >  	struct ring_buffer_event *event;
> > > >  	struct trace_skb_event *entry;
> > > >  	struct trace_array *tr = skb_trace;
> > > > -	struct net_device *dev;
> > > > +	struct net_device *dev = NULL;
> > > >  
> > > >  	if (!trace_skb_source_enabled)
> > > >  		return;
> > > > @@ -50,7 +50,9 @@ static void probe_skb_dequeue(const struct sk_buff *skb, int len)
> > > >  	entry->event_data.rx_queue = skb->queue_mapping;
> > > >  	entry->event_data.ccpu = smp_processor_id();
> > > >  
> > > > -	dev = dev_get_by_index(sock_net(skb->sk), skb->iif);
> > > > +	if (skb->sk)
> > > > +		dev = dev_get_by_index(sock_net(skb->sk), skb->iif);
> > > > +
> > > >  	if (dev) {
> > > >  		memcpy(entry->event_data.ifname, dev->name, IFNAMSIZ);
> > > >  		dev_put(dev);
> > > 
> > > 
> > > 
> > > On the positive side, it did fix the oops.  But the results of the
> > > skb_sources tracing was not that useful.
> > > 
> > > [root@xeontest1 tracing]# numactl --membind=1 nuttcp -In2 -xc4/0 192.168.1.10 & ps ax | grep nuttcp
> > >  5521 ttyS0    S      0:00 nuttcp -In2 -xc4/0 192.168.1.10
> > > n2: 11819.0786 MB /  10.01 sec = 9905.6427 Mbps 26 %TX 37 %RX 0 retrans 0.18 msRTT
> > > 
> > > First off, only 10 trace entries were made:
> > > 
> > > [root@xeontest1 tracing]# wc trace
> > > 14 90 334 trace
> > > 
> > > And here they are:
> > > 
> > > [root@xeontest1 tracing]# cat trace
> > > # tracer: skb_sources
> > > #
> > > #       PID     ANID    CNID    IFC     RXQ     CCPU    LEN
> > > #        |       |       |       |       |       |       |
> > >         5521    0       0       Unknown 0       3       888
> > >         5521    0       0       Unknown 0       3       896
> > >         5521    0       0       Unknown 0       3       20
> > >         5521    0       0       Unknown 0       3       888
> > >         5521    0       0       Unknown 0       3       896
> > >         5521    0       0       Unknown 0       3       20
> > >         5521    1       1       Unknown 0       4       20
> > >         5521    1       1       Unknown 0       4       11
> > >         5521    1       1       Unknown 0       4       540
> > >         5521    1       1       Unknown 0       4       0
> > > 
> > > Even for these 10 entries, why is the IFC Unknown, and the LENs
> > > seem to be wrong too.
> > > 
> > > 						-Bill
> > > 
> > I'm not sure why you're getting Unknown Interface names.  Nominally that
> > indicates that the skb->iif value in the skb was incorrect or otherwise not set,
> > which shouldn't be the case.  As for the lengths that just seems wrong.  That
> > length value is taken directly from skb->len, so if its not right, it seems like
> > its not getting set correctly someplace.
> > 
> > As you may have seen we're removing the ftrace module, and replacing it with the
> > use of raw trace events.  When I have that working, I'll see if I get simmilar
> > results.  I never did in my local testing of the ftrace module, but perhaps its
> > related to load or something.
> 
> IIUC I should keep the first of your original three ftrace patches,
> revert all the rest, and then apply your very latest patch that
> augments the skb_copy_datagram_iovec TRACE_EVENT.  Do I have that
> basically correct?
> 
Thats exactly correct, yes.

> Then I just need to ask how do I use this new method?
> 
It works in basically the same way.  Except instead of doing this:
echo skb_ftracer > /sys/kernel/debug/tracing/current_tracer
you do this:
echo 1 > /sys/kernel/debug/tracing/events/skb/skb_copy_datagram_iovec/enable
Then the events should should up in /sys/kernel/debug/tracing/trace[_pipe]

Best
Neil

> 						-Thanks
> 
> 						-Bill
> 

^ permalink raw reply

* Re: about packets forwarding
From: Mark Smith @ 2009-09-02 10:48 UTC (permalink / raw)
  To: wu xiaofei; +Cc: netdev
In-Reply-To: <f4f837ab0909020203w67bcba16pbb317c587c58e70f@mail.gmail.com>

On Wed, 2 Sep 2009 17:03:42 +0800
wu xiaofei <lampsu@gmail.com> wrote:

> Hello,
> 
> I have something to ask here.
> 
> The topology of the network is as follows.
> There are six Nodes (A, B, C, D, M, N).
> 
>   M
>   |
>   A
>  / \
> B   D
>  \ /
>   C
>   |
>   N
> 
> M-A, C-N are wired links.
> A-B, B-C, A-D, D-C are wireless links.
> 
> Node M wnats to communicate with node N. Because the wireless links
> are not very reliable, I want to forward the packets through A-B-C and
> A-D-C simultaneously (When Node A receives packets(from Node M) from
> its wired interface eth0, It will forward the same packets to its
> wireless interfaces wlan0 and wlan1 simultaneously) . How to implement
> this?
> 

Packets being duplicated in this manner is generally considered an
error - protocols are designed to deal with it, but only as an error
robustness mechanism. In other words, it won't break anything, but
performance is very likely to suffer.

The Internet protocols are designed on the assumption of a low
possibility of packet loss (1 in 100)- they only assume each
link in the path will have an error detection mechanism (and drop them
when errors occur), but not an error correction mechanism. If the
possibility of packet loss is high (1 in 10), then it is expected that
the link layer itself will implement error detection and recovery.

You can read about the design philosophies of the Internet Protocols in
the following paper:

"The Design Philosophy of the DARPA Internet Protocols"

http://www.acm.org/sigs/sigcomm/ccr/archive/1995/jan95/ccr-9501-clark.pdf

As for your specific scenario, to increase the reliablity of the
individual wireless links, you could look into running X.25 over LAPB
over Ethernet (LAPB over Ethernet driver kernel option). LAPB adds the
error correction/recovery that's normally not present with Ethernet, and
X.25 might be necessary to be carry IP frames. According to the help
for that driver, it creates a /dev/labp0 interface, which you may then
alternatively be able to direct pppd to use, which would then allow you
to carry IP over PPP over LAPB over Ethernet, again gaining link layer
reliability below IP.

It's purely a theoretical suggestion though, as I haven't ever tried it.

Regards,
Mark.

^ permalink raw reply

* Re: [NET] Add proc file to display the state of all qdiscs.
From: Jarek Poplawski @ 2009-09-02  9:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Christoph Lameter, David Miller, Patrick McHardy, netdev
In-Reply-To: <20090902093329.GB6530@ff.dom.local>

On Wed, Sep 02, 2009 at 09:33:29AM +0000, Jarek Poplawski wrote:
> On Wed, Sep 02, 2009 at 09:18:54AM +0000, Jarek Poplawski wrote:
> > On Wed, Sep 02, 2009 at 10:28:55AM +0200, Eric Dumazet wrote:
> ...
> > > qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
> > >  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> > >  rate 0bit 0pps backlog 0b 0p requeues 0
> > > 
> > > 
> > > Same name "eth0" is displayed, that might confuse parsers...
> > > 
> > > What naming convention should we choose for multiqueue devices ?
> > > 
> > 
> > Hmm... anything could break here something for somebody, so there is
> > still a (Patrick's) question if not sum it all? Otherwise, I wonder
> > about using the qdisc handle (tcm_handle>>16): there would be at
> > least one "pfifo_fast 0:" looking like proper root for somebody...
> 
> I meant "proper" for pfifo_fast. On the other hand, I wonder why these
> multiqueue qdisc handles can't be really given such unique per dev

should be:
multiqueue qdiscs can't be really given such unique per dev

> (instead of per queue) handles?
> 
> Jarek P.

^ permalink raw reply

* Re: [NET] Add proc file to display the state of all qdiscs.
From: Jarek Poplawski @ 2009-09-02  9:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Christoph Lameter, David Miller, Patrick McHardy, netdev
In-Reply-To: <20090902091854.GA6530@ff.dom.local>

On Wed, Sep 02, 2009 at 09:18:54AM +0000, Jarek Poplawski wrote:
> On Wed, Sep 02, 2009 at 10:28:55AM +0200, Eric Dumazet wrote:
...
> > qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
> >  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
> >  rate 0bit 0pps backlog 0b 0p requeues 0
> > 
> > 
> > Same name "eth0" is displayed, that might confuse parsers...
> > 
> > What naming convention should we choose for multiqueue devices ?
> > 
> 
> Hmm... anything could break here something for somebody, so there is
> still a (Patrick's) question if not sum it all? Otherwise, I wonder
> about using the qdisc handle (tcm_handle>>16): there would be at
> least one "pfifo_fast 0:" looking like proper root for somebody...

I meant "proper" for pfifo_fast. On the other hand, I wonder why these
multiqueue qdisc handles can't be really given such unique per dev
(instead of per queue) handles?

Jarek P.

^ permalink raw reply

* Re: [NET] Add proc file to display the state of all qdiscs.
From: Jarek Poplawski @ 2009-09-02  9:18 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Christoph Lameter, David Miller, Patrick McHardy, netdev
In-Reply-To: <4A9E2CC7.1010103@gmail.com>

On Wed, Sep 02, 2009 at 10:28:55AM +0200, Eric Dumazet wrote:
> Jarek Poplawski a écrit :
> > [Resent with fixed netdev@ address]
> > 
> > On 02-09-2009 01:52, Christoph Lameter wrote:
> >> [NET] Add proc file to display the state of all qdiscs
> >>
> >> TC is a complicated tool and it currently does not allow the display of all
> >> qdisc states. It does not support multiple tx queues and also not
> >> localhost, nor does it display the current operating state of the queues.
> > 
> > I think, tc should've no problem with displaying summary stats of
> > multiqueue qdiscs or even all of them separately, as mentioned by
> > Patrick. And, maybe I still miss something, but there should be
> > nothing special with tc vs. localhost either.
> >
> 
> I made a patch, but for a 8 queue device (bnx2), here is the "tc -s -d qdisc" result :
> 
> $ tc -s -d qdisc show
> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 51814 bytes 459 pkt (dropped 0, overlimits 0 requeues 0)
>  rate 0bit 0pps backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  rate 0bit 0pps backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  rate 0bit 0pps backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  rate 0bit 0pps backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  rate 0bit 0pps backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  rate 0bit 0pps backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  rate 0bit 0pps backlog 0b 0p requeues 0
> qdisc pfifo_fast 0: dev eth0 root bands 3 priomap  1 2 2 2 1 2 0 0 1 1 1 1 1 1 1 1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  rate 0bit 0pps backlog 0b 0p requeues 0
> 
> 
> Same name "eth0" is displayed, that might confuse parsers...
> 
> What naming convention should we choose for multiqueue devices ?
> 

Hmm... anything could break here something for somebody, so there is
still a (Patrick's) question if not sum it all? Otherwise, I wonder
about using the qdisc handle (tcm_handle>>16): there would be at
least one "pfifo_fast 0:" looking like proper root for somebody...

Jarek P.

^ permalink raw reply

* [RFC] net/core: Delay neighbor only if it has been used after confirmed
From: Jens Rosenboom @ 2009-09-02  9:17 UTC (permalink / raw)
  To: Linux Network Developers

When doing some IPv6 testing with the router advertising a small (e.g. 5
seconds) reachable time, I noticed that after the traffic has stopped,
hosts continue to exchange ND packets every 10 seconds. This is due to
neigh_timer_handler() only checking neigh->used and puts a neighbor into
NUD_DELAY state even if neigh->confirmed may be >= neigh->used.

The following patch for net-next-2.6 fixes this behaviour for my IPv6
setup, however I would like to hear some opinion on whether this might
have some negative influence on other protocols that use this code.

I also think that it would make more sense to compute the time for the
delay timer starting from neigh->used instead of using now (second part
of the patch).

diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 5bc4ad5..ca20162 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -820,12 +820,13 @@ static void neigh_timer_handler(unsigned long arg)
 			NEIGH_PRINTK2("neigh %p is still alive.\n", neigh);
 			next = neigh->confirmed + neigh->parms->reachable_time;
 		} else if (time_before_eq(now,
-					  neigh->used + neigh->parms->delay_probe_time)) {
+					  neigh->used + neigh->parms->delay_probe_time) &&
+			   time_after(neigh->confirmed, neigh->used)) {
 			NEIGH_PRINTK2("neigh %p is delayed.\n", neigh);
 			neigh->nud_state = NUD_DELAY;
 			neigh->updated = jiffies;
 			neigh_suspect(neigh);
-			next = now + neigh->parms->delay_probe_time;
+			next = neigh->used + neigh->parms->delay_probe_time;
 		} else {
 			NEIGH_PRINTK2("neigh %p is suspected.\n", neigh);
 			neigh->nud_state = NUD_STALE;

^ permalink raw reply related

* [PATCH 2/2] fec: don't enable irqs in hard irq context
From: Uwe Kleine-König @ 2009-09-02  9:14 UTC (permalink / raw)
  To: linux-rt-users
  Cc: Greg Ungerer, Ben Hutchings, Patrick McHardy, Sascha Hauer,
	Matt Waddel, netdev, Tim Sander
In-Reply-To: <1251882856-23549-1-git-send-email-u.kleine-koenig@pengutronix.de>

fec_enet_mii, fec_enet_rx and fec_enet_tx are both only called by
fec_enet_interrupt in interrupt context.  So they must not use
spin_lock_irq/spin_unlock_irq.

This fixes:
	WARNING: at kernel/lockdep.c:2140 trace_hardirqs_on_caller+0x130/0x194()
	...

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Matt Waddel <Matt.Waddel@freescale.com>
Cc: netdev@vger.kernel.org
Cc: Tim Sander <tim01@vlsi.informatik.tu-darmstadt.de>
---
 drivers/net/fec.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/net/fec.c b/drivers/net/fec.c
index ef82606..9c49d56 100644
--- a/drivers/net/fec.c
+++ b/drivers/net/fec.c
@@ -427,7 +427,7 @@ fec_enet_tx(struct net_device *dev)
 	struct	sk_buff	*skb;
 
 	fep = netdev_priv(dev);
-	spin_lock_irq(&fep->hw_lock);
+	spin_lock(&fep->hw_lock);
 	bdp = fep->dirty_tx;
 
 	while (((status = bdp->cbd_sc) & BD_ENET_TX_READY) == 0) {
@@ -486,7 +486,7 @@ fec_enet_tx(struct net_device *dev)
 		}
 	}
 	fep->dirty_tx = bdp;
-	spin_unlock_irq(&fep->hw_lock);
+	spin_unlock(&fep->hw_lock);
 }
 
 
@@ -509,7 +509,7 @@ fec_enet_rx(struct net_device *dev)
 	flush_cache_all();
 #endif
 
-	spin_lock_irq(&fep->hw_lock);
+	spin_lock(&fep->hw_lock);
 
 	/* First, grab all of the stats for the incoming packet.
 	 * These get messed up if we get called due to a busy condition.
@@ -604,7 +604,7 @@ rx_processing_done:
 	}
 	fep->cur_rx = bdp;
 
-	spin_unlock_irq(&fep->hw_lock);
+	spin_unlock(&fep->hw_lock);
 }
 
 /* called from interrupt context */
@@ -615,7 +615,7 @@ fec_enet_mii(struct net_device *dev)
 	mii_list_t	*mip;
 
 	fep = netdev_priv(dev);
-	spin_lock_irq(&fep->mii_lock);
+	spin_lock(&fep->mii_lock);
 
 	if ((mip = mii_head) == NULL) {
 		printk("MII and no head!\n");
@@ -633,7 +633,7 @@ fec_enet_mii(struct net_device *dev)
 		writel(mip->mii_regval, fep->hwp + FEC_MII_DATA);
 
 unlock:
-	spin_unlock_irq(&fep->mii_lock);
+	spin_unlock(&fep->mii_lock);
 }
 
 static int
-- 
1.6.3.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 1/2] fec: fix recursive locking of mii_lock
From: Uwe Kleine-König @ 2009-09-02  9:14 UTC (permalink / raw)
  To: linux-rt-users
  Cc: Greg Ungerer, Ben Hutchings, Patrick McHardy, Sascha Hauer,
	Matt Waddel, netdev, Tim Sander
In-Reply-To: <20090831132200.GA21836@pengutronix.de>

mii_discover_phy is only called by fec_enet_mii (via mip->mii_func).  So
&fep->mii_lock is already held and mii_discover_phy must not call
mii_queue which locks &fep->mii_lock, too.

This was noticed by lockdep:

	=============================================
	[ INFO: possible recursive locking detected ]
	2.6.31-rc8-00038-g37d0892 #109
	---------------------------------------------
	swapper/1 is trying to acquire lock:
	 (&fep->mii_lock){-.....}, at: [<c01569f8>] mii_queue+0x2c/0xcc

	but task is already holding lock:
	 (&fep->mii_lock){-.....}, at: [<c0156328>] fec_enet_interrupt+0x78/0x460

	other info that might help us debug this:
	2 locks held by swapper/1:
	 #0:  (rtnl_mutex){+.+.+.}, at: [<c0183534>] rtnl_lock+0x18/0x20
	 #1:  (&fep->mii_lock){-.....}, at: [<c0156328>] fec_enet_interrupt+0x78/0x460

	stack backtrace:
	Backtrace:
	[<c00226fc>] (dump_backtrace+0x0/0x108) from [<c01eac14>] (dump_stack+0x18/0x1c)
	 r6:c781d118 r5:c03e41d8 r4:00000001
	[<c01eabfc>] (dump_stack+0x0/0x1c) from [<c005bae4>] (__lock_acquire+0x1a20/0x1a88)
	[<c005a0c4>] (__lock_acquire+0x0/0x1a88) from [<c005bbac>] (lock_acquire+0x60/0x74)
	[<c005bb4c>] (lock_acquire+0x0/0x74) from [<c01edda8>] (_spin_lock_irqsave+0x54/0x68)
	 r7:60000093 r6:c01569f8 r5:c785e468 r4:00000000
	[<c01edd54>] (_spin_lock_irqsave+0x0/0x68) from [<c01569f8>] (mii_queue+0x2c/0xcc)
	 r7:c785e468 r6:c0156b24 r5:600a0000 r4:c785e000
	[<c01569cc>] (mii_queue+0x0/0xcc) from [<c0156b78>] (mii_discover_phy+0x54/0xa8)
	 r8:00000002 r7:00000032 r6:c785e000 r5:c785e360 r4:c785e000
	[<c0156b24>] (mii_discover_phy+0x0/0xa8) from [<c0156354>] (fec_enet_interrupt+0xa4/0x460)
	 r5:c785e360 r4:c077a170
	[<c01562b0>] (fec_enet_interrupt+0x0/0x460) from [<c0066674>] (handle_IRQ_event+0x48/0x120)
	[<c006662c>] (handle_IRQ_event+0x0/0x120) from [<c0068438>] (handle_level_irq+0x94/0x11c)
	...

Signed-off-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Cc: Greg Ungerer <gerg@uclinux.org>
Cc: Ben Hutchings <ben@decadent.org.uk>
Cc: Patrick McHardy <kaber@trash.net>
Cc: Sascha Hauer <s.hauer@pengutronix.de>
Cc: Matt Waddel <Matt.Waddel@freescale.com>
Cc: netdev@vger.kernel.org
Cc: Tim Sander <tim01@vlsi.informatik.tu-darmstadt.de>
---
 drivers/net/fec.c |   22 +++++++++++++++++-----
 1 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/drivers/net/fec.c b/drivers/net/fec.c
index c9fd82d..ef82606 100644
--- a/drivers/net/fec.c
+++ b/drivers/net/fec.c
@@ -637,16 +637,15 @@ unlock:
 }
 
 static int
-mii_queue(struct net_device *dev, int regval, void (*func)(uint, struct net_device *))
+mii_queue_unlocked(struct net_device *dev, int regval,
+		void (*func)(uint, struct net_device *))
 {
 	struct fec_enet_private *fep;
-	unsigned long	flags;
 	mii_list_t	*mip;
 	int		retval;
 
 	/* Add PHY address to register command */
 	fep = netdev_priv(dev);
-	spin_lock_irqsave(&fep->mii_lock, flags);
 
 	regval |= fep->phy_addr << 23;
 	retval = 0;
@@ -667,6 +666,19 @@ mii_queue(struct net_device *dev, int regval, void (*func)(uint, struct net_devi
 		retval = 1;
 	}
 
+	return retval;
+}
+
+static int
+mii_queue(struct net_device *dev, int regval,
+		void (*func)(uint, struct net_device *))
+{
+	struct fec_enet_private *fep;
+	unsigned long   flags;
+	int             retval;
+	fep = netdev_priv(dev);
+	spin_lock_irqsave(&fep->mii_lock, flags);
+	retval = mii_queue_unlocked(dev, regval, func);
 	spin_unlock_irqrestore(&fep->mii_lock, flags);
 	return retval;
 }
@@ -1373,11 +1385,11 @@ mii_discover_phy(uint mii_reg, struct net_device *dev)
 
 			/* Got first part of ID, now get remainder */
 			fep->phy_id = phytype << 16;
-			mii_queue(dev, mk_mii_read(MII_REG_PHYIR2),
+			mii_queue_unlocked(dev, mk_mii_read(MII_REG_PHYIR2),
 							mii_discover_phy3);
 		} else {
 			fep->phy_addr++;
-			mii_queue(dev, mk_mii_read(MII_REG_PHYIR1),
+			mii_queue_unlocked(dev, mk_mii_read(MII_REG_PHYIR1),
 							mii_discover_phy);
 		}
 	} else {
-- 
1.6.3.3

--
To unsubscribe from this list: send the line "unsubscribe linux-rt-users" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox