Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH v2 net-next-2.6] ifb: add performance flags
From: David Miller @ 2011-01-03 19:40 UTC (permalink / raw)
  To: jarkao2; +Cc: eric.dumazet, xiaosuo, pstaszewski, netdev
In-Reply-To: <20110103193703.GA1977@del.dom.local>

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon, 3 Jan 2011 20:37:03 +0100

> On Sun, Jan 02, 2011 at 09:24:36PM +0100, Eric Dumazet wrote:
>> Le mercredi 29 décembre 2010 ?? 00:07 +0100, Jarek Poplawski a écrit :
>> 
>> > Ingress is before vlans handler so these features and the
>> > NETIF_F_HW_VLAN_TX flag seem useful for ifb considering
>> > dev_hard_start_xmit() checks.
>> 
>> OK, here is v2 of the patch then, thanks everybody.
>> 
>> 
>> [PATCH v2 net-next-2.6] ifb: add performance flags
>> 
>> IFB can use the full set of features flags (NETIF_F_SG |
>> NETIF_F_FRAGLIST | NETIF_F_TSO | NETIF_F_NO_CSUM | NETIF_F_HIGHDMA) to
>> avoid unnecessary split of some packets (GRO for example)
>> 
>> Changli suggested to also set vlan_features,
> 
> He also suggested more GSO flags of which especially NETIF_F_TSO6
> seems interesting (wrt GRO)?

I think at least TSO6 would very much be appropriate here.

^ permalink raw reply

* Re: [PATCH] net: eepro testing positive EBUSY return by request_irq()?
From: David Miller @ 2011-01-03 19:37 UTC (permalink / raw)
  To: roel.kluin; +Cc: ben, netdev, akpm, linux-kernel
In-Reply-To: <4D209127.5080505@gmail.com>

From: roel kluin <roel.kluin@gmail.com>
Date: Sun, 02 Jan 2011 15:52:23 +0100

> +		for (i = 0; i < ARRAY_SIZE(irqlist); i++) {
> +			retval = request_irq (irqlist[i], NULL, 0, "bogus", NULL);
> +			if (retval != -EBUSY)
> +				continue;
> +			if (retval < 0)
> +				goto out;
> +			dev->irq = irqlist[i];
> +			break;

This series of tests don't make much sense.

If we get to the "retval < 0" check, retval must be -EBUSY.  So at
best it's superfluous, at worst it's confusing.

^ permalink raw reply

* Re: [PATCH v2 net-next-2.6] ifb: add performance flags
From: Jarek Poplawski @ 2011-01-03 19:37 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, xiaosuo, pstaszewski, netdev
In-Reply-To: <1293999876.2535.211.camel@edumazet-laptop>

On Sun, Jan 02, 2011 at 09:24:36PM +0100, Eric Dumazet wrote:
> Le mercredi 29 décembre 2010 ?? 00:07 +0100, Jarek Poplawski a écrit :
> 
> > Ingress is before vlans handler so these features and the
> > NETIF_F_HW_VLAN_TX flag seem useful for ifb considering
> > dev_hard_start_xmit() checks.
> 
> OK, here is v2 of the patch then, thanks everybody.
> 
> 
> [PATCH v2 net-next-2.6] ifb: add performance flags
> 
> IFB can use the full set of features flags (NETIF_F_SG |
> NETIF_F_FRAGLIST | NETIF_F_TSO | NETIF_F_NO_CSUM | NETIF_F_HIGHDMA) to
> avoid unnecessary split of some packets (GRO for example)
> 
> Changli suggested to also set vlan_features,

He also suggested more GSO flags of which especially NETIF_F_TSO6
seems interesting (wrt GRO)?

Jarek P.

> Jarek suggested to add NETIF_F_HW_VLAN_TX as well.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Changli Gao <xiaosuo@gmail.com>
> Cc: Jarek Poplawski <jarkao2@gmail.com>
> Cc: Pawel Staszewski <pstaszewski@itcare.pl>
> ---
>  drivers/net/ifb.c |    6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/drivers/net/ifb.c b/drivers/net/ifb.c
> index 124dac4..66ca7bf 100644
> --- a/drivers/net/ifb.c
> +++ b/drivers/net/ifb.c
> @@ -126,6 +126,9 @@ static const struct net_device_ops ifb_netdev_ops = {
>  	.ndo_validate_addr = eth_validate_addr,
>  };
>  
> +#define IFB_FEATURES (NETIF_F_NO_CSUM | NETIF_F_SG  | NETIF_F_FRAGLIST | \
> +		      NETIF_F_HIGHDMA | NETIF_F_TSO | NETIF_F_HW_VLAN_TX)
> +
>  static void ifb_setup(struct net_device *dev)
>  {
>  	/* Initialize the device structure. */
> @@ -136,6 +139,9 @@ static void ifb_setup(struct net_device *dev)
>  	ether_setup(dev);
>  	dev->tx_queue_len = TX_Q_LIMIT;
>  
> +	dev->features |= IFB_FEATURES;
> +	dev->vlan_features |= IFB_FEATURES;
> +
>  	dev->flags |= IFF_NOARP;
>  	dev->flags &= ~IFF_MULTICAST;
>  	dev->priv_flags &= ~IFF_XMIT_DST_RELEASE;
> 
> 

^ permalink raw reply

* Re: [PATCH net-next-2.6] bonding: remove meaningless /sys/module/bonding/parameters entries.
From: David Miller @ 2011-01-03 19:32 UTC (permalink / raw)
  To: nicolas.2p.debian; +Cc: bonding-devel, netdev, fubar
In-Reply-To: <1293978915-29674-1-git-send-email-nicolas.2p.debian@free.fr>

From: Nicolas de Pesloüan <nicolas.2p.debian@free.fr>
Date: Sun,  2 Jan 2011 15:35:15 +0100

> Only two bonding parameters are exposed in /sys/module/bonding:
> 
> num_grat_arp
> num_unsol_na
> 
> Those values are not module global, but per device.
> 
> The per device values are available in /sys/class/net/<device>/bonding.
> 
> The values exposed in /sys/module/bonding are those given at module load time
> and only used as default values when creating a device. They are read-only and
> cannot change in any way.
> 
> As such, they are mostly meaningless.

First, you forgot to provide a proper "Signed-off-by: " line in your
patch submission, please read Documentation/SubmittingPatches

Secondly, you can't remove these, people might be using them.  It
could be useful, for example, to debug problems with passing module
parameters in.  This is the one way to find out what actually got
passed to the module when it loaded.

Therefore I'm not applying this patch, sorry.

^ permalink raw reply

* Re: [RFC PATCH 0/3] Simplified 16 bit Toeplitz hash algorithm
From: Ben Hutchings @ 2011-01-03 19:30 UTC (permalink / raw)
  To: David Miller; +Cc: therbert, alexander.h.duyck, netdev
In-Reply-To: <20110103.110244.183045594.davem@davemloft.net>

On Mon, 2011-01-03 at 11:02 -0800, David Miller wrote:
> From: Tom Herbert <therbert@google.com>
> Date: Mon, 3 Jan 2011 10:47:20 -0800
> 
> > I'm not sure why this would be needed.  What is the a advantage in
> > making the TX and RX queues match?
> 
> That's how their hardware based RFS essentially works.
> 
> Instead of watching for "I/O system calls" like we do in software, the
> chip watches for which TX queue a flow ends up on and matches things
> up on the receive side with the same numbered RX queue to match.

ixgbe also implements IRQ affinity setting (or rather hinting) and TX
queue selection by CPU, the inverse of IRQ affinity setting.  Together
with the hardware/firmware Flow Director feature, this should indeed
result in hardware RFS.  (However, irqbalanced does not yet follow the
affinity hints AFAIK, so this requires some manual intervention.  Maybe
the OOT driver is different?)

The proposed change to make TX queue selection hash-based seems to be a
step backwards.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH V4] bridge: fix br_multicast_ipv6_rcv for paged skbs
From: David Miller @ 2011-01-03 19:29 UTC (permalink / raw)
  To: tomas.winkler; +Cc: netdev, johannes, shemminger
In-Reply-To: <1294051080-29492-1-git-send-email-tomas.winkler@intel.com>

From: Tomas Winkler <tomas.winkler@intel.com>
Date: Mon,  3 Jan 2011 12:37:59 +0200

> use pskb_may_pull to access ipv6 header correctly for paged skbs
> It was omitted in the bridge code leading to crash in blind
> __skb_pull
> 
> since the skb is cloned undonditionally we also simplify the
> the exit path
> 
> this fixes bug https://bugzilla.kernel.org/show_bug.cgi?id=25202
 ...
> Cc: David Miller <davem@davemloft.net>
> Cc: Johannes Berg <johannes@sipsolutions.net>
> Cc: Stephen Hemminger <shemminger@vyatta.com>
> Signed-off-by: Tomas Winkler <tomas.winkler@intel.com>

Looks good, applied thanks Tomas.

There are several simplifications we can make in the net-next-2.6
tree.

Most of this code just wants the query type and then optionally
the ipv6 address the operation applies to.  For such simple value
fetching, skb_header_pointer() is probably ideal compared to
all of this pskb_may_pull() business.

^ permalink raw reply

* Re: [PATCH net-next] netdev: Update status of 8390 based drivers in MAINTAINERS
From: David Miller @ 2011-01-03 19:06 UTC (permalink / raw)
  To: paul.gortmaker; +Cc: netdev
In-Reply-To: <1293924510-14290-1-git-send-email-paul.gortmaker@windriver.com>

From: Paul Gortmaker <paul.gortmaker@windriver.com>
Date: Sat,  1 Jan 2011 18:28:30 -0500

> With the original 8 bit ISA ne1000 card being over 20 years old, it
> only makes sense to consider ne.c and all the other toplevel 8390
> based driver files as legacy for obsolete hardware.  The most
> recent thing made in large quantities that was 8390 based were
> those crazy PCI ne2k clones - and even they are now 10+ years old.
> 
> Also remove myself as maintainer, since the only changes to these
> drivers going forward will be the generic API type changes that
> touch all drivers.
> 
> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>

Also applied, thanks Paul.

^ permalink raw reply

* Re: [PATCH net-next] net/Space: delete orphaned externs from deleted drivers
From: David Miller @ 2011-01-03 19:06 UTC (permalink / raw)
  To: paul.gortmaker; +Cc: netdev
In-Reply-To: <1293923701-14147-1-git-send-email-paul.gortmaker@windriver.com>

From: Paul Gortmaker <paul.gortmaker@windriver.com>
Date: Sat,  1 Jan 2011 18:15:01 -0500

> The drivers associated with the prototypes in this commit have
> been deleted some time ago, but the externs escaped detection.
> Using a simple "git grep" shows that these references are
> historical artefacts, only mentioned by the deleted lines.
> 
> Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>

Applied.

^ permalink raw reply

* Re: [PATCH] atl1: fix oops when changing tx/rx ring params
From: David Miller @ 2011-01-03 19:04 UTC (permalink / raw)
  To: jcliburn; +Cc: netdev, stable, jussuf, chris.snook, kronos.it, Xiong.Huang
In-Reply-To: <20110101090212.7149010d@osprey.hogchain.net>

From: "J. K. Cliburn" <jcliburn@gmail.com>
Date: Sat, 1 Jan 2011 09:02:12 -0600

> Commit 3f5a2a713aad28480d86b0add00c68484b54febc zeroes out the statistics
> message block (SMB) and coalescing message block (CMB) when adapter ring
> resources are freed.  This is desirable behavior, but, as a side effect,
> the commit leads to an oops when atl1_set_ringparam() attempts to alter
> the number of rx or tx elements in the ring buffer (by using ethtool
> -G, for example).  We don't want SMB or CMB to change during this
> operation.
> 
> Modify atl1_set_ringparam() to preserve SMB and CMB when changing ring
> parameters.
> 
> Cc: stable@kernel.org
> Signed-off-by: Jay Cliburn <jcliburn@gmail.com>
> Reported-by: Tõnu Raitviir <jussuf@linux.ee>

I'll apply this, thanks Jay.

^ permalink raw reply

* Re: [RFC PATCH 0/3] Simplified 16 bit Toeplitz hash algorithm
From: David Miller @ 2011-01-03 19:02 UTC (permalink / raw)
  To: therbert; +Cc: alexander.h.duyck, netdev
In-Reply-To: <AANLkTiki5ZePtdj4ni2++z1KvHOttev1ZciaV-bRFbWA@mail.gmail.com>

From: Tom Herbert <therbert@google.com>
Date: Mon, 3 Jan 2011 10:47:20 -0800

> I'm not sure why this would be needed.  What is the a advantage in
> making the TX and RX queues match?

That's how their hardware based RFS essentially works.

Instead of watching for "I/O system calls" like we do in software, the
chip watches for which TX queue a flow ends up on and matches things
up on the receive side with the same numbered RX queue to match.

^ permalink raw reply

* Re: [RFC PATCH 0/3] Simplified 16 bit Toeplitz hash algorithm
From: Alexander Duyck @ 2011-01-03 19:00 UTC (permalink / raw)
  To: Tom Herbert; +Cc: netdev@vger.kernel.org
In-Reply-To: <AANLkTiki5ZePtdj4ni2++z1KvHOttev1ZciaV-bRFbWA@mail.gmail.com>

On 1/3/2011 10:47 AM, Tom Herbert wrote:
> I'm not sure why this would be needed.  What is the a advantage in
> making the TX and RX queues match?
>

If the application is affinitized and you are working with RX/TX pairs 
as we have in ixgbe then you can be certain that your buffers are 
staying in the same NUMA node or CPU as the application.  Having them on 
different NUMA nodes can hurt performance for either TX or RX.

The other advantage was that I didn't have to bother with trying to 
reorder the source and destination values when computing an RX hash or a 
TX hash.  I can just call the same function and regardless of direction 
I would get the same hash.  That way I could be guaranteed in a routing 
test that if I was using the RX hash to determine the TX queue that the 
queue number shouldn't change.

I believe the same thing is being accomplished in RPS/TPS via a test for 
the values and swapping them if source is greater than destination.

Thanks,

Alex

^ permalink raw reply

* Re: [RFC PATCH 0/3] Simplified 16 bit Toeplitz hash algorithm
From: Tom Herbert @ 2011-01-03 18:47 UTC (permalink / raw)
  To: Alexander Duyck; +Cc: netdev
In-Reply-To: <20101218004210.28602.18499.stgit@gitlad.jf.intel.com>

I'm not sure why this would be needed.  What is the a advantage in
making the TX and RX queues match?

On Fri, Dec 17, 2010 at 5:00 PM, Alexander Duyck
<alexander.h.duyck@intel.com> wrote:
> This patch series is meant to be a proof of concept for simplifying the cost
> of Toeplitz hashing by reducing the complexity of the key to a 16 bit
> repeating value.  The resultant advantages are that the hash computation
> performance is significantly increased, and that the resultant hash is the
> same for flows in either direction.
>
> The idea for this occurred to me while working on the ATR hashing algorithms
> and improving their performance.  ATR implements a 32 bit repeating key which
> results in us being able to XOR everything down to a 32 bit value.  By using a
> 16 bit key we are able to cut down the 12 to 36 byte input value to only 2
> bytes via XOR operations.  This reduces the resultant hash to 16 bits, however
> since queue selection only requires 7 bits for RSS that still leaves us with a
> large enough resultant key.
>
> I'm currently not planning to do any more work on this in the near future as I
> have several other projects in which I am currently engaged.  However I just
> wanted to put this code out there in case anyone had a use for it.
>
> Thanks,
>
> Alex
>
> ---
>
> Alexander Duyck (3):
>      igb: example of how to update igb to make use of in-kernel Toeplitz hashing
>      ixgbe: example of how to update ixgbe to make use of in-kernel Toeplitz hash
>      net: add simplified 16 bit Toeplitz hash function for transmit side hashing
>
>
>  drivers/net/igb/igb_main.c     |   22 ++++------
>  drivers/net/ixgbe/ixgbe_main.c |   47 ++++++++++++---------
>  include/linux/netdevice.h      |    2 +
>  include/linux/toeplitz.h       |   89 ++++++++++++++++++++++++++++++++++++++++
>  net/core/dev.c                 |   68 +++++++++++++++++++++++++++++++
>  5 files changed, 195 insertions(+), 33 deletions(-)
>  create mode 100644 include/linux/toeplitz.h
>
> --
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: bridge not routing packets via source bridgeport
From: Eric Dumazet @ 2011-01-03 18:42 UTC (permalink / raw)
  To: Sebastian J. Bronner; +Cc: netdev, Daniel Kraft
In-Reply-To: <4D220CFB.5060300@d9t.de>

Le lundi 03 janvier 2011 à 18:52 +0100, Sebastian J. Bronner a écrit :
> Hi all,
> 
> we recently upgraded from 2.6.32.25 to 2.6.35.24 and discovered that our
> virtual machines can no longer access their own external IP addresses.
> Testing revealed that 2.6.34 was the last version not to have the
> problem. 2.6.36 still had it. But on to the details.
> 
> Our setup:
> 
> We use KVM to virtualise our guests. The physical machines (nodes) act
> as One-to-One NAT routers to the virtual machines. The virtual machines
> are connected via virtio interfaces in a bridge.
> 
> Since the virtual machines only know about their RFC-1918 addresses, any
> request they make to their NATed global addresses requires a trip
> through the node's netfilter to perform the needed SNAT and DNAT operations.
> 
> Take the following setup:
> 
>    {internet}
>        |
>      (eth0)       <- 1.1.1.254, proxy_arp=1
>        |
>      [node]       <- ip_forward=1, routes*, nat**
>        |
>     (virbr1)      <- 10.0.0.1
>     /      \
> (vnet0)     |
>    |     (vnet1)
> (veth0)     |     <- 10.0.0.2
>    |     (veth0)  <- 10.0.0.3
>  [vm1]      |
>           [vm2]
> 
> * The static routes on the node for the vms mentioned above are as follows:
> # ip r
> 1.1.1.2 dev virbr1 scope link
> 1.1.1.3 dev virbr1 scope link
> 
> ** The NAT rules are set up as follows (in reality, they're a bit more
> complicated - but this suffices to illustrate the problem at hand):
> # iptables-save -t nat
> -A PREROUTING -d 1.1.1.2 -j DNAT --to-destination 10.0.0.2
> -A PREROUTING -d 1.1.1.3 -j DNAT --to-destination 10.0.0.3
> -A POSTROUTING -s 10.0.0.2 -j SNAT --to-source 1.1.1.2
> -A POSTROUTING -s 10.0.0.3 -j SNAT --to-source 1.1.1.3
> 
> This means that 1.1.1.2 maps to 10.0.0.2 (vm1) and
>                 1.1.1.3 maps to 10.0.0.3 (vm2).
> 
> Assuming ssh is running on both vms, running 'nc -v 1.1.1.3 22' from vm1
> gets me ssh's introductory message.
> 
> Assuming, no service is running on port 23, running 'nc -v 1.1.1.3 23'
> from vm1 gets me 'Connection refused'.
> 
> That's all fine and exactly as it should be. The vms are accessible from
> the internet as well, and can access the internet.
> 
> If, however, i run 'nc -v 1.1.1.2 22' from vm1 (or any port for that
> matter), I get a timeout!
> 
> Running tcpdump on all the involved interfaces showed me that the
> packets successfully traverse veth0 and vnet0 and appear to get lost
> upon reaching virbr1.
> 
> So, then I decided to set up a packet trace with iptables:
> [on the node]
> # modprobe ipt_LOG
> # iptables -t raw -A PREROUTING -p tcp --dport 4577 -j TRACE
> # tail -f /var/log/messages | grep TRACE
> [on vm1]
> # nc -v 1.1.1.2 4577
> 
> The results were very interesting, if somewhat dumbfounding. They are
> attached for easier perusal. The gist of it is that the packet in
> question disappears without a trace after going through the DNAT rule in
> the PREROUTING chain of the NAT table. This can be seen happening three
> times in vm1-to-1.1.1.2.txt in three and six second intervals (retries).
> 
> For comparison, I have also included a trace of a successful packet
> traversal that ends in a 'Connection refused'. It is in vm1-to-1.1.1.3.txt.
> 
> As a last note, I should add that the problem isn't related to the IP
> address. I eliminated that by putting two RFC-1918 IPs on vm1 and
> mapping two IPs to it, then running nc on one IP, while the other one
> was being used as the source IP.
> 
> The problem appears to be that packets can't be routed out the same
> bridgeport that they arrived from.
> 
> I hope this all makes sense and that you can reproduce the problem. One
> virtual machine will suffise to see the problem at work.
> 
> Feel free to contact me if you need more information or have suggestions
> for me.
> 
> Cheers,
> Sebastian Bronner
> 
> P.S.: The IP addresses are faked. I used vim to replace all instances of
> the real IPs with the fake ones used in this e-mail consistently.

random guess: maybe rp_filter hits you ?


With 2.6.36, a new SNMP counter was added, 
"netstat -s | grep IPReversePathFilter"




^ permalink raw reply

* Re: [PATCH] net: bridge: check the length of skb after nf_bridge_maybe_copy_header()
From: Stephen Hemminger @ 2011-01-03 18:15 UTC (permalink / raw)
  To: David Miller; +Cc: xiaosuo, bridge, netdev
In-Reply-To: <20110103.092214.193706896.davem@davemloft.net>

On Mon, 03 Jan 2011 09:22:14 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: Changli Gao <xiaosuo@gmail.com>
> Date: Mon, 3 Jan 2011 18:44:59 +0800
> 
> > On Sat, Jan 1, 2011 at 3:10 AM, David Miller <davem@davemloft.net> wrote:
> >> From: Changli Gao <xiaosuo@gmail.com>
> >> Date: Sat, 25 Dec 2010 21:41:30 +0800
> >>
> >>> Since nf_bridge_maybe_copy_header() may change the length of skb,
> >>> we should check the length of skb after it to handle the ppoe skbs.
> >>>
> >>> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
> >>
> >> This is really strange.
> >>
> >> packet_length() subtracts VLAN_HLEN from the value it returns, so the
> >> correct fix seems to be to make this function handle the PPPOE case
> >> too.
> >>
> > 
> > It is correct. The actual MTU of 802.1q frame is 4 bytes larger. For
> > example, the MTU of ethernet is normally 1500, however the actual MTU
> > of the 802.1Q is 1504.
> 
> Yes, I understand this, but I don't see why packet_length() cannot
> simply account for PPPOE's encapsulation overhead just as it does for
> VLAN's special MTU considerations.

Because PPPOE happens afterwards and is not part the calculation.
The check should be moved until after skb has reached final form.

-- 

^ permalink raw reply

* [PATCH] sch_red: report backlog information
From: Eric Dumazet @ 2011-01-03 18:11 UTC (permalink / raw)
  To: hadi
  Cc: Jarek Poplawski, David Miller, Jesper Dangaard Brouer,
	Patrick McHardy, netdev
In-Reply-To: <1294063372.2892.408.camel@edumazet-laptop>

Provide child qdisc backlog (byte count) information so that "tc -s
qdisc" can report it to user.

packet count is already correctly provided.

qdisc red 11: parent 1:11 limit 60Kb min 15Kb max 45Kb ecn 
 Sent 3116427684 bytes 1415782 pkt (dropped 8, overlimits 7866 requeues 0) 
 rate 242385Kbit 13630pps backlog 13560b 8p requeues 0 
  marked 7865 early 1 pdrop 7 other 0

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/sched/sch_red.c |    1 +
 1 files changed, 1 insertion(+)

diff --git a/net/sched/sch_red.c b/net/sched/sch_red.c
index 8d42bb3..a67ba3c 100644
--- a/net/sched/sch_red.c
+++ b/net/sched/sch_red.c
@@ -239,6 +239,7 @@ static int red_dump(struct Qdisc *sch, struct sk_buff *skb)
 		.Scell_log	= q->parms.Scell_log,
 	};
 
+	sch->qstats.backlog = q->qdisc->qstats.backlog;
 	opts = nla_nest_start(skb, TCA_OPTIONS);
 	if (opts == NULL)
 		goto nla_put_failure;



^ permalink raw reply related

* [net-next-2.6 PATCH] dcbnl: more informed return values for new dcbnl routines
From: Shmulik Ravid @ 2011-01-03 18:04 UTC (permalink / raw)
  To: davem

More accurate return values for the following (new) dcbnl routines:
dcbnl_getdcbx()
dcbnl_setdcbx()
dcbnl_getfeatcfg()
dcbnl_setfeatcfg()

Signed-off-by: Shmulik Ravid <shmulikr@broadcom.com>
---
 net/dcb/dcbnl.c |   81 +++++++++++++++++++++++++------------------------------
 1 files changed, 37 insertions(+), 44 deletions(-)

diff --git a/net/dcb/dcbnl.c b/net/dcb/dcbnl.c
index ff3c12d..9399af5 100644
--- a/net/dcb/dcbnl.c
+++ b/net/dcb/dcbnl.c
@@ -1286,10 +1286,10 @@ nlmsg_failure:
 static int dcbnl_getdcbx(struct net_device *netdev, struct nlattr **tb,
 			 u32 pid, u32 seq, u16 flags)
 {
-	int ret = -EINVAL;
+	int ret;
 
 	if (!netdev->dcbnl_ops->getdcbx)
-		return ret;
+		return -EOPNOTSUPP;
 
 	ret = dcbnl_reply(netdev->dcbnl_ops->getdcbx(netdev), RTM_GETDCB,
 			  DCB_CMD_GDCBX, DCB_ATTR_DCBX, pid, seq, flags);
@@ -1300,11 +1300,14 @@ static int dcbnl_getdcbx(struct net_device *netdev, struct nlattr **tb,
 static int dcbnl_setdcbx(struct net_device *netdev, struct nlattr **tb,
 			 u32 pid, u32 seq, u16 flags)
 {
-	int ret = -EINVAL;
+	int ret;
 	u8 value;
 
-	if (!tb[DCB_ATTR_DCBX] || !netdev->dcbnl_ops->setdcbx)
-		return ret;
+	if (!netdev->dcbnl_ops->setdcbx)
+		return -EOPNOTSUPP;
+
+	if (!tb[DCB_ATTR_DCBX])
+		return -EINVAL;
 
 	value = nla_get_u8(tb[DCB_ATTR_DCBX]);
 
@@ -1323,23 +1326,23 @@ static int dcbnl_getfeatcfg(struct net_device *netdev, struct nlattr **tb,
 	struct dcbmsg *dcb;
 	struct nlattr *data[DCB_FEATCFG_ATTR_MAX + 1], *nest;
 	u8 value;
-	int ret = -EINVAL;
-	int i;
+	int ret, i;
 	int getall = 0;
 
-	if (!tb[DCB_ATTR_FEATCFG] || !netdev->dcbnl_ops->getfeatcfg)
-		return ret;
+	if (!netdev->dcbnl_ops->getfeatcfg)
+		return -EOPNOTSUPP;
+
+	if (!tb[DCB_ATTR_FEATCFG])
+		return -EINVAL;
 
 	ret = nla_parse_nested(data, DCB_FEATCFG_ATTR_MAX, tb[DCB_ATTR_FEATCFG],
 			       dcbnl_featcfg_nest);
-	if (ret) {
-		ret = -EINVAL;
+	if (ret)
 		goto err_out;
-	}
 
 	dcbnl_skb = nlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
 	if (!dcbnl_skb) {
-		ret = -EINVAL;
+		ret = -ENOBUFS;
 		goto err_out;
 	}
 
@@ -1351,8 +1354,8 @@ static int dcbnl_getfeatcfg(struct net_device *netdev, struct nlattr **tb,
 
 	nest = nla_nest_start(dcbnl_skb, DCB_ATTR_FEATCFG);
 	if (!nest) {
-		ret = -EINVAL;
-		goto err;
+		ret = -EMSGSIZE;
+		goto nla_put_failure;
 	}
 
 	if (data[DCB_FEATCFG_ATTR_ALL])
@@ -1363,30 +1366,22 @@ static int dcbnl_getfeatcfg(struct net_device *netdev, struct nlattr **tb,
 			continue;
 
 		ret = netdev->dcbnl_ops->getfeatcfg(netdev, i, &value);
-		if (!ret) {
+		if (!ret)
 			ret = nla_put_u8(dcbnl_skb, i, value);
 
-			if (ret) {
-				nla_nest_cancel(dcbnl_skb, nest);
-				ret = -EINVAL;
-				goto err;
-			}
-		} else
-			goto err;
+		if (ret) {
+			nla_nest_cancel(dcbnl_skb, nest);
+			goto nla_put_failure;
+		}
 	}
 	nla_nest_end(dcbnl_skb, nest);
 
 	nlmsg_end(dcbnl_skb, nlh);
 
-	ret = rtnl_unicast(dcbnl_skb, &init_net, pid);
-	if (ret) {
-		ret = -EINVAL;
-		goto err_out;
-	}
-
-	return 0;
+	return rtnl_unicast(dcbnl_skb, &init_net, pid);
+nla_put_failure:
+	nlmsg_cancel(dcbnl_skb, nlh);
 nlmsg_failure:
-err:
 	kfree_skb(dcbnl_skb);
 err_out:
 	return ret;
@@ -1396,20 +1391,20 @@ static int dcbnl_setfeatcfg(struct net_device *netdev, struct nlattr **tb,
 			    u32 pid, u32 seq, u16 flags)
 {
 	struct nlattr *data[DCB_FEATCFG_ATTR_MAX + 1];
-	int ret = -EINVAL;
+	int ret, i;
 	u8 value;
-	int i;
 
-	if (!tb[DCB_ATTR_FEATCFG] || !netdev->dcbnl_ops->setfeatcfg)
-		return ret;
+	if (!netdev->dcbnl_ops->setfeatcfg)
+		return -ENOTSUPP;
+
+	if (!tb[DCB_ATTR_FEATCFG])
+		return -EINVAL;
 
 	ret = nla_parse_nested(data, DCB_FEATCFG_ATTR_MAX, tb[DCB_ATTR_FEATCFG],
 			       dcbnl_featcfg_nest);
 
-	if (ret) {
-		ret = -EINVAL;
+	if (ret)
 		goto err;
-	}
 
 	for (i = DCB_FEATCFG_ATTR_ALL+1; i <= DCB_FEATCFG_ATTR_MAX; i++) {
 		if (data[i] == NULL)
@@ -1420,14 +1415,12 @@ static int dcbnl_setfeatcfg(struct net_device *netdev, struct nlattr **tb,
 		ret = netdev->dcbnl_ops->setfeatcfg(netdev, i, value);
 
 		if (ret)
-			goto operr;
+			goto err;
 	}
-
-operr:
-	ret = dcbnl_reply(!!ret, RTM_SETDCB, DCB_CMD_SFEATCFG,
-			  DCB_ATTR_FEATCFG, pid, seq, flags);
-
 err:
+	dcbnl_reply(ret, RTM_SETDCB, DCB_CMD_SFEATCFG, DCB_ATTR_FEATCFG,
+		    pid, seq, flags);
+
 	return ret;
 }
 
-- 
1.7.1





^ permalink raw reply related

* bridge not routing packets via source bridgeport
From: Sebastian J. Bronner @ 2011-01-03 17:52 UTC (permalink / raw)
  To: netdev; +Cc: Daniel Kraft

[-- Attachment #1: Type: text/plain, Size: 3983 bytes --]

Hi all,

we recently upgraded from 2.6.32.25 to 2.6.35.24 and discovered that our
virtual machines can no longer access their own external IP addresses.
Testing revealed that 2.6.34 was the last version not to have the
problem. 2.6.36 still had it. But on to the details.

Our setup:

We use KVM to virtualise our guests. The physical machines (nodes) act
as One-to-One NAT routers to the virtual machines. The virtual machines
are connected via virtio interfaces in a bridge.

Since the virtual machines only know about their RFC-1918 addresses, any
request they make to their NATed global addresses requires a trip
through the node's netfilter to perform the needed SNAT and DNAT operations.

Take the following setup:

   {internet}
       |
     (eth0)       <- 1.1.1.254, proxy_arp=1
       |
     [node]       <- ip_forward=1, routes*, nat**
       |
    (virbr1)      <- 10.0.0.1
    /      \
(vnet0)     |
   |     (vnet1)
(veth0)     |     <- 10.0.0.2
   |     (veth0)  <- 10.0.0.3
 [vm1]      |
          [vm2]

* The static routes on the node for the vms mentioned above are as follows:
# ip r
1.1.1.2 dev virbr1 scope link
1.1.1.3 dev virbr1 scope link

** The NAT rules are set up as follows (in reality, they're a bit more
complicated - but this suffices to illustrate the problem at hand):
# iptables-save -t nat
-A PREROUTING -d 1.1.1.2 -j DNAT --to-destination 10.0.0.2
-A PREROUTING -d 1.1.1.3 -j DNAT --to-destination 10.0.0.3
-A POSTROUTING -s 10.0.0.2 -j SNAT --to-source 1.1.1.2
-A POSTROUTING -s 10.0.0.3 -j SNAT --to-source 1.1.1.3

This means that 1.1.1.2 maps to 10.0.0.2 (vm1) and
                1.1.1.3 maps to 10.0.0.3 (vm2).

Assuming ssh is running on both vms, running 'nc -v 1.1.1.3 22' from vm1
gets me ssh's introductory message.

Assuming, no service is running on port 23, running 'nc -v 1.1.1.3 23'
from vm1 gets me 'Connection refused'.

That's all fine and exactly as it should be. The vms are accessible from
the internet as well, and can access the internet.

If, however, i run 'nc -v 1.1.1.2 22' from vm1 (or any port for that
matter), I get a timeout!

Running tcpdump on all the involved interfaces showed me that the
packets successfully traverse veth0 and vnet0 and appear to get lost
upon reaching virbr1.

So, then I decided to set up a packet trace with iptables:
[on the node]
# modprobe ipt_LOG
# iptables -t raw -A PREROUTING -p tcp --dport 4577 -j TRACE
# tail -f /var/log/messages | grep TRACE
[on vm1]
# nc -v 1.1.1.2 4577

The results were very interesting, if somewhat dumbfounding. They are
attached for easier perusal. The gist of it is that the packet in
question disappears without a trace after going through the DNAT rule in
the PREROUTING chain of the NAT table. This can be seen happening three
times in vm1-to-1.1.1.2.txt in three and six second intervals (retries).

For comparison, I have also included a trace of a successful packet
traversal that ends in a 'Connection refused'. It is in vm1-to-1.1.1.3.txt.

As a last note, I should add that the problem isn't related to the IP
address. I eliminated that by putting two RFC-1918 IPs on vm1 and
mapping two IPs to it, then running nc on one IP, while the other one
was being used as the source IP.

The problem appears to be that packets can't be routed out the same
bridgeport that they arrived from.

I hope this all makes sense and that you can reproduce the problem. One
virtual machine will suffise to see the problem at work.

Feel free to contact me if you need more information or have suggestions
for me.

Cheers,
Sebastian Bronner

P.S.: The IP addresses are faked. I used vim to replace all instances of
the real IPs with the fake ones used in this e-mail consistently.
-- 
*Sebastian J. Bronner*
Administrator

D9T GmbH - Magirusstr. 39/1 - D-89077 Ulm
Tel: +49 731 1411 696-0 - Fax: +49 731 3799-220

Geschäftsführer: Daniel Kraft
Sitz und Register: Ulm, HRB 722416
Ust.IdNr: DE 260484638

http://d9t.de - D9T High Performance Hosting
info@d9t.de

[-- Attachment #2: vm1-to-1.1.1.2.txt --]
[-- Type: text/plain, Size: 3243 bytes --]

Jan  3 18:29:35 s14 kernel: [15791.001685] TRACE: raw:PREROUTING:policy:2 IN=virbr1 OUT= PHYSIN=vnet0 MAC=02:00:00:00:00:16:52:54:00:4a:25:72:08:00 SRC=10.0.0.2 DST=1.1.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=62671 DF PROTO=TCP SPT=49068 DPT=4577 SEQ=1589611546 ACK=0 WINDOW=5840 RES=0x00 SYN URGP=0 OPT (020405B40402080A000E78D50000000001030307) UID=0 GID=0 
Jan  3 18:29:35 s14 kernel: [15791.001730] TRACE: mangle:PREROUTING:policy:1 IN=virbr1 OUT= PHYSIN=vnet0 MAC=02:00:00:00:00:16:52:54:00:4a:25:72:08:00 SRC=10.0.0.2 DST=1.1.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=62671 DF PROTO=TCP SPT=49068 DPT=4577 SEQ=1589611546 ACK=0 WINDOW=5840 RES=0x00 SYN URGP=0 OPT (020405B40402080A000E78D50000000001030307) UID=0 GID=0 
Jan  3 18:29:35 s14 kernel: [15791.001762] TRACE: nat:PREROUTING:rule:1 IN=virbr1 OUT= PHYSIN=vnet0 MAC=02:00:00:00:00:16:52:54:00:4a:25:72:08:00 SRC=10.0.0.2 DST=1.1.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=62671 DF PROTO=TCP SPT=49068 DPT=4577 SEQ=1589611546 ACK=0 WINDOW=5840 RES=0x00 SYN URGP=0 OPT (020405B40402080A000E78D50000000001030307) UID=0 GID=0 
Jan  3 18:29:38 s14 kernel: [15793.995583] TRACE: raw:PREROUTING:policy:2 IN=virbr1 OUT= PHYSIN=vnet0 MAC=02:00:00:00:00:16:52:54:00:4a:25:72:08:00 SRC=10.0.0.2 DST=1.1.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=62672 DF PROTO=TCP SPT=49068 DPT=4577 SEQ=1589611546 ACK=0 WINDOW=5840 RES=0x00 SYN URGP=0 OPT (020405B40402080A000E7A010000000001030307) UID=0 GID=0 
Jan  3 18:29:38 s14 kernel: [15793.995624] TRACE: mangle:PREROUTING:policy:1 IN=virbr1 OUT= PHYSIN=vnet0 MAC=02:00:00:00:00:16:52:54:00:4a:25:72:08:00 SRC=10.0.0.2 DST=1.1.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=62672 DF PROTO=TCP SPT=49068 DPT=4577 SEQ=1589611546 ACK=0 WINDOW=5840 RES=0x00 SYN URGP=0 OPT (020405B40402080A000E7A010000000001030307) UID=0 GID=0 
Jan  3 18:29:38 s14 kernel: [15793.995656] TRACE: nat:PREROUTING:rule:1 IN=virbr1 OUT= PHYSIN=vnet0 MAC=02:00:00:00:00:16:52:54:00:4a:25:72:08:00 SRC=10.0.0.2 DST=1.1.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=62672 DF PROTO=TCP SPT=49068 DPT=4577 SEQ=1589611546 ACK=0 WINDOW=5840 RES=0x00 SYN URGP=0 OPT (020405B40402080A000E7A010000000001030307) UID=0 GID=0 
Jan  3 18:29:44 s14 kernel: [15799.995658] TRACE: raw:PREROUTING:policy:2 IN=virbr1 OUT= PHYSIN=vnet0 MAC=02:00:00:00:00:16:52:54:00:4a:25:72:08:00 SRC=10.0.0.2 DST=1.1.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=62673 DF PROTO=TCP SPT=49068 DPT=4577 SEQ=1589611546 ACK=0 WINDOW=5840 RES=0x00 SYN URGP=0 OPT (020405B40402080A000E7C590000000001030307) UID=0 GID=0 
Jan  3 18:29:44 s14 kernel: [15799.995700] TRACE: mangle:PREROUTING:policy:1 IN=virbr1 OUT= PHYSIN=vnet0 MAC=02:00:00:00:00:16:52:54:00:4a:25:72:08:00 SRC=10.0.0.2 DST=1.1.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=62673 DF PROTO=TCP SPT=49068 DPT=4577 SEQ=1589611546 ACK=0 WINDOW=5840 RES=0x00 SYN URGP=0 OPT (020405B40402080A000E7C590000000001030307) UID=0 GID=0 
Jan  3 18:29:44 s14 kernel: [15799.995732] TRACE: nat:PREROUTING:rule:1 IN=virbr1 OUT= PHYSIN=vnet0 MAC=02:00:00:00:00:16:52:54:00:4a:25:72:08:00 SRC=10.0.0.2 DST=1.1.1.2 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=62673 DF PROTO=TCP SPT=49068 DPT=4577 SEQ=1589611546 ACK=0 WINDOW=5840 RES=0x00 SYN URGP=0 OPT (020405B40402080A000E7C590000000001030307) UID=0 GID=0 

[-- Attachment #3: vm1-to-1.1.1.3.txt --]
[-- Type: text/plain, Size: 2284 bytes --]

Jan  3 18:32:33 s14 kernel: [15968.856178] TRACE: raw:PREROUTING:policy:2 IN=virbr1 OUT= PHYSIN=vnet0 MAC=02:00:00:00:00:16:52:54:00:4a:25:72:08:00 SRC=10.0.0.2 DST=1.1.1.3 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=36414 DF PROTO=TCP SPT=37569 DPT=4577 SEQ=80883825 ACK=0 WINDOW=5840 RES=0x00 SYN URGP=0 OPT (020405B40402080A000EBE4F0000000001030307) UID=0 GID=0 
Jan  3 18:32:33 s14 kernel: [15968.856211] TRACE: mangle:PREROUTING:policy:1 IN=virbr1 OUT= PHYSIN=vnet0 MAC=02:00:00:00:00:16:52:54:00:4a:25:72:08:00 SRC=10.0.0.2 DST=1.1.1.3 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=36414 DF PROTO=TCP SPT=37569 DPT=4577 SEQ=80883825 ACK=0 WINDOW=5840 RES=0x00 SYN URGP=0 OPT (020405B40402080A000EBE4F0000000001030307) UID=0 GID=0 
Jan  3 18:32:33 s14 kernel: [15968.856233] TRACE: nat:PREROUTING:policy:2 IN=virbr1 OUT= PHYSIN=vnet0 MAC=02:00:00:00:00:16:52:54:00:4a:25:72:08:00 SRC=10.0.0.2 DST=1.1.1.3 LEN=60 TOS=0x00 PREC=0x00 TTL=64 ID=36414 DF PROTO=TCP SPT=37569 DPT=4577 SEQ=80883825 ACK=0 WINDOW=5840 RES=0x00 SYN URGP=0 OPT (020405B40402080A000EBE4F0000000001030307) UID=0 GID=0 
Jan  3 18:32:33 s14 kernel: [15968.856272] TRACE: mangle:FORWARD:policy:1 IN=virbr1 OUT=eth0 PHYSIN=vnet0 SRC=10.0.0.2 DST=1.1.1.3 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=36414 DF PROTO=TCP SPT=37569 DPT=4577 SEQ=80883825 ACK=0 WINDOW=5840 RES=0x00 SYN URGP=0 OPT (020405B40402080A000EBE4F0000000001030307) 
Jan  3 18:32:33 s14 kernel: [15968.856288] TRACE: filter:FORWARD:policy:1 IN=virbr1 OUT=eth0 PHYSIN=vnet0 SRC=10.0.0.2 DST=1.1.1.3 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=36414 DF PROTO=TCP SPT=37569 DPT=4577 SEQ=80883825 ACK=0 WINDOW=5840 RES=0x00 SYN URGP=0 OPT (020405B40402080A000EBE4F0000000001030307) 
Jan  3 18:32:33 s14 kernel: [15968.856305] TRACE: mangle:POSTROUTING:policy:1 IN= OUT=eth0 PHYSIN=vnet0 SRC=10.0.0.2 DST=1.1.1.3 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=36414 DF PROTO=TCP SPT=37569 DPT=4577 SEQ=80883825 ACK=0 WINDOW=5840 RES=0x00 SYN URGP=0 OPT (020405B40402080A000EBE4F0000000001030307) 
Jan  3 18:32:33 s14 kernel: [15968.856321] TRACE: nat:POSTROUTING:rule:1 IN= OUT=eth0 PHYSIN=vnet0 SRC=10.0.0.2 DST=1.1.1.3 LEN=60 TOS=0x00 PREC=0x00 TTL=63 ID=36414 DF PROTO=TCP SPT=37569 DPT=4577 SEQ=80883825 ACK=0 WINDOW=5840 RES=0x00 SYN URGP=0 OPT (020405B40402080A000EBE4F0000000001030307) 

[-- Attachment #4: iptables-nat.txt --]
[-- Type: text/plain, Size: 645 bytes --]

Chain PREROUTING (policy ACCEPT 4027 packets, 296K bytes)
 pkts bytes target     prot opt in     out     source               destination         
    8   488 DNAT       all  --  *      *       0.0.0.0/0            1.1.1.2             to:10.0.0.2 

Chain OUTPUT (policy ACCEPT 24412 packets, 1578K bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain POSTROUTING (policy ACCEPT 24430 packets, 1579K bytes)
 pkts bytes target     prot opt in     out     source               destination         
    4   240 SNAT       all  --  *      *       10.0.0.2             0.0.0.0/0           to:1.1.1.2 

^ permalink raw reply

* Re: [RFC] net_sched: mark packet staying on queue too long
From: Stephen Hemminger @ 2011-01-03 17:58 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: hadi, Jarek Poplawski, David Miller, Jesper Dangaard Brouer,
	Patrick McHardy, netdev
In-Reply-To: <1294003631.2535.253.camel@edumazet-laptop>

On Sun, 02 Jan 2011 22:27:11 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> While playing with SFQ and other AQM, I was bothered to see how easy it
> was for a single tcp flow to 'fill the pipe' and consume lot of memory
> buffers in queues. I know Jesper use more than 50.000 SFQ on his
> routers, and with GRO packets this can consume a lot of memory.
> 
> I played a bit adding ECN in SFQ, first by marking packets for a
> particular flow if this flow qlen was above a given threshold, and later
> using another trick : ECN mark packet if it stayed longer than a given
> delay in the queue. This of course could be done on other modules, what
> do you think ?
> 
> The idea is to take into account the time packet stayed in the queue,
> regardless of other class parameters.
> 
> Following quick and dirty patch to show the idea. Of course, the delay
> should be configured on each SFQ/RED/XXXX class, so it would need an
> iproute2 patch, and the delay unit should be "ms" (or even "us"), not
> ticks, but as I said this is a quick and dirty patch (net-next-2.6
> based)
> 
> Using jiffies allows only delays above 3 or 4 ticks...
> 
> Or maybe ECN is just a dream :(

You might want to look into CHOKe and ECSFQ which are other AQM models
that have shown up in research.


-- 

^ permalink raw reply

* Re: [PATCH net-next-2.6 1/2] can: add driver for Softing card
From: David Miller @ 2011-01-03 17:33 UTC (permalink / raw)
  To: kurt.van.dijck-/BeEPy95v10
  Cc: socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	netdev-u79uwXL29TY76Z2rM5mHXA, mkl-bIcnvbaLZ9MEGnE8C9+IrQ
In-Reply-To: <20110103163835.GD320-MxZ6Iy/zr/UdbCeoMzGj59i2O/JbrIOy@public.gmane.org>

From: Kurt Van Dijck <kurt.van.dijck-/BeEPy95v10@public.gmane.org>
Date: Mon, 3 Jan 2011 17:38:35 +0100

> On Fri, Dec 24, 2010 at 12:44:08PM +0100, Marc Kleine-Budde wrote:
>> 
>> >> hmmm..all stuff behind dpram is __iomem, isn't it? I think it should
>> >> only be accessed with via the ioread/iowrite operators. Please check
>> > I did an ioremap_nocache. Since it is unaligned, ioread/iowrite would render
>> > a lot of statements.
>> 
>> The thing is, ioremapped mem should not be accessed directly. Instead
>> ioread/iowrite should be used. The softing driver should work on non x86
>> platforms, too.
>> 
> I use __attribute__((packed)) structs to refer to the iomemory.
> To read an unaligned uint16_t, is should then use 2 readb()'s ??
> 
> I could of course turn that sequence into a macro ....

Yes, this is what you'll need to do.

^ permalink raw reply

* spurious netconsole: network logging stopped messages
From: Ferenc Wagner @ 2011-01-03 16:57 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 816 bytes --]

Hi,

In a running system, I can load the netconsole module, and gather the
messages on the other side all right. Now if I modprobe dummy and rmmod
dummy, the following message gets logged (via netconsole as well):

netconsole: network logging stopped, interface dummy0 unregistered

although I never asked netconsole to log through dummy0. The problem is
fairly obvious in netconsole_netdev_event() and could probably be fixed
with something like the first attached patch.  I didn't even compile
tested it, though, because the second attached patch made me realise
that I don't quite understand the bridge logic here.  Why should
netconsole stop logging through a bridge device if that loses a slave?
Or do I misunderstand the meaning of NETDEV_BONDING_DESLAVE?

Please Cc me, I'm not subscribed.
-- 
Thanks,
Feri.


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-netconsole-don-t-announce-stopping-if-nothing-happen.patch --]
[-- Type: text/x-diff, Size: 1568 bytes --]

>From b324e4425f47fcde54757c134a7fdd98f3dc9521 Mon Sep 17 00:00:00 2001
Message-Id: <b324e4425f47fcde54757c134a7fdd98f3dc9521.1294073227.git.wferi@niif.hu>
From: Ferenc Wagner <wferi@niif.hu>
Date: Mon, 3 Jan 2011 17:34:55 +0100
Subject: [PATCH 1/2] netconsole: don't announce stopping if nothing happened


Signed-off-by: Ferenc Wagner <wferi@niif.hu>
---
 drivers/net/netconsole.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index 94255f0..b2ad998 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -664,6 +664,7 @@ static int netconsole_netdev_event(struct notifier_block *this,
 	unsigned long flags;
 	struct netconsole_target *nt;
 	struct net_device *dev = ptr;
+	bool stopped = false;
 
 	if (!(event == NETDEV_CHANGENAME || event == NETDEV_UNREGISTER ||
 	      event == NETDEV_BONDING_DESLAVE || event == NETDEV_GOING_DOWN))
@@ -690,13 +691,14 @@ static int netconsole_netdev_event(struct notifier_block *this,
 			case NETDEV_GOING_DOWN:
 			case NETDEV_BONDING_DESLAVE:
 				nt->enabled = 0;
+				stopped = true;
 				break;
 			}
 		}
 		netconsole_target_put(nt);
 	}
 	spin_unlock_irqrestore(&target_list_lock, flags);
-	if (event == NETDEV_UNREGISTER || event == NETDEV_BONDING_DESLAVE)
+	if (stopped && (event == NETDEV_UNREGISTER || event == NETDEV_BONDING_DESLAVE))
 		printk(KERN_INFO "netconsole: network logging stopped, "
 			"interface %s %s\n",  dev->name,
 			event == NETDEV_UNREGISTER ? "unregistered" : "released slaves");
-- 
1.6.5


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-netconsole-clarify-stopping-message.patch --]
[-- Type: text/x-diff, Size: 1267 bytes --]

>From 211feede3c64419e6997e62a60f4c3c70f237ef8 Mon Sep 17 00:00:00 2001
Message-Id: <211feede3c64419e6997e62a60f4c3c70f237ef8.1294073227.git.wferi@niif.hu>
In-Reply-To: <b324e4425f47fcde54757c134a7fdd98f3dc9521.1294073227.git.wferi@niif.hu>
References: <b324e4425f47fcde54757c134a7fdd98f3dc9521.1294073227.git.wferi@niif.hu>
From: Ferenc Wagner <wferi@niif.hu>
Date: Mon, 3 Jan 2011 17:44:25 +0100
Subject: [PATCH 2/2] netconsole: clarify stopping message


Signed-off-by: Ferenc Wagner <wferi@niif.hu>
---
 drivers/net/netconsole.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/netconsole.c b/drivers/net/netconsole.c
index b2ad998..dfb67eb 100644
--- a/drivers/net/netconsole.c
+++ b/drivers/net/netconsole.c
@@ -699,8 +699,8 @@ static int netconsole_netdev_event(struct notifier_block *this,
 	}
 	spin_unlock_irqrestore(&target_list_lock, flags);
 	if (stopped && (event == NETDEV_UNREGISTER || event == NETDEV_BONDING_DESLAVE))
-		printk(KERN_INFO "netconsole: network logging stopped, "
-			"interface %s %s\n",  dev->name,
+		printk(KERN_INFO "netconsole: network logging stopped on "
+			"interface %s as it %s\n",  dev->name,
 			event == NETDEV_UNREGISTER ? "unregistered" : "released slaves");
 
 done:
-- 
1.6.5


^ permalink raw reply related

* Re: [PATCH] net: bridge: check the length of skb after nf_bridge_maybe_copy_header()
From: David Miller @ 2011-01-03 17:22 UTC (permalink / raw)
  To: xiaosuo; +Cc: shemminger, bridge, netdev
In-Reply-To: <AANLkTingFk8qJ8tBwSRCxD=9k-0O3GnTv9V9TbF4N=bg@mail.gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Mon, 3 Jan 2011 18:44:59 +0800

> On Sat, Jan 1, 2011 at 3:10 AM, David Miller <davem@davemloft.net> wrote:
>> From: Changli Gao <xiaosuo@gmail.com>
>> Date: Sat, 25 Dec 2010 21:41:30 +0800
>>
>>> Since nf_bridge_maybe_copy_header() may change the length of skb,
>>> we should check the length of skb after it to handle the ppoe skbs.
>>>
>>> Signed-off-by: Changli Gao <xiaosuo@gmail.com>
>>
>> This is really strange.
>>
>> packet_length() subtracts VLAN_HLEN from the value it returns, so the
>> correct fix seems to be to make this function handle the PPPOE case
>> too.
>>
> 
> It is correct. The actual MTU of 802.1q frame is 4 bytes larger. For
> example, the MTU of ethernet is normally 1500, however the actual MTU
> of the 802.1Q is 1504.

Yes, I understand this, but I don't see why packet_length() cannot
simply account for PPPOE's encapsulation overhead just as it does for
VLAN's special MTU considerations.

^ permalink raw reply

* Re: [PATCH V7 7/8] ptp: Added a clock driver for the IXP46x.
From: Richard Cochran @ 2011-01-03 17:07 UTC (permalink / raw)
  To: Pavel Machek
  Cc: linux-kernel, linux-api, netdev, Alan Cox, Arnd Bergmann,
	Christoph Lameter, David Miller, John Stultz, Krzysztof Halasa,
	Peter Zijlstra, Rodolfo Giometti, Thomas Gleixner
In-Reply-To: <20110102092042.GA14165@ucw.cz>

On Sun, Jan 02, 2011 at 10:20:42AM +0100, Pavel Machek wrote:
> Given the comments -- does manual really use camelCase crap?
> And... the identifiers actually combine camelCase with _. Better fix
> it.

Thats the way Intel likes it:

http://download.intel.com/design/network/manuals/30626204.pdf

Page 837

^ permalink raw reply

* Re: [net-next-2.6 PATCH v2 3/3] net_sched: implement a root container qdisc sch_mclass
From: Jarek Poplawski @ 2011-01-03 17:04 UTC (permalink / raw)
  To: John Fastabend
  Cc: davem@davemloft.net, netdev@vger.kernel.org, hadi@cyberus.ca,
	shemminger@vyatta.com, tgraf@infradead.org,
	eric.dumazet@gmail.com, bhutchings@solarflare.com,
	nhorman@tuxdriver.com
In-Reply-To: <4D2162A8.60305@intel.com>

On Sun, Jan 02, 2011 at 09:46:16PM -0800, John Fastabend wrote:
> On 12/31/2010 1:25 AM, Jarek Poplawski wrote:
> > On 2010-12-21 20:29, John Fastabend wrote:
> >> +static int mclass_parse_opt(struct net_device *dev, struct tc_mclass_qopt *qopt)
> >> +{
> >> +	int i, j;
> >> +
> >> +	/* Verify TC offset and count are sane */
> > 
> > if (qopt->num_tc > TC_MAX_QUEUE) ?
> > 	return -EINVAL;
> 
> This would be caught later when netdev_set_num_tc() fails although probably best to catch all failures in this function as early as possible.

Plus reading beyond the table range wouldn't look nice.

> 
> > 
> >> +	for (i = 0; i < qopt->num_tc; i++) {
> >> +		int last = qopt->offset[i] + qopt->count[i];
> >> +		if (last > dev->num_tx_queues)
> > 
> > if (last >= dev->num_tx_queues) ?
> > 
> >> +			return -EINVAL;
> >> +		for (j = i + 1; j < qopt->num_tc; j++) {
> >> +			if (last > qopt->offset[j])
> > 
> > if (last >= qopt->offset[j]) ?
> 	
> I believe the below works as expected. The offset needs to be verified (this I missed) but offset+count can be equal to num_tx_queue indicating the last queue is in use. With 8 tx queues and num_tc=2 a valid configuration is, tc1 offset of 0 and a count of 7 with tc2 offset of 7 and count of 1.
> 
> 
>         /* Verify num_tc is in max range */
>         if (qopt->num_tc > TC_MAX_QUEUE)
>                 return -EINVAL;
> 
>         for (i = 0; i < qopt->num_tc; i++) {
>                 /* Verify the queue offset is in the num tx range */
>                 if (qopt->offset[i] >= dev->num_tx_queues)
>                         return -EINVAL;
>                 /* Verify the queue count is in tx range being equal to the
>                  * num_tx_queues indicates the last queue is in use.
>                  */
>                 else if (qopt->offset[i] + qopt->count[i] > dev->num_tx_queues)
>                         return -EINVAL;
> 
>                 /* Verify that the offset and counts do not overlap */
>                 for (j = i + 1; j < qopt->num_tc; j++) {
>                         if (last > qopt->offset[j])
>                                 return -EINVAL;
>                 }
>         }

Yes, after assigning the 'last' it should work OK ;-)

Thanks,
Jarek P.

^ permalink raw reply

* Re: [PATCH] ll_temac: Fix section mismatch from the temac_of_probe
From: Grant Likely @ 2011-01-03 17:03 UTC (permalink / raw)
  To: Michal Simek; +Cc: netdev, linux-kernel, devicetree-discuss, davem
In-Reply-To: <1294050756-31099-1-git-send-email-monstr@monstr.eu>

On Mon, Jan 03, 2011 at 11:32:36AM +0100, Michal Simek wrote:
> Replace __init by __devinit.
> 
> Warning message:
> WARNING: vmlinux.o(.data+0xbc14): Section mismatch in reference from the variable
> temac_of_driver to the function .init.text:temac_of_probe()
> The variable temac_of_driver references
> the function __init temac_of_probe()
> If the reference is valid then annotate the
> variable with __init* or __refdata (see linux/init.h) or name the variable:
> *_template, *_timer, *_sht, *_ops, *_probe, *_probe_one, *_console,
> 
> Signed-off-by: Michal Simek <monstr@monstr.eu>

Acked-by: Grant Likely <grant.likely@secretlab.ca>

> ---
>  drivers/net/ll_temac_main.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/ll_temac_main.c b/drivers/net/ll_temac_main.c
> index 9f8e702..beb6ed8 100644
> --- a/drivers/net/ll_temac_main.c
> +++ b/drivers/net/ll_temac_main.c
> @@ -952,7 +952,7 @@ static const struct attribute_group temac_attr_group = {
>  	.attrs = temac_device_attrs,
>  };
>  
> -static int __init
> +static int __devinit
>  temac_of_probe(struct platform_device *op, const struct of_device_id *match)
>  {
>  	struct device_node *np;
> -- 
> 1.5.5.6
> 

^ permalink raw reply

* Re: [net-next-2.6 PATCH v2 3/3] net_sched: implement a root container qdisc sch_mclass
From: Jarek Poplawski @ 2011-01-03 17:02 UTC (permalink / raw)
  To: John Fastabend
  Cc: davem@davemloft.net, netdev@vger.kernel.org, hadi@cyberus.ca,
	shemminger@vyatta.com, tgraf@infradead.org,
	eric.dumazet@gmail.com, bhutchings@solarflare.com,
	nhorman@tuxdriver.com
In-Reply-To: <4D2161FF.4070804@intel.com>

On Sun, Jan 02, 2011 at 09:43:27PM -0800, John Fastabend wrote:
> On 12/30/2010 3:37 PM, Jarek Poplawski wrote:
> > John Fastabend wrote:
> >> This implements a mclass 'multi-class' queueing discipline that by
> >> default creates multiple mq qdisc's one for each traffic class. Each
> >> mq qdisc then owns a range of queues per the netdev_tc_txq mappings.
> > 
> > Is it really necessary to add one more abstraction layer for this,
> > probably not most often used (or even asked by users), functionality?
> > Why mclass can't simply do these few things more instead of attaching
> > (and changing) mq?
> > 
> 
> The statistics work nicely when the mq qdisc is used. 

Well, I sometimes add leaf qdiscs only to get class stats with less
typing, too ;-)

> 
> qdisc mclass 8002: root  tc 4 map 0 1 2 3 0 1 2 3 1 1 1 1 1 1 1 1
>              queues:(0:1) (2:3) (4:5) (6:15)
>  Sent 140 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc mq 8003: parent 8002:1
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc mq 8004: parent 8002:2
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc mq 8005: parent 8002:3
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc mq 8006: parent 8002:4
>  Sent 140 bytes 2 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc sfq 8007: parent 8005:1 limit 127p quantum 1514b
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> qdisc sfq 8008: parent 8005:2 limit 127p quantum 1514b
>  Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
>  backlog 0b 0p requeues 0
> 
> The mclass gives the statistics for the interface and then statistics on the mq qdisc gives statistics for each traffic class. Also, when using the 'mq qdisc' with this abstraction other qdisc can be grafted onto the queue. For example the sch_sfq is used in the above example.

IMHO, these tc offsets and counts make simply two level hierarchy
(classes with leaf subclasses) similarly (or simpler) to other
classful qdisc which manage it all inside one module. Of course,
we could think of another way of code organization, but it should
be rather done at the beginning of schedulers design. The mq qdisc
broke the design a bit adding a fake root, but I doubt we should go
deeper unless it's necessary. Doing mclass (or something) as a more
complex alternative to mq should be enough. Why couldn't mclass graft
sch_sfq the same way as mq?

> 
> Although I am not too hung up on this use case it does seem to be a good abstraction to me. Is it strictly necessary though no and looking at the class statistics of mclass could be used to get stats per traffic class.

I am not too hung up on this either, especially if it's OK to others,
especially to DaveM ;-)

> 
> > ...
> >> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> >> index 0af57eb..723ee52 100644
> >> --- a/include/net/sch_generic.h
> >> +++ b/include/net/sch_generic.h
> >> @@ -50,6 +50,7 @@ struct Qdisc {
> >>  #define TCQ_F_INGRESS		4
> >>  #define TCQ_F_CAN_BYPASS	8
> >>  #define TCQ_F_MQROOT		16
> >> +#define TCQ_F_MQSAFE		32
> > 
> > If every other qdisc added a flag for qdiscs it likes...
> > 
> 
> then we run out of bits and get unneeded complexity. I think I will drop the MQSAFE bit completely and let user space catch this. The worst that should happen is the noop qdisc is used.

Maybe you're right. On the other hand, usually flags are added for
more general purpose and the optimal/wrong configs are the matter of
documentation.

> 
> >> @@ -709,7 +709,13 @@ static void attach_default_qdiscs(struct net_device *dev)
> >>  		dev->qdisc = txq->qdisc_sleeping;
> >>  		atomic_inc(&dev->qdisc->refcnt);
> >>  	} else {
> >> -		qdisc = qdisc_create_dflt(txq, &mq_qdisc_ops, TC_H_ROOT);
> >> +		if (dev->num_tc)
> > 
> > Actually, where this num_tc is expected to be set? I can see it inside
> > mclass only, with unsetting on destruction, but probably I miss something.
> 
> Either through mclass as you noted or a driver could set the num_tc. One of the RFC's I sent out has ixgbe setting the num_tc when DCB was enabled.

OK, I probably missed this second possibility in the last version.

...
> >> +	/* Unwind attributes on failure */
> >> +	u8 unwnd_tc = dev->num_tc;
> >> +	u8 unwnd_map[16];
> > 
> > [TC_MAX_QUEUE] ?
> 
> Actually TC_BITMASK+1 is probably more accurate. This array maps the skb priority to a traffic class after the priority is masked with TC_BITMASK.
> 
> > 
> >> +	struct netdev_tc_txq unwnd_txq[16];
> >> +
> 
> Although unwnd_txq should be TC_MAX_QUEUE.
...
> >> +	/* Always use supplied priority mappings */
> >> +	for (i = 0; i < 16; i++) {
> > 
> > i < qopt->num_tc ?
> 
> Nope, TC_BITMASK+1 here. If we only have 4 tcs for example we still need to map all 16 priority values to a tc.

OK, anyway, all these '16' should be 'upgraded'.
 
Thanks,
Jarek P.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox