Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] bonding: fix incorrect transmit queue offset
From: Phil Oester @ 2011-02-25 22:56 UTC (permalink / raw)
  To: David Miller; +Cc: bhutchings, andy, netdev, fubar
In-Reply-To: <20110223.155451.70202591.davem@davemloft.net>

On Wed, Feb 23, 2011 at 03:54:51PM -0800, David Miller wrote:
> From: Ben Hutchings <bhutchings@solarflare.com>
> Date: Wed, 23 Feb 2011 23:37:49 +0000
> 
> > On Wed, 2011-02-23 at 15:13 -0800, David Miller wrote:
> >> From: Phil Oester <kernel@linuxace.com>
> >> Date: Wed, 23 Feb 2011 15:08:44 -0800
> >> 
> >> > On Wed, Feb 23, 2011 at 02:42:49PM -0500, Andy Gospodarek wrote:
> >> >> +     while (txq >= dev->real_num_tx_queues) {
> >> >> +             /* let the user know if we do not have enough tx queues */
> >> >> +             if (net_ratelimit())
> >> >> +                     pr_warning("%s selects invalid tx queue %d.  Consider"
> >> >> +                                " setting module option tx_queues > %d.",
> >> >> +                                dev->name, txq, dev->real_num_tx_queues);
> >> >> +             txq -= dev->real_num_tx_queues;
> >> >> +     }
> >> > 
> >> > Think this would be better as a WARN_ONCE, as otherwise syslog will still
> >> > get flooded with this - even when ratelimited.  See get_rps_cpu in 
> >> > net/core/dev.c as an example.o
> >> 
> >> Agreed.
> > 
> > This shouldn't WARN at all.  It is perfectly valid (though non-optimal)
> > to have different numbers of queues on two different multiqueue devices.
> 
> That's also a good point.

The patch works as expected.  Do we have any agreement on a final version?

Phil

^ permalink raw reply

* Re: TX VLAN acceleration on bridges broken in 2.6.37?
From: Jesse Gross @ 2011-02-25 22:53 UTC (permalink / raw)
  To: Jan Niehusmann; +Cc: linux-kernel, netdev
In-Reply-To: <20110221232902.GA3440@x61s.reliablesolutions.de>

On Mon, Feb 21, 2011 at 3:29 PM, Jan Niehusmann <jan@gondor.com> wrote:
> With the following configuration, sending vlan tagged traffic from a
> bridged interface doesn't work in 2.6.37.
> The same configuration does work with 2.6.36:
>
> - bridge br0 with physical interface eth0
> - eth0 being an e1000e device (don't know if that's important)
> - vlan interface br0.10
> - (on 2.6.37) tx vlan acceleration active on br0 (default)
>
> Networking on br0.10 doesn't work, and tcpdump on eth0 shows packets
> sent on br0.10 as untagged, instead of vlan 10 tagged.

I looked at this and wasn't able to reproduce it with either 2.6.37 or
net-next on either of my NICs (ixgbe and bnx2).  I'm guessing that it
is specific to the e1000e driver.  I know that some other Intel NICs
require vlan stripping on receive to be enabled for vlan insertion on
transmit to work.  Since this driver has not been converted over to
use the new vlan model yet, it only enables these things if a vlan is
directly configured on it.  To confirm this can you try a few things:

* Directly configure the vlan on the device instead of going through the bridge.
* Use the bridge but also configure an unused vlan device on the
physical interface.
* Double check that tcpdump with the settings that you are using shows
vlan tags in other situations.  In some cases you need to use the 'e'
flag with tcpdump in order for it show vlan tags.  If it is the
driver/NIC that is dropping the tags, tcpdump should still show them.

Thanks.

^ permalink raw reply

* Re: SO_REUSEPORT - can it be done in kernel?
From: Thomas Graf @ 2011-02-25 22:48 UTC (permalink / raw)
  To: Rick Jones; +Cc: Tom Herbert, Bill Sommerfeld, Daniel Baluta, netdev
In-Reply-To: <1298661495.14113.152.camel@tardy>

On Fri, Feb 25, 2011 at 11:18:15AM -0800, Rick Jones wrote:
> I think the idea is goodness, but will ask, was the (first) bottleneck
> actually in the kernel, or was it in bind itself?  I've seen
> single-instance, single-byte burst-mode netperf TCP_RR do in excess of
> 300K transactions per second (with TCP_NODELAY set) on an X5560 core.
> 
> ftp://ftp.netperf.org/netperf/misc/dl380g6_X5560_rhel54_ad386_cxgb3_1.4.1.2_b2b_to_same_agg_1500mtu_20100513-2.csv
> 
> and that was with now ancient RHEL5.4 bits...  yes, there is a bit of
> apples, oranges and kumquats but still, I am wondering if this didn't
> also "work around" some internal BIND scaling issues as well.

Yes it is. We have observed two separate bottlenecks.

The first we have discovered is within BIND. As soon as more than 1
worker thread is being used strace showed a ton of futex() system
calls to the kernel as soon as the number of queries crossed a magic
barrier. This suggested heavy lock contention within BIND.

This BIND lock contetion was not visible on all systems having scalability
issues though. Some machines were not able to deliver enough queries to
BIND in order for the lock contention to appear.

^ permalink raw reply

* Re: [PATCH net-2.6] bonding: drop frames received with master's source MAC
From: Jay Vosburgh @ 2011-02-25 22:28 UTC (permalink / raw)
  To: =?ISO-8859-1?Q?Nicolas_de_Peslo=FCan?=
  Cc: Andy Gospodarek, netdev, David Miller, Herbert Xu, Jiri Pirko
In-Reply-To: <4D68276B.90104@gmail.com>

Nicolas de Pesloüan 	<nicolas.2p.debian@gmail.com> wrote:

>Le 25/02/2011 22:13, Andy Gospodarek a écrit :
>> I was looking at my system and wondering why I sometimes saw these
>> DAD messages in my logs:
>>
>> bond0: IPv6 duplicate address fe80::21b:21ff:fe38:2ec4 detected!
>>
>> I traced it back and realized the IPv6 Neighbor Solicitations I was
>> sending were also coming back into the stack on the slave(s) that did
>> not transmit the frames.  I could not think of a compelling reason to
>> notify the user that a NS we sent came back, so I set out to just drop
>> the frame silently in ndisc_recv_ns drop.
>>
>> That seemed to work well, but when I thought about it I could not
>> compelling reason to save any of these frames.  Dropping them as soon as
>> we get them seems like a much better idea as it fixes other issues that
>> may exist for more than just IPv6 DAD.
>>
>> I chose to check the incoming frame against the master's MAC address as
>> that should be the MAC address used anytime a broadcast frame is sent by
>> the bonding driver that had the chance to make its way back into one of
>> the other devices.
>
>I think this could break the ARP monitoring. ARP monitoring rely on a
>normal protocol handler, registered in bond_main.c.
>
>void bond_register_arp(struct bonding *bond)
>{
>        struct packet_type *pt = &bond->arp_mon_pt;
>
>        if (pt->type)
>                return;
>
>        pt->type = htons(ETH_P_ARP);
>        pt->dev = bond->dev;
>        pt->func = bond_arp_rcv;
>        dev_add_pack(pt);
>}
>
>For as far as I understand, some variants of arp_validate require the
>backup interfaces to receive ARP requests sent from the master, through
>the active interface, presumably with the master MAC as the source MAC.
>
>As this protocol handler is registered at the master level, the exact
>match logic in __netif_receive_skb(), which apply at the slave level,
>shouldn't deliver this skb to bond_arp_rcv().
>
>Can someone confirm ? Jay ?

	Yes, this is how the ARP monitor works for inactive slaves in
active-backup mode.  It expects to see the broadcast ARP requests loop
back around to the inactive slaves.  If arp_validate is on, the ARP
frames will be inspected to insure that it was sent by the appropriate
master.

	Still, though, if the NS packets are coming in on an inactive
slave, why aren't they already being dropped?  Even in alb mode, there
is a loose concept of "active" and "inactive" in the sense that only one
slave is used for things like broadcast or multicast.

	Andy, what configuration are you seeing this problem in?

	-J

>	Nicolas.
>
>> Signed-off-by: Andy Gospodarek<andy@greyhouse.net>
>> Cc: David Miller<davem@davemloft.net>
>> Cc: Herbert Xu<herbert@gondor.apana.org.au>
>> Cc: Jay Vosburgh<fubar@us.ibm.com>
>> Cc: Jiri Pirko<jpirko@redhat.com>
>>
>> ---
>>
>> I realize this patch comes right in the middle of Jiri Pirko's attempts
>> to move this functionality to the bonding driver, but I think this may
>> be important enough to include now (possibly in 2.6.38 and to -stable)
>> if others agree.
>>
>> ---
>>   net/core/dev.c |    5 +++++
>>   1 files changed, 5 insertions(+), 0 deletions(-)
>>
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 8ae6631..4a76ccd 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -2971,6 +2971,11 @@ static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
>>   int __skb_bond_should_drop(struct sk_buff *skb, struct net_device *master)
>>   {
>>   	struct net_device *dev = skb->dev;
>> +	struct ethhdr *eth = eth_hdr(skb);
>> +
>> +	/* Drop all frames with the bond master's source address. */
>> +	if (unlikely(!compare_ether_addr(eth->h_source, master->dev_addr)))
>> +		return 1;
>>
>>   	if (master->priv_flags&  IFF_MASTER_ARPMON)
>>   		dev->last_rx = jiffies;

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [PATCH net-2.6] bonding: drop frames received with master's source MAC
From: Andy Gospodarek @ 2011-02-25 22:24 UTC (permalink / raw)
  To: Nicolas de Pesloüan
  Cc: Andy Gospodarek, netdev, David Miller, Herbert Xu, Jay Vosburgh,
	Jiri Pirko
In-Reply-To: <4D68276B.90104@gmail.com>

On Fri, Feb 25, 2011 at 11:04:27PM +0100, Nicolas de Pesloüan wrote:
> Le 25/02/2011 22:13, Andy Gospodarek a écrit :
>> I was looking at my system and wondering why I sometimes saw these
>> DAD messages in my logs:
>>
>> bond0: IPv6 duplicate address fe80::21b:21ff:fe38:2ec4 detected!
>>
>> I traced it back and realized the IPv6 Neighbor Solicitations I was
>> sending were also coming back into the stack on the slave(s) that did
>> not transmit the frames.  I could not think of a compelling reason to
>> notify the user that a NS we sent came back, so I set out to just drop
>> the frame silently in ndisc_recv_ns drop.
>>
>> That seemed to work well, but when I thought about it I could not
>> compelling reason to save any of these frames.  Dropping them as soon as
>> we get them seems like a much better idea as it fixes other issues that
>> may exist for more than just IPv6 DAD.
>>
>> I chose to check the incoming frame against the master's MAC address as
>> that should be the MAC address used anytime a broadcast frame is sent by
>> the bonding driver that had the chance to make its way back into one of
>> the other devices.
>
> I think this could break the ARP monitoring. ARP monitoring rely on a 
> normal protocol handler, registered in bond_main.c.
>
> void bond_register_arp(struct bonding *bond)
> {
>         struct packet_type *pt = &bond->arp_mon_pt;
>
>         if (pt->type)
>                 return;
>
>         pt->type = htons(ETH_P_ARP);
>         pt->dev = bond->dev;
>         pt->func = bond_arp_rcv;
>         dev_add_pack(pt);
> }
>
> For as far as I understand, some variants of arp_validate require the 
> backup interfaces to receive ARP requests sent from the master, through 
> the active interface, presumably with the master MAC as the source MAC.
>
> As this protocol handler is registered at the master level, the exact 
> match logic in __netif_receive_skb(), which apply at the slave level, 
> shouldn't deliver this skb to bond_arp_rcv().
>
> Can someone confirm ? Jay ?
>
> 	Nicolas.
>

I confirmed your suspicion, this breaks ARP monitoring.  I would still
welcome other opinions though as I think it would be nice to fix this as
low as possible.


^ permalink raw reply

* ANNOUNCE: debloat-testing kernel git tree
From: John W. Linville @ 2011-02-25 22:22 UTC (permalink / raw)
  To: linux-kernel, netdev, linux-wireless, bloat-devel

Announcement

The bufferbloat project [1] is pleased to announce the availability
of the debloat-testing Linux kernel git tree:

	git://git.infradead.org/debloat-testing.git

The purpose of this tree is to provide a reasonably stable base for
the development and testing of new algorithms, miscellaneous fixes,
and maybe a few hacks intended to advance the cause of eliminating
or at least mitigating bufferbloat in the Linux world.

Introduction

Bufferbloat is a term coined by Jim Gettys to describe the increasing
prevalence of large and (particularly) unmanaged network buffers along
the network links that comprise the Internet [2].  If you are not aware
of the problems with network latency under load that the Internet is
already encountering, we encourage you to visit Jim Gettys' blog [3].
There Jim has begun to fit together enough puzzle pieces to at least
frame the issue.

Jim has also made available slides and an audio recording (edited
for time) from a presentation on this topic:

	http://mirrors.bufferbloat.net/Talks/BellLabs01192011/

Kernel Bits

The debloat-testing tree is intended to track full and -rc releases
from linux-2.6, with interesting patches cherry-picked from net-next
and various experimental bits added on top.  The current stable of
such patches includes the following:

Eric Dumazet (based on original work by Juliusz Chroboczek):
      net_sched: SFB flow scheduler

stephen hemminger:
      sched: CHOKe flow scheduler

John Fastabend:
      net: implement mechanism for HW based QOS
      net_sched: implement a root container qdisc sch_mqprio

John W. Linville:
      mac80211: implement eBDP algorithm to fight bufferbloat

Nathaniel J. Smith:
      iwlwifi: Simplify tx queue management
      iwlwifi: Convert the tx queue high_mark to an atomic_t
      iwlwifi: Invert the sense of the queue high_mark
      iwlwifi: auto-tune tx queue size to minimize latency
      iwlwifi: make current tx queue sizes visible in debugfs

Dave Taht:
      Bufferbloat reduction for the e1000 driver that started it all
      Reduce bufferbloated default for e1000e, increase dynamic range
      Smash bufferbloat in the ath9k driver

Userland Bits

Patches for the userspace tc utility incorporating support for both the
CHOKe AQM and the Stochastic Fair Blue scheduler (SFB) are available:

	https://github.com/dtaht/iproute2bufferbloat

Contributions

Please send any experimental or research-oriented patches related to
bufferbloat to the bloat-devel@lists.bufferbloat.net list.  Reminders
of more mainstream patches that may be relevant and/or interesting
for cherry-picking into debloat-testing are welcome there as well.

Obviously, patches that are ready for normal merge consideration
should continue to be sent to netdev, linux-wireless, linux-kernel,
or whatever other existing list is appropriate for them.

Thanks

Finally, we want to offer a huge thanks to the 130+ new members of
the bloat mailing list [4] for leaping into the fray, and to David
Woodhouse for hosting the debloat-testing tree at infradead.

Please help us beat the bloat.  Good luck, and happy debloating!

Notes

[1] http://bufferbloat.net
[2] http://gettys.wordpress.com/what-is-bufferbloat-anyway/
[3] http://en.wordpress.com/tag/bufferbloat/
[4] https://lists.bufferbloat.net
--
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: [PATCH net-2.6] bonding: drop frames received with master's source MAC
From: Nicolas de Pesloüan @ 2011-02-25 22:04 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: netdev, David Miller, Herbert Xu, Jay Vosburgh, Jiri Pirko
In-Reply-To: <1298668408-14849-1-git-send-email-andy@greyhouse.net>

Le 25/02/2011 22:13, Andy Gospodarek a écrit :
> I was looking at my system and wondering why I sometimes saw these
> DAD messages in my logs:
>
> bond0: IPv6 duplicate address fe80::21b:21ff:fe38:2ec4 detected!
>
> I traced it back and realized the IPv6 Neighbor Solicitations I was
> sending were also coming back into the stack on the slave(s) that did
> not transmit the frames.  I could not think of a compelling reason to
> notify the user that a NS we sent came back, so I set out to just drop
> the frame silently in ndisc_recv_ns drop.
>
> That seemed to work well, but when I thought about it I could not
> compelling reason to save any of these frames.  Dropping them as soon as
> we get them seems like a much better idea as it fixes other issues that
> may exist for more than just IPv6 DAD.
>
> I chose to check the incoming frame against the master's MAC address as
> that should be the MAC address used anytime a broadcast frame is sent by
> the bonding driver that had the chance to make its way back into one of
> the other devices.

I think this could break the ARP monitoring. ARP monitoring rely on a normal protocol handler, 
registered in bond_main.c.

void bond_register_arp(struct bonding *bond)
{
         struct packet_type *pt = &bond->arp_mon_pt;

         if (pt->type)
                 return;

         pt->type = htons(ETH_P_ARP);
         pt->dev = bond->dev;
         pt->func = bond_arp_rcv;
         dev_add_pack(pt);
}

For as far as I understand, some variants of arp_validate require the backup interfaces to receive 
ARP requests sent from the master, through the active interface, presumably with the master MAC as 
the source MAC.

As this protocol handler is registered at the master level, the exact match logic in 
__netif_receive_skb(), which apply at the slave level, shouldn't deliver this skb to bond_arp_rcv().

Can someone confirm ? Jay ?

	Nicolas.

> Signed-off-by: Andy Gospodarek<andy@greyhouse.net>
> Cc: David Miller<davem@davemloft.net>
> Cc: Herbert Xu<herbert@gondor.apana.org.au>
> Cc: Jay Vosburgh<fubar@us.ibm.com>
> Cc: Jiri Pirko<jpirko@redhat.com>
>
> ---
>
> I realize this patch comes right in the middle of Jiri Pirko's attempts
> to move this functionality to the bonding driver, but I think this may
> be important enough to include now (possibly in 2.6.38 and to -stable)
> if others agree.
>
> ---
>   net/core/dev.c |    5 +++++
>   1 files changed, 5 insertions(+), 0 deletions(-)
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 8ae6631..4a76ccd 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2971,6 +2971,11 @@ static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
>   int __skb_bond_should_drop(struct sk_buff *skb, struct net_device *master)
>   {
>   	struct net_device *dev = skb->dev;
> +	struct ethhdr *eth = eth_hdr(skb);
> +
> +	/* Drop all frames with the bond master's source address. */
> +	if (unlikely(!compare_ether_addr(eth->h_source, master->dev_addr)))
> +		return 1;
>
>   	if (master->priv_flags&  IFF_MASTER_ARPMON)
>   		dev->last_rx = jiffies;


^ permalink raw reply

* Re: [patch 0/1] [resend] s390: one more qeth patches for net-next
From: David Miller @ 2011-02-25 22:04 UTC (permalink / raw)
  To: frank.blaschka; +Cc: netdev, linux-s390
In-Reply-To: <20110225123208.764828326@de.ibm.com>

From: frank.blaschka@de.ibm.com
Date: Fri, 25 Feb 2011 13:32:08 +0100

> I did not find this patch in net-next so far

It's in my backlog, a simply query to active patches in patchwork
would have shown you this.

	http://patchwork.ozlabs.org/project/netdev/

Please do not resubmit patches which are properly queued up in
patchwork, as this makes more work for me.

^ permalink raw reply

* Re: [PATCH 7/7] sched: protocol only needed when CONFIG_NET_CLS_ACT is enabled
From: David Miller @ 2011-02-25 22:01 UTC (permalink / raw)
  To: hagen; +Cc: netdev, fw
In-Reply-To: <1298648721-3026-7-git-send-email-hagen@jauu.net>

From: Hagen Paul Pfeifer <hagen@jauu.net>
Date: Fri, 25 Feb 2011 16:45:21 +0100

> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>

Applied to net-next-2.6

^ permalink raw reply

* Re: [PATCH 6/7] ipv6: ignore rtnl_unicast() return code
From: David Miller @ 2011-02-25 22:01 UTC (permalink / raw)
  To: hagen; +Cc: netdev, fw
In-Reply-To: <1298648721-3026-6-git-send-email-hagen@jauu.net>

From: Hagen Paul Pfeifer <hagen@jauu.net>
Date: Fri, 25 Feb 2011 16:45:20 +0100

> rtnl_unicast() return value is not of interest, we can silently ignore
> it, save some instructions and four byte on the stack.
> 
> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>

Applied to net-next-2.6

^ permalink raw reply

* Re: [PATCH 5/7] ipv6: variable next is never used in this function
From: David Miller @ 2011-02-25 22:01 UTC (permalink / raw)
  To: hagen; +Cc: netdev, fw
In-Reply-To: <1298648721-3026-5-git-send-email-hagen@jauu.net>

From: Hagen Paul Pfeifer <hagen@jauu.net>
Date: Fri, 25 Feb 2011 16:45:19 +0100

> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>

Applied to net-next-2.6

^ permalink raw reply

* Re: [PATCH 4/7] ipv6: hash is calculated but not used afterwards
From: David Miller @ 2011-02-25 22:00 UTC (permalink / raw)
  To: hagen; +Cc: netdev, fw
In-Reply-To: <1298648721-3026-4-git-send-email-hagen@jauu.net>

From: Hagen Paul Pfeifer <hagen@jauu.net>
Date: Fri, 25 Feb 2011 16:45:18 +0100

> hash is declared and assigned but not used anymore. ipv6_addr_hash()
> exhibit no side-effects.
> 
> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>

Applied to net-next-2.6

^ permalink raw reply

* Re: [PATCH 3/7] ipv6: totlen is declared and assigned but not used
From: David Miller @ 2011-02-25 22:00 UTC (permalink / raw)
  To: hagen; +Cc: netdev, fw
In-Reply-To: <1298648721-3026-3-git-send-email-hagen@jauu.net>

From: Hagen Paul Pfeifer <hagen@jauu.net>
Date: Fri, 25 Feb 2011 16:45:17 +0100

> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>

Applied to net-next-2.6

^ permalink raw reply

* Re: [PATCH 2/7] dccp: newdp is declared/assigned but never be used
From: David Miller @ 2011-02-25 22:00 UTC (permalink / raw)
  To: hagen; +Cc: netdev, fw
In-Reply-To: <1298648721-3026-2-git-send-email-hagen@jauu.net>

From: Hagen Paul Pfeifer <hagen@jauu.net>
Date: Fri, 25 Feb 2011 16:45:16 +0100

> Declaration and assignment of newdp is removed. Usage of dccp_sk()
> exhibit no side effects.
> 
> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>

Applied to net-next-2.6

^ permalink raw reply

* Re: [PATCH 1/7] net: handle addr_type of 0 properly
From: David Miller @ 2011-02-25 21:59 UTC (permalink / raw)
  To: hagen; +Cc: netdev, fw
In-Reply-To: <1298648721-3026-1-git-send-email-hagen@jauu.net>

From: Hagen Paul Pfeifer <hagen@jauu.net>
Date: Fri, 25 Feb 2011 16:45:15 +0100

> addr_type of 0 means that the type should be adopted from from_dev and
> not from __hw_addr_del_multiple(). Unfortunately it isn't so and
> addr_type will always be considered. Fix this by implementing the
> considered and documented behavior.
> 
> Signed-off-by: Hagen Paul Pfeifer <hagen@jauu.net>

Applied to net-2.6, thanks.

^ permalink raw reply

* Re: [RFC] be2net: add rxhash support
From: Ajit Khaparde @ 2011-02-25 21:35 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

> -----Original Message-----
> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Friday, February 25, 2011 1:45 PM
> To: Khaparde, Ajit
> Cc: netdev@vger.kernel.org
> Subject: Re: [RFC] be2net: add rxhash support
> 
> Le vendredi 25 février 2011 à 13:36 -0600, Ajit Khaparde a écrit :
> > > -----Original Message-----
> > > From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> > >
> > > How is it possible ?
> > >
> > > (I have a VLAN config on top of a bonding)
> > >
> > I'm looking at this..
> > There is no switch involved in your test, just back to back?
> >
> 
> There is a switch.
> 
> Machines are HP ProLiant BL460c G7
> 
> But why do you ask ?
> 

I asked that because, if a switch is part a of the configuration,
the ASIC can receive packets other than the tcp flow.

And if hashing is enabled for IP packets, we can see this behavior.
The other values indicate that hashing has been enabled for IPv4 packets.

Thanks
-Ajit

^ permalink raw reply

* [PATCH net-2.6] bonding: drop frames received with master's source MAC
From: Andy Gospodarek @ 2011-02-25 21:13 UTC (permalink / raw)
  To: netdev; +Cc: David Miller, Herbert Xu, Jay Vosburgh, Jiri Pirko

I was looking at my system and wondering why I sometimes saw these
DAD messages in my logs:

bond0: IPv6 duplicate address fe80::21b:21ff:fe38:2ec4 detected!

I traced it back and realized the IPv6 Neighbor Solicitations I was
sending were also coming back into the stack on the slave(s) that did
not transmit the frames.  I could not think of a compelling reason to
notify the user that a NS we sent came back, so I set out to just drop
the frame silently in ndisc_recv_ns drop.

That seemed to work well, but when I thought about it I could not
compelling reason to save any of these frames.  Dropping them as soon as
we get them seems like a much better idea as it fixes other issues that
may exist for more than just IPv6 DAD.

I chose to check the incoming frame against the master's MAC address as
that should be the MAC address used anytime a broadcast frame is sent by
the bonding driver that had the chance to make its way back into one of
the other devices.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Cc: David Miller <davem@davemloft.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Jay Vosburgh <fubar@us.ibm.com>
Cc: Jiri Pirko <jpirko@redhat.com>

---

I realize this patch comes right in the middle of Jiri Pirko's attempts
to move this functionality to the bonding driver, but I think this may
be important enough to include now (possibly in 2.6.38 and to -stable)
if others agree.

---
 net/core/dev.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 8ae6631..4a76ccd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2971,6 +2971,11 @@ static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
 int __skb_bond_should_drop(struct sk_buff *skb, struct net_device *master)
 {
 	struct net_device *dev = skb->dev;
+	struct ethhdr *eth = eth_hdr(skb);
+
+	/* Drop all frames with the bond master's source address. */
+	if (unlikely(!compare_ether_addr(eth->h_source, master->dev_addr)))
+		return 1;

 	if (master->priv_flags & IFF_MASTER_ARPMON)
 		dev->last_rx = jiffies;
-- 
1.7.4

^ permalink raw reply related

* route embedding
From: David Miller @ 2011-02-25 21:13 UTC (permalink / raw)
  To: netdev

So out of curiosity I took some cycle counts on my Niagara2 for three
different cases of output route resolution.  1) full routing cache
2) routing cache removed and 3) fib_table_lookup() on RT_TABLE_MAIN.

ip_route_output_key() stock net-next-2.6:		463 cycles
ip_route_output_key() cache removed, no optimizations:	3832 cycles
fib_table_lookup() on RT_TABLE_MAIN:			489 cycles

(For reference, doing two fib_table_look() calls, one on RT_TABLE_LOCAL
 then one on RT_TABLE_MAIN, costs about 800 cycles)

Most of the cost with the routing cache removed is quite silly in
nature:

1) As mentioned the other week, we do two fib_table_lookup() calls,
   one on RT_TABLE_LOCAL and one on RT_TABLE_MAIN.

   I think this could be consolidated into one lookup, and I'm running
   with that assumption in mind.

2) Allocating and freeing up the one-time dst_entry, as well as
   initializing all of it's state, has costs too.

To test out my theories about #1 and #2 above I wrote some quick hacks
against the tree with the routing cache removed, each change is
cumulative and depends upon the previous changes being present:

1) Remove struct flowi from struct rtable and just have it contain the
   necessary keys.  Replacing the struct flowi are:

	__be32	rt_key_dst;
	__be32	rt_key_src;
	...
	__u8	rt_tos;
	...
	int	rt_iif;
	int	rt_oif;
	__u32	rt_mark;

   Result: 3932 cycles

   Strangely this made things 100 cycles slower, but bear with me.

   Also, in a routing-cache-less world, these values are almost
   entirely superfluous.  All of these values can be reconstituted
   in the code paths that currently read them.

2) Expand dst_alloc() arguments so that more explicit initializations
   can be specified, dst_alloc()'s function signature now looks like:

	void *dst_alloc(struct dst_ops * ops, struct net_device *dev,
			int initial_ref, int initial_obsolete, int flags);

   Result: 3905 cycles

3) We currently write nearly every byte of the dst_entry objects twice
   when we create them, this is because we use kmem_cache_zalloc().

   So I changed dst_alloc() to use plain kmem_cache_alloc() and added
   explicit initialization to even struct members that start as zero,
   and then propagated the same kinds of changes into decnet, ipv4,
   and ipv6.

   Result: 3481 cycles.

   So, a pretty nice result.

4) I have a hack patch that puts all routes into RT_TABLE_MAIN and never
   uses RT_TABLE_LOCAL, then I force all fib_lookup() calls to consult
   RT_TABLE_MAIN only.  Thus reducing the number of fib_table_lookup()
   calls for output route resolution to one.

   Result: 3454 cycles.

   Not as much improvement as we would like, but it is something.

Anyways, this series of tests shows that object allocation, initialization,
et al. hold some non-trivial costs.

So I wrote a test where route object references are provided by the
caller (think inside of sk_buff, a socket, on the stack, etc.) and we
use a structure that gets rid of all the aspects which are irrelevant
without global dst caching.

The structure might look something like this for ipv4:

struct net_route {
        struct net_device       *dev;
        unsigned long           _metrics;
        struct neighbour        *neighbour;
        struct hh_cache         *hh;
        atomic_t                __refcnt;
};

struct ipv4_route {
        struct net_route        base;

        int                     rt_genid;

        __be32                  rt_dst;
        __be32                  rt_src;
        __u16                   rt_type;
        __u8                    rt_tos;

        int                     rt_iif;
        int                     rt_oif;
        __u32                   rt_mark;

        __be32                  rt_gateway;
        __be32                  rt_spec_dst;

        u32                     rt_peer_genid;
        struct inet_peer        *peer;
        struct fib_info         *fi;
};

Then I cycle counted a test which essentially does a fib_table_lookup() on
RT_TABLE_MAIN and then uses the result to fill in an ipv4_route object, it
looks like:

	table = fib_get_table(&init_net, RT_TABLE_MAIN);
	if (!table) {
		... error handling ...
	}
        err = fib_table_lookup(table, fl, &res, FIB_LOOKUP_NOREF);
        if (err)
                return err;

        rt->rt_genid = atomic_read(&gen_id);
	rt->rt_dst = fl->fl4_dst;
        rt->rt_src = fl->fl4_src;
        rt->rt_type = res.type;
        rt->rt_tos = fl->fl4_tos;
	rt->rt_iif = fl->iif;
        rt->rt_oif = fl->oif;
        rt->rt_mark = fl->mark;
	rt->rt_gateway = FIB_RES_GW(res);
        rt->rt_spec_dst = fl->fl4_src;
        rt->rt_peer_genid = 0;
        rt->peer = NULL;
        rt->fi = NULL;

And this costs about 500 cycles.

The idea is that we would move to a model where the route information
lives embedded inside of a containing object (the user), either the
socket or the sk_buff itself in the most common cases.

Of course, we would need to make amends to handle strings or bundles
of routes such as those created by IPSEC.  We could take care of this
by simply using the embedded storage plus a pointer.  The pointer
could point to an externally allocated object such as an IPSEC bundle,
instead of the embedded route memory.

There are a few funnies to attend to wrt. reference counting.  For the
embedded case we could likely get away with using the containing
object's reference counting to refcount the route (win).  Then the
issue is how to deal properly with cases where the route pointer
"escapes", as is currently handled by dst_clone().  We could either
take another reference to the route's containing object, or we could
simply COW the route object.

Anyways I think this would make our routing implementation not only perform
independent of traffic patterns, but also be within a few clock cycles of
the performance we get right now.

^ permalink raw reply

* Re: [PATCH iproute2] sfq: add divisor support
From: Stephen Hemminger @ 2011-02-25 21:00 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <1295777346.17333.26.camel@edumazet-laptop>

On Sun, 23 Jan 2011 11:09:06 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> In 2.6.39, we can build SFQ queues with a given hash table size,
> different than default one (1024 slots)
> 
> # tc qdisc add dev eth0 sfq help
> Usage: ... sfq [ limit NUMBER ] [ perturb SECS ] [ quantum BYTES ]
>                [ divisor NUMBER ]
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied to iproute2 net-next

^ permalink raw reply

* Re: [PATCH] don't allow CAP_NET_ADMIN to load non-netdev kernel modules
From: Michał Mirosław @ 2011-02-25 20:59 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: David Miller, segoon, netdev, linux-kernel, kuznet, pekkas,
	jmorris, yoshfuji, kaber, eric.dumazet, therbert, xiaosuo, jesse,
	kees.cook, eugene, dan.j.rosenberg, akpm
In-Reply-To: <1298666310.2554.47.camel@bwh-desktop>

2011/2/25 Ben Hutchings <bhutchings@solarflare.com>:
> I bet something like this (plus Vasiliy's changes to static module
> aliases) would cover 99.9% of legitimate uses of this feature:
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 54aaca6..0d09baa 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1120,8 +1120,20 @@ void dev_load(struct net *net, const char *name)
>        dev = dev_get_by_name_rcu(net, name);
>        rcu_read_unlock();
>
> -       if (!dev && capable(CAP_NET_ADMIN))
> -               request_module("%s", name);
> +       if (!dev && capable(CAP_NET_ADMIN)) {
> +               /* Check whether the name looks like one that a net
> +                * driver will generate initially.  If not, require a
> +                * module alias with a suitable prefix, so that this
> +                * can't be used to load arbitrary modules.
> +                */
> +               if ((strncmp(name, "eth", 3) == 0 &&
> +                    isdigit((unsigned char)name[3])) ||
> +                   (strncmp(name, "wlan", 4) == 0 &&
> +                    isdigit((unsigned char)name[4])))
> +                       request_module("%s", name);
> +               else
> +                       request_module("netdev-%s", name);
> +       }
>  }
>  EXPORT_SYMBOL(dev_load);
>

This might be better as:

if (request_module("netdev-%s", name))
    ... fallback

Then after some years the fallback could be removed if announced properly.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [iproute2 PATCH 3/3]: xfrm security context support
From: Stephen Hemminger @ 2011-02-25 20:46 UTC (permalink / raw)
  To: Joy Latten; +Cc: netdev
In-Reply-To: <201102022332.p12NWxXP029398@faith.austin.ibm.com>

On Wed, 02 Feb 2011 17:32:59 -0600
Joy Latten <jml@austin.ibm.com> wrote:

> Adds security context support to ip xfrm state.
> 
> Signed-off-by: Joy Latten <latten@austin.ibm.com>

All three applied to iproute2 net-next

^ permalink raw reply

* Re: [PATCH v4 1/2] iproute2: support listing devices by group
From: Stephen Hemminger @ 2011-02-25 20:43 UTC (permalink / raw)
  To: Vlad Dogaru; +Cc: netdev, Patrick McHardy
In-Reply-To: <1296671021-24421-2-git-send-email-ddvlad@rosedu.org>

On Wed,  2 Feb 2011 20:23:40 +0200
Vlad Dogaru <ddvlad@rosedu.org> wrote:

> User can specify device group to list by using the group keyword:
> 
> 	ip link show group test
> 
> If no group is specified, 0 (default) is implied.
> 
> Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org>

I applied this to net-next for iproute2
but INIT_NETDEV_GROUP is in a part of netdevice.h that is not exported 
(ie inside #ifdef KERNEL).



-- 

^ permalink raw reply

* Re: [PATCH] don't allow CAP_NET_ADMIN to load non-netdev kernel modules
From: Ben Hutchings @ 2011-02-25 20:38 UTC (permalink / raw)
  To: David Miller
  Cc: segoon, netdev, linux-kernel, kuznet, pekkas, jmorris, yoshfuji,
	kaber, eric.dumazet, therbert, xiaosuo, jesse, kees.cook, eugene,
	dan.j.rosenberg, akpm
In-Reply-To: <1298663585.2554.39.camel@bwh-desktop>

On Fri, 2011-02-25 at 19:53 +0000, Ben Hutchings wrote:
> On Fri, 2011-02-25 at 11:43 -0800, David Miller wrote:
> > From: Ben Hutchings <bhutchings@solarflare.com>
> > Date: Fri, 25 Feb 2011 19:30:16 +0000
> > 
> > > On Fri, 2011-02-25 at 11:16 -0800, David Miller wrote:
> > >> From: Ben Hutchings <bhutchings@solarflare.com>
> > >> Date: Fri, 25 Feb 2011 19:07:59 +0000
> > >> 
> > >> > You realise that module loading doesn't actually run in the context of
> > >> > request_module(), right?
> > >> 
> > >> Why is that a barrier?  We could simply pass a capability mask into
> > >> request_module if necessary.
> > >> 
> > >> It's an implementation detail, and not a deterrant to my suggested
> > >> scheme.
> > > 
> > > It's not an implementation detail.  modprobe currently runs with full
> > > capabilities; your proposal requires its capabilities to be limited to
> > > those of the capabilities of the process that triggered the
> > > request_module() (plus, presumably, CAP_SYS_MODULE).
> > 
> > The idea was that the kernel will be the entity that will inspect the
> > elf sections and validate the capability bits, not the userspace
> > module loader.
> 
> Yes, I understand that.
> 
> > Surely we if we can pass an arbitrary string out to the loading
> > process as part of the module loading context, we can pass along
> > capability bits as well.
> 
> If you want insert_module() to be able to deny loading some modules
> based on the capabilities of the process calling request_module() then
> you either have to *reduce* the capabilities given to modprobe or create
> some extra process state, separate from the usual capability state,
> specifically for this purpose.

I bet something like this (plus Vasiliy's changes to static module
aliases) would cover 99.9% of legitimate uses of this feature:

diff --git a/net/core/dev.c b/net/core/dev.c
index 54aaca6..0d09baa 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1120,8 +1120,20 @@ void dev_load(struct net *net, const char *name)
 	dev = dev_get_by_name_rcu(net, name);
 	rcu_read_unlock();
 
-	if (!dev && capable(CAP_NET_ADMIN))
-		request_module("%s", name);
+	if (!dev && capable(CAP_NET_ADMIN)) {
+		/* Check whether the name looks like one that a net
+		 * driver will generate initially.  If not, require a
+		 * module alias with a suitable prefix, so that this
+		 * can't be used to load arbitrary modules.
+		 */
+		if ((strncmp(name, "eth", 3) == 0 &&
+		     isdigit((unsigned char)name[3])) ||
+		    (strncmp(name, "wlan", 4) == 0 &&
+		     isdigit((unsigned char)name[4])))
+			request_module("%s", name);
+		else
+			request_module("netdev-%s", name);
+	}
 }
 EXPORT_SYMBOL(dev_load);
 
---

Note that we don't have to care about interfaces that get renamed from
eth%d or wlan%d, since renaming is triggered asynchronously and
therefore can't be used in conjunction with the auto-loading feature.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply related

* Re: [PATCH] don't allow CAP_NET_ADMIN to load non-netdev kernel modules
From: David Miller @ 2011-02-25 20:37 UTC (permalink / raw)
  To: bhutchings
  Cc: segoon, netdev, linux-kernel, kuznet, pekkas, jmorris, yoshfuji,
	kaber, eric.dumazet, therbert, xiaosuo, jesse, kees.cook, eugene,
	dan.j.rosenberg, akpm
In-Reply-To: <1298663585.2554.39.camel@bwh-desktop>

From: Ben Hutchings <bhutchings@solarflare.com>
Date: Fri, 25 Feb 2011 19:53:05 +0000

> On Fri, 2011-02-25 at 11:43 -0800, David Miller wrote:
>> Surely we if we can pass an arbitrary string out to the loading
>> process as part of the module loading context, we can pass along
>> capability bits as well.
> 
> If you want insert_module() to be able to deny loading some modules
> based on the capabilities of the process calling request_module() then
> you either have to *reduce* the capabilities given to modprobe or create
> some extra process state, separate from the usual capability state,
> specifically for this purpose.

How is this any different from the patch posted which ties
capabilities to the prefix of name of the module to be loaded?

There is simply no difference, except that in my proposal existing
things do not break since the module name will not change.

I don't see where the complexity is, if the only place we can pass the
capability bits is in the execv args, then in the worst case we could
take a peek at those in the module load system call.

^ permalink raw reply

* Re: pull request: wireless-next-2.6 2011-02-22
From: John W. Linville @ 2011-02-25 19:48 UTC (permalink / raw)
  To: David Miller; +Cc: linux-wireless, linux-bluetooth, netdev, padovan
In-Reply-To: <20110225.111500.59674472.davem@davemloft.net>

On Fri, Feb 25, 2011 at 11:15:00AM -0800, David Miller wrote:
> From: David Miller <davem@davemloft.net>
> Date: Thu, 24 Feb 2011 22:43:44 -0800 (PST)

> > Pulled, thanks a lot John.
> 
> John a few things:
> 
> 1) I had to add some vmalloc.h includes to fix the build on sparc64,
>    see commit b08cd667c4b6641c4d16a3f87f4550f81a6d69ac in net-next-2.6

I have a patch in my tree for that -- seems they hit it on ARM as well.

> 2) Something is screwey with the bluetooth config options now.
> 
>    I have an allmodconfig tree, and when I run "make oldconfig" after
>    this pull, BT_L2CAP and BT_SCO both prompt me, claiming that they
>    can only be built statically.
> 
>    I give it 'y' just to make it happen, for both, and afterways no
>    matter how many times I rerun "make oldconfig" I keep seeing things
>    like this in my build:
> 
> scripts/kconfig/conf --silentoldconfig Kconfig
> include/config/auto.conf:986:warning: symbol value 'm' invalid for BT_SCO
> include/config/auto.conf:3156:warning: symbol value 'm' invalid for BT_L2CAP
> 
>    First, what the heck is going on here?  Second, why the heck can't these
>    non-trivial pieces of code be built modular any more?
> 
>    You can't make something "bool", have it depend on something that
>    might be modular, and then build it into what could in fact be a
>    module.  That's exactly what the bluetooth stuff seems to be doing
>    now.
> 
>    I suspect commit 642745184f82688eb3ef0cdfaa4ba632055be9af
> 
> Thanks.

Sorry, I overlooked that.  Hopefully Gustavo will figure it out quickly.

Thanks,

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox