Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: 2.6.37 regression: adding main interface to a bridge breaks vlan interface RX
From: chriss @ 2011-02-26 11:51 UTC (permalink / raw)
  To: netdev
In-Reply-To: <AANLkTikSvs7jF9BZzbsYkLAawpCH2h1Z0r09ft219uaa@mail.gmail.com>

Jesse Gross <jesse <at> nicira.com> writes:

> 
> Can you confirm this by running tcpdump -eni br0?  I would expect that
> you see the correct packets but without vlan tags.
> 

Thats correct. i see the packets in br0 without tags and tagged in eth1. thats
why i added the brouting rule in ebtables to drop it at eth1 and the it apears
in eth1.3 (untagged)...

^ permalink raw reply

* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Nicolas de Pesloüan @ 2011-02-26 14:24 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Jiri Pirko, David Miller, kaber, eric.dumazet, netdev, shemminger,
	andy, Fischer, Anna
In-Reply-To: <4D62F324.6020301@gmail.com>

Le 22/02/2011 00:20, Nicolas de Pesloüan a écrit :

> After checking every protocol handlers installed by dev_add_pack(), it
> appears that only 4 of them really use the orig_dev parameter given by
> __netif_receive_skb():
>
> - bond_3ad_lacpdu_recv() @ drivers/net/bonding/bond_3ad.c
> - bond_arp_recv() @ drivers/net/bonding/bond_main.c
> - packet_rcv() @ net/packet/af_packet.c
> - tpacket_rcv() @ net/packet/af_packet.c
>
>  From the bonding point of view, the meaning of orig_dev is obviously
> "the device one layer below the bonding device, through which the packet
> reached the bonding device". It is used by bond_3ad_lacpdu_recv() and
> bond_arp_recv(), to find the underlying slave device through which the
> LACPDU or ARP was received. (The protocol handler is registered at the
> bonding device level).
>
>  From the af_packet point of view, the meaning is documented (in commit
> "[AF_PACKET]: Add option to return orig_dev to userspace") as the
> "physical device [that] actually received the traffic, instead of having
> the encapsulating device hide that information."
>
> When the bonding device is just one level above the physical device, the
> two meanings happen to match the same device, by chance.
>
> So, currently, a bonding device cannot stack properly on top of anything
> but physical devices. It might not be a problem today, but may change in
> the future...

Hi Jay,

Still thinking about this orig_dev stuff, I wonder why the protocol handlers used in bonding 
(bond_3ad_lacpdu_recv() and bond_arp_rcv()) are registered at the master level instead of at the 
slave level ?

If they were registered at the slave level, they would simply receive skb->dev as the ingress 
interface and use this value instead of needing the orig_dev value given to them when they are 
registered at the master level.

As orig_dev is only used by bonding and by af_packet, but they disagree on the exact meaning of 
orig_dev, one way to fix this discrepancy would be to remove one of the usage. As the af_packet 
usage is exposed to user space, bonding seems the right place to stop using orig_dev, even if 
orig_dev was introduced for bonding :-)

I understand that this would add one entry per slave device to the ptype_base list, but this seems 
to be the only bad effect of registering at the slave level. Can you confirm that this was the 
reason to register at the master level instead?

If you think registering at the slave level would cause too much impact on ptype_base, then we might 
have another way to stop using orig_dev for bonding:

In __skb_bond_should_drop(), we already test for the two interesting protocols:

if ((dev->priv_flags & IFF_SLAVE_NEEDARP) && skb->protocol == __cpu_to_be16(ETH_P_ARP))
	return 0;

if (master->priv_flags & IFF_MASTER_8023AD && skb->protocol == __cpu_to_be16(ETH_P_SLOW))
	return 0;

Would it be possible to call the right handlers directly from inside __skb_bond_should_drop() then 
let __skb_bond_should_drop() return 1 ("should drop") after processing the frames that are only of 
interest for bonding?

	Nicolas.

^ permalink raw reply

* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-26 14:58 UTC (permalink / raw)
  To: Nicolas de Pesloüan
  Cc: David Miller, kaber, eric.dumazet, netdev, shemminger, fubar,
	andy
In-Reply-To: <4D68E31E.7060807@gmail.com>

Sat, Feb 26, 2011 at 12:25:18PM CET, nicolas.2p.debian@gmail.com wrote:
>Le 26/02/2011 08:14, Jiri Pirko a écrit :
>>Sat, Feb 26, 2011 at 12:46:53AM CET, nicolas.2p.debian@gmail.com wrote:
>>>Le 23/02/2011 20:05, Jiri Pirko a écrit :
>>>>This patch converts bonding to use rx_handler. Results in cleaner
>>>>__netif_receive_skb() with much less exceptions needed. Also
>>>>bond-specific work is moved into bond code.
>>>>
>>>>Did performance test using pktgen and counting incoming packets by
>>>>iptables. No regression noted.
>>>>
>>>>Signed-off-by: Jiri Pirko<jpirko@redhat.com>
>>>>
>>>>v1->v2:
>>>>         using skb_iif instead of new input_dev to remember original
>>>>	device
>>>>
>>>>v2->v3:
>>>>	do another loop in case skb->dev is changed. That way orig_dev
>>>>	core can be left untouched.
>>>
>>>Hi Jiri,
>>>
>>>Eventually taking enough time for a review.
>>>
>>>I think we should split this change :
>>>
>>>1/ Change __netif_receive_skb() to call rx_handler for diverted net_device, until rx_handler is NULL.
>>>
>>>2/ Convert currently existing rx_handlers (bridge and macvlan) to use
>>>this new "loop" feature, removing the need to call netif_rx() inside
>>>their respective rx_handler and also removing the associated
>>>overhead.
>>
>>This might not be possible. Macvlan uses result of called netif_rx for
>>counting, bridge calls netdev_receive_skb via NF_HOOK. Nevertheless,
>>this can be eventually handled later, not as a part of this patch.
>
>Yes, I agree. Step 2 and step 3 can be swapped.
>
>Anyway, we need to describe the options given to a rx_handler:
>
>- Return skb unchanged. This would cause normal delivery (ptype->dev == NULL or ptype->dev == skb->dev).
>- Return skb->dev changed. __netif_receive_skb() will loop to the new
>device. This would cause extact match delivery only (ptype->dev !=
>NULL and ptype->dev == one of the orig_dev).
>- Manage the skb another way and return NULL. This would stop any
>protocol handlers to receive the skb, except if the rx_handler
>arrange to re-inject the skb somewhere.
>
>>>3/ Convert bonding to use rx_handlers.
>>>
>>>Also, on step 1, we definitely need to clarify what orig_dev should be.
>>>
>>>I now think that orig_dev should be "the device one level below the
>>>current one" or NULL if current device was not diverted from another
>>>one. It means that we should keep an array of crossed (diverted)
>>>devices and the associated orig_dev. This array would be used to pass
>>>the right orig_dev to protocol handlers, depending on the device they
>>>register on :
>>
>>I constructed the patch in the way origdev is the same in all situations
>>as before the patch. I think that this decision can be ommitted at the
>>moment.
>
>Agreed, event if the current handling of orig_dev is far from bullet
>proof and needs to be clarified at some time.
>
>>>eth0 ->  bond0 ->  br0
>>>
>>>A protocol handler registered on bond0 would receive eth0 as orig_dev.
>>>A protocol handler registered on br0 would receive bond0 as orig_dev.
>>>
>>>[snip]
>>>
>>>>@@ -3167,32 +3135,8 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>
>>>[snip]
>>>
>>>>+another_round:
>>>>+
>>>>+	__this_cpu_inc(softnet_data.processed);
>>>>+
>>>>  #ifdef CONFIG_NET_CLS_ACT
>>>>  	if (skb->tc_verd&   TC_NCLS) {
>>>>  		skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
>>>>@@ -3209,8 +3157,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>>  #endif
>>>>
>>>>  	list_for_each_entry_rcu(ptype,&ptype_all, list) {
>>>>-		if (ptype->dev == null_or_orig || ptype->dev == skb->dev ||
>>>>-		    ptype->dev == orig_dev) {
>>>>+		if (!ptype->dev || ptype->dev == skb->dev) {
>>>>  			if (pt_prev)
>>>>  				ret = deliver_skb(skb, pt_prev, orig_dev);
>>>>  			pt_prev = ptype;
>>>>@@ -3224,16 +3171,20 @@ static int __netif_receive_skb(struct sk_buff *skb)
>>>>  ncls:
>>>>  #endif
>>>>
>>>
>>>Why do you loop to ptype_all before calling rx_handler ?
>>>
>>>I don't understand why ptype_all and ptype_base are not handled at
>>>the same place in current __netif_receive_skb() but I think we should
>>>take the opportunity to change that, unless someone know of a good
>>>reason not to do so.
>>
>>Again, the patch tries to do as little changes as it can. So this stays
>>the same as before. In case you want to change it, feel free to submit
>>patch doing that as follow-on.
>
>The point here is that bridge and macvlan handling used to be after
>the ptype_all loop (hence the place you inserted the call to
>rx_handler last summer), but the bonding part is currently before the
>ptype_all loop.
>
>Moving bonding handling after the ptype_all loop will cause the ptype_all loop to be run twice:
>- first time, with skb->dev == eth0 and orig_dev == eth0.
>- second time, with skb->dev == bond0 and orig_dev == eth0.
>
>The first time currently does not exists. And because bonding wasn't
>given a chance yet to decide that the frame should be dropped, the
>packet will always be delivered to eth0, causing duplicate
>deliveries. Note that this is probably true for bridge and macvlan
>too, and that those duplicate deliveries probably already exists.

Yes, and in fact that was what I like about this patch, that then
deliveries are simillar to bridge.

>
>Also, delivering skb inside a loop that may change the skb (skb->dev
>at least) is guaranteed to produce strange behaviors.
>
>Can someone, knowing the history of
>ptype_all/ptype_base/bridge/macvlan/bonding/vlan handling in
>__netif_receive_skb(), comment on this?
>
>Are there any reasons not to process ptype_all and ptype_base at the
>same location, at the end of __netif_receive_skb(), and to manage all
>divert features (bridge/macvlan/bonding/vlan) before?

That is very good set of questions. Would like to hear answers too.

>
>	Nicolas.

^ permalink raw reply

* //claim...26/2/2011
From: Mrs J B Eaq. Brown @ 2011-02-26  9:57 UTC (permalink / raw)


Will U;{ stephensgates@aim.com }
I am Janet Brown diagnosed for cancer,has a time limit to live,I WILL/donate the sum of USD$10 Million to you.Contact my Attorney stephen gates { stephensgates@aim.com },for claims with this email;{ stephensgates@aim.com }
dumlupinar.edu.tr/26th


^ permalink raw reply

* dccp: null-pointer dereference on close
From: Johan Hovold @ 2011-02-26 17:45 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: David S. Miller, dccp, netdev

Hi,

I triggered the null-pointer dereference below when closing a dccp
socket on 2.6.37 the other day. The receive path is hit during
close, and the socket has already been unhashed in dccp_set_state from
dccp_close.

Thanks,
Johan


root@overo:~# [84140.128631] ------------[ cut here ]------------
[84140.133575] WARNING: at net/ipv4/inet_timewait_sock.c:141 __inet_twsk_hashdance+0x48/0x128()
[84140.142517] Modules linked in: arc4 ecb carl9170 rt2870sta(C) mac80211 r8712u(C) crc_ccitt ah
[84140.151794] [<c0038850>] (unwind_backtrace+0x0/0xec) from [<c0055364>] (warn_slowpath_common)
[84140.161743] [<c0055364>] (warn_slowpath_common+0x4c/0x64) from [<c0055398>] (warn_slowpath_n)
[84140.171966] [<c0055398>] (warn_slowpath_null+0x1c/0x24) from [<c02b72d0>] (__inet_twsk_hashd)
[84140.182373] [<c02b72d0>] (__inet_twsk_hashdance+0x48/0x128) from [<c031caa0>] (dccp_time_wai)
[84140.192413] [<c031caa0>] (dccp_time_wait+0x40/0xc8) from [<c031c15c>] (dccp_rcv_state_proces)
[84140.202636] [<c031c15c>] (dccp_rcv_state_process+0x120/0x538) from [<c032609c>] (dccp_v4_do_)
[84140.213043] [<c032609c>] (dccp_v4_do_rcv+0x11c/0x14c) from [<c0286594>] (release_sock+0xac/0)
[84140.222442] [<c0286594>] (release_sock+0xac/0x110) from [<c031fd34>] (dccp_close+0x28c/0x380)
[84140.231475] [<c031fd34>] (dccp_close+0x28c/0x380) from [<c02d9a78>] (inet_release+0x64/0x70)
[84140.240386] [<c02d9a78>] (inet_release+0x64/0x70) from [<c0284ddc>] (sock_release+0x24/0xb8)
[84140.249328] [<c0284ddc>] (sock_release+0x24/0xb8) from [<c0284e94>] (sock_close+0x24/0x34)
[84140.258087] [<c0284e94>] (sock_close+0x24/0x34) from [<c00c2e4c>] (fput+0x108/0x1f4)
[84140.266296] [<c00c2e4c>] (fput+0x108/0x1f4) from [<c00c0104>] (filp_close+0x70/0x7c)
[84140.274505] [<c00c0104>] (filp_close+0x70/0x7c) from [<c00c01c4>] (sys_close+0xb4/0x10c)
[84140.283081] [<c00c01c4>] (sys_close+0xb4/0x10c) from [<c0033a80>] (ret_fast_syscall+0x0/0x30)
[84140.292114] ---[ end trace b8877ec9d542c32e ]---
[84140.296997] Unable to handle kernel NULL pointer dereference at virtual address 00000010
[84140.305541] pgd = cedb0000
[84140.308410] [00000010] *pgd=8ed22031, *pte=00000000, *ppte=00000000
[84140.315032] Internal error: Oops: 17 [#1] PREEMPT
[84140.320007] last sysfs file: /sys/kernel/uevent_seqnum
[84140.325408] Modules linked in: arc4 ecb carl9170 rt2870sta(C) mac80211 r8712u(C) crc_ccitt ah
[84140.334533] CPU: 0    Tainted: G        WC   (2.6.37+ #47)
[84140.340332] PC is at __inet_twsk_hashdance+0x4c/0x128
[84140.345642] LR is at warn_slowpath_null+0x1c/0x24
[84140.350616] pc : [<c02b72d4>]    lr : [<c0055398>]    psr: 60000013
[84140.350616] sp : ce975e68  ip : ce975db8  fp : cfbc5c00
[84140.362701] r10: cfa3e400  r9 : cfbc5c18  r8 : 00000000
[84140.368225] r7 : 00000006  r6 : cfa96110  r5 : cfa3e400  r4 : cfb54000
[84140.375091] r3 : 00000002  r2 : 00000006  r1 : 00000000  r0 : 00000000
[84140.381988] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment user
[84140.389495] Control: 10c5387d  Table: 8edb0019  DAC: 00000015
[84140.395538] Process be2p_ctrl (pid: 2207, stack limit = 0xce9742f0)
[84140.402160] Stack: (0xce975e68 to 0xce976000)
[84140.406738] 5e60:                   cfb54000 00000180 cfa3e400 c031caa0 00000007 cfbc5c00
[84140.415374] 5e80: cfbc9824 00000020 00000007 c031c15c 00000000 00000022 00000000 00000008
[84140.424011] 5ea0: 00000001 cfbc5c00 cfbc5c00 cfa3e400 cfbc9824 00000000 00000001 c04c11b8
[84140.432617] 5ec0: be8ffc1c c032609c fa200000 c0033608 cfa3e400 cfa3e7b0 be8ffc1c ce975ee8
[84140.441253] 5ee0: be8ffc1c cfbc5c00 cfa3e400 ce974000 00000000 c0286594 cfa3e474 cfa3e400
[84140.449859] 5f00: cfa3e408 00000007 cf487c20 cf805840 cf60ca00 c031fd34 00000000 00000000
[84140.458496] 5f20: cfb20288 cfa3e400 cf487c00 00000008 00000000 c02d9a78 00000003 00000000
[84140.467102] 5f40: cf487c00 c0284ddc 00000000 cfb20288 cfb20280 c0284e94 00000000 c00c2e4c
[84140.475738] 5f60: 00000000 00000000 cfb20280 00000000 cfbc50c0 00000006 c0033c04 ce974000
[84140.484375] 5f80: 00000000 c00c0104 00000004 cfbc50c0 cfb20280 c00c01c4 400a1000 00000000
[84140.492980] 5fa0: 0000891c c0033a80 400a1000 00000000 00000004 00000000 403d3014 00000000
[84140.501617] 5fc0: 400a1000 00000000 0000891c 00000006 00000000 00000000 400a9000 be8ffc1c
[84140.510223] 5fe0: 00000000 be8ffbe0 00009584 4036320c 60000010 00000004 00005153 bf0fa7d0
[84140.518859] [<c02b72d4>] (__inet_twsk_hashdance+0x4c/0x128) from [<c031caa0>] (dccp_time_wai)
[84140.528869] [<c031caa0>] (dccp_time_wait+0x40/0xc8) from [<c031c15c>] (dccp_rcv_state_proces)
[84140.539062] [<c031c15c>] (dccp_rcv_state_process+0x120/0x538) from [<c032609c>] (dccp_v4_do_)
[84140.549407] [<c032609c>] (dccp_v4_do_rcv+0x11c/0x14c) from [<c0286594>] (release_sock+0xac/0)
[84140.558776] [<c0286594>] (release_sock+0xac/0x110) from [<c031fd34>] (dccp_close+0x28c/0x380)
[84140.567779] [<c031fd34>] (dccp_close+0x28c/0x380) from [<c02d9a78>] (inet_release+0x64/0x70)
[84140.576660] [<c02d9a78>] (inet_release+0x64/0x70) from [<c0284ddc>] (sock_release+0x24/0xb8)
[84140.585571] [<c0284ddc>] (sock_release+0x24/0xb8) from [<c0284e94>] (sock_close+0x24/0x34)
[84140.594299] [<c0284e94>] (sock_close+0x24/0x34) from [<c00c2e4c>] (fput+0x108/0x1f4)
[84140.602447] [<c00c2e4c>] (fput+0x108/0x1f4) from [<c00c0104>] (filp_close+0x70/0x7c)
[84140.610626] [<c00c0104>] (filp_close+0x70/0x7c) from [<c00c01c4>] (sys_close+0xb4/0x10c)
[84140.619171] [<c00c01c4>] (sys_close+0xb4/0x10c) from [<c0033a80>] (ret_fast_syscall+0x0/0x30)
[84140.628143] Code: e59f00dc e3a0108d ebf6782a e5941044 (e5912010) 
[84140.634643] ---[ end trace b8877ec9d542c32f ]---
[84140.639526] Kernel panic - not syncing: Fatal exception in interrupt


^ permalink raw reply

* Re: [PATCH] Bluetooth: Fix BT_L2CAP and BT_SCO in Kconfig
From: Vitaly Wool @ 2011-02-26 17:52 UTC (permalink / raw)
  To: Gustavo F. Padovan; +Cc: davem, linville, linux-bluetooth, netdev
In-Reply-To: <1298684485-3081-1-git-send-email-padovan@profusion.mobi>

Hi Gustavo,

On Sat, Feb 26, 2011 at 2:41 AM, Gustavo F. Padovan
<padovan@profusion.mobi> wrote:
> If we want something "bool" built-in in something "tristate" it can't
> "depend on" the tristate config option.
>
> Report by DaveM:
>
>   I give it 'y' just to make it happen, for both, and afterways no
>   matter how many times I rerun "make oldconfig" I keep seeing things
>   like this in my build:
>
> scripts/kconfig/conf --silentoldconfig Kconfig
> include/config/auto.conf:986:warning: symbol value 'm' invalid for BT_SCO
> include/config/auto.conf:3156:warning: symbol value 'm' invalid for BT_L2CAP
>
> Reported-by: David S. Miller <davem@davemloft.net>
> Signed-off-by: Gustavo F. Padovan <padovan@profusion.mobi>
> ---
>  net/bluetooth/Kconfig |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/net/bluetooth/Kconfig b/net/bluetooth/Kconfig
> index c6f9c2f..6ae5ec5 100644
> --- a/net/bluetooth/Kconfig
> +++ b/net/bluetooth/Kconfig
> @@ -31,9 +31,10 @@ menuconfig BT
>          to Bluetooth kernel modules are provided in the BlueZ packages.  For
>          more information, see <http://www.bluez.org/>.
>
> +if BT != n
> +
>  config BT_L2CAP
>        bool "L2CAP protocol support"
> -       depends on BT
>        select CRC16
>        help
>          L2CAP (Logical Link Control and Adaptation Protocol) provides
> @@ -42,11 +43,12 @@ config BT_L2CAP
>
>  config BT_SCO
>        bool "SCO links support"
> -       depends on BT
>        help
>          SCO link provides voice transport over Bluetooth.  SCO support is
>          required for voice applications like Headset and Audio.
>
> +endif
> +

Ugh, isn't it far cleaner to change initial dependencies to "depends
on BT != n" ?

Thanks,
   Vitaly

^ permalink raw reply

* Re: [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Jay Vosburgh @ 2011-02-26 19:42 UTC (permalink / raw)
  To: =?ISO-8859-1?Q?Nicolas_de_Peslo=FCan?=
  Cc: Jiri Pirko, David Miller, kaber, eric.dumazet, netdev, shemminger,
	andy, Fischer, Anna
In-Reply-To: <4D690D16.8020503@gmail.com>

Nicolas de Pesloüan 	<nicolas.2p.debian@gmail.com> wrote:

>Le 22/02/2011 00:20, Nicolas de Pesloüan a écrit :
>
>> After checking every protocol handlers installed by dev_add_pack(), it
>> appears that only 4 of them really use the orig_dev parameter given by
>> __netif_receive_skb():
>>
>> - bond_3ad_lacpdu_recv() @ drivers/net/bonding/bond_3ad.c
>> - bond_arp_recv() @ drivers/net/bonding/bond_main.c
>> - packet_rcv() @ net/packet/af_packet.c
>> - tpacket_rcv() @ net/packet/af_packet.c
>>
>>  From the bonding point of view, the meaning of orig_dev is obviously
>> "the device one layer below the bonding device, through which the packet
>> reached the bonding device". It is used by bond_3ad_lacpdu_recv() and
>> bond_arp_recv(), to find the underlying slave device through which the
>> LACPDU or ARP was received. (The protocol handler is registered at the
>> bonding device level).
>>
>>  From the af_packet point of view, the meaning is documented (in commit
>> "[AF_PACKET]: Add option to return orig_dev to userspace") as the
>> "physical device [that] actually received the traffic, instead of having
>> the encapsulating device hide that information."
>>
>> When the bonding device is just one level above the physical device, the
>> two meanings happen to match the same device, by chance.
>>
>> So, currently, a bonding device cannot stack properly on top of anything
>> but physical devices. It might not be a problem today, but may change in
>> the future...
>
>Hi Jay,
>
>Still thinking about this orig_dev stuff, I wonder why the protocol
>handlers used in bonding (bond_3ad_lacpdu_recv() and bond_arp_rcv()) are
>registered at the master level instead of at the slave level ?
>
>If they were registered at the slave level, they would simply receive
>skb->dev as the ingress interface and use this value instead of needing
>the orig_dev value given to them when they are registered at the master
>level.
>
>As orig_dev is only used by bonding and by af_packet, but they disagree on
>the exact meaning of orig_dev, one way to fix this discrepancy would be to
>remove one of the usage. As the af_packet usage is exposed to user space,
>bonding seems the right place to stop using orig_dev, even if orig_dev was
>introduced for bonding :-)
>
>I understand that this would add one entry per slave device to the
>ptype_base list, but this seems to be the only bad effect of registering
>at the slave level. Can you confirm that this was the reason to register
>at the master level instead?

	My recollection is that it was done the way it is because there
was no "orig_dev" delivery logic at the time.  A handler registered to a
slave dev would receive no packets at all because assignment of skb->dev
to the master happened first, and the "orig_dev" knowledge was lost.

	When 802.3ad was added, a skb->real_dev field was created, but
it wasn't used for delivery.  802.3ad used real_dev to figure out which
slave a LACPDU arrived on.  The skb->real_dev was eventually replaced
with the orig_dev business that's there now.

	Later, I did the arp_validate stuff the same way as 802.3ad
because it worked and was easier than registering a handler per slave.

>If you think registering at the slave level would cause too much impact on
>ptype_base, then we might have another way to stop using orig_dev for
>bonding:
>
>In __skb_bond_should_drop(), we already test for the two interesting protocols:
>
>if ((dev->priv_flags & IFF_SLAVE_NEEDARP) && skb->protocol == __cpu_to_be16(ETH_P_ARP))
>	return 0;
>
>if (master->priv_flags & IFF_MASTER_8023AD && skb->protocol == __cpu_to_be16(ETH_P_SLOW))
>	return 0;
>
>Would it be possible to call the right handlers directly from inside
>__skb_bond_should_drop() then let __skb_bond_should_drop() return 1
>("should drop") after processing the frames that are only of interest for
>bonding?

	Isn't one purpose of switching to rx_handler that there won't
need to be any skb_bond_should_drop logic in __netif_receive_skb at all?

	Still, if you're just trying to simplify __netif_receive_skb
first, I don't see any reason not to register the packet handlers at the
slave level.  Looking at the ptype_base hash, I don't think that the
protocols bonding is registering (ARP and SLOW) will hash collide with
IP or IPv6, so I suspect there won't be much impact.

	Once an rx_handler is used, then I suspect there's no need for
the packet handlers at all, since the rx_handler is within bonding and
can just deal with the ARP or LACPDU directly.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply

* Re: [Lxc-users] Bad checksums and lost packets with macvlan on dummy
From: Andrian Nord @ 2011-02-26 20:38 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: lxc-users, Patrick McHardy, Linux Netdev List
In-Reply-To: <4D6630D9.2050400@free.fr>

[-- Attachment #1: Type: text/plain, Size: 323 bytes --]

On Thu, Feb 24, 2011 at 11:20:09AM +0100, Daniel Lezcano wrote:
> I saw you were using the command 'nc6', do you use netcat with ipv6 ?

Well, yes and no. I've tried both ipv4 and ipv6 and my notebook has no
ipv6 address assigned, so most terrible connection was though ipv4 =).

At another server there is no ipv6 at all.

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [RFC] be2net: add rxhash support
From: Ajit Khaparde @ 2011-02-26 21:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev


> From: Eric Dumazet [mailto:eric.dumazet@gmail.com]
> Sent: Saturday, February 26, 2011 4:31 AM
> To: Khaparde, Ajit
> Cc: netdev@vger.kernel.org
> Subject: Re: [RFC] be2net: add rxhash support
> 
> Le vendredi 25 février 2011 à 15:35 -0600, Ajit Khaparde a écrit :
> 
> > I asked that because, if a switch is part a of the configuration,
> > the ASIC can receive packets other than the tcp flow.
> >
> > And if hashing is enabled for IP packets, we can see this behavior.
> > The other values indicate that hashing has been enabled for IPv4
> packets.
> 
> To make sure RSS (and rxhash) was OK, I added following debugging aid :
> 
> diff --git a/include/net/sock.h b/include/net/sock.h
> index da0534d..e9b1180 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -688,6 +688,7 @@ static inline void sock_rps_save_rxhash(struct sock
> *sk, u32 rxhash)
>  {
>  #ifdef CONFIG_RPS
>  	if (unlikely(sk->sk_rxhash != rxhash)) {
> +		pr_err("rxhash change from %x to %x\n", sk->sk_rxhash,
> rxhash);
>  		sock_rps_reset_flow(sk);
>  		sk->sk_rxhash = rxhash;
>  	}
> 
> 
> And got following traces :
> 
> [  201.170297] change rxhash from 0 to be0b5a87
> [  232.607474] bonding: bond1: Setting eth3 as active slave.
> [  232.607478] bonding: bond1: making interface eth3 the new active
> one.
> [  232.710848] change rxhash from be0b5a87 to e56a3c1e
> [  300.047500] bonding: bond1: Setting eth1 as active slave.
> [  300.047504] bonding: bond1: making interface eth1 the new active
> one.
> [  300.159162] change rxhash from e56a3c1e to be0b5a87
> 
> The flip occured when I changed my active slave (bonding mode=1).
> 
> eth1 is a bnx2 NIC, while eth3 a be2net one, so its OK to change the
> rxhash in this case
> (different firmware/algo)
> 
> So as far as be2net is concerned, everything seems OK : all packets for
> a given flow get an unique RSS hash and can feed skb->rxhash
> 
Fair enough. Thanks.
I guess a fresh patch with the ethtool support included will be ideal,
instead of the previous patch?

-Ajit

^ permalink raw reply

* IPv6 source address selection and privacy extensions
From: Bruno Prémont @ 2011-02-26 22:16 UTC (permalink / raw)
  To: netdev

>From Documentation/networking/ip-sysctl.txt:

use_tempaddr - INTEGER
        Preference for Privacy Extensions (RFC3041).
          <= 0 : disable Privacy Extensions
          == 1 : enable Privacy Extensions, but prefer public
                 addresses over temporary addresses.
          >  1 : enable Privacy Extensions and prefer temporary
                 addresses over public addresses.
        Default:  0 (for most devices)
                 -1 (for point-to-point devices and loopback devices)

Is it possible with current kernel to have >1 make temporary addresses
used by default but have manual or dynamic (e.g. MAC based) address used
for some destination addresses/subnets?
If it's possible, how can this be done (adding a hint to ip-sysctl.txt
would then make it easy for others to find)

With IPv4 this can be done via `ip route add $subnet/$prefix src $addr`
though the same does not work for IPv6.

Thanks,
Bruno

^ permalink raw reply

* Re: [net-next-2.6 PATCH 02/10] ethtool: add ntuple flow specifier to network flow classifier
From: David Miller @ 2011-02-27  0:05 UTC (permalink / raw)
  To: alexander.h.duyck; +Cc: jeffrey.t.kirsher, bhutchings, netdev
In-Reply-To: <20110225233249.7920.70334.stgit@gitlad.jf.intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Fri, 25 Feb 2011 15:32:49 -0800

> @@ -396,8 +411,10 @@ struct ethtool_rx_flow_spec {
>  		struct ethtool_ah_espip4_spec		esp_ip4_spec;
>  		struct ethtool_usrip4_spec		usr_ip4_spec;
>  		struct ethhdr				ether_spec;
> +		struct ethtool_ntuple_spec_ext		ntuple_spec;
>  		__u8					hdata[72];
>  	} h_u, m_u;
> +	__u32		flow_type_ext;
>  	__u64		ring_cookie;
>  	__u32		location;
>  };

How can you add this flow_type_ext member to this user visible structure
without utterly breaking userspace?  It changes the offsets of the
ring_cookie and location members.


^ permalink raw reply

* Re: [net-next-2.6 PATCH 01/10] ethtool: prevent null pointer dereference with NTUPLE set but no set_rx_ntuple
From: David Miller @ 2011-02-27  0:07 UTC (permalink / raw)
  To: alexander.h.duyck; +Cc: bhutchings, jeffrey.t.kirsher, netdev
In-Reply-To: <4D684BED.20805@intel.com>

From: Alexander Duyck <alexander.h.duyck@intel.com>
Date: Fri, 25 Feb 2011 16:40:13 -0800

> It cannot occur with any of the in-kernel drivers since they all set
> the NETIF_F_NTUPLE flag and have the function defined.  However going
> forward I would like to have the option of using the network flow
> classifier interface instead of the set_rx_ntuple interface due to the
> fact that it supports many of the features I needed.

This still doesn't explain to me why a driver would set the feature
flag, but not actually implement the feature.

I'm not applying this patch.

When you create the situation that causes the potentially NULL
dereference, then you can use that patch to show why this seemingly
illogical situation can indeed occur.

Until then no driver causes this issue, therefore the problem does
not exist.

^ permalink raw reply

* net-next: warnings from sysctl_net_exit
From: Stephen Hemminger @ 2011-02-27  0:56 UTC (permalink / raw)
  To: Alexey Dobriyan, David Miller; +Cc: netdev

Seeing lots of these messages in dmesg. Something is broken
recently in net-next.


[26207.669668] ------------[ cut here ]------------
[26207.669673] WARNING: at net/sysctl_net.c:84 sysctl_net_exit+0x2a/0x2c()
[26207.669675] Hardware name: System Product Name
[26207.669676] Modules linked in: ip6table_filter ip6_tables nfs lockd fscache nfs_acl auth_rpcgss sunrpc binfmt_misc ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp llc kvm_intel kvm snd_hda_codec_analog lm63 snd_hda_intel snd_hda_codec snd_hwdep radeon pl2303 snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event ttm drm_kms_helper snd_seq drm usbserial i7core_edac snd_timer snd_seq_device snd sky2 e1000e edac_core psmouse igb serio_raw soundcore snd_page_alloc i2c_algo_bit asus_atk0110 hid_belkin usbhid hid pata_marvell ahci libahci dca floppy btrfs lzo_compress zlib_deflate crc32c libcrc32c
[26207.669725] Pid: 67, comm: kworker/u:5 Tainted: G        W   2.6.38-rc5-net-next+ #38
[26207.669726] Call Trace:
[26207.669731]  [<ffffffff81040dd2>] ? warn_slowpath_common+0x85/0x9d
[26207.669735]  [<ffffffff813617f6>] ? cleanup_net+0x0/0x19a
[26207.669738]  [<ffffffff81040e04>] ? warn_slowpath_null+0x1a/0x1c
[26207.669740]  [<ffffffff814154ad>] ? sysctl_net_exit+0x2a/0x2c
[26207.669742]  [<ffffffff8136144e>] ? ops_exit_list+0x2a/0x5b
[26207.669745]  [<ffffffff813618f0>] ? cleanup_net+0xfa/0x19a
[26207.669749]  [<ffffffff810575c1>] ? process_one_work+0x233/0x3aa
[26207.669752]  [<ffffffff81057528>] ? process_one_work+0x19a/0x3aa
[26207.669755]  [<ffffffff810599c2>] ? worker_thread+0x13b/0x25a
[26207.669757]  [<ffffffff81059887>] ? worker_thread+0x0/0x25a
[26207.669760]  [<ffffffff8105d0f5>] ? kthread+0x9d/0xa5
[26207.669763]  [<ffffffff8106d618>] ? trace_hardirqs_on_caller+0x10c/0x130
[26207.669766]  [<ffffffff810030d4>] ? kernel_thread_helper+0x4/0x10
[26207.669770]  [<ffffffff8142f300>] ? restore_args+0x0/0x30
[26207.669772]  [<ffffffff8105d058>] ? kthread+0x0/0xa5
[26207.669774]  [<ffffffff810030d0>] ? kernel_thread_helper+0x0/0x10
[26207.669776] ---[ end trace 0cd6e119ada0eab1 ]---

^ permalink raw reply

* Re: [net-next-2.6 PATCH 01/10] ethtool: prevent null pointer dereference with NTUPLE set but no set_rx_ntuple
From: Alexander Duyck @ 2011-02-27  2:16 UTC (permalink / raw)
  To: David Miller; +Cc: alexander.h.duyck, bhutchings, jeffrey.t.kirsher, netdev
In-Reply-To: <20110226.160747.226765885.davem@davemloft.net>

On Sat, Feb 26, 2011 at 4:07 PM, David Miller <davem@davemloft.net> wrote:
> From: Alexander Duyck <alexander.h.duyck@intel.com>
> Date: Fri, 25 Feb 2011 16:40:13 -0800
>
>> It cannot occur with any of the in-kernel drivers since they all set
>> the NETIF_F_NTUPLE flag and have the function defined.  However going
>> forward I would like to have the option of using the network flow
>> classifier interface instead of the set_rx_ntuple interface due to the
>> fact that it supports many of the features I needed.
>
> This still doesn't explain to me why a driver would set the feature
> flag, but not actually implement the feature.

Actually the reason I ran into this is because of the patches in the
RFC set.  Basically I was looking at moving the ntuple support in
ixgbe over to network flow classifier rules.  As such I was leaving
the ntuple flag set, but using set_rxnfc via the filter rules instead.
 If you recommend adding a new flag to do that I am fine with that.

> I'm not applying this patch.
>
> When you create the situation that causes the potentially NULL
> dereference, then you can use that patch to show why this seemingly
> illogical situation can indeed occur.
>
> Until then no driver causes this issue, therefore the problem does
> not exist.

I'll do some digging late next week to see if there are any other
means of encountering the issue and will get back to you if I find
anything.

Thanks,

Alex

^ permalink raw reply

* dccp: Change maintainer
From: Arnaldo Carvalho de Melo @ 2011-02-27  2:28 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Networking Development Mailing List

Today was as good as any other day, but I felt I had to do things I love
to when paying hommage to somebody I love, so please apply this one,
something he would be proud of, even if so geekly.

    Way past it was/is deserved.

    Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

diff --git a/MAINTAINERS b/MAINTAINERS
index 5dd6c75..1752436 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -2026,7 +2026,7 @@ F:	Documentation/scsi/dc395x.txt
 F:	drivers/scsi/dc395x.*

 DCCP PROTOCOL
-M:	Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
+M:	Gerrit Renker <gerrit@erg.abdn.ac.uk>
 L:	dccp@vger.kernel.org
 W:	http://www.linuxfoundation.org/collaborate/workgroups/networking/dccp
 S:	Maintained

^ permalink raw reply related

* txqueuelen has wrong units; should be time
From: Albert Cahalan @ 2011-02-27  5:44 UTC (permalink / raw)
  To: linux-kernel, netdev

(thinking about the bufferbloat problem here)

Setting txqueuelen to some fixed number of packets
seems pretty broken if:

1. a link can vary in speed (802.11 especially)

2. a packet can vary in size (9 KiB jumbograms, etc.)

3. there is other weirdness (PPP compression, etc.)

It really needs to be set to some amount of time,
with the OS accounting for packets in terms of the
time it will take to transmit them. This would need
to account for physical-layer packet headers and
minimum spacing requirements.

I think it could also account for estimated congestion
on the local link, because that effects the rate at which
the queue can empty. An OS can directly observe this
on some types of hardware.

Nanoseconds seems fine; it's unlikely you'd ever want
more than 4.2 seconds (32-bit unsigned) of queue.

I guess there are at least 2 queues of interest, with the
second one being under control of the hardware driver.
Having the kernel split the max time as appropriate for
the hardware seems nicest.

^ permalink raw reply

* Re: dccp: Change maintainer
From: David Miller @ 2011-02-27  5:47 UTC (permalink / raw)
  To: acme; +Cc: netdev
In-Reply-To: <20110227022854.GB19108@ghostprotocols.net>

From: Arnaldo Carvalho de Melo <acme@ghostprotocols.net>
Date: Sat, 26 Feb 2011 23:28:54 -0300

> Today was as good as any other day, but I felt I had to do things I love
> to when paying hommage to somebody I love, so please apply this one,
> something he would be proud of, even if so geekly.

I think you're trying to say "I wish people would sending me DCCP bug
reports, damn..." :-)

>     Way past it was/is deserved.
>     
>     Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

Applied, thanks.

^ permalink raw reply

* Re: net-next: warnings from sysctl_net_exit
From: David Miller @ 2011-02-27  6:23 UTC (permalink / raw)
  To: shemminger; +Cc: adobriyan, netdev
In-Reply-To: <20110226165601.48858003@nehalam>

From: Stephen Hemminger <shemminger@vyatta.com>
Date: Sat, 26 Feb 2011 16:56:01 -0800

> Seeing lots of these messages in dmesg. Something is broken
> recently in net-next.

Did you by change pull plain net-2.6 into that tree?  Because one
commit which is in net-2.6 but not in net-next-2.6 catches my eye:

commit c486da34390846b430896a407b47f0cea3a4189c
Author: Lucian Adrian Grijincu <lucian.grijincu@gmail.com>
Date:   Thu Feb 24 19:48:03 2011 +0000

    sysctl: ipv6: use correct net in ipv6_sysctl_rtcache_flush
    
    Before this patch issuing these commands:
    
      fd = open("/proc/sys/net/ipv6/route/flush")
      unshare(CLONE_NEWNET)
      write(fd, "stuff")
    
    would flush the newly created net, not the original one.
    
    The equivalent ipv4 code is correct (stores the net inside ->extra1).
    Acked-by: Daniel Lezcano <daniel.lezcano@free.fr>
    
    Signed-off-by: David S. Miller <davem@davemloft.net>


^ permalink raw reply

* Re: [patch 1/1] [PATCH] qeth: remove needless IPA-commands in offline
From: David Miller @ 2011-02-27  6:41 UTC (permalink / raw)
  To: frank.blaschka; +Cc: netdev, linux-s390, ursula.braun
In-Reply-To: <20110218142343.763210392@de.ibm.com>

From: frank.blaschka@de.ibm.com
Date: Fri, 18 Feb 2011 15:22:59 +0100

> From: Ursula Braun <ursula.braun@de.ibm.com>
> 
> If a qeth device is set offline, data and control subchannels are
> cleared, which means removal of all IP Assist Primitive settings
> implicitly. There is no need to delete those settings explicitly.
> This patch removes all IP Assist invocations from offline.
> 
> Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
> Signed-off-by: Frank Blaschka <frank.blaschka@de.ibm.com>

Applied.

^ permalink raw reply

* Re: txqueuelen has wrong units; should be time
From: Mikael Abrahamsson @ 2011-02-27  7:02 UTC (permalink / raw)
  To: Albert Cahalan; +Cc: linux-kernel, netdev
In-Reply-To: <AANLkTimd5GQwtUFP2fD_An=M8ajBD8DJpzxQJezv8fB8@mail.gmail.com>

On Sun, 27 Feb 2011, Albert Cahalan wrote:

> Nanoseconds seems fine; it's unlikely you'd ever want
> more than 4.2 seconds (32-bit unsigned) of queue.

I think this is shortsighted and I'm sure someone will come up with a case 
where 4.2 seconds isn't enough. Let's not build in those kinds of 
limitations from start.

Why not make it 64bit and go to picoseconds from start?

If you need to make it 32bit unsigned, I'd suggest to start from 
microseconds instead. It's less likely someone would want less than a 
microsecond of queue, than someone wanting more than 4.2 seconds of queue.

-- 
Mikael Abrahamsson    email: swmike@swm.pp.se

^ permalink raw reply

* Re: txqueuelen has wrong units; should be time
From: Eric Dumazet @ 2011-02-27  7:54 UTC (permalink / raw)
  To: Mikael Abrahamsson; +Cc: Albert Cahalan, linux-kernel, netdev
In-Reply-To: <alpine.DEB.1.10.1102270758580.11974@uplift.swm.pp.se>

Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
> On Sun, 27 Feb 2011, Albert Cahalan wrote:
> 
> > Nanoseconds seems fine; it's unlikely you'd ever want
> > more than 4.2 seconds (32-bit unsigned) of queue.
> 
> I think this is shortsighted and I'm sure someone will come up with a case 
> where 4.2 seconds isn't enough. Let's not build in those kinds of 
> limitations from start.
> 
> Why not make it 64bit and go to picoseconds from start?
> 
> If you need to make it 32bit unsigned, I'd suggest to start from 
> microseconds instead. It's less likely someone would want less than a 
> microsecond of queue, than someone wanting more than 4.2 seconds of queue.
> 

32 or 64 bits doesnt matter a lot. At Qdisc stage we have up to 40 bytes
available in skb->sb[] for our usage.

Problem is some machines have slow High Resolution timing services.

_If_ we have a time limit, it will probably use the low resolution (aka
jiffies), unless high resolution services are cheap.

I was thinking not having an absolute hard limit, but an EWMA based one.




^ permalink raw reply

* Re: txqueuelen has wrong units; should be time
From: Albert Cahalan @ 2011-02-27  8:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Mikael Abrahamsson, linux-kernel, netdev
In-Reply-To: <1298793252.8726.45.camel@edumazet-laptop>

On Sun, Feb 27, 2011 at 2:54 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le dimanche 27 février 2011 à 08:02 +0100, Mikael Abrahamsson a écrit :
>> On Sun, 27 Feb 2011, Albert Cahalan wrote:
>>
>> > Nanoseconds seems fine; it's unlikely you'd ever want
>> > more than 4.2 seconds (32-bit unsigned) of queue.
...
> Problem is some machines have slow High Resolution timing services.
>
> _If_ we have a time limit, it will probably use the low resolution (aka
> jiffies), unless high resolution services are cheap.

As long as that is totally internal to the kernel and never
getting exposed by some API for setting the amount, sure.

> I was thinking not having an absolute hard limit, but an EWMA based one.

The whole point is to prevent stale packets, especially to prevent
them from messing with TCP, so I really don't think so. I suppose
you do get this to some extent via early drop.

^ permalink raw reply

* [PATCH nex-next] netdevice: make initial group visible to userspace
From: Vlad Dogaru @ 2011-02-27  8:39 UTC (permalink / raw)
  To: NetDev; +Cc: Stephen Hemminger, David Miller, Patrick McHardy
In-Reply-To: <20110225124345.0d691789@nehalam>

On Fri, Feb 25, 2011 at 12:43:45PM -0800, Stephen Hemminger wrote:
> On Wed,  2 Feb 2011 20:23:40 +0200
> Vlad Dogaru <ddvlad@rosedu.org> wrote:
> 
> > User can specify device group to list by using the group keyword:
> > 
> > 	ip link show group test
> > 
> > If no group is specified, 0 (default) is implied.
> > 
> > Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org>
> 
> I applied this to net-next for iproute2
> but INIT_NETDEV_GROUP is in a part of netdevice.h that is not exported 
> (ie inside #ifdef KERNEL).

Sorry, here is a patch for net-next that fixes the issue:


[PATCH net-next] netdevice: make initial group visible to userspace

INIT_NETDEV_GROUP is needed by userspace, move it outside __KERNEL__
guards.

Signed-off-by: Vlad Dogaru <ddvlad@rosedu.org>
---
 include/linux/netdevice.h |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ffe56c1..8be4056 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -75,9 +75,6 @@ struct wireless_dev;
 #define NET_RX_SUCCESS		0	/* keep 'em coming, baby */
 #define NET_RX_DROP		1	/* packet dropped */
 
-/* Initial net device group. All devices belong to group 0 by default. */
-#define INIT_NETDEV_GROUP	0
-
 /*
  * Transmit return codes: transmit return codes originate from three different
  * namespaces:
@@ -141,6 +138,9 @@ static inline bool dev_xmit_complete(int rc)
 
 #define MAX_ADDR_LEN	32		/* Largest hardware address length */
 
+/* Initial net device group. All devices belong to group 0 by default. */
+#define INIT_NETDEV_GROUP	0
+
 #ifdef  __KERNEL__
 /*
  *	Compute the worst case header length according to the protocols
-- 
1.7.1


^ permalink raw reply related

* Re: EPT: Misconfiguration
From: Avi Kivity @ 2011-02-27 10:46 UTC (permalink / raw)
  To: Ruben Kerkhof; +Cc: Marcelo Tosatti, kvm, netdev
In-Reply-To: <AANLkTiknMneQtYqgmX7gvXsMoSO-yiLXr-dwbEej80Uy@mail.gmail.com>


Copying netdev: looks like memory corruption in the networking stack.

Archive link: http://www.spinics.net/lists/kvm/msg50651.html (for the 
attachment).

On 02/24/2011 11:15 PM, Ruben Kerkhof wrote:
> >
> >  On Tue, Feb 15, 2011 at 18:16, Marcelo Tosatti<mtosatti@redhat.com>  wrote:
>
> >>  This and the others reported. So yes, it looks something is corrupting
> >>  memory. Ruben, you can try to boot with slub_debug=ZFPU kernel option.
>
> Ok, there are now only 6 vms left on this host, and I've booted it
> with the slub_debug=ZFPU option.
> After a few hours, I got the following result:
>
> 2011-02-24T21:41:30.818496+01:00 phy005 kernel:
> =============================================================================
> 2011-02-24T21:41:30.818517+01:00 phy005 kernel: BUG kmalloc-2048 (Not
> tainted): Object padding overwritten
> 2011-02-24T21:41:30.818523+01:00 phy005 kernel:
> -----------------------------------------------------------------------------
> 2011-02-24T21:41:30.818526+01:00 phy005 kernel:
> 2011-02-24T21:41:30.818530+01:00 phy005 kernel: INFO:
> 0xffff8806230752ca-0xffff8806230752cf. First byte 0x0 instead of 0x5a
> 2011-02-24T21:41:30.818534+01:00 phy005 kernel: INFO: Allocated in
> __netdev_alloc_skb+0x34/0x51 age=2231 cpu=8 pid=0
> 2011-02-24T21:41:30.818537+01:00 phy005 kernel: INFO: Freed in
> skb_release_data+0xc9/0xce age=2368 cpu=8 pid=2159
> 2011-02-24T21:41:30.818541+01:00 phy005 kernel: INFO: Slab
> 0xffffea00157a9880 objects=15 used=13 fp=0xffff8806230752d0
> flags=0x40000000004083
> 2011-02-24T21:41:30.818545+01:00 phy005 kernel: INFO: Object
> 0xffff880623074a88 @offset=19080 fp=0xffff8806230752d0
>
> The rest of the output is attached since it's quite large.
>
> Kind regards,
>
> Ruben


-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply

* Re: SO_REUSEPORT - can it be done in kernel?
From: Thomas Graf @ 2011-02-27 11:02 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, rick.jones2, therbert, wsommerfeld, daniel.baluta,
	netdev
In-Reply-To: <20110226005718.GA19889@gondor.apana.org.au>

On Sat, Feb 26, 2011 at 08:57:18AM +0800, Herbert Xu wrote:
> I'm fairly certain the bottleneck is indeed in the kernel, and
> in the UDP stack in particular.
> 
> This is born out by a test where I used two named worker threads,
> both working on the same socket.  Stracing shows that they're
> working flat out only doing sendmsg/recvmsg.
> 
> The result was that they obtained (in aggregate) half the throughput
> of a single worker thread.

I agree. This is the bottleneck that I described were the kernel is
not able to deliver enough queries for BIND to show the lock
contention issues.

But there is also the situation where netperf RR performance numbers
indicate a mugh higher kernel capability but BIND is not able to
deliver more even though the CPU utilization is very low. This is
the situation where we see the large number of futex calls indicating
the lock contention due to too many queries on a single socket.

> Which is why I'm quite skeptical about this REUSEPORT patch as
> IMHO the only reason it produces a great result is solely because
> it is allowing parallel sends going out.
> 
> Rather than modifying all UDP applications out there to fix what
> is fundamentally a kernel problem, I think what we should do is
> fix the UDP stack so that it actually scales.

I am not suggesting that this is the ultimate and final fix for this
problem. It is fixing a symptom rather than fixing the cause but
sometimes being able to fix the symptom becomes really handy :-)

Adding SO_REUSEPORT does not prevent us from fixing the UDP stack
in the long run.

> It isn't all that hard since the easy way would be to only take
> the lock if we're already corked or about to cork.
> 
> For the receive side we also don't need REUSEPORT as we can simply
> make our UDP stack multiqueue.

OK, it is not required and there is definitely a better way to fix
the kernel bottleneck in the long term. Even better.

I still suggest to merge this patch as a immediate workaround fix
until we scale properly on a single socket and also as a workaround
for applications which can't get rid of their per socket mutex quickly.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox