Netdev List
 help / color / mirror / Atom feed
* RE: [PATCH net-next V1] net/mlx4_en: ethtool force speed when asking for autoneg=off
From: David Laight @ 2014-12-08 10:56 UTC (permalink / raw)
  To: 'Amir Vadai', David S. Miller
  Cc: netdev@vger.kernel.org, Or Gerlitz, Yevgeny Petrilin,
	Saeed Mahameed
In-Reply-To: <1417969655-28028-1-git-send-email-amirv@mellanox.com>

From: Amir Vadai
> From: Saeed Mahameed <saeedm@mellanox.com>
> 
> Use cmd->autoneg == AUTONEG_DISABLE as a user hint to force specific speed.
> We don't want to rely on ethtool to calculate advertised link modes when
> forcing specific speed, a user can request a specific speed and specify
> "autoneg off" in ethtool command to give a hint for forcing this speed.

I'm not 100% sure what you are trying to achieve?

By far the safest way to 'force' a specific speed is to set the
advertised modes to contain only the desired speed.
Doing anything else on links that are capable of auto-negotiation
is a complete recipe for disaster.

Even if you fix the operating mode of the PHY and MAC you almost
certainly want to advertise that mode to the remote system.

Yes, I know this is made all the more complicated by 10/100M autodetect.

	David

^ permalink raw reply

* Re: [PATCH net-next v3 2/2] rocker: remove swdev mode
From: Jiri Pirko @ 2014-12-08 11:03 UTC (permalink / raw)
  To: Thomas Graf
  Cc: roopa, sfeldma, jhs, bcrl, john.fastabend, stephen, linville,
	vyasevic, netdev, davem, shm, gospo
In-Reply-To: <20141207081928.GA2215@casper.infradead.org>

Sun, Dec 07, 2014 at 09:19:28AM CET, tgraf@suug.ch wrote:
>On 12/06/14 at 10:54pm, roopa@cumulusnetworks.com wrote:
>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>> 
>> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>> ---
>>  drivers/net/ethernet/rocker/rocker.c |   18 +-----------------
>>  include/linux/rtnetlink.h            |    2 +-
>>  net/core/rtnetlink.c                 |   12 +++++++++---
>>  3 files changed, 11 insertions(+), 21 deletions(-)
>> 
>> diff --git a/drivers/net/ethernet/rocker/rocker.c b/drivers/net/ethernet/rocker/rocker.c
>> index fded127..9f1d256 100644
>> --- a/drivers/net/ethernet/rocker/rocker.c
>> +++ b/drivers/net/ethernet/rocker/rocker.c
>> @@ -3755,7 +3739,7 @@ static int rocker_port_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq,
>>  				      u32 filter_mask)
>>  {
>>  	struct rocker_port *rocker_port = netdev_priv(dev);
>> -	u16 mode = BRIDGE_MODE_SWDEV;
>> +	u16 mode = -1;
>        ^^^
>
>I assume you meant s16

well, I see no problem in using u16. IFLA_BRIDGE_MODE attr is u16 so
mode should stay u16.

But maybe better to add:
#define BRIDGE_MODE_UNDEF 0xFFFF

?
>

^ permalink raw reply

* Re: [PATCH net-next v4 0/2] remove bridge BRIDGE_MODE_SWDEV
From: Jiri Pirko @ 2014-12-08 11:04 UTC (permalink / raw)
  To: roopa
  Cc: sfeldma, jhs, bcrl, tgraf, john.fastabend, stephen, linville,
	vyasevic, netdev, davem, shm, gospo
In-Reply-To: <1417972147-62196-1-git-send-email-roopa@cumulusnetworks.com>

Please send the patches in reverse order. That would prevent compile
error during bisections. Thanks.


Sun, Dec 07, 2014 at 06:09:04PM CET, roopa@cumulusnetworks.com wrote:
>From: Roopa Prabhu <roopa@cumulusnetworks.com>
>
>
>Roopa Prabhu (2):
>  bridge: remove mode 'swdev'
>  rocker: remove swdev mode
>
>Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
>
> drivers/net/ethernet/rocker/rocker.c |   18 +-----------------
> include/linux/rtnetlink.h            |    2 +-
> include/uapi/linux/if_bridge.h       |    1 -
> net/core/rtnetlink.c                 |   12 +++++++++---
> 4 files changed, 11 insertions(+), 22 deletions(-)
>
>-- 
>1.7.10.4
>

^ permalink raw reply

* Re: [PATCH v3 iproute2] bridge link: add option 'self'
From: Jiri Pirko @ 2014-12-08 11:06 UTC (permalink / raw)
  To: roopa
  Cc: sfeldma, jhs, bcrl, tgraf, john.fastabend, stephen, linville,
	vyasevic, netdev, davem, shm, gospo
In-Reply-To: <1417854061-4675-1-git-send-email-roopa@cumulusnetworks.com>

Sat, Dec 06, 2014 at 09:21:01AM CET, roopa@cumulusnetworks.com wrote:
>From: Roopa Prabhu <roopa@cumulusnetworks.com>
>
>Currently self is set internally only if hwmode is set.
>This makes it necessary for the hw to have a mode.
>There is no hwmode really required to go to hardware. So, introduce
>self for anybody who wants to target hardware.
>
>v1 -> v2
>    - fix a few bugs. Initialize flags to zero: this was required to
>    keep the current behaviour unchanged.
>
>v2 -> v3
>    - fix comment
>
>Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> 

Reviewed-by: Jiri Pirko <jiri@resnulli.us>

^ permalink raw reply

* Re: [Xen-devel] [PATCH] xen-netfront: Fix handling packets on compound pages with skb_linearize
From: David Vrabel @ 2014-12-08 11:11 UTC (permalink / raw)
  To: Luis Henriques, Stefan Bader
  Cc: Wei Liu, Ian Campbell, netdev, Kamal Mostafa, linux-kernel,
	Paul Durrant, David Vrabel, Zoltan Kiss, xen-devel,
	Boris Ostrovsky
In-Reply-To: <20141208101936.GA7491@hercules>

On 08/12/14 10:19, Luis Henriques wrote:
> On Mon, Dec 01, 2014 at 09:55:24AM +0100, Stefan Bader wrote:
>> On 11.08.2014 19:32, Zoltan Kiss wrote:
>>> There is a long known problem with the netfront/netback interface: if the guest
>>> tries to send a packet which constitues more than MAX_SKB_FRAGS + 1 ring slots,
>>> it gets dropped. The reason is that netback maps these slots to a frag in the
>>> frags array, which is limited by size. Having so many slots can occur since
>>> compound pages were introduced, as the ring protocol slice them up into
>>> individual (non-compound) page aligned slots. The theoretical worst case
>>> scenario looks like this (note, skbs are limited to 64 Kb here):
>>> linear buffer: at most PAGE_SIZE - 17 * 2 bytes, overlapping page boundary,
>>> using 2 slots
>>> first 15 frags: 1 + PAGE_SIZE + 1 bytes long, first and last bytes are at the
>>> end and the beginning of a page, therefore they use 3 * 15 = 45 slots
>>> last 2 frags: 1 + 1 bytes, overlapping page boundary, 2 * 2 = 4 slots
>>> Although I don't think this 51 slots skb can really happen, we need a solution
>>> which can deal with every scenario. In real life there is only a few slots
>>> overdue, but usually it causes the TCP stream to be blocked, as the retry will
>>> most likely have the same buffer layout.
>>> This patch solves this problem by linearizing the packet. This is not the
>>> fastest way, and it can fail much easier as it tries to allocate a big linear
>>> area for the whole packet, but probably easier by an order of magnitude than
>>> anything else. Probably this code path is not touched very frequently anyway.
>>>
>>> Signed-off-by: Zoltan Kiss <zoltan.kiss@citrix.com>
>>> Cc: Wei Liu <wei.liu2@citrix.com>
>>> Cc: Ian Campbell <Ian.Campbell@citrix.com>
>>> Cc: Paul Durrant <paul.durrant@citrix.com>
>>> Cc: netdev@vger.kernel.org
>>> Cc: linux-kernel@vger.kernel.org
>>> Cc: xen-devel@lists.xenproject.org
>>
>> This does not seem to be marked explicitly as stable. Has someone already asked
>> David Miller to put it on his stable queue? IMO it qualifies quite well and the
>> actual change should be simple to pick/backport.
>>
> 
> Thank you Stefan, I'm queuing this for the next 3.16 kernel release.

Don't backport this yes.  It's broken.  It produces malformed requests
and netback will report a fatal error and stop all traffic on the VIF.

David

^ permalink raw reply

* Re: [PATCH 2/3] bridge: offload bridge port attributes to switch asic if feature flag set
From: Jiri Pirko @ 2014-12-08 11:14 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Scott Feldman, Arad, Ronen, Netdev, Jamal Hadi Salim,
	Benjamin LaHaise, Thomas Graf, john fastabend,
	stephen@networkplumber.org, John Linville, nhorman@tuxdriver.com,
	Nicolas Dichtel, vyasevic@redhat.com, Florian Fainelli,
	buytenh@wantstofly.org, Aviad Raveh, David S. Miller,
	shm@cumulusnetworks.com, Andy Gospodarek
In-Reply-To: <5484B773.7000809@cumulusnetworks.com>

Sun, Dec 07, 2014 at 09:24:19PM CET, roopa@cumulusnetworks.com wrote:
>On 12/5/14, 10:54 PM, Scott Feldman wrote:
>>On Fri, Dec 5, 2014 at 3:21 PM, Arad, Ronen <ronen.arad@intel.com> wrote:
>>>
>>>>-----Original Message-----
>>>>From: netdev-owner@vger.kernel.org [mailto:netdev-
>>>>owner@vger.kernel.org] On Behalf Of Roopa Prabhu
>>>>Sent: Thursday, December 04, 2014 11:02 PM
>>>>To: Scott Feldman
>>>>Cc: Jiří Pírko; Jamal Hadi Salim; Benjamin LaHaise; Thomas Graf; john
>>>>fastabend; stephen@networkplumber.org; John Linville;
>>>>nhorman@tuxdriver.com; Nicolas Dichtel; vyasevic@redhat.com; Florian
>>>>Fainelli; buytenh@wantstofly.org; Aviad Raveh; Netdev; David S. Miller;
>>>>shm@cumulusnetworks.com; Andy Gospodarek
>>>>Subject: Re: [PATCH 2/3] bridge: offload bridge port attributes to switch asic
>>>>if feature flag set
>>>>
>>>>On 12/4/14, 10:41 PM, Scott Feldman wrote:
>>>>>On Thu, Dec 4, 2014 at 6:26 PM,  <roopa@cumulusnetworks.com> wrote:
>>>>>>From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>>>>
>>>>>>This allows offloading to switch asic without having the user to set
>>>>>>any flag. And this is done in the bridge driver to rollback kernel
>>>>>>settings on hw offload failure if required in the future.
>>>>>>
>>>>>>With this, it also makes sure a notification goes out only after the
>>>>>>attributes are set both in the kernel and hw.
>>>>>I like this approach as it streamlines the steps for the user in
>>>>>setting port flags.  There is one case for FLOODING where you'll have
>>>>>to turn off flooding for both, and then turn on flooding in hw.  You
>>>>>don't want flooding turned on on kernel and hw.
>>>>ok, maybe using the higher bits as in
>>>>https://patchwork.ozlabs.org/patch/413211/
>>>>
>>>>might help with that. Let me think some more.
>>>>>>---
>>>>>>   net/bridge/br_netlink.c |   27 ++++++++++++++++++++++++++-
>>>>>>   1 file changed, 26 insertions(+), 1 deletion(-)
>>>>>>
>>>>>>diff --git a/net/bridge/br_netlink.c b/net/bridge/br_netlink.c index
>>>>>>9f5eb55..ce173f0 100644
>>>>>>--- a/net/bridge/br_netlink.c
>>>>>>+++ b/net/bridge/br_netlink.c
>>>>>>@@ -407,9 +407,21 @@ int br_setlink(struct net_device *dev, struct
>>>>nlmsghdr *nlh)
>>>>>>                                  afspec, RTM_SETLINK);
>>>>>>          }
>>>>>>
>>>>>>+       if ((dev->features & NETIF_F_HW_SWITCH_OFFLOAD) &&
>>>>>>+                       dev->netdev_ops->ndo_bridge_setlink) {
>>>>>>+               int ret = dev->netdev_ops->ndo_bridge_setlink(dev,
>>>>>>+ nlh);
>>>>>I think you want to up-level this to net/core/rtnetlink.c because
>>>>>you're only enabling the feature for one instance of a driver that
>>>>>implements ndo_bridge_setlink: the bridge driver.  If another driver
>>>>>was MASTER and implemented ndo_bridge_setlink, you'd want same check
>>>>>to push setting down to SELF port driver.
>>>>yeah, i thought about that. But i moved it here so that rollback would be
>>>>easier.
>>>There is a need for propagating setlink/dellink requests down multiple levels.
>>>The use-case I have in mind is a bridge at the top, team/bond in the middle, and port devices at the bottom.
>>>A setlink for VLAN filtering attributes would come with MASTER flag set, and either port or bond/team netdev.
>>>How would this be handled?
>>>
>>>The propagation rules between bridge and enslaved port device could be different from those between bond/team and enslaved devices.
>>>The current bridge driver does not propagate VLAN filtering from bridge to its ports as each port could have different configuration. In a case of a bond/team all members need to have the same configuration such that the a bond/team would be indistinguishable from a simple port.
>>>
>>>Therefore rtnetlink.c might not have the knowledge for propagation across multiple levels.
>>>It seems that each device which implements ndo_bridge_setlink/ndo_bridge_dellink  and could have master role, need to take care of propagation to its slaves.
>>Thanks Ronen for bringing up this use-case of stacked masters.  I
>>think for VLAN filtering, the stacked master case is handled, not by
>>ndo_setlink/dellink at each level, but with ndo_vlan_rx_kill_vid and
>>ndo_vlan_rx_add_vid.  So the switch port driver can know VLAN
>>membership for port even if port is under bond which is under bridge,
>>by using ndo_vlan_rx_xxx and setting NETIF_F_HW_VLAN_CTAG_FILTER.  The
>>bonding driver's ndo_vlan_rx_xxx handlers seem to keep ports in bond
>>VLAN membership consistent across bond.
>>
>>But in general, ndo_setlink/dellink don't work for the stack use-case
>>for other non-VLAN attributes.  Maybe the answer is to use the VLAN
>>propogation model for other attributes.  ndo_setlink/dellink/getlink
>>have enough weird-isms it might be time to define cleaner ndo ops to
>>propagate the other attrs down.
>And, only the switch asic driver is interested in these attrs. So, seems like
>for these cases, we need to send these attrs to the switchdriver directly
>instead of going through the stack of netdevs ?. see my response to ronen's
>other email.

I think that this should be handled similar to ndo_vlan_rx_add_vid,
ndo_vlan_rx_kill_vid, ndo_change_mtu and others. Master devices like
bridge, bond, team, etc should take care of propagating the calls to
lower devices. It mignt not make sense sometimes so let the masters to
decide.

I think that the feature bit (ethtool flag) should serve only for user
to actually enable or disable the offload. And thinking about that,
maybe the bit checking should be implemented in switch drivers, not in
bridge and friends.

^ permalink raw reply

* Re: 3.12.33 - BUG xfrm_selector_match+0x25/0x2f6
From: Smart Weblications GmbH - Florian Wiessner @ 2014-12-08 11:19 UTC (permalink / raw)
  To: Julian Anastasov
  Cc: Steffen Klassert, netdev, LKML, stable, Simon Horman, lvs-devel
In-Reply-To: <alpine.LFD.2.11.1412072019410.1885@ja.home.ssi.bg>

Hi Julian,

Am 07.12.2014 19:27, schrieb Julian Anastasov:>
> 	Hello,
>
> On Fri, 5 Dec 2014, Smart Weblications GmbH - Florian Wiessner wrote:
>
>> thank you for the fast responses! I would like to test any patch for 3.12.
>
> 	I'm attaching a patch that avoids rerouting in
> IPVS for LOCAL_IN. Please test it in your setup. My tests
> were with NAT on today's net tree. I checked that it
> compiles for 3.12.33. You can use the default snat_reroute=1.
>

I'm sorry to tell you that your patch does not fix the problem. The BUG happens
as soon as the client sends PASV, the ftp server does not return "Entering
Passive Mode":

[   91.862502] BUG: unable to handle kernel NULL pointer dereference at
0000000000000014
[   91.862735] IP: [<ffffffffa013a470>] nf_ct_seqadj_set+0x60/0x90 [nf_conntrack]
[   91.862889] PGD 0
[   91.863026] Oops: 0000 [#1] SMP
[   91.863235] Modules linked in: netconsole xt_nat xt_multiport ip_vs_rr veth
iptable_mangle xt_mark nf_conntrack_netlink nfnetlink ipt_MASQUERADE iptable_nat
nf_nat_ipv4 nf_conntrack_ipv4 nf_defrag_ipv4 ipt_REJECT xt_tcpudp iptable_filter
ip_tables cpufreq_ondemand cpufreq_powersave cpufreq_conservative
cpufreq_userspace ocfs2_stack_o2cb ocfs2_dlm bridge stp llc bonding fuse
nf_conntrack_ftp 8021q openvswitch gre vxlan xt_conntrack x_tables ocfs2_dlmfs
dlm sctp ocfs2 ocfs2_nodemanager ocfs2_stackglue configfs rbd kvm_intel kvm
coretemp ip_vs_ftp ip_vs nf_nat nf_conntrack i2c_i801 psmouse serio_raw lpc_ich
mfd_core evdev btrfs lzo_decompress lzo_compress
[   91.866846] CPU: 1 PID: 18895 Comm: vsftpd Not tainted 3.12.33 #5
[   91.866927] Hardware name: Supermicro X9SCI/X9SCA/X9SCI/X9SCA, BIOS 1.1a
09/28/2011
[   91.867023] task: ffff8807b9360540 ti: ffff8807afe90000 task.ti: ffff8807afe90000
[   91.867116] RIP: 0010:[<ffffffffa013a470>]  [<ffffffffa013a470>]
nf_ct_seqadj_set+0x60/0x90 [nf_conntrack]
[   91.867268] RSP: 0018:ffff88083fc43988  EFLAGS: 00010206
[   91.867346] RAX: 000000000000000c RBX: ffff88079aeb006c RCX: 0000000000000003
[   91.867428] RDX: 000000000000002a RSI: 0000000000000003 RDI: ffff88079aeb006c
[   91.867509] RBP: 00000000ce63f6dd R08: ffff8807b2eed780 R09: ffff88083fc43998
[   91.867598] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
[   91.867679] R13: 0000000000000000 R14: 0000000000000003 R15: ffff880815d948bc
[   91.867761] FS:  00007f1a8aad5700(0000) GS:ffff88083fc40000(0000)
knlGS:0000000000000000
[   91.867855] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   91.867926] CR2: 0000000000000014 CR3: 00000007a386a000 CR4: 00000000000407e0
[   91.868008] Stack:
[   91.868073]  ffff88081690d220 0000000000000012 0000000000000014 ffff88079aeb0068
[   91.868383]  ffff880815d94801 ffffffffa014f681 0000000000000000 ffffffff00000045
[   91.868694]  ffff880800000048 0000001b00000003 ffff88083fc43a60 ffff88081690d220
[   91.869003] Call Trace:
[   91.869077]  <IRQ>
[   91.869136]  [<ffffffffa014f681>] ? __nf_nat_mangle_tcp_packet+0x109/0x120
[nf_nat]
[   91.869356]  [<ffffffffa017749e>] ? ip_vs_ftp_out.part.8+0x2b2/0x338 [ip_vs_ftp]
[   91.869460]  [<ffffffffa015f884>] ? ip_vs_app_pkt_out+0x105/0x18b [ip_vs]
[   91.869539]  [<ffffffffa0163028>] ? tcp_snat_handler+0x6b/0x320 [ip_vs]
[   91.869622]  [<ffffffffa0155d3d>] ? ip_vs_conn_out_get_proto+0x1c/0x25 [ip_vs]
[   91.869736]  [<ffffffffa015893c>] ? ip_vs_out+0x2a5/0x5f6 [ip_vs]
[   91.869826]  [<ffffffff8150f544>] ? ip_frag_mem+0x2a/0x2a
[   91.869906]  [<ffffffff81508e1f>] ? nf_iterate+0x42/0x80
[   91.869996]  [<ffffffff81508ec6>] ? nf_hook_slow+0x69/0xff
[   91.870073]  [<ffffffff8150f544>] ? ip_frag_mem+0x2a/0x2a
[   91.870153]  [<ffffffff8150f8ae>] ? ip_forward+0x22d/0x2cf
[   91.870230]  [<ffffffff814e57ce>] ? __netif_receive_skb_core+0x5f0/0x66c
[   91.870311]  [<ffffffff814e59df>] ? process_backlog+0x13e/0x13e
[   91.870389]  [<ffffffffa0455e09>] ? br_handle_frame_finish+0x382/0x382 [bridge]
[   91.870482]  [<ffffffff814e5a2b>] ? netif_receive_skb+0x4c/0x7d
[   91.870561]  [<ffffffffa0455d95>] ? br_handle_frame_finish+0x30e/0x382 [bridge]
[   91.870652]  [<ffffffffa0455fda>] ? br_handle_frame+0x1d1/0x217 [bridge]
[   91.870733]  [<ffffffff814e567d>] ? __netif_receive_skb_core+0x49f/0x66c
[   91.870817]  [<ffffffff8104daa3>] ? call_timer_fn+0x4b/0xf6
[   91.870893]  [<ffffffff814e592b>] ? process_backlog+0x8a/0x13e
[   91.870972]  [<ffffffff814e5c31>] ? net_rx_action+0xa2/0x1c0
[   91.871051]  [<ffffffff81047e2e>] ? __do_softirq+0xf6/0x24f
[   91.871132]  [<ffffffff815ad7dc>] ? call_softirq+0x1c/0x30
[   91.871203]  <EOI>
[   91.871260]  [<ffffffff8100464d>] ? do_softirq+0x2c/0x5f
[   91.871470]  [<ffffffff81047ca1>] ? local_bh_enable+0x67/0x85
[   91.871545]  [<ffffffff81511689>] ? ip_finish_output+0x2c9/0x322
[   91.871628]  [<ffffffff8151240a>] ? ip_queue_xmit+0x2b7/0x2f0
[   91.871714]  [<ffffffff81524772>] ? tcp_transmit_skb+0x6ef/0x755
[   91.871792]  [<ffffffff815250e8>] ? tcp_write_xmit+0x886/0x9cb
[   91.871872]  [<ffffffff8152527a>] ? __tcp_push_pending_frames+0x24/0x7e
[   91.871951]  [<ffffffff8151a33c>] ? tcp_sendmsg+0xa4c/0xbfc
[   91.872036]  [<ffffffff814d3477>] ? sock_aio_write+0xe3/0xfd
[   91.872129]  [<ffffffff81122f4d>] ? do_sync_write+0x59/0x79
[   91.872215]  [<ffffffff811239e3>] ? vfs_write+0xc4/0x182
[   91.872298]  [<ffffffff81123daf>] ? SyS_write+0x45/0x7c
[   91.872382]  [<ffffffff815ac35b>] ? tracesys+0xdd/0xe2
[   91.872461] Code: 68 14 4d 01 c5 45 85 e4 74 46 f0 80 4f 78 40 48 8d 5f 04 48
89 df e8 00 12 47 e1 31 c0 41 83 fe 02 0f 97 c0 48 6b c0 0c 4c 01 e8 <8b> 70 08
39 70 04 74 08 89 ea 0f ca 39 10 79 0d 89 70 04 44 01
[   91.876166] RIP  [<ffffffffa013a470>] nf_ct_seqadj_set+0x60/0x90 [nf_conntrack]
[   91.876327]  RSP <ffff88083fc43988>
[   91.876400] CR2: 0000000000000014
[   91.876497] ---[ end trace 2c6d9f405db2170c ]---
[   91.876578] Kernel panic - not syncing: Fatal exception in interrupt
[   91.876666] Rebooting in 10 seconds..
[  101.935360] ACPI MEMORY or I/O RESET_REG.



node01:/ocfs2/usr/src/linux-3.12.33/scripts# ./decodecode
</tmp/node01-kernel-ipvs.log
[ 91.872461] Code: 68 14 4d 01 c5 45 85 e4 74 46 f0 80 4f 78 40 48 8d 5f 04 48
89 df e8 00 12 47 e1 31 c0 41 83 fe 02 0f 97 c0 48 6b c0 0c 4c 01 e8 <8b> 70 08
39 70 04 74 08 89 ea 0f ca 39 10 79 0d 89 70 04 44 01
All code
========
   0:   68 14 4d 01 c5          pushq  $0xffffffffc5014d14
   5:   45 85 e4                test   %r12d,%r12d
   8:   74 46                   je     0x50
   a:   f0 80 4f 78 40          lock orb $0x40,0x78(%rdi)
   f:   48 8d 5f 04             lea    0x4(%rdi),%rbx
  13:   48 89 df                mov    %rbx,%rdi
  16:   e8 00 12 47 e1          callq  0xffffffffe147121b
  1b:   31 c0                   xor    %eax,%eax
  1d:   41 83 fe 02             cmp    $0x2,%r14d
  21:   0f 97 c0                seta   %al
  24:   48 6b c0 0c             imul   $0xc,%rax,%rax
  28:   4c 01 e8                add    %r13,%rax
  2b:*  8b 70 08                mov    0x8(%rax),%esi           <-- trapping
instruction
  2e:   39 70 04                cmp    %esi,0x4(%rax)
  31:   74 08                   je     0x3b
  33:   89 ea                   mov    %ebp,%edx
  35:   0f ca                   bswap  %edx
  37:   39 10                   cmp    %edx,(%rax)
  39:   79 0d                   jns    0x48
  3b:   89 70 04                mov    %esi,0x4(%rax)
  3e:   44                      rex.R
  3f:   01                      .byte 0x1

Code starting with the faulting instruction
===========================================
   0:   8b 70 08                mov    0x8(%rax),%esi
   3:   39 70 04                cmp    %esi,0x4(%rax)
   6:   74 08                   je     0x10
   8:   89 ea                   mov    %ebp,%edx
   a:   0f ca                   bswap  %edx
   c:   39 10                   cmp    %edx,(%rax)
   e:   79 0d                   jns    0x1d
  10:   89 70 04                mov    %esi,0x4(%rax)
  13:   44                      rex.R
  14:   01                      .byte 0x1



-- 

Mit freundlichen Grüßen,

Florian Wiessner

Smart Weblications GmbH
Martinsberger Str. 1
D-95119 Naila

fon.: +49 9282 9638 200
fax.: +49 9282 9638 205
24/7: +49 900 144 000 00 - 0,99 EUR/Min*
http://www.smart-weblications.de

--
Sitz der Gesellschaft: Naila
Geschäftsführer: Florian Wiessner
HRB-Nr.: HRB 3840 Amtsgericht Hof
*aus dem dt. Festnetz, ggf. abweichende Preise aus dem Mobilfunknetz

^ permalink raw reply

* Re: Where exactly will arch_fast_hash be used
From: Hannes Frederic Sowa @ 2014-12-08 11:25 UTC (permalink / raw)
  To: George Spelvin
  Cc: davem, dborkman, herbert, linux-kernel, netdev, tgraf, tytso
In-Reply-To: <20141207213354.20910.qmail@ns.horizon.com>

Hi,

On Sun, Dec 7, 2014, at 22:33, George Spelvin wrote:
> On Sun, 2014-12-07 at 15:06 +0100, Hannes Frederic Sowa wrote:
> > In case of openvswitch it shows a performance improvment. The seed
> > parameter could be used as an initial biasing of the crc32 function, but
> > in case of openvswitch it is only set to 0.
> 
> NACK.
> 
> This is the Fatal Error in thinking that Herbert was warning about.
> The seed parameter doesn't affect CRC32 collisions *at all* if the inputs
> are the same size.
> 
> For fixed-size inputs, a non-zero seed is equivalent to XORing a
> constant into the output of the CRC computation.

Sorry for being unclear, I understood that and didn't bother patching
that '0' with a random seed exactly because of this.

> for *different* sized inputs, a non-zero seed detects zero-padding
> better than a zero one, but *which* non-zero value is also irrelevant;
> all-ones is the traditional choice because it's simplest in hardware.
> 
> 
> A CRC is inherently linear.  CRC(a^b) = CRC(a) ^ CRC(b).  This makes
> them easy to analyze mathematically and gives them a number of nice
> properties for detecting hardware corruption.
> 
> But that same simplicity makes it *ridiculously* easy to generate
> collisions if you try.

Yes, understood and I totally agree we shouldn't use crc32 hashing in a
lot of places where unsafe data is going to be hashed and inserted into
hash tables.

> One way of looking at a CRC is to say that each bit in the input
> has a CRC.  The CRC of a message string is just the XOR of the CRCs
> of the individual bits that are set in the message.
> 
> Now, a CRC polynomial is chosen so that all of the bits of a
> message have different CRCs.  Obviously, there's a limit: when the
> message is 2^n bits long, it's not possible for all the bits to
> have different, non-zero n-bit CRCs.
> 
> But a CRC is a really efficient way of assigning different bit patterns
> to different input bits up to that limit.
> 
> (Something like CRC32c is also chosen so that, for messages up to a
> reasonable length, no 3-bit, 4-bit, etc. combinations have CRCs that
> XOR to zero.)
> 
> 
> But, and this might be what Herbert was trying to say and I was
> misunderstanding, if you then *truncate* that CRC, the CRCs of the
> message bits lose that uniqueness guarantee.  They're just pseudorandom
> numbers, and a CRC loses its special collision-resistance properties.
> 
> It's just an ordinary random hash, and thanks to the birthday paradox,
> you're likely to find two bits whose CRCs agree in any particular 8 bits
> within roughly sqrt(2*256) or 22 bits.
> 
> Here are a few such collisions for the least significant 8 bits of
> CRC32c:
> 
> Msg1    CRC32c          Msg2    CRC32c          Match
> 1<<11   3fc5f181        1<<30   bf672381        81
> 1<<12   9d14c3b8        1<<31   dd45aab8        b8
> 1<<5    c79a971f        1<<44   6006181f        1f
> 1<<15   13a29877        1<<45   b2f53777        77
> 
> There's nothing special about the lsbits of the CRC.
> Within 64 bits, the most significant 8 bits have it worse:
> 
> 1<<5    c79a971f        1<<17   c76580d9        c7
> 1<<6    e13b70f7        1<<18   e144fb14        e1
> 1<<19   70a27d8a        1<<38   7022df58        70
> 1<<20   38513ec5        1<<39   38116fac        38
> 1<<13   4e8a61dc        1<<52   4e2dfd53        4e
> 1<<23   a541927e        1<<53   a5e0c5d1        a5
> 
> 
> Now, I'd like to stress that this collision rate is no worse than any
> *other* hash function.  A truncated CRC loses its special resistance to
> the birthday paradox (you'd have been much smarter to use 8-bit CRC),
> but it doesn't become especially bad.  A truncated SHA-1 will have
> coillisions just as often.
> 
> The concern with a CRC is that, once you've found one collision, you've
> found a huge number of them.  Just XOR the bit pattern of your choice
> into both of the colliding messages, and you have a new collision.

Ack.

> For another example, if you consider the CRC32c of all possible 1-byte
> messages *and then take only the low byte*, there are only 128 possible
> values.
> 
> It turns out that the byte 0x5d has a CRC32c of 0xee0d9600.  This ends
> in 00, so if I XOR 0x5d into anything, the low 8 bits of the CRC
> don't change.
> 
> Likewise, the message "23 00" has a CRC32c of 0x00ee0d96.  So you can
> XOR 0x23 into the second-last byte of anything, and the high 8 bits of
> the CRC don't change.

A very interesting read, thanks for your mail!

Bye,
Hannes

^ permalink raw reply

* Re: [PATCH][net-next] net: avoid to call skb_queue_len again
From: Sergei Shtylyov @ 2014-12-08 11:28 UTC (permalink / raw)
  To: Li RongQing, Eric Dumazet; +Cc: netdev
In-Reply-To: <CAJFZqHzeESfMBqb1gQr4DsE-v_fSfUX+AyWbt_VriJJ8Dj2FnA@mail.gmail.com>

Hello.

On 12/8/2014 3:46 AM, Li RongQing wrote:

>>>> From: Li RongQing <roy.qing.li@gmail.com>

>>>> the queue length of sd->input_pkt_queue has been putted into qlen,

>>>      s/putted/put/, it's irregular verb.

> I will fix it and  resend this patch

>>>> and impossible to change, since hold the lock

>>>      I can't parse that. Who holds the lock?

>> This thread/cpu holds the lock to manipulate input_pkt_queue.

>> Otherwise, the following would break horribly....

>> __skb_queue_tail(&sd->input_pkt_queue, skb);

> Thanks Eric

    I expect you to also refine the description, so that it's meaningful, 
unlike now.

> -Roy

WBR, Sergei

^ permalink raw reply

* Re: [PATCH net v2 3/5] cxgb4i: handle non-pdu-aligned rx and additional types of negative advice
From: Sergei Shtylyov @ 2014-12-08 11:36 UTC (permalink / raw)
  To: kxie, linux-scsi, netdev
  Cc: hariprasad, anish, hch, James.Bottomley, michaelc, davem
In-Reply-To: <201412080958.sB89wEsj005499@localhost6.localdomain6>

Hello.

On 12/8/2014 12:58 PM, kxie@chelsio.com wrote:

> [PATCH net v2 3/5] cxgb4i: handle non-pdu-aligned rx data and additional types of negative advice

    The patch summary shouldn't be duplicated in the change log, at least not 
like this...

> From: Karen Xie <kxie@chelsio.com>

> - abort the connection upon receiving of cpl_rx_data, which means the pdu cannot be recovered from the tcp stream. This could be due to pdu header corruption.
> - handle additional types of negative advice returned by h/w.

> Signed-off-by: Karen Xie <kxie@chelsio.com>
> ---
>   drivers/scsi/cxgbi/cxgb4i/cxgb4i.c |   34 +++++++++++++++++++++++++++++++---
>   1 files changed, 31 insertions(+), 3 deletions(-)

> diff --git a/drivers/scsi/cxgbi/cxgb4i/cxgb4i.c b/drivers/scsi/cxgbi/cxgb4i/cxgb4i.c
> index b834bde..051adab 100644
> --- a/drivers/scsi/cxgbi/cxgb4i/cxgb4i.c
> +++ b/drivers/scsi/cxgbi/cxgb4i/cxgb4i.c
> @@ -845,6 +845,13 @@ static void csk_act_open_retry_timer(unsigned long data)
>
>   }
>
> +static inline int is_neg_adv(unsigned int status)

    'bool' fits better.

> +{
> +	return status == CPL_ERR_RTX_NEG_ADVICE ||
> +		status == CPL_ERR_KEEPALV_NEG_ADVICE ||
> +		status == CPL_ERR_PERSIST_NEG_ADVICE;
> +}
> +
[...]
> @@ -1027,6 +1033,27 @@ rel_skb:
>   	__kfree_skb(skb);
>   }
>
> +static void do_rx_data(struct cxgbi_device *cdev, struct sk_buff *skb)
> +{
> +	struct cxgbi_sock *csk;
> +	struct cpl_rx_data *cpl = (struct cpl_rx_data *)skb->data;
> +	unsigned int tid = GET_TID(cpl);
> +	struct cxgb4_lld_info *lldi = cxgbi_cdev_priv(cdev);
> +	struct tid_info *t = lldi->tids;
> +
> +	csk = lookup_tid(t, tid);
> +	if (!csk) {
> +		pr_err("can't find connection for tid %u.\n", tid);
> +	} else {
> +		/* not expecting this, reset the connection. */
> +		pr_err("csk 0x%p, tid %u, rcv cpl_rx_data.\n", csk, tid);

    Both situations considered an error?

> +		spin_lock_bh(&csk->lock);
> +		send_abort_req(csk);
> +		spin_unlock_bh(&csk->lock);
> +	}
> +	__kfree_skb(skb);
> +}
> +
>   static void do_rx_iscsi_hdr(struct cxgbi_device *cdev, struct sk_buff *skb)
>   {
>   	struct cxgbi_sock *csk;
[...]

WBR, Sergei


^ permalink raw reply

* Re: [PATCH net-next v3 2/2] rocker: remove swdev mode
From: Daniel Borkmann @ 2014-12-08 11:41 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Thomas Graf, roopa, sfeldma, jhs, bcrl, john.fastabend, stephen,
	linville, vyasevic, netdev, davem, shm, gospo
In-Reply-To: <20141208110301.GA1885@nanopsycho.brq.redhat.com>

On 12/08/2014 12:03 PM, Jiri Pirko wrote:
> Sun, Dec 07, 2014 at 09:19:28AM CET, tgraf@suug.ch wrote:
>> On 12/06/14 at 10:54pm, roopa@cumulusnetworks.com wrote:
>>> From: Roopa Prabhu <roopa@cumulusnetworks.com>
>>>

Please provide a normal, non-empty commit message as
everyone else ...

>>> Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
...
>>> diff --git a/drivers/net/ethernet/rocker/rocker.c b/drivers/net/ethernet/rocker/rocker.c
>>> index fded127..9f1d256 100644
>>> --- a/drivers/net/ethernet/rocker/rocker.c
>>> +++ b/drivers/net/ethernet/rocker/rocker.c
>>> @@ -3755,7 +3739,7 @@ static int rocker_port_bridge_getlink(struct sk_buff *skb, u32 pid, u32 seq,
>>>   				      u32 filter_mask)
>>>   {
>>>   	struct rocker_port *rocker_port = netdev_priv(dev);
>>> -	u16 mode = BRIDGE_MODE_SWDEV;
>>> +	u16 mode = -1;
>>         ^^^
>> I assume you meant s16
>
> well, I see no problem in using u16. IFLA_BRIDGE_MODE attr is u16 so
> mode should stay u16.
>
> But maybe better to add:
> #define BRIDGE_MODE_UNDEF 0xFFFF

Yep, something along these lines seems better.

^ permalink raw reply

* Re: [PATCH v2] sh_eth: Optimization for RX excess judgement
From: Sergei Shtylyov @ 2014-12-08 11:47 UTC (permalink / raw)
  To: Yoshihiro Kaneko, netdev
  Cc: David S. Miller, Simon Horman, Magnus Damm, linux-sh
In-Reply-To: <1418035701-3871-1-git-send-email-ykaneko0929@gmail.com>

Hello.

On 12/8/2014 1:48 PM, Yoshihiro Kaneko wrote:

> From: Mitsuhiro Kimura <mitsuhiro.kimura.kc@renesas.com>

> Both of 'boguscnt' and 'quota' have nearly meaning as the condition of
> the reception loop.
> In order to cut down redundant processing, this patch changes excess
> judgement.

> Signed-off-by: Mitsuhiro Kimura <mitsuhiro.kimura.kc@renesas.com>
> Signed-off-by: Yoshihiro Kaneko <ykaneko0929@gmail.com>
> ---

> This patch is based on net-next tree.

> v2 [Yoshihiro Kaneko]
> * re-spin for net-next.
> * remove unneeded check of "quota".

    This is not a complete list. :-/

>   drivers/net/ethernet/renesas/sh_eth.c | 10 +++++-----
>   1 file changed, 5 insertions(+), 5 deletions(-)

> diff --git a/drivers/net/ethernet/renesas/sh_eth.c b/drivers/net/ethernet/renesas/sh_eth.c
> index dbe8606..266c9b2 100644
> --- a/drivers/net/ethernet/renesas/sh_eth.c
> +++ b/drivers/net/ethernet/renesas/sh_eth.c
[...]
> @@ -1501,6 +1499,8 @@ static int sh_eth_rx(struct net_device *ndev, u32 intr_status, int *quota)
>   		sh_eth_write(ndev, EDRRR_R, EDRRR);
>   	}
>
> +	*quota -= limit - boguscnt + 1;
> +

    Sorry for the wrong previous suggestion, it clearly should have been -1, 
not +1. :-<

[...]

WBR, Sergei


^ permalink raw reply

* Re: [PATCH net-next V1] net/mlx4_en: ethtool force speed when asking for autoneg=off
From: Saeed Mahameed @ 2014-12-08 11:42 UTC (permalink / raw)
  To: David Laight
  Cc: Amir Vadai, David S. Miller, netdev@vger.kernel.org, Or Gerlitz,
	Yevgeny Petrilin
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D1CA04B67@AcuExch.aculab.com>


> On Dec 8, 2014, at 12:57 PM, David Laight <David.Laight@ACULAB.COM> wrote:
> 
> From: Amir Vadai
>> From: Saeed Mahameed <saeedm@mellanox.com>
>> 
>> Use cmd->autoneg == AUTONEG_DISABLE as a user hint to force specific speed.
>> We don't want to rely on ethtool to calculate advertised link modes when
>> forcing specific speed, a user can request a specific speed and specify
>> "autoneg off" in ethtool command to give a hint for forcing this speed.
> 
> I'm not 100% sure what you are trying to achieve?
Hey David
I am not trying to fix the operating mode of the phy with this patch.
Here i am trying to give the user the ability To let the driver choose what to advertise when user want to force a specific speed using cmd->autoneg =off .

In the driver :
If (cmd->autoneg==off) advertise_according_speed(cmd->speed)
Else advertise(cmd->advertise)

Thanks
-Saeed
> By far the safest way to 'force' a specific speed is to set the
> advertised modes to contain only the desired speed.
> Doing anything else on links that are capable of auto-negotiation
> is a complete recipe for disaster.
> 
> Even if you fix the operating mode of the PHY and MAC you almost
> certainly want to advertise that mode to the remote system.
> 
> Yes, I know this is made all the more complicated by 10/100M autodetect.
> 
>    David
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v2] sh_eth: Optimization for RX excess judgement
From: Yoshihiro Kaneko @ 2014-12-08 12:16 UTC (permalink / raw)
  To: Sergei Shtylyov
  Cc: netdev, David S. Miller, Simon Horman, Magnus Damm, Linux-sh list
In-Reply-To: <54858FCD.3080206@cogentembedded.com>

Hello Sergei,

2014-12-08 20:47 GMT+09:00 Sergei Shtylyov <sergei.shtylyov@cogentembedded.com>:
> Hello.
>
> On 12/8/2014 1:48 PM, Yoshihiro Kaneko wrote:
>
>> From: Mitsuhiro Kimura <mitsuhiro.kimura.kc@renesas.com>
>
>
>> Both of 'boguscnt' and 'quota' have nearly meaning as the condition of
>> the reception loop.
>> In order to cut down redundant processing, this patch changes excess
>> judgement.
>
>
>> Signed-off-by: Mitsuhiro Kimura <mitsuhiro.kimura.kc@renesas.com>
>> Signed-off-by: Yoshihiro Kaneko <ykaneko0929@gmail.com>
>> ---
>
>
>> This patch is based on net-next tree.
>
>
>> v2 [Yoshihiro Kaneko]
>> * re-spin for net-next.
>> * remove unneeded check of "quota".
>
>
>    This is not a complete list. :-/

Sorry, I'll update on the next time.

>
>>   drivers/net/ethernet/renesas/sh_eth.c | 10 +++++-----
>>   1 file changed, 5 insertions(+), 5 deletions(-)
>
>
>> diff --git a/drivers/net/ethernet/renesas/sh_eth.c
>> b/drivers/net/ethernet/renesas/sh_eth.c
>> index dbe8606..266c9b2 100644
>> --- a/drivers/net/ethernet/renesas/sh_eth.c
>> +++ b/drivers/net/ethernet/renesas/sh_eth.c
>
> [...]
>>
>> @@ -1501,6 +1499,8 @@ static int sh_eth_rx(struct net_device *ndev, u32
>> intr_status, int *quota)
>>                 sh_eth_write(ndev, EDRRR_R, EDRRR);
>>         }
>>
>> +       *quota -= limit - boguscnt + 1;
>> +
>
>
>    Sorry for the wrong previous suggestion, it clearly should have been -1,
> not +1. :-<

Oh, I agree.

Thanks,
Kaneko

>
> [...]
>
> WBR, Sergei
>

^ permalink raw reply

* Q: need effective backlog for listen()
From: Ulrich Windl @ 2014-12-08 12:51 UTC (permalink / raw)
  To: netdev

(not subscribed to the list, plese keep me on CC:)

Hi!

I have a problem I could not find the answer. I suspect the problem arises from Linux derivating from standard functionality...

I have written a server that should accept n TCP connections at most. I was expecting that the backlog parameter of listen will cause extra connection requests either
1) to be refused
or
2) to time out eventually

(The standard seems to say that extra connections are refused)

However none of the above see ms true. Even if my server delays accept()ing new connections, no client ever sees a "connection refused" or "connection timed out". Is there any chance to signal the client that no more connections are accepted at the moment?

Regards,
Ulrich Windl

^ permalink raw reply

* Re: [PATCH net] net/mlx4_en: correct the endianness of doorbell_qpn on big endian platform
From: Wei Yang @ 2014-12-08 14:42 UTC (permalink / raw)
  To: David Laight
  Cc: 'Eric Dumazet', David Miller, weiyang@linux.vnet.ibm.com,
	netdev@vger.kernel.org, gideonn@mellanox.com, edumazet@google.com,
	amirv@mellanox.com
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6D1CA04A51@AcuExch.aculab.com>

On Mon, Dec 08, 2014 at 10:00:19AM +0000, David Laight wrote:
>From: Eric Dumazet
>> On Fri, 2014-12-05 at 21:31 -0800, David Miller wrote:
>> 
>> > Guys, let's figure out what we are doing with this patch.
>> > --
>> 
>> Oh well, patch is fine, please apply it, thanks !
>
>I'm not to sure that the patch doesn't generate a software byteswap
>followed by a byteswapping write on ppc - clearly not ideal.
>It might even generate back to back software byteswaps.
>
>If the write to the doorbell register includes a byteswap on BE (ppc)
>then there is no real value in keeping the value as BE.
>
>OTOH ppc ought to have ways of doing IO writes without the byteswap
>(and byteswapping accesses to non-io memory for that matter).
>
>What happens on a BE system with BE peripherals is another matter.

David

Thanks for your comment.

How about use __raw_writel() to replace the iowrite32()? Looks this is better,
if so, I will make up another version for this.

>
>	David
>

-- 
Richard Yang
Help you, Help me

^ permalink raw reply

* Re: [PATCH][net-next] net: avoid to call skb_queue_len again
From: Eric Dumazet @ 2014-12-08 15:10 UTC (permalink / raw)
  To: Sergei Shtylyov; +Cc: Li RongQing, netdev
In-Reply-To: <54858B56.1020607@cogentembedded.com>

On Mon, 2014-12-08 at 14:28 +0300, Sergei Shtylyov wrote:

> 
>     I expect you to also refine the description, so that it's meaningful, 
> unlike now.


It seems obvious to me Li is not a native English speaker. I understood
the patch very well, and the changelog seemed fine to me.

What about you provide this description instead, since you seem to care
very much ?

Thanks !

^ permalink raw reply

* Re: [PATCH net-next 2/3] netlink: IFLA_PHYS_SWITCH_ID to IFLA_PHYS_PARENT_ID
From: Jiri Pirko @ 2014-12-08 15:17 UTC (permalink / raw)
  To: Andy Gospodarek; +Cc: netdev, sfeldma, jpirko
In-Reply-To: <1417802537-20020-2-git-send-email-gospo@cumulusnetworks.com>

Fri, Dec 05, 2014 at 07:02:16PM CET, gospo@cumulusnetworks.com wrote:
>There has been much discussion about proper nomenclature to use for this
>and I would prefer parent rather than calling every forwarding element a
>switch.

Andy, I must say I really do not like just plain "parent". It is really
not clear what it means as it can mean 1000 things.

I know "switch" is not ideal but everytime anyone is talking about these
kind of forwarding devices, they use word "switch" even if it is not
accurate and everyone knows what they are talking about. Nobody uses
"parent".

For me this is nack for this patchset.

Jiri

>
>Signed-off-by: Andy Gospodarek <gospo@cumulusnetworks.com>
>---
> include/uapi/linux/if_link.h | 2 +-
> net/core/rtnetlink.c         | 4 ++--
> 2 files changed, 3 insertions(+), 3 deletions(-)
>
>diff --git a/include/uapi/linux/if_link.h b/include/uapi/linux/if_link.h
>index f7d0d2d..3d8edd8 100644
>--- a/include/uapi/linux/if_link.h
>+++ b/include/uapi/linux/if_link.h
>@@ -145,7 +145,7 @@ enum {
> 	IFLA_CARRIER,
> 	IFLA_PHYS_PORT_ID,
> 	IFLA_CARRIER_CHANGES,
>-	IFLA_PHYS_SWITCH_ID,
>+	IFLA_PHYS_PARENT_ID,
> 	__IFLA_MAX
> };
> 
>diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
>index 61cb7e7..1fe0a16 100644
>--- a/net/core/rtnetlink.c
>+++ b/net/core/rtnetlink.c
>@@ -982,7 +982,7 @@ static int rtnl_phys_switch_id_fill(struct sk_buff *skb, struct net_device *dev)
> 		return err;
> 	}
> 
>-	if (nla_put(skb, IFLA_PHYS_SWITCH_ID, psid.id_len, psid.id))
>+	if (nla_put(skb, IFLA_PHYS_PARENT_ID, psid.id_len, psid.id))
> 		return -EMSGSIZE;
> 
> 	return 0;
>@@ -1222,7 +1222,7 @@ static const struct nla_policy ifla_policy[IFLA_MAX+1] = {
> 	[IFLA_NUM_RX_QUEUES]	= { .type = NLA_U32 },
> 	[IFLA_PHYS_PORT_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
> 	[IFLA_CARRIER_CHANGES]	= { .type = NLA_U32 },  /* ignored */
>-	[IFLA_PHYS_SWITCH_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
>+	[IFLA_PHYS_PARENT_ID]	= { .type = NLA_BINARY, .len = MAX_PHYS_ITEM_ID_LEN },
> };
> 
> static const struct nla_policy ifla_info_policy[IFLA_INFO_MAX+1] = {
>-- 
>1.9.3
>

^ permalink raw reply

* Re: wl1251: NVS firmware data
From: Ming Lei @ 2014-12-08 15:18 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Pavel Machek, Greg Kroah-Hartman, John W. Linville,
	Grazvydas Ignotas, linux-wireless@vger.kernel.org,
	Network Development, Linux Kernel Mailing List, Ivaylo Dimitrov,
	Aaro Koskinen, Kalle Valo, Sebastian Reichel, David Gnedt
In-Reply-To: <201412061402.21514@pali>

On Sat, Dec 6, 2014 at 9:02 PM, Pali Rohár <pali.rohar@gmail.com> wrote:
> On Saturday 06 December 2014 13:49:54 Pavel Machek wrote:

>
>  /**
> + * request_firmware_prefer_user: - prefer usermode helper for loading firmware
> + * @firmware_p: pointer to firmware image
> + * @name: name of firmware file
> + * @device: device for which firmware is being loaded
> + *
> + * This function works pretty much like request_firmware(), but it prefer
> + * usermode helper. If usermode helper fails then it fallback to direct access.
> + * Usefull for dynamic or model specific firmware data.
> + **/
> +int request_firmware_prefer_user(const struct firmware **firmware_p,
> +                           const char *name, struct device *device)
> +{
> +       int ret;
> +       __module_get(THIS_MODULE);
> +       ret = _request_firmware(firmware_p, name, device,
> +                               FW_OPT_UEVENT | FW_OPT_PREFER_USER);
> +       module_put(THIS_MODULE);
> +       return ret;
> +}
> +EXPORT_SYMBOL_GPL(request_firmware_prefer_user);

I'd like to introduce request_firmware_user() which only requests
firmware from user space, and this way is simpler and more flexible
since we have request_firmware_direct() already.

Thanks,
Ming Lei

^ permalink raw reply

* Re: Q: need effective backlog for listen()
From: Eric Dumazet @ 2014-12-08 15:18 UTC (permalink / raw)
  To: Ulrich Windl; +Cc: netdev
In-Reply-To: <5485ACC9020000A100018394@gwsmtp1.uni-regensburg.de>

On Mon, 2014-12-08 at 13:51 +0100, Ulrich Windl wrote:
> (not subscribed to the list, plese keep me on CC:)
> 
> Hi!
> 
> I have a problem I could not find the answer. I suspect the problem
> arises from Linux derivating from standard functionality...
> 
> I have written a server that should accept n TCP connections at most.
> I was expecting that the backlog parameter of listen will cause extra
> connection requests either
> 1) to be refused
> or
> 2) to time out eventually
> 
> (The standard seems to say that extra connections are refused)
> 
> However none of the above see ms true. Even if my server delays
> accept()ing new connections, no client ever sees a "connection
> refused" or "connection timed out". Is there any chance to signal the
> client that no more connections are accepted at the moment?

This 'standard' makes no sense to me, in light of SYNFLOOD attacks.

It actually makes SYNFLOOD attacks very effective.

Have you tried to disable syncookies for a start ?

^ permalink raw reply

* Re: wl1251: NVS firmware data
From: Pali Rohár @ 2014-12-08 15:22 UTC (permalink / raw)
  To: Ming Lei
  Cc: Pavel Machek, Greg Kroah-Hartman, John W. Linville,
	Grazvydas Ignotas, linux-wireless@vger.kernel.org,
	Network Development, Linux Kernel Mailing List, Ivaylo Dimitrov,
	Aaro Koskinen, Kalle Valo, Sebastian Reichel, David Gnedt
In-Reply-To: <CACVXFVPOLfDuqc0nLb-zM8vH618DLXy0xtZbUOn5_XvdxRZSDw@mail.gmail.com>

[-- Attachment #1: Type: Text/Plain, Size: 1626 bytes --]

On Monday 08 December 2014 16:18:18 Ming Lei wrote:
> On Sat, Dec 6, 2014 at 9:02 PM, Pali Rohár 
<pali.rohar@gmail.com> wrote:
> > On Saturday 06 December 2014 13:49:54 Pavel Machek wrote:
> >  /**
> > 
> > + * request_firmware_prefer_user: - prefer usermode helper
> > for loading firmware + * @firmware_p: pointer to firmware
> > image
> > + * @name: name of firmware file
> > + * @device: device for which firmware is being loaded
> > + *
> > + * This function works pretty much like request_firmware(),
> > but it prefer + * usermode helper. If usermode helper fails
> > then it fallback to direct access. + * Usefull for dynamic
> > or model specific firmware data. + **/
> > +int request_firmware_prefer_user(const struct firmware
> > **firmware_p, +                           const char *name,
> > struct device *device) +{
> > +       int ret;
> > +       __module_get(THIS_MODULE);
> > +       ret = _request_firmware(firmware_p, name, device,
> > +                               FW_OPT_UEVENT |
> > FW_OPT_PREFER_USER); +       module_put(THIS_MODULE);
> > +       return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(request_firmware_prefer_user);
> 
> I'd like to introduce request_firmware_user() which only
> requests firmware from user space, and this way is simpler
> and more flexible since we have request_firmware_direct()
> already.
> 
> Thanks,
> Ming Lei

Ming, for wl1251 NVS data we need to load use usermode helper and 
fallback to direct load. So I think it is better to handle this 
request in firmware code and not in driver.

-- 
Pali Rohár
pali.rohar@gmail.com

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* [PATCH] bridge: Remove BR_PROXYARP flooding check code
From: Jouni Malinen @ 2014-12-08 15:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Kyeyoon Park, Jouni Malinen

From: Kyeyoon Park <kyeyoonp@codeaurora.org>

Because dropping broadcast packets for IEEE 802.11 Proxy ARP is more
selective than previously thought, it is better to remove the direct
dropping logic in the bridge code in favor of using the netfilter
infrastructure to provide more control on which frames get dropped. This
code was added in commit 958501163ddd ("bridge: Add support for IEEE
802.11 Proxy ARP").

Signed-off-by: Kyeyoon Park <kyeyoonp@codeaurora.org>
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
---
 net/bridge/br_forward.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/net/bridge/br_forward.c b/net/bridge/br_forward.c
index f96933a..8a025a7 100644
--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
@@ -185,10 +185,6 @@ static void br_flood(struct net_bridge *br, struct sk_buff *skb,
 		if (unicast && !(p->flags & BR_FLOOD))
 			continue;
 
-		/* Do not flood to ports that enable proxy ARP */
-		if (p->flags & BR_PROXYARP)
-			continue;
-
 		prev = maybe_deliver(prev, p, skb, __packet_hook);
 		if (IS_ERR(prev))
 			goto out;
-- 
1.9.1

^ permalink raw reply related

* Re: wl1251: NVS firmware data
From: Ming Lei @ 2014-12-08 15:35 UTC (permalink / raw)
  To: Pali Rohár
  Cc: Pavel Machek, Greg Kroah-Hartman, John W. Linville,
	Grazvydas Ignotas, linux-wireless@vger.kernel.org,
	Network Development, Linux Kernel Mailing List, Ivaylo Dimitrov,
	Aaro Koskinen, Kalle Valo, Sebastian Reichel, David Gnedt
In-Reply-To: <201412081622.25541@pali>

On Mon, Dec 8, 2014 at 11:22 PM, Pali Rohár <pali.rohar@gmail.com> wrote:
> On Monday 08 December 2014 16:18:18 Ming Lei wrote:
>> On Sat, Dec 6, 2014 at 9:02 PM, Pali Rohár
> <pali.rohar@gmail.com> wrote:
>> > On Saturday 06 December 2014 13:49:54 Pavel Machek wrote:
>> >  /**
>> >
>> > + * request_firmware_prefer_user: - prefer usermode helper
>> > for loading firmware + * @firmware_p: pointer to firmware
>> > image
>> > + * @name: name of firmware file
>> > + * @device: device for which firmware is being loaded
>> > + *
>> > + * This function works pretty much like request_firmware(),
>> > but it prefer + * usermode helper. If usermode helper fails
>> > then it fallback to direct access. + * Usefull for dynamic
>> > or model specific firmware data. + **/
>> > +int request_firmware_prefer_user(const struct firmware
>> > **firmware_p, +                           const char *name,
>> > struct device *device) +{
>> > +       int ret;
>> > +       __module_get(THIS_MODULE);
>> > +       ret = _request_firmware(firmware_p, name, device,
>> > +                               FW_OPT_UEVENT |
>> > FW_OPT_PREFER_USER); +       module_put(THIS_MODULE);
>> > +       return ret;
>> > +}
>> > +EXPORT_SYMBOL_GPL(request_firmware_prefer_user);
>>
>> I'd like to introduce request_firmware_user() which only
>> requests firmware from user space, and this way is simpler
>> and more flexible since we have request_firmware_direct()
>> already.
>>
>> Thanks,
>> Ming Lei
>
> Ming, for wl1251 NVS data we need to load use usermode helper and
> fallback to direct load. So I think it is better to handle this
> request in firmware code and not in driver.

Please do that in driver and don't mess firmware loader.

With introducing request_firmware_user(), it is even possible to
clean up firmware loader further.

Thanks,
Ming Lei

^ permalink raw reply

* [PATCH nf-next 1/2] netfilter: conntrack: cache route for forwarded connections
From: Florian Westphal @ 2014-12-08 15:36 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, brouer, Florian Westphal
In-Reply-To: <1418052964-4632-1-git-send-email-fw@strlen.de>

... to avoid per-packet FIB lookup if possible.

The cached dst is re-used provided the input interface
is the same as that of the previous packet in the same direction.

If not, the cached dst is invalidated.

For ipv6 we also need to store sernum, else dst_check doesn't work,
pointed out by Eric Dumazet.

This should speed up forwarding when conntrack is already in use
anyway, especially when using reverse path filtering -- active RPF
enforces two FIB lookups for each packet.

Before the routing cache removal this didn't matter since RPF was performed
only when route cache didn't yield a result; but without route cache it
comes at higher price.

Julian Anastasov suggested to add NETDEV_UNREGISTER handler to
avoid holding on to dsts of 'frozen' conntracks.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 Changes since RFC:
  - add NETDEV_UNREGISTER, suggested by Julian
  - cache fib sernum to make ipv6 work, pointed out by Eric
  - make module unload work
  - remove ASSURED test, in case of -j DROP in prerouting or forward
    cache forward hook won't be reached anyway

 include/net/netfilter/nf_conntrack_extend.h  |   4 +
 include/net/netfilter/nf_conntrack_rtcache.h |  34 +++
 net/netfilter/Kconfig                        |  12 +
 net/netfilter/Makefile                       |   3 +
 net/netfilter/nf_conntrack_rtcache.c         | 387 +++++++++++++++++++++++++++
 5 files changed, 440 insertions(+)
 create mode 100644 include/net/netfilter/nf_conntrack_rtcache.h
 create mode 100644 net/netfilter/nf_conntrack_rtcache.c

diff --git a/include/net/netfilter/nf_conntrack_extend.h b/include/net/netfilter/nf_conntrack_extend.h
index 55d1504..1b00d57 100644
--- a/include/net/netfilter/nf_conntrack_extend.h
+++ b/include/net/netfilter/nf_conntrack_extend.h
@@ -30,6 +30,9 @@ enum nf_ct_ext_id {
 #if IS_ENABLED(CONFIG_NETFILTER_SYNPROXY)
 	NF_CT_EXT_SYNPROXY,
 #endif
+#if IS_ENABLED(CONFIG_NF_CONNTRACK_RTCACHE)
+	NF_CT_EXT_RTCACHE,
+#endif
 	NF_CT_EXT_NUM,
 };
 
@@ -43,6 +46,7 @@ enum nf_ct_ext_id {
 #define NF_CT_EXT_TIMEOUT_TYPE struct nf_conn_timeout
 #define NF_CT_EXT_LABELS_TYPE struct nf_conn_labels
 #define NF_CT_EXT_SYNPROXY_TYPE struct nf_conn_synproxy
+#define NF_CT_EXT_RTCACHE_TYPE struct nf_conn_rtcache
 
 /* Extensions: optional stuff which isn't permanently in struct. */
 struct nf_ct_ext {
diff --git a/include/net/netfilter/nf_conntrack_rtcache.h b/include/net/netfilter/nf_conntrack_rtcache.h
new file mode 100644
index 0000000..e2fb302
--- /dev/null
+++ b/include/net/netfilter/nf_conntrack_rtcache.h
@@ -0,0 +1,34 @@
+#include <linux/gfp.h>
+#include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_extend.h>
+
+struct dst_entry;
+
+struct nf_conn_dst_cache {
+	struct dst_entry *dst;
+	int iif;
+#if IS_ENABLED(CONFIG_NF_CONNTRACK_IPV6)
+	u32 cookie;
+#endif
+
+};
+
+struct nf_conn_rtcache {
+	struct nf_conn_dst_cache cached_dst[IP_CT_DIR_MAX];
+};
+
+static inline
+struct nf_conn_rtcache *nf_ct_rtcache_find(const struct nf_conn *ct)
+{
+#if IS_ENABLED(CONFIG_NF_CONNTRACK_RTCACHE)
+	return nf_ct_ext_find(ct, NF_CT_EXT_RTCACHE);
+#else
+	return NULL;
+#endif
+}
+
+static inline int nf_conn_rtcache_iif_get(const struct nf_conn_rtcache *rtc,
+					  enum ip_conntrack_dir dir)
+{
+	return rtc->cached_dst[dir].iif;
+}
diff --git a/net/netfilter/Kconfig b/net/netfilter/Kconfig
index b02660f..c213a61 100644
--- a/net/netfilter/Kconfig
+++ b/net/netfilter/Kconfig
@@ -106,6 +106,18 @@ config NF_CONNTRACK_EVENTS
 
 	  If unsure, say `N'.
 
+config NF_CONNTRACK_RTCACHE
+	tristate "Cache route entries in conntrack objects"
+	depends on NETFILTER_ADVANCED
+	depends on NF_CONNTRACK
+	help
+	  If this option is enabled, the connection tracking code will
+	  cache routing information for each connection that is being
+	  forwarded, at a cost of 32 bytes per conntrack object.
+
+	  To compile it as a module, choose M here.  If unsure, say N.
+	  The module will be called nf_conntrack_rtcache.
+
 config NF_CONNTRACK_TIMEOUT
 	bool  'Connection tracking timeout'
 	depends on NETFILTER_ADVANCED
diff --git a/net/netfilter/Makefile b/net/netfilter/Makefile
index 89f73a9..c174ab2 100644
--- a/net/netfilter/Makefile
+++ b/net/netfilter/Makefile
@@ -18,6 +18,9 @@ obj-$(CONFIG_NETFILTER_NETLINK_LOG) += nfnetlink_log.o
 # connection tracking
 obj-$(CONFIG_NF_CONNTRACK) += nf_conntrack.o
 
+# optional conntrack route cache extension
+obj-$(CONFIG_NF_CONNTRACK_RTCACHE) += nf_conntrack_rtcache.o
+
 # SCTP protocol connection tracking
 obj-$(CONFIG_NF_CT_PROTO_DCCP) += nf_conntrack_proto_dccp.o
 obj-$(CONFIG_NF_CT_PROTO_GRE) += nf_conntrack_proto_gre.o
diff --git a/net/netfilter/nf_conntrack_rtcache.c b/net/netfilter/nf_conntrack_rtcache.c
new file mode 100644
index 0000000..65fef44
--- /dev/null
+++ b/net/netfilter/nf_conntrack_rtcache.c
@@ -0,0 +1,387 @@
+/* route cache for netfilter.
+ *
+ * (C) 2014 Red Hat GmbH
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+#include <linux/types.h>
+#include <linux/netfilter.h>
+#include <linux/skbuff.h>
+#include <linux/stddef.h>
+#include <linux/kernel.h>
+#include <linux/netdevice.h>
+#include <linux/export.h>
+#include <linux/module.h>
+
+#include <net/dst.h>
+
+#include <net/netfilter/nf_conntrack.h>
+#include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_extend.h>
+#include <net/netfilter/nf_conntrack_rtcache.h>
+
+#if IS_ENABLED(CONFIG_NF_CONNTRACK_IPV6)
+#include <net/ip6_fib.h>
+#endif
+
+static void __nf_conn_rtcache_destroy(struct nf_conn_rtcache *rtc,
+				      enum ip_conntrack_dir dir)
+{
+	struct dst_entry *dst = rtc->cached_dst[dir].dst;
+
+	dst_release(dst);
+}
+
+static void nf_conn_rtcache_destroy(struct nf_conn *ct)
+{
+	struct nf_conn_rtcache *rtc = nf_ct_rtcache_find(ct);
+
+	if (!rtc)
+		return;
+
+	__nf_conn_rtcache_destroy(rtc, IP_CT_DIR_ORIGINAL);
+	__nf_conn_rtcache_destroy(rtc, IP_CT_DIR_REPLY);
+}
+
+static void nf_ct_rtcache_ext_add(struct nf_conn *ct)
+{
+	struct nf_conn_rtcache *rtc;
+
+	rtc = nf_ct_ext_add(ct, NF_CT_EXT_RTCACHE, GFP_ATOMIC);
+	if (rtc) {
+		rtc->cached_dst[IP_CT_DIR_ORIGINAL].iif = -1;
+		rtc->cached_dst[IP_CT_DIR_ORIGINAL].dst = NULL;
+		rtc->cached_dst[IP_CT_DIR_REPLY].iif = -1;
+		rtc->cached_dst[IP_CT_DIR_REPLY].dst = NULL;
+	}
+}
+
+static struct nf_conn_rtcache *nf_ct_rtcache_find_usable(struct nf_conn *ct)
+{
+	if (nf_ct_is_untracked(ct))
+		return NULL;
+	return nf_ct_rtcache_find(ct);
+}
+
+static struct dst_entry *
+nf_conn_rtcache_dst_get(const struct nf_conn_rtcache *rtc,
+			enum ip_conntrack_dir dir)
+{
+	return rtc->cached_dst[dir].dst;
+}
+
+static u32 nf_rtcache_get_cookie(int pf, const struct dst_entry *dst)
+{
+#if IS_ENABLED(CONFIG_NF_CONNTRACK_IPV6)
+	if (pf == NFPROTO_IPV6) {
+		const struct rt6_info *rt = (const struct rt6_info *)dst;
+
+		if (rt->rt6i_node)
+			return (u32)rt->rt6i_node->fn_sernum;
+	}
+#endif
+	return 0;
+}
+
+static void nf_conn_rtcache_dst_set(int pf,
+				    struct nf_conn_rtcache *rtc,
+				    struct dst_entry *dst,
+				    enum ip_conntrack_dir dir, int iif)
+{
+	if (rtc->cached_dst[dir].iif != iif)
+		rtc->cached_dst[dir].iif = iif;
+
+	if (rtc->cached_dst[dir].dst != dst) {
+		struct dst_entry *old;
+
+		dst_hold(dst);
+
+		old = xchg(&rtc->cached_dst[dir].dst, dst);
+		dst_release(old);
+
+#if IS_ENABLED(CONFIG_NF_CONNTRACK_IPV6)
+		if (pf == NFPROTO_IPV6)
+			rtc->cached_dst[dir].cookie =
+				nf_rtcache_get_cookie(pf, dst);
+#endif
+	}
+}
+
+static void nf_conn_rtcache_dst_obsolete(struct nf_conn_rtcache *rtc,
+					 enum ip_conntrack_dir dir)
+{
+	struct dst_entry *old;
+
+	pr_debug("Invalidate iif %d for dir %d on cache %p\n",
+		 rtc->cached_dst[dir].iif, dir, rtc);
+
+	old = xchg(&rtc->cached_dst[dir].dst, NULL);
+	dst_release(old);
+	rtc->cached_dst[dir].iif = -1;
+}
+
+static unsigned int nf_rtcache_in(const struct nf_hook_ops *ops,
+				  struct sk_buff *skb,
+				  const struct net_device *in,
+				  const struct net_device *out,
+				  int (*okfn)(struct sk_buff *))
+{
+	struct nf_conn_rtcache *rtc;
+	enum ip_conntrack_info ctinfo;
+	enum ip_conntrack_dir dir;
+	struct dst_entry *dst;
+	struct nf_conn *ct;
+	int iif;
+	u32 cookie;
+
+	if (skb_dst(skb) || skb->sk)
+		return NF_ACCEPT;
+
+	ct = nf_ct_get(skb, &ctinfo);
+	if (!ct)
+		return NF_ACCEPT;
+
+	rtc = nf_ct_rtcache_find_usable(ct);
+	if (!rtc)
+		return NF_ACCEPT;
+
+	/* if iif changes, don't use cache and let ip stack
+	 * do route lookup.
+	 *
+	 * If rp_filter is enabled it might toss skb, so
+	 * we don't want to avoid these checks.
+	 */
+	dir = CTINFO2DIR(ctinfo);
+	iif = nf_conn_rtcache_iif_get(rtc, dir);
+	if (in->ifindex != iif) {
+		pr_debug("ct %p, iif %d, cached iif %d, skip cached entry\n",
+			 ct, iif, in->ifindex);
+		return NF_ACCEPT;
+	}
+	dst = nf_conn_rtcache_dst_get(rtc, dir);
+	if (dst == NULL)
+		return NF_ACCEPT;
+
+	cookie = nf_rtcache_get_cookie(ops->pf, dst);
+
+	dst = dst_check(dst, cookie);
+	pr_debug("obtained dst %p for skb %p, cookie %d\n", dst, skb, cookie);
+	if (likely(dst))
+		skb_dst_set_noref_force(skb, dst);
+	else
+		nf_conn_rtcache_dst_obsolete(rtc, dir);
+
+	return NF_ACCEPT;
+}
+
+static unsigned int nf_rtcache_forward(const struct nf_hook_ops *ops,
+				       struct sk_buff *skb,
+				       const struct net_device *in,
+				       const struct net_device *out,
+				       int (*okfn)(struct sk_buff *))
+{
+	struct nf_conn_rtcache *rtc;
+	enum ip_conntrack_info ctinfo;
+	enum ip_conntrack_dir dir;
+	struct nf_conn *ct;
+	int iif;
+
+	ct = nf_ct_get(skb, &ctinfo);
+	if (!ct)
+		return NF_ACCEPT;
+
+	if (!nf_ct_is_confirmed(ct)) {
+		if (WARN_ON(nf_ct_rtcache_find(ct)))
+			return NF_ACCEPT;
+		nf_ct_rtcache_ext_add(ct);
+		return NF_ACCEPT;
+	}
+
+	rtc = nf_ct_rtcache_find_usable(ct);
+	if (!rtc)
+		return NF_ACCEPT;
+
+	dir = CTINFO2DIR(ctinfo);
+	iif = nf_conn_rtcache_iif_get(rtc, dir);
+	pr_debug("ct %p, skb %p, dir %d, iif %d, cached iif %d\n",
+		 ct, skb, dir, iif, in->ifindex);
+	if (likely(in->ifindex == iif))
+		return NF_ACCEPT;
+
+	nf_conn_rtcache_dst_set(ops->pf, rtc, skb_dst(skb), dir, in->ifindex);
+	return NF_ACCEPT;
+}
+
+static int nf_rtcache_dst_remove(struct nf_conn *ct, void *data)
+{
+	struct nf_conn_rtcache *rtc = nf_ct_rtcache_find(ct);
+	struct net_device *dev = data;
+
+	if (!rtc)
+		return 0;
+
+	if (dev->ifindex == rtc->cached_dst[IP_CT_DIR_ORIGINAL].iif ||
+	    dev->ifindex == rtc->cached_dst[IP_CT_DIR_REPLY].iif) {
+		nf_conn_rtcache_dst_obsolete(rtc, IP_CT_DIR_ORIGINAL);
+		nf_conn_rtcache_dst_obsolete(rtc, IP_CT_DIR_REPLY);
+	}
+
+	return 0;
+}
+
+static int nf_rtcache_netdev_event(struct notifier_block *this,
+				   unsigned long event, void *ptr)
+{
+	struct net_device *dev = netdev_notifier_info_to_dev(ptr);
+	struct net *net = dev_net(dev);
+
+	if (event == NETDEV_DOWN)
+		nf_ct_iterate_cleanup(net, nf_rtcache_dst_remove, dev, 0, 0);
+
+	return NOTIFY_DONE;
+}
+
+static struct notifier_block nf_rtcache_notifier = {
+	.notifier_call = nf_rtcache_netdev_event,
+};
+
+static struct nf_hook_ops rtcache_ops[] = {
+	{
+		.hook		= nf_rtcache_in,
+		.owner		= THIS_MODULE,
+		.pf		= NFPROTO_IPV4,
+		.hooknum	= NF_INET_PRE_ROUTING,
+		.priority       = NF_IP_PRI_LAST,
+	},
+	{
+		.hook           = nf_rtcache_forward,
+		.owner          = THIS_MODULE,
+		.pf             = NFPROTO_IPV4,
+		.hooknum        = NF_INET_FORWARD,
+		.priority       = NF_IP_PRI_LAST,
+	},
+#if IS_ENABLED(CONFIG_NF_CONNTRACK_IPV6)
+	{
+		.hook		= nf_rtcache_in,
+		.owner		= THIS_MODULE,
+		.pf		= NFPROTO_IPV6,
+		.hooknum	= NF_INET_PRE_ROUTING,
+		.priority       = NF_IP_PRI_LAST,
+	},
+	{
+		.hook           = nf_rtcache_forward,
+		.owner          = THIS_MODULE,
+		.pf             = NFPROTO_IPV6,
+		.hooknum        = NF_INET_FORWARD,
+		.priority       = NF_IP_PRI_LAST,
+	},
+#endif
+};
+
+static struct nf_ct_ext_type rtcache_extend __read_mostly = {
+	.len	= sizeof(struct nf_conn_rtcache),
+	.align	= __alignof__(struct nf_conn_rtcache),
+	.id	= NF_CT_EXT_RTCACHE,
+	.destroy = nf_conn_rtcache_destroy,
+};
+
+static int __init nf_conntrack_rtcache_init(void)
+{
+	int ret = nf_ct_extend_register(&rtcache_extend);
+
+	if (ret < 0) {
+		pr_err("nf_conntrack_rtcache: Unable to register extension\n");
+		return ret;
+	}
+
+	ret = nf_register_hooks(rtcache_ops, ARRAY_SIZE(rtcache_ops));
+	if (ret < 0) {
+		nf_ct_extend_unregister(&rtcache_extend);
+		return ret;
+	}
+
+	ret = register_netdevice_notifier(&nf_rtcache_notifier);
+	if (ret) {
+		nf_unregister_hooks(rtcache_ops, ARRAY_SIZE(rtcache_ops));
+		nf_ct_extend_unregister(&rtcache_extend);
+	}
+
+	return ret;
+}
+
+static int nf_rtcache_ext_remove(struct nf_conn *ct, void *data)
+{
+	struct nf_conn_rtcache *rtc = nf_ct_rtcache_find(ct);
+
+	return rtc != NULL;
+}
+
+static bool __exit nf_conntrack_rtcache_wait_for_dying(struct net *net)
+{
+	bool wait = false;
+	int cpu;
+
+	for_each_possible_cpu(cpu) {
+		struct nf_conntrack_tuple_hash *h;
+		struct hlist_nulls_node *n;
+		struct nf_conn *ct;
+		struct ct_pcpu *pcpu = per_cpu_ptr(net->ct.pcpu_lists, cpu);
+
+		rcu_read_lock();
+		spin_lock_bh(&pcpu->lock);
+
+		hlist_nulls_for_each_entry(h, n, &pcpu->dying, hnnode) {
+			ct = nf_ct_tuplehash_to_ctrack(h);
+			if (nf_ct_rtcache_find(ct) != NULL) {
+				wait = true;
+				break;
+			}
+		}
+		spin_unlock_bh(&pcpu->lock);
+		rcu_read_unlock();
+	}
+
+	return wait;
+}
+
+static void __exit nf_conntrack_rtcache_fini(void)
+{
+	struct net *net;
+	int count = 0;
+
+	/* remove hooks so no new connections get rtcache extension */
+	nf_unregister_hooks(rtcache_ops, ARRAY_SIZE(rtcache_ops));
+
+	synchronize_net();
+
+	unregister_netdevice_notifier(&nf_rtcache_notifier);
+
+	rtnl_lock();
+
+	/* zap all conntracks with rtcache extension */
+	for_each_net(net)
+		nf_ct_iterate_cleanup(net, nf_rtcache_ext_remove, NULL, 0, 0);
+
+	for_each_net(net) {
+		/* .. and make sure they're gone from dying list, too */
+		while (nf_conntrack_rtcache_wait_for_dying(net)) {
+			msleep(200);
+			WARN_ONCE(++count > 25, "Waiting for all rtcache conntracks to go away\n");
+		}
+	}
+
+	rtnl_unlock();
+	synchronize_net();
+	nf_ct_extend_unregister(&rtcache_extend);
+}
+module_init(nf_conntrack_rtcache_init);
+module_exit(nf_conntrack_rtcache_fini);
+
+MODULE_LICENSE("GPL");
+MODULE_AUTHOR("Florian Westphal <fw@strlen.de>");
+MODULE_DESCRIPTION("Conntrack route cache extension");
-- 
2.0.4


^ permalink raw reply related

* [PATCH nf-next 0/2] netfilter: conntrack: route cache for forwarded connections
From: Florian Westphal @ 2014-12-08 15:36 UTC (permalink / raw)
  To: netfilter-devel; +Cc: netdev, brouer

[ Pablo, in case you deem this too late for -next just let me know
and I will resend once its open again ]

This adds an optional forward routing cache extension for netfilter
connection tracking.

The memory cost is an additional 32 bytes per conntrack entry
on x86_64.

Unlike any other currently implemented connection tracking
extension the rtcache has no run-time tunables, it is always active.

Also, unlike other conntrack extensions, it can be built as a module,
in this case modprobe/rmmod are used to enable/disable the cache.

Forward test using netperf UDP_STREAM between two network namespaces
(connected via veth devices), tput:

With conntrack + reverse path filtering (rp_filter sysctl=1):
MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.1.12.2 () port 0 AF_INET
Socket  Message  Elapsed      Messages
 Size    Size     Time         Okay Errors   Throughput
 bytes   bytes    secs            #      #   10^6bits/sec

  212992      64   120.00    26333996      0     112.36
  212992           120.00    26279399            112.13

same, but with rtcache (this patch series):
  212992      64   120.00    34508693      0     147.24
  212992           120.00    34507838            147.23

same but with rp_filter=0 and no conntrack modules active:
  212992      64   120.00    42288748      0     180.43
  212992           120.00    42283439            180.41

IOW, this is only useful if conntrack is used anyway.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox