Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH net v2 2/3] tunnels: Don't apply GRO to multiple layers of encapsulation.
From: Tom Herbert @ 2016-03-28 16:31 UTC (permalink / raw)
  To: Jesse Gross; +Cc: David Miller, Linux Kernel Network Developers
In-Reply-To: <CAEh+42hbYUMK4WMZYDtzfynd1XFb7OxKkmdPOdT1Z7=n-kw4sA@mail.gmail.com>

On Sun, Mar 27, 2016 at 9:38 PM, Jesse Gross <jesse@kernel.org> wrote:
> On Sat, Mar 26, 2016 at 12:41 PM, Tom Herbert <tom@herbertland.com> wrote:
>> On Sat, Mar 19, 2016 at 9:32 AM, Jesse Gross <jesse@kernel.org> wrote:
>>> When drivers express support for TSO of encapsulated packets, they
>>> only mean that they can do it for one layer of encapsulation.
>>> Supporting additional levels would mean updating, at a minimum,
>>> more IP length fields and they are unaware of this.
>>>
>> This patch completely breaks GRO for FOU and GUE.
>>
>>> No encapsulation device expresses support for handling offloaded
>>> encapsulated packets, so we won't generate these types of frames
>>> in the transmit path. However, GRO doesn't have a check for
>>> multiple levels of encapsulation and will attempt to build them.
>>>
>>> UDP tunnel GRO actually does prevent this situation but it only
>>> handles multiple UDP tunnels stacked on top of each other. This
>>> generalizes that solution to prevent any kind of tunnel stacking
>>> that would cause problems.
>>>
>> GUE and FOU regularly create packets that will be both GSO UDP tunnels
>> and IPIP, SIT, GRE, etc. This is by design. There should be no
>> ambiguity in the drivers as to what this means. For instance, if
>> SKB_GSO_UDP_TUNNEL and SKB_GSO_GRE are set that is GRE/UDP packet, a
>> driver can use ndo_features_check to validate.
>>
>> So multiple levels of encapsulation with GRO is perfectly valid and I
>> would suggest to simply revert this patch. The one potential issue we
>> could have is the potential for GRO to construct a packet which is a
>> UDP-encapsulation inside another encapsulation, like UDP-encap in GRE.
>> In this case the GSO flags don't provide enough information to
>> distinguish say between GRE/UDP (common case) and UDP/GRE (uncommon
>> case). To make this clear we can set udp_mark in GRE, ipip, and sit
>> but *not* check for it.
>
> Generally speaking, multiple levels of encapsulation offload are not
> valid. I think this is pretty clear from the fact that none of the
> encapsulation drivers expose support for encapsulation offloads on
> transmit and the drivers supporting NETIF_F_GSO_GRE and
> NETIF_F_GSO_UDP_TUNNEL do not mean they can handle GRE in VXLAN.
>

Kernel software offload does support this combination just fine.
Seriously, I've tested that more than a thousand times. This is a few
HW implementations you're referring to. The limitations of these
drivers should not dictate how we build the software-- it needs to
work the other way around.

> Asking drivers to assume that this combination of flags means FOU
> doesn't seem right to me. To the best of my knowledge, no driver uses
> ndo_feature_check to do validation of multiple tunnel offload flags
> since the assumption is that the stack will never try to do this.
> Since FOU is being treated as only a single level of encapsulation, I
> think it would be better to just advertise it that way on transmit
> (i.e. only set SKB_GSO_UDP_TUNNEL).
>
If it's not FOU it will be something else. Arbitrarily declaring
multi-levels of encapsulation invalid is simply the wrong direction,
we should be increasing generality and capabilities of the kernel not
holding it down with artificial limits. This is why the generic TSO
work that Alexander and Edward are looking at is so important-- if
this flies then we can offload any combination of encapsulations with
out protocol specific information.

> The change that you are suggesting would result in packets generated
> by GRO that cannot be handled properly on transmit in some cases and
> would likely end up being dropped or malformed. GRO is just an
> optimization and correctness must come first so we cannot use it if it
> might corrupt traffic.

That's (a few) drivers problem. It's no different than if they had
decided that SKB_GSO_UDP_TUNNEL means vxlan, or they can support GRE
in IPv4 offload but not GRE in IPv6, or they can only handle headers
of a certain size, can't handle IPv6 ext. hdrs., etc. As I mentioned,
the long term solution is to eliminate the GSO_ flags and use a
generic segmentation offload interface. But in the interim, it is
*incumbent* on drivers to determine if they can handle a GSO packet
and the interfaces to do that exist. Restricting the capabilities of
GRO just to appease those drivers is not right. Breaking existing GRO
for their benefit is definitely not right.

Tom

^ permalink raw reply

* [PATCH net] team: team should sync the port's uc/mc addrs when add a port
From: Xin Long @ 2016-03-28 16:42 UTC (permalink / raw)
  To: network dev; +Cc: davem, Jiri Pirko, Marcelo Ricardo Leitner

There is an issue when we use mavtap over team:
When we replug nic links from team0, the real nics's mc list will not
include the maddr for macvtap any more. then we can't receive pkts to
macvtap device, as they are filterred by mc list of nic.

In Bonding Driver, it syncs the uc/mc addrs in bond_enslave().

We will fix this issue on team by adding the port's uc/mc addrs sync in
team_port_add.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
---
 drivers/net/team/team.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
index 26c64d2..17ff367 100644
--- a/drivers/net/team/team.c
+++ b/drivers/net/team/team.c
@@ -1198,6 +1198,9 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
 		goto err_dev_open;
 	}
 
+	dev_uc_sync_multiple(port_dev, dev);
+	dev_mc_sync_multiple(port_dev, dev);
+
 	err = vlan_vids_add_by_dev(port_dev, dev);
 	if (err) {
 		netdev_err(dev, "Failed to add vlan ids to device %s\n",
-- 
2.1.0

^ permalink raw reply related

* Re: BUG: net/netfilter: KASAN: stack-out-of-bounds in tcp_packet
From: Jozsef Kadlecsik @ 2016-03-28 16:48 UTC (permalink / raw)
  To: Baozeng Ding
  Cc: Pablo Neira Ayuso, Patrick McHardy, David Miller, netfilter-devel,
	netdev
In-Reply-To: <56F92E21.30509@gmail.com>

Hi David, Pablo,

David, do you agree with the patch for net/ipv4/tcp_input.c? If yes, how 
should I proceed? Should I send the whole patch to you or is it OK to send 
to Pablo?

Best regards,
Jozsef

On Mon, 28 Mar 2016, Baozeng Ding wrote:

> 
> 
> On 2016/3/28 10:35, Baozeng Ding wrote:
> > 
> > 
> > On 2016/3/28 6:25, Jozsef Kadlecsik wrote:
> > > On Mon, 28 Mar 2016, Jozsef Kadlecsik wrote:
> > > 
> > > > On Sun, 27 Mar 2016, Baozeng Ding wrote:
> > > > 
> > > > > The following program triggers stack-out-of-bounds in tcp_packet. The
> > > > > kernel version is 4.5 (on Mar 16 commit
> > > > > 09fd671ccb2475436bd5f597f751ca4a7d177aea).
> > > > > Uncovered with syzkaller. Thanks.
> > > > > 
> > > > > ==================================================================
> > > > > BUG: KASAN: stack-out-of-bounds in tcp_packet+0x4b77/0x51c0 at addr
> > > > > ffff8800a45df3c8
> > > > > Read of size 1 by task 0327/11132
> > > > > page:ffffea00029177c0 count:0 mapcount:0 mapping: (null) index:0x0
> > > > > flags: 0x1fffc0000000000()
> > > > > page dumped because: kasan: bad access detected
> > > > > CPU: 1 PID: 11132 Comm: 0327 Tainted: G    B 4.5.0+ #12
> > > > > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> > > > > rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
> > > > >   0000000000000001 ffff8800a45df148 ffffffff82945051 ffff8800a45df1d8
> > > > >   ffff8800a45df3c8 0000000000000027 0000000000000001 ffff8800a45df1c8
> > > > >   ffffffff81709f88 ffff8800b4f7e3d0 0000000000000028 0000000000000286
> > > > > Call Trace:
> > > > > [<     inline     >] __dump_stack /kernel/lib/dump_stack.c:15
> > > > > [<ffffffff82945051>] dump_stack+0xb3/0x112 /kernel/lib/dump_stack.c:51
> > > > > [<     inline     >] print_address_description
> > > > > /kernel/mm/kasan/report.c:150
> > > > > [<ffffffff81709f88>] kasan_report_error+0x4f8/0x530
> > > > > /kernel/mm/kasan/report.c:236
> > > > > [<ffffffff84c54b8d>] ? skb_copy_bits+0x49d/0x6d0
> > > > > /kernel/net/core/skbuff.c:1675
> > > > > [<     inline     >] ? spin_lock_bh
> > > > > /kernel/include/linux/spinlock.h:307
> > > > > [<ffffffff84e0e9b9>] ? tcp_packet+0x1c9/0x51c0
> > > > > /kernel/net/netfilter/nf_conntrack_proto_tcp.c:833
> > > > > [<     inline     >] kasan_report /kernel/mm/kasan/report.c:259
> > > > > [<ffffffff81709ffe>] __asan_report_load1_noabort+0x3e/0x40
> > > > > /kernel/mm/kasan/report.c:277
> > > > > [<     inline     >] ? tcp_sack
> > > > > /kernel/net/netfilter/nf_conntrack_proto_tcp.c:473
> > > > > [<     inline     >] ? tcp_in_window
> > > > > /kernel/net/netfilter/nf_conntrack_proto_tcp.c:527
> > > > > [<ffffffff84e13367>] ? tcp_packet+0x4b77/0x51c0
> > > > > /kernel/net/netfilter/nf_conntrack_proto_tcp.c:1036
> > > > > [<     inline     >] tcp_sack
> > > > > /kernel/net/netfilter/nf_conntrack_proto_tcp.c:473
> > > > > [<     inline     >] tcp_in_window
> > > > > /kernel/net/netfilter/nf_conntrack_proto_tcp.c:527
> > > > > [<ffffffff84e13367>] tcp_packet+0x4b77/0x51c0
> > > > > /kernel/net/netfilter/nf_conntrack_proto_tcp.c:1036
> > > > > [<ffffffff817094b8>] ? memset+0x28/0x30 /kernel/mm/kasan/kasan.c:302
> > > > > [<ffffffff84e0dd74>] ? tcp_new+0x1a4/0xc20
> > > > > /kernel/net/netfilter/nf_conntrack_proto_tcp.c:1122
> > > > > [<     inline     >] ? build_report /kernel/include/net/netlink.h:499
> > > > > [<ffffffff8518c4d6>] ? xfrm_send_report+0x426/0x450
> > > > > /kernel/net/xfrm/xfrm_user.c:3039
> > > > > [<ffffffff84e0e7f0>] ? tcp_new+0xc20/0xc20
> > > > > /kernel/net/netfilter/nf_conntrack_proto_tcp.c:1169
> > > > > [<ffffffff84dfb03a>] ? init_conntrack+0xca/0x9e0
> > > > > /kernel/net/netfilter/nf_conntrack_core.c:972
> > > > > [<ffffffff84dfaf70>] ? nf_conntrack_alloc+0x40/0x40
> > > > > /kernel/net/netfilter/nf_conntrack_core.c:903
> > > > > [<ffffffff84e0cdf0>] ? tcp_init_net+0x6e0/0x6e0
> > > > > /kernel/include/net/netfilter/nf_conntrack_l4proto.h:137
> > > > > [<ffffffff85121732>] ? ipv4_get_l4proto+0x262/0x390
> > > > > /kernel/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c:89
> > > > > [<ffffffff84df372f>] ? nf_ct_get_tuple+0xaf/0x190
> > > > > /kernel/net/netfilter/nf_conntrack_core.c:197
> > > > > [<ffffffff84dfc23e>] nf_conntrack_in+0x8ee/0x1170
> > > > > /kernel/net/netfilter/nf_conntrack_core.c:1177
> > > > > [<ffffffff84dfb950>] ? init_conntrack+0x9e0/0x9e0
> > > > > /kernel/net/netfilter/nf_conntrack_core.c:287
> > > > > [<ffffffff8512ab06>] ? ipt_do_table+0xa16/0x1260
> > > > > /kernel/net/ipv4/netfilter/ip_tables.c:423
> > > > > [<ffffffff81405ced>] ? trace_hardirqs_on+0xd/0x10
> > > > > /kernel/kernel/locking/lockdep.c:2635
> > > > > [<ffffffff81311fcb>] ? __local_bh_enable_ip+0x6b/0xc0
> > > > > /kernel/kernel/softirq.c:175
> > > > > [<ffffffff8512a0f0>] ? check_entry.isra.4+0x190/0x190
> > > > > /kernel/net/ipv6/netfilter/ip6_tables.c:594
> > > > > [<ffffffff84f9d4e0>] ? ip_reply_glue_bits+0xc0/0xc0
> > > > > /kernel/net/ipv4/ip_output.c:1530
> > > > > [<ffffffff851219ae>] ipv4_conntrack_local+0x14e/0x1a0
> > > > > /kernel/net/ipv4/netfilter/nf_conntrack_l3proto_ipv4.c:161
> > > > > [<ffffffff85131b3d>] ? iptable_raw_hook+0x9d/0x1e0
> > > > > /kernel/net/ipv4/netfilter/iptable_raw.c:32
> > > > > [<ffffffff84de5b7d>] nf_iterate+0x15d/0x230
> > > > > /kernel/net/netfilter/core.c:274
> > > > > [<ffffffff84de5c50>] ? nf_iterate+0x230/0x230
> > > > > /kernel/net/netfilter/core.c:268
> > > > > [<ffffffff84de5dfd>] nf_hook_slow+0x1ad/0x310
> > > > > /kernel/net/netfilter/core.c:306
> > > > > [<ffffffff84de5c50>] ? nf_iterate+0x230/0x230
> > > > > /kernel/net/netfilter/core.c:268
> > > > > [<ffffffff84de5c50>] ? nf_iterate+0x230/0x230
> > > > > /kernel/net/netfilter/core.c:268
> > > > > [<ffffffff82979274>] ? prandom_u32+0x24/0x30 /kernel/lib/random32.c:83
> > > > > [<ffffffff84f747ff>] ? ip_idents_reserve+0x9f/0xf0
> > > > > /kernel/net/ipv4/route.c:484
> > > > > [<     inline     >] nf_hook_thresh
> > > > > /kernel/include/linux/netfilter.h:187
> > > > > [<     inline     >] nf_hook /kernel/include/linux/netfilter.h:197
> > > > > [<ffffffff84fa4f53>] __ip_local_out+0x263/0x3c0
> > > > > /kernel/net/ipv4/ip_output.c:104
> > > > > [<ffffffff84fa4cf0>] ? ip_finish_output+0xd00/0xd00
> > > > > /kernel/include/net/ip.h:322
> > > > > [<ffffffff84fa0230>] ? __ip_flush_pending_frames.isra.45+0x2e0/0x2e0
> > > > > /kernel/net/ipv4/ip_output.c:1337
> > > > > [<ffffffff84faa336>] ? __ip_make_skb+0xfe6/0x1610
> > > > > /kernel/net/ipv4/ip_output.c:1436
> > > > > [<ffffffff84fa50dd>] ip_local_out+0x2d/0x1c0
> > > > > /kernel/net/ipv4/ip_output.c:113
> > > > > [<ffffffff84faa99c>] ip_send_skb+0x3c/0xc0
> > > > > /kernel/net/ipv4/ip_output.c:1443
> > > > > [<ffffffff84faaa84>] ip_push_pending_frames+0x64/0x80
> > > > > /kernel/net/ipv4/ip_output.c:1463
> > > > > [<     inline     >] rcu_read_unlock
> > > > > /kernel/include/linux/rcupdate.h:922
> > > > > [<ffffffff8504e10b>] raw_sendmsg+0x17bb/0x25c0
> > > > > /kernel/net/ieee802154/socket.c:53
> > > > > [<ffffffff8504c950>] ? dst_output+0x190/0x190
> > > > > /kernel/include/net/dst.h:492
> > > > > [<     inline     >] ? trace_mm_page_alloc
> > > > > /kernel/include/trace/events/kmem.h:217
> > > > > [<ffffffff81621609>] ? __alloc_pages_nodemask+0x559/0x16b0
> > > > > /kernel/mm/page_alloc.c:3368
> > > > > [<ffffffff81406260>] ? debug_check_no_locks_freed+0x290/0x290
> > > > > /kernel/kernel/locking/lockdep.c:4104
> > > > > [<ffffffff814c0e30>] ? is_module_text_address+0x10/0x20
> > > > > /kernel/kernel/module.c:4057
> > > > > [<ffffffff81360533>] ? __kernel_text_address+0x73/0xa0
> > > > > /kernel/kernel/extable.c:103
> > > > > [<ffffffff81406260>] ? debug_check_no_locks_freed+0x290/0x290
> > > > > /kernel/kernel/locking/lockdep.c:4104
> > > > > [<ffffffff81406260>] ? debug_check_no_locks_freed+0x290/0x290
> > > > > /kernel/kernel/locking/lockdep.c:4104
> > > > > [<ffffffff81405ced>] ? trace_hardirqs_on+0xd/0x10
> > > > > /kernel/kernel/locking/lockdep.c:2635
> > > > > [<ffffffff81406260>] ? debug_check_no_locks_freed+0x290/0x290
> > > > > /kernel/kernel/locking/lockdep.c:4104
> > > > > [<     inline     >] ? sock_rps_record_flow
> > > > > /kernel/include/net/sock.h:874
> > > > > [<ffffffff85089113>] ? inet_sendmsg+0x73/0x4c0
> > > > > /kernel/net/ipv4/af_inet.c:729
> > > > > [<     inline     >] ? rcu_read_unlock
> > > > > /kernel/include/linux/rcupdate.h:922
> > > > > [<     inline     >] ? sock_rps_record_flow_hash
> > > > > /kernel/include/net/sock.h:867
> > > > > [<     inline     >] ? sock_rps_record_flow
> > > > > /kernel/include/net/sock.h:874
> > > > > [<ffffffff8508929a>] ? inet_sendmsg+0x1fa/0x4c0
> > > > > /kernel/net/ipv4/af_inet.c:729
> > > > > [<ffffffff85089395>] inet_sendmsg+0x2f5/0x4c0
> > > > > /kernel/net/ipv4/af_inet.c:736
> > > > > [<     inline     >] ? sock_rps_record_flow
> > > > > /kernel/include/net/sock.h:874
> > > > > [<ffffffff85089113>] ? inet_sendmsg+0x73/0x4c0
> > > > > /kernel/net/ipv4/af_inet.c:729
> > > > > [<ffffffff850890a0>] ? inet_recvmsg+0x4a0/0x4a0
> > > > > /kernel/include/linux/compiler.h:222
> > > > > [<     inline     >] sock_sendmsg_nosec /kernel/net/socket.c:611
> > > > > [<ffffffff84c3434a>] sock_sendmsg+0xca/0x110 /kernel/net/socket.c:621
> > > > > [<ffffffff84c35448>] SYSC_sendto+0x208/0x350 /kernel/net/socket.c:1651
> > > > > [<ffffffff84c35240>] ? SYSC_connect+0x2e0/0x2e0
> > > > > /kernel/net/socket.c:1543
> > > > > [<ffffffff81698650>] ? __pmd_alloc+0x350/0x350
> > > > > /kernel/mm/memory.c:3928
> > > > > [<ffffffff81230b3b>] ? __do_page_fault+0x2ab/0x8e0
> > > > > /kernel/arch/x86/mm/fault.c:1184
> > > > > [<ffffffff81230c30>] ? __do_page_fault+0x3a0/0x8e0
> > > > > /kernel/arch/x86/mm/fault.c:1271
> > > > > [<ffffffff813fb5da>] ? up_read+0x1a/0x40
> > > > > /kernel/kernel/locking/rwsem.c:79
> > > > > [<ffffffff81230a29>] ? __do_page_fault+0x199/0x8e0
> > > > > /kernel/arch/x86/mm/fault.c:1187
> > > > > [<ffffffff84c379b0>] SyS_sendto+0x40/0x50 /kernel/net/socket.c:1619
> > > > > [<ffffffff85dab940>] entry_SYSCALL_64_fastpath+0x23/0xc1
> > > > > /kernel/arch/x86/entry/entry_64.S:207
> > > > > Memory state around the buggy address:
> > > > >   ffff8800a45df280: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > > >   ffff8800a45df300: f1 f1 f1 f1 00 00 04 f4 f2 f2 f2 f2 00 00 04 f4
> > > > > > ffff8800a45df380: f2 f2 f2 f2 00 00 00 00 00 f4 f4 f4 f3 f3 f3 f3
> > > > >                                                ^
> > > > >   ffff8800a45df400: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > > > >   ffff8800a45df480: 00 00 00 00 00 00 00 00 f1 f1 f1 f1 01 f4 f4 f4
> > > > > ==================================================================
> > > > > 
> > > > > #include <unistd.h>
> > > > > #include <sys/syscall.h>
> > > > > #include <string.h>
> > > > > #include <stdint.h>
> > > > > #include <pthread.h>
> > > > > #include <sys/socket.h>
> > > > > #include <sys/mman.h>
> > > > > #include <netinet/in.h>
> > > > > int main()
> > > > > {
> > > > >          mmap((void *)0x20000000ul, 0x19000ul, PROT_READ|PROT_WRITE,
> > > > > MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0);
> > > > >          int sock = socket(AF_INET, SOCK_RAW, IPPROTO_TCP);
> > > > >          int sock_dup = dup(sock);
> > > > >          memcpy((void*)0x2000b000,
> > > > > "\x11\xaf\x7d\x99\x91\x3c\x87\x34\x85\x18\xc4\xd6\xf2\x30\x0a", 15);
> > > > >          *(uint16_t*)0x20002fec = (uint16_t)0x2;
> > > > >          *(uint16_t*)0x20002fee = (uint16_t)0x11ab;
> > > > >          *(uint32_t*)0x20002ff0 = (uint32_t)0x100007f;
> > > > >          sendto(sock_dup, (void *)0x2000b000ul, 0xful, 0x8800ul,
> > > > > (struct
> > > > > sockaddr *)0x20002fe4ul, 0x1cul);
> > > > >          memcpy((void*)0x2001504f,
> > > > > "\x7e\xb1\x52\x5b\x78\x85\x27\xe7\xcc\x3d\xf5\x18\x1b\xba\xda\x97\x6c\x18\x72\x0c\xd2\x0a\xa6\x77\xb7\x8b\xa2\xd2\x1d\xf0\x6b\xf6\x1a\x27\x6b\x98\x3e\x0b\x49\x8d\x54\x6e\x9e\xbb\x21\x4a\x72\x79\x1f\x82\xaf\x89\x2c\xf6\xd3\xc9\xd7\xed\x18\x29\x4d\x2e\x03\x15\xe2\x03\x14\xd0\xac\xa5\x81\x37\x73\x88\xa9\xf5\x08\xe5\xef\x5b\x56\xb7\x18\x8f\xe6\x19\xea\x91\x82\x23\xdd\x2c\x5c\xa5\xf0\xfc\xd8\xe2\x8b\x91\x48\x70\x24\xed\xae\xf9\x06\xac\xc4\x53\x01\xc3\xf5\xa3\x10\xef\xf1\xa6\x2b\xae\x72\xc7\x1a\x02\xee\x78\xcd\xd1\x7e\x8c\x9c\x1a\x36\xc7\xd4\x7c\x82\x64\xf7\x8b\x5a\xb0\x72\xa8\x87\x3c\xdc\xd0\xba\xfe\x70\x7d\x8c\x23\x78\xad\x7c\x31\x04\xec\xab\x1e\x4c\xee\xae\x84\xd8\x1a\x1d\x85\xa5\x57\xa8\x24\x53\x08\x1c\x4f\xda\x49\xe5\x3a\x99\x8c\x29\xa1\xed\x4b\x42\x7a\x15\x48\x2a\x22\x3b\x81\xfe\
 x47\x74\xc1\x2f\x64\xcf\x10\xd4\x71\x72\x50\x71\xd7\xf6\xb0\xca\x41\x9a\x5e\x3e\xe4\x31\x19\xd1\x19\x46\x20\x66\x4c\x2f\xea\x76\x17\x2d\x94", 
> > > > > 232);
> > > > >          *(uint16_t*)0x2001501c = (uint16_t)0xa;
> > > > >          *(uint16_t*)0x2001501e = (uint16_t)0x11ab;
> > > > >          *(uint32_t*)0x20015020 = (uint32_t)0xbdc;
> > > > >          *(uint32_t*)0x20015024 = (uint32_t)0x0;
> > > > >          *(uint32_t*)0x20015028 = (uint32_t)0x0;
> > > > >          *(uint32_t*)0x2001502c = (uint32_t)0x0;
> > > > >          *(uint32_t*)0x20015030 = (uint32_t)0x1000000;
> > > > >          *(uint32_t*)0x20015034 = (uint32_t)0x3;
> > > > >          sendto(sock_dup, (void *)0x2001504ful, 0xe8ul, 0x880ul,
> > > > > (struct
> > > > > sockaddr *)0x20015000ul, 0x1cul);
> > > > >          return 0;
> > > > > }
> > > Actually, in order to fix the non-conntrack case too, I believe the next
> > > patch is required:
> > > 
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > index d4c5115..365f4fb 100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -3716,6 +3716,8 @@ void tcp_parse_options(const struct sk_buff *skb,
> > >               length--;
> > >               continue;
> > >           default:
> > > +            if (length < 2)
> > > +                return;
> > >               opsize = *ptr++;
> > >               if (opsize < 2) /* "silly options" */
> > >                   return;
> > > @@ -3873,6 +3875,8 @@ const u8 *tcp_parse_md5sig_option(const struct
> > > tcphdr *th)
> > >               length--;
> > >               continue;
> > >           default:
> > > +            if (length < 2)
> > > +                return;
> > >               opsize = *ptr++;
> > >               if (opsize < 2 || opsize > length)
> > >                   return NULL;
> > > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c
> > > b/net/netfilter/nf_conntrack_proto_tcp.c
> > > index 278f3b9..7cc1d9c 100644
> > > --- a/net/netfilter/nf_conntrack_proto_tcp.c
> > > +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> > > @@ -410,6 +410,8 @@ static void tcp_options(const struct sk_buff *skb,
> > >               length--;
> > >               continue;
> > >           default:
> > > +            if (length < 2)
> > > +                return;
> > >               opsize=*ptr++;
> > >               if (opsize < 2) /* "silly options" */
> > >                   return;
> > > @@ -470,6 +472,8 @@ static void tcp_sack(const struct sk_buff *skb,
> > > unsigned int dataoff,
> > >               length--;
> > >               continue;
> > >           default:
> > > +            if (length < 2)
> > > +                return;
> > >               opsize = *ptr++;
> > >               if (opsize < 2) /* "silly options" */
> > >                   return; 
> I tested with the patch and it fixed the bug. Thanks.
> > > Best regards,
> > > Jozsef
> > > -
> > > E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
> > > PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
> > > Address : Wigner Research Centre for Physics, Hungarian Academy of
> > > Sciences
> > >            H-1525 Budapest 114, POB. 49, Hungary
> > 
> > 
> 
> 
> 

-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply

* Re: [PATCH net] team: team should sync the port's uc/mc addrs when add a port
From: Marcelo Ricardo Leitner @ 2016-03-28 16:49 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, davem, Jiri Pirko
In-Reply-To: <320fc71dbbf5ed164ca506cf5750b7bf4e048d44.1459183351.git.lucien.xin@gmail.com>

On Tue, Mar 29, 2016 at 12:42:31AM +0800, Xin Long wrote:
> There is an issue when we use mavtap over team:
> When we replug nic links from team0, the real nics's mc list will not
> include the maddr for macvtap any more. then we can't receive pkts to
> macvtap device, as they are filterred by mc list of nic.
> 
> In Bonding Driver, it syncs the uc/mc addrs in bond_enslave().
> 
> We will fix this issue on team by adding the port's uc/mc addrs sync in
> team_port_add.
> 
> Signed-off-by: Xin Long <lucien.xin@gmail.com>

Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>

> ---
>  drivers/net/team/team.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/drivers/net/team/team.c b/drivers/net/team/team.c
> index 26c64d2..17ff367 100644
> --- a/drivers/net/team/team.c
> +++ b/drivers/net/team/team.c
> @@ -1198,6 +1198,9 @@ static int team_port_add(struct team *team, struct net_device *port_dev)
>  		goto err_dev_open;
>  	}
>  
> +	dev_uc_sync_multiple(port_dev, dev);
> +	dev_mc_sync_multiple(port_dev, dev);
> +
>  	err = vlan_vids_add_by_dev(port_dev, dev);
>  	if (err) {
>  		netdev_err(dev, "Failed to add vlan ids to device %s\n",
> -- 
> 2.1.0
> 

^ permalink raw reply

* Re: [RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU
From: Tom Herbert @ 2016-03-28 16:54 UTC (permalink / raw)
  To: Rick Jones; +Cc: Eric Dumazet, David S . Miller, netdev, Eric Dumazet
In-Reply-To: <56F95884.8010901@hpe.com>

On Mon, Mar 28, 2016 at 9:15 AM, Rick Jones <rick.jones2@hpe.com> wrote:
> On 03/25/2016 03:29 PM, Eric Dumazet wrote:
>>
>> UDP sockets are not short lived in the high usage case, so the added
>> cost of call_rcu() should not be a concern.
>
>
> Even a busy DNS resolver?
>
Should use connectionless UDP sockets.

> rick jones

^ permalink raw reply

* Re: [RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU
From: Eric Dumazet @ 2016-03-28 17:00 UTC (permalink / raw)
  To: Rick Jones; +Cc: Eric Dumazet, David S . Miller, netdev, Tom Herbert
In-Reply-To: <56F95884.8010901@hpe.com>

On Mon, 2016-03-28 at 09:15 -0700, Rick Jones wrote:
> On 03/25/2016 03:29 PM, Eric Dumazet wrote:
> > UDP sockets are not short lived in the high usage case, so the added
> > cost of call_rcu() should not be a concern.
> 
> Even a busy DNS resolver?

If you mean that a busy DNS resolver spends _most_ of its time doing :

fd = socket()
bind(fd  port=0)
  < send and receive one frame >
close(fd)

(If this is the case, may I suggest doing something different, and use
some kind of caches ? It will be way faster.)

Then the result for 10,000,000 loops of <socket()+bind()+close()> are

Before patch : 

real	0m13.665s
user	0m0.548s
sys	0m12.372s

After patch :

real	0m20.599s
user	0m0.465s
sys	0m17.965s

So the worst overhead is 700 ns

This is roughly the cost for bringing 960 bytes from memory, or 15 cache
lines (on x86_64)

# grep UDP /proc/slabinfo 
UDPLITEv6              0      0   1088    7    2 : tunables   24   12    8 : slabdata      0      0      0
UDPv6                 24     49   1088    7    2 : tunables   24   12    8 : slabdata      7      7      0
UDP-Lite               0      0    960    4    1 : tunables   54   27    8 : slabdata      0      0      0
UDP                   30     36    960    4    1 : tunables   54   27    8 : slabdata      9      9      2

In reality, chances that UDP sockets are re-opened right after being
freed and their 15 cache lines are very hot in cpu caches is quite
small, so I would not worry at all about this rather stupid benchmark.

int main(int argc, char *argv[]) {
	struct sockaddr_in addr;
	int i, fd, loops = 10000000;

	for (i = 0; i < loops; i++) {
		fd = socket(AF_INET, SOCK_DGRAM, 0);
		if (fd == -1) {
			perror("socket");
			break;
		}
		memset(&addr, 0, sizeof(addr));
		addr.sin_family = AF_INET;
		if (bind(fd, (const struct sockaddr *)&addr, sizeof(addr)) == -1) {
			perror("bind");
			break;
		}
		close(fd);
	}
	return 0;
}

^ permalink raw reply

* Re: BUG: net/netfilter: KASAN: stack-out-of-bounds in tcp_packet
From: Pablo Neira Ayuso @ 2016-03-28 17:05 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Baozeng Ding, Patrick McHardy, David Miller, netfilter-devel,
	netdev
In-Reply-To: <alpine.DEB.2.10.1603281844200.23985@blackhole.kfki.hu>

On Mon, Mar 28, 2016 at 06:48:51PM +0200, Jozsef Kadlecsik wrote:
> Hi David, Pablo,
> 
> David, do you agree with the patch for net/ipv4/tcp_input.c? If yes, how 
> should I proceed? Should I send the whole patch to you or is it OK to send 
> to Pablo?

Submit a formal patch and Cc: netdev@vger.kernel.org and
netfilter-devel@vger.kernel.org, then we can request David to ack this
if he considers it fine or the other way around (I ack it he takes
it).

Thanks!

^ permalink raw reply

* Re: [PATCH] net: macb: Only call GPIO functions if there is a valid GPIO
From: Sergei Shtylyov @ 2016-03-28 17:30 UTC (permalink / raw)
  To: Charles Keepax, nicolas.ferre, davem; +Cc: netdev, linux-kernel, patches
In-Reply-To: <1459169262-3941-1-git-send-email-ckeepax@opensource.wolfsonmicro.com>

Hello.

On 03/28/2016 03:47 PM, Charles Keepax wrote:

> GPIOlib will print warning messages if we call GPIO functions without a
> valid GPIO. Change the code to avoid doing so.
>
> Signed-off-by: Charles Keepax <ckeepax@opensource.wolfsonmicro.com>
> ---
>   drivers/net/ethernet/cadence/macb.c | 8 +++++---
>   1 file changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c
> index 6619178..71bb42e 100644
> --- a/drivers/net/ethernet/cadence/macb.c
> +++ b/drivers/net/ethernet/cadence/macb.c
> @@ -2957,9 +2957,10 @@ static int macb_probe(struct platform_device *pdev)
>   	phy_node =  of_get_next_available_child(np, NULL);
>   	if (phy_node) {
>   		int gpio = of_get_named_gpio(phy_node, "reset-gpios", 0);
> -		if (gpio_is_valid(gpio))
> +		if (gpio_is_valid(gpio)) {
>   			bp->reset_gpio = gpio_to_desc(gpio);
> -		gpiod_direction_output(bp->reset_gpio, 1);
> +			gpiod_direction_output(bp->reset_gpio, 1);
> +		}

    Oops, sorry for missing this while fixing this code.

>   	}
>   	of_node_put(phy_node);
>
> @@ -3029,7 +3030,8 @@ static int macb_remove(struct platform_device *pdev)
>   		mdiobus_free(bp->mii_bus);
>
>   		/* Shutdown the PHY if there is a GPIO reset */
> -		gpiod_set_value(bp->reset_gpio, 0);
> +		if (bp->reset_gpio)
> +			gpiod_set_value(bp->reset_gpio, 0);

    Hm, this function was previously OK to call with NULL (it didn't curse)...

>
>   		unregister_netdev(dev);
>   		clk_disable_unprepare(bp->tx_clk);

MBR, Sergei

^ permalink raw reply

* Re: [PATCH net v2 2/3] tunnels: Don't apply GRO to multiple layers of encapsulation.
From: Alexander Duyck @ 2016-03-28 17:37 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Jesse Gross, David Miller, Linux Kernel Network Developers
In-Reply-To: <CALx6S35WiXiKDJL6s_nW4xWNx8M6atZVQYEgb8kO5ty2WhjFyQ@mail.gmail.com>

On Mon, Mar 28, 2016 at 9:31 AM, Tom Herbert <tom@herbertland.com> wrote:
> On Sun, Mar 27, 2016 at 9:38 PM, Jesse Gross <jesse@kernel.org> wrote:
>> On Sat, Mar 26, 2016 at 12:41 PM, Tom Herbert <tom@herbertland.com> wrote:
>>> On Sat, Mar 19, 2016 at 9:32 AM, Jesse Gross <jesse@kernel.org> wrote:
>>>> When drivers express support for TSO of encapsulated packets, they
>>>> only mean that they can do it for one layer of encapsulation.
>>>> Supporting additional levels would mean updating, at a minimum,
>>>> more IP length fields and they are unaware of this.
>>>>
>>> This patch completely breaks GRO for FOU and GUE.
>>>
>>>> No encapsulation device expresses support for handling offloaded
>>>> encapsulated packets, so we won't generate these types of frames
>>>> in the transmit path. However, GRO doesn't have a check for
>>>> multiple levels of encapsulation and will attempt to build them.
>>>>
>>>> UDP tunnel GRO actually does prevent this situation but it only
>>>> handles multiple UDP tunnels stacked on top of each other. This
>>>> generalizes that solution to prevent any kind of tunnel stacking
>>>> that would cause problems.
>>>>
>>> GUE and FOU regularly create packets that will be both GSO UDP tunnels
>>> and IPIP, SIT, GRE, etc. This is by design. There should be no
>>> ambiguity in the drivers as to what this means. For instance, if
>>> SKB_GSO_UDP_TUNNEL and SKB_GSO_GRE are set that is GRE/UDP packet, a
>>> driver can use ndo_features_check to validate.
>>>
>>> So multiple levels of encapsulation with GRO is perfectly valid and I
>>> would suggest to simply revert this patch. The one potential issue we
>>> could have is the potential for GRO to construct a packet which is a
>>> UDP-encapsulation inside another encapsulation, like UDP-encap in GRE.
>>> In this case the GSO flags don't provide enough information to
>>> distinguish say between GRE/UDP (common case) and UDP/GRE (uncommon
>>> case). To make this clear we can set udp_mark in GRE, ipip, and sit
>>> but *not* check for it.
>>
>> Generally speaking, multiple levels of encapsulation offload are not
>> valid. I think this is pretty clear from the fact that none of the
>> encapsulation drivers expose support for encapsulation offloads on
>> transmit and the drivers supporting NETIF_F_GSO_GRE and
>> NETIF_F_GSO_UDP_TUNNEL do not mean they can handle GRE in VXLAN.
>>
>
> Kernel software offload does support this combination just fine.
> Seriously, I've tested that more than a thousand times. This is a few
> HW implementations you're referring to. The limitations of these
> drivers should not dictate how we build the software-- it needs to
> work the other way around.

Jesse does have a point.  Drivers aren't checking for this kind of
thing currently as the transmit side doesn't generate these kind of
frames.  The fact that GRO is makes things a bit more messy as we will
really need to restructure the GSO code in order to handle it.  As far
as drivers testing for it I am pretty certain the i40e isn't.  I would
wonder if we need to add yet another GSO bit to indicate that we
support multiple layers of encapsulation.  I'm pretty sure the only
way we could possibly handle it would be in software since what you
are indicating is a indeterminate number of headers that all require
updates.

>> Asking drivers to assume that this combination of flags means FOU
>> doesn't seem right to me. To the best of my knowledge, no driver uses
>> ndo_feature_check to do validation of multiple tunnel offload flags
>> since the assumption is that the stack will never try to do this.
>> Since FOU is being treated as only a single level of encapsulation, I
>> think it would be better to just advertise it that way on transmit
>> (i.e. only set SKB_GSO_UDP_TUNNEL).
>>
> If it's not FOU it will be something else. Arbitrarily declaring
> multi-levels of encapsulation invalid is simply the wrong direction,
> we should be increasing generality and capabilities of the kernel not
> holding it down with artificial limits. This is why the generic TSO
> work that Alexander and Edward are looking at is so important-- if
> this flies then we can offload any combination of encapsulations with
> out protocol specific information.

The segmentation code isn't designed to handle more than 2 layers of
headers.  Currently we have the pointers for the inner headers and the
outer headers.  If you are talking about adding yet another level we
would need additional pointers in the skbuff to handle them and we
currently don't have them at present.

>> The change that you are suggesting would result in packets generated
>> by GRO that cannot be handled properly on transmit in some cases and
>> would likely end up being dropped or malformed. GRO is just an
>> optimization and correctness must come first so we cannot use it if it
>> might corrupt traffic.
>
> That's (a few) drivers problem. It's no different than if they had
> decided that SKB_GSO_UDP_TUNNEL means vxlan, or they can support GRE
> in IPv4 offload but not GRE in IPv6, or they can only handle headers
> of a certain size, can't handle IPv6 ext. hdrs., etc. As I mentioned,
> the long term solution is to eliminate the GSO_ flags and use a
> generic segmentation offload interface. But in the interim, it is
> *incumbent* on drivers to determine if they can handle a GSO packet
> and the interfaces to do that exist. Restricting the capabilities of
> GRO just to appease those drivers is not right. Breaking existing GRO
> for their benefit is definitely not right.

This isn't about if drivers can handle it.  It is about if the skbuff
can handle it.  The problem as it stands right now is that we start
losing data once we go past 1 level of encapsulation.  We only have
the standard mac_header, network_header, transport_header, and then
the inner set of the same pointers.  If we try to go multiple levels
deep we start losing data.

If we are wanting to start allowing unlimited headers then maybe we
need to start allowing for "partial" GRO to complement the partial GSO
implementation I have.  With that at least we might be able to track
the first and last headers and we could leave the remaining headers
static.  Then when the frame made it to the driver it would know that
the headers in the long list can be replicated and it only has to
update the outer network header and the inner transport header,
otherwise if it doesn't support that it can just chop up the frame in
software and send that out.

- Alex

^ permalink raw reply

* [PATCH 4/9] netfilter: x_tables: validate e->target_offset early
From: Pablo Neira Ayuso @ 2016-03-28 17:57 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1459187882-5357-1-git-send-email-pablo@netfilter.org>

From: Florian Westphal <fw@strlen.de>

We should check that e->target_offset is sane before
mark_source_chains gets called since it will fetch the target entry
for loop detection.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter/arp_tables.c | 17 ++++++++---------
 net/ipv4/netfilter/ip_tables.c  | 17 ++++++++---------
 net/ipv6/netfilter/ip6_tables.c | 17 ++++++++---------
 3 files changed, 24 insertions(+), 27 deletions(-)

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index bf08192..830bbe8 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -474,14 +474,12 @@ next:
 	return 1;
 }
 
-static inline int check_entry(const struct arpt_entry *e, const char *name)
+static inline int check_entry(const struct arpt_entry *e)
 {
 	const struct xt_entry_target *t;
 
-	if (!arp_checkentry(&e->arp)) {
-		duprintf("arp_tables: arp check failed %p %s.\n", e, name);
+	if (!arp_checkentry(&e->arp))
 		return -EINVAL;
-	}
 
 	if (e->target_offset + sizeof(struct xt_entry_target) > e->next_offset)
 		return -EINVAL;
@@ -522,10 +520,6 @@ find_check_entry(struct arpt_entry *e, const char *name, unsigned int size)
 	struct xt_target *target;
 	int ret;
 
-	ret = check_entry(e, name);
-	if (ret)
-		return ret;
-
 	e->counters.pcnt = xt_percpu_counter_alloc();
 	if (IS_ERR_VALUE(e->counters.pcnt))
 		return -ENOMEM;
@@ -576,6 +570,7 @@ static inline int check_entry_size_and_hooks(struct arpt_entry *e,
 					     unsigned int valid_hooks)
 {
 	unsigned int h;
+	int err;
 
 	if ((unsigned long)e % __alignof__(struct arpt_entry) != 0 ||
 	    (unsigned char *)e + sizeof(struct arpt_entry) >= limit) {
@@ -590,6 +585,10 @@ static inline int check_entry_size_and_hooks(struct arpt_entry *e,
 		return -EINVAL;
 	}
 
+	err = check_entry(e);
+	if (err)
+		return err;
+
 	/* Check hooks & underflows */
 	for (h = 0; h < NF_ARP_NUMHOOKS; h++) {
 		if (!(valid_hooks & (1 << h)))
@@ -1246,7 +1245,7 @@ check_compat_entry_size_and_hooks(struct compat_arpt_entry *e,
 	}
 
 	/* For purposes of check_entry casting the compat entry is fine */
-	ret = check_entry((struct arpt_entry *)e, name);
+	ret = check_entry((struct arpt_entry *)e);
 	if (ret)
 		return ret;
 
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index e53f8d6..1d72a3c 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -569,14 +569,12 @@ static void cleanup_match(struct xt_entry_match *m, struct net *net)
 }
 
 static int
-check_entry(const struct ipt_entry *e, const char *name)
+check_entry(const struct ipt_entry *e)
 {
 	const struct xt_entry_target *t;
 
-	if (!ip_checkentry(&e->ip)) {
-		duprintf("ip check failed %p %s.\n", e, name);
+	if (!ip_checkentry(&e->ip))
 		return -EINVAL;
-	}
 
 	if (e->target_offset + sizeof(struct xt_entry_target) >
 	    e->next_offset)
@@ -666,10 +664,6 @@ find_check_entry(struct ipt_entry *e, struct net *net, const char *name,
 	struct xt_mtchk_param mtpar;
 	struct xt_entry_match *ematch;
 
-	ret = check_entry(e, name);
-	if (ret)
-		return ret;
-
 	e->counters.pcnt = xt_percpu_counter_alloc();
 	if (IS_ERR_VALUE(e->counters.pcnt))
 		return -ENOMEM;
@@ -741,6 +735,7 @@ check_entry_size_and_hooks(struct ipt_entry *e,
 			   unsigned int valid_hooks)
 {
 	unsigned int h;
+	int err;
 
 	if ((unsigned long)e % __alignof__(struct ipt_entry) != 0 ||
 	    (unsigned char *)e + sizeof(struct ipt_entry) >= limit) {
@@ -755,6 +750,10 @@ check_entry_size_and_hooks(struct ipt_entry *e,
 		return -EINVAL;
 	}
 
+	err = check_entry(e);
+	if (err)
+		return err;
+
 	/* Check hooks & underflows */
 	for (h = 0; h < NF_INET_NUMHOOKS; h++) {
 		if (!(valid_hooks & (1 << h)))
@@ -1506,7 +1505,7 @@ check_compat_entry_size_and_hooks(struct compat_ipt_entry *e,
 	}
 
 	/* For purposes of check_entry casting the compat entry is fine */
-	ret = check_entry((struct ipt_entry *)e, name);
+	ret = check_entry((struct ipt_entry *)e);
 	if (ret)
 		return ret;
 
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 84f9baf..26a5ad1 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -581,14 +581,12 @@ static void cleanup_match(struct xt_entry_match *m, struct net *net)
 }
 
 static int
-check_entry(const struct ip6t_entry *e, const char *name)
+check_entry(const struct ip6t_entry *e)
 {
 	const struct xt_entry_target *t;
 
-	if (!ip6_checkentry(&e->ipv6)) {
-		duprintf("ip_tables: ip check failed %p %s.\n", e, name);
+	if (!ip6_checkentry(&e->ipv6))
 		return -EINVAL;
-	}
 
 	if (e->target_offset + sizeof(struct xt_entry_target) >
 	    e->next_offset)
@@ -679,10 +677,6 @@ find_check_entry(struct ip6t_entry *e, struct net *net, const char *name,
 	struct xt_mtchk_param mtpar;
 	struct xt_entry_match *ematch;
 
-	ret = check_entry(e, name);
-	if (ret)
-		return ret;
-
 	e->counters.pcnt = xt_percpu_counter_alloc();
 	if (IS_ERR_VALUE(e->counters.pcnt))
 		return -ENOMEM;
@@ -753,6 +747,7 @@ check_entry_size_and_hooks(struct ip6t_entry *e,
 			   unsigned int valid_hooks)
 {
 	unsigned int h;
+	int err;
 
 	if ((unsigned long)e % __alignof__(struct ip6t_entry) != 0 ||
 	    (unsigned char *)e + sizeof(struct ip6t_entry) >= limit) {
@@ -767,6 +762,10 @@ check_entry_size_and_hooks(struct ip6t_entry *e,
 		return -EINVAL;
 	}
 
+	err = check_entry(e);
+	if (err)
+		return err;
+
 	/* Check hooks & underflows */
 	for (h = 0; h < NF_INET_NUMHOOKS; h++) {
 		if (!(valid_hooks & (1 << h)))
@@ -1518,7 +1517,7 @@ check_compat_entry_size_and_hooks(struct compat_ip6t_entry *e,
 	}
 
 	/* For purposes of check_entry casting the compat entry is fine */
-	ret = check_entry((struct ip6t_entry *)e, name);
+	ret = check_entry((struct ip6t_entry *)e);
 	if (ret)
 		return ret;
 
-- 
2.1.4


^ permalink raw reply related

* [PATCH 5/9] netfilter: x_tables: make sure e->next_offset covers remaining blob size
From: Pablo Neira Ayuso @ 2016-03-28 17:57 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1459187882-5357-1-git-send-email-pablo@netfilter.org>

From: Florian Westphal <fw@strlen.de>

Otherwise this function may read data beyond the ruleset blob.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter/arp_tables.c | 6 ++++--
 net/ipv4/netfilter/ip_tables.c  | 6 ++++--
 net/ipv6/netfilter/ip6_tables.c | 6 ++++--
 3 files changed, 12 insertions(+), 6 deletions(-)

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 830bbe8..51d4fe5 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -573,7 +573,8 @@ static inline int check_entry_size_and_hooks(struct arpt_entry *e,
 	int err;
 
 	if ((unsigned long)e % __alignof__(struct arpt_entry) != 0 ||
-	    (unsigned char *)e + sizeof(struct arpt_entry) >= limit) {
+	    (unsigned char *)e + sizeof(struct arpt_entry) >= limit ||
+	    (unsigned char *)e + e->next_offset > limit) {
 		duprintf("Bad offset %p\n", e);
 		return -EINVAL;
 	}
@@ -1232,7 +1233,8 @@ check_compat_entry_size_and_hooks(struct compat_arpt_entry *e,
 
 	duprintf("check_compat_entry_size_and_hooks %p\n", e);
 	if ((unsigned long)e % __alignof__(struct compat_arpt_entry) != 0 ||
-	    (unsigned char *)e + sizeof(struct compat_arpt_entry) >= limit) {
+	    (unsigned char *)e + sizeof(struct compat_arpt_entry) >= limit ||
+	    (unsigned char *)e + e->next_offset > limit) {
 		duprintf("Bad offset %p, limit = %p\n", e, limit);
 		return -EINVAL;
 	}
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 1d72a3c..fb7694e 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -738,7 +738,8 @@ check_entry_size_and_hooks(struct ipt_entry *e,
 	int err;
 
 	if ((unsigned long)e % __alignof__(struct ipt_entry) != 0 ||
-	    (unsigned char *)e + sizeof(struct ipt_entry) >= limit) {
+	    (unsigned char *)e + sizeof(struct ipt_entry) >= limit ||
+	    (unsigned char *)e + e->next_offset > limit) {
 		duprintf("Bad offset %p\n", e);
 		return -EINVAL;
 	}
@@ -1492,7 +1493,8 @@ check_compat_entry_size_and_hooks(struct compat_ipt_entry *e,
 
 	duprintf("check_compat_entry_size_and_hooks %p\n", e);
 	if ((unsigned long)e % __alignof__(struct compat_ipt_entry) != 0 ||
-	    (unsigned char *)e + sizeof(struct compat_ipt_entry) >= limit) {
+	    (unsigned char *)e + sizeof(struct compat_ipt_entry) >= limit ||
+	    (unsigned char *)e + e->next_offset > limit) {
 		duprintf("Bad offset %p, limit = %p\n", e, limit);
 		return -EINVAL;
 	}
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 26a5ad1..b248528f 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -750,7 +750,8 @@ check_entry_size_and_hooks(struct ip6t_entry *e,
 	int err;
 
 	if ((unsigned long)e % __alignof__(struct ip6t_entry) != 0 ||
-	    (unsigned char *)e + sizeof(struct ip6t_entry) >= limit) {
+	    (unsigned char *)e + sizeof(struct ip6t_entry) >= limit ||
+	    (unsigned char *)e + e->next_offset > limit) {
 		duprintf("Bad offset %p\n", e);
 		return -EINVAL;
 	}
@@ -1504,7 +1505,8 @@ check_compat_entry_size_and_hooks(struct compat_ip6t_entry *e,
 
 	duprintf("check_compat_entry_size_and_hooks %p\n", e);
 	if ((unsigned long)e % __alignof__(struct compat_ip6t_entry) != 0 ||
-	    (unsigned char *)e + sizeof(struct compat_ip6t_entry) >= limit) {
+	    (unsigned char *)e + sizeof(struct compat_ip6t_entry) >= limit ||
+	    (unsigned char *)e + e->next_offset > limit) {
 		duprintf("Bad offset %p, limit = %p\n", e, limit);
 		return -EINVAL;
 	}
-- 
2.1.4


^ permalink raw reply related

* [PATCH 6/9] netfilter: x_tables: fix unconditional helper
From: Pablo Neira Ayuso @ 2016-03-28 17:57 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1459187882-5357-1-git-send-email-pablo@netfilter.org>

From: Florian Westphal <fw@strlen.de>

Ben Hawkes says:

 In the mark_source_chains function (net/ipv4/netfilter/ip_tables.c) it
 is possible for a user-supplied ipt_entry structure to have a large
 next_offset field. This field is not bounds checked prior to writing a
 counter value at the supplied offset.

Problem is that mark_source_chains should not have been called --
the rule doesn't have a next entry, so its supposed to return
an absolute verdict of either ACCEPT or DROP.

However, the function conditional() doesn't work as the name implies.
It only checks that the rule is using wildcard address matching.

However, an unconditional rule must also not be using any matches
(no -m args).

The underflow validator only checked the addresses, therefore
passing the 'unconditional absolute verdict' test, while
mark_source_chains also tested for presence of matches, and thus
proceeeded to the next (not-existent) rule.

Unify this so that all the callers have same idea of 'unconditional rule'.

Reported-by: Ben Hawkes <hawkes@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/ipv4/netfilter/arp_tables.c | 18 +++++++++---------
 net/ipv4/netfilter/ip_tables.c  | 23 +++++++++++------------
 net/ipv6/netfilter/ip6_tables.c | 23 +++++++++++------------
 3 files changed, 31 insertions(+), 33 deletions(-)

diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 51d4fe5..a1bb5e7 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -359,11 +359,12 @@ unsigned int arpt_do_table(struct sk_buff *skb,
 }
 
 /* All zeroes == unconditional rule. */
-static inline bool unconditional(const struct arpt_arp *arp)
+static inline bool unconditional(const struct arpt_entry *e)
 {
 	static const struct arpt_arp uncond;
 
-	return memcmp(arp, &uncond, sizeof(uncond)) == 0;
+	return e->target_offset == sizeof(struct arpt_entry) &&
+	       memcmp(&e->arp, &uncond, sizeof(uncond)) == 0;
 }
 
 /* Figures out from what hook each rule can be called: returns 0 if
@@ -402,11 +403,10 @@ static int mark_source_chains(const struct xt_table_info *newinfo,
 				|= ((1 << hook) | (1 << NF_ARP_NUMHOOKS));
 
 			/* Unconditional return/END. */
-			if ((e->target_offset == sizeof(struct arpt_entry) &&
+			if ((unconditional(e) &&
 			     (strcmp(t->target.u.user.name,
 				     XT_STANDARD_TARGET) == 0) &&
-			     t->verdict < 0 && unconditional(&e->arp)) ||
-			    visited) {
+			     t->verdict < 0) || visited) {
 				unsigned int oldpos, size;
 
 				if ((strcmp(t->target.u.user.name,
@@ -551,7 +551,7 @@ static bool check_underflow(const struct arpt_entry *e)
 	const struct xt_entry_target *t;
 	unsigned int verdict;
 
-	if (!unconditional(&e->arp))
+	if (!unconditional(e))
 		return false;
 	t = arpt_get_target_c(e);
 	if (strcmp(t->u.user.name, XT_STANDARD_TARGET) != 0)
@@ -598,9 +598,9 @@ static inline int check_entry_size_and_hooks(struct arpt_entry *e,
 			newinfo->hook_entry[h] = hook_entries[h];
 		if ((unsigned char *)e - base == underflows[h]) {
 			if (!check_underflow(e)) {
-				pr_err("Underflows must be unconditional and "
-				       "use the STANDARD target with "
-				       "ACCEPT/DROP\n");
+				pr_debug("Underflows must be unconditional and "
+					 "use the STANDARD target with "
+					 "ACCEPT/DROP\n");
 				return -EINVAL;
 			}
 			newinfo->underflow[h] = underflows[h];
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index fb7694e..89b5d95 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -168,11 +168,12 @@ get_entry(const void *base, unsigned int offset)
 
 /* All zeroes == unconditional rule. */
 /* Mildly perf critical (only if packet tracing is on) */
-static inline bool unconditional(const struct ipt_ip *ip)
+static inline bool unconditional(const struct ipt_entry *e)
 {
 	static const struct ipt_ip uncond;
 
-	return memcmp(ip, &uncond, sizeof(uncond)) == 0;
+	return e->target_offset == sizeof(struct ipt_entry) &&
+	       memcmp(&e->ip, &uncond, sizeof(uncond)) == 0;
 #undef FWINV
 }
 
@@ -229,11 +230,10 @@ get_chainname_rulenum(const struct ipt_entry *s, const struct ipt_entry *e,
 	} else if (s == e) {
 		(*rulenum)++;
 
-		if (s->target_offset == sizeof(struct ipt_entry) &&
+		if (unconditional(s) &&
 		    strcmp(t->target.u.kernel.target->name,
 			   XT_STANDARD_TARGET) == 0 &&
-		   t->verdict < 0 &&
-		   unconditional(&s->ip)) {
+		   t->verdict < 0) {
 			/* Tail of chains: STANDARD target (return/policy) */
 			*comment = *chainname == hookname
 				? comments[NF_IP_TRACE_COMMENT_POLICY]
@@ -476,11 +476,10 @@ mark_source_chains(const struct xt_table_info *newinfo,
 			e->comefrom |= ((1 << hook) | (1 << NF_INET_NUMHOOKS));
 
 			/* Unconditional return/END. */
-			if ((e->target_offset == sizeof(struct ipt_entry) &&
+			if ((unconditional(e) &&
 			     (strcmp(t->target.u.user.name,
 				     XT_STANDARD_TARGET) == 0) &&
-			     t->verdict < 0 && unconditional(&e->ip)) ||
-			    visited) {
+			     t->verdict < 0) || visited) {
 				unsigned int oldpos, size;
 
 				if ((strcmp(t->target.u.user.name,
@@ -715,7 +714,7 @@ static bool check_underflow(const struct ipt_entry *e)
 	const struct xt_entry_target *t;
 	unsigned int verdict;
 
-	if (!unconditional(&e->ip))
+	if (!unconditional(e))
 		return false;
 	t = ipt_get_target_c(e);
 	if (strcmp(t->u.user.name, XT_STANDARD_TARGET) != 0)
@@ -763,9 +762,9 @@ check_entry_size_and_hooks(struct ipt_entry *e,
 			newinfo->hook_entry[h] = hook_entries[h];
 		if ((unsigned char *)e - base == underflows[h]) {
 			if (!check_underflow(e)) {
-				pr_err("Underflows must be unconditional and "
-				       "use the STANDARD target with "
-				       "ACCEPT/DROP\n");
+				pr_debug("Underflows must be unconditional and "
+					 "use the STANDARD target with "
+					 "ACCEPT/DROP\n");
 				return -EINVAL;
 			}
 			newinfo->underflow[h] = underflows[h];
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index b248528f..541b59f 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -198,11 +198,12 @@ get_entry(const void *base, unsigned int offset)
 
 /* All zeroes == unconditional rule. */
 /* Mildly perf critical (only if packet tracing is on) */
-static inline bool unconditional(const struct ip6t_ip6 *ipv6)
+static inline bool unconditional(const struct ip6t_entry *e)
 {
 	static const struct ip6t_ip6 uncond;
 
-	return memcmp(ipv6, &uncond, sizeof(uncond)) == 0;
+	return e->target_offset == sizeof(struct ip6t_entry) &&
+	       memcmp(&e->ipv6, &uncond, sizeof(uncond)) == 0;
 }
 
 static inline const struct xt_entry_target *
@@ -258,11 +259,10 @@ get_chainname_rulenum(const struct ip6t_entry *s, const struct ip6t_entry *e,
 	} else if (s == e) {
 		(*rulenum)++;
 
-		if (s->target_offset == sizeof(struct ip6t_entry) &&
+		if (unconditional(s) &&
 		    strcmp(t->target.u.kernel.target->name,
 			   XT_STANDARD_TARGET) == 0 &&
-		    t->verdict < 0 &&
-		    unconditional(&s->ipv6)) {
+		    t->verdict < 0) {
 			/* Tail of chains: STANDARD target (return/policy) */
 			*comment = *chainname == hookname
 				? comments[NF_IP6_TRACE_COMMENT_POLICY]
@@ -488,11 +488,10 @@ mark_source_chains(const struct xt_table_info *newinfo,
 			e->comefrom |= ((1 << hook) | (1 << NF_INET_NUMHOOKS));
 
 			/* Unconditional return/END. */
-			if ((e->target_offset == sizeof(struct ip6t_entry) &&
+			if ((unconditional(e) &&
 			     (strcmp(t->target.u.user.name,
 				     XT_STANDARD_TARGET) == 0) &&
-			     t->verdict < 0 &&
-			     unconditional(&e->ipv6)) || visited) {
+			     t->verdict < 0) || visited) {
 				unsigned int oldpos, size;
 
 				if ((strcmp(t->target.u.user.name,
@@ -727,7 +726,7 @@ static bool check_underflow(const struct ip6t_entry *e)
 	const struct xt_entry_target *t;
 	unsigned int verdict;
 
-	if (!unconditional(&e->ipv6))
+	if (!unconditional(e))
 		return false;
 	t = ip6t_get_target_c(e);
 	if (strcmp(t->u.user.name, XT_STANDARD_TARGET) != 0)
@@ -775,9 +774,9 @@ check_entry_size_and_hooks(struct ip6t_entry *e,
 			newinfo->hook_entry[h] = hook_entries[h];
 		if ((unsigned char *)e - base == underflows[h]) {
 			if (!check_underflow(e)) {
-				pr_err("Underflows must be unconditional and "
-				       "use the STANDARD target with "
-				       "ACCEPT/DROP\n");
+				pr_debug("Underflows must be unconditional and "
+					 "use the STANDARD target with "
+					 "ACCEPT/DROP\n");
 				return -EINVAL;
 			}
 			newinfo->underflow[h] = underflows[h];
-- 
2.1.4


^ permalink raw reply related

* [PATCH 7/9] netfilter: nfnetlink_queue: honor NFQA_CFG_F_FAIL_OPEN when netlink unicast fails
From: Pablo Neira Ayuso @ 2016-03-28 17:58 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1459187882-5357-1-git-send-email-pablo@netfilter.org>

When netlink unicast fails to deliver the message to userspace, we
should also check if the NFQA_CFG_F_FAIL_OPEN flag is set so we reinject
the packet back to the stack.

I think the user expects no packet drops when this flag is set due to
queueing to userspace errors, no matter if related to the internal queue
or when sending the netlink message to userspace.

The userspace application will still get the ENOBUFS error via recvmsg()
so the user still knows that, with the current configuration that is in
place, the userspace application is not consuming the messages at the
pace that the kernel needs.

Reported-by: "Yigal Reiss (yreiss)" <yreiss@cisco.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Tested-by: "Yigal Reiss (yreiss)" <yreiss@cisco.com>
---
 net/netfilter/nfnetlink_queue.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/net/netfilter/nfnetlink_queue.c b/net/netfilter/nfnetlink_queue.c
index 7542999..cb5b630 100644
--- a/net/netfilter/nfnetlink_queue.c
+++ b/net/netfilter/nfnetlink_queue.c
@@ -582,7 +582,12 @@ __nfqnl_enqueue_packet(struct net *net, struct nfqnl_instance *queue,
 	/* nfnetlink_unicast will either free the nskb or add it to a socket */
 	err = nfnetlink_unicast(nskb, net, queue->peer_portid, MSG_DONTWAIT);
 	if (err < 0) {
-		queue->queue_user_dropped++;
+		if (queue->flags & NFQA_CFG_F_FAIL_OPEN) {
+			failopen = 1;
+			err = 0;
+		} else {
+			queue->queue_user_dropped++;
+		}
 		goto err_out_unlock;
 	}
 
-- 
2.1.4


^ permalink raw reply related

* [PATCH 0/9] Netfilter fixes for net
From: Pablo Neira Ayuso @ 2016-03-28 17:57 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev

Hi David,

The following patchset contains Netfilter fixes for you net tree,
they are:

1) There was a race condition between parallel save/swap and delete,
   which resulted a kernel crash due to the increase ref for save, swap,
   wrong ref decrease operations. Reported and fixed by Vishwanath Pai.

2) OVS should call into CT NAT for packets of new expected connections only
   when the conntrack state is persisted with the 'commit' option to the
   OVS CT action. From Jarno Rajahalme.

3) Resolve kconfig dependencies with new OVS NAT support. From Arnd Bergmann.

4) Early validation of entry->target_offset to make sure it doesn't take us
   out from the blob, from Florian Westphal.

5) Again early validation of entry->next_offset to make sure it doesn't take
   out from the blob, also from Florian.

6) Check that entry->target_offset is always of of sizeof(struct xt_entry)
   for unconditional entries, when checking both from check_underflow()
   and when checking for loops in mark_source_chains(), again from
   Florian.

7) Fix inconsistent behaviour in nfnetlink_queue when
   NFQA_CFG_F_FAIL_OPEN is set and netlink_unicast() fails due to buffer
   overrun, we have to reinject the packet as the user expects.

8) Enforce nul-terminated table names from getsockopt GET_ENTRIES
   requests.

9) Don't assume skb->sk is set from nft_bridge_reject and synproxy,
   this fixes a recent update of the code to namespaceify
   ip_default_ttl, patch from Liping Zhang.

This batch comes with four patches to validate x_tables blobs coming
from userspace. CONFIG_USERNS exposes the x_tables interface to
unpriviledged users and to be honest this interface never received the
attention for this move away from the CAP_NET_ADMIN domain. Florian is
working on another round with more patches with more sanity checks, so
expect a bit more Netfilter fixes in this development cycle than usual.

You can pull these changes from:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git

Thanks!

----------------------------------------------------------------

The following changes since commit d7be81a5916bdb1d904803958e5991a16f7ae4b2:

  ravb: fix software timestamping (2016-03-27 22:41:37 -0400)

are available in the git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf.git HEAD

for you to fetch changes up to 29421198c3a860092e27c2ad8499dfe603398817:

  netfilter: ipv4: fix NULL dereference (2016-03-28 17:59:29 +0200)

----------------------------------------------------------------
Arnd Bergmann (1):
      openvswitch: call only into reachable nf-nat code

Florian Westphal (3):
      netfilter: x_tables: validate e->target_offset early
      netfilter: x_tables: make sure e->next_offset covers remaining blob size
      netfilter: x_tables: fix unconditional helper

Jarno Rajahalme (1):
      openvswitch: Fix checking for new expected connections.

Liping Zhang (1):
      netfilter: ipv4: fix NULL dereference

Pablo Neira Ayuso (2):
      netfilter: nfnetlink_queue: honor NFQA_CFG_F_FAIL_OPEN when netlink unicast fails
      netfilter: x_tables: enforce nul-terminated table name from getsockopt GET_ENTRIES

Vishwanath Pai (1):
      netfilter: ipset: fix race condition in ipset save, swap and delete

 include/linux/netfilter/ipset/ip_set.h   |  4 +++
 net/bridge/netfilter/ebtables.c          |  4 +++
 net/bridge/netfilter/nft_reject_bridge.c | 20 ++++++------
 net/ipv4/netfilter/arp_tables.c          | 43 +++++++++++++------------
 net/ipv4/netfilter/ip_tables.c           | 48 ++++++++++++++--------------
 net/ipv4/netfilter/ipt_SYNPROXY.c        | 54 +++++++++++++++++---------------
 net/ipv6/netfilter/ip6_tables.c          | 48 ++++++++++++++--------------
 net/netfilter/ipset/ip_set_bitmap_gen.h  |  2 +-
 net/netfilter/ipset/ip_set_core.c        | 33 ++++++++++++++++---
 net/netfilter/ipset/ip_set_hash_gen.h    |  2 +-
 net/netfilter/ipset/ip_set_list_set.c    |  2 +-
 net/netfilter/nfnetlink_queue.c          |  7 ++++-
 net/openvswitch/Kconfig                  |  4 ++-
 net/openvswitch/conntrack.c              | 21 +++++++------
 14 files changed, 170 insertions(+), 122 deletions(-)

^ permalink raw reply

* [PATCH 1/9] netfilter: ipset: fix race condition in ipset save, swap and delete
From: Pablo Neira Ayuso @ 2016-03-28 17:57 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1459187882-5357-1-git-send-email-pablo@netfilter.org>

From: Vishwanath Pai <vpai@akamai.com>

This fix adds a new reference counter (ref_netlink) for the struct ip_set.
The other reference counter (ref) can be swapped out by ip_set_swap and we
need a separate counter to keep track of references for netlink events
like dump. Using the same ref counter for dump causes a race condition
which can be demonstrated by the following script:

ipset create hash_ip1 hash:ip family inet hashsize 1024 maxelem 500000 \
counters
ipset create hash_ip2 hash:ip family inet hashsize 300000 maxelem 500000 \
counters
ipset create hash_ip3 hash:ip family inet hashsize 1024 maxelem 500000 \
counters

ipset save &

ipset swap hash_ip3 hash_ip2
ipset destroy hash_ip3 /* will crash the machine */

Swap will exchange the values of ref so destroy will see ref = 0 instead of
ref = 1. With this fix in place swap will not succeed because ipset save
still has ref_netlink on the set (ip_set_swap doesn't swap ref_netlink).

Both delete and swap will error out if ref_netlink != 0 on the set.

Note: The changes to *_head functions is because previously we would
increment ref whenever we called these functions, we don't do that
anymore.

Reviewed-by: Joshua Hunt <johunt@akamai.com>
Signed-off-by: Vishwanath Pai <vpai@akamai.com>
Signed-off-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 include/linux/netfilter/ipset/ip_set.h  |  4 ++++
 net/netfilter/ipset/ip_set_bitmap_gen.h |  2 +-
 net/netfilter/ipset/ip_set_core.c       | 33 ++++++++++++++++++++++++++++-----
 net/netfilter/ipset/ip_set_hash_gen.h   |  2 +-
 net/netfilter/ipset/ip_set_list_set.c   |  2 +-
 5 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/include/linux/netfilter/ipset/ip_set.h b/include/linux/netfilter/ipset/ip_set.h
index 0e1f433..f48b8a6 100644
--- a/include/linux/netfilter/ipset/ip_set.h
+++ b/include/linux/netfilter/ipset/ip_set.h
@@ -234,6 +234,10 @@ struct ip_set {
 	spinlock_t lock;
 	/* References to the set */
 	u32 ref;
+	/* References to the set for netlink events like dump,
+	 * ref can be swapped out by ip_set_swap
+	 */
+	u32 ref_netlink;
 	/* The core set type */
 	struct ip_set_type *type;
 	/* The type variant doing the real job */
diff --git a/net/netfilter/ipset/ip_set_bitmap_gen.h b/net/netfilter/ipset/ip_set_bitmap_gen.h
index b0bc475..2e8e7e5 100644
--- a/net/netfilter/ipset/ip_set_bitmap_gen.h
+++ b/net/netfilter/ipset/ip_set_bitmap_gen.h
@@ -95,7 +95,7 @@ mtype_head(struct ip_set *set, struct sk_buff *skb)
 	if (!nested)
 		goto nla_put_failure;
 	if (mtype_do_head(skb, map) ||
-	    nla_put_net32(skb, IPSET_ATTR_REFERENCES, htonl(set->ref - 1)) ||
+	    nla_put_net32(skb, IPSET_ATTR_REFERENCES, htonl(set->ref)) ||
 	    nla_put_net32(skb, IPSET_ATTR_MEMSIZE, htonl(memsize)))
 		goto nla_put_failure;
 	if (unlikely(ip_set_put_flags(skb, set)))
diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
index 7e6568c..a748b0c 100644
--- a/net/netfilter/ipset/ip_set_core.c
+++ b/net/netfilter/ipset/ip_set_core.c
@@ -497,6 +497,26 @@ __ip_set_put(struct ip_set *set)
 	write_unlock_bh(&ip_set_ref_lock);
 }
 
+/* set->ref can be swapped out by ip_set_swap, netlink events (like dump) need
+ * a separate reference counter
+ */
+static inline void
+__ip_set_get_netlink(struct ip_set *set)
+{
+	write_lock_bh(&ip_set_ref_lock);
+	set->ref_netlink++;
+	write_unlock_bh(&ip_set_ref_lock);
+}
+
+static inline void
+__ip_set_put_netlink(struct ip_set *set)
+{
+	write_lock_bh(&ip_set_ref_lock);
+	BUG_ON(set->ref_netlink == 0);
+	set->ref_netlink--;
+	write_unlock_bh(&ip_set_ref_lock);
+}
+
 /* Add, del and test set entries from kernel.
  *
  * The set behind the index must exist and must be referenced
@@ -1002,7 +1022,7 @@ static int ip_set_destroy(struct net *net, struct sock *ctnl,
 	if (!attr[IPSET_ATTR_SETNAME]) {
 		for (i = 0; i < inst->ip_set_max; i++) {
 			s = ip_set(inst, i);
-			if (s && s->ref) {
+			if (s && (s->ref || s->ref_netlink)) {
 				ret = -IPSET_ERR_BUSY;
 				goto out;
 			}
@@ -1024,7 +1044,7 @@ static int ip_set_destroy(struct net *net, struct sock *ctnl,
 		if (!s) {
 			ret = -ENOENT;
 			goto out;
-		} else if (s->ref) {
+		} else if (s->ref || s->ref_netlink) {
 			ret = -IPSET_ERR_BUSY;
 			goto out;
 		}
@@ -1171,6 +1191,9 @@ static int ip_set_swap(struct net *net, struct sock *ctnl, struct sk_buff *skb,
 	      from->family == to->family))
 		return -IPSET_ERR_TYPE_MISMATCH;
 
+	if (from->ref_netlink || to->ref_netlink)
+		return -EBUSY;
+
 	strncpy(from_name, from->name, IPSET_MAXNAMELEN);
 	strncpy(from->name, to->name, IPSET_MAXNAMELEN);
 	strncpy(to->name, from_name, IPSET_MAXNAMELEN);
@@ -1206,7 +1229,7 @@ ip_set_dump_done(struct netlink_callback *cb)
 		if (set->variant->uref)
 			set->variant->uref(set, cb, false);
 		pr_debug("release set %s\n", set->name);
-		__ip_set_put_byindex(inst, index);
+		__ip_set_put_netlink(set);
 	}
 	return 0;
 }
@@ -1328,7 +1351,7 @@ dump_last:
 		if (!cb->args[IPSET_CB_ARG0]) {
 			/* Start listing: make sure set won't be destroyed */
 			pr_debug("reference set\n");
-			set->ref++;
+			set->ref_netlink++;
 		}
 		write_unlock_bh(&ip_set_ref_lock);
 		nlh = start_msg(skb, NETLINK_CB(cb->skb).portid,
@@ -1396,7 +1419,7 @@ release_refcount:
 		if (set->variant->uref)
 			set->variant->uref(set, cb, false);
 		pr_debug("release set %s\n", set->name);
-		__ip_set_put_byindex(inst, index);
+		__ip_set_put_netlink(set);
 		cb->args[IPSET_CB_ARG0] = 0;
 	}
 out:
diff --git a/net/netfilter/ipset/ip_set_hash_gen.h b/net/netfilter/ipset/ip_set_hash_gen.h
index e5336ab..d32fd6b 100644
--- a/net/netfilter/ipset/ip_set_hash_gen.h
+++ b/net/netfilter/ipset/ip_set_hash_gen.h
@@ -1082,7 +1082,7 @@ mtype_head(struct ip_set *set, struct sk_buff *skb)
 	if (nla_put_u32(skb, IPSET_ATTR_MARKMASK, h->markmask))
 		goto nla_put_failure;
 #endif
-	if (nla_put_net32(skb, IPSET_ATTR_REFERENCES, htonl(set->ref - 1)) ||
+	if (nla_put_net32(skb, IPSET_ATTR_REFERENCES, htonl(set->ref)) ||
 	    nla_put_net32(skb, IPSET_ATTR_MEMSIZE, htonl(memsize)))
 		goto nla_put_failure;
 	if (unlikely(ip_set_put_flags(skb, set)))
diff --git a/net/netfilter/ipset/ip_set_list_set.c b/net/netfilter/ipset/ip_set_list_set.c
index 24c6c19..a2a89e4 100644
--- a/net/netfilter/ipset/ip_set_list_set.c
+++ b/net/netfilter/ipset/ip_set_list_set.c
@@ -458,7 +458,7 @@ list_set_head(struct ip_set *set, struct sk_buff *skb)
 	if (!nested)
 		goto nla_put_failure;
 	if (nla_put_net32(skb, IPSET_ATTR_SIZE, htonl(map->size)) ||
-	    nla_put_net32(skb, IPSET_ATTR_REFERENCES, htonl(set->ref - 1)) ||
+	    nla_put_net32(skb, IPSET_ATTR_REFERENCES, htonl(set->ref)) ||
 	    nla_put_net32(skb, IPSET_ATTR_MEMSIZE,
 			  htonl(sizeof(*map) + n * set->dsize)))
 		goto nla_put_failure;
-- 
2.1.4

^ permalink raw reply related

* [PATCH 2/9] openvswitch: Fix checking for new expected connections.
From: Pablo Neira Ayuso @ 2016-03-28 17:57 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1459187882-5357-1-git-send-email-pablo@netfilter.org>

From: Jarno Rajahalme <jarno@ovn.org>

OVS should call into CT NAT for packets of new expected connections only
when the conntrack state is persisted with the 'commit' option to the
OVS CT action.  The test for this condition is doubly wrong, as the CT
status field is ANDed with the bit number (IPS_EXPECTED_BIT) rather
than the mask (IPS_EXPECTED), and due to the wrong assumption that the
expected bit would apply only for the first (i.e., 'new') packet of a
connection, while in fact the expected bit remains on for the lifetime of
an expected connection.  The 'ctinfo' value IP_CT_RELATED derived from
the ct status can be used instead, as it is only ever applicable to
the 'new' packets of the expected connection.

Fixes: 05752523e565 ('openvswitch: Interface with NAT.')
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Jarno Rajahalme <jarno@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/openvswitch/conntrack.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index dc5eb29..47f7c62 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -664,11 +664,12 @@ static int ovs_ct_nat(struct net *net, struct sw_flow_key *key,
 
 	/* Determine NAT type.
 	 * Check if the NAT type can be deduced from the tracked connection.
-	 * Make sure expected traffic is NATted only when committing.
+	 * Make sure new expected connections (IP_CT_RELATED) are NATted only
+	 * when committing.
 	 */
 	if (info->nat & OVS_CT_NAT && ctinfo != IP_CT_NEW &&
 	    ct->status & IPS_NAT_MASK &&
-	    (!(ct->status & IPS_EXPECTED_BIT) || info->commit)) {
+	    (ctinfo != IP_CT_RELATED || info->commit)) {
 		/* NAT an established or related connection like before. */
 		if (CTINFO2DIR(ctinfo) == IP_CT_DIR_REPLY)
 			/* This is the REPLY direction for a connection
-- 
2.1.4

^ permalink raw reply related

* [PATCH 3/9] openvswitch: call only into reachable nf-nat code
From: Pablo Neira Ayuso @ 2016-03-28 17:57 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1459187882-5357-1-git-send-email-pablo@netfilter.org>

From: Arnd Bergmann <arnd@arndb.de>

The openvswitch code has gained support for calling into the
nf-nat-ipv4/ipv6 modules, however those can be loadable modules
in a configuration in which openvswitch is built-in, leading
to link errors:

net/built-in.o: In function `__ovs_ct_lookup':
:(.text+0x2cc2c8): undefined reference to `nf_nat_icmp_reply_translation'
:(.text+0x2cc66c): undefined reference to `nf_nat_icmpv6_reply_translation'

The dependency on (!NF_NAT || NF_NAT) prevents similar issues,
but NF_NAT is set to 'y' if any of the symbols selecting
it are built-in, but the link error happens when any of them
are modular.

A second issue is that even if CONFIG_NF_NAT_IPV6 is built-in,
CONFIG_NF_NAT_IPV4 might be completely disabled. This is unlikely
to be useful in practice, but the driver currently only handles
IPv6 being optional.

This patch improves the Kconfig dependency so that openvswitch
cannot be built-in if either of the two other symbols are set
to 'm', and it replaces the incorrect #ifdef in ovs_ct_nat_execute()
with two "if (IS_ENABLED())" checks that should catch all corner
cases also make the code more readable.

The same #ifdef exists ovs_ct_nat_to_attr(), where it does not
cause a link error, but for consistency I'm changing it the same
way.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Fixes: 05752523e565 ("openvswitch: Interface with NAT.")
Acked-by: Joe Stringer <joe@ovn.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/openvswitch/Kconfig     |  4 +++-
 net/openvswitch/conntrack.c | 16 ++++++++--------
 2 files changed, 11 insertions(+), 9 deletions(-)

diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 234a733..ce94729 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -7,7 +7,9 @@ config OPENVSWITCH
 	depends on INET
 	depends on !NF_CONNTRACK || \
 		   (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
-				     (!NF_NAT || NF_NAT)))
+				     (!NF_NAT || NF_NAT) && \
+				     (!NF_NAT_IPV4 || NF_NAT_IPV4) && \
+				     (!NF_NAT_IPV6 || NF_NAT_IPV6)))
 	select LIBCRC32C
 	select MPLS
 	select NET_MPLS_GSO
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index 47f7c62..3797879 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -535,14 +535,15 @@ static int ovs_ct_nat_execute(struct sk_buff *skb, struct nf_conn *ct,
 	switch (ctinfo) {
 	case IP_CT_RELATED:
 	case IP_CT_RELATED_REPLY:
-		if (skb->protocol == htons(ETH_P_IP) &&
+		if (IS_ENABLED(CONFIG_NF_NAT_IPV4) &&
+		    skb->protocol == htons(ETH_P_IP) &&
 		    ip_hdr(skb)->protocol == IPPROTO_ICMP) {
 			if (!nf_nat_icmp_reply_translation(skb, ct, ctinfo,
 							   hooknum))
 				err = NF_DROP;
 			goto push;
-#if IS_ENABLED(CONFIG_NF_NAT_IPV6)
-		} else if (skb->protocol == htons(ETH_P_IPV6)) {
+		} else if (IS_ENABLED(CONFIG_NF_NAT_IPV6) &&
+			   skb->protocol == htons(ETH_P_IPV6)) {
 			__be16 frag_off;
 			u8 nexthdr = ipv6_hdr(skb)->nexthdr;
 			int hdrlen = ipv6_skip_exthdr(skb,
@@ -557,7 +558,6 @@ static int ovs_ct_nat_execute(struct sk_buff *skb, struct nf_conn *ct,
 					err = NF_DROP;
 				goto push;
 			}
-#endif
 		}
 		/* Non-ICMP, fall thru to initialize if needed. */
 	case IP_CT_NEW:
@@ -1239,7 +1239,8 @@ static bool ovs_ct_nat_to_attr(const struct ovs_conntrack_info *info,
 	}
 
 	if (info->range.flags & NF_NAT_RANGE_MAP_IPS) {
-		if (info->family == NFPROTO_IPV4) {
+		if (IS_ENABLED(CONFIG_NF_NAT_IPV4) &&
+		    info->family == NFPROTO_IPV4) {
 			if (nla_put_in_addr(skb, OVS_NAT_ATTR_IP_MIN,
 					    info->range.min_addr.ip) ||
 			    (info->range.max_addr.ip
@@ -1247,8 +1248,8 @@ static bool ovs_ct_nat_to_attr(const struct ovs_conntrack_info *info,
 			     (nla_put_in_addr(skb, OVS_NAT_ATTR_IP_MAX,
 					      info->range.max_addr.ip))))
 				return false;
-#if IS_ENABLED(CONFIG_NF_NAT_IPV6)
-		} else if (info->family == NFPROTO_IPV6) {
+		} else if (IS_ENABLED(CONFIG_NF_NAT_IPV6) &&
+			   info->family == NFPROTO_IPV6) {
 			if (nla_put_in6_addr(skb, OVS_NAT_ATTR_IP_MIN,
 					     &info->range.min_addr.in6) ||
 			    (memcmp(&info->range.max_addr.in6,
@@ -1257,7 +1258,6 @@ static bool ovs_ct_nat_to_attr(const struct ovs_conntrack_info *info,
 			     (nla_put_in6_addr(skb, OVS_NAT_ATTR_IP_MAX,
 					       &info->range.max_addr.in6))))
 				return false;
-#endif
 		} else {
 			return false;
 		}
-- 
2.1.4

^ permalink raw reply related

* [PATCH 8/9] netfilter: x_tables: enforce nul-terminated table name from getsockopt GET_ENTRIES
From: Pablo Neira Ayuso @ 2016-03-28 17:58 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1459187882-5357-1-git-send-email-pablo@netfilter.org>

Make sure the table names via getsockopt GET_ENTRIES is nul-terminated
in ebtables and all the x_tables variants and their respective compat
code. Uncovered by KASAN.

Reported-by: Baozeng Ding <sploving1@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/bridge/netfilter/ebtables.c | 4 ++++
 net/ipv4/netfilter/arp_tables.c | 2 ++
 net/ipv4/netfilter/ip_tables.c  | 2 ++
 net/ipv6/netfilter/ip6_tables.c | 2 ++
 4 files changed, 10 insertions(+)

diff --git a/net/bridge/netfilter/ebtables.c b/net/bridge/netfilter/ebtables.c
index 67b2e27..8570bc7 100644
--- a/net/bridge/netfilter/ebtables.c
+++ b/net/bridge/netfilter/ebtables.c
@@ -1521,6 +1521,8 @@ static int do_ebt_get_ctl(struct sock *sk, int cmd, void __user *user, int *len)
 	if (copy_from_user(&tmp, user, sizeof(tmp)))
 		return -EFAULT;
 
+	tmp.name[sizeof(tmp.name) - 1] = '\0';
+
 	t = find_table_lock(net, tmp.name, &ret, &ebt_mutex);
 	if (!t)
 		return ret;
@@ -2332,6 +2334,8 @@ static int compat_do_ebt_get_ctl(struct sock *sk, int cmd,
 	if (copy_from_user(&tmp, user, sizeof(tmp)))
 		return -EFAULT;
 
+	tmp.name[sizeof(tmp.name) - 1] = '\0';
+
 	t = find_table_lock(net, tmp.name, &ret, &ebt_mutex);
 	if (!t)
 		return ret;
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index a1bb5e7..4133b0f 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -969,6 +969,7 @@ static int get_entries(struct net *net, struct arpt_get_entries __user *uptr,
 			 sizeof(struct arpt_get_entries) + get.size);
 		return -EINVAL;
 	}
+	get.name[sizeof(get.name) - 1] = '\0';
 
 	t = xt_find_table_lock(net, NFPROTO_ARP, get.name);
 	if (!IS_ERR_OR_NULL(t)) {
@@ -1663,6 +1664,7 @@ static int compat_get_entries(struct net *net,
 			 *len, sizeof(get) + get.size);
 		return -EINVAL;
 	}
+	get.name[sizeof(get.name) - 1] = '\0';
 
 	xt_compat_lock(NFPROTO_ARP);
 	t = xt_find_table_lock(net, NFPROTO_ARP, get.name);
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 89b5d95..631c100 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -1156,6 +1156,7 @@ get_entries(struct net *net, struct ipt_get_entries __user *uptr,
 			 *len, sizeof(get) + get.size);
 		return -EINVAL;
 	}
+	get.name[sizeof(get.name) - 1] = '\0';
 
 	t = xt_find_table_lock(net, AF_INET, get.name);
 	if (!IS_ERR_OR_NULL(t)) {
@@ -1935,6 +1936,7 @@ compat_get_entries(struct net *net, struct compat_ipt_get_entries __user *uptr,
 			 *len, sizeof(get) + get.size);
 		return -EINVAL;
 	}
+	get.name[sizeof(get.name) - 1] = '\0';
 
 	xt_compat_lock(AF_INET);
 	t = xt_find_table_lock(net, AF_INET, get.name);
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 541b59f..86b67b7 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -1168,6 +1168,7 @@ get_entries(struct net *net, struct ip6t_get_entries __user *uptr,
 			 *len, sizeof(get) + get.size);
 		return -EINVAL;
 	}
+	get.name[sizeof(get.name) - 1] = '\0';
 
 	t = xt_find_table_lock(net, AF_INET6, get.name);
 	if (!IS_ERR_OR_NULL(t)) {
@@ -1944,6 +1945,7 @@ compat_get_entries(struct net *net, struct compat_ip6t_get_entries __user *uptr,
 			 *len, sizeof(get) + get.size);
 		return -EINVAL;
 	}
+	get.name[sizeof(get.name) - 1] = '\0';
 
 	xt_compat_lock(AF_INET6);
 	t = xt_find_table_lock(net, AF_INET6, get.name);
-- 
2.1.4

^ permalink raw reply related

* [PATCH 9/9] netfilter: ipv4: fix NULL dereference
From: Pablo Neira Ayuso @ 2016-03-28 17:58 UTC (permalink / raw)
  To: netfilter-devel; +Cc: davem, netdev
In-Reply-To: <1459187882-5357-1-git-send-email-pablo@netfilter.org>

From: Liping Zhang <liping.zhang@spreadtrum.com>

Commit fa50d974d104 ("ipv4: Namespaceify ip_default_ttl sysctl knob")
use sock_net(skb->sk) to get the net namespace, but we can't assume
that sk_buff->sk is always exist, so when it is NULL, oops will happen.

Signed-off-by: Liping Zhang <liping.zhang@spreadtrum.com>
Reviewed-by: Nikolay Borisov <kernel@kyup.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
---
 net/bridge/netfilter/nft_reject_bridge.c | 20 ++++++------
 net/ipv4/netfilter/ipt_SYNPROXY.c        | 54 +++++++++++++++++---------------
 2 files changed, 38 insertions(+), 36 deletions(-)

diff --git a/net/bridge/netfilter/nft_reject_bridge.c b/net/bridge/netfilter/nft_reject_bridge.c
index adc8d72..77f7e7a 100644
--- a/net/bridge/netfilter/nft_reject_bridge.c
+++ b/net/bridge/netfilter/nft_reject_bridge.c
@@ -40,7 +40,8 @@ static void nft_reject_br_push_etherhdr(struct sk_buff *oldskb,
 /* We cannot use oldskb->dev, it can be either bridge device (NF_BRIDGE INPUT)
  * or the bridge port (NF_BRIDGE PREROUTING).
  */
-static void nft_reject_br_send_v4_tcp_reset(struct sk_buff *oldskb,
+static void nft_reject_br_send_v4_tcp_reset(struct net *net,
+					    struct sk_buff *oldskb,
 					    const struct net_device *dev,
 					    int hook)
 {
@@ -48,7 +49,6 @@ static void nft_reject_br_send_v4_tcp_reset(struct sk_buff *oldskb,
 	struct iphdr *niph;
 	const struct tcphdr *oth;
 	struct tcphdr _oth;
-	struct net *net = sock_net(oldskb->sk);
 
 	if (!nft_bridge_iphdr_validate(oldskb))
 		return;
@@ -75,7 +75,8 @@ static void nft_reject_br_send_v4_tcp_reset(struct sk_buff *oldskb,
 	br_deliver(br_port_get_rcu(dev), nskb);
 }
 
-static void nft_reject_br_send_v4_unreach(struct sk_buff *oldskb,
+static void nft_reject_br_send_v4_unreach(struct net *net,
+					  struct sk_buff *oldskb,
 					  const struct net_device *dev,
 					  int hook, u8 code)
 {
@@ -86,7 +87,6 @@ static void nft_reject_br_send_v4_unreach(struct sk_buff *oldskb,
 	void *payload;
 	__wsum csum;
 	u8 proto;
-	struct net *net = sock_net(oldskb->sk);
 
 	if (oldskb->csum_bad || !nft_bridge_iphdr_validate(oldskb))
 		return;
@@ -273,17 +273,17 @@ static void nft_reject_bridge_eval(const struct nft_expr *expr,
 	case htons(ETH_P_IP):
 		switch (priv->type) {
 		case NFT_REJECT_ICMP_UNREACH:
-			nft_reject_br_send_v4_unreach(pkt->skb, pkt->in,
-						      pkt->hook,
+			nft_reject_br_send_v4_unreach(pkt->net, pkt->skb,
+						      pkt->in, pkt->hook,
 						      priv->icmp_code);
 			break;
 		case NFT_REJECT_TCP_RST:
-			nft_reject_br_send_v4_tcp_reset(pkt->skb, pkt->in,
-							pkt->hook);
+			nft_reject_br_send_v4_tcp_reset(pkt->net, pkt->skb,
+							pkt->in, pkt->hook);
 			break;
 		case NFT_REJECT_ICMPX_UNREACH:
-			nft_reject_br_send_v4_unreach(pkt->skb, pkt->in,
-						      pkt->hook,
+			nft_reject_br_send_v4_unreach(pkt->net, pkt->skb,
+						      pkt->in, pkt->hook,
 						      nft_reject_icmp_code(priv->icmp_code));
 			break;
 		}
diff --git a/net/ipv4/netfilter/ipt_SYNPROXY.c b/net/ipv4/netfilter/ipt_SYNPROXY.c
index 7b8fbb3..db5b875 100644
--- a/net/ipv4/netfilter/ipt_SYNPROXY.c
+++ b/net/ipv4/netfilter/ipt_SYNPROXY.c
@@ -18,10 +18,10 @@
 #include <net/netfilter/nf_conntrack_synproxy.h>
 
 static struct iphdr *
-synproxy_build_ip(struct sk_buff *skb, __be32 saddr, __be32 daddr)
+synproxy_build_ip(struct net *net, struct sk_buff *skb, __be32 saddr,
+		  __be32 daddr)
 {
 	struct iphdr *iph;
-	struct net *net = sock_net(skb->sk);
 
 	skb_reset_network_header(skb);
 	iph = (struct iphdr *)skb_put(skb, sizeof(*iph));
@@ -40,14 +40,12 @@ synproxy_build_ip(struct sk_buff *skb, __be32 saddr, __be32 daddr)
 }
 
 static void
-synproxy_send_tcp(const struct synproxy_net *snet,
+synproxy_send_tcp(struct net *net,
 		  const struct sk_buff *skb, struct sk_buff *nskb,
 		  struct nf_conntrack *nfct, enum ip_conntrack_info ctinfo,
 		  struct iphdr *niph, struct tcphdr *nth,
 		  unsigned int tcp_hdr_size)
 {
-	struct net *net = nf_ct_net(snet->tmpl);
-
 	nth->check = ~tcp_v4_check(tcp_hdr_size, niph->saddr, niph->daddr, 0);
 	nskb->ip_summed   = CHECKSUM_PARTIAL;
 	nskb->csum_start  = (unsigned char *)nth - nskb->head;
@@ -72,7 +70,7 @@ free_nskb:
 }
 
 static void
-synproxy_send_client_synack(const struct synproxy_net *snet,
+synproxy_send_client_synack(struct net *net,
 			    const struct sk_buff *skb, const struct tcphdr *th,
 			    const struct synproxy_options *opts)
 {
@@ -91,7 +89,7 @@ synproxy_send_client_synack(const struct synproxy_net *snet,
 		return;
 	skb_reserve(nskb, MAX_TCP_HEADER);
 
-	niph = synproxy_build_ip(nskb, iph->daddr, iph->saddr);
+	niph = synproxy_build_ip(net, nskb, iph->daddr, iph->saddr);
 
 	skb_reset_transport_header(nskb);
 	nth = (struct tcphdr *)skb_put(nskb, tcp_hdr_size);
@@ -109,15 +107,16 @@ synproxy_send_client_synack(const struct synproxy_net *snet,
 
 	synproxy_build_options(nth, opts);
 
-	synproxy_send_tcp(snet, skb, nskb, skb->nfct, IP_CT_ESTABLISHED_REPLY,
+	synproxy_send_tcp(net, skb, nskb, skb->nfct, IP_CT_ESTABLISHED_REPLY,
 			  niph, nth, tcp_hdr_size);
 }
 
 static void
-synproxy_send_server_syn(const struct synproxy_net *snet,
+synproxy_send_server_syn(struct net *net,
 			 const struct sk_buff *skb, const struct tcphdr *th,
 			 const struct synproxy_options *opts, u32 recv_seq)
 {
+	struct synproxy_net *snet = synproxy_pernet(net);
 	struct sk_buff *nskb;
 	struct iphdr *iph, *niph;
 	struct tcphdr *nth;
@@ -132,7 +131,7 @@ synproxy_send_server_syn(const struct synproxy_net *snet,
 		return;
 	skb_reserve(nskb, MAX_TCP_HEADER);
 
-	niph = synproxy_build_ip(nskb, iph->saddr, iph->daddr);
+	niph = synproxy_build_ip(net, nskb, iph->saddr, iph->daddr);
 
 	skb_reset_transport_header(nskb);
 	nth = (struct tcphdr *)skb_put(nskb, tcp_hdr_size);
@@ -153,12 +152,12 @@ synproxy_send_server_syn(const struct synproxy_net *snet,
 
 	synproxy_build_options(nth, opts);
 
-	synproxy_send_tcp(snet, skb, nskb, &snet->tmpl->ct_general, IP_CT_NEW,
+	synproxy_send_tcp(net, skb, nskb, &snet->tmpl->ct_general, IP_CT_NEW,
 			  niph, nth, tcp_hdr_size);
 }
 
 static void
-synproxy_send_server_ack(const struct synproxy_net *snet,
+synproxy_send_server_ack(struct net *net,
 			 const struct ip_ct_tcp *state,
 			 const struct sk_buff *skb, const struct tcphdr *th,
 			 const struct synproxy_options *opts)
@@ -177,7 +176,7 @@ synproxy_send_server_ack(const struct synproxy_net *snet,
 		return;
 	skb_reserve(nskb, MAX_TCP_HEADER);
 
-	niph = synproxy_build_ip(nskb, iph->daddr, iph->saddr);
+	niph = synproxy_build_ip(net, nskb, iph->daddr, iph->saddr);
 
 	skb_reset_transport_header(nskb);
 	nth = (struct tcphdr *)skb_put(nskb, tcp_hdr_size);
@@ -193,11 +192,11 @@ synproxy_send_server_ack(const struct synproxy_net *snet,
 
 	synproxy_build_options(nth, opts);
 
-	synproxy_send_tcp(snet, skb, nskb, NULL, 0, niph, nth, tcp_hdr_size);
+	synproxy_send_tcp(net, skb, nskb, NULL, 0, niph, nth, tcp_hdr_size);
 }
 
 static void
-synproxy_send_client_ack(const struct synproxy_net *snet,
+synproxy_send_client_ack(struct net *net,
 			 const struct sk_buff *skb, const struct tcphdr *th,
 			 const struct synproxy_options *opts)
 {
@@ -215,7 +214,7 @@ synproxy_send_client_ack(const struct synproxy_net *snet,
 		return;
 	skb_reserve(nskb, MAX_TCP_HEADER);
 
-	niph = synproxy_build_ip(nskb, iph->saddr, iph->daddr);
+	niph = synproxy_build_ip(net, nskb, iph->saddr, iph->daddr);
 
 	skb_reset_transport_header(nskb);
 	nth = (struct tcphdr *)skb_put(nskb, tcp_hdr_size);
@@ -231,15 +230,16 @@ synproxy_send_client_ack(const struct synproxy_net *snet,
 
 	synproxy_build_options(nth, opts);
 
-	synproxy_send_tcp(snet, skb, nskb, skb->nfct, IP_CT_ESTABLISHED_REPLY,
+	synproxy_send_tcp(net, skb, nskb, skb->nfct, IP_CT_ESTABLISHED_REPLY,
 			  niph, nth, tcp_hdr_size);
 }
 
 static bool
-synproxy_recv_client_ack(const struct synproxy_net *snet,
+synproxy_recv_client_ack(struct net *net,
 			 const struct sk_buff *skb, const struct tcphdr *th,
 			 struct synproxy_options *opts, u32 recv_seq)
 {
+	struct synproxy_net *snet = synproxy_pernet(net);
 	int mss;
 
 	mss = __cookie_v4_check(ip_hdr(skb), th, ntohl(th->ack_seq) - 1);
@@ -255,7 +255,7 @@ synproxy_recv_client_ack(const struct synproxy_net *snet,
 	if (opts->options & XT_SYNPROXY_OPT_TIMESTAMP)
 		synproxy_check_timestamp_cookie(opts);
 
-	synproxy_send_server_syn(snet, skb, th, opts, recv_seq);
+	synproxy_send_server_syn(net, skb, th, opts, recv_seq);
 	return true;
 }
 
@@ -263,7 +263,8 @@ static unsigned int
 synproxy_tg4(struct sk_buff *skb, const struct xt_action_param *par)
 {
 	const struct xt_synproxy_info *info = par->targinfo;
-	struct synproxy_net *snet = synproxy_pernet(par->net);
+	struct net *net = par->net;
+	struct synproxy_net *snet = synproxy_pernet(net);
 	struct synproxy_options opts = {};
 	struct tcphdr *th, _th;
 
@@ -292,12 +293,12 @@ synproxy_tg4(struct sk_buff *skb, const struct xt_action_param *par)
 					  XT_SYNPROXY_OPT_SACK_PERM |
 					  XT_SYNPROXY_OPT_ECN);
 
-		synproxy_send_client_synack(snet, skb, th, &opts);
+		synproxy_send_client_synack(net, skb, th, &opts);
 		return NF_DROP;
 
 	} else if (th->ack && !(th->fin || th->rst || th->syn)) {
 		/* ACK from client */
-		synproxy_recv_client_ack(snet, skb, th, &opts, ntohl(th->seq));
+		synproxy_recv_client_ack(net, skb, th, &opts, ntohl(th->seq));
 		return NF_DROP;
 	}
 
@@ -308,7 +309,8 @@ static unsigned int ipv4_synproxy_hook(void *priv,
 				       struct sk_buff *skb,
 				       const struct nf_hook_state *nhs)
 {
-	struct synproxy_net *snet = synproxy_pernet(nhs->net);
+	struct net *net = nhs->net;
+	struct synproxy_net *snet = synproxy_pernet(net);
 	enum ip_conntrack_info ctinfo;
 	struct nf_conn *ct;
 	struct nf_conn_synproxy *synproxy;
@@ -365,7 +367,7 @@ static unsigned int ipv4_synproxy_hook(void *priv,
 			 * therefore we need to add 1 to make the SYN sequence
 			 * number match the one of first SYN.
 			 */
-			if (synproxy_recv_client_ack(snet, skb, th, &opts,
+			if (synproxy_recv_client_ack(net, skb, th, &opts,
 						     ntohl(th->seq) + 1))
 				this_cpu_inc(snet->stats->cookie_retrans);
 
@@ -391,12 +393,12 @@ static unsigned int ipv4_synproxy_hook(void *priv,
 				  XT_SYNPROXY_OPT_SACK_PERM);
 
 		swap(opts.tsval, opts.tsecr);
-		synproxy_send_server_ack(snet, state, skb, th, &opts);
+		synproxy_send_server_ack(net, state, skb, th, &opts);
 
 		nf_ct_seqadj_init(ct, ctinfo, synproxy->isn - ntohl(th->seq));
 
 		swap(opts.tsval, opts.tsecr);
-		synproxy_send_client_ack(snet, skb, th, &opts);
+		synproxy_send_client_ack(net, skb, th, &opts);
 
 		consume_skb(skb);
 		return NF_STOLEN;
-- 
2.1.4

^ permalink raw reply related

* Re: [PATCH net] team: team should sync the port's uc/mc addrs when add a port
From: Jiri Pirko @ 2016-03-28 18:04 UTC (permalink / raw)
  To: Xin Long; +Cc: network dev, davem, Marcelo Ricardo Leitner
In-Reply-To: <320fc71dbbf5ed164ca506cf5750b7bf4e048d44.1459183351.git.lucien.xin@gmail.com>

Mon, Mar 28, 2016 at 06:42:31PM CEST, lucien.xin@gmail.com wrote:
>There is an issue when we use mavtap over team:
>When we replug nic links from team0, the real nics's mc list will not
>include the maddr for macvtap any more. then we can't receive pkts to
>macvtap device, as they are filterred by mc list of nic.
>
>In Bonding Driver, it syncs the uc/mc addrs in bond_enslave().
>
>We will fix this issue on team by adding the port's uc/mc addrs sync in
>team_port_add.
>
>Signed-off-by: Xin Long <lucien.xin@gmail.com>

Acked-by: Jiri Pirko <jiri@mellanox.com>

^ permalink raw reply

* Re: [RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU
From: Rick Jones @ 2016-03-28 18:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Eric Dumazet, David S . Miller, netdev, Tom Herbert
In-Reply-To: <1459184434.6473.104.camel@edumazet-glaptop3.roam.corp.google.com>

On 03/28/2016 10:00 AM, Eric Dumazet wrote:
> On Mon, 2016-03-28 at 09:15 -0700, Rick Jones wrote:
>> On 03/25/2016 03:29 PM, Eric Dumazet wrote:
>>> UDP sockets are not short lived in the high usage case, so the added
>>> cost of call_rcu() should not be a concern.
>>
>> Even a busy DNS resolver?
>
> If you mean that a busy DNS resolver spends _most_ of its time doing :
>
> fd = socket()
> bind(fd  port=0)
>    < send and receive one frame >
> close(fd)

Yes.  Although it has been a long time, I thought that say the likes of 
a caching named in the middle between hosts and the rest of the DNS 
would behave that way as it was looking-up names on behalf those who 
asked it.

rick

>
> (If this is the case, may I suggest doing something different, and use
> some kind of caches ? It will be way faster.)
>
> Then the result for 10,000,000 loops of <socket()+bind()+close()> are
>
> Before patch :
>
> real	0m13.665s
> user	0m0.548s
> sys	0m12.372s
>
> After patch :
>
> real	0m20.599s
> user	0m0.465s
> sys	0m17.965s
>
> So the worst overhead is 700 ns
>
> This is roughly the cost for bringing 960 bytes from memory, or 15 cache
> lines (on x86_64)
>
> # grep UDP /proc/slabinfo
> UDPLITEv6              0      0   1088    7    2 : tunables   24   12    8 : slabdata      0      0      0
> UDPv6                 24     49   1088    7    2 : tunables   24   12    8 : slabdata      7      7      0
> UDP-Lite               0      0    960    4    1 : tunables   54   27    8 : slabdata      0      0      0
> UDP                   30     36    960    4    1 : tunables   54   27    8 : slabdata      9      9      2
>
> In reality, chances that UDP sockets are re-opened right after being
> freed and their 15 cache lines are very hot in cpu caches is quite
> small, so I would not worry at all about this rather stupid benchmark.
>
> int main(int argc, char *argv[]) {
> 	struct sockaddr_in addr;
> 	int i, fd, loops = 10000000;
>
> 	for (i = 0; i < loops; i++) {
> 		fd = socket(AF_INET, SOCK_DGRAM, 0);
> 		if (fd == -1) {
> 			perror("socket");
> 			break;
> 		}
> 		memset(&addr, 0, sizeof(addr));
> 		addr.sin_family = AF_INET;
> 		if (bind(fd, (const struct sockaddr *)&addr, sizeof(addr)) == -1) {
> 			perror("bind");
> 			break;
> 		}
> 		close(fd);
> 	}
> 	return 0;
> }
>

^ permalink raw reply

* Re: [PATCH net v2 2/3] tunnels: Don't apply GRO to multiple layers of encapsulation.
From: Tom Herbert @ 2016-03-28 18:47 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Jesse Gross, David Miller, Linux Kernel Network Developers
In-Reply-To: <CAKgT0UdiqecAMihYP=gfmBYp7-Ko54EkO23=GWg4jEAuHocKJQ@mail.gmail.com>

On Mon, Mar 28, 2016 at 10:37 AM, Alexander Duyck
<alexander.duyck@gmail.com> wrote:
> On Mon, Mar 28, 2016 at 9:31 AM, Tom Herbert <tom@herbertland.com> wrote:
>> On Sun, Mar 27, 2016 at 9:38 PM, Jesse Gross <jesse@kernel.org> wrote:
>>> On Sat, Mar 26, 2016 at 12:41 PM, Tom Herbert <tom@herbertland.com> wrote:
>>>> On Sat, Mar 19, 2016 at 9:32 AM, Jesse Gross <jesse@kernel.org> wrote:
>>>>> When drivers express support for TSO of encapsulated packets, they
>>>>> only mean that they can do it for one layer of encapsulation.
>>>>> Supporting additional levels would mean updating, at a minimum,
>>>>> more IP length fields and they are unaware of this.
>>>>>
>>>> This patch completely breaks GRO for FOU and GUE.
>>>>
>>>>> No encapsulation device expresses support for handling offloaded
>>>>> encapsulated packets, so we won't generate these types of frames
>>>>> in the transmit path. However, GRO doesn't have a check for
>>>>> multiple levels of encapsulation and will attempt to build them.
>>>>>
>>>>> UDP tunnel GRO actually does prevent this situation but it only
>>>>> handles multiple UDP tunnels stacked on top of each other. This
>>>>> generalizes that solution to prevent any kind of tunnel stacking
>>>>> that would cause problems.
>>>>>
>>>> GUE and FOU regularly create packets that will be both GSO UDP tunnels
>>>> and IPIP, SIT, GRE, etc. This is by design. There should be no
>>>> ambiguity in the drivers as to what this means. For instance, if
>>>> SKB_GSO_UDP_TUNNEL and SKB_GSO_GRE are set that is GRE/UDP packet, a
>>>> driver can use ndo_features_check to validate.
>>>>
>>>> So multiple levels of encapsulation with GRO is perfectly valid and I
>>>> would suggest to simply revert this patch. The one potential issue we
>>>> could have is the potential for GRO to construct a packet which is a
>>>> UDP-encapsulation inside another encapsulation, like UDP-encap in GRE.
>>>> In this case the GSO flags don't provide enough information to
>>>> distinguish say between GRE/UDP (common case) and UDP/GRE (uncommon
>>>> case). To make this clear we can set udp_mark in GRE, ipip, and sit
>>>> but *not* check for it.
>>>
>>> Generally speaking, multiple levels of encapsulation offload are not
>>> valid. I think this is pretty clear from the fact that none of the
>>> encapsulation drivers expose support for encapsulation offloads on
>>> transmit and the drivers supporting NETIF_F_GSO_GRE and
>>> NETIF_F_GSO_UDP_TUNNEL do not mean they can handle GRE in VXLAN.
>>>
>>
>> Kernel software offload does support this combination just fine.
>> Seriously, I've tested that more than a thousand times. This is a few
>> HW implementations you're referring to. The limitations of these
>> drivers should not dictate how we build the software-- it needs to
>> work the other way around.
>
> Jesse does have a point.  Drivers aren't checking for this kind of
> thing currently as the transmit side doesn't generate these kind of
> frames.  The fact that GRO is makes things a bit more messy as we will
> really need to restructure the GSO code in order to handle it.  As far
> as drivers testing for it I am pretty certain the i40e isn't.  I would
> wonder if we need to add yet another GSO bit to indicate that we
> support multiple layers of encapsulation.  I'm pretty sure the only
> way we could possibly handle it would be in software since what you
> are indicating is a indeterminate number of headers that all require
> updates.
>
>>> Asking drivers to assume that this combination of flags means FOU
>>> doesn't seem right to me. To the best of my knowledge, no driver uses
>>> ndo_feature_check to do validation of multiple tunnel offload flags
>>> since the assumption is that the stack will never try to do this.
>>> Since FOU is being treated as only a single level of encapsulation, I
>>> think it would be better to just advertise it that way on transmit
>>> (i.e. only set SKB_GSO_UDP_TUNNEL).
>>>
>> If it's not FOU it will be something else. Arbitrarily declaring
>> multi-levels of encapsulation invalid is simply the wrong direction,
>> we should be increasing generality and capabilities of the kernel not
>> holding it down with artificial limits. This is why the generic TSO
>> work that Alexander and Edward are looking at is so important-- if
>> this flies then we can offload any combination of encapsulations with
>> out protocol specific information.
>
> The segmentation code isn't designed to handle more than 2 layers of
> headers.  Currently we have the pointers for the inner headers and the
> outer headers.  If you are talking about adding yet another level we
> would need additional pointers in the skbuff to handle them and we
> currently don't have them at present.
>
>>> The change that you are suggesting would result in packets generated
>>> by GRO that cannot be handled properly on transmit in some cases and
>>> would likely end up being dropped or malformed. GRO is just an
>>> optimization and correctness must come first so we cannot use it if it
>>> might corrupt traffic.
>>
>> That's (a few) drivers problem. It's no different than if they had
>> decided that SKB_GSO_UDP_TUNNEL means vxlan, or they can support GRE
>> in IPv4 offload but not GRE in IPv6, or they can only handle headers
>> of a certain size, can't handle IPv6 ext. hdrs., etc. As I mentioned,
>> the long term solution is to eliminate the GSO_ flags and use a
>> generic segmentation offload interface. But in the interim, it is
>> *incumbent* on drivers to determine if they can handle a GSO packet
>> and the interfaces to do that exist. Restricting the capabilities of
>> GRO just to appease those drivers is not right. Breaking existing GRO
>> for their benefit is definitely not right.
>
> This isn't about if drivers can handle it.  It is about if the skbuff
> can handle it.  The problem as it stands right now is that we start
> losing data once we go past 1 level of encapsulation.  We only have
> the standard mac_header, network_header, transport_header, and then
> the inner set of the same pointers.  If we try to go multiple levels
> deep we start losing data.
>
Huh? GUE/FOU has been running perfectly well with two levels of
encapsulation for over a year now. We never had to add anything to
skbuff to make that work. If "losing data" is a problem please provide
the *reproducible* test case for that and we'll debug that.

> If we are wanting to start allowing unlimited headers then maybe we
> need to start allowing for "partial" GRO to complement the partial GSO
> implementation I have.  With that at least we might be able to track
> the first and last headers and we could leave the remaining headers
> static.  Then when the frame made it to the driver it would know that
> the headers in the long list can be replicated and it only has to
> update the outer network header and the inner transport header,
> otherwise if it doesn't support that it can just chop up the frame in
> software and send that out.
>
> - Alex

^ permalink raw reply

* Re: [RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU
From: Eric Dumazet @ 2016-03-28 18:55 UTC (permalink / raw)
  To: Rick Jones; +Cc: Eric Dumazet, David S . Miller, netdev, Tom Herbert
In-Reply-To: <56F97B70.1000904@hpe.com>

On Mon, 2016-03-28 at 11:44 -0700, Rick Jones wrote:
> On 03/28/2016 10:00 AM, Eric Dumazet wrote:
> > On Mon, 2016-03-28 at 09:15 -0700, Rick Jones wrote:
> >> On 03/25/2016 03:29 PM, Eric Dumazet wrote:
> >>> UDP sockets are not short lived in the high usage case, so the added
> >>> cost of call_rcu() should not be a concern.
> >>
> >> Even a busy DNS resolver?
> >
> > If you mean that a busy DNS resolver spends _most_ of its time doing :
> >
> > fd = socket()
> > bind(fd  port=0)
> >    < send and receive one frame >
> > close(fd)
> 
> Yes.  Although it has been a long time, I thought that say the likes of 
> a caching named in the middle between hosts and the rest of the DNS 
> would behave that way as it was looking-up names on behalf those who 
> asked it.

I really doubt a modern program would dynamically allocate one UDP port
for every in-flight request, as it would limit them to number of
ephemeral ports concurrent requests (~30000 assuming the process can get
them all on the host)

Managing a pool would be more efficient (The 1.3 usec penalty becomes
more like 4 usec in multi threaded programs)

Sure, you always can find badly written programs, but they already hit
scalability issues anyway.

UDP refcounting cost about 2 cache line misses per packet in stress
situations, this really has to go, so that well written programs can get
full speed.

^ permalink raw reply

* Re: [RFC net-next 2/2] udp: No longer use SLAB_DESTROY_BY_RCU
From: Rick Jones @ 2016-03-28 19:11 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Eric Dumazet, David S . Miller, netdev, Tom Herbert
In-Reply-To: <1459191346.6473.111.camel@edumazet-glaptop3.roam.corp.google.com>

On 03/28/2016 11:55 AM, Eric Dumazet wrote:
> On Mon, 2016-03-28 at 11:44 -0700, Rick Jones wrote:
>> On 03/28/2016 10:00 AM, Eric Dumazet wrote:
>>> If you mean that a busy DNS resolver spends _most_ of its time doing :
>>>
>>> fd = socket()
>>> bind(fd  port=0)
>>>     < send and receive one frame >
>>> close(fd)
>>
>> Yes.  Although it has been a long time, I thought that say the likes of
>> a caching named in the middle between hosts and the rest of the DNS
>> would behave that way as it was looking-up names on behalf those who
>> asked it.
>
> I really doubt a modern program would dynamically allocate one UDP port
> for every in-flight request, as it would limit them to number of
> ephemeral ports concurrent requests (~30000 assuming the process can get
> them all on the host)

I was under the impression that individual DNS queries were supposed to 
have not only random DNS query IDs but also originate from random UDP 
source ports.  https://tools.ietf.org/html/rfc5452 4.5 at least touches 
on the topic but I don't see it making it hard and fast.  By section 10 
though it is more explicit:

    This document recommends the use of UDP source port number
    randomization to extend the effective DNS transaction ID beyond the
    available 16 bits.

That being the case, if indeed there were to be 30000-odd concurrent 
requests outstanding "upstream" from that location there'd have to be 
30000 ephemeral ports in play.

rick

>
> Managing a pool would be more efficient (The 1.3 usec penalty becomes
> more like 4 usec in multi threaded programs)
>
> Sure, you always can find badly written programs, but they already hit
> scalability issues anyway.
>
> UDP refcounting cost about 2 cache line misses per packet in stress
> situations, this really has to go, so that well written programs can get
> full speed.
>
>

^ permalink raw reply

* Re: BUG: net/netfilter: KASAN: stack-out-of-bounds in tcp_packet
From: David Miller @ 2016-03-28 19:29 UTC (permalink / raw)
  To: kadlec; +Cc: sploving1, pablo, kaber, netfilter-devel, netdev
In-Reply-To: <alpine.DEB.2.10.1603281844200.23985@blackhole.kfki.hu>

From: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Date: Mon, 28 Mar 2016 18:48:51 +0200 (CEST)

>> > > @@ -3716,6 +3716,8 @@ void tcp_parse_options(const struct sk_buff *skb,
>> > >               length--;
>> > >               continue;
>> > >           default:
>> > > +            if (length < 2)
>> > > +                return;
>> > >               opsize = *ptr++;
>> > >               if (opsize < 2) /* "silly options" */
>> > >                   return;

I'm trying to figure out how this can even matter.

If we are in the loop, length is at least one.

That means it is legal to read the opsize byte.

By the next check, opsize is at least 2.

And then the very next line in this code makes sure length >= opsize:

			if (opsize > length)
				return;	/* don't parse partial options */

Therefore no out-of-range access is possible as far as I can see.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox