Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf-next 06/10] [bpf]: make bnxt compatible w/ bpf_xdp_adjust_tail
From: Alexei Starovoitov @ 2018-04-17 23:07 UTC (permalink / raw)
  To: Nikita V. Shirokov
  Cc: Alexei Starovoitov, Daniel Borkmann, Michael Chan, netdev
In-Reply-To: <20180417065131.3632-7-tehnerd@tehnerd.com>

On Mon, Apr 16, 2018 at 11:51:27PM -0700, Nikita V. Shirokov wrote:
> w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
> well (only "decrease" of pointer's location is going to be supported).
> changing of this pointer will change packet's size.
> for bnxt driver we will just calculate packet's length unconditionally
> 
> Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH bpf-next 07/10] [bpf]: make cavium thunder compatible w/ bpf_xdp_adjust_tail
From: Alexei Starovoitov @ 2018-04-17 23:07 UTC (permalink / raw)
  To: Nikita V. Shirokov
  Cc: Alexei Starovoitov, Daniel Borkmann, Robert Richter,
	Sunil Goutham, netdev
In-Reply-To: <20180417065131.3632-8-tehnerd@tehnerd.com>

On Mon, Apr 16, 2018 at 11:51:28PM -0700, Nikita V. Shirokov wrote:
> w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
> well (only "decrease" of pointer's location is going to be supported).
> changing of this pointer will change packet's size.
> for cavium's thunder driver we will just calculate packet's length
> unconditionally
> 
> Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [PATCH bpf-next 08/10] [bpf]: make netronome nfp compatible w/ bpf_xdp_adjust_tail
From: Alexei Starovoitov @ 2018-04-17 23:08 UTC (permalink / raw)
  To: Nikita V. Shirokov
  Cc: Alexei Starovoitov, Daniel Borkmann, Jakub Kicinski, netdev
In-Reply-To: <20180417065131.3632-9-tehnerd@tehnerd.com>

On Mon, Apr 16, 2018 at 11:51:29PM -0700, Nikita V. Shirokov wrote:
> w/ bpf_xdp_adjust_tail helper xdp's data_end pointer could be changed as
> well (only "decrease" of pointer's location is going to be supported).
> changing of this pointer will change packet's size.
> for nfp driver we will just calculate packet's length unconditionally
> 
> Signed-off-by: Nikita V. Shirokov <tehnerd@tehnerd.com>
> ---
>  drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
> index 1eb6549f2a54..d9111c077699 100644
> --- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
> +++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
> @@ -1722,7 +1722,7 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget)
>  
>  			act = bpf_prog_run_xdp(xdp_prog, &xdp);
>  
> -			pkt_len -= xdp.data - orig_data;
> +			pkt_len = xdp.data_end - xdp.data;

Looks correct, but Jakub please review.

^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/3] bpftool: Support new prog types and attach types
From: Jakub Kicinski @ 2018-04-17 23:09 UTC (permalink / raw)
  To: Andrey Ignatov; +Cc: ast, daniel, quentin.monnet, netdev, kernel-team
In-Reply-To: <b5413a9cf18bb0a5a2480346e95101eb3377c0a6.1523985784.git.rdna@fb.com>

On Tue, 17 Apr 2018 10:28:44 -0700, Andrey Ignatov wrote:
> Add recently added prog types to `bpftool prog` and attach types to
> `bpftool cgroup`.
> 
> Update bpftool documentation and bash completion appropriately.
> 
> Signed-off-by: Andrey Ignatov <rdna@fb.com>

Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>

Thank you!!

^ permalink raw reply

* Re: SRIOV switchdev mode BoF minutes
From: Jakub Kicinski @ 2018-04-17 23:19 UTC (permalink / raw)
  To: Andy Gospodarek
  Cc: Or Gerlitz, Samudrala, Sridhar, David Miller, Anjali Singhai Jain,
	Michael Chan, Simon Horman, John Fastabend, Saeed Mahameed,
	Jiri Pirko, Rony Efraim, Linux Netdev List
In-Reply-To: <20180417144700.GJ33938@C02RW35GFVH8.dhcp.broadcom.net>

On Tue, 17 Apr 2018 10:47:00 -0400, Andy Gospodarek wrote:
> There is also a school of thought that the VF reps could be
> pre-allocated on the SmartNIC so that any application processing that
> traffic would sit idle when no traffic arrives on the rep, but could
> process frames that do arrive when the VFs were created on the host.
> This implementation will depend on how resources are allocated on a
> given bit of hardware, but can really work well.

+1 if there is no FW resource allocation issues IMHO it's okay to
just show all reprs for "remote PCIes (PFs and VFs)" on the SmartNIC/
controller.  The reprs should just show link down as if PCIe cable
was unpluged until host actually enables them.  

A similar issue exists on multi-host for PFs, right?  If one of the
hosts is down do we still show their PF repr?  IMHO yes.

That makes the thing looks more like a switch with cables being plugged
in and out.

^ permalink raw reply

* Re: Repeatable inet6_dump_fib crash in stock 4.12.0-rc4+
From: Ben Greear @ 2018-04-17 23:29 UTC (permalink / raw)
  To: David Ahern, Michal Kubecek; +Cc: Cong Wang, Eric Dumazet, netdev
In-Reply-To: <763bdb6c-bd5f-2398-53ca-6d9dc28c3df6@candelatech.com>

On 01/24/2018 03:59 PM, Ben Greear wrote:
> On 06/20/2017 08:03 PM, David Ahern wrote:
>> On 6/20/17 5:41 PM, Ben Greear wrote:
>>> On 06/20/2017 11:05 AM, Michal Kubecek wrote:
>>>> On Tue, Jun 20, 2017 at 07:12:27AM -0700, Ben Greear wrote:
>>>>> On 06/14/2017 03:25 PM, David Ahern wrote:
>>>>>> On 6/14/17 4:23 PM, Ben Greear wrote:
>>>>>>> On 06/13/2017 07:27 PM, David Ahern wrote:
>>>>>>>
>>>>>>>> Let's try a targeted debug patch. See attached
>>>>>>>
>>>>>>> I had to change it to pr_err so it would go to our serial console
>>>>>>> since the system locked hard on crash,
>>>>>>> and that appears to be enough to change the timing where we can no
>>>>>>> longer
>>>>>>> reproduce the problem.
>>>>>>
>>>>>>
>>>>>> ok, let's figure out which one is doing that. There are 3 debug
>>>>>> statements. I suspect fib6_del_route is the one setting the state to
>>>>>> FWS_U. Can you remove the debug prints in fib6_repair_tree and
>>>>>> fib6_walk_continue and try again?
>>>>>
>>>>> We cannot reproduce with just that one printf in the kernel either.  It
>>>>> must change the timing too much to trigger the bug.
>>>>
>>>> You might try trace_printk() which should have less impact (don't forget
>>>> to enable /proc/sys/kernel/ftrace_dump_on_oops).
>>>
>>> We cannot reproduce with trace_printk() either.
>>
>> I think that suggests the walker state is set to FWS_U in
>> fib6_del_route, and it is the FWS_U case in fib6_walk_continue that
>> triggers the fault -- the null parent (pn = fn->parent). So we have the
>> 2 areas of code that are interacting.
>>
>> I'm on a road trip through the end of this week with little time to
>> focus on this problem. I'll get back to you another suggestion when I can.

FYI, problem still happens in 4.16.  I'm going to re-enable my hack below
for this kernel as well...I had hopes it might be fixed...

BUG: unable to handle kernel NULL pointer dereference at 8
IP: fib6_walk_continue+0x5b/0x140 [ipv6]
PGD 80000007dfc0c067 P4D 80000007dfc0c067 PUD 7e66ff067 PMD 0
Oops: 0000 [#1] PREEMPT SMP PTI
Modules linked in: nf_conntrack_netlink nf_conntrack nfnetlink nf_defrag_ipv4 libcrc32c vrf]
CPU: 3 PID: 15117 Comm: ip Tainted: G           O     4.16.0+ #5
Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017
RIP: 0010:fib6_walk_continue+0x5b/0x140 [ipv6]
RSP: 0018:ffffc90008c3bc10 EFLAGS: 00010287
RAX: ffff88085ac45050 RBX: ffff8807e03008a0 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffc90008c3bc48 RDI: ffffffff8232b240
RBP: ffff880819167600 R08: 0000000000000008 R09: ffff8807dff10071
R10: ffffc90008c3bbd0 R11: 0000000000000000 R12: ffff8807e03008a0
R13: 0000000000000002 R14: ffff8807e05744c8 R15: ffff8807e08ef000
FS:  00007f2f04342700(0000) GS:ffff88087fcc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 00000007e0556002 CR4: 00000000003606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  inet6_dump_fib+0x14b/0x2c0 [ipv6]
  netlink_dump+0x216/0x2a0
  netlink_recvmsg+0x254/0x400
  ? copy_msghdr_from_user+0xb5/0x110
  ___sys_recvmsg+0xe9/0x230
  ? find_held_lock+0x3b/0xb0
  ? __handle_mm_fault+0x617/0x1180
  ? __audit_syscall_entry+0xb3/0x110
  ? __sys_recvmsg+0x39/0x70
  __sys_recvmsg+0x39/0x70
  do_syscall_64+0x63/0x120
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x7f2f03a72030
RSP: 002b:00007fffab3de508 EFLAGS: 00000246 ORIG_RAX: 000000000000002f
RAX: ffffffffffffffda RBX: 00007fffab3e641c RCX: 00007f2f03a72030
RDX: 0000000000000000 RSI: 00007fffab3de570 RDI: 0000000000000004
RBP: 0000000000000000 R08: 0000000000007e6c R09: 00007fffab3e63a8
R10: 00007fffab3de5b0 R11: 0000000000000246 R12: 00007fffab3e6608
R13: 000000000066b460 R14: 0000000000007e6c R15: 0000000000000000
Code: 85 d2 74 17 f6 40 2a 04 74 11 8b 53 2c 85 d2 0f 84 d7 00 00 00 83 ea 01 89 53 2c c7 4
RIP: fib6_walk_continue+0x5b/0x140 [ipv6] RSP: ffffc90008c3bc10
CR2: 0000000000000008
---[ end trace bd03458864eb266c ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
Rebooting in 10 seconds..
ACPI MEMORY or I/O RESET_REG.

>
> So, though I don't know the right way to fix it, the patch below appears
> to make the system not crash.
>
>
> diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
> index 68b9cc7..bf19a14 100644
> --- a/net/ipv6/ip6_fib.c
> +++ b/net/ipv6/ip6_fib.c
> @@ -1614,6 +1614,12 @@ static int fib6_walk_continue(struct fib6_walker *w)
>                         pn = fn->parent;
>                         w->node = pn;
>  #ifdef CONFIG_IPV6_SUBTREES
> +                       if (WARN_ON_ONCE(!pn)) {
> +                               pr_err("FWS-U, w: %p  fn: %p  pn: %p\n",
> +                                      w, fn, pn);
> +                               /* Attempt to work around crash that has been here forever. --Ben */
> +                               return 0;
> +                       }
>                         if (FIB6_SUBTREE(pn) == fn) {
>                                 WARN_ON(!(fn->fn_flags & RTN_ROOT));
>                                 w->state = FWS_L;
>
>
>
> The printout looks like this (when adding 4000 mac-vlans, so it is pretty rare).  PN is definitely NULL sometimes:
>
> [root@2u-6n ~]# journalctl -f|grep FWS
> Jan 24 15:48:05 2u-6n kernel: IPv6: FWS-U, w: ffff8807ea121ba0  fn: ffff880856a09260  pn:           (null)
> Jan 24 15:51:15 2u-6n kernel: IPv6: FWS-U, w: ffff8807e3963de0  fn: ffff880856a09260  pn:           (null)
> Jan 24 15:51:15 2u-6n kernel: IPv6: FWS-U, w: ffff88081ac22de0  fn: ffff880856a09260  pn:           (null)
> Jan 24 15:53:13 2u-6n kernel: IPv6: FWS-U, w: ffff8808290c69c0  fn: ffff8807e369f920  pn:           (null)
> Jan 24 15:53:24 2u-6n kernel: IPv6: FWS-U, w: ffff8807ea3156c0  fn: ffff88082d1eeb60  pn:           (null)
>
>
>
> 8066 Jan 24 15:48:04 2u-6n kernel: 8021q: adding VLAN 0 to HW filter on device eth2#1006
>  8067 Jan 24 15:48:05 2u-6n kernel: ------------[ cut here ]------------
>  8068 Jan 24 15:48:05 2u-6n kernel: WARNING: CPU: 5 PID: 3346 at /home/greearb/git/linux-4.13.dev.y/net/ipv6/ip6_fib.c:1617 fib6_walk_continue+ 0x154/0x1b0 [ipv6]
>  8069 Jan 24 15:48:05 2u-6n kernel: Modules linked in: 8021q garp mrp stp llc fuse macvlan wanlink(O) pktgen ipmi_ssif coretemp intel_rapl            sb_edac
> x86_pkg_temp_thermal intel_powerclamp kvm_intel kvm ath9k irqbypass iTCO_wdt ath9k_common iTCO_vendor_support ath9k_hw ath              i2c_i801 mac80211 joydev
> lpc_ich cfg80211 ioatdma shpchp tpm_tis tpm_tis_core wmi tpm ipmi_si ipmi_devintf ipmi_msghandler acpi_pad             acpi_power_meter nfsd auth_rpcgss nfs_acl
> sch_fq_codel lockd grace sunrpc ast drm_kms_helper ttm drm igb hwmon ptp pps_core dca                 i2c_algo_bit i2c_core ipv6 crc_ccitt
>  8070 Jan 24 15:48:05 2u-6n kernel: CPU: 5 PID: 3346 Comm: ip Tainted: G           O    4.13.16+ #22
>  8071 Jan 24 15:48:05 2u-6n kernel: Hardware name: Iron_Systems,Inc CS-CAD-2U-A02/X10SRL-F, BIOS 2.0b 05/02/2017
>  8072 Jan 24 15:48:05 2u-6n kernel: task: ffff8807e9ef1dc0 task.stack: ffffc9002083c000
>  8073 Jan 24 15:48:05 2u-6n kernel: RIP: 0010:fib6_walk_continue+0x154/0x1b0 [ipv6]
>  8074 Jan 24 15:48:05 2u-6n kernel: RSP: 0018:ffffc9002083fbc0 EFLAGS: 00010246
>  8075 Jan 24 15:48:05 2u-6n kernel: RAX: 0000000000000000 RBX: ffff8807ea121ba0 RCX: 0000000000000000
>  8076 Jan 24 15:48:05 2u-6n kernel: RDX: ffff880856a09260 RSI: ffffc9002083fc00 RDI: ffffffff81ef2140
>  8077 Jan 24 15:48:05 2u-6n kernel: RBP: ffffc9002083fbc8 R08: 0000000000000008 R09: ffff8807e36f6b25
>  8078 Jan 24 15:48:05 2u-6n kernel: R10: ffffc9002083fb70 R11: 0000000000000000 R12: 0000000000000002
>  8079 Jan 24 15:48:05 2u-6n kernel: R13: 0000000000000002 R14: ffff8807ea121ba0 R15: ffff8807ebcc8d80
>  8080 Jan 24 15:48:05 2u-6n kernel: FS:  00007f77a5d0f700(0000) GS:ffff88087fd40000(0000) knlGS:0000000000000000
>  8081 Jan 24 15:48:05 2u-6n kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>  8082 Jan 24 15:48:05 2u-6n kernel: CR2: 0000000003d56c88 CR3: 00000007f3106000 CR4: 00000000003406e0
>  8083 Jan 24 15:48:05 2u-6n kernel: DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>  8084 Jan 24 15:48:05 2u-6n kernel: DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>  8085 Jan 24 15:48:05 2u-6n kernel: Call Trace:
>  8086 Jan 24 15:48:05 2u-6n kernel:  inet6_dump_fib+0x1ab/0x2a0 [ipv6]
>  8087 Jan 24 15:48:05 2u-6n kernel:  netlink_dump+0x22c/0x2b0
>  8088 Jan 24 15:48:05 2u-6n kernel:  netlink_recvmsg+0x260/0x3f0
>  8089 Jan 24 15:48:05 2u-6n kernel:  sock_recvmsg+0x38/0x40
>  8090 Jan 24 15:48:05 2u-6n kernel:  ___sys_recvmsg+0xe9/0x230
>  8091 Jan 24 15:48:05 2u-6n kernel:  ? alloc_pages_vma+0x83/0x1e0
>  8092 Jan 24 15:48:05 2u-6n kernel:  ? page_add_new_anon_rmap+0x88/0xc0
>  8093 Jan 24 15:48:05 2u-6n kernel:  ? lru_cache_add_active_or_unevictable+0x31/0xb0
>  8094 Jan 24 15:48:05 2u-6n kernel:  ? __handle_mm_fault+0x5e5/0xfa0
>  8095 Jan 24 15:48:05 2u-6n kernel:  __sys_recvmsg+0x3d/0x70
>  8096 Jan 24 15:48:05 2u-6n kernel:  ? __sys_recvmsg+0x3d/0x70
>  8097 Jan 24 15:48:05 2u-6n kernel:  SyS_recvmsg+0xd/0x20
>  8098 Jan 24 15:48:05 2u-6n kernel:  do_syscall_64+0x56/0xc0
>  8099 Jan 24 15:48:05 2u-6n kernel:  entry_SYSCALL64_slow_path+0x25/0x25
>  8100 Jan 24 15:48:05 2u-6n kernel: RIP: 0033:0x7f77a5644030
>  8101 Jan 24 15:48:05 2u-6n kernel: RSP: 002b:00007ffc3e783e68 EFLAGS: 00000246 ORIG_RAX: 000000000000002f
>  8102 Jan 24 15:48:05 2u-6n kernel: RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f77a5644030
>  8103 Jan 24 15:48:05 2u-6n kernel: RDX: 0000000000000000 RSI: 00007ffc3e783ed0 RDI: 0000000000000004
>  8104 Jan 24 15:48:05 2u-6n kernel: RBP: 00007ffc3e787ef4 R08: 0000000000003fe4 R09: 0
> 8105 Jan 24 15:48:05 2u-6n kernel: R10: 00007ffc3e783f10 R11: 0000000000000246 R12: 000000000064f360
>  8106 Jan 24 15:48:05 2u-6n kernel: R13: 00007ffc3e787f60 R14: 0000000000003fe4 R15: 0000000000000000
>  8107 Jan 24 15:48:05 2u-6n kernel: Code: ff 24 c5 a8 e5 04 a0 f6 42 2a 02 74 68 c7 43 28 01 00 00 00 48 89 c2 e9 c7 fe ff ff c7 43 28 02 00 00       00 48 89
> c2 e9 b8 fe ff ff <0f> ff 31 c9 48 89 de 48 c7 c7 78 36 05 a0 e8 65 e4 14 e1 31 c0
>  8108 Jan 24 15:48:05 2u-6n kernel: ---[ end trace 1d1c7028c9dec459 ]---
>  8109 Jan 24 15:48:05 2u-6n kernel: IPv6: FWS-U, w: ffff8807ea121ba0  fn: ffff880856a09260  pn:           (null)
>  8110 Jan 24 15:48:05 2u-6n kernel: 8021q: adding VLAN 0 to HW filter on device eth2#1008
>  8111 Jan 24 15:48:05 2u-6n kernel: 8021q: adding VLAN 0 to HW filter on device eth2#1009
> ....
>
> Thanks,
> Ben
>


-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com

^ permalink raw reply

* Re: [PATCH] samples/bpf: correct comment in sock_example.c
From: Alexei Starovoitov @ 2018-04-17 23:39 UTC (permalink / raw)
  To: Wang Sheng-Hui; +Cc: ast, daniel, netdev
In-Reply-To: <20180417022520.2412-1-shhuiw@foxmail.com>

On Tue, Apr 17, 2018 at 10:25:20AM +0800, Wang Sheng-Hui wrote:
> The program run against loopback interace "lo", not "eth0".
> Correct the comment.
> 
> Signed-off-by: Wang Sheng-Hui <shhuiw@foxmail.com>

Acked-by: Alexei Starovoitov <ast@kernel.org>

for future patches please use the following format for the subject:
[PATCH bpf-next] samples/bpf: ...

^ permalink raw reply

* [PATCH net] net: qualcomm: rmnet: Fix warning seen with fill_info
From: Subash Abhinov Kasiviswanathan @ 2018-04-17 23:40 UTC (permalink / raw)
  To: davem, netdev; +Cc: Subash Abhinov Kasiviswanathan

When the last rmnet device attached to a real device is removed, the
real device is unregistered from rmnet. As a result, the real device
lookup fails resulting in a warning when the fill_info handler is
called as part of the rmnet device unregistration.

Fix this by returning the rmnet flags as 0 when no real device is
present.

WARNING: CPU: 0 PID: 1779 at net/core/rtnetlink.c:3254
rtmsg_ifinfo_build_skb+0xca/0x10d
Modules linked in:
CPU: 0 PID: 1779 Comm: ip Not tainted 4.16.0-11872-g7ce2367 #1
Stack:
 7fe655f0 60371ea3 00000000 00000000
 60282bc6 6006b116 7fe65600 60371ee8
 7fe65660 6003a68c 00000000 900000000
Call Trace:
 [<6006b116>] ? printk+0x0/0x94
 [<6001f375>] show_stack+0xfe/0x158
 [<60371ea3>] ? dump_stack_print_info+0xe8/0xf1
 [<60282bc6>] ? rtmsg_ifinfo_build_skb+0xca/0x10d
 [<6006b116>] ? printk+0x0/0x94
 [<60371ee8>] dump_stack+0x2a/0x2c
 [<6003a68c>] __warn+0x10e/0x13e
 [<6003a82c>] warn_slowpath_null+0x48/0x4f
 [<60282bc6>] rtmsg_ifinfo_build_skb+0xca/0x10d
 [<60282c4d>] rtmsg_ifinfo_event.part.37+0x1e/0x43
 [<60282c2f>] ? rtmsg_ifinfo_event.part.37+0x0/0x43
 [<60282d03>] rtmsg_ifinfo+0x24/0x28
 [<60264e86>] dev_close_many+0xba/0x119
 [<60282cdf>] ? rtmsg_ifinfo+0x0/0x28
 [<6027c225>] ? rtnl_is_locked+0x0/0x1c
 [<6026ca67>] rollback_registered_many+0x1ae/0x4ae
 [<600314be>] ? unblock_signals+0x0/0xae
 [<6026cdc0>] ? unregister_netdevice_queue+0x19/0xec
 [<6026ceec>] unregister_netdevice_many+0x21/0xa1
 [<6027c765>] rtnl_delete_link+0x3e/0x4e
 [<60280ecb>] rtnl_dellink+0x262/0x29c
 [<6027c241>] ? rtnl_get_link+0x0/0x3e
 [<6027f867>] rtnetlink_rcv_msg+0x235/0x274

Fixes: be81a85f5f87 ("net: qualcomm: rmnet: Implement fill_info")
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
---
 drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
index d339885..5f4e447 100644
--- a/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
+++ b/drivers/net/ethernet/qualcomm/rmnet/rmnet_config.c
@@ -350,15 +350,16 @@ static int rmnet_fill_info(struct sk_buff *skb, const struct net_device *dev)
 
 	real_dev = priv->real_dev;
 
-	if (!rmnet_is_real_dev_registered(real_dev))
-		return -ENODEV;
-
 	if (nla_put_u16(skb, IFLA_RMNET_MUX_ID, priv->mux_id))
 		goto nla_put_failure;
 
-	port = rmnet_get_port_rtnl(real_dev);
+	if (rmnet_is_real_dev_registered(real_dev)) {
+		port = rmnet_get_port_rtnl(real_dev);
+		f.flags = port->data_format;
+	} else {
+		f.flags = 0;
+	}
 
-	f.flags = port->data_format;
 	f.mask  = ~0;
 
 	if (nla_put(skb, IFLA_RMNET_FLAGS, sizeof(f), &f))
-- 
1.9.1

^ permalink raw reply related

* Re: [patch net-next RFC 00/12] devlink: introduce port flavours and common phys_port_name generation
From: Jakub Kicinski @ 2018-04-17 23:42 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Jiri Pirko, Linux Netdev List, David Miller, Ido Schimmel, mlxsw,
	Andrew Lunn, Vivien Didelot, Florian Fainelli, Michael Chan,
	Ganesh Goudar, Saeed Mahameed, Simon Horman,
	Pieter Jansen van Vuuren, John Hurley, Dirk van der Merwe,
	Alexander Duyck, Or Gerlitz, David Ahern, vijaya.guvva
In-Reply-To: <CAJ3xEMhX621ed7s-z8OO6rmwP2avczk-bh6HpERwwAU+ufGzTw@mail.gmail.com>

On Tue, 17 Apr 2018 16:23:48 +0300, Or Gerlitz wrote:
> On Thu, Mar 22, 2018 at 1:55 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> > From: Jiri Pirko <jiri@mellanox.com>
> >
> > This patchset resolves 2 issues we have right now:
> > 1) There are many netdevices / ports in the system, for port, pf, vf
> >    represenatation but the user has no way to see which is which
> > 2) The ndo_get_phys_port_name is implemented in each driver separatelly,
> >    which may lead to inconsistent names between drivers.
> >
> > This patchset introduces port flavours which should address the first
> > problem. I'm testing this with Netronome nfp hardware. When the user
> > has 2 physical ports, 1 pf, and 4 vfs, he should see something like this:  
> 
> J/J (Jiri/Jakub) --
> 
> re "2 physical ports, 1 pf, and 4 vfs" --- does NFP exposes one PF for
> both physical ports?

Yes there are multiple PCIe PFs on the card, but the basic CX card just
uses one for all uplinks (like mlx4).

> FWIW note that in mlx5 and AFAIK any other device except for mlx4 (...)
> folks have FPP (Function Per Port) scheme.
> 
> [..]
> 
> > The desired output should look like this:
> > # devlink port
> > pci/0000:05:00.0/0: type eth netdev enp5s0np0 flavour physical number 0
> > pci/0000:05:00.0/1: type eth netdev enp5s0np1 flavour physical number 1
> > pci/0000:05:00.0/2: type eth netdev enp5s0npf0 flavour pf_rep number 0
> > pci/0000:05:00.0/3: type eth netdev enp5s0nvf0 flavour vf_rep number 0
> > pci/0000:05:00.0/4: type eth netdev enp5s0nvf1 flavour vf_rep number 1
> > pci/0000:05:00.0/5: type eth netdev enp5s0nvf2 flavour vf_rep number 2
> > pci/0000:05:00.0/6: type eth netdev enp5s0nvf3 flavour vf_rep number 3
> > As you can see, the netdev names are generated according to the flavour
> > and port number. In case the port is split, the split subnumber is also included.  
> 
> What is the purpose/role in getting dev link ports here? is it such
> that @ the end
> of the day the driver would do a devlink_port_get_phys_port_name() call in their
> get phys port name ndo? or we buy more advantages out of doing so?

IMHO having way to get all netdevs and the netdev type from devlink is
quite user friendly.  As of today we also use the devlink ports for port
splitting on 40/100G parts.  Hopefully more functionality migrates over
from ethtool over time.

^ permalink raw reply

* Re: fix for bnx2x panic during ethtool reporting
From: Florian Fainelli @ 2018-04-17 23:42 UTC (permalink / raw)
  To: Sebastian Kuzminsky, linux-kernel, netdev, ariel.elior,
	everest-linux-l2
In-Reply-To: <CAOTh5U2NwGU68cwzX2amhzPyj_=f0pm4CO4biR4m1DX49MJkUw@mail.gmail.com>

+netdev, Ariel,

On 04/17/2018 10:21 AM, Sebastian Kuzminsky wrote:
> "ethtool -i" on a bnx2x interface causes kernel panic when the
> firmware version is longer than expected.  The attached patch fixes
> the problem by simplifying the string handling in bnx2x_fill_fw_str().
> It applies cleanly to 4.14 and 4.17-rc1.

If you want to have a chance of getting your patch included, your should
make sure you copy the driver maintainers and the network mailinglist,
doing that.
-- 
Florian

^ permalink raw reply

* Re: [PATCH net-next] net: introduce a new tracepoint for tcp_rcv_space_adjust
From: Alexei Starovoitov @ 2018-04-17 23:44 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Yafang Shao, davem, songliubraving, netdev, linux-kernel
In-Reply-To: <d2ab1647-c832-c982-5952-ab7a415f7c76@gmail.com>

On Mon, Apr 16, 2018 at 08:43:31AM -0700, Eric Dumazet wrote:
> 
> 
> On 04/16/2018 08:33 AM, Yafang Shao wrote:
> > tcp_rcv_space_adjust is called every time data is copied to user space,
> > introducing a tcp tracepoint for which could show us when the packet is
> > copied to user.
> > This could help us figure out whether there's latency in user process.
> > 
> > When a tcp packet arrives, tcp_rcv_established() will be called and with
> > the existed tracepoint tcp_probe we could get the time when this packet
> > arrives.
> > Then this packet will be copied to user, and tcp_rcv_space_adjust will
> > be called and with this new introduced tracepoint we could get the time
> > when this packet is copied to user.
> > 
> > 	arrives time : user process time    => latency caused by user
> > 	tcp_probe      tcp_rcv_space_adjust
> > 
> > Hence in the prink message, sk is printed as a key to connect these two
> > tracepoints.
> > 
> 
> socket pointer is not a key.
> 
> TCP sockets can be reused pretty fast after free.
> 
> I suggest you go for cookie instead, this is an unique 64bit identifier.
> ( sock_gen_cookie() for details )

I think would be even better if the stack would do this sock_gen_cookie()
on its own in some way that user cannnot infere the order.
In many cases we wanted to use socket cookie, but since it's not inited
by default it's kinda useless.
Turning this tracepoint on just to get cookie would be an ugly workaround.

^ permalink raw reply

* Re: [RFC PATCH v3 bpf-next 2/5] bpf/verifier: rewrite subprog boundary detection
From: Alexei Starovoitov @ 2018-04-17 23:48 UTC (permalink / raw)
  To: Edward Cree; +Cc: Daniel Borkmann, netdev
In-Reply-To: <99e70dfe-66a1-911a-6616-60eae4ddc689@solarflare.com>

On Fri, Apr 06, 2018 at 06:13:59PM +0100, Edward Cree wrote:
> By storing a subprogno in each insn's aux data, we avoid the need to keep
>  the list of subprog starts sorted or bsearch() it in find_subprog().
> Also, get rid of the weird one-based indexing of subprog numbers.
> 
> Signed-off-by: Edward Cree <ecree@solarflare.com>
> ---
>  include/linux/bpf_verifier.h |   3 +-
>  kernel/bpf/verifier.c        | 284 ++++++++++++++++++++++++++-----------------
>  2 files changed, 177 insertions(+), 110 deletions(-)
> 
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 8f70dc181e23..17990dd56e65 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -146,6 +146,7 @@ struct bpf_insn_aux_data {
>  		s32 call_imm;			/* saved imm field of call insn */
>  	};
>  	int ctx_field_size; /* the ctx field size for load insn, maybe 0 */
> +	u16 subprogno; /* subprog in which this insn resides */
>  	bool seen; /* this insn was processed by the verifier */
>  };

as I was saying before this is no go.
subprogno is meaningless in the hierarchy of: prog -> func -> bb -> insn
Soon bpf will have libraries and this field would need to become
a pointer back to bb or func structure creating unnecessary circular dependency.

^ permalink raw reply

* Re: One question about __tcp_select_window()
From: Wang Jian @ 2018-04-18  0:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev
In-Reply-To: <6b7c1a82-e253-09bd-fd3e-363ce6a84653@gmail.com>

Thanks for your reply, Eric.

Actually, this is a query about the code while I am reading code.
>From my instinct and the comment, I think we should choose the bigger
one but maybe I miss something(like your said, autotuning)
Anyway, I will read more codes and do more tests.

Thanks.

On Tue, Apr 17, 2018 at 10:43 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>
> On 04/17/2018 06:53 AM, Wang Jian wrote:
>> I test the fix with 4.17.0-rc1+ and it seems work.
>>
>> 1. iperf -c IP -i 20 -t 60 -w 1K
>>  with-fix vs without-fix : 1.15Gbits/sec vs 1.05Gbits/sec
>> I also try other windows and have similar results.
>>
>> 2. Use tcp probe trace snd_wind.
>> with-fix vs without-fix: 1245568 vs 1042816
>>
>> 3. I don't see extra retransmit/drops.
>>
>
> Unfortunately I have no idea what exact problem you had to solve.
>
> Setting small windows is not exactly the path we are taking.
>
> And I do not know how many side effects your change will have for 'standard' flows
> using autotuning or sane windows.
>

^ permalink raw reply

* Re: [RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice
From: Siwei Liu @ 2018-04-18  0:26 UTC (permalink / raw)
  To: David Miller
  Cc: David Ahern, Jiri Pirko, si-wei liu, Michael S. Tsirkin,
	Stephen Hemminger, Alexander Duyck, Brandeburg, Jesse,
	Jakub Kicinski, Jason Wang, Samudrala, Sridhar, Netdev,
	virtualization, virtio-dev
In-Reply-To: <CADGSJ233AmbBrTP-A_u3-5fkr=HUJka_xcSNWxO1HWGua1J92Q@mail.gmail.com>

I ran this with a few folks offline and gathered some good feedbacks
that I'd like to share thus revive the discussion.

First of all, as illustrated in the reply below, cloud service
providers require transparent live migration. Specifically, the main
target of our case is to support SR-IOV live migration via kernel
upgrade while keeping the userspace of old distros unmodified. If it's
because this use case is not appealing enough for the mainline to
adopt, I will shut up and not continue discussing, although
technically it's entirely possible (and there's precedent in other
implementation) to do so to benefit any cloud service providers.

If it's just the implementation of hiding netdev itself needs to be
improved, such as implementing it as attribute flag or adding linkdump
API, that's completely fine and we can look into that. However, the
specific issue needs to be undestood beforehand is to make transparent
SR-IOV to be able to take over the name (so inherit all the configs)
from the lower netdev, which needs some games with uevents and name
space reservation. So far I don't think it's been well discussed.

One thing in particular I'd like to point out is that the 3-netdev
model currently missed to address the core problem of live migration:
migration of hardware specific feature/state, for e.g. ethtool configs
and hardware offloading states. Only general network state (IP
address, gateway, for eg.) associated with the bypass interface can be
migrated. As a follow-up work, bypass driver can/should be enhanced to
save and apply those hardware specific configs before or after
migration as needed. The transparent 1-netdev model being proposed as
part of this patch series will be able to solve that problem naturally
by making all hardware specific configurations go through the central
bypass driver, such that hardware configurations can be replayed when
new VF or passthrough gets plugged back in. Although that
corresponding function hasn't been implemented today, I'd like to
refresh everyone's mind that is the core problem any live migration
proposal should have addressed.

If it would make things more clear to defer netdev hiding until all
functionalities regarding centralizing and replay are implemented,
we'd take advices like that and move on to implementing those features
as follow-up patches. Once all needed features get done, we'd resume
the work for hiding lower netdev at that point. Think it would be the
best to make everyone understand the big picture in advance before
going too far.

Thanks, comments welcome.

-Siwei


On Mon, Apr 9, 2018 at 11:48 PM, Siwei Liu <loseweigh@gmail.com> wrote:
> On Sun, Apr 8, 2018 at 9:32 AM, David Miller <davem@davemloft.net> wrote:
>> From: Siwei Liu <loseweigh@gmail.com>
>> Date: Fri, 6 Apr 2018 19:32:05 -0700
>>
>>> And I assume everyone here understands the use case for live
>>> migration (in the context of providing cloud service) is very
>>> different, and we have to hide the netdevs. If not, I'm more than
>>> happy to clarify.
>>
>> I think you still need to clarify.
>
> OK. The short answer is cloud users really want *transparent* live migration.
>
> By being transparent it means they don't and shouldn't care about the
> existence and the occurence of live migration, but they do if
> userspace toolstack and libraries have to be updated or modified,
> which means potential dependency brokeness of their applications. They
> don't like any change to the userspace envinroment (existing apps
> lift-and-shift, no recompilation, no re-packaging, no re-certification
> needed), while no one barely cares about ABI or API compatibility in
> the kernel level, as long as their applications don't break.
>
> I agree the current bypass solution for SR-IOV live migration requires
> guest cooperation. Though it doesn't mean guest *userspace*
> cooperation. As a matter of fact, techinically it shouldn't invovle
> userspace at all to get SR-IOV migration working. It's the kernel that
> does the real work. If I understand the goal of this in-kernel
> approach correctly, it was meant to save userspace from modification
> or corresponding toolstack support, as those additional 2 interfaces
> is more a side product of this approach, rather than being neccessary
> for users to be aware of. All what the user needs to deal with is one
> single interface, and that's what they care about. It's more a trouble
> than help when they see 2 extra interfaces are present. Management
> tools in the old distros don't recoginze them and try to bring up
> those extra interfaces for its own. Various odd warnings start to spew
> out, and there's a lot of caveats for the users to get around...
>
> On the other hand, if we "teach" those cloud users to update the
> userspace toolstack just for trading a feature they don't need, no one
> is likely going to embrace the change. As such there's just no real
> value of adopting this in-kernel bypass facility for any cloud service
> provider. It does not look more appealing than just configure generic
> bonding using its own set of daemons or scripts. But again, cloud
> users don't welcome that facility. And basically it would get to
> nearly the same set of problems if leaving userspace alone.
>
> IMHO we're not hiding the devices, think it the way we're adding a
> feature transparent to user. Those auto-managed slaves are ones users
> don't care about much. And user is still able to see and configure the
> lower netdevs if they really desires to do so. But generally the
> target user for this feature won't need to know that. Why they care
> how many interfaces a VM virtually has rather than how many interfaces
> are actually _useable_ to them??
>
> Thanks,
> -Siwei
>
>
>>
>> netdevs are netdevs.  If they have special attributes, mark them as
>> such and the tools base their actions upon that.
>>
>> "Hiding", or changing classes, doesn't make any sense to me still.

^ permalink raw reply

* [PATCH v2] net: change the comment of dev_mc_init
From: sunlianwen @ 2018-04-18  0:29 UTC (permalink / raw)
  To: netdev

The comment of dev_mc_init() is wrong. which use dev_mc_flush
instead of dev_mc_init.

Signed-off-by: Lianwen Sun <sunlw.fnst@cn.fujitsu.com>
---
  net/core/dev_addr_lists.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/core/dev_addr_lists.c b/net/core/dev_addr_lists.c
index e3e6a3e2ca22..d884d8f5f0e5 100644
--- a/net/core/dev_addr_lists.c
+++ b/net/core/dev_addr_lists.c
@@ -839,7 +839,7 @@ void dev_mc_flush(struct net_device *dev)
  EXPORT_SYMBOL(dev_mc_flush);

  /**
- *     dev_mc_flush - Init multicast address list
+ *     dev_mc_init - Init multicast address list
   *     @dev: device
   *
   *     Init multicast address list.
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH net-next 2/2] openvswitch: Support conntrack zone limit
From: Yi-Hung Wei @ 2018-04-18  0:30 UTC (permalink / raw)
  To: Gregory Rose; +Cc: netdev
In-Reply-To: <4ac1904b-043f-faaf-043d-0a3c68da8c88@gmail.com>

> s/to commit/from committing/
> s/entry/entries/

Thanks, will fix that in both patches in v2.


> I think this is a great idea but I suggest porting to the iproute2 package
> so everyone can use it.  Then git rid of the OVS specific prefixes.
> Presuming of course that the conntrack connection
> limit backend works there as well I guess.  If it doesn't, then I'd suggest
> extending
> it.  This is a nice feature for all users in my opinion and then OVS
> can take advantage of it as well.

Thanks for the comment.  And yes, I think currently, iptables’s
connlimit extension does support limiting the # of connections.  Users
need to configure the zone properly, and the iptable’s connlimit
extension is using netfilter's nf_conncount backend already.

The main goal for this patch is to utilize netfilter backend
(nf_conncount) to count and limit the number of connections. OVS needs
the proposed OVS_CT_LIMIT netlink API and the corresponding booking
data structure because the current nf_conncount backend only counts
the # of connections, but it does not keep track of the connection
limit in nf_conncount.

Thanks,

-Yi-Hung

^ permalink raw reply

* [PATCH net-next v2 0/2] openvswitch: Support conntrack zone limit
From: Yi-Hung Wei @ 2018-04-18  0:30 UTC (permalink / raw)
  To: netdev; +Cc: Yi-Hung Wei

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, OVS will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The first patch defines the conntrack limit netlink definition, and the
second patch provides the implementation.

v1->v2:
  - Fixes commit log typos suggested by Greg.
  - Fixes memory free issue that Julia found.

Yi-Hung Wei (2):
  openvswitch: Add conntrack limit netlink definition
  openvswitch: Support conntrack zone limit

 include/uapi/linux/openvswitch.h |  62 +++++
 net/openvswitch/Kconfig          |   3 +-
 net/openvswitch/conntrack.c      | 498 ++++++++++++++++++++++++++++++++++++++-
 net/openvswitch/conntrack.h      |   9 +-
 net/openvswitch/datapath.c       |   7 +-
 net/openvswitch/datapath.h       |   1 +
 6 files changed, 574 insertions(+), 6 deletions(-)

-- 
2.7.4

^ permalink raw reply

* [PATCH net-next v2 1/2] openvswitch: Add conntrack limit netlink definition
From: Yi-Hung Wei @ 2018-04-18  0:30 UTC (permalink / raw)
  To: netdev; +Cc: Yi-Hung Wei
In-Reply-To: <1524011429-14500-1-git-send-email-yihung.wei@gmail.com>

Define netlink messages and attributes to support user kernel
communication that uses the conntrack limit feature.

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
---
 include/uapi/linux/openvswitch.h | 62 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/include/uapi/linux/openvswitch.h b/include/uapi/linux/openvswitch.h
index 713e56ce681f..ca63c16375ce 100644
--- a/include/uapi/linux/openvswitch.h
+++ b/include/uapi/linux/openvswitch.h
@@ -937,4 +937,66 @@ enum ovs_meter_band_type {
 
 #define OVS_METER_BAND_TYPE_MAX (__OVS_METER_BAND_TYPE_MAX - 1)
 
+/* Conntrack limit */
+#define OVS_CT_LIMIT_FAMILY  "ovs_ct_limit"
+#define OVS_CT_LIMIT_MCGROUP "ovs_ct_limit"
+#define OVS_CT_LIMIT_VERSION 0x1
+
+enum ovs_ct_limit_cmd {
+	OVS_CT_LIMIT_CMD_UNSPEC,
+	OVS_CT_LIMIT_CMD_SET,		/* Add or modify ct limit. */
+	OVS_CT_LIMIT_CMD_DEL,		/* Delete ct limit. */
+	OVS_CT_LIMIT_CMD_GET		/* Get ct limit. */
+};
+
+enum ovs_ct_limit_attr {
+	OVS_CT_LIMIT_ATTR_UNSPEC,
+	OVS_CT_LIMIT_ATTR_OPTION,	/* Nested OVS_CT_LIMIT_ATTR_* */
+	__OVS_CT_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_ATTR_MAX (__OVS_CT_LIMIT_ATTR_MAX - 1)
+
+/**
+ * @OVS_CT_ZONE_LIMIT_ATTR_SET_REQ: Contains either
+ * OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT or a pair of
+ * OVS_CT_ZONE_LIMIT_ATTR_ZONE and OVS_CT_ZONE_LIMIT_ATTR_LIMIT.
+ * @OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ: Contains OVS_CT_ZONE_LIMIT_ATTR_ZONE.
+ * @OVS_CT_ZONE_LIMIT_ATTR_GET_REQ: Contains OVS_CT_ZONE_LIMIT_ATTR_ZONE.
+ * @OVS_CT_ZONE_LIMIT_ATTR_GET_RLY: Contains either
+ * OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT or a triple of
+ * OVS_CT_ZONE_LIMIT_ATTR_ZONE, OVS_CT_ZONE_LIMIT_ATTR_LIMIT and
+ * OVS_CT_ZONE_LIMIT_ATTR_COUNT.
+ */
+enum ovs_ct_limit_option_attr {
+	OVS_CT_LIMIT_OPTION_ATTR_UNSPEC,
+	OVS_CT_ZONE_LIMIT_ATTR_SET_REQ,	/* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+					 * attributes. */
+	OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ,	/* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+					 * attributes. */
+	OVS_CT_ZONE_LIMIT_ATTR_GET_REQ,	/* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+					 * attributes. */
+	OVS_CT_ZONE_LIMIT_ATTR_GET_RLY,	/* Nested OVS_CT_ZONE_LIMIT_ATTR_*
+					 * attributes. */
+	__OVS_CT_LIMIT_OPTION_ATTR_MAX
+};
+
+#define OVS_CT_LIMIT_OPTION_ATTR_MAX (__OVS_CT_LIMIT_OPTION_ATTR_MAX - 1)
+
+enum ovs_ct_zone_limit_attr {
+	OVS_CT_ZONE_LIMIT_ATTR_UNSPEC,
+	OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT,	/* u32 default conntrack limit
+						 * for all zones. */
+	OVS_CT_ZONE_LIMIT_ATTR_ZONE,		/* u16 conntrack zone id. */
+	OVS_CT_ZONE_LIMIT_ATTR_LIMIT,		/* u32 max number of conntrack
+						 * entries allowed in the
+						 * corresponding zone. */
+	OVS_CT_ZONE_LIMIT_ATTR_COUNT,		/* u32 number of conntrack
+						 * entries in the corresponding
+						 * zone. */
+	__OVS_CT_ZONE_LIMIT_ATTR_MAX
+};
+
+#define OVS_CT_ZONE_LIMIT_ATTR_MAX (__OVS_CT_ZONE_LIMIT_ATTR_MAX - 1)
+
 #endif /* _LINUX_OPENVSWITCH_H */
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next v2 2/2] openvswitch: Support conntrack zone limit
From: Yi-Hung Wei @ 2018-04-18  0:30 UTC (permalink / raw)
  To: netdev; +Cc: Yi-Hung Wei
In-Reply-To: <1524011429-14500-1-git-send-email-yihung.wei@gmail.com>

Currently, nf_conntrack_max is used to limit the maximum number of
conntrack entries in the conntrack table for every network namespace.
For the VMs and containers that reside in the same namespace,
they share the same conntrack table, and the total # of conntrack entries
for all the VMs and containers are limited by nf_conntrack_max.  In this
case, if one of the VM/container abuses the usage the conntrack entries,
it blocks the others from committing valid conntrack entries into the
conntrack table.  Even if we can possibly put the VM in different network
namespace, the current nf_conntrack_max configuration is kind of rigid
that we cannot limit different VM/container to have different # conntrack
entries.

To address the aforementioned issue, this patch proposes to have a
fine-grained mechanism that could further limit the # of conntrack entries
per-zone.  For example, we can designate different zone to different VM,
and set conntrack limit to each zone.  By providing this isolation, a
mis-behaved VM only consumes the conntrack entries in its own zone, and
it will not influence other well-behaved VMs.  Moreover, the users can
set various conntrack limit to different zone based on their preference.

The proposed implementation utilizes Netfilter's nf_conncount backend
to count the number of connections in a particular zone.  If the number of
connection is above a configured limitation, ovs will return ENOMEM to the
userspace.  If userspace does not configure the zone limit, the limit
defaults to zero that is no limitation, which is backward compatible to
the behavior without this patch.

The following high leve APIs are provided to the userspace:
  - OVS_CT_LIMIT_CMD_SET:
    * set default connection limit for all zones
    * set the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_DEL:
    * remove the connection limit for a particular zone
  - OVS_CT_LIMIT_CMD_GET:
    * get the default connection limit for all zones
    * get the connection limit for a particular zone

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
---
 net/openvswitch/Kconfig     |   3 +-
 net/openvswitch/conntrack.c | 498 +++++++++++++++++++++++++++++++++++++++++++-
 net/openvswitch/conntrack.h |   9 +-
 net/openvswitch/datapath.c  |   7 +-
 net/openvswitch/datapath.h  |   1 +
 5 files changed, 512 insertions(+), 6 deletions(-)

diff --git a/net/openvswitch/Kconfig b/net/openvswitch/Kconfig
index 2650205cdaf9..89da9512ec1e 100644
--- a/net/openvswitch/Kconfig
+++ b/net/openvswitch/Kconfig
@@ -9,7 +9,8 @@ config OPENVSWITCH
 		   (NF_CONNTRACK && ((!NF_DEFRAG_IPV6 || NF_DEFRAG_IPV6) && \
 				     (!NF_NAT || NF_NAT) && \
 				     (!NF_NAT_IPV4 || NF_NAT_IPV4) && \
-				     (!NF_NAT_IPV6 || NF_NAT_IPV6)))
+				     (!NF_NAT_IPV6 || NF_NAT_IPV6) && \
+				     (!NETFILTER_CONNCOUNT || NETFILTER_CONNCOUNT)))
 	select LIBCRC32C
 	select MPLS
 	select NET_MPLS_GSO
diff --git a/net/openvswitch/conntrack.c b/net/openvswitch/conntrack.c
index c5904f629091..d09b572f72b4 100644
--- a/net/openvswitch/conntrack.c
+++ b/net/openvswitch/conntrack.c
@@ -17,7 +17,9 @@
 #include <linux/udp.h>
 #include <linux/sctp.h>
 #include <net/ip.h>
+#include <net/genetlink.h>
 #include <net/netfilter/nf_conntrack_core.h>
+#include <net/netfilter/nf_conntrack_count.h>
 #include <net/netfilter/nf_conntrack_helper.h>
 #include <net/netfilter/nf_conntrack_labels.h>
 #include <net/netfilter/nf_conntrack_seqadj.h>
@@ -76,6 +78,38 @@ struct ovs_conntrack_info {
 #endif
 };
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+#define OVS_CT_LIMIT_UNLIMITED	0
+#define OVS_CT_LIMIT_DEFAULT OVS_CT_LIMIT_UNLIMITED
+#define CT_LIMIT_HASH_BUCKETS 512
+
+struct ovs_ct_limit {
+	/* Elements in ovs_ct_limit_info->limits hash table */
+	struct hlist_node hlist_node;
+	struct rcu_head rcu;
+	u16 zone;
+	u32 limit;
+};
+
+struct ovs_ct_limit_info {
+	u32 default_limit;
+	struct hlist_head *limits;
+	struct nf_conncount_data *data __aligned(8);
+};
+
+static const struct nla_policy ct_limit_policy[OVS_CT_LIMIT_ATTR_MAX + 1] = {
+	[OVS_CT_LIMIT_ATTR_OPTION] = { .type = NLA_NESTED, },
+};
+
+static const struct nla_policy
+	ct_zone_limit_policy[OVS_CT_ZONE_LIMIT_ATTR_MAX + 1] = {
+		[OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT] = { .type = NLA_U32, },
+		[OVS_CT_ZONE_LIMIT_ATTR_ZONE] = { .type = NLA_U16, },
+		[OVS_CT_ZONE_LIMIT_ATTR_LIMIT] = { .type = NLA_U32, },
+		[OVS_CT_ZONE_LIMIT_ATTR_COUNT] = { .type = NLA_U32, },
+};
+#endif
+
 static bool labels_nonzero(const struct ovs_key_ct_labels *labels);
 
 static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info);
@@ -1036,6 +1070,94 @@ static bool labels_nonzero(const struct ovs_key_ct_labels *labels)
 	return false;
 }
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static struct hlist_head *ct_limit_hash_bucket(
+	const struct ovs_ct_limit_info *info, u16 zone)
+{
+	return &info->limits[zone & (CT_LIMIT_HASH_BUCKETS - 1)];
+}
+
+/* Call with ovs_mutex */
+static void ct_limit_set(const struct ovs_ct_limit_info *info,
+			 struct ovs_ct_limit *new_ct_limit)
+{
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+
+	head = ct_limit_hash_bucket(info, new_ct_limit->zone);
+	hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+		if (ct_limit->zone == new_ct_limit->zone) {
+			hlist_replace_rcu(&ct_limit->hlist_node,
+					  &new_ct_limit->hlist_node);
+			kfree_rcu(ct_limit, rcu);
+			return;
+		}
+	}
+
+	hlist_add_head_rcu(&new_ct_limit->hlist_node, head);
+}
+
+/* Call with ovs_mutex */
+static void ct_limit_del(const struct ovs_ct_limit_info *info, u16 zone)
+{
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+
+	head = ct_limit_hash_bucket(info, zone);
+	hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+		if (ct_limit->zone == zone) {
+			hlist_del_rcu(&ct_limit->hlist_node);
+			kfree_rcu(ct_limit, rcu);
+			return;
+		}
+	}
+}
+
+/* Call with RCU read lock */
+static u32 ct_limit_get(const struct ovs_ct_limit_info *info, u16 zone)
+{
+	struct ovs_ct_limit *ct_limit;
+	struct hlist_head *head;
+
+	head = ct_limit_hash_bucket(info, zone);
+	hlist_for_each_entry_rcu(ct_limit, head, hlist_node) {
+		if (ct_limit->zone == zone)
+			return ct_limit->limit;
+	}
+
+	return info->default_limit;
+}
+
+static int ovs_ct_check_limit(struct net *net,
+			      const struct ovs_conntrack_info *info,
+			      const struct nf_conntrack_tuple *tuple)
+{
+	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
+	const struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	u32 per_zone_limit, connections;
+	u32 conncount_key[5];
+
+	conncount_key[0] = info->zone.id;
+
+	rcu_read_lock();
+	per_zone_limit = ct_limit_get(ct_limit_info, info->zone.id);
+	if (per_zone_limit == OVS_CT_LIMIT_UNLIMITED) {
+		rcu_read_unlock();
+		return 0;
+	}
+
+	connections = nf_conncount_count(net, ct_limit_info->data,
+					 conncount_key, tuple, &info->zone);
+	if (connections > per_zone_limit) {
+		rcu_read_unlock();
+		return -ENOMEM;
+	}
+
+	rcu_read_unlock();
+	return 0;
+}
+#endif
+
 /* Lookup connection and confirm if unconfirmed. */
 static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
 			 const struct ovs_conntrack_info *info,
@@ -1054,6 +1176,13 @@ static int ovs_ct_commit(struct net *net, struct sw_flow_key *key,
 	if (!ct)
 		return 0;
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	err = ovs_ct_check_limit(net, info,
+				 &ct->tuplehash[IP_CT_DIR_ORIGINAL].tuple);
+	if (err)
+		return err;
+#endif
+
 	/* Set the conntrack event mask if given.  NEW and DELETE events have
 	 * their own groups, but the NFNLGRP_CONNTRACK_UPDATE group listener
 	 * typically would receive many kinds of updates.  Setting the event
@@ -1655,7 +1784,364 @@ static void __ovs_ct_free_action(struct ovs_conntrack_info *ct_info)
 		nf_ct_tmpl_free(ct_info->ct);
 }
 
-void ovs_ct_init(struct net *net)
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+static int ovs_ct_limit_init(struct net *net, struct ovs_net *ovs_net)
+{
+	int i, err;
+
+	ovs_net->ct_limit_info = kmalloc(sizeof *ovs_net->ct_limit_info,
+					 GFP_KERNEL);
+	if (!ovs_net->ct_limit_info)
+		return -ENOMEM;
+
+	ovs_net->ct_limit_info->default_limit = OVS_CT_LIMIT_DEFAULT;
+	ovs_net->ct_limit_info->limits =
+		kmalloc_array(CT_LIMIT_HASH_BUCKETS, sizeof(struct hlist_head),
+			      GFP_KERNEL);
+	if (!ovs_net->ct_limit_info->limits) {
+		kfree(ovs_net->ct_limit_info);
+		return -ENOMEM;
+	}
+
+	for (i = 0; i < CT_LIMIT_HASH_BUCKETS; i++)
+		INIT_HLIST_HEAD(&ovs_net->ct_limit_info->limits[i]);
+
+	ovs_net->ct_limit_info->data =
+		nf_conncount_init(net, NFPROTO_INET, sizeof(u32));
+
+	if (IS_ERR(ovs_net->ct_limit_info->data)) {
+		err = PTR_ERR(ovs_net->ct_limit_info->data);
+		kfree(ovs_net->ct_limit_info->limits);
+		kfree(ovs_net->ct_limit_info);
+		return err;
+	}
+	return 0;
+}
+
+static void ovs_ct_limit_exit(struct net *net, struct ovs_net *ovs_net)
+{
+	const struct ovs_ct_limit_info *info = ovs_net->ct_limit_info;
+	int i;
+
+	nf_conncount_destroy(net, NFPROTO_INET, info->data);
+	for (i = 0; i < CT_LIMIT_HASH_BUCKETS; ++i) {
+		struct hlist_head *head = &info->limits[i];
+		struct ovs_ct_limit *ct_limit;
+
+		hlist_for_each_entry_rcu(ct_limit, head, hlist_node)
+			kfree_rcu(ct_limit, rcu);
+	}
+	kfree(ovs_net->ct_limit_info->limits);
+	kfree(ovs_net->ct_limit_info);
+}
+
+static struct sk_buff *
+ovs_ct_limit_cmd_reply_start(struct genl_info *info, u8 cmd,
+			     struct ovs_header **ovs_reply_header)
+{
+	struct sk_buff *skb;
+	struct ovs_header *ovs_header = info->userhdr;
+
+	skb = genlmsg_new(NLMSG_DEFAULT_SIZE, GFP_KERNEL);
+	if (!skb)
+		return ERR_PTR(-ENOMEM);
+
+	*ovs_reply_header = genlmsg_put(skb, info->snd_portid,
+					info->snd_seq,
+					&dp_ct_limit_genl_family, 0, cmd);
+
+	if (!*ovs_reply_header) {
+		nlmsg_free(skb);
+		return ERR_PTR(-EMSGSIZE);
+	}
+	(*ovs_reply_header)->dp_ifindex = ovs_header->dp_ifindex;
+
+	return skb;
+}
+
+static int ovs_ct_limit_set_zone_limit(struct nlattr *nla_zone_limit,
+				       struct ovs_ct_limit_info *info)
+{
+	struct nlattr *nla;
+	int rem, err;
+
+	nla_for_each_nested(nla, nla_zone_limit, rem) {
+		struct nlattr *attr[OVS_CT_ZONE_LIMIT_ATTR_MAX + 1];
+		struct ovs_ct_limit *ct_limit;
+
+		if (nla_type(nla) != OVS_CT_ZONE_LIMIT_ATTR_SET_REQ)
+			return  -EINVAL;
+
+		err = nla_parse((struct nlattr **)&attr,
+				OVS_CT_ZONE_LIMIT_ATTR_MAX, nla_data(nla),
+				nla_len(nla), ct_zone_limit_policy, NULL);
+		if (err)
+			return err;
+
+		if (attr[OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT]) {
+			u32 default_limit = nla_get_u32(
+				attr[OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT]);
+			ovs_lock();
+			info->default_limit = default_limit;
+			ovs_unlock();
+		} else {
+			if (!attr[OVS_CT_ZONE_LIMIT_ATTR_ZONE] ||
+			    !attr[OVS_CT_ZONE_LIMIT_ATTR_LIMIT]) {
+				return -EINVAL;
+			}
+
+			ct_limit = kmalloc(sizeof(*ct_limit), GFP_KERNEL);
+			if (!ct_limit)
+				return -ENOMEM;
+
+			ct_limit->zone = nla_get_u16(
+				attr[OVS_CT_ZONE_LIMIT_ATTR_ZONE]);
+			ct_limit->limit = nla_get_u32(
+				attr[OVS_CT_ZONE_LIMIT_ATTR_LIMIT]);
+
+			ovs_lock();
+			ct_limit_set(info, ct_limit);
+			ovs_unlock();
+		}
+	}
+	return 0;
+}
+
+static int ovs_ct_limit_del_zone_limit(struct nlattr *nla_zone_limit,
+				       struct ovs_ct_limit_info *info)
+{
+	struct nlattr *nla;
+	int rem, err;
+
+	nla_for_each_nested(nla, nla_zone_limit, rem) {
+		struct nlattr *attr[OVS_CT_ZONE_LIMIT_ATTR_MAX + 1];
+		u16 zone;
+
+		if (nla_type(nla) != OVS_CT_ZONE_LIMIT_ATTR_DEL_REQ)
+			return  -EINVAL;
+
+		err = nla_parse((struct nlattr **)&attr,
+				OVS_CT_ZONE_LIMIT_ATTR_MAX, nla_data(nla),
+				nla_len(nla), ct_zone_limit_policy, NULL);
+		if (err)
+			return err;
+
+		if (!attr[OVS_CT_ZONE_LIMIT_ATTR_ZONE])
+			return -EINVAL;
+
+		zone = nla_get_u16(attr[OVS_CT_ZONE_LIMIT_ATTR_ZONE]);
+
+		ovs_lock();
+		ct_limit_del(info, zone);
+		ovs_unlock();
+	}
+	return 0;
+}
+
+static int ovs_ct_limit_get_default_limit(struct ovs_ct_limit_info *info,
+					  struct sk_buff *reply)
+{
+	int err;
+	struct nlattr *nla_nested;
+
+	nla_nested = nla_nest_start(reply, OVS_CT_ZONE_LIMIT_ATTR_GET_RLY);
+
+	err = nla_put_u32(reply, OVS_CT_ZONE_LIMIT_ATTR_DEFAULT_LIMIT,
+			  info->default_limit);
+	if (err)
+		return err;
+
+	nla_nest_end(reply, nla_nested);
+	return 0;
+}
+
+static int ovs_ct_limit_get_zone_limit(struct net *net,
+				       struct nlattr *nla_zone_limit,
+				       struct ovs_ct_limit_info *info,
+				       struct sk_buff *reply)
+{
+	struct nlattr *nla, *nla_nested;
+	int rem, err;
+	u16 zone;
+	u32 limit, count, conncount_key[5];
+	struct nf_conntrack_zone ct_zone;
+
+	nla_for_each_nested(nla, nla_zone_limit, rem) {
+		struct nlattr *attr[OVS_CT_ZONE_LIMIT_ATTR_MAX + 1];
+
+		if (nla_type(nla) != OVS_CT_ZONE_LIMIT_ATTR_GET_REQ)
+			return -EINVAL;
+
+		err = nla_parse((struct nlattr **)&attr,
+				OVS_CT_ZONE_LIMIT_ATTR_MAX, nla_data(nla),
+				nla_len(nla), ct_zone_limit_policy, NULL);
+		if (err)
+			return err;
+
+		if (!attr[OVS_CT_ZONE_LIMIT_ATTR_ZONE])
+			return -EINVAL;
+
+		zone = nla_get_u16(attr[OVS_CT_ZONE_LIMIT_ATTR_ZONE]);
+		nf_ct_zone_init(&ct_zone, zone, NF_CT_DEFAULT_ZONE_DIR, 0);
+		rcu_read_lock();
+		limit = ct_limit_get(info, zone);
+		rcu_read_unlock();
+
+		conncount_key[0] = zone;
+		count = nf_conncount_count(net, info->data, conncount_key,
+					   NULL, &ct_zone);
+
+		nla_nested = nla_nest_start(reply,
+					    OVS_CT_ZONE_LIMIT_ATTR_GET_RLY);
+		if (nla_put_u16(reply, OVS_CT_ZONE_LIMIT_ATTR_ZONE, zone) ||
+		    nla_put_u32(reply, OVS_CT_ZONE_LIMIT_ATTR_LIMIT, limit) ||
+		    nla_put_u32(reply, OVS_CT_ZONE_LIMIT_ATTR_COUNT, count))
+			return -EMSGSIZE;
+		nla_nest_end(reply, nla_nested);
+	}
+
+	return 0;
+}
+
+static int ovs_ct_limit_cmd_set(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr **a = info->attrs;
+	struct sk_buff *reply;
+	struct ovs_header *ovs_reply_header;
+	struct ovs_net *ovs_net = net_generic(sock_net(skb->sk), ovs_net_id);
+	struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	int err;
+
+	reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_SET,
+					     &ovs_reply_header);
+	if (IS_ERR(reply))
+		return PTR_ERR(reply);
+
+	if (!a[OVS_CT_LIMIT_ATTR_OPTION])
+		return -EINVAL;
+
+	err = ovs_ct_limit_set_zone_limit(a[OVS_CT_LIMIT_ATTR_OPTION],
+					  ct_limit_info);
+	if (err)
+		goto exit_err;
+
+	genlmsg_end(reply, ovs_reply_header);
+	return genlmsg_reply(reply, info);
+
+exit_err:
+	nlmsg_free(reply);
+	return err;
+}
+
+static int ovs_ct_limit_cmd_del(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr **a = info->attrs;
+	struct sk_buff *reply;
+	struct ovs_header *ovs_reply_header;
+	struct ovs_net *ovs_net = net_generic(sock_net(skb->sk), ovs_net_id);
+	struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	int err;
+
+	reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_DEL,
+					     &ovs_reply_header);
+	if (IS_ERR(reply))
+		return PTR_ERR(reply);
+
+	if (!a[OVS_CT_LIMIT_ATTR_OPTION])
+		return -EINVAL;
+
+	err = ovs_ct_limit_del_zone_limit(a[OVS_CT_LIMIT_ATTR_OPTION],
+					  ct_limit_info);
+	if (err)
+		goto exit_err;
+
+	genlmsg_end(reply, ovs_reply_header);
+	return genlmsg_reply(reply, info);
+
+exit_err:
+	nlmsg_free(reply);
+	return err;
+}
+
+static int ovs_ct_limit_cmd_get(struct sk_buff *skb, struct genl_info *info)
+{
+	struct nlattr **a = info->attrs;
+	struct nlattr *nla_reply;
+	struct sk_buff *reply;
+	struct ovs_header *ovs_reply_header;
+	struct net *net = sock_net(skb->sk);
+	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
+	struct ovs_ct_limit_info *ct_limit_info = ovs_net->ct_limit_info;
+	int err;
+
+	reply = ovs_ct_limit_cmd_reply_start(info, OVS_CT_LIMIT_CMD_GET,
+					     &ovs_reply_header);
+	if (IS_ERR(reply))
+		return PTR_ERR(reply);
+
+	nla_reply = nla_nest_start(reply, OVS_CT_LIMIT_ATTR_OPTION);
+
+	err = ovs_ct_limit_get_default_limit(ct_limit_info, reply);
+	if (err)
+		goto exit_err;
+
+	if (a[OVS_CT_LIMIT_ATTR_OPTION]) {
+		err = ovs_ct_limit_get_zone_limit(
+			net, a[OVS_CT_LIMIT_ATTR_OPTION], ct_limit_info,
+			reply);
+		if (err)
+			goto exit_err;
+	}
+
+	nla_nest_end(reply, nla_reply);
+	genlmsg_end(reply, ovs_reply_header);
+	return genlmsg_reply(reply, info);
+
+exit_err:
+	nlmsg_free(reply);
+	return err;
+}
+
+static struct genl_ops ct_limit_genl_ops[] = {
+	{ .cmd = OVS_CT_LIMIT_CMD_SET,
+		.flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN
+					   * privilege. */
+		.policy = ct_limit_policy,
+		.doit = ovs_ct_limit_cmd_set,
+	},
+	{ .cmd = OVS_CT_LIMIT_CMD_DEL,
+		.flags = GENL_ADMIN_PERM, /* Requires CAP_NET_ADMIN
+					   * privilege. */
+		.policy = ct_limit_policy,
+		.doit = ovs_ct_limit_cmd_del,
+	},
+	{ .cmd = OVS_CT_LIMIT_CMD_GET,
+		.flags = 0,		  /* OK for unprivileged users. */
+		.policy = ct_limit_policy,
+		.doit = ovs_ct_limit_cmd_get,
+	},
+};
+
+static const struct genl_multicast_group ovs_ct_limit_multicast_group = {
+	.name = OVS_CT_LIMIT_MCGROUP,
+};
+
+struct genl_family dp_ct_limit_genl_family __ro_after_init = {
+	.hdrsize = sizeof(struct ovs_header),
+	.name = OVS_CT_LIMIT_FAMILY,
+	.version = OVS_CT_LIMIT_VERSION,
+	.maxattr = OVS_CT_LIMIT_ATTR_MAX,
+	.netnsok = true,
+	.parallel_ops = true,
+	.ops = ct_limit_genl_ops,
+	.n_ops = ARRAY_SIZE(ct_limit_genl_ops),
+	.mcgrps = &ovs_ct_limit_multicast_group,
+	.n_mcgrps = 1,
+	.module = THIS_MODULE,
+};
+#endif
+
+int ovs_ct_init(struct net *net)
 {
 	unsigned int n_bits = sizeof(struct ovs_key_ct_labels) * BITS_PER_BYTE;
 	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
@@ -1666,12 +2152,22 @@ void ovs_ct_init(struct net *net)
 	} else {
 		ovs_net->xt_label = true;
 	}
+
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	return ovs_ct_limit_init(net, ovs_net);
+#else
+	return 0;
+#endif
 }
 
 void ovs_ct_exit(struct net *net)
 {
 	struct ovs_net *ovs_net = net_generic(net, ovs_net_id);
 
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	ovs_ct_limit_exit(net, ovs_net);
+#endif
+
 	if (ovs_net->xt_label)
 		nf_connlabels_put(net);
 }
diff --git a/net/openvswitch/conntrack.h b/net/openvswitch/conntrack.h
index 399dfdd2c4f9..900dadd70974 100644
--- a/net/openvswitch/conntrack.h
+++ b/net/openvswitch/conntrack.h
@@ -17,10 +17,11 @@
 #include "flow.h"
 
 struct ovs_conntrack_info;
+struct ovs_ct_limit_info;
 enum ovs_key_attr;
 
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
-void ovs_ct_init(struct net *);
+int ovs_ct_init(struct net *);
 void ovs_ct_exit(struct net *);
 bool ovs_ct_verify(struct net *, enum ovs_key_attr attr);
 int ovs_ct_copy_action(struct net *, const struct nlattr *,
@@ -44,7 +45,7 @@ void ovs_ct_free_action(const struct nlattr *a);
 #else
 #include <linux/errno.h>
 
-static inline void ovs_ct_init(struct net *net) { }
+static inline int ovs_ct_init(struct net *net) { return 0; }
 
 static inline void ovs_ct_exit(struct net *net) { }
 
@@ -104,4 +105,8 @@ static inline void ovs_ct_free_action(const struct nlattr *a) { }
 
 #define CT_SUPPORTED_MASK 0
 #endif /* CONFIG_NF_CONNTRACK */
+
+#if IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+extern struct genl_family dp_ct_limit_genl_family;
+#endif
 #endif /* ovs_conntrack.h */
diff --git a/net/openvswitch/datapath.c b/net/openvswitch/datapath.c
index 015e24e08909..a61818e94396 100644
--- a/net/openvswitch/datapath.c
+++ b/net/openvswitch/datapath.c
@@ -2288,6 +2288,9 @@ static struct genl_family * const dp_genl_families[] = {
 	&dp_flow_genl_family,
 	&dp_packet_genl_family,
 	&dp_meter_genl_family,
+#if	IS_ENABLED(CONFIG_NETFILTER_CONNCOUNT)
+	&dp_ct_limit_genl_family,
+#endif
 };
 
 static void dp_unregister_genl(int n_families)
@@ -2323,8 +2326,7 @@ static int __net_init ovs_init_net(struct net *net)
 
 	INIT_LIST_HEAD(&ovs_net->dps);
 	INIT_WORK(&ovs_net->dp_notify_work, ovs_dp_notify_wq);
-	ovs_ct_init(net);
-	return 0;
+	return ovs_ct_init(net);
 }
 
 static void __net_exit list_vports_from_net(struct net *net, struct net *dnet,
@@ -2469,3 +2471,4 @@ MODULE_ALIAS_GENL_FAMILY(OVS_VPORT_FAMILY);
 MODULE_ALIAS_GENL_FAMILY(OVS_FLOW_FAMILY);
 MODULE_ALIAS_GENL_FAMILY(OVS_PACKET_FAMILY);
 MODULE_ALIAS_GENL_FAMILY(OVS_METER_FAMILY);
+MODULE_ALIAS_GENL_FAMILY(OVS_CT_LIMIT_FAMILY);
diff --git a/net/openvswitch/datapath.h b/net/openvswitch/datapath.h
index 523d65526766..51bd4dcb6c8b 100644
--- a/net/openvswitch/datapath.h
+++ b/net/openvswitch/datapath.h
@@ -144,6 +144,7 @@ struct dp_upcall_info {
 struct ovs_net {
 	struct list_head dps;
 	struct work_struct dp_notify_work;
+	struct ovs_ct_limit_info *ct_limit_info;
 
 	/* Module reference for configuring conntrack. */
 	bool xt_label;
-- 
2.7.4

^ permalink raw reply related

* [PATCH net-next v2 00/21] net/ipv6: Separate data structures for FIB and data path
From: David Ahern @ 2018-04-18  0:33 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern

IPv6 uses the same data struct for both control plane (FIB entries) and
data path (dst entries). This struct has elements needed for both paths
adding memory overhead and complexity (taking a dst hold in most places
but an additional reference on rt6i_ref in a few). Furthermore, because
of the dst_alloc tie, all FIB entries are allocated with GFP_ATOMIC.

This patch set separates FIB entries from dst entries, better aligning
IPv6 code with IPv4, simplifying the reference counting and allowing
FIB entries added by userspace (not autoconf) to use GFP_KERNEL. It is
first step to a number of performance and scalability changes.

The end result of this patch set:
  - FIB entries (fib6_info):
        /* size: 208, cachelines: 4, members: 25 */
        /* sum members: 207, holes: 1, sum holes: 1 */

  - dst entries (rt6_info)
       /* size: 240, cachelines: 4, members: 11 */

Versus the the single rt6_info struct today for both paths:
      /* size: 320, cachelines: 5, members: 28 */

This amounts to a 35% reduction in memory use for FIB entries and a
25% reduction for dst entries.

With respect to locking FIB entries use RCU and a single atomic
counter with fib6_info_hold and fib6_info_release helpers to manage
the reference counting. dst entries use only the traditional dst
refcounts with dst_hold and dst_release.

FIB entries for host routes are referenced by inet6_ifaddr and
ifacaddr6. In both cases, additional holds are taken -- similar to
what is done for devices.

This set is the first of many changes to improve the scalability of the
IPv6 code. Follow on changes include:
- consolidating duplicate fib6_info references like IPv4 does with
  duplicate fib_info

- moving fib6_info into a slab cache to avoid allocation roundups to
  power of 2 (the 208 size becomes a 256 actual allocation)

- Allow FIB lookups without generating a dst (e.g., most rt6_lookup
  users just want to verify the egress device). Means moving dst
  allocation to the other side of fib6_rule_lookup which again aligns
  with IPv4 behavior

- using separate standalone nexthop objects which have performance
  benefits beyond fib_info consolidation

At this point I am not seeing any refcount leaks or underflows, no
oops or bug_ons, or warnings from kasan, so I think it is ready for
others to beat up on it finding errors in code paths I have missed.

v2 changes
- rebased to top of tree
- improved commit message on patch 7

v1 changes
- rebased to top of tree
- fix memory leak of metrics as noted by Ido
- MTU fixes based on pmtu tests (thanks Stefano Brivio for writing)

RFC v2 changes
- improved commit messages
- move common metrics code from dst.c to net/ipv4/metrics.c (comment
  from DaveM)
- address comments from Wei Wang and Martin KaFai Lau (let me know if
  I missed something)
- fixes detected by kernel test robots
  + added fib6_metric_set to change metric on a FIB entry which could
    be pointing to read-only dst_default_metrics
  + 0day testing found a problem with an intermediate patch; added
    dst_hold_safe on rt->from. Code is removed 3 patches later
- allow cacheinfo to handle NULL dst; means only expires is pushed to
  userspace

David Ahern (21):
  net: Move fib_convert_metrics to metrics file
  net: Handle null dst in rtnl_put_cacheinfo
  vrf: Move fib6_table into net_vrf
  net/ipv6: Pass net to fib6_update_sernum
  net/ipv6: Pass net namespace to route functions
  net/ipv6: Move support functions up in route.c
  net/ipv6: Save route type in rt6_info
  net/ipv6: Move nexthop data to fib6_nh
  net/ipv6: Defer initialization of dst to data path
  net/ipv6: move metrics from dst to rt6_info
  net/ipv6: move expires into rt6_info
  net/ipv6: Add fib6_null_entry
  net/ipv6: Add rt6_info create function for ip6_pol_route_lookup
  net/ipv6: Move dst flags to booleans in fib entries
  net/ipv6: Create a neigh_lookup for FIB entries
  net/ipv6: Add gfp_flags to route add functions
  net/ipv6: Cleanup exception and cache route handling
  net/ipv6: introduce fib6_info struct and helpers
  net/ipv6: separate handling of FIB entries from dst based routes
  net/ipv6: Flip FIB entries to fib6_info
  net/ipv6: Remove unused code and variables for rt6_info

 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |   96 +-
 drivers/net/vrf.c                                  |   25 +-
 include/net/if_inet6.h                             |    4 +-
 include/net/ip.h                                   |    3 +
 include/net/ip6_fib.h                              |  151 ++-
 include/net/ip6_route.h                            |   45 +-
 include/net/netns/ipv6.h                           |    3 +-
 net/core/dst.c                                     |    1 +
 net/core/rtnetlink.c                               |    8 +-
 net/ipv4/Makefile                                  |    3 +-
 net/ipv4/fib_semantics.c                           |   43 +-
 net/ipv4/metrics.c                                 |   53 +
 net/ipv6/addrconf.c                                |  118 +-
 net/ipv6/anycast.c                                 |   21 +-
 net/ipv6/ip6_fib.c                                 |  366 +++---
 net/ipv6/ip6_output.c                              |    3 +-
 net/ipv6/ndisc.c                                   |   40 +-
 net/ipv6/route.c                                   | 1369 ++++++++++----------
 net/ipv6/xfrm6_policy.c                            |    2 -
 19 files changed, 1218 insertions(+), 1136 deletions(-)
 create mode 100644 net/ipv4/metrics.c

-- 
2.11.0

^ permalink raw reply

* [PATCH net-next v2 01/21] net: Move fib_convert_metrics to metrics file
From: David Ahern @ 2018-04-18  0:33 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180418003327.19992-1-dsahern@gmail.com>

Move logic of fib_convert_metrics into ip_metrics_convert. This allows
the code that converts netlink attributes into metrics struct to be
re-used in a later patch by IPv6.

This is mostly a code move with the following changes to variable names:
  - fi->fib_net becomes net
  - fc_mx and fc_mx_len are passed as inputs pulled from fib_config
  - metrics array is passed as an input from fi->fib_metrics->metrics

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/ip.h         |  3 +++
 net/ipv4/Makefile        |  3 ++-
 net/ipv4/fib_semantics.c | 43 ++-------------------------------------
 net/ipv4/metrics.c       | 53 ++++++++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 60 insertions(+), 42 deletions(-)
 create mode 100644 net/ipv4/metrics.c

diff --git a/include/net/ip.h b/include/net/ip.h
index ecffd843e7b8..dc4a2d6e58a5 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -396,6 +396,9 @@ static inline unsigned int ip_skb_dst_mtu(struct sock *sk,
 	return min(READ_ONCE(skb_dst(skb)->dev->mtu), IP_MAX_MTU);
 }
 
+int ip_metrics_convert(struct net *net, struct nlattr *fc_mx, int fc_mx_len,
+		       u32 *metrics);
+
 u32 ip_idents_reserve(u32 hash, int segs);
 void __ip_select_ident(struct net *net, struct iphdr *iph, int segs);
 
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index a07b7dd06def..b379520f9133 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -13,7 +13,8 @@ obj-y     := route.o inetpeer.o protocol.o \
 	     tcp_offload.o datagram.o raw.o udp.o udplite.o \
 	     udp_offload.o arp.o icmp.o devinet.o af_inet.o igmp.o \
 	     fib_frontend.o fib_semantics.o fib_trie.o fib_notifier.o \
-	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o
+	     inet_fragment.o ping.o ip_tunnel_core.o gre_offload.o \
+	     metrics.o
 
 obj-$(CONFIG_NET_IP_TUNNEL) += ip_tunnel.o
 obj-$(CONFIG_SYSCTL) += sysctl_net_ipv4.o
diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index c27122f01b87..6608db23f54b 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -1019,47 +1019,8 @@ static bool fib_valid_prefsrc(struct fib_config *cfg, __be32 fib_prefsrc)
 static int
 fib_convert_metrics(struct fib_info *fi, const struct fib_config *cfg)
 {
-	bool ecn_ca = false;
-	struct nlattr *nla;
-	int remaining;
-
-	if (!cfg->fc_mx)
-		return 0;
-
-	nla_for_each_attr(nla, cfg->fc_mx, cfg->fc_mx_len, remaining) {
-		int type = nla_type(nla);
-		u32 val;
-
-		if (!type)
-			continue;
-		if (type > RTAX_MAX)
-			return -EINVAL;
-
-		if (type == RTAX_CC_ALGO) {
-			char tmp[TCP_CA_NAME_MAX];
-
-			nla_strlcpy(tmp, nla, sizeof(tmp));
-			val = tcp_ca_get_key_by_name(fi->fib_net, tmp, &ecn_ca);
-			if (val == TCP_CA_UNSPEC)
-				return -EINVAL;
-		} else {
-			val = nla_get_u32(nla);
-		}
-		if (type == RTAX_ADVMSS && val > 65535 - 40)
-			val = 65535 - 40;
-		if (type == RTAX_MTU && val > 65535 - 15)
-			val = 65535 - 15;
-		if (type == RTAX_HOPLIMIT && val > 255)
-			val = 255;
-		if (type == RTAX_FEATURES && (val & ~RTAX_FEATURE_MASK))
-			return -EINVAL;
-		fi->fib_metrics->metrics[type - 1] = val;
-	}
-
-	if (ecn_ca)
-		fi->fib_metrics->metrics[RTAX_FEATURES - 1] |= DST_FEATURE_ECN_CA;
-
-	return 0;
+	return ip_metrics_convert(fi->fib_net, cfg->fc_mx, cfg->fc_mx_len,
+				  fi->fib_metrics->metrics);
 }
 
 struct fib_info *fib_create_info(struct fib_config *cfg,
diff --git a/net/ipv4/metrics.c b/net/ipv4/metrics.c
new file mode 100644
index 000000000000..5121c6475e6b
--- /dev/null
+++ b/net/ipv4/metrics.c
@@ -0,0 +1,53 @@
+#include <linux/netlink.h>
+#include <linux/rtnetlink.h>
+#include <linux/types.h>
+#include <net/ip.h>
+#include <net/net_namespace.h>
+#include <net/tcp.h>
+
+int ip_metrics_convert(struct net *net, struct nlattr *fc_mx, int fc_mx_len,
+		       u32 *metrics)
+{
+	bool ecn_ca = false;
+	struct nlattr *nla;
+	int remaining;
+
+	if (!fc_mx)
+		return 0;
+
+	nla_for_each_attr(nla, fc_mx, fc_mx_len, remaining) {
+		int type = nla_type(nla);
+		u32 val;
+
+		if (!type)
+			continue;
+		if (type > RTAX_MAX)
+			return -EINVAL;
+
+		if (type == RTAX_CC_ALGO) {
+			char tmp[TCP_CA_NAME_MAX];
+
+			nla_strlcpy(tmp, nla, sizeof(tmp));
+			val = tcp_ca_get_key_by_name(net, tmp, &ecn_ca);
+			if (val == TCP_CA_UNSPEC)
+				return -EINVAL;
+		} else {
+			val = nla_get_u32(nla);
+		}
+		if (type == RTAX_ADVMSS && val > 65535 - 40)
+			val = 65535 - 40;
+		if (type == RTAX_MTU && val > 65535 - 15)
+			val = 65535 - 15;
+		if (type == RTAX_HOPLIMIT && val > 255)
+			val = 255;
+		if (type == RTAX_FEATURES && (val & ~RTAX_FEATURE_MASK))
+			return -EINVAL;
+		metrics[type - 1] = val;
+	}
+
+	if (ecn_ca)
+		metrics[RTAX_FEATURES - 1] |= DST_FEATURE_ECN_CA;
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(ip_metrics_convert);
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next v2 02/21] net: Handle null dst in rtnl_put_cacheinfo
From: David Ahern @ 2018-04-18  0:33 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180418003327.19992-1-dsahern@gmail.com>

Need to keep expires time for IPv6 routes in a dump of FIB entries.
Update rtnl_put_cacheinfo to allow dst to be NULL in which case
rta_cacheinfo will only contain non-dst data.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 net/core/rtnetlink.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index 45936922d7e2..80802546c279 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -785,13 +785,15 @@ int rtnl_put_cacheinfo(struct sk_buff *skb, struct dst_entry *dst, u32 id,
 		       long expires, u32 error)
 {
 	struct rta_cacheinfo ci = {
-		.rta_lastuse = jiffies_delta_to_clock_t(jiffies - dst->lastuse),
-		.rta_used = dst->__use,
-		.rta_clntref = atomic_read(&(dst->__refcnt)),
 		.rta_error = error,
 		.rta_id =  id,
 	};
 
+	if (dst) {
+		ci.rta_lastuse = jiffies_delta_to_clock_t(jiffies - dst->lastuse);
+		ci.rta_used = dst->__use;
+		ci.rta_clntref = atomic_read(&dst->__refcnt);
+	}
 	if (expires) {
 		unsigned long clock;
 
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next v2 03/21] vrf: Move fib6_table into net_vrf
From: David Ahern @ 2018-04-18  0:33 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180418003327.19992-1-dsahern@gmail.com>

A later patch removes rt6i_table from rt6_info. Save the ipv6
table for a VRF in net_vrf. fib tables can not be deleted so
no reference counting or locking is required.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 drivers/net/vrf.c | 25 ++++++-------------------
 1 file changed, 6 insertions(+), 19 deletions(-)

diff --git a/drivers/net/vrf.c b/drivers/net/vrf.c
index 0a2b180d138a..90b5f3900c22 100644
--- a/drivers/net/vrf.c
+++ b/drivers/net/vrf.c
@@ -48,6 +48,9 @@ static unsigned int vrf_net_id;
 struct net_vrf {
 	struct rtable __rcu	*rth;
 	struct rt6_info	__rcu	*rt6;
+#if IS_ENABLED(CONFIG_IPV6)
+	struct fib6_table	*fib6_table;
+#endif
 	u32                     tb_id;
 };
 
@@ -496,7 +499,6 @@ static int vrf_rt6_create(struct net_device *dev)
 	int flags = DST_HOST | DST_NOPOLICY | DST_NOXFRM;
 	struct net_vrf *vrf = netdev_priv(dev);
 	struct net *net = dev_net(dev);
-	struct fib6_table *rt6i_table;
 	struct rt6_info *rt6;
 	int rc = -ENOMEM;
 
@@ -504,8 +506,8 @@ static int vrf_rt6_create(struct net_device *dev)
 	if (!ipv6_mod_enabled())
 		return 0;
 
-	rt6i_table = fib6_new_table(net, vrf->tb_id);
-	if (!rt6i_table)
+	vrf->fib6_table = fib6_new_table(net, vrf->tb_id);
+	if (!vrf->fib6_table)
 		goto out;
 
 	/* create a dst for routing packets out a VRF device */
@@ -513,7 +515,6 @@ static int vrf_rt6_create(struct net_device *dev)
 	if (!rt6)
 		goto out;
 
-	rt6->rt6i_table = rt6i_table;
 	rt6->dst.output	= vrf_output6;
 
 	rcu_assign_pointer(vrf->rt6, rt6);
@@ -946,22 +947,8 @@ static struct rt6_info *vrf_ip6_route_lookup(struct net *net,
 					     int flags)
 {
 	struct net_vrf *vrf = netdev_priv(dev);
-	struct fib6_table *table = NULL;
-	struct rt6_info *rt6;
-
-	rcu_read_lock();
-
-	/* fib6_table does not have a refcnt and can not be freed */
-	rt6 = rcu_dereference(vrf->rt6);
-	if (likely(rt6))
-		table = rt6->rt6i_table;
-
-	rcu_read_unlock();
-
-	if (!table)
-		return NULL;
 
-	return ip6_pol_route(net, table, ifindex, fl6, skb, flags);
+	return ip6_pol_route(net, vrf->fib6_table, ifindex, fl6, skb, flags);
 }
 
 static void vrf_ip6_input_dst(struct sk_buff *skb, struct net_device *vrf_dev,
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next v2 04/21] net/ipv6: Pass net to fib6_update_sernum
From: David Ahern @ 2018-04-18  0:33 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180418003327.19992-1-dsahern@gmail.com>

Pass net namespace to fib6_update_sernum. It can not be marked const
as fib6_new_sernum will change ipv6.fib6_sernum.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/ip6_fib.h |  2 +-
 net/ipv6/ip6_fib.c    |  3 +--
 net/ipv6/route.c      | 10 +++++-----
 3 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/include/net/ip6_fib.h b/include/net/ip6_fib.h
index 5e86fd9dc857..f0aaf1c8f1a8 100644
--- a/include/net/ip6_fib.h
+++ b/include/net/ip6_fib.h
@@ -408,7 +408,7 @@ void __net_exit fib6_notifier_exit(struct net *net);
 unsigned int fib6_tables_seq_read(struct net *net);
 int fib6_tables_dump(struct net *net, struct notifier_block *nb);
 
-void fib6_update_sernum(struct rt6_info *rt);
+void fib6_update_sernum(struct net *net, struct rt6_info *rt);
 void fib6_update_sernum_upto_root(struct net *net, struct rt6_info *rt);
 
 #ifdef CONFIG_IPV6_MULTIPLE_TABLES
diff --git a/net/ipv6/ip6_fib.c b/net/ipv6/ip6_fib.c
index deab2db6692e..74d2a3748e2f 100644
--- a/net/ipv6/ip6_fib.c
+++ b/net/ipv6/ip6_fib.c
@@ -105,9 +105,8 @@ enum {
 	FIB6_NO_SERNUM_CHANGE = 0,
 };
 
-void fib6_update_sernum(struct rt6_info *rt)
+void fib6_update_sernum(struct net *net, struct rt6_info *rt)
 {
-	struct net *net = dev_net(rt->dst.dev);
 	struct fib6_node *fn;
 
 	fn = rcu_dereference_protected(rt->rt6i_node,
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 1d738bfe893b..0a99cda9fd7b 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -1352,7 +1352,7 @@ static int rt6_insert_exception(struct rt6_info *nrt,
 	/* Update fn->fn_sernum to invalidate all cached dst */
 	if (!err) {
 		spin_lock_bh(&ort->rt6i_table->tb6_lock);
-		fib6_update_sernum(ort);
+		fib6_update_sernum(net, ort);
 		spin_unlock_bh(&ort->rt6i_table->tb6_lock);
 		fib6_force_start_gc(net);
 	}
@@ -3786,11 +3786,11 @@ void rt6_multipath_rebalance(struct rt6_info *rt)
 static int fib6_ifup(struct rt6_info *rt, void *p_arg)
 {
 	const struct arg_netdev_event *arg = p_arg;
-	const struct net *net = dev_net(arg->dev);
+	struct net *net = dev_net(arg->dev);
 
 	if (rt != net->ipv6.ip6_null_entry && rt->dst.dev == arg->dev) {
 		rt->rt6i_nh_flags &= ~arg->nh_flags;
-		fib6_update_sernum_upto_root(dev_net(rt->dst.dev), rt);
+		fib6_update_sernum_upto_root(net, rt);
 		rt6_multipath_rebalance(rt);
 	}
 
@@ -3869,7 +3869,7 @@ static int fib6_ifdown(struct rt6_info *rt, void *p_arg)
 {
 	const struct arg_netdev_event *arg = p_arg;
 	const struct net_device *dev = arg->dev;
-	const struct net *net = dev_net(dev);
+	struct net *net = dev_net(dev);
 
 	if (rt == net->ipv6.ip6_null_entry)
 		return 0;
@@ -3892,7 +3892,7 @@ static int fib6_ifdown(struct rt6_info *rt, void *p_arg)
 			}
 			rt6_multipath_nh_flags_set(rt, dev, RTNH_F_DEAD |
 						   RTNH_F_LINKDOWN);
-			fib6_update_sernum(rt);
+			fib6_update_sernum(net, rt);
 			rt6_multipath_rebalance(rt);
 		}
 		return -2;
-- 
2.11.0

^ permalink raw reply related

* [PATCH net-next v2 05/21] net/ipv6: Pass net namespace to route functions
From: David Ahern @ 2018-04-18  0:33 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, roopa, eric.dumazet, weiwan, kafai, yoshfuji,
	David Ahern
In-Reply-To: <20180418003327.19992-1-dsahern@gmail.com>

Pass network namespace reference into route add, delete and get
functions.

Signed-off-by: David Ahern <dsahern@gmail.com>
---
 include/net/ip6_route.h | 12 ++++++-----
 net/ipv6/addrconf.c     | 33 ++++++++++++++++--------------
 net/ipv6/anycast.c      | 10 +++++----
 net/ipv6/ndisc.c        | 12 ++++++-----
 net/ipv6/route.c        | 54 +++++++++++++++++++++++++------------------------
 5 files changed, 66 insertions(+), 55 deletions(-)

diff --git a/include/net/ip6_route.h b/include/net/ip6_route.h
index 08b132381984..1130a1144dfd 100644
--- a/include/net/ip6_route.h
+++ b/include/net/ip6_route.h
@@ -101,8 +101,8 @@ void ip6_route_cleanup(void);
 int ipv6_route_ioctl(struct net *net, unsigned int cmd, void __user *arg);
 
 int ip6_route_add(struct fib6_config *cfg, struct netlink_ext_ack *extack);
-int ip6_ins_rt(struct rt6_info *);
-int ip6_del_rt(struct rt6_info *);
+int ip6_ins_rt(struct net *net, struct rt6_info *rt);
+int ip6_del_rt(struct net *net, struct rt6_info *rt);
 
 void rt6_flush_exceptions(struct rt6_info *rt);
 int rt6_remove_exception_rt(struct rt6_info *rt);
@@ -137,7 +137,7 @@ struct dst_entry *icmp6_dst_alloc(struct net_device *dev, struct flowi6 *fl6);
 
 void fib6_force_start_gc(struct net *net);
 
-struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
+struct rt6_info *addrconf_dst_alloc(struct net *net, struct inet6_dev *idev,
 				    const struct in6_addr *addr, bool anycast);
 
 struct rt6_info *ip6_dst_alloc(struct net *net, struct net_device *dev,
@@ -147,9 +147,11 @@ struct rt6_info *ip6_dst_alloc(struct net *net, struct net_device *dev,
  *	support functions for ND
  *
  */
-struct rt6_info *rt6_get_dflt_router(const struct in6_addr *addr,
+struct rt6_info *rt6_get_dflt_router(struct net *net,
+				     const struct in6_addr *addr,
 				     struct net_device *dev);
-struct rt6_info *rt6_add_dflt_router(const struct in6_addr *gwaddr,
+struct rt6_info *rt6_add_dflt_router(struct net *net,
+				     const struct in6_addr *gwaddr,
 				     struct net_device *dev, unsigned int pref);
 
 void rt6_purge_dflt_routers(struct net *net);
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index b2c0175125db..7ff7466c52e5 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1037,7 +1037,7 @@ ipv6_add_addr(struct inet6_dev *idev, const struct in6_addr *addr,
 		goto out;
 	}
 
-	rt = addrconf_dst_alloc(idev, addr, false);
+	rt = addrconf_dst_alloc(net, idev, addr, false);
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
 		rt = NULL;
@@ -1187,7 +1187,7 @@ cleanup_prefix_route(struct inet6_ifaddr *ifp, unsigned long expires, bool del_r
 				       0, RTF_GATEWAY | RTF_DEFAULT);
 	if (rt) {
 		if (del_rt)
-			ip6_del_rt(rt);
+			ip6_del_rt(dev_net(ifp->idev->dev), rt);
 		else {
 			if (!(rt->rt6i_flags & RTF_EXPIRES))
 				rt6_set_expires(rt, expires);
@@ -2666,7 +2666,7 @@ void addrconf_prefix_rcv(struct net_device *dev, u8 *opt, int len, bool sllao)
 		if (rt) {
 			/* Autoconf prefix route */
 			if (valid_lft == 0) {
-				ip6_del_rt(rt);
+				ip6_del_rt(net, rt);
 				rt = NULL;
 			} else if (addrconf_finite_timeout(rt_expires)) {
 				/* not infinity */
@@ -3333,7 +3333,8 @@ static void addrconf_gre_config(struct net_device *dev)
 }
 #endif
 
-static int fixup_permanent_addr(struct inet6_dev *idev,
+static int fixup_permanent_addr(struct net *net,
+				struct inet6_dev *idev,
 				struct inet6_ifaddr *ifp)
 {
 	/* !rt6i_node means the host route was removed from the
@@ -3343,7 +3344,7 @@ static int fixup_permanent_addr(struct inet6_dev *idev,
 	if (!ifp->rt || !ifp->rt->rt6i_node) {
 		struct rt6_info *rt, *prev;
 
-		rt = addrconf_dst_alloc(idev, &ifp->addr, false);
+		rt = addrconf_dst_alloc(net, idev, &ifp->addr, false);
 		if (IS_ERR(rt))
 			return PTR_ERR(rt);
 
@@ -3367,7 +3368,7 @@ static int fixup_permanent_addr(struct inet6_dev *idev,
 	return 0;
 }
 
-static void addrconf_permanent_addr(struct net_device *dev)
+static void addrconf_permanent_addr(struct net *net, struct net_device *dev)
 {
 	struct inet6_ifaddr *ifp, *tmp;
 	struct inet6_dev *idev;
@@ -3380,7 +3381,7 @@ static void addrconf_permanent_addr(struct net_device *dev)
 
 	list_for_each_entry_safe(ifp, tmp, &idev->addr_list, if_list) {
 		if ((ifp->flags & IFA_F_PERMANENT) &&
-		    fixup_permanent_addr(idev, ifp) < 0) {
+		    fixup_permanent_addr(net, idev, ifp) < 0) {
 			write_unlock_bh(&idev->lock);
 			in6_ifa_hold(ifp);
 			ipv6_del_addr(ifp);
@@ -3449,7 +3450,7 @@ static int addrconf_notify(struct notifier_block *this, unsigned long event,
 
 		if (event == NETDEV_UP) {
 			/* restore routes for permanent addresses */
-			addrconf_permanent_addr(dev);
+			addrconf_permanent_addr(net, dev);
 
 			if (!addrconf_link_ready(dev)) {
 				/* device is not ready yet. */
@@ -3735,7 +3736,7 @@ static int addrconf_ifdown(struct net_device *dev, int how)
 		spin_unlock_bh(&ifa->lock);
 
 		if (rt)
-			ip6_del_rt(rt);
+			ip6_del_rt(net, rt);
 
 		if (state != INET6_IFADDR_STATE_DEAD) {
 			__ipv6_ifa_notify(RTM_DELADDR, ifa);
@@ -3853,6 +3854,7 @@ static void addrconf_dad_begin(struct inet6_ifaddr *ifp)
 	struct inet6_dev *idev = ifp->idev;
 	struct net_device *dev = idev->dev;
 	bool bump_id, notify = false;
+	struct net *net;
 
 	addrconf_join_solict(dev, &ifp->addr);
 
@@ -3863,8 +3865,9 @@ static void addrconf_dad_begin(struct inet6_ifaddr *ifp)
 	if (ifp->state == INET6_IFADDR_STATE_DEAD)
 		goto out;
 
+	net = dev_net(dev);
 	if (dev->flags&(IFF_NOARP|IFF_LOOPBACK) ||
-	    (dev_net(dev)->ipv6.devconf_all->accept_dad < 1 &&
+	    (net->ipv6.devconf_all->accept_dad < 1 &&
 	     idev->cnf.accept_dad < 1) ||
 	    !(ifp->flags&IFA_F_TENTATIVE) ||
 	    ifp->flags & IFA_F_NODAD) {
@@ -3900,8 +3903,8 @@ static void addrconf_dad_begin(struct inet6_ifaddr *ifp)
 	 * Frames right away
 	 */
 	if (ifp->flags & IFA_F_OPTIMISTIC) {
-		ip6_ins_rt(ifp->rt);
-		if (ipv6_use_optimistic_addr(dev_net(dev), idev)) {
+		ip6_ins_rt(net, ifp->rt);
+		if (ipv6_use_optimistic_addr(net, idev)) {
 			/* Because optimistic nodes can use this address,
 			 * notify listeners. If DAD fails, RTM_DELADDR is sent.
 			 */
@@ -5604,7 +5607,7 @@ static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)
 		 * to do it again
 		 */
 		if (!rcu_access_pointer(ifp->rt->rt6i_node))
-			ip6_ins_rt(ifp->rt);
+			ip6_ins_rt(net, ifp->rt);
 		if (ifp->idev->cnf.forwarding)
 			addrconf_join_anycast(ifp);
 		if (!ipv6_addr_any(&ifp->peer_addr))
@@ -5621,11 +5624,11 @@ static void __ipv6_ifa_notify(int event, struct inet6_ifaddr *ifp)
 			rt = addrconf_get_prefix_route(&ifp->peer_addr, 128,
 						       ifp->idev->dev, 0, 0);
 			if (rt)
-				ip6_del_rt(rt);
+				ip6_del_rt(net, rt);
 		}
 		if (ifp->rt) {
 			if (dst_hold_safe(&ifp->rt->dst))
-				ip6_del_rt(ifp->rt);
+				ip6_del_rt(net, ifp->rt);
 		}
 		rt_genid_bump_ipv6(net);
 		break;
diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index bbcabbba9bd8..1122fb299b86 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -247,6 +247,7 @@ int __ipv6_dev_ac_inc(struct inet6_dev *idev, const struct in6_addr *addr)
 {
 	struct ifacaddr6 *aca;
 	struct rt6_info *rt;
+	struct net *net;
 	int err;
 
 	ASSERT_RTNL();
@@ -265,7 +266,8 @@ int __ipv6_dev_ac_inc(struct inet6_dev *idev, const struct in6_addr *addr)
 		}
 	}
 
-	rt = addrconf_dst_alloc(idev, addr, true);
+	net = dev_net(idev->dev);
+	rt = addrconf_dst_alloc(net, idev, addr, true);
 	if (IS_ERR(rt)) {
 		err = PTR_ERR(rt);
 		goto out;
@@ -286,7 +288,7 @@ int __ipv6_dev_ac_inc(struct inet6_dev *idev, const struct in6_addr *addr)
 	aca_get(aca);
 	write_unlock_bh(&idev->lock);
 
-	ip6_ins_rt(rt);
+	ip6_ins_rt(net, rt);
 
 	addrconf_join_solict(idev->dev, &aca->aca_addr);
 
@@ -329,7 +331,7 @@ int __ipv6_dev_ac_dec(struct inet6_dev *idev, const struct in6_addr *addr)
 	addrconf_leave_solict(idev, &aca->aca_addr);
 
 	dst_hold(&aca->aca_rt->dst);
-	ip6_del_rt(aca->aca_rt);
+	ip6_del_rt(dev_net(idev->dev), aca->aca_rt);
 
 	aca_put(aca);
 	return 0;
@@ -357,7 +359,7 @@ void ipv6_ac_destroy_dev(struct inet6_dev *idev)
 		addrconf_leave_solict(idev, &aca->aca_addr);
 
 		dst_hold(&aca->aca_rt->dst);
-		ip6_del_rt(aca->aca_rt);
+		ip6_del_rt(dev_net(idev->dev), aca->aca_rt);
 
 		aca_put(aca);
 
diff --git a/net/ipv6/ndisc.c b/net/ipv6/ndisc.c
index 9de4dfb126ba..3fbc3805e69b 100644
--- a/net/ipv6/ndisc.c
+++ b/net/ipv6/ndisc.c
@@ -1156,6 +1156,7 @@ static void ndisc_router_discovery(struct sk_buff *skb)
 	struct neighbour *neigh = NULL;
 	struct inet6_dev *in6_dev;
 	struct rt6_info *rt = NULL;
+	struct net *net;
 	int lifetime;
 	struct ndisc_options ndopts;
 	int optlen;
@@ -1253,9 +1254,9 @@ static void ndisc_router_discovery(struct sk_buff *skb)
 	/* Do not accept RA with source-addr found on local machine unless
 	 * accept_ra_from_local is set to true.
 	 */
+	net = dev_net(in6_dev->dev);
 	if (!in6_dev->cnf.accept_ra_from_local &&
-	    ipv6_chk_addr(dev_net(in6_dev->dev), &ipv6_hdr(skb)->saddr,
-			  in6_dev->dev, 0)) {
+	    ipv6_chk_addr(net, &ipv6_hdr(skb)->saddr, in6_dev->dev, 0)) {
 		ND_PRINTK(2, info,
 			  "RA from local address detected on dev: %s: default router ignored\n",
 			  skb->dev->name);
@@ -1272,7 +1273,7 @@ static void ndisc_router_discovery(struct sk_buff *skb)
 		pref = ICMPV6_ROUTER_PREF_MEDIUM;
 #endif
 
-	rt = rt6_get_dflt_router(&ipv6_hdr(skb)->saddr, skb->dev);
+	rt = rt6_get_dflt_router(net, &ipv6_hdr(skb)->saddr, skb->dev);
 
 	if (rt) {
 		neigh = dst_neigh_lookup(&rt->dst, &ipv6_hdr(skb)->saddr);
@@ -1285,7 +1286,7 @@ static void ndisc_router_discovery(struct sk_buff *skb)
 		}
 	}
 	if (rt && lifetime == 0) {
-		ip6_del_rt(rt);
+		ip6_del_rt(net, rt);
 		rt = NULL;
 	}
 
@@ -1294,7 +1295,8 @@ static void ndisc_router_discovery(struct sk_buff *skb)
 	if (!rt && lifetime) {
 		ND_PRINTK(3, info, "RA: adding default router\n");
 
-		rt = rt6_add_dflt_router(&ipv6_hdr(skb)->saddr, skb->dev, pref);
+		rt = rt6_add_dflt_router(net, &ipv6_hdr(skb)->saddr,
+					 skb->dev, pref);
 		if (!rt) {
 			ND_PRINTK(0, err,
 				  "RA: %s failed to add default route\n",
diff --git a/net/ipv6/route.c b/net/ipv6/route.c
index 0a99cda9fd7b..045811a3da76 100644
--- a/net/ipv6/route.c
+++ b/net/ipv6/route.c
@@ -850,13 +850,13 @@ int rt6_route_rcv(struct net_device *dev, u8 *opt, int len,
 	}
 
 	if (rinfo->prefix_len == 0)
-		rt = rt6_get_dflt_router(gwaddr, dev);
+		rt = rt6_get_dflt_router(net, gwaddr, dev);
 	else
 		rt = rt6_get_route_info(net, prefix, rinfo->prefix_len,
 					gwaddr, dev);
 
 	if (rt && !lifetime) {
-		ip6_del_rt(rt);
+		ip6_del_rt(net, rt);
 		rt = NULL;
 	}
 
@@ -1014,9 +1014,9 @@ static int __ip6_ins_rt(struct rt6_info *rt, struct nl_info *info,
 	return err;
 }
 
-int ip6_ins_rt(struct rt6_info *rt)
+int ip6_ins_rt(struct net *net, struct rt6_info *rt)
 {
-	struct nl_info info = {	.nl_net = dev_net(rt->dst.dev), };
+	struct nl_info info = {	.nl_net = net, };
 	struct mx6_config mxc = { .mx = NULL, };
 
 	/* Hold dst to account for the reference from the fib6 tree */
@@ -1121,14 +1121,13 @@ static struct rt6_info *rt6_get_pcpu_route(struct rt6_info *rt)
 	return pcpu_rt;
 }
 
-static struct rt6_info *rt6_make_pcpu_route(struct rt6_info *rt)
+static struct rt6_info *rt6_make_pcpu_route(struct net *net,
+					    struct rt6_info *rt)
 {
 	struct rt6_info *pcpu_rt, *prev, **p;
 
 	pcpu_rt = ip6_rt_pcpu_alloc(rt);
 	if (!pcpu_rt) {
-		struct net *net = dev_net(rt->dst.dev);
-
 		dst_hold(&net->ipv6.ip6_null_entry->dst);
 		return net->ipv6.ip6_null_entry;
 	}
@@ -1787,7 +1786,7 @@ struct rt6_info *ip6_pol_route(struct net *net, struct fib6_table *table,
 				/* No dst_hold() on rt is needed because grabbing
 				 * rt->rt6i_ref makes sure rt can't be released.
 				 */
-				pcpu_rt = rt6_make_pcpu_route(rt);
+				pcpu_rt = rt6_make_pcpu_route(net, rt);
 				rt6_release(rt);
 			} else {
 				/* rt is already removed from tree */
@@ -2088,7 +2087,7 @@ static struct dst_entry *ip6_negative_advice(struct dst_entry *dst)
 	if (rt) {
 		if (rt->rt6i_flags & RTF_CACHE) {
 			if (rt6_check_expired(rt)) {
-				ip6_del_rt(rt);
+				ip6_del_rt(dev_net(dst->dev), rt);
 				dst = NULL;
 			}
 		} else {
@@ -2109,7 +2108,7 @@ static void ip6_link_failure(struct sk_buff *skb)
 	if (rt) {
 		if (rt->rt6i_flags & RTF_CACHE) {
 			if (dst_hold_safe(&rt->dst))
-				ip6_del_rt(rt);
+				ip6_del_rt(dev_net(rt->dst.dev), rt);
 		} else {
 			struct fib6_node *fn;
 
@@ -3018,9 +3017,9 @@ int ip6_route_add(struct fib6_config *cfg,
 
 static int __ip6_del_rt(struct rt6_info *rt, struct nl_info *info)
 {
-	int err;
+	struct net *net = info->nl_net;
 	struct fib6_table *table;
-	struct net *net = dev_net(rt->dst.dev);
+	int err;
 
 	if (rt == net->ipv6.ip6_null_entry) {
 		err = -ENOENT;
@@ -3037,11 +3036,10 @@ static int __ip6_del_rt(struct rt6_info *rt, struct nl_info *info)
 	return err;
 }
 
-int ip6_del_rt(struct rt6_info *rt)
+int ip6_del_rt(struct net *net, struct rt6_info *rt)
 {
-	struct nl_info info = {
-		.nl_net = dev_net(rt->dst.dev),
-	};
+	struct nl_info info = { .nl_net = net };
+
 	return __ip6_del_rt(rt, &info);
 }
 
@@ -3376,13 +3374,15 @@ static struct rt6_info *rt6_add_route_info(struct net *net,
 }
 #endif
 
-struct rt6_info *rt6_get_dflt_router(const struct in6_addr *addr, struct net_device *dev)
+struct rt6_info *rt6_get_dflt_router(struct net *net,
+				     const struct in6_addr *addr,
+				     struct net_device *dev)
 {
 	u32 tb_id = l3mdev_fib_table(dev) ? : RT6_TABLE_DFLT;
 	struct rt6_info *rt;
 	struct fib6_table *table;
 
-	table = fib6_get_table(dev_net(dev), tb_id);
+	table = fib6_get_table(net, tb_id);
 	if (!table)
 		return NULL;
 
@@ -3399,7 +3399,8 @@ struct rt6_info *rt6_get_dflt_router(const struct in6_addr *addr, struct net_dev
 	return rt;
 }
 
-struct rt6_info *rt6_add_dflt_router(const struct in6_addr *gwaddr,
+struct rt6_info *rt6_add_dflt_router(struct net *net,
+				     const struct in6_addr *gwaddr,
 				     struct net_device *dev,
 				     unsigned int pref)
 {
@@ -3412,7 +3413,7 @@ struct rt6_info *rt6_add_dflt_router(const struct in6_addr *gwaddr,
 		.fc_protocol = RTPROT_RA,
 		.fc_nlinfo.portid = 0,
 		.fc_nlinfo.nlh = NULL,
-		.fc_nlinfo.nl_net = dev_net(dev),
+		.fc_nlinfo.nl_net = net,
 	};
 
 	cfg.fc_gateway = *gwaddr;
@@ -3425,10 +3426,11 @@ struct rt6_info *rt6_add_dflt_router(const struct in6_addr *gwaddr,
 			table->flags |= RT6_TABLE_HAS_DFLT_ROUTER;
 	}
 
-	return rt6_get_dflt_router(gwaddr, dev);
+	return rt6_get_dflt_router(net, gwaddr, dev);
 }
 
-static void __rt6_purge_dflt_routers(struct fib6_table *table)
+static void __rt6_purge_dflt_routers(struct net *net,
+				     struct fib6_table *table)
 {
 	struct rt6_info *rt;
 
@@ -3439,7 +3441,7 @@ static void __rt6_purge_dflt_routers(struct fib6_table *table)
 		    (!rt->rt6i_idev || rt->rt6i_idev->cnf.accept_ra != 2)) {
 			if (dst_hold_safe(&rt->dst)) {
 				rcu_read_unlock();
-				ip6_del_rt(rt);
+				ip6_del_rt(net, rt);
 			} else {
 				rcu_read_unlock();
 			}
@@ -3463,7 +3465,7 @@ void rt6_purge_dflt_routers(struct net *net)
 		head = &net->ipv6.fib_table_hash[h];
 		hlist_for_each_entry_rcu(table, head, tb6_hlist) {
 			if (table->flags & RT6_TABLE_HAS_DFLT_ROUTER)
-				__rt6_purge_dflt_routers(table);
+				__rt6_purge_dflt_routers(net, table);
 		}
 	}
 
@@ -3583,12 +3585,12 @@ static int ip6_pkt_prohibit_out(struct net *net, struct sock *sk, struct sk_buff
  *	Allocate a dst for local (unicast / anycast) address.
  */
 
-struct rt6_info *addrconf_dst_alloc(struct inet6_dev *idev,
+struct rt6_info *addrconf_dst_alloc(struct net *net,
+				    struct inet6_dev *idev,
 				    const struct in6_addr *addr,
 				    bool anycast)
 {
 	u32 tb_id;
-	struct net *net = dev_net(idev->dev);
 	struct net_device *dev = idev->dev;
 	struct rt6_info *rt;
 
-- 
2.11.0

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox