Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: net: GPF in rt6_get_cookie
From: Hannes Frederic Sowa @ 2016-11-30 11:00 UTC (permalink / raw)
  To: Andrey Konovalov, syzkaller
  Cc: David Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, netdev, LKML, Eric Dumazet
In-Reply-To: <CAAeHK+wvAZByn7-fONWYk1P8fXA9wNdkVLGtXfQsdFb-NSdn+g@mail.gmail.com>

Hi

On 30.11.2016 11:39, Andrey Konovalov wrote:
> On Sat, Nov 26, 2016 at 5:23 PM, 'Dmitry Vyukov' via syzkaller
> <syzkaller@googlegroups.com> wrote:
>> Hello,
>>
>> I got several GPFs in rt6_get_cookie while running syzkaller:
>>
>> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
>> Dumping ftrace buffer:
>>    (ftrace buffer empty)
>> Modules linked in:
>> CPU: 2 PID: 10156 Comm: syz-executor Not tainted 4.9.0-rc5+ #54
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> task: ffff880016f40480 task.stack: ffff88000fc00000
>> RIP: 0010:[<ffffffff87a209e8>]  [<     inline     >] rt6_get_cookie
>> include/net/ip6_fib.h:174
>> RIP: 0010:[<ffffffff87a209e8>]  [<ffffffff87a209e8>]
>> sctp_v6_get_dst+0x7c8/0x1960 net/sctp/ipv6.c:340
>> RSP: 0018:ffff88000fc07298  EFLAGS: 00010202
>> RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc900029f5000
>> RDX: 0000000000000015 RSI: 0000000000000001 RDI: 00000000000000a8
>> RBP: ffff88000fc07580 R08: 0000000000000000 R09: 0000000000000001
>> R10: 0000000000000000 R11: 0000000000000000 R12: ffff880066cd0068
>> R13: 1ffff10001f80e92 R14: ffff880066cd0040 R15: ffff88005f2d2808
>> FS:  00007f52c41f7700(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 0000000020016000 CR3: 0000000065dd7000 CR4: 00000000000006e0
>> DR0: 0000000000000400 DR1: 0000000000000400 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>> Stack:
>>  ffffffff87a210f6 ffffffff8701ad45 ffff88006768ec20 ffff88006768ec20
>>  0000000000000000 0000000016f40480 ffff88000fc07450 1ffff1000cd9a017
>>  ffff88006768ec00 ffff880066fc0730 ffff880066cd0068 1ffff10001f80e66
>> Call Trace:
>>  [<ffffffff879a313d>] sctp_transport_route+0xad/0x430 net/sctp/transport.c:279
>>  [<ffffffff8799b106>] sctp_assoc_add_peer+0x5a6/0x13e0 net/sctp/associola.c:641
>>  [<ffffffff879e8911>] sctp_sendmsg+0x1921/0x3bc0 net/sctp/socket.c:1864
>>  [<ffffffff8701ad45>] inet_sendmsg+0x385/0x590 net/ipv4/af_inet.c:734
>>  [<     inline     >] sock_sendmsg_nosec net/socket.c:621
>>  [<ffffffff86a6d54f>] sock_sendmsg+0xcf/0x110 net/socket.c:631
>>  [<ffffffff86a6ede0>] SYSC_sendto+0x660/0x810 net/socket.c:1656
>>  [<ffffffff86a71dd5>] SyS_sendto+0x45/0x60 net/socket.c:1624
>>  [<ffffffff88149dc5>] entry_SYSCALL_64_fastpath+0x23/0xc6
>> Code: 00 00 48 8b 84 24 88 00 00 00 48 8b 58 40 e8 80 76 cc f9 48 8d
>> bb a8 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80>
>> 3c 02 00 0f 85 56 0f 00 00 48 8b 9b a8 00 00 00 45 31 ed 48
>> RIP  [<     inline     >] rt6_get_cookie include/net/ip6_fib.h:174
>> RIP  [<ffffffff87a209e8>] sctp_v6_get_dst+0x7c8/0x1960 net/sctp/ipv6.c:340
>>  RSP <ffff88000fc07298>
>> ---[ end trace b8d1354fa571700d ]---
>>
>>
>> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
>> Dumping ftrace buffer:
>>    (ftrace buffer empty)
>> Modules linked in:
>> CPU: 3 PID: 22744 Comm: syz-executor Not tainted 4.9.0-rc5+ #54
>> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
>> task: ffff88006b92a840 task.stack: ffff88006a730000
>> RIP: 0010:[<ffffffff87a209e8>]  [<     inline     >] rt6_get_cookie
>> include/net/ip6_fib.h:174
>> RIP: 0010:[<ffffffff87a209e8>]  [<ffffffff87a209e8>]
>> sctp_v6_get_dst+0x7c8/0x1960 net/sctp/ipv6.c:340
>> RSP: 0018:ffff88006a736b88  EFLAGS: 00010202
>> RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc90003c4f000
>> RDX: 0000000000000015 RSI: 0000000000000001 RDI: 00000000000000a8
>> RBP: ffff88006a736e68 R08: 0000000000000000 R09: 0000000000000001
>> R10: 0000000000000000 R11: 0000000000000000 R12: ffff880064cff268
>> R13: 1ffff1000d4e6db0 R14: ffff880064cff240 R15: ffff88006a4b6808
>> FS:  00007f74f4ec9700(0000) GS:ffff88006d100000(0000) knlGS:0000000000000000
>> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> CR2: 000000002070effc CR3: 000000003bd2f000 CR4: 00000000000006e0
>> DR0: 0000000000000400 DR1: 0000000000000400 DR2: 0000000000000000
>> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
>> Stack:
>>  ffffffff87a210f6 ffffffff000bbd2d ffff88006c2cd5a0 ffff88006c2cd5a0
>>  0000000000000000 000000006ccb46c0 ffff88006a736d40 1ffff1000c99fe57
>>  ffff88006c2cd500 ffff8800658b1f30 ffff880064cff268 1ffff1000d4e6d84
>> Call Trace:
>>  [<ffffffff879a313d>] sctp_transport_route+0xad/0x430 net/sctp/transport.c:279
>>  [<ffffffff8799b106>] sctp_assoc_add_peer+0x5a6/0x13e0 net/sctp/associola.c:641
>>  [<ffffffff879e4358>] __sctp_connect+0x288/0xc90 net/sctp/socket.c:1178
>>  [<ffffffff879e4f0b>] __sctp_setsockopt_connectx+0x1ab/0x200
>> net/sctp/socket.c:1332
>>  [<     inline     >] sctp_getsockopt_connectx3 net/sctp/socket.c:1417
>>  [<ffffffff879fd2bd>] sctp_getsockopt+0x36ed/0x6800 net/sctp/socket.c:6474
>>  [<ffffffff86a76c0a>] sock_common_getsockopt+0x9a/0xe0 net/core/sock.c:2649
>>  [<     inline     >] SYSC_getsockopt net/socket.c:1788
>>  [<ffffffff86a724d7>] SyS_getsockopt+0x257/0x390 net/socket.c:1770
>>  [<ffffffff88149dc5>] entry_SYSCALL_64_fastpath+0x23/0xc6
>> Code: 00 00 48 8b 84 24 88 00 00 00 48 8b 58 40 e8 80 76 cc f9 48 8d
>> bb a8 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80>
>> 3c 02 00 0f 85 56 0f 00 00 48 8b 9b a8 00 00 00 45 31 ed 48
>> RIP  [<     inline     >] rt6_get_cookie include/net/ip6_fib.h:174
>> RIP  [<ffffffff87a209e8>] sctp_v6_get_dst+0x7c8/0x1960 net/sctp/ipv6.c:340
>>  RSP <ffff88006a736b88>
>> ---[ end trace f42d1c14cb6d2835 ]---
>>
>> This happened on commit a25f0944ba9b1d8a6813fd6f1a86f1bd59ac25a6 (Nov 13).
>>
>> Unfortunately this is not reproducible.
>>
>> The line is:
>>
>>     return rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
>>
>> Can it be a data race? rt->rt6i_node != NULL, but the next moment it
>> is already NULL? That would explain the crash and non-reproducibility
>> (need ThreadSanitizer!).
>>
>> This always happened when called from sctp code, but I don't know if
>> it is relevant or not. It happened only 3 times.
> 
> I'm seeing similar crashes from ipv6 and dccp code, reports below.
> 
> [...]

Thanks for the report.

Do you have a thread running that concurrently mutates the routing table?

Bye,
Hannes

^ permalink raw reply

* net/ipv6: null-ptr-deref in ip6_rt_cache_alloc
From: Andrey Konovalov @ 2016-11-30 10:58 UTC (permalink / raw)
  To: David S. Miller, Alexey Kuznetsov, James Morris,
	Hideaki YOSHIFUJI, Patrick McHardy, netdev, LKML
  Cc: Dmitry Vyukov, Eric Dumazet, Kostya Serebryany, syzkaller

Hi!

I've got the following error report while running the syzkaller fuzzer.

On commit d8e435f3ab6fea2ea324dce72b51dd7761747523 (Nov 26).

This might be related to the crash in rt6_get_cookie that Dmitry
reported, since it also happens when accessing ort->dst:
https://groups.google.com/forum/#!msg/syzkaller/3uDn6P5bwzA/gdzgPxeYAgAJ

general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 3 PID: 5315 Comm: syz-executor6 Not tainted 4.9.0-rc6+ #468
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88003b729700 task.stack: ffff880038be8000
RIP: 0010:[<ffffffff83442c35>]  [<ffffffff83442c35>]
ip6_rt_cache_alloc+0xa5/0x580 net/ipv6/route.c:953
RSP: 0018:ffff880038bef168  EFLAGS: 00010206
RAX: ffff88003b729700 RBX: 0000000000000007 RCX: 0000000000000001
RDX: 0000000000000000 RSI: ffffc90001aa7000 RDI: 0000000000000018
RBP: ffff880038bef198 R08: 0000000000004000 R09: 0000000000000003
R10: dffffc0000000000 R11: dffffc0000000000 R12: 0000000000000000
R13: ffff880038befa60 R14: 0000000000000000 R15: ffff880069ee1a40
FS:  00007fedfbb9f700(0000) GS:ffff88006e100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000003109cb8 CR3: 000000006c633000 CR4: 00000000000006e0
Stack:
 ffffffff8125141d ffff880069ee1a40 00000000fffd635a 1ffffffff0981200
 0000000000000000 ffff880069ee1a40 ffff880038bef310 ffffffff8344f233
 ffff880038befa60 1ffff1000717de49 ffff880038befa4f ffffffff850a0a68
Call Trace:
 [<ffffffff8344f233>] ip6_pol_route+0x13c3/0x1b20 net/ipv6/route.c:1106
 [<ffffffff8344fa4d>] ip6_pol_route_output+0x4d/0x60 net/ipv6/route.c:1190
 [<ffffffff834f606d>] fib6_rule_action+0x23d/0x740 net/ipv6/fib6_rules.c:100
 [<ffffffff82d82c36>] fib_rules_lookup+0x2b6/0x850 net/core/fib_rules.c:227
 [<ffffffff834f6b46>] fib6_rule_lookup+0xd6/0x260 net/ipv6/fib6_rules.c:44
 [<ffffffff83443426>] ip6_route_output_flags+0x276/0x310 net/ipv6/route.c:1218
 [<ffffffff83408f8d>] ip6_dst_lookup_tail+0xf9d/0x1410 net/ipv6/ip6_output.c:965
 [<ffffffff83409501>] ip6_dst_lookup_flow+0xa1/0x200 net/ipv6/ip6_output.c:1061
 [<ffffffff83488a3c>] rawv6_sendmsg+0xc0c/0x2c20 net/ipv6/raw.c:893
 [<ffffffff832a1037>] inet_sendmsg+0x317/0x4e0 net/ipv4/af_inet.c:734
 [<     inline     >] sock_sendmsg_nosec net/socket.c:621
 [<ffffffff82c9d76c>] sock_sendmsg+0xcc/0x110 net/socket.c:631
 [<ffffffff82c9f651>] ___sys_sendmsg+0x771/0x8b0 net/socket.c:1954
 [<ffffffff82ca163e>] __sys_sendmsg+0xce/0x170 net/socket.c:1988
 [<     inline     >] SYSC_sendmsg net/socket.c:1999
 [<ffffffff82ca170d>] SyS_sendmsg+0x2d/0x50 net/socket.c:1995
 [<ffffffff840f2d81>] entry_SYSCALL_64_fastpath+0x1f/0xc2
Code: 42 80 3c 06 00 0f 85 54 04 00 00 4d 8b 64 24 40 e8 11 11 01 fe
49 8d 7c 24 18 49 ba 00 00 00 00 00 fc ff df 49 89 f9 49 c1 e9 03 <43>
80 3c 11 00 0f 85 77 04 00 00 49 8b 74 24 18 49 bf 00 00 00
RIP  [<ffffffff83442c35>] ip6_rt_cache_alloc+0xa5/0x580 net/ipv6/route.c:953
 RSP <ffff880038bef168>
---[ end trace fefbac32da74ad88 ]---
Kernel panic - not syncing: Fatal exception
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled

^ permalink raw reply

* Re: Netperf UDP issue with connected sockets
From: Jesper Dangaard Brouer @ 2016-11-30 10:43 UTC (permalink / raw)
  To: Rick Jones; +Cc: Eric Dumazet, netdev, brouer
In-Reply-To: <20730b37-d218-a1bb-d0fb-0f838e2a77b5@hpe.com>


On Mon, 28 Nov 2016 10:33:49 -0800 Rick Jones <rick.jones2@hpe.com> wrote:

> On 11/17/2016 12:16 AM, Jesper Dangaard Brouer wrote:
> >> time to try IP_MTU_DISCOVER ;)  
> >
> > To Rick, maybe you can find a good solution or option with Eric's hint,
> > to send appropriate sized UDP packets with Don't Fragment (DF).  
> 
> Jesper -
> 
> Top of trunk has a change adding an omni, test-specific -f option which 
> will set IP_MTU_DISCOVER:IP_PMTUDISC_DO on the data socket.  Is that 
> sufficient to your needs?

The "-- -f" option makes the __ip_select_ident lookup go away.  So,
confirming your new option works.

Notice the "fib_lookup" cost is still present, even when I use
option "-- -n -N" to create a connected socket.  As Eric taught us,
this is because we should use syscalls "send" or "write" on a connected
socket.

My udp_flood tool[1] cycle through the different syscalls:

taskset -c 2 ~/git/network-testing/src/udp_flood 198.18.50.1 --count $((10**7)) --pmtu 2
             	ns/pkt	pps		cycles/pkt
send      	473.08	2113816.28	1891
sendto    	558.58	1790265.84	2233
sendmsg   	587.24	1702873.80	2348
sendMmsg/32  	547.57	1826265.90	2189
write     	518.36	1929175.52	2072

Using "send" seems to be the fastest option.

Some notes on test: I've forced TX completions to happen on another CPU0
and pinned the udp_flood program (to CPU2) as I want to avoid the CPU
scheduler to move udp_flood around as this cause fluctuations in the
results (as it stress the memory allocations more).

My udp_flood --pmtu option is documented in the --help usage text (see below signature)

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

$ uname -a
Linux canyon 4.9.0-rc6-page_pool07-baseline+ #185 SMP PREEMPT Wed Nov 30 10:07:51 CET 2016 x86_64

[1] udp_flood tool:
 https://github.com/netoptimizer/network-testing/blob/master/src/udp_flood.c

Quick command used for verifying  __ip_select_ident is removed:

 # First run benchmark
 sudo perf record -g -a ~/tools/netperf2-svn/src/netperf -H 198.18.50.1 \
  -t UDP_STREAM -l 3 -- -m 1472 -f

 # Second grep in perf output for functions
 sudo perf report --no-children --call-graph none --stdio |\
  egrep -e '__ip_select_ident|fib_table_lookup'


$ ./udp_flood --help

DOCUMENTATION:
 This tool is a UDP flood that measures the outgoing packet rate.
 Default cycles through tests with different send system calls.
 What function-call to invoke can also be specified as a command
 line option (see below).

 Default transmit 1000000 packets per test, adjust via --count

 Usage: ./udp_flood (options-see-below) IPADDR
 Listing options:
 --help         short-option: -h
 --ipv4         short-option: -4
 --ipv6         short-option: -6
 --sendmsg      short-option: -u
 --sendmmsg     short-option: -U
 --sendto       short-option: -t
 --write        short-option: -T
 --send         short-option: -S
 --batch        short-option: -b
 --count        short-option: -c
 --port         short-option: -p
 --payload      short-option: -m
 --pmtu         short-option: -d
 --verbose      short-option: -v

 Multiple tests can be selected:
     default: all tests
     -u -U -t -T -S: run any combination of sendmsg/sendmmsg/sendto/write/send

Option --pmtu <N>  for Path MTU discover socket option IP_MTU_DISCOVER
 This affects the DF(Don't-Fragment) bit setting.
 Following values are selectable:
  0 = IP_PMTUDISC_DONT
  1 = IP_PMTUDISC_WANT
  2 = IP_PMTUDISC_DO
  3 = IP_PMTUDISC_PROBE
  4 = IP_PMTUDISC_INTERFACE
  5 = IP_PMTUDISC_OMIT
 Documentation see under IP_MTU_DISCOVER in 'man 7 ip'

^ permalink raw reply

* Re: net: GPF in rt6_get_cookie
From: Andrey Konovalov @ 2016-11-30 10:39 UTC (permalink / raw)
  To: syzkaller
  Cc: David Miller, Alexey Kuznetsov, James Morris, Hideaki YOSHIFUJI,
	Patrick McHardy, netdev, LKML, Eric Dumazet
In-Reply-To: <CACT4Y+Y-B_GCDGpcLHt1CtXs3u9MBFD82MSFWTcL_v4Vi3+=HQ@mail.gmail.com>

On Sat, Nov 26, 2016 at 5:23 PM, 'Dmitry Vyukov' via syzkaller
<syzkaller@googlegroups.com> wrote:
> Hello,
>
> I got several GPFs in rt6_get_cookie while running syzkaller:
>
> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Modules linked in:
> CPU: 2 PID: 10156 Comm: syz-executor Not tainted 4.9.0-rc5+ #54
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> task: ffff880016f40480 task.stack: ffff88000fc00000
> RIP: 0010:[<ffffffff87a209e8>]  [<     inline     >] rt6_get_cookie
> include/net/ip6_fib.h:174
> RIP: 0010:[<ffffffff87a209e8>]  [<ffffffff87a209e8>]
> sctp_v6_get_dst+0x7c8/0x1960 net/sctp/ipv6.c:340
> RSP: 0018:ffff88000fc07298  EFLAGS: 00010202
> RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc900029f5000
> RDX: 0000000000000015 RSI: 0000000000000001 RDI: 00000000000000a8
> RBP: ffff88000fc07580 R08: 0000000000000000 R09: 0000000000000001
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff880066cd0068
> R13: 1ffff10001f80e92 R14: ffff880066cd0040 R15: ffff88005f2d2808
> FS:  00007f52c41f7700(0000) GS:ffff88006d000000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 0000000020016000 CR3: 0000000065dd7000 CR4: 00000000000006e0
> DR0: 0000000000000400 DR1: 0000000000000400 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> Stack:
>  ffffffff87a210f6 ffffffff8701ad45 ffff88006768ec20 ffff88006768ec20
>  0000000000000000 0000000016f40480 ffff88000fc07450 1ffff1000cd9a017
>  ffff88006768ec00 ffff880066fc0730 ffff880066cd0068 1ffff10001f80e66
> Call Trace:
>  [<ffffffff879a313d>] sctp_transport_route+0xad/0x430 net/sctp/transport.c:279
>  [<ffffffff8799b106>] sctp_assoc_add_peer+0x5a6/0x13e0 net/sctp/associola.c:641
>  [<ffffffff879e8911>] sctp_sendmsg+0x1921/0x3bc0 net/sctp/socket.c:1864
>  [<ffffffff8701ad45>] inet_sendmsg+0x385/0x590 net/ipv4/af_inet.c:734
>  [<     inline     >] sock_sendmsg_nosec net/socket.c:621
>  [<ffffffff86a6d54f>] sock_sendmsg+0xcf/0x110 net/socket.c:631
>  [<ffffffff86a6ede0>] SYSC_sendto+0x660/0x810 net/socket.c:1656
>  [<ffffffff86a71dd5>] SyS_sendto+0x45/0x60 net/socket.c:1624
>  [<ffffffff88149dc5>] entry_SYSCALL_64_fastpath+0x23/0xc6
> Code: 00 00 48 8b 84 24 88 00 00 00 48 8b 58 40 e8 80 76 cc f9 48 8d
> bb a8 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80>
> 3c 02 00 0f 85 56 0f 00 00 48 8b 9b a8 00 00 00 45 31 ed 48
> RIP  [<     inline     >] rt6_get_cookie include/net/ip6_fib.h:174
> RIP  [<ffffffff87a209e8>] sctp_v6_get_dst+0x7c8/0x1960 net/sctp/ipv6.c:340
>  RSP <ffff88000fc07298>
> ---[ end trace b8d1354fa571700d ]---
>
>
> general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN
> Dumping ftrace buffer:
>    (ftrace buffer empty)
> Modules linked in:
> CPU: 3 PID: 22744 Comm: syz-executor Not tainted 4.9.0-rc5+ #54
> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
> task: ffff88006b92a840 task.stack: ffff88006a730000
> RIP: 0010:[<ffffffff87a209e8>]  [<     inline     >] rt6_get_cookie
> include/net/ip6_fib.h:174
> RIP: 0010:[<ffffffff87a209e8>]  [<ffffffff87a209e8>]
> sctp_v6_get_dst+0x7c8/0x1960 net/sctp/ipv6.c:340
> RSP: 0018:ffff88006a736b88  EFLAGS: 00010202
> RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffc90003c4f000
> RDX: 0000000000000015 RSI: 0000000000000001 RDI: 00000000000000a8
> RBP: ffff88006a736e68 R08: 0000000000000000 R09: 0000000000000001
> R10: 0000000000000000 R11: 0000000000000000 R12: ffff880064cff268
> R13: 1ffff1000d4e6db0 R14: ffff880064cff240 R15: ffff88006a4b6808
> FS:  00007f74f4ec9700(0000) GS:ffff88006d100000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 000000002070effc CR3: 000000003bd2f000 CR4: 00000000000006e0
> DR0: 0000000000000400 DR1: 0000000000000400 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
> Stack:
>  ffffffff87a210f6 ffffffff000bbd2d ffff88006c2cd5a0 ffff88006c2cd5a0
>  0000000000000000 000000006ccb46c0 ffff88006a736d40 1ffff1000c99fe57
>  ffff88006c2cd500 ffff8800658b1f30 ffff880064cff268 1ffff1000d4e6d84
> Call Trace:
>  [<ffffffff879a313d>] sctp_transport_route+0xad/0x430 net/sctp/transport.c:279
>  [<ffffffff8799b106>] sctp_assoc_add_peer+0x5a6/0x13e0 net/sctp/associola.c:641
>  [<ffffffff879e4358>] __sctp_connect+0x288/0xc90 net/sctp/socket.c:1178
>  [<ffffffff879e4f0b>] __sctp_setsockopt_connectx+0x1ab/0x200
> net/sctp/socket.c:1332
>  [<     inline     >] sctp_getsockopt_connectx3 net/sctp/socket.c:1417
>  [<ffffffff879fd2bd>] sctp_getsockopt+0x36ed/0x6800 net/sctp/socket.c:6474
>  [<ffffffff86a76c0a>] sock_common_getsockopt+0x9a/0xe0 net/core/sock.c:2649
>  [<     inline     >] SYSC_getsockopt net/socket.c:1788
>  [<ffffffff86a724d7>] SyS_getsockopt+0x257/0x390 net/socket.c:1770
>  [<ffffffff88149dc5>] entry_SYSCALL_64_fastpath+0x23/0xc6
> Code: 00 00 48 8b 84 24 88 00 00 00 48 8b 58 40 e8 80 76 cc f9 48 8d
> bb a8 00 00 00 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80>
> 3c 02 00 0f 85 56 0f 00 00 48 8b 9b a8 00 00 00 45 31 ed 48
> RIP  [<     inline     >] rt6_get_cookie include/net/ip6_fib.h:174
> RIP  [<ffffffff87a209e8>] sctp_v6_get_dst+0x7c8/0x1960 net/sctp/ipv6.c:340
>  RSP <ffff88006a736b88>
> ---[ end trace f42d1c14cb6d2835 ]---
>
> This happened on commit a25f0944ba9b1d8a6813fd6f1a86f1bd59ac25a6 (Nov 13).
>
> Unfortunately this is not reproducible.
>
> The line is:
>
>     return rt->rt6i_node ? rt->rt6i_node->fn_sernum : 0;
>
> Can it be a data race? rt->rt6i_node != NULL, but the next moment it
> is already NULL? That would explain the crash and non-reproducibility
> (need ThreadSanitizer!).
>
> This always happened when called from sctp code, but I don't know if
> it is relevant or not. It happened only 3 times.

I'm seeing similar crashes from ipv6 and dccp code, reports below.

===

general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 1 PID: 30320 Comm: syz-executor0 Not tainted 4.9.0-rc6+ #462
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff880069c10040 task.stack: ffff880069f20000
RIP: 0010:[<ffffffff839eec82>]  [<     inline     >] rt6_get_cookie
include/net/ip6_fib.h:174
RIP: 0010:[<ffffffff839eec82>]  [<     inline     >] ip6_dst_store
include/net/ip6_route.h:174
RIP: 0010:[<ffffffff839eec82>]  [<ffffffff839eec82>]
dccp_v6_connect+0x762/0x14e0 net/dccp/ipv6.c:899
RSP: 0018:ffff880069f27ab0  EFLAGS: 00010202
RAX: ffff880069c10040 RBX: ffff88003b5b0040 RCX: 0000000000000000
RDX: dffffc0000000000 RSI: ffffc90000e6c000 RDI: 00000000000000a8
RBP: ffff880069f27c08 R08: 0000000000000015 R09: ffffffff839eec65
R10: 1ffff1000d2d78e0 R11: dffffc0000000000 R12: ffff880069f27e00
R13: ffff8800696bc6c0 R14: ffff88003b5b08e8 R15: ffff88003b5b08e8
FS:  00007f04712ef700(0000) GS:ffff88003ed00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004ad010 CR3: 0000000037c6a000 CR4: 00000000000006e0
Stack:
 ffff880069f27c58 ffff88003b5b04a0 ffff88003b5b0088 ffff88003b5b0078
 ffff880069f27e02 ffff88003b5b0052 00000000ffffffff 0000000000000000
 1ffff1000d3e4f60 0000000000000000 0000000041b58ab3 ffffffff84a3599d
Call Trace:
 [<ffffffff8329b087>] __inet_stream_connect+0x2a7/0xb30 net/ipv4/af_inet.c:594
 [<ffffffff8329b965>] inet_stream_connect+0x55/0xa0 net/ipv4/af_inet.c:655
 [<ffffffff82c9e5f4>] SYSC_connect+0x244/0x2f0 net/socket.c:1548
 [<ffffffff82ca0de4>] SyS_connect+0x24/0x30 net/socket.c:1529
 [<ffffffff81006485>] do_syscall_64+0x195/0x490 arch/x86/entry/common.c:280
 [<ffffffff840f2d09>] entry_SYSCALL64_slow_path+0x25/0x25
Code: 49 8b 7d 40 48 89 7c 24 38 e8 cb 50 a6 fd 48 8b 4c 24 38 48 ba
00 00 00 00 00 fc ff df 48 8d b9 a8 00 00 00 49 89 f8 49 c1 e8 03 <41>
80 3c 10 00 0f 85 52 0d 00 00 48 8b 81 a8 00 00 00 48 85 c0
RIP  [<     inline     >] rt6_get_cookie include/net/ip6_fib.h:174
RIP  [<     inline     >] ip6_dst_store include/net/ip6_route.h:174
RIP  [<ffffffff839eec82>] dccp_v6_connect+0x762/0x14e0 net/dccp/ipv6.c:899
 RSP <ffff880069f27ab0>
---[ end trace e7d9d916f3bf26c5 ]---
Kernel panic - not syncing: Fatal exception
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled

===

general protection fault: 0000 [#1] SMP KASAN
Dumping ftrace buffer:
   (ftrace buffer empty)
Modules linked in:
CPU: 3 PID: 21865 Comm: syz-executor0 Not tainted 4.9.0-rc6+ #462
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
task: ffff88003bacc480 task.stack: ffff88003a038000
RIP: 0010:[<ffffffff834c9c7f>]  [<     inline     >] rt6_get_cookie
include/net/ip6_fib.h:174
RIP: 0010:[<ffffffff834c9c7f>]  [<     inline     >] ip6_dst_store
include/net/ip6_route.h:174
RIP: 0010:[<ffffffff834c9c7f>]  [<ffffffff834c9c7f>]
ip6_datagram_dst_update+0x75f/0xe70 net/ipv6/datagram.c:108
RSP: 0018:ffff88003a03fb48  EFLAGS: 00010202
RAX: ffff88003bacc480 RBX: ffff880068e887c0 RCX: 0000000000000001
RDX: 0000000000000015 RSI: ffffc90000de4000 RDI: 00000000000000a8
RBP: ffff88003a03fc88 R08: 0000000000004000 R09: ffffffff834c9c67
R10: dffffc0000000000 R11: dffffc0000000000 R12: ffff880068e88cf0
R13: ffff88006ba75a40 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f1d4c9f3700(0000) GS:ffff88006e100000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004ad010 CR3: 000000006cde5000 CR4: 00000000000006e0
Stack:
 ffffffff834c997b 0000000041b58ab3 ffffffff849dee58 1ffff10007407f70
 ffff880068e887f8 ffff880068e887f8 ffff880068e88cf0 0000000041b58ab3
 ffffffff84a35071 ffffffff834c9520 0000000000000007 ffff88003bacc480
Call Trace:
 [<     inline     >] __ip6_datagram_connect net/ipv6/datagram.c:246
 [<ffffffff834cab95>] ip6_datagram_connect+0x375/0xcc0 net/ipv6/datagram.c:261
 [<ffffffff834cb53f>] ip6_datagram_connect_v6_only+0x5f/0x80
net/ipv6/datagram.c:273
 [<ffffffff8329a3ab>] inet_dgram_connect+0x11b/0x200 net/ipv4/af_inet.c:530
 [<ffffffff82c9e5f4>] SYSC_connect+0x244/0x2f0 net/socket.c:1548
 [<ffffffff82ca0de4>] SyS_connect+0x24/0x30 net/socket.c:1529
 [<ffffffff840f2c41>] entry_SYSCALL_64_fastpath+0x1f/0xc2
Code: 80 3c 08 00 0f 85 96 05 00 00 4d 8b 7d 40 e8 c9 a0 f8 fd 49 8d
bf a8 00 00 00 49 ba 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <42>
80 3c 12 00 0f 85 74 05 00 00 4d 8b bf a8 00 00 00 4d 85 ff
RIP  [<     inline     >] rt6_get_cookie include/net/ip6_fib.h:174
RIP  [<     inline     >] ip6_dst_store include/net/ip6_route.h:174
RIP  [<ffffffff834c9c7f>] ip6_datagram_dst_update+0x75f/0xe70
net/ipv6/datagram.c:108
 RSP <ffff88003a03fb48>
---[ end trace 148fc8ac80034c6f ]---
Kernel panic - not syncing: Fatal exception
Dumping ftrace buffer:
   (ftrace buffer empty)
Kernel Offset: disabled

===

>
> Thanks.
>
> --
> You received this message because you are subscribed to the Google Groups "syzkaller" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to syzkaller+unsubscribe@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

^ permalink raw reply

* Re: stmmac ethernet in kernel 4.4: coalescing related pauses?
From: Pavel Machek @ 2016-11-30 10:28 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, lsanfil, peppe.cavallaro, netdev, linux-kernel
In-Reply-To: <1480347103.18162.58.camel@edumazet-glaptop3.roam.corp.google.com>

[-- Attachment #1: Type: text/plain, Size: 1090 bytes --]

On Mon 2016-11-28 07:31:43, Eric Dumazet wrote:
> On Mon, 2016-11-28 at 09:54 -0500, David Miller wrote:
> > From: Lino Sanfilippo <lsanfil@marvell.com>
> > Date: Mon, 28 Nov 2016 14:07:51 +0100
> > 
> > > Calling skb_orphan() in the xmit handler made this issue disappear.
> > 
> > This is not the way to handle this problem.
> > 
> > The solution is to free the SKBs in a timely manner after the
> > chip has transmitted the frame.
> 
> Note that the 'pauses' described by Pavel are also caused by a too small
> SO_SNDBUF value on the UDP socket.
> 
> An immediate fix, with no kernel change is to increase it.
> 
> echo 1000000 >/proc/sys/net/core/wmem_default

Thanks a lot. For the record, that works around the problem, too. (Or
at least helps a lot; it may be possible that problem still remains if
continuous stream of packets is going to trigger this, if I read the
sources correctly.)

Best regards,
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply

* Re: [PATCH net] tipc: check minimum bearer MTU
From: Ying Xue @ 2016-11-30 10:28 UTC (permalink / raw)
  To: Michal Kubecek, Jon Maloy
  Cc: Qian, netdev, Zhang, linux-kernel, tipc-discussion, Ben Hutchings,
	David S. Miller
In-Reply-To: <20161130095702.DD033A0F14@unicorn.suse.cz>

On 11/30/2016 05:57 PM, Michal Kubecek wrote:
> Qian Zhang (张谦) reported a potential socket buffer overflow in
> tipc_msg_build() which is also known as CVE-2016-8632: due to
> insufficient checks, a buffer overflow can occur if MTU is too short for
> even tipc headers. As anyone can set device MTU in a user/net namespace,
> this issue can be abused by a regular user.
>
> As agreed in the discussion on Ben Hutchings' original patch, we should
> check the MTU at the moment a bearer is attached rather than for each
> processed packet. We also need to repeat the check when bearer MTU is
> adjusted to new device MTU. UDP case also needs a check to avoid
> overflow when calculating bearer MTU.
>
> Fixes: b97bf3fd8f6a ("[TIPC] Initial merge")
> Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
> Reported-by: Qian Zhang (张谦) <zhangqian-c@360.cn>
> ---
>  net/tipc/bearer.c    |  9 +++++++--
>  net/tipc/bearer.h    | 13 +++++++++++++
>  net/tipc/udp_media.c |  5 +++++
>  3 files changed, 25 insertions(+), 2 deletions(-)
>
> diff --git a/net/tipc/bearer.c b/net/tipc/bearer.c
> index 975dbeb60ab0..dd4b19e8bb43 100644
> --- a/net/tipc/bearer.c
> +++ b/net/tipc/bearer.c
> @@ -421,6 +421,10 @@ int tipc_enable_l2_media(struct net *net, struct tipc_bearer *b,
>  	dev = dev_get_by_name(net, driver_name);
>  	if (!dev)
>  		return -ENODEV;
> +	if (tipc_check_mtu(dev, 0)) {
> +		dev_put(dev);
> +		return -EINVAL;
> +	}
>
>  	/* Associate TIPC bearer with L2 bearer */
>  	rcu_assign_pointer(b->media_ptr, dev);
> @@ -610,8 +614,6 @@ static int tipc_l2_device_event(struct notifier_block *nb, unsigned long evt,
>  	if (!b)
>  		return NOTIFY_DONE;
>
> -	b->mtu = dev->mtu;
> -
>  	switch (evt) {
>  	case NETDEV_CHANGE:
>  		if (netif_carrier_ok(dev))
> @@ -624,6 +626,9 @@ static int tipc_l2_device_event(struct notifier_block *nb, unsigned long evt,
>  		tipc_reset_bearer(net, b);
>  		break;
>  	case NETDEV_CHANGEMTU:
> +		if (tipc_check_mtu(dev, 0))
> +			return -EINVAL;
> +		b->mtu = dev->mtu;
>  		tipc_reset_bearer(net, b);
>  		break;
>  	case NETDEV_CHANGEADDR:
> diff --git a/net/tipc/bearer.h b/net/tipc/bearer.h
> index 78892e2f53e3..1a0b7434ec24 100644
> --- a/net/tipc/bearer.h
> +++ b/net/tipc/bearer.h
> @@ -39,6 +39,7 @@
>
>  #include "netlink.h"
>  #include "core.h"
> +#include "msg.h"
>  #include <net/genetlink.h>
>
>  #define MAX_MEDIA	3
> @@ -59,6 +60,9 @@
>  #define TIPC_MEDIA_TYPE_IB	2
>  #define TIPC_MEDIA_TYPE_UDP	3
>
> +/* minimum bearer MTU */
> +#define TIPC_MIN_BEARER_MTU	(MAX_H_SIZE + INT_H_SIZE)
> +
>  /**
>   * struct tipc_media_addr - destination address used by TIPC bearers
>   * @value: address info (format defined by media)
> @@ -215,4 +219,13 @@ void tipc_bearer_xmit(struct net *net, u32 bearer_id,
>  void tipc_bearer_bc_xmit(struct net *net, u32 bearer_id,
>  			 struct sk_buff_head *xmitq);
>
> +/* check if device MTU is sufficient for tipc headers */
> +inline bool tipc_check_mtu(struct net_device *dev, unsigned int reserve)

It's unnecessary to explicitly declare a function as inline, instead 
please let GCC smartly decide this.

> +{
> +	if (dev->mtu >= TIPC_MIN_BEARER_MTU + reserve)
> +		return false;
> +	netdev_warn(dev, "MTU too low for tipc bearer\n");
> +	return true;
> +}
> +
>  #endif	/* _TIPC_BEARER_H */
> diff --git a/net/tipc/udp_media.c b/net/tipc/udp_media.c
> index 78cab9c5a445..376ed3e3ed46 100644
> --- a/net/tipc/udp_media.c
> +++ b/net/tipc/udp_media.c
> @@ -697,6 +697,11 @@ static int tipc_udp_enable(struct net *net, struct tipc_bearer *b,
>  		udp_conf.local_ip.s_addr = htonl(INADDR_ANY);
>  		udp_conf.use_udp_checksums = false;
>  		ub->ifindex = dev->ifindex;
> +		if (tipc_check_mtu(dev, sizeof(struct iphdr) +
> +					sizeof(struct udphdr))) {
> +			err = -EINVAL;
> +			goto err;
> +		}

For UDP bearer, it seems insufficient for us to check MTU size only when 
UDP bearer is enabled. Meanwhile, we should update MTU size for UDP 
bearer with Path MTU discovery protocol once MTU size is changed after 
bearer is enabled.

Regards,
Ying

>  		b->mtu = dev->mtu - sizeof(struct iphdr)
>  			- sizeof(struct udphdr);
>  #if IS_ENABLED(CONFIG_IPV6)
>


------------------------------------------------------------------------------
_______________________________________________
tipc-discussion mailing list
tipc-discussion@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/tipc-discussion

^ permalink raw reply

* [PATCH] net:phy fix driver reference count error when attach and detach phy device
From: Mao Wenan @ 2016-11-30 10:22 UTC (permalink / raw)
  To: netdev, f.fainelli, dingtianhong

The nic in my board use the phy dev from marvell, and the system will
load the marvell phy driver automatically, but when I remove the phy
drivers, the system immediately panic:
Call trace:
[ 2582.834493] [<ffff800000715384>] phy_state_machine+0x3c/0x438 [
2582.851754] [<ffff8000000db3b8>] process_one_work+0x150/0x428 [
2582.868188] [<ffff8000000db7d4>] worker_thread+0x144/0x4b0 [
2582.883882] [<ffff8000000e1d0c>] kthread+0xfc/0x110

there should be proper reference counting in place to avoid that.
I found that phy_attach_direct() forgets to add phy device driver
reference count, and phy_detach() forgets to subtract reference count.
This patch is to fix this bug, after that panic is disappeared when remove
marvell.ko

Signed-off-by: Mao Wenan <maowenan@huawei.com>
---
 drivers/net/phy/phy_device.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 1a4bf8a..a7ec7c2 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -866,6 +866,11 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
 		return -EIO;
 	}
 
+	if (!try_module_get(d->driver->owner)) {
+		dev_err(&dev->dev, "failed to get the device driver module\n");
+		return -EIO;
+	}
+
 	get_device(d);
 
 	/* Assume that if there is no driver, that it doesn't
@@ -921,6 +926,7 @@ int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
 
 error:
 	put_device(d);
+	module_put(d->driver->owner);
 	module_put(bus->owner);
 	return err;
 }
@@ -998,6 +1004,7 @@ void phy_detach(struct phy_device *phydev)
 	bus = phydev->mdio.bus;
 
 	put_device(&phydev->mdio.dev);
+	module_put(phydev->mdio.dev.driver->owner);
 	module_put(bus->owner);
 }
 EXPORT_SYMBOL(phy_detach);
-- 
2.7.0

^ permalink raw reply related

* Re: [PATCH net] tipc: check minimum bearer MTU
From: Michal Kubecek @ 2016-11-30 10:24 UTC (permalink / raw)
  To: Jon Maloy, Ying Xue
  Cc: David S. Miller, tipc-discussion, netdev, linux-kernel,
	Ben Hutchings, Qian Zhang
In-Reply-To: <20161130095702.DD033A0F14@unicorn.suse.cz>

On Wed, Nov 30, 2016 at 10:57:02AM +0100, Michal Kubecek wrote:
> Qian Zhang (张谦) reported a potential socket buffer overflow in
> tipc_msg_build() which is also known as CVE-2016-8632: due to
> insufficient checks, a buffer overflow can occur if MTU is too short for
> even tipc headers. As anyone can set device MTU in a user/net namespace,
> this issue can be abused by a regular user.
> 
> As agreed in the discussion on Ben Hutchings' original patch, we should
> check the MTU at the moment a bearer is attached rather than for each
> processed packet. We also need to repeat the check when bearer MTU is
> adjusted to new device MTU. UDP case also needs a check to avoid
> overflow when calculating bearer MTU.
> 
> Fixes: b97bf3fd8f6a ("[TIPC] Initial merge")
> Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
> Reported-by: Qian Zhang (张谦) <zhangqian-c@360.cn>

Self-NACK.

Im sorry, while testing this, I overlooked that an attempt to change
MTU of an underlying device to low value issues a warning but it
succeeds anyway.

> @@ -624,6 +626,9 @@ static int tipc_l2_device_event(struct notifier_block *nb, unsigned long evt,
>  		tipc_reset_bearer(net, b);
>  		break;
>  	case NETDEV_CHANGEMTU:
> +		if (tipc_check_mtu(dev, 0))
> +			return -EINVAL;
> +		b->mtu = dev->mtu;
>  		tipc_reset_bearer(net, b);
>  		break;
>  	case NETDEV_CHANGEADDR:

This is a notifier so that error value needs to be encoded into notifier
error. I'll send v2 after retesting

Michal Kubecek

^ permalink raw reply

* Re: [PATCH 3/6] net: ethernet: ti: cpts: add support of cpts HW_TS_PUSH
From: Richard Cochran @ 2016-11-30 10:19 UTC (permalink / raw)
  To: Grygorii Strashko
  Cc: David S. Miller, netdev-u79uwXL29TY76Z2rM5mHXA, Mugunthan V N,
	Sekhar Nori, linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-omap-u79uwXL29TY76Z2rM5mHXA, Rob Herring,
	devicetree-u79uwXL29TY76Z2rM5mHXA, Murali Karicheri, Wingman Kwok
In-Reply-To: <20161128230428.6872-4-grygorii.strashko-l0cyMroinI0@public.gmane.org>

On Mon, Nov 28, 2016 at 05:04:25PM -0600, Grygorii Strashko wrote:
> +/* HW TS */
> +static int cpts_extts_enable(struct cpts *cpts, u32 index, int on)
> +{
> +	unsigned long flags;
> +	u32 v;
> +
> +	if (index >= cpts->info.n_ext_ts)
> +		return -ENXIO;
> +
> +	if (((cpts->hw_ts_enable & BIT(index)) >> index) == on)
> +		return 0;
> +
> +	spin_lock_irqsave(&cpts->lock, flags);
> +
> +	v = cpts_read32(cpts, control);
> +	if (on) {
> +		v |= BIT(8 + index);
> +		cpts->hw_ts_enable |= BIT(index);
> +	} else {
> +		v &= ~BIT(8 + index);
> +		cpts->hw_ts_enable &= ~BIT(index);
> +	}
> +	cpts_write32(cpts, v, control);
> +
> +	spin_unlock_irqrestore(&cpts->lock, flags);
> +
> +	if (cpts->hw_ts_enable)
> +		/* poll for events faster - evry 200 ms */

every

> +		cpts->ov_check_period =
> +			msecs_to_jiffies(CPTS_EVENT_HWSTAMP_TIMEOUT);

Bad indentation.  Use braces {} to contain the comment and assignment
statement.

> +	else
> +		cpts->ov_check_period = cpts->ov_check_period_slow;
> +
> +	mod_delayed_work(system_wq, &cpts->overflow_work,
> +			 cpts->ov_check_period);
> +
> +	return 0;
> +}

Thanks,
Richard
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [patch net-next v3 12/12] rocker: Register FIB notifier before creating ports
From: Jiri Pirko @ 2016-11-30 10:09 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber
In-Reply-To: <1480500546-2544-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

Unlike mlxsw, rocker only supports the reflection of routes pointing to
its own netdevs. Therefore, instead of requesting a FIB dump during
init, simply register the FIB notifier before creating the ports.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/rocker/rocker_main.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker_main.c b/drivers/net/ethernet/rocker/rocker_main.c
index 914e9e1..8c9c90a 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -2804,6 +2804,9 @@ static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		goto err_alloc_ordered_workqueue;
 	}
 
+	rocker->fib_nb.notifier_call = rocker_router_fib_event;
+	register_fib_notifier(&rocker->fib_nb);
+
 	rocker->hw.id = rocker_read64(rocker, SWITCH_ID);
 
 	err = rocker_probe_ports(rocker);
@@ -2812,15 +2815,13 @@ static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		goto err_probe_ports;
 	}
 
-	rocker->fib_nb.notifier_call = rocker_router_fib_event;
-	register_fib_notifier(&rocker->fib_nb);
-
 	dev_info(&pdev->dev, "Rocker switch with id %*phN\n",
 		 (int)sizeof(rocker->hw.id), &rocker->hw.id);
 
 	return 0;
 
 err_probe_ports:
+	unregister_fib_notifier(&rocker->fib_nb);
 	destroy_workqueue(rocker->rocker_owq);
 err_alloc_ordered_workqueue:
 	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_EVENT), rocker);
@@ -2848,9 +2849,9 @@ static void rocker_remove(struct pci_dev *pdev)
 {
 	struct rocker *rocker = pci_get_drvdata(pdev);
 
+	rocker_remove_ports(rocker);
 	unregister_fib_notifier(&rocker->fib_nb);
 	rocker_write32(rocker, CONTROL, ROCKER_CONTROL_RESET);
-	rocker_remove_ports(rocker);
 	destroy_workqueue(rocker->rocker_owq);
 	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_EVENT), rocker);
 	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_CMD), rocker);
-- 
2.7.4

^ permalink raw reply related

* [patch net-next v3 11/12] mlxsw: spectrum_router: Request a dump of FIB tables during init
From: Jiri Pirko @ 2016-11-30 10:09 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber
In-Reply-To: <1480500546-2544-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

Make sure the device has a complete view of the FIB tables by invoking
their dump during module init.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 23 ++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 14bed1d..d176047 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -2027,8 +2027,23 @@ static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
 	return NOTIFY_DONE;
 }
 
+static void mlxsw_sp_router_fib_dump_flush(struct notifier_block *nb)
+{
+	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
+
+	/* Flush pending FIB notifications and then flush the device's
+	 * table before requesting another dump. Do that with RTNL held,
+	 * as FIB notification block is already registered.
+	 */
+	mlxsw_core_flush_owq();
+	rtnl_lock();
+	mlxsw_sp_router_fib_flush(mlxsw_sp);
+	rtnl_unlock();
+}
+
 int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
 {
+	fib_dump_cb_t *cb = mlxsw_sp_router_fib_dump_flush;
 	int err;
 
 	INIT_LIST_HEAD(&mlxsw_sp->router.nexthop_neighs_list);
@@ -2048,8 +2063,16 @@ int mlxsw_sp_router_init(struct mlxsw_sp *mlxsw_sp)
 
 	mlxsw_sp->fib_nb.notifier_call = mlxsw_sp_router_fib_event;
 	register_fib_notifier(&mlxsw_sp->fib_nb);
+	if (!fib_notifier_dump(&mlxsw_sp->fib_nb, &init_net, cb)) {
+		err = -EBUSY;
+		goto err_fib_notifier_dump;
+	}
+
 	return 0;
 
+err_fib_notifier_dump:
+	unregister_fib_notifier(&mlxsw_sp->fib_nb);
+	mlxsw_sp_neigh_fini(mlxsw_sp);
 err_neigh_init:
 	mlxsw_sp_vrs_fini(mlxsw_sp);
 err_vrs_init:
-- 
2.7.4

^ permalink raw reply related

* [patch net-next v3 10/12] ipv4: fib: Add an API to request a FIB dump
From: Jiri Pirko @ 2016-11-30 10:09 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber
In-Reply-To: <1480500546-2544-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

Commit b90eb7549499 ("fib: introduce FIB notification infrastructure")
introduced a new notification chain to notify listeners (f.e., switchdev
drivers) about addition and deletion of routes.

However, upon registration to the chain the FIB tables can already be
populated, which means potential listeners will have an incomplete view
of the tables.

Solve that by adding an API to request a FIB dump. The dump itself it
done using RCU in order not to starve consumers that need RTNL to make
progress.

The integrity of the dump is ensured by reading the FIB change sequence
counter before and after the dump. This allows us to avoid the
problematic situation in which the dumping process sends a ENTRY_ADD
notification following ENTRY_DEL generated by another process holding
RTNL.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/ip_fib.h |   4 ++
 net/ipv4/fib_trie.c  | 138 +++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 142 insertions(+)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index 6c67b93..1388bfc 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -221,6 +221,10 @@ enum fib_event_type {
 	FIB_EVENT_RULE_DEL,
 };
 
+typedef void fib_dump_cb_t(struct notifier_block *nb);
+
+bool fib_notifier_dump(struct notifier_block *nb, struct net *net,
+		       fib_dump_cb_t *cb);
 int register_fib_notifier(struct notifier_block *nb);
 int unregister_fib_notifier(struct notifier_block *nb);
 int call_fib_notifiers(struct net *net, enum fib_event_type event_type,
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 2891356..d944308 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -84,8 +84,90 @@
 #include <trace/events/fib.h>
 #include "fib_lookup.h"
 
+static unsigned int net_fib_seq(const struct net *net)
+{
+	unsigned int fib_seq;
+
+	rtnl_lock();
+	fib_seq = net->ipv4.fib_seq;
+	rtnl_unlock();
+
+	return fib_seq;
+}
+
 static ATOMIC_NOTIFIER_HEAD(fib_chain);
 
+static int call_fib_notifier(struct notifier_block *nb, struct net *net,
+			     enum fib_event_type event_type,
+			     struct fib_notifier_info *info)
+{
+	info->net = net;
+	return nb->notifier_call(nb, event_type, info);
+}
+
+static void fib_rules_notify(struct net *net, struct notifier_block *nb,
+			     enum fib_event_type event_type)
+{
+#ifdef CONFIG_IP_MULTIPLE_TABLES
+	struct fib_notifier_info info;
+
+	if (net->ipv4.fib_has_custom_rules)
+		call_fib_notifier(nb, net, event_type, &info);
+#endif
+}
+
+static void fib_notify(struct net *net, struct notifier_block *nb,
+		       enum fib_event_type event_type);
+
+static int call_fib_entry_notifier(struct notifier_block *nb, struct net *net,
+				   enum fib_event_type event_type, u32 dst,
+				   int dst_len, struct fib_info *fi,
+				   u8 tos, u8 type, u32 tb_id, u32 nlflags)
+{
+	struct fib_entry_notifier_info info = {
+		.dst = dst,
+		.dst_len = dst_len,
+		.fi = fi,
+		.tos = tos,
+		.type = type,
+		.tb_id = tb_id,
+		.nlflags = nlflags,
+	};
+	return call_fib_notifier(nb, net, event_type, &info.info);
+}
+
+static bool fib_dump_is_consistent(struct notifier_block *nb,
+				   const struct net *net, fib_dump_cb_t *cb,
+				   unsigned int fib_seq)
+{
+	if (fib_seq == net_fib_seq(net))
+		return true;
+	if (cb)
+		cb(nb);
+	return false;
+}
+
+bool fib_notifier_dump(struct notifier_block *nb, struct net *net,
+		       fib_dump_cb_t *cb)
+{
+	int retries = 0;
+
+	do {
+		unsigned int fib_seq = net_fib_seq(net);
+
+		rcu_read_lock();
+		fib_rules_notify(net, nb, FIB_EVENT_RULE_ADD);
+		fib_notify(net, nb, FIB_EVENT_ENTRY_ADD);
+		rcu_read_unlock();
+
+		if (fib_dump_is_consistent(nb, net, cb, fib_seq))
+			return true;
+	} while (++retries < net->ipv4.sysctl_fib_dump_max_retries);
+
+	return false;
+}
+EXPORT_SYMBOL(fib_notifier_dump);
+
 int register_fib_notifier(struct notifier_block *nb)
 {
 	return atomic_notifier_chain_register(&fib_chain, nb);
@@ -1902,6 +1984,62 @@ int fib_table_flush(struct net *net, struct fib_table *tb)
 	return found;
 }
 
+static void fib_leaf_notify(struct net *net, struct key_vector *l,
+			    struct fib_table *tb, struct notifier_block *nb,
+			    enum fib_event_type event_type)
+{
+	struct fib_alias *fa;
+
+	hlist_for_each_entry_rcu(fa, &l->leaf, fa_list) {
+		struct fib_info *fi = fa->fa_info;
+
+		if (!fi)
+			continue;
+
+		/* local and main table can share the same trie,
+		 * so don't notify twice for the same entry.
+		 */
+		if (tb->tb_id != fa->tb_id)
+			continue;
+
+		call_fib_entry_notifier(nb, net, event_type, l->key,
+					KEYLENGTH - fa->fa_slen, fi, fa->fa_tos,
+					fa->fa_type, fa->tb_id, 0);
+	}
+}
+
+static void fib_table_notify(struct net *net, struct fib_table *tb,
+			     struct notifier_block *nb,
+			     enum fib_event_type event_type)
+{
+	struct trie *t = (struct trie *)tb->tb_data;
+	struct key_vector *l, *tp = t->kv;
+	t_key key = 0;
+
+	while ((l = leaf_walk_rcu(&tp, key)) != NULL) {
+		fib_leaf_notify(net, l, tb, nb, event_type);
+
+		key = l->key + 1;
+		/* stop in case of wrap around */
+		if (key < l->key)
+			break;
+	}
+}
+
+static void fib_notify(struct net *net, struct notifier_block *nb,
+		       enum fib_event_type event_type)
+{
+	unsigned int h;
+
+	for (h = 0; h < FIB_TABLE_HASHSZ; h++) {
+		struct hlist_head *head = &net->ipv4.fib_table_hash[h];
+		struct fib_table *tb;
+
+		hlist_for_each_entry_rcu(tb, head, tb_hlist)
+			fib_table_notify(net, tb, nb, event_type);
+	}
+}
+
 static void __trie_free_rcu(struct rcu_head *head)
 {
 	struct fib_table *tb = container_of(head, struct fib_table, rcu);
-- 
2.7.4

^ permalink raw reply related

* RE: [patch net / RFC] net: fec: increase frame size limitation to actually available buffer
From: Andy Duan @ 2016-11-30  2:37 UTC (permalink / raw)
  To: Nikita Yushchenko, David S. Miller, Troy Kisky, Andrew Lunn,
	Eric Nelson, Philippe Reynes, Johannes Berg,
	netdev@vger.kernel.org
  Cc: Chris Healy, Fabio Estevam, linux-kernel@vger.kernel.org
In-Reply-To: <1480444528-30054-1-git-send-email-nikita.yoush@cogentembedded.com>

From: Nikita Yushchenko <nikita.yoush@cogentembedded.com> Sent: Wednesday, November 30, 2016 2:35 AM
 >To: David S. Miller <davem@davemloft.net>; Andy Duan
 ><fugang.duan@nxp.com>; Troy Kisky <troy.kisky@boundarydevices.com>;
 >Andrew Lunn <andrew@lunn.ch>; Eric Nelson <eric@nelint.com>; Philippe
 >Reynes <tremyfr@gmail.com>; Johannes Berg <johannes@sipsolutions.net>;
 >netdev@vger.kernel.org
 >Cc: Chris Healy <cphealy@gmail.com>; Fabio Estevam
 ><fabio.estevam@nxp.com>; linux-kernel@vger.kernel.org; Nikita
 >Yushchenko <nikita.yoush@cogentembedded.com>
 >Subject: [patch net / RFC] net: fec: increase frame size limitation to actually
 >available buffer
 >
 >Fec driver uses Rx buffers of 2k, but programs hardware to limit incoming
 >frames to 1522 bytes. This raises issues when FEC device is used with DSA
 >(since DSA tag can make frame larger), and also disallows manual sending and
 >receiving larger frames.
 >
 >This patch removes the limitation, allowing Rx size up to entire buffer.
 >At the same time possible Tx size is increased as well, because hardware uses
 >the same register field to limit Rx and Tx.
 >
 >Signed-off-by: Nikita Yushchenko <nikita.yoush@cogentembedded.com>
 >---
 > drivers/net/ethernet/freescale/fec_main.c | 33 +++++++++++----------------
 >----
 > 1 file changed, 12 insertions(+), 21 deletions(-)
 >
 >diff --git a/drivers/net/ethernet/freescale/fec_main.c
 >b/drivers/net/ethernet/freescale/fec_main.c
 >index 73ac35780611..c09789a71024 100644
 >--- a/drivers/net/ethernet/freescale/fec_main.c
 >+++ b/drivers/net/ethernet/freescale/fec_main.c
 >@@ -171,30 +171,12 @@ MODULE_PARM_DESC(macaddr, "FEC Ethernet
 >MAC address");  #endif  #endif /* CONFIG_M5272 */
 >
 >-/* The FEC stores dest/src/type/vlan, data, and checksum for receive
 >packets.
 >- */
 >-#define PKT_MAXBUF_SIZE		1522
 >-#define PKT_MINBUF_SIZE		64
 >-#define PKT_MAXBLR_SIZE		1536
 >-
 > /* FEC receive acceleration */
 > #define FEC_RACC_IPDIS		(1 << 1)
 > #define FEC_RACC_PRODIS		(1 << 2)
 > #define FEC_RACC_SHIFT16	BIT(7)
 > #define FEC_RACC_OPTIONS	(FEC_RACC_IPDIS | FEC_RACC_PRODIS)
 >
 >-/*
 >- * The 5270/5271/5280/5282/532x RX control register also contains maximum
 >frame
 >- * size bits. Other FEC hardware does not, so we need to take that into
 >- * account when setting it.
 >- */
 >-#if defined(CONFIG_M523x) || defined(CONFIG_M527x) ||
 >defined(CONFIG_M528x) || \
 >-    defined(CONFIG_M520x) || defined(CONFIG_M532x) ||
 >defined(CONFIG_ARM)
 >-#define	OPT_FRAME_SIZE	(PKT_MAXBUF_SIZE << 16)
 >-#else
 >-#define	OPT_FRAME_SIZE	0
 >-#endif
 >-
 > /* FEC MII MMFR bits definition */
 > #define FEC_MMFR_ST		(1 << 30)
 > #define FEC_MMFR_OP_READ	(2 << 28)
 >@@ -847,7 +829,8 @@ static void fec_enet_enable_ring(struct net_device
 >*ndev)
 > 	for (i = 0; i < fep->num_rx_queues; i++) {
 > 		rxq = fep->rx_queue[i];
 > 		writel(rxq->bd.dma, fep->hwp + FEC_R_DES_START(i));
 >-		writel(PKT_MAXBLR_SIZE, fep->hwp + FEC_R_BUFF_SIZE(i));
 >+		writel(FEC_ENET_RX_FRSIZE - fep->rx_align,
 >+		       fep->hwp + FEC_R_BUFF_SIZE(i));
 >
 > 		/* enable DMA1/2 */
 > 		if (i)
 >@@ -895,9 +878,17 @@ fec_restart(struct net_device *ndev)
 > 	struct fec_enet_private *fep = netdev_priv(ndev);
 > 	u32 val;
 > 	u32 temp_mac[2];
 >-	u32 rcntl = OPT_FRAME_SIZE | 0x04;
 >+	u32 rcntl = 0x04;
 > 	u32 ecntl = 0x2; /* ETHEREN */
 >
 >+	/* The 5270/5271/5280/5282/532x RX control register also contains
 >+	 * maximum frame * size bits. Other FEC hardware does not.
 >+	 */
 >+#if defined(CONFIG_M523x) || defined(CONFIG_M527x) ||
 >defined(CONFIG_M528x) || \
 >+    defined(CONFIG_M520x) || defined(CONFIG_M532x) ||
 >defined(CONFIG_ARM)
 >+	rcntl |= (FEC_ENET_RX_FRSIZE - fep->rx_align) << 16; #endif
 >+
 > 	/* Whack a reset.  We should wait for this.
 > 	 * For i.MX6SX SOC, enet use AXI bus, we use disable MAC
 > 	 * instead of reset MAC itself.
 >@@ -953,7 +944,7 @@ fec_restart(struct net_device *ndev)
 > 		else
 > 			val &= ~FEC_RACC_OPTIONS;
 > 		writel(val, fep->hwp + FEC_RACC);
 >-		writel(PKT_MAXBUF_SIZE, fep->hwp + FEC_FTRL);
 >+		writel(FEC_ENET_RX_FRSIZE - fep->rx_align, fep->hwp +
 >FEC_FTRL);

By the patch itself, it seems fine except MRBRn must be evenly divisible by 64.
But I think it is not necessary since the driver don't support jumbo frame.

 > 	}
 > #endif
 >
 >--
 >2.1.4

^ permalink raw reply

* [patch net-next v3 09/12] ipv4: fib: Add sysctl to limit number of FIB dump retries
From: Jiri Pirko @ 2016-11-30 10:09 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber
In-Reply-To: <1480500546-2544-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

When dumping the FIB tables in the next commit, the dump will be
considered invalid if notifications were sent in the FIB notification
chain mid-dump. In systems where routing changes are frequent, the dump
might need to be restarted multiple times.

Add sysctl to limit the number of FIB dump retries, thereby preventing
callers from looping for long periods of time.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 Documentation/networking/ip-sysctl.txt | 8 ++++++++
 include/net/netns/ipv4.h               | 1 +
 net/ipv4/fib_frontend.c                | 1 +
 net/ipv4/sysctl_net_ipv4.c             | 7 +++++++
 4 files changed, 17 insertions(+)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index 5af48dd..5182b23 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -73,6 +73,14 @@ fib_multipath_use_neigh - BOOLEAN
 	0 - disabled
 	1 - enabled
 
+fib_dump_max_retries - INTEGER
+	Maximum number of retries until the FIB dump is aborted. For a
+	given net namespace, a FIB dump is considered invalid if
+	notifications were sent in the FIB notification chain mid-dump.
+	The dump will be retried until it is successful or maximum
+	number of retries has been reached.
+	Default: 5
+
 route/max_size - INTEGER
 	Maximum number of routes allowed in the kernel.  Increase
 	this when using large numbers of interfaces and/or routes.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index f0cf5a1..71c4ca8 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -136,6 +136,7 @@ struct netns_ipv4 {
 	int sysctl_fib_multipath_use_neigh;
 #endif
 
+	int sysctl_fib_dump_max_retries;
 	unsigned int	fib_seq;	/* protected by rtnl_mutex */
 
 	atomic_t	rt_genid;
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index dbad5a1..43f7557 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1219,6 +1219,7 @@ static int __net_init ip_fib_net_init(struct net *net)
 	int err;
 	size_t size = sizeof(struct hlist_head) * FIB_TABLE_HASHSZ;
 
+	net->ipv4.sysctl_fib_dump_max_retries = 5;
 	net->ipv4.fib_seq = 0;
 
 	/* Avoid false sharing : Use at least a full cache line */
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 80bc36b..046147c 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -971,6 +971,13 @@ static struct ctl_table ipv4_net_table[] = {
 		.extra2		= &one,
 	},
 #endif
+	{
+		.procname	= "fib_dump_max_retries",
+		.data		= &init_net.ipv4.sysctl_fib_dump_max_retries,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
-- 
2.7.4

^ permalink raw reply related

* [patch net-next v3 08/12] ipv4: fib: Allow for consistent FIB dumping
From: Jiri Pirko @ 2016-11-30 10:09 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber
In-Reply-To: <1480500546-2544-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

The next patch will enable listeners of the FIB notification chain to
request a dump of the FIB tables. However, since RTNL isn't taken during
the dump, it's possible for the FIB tables to change mid-dump, which
will result in inconsistency between the listener's table and the
kernel's.

Allow listeners to know about changes that occurred mid-dump, by adding
a change sequence counter to each net namespace. The counter is
incremented just before a notification is sent in the FIB chain.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/netns/ipv4.h | 3 +++
 net/ipv4/fib_frontend.c  | 2 ++
 net/ipv4/fib_trie.c      | 1 +
 3 files changed, 6 insertions(+)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 7adf438..f0cf5a1 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -135,6 +135,9 @@ struct netns_ipv4 {
 #ifdef CONFIG_IP_ROUTE_MULTIPATH
 	int sysctl_fib_multipath_use_neigh;
 #endif
+
+	unsigned int	fib_seq;	/* protected by rtnl_mutex */
+
 	atomic_t	rt_genid;
 };
 #endif
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 121384b..dbad5a1 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -1219,6 +1219,8 @@ static int __net_init ip_fib_net_init(struct net *net)
 	int err;
 	size_t size = sizeof(struct hlist_head) * FIB_TABLE_HASHSZ;
 
+	net->ipv4.fib_seq = 0;
+
 	/* Avoid false sharing : Use at least a full cache line */
 	size = max_t(size_t, size, L1_CACHE_BYTES);
 
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 9bfce0d..2891356 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -101,6 +101,7 @@ EXPORT_SYMBOL(unregister_fib_notifier);
 int call_fib_notifiers(struct net *net, enum fib_event_type event_type,
 		       struct fib_notifier_info *info)
 {
+	net->ipv4.fib_seq++;
 	info->net = net;
 	return atomic_notifier_call_chain(&fib_chain, event_type, info);
 }
-- 
2.7.4

^ permalink raw reply related

* [patch net-next v3 07/12] ipv4: fib: Convert FIB notification chain to be atomic
From: Jiri Pirko @ 2016-11-30 10:09 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber
In-Reply-To: <1480500546-2544-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

In order not to hold RTNL for long periods of time we're going to dump
the FIB tables using RCU.

Convert the FIB notification chain to be atomic, as we can't block in
RCU critical sections.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 net/ipv4/fib_trie.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 026f309..9bfce0d 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -84,17 +84,17 @@
 #include <trace/events/fib.h>
 #include "fib_lookup.h"
 
-static BLOCKING_NOTIFIER_HEAD(fib_chain);
+static ATOMIC_NOTIFIER_HEAD(fib_chain);
 
 int register_fib_notifier(struct notifier_block *nb)
 {
-	return blocking_notifier_chain_register(&fib_chain, nb);
+	return atomic_notifier_chain_register(&fib_chain, nb);
 }
 EXPORT_SYMBOL(register_fib_notifier);
 
 int unregister_fib_notifier(struct notifier_block *nb)
 {
-	return blocking_notifier_chain_unregister(&fib_chain, nb);
+	return atomic_notifier_chain_unregister(&fib_chain, nb);
 }
 EXPORT_SYMBOL(unregister_fib_notifier);
 
@@ -102,7 +102,7 @@ int call_fib_notifiers(struct net *net, enum fib_event_type event_type,
 		       struct fib_notifier_info *info)
 {
 	info->net = net;
-	return blocking_notifier_call_chain(&fib_chain, event_type, info);
+	return atomic_notifier_call_chain(&fib_chain, event_type, info);
 }
 
 static int call_fib_entry_notifiers(struct net *net,
-- 
2.7.4

^ permalink raw reply related

* [patch net-next v3 06/12] rocker: Implement FIB offload in deferred work
From: Jiri Pirko @ 2016-11-30 10:09 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber
In-Reply-To: <1480500546-2544-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

Convert rocker to offload FIBs in deferred work in a similar fashion to
mlxsw, which was converted in the previous commits.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/rocker/rocker_main.c  | 58 +++++++++++++++++++++++++-----
 drivers/net/ethernet/rocker/rocker_ofdpa.c |  1 +
 2 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/rocker/rocker_main.c b/drivers/net/ethernet/rocker/rocker_main.c
index 424be96..914e9e1 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -2166,28 +2166,70 @@ static const struct switchdev_ops rocker_port_switchdev_ops = {
 	.switchdev_port_obj_dump	= rocker_port_obj_dump,
 };
 
-static int rocker_router_fib_event(struct notifier_block *nb,
-				   unsigned long event, void *ptr)
+struct rocker_fib_event_work {
+	struct work_struct work;
+	struct fib_entry_notifier_info fen_info;
+	struct rocker *rocker;
+	unsigned long event;
+};
+
+static void rocker_router_fib_event_work(struct work_struct *work)
 {
-	struct rocker *rocker = container_of(nb, struct rocker, fib_nb);
-	struct fib_entry_notifier_info *fen_info = ptr;
+	struct rocker_fib_event_work *fib_work =
+		container_of(work, struct rocker_fib_event_work, work);
+	struct rocker *rocker = fib_work->rocker;
 	int err;
 
-	switch (event) {
+	/* Protect internal structures from changes */
+	rtnl_lock();
+	switch (fib_work->event) {
 	case FIB_EVENT_ENTRY_ADD:
-		err = rocker_world_fib4_add(rocker, fen_info);
+		err = rocker_world_fib4_add(rocker, &fib_work->fen_info);
 		if (err)
 			rocker_world_fib4_abort(rocker);
-		else
+		fib_info_put(fib_work->fen_info.fi);
 		break;
 	case FIB_EVENT_ENTRY_DEL:
-		rocker_world_fib4_del(rocker, fen_info);
+		rocker_world_fib4_del(rocker, &fib_work->fen_info);
+		fib_info_put(fib_work->fen_info.fi);
 		break;
 	case FIB_EVENT_RULE_ADD: /* fall through */
 	case FIB_EVENT_RULE_DEL:
 		rocker_world_fib4_abort(rocker);
 		break;
 	}
+	rtnl_unlock();
+	kfree(fib_work);
+}
+
+/* Called with rcu_read_lock() */
+static int rocker_router_fib_event(struct notifier_block *nb,
+				   unsigned long event, void *ptr)
+{
+	struct rocker *rocker = container_of(nb, struct rocker, fib_nb);
+	struct rocker_fib_event_work *fib_work;
+
+	fib_work = kzalloc(sizeof(*fib_work), GFP_ATOMIC);
+	if (WARN_ON(!fib_work))
+		return NOTIFY_BAD;
+
+	INIT_WORK(&fib_work->work, rocker_router_fib_event_work);
+	fib_work->rocker = rocker;
+	fib_work->event = event;
+
+	switch (event) {
+	case FIB_EVENT_ENTRY_ADD: /* fall through */
+	case FIB_EVENT_ENTRY_DEL:
+		memcpy(&fib_work->fen_info, ptr, sizeof(fib_work->fen_info));
+		/* Take referece on fib_info to prevent it from being
+		 * freed while work is queued. Release it afterwards.
+		 */
+		fib_info_hold(fib_work->fen_info.fi);
+		break;
+	}
+
+	queue_work(rocker->rocker_owq, &fib_work->work);
+
 	return NOTIFY_DONE;
 }
 
diff --git a/drivers/net/ethernet/rocker/rocker_ofdpa.c b/drivers/net/ethernet/rocker/rocker_ofdpa.c
index 4ca4613..7cd76b6 100644
--- a/drivers/net/ethernet/rocker/rocker_ofdpa.c
+++ b/drivers/net/ethernet/rocker/rocker_ofdpa.c
@@ -2516,6 +2516,7 @@ static void ofdpa_fini(struct rocker *rocker)
 	int bkt;
 
 	del_timer_sync(&ofdpa->fdb_cleanup_timer);
+	flush_workqueue(rocker->rocker_owq);
 
 	spin_lock_irqsave(&ofdpa->flow_tbl_lock, flags);
 	hash_for_each_safe(ofdpa->flow_tbl, bkt, tmp, flow_entry, entry)
-- 
2.7.4

^ permalink raw reply related

* [patch net-next v3 05/12] rocker: Create an ordered workqueue for FIB offload
From: Jiri Pirko @ 2016-11-30 10:08 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber
In-Reply-To: <1480500546-2544-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

As explained in the previous commits, we need to process FIB entries
addition / deletion events in FIFO order or otherwise we can have a
mismatch between the kernel's FIB table and the device's.

Create an ordered workqueue for rocker to which these work items will be
submitted to.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/rocker/rocker.h      |  1 +
 drivers/net/ethernet/rocker/rocker_main.c | 11 +++++++++++
 2 files changed, 12 insertions(+)

diff --git a/drivers/net/ethernet/rocker/rocker.h b/drivers/net/ethernet/rocker/rocker.h
index 2eb9b49..ee9675d 100644
--- a/drivers/net/ethernet/rocker/rocker.h
+++ b/drivers/net/ethernet/rocker/rocker.h
@@ -72,6 +72,7 @@ struct rocker {
 	struct rocker_dma_ring_info event_ring;
 	struct notifier_block fib_nb;
 	struct rocker_world_ops *wops;
+	struct workqueue_struct *rocker_owq;
 	void *wpriv;
 };
 
diff --git a/drivers/net/ethernet/rocker/rocker_main.c b/drivers/net/ethernet/rocker/rocker_main.c
index 67df4cf..424be96 100644
--- a/drivers/net/ethernet/rocker/rocker_main.c
+++ b/drivers/net/ethernet/rocker/rocker_main.c
@@ -28,6 +28,7 @@
 #include <linux/if_bridge.h>
 #include <linux/bitops.h>
 #include <linux/ctype.h>
+#include <linux/workqueue.h>
 #include <net/switchdev.h>
 #include <net/rtnetlink.h>
 #include <net/netevent.h>
@@ -2754,6 +2755,13 @@ static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		goto err_request_event_irq;
 	}
 
+	rocker->rocker_owq = alloc_ordered_workqueue(rocker_driver_name,
+						     WQ_MEM_RECLAIM);
+	if (!rocker->rocker_owq) {
+		err = -ENOMEM;
+		goto err_alloc_ordered_workqueue;
+	}
+
 	rocker->hw.id = rocker_read64(rocker, SWITCH_ID);
 
 	err = rocker_probe_ports(rocker);
@@ -2771,6 +2779,8 @@ static int rocker_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	return 0;
 
 err_probe_ports:
+	destroy_workqueue(rocker->rocker_owq);
+err_alloc_ordered_workqueue:
 	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_EVENT), rocker);
 err_request_event_irq:
 	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_CMD), rocker);
@@ -2799,6 +2809,7 @@ static void rocker_remove(struct pci_dev *pdev)
 	unregister_fib_notifier(&rocker->fib_nb);
 	rocker_write32(rocker, CONTROL, ROCKER_CONTROL_RESET);
 	rocker_remove_ports(rocker);
+	destroy_workqueue(rocker->rocker_owq);
 	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_EVENT), rocker);
 	free_irq(rocker_msix_vector(rocker, ROCKER_MSIX_VEC_CMD), rocker);
 	rocker_dma_rings_fini(rocker);
-- 
2.7.4

^ permalink raw reply related

* [patch net-next v3 04/12] mlxsw: spectrum_router: Implement FIB offload in deferred work
From: Jiri Pirko @ 2016-11-30 10:08 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber
In-Reply-To: <1480500546-2544-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

FIB offload is currently done in process context with RTNL held, but
we're about to dump the FIB tables in RCU critical section, so we can no
longer sleep.

Instead, defer the operation to process context using deferred work. Make
sure fib info isn't freed while the work is queued by taking a reference
on it and releasing it after the operation is done.

Deferring the operation is valid because the upper layers always assume
the operation was successful. If it's not, then the driver-specific
abort mechanism is called and all routed traffic is directed to slow
path.

The work items are submitted to an ordered workqueue to prevent a
mismatch between the kernel's FIB table and the device's.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  | 72 +++++++++++++++++++---
 1 file changed, 62 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
index 683f045..14bed1d 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum_router.c
@@ -593,6 +593,14 @@ static void mlxsw_sp_router_fib_flush(struct mlxsw_sp *mlxsw_sp);
 
 static void mlxsw_sp_vrs_fini(struct mlxsw_sp *mlxsw_sp)
 {
+	/* At this stage we're guaranteed not to have new incoming
+	 * FIB notifications and the work queue is free from FIBs
+	 * sitting on top of mlxsw netdevs. However, we can still
+	 * have other FIBs queued. Flush the queue before flushing
+	 * the device's tables. No need for locks, as we're the only
+	 * writer.
+	 */
+	mlxsw_core_flush_owq();
 	mlxsw_sp_router_fib_flush(mlxsw_sp);
 	kfree(mlxsw_sp->router.vrs);
 }
@@ -1948,30 +1956,74 @@ static void __mlxsw_sp_router_fini(struct mlxsw_sp *mlxsw_sp)
 	kfree(mlxsw_sp->rifs);
 }
 
-static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
-				     unsigned long event, void *ptr)
+struct mlxsw_sp_fib_event_work {
+	struct delayed_work dw;
+	struct fib_entry_notifier_info fen_info;
+	struct mlxsw_sp *mlxsw_sp;
+	unsigned long event;
+};
+
+static void mlxsw_sp_router_fib_event_work(struct work_struct *work)
 {
-	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
-	struct fib_entry_notifier_info *fen_info = ptr;
+	struct mlxsw_sp_fib_event_work *fib_work =
+		container_of(work, struct mlxsw_sp_fib_event_work, dw.work);
+	struct mlxsw_sp *mlxsw_sp = fib_work->mlxsw_sp;
 	int err;
 
-	if (!net_eq(fen_info->info.net, &init_net))
-		return NOTIFY_DONE;
-
-	switch (event) {
+	/* Protect internal structures from changes */
+	rtnl_lock();
+	switch (fib_work->event) {
 	case FIB_EVENT_ENTRY_ADD:
-		err = mlxsw_sp_router_fib4_add(mlxsw_sp, fen_info);
+		err = mlxsw_sp_router_fib4_add(mlxsw_sp, &fib_work->fen_info);
 		if (err)
 			mlxsw_sp_router_fib4_abort(mlxsw_sp);
+		fib_info_put(fib_work->fen_info.fi);
 		break;
 	case FIB_EVENT_ENTRY_DEL:
-		mlxsw_sp_router_fib4_del(mlxsw_sp, fen_info);
+		mlxsw_sp_router_fib4_del(mlxsw_sp, &fib_work->fen_info);
+		fib_info_put(fib_work->fen_info.fi);
 		break;
 	case FIB_EVENT_RULE_ADD: /* fall through */
 	case FIB_EVENT_RULE_DEL:
 		mlxsw_sp_router_fib4_abort(mlxsw_sp);
 		break;
 	}
+	rtnl_unlock();
+	kfree(fib_work);
+}
+
+/* Called with rcu_read_lock() */
+static int mlxsw_sp_router_fib_event(struct notifier_block *nb,
+				     unsigned long event, void *ptr)
+{
+	struct mlxsw_sp *mlxsw_sp = container_of(nb, struct mlxsw_sp, fib_nb);
+	struct mlxsw_sp_fib_event_work *fib_work;
+	struct fib_notifier_info *info = ptr;
+
+	if (!net_eq(info->net, &init_net))
+		return NOTIFY_DONE;
+
+	fib_work = kzalloc(sizeof(*fib_work), GFP_ATOMIC);
+	if (WARN_ON(!fib_work))
+		return NOTIFY_BAD;
+
+	INIT_DELAYED_WORK(&fib_work->dw, mlxsw_sp_router_fib_event_work);
+	fib_work->mlxsw_sp = mlxsw_sp;
+	fib_work->event = event;
+
+	switch (event) {
+	case FIB_EVENT_ENTRY_ADD: /* fall through */
+	case FIB_EVENT_ENTRY_DEL:
+		memcpy(&fib_work->fen_info, ptr, sizeof(fib_work->fen_info));
+		/* Take referece on fib_info to prevent it from being
+		 * freed while work is queued. Release it afterwards.
+		 */
+		fib_info_hold(fib_work->fen_info.fi);
+		break;
+	}
+
+	mlxsw_core_schedule_odw(&fib_work->dw, 0);
+
 	return NOTIFY_DONE;
 }
 
-- 
2.7.4

^ permalink raw reply related

* [patch net-next v3 03/12] mlxsw: core: Create an ordered workqueue for FIB offload
From: Jiri Pirko @ 2016-11-30 10:08 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber
In-Reply-To: <1480500546-2544-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

We're going to start processing FIB entries addition / deletion events
in deferred work. These work items must be processed in the order they
were submitted or otherwise we can have differences between the kernel's
FIB table and the device's.

Solve this by creating an ordered workqueue to which these work items
will be submitted to. Note that we can't simply convert the current
workqueue to be ordered, as EMADs re-transmissions are also processed in
deferred work.

Later on, we can migrate other work items to this workqueue, such as FDB
notification processing and nexthop resolution, since they all take the
same lock anyway.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/core.c | 22 ++++++++++++++++++++++
 drivers/net/ethernet/mellanox/mlxsw/core.h |  2 ++
 2 files changed, 24 insertions(+)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.c b/drivers/net/ethernet/mellanox/mlxsw/core.c
index 4dc028b..57a9884 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.c
@@ -77,6 +77,7 @@ static const char mlxsw_core_driver_name[] = "mlxsw_core";
 static struct dentry *mlxsw_core_dbg_root;
 
 static struct workqueue_struct *mlxsw_wq;
+static struct workqueue_struct *mlxsw_owq;
 
 struct mlxsw_core_pcpu_stats {
 	u64			trap_rx_packets[MLXSW_TRAP_ID_MAX];
@@ -1900,6 +1901,18 @@ int mlxsw_core_schedule_dw(struct delayed_work *dwork, unsigned long delay)
 }
 EXPORT_SYMBOL(mlxsw_core_schedule_dw);
 
+int mlxsw_core_schedule_odw(struct delayed_work *dwork, unsigned long delay)
+{
+	return queue_delayed_work(mlxsw_owq, dwork, delay);
+}
+EXPORT_SYMBOL(mlxsw_core_schedule_odw);
+
+void mlxsw_core_flush_owq(void)
+{
+	flush_workqueue(mlxsw_owq);
+}
+EXPORT_SYMBOL(mlxsw_core_flush_owq);
+
 static int __init mlxsw_core_module_init(void)
 {
 	int err;
@@ -1907,6 +1920,12 @@ static int __init mlxsw_core_module_init(void)
 	mlxsw_wq = alloc_workqueue(mlxsw_core_driver_name, WQ_MEM_RECLAIM, 0);
 	if (!mlxsw_wq)
 		return -ENOMEM;
+	mlxsw_owq = alloc_ordered_workqueue("%s_ordered", WQ_MEM_RECLAIM,
+					    mlxsw_core_driver_name);
+	if (!mlxsw_owq) {
+		err = -ENOMEM;
+		goto err_alloc_ordered_workqueue;
+	}
 	mlxsw_core_dbg_root = debugfs_create_dir(mlxsw_core_driver_name, NULL);
 	if (!mlxsw_core_dbg_root) {
 		err = -ENOMEM;
@@ -1915,6 +1934,8 @@ static int __init mlxsw_core_module_init(void)
 	return 0;
 
 err_debugfs_create_dir:
+	destroy_workqueue(mlxsw_owq);
+err_alloc_ordered_workqueue:
 	destroy_workqueue(mlxsw_wq);
 	return err;
 }
@@ -1922,6 +1943,7 @@ static int __init mlxsw_core_module_init(void)
 static void __exit mlxsw_core_module_exit(void)
 {
 	debugfs_remove_recursive(mlxsw_core_dbg_root);
+	destroy_workqueue(mlxsw_owq);
 	destroy_workqueue(mlxsw_wq);
 }
 
diff --git a/drivers/net/ethernet/mellanox/mlxsw/core.h b/drivers/net/ethernet/mellanox/mlxsw/core.h
index e856b49..a7f94fb 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/core.h
+++ b/drivers/net/ethernet/mellanox/mlxsw/core.h
@@ -207,6 +207,8 @@ enum devlink_port_type mlxsw_core_port_type_get(struct mlxsw_core *mlxsw_core,
 						u8 local_port);
 
 int mlxsw_core_schedule_dw(struct delayed_work *dwork, unsigned long delay);
+int mlxsw_core_schedule_odw(struct delayed_work *dwork, unsigned long delay);
+void mlxsw_core_flush_owq(void);
 
 #define MLXSW_CONFIG_PROFILE_SWID_COUNT 8
 
-- 
2.7.4

^ permalink raw reply related

* [patch net-next v3 02/12] ipv4: fib: Add fib_info_hold() helper
From: Jiri Pirko @ 2016-11-30 10:08 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber
In-Reply-To: <1480500546-2544-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

As explained in the previous commit, modules are going to need to take a
reference on fib info and then drop it using fib_info_put().

Add the fib_info_hold() helper to make the code more readable and also
symmetric with fib_info_put().

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Suggested-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 include/net/ip_fib.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h
index f390c3b..6c67b93 100644
--- a/include/net/ip_fib.h
+++ b/include/net/ip_fib.h
@@ -397,6 +397,11 @@ static inline void fib_combine_itag(u32 *itag, const struct fib_result *res)
 
 void free_fib_info(struct fib_info *fi);
 
+static inline void fib_info_hold(struct fib_info *fi)
+{
+	atomic_inc(&fi->fib_clntref);
+}
+
 static inline void fib_info_put(struct fib_info *fi)
 {
 	if (atomic_dec_and_test(&fi->fib_clntref))
-- 
2.7.4

^ permalink raw reply related

* [patch net-next v3 01/12] ipv4: fib: Export free_fib_info()
From: Jiri Pirko @ 2016-11-30 10:08 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber
In-Reply-To: <1480500546-2544-1-git-send-email-jiri@resnulli.us>

From: Ido Schimmel <idosch@mellanox.com>

The FIB notification chain is going to be converted to an atomic chain,
which means switchdev drivers will have to offload FIB entries in
deferred work, as hardware operations entail sleeping.

However, while the work is queued fib info might be freed, so a
reference must be taken. To release the reference (and potentially free
the fib info) fib_info_put() will be called, which in turn calls
free_fib_info().

Export free_fib_info() so that modules will be able to invoke
fib_info_put().

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
---
 net/ipv4/fib_semantics.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c
index 388d3e2..c1bc1e9 100644
--- a/net/ipv4/fib_semantics.c
+++ b/net/ipv4/fib_semantics.c
@@ -234,6 +234,7 @@ void free_fib_info(struct fib_info *fi)
 #endif
 	call_rcu(&fi->rcu, free_fib_info_rcu);
 }
+EXPORT_SYMBOL_GPL(free_fib_info);

 void fib_release_info(struct fib_info *fi)
 {
-- 
2.7.4

^ permalink raw reply related

* [patch net-next v3 00/12]  ipv4: fib: Allow modules to dump FIB tables
From: Jiri Pirko @ 2016-11-30 10:08 UTC (permalink / raw)
  To: netdev
  Cc: davem, idosch, eladr, yotamg, nogahf, arkadis, ogerlitz, roopa,
	dsa, nikolay, andy, vivien.didelot, andrew, f.fainelli,
	alexander.h.duyck, hannes, kaber

From: Jiri Pirko <jiri@mellanox.com>

Ido says:

In kernel 4.9 the switchdev-specific FIB offload mechanism was replaced
by a new FIB notification chain to which modules could register in order
to be notified about the addition and deletion of FIB entries. The
motivation for this change was that switchdev drivers need to be able to
reflect the entire FIB table and not only FIBs configured on top of the
port netdevs themselves. This is useful in case of in-band management.

The fundamental problem with this approach is that upon registration
listeners lose all the information previously sent in the chain and
thus have an incomplete view of the FIB tables, which can result in
packet loss. This patchset fixes that by introducing a new API to dump
the FIB tables.

The entire dump process is done under RCU and thus the FIB notification
chain is converted to be atomic. The listeners are modified accordingly.
This is done in the first seven patches.

The eighth and ninth patches add a change sequence counter to ensure the
integrity of the FIB dump and a sysctl to set the number of retries,
respectively. The tenth patch finally introduces the FIB dump itself.

The last two patches modify current listeners of the FIB notification
chain to invoke the dump during their init.

v2->v3:
- Add sysctl to set the number of FIB dump retries
  (Hannes Frederic Sowa).
- Read the sequence counter under RTNL to ensure synchronization
  between the dump process and other processes changing the routing
  tables (Hannes Frederic Sowa).
- Pass a callback to the dump function to be executed prior to a retry.
- Limit the dump to a single net namespace.

v1->v2:
- Add a sequence counter to ensure the integrity of the FIB dump
  (David S. Miller, Hannes Frederic Sowa).
- Protect notifications from re-ordering in listeners by using an
  ordered workqueue (Hannes Frederic Sowa).
- Introduce fib_info_hold() (Jiri Pirko).
- Relieve rocker from the need to invoke the FIB dump by registering
  to the FIB notification chain prior to ports creation.

Ido Schimmel (12):
  ipv4: fib: Export free_fib_info()
  ipv4: fib: Add fib_info_hold() helper
  mlxsw: core: Create an ordered workqueue for FIB offload
  mlxsw: spectrum_router: Implement FIB offload in deferred work
  rocker: Create an ordered workqueue for FIB offload
  rocker: Implement FIB offload in deferred work
  ipv4: fib: Convert FIB notification chain to be atomic
  ipv4: fib: Allow for consistent FIB dumping
  ipv4: fib: Add sysctl to limit number of FIB dump retries
  ipv4: fib: Add an API to request a FIB dump
  mlxsw: spectrum_router: Request a dump of FIB tables during init
  rocker: Register FIB notifier before creating ports

 Documentation/networking/ip-sysctl.txt             |   8 ++
 drivers/net/ethernet/mellanox/mlxsw/core.c         |  22 +++
 drivers/net/ethernet/mellanox/mlxsw/core.h         |   2 +
 .../net/ethernet/mellanox/mlxsw/spectrum_router.c  |  95 +++++++++++--
 drivers/net/ethernet/rocker/rocker.h               |   1 +
 drivers/net/ethernet/rocker/rocker_main.c          |  78 +++++++++--
 drivers/net/ethernet/rocker/rocker_ofdpa.c         |   1 +
 include/net/ip_fib.h                               |   9 ++
 include/net/netns/ipv4.h                           |   4 +
 net/ipv4/fib_frontend.c                            |   3 +
 net/ipv4/fib_semantics.c                           |   1 +
 net/ipv4/fib_trie.c                                | 147 ++++++++++++++++++++-
 net/ipv4/sysctl_net_ipv4.c                         |   7 +
 13 files changed, 352 insertions(+), 26 deletions(-)

-- 
2.7.4

^ permalink raw reply

* Re: [PATCH 4/6] net: ethernet: ti: cpts: add ptp pps support
From: Richard Cochran @ 2016-11-30 10:05 UTC (permalink / raw)
  To: Grygorii Strashko
  Cc: David S. Miller, netdev, Mugunthan V N, Sekhar Nori, linux-kernel,
	linux-omap, Rob Herring, devicetree, Murali Karicheri,
	Wingman Kwok
In-Reply-To: <20161128230428.6872-5-grygorii.strashko@ti.com>

On Mon, Nov 28, 2016 at 05:04:26PM -0600, Grygorii Strashko wrote:
> The TS_COMP output in the CPSW CPTS module is asserted for
> ts_comp_length[15:0] RCLK periods when the time_stamp value compares
> with the ts_comp_val[31:0] and the length value is non-zero. The
> TS_COMP pulse edge occurs three RCLK periods after the values
> compare. A timestamp compare event is pushed into the event FIFO when
> TS_COMP is asserted.
> 
> This patch adds support of Pulse-Per-Second (PPS) by using the
> timestamp compare output. The CPTS driver adds one second of counter
> value to the ts_comp_val register after each assertion of the TS_COMP
> output. The TS_COMP pulse polarity and width are configurable in DT.

I really dislike this patch.  You go through contortions to get from
the timecounter back to the raw HW counter.  That is rather ugly.

Can you adjust the frequency of the keystone devices in hardware?  If
so, then please implement it, and just disable PPS for the CPSW.

The only reason I used the timecounter for frequency adjustment was
because the am335x HW is broken.  But this shouldn't hold back other
newer HW without the same silicon flaws.

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH net-next 1/7] net/mlx5e: Implement Fragmented Work Queue (WQ)
From: Tariq Toukan @ 2016-11-30 10:02 UTC (permalink / raw)
  To: Eric Dumazet, Saeed Mahameed
  Cc: David S. Miller, netdev, Tariq Toukan, Or Gerlitz, Roi Dayan,
	Sebastian Ott
In-Reply-To: <1480462282.18162.161.camel@edumazet-glaptop3.roam.corp.google.com>


On 30/11/2016 1:31 AM, Eric Dumazet wrote:
> On Wed, 2016-11-30 at 00:19 +0200, Saeed Mahameed wrote:
>> From: Tariq Toukan <tariqt@mellanox.com>
>>
>> Add new type of struct mlx5_frag_buf which is used to allocate fragmented
>> buffers rather than contiguous, and make the Completion Queues (CQs) use
>> it as they are big (default of 2MB per CQ in Striding RQ).
>>
>> This fixes the failures of type:
>> "mlx5e_open_locked: mlx5e_open_channels failed, -12"
>> due to dma_zalloc_coherent insufficient contiguous coherent memory to
>> satisfy the driver's request when the user tries to setup more or larger
>> rings.
>>
>> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
>> Reported-by: Sebastian Ott <sebott@linux.vnet.ibm.com>
>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
>> ---
>>   drivers/net/ethernet/mellanox/mlx5/core/alloc.c   | 66 +++++++++++++++++++++++
>>   drivers/net/ethernet/mellanox/mlx5/core/en.h      |  2 +-
>>   drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 10 ++--
>>   drivers/net/ethernet/mellanox/mlx5/core/wq.c      | 26 ++++++---
>>   drivers/net/ethernet/mellanox/mlx5/core/wq.h      | 18 +++++--
>>   include/linux/mlx5/driver.h                       | 11 ++++
>>   6 files changed, 116 insertions(+), 17 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/alloc.c b/drivers/net/ethernet/mellanox/mlx5/core/alloc.c
>> index 2c6e3c7..bc8357d 100644
>> --- a/drivers/net/ethernet/mellanox/mlx5/core/alloc.c
>> +++ b/drivers/net/ethernet/mellanox/mlx5/core/alloc.c
>> @@ -106,6 +106,63 @@ void mlx5_buf_free(struct mlx5_core_dev *dev, struct mlx5_buf *buf)
>>   }
>>   EXPORT_SYMBOL_GPL(mlx5_buf_free);
>>   
>> +int mlx5_frag_buf_alloc_node(struct mlx5_core_dev *dev, int size,
>> +			     struct mlx5_frag_buf *buf, int node)
>> +{
>> +	int i;
>> +
>> +	buf->size = size;
>> +	buf->npages = 1 << get_order(size);
>> +	buf->page_shift = PAGE_SHIFT;
>> +	buf->frags = kcalloc(buf->npages, sizeof(struct mlx5_buf_list),
>> +			     GFP_KERNEL);
>> +	if (!buf->frags)
>> +		goto err_out;
>> +
>> +	for (i = 0; i < buf->npages; i++) {
>> +		struct mlx5_buf_list *frag = &buf->frags[i];
>> +		int frag_sz = min_t(int, size, PAGE_SIZE);
>> +
>> +		frag->buf = mlx5_dma_zalloc_coherent_node(dev, frag_sz,
>> +							  &frag->map, node);
>> +		if (!frag->buf)
>> +			goto err_free_buf;
>> +		if (frag->map & ((1 << buf->page_shift) - 1)) {
>> +			dma_free_coherent(&dev->pdev->dev, frag_sz,
>> +					  buf->frags[i].buf, buf->frags[i].map);
> There is a bug if this happens with i = 0
>
>> +			mlx5_core_warn(dev, "unexpected map alignment: 0x%p, page_shift=%d\n",
>> +				       (void *)frag->map, buf->page_shift);
>> +			goto err_free_buf;
>> +		}
>> +		size -= frag_sz;
>> +	}
>> +
>> +	return 0;
>> +
>> +err_free_buf:
>> +	while (--i)
> Because this loop will be done about 2^32 times.
Right. I'll fix this.

Thanks,
Tariq.
>
>> +		dma_free_coherent(&dev->pdev->dev, PAGE_SIZE, buf->frags[i].buf,
>> +				  buf->frags[i].map);
>> +	kfree(buf->frags);
>> +err_out:
>> +	return -ENOMEM;
>> +}
>
>
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox