Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH bpf] bpf: fix wrong helper enablement in cgroup local storage
From: Roman Gushchin @ 2018-10-26 22:57 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: ast@kernel.org, netdev@vger.kernel.org
In-Reply-To: <20181026224902.12455-1-daniel@iogearbox.net>

On Sat, Oct 27, 2018 at 12:49:02AM +0200, Daniel Borkmann wrote:
> Commit cd3394317653 ("bpf: introduce the bpf_get_local_storage()
> helper function") enabled the bpf_get_local_storage() helper also
> for BPF program types where it does not make sense to use them.
> 
> They have been added both in sk_skb_func_proto() and sk_msg_func_proto()
> even though both program types are not invoked in combination with
> cgroups, and neither through BPF_PROG_RUN_ARRAY(). In the latter the
> bpf_cgroup_storage_set() is set shortly before BPF program invocation.
> 
> Later, the helper bpf_get_local_storage() retrieves this prior set
> up per-cpu pointer and hands the buffer to the BPF program. The map
> argument in there solely retrieves the enum bpf_cgroup_storage_type
> from a local storage map associated with the program and based on the
> type returns either the global or per-cpu storage. However, there
> is no specific association between the program's map and the actual
> content in bpf_cgroup_storage[].
> 
> Meaning, any BPF program that would have been properly run from the
> cgroup side through BPF_PROG_RUN_ARRAY() where bpf_cgroup_storage_set()
> was performed, and that is later unloaded such that prog / maps are
> teared down will cause a use after free if that pointer is retrieved
> from programs that are not run through BPF_PROG_RUN_ARRAY() but have
> the cgroup local storage helper enabled in their func proto.
> 
> Lets just remove it from the two sock_map program types to fix it.
> Auditing through the types where this helper is enabled, it appears
> that these are the only ones where it was mistakenly allowed.
> 
> Fixes: cd3394317653 ("bpf: introduce the bpf_get_local_storage() helper function")
> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
> Cc: Roman Gushchin <guro@fb.com>
> Acked-by: John Fastabend <john.fastabend@gmail.com>


Acked-by: Roman Gushchin <guro@fb.com>

Thanks, Daniel!

^ permalink raw reply

* Re: ethernet "bus" number in DTS ?
From: Florian Fainelli @ 2018-10-26 22:57 UTC (permalink / raw)
  To: Michal Suchánek
  Cc: Joakim Tjernlund, linuxppc-dev@lists.ozlabs.org,
	netdev@vger.kernel.org, andrew@lunn.ch, robh
In-Reply-To: <20181024082239.5ee41017@naga.suse.cz>

On 10/23/18 11:22 PM, Michal Suchánek wrote:
> On Tue, 23 Oct 2018 11:20:36 -0700
> Florian Fainelli <f.fainelli@gmail.com> wrote:
> 
>> On 10/23/18 11:02 AM, Joakim Tjernlund wrote:
>>> On Tue, 2018-10-23 at 10:03 -0700, Florian Fainelli wrote:  
> 
>>>
>>> I also noted that using status = "disabled" didn't work either to
>>> create a fix name scheme. Even worse, all the eth I/F after gets
>>> renumbered. It seems to me there is value in having stability in
>>> eth I/F naming at boot. Then userspace(udev) can rename if need be. 
>>>
>>> Sure would like to known more about why this feature is not wanted ?
>>>
>>> I found
>>>   https://patchwork.kernel.org/patch/4122441/
>>> You quote policy as reason but surely it must be better to
>>> have something stable, connected to the hardware name, than
>>> semirandom naming?  
>>
>> If the Device Tree nodes are ordered by ascending base register
>> address, my understanding is that you get the same order as far as
>> platform_device creation goes, this may not be true in the future if
>> Rob decides to randomize that, but AFAICT this is still true. This
>> may not work well with status = disabled properties being inserted
>> here and there, but we have used that here and it has worked for as
>> far as I can remember doing it.
> 
> So this is unstable in several respects. First is changing the
> enabled/disabled status in the deivecetrees provided by the kernel.
> 
> Second is if you have hardware hotplug mechanism either by firmware or
> by loading device overlays.
> 
> Third is the case when the devicetree is not built as part of the
> kernel but is instead provided by firmware that initializes the
> low-level hardware details. Then the ordering by address is not
> guaranteed nor is that the same address will be used to access the same
> interface every time. There might be multiple ways to configure the
> hardware depending on firmware configuration and/or version.
> 
>  
>> Second, you might want to name network devices ethX, but what if I
>> want to name them ethernetX or fooX or barX? Should we be accepting a
>> mechanism in the kernel that would allow someone to name the
>> interfaces the way they want straight from a name being provided in
>> Device Tree?
> 
> Clearly if there is text Ethernet1 printed above the Ethernet port we
> should provide a mechanism to name the port Ethernet1 by default.

Yes, but then have a specific property that is flexible enough to
retrieve things like "vital product information". For DSA switches, we
have an optional "label" property which names the network device
directly like it would be found on the product's case. Ideally this
should be much more flexible such that it can point to the appropriate
node/firmware service/whatever to get that name, which is what some
people have been working on lately, see [1].

[1]: https://lkml.org/lkml/2018/8/14/1039

The point is don't re-purpose the aliases which is something that exists
within Device Tree to basically provide a shorter path to a specific set
of nodes for the boot program to mangle and muck with instead of having
to resolve a full path to a node. At least that is how I conceive it.

Now what complicates the matter is that some OSes like Linux tend to use
it to also generate seemingly stable index for peripherals that are
numbered by index such as SPI, I2C, etc buses, which is why we are
having this conversation here, see below for more.

> 
>>
>> Aliases are fine for providing relative stability within the Device
>> Tree itself and boot programs that might need to modify the Device
>> Tree (e.g: inserting MAC addresses) such that you don't have to
>> encode logic to search for nodes by compatible strings etc. but
>> outside of that use case, it seems to me that you can resolve every
>> naming decision in user-space.
> 
> However, this is pushing platform-specific knowledge to userspace. The
> way the Ethernet interface is attached and hence the device properties
> usable for identifying the device uniquely are platform-specific.

There is always going to be some amount of platform specific knowledge
to user-space, what matters is the level of abstraction that is
presented to you.

> 
> On the other hand, aliases are universal when provided. If they are
> good enough to assign a MAC address they are good enough to provide a
> stable default name.

We are not talking about the same aliases then. The special Device Tree
node named "aliases" is a way to create shorted Device Tree node paths,
it is not by any means an equivalent for what I would rather call a
"label", or "port name" or silk screen annotation on a PCB for instance.

> 
> I think this is indeed forcing the userspace to reinvent several wheels
> for no good reason.

udev or systemd will come up with some stable names for your network
device right off the bat. If you are deeply embedded maybe you don't
want those, but then use something like the full path in /sys to get
some stable names based on the SoC's topology.

> 
> What is the problem with adding the aliases?

aliases is IMHO the wrong tool for the job because it is too limiting,
you want it to be used to have Ethernet controller instance N to be
named "ethN", what if someone tomorrows says, no this is not good, I
want the aliases to automatically name my network devices
"ethernet-controllerN" or "blahblahN"? You see where I am getting with that?

And yes, I do realize that Linux has traditionally named Ethernet
adapters ethN, but also allows people to name interfaces just the way
they want even from within the drivers themselves.

Now imagine your platform uses ACPI, and there are no aliases there to
point a node with a shorter name, how we would go about naming your
Ethernet controller unless there is a specific VPD/label/sticker
property that can be somehow be retried?
-- 
Florian

^ permalink raw reply

* Re: KASAN: use-after-free Read in sctp_outq_select_transport
From: Xin Long @ 2018-10-27  7:33 UTC (permalink / raw)
  To: syzbot+56a40ceee5fb35932f4d
  Cc: davem, LKML, linux-sctp, Marcelo Ricardo Leitner, network dev,
	Neil Horman, syzkaller-bugs, Vlad Yasevich
In-Reply-To: <00000000000073bd310578e29d8d@google.com>

On Tue, Oct 23, 2018 at 7:14 PM syzbot
<syzbot+56a40ceee5fb35932f4d@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    23469de647c4 Merge git://git.kernel.org/pub/scm/linux/kern..
> git tree:       net
> console output: https://syzkaller.appspot.com/x/log.txt?x=13580ce5400000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=b3f55cb3dfcc6c33
> dashboard link: https://syzkaller.appspot.com/bug?extid=56a40ceee5fb35932f4d
> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
>
> Unfortunately, I don't have any reproducer for this crash yet.
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+56a40ceee5fb35932f4d@syzkaller.appspotmail.com
>
> ==================================================================
> BUG: KASAN: use-after-free in sctp_outq_select_transport+0x77e/0x9a0
> net/sctp/outqueue.c:840
> Read of size 4 at addr ffff8801c5ff2614 by task syz-executor5/8275
>
> CPU: 0 PID: 8275 Comm: syz-executor5 Not tainted 4.19.0-rc8+ #152
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>   <IRQ>
>   __dump_stack lib/dump_stack.c:77 [inline]
>   dump_stack+0x1c4/0x2b6 lib/dump_stack.c:113
>   print_address_description.cold.8+0x9/0x1ff mm/kasan/report.c:256
>   kasan_report_error mm/kasan/report.c:354 [inline]
>   kasan_report.cold.9+0x242/0x309 mm/kasan/report.c:412
>   __asan_report_load4_noabort+0x14/0x20 mm/kasan/report.c:432
>   sctp_outq_select_transport+0x77e/0x9a0 net/sctp/outqueue.c:840
I think we need a fix like:
@@ -586,6 +586,10 @@ void sctp_assoc_rm_peer(struct sctp_association *asoc,
                                sctp_transport_hold(active);
        }

+        list_for_each_entry(ch, &asoc->outqueue.out_chunk_list, list)
+               if (ch->transport == transport)
+                       ch->transport = NULL;
+
        asoc->peer.transport_count--;

        sctp_transport_free(peer);

>   sctp_outq_flush_data net/sctp/outqueue.c:1100 [inline]
>   sctp_outq_flush+0x14f7/0x34f0 net/sctp/outqueue.c:1209
>   sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
>   sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1820 [inline]
>   sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
>   sctp_do_sm+0x608/0x7190 net/sctp/sm_sideeffect.c:1191
>   sctp_assoc_bh_rcv+0x346/0x670 net/sctp/associola.c:1067
>   sctp_inq_push+0x280/0x370 net/sctp/inqueue.c:95
>   sctp_rcv+0x2d93/0x3c00 net/sctp/input.c:268
>   sctp6_rcv+0x15/0x30 net/sctp/ipv6.c:1061
>   ip6_input_finish+0x3fc/0x1aa0 net/ipv6/ip6_input.c:383
>   NF_HOOK include/linux/netfilter.h:289 [inline]
>   ip6_input+0xe9/0x600 net/ipv6/ip6_input.c:426
>   dst_input include/net/dst.h:450 [inline]
>   ip6_rcv_finish+0x17a/0x330 net/ipv6/ip6_input.c:76
>   NF_HOOK include/linux/netfilter.h:289 [inline]
>   ipv6_rcv+0x120/0x640 net/ipv6/ip6_input.c:271
>   __netif_receive_skb_one_core+0x14d/0x200 net/core/dev.c:4913
>   __netif_receive_skb+0x2c/0x1e0 net/core/dev.c:5023
>   process_backlog+0x218/0x6f0 net/core/dev.c:5829
>   napi_poll net/core/dev.c:6249 [inline]
>   net_rx_action+0x7c5/0x1950 net/core/dev.c:6315
> IPv6: ADDRCONF(NETDEV_CHANGE): vcan0: link becomes ready
> IPv6: ADDRCONF(NETDEV_CHANGE): vcan0: link becomes ready
>   __do_softirq+0x30c/0xb03 kernel/softirq.c:292
>   do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1047
>   </IRQ>
>   do_softirq.part.13+0x126/0x160 kernel/softirq.c:336
>   do_softirq kernel/softirq.c:328 [inline]
>   __local_bh_enable_ip+0x21d/0x260 kernel/softirq.c:189
>   local_bh_enable include/linux/bottom_half.h:32 [inline]
>   rcu_read_unlock_bh include/linux/rcupdate.h:723 [inline]
>   ip6_finish_output2+0xce4/0x27a0 net/ipv6/ip6_output.c:121
>   ip6_finish_output+0x58c/0xc60 net/ipv6/ip6_output.c:154
>   NF_HOOK_COND include/linux/netfilter.h:278 [inline]
>   ip6_output+0x232/0x9d0 net/ipv6/ip6_output.c:171
>   dst_output include/net/dst.h:444 [inline]
>   NF_HOOK include/linux/netfilter.h:289 [inline]
>   ip6_xmit+0xf69/0x2420 net/ipv6/ip6_output.c:275
>   sctp_v6_xmit+0x3b2/0x790 net/sctp/ipv6.c:230
>   sctp_packet_transmit+0x298e/0x3db0 net/sctp/output.c:656
>   sctp_packet_singleton net/sctp/outqueue.c:791 [inline]
>   sctp_outq_flush_ctrl.constprop.11+0x7a9/0xe50 net/sctp/outqueue.c:922
>   sctp_outq_flush+0x310/0x34f0 net/sctp/outqueue.c:1204
>   sctp_outq_uncork+0x6a/0x80 net/sctp/outqueue.c:776
> kobject: 'rfkill9' (0000000067be229d): fill_kobj_path: path
> = '/devices/virtual/mac80211_hwsim/hwsim7/ieee80211/phy7/rfkill9'
>   sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1349 [inline]
>   sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
>   sctp_do_sm+0x4f25/0x7190 net/sctp/sm_sideeffect.c:1191
> ieee80211 phy7: Selected rate control algorithm 'minstrel_ht'
> kobject: 'net' (00000000cdd117dc): kobject_add_internal: parent: 'hwsim7',
> set: '(null)'
>   sctp_endpoint_bh_rcv+0x465/0x960 net/sctp/endpointola.c:456
> kobject: 'wlan0' (00000000c9d333fc): kobject_add_internal: parent: 'net',
> set: 'devices'
>   sctp_inq_push+0x280/0x370 net/sctp/inqueue.c:95
>   sctp_backlog_rcv+0x1a8/0xd50 net/sctp/input.c:351
> kobject: 'wlan0' (00000000c9d333fc): kobject_uevent_env
> kobject: 'wlan0' (00000000c9d333fc): fill_kobj_path: path
> = '/devices/virtual/mac80211_hwsim/hwsim7/net/wlan0'
>   sk_backlog_rcv include/net/sock.h:931 [inline]
>   __release_sock+0x12f/0x3a0 net/core/sock.c:2336
> kobject: 'queues' (00000000c5d009e9): kobject_add_internal:
> parent: 'wlan0', set: '<NULL>'
>   release_sock+0xad/0x2c0 net/core/sock.c:2849
>   sock_setsockopt+0x60d/0x2280 net/core/sock.c:1054
> kobject: 'queues' (00000000c5d009e9): kobject_uevent_env
> kobject: 'queues' (00000000c5d009e9): kobject_uevent_env: filter function
> caused the event to drop!
>   __sys_setsockopt+0x304/0x3c0 net/socket.c:1898
> kobject: 'rx-0' (000000007736ba36): kobject_add_internal: parent: 'queues',
> set: 'queues'
>   __do_sys_setsockopt net/socket.c:1913 [inline]
>   __se_sys_setsockopt net/socket.c:1910 [inline]
>   __x64_sys_setsockopt+0xbe/0x150 net/socket.c:1910
>   do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
> kobject: 'rx-0' (000000007736ba36): kobject_uevent_env
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> kobject: 'rx-0' (000000007736ba36): fill_kobj_path: path
> = '/devices/virtual/mac80211_hwsim/hwsim7/net/wlan0/queues/rx-0'
> RIP: 0033:0x457569
> Code: fd b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7
> 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
> ff 0f 83 cb b3 fb ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:00007f0613beec78 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
> RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 0000000000457569
> RDX: 000000000000001a RSI: 0000000000000001 RDI: 0000000000000003
> RBP: 000000000072c040 R08: 0000000000000010 R09: 0000000000000000
> R10: 00000000200000c0 R11: 0000000000000246 R12: 00007f0613bef6d4
> R13: 00000000004c3b61 R14: 00000000004d5c88 R15: 00000000ffffffff
>
> Allocated by task 8261:
>   save_stack+0x43/0xd0 mm/kasan/kasan.c:448
> kobject: 'tx-0' (000000008f0eae4c): kobject_add_internal: parent: 'queues',
> set: 'queues'
>   set_track mm/kasan/kasan.c:460 [inline]
>   kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
>   kmem_cache_alloc_trace+0x152/0x750 mm/slab.c:3620
>   kmalloc include/linux/slab.h:513 [inline]
>   kzalloc include/linux/slab.h:707 [inline]
>   sctp_transport_new+0x10a/0x880 net/sctp/transport.c:111
>   sctp_assoc_add_peer+0x2de/0x10d0 net/sctp/associola.c:630
>   sctp_process_param net/sctp/sm_make_chunk.c:2540 [inline]
>   sctp_process_init+0xfc0/0x29e0 net/sctp/sm_make_chunk.c:2356
>   sctp_cmd_process_init net/sctp/sm_sideeffect.c:682 [inline]
>   sctp_cmd_interpreter net/sctp/sm_sideeffect.c:1410 [inline]
>   sctp_side_effects net/sctp/sm_sideeffect.c:1220 [inline]
>   sctp_do_sm+0x13b9/0x7190 net/sctp/sm_sideeffect.c:1191
>   sctp_assoc_bh_rcv+0x346/0x670 net/sctp/associola.c:1067
>   sctp_inq_push+0x280/0x370 net/sctp/inqueue.c:95
>   sctp_backlog_rcv+0x1a8/0xd50 net/sctp/input.c:351
>   sk_backlog_rcv include/net/sock.h:931 [inline]
>   __release_sock+0x12f/0x3a0 net/core/sock.c:2336
> kobject: 'tx-0' (000000008f0eae4c): kobject_uevent_env
>   release_sock+0xad/0x2c0 net/core/sock.c:2849
>   sctp_wait_for_connect+0x391/0x640 net/sctp/socket.c:8667
>   sctp_sendmsg_to_asoc+0x1d0f/0x2230 net/sctp/socket.c:1985
>   sctp_sendmsg+0x13c2/0x1da0 net/sctp/socket.c:2131
>   inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
>   sock_sendmsg_nosec net/socket.c:621 [inline]
>   sock_sendmsg+0xd5/0x120 net/socket.c:631
>   __sys_sendto+0x3d7/0x670 net/socket.c:1788
>   __do_sys_sendto net/socket.c:1800 [inline]
>   __se_sys_sendto net/socket.c:1796 [inline]
>   __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1796
>   do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> kobject: 'tx-0' (000000008f0eae4c): fill_kobj_path: path
> = '/devices/virtual/mac80211_hwsim/hwsim7/net/wlan0/queues/tx-0'
>
> Freed by task 9:
>   save_stack+0x43/0xd0 mm/kasan/kasan.c:448
>   set_track mm/kasan/kasan.c:460 [inline]
>   __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
>   kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
>   __cache_free mm/slab.c:3498 [inline]
>   kfree+0xcf/0x230 mm/slab.c:3813
>   sctp_transport_destroy_rcu+0x4a/0x60 net/sctp/transport.c:163
>   __rcu_reclaim kernel/rcu/rcu.h:236 [inline]
>   rcu_do_batch kernel/rcu/tree.c:2576 [inline]
>   invoke_rcu_callbacks kernel/rcu/tree.c:2880 [inline]
>   __rcu_process_callbacks kernel/rcu/tree.c:2847 [inline]
>   rcu_process_callbacks+0xf23/0x2670 kernel/rcu/tree.c:2864
>   __do_softirq+0x30c/0xb03 kernel/softirq.c:292
> kobject: 'tx-1' (000000009aaee194): kobject_add_internal: parent: 'queues',
> set: 'queues'
>
> The buggy address belongs to the object at ffff8801c5ff24c0
>   which belongs to the cache kmalloc-1024 of size 1024
> The buggy address is located 340 bytes inside of
>   1024-byte region [ffff8801c5ff24c0, ffff8801c5ff28c0)
> The buggy address belongs to the page:
> page:ffffea000717fc80 count:1 mapcount:0 mapping:ffff8801da800ac0 index:0x0
> compound_mapcount: 0
> flags: 0x2fffc0000008100(slab|head)
> raw: 02fffc0000008100 ffffea00070ed808 ffffea00071b2488 ffff8801da800ac0
> raw: 0000000000000000 ffff8801c5ff2040 0000000100000007 0000000000000000
> kobject: 'tx-1' (000000009aaee194): kobject_uevent_env
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
>   ffff8801c5ff2500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>   ffff8801c5ff2580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> > ffff8801c5ff2600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>                           ^
>   ffff8801c5ff2680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>   ffff8801c5ff2700: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
> ==================================================================
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
> syzbot.

^ permalink raw reply

* [PATCH net] net: sched: gred: pass the right attribute to gred_change_table_def()
From: Jakub Kicinski @ 2018-10-26 22:51 UTC (permalink / raw)
  To: davem; +Cc: tgraf, jhs, xiyou.wangcong, netdev, oss-drivers, Jakub Kicinski

gred_change_table_def() takes a pointer to TCA_GRED_DPS attribute,
and expects it will be able to interpret its contents as
struct tc_gred_sopt.  Pass the correct gred attribute, instead of
TCA_OPTIONS.

This bug meant the table definition could never be changed after
Qdisc was initialized (unless whatever TCA_OPTIONS contained both
passed netlink validation and was a valid struct tc_gred_sopt...).

Old behaviour:
$ ip link add type dummy
$ tc qdisc replace dev dummy0 parent root handle 7: \
     gred setup vqs 4 default 0
$ tc qdisc replace dev dummy0 parent root handle 7: \
     gred setup vqs 4 default 0
RTNETLINK answers: Invalid argument

Now:
$ ip link add type dummy
$ tc qdisc replace dev dummy0 parent root handle 7: \
     gred setup vqs 4 default 0
$ tc qdisc replace dev dummy0 parent root handle 7: \
     gred setup vqs 4 default 0
$ tc qdisc replace dev dummy0 parent root handle 7: \
     gred setup vqs 4 default 0

Fixes: f62d6b936df5 ("[PKT_SCHED]: GRED: Use central VQ change procedure")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 net/sched/sch_gred.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/sched/sch_gred.c b/net/sched/sch_gred.c
index cbe4831f46f4..4a042abf844c 100644
--- a/net/sched/sch_gred.c
+++ b/net/sched/sch_gred.c
@@ -413,7 +413,7 @@ static int gred_change(struct Qdisc *sch, struct nlattr *opt,
 	if (tb[TCA_GRED_PARMS] == NULL && tb[TCA_GRED_STAB] == NULL) {
 		if (tb[TCA_GRED_LIMIT] != NULL)
 			sch->limit = nla_get_u32(tb[TCA_GRED_LIMIT]);
-		return gred_change_table_def(sch, opt);
+		return gred_change_table_def(sch, tb[TCA_GRED_DPS]);
 	}
 
 	if (tb[TCA_GRED_PARMS] == NULL ||
-- 
2.17.1

^ permalink raw reply related

* [PATCH bpf] bpf: fix wrong helper enablement in cgroup local storage
From: Daniel Borkmann @ 2018-10-26 22:49 UTC (permalink / raw)
  To: ast; +Cc: netdev, Daniel Borkmann, Roman Gushchin

Commit cd3394317653 ("bpf: introduce the bpf_get_local_storage()
helper function") enabled the bpf_get_local_storage() helper also
for BPF program types where it does not make sense to use them.

They have been added both in sk_skb_func_proto() and sk_msg_func_proto()
even though both program types are not invoked in combination with
cgroups, and neither through BPF_PROG_RUN_ARRAY(). In the latter the
bpf_cgroup_storage_set() is set shortly before BPF program invocation.

Later, the helper bpf_get_local_storage() retrieves this prior set
up per-cpu pointer and hands the buffer to the BPF program. The map
argument in there solely retrieves the enum bpf_cgroup_storage_type
from a local storage map associated with the program and based on the
type returns either the global or per-cpu storage. However, there
is no specific association between the program's map and the actual
content in bpf_cgroup_storage[].

Meaning, any BPF program that would have been properly run from the
cgroup side through BPF_PROG_RUN_ARRAY() where bpf_cgroup_storage_set()
was performed, and that is later unloaded such that prog / maps are
teared down will cause a use after free if that pointer is retrieved
from programs that are not run through BPF_PROG_RUN_ARRAY() but have
the cgroup local storage helper enabled in their func proto.

Lets just remove it from the two sock_map program types to fix it.
Auditing through the types where this helper is enabled, it appears
that these are the only ones where it was mistakenly allowed.

Fixes: cd3394317653 ("bpf: introduce the bpf_get_local_storage() helper function")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Roman Gushchin <guro@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
---
 net/core/filter.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/net/core/filter.c b/net/core/filter.c
index 35c6933..4d7c511 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -5264,8 +5264,6 @@ sk_msg_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_msg_pull_data_proto;
 	case BPF_FUNC_msg_push_data:
 		return &bpf_msg_push_data_proto;
-	case BPF_FUNC_get_local_storage:
-		return &bpf_get_local_storage_proto;
 	default:
 		return bpf_base_func_proto(func_id);
 	}
@@ -5296,8 +5294,6 @@ sk_skb_func_proto(enum bpf_func_id func_id, const struct bpf_prog *prog)
 		return &bpf_sk_redirect_map_proto;
 	case BPF_FUNC_sk_redirect_hash:
 		return &bpf_sk_redirect_hash_proto;
-	case BPF_FUNC_get_local_storage:
-		return &bpf_get_local_storage_proto;
 #ifdef CONFIG_INET
 	case BPF_FUNC_sk_lookup_tcp:
 		return &bpf_sk_lookup_tcp_proto;
-- 
2.9.5

^ permalink raw reply related

* Re: [Intel-wired-lan] [RFC PATCH 1/4] ptp: add PTP_SYS_OFFSET_EXTENDED ioctl
From: Vinicius Costa Gomes @ 2018-10-26 22:16 UTC (permalink / raw)
  To: Miroslav Lichvar, netdev; +Cc: Richard Cochran, intel-wired-lan
In-Reply-To: <20181026162742.631-2-mlichvar@redhat.com>

Hi Miroslav,

Miroslav Lichvar <mlichvar@redhat.com> writes:

> The PTP_SYS_OFFSET ioctl, which can be used to measure the offset
> between a PHC and the system clock, includes the total time that the
> gettime64 function of a driver needs to read the PHC timestamp.
>
> This typically involves reading of multiple PCI registers (sometimes in
> multiple iterations) and the register that contains the lowest bits of
> the timestamp is not read in the middle between the two readings of the
> system clock. This asymmetry causes the measured offset to have a
> significant error.
>
> Introduce a new ioctl, driver function, and helper functions, which
> allow the reading of the lowest register to be isolated from the other
> readings in order to reduce the asymmetry. The ioctl and driver function
> return three timestamps for each measurement:
> - system time right before reading the lowest bits of the PHC timestamp
> - PHC time
> - system time immediately after reading the lowest bits of the PHC
>   timestamp

Cool stuff!

Just one little thing below. Feel free to add my ack to the series.

Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>

>
> Cc: Richard Cochran <richardcochran@gmail.com>
> Cc: Jacob Keller <jacob.e.keller@intel.com>
> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
> ---
>  drivers/ptp/ptp_chardev.c        | 39 ++++++++++++++++++++++++++++++++
>  include/linux/ptp_clock_kernel.h | 26 +++++++++++++++++++++
>  include/uapi/linux/ptp_clock.h   | 12 ++++++++++
>  3 files changed, 77 insertions(+)
>
> diff --git a/drivers/ptp/ptp_chardev.c b/drivers/ptp/ptp_chardev.c
> index 2012551d93e0..1a04c437fd4f 100644
> --- a/drivers/ptp/ptp_chardev.c
> +++ b/drivers/ptp/ptp_chardev.c
> @@ -124,11 +124,13 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
>  	struct ptp_clock_caps caps;
>  	struct ptp_clock_request req;
>  	struct ptp_sys_offset *sysoff = NULL;
> +	struct ptp_sys_offset_extended *sysoff_extended = NULL;
>  	struct ptp_sys_offset_precise precise_offset;
>  	struct ptp_pin_desc pd;
>  	struct ptp_clock *ptp = container_of(pc, struct ptp_clock, clock);
>  	struct ptp_clock_info *ops = ptp->info;
>  	struct ptp_clock_time *pct;
> +	struct ptp_system_timestamp sts;
>  	struct timespec64 ts;
>  	struct system_device_crosststamp xtstamp;
>  	int enable, err = 0;
> @@ -211,6 +213,43 @@ long ptp_ioctl(struct posix_clock *pc, unsigned int cmd, unsigned long arg)
>  			err = -EFAULT;
>  		break;
>  
> +	case PTP_SYS_OFFSET_EXTENDED:
> +		if (!ptp->info->gettimex64) {
> +			err = -EOPNOTSUPP;
> +			break;
> +		}
> +		sysoff_extended = memdup_user((void __user *)arg,
> +					      sizeof(*sysoff_extended));

Looks like you forgot to free 'sysoff_extended', no? 

^ permalink raw reply

* [iproute2 PATCH v2] bridge: fix vlan show stats formatting
From: Tobias Jungel @ 2018-10-26 21:51 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev
In-Reply-To: <20181022100001.3413fe1c@xeon-e3>

The output of -statistics vlan show was broken previous change for json
output. This aligns the format to vlan show.

v2: fixed too greedy deletion that caused a -Wmaybe-uninitialized

Signed-off-by: Tobias Jungel <tobias.jungel@gmail.com>
---
 bridge/vlan.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/bridge/vlan.c b/bridge/vlan.c
index a111d5e6..d075a42d 100644
--- a/bridge/vlan.c
+++ b/bridge/vlan.c
@@ -484,7 +484,7 @@ static void print_vlan_stats_attr(struct rtattr *attr, int ifindex)
 	rem = RTA_PAYLOAD(list);

 	ifname = ll_index_to_name(ifindex);
-	open_json_object(ifname);
+	open_vlan_port(ifindex);

 	print_color_string(PRINT_FP, COLOR_IFNAME,
 			   NULL, "%-16s", ifname);
@@ -505,8 +505,7 @@ static void print_vlan_stats_attr(struct rtattr *attr, int ifindex)

 		print_one_vlan_stats(vstats);
 	}
-	close_json_object();
-
+	close_vlan_port();
 }

 static int print_vlan_stats(struct nlmsghdr *n, void *arg)

^ permalink raw reply related

* Re: [PATCH net v3] net/ipv6: Add anycast addresses to a global hashtable
From: David Ahern @ 2018-10-26 21:44 UTC (permalink / raw)
  To: Jeff Barnhill, netdev; +Cc: davem, kuznet, yoshfuji
In-Reply-To: <20181026212242.9661-1-0xeffeff@gmail.com>

On 10/26/18 3:22 PM, Jeff Barnhill wrote:
> @@ -275,6 +356,13 @@ int __ipv6_dev_ac_inc(struct inet6_dev *idev, const struct in6_addr *addr)
>  		err = -ENOMEM;
>  		goto out;
>  	}
> +	err = ipv6_add_acaddr_hash(dev_net(idev->dev), addr);
> +	if (err) {
> +		fib6_info_release(f6i);
> +		fib6_info_release(f6i);
> +		kfree(aca);
> +		goto out;
> +	}

I think aca_put() makes this less confusing as it will do the
fib6_info_release(f6i) and kfree(aca);

^ permalink raw reply

* Re: [PATCH] net: allow traceroute with a specified interface in a vrf
From: David Ahern @ 2018-10-26 21:23 UTC (permalink / raw)
  To: Mike Manning, netdev
In-Reply-To: <20181026112435.12159-1-mmanning@vyatta.att-mail.com>

On 10/26/18 5:24 AM, Mike Manning wrote:
> Traceroute executed in a vrf succeeds if no device is given or if the
> vrf is given as the device, but fails if the interface is given as the
> device. This is for default UDP probes, it succeeds for TCP SYN or ICMP
> ECHO probes. As the skb bound dev is the interface and the sk dev is
> the vrf, sk lookup fails for ICMP_DEST_UNREACH and ICMP_TIME_EXCEEDED
> messages. The solution is for the secondary dev to be passed so that
> the interface is available for the device match to succeed, in the same
> way as is already done for non-error cases.
> 
> Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
> ---
>  net/ipv4/udp.c | 4 ++--
>  net/ipv6/udp.c | 2 +-
>  2 files changed, 3 insertions(+), 3 deletions(-)
> 


Reviewed-by: David Ahern <dsahern@gmail.com>

^ permalink raw reply

* [PATCH] ptp: drop redundant kasprintf() to create worker name
From: Rasmus Villemoes @ 2018-10-26 21:22 UTC (permalink / raw)
  To: Richard Cochran; +Cc: Kees Cook, Rasmus Villemoes, netdev, linux-kernel

Building with -Wformat-nonliteral, gcc complains

drivers/ptp/ptp_clock.c: In function ‘ptp_clock_register’:
drivers/ptp/ptp_clock.c:239:26: warning: format not a string literal and no format arguments [-Wformat-nonliteral]
            worker_name : info->name);

kthread_create_worker takes fmt+varargs to set the name of the
worker, and that happens with a vsnprintf() to a stack buffer (that is
then copied into task_comm). So there's no reason not to just pass
"ptp%d", ptp->index to kthread_create_worker() and avoid the
intermediate worker_name variable.

Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
---
 drivers/ptp/ptp_clock.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/drivers/ptp/ptp_clock.c b/drivers/ptp/ptp_clock.c
index 7eacc1c4b3b1..5419a89d300e 100644
--- a/drivers/ptp/ptp_clock.c
+++ b/drivers/ptp/ptp_clock.c
@@ -232,12 +232,8 @@ struct ptp_clock *ptp_clock_register(struct ptp_clock_info *info,
 	init_waitqueue_head(&ptp->tsev_wq);
 
 	if (ptp->info->do_aux_work) {
-		char *worker_name = kasprintf(GFP_KERNEL, "ptp%d", ptp->index);
-
 		kthread_init_delayed_work(&ptp->aux_work, ptp_aux_kworker);
-		ptp->kworker = kthread_create_worker(0, worker_name ?
-						     worker_name : info->name);
-		kfree(worker_name);
+		ptp->kworker = kthread_create_worker(0, "ptp%d", ptp->index);
 		if (IS_ERR(ptp->kworker)) {
 			err = PTR_ERR(ptp->kworker);
 			pr_err("failed to create ptp aux_worker %d\n", err);
-- 
2.19.1.6.gbde171bbf5

^ permalink raw reply related

* [PATCH net v3] net/ipv6: Add anycast addresses to a global hashtable
From: Jeff Barnhill @ 2018-10-26 21:22 UTC (permalink / raw)
  To: netdev; +Cc: davem, kuznet, yoshfuji, Jeff Barnhill
In-Reply-To: <CAL6e_pdbk9+5g-6dr54f9aT4v1RHV_thxM+zb4w41J0n_ydNcA@mail.gmail.com>

icmp6_send() function is expensive on systems with a large number of
interfaces. Every time it’s called, it has to verify that the source
address does not correspond to an existing anycast address by looping
through every device and every anycast address on the device.  This can
result in significant delays for a CPU when there are a large number of
neighbors and ND timers are frequently timing out and calling
neigh_invalidate().

Add anycast addresses to a global hashtable to allow quick searching for
matching anycast addresses.  This is based on inet6_addr_lst in addrconf.c.

Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>
---
 include/net/addrconf.h |   2 +
 include/net/if_inet6.h |   8 ++++
 net/ipv6/af_inet6.c    |   5 ++
 net/ipv6/anycast.c     | 122 ++++++++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 135 insertions(+), 2 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 14b789a123e7..799af1a037d1 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -317,6 +317,8 @@ bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
 			 const struct in6_addr *addr);
 bool ipv6_chk_acast_addr_src(struct net *net, struct net_device *dev,
 			     const struct in6_addr *addr);
+int anycast_init(void);
+void anycast_cleanup(void);
 
 /* Device notifier */
 int register_inet6addr_notifier(struct notifier_block *nb);
diff --git a/include/net/if_inet6.h b/include/net/if_inet6.h
index d7578cf49c3a..a445014b981d 100644
--- a/include/net/if_inet6.h
+++ b/include/net/if_inet6.h
@@ -142,6 +142,14 @@ struct ipv6_ac_socklist {
 	struct ipv6_ac_socklist *acl_next;
 };
 
+struct ipv6_ac_addrlist {
+	struct in6_addr		acal_addr;
+	possible_net_t		acal_pnet;
+	refcount_t		acal_users;
+	struct hlist_node	acal_lst; /* inet6_acaddr_lst */
+	struct rcu_head		rcu;
+};
+
 struct ifacaddr6 {
 	struct in6_addr		aca_addr;
 	struct fib6_info	*aca_rt;
diff --git a/net/ipv6/af_inet6.c b/net/ipv6/af_inet6.c
index 3f4d61017a69..ddc8a6dbfba2 100644
--- a/net/ipv6/af_inet6.c
+++ b/net/ipv6/af_inet6.c
@@ -1001,6 +1001,9 @@ static int __init inet6_init(void)
 	err = ip6_flowlabel_init();
 	if (err)
 		goto ip6_flowlabel_fail;
+	err = anycast_init();
+	if (err)
+		goto anycast_fail;
 	err = addrconf_init();
 	if (err)
 		goto addrconf_fail;
@@ -1091,6 +1094,8 @@ static int __init inet6_init(void)
 ipv6_exthdrs_fail:
 	addrconf_cleanup();
 addrconf_fail:
+	anycast_cleanup();
+anycast_fail:
 	ip6_flowlabel_cleanup();
 ip6_flowlabel_fail:
 	ndisc_late_cleanup();
diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index 4e0ff7031edd..1040d08867ab 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -44,8 +44,22 @@
 
 #include <net/checksum.h>
 
+#define IN6_ADDR_HSIZE_SHIFT	8
+#define IN6_ADDR_HSIZE		BIT(IN6_ADDR_HSIZE_SHIFT)
+/*	anycast address hash table
+ */
+static struct hlist_head inet6_acaddr_lst[IN6_ADDR_HSIZE];
+static DEFINE_SPINLOCK(acaddr_hash_lock);
+
 static int ipv6_dev_ac_dec(struct net_device *dev, const struct in6_addr *addr);
 
+static u32 inet6_acaddr_hash(struct net *net, const struct in6_addr *addr)
+{
+	u32 val = ipv6_addr_hash(addr) ^ net_hash_mix(net);
+
+	return hash_32(val, IN6_ADDR_HSIZE_SHIFT);
+}
+
 /*
  *	socket join an anycast group
  */
@@ -204,6 +218,73 @@ void ipv6_sock_ac_close(struct sock *sk)
 	rtnl_unlock();
 }
 
+static struct ipv6_ac_addrlist *acal_alloc(struct net *net,
+					   const struct in6_addr *addr)
+{
+	struct ipv6_ac_addrlist *acal;
+
+	acal = kzalloc(sizeof(*acal), GFP_ATOMIC);
+	if (!acal)
+		return NULL;
+
+	acal->acal_addr = *addr;
+	write_pnet(&acal->acal_pnet, net);
+	refcount_set(&acal->acal_users, 1);
+	INIT_HLIST_NODE(&acal->acal_lst);
+
+	return acal;
+}
+
+static int ipv6_add_acaddr_hash(struct net *net, const struct in6_addr *addr)
+{
+	unsigned int hash = inet6_acaddr_hash(net, addr);
+	struct ipv6_ac_addrlist *acal;
+	int err = 0;
+
+	spin_lock(&acaddr_hash_lock);
+	hlist_for_each_entry(acal, &inet6_acaddr_lst[hash], acal_lst) {
+		if (!net_eq(read_pnet(&acal->acal_pnet), net))
+			continue;
+		if (ipv6_addr_equal(&acal->acal_addr, addr)) {
+			refcount_inc(&acal->acal_users);
+			goto out;
+		}
+	}
+
+	acal = acal_alloc(net, addr);
+	if (!acal) {
+		err = -ENOMEM;
+		goto out;
+	}
+
+	hlist_add_head_rcu(&acal->acal_lst, &inet6_acaddr_lst[hash]);
+
+out:
+	spin_unlock(&acaddr_hash_lock);
+	return err;
+}
+
+static void ipv6_del_acaddr_hash(struct net *net, const struct in6_addr *addr)
+{
+	unsigned int hash = inet6_acaddr_hash(net, addr);
+	struct ipv6_ac_addrlist *acal;
+
+	spin_lock(&acaddr_hash_lock);
+	hlist_for_each_entry(acal, &inet6_acaddr_lst[hash], acal_lst) {
+		if (!net_eq(read_pnet(&acal->acal_pnet), net))
+			continue;
+		if (ipv6_addr_equal(&acal->acal_addr, addr)) {
+			if (refcount_dec_and_test(&acal->acal_users)) {
+				hlist_del_init_rcu(&acal->acal_lst);
+				kfree_rcu(acal, rcu);
+			}
+			spin_unlock(&acaddr_hash_lock);
+			return;
+		}
+	}
+	spin_unlock(&acaddr_hash_lock);
+}
+
 static void aca_get(struct ifacaddr6 *aca)
 {
 	refcount_inc(&aca->aca_refcnt);
@@ -275,6 +356,13 @@ int __ipv6_dev_ac_inc(struct inet6_dev *idev, const struct in6_addr *addr)
 		err = -ENOMEM;
 		goto out;
 	}
+	err = ipv6_add_acaddr_hash(dev_net(idev->dev), addr);
+	if (err) {
+		fib6_info_release(f6i);
+		fib6_info_release(f6i);
+		kfree(aca);
+		goto out;
+	}
 
 	aca->aca_next = idev->ac_list;
 	idev->ac_list = aca;
@@ -324,6 +412,7 @@ int __ipv6_dev_ac_dec(struct inet6_dev *idev, const struct in6_addr *addr)
 		prev_aca->aca_next = aca->aca_next;
 	else
 		idev->ac_list = aca->aca_next;
+	ipv6_del_acaddr_hash(dev_net(idev->dev), &aca->aca_addr);
 	write_unlock_bh(&idev->lock);
 	addrconf_leave_solict(idev, &aca->aca_addr);
 
@@ -350,6 +439,8 @@ void ipv6_ac_destroy_dev(struct inet6_dev *idev)
 	write_lock_bh(&idev->lock);
 	while ((aca = idev->ac_list) != NULL) {
 		idev->ac_list = aca->aca_next;
+		ipv6_del_acaddr_hash(dev_net(idev->dev), &aca->aca_addr);
+
 		write_unlock_bh(&idev->lock);
 
 		addrconf_leave_solict(idev, &aca->aca_addr);
@@ -390,17 +481,23 @@ static bool ipv6_chk_acast_dev(struct net_device *dev, const struct in6_addr *ad
 bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
 			 const struct in6_addr *addr)
 {
+	unsigned int hash = inet6_acaddr_hash(net, addr);
+	struct ipv6_ac_addrlist *acal;
 	bool found = false;
 
 	rcu_read_lock();
 	if (dev)
 		found = ipv6_chk_acast_dev(dev, addr);
 	else
-		for_each_netdev_rcu(net, dev)
-			if (ipv6_chk_acast_dev(dev, addr)) {
+		hlist_for_each_entry_rcu(acal, &inet6_acaddr_lst[hash],
+					 acal_lst) {
+			if (!net_eq(read_pnet(&acal->acal_pnet), net))
+				continue;
+			if (ipv6_addr_equal(&acal->acal_addr, addr)) {
 				found = true;
 				break;
 			}
+		}
 	rcu_read_unlock();
 	return found;
 }
@@ -539,4 +636,25 @@ void ac6_proc_exit(struct net *net)
 {
 	remove_proc_entry("anycast6", net->proc_net);
 }
+
+/*	Init / cleanup code
+ */
+int __init anycast_init(void)
+{
+	int i;
+
+	for (i = 0; i < IN6_ADDR_HSIZE; i++)
+		INIT_HLIST_HEAD(&inet6_acaddr_lst[i]);
+	return 0;
+}
+
+void anycast_cleanup(void)
+{
+	int i;
+
+	spin_lock(&acaddr_hash_lock);
+	for (i = 0; i < IN6_ADDR_HSIZE; i++)
+		WARN_ON(!hlist_empty(&inet6_acaddr_lst[i]));
+	spin_unlock(&acaddr_hash_lock);
+}
 #endif
-- 
2.14.1

^ permalink raw reply related

* Re: KASAN: slab-out-of-bounds Read in sctp_getsockopt
From: Xin Long @ 2018-10-27  5:58 UTC (permalink / raw)
  To: syzbot+5da0d0a72a9e7d791748
  Cc: davem, LKML, linux-sctp, Marcelo Ricardo Leitner, network dev,
	Neil Horman, syzkaller-bugs, Vlad Yasevich
In-Reply-To: <000000000000e80c930579245751@google.com>

On Sat, Oct 27, 2018 at 1:38 AM syzbot
<syzbot+5da0d0a72a9e7d791748@syzkaller.appspotmail.com> wrote:
>
> Hello,
>
> syzbot found the following crash on:
>
> HEAD commit:    bd6bf7c10484 Merge tag 'pci-v4.20-changes' of git://git.ke..
> git tree:       upstream
> console output: https://syzkaller.appspot.com/x/log.txt?x=16fd6bcb400000
> kernel config:  https://syzkaller.appspot.com/x/.config?x=2dd8629d56664133
> dashboard link: https://syzkaller.appspot.com/bug?extid=5da0d0a72a9e7d791748
> compiler:       gcc (GCC) 8.0.1 20180413 (experimental)
> syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=16b3ea33400000
> C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=17f9f1bd400000
>
> IMPORTANT: if you fix the bug, please add the following tag to the commit:
> Reported-by: syzbot+5da0d0a72a9e7d791748@syzkaller.appspotmail.com
>
> ==================================================================
> BUG: KASAN: slab-out-of-bounds in sctp_getsockopt_pr_streamstatus
> net/sctp/socket.c:7174 [inline]
Forgot to change to use "&" in sctp_getsockopt_pr_streamstatus() in
"sctp: get pr_assoc and pr_stream all status with SCTP_PR_SCTP_ALL instead"

@@ -7158,7 +7158,7 @@ static int
sctp_getsockopt_pr_streamstatus(struct sock *sk, int len,
                goto out;
        }

-       if (policy == SCTP_PR_SCTP_ALL) {
+       if (policy & SCTP_PR_SCTP_ALL) {

> BUG: KASAN: slab-out-of-bounds in sctp_getsockopt+0x7516/0x7cc2
> net/sctp/socket.c:7582
> Read of size 8 at addr ffff8801d89f0968 by task syz-executor278/5330
>
> CPU: 1 PID: 5330 Comm: syz-executor278 Not tainted 4.19.0+ #303
> Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> Google 01/01/2011
> Call Trace:
>   __dump_stack lib/dump_stack.c:77 [inline]
>   dump_stack+0x244/0x39d lib/dump_stack.c:113
>   print_address_description.cold.7+0x9/0x1ff mm/kasan/report.c:256
>   kasan_report_error mm/kasan/report.c:354 [inline]
>   kasan_report.cold.8+0x242/0x309 mm/kasan/report.c:412
>   __asan_report_load8_noabort+0x14/0x20 mm/kasan/report.c:433
>   sctp_getsockopt_pr_streamstatus net/sctp/socket.c:7174 [inline]
>   sctp_getsockopt+0x7516/0x7cc2 net/sctp/socket.c:7582
>   sock_common_getsockopt+0x9a/0xe0 net/core/sock.c:2937
>   __sys_getsockopt+0x1ad/0x390 net/socket.c:1939
>   __do_sys_getsockopt net/socket.c:1950 [inline]
>   __se_sys_getsockopt net/socket.c:1947 [inline]
>   __x64_sys_getsockopt+0xbe/0x150 net/socket.c:1947
>   do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
> RIP: 0033:0x445789
> Code: e8 6c b6 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7
> 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
> ff 0f 83 2b 12 fc ff c3 66 2e 0f 1f 84 00 00 00 00
> RSP: 002b:00007effdb293db8 EFLAGS: 00000246 ORIG_RAX: 0000000000000037
> RAX: ffffffffffffffda RBX: 00000000006dac48 RCX: 0000000000445789
> RDX: 0000000000000074 RSI: 0000000000000084 RDI: 0000000000000003
> RBP: 00000000006dac40 R08: 0000000020000040 R09: 0000000000000000
> R10: 0000000020000080 R11: 0000000000000246 R12: 00000000006dac4c
> R13: 00007ffcfc408c6f R14: 00007effdb2949c0 R15: 00000000006dad2c
>
> Allocated by task 5329:
>   save_stack+0x43/0xd0 mm/kasan/kasan.c:448
>   set_track mm/kasan/kasan.c:460 [inline]
>   kasan_kmalloc+0xc7/0xe0 mm/kasan/kasan.c:553
>   kmem_cache_alloc_trace+0x152/0x750 mm/slab.c:3620
>   kmalloc include/linux/slab.h:513 [inline]
>   kzalloc include/linux/slab.h:707 [inline]
>   sctp_stream_init_ext+0x4f/0xf0 net/sctp/stream.c:237
>   sctp_sendmsg_to_asoc+0x1308/0x1a20 net/sctp/socket.c:1896
>   sctp_sendmsg+0x13c2/0x1da0 net/sctp/socket.c:2113
>   inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
>   sock_sendmsg_nosec net/socket.c:621 [inline]
>   sock_sendmsg+0xd5/0x120 net/socket.c:631
>   __sys_sendto+0x3d7/0x670 net/socket.c:1788
>   __do_sys_sendto net/socket.c:1800 [inline]
>   __se_sys_sendto net/socket.c:1796 [inline]
>   __x64_sys_sendto+0xe1/0x1a0 net/socket.c:1796
>   do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> Freed by task 3223:
>   save_stack+0x43/0xd0 mm/kasan/kasan.c:448
>   set_track mm/kasan/kasan.c:460 [inline]
>   __kasan_slab_free+0x102/0x150 mm/kasan/kasan.c:521
>   kasan_slab_free+0xe/0x10 mm/kasan/kasan.c:528
>   __cache_free mm/slab.c:3498 [inline]
>   kfree+0xcf/0x230 mm/slab.c:3813
>   kzfree+0x28/0x30 mm/slab_common.c:1543
>   aa_free_file_ctx security/apparmor/include/file.h:76 [inline]
>   apparmor_file_free_security+0x133/0x1a0 security/apparmor/lsm.c:448
>   security_file_free+0x4a/0x80 security/security.c:900
>   file_free fs/file_table.c:54 [inline]
>   __fput+0x4e8/0xa30 fs/file_table.c:294
>   ____fput+0x15/0x20 fs/file_table.c:309
>   task_work_run+0x1e8/0x2a0 kernel/task_work.c:113
>   tracehook_notify_resume include/linux/tracehook.h:188 [inline]
>   exit_to_usermode_loop+0x318/0x380 arch/x86/entry/common.c:166
>   prepare_exit_to_usermode arch/x86/entry/common.c:197 [inline]
>   syscall_return_slowpath arch/x86/entry/common.c:268 [inline]
>   do_syscall_64+0x6be/0x820 arch/x86/entry/common.c:293
>   entry_SYSCALL_64_after_hwframe+0x49/0xbe
>
> The buggy address belongs to the object at ffff8801d89f0900
>   which belongs to the cache kmalloc-96 of size 96
> The buggy address is located 8 bytes to the right of
>   96-byte region [ffff8801d89f0900, ffff8801d89f0960)
> The buggy address belongs to the page:
> page:ffffea0007627c00 count:1 mapcount:0 mapping:ffff8801da8004c0 index:0x0
> flags: 0x2fffc0000000100(slab)
> raw: 02fffc0000000100 ffffea0007646748 ffffea0007613488 ffff8801da8004c0
> raw: 0000000000000000 ffff8801d89f0000 0000000100000020 0000000000000000
> page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
>   ffff8801d89f0800: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>   ffff8801d89f0880: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> > ffff8801d89f0900: 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc fc
>                                                            ^
>   ffff8801d89f0980: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
>   ffff8801d89f0a00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
> ==================================================================
>
>
> ---
> This bug is generated by a bot. It may contain errors.
> See https://goo.gl/tpsmEJ for more information about syzbot.
> syzbot engineers can be reached at syzkaller@googlegroups.com.
>
> syzbot will keep track of this bug report. See:
> https://goo.gl/tpsmEJ#bug-status-tracking for how to communicate with
> syzbot.
> syzbot can test patches for this bug, for details see:
> https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* [PATCH] Fix ss Netid column and Local/Peer_Address
From: Yoann P. @ 2018-10-26 20:53 UTC (permalink / raw)
  To: netdev; +Cc: yoann.p.public

When using ss -Hutn4 or -utn3, Netid and State columns are sometime merged, it 
can be confusing when trying to pipe into awk or column.
Details (before and after output) are available on this github issue: https://
github.com/shemminger/iproute2/issues/20

Signed-off-by: YoyPa <yoann.p.public@gmail.com>
---
 misc/ss.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/misc/ss.c b/misc/ss.c
index c8970438..5e46cc0e 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -144,9 +144,9 @@ static struct column columns[] = {
        { ALIGN_LEFT,   "State",                " ",    0, 0, 0 },
        { ALIGN_LEFT,   "Recv-Q",               " ",    0, 0, 0 },
        { ALIGN_LEFT,   "Send-Q",               " ",    0, 0, 0 },
-       { ALIGN_RIGHT,  "Local Address:",       " ",    0, 0, 0 },
+       { ALIGN_RIGHT,  "Local_Address:",       " ",    0, 0, 0 },
        { ALIGN_LEFT,   "Port",                 "",     0, 0, 0 },
-       { ALIGN_RIGHT,  "Peer Address:",        " ",    0, 0, 0 },
+       { ALIGN_RIGHT,  "Peer_Address:",        " ",    0, 0, 0 },
        { ALIGN_LEFT,   "Port",                 "",     0, 0, 0 },
        { ALIGN_LEFT,   "",                     "",     0, 0, 0 },
 };
@@ -1334,7 +1334,7 @@ static void sock_state_print(struct sockstat *s)
                out("`- %s", sctp_sstate_name[s->state]);
        } else {
                field_set(COL_NETID);
-               out("%s", sock_name);
+               out("%-6s", sock_name);
                field_set(COL_STATE);
                out("%s", sstate_name[s->state]);
        }
-- 
2.19.1

^ permalink raw reply related

* Re: [PATCH v2 00/17] octeontx2-af: NPC parser and NIX blocks initialization
From: Arnd Bergmann @ 2018-10-26 19:28 UTC (permalink / raw)
  To: Sunil Kovvuri; +Cc: David Miller, Networking, linux-soc, Sunil Goutham
In-Reply-To: <CA+sq2Cdia2ETBOEHo5mezWe6urrARNa2azbZnWfxUjmEbgyJgw@mail.gmail.com>

On Fri, Oct 26, 2018 at 6:33 PM Sunil Kovvuri <sunil.kovvuri@gmail.com> wrote:
> On Fri, Oct 26, 2018 at 9:56 PM Sunil Kovvuri <sunil.kovvuri@gmail.com> wrote:
> > On Fri, Oct 26, 2018 at 7:34 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > > On 10/26/18, Sunil Kovvuri <sunil.kovvuri@gmail.com> wrote:
> > > > On Fri, Oct 26, 2018 at 6:24 PM Arnd Bergmann <arnd@arndb.de> wrote:
> > > >
> > > > As of now it's only mbox based configuration that is supported.
> > >
> > > Ok, thanks for the clarification.
> > >
> > > Does this mean that you intend to have user space tools that use
> > > the mbox based interface on VFIO devices to perform configuration
> > > for virtual network devices, or just that the configuration interface
> > > is something that needs to be designed later?
> > >
> >
> > No there is no need for any userspace tools.
> > It's the virtual network device's driver which will send commands
> > like resource allocation, configuration, stats retrieval to this
> > AF device via mbox interface.
> >
> > eg: A user using ethtool changes RSS settings for the network device,
> > network device's driver receives the data, prepares a mailbox command
> > sends it to this driver for configuring the same in HW.

Ok, that part is mostly fine, as within a given host you can just have
multiple network interfaces that you can each configure independently,
and the mailbox interface for the most part is an implementation detail.

Doing the same in virtual machines means that the mailbox interface
becomes an ABI between the driver in the guest and the driver in the
host. This is still not too bad, in the worst case the guest might have
to detect the version of the host that's running and support both
an old and new version of the interface. There is reasonable hope
that you don't need to revise the interface here; it's not much different
from talking to firmware, and having both sides of the interface under
the control of Linux may in fact be better.

Once the interface gets exposed to stuff like ODP or DPDK, it effectively
becomes a user space interface, and that carries risk of being abused
for passing lots of other stuff, so this is the point where we have to be
very careful.

Aside from this, there is the stuff that Andrew mentioned, which is the
most important: For anything that should /not/ be controlled by a
network interface for itself, you still need an administrative interface.
An example of this would be creating additional virtual functions,
assigning bandwidth allocation between them, or limiting the
data that can be transferred to and from a virtual function.

Can you explain what your plan is to handle those?

> To be more clear there is no mbox 'interface' as such.
> Here PCI devices shares a memory region, one device prepares a command
> in this shared memory and writes into a doorbell kind of register which triggers
> an IRQ to other device. Which then takes the command processes it.

Yes, this part was already clear to me.

       Arnd

^ permalink raw reply

* Re: [PATCH 1/1] ipmr: Make cache queue length configurable
From: Stephen Hemminger @ 2018-10-26 18:24 UTC (permalink / raw)
  To: Brodie Greenfield
  Cc: davem, kuznet, yoshfuji, netdev, linux-kernel, chris.packham,
	luuk.paulussen
In-Reply-To: <20181026020219.22401-2-brodie.greenfield@alliedtelesis.co.nz>

On Fri, 26 Oct 2018 15:02:19 +1300
Brodie Greenfield <brodie.greenfield@alliedtelesis.co.nz> wrote:

> We want to be able to keep more spaces available in our queue for
> processing incoming multicast traffic (adding (S,G) entries) - this lets
> us learn more groups faster, rather than dropping them at this stage.
> 
> Signed-off-by: Brodie Greenfield <brodie.greenfield@alliedtelesis.co.nz>
> ---
>  Documentation/networking/ip-sysctl.txt | 7 +++++++
>  include/net/netns/ipv4.h               | 1 +
>  include/uapi/linux/sysctl.h            | 1 +
>  kernel/sysctl_binary.c                 | 1 +
>  net/ipv4/af_inet.c                     | 2 ++
>  net/ipv4/ipmr.c                        | 4 +++-
>  net/ipv4/sysctl_net_ipv4.c             | 7 +++++++
>  7 files changed, 22 insertions(+), 1 deletion(-)
> 
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index 960de8fe3f40..dfc70ef6c42b 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -864,6 +864,13 @@ ip_local_reserved_ports - list of comma separated ranges
>  
>  	Default: Empty
>  
> +ip_mr_cache_queue_length - INTEGER
> +	Limit the number of multicast packets we can have in the queue to be
> +	resolved.
> +	Bear in mind that this causes an O(n) traversal of the same size when
> +	the queue is full. This should be considered if increasing.
> +	Default: 10
>

Thanks for updating documentation.  The second two sentences aren't clear.
Does it mean that setting queue length causes O(n) traversal or that each
multicast packet received causes O(n) traversal

>  ip_unprivileged_port_start - INTEGER
>  	This is a per-namespace sysctl.  It defines the first
>  	unprivileged port in the network namespace.  Privileged ports
> diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
> index e47503b4e4d1..1ca5cabe2d3b 100644
> --- a/include/net/netns/ipv4.h
> +++ b/include/net/netns/ipv4.h
> @@ -184,6 +184,7 @@ struct netns_ipv4 {
>  	int sysctl_igmp_max_msf;
>  	int sysctl_igmp_llm_reports;
>  	int sysctl_igmp_qrv;
> +	int sysctl_ip_mr_cache_queue_length;

Maybe unsigned because negative value is not meaningful.

>  
>  	struct ping_group_range ping_group_range;
>  
> diff --git a/include/uapi/linux/sysctl.h b/include/uapi/linux/sysctl.h
> index d71013fffaf6..32e32d4904cd 100644
> --- a/include/uapi/linux/sysctl.h
> +++ b/include/uapi/linux/sysctl.h
> @@ -425,6 +425,7 @@ enum
>  	NET_TCP_ALLOWED_CONG_CONTROL=123,
>  	NET_TCP_MAX_SSTHRESH=124,
>  	NET_TCP_FRTO_RESPONSE=125,
> +	NET_IPV4_IP_MR_CACHE_QUEUE_LENGTH=126,
>  };

The numeric sysctl enum is considered deprecated, new sysctl's
need not be added here.

>  enum {
> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
> index 07148b497451..8db94e8d97ed 100644
> --- a/kernel/sysctl_binary.c
> +++ b/kernel/sysctl_binary.c
> @@ -367,6 +367,7 @@ static const struct bin_table bin_net_ipv4_table[] = {
>  	{ CTL_INT,	NET_IPV4_LOCAL_PORT_RANGE,		"ip_local_port_range" },
>  	{ CTL_INT,	NET_IPV4_IGMP_MAX_MEMBERSHIPS,		"igmp_max_memberships" },
>  	{ CTL_INT,	NET_IPV4_IGMP_MAX_MSF,			"igmp_max_msf" },
> +	{ CTL_INT,	NET_IPV4_IP_MR_CACHE_QUEUE_LENGTH,	"ip_mr_cache_queue_length" },
>  	{ CTL_INT,	NET_IPV4_INET_PEER_THRESHOLD,		"inet_peer_threshold" },
>  	{ CTL_INT,	NET_IPV4_INET_PEER_MINTTL,		"inet_peer_minttl" },
>  	{ CTL_INT,	NET_IPV4_INET_PEER_MAXTTL,		"inet_peer_maxttl" },
> diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
> index 1fbe2f815474..4b78d12aca36 100644
> --- a/net/ipv4/af_inet.c
> +++ b/net/ipv4/af_inet.c
> @@ -1818,6 +1818,8 @@ static __net_init int inet_init_net(struct net *net)
>  	net->ipv4.sysctl_igmp_llm_reports = 1;
>  	net->ipv4.sysctl_igmp_qrv = 2;
>  
> +	/* ipmr unresolved queue length max */
> +	net->ipv4.sysctl_ip_mr_cache_queue_length = 10;

Comment here is not necessary, is obvious.

>  	return 0;
>  }
>  
> diff --git a/net/ipv4/ipmr.c b/net/ipv4/ipmr.c
> index 5660adcf7a04..2864f80e2f2a 100644
> --- a/net/ipv4/ipmr.c
> +++ b/net/ipv4/ipmr.c
> @@ -1128,6 +1128,7 @@ static int ipmr_cache_unresolved(struct mr_table *mrt, vifi_t vifi,
>  	struct mfc_cache *c;
>  	bool found = false;
>  	int err;
> +	struct net *net = dev_net(dev);

The network layer coding style is to use reverse christmas tree
style declarations, move this up.

>  
>  	spin_lock_bh(&mfc_unres_lock);
>  	list_for_each_entry(c, &mrt->mfc_unres_queue, _c.list) {
> @@ -1140,7 +1141,8 @@ static int ipmr_cache_unresolved(struct mr_table *mrt, vifi_t vifi,
>  
>  	if (!found) {
>  		/* Create a new entry if allowable */
> -		if (atomic_read(&mrt->cache_resolve_queue_len) >= 10 ||
> +		if (atomic_read(&mrt->cache_resolve_queue_len) >=
> +		    net->ipv4.sysctl_ip_mr_cache_queue_length ||
>  		    (c = ipmr_cache_alloc_unres()) == NULL) {
>  			spin_unlock_bh(&mfc_unres_lock);
>  
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 891ed2f91467..b249932ee24e 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -772,6 +772,13 @@ static struct ctl_table ipv4_net_table[] = {
>  		.mode		= 0644,
>  		.proc_handler	= proc_dointvec
>  	},
> +	{
> +		.procname	= "ip_mr_cache_queue_length",
> +		.data		= &init_net.ipv4.sysctl_ip_mr_cache_queue_length,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= proc_dointvec
> +	},
>  #ifdef CONFIG_IP_MULTICAST
>  	{
>  		.procname	= "igmp_qrv",

This sysctl is not needed if CONFIG_IP_MULTICAST is not defined.

^ permalink raw reply

* Re: [PATCH net] net: sched: Remove TCA_OPTIONS from policy
From: Cong Wang @ 2018-10-26 18:19 UTC (permalink / raw)
  To: Marco Berizzi; +Cc: dsahern, David Miller, Linux Kernel Network Developers
In-Reply-To: <1305358874.1795395.1540553653206@mail.libero.it>

On Fri, Oct 26, 2018 at 4:35 AM Marco Berizzi <pupilla@libero.it> wrote:
> Apologies for bothering you again.
> I applied your patch to 4.19, but after issuing this
> command:
>
> root@Calimero:~# tc qdisc add dev eth0 root handle 1:0 hfsc default 1
> root@Calimero:~# ping 10.81.104.1
> PING 10.81.104.1 (10.81.104.1) 56(84) bytes of data.
> ^C
> --- 10.81.104.1 ping statistics ---
> 2 packets transmitted, 0 received, 100% packet loss, time 1001ms
>
> I'm losing ipv4 connectivity.
> If I remove the qdisc everything is going to work again:

Did this really work before?

You specify a default class without adding it, so the packets are dropped.

How would you expect this to work? :)

^ permalink raw reply

* [RFT net-next] net: stmmac: Fix RX packet size > 8191
From: thor.thayer @ 2018-10-26 18:15 UTC (permalink / raw)
  To: peppe.cavallaro, alexandre.torgue, joabreu; +Cc: davem, netdev, Thor Thayer

From: Thor Thayer <thor.thayer@linux.intel.com>

Ping problems with packets > 8191 as shown:

PING 192.168.1.99 (192.168.1.99) 8150(8178) bytes of data.
8158 bytes from 192.168.1.99: icmp_seq=1 ttl=64 time=0.669 ms
wrong data byte 8144 should be 0xd0 but was 0x0
16    10 11 12 13 14 15 16 17 18 19 1a 1b 1c 1d 1e 1f
      20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f
%< ---------------snip--------------------------------------
8112  b0 b1 b2 b3 b4 b5 b6 b7 b8 b9 ba bb bc bd be bf
      c0 c1 c2 c3 c4 c5 c6 c7 c8 c9 ca cb cc cd ce cf
8144  0 0 0 0 d0 d1
      ^^^^^^^
Notice the 4 bytes of 0 before the expected byte of d0.

Databook notes that the RX buffer must be a multiple of 4/8/16
bytes [1].

Add a new define for RX DMA Buffer size since the TX descriptors
don't have this limitation. Use this new define in all the RX
buffer setup and refill functions.
Also fixup the normal descriptor RX buffer size since that has
the same limitation.

[1] Synopsys DesignWare Cores Ethernet MAC Universal v3.70a
    [section 8.4.2 - Table 8-24]

[RFT] Request testing on a platform that has normal descriptors.

Tested on SoCFPGA Stratix10 with ping sweep from 100 to 8300 byte packets.

Fixes: 286a83721720 ("stmmac: add CHAINED descriptor mode support (V4)")
Suggested-by: Jose Abreu <jose.abreu@synopsys.com>
Signed-off-by: Thor Thayer <thor.thayer@linux.intel.com>
---
 drivers/net/ethernet/stmicro/stmmac/common.h      | 2 ++
 drivers/net/ethernet/stmicro/stmmac/descs_com.h   | 4 ++--
 drivers/net/ethernet/stmicro/stmmac/norm_desc.c   | 2 +-
 drivers/net/ethernet/stmicro/stmmac/ring_mode.c   | 8 ++++----
 drivers/net/ethernet/stmicro/stmmac/stmmac_main.c | 2 +-
 5 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/stmicro/stmmac/common.h b/drivers/net/ethernet/stmicro/stmmac/common.h
index b1b305f8f414..ffc6b344a81c 100644
--- a/drivers/net/ethernet/stmicro/stmmac/common.h
+++ b/drivers/net/ethernet/stmicro/stmmac/common.h
@@ -366,6 +366,8 @@ struct dma_features {
 /* GMAC TX FIFO is 8K, Rx FIFO is 16K */
 #define BUF_SIZE_16KiB 16384
 #define BUF_SIZE_8KiB 8192
+/* RX Buffer size must be < 8191 and multiple of 4/8/16 bytes */
+#define RX_BUF_SIZE_8KiB 8188
 #define BUF_SIZE_4KiB 4096
 #define BUF_SIZE_2KiB 2048
 
diff --git a/drivers/net/ethernet/stmicro/stmmac/descs_com.h b/drivers/net/ethernet/stmicro/stmmac/descs_com.h
index ca9d7e48034c..4043ef6e8698 100644
--- a/drivers/net/ethernet/stmicro/stmmac/descs_com.h
+++ b/drivers/net/ethernet/stmicro/stmmac/descs_com.h
@@ -31,7 +31,7 @@
 /* Enhanced descriptors */
 static inline void ehn_desc_rx_set_on_ring(struct dma_desc *p, int end)
 {
-	p->des1 |= cpu_to_le32(((BUF_SIZE_8KiB - 1)
+	p->des1 |= cpu_to_le32((RX_BUF_SIZE_8KiB
 			<< ERDES1_BUFFER2_SIZE_SHIFT)
 		   & ERDES1_BUFFER2_SIZE_MASK);
 
@@ -61,7 +61,7 @@ static inline void enh_set_tx_desc_len_on_ring(struct dma_desc *p, int len)
 /* Normal descriptors */
 static inline void ndesc_rx_set_on_ring(struct dma_desc *p, int end)
 {
-	p->des1 |= cpu_to_le32(((BUF_SIZE_2KiB - 1)
+	p->des1 |= cpu_to_le32((BUF_SIZE_2KiB
 				<< RDES1_BUFFER2_SIZE_SHIFT)
 		    & RDES1_BUFFER2_SIZE_MASK);
 
diff --git a/drivers/net/ethernet/stmicro/stmmac/norm_desc.c b/drivers/net/ethernet/stmicro/stmmac/norm_desc.c
index de65bb29feba..74a563682945 100644
--- a/drivers/net/ethernet/stmicro/stmmac/norm_desc.c
+++ b/drivers/net/ethernet/stmicro/stmmac/norm_desc.c
@@ -138,7 +138,7 @@ static void ndesc_init_rx_desc(struct dma_desc *p, int disable_rx_ic, int mode,
 			       int end)
 {
 	p->des0 |= cpu_to_le32(RDES0_OWN);
-	p->des1 |= cpu_to_le32((BUF_SIZE_2KiB - 1) & RDES1_BUFFER1_SIZE_MASK);
+	p->des1 |= cpu_to_le32(BUF_SIZE_2KiB & RDES1_BUFFER1_SIZE_MASK);
 
 	if (mode == STMMAC_CHAIN_MODE)
 		ndesc_rx_set_on_chain(p, end);
diff --git a/drivers/net/ethernet/stmicro/stmmac/ring_mode.c b/drivers/net/ethernet/stmicro/stmmac/ring_mode.c
index abc3f85270cd..09974a626b49 100644
--- a/drivers/net/ethernet/stmicro/stmmac/ring_mode.c
+++ b/drivers/net/ethernet/stmicro/stmmac/ring_mode.c
@@ -114,14 +114,14 @@ static void refill_desc3(void *priv_ptr, struct dma_desc *p)
 	struct stmmac_priv *priv = (struct stmmac_priv *)priv_ptr;
 
 	/* Fill DES3 in case of RING mode */
-	if (priv->dma_buf_sz >= BUF_SIZE_8KiB)
-		p->des3 = cpu_to_le32(le32_to_cpu(p->des2) + BUF_SIZE_8KiB);
+	if (priv->dma_buf_sz >= RX_BUF_SIZE_8KiB)
+		p->des3 = cpu_to_le32(le32_to_cpu(p->des2) + RX_BUF_SIZE_8KiB);
 }
 
 /* In ring mode we need to fill the desc3 because it is used as buffer */
 static void init_desc3(struct dma_desc *p)
 {
-	p->des3 = cpu_to_le32(le32_to_cpu(p->des2) + BUF_SIZE_8KiB);
+	p->des3 = cpu_to_le32(le32_to_cpu(p->des2) + RX_BUF_SIZE_8KiB);
 }
 
 static void clean_desc3(void *priv_ptr, struct dma_desc *p)
@@ -140,7 +140,7 @@ static void clean_desc3(void *priv_ptr, struct dma_desc *p)
 static int set_16kib_bfsize(int mtu)
 {
 	int ret = 0;
-	if (unlikely(mtu >= BUF_SIZE_8KiB))
+	if (unlikely(mtu > RX_BUF_SIZE_8KiB))
 		ret = BUF_SIZE_16KiB;
 	return ret;
 }
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 076a8be18d67..5e314050eb38 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -1086,7 +1086,7 @@ static int stmmac_set_bfsize(int mtu, int bufsize)
 	int ret = bufsize;
 
 	if (mtu >= BUF_SIZE_4KiB)
-		ret = BUF_SIZE_8KiB;
+		ret = RX_BUF_SIZE_8KiB;
 	else if (mtu >= BUF_SIZE_2KiB)
 		ret = BUF_SIZE_4KiB;
 	else if (mtu > DEFAULT_BUFSIZE)
-- 
2.7.4

^ permalink raw reply related

* RE: [PATCH v2 net] igb: shorten maximum PHC timecounter update interval
From: Keller, Jacob E @ 2018-10-26 17:20 UTC (permalink / raw)
  To: Miroslav Lichvar, netdev@vger.kernel.org
  Cc: intel-wired-lan@lists.osuosl.org, Richard Cochran,
	Thomas Gleixner
In-Reply-To: <20181026171300.12720-1-mlichvar@redhat.com>

> -----Original Message-----
> From: Miroslav Lichvar [mailto:mlichvar@redhat.com]
> Sent: Friday, October 26, 2018 10:13 AM
> To: netdev@vger.kernel.org
> Cc: intel-wired-lan@lists.osuosl.org; Miroslav Lichvar <mlichvar@redhat.com>;
> Keller, Jacob E <jacob.e.keller@intel.com>; Richard Cochran
> <richardcochran@gmail.com>; Thomas Gleixner <tglx@linutronix.de>
> Subject: [PATCH v2 net] igb: shorten maximum PHC timecounter update interval
> 
> The timecounter needs to be updated at least once per ~550 seconds in
> order to avoid a 40-bit SYSTIM timestamp to be misinterpreted as an old
> timestamp.
> 
> Since commit 500462a9de65 ("timers: Switch to a non-cascading wheel"),
> scheduling of delayed work seems to be less accurate and a requested
> delay of 540 seconds may actually be longer than 550 seconds. Also, the
> PHC may be adjusted to run up to 6% faster than real time and the system
> clock up to 10% slower. Shorten the delay to 360 seconds to be sure the
> timecounter is updated in time.
> 
> This fixes an issue with HW timestamps on 82580/I350/I354 being off by
> ~1100 seconds for few seconds every ~9 minutes.
> 

Acked-by: Jacob Keller <jacob.e.keller@intel.com>

> Cc: Jacob Keller <jacob.e.keller@intel.com>
> Cc: Richard Cochran <richardcochran@gmail.com>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
> ---
>  drivers/net/ethernet/intel/igb/igb_ptp.c | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c
> b/drivers/net/ethernet/intel/igb/igb_ptp.c
> index 9f4d700e09df..2b95dc9c7a6a 100644
> --- a/drivers/net/ethernet/intel/igb/igb_ptp.c
> +++ b/drivers/net/ethernet/intel/igb/igb_ptp.c
> @@ -51,9 +51,17 @@
>   *
>   * The 40 bit 82580 SYSTIM overflows every
>   *   2^40 * 10^-9 /  60  = 18.3 minutes.
> + *
> + * SYSTIM is converted to real time using a timecounter. As
> + * timecounter_cyc2time() allows old timestamps, the timecounter needs
> + * to be updated at least once per half of the SYSTIM interval.
> + * Scheduling of delayed work is not very accurate, and also the NIC
> + * clock can be adjusted to run up to 6% faster and the system clock
> + * up to 10% slower, so we aim for 6 minutes to be sure the actual
> + * interval in the NIC time is shorter than 9.16 minutes.
>   */
> 
> -#define IGB_SYSTIM_OVERFLOW_PERIOD	(HZ * 60 * 9)
> +#define IGB_SYSTIM_OVERFLOW_PERIOD	(HZ * 60 * 6)
>  #define IGB_PTP_TX_TIMEOUT		(HZ * 15)
>  #define INCPERIOD_82576			BIT(E1000_TIMINCA_16NS_SHIFT)
>  #define INCVALUE_82576_MASK
> 	GENMASK(E1000_TIMINCA_16NS_SHIFT - 1, 0)
> --
> 2.17.2

^ permalink raw reply

* Re: [PATCH v2 00/17] octeontx2-af: NPC parser and NIX blocks initialization
From: David Miller @ 2018-10-26 17:19 UTC (permalink / raw)
  To: arnd; +Cc: sunil.kovvuri, netdev, linux-soc, sgoutham
In-Reply-To: <CAK8P3a3qL15dmeRYX7W2EC5bGO1qFCXEDRj+xTcGt8Fakwc3vw@mail.gmail.com>

From: Arnd Bergmann <arnd@arndb.de>
Date: Fri, 26 Oct 2018 16:04:44 +0200

> I fear that setting a precedent of using the mbox for user-level
> configuration management would mean that we would have to
> treat each of these interfaces as an ABI, which in turn requires
> much deeper review as well as raising the fundamental question
> on how this should be done across drivers. The mailbox interface
> seem inherently nonportable to other hardware here, which is
> a significant downside.

+1

^ permalink raw reply

* Re: [PATCH v2 00/17] octeontx2-af: NPC parser and NIX blocks initialization
From: Andrew Lunn @ 2018-10-26 17:16 UTC (permalink / raw)
  To: Sunil Kovvuri
  Cc: Arnd Bergmann, David S. Miller, Linux Netdev List, linux-soc,
	Sunil Goutham
In-Reply-To: <CA+sq2CcuUYNoFHaAXGYMSTUqZy9LQT+Xk+g+zVCRTSDUcw27sQ@mail.gmail.com>

> No there is no need for any userspace tools.

Cool.

The problem with the Freescale architecture is that you seem to need
to provision the hardware. Tell it how many instances of various
things to create, and how to logically connect them together. They
have a user space tool to do this, in an opaque way.

But it sounds like you don't need anything like this.

    Andrew

^ permalink raw reply

* [PATCH v2 net] igb: shorten maximum PHC timecounter update interval
From: Miroslav Lichvar @ 2018-10-26 17:13 UTC (permalink / raw)
  To: netdev
  Cc: intel-wired-lan, Miroslav Lichvar, Jacob Keller, Richard Cochran,
	Thomas Gleixner

The timecounter needs to be updated at least once per ~550 seconds in
order to avoid a 40-bit SYSTIM timestamp to be misinterpreted as an old
timestamp.

Since commit 500462a9de65 ("timers: Switch to a non-cascading wheel"),
scheduling of delayed work seems to be less accurate and a requested
delay of 540 seconds may actually be longer than 550 seconds. Also, the
PHC may be adjusted to run up to 6% faster than real time and the system
clock up to 10% slower. Shorten the delay to 360 seconds to be sure the
timecounter is updated in time.

This fixes an issue with HW timestamps on 82580/I350/I354 being off by
~1100 seconds for few seconds every ~9 minutes.

Cc: Jacob Keller <jacob.e.keller@intel.com>
Cc: Richard Cochran <richardcochran@gmail.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
---
 drivers/net/ethernet/intel/igb/igb_ptp.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/igb/igb_ptp.c b/drivers/net/ethernet/intel/igb/igb_ptp.c
index 9f4d700e09df..2b95dc9c7a6a 100644
--- a/drivers/net/ethernet/intel/igb/igb_ptp.c
+++ b/drivers/net/ethernet/intel/igb/igb_ptp.c
@@ -51,9 +51,17 @@
  *
  * The 40 bit 82580 SYSTIM overflows every
  *   2^40 * 10^-9 /  60  = 18.3 minutes.
+ *
+ * SYSTIM is converted to real time using a timecounter. As
+ * timecounter_cyc2time() allows old timestamps, the timecounter needs
+ * to be updated at least once per half of the SYSTIM interval.
+ * Scheduling of delayed work is not very accurate, and also the NIC
+ * clock can be adjusted to run up to 6% faster and the system clock
+ * up to 10% slower, so we aim for 6 minutes to be sure the actual
+ * interval in the NIC time is shorter than 9.16 minutes.
  */
 
-#define IGB_SYSTIM_OVERFLOW_PERIOD	(HZ * 60 * 9)
+#define IGB_SYSTIM_OVERFLOW_PERIOD	(HZ * 60 * 6)
 #define IGB_PTP_TX_TIMEOUT		(HZ * 15)
 #define INCPERIOD_82576			BIT(E1000_TIMINCA_16NS_SHIFT)
 #define INCVALUE_82576_MASK		GENMASK(E1000_TIMINCA_16NS_SHIFT - 1, 0)
-- 
2.17.2

^ permalink raw reply related

* Re: [PATCH net] net/neigh: fix NULL deref in pneigh_dump_table()
From: David Ahern @ 2018-10-26 17:07 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller; +Cc: netdev, Eric Dumazet
In-Reply-To: <20181026163327.219797-1-edumazet@google.com>

On 10/26/18 10:33 AM, Eric Dumazet wrote:
> pneigh can have NULL device pointer, so we need to make

hmmm.. missed that.

> neigh_master_filtered() and neigh_ifindex_filtered() more robust.
> 
...

> Fixes: 6f52f80e8530 ("net/neigh: Extend dump filter to proxy neighbor dumps")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: David Ahern <dsahern@gmail.com>
> Reported-by: syzbot <syzkaller@googlegroups.com>
> ---
>  net/core/neighbour.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 

Thanks for the fix

Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: David Ahern <dsahern@gmail.com>

^ permalink raw reply

* RE: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime
From: Keller, Jacob E @ 2018-10-26 16:54 UTC (permalink / raw)
  To: Miroslav Lichvar, netdev@vger.kernel.org
  Cc: intel-wired-lan@lists.osuosl.org, Richard Cochran
In-Reply-To: <20181026162742.631-5-mlichvar@redhat.com>



> -----Original Message-----
> From: Miroslav Lichvar [mailto:mlichvar@redhat.com]
> Sent: Friday, October 26, 2018 9:28 AM
> To: netdev@vger.kernel.org
> Cc: intel-wired-lan@lists.osuosl.org; Richard Cochran <richardcochran@gmail.com>;
> Keller, Jacob E <jacob.e.keller@intel.com>; Miroslav Lichvar <mlichvar@redhat.com>
> Subject: [RFC PATCH 4/4] ixgbe: add support for extended PHC gettime
> 
> Cc: Richard Cochran <richardcochran@gmail.com>
> Cc: Jacob Keller <jacob.e.keller@intel.com>
> Signed-off-by: Miroslav Lichvar <mlichvar@redhat.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c | 57 ++++++++++++++++++++
>  1 file changed, 57 insertions(+)
> 
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c
> b/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c
> index b3e0d8bb5cbd..d31e8d3effc7 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c
> @@ -466,6 +466,60 @@ static int ixgbe_ptp_gettime(struct ptp_clock_info *ptp,
> struct timespec64 *ts)
>  	return 0;
>  }
> 
> +/**
> + * ixgbe_ptp_gettimex
> + * @ptp: the ptp clock structure
> + * @sts: structure to hold the system time before reading the PHC,
> + * the PHC timestamp, and system time after reading the PHC
> + *
> + * read the timecounter and return the correct value on ns,
> + * after converting it into a struct timespec.
> + */
> +static int ixgbe_ptp_gettimex(struct ptp_clock_info *ptp,
> +			      struct ptp_system_timestamp *sts)
> +{
> +	struct ixgbe_adapter *adapter =
> +		container_of(ptp, struct ixgbe_adapter, ptp_caps);
> +	struct ixgbe_hw *hw = &adapter->hw;
> +	unsigned long flags;
> +	struct timespec64 ts;
> +	u64 ns, stamp;
> +
> +	spin_lock_irqsave(&adapter->tmreg_lock, flags);
> +
> +	switch (adapter->hw.mac.type) {
> +	case ixgbe_mac_X550:
> +	case ixgbe_mac_X550EM_x:
> +	case ixgbe_mac_x550em_a:
> +		/* Upper 32 bits represent billions of cycles, lower 32 bits
> +		 * represent cycles. However, we use timespec64_to_ns for the
> +		 * correct math even though the units haven't been corrected
> +		 * yet.
> +		 */
> +		ptp_read_system_prets(sts);
> +		IXGBE_READ_REG(hw, IXGBE_SYSTIMR);
> +		ptp_read_system_postts(sts);
> +		ts.tv_nsec = IXGBE_READ_REG(hw, IXGBE_SYSTIML);
> +		ts.tv_sec = IXGBE_READ_REG(hw, IXGBE_SYSTIMH);
> +		stamp = timespec64_to_ns(&ts);
> +		break;
> +	default:
> +		ptp_read_system_prets(sts);
> +		stamp = IXGBE_READ_REG(hw, IXGBE_SYSTIML);
> +		ptp_read_system_postts(sts);
> +		stamp |= (u64)IXGBE_READ_REG(hw, IXGBE_SYSTIMH) << 32;
> +		break;
> +	}
> +
> +	ns = timecounter_cyc2time(&adapter->hw_tc, stamp);
> +
> +	spin_unlock_irqrestore(&adapter->tmreg_lock, flags);
> +
> +	sts->phc_ts = ns_to_timespec64(ns);
> +
> +	return 0;
> +}
> +


What about replacing gettime64 with:

static int ixgbe_ptp_gettimex(struct ptp_clock_info *ptp, struct timespec64 *ts)
{
    struct ptp_system_timestamp sts
    
    ixgbe_ptp_gettimex(ptp, &tst);
    *ts = sts.phc_ts
}

Actually, could that even just be provided by the PTP core if gettime64 isn't implemented? This way new drivers only have to implement the new interface, and userspace will just get the old behavior if they use the old call?

Thanks,
Jake

>  /**
>   * ixgbe_ptp_settime
>   * @ptp: the ptp clock structure
> @@ -1217,6 +1271,7 @@ static long ixgbe_ptp_create_clock(struct ixgbe_adapter
> *adapter)
>  		adapter->ptp_caps.adjfreq = ixgbe_ptp_adjfreq_82599;
>  		adapter->ptp_caps.adjtime = ixgbe_ptp_adjtime;
>  		adapter->ptp_caps.gettime64 = ixgbe_ptp_gettime;
> +		adapter->ptp_caps.gettimex64 = ixgbe_ptp_gettimex;
>  		adapter->ptp_caps.settime64 = ixgbe_ptp_settime;
>  		adapter->ptp_caps.enable = ixgbe_ptp_feature_enable;
>  		adapter->ptp_setup_sdp = ixgbe_ptp_setup_sdp_x540;
> @@ -1234,6 +1289,7 @@ static long ixgbe_ptp_create_clock(struct ixgbe_adapter
> *adapter)
>  		adapter->ptp_caps.adjfreq = ixgbe_ptp_adjfreq_82599;
>  		adapter->ptp_caps.adjtime = ixgbe_ptp_adjtime;
>  		adapter->ptp_caps.gettime64 = ixgbe_ptp_gettime;
> +		adapter->ptp_caps.gettimex64 = ixgbe_ptp_gettimex;
>  		adapter->ptp_caps.settime64 = ixgbe_ptp_settime;
>  		adapter->ptp_caps.enable = ixgbe_ptp_feature_enable;
>  		break;
> @@ -1250,6 +1306,7 @@ static long ixgbe_ptp_create_clock(struct ixgbe_adapter
> *adapter)
>  		adapter->ptp_caps.adjfreq = ixgbe_ptp_adjfreq_X550;
>  		adapter->ptp_caps.adjtime = ixgbe_ptp_adjtime;
>  		adapter->ptp_caps.gettime64 = ixgbe_ptp_gettime;
> +		adapter->ptp_caps.gettimex64 = ixgbe_ptp_gettimex;
>  		adapter->ptp_caps.settime64 = ixgbe_ptp_settime;
>  		adapter->ptp_caps.enable = ixgbe_ptp_feature_enable;
>  		adapter->ptp_setup_sdp = NULL;
> --
> 2.17.2

^ permalink raw reply

* RE: [RFC PATCH 0/4] More accurate PHC<->system clock synchronization
From: Keller, Jacob E @ 2018-10-26 16:51 UTC (permalink / raw)
  To: Miroslav Lichvar, netdev@vger.kernel.org
  Cc: intel-wired-lan@lists.osuosl.org, Richard Cochran
In-Reply-To: <20181026162742.631-1-mlichvar@redhat.com>

> -----Original Message-----
> From: Miroslav Lichvar [mailto:mlichvar@redhat.com]
> Sent: Friday, October 26, 2018 9:28 AM
> To: netdev@vger.kernel.org
> Cc: intel-wired-lan@lists.osuosl.org; Richard Cochran <richardcochran@gmail.com>;
> Keller, Jacob E <jacob.e.keller@intel.com>; Miroslav Lichvar <mlichvar@redhat.com>
> Subject: [RFC PATCH 0/4] More accurate PHC<->system clock synchronization
> 

I read the whole series, and it looks good to me.

Acked-by: Jacob Keller <jacob.e.keller@intel.com>

> There is some duplication of code in the igb and ixgbe drivers, which I
> don't like very much, but I thought it's better than extending and
> wrapping the existing functions like in the e1000e driver. Also, mixing
> SYSTIM and "system time" in the code will probably be confusing.
> 
> I wasn't able to find a better name for the ioctl, the structures, and
> the driver function. If anyone has suggestions, please let me know.
> 

Hmm.. Yea, I don't really have better names either.

> Miroslav Lichvar (4):
>   ptp: add PTP_SYS_OFFSET_EXTENDED ioctl
>   e1000e: add support for extended PHC gettime
>   igb: add support for extended PHC gettime
>   ixgbe: add support for extended PHC gettime
> 
>  drivers/net/ethernet/intel/e1000e/e1000.h    |  3 ++
>  drivers/net/ethernet/intel/e1000e/netdev.c   | 48 +++++++++++++----
>  drivers/net/ethernet/intel/e1000e/ptp.c      | 21 ++++++++
>  drivers/net/ethernet/intel/igb/igb_ptp.c     | 43 +++++++++++++++
>  drivers/net/ethernet/intel/ixgbe/ixgbe_ptp.c | 57 ++++++++++++++++++++
>  drivers/ptp/ptp_chardev.c                    | 39 ++++++++++++++
>  include/linux/ptp_clock_kernel.h             | 26 +++++++++
>  include/uapi/linux/ptp_clock.h               | 12 +++++
>  8 files changed, 239 insertions(+), 10 deletions(-)
> 
> --
> 2.17.2

^ permalink raw reply

* RE: [RFC PATCH 0/4] More accurate PHC<->system clock synchronization
From: Keller, Jacob E @ 2018-10-26 16:40 UTC (permalink / raw)
  To: Miroslav Lichvar, netdev@vger.kernel.org
  Cc: intel-wired-lan@lists.osuosl.org, Richard Cochran
In-Reply-To: <20181026162742.631-1-mlichvar@redhat.com>

Hi Miroslav,

> -----Original Message-----
> From: Miroslav Lichvar [mailto:mlichvar@redhat.com]
> Sent: Friday, October 26, 2018 9:28 AM
> To: netdev@vger.kernel.org
> Cc: intel-wired-lan@lists.osuosl.org; Richard Cochran <richardcochran@gmail.com>;
> Keller, Jacob E <jacob.e.keller@intel.com>; Miroslav Lichvar <mlichvar@redhat.com>
> Subject: [RFC PATCH 0/4] More accurate PHC<->system clock synchronization
> 
> This series adds support for a more accurate synchronization between a
> PTP hardware clock and the system clock.
> 
> The first patch adds an extended version of the PTP_SYS_OFFSET ioctl,
> which returns three timestamps for each measurement. The idea is to
> shorten the interval between the system timestamps to contain just the
> reading of the lowest register of the PHC in order to reduce the error
> in the measured offset and give a better bound on the maximum error.
> 
> The other patches add support for the new ioctl to the e1000e, igb,
> and ixgbe driver. Tests with few different NICs in different machines
> (and PCIe slots) show that:
> - with an I219 (e1000e) the measured delay improved from 2500 to 1300 ns
>   and the error in the measured offset, when compared to cross
>   timestamping, was reduced by a factor of 5
> - with an I210 (igb) the delay improved from 5100 to 1700 ns
> - with an I350 (igb) the delay improved from 2300 to 750 ns
> - with an X550 (ixgbe) the delay improved from 1950 to 650 ns
> 

That is some very significant improvements! Excellent find.

> There is some duplication of code in the igb and ixgbe drivers, which I
> don't like very much, but I thought it's better than extending and
> wrapping the existing functions like in the e1000e driver. Also, mixing
> SYSTIM and "system time" in the code will probably be confusing.
> 

Yea...

> I wasn't able to find a better name for the ioctl, the structures, and
> the driver function. If anyone has suggestions, please let me know.
> 

I don't have any good suggestions yet. I'll reply after reviewing if I think of any.

Thanks,
Jake

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox