Netdev List

Netdev List
 help / color / mirror / Atom feed

* general protection fault in sctp_inq_pop
From: syzbot @ 2019-08-22 14:53 UTC (permalink / raw)
  To: davem, linux-kernel, linux-sctp, marcelo.leitner, netdev, nhorman,
	syzkaller-bugs, vyasevich

Hello,

syzbot found the following crash on:

HEAD commit:    20e79a0a net: hns: add phy_attached_info() to the hns driver
git tree:       net-next
console output: https://syzkaller.appspot.com/x/log.txt?x=10d6dfba600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=ce5e88233f2f83b0
dashboard link: https://syzkaller.appspot.com/bug?extid=4a0643a653ac375612d1
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)

Unfortunately, I don't have any reproducer for this crash yet.

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+4a0643a653ac375612d1@syzkaller.appspotmail.com

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 15509 Comm: syz-executor.1 Not tainted 5.3.0-rc3+ #139
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:sctp_inq_pop+0x294/0xd80 net/sctp/inqueue.c:201
Code: 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 e3 08 00 00 49 8d 7d 02 4d 89 6c  
24 60 48 b8 00 00 00 00 00 fc ff df 48 89 f9 48 c1 e9 03 <0f> b6 0c 01 48  
89 f8 83 e0 07 83 c0 01 38 c8 7c 08 84 c9 0f 85 4b
RSP: 0018:ffff888097d9ee40 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff8880968405d8 RCX: 0000000000000001
RDX: 0000000000005803 RSI: ffffffff86b236aa RDI: 000000000000000a
RBP: ffff888097d9ee90 R08: ffff88808fd44240 R09: fffffbfff14a914f
R10: fffffbfff14a914e R11: ffffffff8a548a77 R12: ffff888096840580
R13: 0000000000000008 R14: 0000000000000000 R15: ffff888097d9f478
FS:  00007f59f27b8700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f40b77fd000 CR3: 0000000063e8e000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  sctp_endpoint_bh_rcv+0x184/0x8d0 net/sctp/endpointola.c:385
  sctp_inq_push+0x1e4/0x280 net/sctp/inqueue.c:80
  sctp_rcv+0x2807/0x3590 net/sctp/input.c:256
  sctp6_rcv+0x17/0x30 net/sctp/ipv6.c:1049
  ip6_protocol_deliver_rcu+0x2fe/0x1660 net/ipv6/ip6_input.c:397
  ip6_input_finish+0x84/0x170 net/ipv6/ip6_input.c:438
  NF_HOOK include/linux/netfilter.h:305 [inline]
  NF_HOOK include/linux/netfilter.h:299 [inline]
  ip6_input+0xe4/0x3f0 net/ipv6/ip6_input.c:447
  dst_input include/net/dst.h:442 [inline]
  ip6_sublist_rcv_finish+0x98/0x1e0 net/ipv6/ip6_input.c:84
  ip6_list_rcv_finish net/ipv6/ip6_input.c:118 [inline]
  ip6_sublist_rcv+0x80c/0xcf0 net/ipv6/ip6_input.c:282
  ipv6_list_rcv+0x373/0x4b0 net/ipv6/ip6_input.c:316
  __netif_receive_skb_list_ptype net/core/dev.c:5049 [inline]
  __netif_receive_skb_list_core+0x1a2/0x9d0 net/core/dev.c:5087
  __netif_receive_skb_list net/core/dev.c:5149 [inline]
  netif_receive_skb_list_internal+0x7eb/0xe60 net/core/dev.c:5244
  gro_normal_list.part.0+0x1e/0xb0 net/core/dev.c:5757
  gro_normal_list net/core/dev.c:5755 [inline]
  gro_normal_one net/core/dev.c:5769 [inline]
  napi_frags_finish net/core/dev.c:5782 [inline]
  napi_gro_frags+0xa6a/0xea0 net/core/dev.c:5855
  tun_get_user+0x2e98/0x3fa0 drivers/net/tun.c:1974
  tun_chr_write_iter+0xbd/0x156 drivers/net/tun.c:2020
  call_write_iter include/linux/fs.h:1870 [inline]
  do_iter_readv_writev+0x5f8/0x8f0 fs/read_write.c:693
  do_iter_write fs/read_write.c:970 [inline]
  do_iter_write+0x184/0x610 fs/read_write.c:951
  vfs_writev+0x1b3/0x2f0 fs/read_write.c:1015
  do_writev+0x15b/0x330 fs/read_write.c:1058
  __do_sys_writev fs/read_write.c:1131 [inline]
  __se_sys_writev fs/read_write.c:1128 [inline]
  __x64_sys_writev+0x75/0xb0 fs/read_write.c:1128
  do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4596e1
Code: 75 14 b8 14 00 00 00 0f 05 48 3d 01 f0 ff ff 0f 83 34 b9 fb ff c3 48  
83 ec 08 e8 fa 2c 00 00 48 89 04 24 b8 14 00 00 00 0f 05 <48> 8b 3c 24 48  
89 c2 e8 43 2d 00 00 48 89 d0 48 83 c4 08 48 3d 01
RSP: 002b:00007f59f27b7ba0 EFLAGS: 00000293 ORIG_RAX: 0000000000000014
RAX: ffffffffffffffda RBX: 000000000000010c RCX: 00000000004596e1
RDX: 0000000000000001 RSI: 00007f59f27b7c00 RDI: 00000000000000f0
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 00007f59f27b86d4
R13: 00000000004c8783 R14: 00000000004df5a0 R15: 00000000ffffffff
Modules linked in:
---[ end trace 4d09ea96a0c7705b ]---
RIP: 0010:sctp_inq_pop+0x294/0xd80 net/sctp/inqueue.c:201
Code: 89 f9 48 c1 e9 03 80 3c 01 00 0f 85 e3 08 00 00 49 8d 7d 02 4d 89 6c  
24 60 48 b8 00 00 00 00 00 fc ff df 48 89 f9 48 c1 e9 03 <0f> b6 0c 01 48  
89 f8 83 e0 07 83 c0 01 38 c8 7c 08 84 c9 0f 85 4b
RSP: 0018:ffff888097d9ee40 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff8880968405d8 RCX: 0000000000000001
RDX: 0000000000005803 RSI: ffffffff86b236aa RDI: 000000000000000a
RBP: ffff888097d9ee90 R08: ffff88808fd44240 R09: fffffbfff14a914f
R10: fffffbfff14a914e R11: ffffffff8a548a77 R12: ffff888096840580
R13: 0000000000000008 R14: 0000000000000000 R15: ffff888097d9f478
FS:  00007f59f27b8700(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f40b77fd000 CR3: 0000000063e8e000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.

^ permalink raw reply

* Re: WARNING in rollback_registered_many (2)
From: Andrey Konovalov @ 2019-08-22 14:54 UTC (permalink / raw)
  To: syzbot, USB list
  Cc: Larry Finger, avagin, David S. Miller, devel, Eric W . Biederman,
	Eric Dumazet, florian.c.schilhabel, Greg Kroah-Hartman,
	Kai Heng Feng, ktkhai, LKML, netdev, straube.linux,
	syzkaller-bugs, tyhicks, Matthew Wilcox, Oliver Neukum,
	Alan Stern
In-Reply-To: <CAAeHK+y-2DZ1sWUE5bESrd=dUAaGrHXzR5+gFJFgiAaWo+D2dw@mail.gmail.com>

On Thu, Aug 22, 2019 at 3:07 PM Andrey Konovalov <andreyknvl@google.com> wrote:
>
> On Wed, Aug 7, 2019 at 4:03 PM Andrey Konovalov <andreyknvl@google.com> wrote:
> >
> > On Fri, Apr 12, 2019 at 1:32 PM Andrey Konovalov <andreyknvl@google.com> wrote:
> > >
> > > On Fri, Apr 12, 2019 at 1:29 AM syzbot
> > > <syzbot+40918e4d826fb2ff9b96@syzkaller.appspotmail.com> wrote:
> > > >
> > > > syzbot has found a reproducer for the following crash on:
> > > >
> > > > HEAD commit:    9a33b369 usb-fuzzer: main usb gadget fuzzer driver
> > > > git tree:       https://github.com/google/kasan/tree/usb-fuzzer
> > > > console output: https://syzkaller.appspot.com/x/log.txt?x=10d552b7200000
> > > > kernel config:  https://syzkaller.appspot.com/x/.config?x=23e37f59d94ddd15
> > > > dashboard link: https://syzkaller.appspot.com/bug?extid=40918e4d826fb2ff9b96
> > > > compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
> > > > syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=17a4c1af200000
> > > > C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=121b274b200000
> > > >
> > > > IMPORTANT: if you fix the bug, please add the following tag to the commit:
> > > > Reported-by: syzbot+40918e4d826fb2ff9b96@syzkaller.appspotmail.com
> > > >
> > > > usb 1-1: r8712u: MAC Address from efuse = 00:e0:4c:87:00:00
> > > > usb 1-1: r8712u: Loading firmware from "rtlwifi/rtl8712u.bin"
> > > > usb 1-1: USB disconnect, device number 2
> > > > usb 1-1: Direct firmware load for rtlwifi/rtl8712u.bin failed with error -2
> > > > usb 1-1: r8712u: Firmware request failed
> > > > WARNING: CPU: 0 PID: 575 at net/core/dev.c:8152
> > > > rollback_registered_many+0x1f3/0xe70 net/core/dev.c:8152
> > > > Kernel panic - not syncing: panic_on_warn set ...
> > > > CPU: 0 PID: 575 Comm: kworker/0:4 Not tainted 5.1.0-rc4-319354-g9a33b36 #3
> > > > Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
> > > > Google 01/01/2011
> > > > Workqueue: usb_hub_wq hub_event
> > > > Call Trace:
> > > >   __dump_stack lib/dump_stack.c:77 [inline]
> > > >   dump_stack+0xe8/0x16e lib/dump_stack.c:113
> > > >   panic+0x29d/0x5f2 kernel/panic.c:214
> > > >   __warn.cold+0x20/0x48 kernel/panic.c:571
> > > >   report_bug+0x262/0x2a0 lib/bug.c:186
> > > >   fixup_bug arch/x86/kernel/traps.c:179 [inline]
> > > >   fixup_bug arch/x86/kernel/traps.c:174 [inline]
> > > >   do_error_trap+0x130/0x1f0 arch/x86/kernel/traps.c:272
> > > >   do_invalid_op+0x37/0x40 arch/x86/kernel/traps.c:291
> > > >   invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:973
> > > > RIP: 0010:rollback_registered_many+0x1f3/0xe70 net/core/dev.c:8152
> > > > Code: 05 00 00 31 ff 44 89 fe e8 5a 15 f3 f4 45 84 ff 0f 85 49 ff ff ff e8
> > > > 1c 14 f3 f4 0f 1f 44 00 00 e8 12 14 f3 f4 e8 0d 14 f3 f4 <0f> 0b 4c 89 e7
> > > > e8 33 72 f2 f6 31 ff 41 89 c4 89 c6 e8 27 15 f3 f4
> > > > RSP: 0018:ffff88809d087698 EFLAGS: 00010293
> > > > RAX: ffff88809d058000 RBX: ffff888096240000 RCX: ffffffff8c7eb146
> > > > RDX: 0000000000000000 RSI: ffffffff8c7eb163 RDI: 0000000000000001
> > > > RBP: ffff88809d0877c8 R08: ffff88809d058000 R09: fffffbfff2708111
> > > > R10: fffffbfff2708110 R11: ffffffff93840887 R12: ffff888096240070
> > > > R13: dffffc0000000000 R14: ffff88809d087758 R15: 0000000000000000
> > > >   rollback_registered+0xf7/0x1c0 net/core/dev.c:8228
> > > >   unregister_netdevice_queue net/core/dev.c:9275 [inline]
> > > >   unregister_netdevice_queue+0x1dc/0x2b0 net/core/dev.c:9268
> > > >   unregister_netdevice include/linux/netdevice.h:2655 [inline]
> > > >   unregister_netdev+0x1d/0x30 net/core/dev.c:9316
> > > >   r871xu_dev_remove+0xe7/0x223 drivers/staging/rtl8712/usb_intf.c:604
> > > >   usb_unbind_interface+0x1c9/0x980 drivers/usb/core/driver.c:423
> > > >   __device_release_driver drivers/base/dd.c:1082 [inline]
> > > >   device_release_driver_internal+0x436/0x4f0 drivers/base/dd.c:1113
> > > >   bus_remove_device+0x302/0x5c0 drivers/base/bus.c:556
> > > >   device_del+0x467/0xb90 drivers/base/core.c:2269
> > > >   usb_disable_device+0x242/0x790 drivers/usb/core/message.c:1235
> > > >   usb_disconnect+0x298/0x870 drivers/usb/core/hub.c:2197
> > > >   hub_port_connect drivers/usb/core/hub.c:4940 [inline]
> > > >   hub_port_connect_change drivers/usb/core/hub.c:5204 [inline]
> > > >   port_event drivers/usb/core/hub.c:5350 [inline]
> > > >   hub_event+0xcd2/0x3b00 drivers/usb/core/hub.c:5432
> > > >   process_one_work+0x90f/0x1580 kernel/workqueue.c:2269
> > > >   process_scheduled_works kernel/workqueue.c:2331 [inline]
> > > >   worker_thread+0x7b0/0xe20 kernel/workqueue.c:2417
> > > >   kthread+0x313/0x420 kernel/kthread.c:253
> > > >   ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
> > > > Kernel Offset: disabled
> > > > Rebooting in 86400 seconds..
> > > >
> > >
> > > +linux-usb mailing list
> >
> > This USB bug is the most frequently triggered one right now with over
> > 27k kernel crashes.
>
> OK, this report is confusing. It was initially reported on the
> upstream instance a long time ago, but since then has stopped
> happening, as it was probably fixed. Then when we launched the USB
> fuzzing instance, it has started producing similarly named reports
> (with a different root cause though), and they were bucketed into this
> bug by syzkaller. I've improved parsing titles of such reports in
> syzkaller, so I'm invalidating this one, and syzbot should send a
> properly attributed USB report soon.
>
> #syz invalid

This has been reported as:

WARNING in r871xu_dev_remove

https://syzkaller.appspot.com/bug?extid=80899a8a8efe8968cde7

^ permalink raw reply

* Re: [PATCH spi for-5.4 0/5] Deterministic SPI latency with NXP DSPI driver
From: Andrew Lunn @ 2019-08-22 14:56 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Vladimir Oltean, Mark Brown, Hubert Feurstein, Miroslav Lichvar,
	Florian Fainelli, linux-spi, netdev
In-Reply-To: <20190822141641.GB1437@localhost>

> Thinking back...
> 
> One problem is this.  PTP requires a delay measurement.  You can send
> a delay request from the host, but there will never be a reply.
> 
> Another problem is this.  A Sync message arriving on an external port
> is time stamped there, but then it is encapsulated as a tagged DSA
> management message and delivered out the CPU port.  At this point, it
> is no longer a PTP frame and will not be time stamped at the CPU port
> on egress.

I think so that both the host interface and the CPU port recognize the
frame and time stamp it, it needs to be untagged. Otherwise, as you
said, the hardware does not recognise it. I've never tried sending
untagged frames to the CPU port. I expect they are just dropped.

However, somebody might want to play with the TCAM. The TCAM can
redirect a packet out any port. I've no idea what the pipeline
ordering is, but it might be possible for the TCAM to redirect a frame
back to the host interface, before it gets dropped because it does not
have DSA tags?  But is the TCAM before or after PTP in the pipeline?
Could you then get 4 timestamps for the same frame?  Host egress,
switch ingress, switch egress, host ingress?

But how do you make this generic? Can other switches also loop a frame
back like this and do the same time stamping? How do you actually get
access to these time stamps split over two blocks of hardware?

So in theory, this might be possible, but in practice?

     Andrew

^ permalink raw reply

* Re: [PATCH spi for-5.4 0/5] Deterministic SPI latency with NXP DSPI driver
From: Vladimir Oltean @ 2019-08-22 14:58 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Mark Brown, Hubert Feurstein, Miroslav Lichvar, Andrew Lunn,
	Florian Fainelli, linux-spi, netdev
In-Reply-To: <20190822141641.GB1437@localhost>

On Thu, 22 Aug 2019 at 17:16, Richard Cochran <richardcochran@gmail.com> wrote:
>
> On Wed, Aug 21, 2019 at 11:17:23PM +0300, Vladimir Oltean wrote:
> > Of course PPS with a dedicated hardware receiver that can take input
> > compare timestamps is always preferable. However non-Ethernet
> > synchronization in the field looks to me like "make do with whatever
> > you can". I'm not sure a plain GPIO that raises an interrupt is better
> > than an interrupt-driven serial protocol controller - it's (mostly)
> > the interrupts that throw off the precision of the software timestamp.
> > And use Miroslav's pps-gpio-poll module and you're back from where you
> > started (try to make a sw timestamp as precise as possible).
>
> Right, it might be better, might not.  You can consider hacking a
> local time stamp into the ISR.  Also, if one of your MACs has a input
> event pin, you can feed the switch's PPS output in there.
>
> > wouldn't be my first choice. But DSA could have that built-in, and
> > with the added latency benefit of a MAC-to-MAC connection.
> > Too bad the mv88e6xxx driver can't do loopback timestamping, that's
> > already 50% of the DSA drivers that support PTP at all. An embedded
> > solution for this is less compelling now.
>
> Let me back track on my statement about mv88e6xxx.  At the time, I
> didn't see any practical way to use the CPU port for synchronization,
> but I forget exactly the details.  Maybe it is indeed possible,
> somehow.  If you can find a way that will work on your switch and on
> the Marvell, then I'd like to hear about it.
>
> Thinking back...
>
> One problem is this.  PTP requires a delay measurement.  You can send
> a delay request from the host, but there will never be a reply.
>

I don't think I understand the problem here.
You need to think about this as a sort of degenerate PTP where the
master and the slave are under the same device's management, not the
full stack. I never actually said I want to make ptp4l work over the
CPU port.
So instead of the typical:

Master (device A)                Slave (device B)

    |                            |         Tstamps known
 t1 |------\      Sync           |         to slave
    |       \-----\              |
    |              \-----\       |
    |                     \----->| t2      t2
    |------\    Follow_up        |
    |       \-----\              |
    |              \-----\       |
    |                     \----->| t1      t1, t2
    |                            |
    |          Delay_req  /------| t3      t1, t2, t3
    |              /-----/       |
    |       /-----/              |
 t4 |<-----/                     |
    |                            |
    |------\    Follow_up        |
    |       \-----\              |
    |              \-----\       |
    |                     \----->|         t1, t2, t3, t4
    |                            |
    |                            |
    |                            |
    |                            |
    v           time             v

You'd have something like this:

Master (DSA master port)         Slave (switch CPU port)

    |                            |         Tstamps known
    |                            |         to slave
    |       Local_sync_req       |
 t1 |------\                     |         t1
    |       \-----\              |
    |              \-----\       |
    |                     \----->| t2      t1, t2
    |                            |
    |     Local_sync_resp /------| t3      t1, t2, t3
    |              /-----/       |
    |       /-----/              |
 t4 |<-----/                     |         t1, t2, t3, t4
    |                            |
    |                            |
    v           time             v

As far as I understand PTP, the other messages are just protocol
blah-blah because the slave doesn't know what the master knows, which
is clearly not applicable here.
t1, t2, t3, t4 still keep the same definitions though (master TX
timestamp, slave RX timestamp, slave TX timestamp, master RX
timestamp).
I'm 90% certain the sja1105 can take an RX timestamp for a management
frame (e.g. one for which a TX timestamp was requested) sent in
loopback.

> Another problem is this.  A Sync message arriving on an external port
> is time stamped there, but then it is encapsulated as a tagged DSA
> management message and delivered out the CPU port.  At this point, it
> is no longer a PTP frame and will not be time stamped at the CPU port
> on egress.
>

But you don't mean a TX timestamp at the egress of swp4 here, do you?

 +---------------------------------+
 |  Management CPU                 |
 |                                 |
 |           DSA master            |
 |             +-----+             |
 |             |     |             |
 |             |     |             |
 +-------------+-----+-------------+
           eth0   ^ RX tstamp
                  |
                  | TX tstamp
        (swp4) CPU port
 +-------------+-----+-------------+
 |  Switch     |     |             |
 |             |     |             |
 |             +-----+             |
 |                ^ T              |
 |                |                |
 |    +-----------+                |
 |    |                            |
 |    |                            |
 | +-----+ +-----+ +-----+ +-----+ |
 | |     | |     | |     | |     | |
 | |     | |     | |     | |     | |
 +-+-----+-+-----+-+-----+-+-----+-+
   swp0    swp1    swp2    swp3
      ^
      | RX tstamp

Why would that matter?
Or just the RX timestamp at eth0?
If you mean the latter, then yes, having HWTSTAMP_FILTER_ALL in the
rx_filter of the DSA master is a hard requirement.

> Thanks,
> Richard

^ permalink raw reply

* Re: [PATCH net-next v2 1/3] net: ethernet: mediatek: Add basic PHYLINK support
From: René van Dorst @ 2019-08-22 15:11 UTC (permalink / raw)
  To: Russell King - ARM Linux admin
  Cc: John Crispin, Sean Wang, Nelson Chang, David S . Miller,
	Matthias Brugger, Frank Wunderlich, netdev, linux-mips,
	linux-mediatek, Stefan Roese, linux-arm-kernel
In-Reply-To: <20190822142739.GS13294@shell.armlinux.org.uk>

Hi Russell,

Quoting Russell King - ARM Linux admin <linux@armlinux.org.uk>:

> On Wed, Aug 21, 2019 at 04:43:34PM +0200, René van Dorst wrote:
>> +static void mtk_mac_link_down(struct phylink_config *config,  
>> unsigned int mode,
>> +			      phy_interface_t interface)
>> +{
>> +	struct mtk_mac *mac = container_of(config, struct mtk_mac,
>> +					   phylink_config);
>>
>> -	return 0;
>> +	mtk_w32(mac->hw, MAC_MCR_FORCE_LINK_DOWN, MTK_MAC_MCR(mac->id));
>>  }
>
> You set the MAC_MCR_FORCE_MODE bit here...
>
>> +static void mtk_mac_link_up(struct phylink_config *config,  
>> unsigned int mode,
>> +			    phy_interface_t interface,
>> +			    struct phy_device *phy)
>>  {
>> +	struct mtk_mac *mac = container_of(config, struct mtk_mac,
>> +					   phylink_config);
>> +	u32 mcr = mtk_r32(mac->hw, MTK_MAC_MCR(mac->id));
>>
>> +	mcr |= MAC_MCR_TX_EN | MAC_MCR_RX_EN;
>> +	mtk_w32(mac->hw, mcr, MTK_MAC_MCR(mac->id));
>> +}
>
> Looking at this, a link_down() followed by a link_up() would result in
> this register containing MAC_MCR_FORCE_MODE | MAC_MCR_TX_EN |
> MAC_MCR_RX_EN ?  Is that actually correct?  (MAC_MCR_FORCE_LINK isn't
> set, so it looks to me like it still forces the link down.)

Thanks for reviewing.

Probably not.
I assumed that mac_config() is always called before link_up()

I simply can make it the opposite of link_up()

like this:
static void mtk_mac_link_down(struct phylink_config *config, unsigned  
int mode,
                               phy_interface_t interface)
{
       struct mtk_mac *mac = container_of(config, struct mtk_mac,
                                  phylink_config);
       u32 mcr = mtk_r32(mac->hw, MTK_MAC_MCR(mac->id));

       mcr &= (MAC_MCR_TX_EN | MAC_MCR_RX_EN);
       mtk_w32(mac->hw, mcr, MTK_MAC_MCR(mac->id));
}

>
> Note that link up/down forcing should not be done for in-band AN.
>

This means that mac_config() of the SGMII patch is also incorrect?

mac_config() always sets the MAC in a force mode.
But the SGMII block is set in AN.

Mainline code seems to do the same.
Puts the SGMII block in AN or forced mode and always set the MAC in  
forced mode.

>> +static void mtk_validate(struct phylink_config *config,
>> +			 unsigned long *supported,
>> +			 struct phylink_link_state *state)
>> +{
>> +	struct mtk_mac *mac = container_of(config, struct mtk_mac,
>> +					   phylink_config);
>> +	__ETHTOOL_DECLARE_LINK_MODE_MASK(mask) = { 0, };
>>
>> +	if (state->interface != PHY_INTERFACE_MODE_NA &&
>> +	    state->interface != PHY_INTERFACE_MODE_MII &&
>> +	    state->interface != PHY_INTERFACE_MODE_GMII &&
>> +	    !(MTK_HAS_CAPS(mac->hw->soc->caps, MTK_RGMII) &&
>> +	      phy_interface_mode_is_rgmii(state->interface)) &&
>> +	    !(MTK_HAS_CAPS(mac->hw->soc->caps, MTK_TRGMII) &&
>> +	      !mac->id && state->interface == PHY_INTERFACE_MODE_TRGMII)) {
>> +		linkmode_zero(supported);
>> +		return;
>>  	}
>>
>> +	phylink_set_port_modes(mask);
>> +	phylink_set(mask, Autoneg);
>>
>> +	if (state->interface == PHY_INTERFACE_MODE_TRGMII) {
>> +		phylink_set(mask, 1000baseT_Full);
>> +	} else {
>> +		phylink_set(mask, 10baseT_Half);
>> +		phylink_set(mask, 10baseT_Full);
>> +		phylink_set(mask, 100baseT_Half);
>> +		phylink_set(mask, 100baseT_Full);
>> +
>> +		if (state->interface != PHY_INTERFACE_MODE_MII) {
>> +			phylink_set(mask, 1000baseT_Half);
>> +			phylink_set(mask, 1000baseT_Full);
>> +			phylink_set(mask, 1000baseX_Full);
>> +		}
>> +	}
>>
>> +	phylink_set(mask, Pause);
>> +	phylink_set(mask, Asym_Pause);
>>
>> +	linkmode_and(supported, supported, mask);
>> +	linkmode_and(state->advertising, state->advertising, mask);
>>  }
>
> This looks fine.

OK.

> Thanks.
>
> --
> RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line in suburbia: sync at 12.1Mbps down 622kbps up
> According to speedtest.net: 11.9Mbps down 500kbps up

Greats,

René



^ permalink raw reply

* Re: [PATCH v2 bpf-next 1/4] bpf: unprivileged BPF access via /dev/bpf
From: Andy Lutomirski @ 2019-08-22 15:16 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Andy Lutomirski, Alexei Starovoitov, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List, Chenbo Feng
In-Reply-To: <98fee747-795a-ff10-fa98-10ddb5afcc03@iogearbox.net>

On Thu, Aug 22, 2019 at 7:17 AM Daniel Borkmann <daniel@iogearbox.net> wrote:
>
> On 8/7/19 7:24 AM, Andy Lutomirski wrote:
> > On Mon, Aug 5, 2019 at 6:11 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >> On Mon, Aug 05, 2019 at 02:25:35PM -0700, Andy Lutomirski wrote:
> >>> It tries to make the kernel respect the access modes for fds.  Without
> >>> this patch, there seem to be some holes: nothing looked at program fds
> >>> and, unless I missed something, you could take a readonly fd for a
> >>> program, pin the program, and reopen it RW.
> >>
> >> I think it's by design. iirc Daniel had a use case for something like this.
> >
> > That seems odd.  Daniel, can you elaborate?
>
> [ ... catching up late. ]
>
> Not from my side, the change was added by Chenbo back then for Android
> use-case to replace xt_qtaguid and xt_owner with BPF programs and to
> allow unprivileged applications to read maps. More on their architecture:
>
>    https://source.android.com/devices/tech/datausage/ebpf-traffic-monitor
>
>  From the cover-letter:
>
>    [...]
>    The network-control daemon (netd) creates and loads an eBPF object for
>    network packet filtering and analysis. It passes the object FD to an
>    unprivileged network monitor app (netmonitor), which is not allowed to
>    create, modify or load eBPF objects, but is allowed to read the traffic
>    stats from the map.
>    [...]

I suspect that this use case is, in fact, mostly broken in current
kernels.  An unprivileged process with a read-only fd to a bpf map can
BPF_OBJ_PIN the map and then re-open it read-write.  As far as I can
tell, the only thing mitigating this is that it won't work unless the
attacker has write access to some directory in bpffs.

> > Trusted by whom?  In a non-nested container, the container manager
> > *might* be trusted by the outside world.  In a *nested* container,
> > unless the inner container management is controlled from outside the
> > outer container, it's not trusted.  I don't know much about how
> > Facebook's containers work, but the LXC/LXD/Podman world is moving
> > very strongly toward user namespaces and maximally-untrusted
> > containers, and I think bpf() should work in that context.
>
> [...] and if we opt-in with CAP_NET_ADMIN, for example, then it should
> ideally be possible for that container to install BPF programs for
> mangling, dropping, forwarding etc as long as it's only affecting it's
> /own/ netns like the rest of networking subsystem controls that work
> in that case. I would actually like to get to this at some point and
> make it more approachable as long as there is a way for an admin to
> /opt into it/ via policy (aka not by default).

For better or for worse, I think this would need a massive
re-architecting of the way bpf filtering works.  bpf filters attach to
cgroups, which aren't scoped to network namespaces at all.  So we need
a different permission model.

> Thinking out loud, I'd
> love some sort of a hybrid, that is, a mixture of CAP_BPF_ADMIN and
> customizable seccomp policy. Meaning, there could be several CAP_BPF
> type sub-policies e.g. from allowing everything (equivalent to the
> /dev/bpf on/off handle or CAP_SYS_ADMIN we have today) down to
> programmable user defined policy that can be tailored to specific
> needs like granting apps to BPF_OBJ_GET and BPF_MAP_LOOKUP elements
> or granting to load+mangle a specific subset of maps (e.g. BPF_MAP_TYPE_{ARRAY,
> HASH,LRU_HASH,LPM_TRIE}) and prog types (...) plus attaching them to
> their own netns, and if that is untrusted, then same restrictions/
> mitigations could be done by the verifier as with (current) unprivileged
> BPF, enabled via programmable policy as well. We wouldn't make any
> static/fixed assumptions, but allow users to define them based on their
> own use-cases. Haven't looked how feasible this would be, but something
> to take into consideration when we rework the current [admittedly
> suboptimal all-or-nothing] model we have. Is this something you had in
> mind as well for your wip proposal, Andy?
>

Hmm.  Fine-grained seccomp stuff like this is very much in scope for
the seccomp discussion that's happening at LPC this year.
Unfortunately, I'm not there, but I'm participating via the mailing
list.

I also finally finished typing a very rough draft of my bpf ideas.
I'll email them out momentarily in a separate email.  I think it
should come fairly close to doing what you want.

^ permalink raw reply

* RFC: very rough draft of a bpf permission model
From: Andy Lutomirski @ 2019-08-22 15:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Daniel Borkmann, Alexei Starovoitov, Song Liu, Kees Cook,
	Networking, bpf, Alexei Starovoitov, Kernel Team, Lorenz Bauer,
	Jann Horn, Greg KH, Linux API, LSM List, Chenbo Feng
In-Reply-To: <CALCETrUWQbPK3Pc6P5i_UqHPXJmZVyvuYXfq+VRtD6A3emaRhw@mail.gmail.com>

BPF security strawman, v0.1

This is very rough.  Most of this, especially the API details, needs
work before it's ready to implement.  The whole concept also needs
review.

= Goals =

The overall goal is to make it possible to use eBPF without having
what is effectively administrator access.  For example, an eBPF user
should not be able to directly tamper with other processes (unless
this permission is explicitly granted) and should not be able to
read or write other users' eBPF maps.

It should be possible to use eBPF inside a user namespace without breaking
the userns security model.

Due to the risk of speculation attacks and such being carried out via
eBPF, it should not become possible to use too much of eBPF without the
administrator's permission.  (NB: it is already possible to use
*classic* BPF without any permission, and classic BPF is translated
internally to eBPF, so this goal can only be met to a limited extent.)

= Definitions =

Global capability: A capability bit in the caller's effective mask, so
long as the caller is in the root user namespace.  Tasks in non-root
user namespaces never have global capabilibies.  This is what capable()
checks.

Namespace capability: A capability over a specific user namespace.
Tasks in a user namespace have all the capabilities in their effective
mask over their user namespace.  A namespace capability generally
indicates that the capability applies to the user namespace itself and
to all non-user namespaces that live in the user namespace.  For
example, CAP_NET_ADMIN means that you can configure all networks
namespaces in the current user namespace.  This is what ns_capable()
checks.

Anything that requires a global capability will not work in a non-root
user namespace.

= unprivileged_bpf_disabled =

Nothing in here supercedes unprivileged_bpf_disabled.  If
unprivileged_bpf_disabled = 1, then these proposals should not allow anything
that is disallowed today.  The idea is to make unprivileged_bpf_disabled=0
both safer and more useful.

= Test runs =

Global CAP_SYS_ADMIN is needed to test-run a program.  Test-running a program
exposes its own attack surface.  It's also the only way to run a program at
all if you merely have permission to load the program but not to attach it
anywhere.  Some of the proposed changes below will make it possible to load
most program types without

= Access to programs and maps =

There are two basic security concerns when accessing programs and maps:
the attack surface against the kernel and the ability to access other
people's maps.

Unprivileged processes may read a map if they have an FMODE_READ descriptor
for the map.  Unprivileged processes may write a map if they have an
FMODE_WRITE descriptor to the map.  Unprivileged processes may open a
persistent map with a mode consistent with the permissions in bpffs.

Unprivileged processes may create a bpffs inode for an existing map
if the have an RW file descriptor for the map.  (This is a change to
current behavior.  Daniel, Alexei thought the current behavior was
intentional.  Do you recall whether this is the case?)

The _BY_ID map APIs inherently have no concept of ownership of maps.  These
APIs will continue to require global CAP_SYS_ADMIN.

The small number of things that currently require the _BY_ID APIs, e.g.,
reading maps of maps, can be addressed if needed with new APIs that
return fds instead of ids.  Otherwise using them will continue to require
global capabilities.

Unprivileged processes may create exactly the set of maps that they can
create today.  Future proposals may extend this by a variety of means;
this current proposal makes no changes.

= Program loading =

Loading a program carries the following risks:

 - It exposes the attack surface in the program verifier itself.  That is
   possible, although unlikely, that merely verifying a malicious program
   could crash or otherwise cause a kernel malfunction.

 - It exposes the attack surface of insufficient checks in the verifier.
   That is, a verifier bug could allow a malicious program that is dangerous
   when run.

 - It exposes all of the functions that the program type can call.
   Some functions, e.g. bpf_probe_read(), should require privilege to call.

 - It exposes resource attacks.  Currently, privileged users can load programs
   that use more resources than unprivileged users can load.

 - It exposes pointer-to-integer conversions.  This requires global
   capabilities.

 - The program could contain speculation attack gadgets.

 - Loading a program is a prerequisite to attaching the program.

I propose the following:

Flag functions that require privilege as such.  Loading a program that calls
such a function will require a global capability.  The privileged functions are
mainly used for tracing, I think, and kernel tracing should require global
capabilities.

Loading a program that uses privileged verifier features (function calls or
pointer-to-integer-conversions) will continue to require privileges.

Loading a function that uses excessive resources can continue to require
global capabilities or it could use a new set of cgroup settings that
adjust the bpf complexity limits.

Loading a function that bypasses the various speculation attack hardening
features (e.g. constant blinding) requires global capabilities.

Other than this, bpf program types can have a new flag set to allow
them to be loaded without any privileges.  Some bpf program types
may need additional care, e.g. perf bpf events.  They can be attached
without privilege even in current kernels, and this might need to change.

(optional) Add an API to load a program where the program source comes from a
file specified by id instead of in memory.  This would allow LSMs to require
that bpf() programs be appropriately labeled.  If LSMs require use of this
API, it will make it much harder to exploit the verifier or speculation bugs.

As a possible future extension, a way to selectively grant the ability to
use specific program types without privilege could be useful.  This
could be done with a cgroup option, for example.

= Cgroup attach =

Cgroups have their own hierarchy that does not necessarily follow the
namespace hierarchy.  Unless cgroups integrate with namespaces in ways
that they currently don't, namespace capabilites cannot be used to grant
permission to operate on cgroups.

I propose that attaching and detaching bpf programs to cgroups use a
permission model similar to the model for changing non-bpf cgroup
settings.  In particular, each bpf_attach_type will get a new file in a
cgroup directory.  So there will be
/sys/fs/cgroup/cgroup_name/bpf.inet_ingress, bpf.inet_egress, etc.

A new API will be added to bpf() to attach and detach programs.  The new
API will take an fd to the bpf.attach_type file instead of to the cgroup
directory.  It will require FMODE_WRITE.  This API will *not* require
any capability.

To prevent anyone with a delegated cgroup from automatically being
able to use all bpf program types, the new bpf.attach_type files
will be opt-in as part of the hierarchy.  This could be done by writing
"+bpf.*" or "+bpf.inet_ingress" to cgroup.subtree_control to make
all the bpf.attach_type files or just bpf.inet_ingress available
in descendents of the cgroup in question.  This could alternatively
be a new bpf.subtree_control file if that seems better.

The result of these changes will be that root can use the old
attach API or the new attach API.  Unprivileged programs cannot
use the old attach API.  Unprivileged programs can use the new
attach API if they are explicitly granted permission by all their
ancestor cgroup managers.

= Additional mitigations =

Optional: there may be cases where a user can load a bpf program
but can't attach or otherwise execute it.  Nonetheless, it's plausible
that such a program could be speculatively executed.  The kernel could
mitigate this by only marking a JITted bpf program executable when it
is first attached or test-run.

^ permalink raw reply

* Re: [RFC bpf-next 4/5] iproute2: Allow compiling against libbpf
From: Toke Høiland-Jørgensen @ 2019-08-22 15:28 UTC (permalink / raw)
  To: Daniel Borkmann, Stephen Hemminger, Alexei Starovoitov
  Cc: Martin KaFai Lau, Song Liu, Yonghong Song, David Miller,
	Jesper Dangaard Brouer, netdev, bpf, andrii.nakryiko
In-Reply-To: <2cf18fed-f2ec-bea3-658e-07ba8124148a@iogearbox.net>

Daniel Borkmann <daniel@iogearbox.net> writes:

> On 8/22/19 3:38 PM, Toke Høiland-Jørgensen wrote:
>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>> On 8/22/19 2:04 PM, Toke Høiland-Jørgensen wrote:
>>>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>>>> On 8/22/19 12:43 PM, Toke Høiland-Jørgensen wrote:
>>>>>> Daniel Borkmann <daniel@iogearbox.net> writes:
>>>>>>> On 8/20/19 1:47 PM, Toke Høiland-Jørgensen wrote:
>>>>>>>> This adds a configure check for libbpf and renames functions to allow
>>>>>>>> lib/bpf.c to be compiled with it present. This makes it possible to
>>>>>>>> port functionality piecemeal to use libbpf.
>>>>>>>>
>>>>>>>> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
>>>>>>>> ---
>>>>>>>>      configure          | 16 ++++++++++++++++
>>>>>>>>      include/bpf_util.h |  6 +++---
>>>>>>>>      ip/ipvrf.c         |  4 ++--
>>>>>>>>      lib/bpf.c          | 33 +++++++++++++++++++--------------
>>>>>>>>      4 files changed, 40 insertions(+), 19 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/configure b/configure
>>>>>>>> index 45fcffb6..5a89ee9f 100755
>>>>>>>> --- a/configure
>>>>>>>> +++ b/configure
>>>>>>>> @@ -238,6 +238,19 @@ check_elf()
>>>>>>>>          fi
>>>>>>>>      }
>>>>>>>>      
>>>>>>>> +check_libbpf()
>>>>>>>> +{
>>>>>>>> +    if ${PKG_CONFIG} libbpf --exists; then
>>>>>>>> +	echo "HAVE_LIBBPF:=y" >>$CONFIG
>>>>>>>> +	echo "yes"
>>>>>>>> +
>>>>>>>> +	echo 'CFLAGS += -DHAVE_LIBBPF' `${PKG_CONFIG} libbpf --cflags` >> $CONFIG
>>>>>>>> +	echo 'LDLIBS += ' `${PKG_CONFIG} libbpf --libs` >>$CONFIG
>>>>>>>> +    else
>>>>>>>> +	echo "no"
>>>>>>>> +    fi
>>>>>>>> +}
>>>>>>>> +
>>>>>>>>      check_selinux()
>>>>>>>
>>>>>>> More of an implementation detail at this point in time, but want to
>>>>>>> make sure this doesn't get missed along the way: as discussed at
>>>>>>> bpfconf [0] best for iproute2 to handle libbpf support would be the
>>>>>>> same way of integration as pahole does, that is, to integrate it via
>>>>>>> submodule [1] to allow kernel and libbpf features to be in sync with
>>>>>>> iproute2 releases and therefore easily consume extensions we're adding
>>>>>>> to libbpf to aide iproute2 integration.
>>>>>>
>>>>>> I can sorta see the point wrt keeping in sync with kernel features. But
>>>>>> how will this work with distros that package libbpf as a regular
>>>>>> library? Have you guys given up on regular library symbol versioning for
>>>>>> libbpf?
>>>>>
>>>>> Not at all, and I hope you know that. ;-)
>>>>
>>>> Good! Didn't really expect you had, just checking ;)
>>>>
>>>>> The reason I added lib/bpf.c integration into iproute2 directly back
>>>>> then was exactly such that users can start consuming BPF for tc and
>>>>> XDP via iproute2 /everywhere/ with only a simple libelf dependency
>>>>> which is also available on all distros since pretty much forever. If
>>>>> it was an external library, we could have waited till hell freezes
>>>>> over and initial distro adoption would have pretty much taken forever:
>>>>> to pick one random example here wrt the pace of some downstream
>>>>> distros [0]. The main rationale is pretty much the same as with added
>>>>> kernel features that land complementary iproute2 patches for that
>>>>> kernel release and as libbpf is developed alongside it is reasonable
>>>>> to guarantee user expectations that iproute2 released for kernel
>>>>> version x can make use of BPF features added to kernel x with same
>>>>> loader support from x.
>>>>
>>>> Well, for iproute2 I would expect this to be solved by version
>>>> dependencies. I.e. iproute2 version X would depend on libbpf version Y+
>>>> (like, I dunno, the version of libbpf included in the same kernel source
>>>> tree as the kernel version iproute2 is targeting? :)).
>>>
>>> This sounds nice in theory, but from what I've seen major (!) distros
>>> already seem to have a hard time releasing kernel x along with iproute2
>>> package x, concrete example was that distro kernel was on 4.13 and its
>>> official iproute2 package on 4.9,
>> 
>> If the iproute2 package is not being updated at all I don't really see
>> how it would make any difference whether libbpf is vendored or not? :)
>> 
>>> adding yet another variable that needs to be in sync with kernel is
>>> simply impractical especially for a _core_ package like iproute2 that
>>> should have as little dependencies as possible. I also don't want to
>>> make a bet on whether libbpf will be available on every distro that
>>> also ships iproute2. Hence approach that pahole (and also bcc by the
>>> way) takes is most reasonable to have the best user experience.
>> 
>> Most users are going to get iproute2 from their distro packages anyway,
>> so if distros are incompetent at packaging, my bet is that you're going
>> to run into issues one way or another.
>> 
>> But OK, if you think it is easier to work around bad distros by
>> vendoring, you guys are the maintainers, so that's up to you. But can we
>> at least put in the version dependency and let the build system pick up
>> a system libbpf if it exists and is compatible? That way distros that
>> *are* competent can still link it dynamically...
>
> Yeah that would be fine by me to use this as a fallback, and I think that
> iproute2's configure script should be able to easily handle this
> situation.

Cool, I can live with that :)

-Toke

^ permalink raw reply

* Re: [PATCH rdma-next 0/3] RDMA RX RoCE Steering Support
From: Doug Ledford @ 2019-08-22 15:29 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Jason Gunthorpe, RDMA mailing list, Mark Bloch, Mark Zhang,
	Saeed Mahameed, linux-netdev
In-Reply-To: <20190821140204.GG4459@mtr-leonro.mtl.com>

[-- Attachment #1: Type: text/plain, Size: 1197 bytes --]

On Wed, 2019-08-21 at 17:02 +0300, Leon Romanovsky wrote:
> On Tue, Aug 20, 2019 at 01:54:59PM -0400, Doug Ledford wrote:
> > On Mon, 2019-08-19 at 14:36 +0300, Leon Romanovsky wrote:
> > > From: Leon Romanovsky <leonro@mellanox.com>
> > > 
> > > Hi,
> > > 
> > > This series from Mark extends mlx5 with RDMA_RX RoCE flow steering
> > > support
> > > for DEVX and QP objects.
> > > 
> > > Thanks
> > > 
> > > Mark Zhang (3):
> > >   net/mlx5: Add per-namespace flow table default miss action
> > > support
> > >   net/mlx5: Create bypass and loopback flow steering namespaces
> > > for
> > > RDMA
> > >     RX
> > >   RDMA/mlx5: RDMA_RX flow type support for user applications
> > 
> > I have no objection to this series.
> 
> Thanks, first two patches were applied to mlx5-next
> 
> e6806e9a63a7 net/mlx5: Create bypass and loopback flow steering
> namespaces for RDMA RX
> f66ad830b114 net/mlx5: Add per-namespace flow table default miss
> action support

mlx5-next merged into for-next, final patch applied, thanks.

-- 
Doug Ledford <dledford@redhat.com>
    GPG KeyID: B826A3330E572FDD
    Fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply

* Aw: [PATCH net-next v2 0/3] net: dsa: mt7530: Convert to PHYLINK and add support for port 5
From: Frank Wunderlich @ 2019-08-22 15:44 UTC (permalink / raw)
  To: "René van Dorst"
  Cc: Sean Wang, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	David S . Miller, Matthias Brugger, netdev, linux-mips,
	"René van Dorst", linux-mediatek, John Crispin,
	linux-arm-kernel
In-Reply-To: <20190821144547.15113-1-opensource@vdorst.com>

Hi,

tested on BPI-R2 (mt7623) with 2 Problems (already reported to Rene, just to inform everyone)...maybe anybody has an idea

- linux-next (i know it's not part of the series, but a pitfall on testing other devices) seems to break power-regulator somewhere here:

priv->core_pwr = devm_regulator_get(&mdiodev->dev, "core"); returns 517

#define EPROBE_DEFER517/* Driver requests probe retry */

https://elixir.bootlin.com/linux/latest/source/drivers/regulator/core.c#L1726

without linux-next switch came up including dsa-ports

- RX-traffic (run iperf3 -c x.x.x.x -R) is only 780 Mbits/sec (TX=940 Mbits/sec), same measure with 5.3-rc4 gives 940 MBit/s with same devices,
maybe caused by changes for mt76x8?

regards Frank

^ permalink raw reply

* Re: [PATCH rdma-next 0/3] RDMA RX RoCE Steering Support
From: Leon Romanovsky @ 2019-08-22 15:45 UTC (permalink / raw)
  To: Doug Ledford
  Cc: Jason Gunthorpe, RDMA mailing list, Mark Bloch, Mark Zhang,
	Saeed Mahameed, linux-netdev
In-Reply-To: <c7caa8eece02f3d15a0928663e9f64f99572f3ab.camel@redhat.com>

On Thu, Aug 22, 2019 at 11:29:02AM -0400, Doug Ledford wrote:
> On Wed, 2019-08-21 at 17:02 +0300, Leon Romanovsky wrote:
> > On Tue, Aug 20, 2019 at 01:54:59PM -0400, Doug Ledford wrote:
> > > On Mon, 2019-08-19 at 14:36 +0300, Leon Romanovsky wrote:
> > > > From: Leon Romanovsky <leonro@mellanox.com>
> > > >
> > > > Hi,
> > > >
> > > > This series from Mark extends mlx5 with RDMA_RX RoCE flow steering
> > > > support
> > > > for DEVX and QP objects.
> > > >
> > > > Thanks
> > > >
> > > > Mark Zhang (3):
> > > >   net/mlx5: Add per-namespace flow table default miss action
> > > > support
> > > >   net/mlx5: Create bypass and loopback flow steering namespaces
> > > > for
> > > > RDMA
> > > >     RX
> > > >   RDMA/mlx5: RDMA_RX flow type support for user applications
> > >
> > > I have no objection to this series.
> >
> > Thanks, first two patches were applied to mlx5-next
> >
> > e6806e9a63a7 net/mlx5: Create bypass and loopback flow steering
> > namespaces for RDMA RX
> > f66ad830b114 net/mlx5: Add per-namespace flow table default miss
> > action support
>
> mlx5-next merged into for-next, final patch applied, thanks.

Thanks
>
> --
> Doug Ledford <dledford@redhat.com>
>     GPG KeyID: B826A3330E572FDD
>     Fingerprint = AE6B 1BDA 122B 23B4 265B  1274 B826 A333 0E57 2FDD



^ permalink raw reply

* Re: [PATCH v12 3/5] dt-bindings: can: tcan4x5x: Add DT bindings for TCAN4x5X driver
From: Marc Kleine-Budde @ 2019-08-22 15:46 UTC (permalink / raw)
  To: Dan Murphy, wg, davem; +Cc: linux-can, netdev, linux-kernel
In-Reply-To: <ff9e007b-6e39-3d64-b62b-93c281d69113@ti.com>


[-- Attachment #1.1: Type: text/plain, Size: 1334 bytes --]

On 8/22/19 4:20 PM, Dan Murphy wrote:
>>> +tcan4x5x: tcan4x5x@0 {
>>> +		compatible = "ti,tcan4x5x";
>>> +		reg = <0>;
>>> +		#address-cells = <1>;
>>> +		#size-cells = <1>;
>>> +		spi-max-frequency = <10000000>;
>>> +		bosch,mram-cfg = <0x0 0 0 32 0 0 1 1>;
>>> +		data-ready-gpios = <&gpio1 14 GPIO_ACTIVE_LOW>;
>> Can you convert this into a proper interrupt property? E.g.:
> 
> OK.  Do you want v13 or do you want patches on top for net-next?

Please use net-next/master as the base.

>>>                  interrupt-parent = <&gpio4>;
>>>                  interrupts = <13 0x2>;
>> See:
>> https://elixir.bootlin.com/linux/latest/source/Documentation/devicetree/bindings/net/can/microchip,mcp251x.txt#L21
>> https://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next.git/tree/drivers/net/can/spi/mcp251x.c?h=mcp251x#n945

I've removed the branch, as the patches are already upstream, have a
look here:

https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/tree/drivers/net/can/spi/mcp251x.c#n921

Marc

-- 
Pengutronix e.K.                  | Marc Kleine-Budde           |
Industrial Linux Solutions        | Phone: +49-231-2826-924     |
Vertretung West/Dortmund          | Fax:   +49-5121-206917-5555 |
Amtsgericht Hildesheim, HRA 2686  | http://www.pengutronix.de   |


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH net-next] net/mlx5e: Use PTR_ERR_OR_ZERO in mlx5e_tc_add_nic_flow()
From: Leon Romanovsky @ 2019-08-22 15:50 UTC (permalink / raw)
  To: YueHaibing
  Cc: Saeed Mahameed, davem, netdev, linux-rdma, kernel-janitors,
	linux-kernel
In-Reply-To: <20190822065219.73945-1-yuehaibing@huawei.com>

On Thu, Aug 22, 2019 at 06:52:19AM +0000, YueHaibing wrote:
> Use PTR_ERR_OR_ZERO rather than if(IS_ERR(...)) + PTR_ERR
>
> Signed-off-by: YueHaibing <yuehaibing@huawei.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_tc.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
>

Thanks,
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>

^ permalink raw reply

* Re: [PATCH v12 3/5] dt-bindings: can: tcan4x5x: Add DT bindings for TCAN4x5X driver
From: Dan Murphy @ 2019-08-22 15:58 UTC (permalink / raw)
  To: Marc Kleine-Budde, wg, davem; +Cc: linux-can, netdev, linux-kernel
In-Reply-To: <6c2bf55f-e360-c51a-e7bb-effc86aa5b6c@pengutronix.de>

Marc

On 8/22/19 10:46 AM, Marc Kleine-Budde wrote:
> On 8/22/19 4:20 PM, Dan Murphy wrote:
>>>> +tcan4x5x: tcan4x5x@0 {
>>>> +		compatible = "ti,tcan4x5x";
>>>> +		reg = <0>;
>>>> +		#address-cells = <1>;
>>>> +		#size-cells = <1>;
>>>> +		spi-max-frequency = <10000000>;
>>>> +		bosch,mram-cfg = <0x0 0 0 32 0 0 1 1>;
>>>> +		data-ready-gpios = <&gpio1 14 GPIO_ACTIVE_LOW>;
>>> Can you convert this into a proper interrupt property? E.g.:
>> OK.  Do you want v13 or do you want patches on top for net-next?
> Please use net-next/master as the base.


Thanks for the reply.  I see that that there are patches on top of the 
driver so I will send patches on top of that.

Dan

<snip>

^ permalink raw reply

* Re: [PATCH spi for-5.4 0/5] Deterministic SPI latency with NXP DSPI driver
From: Richard Cochran @ 2019-08-22 16:05 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Mark Brown, Hubert Feurstein, Miroslav Lichvar, Andrew Lunn,
	Florian Fainelli, linux-spi, netdev
In-Reply-To: <CA+h21hpJm-3svfV93pYYrpoiV12jDjuROHCgvCjPivAjXTB_VA@mail.gmail.com>

On Thu, Aug 22, 2019 at 05:58:49PM +0300, Vladimir Oltean wrote:
> I don't think I understand the problem here.

On the contrary, I do.

> You'd have something like this:
> 
> Master (DSA master port)         Slave (switch CPU port)
> 
>     |                            |         Tstamps known
>     |                            |         to slave
>     |       Local_sync_req       |
>  t1 |------\                     |         t1
>     |       \-----\              |
>     |              \-----\       |
>     |                     \----->| t2      t1, t2
>     |                            |
>     |     Local_sync_resp /------| t3      t1, t2, t3
>     |              /-----/       |
>     |       /-----/              |
>  t4 |<-----/                     |         t1, t2, t3, t4
>     |                            |
>     |                            |
>     v           time             v

And who generates Local_sync_resp?

Also, what sort of frame is it?  PTP has no Sync request or response.

> But you don't mean a TX timestamp at the egress of swp4 here, do you?

Yes, I do.
 
> Why would that matter?

Because in order to synchronize to an external GM, you need to measure
two things:

1. the (unchanging) delay from MAC to MAC
2. the (per-packet) switch residence time

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH net] ixgbe: fix double clean of tx descriptors with xdp
From: William Tu @ 2019-08-22 16:07 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Alexander Duyck, Björn Töpel, Netdev, LKML, bpf,
	David S. Miller, Magnus Karlsson, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Jeff Kirsher,
	intel-wired-lan, Eelco Chaudron
In-Reply-To: <cbf7c51b-9ce7-6ef6-32c4-981258d4af4c@samsung.com>

On Thu, Aug 22, 2019 at 1:17 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>
> On 22.08.2019 0:38, William Tu wrote:
> > On Wed, Aug 21, 2019 at 9:57 AM Alexander Duyck
> > <alexander.duyck@gmail.com> wrote:
> >>
> >> On Wed, Aug 21, 2019 at 9:22 AM Ilya Maximets <i.maximets@samsung.com> wrote:
> >>>
> >>> On 21.08.2019 4:17, Alexander Duyck wrote:
> >>>> On Tue, Aug 20, 2019 at 8:58 AM Ilya Maximets <i.maximets@samsung.com> wrote:
> >>>>>
> >>>>> On 20.08.2019 18:35, Alexander Duyck wrote:
> >>>>>> On Tue, Aug 20, 2019 at 8:18 AM Ilya Maximets <i.maximets@samsung.com> wrote:
> >>>>>>>
> >>>>>>> Tx code doesn't clear the descriptor status after cleaning.
> >>>>>>> So, if the budget is larger than number of used elems in a ring, some
> >>>>>>> descriptors will be accounted twice and xsk_umem_complete_tx will move
> >>>>>>> prod_tail far beyond the prod_head breaking the comletion queue ring.
> >>>>>>>
> >>>>>>> Fix that by limiting the number of descriptors to clean by the number
> >>>>>>> of used descriptors in the tx ring.
> >>>>>>>
> >>>>>>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
> >>>>>>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> >>>>>>
> >>>>>> I'm not sure this is the best way to go. My preference would be to
> >>>>>> have something in the ring that would prevent us from racing which I
> >>>>>> don't think this really addresses. I am pretty sure this code is safe
> >>>>>> on x86 but I would be worried about weak ordered systems such as
> >>>>>> PowerPC.
> >>>>>>
> >>>>>> It might make sense to look at adding the eop_desc logic like we have
> >>>>>> in the regular path with a proper barrier before we write it and after
> >>>>>> we read it. So for example we could hold of on writing the bytecount
> >>>>>> value until the end of an iteration and call smp_wmb before we write
> >>>>>> it. Then on the cleanup we could read it and if it is non-zero we take
> >>>>>> an smp_rmb before proceeding further to process the Tx descriptor and
> >>>>>> clearing the value. Otherwise this code is going to just keep popping
> >>>>>> up with issues.
> >>>>>
> >>>>> But, unlike regular case, xdp zero-copy xmit and clean for particular
> >>>>> tx ring always happens in the same NAPI context and even on the same
> >>>>> CPU core.
> >>>>>
> >>>>> I saw the 'eop_desc' manipulations in regular case and yes, we could
> >>>>> use 'next_to_watch' field just as a flag of descriptor existence,
> >>>>> but it seems unnecessarily complicated. Am I missing something?
> >>>>>
> >>>>
> >>>> So is it always in the same NAPI context?. I forgot, I was thinking
> >>>> that somehow the socket could possibly make use of XDP for transmit.
> >>>
> >>> AF_XDP socket only triggers tx interrupt on ndo_xsk_async_xmit() which
> >>> is used in zero-copy mode. Real xmit happens inside
> >>> ixgbe_poll()
> >>>  -> ixgbe_clean_xdp_tx_irq()
> >>>     -> ixgbe_xmit_zc()
> >>>
> >>> This should be not possible to bound another XDP socket to the same netdev
> >>> queue.
> >>>
> >>> It also possible to xmit frames in xdp_ring while performing XDP_TX/REDIRECT
> >>> actions. REDIRECT could happen from different netdev with different NAPI
> >>> context, but this operation is bound to specific CPU core and each core has
> >>> its own xdp_ring.
> >>>
> >>> However, I'm not an expert here.
> >>> Björn, maybe you could comment on this?
> >>>
> >>>>
> >>>> As far as the logic to use I would be good with just using a value you
> >>>> are already setting such as the bytecount value. All that would need
> >>>> to happen is to guarantee that the value is cleared in the Tx path. So
> >>>> if you clear the bytecount in ixgbe_clean_xdp_tx_irq you could
> >>>> theoretically just use that as well to flag that a descriptor has been
> >>>> populated and is ready to be cleaned. Assuming the logic about this
> >>>> all being in the same NAPI context anyway you wouldn't need to mess
> >>>> with the barrier stuff I mentioned before.
> >>>
> >>> Checking the number of used descs, i.e. next_to_use - next_to_clean,
> >>> makes iteration in this function logically equal to the iteration inside
> >>> 'ixgbe_xsk_clean_tx_ring()'. Do you think we need to change the later
> >>> function too to follow same 'bytecount' approach? I don't like having
> >>> two different ways to determine number of used descriptors in the same file.
> >>>
> >>> Best regards, Ilya Maximets.
> >>
> >> As far as ixgbe_clean_xdp_tx_irq() vs ixgbe_xsk_clean_tx_ring(), I
> >> would say that if you got rid of budget and framed things more like
> >> how ixgbe_xsk_clean_tx_ring was framed with the ntc != ntu being
> >> obvious I would prefer to see us go that route.
> >>
> >> Really there is no need for budget in ixgbe_clean_xdp_tx_irq() if you
> >> are going to be working with a static ntu value since you will only
> >> ever process one iteration through the ring anyway. It might make more
> >> sense if you just went through and got rid of budget and i, and
> >> instead used ntc and ntu like what was done in
> >> ixgbe_xsk_clean_tx_ring().
> >>
> >> Thanks.
> >>
> >> - Alex
> >
> > Not familiar with the driver details.
> > I tested this patch and the issue mentioned in OVS mailing list.
> > https://www.mail-archive.com/ovs-dev@openvswitch.org/msg35362.html
> > and indeed the problem goes away.
>
> Good. Thanks for testing!
>
> > But I saw a huge performance drop,
> > my AF_XDP tx performance drops from >9Mpps to <5Mpps.
>
> I didn't expect so big performance difference with this change.
> What is your test scenario?

I was using OVS with dual port NIC, setting one OpenFlow rule
in_port=eth2 actions=output:eth3
and eth2 for rx and measure eth3 tx
'sar -n DEV 1'  shows pretty huge drop on eth3 tx.

> Is it possible that you're accounting same
> packet several times due to broken completion queue?

That's possible.
Let me double check on your v2 patch.

@Eelco: do you also see some performance difference?

Regards,
William

>
> Looking at samples/bpf/xdpsock_user.c:complete_tx_only(), it accounts
> sent packets (tx_npkts) by accumulating results of xsk_ring_cons__peek()
> for completion queue, so it's not a trusted source of pps information.
>
> Best regards, Ilya Maximets.
>
> >
> > Tested using kernel 5.3.0-rc3+
> > 03:00.0 Ethernet controller: Intel Corporation Ethernet Controller
> > 10-Gigabit X540-AT2 (rev 01)
> > Subsystem: Intel Corporation Ethernet 10G 2P X540-t Adapter
> > Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
> > Stepping- SERR- FastB2B- DisINTx+
> >
> > Regards,
> > William

^ permalink raw reply

* [PATCH] net/core/skmsg: Delete an unnecessary check before the function call “consume_skb”
From: Markus Elfring @ 2019-08-22 16:08 UTC (permalink / raw)
  To: netdev, bpf, Daniel Borkmann, David S. Miller, John Fastabend
  Cc: LKML, kernel-janitors

From: Markus Elfring <elfring@users.sourceforge.net>
Date: Thu, 22 Aug 2019 18:00:40 +0200

The consume_skb() function performs also input parameter validation.
Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
 net/core/skmsg.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/core/skmsg.c b/net/core/skmsg.c
index 6832eeb4b785..cf390e0aa73d 100644
--- a/net/core/skmsg.c
+++ b/net/core/skmsg.c
@@ -190,8 +190,7 @@ static int __sk_msg_free(struct sock *sk, struct sk_msg *msg, u32 i,
 		sk_msg_check_to_free(msg, i, msg->sg.size);
 		sge = sk_msg_elem(msg, i);
 	}
-	if (msg->skb)
-		consume_skb(msg->skb);
+	consume_skb(msg->skb);
 	sk_msg_init(msg);
 	return freed;
 }
--
2.23.0

^ permalink raw reply related

* Re: [PATCH spi for-5.4 0/5] Deterministic SPI latency with NXP DSPI driver
From: Richard Cochran @ 2019-08-22 16:10 UTC (permalink / raw)
  To: Vladimir Oltean
  Cc: Mark Brown, Hubert Feurstein, Miroslav Lichvar, Andrew Lunn,
	Florian Fainelli, linux-spi, netdev
In-Reply-To: <CA+h21hpJm-3svfV93pYYrpoiV12jDjuROHCgvCjPivAjXTB_VA@mail.gmail.com>

On Thu, Aug 22, 2019 at 05:58:49PM +0300, Vladimir Oltean wrote:
> If you mean the latter, then yes, having HWTSTAMP_FILTER_ALL in the
> rx_filter of the DSA master is a hard requirement.

And to clear, the marvell only recognizes PTP frames.  That means that
DSA frames cannot be time stamped.

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH spi for-5.4 0/5] Deterministic SPI latency with NXP DSPI driver
From: Vladimir Oltean @ 2019-08-22 16:13 UTC (permalink / raw)
  To: Richard Cochran
  Cc: Mark Brown, Hubert Feurstein, Miroslav Lichvar, Andrew Lunn,
	Florian Fainelli, linux-spi, netdev
In-Reply-To: <20190822160521.GC4522@localhost>

On Thu, 22 Aug 2019 at 19:05, Richard Cochran <richardcochran@gmail.com> wrote:
>
> On Thu, Aug 22, 2019 at 05:58:49PM +0300, Vladimir Oltean wrote:
> > I don't think I understand the problem here.
>
> On the contrary, I do.
>

You do think that I understand the problem? But I don't!

> > You'd have something like this:
> >
> > Master (DSA master port)         Slave (switch CPU port)
> >
> >     |                            |         Tstamps known
> >     |                            |         to slave
> >     |       Local_sync_req       |
> >  t1 |------\                     |         t1
> >     |       \-----\              |
> >     |              \-----\       |
> >     |                     \----->| t2      t1, t2
> >     |                            |
> >     |     Local_sync_resp /------| t3      t1, t2, t3
> >     |              /-----/       |
> >     |       /-----/              |
> >  t4 |<-----/                     |         t1, t2, t3, t4
> >     |                            |
> >     |                            |
> >     v           time             v
>
> And who generates Local_sync_resp?
>

Local_sync_resp is the same as Local_sync_req except maybe with a
custom tag added by the switch. Irrelevant as long as the DSA master
can timestamp it.

> Also, what sort of frame is it?  PTP has no Sync request or response.
>

A frame that can be timestamped on RX and TX by the DSA switch and
master, that is not a PTP frame.

> > But you don't mean a TX timestamp at the egress of swp4 here, do you?
>
> Yes, I do.
>
> > Why would that matter?
>
> Because in order to synchronize to an external GM, you need to measure
> two things:
>
> 1. the (unchanging) delay from MAC to MAC
> 2. the (per-packet) switch residence time
>

But since when are we discussing the synchronization to an external
grandmaster? I don't see the connection.

> Thanks,
> Richard

Regards,
-Vladimir

^ permalink raw reply

* Re: [PATCH net v2] ixgbe: fix double clean of tx descriptors with xdp
From: William Tu @ 2019-08-22 16:23 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Linux Kernel Network Developers, LKML, bpf, David S. Miller,
	Björn Töpel, Magnus Karlsson, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Jeff Kirsher,
	intel-wired-lan, Eelco Chaudron, Alexander Duyck
In-Reply-To: <20190822123037.28068-1-i.maximets@samsung.com>

On Thu, Aug 22, 2019 at 5:30 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>
> Tx code doesn't clear the descriptors' status after cleaning.
> So, if the budget is larger than number of used elems in a ring, some
> descriptors will be accounted twice and xsk_umem_complete_tx will move
> prod_tail far beyond the prod_head breaking the comletion queue ring.

s/comletion/completion/

>
> Fix that by limiting the number of descriptors to clean by the number
> of used descriptors in the tx ring.
>
> 'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
> 'ixgbe_xsk_clean_tx_ring()' since we don't need most of the
> complications implemented in the regular 'ixgbe_clean_tx_irq()'
> and we're allowed to directly use 'next_to_clean' and 'next_to_use'
> indexes.
>
> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> ---

Tested-by: William Tu <u9012063@gmail.com>

Instead of measuring tx performance at the tx machine, I measured the TX
performance at the other side (the traffic generating machine).  This time it
is more consistent and showing not much difference with (5.9Mpps) and
without this patch (6.1Mpps).

>
> Version 2:
>   * 'ixgbe_clean_xdp_tx_irq()' refactored to look more like
>     'ixgbe_xsk_clean_tx_ring()'.
>
>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 34 ++++++++------------
>  1 file changed, 13 insertions(+), 21 deletions(-)
>

<snip>

^ permalink raw reply

* Aw: [PATCH net-next v2 0/3] net: dsa: mt7530: Convert to PHYLINK and add support for port 5
From: Frank Wunderlich @ 2019-08-22 16:26 UTC (permalink / raw)
  To: Frank Wunderlich
  Cc: "René van Dorst", Andrew Lunn, Florian Fainelli,
	netdev, Sean Wang, linux-mips, David S . Miller, linux-mediatek,
	John Crispin, Matthias Brugger, Vivien Didelot, linux-arm-kernel
In-Reply-To: <trinity-b1f48e51-af73-466d-9ecf-d560a7d7c1ee-1566488653737@3c-app-gmx-bap07>

tested now also on bpi-r64 (mt7622) v0.1 (rtl8367 switch), without linux-next to avoid power-regulator-problems like on bpi-r2

dmesg without warnings/errors caused by this patches
link came up as desired
iperf3 looks good: 943 Mbits/sec in both directions and no other issues

so it is currently only the rx-throughput-problem on mt7623/bpi-r2

regards Frank

^ permalink raw reply

* Re: [PATCH net] ixgbe: fix double clean of tx descriptors with xdp
From: Ilya Maximets @ 2019-08-22 16:30 UTC (permalink / raw)
  To: William Tu
  Cc: Alexander Duyck, Björn Töpel, Netdev, LKML, bpf,
	David S. Miller, Magnus Karlsson, Jakub Kicinski,
	Alexei Starovoitov, Daniel Borkmann, Jeff Kirsher,
	intel-wired-lan, Eelco Chaudron
In-Reply-To: <CALDO+SaRNMvmXrQqOtNiRsOkgfOQAW4EA2yVgmeoGQto2zvfMQ@mail.gmail.com>

On 22.08.2019 19:07, William Tu wrote:
> On Thu, Aug 22, 2019 at 1:17 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>>
>> On 22.08.2019 0:38, William Tu wrote:
>>> On Wed, Aug 21, 2019 at 9:57 AM Alexander Duyck
>>> <alexander.duyck@gmail.com> wrote:
>>>>
>>>> On Wed, Aug 21, 2019 at 9:22 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>>>>>
>>>>> On 21.08.2019 4:17, Alexander Duyck wrote:
>>>>>> On Tue, Aug 20, 2019 at 8:58 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>>>>>>>
>>>>>>> On 20.08.2019 18:35, Alexander Duyck wrote:
>>>>>>>> On Tue, Aug 20, 2019 at 8:18 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>>>>>>>>>
>>>>>>>>> Tx code doesn't clear the descriptor status after cleaning.
>>>>>>>>> So, if the budget is larger than number of used elems in a ring, some
>>>>>>>>> descriptors will be accounted twice and xsk_umem_complete_tx will move
>>>>>>>>> prod_tail far beyond the prod_head breaking the comletion queue ring.
>>>>>>>>>
>>>>>>>>> Fix that by limiting the number of descriptors to clean by the number
>>>>>>>>> of used descriptors in the tx ring.
>>>>>>>>>
>>>>>>>>> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
>>>>>>>>> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
>>>>>>>>
>>>>>>>> I'm not sure this is the best way to go. My preference would be to
>>>>>>>> have something in the ring that would prevent us from racing which I
>>>>>>>> don't think this really addresses. I am pretty sure this code is safe
>>>>>>>> on x86 but I would be worried about weak ordered systems such as
>>>>>>>> PowerPC.
>>>>>>>>
>>>>>>>> It might make sense to look at adding the eop_desc logic like we have
>>>>>>>> in the regular path with a proper barrier before we write it and after
>>>>>>>> we read it. So for example we could hold of on writing the bytecount
>>>>>>>> value until the end of an iteration and call smp_wmb before we write
>>>>>>>> it. Then on the cleanup we could read it and if it is non-zero we take
>>>>>>>> an smp_rmb before proceeding further to process the Tx descriptor and
>>>>>>>> clearing the value. Otherwise this code is going to just keep popping
>>>>>>>> up with issues.
>>>>>>>
>>>>>>> But, unlike regular case, xdp zero-copy xmit and clean for particular
>>>>>>> tx ring always happens in the same NAPI context and even on the same
>>>>>>> CPU core.
>>>>>>>
>>>>>>> I saw the 'eop_desc' manipulations in regular case and yes, we could
>>>>>>> use 'next_to_watch' field just as a flag of descriptor existence,
>>>>>>> but it seems unnecessarily complicated. Am I missing something?
>>>>>>>
>>>>>>
>>>>>> So is it always in the same NAPI context?. I forgot, I was thinking
>>>>>> that somehow the socket could possibly make use of XDP for transmit.
>>>>>
>>>>> AF_XDP socket only triggers tx interrupt on ndo_xsk_async_xmit() which
>>>>> is used in zero-copy mode. Real xmit happens inside
>>>>> ixgbe_poll()
>>>>>  -> ixgbe_clean_xdp_tx_irq()
>>>>>     -> ixgbe_xmit_zc()
>>>>>
>>>>> This should be not possible to bound another XDP socket to the same netdev
>>>>> queue.
>>>>>
>>>>> It also possible to xmit frames in xdp_ring while performing XDP_TX/REDIRECT
>>>>> actions. REDIRECT could happen from different netdev with different NAPI
>>>>> context, but this operation is bound to specific CPU core and each core has
>>>>> its own xdp_ring.
>>>>>
>>>>> However, I'm not an expert here.
>>>>> Björn, maybe you could comment on this?
>>>>>
>>>>>>
>>>>>> As far as the logic to use I would be good with just using a value you
>>>>>> are already setting such as the bytecount value. All that would need
>>>>>> to happen is to guarantee that the value is cleared in the Tx path. So
>>>>>> if you clear the bytecount in ixgbe_clean_xdp_tx_irq you could
>>>>>> theoretically just use that as well to flag that a descriptor has been
>>>>>> populated and is ready to be cleaned. Assuming the logic about this
>>>>>> all being in the same NAPI context anyway you wouldn't need to mess
>>>>>> with the barrier stuff I mentioned before.
>>>>>
>>>>> Checking the number of used descs, i.e. next_to_use - next_to_clean,
>>>>> makes iteration in this function logically equal to the iteration inside
>>>>> 'ixgbe_xsk_clean_tx_ring()'. Do you think we need to change the later
>>>>> function too to follow same 'bytecount' approach? I don't like having
>>>>> two different ways to determine number of used descriptors in the same file.
>>>>>
>>>>> Best regards, Ilya Maximets.
>>>>
>>>> As far as ixgbe_clean_xdp_tx_irq() vs ixgbe_xsk_clean_tx_ring(), I
>>>> would say that if you got rid of budget and framed things more like
>>>> how ixgbe_xsk_clean_tx_ring was framed with the ntc != ntu being
>>>> obvious I would prefer to see us go that route.
>>>>
>>>> Really there is no need for budget in ixgbe_clean_xdp_tx_irq() if you
>>>> are going to be working with a static ntu value since you will only
>>>> ever process one iteration through the ring anyway. It might make more
>>>> sense if you just went through and got rid of budget and i, and
>>>> instead used ntc and ntu like what was done in
>>>> ixgbe_xsk_clean_tx_ring().
>>>>
>>>> Thanks.
>>>>
>>>> - Alex
>>>
>>> Not familiar with the driver details.
>>> I tested this patch and the issue mentioned in OVS mailing list.
>>> https://www.mail-archive.com/ovs-dev@openvswitch.org/msg35362.html
>>> and indeed the problem goes away.
>>
>> Good. Thanks for testing!
>>
>>> But I saw a huge performance drop,
>>> my AF_XDP tx performance drops from >9Mpps to <5Mpps.
>>
>> I didn't expect so big performance difference with this change.
>> What is your test scenario?
> 
> I was using OVS with dual port NIC, setting one OpenFlow rule
> in_port=eth2 actions=output:eth3
> and eth2 for rx and measure eth3 tx
> 'sar -n DEV 1'  shows pretty huge drop on eth3 tx.

'sar' just parses net procfs to obtain interface stats, but interface stats
are not correct due to this bug (same packets accounted several times).

> 
>> Is it possible that you're accounting same
>> packet several times due to broken completion queue?
> 
> That's possible.
> Let me double check on your v2 patch.
> 
> @Eelco: do you also see some performance difference?
> 
> Regards,
> William
> 
>>
>> Looking at samples/bpf/xdpsock_user.c:complete_tx_only(), it accounts
>> sent packets (tx_npkts) by accumulating results of xsk_ring_cons__peek()
>> for completion queue, so it's not a trusted source of pps information.
>>
>> Best regards, Ilya Maximets.
>>
>>>
>>> Tested using kernel 5.3.0-rc3+
>>> 03:00.0 Ethernet controller: Intel Corporation Ethernet Controller
>>> 10-Gigabit X540-AT2 (rev 01)
>>> Subsystem: Intel Corporation Ethernet 10G 2P X540-t Adapter
>>> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr-
>>> Stepping- SERR- FastB2B- DisINTx+
>>>
>>> Regards,
>>> William
> 
> 

^ permalink raw reply

* Re: New skb extension for use by LSMs (skb "security blob")?
From: Paul Moore @ 2019-08-22 16:32 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netdev, linux-security-module, selinux
In-Reply-To: <20190822070358.GE20113@breakpoint.cc>

On Thu, Aug 22, 2019 at 3:03 AM Florian Westphal <fw@strlen.de> wrote:
> Paul Moore <paul@paul-moore.com> wrote:
> > Hello netdev,
> >
> > I was just made aware of the skb extension work, and it looks very
> > appealing from a LSM perspective.  As some of you probably remember,
> > we (the LSM folks) have wanted a proper security blob in the skb for
> > quite some time, but netdev has been resistant to this idea thus far.
>
> Is that "blob" in addition to skb->secmark, or a replacement?

That's a good question.  While I thought about that, I wasn't sure if
that was worth bringing up as previous attempts to trade the secmark
field for a void pointer met with failure.  Last time I played with it
I was able to take the additional 32-bits from holes in the skb, and
possibly even improve some of the cacheline groupings (but that is
always going to be a dependent on use case I think), but that wasn't
enough.

I think we could consider freeing up the secmark in the main skb, and
move it to a skb extension, but this would potentially increase the
chances that we would need to add an extension to a skb.  I don't have
any hard numbers, but based on discussions and questions I suspect
Secmark is more widely used than NetLabel and/or labeled IPsec;
although I'm confident it is still a minor percentage of the overall
Linux installed base.

For me the big question is what would it take for us to get a security
blob associated with the skb?  Would moving the secmark into the skb
extension be enough?  Something else?  Or is this simply never going
to happen?  I want to remain optimistic, but I've been trying for this
off-and-on for over a decade and keep running into a brick wall ;)

-- 
paul moore
www.paul-moore.com

^ permalink raw reply

* [PATCH] net/ipv4/tcp_bpf: Delete an unnecessary check before the function call “consume_skb”
From: Markus Elfring @ 2019-08-22 16:32 UTC (permalink / raw)
  To: netdev, bpf, Alexei Starovoitov, Alexey Kuznetsov,
	Daniel Borkmann, David S. Miller, Eric Dumazet, Hideaki Yoshifuji,
	John Fastabend, Martin KaFai Lau, Song Liu, Yonghong Song
  Cc: LKML, kernel-janitors

From: Markus Elfring <elfring@users.sourceforge.net>
Date: Thu, 22 Aug 2019 18:20:42 +0200

The consume_skb() function performs also input parameter validation.
Thus the test around the call is not needed.

This issue was detected by using the Coccinelle software.

Signed-off-by: Markus Elfring <elfring@users.sourceforge.net>
---
 net/ipv4/tcp_bpf.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv4/tcp_bpf.c b/net/ipv4/tcp_bpf.c
index 8a56e09cfb0e..4ae18bd431a0 100644
--- a/net/ipv4/tcp_bpf.c
+++ b/net/ipv4/tcp_bpf.c
@@ -103,8 +103,7 @@ int __tcp_bpf_recvmsg(struct sock *sk, struct sk_psock *psock,
 		msg_rx->sg.start = i;
 		if (!sge->length && msg_rx->sg.start == msg_rx->sg.end) {
 			list_del(&msg_rx->list);
-			if (msg_rx->skb)
-				consume_skb(msg_rx->skb);
+			consume_skb(msg_rx->skb);
 			kfree(msg_rx);
 		}
 		msg_rx = list_first_entry_or_null(&psock->ingress_msg,
--
2.23.0


^ permalink raw reply related

* Re: [PATCH net v2] ixgbe: fix double clean of tx descriptors with xdp
From: Alexander Duyck @ 2019-08-22 16:38 UTC (permalink / raw)
  To: Ilya Maximets
  Cc: Netdev, LKML, bpf, David S. Miller, Björn Töpel,
	Magnus Karlsson, Jakub Kicinski, Alexei Starovoitov,
	Daniel Borkmann, Jeff Kirsher, intel-wired-lan, Eelco Chaudron,
	William Tu
In-Reply-To: <20190822123037.28068-1-i.maximets@samsung.com>

On Thu, Aug 22, 2019 at 5:30 AM Ilya Maximets <i.maximets@samsung.com> wrote:
>
> Tx code doesn't clear the descriptors' status after cleaning.
> So, if the budget is larger than number of used elems in a ring, some
> descriptors will be accounted twice and xsk_umem_complete_tx will move
> prod_tail far beyond the prod_head breaking the comletion queue ring.
>
> Fix that by limiting the number of descriptors to clean by the number
> of used descriptors in the tx ring.
>
> 'ixgbe_clean_xdp_tx_irq()' function refactored to look more like
> 'ixgbe_xsk_clean_tx_ring()' since we don't need most of the
> complications implemented in the regular 'ixgbe_clean_tx_irq()'
> and we're allowed to directly use 'next_to_clean' and 'next_to_use'
> indexes.
>
> Fixes: 8221c5eba8c1 ("ixgbe: add AF_XDP zero-copy Tx support")
> Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
> ---
>
> Version 2:
>   * 'ixgbe_clean_xdp_tx_irq()' refactored to look more like
>     'ixgbe_xsk_clean_tx_ring()'.
>
>  drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 34 ++++++++------------
>  1 file changed, 13 insertions(+), 21 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
> index 6b609553329f..d1297660e14a 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
> @@ -633,22 +633,23 @@ static void ixgbe_clean_xdp_tx_buffer(struct ixgbe_ring *tx_ring,
>  bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
>                             struct ixgbe_ring *tx_ring, int napi_budget)
>  {
> +       u16 ntc = tx_ring->next_to_clean, ntu = tx_ring->next_to_use;
>         unsigned int total_packets = 0, total_bytes = 0;
> -       u32 i = tx_ring->next_to_clean, xsk_frames = 0;
>         unsigned int budget = q_vector->tx.work_limit;
>         struct xdp_umem *umem = tx_ring->xsk_umem;
> -       union ixgbe_adv_tx_desc *tx_desc;
> -       struct ixgbe_tx_buffer *tx_bi;
> +       u32 xsk_frames = 0;
>         bool xmit_done;
>
> -       tx_bi = &tx_ring->tx_buffer_info[i];
> -       tx_desc = IXGBE_TX_DESC(tx_ring, i);
> -       i -= tx_ring->count;
> +       while (likely(ntc != ntu && budget)) {

I would say you can get rid of budget entirely. It was only really
needed for the regular Tx case where you can have multiple CPUs
feeding a single Tx queue and causing a stall. Since we have a 1:1
mapping we should never have more than the Rx budget worth of packets
to really process. In addition we can only make one pass through the
ring since the ntu value is not updated while running the loop.

> +               union ixgbe_adv_tx_desc *tx_desc;
> +               struct ixgbe_tx_buffer *tx_bi;
> +
> +               tx_desc = IXGBE_TX_DESC(tx_ring, ntc);
>
> -       do {
>                 if (!(tx_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)))
>                         break;
>
> +               tx_bi = &tx_ring->tx_buffer_info[ntc];

Please don't move this logic into the loop. We were intentionally
processing this outside of the loop once and then just doing the
increments because it is faster that way. It takes several operations
to compute tx_bi based on ntc, whereas just incrementing is a single
operation.

>                 total_bytes += tx_bi->bytecount;
>                 total_packets += tx_bi->gso_segs;
>
> @@ -659,24 +660,15 @@ bool ixgbe_clean_xdp_tx_irq(struct ixgbe_q_vector *q_vector,
>
>                 tx_bi->xdpf = NULL;
>
> -               tx_bi++;
> -               tx_desc++;
> -               i++;
> -               if (unlikely(!i)) {
> -                       i -= tx_ring->count;

So these two lines can probably just be replaced by:
if (unlikely(ntc == tx_ring->count)) {
        ntc = 0;

> -                       tx_bi = tx_ring->tx_buffer_info;
> -                       tx_desc = IXGBE_TX_DESC(tx_ring, 0);
> -               }
> -
> -               /* issue prefetch for next Tx descriptor */
> -               prefetch(tx_desc);

Did you just drop the prefetch? You are changing way too much with
this patch. All you should need to do is replace i with ntc, replace
the "do {" with "while (ntc != ntu) {", and remove the while at the
end.

> +               ntc++;
> +               if (unlikely(ntc == tx_ring->count))
> +                       ntc = 0;
>
>                 /* update budget accounting */
>                 budget--;
> -       } while (likely(budget));

As I stated earlier, budget can be removed entirely.

> +       }
>
> -       i += tx_ring->count;
> -       tx_ring->next_to_clean = i;
> +       tx_ring->next_to_clean = ntc;
>
>         u64_stats_update_begin(&tx_ring->syncp);
>         tx_ring->stats.bytes += total_bytes;
> --
> 2.17.1
>

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox