Netdev List

Netdev List
 help / color / mirror / Atom feed

* [net-next 02/15] ixgbevf: Link lost in VM on ixgbevf when restoring from freeze or suspend
From: Jeff Kirsher @ 2019-08-28  6:43 UTC (permalink / raw)
  To: davem; +Cc: Radoslaw Tyl, netdev, nhorman, sassmann, Andrew Bowers,
	Jeff Kirsher
In-Reply-To: <20190828064407.30168-1-jeffrey.t.kirsher@intel.com>

From: Radoslaw Tyl <radoslawx.tyl@intel.com>

This patch fixed issue in VM which shows no link when hypervisor is
restored from low-power state. The driver is responsible for re-enabling
any features of the device that had been disabled during suspend calls,
such as IRQs and bus mastering.

Signed-off-by: Radoslaw Tyl <radoslawx.tyl@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
 drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 8c011d4ce7a9..75e849a64db7 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -2517,6 +2517,7 @@ void ixgbevf_reinit_locked(struct ixgbevf_adapter *adapter)
 		msleep(1);
 
 	ixgbevf_down(adapter);
+	pci_set_master(adapter->pdev);
 	ixgbevf_up(adapter);
 
 	clear_bit(__IXGBEVF_RESETTING, &adapter->state);
-- 
2.21.0


^ permalink raw reply related

* [net-next 00/15][pull request] Intel Wired LAN Driver Updates 2019-08-27
From: Jeff Kirsher @ 2019-08-28  6:43 UTC (permalink / raw)
  To: davem; +Cc: Jeff Kirsher, netdev, nhorman, sassmann

This series contains a variety of cold and hot savoury changes to Intel
drivers.  Some of the fixes could be considered for stable even though
the author did not request it.

Hulk Robert cleans up (i.e. removes) a function that has no caller for
the iavf driver.

Radoslaw fixes an issue when there is no link in the VM after the
hypervisor is restored from a low-power state due to the driver not
properly restoring features in the device that had been disabled during
the suspension for ixgbevf.

Kai-Heng Feng modified e1000e to use mod_delayed_work() to help resolve
a hot plug speed detection issue by adding a deterministic 1 second
delay before running watchdog task after an interrupt.

Sasha moves functions around to avoid forward declarations, since the
forward declarations are not necessary for these static functions in
igc.  Also added a check for igc during driver probe to validate the NVM
checksum.  Cleaned up code defines that were not being used in the igc
driver.  Adds support for IP generic transmit checksum offload in the
igc driver.

Updated the iavf kernel documentation by a developer with no life.

Jake provides another fm10k update to a local variable for ease of code
readability.

Mitch fixes the iavf driver to allow the VF to override the MAC address
set by the host, if the VF is in "trusted" mode.

Mauro S. M. Rodrigues provides several changes for i40e driver, first
with resolving hw_dbg usage and referencing a i40e_hw attribute.  Also
implemented a debug macro using pr_debug, since the use of netdev_dbg
could cause a NULL pointer dereference during probe.  Finally cleaned up
code that is no longer used or needed.

Firo Yang provides a change in the ixgbe driver to ensure we sync the
first fragment unconditionally to help resolve an issue seen in the XEN
environment when the upper network stack could receive an incomplete
network packet.

Mariusz adds a missing device to the i40e PCI table in the driver.

The following are changes since commit 68aaf4459556b1f9370c259fd486aecad2257552:
  Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
and are available in the git repository at:
  git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 10GbE

Firo Yang (1):
  ixgbe: sync the first fragment unconditionally

Jacob Keller (1):
  fm10k: use a local variable for the frag pointer

Jeff Kirsher (1):
  Documentation: iavf: Update the Intel LAN driver doc for iavf

Kai-Heng Feng (1):
  e1000e: Make speed detection on hotplugging cable more reliable

Mariusz Stachura (1):
  i40e: Add support for X710 device

Mauro S. M. Rodrigues (3):
  i40e: fix hw_dbg usage in i40e_hmc_get_object_va
  i40e: Implement debug macro hw_dbg using pr_debug
  i40e: Remove EMPR traces from debugfs facility

Mitch Williams (1):
  iavf: allow permanent MAC address to change

Radoslaw Tyl (1):
  ixgbevf: Link lost in VM on ixgbevf when restoring from freeze or
    suspend

Sasha Neftin (4):
  igc: Remove useless forward declaration
  igc: Add NVM checksum validation
  igc: Remove unneeded PCI bus defines
  igc: Add tx_csum offload functionality

YueHaibing (1):
  iavf: remove unused debug function iavf_debug_d

 .../networking/device_drivers/intel/iavf.rst  | 115 ++++++++---
 drivers/net/ethernet/intel/e1000e/netdev.c    |  12 +-
 drivers/net/ethernet/intel/fm10k/fm10k_main.c |   8 +-
 drivers/net/ethernet/intel/i40e/i40e.h        |   1 -
 drivers/net/ethernet/intel/i40e/i40e_common.c |   1 +
 .../net/ethernet/intel/i40e/i40e_debugfs.c    |   4 -
 drivers/net/ethernet/intel/i40e/i40e_hmc.c    |   1 +
 .../net/ethernet/intel/i40e/i40e_lan_hmc.c    |  14 +-
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   1 +
 drivers/net/ethernet/intel/i40e/i40e_osdep.h  |   7 +-
 drivers/net/ethernet/intel/iavf/iavf.h        |   1 -
 drivers/net/ethernet/intel/iavf/iavf_main.c   |  26 ---
 drivers/net/ethernet/intel/igc/igc.h          |   4 +
 drivers/net/ethernet/intel/igc/igc_base.h     |   8 +
 drivers/net/ethernet/intel/igc/igc_defines.h  |   9 +-
 drivers/net/ethernet/intel/igc/igc_mac.c      |  73 ++++---
 drivers/net/ethernet/intel/igc/igc_main.c     | 106 ++++++++++
 drivers/net/ethernet/intel/igc/igc_phy.c      | 192 +++++++++---------
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |  16 +-
 .../net/ethernet/intel/ixgbevf/ixgbevf_main.c |   1 +
 20 files changed, 372 insertions(+), 228 deletions(-)

-- 
2.21.0

^ permalink raw reply

* general protection fault in tls_sk_proto_close (2)
From: syzbot @ 2019-08-28  6:38 UTC (permalink / raw)
  To: aviadye, borisp, daniel, davejwatson, davem, jakub.kicinski,
	john.fastabend, linux-kernel, netdev, syzkaller-bugs

Hello,

syzbot found the following crash on:

HEAD commit:    a55aa89a Linux 5.3-rc6
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=16c26ebc600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=2a6a2b9826fdadf9
dashboard link: https://syzkaller.appspot.com/bug?extid=7a6ee4d0078eac6bf782
compiler:       gcc (GCC) 9.0.0 20181231 (experimental)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=1112a4de600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+7a6ee4d0078eac6bf782@syzkaller.appspotmail.com

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 1 PID: 10290 Comm: syz-executor.0 Not tainted 5.3.0-rc6 #120
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
RIP: 0010:tls_sk_proto_close+0xe5/0x990 net/tls/tls_main.c:298
Code: 0f 85 3f 08 00 00 49 8b 84 24 c0 02 00 00 4d 8d 75 14 4c 89 f2 48 c1  
ea 03 48 89 85 50 ff ff ff 48 b8 00 00 00 00 00 fc ff df <0f> b6 04 02 4c  
89 f2 83 e2 07 38 d0 7f 08 84 c0 0f 85 2e 06 00 00
RSP: 0018:ffff88809b23fb90 EFLAGS: 00010203
RAX: dffffc0000000000 RBX: dffffc0000000000 RCX: ffffffff862cb8db
RDX: 0000000000000002 RSI: ffffffff862cb639 RDI: ffff8880a155ef00
RBP: ffff88809b23fc48 R08: ffff888094344640 R09: ffffed10142abd9a
R10: ffffed10142abd99 R11: ffff8880a155eccb R12: ffff8880a155ec40
R13: 0000000000000000 R14: 0000000000000014 R15: 0000000000000001
FS:  00005555556a8940(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f353458e000 CR3: 00000000a9174000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
  tls_sk_proto_close+0x35b/0x990 net/tls/tls_main.c:321
  tcp_bpf_close+0x17c/0x390 net/ipv4/tcp_bpf.c:582
  inet_release+0xed/0x200 net/ipv4/af_inet.c:427
  inet6_release+0x53/0x80 net/ipv6/af_inet6.c:470
  __sock_release+0xce/0x280 net/socket.c:590
  sock_close+0x1e/0x30 net/socket.c:1268
  __fput+0x2ff/0x890 fs/file_table.c:280
  ____fput+0x16/0x20 fs/file_table.c:313
  task_work_run+0x145/0x1c0 kernel/task_work.c:113
  tracehook_notify_resume include/linux/tracehook.h:188 [inline]
  exit_to_usermode_loop+0x316/0x380 arch/x86/entry/common.c:163
  prepare_exit_to_usermode arch/x86/entry/common.c:194 [inline]
  syscall_return_slowpath arch/x86/entry/common.c:274 [inline]
  do_syscall_64+0x5a9/0x6a0 arch/x86/entry/common.c:299
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x413540
Code: 01 f0 ff ff 0f 83 30 1b 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f  
44 00 00 83 3d 4d 2d 66 00 00 75 14 b8 03 00 00 00 0f 05 <48> 3d 01 f0 ff  
ff 0f 83 04 1b 00 00 c3 48 83 ec 08 e8 0a fc ff ff
RSP: 002b:00007fff5d481778 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000006 RCX: 0000000000413540
RDX: 0000001b2e520000 RSI: 0000000000000000 RDI: 0000000000000005
RBP: 0000000000000001 R08: 0000000000000000 R09: ffffffffffffffff
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000075bf20
R13: 0000000000000003 R14: 0000000000761220 R15: ffffffffffffffff
Modules linked in:
---[ end trace bdfd4385a0f1f76d ]---
RIP: 0010:tls_sk_proto_close+0xe5/0x990 net/tls/tls_main.c:298
Code: 0f 85 3f 08 00 00 49 8b 84 24 c0 02 00 00 4d 8d 75 14 4c 89 f2 48 c1  
ea 03 48 89 85 50 ff ff ff 48 b8 00 00 00 00 00 fc ff df <0f> b6 04 02 4c  
89 f2 83 e2 07 38 d0 7f 08 84 c0 0f 85 2e 06 00 00
RSP: 0018:ffff88809b23fb90 EFLAGS: 00010203
RAX: dffffc0000000000 RBX: dffffc0000000000 RCX: ffffffff862cb8db
RDX: 0000000000000002 RSI: ffffffff862cb639 RDI: ffff8880a155ef00
RBP: ffff88809b23fc48 R08: ffff888094344640 R09: ffffed10142abd9a
R10: ffffed10142abd99 R11: ffff8880a155eccb R12: ffff8880a155ec40
R13: 0000000000000000 R14: 0000000000000014 R15: 0000000000000001
FS:  00005555556a8940(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f353458e000 CR3: 00000000a9174000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* WARNING in smc_unhash_sk (3)
From: syzbot @ 2019-08-28  6:38 UTC (permalink / raw)
  To: davem, kgraul, linux-kernel, linux-s390, netdev, syzkaller-bugs,
	ubraun

Hello,

syzbot found the following crash on:

HEAD commit:    a55aa89a Linux 5.3-rc6
git tree:       upstream
console output: https://syzkaller.appspot.com/x/log.txt?x=112dd212600000
kernel config:  https://syzkaller.appspot.com/x/.config?x=58485246ad14eafe
dashboard link: https://syzkaller.appspot.com/bug?extid=8488cc4cf1c9e09b8b86
compiler:       clang version 9.0.0 (/home/glider/llvm/clang  
80fee25776c2fb61e74c1ecb1a523375c2500b69)
syz repro:      https://syzkaller.appspot.com/x/repro.syz?x=15426ebc600000
C reproducer:   https://syzkaller.appspot.com/x/repro.c?x=116aca7a600000

IMPORTANT: if you fix the bug, please add the following tag to the commit:
Reported-by: syzbot+8488cc4cf1c9e09b8b86@syzkaller.appspotmail.com

------------[ cut here ]------------
WARNING: CPU: 0 PID: 9198 at ./include/net/sock.h:666 sk_del_node_init  
include/net/sock.h:666 [inline]
WARNING: CPU: 0 PID: 9198 at ./include/net/sock.h:666  
smc_unhash_sk+0x21b/0x240 net/smc/af_smc.c:96
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 9198 Comm: syz-executor057 Not tainted 5.3.0-rc6 #93
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS  
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1d8/0x2f8 lib/dump_stack.c:113
  panic+0x25c/0x799 kernel/panic.c:219
  __warn+0x22f/0x230 kernel/panic.c:576
  report_bug+0x190/0x290 lib/bug.c:186
  fixup_bug arch/x86/kernel/traps.c:179 [inline]
  do_error_trap+0xd7/0x440 arch/x86/kernel/traps.c:272
  do_invalid_op+0x36/0x40 arch/x86/kernel/traps.c:291
  invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1028
RIP: 0010:sk_del_node_init include/net/sock.h:666 [inline]
RIP: 0010:smc_unhash_sk+0x21b/0x240 net/smc/af_smc.c:96
Code: 48 89 df e8 07 b1 39 00 48 83 c4 20 5b 41 5c 41 5d 41 5e 41 5f 5d c3  
e8 03 d7 31 fa 48 c7 c7 f2 c3 3a 88 31 c0 e8 28 1d 1b fa <0f> 0b eb 85 44  
89 f1 80 e1 07 80 c1 03 38 c1 0f 8c 5b ff ff ff 4c
RSP: 0018:ffff888094177b68 EFLAGS: 00010246
RAX: 0000000000000024 RBX: 0000000000000001 RCX: b964ece25f6b7c00
RDX: 0000000000000000 RSI: 0000000000000201 RDI: 0000000000000000
RBP: ffff888094177bb0 R08: ffffffff815cf7d4 R09: ffffed1015d46088
R10: ffffed1015d46088 R11: 0000000000000000 R12: ffff888098ccb240
R13: dffffc0000000000 R14: ffff888098ccb2c0 R15: ffff888098ccb268
  __smc_release+0x1f8/0x3a0 net/smc/af_smc.c:146
  smc_release+0x15b/0x2c0 net/smc/af_smc.c:185
  __sock_release net/socket.c:590 [inline]
  sock_close+0xe1/0x260 net/socket.c:1268
  __fput+0x2e4/0x740 fs/file_table.c:280
  ____fput+0x15/0x20 fs/file_table.c:313
  task_work_run+0x17e/0x1b0 kernel/task_work.c:113
  exit_task_work include/linux/task_work.h:22 [inline]
  do_exit+0x5e8/0x21a0 kernel/exit.c:879
  do_group_exit+0x15c/0x2b0 kernel/exit.c:983
  __do_sys_exit_group+0x17/0x20 kernel/exit.c:994
  __se_sys_exit_group+0x14/0x20 kernel/exit.c:992
  __x64_sys_exit_group+0x3b/0x40 kernel/exit.c:992
  do_syscall_64+0xfe/0x140 arch/x86/entry/common.c:296
  entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x43ff28
Code: 00 00 be 3c 00 00 00 eb 19 66 0f 1f 84 00 00 00 00 00 48 89 d7 89 f0  
0f 05 48 3d 00 f0 ff ff 77 21 f4 48 89 d7 44 89 c0 0f 05 <48> 3d 00 f0 ff  
ff 76 e0 f7 d8 64 41 89 01 eb d8 0f 1f 84 00 00 00
RSP: 002b:00007ffefacce238 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000043ff28
RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
RBP: 00000000004bf750 R08: 00000000000000e7 R09: ffffffffffffffd0
R10: 00000000200000c0 R11: 0000000000000246 R12: 0000000000000001
R13: 00000000006d1180 R14: 0000000000000000 R15: 0000000000000000
Kernel Offset: disabled
Rebooting in 86400 seconds..


---
This bug is generated by a bot. It may contain errors.
See https://goo.gl/tpsmEJ for more information about syzbot.
syzbot engineers can be reached at syzkaller@googlegroups.com.

syzbot will keep track of this bug report. See:
https://goo.gl/tpsmEJ#status for how to communicate with syzbot.
syzbot can test patches for this bug, for details see:
https://goo.gl/tpsmEJ#testing-patches

^ permalink raw reply

* Re: [PATCH net-next v2 2/3] dt-bindings: net: dsa: mt7530: Add support for port 5
From: René van Dorst @ 2019-08-28  6:35 UTC (permalink / raw)
  To: Rob Herring
  Cc: Sean Wang, Andrew Lunn, Vivien Didelot, Florian Fainelli,
	David S . Miller, Matthias Brugger, netdev, linux-arm-kernel,
	linux-mediatek, John Crispin, linux-mips, Frank Wunderlich,
	devicetree
In-Reply-To: <20190827222251.GA30507@bogus>

Hi Rob,

Quoting Rob Herring <robh@kernel.org>:

> On Wed, Aug 21, 2019 at 04:45:46PM +0200, René van Dorst wrote:
>> MT7530 port 5 has many modes/configurations.
>> Update the documentation how to use port 5.
>>
>> Signed-off-by: René van Dorst <opensource@vdorst.com>
>> Cc: devicetree@vger.kernel.org
>> Cc: Rob Herring <robh@kernel.org>
>
>> v1->v2:
>> * Adding extra note about RGMII2 and gpio use.
>> rfc->v1:
>> * No change
>
> The changelog goes below the '---'
>

Thanks for the review,
I shall fix that.

>> ---
>>  .../devicetree/bindings/net/dsa/mt7530.txt    | 218 ++++++++++++++++++
>>  1 file changed, 218 insertions(+)
>>
>> diff --git a/Documentation/devicetree/bindings/net/dsa/mt7530.txt  
>> b/Documentation/devicetree/bindings/net/dsa/mt7530.txt
>> index 47aa205ee0bd..43993aae3f9c 100644
>> --- a/Documentation/devicetree/bindings/net/dsa/mt7530.txt
>> +++ b/Documentation/devicetree/bindings/net/dsa/mt7530.txt
>> @@ -35,6 +35,42 @@ Required properties for the child nodes within  
>> ports container:
>>  - phy-mode: String, must be either "trgmii" or "rgmii" for port labeled
>>  	 "cpu".
>>
>> +Port 5 of the switch is muxed between:
>> +1. GMAC5: GMAC5 can interface with another external MAC or PHY.
>> +2. PHY of port 0 or port 4: PHY interfaces with an external MAC  
>> like 2nd GMAC
>> +   of the SOC. Used in many setups where port 0/4 becomes the WAN port.
>> +   Note: On a MT7621 SOC with integrated switch: 2nd GMAC can only  
>> connected to
>> +	 GMAC5 when the gpios for RGMII2 (GPIO 22-33) are not used and not
>> +	 connected to external component!
>> +
>> +Port 5 modes/configurations:
>> +1. Port 5 is disabled and isolated: An external phy can interface  
>> to the 2nd
>> +   GMAC of the SOC.
>> +   In the case of a build-in MT7530 switch, port 5 shares the  
>> RGMII bus with 2nd
>> +   GMAC and an optional external phy. Mind the GPIO/pinctl  
>> settings of the SOC!
>> +2. Port 5 is muxed to PHY of port 0/4: Port 0/4 interfaces with 2nd GMAC.
>> +   It is a simple MAC to PHY interface, port 5 needs to be setup  
>> for xMII mode
>> +   and RGMII delay.
>> +3. Port 5 is muxed to GMAC5 and can interface to an external phy.
>> +   Port 5 becomes an extra switch port.
>> +   Only works on platform where external phy TX<->RX lines are swapped.
>> +   Like in the Ubiquiti ER-X-SFP.
>> +4. Port 5 is muxed to GMAC5 and interfaces with the 2nd GAMC as  
>> 2nd CPU port.
>> +   Currently a 2nd CPU port is not supported by DSA code.
>> +
>> +Depending on how the external PHY is wired:
>> +1. normal: The PHY can only connect to 2nd GMAC but not to the switch
>> +2. swapped: RGMII TX, RX are swapped; external phy interface with  
>> the switch as
>> +   a ethernet port. But can't interface to the 2nd GMAC.
>> +
>> +Based on the DT the port 5 mode is configured.
>> +
>> +Driver tries to lookup the phy-handle of the 2nd GMAC of the master device.
>> +When phy-handle matches PHY of port 0 or 4 then port 5 set-up as mode 2.
>> +phy-mode must be set, see also example 2 below!
>> + * mt7621: phy-mode = "rgmii-txid";
>> + * mt7623: phy-mode = "rgmii";
>> +
>>  See Documentation/devicetree/bindings/net/dsa/dsa.txt for a list  
>> of additional
>>  required, optional properties and how the integrated switch subnodes must
>>  be specified.
>> @@ -94,3 +130,185 @@ Example:
>>  			};
>>  		};
>>  	};
>> +
>> +Example 2: MT7621: Port 4 is WAN port: 2nd GMAC -> Port 5 -> PHY port 4.
>> +
>> +&eth {
>> +	status = "okay";
>
> Don't show status in examples.

OK.

> This should show the complete node.
>

To be clear, I should take ethernet node from the mt7621.dtsi [0] or  
mt7623.dtsi
[1] and insert the example below?, right?

Greats,

René

[0]:  
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/tree/drivers/staging/mt7621-dts/mt7621.dtsi#n397
[1]:  
https://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git/tree/arch/arm/boot/dts/mt7623.dtsi#n1023

>> +
>> +	gmac0: mac@0 {
>> +		compatible = "mediatek,eth-mac";
>> +		reg = <0>;
>> +		phy-mode = "rgmii";
>> +
>> +		fixed-link {
>> +			speed = <1000>;
>> +			full-duplex;
>> +			pause;
>> +		};
>> +	};
>> +
>> +	gmac1: mac@1 {
>> +		compatible = "mediatek,eth-mac";
>> +		reg = <1>;
>> +		phy-mode = "rgmii-txid";
>> +		phy-handle = <&phy4>;
>> +	};
>> +
>> +	mdio: mdio-bus {
>> +		#address-cells = <1>;
>> +		#size-cells = <0>;
>> +
>> +		/* Internal phy */
>> +		phy4: ethernet-phy@4 {
>> +			reg = <4>;
>> +		};
>> +
>> +		mt7530: switch@1f {
>> +			compatible = "mediatek,mt7621";
>> +			#address-cells = <1>;
>> +			#size-cells = <0>;
>> +			reg = <0x1f>;
>> +			pinctrl-names = "default";
>> +			mediatek,mcm;
>> +
>> +			resets = <&rstctrl 2>;
>> +			reset-names = "mcm";
>> +
>> +			ports {
>> +				#address-cells = <1>;
>> +				#size-cells = <0>;
>> +
>> +				port@0 {
>> +					reg = <0>;
>> +					label = "lan0";
>> +				};
>> +
>> +				port@1 {
>> +					reg = <1>;
>> +					label = "lan1";
>> +				};
>> +
>> +				port@2 {
>> +					reg = <2>;
>> +					label = "lan2";
>> +				};
>> +
>> +				port@3 {
>> +					reg = <3>;
>> +					label = "lan3";
>> +				};
>> +
>> +/* Commented out. Port 4 is handled by 2nd GMAC.
>> +				port@4 {
>> +					reg = <4>;
>> +					label = "lan4";
>> +				};
>> +*/
>> +
>> +				cpu_port0: port@6 {
>> +					reg = <6>;
>> +					label = "cpu";
>> +					ethernet = <&gmac0>;
>> +					phy-mode = "rgmii";
>> +
>> +					fixed-link {
>> +						speed = <1000>;
>> +						full-duplex;
>> +						pause;
>> +					};
>> +				};
>> +			};
>> +		};
>> +	};
>> +};
>> +
>> +Example 3: MT7621: Port 5 is connected to external PHY: Port 5 ->  
>> external PHY.
>> +
>> +&eth {
>> +	status = "okay";
>> +
>> +	gmac0: mac@0 {
>> +		compatible = "mediatek,eth-mac";
>> +		reg = <0>;
>> +		phy-mode = "rgmii";
>> +
>> +		fixed-link {
>> +			speed = <1000>;
>> +			full-duplex;
>> +			pause;
>> +		};
>> +	};
>> +
>> +	mdio: mdio-bus {
>> +		#address-cells = <1>;
>> +		#size-cells = <0>;
>> +
>> +		/* External phy */
>> +		ephy5: ethernet-phy@7 {
>> +			reg = <7>;
>> +		};
>> +
>> +		mt7530: switch@1f {
>> +			compatible = "mediatek,mt7621";
>> +			#address-cells = <1>;
>> +			#size-cells = <0>;
>> +			reg = <0x1f>;
>> +			pinctrl-names = "default";
>> +			mediatek,mcm;
>> +
>> +			resets = <&rstctrl 2>;
>> +			reset-names = "mcm";
>> +
>> +			ports {
>> +				#address-cells = <1>;
>> +				#size-cells = <0>;
>> +
>> +				port@0 {
>> +					reg = <0>;
>> +					label = "lan0";
>> +				};
>> +
>> +				port@1 {
>> +					reg = <1>;
>> +					label = "lan1";
>> +				};
>> +
>> +				port@2 {
>> +					reg = <2>;
>> +					label = "lan2";
>> +				};
>> +
>> +				port@3 {
>> +					reg = <3>;
>> +					label = "lan3";
>> +				};
>> +
>> +				port@4 {
>> +					reg = <4>;
>> +					label = "lan4";
>> +				};
>> +
>> +				port@5 {
>> +					reg = <5>;
>> +					label = "lan5";
>> +					phy-mode = "rgmii";
>> +					phy-handle = <&ephy5>;
>> +				};
>> +
>> +				cpu_port0: port@6 {
>> +					reg = <6>;
>> +					label = "cpu";
>> +					ethernet = <&gmac0>;
>> +					phy-mode = "rgmii";
>> +
>> +					fixed-link {
>> +						speed = <1000>;
>> +						full-duplex;
>> +						pause;
>> +					};
>> +				};
>> +			};
>> +		};
>> +	};
>> +};
>> --
>> 2.20.1
>>




^ permalink raw reply

* Re: [PATCH net] net/sched: pfifo_fast: fix wrong dereference in pfifo_fast_enqueue
From: Paolo Abeni @ 2019-08-28  6:31 UTC (permalink / raw)
  To: Davide Caratti, Cong Wang, Jamal Hadi Salim, Jiri Pirko,
	David S. Miller, netdev
  Cc: Stefano Brivio, Li Shuang
In-Reply-To: <d5a7a167ab57e035685445ee641840a0c5fd39ae.1566940693.git.dcaratti@redhat.com>

On Tue, 2019-08-27 at 23:18 +0200, Davide Caratti wrote:
> Now that 'TCQ_F_CPUSTATS' bit can be cleared, depending on the value of
> 'TCQ_F_NOLOCK' bit in the parent qdisc, we can't assume anymore that
> per-cpu counters are there in the error path of skb_array_produce().
> Otherwise, the following splat can be seen:
> 
>  Unable to handle kernel paging request at virtual address 0000600dea430008
>  Mem abort info:
>    ESR = 0x96000005
>    Exception class = DABT (current EL), IL = 32 bits
>    SET = 0, FnV = 0
>    EA = 0, S1PTW = 0
>  Data abort info:
>    ISV = 0, ISS = 0x00000005
>    CM = 0, WnR = 0
>  user pgtable: 64k pages, 48-bit VAs, pgdp = 000000007b97530e
>  [0000600dea430008] pgd=0000000000000000, pud=0000000000000000
>  Internal error: Oops: 96000005 [#1] SMP
> [...]
>  pstate: 10000005 (nzcV daif -PAN -UAO)
>  pc : pfifo_fast_enqueue+0x524/0x6e8
>  lr : pfifo_fast_enqueue+0x46c/0x6e8
>  sp : ffff800d39376fe0
>  x29: ffff800d39376fe0 x28: 1ffff001a07d1e40
>  x27: ffff800d03e8f188 x26: ffff800d03e8f200
>  x25: 0000000000000062 x24: ffff800d393772f0
>  x23: 0000000000000000 x22: 0000000000000403
>  x21: ffff800cca569a00 x20: ffff800d03e8ee00
>  x19: ffff800cca569a10 x18: 00000000000000bf
>  x17: 0000000000000000 x16: 0000000000000000
>  x15: 0000000000000000 x14: ffff1001a726edd0
>  x13: 1fffe4000276a9a4 x12: 0000000000000000
>  x11: dfff200000000000 x10: ffff800d03e8f1a0
>  x9 : 0000000000000003 x8 : 0000000000000000
>  x7 : 00000000f1f1f1f1 x6 : ffff1001a726edea
>  x5 : ffff800cca56a53c x4 : 1ffff001bf9a8003
>  x3 : 1ffff001bf9a8003 x2 : 1ffff001a07d1dcb
>  x1 : 0000600dea430000 x0 : 0000600dea430008
>  Process ping (pid: 6067, stack limit = 0x00000000dc0aa557)
>  Call trace:
>   pfifo_fast_enqueue+0x524/0x6e8
>   htb_enqueue+0x660/0x10e0 [sch_htb]
>   __dev_queue_xmit+0x123c/0x2de0
>   dev_queue_xmit+0x24/0x30
>   ip_finish_output2+0xc48/0x1720
>   ip_finish_output+0x548/0x9d8
>   ip_output+0x334/0x788
>   ip_local_out+0x90/0x138
>   ip_send_skb+0x44/0x1d0
>   ip_push_pending_frames+0x5c/0x78
>   raw_sendmsg+0xed8/0x28d0
>   inet_sendmsg+0xc4/0x5c0
>   sock_sendmsg+0xac/0x108
>   __sys_sendto+0x1ac/0x2a0
>   __arm64_sys_sendto+0xc4/0x138
>   el0_svc_handler+0x13c/0x298
>   el0_svc+0x8/0xc
>  Code: f9402e80 d538d081 91002000 8b010000 (885f7c03)
> 
> Fix this by testing the value of 'TCQ_F_CPUSTATS' bit in 'qdisc->flags',
> before dereferencing 'qdisc->cpu_qstats'.
> 
> Fixes: 8a53e616de29 ("net: sched: when clearing NOLOCK, clear TCQ_F_CPUSTATS, too")
> CC: Paolo Abeni <pabeni@redhat.com>
> CC: Stefano Brivio <sbrivio@redhat.com>
> Reported-by: Li Shuang <shuali@redhat.com>
> Signed-off-by: Davide Caratti <dcaratti@redhat.com>
> ---
>  net/sched/sch_generic.c | 8 ++++++--
>  1 file changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index 099797e5409d..137db1cbde85 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -624,8 +624,12 @@ static int pfifo_fast_enqueue(struct sk_buff *skb, struct Qdisc *qdisc,
>  
>  	err = skb_array_produce(q, skb);
>  
> -	if (unlikely(err))
> -		return qdisc_drop_cpu(skb, qdisc, to_free);
> +	if (unlikely(err)) {
> +		if (qdisc_is_percpu_stats(qdisc))
> +			return qdisc_drop_cpu(skb, qdisc, to_free);
> +		else
> +			return qdisc_drop(skb, qdisc, to_free);
> +	}
>  
>  	qdisc_update_stats_at_enqueue(qdisc, pkt_len);
>  	return NET_XMIT_SUCCESS;

LGTM, thanks Davide!

I just did a code audit of the others pfifo_fast callbacks, I think
this is the last spot in need of such fix.

Acked-by: Paolo Abeni <pabeni@redhat.com>


^ permalink raw reply

* [PATCH] sky2: Disable MSI on yet another ASUS boards (P6Xxxx)
From: Takashi Iwai @ 2019-08-28  6:31 UTC (permalink / raw)
  To: Mirko Lindner, Stephen Hemminger
  Cc: David S . Miller, netdev, linux-kernel, SteveM

A similar workaround for the suspend/resume problem is needed for yet
another ASUS machines, P6X models.  Like the previous fix, the BIOS
doesn't provide the standard DMI_SYS_* entry, so again DMI_BOARD_*
entries are used instead.

Reported-and-tested-by: SteveM <swm@swm1.com>
Signed-off-by: Takashi Iwai <tiwai@suse.de>
---
 drivers/net/ethernet/marvell/sky2.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/drivers/net/ethernet/marvell/sky2.c b/drivers/net/ethernet/marvell/sky2.c
index c2e00bb587cd..5f56ee83e3b1 100644
--- a/drivers/net/ethernet/marvell/sky2.c
+++ b/drivers/net/ethernet/marvell/sky2.c
@@ -4931,6 +4931,13 @@ static const struct dmi_system_id msi_blacklist[] = {
 			DMI_MATCH(DMI_BOARD_NAME, "P6T"),
 		},
 	},
+	{
+		.ident = "ASUS P6X",
+		.matches = {
+			DMI_MATCH(DMI_BOARD_VENDOR, "ASUSTeK Computer INC."),
+			DMI_MATCH(DMI_BOARD_NAME, "P6X"),
+		},
+	},
 	{}
 };

-- 
2.16.4

^ permalink raw reply related

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Andy Lutomirski @ 2019-08-28  6:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Alexei Starovoitov, Kees Cook, LSM List,
	James Morris, Jann Horn, Peter Zijlstra, Masami Hiramatsu,
	Steven Rostedt, David S. Miller, Daniel Borkmann,
	Network Development, bpf, kernel-team, Linux API
In-Reply-To: <20190828044903.nv3hvinkkolnnxtv@ast-mbp.dhcp.thefacebook.com>

On Tue, Aug 27, 2019 at 9:49 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Aug 27, 2019 at 07:00:40PM -0700, Andy Lutomirski wrote:
> >
> > Let me put this a bit differently. Part of the point is that
> > CAP_TRACING should allow a user or program to trace without being able
> > to corrupt the system. CAP_BPF as you’ve proposed it *can* likely
> > crash the system.
>
> Really? I'm still waiting for your example where bpf+kprobe crashes the system...
>

That's not what I meant.  bpf+kprobe causing a crash is a bug.  I'm
referring to a totally different issue.  On my laptop:

$ sudo bpftool map
48: hash  name foobar  flags 0x0
    key 8B  value 8B  max_entries 64  memlock 8192B
181: lpm_trie  flags 0x1
    key 8B  value 8B  max_entries 1  memlock 4096B
182: lpm_trie  flags 0x1
    key 20B  value 8B  max_entries 1  memlock 4096B
183: lpm_trie  flags 0x1
    key 8B  value 8B  max_entries 1  memlock 4096B
184: lpm_trie  flags 0x1
    key 20B  value 8B  max_entries 1  memlock 4096B
185: lpm_trie  flags 0x1
    key 8B  value 8B  max_entries 1  memlock 4096B
186: lpm_trie  flags 0x1
    key 20B  value 8B  max_entries 1  memlock 4096B
187: lpm_trie  flags 0x1
    key 8B  value 8B  max_entries 1  memlock 4096B
188: lpm_trie  flags 0x1
    key 20B  value 8B  max_entries 1  memlock 4096B

$ sudo bpftool map dump id 186
key:
00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
00 00 00 00
value:
02 00 00 00 00 00 00 00
Found 1 element

$ sudo bpftool map delete id 186 key hex 00 00 00 00 00 00 00 00 00 00
00 00 00 00 00 00 00 00 00 00
[this worked]

I don't know what my laptop was doing with map id 186 in particular,
but, whatever it was, I definitely broke it.  If a BPF firewall is in
use on something important enough, this could easily remove
connectivity from part or all of the system.  Right now, this needs
CAP_SYS_ADMIN.  With your patch, CAP_BPF is sufficient to do this, but
you *also* need CAP_BPF to trace the system using BPF.  Tracing with
BPF is 'safe' in the absence of bugs.  Modifying other peoples' maps
is not.

One possible answer to this would be to limit CAP_BPF to the subset of
BPF that is totaly safe in the absence of bugs (e.g. loading most
program types if they don't have dangerous BPF_CALL instructions but
not *_BY_ID).  Another answer would be to say that CAP_BPF will not be
needed by future unprivileged bpf mechanisms, and that CAP_TRACING
plus unprivileged bpf will be enough to do most or all interesting BPF
tracing operations.

If the answer is the latter, then maybe it would make sense to try to
implement some of the unprivileged bpf stuff and then to see whether
CAP_BPF is still needed.

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Andy Lutomirski @ 2019-08-28  6:12 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Andy Lutomirski, Alexei Starovoitov, Kees Cook, LSM List,
	James Morris, Jann Horn, Peter Zijlstra, Masami Hiramatsu,
	Steven Rostedt, David S. Miller, Daniel Borkmann,
	Network Development, bpf, kernel-team, Linux API
In-Reply-To: <20190828044340.zeha3k3cmmxgfqj7@ast-mbp.dhcp.thefacebook.com>

On Tue, Aug 27, 2019 at 9:43 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Aug 27, 2019 at 05:55:41PM -0700, Andy Lutomirski wrote:
> >
> > I was hoping for something in Documentation/admin-guide, not in a
> > changelog that's hard to find.
>
> eventually yes.
>
> > >
> > > > Changing the capability that some existing operation requires could
> > > > break existing programs.  The old capability may need to be accepted
> > > > as well.
> > >
> > > As far as I can see there is no ABI breakage. Please point out
> > > which line of the patch may break it.
> >
> > As a more or less arbitrary selection:
> >
> >  void bpf_prog_kallsyms_add(struct bpf_prog *fp)
> >  {
> >         if (!bpf_prog_kallsyms_candidate(fp) ||
> > -           !capable(CAP_SYS_ADMIN))
> > +           !capable(CAP_BPF))
> >                 return;
> >
> > Before your patch, a task with CAP_SYS_ADMIN could do this.  Now it
> > can't.  Per the usual Linux definition of "ABI break", this is an ABI
> > break if and only if someone actually did this in a context where they
> > have CAP_SYS_ADMIN but not all capabilities.  How confident are you
> > that no one does things like this?
> >  void bpf_prog_kallsyms_add(struct bpf_prog *fp)
> >  {
> >         if (!bpf_prog_kallsyms_candidate(fp) ||
> > -           !capable(CAP_SYS_ADMIN))
> > +           !capable(CAP_BPF))
> >                 return;
>
> Yes. I'm confident that apps don't drop everything and
> leave cap_sys_admin only before doing bpf() syscall, since it would
> break their own use of networking.
> Hence I'm not going to do the cap_syslog-like "deprecated" message mess
> because of this unfounded concern.
> If I turn out to be wrong we will add this "deprecated mess" later.
>
> >
> > From the previous discussion, you want to make progress toward solving
> > a lot of problems with CAP_BPF.  One of them was making BPF
> > firewalling more generally useful. By making CAP_BPF grant the ability
> > to read kernel memory, you will make administrators much more nervous
> > to grant CAP_BPF.
>
> Andy, were your email hacked?
> I explained several times that in this proposal
> CAP_BPF _and_ CAP_TRACING _both_ are necessary to read kernel memory.
> CAP_BPF alone is _not enough_.

You have indeed said this many times.  You've stated it as a matter of
fact as though it cannot possibly discussed.  I'm asking you to
justify it.

> > Similarly, and correct me if I'm wrong, most of
> > these capabilities are primarily or only useful for tracing, so I
> > don't see why users without CAP_TRACING should get them.
> > bpf_trace_printk(), in particular, even has "trace" in its name :)
> >
> > Also, if a task has CAP_TRACING, it's expected to be able to trace the
> > system -- that's the whole point.  Why shouldn't it be able to use BPF
> > to trace the system better?
>
> CAP_TRACING shouldn't be able to do BPF because BPF is not tracing only.

What does "do BPF" even mean?  seccomp() does BPF.  SO_ATTACH_FILTER
does BPF.  Saying that using BPF should require a specific capability
seems kind of like saying that using the network should require a
specific capability.  Linux (and Unixy systems in general) distinguish
between binding low-number ports, binding high-number ports, using raw
sockets, and changing the system's IP address.  These have different
implications and require different capabilities.

It seems like you are specifically trying to add a new switch to turn
as much of BPF as possible on and off.  Why?

> >
> > test_run allows fully controlled inputs, in a context where a program
> > can trivially flush caches, mistrain branch predictors, etc first.  It
> > seems to me that, if a JITted bpf program contains an exploitable
> > speculation gadget (MDS, Spectre v1, RSB, or anything else),
>
> speaking of MDS... I already asked you to help investigate its
> applicability with existing bpf exposure. Are you going to do that?

I am blissfully uninvolved in MDS, and I don't know all that much more
about the overall mechanism than a random reader of tech news :)  ISTM
there are two meaningful ways that BPF could be involved: a BPF
program could leak info into the state exposed by MDS, or a BPF
program could try to read that state.  From what little I understand,
it's essentially inevitable that BPF leaks information into MDS state,
and this is probably even controllable by an attacker that understands
MDS in enough detail.    So the interesting questions are: can BPF be
used to read MDS state and can BPF be used to leak information in a
more useful way than the rest of the kernel to an attacker.

Keeping in mind that the kernel will flush MDS state on every exit to
usermode, I think the most likely attack is to try to read MDS state
with BPF.  This could happen, I suppose -- BPF programs can easily
contain the usual speculation gadgets of "do something and read an
address that depends on the outcome".  Fortunately, outside of
bpf_probe_read(), AFAIK BPF programs can't directly touch user memory,
and an attacker that is allowed to use bpf_probe_read() doesn't need
MDS to read things.

So it's not entirely obvious to me how an attack would be mounted.
test_run would make it a lot easier, I think.

>
> > it will
> > be *much* easier to exploit it using test_run than using normal
> > network traffic.  Similarly, normal network traffic will have network
> > headers that are valid enough to have caused the BPF program to be
> > invoked in the first place.  test_run can inject arbitrary garbage.
>
> Please take a look at Jann's var1 exploit. Was it hard to run bpf prog
> in controlled environment without test_run command ?
>

Can you send me a link?

^ permalink raw reply

* [PATCH net 2/2] nfp: flower: handle neighbour events on internal ports
From: Jakub Kicinski @ 2019-08-28  5:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, oss-drivers, John Hurley, Simon Horman, Jakub Kicinski
In-Reply-To: <20190828055630.17331-1-jakub.kicinski@netronome.com>

From: John Hurley <john.hurley@netronome.com>

Recent code changes to NFP allowed the offload of neighbour entries to FW
when the next hop device was an internal port. This allows for offload of
tunnel encap when the end-point IP address is applied to such a port.

Unfortunately, the neighbour event handler still rejects events that are
not associated with a repr dev and so the firmware neighbour table may get
out of sync for internal ports.

Fix this by allowing internal port neighbour events to be correctly
processed.

Fixes: 45756dfedab5 ("nfp: flower: allow tunnels to output to internal port")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
index a7a80f4b722a..f0ee982eb1b5 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c
@@ -328,13 +328,13 @@ nfp_tun_neigh_event_handler(struct notifier_block *nb, unsigned long event,
 
 	flow.daddr = *(__be32 *)n->primary_key;
 
-	/* Only concerned with route changes for representors. */
-	if (!nfp_netdev_is_nfp_repr(n->dev))
-		return NOTIFY_DONE;
-
 	app_priv = container_of(nb, struct nfp_flower_priv, tun.neigh_nb);
 	app = app_priv->app;
 
+	if (!nfp_netdev_is_nfp_repr(n->dev) &&
+	    !nfp_flower_internal_port_can_offload(app, n->dev))
+		return NOTIFY_DONE;
+
 	/* Only concerned with changes to routes already added to NFP. */
 	if (!nfp_tun_has_route(app, flow.daddr))
 		return NOTIFY_DONE;
-- 
2.21.0


^ permalink raw reply related

* [PATCH net 1/2] nfp: flower: prevent ingress block binds on internal ports
From: Jakub Kicinski @ 2019-08-28  5:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, oss-drivers, John Hurley, Jakub Kicinski
In-Reply-To: <20190828055630.17331-1-jakub.kicinski@netronome.com>

From: John Hurley <john.hurley@netronome.com>

Internal port TC offload is implemented through user-space applications
(such as OvS) by adding filters at egress via TC clsact qdiscs. Indirect
block offload support in the NFP driver accepts both ingress qdisc binds
and egress binds if the device is an internal port. However, clsact sends
bind notification for both ingress and egress block binds which can lead
to the driver registering multiple callbacks and receiving multiple
notifications of new filters.

Fix this by rejecting ingress block bind callbacks when the port is
internal and only adding filter callbacks for egress binds.

Fixes: 4d12ba42787b ("nfp: flower: allow offloading of matches on 'internal' ports")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/flower/offload.c | 7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/flower/offload.c b/drivers/net/ethernet/netronome/nfp/flower/offload.c
index 9917d64694c6..457bdc60f3ee 100644
--- a/drivers/net/ethernet/netronome/nfp/flower/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/flower/offload.c
@@ -1409,9 +1409,10 @@ nfp_flower_setup_indr_tc_block(struct net_device *netdev, struct nfp_app *app,
 	struct nfp_flower_priv *priv = app->priv;
 	struct flow_block_cb *block_cb;
 
-	if (f->binder_type != FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS &&
-	    !(f->binder_type == FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS &&
-	      nfp_flower_internal_port_can_offload(app, netdev)))
+	if ((f->binder_type != FLOW_BLOCK_BINDER_TYPE_CLSACT_INGRESS &&
+	     !nfp_flower_internal_port_can_offload(app, netdev)) ||
+	    (f->binder_type != FLOW_BLOCK_BINDER_TYPE_CLSACT_EGRESS &&
+	     nfp_flower_internal_port_can_offload(app, netdev)))
 		return -EOPNOTSUPP;
 
 	switch (f->command) {
-- 
2.21.0


^ permalink raw reply related

* [PATCH net 0/2] nfp: flower: fix bugs in merge tunnel encap code
From: Jakub Kicinski @ 2019-08-28  5:56 UTC (permalink / raw)
  To: davem; +Cc: netdev, oss-drivers, Jakub Kicinski

John says:

There are few bugs in the merge encap code that have come to light with
recent driver changes. Effectively, flow bind callbacks were being
registered twice when using internal ports (new 'busy' code triggers
this). There was also an issue with neighbour notifier messages being
ignored for internal ports.

John Hurley (2):
  nfp: flower: prevent ingress block binds on internal ports
  nfp: flower: handle neighbour events on internal ports

 drivers/net/ethernet/netronome/nfp/flower/offload.c     | 7 ++++---
 drivers/net/ethernet/netronome/nfp/flower/tunnel_conf.c | 8 ++++----
 2 files changed, 8 insertions(+), 7 deletions(-)

-- 
2.21.0

^ permalink raw reply

* Re: BUG_ON in skb_segment, after bpf_skb_change_proto was applied
From: Shmulik Ladkani @ 2019-08-28  5:56 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Eric Dumazet, netdev, Alexander Duyck, Alexei Starovoitov,
	Yonghong Song, Steffen Klassert, shmulik, eyal
In-Reply-To: <88a3da53-fecc-0d8c-56dc-a4c3b0e11dfd@iogearbox.net>

On Tue, 27 Aug 2019 14:10:35 +0200
Daniel Borkmann <daniel@iogearbox.net> wrote:

> Given first point above wrt hitting rarely, it would be good to first get a
> better understanding for writing a reproducer. Back then Yonghong added one
> to the BPF kernel test suite [0], so it would be desirable to extend it for
> the case you're hitting. Given NAT64 use-case is needed and used by multiple
> parties, we should try to (fully) fix it generically.

Thanks Daniel for the advice.

I'm working on a reproducer that resembles the input skb which triggers
this BUG_ON.

^ permalink raw reply

* Re: [PATCH net-next 1/4] r8169: prepare for adding RTL8125 support
From: Heiner Kallweit @ 2019-08-28  5:52 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: Realtek linux nic maintainers, David Miller,
	netdev@vger.kernel.org, Chun-Hao Lin
In-Reply-To: <20190827232713.GE26248@lunn.ch>

On 28.08.2019 01:27, Andrew Lunn wrote:
> On Tue, Aug 27, 2019 at 08:41:00PM +0200, Heiner Kallweit wrote:
>> This patch prepares the driver for adding RTL8125 support:
>> - change type of interrupt mask to u32
>> - restrict rtl_is_8168evl_up to RTL8168 chip versions
>> - factor out reading MAC address from registers
>> - re-add function rtl_get_events
>> - move disabling interrupt coalescing to RTL8169/RTL8168 init
>> - read different register for PCI commit
>> - don't use bit LastFrag in tx descriptor after send, RTL8125 clears it
> 
> Hi Heiner
> 
> That is a lot of changes in one patch. Although there is no planned
> functional change, r8169 has a habit of breaking. Having lots of small
> changes would help tracking down which change caused a breakage, via a
> git bisect.
> 
> So you might want to consider splitting this up into a number of small
> patches.
> 
> 	Andrew
> 
Hi Andrew,

most of the changes are trivial, but you're right. I'll split this patch.

Heiner

^ permalink raw reply

* [PATCH 1/2] vhost/test: fix build for vhost test
From: Tiwei Bie @ 2019-08-28  5:36 UTC (permalink / raw)
  To: mst, jasowang; +Cc: kvm, virtualization, netdev, linux-kernel, stable

Since below commit, callers need to specify the iov_limit in
vhost_dev_init() explicitly.

Fixes: b46a0bf78ad7 ("vhost: fix OOB in get_rx_bufs()")
Cc: stable@vger.kernel.org
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/vhost/test.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index 9e90e969af55..ac4f762c4f65 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -115,7 +115,7 @@ static int vhost_test_open(struct inode *inode, struct file *f)
 	dev = &n->dev;
 	vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
 	n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
-	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX);
+	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV);
 
 	f->private_data = n;
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH 2/2] vhost/test: fix build for vhost test
From: Tiwei Bie @ 2019-08-28  5:37 UTC (permalink / raw)
  To: mst, jasowang; +Cc: kvm, virtualization, netdev, linux-kernel, stable
In-Reply-To: <20190828053700.26022-1-tiwei.bie@intel.com>

Since vhost_exceeds_weight() was introduced, callers need to specify
the packet weight and byte weight in vhost_dev_init(). Note that, the
packet weight isn't counted in this patch to keep the original behavior
unchanged.

Fixes: e82b9b0727ff ("vhost: introduce vhost_exceeds_weight()")
Cc: stable@vger.kernel.org
Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/vhost/test.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/test.c b/drivers/vhost/test.c
index ac4f762c4f65..7804869c6a31 100644
--- a/drivers/vhost/test.c
+++ b/drivers/vhost/test.c
@@ -22,6 +22,12 @@
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_TEST_WEIGHT 0x80000
 
+/* Max number of packets transferred before requeueing the job.
+ * Using this limit prevents one virtqueue from starving others with
+ * pkts.
+ */
+#define VHOST_TEST_PKT_WEIGHT 256
+
 enum {
 	VHOST_TEST_VQ = 0,
 	VHOST_TEST_VQ_MAX = 1,
@@ -80,10 +86,8 @@ static void handle_vq(struct vhost_test *n)
 		}
 		vhost_add_used_and_signal(&n->dev, vq, head, 0);
 		total_len += len;
-		if (unlikely(total_len >= VHOST_TEST_WEIGHT)) {
-			vhost_poll_queue(&vq->poll);
+		if (unlikely(vhost_exceeds_weight(vq, 0, total_len)))
 			break;
-		}
 	}
 
 	mutex_unlock(&vq->mutex);
@@ -115,7 +119,8 @@ static int vhost_test_open(struct inode *inode, struct file *f)
 	dev = &n->dev;
 	vqs[VHOST_TEST_VQ] = &n->vqs[VHOST_TEST_VQ];
 	n->vqs[VHOST_TEST_VQ].handle_kick = handle_vq_kick;
-	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV);
+	vhost_dev_init(dev, vqs, VHOST_TEST_VQ_MAX, UIO_MAXIOV,
+		       VHOST_TEST_PKT_WEIGHT, VHOST_TEST_WEIGHT);
 
 	f->private_data = n;
 
-- 
2.17.1


^ permalink raw reply related

* [RFC v3] vhost: introduce mdev based hardware vhost backend
From: Tiwei Bie @ 2019-08-28  5:37 UTC (permalink / raw)
  To: mst, jasowang, alex.williamson, maxime.coquelin
  Cc: linux-kernel, kvm, virtualization, netdev, dan.daly,
	cunming.liang, zhihong.wang, lingshan.zhu, tiwei.bie

Details about this can be found here:

https://lwn.net/Articles/750770/

What's new in this version
==========================

There are three choices based on the discussion [1] in RFC v2:

> #1. We expose a VFIO device, so we can reuse the VFIO container/group
>     based DMA API and potentially reuse a lot of VFIO code in QEMU.
>
>     But in this case, we have two choices for the VFIO device interface
>     (i.e. the interface on top of VFIO device fd):
>
>     A) we may invent a new vhost protocol (as demonstrated by the code
>        in this RFC) on VFIO device fd to make it work in VFIO's way,
>        i.e. regions and irqs.
>
>     B) Or as you proposed, instead of inventing a new vhost protocol,
>        we can reuse most existing vhost ioctls on the VFIO device fd
>        directly. There should be no conflicts between the VFIO ioctls
>        (type is 0x3B) and VHOST ioctls (type is 0xAF) currently.
>
> #2. Instead of exposing a VFIO device, we may expose a VHOST device.
>     And we will introduce a new mdev driver vhost-mdev to do this.
>     It would be natural to reuse the existing kernel vhost interface
>     (ioctls) on it as much as possible. But we will need to invent
>     some APIs for DMA programming (reusing VHOST_SET_MEM_TABLE is a
>     choice, but it's too heavy and doesn't support vIOMMU by itself).

This version is more like a quick PoC to try Jason's proposal on
reusing vhost ioctls. And the second way (#1/B) in above three
choices was chosen in this version to demonstrate the idea quickly.

Now the userspace API looks like this:

- VFIO's container/group based IOMMU API is used to do the
  DMA programming.

- Vhost's existing ioctls are used to setup the device.

And the device will report device_api as "vfio-vhost".

Note that, there are dirty hacks in this version. If we decide to
go this way, some refactoring in vhost.c/vhost.h may be needed.

PS. The direct mapping of the notify registers isn't implemented
    in this version.

[1] https://lkml.org/lkml/2019/7/9/101

Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
---
 drivers/vhost/Kconfig      |   9 +
 drivers/vhost/Makefile     |   3 +
 drivers/vhost/mdev.c       | 382 +++++++++++++++++++++++++++++++++++++
 include/linux/vhost_mdev.h |  58 ++++++
 include/uapi/linux/vfio.h  |   2 +
 include/uapi/linux/vhost.h |   8 +
 6 files changed, 462 insertions(+)
 create mode 100644 drivers/vhost/mdev.c
 create mode 100644 include/linux/vhost_mdev.h

diff --git a/drivers/vhost/Kconfig b/drivers/vhost/Kconfig
index 3d03ccbd1adc..2ba54fcf43b7 100644
--- a/drivers/vhost/Kconfig
+++ b/drivers/vhost/Kconfig
@@ -34,6 +34,15 @@ config VHOST_VSOCK
 	To compile this driver as a module, choose M here: the module will be called
 	vhost_vsock.
 
+config VHOST_MDEV
+	tristate "Hardware vhost accelerator abstraction"
+	depends on EVENTFD && VFIO && VFIO_MDEV
+	select VHOST
+	default n
+	---help---
+	Say Y here to enable the vhost_mdev module
+	for use with hardware vhost accelerators
+
 config VHOST
 	tristate
 	---help---
diff --git a/drivers/vhost/Makefile b/drivers/vhost/Makefile
index 6c6df24f770c..ad9c0f8c6d8c 100644
--- a/drivers/vhost/Makefile
+++ b/drivers/vhost/Makefile
@@ -10,4 +10,7 @@ vhost_vsock-y := vsock.o
 
 obj-$(CONFIG_VHOST_RING) += vringh.o
 
+obj-$(CONFIG_VHOST_MDEV) += vhost_mdev.o
+vhost_mdev-y := mdev.o
+
 obj-$(CONFIG_VHOST)	+= vhost.o
diff --git a/drivers/vhost/mdev.c b/drivers/vhost/mdev.c
new file mode 100644
index 000000000000..6bef1d9ae2e6
--- /dev/null
+++ b/drivers/vhost/mdev.c
@@ -0,0 +1,382 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018-2019 Intel Corporation.
+ */
+
+#include <linux/module.h>
+#include <linux/kernel.h>
+#include <linux/vfio.h>
+#include <linux/vhost.h>
+#include <linux/mdev.h>
+#include <linux/vhost_mdev.h>
+
+#include "vhost.h"
+
+struct vhost_mdev {
+	struct vhost_dev dev;
+	bool opened;
+	int nvqs;
+	u64 state;
+	u64 acked_features;
+	u64 features;
+	const struct vhost_mdev_device_ops *ops;
+	struct mdev_device *mdev;
+	void *private;
+	struct vhost_virtqueue vqs[];
+};
+
+static void handle_vq_kick(struct vhost_work *work)
+{
+	struct vhost_virtqueue *vq = container_of(work, struct vhost_virtqueue,
+						  poll.work);
+	struct vhost_mdev *vdpa = container_of(vq->dev, struct vhost_mdev, dev);
+
+	vdpa->ops->notify(vdpa, vq - vdpa->vqs);
+}
+
+static int vhost_set_state(struct vhost_mdev *vdpa, u64 __user *statep)
+{
+	u64 state;
+
+	if (copy_from_user(&state, statep, sizeof(state)))
+		return -EFAULT;
+
+	if (state >= VHOST_MDEV_S_MAX)
+		return -EINVAL;
+
+	if (vdpa->state == state)
+		return 0;
+
+	mutex_lock(&vdpa->dev.mutex);
+
+	vdpa->state = state;
+
+	switch (vdpa->state) {
+	case VHOST_MDEV_S_RUNNING:
+		vdpa->ops->start(vdpa);
+		break;
+	case VHOST_MDEV_S_STOPPED:
+		vdpa->ops->stop(vdpa);
+		break;
+	}
+
+	mutex_unlock(&vdpa->dev.mutex);
+
+	return 0;
+}
+
+static int vhost_set_features(struct vhost_mdev *vdpa, u64 __user *featurep)
+{
+	u64 features;
+
+	if (copy_from_user(&features, featurep, sizeof(features)))
+		return -EFAULT;
+
+	if (features & ~vdpa->features)
+		return -EINVAL;
+
+	vdpa->acked_features = features;
+	vdpa->ops->features_changed(vdpa);
+	return 0;
+}
+
+static int vhost_get_features(struct vhost_mdev *vdpa, u64 __user *featurep)
+{
+	if (copy_to_user(featurep, &vdpa->features, sizeof(vdpa->features)))
+		return -EFAULT;
+	return 0;
+}
+
+static int vhost_get_vring_base(struct vhost_mdev *vdpa, void __user *argp)
+{
+	struct vhost_virtqueue *vq;
+	u32 idx;
+	int r;
+
+	r = get_user(idx, (u32 __user *)argp);
+	if (r < 0)
+		return r;
+
+	vq = &vdpa->vqs[idx];
+	vq->last_avail_idx = vdpa->ops->get_vring_base(vdpa, idx);
+
+	return vhost_vring_ioctl(&vdpa->dev, VHOST_GET_VRING_BASE, argp);
+}
+
+/*
+ * Helpers for backend to register mdev.
+ */
+
+struct vhost_mdev *vhost_mdev_alloc(struct mdev_device *mdev, void *private,
+				    int nvqs)
+{
+	struct vhost_mdev *vdpa;
+	struct vhost_dev *dev;
+	struct vhost_virtqueue **vqs;
+	size_t size;
+	int i;
+
+	size = sizeof(struct vhost_mdev) + nvqs * sizeof(struct vhost_virtqueue);
+
+	vdpa = kzalloc(size, GFP_KERNEL);
+	if (!vdpa)
+		return NULL;
+
+	vdpa->nvqs = nvqs;
+
+	vqs = kmalloc_array(nvqs, sizeof(*vqs), GFP_KERNEL);
+	if (!vqs) {
+		kfree(vdpa);
+		return NULL;
+	}
+
+	dev = &vdpa->dev;
+	for (i = 0; i < nvqs; i++) {
+		vqs[i] = &vdpa->vqs[i];
+		vqs[i]->handle_kick = handle_vq_kick;
+	}
+	vhost_dev_init(dev, vqs, nvqs, 0, 0, 0);
+
+	vdpa->private = private;
+	vdpa->mdev = mdev;
+
+	mdev_set_drvdata(mdev, vdpa);
+
+	return vdpa;
+}
+EXPORT_SYMBOL(vhost_mdev_alloc);
+
+void vhost_mdev_free(struct vhost_mdev *vdpa)
+{
+	struct mdev_device *mdev;
+
+	mdev = vdpa->mdev;
+	mdev_set_drvdata(mdev, NULL);
+
+	vhost_dev_stop(&vdpa->dev);
+	vhost_dev_cleanup(&vdpa->dev);
+	kfree(vdpa->dev.vqs);
+	kfree(vdpa);
+}
+EXPORT_SYMBOL(vhost_mdev_free);
+
+ssize_t vhost_mdev_read(struct mdev_device *mdev, char __user *buf,
+		  size_t count, loff_t *ppos)
+{
+	return -EINVAL;
+}
+EXPORT_SYMBOL(vhost_mdev_read);
+
+
+ssize_t vhost_mdev_write(struct mdev_device *mdev, const char __user *buf,
+		   size_t count, loff_t *ppos)
+{
+	return -EINVAL;
+}
+EXPORT_SYMBOL(vhost_mdev_write);
+
+int vhost_mdev_mmap(struct mdev_device *mdev, struct vm_area_struct *vma)
+{
+	// TODO
+	return -EINVAL;
+}
+EXPORT_SYMBOL(vhost_mdev_mmap);
+
+long vhost_mdev_ioctl(struct mdev_device *mdev, unsigned int cmd,
+		      unsigned long arg)
+{
+	void __user *argp = (void __user *)arg;
+	struct vhost_mdev *vdpa;
+	unsigned long minsz;
+	int ret = 0;
+
+	if (!mdev)
+		return -EINVAL;
+
+	vdpa = mdev_get_drvdata(mdev);
+	if (!vdpa)
+		return -ENODEV;
+
+	switch (cmd) {
+	case VFIO_DEVICE_GET_INFO:
+	{
+		struct vfio_device_info info;
+
+		minsz = offsetofend(struct vfio_device_info, num_irqs);
+
+		if (copy_from_user(&info, (void __user *)arg, minsz)) {
+			ret = -EFAULT;
+			break;
+		}
+
+		if (info.argsz < minsz) {
+			ret = -EINVAL;
+			break;
+		}
+
+		info.flags = VFIO_DEVICE_FLAGS_VHOST;
+		info.num_regions = 0;
+		info.num_irqs = 0;
+
+		if (copy_to_user((void __user *)arg, &info, minsz)) {
+			ret = -EFAULT;
+			break;
+		}
+
+		break;
+	}
+	case VFIO_DEVICE_GET_REGION_INFO:
+	case VFIO_DEVICE_GET_IRQ_INFO:
+	case VFIO_DEVICE_SET_IRQS:
+	case VFIO_DEVICE_RESET:
+		ret = -EINVAL;
+		break;
+
+	case VHOST_MDEV_SET_STATE:
+		ret = vhost_set_state(vdpa, argp);
+		break;
+	case VHOST_GET_FEATURES:
+		ret = vhost_get_features(vdpa, argp);
+		break;
+	case VHOST_SET_FEATURES:
+		ret = vhost_set_features(vdpa, argp);
+		break;
+	case VHOST_GET_VRING_BASE:
+		ret = vhost_get_vring_base(vdpa, argp);
+		break;
+	default:
+		ret = vhost_dev_ioctl(&vdpa->dev, cmd, argp);
+		if (ret == -ENOIOCTLCMD)
+			ret = vhost_vring_ioctl(&vdpa->dev, cmd, argp);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(vhost_mdev_ioctl);
+
+int vhost_mdev_open(struct mdev_device *mdev)
+{
+	struct vhost_mdev *vdpa;
+	int ret = 0;
+
+	vdpa = mdev_get_drvdata(mdev);
+	if (!vdpa)
+		return -ENODEV;
+
+	mutex_lock(&vdpa->dev.mutex);
+
+	if (vdpa->opened)
+		ret = -EBUSY;
+	else
+		vdpa->opened = true;
+
+	mutex_unlock(&vdpa->dev.mutex);
+
+	return ret;
+}
+EXPORT_SYMBOL(vhost_mdev_open);
+
+void vhost_mdev_close(struct mdev_device *mdev)
+{
+	struct vhost_mdev *vdpa;
+
+	vdpa = mdev_get_drvdata(mdev);
+
+	mutex_lock(&vdpa->dev.mutex);
+
+	vhost_dev_stop(&vdpa->dev);
+	vhost_dev_cleanup(&vdpa->dev);
+
+	vdpa->opened = false;
+	mutex_unlock(&vdpa->dev.mutex);
+}
+EXPORT_SYMBOL(vhost_mdev_close);
+
+/*
+ * Helpers for backend to set/get information.
+ */
+
+int vhost_mdev_set_device_ops(struct vhost_mdev *vdpa,
+			      const struct vhost_mdev_device_ops *ops)
+{
+	vdpa->ops = ops;
+	return 0;
+}
+EXPORT_SYMBOL(vhost_mdev_set_device_ops);
+
+int vhost_mdev_set_features(struct vhost_mdev *vdpa, u64 features)
+{
+	vdpa->features = features;
+	return 0;
+}
+EXPORT_SYMBOL(vhost_mdev_set_features);
+
+struct eventfd_ctx *
+vhost_mdev_get_call_ctx(struct vhost_mdev *vdpa, int queue_id)
+{
+	return vdpa->vqs[queue_id].call_ctx;
+}
+EXPORT_SYMBOL(vhost_mdev_get_call_ctx);
+
+int vhost_mdev_get_acked_features(struct vhost_mdev *vdpa, u64 *features)
+{
+	*features = vdpa->acked_features;
+	return 0;
+}
+EXPORT_SYMBOL(vhost_mdev_get_acked_features);
+
+int vhost_mdev_get_vring_num(struct vhost_mdev *vdpa, int queue_id, u16 *num)
+{
+	*num = vdpa->vqs[queue_id].num;
+	return 0;
+}
+EXPORT_SYMBOL(vhost_mdev_get_vring_num);
+
+int vhost_mdev_get_vring_base(struct vhost_mdev *vdpa, int queue_id, u16 *base)
+{
+	*base = vdpa->vqs[queue_id].last_avail_idx;
+	return 0;
+}
+EXPORT_SYMBOL(vhost_mdev_get_vring_base);
+
+int vhost_mdev_get_vring_addr(struct vhost_mdev *vdpa, int queue_id,
+			      struct vhost_vring_addr *addr)
+{
+	struct vhost_virtqueue *vq = &vdpa->vqs[queue_id];
+
+	/*
+	 * XXX: we need userspace to pass guest physical address or
+	 *      IOVA directly.
+	 */
+	addr->flags = vq->log_used ? (0x1 << VHOST_VRING_F_LOG) : 0;
+	addr->desc_user_addr = (__u64)vq->desc;
+	addr->avail_user_addr = (__u64)vq->avail;
+	addr->used_user_addr = (__u64)vq->used;
+	addr->log_guest_addr = (__u64)vq->log_addr;
+	return 0;
+}
+EXPORT_SYMBOL(vhost_mdev_get_vring_addr);
+
+int vhost_mdev_get_log_base(struct vhost_mdev *vdpa, int queue_id,
+			    void **log_base, u64 *log_size)
+{
+	// TODO
+	return 0;
+}
+EXPORT_SYMBOL(vhost_mdev_get_log_base);
+
+struct mdev_device *vhost_mdev_get_mdev(struct vhost_mdev *vdpa)
+{
+	return vdpa->mdev;
+}
+EXPORT_SYMBOL(vhost_mdev_get_mdev);
+
+void *vhost_mdev_get_private(struct vhost_mdev *vdpa)
+{
+	return vdpa->private;
+}
+EXPORT_SYMBOL(vhost_mdev_get_private);
+
+MODULE_VERSION("0.0.0");
+MODULE_LICENSE("GPL v2");
+MODULE_DESCRIPTION("Hardware vhost accelerator abstraction");
diff --git a/include/linux/vhost_mdev.h b/include/linux/vhost_mdev.h
new file mode 100644
index 000000000000..070787ce6b36
--- /dev/null
+++ b/include/linux/vhost_mdev.h
@@ -0,0 +1,58 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (C) 2018-2019 Intel Corporation.
+ */
+
+#ifndef _VHOST_MDEV_H
+#define _VHOST_MDEV_H
+
+struct mdev_device;
+struct vhost_mdev;
+
+typedef int (*vhost_mdev_start_device_t)(struct vhost_mdev *vdpa);
+typedef int (*vhost_mdev_stop_device_t)(struct vhost_mdev *vdpa);
+typedef int (*vhost_mdev_set_features_t)(struct vhost_mdev *vdpa);
+typedef void (*vhost_mdev_notify_device_t)(struct vhost_mdev *vdpa, int queue_id);
+typedef u64 (*vhost_mdev_get_notify_addr_t)(struct vhost_mdev *vdpa, int queue_id);
+typedef u16 (*vhost_mdev_get_vring_base_t)(struct vhost_mdev *vdpa, int queue_id);
+typedef void (*vhost_mdev_features_changed_t)(struct vhost_mdev *vdpa);
+
+struct vhost_mdev_device_ops {
+	vhost_mdev_start_device_t	start;
+	vhost_mdev_stop_device_t	stop;
+	vhost_mdev_notify_device_t	notify;
+	vhost_mdev_get_notify_addr_t	get_notify_addr;
+	vhost_mdev_get_vring_base_t	get_vring_base;
+	vhost_mdev_features_changed_t	features_changed;
+};
+
+struct vhost_mdev *vhost_mdev_alloc(struct mdev_device *mdev,
+		void *private, int nvqs);
+void vhost_mdev_free(struct vhost_mdev *vdpa);
+
+ssize_t vhost_mdev_read(struct mdev_device *mdev, char __user *buf,
+		size_t count, loff_t *ppos);
+ssize_t vhost_mdev_write(struct mdev_device *mdev, const char __user *buf,
+		size_t count, loff_t *ppos);
+long vhost_mdev_ioctl(struct mdev_device *mdev, unsigned int cmd,
+		unsigned long arg);
+int vhost_mdev_mmap(struct mdev_device *mdev, struct vm_area_struct *vma);
+int vhost_mdev_open(struct mdev_device *mdev);
+void vhost_mdev_close(struct mdev_device *mdev);
+
+int vhost_mdev_set_device_ops(struct vhost_mdev *vdpa,
+		const struct vhost_mdev_device_ops *ops);
+int vhost_mdev_set_features(struct vhost_mdev *vdpa, u64 features);
+struct eventfd_ctx *vhost_mdev_get_call_ctx(struct vhost_mdev *vdpa,
+		int queue_id);
+int vhost_mdev_get_acked_features(struct vhost_mdev *vdpa, u64 *features);
+int vhost_mdev_get_vring_num(struct vhost_mdev *vdpa, int queue_id, u16 *num);
+int vhost_mdev_get_vring_base(struct vhost_mdev *vdpa, int queue_id, u16 *base);
+int vhost_mdev_get_vring_addr(struct vhost_mdev *vdpa, int queue_id,
+		struct vhost_vring_addr *addr);
+int vhost_mdev_get_log_base(struct vhost_mdev *vdpa, int queue_id,
+		void **log_base, u64 *log_size);
+struct mdev_device *vhost_mdev_get_mdev(struct vhost_mdev *vdpa);
+void *vhost_mdev_get_private(struct vhost_mdev *vdpa);
+
+#endif /* _VHOST_MDEV_H */
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 8f10748dac79..0300d6831cc5 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -201,6 +201,7 @@ struct vfio_device_info {
 #define VFIO_DEVICE_FLAGS_AMBA  (1 << 3)	/* vfio-amba device */
 #define VFIO_DEVICE_FLAGS_CCW	(1 << 4)	/* vfio-ccw device */
 #define VFIO_DEVICE_FLAGS_AP	(1 << 5)	/* vfio-ap device */
+#define VFIO_DEVICE_FLAGS_VHOST	(1 << 6)	/* vfio-vhost device */
 	__u32	num_regions;	/* Max region index + 1 */
 	__u32	num_irqs;	/* Max IRQ index + 1 */
 };
@@ -217,6 +218,7 @@ struct vfio_device_info {
 #define VFIO_DEVICE_API_AMBA_STRING		"vfio-amba"
 #define VFIO_DEVICE_API_CCW_STRING		"vfio-ccw"
 #define VFIO_DEVICE_API_AP_STRING		"vfio-ap"
+#define VFIO_DEVICE_API_VHOST_STRING		"vfio-vhost"
 
 /**
  * VFIO_DEVICE_GET_REGION_INFO - _IOWR(VFIO_TYPE, VFIO_BASE + 8,
diff --git a/include/uapi/linux/vhost.h b/include/uapi/linux/vhost.h
index 40d028eed645..5afbc2f08fa3 100644
--- a/include/uapi/linux/vhost.h
+++ b/include/uapi/linux/vhost.h
@@ -116,4 +116,12 @@
 #define VHOST_VSOCK_SET_GUEST_CID	_IOW(VHOST_VIRTIO, 0x60, __u64)
 #define VHOST_VSOCK_SET_RUNNING		_IOW(VHOST_VIRTIO, 0x61, int)
 
+/* VHOST_MDEV specific defines */
+
+#define VHOST_MDEV_SET_STATE	_IOW(VHOST_VIRTIO, 0x70, __u64)
+
+#define VHOST_MDEV_S_STOPPED	0
+#define VHOST_MDEV_S_RUNNING	1
+#define VHOST_MDEV_S_MAX	2
+
 #endif
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH net-next] net: phy: force phy suspend when calling phy_stop
From: Heiner Kallweit @ 2019-08-28  5:38 UTC (permalink / raw)
  To: Jian Shen, andrew, f.fainelli, davem, sergei.shtylyov
  Cc: netdev, forest.zhouchang, linuxarm
In-Reply-To: <1566956087-37096-1-git-send-email-shenjian15@huawei.com>

On 28.08.2019 03:34, Jian Shen wrote:
> Some ethernet drivers may call phy_start() and phy_stop() from
> ndo_open() and ndo_close() respectively.
> 
> When network cable is unconnected, and operate like below:
> step 1: ifconfig ethX up -> ndo_open -> phy_start ->start
> autoneg, and phy is no link.
> step 2: ifconfig ethX down -> ndo_close -> phy_stop -> just stop
> phy state machine.
> 
> This patch forces phy suspend even phydev->link is off.
> 
> Signed-off-by: Jian Shen <shenjian15@huawei.com>
> ---
>  drivers/net/phy/phy.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index f3adea9..0acd5b4 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -911,8 +911,8 @@ void phy_state_machine(struct work_struct *work)
>  		if (phydev->link) {
>  			phydev->link = 0;
>  			phy_link_down(phydev, true);
> -			do_suspend = true;
>  		}
> +		do_suspend = true;
>  		break;
>  	}
>  
> 
Reviewed-by: Heiner Kallweit <hkallweit1@gmail.com>

^ permalink raw reply

* [PATCH bpf-next 2/2] nfp: bpf: add simple map op cache
From: Jakub Kicinski @ 2019-08-28  5:36 UTC (permalink / raw)
  To: alexei.starovoitov, daniel
  Cc: netdev, oss-drivers, jaco.gericke, Jakub Kicinski, Quentin Monnet
In-Reply-To: <20190828053629.28658-1-jakub.kicinski@netronome.com>

Each get_next and lookup call requires a round trip to the device.
However, the device is capable of giving us a few entries back,
instead of just one.

In this patch we ask for a small yet reasonable number of entries
(4) on every get_next call, and on subsequent get_next/lookup calls
check this little cache for a hit. The cache is only kept for 250us,
and is invalidated on every operation which may modify the map
(e.g. delete or update call). Note that operations may be performed
simultaneously, so we have to keep track of operations in flight.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/bpf/cmsg.c | 179 +++++++++++++++++-
 drivers/net/ethernet/netronome/nfp/bpf/fw.h   |   1 +
 drivers/net/ethernet/netronome/nfp/bpf/main.c |  18 ++
 drivers/net/ethernet/netronome/nfp/bpf/main.h |  23 +++
 .../net/ethernet/netronome/nfp/bpf/offload.c  |   3 +
 5 files changed, 215 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
index fcf880c82f3f..0e2db6ea79e9 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
@@ -6,6 +6,7 @@
 #include <linux/bug.h>
 #include <linux/jiffies.h>
 #include <linux/skbuff.h>
+#include <linux/timekeeping.h>
 
 #include "../ccm.h"
 #include "../nfp_app.h"
@@ -175,29 +176,151 @@ nfp_bpf_ctrl_reply_val(struct nfp_app_bpf *bpf, struct cmsg_reply_map_op *reply,
 	return &reply->data[bpf->cmsg_key_sz * (n + 1) + bpf->cmsg_val_sz * n];
 }
 
+static bool nfp_bpf_ctrl_op_cache_invalidate(enum nfp_ccm_type op)
+{
+	return op == NFP_CCM_TYPE_BPF_MAP_UPDATE ||
+	       op == NFP_CCM_TYPE_BPF_MAP_DELETE;
+}
+
+static bool nfp_bpf_ctrl_op_cache_capable(enum nfp_ccm_type op)
+{
+	return op == NFP_CCM_TYPE_BPF_MAP_LOOKUP ||
+	       op == NFP_CCM_TYPE_BPF_MAP_GETNEXT;
+}
+
+static bool nfp_bpf_ctrl_op_cache_fill(enum nfp_ccm_type op)
+{
+	return op == NFP_CCM_TYPE_BPF_MAP_GETFIRST ||
+	       op == NFP_CCM_TYPE_BPF_MAP_GETNEXT;
+}
+
+static unsigned int
+nfp_bpf_ctrl_op_cache_get(struct nfp_bpf_map *nfp_map, enum nfp_ccm_type op,
+			  const u8 *key, u8 *out_key, u8 *out_value,
+			  u32 *cache_gen)
+{
+	struct bpf_map *map = &nfp_map->offmap->map;
+	struct nfp_app_bpf *bpf = nfp_map->bpf;
+	unsigned int i, count, n_entries;
+	struct cmsg_reply_map_op *reply;
+
+	n_entries = nfp_bpf_ctrl_op_cache_fill(op) ? bpf->cmsg_cache_cnt : 1;
+
+	spin_lock(&nfp_map->cache_lock);
+	*cache_gen = nfp_map->cache_gen;
+	if (nfp_map->cache_blockers)
+		n_entries = 1;
+
+	if (nfp_bpf_ctrl_op_cache_invalidate(op))
+		goto exit_block;
+	if (!nfp_bpf_ctrl_op_cache_capable(op))
+		goto exit_unlock;
+
+	if (!nfp_map->cache)
+		goto exit_unlock;
+	if (nfp_map->cache_to < ktime_get_ns())
+		goto exit_invalidate;
+
+	reply = (void *)nfp_map->cache->data;
+	count = be32_to_cpu(reply->count);
+
+	for (i = 0; i < count; i++) {
+		void *cached_key;
+
+		cached_key = nfp_bpf_ctrl_reply_key(bpf, reply, i);
+		if (memcmp(cached_key, key, map->key_size))
+			continue;
+
+		if (op == NFP_CCM_TYPE_BPF_MAP_LOOKUP)
+			memcpy(out_value, nfp_bpf_ctrl_reply_val(bpf, reply, i),
+			       map->value_size);
+		if (op == NFP_CCM_TYPE_BPF_MAP_GETNEXT) {
+			if (i + 1 == count)
+				break;
+
+			memcpy(out_key,
+			       nfp_bpf_ctrl_reply_key(bpf, reply, i + 1),
+			       map->key_size);
+		}
+
+		n_entries = 0;
+		goto exit_unlock;
+	}
+	goto exit_unlock;
+
+exit_block:
+	nfp_map->cache_blockers++;
+exit_invalidate:
+	dev_consume_skb_any(nfp_map->cache);
+	nfp_map->cache = NULL;
+exit_unlock:
+	spin_unlock(&nfp_map->cache_lock);
+	return n_entries;
+}
+
+static void
+nfp_bpf_ctrl_op_cache_put(struct nfp_bpf_map *nfp_map, enum nfp_ccm_type op,
+			  struct sk_buff *skb, u32 cache_gen)
+{
+	bool blocker, filler;
+
+	blocker = nfp_bpf_ctrl_op_cache_invalidate(op);
+	filler = nfp_bpf_ctrl_op_cache_fill(op);
+	if (blocker || filler) {
+		u64 to = 0;
+
+		if (filler)
+			to = ktime_get_ns() + NFP_BPF_MAP_CACHE_TIME_NS;
+
+		spin_lock(&nfp_map->cache_lock);
+		if (blocker) {
+			nfp_map->cache_blockers--;
+			nfp_map->cache_gen++;
+		}
+		if (filler && !nfp_map->cache_blockers &&
+		    nfp_map->cache_gen == cache_gen) {
+			nfp_map->cache_to = to;
+			swap(nfp_map->cache, skb);
+		}
+		spin_unlock(&nfp_map->cache_lock);
+	}
+
+	dev_consume_skb_any(skb);
+}
+
 static int
 nfp_bpf_ctrl_entry_op(struct bpf_offloaded_map *offmap, enum nfp_ccm_type op,
 		      u8 *key, u8 *value, u64 flags, u8 *out_key, u8 *out_value)
 {
 	struct nfp_bpf_map *nfp_map = offmap->dev_priv;
+	unsigned int n_entries, reply_entries, count;
 	struct nfp_app_bpf *bpf = nfp_map->bpf;
 	struct bpf_map *map = &offmap->map;
 	struct cmsg_reply_map_op *reply;
 	struct cmsg_req_map_op *req;
 	struct sk_buff *skb;
+	u32 cache_gen;
 	int err;
 
 	/* FW messages have no space for more than 32 bits of flags */
 	if (flags >> 32)
 		return -EOPNOTSUPP;
 
+	/* Handle op cache */
+	n_entries = nfp_bpf_ctrl_op_cache_get(nfp_map, op, key, out_key,
+					      out_value, &cache_gen);
+	if (!n_entries)
+		return 0;
+
 	skb = nfp_bpf_cmsg_map_req_alloc(bpf, 1);
-	if (!skb)
-		return -ENOMEM;
+	if (!skb) {
+		err = -ENOMEM;
+		goto err_cache_put;
+	}
 
 	req = (void *)skb->data;
 	req->tid = cpu_to_be32(nfp_map->tid);
-	req->count = cpu_to_be32(1);
+	req->count = cpu_to_be32(n_entries);
 	req->flags = cpu_to_be32(flags);
 
 	/* Copy inputs */
@@ -207,16 +330,38 @@ nfp_bpf_ctrl_entry_op(struct bpf_offloaded_map *offmap, enum nfp_ccm_type op,
 		memcpy(nfp_bpf_ctrl_req_val(bpf, req, 0), value,
 		       map->value_size);
 
-	skb = nfp_ccm_communicate(&bpf->ccm, skb, op,
-				  nfp_bpf_cmsg_map_reply_size(bpf, 1));
-	if (IS_ERR(skb))
-		return PTR_ERR(skb);
+	skb = nfp_ccm_communicate(&bpf->ccm, skb, op, 0);
+	if (IS_ERR(skb)) {
+		err = PTR_ERR(skb);
+		goto err_cache_put;
+	}
+
+	if (skb->len < sizeof(*reply)) {
+		cmsg_warn(bpf, "cmsg drop - type 0x%02x too short %d!\n",
+			  op, skb->len);
+		err = -EIO;
+		goto err_free;
+	}
 
 	reply = (void *)skb->data;
+	count = be32_to_cpu(reply->count);
 	err = nfp_bpf_ctrl_rc_to_errno(bpf, &reply->reply_hdr);
+	/* FW responds with message sized to hold the good entries,
+	 * plus one extra entry if there was an error.
+	 */
+	reply_entries = count + !!err;
+	if (n_entries > 1 && count)
+		err = 0;
 	if (err)
 		goto err_free;
 
+	if (skb->len != nfp_bpf_cmsg_map_reply_size(bpf, reply_entries)) {
+		cmsg_warn(bpf, "cmsg drop - type 0x%02x too short %d for %d entries!\n",
+			  op, skb->len, reply_entries);
+		err = -EIO;
+		goto err_free;
+	}
+
 	/* Copy outputs */
 	if (out_key)
 		memcpy(out_key, nfp_bpf_ctrl_reply_key(bpf, reply, 0),
@@ -225,11 +370,13 @@ nfp_bpf_ctrl_entry_op(struct bpf_offloaded_map *offmap, enum nfp_ccm_type op,
 		memcpy(out_value, nfp_bpf_ctrl_reply_val(bpf, reply, 0),
 		       map->value_size);
 
-	dev_consume_skb_any(skb);
+	nfp_bpf_ctrl_op_cache_put(nfp_map, op, skb, cache_gen);
 
 	return 0;
 err_free:
 	dev_kfree_skb_any(skb);
+err_cache_put:
+	nfp_bpf_ctrl_op_cache_put(nfp_map, op, NULL, cache_gen);
 	return err;
 }
 
@@ -275,7 +422,21 @@ unsigned int nfp_bpf_ctrl_cmsg_min_mtu(struct nfp_app_bpf *bpf)
 
 unsigned int nfp_bpf_ctrl_cmsg_mtu(struct nfp_app_bpf *bpf)
 {
-	return max(NFP_NET_DEFAULT_MTU, nfp_bpf_ctrl_cmsg_min_mtu(bpf));
+	return max3(NFP_NET_DEFAULT_MTU,
+		    nfp_bpf_cmsg_map_req_size(bpf, NFP_BPF_MAP_CACHE_CNT),
+		    nfp_bpf_cmsg_map_reply_size(bpf, NFP_BPF_MAP_CACHE_CNT));
+}
+
+unsigned int nfp_bpf_ctrl_cmsg_cache_cnt(struct nfp_app_bpf *bpf)
+{
+	unsigned int mtu, req_max, reply_max, entry_sz;
+
+	mtu = bpf->app->ctrl->dp.mtu;
+	entry_sz = bpf->cmsg_key_sz + bpf->cmsg_val_sz;
+	req_max = (mtu - sizeof(struct cmsg_req_map_op)) / entry_sz;
+	reply_max = (mtu - sizeof(struct cmsg_reply_map_op)) / entry_sz;
+
+	return min3(req_max, reply_max, NFP_BPF_MAP_CACHE_CNT);
 }
 
 void nfp_bpf_ctrl_msg_rx(struct nfp_app *app, struct sk_buff *skb)
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/fw.h b/drivers/net/ethernet/netronome/nfp/bpf/fw.h
index 06c4286bd79e..a83a0ad5e27d 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/fw.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/fw.h
@@ -24,6 +24,7 @@ enum bpf_cap_tlv_type {
 	NFP_BPF_CAP_TYPE_QUEUE_SELECT	= 5,
 	NFP_BPF_CAP_TYPE_ADJUST_TAIL	= 6,
 	NFP_BPF_CAP_TYPE_ABI_VERSION	= 7,
+	NFP_BPF_CAP_TYPE_CMSG_MULTI_ENT	= 8,
 };
 
 struct nfp_bpf_cap_tlv_func {
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index 2b1773ed3de9..8f732771d3fa 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -299,6 +299,14 @@ nfp_bpf_parse_cap_adjust_tail(struct nfp_app_bpf *bpf, void __iomem *value,
 	return 0;
 }
 
+static int
+nfp_bpf_parse_cap_cmsg_multi_ent(struct nfp_app_bpf *bpf, void __iomem *value,
+				 u32 length)
+{
+	bpf->cmsg_multi_ent = true;
+	return 0;
+}
+
 static int
 nfp_bpf_parse_cap_abi_version(struct nfp_app_bpf *bpf, void __iomem *value,
 			      u32 length)
@@ -375,6 +383,11 @@ static int nfp_bpf_parse_capabilities(struct nfp_app *app)
 							  length))
 				goto err_release_free;
 			break;
+		case NFP_BPF_CAP_TYPE_CMSG_MULTI_ENT:
+			if (nfp_bpf_parse_cap_cmsg_multi_ent(app->priv, value,
+							     length))
+				goto err_release_free;
+			break;
 		default:
 			nfp_dbg(cpp, "unknown BPF capability: %d\n", type);
 			break;
@@ -426,6 +439,11 @@ static int nfp_bpf_start(struct nfp_app *app)
 		return -EINVAL;
 	}
 
+	if (bpf->cmsg_multi_ent)
+		bpf->cmsg_cache_cnt = nfp_bpf_ctrl_cmsg_cache_cnt(bpf);
+	else
+		bpf->cmsg_cache_cnt = 1;
+
 	return 0;
 }
 
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index f4802036eb42..fac9c6f9e197 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -99,6 +99,7 @@ enum pkt_vec {
  * @maps_neutral:	hash table of offload-neutral maps (on pointer)
  *
  * @abi_version:	global BPF ABI version
+ * @cmsg_cache_cnt:	number of entries to read for caching
  *
  * @adjust_head:	adjust head capability
  * @adjust_head.flags:		extra flags for adjust head
@@ -124,6 +125,7 @@ enum pkt_vec {
  * @pseudo_random:	FW initialized the pseudo-random machinery (CSRs)
  * @queue_select:	BPF can set the RX queue ID in packet vector
  * @adjust_tail:	BPF can simply trunc packet size for adjust tail
+ * @cmsg_multi_ent:	FW can pack multiple map entries in a single cmsg
  */
 struct nfp_app_bpf {
 	struct nfp_app *app;
@@ -134,6 +136,8 @@ struct nfp_app_bpf {
 	unsigned int cmsg_key_sz;
 	unsigned int cmsg_val_sz;
 
+	unsigned int cmsg_cache_cnt;
+
 	struct list_head map_list;
 	unsigned int maps_in_use;
 	unsigned int map_elems_in_use;
@@ -169,6 +173,7 @@ struct nfp_app_bpf {
 	bool pseudo_random;
 	bool queue_select;
 	bool adjust_tail;
+	bool cmsg_multi_ent;
 };
 
 enum nfp_bpf_map_use {
@@ -183,11 +188,21 @@ struct nfp_bpf_map_word {
 	unsigned char non_zero_update	:1;
 };
 
+#define NFP_BPF_MAP_CACHE_CNT		4U
+#define NFP_BPF_MAP_CACHE_TIME_NS	(250 * 1000)
+
 /**
  * struct nfp_bpf_map - private per-map data attached to BPF maps for offload
  * @offmap:	pointer to the offloaded BPF map
  * @bpf:	back pointer to bpf app private structure
  * @tid:	table id identifying map on datapath
+ *
+ * @cache_lock:	protects @cache_blockers, @cache_to, @cache
+ * @cache_blockers:	number of ops in flight which block caching
+ * @cache_gen:	counter incremented by every blocker on exit
+ * @cache_to:	time when cache will no longer be valid (ns)
+ * @cache:	skb with cached response
+ *
  * @l:		link on the nfp_app_bpf->map_list list
  * @use_map:	map of how the value is used (in 4B chunks)
  */
@@ -195,6 +210,13 @@ struct nfp_bpf_map {
 	struct bpf_offloaded_map *offmap;
 	struct nfp_app_bpf *bpf;
 	u32 tid;
+
+	spinlock_t cache_lock;
+	u32 cache_blockers;
+	u32 cache_gen;
+	u64 cache_to;
+	struct sk_buff *cache;
+
 	struct list_head l;
 	struct nfp_bpf_map_word use_map[];
 };
@@ -566,6 +588,7 @@ void *nfp_bpf_relo_for_vnic(struct nfp_prog *nfp_prog, struct nfp_bpf_vnic *bv);
 
 unsigned int nfp_bpf_ctrl_cmsg_min_mtu(struct nfp_app_bpf *bpf);
 unsigned int nfp_bpf_ctrl_cmsg_mtu(struct nfp_app_bpf *bpf);
+unsigned int nfp_bpf_ctrl_cmsg_cache_cnt(struct nfp_app_bpf *bpf);
 long long int
 nfp_bpf_ctrl_alloc_map(struct nfp_app_bpf *bpf, struct bpf_map *map);
 void
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/offload.c b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
index 39c9fec222b4..88fab6a82acf 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/offload.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/offload.c
@@ -385,6 +385,7 @@ nfp_bpf_map_alloc(struct nfp_app_bpf *bpf, struct bpf_offloaded_map *offmap)
 	offmap->dev_priv = nfp_map;
 	nfp_map->offmap = offmap;
 	nfp_map->bpf = bpf;
+	spin_lock_init(&nfp_map->cache_lock);
 
 	res = nfp_bpf_ctrl_alloc_map(bpf, &offmap->map);
 	if (res < 0) {
@@ -407,6 +408,8 @@ nfp_bpf_map_free(struct nfp_app_bpf *bpf, struct bpf_offloaded_map *offmap)
 	struct nfp_bpf_map *nfp_map = offmap->dev_priv;
 
 	nfp_bpf_ctrl_free_map(bpf, nfp_map);
+	dev_consume_skb_any(nfp_map->cache);
+	WARN_ON_ONCE(nfp_map->cache_blockers);
 	list_del_init(&nfp_map->l);
 	bpf->map_elems_in_use -= offmap->map.max_entries;
 	bpf->maps_in_use--;
-- 
2.21.0


^ permalink raw reply related

* [PATCH bpf-next 1/2] nfp: bpf: rework MTU checking
From: Jakub Kicinski @ 2019-08-28  5:36 UTC (permalink / raw)
  To: alexei.starovoitov, daniel
  Cc: netdev, oss-drivers, jaco.gericke, Jakub Kicinski, Quentin Monnet
In-Reply-To: <20190828053629.28658-1-jakub.kicinski@netronome.com>

If control channel MTU is too low to support map operations a warning
will be printed. This is not enough, we want to make sure probe fails
in such scenario, as this would clearly be a faulty configuration.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
---
 drivers/net/ethernet/netronome/nfp/bpf/cmsg.c     | 10 +++++++---
 drivers/net/ethernet/netronome/nfp/bpf/main.c     | 15 +++++++++++++++
 drivers/net/ethernet/netronome/nfp/bpf/main.h     |  1 +
 drivers/net/ethernet/netronome/nfp/nfp_net.h      |  2 +-
 .../net/ethernet/netronome/nfp/nfp_net_common.c   |  9 +--------
 5 files changed, 25 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
index bc9850e4ec5e..fcf880c82f3f 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/cmsg.c
@@ -267,11 +267,15 @@ int nfp_bpf_ctrl_getnext_entry(struct bpf_offloaded_map *offmap,
 				     key, NULL, 0, next_key, NULL);
 }
 
+unsigned int nfp_bpf_ctrl_cmsg_min_mtu(struct nfp_app_bpf *bpf)
+{
+	return max(nfp_bpf_cmsg_map_req_size(bpf, 1),
+		   nfp_bpf_cmsg_map_reply_size(bpf, 1));
+}
+
 unsigned int nfp_bpf_ctrl_cmsg_mtu(struct nfp_app_bpf *bpf)
 {
-	return max3((unsigned int)NFP_NET_DEFAULT_MTU,
-		    nfp_bpf_cmsg_map_req_size(bpf, 1),
-		    nfp_bpf_cmsg_map_reply_size(bpf, 1));
+	return max(NFP_NET_DEFAULT_MTU, nfp_bpf_ctrl_cmsg_min_mtu(bpf));
 }
 
 void nfp_bpf_ctrl_msg_rx(struct nfp_app *app, struct sk_buff *skb)
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.c b/drivers/net/ethernet/netronome/nfp/bpf/main.c
index 1c9fb11470df..2b1773ed3de9 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.c
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.c
@@ -415,6 +415,20 @@ static void nfp_bpf_ndo_uninit(struct nfp_app *app, struct net_device *netdev)
 	bpf_offload_dev_netdev_unregister(bpf->bpf_dev, netdev);
 }
 
+static int nfp_bpf_start(struct nfp_app *app)
+{
+	struct nfp_app_bpf *bpf = app->priv;
+
+	if (app->ctrl->dp.mtu < nfp_bpf_ctrl_cmsg_min_mtu(bpf)) {
+		nfp_err(bpf->app->cpp,
+			"ctrl channel MTU below min required %u < %u\n",
+			app->ctrl->dp.mtu, nfp_bpf_ctrl_cmsg_min_mtu(bpf));
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
 static int nfp_bpf_init(struct nfp_app *app)
 {
 	struct nfp_app_bpf *bpf;
@@ -488,6 +502,7 @@ const struct nfp_app_type app_bpf = {
 
 	.init		= nfp_bpf_init,
 	.clean		= nfp_bpf_clean,
+	.start		= nfp_bpf_start,
 
 	.check_mtu	= nfp_bpf_check_mtu,
 
diff --git a/drivers/net/ethernet/netronome/nfp/bpf/main.h b/drivers/net/ethernet/netronome/nfp/bpf/main.h
index 57d6ff51e980..f4802036eb42 100644
--- a/drivers/net/ethernet/netronome/nfp/bpf/main.h
+++ b/drivers/net/ethernet/netronome/nfp/bpf/main.h
@@ -564,6 +564,7 @@ nfp_bpf_goto_meta(struct nfp_prog *nfp_prog, struct nfp_insn_meta *meta,
 
 void *nfp_bpf_relo_for_vnic(struct nfp_prog *nfp_prog, struct nfp_bpf_vnic *bv);
 
+unsigned int nfp_bpf_ctrl_cmsg_min_mtu(struct nfp_app_bpf *bpf);
 unsigned int nfp_bpf_ctrl_cmsg_mtu(struct nfp_app_bpf *bpf);
 long long int
 nfp_bpf_ctrl_alloc_map(struct nfp_app_bpf *bpf, struct bpf_map *map);
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index 5d6c3738b494..250f510b1d21 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -66,7 +66,7 @@
 #define NFP_NET_MAX_DMA_BITS	40
 
 /* Default size for MTU and freelist buffer sizes */
-#define NFP_NET_DEFAULT_MTU		1500
+#define NFP_NET_DEFAULT_MTU		1500U
 
 /* Maximum number of bytes prepended to a packet */
 #define NFP_NET_MAX_PREPEND		64
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 6f97b554f7da..61aabffc8888 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -4116,14 +4116,7 @@ int nfp_net_init(struct nfp_net *nn)
 
 	/* Set default MTU and Freelist buffer size */
 	if (!nfp_net_is_data_vnic(nn) && nn->app->ctrl_mtu) {
-		if (nn->app->ctrl_mtu <= nn->max_mtu) {
-			nn->dp.mtu = nn->app->ctrl_mtu;
-		} else {
-			if (nn->app->ctrl_mtu != NFP_APP_CTRL_MTU_MAX)
-				nn_warn(nn, "app requested MTU above max supported %u > %u\n",
-					nn->app->ctrl_mtu, nn->max_mtu);
-			nn->dp.mtu = nn->max_mtu;
-		}
+		nn->dp.mtu = min(nn->app->ctrl_mtu, nn->max_mtu);
 	} else if (nn->max_mtu < NFP_NET_DEFAULT_MTU) {
 		nn->dp.mtu = nn->max_mtu;
 	} else {
-- 
2.21.0


^ permalink raw reply related

* [PATCH bpf-next 0/2] nfp: bpf: add simple map op cache
From: Jakub Kicinski @ 2019-08-28  5:36 UTC (permalink / raw)
  To: alexei.starovoitov, daniel
  Cc: netdev, oss-drivers, jaco.gericke, Jakub Kicinski

Hi!

This set adds a small batching and cache mechanism to the driver.
Map dumps require two operations per element - get next, and
lookup. Each of those needs a round trip to the device, and on
a loaded system scheduling out and in of the dumping process.
This set makes the driver request a number of entries at the same
time, and if no operation which would modify the map happens
from the host side those entries are used to serve lookup
requests for up to 250us, at which point they are considered
stale.

This set has been measured to provide almost 4x dumping speed
improvement, Jaco says:

OLD dump times
    500 000 elements: 26.1s
      1 000 000 elements: 54.5s

NEW dump times
    500 000 elements: 7.6s
      1 000 000 elements: 16.5s

Jakub Kicinski (2):
  nfp: bpf: rework MTU checking
  nfp: bpf: add simple map op cache

 drivers/net/ethernet/netronome/nfp/bpf/cmsg.c | 187 ++++++++++++++++--
 drivers/net/ethernet/netronome/nfp/bpf/fw.h   |   1 +
 drivers/net/ethernet/netronome/nfp/bpf/main.c |  33 ++++
 drivers/net/ethernet/netronome/nfp/bpf/main.h |  24 +++
 .../net/ethernet/netronome/nfp/bpf/offload.c  |   3 +
 drivers/net/ethernet/netronome/nfp/nfp_net.h  |   2 +-
 .../ethernet/netronome/nfp/nfp_net_common.c   |   9 +-
 7 files changed, 239 insertions(+), 20 deletions(-)

-- 
2.21.0

^ permalink raw reply

* Re: [PATCH] powerpc/kmcent2: update the ethernet devices' phy properties
From: Scott Wood @ 2019-08-28  4:19 UTC (permalink / raw)
  To: Valentin Longchamp, Madalin-cristian Bucur
  Cc: linuxppc-dev@lists.ozlabs.org, galak@kernel.crashing.org,
	netdev@vger.kernel.org
In-Reply-To: <CADYrJDxsQ3H7b_BHOfmfTNb1OuXt+vzTg4k8Goj8tKPaaOMz_g@mail.gmail.com>

On Thu, 2019-08-08 at 23:09 +0200, Valentin Longchamp wrote:
> Le mar. 30 juil. 2019 à 11:44, Madalin-cristian Bucur
> <madalin.bucur@nxp.com> a écrit :
> > 
> > > -----Original Message-----
> > > 
> > > > Le dim. 14 juil. 2019 à 22:05, Valentin Longchamp
> > > > <valentin@longchamp.me> a écrit :
> > > > > 
> > > > > Change all phy-connection-type properties to phy-mode that are
> > > > > better
> > > > > supported by the fman driver.
> > > > > 
> > > > > Use the more readable fixed-link node for the 2 sgmii links.
> > > > > 
> > > > > Change the RGMII link to rgmii-id as the clock delays are added by
> > > > > the
> > > > > phy.
> > > > > 
> > > > > Signed-off-by: Valentin Longchamp <valentin@longchamp.me>
> > > 
> > > I don't see any other uses of phy-mode in arch/powerpc/boot/dts/fsl, and
> > > I see
> > > lots of phy-connection-type with fman.  Madalin, does this patch look
> > > OK?
> > > 
> > > -Scott
> > 
> > Hi,
> > 
> > we are using "phy-connection-type" not "phy-mode" for the NXP (former
> > Freescale)
> > DPAA platforms. While the two seem to be interchangeable ("phy-mode" seems
> > to be
> > more recent, looking at the device tree bindings), the driver code in
> > Linux seems
> > to use one or the other, not both so one should stick with the variant the
> > driver
> > is using. To make things more complex, there may be dependencies in
> > bootloaders,
> > I see code in u-boot using only "phy-connection-type" or only "phy-mode".
> > 
> > I'd leave "phy-connection-type" as is.
> 
> So I have finally had time to have a look and now I understand what
> happens. You are right, there are bootloader dependencies: u-boot
> calls fdt_fixup_phy_connection() that somehow in our case adds (or
> changes if already in the device tree) the phy-connection-type
> property to a wrong value ! By having a phy-mode in the device tree,
> that is not changed by u-boot and by chance picked up by the kernel
> fman driver (of_get_phy_mode() ) over phy-connection-mode, the below
> patch fixes it for us.
> 
> I agree with you, it's not correct to have both phy-connection-type
> and phy-mode. Ideally, u-boot on the board should be reworked so that
> it does not perform the above wrong fixup. However, in an "unfixed"
> .dtb (I have disabled fdt_fixup_phy_connection), the device tree in
> the end only has either phy-connection-type or phy-mode, according to
> what was chosen in the .dts file. And the fman driver works well with
> both (thanks to the call to of_get_phy_mode() ). I would therefore
> argue that even if all other DPAA platforms use phy-connection-type,
> phy-mode is valid as well. (Furthermore we already have hundreds of
> such boards in the field and we don't really support "remote" u-boot
> update, so the u-boot fix is going to be difficult for us to pull).
> 
> Valentin

Madalin, are you OK with the patch given this explanation?

-Scott



^ permalink raw reply

* Re: [PATCH v1 net-next 0/4] Add EHL and TGL PCI info and PCI ID
From: David Miller @ 2019-08-28  4:59 UTC (permalink / raw)
  To: weifeng.voon
  Cc: mcoquelin.stm32, netdev, linux-kernel, joabreu, peppe.cavallaro,
	andrew, alexandre.torgue, boon.leong.ong
In-Reply-To: <1566869891-29239-1-git-send-email-weifeng.voon@intel.com>

From: Voon Weifeng <weifeng.voon@intel.com>
Date: Tue, 27 Aug 2019 09:38:07 +0800

> In order to keep PCI info simple and neat, this patch series have
> introduced a 3 hierarchy of struct. First layer will be the
> intel_mgbe_common_data struct which keeps all Intel common configuration.
> Second layer will be xxx_common_data which keeps all the different Intel
> microarchitecture, e.g tgl, ehl. The third layer will be configuration
> that tied to the PCI ID only based on speed and RGMII/SGMII interface.
> 
> EHL and TGL will also having a higher system clock which is 200Mhz.

Series applied.

^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-08-28  4:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Alexei Starovoitov, Kees Cook, LSM List, James Morris, Jann Horn,
	Peter Zijlstra, Masami Hiramatsu, Steven Rostedt, David S. Miller,
	Daniel Borkmann, Network Development, bpf, kernel-team, Linux API
In-Reply-To: <CALCETrVVQs1s27y8fB17JtQi-VzTq1YZPTPy3k=fKhQB1X-KKA@mail.gmail.com>

On Tue, Aug 27, 2019 at 07:00:40PM -0700, Andy Lutomirski wrote:
> 
> Let me put this a bit differently. Part of the point is that
> CAP_TRACING should allow a user or program to trace without being able
> to corrupt the system. CAP_BPF as you’ve proposed it *can* likely
> crash the system.

Really? I'm still waiting for your example where bpf+kprobe crashes the system...


^ permalink raw reply

* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Alexei Starovoitov @ 2019-08-28  4:47 UTC (permalink / raw)
  To: Masami Hiramatsu
  Cc: Steven Rostedt, Andy Lutomirski, Alexei Starovoitov, Kees Cook,
	LSM List, James Morris, Jann Horn, Peter Zijlstra,
	David S. Miller, Daniel Borkmann, Network Development, bpf,
	kernel-team, Linux API
In-Reply-To: <20190828123041.c0c90c15865897461ee819a2@kernel.org>

On Wed, Aug 28, 2019 at 12:30:41PM +0900, Masami Hiramatsu wrote:
> > kprobes can be created in the tracefs filesystem (which is separate from
> > debugfs, tracefs just gets automatically mounted
> > in /sys/kernel/debug/tracing when debugfs is mounted) from the
> > kprobe_events file. /sys/kernel/tracing is just the tracefs
> > directory without debugfs, and was created specifically to allow
> > tracing to be access without opening up the can of worms in debugfs.
> 
> I like the CAP_TRACING for tracefs. Can we make the tracefs itself
> check the CAP_TRACING and call file_ops? or each tracefs file-ops
> handlers must check it?

Thanks for the feedback.
I'll hack a prototype of CAP_TRACING for perf bits that I understand
and you folks will be able to use it in ftrace when initial support lands.
imo the question above is an implementation detail that you can resolve later.
I see it as a followup to initial CAP_TRACING drop.


^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox