Netdev List
 help / color / mirror / Atom feed
* [PATCH mlx5-next 09/10] net/mlx5: Remove unused function
From: Leon Romanovsky @ 2018-11-08 19:10 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Artemy Kovalyov, Majd Dibbiny,
	Moni Shoua, Saeed Mahameed, linux-netdev
In-Reply-To: <20181108191017.21891-1-leon@kernel.org>

From: Moni Shoua <monis@mellanox.com>

No callers to the function mlx5_core_page_fault_resume() so it is OK to
delete it.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/eq.c | 16 ----------------
 include/linux/mlx5/driver.h                  |  4 ----
 2 files changed, 20 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/eq.c b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
index aeab0c4f60f4..22eabb80e122 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/eq.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/eq.c
@@ -373,22 +373,6 @@ static int init_pf_ctx(struct mlx5_eq_pagefault *pf_ctx, const char *name)
 	return -ENOMEM;
 }
 
-int mlx5_core_page_fault_resume(struct mlx5_core_dev *dev, u32 token,
-				u32 wq_num, u8 type, int error)
-{
-	u32 out[MLX5_ST_SZ_DW(page_fault_resume_out)] = {0};
-	u32 in[MLX5_ST_SZ_DW(page_fault_resume_in)]   = {0};
-
-	MLX5_SET(page_fault_resume_in, in, opcode,
-		 MLX5_CMD_OP_PAGE_FAULT_RESUME);
-	MLX5_SET(page_fault_resume_in, in, error, !!error);
-	MLX5_SET(page_fault_resume_in, in, page_fault_type, type);
-	MLX5_SET(page_fault_resume_in, in, wq_number, wq_num);
-	MLX5_SET(page_fault_resume_in, in, token, token);
-
-	return mlx5_cmd_exec(dev, in, sizeof(in), out, sizeof(out));
-}
-EXPORT_SYMBOL_GPL(mlx5_core_page_fault_resume);
 #endif
 
 static void general_event_handler(struct mlx5_core_dev *dev,
diff --git a/include/linux/mlx5/driver.h b/include/linux/mlx5/driver.h
index 85275612a473..1de1e8e91382 100644
--- a/include/linux/mlx5/driver.h
+++ b/include/linux/mlx5/driver.h
@@ -1141,10 +1141,6 @@ int mlx5_query_odp_caps(struct mlx5_core_dev *dev,
 			struct mlx5_odp_caps *odp_caps);
 int mlx5_core_query_ib_ppcnt(struct mlx5_core_dev *dev,
 			     u8 port_num, void *out, size_t sz);
-#ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
-int mlx5_core_page_fault_resume(struct mlx5_core_dev *dev, u32 token,
-				u32 wq_num, u8 type, int error);
-#endif
 
 int mlx5_init_rl_table(struct mlx5_core_dev *dev);
 void mlx5_cleanup_rl_table(struct mlx5_core_dev *dev);
-- 
2.19.1

^ permalink raw reply related

* [PATCH mlx5-next 10/10] IB/mlx5: Improve ODP debugging messages
From: Leon Romanovsky @ 2018-11-08 19:10 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Artemy Kovalyov, Majd Dibbiny,
	Moni Shoua, Saeed Mahameed, linux-netdev
In-Reply-To: <20181108191017.21891-1-leon@kernel.org>

From: Moni Shoua <monis@mellanox.com>

Add and modify debug messages to ODP related error flows.
In that context, return code EAGAIN is considered less severe and print
level for it is set debug instead of warn.

Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/infiniband/core/umem_odp.c | 14 ++++++++++++--
 drivers/infiniband/hw/mlx5/odp.c   |  4 ++--
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 2b4c5e7dd5a1..e334480b56bf 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -655,8 +655,13 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt,
 				flags, local_page_list, NULL, NULL);
 		up_read(&owning_mm->mmap_sem);

-		if (npages < 0)
+		if (npages < 0) {
+			if (npages != -EAGAIN)
+				pr_warn("fail to get %zu user pages with error %d\n", gup_num_pages, npages);
+			else
+				pr_debug("fail to get %zu user pages with error %d\n", gup_num_pages, npages);
 			break;
+		}

 		bcnt -= min_t(size_t, npages << PAGE_SHIFT, bcnt);
 		mutex_lock(&umem_odp->umem_mutex);
@@ -674,8 +679,13 @@ int ib_umem_odp_map_dma_pages(struct ib_umem_odp *umem_odp, u64 user_virt,
 			ret = ib_umem_odp_map_dma_single_page(
 					umem_odp, k, local_page_list[j],
 					access_mask, current_seq);
-			if (ret < 0)
+			if (ret < 0) {
+				if (ret != -EAGAIN)
+					pr_warn("ib_umem_odp_map_dma_single_page failed with error %d\n", ret);
+				else
+					pr_debug("ib_umem_odp_map_dma_single_page failed with error %d\n", ret);
 				break;
+			}

 			p = page_to_phys(local_page_list[j]);
 			k++;
diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
index 0c4f469cdd5b..7bca1592d811 100644
--- a/drivers/infiniband/hw/mlx5/odp.c
+++ b/drivers/infiniband/hw/mlx5/odp.c
@@ -665,8 +665,8 @@ static int pagefault_mr(struct mlx5_ib_dev *dev, struct mlx5_ib_mr *mr,
 			if (!wait_for_completion_timeout(
 					&odp->notifier_completion,
 					timeout)) {
-				mlx5_ib_warn(dev, "timeout waiting for mmu notifier. seq %d against %d\n",
-					     current_seq, odp->notifiers_seq);
+				mlx5_ib_warn(dev, "timeout waiting for mmu notifier. seq %d against %d. notifiers_count=%d\n",
+					     current_seq, odp->notifiers_seq, odp->notifiers_count);
 			}
 		} else {
 			/* The MR is being killed, kill the QP as well. */

^ permalink raw reply related

* [PATCH mlx5-next 06/10] net/mlx5: Return success for PAGE_FAULT_RESUME in internal error state
From: Leon Romanovsky @ 2018-11-08 19:10 UTC (permalink / raw)
  To: Doug Ledford, Jason Gunthorpe
  Cc: Leon Romanovsky, RDMA mailing list, Artemy Kovalyov, Majd Dibbiny,
	Moni Shoua, Saeed Mahameed, linux-netdev
In-Reply-To: <20181108191017.21891-1-leon@kernel.org>

From: Moni Shoua <monis@mellanox.com>

When the device is in internal error state, command interface isn't
accessible and the driver decides which commands to fail and which to
pass.
Move the PAGE_FAULT_RESUME command to the pass list in order to avoid
redundant failure messages.

Fixes: 89d44f0a6c73 ("net/mlx5_core: Add pci error handlers to mlx5_core driver")
Signed-off-by: Moni Shoua <monis@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/cmd.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
index a5a0823e5ada..7b18aff955f1 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/cmd.c
@@ -313,6 +313,7 @@ static int mlx5_internal_err_ret_value(struct mlx5_core_dev *dev, u16 op,
 	case MLX5_CMD_OP_FPGA_DESTROY_QP:
 	case MLX5_CMD_OP_DESTROY_GENERAL_OBJECT:
 	case MLX5_CMD_OP_DEALLOC_MEMIC:
+	case MLX5_CMD_OP_PAGE_FAULT_RESUME:
 		return MLX5_CMD_STAT_OK;
 
 	case MLX5_CMD_OP_QUERY_HCA_CAP:
@@ -326,7 +327,6 @@ static int mlx5_internal_err_ret_value(struct mlx5_core_dev *dev, u16 op,
 	case MLX5_CMD_OP_CREATE_MKEY:
 	case MLX5_CMD_OP_QUERY_MKEY:
 	case MLX5_CMD_OP_QUERY_SPECIAL_CONTEXTS:
-	case MLX5_CMD_OP_PAGE_FAULT_RESUME:
 	case MLX5_CMD_OP_CREATE_EQ:
 	case MLX5_CMD_OP_QUERY_EQ:
 	case MLX5_CMD_OP_GEN_EQE:
-- 
2.19.1

^ permalink raw reply related

* Re: Kernel 4.19 network performance - forwarding/routing normal users traffic
From: Paweł Staszewski @ 2018-11-08 19:12 UTC (permalink / raw)
  To: Saeed Mahameed, netdev@vger.kernel.org
In-Reply-To: <920c2665-781f-5f62-efbe-347e63063a24@itcare.pl>



W dniu 03.11.2018 o 01:18, Paweł Staszewski pisze:
>
>
> W dniu 01.11.2018 o 21:37, Saeed Mahameed pisze:
>> On Thu, 2018-11-01 at 12:09 +0100, Paweł Staszewski wrote:
>>> W dniu 01.11.2018 o 10:50, Saeed Mahameed pisze:
>>>> On Wed, 2018-10-31 at 22:57 +0100, Paweł Staszewski wrote:
>>>>> Hi
>>>>>
>>>>> So maybee someone will be interested how linux kernel handles
>>>>> normal
>>>>> traffic (not pktgen :) )
>>>>>
>>>>>
>>>>> Server HW configuration:
>>>>>
>>>>> CPU : Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz
>>>>>
>>>>> NIC's: 2x 100G Mellanox ConnectX-4 (connected to x16 pcie 8GT)
>>>>>
>>>>>
>>>>> Server software:
>>>>>
>>>>> FRR - as routing daemon
>>>>>
>>>>> enp175s0f0 (100G) - 16 vlans from upstreams (28 RSS binded to
>>>>> local
>>>>> numa
>>>>> node)
>>>>>
>>>>> enp175s0f1 (100G) - 343 vlans to clients (28 RSS binded to local
>>>>> numa
>>>>> node)
>>>>>
>>>>>
>>>>> Maximum traffic that server can handle:
>>>>>
>>>>> Bandwidth
>>>>>
>>>>>     bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>>>>>      input: /proc/net/dev type: rate
>>>>>      \         iface                   Rx Tx                Total
>>>>> =================================================================
>>>>> ====
>>>>> =========
>>>>>           enp175s0f1:          28.51 Gb/s           37.24
>>>>> Gb/s
>>>>> 65.74 Gb/s
>>>>>           enp175s0f0:          38.07 Gb/s           28.44
>>>>> Gb/s
>>>>> 66.51 Gb/s
>>>>> ---------------------------------------------------------------
>>>>> ----
>>>>> -----------
>>>>>                total:          66.58 Gb/s           65.67
>>>>> Gb/s
>>>>> 132.25 Gb/s
>>>>>
>>>>>
>>>>> Packets per second:
>>>>>
>>>>>     bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
>>>>>      input: /proc/net/dev type: rate
>>>>>      -         iface                   Rx Tx                Total
>>>>> =================================================================
>>>>> ====
>>>>> =========
>>>>>           enp175s0f1:      5248589.00 P/s       3486617.75 P/s
>>>>> 8735207.00 P/s
>>>>>           enp175s0f0:      3557944.25 P/s       5232516.00 P/s
>>>>> 8790460.00 P/s
>>>>> ---------------------------------------------------------------
>>>>> ----
>>>>> -----------
>>>>>                total:      8806533.00 P/s       8719134.00 P/s
>>>>> 17525668.00 P/s
>>>>>
>>>>>
>>>>> After reaching that limits nics on the upstream side (more RX
>>>>> traffic)
>>>>> start to drop packets
>>>>>
>>>>>
>>>>> I just dont understand that server can't handle more bandwidth
>>>>> (~40Gbit/s is limit where all cpu's are 100% util) - where pps on
>>>>> RX
>>>>> side are increasing.
>>>>>
>>>> Where do you see 40 Gb/s ? you showed that both ports on the same
>>>> NIC (
>>>> same pcie link) are doing  66.58 Gb/s (RX) + 65.67 Gb/s (TX) =
>>>> 132.25
>>>> Gb/s which aligns with your pcie link limit, what am i missing ?
>>> hmm yes that was my concern also - cause cant find anywhere
>>> informations
>>> about that bandwidth is uni or bidirectional - so if 126Gbit for x16
>>> 8GT
>>> is unidir - then bidir will be 126/2 ~68Gbit - which will fit total
>>> bw
>>> on both ports
>> i think it is bidir
> So yes - we are hitting there other problem i think pcie is most 
> probabbly bidirectional max bw 126Gbit so RX 126Gbit and at same time 
> TX should be 126Gbit
>
>
So one 2-port 100G card connectx4 replaced with two separate connectx5 
placed in two different pcie x16 gen 3.0
lspci -vvv -s af:00.0
af:00.0 Ethernet controller: Mellanox Technologies MT27800 Family 
[ConnectX-5]
         Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5]
         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- 
ParErr- Stepping- SERR+ FastB2B- DisINTx+
         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- >SERR- <PERR- INTx-
         Latency: 0, Cache Line Size: 32 bytes
         Interrupt: pin A routed to IRQ 90
         NUMA node: 1
         Region 0: Memory at 39bffe000000 (64-bit, prefetchable) [size=32M]
         Expansion ROM at ee600000 [disabled] [size=1M]
         Capabilities: [60] Express (v2) Endpoint, MSI 00
                 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s 
unlimited, L1 unlimited
                         ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ 
SlotPowerLimit 0.000W
                 DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
Unsupported-
                         RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ 
FLReset-
                         MaxPayload 256 bytes, MaxReadReq 4096 bytes
                 DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ 
AuxPwr- TransPend-
                 LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not 
supported, Exit Latency L0s unlimited, L1 unlimited
                         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                 LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
                 DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, 
LTR-, OBFF Not Supported
                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, 
LTR-, OBFF Disabled
                 LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- 
SpeedDis-
                          Transmit Margin: Normal Operating Range, 
EnterModifiedCompliance- ComplianceSOS-
                          Compliance De-emphasis: -6dB
                 LnkSta2: Current De-emphasis Level: -6dB, 
EqualizationComplete+, EqualizationPhase1+
                          EqualizationPhase2+, EqualizationPhase3+, 
LinkEqualizationRequest-
         Capabilities: [48] Vital Product Data
                 Product Name: CX515A - ConnectX-5 QSFP28
                 Read-only fields:
                         [PN] Part number: MCX515A-CCAT
                         [EC] Engineering changes: A6
                         [V2] Vendor specific: MCX515A-CCAT
                         [SN] Serial number: MT1831J00221
                         [V3] Vendor specific: 
14a5c73bee92e811800098039b1ee5f0
                         [VA] Vendor specific: 
MLX:MODL=CX515A:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0
                         [V0] Vendor specific: PCIeGen3 x16
                         [RV] Reserved: checksum good, 2 byte(s) reserved
                 End
         Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
                 Vector table: BAR=0 offset=00002000
                 PBA: BAR=0 offset=00003000
         Capabilities: [c0] Vendor Specific Information: Len=18 <?>
         Capabilities: [40] Power Management version 3
                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA 
PME(D0-,D1-,D2-,D3hot-,D3cold+)
                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
         Capabilities: [100 v1] Advanced Error Reporting
                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- 
UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
NonFatalErr+
                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
NonFatalErr+
                 AERCap: First Error Pointer: 04, GenCap+ CGenEn- 
ChkCap+ ChkEn-
         Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                 ARICap: MFVC- ACS-, Next Function: 0
                 ARICtl: MFVC- ACS-, Function Group: 0
         Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
                 IOVCap: Migration-, Interrupt Message Number: 000
                 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
                 IOVSta: Migration-
                 Initial VFs: 0, Total VFs: 0, Number of VFs: 0, 
Function Dependency Link: 00
                 VF offset: 1, stride: 1, Device ID: 1018
                 Supported Page Size: 000007ff, System Page Size: 00000001
                 Region 0: Memory at 0000000000000000 (64-bit, prefetchable)
                 VF Migration: offset: 00000000, BIR: 0
         Capabilities: [1c0 v1] #19
         Kernel driver in use: mlx5_core

d8:00.0 Ethernet controller: Mellanox Technologies MT27800 Family 
[ConnectX-5]
         Subsystem: Mellanox Technologies MT27800 Family [ConnectX-5]
         Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- 
ParErr- Stepping- SERR+ FastB2B- DisINTx+
         Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- 
<TAbort- <MAbort- >SERR- <PERR- INTx-
         Latency: 0, Cache Line Size: 32 bytes
         Interrupt: pin A routed to IRQ 159
         NUMA node: 1
         Region 0: Memory at 39fffe000000 (64-bit, prefetchable) [size=32M]
         Expansion ROM at fbe00000 [disabled] [size=1M]
         Capabilities: [60] Express (v2) Endpoint, MSI 00
                 DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s 
unlimited, L1 unlimited
                         ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ 
SlotPowerLimit 0.000W
                 DevCtl: Report errors: Correctable- Non-Fatal- Fatal- 
Unsupported-
                         RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ 
FLReset-
                         MaxPayload 256 bytes, MaxReadReq 4096 bytes
                 DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ 
AuxPwr- TransPend-
                 LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM not 
supported, Exit Latency L0s unlimited, L1 unlimited
                         ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                 LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
                         ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
                 LnkSta: Speed 8GT/s, Width x16, TrErr- Train- SlotClk+ 
DLActive- BWMgmt- ABWMgmt-
                 DevCap2: Completion Timeout: Range ABCD, TimeoutDis+, 
LTR-, OBFF Not Supported
                 DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, 
LTR-, OBFF Disabled
                 LnkCtl2: Target Link Speed: 8GT/s, EnterCompliance- 
SpeedDis-
                          Transmit Margin: Normal Operating Range, 
EnterModifiedCompliance- ComplianceSOS-
                          Compliance De-emphasis: -6dB
                 LnkSta2: Current De-emphasis Level: -6dB, 
EqualizationComplete+, EqualizationPhase1+
                          EqualizationPhase2+, EqualizationPhase3+, 
LinkEqualizationRequest-
         Capabilities: [48] Vital Product Data
                 Product Name: CX515A - ConnectX-5 QSFP28
                 Read-only fields:
                         [PN] Part number: MCX515A-CCAT
                         [EC] Engineering changes: A6
                         [V2] Vendor specific: MCX515A-CCAT
                         [SN] Serial number: MT1831J00169
                         [V3] Vendor specific: 
c06757e6e092e811800098039b1ee520
                         [VA] Vendor specific: 
MLX:MODL=CX515A:MN=MLNX:CSKU=V2:UUID=V3:PCI=V0
                         [V0] Vendor specific: PCIeGen3 x16
                         [RV] Reserved: checksum good, 2 byte(s) reserved
                 End
         Capabilities: [9c] MSI-X: Enable+ Count=64 Masked-
                 Vector table: BAR=0 offset=00002000
                 PBA: BAR=0 offset=00003000
         Capabilities: [c0] Vendor Specific Information: Len=18 <?>
         Capabilities: [40] Power Management version 3
                 Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA 
PME(D0-,D1-,D2-,D3hot-,D3cold+)
                 Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
         Capabilities: [100 v1] Advanced Error Reporting
                 UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                 UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- 
UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
                 UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- 
UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
                 CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
NonFatalErr+
                 CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- 
NonFatalErr+
                 AERCap: First Error Pointer: 04, GenCap+ CGenEn- 
ChkCap+ ChkEn-
         Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
                 ARICap: MFVC- ACS-, Next Function: 0
                 ARICtl: MFVC- ACS-, Function Group: 0
         Capabilities: [180 v1] Single Root I/O Virtualization (SR-IOV)
                 IOVCap: Migration-, Interrupt Message Number: 000
                 IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
                 IOVSta: Migration-
                 Initial VFs: 0, Total VFs: 0, Number of VFs: 0, 
Function Dependency Link: 00
                 VF offset: 1, stride: 1, Device ID: 1018
                 Supported Page Size: 000007ff, System Page Size: 00000001
                 Region 0: Memory at 0000000000000000 (64-bit, prefetchable)
                 VF Migration: offset: 00000000, BIR: 0
         Capabilities: [1c0 v1] #19
         Kernel driver in use: mlx5_core



CPU load is lower than for connectx4 - but it looks like bandwidth limit 
is the same :)
But also after reaching 60Gbit/60Gbit

  bwm-ng v0.6.1 (probing every 1.000s), press 'h' for help
   input: /proc/net/dev type: rate
   -         iface                   Rx Tx                Total
==============================================================================
          enp175s0:          45.09 Gb/s           15.09 Gb/s           
60.18 Gb/s
          enp216s0:          15.14 Gb/s           45.19 Gb/s           
60.33 Gb/s
------------------------------------------------------------------------------
             total:          60.45 Gb/s           60.48 Gb/s 120.93 Gb/s


Nics start to drop packets (discards from nic's where is more rx traffic):
ethtool -S enp175s0 |grep 'disc'
      rx_discards_phy: 47265611

after 20 secs

ethtool -S enp175s0 |grep 'disc'
      rx_discards_phy: 49434472


current coalescence params:
ethtool -c enp175s0
Coalesce parameters for enp175s0:
Adaptive RX: off  TX: on
stats-block-usecs: 0
sample-interval: 0
pkt-rate-low: 0
pkt-rate-high: 0
dmac: 32651

rx-usecs: 128
rx-frames: 128
rx-usecs-irq: 0
rx-frames-irq: 0

tx-usecs: 8
tx-frames: 128
tx-usecs-irq: 0
tx-frames-irq: 0

rx-usecs-low: 0
rx-frame-low: 0
tx-usecs-low: 0
tx-frame-low: 0

rx-usecs-high: 0
rx-frame-high: 0
tx-usecs-high: 0
tx-frame-high: 0


and perf top:
    PerfTop:   86898 irqs/sec  kernel:99.5%  exact:  0.0% [4000Hz 
cycles],  (all, 56 CPUs)
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

     12.76%  [kernel]       [k] mlx5e_skb_from_cqe_mpwrq_linear
      8.68%  [kernel]       [k] mlx5e_sq_xmit
      6.47%  [kernel]       [k] build_skb
      4.78%  [kernel]       [k] fib_table_lookup
      4.58%  [kernel]       [k] memcpy_erms
      3.47%  [kernel]       [k] mlx5e_poll_rx_cq
      2.59%  [kernel]       [k] mlx5e_handle_rx_cqe_mpwrq
      2.37%  [kernel]       [k] mlx5e_post_rx_mpwqes
      2.33%  [kernel]       [k] vlan_do_receive
      1.94%  [kernel]       [k] __dev_queue_xmit
      1.89%  [kernel]       [k] mlx5e_poll_tx_cq
      1.74%  [kernel]       [k] ip_finish_output2
      1.67%  [kernel]       [k] dev_gro_receive
      1.64%  [kernel]       [k] ipt_do_table
      1.58%  [kernel]       [k] tcp_gro_receive
      1.49%  [kernel]       [k] pfifo_fast_dequeue
      1.28%  [kernel]       [k] mlx5_eq_int
      1.26%  [kernel]       [k] inet_gro_receive
      1.26%  [kernel]       [k] _raw_spin_lock
      1.20%  [kernel]       [k] __netif_receive_skb_core
      1.19%  [kernel]       [k] irq_entries_start
      1.17%  [kernel]       [k] swiotlb_map_page
      1.13%  [kernel]       [k] vlan_dev_hard_start_xmit
      1.12%  [kernel]       [k] ip_route_input_rcu
      0.97%  [kernel]       [k] __build_skb
      0.84%  [kernel]       [k] _raw_spin_lock_irqsave
      0.78%  [kernel]       [k] kmem_cache_alloc
      0.77%  [kernel]       [k] mlx5e_xmit
      0.77%  [kernel]       [k] dev_hard_start_xmit
      0.76%  [kernel]       [k] ip_forward
      0.73%  [kernel]       [k] netif_skb_features
      0.70%  [kernel]       [k] tasklet_action_common.isra.21
      0.58%  [kernel]       [k] validate_xmit_skb.isra.142
      0.55%  [kernel]       [k] ip_rcv_core.isra.20.constprop.25
      0.55%  [kernel]       [k] mlx5e_page_release
      0.55%  [kernel]       [k] __qdisc_run
      0.51%  [kernel]       [k] __memcpy
      0.48%  [kernel]       [k] kmem_cache_free_bulk
      0.48%  [kernel]       [k] page_frag_free
      0.47%  [kernel]       [k] inet_lookup_ifaddr_rcu
      0.47%  [kernel]       [k] queued_spin_lock_slowpath
      0.46%  [kernel]       [k] pfifo_fast_enqueue
      0.43%  [kernel]       [k] tcp4_gro_receive
      0.40%  [kernel]       [k] skb_gro_receive
      0.39%  [kernel]       [k] skb_release_data
      0.38%  [kernel]       [k] find_busiest_group
      0.36%  [kernel]       [k] _raw_spin_trylock
      0.36%  [kernel]       [k] skb_segment
      0.33%  [kernel]       [k] eth_type_trans
      0.32%  [kernel]       [k] __sched_text_start
      0.32%  [kernel]       [k] __netif_schedule
      0.32%  [kernel]       [k] try_to_wake_up
      0.31%  [kernel]       [k] _raw_spin_lock_irq
      0.31%  [kernel]       [k] __local_bh_enable_ip



Also mpstat:
Average:     CPU    %usr   %nice    %sys %iowait    %irq   %soft %steal  
%guest  %gnice   %idle
Average:     all    0.06    0.00    1.00    0.02    0.00   21.61 0.00    
0.00    0.00   77.32
Average:       0    0.00    0.00    0.60    0.00    0.00    0.00 0.00    
0.00    0.00   99.40
Average:       1    0.10    0.00    1.30    0.00    0.00    0.00 0.00    
0.00    0.00   98.60
Average:       2    0.00    0.00    0.20    0.00    0.00    0.00 0.00    
0.00    0.00   99.80
Average:       3    0.00    0.00    1.60    0.00    0.00    0.00 0.00    
0.00    0.00   98.40
Average:       4    0.00    0.00    1.00    0.00    0.00    0.00 0.00    
0.00    0.00   99.00
Average:       5    0.20    0.00    4.60    0.00    0.00    0.00 0.00    
0.00    0.00   95.20
Average:       6    0.00    0.00    0.20    0.00    0.00    0.00 0.00    
0.00    0.00   99.80
Average:       7    0.60    0.00    3.00    0.00    0.00    0.00 0.00    
0.00    0.00   96.40
Average:       8    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:       9    0.70    0.00    0.30    0.00    0.00    0.00 0.00    
0.00    0.00   99.00
Average:      10    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:      11    0.00    0.00    2.00    0.00    0.00    0.00 0.00    
0.00    0.00   98.00
Average:      12    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:      13    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:      14    0.00    0.00    1.00    0.00    0.00   50.40 0.00    
0.00    0.00   48.60
Average:      15    0.00    0.00    1.30    0.00    0.00   47.90 0.00    
0.00    0.00   50.80
Average:      16    0.00    0.00    2.00    0.00    0.00   47.80 0.00    
0.00    0.00   50.20
Average:      17    0.00    0.00    1.30    0.00    0.00   50.20 0.00    
0.00    0.00   48.50
Average:      18    0.10    0.00    1.10    0.00    0.00   42.40 0.00    
0.00    0.00   56.40
Average:      19    0.00    0.00    1.50    0.00    0.00   44.40 0.00    
0.00    0.00   54.10
Average:      20    0.00    0.00    1.40    0.00    0.00   45.90 0.00    
0.00    0.00   52.70
Average:      21    0.00    0.00    0.70    0.00    0.00   44.50 0.00    
0.00    0.00   54.80
Average:      22    0.10    0.00    1.40    0.00    0.00   47.00 0.00    
0.00    0.00   51.50
Average:      23    0.00    0.00    0.30    0.00    0.00   45.50 0.00    
0.00    0.00   54.20
Average:      24    0.00    0.00    1.60    0.00    0.00   50.00 0.00    
0.00    0.00   48.40
Average:      25    0.10    0.00    0.70    0.00    0.00   47.00 0.00    
0.00    0.00   52.20
Average:      26    0.00    0.00    1.80    0.00    0.00   48.70 0.00    
0.00    0.00   49.50
Average:      27    0.00    0.00    1.10    0.00    0.00   44.80 0.00    
0.00    0.00   54.10
Average:      28    0.30    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00   99.70
Average:      29    0.10    0.00    0.60    0.00    0.00    0.00 0.00    
0.00    0.00   99.30
Average:      30    0.00    0.00    0.20    0.00    0.00    0.00 0.00    
0.00    0.00   99.80
Average:      31    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:      32    0.00    0.00    1.20    0.00    0.00    0.00 0.00    
0.00    0.00   98.80
Average:      33    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:      34    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:      35    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:      36    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:      37    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:      38    0.20    0.00    0.80    0.00    0.00    0.00 0.00    
0.00    0.00   99.00
Average:      39    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:      40    0.00    0.00    3.30    0.00    0.00    0.00 0.00    
0.00    0.00   96.70
Average:      41    0.00    0.00    0.00    0.00    0.00    0.00 0.00    
0.00    0.00  100.00
Average:      42    0.00    0.00    0.80    0.00    0.00   45.00 0.00    
0.00    0.00   54.20
Average:      43    0.00    0.00    1.60    0.00    0.00   48.30 0.00    
0.00    0.00   50.10
Average:      44    0.00    0.00    1.60    0.00    0.00   37.90 0.00    
0.00    0.00   60.50
Average:      45    0.30    0.00    1.40    0.00    0.00   32.90 0.00    
0.00    0.00   65.40
Average:      46    0.00    0.00    1.50    0.90    0.00   37.60 0.00    
0.00    0.00   60.00
Average:      47    0.10    0.00    0.40    0.00    0.00   41.40 0.00    
0.00    0.00   58.10
Average:      48    0.20    0.00    1.70    0.00    0.00   38.20 0.00    
0.00    0.00   59.90
Average:      49    0.00    0.00    1.40    0.00    0.00   37.20 0.00    
0.00    0.00   61.40
Average:      50    0.00    0.00    1.30    0.00    0.00   38.10 0.00    
0.00    0.00   60.60
Average:      51    0.00    0.00    0.80    0.00    0.00   39.40 0.00    
0.00    0.00   59.80
Average:      52    0.00    0.00    1.70    0.00    0.00   39.50 0.00    
0.00    0.00   58.80
Average:      53    0.10    0.00    0.90    0.00    0.00   38.20 0.00    
0.00    0.00   60.80
Average:      54    0.00    0.00    1.30    0.00    0.00   42.10 0.00    
0.00    0.00   56.60
Average:      55    0.00    0.00    1.60    0.00    0.00   37.70 0.00    
0.00    0.00   60.70


So it looks like previously there was also no problem with pciexpress x16










>
>
>
>>> This can explain maybee also why cpuload is rising rapidly from
>>> 120Gbit/s in total to 132Gbit (counters of bwmng are from /proc/net -
>>> so
>>> there can be some error in reading them when offloading (gro/gso/tso)
>>> on
>>> nic's is enabled that is why
>>>
>>>>> Was thinking that maybee reached some pcie x16 limit - but x16
>>>>> 8GT
>>>>> is
>>>>> 126Gbit - and also when testing with pktgen i can reach more bw
>>>>> and
>>>>> pps
>>>>> (like 4x more comparing to normal internet traffic)
>>>>>
>>>> Are you forwarding when using pktgen as well or you just testing
>>>> the RX
>>>> side pps ?
>>> Yes pktgen was tested on single port RX
>>> Can check also forwarding to eliminate pciex limits
>>>
>> So this explains why you have more RX pps, since tx is idle and pcie
>> will be free to do only rx.
>>
>> [...]
>>
>>
>>>>> ethtool -S enp175s0f1
>>>>> NIC statistics:
>>>>>         rx_packets: 173730800927
>>>>>         rx_bytes: 99827422751332
>>>>>         tx_packets: 142532009512
>>>>>         tx_bytes: 184633045911222
>>>>>         tx_tso_packets: 25989113891
>>>>>         tx_tso_bytes: 132933363384458
>>>>>         tx_tso_inner_packets: 0
>>>>>         tx_tso_inner_bytes: 0
>>>>>         tx_added_vlan_packets: 74630239613
>>>>>         tx_nop: 2029817748
>>>>>         rx_lro_packets: 0
>>>>>         rx_lro_bytes: 0
>>>>>         rx_ecn_mark: 0
>>>>>         rx_removed_vlan_packets: 173730800927
>>>>>         rx_csum_unnecessary: 0
>>>>>         rx_csum_none: 434357
>>>>>         rx_csum_complete: 173730366570
>>>>>         rx_csum_unnecessary_inner: 0
>>>>>         rx_xdp_drop: 0
>>>>>         rx_xdp_redirect: 0
>>>>>         rx_xdp_tx_xmit: 0
>>>>>         rx_xdp_tx_full: 0
>>>>>         rx_xdp_tx_err: 0
>>>>>         rx_xdp_tx_cqe: 0
>>>>>         tx_csum_none: 38260960853
>>>>>         tx_csum_partial: 36369278774
>>>>>         tx_csum_partial_inner: 0
>>>>>         tx_queue_stopped: 1
>>>>>         tx_queue_dropped: 0
>>>>>         tx_xmit_more: 748638099
>>>>>         tx_recover: 0
>>>>>         tx_cqes: 73881645031
>>>>>         tx_queue_wake: 1
>>>>>         tx_udp_seg_rem: 0
>>>>>         tx_cqe_err: 0
>>>>>         tx_xdp_xmit: 0
>>>>>         tx_xdp_full: 0
>>>>>         tx_xdp_err: 0
>>>>>         tx_xdp_cqes: 0
>>>>>         rx_wqe_err: 0
>>>>>         rx_mpwqe_filler_cqes: 0
>>>>>         rx_mpwqe_filler_strides: 0
>>>>>         rx_buff_alloc_err: 0
>>>>>         rx_cqe_compress_blks: 0
>>>>>         rx_cqe_compress_pkts: 0
>>>> If this is a pcie bottleneck it might be useful to  enable CQE
>>>> compression (to reduce PCIe completion descriptors transactions)
>>>> you should see the above rx_cqe_compress_pkts increasing when
>>>> enabled.
>>>>
>>>> $ ethtool  --set-priv-flags enp175s0f1 rx_cqe_compress on
>>>> $ ethtool --show-priv-flags enp175s0f1
>>>> Private flags for p6p1:
>>>> rx_cqe_moder       : on
>>>> cqe_moder          : off
>>>> rx_cqe_compress    : on
>>>> ...
>>>>
>>>> try this on both interfaces.
>>> Done
>>> ethtool --show-priv-flags enp175s0f1
>>> Private flags for enp175s0f1:
>>> rx_cqe_moder       : on
>>> tx_cqe_moder       : off
>>> rx_cqe_compress    : on
>>> rx_striding_rq     : off
>>> rx_no_csum_complete: off
>>>
>>> ethtool --show-priv-flags enp175s0f0
>>> Private flags for enp175s0f0:
>>> rx_cqe_moder       : on
>>> tx_cqe_moder       : off
>>> rx_cqe_compress    : on
>>> rx_striding_rq     : off
>>> rx_no_csum_complete: off
>>>
>> did it help reduce the load on the pcie  ? do you see more pps ?
>> what is the ratio between rx_cqe_compress_pkts and over all rx packets
>> ?
>>
>> [...]
>>
>>>>> ethtool -S enp175s0f0
>>>>> NIC statistics:
>>>>>         rx_packets: 141574897253
>>>>>         rx_bytes: 184445040406258
>>>>>         tx_packets: 172569543894
>>>>>         tx_bytes: 99486882076365
>>>>>         tx_tso_packets: 9367664195
>>>>>         tx_tso_bytes: 56435233992948
>>>>>         tx_tso_inner_packets: 0
>>>>>         tx_tso_inner_bytes: 0
>>>>>         tx_added_vlan_packets: 141297671626
>>>>>         tx_nop: 2102916272
>>>>>         rx_lro_packets: 0
>>>>>         rx_lro_bytes: 0
>>>>>         rx_ecn_mark: 0
>>>>>         rx_removed_vlan_packets: 141574897252
>>>>>         rx_csum_unnecessary: 0
>>>>>         rx_csum_none: 23135854
>>>>>         rx_csum_complete: 141551761398
>>>>>         rx_csum_unnecessary_inner: 0
>>>>>         rx_xdp_drop: 0
>>>>>         rx_xdp_redirect: 0
>>>>>         rx_xdp_tx_xmit: 0
>>>>>         rx_xdp_tx_full: 0
>>>>>         rx_xdp_tx_err: 0
>>>>>         rx_xdp_tx_cqe: 0
>>>>>         tx_csum_none: 127934791664
>>>> It is a good idea to look into this, tx is not requesting hw tx
>>>> csumming for a lot of packets, maybe you are wasting a lot of cpu
>>>> on
>>>> calculating csum, or maybe this is just the rx csum complete..
>>>>
>>>>>         tx_csum_partial: 13362879974
>>>>>         tx_csum_partial_inner: 0
>>>>>         tx_queue_stopped: 232561
>>>> TX queues are stalling, could be an indentation for the pcie
>>>> bottelneck.
>>>>
>>>>>         tx_queue_dropped: 0
>>>>>         tx_xmit_more: 1266021946
>>>>>         tx_recover: 0
>>>>>         tx_cqes: 140031716469
>>>>>         tx_queue_wake: 232561
>>>>>         tx_udp_seg_rem: 0
>>>>>         tx_cqe_err: 0
>>>>>         tx_xdp_xmit: 0
>>>>>         tx_xdp_full: 0
>>>>>         tx_xdp_err: 0
>>>>>         tx_xdp_cqes: 0
>>>>>         rx_wqe_err: 0
>>>>>         rx_mpwqe_filler_cqes: 0
>>>>>         rx_mpwqe_filler_strides: 0
>>>>>         rx_buff_alloc_err: 0
>>>>>         rx_cqe_compress_blks: 0
>>>>>         rx_cqe_compress_pkts: 0
>>>>>         rx_page_reuse: 0
>>>>>         rx_cache_reuse: 16625975793
>>>>>         rx_cache_full: 54161465914
>>>>>         rx_cache_empty: 258048
>>>>>         rx_cache_busy: 54161472735
>>>>>         rx_cache_waive: 0
>>>>>         rx_congst_umr: 0
>>>>>         rx_arfs_err: 0
>>>>>         ch_events: 40572621887
>>>>>         ch_poll: 40885650979
>>>>>         ch_arm: 40429276692
>>>>>         ch_aff_change: 0
>>>>>         ch_eq_rearm: 0
>>>>>         rx_out_of_buffer: 2791690
>>>>>         rx_if_down_packets: 74
>>>>>         rx_vport_unicast_packets: 141843476308
>>>>>         rx_vport_unicast_bytes: 185421265403318
>>>>>         tx_vport_unicast_packets: 172569484005
>>>>>         tx_vport_unicast_bytes: 100019940094298
>>>>>         rx_vport_multicast_packets: 85122935
>>>>>         rx_vport_multicast_bytes: 5761316431
>>>>>         tx_vport_multicast_packets: 6452
>>>>>         tx_vport_multicast_bytes: 643540
>>>>>         rx_vport_broadcast_packets: 22423624
>>>>>         rx_vport_broadcast_bytes: 1390127090
>>>>>         tx_vport_broadcast_packets: 22024
>>>>>         tx_vport_broadcast_bytes: 1321440
>>>>>         rx_vport_rdma_unicast_packets: 0
>>>>>         rx_vport_rdma_unicast_bytes: 0
>>>>>         tx_vport_rdma_unicast_packets: 0
>>>>>         tx_vport_rdma_unicast_bytes: 0
>>>>>         rx_vport_rdma_multicast_packets: 0
>>>>>         rx_vport_rdma_multicast_bytes: 0
>>>>>         tx_vport_rdma_multicast_packets: 0
>>>>>         tx_vport_rdma_multicast_bytes: 0
>>>>>         tx_packets_phy: 172569501577
>>>>>         rx_packets_phy: 142871314588
>>>>>         rx_crc_errors_phy: 0
>>>>>         tx_bytes_phy: 100710212814151
>>>>>         rx_bytes_phy: 187209224289564
>>>>>         tx_multicast_phy: 6452
>>>>>         tx_broadcast_phy: 22024
>>>>>         rx_multicast_phy: 85122933
>>>>>         rx_broadcast_phy: 22423623
>>>>>         rx_in_range_len_errors_phy: 2
>>>>>         rx_out_of_range_len_phy: 0
>>>>>         rx_oversize_pkts_phy: 0
>>>>>         rx_symbol_err_phy: 0
>>>>>         tx_mac_control_phy: 0
>>>>>         rx_mac_control_phy: 0
>>>>>         rx_unsupported_op_phy: 0
>>>>>         rx_pause_ctrl_phy: 0
>>>>>         tx_pause_ctrl_phy: 0
>>>>>         rx_discards_phy: 920161423
>>>> Ok, this port seem to be suffering more, RX is congested, maybe due
>>>> to
>>>> the pcie bottleneck.
>>> Yes this side is receiving more traffic - second port is +10G more tx
>>>
>> [...]
>>
>>
>>>>> Average:      17    0.00    0.00 16.60    0.00    0.00 52.10
>>>>> 0.00    0.00    0.00   31.30
>>>>> Average:      18    0.00    0.00   13.90    0.00    0.00 61.20
>>>>> 0.00    0.00    0.00   24.90
>>>>> Average:      19    0.00    0.00    9.99    0.00    0.00 70.33
>>>>> 0.00    0.00    0.00   19.68
>>>>> Average:      20    0.00    0.00    9.00    0.00    0.00 73.00
>>>>> 0.00    0.00    0.00   18.00
>>>>> Average:      21    0.00    0.00    8.70    0.00    0.00 73.90
>>>>> 0.00    0.00    0.00   17.40
>>>>> Average:      22    0.00    0.00   15.42    0.00    0.00 58.56
>>>>> 0.00    0.00    0.00   26.03
>>>>> Average:      23    0.00    0.00   10.81    0.00    0.00 71.67
>>>>> 0.00    0.00    0.00   17.52
>>>>> Average:      24    0.00    0.00   10.00    0.00    0.00 71.80
>>>>> 0.00    0.00    0.00   18.20
>>>>> Average:      25    0.00    0.00   11.19    0.00    0.00 71.13
>>>>> 0.00    0.00    0.00   17.68
>>>>> Average:      26    0.00    0.00   11.00    0.00    0.00 70.80
>>>>> 0.00    0.00    0.00   18.20
>>>>> Average:      27    0.00    0.00   10.01    0.00    0.00 69.57
>>>>> 0.00    0.00    0.00   20.42
>>>> The numa cores are not at 100% util, you have around 20% of idle on
>>>> each one.
>>> Yes - no 100% cpu - but the difference between 80% and 100% is like
>>> push
>>> aditional 1-2Gbit/s
>>>
>> yes but, it doens't look like the bottleneck is the cpu, although it is
>> close to be :)..
>>
>>>>> Average:      28    0.00    0.00 0.00    0.00    0.00 0.00
>>>>> 0.00
>>>>> 0.00    0.00  100.00
>>>>> Average:      29    0.00    0.00    0.00    0.00    0.00 0.00
>>>>> 0.00
>>>>> 0.00    0.00  100.00
>>>>> Average:      30    0.00    0.00    0.00    0.00    0.00 0.00
>>>>> 0.00
>>>>> 0.00    0.00  100.00
>>>>> Average:      31    0.00    0.00    0.00    0.00    0.00 0.00
>>>>> 0.00
>>>>> 0.00    0.00  100.00
>>>>> Average:      32    0.00    0.00    0.00    0.00    0.00 0.00
>>>>> 0.00
>>>>> 0.00    0.00  100.00
>>>>> Average:      33    0.00    0.00    3.90    0.00    0.00 0.00
>>>>> 0.00
>>>>> 0.00    0.00   96.10
>>>>> Average:      34    0.00    0.00    0.00    0.00    0.00 0.00
>>>>> 0.00
>>>>> 0.00    0.00  100.00
>>>>> Average:      35    0.00    0.00    0.00    0.00    0.00 0.00
>>>>> 0.00
>>>>> 0.00    0.00  100.00
>>>>> Average:      36    0.10    0.00    0.20    0.00    0.00 0.00
>>>>> 0.00
>>>>> 0.00    0.00   99.70
>>>>> Average:      37    0.20    0.00    0.30    0.00    0.00 0.00
>>>>> 0.00
>>>>> 0.00    0.00   99.50
>>>>> Average:      38    0.00    0.00    0.00    0.00    0.00 0.00
>>>>> 0.00
>>>>> 0.00    0.00  100.00
>>>>> Average:      39    0.00    0.00    2.60    0.00    0.00 0.00
>>>>> 0.00
>>>>> 0.00    0.00   97.40
>>>>> Average:      40    0.00    0.00    0.90    0.00    0.00 0.00
>>>>> 0.00
>>>>> 0.00    0.00   99.10
>>>>> Average:      41    0.10    0.00    0.50    0.00    0.00 0.00
>>>>> 0.00
>>>>> 0.00    0.00   99.40
>>>>> Average:      42    0.00    0.00    9.91    0.00    0.00 70.67
>>>>> 0.00    0.00    0.00   19.42
>>>>> Average:      43    0.00    0.00   15.90    0.00    0.00 57.50
>>>>> 0.00    0.00    0.00   26.60
>>>>> Average:      44    0.00    0.00   12.20    0.00    0.00 66.20
>>>>> 0.00    0.00    0.00   21.60
>>>>> Average:      45    0.00    0.00   12.00    0.00    0.00 67.50
>>>>> 0.00    0.00    0.00   20.50
>>>>> Average:      46    0.00    0.00   12.90    0.00    0.00 65.50
>>>>> 0.00    0.00    0.00   21.60
>>>>> Average:      47    0.00    0.00   14.59    0.00    0.00 60.84
>>>>> 0.00    0.00    0.00   24.58
>>>>> Average:      48    0.00    0.00   13.59    0.00    0.00 61.74
>>>>> 0.00    0.00    0.00   24.68
>>>>> Average:      49    0.00    0.00   18.36    0.00    0.00 53.29
>>>>> 0.00    0.00    0.00   28.34
>>>>> Average:      50    0.00    0.00   15.32    0.00    0.00 58.86
>>>>> 0.00    0.00    0.00   25.83
>>>>> Average:      51    0.00    0.00   17.60    0.00    0.00 55.20
>>>>> 0.00    0.00    0.00   27.20
>>>>> Average:      52    0.00    0.00   15.92    0.00    0.00 56.06
>>>>> 0.00    0.00    0.00   28.03
>>>>> Average:      53    0.00    0.00   13.00    0.00    0.00 62.30
>>>>> 0.00    0.00    0.00   24.70
>>>>> Average:      54    0.00    0.00   13.20    0.00    0.00 61.50
>>>>> 0.00    0.00    0.00   25.30
>>>>> Average:      55    0.00    0.00   14.59    0.00    0.00 58.64
>>>>> 0.00    0.00    0.00   26.77
>>>>>
>>>>>
>>>>> ethtool -k enp175s0f0
>>>>> Features for enp175s0f0:
>>>>> rx-checksumming: on
>>>>> tx-checksumming: on
>>>>>            tx-checksum-ipv4: on
>>>>>            tx-checksum-ip-generic: off [fixed]
>>>>>            tx-checksum-ipv6: on
>>>>>            tx-checksum-fcoe-crc: off [fixed]
>>>>>            tx-checksum-sctp: off [fixed]
>>>>> scatter-gather: on
>>>>>            tx-scatter-gather: on
>>>>>            tx-scatter-gather-fraglist: off [fixed]
>>>>> tcp-segmentation-offload: on
>>>>>            tx-tcp-segmentation: on
>>>>>            tx-tcp-ecn-segmentation: off [fixed]
>>>>>            tx-tcp-mangleid-segmentation: off
>>>>>            tx-tcp6-segmentation: on
>>>>> udp-fragmentation-offload: off
>>>>> generic-segmentation-offload: on
>>>>> generic-receive-offload: on
>>>>> large-receive-offload: off [fixed]
>>>>> rx-vlan-offload: on
>>>>> tx-vlan-offload: on
>>>>> ntuple-filters: off
>>>>> receive-hashing: on
>>>>> highdma: on [fixed]
>>>>> rx-vlan-filter: on
>>>>> vlan-challenged: off [fixed]
>>>>> tx-lockless: off [fixed]
>>>>> netns-local: off [fixed]
>>>>> tx-gso-robust: off [fixed]
>>>>> tx-fcoe-segmentation: off [fixed]
>>>>> tx-gre-segmentation: on
>>>>> tx-gre-csum-segmentation: on
>>>>> tx-ipxip4-segmentation: off [fixed]
>>>>> tx-ipxip6-segmentation: off [fixed]
>>>>> tx-udp_tnl-segmentation: on
>>>>> tx-udp_tnl-csum-segmentation: on
>>>>> tx-gso-partial: on
>>>>> tx-sctp-segmentation: off [fixed]
>>>>> tx-esp-segmentation: off [fixed]
>>>>> tx-udp-segmentation: on
>>>>> fcoe-mtu: off [fixed]
>>>>> tx-nocache-copy: off
>>>>> loopback: off [fixed]
>>>>> rx-fcs: off
>>>>> rx-all: off
>>>>> tx-vlan-stag-hw-insert: on
>>>>> rx-vlan-stag-hw-parse: off [fixed]
>>>>> rx-vlan-stag-filter: on [fixed]
>>>>> l2-fwd-offload: off [fixed]
>>>>> hw-tc-offload: off
>>>>> esp-hw-offload: off [fixed]
>>>>> esp-tx-csum-hw-offload: off [fixed]
>>>>> rx-udp_tunnel-port-offload: on
>>>>> tls-hw-tx-offload: off [fixed]
>>>>> tls-hw-rx-offload: off [fixed]
>>>>> rx-gro-hw: off [fixed]
>>>>> tls-hw-record: off [fixed]
>>>>>
>>>>> ethtool -c enp175s0f0
>>>>> Coalesce parameters for enp175s0f0:
>>>>> Adaptive RX: off  TX: on
>>>>> stats-block-usecs: 0
>>>>> sample-interval: 0
>>>>> pkt-rate-low: 0
>>>>> pkt-rate-high: 0
>>>>> dmac: 32703
>>>>>
>>>>> rx-usecs: 256
>>>>> rx-frames: 128
>>>>> rx-usecs-irq: 0
>>>>> rx-frames-irq: 0
>>>>>
>>>>> tx-usecs: 8
>>>>> tx-frames: 128
>>>>> tx-usecs-irq: 0
>>>>> tx-frames-irq: 0
>>>>>
>>>>> rx-usecs-low: 0
>>>>> rx-frame-low: 0
>>>>> tx-usecs-low: 0
>>>>> tx-frame-low: 0
>>>>>
>>>>> rx-usecs-high: 0
>>>>> rx-frame-high: 0
>>>>> tx-usecs-high: 0
>>>>> tx-frame-high: 0
>>>>>
>>>>> ethtool -g enp175s0f0
>>>>> Ring parameters for enp175s0f0:
>>>>> Pre-set maximums:
>>>>> RX:             8192
>>>>> RX Mini:        0
>>>>> RX Jumbo:       0
>>>>> TX:             8192
>>>>> Current hardware settings:
>>>>> RX:             4096
>>>>> RX Mini:        0
>>>>> RX Jumbo:       0
>>>>> TX:             4096
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>> Also changed a little coalesce params - and best for this config are:
>>> ethtool -c enp175s0f0
>>> Coalesce parameters for enp175s0f0:
>>> Adaptive RX: off  TX: off
>>> stats-block-usecs: 0
>>> sample-interval: 0
>>> pkt-rate-low: 0
>>> pkt-rate-high: 0
>>> dmac: 32573
>>>
>>> rx-usecs: 40
>>> rx-frames: 128
>>> rx-usecs-irq: 0
>>> rx-frames-irq: 0
>>>
>>> tx-usecs: 8
>>> tx-frames: 8
>>> tx-usecs-irq: 0
>>> tx-frames-irq: 0
>>>
>>> rx-usecs-low: 0
>>> rx-frame-low: 0
>>> tx-usecs-low: 0
>>> tx-frame-low: 0
>>>
>>> rx-usecs-high: 0
>>> rx-frame-high: 0
>>> tx-usecs-high: 0
>>> tx-frame-high: 0
>>>
>>>
>>> Less drops on RX side - and more pps in overall forwarded.
>>>
>> how much improvement ? maybe we can improve our adaptive rx coal to be
>> efficient for this work load.
>>
>>
>
>

^ permalink raw reply

* Re: [PATCH v3 bpf-next 4/4] bpftool: support loading flow dissector
From: Jakub Kicinski @ 2018-11-08 19:35 UTC (permalink / raw)
  To: Quentin Monnet
  Cc: Stanislav Fomichev, Stanislav Fomichev, netdev, linux-kselftest,
	ast, daniel, shuah, guro, jiong.wang, bhole_prashant_q7,
	john.fastabend, jbenc, treeze.taeung, yhs, osk, sandipan
In-Reply-To: <d61e20cb-ccb1-1502-3119-8f71fb8d3570@netronome.com>

On Thu, 8 Nov 2018 18:21:24 +0000, Quentin Monnet wrote:
> >>> @@ -79,8 +82,11 @@ DESCRIPTION
> >>>   		  contain a dot character ('.'), which is reserved for future
> >>>   		  extensions of *bpffs*.
> >>> -	**bpftool prog load** *OBJ* *FILE* [**type** *TYPE*] [**map** {**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*]
> >>> +	**bpftool prog { load | loadall }** *OBJ* *FILE* [**type** *TYPE*] [**map** {**idx** *IDX* | **name** *NAME*} *MAP*] [**dev** *NAME*]
> >>>   		  Load bpf program from binary *OBJ* and pin as *FILE*.
> >>> +		  **bpftool prog load** will pin only the first bpf program
> >>> +		  from the *OBJ*, **bpftool prog loadall** will pin all maps
> >>> +		  and programs from the *OBJ*.  
> >>
> >> This could be improved regarding maps: with "bpftool prog load" I think we
> >> also load and pin all maps, but your description implies this is only the
> >> case with "loadall"  
> > I don't think we pin any maps with `bpftool prog load`, we certainly load
> > them, but we don't pin any afaict. Can you point me to the code where we
> > pin the maps?
> >   
> 
> My bad. I read "pin" but thought "load". It does not pin them indeed,
> sorry about that.

Right, but I don't see much reason why prog loadall should pin maps.
The reason to pin program(s) is to hold some reference and to be able
to find them.  Since we have the programs pinned we should be able to
find their maps with relative ease.

$ bpftool prog show pinned /sys/fs/bpf/prog
7: cgroup_skb  tag 2a142ef67aaad174  gpl
	loaded_at 2018-11-08T11:02:25-0800  uid 0
	xlated 296B  jited 229B  memlock 4096B  map_ids 6,7

possibly:

$ bpftool -j prog show pinned /sys/fs/bpf/prog | jq '.map_ids[0]'
6

Moreover, I think program and map names may collide making ELFs
unloadable..

^ permalink raw reply

* Re: [PATCH net-next 2/2] net: phy: realtek: remove boilerplate code from driver configs
From: Heiner Kallweit @ 2018-11-08 19:38 UTC (permalink / raw)
  To: Andrew Lunn; +Cc: Florian Fainelli, David Miller, netdev@vger.kernel.org
In-Reply-To: <20181108183721.GE5259@lunn.ch>

On 08.11.2018 19:37, Andrew Lunn wrote:
>>  	{
>>  		.phy_id         = 0x00008201,
>>  		.name           = "RTL8201CP Ethernet",
>> -		.phy_id_mask    = 0x0000ffff,
>>  		.features       = PHY_BASIC_FEATURES,
>>  		.flags          = PHY_HAS_INTERRUPT,
>>  	}, {
>>  		.phy_id		= 0x001cc816,
>>  		.name		= "RTL8201F Fast Ethernet",
>> -		.phy_id_mask	= 0x001fffff,
> 
> Hi Heiner
> 
> "RTL8201CP Ethernet" has a mask of 0x0000ffff, where as all the others
> use 0x001fffff. Is this correct?
> 
IMO none of the masks is correct. All of them should be 0xffffffff.
Nowadays the 32 bit PHYID  is assembled from (MSB to LSB):
22 bits vendor OUI, 6 bit model number, 4 bit revision number
Just the old 8201 doesn't follow this scheme.

With the current masks, a PHYID 0x12348201 would be recognized as
Realtek 8201 too, what's obviously wrong.

>     Andrew
> 

^ permalink raw reply

* [PATCH] net: smsc95xx: Fix MTU range
From: Stefan Wahren @ 2018-11-08 19:38 UTC (permalink / raw)
  To: David S. Miller, Steve Glendinning
  Cc: UNGLinuxDriver, Raghuram Chary J, netdev, linux-usb,
	Stefan Wahren

The commit f77f0aee4da4 ("net: use core MTU range checking in USB NIC
drivers") introduce a common MTU handling for usbnet. But it's missing
the necessary changes for smsc95xx. So set the MTU range accordingly.

This patch has been tested on a Raspberry Pi 3.

Fixes: f77f0aee4da4 ("net: use core MTU range checking in USB NIC drivers")
Signed-off-by: Stefan Wahren <stefan.wahren@i2se.com>
---
 drivers/net/usb/smsc95xx.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/usb/smsc95xx.c b/drivers/net/usb/smsc95xx.c
index 262e7a3..5974478 100644
--- a/drivers/net/usb/smsc95xx.c
+++ b/drivers/net/usb/smsc95xx.c
@@ -1321,6 +1321,8 @@ static int smsc95xx_bind(struct usbnet *dev, struct usb_interface *intf)
 	dev->net->ethtool_ops = &smsc95xx_ethtool_ops;
 	dev->net->flags |= IFF_MULTICAST;
 	dev->net->hard_header_len += SMSC95XX_TX_OVERHEAD_CSUM;
+	dev->net->min_mtu = ETH_MIN_MTU;
+	dev->net->max_mtu = ETH_DATA_LEN;
 	dev->hard_mtu = dev->net->mtu + dev->net->hard_header_len;
 
 	pdata->dev = dev;
-- 
2.7.4

^ permalink raw reply related

* Re: [PATCH bpf-next v2 02/13] bpf: btf: Add BTF_KIND_FUNC and BTF_KIND_FUNC_PROTO
From: Alexei Starovoitov @ 2018-11-08 19:42 UTC (permalink / raw)
  To: Edward Cree
  Cc: Martin KaFai Lau, Yonghong Song, Alexei Starovoitov,
	Daniel Borkmann, Network Development, Kernel Team
In-Reply-To: <20181108182101.5ncf7epfjydeeteq@ast-mbp.dhcp.thefacebook.com>

On Thu, Nov 8, 2018 at 10:21 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Thu, Nov 08, 2018 at 05:58:56PM +0000, Edward Cree wrote:
> >
> > > Happy to jump on the call to explain it again.
> > > 10:30am pacific time works for me tomorrow.
> > That works for me (that's in ~30 minutes from now if I've converted
> >  correctly.)  Please email me offlist with the phone number to call.
>
> no offlist. public link for anyone to join:
> https://bluejeans.com/867080076/
>
> I have hard cutoff at 11am though.

same link let's continue at 1pm PST.

^ permalink raw reply

* Re: [PATCH net-next 1/2] net: phy: use phy_id_mask value zero for exact match
From: Florian Fainelli @ 2018-11-08 19:44 UTC (permalink / raw)
  To: Heiner Kallweit, Andrew Lunn, David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <e69ac41d-7c1a-78dd-06b2-cb7bffab9e80@gmail.com>

On 11/7/18 12:53 PM, Heiner Kallweit wrote:
> A phy_id_mask value zero means every PHYID matches, therefore
> value zero isn't used. So we can safely redefine the semantics
> of value zero to mean "exact match". This allows to avoid some
> boilerplate code in PHY driver configs.

Having run recently into some ethtool quirkyness about how masks are
supposed to be specified between ntuple/nfc, where the meaning of 0 is
either don't care or match, I would rather we stick with the current
behavior where every bit set to 0 is a don't care and bits set t 1 are not.

Maybe we can find a clever way with a macro to specify only the PHY OUI
and compute a suitable mask automatically?

> 
> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
> ---
>  drivers/net/phy/phy_device.c | 21 +++++++++++++++------
>  include/linux/phy.h          |  2 +-
>  2 files changed, 16 insertions(+), 7 deletions(-)
> 
> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index ab33d1777..d165a2c82 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -483,15 +483,24 @@ static int phy_bus_match(struct device *dev, struct device_driver *drv)
>  			if (!(phydev->c45_ids.devices_in_package & (1 << i)))
>  				continue;
>  
> -			if ((phydrv->phy_id & phydrv->phy_id_mask) ==
> -			    (phydev->c45_ids.device_ids[i] &
> -			     phydrv->phy_id_mask))
> -				return 1;
> +			if (!phydrv->phy_id_mask) {
> +				if (phydrv->phy_id ==
> +				    phydev->c45_ids.device_ids[i])
> +					return 1;
> +			} else {
> +				if ((phydrv->phy_id & phydrv->phy_id_mask) ==
> +				    (phydev->c45_ids.device_ids[i] &
> +				     phydrv->phy_id_mask))
> +					return 1;
> +			}
>  		}
>  		return 0;
>  	} else {
> -		return (phydrv->phy_id & phydrv->phy_id_mask) ==
> -			(phydev->phy_id & phydrv->phy_id_mask);
> +		if (!phydrv->phy_id_mask)
> +			return phydrv->phy_id == phydev->phy_id;
> +		else
> +			return (phydrv->phy_id & phydrv->phy_id_mask) ==
> +			       (phydev->phy_id & phydrv->phy_id_mask);
>  	}
>  }
>  
> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index 2090277ea..e30ca2fdd 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
> @@ -500,7 +500,7 @@ struct phy_driver {
>  	struct mdio_driver_common mdiodrv;
>  	u32 phy_id;
>  	char *name;
> -	u32 phy_id_mask;
> +	u32 phy_id_mask; /* value 0 means exact match */
>  	const unsigned long * const features;
>  	u32 flags;
>  	const void *driver_data;
> 


-- 
Florian

^ permalink raw reply

* Re: [PATCH mlx5-next 02/10] IB/mlx5: Avoid hangs due to a missing completion
From: Jason Gunthorpe @ 2018-11-08 19:44 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Leon Romanovsky, RDMA mailing list, Artemy Kovalyov,
	Majd Dibbiny, Moni Shoua, Saeed Mahameed, linux-netdev
In-Reply-To: <20181108191017.21891-3-leon@kernel.org>

On Thu, Nov 08, 2018 at 09:10:09PM +0200, Leon Romanovsky wrote:
> From: Moni Shoua <monis@mellanox.com>
> 
> Fix 2 flows that may cause a process to hang on wait_for_completion():
> 
> 1. When callback for create MKEY command returns with bad status
> 2. When callback for create MKEY command is called before executer of
> command calls wait_for_completion()
> 
> The following call trace might be triggered in the above flows:
> 
> INFO: task echo_server:1655 blocked for more than 120 seconds.
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> echo_server     D ffff880813fb6898     0  1655      1 0x00000004
>  ffff880423f5b880 0000000000000086 ffff880402290fd0 ffff880423f5bfd8
>  ffff880423f5bfd8 ffff880423f5bfd8 ffff880402290fd0 ffff880813fb68a0
>  7fffffffffffffff ffff880813fb6898 ffff880402290fd0 ffff880813fb6898
> Call Trace:
>  [<ffffffff816a94c9>] schedule+0x29/0x70
>  [<ffffffff816a6fd9>] schedule_timeout+0x239/0x2c0
>  [<ffffffffc07309e2>] ? mlx5_cmd_exec_cb+0x22/0x30 [mlx5_core]
>  [<ffffffffc073e697>] ? mlx5_core_create_mkey_cb+0xb7/0x220 [mlx5_core]
>  [<ffffffff811b94b7>] ? mm_drop_all_locks+0xd7/0x110
>  [<ffffffff816a987d>] wait_for_completion+0xfd/0x140
>  [<ffffffff810c4810>] ? wake_up_state+0x20/0x20
>  [<ffffffffc08fd308>] mlx5_mr_cache_alloc+0xa8/0x170 [mlx5_ib]
>  [<ffffffffc0909626>] implicit_mr_alloc+0x36/0x190 [mlx5_ib]
>  [<ffffffffc090a26e>] mlx5_ib_alloc_implicit_mr+0x4e/0xa0 [mlx5_ib]
>  [<ffffffffc08ff2f3>] mlx5_ib_reg_user_mr+0x93/0x6a0 [mlx5_ib]
>  [<ffffffffc0907410>] ? mlx5_ib_exp_query_device+0xab0/0xbc0 [mlx5_ib]
>  [<ffffffffc04998be>] ib_uverbs_exp_reg_mr+0x2fe/0x550 [ib_uverbs]
>  [<ffffffff811edaff>] ? do_huge_pmd_anonymous_page+0x2bf/0x530
>  [<ffffffffc048f6cc>] ib_uverbs_write+0x3ec/0x490 [ib_uverbs]
>  [<ffffffff81200d2d>] vfs_write+0xbd/0x1e0
>  [<ffffffff81201b3f>] SyS_write+0x7f/0xe0
>  [<ffffffff816b4fc9>] system_call_fastpath+0x16/0x1b
> 
> Fixes: 49780d42dfc9 ("IB/mlx5: Expose MR cache for mlx5_ib")
> Signed-off-by: Moni Shoua <monis@mellanox.com>
> Reviewed-by: Majd Dibbiny <majd@mellanox.com>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
>  drivers/infiniband/hw/mlx5/mlx5_ib.h |  1 +
>  drivers/infiniband/hw/mlx5/mr.c      | 15 ++++++++++++---
>  2 files changed, 13 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> index b651a7a6fde9..cd9335e368bd 100644
> +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
> @@ -644,6 +644,7 @@ struct mlx5_cache_ent {
>  	struct delayed_work	dwork;
>  	int			pending;
>  	struct completion	compl;
> +	atomic_t		do_complete;
>  };
> 
>  struct mlx5_mr_cache {
> diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
> index 9b195d65a13e..259dd49c6874 100644
> +++ b/drivers/infiniband/hw/mlx5/mr.c
> @@ -143,7 +143,7 @@ static void reg_mr_callback(int status, void *context)
>  		kfree(mr);
>  		dev->fill_delay = 1;
>  		mod_timer(&dev->delay_timer, jiffies + HZ);
> -		return;
> +		goto do_complete;
>  	}
> 
>  	mr->mmkey.type = MLX5_MKEY_MR;
> @@ -167,8 +167,13 @@ static void reg_mr_callback(int status, void *context)
>  		pr_err("Error inserting to mkey tree. 0x%x\n", -err);
>  	write_unlock_irqrestore(&table->lock, flags);
> 
> -	if (!completion_done(&ent->compl))
> +do_complete:
> +	spin_lock_irqsave(&ent->lock, flags);
> +	if (atomic_read(&ent->do_complete)) {
>  		complete(&ent->compl);
> +		atomic_dec(&ent->do_complete);
> +	}
> +	spin_unlock_irqrestore(&ent->lock, flags);

Oh, this is quite an ugly way to use completions, I think this has
veered into misusing completion territory.. The completion_done was
never right...

add_keys should accept a flag indicating that this MR has a completor
waiting and should trigger complete() on CB finishing... Can probably
store the flag someplace in the MR.

Jason

^ permalink raw reply

* Re: [PATCH v3 bpf-next 4/4] bpftool: support loading flow dissector
From: Jakub Kicinski @ 2018-11-08 19:45 UTC (permalink / raw)
  To: Quentin Monnet
  Cc: Stanislav Fomichev, netdev, linux-kselftest, ast, daniel, shuah,
	guro, jiong.wang, bhole_prashant_q7, john.fastabend, jbenc,
	treeze.taeung, yhs, osk, sandipan
In-Reply-To: <8c35340e-3ed7-70cd-3123-7cd0fb8824a7@netronome.com>

On Thu, 8 Nov 2018 11:16:43 +0000, Quentin Monnet wrote:
> > -	bpf_program__set_ifindex(prog, ifindex);
> >   	if (attr.prog_type == BPF_PROG_TYPE_UNSPEC) {
> > +		if (!prog) {
> > +			p_err("can not guess program type when loading all programs\n");

No new lines in p_err(), beaks JSON.

> > +			goto err_close_obj;
> > +		}
> > +
> >   		const char *sec_name = bpf_program__title(prog, false);
> >   
> >   		err = libbpf_prog_type_by_name(sec_name, &attr.prog_type,
> > @@ -936,8 +958,13 @@ static int do_load(int argc, char **argv)
> >   			goto err_close_obj;
> >   		}
> >   	}
> > -	bpf_program__set_type(prog, attr.prog_type);
> > -	bpf_program__set_expected_attach_type(prog, expected_attach_type);
> > +
> > +	bpf_object__for_each_program(pos, obj) {
> > +		bpf_program__set_ifindex(pos, ifindex);
> > +		bpf_program__set_type(pos, attr.prog_type);
> > +		bpf_program__set_expected_attach_type(pos,
> > +						      expected_attach_type);
> > +	}  
> 
> I still believe you can have programs of different types here, and be 
> able to load them. I tried it and managed to have it working fine. If no 
> type is provided from command line we can retrieve types for each 
> program from its section name. If a type is provided on the command 
> line, we can do the same, but I am not sure we should do it, or impose 
> that type for all programs instead.

attr->prog_type is one per object, though.  How do we set that one?

^ permalink raw reply

* Re: [PATCH mlx5-next 00/10] Collection of ODP fixes
From: Jason Gunthorpe @ 2018-11-08 19:45 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Leon Romanovsky, RDMA mailing list, Artemy Kovalyov,
	Majd Dibbiny, Moni Shoua, Saeed Mahameed, linux-netdev
In-Reply-To: <20181108191017.21891-1-leon@kernel.org>

On Thu, Nov 08, 2018 at 09:10:07PM +0200, Leon Romanovsky wrote:
> From: Leon Romanovsky <leonro@mellanox.com>
> 
> Hi,
> 
> This is collection of fixes to mlx5_core and mlx5_ib ODP logic.
> There are two main reasons why we decided to forward it to mlx5-next
> and not to rdma-rc or net like our other fixes.
> 
> 1. We have large number of patches exactly in that area that are ready
> for submission and we don't want to create extra merge work for
> maintainers due to that. Among those future series (already ready and
> tested): internal change of mlx5_core pagefaults handling, moving ODP
> code to RDMA, change in EQ mechanism, simplification and moving SRQ QP
> to RDMA, extra fixes to mlx5_ib ODP and ODP SRQ support.
> 
> 2. Most of those fixes are for old bugs and there is no rush to fix them
> right now (in -rc1/-rc2).
> 
> Thanks
> 
> Moni Shoua (10):
>   net/mlx5: Release resource on error flow
>   IB/mlx5: Avoid hangs due to a missing completion
>   net/mlx5: Add interface to hold and release core resources
>   net/mlx5: Enumerate page fault types
>   IB/mlx5: Lock QP during page fault handling
>   net/mlx5: Return success for PAGE_FAULT_RESUME in internal error state
>   net/mlx5: Use multi threaded workqueue for page fault handling
>   IB/mlx5: Call PAGE_FAULT_RESUME command asynchronously
>   net/mlx5: Remove unused function
>   IB/mlx5: Improve ODP debugging messages
> 
>  drivers/infiniband/core/umem_odp.c            |  14 +-
>  drivers/infiniband/hw/mlx5/mlx5_ib.h          |   1 +
>  drivers/infiniband/hw/mlx5/mr.c               |  15 +-
>  drivers/infiniband/hw/mlx5/odp.c              | 158 ++++++++++++++----
>  drivers/net/ethernet/mellanox/mlx5/core/cmd.c |   2 +-
>  drivers/net/ethernet/mellanox/mlx5/core/eq.c  |  21 +--
>  drivers/net/ethernet/mellanox/mlx5/core/qp.c  |  20 ++-

there is alot of ethernet files here, parts of this should probably go
through the shared branch?

Jason

^ permalink raw reply

* Re: [PATCH v3 bpf-next 4/4] bpftool: support loading flow dissector
From: Jakub Kicinski @ 2018-11-08 19:46 UTC (permalink / raw)
  To: Stanislav Fomichev
  Cc: netdev, linux-kselftest, ast, daniel, shuah, quentin.monnet, guro,
	jiong.wang, bhole_prashant_q7, john.fastabend, jbenc,
	treeze.taeung, yhs, osk, sandipan
In-Reply-To: <20181108053957.205681-5-sdf@google.com>

On Wed,  7 Nov 2018 21:39:57 -0800, Stanislav Fomichev wrote:
> This commit adds support for loading/attaching/detaching flow
> dissector program. The structure of the flow dissector program is
> assumed to be the same as in the selftests:
> 
> * flow_dissector section with the main entry point
> * a bunch of tail call progs
> * a jmp_table map that is populated with the tail call progs

Could you split the loadall changes and the flow_dissector changes into
two separate patches?

^ permalink raw reply

* [PATCH net-next] sfc: use the new __netdev_tx_sent_queue BQL optimisation
From: Edward Cree @ 2018-11-08 19:47 UTC (permalink / raw)
  To: linux-net-drivers, davem; +Cc: netdev

As added in 3e59020abf0f ("net: bql: add __netdev_tx_sent_queue()"), which
 see for performance rationale.

Signed-off-by: Edward Cree <ecree@solarflare.com>
---
 drivers/net/ethernet/sfc/tx.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/sfc/tx.c b/drivers/net/ethernet/sfc/tx.c
index c3ad564ac4c0..22eb059086f7 100644
--- a/drivers/net/ethernet/sfc/tx.c
+++ b/drivers/net/ethernet/sfc/tx.c
@@ -553,13 +553,10 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
 	if (!data_mapped && (efx_tx_map_data(tx_queue, skb, segments)))
 		goto err;
 
-	/* Update BQL */
-	netdev_tx_sent_queue(tx_queue->core_txq, skb_len);
-
 	efx_tx_maybe_stop_queue(tx_queue);
 
 	/* Pass off to hardware */
-	if (!xmit_more || netif_xmit_stopped(tx_queue->core_txq)) {
+	if (__netdev_tx_sent_queue(tx_queue->core_txq, skb_len, xmit_more)) {
 		struct efx_tx_queue *txq2 = efx_tx_queue_partner(tx_queue);
 
 		/* There could be packets left on the partner queue if those

^ permalink raw reply related

* Re: [PATCH mlx5-next 08/10] IB/mlx5: Call PAGE_FAULT_RESUME command asynchronously
From: Jason Gunthorpe @ 2018-11-08 19:49 UTC (permalink / raw)
  To: Leon Romanovsky
  Cc: Doug Ledford, Leon Romanovsky, RDMA mailing list, Artemy Kovalyov,
	Majd Dibbiny, Moni Shoua, Saeed Mahameed, linux-netdev
In-Reply-To: <20181108191017.21891-9-leon@kernel.org>

On Thu, Nov 08, 2018 at 09:10:15PM +0200, Leon Romanovsky wrote:
> From: Moni Shoua <monis@mellanox.com>
> 
> Telling the HCA that page fault handling is done and QP can resume
> its flow is done in the context of the page fault handler. This blocks
> the handling of the next work in queue without a need.
> Call the PAGE_FAULT_RESUME command in an asynchronous manner and free
> the workqueue to pick the next work item for handling. All tasks that
> were executed after PAGE_FAULT_RESUME need to be done now
> in the callback of the asynchronous command mechanism.
> 
> Signed-off-by: Moni Shoua <monis@mellanox.com>
> Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
>  drivers/infiniband/hw/mlx5/odp.c | 110 +++++++++++++++++++++++++------
>  include/linux/mlx5/driver.h      |   3 +
>  2 files changed, 94 insertions(+), 19 deletions(-)
> 
> diff --git a/drivers/infiniband/hw/mlx5/odp.c b/drivers/infiniband/hw/mlx5/odp.c
> index abce55b8b9ba..0c4f469cdd5b 100644
> +++ b/drivers/infiniband/hw/mlx5/odp.c
> @@ -298,20 +298,78 @@ void mlx5_ib_internal_fill_odp_caps(struct mlx5_ib_dev *dev)
>  	return;
>  }
>  
> +struct pfault_resume_cb_ctx {
> +	struct mlx5_ib_dev *dev;
> +	struct mlx5_core_rsc_common *res;
> +	struct mlx5_pagefault *pfault;
> +};
> +
> +static void page_fault_resume_callback(int status, void *context)
> +{
> +	struct pfault_resume_cb_ctx *ctx = context;
> +	struct mlx5_pagefault *pfault = ctx->pfault;
> +
> +	if (status)
> +		mlx5_ib_err(ctx->dev, "Resolve the page fault failed with status %d\n",
> +			    status);
> +
> +	if (ctx->res)
> +		mlx5_core_res_put(ctx->res);
> +	kfree(pfault);
> +	kfree(ctx);
> +}
> +
>  static void mlx5_ib_page_fault_resume(struct mlx5_ib_dev *dev,
> +				      struct mlx5_core_rsc_common *res,
>  				      struct mlx5_pagefault *pfault,
> -				      int error)
> +				      int error,
> +				      bool async)
>  {
> +	int ret = 0;
> +	u32 *out = pfault->out_pf_resume;
> +	u32 *in = pfault->in_pf_resume;
> +	u32 token = pfault->token;
>  	int wq_num = pfault->event_subtype == MLX5_PFAULT_SUBTYPE_WQE ?
> -		     pfault->wqe.wq_num : pfault->token;
> -	int ret = mlx5_core_page_fault_resume(dev->mdev,
> -					      pfault->token,
> -					      wq_num,
> -					      pfault->type,
> -					      error);
> -	if (ret)
> -		mlx5_ib_err(dev, "Failed to resolve the page fault on WQ 0x%x\n",
> -			    wq_num);
> +		pfault->wqe.wq_num : pfault->token;
> +	u8 type = pfault->type;
> +	struct pfault_resume_cb_ctx *ctx = NULL;
> +
> +	if (async)
> +		ctx = kmalloc(sizeof(*ctx), GFP_KERNEL);

Why not allocate this ctx ast part of the mlx5_pagefault and avoid
this allocation failure strategy?

Jason

^ permalink raw reply

* [Patch net-next] net: move __skb_checksum_complete*() to skbuff.c
From: Cong Wang @ 2018-11-08 19:49 UTC (permalink / raw)
  To: netdev; +Cc: Cong Wang

__skb_checksum_complete_head() and __skb_checksum_complete()
are both declared in skbuff.h, they fit better in skbuff.c
than datagram.c.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
---
 net/core/datagram.c | 43 -------------------------------------------
 net/core/skbuff.c   | 43 +++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 43 insertions(+), 43 deletions(-)

diff --git a/net/core/datagram.c b/net/core/datagram.c
index 57f3a6fcfc1e..07983b90d2bd 100644
--- a/net/core/datagram.c
+++ b/net/core/datagram.c
@@ -728,49 +728,6 @@ static int skb_copy_and_csum_datagram(const struct sk_buff *skb, int offset,
 	return -EFAULT;
 }
 
-__sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len)
-{
-	__sum16 sum;
-
-	sum = csum_fold(skb_checksum(skb, 0, len, skb->csum));
-	if (likely(!sum)) {
-		if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
-		    !skb->csum_complete_sw)
-			netdev_rx_csum_fault(skb->dev);
-	}
-	if (!skb_shared(skb))
-		skb->csum_valid = !sum;
-	return sum;
-}
-EXPORT_SYMBOL(__skb_checksum_complete_head);
-
-__sum16 __skb_checksum_complete(struct sk_buff *skb)
-{
-	__wsum csum;
-	__sum16 sum;
-
-	csum = skb_checksum(skb, 0, skb->len, 0);
-
-	/* skb->csum holds pseudo checksum */
-	sum = csum_fold(csum_add(skb->csum, csum));
-	if (likely(!sum)) {
-		if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
-		    !skb->csum_complete_sw)
-			netdev_rx_csum_fault(skb->dev);
-	}
-
-	if (!skb_shared(skb)) {
-		/* Save full packet checksum */
-		skb->csum = csum;
-		skb->ip_summed = CHECKSUM_COMPLETE;
-		skb->csum_complete_sw = 1;
-		skb->csum_valid = !sum;
-	}
-
-	return sum;
-}
-EXPORT_SYMBOL(__skb_checksum_complete);
-
 /**
  *	skb_copy_and_csum_datagram_msg - Copy and checksum skb to user iovec.
  *	@skb: skbuff
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index b4ee5c8b928f..7db6c520ed92 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -2559,6 +2559,49 @@ __wsum skb_checksum(const struct sk_buff *skb, int offset,
 }
 EXPORT_SYMBOL(skb_checksum);
 
+__sum16 __skb_checksum_complete_head(struct sk_buff *skb, int len)
+{
+	__sum16 sum;
+
+	sum = csum_fold(skb_checksum(skb, 0, len, skb->csum));
+	if (likely(!sum)) {
+		if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
+		    !skb->csum_complete_sw)
+			netdev_rx_csum_fault(skb->dev);
+	}
+	if (!skb_shared(skb))
+		skb->csum_valid = !sum;
+	return sum;
+}
+EXPORT_SYMBOL(__skb_checksum_complete_head);
+
+__sum16 __skb_checksum_complete(struct sk_buff *skb)
+{
+	__wsum csum;
+	__sum16 sum;
+
+	csum = skb_checksum(skb, 0, skb->len, 0);
+
+	/* skb->csum holds pseudo checksum */
+	sum = csum_fold(csum_add(skb->csum, csum));
+	if (likely(!sum)) {
+		if (unlikely(skb->ip_summed == CHECKSUM_COMPLETE) &&
+		    !skb->csum_complete_sw)
+			netdev_rx_csum_fault(skb->dev);
+	}
+
+	if (!skb_shared(skb)) {
+		/* Save full packet checksum */
+		skb->csum = csum;
+		skb->ip_summed = CHECKSUM_COMPLETE;
+		skb->csum_complete_sw = 1;
+		skb->csum_valid = !sum;
+	}
+
+	return sum;
+}
+EXPORT_SYMBOL(__skb_checksum_complete);
+
 /* Both of above in one bottle. */
 
 __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset,
-- 
2.19.1

^ permalink raw reply related

* Re: [PATCH mlx5-next 00/10] Collection of ODP fixes
From: Saeed Mahameed @ 2018-11-08 19:50 UTC (permalink / raw)
  To: Jason Gunthorpe, leon@kernel.org
  Cc: netdev@vger.kernel.org, Majd Dibbiny, Moni Shoua, Leon Romanovsky,
	linux-rdma@vger.kernel.org, dledford@redhat.com, Artemy Kovalyov
In-Reply-To: <20181108194542.GE5548@mellanox.com>

On Thu, 2018-11-08 at 19:45 +0000, Jason Gunthorpe wrote:
> On Thu, Nov 08, 2018 at 09:10:07PM +0200, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@mellanox.com>
> > 
> > Hi,
> > 
> > This is collection of fixes to mlx5_core and mlx5_ib ODP logic.
> > There are two main reasons why we decided to forward it to mlx5-
> > next
> > and not to rdma-rc or net like our other fixes.
> > 
> > 1. We have large number of patches exactly in that area that are
> > ready
> > for submission and we don't want to create extra merge work for
> > maintainers due to that. Among those future series (already ready
> > and
> > tested): internal change of mlx5_core pagefaults handling, moving
> > ODP
> > code to RDMA, change in EQ mechanism, simplification and moving SRQ
> > QP
> > to RDMA, extra fixes to mlx5_ib ODP and ODP SRQ support.
> > 
> > 2. Most of those fixes are for old bugs and there is no rush to fix
> > them
> > right now (in -rc1/-rc2).
> > 
> > Thanks
> > 
> > Moni Shoua (10):
> >   net/mlx5: Release resource on error flow
> >   IB/mlx5: Avoid hangs due to a missing completion
> >   net/mlx5: Add interface to hold and release core resources
> >   net/mlx5: Enumerate page fault types
> >   IB/mlx5: Lock QP during page fault handling
> >   net/mlx5: Return success for PAGE_FAULT_RESUME in internal error
> > state
> >   net/mlx5: Use multi threaded workqueue for page fault handling
> >   IB/mlx5: Call PAGE_FAULT_RESUME command asynchronously
> >   net/mlx5: Remove unused function
> >   IB/mlx5: Improve ODP debugging messages
> > 
> >  drivers/infiniband/core/umem_odp.c            |  14 +-
> >  drivers/infiniband/hw/mlx5/mlx5_ib.h          |   1 +
> >  drivers/infiniband/hw/mlx5/mr.c               |  15 +-
> >  drivers/infiniband/hw/mlx5/odp.c              | 158
> > ++++++++++++++----
> >  drivers/net/ethernet/mellanox/mlx5/core/cmd.c |   2 +-
> >  drivers/net/ethernet/mellanox/mlx5/core/eq.c  |  21 +--
> >  drivers/net/ethernet/mellanox/mlx5/core/qp.c  |  20 ++-
> 
> there is alot of ethernet files here, parts of this should probably
> go
> through the shared branch?
> 

All of this should go to the shared branch, 
after this ODP series we have another series that moves all the ODP
logic out of mlx5_core into mlx5_ib.

> Jason

^ permalink raw reply

* [PATCH] staging: rtl8723bs: Add missing return for cfg80211_rtw_get_station
From: Larry Finger @ 2018-11-09  5:30 UTC (permalink / raw)
  To: gregkh; +Cc: devel, netdev, youling257, linux-wireless, Larry Finger

With Androidx86 8.1, wificond returns "failed to get
nl80211_sta_info_tx_failed" and wificondControl returns "Invalid signal
poll result from wificond". The fix is to OR sinfo->filled with
BIT_ULL(NL80211_STA_INFO_TX_FAILED).

This missing bit is apparently not needed with NetworkManager, but it
does no harm in that case.

Reported-and-Tested-by: youling257 <youling257@gmail.com>
Cc: linux-wireless@vger.kernel.org
Cc: youling257 <youling257@gmail.com>
Signed-off-by: Larry Finger <Larry.Finger@lwfinger.net>
---
 drivers/staging/rtl8723bs/os_dep/ioctl_cfg80211.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/staging/rtl8723bs/os_dep/ioctl_cfg80211.c b/drivers/staging/rtl8723bs/os_dep/ioctl_cfg80211.c
index af2234798fa8..db553f2e4c0b 100644
--- a/drivers/staging/rtl8723bs/os_dep/ioctl_cfg80211.c
+++ b/drivers/staging/rtl8723bs/os_dep/ioctl_cfg80211.c
@@ -1277,7 +1277,7 @@ static int cfg80211_rtw_get_station(struct wiphy *wiphy,
 
 		sinfo->filled |= BIT_ULL(NL80211_STA_INFO_TX_PACKETS);
 		sinfo->tx_packets = psta->sta_stats.tx_pkts;
-
+		sinfo->filled |= BIT_ULL(NL80211_STA_INFO_TX_FAILED);
 	}
 
 	/* for Ad-Hoc/AP mode */
-- 
2.19.1

^ permalink raw reply related

* Re: [Patch net-next] net: move __skb_checksum_complete*() to skbuff.c
From: Stefano Brivio @ 2018-11-08 19:54 UTC (permalink / raw)
  To: Cong Wang; +Cc: netdev
In-Reply-To: <20181108194949.11866-1-xiyou.wangcong@gmail.com>

Hi,

On Thu,  8 Nov 2018 11:49:49 -0800
Cong Wang <xiyou.wangcong@gmail.com> wrote:

> +EXPORT_SYMBOL(__skb_checksum_complete);
> +
>  /* Both of above in one bottle. */

Maybe you should also update/drop this comment now?
 
>  __wsum skb_copy_and_csum_bits(const struct sk_buff *skb, int offset,

-- 
Stefano

^ permalink raw reply

* Re: [Patch net-next] net: move __skb_checksum_complete*() to skbuff.c
From: Cong Wang @ 2018-11-08 20:02 UTC (permalink / raw)
  To: Stefano Brivio; +Cc: Linux Kernel Network Developers
In-Reply-To: <20181108205413.2008a004@redhat.com>

On Thu, Nov 8, 2018 at 11:54 AM Stefano Brivio <sbrivio@redhat.com> wrote:
>
> Hi,
>
> On Thu,  8 Nov 2018 11:49:49 -0800
> Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> > +EXPORT_SYMBOL(__skb_checksum_complete);
> > +
> >  /* Both of above in one bottle. */
>
> Maybe you should also update/drop this comment now?

I have no idea what that comment means. Do you?

It dates back to pre-git history.

Thanks.

^ permalink raw reply

* [PATCH] net: Add trace events for all receive exit points
From: Geneviève Bastien @ 2018-11-08 19:56 UTC (permalink / raw)
  To: davem
  Cc: netdev, Geneviève Bastien, Mathieu Desnoyers, Steven Rostedt,
	Ingo Molnar

Trace events are already present for the receive entry points, to indicate
how the reception entered the stack.

This patch adds the corresponding exit trace events that will bound the
reception such that all events occurring between the entry and the exit
can be considered as part of the reception context. This greatly helps
for dependency and root cause analyses.

Without this, it is impossible to determine whether a sched_wakeup
event following a netif_receive_skb event is the result of the packet
reception or a simple coincidence after further processing by the
thread.

Signed-off-by: Geneviève Bastien <gbastien@versatic.net>
CC: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ingo Molnar <mingo@redhat.com>
CC: David S. Miller <davem@davemloft.net>
---
 include/trace/events/net.h | 59 ++++++++++++++++++++++++++++++++++++++
 net/core/dev.c             | 30 ++++++++++++++++---
 2 files changed, 85 insertions(+), 4 deletions(-)

diff --git a/include/trace/events/net.h b/include/trace/events/net.h
index 00aa72ce0e7c..318307511018 100644
--- a/include/trace/events/net.h
+++ b/include/trace/events/net.h
@@ -117,6 +117,23 @@ DECLARE_EVENT_CLASS(net_dev_template,
 		__get_str(name), __entry->skbaddr, __entry->len)
 )
 
+DECLARE_EVENT_CLASS(net_dev_template_simple,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb),
+
+	TP_STRUCT__entry(
+		__field(void *,	skbaddr)
+	),
+
+	TP_fast_assign(
+		__entry->skbaddr = skb;
+	),
+
+	TP_printk("skbaddr=%p", __entry->skbaddr)
+)
+
 DEFINE_EVENT(net_dev_template, net_dev_queue,
 
 	TP_PROTO(struct sk_buff *skb),
@@ -244,6 +261,48 @@ DEFINE_EVENT(net_dev_rx_verbose_template, netif_rx_ni_entry,
 	TP_ARGS(skb)
 );
 
+DEFINE_EVENT(net_dev_template_simple, napi_gro_frags_exit,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb)
+);
+
+DEFINE_EVENT(net_dev_template_simple, napi_gro_receive_exit,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb)
+);
+
+DEFINE_EVENT(net_dev_template_simple, netif_receive_skb_exit,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb)
+);
+
+DEFINE_EVENT(net_dev_template_simple, netif_receive_skb_list_exit,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb)
+);
+
+DEFINE_EVENT(net_dev_template_simple, netif_rx_exit,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb)
+);
+
+DEFINE_EVENT(net_dev_template_simple, netif_rx_ni_exit,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb)
+);
+
 #endif /* _TRACE_NET_H */
 
 /* This part must be outside protection */
diff --git a/net/core/dev.c b/net/core/dev.c
index 0ffcbdd55fa9..e670ca27e829 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4520,9 +4520,14 @@ static int netif_rx_internal(struct sk_buff *skb)
 
 int netif_rx(struct sk_buff *skb)
 {
+	int ret;
+
 	trace_netif_rx_entry(skb);
 
-	return netif_rx_internal(skb);
+	ret = netif_rx_internal(skb);
+	trace_netif_rx_exit(skb);
+
+	return ret;
 }
 EXPORT_SYMBOL(netif_rx);
 
@@ -4537,6 +4542,7 @@ int netif_rx_ni(struct sk_buff *skb)
 	if (local_softirq_pending())
 		do_softirq();
 	preempt_enable();
+	trace_netif_rx_ni_exit(skb);
 
 	return err;
 }
@@ -5222,9 +5228,14 @@ static void netif_receive_skb_list_internal(struct list_head *head)
  */
 int netif_receive_skb(struct sk_buff *skb)
 {
+	int ret;
+
 	trace_netif_receive_skb_entry(skb);
 
-	return netif_receive_skb_internal(skb);
+	ret = netif_receive_skb_internal(skb);
+	trace_netif_receive_skb_exit(skb);
+
+	return ret;
 }
 EXPORT_SYMBOL(netif_receive_skb);
 
@@ -5247,6 +5258,8 @@ void netif_receive_skb_list(struct list_head *head)
 	list_for_each_entry(skb, head, list)
 		trace_netif_receive_skb_list_entry(skb);
 	netif_receive_skb_list_internal(head);
+	list_for_each_entry(skb, head, list)
+		trace_netif_receive_skb_list_exit(skb);
 }
 EXPORT_SYMBOL(netif_receive_skb_list);
 
@@ -5634,12 +5647,17 @@ static gro_result_t napi_skb_finish(gro_result_t ret, struct sk_buff *skb)
 
 gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb)
 {
+	gro_result_t ret;
+
 	skb_mark_napi_id(skb, napi);
 	trace_napi_gro_receive_entry(skb);
 
 	skb_gro_reset_offset(skb);
 
-	return napi_skb_finish(dev_gro_receive(napi, skb), skb);
+	ret = napi_skb_finish(dev_gro_receive(napi, skb), skb);
+	trace_napi_gro_receive_exit(skb);
+
+	return ret;
 }
 EXPORT_SYMBOL(napi_gro_receive);
 
@@ -5753,6 +5771,7 @@ static struct sk_buff *napi_frags_skb(struct napi_struct *napi)
 
 gro_result_t napi_gro_frags(struct napi_struct *napi)
 {
+	gro_result_t ret;
 	struct sk_buff *skb = napi_frags_skb(napi);
 
 	if (!skb)
@@ -5760,7 +5779,10 @@ gro_result_t napi_gro_frags(struct napi_struct *napi)
 
 	trace_napi_gro_frags_entry(skb);
 
-	return napi_frags_finish(napi, skb, dev_gro_receive(napi, skb));
+	ret = napi_frags_finish(napi, skb, dev_gro_receive(napi, skb));
+	trace_napi_gro_frags_exit(skb);
+
+	return ret;
 }
 EXPORT_SYMBOL(napi_gro_frags);
 
-- 
2.19.1

^ permalink raw reply related

* Re: [PATCH net-next 1/2] net: phy: use phy_id_mask value zero for exact match
From: Heiner Kallweit @ 2018-11-08 20:06 UTC (permalink / raw)
  To: Florian Fainelli, Andrew Lunn, David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <827c53cd-c9f5-7b1c-84fd-af1a49317fe7@gmail.com>

On 08.11.2018 20:44, Florian Fainelli wrote:
> On 11/7/18 12:53 PM, Heiner Kallweit wrote:
>> A phy_id_mask value zero means every PHYID matches, therefore
>> value zero isn't used. So we can safely redefine the semantics
>> of value zero to mean "exact match". This allows to avoid some
>> boilerplate code in PHY driver configs.
> 
> Having run recently into some ethtool quirkyness about how masks are
> supposed to be specified between ntuple/nfc, where the meaning of 0 is
> either don't care or match, I would rather we stick with the current
> behavior where every bit set to 0 is a don't care and bits set t 1 are not.
> 
I agree that using a mask value 0 as either exact match or don't care
can be confusing. However I think that the advantages here outweigh
this confusion aspect.
- We get a meaningful default in case a driver author misses to set
  the phy_id_mask.
- If a driver author wants to enforce an exact match, he has to do
  nothing and can rely on the core. This avoids mistakes like in the
  Realtek case where the driver author meant exact match but specified
  something completely different.
- Avoid boilerplate code

> Maybe we can find a clever way with a macro to specify only the PHY OUI
> and compute a suitable mask automatically?
> 
I don't think so. For Realtek each driver is specific even to a model
revision (therefore mask 0xffffffff). Same applies to intel-xway.
In broadcom.c we have masks 0xfffffff0, so for each model, independent
of revision number. There is no general rule.
Also we can't simply check for the first-bit-set to derive a mask.

>>
>> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
>> ---
>>  drivers/net/phy/phy_device.c | 21 +++++++++++++++------
>>  include/linux/phy.h          |  2 +-
>>  2 files changed, 16 insertions(+), 7 deletions(-)
>>
>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>> index ab33d1777..d165a2c82 100644
>> --- a/drivers/net/phy/phy_device.c
>> +++ b/drivers/net/phy/phy_device.c
>> @@ -483,15 +483,24 @@ static int phy_bus_match(struct device *dev, struct device_driver *drv)
>>  			if (!(phydev->c45_ids.devices_in_package & (1 << i)))
>>  				continue;
>>  
>> -			if ((phydrv->phy_id & phydrv->phy_id_mask) ==
>> -			    (phydev->c45_ids.device_ids[i] &
>> -			     phydrv->phy_id_mask))
>> -				return 1;
>> +			if (!phydrv->phy_id_mask) {
>> +				if (phydrv->phy_id ==
>> +				    phydev->c45_ids.device_ids[i])
>> +					return 1;
>> +			} else {
>> +				if ((phydrv->phy_id & phydrv->phy_id_mask) ==
>> +				    (phydev->c45_ids.device_ids[i] &
>> +				     phydrv->phy_id_mask))
>> +					return 1;
>> +			}
>>  		}
>>  		return 0;
>>  	} else {
>> -		return (phydrv->phy_id & phydrv->phy_id_mask) ==
>> -			(phydev->phy_id & phydrv->phy_id_mask);
>> +		if (!phydrv->phy_id_mask)
>> +			return phydrv->phy_id == phydev->phy_id;
>> +		else
>> +			return (phydrv->phy_id & phydrv->phy_id_mask) ==
>> +			       (phydev->phy_id & phydrv->phy_id_mask);
>>  	}
>>  }
>>  
>> diff --git a/include/linux/phy.h b/include/linux/phy.h
>> index 2090277ea..e30ca2fdd 100644
>> --- a/include/linux/phy.h
>> +++ b/include/linux/phy.h
>> @@ -500,7 +500,7 @@ struct phy_driver {
>>  	struct mdio_driver_common mdiodrv;
>>  	u32 phy_id;
>>  	char *name;
>> -	u32 phy_id_mask;
>> +	u32 phy_id_mask; /* value 0 means exact match */
>>  	const unsigned long * const features;
>>  	u32 flags;
>>  	const void *driver_data;
>>
> 
> 

^ permalink raw reply

* Re: [PATCH net-next] sfc: use the new __netdev_tx_sent_queue BQL optimisation
From: Eric Dumazet @ 2018-11-08 20:13 UTC (permalink / raw)
  To: Edward Cree, linux-net-drivers, davem; +Cc: netdev
In-Reply-To: <31ee2a24-103c-ac0a-9e60-b3204bd61167@solarflare.com>



On 11/08/2018 11:47 AM, Edward Cree wrote:
> As added in 3e59020abf0f ("net: bql: add __netdev_tx_sent_queue()"), which
>  see for performance rationale.
> 
> Signed-off-by: Edward Cree <ecree@solarflare.com>
> ---
>  drivers/net/ethernet/sfc/tx.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/drivers/net/ethernet/sfc/tx.c b/drivers/net/ethernet/sfc/tx.c
> index c3ad564ac4c0..22eb059086f7 100644
> --- a/drivers/net/ethernet/sfc/tx.c
> +++ b/drivers/net/ethernet/sfc/tx.c
> @@ -553,13 +553,10 @@ netdev_tx_t efx_enqueue_skb(struct efx_tx_queue *tx_queue, struct sk_buff *skb)
>  	if (!data_mapped && (efx_tx_map_data(tx_queue, skb, segments)))
>  		goto err;
>  
> -	/* Update BQL */
> -	netdev_tx_sent_queue(tx_queue->core_txq, skb_len);
> -
>  	efx_tx_maybe_stop_queue(tx_queue);
>  
>  	/* Pass off to hardware */
> -	if (!xmit_more || netif_xmit_stopped(tx_queue->core_txq)) {
> +	if (__netdev_tx_sent_queue(tx_queue->core_txq, skb_len, xmit_more)) {
>  		struct efx_tx_queue *txq2 = efx_tx_queue_partner(tx_queue);
>  
>  		/* There could be packets left on the partner queue if those
> 

Reviewed-by: Eric Dumazet <edumazet@google.com>

Thanks !

^ permalink raw reply

* Re: [Patch net-next] net: move __skb_checksum_complete*() to skbuff.c
From: Stefano Brivio @ 2018-11-08 20:14 UTC (permalink / raw)
  To: Cong Wang; +Cc: Linux Kernel Network Developers
In-Reply-To: <CAM_iQpXh03pCAp-t-Zh-etTPRbiBwCSuJLL5LTfr5EK0CT=Qzg@mail.gmail.com>

On Thu, 8 Nov 2018 12:02:00 -0800
Cong Wang <xiyou.wangcong@gmail.com> wrote:

> On Thu, Nov 8, 2018 at 11:54 AM Stefano Brivio <sbrivio@redhat.com> wrote:
> >
> > Hi,
> >
> > On Thu,  8 Nov 2018 11:49:49 -0800
> > Cong Wang <xiyou.wangcong@gmail.com> wrote:
> >  
> > > +EXPORT_SYMBOL(__skb_checksum_complete);
> > > +
> > >  /* Both of above in one bottle. */  
> >
> > Maybe you should also update/drop this comment now?  
> 
> I have no idea what that comment means. Do you?

I think it refers to the fact that skb_copy_and_csum_bits() implements
both skb_checksum() and skb_copy_bits(), that were just above it at
1da177e4c3f4.

Then more stuff was moved in between, and the comment was never updated
or dropped.

-- 
Stefano

^ permalink raw reply

* RE: [PATCH net-next] dpaa2-eth: Introduce TX congestion management
From: Ioana Ciocoi Radulescu @ 2018-11-08 20:21 UTC (permalink / raw)
  To: David Miller; +Cc: netdev@vger.kernel.org, Ioana Ciornei
In-Reply-To: <20181107.220759.1889877577682317113.davem@davemloft.net>

> -----Original Message-----
> From: David Miller <davem@davemloft.net>
> Sent: Thursday, November 8, 2018 8:08 AM
> To: Ioana Ciocoi Radulescu <ruxandra.radulescu@nxp.com>
> Cc: netdev@vger.kernel.org; Ioana Ciornei <ioana.ciornei@nxp.com>
> Subject: Re: [PATCH net-next] dpaa2-eth: Introduce TX congestion
> management
> 
> From: Ioana Ciocoi Radulescu <ruxandra.radulescu@nxp.com>
> Date: Wed, 7 Nov 2018 10:31:16 +0000
> 
> > We chose this mechanism over BQL (to which it is conceptually
> > very similar) because a) we can take advantage of the hardware
> > offloading and b) BQL doesn't match well with our driver fastpath
> > (we process ingress (Rx or Tx conf) frames in batches of up to 16,
> > which in certain scenarios confuses the BQL adaptive algorithm,
> > resulting in too low values of the limit and low performance).
> 
> First, this kind of explanation belongs in the commit message.
> 
> Second, you'll have to describe better what BQL, which is the
> ultimate standard mechanism for every single driver in the
> kernel to deal with this issue.
> 
> Are you saying that if 15 TX frames are pending, not TX interrupt
> will arrive at all?

I meant up to 16, not exactly 16.

> 
> There absolutely must be some timeout or similar interrupt that gets
> sent in that kind of situation.  You cannot leave stale TX packets
> on your ring unprocessed just because a non-multiple of 16 packets
> were queued up and then TX activity stopped.

I'll try to detail a bit how our hardware works, since it's not the standard
ring-based architecture.

In our driver implementation, we have multiple ingress queues (for Rx frames
and Tx confirmation frames) grouped into core-affine channels. Each channel
may contain one or more queues of each type.

When queues inside a channel have available frames, the hardware triggers
a notification for us; we then issue a dequeue command for that channel,
in response to which hardware will write a number of frames (between 1 and
16) at a memory location of our choice. Frames obtained as a result of one
dequeue operation are all from the same queue, but consecutive dequeues
on the same channel may service different queues.

So my initial attempt at implementing BQL called netdev_tx_completed_queue()
for each batch of Tx confirmation frames obtained from a single dequeue
operation. I don't fully understand the BQL algorithm yet, but I think it had
a problem with the fact that we never reported as completed more than
16 frames at a time.

Consider a scenario where a DPAA2 platform does IP forwarding from port1
to port2; to keep things simple, let's say we're single core and have only
one Rx and one Tx confirmation queue per interface.

Without BQL, the egress interface can transmit up to 64 frames at a time
(one for each Rx frame processed by the ingress interface in its NAPI poll)
and then process all 64 confirmations (grouped in 4 dequeues) during its
own NAPI cycle.

With BQL support added, the number of consecutive transmitted frames is
never higher than 33; after that, BQL stops the netdev tx queue until
we mark some of those trasmits as completed. This results in a performance
drop of over 30%.

Today I tried to further coalesce the confirmation frames such that I call
netdev_tx_completed_queue() only at the end of the NAPI poll, once for each
confirmation queue that was serviced during that NAPI. I need to do more
testing, but so far it performs *almost* on par with the non-BQL driver
version. But it does complicate the fastpath code and feels somewhat
unnatural.

So I would still prefer using the hardware mechanism, which I think fits
more smoothly with both our hardware and driver implementation, and adds
less overhead. But if you feel strongly about it I can drop this patch and
polish the BQL support until it's looks decent enough (and hopefully doesn't
mess too much with our performance benchmarks).

Thanks,
Ioana

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox