* [net-next 00/15][pull request] Intel Wired LAN Driver Updates 2019-08-27
From: Jeff Kirsher @ 2019-08-28 6:43 UTC (permalink / raw)
To: davem; +Cc: Jeff Kirsher, netdev, nhorman, sassmann
This series contains a variety of cold and hot savoury changes to Intel
drivers. Some of the fixes could be considered for stable even though
the author did not request it.
Hulk Robert cleans up (i.e. removes) a function that has no caller for
the iavf driver.
Radoslaw fixes an issue when there is no link in the VM after the
hypervisor is restored from a low-power state due to the driver not
properly restoring features in the device that had been disabled during
the suspension for ixgbevf.
Kai-Heng Feng modified e1000e to use mod_delayed_work() to help resolve
a hot plug speed detection issue by adding a deterministic 1 second
delay before running watchdog task after an interrupt.
Sasha moves functions around to avoid forward declarations, since the
forward declarations are not necessary for these static functions in
igc. Also added a check for igc during driver probe to validate the NVM
checksum. Cleaned up code defines that were not being used in the igc
driver. Adds support for IP generic transmit checksum offload in the
igc driver.
Updated the iavf kernel documentation by a developer with no life.
Jake provides another fm10k update to a local variable for ease of code
readability.
Mitch fixes the iavf driver to allow the VF to override the MAC address
set by the host, if the VF is in "trusted" mode.
Mauro S. M. Rodrigues provides several changes for i40e driver, first
with resolving hw_dbg usage and referencing a i40e_hw attribute. Also
implemented a debug macro using pr_debug, since the use of netdev_dbg
could cause a NULL pointer dereference during probe. Finally cleaned up
code that is no longer used or needed.
Firo Yang provides a change in the ixgbe driver to ensure we sync the
first fragment unconditionally to help resolve an issue seen in the XEN
environment when the upper network stack could receive an incomplete
network packet.
Mariusz adds a missing device to the i40e PCI table in the driver.
The following are changes since commit 68aaf4459556b1f9370c259fd486aecad2257552:
Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
and are available in the git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue 10GbE
Firo Yang (1):
ixgbe: sync the first fragment unconditionally
Jacob Keller (1):
fm10k: use a local variable for the frag pointer
Jeff Kirsher (1):
Documentation: iavf: Update the Intel LAN driver doc for iavf
Kai-Heng Feng (1):
e1000e: Make speed detection on hotplugging cable more reliable
Mariusz Stachura (1):
i40e: Add support for X710 device
Mauro S. M. Rodrigues (3):
i40e: fix hw_dbg usage in i40e_hmc_get_object_va
i40e: Implement debug macro hw_dbg using pr_debug
i40e: Remove EMPR traces from debugfs facility
Mitch Williams (1):
iavf: allow permanent MAC address to change
Radoslaw Tyl (1):
ixgbevf: Link lost in VM on ixgbevf when restoring from freeze or
suspend
Sasha Neftin (4):
igc: Remove useless forward declaration
igc: Add NVM checksum validation
igc: Remove unneeded PCI bus defines
igc: Add tx_csum offload functionality
YueHaibing (1):
iavf: remove unused debug function iavf_debug_d
.../networking/device_drivers/intel/iavf.rst | 115 ++++++++---
drivers/net/ethernet/intel/e1000e/netdev.c | 12 +-
drivers/net/ethernet/intel/fm10k/fm10k_main.c | 8 +-
drivers/net/ethernet/intel/i40e/i40e.h | 1 -
drivers/net/ethernet/intel/i40e/i40e_common.c | 1 +
.../net/ethernet/intel/i40e/i40e_debugfs.c | 4 -
drivers/net/ethernet/intel/i40e/i40e_hmc.c | 1 +
.../net/ethernet/intel/i40e/i40e_lan_hmc.c | 14 +-
drivers/net/ethernet/intel/i40e/i40e_main.c | 1 +
drivers/net/ethernet/intel/i40e/i40e_osdep.h | 7 +-
drivers/net/ethernet/intel/iavf/iavf.h | 1 -
drivers/net/ethernet/intel/iavf/iavf_main.c | 26 ---
drivers/net/ethernet/intel/igc/igc.h | 4 +
drivers/net/ethernet/intel/igc/igc_base.h | 8 +
drivers/net/ethernet/intel/igc/igc_defines.h | 9 +-
drivers/net/ethernet/intel/igc/igc_mac.c | 73 ++++---
drivers/net/ethernet/intel/igc/igc_main.c | 106 ++++++++++
drivers/net/ethernet/intel/igc/igc_phy.c | 192 +++++++++---------
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 16 +-
.../net/ethernet/intel/ixgbevf/ixgbevf_main.c | 1 +
20 files changed, 372 insertions(+), 228 deletions(-)
--
2.21.0
^ permalink raw reply
* [net-next 02/15] ixgbevf: Link lost in VM on ixgbevf when restoring from freeze or suspend
From: Jeff Kirsher @ 2019-08-28 6:43 UTC (permalink / raw)
To: davem; +Cc: Radoslaw Tyl, netdev, nhorman, sassmann, Andrew Bowers,
Jeff Kirsher
In-Reply-To: <20190828064407.30168-1-jeffrey.t.kirsher@intel.com>
From: Radoslaw Tyl <radoslawx.tyl@intel.com>
This patch fixed issue in VM which shows no link when hypervisor is
restored from low-power state. The driver is responsible for re-enabling
any features of the device that had been disabled during suspend calls,
such as IRQs and bus mastering.
Signed-off-by: Radoslaw Tyl <radoslawx.tyl@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
index 8c011d4ce7a9..75e849a64db7 100644
--- a/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
+++ b/drivers/net/ethernet/intel/ixgbevf/ixgbevf_main.c
@@ -2517,6 +2517,7 @@ void ixgbevf_reinit_locked(struct ixgbevf_adapter *adapter)
msleep(1);
ixgbevf_down(adapter);
+ pci_set_master(adapter->pdev);
ixgbevf_up(adapter);
clear_bit(__IXGBEVF_RESETTING, &adapter->state);
--
2.21.0
^ permalink raw reply related
* [net-next 03/15] e1000e: Make speed detection on hotplugging cable more reliable
From: Jeff Kirsher @ 2019-08-28 6:43 UTC (permalink / raw)
To: davem; +Cc: Kai-Heng Feng, netdev, nhorman, sassmann, Aaron Brown,
Jeff Kirsher
In-Reply-To: <20190828064407.30168-1-jeffrey.t.kirsher@intel.com>
From: Kai-Heng Feng <kai.heng.feng@canonical.com>
After hot plugging an 1Gbps Ethernet cable with 1Gbps link partner, the
MII_BMSR may report 10Mbps, renders the network rather slow.
The issue has much lower fail rate after commit 59653e6497d1 ("e1000e:
Make watchdog use delayed work"), which essentially introduces some
delay before running the watchdog task.
But there's still a chance that the hot plugging event and the queued
watchdog task gets run at the same time, then the original issue can be
observed once again.
So let's use mod_delayed_work() to add a deterministic 1 second delay
before running watchdog task, after an interrupt.
Signed-off-by: Kai-Heng Feng <kai.heng.feng@canonical.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/e1000e/netdev.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/intel/e1000e/netdev.c b/drivers/net/ethernet/intel/e1000e/netdev.c
index 8a3f035c3a5f..d7d56e42a6aa 100644
--- a/drivers/net/ethernet/intel/e1000e/netdev.c
+++ b/drivers/net/ethernet/intel/e1000e/netdev.c
@@ -1780,8 +1780,8 @@ static irqreturn_t e1000_intr_msi(int __always_unused irq, void *data)
}
/* guard against interrupt when we're going down */
if (!test_bit(__E1000_DOWN, &adapter->state))
- queue_delayed_work(adapter->e1000_workqueue,
- &adapter->watchdog_task, 1);
+ mod_delayed_work(adapter->e1000_workqueue,
+ &adapter->watchdog_task, HZ);
}
/* Reset on uncorrectable ECC error */
@@ -1861,8 +1861,8 @@ static irqreturn_t e1000_intr(int __always_unused irq, void *data)
}
/* guard against interrupt when we're going down */
if (!test_bit(__E1000_DOWN, &adapter->state))
- queue_delayed_work(adapter->e1000_workqueue,
- &adapter->watchdog_task, 1);
+ mod_delayed_work(adapter->e1000_workqueue,
+ &adapter->watchdog_task, HZ);
}
/* Reset on uncorrectable ECC error */
@@ -1907,8 +1907,8 @@ static irqreturn_t e1000_msix_other(int __always_unused irq, void *data)
hw->mac.get_link_status = true;
/* guard against interrupt when we're going down */
if (!test_bit(__E1000_DOWN, &adapter->state))
- queue_delayed_work(adapter->e1000_workqueue,
- &adapter->watchdog_task, 1);
+ mod_delayed_work(adapter->e1000_workqueue,
+ &adapter->watchdog_task, HZ);
}
if (!test_bit(__E1000_DOWN, &adapter->state))
--
2.21.0
^ permalink raw reply related
* [net-next 01/15] iavf: remove unused debug function iavf_debug_d
From: Jeff Kirsher @ 2019-08-28 6:43 UTC (permalink / raw)
To: davem
Cc: YueHaibing, netdev, nhorman, sassmann, Hulk Robot, Andrew Bowers,
Jeff Kirsher
In-Reply-To: <20190828064407.30168-1-jeffrey.t.kirsher@intel.com>
From: YueHaibing <yuehaibing@huawei.com>
There is no caller of function iavf_debug_d() in tree since
commit 75051ce4c5d8 ("iavf: Fix up debug print macro"),
so it can be removed.
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/iavf/iavf_main.c | 22 ---------------------
1 file changed, 22 deletions(-)
diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index 9d2b50964a08..554aa619ff02 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -142,28 +142,6 @@ enum iavf_status iavf_free_virt_mem_d(struct iavf_hw *hw,
return 0;
}
-/**
- * iavf_debug_d - OS dependent version of debug printing
- * @hw: pointer to the HW structure
- * @mask: debug level mask
- * @fmt_str: printf-type format description
- **/
-void iavf_debug_d(void *hw, u32 mask, char *fmt_str, ...)
-{
- char buf[512];
- va_list argptr;
-
- if (!(mask & ((struct iavf_hw *)hw)->debug_mask))
- return;
-
- va_start(argptr, fmt_str);
- vsnprintf(buf, sizeof(buf), fmt_str, argptr);
- va_end(argptr);
-
- /* the debug string is already formatted with a newline */
- pr_info("%s", buf);
-}
-
/**
* iavf_schedule_reset - Set the flags and schedule a reset event
* @adapter: board private structure
--
2.21.0
^ permalink raw reply related
* [net-next 11/15] i40e: Implement debug macro hw_dbg using pr_debug
From: Jeff Kirsher @ 2019-08-28 6:44 UTC (permalink / raw)
To: davem
Cc: Mauro S. M. Rodrigues, netdev, nhorman, sassmann, Andrew Bowers,
Jeff Kirsher
In-Reply-To: <20190828064407.30168-1-jeffrey.t.kirsher@intel.com>
From: "Mauro S. M. Rodrigues" <maurosr@linux.vnet.ibm.com>
There are several uses of hw_dbg in the code, producing no output. This
patch implements it using pr_debug.
Initially the intention was to implement it using netdev_dbg, analogously
to what is done in ixgbe for instance. That approach was avoided due to
some early usages of hw_dbg, like i40e_pf_reset, before the VSI structure
initialization causing NULL pointer dereference during the driver probe if
the dbg messages were turned on as soon as the module is probed.
Signed-off-by: "Mauro S. M. Rodrigues" <maurosr@linux.vnet.ibm.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_common.c | 1 +
drivers/net/ethernet/intel/i40e/i40e_hmc.c | 1 +
drivers/net/ethernet/intel/i40e/i40e_osdep.h | 7 ++++++-
3 files changed, 8 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_common.c b/drivers/net/ethernet/intel/i40e/i40e_common.c
index 46e649c09f72..d37c6e0e5f08 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_common.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_common.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright(c) 2013 - 2018 Intel Corporation. */
+#include "i40e.h"
#include "i40e_type.h"
#include "i40e_adminq.h"
#include "i40e_prototype.h"
diff --git a/drivers/net/ethernet/intel/i40e/i40e_hmc.c b/drivers/net/ethernet/intel/i40e/i40e_hmc.c
index 19ce93d7fd0a..163ee8c6311c 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_hmc.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_hmc.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright(c) 2013 - 2018 Intel Corporation. */
+#include "i40e.h"
#include "i40e_osdep.h"
#include "i40e_register.h"
#include "i40e_status.h"
diff --git a/drivers/net/ethernet/intel/i40e/i40e_osdep.h b/drivers/net/ethernet/intel/i40e/i40e_osdep.h
index a07574bff550..c0c9ce3eab23 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_osdep.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_osdep.h
@@ -18,7 +18,12 @@
* actual OS primitives
*/
-#define hw_dbg(hw, S, A...) do {} while (0)
+#define hw_dbg(hw, S, A...) \
+do { \
+ int domain = pci_domain_nr(((struct i40e_pf *)(hw)->back)->pdev->bus); \
+ pr_debug("i40e %04x:%02x:%02x.%x " S, domain, (hw)->bus.bus_id, \
+ (hw)->bus.device, (hw)->bus.func, ## A); \
+} while (0)
#define wr32(a, reg, value) writel((value), ((a)->hw_addr + (reg)))
#define rd32(a, reg) readl((a)->hw_addr + (reg))
--
2.21.0
^ permalink raw reply related
* [net-next 14/15] igc: Add tx_csum offload functionality
From: Jeff Kirsher @ 2019-08-28 6:44 UTC (permalink / raw)
To: davem; +Cc: Sasha Neftin, netdev, nhorman, sassmann, Aaron Brown,
Jeff Kirsher
In-Reply-To: <20190828064407.30168-1-jeffrey.t.kirsher@intel.com>
From: Sasha Neftin <sasha.neftin@intel.com>
Add IP generic TX checksum offload functionality.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/igc/igc.h | 4 +
drivers/net/ethernet/intel/igc/igc_base.h | 8 ++
drivers/net/ethernet/intel/igc/igc_defines.h | 5 +
drivers/net/ethernet/intel/igc/igc_main.c | 97 ++++++++++++++++++++
4 files changed, 114 insertions(+)
diff --git a/drivers/net/ethernet/intel/igc/igc.h b/drivers/net/ethernet/intel/igc/igc.h
index 0f5534ce27b0..7e16345d836e 100644
--- a/drivers/net/ethernet/intel/igc/igc.h
+++ b/drivers/net/ethernet/intel/igc/igc.h
@@ -135,6 +135,9 @@ extern char igc_driver_version[];
/* How many Rx Buffers do we bundle into one write to the hardware ? */
#define IGC_RX_BUFFER_WRITE 16 /* Must be power of 2 */
+/* VLAN info */
+#define IGC_TX_FLAGS_VLAN_MASK 0xffff0000
+
/* igc_test_staterr - tests bits within Rx descriptor status and error fields */
static inline __le32 igc_test_staterr(union igc_adv_rx_desc *rx_desc,
const u32 stat_err_bits)
@@ -254,6 +257,7 @@ struct igc_ring {
u16 count; /* number of desc. in the ring */
u8 queue_index; /* logical index of the ring*/
u8 reg_idx; /* physical index of the ring */
+ bool launchtime_enable; /* true if LaunchTime is enabled */
/* everything past this point are written often */
u16 next_to_clean;
diff --git a/drivers/net/ethernet/intel/igc/igc_base.h b/drivers/net/ethernet/intel/igc/igc_base.h
index 58d1109d7f3f..ea627ce52525 100644
--- a/drivers/net/ethernet/intel/igc/igc_base.h
+++ b/drivers/net/ethernet/intel/igc/igc_base.h
@@ -22,6 +22,14 @@ union igc_adv_tx_desc {
} wb;
};
+/* Context descriptors */
+struct igc_adv_tx_context_desc {
+ __le32 vlan_macip_lens;
+ __le32 launch_time;
+ __le32 type_tucmd_mlhl;
+ __le32 mss_l4len_idx;
+};
+
/* Adv Transmit Descriptor Config Masks */
#define IGC_ADVTXD_MAC_TSTAMP 0x00080000 /* IEEE1588 Timestamp packet */
#define IGC_ADVTXD_DTYP_CTXT 0x00200000 /* Advanced Context Descriptor */
diff --git a/drivers/net/ethernet/intel/igc/igc_defines.h b/drivers/net/ethernet/intel/igc/igc_defines.h
index 549134ecd105..f3f2325fe567 100644
--- a/drivers/net/ethernet/intel/igc/igc_defines.h
+++ b/drivers/net/ethernet/intel/igc/igc_defines.h
@@ -397,4 +397,9 @@
#define IGC_VLAPQF_P_VALID(_n) (0x1 << (3 + (_n) * 4))
#define IGC_VLAPQF_QUEUE_MASK 0x03
+#define IGC_ADVTXD_MACLEN_SHIFT 9 /* Adv ctxt desc mac len shift */
+#define IGC_ADVTXD_TUCMD_IPV4 0x00000400 /* IP Packet Type:1=IPv4 */
+#define IGC_ADVTXD_TUCMD_L4T_TCP 0x00000800 /* L4 Packet Type of TCP */
+#define IGC_ADVTXD_TUCMD_L4T_SCTP 0x00001000 /* L4 packet TYPE of SCTP */
+
#endif /* _IGC_DEFINES_H_ */
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index 965d1c939f0f..63b62d74f961 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -5,6 +5,11 @@
#include <linux/types.h>
#include <linux/if_vlan.h>
#include <linux/aer.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+#include <linux/ip.h>
+
+#include <net/ipv6.h>
#include "igc.h"
#include "igc_hw.h"
@@ -790,8 +795,96 @@ static int igc_set_mac(struct net_device *netdev, void *p)
return 0;
}
+static void igc_tx_ctxtdesc(struct igc_ring *tx_ring,
+ struct igc_tx_buffer *first,
+ u32 vlan_macip_lens, u32 type_tucmd,
+ u32 mss_l4len_idx)
+{
+ struct igc_adv_tx_context_desc *context_desc;
+ u16 i = tx_ring->next_to_use;
+ struct timespec64 ts;
+
+ context_desc = IGC_TX_CTXTDESC(tx_ring, i);
+
+ i++;
+ tx_ring->next_to_use = (i < tx_ring->count) ? i : 0;
+
+ /* set bits to identify this as an advanced context descriptor */
+ type_tucmd |= IGC_TXD_CMD_DEXT | IGC_ADVTXD_DTYP_CTXT;
+
+ /* For 82575, context index must be unique per ring. */
+ if (test_bit(IGC_RING_FLAG_TX_CTX_IDX, &tx_ring->flags))
+ mss_l4len_idx |= tx_ring->reg_idx << 4;
+
+ context_desc->vlan_macip_lens = cpu_to_le32(vlan_macip_lens);
+ context_desc->type_tucmd_mlhl = cpu_to_le32(type_tucmd);
+ context_desc->mss_l4len_idx = cpu_to_le32(mss_l4len_idx);
+
+ /* We assume there is always a valid Tx time available. Invalid times
+ * should have been handled by the upper layers.
+ */
+ if (tx_ring->launchtime_enable) {
+ ts = ns_to_timespec64(first->skb->tstamp);
+ first->skb->tstamp = 0;
+ context_desc->launch_time = cpu_to_le32(ts.tv_nsec / 32);
+ } else {
+ context_desc->launch_time = 0;
+ }
+}
+
+static inline bool igc_ipv6_csum_is_sctp(struct sk_buff *skb)
+{
+ unsigned int offset = 0;
+
+ ipv6_find_hdr(skb, &offset, IPPROTO_SCTP, NULL, NULL);
+
+ return offset == skb_checksum_start_offset(skb);
+}
+
static void igc_tx_csum(struct igc_ring *tx_ring, struct igc_tx_buffer *first)
{
+ struct sk_buff *skb = first->skb;
+ u32 vlan_macip_lens = 0;
+ u32 type_tucmd = 0;
+
+ if (skb->ip_summed != CHECKSUM_PARTIAL) {
+csum_failed:
+ if (!(first->tx_flags & IGC_TX_FLAGS_VLAN) &&
+ !tx_ring->launchtime_enable)
+ return;
+ goto no_csum;
+ }
+
+ switch (skb->csum_offset) {
+ case offsetof(struct tcphdr, check):
+ type_tucmd = IGC_ADVTXD_TUCMD_L4T_TCP;
+ /* fall through */
+ case offsetof(struct udphdr, check):
+ break;
+ case offsetof(struct sctphdr, checksum):
+ /* validate that this is actually an SCTP request */
+ if ((first->protocol == htons(ETH_P_IP) &&
+ (ip_hdr(skb)->protocol == IPPROTO_SCTP)) ||
+ (first->protocol == htons(ETH_P_IPV6) &&
+ igc_ipv6_csum_is_sctp(skb))) {
+ type_tucmd = IGC_ADVTXD_TUCMD_L4T_SCTP;
+ break;
+ }
+ /* fall through */
+ default:
+ skb_checksum_help(skb);
+ goto csum_failed;
+ }
+
+ /* update TX checksum flag */
+ first->tx_flags |= IGC_TX_FLAGS_CSUM;
+ vlan_macip_lens = skb_checksum_start_offset(skb) -
+ skb_network_offset(skb);
+no_csum:
+ vlan_macip_lens |= skb_network_offset(skb) << IGC_ADVTXD_MACLEN_SHIFT;
+ vlan_macip_lens |= first->tx_flags & IGC_TX_FLAGS_VLAN_MASK;
+
+ igc_tx_ctxtdesc(tx_ring, first, vlan_macip_lens, type_tucmd, 0);
}
static int __igc_maybe_stop_tx(struct igc_ring *tx_ring, const u16 size)
@@ -4116,6 +4209,9 @@ static int igc_probe(struct pci_dev *pdev,
if (err)
goto err_sw_init;
+ /* Add supported features to the features list*/
+ netdev->features |= NETIF_F_HW_CSUM;
+
/* setup the private structure */
err = igc_sw_init(adapter);
if (err)
@@ -4123,6 +4219,7 @@ static int igc_probe(struct pci_dev *pdev,
/* copy netdev features into list of user selectable features */
netdev->hw_features |= NETIF_F_NTUPLE;
+ netdev->hw_features |= netdev->features;
/* MTU range: 68 - 9216 */
netdev->min_mtu = ETH_MIN_MTU;
--
2.21.0
^ permalink raw reply related
* [net-next 15/15] i40e: Add support for X710 device
From: Jeff Kirsher @ 2019-08-28 6:44 UTC (permalink / raw)
To: davem
Cc: Mariusz Stachura, netdev, nhorman, sassmann, Andrew Bowers,
Jeff Kirsher
In-Reply-To: <20190828064407.30168-1-jeffrey.t.kirsher@intel.com>
From: Mariusz Stachura <mariusz.stachura@intel.com>
Add I40E_DEV_ID_10G_BASE_T_BC to i40e_pci_tbl
Signed-off-by: Mariusz Stachura <mariusz.stachura@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_main.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c
index fdf43d87e983..a71369546c23 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -73,6 +73,7 @@ static const struct pci_device_id i40e_pci_tbl[] = {
{PCI_VDEVICE(INTEL, I40E_DEV_ID_QSFP_C), 0},
{PCI_VDEVICE(INTEL, I40E_DEV_ID_10G_BASE_T), 0},
{PCI_VDEVICE(INTEL, I40E_DEV_ID_10G_BASE_T4), 0},
+ {PCI_VDEVICE(INTEL, I40E_DEV_ID_10G_BASE_T_BC), 0},
{PCI_VDEVICE(INTEL, I40E_DEV_ID_10G_SFP), 0},
{PCI_VDEVICE(INTEL, I40E_DEV_ID_10G_B), 0},
{PCI_VDEVICE(INTEL, I40E_DEV_ID_KX_X722), 0},
--
2.21.0
^ permalink raw reply related
* [net-next 12/15] i40e: Remove EMPR traces from debugfs facility
From: Jeff Kirsher @ 2019-08-28 6:44 UTC (permalink / raw)
To: davem
Cc: Mauro S. M. Rodrigues, netdev, nhorman, sassmann, Andrew Bowers,
Jeff Kirsher
In-Reply-To: <20190828064407.30168-1-jeffrey.t.kirsher@intel.com>
From: "Mauro S. M. Rodrigues" <maurosr@linux.vnet.ibm.com>
Since commit
'5098850c9b9b ("i40e/i40evf: i40e_register.h updates")'
it is no longer possible to trigger an EMP Reset from debugfs, but it's
possible to request it either way, to end up with a bad reset request:
echo empr > /sys/kernel/debug/i40e/0002\:01\:00.1/command
i40e 0002:01:00.1: debugfs: forcing EMPR
i40e 0002:01:00.1: bad reset request 0x00010000
So let's remove this piece of code and show the available valid commands
as it is when any invalid command is issued.
Signed-off-by: "Mauro S. M. Rodrigues" <maurosr@linux.vnet.ibm.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e.h | 1 -
drivers/net/ethernet/intel/i40e/i40e_debugfs.c | 4 ----
2 files changed, 5 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e.h b/drivers/net/ethernet/intel/i40e/i40e.h
index 3e535d3263b3..f1a1bd324b50 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -131,7 +131,6 @@ enum i40e_state_t {
__I40E_PF_RESET_REQUESTED,
__I40E_CORE_RESET_REQUESTED,
__I40E_GLOBAL_RESET_REQUESTED,
- __I40E_EMP_RESET_REQUESTED,
__I40E_EMP_RESET_INTR_RECEIVED,
__I40E_SUSPENDED,
__I40E_PTP_TX_IN_PROGRESS,
diff --git a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
index 41232898d8ae..99ea543dd245 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_debugfs.c
@@ -1125,10 +1125,6 @@ static ssize_t i40e_dbg_command_write(struct file *filp,
dev_info(&pf->pdev->dev, "debugfs: forcing GlobR\n");
i40e_do_reset_safe(pf, BIT(__I40E_GLOBAL_RESET_REQUESTED));
- } else if (strncmp(cmd_buf, "empr", 4) == 0) {
- dev_info(&pf->pdev->dev, "debugfs: forcing EMPR\n");
- i40e_do_reset_safe(pf, BIT(__I40E_EMP_RESET_REQUESTED));
-
} else if (strncmp(cmd_buf, "read", 4) == 0) {
u32 address;
u32 value;
--
2.21.0
^ permalink raw reply related
* [net-next 13/15] ixgbe: sync the first fragment unconditionally
From: Jeff Kirsher @ 2019-08-28 6:44 UTC (permalink / raw)
To: davem
Cc: Firo Yang, netdev, nhorman, sassmann, Alexander Duyck,
Andrew Bowers, Jeff Kirsher
In-Reply-To: <20190828064407.30168-1-jeffrey.t.kirsher@intel.com>
From: Firo Yang <firo.yang@suse.com>
In Xen environment, if Xen-swiotlb is enabled, ixgbe driver
could possibly allocate a page, DMA memory buffer, for the first
fragment which is not suitable for Xen-swiotlb to do DMA operations.
Xen-swiotlb have to internally allocate another page for doing DMA
operations. This mechanism requires syncing the data from the internal
page to the page which ixgbe sends to upper network stack. However,
since commit f3213d932173 ("ixgbe: Update driver to make use of DMA
attributes in Rx path"), the unmap operation is performed with
DMA_ATTR_SKIP_CPU_SYNC. As a result, the sync is not performed.
Since the sync isn't performed, the upper network stack could receive
a incomplete network packet. By incomplete, it means the linear data
on the first fragment(between skb->head and skb->end) is invalid. So
we have to copy the data from the internal xen-swiotlb page to the page
which ixgbe sends to upper network stack through the sync operation.
More details from Alexander Duyck:
Specifically since we are mapping the frame with
DMA_ATTR_SKIP_CPU_SYNC we have to unmap with that as well. As a result
a sync is not performed on an unmap and must be done manually as we
skipped it for the first frag. As such we need to always sync before
possibly performing a page unmap operation.
Fixes: f3213d932173 ("ixgbe: Update driver to make use of DMA
attributes in Rx path")
Signed-off-by: Firo Yang <firo.yang@suse.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 16 +++++++++-------
1 file changed, 9 insertions(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 17b7ae9f46ec..f5fc5929a15d 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1825,13 +1825,7 @@ static void ixgbe_pull_tail(struct ixgbe_ring *rx_ring,
static void ixgbe_dma_sync_frag(struct ixgbe_ring *rx_ring,
struct sk_buff *skb)
{
- /* if the page was released unmap it, else just sync our portion */
- if (unlikely(IXGBE_CB(skb)->page_released)) {
- dma_unmap_page_attrs(rx_ring->dev, IXGBE_CB(skb)->dma,
- ixgbe_rx_pg_size(rx_ring),
- DMA_FROM_DEVICE,
- IXGBE_RX_DMA_ATTR);
- } else if (ring_uses_build_skb(rx_ring)) {
+ if (ring_uses_build_skb(rx_ring)) {
unsigned long offset = (unsigned long)(skb->data) & ~PAGE_MASK;
dma_sync_single_range_for_cpu(rx_ring->dev,
@@ -1848,6 +1842,14 @@ static void ixgbe_dma_sync_frag(struct ixgbe_ring *rx_ring,
skb_frag_size(frag),
DMA_FROM_DEVICE);
}
+
+ /* If the page was released, just unmap it. */
+ if (unlikely(IXGBE_CB(skb)->page_released)) {
+ dma_unmap_page_attrs(rx_ring->dev, IXGBE_CB(skb)->dma,
+ ixgbe_rx_pg_size(rx_ring),
+ DMA_FROM_DEVICE,
+ IXGBE_RX_DMA_ATTR);
+ }
}
/**
--
2.21.0
^ permalink raw reply related
* [net-next 10/15] i40e: fix hw_dbg usage in i40e_hmc_get_object_va
From: Jeff Kirsher @ 2019-08-28 6:44 UTC (permalink / raw)
To: davem
Cc: Mauro S. M. Rodrigues, netdev, nhorman, sassmann, Andrew Bowers,
Jeff Kirsher
In-Reply-To: <20190828064407.30168-1-jeffrey.t.kirsher@intel.com>
From: "Mauro S. M. Rodrigues" <maurosr@linux.vnet.ibm.com>
The mentioned function references a i40e_hw attribute, as parameter for
hw_dbg, but it doesn't exist in the function scope.
Fixes it by changing parameters from i40e_hmc_info to i40e_hw which can
retrieve the necessary i40e_hmc_info.
Signed-off-by: "Mauro S. M. Rodrigues" <maurosr@linux.vnet.ibm.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/i40e/i40e_lan_hmc.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)
diff --git a/drivers/net/ethernet/intel/i40e/i40e_lan_hmc.c b/drivers/net/ethernet/intel/i40e/i40e_lan_hmc.c
index 994011c38fb4..f059de33a0fd 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_lan_hmc.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_lan_hmc.c
@@ -1,6 +1,7 @@
// SPDX-License-Identifier: GPL-2.0
/* Copyright(c) 2013 - 2018 Intel Corporation. */
+#include "i40e.h"
#include "i40e_osdep.h"
#include "i40e_register.h"
#include "i40e_type.h"
@@ -963,7 +964,7 @@ static i40e_status i40e_set_hmc_context(u8 *context_bytes,
/**
* i40e_hmc_get_object_va - retrieves an object's virtual address
- * @hmc_info: pointer to i40e_hmc_info struct
+ * @hw: the hardware struct, from which we obtain the i40e_hmc_info pointer
* @object_base: pointer to u64 to get the va
* @rsrc_type: the hmc resource type
* @obj_idx: hmc object index
@@ -972,7 +973,7 @@ static i40e_status i40e_set_hmc_context(u8 *context_bytes,
* base pointer. This function is used for LAN Queue contexts.
**/
static
-i40e_status i40e_hmc_get_object_va(struct i40e_hmc_info *hmc_info,
+i40e_status i40e_hmc_get_object_va(struct i40e_hw *hw,
u8 **object_base,
enum i40e_hmc_lan_rsrc_type rsrc_type,
u32 obj_idx)
@@ -982,6 +983,7 @@ i40e_status i40e_hmc_get_object_va(struct i40e_hmc_info *hmc_info,
struct i40e_hmc_sd_entry *sd_entry;
struct i40e_hmc_pd_entry *pd_entry;
u32 pd_idx, pd_lmt, rel_pd_idx;
+ struct i40e_hmc_info *hmc_info = &hw->hmc;
u64 obj_offset_in_fpm;
u32 sd_idx, sd_lmt;
@@ -1047,7 +1049,7 @@ i40e_status i40e_clear_lan_tx_queue_context(struct i40e_hw *hw,
i40e_status err;
u8 *context_bytes;
- err = i40e_hmc_get_object_va(&hw->hmc, &context_bytes,
+ err = i40e_hmc_get_object_va(hw, &context_bytes,
I40E_HMC_LAN_TX, queue);
if (err < 0)
return err;
@@ -1068,7 +1070,7 @@ i40e_status i40e_set_lan_tx_queue_context(struct i40e_hw *hw,
i40e_status err;
u8 *context_bytes;
- err = i40e_hmc_get_object_va(&hw->hmc, &context_bytes,
+ err = i40e_hmc_get_object_va(hw, &context_bytes,
I40E_HMC_LAN_TX, queue);
if (err < 0)
return err;
@@ -1088,7 +1090,7 @@ i40e_status i40e_clear_lan_rx_queue_context(struct i40e_hw *hw,
i40e_status err;
u8 *context_bytes;
- err = i40e_hmc_get_object_va(&hw->hmc, &context_bytes,
+ err = i40e_hmc_get_object_va(hw, &context_bytes,
I40E_HMC_LAN_RX, queue);
if (err < 0)
return err;
@@ -1109,7 +1111,7 @@ i40e_status i40e_set_lan_rx_queue_context(struct i40e_hw *hw,
i40e_status err;
u8 *context_bytes;
- err = i40e_hmc_get_object_va(&hw->hmc, &context_bytes,
+ err = i40e_hmc_get_object_va(hw, &context_bytes,
I40E_HMC_LAN_RX, queue);
if (err < 0)
return err;
--
2.21.0
^ permalink raw reply related
* [net-next 09/15] igc: Remove unneeded PCI bus defines
From: Jeff Kirsher @ 2019-08-28 6:44 UTC (permalink / raw)
To: davem; +Cc: Sasha Neftin, netdev, nhorman, sassmann, Aaron Brown,
Jeff Kirsher
In-Reply-To: <20190828064407.30168-1-jeffrey.t.kirsher@intel.com>
From: Sasha Neftin <sasha.neftin@intel.com>
PCIe device control 2 defines does not use internally.
This patch comes to clean up those.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/igc/igc_defines.h | 4 ----
1 file changed, 4 deletions(-)
diff --git a/drivers/net/ethernet/intel/igc/igc_defines.h b/drivers/net/ethernet/intel/igc/igc_defines.h
index 11b99acf4abe..549134ecd105 100644
--- a/drivers/net/ethernet/intel/igc/igc_defines.h
+++ b/drivers/net/ethernet/intel/igc/igc_defines.h
@@ -10,10 +10,6 @@
#define IGC_CTRL_EXT_DRV_LOAD 0x10000000 /* Drv loaded bit for FW */
-/* PCI Bus Info */
-#define PCIE_DEVICE_CONTROL2 0x28
-#define PCIE_DEVICE_CONTROL2_16ms 0x0005
-
/* Physical Func Reset Done Indication */
#define IGC_CTRL_EXT_LINK_MODE_MASK 0x00C00000
--
2.21.0
^ permalink raw reply related
* [net-next 06/15] fm10k: use a local variable for the frag pointer
From: Jeff Kirsher @ 2019-08-28 6:43 UTC (permalink / raw)
To: davem; +Cc: Jacob Keller, netdev, nhorman, sassmann, Andrew Bowers,
Jeff Kirsher
In-Reply-To: <20190828064407.30168-1-jeffrey.t.kirsher@intel.com>
From: Jacob Keller <jacob.e.keller@intel.com>
In the function fm10k_xmit_frame_ring, we recently switched to using
the skb_frag_size accessor instead of directly using the size member of
the skb fragment.
This made the for loop slightly harder to read because it created a very
long line that is difficult to split up. Avoid this by using a local
variable in the for loop, so that we do not have to break the line on an
open parenthesis.
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/fm10k/fm10k_main.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)
diff --git a/drivers/net/ethernet/intel/fm10k/fm10k_main.c b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
index e0a2be534b20..2be9222510e7 100644
--- a/drivers/net/ethernet/intel/fm10k/fm10k_main.c
+++ b/drivers/net/ethernet/intel/fm10k/fm10k_main.c
@@ -1073,9 +1073,11 @@ netdev_tx_t fm10k_xmit_frame_ring(struct sk_buff *skb,
* + 2 desc gap to keep tail from touching head
* otherwise try next time
*/
- for (f = 0; f < skb_shinfo(skb)->nr_frags; f++)
- count += TXD_USE_COUNT(skb_frag_size(
- &skb_shinfo(skb)->frags[f]));
+ for (f = 0; f < skb_shinfo(skb)->nr_frags; f++) {
+ skb_frag_t *frag = &skb_shinfo(skb)->frags[f];
+
+ count += TXD_USE_COUNT(skb_frag_size(frag));
+ }
if (fm10k_maybe_stop_tx(tx_ring, count + 3)) {
tx_ring->tx_stats.tx_busy++;
--
2.21.0
^ permalink raw reply related
* [net-next 07/15] igc: Add NVM checksum validation
From: Jeff Kirsher @ 2019-08-28 6:43 UTC (permalink / raw)
To: davem; +Cc: Sasha Neftin, netdev, nhorman, sassmann, Aaron Brown,
Jeff Kirsher
In-Reply-To: <20190828064407.30168-1-jeffrey.t.kirsher@intel.com>
From: Sasha Neftin <sasha.neftin@intel.com>
Add NVM checksum validation during probe functionality.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/igc/igc_main.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index 251552855c40..965d1c939f0f 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -4133,6 +4133,15 @@ static int igc_probe(struct pci_dev *pdev,
*/
hw->mac.ops.reset_hw(hw);
+ if (igc_get_flash_presence_i225(hw)) {
+ if (hw->nvm.ops.validate(hw) < 0) {
+ dev_err(&pdev->dev,
+ "The NVM Checksum Is Not Valid\n");
+ err = -EIO;
+ goto err_eeprom;
+ }
+ }
+
if (eth_platform_get_mac_address(&pdev->dev, hw->mac.addr)) {
/* copy the MAC address out of the NVM */
if (hw->mac.ops.read_mac_addr(hw))
--
2.21.0
^ permalink raw reply related
* [net-next 08/15] iavf: allow permanent MAC address to change
From: Jeff Kirsher @ 2019-08-28 6:44 UTC (permalink / raw)
To: davem
Cc: Mitch Williams, netdev, nhorman, sassmann, Andrew Bowers,
Jeff Kirsher
In-Reply-To: <20190828064407.30168-1-jeffrey.t.kirsher@intel.com>
From: Mitch Williams <mitch.a.williams@intel.com>
Allow the VF to override the "permanent" MAC address set by the host.
This allows bonding to work in the case where the administrator has set
the VF MAC.
Note that the VF must still be set to Trusted on the host if this change
is to be accepted by the PF driver.
Signed-off-by: Mitch Williams <mitch.a.williams@intel.com>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/iavf/iavf.h | 1 -
drivers/net/ethernet/intel/iavf/iavf_main.c | 4 ----
2 files changed, 5 deletions(-)
diff --git a/drivers/net/ethernet/intel/iavf/iavf.h b/drivers/net/ethernet/intel/iavf/iavf.h
index 9fc635d816d2..29de3ae96ef2 100644
--- a/drivers/net/ethernet/intel/iavf/iavf.h
+++ b/drivers/net/ethernet/intel/iavf/iavf.h
@@ -253,7 +253,6 @@ struct iavf_adapter {
#define IAVF_FLAG_RESET_PENDING BIT(4)
#define IAVF_FLAG_RESET_NEEDED BIT(5)
#define IAVF_FLAG_WB_ON_ITR_CAPABLE BIT(6)
-#define IAVF_FLAG_ADDR_SET_BY_PF BIT(8)
#define IAVF_FLAG_SERVICE_CLIENT_REQUESTED BIT(9)
#define IAVF_FLAG_CLIENT_NEEDS_OPEN BIT(10)
#define IAVF_FLAG_CLIENT_NEEDS_CLOSE BIT(11)
diff --git a/drivers/net/ethernet/intel/iavf/iavf_main.c b/drivers/net/ethernet/intel/iavf/iavf_main.c
index 554aa619ff02..07f5541a0f01 100644
--- a/drivers/net/ethernet/intel/iavf/iavf_main.c
+++ b/drivers/net/ethernet/intel/iavf/iavf_main.c
@@ -790,9 +790,6 @@ static int iavf_set_mac(struct net_device *netdev, void *p)
if (ether_addr_equal(netdev->dev_addr, addr->sa_data))
return 0;
- if (adapter->flags & IAVF_FLAG_ADDR_SET_BY_PF)
- return -EPERM;
-
spin_lock_bh(&adapter->mac_vlan_list_lock);
f = iavf_find_filter(adapter, hw->mac.addr);
@@ -1811,7 +1808,6 @@ static int iavf_init_get_resources(struct iavf_adapter *adapter)
eth_hw_addr_random(netdev);
ether_addr_copy(adapter->hw.mac.addr, netdev->dev_addr);
} else {
- adapter->flags |= IAVF_FLAG_ADDR_SET_BY_PF;
ether_addr_copy(netdev->dev_addr, adapter->hw.mac.addr);
ether_addr_copy(netdev->perm_addr, adapter->hw.mac.addr);
}
--
2.21.0
^ permalink raw reply related
* [net-next 05/15] Documentation: iavf: Update the Intel LAN driver doc for iavf
From: Jeff Kirsher @ 2019-08-28 6:43 UTC (permalink / raw)
To: davem; +Cc: Jeff Kirsher, netdev, nhorman, sassmann, Aaron Brown
In-Reply-To: <20190828064407.30168-1-jeffrey.t.kirsher@intel.com>
Update the LAN driver documentation to include the latest feature
implementation and driver capabilities.
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
---
.../networking/device_drivers/intel/iavf.rst | 115 +++++++++++++-----
1 file changed, 82 insertions(+), 33 deletions(-)
diff --git a/Documentation/networking/device_drivers/intel/iavf.rst b/Documentation/networking/device_drivers/intel/iavf.rst
index 2d0c3baa1752..cfc08842e32c 100644
--- a/Documentation/networking/device_drivers/intel/iavf.rst
+++ b/Documentation/networking/device_drivers/intel/iavf.rst
@@ -10,11 +10,15 @@ Copyright(c) 2013-2018 Intel Corporation.
Contents
========
+- Overview
- Identifying Your Adapter
- Additional Configurations
- Known Issues/Troubleshooting
- Support
+Overview
+========
+
This file describes the iavf Linux* Base Driver. This driver was formerly
called i40evf.
@@ -27,6 +31,7 @@ The guest OS loading the iavf driver must support MSI-X interrupts.
Identifying Your Adapter
========================
+
The driver in this kernel is compatible with devices based on the following:
* Intel(R) XL710 X710 Virtual Function
* Intel(R) X722 Virtual Function
@@ -50,9 +55,10 @@ Link messages will not be displayed to the console if the distribution is
restricting system messages. In order to see network driver link messages on
your console, set dmesg to eight by entering the following::
- dmesg -n 8
+ # dmesg -n 8
-NOTE: This setting is not saved across reboots.
+NOTE:
+ This setting is not saved across reboots.
ethtool
-------
@@ -72,11 +78,11 @@ then requests from that VF to set VLAN tag stripping will be ignored.
To enable/disable VLAN tag stripping for a VF, issue the following command
from inside the VM in which you are running the VF::
- ethtool -K <if_name> rxvlan on/off
+ # ethtool -K <if_name> rxvlan on/off
or alternatively::
- ethtool --offload <if_name> rxvlan on/off
+ # ethtool --offload <if_name> rxvlan on/off
Adaptive Virtual Function
-------------------------
@@ -91,21 +97,21 @@ additional features depending on what features are available in the PF with
which the AVF is associated. The following are base mode features:
- 4 Queue Pairs (QP) and associated Configuration Status Registers (CSRs)
- for Tx/Rx.
-- i40e descriptors and ring format.
-- Descriptor write-back completion.
-- 1 control queue, with i40e descriptors, CSRs and ring format.
-- 5 MSI-X interrupt vectors and corresponding i40e CSRs.
-- 1 Interrupt Throttle Rate (ITR) index.
-- 1 Virtual Station Interface (VSI) per VF.
+ for Tx/Rx
+- i40e descriptors and ring format
+- Descriptor write-back completion
+- 1 control queue, with i40e descriptors, CSRs and ring format
+- 5 MSI-X interrupt vectors and corresponding i40e CSRs
+- 1 Interrupt Throttle Rate (ITR) index
+- 1 Virtual Station Interface (VSI) per VF
- 1 Traffic Class (TC), TC0
- Receive Side Scaling (RSS) with 64 entry indirection table and key,
- configured through the PF.
-- 1 unicast MAC address reserved per VF.
-- 16 MAC address filters for each VF.
-- Stateless offloads - non-tunneled checksums.
-- AVF device ID.
-- HW mailbox is used for VF to PF communications (including on Windows).
+ configured through the PF
+- 1 unicast MAC address reserved per VF
+- 16 MAC address filters for each VF
+- Stateless offloads - non-tunneled checksums
+- AVF device ID
+- HW mailbox is used for VF to PF communications (including on Windows)
IEEE 802.1ad (QinQ) Support
---------------------------
@@ -117,8 +123,8 @@ VLAN ID, among other uses.
The following are examples of how to configure 802.1ad (QinQ)::
- ip link add link eth0 eth0.24 type vlan proto 802.1ad id 24
- ip link add link eth0.24 eth0.24.371 type vlan proto 802.1Q id 371
+ # ip link add link eth0 eth0.24 type vlan proto 802.1ad id 24
+ # ip link add link eth0.24 eth0.24.371 type vlan proto 802.1Q id 371
Where "24" and "371" are example VLAN IDs.
@@ -133,6 +139,19 @@ specific application. This can reduce latency for the specified application,
and allow Tx traffic to be rate limited per application. Follow the steps below
to set ADq.
+Requirements:
+
+- The sch_mqprio, act_mirred and cls_flower modules must be loaded
+- The latest version of iproute2
+- If another driver (for example, DPDK) has set cloud filters, you cannot
+ enable ADQ
+- Depending on the underlying PF device, ADQ cannot be enabled when the
+ following features are enabled:
+
+ + Data Center Bridging (DCB)
+ + Multiple Functions per Port (MFP)
+ + Sideband Filters
+
1. Create traffic classes (TCs). Maximum of 8 TCs can be created per interface.
The shaper bw_rlimit parameter is optional.
@@ -141,9 +160,9 @@ to 1Gbit for tc0 and 3Gbit for tc1.
::
- # tc qdisc add dev <interface> root mqprio num_tc 2 map 0 0 0 0 1 1 1 1
- queues 16@0 16@16 hw 1 mode channel shaper bw_rlimit min_rate 1Gbit 2Gbit
- max_rate 1Gbit 3Gbit
+ tc qdisc add dev <interface> root mqprio num_tc 2 map 0 0 0 0 1 1 1 1
+ queues 16@0 16@16 hw 1 mode channel shaper bw_rlimit min_rate 1Gbit 2Gbit
+ max_rate 1Gbit 3Gbit
map: priority mapping for up to 16 priorities to tcs (e.g. map 0 0 0 0 1 1 1 1
sets priorities 0-3 to use tc0 and 4-7 to use tc1)
@@ -162,6 +181,10 @@ Totals must be equal or less than port speed.
For example: min_rate 1Gbit 3Gbit: Verify bandwidth limit using network
monitoring tools such as ifstat or sar –n DEV [interval] [number of samples]
+NOTE:
+ Setting up channels via ethtool (ethtool -L) is not supported when the
+ TCs are configured using mqprio.
+
2. Enable HW TC offload on interface::
# ethtool -K <interface> hw-tc-offload on
@@ -171,16 +194,16 @@ monitoring tools such as ifstat or sar –n DEV [interval] [number of samples]
# tc qdisc add dev <interface> ingress
NOTES:
- - Run all tc commands from the iproute2 <pathtoiproute2>/tc/ directory.
- - ADq is not compatible with cloud filters.
+ - Run all tc commands from the iproute2 <pathtoiproute2>/tc/ directory
+ - ADq is not compatible with cloud filters
- Setting up channels via ethtool (ethtool -L) is not supported when the TCs
- are configured using mqprio.
+ are configured using mqprio
- You must have iproute2 latest version
- - NVM version 6.01 or later is required.
+ - NVM version 6.01 or later is required
- ADq cannot be enabled when any the following features are enabled: Data
- Center Bridging (DCB), Multiple Functions per Port (MFP), or Sideband Filters.
+ Center Bridging (DCB), Multiple Functions per Port (MFP), or Sideband Filters
- If another driver (for example, DPDK) has set cloud filters, you cannot
- enable ADq.
+ enable ADq
- Tunnel filters are not supported in ADq. If encapsulated packets do arrive
in non-tunnel mode, filtering will be done on the inner headers. For example,
for VXLAN traffic in non-tunnel mode, PCTYPE is identified as a VXLAN
@@ -198,6 +221,16 @@ NOTES:
Known Issues/Troubleshooting
============================
+Bonding fails with VFs bound to an Intel(R) Ethernet Controller 700 series device
+---------------------------------------------------------------------------------
+If you bind Virtual Functions (VFs) to an Intel(R) Ethernet Controller 700
+series based device, the VF slaves may fail when they become the active slave.
+If the MAC address of the VF is set by the PF (Physical Function) of the
+device, when you add a slave, or change the active-backup slave, Linux bonding
+tries to sync the backup slave's MAC address to the same MAC address as the
+active slave. Linux bonding will fail at this point. This issue will not occur
+if the VF's MAC address is not set by the PF.
+
Traffic Is Not Being Passed Between VM and Client
-------------------------------------------------
You may not be able to pass traffic between a client system and a
@@ -215,13 +248,28 @@ Do not unload a port's driver if a Virtual Function (VF) with an active Virtual
Machine (VM) is bound to it. Doing so will cause the port to appear to hang.
Once the VM shuts down, or otherwise releases the VF, the command will complete.
+Using four traffic classes fails
+--------------------------------
+Do not try to reserve more than three traffic classes in the iavf driver. Doing
+so will fail to set any traffic classes and will cause the driver to write
+errors to stdout. Use a maximum of three queues to avoid this issue.
+
+Multiple log error messages on iavf driver removal
+--------------------------------------------------
+If you have several VFs and you remove the iavf driver, several instances of
+the following log errors are written to the log::
+
+ Unable to send opcode 2 to PF, err I40E_ERR_QUEUE_EMPTY, aq_err ok
+ Unable to send the message to VF 2 aq_err 12
+ ARQ Overflow Error detected
+
Virtual machine does not get link
---------------------------------
If the virtual machine has more than one virtual port assigned to it, and those
virtual ports are bound to different physical ports, you may not get link on
all of the virtual ports. The following command may work around the issue::
- ethtool -r <PF>
+ # ethtool -r <PF>
Where <PF> is the PF interface in the host, for example: p5p1. You may need to
run the command more than once to get link on all virtual ports.
@@ -251,12 +299,13 @@ traffic.
If you have multiple interfaces in a server, either turn on ARP filtering by
entering::
- echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
+ # echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter
-NOTE: This setting is not saved across reboots. The configuration change can be
-made permanent by adding the following line to the file /etc/sysctl.conf::
+NOTE:
+ This setting is not saved across reboots. The configuration change can be
+ made permanent by adding the following line to the file /etc/sysctl.conf::
- net.ipv4.conf.all.arp_filter = 1
+ net.ipv4.conf.all.arp_filter = 1
Another alternative is to install the interfaces in separate broadcast domains
(either in different switches or in a switch partitioned to VLANs).
--
2.21.0
^ permalink raw reply related
* [net-next 04/15] igc: Remove useless forward declaration
From: Jeff Kirsher @ 2019-08-28 6:43 UTC (permalink / raw)
To: davem; +Cc: Sasha Neftin, netdev, nhorman, sassmann, Aaron Brown,
Jeff Kirsher
In-Reply-To: <20190828064407.30168-1-jeffrey.t.kirsher@intel.com>
From: Sasha Neftin <sasha.neftin@intel.com>
Move igc_phy_setup_autoneg, igc_wait_autoneg and igc_set_fc_watermarks
up to avoid forward declaration.
It is not necessary to forward declare these static methods.
Signed-off-by: Sasha Neftin <sasha.neftin@intel.com>
Tested-by: Aaron Brown <aaron.f.brown@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
---
drivers/net/ethernet/intel/igc/igc_mac.c | 73 +++++----
drivers/net/ethernet/intel/igc/igc_phy.c | 192 +++++++++++------------
2 files changed, 129 insertions(+), 136 deletions(-)
diff --git a/drivers/net/ethernet/intel/igc/igc_mac.c b/drivers/net/ethernet/intel/igc/igc_mac.c
index ba4646737288..5eeb4c8caf4a 100644
--- a/drivers/net/ethernet/intel/igc/igc_mac.c
+++ b/drivers/net/ethernet/intel/igc/igc_mac.c
@@ -7,9 +7,6 @@
#include "igc_mac.h"
#include "igc_hw.h"
-/* forward declaration */
-static s32 igc_set_fc_watermarks(struct igc_hw *hw);
-
/**
* igc_disable_pcie_master - Disables PCI-express master access
* @hw: pointer to the HW structure
@@ -74,6 +71,41 @@ void igc_init_rx_addrs(struct igc_hw *hw, u16 rar_count)
hw->mac.ops.rar_set(hw, mac_addr, i);
}
+/**
+ * igc_set_fc_watermarks - Set flow control high/low watermarks
+ * @hw: pointer to the HW structure
+ *
+ * Sets the flow control high/low threshold (watermark) registers. If
+ * flow control XON frame transmission is enabled, then set XON frame
+ * transmission as well.
+ */
+static s32 igc_set_fc_watermarks(struct igc_hw *hw)
+{
+ u32 fcrtl = 0, fcrth = 0;
+
+ /* Set the flow control receive threshold registers. Normally,
+ * these registers will be set to a default threshold that may be
+ * adjusted later by the driver's runtime code. However, if the
+ * ability to transmit pause frames is not enabled, then these
+ * registers will be set to 0.
+ */
+ if (hw->fc.current_mode & igc_fc_tx_pause) {
+ /* We need to set up the Receive Threshold high and low water
+ * marks as well as (optionally) enabling the transmission of
+ * XON frames.
+ */
+ fcrtl = hw->fc.low_water;
+ if (hw->fc.send_xon)
+ fcrtl |= IGC_FCRTL_XONE;
+
+ fcrth = hw->fc.high_water;
+ }
+ wr32(IGC_FCRTL, fcrtl);
+ wr32(IGC_FCRTH, fcrth);
+
+ return 0;
+}
+
/**
* igc_setup_link - Setup flow control and link settings
* @hw: pointer to the HW structure
@@ -194,41 +226,6 @@ s32 igc_force_mac_fc(struct igc_hw *hw)
return ret_val;
}
-/**
- * igc_set_fc_watermarks - Set flow control high/low watermarks
- * @hw: pointer to the HW structure
- *
- * Sets the flow control high/low threshold (watermark) registers. If
- * flow control XON frame transmission is enabled, then set XON frame
- * transmission as well.
- */
-static s32 igc_set_fc_watermarks(struct igc_hw *hw)
-{
- u32 fcrtl = 0, fcrth = 0;
-
- /* Set the flow control receive threshold registers. Normally,
- * these registers will be set to a default threshold that may be
- * adjusted later by the driver's runtime code. However, if the
- * ability to transmit pause frames is not enabled, then these
- * registers will be set to 0.
- */
- if (hw->fc.current_mode & igc_fc_tx_pause) {
- /* We need to set up the Receive Threshold high and low water
- * marks as well as (optionally) enabling the transmission of
- * XON frames.
- */
- fcrtl = hw->fc.low_water;
- if (hw->fc.send_xon)
- fcrtl |= IGC_FCRTL_XONE;
-
- fcrth = hw->fc.high_water;
- }
- wr32(IGC_FCRTL, fcrtl);
- wr32(IGC_FCRTH, fcrth);
-
- return 0;
-}
-
/**
* igc_clear_hw_cntrs_base - Clear base hardware counters
* @hw: pointer to the HW structure
diff --git a/drivers/net/ethernet/intel/igc/igc_phy.c b/drivers/net/ethernet/intel/igc/igc_phy.c
index 4c8f96a9a148..f4b05af0dd2f 100644
--- a/drivers/net/ethernet/intel/igc/igc_phy.c
+++ b/drivers/net/ethernet/intel/igc/igc_phy.c
@@ -3,10 +3,6 @@
#include "igc_phy.h"
-/* forward declaration */
-static s32 igc_phy_setup_autoneg(struct igc_hw *hw);
-static s32 igc_wait_autoneg(struct igc_hw *hw);
-
/**
* igc_check_reset_block - Check if PHY reset is blocked
* @hw: pointer to the HW structure
@@ -207,100 +203,6 @@ s32 igc_phy_hw_reset(struct igc_hw *hw)
return ret_val;
}
-/**
- * igc_copper_link_autoneg - Setup/Enable autoneg for copper link
- * @hw: pointer to the HW structure
- *
- * Performs initial bounds checking on autoneg advertisement parameter, then
- * configure to advertise the full capability. Setup the PHY to autoneg
- * and restart the negotiation process between the link partner. If
- * autoneg_wait_to_complete, then wait for autoneg to complete before exiting.
- */
-static s32 igc_copper_link_autoneg(struct igc_hw *hw)
-{
- struct igc_phy_info *phy = &hw->phy;
- u16 phy_ctrl;
- s32 ret_val;
-
- /* Perform some bounds checking on the autoneg advertisement
- * parameter.
- */
- phy->autoneg_advertised &= phy->autoneg_mask;
-
- /* If autoneg_advertised is zero, we assume it was not defaulted
- * by the calling code so we set to advertise full capability.
- */
- if (phy->autoneg_advertised == 0)
- phy->autoneg_advertised = phy->autoneg_mask;
-
- hw_dbg("Reconfiguring auto-neg advertisement params\n");
- ret_val = igc_phy_setup_autoneg(hw);
- if (ret_val) {
- hw_dbg("Error Setting up Auto-Negotiation\n");
- goto out;
- }
- hw_dbg("Restarting Auto-Neg\n");
-
- /* Restart auto-negotiation by setting the Auto Neg Enable bit and
- * the Auto Neg Restart bit in the PHY control register.
- */
- ret_val = phy->ops.read_reg(hw, PHY_CONTROL, &phy_ctrl);
- if (ret_val)
- goto out;
-
- phy_ctrl |= (MII_CR_AUTO_NEG_EN | MII_CR_RESTART_AUTO_NEG);
- ret_val = phy->ops.write_reg(hw, PHY_CONTROL, phy_ctrl);
- if (ret_val)
- goto out;
-
- /* Does the user want to wait for Auto-Neg to complete here, or
- * check at a later time (for example, callback routine).
- */
- if (phy->autoneg_wait_to_complete) {
- ret_val = igc_wait_autoneg(hw);
- if (ret_val) {
- hw_dbg("Error while waiting for autoneg to complete\n");
- goto out;
- }
- }
-
- hw->mac.get_link_status = true;
-
-out:
- return ret_val;
-}
-
-/**
- * igc_wait_autoneg - Wait for auto-neg completion
- * @hw: pointer to the HW structure
- *
- * Waits for auto-negotiation to complete or for the auto-negotiation time
- * limit to expire, which ever happens first.
- */
-static s32 igc_wait_autoneg(struct igc_hw *hw)
-{
- u16 i, phy_status;
- s32 ret_val = 0;
-
- /* Break after autoneg completes or PHY_AUTO_NEG_LIMIT expires. */
- for (i = PHY_AUTO_NEG_LIMIT; i > 0; i--) {
- ret_val = hw->phy.ops.read_reg(hw, PHY_STATUS, &phy_status);
- if (ret_val)
- break;
- ret_val = hw->phy.ops.read_reg(hw, PHY_STATUS, &phy_status);
- if (ret_val)
- break;
- if (phy_status & MII_SR_AUTONEG_COMPLETE)
- break;
- msleep(100);
- }
-
- /* PHY_AUTO_NEG_TIME expiration doesn't guarantee auto-negotiation
- * has completed.
- */
- return ret_val;
-}
-
/**
* igc_phy_setup_autoneg - Configure PHY for auto-negotiation
* @hw: pointer to the HW structure
@@ -485,6 +387,100 @@ static s32 igc_phy_setup_autoneg(struct igc_hw *hw)
return ret_val;
}
+/**
+ * igc_wait_autoneg - Wait for auto-neg completion
+ * @hw: pointer to the HW structure
+ *
+ * Waits for auto-negotiation to complete or for the auto-negotiation time
+ * limit to expire, which ever happens first.
+ */
+static s32 igc_wait_autoneg(struct igc_hw *hw)
+{
+ u16 i, phy_status;
+ s32 ret_val = 0;
+
+ /* Break after autoneg completes or PHY_AUTO_NEG_LIMIT expires. */
+ for (i = PHY_AUTO_NEG_LIMIT; i > 0; i--) {
+ ret_val = hw->phy.ops.read_reg(hw, PHY_STATUS, &phy_status);
+ if (ret_val)
+ break;
+ ret_val = hw->phy.ops.read_reg(hw, PHY_STATUS, &phy_status);
+ if (ret_val)
+ break;
+ if (phy_status & MII_SR_AUTONEG_COMPLETE)
+ break;
+ msleep(100);
+ }
+
+ /* PHY_AUTO_NEG_TIME expiration doesn't guarantee auto-negotiation
+ * has completed.
+ */
+ return ret_val;
+}
+
+/**
+ * igc_copper_link_autoneg - Setup/Enable autoneg for copper link
+ * @hw: pointer to the HW structure
+ *
+ * Performs initial bounds checking on autoneg advertisement parameter, then
+ * configure to advertise the full capability. Setup the PHY to autoneg
+ * and restart the negotiation process between the link partner. If
+ * autoneg_wait_to_complete, then wait for autoneg to complete before exiting.
+ */
+static s32 igc_copper_link_autoneg(struct igc_hw *hw)
+{
+ struct igc_phy_info *phy = &hw->phy;
+ u16 phy_ctrl;
+ s32 ret_val;
+
+ /* Perform some bounds checking on the autoneg advertisement
+ * parameter.
+ */
+ phy->autoneg_advertised &= phy->autoneg_mask;
+
+ /* If autoneg_advertised is zero, we assume it was not defaulted
+ * by the calling code so we set to advertise full capability.
+ */
+ if (phy->autoneg_advertised == 0)
+ phy->autoneg_advertised = phy->autoneg_mask;
+
+ hw_dbg("Reconfiguring auto-neg advertisement params\n");
+ ret_val = igc_phy_setup_autoneg(hw);
+ if (ret_val) {
+ hw_dbg("Error Setting up Auto-Negotiation\n");
+ goto out;
+ }
+ hw_dbg("Restarting Auto-Neg\n");
+
+ /* Restart auto-negotiation by setting the Auto Neg Enable bit and
+ * the Auto Neg Restart bit in the PHY control register.
+ */
+ ret_val = phy->ops.read_reg(hw, PHY_CONTROL, &phy_ctrl);
+ if (ret_val)
+ goto out;
+
+ phy_ctrl |= (MII_CR_AUTO_NEG_EN | MII_CR_RESTART_AUTO_NEG);
+ ret_val = phy->ops.write_reg(hw, PHY_CONTROL, phy_ctrl);
+ if (ret_val)
+ goto out;
+
+ /* Does the user want to wait for Auto-Neg to complete here, or
+ * check at a later time (for example, callback routine).
+ */
+ if (phy->autoneg_wait_to_complete) {
+ ret_val = igc_wait_autoneg(hw);
+ if (ret_val) {
+ hw_dbg("Error while waiting for autoneg to complete\n");
+ goto out;
+ }
+ }
+
+ hw->mac.get_link_status = true;
+
+out:
+ return ret_val;
+}
+
/**
* igc_setup_copper_link - Configure copper link settings
* @hw: pointer to the HW structure
--
2.21.0
^ permalink raw reply related
* Re: [PATCH rdma-next v3 0/3] ODP support for mlx5 DC QPs
From: Leon Romanovsky @ 2019-08-28 7:06 UTC (permalink / raw)
To: Jason Gunthorpe
Cc: Doug Ledford, RDMA mailing list, Michael Guralnik, Saeed Mahameed,
linux-netdev
In-Reply-To: <20190827155140.GA15153@ziepe.ca>
On Tue, Aug 27, 2019 at 12:51:40PM -0300, Jason Gunthorpe wrote:
> On Mon, Aug 19, 2019 at 03:08:12PM +0300, Leon Romanovsky wrote:
> > From: Leon Romanovsky <leonro@mellanox.com>
> >
> > Changelog
> > v3:
> > * Rewrote patches to expose through DEVX without need to change mlx5-abi.h at all.
> > v2: https://lore.kernel.org/linux-rdma/20190806074807.9111-1-leon@kernel.org
> > * Fixed reserved_* field wrong name (Saeed M.)
> > * Split first patch to two patches, one for mlx5-next and one for rdma-next. (Saeed M.)
> > v1: https://lore.kernel.org/linux-rdma/20190804100048.32671-1-leon@kernel.org
> > * Fixed alignment to u64 in mlx5-abi.h (Gal P.)
> > v0: https://lore.kernel.org/linux-rdma/20190801122139.25224-1-leon@kernel.org
> >
> > >From Michael,
> >
> > The series adds support for on-demand paging for DC transport.
> >
> > As DC is mlx-only transport, the capabilities are exposed
> > to the user using DEVX objects and later on through mlx5dv_query_device.
> >
> > Thanks
> >
> > Michael Guralnik (3):
> > net/mlx5: Set ODP capabilities for DC transport to max
> > IB/mlx5: Remove check of FW capabilities in ODP page fault handling
> > IB/mlx5: Add page fault handler for DC initiator WQE
>
> This seems fine, can you put the commit on the shared branch?
Thanks, applied to mlx5-next
00679b631edd net/mlx5: Set ODP capabilities for DC transport to max
>
> Thanks,
> Jason
^ permalink raw reply
* Re: [patch net-next rfc 3/7] net: rtnetlink: add commands to add and delete alternative ifnames
From: Jiri Pirko @ 2019-08-28 7:07 UTC (permalink / raw)
To: Roopa Prabhu
Cc: David Miller, Jakub Kicinski, David Ahern, netdev,
Stephen Hemminger, dcbw, Michal Kubecek, Andrew Lunn, parav,
Saeed Mahameed, mlxsw
In-Reply-To: <CAJieiUjpE+o-=x2hQcsKQJNxB8O7VLHYw2tSnqzTFRuy_vtOxw@mail.gmail.com>
Tue, Aug 27, 2019 at 05:14:49PM CEST, roopa@cumulusnetworks.com wrote:
>On Tue, Aug 27, 2019 at 2:35 AM Jiri Pirko <jiri@resnulli.us> wrote:
>>
>> Tue, Aug 27, 2019 at 10:22:42AM CEST, davem@davemloft.net wrote:
>> >From: Jiri Pirko <jiri@resnulli.us>
>> >Date: Tue, 27 Aug 2019 09:08:08 +0200
>> >
>> >> Okay, so if I understand correctly, on top of separate commands for
>> >> add/del of alternative names, you suggest also get/dump to be separate
>> >> command and don't fill this up in existing newling/getlink command.
>> >
>> >I'm not sure what to do yet.
>> >
>> >David has a point, because the only way these ifnames are useful is
>> >as ways to specify and choose net devices. So based upon that I'm
>> >slightly learning towards not using separate commands.
>>
>> Well yeah, one can use it to handle existing commands instead of
>> IFLA_NAME.
>>
>> But why does it rule out separate commands? I think it is cleaner than
>> to put everything in poor setlink messages :/ The fact that we would
>> need to add "OP" to the setlink message just feels of. Other similar
>> needs may show up in the future and we may endup in ridiculous messages
>> like:
>>
>> SETLINK
>> IFLA_NAME eth0
>> IFLA_ATLNAME_LIST (nest)
>> IFLA_ALTNAME_OP add
>> IFLA_ALTNAME somereallylongname
>> IFLA_ALTNAME_OP del
>> IFLA_ALTNAME somereallyreallylongname
>> IFLA_ALTNAME_OP add
>> IFLA_ALTNAME someotherreallylongname
>> IFLA_SOMETHING_ELSE_LIST (nest)
>> IFLA_SOMETHING_ELSE_OP add
>> ...
>> IFLA_SOMETHING_ELSE_OP del
>> ...
>> IFLA_SOMETHING_ELSE_OP add
>> ...
>>
>> I don't know what to think about it. Rollbacks are going to be pure hell :/
>
>I don't see a huge problem with the above. We need a way to solve this
>anyways for other list types in the future correct ?.
>The approach taken by this series will not scale if we have to add a
>new msg type and header for every such list attribute in the future.
Do you have some other examples in mind? So far, this was not needed.
>
>A good parallel here is bridge vlan which uses RTM_SETLINK and
>RTM_DELLINK for vlan add and deletes. But it does have an advantage of
>a separate
>msg space under AF_BRIDGE which makes it cleaner. Maybe something
>closer to that can be made to work (possibly with a msg flag) ?.
1) Not sure if AF_BRIDGE is the right example how to do things
2) See br_vlan_info(). It is not an OP-PER-VLAN. You either add or
delete all passed info, depending on the cmd (RTM_SETLINK/RTM_DETLINK).
>
>Would be good to have a consistent way to update list attributes for
>future needs too.
Okay. Do you suggest to have new set of commands to handle
adding/deleting lists of items? altNames now, others (other nests) later?
Something like:
CMD SETLISTS
IFLA_NAME eth0
IFLA_ATLNAME_LIST (nest)
IFLA_ALTNAME somereallylongname
IFLA_ALTNAME somereallyreallylongname
IFLA_ALTNAME someotherreallylongname
IFLA_SOMETHING_ELSE_LIST (nest)
IFLA_SOMETHING_ELSE
IFLA_SOMETHING_ELSE
IFLA_SOMETHING_ELSE
CMD DELLISTS
IFLA_NAME eth0
IFLA_ATLNAME_LIST (nest)
IFLA_ALTNAME somereallylongname
IFLA_ALTNAME somereallyreallylongname
IFLA_ALTNAME someotherreallylongname
IFLA_SOMETHING_ELSE_LIST (nest)
IFLA_SOMETHING_ELSE
IFLA_SOMETHING_ELSE
IFLA_SOMETHING_ELSE
How does this sound?
^ permalink raw reply
* Re: libbpf distro packaging
From: Jiri Olsa @ 2019-08-28 7:12 UTC (permalink / raw)
To: Julia Kartseva
Cc: Alexei Starovoitov, Andrii Nakryiko, labbott@redhat.com,
acme@kernel.org, debian-kernel@lists.debian.org,
netdev@vger.kernel.org, Andrey Ignatov, Yonghong Song,
jolsa@kernel.org, Daniel Borkmann
In-Reply-To: <A2E805DD-8237-4703-BE6F-CC96A4D4D909@fb.com>
On Tue, Aug 27, 2019 at 10:30:24PM +0000, Julia Kartseva wrote:
> On 8/25/19, 11:42 PM, "Jiri Olsa" <jolsa@redhat.com> wrote:
>
> > On Fri, Aug 23, 2019 at 04:00:01PM +0000, Alexei Starovoitov wrote:
> > >
> > > Technically we can bump it at any time.
> > > The goal was to bump it only when new kernel is released
> > > to capture a collection of new APIs in a given 0.0.X release.
> > > So that libbpf versions are synchronized with kernel versions
> > > in some what loose way.
> > > In this case we can make an exception and bump it now.
> >
> > I see, I dont think it's worth of the exception now,
> > the patch is simple or we'll start with 0.0.3
>
> PR introducing 0.0.5 ABI was merged:
> https://github.com/libbpf/libbpf/commit/476e158
> Jiri, you'd like to avoid patching, you can start w/ 0.0.5.
> Also if you're planning to use *.spec from libbpf as a source of truth,
> It may be enhanced by syncing spec and ABI versions, similar to
> https://github.com/libbpf/libbpf/commit/d60f568
cool, anyway I started with v0.0.3 ;-) I'll update
to latest once we are merged in
the spec/srpm is currently under Fedora review:
https://bugzilla.redhat.com/show_bug.cgi?id=1745478
you can check it in here:
http://people.redhat.com/~jolsa/libbpf/v2/
I think it's little different from what you have,
but not in the essential parts
jirka
^ permalink raw reply
* Re: [PATCH bpf-next] bpf, capabilities: introduce CAP_BPF
From: Peter Zijlstra @ 2019-08-28 7:14 UTC (permalink / raw)
To: Andy Lutomirski
Cc: Alexei Starovoitov, Kees Cook, LSM List, James Morris, Jann Horn,
Masami Hiramatsu, Steven Rostedt, David S. Miller,
Daniel Borkmann, Network Development, bpf, kernel-team, Linux API
In-Reply-To: <CALCETrV8iJv9+Ai11_1_r6MapPhhwt9hjxi=6EoixytabTScqg@mail.gmail.com>
On Tue, Aug 27, 2019 at 04:01:08PM -0700, Andy Lutomirski wrote:
> > Tracing:
> >
> > CAP_BPF and perf_paranoid_tracepoint_raw() (which is kernel.perf_event_paranoid == -1)
> > are necessary to:
That's not tracing, that's perf.
> > +bool cap_bpf_tracing(void)
> > +{
> > + return capable(CAP_SYS_ADMIN) ||
> > + (capable(CAP_BPF) && !perf_paranoid_tracepoint_raw());
> > +}
A whole long time ago, I proposed we introduce CAP_PERF or something
along those lines; as a replacement for that horrible crap Android and
Debian ship. But nobody was ever interested enough.
The nice thing about that is that you can then disallow perf/tracing in
general, but tag the perf executable (and similar tools) with the
capability so that unpriv users can still use it, but only limited
through the tool, not the syscalls directly.
^ permalink raw reply
* Re: [E1000-devel] SFP+ EEPROM readouts fail on X722 (ethtool -m: Invalid argument)
From: Jakub Jankowski @ 2019-08-28 7:18 UTC (permalink / raw)
To: Fujinaka, Todd, e1000-devel@lists.sourceforge.net
Cc: netdev@vger.kernel.org, mhemsley@open-systems.com, Jeff Kirsher,
Lihong Yang
In-Reply-To: <9B4A1B1917080E46B64F07F2989DADD69B01402F@ORSMSX115.amr.corp.intel.com>
This commit suggests that it should be possible:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c271dd6c391b535226cf1a81aaad9f33cb5899d3
(It has been in upstream kernel since v4.12, so my test kernel does have
it, and so does the out-of-tree driver I'm testing with)
On 8/28/19 2:53 AM, Fujinaka, Todd wrote:
> Sorry about the top posting, but if I don't do it this way I can't read anything in Outlook (not my preferred MUA).
>
> I think I may have been wrong about things. I'm not as familiar with the x722, and the NVM versions are completely different than the x710 and I was confused.
>
> Even worse, I'm not sure if the x722 is able to read the data from the SFP/SFP+ EEPROM. I remembered that was a feature we requested internally but I don't remember what the progress was.
>
> I'm asking around to see if I can get clarification. I haven't heard anything yet.
>
> Todd Fujinaka
> Software Application Engineer
> Datacenter Engineering Group
> Intel Corporation
> todd.fujinaka@intel.com
>
>
> -----Original Message-----
> From: Jakub Jankowski [mailto:shasta@toxcorp.com]
> Sent: Tuesday, August 27, 2019 4:01 PM
> To: Fujinaka, Todd <todd.fujinaka@intel.com>; e1000-devel@lists.sourceforge.net
> Cc: netdev@vger.kernel.org; mhemsley@open-systems.com
> Subject: Re: [E1000-devel] SFP+ EEPROM readouts fail on X722 (ethtool -m: Invalid argument)
>
> Hi,
>
> On 8/27/19 7:56 PM, Fujinaka, Todd wrote:
>> The hints should be:
>> # ethtool -m eth10
>> Cannot get module EEPROM information: Invalid argument # dmesg | tail -n 1 [ 445.971974] i40e 0000:3d:00.3 eth10: Module EEPROM memory read not supported. Please update the NVM image.
>>
>> # ethtool -i eth10
>> driver: i40e
>> version: 2.9.21
>> firmware-version: 3.31 0x80000d31 1.1767.0
>>
>> And the working case:
>> # ethtool -i eth8
>> driver: i40e
>> version: 2.9.21
>> firmware-version: 6.01 0x800035cf 1.1876.0
>>
>> If you don't see it, 6.01 > 3.31.
> The reason why firmware between the two is (that much) different is because the non-working case is from X722 NIC, while the working one is from X710.
>
>> The NVM update tool should be available on downloadcenter.intel.com
> Thanks for the pointer to NVM updater. I'd like to offer some additional comments about my experience with the newest one (v4.00):
>
> a) running ./nvmupdate64e (from X722_NVMUpdate_Linux_x64 subdir) errors out without really saying what's wrong:
>
> # ./nvmupdate64e
>
> Intel(R) Ethernet NVM Update Tool
> NVMUpdate version 1.30.2.11
> Copyright (C) 2013 - 2017 Intel Corporation.
>
>
> WARNING: To avoid damage to your device, do not stop the update or reboot or power off the system during this update.
> Inventory in progress. Please wait [+.........]
> Tool execution completed with the following status: The configuration file could not be opened/read, or a syntax error was discovered in the file
> Press any key to exit.
>
> after enabling logging (-l out.log) a bit more is revealed:
>
> # tail -n 2 out.log
> Error: Config file line 2: Not supported config file version.
> Error: Missing CONFIG VERSION parameter in configuration file.
>
> but that's not entirely true, CONFIG VERSION is set in the default configuration file:
>
> # head -n 2 nvmupdate.cfg
> CURRENT FAMILY: 1.0.0
> CONFIG VERSION: 1.14.0
>
> so why isn't this understood?
> Manually editing nvmupdate.cfg and setting CONFIG VERSION: 1.11.0 seems to make this particular problem go away.
>
> b) Re-doing this with downgraded config version exposes another problem:
>
> Config file read.
> Error: Can't open NVM map file [Immediate_offset_2.txt]
>
> and indeed, there is no Immediate_offset_2.txt in NVMUpdatePackage_WFT_WFQ&WF0_v4.00/X722_NVMUpdate_Linux_x64/
> There is one, however, in
> NVMUpdatePackage_WFT_WFQ&WF0_v4.00/X722_NVMUpdate_EFIx64/ subdir.
> Copying it over to the _Linux_x64 resolves this particular problem
>
> c) Re-doing this with Immediate_offset_2.txt in place exposes third problem:
>
> Error: Can't open NVM image file
> [LBG_B2_Wolf_Pass_WFT_X557_P01_PHY_Auto_Detect_P23_NCSI_v3.31_800016DB.bin]
>
> and once again - same story. It exists in NVMUpdatePackage_WFT_WFQ&WF0_v4.00/X722_NVMUpdate_EFIx64/ but not NVMUpdatePackage_WFT_WFQ&WF0_v4.00/X722_NVMUpdate_Linux_x64/ - had to copy it over.
>
>
> Once I managed to get all these out of the way, the tool finally ran:
>
> Num Description Ver. DevId S:B Status
> === ======================================== ===== ===== ====== ===============
> 01) Intel(R) Ethernet Server Adapter I350-T4 1.99 1521 00:024 Update not available
> 02) Intel(R) Ethernet Connection X722 for 3.49 37D2 00:061 Update
> 10GBASE-T available
> 03) Intel(R) Ethernet Server Adapter I350-T4 1.99 1521 00:175 Update not available
>
>
> The initial starting point was:
>
> 0) firmware-version: 3.31 0x80000d31 1.1767.0
>
> After first update+reboot, this was bumped to:
>
> 1) firmware-version: 3.1d 0x800016db 1.1767.0 (but ethtool -m ethX still doesn't work)
>
> So I ran the tool the second time, it said 'Update available' again, but this time:
>
> Num Description Ver. DevId S:B Status
> === ======================================== ===== ===== ====== ===============
> 01) Intel(R) Ethernet Server Adapter I350-T4 1.99 1521 00:024 Update not available
> 02) Intel(R) Ethernet Connection X722 for 3.29 37D2 00:061 Update
> 10GBASE-T available
> 03) Intel(R) Ethernet Server Adapter I350-T4 1.99 1521 00:175 Update not available
>
> Options: Adapter Index List (comma-separated), [A]ll, e[X]it
> Enter selection:02
> Would you like to back up the NVM images? [Y]es/[N]o: Y
> Update in progress. This operation may take several minutes.
> [*******+..]
> Tool execution completed with the following status: <---------- why is there no status printed?
> Press any key to exit.
>
>
> Checking output log:
>
> # cat out3.log
> Intel(R) Ethernet NVM Update Tool
> NVMUpdate version 1.30.2.11
> Copyright (C) 2013 - 2017 Intel Corporation.
>
> ./nvmupdate64e -c nvmupdate.cfg -l out3.log
>
> Config file read.
> Inventory
> [00:061:00:00]: Intel(R) Ethernet Connection X722 for 10GBASE-T
> Flash inventory started
> Shadow RAM inventory started
> Alternate MAC address is not set
> Shadow RAM inventory finished
> Flash inventory finished
> OROM inventory started
> OROM inventory finished
> PHY NVM inventory started
> PHY NVM inventory finished
> [00:061:00:01]: Intel(R) Ethernet Connection X722 for 10GBASE-T
> Device already inventoried.
> [00:061:00:02]: Intel(R) Ethernet Connection X722 for 10GbE SFP+
> Device already inventoried.
> PHY NVM inventory started
> PHY NVM inventory finished
> [00:061:00:03]: Intel(R) Ethernet Connection X722 for 10GbE SFP+
> Device already inventoried.
> Update
> [00:061:00:00]: Intel(R) Ethernet Connection X722 for 10GBASE-T
> Creating backup images in directory: A4BF0164884A
> Backup images created.
> Flash update started
> NVM image verification started
> Shadow RAM image verification started
>
> Image differences found at offset 0x3AE [Device=0xF, Buffer=0x0] -
> update required.
> Error: Flash update failed
> [00:061:00:02]: Intel(R) Ethernet Connection X722 for 10GbE SFP+
> #
>
> However, ethtool -i suggests that firmware was updated to:
>
> 2) firmware-version: 4.00 0x80001577 1.1580.0 <------- so it did
> _something_ after all?
>
> At this point, every subsequent attempt to run the NVM updater yields
> the same results: an update is available, but attempting to apply it
> fails with the same message in log.
>
> And my initial issue still persists - ethtool -m <iface> still returns
> "invalid argument" with "Module EEPROM memory read not supported. Please
> update the NVM image" logged in dmesg.
>
> How can I resolve this?
>
> Cheers,
> Jakub.
>
>> Todd Fujinaka
>> Software Application Engineer
>> Datacenter Engineering Group
>> Intel Corporation
>> todd.fujinaka@intel.com
>>
>>
>> -----Original Message-----
>> From: Jakub Jankowski [mailto:shasta@toxcorp.com]
>> Sent: Tuesday, August 27, 2019 4:03 AM
>> To: e1000-devel@lists.sourceforge.net
>> Cc: netdev@vger.kernel.org; shasta@toxcorp.com; mhemsley@open-systems.com
>> Subject: [E1000-devel] SFP+ EEPROM readouts fail on X722 (ethtool -m: Invalid argument)
>>
>> Hi,
>>
>> We can't get SFP+ EEPROM readouts for X722 to work at all:
>>
>> # ethtool -m eth10
>> Cannot get module EEPROM information: Invalid argument # dmesg | tail -n 1 [ 445.971974] i40e 0000:3d:00.3 eth10: Module EEPROM memory read not supported. Please update the NVM image.
>> # lspci | grep 3d:00.3
>> 3d:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 09)
>>
>>
>> We're running 4.19.65 kernel at the moment, testing using the newest out-of-tree Intel module
>>
>> # modinfo -F version i40e
>> 2.9.21
>>
>> We also tried:
>> - 4.19.65 with in-tree i40e (2.3.2-k)
>> - stock Arch Linux (kernel 5.2.5, driver 2.8.20-k) and the results are the same, as shown above.
>>
>> # ethtool -i eth10
>> driver: i40e
>> version: 2.9.21
>> firmware-version: 3.31 0x80000d31 1.1767.0
>> expansion-rom-version:
>> bus-info: 0000:3d:00.3
>> supports-statistics: yes
>> supports-test: yes
>> supports-eeprom-access: yes
>> supports-register-dump: yes
>> supports-priv-flags: yes
>> # dmidecode -s baseboard-manufacturer
>> Intel Corporation
>> # dmidecode -s baseboard-product-name
>> S2600WFT
>> # dmidecode -s baseboard-version
>> H48104-853
>>
>> # lspci -vvv
>> (...)
>> 3d:00.3 Ethernet controller: Intel Corporation Ethernet Connection X722 for 10GbE SFP+ (rev 09)
>> DeviceName: Intel PCH Integrated 10 Gigabit Ethernet Controller
>> Subsystem: Intel Corporation Ethernet Connection X722 for 10GbE SFP+
>> Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx+
>> Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
>> Latency: 0, Cache Line Size: 32 bytes
>> Interrupt: pin A routed to IRQ 112
>> NUMA node: 0
>> Region 0: Memory at ab000000 (64-bit, prefetchable) [size=16M]
>> Region 3: Memory at b0000000 (64-bit, prefetchable) [size=32K]
>> Expansion ROM at <ignored> [disabled]
>> Capabilities: [40] Power Management version 3
>> Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
>> Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
>> Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
>> Address: 0000000000000000 Data: 0000
>> Masking: 00000000 Pending: 00000000
>> Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
>> Vector table: BAR=3 offset=00000000
>> PBA: BAR=3 offset=00001000
>> Capabilities: [a0] Express (v2) Endpoint, MSI 00
>> DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
>> ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
>> DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
>> RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
>> MaxPayload 256 bytes, MaxReadReq 512 bytes
>> DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr+ TransPend-
>> LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
>> ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
>> LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
>> ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
>> LnkSta: Speed 2.5GT/s (ok), Width x1 (ok)
>> TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
>> DevCap2: Completion Timeout: Range AB, TimeoutDis+, LTR-, OBFF Not Supported
>> AtomicOpsCap: 32bit- 64bit- 128bitCAS-
>> DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
>> AtomicOpsCtl: ReqEn-
>> LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete-, EqualizationPhase1-
>> EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
>> Capabilities: [e0] Vital Product Data
>> Product Name: Example VPD
>> Read-only fields:
>> [V0] Vendor specific:
>> [RV] Reserved: checksum good, 0 byte(s) reserved
>> End
>> Capabilities: [100 v2] Advanced Error Reporting
>> UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
>> UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq+ ACSViol-
>> UESvrt: DLP+ SDES- TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
>> CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>> CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
>> AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
>> MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
>> HeaderLog: 00000000 00000000 00000000 00000000
>> Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
>> ARICap: MFVC- ACS-, Next Function: 0
>> ARICtl: MFVC- ACS-, Function Group: 0
>> Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
>> IOVCap: Migration-, Interrupt Message Number: 000
>> IOVCtl: Enable- Migration- Interrupt- MSE- ARIHierarchy-
>> IOVSta: Migration-
>> Initial VFs: 32, Total VFs: 32, Number of VFs: 0, Function Dependency Link: 03
>> VF offset: 109, stride: 1, Device ID: 37cd
>> Supported Page Size: 00000553, System Page Size: 00000001
>> Region 0: Memory at 00000000af000000 (64-bit, prefetchable)
>> Region 3: Memory at 00000000b0020000 (64-bit, prefetchable)
>> VF Migration: offset: 00000000, BIR: 0
>> Capabilities: [1a0 v1] Transaction Processing Hints
>> Device specific mode supported
>> No steering table available
>> Capabilities: [1b0 v1] Access Control Services
>> ACSCap: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
>> ACSCtl: SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
>> Kernel driver in use: i40e
>> Kernel modules: i40e
>>
>>
>> Same kernel+i40e, same SFP+ module - but on Intel X710, works like a treat:
>>
>> # lspci | grep X7
>> 81:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01)
>> 81:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 01) # ethtool -m eth8
>> Identifier : 0x03 (SFP)
>> Extended identifier : 0x04 (GBIC/SFP defined by 2-wire interface ID)
>> Connector : 0x07 (LC)
>> Transceiver codes : 0x10 0x00 0x00 0x01 0x00 0x00 0x00 0x00 0x00
>> Transceiver type : 10G Ethernet: 10G Base-SR
>> Transceiver type : Ethernet: 1000BASE-SX
>> Encoding : 0x06 (64B/66B)
>> BR, Nominal : 10300MBd
>> (...)
>> # ethtool -i eth8
>> driver: i40e
>> version: 2.9.21
>> firmware-version: 6.01 0x800035cf 1.1876.0
>> expansion-rom-version:
>> bus-info: 0000:81:00.0
>> supports-statistics: yes
>> supports-test: yes
>> supports-eeprom-access: yes
>> supports-register-dump: yes
>> supports-priv-flags: yes
>> #
>>
>>
>> Is this a known problem?
>>
>>
>> Best regards,
>> Jakub
>>
>>
>>
>> _______________________________________________
>> E1000-devel mailing list
>> E1000-devel@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/e1000-devel
>> To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired
--
Jakub Jankowski|shasta@toxcorp.com|http://toxcorp.com/
GPG: FCBF F03D 9ADB B768 8B92 BB52 0341 9037 A875 942D
^ permalink raw reply
* [RFCv2 bpf-next 00/12] Programming socket lookup with BPF
From: Jakub Sitnicki @ 2019-08-28 7:22 UTC (permalink / raw)
To: bpf, netdev; +Cc: kernel-team, Lorenz Bauer, Marek Majkowski
This patch set adds a mechanism for programming mappings between the local
addresses and listening/receiving sockets with BPF.
It introduces a new per-netns BPF program type, called inet_lookup, which
runs during the socket lookup. The program is allowed to select a
listening/receiving socket from a SOCKARRAY map that the packet will be
delivered to.
BPF inet_lookup intends to be an alternative for:
* SO_BINDTOPREFIX [1] - a mechanism that provides a way to listen/receive
on all local addresses that belong to a network prefix. An alternative to
binding to INADDR_ANY that allows applications bound to disjoint network
prefixes to share a port. Not generic. Never got upstreamed.
* TPROXY [2] - a powerful mechanism that allows steering packets destined
to non-local addresses to a local socket. It also works for local
addresses, which is a less restrictive case. Can be used to implement
what SO_BINDTOPREFIX does, and more - in particular, all ports can be
redirected to a single socket. Socket dispatch happens early in ingress
path (PREROUTING hook). Versatile but comes with complexities.
Compared to the above, inet_lookup aims to be a programmatic way to map
(address, port) pairs to a socket. It runs after a routing decision for
local delivery was made, and hence is limited to local addresses only.
Being part of the socket lookup, has a desired effect that redirection is
visible to XDP programs which call bpf_sk_lookup helpers.
When it comes to use cases, we have presented them in RFCv1 [3] cover
letter and also at last Netconf [4]. To recap, they are:
1) sharing a port between two services
Services are accepting connections on different (disjoint) IP ranges but
same port. Requests going to 192.0.2.0/24 tcp/80 are handled by NGINX,
while 198.51.100.0/24 tcp/80 IP range is handled by Apache server.
Applications are running as different users, in a flat single-netns
setup.
2) receiving traffic on all ports
We have a proxy server that accepts connections to _any_ port [5].
A simple demo program that implements (1) could look like
#define NET1 (IP4(192, 0, 2, 0) >> 8)
#define NET2 (IP4(198, 51, 100, 0) >> 8)
#define MAX_SERVERS 2
struct {
__uint(type, BPF_MAP_TYPE_REUSEPORT_SOCKARRAY);
__uint(max_entries, MAX_SERVERS);
__type(key, __u32);
__type(value, __u64);
} redir_map SEC(".maps");
SEC("inet_lookup/demo_two_servers")
int demo_two_http_servers(struct bpf_inet_lookup *ctx)
{
__u32 index = 0;
__u64 flags = 0;
if (ctx->family != AF_INET)
return BPF_OK;
if (ctx->protocol != IPPROTO_TCP)
return BPF_OK;
if (ctx->local_port != 80)
return BPF_OK;
switch (bpf_ntohl(ctx->local_ip4) >> 8) {
case NET1:
index = 0;
break;
case NET2:
index = 1;
break;
default:
return BPF_OK;
}
return bpf_redirect_lookup(ctx, &redir_map, &index, flags);
}
Since RFCv1, we've changed the approach from rewriting the lookup key to
map-based redirection. This has been suggested at Netconf, and is a
recurring pattern in existing BPF program types.
We're posting the 2nd version of RFC patch set to collect further feedback
and set context for the presentation and discussions at the upcoming
Network Summit at LPC '19 [6].
Patches are also available on GitHub [7].
Thanks,
Jakub
[1] https://www.spinics.net/lists/netdev/msg370789.html
[2] https://www.kernel.org/doc/Documentation/networking/tproxy.txt
[3] https://lore.kernel.org/netdev/20190618130050.8344-1-jakub@cloudflare.com/
[4] http://vger.kernel.org/netconf2019_files/Programmable%20socket%20lookup.pdf
[5] https://blog.cloudflare.com/how-we-built-spectrum/
[6] https://linuxplumbersconf.org/event/4/contributions/487/
[7] https://github.com/jsitnicki/linux/commits/bpf-inet-lookup
Changes RFCv1 -> RFCv2:
- Make socket lookup redirection map-based. BPF program now uses a
dedicated helper and a SOCKARRAY map to select the socket to redirect to.
A consequence of this change is that bpf_inet_lookup context is now
read-only.
- Look for connected UDP sockets before allowing redirection from BPF.
This makes connected UDP socket work as expected in the presence of
inet_lookup prog.
- Share the code for BPF_PROG_{ATTACH,DETACH,QUERY} with flow_dissector,
the only other per-netns BPF prog type.
Jakub Sitnicki (12):
flow_dissector: Extract attach/detach/query helpers
bpf: Introduce inet_lookup program type for redirecting socket lookup
bpf: Add verifier tests for inet_lookup context access
inet: Store layer 4 protocol in inet_hashinfo
udp: Store layer 4 protocol in udp_table
inet: Run inet_lookup bpf program on socket lookup
inet6: Run inet_lookup bpf program on socket lookup
udp: Run inet_lookup bpf program on socket lookup
udp6: Run inet_lookup bpf program on socket lookup
bpf: Sync linux/bpf.h to tools/
libbpf: Add support for inet_lookup program type
bpf: Test redirecting listening/receiving socket lookup
include/linux/bpf.h | 8 +
include/linux/bpf_types.h | 1 +
include/linux/filter.h | 18 +
include/net/inet6_hashtables.h | 19 +
include/net/inet_hashtables.h | 36 +
include/net/net_namespace.h | 2 +
include/net/udp.h | 10 +-
include/uapi/linux/bpf.h | 58 +-
kernel/bpf/syscall.c | 10 +
kernel/bpf/verifier.c | 7 +-
net/core/filter.c | 304 ++++++++
net/core/flow_dissector.c | 65 +-
net/dccp/proto.c | 2 +-
net/ipv4/inet_hashtables.c | 5 +
net/ipv4/tcp_ipv4.c | 2 +-
net/ipv4/udp.c | 59 +-
net/ipv4/udp_impl.h | 2 +-
net/ipv4/udplite.c | 4 +-
net/ipv6/inet6_hashtables.c | 5 +
net/ipv6/udp.c | 54 +-
net/ipv6/udp_impl.h | 2 +-
net/ipv6/udplite.c | 2 +-
tools/include/uapi/linux/bpf.h | 58 +-
tools/lib/bpf/libbpf.c | 4 +
tools/lib/bpf/libbpf.h | 2 +
tools/lib/bpf/libbpf.map | 2 +
tools/lib/bpf/libbpf_probes.c | 1 +
tools/testing/selftests/bpf/.gitignore | 1 +
tools/testing/selftests/bpf/Makefile | 5 +-
tools/testing/selftests/bpf/bpf_helpers.h | 3 +
.../selftests/bpf/progs/inet_lookup_progs.c | 78 ++
.../testing/selftests/bpf/test_inet_lookup.c | 522 +++++++++++++
.../testing/selftests/bpf/test_inet_lookup.sh | 35 +
.../selftests/bpf/verifier/ctx_inet_lookup.c | 696 ++++++++++++++++++
34 files changed, 1974 insertions(+), 108 deletions(-)
create mode 100644 tools/testing/selftests/bpf/progs/inet_lookup_progs.c
create mode 100644 tools/testing/selftests/bpf/test_inet_lookup.c
create mode 100755 tools/testing/selftests/bpf/test_inet_lookup.sh
create mode 100644 tools/testing/selftests/bpf/verifier/ctx_inet_lookup.c
--
2.20.1
^ permalink raw reply
* [RFCv2 bpf-next 01/12] flow_dissector: Extract attach/detach/query helpers
From: Jakub Sitnicki @ 2019-08-28 7:22 UTC (permalink / raw)
To: bpf, netdev; +Cc: kernel-team, Lorenz Bauer, Marek Majkowski
In-Reply-To: <20190828072250.29828-1-jakub@cloudflare.com>
Move generic parts of callbacks for querying, attaching, and detaching a
single BPF program for reuse by other BPF program types.
Subsequent patch makes use of the extracted routines.
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
include/linux/bpf.h | 8 +++++
net/core/filter.c | 73 +++++++++++++++++++++++++++++++++++++++
net/core/flow_dissector.c | 65 ++++++----------------------------
3 files changed, 92 insertions(+), 54 deletions(-)
diff --git a/include/linux/bpf.h b/include/linux/bpf.h
index 5b9d22338606..b301e0c03a8c 100644
--- a/include/linux/bpf.h
+++ b/include/linux/bpf.h
@@ -23,6 +23,7 @@ struct sock;
struct seq_file;
struct btf;
struct btf_type;
+struct mutex;
extern struct idr btf_idr;
extern spinlock_t btf_idr_lock;
@@ -1145,4 +1146,11 @@ static inline u32 bpf_xdp_sock_convert_ctx_access(enum bpf_access_type type,
}
#endif /* CONFIG_INET */
+int bpf_prog_query_one(struct bpf_prog __rcu **pprog,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr);
+int bpf_prog_attach_one(struct bpf_prog __rcu **pprog, struct mutex *lock,
+ struct bpf_prog *prog, u32 flags);
+int bpf_prog_detach_one(struct bpf_prog __rcu **pprog, struct mutex *lock);
+
#endif /* _LINUX_BPF_H */
diff --git a/net/core/filter.c b/net/core/filter.c
index 0c1059cdad3d..a498fbaa2d50 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -8668,6 +8668,79 @@ int sk_get_filter(struct sock *sk, struct sock_filter __user *ubuf,
return ret;
}
+int bpf_prog_query_one(struct bpf_prog __rcu **pprog,
+ const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ __u32 __user *prog_ids = u64_to_user_ptr(attr->query.prog_ids);
+ u32 prog_id, prog_cnt = 0, flags = 0;
+ struct bpf_prog *attached;
+
+ if (attr->query.query_flags)
+ return -EINVAL;
+
+ rcu_read_lock();
+ attached = rcu_dereference(*pprog);
+ if (attached) {
+ prog_cnt = 1;
+ prog_id = attached->aux->id;
+ }
+ rcu_read_unlock();
+
+ if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags)))
+ return -EFAULT;
+ if (copy_to_user(&uattr->query.prog_cnt, &prog_cnt, sizeof(prog_cnt)))
+ return -EFAULT;
+
+ if (!attr->query.prog_cnt || !prog_ids || !prog_cnt)
+ return 0;
+
+ if (copy_to_user(prog_ids, &prog_id, sizeof(u32)))
+ return -EFAULT;
+
+ return 0;
+}
+
+int bpf_prog_attach_one(struct bpf_prog __rcu **pprog, struct mutex *lock,
+ struct bpf_prog *prog, u32 flags)
+{
+ struct bpf_prog *attached;
+
+ if (flags)
+ return -EINVAL;
+
+ mutex_lock(lock);
+ attached = rcu_dereference_protected(*pprog,
+ lockdep_is_held(lock));
+ if (attached) {
+ /* Only one BPF program can be attached at a time */
+ mutex_unlock(lock);
+ return -EEXIST;
+ }
+ rcu_assign_pointer(*pprog, prog);
+ mutex_unlock(lock);
+
+ return 0;
+}
+
+int bpf_prog_detach_one(struct bpf_prog __rcu **pprog, struct mutex *lock)
+{
+ struct bpf_prog *attached;
+
+ mutex_lock(lock);
+ attached = rcu_dereference_protected(*pprog,
+ lockdep_is_held(lock));
+ if (!attached) {
+ mutex_unlock(lock);
+ return -ENOENT;
+ }
+ RCU_INIT_POINTER(*pprog, NULL);
+ bpf_prog_put(attached);
+ mutex_unlock(lock);
+
+ return 0;
+}
+
#ifdef CONFIG_INET
struct sk_reuseport_kern {
struct sk_buff *skb;
diff --git a/net/core/flow_dissector.c b/net/core/flow_dissector.c
index 9741b593ea53..c51602158906 100644
--- a/net/core/flow_dissector.c
+++ b/net/core/flow_dissector.c
@@ -73,80 +73,37 @@ EXPORT_SYMBOL(skb_flow_dissector_init);
int skb_flow_dissector_prog_query(const union bpf_attr *attr,
union bpf_attr __user *uattr)
{
- __u32 __user *prog_ids = u64_to_user_ptr(attr->query.prog_ids);
- u32 prog_id, prog_cnt = 0, flags = 0;
- struct bpf_prog *attached;
struct net *net;
-
- if (attr->query.query_flags)
- return -EINVAL;
+ int ret;
net = get_net_ns_by_fd(attr->query.target_fd);
if (IS_ERR(net))
return PTR_ERR(net);
- rcu_read_lock();
- attached = rcu_dereference(net->flow_dissector_prog);
- if (attached) {
- prog_cnt = 1;
- prog_id = attached->aux->id;
- }
- rcu_read_unlock();
+ ret = bpf_prog_query_one(&net->flow_dissector_prog, attr, uattr);
put_net(net);
-
- if (copy_to_user(&uattr->query.attach_flags, &flags, sizeof(flags)))
- return -EFAULT;
- if (copy_to_user(&uattr->query.prog_cnt, &prog_cnt, sizeof(prog_cnt)))
- return -EFAULT;
-
- if (!attr->query.prog_cnt || !prog_ids || !prog_cnt)
- return 0;
-
- if (copy_to_user(prog_ids, &prog_id, sizeof(u32)))
- return -EFAULT;
-
- return 0;
+ return ret;
}
int skb_flow_dissector_bpf_prog_attach(const union bpf_attr *attr,
struct bpf_prog *prog)
{
- struct bpf_prog *attached;
- struct net *net;
+ struct net *net = current->nsproxy->net_ns;
- net = current->nsproxy->net_ns;
- mutex_lock(&flow_dissector_mutex);
- attached = rcu_dereference_protected(net->flow_dissector_prog,
- lockdep_is_held(&flow_dissector_mutex));
- if (attached) {
- /* Only one BPF program can be attached at a time */
- mutex_unlock(&flow_dissector_mutex);
- return -EEXIST;
- }
- rcu_assign_pointer(net->flow_dissector_prog, prog);
- mutex_unlock(&flow_dissector_mutex);
- return 0;
+ return bpf_prog_attach_one(&net->flow_dissector_prog,
+ &flow_dissector_mutex, prog,
+ attr->attach_flags);
}
int skb_flow_dissector_bpf_prog_detach(const union bpf_attr *attr)
{
- struct bpf_prog *attached;
- struct net *net;
+ struct net *net = current->nsproxy->net_ns;
- net = current->nsproxy->net_ns;
- mutex_lock(&flow_dissector_mutex);
- attached = rcu_dereference_protected(net->flow_dissector_prog,
- lockdep_is_held(&flow_dissector_mutex));
- if (!attached) {
- mutex_unlock(&flow_dissector_mutex);
- return -ENOENT;
- }
- bpf_prog_put(attached);
- RCU_INIT_POINTER(net->flow_dissector_prog, NULL);
- mutex_unlock(&flow_dissector_mutex);
- return 0;
+ return bpf_prog_detach_one(&net->flow_dissector_prog,
+ &flow_dissector_mutex);
}
+
/**
* skb_flow_get_be16 - extract be16 entity
* @skb: sk_buff to extract from
--
2.20.1
^ permalink raw reply related
* [RFCv2 bpf-next 02/12] bpf: Introduce inet_lookup program type for redirecting socket lookup
From: Jakub Sitnicki @ 2019-08-28 7:22 UTC (permalink / raw)
To: bpf, netdev; +Cc: kernel-team, Lorenz Bauer, Marek Majkowski
In-Reply-To: <20190828072250.29828-1-jakub@cloudflare.com>
Add a new program type for redirecting the listening/bound socket lookup
from BPF. The program attaches to a network namespace. It is allowed to
select a socket from a SOCKARRAY, which will be used as a result of socket
lookup.
This provides a mechanism for programming the mapping between
local (address, port) pairs and listening/receiving sockets.
The program receives the 4-tuple, as well as the IP version and L4
protocol, of the packet that triggered the lookup as its context for making
a decision.
The netns-attached program is not called anywhere yet. Following patches
hook it up to ipv4 and ipv6 stacks.
Suggested-by: Marek Majkowski <marek@cloudflare.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
include/linux/bpf_types.h | 1 +
include/linux/filter.h | 18 +++
include/net/net_namespace.h | 2 +
include/uapi/linux/bpf.h | 58 ++++++++-
kernel/bpf/syscall.c | 10 ++
kernel/bpf/verifier.c | 7 +-
net/core/filter.c | 231 ++++++++++++++++++++++++++++++++++++
7 files changed, 325 insertions(+), 2 deletions(-)
diff --git a/include/linux/bpf_types.h b/include/linux/bpf_types.h
index 36a9c2325176..cc5c4ece748a 100644
--- a/include/linux/bpf_types.h
+++ b/include/linux/bpf_types.h
@@ -37,6 +37,7 @@ BPF_PROG_TYPE(BPF_PROG_TYPE_LIRC_MODE2, lirc_mode2)
#endif
#ifdef CONFIG_INET
BPF_PROG_TYPE(BPF_PROG_TYPE_SK_REUSEPORT, sk_reuseport)
+BPF_PROG_TYPE(BPF_PROG_TYPE_INET_LOOKUP, inet_lookup)
#endif
BPF_MAP_TYPE(BPF_MAP_TYPE_ARRAY, array_map_ops)
diff --git a/include/linux/filter.h b/include/linux/filter.h
index 92c6e31fb008..5b1b3b754c28 100644
--- a/include/linux/filter.h
+++ b/include/linux/filter.h
@@ -1229,4 +1229,22 @@ struct bpf_sockopt_kern {
s32 retval;
};
+struct bpf_inet_lookup_kern {
+ unsigned short family;
+ u8 protocol;
+ __be32 saddr;
+ struct in6_addr saddr6;
+ __be16 sport;
+ __be32 daddr;
+ struct in6_addr daddr6;
+ unsigned short hnum;
+ struct sock *redir_sk;
+};
+
+int inet_lookup_bpf_prog_attach(const union bpf_attr *attr,
+ struct bpf_prog *prog);
+int inet_lookup_bpf_prog_detach(const union bpf_attr *attr);
+int inet_lookup_bpf_prog_query(const union bpf_attr *attr,
+ union bpf_attr __user *uattr);
+
#endif /* __LINUX_FILTER_H__ */
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 4a9da951a794..bd01147cc064 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -171,6 +171,8 @@ struct net {
#ifdef CONFIG_XDP_SOCKETS
struct netns_xdp xdp;
#endif
+ struct bpf_prog __rcu *inet_lookup_prog;
+
struct sock *diag_nlsk;
atomic_t fnhe_genid;
} __randomize_layout;
diff --git a/include/uapi/linux/bpf.h b/include/uapi/linux/bpf.h
index b5889257cc33..639abfa96779 100644
--- a/include/uapi/linux/bpf.h
+++ b/include/uapi/linux/bpf.h
@@ -173,6 +173,7 @@ enum bpf_prog_type {
BPF_PROG_TYPE_CGROUP_SYSCTL,
BPF_PROG_TYPE_RAW_TRACEPOINT_WRITABLE,
BPF_PROG_TYPE_CGROUP_SOCKOPT,
+ BPF_PROG_TYPE_INET_LOOKUP,
};
enum bpf_attach_type {
@@ -199,6 +200,7 @@ enum bpf_attach_type {
BPF_CGROUP_UDP6_RECVMSG,
BPF_CGROUP_GETSOCKOPT,
BPF_CGROUP_SETSOCKOPT,
+ BPF_INET_LOOKUP,
__MAX_BPF_ATTACH_TYPE
};
@@ -2747,6 +2749,33 @@ union bpf_attr {
* **-EOPNOTSUPP** kernel configuration does not enable SYN cookies
*
* **-EPROTONOSUPPORT** IP packet version is not 4 or 6
+ *
+ * int bpf_redirect_lookup(struct bpf_inet_lookup *ctx, struct bpf_map *sockarray, void *key, u64 flags)
+ * Description
+ * Select a socket referenced by *map* (of type
+ * **BPF_MAP_TYPE_REUSEPORT_SOCKARRAY**) at index *key* to use as a
+ * result of listening (TCP) or bound (UDP) socket lookup.
+ *
+ * The IP family and L4 protocol in *ctx* object, populated from
+ * the packet that triggered the lookup, must match the selected
+ * socket's family and protocol. IP6_V6ONLY socket option is
+ * honored.
+ *
+ * To be used by **BPF_INET_LOOKUP** programs attached to the
+ * network namespace. Program needs to return **BPF_REDIRECT**, the
+ * helper's success return value, for the selected socket to be
+ * actually used.
+ *
+ * Return
+ * **BPF_REDIRECT** on success, if the socket at index *key* was selected.
+ *
+ * **-EINVAL** if *flags* are invalid (not zero).
+ *
+ * **-ENOENT** if there is no socket at index *key*.
+ *
+ * **-EPROTOTYPE** if *ctx->protocol* does not match the socket protocol.
+ *
+ * **-EAFNOSUPPORT** if socket does not accept IP version in *ctx->family*.
*/
#define __BPF_FUNC_MAPPER(FN) \
FN(unspec), \
@@ -2859,7 +2888,8 @@ union bpf_attr {
FN(sk_storage_get), \
FN(sk_storage_delete), \
FN(send_signal), \
- FN(tcp_gen_syncookie),
+ FN(tcp_gen_syncookie), \
+ FN(redirect_lookup),
/* integer value in 'imm' field of BPF_CALL instruction selects which helper
* function eBPF program intends to call
@@ -3116,6 +3146,32 @@ struct bpf_tcp_sock {
__u32 icsk_retransmits; /* Number of unrecovered [RTO] timeouts */
};
+/* User accessible data for inet_lookup programs.
+ * New fields must be added at the end.
+ */
+struct bpf_inet_lookup {
+ __u32 family; /* AF_INET, AF_INET6 */
+ __u32 protocol; /* IPROTO_TCP, IPPROTO_UDP */
+ __u32 remote_ip4; /* Allows 1,2,4-byte read but no write.
+ * Stored in network byte order.
+ */
+ __u32 local_ip4; /* Allows 1,2,4-byte read and 4-byte write.
+ * Stored in network byte order.
+ */
+ __u32 remote_ip6[4]; /* Allows 1,2,4-byte read but no write.
+ * Stored in network byte order.
+ */
+ __u32 local_ip6[4]; /* Allows 1,2,4-byte read and 4-byte write.
+ * Stored in network byte order.
+ */
+ __u32 remote_port; /* Allows 4-byte read but no write.
+ * Stored in network byte order.
+ */
+ __u32 local_port; /* Allows 4-byte read and write.
+ * Stored in host byte order.
+ */
+};
+
struct bpf_sock_tuple {
union {
struct {
diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c0f62fd67c6b..763f2352ff7f 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -1935,6 +1935,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
case BPF_CGROUP_SETSOCKOPT:
ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
break;
+ case BPF_INET_LOOKUP:
+ ptype = BPF_PROG_TYPE_INET_LOOKUP;
+ break;
default:
return -EINVAL;
}
@@ -1959,6 +1962,9 @@ static int bpf_prog_attach(const union bpf_attr *attr)
case BPF_PROG_TYPE_FLOW_DISSECTOR:
ret = skb_flow_dissector_bpf_prog_attach(attr, prog);
break;
+ case BPF_PROG_TYPE_INET_LOOKUP:
+ ret = inet_lookup_bpf_prog_attach(attr, prog);
+ break;
default:
ret = cgroup_bpf_prog_attach(attr, ptype, prog);
}
@@ -2022,6 +2028,8 @@ static int bpf_prog_detach(const union bpf_attr *attr)
case BPF_CGROUP_SETSOCKOPT:
ptype = BPF_PROG_TYPE_CGROUP_SOCKOPT;
break;
+ case BPF_INET_LOOKUP:
+ return inet_lookup_bpf_prog_detach(attr);
default:
return -EINVAL;
}
@@ -2065,6 +2073,8 @@ static int bpf_prog_query(const union bpf_attr *attr,
return lirc_prog_query(attr, uattr);
case BPF_FLOW_DISSECTOR:
return skb_flow_dissector_prog_query(attr, uattr);
+ case BPF_INET_LOOKUP:
+ return inet_lookup_bpf_prog_query(attr, uattr);
default:
return -EINVAL;
}
diff --git a/kernel/bpf/verifier.c b/kernel/bpf/verifier.c
index 10c0ff93f52b..5717dd10cc4d 100644
--- a/kernel/bpf/verifier.c
+++ b/kernel/bpf/verifier.c
@@ -3494,7 +3494,8 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
goto error;
break;
case BPF_MAP_TYPE_REUSEPORT_SOCKARRAY:
- if (func_id != BPF_FUNC_sk_select_reuseport)
+ if (func_id != BPF_FUNC_sk_select_reuseport &&
+ func_id != BPF_FUNC_redirect_lookup)
goto error;
break;
case BPF_MAP_TYPE_QUEUE:
@@ -3578,6 +3579,10 @@ static int check_map_func_compatibility(struct bpf_verifier_env *env,
if (map->map_type != BPF_MAP_TYPE_SK_STORAGE)
goto error;
break;
+ case BPF_FUNC_redirect_lookup:
+ if (map->map_type != BPF_MAP_TYPE_REUSEPORT_SOCKARRAY)
+ goto error;
+ break;
default:
break;
}
diff --git a/net/core/filter.c b/net/core/filter.c
index a498fbaa2d50..d9375a7e60f5 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -9007,4 +9007,235 @@ const struct bpf_verifier_ops sk_reuseport_verifier_ops = {
const struct bpf_prog_ops sk_reuseport_prog_ops = {
};
+
#endif /* CONFIG_INET */
+
+static DEFINE_MUTEX(inet_lookup_prog_mutex);
+
+BPF_CALL_4(redirect_lookup, struct bpf_inet_lookup_kern *, ctx,
+ struct bpf_map *, map, void *, key, u64, flags)
+{
+ struct sock_reuseport *reuse;
+ struct sock *redir_sk;
+
+ if (unlikely(flags))
+ return -EINVAL;
+
+ /* Lookup socket in the map */
+ redir_sk = map->ops->map_lookup_elem(map, key);
+ if (!redir_sk)
+ return -ENOENT;
+
+ /* Check if socket got unhashed from sockets table, e.g. by
+ * close(), after the above map_lookup_elem(). Treat it as
+ * removed from the map.
+ */
+ reuse = rcu_dereference(redir_sk->sk_reuseport_cb);
+ if (!reuse)
+ return -ENOENT;
+
+ /* Check protocol & family are a match */
+ if (ctx->protocol != redir_sk->sk_protocol)
+ return -EPROTOTYPE;
+ if (ctx->family != redir_sk->sk_family &&
+ (redir_sk->sk_family == AF_INET || ipv6_only_sock(redir_sk)))
+ return -EAFNOSUPPORT;
+
+ /* Store socket in context */
+ ctx->redir_sk = redir_sk;
+
+ /* Signal redirect action */
+ return BPF_REDIRECT;
+}
+
+static const struct bpf_func_proto bpf_redirect_lookup_proto = {
+ .func = redirect_lookup,
+ .gpl_only = false,
+ .ret_type = RET_INTEGER,
+ .arg1_type = ARG_PTR_TO_CTX,
+ .arg2_type = ARG_CONST_MAP_PTR,
+ .arg3_type = ARG_PTR_TO_MAP_KEY,
+ .arg4_type = ARG_ANYTHING,
+};
+
+static const struct bpf_func_proto *
+inet_lookup_func_proto(enum bpf_func_id func_id,
+ const struct bpf_prog *prog)
+{
+ switch (func_id) {
+ case BPF_FUNC_redirect_lookup:
+ return &bpf_redirect_lookup_proto;
+ default:
+ return bpf_base_func_proto(func_id);
+ }
+}
+
+int inet_lookup_bpf_prog_attach(const union bpf_attr *attr,
+ struct bpf_prog *prog)
+{
+ struct net *net = current->nsproxy->net_ns;
+
+ return bpf_prog_attach_one(&net->inet_lookup_prog,
+ &inet_lookup_prog_mutex, prog,
+ attr->attach_flags);
+}
+
+int inet_lookup_bpf_prog_detach(const union bpf_attr *attr)
+{
+ struct net *net = current->nsproxy->net_ns;
+
+ return bpf_prog_detach_one(&net->inet_lookup_prog,
+ &inet_lookup_prog_mutex);
+}
+
+int inet_lookup_bpf_prog_query(const union bpf_attr *attr,
+ union bpf_attr __user *uattr)
+{
+ struct net *net;
+ int ret;
+
+ net = get_net_ns_by_fd(attr->query.target_fd);
+ if (IS_ERR(net))
+ return PTR_ERR(net);
+
+ ret = bpf_prog_query_one(&net->inet_lookup_prog, attr, uattr);
+
+ put_net(net);
+ return ret;
+}
+
+static bool inet_lookup_is_valid_access(int off, int size,
+ enum bpf_access_type type,
+ const struct bpf_prog *prog,
+ struct bpf_insn_access_aux *info)
+{
+ const int size_default = sizeof(__u32);
+
+ if (off < 0 || off >= sizeof(struct bpf_inet_lookup))
+ return false;
+ if (off % size != 0)
+ return false;
+ if (type != BPF_READ)
+ return false;
+
+ switch (off) {
+ case bpf_ctx_range(struct bpf_inet_lookup, remote_ip4):
+ case bpf_ctx_range(struct bpf_inet_lookup, local_ip4):
+ case bpf_ctx_range_till(struct bpf_inet_lookup,
+ remote_ip6[0], remote_ip6[3]):
+ case bpf_ctx_range_till(struct bpf_inet_lookup,
+ local_ip6[0], local_ip6[3]):
+ if (!bpf_ctx_narrow_access_ok(off, size, size_default))
+ return false;
+ bpf_ctx_record_field_size(info, size_default);
+ break;
+
+ case bpf_ctx_range(struct bpf_inet_lookup, family):
+ case bpf_ctx_range(struct bpf_inet_lookup, protocol):
+ case bpf_ctx_range(struct bpf_inet_lookup, remote_port):
+ case bpf_ctx_range(struct bpf_inet_lookup, local_port):
+ if (size != size_default)
+ return false;
+ break;
+
+ default:
+ return false;
+ }
+
+ return true;
+}
+
+#define LOAD_FIELD_SIZE_OFF(TYPE, FIELD, SIZE, OFF) ({ \
+ *insn++ = BPF_LDX_MEM(SIZE, si->dst_reg, si->src_reg, \
+ bpf_target_off(TYPE, FIELD, \
+ FIELD_SIZEOF(TYPE, FIELD), \
+ target_size) + (OFF)); \
+})
+
+#define LOAD_FIELD_SIZE(TYPE, FIELD, SIZE) \
+ LOAD_FIELD_SIZE_OFF(TYPE, FIELD, SIZE, 0)
+
+#define LOAD_FIELD(TYPE, FIELD) \
+ LOAD_FIELD_SIZE(TYPE, FIELD, BPF_FIELD_SIZEOF(TYPE, FIELD))
+
+static u32 inet_lookup_convert_ctx_access(enum bpf_access_type type,
+ const struct bpf_insn *si,
+ struct bpf_insn *insn_buf,
+ struct bpf_prog *prog,
+ u32 *target_size)
+{
+ struct bpf_insn *insn = insn_buf;
+ int off;
+
+ switch (si->off) {
+ case offsetof(struct bpf_inet_lookup, family):
+ LOAD_FIELD(struct bpf_inet_lookup_kern, family);
+ break;
+
+ case offsetof(struct bpf_inet_lookup, protocol):
+ LOAD_FIELD(struct bpf_inet_lookup_kern, protocol);
+ break;
+
+ case offsetof(struct bpf_inet_lookup, remote_ip4):
+ LOAD_FIELD_SIZE(struct bpf_inet_lookup_kern, saddr,
+ BPF_SIZE(si->code));
+ break;
+
+ case offsetof(struct bpf_inet_lookup, local_ip4):
+ LOAD_FIELD_SIZE(struct bpf_inet_lookup_kern, daddr,
+ BPF_SIZE(si->code));
+
+ break;
+
+ case bpf_ctx_range_till(struct bpf_inet_lookup,
+ remote_ip6[0], remote_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+ off = si->off;
+ off -= offsetof(struct bpf_inet_lookup, remote_ip6[0]);
+
+ LOAD_FIELD_SIZE_OFF(struct bpf_inet_lookup_kern,
+ saddr6.s6_addr32[0],
+ BPF_SIZE(si->code), off);
+#else
+ (void)off;
+
+ *insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+ break;
+
+ case bpf_ctx_range_till(struct bpf_inet_lookup,
+ local_ip6[0], local_ip6[3]):
+#if IS_ENABLED(CONFIG_IPV6)
+ off = si->off;
+ off -= offsetof(struct bpf_inet_lookup, local_ip6[0]);
+
+ LOAD_FIELD_SIZE_OFF(struct bpf_inet_lookup_kern,
+ daddr6.s6_addr32[0],
+ BPF_SIZE(si->code), off);
+#else
+ (void)off;
+
+ *insn++ = BPF_MOV32_IMM(si->dst_reg, 0);
+#endif
+ break;
+
+ case offsetof(struct bpf_inet_lookup, remote_port):
+ LOAD_FIELD(struct bpf_inet_lookup_kern, sport);
+ break;
+
+ case offsetof(struct bpf_inet_lookup, local_port):
+ LOAD_FIELD(struct bpf_inet_lookup_kern, hnum);
+ break;
+ }
+
+ return insn - insn_buf;
+}
+
+const struct bpf_prog_ops inet_lookup_prog_ops = {
+};
+
+const struct bpf_verifier_ops inet_lookup_verifier_ops = {
+ .get_func_proto = inet_lookup_func_proto,
+ .is_valid_access = inet_lookup_is_valid_access,
+ .convert_ctx_access = inet_lookup_convert_ctx_access,
+};
--
2.20.1
^ permalink raw reply related
* [RFCv2 bpf-next 04/12] inet: Store layer 4 protocol in inet_hashinfo
From: Jakub Sitnicki @ 2019-08-28 7:22 UTC (permalink / raw)
To: bpf, netdev; +Cc: kernel-team, Lorenz Bauer, Marek Majkowski
In-Reply-To: <20190828072250.29828-1-jakub@cloudflare.com>
Make it possible to identify the protocol of the sockets stored in hashinfo
without looking up one.
Subsequent patches make use the new field at the socket lookup time to
enforce that the BPF program selects only sockets with matching protocol.
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
include/net/inet_hashtables.h | 3 +++
net/dccp/proto.c | 2 +-
net/ipv4/tcp_ipv4.c | 2 +-
3 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/include/net/inet_hashtables.h b/include/net/inet_hashtables.h
index af2b4c065a04..b2d43ee72dc1 100644
--- a/include/net/inet_hashtables.h
+++ b/include/net/inet_hashtables.h
@@ -138,6 +138,9 @@ struct inet_hashinfo {
unsigned int lhash2_mask;
struct inet_listen_hashbucket *lhash2;
+ /* Layer 4 protocol of the stored sockets */
+ int protocol;
+
/* All the above members are written once at bootup and
* never written again _or_ are predominantly read-access.
*
diff --git a/net/dccp/proto.c b/net/dccp/proto.c
index 5bad08dc4316..805eee1b4fb0 100644
--- a/net/dccp/proto.c
+++ b/net/dccp/proto.c
@@ -45,7 +45,7 @@ EXPORT_SYMBOL_GPL(dccp_statistics);
struct percpu_counter dccp_orphan_count;
EXPORT_SYMBOL_GPL(dccp_orphan_count);
-struct inet_hashinfo dccp_hashinfo;
+struct inet_hashinfo dccp_hashinfo = { .protocol = IPPROTO_DCCP };
EXPORT_SYMBOL_GPL(dccp_hashinfo);
/* the maximum queue length for tx in packets. 0 is no limit */
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index fd394ad179a0..5d2afbcc45cc 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -87,7 +87,7 @@ static int tcp_v4_md5_hash_hdr(char *md5_hash, const struct tcp_md5sig_key *key,
__be32 daddr, __be32 saddr, const struct tcphdr *th);
#endif
-struct inet_hashinfo tcp_hashinfo;
+struct inet_hashinfo tcp_hashinfo = { .protocol = IPPROTO_TCP };
EXPORT_SYMBOL(tcp_hashinfo);
static u32 tcp_v4_init_seq(const struct sk_buff *skb)
--
2.20.1
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox