Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next] ipvs: Avoid null-pointer deref in debug code
From: Alex Gartrell @ 2014-10-06  0:54 UTC (permalink / raw)
  To: horms; +Cc: ja, dan.carpenter, lvs-devel, netdev, kernel-team, Alex Gartrell
In-Reply-To: <20141001174526.GA16206@mwanda>

Ensure that the pointer is non-NULL before dereferencing it for debugging
purposes.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Alex Gartrell <agartrell@fb.com>
---
 net/netfilter/ipvs/ip_vs_xmit.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/netfilter/ipvs/ip_vs_xmit.c b/net/netfilter/ipvs/ip_vs_xmit.c
index 91f17c1..06bba9b 100644
--- a/net/netfilter/ipvs/ip_vs_xmit.c
+++ b/net/netfilter/ipvs/ip_vs_xmit.c
@@ -316,7 +316,7 @@ __ip_vs_get_out_rt(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
 	if (unlikely(crosses_local_route_boundary(skb_af, skb, rt_mode,
 						  local))) {
 		IP_VS_DBG_RL("We are crossing local and non-local addresses"
-			     " daddr=%pI4\n", &dest->addr.ip);
+			     " daddr=%pI4\n", dest ? &dest->addr.ip : NULL);
 		goto err_put;
 	}
 
@@ -458,7 +458,7 @@ __ip_vs_get_out_rt_v6(int skb_af, struct sk_buff *skb, struct ip_vs_dest *dest,
 	if (unlikely(crosses_local_route_boundary(skb_af, skb, rt_mode,
 						  local))) {
 		IP_VS_DBG_RL("We are crossing local and non-local addresses"
-			     " daddr=%pI6\n", &dest->addr.in6);
+			     " daddr=%pI6\n", dest ? &dest->addr.in6 : NULL);
 		goto err_put;
 	}
 
-- 
Alex Gartrell <agartrell@fb.com>

^ permalink raw reply related

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
From: Florian Westphal @ 2014-10-06  0:29 UTC (permalink / raw)
  To: John Fastabend
  Cc: dborkman, fw, gerlitz.or, hannes, netdev, john.ronciak, amirv,
	eric.dumazet, danny.zhou
In-Reply-To: <20141006000629.32055.2295.stgit@nitbit.x32>

John Fastabend <john.fastabend@gmail.com> wrote:
> There is one critical difference when running with these interfaces
> vs running without them. In the normal case the af_packet module
> uses a standard descriptor format exported by the af_packet user
> space headers. In this model because we are working directly with
> driver queues the descriptor format maps to the descriptor format
> used by the device. User space applications can learn device
> information from the socket option PACKET_DEV_DESC_INFO which
> should provide enough details to extrapulate the descriptor formats.
> Although this adds some complexity to user space it removes the
> requirement to copy descriptor fields around.

I find it very disappointing that we seem to have to expose such
hardware specific details to userspace via hw-independent interface.

How big of a cost are we talking about when you say that it 'removes
the requirement to copy descriptor fields'?

Thanks,
Florian

^ permalink raw reply

* Re: [PATCH net-next] net: better IFF_XMIT_DST_RELEASE support
From: Eric Dumazet @ 2014-10-06  0:26 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20141004.202900.1107815015981261576.davem@davemloft.net>

On Sat, 2014-10-04 at 20:29 -0400, David Miller wrote:

> I assume you are going to rework this to use a counter indication
> in order to deal with the packet scheduler issues Julian brought
> up.

Absolutely, I will provide a v2.

^ permalink raw reply

* [net-next PATCH v1 3/3] net: packet: Document PACKET_DEV_QPAIR_SPLIT and friends
From: John Fastabend @ 2014-10-06  0:07 UTC (permalink / raw)
  To: dborkman, fw, gerlitz.or, hannes
  Cc: netdev, john.ronciak, amirv, eric.dumazet, danny.zhou
In-Reply-To: <20141006000629.32055.2295.stgit@nitbit.x32>

This adds a section to the packet interface kernel documentation
describing the set of socket options to get direct queue
assignment working.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 Documentation/networking/packet_mmap.txt |   44 ++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index a6d7cb9..ad26194 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -1047,6 +1047,50 @@ See include/linux/net_tstamp.h and Documentation/networking/timestamping
 for more information on hardware timestamps.
 
 -------------------------------------------------------------------------------
++ PACKET_RXTX_QPAIRS_SPLIT and friends
+-------------------------------------------------------------------------------
+
+The PACKET_RXTX_QPAIRS_SLIT setting allows direct access to the hardware
+packet rings. If your NIC is capable of supporting hardware packet steering
+and the driver has this feature enabled you can use the hardware to steer
+packets directly to user mapped memory and use user space descriptor rings.
+
+The user space flow should be,
+
+	bind(fd, &sockaddr, sizeof(sockaddr));
+
+	/* Get the device type and info */
+	getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
+		   &optlen);
+
+	/* With device info we can look up descriptor format */
+
+	/* Get the layout of ring space offset, page_sz, cnt */
+	getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
+                   &info, &optlen);
+
+        /* request some queues from the driver */
+	setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
+		   &qpairs_info, sizeof(qpairs_info));
+
+	/* if we let the driver pick us queues learn which queues
+	 * we were given
+	 */
+	getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
+		   &qpairs_info, sizeof(qpairs_info));
+
+	/* And mmap queue pairs to user space */
+	mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
+	     MAP_SHARED, fd, 0);
+
+	/* Now we have some user space queues to read/write to*/
+
+After this user space can directly manipulate the drivers descriptor rings.
+The descriptor rings use the native descriptor format of the hardware device.
+The device specifics are returned from the PACKET_DEV_DESC_INFO call which
+allows user space to determine the correct descriptor format to use.
+
+-------------------------------------------------------------------------------
 + Miscellaneous bits
 -------------------------------------------------------------------------------
 

^ permalink raw reply related

* [net-next PATCH v1 2/3] net: sched: add direct ring acces via af_packet to ixgbe
From: John Fastabend @ 2014-10-06  0:07 UTC (permalink / raw)
  To: dborkman, fw, gerlitz.or, hannes
  Cc: netdev, john.ronciak, amirv, eric.dumazet, danny.zhou
In-Reply-To: <20141006000629.32055.2295.stgit@nitbit.x32>

This implements the necessary ndo ops to support the af_packet
interface to directly own and manipulate queues.

Signed-off-by: Danny Zhou <danny.zhou@intel.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h         |    3 
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |   23 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c    |  232 ++++++++++++++++++++++
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h    |    1 
 4 files changed, 251 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 673d820..2f6eadf 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -678,6 +678,9 @@ struct ixgbe_adapter {
 
 	struct ixgbe_q_vector *q_vector[MAX_Q_VECTORS];
 
+	/* Direct User Space Queues */
+	struct sock *sk_handles[MAX_RX_QUEUES];
+
 	/* DCB parameters */
 	struct ieee_pfc *ixgbe_ieee_pfc;
 	struct ieee_ets *ixgbe_ieee_ets;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index cff383b..01a6e55 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -2581,12 +2581,17 @@ static int ixgbe_add_ethtool_fdir_entry(struct ixgbe_adapter *adapter,
 	if (!(adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE))
 		return -EOPNOTSUPP;
 
+	if (fsp->ring_cookie > MAX_RX_QUEUES)
+		return -EINVAL;
+
 	/*
 	 * Don't allow programming if the action is a queue greater than
-	 * the number of online Rx queues.
+	 * the number of online Rx queues unless it is a user space
+	 * queue.
 	 */
 	if ((fsp->ring_cookie != RX_CLS_FLOW_DISC) &&
-	    (fsp->ring_cookie >= adapter->num_rx_queues))
+	    (fsp->ring_cookie >= adapter->num_rx_queues) &&
+	    !adapter->sk_handles[fsp->ring_cookie])
 		return -EINVAL;
 
 	/* Don't allow indexes to exist outside of available space */
@@ -2663,12 +2668,18 @@ static int ixgbe_add_ethtool_fdir_entry(struct ixgbe_adapter *adapter,
 	/* apply mask and compute/store hash */
 	ixgbe_atr_compute_perfect_hash_82599(&input->filter, &mask);
 
+	/* Set input action to reg_idx for driver owned queues otherwise
+	 * use the absolute index for user space queues.
+	 */
+	if (fsp->ring_cookie < adapter->num_rx_queues &&
+	    fsp->ring_cookie != IXGBE_FDIR_DROP_QUEUE)
+		input->action = adapter->rx_ring[input->action]->reg_idx;
+
 	/* program filters to filter memory */
 	err = ixgbe_fdir_write_perfect_filter_82599(hw,
-				&input->filter, input->sw_idx,
-				(input->action == IXGBE_FDIR_DROP_QUEUE) ?
-				IXGBE_FDIR_DROP_QUEUE :
-				adapter->rx_ring[input->action]->reg_idx);
+						    &input->filter,
+						    input->sw_idx,
+						    input->action);
 	if (err)
 		goto err_out_w_lock;
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 06ef5a3..6506550 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -48,7 +48,9 @@
 #include <linux/if_macvlan.h>
 #include <linux/if_bridge.h>
 #include <linux/prefetch.h>
+#include <linux/mm.h>
 #include <scsi/fc/fc_fcoe.h>
+#include <linux/if_packet.h>
 
 #include "ixgbe.h"
 #include "ixgbe_common.h"
@@ -70,6 +72,8 @@ const char ixgbe_driver_version[] = DRV_VERSION;
 static const char ixgbe_copyright[] =
 				"Copyright (c) 1999-2014 Intel Corporation.";
 
+static unsigned int *dummy_page_buf;
+
 static const struct ixgbe_info *ixgbe_info_tbl[] = {
 	[board_82598] = &ixgbe_82598_info,
 	[board_82599] = &ixgbe_82599_info,
@@ -3122,6 +3126,16 @@ static void ixgbe_enable_rx_drop(struct ixgbe_adapter *adapter,
 	IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(reg_idx), srrctl);
 }
 
+static bool ixgbe_have_user_queues(struct ixgbe_adapter *adapter)
+{
+	int i;
+
+	for (i = 0; i < MAX_RX_QUEUES; i++)
+		if (adapter->sk_handles[i])
+			return true;
+	return false;
+}
+
 static void ixgbe_disable_rx_drop(struct ixgbe_adapter *adapter,
 				  struct ixgbe_ring *ring)
 {
@@ -3156,7 +3170,8 @@ static void ixgbe_set_rx_drop_en(struct ixgbe_adapter *adapter)
 	 *  and performance reasons.
 	 */
 	if (adapter->num_vfs || (adapter->num_rx_queues > 1 &&
-	    !(adapter->hw.fc.current_mode & ixgbe_fc_tx_pause) && !pfc_en)) {
+	    !(adapter->hw.fc.current_mode & ixgbe_fc_tx_pause) && !pfc_en) ||
+	    ixgbe_have_user_queues(adapter)) {
 		for (i = 0; i < adapter->num_rx_queues; i++)
 			ixgbe_enable_rx_drop(adapter, adapter->rx_ring[i]);
 	} else {
@@ -7812,6 +7827,210 @@ static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
 	kfree(fwd_adapter);
 }
 
+static int ixgbe_ndo_split_queue_pairs(struct net_device *dev,
+				       unsigned int start_from,
+				       unsigned int qpairs_num,
+				       struct sock *sk)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	unsigned int qpair_index;
+
+	/* allocate whatever availiable qpairs */
+	if (start_from == PACKET_QPAIRS_START_ANY) {
+		unsigned int count = 0;
+
+		for (qpair_index = adapter->num_rx_queues;
+		     qpair_index < MAX_RX_QUEUES;
+		     qpair_index++) {
+			if (!adapter->sk_handles[qpair_index]) {
+				count++;
+				if (count == qpairs_num) {
+					start_from = qpair_index - count + 1;
+					break;
+				}
+			} else {
+				count = 0;
+			}
+		}
+	}
+
+	/* otherwise the caller specified exact queues */
+	if ((start_from > MAX_TX_QUEUES) ||
+	    (start_from > MAX_RX_QUEUES) ||
+	    (start_from + qpairs_num > MAX_TX_QUEUES) ||
+	    (start_from + qpairs_num > MAX_RX_QUEUES))
+		return -EINVAL;
+
+	/* If the qpairs are being used by the driver do not let user space
+	 * consume the queues. Also if the queue has already been allocated
+	 * to a socket do fail the request.
+	 */
+	for (qpair_index = start_from;
+	     qpair_index < start_from + qpairs_num;
+	     qpair_index++) {
+		if ((qpair_index < adapter->num_tx_queues) ||
+		    (qpair_index < adapter->num_rx_queues))
+			return -EINVAL;
+
+		if (adapter->sk_handles[qpair_index] != NULL)
+			return -EBUSY;
+	}
+
+	/* remember the sk handle for each queue pair */
+	for (qpair_index = start_from;
+	     qpair_index < start_from + qpairs_num;
+	     qpair_index++)
+		adapter->sk_handles[qpair_index] = sk;
+
+	return start_from;
+}
+
+static int ixgbe_ndo_get_queue_pairs(struct net_device *dev,
+				     unsigned int *start_from,
+				     unsigned int *qpairs_num,
+				     struct sock *sk)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	unsigned int qpair_index;
+
+	*qpairs_num = 0;
+
+	for (qpair_index = adapter->num_tx_queues;
+	     qpair_index < MAX_RX_QUEUES;
+	     qpair_index++) {
+		if (adapter->sk_handles[qpair_index] == sk) {
+			if (*qpairs_num == 0)
+				*start_from = qpair_index;
+			*qpairs_num = *qpairs_num + 1;
+		}
+	}
+
+	return 0;
+}
+
+static int ixgbe_ndo_return_queue_pairs(struct net_device *dev, struct sock *sk)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	unsigned int qpair_index;
+
+	for (qpair_index = adapter->num_tx_queues;
+	     qpair_index < MAX_TX_QUEUES;
+	     qpair_index++) {
+		if (adapter->sk_handles[qpair_index] == sk)
+			adapter->sk_handles[qpair_index] = NULL;
+	}
+
+	return 0;
+}
+
+/* Rx descriptor starts from 0x1000 and Tx descriptor starts from 0x6000
+ * both the TX and RX descriptors use 4K pages.
+ */
+#define RX_DESC_ADDR_OFFSET		0x1000
+#define TX_DESC_ADDR_OFFSET		0x6000
+#define PAGE_SIZE_4K			4096
+
+static int
+ixgbe_ndo_qpair_map_region(struct net_device *dev,
+			   struct tpacket_dev_qpair_map_region_info *info)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+
+	/* no need to map systme memory to userspace for ixgbe */
+	info->tp_dev_sysm_sz = 0;
+	info->tp_num_sysm_map_regions = 0;
+
+	info->tp_dev_bar_sz = pci_resource_len(adapter->pdev, 0);
+	info->tp_num_map_regions = 2;
+
+	info->regions[0].page_offset = RX_DESC_ADDR_OFFSET;
+	info->regions[0].page_sz = PAGE_SIZE;
+	info->regions[0].page_cnt = 1;
+	info->regions[1].page_offset = TX_DESC_ADDR_OFFSET;
+	info->regions[1].page_sz = PAGE_SIZE;
+	info->regions[1].page_cnt = 1;
+
+	return 0;
+}
+
+static int ixgbe_ndo_get_device_desc_info(struct net_device *dev,
+					  struct tpacket_dev_info *dev_info)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	int max_queues;
+
+	max_queues = max(adapter->num_rx_queues, adapter->num_tx_queues);
+
+	dev_info->tp_device_id = adapter->hw.device_id;
+	dev_info->tp_vendor_id = adapter->hw.vendor_id;
+	dev_info->tp_subsystem_device_id = adapter->hw.subsystem_device_id;
+	dev_info->tp_subsystem_vendor_id = adapter->hw.subsystem_vendor_id;
+	dev_info->tp_revision_id = adapter->hw.revision_id;
+	dev_info->tp_numa_node = dev_to_node(&dev->dev);
+
+	dev_info->tp_num_total_qpairs = min(MAX_RX_QUEUES, MAX_TX_QUEUES);
+	dev_info->tp_num_inuse_qpairs = max_queues;
+
+	dev_info->tp_rxdesc_size = sizeof(union ixgbe_adv_rx_desc);
+	dev_info->tp_rxdesc_ver = 1;
+	dev_info->tp_txdesc_size = sizeof(union ixgbe_adv_tx_desc);
+	dev_info->tp_txdesc_ver = 1;
+
+	return 0;
+}
+
+static int
+ixgbe_ndo_qpair_page_map(struct vm_area_struct *vma, struct net_device *dev)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	phys_addr_t phy_addr = pci_resource_start(adapter->pdev, 0);
+	unsigned long pfn_rx = (phy_addr + RX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
+	unsigned long pfn_tx = (phy_addr + TX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
+	unsigned long dummy_page_phy;
+	pgprot_t pre_vm_page_prot;
+	unsigned long start;
+	unsigned int i;
+	int err;
+
+	if (!dummy_page_buf) {
+		dummy_page_buf = kzalloc(PAGE_SIZE_4K, GFP_KERNEL);
+		if (!dummy_page_buf)
+			return -ENOMEM;
+
+		for (i = 0; i < PAGE_SIZE_4K / sizeof(unsigned int); i++)
+			dummy_page_buf[i] = 0xdeadbeef;
+	}
+
+	dummy_page_phy = virt_to_phys(dummy_page_buf);
+	pre_vm_page_prot = vma->vm_page_prot;
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+	/* assume the vm_start is 4K aligned address */
+	for (start = vma->vm_start;
+	     start < vma->vm_end;
+	     start += PAGE_SIZE_4K) {
+		if (start == vma->vm_start + RX_DESC_ADDR_OFFSET) {
+			err = remap_pfn_range(vma, start, pfn_rx, PAGE_SIZE_4K,
+					      vma->vm_page_prot);
+			if (err)
+				return -EAGAIN;
+		} else if (start == vma->vm_start + TX_DESC_ADDR_OFFSET) {
+			err = remap_pfn_range(vma, start, pfn_tx, PAGE_SIZE_4K,
+					      vma->vm_page_prot);
+			if (err)
+				return -EAGAIN;
+		} else {
+			unsigned long addr = dummy_page_phy > PAGE_SHIFT;
+
+			err = remap_pfn_range(vma, start, addr, PAGE_SIZE_4K,
+					      pre_vm_page_prot);
+			if (err)
+				return -EAGAIN;
+		}
+	}
+	return 0;
+}
+
 static const struct net_device_ops ixgbe_netdev_ops = {
 	.ndo_open		= ixgbe_open,
 	.ndo_stop		= ixgbe_close,
@@ -7856,6 +8075,12 @@ static const struct net_device_ops ixgbe_netdev_ops = {
 	.ndo_bridge_getlink	= ixgbe_ndo_bridge_getlink,
 	.ndo_dfwd_add_station	= ixgbe_fwd_add,
 	.ndo_dfwd_del_station	= ixgbe_fwd_del,
+	.ndo_split_queue_pairs	= ixgbe_ndo_split_queue_pairs,
+	.ndo_get_queue_pairs	= ixgbe_ndo_get_queue_pairs,
+	.ndo_return_queue_pairs = ixgbe_ndo_return_queue_pairs,
+	.ndo_get_device_qpair_map_region_info = ixgbe_ndo_qpair_map_region,
+	.ndo_get_device_desc_info  = ixgbe_ndo_get_device_desc_info,
+	.ndo_direct_qpair_page_map = ixgbe_ndo_qpair_page_map,
 };
 
 /**
@@ -8054,7 +8279,9 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	hw->back = adapter;
 	adapter->msg_enable = netif_msg_init(debug, DEFAULT_MSG_ENABLE);
 
-	hw->hw_addr = ioremap(pci_resource_start(pdev, 0),
+	hw->pci_hw_addr = pci_resource_start(pdev, 0);
+
+	hw->hw_addr = ioremap(hw->pci_hw_addr,
 			      pci_resource_len(pdev, 0));
 	adapter->io_addr = hw->hw_addr;
 	if (!hw->hw_addr) {
@@ -8705,6 +8932,7 @@ module_init(ixgbe_init_module);
  **/
 static void __exit ixgbe_exit_module(void)
 {
+	kfree(dummy_page_buf);
 #ifdef CONFIG_IXGBE_DCA
 	dca_unregister_notify(&dca_notifier);
 #endif
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
index dfd55d8..26e9163 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
@@ -3022,6 +3022,7 @@ struct ixgbe_mbx_info {
 
 struct ixgbe_hw {
 	u8 __iomem			*hw_addr;
+	phys_addr_t			pci_hw_addr;
 	void				*back;
 	struct ixgbe_mac_info		mac;
 	struct ixgbe_addr_filter_info	addr_ctrl;

^ permalink raw reply related

* [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
From: John Fastabend @ 2014-10-06  0:06 UTC (permalink / raw)
  To: dborkman, fw, gerlitz.or, hannes
  Cc: netdev, john.ronciak, amirv, eric.dumazet, danny.zhou

This patch adds a net_device ops to split off a set of driver queues
from the driver and map the queues into user space via mmap. This
allows the queues to be directly manipulated from user space. For
raw packet interface this removes any overhead from the kernel network
stack.

Typically in an af_packet interface a packet_type handler is
registered and used to filter traffic to the socket and do other
things such as fan out traffic to multiple sockets. In this case the
networking stack is being bypassed so this code is not run. So the
hardware must push the correct traffic to the queues obtained from
the ndo callback ndo_split_queue_pairs().

Fortunately there is already a flow classification interface which
is part of the ethtool command set, ETHTOOL_SRXCLSRLINS. It is
currently supported by multiple drivers including sfc, mlx4, niu,
ixgbe, and i40e. Supporting some way to steer traffic to a queue
is the _only_ hardware requirement to support the interface, plus
the driver needs to implement the correct ndo ops. A follow on
patch adds support for ixgbe but we expect at least the subset of
drivers implementing ETHTOOL_SRXCLSRLINS to be implemented later.

The interface is driven over an af_packet socket which we believe
is the most natural interface to use. Because it is already used
for raw packet interfaces which is what we are providing here.
 The high level flow for this interface looks like:

	bind(fd, &sockaddr, sizeof(sockaddr));

	/* Get the device type and info */
	getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
		   &optlen);

	/* With device info we can look up descriptor format */

	/* Get the layout of ring space offset, page_sz, cnt */
	getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
		   &info, &optlen);

	/* request some queues from the driver */
	setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
		   &qpairs_info, sizeof(qpairs_info));

	/* if we let the driver pick us queues learn which queues
         * we were given
         */
	getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
		   &qpairs_info, sizeof(qpairs_info));

	/* And mmap queue pairs to user space */
	mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
	     MAP_SHARED, fd, 0);

	/* Now we have some user space queues to read/write to*/

There is one critical difference when running with these interfaces
vs running without them. In the normal case the af_packet module
uses a standard descriptor format exported by the af_packet user
space headers. In this model because we are working directly with
driver queues the descriptor format maps to the descriptor format
used by the device. User space applications can learn device
information from the socket option PACKET_DEV_DESC_INFO which
should provide enough details to extrapulate the descriptor formats.
Although this adds some complexity to user space it removes the
requirement to copy descriptor fields around.

The formats are usually provided by the device vendor documentation
If folks want I can provide a follow up patch to provide the formats
in a .h file in ./include/uapi/linux/ for ease of use. I have access
to formats for ixgbe and mlx drivers other driver owners would need to
provide their formats.

We tested this interface using traffic generators and doing basic
L2 forwarding tests on ixgbe devices. Our tests use a set of patches
to DPDK to enable an interface using this socket interfaace. With
this interface we can xmit/receive @ line rate from a test user space
application on a single core.

Additionally we have a set of DPDK patches to enable DPDK with this
interface. DPDK can be downloaded @ dpdk.org although as I hope is
clear from above DPDK is just our paticular test environment we
expect other libraries could be built on this interface.

Signed-off-by: Danny Zhou <danny.zhou@intel.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/linux/netdevice.h      |   63 ++++++++++++++
 include/uapi/linux/if_packet.h |   42 +++++++++
 net/packet/af_packet.c         |  181 ++++++++++++++++++++++++++++++++++++++++
 net/packet/internal.h          |    1 
 4 files changed, 287 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9f5d293..dae96dc2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -51,6 +51,8 @@
 #include <linux/neighbour.h>
 #include <uapi/linux/netdevice.h>
 
+#include <linux/if_packet.h>
+
 struct netpoll_info;
 struct device;
 struct phy_device;
@@ -997,6 +999,47 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
  *	Callback to use for xmit over the accelerated station. This
  *	is used in place of ndo_start_xmit on accelerated net
  *	devices.
+ *
+ * int (*ndo_split_queue_pairs) (struct net_device *dev,
+ *				 unsigned int qpairs_start_from,
+ *				 unsigned int qpairs_num,
+ *				 struct sock *sk)
+ *	Called to request a set of queues from the driver to be
+ *	handed to the callee for management. After this returns the
+ *	driver will not use the queues. The call should return zero
+ *	on success otherwise an appropriate error code can be used.
+ *	If qpairs_start_from is less than zero driver can start at
+ *	any available slot.
+ *
+ * int (*ndo_get_queue_pairs)(struct net_device *dev,
+ *			      unsigned int *qpairs_start_from,
+ *			      unsigned int *qpairs_num,
+ *			      struct sock *sk);
+ *	Called to get the queues assigned by the driver to this sock
+ *	when ndo_split_queue_pairs does not specify a start_from and
+ *	qpairs_num field. Returns zero on success.
+ *
+ * int (*ndo_return_queue_pairs) (struct net_device *dev,
+ *				  struct sock *sk)
+ *	Called to return a set of queues identified by sock to
+ *	the driver. The socket must have previously requested
+ *	the queues via ndo_split_queue_pairs for this action to
+ *	be performed.
+ *
+ * int (*ndo_get_device_qpair_map_region_info) (struct net_device *dev,
+ *				struct tpacket_dev_qpair_map_region_info *info)
+ *	Called to return mapping of queue memory region
+ *
+ * int (*ndo_get_device_desc_info) (struct net_device *dev,
+ *				    struct tpacket_dev_info *dev_info)
+ *	Called to get device specific information. This should
+ *	uniquely identify the hardware so that descriptor formats
+ *	can be learned by the stack/user space.
+ *
+ * int (*ndo_direct_qpair_page_map) (struct vm_area_struct *vma,
+ *				     struct net_device *dev)
+ *	Called to map queue pair range from split_queue_pairs into
+ *	mmap region.
  */
 struct net_device_ops {
 	int			(*ndo_init)(struct net_device *dev);
@@ -1146,6 +1189,26 @@ struct net_device_ops {
 							struct net_device *dev,
 							void *priv);
 	int			(*ndo_get_lock_subclass)(struct net_device *dev);
+	int			(*ndo_split_queue_pairs)(struct net_device *dev,
+					 unsigned int qpairs_start_from,
+					 unsigned int qpairs_num,
+					 struct sock *sk);
+	int			(*ndo_get_queue_pairs)(struct net_device *dev,
+					 unsigned int *qpairs_start_from,
+					 unsigned int *qpairs_num,
+					 struct sock *sk);
+	int			(*ndo_return_queue_pairs)(
+					 struct net_device *dev,
+					 struct sock *sk);
+	int			(*ndo_get_device_qpair_map_region_info)
+					(struct net_device *dev,
+					 struct tpacket_dev_qpair_map_region_info *info);
+	int			(*ndo_get_device_desc_info)
+					(struct net_device *dev,
+					 struct tpacket_dev_info *dev_info);
+	int			(*ndo_direct_qpair_page_map)
+					(struct vm_area_struct *vma,
+					 struct net_device *dev);
 };
 
 /**
diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index da2d668..fa94b65 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -54,6 +54,10 @@ struct sockaddr_ll {
 #define PACKET_FANOUT			18
 #define PACKET_TX_HAS_OFF		19
 #define PACKET_QDISC_BYPASS		20
+#define PACKET_RXTX_QPAIRS_SPLIT	21
+#define PACKET_RXTX_QPAIRS_RETURN	22
+#define PACKET_DEV_QPAIR_MAP_REGION_INFO	23
+#define PACKET_DEV_DESC_INFO		24
 
 #define PACKET_FANOUT_HASH		0
 #define PACKET_FANOUT_LB		1
@@ -64,6 +68,44 @@ struct sockaddr_ll {
 #define PACKET_FANOUT_FLAG_ROLLOVER	0x1000
 #define PACKET_FANOUT_FLAG_DEFRAG	0x8000
 
+#define MAX_MAP_MEMORY_REGIONS		64
+
+struct tpacket_dev_qpair_map_region_info {
+	unsigned int tp_dev_bar_sz;		/* size of BAR */
+	unsigned int tp_dev_sysm_sz;		/* size of systerm memory */
+	/* number of contiguous memory on BAR mapping to user space */
+	unsigned int tp_num_map_regions;
+	/* number of contiguous memory on system mapping to user apce */
+	unsigned int tp_num_sysm_map_regions;
+	struct map_page_region {
+		unsigned page_offset;		/* offset to start of region */
+		unsigned page_sz;		/* size of page */
+		unsigned page_cnt;		/* number of pages */
+	} regions[MAX_MAP_MEMORY_REGIONS];
+};
+
+#define PACKET_QPAIRS_START_ANY		-1
+
+struct tpacket_dev_qpairs_info {
+	unsigned int tp_qpairs_start_from;	/* qpairs index to start from */
+	unsigned int tp_qpairs_num;		/* number of qpairs */
+};
+
+struct tpacket_dev_info {
+	__u16		tp_device_id;
+	__u16		tp_vendor_id;
+	__u16		tp_subsystem_device_id;
+	__u16		tp_subsystem_vendor_id;
+	__u32		tp_numa_node;
+	__u32		tp_revision_id;
+	__u32		tp_num_total_qpairs;
+	__u32		tp_num_inuse_qpairs;
+	unsigned int	tp_rxdesc_size;	/* rx desc size in bytes */
+	__u16		tp_rxdesc_ver;
+	unsigned int	tp_txdesc_size;	/* tx desc size in bytes */
+	__u16		tp_txdesc_ver;
+};
+
 struct tpacket_stats {
 	unsigned int	tp_packets;
 	unsigned int	tp_drops;
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 87d20f4..19b43ee 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2611,6 +2611,14 @@ static int packet_release(struct socket *sock)
 	sock_prot_inuse_add(net, sk->sk_prot, -1);
 	preempt_enable();
 
+	if (po->tp_owns_queue_pairs) {
+		struct net_device *dev;
+
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (dev)
+			dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
+	}
+
 	spin_lock(&po->bind_lock);
 	unregister_prot_hook(sk, false);
 	packet_cached_dev_reset(po);
@@ -3403,6 +3411,70 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 		po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
 		return 0;
 	}
+
+	case PACKET_RXTX_QPAIRS_SPLIT:
+	{
+		struct tpacket_dev_qpairs_info qpairs;
+		const struct net_device_ops *ops;
+		struct net_device *dev;
+		int err;
+
+		if (optlen != sizeof(qpairs))
+			return -EINVAL;
+		if (copy_from_user(&qpairs, optval, sizeof(qpairs)))
+			return -EFAULT;
+
+		/* Only allow one set of queues to be owned by userspace */
+		if (po->tp_owns_queue_pairs)
+			return -EBUSY;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		ops = dev->netdev_ops;
+		if (!ops->ndo_split_queue_pairs)
+			return -EOPNOTSUPP;
+
+		err =  ops->ndo_split_queue_pairs(dev,
+						  qpairs.tp_qpairs_start_from,
+						  qpairs.tp_qpairs_num, sk);
+		if (!err)
+			po->tp_owns_queue_pairs = true;
+
+		return err;
+	}
+
+	case PACKET_RXTX_QPAIRS_RETURN:
+	{
+		struct tpacket_dev_qpairs_info qpairs_info;
+		struct net_device *dev;
+		int err;
+
+		if (optlen != sizeof(qpairs_info))
+			return -EINVAL;
+		if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
+			return -EFAULT;
+
+		if (!po->tp_owns_queue_pairs)
+			return -EINVAL;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+
+		err =  dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
+		if (!err)
+			po->tp_owns_queue_pairs = false;
+
+		return err;
+	}
+
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -3498,6 +3570,99 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 	case PACKET_QDISC_BYPASS:
 		val = packet_use_direct_xmit(po);
 		break;
+	case PACKET_RXTX_QPAIRS_SPLIT:
+	{
+		struct net_device *dev;
+		struct tpacket_dev_qpairs_info qpairs_info;
+		int err;
+
+		if (len != sizeof(qpairs_info))
+			return -EINVAL;
+		if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
+			return -EFAULT;
+
+		/* This call only works after a successful queue pairs split-off
+		 * operation via setsockopt()
+		 */
+		if (!po->tp_owns_queue_pairs)
+			return -EINVAL;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		if (!dev->netdev_ops->ndo_split_queue_pairs)
+			return -EOPNOTSUPP;
+
+		err = dev->netdev_ops->ndo_get_queue_pairs(dev,
+					&qpairs_info.tp_qpairs_start_from,
+					&qpairs_info.tp_qpairs_num, sk);
+
+		lv = sizeof(qpairs_info);
+		data = &qpairs_info;
+		break;
+	}
+	case PACKET_DEV_QPAIR_MAP_REGION_INFO:
+	{
+		struct tpacket_dev_qpair_map_region_info info;
+		const struct net_device_ops *ops;
+		struct net_device *dev;
+		int err;
+
+		if (len != sizeof(info))
+			return -EINVAL;
+		if (copy_from_user(&info, optval, sizeof(info)))
+			return -EFAULT;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+
+		ops = dev->netdev_ops;
+		if (!ops->ndo_get_device_qpair_map_region_info)
+			return -EOPNOTSUPP;
+
+		err = ops->ndo_get_device_qpair_map_region_info(dev, &info);
+		if (err)
+			return err;
+
+		lv = sizeof(struct tpacket_dev_qpair_map_region_info);
+		data = &info;
+		break;
+	}
+	case PACKET_DEV_DESC_INFO:
+	{
+		struct net_device *dev;
+		struct tpacket_dev_info info;
+		int err;
+
+		if (len != sizeof(info))
+			return -EINVAL;
+		if (copy_from_user(&info, optval, sizeof(info)))
+			return -EFAULT;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		if (!dev->netdev_ops->ndo_get_device_desc_info)
+			return -EOPNOTSUPP;
+
+		err =  dev->netdev_ops->ndo_get_device_desc_info(dev, &info);
+		if (err)
+			return err;
+
+		lv = sizeof(struct tpacket_dev_info);
+		data = &info;
+		break;
+	}
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -3904,6 +4069,21 @@ static int packet_mmap(struct file *file, struct socket *sock,
 
 	mutex_lock(&po->pg_vec_lock);
 
+	if (po->tp_owns_queue_pairs) {
+		const struct net_device_ops *ops;
+		struct net_device *dev;
+
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+
+		ops = dev->netdev_ops;
+		err = ops->ndo_direct_qpair_page_map(vma, dev);
+		if (err)
+			goto out;
+		goto done;
+	}
+
 	expected_size = 0;
 	for (rb = &po->rx_ring; rb <= &po->tx_ring; rb++) {
 		if (rb->pg_vec) {
@@ -3941,6 +4121,7 @@ static int packet_mmap(struct file *file, struct socket *sock,
 		}
 	}
 
+done:
 	atomic_inc(&po->mapped);
 	vma->vm_ops = &packet_mmap_ops;
 	err = 0;
diff --git a/net/packet/internal.h b/net/packet/internal.h
index cdddf6a..55cadbc 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -113,6 +113,7 @@ struct packet_sock {
 	unsigned int		tp_reserve;
 	unsigned int		tp_loss:1;
 	unsigned int		tp_tx_has_off:1;
+	unsigned int		tp_owns_queue_pairs:1;
 	unsigned int		tp_tstamp;
 	struct net_device __rcu	*cached_dev;
 	int			(*xmit)(struct sk_buff *skb);

^ permalink raw reply related

* Good day
From: generaldavid @ 2014-10-05 23:15 UTC (permalink / raw)


Good day
please did you receive what i sent to you recently?

^ permalink raw reply

* Lieber Freund,
From: test @ 2014-10-05 22:16 UTC (permalink / raw)
  To: info; +Cc: info



Lieber Freund, 
Mein Name ist H. Sahid, einem libyschen Geschäftsmann, wegen der anhaltenden Krisen Ich bin dringend auf Ihre Hilfe, um mich behandeln riesige Menge an Geld, wenn Sie interessiert, dann kontaktieren Sie mich durch Rückseite sind: hamzaksahid@aol.com 
aufrichtig 
Hamza Sahid 
hamzaksahid@aol.com

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

^ permalink raw reply

* Re: r8168 is needed to enter P-state: Package State 6 (pc6) on Haswell hardware
From: Ceriel Jacobs @ 2014-10-05 22:22 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Realtek linux nic maintainers, netdev
In-Reply-To: <20141005165920.GA21926@electric-eye.fr.zoreil.com>

This is the requested "XID" dmesg line is:

[    1.479298] r8169 0000:03:00.0 eth0: RTL8168g/8111g at 
0xffffc90004752000, d0:50:99:0b:09:90, XID 0c000800 IRQ 27

Francois Romieu schreef op 05-10-14 om 18:59:
> Ceriel Jacobs <linux-ide@crashplan.pro> :
>> With in-kernel r8169 module, only P-state package C3 (pc3) can be reached
>> when enabling ASPM.
>>
>> Only after installing r8168 driver (and enabling ASPM), a Haswell Celeron
>> G1820/G1840 processer will enter package C6 state (pc6).
>
> Vanilla kernel r8169 driver logs a message that contains an "XID" string.
> Please grep for it in your dmesg so that I can narrow the search for
> meaningful differences in Realtek's r8168 driver.
>
> Thanks.
>

^ permalink raw reply

* Re: [PATCH] net: Add ndo_gso_check
From: Or Gerlitz @ 2014-10-05 19:13 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Alexander Duyck, John Fastabend, Jeff Kirsher, David Miller,
	Linux Netdev List, Thomas Graf, Pravin Shelar, Andy Zhou
In-Reply-To: <CA+mtBx8-w74f_f9JEfpwH9=Mjr_zaJJAYCMMdYAWdCy-51pzqg@mail.gmail.com>

On Sun, Oct 5, 2014 at 9:49 PM, Tom Herbert <therbert@google.com> wrote:
> On Sun, Oct 5, 2014 at 7:04 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>> On Thu, Oct 2, 2014 at 2:06 AM, Tom Herbert <therbert@google.com> wrote:
>>> On Wed, Oct 1, 2014 at 1:58 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>>>> On Tue, Sep 30, 2014 at 6:34 PM, Tom Herbert <therbert@google.com> wrote:
>>>>> On Tue, Sep 30, 2014 at 7:30 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>> [...]
>>> Solution #4: apply this patch and implement the check functions as
>>> needed in those 4 or 5 drivers. If a device can only do VXLAN/NVGRE
>>> then I believe the check function is something like:
>>>
>>> bool mydev_gso_check(struct sk_buff *skb, struct net_device *dev)
>>> {
>>>         if ((skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL) &&
>>>             ((skb->inner_protocol_type != ENCAP_TYPE_ETHER ||
>>>               skb->protocol != htons(ETH_P_TEB) ||
>>>               skb_inner_mac_header(skb) - skb_transport_header(skb) != 12)
>>>                 return false;
>>>
>>>         return true;
>>> }
>>
>> Yep, such helper can can be basically made to work and let the 4-5
>> drivers that can
>> do GSO offloading for vxlan but not for any FOU/GUE packets signal
>> that to the stack.
>>
>> Re the 12 constant, you were referring to the udp+vxlan headers? it's 8+8
>>
>> Also, we need a way for drivers that can support VXLAN or NVGRE but
>> not concurrently
>> on the same port @ the same time to only let vxlan packet to pass
>> successfully through the helper.

> Or, there should be no difference in GSO processing between VXLAN and
> NVGRE. Can you explain why you feel you need to differentiate them for GSO?


RX wise, Linux tells the driver that UDP port X would be used for
VXLAN, right? and indeed, it's possible for some HW implementations
not to support RX offloading (checksum) for both VXLAN and NVGRE @ the
same time over the same port. But TX/GRO wise, you're probably
correct. The thing is that from the user POV they need solution that
works for both RX and TX offloading.

^ permalink raw reply

* Re: [PATCH net-next] net: introduce netdevice gso_min_segs attribute
From: Eric Dumazet @ 2014-10-05 18:58 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Amir Vadai, David S. Miller, Eric Dumazet, Linux Netdev List,
	Yevgeny Petrilin, Or Gerlitz, Ido Shamay
In-Reply-To: <CA+mtBx9qLU_6OgLQ=GW+V2UwZ2h_GqrzAHhYL_0BP8p8AzVd=g@mail.gmail.com>

On Sun, 2014-10-05 at 11:45 -0700, Tom Herbert wrote:
> On Sun, Oct 5, 2014 at 10:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> > From: Eric Dumazet <edumazet@google.com>
> >
> > Some TSO engines might have a too heavy setup cost, that impacts
> > performance on hosts sending small bursts (2 MSS per packet).
> >
> > This patch adds a device gso_min_segs, allowing drivers to set
> > a minimum segment size for TSO packets, according to the NIC
> > performance.
> >
> Eric, this seems like another device specific limitation in TSO we are
> exposing to the stack. Can't we just put things like this in
> ndo_gso_check so driver can apply whatever criteria it wants in
> deciding if it worth it or even to rather do TSO?

I don't think it really matters where the check is done.

If you want to move the check away from netif_skb_features(), its fine
with me, but please note you did not suggest this for current check
against gso_max_segs.

I based my patch on current net-next, where ndo_gso_check() does not
exist.

BTW, I wonder now why we added gso_max_segs support to bonding driver.
(commit 0e376bd0b791ac6ac6bdb051492df0769c840848)

It makes little sense, gso should be done at the last possible stage.

^ permalink raw reply

* Re: [PATCH] net: Add ndo_gso_check
From: Tom Herbert @ 2014-10-05 18:49 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Alexander Duyck, John Fastabend, Jeff Kirsher, David Miller,
	Linux Netdev List, Thomas Graf, Pravin Shelar, Andy Zhou
In-Reply-To: <CAJ3xEMhzxfP3Sqp=_QSQehd2j2s=Gr3B1iO2S--OcoT7VC94nQ@mail.gmail.com>

On Sun, Oct 5, 2014 at 7:04 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> On Thu, Oct 2, 2014 at 2:06 AM, Tom Herbert <therbert@google.com> wrote:
>> On Wed, Oct 1, 2014 at 1:58 PM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
>>> On Tue, Sep 30, 2014 at 6:34 PM, Tom Herbert <therbert@google.com> wrote:
>>>> On Tue, Sep 30, 2014 at 7:30 AM, Or Gerlitz <gerlitz.or@gmail.com> wrote:
> [...]
>> Solution #4: apply this patch and implement the check functions as
>> needed in those 4 or 5 drivers. If a device can only do VXLAN/NVGRE
>> then I believe the check function is something like:
>>
>> bool mydev_gso_check(struct sk_buff *skb, struct net_device *dev)
>> {
>>         if ((skb_shinfo(skb)->gso_type & SKB_GSO_UDP_TUNNEL) &&
>>             ((skb->inner_protocol_type != ENCAP_TYPE_ETHER ||
>>               skb->protocol != htons(ETH_P_TEB) ||
>>               skb_inner_mac_header(skb) - skb_transport_header(skb) != 12)
>>                 return false;
>>
>>         return true;
>> }
>
> Yep, such helper can can be basically made to work and let the 4-5
> drivers that can
> do GSO offloading for vxlan but not for any FOU/GUE packets signal
> that to the stack.
>
> Re the 12 constant, you were referring to the udp+vxlan headers? it's 8+8
>
> Also, we need a way for drivers that can support VXLAN or NVGRE but
> not concurrently
> on the same port @ the same time to only let vxlan packet to pass
> successfully through the helper.
>
Or, there should be no difference in GSO processing between VXLAN and
NVGRE. Can you explain why you feel you need to differentiate them for
GSO?

Thanks,
Tom

> There was that coloring scheme I suggested, but you didn't like it...
>
>>>> Would any other driver maintainers like to chime in on this?
>>> Alex? John?
>
>
> Or.

^ permalink raw reply

* Re: [PATCH net-next] net: introduce netdevice gso_min_segs attribute
From: Tom Herbert @ 2014-10-05 18:45 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Amir Vadai, David S. Miller, Eric Dumazet, Linux Netdev List,
	Yevgeny Petrilin, Or Gerlitz, Ido Shamay
In-Reply-To: <1412529087.11091.14.camel@edumazet-glaptop2.roam.corp.google.com>

On Sun, Oct 5, 2014 at 10:11 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> From: Eric Dumazet <edumazet@google.com>
>
> Some TSO engines might have a too heavy setup cost, that impacts
> performance on hosts sending small bursts (2 MSS per packet).
>
> This patch adds a device gso_min_segs, allowing drivers to set
> a minimum segment size for TSO packets, according to the NIC
> performance.
>
Eric, this seems like another device specific limitation in TSO we are
exposing to the stack. Can't we just put things like this in
ndo_gso_check so driver can apply whatever criteria it wants in
deciding if it worth it or even to rather do TSO?

> Tested on a mlx4 NIC, this allows to get a ~110% increase of
> throughput when sending 2 MSS per packet.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> mlx4 patch will be sent later, its a one liner.
>
>  include/linux/netdevice.h |    4 +++-
>  net/core/dev.c            |    9 ++++++---
>  2 files changed, 9 insertions(+), 4 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 22d54b9b700d..2df86f50261c 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -1416,6 +1416,8 @@ enum netdev_priv_flags {
>   *     @gso_max_size:  Maximum size of generic segmentation offload
>   *     @gso_max_segs:  Maximum number of segments that can be passed to the
>   *                     NIC for GSO
> + *     @gso_min_segs:  Minimum number of segments that can be passed to the
> + *                     NIC for GSO
>   *
>   *     @dcbnl_ops:     Data Center Bridging netlink ops
>   *     @num_tc:        Number of traffic classes in the net device
> @@ -1666,7 +1668,7 @@ struct net_device {
>         unsigned int            gso_max_size;
>  #define GSO_MAX_SEGS           65535
>         u16                     gso_max_segs;
> -
> +       u16                     gso_min_segs;
>  #ifdef CONFIG_DCB
>         const struct dcbnl_rtnl_ops *dcbnl_ops;
>  #endif
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 1a90530f83ff..16e8ebbd3316 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -2567,10 +2567,12 @@ static netdev_features_t harmonize_features(struct sk_buff *skb,
>
>  netdev_features_t netif_skb_features(struct sk_buff *skb)
>  {
> +       const struct net_device *dev = skb->dev;
> +       netdev_features_t features = dev->features;
> +       u16 gso_segs = skb_shinfo(skb)->gso_segs;
>         __be16 protocol = skb->protocol;
> -       netdev_features_t features = skb->dev->features;
>
> -       if (skb_shinfo(skb)->gso_segs > skb->dev->gso_max_segs)
> +       if (gso_segs > dev->gso_max_segs || gso_segs < dev->gso_min_segs)
>                 features &= ~NETIF_F_GSO_MASK;
>
>         if (protocol == htons(ETH_P_8021Q) || protocol == htons(ETH_P_8021AD)) {
> @@ -2581,7 +2583,7 @@ netdev_features_t netif_skb_features(struct sk_buff *skb)
>         }
>
>         features = netdev_intersect_features(features,
> -                                            skb->dev->vlan_features |
> +                                            dev->vlan_features |
>                                              NETIF_F_HW_VLAN_CTAG_TX |
>                                              NETIF_F_HW_VLAN_STAG_TX);
>
> @@ -6658,6 +6660,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
>
>         dev->gso_max_size = GSO_MAX_SIZE;
>         dev->gso_max_segs = GSO_MAX_SEGS;
> +       dev->gso_min_segs = 0;
>
>         INIT_LIST_HEAD(&dev->napi_list);
>         INIT_LIST_HEAD(&dev->unreg_list);
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next v2 1/2] if_link: add client name to port profile
From: Govindarajulu Varadarajan @ 2014-10-05 18:44 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Govindarajulu Varadarajan, davem, netdev, ssujith, benve
In-Reply-To: <20141004200523.4fc102e3@urahara>

On Sat, 4 Oct 2014, Stephen Hemminger wrote:
> Maybe you could use the existing IFLA_IFALIAS?
> It is already supported by iproute tools and sysfs, and can be used by net-snmp as well.
> You would just be setting the default string, users could change it.
>

I do not think we can use ifalias here. Port profile is for SRIOV-VF
pci passthrough. vfio-pci will own the pci device and there would not be
net_device registeres for that interface on host. The client (libvirt)
unregisters the vf before binding it to vfio-pci.

^ permalink raw reply

* [PATCH net 1/1] hyperv: Fix a bug in netvsc_send()
From: K. Y. Srinivasan @ 2014-10-05 17:42 UTC (permalink / raw)
  To: davem, netdev, linux-kernel, devel, olaf, apw, jasowang

After the packet is successfully sent, we should not touch the packet 
as it may have been freed. This patch is based on the work done by
Long Li <longli@microsoft.com>.

David, please queue this up for stable.

Signed-off-by: K. Y. Srinivasan <kys@microsoft.com>
Reported-by: Sitsofe Wheeler <sitsofe@yahoo.com>
---
 drivers/net/hyperv/netvsc.c |   15 ++++++++-------
 1 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/net/hyperv/netvsc.c b/drivers/net/hyperv/netvsc.c
index 977984b..7d76c95 100644
--- a/drivers/net/hyperv/netvsc.c
+++ b/drivers/net/hyperv/netvsc.c
@@ -717,6 +717,7 @@ int netvsc_send(struct hv_device *device,
 	unsigned int section_index = NETVSC_INVALID_INDEX;
 	u32 msg_size = 0;
 	struct sk_buff *skb;
+	u16 q_idx = packet->q_idx;
 
 
 	net_device = get_outbound_net_device(device);
@@ -781,24 +782,24 @@ int netvsc_send(struct hv_device *device,
 
 	if (ret == 0) {
 		atomic_inc(&net_device->num_outstanding_sends);
-		atomic_inc(&net_device->queue_sends[packet->q_idx]);
+		atomic_inc(&net_device->queue_sends[q_idx]);
 
 		if (hv_ringbuf_avail_percent(&out_channel->outbound) <
 			RING_AVAIL_PERCENT_LOWATER) {
 			netif_tx_stop_queue(netdev_get_tx_queue(
-					    ndev, packet->q_idx));
+					    ndev, q_idx));
 
 			if (atomic_read(&net_device->
-				queue_sends[packet->q_idx]) < 1)
+				queue_sends[q_idx]) < 1)
 				netif_tx_wake_queue(netdev_get_tx_queue(
-						    ndev, packet->q_idx));
+						    ndev, q_idx));
 		}
 	} else if (ret == -EAGAIN) {
 		netif_tx_stop_queue(netdev_get_tx_queue(
-				    ndev, packet->q_idx));
-		if (atomic_read(&net_device->queue_sends[packet->q_idx]) < 1) {
+				    ndev, q_idx));
+		if (atomic_read(&net_device->queue_sends[q_idx]) < 1) {
 			netif_tx_wake_queue(netdev_get_tx_queue(
-					    ndev, packet->q_idx));
+					    ndev, q_idx));
 			ret = -ENOSPC;
 		}
 	} else {
-- 
1.7.4.1

^ permalink raw reply related

* Re: [PATCH net-next] ipv4: igmp: fix v3 general query drop monitor false positive
From: Eric Dumazet @ 2014-10-05 17:12 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: davem, edumazet, netdev
In-Reply-To: <1412522870-26335-1-git-send-email-dborkman@redhat.com>

On Sun, 2014-10-05 at 17:27 +0200, Daniel Borkmann wrote:
> In case we find a general query with non-zero number of sources, we
> are dropping the skb as it's malformed.
> 
> RFC3376, section 4.1.8. Number of Sources (N):
> 
>   This number is zero in a General Query or a Group-Specific Query,
>   and non-zero in a Group-and-Source-Specific Query.
> 
> Therefore, reflect that by using kfree_skb() instead of consume_skb().
> 
> Fixes: d679c5324d9a ("igmp: avoid drop_monitor false positives")
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
> ---

Acked-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply

* [PATCH net-next] net: introduce netdevice gso_min_segs attribute
From: Eric Dumazet @ 2014-10-05 17:11 UTC (permalink / raw)
  To: Amir Vadai, David S. Miller
  Cc: Eric Dumazet, netdev, Yevgeny Petrilin, Or Gerlitz, Ido Shamay
In-Reply-To: <1412524252.11091.3.camel@edumazet-glaptop2.roam.corp.google.com>

From: Eric Dumazet <edumazet@google.com>

Some TSO engines might have a too heavy setup cost, that impacts
performance on hosts sending small bursts (2 MSS per packet).

This patch adds a device gso_min_segs, allowing drivers to set
a minimum segment size for TSO packets, according to the NIC
performance.

Tested on a mlx4 NIC, this allows to get a ~110% increase of
throughput when sending 2 MSS per packet.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
mlx4 patch will be sent later, its a one liner.

 include/linux/netdevice.h |    4 +++-
 net/core/dev.c            |    9 ++++++---
 2 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 22d54b9b700d..2df86f50261c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1416,6 +1416,8 @@ enum netdev_priv_flags {
  *	@gso_max_size:	Maximum size of generic segmentation offload
  *	@gso_max_segs:	Maximum number of segments that can be passed to the
  *			NIC for GSO
+ *	@gso_min_segs:	Minimum number of segments that can be passed to the
+ *			NIC for GSO
  *
  *	@dcbnl_ops:	Data Center Bridging netlink ops
  *	@num_tc:	Number of traffic classes in the net device
@@ -1666,7 +1668,7 @@ struct net_device {
 	unsigned int		gso_max_size;
 #define GSO_MAX_SEGS		65535
 	u16			gso_max_segs;
-
+	u16			gso_min_segs;
 #ifdef CONFIG_DCB
 	const struct dcbnl_rtnl_ops *dcbnl_ops;
 #endif
diff --git a/net/core/dev.c b/net/core/dev.c
index 1a90530f83ff..16e8ebbd3316 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2567,10 +2567,12 @@ static netdev_features_t harmonize_features(struct sk_buff *skb,
 
 netdev_features_t netif_skb_features(struct sk_buff *skb)
 {
+	const struct net_device *dev = skb->dev;
+	netdev_features_t features = dev->features;
+	u16 gso_segs = skb_shinfo(skb)->gso_segs;
 	__be16 protocol = skb->protocol;
-	netdev_features_t features = skb->dev->features;
 
-	if (skb_shinfo(skb)->gso_segs > skb->dev->gso_max_segs)
+	if (gso_segs > dev->gso_max_segs || gso_segs < dev->gso_min_segs)
 		features &= ~NETIF_F_GSO_MASK;
 
 	if (protocol == htons(ETH_P_8021Q) || protocol == htons(ETH_P_8021AD)) {
@@ -2581,7 +2583,7 @@ netdev_features_t netif_skb_features(struct sk_buff *skb)
 	}
 
 	features = netdev_intersect_features(features,
-					     skb->dev->vlan_features |
+					     dev->vlan_features |
 					     NETIF_F_HW_VLAN_CTAG_TX |
 					     NETIF_F_HW_VLAN_STAG_TX);
 
@@ -6658,6 +6660,7 @@ struct net_device *alloc_netdev_mqs(int sizeof_priv, const char *name,
 
 	dev->gso_max_size = GSO_MAX_SIZE;
 	dev->gso_max_segs = GSO_MAX_SEGS;
+	dev->gso_min_segs = 0;
 
 	INIT_LIST_HEAD(&dev->napi_list);
 	INIT_LIST_HEAD(&dev->unreg_list);

^ permalink raw reply related

* Re: r8168 is needed to enter P-state: Package State 6 (pc6) on Haswell hardware
From: Francois Romieu @ 2014-10-05 16:59 UTC (permalink / raw)
  To: Ceriel Jacobs; +Cc: Realtek linux nic maintainers, netdev
In-Reply-To: <542B3829.3010108@crashplan.pro>

Ceriel Jacobs <linux-ide@crashplan.pro> :
> With in-kernel r8169 module, only P-state package C3 (pc3) can be reached
> when enabling ASPM.
> 
> Only after installing r8168 driver (and enabling ASPM), a Haswell Celeron
> G1820/G1840 processer will enter package C6 state (pc6).

Vanilla kernel r8169 driver logs a message that contains an "XID" string.
Please grep for it in your dmesg so that I can narrow the search for 
meaningful differences in Realtek's r8168 driver.

Thanks.

-- 
Ueimor

^ permalink raw reply

* Re: [PATCH] team: add rescheduling jiffy delay on !rtnl_trylock
From: Tejun Heo @ 2014-10-05 16:11 UTC (permalink / raw)
  To: Paul E. McKenney; +Cc: Joe Lawrence, netdev, Jiri Pirko
In-Reply-To: <20141005140855.GO5015@linux.vnet.ibm.com>

Hello, Paul, Joe.

On Sun, Oct 05, 2014 at 07:08:55AM -0700, Paul E. McKenney wrote:
> > I wasn't sure if cond_resched_rcu_qs will be backported, so how should
> > -stable handle this condition?
> 
> If it is needed to backport a fix, it can of course be backported.
> The various -stable maintainers also have the option of avoiding
> the need for a backport by expanding it inline.  I am happy to let
> them choose.  ;-)

Let's go for separate cond_resched + rcu_qs calls first and then apply
a separate patch to use cond_resched_rcu_qs() so that the first one
can be marked for -stable.

Thanks.

-- 
tejun

^ permalink raw reply

* [PATCH 15/16] 9p/trans_virtio: enable VQs early
From: Michael S. Tsirkin @ 2014-10-05 16:07 UTC (permalink / raw)
  To: linux-kernel
  Cc: Eric Van Hensbergen, Ron Minnich, Latchesar Ionkov,
	David S. Miller, v9fs-developer, netdev
In-Reply-To: <1412525038-15871-1-git-send-email-mst@redhat.com>

virtio spec requires drivers to set DRIVER_OK before using VQs.
This is set automatically after probe returns, but virtio 9p device
adds self to channel list within probe, at which point VQ can be
used in violation of the spec.

To fix, call virtio_early_enable_vqs before using VQs.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 net/9p/trans_virtio.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/net/9p/trans_virtio.c b/net/9p/trans_virtio.c
index 6940d8f..0e8bbb3 100644
--- a/net/9p/trans_virtio.c
+++ b/net/9p/trans_virtio.c
@@ -575,6 +575,8 @@ static int p9_virtio_probe(struct virtio_device *vdev)
 	/* Ceiling limit to avoid denial of service attacks */
 	chan->p9_max_pages = nr_free_buffer_pages()/4;
 
+	virtio_early_enable_vqs(vdev);
+
 	mutex_lock(&virtio_9p_lock);
 	list_add_tail(&chan->chan_list, &virtio_chan_list);
 	mutex_unlock(&virtio_9p_lock);
-- 
MST

^ permalink raw reply related

* [PATCH 12/16] virtio_net: enable VQs early
From: Michael S. Tsirkin @ 2014-10-05 16:07 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rusty Russell, virtualization, netdev
In-Reply-To: <1412525038-15871-1-git-send-email-mst@redhat.com>

virtio spec requires drivers to set DRIVER_OK before using VQs.
This is set automatically after probe returns, virtio net violated this
rule by using receive VQs within probe.

To fix, call virtio_early_enable_vqs before using VQs.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 4c8c314..7afc990 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1792,6 +1792,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 		goto free_vqs;
 	}
 
+	virtio_early_enable_vqs(vdev);
+
 	/* Last of all, set up some receive buffers. */
 	for (i = 0; i < vi->curr_queue_pairs; i++) {
 		try_fill_recv(&vi->rq[i], GFP_KERNEL);
-- 
MST

^ permalink raw reply related

* [PATCH 11/16] virtio_net: minor cleanup
From: Michael S. Tsirkin @ 2014-10-05 16:07 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, virtualization
In-Reply-To: <1412525038-15871-1-git-send-email-mst@redhat.com>

	goto done;
done:
	return;
is ugly, it was put there to make diff review easier.
replace by open-coded return.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index d80fef4..4c8c314 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1403,7 +1403,7 @@ static void virtnet_config_changed_work(struct work_struct *work)
 
 	if (virtio_cread_feature(vi->vdev, VIRTIO_NET_F_STATUS,
 				 struct virtio_net_config, status, &v) < 0)
-		goto done;
+		return;
 
 	if (v & VIRTIO_NET_S_ANNOUNCE) {
 		netdev_notify_peers(vi->dev);
@@ -1414,7 +1414,7 @@ static void virtnet_config_changed_work(struct work_struct *work)
 	v &= VIRTIO_NET_S_LINK_UP;
 
 	if (vi->status == v)
-		goto done;
+		return;
 
 	vi->status = v;
 
@@ -1425,8 +1425,6 @@ static void virtnet_config_changed_work(struct work_struct *work)
 		netif_carrier_off(vi->dev);
 		netif_tx_stop_all_queues(vi->dev);
 	}
-done:
-	return;
 }
 
 static void virtnet_config_changed(struct virtio_device *vdev)
-- 
MST

^ permalink raw reply related

* [PATCH 09/16] virtio-net: drop config_mutex
From: Michael S. Tsirkin @ 2014-10-05 16:07 UTC (permalink / raw)
  To: linux-kernel; +Cc: netdev, virtualization
In-Reply-To: <1412525038-15871-1-git-send-email-mst@redhat.com>

config_mutex served two purposes: prevent multiple concurrent config
change handlers, and synchronize access to config_enable flag.

Since commit dbf2576e37da0fcc7aacbfbb9fd5d3de7888a3c1
    workqueue: make all workqueues non-reentrant
all workqueues are non-reentrant, and config_enable
is now gone.

Get rid of the unnecessary lock.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 7 +------
 1 file changed, 1 insertion(+), 6 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index fa17afa..d80fef4 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -132,9 +132,6 @@ struct virtnet_info {
 	/* Work struct for config space updates */
 	struct work_struct config_work;
 
-	/* Lock for config space updates */
-	struct mutex config_lock;
-
 	/* Does the affinity hint is set for virtqueues? */
 	bool affinity_hint_set;
 
@@ -1404,7 +1401,6 @@ static void virtnet_config_changed_work(struct work_struct *work)
 		container_of(work, struct virtnet_info, config_work);
 	u16 v;
 
-	mutex_lock(&vi->config_lock);
 	if (virtio_cread_feature(vi->vdev, VIRTIO_NET_F_STATUS,
 				 struct virtio_net_config, status, &v) < 0)
 		goto done;
@@ -1430,7 +1426,7 @@ static void virtnet_config_changed_work(struct work_struct *work)
 		netif_tx_stop_all_queues(vi->dev);
 	}
 done:
-	mutex_unlock(&vi->config_lock);
+	return;
 }
 
 static void virtnet_config_changed(struct virtio_device *vdev)
@@ -1751,7 +1747,6 @@ static int virtnet_probe(struct virtio_device *vdev)
 		u64_stats_init(&virtnet_stats->rx_syncp);
 	}
 
-	mutex_init(&vi->config_lock);
 	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
 
 	/* If we can receive ANY GSO packets, we must allocate large ones. */
-- 
MST

^ permalink raw reply related

* [PATCH 08/16] virtio_net: drop config_enable
From: Michael S. Tsirkin @ 2014-10-05 16:07 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rusty Russell, virtualization, netdev
In-Reply-To: <1412525038-15871-1-git-send-email-mst@redhat.com>

Now that virtio core ensures config changes don't arrive during probing,
drop config_enable flag in virtio net.
On removal, flush is now sufficient to guarantee that no change work is
queued.

This help simplify the driver, and will allow setting DRIVER_OK earlier
without losing config change notifications.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 23 ++---------------------
 1 file changed, 2 insertions(+), 21 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 59caa06..fa17afa 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -123,9 +123,6 @@ struct virtnet_info {
 	/* Host can handle any s/g split between our header and packet data */
 	bool any_header_sg;
 
-	/* enable config space updates */
-	bool config_enable;
-
 	/* Active statistics */
 	struct virtnet_stats __percpu *stats;
 
@@ -1408,9 +1405,6 @@ static void virtnet_config_changed_work(struct work_struct *work)
 	u16 v;
 
 	mutex_lock(&vi->config_lock);
-	if (!vi->config_enable)
-		goto done;
-
 	if (virtio_cread_feature(vi->vdev, VIRTIO_NET_F_STATUS,
 				 struct virtio_net_config, status, &v) < 0)
 		goto done;
@@ -1758,7 +1752,6 @@ static int virtnet_probe(struct virtio_device *vdev)
 	}
 
 	mutex_init(&vi->config_lock);
-	vi->config_enable = true;
 	INIT_WORK(&vi->config_work, virtnet_config_changed_work);
 
 	/* If we can receive ANY GSO packets, we must allocate large ones. */
@@ -1876,16 +1869,12 @@ static void virtnet_remove(struct virtio_device *vdev)
 	unregister_hotcpu_notifier(&vi->nb);
 
 	/* Prevent config work handler from accessing the device. */
-	mutex_lock(&vi->config_lock);
-	vi->config_enable = false;
-	mutex_unlock(&vi->config_lock);
+	flush_work(&vi->config_work);
 
 	unregister_netdev(vi->dev);
 
 	remove_vq_common(vi);
 
-	flush_work(&vi->config_work);
-
 	free_percpu(vi->stats);
 	free_netdev(vi->dev);
 }
@@ -1899,9 +1888,7 @@ static int virtnet_freeze(struct virtio_device *vdev)
 	unregister_hotcpu_notifier(&vi->nb);
 
 	/* Prevent config work handler from accessing the device */
-	mutex_lock(&vi->config_lock);
-	vi->config_enable = false;
-	mutex_unlock(&vi->config_lock);
+	flush_work(&vi->config_work);
 
 	netif_device_detach(vi->dev);
 	cancel_delayed_work_sync(&vi->refill);
@@ -1916,8 +1903,6 @@ static int virtnet_freeze(struct virtio_device *vdev)
 
 	remove_vq_common(vi);
 
-	flush_work(&vi->config_work);
-
 	return 0;
 }
 
@@ -1941,10 +1926,6 @@ static int virtnet_restore(struct virtio_device *vdev)
 
 	netif_device_attach(vi->dev);
 
-	mutex_lock(&vi->config_lock);
-	vi->config_enable = true;
-	mutex_unlock(&vi->config_lock);
-
 	rtnl_lock();
 	virtnet_set_queues(vi, vi->curr_queue_pairs);
 	rtnl_unlock();
-- 
MST

^ permalink raw reply related

* [PATCH 16/16] virtio_net: fix use after free on allocation failure
From: Michael S. Tsirkin @ 2014-10-05 16:07 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rusty Russell, virtualization, netdev
In-Reply-To: <1412525038-15871-1-git-send-email-mst@redhat.com>

In the extremely unlikely event that driver initialization fails after
RX buffers are added, virtio net frees RX buffers while VQs are
still active, potentially causing device to use a freed buffer.

To fix, reset device first - same as we do on device removal.

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
---
 drivers/net/virtio_net.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 7afc990..85e6098 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -1830,6 +1830,8 @@ static int virtnet_probe(struct virtio_device *vdev)
 	return 0;
 
 free_recv_bufs:
+	vi->vdev->config->reset(vdev);
+
 	free_receive_bufs(vi);
 	unregister_netdev(dev);
 free_vqs:
-- 
MST

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox