[net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
@ 2014-10-06  0:06 John Fastabend
  2014-10-06  0:07 ` [net-next PATCH v1 2/3] net: sched: add direct ring acces via af_packet to ixgbe John Fastabend
                   ` (5 more replies)
  0 siblings, 6 replies; 36+ messages in thread
From: John Fastabend @ 2014-10-06  0:06 UTC (permalink / raw)
  To: dborkman, fw, gerlitz.or, hannes
  Cc: netdev, john.ronciak, amirv, eric.dumazet, danny.zhou

This patch adds a net_device ops to split off a set of driver queues
from the driver and map the queues into user space via mmap. This
allows the queues to be directly manipulated from user space. For
raw packet interface this removes any overhead from the kernel network
stack.

Typically in an af_packet interface a packet_type handler is
registered and used to filter traffic to the socket and do other
things such as fan out traffic to multiple sockets. In this case the
networking stack is being bypassed so this code is not run. So the
hardware must push the correct traffic to the queues obtained from
the ndo callback ndo_split_queue_pairs().

Fortunately there is already a flow classification interface which
is part of the ethtool command set, ETHTOOL_SRXCLSRLINS. It is
currently supported by multiple drivers including sfc, mlx4, niu,
ixgbe, and i40e. Supporting some way to steer traffic to a queue
is the _only_ hardware requirement to support the interface, plus
the driver needs to implement the correct ndo ops. A follow on
patch adds support for ixgbe but we expect at least the subset of
drivers implementing ETHTOOL_SRXCLSRLINS to be implemented later.

The interface is driven over an af_packet socket which we believe
is the most natural interface to use. Because it is already used
for raw packet interfaces which is what we are providing here.
 The high level flow for this interface looks like:

	bind(fd, &sockaddr, sizeof(sockaddr));

	/* Get the device type and info */
	getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
		   &optlen);

	/* With device info we can look up descriptor format */

	/* Get the layout of ring space offset, page_sz, cnt */
	getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
		   &info, &optlen);

	/* request some queues from the driver */
	setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
		   &qpairs_info, sizeof(qpairs_info));

	/* if we let the driver pick us queues learn which queues
         * we were given
         */
	getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
		   &qpairs_info, sizeof(qpairs_info));

	/* And mmap queue pairs to user space */
	mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
	     MAP_SHARED, fd, 0);

	/* Now we have some user space queues to read/write to*/

There is one critical difference when running with these interfaces
vs running without them. In the normal case the af_packet module
uses a standard descriptor format exported by the af_packet user
space headers. In this model because we are working directly with
driver queues the descriptor format maps to the descriptor format
used by the device. User space applications can learn device
information from the socket option PACKET_DEV_DESC_INFO which
should provide enough details to extrapulate the descriptor formats.
Although this adds some complexity to user space it removes the
requirement to copy descriptor fields around.

The formats are usually provided by the device vendor documentation
If folks want I can provide a follow up patch to provide the formats
in a .h file in ./include/uapi/linux/ for ease of use. I have access
to formats for ixgbe and mlx drivers other driver owners would need to
provide their formats.

We tested this interface using traffic generators and doing basic
L2 forwarding tests on ixgbe devices. Our tests use a set of patches
to DPDK to enable an interface using this socket interfaace. With
this interface we can xmit/receive @ line rate from a test user space
application on a single core.

Additionally we have a set of DPDK patches to enable DPDK with this
interface. DPDK can be downloaded @ dpdk.org although as I hope is
clear from above DPDK is just our paticular test environment we
expect other libraries could be built on this interface.

Signed-off-by: Danny Zhou <danny.zhou@intel.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/linux/netdevice.h      |   63 ++++++++++++++
 include/uapi/linux/if_packet.h |   42 +++++++++
 net/packet/af_packet.c         |  181 ++++++++++++++++++++++++++++++++++++++++
 net/packet/internal.h          |    1 
 4 files changed, 287 insertions(+)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 9f5d293..dae96dc2 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -51,6 +51,8 @@
 #include <linux/neighbour.h>
 #include <uapi/linux/netdevice.h>
 
+#include <linux/if_packet.h>
+
 struct netpoll_info;
 struct device;
 struct phy_device;
@@ -997,6 +999,47 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
  *	Callback to use for xmit over the accelerated station. This
  *	is used in place of ndo_start_xmit on accelerated net
  *	devices.
+ *
+ * int (*ndo_split_queue_pairs) (struct net_device *dev,
+ *				 unsigned int qpairs_start_from,
+ *				 unsigned int qpairs_num,
+ *				 struct sock *sk)
+ *	Called to request a set of queues from the driver to be
+ *	handed to the callee for management. After this returns the
+ *	driver will not use the queues. The call should return zero
+ *	on success otherwise an appropriate error code can be used.
+ *	If qpairs_start_from is less than zero driver can start at
+ *	any available slot.
+ *
+ * int (*ndo_get_queue_pairs)(struct net_device *dev,
+ *			      unsigned int *qpairs_start_from,
+ *			      unsigned int *qpairs_num,
+ *			      struct sock *sk);
+ *	Called to get the queues assigned by the driver to this sock
+ *	when ndo_split_queue_pairs does not specify a start_from and
+ *	qpairs_num field. Returns zero on success.
+ *
+ * int (*ndo_return_queue_pairs) (struct net_device *dev,
+ *				  struct sock *sk)
+ *	Called to return a set of queues identified by sock to
+ *	the driver. The socket must have previously requested
+ *	the queues via ndo_split_queue_pairs for this action to
+ *	be performed.
+ *
+ * int (*ndo_get_device_qpair_map_region_info) (struct net_device *dev,
+ *				struct tpacket_dev_qpair_map_region_info *info)
+ *	Called to return mapping of queue memory region
+ *
+ * int (*ndo_get_device_desc_info) (struct net_device *dev,
+ *				    struct tpacket_dev_info *dev_info)
+ *	Called to get device specific information. This should
+ *	uniquely identify the hardware so that descriptor formats
+ *	can be learned by the stack/user space.
+ *
+ * int (*ndo_direct_qpair_page_map) (struct vm_area_struct *vma,
+ *				     struct net_device *dev)
+ *	Called to map queue pair range from split_queue_pairs into
+ *	mmap region.
  */
 struct net_device_ops {
 	int			(*ndo_init)(struct net_device *dev);
@@ -1146,6 +1189,26 @@ struct net_device_ops {
 							struct net_device *dev,
 							void *priv);
 	int			(*ndo_get_lock_subclass)(struct net_device *dev);
+	int			(*ndo_split_queue_pairs)(struct net_device *dev,
+					 unsigned int qpairs_start_from,
+					 unsigned int qpairs_num,
+					 struct sock *sk);
+	int			(*ndo_get_queue_pairs)(struct net_device *dev,
+					 unsigned int *qpairs_start_from,
+					 unsigned int *qpairs_num,
+					 struct sock *sk);
+	int			(*ndo_return_queue_pairs)(
+					 struct net_device *dev,
+					 struct sock *sk);
+	int			(*ndo_get_device_qpair_map_region_info)
+					(struct net_device *dev,
+					 struct tpacket_dev_qpair_map_region_info *info);
+	int			(*ndo_get_device_desc_info)
+					(struct net_device *dev,
+					 struct tpacket_dev_info *dev_info);
+	int			(*ndo_direct_qpair_page_map)
+					(struct vm_area_struct *vma,
+					 struct net_device *dev);
 };
 
 /**
diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index da2d668..fa94b65 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -54,6 +54,10 @@ struct sockaddr_ll {
 #define PACKET_FANOUT			18
 #define PACKET_TX_HAS_OFF		19
 #define PACKET_QDISC_BYPASS		20
+#define PACKET_RXTX_QPAIRS_SPLIT	21
+#define PACKET_RXTX_QPAIRS_RETURN	22
+#define PACKET_DEV_QPAIR_MAP_REGION_INFO	23
+#define PACKET_DEV_DESC_INFO		24
 
 #define PACKET_FANOUT_HASH		0
 #define PACKET_FANOUT_LB		1
@@ -64,6 +68,44 @@ struct sockaddr_ll {
 #define PACKET_FANOUT_FLAG_ROLLOVER	0x1000
 #define PACKET_FANOUT_FLAG_DEFRAG	0x8000
 
+#define MAX_MAP_MEMORY_REGIONS		64
+
+struct tpacket_dev_qpair_map_region_info {
+	unsigned int tp_dev_bar_sz;		/* size of BAR */
+	unsigned int tp_dev_sysm_sz;		/* size of systerm memory */
+	/* number of contiguous memory on BAR mapping to user space */
+	unsigned int tp_num_map_regions;
+	/* number of contiguous memory on system mapping to user apce */
+	unsigned int tp_num_sysm_map_regions;
+	struct map_page_region {
+		unsigned page_offset;		/* offset to start of region */
+		unsigned page_sz;		/* size of page */
+		unsigned page_cnt;		/* number of pages */
+	} regions[MAX_MAP_MEMORY_REGIONS];
+};
+
+#define PACKET_QPAIRS_START_ANY		-1
+
+struct tpacket_dev_qpairs_info {
+	unsigned int tp_qpairs_start_from;	/* qpairs index to start from */
+	unsigned int tp_qpairs_num;		/* number of qpairs */
+};
+
+struct tpacket_dev_info {
+	__u16		tp_device_id;
+	__u16		tp_vendor_id;
+	__u16		tp_subsystem_device_id;
+	__u16		tp_subsystem_vendor_id;
+	__u32		tp_numa_node;
+	__u32		tp_revision_id;
+	__u32		tp_num_total_qpairs;
+	__u32		tp_num_inuse_qpairs;
+	unsigned int	tp_rxdesc_size;	/* rx desc size in bytes */
+	__u16		tp_rxdesc_ver;
+	unsigned int	tp_txdesc_size;	/* tx desc size in bytes */
+	__u16		tp_txdesc_ver;
+};
+
 struct tpacket_stats {
 	unsigned int	tp_packets;
 	unsigned int	tp_drops;
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 87d20f4..19b43ee 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -2611,6 +2611,14 @@ static int packet_release(struct socket *sock)
 	sock_prot_inuse_add(net, sk->sk_prot, -1);
 	preempt_enable();
 
+	if (po->tp_owns_queue_pairs) {
+		struct net_device *dev;
+
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (dev)
+			dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
+	}
+
 	spin_lock(&po->bind_lock);
 	unregister_prot_hook(sk, false);
 	packet_cached_dev_reset(po);
@@ -3403,6 +3411,70 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 		po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
 		return 0;
 	}
+
+	case PACKET_RXTX_QPAIRS_SPLIT:
+	{
+		struct tpacket_dev_qpairs_info qpairs;
+		const struct net_device_ops *ops;
+		struct net_device *dev;
+		int err;
+
+		if (optlen != sizeof(qpairs))
+			return -EINVAL;
+		if (copy_from_user(&qpairs, optval, sizeof(qpairs)))
+			return -EFAULT;
+
+		/* Only allow one set of queues to be owned by userspace */
+		if (po->tp_owns_queue_pairs)
+			return -EBUSY;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		ops = dev->netdev_ops;
+		if (!ops->ndo_split_queue_pairs)
+			return -EOPNOTSUPP;
+
+		err =  ops->ndo_split_queue_pairs(dev,
+						  qpairs.tp_qpairs_start_from,
+						  qpairs.tp_qpairs_num, sk);
+		if (!err)
+			po->tp_owns_queue_pairs = true;
+
+		return err;
+	}
+
+	case PACKET_RXTX_QPAIRS_RETURN:
+	{
+		struct tpacket_dev_qpairs_info qpairs_info;
+		struct net_device *dev;
+		int err;
+
+		if (optlen != sizeof(qpairs_info))
+			return -EINVAL;
+		if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
+			return -EFAULT;
+
+		if (!po->tp_owns_queue_pairs)
+			return -EINVAL;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+
+		err =  dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
+		if (!err)
+			po->tp_owns_queue_pairs = false;
+
+		return err;
+	}
+
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -3498,6 +3570,99 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 	case PACKET_QDISC_BYPASS:
 		val = packet_use_direct_xmit(po);
 		break;
+	case PACKET_RXTX_QPAIRS_SPLIT:
+	{
+		struct net_device *dev;
+		struct tpacket_dev_qpairs_info qpairs_info;
+		int err;
+
+		if (len != sizeof(qpairs_info))
+			return -EINVAL;
+		if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
+			return -EFAULT;
+
+		/* This call only works after a successful queue pairs split-off
+		 * operation via setsockopt()
+		 */
+		if (!po->tp_owns_queue_pairs)
+			return -EINVAL;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		if (!dev->netdev_ops->ndo_split_queue_pairs)
+			return -EOPNOTSUPP;
+
+		err = dev->netdev_ops->ndo_get_queue_pairs(dev,
+					&qpairs_info.tp_qpairs_start_from,
+					&qpairs_info.tp_qpairs_num, sk);
+
+		lv = sizeof(qpairs_info);
+		data = &qpairs_info;
+		break;
+	}
+	case PACKET_DEV_QPAIR_MAP_REGION_INFO:
+	{
+		struct tpacket_dev_qpair_map_region_info info;
+		const struct net_device_ops *ops;
+		struct net_device *dev;
+		int err;
+
+		if (len != sizeof(info))
+			return -EINVAL;
+		if (copy_from_user(&info, optval, sizeof(info)))
+			return -EFAULT;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+
+		ops = dev->netdev_ops;
+		if (!ops->ndo_get_device_qpair_map_region_info)
+			return -EOPNOTSUPP;
+
+		err = ops->ndo_get_device_qpair_map_region_info(dev, &info);
+		if (err)
+			return err;
+
+		lv = sizeof(struct tpacket_dev_qpair_map_region_info);
+		data = &info;
+		break;
+	}
+	case PACKET_DEV_DESC_INFO:
+	{
+		struct net_device *dev;
+		struct tpacket_dev_info info;
+		int err;
+
+		if (len != sizeof(info))
+			return -EINVAL;
+		if (copy_from_user(&info, optval, sizeof(info)))
+			return -EFAULT;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		if (!dev->netdev_ops->ndo_get_device_desc_info)
+			return -EOPNOTSUPP;
+
+		err =  dev->netdev_ops->ndo_get_device_desc_info(dev, &info);
+		if (err)
+			return err;
+
+		lv = sizeof(struct tpacket_dev_info);
+		data = &info;
+		break;
+	}
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -3904,6 +4069,21 @@ static int packet_mmap(struct file *file, struct socket *sock,
 
 	mutex_lock(&po->pg_vec_lock);
 
+	if (po->tp_owns_queue_pairs) {
+		const struct net_device_ops *ops;
+		struct net_device *dev;
+
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+
+		ops = dev->netdev_ops;
+		err = ops->ndo_direct_qpair_page_map(vma, dev);
+		if (err)
+			goto out;
+		goto done;
+	}
+
 	expected_size = 0;
 	for (rb = &po->rx_ring; rb <= &po->tx_ring; rb++) {
 		if (rb->pg_vec) {
@@ -3941,6 +4121,7 @@ static int packet_mmap(struct file *file, struct socket *sock,
 		}
 	}
 
+done:
 	atomic_inc(&po->mapped);
 	vma->vm_ops = &packet_mmap_ops;
 	err = 0;
diff --git a/net/packet/internal.h b/net/packet/internal.h
index cdddf6a..55cadbc 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -113,6 +113,7 @@ struct packet_sock {
 	unsigned int		tp_reserve;
 	unsigned int		tp_loss:1;
 	unsigned int		tp_tx_has_off:1;
+	unsigned int		tp_owns_queue_pairs:1;
 	unsigned int		tp_tstamp;
 	struct net_device __rcu	*cached_dev;
 	int			(*xmit)(struct sk_buff *skb);

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [net-next PATCH v1 2/3] net: sched: add direct ring acces via af_packet to ixgbe
  2014-10-06  0:06 [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access John Fastabend
@ 2014-10-06  0:07 ` John Fastabend
  2014-10-06  0:07 ` [net-next PATCH v1 3/3] net: packet: Document PACKET_DEV_QPAIR_SPLIT and friends John Fastabend
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 36+ messages in thread
From: John Fastabend @ 2014-10-06  0:07 UTC (permalink / raw)
  To: dborkman, fw, gerlitz.or, hannes
  Cc: netdev, john.ronciak, amirv, eric.dumazet, danny.zhou

This implements the necessary ndo ops to support the af_packet
interface to directly own and manipulate queues.

Signed-off-by: Danny Zhou <danny.zhou@intel.com>
Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h         |    3 
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |   23 ++
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c    |  232 ++++++++++++++++++++++
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h    |    1 
 4 files changed, 251 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 673d820..2f6eadf 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -678,6 +678,9 @@ struct ixgbe_adapter {
 
 	struct ixgbe_q_vector *q_vector[MAX_Q_VECTORS];
 
+	/* Direct User Space Queues */
+	struct sock *sk_handles[MAX_RX_QUEUES];
+
 	/* DCB parameters */
 	struct ieee_pfc *ixgbe_ieee_pfc;
 	struct ieee_ets *ixgbe_ieee_ets;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index cff383b..01a6e55 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -2581,12 +2581,17 @@ static int ixgbe_add_ethtool_fdir_entry(struct ixgbe_adapter *adapter,
 	if (!(adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE))
 		return -EOPNOTSUPP;
 
+	if (fsp->ring_cookie > MAX_RX_QUEUES)
+		return -EINVAL;
+
 	/*
 	 * Don't allow programming if the action is a queue greater than
-	 * the number of online Rx queues.
+	 * the number of online Rx queues unless it is a user space
+	 * queue.
 	 */
 	if ((fsp->ring_cookie != RX_CLS_FLOW_DISC) &&
-	    (fsp->ring_cookie >= adapter->num_rx_queues))
+	    (fsp->ring_cookie >= adapter->num_rx_queues) &&
+	    !adapter->sk_handles[fsp->ring_cookie])
 		return -EINVAL;
 
 	/* Don't allow indexes to exist outside of available space */
@@ -2663,12 +2668,18 @@ static int ixgbe_add_ethtool_fdir_entry(struct ixgbe_adapter *adapter,
 	/* apply mask and compute/store hash */
 	ixgbe_atr_compute_perfect_hash_82599(&input->filter, &mask);
 
+	/* Set input action to reg_idx for driver owned queues otherwise
+	 * use the absolute index for user space queues.
+	 */
+	if (fsp->ring_cookie < adapter->num_rx_queues &&
+	    fsp->ring_cookie != IXGBE_FDIR_DROP_QUEUE)
+		input->action = adapter->rx_ring[input->action]->reg_idx;
+
 	/* program filters to filter memory */
 	err = ixgbe_fdir_write_perfect_filter_82599(hw,
-				&input->filter, input->sw_idx,
-				(input->action == IXGBE_FDIR_DROP_QUEUE) ?
-				IXGBE_FDIR_DROP_QUEUE :
-				adapter->rx_ring[input->action]->reg_idx);
+						    &input->filter,
+						    input->sw_idx,
+						    input->action);
 	if (err)
 		goto err_out_w_lock;
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 06ef5a3..6506550 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -48,7 +48,9 @@
 #include <linux/if_macvlan.h>
 #include <linux/if_bridge.h>
 #include <linux/prefetch.h>
+#include <linux/mm.h>
 #include <scsi/fc/fc_fcoe.h>
+#include <linux/if_packet.h>
 
 #include "ixgbe.h"
 #include "ixgbe_common.h"
@@ -70,6 +72,8 @@ const char ixgbe_driver_version[] = DRV_VERSION;
 static const char ixgbe_copyright[] =
 				"Copyright (c) 1999-2014 Intel Corporation.";
 
+static unsigned int *dummy_page_buf;
+
 static const struct ixgbe_info *ixgbe_info_tbl[] = {
 	[board_82598] = &ixgbe_82598_info,
 	[board_82599] = &ixgbe_82599_info,
@@ -3122,6 +3126,16 @@ static void ixgbe_enable_rx_drop(struct ixgbe_adapter *adapter,
 	IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(reg_idx), srrctl);
 }
 
+static bool ixgbe_have_user_queues(struct ixgbe_adapter *adapter)
+{
+	int i;
+
+	for (i = 0; i < MAX_RX_QUEUES; i++)
+		if (adapter->sk_handles[i])
+			return true;
+	return false;
+}
+
 static void ixgbe_disable_rx_drop(struct ixgbe_adapter *adapter,
 				  struct ixgbe_ring *ring)
 {
@@ -3156,7 +3170,8 @@ static void ixgbe_set_rx_drop_en(struct ixgbe_adapter *adapter)
 	 *  and performance reasons.
 	 */
 	if (adapter->num_vfs || (adapter->num_rx_queues > 1 &&
-	    !(adapter->hw.fc.current_mode & ixgbe_fc_tx_pause) && !pfc_en)) {
+	    !(adapter->hw.fc.current_mode & ixgbe_fc_tx_pause) && !pfc_en) ||
+	    ixgbe_have_user_queues(adapter)) {
 		for (i = 0; i < adapter->num_rx_queues; i++)
 			ixgbe_enable_rx_drop(adapter, adapter->rx_ring[i]);
 	} else {
@@ -7812,6 +7827,210 @@ static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
 	kfree(fwd_adapter);
 }
 
+static int ixgbe_ndo_split_queue_pairs(struct net_device *dev,
+				       unsigned int start_from,
+				       unsigned int qpairs_num,
+				       struct sock *sk)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	unsigned int qpair_index;
+
+	/* allocate whatever availiable qpairs */
+	if (start_from == PACKET_QPAIRS_START_ANY) {
+		unsigned int count = 0;
+
+		for (qpair_index = adapter->num_rx_queues;
+		     qpair_index < MAX_RX_QUEUES;
+		     qpair_index++) {
+			if (!adapter->sk_handles[qpair_index]) {
+				count++;
+				if (count == qpairs_num) {
+					start_from = qpair_index - count + 1;
+					break;
+				}
+			} else {
+				count = 0;
+			}
+		}
+	}
+
+	/* otherwise the caller specified exact queues */
+	if ((start_from > MAX_TX_QUEUES) ||
+	    (start_from > MAX_RX_QUEUES) ||
+	    (start_from + qpairs_num > MAX_TX_QUEUES) ||
+	    (start_from + qpairs_num > MAX_RX_QUEUES))
+		return -EINVAL;
+
+	/* If the qpairs are being used by the driver do not let user space
+	 * consume the queues. Also if the queue has already been allocated
+	 * to a socket do fail the request.
+	 */
+	for (qpair_index = start_from;
+	     qpair_index < start_from + qpairs_num;
+	     qpair_index++) {
+		if ((qpair_index < adapter->num_tx_queues) ||
+		    (qpair_index < adapter->num_rx_queues))
+			return -EINVAL;
+
+		if (adapter->sk_handles[qpair_index] != NULL)
+			return -EBUSY;
+	}
+
+	/* remember the sk handle for each queue pair */
+	for (qpair_index = start_from;
+	     qpair_index < start_from + qpairs_num;
+	     qpair_index++)
+		adapter->sk_handles[qpair_index] = sk;
+
+	return start_from;
+}
+
+static int ixgbe_ndo_get_queue_pairs(struct net_device *dev,
+				     unsigned int *start_from,
+				     unsigned int *qpairs_num,
+				     struct sock *sk)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	unsigned int qpair_index;
+
+	*qpairs_num = 0;
+
+	for (qpair_index = adapter->num_tx_queues;
+	     qpair_index < MAX_RX_QUEUES;
+	     qpair_index++) {
+		if (adapter->sk_handles[qpair_index] == sk) {
+			if (*qpairs_num == 0)
+				*start_from = qpair_index;
+			*qpairs_num = *qpairs_num + 1;
+		}
+	}
+
+	return 0;
+}
+
+static int ixgbe_ndo_return_queue_pairs(struct net_device *dev, struct sock *sk)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	unsigned int qpair_index;
+
+	for (qpair_index = adapter->num_tx_queues;
+	     qpair_index < MAX_TX_QUEUES;
+	     qpair_index++) {
+		if (adapter->sk_handles[qpair_index] == sk)
+			adapter->sk_handles[qpair_index] = NULL;
+	}
+
+	return 0;
+}
+
+/* Rx descriptor starts from 0x1000 and Tx descriptor starts from 0x6000
+ * both the TX and RX descriptors use 4K pages.
+ */
+#define RX_DESC_ADDR_OFFSET		0x1000
+#define TX_DESC_ADDR_OFFSET		0x6000
+#define PAGE_SIZE_4K			4096
+
+static int
+ixgbe_ndo_qpair_map_region(struct net_device *dev,
+			   struct tpacket_dev_qpair_map_region_info *info)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+
+	/* no need to map systme memory to userspace for ixgbe */
+	info->tp_dev_sysm_sz = 0;
+	info->tp_num_sysm_map_regions = 0;
+
+	info->tp_dev_bar_sz = pci_resource_len(adapter->pdev, 0);
+	info->tp_num_map_regions = 2;
+
+	info->regions[0].page_offset = RX_DESC_ADDR_OFFSET;
+	info->regions[0].page_sz = PAGE_SIZE;
+	info->regions[0].page_cnt = 1;
+	info->regions[1].page_offset = TX_DESC_ADDR_OFFSET;
+	info->regions[1].page_sz = PAGE_SIZE;
+	info->regions[1].page_cnt = 1;
+
+	return 0;
+}
+
+static int ixgbe_ndo_get_device_desc_info(struct net_device *dev,
+					  struct tpacket_dev_info *dev_info)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	int max_queues;
+
+	max_queues = max(adapter->num_rx_queues, adapter->num_tx_queues);
+
+	dev_info->tp_device_id = adapter->hw.device_id;
+	dev_info->tp_vendor_id = adapter->hw.vendor_id;
+	dev_info->tp_subsystem_device_id = adapter->hw.subsystem_device_id;
+	dev_info->tp_subsystem_vendor_id = adapter->hw.subsystem_vendor_id;
+	dev_info->tp_revision_id = adapter->hw.revision_id;
+	dev_info->tp_numa_node = dev_to_node(&dev->dev);
+
+	dev_info->tp_num_total_qpairs = min(MAX_RX_QUEUES, MAX_TX_QUEUES);
+	dev_info->tp_num_inuse_qpairs = max_queues;
+
+	dev_info->tp_rxdesc_size = sizeof(union ixgbe_adv_rx_desc);
+	dev_info->tp_rxdesc_ver = 1;
+	dev_info->tp_txdesc_size = sizeof(union ixgbe_adv_tx_desc);
+	dev_info->tp_txdesc_ver = 1;
+
+	return 0;
+}
+
+static int
+ixgbe_ndo_qpair_page_map(struct vm_area_struct *vma, struct net_device *dev)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	phys_addr_t phy_addr = pci_resource_start(adapter->pdev, 0);
+	unsigned long pfn_rx = (phy_addr + RX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
+	unsigned long pfn_tx = (phy_addr + TX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
+	unsigned long dummy_page_phy;
+	pgprot_t pre_vm_page_prot;
+	unsigned long start;
+	unsigned int i;
+	int err;
+
+	if (!dummy_page_buf) {
+		dummy_page_buf = kzalloc(PAGE_SIZE_4K, GFP_KERNEL);
+		if (!dummy_page_buf)
+			return -ENOMEM;
+
+		for (i = 0; i < PAGE_SIZE_4K / sizeof(unsigned int); i++)
+			dummy_page_buf[i] = 0xdeadbeef;
+	}
+
+	dummy_page_phy = virt_to_phys(dummy_page_buf);
+	pre_vm_page_prot = vma->vm_page_prot;
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+	/* assume the vm_start is 4K aligned address */
+	for (start = vma->vm_start;
+	     start < vma->vm_end;
+	     start += PAGE_SIZE_4K) {
+		if (start == vma->vm_start + RX_DESC_ADDR_OFFSET) {
+			err = remap_pfn_range(vma, start, pfn_rx, PAGE_SIZE_4K,
+					      vma->vm_page_prot);
+			if (err)
+				return -EAGAIN;
+		} else if (start == vma->vm_start + TX_DESC_ADDR_OFFSET) {
+			err = remap_pfn_range(vma, start, pfn_tx, PAGE_SIZE_4K,
+					      vma->vm_page_prot);
+			if (err)
+				return -EAGAIN;
+		} else {
+			unsigned long addr = dummy_page_phy > PAGE_SHIFT;
+
+			err = remap_pfn_range(vma, start, addr, PAGE_SIZE_4K,
+					      pre_vm_page_prot);
+			if (err)
+				return -EAGAIN;
+		}
+	}
+	return 0;
+}
+
 static const struct net_device_ops ixgbe_netdev_ops = {
 	.ndo_open		= ixgbe_open,
 	.ndo_stop		= ixgbe_close,
@@ -7856,6 +8075,12 @@ static const struct net_device_ops ixgbe_netdev_ops = {
 	.ndo_bridge_getlink	= ixgbe_ndo_bridge_getlink,
 	.ndo_dfwd_add_station	= ixgbe_fwd_add,
 	.ndo_dfwd_del_station	= ixgbe_fwd_del,
+	.ndo_split_queue_pairs	= ixgbe_ndo_split_queue_pairs,
+	.ndo_get_queue_pairs	= ixgbe_ndo_get_queue_pairs,
+	.ndo_return_queue_pairs = ixgbe_ndo_return_queue_pairs,
+	.ndo_get_device_qpair_map_region_info = ixgbe_ndo_qpair_map_region,
+	.ndo_get_device_desc_info  = ixgbe_ndo_get_device_desc_info,
+	.ndo_direct_qpair_page_map = ixgbe_ndo_qpair_page_map,
 };
 
 /**
@@ -8054,7 +8279,9 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	hw->back = adapter;
 	adapter->msg_enable = netif_msg_init(debug, DEFAULT_MSG_ENABLE);
 
-	hw->hw_addr = ioremap(pci_resource_start(pdev, 0),
+	hw->pci_hw_addr = pci_resource_start(pdev, 0);
+
+	hw->hw_addr = ioremap(hw->pci_hw_addr,
 			      pci_resource_len(pdev, 0));
 	adapter->io_addr = hw->hw_addr;
 	if (!hw->hw_addr) {
@@ -8705,6 +8932,7 @@ module_init(ixgbe_init_module);
  **/
 static void __exit ixgbe_exit_module(void)
 {
+	kfree(dummy_page_buf);
 #ifdef CONFIG_IXGBE_DCA
 	dca_unregister_notify(&dca_notifier);
 #endif
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
index dfd55d8..26e9163 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
@@ -3022,6 +3022,7 @@ struct ixgbe_mbx_info {
 
 struct ixgbe_hw {
 	u8 __iomem			*hw_addr;
+	phys_addr_t			pci_hw_addr;
 	void				*back;
 	struct ixgbe_mac_info		mac;
 	struct ixgbe_addr_filter_info	addr_ctrl;

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* [net-next PATCH v1 3/3] net: packet: Document PACKET_DEV_QPAIR_SPLIT and friends
  2014-10-06  0:06 [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access John Fastabend
  2014-10-06  0:07 ` [net-next PATCH v1 2/3] net: sched: add direct ring acces via af_packet to ixgbe John Fastabend
@ 2014-10-06  0:07 ` John Fastabend
  2014-10-06  0:29 ` [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access Florian Westphal
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 36+ messages in thread
From: John Fastabend @ 2014-10-06  0:07 UTC (permalink / raw)
  To: dborkman, fw, gerlitz.or, hannes
  Cc: netdev, john.ronciak, amirv, eric.dumazet, danny.zhou

This adds a section to the packet interface kernel documentation
describing the set of socket options to get direct queue
assignment working.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 Documentation/networking/packet_mmap.txt |   44 ++++++++++++++++++++++++++++++
 1 file changed, 44 insertions(+)

diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt
index a6d7cb9..ad26194 100644
--- a/Documentation/networking/packet_mmap.txt
+++ b/Documentation/networking/packet_mmap.txt
@@ -1047,6 +1047,50 @@ See include/linux/net_tstamp.h and Documentation/networking/timestamping
 for more information on hardware timestamps.
 
 -------------------------------------------------------------------------------
++ PACKET_RXTX_QPAIRS_SPLIT and friends
+-------------------------------------------------------------------------------
+
+The PACKET_RXTX_QPAIRS_SLIT setting allows direct access to the hardware
+packet rings. If your NIC is capable of supporting hardware packet steering
+and the driver has this feature enabled you can use the hardware to steer
+packets directly to user mapped memory and use user space descriptor rings.
+
+The user space flow should be,
+
+	bind(fd, &sockaddr, sizeof(sockaddr));
+
+	/* Get the device type and info */
+	getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
+		   &optlen);
+
+	/* With device info we can look up descriptor format */
+
+	/* Get the layout of ring space offset, page_sz, cnt */
+	getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
+                   &info, &optlen);
+
+        /* request some queues from the driver */
+	setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
+		   &qpairs_info, sizeof(qpairs_info));
+
+	/* if we let the driver pick us queues learn which queues
+	 * we were given
+	 */
+	getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
+		   &qpairs_info, sizeof(qpairs_info));
+
+	/* And mmap queue pairs to user space */
+	mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
+	     MAP_SHARED, fd, 0);
+
+	/* Now we have some user space queues to read/write to*/
+
+After this user space can directly manipulate the drivers descriptor rings.
+The descriptor rings use the native descriptor format of the hardware device.
+The device specifics are returned from the PACKET_DEV_DESC_INFO call which
+allows user space to determine the correct descriptor format to use.
+
+-------------------------------------------------------------------------------
 + Miscellaneous bits
 -------------------------------------------------------------------------------
 

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-06  0:06 [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access John Fastabend
  2014-10-06  0:07 ` [net-next PATCH v1 2/3] net: sched: add direct ring acces via af_packet to ixgbe John Fastabend
  2014-10-06  0:07 ` [net-next PATCH v1 3/3] net: packet: Document PACKET_DEV_QPAIR_SPLIT and friends John Fastabend
@ 2014-10-06  0:29 ` Florian Westphal
  2014-10-06  1:09   ` David Miller
  2014-10-06  1:12   ` John Fastabend
  2014-10-06 16:55 ` [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access Stephen Hemminger
                   ` (2 subsequent siblings)
  5 siblings, 2 replies; 36+ messages in thread
From: Florian Westphal @ 2014-10-06  0:29 UTC (permalink / raw)
  To: John Fastabend
  Cc: dborkman, fw, gerlitz.or, hannes, netdev, john.ronciak, amirv,
	eric.dumazet, danny.zhou

John Fastabend <john.fastabend@gmail.com> wrote:
> There is one critical difference when running with these interfaces
> vs running without them. In the normal case the af_packet module
> uses a standard descriptor format exported by the af_packet user
> space headers. In this model because we are working directly with
> driver queues the descriptor format maps to the descriptor format
> used by the device. User space applications can learn device
> information from the socket option PACKET_DEV_DESC_INFO which
> should provide enough details to extrapulate the descriptor formats.
> Although this adds some complexity to user space it removes the
> requirement to copy descriptor fields around.

I find it very disappointing that we seem to have to expose such
hardware specific details to userspace via hw-independent interface.

How big of a cost are we talking about when you say that it 'removes
the requirement to copy descriptor fields'?

Thanks,
Florian

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-06  0:29 ` [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access Florian Westphal
@ 2014-10-06  1:09   ` David Miller
  2014-10-06  1:18     ` John Fastabend
  2014-10-06  1:12   ` John Fastabend
  1 sibling, 1 reply; 36+ messages in thread
From: David Miller @ 2014-10-06  1:09 UTC (permalink / raw)
  To: fw
  Cc: john.fastabend, dborkman, gerlitz.or, hannes, netdev,
	john.ronciak, amirv, eric.dumazet, danny.zhou

From: Florian Westphal <fw@strlen.de>
Date: Mon, 6 Oct 2014 02:29:51 +0200

> John Fastabend <john.fastabend@gmail.com> wrote:
>> There is one critical difference when running with these interfaces
>> vs running without them. In the normal case the af_packet module
>> uses a standard descriptor format exported by the af_packet user
>> space headers. In this model because we are working directly with
>> driver queues the descriptor format maps to the descriptor format
>> used by the device. User space applications can learn device
>> information from the socket option PACKET_DEV_DESC_INFO which
>> should provide enough details to extrapulate the descriptor formats.
>> Although this adds some complexity to user space it removes the
>> requirement to copy descriptor fields around.
> 
> I find it very disappointing that we seem to have to expose such
> hardware specific details to userspace via hw-independent interface.
> 
> How big of a cost are we talking about when you say that it 'removes
> the requirement to copy descriptor fields'?

FWIW, it also avoids the domain switch (which is just a fancy way
to refer to performing the system call), both in and out.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-06  1:09   ` David Miller
@ 2014-10-06  1:18     ` John Fastabend
  0 siblings, 0 replies; 36+ messages in thread
From: John Fastabend @ 2014-10-06  1:18 UTC (permalink / raw)
  To: David Miller
  Cc: fw, dborkman, gerlitz.or, hannes, netdev, john.ronciak, amirv,
	eric.dumazet, danny.zhou

On 10/05/2014 06:09 PM, David Miller wrote:
> From: Florian Westphal <fw@strlen.de>
> Date: Mon, 6 Oct 2014 02:29:51 +0200
>
>> John Fastabend <john.fastabend@gmail.com> wrote:
>>> There is one critical difference when running with these interfaces
>>> vs running without them. In the normal case the af_packet module
>>> uses a standard descriptor format exported by the af_packet user
>>> space headers. In this model because we are working directly with
>>> driver queues the descriptor format maps to the descriptor format
>>> used by the device. User space applications can learn device
>>> information from the socket option PACKET_DEV_DESC_INFO which
>>> should provide enough details to extrapulate the descriptor formats.
>>> Although this adds some complexity to user space it removes the
>>> requirement to copy descriptor fields around.
>>
>> I find it very disappointing that we seem to have to expose such
>> hardware specific details to userspace via hw-independent interface.
>>
>> How big of a cost are we talking about when you say that it 'removes
>> the requirement to copy descriptor fields'?
>
> FWIW, it also avoids the domain switch (which is just a fancy way
> to refer to performing the system call), both in and out.
>

Right, my description could have been better and called this out.

Thanks.

-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-06  0:29 ` [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access Florian Westphal
  2014-10-06  1:09   ` David Miller
@ 2014-10-06  1:12   ` John Fastabend
  2014-10-06  9:49     ` Daniel Borkmann
  1 sibling, 1 reply; 36+ messages in thread
From: John Fastabend @ 2014-10-06  1:12 UTC (permalink / raw)
  To: Florian Westphal
  Cc: dborkman, gerlitz.or, hannes, netdev, john.ronciak, amirv,
	eric.dumazet, danny.zhou

On 10/05/2014 05:29 PM, Florian Westphal wrote:
> John Fastabend <john.fastabend@gmail.com> wrote:
>> There is one critical difference when running with these interfaces
>> vs running without them. In the normal case the af_packet module
>> uses a standard descriptor format exported by the af_packet user
>> space headers. In this model because we are working directly with
>> driver queues the descriptor format maps to the descriptor format
>> used by the device. User space applications can learn device
>> information from the socket option PACKET_DEV_DESC_INFO which
>> should provide enough details to extrapulate the descriptor formats.
>> Although this adds some complexity to user space it removes the
>> requirement to copy descriptor fields around.
>
> I find it very disappointing that we seem to have to expose such
> hardware specific details to userspace via hw-independent interface.
>

Well it was only for convenience if it doesn't fit as a socket
option we can remove it. We can look up the device using the netdev
name from the bind call. I see your point though so if there is
consensus that this is not needed that is fine.

> How big of a cost are we talking about when you say that it 'removes
> the requirement to copy descriptor fields'?
>

This was likely a poor description. If you want to let user space
poll on the ring (without using system calls or interrupts) then
I don't see how you can _not_ expose the ring directly complete with
the vendor descriptor formats.

> Thanks,
> Florian
>


-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-06  1:12   ` John Fastabend
@ 2014-10-06  9:49     ` Daniel Borkmann
  2014-10-06 15:01       ` John Fastabend
  0 siblings, 1 reply; 36+ messages in thread
From: Daniel Borkmann @ 2014-10-06  9:49 UTC (permalink / raw)
  To: John Fastabend
  Cc: Florian Westphal, gerlitz.or, hannes, netdev, john.ronciak, amirv,
	eric.dumazet, danny.zhou

Hi John,

On 10/06/2014 03:12 AM, John Fastabend wrote:
> On 10/05/2014 05:29 PM, Florian Westphal wrote:
>> John Fastabend <john.fastabend@gmail.com> wrote:
>>> There is one critical difference when running with these interfaces
>>> vs running without them. In the normal case the af_packet module
>>> uses a standard descriptor format exported by the af_packet user
>>> space headers. In this model because we are working directly with
>>> driver queues the descriptor format maps to the descriptor format
>>> used by the device. User space applications can learn device
>>> information from the socket option PACKET_DEV_DESC_INFO which
>>> should provide enough details to extrapulate the descriptor formats.
>>> Although this adds some complexity to user space it removes the
>>> requirement to copy descriptor fields around.
>>
>> I find it very disappointing that we seem to have to expose such
>> hardware specific details to userspace via hw-independent interface.
>
> Well it was only for convenience if it doesn't fit as a socket
> option we can remove it. We can look up the device using the netdev
> name from the bind call. I see your point though so if there is
> consensus that this is not needed that is fine.
>
>> How big of a cost are we talking about when you say that it 'removes
>> the requirement to copy descriptor fields'?
>
> This was likely a poor description. If you want to let user space
> poll on the ring (without using system calls or interrupts) then
> I don't see how you can _not_ expose the ring directly complete with
> the vendor descriptor formats.

But how big is the concrete performance degradation you're seeing if you
use an e.g. `netmap-alike` Linux-own variant as a hw-neutral interface
that does *not* directly expose hw descriptor formats to user space?

With 1 core netmap does 10G line-rate on 64b; I don't know their numbers
on 40G when run on decent hardware though.

It would really be great if we have something vendor neutral exposed as
a stable ABI and could leverage emerging infrastructure we already have
in the kernel such as eBPF and recent qdisc batching for raw sockets
instead of reinventing the wheels. (Don't get me wrong, I would love to
see AF_PACKET improved ...)

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-06  9:49     ` Daniel Borkmann
@ 2014-10-06 15:01       ` John Fastabend
  2014-10-06 16:35         ` Jesper Dangaard Brouer
  2014-10-06 17:03         ` Hannes Frederic Sowa
  0 siblings, 2 replies; 36+ messages in thread
From: John Fastabend @ 2014-10-06 15:01 UTC (permalink / raw)
  To: Daniel Borkmann, John Fastabend, Jesper Dangaard Brouer,
	John W. Linville, Neil Horman
  Cc: Florian Westphal, gerlitz.or, hannes, netdev, john.ronciak, amirv,
	eric.dumazet, danny.zhou

On 10/06/2014 02:49 AM, Daniel Borkmann wrote:
> Hi John,
> 
> On 10/06/2014 03:12 AM, John Fastabend wrote:
>> On 10/05/2014 05:29 PM, Florian Westphal wrote:
>>> John Fastabend <john.fastabend@gmail.com> wrote:
>>>> There is one critical difference when running with these interfaces
>>>> vs running without them. In the normal case the af_packet module
>>>> uses a standard descriptor format exported by the af_packet user
>>>> space headers. In this model because we are working directly with
>>>> driver queues the descriptor format maps to the descriptor format
>>>> used by the device. User space applications can learn device
>>>> information from the socket option PACKET_DEV_DESC_INFO which
>>>> should provide enough details to extrapulate the descriptor formats.
>>>> Although this adds some complexity to user space it removes the
>>>> requirement to copy descriptor fields around.
>>>
>>> I find it very disappointing that we seem to have to expose such
>>> hardware specific details to userspace via hw-independent interface.
>>
>> Well it was only for convenience if it doesn't fit as a socket
>> option we can remove it. We can look up the device using the netdev
>> name from the bind call. I see your point though so if there is
>> consensus that this is not needed that is fine.
>>
>>> How big of a cost are we talking about when you say that it 'removes
>>> the requirement to copy descriptor fields'?
>>
>> This was likely a poor description. If you want to let user space
>> poll on the ring (without using system calls or interrupts) then
>> I don't see how you can _not_ expose the ring directly complete with
>> the vendor descriptor formats.
> 
> But how big is the concrete performance degradation you're seeing if you
> use an e.g. `netmap-alike` Linux-own variant as a hw-neutral interface
> that does *not* directly expose hw descriptor formats to user space?

If we don't directly expose the hardware descriptor formats then we
need to somehow kick the driver when we want it to do the copy from
the driver descriptor format to the common descriptor format.

This requires a system call as far as I can tell. Which has unwanted
overhead. I can micro-benchmark this if its helpful. But if we dredge
up Jesper's slides here we are really counting cycles so even small
numbers count if we want to hit line rate in a user space application
with 40Gpbs hardware.

> 
> With 1 core netmap does 10G line-rate on 64b; I don't know their numbers
> on 40G when run on decent hardware though.
> 
> It would really be great if we have something vendor neutral exposed as
> a stable ABI and could leverage emerging infrastructure we already have
> in the kernel such as eBPF and recent qdisc batching for raw sockets
> instead of reinventing the wheels. (Don't get me wrong, I would love to
> see AF_PACKET improved ...)

I don't think the interface is vendor specific. It does require some
knowledge of the hardware descriptor layout though. It is though vendor
neutral from my point of view. I provided the ixgbe patch simple because
I'm most familiar with it and have a NIC here. If someone wants to send me
a Mellanox NIC I can give it a try although I was hoping to recruit Or or
Amir? The only hardware feature required is flow classification to queues
which seems to be common across 10Gbps and 40/100Gbps devices. So most
of the drivers should be able to support this.

If your worried driver writers will implement the interface but not make
their descriptor formats easily available I considered putting the layout
in a header file in the uapi somewhere. Then we could just reject any
implementation that doesn't include the header file needed to use it
from user space.

With regards to leveraging eBPF and qdisc batching I don't see how this
works with direct DMA and polling. Needed to give the lowest overhead
between kernel and user space. In this case we want to use the hardware
to do the filtering that would normally be done for eBPF and for many
use cases the hardware flow classifiers is sufficient. 

We already added a qdisc bypass option I see this as taking this path
further. I believe there is room for a continuum here. For basic cases
use af_packet v1,v2 for mmap rings but using common descriptors use
af_packet v3 and set QOS_BYASS. For absolute lowest overhead and
specific applications that don't need QOS, eBPF use this interface.

Thanks.

> 
> Thanks,
> Daniel
> -- 
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-06 15:01       ` John Fastabend
@ 2014-10-06 16:35         ` Jesper Dangaard Brouer
  2014-10-06 17:03         ` Hannes Frederic Sowa
  1 sibling, 0 replies; 36+ messages in thread
From: Jesper Dangaard Brouer @ 2014-10-06 16:35 UTC (permalink / raw)
  To: John Fastabend
  Cc: brouer, Daniel Borkmann, John Fastabend, John W. Linville,
	Neil Horman, Florian Westphal, gerlitz.or, hannes, netdev,
	john.ronciak, amirv, eric.dumazet, danny.zhou


On Mon, 06 Oct 2014 08:01:52 -0700 John Fastabend <john.r.fastabend@intel.com> wrote:
 
> This requires a system call as far as I can tell. Which has unwanted
> overhead. I can micro-benchmark this if its helpful. But if we dredge
> up Jesper's slides here we are really counting cycles so even small
> numbers count if we want to hit line rate in a user space application
> with 40Gpbs hardware.

The micro-benchmarked syscall[2] cost is approx 42 ns [1] (when
disabling CONFIG_AUDITSYSCALL else its approx 88ns), which is
significant compared to the 10G wirespeed smallest packet size budget
of 67.2ns.

See:
 [1] http://netoptimizer.blogspot.dk/2014/05/the-calculations-10gbits-wirespeed.html
 [2] https://github.com/netoptimizer/network-testing/blob/master/src/syscall_overhead.c

[...] 
> We already added a qdisc bypass option I see this as taking this path
> further. I believe there is room for a continuum here. For basic cases
> use af_packet v1,v2 for mmap rings but using common descriptors use
> af_packet v3 and set QOS_BYASS. For absolute lowest overhead and
> specific applications that don't need QOS, eBPF use this interface.

Well, after the qdisc bulking changes, when bulking kicks in then the
qdisc path is faster than the qdisc bypass (measured with trafgen).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-06 15:01       ` John Fastabend
  2014-10-06 16:35         ` Jesper Dangaard Brouer
@ 2014-10-06 17:03         ` Hannes Frederic Sowa
  2014-10-06 20:37           ` John Fastabend
  1 sibling, 1 reply; 36+ messages in thread
From: Hannes Frederic Sowa @ 2014-10-06 17:03 UTC (permalink / raw)
  To: John Fastabend
  Cc: Daniel Borkmann, John Fastabend, Jesper Dangaard Brouer,
	John W. Linville, Neil Horman, Florian Westphal, gerlitz.or,
	netdev, john.ronciak, amirv, eric.dumazet, danny.zhou

Hi John,

On Mo, 2014-10-06 at 08:01 -0700, John Fastabend wrote:
> On 10/06/2014 02:49 AM, Daniel Borkmann wrote:
> > Hi John,
> > 
> > On 10/06/2014 03:12 AM, John Fastabend wrote:
> >> On 10/05/2014 05:29 PM, Florian Westphal wrote:
> >>> John Fastabend <john.fastabend@gmail.com> wrote:
> >>>> There is one critical difference when running with these interfaces
> >>>> vs running without them. In the normal case the af_packet module
> >>>> uses a standard descriptor format exported by the af_packet user
> >>>> space headers. In this model because we are working directly with
> >>>> driver queues the descriptor format maps to the descriptor format
> >>>> used by the device. User space applications can learn device
> >>>> information from the socket option PACKET_DEV_DESC_INFO which
> >>>> should provide enough details to extrapulate the descriptor formats.
> >>>> Although this adds some complexity to user space it removes the
> >>>> requirement to copy descriptor fields around.
> >>>
> >>> I find it very disappointing that we seem to have to expose such
> >>> hardware specific details to userspace via hw-independent interface.
> >>
> >> Well it was only for convenience if it doesn't fit as a socket
> >> option we can remove it. We can look up the device using the netdev
> >> name from the bind call. I see your point though so if there is
> >> consensus that this is not needed that is fine.
> >>
> >>> How big of a cost are we talking about when you say that it 'removes
> >>> the requirement to copy descriptor fields'?
> >>
> >> This was likely a poor description. If you want to let user space
> >> poll on the ring (without using system calls or interrupts) then
> >> I don't see how you can _not_ expose the ring directly complete with
> >> the vendor descriptor formats.
> > 
> > But how big is the concrete performance degradation you're seeing if you
> > use an e.g. `netmap-alike` Linux-own variant as a hw-neutral interface
> > that does *not* directly expose hw descriptor formats to user space?
> 
> If we don't directly expose the hardware descriptor formats then we
> need to somehow kick the driver when we want it to do the copy from
> the driver descriptor format to the common descriptor format.
>
> This requires a system call as far as I can tell. Which has unwanted
> overhead. I can micro-benchmark this if its helpful. But if we dredge
> up Jesper's slides here we are really counting cycles so even small
> numbers count if we want to hit line rate in a user space application
> with 40Gpbs hardware.

I agree, it seems pretty hard to achieve non-syscall sending on the same
core, as we somehow must transfer control over to the kernel without
doing a syscall.

The only other idea would be to export machine code up to user space,
which you can mmap(MAP_EXEC) from the socket somehow to make this API
truly NIC agnostic without recompiling. This code then would transform
the generic descriptors to the hardware specific ones. Seems also pretty
hairy to do that correctly, maybe.

> > With 1 core netmap does 10G line-rate on 64b; I don't know their numbers
> > on 40G when run on decent hardware though.
> > 
> > It would really be great if we have something vendor neutral exposed as
> > a stable ABI and could leverage emerging infrastructure we already have
> > in the kernel such as eBPF and recent qdisc batching for raw sockets
> > instead of reinventing the wheels. (Don't get me wrong, I would love to
> > see AF_PACKET improved ...)
> 
> I don't think the interface is vendor specific. It does require some
> knowledge of the hardware descriptor layout though. It is though vendor
> neutral from my point of view. I provided the ixgbe patch simple because
> I'm most familiar with it and have a NIC here. If someone wants to send me
> a Mellanox NIC I can give it a try although I was hoping to recruit Or or
> Amir? The only hardware feature required is flow classification to queues
> which seems to be common across 10Gbps and 40/100Gbps devices. So most
> of the drivers should be able to support this.

Does flow classification work at the same level as registering network
addresses? Do I have to bind a e.g. multicast address wie ip maddr and
then set up flow director/ntuple to get the packets on the correct user
space facing queue or is it in case of the ixgbe card enough to just add
those addresses via fdir? Have you thought about letting the
kernel/driver handle that? In case one would like to connect their
virtual machines via this interface to the network maybe we need central
policing and resource constraints for queue management here?

Do other drivers need a separate af-packet managed way to bind addresses
to the queue? Maybe there are other quirks we might need to add to
actually build support for other network interface cards. Would be great
to at least examine one other driver in regard to this.
        
What other properties of the NIC must be exported? I think we also have
to deal with MTUs currently configured in the NIC, promisc mode and
maybe TSO?

> If your worried driver writers will implement the interface but not make
> their descriptor formats easily available I considered putting the layout
> in a header file in the uapi somewhere. Then we could just reject any
> implementation that doesn't include the header file needed to use it
> from user space.
> 
> With regards to leveraging eBPF and qdisc batching I don't see how this
> works with direct DMA and polling. Needed to give the lowest overhead
> between kernel and user space. In this case we want to use the hardware
> to do the filtering that would normally be done for eBPF and for many
> use cases the hardware flow classifiers is sufficient.

I agree, those features are hard to connect.

> We already added a qdisc bypass option I see this as taking this path
> further. I believe there is room for a continuum here. For basic cases
> use af_packet v1,v2 for mmap rings but using common descriptors use
> af_packet v3 and set QOS_BYASS. For absolute lowest overhead and
> specific applications that don't need QOS, eBPF use this interface.

You can simply write C code instead of eBPF code, yes.

I find the six additional ndo ops a bit worrisome as we are adding more
and more subsystem specific ndoops to this struct. I would like to see
some unification here, but currently cannot make concrete proposals,
sorry.

Patch 2/3 does not yet expose hw ring descriptors in uapi headers it
seems?

Are there plans to push a user space framework (maybe even into the
kernel), too? Will this be dpdk (alike) in the end?

Bye,
Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-06 17:03         ` Hannes Frederic Sowa
@ 2014-10-06 20:37           ` John Fastabend
  2014-10-06 23:26             ` Hannes Frederic Sowa
  0 siblings, 1 reply; 36+ messages in thread
From: John Fastabend @ 2014-10-06 20:37 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: Daniel Borkmann, John Fastabend, Jesper Dangaard Brouer,
	John W. Linville, Neil Horman, Florian Westphal, gerlitz.or,
	netdev, john.ronciak, amirv, eric.dumazet, danny.zhou

On 10/06/2014 10:03 AM, Hannes Frederic Sowa wrote:
> Hi John,
> 
> On Mo, 2014-10-06 at 08:01 -0700, John Fastabend wrote:
>> On 10/06/2014 02:49 AM, Daniel Borkmann wrote:
>>> Hi John,
>>>
>>> On 10/06/2014 03:12 AM, John Fastabend wrote:
>>>> On 10/05/2014 05:29 PM, Florian Westphal wrote:
>>>>> John Fastabend <john.fastabend@gmail.com> wrote:
>>>>>> There is one critical difference when running with these interfaces
>>>>>> vs running without them. In the normal case the af_packet module
>>>>>> uses a standard descriptor format exported by the af_packet user
>>>>>> space headers. In this model because we are working directly with
>>>>>> driver queues the descriptor format maps to the descriptor format
>>>>>> used by the device. User space applications can learn device
>>>>>> information from the socket option PACKET_DEV_DESC_INFO which
>>>>>> should provide enough details to extrapulate the descriptor formats.
>>>>>> Although this adds some complexity to user space it removes the
>>>>>> requirement to copy descriptor fields around.
>>>>>
>>>>> I find it very disappointing that we seem to have to expose such
>>>>> hardware specific details to userspace via hw-independent interface.
>>>>
>>>> Well it was only for convenience if it doesn't fit as a socket
>>>> option we can remove it. We can look up the device using the netdev
>>>> name from the bind call. I see your point though so if there is
>>>> consensus that this is not needed that is fine.
>>>>
>>>>> How big of a cost are we talking about when you say that it 'removes
>>>>> the requirement to copy descriptor fields'?
>>>>
>>>> This was likely a poor description. If you want to let user space
>>>> poll on the ring (without using system calls or interrupts) then
>>>> I don't see how you can _not_ expose the ring directly complete with
>>>> the vendor descriptor formats.
>>>
>>> But how big is the concrete performance degradation you're seeing if you
>>> use an e.g. `netmap-alike` Linux-own variant as a hw-neutral interface
>>> that does *not* directly expose hw descriptor formats to user space?
>>
>> If we don't directly expose the hardware descriptor formats then we
>> need to somehow kick the driver when we want it to do the copy from
>> the driver descriptor format to the common descriptor format.
>>
>> This requires a system call as far as I can tell. Which has unwanted
>> overhead. I can micro-benchmark this if its helpful. But if we dredge
>> up Jesper's slides here we are really counting cycles so even small
>> numbers count if we want to hit line rate in a user space application
>> with 40Gpbs hardware.
> 
> I agree, it seems pretty hard to achieve non-syscall sending on the same
> core, as we somehow must transfer control over to the kernel without
> doing a syscall.
> 
> The only other idea would be to export machine code up to user space,
> which you can mmap(MAP_EXEC) from the socket somehow to make this API
> truly NIC agnostic without recompiling. This code then would transform
> the generic descriptors to the hardware specific ones. Seems also pretty
> hairy to do that correctly, maybe.

This seems more complicated then a minimal library in userspace to
load descriptor handling code based on device id.

> 
>>> With 1 core netmap does 10G line-rate on 64b; I don't know their numbers
>>> on 40G when run on decent hardware though.
>>>
>>> It would really be great if we have something vendor neutral exposed as
>>> a stable ABI and could leverage emerging infrastructure we already have
>>> in the kernel such as eBPF and recent qdisc batching for raw sockets
>>> instead of reinventing the wheels. (Don't get me wrong, I would love to
>>> see AF_PACKET improved ...)
>>
>> I don't think the interface is vendor specific. It does require some
>> knowledge of the hardware descriptor layout though. It is though vendor
>> neutral from my point of view. I provided the ixgbe patch simple because
>> I'm most familiar with it and have a NIC here. If someone wants to send me
>> a Mellanox NIC I can give it a try although I was hoping to recruit Or or
>> Amir? The only hardware feature required is flow classification to queues
>> which seems to be common across 10Gbps and 40/100Gbps devices. So most
>> of the drivers should be able to support this.
> 
> Does flow classification work at the same level as registering network
> addresses? Do I have to bind a e.g. multicast address wie ip maddr and
> then set up flow director/ntuple to get the packets on the correct user
> space facing queue or is it in case of the ixgbe card enough to just add
> those addresses via fdir? Have you thought about letting the
> kernel/driver handle that? In case one would like to connect their
> virtual machines via this interface to the network maybe we need central
> policing and resource constraints for queue management here?

Right now it is enough to program the addresses via fdir. This shouldn't
be ixgbe specific and I did take a quick look at the other drivers use
for fdir and believe it should work the same.

I'm not sure what you mean by kernel/driver handle this? Maybe you mean
from the socket interface? I thought about it briefly but opted for what
I think is more straight forward and using the fdir APIs.

As far as policing and resource constraints I think that is a user space
task. And yes I am working on some user space management applications but
they are still fairly rough.

> 
> Do other drivers need a separate af-packet managed way to bind addresses
> to the queue? Maybe there are other quirks we might need to add to
> actually build support for other network interface cards. Would be great
> to at least examine one other driver in regard to this.

I _believe_ this interface is sufficient. I think one of the mellanox
interfaces could be implemented fairly easily.

>         
> What other properties of the NIC must be exported? I think we also have
> to deal with MTUs currently configured in the NIC, promisc mode and
> maybe TSO?

We don't support per queue MTUs in the kernel. So I think this can be
learned using existing interfaces.

> 
>> If your worried driver writers will implement the interface but not make
>> their descriptor formats easily available I considered putting the layout
>> in a header file in the uapi somewhere. Then we could just reject any
>> implementation that doesn't include the header file needed to use it
>> from user space.
>>
>> With regards to leveraging eBPF and qdisc batching I don't see how this
>> works with direct DMA and polling. Needed to give the lowest overhead
>> between kernel and user space. In this case we want to use the hardware
>> to do the filtering that would normally be done for eBPF and for many
>> use cases the hardware flow classifiers is sufficient.
> 
> I agree, those features are hard to connect.
> 
>> We already added a qdisc bypass option I see this as taking this path
>> further. I believe there is room for a continuum here. For basic cases
>> use af_packet v1,v2 for mmap rings but using common descriptors use
>> af_packet v3 and set QOS_BYASS. For absolute lowest overhead and
>> specific applications that don't need QOS, eBPF use this interface.
> 
> You can simply write C code instead of eBPF code, yes.
> 
> I find the six additional ndo ops a bit worrisome as we are adding more
> and more subsystem specific ndoops to this struct. I would like to see
> some unification here, but currently cannot make concrete proposals,
> sorry.

I agree it seems like a bit much. One thought was to split the ndo
ops into categories. Switch ops, MACVLAN ops, basic ops and with this
userspace queue ops. This sort of goes along with some of the switch
offload work which is going to add a handful more ops as best I can
tell.

> 
> Patch 2/3 does not yet expose hw ring descriptors in uapi headers it
> seems?
> 

Nope I wasn't sure if we wanted to put the ring desciptors in UAPI or
not. If we do I would likely push that as a 4th patch.

> Are there plans to push a user space framework (maybe even into the
> kernel), too? Will this be dpdk (alike) in the end?

I have patches to enable this interface on DPDK and it gets the
same performance as the other DPDK interfaces.

I've considered creating a minimal library to do basic tx/rx and
descriptor processing maybe in ./test or ,/scripts to give a usage
example that is easier to get ahold of and review without having
to pull in all the other things DPDK does that may or may not be
interesting depending on what you are doing and on what hardware.

> 
> Bye,
> Hannes
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-06 20:37           ` John Fastabend
@ 2014-10-06 23:26             ` Hannes Frederic Sowa
  2014-10-07 18:59               ` Neil Horman
  0 siblings, 1 reply; 36+ messages in thread
From: Hannes Frederic Sowa @ 2014-10-06 23:26 UTC (permalink / raw)
  To: John Fastabend
  Cc: Daniel Borkmann, John Fastabend, Jesper Dangaard Brouer,
	John W. Linville, Neil Horman, Florian Westphal, gerlitz.or,
	netdev, john.ronciak, amirv, eric.dumazet, danny.zhou,
	Willem de Bruijn

Hi John,

On Mon, Oct 6, 2014, at 22:37, John Fastabend wrote:
> > I find the six additional ndo ops a bit worrisome as we are adding more
> > and more subsystem specific ndoops to this struct. I would like to see
> > some unification here, but currently cannot make concrete proposals,
> > sorry.
> 
> I agree it seems like a bit much. One thought was to split the ndo
> ops into categories. Switch ops, MACVLAN ops, basic ops and with this
> userspace queue ops. This sort of goes along with some of the switch
> offload work which is going to add a handful more ops as best I can
> tell.

Thanks for your mail, you answered all of my questions.

Have you looked at <https://code.google.com/p/kernel/wiki/ProjectUnetq>?
Willem (also in Cc) used sysfs files which get mmaped to represent the
tx/rx descriptors. The representation was independent of the device and
IIRC the prototype used a write(fd, "", 1) to signal the kernel it
should proceed with tx. I agree, it would be great to be syscall-free
here.

For the semantics of the descriptors we could also easily generate files
in sysfs. I thought about something like tracepoints already do for
representing the data in the ringbuffer depending on the event:

-- >8 --
# cat /sys/kernel/debug/tracing/events/net/net_dev_queue/format 
name: net_dev_queue
ID: 1006
format:
	field:unsigned short common_type;       offset:0;       size:2;
	signed:0;
	field:unsigned char common_flags;       offset:2;       size:1;
	signed:0;
	field:unsigned char common_preempt_count;       offset:3;      
	size:1; signed:0;
	field:int common_pid;   offset:4;       size:4; signed:1;

	field:void * skbaddr;   offset:8;       size:8; signed:0;
	field:unsigned int len; offset:16;      size:4; signed:0;
	field:__data_loc char[] name;   offset:20;      size:4;
	signed:1;

print fmt: "dev=%s skbaddr=%p len=%u", __get_str(name), REC->skbaddr,
REC->len
-- >8 --

Maybe the macros from tracing are reusable (TP_STRUCT__entry), e.g.
endianess would need to be added. Hopefully there is already a user
space parser somewhere in the perf sources. An easier to parse binary
representation could be added easily and maybe even something vDSO alike
if people care about that.

Maybe this open/mmap per queue also kills some of the ndo_ops?

Bye,
Hannes

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-06 23:26             ` Hannes Frederic Sowa
@ 2014-10-07 18:59               ` Neil Horman
  2014-10-08 17:20                 ` John Fastabend
  0 siblings, 1 reply; 36+ messages in thread
From: Neil Horman @ 2014-10-07 18:59 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: John Fastabend, Daniel Borkmann, John Fastabend,
	Jesper Dangaard Brouer, John W. Linville, Florian Westphal,
	gerlitz.or, netdev, john.ronciak, amirv, eric.dumazet, danny.zhou,
	Willem de Bruijn

On Tue, Oct 07, 2014 at 01:26:11AM +0200, Hannes Frederic Sowa wrote:
> Hi John,
> 
> On Mon, Oct 6, 2014, at 22:37, John Fastabend wrote:
> > > I find the six additional ndo ops a bit worrisome as we are adding more
> > > and more subsystem specific ndoops to this struct. I would like to see
> > > some unification here, but currently cannot make concrete proposals,
> > > sorry.
> > 
> > I agree it seems like a bit much. One thought was to split the ndo
> > ops into categories. Switch ops, MACVLAN ops, basic ops and with this
> > userspace queue ops. This sort of goes along with some of the switch
> > offload work which is going to add a handful more ops as best I can
> > tell.
> 
> Thanks for your mail, you answered all of my questions.
> 
> Have you looked at <https://code.google.com/p/kernel/wiki/ProjectUnetq>?
> Willem (also in Cc) used sysfs files which get mmaped to represent the
> tx/rx descriptors. The representation was independent of the device and
> IIRC the prototype used a write(fd, "", 1) to signal the kernel it
> should proceed with tx. I agree, it would be great to be syscall-free
> here.
> 
> For the semantics of the descriptors we could also easily generate files
> in sysfs. I thought about something like tracepoints already do for
> representing the data in the ringbuffer depending on the event:
> 
> -- >8 --
> # cat /sys/kernel/debug/tracing/events/net/net_dev_queue/format 
> name: net_dev_queue
> ID: 1006
> format:
> 	field:unsigned short common_type;       offset:0;       size:2;
> 	signed:0;
> 	field:unsigned char common_flags;       offset:2;       size:1;
> 	signed:0;
> 	field:unsigned char common_preempt_count;       offset:3;      
> 	size:1; signed:0;
> 	field:int common_pid;   offset:4;       size:4; signed:1;
> 
> 	field:void * skbaddr;   offset:8;       size:8; signed:0;
> 	field:unsigned int len; offset:16;      size:4; signed:0;
> 	field:__data_loc char[] name;   offset:20;      size:4;
> 	signed:1;
> 
> print fmt: "dev=%s skbaddr=%p len=%u", __get_str(name), REC->skbaddr,
> REC->len
> -- >8 --
> 
> Maybe the macros from tracing are reusable (TP_STRUCT__entry), e.g.
> endianess would need to be added. Hopefully there is already a user
> space parser somewhere in the perf sources. An easier to parse binary
> representation could be added easily and maybe even something vDSO alike
> if people care about that.
> 
> Maybe this open/mmap per queue also kills some of the ndo_ops?
> 
> Bye,
> Hannes
> 


John-
	I don't know if its of use to you here, but I was experimenting awhile
ago with af_packet memory mapping, using the protection bits in the page tables
as a doorbell mechanism.  I scrapped the work as the performance bottleneck for
af_packet wasn't found in the syscall trap time, but it occurs to me, it might
be useful for you here, in that, using this mechanism, if you keep the transmit
ring non-empty, you only encur the cost of a single trap to start the transmit
process.  Let me know if you want to see it.

Neil

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-07 18:59               ` Neil Horman
@ 2014-10-08 17:20                 ` John Fastabend
  2014-10-09 13:36                   ` [PATCH] af_packet: Add Doorbell transmit mode to AF_PACKET sockets Neil Horman
  0 siblings, 1 reply; 36+ messages in thread
From: John Fastabend @ 2014-10-08 17:20 UTC (permalink / raw)
  To: Neil Horman
  Cc: Hannes Frederic Sowa, John Fastabend, Daniel Borkmann,
	Jesper Dangaard Brouer, John W. Linville, Florian Westphal,
	gerlitz.or, netdev, john.ronciak, amirv, eric.dumazet, danny.zhou,
	Willem de Bruijn

On 10/07/2014 11:59 AM, Neil Horman wrote:
> On Tue, Oct 07, 2014 at 01:26:11AM +0200, Hannes Frederic Sowa wrote:
>> Hi John,
>>
>> On Mon, Oct 6, 2014, at 22:37, John Fastabend wrote:
>>>> I find the six additional ndo ops a bit worrisome as we are adding more
>>>> and more subsystem specific ndoops to this struct. I would like to see
>>>> some unification here, but currently cannot make concrete proposals,
>>>> sorry.
>>>
>>> I agree it seems like a bit much. One thought was to split the ndo
>>> ops into categories. Switch ops, MACVLAN ops, basic ops and with this
>>> userspace queue ops. This sort of goes along with some of the switch
>>> offload work which is going to add a handful more ops as best I can
>>> tell.
>>
>> Thanks for your mail, you answered all of my questions.
>>
>> Have you looked at <https://code.google.com/p/kernel/wiki/ProjectUnetq>?
>> Willem (also in Cc) used sysfs files which get mmaped to represent the
>> tx/rx descriptors. The representation was independent of the device and
>> IIRC the prototype used a write(fd, "", 1) to signal the kernel it
>> should proceed with tx. I agree, it would be great to be syscall-free
>> here.
>>
>> For the semantics of the descriptors we could also easily generate files
>> in sysfs. I thought about something like tracepoints already do for
>> representing the data in the ringbuffer depending on the event:
>>
>> -- >8 --
>> # cat /sys/kernel/debug/tracing/events/net/net_dev_queue/format
>> name: net_dev_queue
>> ID: 1006
>> format:
>> 	field:unsigned short common_type;       offset:0;       size:2;
>> 	signed:0;
>> 	field:unsigned char common_flags;       offset:2;       size:1;
>> 	signed:0;
>> 	field:unsigned char common_preempt_count;       offset:3;
>> 	size:1; signed:0;
>> 	field:int common_pid;   offset:4;       size:4; signed:1;
>>
>> 	field:void * skbaddr;   offset:8;       size:8; signed:0;
>> 	field:unsigned int len; offset:16;      size:4; signed:0;
>> 	field:__data_loc char[] name;   offset:20;      size:4;
>> 	signed:1;
>>
>> print fmt: "dev=%s skbaddr=%p len=%u", __get_str(name), REC->skbaddr,
>> REC->len
>> -- >8 --
>>
>> Maybe the macros from tracing are reusable (TP_STRUCT__entry), e.g.
>> endianess would need to be added. Hopefully there is already a user
>> space parser somewhere in the perf sources. An easier to parse binary
>> representation could be added easily and maybe even something vDSO alike
>> if people care about that.
>>
>> Maybe this open/mmap per queue also kills some of the ndo_ops?
>>
>> Bye,
>> Hannes
>>
>
>
> John-
> 	I don't know if its of use to you here, but I was experimenting awhile
> ago with af_packet memory mapping, using the protection bits in the page tables
> as a doorbell mechanism.  I scrapped the work as the performance bottleneck for
> af_packet wasn't found in the syscall trap time, but it occurs to me, it might
> be useful for you here, in that, using this mechanism, if you keep the transmit
> ring non-empty, you only encur the cost of a single trap to start the transmit
> process.  Let me know if you want to see it.
>
> Neil
>


Hi Neil,

If you could forward it along I'll take a look. It seems like something
along these lines will be needed.

Thanks,
John


-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 36+ messages in thread

* [PATCH] af_packet:  Add Doorbell transmit mode to AF_PACKET sockets
  2014-10-08 17:20                 ` John Fastabend
@ 2014-10-09 13:36                   ` Neil Horman
  2014-10-09 15:01                     ` John Fastabend
  0 siblings, 1 reply; 36+ messages in thread
From: Neil Horman @ 2014-10-09 13:36 UTC (permalink / raw)
  To: netdev
  Cc: John Fastabend, Daniel Borkmann, Jesper Dangaard Brouer,
	John W. Linville, Florian Westphal, gerlitz.or, john.ronciak,
	amirv, eric.dumazet, danny.zhou, Willem de Bruijn, John Fastabend,
	Neil Horman

This patch adds a variation to the AF_PACKET memory mapped socket transmit
mechanism.  Nominally, when using a memory mapped AF_PACKET socket, frames are
written into the memory mapped buffer, and then the application calls sendmsg
with a NULL buffer which triggers then cleans the mapped space of all pending
buffers.

While this provides clean, synchronous operation, improvements can be made.  To
this end, I've introduced a doorbell mode of operation to memory mapped packet
sockets. When a packet socket is placed into doorbell mode, it write protects
the mappings of any process using the packet socket, so that on the first write
to it, a kernel trap is generated, which returns the mapping to a read-write
state, and forks a task to begin cleaning the buffers on the applications
behalf.  This thread contains some hysterisis to continue running a short while
after the last buffer has been cleaned, allowing subsquent wrtites to be sent
without needing to fork another task.  This allows for additional parallelism in
that an application on an smp system can run in parallel with a cleaning task,
so that the socket buffer can be filled and emptied in parallel without having
to incur multiple system call traps.

I've only done some very rough performance estimates, but early results are
promising.  Using this code here:
http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap

I made some modifications to support using doorbell mode and compared the time
it took to send 1500 packets (each of size 1492 bytes), in basic mmap and
doorbell mmaped mode, and used tcpdump to capture the output.  Results:

trace	packets	start time	end time	delta		p/s	size
ndb	1500	2.755605	3.000886	0.245281	6115.43	1492b
db	1500	4.716448	4.846382	0.129934	11544.32 1492b

Its very rough of course but it would seem I get a 40% increase in throughput
when using this method.  I'm sure thats an overestimate, and so more testing is
required, but initial results look good.

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
---
 include/uapi/linux/if_packet.h |   1 +
 net/packet/af_packet.c         | 215 +++++++++++++++++++++++++++++++++++++++--
 net/packet/internal.h          |  10 ++
 3 files changed, 217 insertions(+), 9 deletions(-)

diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index bac27fa..efce7e1 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -54,6 +54,7 @@ struct sockaddr_ll {
 #define PACKET_FANOUT			18
 #define PACKET_TX_HAS_OFF		19
 #define PACKET_QDISC_BYPASS		20
+#define PACKET_MMAP_DOORBELL		21
 
 #define PACKET_FANOUT_HASH		0
 #define PACKET_FANOUT_LB		1
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 6a2bb37..27849c5 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -66,6 +66,8 @@
 #include <linux/kmod.h>
 #include <linux/slab.h>
 #include <linux/vmalloc.h>
+#include <linux/rmap.h>
+#include <linux/async.h>
 #include <net/net_namespace.h>
 #include <net/ip.h>
 #include <net/protocol.h>
@@ -234,9 +236,18 @@ struct packet_skb_cb {
 	(((x)->kactive_blk_num < ((x)->knum_blocks-1)) ? \
 	((x)->kactive_blk_num+1) : 0)
 
+ASYNC_DOMAIN_EXCLUSIVE(packet_doorbell_domain);
+
 static void __fanout_unlink(struct sock *sk, struct packet_sock *po);
 static void __fanout_link(struct sock *sk, struct packet_sock *po);
 
+
+static void packet_mod_tx_doorbell(struct packet_sock *po,
+				   struct vm_area_struct *vma, bool arm);
+
+#define packet_arm_tx_doorbell(p, v) packet_mod_tx_doorbell(p, v, true)
+#define packet_disarm_tx_doorbell(p, v) packet_mod_tx_doorbell(p, v, false)
+
 static int packet_direct_xmit(struct sk_buff *skb)
 {
 	struct net_device *dev = skb->dev;
@@ -2215,7 +2226,8 @@ static int tpacket_snd(struct packet_sock *po, struct msghdr *msg)
 	int status = TP_STATUS_AVAILABLE;
 	int hlen, tlen;
 
-	mutex_lock(&po->pg_vec_lock);
+	if (!po->tp_doorbell_mode)
+		mutex_lock(&po->pg_vec_lock);
 
 	if (likely(saddr == NULL)) {
 		dev	= packet_cached_dev_get(po);
@@ -2326,7 +2338,8 @@ out_status:
 out_put:
 	dev_put(dev);
 out:
-	mutex_unlock(&po->pg_vec_lock);
+	if (!po->tp_doorbell_mode)
+		mutex_unlock(&po->pg_vec_lock);
 	return err;
 }
 
@@ -2548,9 +2561,13 @@ static int packet_sendmsg(struct kiocb *iocb, struct socket *sock,
 	struct sock *sk = sock->sk;
 	struct packet_sock *po = pkt_sk(sk);
 
-	if (po->tx_ring.pg_vec)
-		return tpacket_snd(po, msg);
-	else
+	if (po->tx_ring.pg_vec) {
+		if (po->tp_doorbell_mode) {
+			async_synchronize_full_domain(&packet_doorbell_domain);
+			return 0;
+		} else
+			return tpacket_snd(po, msg);
+	} else
 		return packet_snd(sock, msg, len);
 }
 
@@ -2592,6 +2609,10 @@ static int packet_release(struct socket *sock)
 
 	packet_flush_mclist(sk);
 
+	if (po->tp_doorbell_mode)
+		async_synchronize_full_domain(&packet_doorbell_domain);
+	
+
 	if (po->rx_ring.pg_vec) {
 		memset(&req_u, 0, sizeof(req_u));
 		packet_set_ring(sk, &req_u, 1, 0);
@@ -2772,6 +2793,9 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
 	sock_init_data(sock, sk);
 
 	po = pkt_sk(sk);
+	INIT_LIST_HEAD(&po->doorbell_vmas);
+	spin_lock_init(&po->doorbell_lock);
+	atomic_set(&po->doorbell_thread_count, 0);
 	sk->sk_family = PF_PACKET;
 	po->num = proto;
 	po->xmit = dev_queue_xmit;
@@ -3374,6 +3398,21 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 		po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
 		return 0;
 	}
+	case PACKET_MMAP_DOORBELL:
+	{
+		unsigned int val;
+		if (optlen != sizeof(val))
+			return -EINVAL;
+		if (atomic_read(&po->mapped))
+			return -EBUSY;
+		if (copy_from_user(&val, optval, sizeof(val)))
+			return -EFAULT;
+
+		po->tp_doorbell_mode = !!val;
+		if (!po->tp_doorbell_mode)
+			async_synchronize_full_domain(&packet_doorbell_domain);
+		return 0;
+	}
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -3469,6 +3508,9 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 	case PACKET_QDISC_BYPASS:
 		val = packet_use_direct_xmit(po);
 		break;
+	case PACKET_MMAP_DOORBELL:
+		val = po->tp_doorbell_mode;
+		break;
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -3610,6 +3652,74 @@ static unsigned int packet_poll(struct file *file, struct socket *sock,
 	return mask;
 }
 
+void packet_doorbell_send(void *data, async_cookie_t cookie)
+{
+	struct sock *sk = (struct sock *)data;
+	struct packet_sock *po = pkt_sk(sk);
+	struct msghdr msg;
+	int ret;
+	int retry_count;
+	struct doorbell_vma *db_vma;
+	void *more_work;
+
+	WARN_ON(!po);
+
+restart:
+	for (retry_count = 2; retry_count > 0; retry_count--) {
+		do {
+			msg.msg_flags = 0;
+			msg.msg_name = NULL;
+			ret = tpacket_snd(po, &msg);
+		} while (ret > 0);
+		schedule_timeout(1);
+	}
+	atomic_dec(&po->doorbell_thread_count);
+	rcu_read_lock();
+	list_for_each_entry_rcu(db_vma, &po->doorbell_vmas, list)
+		packet_arm_tx_doorbell(po, db_vma->vma);
+	rcu_read_unlock();
+
+	more_work = packet_current_frame(po, &po->tx_ring, TP_STATUS_SEND_REQUEST);
+
+	if (more_work &&
+	    atomic_add_unless(&po->doorbell_thread_count, 1, 1)) {
+		/*
+		 * We have more to send and we won the race to be the cleaning
+		 * thread.  go back and try again
+		 */
+		rcu_read_lock();
+		list_for_each_entry_rcu(db_vma, &po->doorbell_vmas, list)
+			packet_disarm_tx_doorbell(po, db_vma->vma);
+		rcu_read_unlock();
+		goto restart;
+	}
+	
+}
+
+static int packet_mm_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+	struct file *file = vma->vm_file;
+	struct socket *sock = file->private_data;
+	struct sock *sk = sock->sk;
+	struct doorbell_vma *db_vma;
+	struct packet_sock *po = sk ? pkt_sk(sk) : NULL;
+
+	if (po) {
+		rcu_read_lock();
+		list_for_each_entry_rcu(db_vma, &po->doorbell_vmas, list)
+			packet_disarm_tx_doorbell(po, db_vma->vma);
+		rcu_read_unlock();
+		if (atomic_add_unless(&po->doorbell_thread_count, 1, 1)) {
+			if (po->tp_doorbell_mode)
+				async_schedule_domain(packet_doorbell_send, sk,
+						      &packet_doorbell_domain);
+			else
+				atomic_dec(&po->doorbell_thread_count);
+		}
+
+	}
+	return VM_FAULT_RETRY;
+}
 
 /* Dirty? Well, I still did not learn better way to account
  * for user mmaps.
@@ -3627,17 +3737,29 @@ static void packet_mm_open(struct vm_area_struct *vma)
 
 static void packet_mm_close(struct vm_area_struct *vma)
 {
+	struct doorbell_vma *db_vma;
 	struct file *file = vma->vm_file;
 	struct socket *sock = file->private_data;
 	struct sock *sk = sock->sk;
-
-	if (sk)
-		atomic_dec(&pkt_sk(sk)->mapped);
+	struct packet_sock *po = sk ? pkt_sk(sk) : NULL;
+
+	if (po) {
+		spin_lock(&po->doorbell_lock);
+		list_for_each_entry_rcu(db_vma, &po->doorbell_vmas, list) {
+			if (db_vma->vma == vma) {
+				list_del_rcu(&db_vma->list);
+				kfree_rcu(db_vma, rcu);
+			}
+		}
+		spin_unlock(&po->doorbell_lock);
+		atomic_dec(&po->mapped);
+	}
 }
 
-static const struct vm_operations_struct packet_mmap_ops = {
+const struct vm_operations_struct packet_mmap_ops = {
 	.open	=	packet_mm_open,
 	.close	=	packet_mm_close,
+	.page_mkwrite	=	packet_mm_mkwrite,
 };
 
 static void free_pg_vec(struct pgv *pg_vec, unsigned int order,
@@ -3855,6 +3977,62 @@ out:
 	return err;
 }
 
+static void packet_mod_tx_doorbell(struct packet_sock *po,
+				   struct vm_area_struct *vma, bool arm)
+{
+	void *kaddr;
+	int pg_num;
+	struct packet_ring_buffer *rb;
+	pte_t entry;
+	struct page *page;
+	int i;
+	pte_t *ptep;
+	unsigned long start;
+	pgd_t *pgd;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	rb = &po->tx_ring;
+
+	for (i = 0; i < rb->pg_vec_len; i++) {
+		kaddr = rb->pg_vec[i].buffer;
+		start = vma->vm_start;
+		for (pg_num = 0; pg_num < rb->pg_vec_pages; pg_num++) {
+			ptep = NULL;
+			page = pgv_to_page(kaddr);
+
+
+			pgd = pgd_offset(vma->vm_mm, start);
+			if (!pgd_present(*pgd))
+				goto next;
+
+			pud = pud_offset(pgd, start);
+			if (!pud_present(*pud))
+				goto next;
+
+			pmd = pmd_offset(pud, start);
+			if (!pmd_present(*pmd))
+				goto next;
+
+			ptep = pte_offset_kernel(pmd, start);
+
+			if (arm)
+				entry = pte_wrprotect(*ptep);
+			else
+				entry = pte_mkwrite(*ptep);
+
+			flush_dcache_page(page);
+			set_pte_at(vma->vm_mm, start, ptep, entry);
+
+next:
+			kaddr += PAGE_SIZE;
+			start += PAGE_SIZE;
+		}
+	}
+
+	
+}
+
 static int packet_mmap(struct file *file, struct socket *sock,
 		struct vm_area_struct *vma)
 {
@@ -3865,10 +4043,17 @@ static int packet_mmap(struct file *file, struct socket *sock,
 	unsigned long start;
 	int err = -EINVAL;
 	int i;
+	struct doorbell_vma *db_vma = NULL;
 
 	if (vma->vm_pgoff)
 		return -EINVAL;
 
+	if (po->tp_doorbell_mode) {
+		db_vma = kzalloc(sizeof(struct doorbell_vma), GFP_KERNEL);
+		if (!db_vma)
+			return -ENOMEM;
+	}
+
 	mutex_lock(&po->pg_vec_lock);
 
 	expected_size = 0;
@@ -3905,9 +4090,21 @@ static int packet_mmap(struct file *file, struct socket *sock,
 				start += PAGE_SIZE;
 				kaddr += PAGE_SIZE;
 			}
+#ifdef CONFIG_X86
+			set_pages_uc(pgv_to_page(rb->pg_vec[i].buffer), rb->pg_vec_pages);
+#endif
 		}
 	}
 
+	if (po->tp_doorbell_mode) {
+		vma->vm_flags |= VM_SHARED;
+		db_vma->vma = vma;
+		spin_lock(&po->doorbell_lock);
+		list_add_rcu(&db_vma->list, &po->doorbell_vmas);
+		spin_unlock(&po->doorbell_lock);
+		packet_arm_tx_doorbell(po, vma);
+	}
+
 	atomic_inc(&po->mapped);
 	vma->vm_ops = &packet_mmap_ops;
 	err = 0;
diff --git a/net/packet/internal.h b/net/packet/internal.h
index eb9580a..2e1f5f7 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -89,9 +89,17 @@ struct packet_fanout {
 	struct packet_type	prot_hook ____cacheline_aligned_in_smp;
 };
 
+struct doorbell_vma {
+	struct list_head list;
+	struct vm_area_struct *vma;
+	struct rcu_head rcu;
+};
+
 struct packet_sock {
 	/* struct sock has to be the first member of packet_sock */
 	struct sock		sk;
+	struct list_head __rcu	doorbell_vmas;
+	spinlock_t	doorbell_lock;
 	struct packet_fanout	*fanout;
 	union  tpacket_stats_u	stats;
 	struct packet_ring_buffer	rx_ring;
@@ -112,6 +120,8 @@ struct packet_sock {
 	unsigned int		tp_reserve;
 	unsigned int		tp_loss:1;
 	unsigned int		tp_tx_has_off:1;
+	unsigned int		tp_doorbell_mode:1;
+	atomic_t		doorbell_thread_count;
 	unsigned int		tp_tstamp;
 	struct net_device __rcu	*cached_dev;
 	int			(*xmit)(struct sk_buff *skb);
-- 
1.9.3

^ permalink raw reply related	[flat|nested] 36+ messages in thread

* Re: [PATCH] af_packet:  Add Doorbell transmit mode to AF_PACKET sockets
  2014-10-09 13:36                   ` [PATCH] af_packet: Add Doorbell transmit mode to AF_PACKET sockets Neil Horman
@ 2014-10-09 15:01                     ` John Fastabend
  2014-10-09 16:05                       ` Neil Horman
  0 siblings, 1 reply; 36+ messages in thread
From: John Fastabend @ 2014-10-09 15:01 UTC (permalink / raw)
  To: Neil Horman, netdev
  Cc: John Fastabend, Daniel Borkmann, Jesper Dangaard Brouer,
	John W. Linville, Florian Westphal, gerlitz.or, john.ronciak,
	amirv, eric.dumazet, danny.zhou, Willem de Bruijn

On 10/09/2014 06:36 AM, Neil Horman wrote:
> This patch adds a variation to the AF_PACKET memory mapped socket transmit
> mechanism.  Nominally, when using a memory mapped AF_PACKET socket, frames are
> written into the memory mapped buffer, and then the application calls sendmsg
> with a NULL buffer which triggers then cleans the mapped space of all pending
> buffers.
> 
> While this provides clean, synchronous operation, improvements can be made.  To
> this end, I've introduced a doorbell mode of operation to memory mapped packet
> sockets. When a packet socket is placed into doorbell mode, it write protects
> the mappings of any process using the packet socket, so that on the first write
> to it, a kernel trap is generated, which returns the mapping to a read-write
> state, and forks a task to begin cleaning the buffers on the applications
> behalf.  This thread contains some hysterisis to continue running a short while
> after the last buffer has been cleaned, allowing subsquent wrtites to be sent
> without needing to fork another task.  This allows for additional parallelism in
> that an application on an smp system can run in parallel with a cleaning task,
> so that the socket buffer can be filled and emptied in parallel without having
> to incur multiple system call traps.
> 
> I've only done some very rough performance estimates, but early results are
> promising.  Using this code here:
> http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap
> 
> I made some modifications to support using doorbell mode and compared the time
> it took to send 1500 packets (each of size 1492 bytes), in basic mmap and
> doorbell mmaped mode, and used tcpdump to capture the output.  Results:
> 
> trace	packets	start time	end time	delta		p/s	size
> ndb	1500	2.755605	3.000886	0.245281	6115.43	1492b
> db	1500	4.716448	4.846382	0.129934	11544.32 1492b
> 
> Its very rough of course but it would seem I get a 40% increase in throughput
> when using this method.  I'm sure thats an overestimate, and so more testing is
> required, but initial results look good.
> 
> Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> ---


Thanks Neil, This looks helpful I'll see if I can merge something like this with
my previous patch. Not likely to have anything by next week though ;)

.John

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [PATCH] af_packet:  Add Doorbell transmit mode to AF_PACKET sockets
  2014-10-09 15:01                     ` John Fastabend
@ 2014-10-09 16:05                       ` Neil Horman
  0 siblings, 0 replies; 36+ messages in thread
From: Neil Horman @ 2014-10-09 16:05 UTC (permalink / raw)
  To: John Fastabend
  Cc: netdev, John Fastabend, Daniel Borkmann, Jesper Dangaard Brouer,
	John W. Linville, Florian Westphal, gerlitz.or, john.ronciak,
	amirv, eric.dumazet, danny.zhou, Willem de Bruijn

On Thu, Oct 09, 2014 at 08:01:07AM -0700, John Fastabend wrote:
> On 10/09/2014 06:36 AM, Neil Horman wrote:
> > This patch adds a variation to the AF_PACKET memory mapped socket transmit
> > mechanism.  Nominally, when using a memory mapped AF_PACKET socket, frames are
> > written into the memory mapped buffer, and then the application calls sendmsg
> > with a NULL buffer which triggers then cleans the mapped space of all pending
> > buffers.
> > 
> > While this provides clean, synchronous operation, improvements can be made.  To
> > this end, I've introduced a doorbell mode of operation to memory mapped packet
> > sockets. When a packet socket is placed into doorbell mode, it write protects
> > the mappings of any process using the packet socket, so that on the first write
> > to it, a kernel trap is generated, which returns the mapping to a read-write
> > state, and forks a task to begin cleaning the buffers on the applications
> > behalf.  This thread contains some hysterisis to continue running a short while
> > after the last buffer has been cleaned, allowing subsquent wrtites to be sent
> > without needing to fork another task.  This allows for additional parallelism in
> > that an application on an smp system can run in parallel with a cleaning task,
> > so that the socket buffer can be filled and emptied in parallel without having
> > to incur multiple system call traps.
> > 
> > I've only done some very rough performance estimates, but early results are
> > promising.  Using this code here:
> > http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap
> > 
> > I made some modifications to support using doorbell mode and compared the time
> > it took to send 1500 packets (each of size 1492 bytes), in basic mmap and
> > doorbell mmaped mode, and used tcpdump to capture the output.  Results:
> > 
> > trace	packets	start time	end time	delta		p/s	size
> > ndb	1500	2.755605	3.000886	0.245281	6115.43	1492b
> > db	1500	4.716448	4.846382	0.129934	11544.32 1492b
> > 
> > Its very rough of course but it would seem I get a 40% increase in throughput
> > when using this method.  I'm sure thats an overestimate, and so more testing is
> > required, but initial results look good.
> > 
> > Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
> > ---
> 
> 
> Thanks Neil, This looks helpful I'll see if I can merge something like this with
> my previous patch. Not likely to have anything by next week though ;)
> 
No worries, I'm on vacation next week anyway :)
Thanks!
Neil

> .John
> 
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-06  0:06 [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access John Fastabend
                   ` (2 preceding siblings ...)
  2014-10-06  0:29 ` [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access Florian Westphal
@ 2014-10-06 16:55 ` Stephen Hemminger
  2014-10-06 20:42   ` John Fastabend
  2014-10-06 21:42 ` David Miller
  2014-10-07  4:24 ` Willem de Bruijn
  5 siblings, 1 reply; 36+ messages in thread
From: Stephen Hemminger @ 2014-10-06 16:55 UTC (permalink / raw)
  To: John Fastabend
  Cc: dborkman, fw, gerlitz.or, hannes, netdev, john.ronciak, amirv,
	eric.dumazet, danny.zhou

On Sun, 05 Oct 2014 17:06:31 -0700
John Fastabend <john.fastabend@gmail.com> wrote:

> This patch adds a net_device ops to split off a set of driver queues
> from the driver and map the queues into user space via mmap. This
> allows the queues to be directly manipulated from user space. For
> raw packet interface this removes any overhead from the kernel network
> stack.
> 
> Typically in an af_packet interface a packet_type handler is
> registered and used to filter traffic to the socket and do other
> things such as fan out traffic to multiple sockets. In this case the
> networking stack is being bypassed so this code is not run. So the
> hardware must push the correct traffic to the queues obtained from
> the ndo callback ndo_split_queue_pairs().
> 
> Fortunately there is already a flow classification interface which
> is part of the ethtool command set, ETHTOOL_SRXCLSRLINS. It is
> currently supported by multiple drivers including sfc, mlx4, niu,
> ixgbe, and i40e. Supporting some way to steer traffic to a queue
> is the _only_ hardware requirement to support the interface, plus
> the driver needs to implement the correct ndo ops. A follow on
> patch adds support for ixgbe but we expect at least the subset of
> drivers implementing ETHTOOL_SRXCLSRLINS to be implemented later.
> 
> The interface is driven over an af_packet socket which we believe
> is the most natural interface to use. Because it is already used
> for raw packet interfaces which is what we are providing here.
>  The high level flow for this interface looks like:
> 
> 	bind(fd, &sockaddr, sizeof(sockaddr));
> 
> 	/* Get the device type and info */
> 	getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
> 		   &optlen);
> 
> 	/* With device info we can look up descriptor format */
> 
> 	/* Get the layout of ring space offset, page_sz, cnt */
> 	getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
> 		   &info, &optlen);
> 
> 	/* request some queues from the driver */
> 	setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> 		   &qpairs_info, sizeof(qpairs_info));
> 
> 	/* if we let the driver pick us queues learn which queues
>          * we were given
>          */
> 	getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> 		   &qpairs_info, sizeof(qpairs_info));
> 
> 	/* And mmap queue pairs to user space */
> 	mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
> 	     MAP_SHARED, fd, 0);
> 
> 	/* Now we have some user space queues to read/write to*/
> 
> There is one critical difference when running with these interfaces
> vs running without them. In the normal case the af_packet module
> uses a standard descriptor format exported by the af_packet user
> space headers. In this model because we are working directly with
> driver queues the descriptor format maps to the descriptor format
> used by the device. User space applications can learn device
> information from the socket option PACKET_DEV_DESC_INFO which
> should provide enough details to extrapulate the descriptor formats.
> Although this adds some complexity to user space it removes the
> requirement to copy descriptor fields around.
> 
> The formats are usually provided by the device vendor documentation
> If folks want I can provide a follow up patch to provide the formats
> in a .h file in ./include/uapi/linux/ for ease of use. I have access
> to formats for ixgbe and mlx drivers other driver owners would need to
> provide their formats.
> 
> We tested this interface using traffic generators and doing basic
> L2 forwarding tests on ixgbe devices. Our tests use a set of patches
> to DPDK to enable an interface using this socket interfaace. With
> this interface we can xmit/receive @ line rate from a test user space
> application on a single core.
> 
> Additionally we have a set of DPDK patches to enable DPDK with this
> interface. DPDK can be downloaded @ dpdk.org although as I hope is
> clear from above DPDK is just our paticular test environment we
> expect other libraries could be built on this interface.
> 
> Signed-off-by: Danny Zhou <danny.zhou@intel.com>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>

I like the ability to share a device between kernel and user mode networking.
The model used for DPDK for this is really ugly and fragile/broken.
Your proposal assumes that you fully trust the user mode networking application
which is not a generally safe assumption.

A device can DMA from/to any arbitrary physical memory. 
And it would be hard to use IOMMU to protect because the
IOMMU doesn't know that the difference between the applications queue and
the rest of the queues.

At least with DPDK you can use VFIO, and you are claiming the whole device to
allow protection against random memory being read/written.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-06 16:55 ` [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access Stephen Hemminger
@ 2014-10-06 20:42   ` John Fastabend
  0 siblings, 0 replies; 36+ messages in thread
From: John Fastabend @ 2014-10-06 20:42 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dborkman, fw, gerlitz.or, hannes, netdev, john.ronciak, amirv,
	eric.dumazet, danny.zhou

On 10/06/2014 09:55 AM, Stephen Hemminger wrote:
> On Sun, 05 Oct 2014 17:06:31 -0700
> John Fastabend <john.fastabend@gmail.com> wrote:
>
>> This patch adds a net_device ops to split off a set of driver queues
>> from the driver and map the queues into user space via mmap. This
>> allows the queues to be directly manipulated from user space. For
>> raw packet interface this removes any overhead from the kernel network
>> stack.
>>
>> Typically in an af_packet interface a packet_type handler is
>> registered and used to filter traffic to the socket and do other
>> things such as fan out traffic to multiple sockets. In this case the
>> networking stack is being bypassed so this code is not run. So the
>> hardware must push the correct traffic to the queues obtained from
>> the ndo callback ndo_split_queue_pairs().
>>
>> Fortunately there is already a flow classification interface which
>> is part of the ethtool command set, ETHTOOL_SRXCLSRLINS. It is
>> currently supported by multiple drivers including sfc, mlx4, niu,
>> ixgbe, and i40e. Supporting some way to steer traffic to a queue
>> is the _only_ hardware requirement to support the interface, plus
>> the driver needs to implement the correct ndo ops. A follow on
>> patch adds support for ixgbe but we expect at least the subset of
>> drivers implementing ETHTOOL_SRXCLSRLINS to be implemented later.
>>
>> The interface is driven over an af_packet socket which we believe
>> is the most natural interface to use. Because it is already used
>> for raw packet interfaces which is what we are providing here.
>>   The high level flow for this interface looks like:
>>
>> 	bind(fd, &sockaddr, sizeof(sockaddr));
>>
>> 	/* Get the device type and info */
>> 	getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
>> 		   &optlen);
>>
>> 	/* With device info we can look up descriptor format */
>>
>> 	/* Get the layout of ring space offset, page_sz, cnt */
>> 	getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
>> 		   &info, &optlen);
>>
>> 	/* request some queues from the driver */
>> 	setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>> 		   &qpairs_info, sizeof(qpairs_info));
>>
>> 	/* if we let the driver pick us queues learn which queues
>>           * we were given
>>           */
>> 	getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>> 		   &qpairs_info, sizeof(qpairs_info));
>>
>> 	/* And mmap queue pairs to user space */
>> 	mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
>> 	     MAP_SHARED, fd, 0);
>>
>> 	/* Now we have some user space queues to read/write to*/
>>
>> There is one critical difference when running with these interfaces
>> vs running without them. In the normal case the af_packet module
>> uses a standard descriptor format exported by the af_packet user
>> space headers. In this model because we are working directly with
>> driver queues the descriptor format maps to the descriptor format
>> used by the device. User space applications can learn device
>> information from the socket option PACKET_DEV_DESC_INFO which
>> should provide enough details to extrapulate the descriptor formats.
>> Although this adds some complexity to user space it removes the
>> requirement to copy descriptor fields around.
>>
>> The formats are usually provided by the device vendor documentation
>> If folks want I can provide a follow up patch to provide the formats
>> in a .h file in ./include/uapi/linux/ for ease of use. I have access
>> to formats for ixgbe and mlx drivers other driver owners would need to
>> provide their formats.
>>
>> We tested this interface using traffic generators and doing basic
>> L2 forwarding tests on ixgbe devices. Our tests use a set of patches
>> to DPDK to enable an interface using this socket interfaace. With
>> this interface we can xmit/receive @ line rate from a test user space
>> application on a single core.
>>
>> Additionally we have a set of DPDK patches to enable DPDK with this
>> interface. DPDK can be downloaded @ dpdk.org although as I hope is
>> clear from above DPDK is just our paticular test environment we
>> expect other libraries could be built on this interface.
>>
>> Signed-off-by: Danny Zhou <danny.zhou@intel.com>
>> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
>
> I like the ability to share a device between kernel and user mode networking.
> The model used for DPDK for this is really ugly and fragile/broken.
> Your proposal assumes that you fully trust the user mode networking application
> which is not a generally safe assumption.
>
> A device can DMA from/to any arbitrary physical memory.
> And it would be hard to use IOMMU to protect because the
> IOMMU doesn't know that the difference between the applications queue and
> the rest of the queues.
>
> At least with DPDK you can use VFIO, and you are claiming the whole device to
> allow protection against random memory being read/written.
>
>

However not all platforms support VFIO and when the application
only want to handle specific traffic types a queue maps well to
this.

-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-06  0:06 [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access John Fastabend
                   ` (3 preceding siblings ...)
  2014-10-06 16:55 ` [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access Stephen Hemminger
@ 2014-10-06 21:42 ` David Miller
  2014-10-07  4:25   ` John Fastabend
  2014-10-07  4:24 ` Willem de Bruijn
  5 siblings, 1 reply; 36+ messages in thread
From: David Miller @ 2014-10-06 21:42 UTC (permalink / raw)
  To: john.fastabend
  Cc: dborkman, fw, gerlitz.or, hannes, netdev, john.ronciak, amirv,
	eric.dumazet, danny.zhou

From: John Fastabend <john.fastabend@gmail.com>
Date: Sun, 05 Oct 2014 17:06:31 -0700

> This patch adds a net_device ops to split off a set of driver queues
> from the driver and map the queues into user space via mmap. This
> allows the queues to be directly manipulated from user space. For
> raw packet interface this removes any overhead from the kernel network
> stack.

About the facility in general, I am generally in favor, as I expressed
at the networking track in Chiacgo.

But you missed the mark wrt. describing the descriptors.

I do not want you to give device IDs.

I want the code to be %100 agnostic to device or vendor IDs.

Really "describe" the descriptor.  Not just how large is it (32-bits,
64-bits, etc.), but also: 1) is it little or big endian 2) where is
the length field 3) where is control bit "foo" located, etc.

That's what I want to see in "struct tpacket_dev_info", rather than
device IDs and "versions".

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-06 21:42 ` David Miller
@ 2014-10-07  4:25   ` John Fastabend
  0 siblings, 0 replies; 36+ messages in thread
From: John Fastabend @ 2014-10-07  4:25 UTC (permalink / raw)
  To: David Miller, john.fastabend
  Cc: dborkman, fw, gerlitz.or, hannes, netdev, john.ronciak, amirv,
	eric.dumazet, danny.zhou

On 10/06/2014 02:42 PM, David Miller wrote:
> From: John Fastabend <john.fastabend@gmail.com>
> Date: Sun, 05 Oct 2014 17:06:31 -0700
> 
>> This patch adds a net_device ops to split off a set of driver queues
>> from the driver and map the queues into user space via mmap. This
>> allows the queues to be directly manipulated from user space. For
>> raw packet interface this removes any overhead from the kernel network
>> stack.
> 
> About the facility in general, I am generally in favor, as I expressed
> at the networking track in Chiacgo.
> 
> But you missed the mark wrt. describing the descriptors.
> 
> I do not want you to give device IDs.
> 
> I want the code to be %100 agnostic to device or vendor IDs.
> 

OK. I'll work on this and have a v2 after net-next opens again.

Thanks!

> Really "describe" the descriptor.  Not just how large is it (32-bits,
> 64-bits, etc.), but also: 1) is it little or big endian 2) where is
> the length field 3) where is control bit "foo" located, etc.
> 
> That's what I want to see in "struct tpacket_dev_info", rather than
> device IDs and "versions".
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-06  0:06 [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access John Fastabend
                   ` (4 preceding siblings ...)
  2014-10-06 21:42 ` David Miller
@ 2014-10-07  4:24 ` Willem de Bruijn
  2014-10-07  9:27   ` David Laight
  2014-10-07 15:21   ` Zhou, Danny
  5 siblings, 2 replies; 36+ messages in thread
From: Willem de Bruijn @ 2014-10-07  4:24 UTC (permalink / raw)
  To: John Fastabend
  Cc: Daniel Borkmann, Florian Westphal, gerlitz.or,
	Hannes Frederic Sowa, Network Development, john.ronciak,
	Amir Vadai, Eric Dumazet, danny.zhou

> Supporting some way to steer traffic to a queue
> is the _only_ hardware requirement to support the interface,

I would not impose his constraint. There may be legitimate use
cases for taking over all queues of a device. For instance, when
this is a secondary nic that does not carry any control traffic.

> Typically in an af_packet interface a packet_type handler is
> registered and used to filter traffic to the socket and do other
> things such as fan out traffic to multiple sockets. In this case the
> networking stack is being bypassed so this code is not run. So the
> hardware must push the correct traffic to the queues obtained from
> the ndo callback ndo_split_queue_pairs().

Why does the interface work at the level of queue_pairs instead of
individual queues?

>         /* Get the layout of ring space offset, page_sz, cnt */
>         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
>                    &info, &optlen);
>
>         /* request some queues from the driver */
>         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>                    &qpairs_info, sizeof(qpairs_info));
>
>         /* if we let the driver pick us queues learn which queues
>          * we were given
>          */
>         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>                    &qpairs_info, sizeof(qpairs_info));

If ethtool -U is used to steer traffic to a specific descriptor queue,
then the setsockopt can pass the exact id of that queue and there
is no need for a getsockopt follow-up.

>         /* And mmap queue pairs to user space */
>         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
>              MAP_SHARED, fd, 0);

How will packet data be mapped and how will userspace translate
from paddr to vaddr? Is the goal to maintain long lived mappings
and instruct drivers to allocate from this restricted range (to
avoid per-packet system calls and vma operations)?

For throughput-oriented workloads, the syscall overhead
involved in kicking the nic (on tx, or for increasing the ring
consumer index on rx) can be amortized. And the operation
can perhaps piggy-back on interrupts or other events
(as long as interrupts are not disabled for full userspace
polling). Latency would be harder to satisfy while maintaining
some kernel policy enforcement. An extreme solution
uses an asynchronously busy polling kernel worker thread
(at high cycle cost, so acceptable for few workloads).

When keeping the kernel in the loop, it is possible to do
some basic sanity checking and transparently translate between
vaddr and paddr, even when exposing the hardware descriptors
directly. Though at this point it may be just as cheap to expose
an idealized virtualized descriptor format and copy fields between
that and device descriptors.

One assumption underlying exposing the hardware descriptors
is that they are quire similar between devices. How true is this
in the context of formats that span multiple descriptors?

> + * int (*ndo_split_queue_pairs) (struct net_device *dev,
> + *                              unsigned int qpairs_start_from,
> + *                              unsigned int qpairs_num,
> + *                              struct sock *sk)
> + *     Called to request a set of queues from the driver to be
> + *     handed to the callee for management. After this returns the
> + *     driver will not use the queues.

Are these queues also taken out of ethtool management, or is
this equivalent to taking removing them from the rss set with
ethtool -X?

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-07  4:24 ` Willem de Bruijn
@ 2014-10-07  9:27   ` David Laight
  2014-10-07 15:43     ` David Miller
  2014-10-07 15:21   ` Zhou, Danny
  1 sibling, 1 reply; 36+ messages in thread
From: David Laight @ 2014-10-07  9:27 UTC (permalink / raw)
  To: 'Willem de Bruijn', John Fastabend
  Cc: Daniel Borkmann, Florian Westphal, gerlitz.or@gmail.com,
	Hannes Frederic Sowa, Network Development, john.ronciak@intel.com,
	Amir Vadai, Eric Dumazet, danny.zhou@intel.com

From: Willem de Bruijn
...
> When keeping the kernel in the loop, it is possible to do
> some basic sanity checking and transparently translate between
> vaddr and paddr, even when exposing the hardware descriptors
> directly.

The application could change the addresses after they have been
validated, but before they have been read by the device.

> Though at this point it may be just as cheap to expose
> an idealized virtualized descriptor format and copy fields between
> that and device descriptors.

That is (probably) the only scheme that stops the application
accessing random parts of physical memory.

> One assumption underlying exposing the hardware descriptors
> is that they are quire similar between devices. How true is this
> in the context of formats that span multiple descriptors?

I suspect you'd need to define complete ring entries for 'initial',
'middle', 'final' and 'complete' fragments, together with the
offsets and endianness (and size?) of the address and data fields.

Also whether there is a special 'last entry' in the ring.

Passing checksum offload flags through adds an extra level of complexity.

Rings like the xhci (actually USB, but could contain ethernet data)
require the 'owner' bit be written odd or even in alternating passes.
Actually mapping support for usbnet (especially xhci - usb3) might show
up some deficiencies in the definition.

You also need to know when transmits have completed.
This might be an 'owner' bit being cleared, but could be signalled
in an entirely different way.

	David

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-07  9:27   ` David Laight
@ 2014-10-07 15:43     ` David Miller
  2014-10-07 15:59       ` David Laight
  0 siblings, 1 reply; 36+ messages in thread
From: David Miller @ 2014-10-07 15:43 UTC (permalink / raw)
  To: David.Laight
  Cc: willemb, john.fastabend, dborkman, fw, gerlitz.or, hannes, netdev,
	john.ronciak, amirv, eric.dumazet, danny.zhou

From: David Laight <David.Laight@ACULAB.COM>
Date: Tue, 7 Oct 2014 09:27:03 +0000

> That is (probably) the only scheme that stops the application
> accessing random parts of physical memory.

I don't know where this claim keeps coming from, it's false.

The application has to attach memory to the ring, and then the
ring can only refer to that memory for the duration of the
session.

There is no way that the user can program the address field of the
descriptors to point at arbitrary physical memory locations.

There is protection and control.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-07 15:43     ` David Miller
@ 2014-10-07 15:59       ` David Laight
  2014-10-07 16:08         ` David Miller
  0 siblings, 1 reply; 36+ messages in thread
From: David Laight @ 2014-10-07 15:59 UTC (permalink / raw)
  To: 'David Miller'
  Cc: willemb@google.com, john.fastabend@gmail.com, dborkman@redhat.com,
	fw@strlen.de, gerlitz.or@gmail.com, hannes@stressinduktion.org,
	netdev@vger.kernel.org, john.ronciak@intel.com,
	amirv@mellanox.com, eric.dumazet@gmail.com, danny.zhou@intel.com

From: David 
> From: David Laight <David.Laight@ACULAB.COM>
> Date: Tue, 7 Oct 2014 09:27:03 +0000
> 
> > That is (probably) the only scheme that stops the application
> > accessing random parts of physical memory.
> 
> I don't know where this claim keeps coming from, it's false.
> 
> The application has to attach memory to the ring, and then the
> ring can only refer to that memory for the duration of the
> session.
> 
> There is no way that the user can program the address field of the
> descriptors to point at arbitrary physical memory locations.
> 
> There is protection and control.

I got the impression that the application was directly writing the ring
structure that the ethernet mac hardware uses to describe tx and rx buffers.
(ie they are mapped read-write into userspace).
Unless you have a system where you can limit the physical memory
ranges accessible to the mac hardware, I don't see how you can stop
the application putting rogue values into the ring.

Clearly I'm missing something in my quick read of the change.

	David

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-07 15:59       ` David Laight
@ 2014-10-07 16:08         ` David Miller
  0 siblings, 0 replies; 36+ messages in thread
From: David Miller @ 2014-10-07 16:08 UTC (permalink / raw)
  To: David.Laight
  Cc: willemb, john.fastabend, dborkman, fw, gerlitz.or, hannes, netdev,
	john.ronciak, amirv, eric.dumazet, danny.zhou

From: David Laight <David.Laight@ACULAB.COM>
Date: Tue, 7 Oct 2014 15:59:35 +0000

> From: David 
>> From: David Laight <David.Laight@ACULAB.COM>
>> Date: Tue, 7 Oct 2014 09:27:03 +0000
>> 
>> > That is (probably) the only scheme that stops the application
>> > accessing random parts of physical memory.
>> 
>> I don't know where this claim keeps coming from, it's false.
>> 
>> The application has to attach memory to the ring, and then the
>> ring can only refer to that memory for the duration of the
>> session.
>> 
>> There is no way that the user can program the address field of the
>> descriptors to point at arbitrary physical memory locations.
>> 
>> There is protection and control.
> 
> I got the impression that the application was directly writing the ring
> structure that the ethernet mac hardware uses to describe tx and rx buffers.
> (ie they are mapped read-write into userspace).
> Unless you have a system where you can limit the physical memory
> ranges accessible to the mac hardware, I don't see how you can stop
> the application putting rogue values into the ring.
> 
> Clearly I'm missing something in my quick read of the change.

No, I think I misunderstood, and apparently the Mellanox driver allows
the user to crap into arbitrary physical memory too.

All of this garbage must get fixed and this feature is a non-starter
until there is control over the memory the rings can point ti.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-07  4:24 ` Willem de Bruijn
  2014-10-07  9:27   ` David Laight
@ 2014-10-07 15:21   ` Zhou, Danny
  2014-10-07 15:46     ` Willem de Bruijn
  2014-10-07 16:05     ` David Miller
  1 sibling, 2 replies; 36+ messages in thread
From: Zhou, Danny @ 2014-10-07 15:21 UTC (permalink / raw)
  To: Willem de Bruijn, John Fastabend
  Cc: Daniel Borkmann, Florian Westphal, gerlitz.or@gmail.com,
	Hannes Frederic Sowa, Network Development, Ronciak, John,
	Amir Vadai, Eric Dumazet

> -----Original Message-----
> From: Willem de Bruijn [mailto:willemb@google.com]
> Sent: Tuesday, October 07, 2014 12:24 PM
> To: John Fastabend
> Cc: Daniel Borkmann; Florian Westphal; gerlitz.or@gmail.com; Hannes Frederic Sowa; Network Development; Ronciak, John; Amir
> Vadai; Eric Dumazet; Zhou, Danny
> Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
> 
> > Supporting some way to steer traffic to a queue
> > is the _only_ hardware requirement to support the interface,
> 
> I would not impose his constraint. There may be legitimate use
> cases for taking over all queues of a device. For instance, when
> this is a secondary nic that does not carry any control traffic.

For the secondary NIC that carries the data plane traffics only, you can use UIO or VFIO 
to map the entire NIC's entire I/O space to user space. Then the user space poll-mode driver, 
like those have been supported and open-sourced in DPDK and those supports 
Mellanox/Emulex NICs but not open-sourced, can drive the NIC as a sole driver in user space. 

> 
> > Typically in an af_packet interface a packet_type handler is
> > registered and used to filter traffic to the socket and do other
> > things such as fan out traffic to multiple sockets. In this case the
> > networking stack is being bypassed so this code is not run. So the
> > hardware must push the correct traffic to the queues obtained from
> > the ndo callback ndo_split_queue_pairs().
> 
> Why does the interface work at the level of queue_pairs instead of
> individual queues?

The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O 
on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and 
egress traffics respectively, although the flow director only applies to ingress traffics.

> 
> >         /* Get the layout of ring space offset, page_sz, cnt */
> >         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
> >                    &info, &optlen);
> >
> >         /* request some queues from the driver */
> >         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> >                    &qpairs_info, sizeof(qpairs_info));
> >
> >         /* if we let the driver pick us queues learn which queues
> >          * we were given
> >          */
> >         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> >                    &qpairs_info, sizeof(qpairs_info));
> 
> If ethtool -U is used to steer traffic to a specific descriptor queue,
> then the setsockopt can pass the exact id of that queue and there
> is no need for a getsockopt follow-up.

Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via 
setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and 
available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info 
as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable. 
But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need 
to call getsocketopt.

> 
> >         /* And mmap queue pairs to user space */
> >         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
> >              MAP_SHARED, fd, 0);
> 
> How will packet data be mapped and how will userspace translate
> from paddr to vaddr? Is the goal to maintain long lived mappings
> and instruct drivers to allocate from this restricted range (to
> avoid per-packet system calls and vma operations)?
> 

Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues 
completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages 
are translated to paddr) to fill in the packet descriptors.
As of security concern raised in previous discussion, the reason we think(BTW, correct me if I am wrong)
af_packet  is most suitable is because only user application with root permission is allowed to successfully 
split-off queue pairs and mmap a small window of PCIe I/O space to user space, so concern regarding "device 
can DMA from/to any arbitrary physical memory." is not that big. As all user space device drivers based on 
UIO mechanism has the same concern issue, VFIO adds protection but it is based on IOMMU which is
specific to Intel silicons.

> For throughput-oriented workloads, the syscall overhead
> involved in kicking the nic (on tx, or for increasing the ring
> consumer index on rx) can be amortized. And the operation
> can perhaps piggy-back on interrupts or other events
> (as long as interrupts are not disabled for full userspace
> polling). Latency would be harder to satisfy while maintaining
> some kernel policy enforcement. An extreme solution
> uses an asynchronously busy polling kernel worker thread
> (at high cycle cost, so acceptable for few workloads).
> 
> When keeping the kernel in the loop, it is possible to do
> some basic sanity checking and transparently translate between
> vaddr and paddr, even when exposing the hardware descriptors
> directly. Though at this point it may be just as cheap to expose
> an idealized virtualized descriptor format and copy fields between
> that and device descriptors.
> 
> One assumption underlying exposing the hardware descriptors
> is that they are quire similar between devices. How true is this
> in the context of formats that span multiple descriptors?
> 

Packet descriptors format varies for different vendors. On Intel NICs, 1G/10G/40G NICs have 
totally different formats. Even a same Intel 10G/40G NIC supports at least 2 different descriptor 
formats. IMHO, the idea behind those patches intent to skip descriptor difference among devices, 
as it just maps certain I/O space pages to user space and user space "slave" NIC driver can handle 
it using different descriptor struct based on vendor/device ID. But I am open to add support of generic 
packet descriptor format description, per David M' suggestion.

> > + * int (*ndo_split_queue_pairs) (struct net_device *dev,
> > + *                              unsigned int qpairs_start_from,
> > + *                              unsigned int qpairs_num,
> > + *                              struct sock *sk)
> > + *     Called to request a set of queues from the driver to be
> > + *     handed to the callee for management. After this returns the
> > + *     driver will not use the queues.
> 
> Are these queues also taken out of ethtool management, or is
> this equivalent to taking removing them from the rss set with
> ethtool -X?

As a master driver, the NIC kernel driver still takes control of flow director as a ethtool backend. Generally, 
not all queues are initialized and used by NIC kernel driver, which reports actually used rx/tx numbers to stacks. 
Before splitting off certain queues, if you want use ethtool to direct traffics to those unused queues, 
ethtool reports invalid argument. Once certain stack-unaware queues are allocated for user space slave driver, 
ethtool allows directing packets to them as the NIC driver maintains a data struct about which queues are visible 
and used by kernel, which are used by user space.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-07 15:21   ` Zhou, Danny
@ 2014-10-07 15:46     ` Willem de Bruijn
  2014-10-07 15:55       ` John Fastabend
  2014-10-07 16:05     ` David Miller
  1 sibling, 1 reply; 36+ messages in thread
From: Willem de Bruijn @ 2014-10-07 15:46 UTC (permalink / raw)
  To: Zhou, Danny
  Cc: John Fastabend, Daniel Borkmann, Florian Westphal,
	gerlitz.or@gmail.com, Hannes Frederic Sowa, Network Development,
	Ronciak, John, Amir Vadai, Eric Dumazet

>> > Typically in an af_packet interface a packet_type handler is
>> > registered and used to filter traffic to the socket and do other
>> > things such as fan out traffic to multiple sockets. In this case the
>> > networking stack is being bypassed so this code is not run. So the
>> > hardware must push the correct traffic to the queues obtained from
>> > the ndo callback ndo_split_queue_pairs().
>>
>> Why does the interface work at the level of queue_pairs instead of
>> individual queues?
>
> The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O
> on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and
> egress traffics respectively, although the flow director only applies to ingress traffics.

That requirement of co-allocation is absent in existing packet
rings. Many applications only receive or transmit. For
receive-only, it would even be possible to map descriptor
rings read-only, if the kernel remains responsible for posting
buffers -- but I see below that that is not the case, so that's
not very relevant here.

Still, some workloads want asymmetric sets of rx and tx rings.
For instance, instead of using RSS, a process may want to
receive on as few rings as possible, load balance across
workers in software, but still give each worker thread its own
private transmit ring.


>>
>> >         /* Get the layout of ring space offset, page_sz, cnt */
>> >         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
>> >                    &info, &optlen);
>> >
>> >         /* request some queues from the driver */
>> >         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>> >                    &qpairs_info, sizeof(qpairs_info));
>> >
>> >         /* if we let the driver pick us queues learn which queues
>> >          * we were given
>> >          */
>> >         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>> >                    &qpairs_info, sizeof(qpairs_info));
>>
>> If ethtool -U is used to steer traffic to a specific descriptor queue,
>> then the setsockopt can pass the exact id of that queue and there
>> is no need for a getsockopt follow-up.
>
> Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via
> setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and
> available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
> Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info
> as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable.
> But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need
> to call getsocketopt.

One step further would be to move the entire configuration behind
the packet socket interface. It's perhaps out of scope of this patch,
but the difference between using `ethtool -U` and passing the same
expression through the packet socket is that in the latter case the
kernel can automatically rollback the configuration change when the
process dies.

>
>>
>> >         /* And mmap queue pairs to user space */
>> >         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
>> >              MAP_SHARED, fd, 0);
>>
>> How will packet data be mapped and how will userspace translate
>> from paddr to vaddr? Is the goal to maintain long lived mappings
>> and instruct drivers to allocate from this restricted range (to
>> avoid per-packet system calls and vma operations)?
>>
>
> Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues
> completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages
> are translated to paddr) to fill in the packet descriptors.

Ah, userspace is responsible for posting buffers and translation
from vaddr to paddr is straightforward. Yes that makes sense.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-07 15:46     ` Willem de Bruijn
@ 2014-10-07 15:55       ` John Fastabend
  2014-10-07 16:06         ` Zhou, Danny
  0 siblings, 1 reply; 36+ messages in thread
From: John Fastabend @ 2014-10-07 15:55 UTC (permalink / raw)
  To: Willem de Bruijn, Zhou, Danny
  Cc: John Fastabend, Daniel Borkmann, Florian Westphal,
	gerlitz.or@gmail.com, Hannes Frederic Sowa, Network Development,
	Ronciak, John, Amir Vadai, Eric Dumazet

On 10/07/2014 08:46 AM, Willem de Bruijn wrote:
>>>> Typically in an af_packet interface a packet_type handler is
>>>> registered and used to filter traffic to the socket and do other
>>>> things such as fan out traffic to multiple sockets. In this case the
>>>> networking stack is being bypassed so this code is not run. So the
>>>> hardware must push the correct traffic to the queues obtained from
>>>> the ndo callback ndo_split_queue_pairs().
>>>
>>> Why does the interface work at the level of queue_pairs instead of
>>> individual queues?
>>
>> The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O
>> on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and
>> egress traffics respectively, although the flow director only applies to ingress traffics.
> 
> That requirement of co-allocation is absent in existing packet
> rings. Many applications only receive or transmit. For
> receive-only, it would even be possible to map descriptor
> rings read-only, if the kernel remains responsible for posting
> buffers -- but I see below that that is not the case, so that's
> not very relevant here.
> 
> Still, some workloads want asymmetric sets of rx and tx rings.
> For instance, instead of using RSS, a process may want to
> receive on as few rings as possible, load balance across
> workers in software, but still give each worker thread its own
> private transmit ring.
> 

We can build this into the interface by having the setsockopt
provide both the number of tx rings and number of rx rings. It
might not be immediately available in any drivers because at
least ixgbe is pretty dependent on tx/rx pairing.

I would have to look through the other drivers to see how
much work it would be to support this on them. If I can't find
a good candidate we might leave it out until we can fix up the
drivers.

> 
>>>
>>>>         /* Get the layout of ring space offset, page_sz, cnt */
>>>>         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
>>>>                    &info, &optlen);
>>>>
>>>>         /* request some queues from the driver */
>>>>         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>>>>                    &qpairs_info, sizeof(qpairs_info));
>>>>
>>>>         /* if we let the driver pick us queues learn which queues
>>>>          * we were given
>>>>          */
>>>>         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>>>>                    &qpairs_info, sizeof(qpairs_info));
>>>
>>> If ethtool -U is used to steer traffic to a specific descriptor queue,
>>> then the setsockopt can pass the exact id of that queue and there
>>> is no need for a getsockopt follow-up.
>>
>> Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via
>> setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and
>> available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
>> Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info
>> as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable.
>> But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need
>> to call getsocketopt.
> 
> One step further would be to move the entire configuration behind
> the packet socket interface. It's perhaps out of scope of this patch,
> but the difference between using `ethtool -U` and passing the same
> expression through the packet socket is that in the latter case the
> kernel can automatically rollback the configuration change when the
> process dies.
> 

hmm might be interesting I think  this is a follow on path to
investigate after the initial support.

>>
>>>
>>>>         /* And mmap queue pairs to user space */
>>>>         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
>>>>              MAP_SHARED, fd, 0);
>>>
>>> How will packet data be mapped and how will userspace translate
>>> from paddr to vaddr? Is the goal to maintain long lived mappings
>>> and instruct drivers to allocate from this restricted range (to
>>> avoid per-packet system calls and vma operations)?
>>>
>>
>> Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues
>> completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages
>> are translated to paddr) to fill in the packet descriptors.
> 
> Ah, userspace is responsible for posting buffers and translation
> from vaddr to paddr is straightforward. Yes that makes sense.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-07 15:55       ` John Fastabend
@ 2014-10-07 16:06         ` Zhou, Danny
  0 siblings, 0 replies; 36+ messages in thread
From: Zhou, Danny @ 2014-10-07 16:06 UTC (permalink / raw)
  To: Fastabend, John R, Willem de Bruijn
  Cc: John Fastabend, Daniel Borkmann, Florian Westphal,
	gerlitz.or@gmail.com, Hannes Frederic Sowa, Network Development,
	Ronciak, John, Amir Vadai, Eric Dumazet



> -----Original Message-----
> From: Fastabend, John R
> Sent: Tuesday, October 07, 2014 11:55 PM
> To: Willem de Bruijn; Zhou, Danny
> Cc: John Fastabend; Daniel Borkmann; Florian Westphal; gerlitz.or@gmail.com; Hannes Frederic Sowa; Network Development;
> Ronciak, John; Amir Vadai; Eric Dumazet
> Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
> 
> On 10/07/2014 08:46 AM, Willem de Bruijn wrote:
> >>>> Typically in an af_packet interface a packet_type handler is
> >>>> registered and used to filter traffic to the socket and do other
> >>>> things such as fan out traffic to multiple sockets. In this case the
> >>>> networking stack is being bypassed so this code is not run. So the
> >>>> hardware must push the correct traffic to the queues obtained from
> >>>> the ndo callback ndo_split_queue_pairs().
> >>>
> >>> Why does the interface work at the level of queue_pairs instead of
> >>> individual queues?
> >>
> >> The user mode "slave" driver(I call it slave driver because it is only responsible for packet I/O
> >> on certain queue pairs) needs at least take over one rx queue and one tx queue for ingress and
> >> egress traffics respectively, although the flow director only applies to ingress traffics.
> >
> > That requirement of co-allocation is absent in existing packet
> > rings. Many applications only receive or transmit. For
> > receive-only, it would even be possible to map descriptor
> > rings read-only, if the kernel remains responsible for posting
> > buffers -- but I see below that that is not the case, so that's
> > not very relevant here.
> >
> > Still, some workloads want asymmetric sets of rx and tx rings.
> > For instance, instead of using RSS, a process may want to
> > receive on as few rings as possible, load balance across
> > workers in software, but still give each worker thread its own
> > private transmit ring.
> >

> 
> We can build this into the interface by having the setsockopt
> provide both the number of tx rings and number of rx rings. It
> might not be immediately available in any drivers because at
> least ixgbe is pretty dependent on tx/rx pairing.
> 
> I would have to look through the other drivers to see how
> much work it would be to support this on them. If I can't find
> a good candidate we might leave it out until we can fix up the
> drivers.
> 


Fully agreed. Unlike ixgbe, DPDK supports asymmetric sets of rx and tx rings. It should be easily enabled
if user space requests those free and available rx/tx queues, but it would cause a lot of troubles if certain 
requested queues have been occupied by ixgbe, as it breaks existing rx/tx pairing.

> >
> >>>
> >>>>         /* Get the layout of ring space offset, page_sz, cnt */
> >>>>         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
> >>>>                    &info, &optlen);
> >>>>
> >>>>         /* request some queues from the driver */
> >>>>         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> >>>>                    &qpairs_info, sizeof(qpairs_info));
> >>>>
> >>>>         /* if we let the driver pick us queues learn which queues
> >>>>          * we were given
> >>>>          */
> >>>>         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> >>>>                    &qpairs_info, sizeof(qpairs_info));
> >>>
> >>> If ethtool -U is used to steer traffic to a specific descriptor queue,
> >>> then the setsockopt can pass the exact id of that queue and there
> >>> is no need for a getsockopt follow-up.
> >>
> >> Very good point, it supports pass "-1" as queue id(following by number of qpairs needed) via
> >> setsockopt to af_packet and NIC kernel driver to ask the driver dynamically allocate free and
> >> available qpairs for this socket, so getsockopt() is needed to return the actually assigned queue pair indexes.
> >> Initially, we had a implementation that calls getsockopt once and af_packet treats qpairs_info
> >> as a IN/OUT parameter, but it is semantic wrong, so we think above implementation is most suitable.
> >> But I agree with you, if setsockopt can pass the exact id with a valid queue pair index, there is no need
> >> to call getsocketopt.
> >
> > One step further would be to move the entire configuration behind
> > the packet socket interface. It's perhaps out of scope of this patch,
> > but the difference between using `ethtool -U` and passing the same
> > expression through the packet socket is that in the latter case the
> > kernel can automatically rollback the configuration change when the
> > process dies.
> >
> 
> hmm might be interesting I think  this is a follow on path to
> investigate after the initial support.
> 
> >>
> >>>
> >>>>         /* And mmap queue pairs to user space */
> >>>>         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
> >>>>              MAP_SHARED, fd, 0);
> >>>
> >>> How will packet data be mapped and how will userspace translate
> >>> from paddr to vaddr? Is the goal to maintain long lived mappings
> >>> and instruct drivers to allocate from this restricted range (to
> >>> avoid per-packet system calls and vma operations)?
> >>>
> >>
> >> Once qpairs split-off is done, the user space driver, as a slave driver, will re-initialize those queues
> >> completely in user space by using paddr(in the case of DPDK, vaddr of DPDK used huge pages
> >> are translated to paddr) to fill in the packet descriptors.
> >
> > Ah, userspace is responsible for posting buffers and translation
> > from vaddr to paddr is straightforward. Yes that makes sense.
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-07 15:21   ` Zhou, Danny
  2014-10-07 15:46     ` Willem de Bruijn
@ 2014-10-07 16:05     ` David Miller
  2014-10-10  3:49       ` Zhou, Danny
  1 sibling, 1 reply; 36+ messages in thread
From: David Miller @ 2014-10-07 16:05 UTC (permalink / raw)
  To: danny.zhou
  Cc: willemb, john.fastabend, dborkman, fw, gerlitz.or, hannes, netdev,
	john.ronciak, amirv, eric.dumazet

From: "Zhou, Danny" <danny.zhou@intel.com>
Date: Tue, 7 Oct 2014 15:21:15 +0000

> Once qpairs split-off is done, the user space driver, as a slave
> driver, will re-initialize those queues completely in user space by
> using paddr(in the case of DPDK, vaddr of DPDK used huge pages are
> translated to paddr) to fill in the packet descriptors.  As of
> security concern raised in previous discussion, the reason we
> think(BTW, correct me if I am wrong) af_packet is most suitable is
> because only user application with root permission is allowed to
> successfully split-off queue pairs and mmap a small window of PCIe
> I/O space to user space, so concern regarding "device can DMA
> from/to any arbitrary physical memory." is not that big. As all user
> space device drivers based on UIO mechanism has the same concern
> issue, VFIO adds protection but it is based on IOMMU which is
> specific to Intel silicons.

Wait a second.

If there is no memory protection performed I'm not merging this.

I thought the user has to associate a fixed pool of memory to the
queueus, the kernel attaches that memory, and then the user cannot
modify the addresses _AT_ _ALL_.

If the user can modify the addresses in the descriptors and make
the chip crap on random memory, this is a non-starter.

Sorry.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-07 16:05     ` David Miller
@ 2014-10-10  3:49       ` Zhou, Danny
  0 siblings, 0 replies; 36+ messages in thread
From: Zhou, Danny @ 2014-10-10  3:49 UTC (permalink / raw)
  To: David Miller
  Cc: willemb@google.com, john.fastabend@gmail.com, dborkman@redhat.com,
	fw@strlen.de, gerlitz.or@gmail.com, hannes@stressinduktion.org,
	netdev@vger.kernel.org, Ronciak, John, amirv@mellanox.com,
	eric.dumazet@gmail.com


> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Wednesday, October 08, 2014 12:06 AM
> To: Zhou, Danny
> Cc: willemb@google.com; john.fastabend@gmail.com; dborkman@redhat.com; fw@strlen.de; gerlitz.or@gmail.com;
> hannes@stressinduktion.org; netdev@vger.kernel.org; Ronciak, John; amirv@mellanox.com; eric.dumazet@gmail.com
> Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
> 
> From: "Zhou, Danny" <danny.zhou@intel.com>
> Date: Tue, 7 Oct 2014 15:21:15 +0000
> 
> > Once qpairs split-off is done, the user space driver, as a slave
> > driver, will re-initialize those queues completely in user space by
> > using paddr(in the case of DPDK, vaddr of DPDK used huge pages are
> > translated to paddr) to fill in the packet descriptors.  As of
> > security concern raised in previous discussion, the reason we
> > think(BTW, correct me if I am wrong) af_packet is most suitable is
> > because only user application with root permission is allowed to
> > successfully split-off queue pairs and mmap a small window of PCIe
> > I/O space to user space, so concern regarding "device can DMA
> > from/to any arbitrary physical memory." is not that big. As all user
> > space device drivers based on UIO mechanism has the same concern
> > issue, VFIO adds protection but it is based on IOMMU which is
> > specific to Intel silicons.
> 
> Wait a second.
> 
> If there is no memory protection performed I'm not merging this.
> 
> I thought the user has to associate a fixed pool of memory to the
> queueus, the kernel attaches that memory, and then the user cannot
> modify the addresses _AT_ _ALL_.
> 
> If the user can modify the addresses in the descriptors and make
> the chip crap on random memory, this is a non-starter.
> 
> Sorry.

Fairly enough, we will manage to add memory protection in future versions. 
Several options are under investigation.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
@ 2014-10-07 16:33 Alexei Starovoitov
  2014-10-07 16:46 ` Zhou, Danny
  2014-10-07 17:01 ` David Miller
  0 siblings, 2 replies; 36+ messages in thread
From: Alexei Starovoitov @ 2014-10-07 16:33 UTC (permalink / raw)
  To: Zhou, Danny
  Cc: Willem de Bruijn, John Fastabend, Daniel Borkmann,
	Florian Westphal, gerlitz.or@gmail.com, Hannes Frederic Sowa,
	Network Development, Ronciak, John, Amir Vadai, Eric Dumazet,
	David S. Miller

On Tue, Oct 7, 2014 at 8:21 AM, Zhou, Danny <danny.zhou@intel.com> wrote:
>
> As a master driver, the NIC kernel driver still takes control of flow director as a ethtool backend. Generally,
> not all queues are initialized and used by NIC kernel driver, which reports actually used rx/tx numbers to stacks.
> Before splitting off certain queues, if you want use ethtool to direct traffics to those unused queues,
> ethtool reports invalid argument. Once certain stack-unaware queues are allocated for user space slave driver,
> ethtool allows directing packets to them as the NIC driver maintains a data struct about which queues are visible
> and used by kernel, which are used by user space.

this whole thing sounds like it's a way to let DPDK apps share physical
interfaces with kernel, so that tcp-based control plane traffic
can reach DPDK process via kernel while DPDK user space
data path does the rest of packet processing?
One still needs pcie register i/o and all of user space driver support
to make it work, right?
I guess that's great for DPDK users, but I don't think it's good for linux.

^ permalink raw reply	[flat|nested] 36+ messages in thread

* RE: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-07 16:33 Alexei Starovoitov
@ 2014-10-07 16:46 ` Zhou, Danny
  2014-10-07 17:01 ` David Miller
  1 sibling, 0 replies; 36+ messages in thread
From: Zhou, Danny @ 2014-10-07 16:46 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Willem de Bruijn, John Fastabend, Daniel Borkmann,
	Florian Westphal, gerlitz.or@gmail.com, Hannes Frederic Sowa,
	Network Development, Ronciak, John, Amir Vadai, Eric Dumazet,
	David S. Miller


> -----Original Message-----
> From: Alexei Starovoitov [mailto:alexei.starovoitov@gmail.com]
> Sent: Wednesday, October 08, 2014 12:33 AM
> To: Zhou, Danny
> Cc: Willem de Bruijn; John Fastabend; Daniel Borkmann; Florian Westphal; gerlitz.or@gmail.com; Hannes Frederic Sowa; Network
> Development; Ronciak, John; Amir Vadai; Eric Dumazet; David S. Miller
> Subject: Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
> 
> On Tue, Oct 7, 2014 at 8:21 AM, Zhou, Danny <danny.zhou@intel.com> wrote:
> >
> > As a master driver, the NIC kernel driver still takes control of flow director as a ethtool backend. Generally,
> > not all queues are initialized and used by NIC kernel driver, which reports actually used rx/tx numbers to stacks.
> > Before splitting off certain queues, if you want use ethtool to direct traffics to those unused queues,
> > ethtool reports invalid argument. Once certain stack-unaware queues are allocated for user space slave driver,
> > ethtool allows directing packets to them as the NIC driver maintains a data struct about which queues are visible
> > and used by kernel, which are used by user space.
> 
> this whole thing sounds like it's a way to let DPDK apps share physical
> interfaces with kernel, so that tcp-based control plane traffic
> can reach DPDK process via kernel while DPDK user space
> data path does the rest of packet processing?
> One still needs pcie register i/o and all of user space driver support
> to make it work, right?
> I guess that's great for DPDK users, but I don't think it's good for linux.

No, it is a generic NIC resource(e.g. rx/tx queue pairs) portioning mechanism among NIC kernel and user space drivers with 
different performance/latency/jitter characteristics. One only needs write efficient packet rx/tx routines by taking advantage
of whatever software optimizations, to manipulate limited rx/tx queue relevant registers on PCIe I/O space. In other words, no 
need to port the entire NIC driver from kernel to user space. 


^ permalink raw reply	[flat|nested] 36+ messages in thread

* Re: [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access
  2014-10-07 16:33 Alexei Starovoitov
  2014-10-07 16:46 ` Zhou, Danny
@ 2014-10-07 17:01 ` David Miller
  1 sibling, 0 replies; 36+ messages in thread
From: David Miller @ 2014-10-07 17:01 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: danny.zhou, willemb, john.fastabend, dborkman, fw, gerlitz.or,
	hannes, netdev, john.ronciak, amirv, eric.dumazet

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Tue, 7 Oct 2014 09:33:04 -0700

> I guess that's great for DPDK users, but I don't think it's good for
> linux.

Any use of a piece of hardware is fine with me, personally, as long
as adequate protections are in place.

If it's just a descriptor ring in software and a doorbell to trigger
a refetch of the head and tail pointers, with appropriate protection
and control of the memory attached to the ring, I don't see how I
could object to such a facility.

^ permalink raw reply	[flat|nested] 36+ messages in thread

end of thread, other threads:[~2014-10-10  3:49 UTC | newest]

Thread overview: 36+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-10-06  0:06 [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access John Fastabend
2014-10-06  0:07 ` [net-next PATCH v1 2/3] net: sched: add direct ring acces via af_packet to ixgbe John Fastabend
2014-10-06  0:07 ` [net-next PATCH v1 3/3] net: packet: Document PACKET_DEV_QPAIR_SPLIT and friends John Fastabend
2014-10-06  0:29 ` [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access Florian Westphal
2014-10-06  1:09   ` David Miller
2014-10-06  1:18     ` John Fastabend
2014-10-06  1:12   ` John Fastabend
2014-10-06  9:49     ` Daniel Borkmann
2014-10-06 15:01       ` John Fastabend
2014-10-06 16:35         ` Jesper Dangaard Brouer
2014-10-06 17:03         ` Hannes Frederic Sowa
2014-10-06 20:37           ` John Fastabend
2014-10-06 23:26             ` Hannes Frederic Sowa
2014-10-07 18:59               ` Neil Horman
2014-10-08 17:20                 ` John Fastabend
2014-10-09 13:36                   ` [PATCH] af_packet: Add Doorbell transmit mode to AF_PACKET sockets Neil Horman
2014-10-09 15:01                     ` John Fastabend
2014-10-09 16:05                       ` Neil Horman
2014-10-06 16:55 ` [net-next PATCH v1 1/3] net: sched: af_packet support for direct ring access Stephen Hemminger
2014-10-06 20:42   ` John Fastabend
2014-10-06 21:42 ` David Miller
2014-10-07  4:25   ` John Fastabend
2014-10-07  4:24 ` Willem de Bruijn
2014-10-07  9:27   ` David Laight
2014-10-07 15:43     ` David Miller
2014-10-07 15:59       ` David Laight
2014-10-07 16:08         ` David Miller
2014-10-07 15:21   ` Zhou, Danny
2014-10-07 15:46     ` Willem de Bruijn
2014-10-07 15:55       ` John Fastabend
2014-10-07 16:06         ` Zhou, Danny
2014-10-07 16:05     ` David Miller
2014-10-10  3:49       ` Zhou, Danny
  -- strict thread matches above, loose matches on Subject: below --
2014-10-07 16:33 Alexei Starovoitov
2014-10-07 16:46 ` Zhou, Danny
2014-10-07 17:01 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).