[RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
@ 2015-01-13  4:35 John Fastabend
  2015-01-13  4:35 ` [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings John Fastabend
                   ` (6 more replies)
  0 siblings, 7 replies; 24+ messages in thread
From: John Fastabend @ 2015-01-13  4:35 UTC (permalink / raw)
  To: netdev; +Cc: danny.zhou, nhorman, dborkman, john.ronciak, hannes, brouer

This patch adds net_device ops to split off a set of driver queues
from the driver and map the queues into user space via mmap. This
allows the queues to be directly manipulated from user space. For
raw packet interface this removes any overhead from the kernel network
stack.

With these operations we bypass the network stack and packet_type
handlers that would typically send traffic to an af_packet socket.
This means hardware must do the forwarding. To do this ew can use
the ETHTOOL_SRXCLSRLINS ops in the ethtool command set. It is
currently supported by multiple drivers including sfc, mlx4, niu,
ixgbe, and i40e. Supporting some way to steer traffic to a queue
is the _only_ hardware requirement to support this interface.

A follow on patch adds support for ixgbe but we expect at least
the subset of drivers implementing ETHTOOL_SRXCLSRLINS can be
implemented later.

The high level flow, leveraging the af_packet control path, looks
like:

	bind(fd, &sockaddr, sizeof(sockaddr));

	/* Get the device type and info */
	getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
		   &optlen);

	/* With device info we can look up descriptor format */

	/* Get the layout of ring space offset, page_sz, cnt */
	getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
		   &info, &optlen);

	/* request some queues from the driver */
	setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
		   &qpairs_info, sizeof(qpairs_info));

	/* if we let the driver pick us queues learn which queues
         * we were given
         */
	getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
		   &qpairs_info, sizeof(qpairs_info));

	/* And mmap queue pairs to user space */
	mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
	     MAP_SHARED, fd, 0);

	/* Now we have some user space queues to read/write to*/

There is one critical difference when running with these interfaces
vs running without them. In the normal case the af_packet module
uses a standard descriptor format exported by the af_packet user
space headers. In this model because we are working directly with
driver queues the descriptor format maps to the descriptor format
used by the device. User space applications can learn device
information from the socket option PACKET_DEV_DESC_INFO. These
are described by giving the vendor/deviceid and a descriptor layout
in offset/length/width/alignment/byte_ordering.

To protect against arbitrary DMA writes IOMMU devices put memory
in a single domain to stop arbitrary DMA to memory. Note it would
be possible to dma into another sockets pages because most NIC
devices only support a single domain. This would require being
able to guess another sockets page layout. However the socket
operation does require CAP_NET_ADMIN privileges.

Additionally we have a set of DPDK patches to enable DPDK with this
interface. DPDK can be downloaded @ dpdk.org although as I hope is
clear from above DPDK is just our paticular test environment we
expect other libraries could be built on this interface.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 include/linux/netdevice.h      |   79 ++++++++
 include/uapi/linux/if_packet.h |   88 +++++++++
 net/packet/af_packet.c         |  397 ++++++++++++++++++++++++++++++++++++++++
 net/packet/internal.h          |   10 +
 4 files changed, 573 insertions(+), 1 deletion(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 679e6e9..b71c97d 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -52,6 +52,8 @@
 #include <linux/neighbour.h>
 #include <uapi/linux/netdevice.h>
 
+#include <linux/if_packet.h>
+
 struct netpoll_info;
 struct device;
 struct phy_device;
@@ -1030,6 +1032,54 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
  * int (*ndo_switch_port_stp_update)(struct net_device *dev, u8 state);
  *	Called to notify switch device port of bridge port STP
  *	state change.
+ *
+ * int (*ndo_split_queue_pairs) (struct net_device *dev,
+ *				 unsigned int qpairs_start_from,
+ *				 unsigned int qpairs_num,
+ *				 struct sock *sk)
+ *	Called to request a set of queues from the driver to be handed to the
+ *	callee for management. After this returns the driver will not use the
+ *	queues.
+ *
+ * int (*ndo_get_split_queue_pairs) (struct net_device *dev,
+ *				 unsigned int *qpairs_start_from,
+ *				 unsigned int *qpairs_num,
+ *				 struct sock *sk)
+ *	Called to get the location of queues that have been split for user
+ *	space to use. The socket must have previously requested the queues via
+ *	ndo_split_queue_pairs successfully.
+ *
+ * int (*ndo_return_queue_pairs) (struct net_device *dev,
+ *				  struct sock *sk)
+ *	Called to return a set of queues identified by sock to the driver. The
+ *	socket must have previously requested the queues via
+ *	ndo_split_queue_pairs for this action to be performed.
+ *
+ * int (*ndo_get_device_qpair_map_region_info) (struct net_device *dev,
+ *				struct tpacket_dev_qpair_map_region_info *info)
+ *	Called to return mapping of queue memory region.
+ *
+ * int (*ndo_get_device_desc_info) (struct net_device *dev,
+ *				    struct tpacket_dev_info *dev_info)
+ *	Called to get device specific information. This should uniquely identify
+ *	the hardware so that descriptor formats can be learned by the stack/user
+ *	space.
+ *
+ * int (*ndo_direct_qpair_page_map) (struct vm_area_struct *vma,
+ *				     struct net_device *dev)
+ *	Called to map queue pair range from split_queue_pairs into mmap region.
+ *
+ * int (*ndo_direct_validate_dma_mem_region_map)
+ *					(struct net_device *dev,
+ *					 struct tpacket_dma_mem_region *region,
+ *					 struct sock *sk)
+ *	Called to validate DMA address remaping for userspace memory region
+ *
+ * int (*ndo_get_dma_region_info)
+ *				 (struct net_device *dev,
+ *				  struct tpacket_dma_mem_region *region,
+ *				  struct sock *sk)
+ *	Called to get dma region' information such as iova.
  */
 struct net_device_ops {
 	int			(*ndo_init)(struct net_device *dev);
@@ -1190,6 +1240,35 @@ struct net_device_ops {
 	int			(*ndo_switch_port_stp_update)(struct net_device *dev,
 							      u8 state);
 #endif
+	int			(*ndo_split_queue_pairs)(struct net_device *dev,
+					 unsigned int qpairs_start_from,
+					 unsigned int qpairs_num,
+					 struct sock *sk);
+	int			(*ndo_get_split_queue_pairs)
+					(struct net_device *dev,
+					 unsigned int *qpairs_start_from,
+					 unsigned int *qpairs_num,
+					 struct sock *sk);
+	int			(*ndo_return_queue_pairs)
+					(struct net_device *dev,
+					 struct sock *sk);
+	int			(*ndo_get_device_qpair_map_region_info)
+					(struct net_device *dev,
+					 struct tpacket_dev_qpair_map_region_info *info);
+	int			(*ndo_get_device_desc_info)
+					(struct net_device *dev,
+					 struct tpacket_dev_info *dev_info);
+	int			(*ndo_direct_qpair_page_map)
+					(struct vm_area_struct *vma,
+					 struct net_device *dev);
+	int			(*ndo_validate_dma_mem_region_map)
+					(struct net_device *dev,
+					 struct tpacket_dma_mem_region *region,
+					 struct sock *sk);
+	int			(*ndo_get_dma_region_info)
+					(struct net_device *dev,
+					 struct tpacket_dma_mem_region *region,
+					 struct sock *sk);
 };
 
 /**
diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
index da2d668..eb7a727 100644
--- a/include/uapi/linux/if_packet.h
+++ b/include/uapi/linux/if_packet.h
@@ -54,6 +54,13 @@ struct sockaddr_ll {
 #define PACKET_FANOUT			18
 #define PACKET_TX_HAS_OFF		19
 #define PACKET_QDISC_BYPASS		20
+#define PACKET_RXTX_QPAIRS_SPLIT	21
+#define PACKET_RXTX_QPAIRS_RETURN	22
+#define PACKET_DEV_QPAIR_MAP_REGION_INFO	23
+#define PACKET_DEV_DESC_INFO		24
+#define PACKET_DMA_MEM_REGION_MAP       25
+#define PACKET_DMA_MEM_REGION_RELEASE   26
+
 
 #define PACKET_FANOUT_HASH		0
 #define PACKET_FANOUT_LB		1
@@ -64,6 +71,87 @@ struct sockaddr_ll {
 #define PACKET_FANOUT_FLAG_ROLLOVER	0x1000
 #define PACKET_FANOUT_FLAG_DEFRAG	0x8000
 
+#define PACKET_MAX_NUM_MAP_MEMORY_REGIONS 64
+#define PACKET_MAX_NUM_DESC_FORMATS	  8
+#define PACKET_MAX_NUM_DESC_FIELDS	  64
+#define PACKET_NIC_DESC_FIELD(fseq, foffset, fwidth, falign, fbo) \
+		.seqn = (__u8)fseq,				\
+		.offset = (__u8)foffset,			\
+		.width = (__u8)fwidth,				\
+		.align = (__u8)falign,				\
+		.byte_order = (__u8)fbo
+
+#define MAX_MAP_MEMORY_REGIONS	64
+
+/* setsockopt takes addr, size ,direction parametner, getsockopt takes
+ * iova, size, direction.
+ * */
+struct tpacket_dma_mem_region {
+	void *addr;		/* userspace virtual address */
+	__u64 phys_addr;	/* physical address */
+	__u64 iova;		/* IO virtual address used for DMA */
+	unsigned long size;	/* size of region */
+	int direction;		/* dma data direction */
+};
+
+struct tpacket_dev_qpair_map_region_info {
+	unsigned int tp_dev_bar_sz;		/* size of BAR */
+	unsigned int tp_dev_sysm_sz;		/* size of systerm memory */
+	/* number of contiguous memory on BAR mapping to user space */
+	unsigned int tp_num_map_regions;
+	/* number of contiguous memory on system mapping to user apce */
+	unsigned int tp_num_sysm_map_regions;
+	struct map_page_region {
+		unsigned page_offset;	/* offset to start of region */
+		unsigned page_sz;	/* size of page */
+		unsigned page_cnt;	/* number of pages */
+	} tp_regions[MAX_MAP_MEMORY_REGIONS];
+};
+
+struct tpacket_dev_qpairs_info {
+	unsigned int tp_qpairs_start_from;	/* qpairs index to start from */
+	unsigned int tp_qpairs_num;		/* number of qpairs */
+};
+
+enum tpack_desc_byte_order {
+	BO_NATIVE = 0,
+	BO_NETWORK,
+	BO_BIG_ENDIAN,
+	BO_LITTLE_ENDIAN,
+};
+
+struct tpacket_nic_desc_fld {
+	__u8 seqn;	/* Sequency index of descriptor field */
+	__u8 offset;	/* Offset to start */
+	__u8 width;	/* Width of field */
+	__u8 align;	/* Alignment in bits */
+	enum tpack_desc_byte_order byte_order;	/* Endian flag */
+};
+
+struct tpacket_nic_desc_expr {
+	__u8 version;		/* Version number */
+	__u8 size;		/* Descriptor size in bytes */
+	enum tpack_desc_byte_order byte_order;		/* Endian flag */
+	__u8 num_of_fld;	/* Number of valid fields */
+	/* List of each descriptor field */
+	struct tpacket_nic_desc_fld fields[PACKET_MAX_NUM_DESC_FIELDS];
+};
+
+struct tpacket_dev_info {
+	__u16	tp_device_id;
+	__u16	tp_vendor_id;
+	__u16	tp_subsystem_device_id;
+	__u16	tp_subsystem_vendor_id;
+	__u32	tp_numa_node;
+	__u32	tp_revision_id;
+	__u32	tp_num_total_qpairs;
+	__u32	tp_num_inuse_qpairs;
+	__u32	tp_num_rx_desc_fmt;
+	__u32	tp_num_tx_desc_fmt;
+	struct tpacket_nic_desc_expr tp_rx_dexpr[PACKET_MAX_NUM_DESC_FORMATS];
+	struct tpacket_nic_desc_expr tp_tx_dexpr[PACKET_MAX_NUM_DESC_FORMATS];
+};
+
 struct tpacket_stats {
 	unsigned int	tp_packets;
 	unsigned int	tp_drops;
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index 6880f34..8cd17da 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -214,6 +214,9 @@ static void prb_clear_rxhash(struct tpacket_kbdq_core *,
 static void prb_fill_vlan_info(struct tpacket_kbdq_core *,
 		struct tpacket3_hdr *);
 static void packet_flush_mclist(struct sock *sk);
+static int umem_release(struct net_device *dev, struct packet_sock *po);
+static int get_umem_pages(struct tpacket_dma_mem_region *region,
+			  struct packet_umem_region *umem);
 
 struct packet_skb_cb {
 	unsigned int origlen;
@@ -2633,6 +2636,16 @@ static int packet_release(struct socket *sock)
 	sock_prot_inuse_add(net, sk->sk_prot, -1);
 	preempt_enable();
 
+	if (po->tp_owns_queue_pairs) {
+		struct net_device *dev;
+
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (dev) {
+			dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
+			umem_release(dev, po);
+		}
+	}
+
 	spin_lock(&po->bind_lock);
 	unregister_prot_hook(sk, false);
 	packet_cached_dev_reset(po);
@@ -2829,6 +2842,8 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
 	po->num = proto;
 	po->xmit = dev_queue_xmit;
 
+	INIT_LIST_HEAD(&po->umem_list);
+
 	err = packet_alloc_pending(po);
 	if (err)
 		goto out2;
@@ -3226,6 +3241,88 @@ static void packet_flush_mclist(struct sock *sk)
 }
 
 static int
+get_umem_pages(struct tpacket_dma_mem_region *region,
+	       struct packet_umem_region *umem)
+{
+	struct page **page_list;
+	unsigned long npages;
+	unsigned long offset;
+	unsigned long base;
+	unsigned long i;
+	int ret;
+	dma_addr_t phys_base;
+
+	phys_base = (region->phys_addr) & PAGE_MASK;
+	base = ((unsigned long)region->addr) & PAGE_MASK;
+	offset = ((unsigned long)region->addr) & (~PAGE_MASK);
+	npages = PAGE_ALIGN(region->size + offset) >> PAGE_SHIFT;
+
+	npages = min_t(unsigned long, npages, umem->nents);
+	sg_init_table(umem->sglist, npages);
+
+	umem->nmap = 0;
+	page_list = (struct page **)__get_free_page(GFP_KERNEL);
+	if (!page_list)
+		return -ENOMEM;
+
+	while (npages) {
+		unsigned long min = min_t(unsigned long, npages,
+					  PAGE_SIZE / sizeof(struct page *));
+
+		ret = get_user_pages(current, current->mm, base, min,
+				     1, 0, page_list, NULL);
+		if (ret < 0)
+			break;
+
+		base += ret * PAGE_SIZE;
+		npages -= ret;
+
+		/* validate if the memory region is physically contigenous */
+		for (i = 0; i < ret; i++) {
+			unsigned int page_index =
+				(page_to_phys(page_list[i]) - phys_base) /
+				PAGE_SIZE;
+
+			if (page_index != umem->nmap + i) {
+				int j;
+
+				for (j = 0; j < (umem->nmap + i); j++)
+					put_page(sg_page(&umem->sglist[j]));
+
+				free_page((unsigned long)page_list);
+				return -EFAULT;
+			}
+
+			sg_set_page(&umem->sglist[umem->nmap + i],
+				    page_list[i], PAGE_SIZE, 0);
+		}
+
+		umem->nmap += ret;
+	}
+
+	free_page((unsigned long)page_list);
+	return 0;
+}
+
+static int
+umem_release(struct net_device *dev, struct packet_sock *po)
+{
+	struct packet_umem_region *umem, *tmp;
+	int i;
+
+	list_for_each_entry_safe(umem, tmp, &po->umem_list, list) {
+		dma_unmap_sg(dev->dev.parent, umem->sglist,
+			     umem->nmap, umem->direction);
+		for (i = 0; i < umem->nmap; i++)
+			put_page(sg_page(&umem->sglist[i]));
+
+		vfree(umem);
+	}
+
+	return 0;
+}
+
+static int
 packet_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen)
 {
 	struct sock *sk = sock->sk;
@@ -3428,6 +3525,167 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
 		po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
 		return 0;
 	}
+	case PACKET_RXTX_QPAIRS_SPLIT:
+	{
+		struct tpacket_dev_qpairs_info qpairs;
+		const struct net_device_ops *ops;
+		struct net_device *dev;
+		int err;
+
+		if (optlen != sizeof(qpairs))
+			return -EINVAL;
+		if (copy_from_user(&qpairs, optval, sizeof(qpairs)))
+			return -EFAULT;
+
+		/* Only allow one set of queues to be owned by userspace */
+		if (po->tp_owns_queue_pairs)
+			return -EBUSY;
+
+		/* This call only works after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		ops = dev->netdev_ops;
+		if (!ops->ndo_split_queue_pairs)
+			return -EOPNOTSUPP;
+
+		err =  ops->ndo_split_queue_pairs(dev,
+						  qpairs.tp_qpairs_start_from,
+						  qpairs.tp_qpairs_num, sk);
+		if (!err)
+			po->tp_owns_queue_pairs = true;
+
+		return err;
+	}
+	case PACKET_RXTX_QPAIRS_RETURN:
+	{
+		struct tpacket_dev_qpairs_info qpairs_info;
+		const struct net_device_ops *ops;
+		struct net_device *dev;
+		int err;
+
+		if (optlen != sizeof(qpairs_info))
+			return -EINVAL;
+		if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
+			return -EFAULT;
+
+		if (!po->tp_owns_queue_pairs)
+			return -EINVAL;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		ops = dev->netdev_ops;
+		if (!ops->ndo_split_queue_pairs)
+			return -EOPNOTSUPP;
+
+		err =  dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
+		if (!err)
+			po->tp_owns_queue_pairs = false;
+
+		return err;
+	}
+	case PACKET_DMA_MEM_REGION_MAP:
+	{
+		struct tpacket_dma_mem_region region;
+		const struct net_device_ops *ops;
+		struct net_device *dev;
+		struct packet_umem_region *umem;
+		unsigned long npages;
+		unsigned long offset;
+		unsigned long i;
+		int err;
+
+		if (optlen != sizeof(region))
+			return -EINVAL;
+		if (copy_from_user(&region, optval, sizeof(region)))
+			return -EFAULT;
+		if ((region.direction != DMA_BIDIRECTIONAL) &&
+		    (region.direction != DMA_TO_DEVICE) &&
+		    (region.direction != DMA_FROM_DEVICE))
+			return -EFAULT;
+
+		if (!po->tp_owns_queue_pairs)
+			return -EINVAL;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+
+		offset = ((unsigned long)region.addr) & (~PAGE_MASK);
+		npages = PAGE_ALIGN(region.size + offset) >> PAGE_SHIFT;
+
+		umem = vzalloc(sizeof(*umem) +
+			       sizeof(struct scatterlist) * npages);
+		if (!umem)
+			return -ENOMEM;
+
+		umem->nents = npages;
+		umem->direction = region.direction;
+
+		down_write(&current->mm->mmap_sem);
+		if (get_umem_pages(&region, umem) < 0) {
+			ret = -EFAULT;
+			goto exit;
+		}
+
+		if ((umem->nmap == npages) &&
+		    (0 != dma_map_sg(dev->dev.parent, umem->sglist,
+				     umem->nmap, region.direction))) {
+			region.iova = sg_dma_address(umem->sglist) + offset;
+
+			ops = dev->netdev_ops;
+			if (!ops->ndo_validate_dma_mem_region_map) {
+				ret = -EOPNOTSUPP;
+				goto unmap;
+			}
+
+			/* use driver to validate mapping of dma memory */
+			err = ops->ndo_validate_dma_mem_region_map(dev,
+								   &region,
+								   sk);
+			if (!err) {
+				list_add_tail(&umem->list, &po->umem_list);
+				ret = 0;
+				goto exit;
+			}
+		}
+
+unmap:
+		dma_unmap_sg(dev->dev.parent, umem->sglist,
+			     umem->nmap, umem->direction);
+		for (i = 0; i < umem->nmap; i++)
+			put_page(sg_page(&umem->sglist[i]));
+
+		vfree(umem);
+exit:
+		up_write(&current->mm->mmap_sem);
+
+		return ret;
+	}
+	case PACKET_DMA_MEM_REGION_RELEASE:
+	{
+		struct net_device *dev;
+
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+
+		down_write(&current->mm->mmap_sem);
+		ret = umem_release(dev, po);
+		up_write(&current->mm->mmap_sem);
+
+		return ret;
+	}
+
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -3523,6 +3781,129 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 	case PACKET_QDISC_BYPASS:
 		val = packet_use_direct_xmit(po);
 		break;
+	case PACKET_RXTX_QPAIRS_SPLIT:
+	{
+		struct net_device *dev;
+		struct tpacket_dev_qpairs_info qpairs_info;
+		int err;
+
+		if (len != sizeof(qpairs_info))
+			return -EINVAL;
+		if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
+			return -EFAULT;
+
+		/* This call only work after a successful queue pairs split-off
+		 * operation via setsockopt()
+		 */
+		if (!po->tp_owns_queue_pairs)
+			return -EINVAL;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		if (!dev->netdev_ops->ndo_split_queue_pairs)
+			return -EOPNOTSUPP;
+
+		err =  dev->netdev_ops->ndo_get_split_queue_pairs(dev,
+					&qpairs_info.tp_qpairs_start_from,
+					&qpairs_info.tp_qpairs_num, sk);
+
+		lv = sizeof(qpairs_info);
+		data = &qpairs_info;
+		break;
+	}
+	case PACKET_DEV_QPAIR_MAP_REGION_INFO:
+	{
+		struct tpacket_dev_qpair_map_region_info info;
+		const struct net_device_ops *ops;
+		struct net_device *dev;
+		int err;
+
+		if (len != sizeof(info))
+			return -EINVAL;
+		if (copy_from_user(&info, optval, sizeof(info)))
+			return -EFAULT;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+
+		ops = dev->netdev_ops;
+		if (!ops->ndo_get_device_qpair_map_region_info)
+			return -EOPNOTSUPP;
+
+		err = ops->ndo_get_device_qpair_map_region_info(dev, &info);
+		if (err)
+			return err;
+
+		lv = sizeof(struct tpacket_dev_qpair_map_region_info);
+		data = &info;
+		break;
+	}
+	case PACKET_DEV_DESC_INFO:
+	{
+		struct net_device *dev;
+		struct tpacket_dev_info info;
+		int err;
+
+		if (len != sizeof(info))
+			return -EINVAL;
+		if (copy_from_user(&info, optval, sizeof(info)))
+			return -EFAULT;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+		if (!dev->netdev_ops->ndo_get_device_desc_info)
+			return -EOPNOTSUPP;
+
+		err =  dev->netdev_ops->ndo_get_device_desc_info(dev, &info);
+		if (err)
+			return err;
+
+		lv = sizeof(struct tpacket_dev_info);
+		data = &info;
+		break;
+	}
+	case PACKET_DMA_MEM_REGION_MAP:
+	{
+		struct tpacket_dma_mem_region info;
+		struct net_device *dev;
+		int err;
+
+		if (len != sizeof(info))
+				return -EINVAL;
+		if (copy_from_user(&info, optval, sizeof(info)))
+				return -EFAULT;
+
+		/* This call only work after a bind call which calls a dev_hold
+		 * operation so we do not need to increment dev ref counter
+		 */
+		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+		if (!dev)
+			return -EINVAL;
+
+		if (!dev->netdev_ops->ndo_get_dma_region_info)
+			return -EOPNOTSUPP;
+
+		err =  dev->netdev_ops->ndo_get_dma_region_info(dev, &info, sk);
+		if (err)
+			return err;
+
+		lv = sizeof(struct tpacket_dma_mem_region);
+		data = &info;
+		break;
+	}
+
 	default:
 		return -ENOPROTOOPT;
 	}
@@ -3536,7 +3917,6 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
 	return 0;
 }
 
-
 static int packet_notifier(struct notifier_block *this,
 			   unsigned long msg, void *ptr)
 {
@@ -3920,6 +4300,8 @@ static int packet_mmap(struct file *file, struct socket *sock,
 	struct packet_sock *po = pkt_sk(sk);
 	unsigned long size, expected_size;
 	struct packet_ring_buffer *rb;
+	const struct net_device_ops *ops;
+	struct net_device *dev;
 	unsigned long start;
 	int err = -EINVAL;
 	int i;
@@ -3927,8 +4309,20 @@ static int packet_mmap(struct file *file, struct socket *sock,
 	if (vma->vm_pgoff)
 		return -EINVAL;
 
+	dev = __dev_get_by_index(sock_net(sk), po->ifindex);
+	if (!dev)
+		return -EINVAL;
+
 	mutex_lock(&po->pg_vec_lock);
 
+	if (po->tp_owns_queue_pairs) {
+		ops = dev->netdev_ops;
+		err = ops->ndo_direct_qpair_page_map(vma, dev);
+		if (err)
+			goto out;
+		goto done;
+	}
+
 	expected_size = 0;
 	for (rb = &po->rx_ring; rb <= &po->tx_ring; rb++) {
 		if (rb->pg_vec) {
@@ -3966,6 +4360,7 @@ static int packet_mmap(struct file *file, struct socket *sock,
 		}
 	}
 
+done:
 	atomic_inc(&po->mapped);
 	vma->vm_ops = &packet_mmap_ops;
 	err = 0;
diff --git a/net/packet/internal.h b/net/packet/internal.h
index cdddf6a..55d2fce 100644
--- a/net/packet/internal.h
+++ b/net/packet/internal.h
@@ -90,6 +90,14 @@ struct packet_fanout {
 	struct packet_type	prot_hook ____cacheline_aligned_in_smp;
 };
 
+struct packet_umem_region {
+	struct list_head	list;
+	int			nents;
+	int			nmap;
+	int			direction;
+	struct scatterlist	sglist[0];
+};
+
 struct packet_sock {
 	/* struct sock has to be the first member of packet_sock */
 	struct sock		sk;
@@ -97,6 +105,7 @@ struct packet_sock {
 	union  tpacket_stats_u	stats;
 	struct packet_ring_buffer	rx_ring;
 	struct packet_ring_buffer	tx_ring;
+	struct list_head        umem_list;
 	int			copy_thresh;
 	spinlock_t		bind_lock;
 	struct mutex		pg_vec_lock;
@@ -113,6 +122,7 @@ struct packet_sock {
 	unsigned int		tp_reserve;
 	unsigned int		tp_loss:1;
 	unsigned int		tp_tx_has_off:1;
+	unsigned int		tp_owns_queue_pairs:1;
 	unsigned int		tp_tstamp;
 	struct net_device __rcu	*cached_dev;
 	int			(*xmit)(struct sk_buff *skb);

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings
  2015-01-13  4:35 [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space John Fastabend
@ 2015-01-13  4:35 ` John Fastabend
  2015-01-13 12:05   ` Hannes Frederic Sowa
                     ` (2 more replies)
  2015-01-13  4:42 ` [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space John Fastabend
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 24+ messages in thread
From: John Fastabend @ 2015-01-13  4:35 UTC (permalink / raw)
  To: netdev; +Cc: danny.zhou, nhorman, dborkman, john.ronciak, hannes, brouer

This allows driver queues to be split off and mapped into user
space using af_packet.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h         |   17 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |   23 +
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c    |  407 ++++++++++++++++++++++
 drivers/net/ethernet/intel/ixgbe/ixgbe_type.h    |    1 
 4 files changed, 440 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index 38fc64c..aa4960e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -204,6 +204,20 @@ struct ixgbe_tx_queue_stats {
 	u64 tx_done_old;
 };
 
+#define MAX_USER_DMA_REGIONS_PER_SOCKET  16
+
+struct ixgbe_user_dma_region {
+	dma_addr_t dma_region_iova;
+	unsigned long dma_region_size;
+	int direction;
+};
+
+struct ixgbe_user_queue_info {
+	struct sock *sk_handle;
+	struct ixgbe_user_dma_region regions[MAX_USER_DMA_REGIONS_PER_SOCKET];
+	int num_of_regions;
+};
+
 struct ixgbe_rx_queue_stats {
 	u64 rsc_count;
 	u64 rsc_flush;
@@ -673,6 +687,9 @@ struct ixgbe_adapter {
 
 	struct ixgbe_q_vector *q_vector[MAX_Q_VECTORS];
 
+	/* Direct User Space Queues */
+	struct ixgbe_user_queue_info user_queue_info[MAX_RX_QUEUES];
+
 	/* DCB parameters */
 	struct ieee_pfc *ixgbe_ieee_pfc;
 	struct ieee_ets *ixgbe_ieee_ets;
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
index e5be0dd..f180a58 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
@@ -2598,12 +2598,17 @@ static int ixgbe_add_ethtool_fdir_entry(struct ixgbe_adapter *adapter,
 	if (!(adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE))
 		return -EOPNOTSUPP;
 
+	if (fsp->ring_cookie > MAX_RX_QUEUES)
+		return -EINVAL;
+
 	/*
 	 * Don't allow programming if the action is a queue greater than
-	 * the number of online Rx queues.
+	 * the number of online Rx queues unless it is a user space
+	 * queue.
 	 */
 	if ((fsp->ring_cookie != RX_CLS_FLOW_DISC) &&
-	    (fsp->ring_cookie >= adapter->num_rx_queues))
+	    (fsp->ring_cookie >= adapter->num_rx_queues) &&
+	    !adapter->user_queue_info[fsp->ring_cookie].sk_handle)
 		return -EINVAL;
 
 	/* Don't allow indexes to exist outside of available space */
@@ -2680,12 +2685,18 @@ static int ixgbe_add_ethtool_fdir_entry(struct ixgbe_adapter *adapter,
 	/* apply mask and compute/store hash */
 	ixgbe_atr_compute_perfect_hash_82599(&input->filter, &mask);
 
+	/* Set input action to reg_idx for driver owned queues otherwise
+	 * use the absolute index for user space queues.
+	 */
+	if (fsp->ring_cookie < adapter->num_rx_queues &&
+	    fsp->ring_cookie != IXGBE_FDIR_DROP_QUEUE)
+		input->action = adapter->rx_ring[input->action]->reg_idx;
+
 	/* program filters to filter memory */
 	err = ixgbe_fdir_write_perfect_filter_82599(hw,
-				&input->filter, input->sw_idx,
-				(input->action == IXGBE_FDIR_DROP_QUEUE) ?
-				IXGBE_FDIR_DROP_QUEUE :
-				adapter->rx_ring[input->action]->reg_idx);
+						    &input->filter,
+						    input->sw_idx,
+						    input->action);
 	if (err)
 		goto err_out_w_lock;
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 2ed2c7d..be5bde86 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -50,6 +50,9 @@
 #include <linux/if_bridge.h>
 #include <linux/prefetch.h>
 #include <scsi/fc/fc_fcoe.h>
+#include <linux/mm.h>
+#include <linux/if_packet.h>
+#include <linux/iommu.h>
 
 #ifdef CONFIG_OF
 #include <linux/of_net.h>
@@ -80,6 +83,12 @@ const char ixgbe_driver_version[] = DRV_VERSION;
 static const char ixgbe_copyright[] =
 				"Copyright (c) 1999-2014 Intel Corporation.";
 
+static unsigned int *dummy_page_buf;
+
+#ifndef CONFIG_DMA_MEMORY_PROTECTION
+#define CONFIG_DMA_MEMORY_PROTECTION
+#endif
+
 static const struct ixgbe_info *ixgbe_info_tbl[] = {
 	[board_82598]		= &ixgbe_82598_info,
 	[board_82599]		= &ixgbe_82599_info,
@@ -167,6 +176,76 @@ MODULE_DESCRIPTION("Intel(R) 10 Gigabit PCI Express Network Driver");
 MODULE_LICENSE("GPL");
 MODULE_VERSION(DRV_VERSION);
 
+enum ixgbe_legacy_rx_enum {
+	IXGBE_LEGACY_RX_FIELD_PKT_ADDR = 0,	/* Packet buffer address */
+	IXGBE_LEGACY_RX_FIELD_LENGTH,		/* Packet length */
+	IXGBE_LEGACY_RX_FIELD_CSUM,		/* Fragment checksum */
+	IXGBE_LEGACY_RX_FIELD_STATUS,		/* Descriptors status */
+	IXGBE_LEGACY_RX_FIELD_ERRORS,		/* Receive errors */
+	IXGBE_LEGACY_RX_FIELD_VLAN,		/* VLAN tag */
+};
+
+enum ixgbe_legacy_tx_enum {
+	IXGBE_LEGACY_TX_FIELD_PKT_ADDR = 0,	/* Packet buffer address */
+	IXGBE_LEGACY_TX_FIELD_LENGTH,		/* Packet length */
+	IXGBE_LEGACY_TX_FIELD_CSO,		/* Checksum offset*/
+	IXGBE_LEGACY_TX_FIELD_CMD,		/* Descriptor control */
+	IXGBE_LEGACY_TX_FIELD_STATUS,		/* Descriptor status */
+	IXGBE_LEGACY_TX_FIELD_RSVD,		/* Reserved */
+	IXGBE_LEGACY_TX_FIELD_CSS,		/* Checksum start */
+	IXGBE_LEGACY_TX_FIELD_VLAN_TAG,		/* VLAN tag */
+};
+
+/* IXGBE Receive Descriptor - Legacy */
+static const struct tpacket_nic_desc_fld ixgbe_legacy_rx_desc[] = {
+	/* Packet buffer address */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_PKT_ADDR,
+				0,  64, 64,  BO_NATIVE)},
+	/* Packet length */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_LENGTH,
+				64, 16, 8,  BO_NATIVE)},
+	/* Fragment checksum */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_CSUM,
+				80, 16, 8,  BO_NATIVE)},
+	/* Descriptors status */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_STATUS,
+				96, 8, 8,  BO_NATIVE)},
+	/* Receive errors */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_ERRORS,
+				104, 8, 8,  BO_NATIVE)},
+	/* VLAN tag */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_VLAN,
+				112, 16, 8,  BO_NATIVE)},
+};
+
+/* IXGBE Transmit Descriptor - Legacy */
+static const struct tpacket_nic_desc_fld ixgbe_legacy_tx_desc[] = {
+	/* Packet buffer address */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_PKT_ADDR,
+				0,   64, 64,  BO_NATIVE)},
+	/* Data buffer length */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_LENGTH,
+				64,  16, 8,  BO_NATIVE)},
+	/* Checksum offset */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_CSO,
+				80,  8, 8,  BO_NATIVE)},
+	/* Command byte */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_CMD,
+				88,  8, 8,  BO_NATIVE)},
+	/* Transmitted status */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_STATUS,
+				96,  4, 1,  BO_NATIVE)},
+	/* Reserved */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_RSVD,
+				100, 4, 1,  BO_NATIVE)},
+	/* Checksum start */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_CSS,
+				104, 8, 8,  BO_NATIVE)},
+	/* VLAN tag */
+	{PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_VLAN_TAG,
+				112, 16, 8,  BO_NATIVE)},
+};
+
 static bool ixgbe_check_cfg_remove(struct ixgbe_hw *hw, struct pci_dev *pdev);
 
 static int ixgbe_read_pci_cfg_word_parent(struct ixgbe_adapter *adapter,
@@ -3137,6 +3216,17 @@ static void ixgbe_enable_rx_drop(struct ixgbe_adapter *adapter,
 	IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(reg_idx), srrctl);
 }
 
+static bool ixgbe_have_user_queues(struct ixgbe_adapter *adapter)
+{
+	int i;
+
+	for (i = 0; i < MAX_RX_QUEUES; i++) {
+		if (adapter->user_queue_info[i].sk_handle)
+			return true;
+	}
+	return false;
+}
+
 static void ixgbe_disable_rx_drop(struct ixgbe_adapter *adapter,
 				  struct ixgbe_ring *ring)
 {
@@ -3171,7 +3261,8 @@ static void ixgbe_set_rx_drop_en(struct ixgbe_adapter *adapter)
 	 *  and performance reasons.
 	 */
 	if (adapter->num_vfs || (adapter->num_rx_queues > 1 &&
-	    !(adapter->hw.fc.current_mode & ixgbe_fc_tx_pause) && !pfc_en)) {
+	    !(adapter->hw.fc.current_mode & ixgbe_fc_tx_pause) && !pfc_en) ||
+	    ixgbe_have_user_queues(adapter)) {
 		for (i = 0; i < adapter->num_rx_queues; i++)
 			ixgbe_enable_rx_drop(adapter, adapter->rx_ring[i]);
 	} else {
@@ -7938,6 +8029,306 @@ static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
 	kfree(fwd_adapter);
 }
 
+static int ixgbe_ndo_split_queue_pairs(struct net_device *dev,
+				       unsigned int start_from,
+				       unsigned int qpairs_num,
+				       struct sock *sk)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	unsigned int qpair_index;
+
+	/* allocate whatever available qpairs */
+	if (start_from == -1) {
+		unsigned int count = 0;
+
+		for (qpair_index = adapter->num_rx_queues;
+		     qpair_index < MAX_RX_QUEUES;
+		     qpair_index++) {
+			if (!adapter->user_queue_info[qpair_index].sk_handle) {
+				count++;
+				if (count == qpairs_num) {
+					start_from = qpair_index - count + 1;
+					break;
+				}
+			} else {
+				count = 0;
+			}
+		}
+	}
+
+	/* otherwise the caller specified exact queues */
+	if ((start_from > MAX_TX_QUEUES) ||
+	    (start_from > MAX_RX_QUEUES) ||
+	    (start_from + qpairs_num > MAX_TX_QUEUES) ||
+	    (start_from + qpairs_num > MAX_RX_QUEUES))
+		return -EINVAL;
+
+	/* If the qpairs are being used by the driver do not let user space
+	 * consume the queues. Also if the queue has already been allocated
+	 * to a socket do fail the request.
+	 */
+	for (qpair_index = start_from;
+	     qpair_index < start_from + qpairs_num;
+	     qpair_index++) {
+		if ((qpair_index < adapter->num_tx_queues) ||
+		    (qpair_index < adapter->num_rx_queues))
+			return -EINVAL;
+
+		if (adapter->user_queue_info[qpair_index].sk_handle)
+			return -EBUSY;
+	}
+
+	/* remember the sk handle for each queue pair */
+	for (qpair_index = start_from;
+	     qpair_index < start_from + qpairs_num;
+	     qpair_index++) {
+		adapter->user_queue_info[qpair_index].sk_handle = sk;
+		adapter->user_queue_info[qpair_index].num_of_regions = 0;
+	}
+
+	return 0;
+}
+
+static int ixgbe_ndo_get_split_queue_pairs(struct net_device *dev,
+					   unsigned int *start_from,
+					   unsigned int *qpairs_num,
+					   struct sock *sk)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	unsigned int qpair_index;
+	*qpairs_num = 0;
+
+	for (qpair_index = adapter->num_tx_queues;
+	     qpair_index < MAX_RX_QUEUES;
+	     qpair_index++) {
+		if (adapter->user_queue_info[qpair_index].sk_handle == sk) {
+			if (*qpairs_num == 0)
+				*start_from = qpair_index;
+			*qpairs_num = *qpairs_num + 1;
+		}
+	}
+
+	return 0;
+}
+
+static int ixgbe_ndo_return_queue_pairs(struct net_device *dev, struct sock *sk)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	struct ixgbe_user_queue_info *info;
+	unsigned int qpair_index;
+
+	for (qpair_index = adapter->num_tx_queues;
+	     qpair_index < MAX_RX_QUEUES;
+	     qpair_index++) {
+		info = &adapter->user_queue_info[qpair_index];
+
+		if (info->sk_handle == sk) {
+			info->sk_handle = NULL;
+			info->num_of_regions = 0;
+		}
+	}
+
+	return 0;
+}
+
+/* Rx descriptor starts from 0x1000 and Tx descriptor starts from 0x6000
+ * both the TX and RX descriptors use 4K pages.
+ */
+#define RX_DESC_ADDR_OFFSET		0x1000
+#define TX_DESC_ADDR_OFFSET		0x6000
+#define PAGE_SIZE_4K			4096
+
+static int
+ixgbe_ndo_qpair_map_region(struct net_device *dev,
+			   struct tpacket_dev_qpair_map_region_info *info)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+
+	/* no need to map systme memory to userspace for ixgbe */
+	info->tp_dev_sysm_sz = 0;
+	info->tp_num_sysm_map_regions = 0;
+
+	info->tp_dev_bar_sz = pci_resource_len(adapter->pdev, 0);
+	info->tp_num_map_regions = 2;
+
+	info->tp_regions[0].page_offset = RX_DESC_ADDR_OFFSET;
+	info->tp_regions[0].page_sz = PAGE_SIZE;
+	info->tp_regions[0].page_cnt = 1;
+	info->tp_regions[1].page_offset = TX_DESC_ADDR_OFFSET;
+	info->tp_regions[1].page_sz = PAGE_SIZE;
+	info->tp_regions[1].page_cnt = 1;
+
+	return 0;
+}
+
+static int ixgbe_ndo_get_device_desc_info(struct net_device *dev,
+					  struct tpacket_dev_info *dev_info)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	int max_queues;
+	int i;
+	__u8 flds_rx = sizeof(ixgbe_legacy_rx_desc) /
+		       sizeof(struct tpacket_nic_desc_fld);
+	__u8 flds_tx = sizeof(ixgbe_legacy_tx_desc) /
+		       sizeof(struct tpacket_nic_desc_fld);
+
+	max_queues = max(adapter->num_rx_queues, adapter->num_tx_queues);
+
+	dev_info->tp_device_id = adapter->hw.device_id;
+	dev_info->tp_vendor_id = adapter->hw.vendor_id;
+	dev_info->tp_subsystem_device_id = adapter->hw.subsystem_device_id;
+	dev_info->tp_subsystem_vendor_id = adapter->hw.subsystem_vendor_id;
+	dev_info->tp_revision_id = adapter->hw.revision_id;
+	dev_info->tp_numa_node = dev_to_node(&dev->dev);
+
+	dev_info->tp_num_total_qpairs = min(MAX_RX_QUEUES, MAX_TX_QUEUES);
+	dev_info->tp_num_inuse_qpairs = max_queues;
+
+	dev_info->tp_num_rx_desc_fmt = 1;
+	dev_info->tp_num_tx_desc_fmt = 1;
+
+	dev_info->tp_rx_dexpr[0].version = 1;
+	dev_info->tp_rx_dexpr[0].size = sizeof(union ixgbe_adv_rx_desc);
+	dev_info->tp_rx_dexpr[0].byte_order = BO_NATIVE;
+	dev_info->tp_rx_dexpr[0].num_of_fld = flds_rx;
+	for (i = 0; i < dev_info->tp_rx_dexpr[0].num_of_fld; i++)
+		memcpy(&dev_info->tp_rx_dexpr[0].fields[i],
+		       &ixgbe_legacy_rx_desc[i],
+		       sizeof(struct tpacket_nic_desc_fld));
+
+	dev_info->tp_tx_dexpr[0].version = 1;
+	dev_info->tp_tx_dexpr[0].size = sizeof(union ixgbe_adv_tx_desc);
+	dev_info->tp_tx_dexpr[0].byte_order = BO_NATIVE;
+	dev_info->tp_tx_dexpr[0].num_of_fld = flds_tx;
+	for (i = 0; i < dev_info->tp_rx_dexpr[0].num_of_fld; i++)
+		memcpy(&dev_info->tp_tx_dexpr[0].fields[i],
+		       &ixgbe_legacy_tx_desc[i],
+		       sizeof(struct tpacket_nic_desc_fld));
+
+	return 0;
+}
+
+static int
+ixgbe_ndo_qpair_page_map(struct vm_area_struct *vma, struct net_device *dev)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	phys_addr_t phy_addr = pci_resource_start(adapter->pdev, 0);
+	unsigned long pfn_rx = (phy_addr + RX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
+	unsigned long pfn_tx = (phy_addr + TX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
+	unsigned long dummy_page_phy;
+	pgprot_t pre_vm_page_prot;
+	unsigned long start;
+	unsigned int i;
+	int err;
+
+	if (!dummy_page_buf) {
+		dummy_page_buf = kzalloc(PAGE_SIZE_4K, GFP_KERNEL);
+		if (!dummy_page_buf)
+			return -ENOMEM;
+
+		for (i = 0; i < PAGE_SIZE_4K / sizeof(unsigned int); i++)
+			dummy_page_buf[i] = 0xdeadbeef;
+	}
+
+	dummy_page_phy = virt_to_phys(dummy_page_buf);
+	pre_vm_page_prot = vma->vm_page_prot;
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+
+	/* assume the vm_start is 4K aligned address */
+	for (start = vma->vm_start;
+	     start < vma->vm_end;
+	     start += PAGE_SIZE_4K) {
+		if (start == vma->vm_start + RX_DESC_ADDR_OFFSET) {
+			err = remap_pfn_range(vma, start, pfn_rx, PAGE_SIZE_4K,
+					      vma->vm_page_prot);
+			if (err)
+				return -EAGAIN;
+		} else if (start == vma->vm_start + TX_DESC_ADDR_OFFSET) {
+			err = remap_pfn_range(vma, start, pfn_tx, PAGE_SIZE_4K,
+					      vma->vm_page_prot);
+			if (err)
+				return -EAGAIN;
+		} else {
+			unsigned long addr = dummy_page_phy > PAGE_SHIFT;
+
+			err = remap_pfn_range(vma, start, addr, PAGE_SIZE_4K,
+					      pre_vm_page_prot);
+			if (err)
+				return -EAGAIN;
+		}
+	}
+	return 0;
+}
+
+static int
+ixgbe_ndo_val_dma_mem_region_map(struct net_device *dev,
+				 struct tpacket_dma_mem_region *region,
+				 struct sock *sk)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	unsigned int qpair_index, i;
+	struct ixgbe_user_queue_info *info;
+
+#ifdef CONFIG_DMA_MEMORY_PROTECTION
+	/* IOVA not equal to physical address means IOMMU takes effect */
+	if (region->phys_addr == region->iova)
+		return -EFAULT;
+#endif
+
+	for (qpair_index = adapter->num_tx_queues;
+	     qpair_index < MAX_RX_QUEUES;
+	     qpair_index++) {
+		info = &adapter->user_queue_info[qpair_index];
+		i = info->num_of_regions;
+
+		if (info->sk_handle != sk)
+			continue;
+
+		if (info->num_of_regions >= MAX_USER_DMA_REGIONS_PER_SOCKET)
+			return -EFAULT;
+
+		info->regions[i].dma_region_size = region->size;
+		info->regions[i].direction = region->direction;
+		info->regions[i].dma_region_iova = region->iova;
+		info->num_of_regions++;
+	}
+
+	return 0;
+}
+
+static int
+ixgbe_get_dma_region_info(struct net_device *dev,
+			  struct tpacket_dma_mem_region *region,
+			  struct sock *sk)
+{
+	struct ixgbe_adapter *adapter = netdev_priv(dev);
+	struct ixgbe_user_queue_info *info;
+	unsigned int qpair_index;
+
+	for (qpair_index = adapter->num_tx_queues;
+	     qpair_index < MAX_RX_QUEUES;
+	     qpair_index++) {
+		int i;
+
+		info = &adapter->user_queue_info[qpair_index];
+		if (info->sk_handle != sk)
+			continue;
+
+		for (i = 0; i <= info->num_of_regions; i++) {
+			struct ixgbe_user_dma_region *r;
+
+			r = &info->regions[i];
+			if ((r->dma_region_size == region->size) &&
+			    (r->direction == region->direction)) {
+				region->iova = r->dma_region_iova;
+				return 0;
+			}
+		}
+	}
+
+	return -1;
+}
+
 static const struct net_device_ops ixgbe_netdev_ops = {
 	.ndo_open		= ixgbe_open,
 	.ndo_stop		= ixgbe_close,
@@ -7982,6 +8373,15 @@ static const struct net_device_ops ixgbe_netdev_ops = {
 	.ndo_bridge_getlink	= ixgbe_ndo_bridge_getlink,
 	.ndo_dfwd_add_station	= ixgbe_fwd_add,
 	.ndo_dfwd_del_station	= ixgbe_fwd_del,
+
+	.ndo_split_queue_pairs	= ixgbe_ndo_split_queue_pairs,
+	.ndo_get_split_queue_pairs = ixgbe_ndo_get_split_queue_pairs,
+	.ndo_return_queue_pairs	   = ixgbe_ndo_return_queue_pairs,
+	.ndo_get_device_desc_info  = ixgbe_ndo_get_device_desc_info,
+	.ndo_direct_qpair_page_map = ixgbe_ndo_qpair_page_map,
+	.ndo_get_dma_region_info   = ixgbe_get_dma_region_info,
+	.ndo_get_device_qpair_map_region_info = ixgbe_ndo_qpair_map_region,
+	.ndo_validate_dma_mem_region_map = ixgbe_ndo_val_dma_mem_region_map,
 };
 
 /**
@@ -8203,7 +8603,9 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
 	hw->back = adapter;
 	adapter->msg_enable = netif_msg_init(debug, DEFAULT_MSG_ENABLE);
 
-	hw->hw_addr = ioremap(pci_resource_start(pdev, 0),
+	hw->pci_hw_addr = pci_resource_start(pdev, 0);
+
+	hw->hw_addr = ioremap(hw->pci_hw_addr,
 			      pci_resource_len(pdev, 0));
 	adapter->io_addr = hw->hw_addr;
 	if (!hw->hw_addr) {
@@ -8875,6 +9277,7 @@ module_init(ixgbe_init_module);
  **/
 static void __exit ixgbe_exit_module(void)
 {
+	kfree(dummy_page_buf);
 #ifdef CONFIG_IXGBE_DCA
 	dca_unregister_notify(&dca_notifier);
 #endif
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
index d101b25..4034d31 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
@@ -3180,6 +3180,7 @@ struct ixgbe_mbx_info {
 
 struct ixgbe_hw {
 	u8 __iomem			*hw_addr;
+	phys_addr_t			pci_hw_addr;
 	void				*back;
 	struct ixgbe_mac_info		mac;
 	struct ixgbe_addr_filter_info	addr_ctrl;

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
  2015-01-13  4:35 [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space John Fastabend
  2015-01-13  4:35 ` [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings John Fastabend
@ 2015-01-13  4:42 ` John Fastabend
  2015-01-13 12:35 ` Hannes Frederic Sowa
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 24+ messages in thread
From: John Fastabend @ 2015-01-13  4:42 UTC (permalink / raw)
  To: netdev
  Cc: danny.zhou, nhorman, dborkman, john.ronciak, hannes, brouer,
	Or Gerlitz

On 01/12/2015 08:35 PM, John Fastabend wrote:
> This patch adds net_device ops to split off a set of driver queues
> from the driver and map the queues into user space via mmap. This
> allows the queues to be directly manipulated from user space. For
> raw packet interface this removes any overhead from the kernel network
> stack.
>

+cc: Or Gerlitz

[...]

> +
> +struct tpacket_dev_info {
> +	__u16	tp_device_id;
> +	__u16	tp_vendor_id;
> +	__u16	tp_subsystem_device_id;
> +	__u16	tp_subsystem_vendor_id;
> +	__u32	tp_numa_node;
> +	__u32	tp_revision_id;
> +	__u32	tp_num_total_qpairs;
> +	__u32	tp_num_inuse_qpairs;
> +	__u32	tp_num_rx_desc_fmt;
> +	__u32	tp_num_tx_desc_fmt;
> +	struct tpacket_nic_desc_expr tp_rx_dexpr[PACKET_MAX_NUM_DESC_FORMATS];
> +	struct tpacket_nic_desc_expr tp_tx_dexpr[PACKET_MAX_NUM_DESC_FORMATS];

At least one reason this is still RFCs is this needs to be
cleaned up.

net/packet/af_packet.c: In function ‘packet_getsockopt’:
net/packet/af_packet.c:3918:1: warning: the frame size of 9264 bytes is 
larger than 2048 bytes [-Wframe-larger-than=]

but I wanted to see if there was any feedback.

Thanks,
John

-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings
  2015-01-13  4:35 ` [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings John Fastabend
@ 2015-01-13 12:05   ` Hannes Frederic Sowa
  2015-01-13 14:26   ` Daniel Borkmann
  2015-01-13 18:58   ` Willem de Bruijn
  2 siblings, 0 replies; 24+ messages in thread
From: Hannes Frederic Sowa @ 2015-01-13 12:05 UTC (permalink / raw)
  To: John Fastabend
  Cc: netdev, danny.zhou, nhorman, dborkman, john.ronciak, brouer

On Mo, 2015-01-12 at 20:35 -0800, John Fastabend wrote:
> +static int
> +ixgbe_ndo_qpair_page_map(struct vm_area_struct *vma, struct net_device *dev)
> +{
> +	struct ixgbe_adapter *adapter = netdev_priv(dev);
> +	phys_addr_t phy_addr = pci_resource_start(adapter->pdev, 0);
> +	unsigned long pfn_rx = (phy_addr + RX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
> +	unsigned long pfn_tx = (phy_addr + TX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
> +	unsigned long dummy_page_phy;
> +	pgprot_t pre_vm_page_prot;
> +	unsigned long start;
> +	unsigned int i;
> +	int err;
> +
> +	if (!dummy_page_buf) {
> +		dummy_page_buf = kzalloc(PAGE_SIZE_4K, GFP_KERNEL);
> +		if (!dummy_page_buf)
> +			return -ENOMEM;
> +
> +		for (i = 0; i < PAGE_SIZE_4K / sizeof(unsigned int); i++)
> +			dummy_page_buf[i] = 0xdeadbeef;
> +	}
> +
> +	dummy_page_phy = virt_to_phys(dummy_page_buf);
> +	pre_vm_page_prot = vma->vm_page_prot;
> +	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +
> +	/* assume the vm_start is 4K aligned address */
> +	for (start = vma->vm_start;
> +	     start < vma->vm_end;
> +	     start += PAGE_SIZE_4K) {
> +		if (start == vma->vm_start + RX_DESC_ADDR_OFFSET) {
> +			err = remap_pfn_range(vma, start, pfn_rx, PAGE_SIZE_4K,
> +					      vma->vm_page_prot);
> +			if (err)
> +				return -EAGAIN;
> +		} else if (start == vma->vm_start + TX_DESC_ADDR_OFFSET) {
> +			err = remap_pfn_range(vma, start, pfn_tx, PAGE_SIZE_4K,
> +					      vma->vm_page_prot);
> +			if (err)
> +				return -EAGAIN;
> +		} else {
> +			unsigned long addr = dummy_page_phy > PAGE_SHIFT;

I guess you have forgotten to delete this line?

> +
> +			err = remap_pfn_range(vma, start, addr, PAGE_SIZE_4K,
> +					      pre_vm_page_prot);
> +			if (err)
> +				return -EAGAIN;
> +		}
> +	}
> +	return 0;
> +}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
  2015-01-13  4:35 [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space John Fastabend
  2015-01-13  4:35 ` [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings John Fastabend
  2015-01-13  4:42 ` [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space John Fastabend
@ 2015-01-13 12:35 ` Hannes Frederic Sowa
  2015-01-13 13:21   ` Daniel Borkmann
  2015-01-13 15:12 ` Daniel Borkmann
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Hannes Frederic Sowa @ 2015-01-13 12:35 UTC (permalink / raw)
  To: John Fastabend
  Cc: netdev, danny.zhou, nhorman, dborkman, john.ronciak, brouer

On Mo, 2015-01-12 at 20:35 -0800, John Fastabend wrote:
> This patch adds net_device ops to split off a set of driver queues
> from the driver and map the queues into user space via mmap. This
> allows the queues to be directly manipulated from user space. For
> raw packet interface this removes any overhead from the kernel network
> stack.
> 
> With these operations we bypass the network stack and packet_type
> handlers that would typically send traffic to an af_packet socket.
> This means hardware must do the forwarding. To do this ew can use
> the ETHTOOL_SRXCLSRLINS ops in the ethtool command set. It is
> currently supported by multiple drivers including sfc, mlx4, niu,
> ixgbe, and i40e. Supporting some way to steer traffic to a queue
> is the _only_ hardware requirement to support this interface.
> 
> A follow on patch adds support for ixgbe but we expect at least
> the subset of drivers implementing ETHTOOL_SRXCLSRLINS can be
> implemented later.
> 
> The high level flow, leveraging the af_packet control path, looks
> like:
> 
> 	bind(fd, &sockaddr, sizeof(sockaddr));
> 
> 	/* Get the device type and info */
> 	getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
> 		   &optlen);
> 
> 	/* With device info we can look up descriptor format */
> 
> 	/* Get the layout of ring space offset, page_sz, cnt */
> 	getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
> 		   &info, &optlen);
> 
> 	/* request some queues from the driver */
> 	setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> 		   &qpairs_info, sizeof(qpairs_info));
> 
> 	/* if we let the driver pick us queues learn which queues
>          * we were given
>          */
> 	getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> 		   &qpairs_info, sizeof(qpairs_info));
> 
> 	/* And mmap queue pairs to user space */
> 	mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
> 	     MAP_SHARED, fd, 0);
> 
> 	/* Now we have some user space queues to read/write to*/
> 
> There is one critical difference when running with these interfaces
> vs running without them. In the normal case the af_packet module
> uses a standard descriptor format exported by the af_packet user
> space headers. In this model because we are working directly with
> driver queues the descriptor format maps to the descriptor format
> used by the device. User space applications can learn device
> information from the socket option PACKET_DEV_DESC_INFO. These
> are described by giving the vendor/deviceid and a descriptor layout
> in offset/length/width/alignment/byte_ordering.
> 
> To protect against arbitrary DMA writes IOMMU devices put memory
> in a single domain to stop arbitrary DMA to memory. Note it would
> be possible to dma into another sockets pages because most NIC
> devices only support a single domain. This would require being
> able to guess another sockets page layout. However the socket
> operation does require CAP_NET_ADMIN privileges.
> 
> Additionally we have a set of DPDK patches to enable DPDK with this
> interface. DPDK can be downloaded @ dpdk.org although as I hope is
> clear from above DPDK is just our paticular test environment we
> expect other libraries could be built on this interface.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  include/linux/netdevice.h      |   79 ++++++++
>  include/uapi/linux/if_packet.h |   88 +++++++++
>  net/packet/af_packet.c         |  397 ++++++++++++++++++++++++++++++++++++++++
>  net/packet/internal.h          |   10 +
>  4 files changed, 573 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 679e6e9..b71c97d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -52,6 +52,8 @@
>  #include <linux/neighbour.h>
>  #include <uapi/linux/netdevice.h>
>  
> +#include <linux/if_packet.h>
> +
>  struct netpoll_info;
>  struct device;
>  struct phy_device;
> @@ -1030,6 +1032,54 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
>   * int (*ndo_switch_port_stp_update)(struct net_device *dev, u8 state);
>   *	Called to notify switch device port of bridge port STP
>   *	state change.
> + *
> + * int (*ndo_split_queue_pairs) (struct net_device *dev,
> + *				 unsigned int qpairs_start_from,
> + *				 unsigned int qpairs_num,
> + *				 struct sock *sk)
> + *	Called to request a set of queues from the driver to be handed to the
> + *	callee for management. After this returns the driver will not use the
> + *	queues.
> + *
> + * int (*ndo_get_split_queue_pairs) (struct net_device *dev,
> + *				 unsigned int *qpairs_start_from,
> + *				 unsigned int *qpairs_num,
> + *				 struct sock *sk)
> + *	Called to get the location of queues that have been split for user
> + *	space to use. The socket must have previously requested the queues via
> + *	ndo_split_queue_pairs successfully.
> + *
> + * int (*ndo_return_queue_pairs) (struct net_device *dev,
> + *				  struct sock *sk)
> + *	Called to return a set of queues identified by sock to the driver. The
> + *	socket must have previously requested the queues via
> + *	ndo_split_queue_pairs for this action to be performed.
> + *
> + * int (*ndo_get_device_qpair_map_region_info) (struct net_device *dev,
> + *				struct tpacket_dev_qpair_map_region_info *info)
> + *	Called to return mapping of queue memory region.
> + *
> + * int (*ndo_get_device_desc_info) (struct net_device *dev,
> + *				    struct tpacket_dev_info *dev_info)
> + *	Called to get device specific information. This should uniquely identify
> + *	the hardware so that descriptor formats can be learned by the stack/user
> + *	space.
> + *
> + * int (*ndo_direct_qpair_page_map) (struct vm_area_struct *vma,
> + *				     struct net_device *dev)
> + *	Called to map queue pair range from split_queue_pairs into mmap region.
> + *
> + * int (*ndo_direct_validate_dma_mem_region_map)
> + *					(struct net_device *dev,
> + *					 struct tpacket_dma_mem_region *region,
> + *					 struct sock *sk)
> + *	Called to validate DMA address remaping for userspace memory region
> + *
> + * int (*ndo_get_dma_region_info)
> + *				 (struct net_device *dev,
> + *				  struct tpacket_dma_mem_region *region,
> + *				  struct sock *sk)
> + *	Called to get dma region' information such as iova.
>   */
>  struct net_device_ops {
>  	int			(*ndo_init)(struct net_device *dev);
> @@ -1190,6 +1240,35 @@ struct net_device_ops {
>  	int			(*ndo_switch_port_stp_update)(struct net_device *dev,
>  							      u8 state);
>  #endif
> +	int			(*ndo_split_queue_pairs)(struct net_device *dev,
> +					 unsigned int qpairs_start_from,
> +					 unsigned int qpairs_num,
> +					 struct sock *sk);
> +	int			(*ndo_get_split_queue_pairs)
> +					(struct net_device *dev,
> +					 unsigned int *qpairs_start_from,
> +					 unsigned int *qpairs_num,
> +					 struct sock *sk);
> +	int			(*ndo_return_queue_pairs)
> +					(struct net_device *dev,
> +					 struct sock *sk);
> +	int			(*ndo_get_device_qpair_map_region_info)
> +					(struct net_device *dev,
> +					 struct tpacket_dev_qpair_map_region_info *info);
> +	int			(*ndo_get_device_desc_info)
> +					(struct net_device *dev,
> +					 struct tpacket_dev_info *dev_info);
> +	int			(*ndo_direct_qpair_page_map)
> +					(struct vm_area_struct *vma,
> +					 struct net_device *dev);
> +	int			(*ndo_validate_dma_mem_region_map)
> +					(struct net_device *dev,
> +					 struct tpacket_dma_mem_region *region,
> +					 struct sock *sk);
> +	int			(*ndo_get_dma_region_info)
> +					(struct net_device *dev,
> +					 struct tpacket_dma_mem_region *region,
> +					 struct sock *sk);
>  };
>  
>  /**
> diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
> index da2d668..eb7a727 100644
> --- a/include/uapi/linux/if_packet.h
> +++ b/include/uapi/linux/if_packet.h
> @@ -54,6 +54,13 @@ struct sockaddr_ll {
>  #define PACKET_FANOUT			18
>  #define PACKET_TX_HAS_OFF		19
>  #define PACKET_QDISC_BYPASS		20
> +#define PACKET_RXTX_QPAIRS_SPLIT	21
> +#define PACKET_RXTX_QPAIRS_RETURN	22
> +#define PACKET_DEV_QPAIR_MAP_REGION_INFO	23
> +#define PACKET_DEV_DESC_INFO		24
> +#define PACKET_DMA_MEM_REGION_MAP       25
> +#define PACKET_DMA_MEM_REGION_RELEASE   26
> +
>  
>  #define PACKET_FANOUT_HASH		0
>  #define PACKET_FANOUT_LB		1
> @@ -64,6 +71,87 @@ struct sockaddr_ll {
>  #define PACKET_FANOUT_FLAG_ROLLOVER	0x1000
>  #define PACKET_FANOUT_FLAG_DEFRAG	0x8000
>  
> +#define PACKET_MAX_NUM_MAP_MEMORY_REGIONS 64
> +#define PACKET_MAX_NUM_DESC_FORMATS	  8
> +#define PACKET_MAX_NUM_DESC_FIELDS	  64
> +#define PACKET_NIC_DESC_FIELD(fseq, foffset, fwidth, falign, fbo) \
> +		.seqn = (__u8)fseq,				\
> +		.offset = (__u8)foffset,			\
> +		.width = (__u8)fwidth,				\
> +		.align = (__u8)falign,				\
> +		.byte_order = (__u8)fbo

Are the __u8 necessary? They seem to hide compiler warnings?

> +
> +#define MAX_MAP_MEMORY_REGIONS	64
> +
> +/* setsockopt takes addr, size ,direction parametner, getsockopt takes
> + * iova, size, direction.
> + * */
> +struct tpacket_dma_mem_region {
> +	void *addr;		/* userspace virtual address */
> +	__u64 phys_addr;	/* physical address */
> +	__u64 iova;		/* IO virtual address used for DMA */
> +	unsigned long size;	/* size of region */
> +	int direction;		/* dma data direction */
> +};

Have you tested this with with 32 bit user space and 32 bit kernel, too?
I don't have any problem with only supporting 64 bit kernels for this
feature, but looking through the code I wonder if we handle the __u64
addresses correctly in all situations.

The other question I have, would it make sense to move the

+#ifdef CONFIG_DMA_MEMORY_PROTECTION
+	/* IOVA not equal to physical address means IOMMU takes effect */
+	if (region->phys_addr == region->iova)
+		return -EFAULT;
+#endif

check from the ixgbe driver into the kernel core, so we never expose
memory mapped io which is not protected by its own memory domain?

Thanks,
Hannes

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
  2015-01-13 12:35 ` Hannes Frederic Sowa
@ 2015-01-13 13:21   ` Daniel Borkmann
  2015-01-13 15:24     ` John Fastabend
  0 siblings, 1 reply; 24+ messages in thread
From: Daniel Borkmann @ 2015-01-13 13:21 UTC (permalink / raw)
  To: Hannes Frederic Sowa
  Cc: John Fastabend, netdev, danny.zhou, nhorman, john.ronciak, brouer

On 01/13/2015 01:35 PM, Hannes Frederic Sowa wrote:
> On Mo, 2015-01-12 at 20:35 -0800, John Fastabend wrote:
...
>> +/* setsockopt takes addr, size ,direction parametner, getsockopt takes
>> + * iova, size, direction.
>> + * */
>> +struct tpacket_dma_mem_region {
>> +	void *addr;		/* userspace virtual address */
>> +	__u64 phys_addr;	/* physical address */
>> +	__u64 iova;		/* IO virtual address used for DMA */
>> +	unsigned long size;	/* size of region */
>> +	int direction;		/* dma data direction */
>> +};
>
> Have you tested this with with 32 bit user space and 32 bit kernel, too?
> I don't have any problem with only supporting 64 bit kernels for this
> feature, but looking through the code I wonder if we handle the __u64
> addresses correctly in all situations.

Given this is placed into uapi and transferred via setsockopt(2), this
would also need some form of compat handling, also for the case of mixed
environments (e.g. 64 bit kernel, 32 bit user space).

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings
  2015-01-13  4:35 ` [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings John Fastabend
  2015-01-13 12:05   ` Hannes Frederic Sowa
@ 2015-01-13 14:26   ` Daniel Borkmann
  2015-01-13 15:46     ` John Fastabend
  2015-01-13 18:58   ` Willem de Bruijn
  2 siblings, 1 reply; 24+ messages in thread
From: Daniel Borkmann @ 2015-01-13 14:26 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev, danny.zhou, nhorman, john.ronciak, hannes, brouer

On 01/13/2015 05:35 AM, John Fastabend wrote:
...
> +static int ixgbe_ndo_split_queue_pairs(struct net_device *dev,
> +				       unsigned int start_from,
> +				       unsigned int qpairs_num,
> +				       struct sock *sk)
> +{
> +	struct ixgbe_adapter *adapter = netdev_priv(dev);
> +	unsigned int qpair_index;

We should probably return -EINVAL, still from within the setsockopt
call when qpairs_num is 0?

> +	/* allocate whatever available qpairs */
> +	if (start_from == -1) {

I guess we should define the notion of auto-select into a uapi
define instead of -1, which might not be overly obvious.

Anyway, extending Documentation/networking/packet_mmap.txt with
API details/examples at least for a non-RFC version is encouraged. ;)

> +		unsigned int count = 0;
> +
> +		for (qpair_index = adapter->num_rx_queues;
> +		     qpair_index < MAX_RX_QUEUES;
> +		     qpair_index++) {
> +			if (!adapter->user_queue_info[qpair_index].sk_handle) {
> +				count++;
> +				if (count == qpairs_num) {
> +					start_from = qpair_index - count + 1;
> +					break;
> +				}
> +			} else {
> +				count = 0;
> +			}
> +		}
> +	}
> +
> +	/* otherwise the caller specified exact queues */
> +	if ((start_from > MAX_TX_QUEUES) ||
> +	    (start_from > MAX_RX_QUEUES) ||
> +	    (start_from + qpairs_num > MAX_TX_QUEUES) ||
> +	    (start_from + qpairs_num > MAX_RX_QUEUES))
> +		return -EINVAL;

Shouldn't this be '>=' if I see this correctly?

> +	/* If the qpairs are being used by the driver do not let user space
> +	 * consume the queues. Also if the queue has already been allocated
> +	 * to a socket do fail the request.
> +	 */
> +	for (qpair_index = start_from;
> +	     qpair_index < start_from + qpairs_num;
> +	     qpair_index++) {
> +		if ((qpair_index < adapter->num_tx_queues) ||
> +		    (qpair_index < adapter->num_rx_queues))
> +			return -EINVAL;
> +
> +		if (adapter->user_queue_info[qpair_index].sk_handle)
> +			return -EBUSY;
> +	}
> +
> +	/* remember the sk handle for each queue pair */
> +	for (qpair_index = start_from;
> +	     qpair_index < start_from + qpairs_num;
> +	     qpair_index++) {
> +		adapter->user_queue_info[qpair_index].sk_handle = sk;
> +		adapter->user_queue_info[qpair_index].num_of_regions = 0;
> +	}
> +
> +	return 0;
> +}

I guess many drivers would need to implement similar code, do you see
a chance to move generic parts to the core, at least for some helper
functions?

Thanks,
Daniel

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
  2015-01-13  4:35 [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space John Fastabend
                   ` (2 preceding siblings ...)
  2015-01-13 12:35 ` Hannes Frederic Sowa
@ 2015-01-13 15:12 ` Daniel Borkmann
  2015-01-13 15:58   ` John Fastabend
  2015-01-13 16:19 ` Neil Horman
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 24+ messages in thread
From: Daniel Borkmann @ 2015-01-13 15:12 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev, danny.zhou, nhorman, john.ronciak, hannes, brouer

On 01/13/2015 05:35 AM, John Fastabend wrote:
...
>   struct net_device_ops {
>   	int			(*ndo_init)(struct net_device *dev);
> @@ -1190,6 +1240,35 @@ struct net_device_ops {
>   	int			(*ndo_switch_port_stp_update)(struct net_device *dev,
>   							      u8 state);
>   #endif
> +	int			(*ndo_split_queue_pairs)(struct net_device *dev,
> +					 unsigned int qpairs_start_from,
> +					 unsigned int qpairs_num,
> +					 struct sock *sk);
...
> +	int			(*ndo_get_dma_region_info)
> +					(struct net_device *dev,
> +					 struct tpacket_dma_mem_region *region,
> +					 struct sock *sk);
>   };

Any slight chance these 8 ndo ops could be further reduced? ;)

>   /**
> diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
> index da2d668..eb7a727 100644
> --- a/include/uapi/linux/if_packet.h
> +++ b/include/uapi/linux/if_packet.h
...
> +struct tpacket_dev_qpair_map_region_info {
> +	unsigned int tp_dev_bar_sz;		/* size of BAR */
> +	unsigned int tp_dev_sysm_sz;		/* size of systerm memory */
> +	/* number of contiguous memory on BAR mapping to user space */
> +	unsigned int tp_num_map_regions;
> +	/* number of contiguous memory on system mapping to user apce */
> +	unsigned int tp_num_sysm_map_regions;
> +	struct map_page_region {
> +		unsigned page_offset;	/* offset to start of region */
> +		unsigned page_sz;	/* size of page */
> +		unsigned page_cnt;	/* number of pages */

Please use unsigned int et al, or preferably __u* variants consistently
in the uapi structs.

> +	} tp_regions[MAX_MAP_MEMORY_REGIONS];
> +};
...
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 6880f34..8cd17da 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
...
> @@ -2633,6 +2636,16 @@ static int packet_release(struct socket *sock)
>   	sock_prot_inuse_add(net, sk->sk_prot, -1);
>   	preempt_enable();
>
> +	if (po->tp_owns_queue_pairs) {
> +		struct net_device *dev;
> +
> +		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +		if (dev) {
> +			dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
> +			umem_release(dev, po);
> +		}
> +	}
> +
...
> +static int
>   packet_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen)
>   {
>   	struct sock *sk = sock->sk;
> @@ -3428,6 +3525,167 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
>   		po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
>   		return 0;
>   	}
> +	case PACKET_RXTX_QPAIRS_SPLIT:
> +	{
...
> +		/* This call only works after a bind call which calls a dev_hold
> +		 * operation so we do not need to increment dev ref counter
> +		 */
> +		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +		if (!dev)
> +			return -EINVAL;
> +		ops = dev->netdev_ops;
> +		if (!ops->ndo_split_queue_pairs)
> +			return -EOPNOTSUPP;
> +
> +		err =  ops->ndo_split_queue_pairs(dev,
> +						  qpairs.tp_qpairs_start_from,
> +						  qpairs.tp_qpairs_num, sk);
> +		if (!err)
> +			po->tp_owns_queue_pairs = true;

When this is being set here, above test in packet_release() and the chunk
quoted below in packet_mmap() are not guaranteed to work since we don't
test if some ndos are actually implemented by the driver. Seems a bit
fragile, I'm wondering if we should test this capability as a _whole_,
iow if all necessary functions to make this work are being provided by the
driver, e.g. flag the netdev as such and test for that instead.

> +		return err;
> +	}
> +	case PACKET_RXTX_QPAIRS_RETURN:
> +	{
...
> +		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +		if (!dev)
> +			return -EINVAL;
> +		ops = dev->netdev_ops;
> +		if (!ops->ndo_split_queue_pairs)
> +			return -EOPNOTSUPP;

Should test for ndo_return_queue_pairs.

> +		err =  dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
> +		if (!err)
> +			po->tp_owns_queue_pairs = false;
> +
...
> +	case PACKET_RXTX_QPAIRS_SPLIT:
> +	{
...
> +		/* This call only work after a bind call which calls a dev_hold
> +		 * operation so we do not need to increment dev ref counter
> +		 */
> +		dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +		if (!dev)
> +			return -EINVAL;
> +		if (!dev->netdev_ops->ndo_split_queue_pairs)
> +			return -EOPNOTSUPP;

Copy-paste (although not quite, since here's no extra ops var). :)
Should be ndo_get_split_queue_pairs.

> +		err =  dev->netdev_ops->ndo_get_split_queue_pairs(dev,
> +					&qpairs_info.tp_qpairs_start_from,
> +					&qpairs_info.tp_qpairs_num, sk);
> +
...
> @@ -3927,8 +4309,20 @@ static int packet_mmap(struct file *file, struct socket *sock,
>   	if (vma->vm_pgoff)
>   		return -EINVAL;
>
> +	dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +	if (!dev)
> +		return -EINVAL;
> +
>   	mutex_lock(&po->pg_vec_lock);
>
> +	if (po->tp_owns_queue_pairs) {
> +		ops = dev->netdev_ops;
> +		err = ops->ndo_direct_qpair_page_map(vma, dev);
> +		if (err)
> +			goto out;
> +		goto done;
> +	}
> +

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
  2015-01-13 13:21   ` Daniel Borkmann
@ 2015-01-13 15:24     ` John Fastabend
  2015-01-13 17:15       ` David Laight
  0 siblings, 1 reply; 24+ messages in thread
From: John Fastabend @ 2015-01-13 15:24 UTC (permalink / raw)
  To: Daniel Borkmann
  Cc: Hannes Frederic Sowa, netdev, danny.zhou, nhorman, john.ronciak,
	brouer

On 01/13/2015 05:21 AM, Daniel Borkmann wrote:
> On 01/13/2015 01:35 PM, Hannes Frederic Sowa wrote:
>> On Mo, 2015-01-12 at 20:35 -0800, John Fastabend wrote:
> ...
>>> +/* setsockopt takes addr, size ,direction parametner, getsockopt takes
>>> + * iova, size, direction.
>>> + * */
>>> +struct tpacket_dma_mem_region {
>>> +    void *addr;        /* userspace virtual address */
>>> +    __u64 phys_addr;    /* physical address */
>>> +    __u64 iova;        /* IO virtual address used for DMA */
>>> +    unsigned long size;    /* size of region */
>>> +    int direction;        /* dma data direction */
>>> +};
>>
>> Have you tested this with with 32 bit user space and 32 bit kernel, too?
>> I don't have any problem with only supporting 64 bit kernels for this
>> feature, but looking through the code I wonder if we handle the __u64
>> addresses correctly in all situations.

We still need to test/implement this I'm going to guess there is some
more work needed for this to work correctly.

>
> Given this is placed into uapi and transferred via setsockopt(2), this
> would also need some form of compat handling, also for the case of mixed
> environments (e.g. 64 bit kernel, 32 bit user space).

noted, thanks!

-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings
  2015-01-13 14:26   ` Daniel Borkmann
@ 2015-01-13 15:46     ` John Fastabend
  2015-01-13 18:18       ` Daniel Borkmann
  0 siblings, 1 reply; 24+ messages in thread
From: John Fastabend @ 2015-01-13 15:46 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, danny.zhou, nhorman, john.ronciak, hannes, brouer

On 01/13/2015 06:26 AM, Daniel Borkmann wrote:
> On 01/13/2015 05:35 AM, John Fastabend wrote:
> ...
>> +static int ixgbe_ndo_split_queue_pairs(struct net_device *dev,
>> +                       unsigned int start_from,
>> +                       unsigned int qpairs_num,
>> +                       struct sock *sk)
>> +{
>> +    struct ixgbe_adapter *adapter = netdev_priv(dev);
>> +    unsigned int qpair_index;
>
> We should probably return -EINVAL, still from within the setsockopt
> call when qpairs_num is 0?

yep,

>
>> +    /* allocate whatever available qpairs */
>> +    if (start_from == -1) {
>
> I guess we should define the notion of auto-select into a uapi
> define instead of -1, which might not be overly obvious.

Certainly not obvious should be defined in the UAPI.

>
> Anyway, extending Documentation/networking/packet_mmap.txt with
> API details/examples at least for a non-RFC version is encouraged. ;)

Yep for the non-RFC version I'll add an example to packet_mmap.txt

>
>> +        unsigned int count = 0;
>> +
>> +        for (qpair_index = adapter->num_rx_queues;
>> +             qpair_index < MAX_RX_QUEUES;
>> +             qpair_index++) {
>> +            if (!adapter->user_queue_info[qpair_index].sk_handle) {
>> +                count++;
>> +                if (count == qpairs_num) {
>> +                    start_from = qpair_index - count + 1;
>> +                    break;
>> +                }
>> +            } else {
>> +                count = 0;
>> +            }
>> +        }
>> +    }
>> +
>> +    /* otherwise the caller specified exact queues */
>> +    if ((start_from > MAX_TX_QUEUES) ||
>> +        (start_from > MAX_RX_QUEUES) ||
>> +        (start_from + qpairs_num > MAX_TX_QUEUES) ||
>> +        (start_from + qpairs_num > MAX_RX_QUEUES))
>> +        return -EINVAL;
>
> Shouldn't this be '>=' if I see this correctly?

hmm I think this is correct the device allocates MAX_TX_QUEUES so the
queue space index is (0, MAX_TX_QUEUES - 1). So MAX_TX_QUEUES and
MAX_RX_QUEUES would be an invalid access below,

>
>> +    /* If the qpairs are being used by the driver do not let user space
>> +     * consume the queues. Also if the queue has already been allocated
>> +     * to a socket do fail the request.
>> +     */
>> +    for (qpair_index = start_from;
>> +         qpair_index < start_from + qpairs_num;
>> +         qpair_index++) {
>> +        if ((qpair_index < adapter->num_tx_queues) ||
>> +            (qpair_index < adapter->num_rx_queues))
>> +            return -EINVAL;
>> +
>> +        if (adapter->user_queue_info[qpair_index].sk_handle)
>> +            return -EBUSY;
>> +    }
>> +
>> +    /* remember the sk handle for each queue pair */
>> +    for (qpair_index = start_from;
>> +         qpair_index < start_from + qpairs_num;
>> +         qpair_index++) {
>> +        adapter->user_queue_info[qpair_index].sk_handle = sk;
>> +        adapter->user_queue_info[qpair_index].num_of_regions = 0;
                                     ^^^^^^^^^^^^^^
                                     (0, MAX_TX_QUEUES - 1)

@@ -673,6 +687,9 @@ Hunk #2, a/drivers/net/ethernet/intel/ixgbe/ixgbe.h 
struct ixgbe_adapter {

         struct ixgbe_q_vector *q_vector[MAX_Q_VECTORS];

+       /* Direct User Space Queues */
+       struct ixgbe_user_queue_info user_queue_info[MAX_RX_QUEUES];
+



>> +    }
>> +
>> +    return 0;
>> +}
>
> I guess many drivers would need to implement similar code, do you see
> a chance to move generic parts to the core, at least for some helper
> functions?

I'm not entirely sure about this. It depends on how the driver manages
its queue space. Many of the 10Gpbs devices it seems could use similar
logic so it might make sense. Other drivers might conjure the queues
out of some other bank of queues. If I'm looking @ the devices I have
here, i40e at least would manage the queues slightly different then this
I think.

>
> Thanks,
> Daniel


-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
  2015-01-13 15:12 ` Daniel Borkmann
@ 2015-01-13 15:58   ` John Fastabend
  2015-01-13 16:05     ` Daniel Borkmann
  0 siblings, 1 reply; 24+ messages in thread
From: John Fastabend @ 2015-01-13 15:58 UTC (permalink / raw)
  To: Daniel Borkmann; +Cc: netdev, danny.zhou, nhorman, john.ronciak, hannes, brouer

On 01/13/2015 07:12 AM, Daniel Borkmann wrote:
> On 01/13/2015 05:35 AM, John Fastabend wrote:
> ...
>>   struct net_device_ops {
>>       int            (*ndo_init)(struct net_device *dev);
>> @@ -1190,6 +1240,35 @@ struct net_device_ops {
>>       int            (*ndo_switch_port_stp_update)(struct net_device
>> *dev,
>>                                     u8 state);
>>   #endif
>> +    int            (*ndo_split_queue_pairs)(struct net_device *dev,
>> +                     unsigned int qpairs_start_from,
>> +                     unsigned int qpairs_num,
>> +                     struct sock *sk);
> ...
>> +    int            (*ndo_get_dma_region_info)
>> +                    (struct net_device *dev,
>> +                     struct tpacket_dma_mem_region *region,
>> +                     struct sock *sk);
>>   };
>
> Any slight chance these 8 ndo ops could be further reduced? ;)
>

Its possible we could collapse a few of these calls. I'll see if
we can get it a bit smaller. Another option would be to put a
a pointer to the set of ops in the net_device struct. Something
like,

	struct net_device {
		...
		const struct af_packet_hw *afp_ops;
		...
	}

	struct af_packet_hw {
		int (*ndo_split_queue_pairs)(struct net_device *dev,
					     unsigned int qpairs_start_from,
					     unsigned int qpairs_num,
					     struct sock *sk);
		...
	}
		

>>   /**
>> diff --git a/include/uapi/linux/if_packet.h
>> b/include/uapi/linux/if_packet.h
>> index da2d668..eb7a727 100644
>> --- a/include/uapi/linux/if_packet.h
>> +++ b/include/uapi/linux/if_packet.h
> ...
>> +struct tpacket_dev_qpair_map_region_info {
>> +    unsigned int tp_dev_bar_sz;        /* size of BAR */
>> +    unsigned int tp_dev_sysm_sz;        /* size of systerm memory */
>> +    /* number of contiguous memory on BAR mapping to user space */
>> +    unsigned int tp_num_map_regions;
>> +    /* number of contiguous memory on system mapping to user apce */
>> +    unsigned int tp_num_sysm_map_regions;
>> +    struct map_page_region {
>> +        unsigned page_offset;    /* offset to start of region */
>> +        unsigned page_sz;    /* size of page */
>> +        unsigned page_cnt;    /* number of pages */
>
> Please use unsigned int et al, or preferably __u* variants consistently
> in the uapi structs.

I'll turn this all into __u* variants.

[...]

> ...
>> +static int
>>   packet_setsockopt(struct socket *sock, int level, int optname, char
>> __user *optval, unsigned int optlen)
>>   {
>>       struct sock *sk = sock->sk;
>> @@ -3428,6 +3525,167 @@ packet_setsockopt(struct socket *sock, int
>> level, int optname, char __user *optv
>>           po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
>>           return 0;
>>       }
>> +    case PACKET_RXTX_QPAIRS_SPLIT:
>> +    {
> ...
>> +        /* This call only works after a bind call which calls a dev_hold
>> +         * operation so we do not need to increment dev ref counter
>> +         */
>> +        dev = __dev_get_by_index(sock_net(sk), po->ifindex);
>> +        if (!dev)
>> +            return -EINVAL;
>> +        ops = dev->netdev_ops;
>> +        if (!ops->ndo_split_queue_pairs)
>> +            return -EOPNOTSUPP;
>> +
>> +        err =  ops->ndo_split_queue_pairs(dev,
>> +                          qpairs.tp_qpairs_start_from,
>> +                          qpairs.tp_qpairs_num, sk);
>> +        if (!err)
>> +            po->tp_owns_queue_pairs = true;
>
> When this is being set here, above test in packet_release() and the chunk
> quoted below in packet_mmap() are not guaranteed to work since we don't
> test if some ndos are actually implemented by the driver. Seems a bit
> fragile, I'm wondering if we should test this capability as a _whole_,
> iow if all necessary functions to make this work are being provided by the
> driver, e.g. flag the netdev as such and test for that instead.

Sounds good to me, better than scattering ndo checks throughout. Also
with a feature flag administrators could disable it easily.

>
>> +        return err;
>> +    }
>> +    case PACKET_RXTX_QPAIRS_RETURN:
>> +    {
> ...
>> +        dev = __dev_get_by_index(sock_net(sk), po->ifindex);
>> +        if (!dev)
>> +            return -EINVAL;
>> +        ops = dev->netdev_ops;
>> +        if (!ops->ndo_split_queue_pairs)
>> +            return -EOPNOTSUPP;
>
> Should test for ndo_return_queue_pairs.

yep but I like the feature flag idea above.

>
>> +        err =  dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
>> +        if (!err)
>> +            po->tp_owns_queue_pairs = false;
>> +
> ...
>> +    case PACKET_RXTX_QPAIRS_SPLIT:
>> +    {
> ...
>> +        /* This call only work after a bind call which calls a dev_hold
>> +         * operation so we do not need to increment dev ref counter
>> +         */
>> +        dev = __dev_get_by_index(sock_net(sk), po->ifindex);
>> +        if (!dev)
>> +            return -EINVAL;
>> +        if (!dev->netdev_ops->ndo_split_queue_pairs)
>> +            return -EOPNOTSUPP;
>
> Copy-paste (although not quite, since here's no extra ops var). :)
> Should be ndo_get_split_queue_pairs.

yep.

[...]

Thanks for reviewing!

-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
  2015-01-13 15:58   ` John Fastabend
@ 2015-01-13 16:05     ` Daniel Borkmann
  0 siblings, 0 replies; 24+ messages in thread
From: Daniel Borkmann @ 2015-01-13 16:05 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev, danny.zhou, nhorman, john.ronciak, hannes, brouer

On 01/13/2015 04:58 PM, John Fastabend wrote:
> On 01/13/2015 07:12 AM, Daniel Borkmann wrote:
...
>> Any slight chance these 8 ndo ops could be further reduced? ;)
>
> Its possible we could collapse a few of these calls. I'll see if
> we can get it a bit smaller. Another option would be to put a
> a pointer to the set of ops in the net_device struct. Something
> like,
>
>      struct net_device {
>          ...
>          const struct af_packet_hw *afp_ops;
>          ...
>      }
>
>      struct af_packet_hw {
>          int (*ndo_split_queue_pairs)(struct net_device *dev,
>                           unsigned int qpairs_start_from,
>                           unsigned int qpairs_num,
>                           struct sock *sk);
>          ...
>      }

I think trying to collapse might be better than two indirections.

...
>> When this is being set here, above test in packet_release() and the chunk
>> quoted below in packet_mmap() are not guaranteed to work since we don't
>> test if some ndos are actually implemented by the driver. Seems a bit
>> fragile, I'm wondering if we should test this capability as a _whole_,
>> iow if all necessary functions to make this work are being provided by the
>> driver, e.g. flag the netdev as such and test for that instead.
>
> Sounds good to me, better than scattering ndo checks throughout. Also
> with a feature flag administrators could disable it easily.

Sounds good to me, thanks John!

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
  2015-01-13  4:35 [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space John Fastabend
                   ` (3 preceding siblings ...)
  2015-01-13 15:12 ` Daniel Borkmann
@ 2015-01-13 16:19 ` Neil Horman
  2015-01-13 18:52 ` Willem de Bruijn
  2015-01-14 20:35 ` David Miller
  6 siblings, 0 replies; 24+ messages in thread
From: Neil Horman @ 2015-01-13 16:19 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev, danny.zhou, dborkman, john.ronciak, hannes, brouer

On Mon, Jan 12, 2015 at 08:35:11PM -0800, John Fastabend wrote:
> This patch adds net_device ops to split off a set of driver queues
> from the driver and map the queues into user space via mmap. This
> allows the queues to be directly manipulated from user space. For
> raw packet interface this removes any overhead from the kernel network
> stack.
> 
> With these operations we bypass the network stack and packet_type
> handlers that would typically send traffic to an af_packet socket.
> This means hardware must do the forwarding. To do this ew can use
> the ETHTOOL_SRXCLSRLINS ops in the ethtool command set. It is
> currently supported by multiple drivers including sfc, mlx4, niu,
> ixgbe, and i40e. Supporting some way to steer traffic to a queue
> is the _only_ hardware requirement to support this interface.
> 
> A follow on patch adds support for ixgbe but we expect at least
> the subset of drivers implementing ETHTOOL_SRXCLSRLINS can be
> implemented later.
> 
> The high level flow, leveraging the af_packet control path, looks
> like:
> 
> 	bind(fd, &sockaddr, sizeof(sockaddr));
> 
> 	/* Get the device type and info */
> 	getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
> 		   &optlen);
> 
> 	/* With device info we can look up descriptor format */
> 
> 	/* Get the layout of ring space offset, page_sz, cnt */
> 	getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
> 		   &info, &optlen);
> 
> 	/* request some queues from the driver */
> 	setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> 		   &qpairs_info, sizeof(qpairs_info));
> 
> 	/* if we let the driver pick us queues learn which queues
>          * we were given
>          */
> 	getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> 		   &qpairs_info, sizeof(qpairs_info));
> 
> 	/* And mmap queue pairs to user space */
> 	mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
> 	     MAP_SHARED, fd, 0);
> 
> 	/* Now we have some user space queues to read/write to*/
> 
> There is one critical difference when running with these interfaces
> vs running without them. In the normal case the af_packet module
> uses a standard descriptor format exported by the af_packet user
> space headers. In this model because we are working directly with
> driver queues the descriptor format maps to the descriptor format
> used by the device. User space applications can learn device
> information from the socket option PACKET_DEV_DESC_INFO. These
> are described by giving the vendor/deviceid and a descriptor layout
> in offset/length/width/alignment/byte_ordering.
> 
> To protect against arbitrary DMA writes IOMMU devices put memory
> in a single domain to stop arbitrary DMA to memory. Note it would
> be possible to dma into another sockets pages because most NIC
> devices only support a single domain. This would require being
> able to guess another sockets page layout. However the socket
> operation does require CAP_NET_ADMIN privileges.
> 
> Additionally we have a set of DPDK patches to enable DPDK with this
> interface. DPDK can be downloaded @ dpdk.org although as I hope is
> clear from above DPDK is just our paticular test environment we
> expect other libraries could be built on this interface.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>

Just thinking about this a bit, have you considered collapsing this work in with
the macvtap work you and I did when we enabled some nics to allocate queue pairs
to those tap devices?  I ask, because it seems like that infrastructure already
embodies the notion of reserving queues from underlying hardware, and so if you
were to only allow queue mapping from macvlan/tap devices, you could reduce both
the api surface that you need to add in your ndo_ops (no more need for a ndo op
to reserve/free queues, and you could eliminate the need to explicitly reserve
queues from user space (i.e. reserving queues on a macvtap device automatically
reserves all its queues).

Neil

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
  2015-01-13 15:24     ` John Fastabend
@ 2015-01-13 17:15       ` David Laight
  2015-01-13 17:27         ` David Miller
  0 siblings, 1 reply; 24+ messages in thread
From: David Laight @ 2015-01-13 17:15 UTC (permalink / raw)
  To: 'John Fastabend', Daniel Borkmann
  Cc: Hannes Frederic Sowa, netdev@vger.kernel.org,
	danny.zhou@intel.com, nhorman@tuxdriver.com,
	john.ronciak@intel.com, brouer@redhat.com

From: John Fastabend
> On 01/13/2015 05:21 AM, Daniel Borkmann wrote:
> > On 01/13/2015 01:35 PM, Hannes Frederic Sowa wrote:
> >> On Mo, 2015-01-12 at 20:35 -0800, John Fastabend wrote:
> > ...
> >>> +/* setsockopt takes addr, size ,direction parametner, getsockopt takes
> >>> + * iova, size, direction.
> >>> + * */
> >>> +struct tpacket_dma_mem_region {
> >>> +    void *addr;        /* userspace virtual address */
> >>> +    __u64 phys_addr;    /* physical address */
> >>> +    __u64 iova;        /* IO virtual address used for DMA */
> >>> +    unsigned long size;    /* size of region */
> >>> +    int direction;        /* dma data direction */
> >>> +};
> >>
> >> Have you tested this with with 32 bit user space and 32 bit kernel, too?
> >> I don't have any problem with only supporting 64 bit kernels for this
> >> feature, but looking through the code I wonder if we handle the __u64
> >> addresses correctly in all situations.
> 
> We still need to test/implement this I'm going to guess there is some
> more work needed for this to work correctly.

How about something like:

struct tpacket_dma_mem_region {
    __u64 addr;        /* userspace virtual address */
    __u64 phys_addr;    /* physical address */
    __u64 iova;        /* IO virtual address used for DMA */
    __u64 size;    /* size of region */
    int direction;        /* dma data direction */
} aligned(8);

So that it is independant of 32/64 bits.
It is a shame that gcc has no way of defining a 64bit 'void *' on 32bit systems.
You can use a union, but you still need to zero extend the value on LE (worse on BE).

	David



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
  2015-01-13 17:15       ` David Laight
@ 2015-01-13 17:27         ` David Miller
  2015-01-14 15:28           ` Zhou, Danny
  0 siblings, 1 reply; 24+ messages in thread
From: David Miller @ 2015-01-13 17:27 UTC (permalink / raw)
  To: David.Laight
  Cc: john.fastabend, dborkman, hannes, netdev, danny.zhou, nhorman,
	john.ronciak, brouer

From: David Laight <David.Laight@ACULAB.COM>
Date: Tue, 13 Jan 2015 17:15:30 +0000

> How about something like:
> 
> struct tpacket_dma_mem_region {
>     __u64 addr;        /* userspace virtual address */
>     __u64 phys_addr;    /* physical address */
>     __u64 iova;        /* IO virtual address used for DMA */
>     __u64 size;    /* size of region */
>     int direction;        /* dma data direction */
> } aligned(8);
> 
> So that it is independant of 32/64 bits.
> It is a shame that gcc has no way of defining a 64bit 'void *' on 32bit systems.
> You can use a union, but you still need to zero extend the value on LE (worse on BE).

We have an __aligned_u64, please use that.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings
  2015-01-13 15:46     ` John Fastabend
@ 2015-01-13 18:18       ` Daniel Borkmann
  0 siblings, 0 replies; 24+ messages in thread
From: Daniel Borkmann @ 2015-01-13 18:18 UTC (permalink / raw)
  To: John Fastabend; +Cc: netdev, danny.zhou, nhorman, john.ronciak, hannes, brouer

On 01/13/2015 04:46 PM, John Fastabend wrote:
> On 01/13/2015 06:26 AM, Daniel Borkmann wrote:
...
>> Anyway, extending Documentation/networking/packet_mmap.txt with
>> API details/examples at least for a non-RFC version is encouraged. ;)
>
> Yep for the non-RFC version I'll add an example to packet_mmap.txt

Cool, alternatively, perhaps a mini library under samples/packet/
would work, similarly as we have with samples/bpf/ which would allow
for a simple starting point for users to rx/tx frames.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
  2015-01-13  4:35 [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space John Fastabend
                   ` (4 preceding siblings ...)
  2015-01-13 16:19 ` Neil Horman
@ 2015-01-13 18:52 ` Willem de Bruijn
  2015-01-14 15:26   ` Zhou, Danny
  2015-01-14 20:35 ` David Miller
  6 siblings, 1 reply; 24+ messages in thread
From: Willem de Bruijn @ 2015-01-13 18:52 UTC (permalink / raw)
  To: John Fastabend
  Cc: Network Development, Zhou, Danny, Neil Horman, Daniel Borkmann,
	Ronciak, John, Hannes Frederic Sowa, brouer

On Mon, Jan 12, 2015 at 11:35 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> This patch adds net_device ops to split off a set of driver queues
> from the driver and map the queues into user space via mmap. This
> allows the queues to be directly manipulated from user space. For
> raw packet interface this removes any overhead from the kernel network
> stack.

Can you elaborate how packet payload mapping is handled?
Processes are still responsible for translating from user virtual to
physical (and bus) addresses, correct? The IOMMU is only there
to restrict the physical address ranges that may be written.

>
> With these operations we bypass the network stack and packet_type
> handlers that would typically send traffic to an af_packet socket.
> This means hardware must do the forwarding. To do this ew can use
> the ETHTOOL_SRXCLSRLINS ops in the ethtool command set. It is
> currently supported by multiple drivers including sfc, mlx4, niu,
> ixgbe, and i40e. Supporting some way to steer traffic to a queue
> is the _only_ hardware requirement to support this interface.
>
> A follow on patch adds support for ixgbe but we expect at least
> the subset of drivers implementing ETHTOOL_SRXCLSRLINS can be
> implemented later.
>
> The high level flow, leveraging the af_packet control path, looks
> like:
>
>         bind(fd, &sockaddr, sizeof(sockaddr));
>
>         /* Get the device type and info */
>         getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
>                    &optlen);
>
>         /* With device info we can look up descriptor format */
>
>         /* Get the layout of ring space offset, page_sz, cnt */
>         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
>                    &info, &optlen);
>
>         /* request some queues from the driver */
>         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>                    &qpairs_info, sizeof(qpairs_info));
>
>         /* if we let the driver pick us queues learn which queues
>          * we were given
>          */
>         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
>                    &qpairs_info, sizeof(qpairs_info));
>
>         /* And mmap queue pairs to user space */
>         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
>              MAP_SHARED, fd, 0);
>
>         /* Now we have some user space queues to read/write to*/
>
> There is one critical difference when running with these interfaces
> vs running without them. In the normal case the af_packet module
> uses a standard descriptor format exported by the af_packet user
> space headers. In this model because we are working directly with
> driver queues the descriptor format maps to the descriptor format
> used by the device. User space applications can learn device
> information from the socket option PACKET_DEV_DESC_INFO. These
> are described by giving the vendor/deviceid and a descriptor layout
> in offset/length/width/alignment/byte_ordering.

Raising the issue of exposed vs. virtualized interface just once
more. I wonder if it is possible to keep the virtual to physical
translation in the kernel while avoiding syscall latency, by doing
the translation in a kernel thread on a coupled hyperthread that
waits with mwait on the virtual queue producer index. The page
table operations that Neil proposed in v1 of this patch may work
even better.

> To protect against arbitrary DMA writes IOMMU devices put memory
> in a single domain to stop arbitrary DMA to memory. Note it would
> be possible to dma into another sockets pages because most NIC
> devices only support a single domain. This would require being
> able to guess another sockets page layout. However the socket
> operation does require CAP_NET_ADMIN privileges.
>
> Additionally we have a set of DPDK patches to enable DPDK with this
> interface. DPDK can be downloaded @ dpdk.org although as I hope is
> clear from above DPDK is just our paticular test environment we
> expect other libraries could be built on this interface.
>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  include/linux/netdevice.h      |   79 ++++++++
>  include/uapi/linux/if_packet.h |   88 +++++++++
>  net/packet/af_packet.c         |  397 ++++++++++++++++++++++++++++++++++++++++
>  net/packet/internal.h          |   10 +
>  4 files changed, 573 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 679e6e9..b71c97d 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -52,6 +52,8 @@
>  #include <linux/neighbour.h>
>  #include <uapi/linux/netdevice.h>
>
> +#include <linux/if_packet.h>
> +
>  struct netpoll_info;
>  struct device;
>  struct phy_device;
> @@ -1030,6 +1032,54 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
>   * int (*ndo_switch_port_stp_update)(struct net_device *dev, u8 state);
>   *     Called to notify switch device port of bridge port STP
>   *     state change.
> + *
> + * int (*ndo_split_queue_pairs) (struct net_device *dev,
> + *                              unsigned int qpairs_start_from,
> + *                              unsigned int qpairs_num,
> + *                              struct sock *sk)
> + *     Called to request a set of queues from the driver to be handed to the
> + *     callee for management. After this returns the driver will not use the
> + *     queues.
> + *
> + * int (*ndo_get_split_queue_pairs) (struct net_device *dev,
> + *                              unsigned int *qpairs_start_from,
> + *                              unsigned int *qpairs_num,
> + *                              struct sock *sk)
> + *     Called to get the location of queues that have been split for user
> + *     space to use. The socket must have previously requested the queues via
> + *     ndo_split_queue_pairs successfully.
> + *
> + * int (*ndo_return_queue_pairs) (struct net_device *dev,
> + *                               struct sock *sk)
> + *     Called to return a set of queues identified by sock to the driver. The
> + *     socket must have previously requested the queues via
> + *     ndo_split_queue_pairs for this action to be performed.
> + *
> + * int (*ndo_get_device_qpair_map_region_info) (struct net_device *dev,
> + *                             struct tpacket_dev_qpair_map_region_info *info)
> + *     Called to return mapping of queue memory region.
> + *
> + * int (*ndo_get_device_desc_info) (struct net_device *dev,
> + *                                 struct tpacket_dev_info *dev_info)
> + *     Called to get device specific information. This should uniquely identify
> + *     the hardware so that descriptor formats can be learned by the stack/user
> + *     space.
> + *
> + * int (*ndo_direct_qpair_page_map) (struct vm_area_struct *vma,
> + *                                  struct net_device *dev)
> + *     Called to map queue pair range from split_queue_pairs into mmap region.
> + *
> + * int (*ndo_direct_validate_dma_mem_region_map)
> + *                                     (struct net_device *dev,
> + *                                      struct tpacket_dma_mem_region *region,
> + *                                      struct sock *sk)
> + *     Called to validate DMA address remaping for userspace memory region
> + *
> + * int (*ndo_get_dma_region_info)
> + *                              (struct net_device *dev,
> + *                               struct tpacket_dma_mem_region *region,
> + *                               struct sock *sk)
> + *     Called to get dma region' information such as iova.
>   */
>  struct net_device_ops {
>         int                     (*ndo_init)(struct net_device *dev);
> @@ -1190,6 +1240,35 @@ struct net_device_ops {
>         int                     (*ndo_switch_port_stp_update)(struct net_device *dev,
>                                                               u8 state);
>  #endif
> +       int                     (*ndo_split_queue_pairs)(struct net_device *dev,
> +                                        unsigned int qpairs_start_from,
> +                                        unsigned int qpairs_num,
> +                                        struct sock *sk);
> +       int                     (*ndo_get_split_queue_pairs)
> +                                       (struct net_device *dev,
> +                                        unsigned int *qpairs_start_from,
> +                                        unsigned int *qpairs_num,
> +                                        struct sock *sk);
> +       int                     (*ndo_return_queue_pairs)
> +                                       (struct net_device *dev,
> +                                        struct sock *sk);
> +       int                     (*ndo_get_device_qpair_map_region_info)
> +                                       (struct net_device *dev,
> +                                        struct tpacket_dev_qpair_map_region_info *info);
> +       int                     (*ndo_get_device_desc_info)
> +                                       (struct net_device *dev,
> +                                        struct tpacket_dev_info *dev_info);
> +       int                     (*ndo_direct_qpair_page_map)
> +                                       (struct vm_area_struct *vma,
> +                                        struct net_device *dev);
> +       int                     (*ndo_validate_dma_mem_region_map)
> +                                       (struct net_device *dev,
> +                                        struct tpacket_dma_mem_region *region,
> +                                        struct sock *sk);
> +       int                     (*ndo_get_dma_region_info)
> +                                       (struct net_device *dev,
> +                                        struct tpacket_dma_mem_region *region,
> +                                        struct sock *sk);
>  };
>
>  /**
> diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
> index da2d668..eb7a727 100644
> --- a/include/uapi/linux/if_packet.h
> +++ b/include/uapi/linux/if_packet.h
> @@ -54,6 +54,13 @@ struct sockaddr_ll {
>  #define PACKET_FANOUT                  18
>  #define PACKET_TX_HAS_OFF              19
>  #define PACKET_QDISC_BYPASS            20
> +#define PACKET_RXTX_QPAIRS_SPLIT       21
> +#define PACKET_RXTX_QPAIRS_RETURN      22
> +#define PACKET_DEV_QPAIR_MAP_REGION_INFO       23
> +#define PACKET_DEV_DESC_INFO           24
> +#define PACKET_DMA_MEM_REGION_MAP       25
> +#define PACKET_DMA_MEM_REGION_RELEASE   26
> +
>
>  #define PACKET_FANOUT_HASH             0
>  #define PACKET_FANOUT_LB               1
> @@ -64,6 +71,87 @@ struct sockaddr_ll {
>  #define PACKET_FANOUT_FLAG_ROLLOVER    0x1000
>  #define PACKET_FANOUT_FLAG_DEFRAG      0x8000
>
> +#define PACKET_MAX_NUM_MAP_MEMORY_REGIONS 64
> +#define PACKET_MAX_NUM_DESC_FORMATS      8
> +#define PACKET_MAX_NUM_DESC_FIELDS       64
> +#define PACKET_NIC_DESC_FIELD(fseq, foffset, fwidth, falign, fbo) \
> +               .seqn = (__u8)fseq,                             \
> +               .offset = (__u8)foffset,                        \
> +               .width = (__u8)fwidth,                          \
> +               .align = (__u8)falign,                          \
> +               .byte_order = (__u8)fbo
> +
> +#define MAX_MAP_MEMORY_REGIONS 64
> +
> +/* setsockopt takes addr, size ,direction parametner, getsockopt takes
> + * iova, size, direction.
> + * */
> +struct tpacket_dma_mem_region {
> +       void *addr;             /* userspace virtual address */
> +       __u64 phys_addr;        /* physical address */
> +       __u64 iova;             /* IO virtual address used for DMA */
> +       unsigned long size;     /* size of region */
> +       int direction;          /* dma data direction */
> +};
> +
> +struct tpacket_dev_qpair_map_region_info {
> +       unsigned int tp_dev_bar_sz;             /* size of BAR */
> +       unsigned int tp_dev_sysm_sz;            /* size of systerm memory */
> +       /* number of contiguous memory on BAR mapping to user space */
> +       unsigned int tp_num_map_regions;
> +       /* number of contiguous memory on system mapping to user apce */
> +       unsigned int tp_num_sysm_map_regions;
> +       struct map_page_region {
> +               unsigned page_offset;   /* offset to start of region */
> +               unsigned page_sz;       /* size of page */
> +               unsigned page_cnt;      /* number of pages */
> +       } tp_regions[MAX_MAP_MEMORY_REGIONS];
> +};
> +
> +struct tpacket_dev_qpairs_info {
> +       unsigned int tp_qpairs_start_from;      /* qpairs index to start from */
> +       unsigned int tp_qpairs_num;             /* number of qpairs */
> +};
> +
> +enum tpack_desc_byte_order {
> +       BO_NATIVE = 0,
> +       BO_NETWORK,
> +       BO_BIG_ENDIAN,
> +       BO_LITTLE_ENDIAN,
> +};
> +
> +struct tpacket_nic_desc_fld {
> +       __u8 seqn;      /* Sequency index of descriptor field */
> +       __u8 offset;    /* Offset to start */
> +       __u8 width;     /* Width of field */
> +       __u8 align;     /* Alignment in bits */
> +       enum tpack_desc_byte_order byte_order;  /* Endian flag */
> +};
> +
> +struct tpacket_nic_desc_expr {
> +       __u8 version;           /* Version number */
> +       __u8 size;              /* Descriptor size in bytes */
> +       enum tpack_desc_byte_order byte_order;          /* Endian flag */
> +       __u8 num_of_fld;        /* Number of valid fields */
> +       /* List of each descriptor field */
> +       struct tpacket_nic_desc_fld fields[PACKET_MAX_NUM_DESC_FIELDS];
> +};
> +
> +struct tpacket_dev_info {
> +       __u16   tp_device_id;
> +       __u16   tp_vendor_id;
> +       __u16   tp_subsystem_device_id;
> +       __u16   tp_subsystem_vendor_id;
> +       __u32   tp_numa_node;
> +       __u32   tp_revision_id;
> +       __u32   tp_num_total_qpairs;
> +       __u32   tp_num_inuse_qpairs;
> +       __u32   tp_num_rx_desc_fmt;
> +       __u32   tp_num_tx_desc_fmt;
> +       struct tpacket_nic_desc_expr tp_rx_dexpr[PACKET_MAX_NUM_DESC_FORMATS];
> +       struct tpacket_nic_desc_expr tp_tx_dexpr[PACKET_MAX_NUM_DESC_FORMATS];
> +};
> +
>  struct tpacket_stats {
>         unsigned int    tp_packets;
>         unsigned int    tp_drops;
> diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> index 6880f34..8cd17da 100644
> --- a/net/packet/af_packet.c
> +++ b/net/packet/af_packet.c
> @@ -214,6 +214,9 @@ static void prb_clear_rxhash(struct tpacket_kbdq_core *,
>  static void prb_fill_vlan_info(struct tpacket_kbdq_core *,
>                 struct tpacket3_hdr *);
>  static void packet_flush_mclist(struct sock *sk);
> +static int umem_release(struct net_device *dev, struct packet_sock *po);
> +static int get_umem_pages(struct tpacket_dma_mem_region *region,
> +                         struct packet_umem_region *umem);
>
>  struct packet_skb_cb {
>         unsigned int origlen;
> @@ -2633,6 +2636,16 @@ static int packet_release(struct socket *sock)
>         sock_prot_inuse_add(net, sk->sk_prot, -1);
>         preempt_enable();
>
> +       if (po->tp_owns_queue_pairs) {
> +               struct net_device *dev;
> +
> +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +               if (dev) {
> +                       dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
> +                       umem_release(dev, po);
> +               }
> +       }
> +
>         spin_lock(&po->bind_lock);
>         unregister_prot_hook(sk, false);
>         packet_cached_dev_reset(po);
> @@ -2829,6 +2842,8 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
>         po->num = proto;
>         po->xmit = dev_queue_xmit;
>
> +       INIT_LIST_HEAD(&po->umem_list);
> +
>         err = packet_alloc_pending(po);
>         if (err)
>                 goto out2;
> @@ -3226,6 +3241,88 @@ static void packet_flush_mclist(struct sock *sk)
>  }
>
>  static int
> +get_umem_pages(struct tpacket_dma_mem_region *region,
> +              struct packet_umem_region *umem)
> +{
> +       struct page **page_list;
> +       unsigned long npages;
> +       unsigned long offset;
> +       unsigned long base;
> +       unsigned long i;
> +       int ret;
> +       dma_addr_t phys_base;
> +
> +       phys_base = (region->phys_addr) & PAGE_MASK;
> +       base = ((unsigned long)region->addr) & PAGE_MASK;
> +       offset = ((unsigned long)region->addr) & (~PAGE_MASK);
> +       npages = PAGE_ALIGN(region->size + offset) >> PAGE_SHIFT;
> +
> +       npages = min_t(unsigned long, npages, umem->nents);
> +       sg_init_table(umem->sglist, npages);
> +
> +       umem->nmap = 0;
> +       page_list = (struct page **)__get_free_page(GFP_KERNEL);
> +       if (!page_list)
> +               return -ENOMEM;
> +
> +       while (npages) {
> +               unsigned long min = min_t(unsigned long, npages,
> +                                         PAGE_SIZE / sizeof(struct page *));
> +
> +               ret = get_user_pages(current, current->mm, base, min,
> +                                    1, 0, page_list, NULL);
> +               if (ret < 0)
> +                       break;
> +
> +               base += ret * PAGE_SIZE;
> +               npages -= ret;
> +
> +               /* validate if the memory region is physically contigenous */
> +               for (i = 0; i < ret; i++) {
> +                       unsigned int page_index =
> +                               (page_to_phys(page_list[i]) - phys_base) /
> +                               PAGE_SIZE;
> +
> +                       if (page_index != umem->nmap + i) {
> +                               int j;
> +
> +                               for (j = 0; j < (umem->nmap + i); j++)
> +                                       put_page(sg_page(&umem->sglist[j]));
> +
> +                               free_page((unsigned long)page_list);
> +                               return -EFAULT;
> +                       }
> +
> +                       sg_set_page(&umem->sglist[umem->nmap + i],
> +                                   page_list[i], PAGE_SIZE, 0);
> +               }
> +
> +               umem->nmap += ret;
> +       }
> +
> +       free_page((unsigned long)page_list);
> +       return 0;
> +}
> +
> +static int
> +umem_release(struct net_device *dev, struct packet_sock *po)
> +{
> +       struct packet_umem_region *umem, *tmp;
> +       int i;
> +
> +       list_for_each_entry_safe(umem, tmp, &po->umem_list, list) {
> +               dma_unmap_sg(dev->dev.parent, umem->sglist,
> +                            umem->nmap, umem->direction);
> +               for (i = 0; i < umem->nmap; i++)
> +                       put_page(sg_page(&umem->sglist[i]));
> +
> +               vfree(umem);
> +       }
> +
> +       return 0;
> +}
> +
> +static int
>  packet_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen)
>  {
>         struct sock *sk = sock->sk;
> @@ -3428,6 +3525,167 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
>                 po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
>                 return 0;
>         }
> +       case PACKET_RXTX_QPAIRS_SPLIT:
> +       {
> +               struct tpacket_dev_qpairs_info qpairs;
> +               const struct net_device_ops *ops;
> +               struct net_device *dev;
> +               int err;
> +
> +               if (optlen != sizeof(qpairs))
> +                       return -EINVAL;
> +               if (copy_from_user(&qpairs, optval, sizeof(qpairs)))
> +                       return -EFAULT;
> +
> +               /* Only allow one set of queues to be owned by userspace */
> +               if (po->tp_owns_queue_pairs)
> +                       return -EBUSY;
> +
> +               /* This call only works after a bind call which calls a dev_hold
> +                * operation so we do not need to increment dev ref counter
> +                */
> +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +               if (!dev)
> +                       return -EINVAL;
> +               ops = dev->netdev_ops;
> +               if (!ops->ndo_split_queue_pairs)
> +                       return -EOPNOTSUPP;
> +
> +               err =  ops->ndo_split_queue_pairs(dev,
> +                                                 qpairs.tp_qpairs_start_from,
> +                                                 qpairs.tp_qpairs_num, sk);
> +               if (!err)
> +                       po->tp_owns_queue_pairs = true;
> +
> +               return err;
> +       }
> +       case PACKET_RXTX_QPAIRS_RETURN:
> +       {
> +               struct tpacket_dev_qpairs_info qpairs_info;
> +               const struct net_device_ops *ops;
> +               struct net_device *dev;
> +               int err;
> +
> +               if (optlen != sizeof(qpairs_info))
> +                       return -EINVAL;
> +               if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
> +                       return -EFAULT;
> +
> +               if (!po->tp_owns_queue_pairs)
> +                       return -EINVAL;
> +
> +               /* This call only work after a bind call which calls a dev_hold
> +                * operation so we do not need to increment dev ref counter
> +                */
> +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +               if (!dev)
> +                       return -EINVAL;
> +               ops = dev->netdev_ops;
> +               if (!ops->ndo_split_queue_pairs)
> +                       return -EOPNOTSUPP;
> +
> +               err =  dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
> +               if (!err)
> +                       po->tp_owns_queue_pairs = false;
> +
> +               return err;
> +       }
> +       case PACKET_DMA_MEM_REGION_MAP:
> +       {
> +               struct tpacket_dma_mem_region region;
> +               const struct net_device_ops *ops;
> +               struct net_device *dev;
> +               struct packet_umem_region *umem;
> +               unsigned long npages;
> +               unsigned long offset;
> +               unsigned long i;
> +               int err;
> +
> +               if (optlen != sizeof(region))
> +                       return -EINVAL;
> +               if (copy_from_user(&region, optval, sizeof(region)))
> +                       return -EFAULT;
> +               if ((region.direction != DMA_BIDIRECTIONAL) &&
> +                   (region.direction != DMA_TO_DEVICE) &&
> +                   (region.direction != DMA_FROM_DEVICE))
> +                       return -EFAULT;
> +
> +               if (!po->tp_owns_queue_pairs)
> +                       return -EINVAL;
> +
> +               /* This call only work after a bind call which calls a dev_hold
> +                * operation so we do not need to increment dev ref counter
> +                */
> +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +               if (!dev)
> +                       return -EINVAL;
> +
> +               offset = ((unsigned long)region.addr) & (~PAGE_MASK);
> +               npages = PAGE_ALIGN(region.size + offset) >> PAGE_SHIFT;
> +
> +               umem = vzalloc(sizeof(*umem) +
> +                              sizeof(struct scatterlist) * npages);
> +               if (!umem)
> +                       return -ENOMEM;
> +
> +               umem->nents = npages;
> +               umem->direction = region.direction;
> +
> +               down_write(&current->mm->mmap_sem);
> +               if (get_umem_pages(&region, umem) < 0) {
> +                       ret = -EFAULT;
> +                       goto exit;
> +               }
> +
> +               if ((umem->nmap == npages) &&
> +                   (0 != dma_map_sg(dev->dev.parent, umem->sglist,
> +                                    umem->nmap, region.direction))) {
> +                       region.iova = sg_dma_address(umem->sglist) + offset;
> +
> +                       ops = dev->netdev_ops;
> +                       if (!ops->ndo_validate_dma_mem_region_map) {
> +                               ret = -EOPNOTSUPP;
> +                               goto unmap;
> +                       }
> +
> +                       /* use driver to validate mapping of dma memory */
> +                       err = ops->ndo_validate_dma_mem_region_map(dev,
> +                                                                  &region,
> +                                                                  sk);
> +                       if (!err) {
> +                               list_add_tail(&umem->list, &po->umem_list);
> +                               ret = 0;
> +                               goto exit;
> +                       }
> +               }
> +
> +unmap:
> +               dma_unmap_sg(dev->dev.parent, umem->sglist,
> +                            umem->nmap, umem->direction);
> +               for (i = 0; i < umem->nmap; i++)
> +                       put_page(sg_page(&umem->sglist[i]));
> +
> +               vfree(umem);
> +exit:
> +               up_write(&current->mm->mmap_sem);
> +
> +               return ret;
> +       }
> +       case PACKET_DMA_MEM_REGION_RELEASE:
> +       {
> +               struct net_device *dev;
> +
> +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +               if (!dev)
> +                       return -EINVAL;
> +
> +               down_write(&current->mm->mmap_sem);
> +               ret = umem_release(dev, po);
> +               up_write(&current->mm->mmap_sem);
> +
> +               return ret;
> +       }
> +
>         default:
>                 return -ENOPROTOOPT;
>         }
> @@ -3523,6 +3781,129 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
>         case PACKET_QDISC_BYPASS:
>                 val = packet_use_direct_xmit(po);
>                 break;
> +       case PACKET_RXTX_QPAIRS_SPLIT:
> +       {
> +               struct net_device *dev;
> +               struct tpacket_dev_qpairs_info qpairs_info;
> +               int err;
> +
> +               if (len != sizeof(qpairs_info))
> +                       return -EINVAL;
> +               if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
> +                       return -EFAULT;
> +
> +               /* This call only work after a successful queue pairs split-off
> +                * operation via setsockopt()
> +                */
> +               if (!po->tp_owns_queue_pairs)
> +                       return -EINVAL;
> +
> +               /* This call only work after a bind call which calls a dev_hold
> +                * operation so we do not need to increment dev ref counter
> +                */
> +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +               if (!dev)
> +                       return -EINVAL;
> +               if (!dev->netdev_ops->ndo_split_queue_pairs)
> +                       return -EOPNOTSUPP;
> +
> +               err =  dev->netdev_ops->ndo_get_split_queue_pairs(dev,
> +                                       &qpairs_info.tp_qpairs_start_from,
> +                                       &qpairs_info.tp_qpairs_num, sk);
> +
> +               lv = sizeof(qpairs_info);
> +               data = &qpairs_info;
> +               break;
> +       }
> +       case PACKET_DEV_QPAIR_MAP_REGION_INFO:
> +       {
> +               struct tpacket_dev_qpair_map_region_info info;
> +               const struct net_device_ops *ops;
> +               struct net_device *dev;
> +               int err;
> +
> +               if (len != sizeof(info))
> +                       return -EINVAL;
> +               if (copy_from_user(&info, optval, sizeof(info)))
> +                       return -EFAULT;
> +
> +               /* This call only work after a bind call which calls a dev_hold
> +                * operation so we do not need to increment dev ref counter
> +                */
> +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +               if (!dev)
> +                       return -EINVAL;
> +
> +               ops = dev->netdev_ops;
> +               if (!ops->ndo_get_device_qpair_map_region_info)
> +                       return -EOPNOTSUPP;
> +
> +               err = ops->ndo_get_device_qpair_map_region_info(dev, &info);
> +               if (err)
> +                       return err;
> +
> +               lv = sizeof(struct tpacket_dev_qpair_map_region_info);
> +               data = &info;
> +               break;
> +       }
> +       case PACKET_DEV_DESC_INFO:
> +       {
> +               struct net_device *dev;
> +               struct tpacket_dev_info info;
> +               int err;
> +
> +               if (len != sizeof(info))
> +                       return -EINVAL;
> +               if (copy_from_user(&info, optval, sizeof(info)))
> +                       return -EFAULT;
> +
> +               /* This call only work after a bind call which calls a dev_hold
> +                * operation so we do not need to increment dev ref counter
> +                */
> +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +               if (!dev)
> +                       return -EINVAL;
> +               if (!dev->netdev_ops->ndo_get_device_desc_info)
> +                       return -EOPNOTSUPP;
> +
> +               err =  dev->netdev_ops->ndo_get_device_desc_info(dev, &info);
> +               if (err)
> +                       return err;
> +
> +               lv = sizeof(struct tpacket_dev_info);
> +               data = &info;
> +               break;
> +       }
> +       case PACKET_DMA_MEM_REGION_MAP:
> +       {
> +               struct tpacket_dma_mem_region info;
> +               struct net_device *dev;
> +               int err;
> +
> +               if (len != sizeof(info))
> +                               return -EINVAL;
> +               if (copy_from_user(&info, optval, sizeof(info)))
> +                               return -EFAULT;
> +
> +               /* This call only work after a bind call which calls a dev_hold
> +                * operation so we do not need to increment dev ref counter
> +                */
> +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +               if (!dev)
> +                       return -EINVAL;
> +
> +               if (!dev->netdev_ops->ndo_get_dma_region_info)
> +                       return -EOPNOTSUPP;
> +
> +               err =  dev->netdev_ops->ndo_get_dma_region_info(dev, &info, sk);
> +               if (err)
> +                       return err;
> +
> +               lv = sizeof(struct tpacket_dma_mem_region);
> +               data = &info;
> +               break;
> +       }
> +
>         default:
>                 return -ENOPROTOOPT;
>         }
> @@ -3536,7 +3917,6 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
>         return 0;
>  }
>
> -
>  static int packet_notifier(struct notifier_block *this,
>                            unsigned long msg, void *ptr)
>  {
> @@ -3920,6 +4300,8 @@ static int packet_mmap(struct file *file, struct socket *sock,
>         struct packet_sock *po = pkt_sk(sk);
>         unsigned long size, expected_size;
>         struct packet_ring_buffer *rb;
> +       const struct net_device_ops *ops;
> +       struct net_device *dev;
>         unsigned long start;
>         int err = -EINVAL;
>         int i;
> @@ -3927,8 +4309,20 @@ static int packet_mmap(struct file *file, struct socket *sock,
>         if (vma->vm_pgoff)
>                 return -EINVAL;
>
> +       dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> +       if (!dev)
> +               return -EINVAL;
> +
>         mutex_lock(&po->pg_vec_lock);
>
> +       if (po->tp_owns_queue_pairs) {
> +               ops = dev->netdev_ops;
> +               err = ops->ndo_direct_qpair_page_map(vma, dev);
> +               if (err)
> +                       goto out;
> +               goto done;
> +       }
> +
>         expected_size = 0;
>         for (rb = &po->rx_ring; rb <= &po->tx_ring; rb++) {
>                 if (rb->pg_vec) {
> @@ -3966,6 +4360,7 @@ static int packet_mmap(struct file *file, struct socket *sock,
>                 }
>         }
>
> +done:
>         atomic_inc(&po->mapped);
>         vma->vm_ops = &packet_mmap_ops;
>         err = 0;
> diff --git a/net/packet/internal.h b/net/packet/internal.h
> index cdddf6a..55d2fce 100644
> --- a/net/packet/internal.h
> +++ b/net/packet/internal.h
> @@ -90,6 +90,14 @@ struct packet_fanout {
>         struct packet_type      prot_hook ____cacheline_aligned_in_smp;
>  };
>
> +struct packet_umem_region {
> +       struct list_head        list;
> +       int                     nents;
> +       int                     nmap;
> +       int                     direction;
> +       struct scatterlist      sglist[0];
> +};
> +
>  struct packet_sock {
>         /* struct sock has to be the first member of packet_sock */
>         struct sock             sk;
> @@ -97,6 +105,7 @@ struct packet_sock {
>         union  tpacket_stats_u  stats;
>         struct packet_ring_buffer       rx_ring;
>         struct packet_ring_buffer       tx_ring;
> +       struct list_head        umem_list;
>         int                     copy_thresh;
>         spinlock_t              bind_lock;
>         struct mutex            pg_vec_lock;
> @@ -113,6 +122,7 @@ struct packet_sock {
>         unsigned int            tp_reserve;
>         unsigned int            tp_loss:1;
>         unsigned int            tp_tx_has_off:1;
> +       unsigned int            tp_owns_queue_pairs:1;
>         unsigned int            tp_tstamp;
>         struct net_device __rcu *cached_dev;
>         int                     (*xmit)(struct sk_buff *skb);
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings
  2015-01-13  4:35 ` [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings John Fastabend
  2015-01-13 12:05   ` Hannes Frederic Sowa
  2015-01-13 14:26   ` Daniel Borkmann
@ 2015-01-13 18:58   ` Willem de Bruijn
  2 siblings, 0 replies; 24+ messages in thread
From: Willem de Bruijn @ 2015-01-13 18:58 UTC (permalink / raw)
  To: John Fastabend
  Cc: Network Development, Zhou, Danny, Neil Horman, Daniel Borkmann,
	Ronciak, John, Hannes Frederic Sowa, brouer

On Mon, Jan 12, 2015 at 11:35 PM, John Fastabend
<john.fastabend@gmail.com> wrote:
> This allows driver queues to be split off and mapped into user
> space using af_packet.
>
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
>  drivers/net/ethernet/intel/ixgbe/ixgbe.h         |   17 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c |   23 +
>  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c    |  407 ++++++++++++++++++++++
>  drivers/net/ethernet/intel/ixgbe/ixgbe_type.h    |    1
>  4 files changed, 440 insertions(+), 8 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> index 38fc64c..aa4960e 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
> @@ -204,6 +204,20 @@ struct ixgbe_tx_queue_stats {
>         u64 tx_done_old;
>  };
>
> +#define MAX_USER_DMA_REGIONS_PER_SOCKET  16
> +
> +struct ixgbe_user_dma_region {
> +       dma_addr_t dma_region_iova;
> +       unsigned long dma_region_size;
> +       int direction;
> +};
> +
> +struct ixgbe_user_queue_info {
> +       struct sock *sk_handle;
> +       struct ixgbe_user_dma_region regions[MAX_USER_DMA_REGIONS_PER_SOCKET];
> +       int num_of_regions;
> +};
> +
>  struct ixgbe_rx_queue_stats {
>         u64 rsc_count;
>         u64 rsc_flush;
> @@ -673,6 +687,9 @@ struct ixgbe_adapter {
>
>         struct ixgbe_q_vector *q_vector[MAX_Q_VECTORS];
>
> +       /* Direct User Space Queues */
> +       struct ixgbe_user_queue_info user_queue_info[MAX_RX_QUEUES];
> +
>         /* DCB parameters */
>         struct ieee_pfc *ixgbe_ieee_pfc;
>         struct ieee_ets *ixgbe_ieee_ets;
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
> index e5be0dd..f180a58 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_ethtool.c
> @@ -2598,12 +2598,17 @@ static int ixgbe_add_ethtool_fdir_entry(struct ixgbe_adapter *adapter,
>         if (!(adapter->flags & IXGBE_FLAG_FDIR_PERFECT_CAPABLE))
>                 return -EOPNOTSUPP;
>
> +       if (fsp->ring_cookie > MAX_RX_QUEUES)
> +               return -EINVAL;
> +
>         /*
>          * Don't allow programming if the action is a queue greater than
> -        * the number of online Rx queues.
> +        * the number of online Rx queues unless it is a user space
> +        * queue.
>          */
>         if ((fsp->ring_cookie != RX_CLS_FLOW_DISC) &&
> -           (fsp->ring_cookie >= adapter->num_rx_queues))
> +           (fsp->ring_cookie >= adapter->num_rx_queues) &&
> +           !adapter->user_queue_info[fsp->ring_cookie].sk_handle)
>                 return -EINVAL;
>
>         /* Don't allow indexes to exist outside of available space */
> @@ -2680,12 +2685,18 @@ static int ixgbe_add_ethtool_fdir_entry(struct ixgbe_adapter *adapter,
>         /* apply mask and compute/store hash */
>         ixgbe_atr_compute_perfect_hash_82599(&input->filter, &mask);
>
> +       /* Set input action to reg_idx for driver owned queues otherwise
> +        * use the absolute index for user space queues.
> +        */
> +       if (fsp->ring_cookie < adapter->num_rx_queues &&
> +           fsp->ring_cookie != IXGBE_FDIR_DROP_QUEUE)
> +               input->action = adapter->rx_ring[input->action]->reg_idx;
> +
>         /* program filters to filter memory */
>         err = ixgbe_fdir_write_perfect_filter_82599(hw,
> -                               &input->filter, input->sw_idx,
> -                               (input->action == IXGBE_FDIR_DROP_QUEUE) ?
> -                               IXGBE_FDIR_DROP_QUEUE :
> -                               adapter->rx_ring[input->action]->reg_idx);
> +                                                   &input->filter,
> +                                                   input->sw_idx,
> +                                                   input->action);
>         if (err)
>                 goto err_out_w_lock;
>
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> index 2ed2c7d..be5bde86 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
> @@ -50,6 +50,9 @@
>  #include <linux/if_bridge.h>
>  #include <linux/prefetch.h>
>  #include <scsi/fc/fc_fcoe.h>
> +#include <linux/mm.h>
> +#include <linux/if_packet.h>
> +#include <linux/iommu.h>
>
>  #ifdef CONFIG_OF
>  #include <linux/of_net.h>
> @@ -80,6 +83,12 @@ const char ixgbe_driver_version[] = DRV_VERSION;
>  static const char ixgbe_copyright[] =
>                                 "Copyright (c) 1999-2014 Intel Corporation.";
>
> +static unsigned int *dummy_page_buf;
> +
> +#ifndef CONFIG_DMA_MEMORY_PROTECTION
> +#define CONFIG_DMA_MEMORY_PROTECTION
> +#endif
> +
>  static const struct ixgbe_info *ixgbe_info_tbl[] = {
>         [board_82598]           = &ixgbe_82598_info,
>         [board_82599]           = &ixgbe_82599_info,
> @@ -167,6 +176,76 @@ MODULE_DESCRIPTION("Intel(R) 10 Gigabit PCI Express Network Driver");
>  MODULE_LICENSE("GPL");
>  MODULE_VERSION(DRV_VERSION);
>
> +enum ixgbe_legacy_rx_enum {
> +       IXGBE_LEGACY_RX_FIELD_PKT_ADDR = 0,     /* Packet buffer address */
> +       IXGBE_LEGACY_RX_FIELD_LENGTH,           /* Packet length */
> +       IXGBE_LEGACY_RX_FIELD_CSUM,             /* Fragment checksum */
> +       IXGBE_LEGACY_RX_FIELD_STATUS,           /* Descriptors status */
> +       IXGBE_LEGACY_RX_FIELD_ERRORS,           /* Receive errors */
> +       IXGBE_LEGACY_RX_FIELD_VLAN,             /* VLAN tag */
> +};
> +
> +enum ixgbe_legacy_tx_enum {
> +       IXGBE_LEGACY_TX_FIELD_PKT_ADDR = 0,     /* Packet buffer address */
> +       IXGBE_LEGACY_TX_FIELD_LENGTH,           /* Packet length */
> +       IXGBE_LEGACY_TX_FIELD_CSO,              /* Checksum offset*/
> +       IXGBE_LEGACY_TX_FIELD_CMD,              /* Descriptor control */
> +       IXGBE_LEGACY_TX_FIELD_STATUS,           /* Descriptor status */
> +       IXGBE_LEGACY_TX_FIELD_RSVD,             /* Reserved */
> +       IXGBE_LEGACY_TX_FIELD_CSS,              /* Checksum start */
> +       IXGBE_LEGACY_TX_FIELD_VLAN_TAG,         /* VLAN tag */
> +};
> +
> +/* IXGBE Receive Descriptor - Legacy */
> +static const struct tpacket_nic_desc_fld ixgbe_legacy_rx_desc[] = {
> +       /* Packet buffer address */
> +       {PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_PKT_ADDR,
> +                               0,  64, 64,  BO_NATIVE)},
> +       /* Packet length */
> +       {PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_LENGTH,
> +                               64, 16, 8,  BO_NATIVE)},
> +       /* Fragment checksum */
> +       {PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_CSUM,
> +                               80, 16, 8,  BO_NATIVE)},
> +       /* Descriptors status */
> +       {PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_STATUS,
> +                               96, 8, 8,  BO_NATIVE)},
> +       /* Receive errors */
> +       {PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_ERRORS,
> +                               104, 8, 8,  BO_NATIVE)},
> +       /* VLAN tag */
> +       {PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_RX_FIELD_VLAN,
> +                               112, 16, 8,  BO_NATIVE)},
> +};
> +
> +/* IXGBE Transmit Descriptor - Legacy */
> +static const struct tpacket_nic_desc_fld ixgbe_legacy_tx_desc[] = {
> +       /* Packet buffer address */
> +       {PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_PKT_ADDR,
> +                               0,   64, 64,  BO_NATIVE)},
> +       /* Data buffer length */
> +       {PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_LENGTH,
> +                               64,  16, 8,  BO_NATIVE)},
> +       /* Checksum offset */
> +       {PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_CSO,
> +                               80,  8, 8,  BO_NATIVE)},
> +       /* Command byte */
> +       {PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_CMD,
> +                               88,  8, 8,  BO_NATIVE)},
> +       /* Transmitted status */
> +       {PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_STATUS,
> +                               96,  4, 1,  BO_NATIVE)},
> +       /* Reserved */
> +       {PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_RSVD,
> +                               100, 4, 1,  BO_NATIVE)},
> +       /* Checksum start */
> +       {PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_CSS,
> +                               104, 8, 8,  BO_NATIVE)},
> +       /* VLAN tag */
> +       {PACKET_NIC_DESC_FIELD(IXGBE_LEGACY_TX_FIELD_VLAN_TAG,
> +                               112, 16, 8,  BO_NATIVE)},
> +};
> +
>  static bool ixgbe_check_cfg_remove(struct ixgbe_hw *hw, struct pci_dev *pdev);
>
>  static int ixgbe_read_pci_cfg_word_parent(struct ixgbe_adapter *adapter,
> @@ -3137,6 +3216,17 @@ static void ixgbe_enable_rx_drop(struct ixgbe_adapter *adapter,
>         IXGBE_WRITE_REG(hw, IXGBE_SRRCTL(reg_idx), srrctl);
>  }
>
> +static bool ixgbe_have_user_queues(struct ixgbe_adapter *adapter)
> +{
> +       int i;
> +
> +       for (i = 0; i < MAX_RX_QUEUES; i++) {
> +               if (adapter->user_queue_info[i].sk_handle)
> +                       return true;
> +       }
> +       return false;
> +}
> +
>  static void ixgbe_disable_rx_drop(struct ixgbe_adapter *adapter,
>                                   struct ixgbe_ring *ring)
>  {
> @@ -3171,7 +3261,8 @@ static void ixgbe_set_rx_drop_en(struct ixgbe_adapter *adapter)
>          *  and performance reasons.
>          */
>         if (adapter->num_vfs || (adapter->num_rx_queues > 1 &&
> -           !(adapter->hw.fc.current_mode & ixgbe_fc_tx_pause) && !pfc_en)) {
> +           !(adapter->hw.fc.current_mode & ixgbe_fc_tx_pause) && !pfc_en) ||
> +           ixgbe_have_user_queues(adapter)) {
>                 for (i = 0; i < adapter->num_rx_queues; i++)
>                         ixgbe_enable_rx_drop(adapter, adapter->rx_ring[i]);
>         } else {
> @@ -7938,6 +8029,306 @@ static void ixgbe_fwd_del(struct net_device *pdev, void *priv)
>         kfree(fwd_adapter);
>  }
>
> +static int ixgbe_ndo_split_queue_pairs(struct net_device *dev,
> +                                      unsigned int start_from,
> +                                      unsigned int qpairs_num,
> +                                      struct sock *sk)
> +{
> +       struct ixgbe_adapter *adapter = netdev_priv(dev);
> +       unsigned int qpair_index;
> +
> +       /* allocate whatever available qpairs */
> +       if (start_from == -1) {

When is this wildcard case used? If the nic is configured to send
specific traffic to a specific rxqueue, then that queue has to be
mapped. When is an arbitrary queue acceptable?

> +               unsigned int count = 0;
> +
> +               for (qpair_index = adapter->num_rx_queues;
> +                    qpair_index < MAX_RX_QUEUES;
> +                    qpair_index++) {
> +                       if (!adapter->user_queue_info[qpair_index].sk_handle) {
> +                               count++;
> +                               if (count == qpairs_num) {
> +                                       start_from = qpair_index - count + 1;
> +                                       break;
> +                               }
> +                       } else {
> +                               count = 0;
> +                       }
> +               }
> +       }
> +
> +       /* otherwise the caller specified exact queues */
> +       if ((start_from > MAX_TX_QUEUES) ||
> +           (start_from > MAX_RX_QUEUES) ||
> +           (start_from + qpairs_num > MAX_TX_QUEUES) ||
> +           (start_from + qpairs_num > MAX_RX_QUEUES))
> +               return -EINVAL;
> +
> +       /* If the qpairs are being used by the driver do not let user space
> +        * consume the queues. Also if the queue has already been allocated
> +        * to a socket do fail the request.
> +        */
> +       for (qpair_index = start_from;
> +            qpair_index < start_from + qpairs_num;
> +            qpair_index++) {
> +               if ((qpair_index < adapter->num_tx_queues) ||
> +                   (qpair_index < adapter->num_rx_queues))
> +                       return -EINVAL;

is there a similar check to ensure that the driver does not increase its
number of queues with ethtool -X and subsumes user queues?

> +
> +               if (adapter->user_queue_info[qpair_index].sk_handle)
> +                       return -EBUSY;
> +       }
> +
> +       /* remember the sk handle for each queue pair */
> +       for (qpair_index = start_from;
> +            qpair_index < start_from + qpairs_num;
> +            qpair_index++) {
> +               adapter->user_queue_info[qpair_index].sk_handle = sk;
> +               adapter->user_queue_info[qpair_index].num_of_regions = 0;
> +       }
> +
> +       return 0;
> +}
> +
> +static int ixgbe_ndo_get_split_queue_pairs(struct net_device *dev,
> +                                          unsigned int *start_from,
> +                                          unsigned int *qpairs_num,
> +                                          struct sock *sk)
> +{
> +       struct ixgbe_adapter *adapter = netdev_priv(dev);
> +       unsigned int qpair_index;
> +       *qpairs_num = 0;
> +
> +       for (qpair_index = adapter->num_tx_queues;
> +            qpair_index < MAX_RX_QUEUES;
> +            qpair_index++) {
> +               if (adapter->user_queue_info[qpair_index].sk_handle == sk) {
> +                       if (*qpairs_num == 0)
> +                               *start_from = qpair_index;
> +                       *qpairs_num = *qpairs_num + 1;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +static int ixgbe_ndo_return_queue_pairs(struct net_device *dev, struct sock *sk)
> +{
> +       struct ixgbe_adapter *adapter = netdev_priv(dev);
> +       struct ixgbe_user_queue_info *info;
> +       unsigned int qpair_index;
> +
> +       for (qpair_index = adapter->num_tx_queues;
> +            qpair_index < MAX_RX_QUEUES;
> +            qpair_index++) {
> +               info = &adapter->user_queue_info[qpair_index];
> +
> +               if (info->sk_handle == sk) {
> +                       info->sk_handle = NULL;
> +                       info->num_of_regions = 0;
> +               }
> +       }
> +
> +       return 0;
> +}
> +
> +/* Rx descriptor starts from 0x1000 and Tx descriptor starts from 0x6000
> + * both the TX and RX descriptors use 4K pages.
> + */
> +#define RX_DESC_ADDR_OFFSET            0x1000
> +#define TX_DESC_ADDR_OFFSET            0x6000
> +#define PAGE_SIZE_4K                   4096
> +
> +static int
> +ixgbe_ndo_qpair_map_region(struct net_device *dev,
> +                          struct tpacket_dev_qpair_map_region_info *info)
> +{
> +       struct ixgbe_adapter *adapter = netdev_priv(dev);
> +
> +       /* no need to map systme memory to userspace for ixgbe */
> +       info->tp_dev_sysm_sz = 0;
> +       info->tp_num_sysm_map_regions = 0;
> +
> +       info->tp_dev_bar_sz = pci_resource_len(adapter->pdev, 0);
> +       info->tp_num_map_regions = 2;
> +
> +       info->tp_regions[0].page_offset = RX_DESC_ADDR_OFFSET;
> +       info->tp_regions[0].page_sz = PAGE_SIZE;
> +       info->tp_regions[0].page_cnt = 1;
> +       info->tp_regions[1].page_offset = TX_DESC_ADDR_OFFSET;
> +       info->tp_regions[1].page_sz = PAGE_SIZE;
> +       info->tp_regions[1].page_cnt = 1;
> +
> +       return 0;
> +}
> +
> +static int ixgbe_ndo_get_device_desc_info(struct net_device *dev,
> +                                         struct tpacket_dev_info *dev_info)
> +{
> +       struct ixgbe_adapter *adapter = netdev_priv(dev);
> +       int max_queues;
> +       int i;
> +       __u8 flds_rx = sizeof(ixgbe_legacy_rx_desc) /
> +                      sizeof(struct tpacket_nic_desc_fld);
> +       __u8 flds_tx = sizeof(ixgbe_legacy_tx_desc) /
> +                      sizeof(struct tpacket_nic_desc_fld);
> +
> +       max_queues = max(adapter->num_rx_queues, adapter->num_tx_queues);
> +
> +       dev_info->tp_device_id = adapter->hw.device_id;
> +       dev_info->tp_vendor_id = adapter->hw.vendor_id;
> +       dev_info->tp_subsystem_device_id = adapter->hw.subsystem_device_id;
> +       dev_info->tp_subsystem_vendor_id = adapter->hw.subsystem_vendor_id;
> +       dev_info->tp_revision_id = adapter->hw.revision_id;
> +       dev_info->tp_numa_node = dev_to_node(&dev->dev);
> +
> +       dev_info->tp_num_total_qpairs = min(MAX_RX_QUEUES, MAX_TX_QUEUES);
> +       dev_info->tp_num_inuse_qpairs = max_queues;
> +
> +       dev_info->tp_num_rx_desc_fmt = 1;
> +       dev_info->tp_num_tx_desc_fmt = 1;
> +
> +       dev_info->tp_rx_dexpr[0].version = 1;
> +       dev_info->tp_rx_dexpr[0].size = sizeof(union ixgbe_adv_rx_desc);
> +       dev_info->tp_rx_dexpr[0].byte_order = BO_NATIVE;
> +       dev_info->tp_rx_dexpr[0].num_of_fld = flds_rx;
> +       for (i = 0; i < dev_info->tp_rx_dexpr[0].num_of_fld; i++)
> +               memcpy(&dev_info->tp_rx_dexpr[0].fields[i],
> +                      &ixgbe_legacy_rx_desc[i],
> +                      sizeof(struct tpacket_nic_desc_fld));
> +
> +       dev_info->tp_tx_dexpr[0].version = 1;
> +       dev_info->tp_tx_dexpr[0].size = sizeof(union ixgbe_adv_tx_desc);
> +       dev_info->tp_tx_dexpr[0].byte_order = BO_NATIVE;
> +       dev_info->tp_tx_dexpr[0].num_of_fld = flds_tx;
> +       for (i = 0; i < dev_info->tp_rx_dexpr[0].num_of_fld; i++)
> +               memcpy(&dev_info->tp_tx_dexpr[0].fields[i],
> +                      &ixgbe_legacy_tx_desc[i],
> +                      sizeof(struct tpacket_nic_desc_fld));
> +
> +       return 0;
> +}
> +
> +static int
> +ixgbe_ndo_qpair_page_map(struct vm_area_struct *vma, struct net_device *dev)
> +{
> +       struct ixgbe_adapter *adapter = netdev_priv(dev);
> +       phys_addr_t phy_addr = pci_resource_start(adapter->pdev, 0);
> +       unsigned long pfn_rx = (phy_addr + RX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
> +       unsigned long pfn_tx = (phy_addr + TX_DESC_ADDR_OFFSET) >> PAGE_SHIFT;
> +       unsigned long dummy_page_phy;
> +       pgprot_t pre_vm_page_prot;
> +       unsigned long start;
> +       unsigned int i;
> +       int err;
> +
> +       if (!dummy_page_buf) {
> +               dummy_page_buf = kzalloc(PAGE_SIZE_4K, GFP_KERNEL);
> +               if (!dummy_page_buf)
> +                       return -ENOMEM;
> +
> +               for (i = 0; i < PAGE_SIZE_4K / sizeof(unsigned int); i++)
> +                       dummy_page_buf[i] = 0xdeadbeef;
> +       }
> +
> +       dummy_page_phy = virt_to_phys(dummy_page_buf);
> +       pre_vm_page_prot = vma->vm_page_prot;
> +       vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +
> +       /* assume the vm_start is 4K aligned address */
> +       for (start = vma->vm_start;
> +            start < vma->vm_end;
> +            start += PAGE_SIZE_4K) {
> +               if (start == vma->vm_start + RX_DESC_ADDR_OFFSET) {
> +                       err = remap_pfn_range(vma, start, pfn_rx, PAGE_SIZE_4K,
> +                                             vma->vm_page_prot);
> +                       if (err)
> +                               return -EAGAIN;
> +               } else if (start == vma->vm_start + TX_DESC_ADDR_OFFSET) {
> +                       err = remap_pfn_range(vma, start, pfn_tx, PAGE_SIZE_4K,
> +                                             vma->vm_page_prot);
> +                       if (err)
> +                               return -EAGAIN;
> +               } else {
> +                       unsigned long addr = dummy_page_phy > PAGE_SHIFT;
> +
> +                       err = remap_pfn_range(vma, start, addr, PAGE_SIZE_4K,
> +                                             pre_vm_page_prot);
> +                       if (err)
> +                               return -EAGAIN;
> +               }
> +       }
> +       return 0;
> +}
> +
> +static int
> +ixgbe_ndo_val_dma_mem_region_map(struct net_device *dev,
> +                                struct tpacket_dma_mem_region *region,
> +                                struct sock *sk)
> +{
> +       struct ixgbe_adapter *adapter = netdev_priv(dev);
> +       unsigned int qpair_index, i;
> +       struct ixgbe_user_queue_info *info;
> +
> +#ifdef CONFIG_DMA_MEMORY_PROTECTION
> +       /* IOVA not equal to physical address means IOMMU takes effect */
> +       if (region->phys_addr == region->iova)
> +               return -EFAULT;
> +#endif
> +
> +       for (qpair_index = adapter->num_tx_queues;
> +            qpair_index < MAX_RX_QUEUES;
> +            qpair_index++) {
> +               info = &adapter->user_queue_info[qpair_index];
> +               i = info->num_of_regions;
> +
> +               if (info->sk_handle != sk)
> +                       continue;
> +
> +               if (info->num_of_regions >= MAX_USER_DMA_REGIONS_PER_SOCKET)
> +                       return -EFAULT;
> +
> +               info->regions[i].dma_region_size = region->size;
> +               info->regions[i].direction = region->direction;
> +               info->regions[i].dma_region_iova = region->iova;
> +               info->num_of_regions++;
> +       }
> +
> +       return 0;
> +}
> +
> +static int
> +ixgbe_get_dma_region_info(struct net_device *dev,
> +                         struct tpacket_dma_mem_region *region,
> +                         struct sock *sk)
> +{
> +       struct ixgbe_adapter *adapter = netdev_priv(dev);
> +       struct ixgbe_user_queue_info *info;
> +       unsigned int qpair_index;
> +
> +       for (qpair_index = adapter->num_tx_queues;
> +            qpair_index < MAX_RX_QUEUES;
> +            qpair_index++) {
> +               int i;
> +
> +               info = &adapter->user_queue_info[qpair_index];
> +               if (info->sk_handle != sk)
> +                       continue;
> +
> +               for (i = 0; i <= info->num_of_regions; i++) {
> +                       struct ixgbe_user_dma_region *r;
> +
> +                       r = &info->regions[i];
> +                       if ((r->dma_region_size == region->size) &&
> +                           (r->direction == region->direction)) {
> +                               region->iova = r->dma_region_iova;
> +                               return 0;
> +                       }
> +               }
> +       }
> +
> +       return -1;
> +}
> +
>  static const struct net_device_ops ixgbe_netdev_ops = {
>         .ndo_open               = ixgbe_open,
>         .ndo_stop               = ixgbe_close,
> @@ -7982,6 +8373,15 @@ static const struct net_device_ops ixgbe_netdev_ops = {
>         .ndo_bridge_getlink     = ixgbe_ndo_bridge_getlink,
>         .ndo_dfwd_add_station   = ixgbe_fwd_add,
>         .ndo_dfwd_del_station   = ixgbe_fwd_del,
> +
> +       .ndo_split_queue_pairs  = ixgbe_ndo_split_queue_pairs,
> +       .ndo_get_split_queue_pairs = ixgbe_ndo_get_split_queue_pairs,
> +       .ndo_return_queue_pairs    = ixgbe_ndo_return_queue_pairs,
> +       .ndo_get_device_desc_info  = ixgbe_ndo_get_device_desc_info,
> +       .ndo_direct_qpair_page_map = ixgbe_ndo_qpair_page_map,
> +       .ndo_get_dma_region_info   = ixgbe_get_dma_region_info,
> +       .ndo_get_device_qpair_map_region_info = ixgbe_ndo_qpair_map_region,
> +       .ndo_validate_dma_mem_region_map = ixgbe_ndo_val_dma_mem_region_map,
>  };
>
>  /**
> @@ -8203,7 +8603,9 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
>         hw->back = adapter;
>         adapter->msg_enable = netif_msg_init(debug, DEFAULT_MSG_ENABLE);
>
> -       hw->hw_addr = ioremap(pci_resource_start(pdev, 0),
> +       hw->pci_hw_addr = pci_resource_start(pdev, 0);
> +
> +       hw->hw_addr = ioremap(hw->pci_hw_addr,
>                               pci_resource_len(pdev, 0));
>         adapter->io_addr = hw->hw_addr;
>         if (!hw->hw_addr) {
> @@ -8875,6 +9277,7 @@ module_init(ixgbe_init_module);
>   **/
>  static void __exit ixgbe_exit_module(void)
>  {
> +       kfree(dummy_page_buf);
>  #ifdef CONFIG_IXGBE_DCA
>         dca_unregister_notify(&dca_notifier);
>  #endif
> diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
> index d101b25..4034d31 100644
> --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
> +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_type.h
> @@ -3180,6 +3180,7 @@ struct ixgbe_mbx_info {
>
>  struct ixgbe_hw {
>         u8 __iomem                      *hw_addr;
> +       phys_addr_t                     pci_hw_addr;
>         void                            *back;
>         struct ixgbe_mac_info           mac;
>         struct ixgbe_addr_filter_info   addr_ctrl;
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
  2015-01-13 18:52 ` Willem de Bruijn
@ 2015-01-14 15:26   ` Zhou, Danny
  0 siblings, 0 replies; 24+ messages in thread
From: Zhou, Danny @ 2015-01-14 15:26 UTC (permalink / raw)
  To: Willem de Bruijn, John Fastabend
  Cc: Network Development, Neil Horman, Daniel Borkmann, Ronciak, John,
	Hannes Frederic Sowa, brouer@redhat.com



> -----Original Message-----
> From: Willem de Bruijn [mailto:willemb@google.com]
> Sent: Wednesday, January 14, 2015 2:53 AM
> To: John Fastabend
> Cc: Network Development; Zhou, Danny; Neil Horman; Daniel Borkmann; Ronciak, John; Hannes Frederic Sowa;
> brouer@redhat.com
> Subject: Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
> 
> On Mon, Jan 12, 2015 at 11:35 PM, John Fastabend
> <john.fastabend@gmail.com> wrote:
> > This patch adds net_device ops to split off a set of driver queues
> > from the driver and map the queues into user space via mmap. This
> > allows the queues to be directly manipulated from user space. For
> > raw packet interface this removes any overhead from the kernel network
> > stack.
> 
> Can you elaborate how packet payload mapping is handled?
> Processes are still responsible for translating from user virtual to
> physical (and bus) addresses, correct? The IOMMU is only there
> to restrict the physical address ranges that may be written.
> 

User space processes have to use the IOVA returned from af_packet to fill 
NIC's Rx (as well as Tx) descriptors. When a DMA request is trigged for transferring a 
coming packet from the NIC to host memory, the device ID(specified by PCIe device' B:N:F) 
field in the DMA request will be used by IOMMU to find the device address translation
structure for this domain/device. Then the IOMMU will use the IOVA field in the 
DMA request as the match field to look up the per-device address translation structure 
to get the corresponding physical address pointing to where packet should be transferred to.

If an invalid IOVA address (e.g. arbitrary address or physical address) is filled in NIC's descriptors, 
IOMMU would prevent DMA from happening due to above lookup operation returns failure.

> >
> > With these operations we bypass the network stack and packet_type
> > handlers that would typically send traffic to an af_packet socket.
> > This means hardware must do the forwarding. To do this ew can use
> > the ETHTOOL_SRXCLSRLINS ops in the ethtool command set. It is
> > currently supported by multiple drivers including sfc, mlx4, niu,
> > ixgbe, and i40e. Supporting some way to steer traffic to a queue
> > is the _only_ hardware requirement to support this interface.
> >
> > A follow on patch adds support for ixgbe but we expect at least
> > the subset of drivers implementing ETHTOOL_SRXCLSRLINS can be
> > implemented later.
> >
> > The high level flow, leveraging the af_packet control path, looks
> > like:
> >
> >         bind(fd, &sockaddr, sizeof(sockaddr));
> >
> >         /* Get the device type and info */
> >         getsockopt(fd, SOL_PACKET, PACKET_DEV_DESC_INFO, &def_info,
> >                    &optlen);
> >
> >         /* With device info we can look up descriptor format */
> >
> >         /* Get the layout of ring space offset, page_sz, cnt */
> >         getsockopt(fd, SOL_PACKET, PACKET_DEV_QPAIR_MAP_REGION_INFO,
> >                    &info, &optlen);
> >
> >         /* request some queues from the driver */
> >         setsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> >                    &qpairs_info, sizeof(qpairs_info));
> >
> >         /* if we let the driver pick us queues learn which queues
> >          * we were given
> >          */
> >         getsockopt(fd, SOL_PACKET, PACKET_RXTX_QPAIRS_SPLIT,
> >                    &qpairs_info, sizeof(qpairs_info));
> >
> >         /* And mmap queue pairs to user space */
> >         mmap(NULL, info.tp_dev_bar_sz, PROT_READ | PROT_WRITE,
> >              MAP_SHARED, fd, 0);
> >
> >         /* Now we have some user space queues to read/write to*/
> >
> > There is one critical difference when running with these interfaces
> > vs running without them. In the normal case the af_packet module
> > uses a standard descriptor format exported by the af_packet user
> > space headers. In this model because we are working directly with
> > driver queues the descriptor format maps to the descriptor format
> > used by the device. User space applications can learn device
> > information from the socket option PACKET_DEV_DESC_INFO. These
> > are described by giving the vendor/deviceid and a descriptor layout
> > in offset/length/width/alignment/byte_ordering.
> 
> Raising the issue of exposed vs. virtualized interface just once
> more. I wonder if it is possible to keep the virtual to physical
> translation in the kernel while avoiding syscall latency, by doing
> the translation in a kernel thread on a coupled hyperthread that
> waits with mwait on the virtual queue producer index. The page
> table operations that Neil proposed in v1 of this patch may work
> even better.
> 

This is one shot request during initialization, so should be ok from latency
prospective. The NIC requests physically contiguous host memory region
to be used as rx/tx packet buffer, so the physical address is provided for af_packet
or the NIC driver to do this check. Otherwise, it is hard to check it for given
virtual address and size of the memory regions.

> > To protect against arbitrary DMA writes IOMMU devices put memory
> > in a single domain to stop arbitrary DMA to memory. Note it would
> > be possible to dma into another sockets pages because most NIC
> > devices only support a single domain. This would require being
> > able to guess another sockets page layout. However the socket
> > operation does require CAP_NET_ADMIN privileges.
> >
> > Additionally we have a set of DPDK patches to enable DPDK with this
> > interface. DPDK can be downloaded @ dpdk.org although as I hope is
> > clear from above DPDK is just our paticular test environment we
> > expect other libraries could be built on this interface.
> >
> > Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> > ---
> >  include/linux/netdevice.h      |   79 ++++++++
> >  include/uapi/linux/if_packet.h |   88 +++++++++
> >  net/packet/af_packet.c         |  397 ++++++++++++++++++++++++++++++++++++++++
> >  net/packet/internal.h          |   10 +
> >  4 files changed, 573 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> > index 679e6e9..b71c97d 100644
> > --- a/include/linux/netdevice.h
> > +++ b/include/linux/netdevice.h
> > @@ -52,6 +52,8 @@
> >  #include <linux/neighbour.h>
> >  #include <uapi/linux/netdevice.h>
> >
> > +#include <linux/if_packet.h>
> > +
> >  struct netpoll_info;
> >  struct device;
> >  struct phy_device;
> > @@ -1030,6 +1032,54 @@ typedef u16 (*select_queue_fallback_t)(struct net_device *dev,
> >   * int (*ndo_switch_port_stp_update)(struct net_device *dev, u8 state);
> >   *     Called to notify switch device port of bridge port STP
> >   *     state change.
> > + *
> > + * int (*ndo_split_queue_pairs) (struct net_device *dev,
> > + *                              unsigned int qpairs_start_from,
> > + *                              unsigned int qpairs_num,
> > + *                              struct sock *sk)
> > + *     Called to request a set of queues from the driver to be handed to the
> > + *     callee for management. After this returns the driver will not use the
> > + *     queues.
> > + *
> > + * int (*ndo_get_split_queue_pairs) (struct net_device *dev,
> > + *                              unsigned int *qpairs_start_from,
> > + *                              unsigned int *qpairs_num,
> > + *                              struct sock *sk)
> > + *     Called to get the location of queues that have been split for user
> > + *     space to use. The socket must have previously requested the queues via
> > + *     ndo_split_queue_pairs successfully.
> > + *
> > + * int (*ndo_return_queue_pairs) (struct net_device *dev,
> > + *                               struct sock *sk)
> > + *     Called to return a set of queues identified by sock to the driver. The
> > + *     socket must have previously requested the queues via
> > + *     ndo_split_queue_pairs for this action to be performed.
> > + *
> > + * int (*ndo_get_device_qpair_map_region_info) (struct net_device *dev,
> > + *                             struct tpacket_dev_qpair_map_region_info *info)
> > + *     Called to return mapping of queue memory region.
> > + *
> > + * int (*ndo_get_device_desc_info) (struct net_device *dev,
> > + *                                 struct tpacket_dev_info *dev_info)
> > + *     Called to get device specific information. This should uniquely identify
> > + *     the hardware so that descriptor formats can be learned by the stack/user
> > + *     space.
> > + *
> > + * int (*ndo_direct_qpair_page_map) (struct vm_area_struct *vma,
> > + *                                  struct net_device *dev)
> > + *     Called to map queue pair range from split_queue_pairs into mmap region.
> > + *
> > + * int (*ndo_direct_validate_dma_mem_region_map)
> > + *                                     (struct net_device *dev,
> > + *                                      struct tpacket_dma_mem_region *region,
> > + *                                      struct sock *sk)
> > + *     Called to validate DMA address remaping for userspace memory region
> > + *
> > + * int (*ndo_get_dma_region_info)
> > + *                              (struct net_device *dev,
> > + *                               struct tpacket_dma_mem_region *region,
> > + *                               struct sock *sk)
> > + *     Called to get dma region' information such as iova.
> >   */
> >  struct net_device_ops {
> >         int                     (*ndo_init)(struct net_device *dev);
> > @@ -1190,6 +1240,35 @@ struct net_device_ops {
> >         int                     (*ndo_switch_port_stp_update)(struct net_device *dev,
> >                                                               u8 state);
> >  #endif
> > +       int                     (*ndo_split_queue_pairs)(struct net_device *dev,
> > +                                        unsigned int qpairs_start_from,
> > +                                        unsigned int qpairs_num,
> > +                                        struct sock *sk);
> > +       int                     (*ndo_get_split_queue_pairs)
> > +                                       (struct net_device *dev,
> > +                                        unsigned int *qpairs_start_from,
> > +                                        unsigned int *qpairs_num,
> > +                                        struct sock *sk);
> > +       int                     (*ndo_return_queue_pairs)
> > +                                       (struct net_device *dev,
> > +                                        struct sock *sk);
> > +       int                     (*ndo_get_device_qpair_map_region_info)
> > +                                       (struct net_device *dev,
> > +                                        struct tpacket_dev_qpair_map_region_info *info);
> > +       int                     (*ndo_get_device_desc_info)
> > +                                       (struct net_device *dev,
> > +                                        struct tpacket_dev_info *dev_info);
> > +       int                     (*ndo_direct_qpair_page_map)
> > +                                       (struct vm_area_struct *vma,
> > +                                        struct net_device *dev);
> > +       int                     (*ndo_validate_dma_mem_region_map)
> > +                                       (struct net_device *dev,
> > +                                        struct tpacket_dma_mem_region *region,
> > +                                        struct sock *sk);
> > +       int                     (*ndo_get_dma_region_info)
> > +                                       (struct net_device *dev,
> > +                                        struct tpacket_dma_mem_region *region,
> > +                                        struct sock *sk);
> >  };
> >
> >  /**
> > diff --git a/include/uapi/linux/if_packet.h b/include/uapi/linux/if_packet.h
> > index da2d668..eb7a727 100644
> > --- a/include/uapi/linux/if_packet.h
> > +++ b/include/uapi/linux/if_packet.h
> > @@ -54,6 +54,13 @@ struct sockaddr_ll {
> >  #define PACKET_FANOUT                  18
> >  #define PACKET_TX_HAS_OFF              19
> >  #define PACKET_QDISC_BYPASS            20
> > +#define PACKET_RXTX_QPAIRS_SPLIT       21
> > +#define PACKET_RXTX_QPAIRS_RETURN      22
> > +#define PACKET_DEV_QPAIR_MAP_REGION_INFO       23
> > +#define PACKET_DEV_DESC_INFO           24
> > +#define PACKET_DMA_MEM_REGION_MAP       25
> > +#define PACKET_DMA_MEM_REGION_RELEASE   26
> > +
> >
> >  #define PACKET_FANOUT_HASH             0
> >  #define PACKET_FANOUT_LB               1
> > @@ -64,6 +71,87 @@ struct sockaddr_ll {
> >  #define PACKET_FANOUT_FLAG_ROLLOVER    0x1000
> >  #define PACKET_FANOUT_FLAG_DEFRAG      0x8000
> >
> > +#define PACKET_MAX_NUM_MAP_MEMORY_REGIONS 64
> > +#define PACKET_MAX_NUM_DESC_FORMATS      8
> > +#define PACKET_MAX_NUM_DESC_FIELDS       64
> > +#define PACKET_NIC_DESC_FIELD(fseq, foffset, fwidth, falign, fbo) \
> > +               .seqn = (__u8)fseq,                             \
> > +               .offset = (__u8)foffset,                        \
> > +               .width = (__u8)fwidth,                          \
> > +               .align = (__u8)falign,                          \
> > +               .byte_order = (__u8)fbo
> > +
> > +#define MAX_MAP_MEMORY_REGIONS 64
> > +
> > +/* setsockopt takes addr, size ,direction parametner, getsockopt takes
> > + * iova, size, direction.
> > + * */
> > +struct tpacket_dma_mem_region {
> > +       void *addr;             /* userspace virtual address */
> > +       __u64 phys_addr;        /* physical address */
> > +       __u64 iova;             /* IO virtual address used for DMA */
> > +       unsigned long size;     /* size of region */
> > +       int direction;          /* dma data direction */
> > +};
> > +
> > +struct tpacket_dev_qpair_map_region_info {
> > +       unsigned int tp_dev_bar_sz;             /* size of BAR */
> > +       unsigned int tp_dev_sysm_sz;            /* size of systerm memory */
> > +       /* number of contiguous memory on BAR mapping to user space */
> > +       unsigned int tp_num_map_regions;
> > +       /* number of contiguous memory on system mapping to user apce */
> > +       unsigned int tp_num_sysm_map_regions;
> > +       struct map_page_region {
> > +               unsigned page_offset;   /* offset to start of region */
> > +               unsigned page_sz;       /* size of page */
> > +               unsigned page_cnt;      /* number of pages */
> > +       } tp_regions[MAX_MAP_MEMORY_REGIONS];
> > +};
> > +
> > +struct tpacket_dev_qpairs_info {
> > +       unsigned int tp_qpairs_start_from;      /* qpairs index to start from */
> > +       unsigned int tp_qpairs_num;             /* number of qpairs */
> > +};
> > +
> > +enum tpack_desc_byte_order {
> > +       BO_NATIVE = 0,
> > +       BO_NETWORK,
> > +       BO_BIG_ENDIAN,
> > +       BO_LITTLE_ENDIAN,
> > +};
> > +
> > +struct tpacket_nic_desc_fld {
> > +       __u8 seqn;      /* Sequency index of descriptor field */
> > +       __u8 offset;    /* Offset to start */
> > +       __u8 width;     /* Width of field */
> > +       __u8 align;     /* Alignment in bits */
> > +       enum tpack_desc_byte_order byte_order;  /* Endian flag */
> > +};
> > +
> > +struct tpacket_nic_desc_expr {
> > +       __u8 version;           /* Version number */
> > +       __u8 size;              /* Descriptor size in bytes */
> > +       enum tpack_desc_byte_order byte_order;          /* Endian flag */
> > +       __u8 num_of_fld;        /* Number of valid fields */
> > +       /* List of each descriptor field */
> > +       struct tpacket_nic_desc_fld fields[PACKET_MAX_NUM_DESC_FIELDS];
> > +};
> > +
> > +struct tpacket_dev_info {
> > +       __u16   tp_device_id;
> > +       __u16   tp_vendor_id;
> > +       __u16   tp_subsystem_device_id;
> > +       __u16   tp_subsystem_vendor_id;
> > +       __u32   tp_numa_node;
> > +       __u32   tp_revision_id;
> > +       __u32   tp_num_total_qpairs;
> > +       __u32   tp_num_inuse_qpairs;
> > +       __u32   tp_num_rx_desc_fmt;
> > +       __u32   tp_num_tx_desc_fmt;
> > +       struct tpacket_nic_desc_expr tp_rx_dexpr[PACKET_MAX_NUM_DESC_FORMATS];
> > +       struct tpacket_nic_desc_expr tp_tx_dexpr[PACKET_MAX_NUM_DESC_FORMATS];
> > +};
> > +
> >  struct tpacket_stats {
> >         unsigned int    tp_packets;
> >         unsigned int    tp_drops;
> > diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
> > index 6880f34..8cd17da 100644
> > --- a/net/packet/af_packet.c
> > +++ b/net/packet/af_packet.c
> > @@ -214,6 +214,9 @@ static void prb_clear_rxhash(struct tpacket_kbdq_core *,
> >  static void prb_fill_vlan_info(struct tpacket_kbdq_core *,
> >                 struct tpacket3_hdr *);
> >  static void packet_flush_mclist(struct sock *sk);
> > +static int umem_release(struct net_device *dev, struct packet_sock *po);
> > +static int get_umem_pages(struct tpacket_dma_mem_region *region,
> > +                         struct packet_umem_region *umem);
> >
> >  struct packet_skb_cb {
> >         unsigned int origlen;
> > @@ -2633,6 +2636,16 @@ static int packet_release(struct socket *sock)
> >         sock_prot_inuse_add(net, sk->sk_prot, -1);
> >         preempt_enable();
> >
> > +       if (po->tp_owns_queue_pairs) {
> > +               struct net_device *dev;
> > +
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (dev) {
> > +                       dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
> > +                       umem_release(dev, po);
> > +               }
> > +       }
> > +
> >         spin_lock(&po->bind_lock);
> >         unregister_prot_hook(sk, false);
> >         packet_cached_dev_reset(po);
> > @@ -2829,6 +2842,8 @@ static int packet_create(struct net *net, struct socket *sock, int protocol,
> >         po->num = proto;
> >         po->xmit = dev_queue_xmit;
> >
> > +       INIT_LIST_HEAD(&po->umem_list);
> > +
> >         err = packet_alloc_pending(po);
> >         if (err)
> >                 goto out2;
> > @@ -3226,6 +3241,88 @@ static void packet_flush_mclist(struct sock *sk)
> >  }
> >
> >  static int
> > +get_umem_pages(struct tpacket_dma_mem_region *region,
> > +              struct packet_umem_region *umem)
> > +{
> > +       struct page **page_list;
> > +       unsigned long npages;
> > +       unsigned long offset;
> > +       unsigned long base;
> > +       unsigned long i;
> > +       int ret;
> > +       dma_addr_t phys_base;
> > +
> > +       phys_base = (region->phys_addr) & PAGE_MASK;
> > +       base = ((unsigned long)region->addr) & PAGE_MASK;
> > +       offset = ((unsigned long)region->addr) & (~PAGE_MASK);
> > +       npages = PAGE_ALIGN(region->size + offset) >> PAGE_SHIFT;
> > +
> > +       npages = min_t(unsigned long, npages, umem->nents);
> > +       sg_init_table(umem->sglist, npages);
> > +
> > +       umem->nmap = 0;
> > +       page_list = (struct page **)__get_free_page(GFP_KERNEL);
> > +       if (!page_list)
> > +               return -ENOMEM;
> > +
> > +       while (npages) {
> > +               unsigned long min = min_t(unsigned long, npages,
> > +                                         PAGE_SIZE / sizeof(struct page *));
> > +
> > +               ret = get_user_pages(current, current->mm, base, min,
> > +                                    1, 0, page_list, NULL);
> > +               if (ret < 0)
> > +                       break;
> > +
> > +               base += ret * PAGE_SIZE;
> > +               npages -= ret;
> > +
> > +               /* validate if the memory region is physically contigenous */
> > +               for (i = 0; i < ret; i++) {
> > +                       unsigned int page_index =
> > +                               (page_to_phys(page_list[i]) - phys_base) /
> > +                               PAGE_SIZE;
> > +
> > +                       if (page_index != umem->nmap + i) {
> > +                               int j;
> > +
> > +                               for (j = 0; j < (umem->nmap + i); j++)
> > +                                       put_page(sg_page(&umem->sglist[j]));
> > +
> > +                               free_page((unsigned long)page_list);
> > +                               return -EFAULT;
> > +                       }
> > +
> > +                       sg_set_page(&umem->sglist[umem->nmap + i],
> > +                                   page_list[i], PAGE_SIZE, 0);
> > +               }
> > +
> > +               umem->nmap += ret;
> > +       }
> > +
> > +       free_page((unsigned long)page_list);
> > +       return 0;
> > +}
> > +
> > +static int
> > +umem_release(struct net_device *dev, struct packet_sock *po)
> > +{
> > +       struct packet_umem_region *umem, *tmp;
> > +       int i;
> > +
> > +       list_for_each_entry_safe(umem, tmp, &po->umem_list, list) {
> > +               dma_unmap_sg(dev->dev.parent, umem->sglist,
> > +                            umem->nmap, umem->direction);
> > +               for (i = 0; i < umem->nmap; i++)
> > +                       put_page(sg_page(&umem->sglist[i]));
> > +
> > +               vfree(umem);
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static int
> >  packet_setsockopt(struct socket *sock, int level, int optname, char __user *optval, unsigned int optlen)
> >  {
> >         struct sock *sk = sock->sk;
> > @@ -3428,6 +3525,167 @@ packet_setsockopt(struct socket *sock, int level, int optname, char __user *optv
> >                 po->xmit = val ? packet_direct_xmit : dev_queue_xmit;
> >                 return 0;
> >         }
> > +       case PACKET_RXTX_QPAIRS_SPLIT:
> > +       {
> > +               struct tpacket_dev_qpairs_info qpairs;
> > +               const struct net_device_ops *ops;
> > +               struct net_device *dev;
> > +               int err;
> > +
> > +               if (optlen != sizeof(qpairs))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&qpairs, optval, sizeof(qpairs)))
> > +                       return -EFAULT;
> > +
> > +               /* Only allow one set of queues to be owned by userspace */
> > +               if (po->tp_owns_queue_pairs)
> > +                       return -EBUSY;
> > +
> > +               /* This call only works after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +               ops = dev->netdev_ops;
> > +               if (!ops->ndo_split_queue_pairs)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err =  ops->ndo_split_queue_pairs(dev,
> > +                                                 qpairs.tp_qpairs_start_from,
> > +                                                 qpairs.tp_qpairs_num, sk);
> > +               if (!err)
> > +                       po->tp_owns_queue_pairs = true;
> > +
> > +               return err;
> > +       }
> > +       case PACKET_RXTX_QPAIRS_RETURN:
> > +       {
> > +               struct tpacket_dev_qpairs_info qpairs_info;
> > +               const struct net_device_ops *ops;
> > +               struct net_device *dev;
> > +               int err;
> > +
> > +               if (optlen != sizeof(qpairs_info))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
> > +                       return -EFAULT;
> > +
> > +               if (!po->tp_owns_queue_pairs)
> > +                       return -EINVAL;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +               ops = dev->netdev_ops;
> > +               if (!ops->ndo_split_queue_pairs)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err =  dev->netdev_ops->ndo_return_queue_pairs(dev, sk);
> > +               if (!err)
> > +                       po->tp_owns_queue_pairs = false;
> > +
> > +               return err;
> > +       }
> > +       case PACKET_DMA_MEM_REGION_MAP:
> > +       {
> > +               struct tpacket_dma_mem_region region;
> > +               const struct net_device_ops *ops;
> > +               struct net_device *dev;
> > +               struct packet_umem_region *umem;
> > +               unsigned long npages;
> > +               unsigned long offset;
> > +               unsigned long i;
> > +               int err;
> > +
> > +               if (optlen != sizeof(region))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&region, optval, sizeof(region)))
> > +                       return -EFAULT;
> > +               if ((region.direction != DMA_BIDIRECTIONAL) &&
> > +                   (region.direction != DMA_TO_DEVICE) &&
> > +                   (region.direction != DMA_FROM_DEVICE))
> > +                       return -EFAULT;
> > +
> > +               if (!po->tp_owns_queue_pairs)
> > +                       return -EINVAL;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +
> > +               offset = ((unsigned long)region.addr) & (~PAGE_MASK);
> > +               npages = PAGE_ALIGN(region.size + offset) >> PAGE_SHIFT;
> > +
> > +               umem = vzalloc(sizeof(*umem) +
> > +                              sizeof(struct scatterlist) * npages);
> > +               if (!umem)
> > +                       return -ENOMEM;
> > +
> > +               umem->nents = npages;
> > +               umem->direction = region.direction;
> > +
> > +               down_write(&current->mm->mmap_sem);
> > +               if (get_umem_pages(&region, umem) < 0) {
> > +                       ret = -EFAULT;
> > +                       goto exit;
> > +               }
> > +
> > +               if ((umem->nmap == npages) &&
> > +                   (0 != dma_map_sg(dev->dev.parent, umem->sglist,
> > +                                    umem->nmap, region.direction))) {
> > +                       region.iova = sg_dma_address(umem->sglist) + offset;
> > +
> > +                       ops = dev->netdev_ops;
> > +                       if (!ops->ndo_validate_dma_mem_region_map) {
> > +                               ret = -EOPNOTSUPP;
> > +                               goto unmap;
> > +                       }
> > +
> > +                       /* use driver to validate mapping of dma memory */
> > +                       err = ops->ndo_validate_dma_mem_region_map(dev,
> > +                                                                  &region,
> > +                                                                  sk);
> > +                       if (!err) {
> > +                               list_add_tail(&umem->list, &po->umem_list);
> > +                               ret = 0;
> > +                               goto exit;
> > +                       }
> > +               }
> > +
> > +unmap:
> > +               dma_unmap_sg(dev->dev.parent, umem->sglist,
> > +                            umem->nmap, umem->direction);
> > +               for (i = 0; i < umem->nmap; i++)
> > +                       put_page(sg_page(&umem->sglist[i]));
> > +
> > +               vfree(umem);
> > +exit:
> > +               up_write(&current->mm->mmap_sem);
> > +
> > +               return ret;
> > +       }
> > +       case PACKET_DMA_MEM_REGION_RELEASE:
> > +       {
> > +               struct net_device *dev;
> > +
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +
> > +               down_write(&current->mm->mmap_sem);
> > +               ret = umem_release(dev, po);
> > +               up_write(&current->mm->mmap_sem);
> > +
> > +               return ret;
> > +       }
> > +
> >         default:
> >                 return -ENOPROTOOPT;
> >         }
> > @@ -3523,6 +3781,129 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
> >         case PACKET_QDISC_BYPASS:
> >                 val = packet_use_direct_xmit(po);
> >                 break;
> > +       case PACKET_RXTX_QPAIRS_SPLIT:
> > +       {
> > +               struct net_device *dev;
> > +               struct tpacket_dev_qpairs_info qpairs_info;
> > +               int err;
> > +
> > +               if (len != sizeof(qpairs_info))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&qpairs_info, optval, sizeof(qpairs_info)))
> > +                       return -EFAULT;
> > +
> > +               /* This call only work after a successful queue pairs split-off
> > +                * operation via setsockopt()
> > +                */
> > +               if (!po->tp_owns_queue_pairs)
> > +                       return -EINVAL;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +               if (!dev->netdev_ops->ndo_split_queue_pairs)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err =  dev->netdev_ops->ndo_get_split_queue_pairs(dev,
> > +                                       &qpairs_info.tp_qpairs_start_from,
> > +                                       &qpairs_info.tp_qpairs_num, sk);
> > +
> > +               lv = sizeof(qpairs_info);
> > +               data = &qpairs_info;
> > +               break;
> > +       }
> > +       case PACKET_DEV_QPAIR_MAP_REGION_INFO:
> > +       {
> > +               struct tpacket_dev_qpair_map_region_info info;
> > +               const struct net_device_ops *ops;
> > +               struct net_device *dev;
> > +               int err;
> > +
> > +               if (len != sizeof(info))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&info, optval, sizeof(info)))
> > +                       return -EFAULT;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +
> > +               ops = dev->netdev_ops;
> > +               if (!ops->ndo_get_device_qpair_map_region_info)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err = ops->ndo_get_device_qpair_map_region_info(dev, &info);
> > +               if (err)
> > +                       return err;
> > +
> > +               lv = sizeof(struct tpacket_dev_qpair_map_region_info);
> > +               data = &info;
> > +               break;
> > +       }
> > +       case PACKET_DEV_DESC_INFO:
> > +       {
> > +               struct net_device *dev;
> > +               struct tpacket_dev_info info;
> > +               int err;
> > +
> > +               if (len != sizeof(info))
> > +                       return -EINVAL;
> > +               if (copy_from_user(&info, optval, sizeof(info)))
> > +                       return -EFAULT;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +               if (!dev->netdev_ops->ndo_get_device_desc_info)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err =  dev->netdev_ops->ndo_get_device_desc_info(dev, &info);
> > +               if (err)
> > +                       return err;
> > +
> > +               lv = sizeof(struct tpacket_dev_info);
> > +               data = &info;
> > +               break;
> > +       }
> > +       case PACKET_DMA_MEM_REGION_MAP:
> > +       {
> > +               struct tpacket_dma_mem_region info;
> > +               struct net_device *dev;
> > +               int err;
> > +
> > +               if (len != sizeof(info))
> > +                               return -EINVAL;
> > +               if (copy_from_user(&info, optval, sizeof(info)))
> > +                               return -EFAULT;
> > +
> > +               /* This call only work after a bind call which calls a dev_hold
> > +                * operation so we do not need to increment dev ref counter
> > +                */
> > +               dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +               if (!dev)
> > +                       return -EINVAL;
> > +
> > +               if (!dev->netdev_ops->ndo_get_dma_region_info)
> > +                       return -EOPNOTSUPP;
> > +
> > +               err =  dev->netdev_ops->ndo_get_dma_region_info(dev, &info, sk);
> > +               if (err)
> > +                       return err;
> > +
> > +               lv = sizeof(struct tpacket_dma_mem_region);
> > +               data = &info;
> > +               break;
> > +       }
> > +
> >         default:
> >                 return -ENOPROTOOPT;
> >         }
> > @@ -3536,7 +3917,6 @@ static int packet_getsockopt(struct socket *sock, int level, int optname,
> >         return 0;
> >  }
> >
> > -
> >  static int packet_notifier(struct notifier_block *this,
> >                            unsigned long msg, void *ptr)
> >  {
> > @@ -3920,6 +4300,8 @@ static int packet_mmap(struct file *file, struct socket *sock,
> >         struct packet_sock *po = pkt_sk(sk);
> >         unsigned long size, expected_size;
> >         struct packet_ring_buffer *rb;
> > +       const struct net_device_ops *ops;
> > +       struct net_device *dev;
> >         unsigned long start;
> >         int err = -EINVAL;
> >         int i;
> > @@ -3927,8 +4309,20 @@ static int packet_mmap(struct file *file, struct socket *sock,
> >         if (vma->vm_pgoff)
> >                 return -EINVAL;
> >
> > +       dev = __dev_get_by_index(sock_net(sk), po->ifindex);
> > +       if (!dev)
> > +               return -EINVAL;
> > +
> >         mutex_lock(&po->pg_vec_lock);
> >
> > +       if (po->tp_owns_queue_pairs) {
> > +               ops = dev->netdev_ops;
> > +               err = ops->ndo_direct_qpair_page_map(vma, dev);
> > +               if (err)
> > +                       goto out;
> > +               goto done;
> > +       }
> > +
> >         expected_size = 0;
> >         for (rb = &po->rx_ring; rb <= &po->tx_ring; rb++) {
> >                 if (rb->pg_vec) {
> > @@ -3966,6 +4360,7 @@ static int packet_mmap(struct file *file, struct socket *sock,
> >                 }
> >         }
> >
> > +done:
> >         atomic_inc(&po->mapped);
> >         vma->vm_ops = &packet_mmap_ops;
> >         err = 0;
> > diff --git a/net/packet/internal.h b/net/packet/internal.h
> > index cdddf6a..55d2fce 100644
> > --- a/net/packet/internal.h
> > +++ b/net/packet/internal.h
> > @@ -90,6 +90,14 @@ struct packet_fanout {
> >         struct packet_type      prot_hook ____cacheline_aligned_in_smp;
> >  };
> >
> > +struct packet_umem_region {
> > +       struct list_head        list;
> > +       int                     nents;
> > +       int                     nmap;
> > +       int                     direction;
> > +       struct scatterlist      sglist[0];
> > +};
> > +
> >  struct packet_sock {
> >         /* struct sock has to be the first member of packet_sock */
> >         struct sock             sk;
> > @@ -97,6 +105,7 @@ struct packet_sock {
> >         union  tpacket_stats_u  stats;
> >         struct packet_ring_buffer       rx_ring;
> >         struct packet_ring_buffer       tx_ring;
> > +       struct list_head        umem_list;
> >         int                     copy_thresh;
> >         spinlock_t              bind_lock;
> >         struct mutex            pg_vec_lock;
> > @@ -113,6 +122,7 @@ struct packet_sock {
> >         unsigned int            tp_reserve;
> >         unsigned int            tp_loss:1;
> >         unsigned int            tp_tx_has_off:1;
> > +       unsigned int            tp_owns_queue_pairs:1;
> >         unsigned int            tp_tstamp;
> >         struct net_device __rcu *cached_dev;
> >         int                     (*xmit)(struct sk_buff *skb);
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe netdev" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 24+ messages in thread

* RE: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
  2015-01-13 17:27         ` David Miller
@ 2015-01-14 15:28           ` Zhou, Danny
  0 siblings, 0 replies; 24+ messages in thread
From: Zhou, Danny @ 2015-01-14 15:28 UTC (permalink / raw)
  To: David Miller, David.Laight@ACULAB.COM
  Cc: john.fastabend@gmail.com, dborkman@redhat.com,
	hannes@stressinduktion.org, netdev@vger.kernel.org,
	nhorman@tuxdriver.com, Ronciak, John, brouer@redhat.com



> -----Original Message-----
> From: David Miller [mailto:davem@davemloft.net]
> Sent: Wednesday, January 14, 2015 1:28 AM
> To: David.Laight@ACULAB.COM
> Cc: john.fastabend@gmail.com; dborkman@redhat.com; hannes@stressinduktion.org; netdev@vger.kernel.org; Zhou, Danny;
> nhorman@tuxdriver.com; Ronciak, John; brouer@redhat.com
> Subject: Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
> 
> From: David Laight <David.Laight@ACULAB.COM>
> Date: Tue, 13 Jan 2015 17:15:30 +0000
> 
> > How about something like:
> >
> > struct tpacket_dma_mem_region {
> >     __u64 addr;        /* userspace virtual address */
> >     __u64 phys_addr;    /* physical address */
> >     __u64 iova;        /* IO virtual address used for DMA */
> >     __u64 size;    /* size of region */
> >     int direction;        /* dma data direction */
> > } aligned(8);
> >
> > So that it is independant of 32/64 bits.
> > It is a shame that gcc has no way of defining a 64bit 'void *' on 32bit systems.
> > You can use a union, but you still need to zero extend the value on LE (worse on BE).
> 
> We have an __aligned_u64, please use that.

Thanks, will do.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
  2015-01-13  4:35 [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space John Fastabend
                   ` (5 preceding siblings ...)
  2015-01-13 18:52 ` Willem de Bruijn
@ 2015-01-14 20:35 ` David Miller
  2015-01-17 17:35   ` John Fastabend
                     ` (2 more replies)
  6 siblings, 3 replies; 24+ messages in thread
From: David Miller @ 2015-01-14 20:35 UTC (permalink / raw)
  To: john.fastabend
  Cc: netdev, danny.zhou, nhorman, dborkman, john.ronciak, hannes,
	brouer

From: John Fastabend <john.fastabend@gmail.com>
Date: Mon, 12 Jan 2015 20:35:11 -0800

> +		if ((region.direction != DMA_BIDIRECTIONAL) &&
> +		    (region.direction != DMA_TO_DEVICE) &&
> +		    (region.direction != DMA_FROM_DEVICE))
> +			return -EFAULT;
 ...
> +		if ((umem->nmap == npages) &&
> +		    (0 != dma_map_sg(dev->dev.parent, umem->sglist,
> +				     umem->nmap, region.direction))) {
> +			region.iova = sg_dma_address(umem->sglist) + offset;

I am having trouble seeing how this can work.

dma_map_{single,sg}() mappings need synchronization after a DMA
transfer takes place.

For example if the DMA occurs to the device, then that region can
be cached in the PCI controller's internal caches and thus future
cpu writes into that memory region will not be seen, until a
dma_sync_*() is invoked.

That isn't going to happen when the device transmit queue is
being completely managed in userspace.

And this takes us back to the issue of protection, I don't think
it is addressed properly yet.

CAP_NET_ADMIN privileges do not mean "can crap all over memory"
yet with this feature that can still happen.

If we are dealing with a device which cannot provide strict protection
to only the process's locked local pages, you have to do something
to implement that protection.

And you have _exactly_ one option to do that, abstracting the page
addresses and eating a system call to trigger the sends, so that you
can read from the user's (fake) descriptors and write into the real
descriptors (translating the DMA addresses along the way) and
triggering the TX doorbell.

I am not going to consider seriously an implementation that says "yeah
sometimes the user can crap onto other people's memory", this isn't
MS-DOS, it's a system where proper memory protections are mandatory
rather than optional.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
  2015-01-14 20:35 ` David Miller
@ 2015-01-17 17:35   ` John Fastabend
  2015-01-18 22:02   ` Neil Horman
  2015-01-19 21:45   ` Neil Horman
  2 siblings, 0 replies; 24+ messages in thread
From: John Fastabend @ 2015-01-17 17:35 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, danny.zhou, nhorman, dborkman, john.ronciak, hannes,
	brouer

On 01/14/2015 12:35 PM, David Miller wrote:
> From: John Fastabend <john.fastabend@gmail.com>
> Date: Mon, 12 Jan 2015 20:35:11 -0800
>
>> +		if ((region.direction != DMA_BIDIRECTIONAL) &&
>> +		    (region.direction != DMA_TO_DEVICE) &&
>> +		    (region.direction != DMA_FROM_DEVICE))
>> +			return -EFAULT;
>   ...
>> +		if ((umem->nmap == npages) &&
>> +		    (0 != dma_map_sg(dev->dev.parent, umem->sglist,
>> +				     umem->nmap, region.direction))) {
>> +			region.iova = sg_dma_address(umem->sglist) + offset;
>
> I am having trouble seeing how this can work.
>
> dma_map_{single,sg}() mappings need synchronization after a DMA
> transfer takes place.
>
> For example if the DMA occurs to the device, then that region can
> be cached in the PCI controller's internal caches and thus future
> cpu writes into that memory region will not be seen, until a
> dma_sync_*() is invoked.
>
> That isn't going to happen when the device transmit queue is
> being completely managed in userspace.
>
> And this takes us back to the issue of protection, I don't think
> it is addressed properly yet.
>
> CAP_NET_ADMIN privileges do not mean "can crap all over memory"
> yet with this feature that can still happen.
>
> If we are dealing with a device which cannot provide strict protection
> to only the process's locked local pages, you have to do something
> to implement that protection.
>
> And you have _exactly_ one option to do that, abstracting the page
> addresses and eating a system call to trigger the sends, so that you
> can read from the user's (fake) descriptors and write into the real
> descriptors (translating the DMA addresses along the way) and
> triggering the TX doorbell.

OK, I think this brings us back to some of the original designs/ideas
we were thinking about with Daniel/Neil. We are going to take a look
at this. At least on the RX side we can have the af_packet logic give
us a set of DMA addresses'. I wonder if we can also make the busy
poll logic per queue and use it.

>
> I am not going to consider seriously an implementation that says "yeah
> sometimes the user can crap onto other people's memory", this isn't
> MS-DOS, it's a system where proper memory protections are mandatory
> rather than optional.
>

More to sort out on our side. Thanks for looking at the patches.

.John

-- 
John Fastabend         Intel Corporation

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
  2015-01-14 20:35 ` David Miller
  2015-01-17 17:35   ` John Fastabend
@ 2015-01-18 22:02   ` Neil Horman
  2015-01-19 21:45   ` Neil Horman
  2 siblings, 0 replies; 24+ messages in thread
From: Neil Horman @ 2015-01-18 22:02 UTC (permalink / raw)
  To: David Miller
  Cc: john.fastabend, netdev, danny.zhou, dborkman, john.ronciak,
	hannes, brouer

On Wed, Jan 14, 2015 at 03:35:09PM -0500, David Miller wrote:
> From: John Fastabend <john.fastabend@gmail.com>
> Date: Mon, 12 Jan 2015 20:35:11 -0800
> 
> > +		if ((region.direction != DMA_BIDIRECTIONAL) &&
> > +		    (region.direction != DMA_TO_DEVICE) &&
> > +		    (region.direction != DMA_FROM_DEVICE))
> > +			return -EFAULT;
>  ...
> > +		if ((umem->nmap == npages) &&
> > +		    (0 != dma_map_sg(dev->dev.parent, umem->sglist,
> > +				     umem->nmap, region.direction))) {
> > +			region.iova = sg_dma_address(umem->sglist) + offset;
> 
> I am having trouble seeing how this can work.
> 
> dma_map_{single,sg}() mappings need synchronization after a DMA
> transfer takes place.
> 
> For example if the DMA occurs to the device, then that region can
> be cached in the PCI controller's internal caches and thus future
> cpu writes into that memory region will not be seen, until a
> dma_sync_*() is invoked.
> 
> That isn't going to happen when the device transmit queue is
> being completely managed in userspace.
> 
> And this takes us back to the issue of protection, I don't think
> it is addressed properly yet.
> 
> CAP_NET_ADMIN privileges do not mean "can crap all over memory"
> yet with this feature that can still happen.
> 
> If we are dealing with a device which cannot provide strict protection
> to only the process's locked local pages, you have to do something
> to implement that protection.
> 
> And you have _exactly_ one option to do that, abstracting the page
> addresses and eating a system call to trigger the sends, so that you
> can read from the user's (fake) descriptors and write into the real
> descriptors (translating the DMA addresses along the way) and
> triggering the TX doorbell.
> 
> I am not going to consider seriously an implementation that says "yeah
> sometimes the user can crap onto other people's memory", this isn't
> MS-DOS, it's a system where proper memory protections are mandatory
> rather than optional.
> 
This is probably a stupid question, but can you not dynamically mark the address
range that gets mapped for dma as uncacheable? i.e. Something simmilar to
ioremap_noncache, but to mark the region as uncacheable within the pci
controller?  Would doing so not obviate the need for sync operations
(potentially at the cost of some performance, though perhaps not as much as
incurring a system call)
Neil

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space
  2015-01-14 20:35 ` David Miller
  2015-01-17 17:35   ` John Fastabend
  2015-01-18 22:02   ` Neil Horman
@ 2015-01-19 21:45   ` Neil Horman
  2 siblings, 0 replies; 24+ messages in thread
From: Neil Horman @ 2015-01-19 21:45 UTC (permalink / raw)
  To: David Miller
  Cc: john.fastabend, netdev, danny.zhou, dborkman, john.ronciak,
	hannes, brouer

On Wed, Jan 14, 2015 at 03:35:09PM -0500, David Miller wrote:
> From: John Fastabend <john.fastabend@gmail.com>
> Date: Mon, 12 Jan 2015 20:35:11 -0800
> 
> > +		if ((region.direction != DMA_BIDIRECTIONAL) &&
> > +		    (region.direction != DMA_TO_DEVICE) &&
> > +		    (region.direction != DMA_FROM_DEVICE))
> > +			return -EFAULT;
>  ...
> > +		if ((umem->nmap == npages) &&
> > +		    (0 != dma_map_sg(dev->dev.parent, umem->sglist,
> > +				     umem->nmap, region.direction))) {
> > +			region.iova = sg_dma_address(umem->sglist) + offset;
> 
> I am having trouble seeing how this can work.
> 
> dma_map_{single,sg}() mappings need synchronization after a DMA
> transfer takes place.
> 
> For example if the DMA occurs to the device, then that region can
> be cached in the PCI controller's internal caches and thus future
> cpu writes into that memory region will not be seen, until a
> dma_sync_*() is invoked.
> 
> That isn't going to happen when the device transmit queue is
> being completely managed in userspace.
> 
> And this takes us back to the issue of protection, I don't think
> it is addressed properly yet.
> 
> CAP_NET_ADMIN privileges do not mean "can crap all over memory"
> yet with this feature that can still happen.
> 
> If we are dealing with a device which cannot provide strict protection
> to only the process's locked local pages, you have to do something
> to implement that protection.
> 
> And you have _exactly_ one option to do that, abstracting the page
> addresses and eating a system call to trigger the sends, so that you
> can read from the user's (fake) descriptors and write into the real
> descriptors (translating the DMA addresses along the way) and
> triggering the TX doorbell.
> 
> I am not going to consider seriously an implementation that says "yeah
> sometimes the user can crap onto other people's memory", this isn't
> MS-DOS, it's a system where proper memory protections are mandatory
> rather than optional.
> 

Another stupid question - If we can't provide protection from the device to
ensure memory coherency, can we mitigate the problem by creating an iommu group
for the device?

I'd mentioned to john the possibility of using the existing dfwd offload
operations to do the allocation of queues so that we could reuse that code instead
of having to create a set of new queue allocation routines.  What if, instead of
the dfwd queue allocation methods, we used sriov functionality here?  I.e.,
plumb a virtual function, and set it in its own iommu group, but instead of
passing it off to a guest, we just let the host use it?  That gives us the
opportunity to tear down the iommu mappings should the process exit, so if the
physical pages get re-allocated while DMA is in flight, we can just take the
iommu exception and avoid the memory corruption.

Its not perfect, in that we're still not syncing when we should be, but I think
it would be safe at least.

Thoughts?

Neil

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-01-19 21:45 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-13  4:35 [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space John Fastabend
2015-01-13  4:35 ` [RFC PATCH v2 2/2] net: ixgbe: implement af_packet direct queue mappings John Fastabend
2015-01-13 12:05   ` Hannes Frederic Sowa
2015-01-13 14:26   ` Daniel Borkmann
2015-01-13 15:46     ` John Fastabend
2015-01-13 18:18       ` Daniel Borkmann
2015-01-13 18:58   ` Willem de Bruijn
2015-01-13  4:42 ` [RFC PATCH v2 1/2] net: af_packet support for direct ring access in user space John Fastabend
2015-01-13 12:35 ` Hannes Frederic Sowa
2015-01-13 13:21   ` Daniel Borkmann
2015-01-13 15:24     ` John Fastabend
2015-01-13 17:15       ` David Laight
2015-01-13 17:27         ` David Miller
2015-01-14 15:28           ` Zhou, Danny
2015-01-13 15:12 ` Daniel Borkmann
2015-01-13 15:58   ` John Fastabend
2015-01-13 16:05     ` Daniel Borkmann
2015-01-13 16:19 ` Neil Horman
2015-01-13 18:52 ` Willem de Bruijn
2015-01-14 15:26   ` Zhou, Danny
2015-01-14 20:35 ` David Miller
2015-01-17 17:35   ` John Fastabend
2015-01-18 22:02   ` Neil Horman
2015-01-19 21:45   ` Neil Horman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).