DPDK-dev Archive on lore.kernel.org

DPDK-dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 27/28] net/vmxnet3: use eal I/O device memory read/write API
From: Santosh Shukla @ 2016-12-15  5:48 UTC (permalink / raw)
  To: Yuanhan Liu
  Cc: Jerin Jacob, dev, konstantin.ananyev, thomas.monjalon,
	bruce.richardson, jianbo.liu, viktorin, Yong Wang
In-Reply-To: <20161214025534.GG18991@yliu-dev.sh.intel.com>

On Wed, Dec 14, 2016 at 10:55:34AM +0800, Yuanhan Liu wrote:
> On Wed, Dec 14, 2016 at 07:25:57AM +0530, Jerin Jacob wrote:
> > From: Santosh Shukla <santosh.shukla@caviumnetworks.com>
> > 
> > Replace the raw I/O device memory read/write access with eal
> > abstraction for I/O device memory read/write access to fix
> > portability issues across different architectures.
> > 
> > Signed-off-by: Santosh Shukla <santosh.shukla@caviumnetworks.com>
> > Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> > CC: Yong Wang <yongwang@vmware.com>
> > ---
> >  drivers/net/vmxnet3/vmxnet3_ethdev.h | 14 ++++++++++----
> >  1 file changed, 10 insertions(+), 4 deletions(-)
> > 
> > diff --git a/drivers/net/vmxnet3/vmxnet3_ethdev.h b/drivers/net/vmxnet3/vmxnet3_ethdev.h
> > index 7d3b11e..5b6501b 100644
> > --- a/drivers/net/vmxnet3/vmxnet3_ethdev.h
> > +++ b/drivers/net/vmxnet3/vmxnet3_ethdev.h
> > @@ -34,6 +34,8 @@
> >  #ifndef _VMXNET3_ETHDEV_H_
> >  #define _VMXNET3_ETHDEV_H_
> >  
> > +#include <rte_io.h>
> > +
> >  #define VMXNET3_MAX_MAC_ADDRS 1
> >  
> >  /* UPT feature to negotiate */
> > @@ -120,7 +122,11 @@ struct vmxnet3_hw {
> >  
> >  /* Config space read/writes */
> >  
> > -#define VMXNET3_PCI_REG(reg) (*((volatile uint32_t *)(reg)))
> > +#define VMXNET3_PCI_REG(reg) ({		\
> > +	uint32_t __val;			\
> > +	__val = rte_readl(reg);		\
> > +	__val;				\
> > +})
> 
> Why not simply using rte_readl directly?
> 
> 	#define VMXNET3_PCI_REG(reg)	rte_readl(reg)
>

Ok.

> >  
> >  static inline uint32_t
> >  vmxnet3_read_addr(volatile void *addr)
> > @@ -128,9 +134,9 @@ vmxnet3_read_addr(volatile void *addr)
> >  	return VMXNET3_PCI_REG(addr);
> >  }
> >  
> > -#define VMXNET3_PCI_REG_WRITE(reg, value) do { \
> > -	VMXNET3_PCI_REG((reg)) = (value); \
> > -} while(0)
> > +#define VMXNET3_PCI_REG_WRITE(reg, value) ({	\
> > +	rte_writel(value, reg);			\
> > +})
> 
> I think this could be done in one line.
>

Ok.
will take care in V2.

> 	--yliu

^ permalink raw reply

* Re: [PATCH 26/28] net/virtio: use eal I/O device memory read/write API
From: Santosh Shukla @ 2016-12-15  5:45 UTC (permalink / raw)
  To: Yuanhan Liu
  Cc: Jerin Jacob, dev, konstantin.ananyev, thomas.monjalon,
	bruce.richardson, jianbo.liu, viktorin, Huawei Xie
In-Reply-To: <20161214030223.GH18991@yliu-dev.sh.intel.com>

On Wed, Dec 14, 2016 at 11:02:23AM +0800, Yuanhan Liu wrote:
> On Wed, Dec 14, 2016 at 07:25:56AM +0530, Jerin Jacob wrote:
> >   * Following macros are derived from linux/pci_regs.h, however,
> >   * we can't simply include that header here, as there is no such
> > @@ -320,37 +322,37 @@ static const struct virtio_pci_ops legacy_ops = {
> >  static inline uint8_t
> >  io_read8(uint8_t *addr)
> >  {
> > -	return *(volatile uint8_t *)addr;
> > +	return rte_readb(addr);
> >  }
> 
> Oh, one more comments: why not replacing io_read8 with rte_readb(),
> and do similar for others? Then we don't have to define those wrappers.
> 
> I think you can also do something similar for other patches?

Make sense for the virtio-pci case where API name io_read/write as good as
rte_read/write. However, IMO for other drivers for example ADF_CSR_RD/WR
improves code readability compared to plain rte_read/write.

Also IMO replacing code incident like below 
 
static inline void writel(unsigned int val, volatile void __iomem *addr)
{
-    *(volatile unsigned int *)addr = val;
+    rte_writel(val, addr);
}

with direct rte_read/write more appropriate. does above said make sense
to you?
If so then I will take care for all such driver in V2.

--Santosh.

> 	--yliu
> >  
> >  static inline void
> >  io_write8(uint8_t val, uint8_t *addr)
> >  {
> > -	*(volatile uint8_t *)addr = val;
> > +	rte_writeb(val, addr);
> >  }
> >  
> >  static inline uint16_t
> >  io_read16(uint16_t *addr)
> >  {
> > -	return *(volatile uint16_t *)addr;
> > +	return rte_readw(addr);
> >  }
> >  
> >  static inline void
> >  io_write16(uint16_t val, uint16_t *addr)
> >  {
> > -	*(volatile uint16_t *)addr = val;
> > +	rte_writew(val, addr);
> >  }
> >  
> >  static inline uint32_t
> >  io_read32(uint32_t *addr)
> >  {
> > -	return *(volatile uint32_t *)addr;
> > +	return rte_readl(addr);
> >  }
> >  
> >  static inline void
> >  io_write32(uint32_t val, uint32_t *addr)
> >  {
> > -	*(volatile uint32_t *)addr = val;
> > +	rte_writel(val, addr);
> >  }
> >  
> >  static inline void
> > -- 
> > 2.5.5

^ permalink raw reply

* [PATCH 2/2] hyperv: VMBUS support infrastucture
From: Stephen Hemminger @ 2016-12-14 23:59 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger
In-Reply-To: <20161214235920.12877-1-sthemmin@microsoft.com>

Generalize existing bus support to handle VMBUS in Hyper-V.
Most of the code is based of existing model for PCI, the difference
is how bus is represented in sysfs and how addressing works.

This is based on earlier code contributed by Brocade.
It supports only 4.9 or later versions of the Linux kernel
at this time (not older kernels or BSD).

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---
 lib/librte_eal/common/Makefile              |   2 +-
 lib/librte_eal/common/eal_common_devargs.c  |   7 +
 lib/librte_eal/common/eal_common_options.c  |  38 ++
 lib/librte_eal/common/eal_internal_cfg.h    |   3 +-
 lib/librte_eal/common/eal_options.h         |   6 +
 lib/librte_eal/common/eal_private.h         |   5 +
 lib/librte_eal/common/include/rte_devargs.h |   8 +
 lib/librte_eal/common/include/rte_vmbus.h   | 247 ++++++++
 lib/librte_eal/linuxapp/eal/Makefile        |   6 +
 lib/librte_eal/linuxapp/eal/eal.c           |  11 +
 lib/librte_eal/linuxapp/eal/eal_vmbus.c     | 906 ++++++++++++++++++++++++++++
 lib/librte_ether/rte_ethdev.c               |  90 +++
 lib/librte_ether/rte_ethdev.h               |  28 +-
 mk/rte.app.mk                               |   1 +
 14 files changed, 1354 insertions(+), 4 deletions(-)
 create mode 100644 lib/librte_eal/common/include/rte_vmbus.h
 create mode 100644 lib/librte_eal/linuxapp/eal/eal_vmbus.c

diff --git a/lib/librte_eal/common/Makefile b/lib/librte_eal/common/Makefile
index a92c984..9254bae 100644
--- a/lib/librte_eal/common/Makefile
+++ b/lib/librte_eal/common/Makefile
@@ -33,7 +33,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 
 INC := rte_branch_prediction.h rte_common.h
 INC += rte_debug.h rte_eal.h rte_errno.h rte_launch.h rte_lcore.h
-INC += rte_log.h rte_memory.h rte_memzone.h rte_pci.h
+INC += rte_log.h rte_memory.h rte_memzone.h rte_pci.h rte_vmbus.h
 INC += rte_per_lcore.h rte_random.h
 INC += rte_tailq.h rte_interrupts.h rte_alarm.h
 INC += rte_string_fns.h rte_version.h
diff --git a/lib/librte_eal/common/eal_common_devargs.c b/lib/librte_eal/common/eal_common_devargs.c
index e403717..934ca84 100644
--- a/lib/librte_eal/common/eal_common_devargs.c
+++ b/lib/librte_eal/common/eal_common_devargs.c
@@ -113,6 +113,13 @@ rte_eal_devargs_add(enum rte_devtype devtype, const char *devargs_str)
 			goto fail;
 
 		break;
+	case RTE_DEVTYPE_WHITELISTED_VMBUS:
+	case RTE_DEVTYPE_BLACKLISTED_VMBUS:
+#ifdef RTE_LIBRTE_HV_PMD
+		if (uuid_parse(buf, devargs->uuid) == 0)
+			break;
+#endif
+		goto fail;
 	}
 
 	free(buf);
diff --git a/lib/librte_eal/common/eal_common_options.c b/lib/librte_eal/common/eal_common_options.c
index 6ca8af1..6aea87d 100644
--- a/lib/librte_eal/common/eal_common_options.c
+++ b/lib/librte_eal/common/eal_common_options.c
@@ -95,6 +95,11 @@ eal_long_options[] = {
 	{OPT_VFIO_INTR,         1, NULL, OPT_VFIO_INTR_NUM        },
 	{OPT_VMWARE_TSC_MAP,    0, NULL, OPT_VMWARE_TSC_MAP_NUM   },
 	{OPT_XEN_DOM0,          0, NULL, OPT_XEN_DOM0_NUM         },
+#ifdef RTE_LIBRTE_HV_PMD
+	{OPT_NO_VMBUS,          0, NULL, OPT_NO_VMBUS_NUM         },
+	{OPT_VMBUS_BLACKLIST,   1, NULL, OPT_VMBUS_BLACKLIST_NUM  },
+	{OPT_VMBUS_WHITELIST,   1, NULL, OPT_VMBUS_WHITELIST_NUM  },
+#endif
 	{0,                     0, NULL, 0                        }
 };
 
@@ -855,6 +860,21 @@ eal_parse_common_option(int opt, const char *optarg,
 		conf->no_pci = 1;
 		break;
 
+#ifdef RTE_LIBRTE_HV_PMD
+	case OPT_NO_VMBUS_NUM:
+		conf->no_vmbus = 1;
+		break;
+	case OPT_VMBUS_BLACKLIST_NUM:
+		if (rte_eal_devargs_add(RTE_DEVTYPE_BLACKLISTED_VMBUS,
+					optarg) < 0)
+			return -1;
+		break;
+	case OPT_VMBUS_WHITELIST_NUM:
+		if (rte_eal_devargs_add(RTE_DEVTYPE_WHITELISTED_VMBUS,
+				optarg) < 0)
+			return -1;
+		break;
+#endif
 	case OPT_NO_HPET_NUM:
 		conf->no_hpet = 1;
 		break;
@@ -987,6 +1007,14 @@ eal_check_common_options(struct internal_config *internal_cfg)
 		return -1;
 	}
 
+#ifdef RTE_LIBRTE_HV_PMD
+	if (rte_eal_devargs_type_count(RTE_DEVTYPE_WHITELISTED_VMBUS) != 0 &&
+		rte_eal_devargs_type_count(RTE_DEVTYPE_BLACKLISTED_VMBUS) != 0) {
+		RTE_LOG(ERR, EAL, "Options vmbus blacklist and whitelist "
+			"cannot be used at the same time\n");
+		return -1;
+	}
+#endif
 	return 0;
 }
 
@@ -1036,5 +1064,15 @@ eal_common_usage(void)
 	       "  --"OPT_NO_PCI"            Disable PCI\n"
 	       "  --"OPT_NO_HPET"           Disable HPET\n"
 	       "  --"OPT_NO_SHCONF"         No shared config (mmap'd files)\n"
+#ifdef RTE_LIBRTE_HV_PMD
+	       "  --"OPT_NO_VMBUS"          Disable VMBUS\n"
+	       "  --"OPT_VMBUS_BLACKLIST" Add a VMBUS device to black list.\n"
+	       "                      Prevent EAL from using this PCI device. The argument\n"
+	       "                      format is device UUID.\n"
+	       "  --"OPT_VMBUS_WHITELIST" Add a VMBUS device to white list.\n"
+	       "                      Only use the specified VMBUS devices. The argument format\n"
+	       "                      is device UUID This option can be present\n"
+	       "                      several times (once per device).\n"
+#endif
 	       "\n", RTE_MAX_LCORE);
 }
diff --git a/lib/librte_eal/common/eal_internal_cfg.h b/lib/librte_eal/common/eal_internal_cfg.h
index 5f1367e..1827194 100644
--- a/lib/librte_eal/common/eal_internal_cfg.h
+++ b/lib/librte_eal/common/eal_internal_cfg.h
@@ -69,7 +69,8 @@ struct internal_config {
 	volatile unsigned no_pci;         /**< true to disable PCI */
 	volatile unsigned no_hpet;        /**< true to disable HPET */
 	volatile unsigned vmware_tsc_map; /**< true to use VMware TSC mapping
-										* instead of native TSC */
+					   * instead of native TSC */
+	volatile unsigned no_vmbus;       /**< true to disable VMBUS */
 	volatile unsigned no_shconf;      /**< true if there is no shared config */
 	volatile unsigned create_uio_dev; /**< true to create /dev/uioX devices */
 	volatile enum rte_proc_type_t process_type; /**< multi-process proc type */
diff --git a/lib/librte_eal/common/eal_options.h b/lib/librte_eal/common/eal_options.h
index a881c62..156727e 100644
--- a/lib/librte_eal/common/eal_options.h
+++ b/lib/librte_eal/common/eal_options.h
@@ -83,6 +83,12 @@ enum {
 	OPT_VMWARE_TSC_MAP_NUM,
 #define OPT_XEN_DOM0          "xen-dom0"
 	OPT_XEN_DOM0_NUM,
+#define OPT_NO_VMBUS          "no-vmbus"
+	OPT_NO_VMBUS_NUM,
+#define OPT_VMBUS_BLACKLIST   "vmbus-blacklist"
+	OPT_VMBUS_BLACKLIST_NUM,
+#define OPT_VMBUS_WHITELIST   "vmbus-whitelist"
+	OPT_VMBUS_WHITELIST_NUM,
 	OPT_LONG_MAX_NUM
 };
 
diff --git a/lib/librte_eal/common/eal_private.h b/lib/librte_eal/common/eal_private.h
index 9e7d8f6..c856c63 100644
--- a/lib/librte_eal/common/eal_private.h
+++ b/lib/librte_eal/common/eal_private.h
@@ -210,6 +210,11 @@ int pci_uio_map_resource_by_index(struct rte_pci_device *dev, int res_idx,
 		struct mapped_pci_resource *uio_res, int map_idx);
 
 /**
+ * VMBUS related functions and structures
+ */
+int rte_eal_vmbus_init(void);
+
+/**
  * Init tail queues for non-EAL library structures. This is to allow
  * the rings, mempools, etc. lists to be shared among multiple processes
  *
diff --git a/lib/librte_eal/common/include/rte_devargs.h b/lib/librte_eal/common/include/rte_devargs.h
index 88120a1..c079d28 100644
--- a/lib/librte_eal/common/include/rte_devargs.h
+++ b/lib/librte_eal/common/include/rte_devargs.h
@@ -51,6 +51,9 @@ extern "C" {
 #include <stdio.h>
 #include <sys/queue.h>
 #include <rte_pci.h>
+#ifdef RTE_LIBRTE_HV_PMD
+#include <uuid/uuid.h>
+#endif
 
 /**
  * Type of generic device
@@ -59,6 +62,8 @@ enum rte_devtype {
 	RTE_DEVTYPE_WHITELISTED_PCI,
 	RTE_DEVTYPE_BLACKLISTED_PCI,
 	RTE_DEVTYPE_VIRTUAL,
+	RTE_DEVTYPE_WHITELISTED_VMBUS,
+	RTE_DEVTYPE_BLACKLISTED_VMBUS,
 };
 
 /**
@@ -88,6 +93,9 @@ struct rte_devargs {
 			/** Driver name. */
 			char drv_name[32];
 		} virt;
+#ifdef RTE_LIBRTE_HV_PMD
+		uuid_t uuid;
+#endif
 	};
 	/** Arguments string as given by user or "" for no argument. */
 	char *args;
diff --git a/lib/librte_eal/common/include/rte_vmbus.h b/lib/librte_eal/common/include/rte_vmbus.h
new file mode 100644
index 0000000..8540539
--- /dev/null
+++ b/lib/librte_eal/common/include/rte_vmbus.h
@@ -0,0 +1,247 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2013-2016 Brocade Communications Systems, Inc.
+ *   Copyright(c) 2016 Microsoft Corporation
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#ifndef _RTE_VMBUS_H_
+#define _RTE_VMBUS_H_
+
+/**
+ * @file
+ *
+ * RTE VMBUS Interface
+ */
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <limits.h>
+#include <errno.h>
+#include <uuid/uuid.h>
+#include <sys/queue.h>
+#include <stdint.h>
+#include <inttypes.h>
+
+#include <rte_debug.h>
+#include <rte_interrupts.h>
+#include <rte_dev.h>
+
+TAILQ_HEAD(vmbus_device_list, rte_vmbus_device);
+TAILQ_HEAD(vmbus_driver_list, rte_vmbus_driver);
+
+extern struct vmbus_driver_list vmbus_driver_list;
+extern struct vmbus_device_list vmbus_device_list;
+
+/** Pathname of VMBUS devices directory. */
+#define SYSFS_VMBUS_DEVICES "/sys/bus/vmbus/devices"
+
+#define UUID_BUF_SZ	(36 + 1)
+	
+
+/** Maximum number of VMBUS resources. */
+#define VMBUS_MAX_RESOURCE 7
+
+/**
+ * A structure describing a VMBUS device.
+ */
+struct rte_vmbus_device {
+	TAILQ_ENTRY(rte_vmbus_device) next;     /**< Next probed VMBUS device. */
+	struct rte_device device;               /**< Inherit core device */
+	uuid_t device_id;			/**< VMBUS device id */
+	uuid_t class_id;			/**< VMBUS device type */
+	uint32_t relid;				/**< VMBUS id for notification */
+	uint8_t	monitor_id;
+	struct rte_intr_handle intr_handle;     /**< Interrupt handle */
+	const struct rte_vmbus_driver *driver;  /**< Associated driver */
+
+	struct rte_mem_resource mem_resource[VMBUS_MAX_RESOURCE];
+						/**< VMBUS Memory Resource */
+	char sysfs_name[];			/**< Name in sysfs bus directory */
+};
+
+struct rte_vmbus_driver;
+
+/**
+ * Initialisation function for the driver called during VMBUS probing.
+ */
+typedef int (vmbus_probe_t)(struct rte_vmbus_driver *, struct rte_vmbus_device *);
+
+/**
+ * Uninitialisation function for the driver called during hotplugging.
+ */
+typedef int (vmbus_remove_t)(struct rte_vmbus_device *);
+
+/**
+ * A structure describing a VMBUS driver.
+ */
+struct rte_vmbus_driver {
+	TAILQ_ENTRY(rte_vmbus_driver) next;     /**< Next in list. */
+	struct rte_driver driver;
+	vmbus_probe_t *probe;                   /**< Device Probe function. */
+	vmbus_remove_t *remove;                 /**< Device Remove function. */
+
+	const uuid_t *id_table;			/**< ID table, NULL terminated. */
+};
+
+struct vmbus_map {
+	void *addr;
+	char *path;
+	uint64_t offset;
+	uint64_t size;
+	uint64_t phaddr;
+};
+
+/*
+ * For multi-process we need to reproduce all vmbus mappings in secondary
+ * processes, so save them in a tailq.
+ */
+struct mapped_vmbus_resource {
+	TAILQ_ENTRY(mapped_vmbus_resource) next;
+
+	uuid_t uuid;
+	char path[PATH_MAX];
+	int nb_maps;
+	struct vmbus_map maps[VMBUS_MAX_RESOURCE];
+};
+
+TAILQ_HEAD(mapped_vmbus_res_list, mapped_vmbus_resource);
+
+/**
+ * Scan the content of the VMBUS bus, and the devices in the devices list
+ *
+ * @return
+ *  0 on success, negative on error
+ */
+int rte_eal_vmbus_scan(void);
+
+/**
+ * Probe the VMBUS bus for registered drivers.
+ *
+ * Scan the content of the VMBUS bus, and call the probe() function for
+ * all registered drivers that have a matching entry in its id_table
+ * for discovered devices.
+ *
+ * @return
+ *   - 0 on success.
+ *   - Negative on error.
+ */
+int rte_eal_vmbus_probe(void);
+
+/**
+ * Map the VMBUS device resources in user space virtual memory address
+ *
+ * @param dev
+ *   A pointer to a rte_vmbus_device structure describing the device
+ *   to use
+ *
+ * @return
+ *   0 on success, negative on error and positive if no driver
+ *   is found for the device.
+ */
+int rte_eal_vmbus_map_device(struct rte_vmbus_device *dev);
+
+/**
+ * Unmap this device
+ *
+ * @param dev
+ *   A pointer to a rte_vmbus_device structure describing the device
+ *   to use
+ */
+void rte_eal_vmbus_unmap_device(struct rte_vmbus_device *dev);
+
+/**
+ * Probe the single VMBUS device.
+ *
+ * Scan the content of the VMBUS bus, and find the vmbus device
+ * specified by device uuid, then call the probe() function for
+ * registered driver that has a matching entry in its id_table for
+ * discovered device.
+ *
+ * @param id
+ * 	The VMBUS device uuid.
+ * @return
+ *   - 0 on success.
+ *   - Negative on error.
+ */
+int rte_eal_vmbus_probe_one(uuid_t id);
+
+/**
+ * Close the single VMBUS device.
+ *
+ * Scan the content of the VMBUS bus, and find the vmbus device id,
+ * then call the remove() function for registered driver that has a
+ * matching entry in its id_table for discovered device.
+ *
+ * @param id
+ * 	The VMBUS device uuid.
+ * @return
+ *   - 0 on success.
+ *   - Negative on error.
+ */
+int rte_eal_vmbus_detach(uuid_t id);
+
+/**
+ * Register a VMBUS driver.
+ *
+ * @param driver
+ *   A pointer to a rte_vmbus_driver structure describing the driver
+ *   to be registered.
+ */
+void rte_eal_vmbus_register(struct rte_vmbus_driver *driver);
+
+/** Helper for VMBUS device registration from driver nstance */
+#define RTE_PMD_REGISTER_VMBUS(nm, vmbus_drv) \
+RTE_INIT(vmbusinitfn_ ##nm); \
+static void vmbusinitfn_ ##nm(void) \
+{\
+	(vmbus_drv).driver.name = RTE_STR(nm);\
+	rte_eal_vmbus_register(&vmbus_drv); \
+} \
+RTE_PMD_EXPORT_NAME(nm, __COUNTER__)
+
+/**
+ * Unregister a VMBUS driver.
+ *
+ * @param driver
+ *   A pointer to a rte_vmbus_driver structure describing the driver
+ *   to be unregistered.
+ */
+void rte_eal_vmbus_unregister(struct rte_vmbus_driver *driver);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_VMBUS_H_ */
diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
index 4e206f0..f6ca384 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -71,6 +71,11 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_timer.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_interrupts.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_alarm.c
 
+ifeq ($(CONFIG_RTE_LIBRTE_HV_PMD),y)
+SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_vmbus.c
+LDLIBS += -luuid
+endif
+
 # from common dir
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_lcore.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += eal_common_timer.c
@@ -114,6 +119,7 @@ CFLAGS_eal_hugepage_info.o := -D_GNU_SOURCE
 CFLAGS_eal_pci.o := -D_GNU_SOURCE
 CFLAGS_eal_pci_uio.o := -D_GNU_SOURCE
 CFLAGS_eal_pci_vfio.o := -D_GNU_SOURCE
+CFLAGS_eal_vmbux.o := -D_GNU_SOURCE
 CFLAGS_eal_common_whitelist.o := -D_GNU_SOURCE
 CFLAGS_eal_common_options.o := -D_GNU_SOURCE
 CFLAGS_eal_common_thread.o := -D_GNU_SOURCE
diff --git a/lib/librte_eal/linuxapp/eal/eal.c b/lib/librte_eal/linuxapp/eal/eal.c
index 2075282..71083ec 100644
--- a/lib/librte_eal/linuxapp/eal/eal.c
+++ b/lib/librte_eal/linuxapp/eal/eal.c
@@ -70,6 +70,7 @@
 #include <rte_cpuflags.h>
 #include <rte_interrupts.h>
 #include <rte_pci.h>
+#include <rte_vmbus.h>
 #include <rte_dev.h>
 #include <rte_devargs.h>
 #include <rte_common.h>
@@ -830,6 +831,11 @@ rte_eal_init(int argc, char **argv)
 
 	eal_check_mem_on_local_socket();
 
+#ifdef RTE_LIBRTE_HV_PMD
+	if (rte_eal_vmbus_init() < 0)
+		RTE_LOG(ERR, EAL, "Cannot init VMBUS\n");
+#endif
+
 	if (eal_plugins_init() < 0)
 		rte_panic("Cannot init plugins\n");
 
@@ -887,6 +893,11 @@ rte_eal_init(int argc, char **argv)
 	if (rte_eal_pci_probe())
 		rte_panic("Cannot probe PCI\n");
 
+#ifdef RTE_LIBRTE_HV_PMD
+	if (rte_eal_vmbus_probe() < 0)
+		rte_panic("Cannot probe VMBUS\n");
+#endif
+
 	rte_eal_mcfg_complete();
 
 	return fctret;
diff --git a/lib/librte_eal/linuxapp/eal/eal_vmbus.c b/lib/librte_eal/linuxapp/eal/eal_vmbus.c
new file mode 100644
index 0000000..cbd8bd1
--- /dev/null
+++ b/lib/librte_eal/linuxapp/eal/eal_vmbus.c
@@ -0,0 +1,906 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2013-2016 Brocade Communications Systems, Inc.
+ *   Copyright(c) 2016 Microsoft Corporation
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#include <string.h>
+#include <unistd.h>
+#include <dirent.h>
+#include <fcntl.h>
+#include <sys/mman.h>
+
+#include <rte_eal.h>
+#include <rte_tailq.h>
+#include <rte_log.h>
+#include <rte_devargs.h>
+#include <rte_vmbus.h>
+#include <rte_malloc.h>
+
+#include "eal_private.h"
+#include "eal_pci_init.h"
+#include "eal_filesystem.h"
+
+struct vmbus_driver_list vmbus_driver_list =
+	TAILQ_HEAD_INITIALIZER(vmbus_driver_list);
+struct vmbus_device_list vmbus_device_list =
+	TAILQ_HEAD_INITIALIZER(vmbus_device_list);
+
+static void *vmbus_map_addr;
+
+static struct rte_tailq_elem rte_vmbus_uio_tailq = {
+	.name = "UIO_RESOURCE_LIST",
+};
+EAL_REGISTER_TAILQ(rte_vmbus_uio_tailq);
+
+/*
+ * parse a sysfs file containing one integer value
+ * different to the eal version, as it needs to work with 64-bit values
+ */
+static int
+vmbus_get_sysfs_uuid(const char *filename, uuid_t uu)
+{
+	char buf[BUFSIZ];
+	char *cp = NULL;
+	FILE *f;
+
+	f = fopen(filename, "r");
+	if (f == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs value %s\n",
+				__func__, filename);
+		return -1;
+	}
+
+	if (fgets(buf, sizeof(buf), f) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot read sysfs value %s\n",
+				__func__, filename);
+		fclose(f);
+		return -1;
+	}
+	fclose(f);
+
+	cp = strchr(cp, '\n');
+	if (cp)
+		*cp = '\0';
+
+	/* strip { } notation */
+	if (buf[0] == '{' && (cp = strchr(buf, '}')))
+		*cp = '\0';
+
+	if (uuid_parse(buf, uu) < 0) {
+		RTE_LOG(ERR, EAL, "%s %s not a valid UUID\n",
+			filename, buf);
+		return -1;
+	}
+
+	return 0;
+}
+
+/* map a particular resource from a file */
+static void *
+vmbus_map_resource(void *requested_addr, int fd, off_t offset, size_t size,
+		   int flags)
+{
+	void *mapaddr;
+
+	/* Map the memory resource of device */
+	mapaddr = mmap(requested_addr, size, PROT_READ | PROT_WRITE,
+		       MAP_SHARED | flags, fd, offset);
+	if (mapaddr == MAP_FAILED ||
+	    (requested_addr != NULL && mapaddr != requested_addr)) {
+		RTE_LOG(ERR, EAL,
+			"%s(): cannot mmap(%d, %p, 0x%lx, 0x%lx): %s)\n",
+			__func__, fd, requested_addr,
+			(unsigned long)size, (unsigned long)offset,
+			strerror(errno));
+	} else
+		RTE_LOG(DEBUG, EAL, "  VMBUS memory mapped at %p\n", mapaddr);
+
+	return mapaddr;
+}
+
+/* unmap a particular resource */
+static void
+vmbus_unmap_resource(void *requested_addr, size_t size)
+{
+	if (requested_addr == NULL)
+		return;
+
+	/* Unmap the VMBUS memory resource of device */
+	if (munmap(requested_addr, size)) {
+		RTE_LOG(ERR, EAL, "%s(): cannot munmap(%p, 0x%lx): %s\n",
+			__func__, requested_addr, (unsigned long)size,
+			strerror(errno));
+	} else
+		RTE_LOG(DEBUG, EAL, "  VMBUS memory unmapped at %p\n",
+				requested_addr);
+}
+
+/* Only supports current kernel version
+ * Unlike PCI there is no option (or need) to create UIO device.
+ */
+static int vmbus_get_uio_dev(const char *name,
+			     char *dstbuf, size_t buflen)
+{
+	char dirname[PATH_MAX];
+	unsigned int uio_num;
+	struct dirent *e;
+	DIR *dir;
+
+	snprintf(dirname, sizeof(dirname),
+		 "/sys/bus/vmbus/devices/%s/uio", name);
+
+	dir = opendir(dirname);
+	if (dir == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot map uio resources for %s: %s\n",
+			name, strerror(errno));
+		return -1;
+	}
+
+	/* take the first file starting with "uio" */
+	while ((e = readdir(dir)) != NULL) {
+		if (sscanf(e->d_name, "uio%u", &uio_num) != 1)
+			continue;
+
+		snprintf(dstbuf, buflen, "%s/uio%u", dirname, uio_num);
+		break;
+	}
+	closedir(dir);
+
+	return e ? (int) uio_num : -1;
+}
+
+/*
+ * parse a sysfs file containing one integer value
+ * different to the eal version, as it needs to work with 64-bit values
+ */
+static int
+vmbus_parse_sysfs_value(const char *dir, const char *name,
+			uint64_t *val)
+{
+	char filename[PATH_MAX];
+	FILE *f;
+	char buf[BUFSIZ];
+	char *end = NULL;
+
+	snprintf(filename, sizeof(filename), "%s/%s", dir, name);
+	f = fopen(filename, "r");
+	if (f == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot open sysfs value %s\n",
+				__func__, filename);
+		return -1;
+	}
+
+	if (fgets(buf, sizeof(buf), f) == NULL) {
+		RTE_LOG(ERR, EAL, "%s(): cannot read sysfs value %s\n",
+				__func__, filename);
+		fclose(f);
+		return -1;
+	}
+	fclose(f);
+
+	*val = strtoull(buf, &end, 0);
+	if ((buf[0] == '\0') || (end == NULL) || (*end != '\n')) {
+		RTE_LOG(ERR, EAL, "%s(): cannot parse sysfs value %s\n",
+				__func__, filename);
+		return -1;
+	}
+	return 0;
+}
+
+/* Get mappings out of values provided by uio */
+static int
+vmbus_uio_get_mappings(const char *uioname,
+		       struct vmbus_map maps[])
+{
+	int i;
+
+	for (i = 0; i != VMBUS_MAX_RESOURCE; i++) {
+		struct vmbus_map *map = &maps[i];
+		char dirname[PATH_MAX];
+
+		/* check if map directory exists */
+		snprintf(dirname, sizeof(dirname),
+			 "%s/maps/map%d", uioname, i);
+
+		if (access(dirname, F_OK) != 0)
+			break;
+
+		/* get mapping offset */
+		if (vmbus_parse_sysfs_value(dirname, "offset",
+					    &map->offset) < 0)
+			return -1;
+
+		/* get mapping size */
+		if (vmbus_parse_sysfs_value(dirname, "size",
+					    &map->size) < 0)
+			return -1;
+
+		/* get mapping physical address */
+		if (vmbus_parse_sysfs_value(dirname, "addr",
+					    &maps->phaddr) < 0)
+			return -1;
+	}
+
+	return i;
+}
+
+static void
+vmbus_uio_free_resource(struct rte_vmbus_device *dev,
+		struct mapped_vmbus_resource *uio_res)
+{
+	rte_free(uio_res);
+
+	if (dev->intr_handle.fd) {
+		close(dev->intr_handle.fd);
+		dev->intr_handle.fd = -1;
+		dev->intr_handle.type = RTE_INTR_HANDLE_UNKNOWN;
+	}
+}
+
+static struct mapped_vmbus_resource *
+vmbus_uio_alloc_resource(struct rte_vmbus_device *dev)
+{
+	struct mapped_vmbus_resource *uio_res;
+	char dirname[PATH_MAX], devname[PATH_MAX];
+	int uio_num, nb_maps;
+
+	uio_num = vmbus_get_uio_dev(dev->sysfs_name, dirname, sizeof(dirname));
+	if (uio_num < 0) {
+		RTE_LOG(WARNING, EAL,
+			"  %s not managed by UIO driver, skipping\n",
+			dev->sysfs_name);
+		return NULL;
+	}
+
+	/* allocate the mapping details for secondary processes*/
+	uio_res = rte_zmalloc("UIO_RES", sizeof(*uio_res), 0);
+	if (uio_res == NULL) {
+		RTE_LOG(ERR, EAL,
+			"%s(): cannot store uio mmap details\n", __func__);
+		goto error;
+	}
+
+	snprintf(devname, sizeof(devname), "/dev/uio%u", uio_num);
+	dev->intr_handle.fd = open(devname, O_RDWR);
+	if (dev->intr_handle.fd < 0) {
+		RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",
+			devname, strerror(errno));
+		goto error;
+	}
+
+	dev->intr_handle.type = RTE_INTR_HANDLE_UIO_INTX;
+
+	snprintf(uio_res->path, sizeof(uio_res->path), "%s", devname);
+	uuid_copy(uio_res->uuid, dev->device_id);
+
+	nb_maps = vmbus_uio_get_mappings(dirname, uio_res->maps);
+	if (nb_maps < 0)
+		goto error;
+
+	RTE_LOG(DEBUG, EAL, "Found %d memory maps for device %s\n",
+		nb_maps, dev->sysfs_name);
+
+	return uio_res;
+
+ error:
+	vmbus_uio_free_resource(dev, uio_res);
+	return NULL;
+}
+
+static int
+vmbus_uio_map_resource_by_index(struct rte_vmbus_device *dev,
+				unsigned int res_idx,
+				struct mapped_vmbus_resource *uio_res,
+				unsigned int map_idx)
+{
+	struct vmbus_map *maps = uio_res->maps;
+	char devname[PATH_MAX];
+	void *mapaddr;
+	int fd;
+
+	snprintf(devname, sizeof(devname),
+		 "/sys/bus/vmbus/%s/resource%u", dev->sysfs_name, res_idx);
+
+	fd = open(devname, O_RDWR);
+	if (fd < 0) {
+		RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",
+				devname, strerror(errno));
+		return -1;
+	}
+
+	/* allocate memory to keep path */
+	maps[map_idx].path = rte_malloc(NULL, strlen(devname) + 1, 0);
+	if (maps[map_idx].path == NULL) {
+		RTE_LOG(ERR, EAL, "Cannot allocate memory for path: %s\n",
+				strerror(errno));
+		return -1;
+	}
+
+	/* try mapping somewhere close to the end of hugepages */
+	if (vmbus_map_addr == NULL)
+		vmbus_map_addr = pci_find_max_end_va();
+
+	mapaddr = vmbus_map_resource(vmbus_map_addr, fd, 0,
+				     dev->mem_resource[res_idx].len, 0);
+	close(fd);
+	if (mapaddr == MAP_FAILED) {
+		rte_free(maps[map_idx].path);
+		return -1;
+	}
+
+	vmbus_map_addr = RTE_PTR_ADD(mapaddr,
+				     dev->mem_resource[res_idx].len);
+
+	maps[map_idx].phaddr = dev->mem_resource[res_idx].phys_addr;
+	maps[map_idx].size = dev->mem_resource[res_idx].len;
+	maps[map_idx].addr = mapaddr;
+	maps[map_idx].offset = 0;
+	strcpy(maps[map_idx].path, devname);
+	dev->mem_resource[res_idx].addr = mapaddr;
+
+	return 0;
+}
+
+static void
+vmbus_uio_unmap(struct mapped_vmbus_resource *uio_res)
+{
+	int i;
+
+	if (uio_res == NULL)
+		return;
+
+	for (i = 0; i != uio_res->nb_maps; i++) {
+		vmbus_unmap_resource(uio_res->maps[i].addr,
+				     uio_res->maps[i].size);
+
+		if (rte_eal_process_type() == RTE_PROC_PRIMARY)
+			rte_free(uio_res->maps[i].path);
+	}
+}
+
+static struct mapped_vmbus_resource *
+vmbus_uio_find_resource(struct rte_vmbus_device *dev)
+{
+	struct mapped_vmbus_resource *uio_res;
+	struct mapped_vmbus_res_list *uio_res_list =
+			RTE_TAILQ_CAST(rte_vmbus_uio_tailq.head, mapped_vmbus_res_list);
+
+	if (dev == NULL)
+		return NULL;
+
+	TAILQ_FOREACH(uio_res, uio_res_list, next) {
+		if (uuid_compare(uio_res->uuid, dev->device_id) == 0)
+			return uio_res;
+	}
+	return NULL;
+}
+
+/* unmap the VMBUS resource of a VMBUS device in virtual memory */
+static void
+vmbus_uio_unmap_resource(struct rte_vmbus_device *dev)
+{
+	struct mapped_vmbus_resource *uio_res;
+	struct mapped_vmbus_res_list *uio_res_list =
+			RTE_TAILQ_CAST(rte_vmbus_uio_tailq.head, mapped_vmbus_res_list);
+
+	if (dev == NULL)
+		return;
+
+	/* find an entry for the device */
+	uio_res = vmbus_uio_find_resource(dev);
+	if (uio_res == NULL)
+		return;
+
+	/* secondary processes - just free maps */
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return vmbus_uio_unmap(uio_res);
+
+	TAILQ_REMOVE(uio_res_list, uio_res, next);
+
+	/* unmap all resources */
+	vmbus_uio_unmap(uio_res);
+
+	/* free uio resource */
+	rte_free(uio_res);
+
+	/* close fd if in primary process */
+	close(dev->intr_handle.fd);
+	if (dev->intr_handle.uio_cfg_fd >= 0) {
+		close(dev->intr_handle.uio_cfg_fd);
+		dev->intr_handle.uio_cfg_fd = -1;
+	}
+
+	dev->intr_handle.fd = -1;
+	dev->intr_handle.type = RTE_INTR_HANDLE_UNKNOWN;
+}
+
+static int
+vmbus_uio_map_secondary(struct rte_vmbus_device *dev)
+{
+	struct mapped_vmbus_resource *uio_res;
+	struct mapped_vmbus_res_list *uio_res_list =
+			RTE_TAILQ_CAST(rte_vmbus_uio_tailq.head,
+				       mapped_vmbus_res_list);
+
+	TAILQ_FOREACH(uio_res, uio_res_list, next) {
+		int i;
+
+		/* skip this element if it doesn't match our id */
+		if (uuid_compare(uio_res->uuid, dev->device_id))
+			continue;
+
+		for (i = 0; i != uio_res->nb_maps; i++) {
+			void *mapaddr;
+			int fd;
+
+			fd = open(uio_res->maps[i].path, O_RDWR);
+			if (fd < 0) {
+				RTE_LOG(ERR, EAL, "Cannot open %s: %s\n",
+					uio_res->maps[i].path, strerror(errno));
+				return -1;
+			}
+
+			mapaddr = vmbus_map_resource(uio_res->maps[i].addr, fd,
+						     uio_res->maps[i].offset,
+						     uio_res->maps[i].size, 0);
+			/* fd is not needed in slave process, close it */
+			close(fd);
+
+			if (mapaddr == uio_res->maps[i].addr)
+				continue;
+
+			RTE_LOG(ERR, EAL,
+				"Cannot mmap device resource file %s to address: %p\n",
+				uio_res->maps[i].path,
+				uio_res->maps[i].addr);
+
+			/* unmap addrs correctly mapped */
+			while (i != 0) {
+				--i;
+ 				vmbus_unmap_resource(uio_res->maps[i].addr,
+						     uio_res->maps[i].size);
+			}
+			return -1;
+
+		}
+		return 0;
+	}
+
+	RTE_LOG(ERR, EAL, "Cannot find resource for device\n");
+	return 1;
+}
+
+/* map the resources of a vmbus device in virtual memory */
+int
+rte_eal_vmbus_map_device(struct rte_vmbus_device *dev)
+{
+	struct mapped_vmbus_resource *uio_res;
+	struct mapped_vmbus_res_list *uio_res_list =
+		RTE_TAILQ_CAST(rte_vmbus_uio_tailq.head, mapped_vmbus_res_list);
+	int i, ret, map_idx = 0;
+
+	dev->intr_handle.fd = -1;
+	dev->intr_handle.uio_cfg_fd = -1;
+	dev->intr_handle.type = RTE_INTR_HANDLE_UNKNOWN;
+
+	/* secondary processes - use already recorded details */
+	if (rte_eal_process_type() != RTE_PROC_PRIMARY)
+		return vmbus_uio_map_secondary(dev);
+
+	/* allocate uio resource */
+	uio_res = vmbus_uio_alloc_resource(dev);
+	if (uio_res == NULL)
+		return -1;
+
+	/* Map all BARs */
+	for (i = 0; i != VMBUS_MAX_RESOURCE; i++) {
+		uint64_t phaddr;
+
+		/* skip empty BAR */
+		phaddr = dev->mem_resource[i].phys_addr;
+		if (phaddr == 0)
+			continue;
+
+		ret = vmbus_uio_map_resource_by_index(dev, i,
+						      uio_res, map_idx);
+		if (ret)
+			goto error;
+
+		map_idx++;
+	}
+
+	uio_res->nb_maps = map_idx;
+
+	TAILQ_INSERT_TAIL(uio_res_list, uio_res, next);
+
+	return 0;
+error:
+	for (i = 0; i < map_idx; i++) {
+		vmbus_unmap_resource(uio_res->maps[i].addr,
+				     uio_res->maps[i].size);
+		rte_free(uio_res->maps[i].path);
+	}
+	vmbus_uio_free_resource(dev, uio_res);
+	return -1;
+}
+
+/* Scan one vmbus sysfs entry, and fill the devices list from it. */
+static int
+vmbus_scan_one(const char *name)
+{
+	struct rte_vmbus_device *dev, *dev2;
+	char filename[PATH_MAX];
+	char dirname[PATH_MAX];
+	unsigned long tmp;
+
+	dev = malloc(sizeof(*dev) + strlen(name) + 1);
+	if (dev == NULL)
+		return -1;
+
+	memset(dev, 0, sizeof(*dev));
+	strcpy(dev->sysfs_name, name);
+	if (dev->sysfs_name == NULL)
+		goto error;
+
+	/* sysfs base directory
+	 *   /sys/bus/vmbus/devices/7a08391f-f5a0-4ac0-9802-d13fd964f8df
+	 * or on older kernel
+	 *   /sys/bus/vmbus/devices/vmbus_1
+	 */
+	snprintf(dirname, sizeof(dirname), "%s/%s",
+		 SYSFS_VMBUS_DEVICES, name);
+
+	/* get device id */
+	snprintf(filename, sizeof(filename), "%s/device_id", dirname);
+	if (vmbus_get_sysfs_uuid(filename, dev->device_id) < 0)
+		goto error;
+
+	/* get device class  */
+	snprintf(filename, sizeof(filename), "%s/class_id", dirname);
+	if (vmbus_get_sysfs_uuid(filename, dev->class_id) < 0)
+		goto error;
+
+	/* get relid */
+	snprintf(filename, sizeof(filename), "%s/id", dirname);
+	if (eal_parse_sysfs_value(filename, &tmp) < 0)
+		goto error;
+	dev->relid = tmp;
+
+	/* get monitor id */
+	snprintf(filename, sizeof(filename), "%s/monitor_id", dirname);
+	if (eal_parse_sysfs_value(filename, &tmp) < 0)
+		goto error;
+	dev->monitor_id = tmp;
+
+	/* get numa node */
+	snprintf(filename, sizeof(filename), "%s/numa_node",
+		 dirname);
+	if (eal_parse_sysfs_value(filename, &tmp) < 0)
+		/* if no NUMA support, set default to 0 */
+		dev->device.numa_node = 0;
+	else
+		dev->device.numa_node = tmp;
+
+	/* device is valid, add in list (sorted) */
+	RTE_LOG(DEBUG, EAL, "Adding vmbus device %s\n", name);
+
+	TAILQ_FOREACH(dev2, &vmbus_device_list, next) {
+		int ret;
+
+		ret = uuid_compare(dev->device_id, dev->device_id);
+		if (ret > 0)
+			continue;
+
+		if (ret < 0) {
+			TAILQ_INSERT_BEFORE(dev2, dev, next);
+			rte_eal_device_insert(&dev->device);
+		} else { /* already registered */
+			memmove(dev2->mem_resource, dev->mem_resource,
+				sizeof(dev->mem_resource));
+			free(dev);
+		}
+		return 0;
+	}
+
+	rte_eal_device_insert(&dev->device);
+	TAILQ_INSERT_TAIL(&vmbus_device_list, dev, next);
+
+	return 0;
+error:
+	free(dev);
+	return -1;
+}
+
+/*
+ * Scan the content of the vmbus, and the devices in the devices list
+ */
+static int
+vmbus_scan(void)
+{
+	struct dirent *e;
+	DIR *dir;
+
+	dir = opendir(SYSFS_VMBUS_DEVICES);
+	if (dir == NULL) {
+		if (errno == ENOENT)
+			return 0;
+		else {
+			RTE_LOG(ERR, EAL, "%s(): opendir failed: %s\n",
+					__func__, strerror(errno));
+			return -1;
+		}
+	}
+
+	while ((e = readdir(dir)) != NULL) {
+		if (e->d_name[0] == '.')
+			continue;
+
+		if (vmbus_scan_one(e->d_name) < 0)
+			goto error;
+	}
+	closedir(dir);
+	return 0;
+
+error:
+	closedir(dir);
+	return -1;
+}
+
+/* Init the VMBUS EAL subsystem */
+int rte_eal_vmbus_init(void)
+{
+	/* VMBUS can be disabled */
+	if (internal_config.no_vmbus)
+		return 0;
+
+	if (vmbus_scan() < 0) {
+		RTE_LOG(ERR, EAL, "%s(): Cannot scan vmbus\n", __func__);
+		return -1;
+	}
+	return 0;
+}
+
+/* Below is PROBE part of eal_vmbus library */
+
+/*
+ * If device ID match, call the devinit() function of the driver.
+ */
+static int
+rte_eal_vmbus_probe_one_driver(struct rte_vmbus_driver *dr,
+			       struct rte_vmbus_device *dev)
+{
+	const uuid_t *id_table;
+
+	RTE_LOG(DEBUG, EAL, "  probe driver: %s\n", dr->driver.name);
+
+	for (id_table = dr->id_table; !uuid_is_null(*id_table); ++id_table) {
+		struct rte_devargs *args;
+		char guid[UUID_BUF_SZ];
+		int ret;
+
+		/* skip devices not assocaited with this device class */
+		if (uuid_compare(*id_table, dev->class_id) != 0)
+			continue;
+
+		uuid_unparse(dev->device_id, guid);
+		RTE_LOG(INFO, EAL, "VMBUS device %s on NUMA socket %i\n",
+			guid, dev->device.numa_node);
+
+		/* no initialization when blacklisted, return without error */
+		args = dev->device.devargs;
+		if (args && args->type == RTE_DEVTYPE_BLACKLISTED_VMBUS) {
+			RTE_LOG(INFO, EAL, "  Device is blacklisted, not initializing\n");
+			return 1;
+		}
+
+		RTE_LOG(INFO, EAL, "  probe driver: %s\n", dr->driver.name);
+
+		/* map resources for device */
+		ret = rte_eal_vmbus_map_device(dev);
+		if (ret != 0)
+			return ret;
+
+		/* reference driver structure */
+		dev->driver = dr;
+
+		/* call the driver probe() function */
+		ret = dr->probe(dr, dev);
+		if (ret)
+			dev->driver = NULL;
+
+		return ret;
+	}
+
+	/* return positive value if driver doesn't support this device */
+	return 1;
+}
+
+
+/*
+ * If vendor/device ID match, call the remove() function of the
+ * driver.
+ */
+static int
+vmbus_detach_dev(struct rte_vmbus_driver *dr,
+		 struct rte_vmbus_device *dev)
+{
+	const uuid_t *id_table;
+
+	for (id_table = dr->id_table; !uuid_is_null(*id_table); ++id_table) {
+		char guid[UUID_BUF_SZ];
+
+		/* skip devices not assocaited with this device class */
+		if (uuid_compare(*id_table, dev->class_id) != 0)
+			continue;
+
+		uuid_unparse(dev->device_id, guid);
+		RTE_LOG(INFO, EAL, "VMBUS device %s on NUMA socket %i\n",
+			guid, dev->device.numa_node);
+
+		RTE_LOG(DEBUG, EAL, "  remove driver: %s\n", dr->driver.name);
+
+		if (dr->remove && (dr->remove(dev) < 0))
+			return -1;	/* negative value is an error */
+
+		/* clear driver structure */
+		dev->driver = NULL;
+
+		vmbus_uio_unmap_resource(dev);
+		return 0;
+	}
+
+	/* return positive value if driver doesn't support this device */
+	return 1;
+}
+
+/*
+ * call the devinit() function of all
+ * registered drivers for the vmbus device. Return -1 if no driver is
+ * found for this class of vmbus device.
+ * The present assumption is that we have drivers only for vmbus network
+ * devices. That's why we don't check driver's id_table now.
+ */
+static int
+vmbus_probe_all_drivers(struct rte_vmbus_device *dev)
+{
+	struct rte_vmbus_driver *dr = NULL;
+	int ret;
+
+	TAILQ_FOREACH(dr, &vmbus_driver_list, next) {
+		ret = rte_eal_vmbus_probe_one_driver(dr, dev);
+		if (ret < 0) {
+			/* negative value is an error */
+			RTE_LOG(ERR, EAL, "Failed to probe driver %s\n",
+				dr->driver.name);
+			return -1;
+		}
+		/* positive value means driver doesn't support it */
+		if (ret > 0)
+			continue;
+
+		return 0;
+	}
+
+	return 1;
+}
+
+
+/*
+ * If device ID matches, call the remove() function of all
+ * registered driver for the given device. Return -1 if initialization
+ * failed, return 1 if no driver is found for this device.
+ */
+static int
+vmbus_detach_all_drivers(struct rte_vmbus_device *dev)
+{
+	struct rte_vmbus_driver *dr;
+	int rc = 0;
+
+	if (dev == NULL)
+		return -1;
+
+	TAILQ_FOREACH(dr, &vmbus_driver_list, next) {
+		rc = vmbus_detach_dev(dr, dev);
+		if (rc < 0)
+			/* negative value is an error */
+			return -1;
+		if (rc > 0)
+			/* positive value means driver doesn't support it */
+			continue;
+		return 0;
+	}
+	return 1;
+}
+
+/* Detach device specified by its VMBUS id */
+int
+rte_eal_vmbus_detach(uuid_t device_id)
+{
+	struct rte_vmbus_device *dev;
+	char ubuf[UUID_BUF_SZ];
+
+	TAILQ_FOREACH(dev, &vmbus_device_list, next) {
+		if (uuid_compare(dev->device_id, device_id) != 0)
+			continue;
+
+		if (vmbus_detach_all_drivers(dev) < 0)
+			goto err_return;
+
+		TAILQ_REMOVE(&vmbus_device_list, dev, next);
+		free(dev);
+		return 0;
+	}
+	return -1;
+
+err_return:
+	uuid_unparse(device_id, ubuf);
+	RTE_LOG(WARNING, EAL, "Requested device %s cannot be used\n",
+		ubuf);
+	return -1;
+}
+
+/*
+ * Scan the vmbus, and call the devinit() function for
+ * all registered drivers that have a matching entry in its id_table
+ * for discovered devices.
+ */
+int
+rte_eal_vmbus_probe(void)
+{
+	struct rte_vmbus_device *dev = NULL;
+
+	TAILQ_FOREACH(dev, &vmbus_device_list, next) {
+		char ubuf[UUID_BUF_SZ];
+
+		uuid_unparse(dev->device_id, ubuf);
+
+		RTE_LOG(DEBUG, EAL, "Probing driver for device %s ...\n",
+			ubuf);
+		vmbus_probe_all_drivers(dev);
+	}
+	return 0;
+}
+
+/* register vmbus driver */
+void
+rte_eal_vmbus_register(struct rte_vmbus_driver *driver)
+{
+	TAILQ_INSERT_TAIL(&vmbus_driver_list, driver, next);
+}
+
+/* unregister vmbus driver */
+void
+rte_eal_vmbus_unregister(struct rte_vmbus_driver *driver)
+{
+	TAILQ_REMOVE(&vmbus_driver_list, driver, next);
+}
diff --git a/lib/librte_ether/rte_ethdev.c b/lib/librte_ether/rte_ethdev.c
index 1e0f206..6298a8d 100644
--- a/lib/librte_ether/rte_ethdev.c
+++ b/lib/librte_ether/rte_ethdev.c
@@ -3282,3 +3282,93 @@ rte_eth_dev_l2_tunnel_offload_set(uint8_t port_id,
 				-ENOTSUP);
 	return (*dev->dev_ops->l2_tunnel_offload_set)(dev, l2_tunnel, mask, en);
 }
+
+
+#ifdef RTE_LIBRTE_HV_PMD
+int
+rte_eth_dev_vmbus_probe(struct rte_vmbus_driver *vmbus_drv,
+			struct rte_vmbus_device *vmbus_dev)
+{
+	struct eth_driver  *eth_drv = (struct eth_driver *)vmbus_drv;
+	struct rte_eth_dev *eth_dev;
+	char ustr[UUID_BUF_SZ];
+	int diag;
+
+	uuid_unparse(vmbus_dev->device_id, ustr);
+
+	eth_dev = rte_eth_dev_allocate(ustr);
+	if (eth_dev == NULL)
+		return -ENOMEM;
+
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY) {
+		eth_dev->data->dev_private = rte_zmalloc("ethdev private structure",
+				  eth_drv->dev_private_size,
+				  RTE_CACHE_LINE_SIZE);
+		if (eth_dev->data->dev_private == NULL)
+			rte_panic("Cannot allocate memzone for private port data\n");
+	}
+
+	eth_dev->vmbus_dev = vmbus_dev;
+	eth_dev->driver = eth_drv;
+	eth_dev->data->rx_mbuf_alloc_failed = 0;
+
+	/* init user callbacks */
+	TAILQ_INIT(&(eth_dev->link_intr_cbs));
+
+	/*
+	 * Set the default maximum frame size.
+	 */
+	eth_dev->data->mtu = ETHER_MTU;
+
+	/* Invoke PMD device initialization function */
+	diag = (*eth_drv->eth_dev_init)(eth_dev);
+	if (diag == 0)
+		return 0;
+
+	RTE_PMD_DEBUG_TRACE("driver %s: eth_dev_init(%s) failed\n",
+			    vmbus_drv->driver.name, ustr);
+
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
+		rte_free(eth_dev->data->dev_private);
+
+	return diag;
+}
+
+int
+rte_eth_dev_vmbus_remove(struct rte_vmbus_device *vmbus_dev)
+{
+	const struct eth_driver *eth_drv;
+	struct rte_eth_dev *eth_dev;
+	char ustr[UUID_BUF_SZ];
+	int ret;
+
+	if (vmbus_dev == NULL)
+		return -EINVAL;
+
+	uuid_unparse(vmbus_dev->device_id, ustr);
+	eth_dev = rte_eth_dev_allocated(ustr);
+	if (eth_dev == NULL)
+		return -ENODEV;
+
+	eth_drv = (const struct eth_driver *)vmbus_dev->driver;
+
+	/* Invoke PMD device uninit function */
+	if (*eth_drv->eth_dev_uninit) {
+		ret = (*eth_drv->eth_dev_uninit)(eth_dev);
+		if (ret)
+			return ret;
+	}
+
+	/* free ether device */
+	rte_eth_dev_release_port(eth_dev);
+
+	if (rte_eal_process_type() == RTE_PROC_PRIMARY)
+		rte_free(eth_dev->data->dev_private);
+
+	eth_dev->pci_dev = NULL;
+	eth_dev->driver = NULL;
+	eth_dev->data = NULL;
+
+	return 0;
+}
+#endif
diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 3c85e33..5050087 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -180,6 +180,7 @@ extern "C" {
 #include <rte_log.h>
 #include <rte_interrupts.h>
 #include <rte_pci.h>
+#include <rte_vmbus.h>
 #include <rte_dev.h>
 #include <rte_devargs.h>
 #include "rte_ether.h"
@@ -1628,7 +1629,11 @@ struct rte_eth_dev {
 	struct rte_eth_dev_data *data;  /**< Pointer to device data */
 	const struct eth_driver *driver;/**< Driver for this device */
 	const struct eth_dev_ops *dev_ops; /**< Functions exported by PMD */
-	struct rte_pci_device *pci_dev; /**< PCI info. supplied by probing */
+	union {
+		struct rte_pci_device *pci_dev; /**< PCI info. */
+		struct rte_vmbus_device *vmbus_dev; /**< VMBUS info. */
+	};
+
 	/** User application callbacks for NIC interrupts */
 	struct rte_eth_dev_cb_list link_intr_cbs;
 	/**
@@ -1866,7 +1871,11 @@ typedef int (*eth_dev_uninit_t)(struct rte_eth_dev *eth_dev);
  * - The size of the private data to allocate for each matching device.
  */
 struct eth_driver {
-	struct rte_pci_driver pci_drv;    /**< The PMD is also a PCI driver. */
+	union {
+		struct rte_pci_driver pci_drv;    /**< The PMD PCI driver. */
+		struct rte_vmbus_driver vmbus_drv;/**< The PMD VMBUS drv. */
+	};
+
 	eth_dev_init_t eth_dev_init;      /**< Device init function. */
 	eth_dev_uninit_t eth_dev_uninit;  /**< Device uninit function. */
 	unsigned int dev_private_size;    /**< Size of device private data. */
@@ -4383,6 +4392,21 @@ int rte_eth_dev_pci_probe(struct rte_pci_driver *pci_drv,
  */
 int rte_eth_dev_pci_remove(struct rte_pci_device *pci_dev);
 
+/**
+ * @internal
+ * Wrapper for use by vmbus drivers as a .probe function to attach to a ethdev
+ * interface.
+ */
+int rte_eth_dev_vmbus_probe(struct rte_vmbus_driver *vmbus_drv,
+			  struct rte_vmbus_device *vmbus_dev);
+
+/**
+ * @internal
+ * Wrapper for use by vmbus drivers as a .remove function to detach a ethdev
+ * interface.
+ */
+int rte_eth_dev_vmbus_remove(struct rte_vmbus_device *vmbus_dev);
+
 #ifdef __cplusplus
 }
 #endif
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index f75f0e2..6b30408 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -130,6 +130,7 @@ ifeq ($(CONFIG_RTE_LIBRTE_VHOST),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_VHOST)      += -lrte_pmd_vhost
 endif # $(CONFIG_RTE_LIBRTE_VHOST)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_VMXNET3_PMD)    += -lrte_pmd_vmxnet3_uio
+_LDLIBS-$(CONFIG_RTE_LIBRTE_HV_PMD)	    += -luuid
 
 ifeq ($(CONFIG_RTE_LIBRTE_CRYPTODEV),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_AESNI_MB)    += -lrte_pmd_aesni_mb
-- 
2.10.2

^ permalink raw reply related

* [PATCH 1/2] ethdev: increase length ethernet device internal name
From: Stephen Hemminger @ 2016-12-14 23:59 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger
In-Reply-To: <20161214235920.12877-1-sthemmin@microsoft.com>

Allow sufficicent space for UUID in string form (36+1).
Needed to use UUID with Hyper-V

Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com>
---
 doc/guides/rel_notes/deprecation.rst | 3 +++
 lib/librte_ether/rte_ethdev.h        | 6 +++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index 2d17bc6..b83f23a 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -58,6 +58,9 @@ Deprecation Notices
   ``port`` field, may be moved or removed as part of this mbuf work. A
   ``timestamp`` will also be added.
 
+* ethdev: for 17.02 the size of internal device name will be increased
+  to 40 characters to allow for storing UUID.
+
 * The mbuf flags PKT_RX_VLAN_PKT and PKT_RX_QINQ_PKT are deprecated and
   are respectively replaced by PKT_RX_VLAN_STRIPPED and
   PKT_RX_QINQ_STRIPPED, that are better described. The old flags and
diff --git a/lib/librte_ether/rte_ethdev.h b/lib/librte_ether/rte_ethdev.h
index 9678179..3c85e33 100644
--- a/lib/librte_ether/rte_ethdev.h
+++ b/lib/librte_ether/rte_ethdev.h
@@ -1652,7 +1652,11 @@ struct rte_eth_dev_sriov {
 };
 #define RTE_ETH_DEV_SRIOV(dev)         ((dev)->data->sriov)
 
-#define RTE_ETH_NAME_MAX_LEN (32)
+/*
+ * Internal identifier length
+ * Sufficiently large to allow for UUID or PCI address
+ */
+#define RTE_ETH_NAME_MAX_LEN 40
 
 /**
  * @internal
-- 
2.10.2

^ permalink raw reply related

* [PATCH 0/2] support for Hyper-V VMBUS
From: Stephen Hemminger @ 2016-12-14 23:59 UTC (permalink / raw)
  To: dev; +Cc: Stephen Hemminger

This is the core changes required to support VMBUS.
It does cause some ABI change to ethdev etc.

Stephen Hemminger (2):
  ethdev: increase length ethernet device internal name
  hyperv: VMBUS support infrastucture

 doc/guides/rel_notes/deprecation.rst        |   3 +
 lib/librte_eal/common/Makefile              |   2 +-
 lib/librte_eal/common/eal_common_devargs.c  |   7 +
 lib/librte_eal/common/eal_common_options.c  |  38 ++
 lib/librte_eal/common/eal_internal_cfg.h    |   3 +-
 lib/librte_eal/common/eal_options.h         |   6 +
 lib/librte_eal/common/eal_private.h         |   5 +
 lib/librte_eal/common/include/rte_devargs.h |   8 +
 lib/librte_eal/common/include/rte_vmbus.h   | 247 ++++++++
 lib/librte_eal/linuxapp/eal/Makefile        |   6 +
 lib/librte_eal/linuxapp/eal/eal.c           |  11 +
 lib/librte_eal/linuxapp/eal/eal_vmbus.c     | 906 ++++++++++++++++++++++++++++
 lib/librte_ether/rte_ethdev.c               |  90 +++
 lib/librte_ether/rte_ethdev.h               |  34 +-
 mk/rte.app.mk                               |   1 +
 15 files changed, 1362 insertions(+), 5 deletions(-)
 create mode 100644 lib/librte_eal/common/include/rte_vmbus.h
 create mode 100644 lib/librte_eal/linuxapp/eal/eal_vmbus.c

-- 
2.10.2

^ permalink raw reply

* KNI broken again with 4.9 kernel
From: Stephen Hemminger @ 2016-12-14 23:40 UTC (permalink / raw)
  To: dev

/build/lib/librte_eal/linuxapp/kni/igb_main.c:2317:21: error: initialization from incompatible pointer type [-Werror=incompatible-pointer-types]
  .ndo_set_vf_vlan = igb_ndo_set_vf_vlan,
                     ^~~~~~~~~~~~~~~~~~~

I am sure Ferruh Yigit will fix it.

Which raises a couple of questions:
 1. Why is DPDK still keeping KNI support for Intel specific ethtool functionality.
    This always breaks, is code bloat, and means a 3rd copy of base code (Linux, DPDK PMD, + KNI)

 2. Why is KNI not upstream?
    If not acceptable due to security or supportablity then why does it still exist?

 3. If not upstream, then maintainer should track upstream kernel changes and fix DPDK before
    kernel is released.  The ABI is normally set early in the rc cycle weeks before release.

^ permalink raw reply

* Re: [PATCH v4] net/kni: add KNI PMD
From: Yong Wang @ 2016-12-14 19:25 UTC (permalink / raw)
  To: Ferruh Yigit, dev@dpdk.org
In-Reply-To: <c7978d58-7254-c92d-7792-3cfe47e7e296@intel.com>

> -----Original Message-----
> From: Ferruh Yigit [mailto:ferruh.yigit@intel.com]
> Sent: Wednesday, December 14, 2016 8:00 AM
> To: Yong Wang <yongwang@vmware.com>; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v4] net/kni: add KNI PMD
> 
> On 12/12/2016 9:59 PM, Yong Wang wrote:
> >> -----Original Message-----
> >> From: Ferruh Yigit [mailto:ferruh.yigit@intel.com]
> >> Sent: Wednesday, November 30, 2016 10:12 AM
> >> To: dev@dpdk.org
> >> Cc: Ferruh Yigit <ferruh.yigit@intel.com>; Yong Wang
> >> <yongwang@vmware.com>
> >> Subject: [PATCH v4] net/kni: add KNI PMD
> >>
> >> Add KNI PMD which wraps librte_kni for ease of use.
> >>
> >> KNI PMD can be used as any regular PMD to send / receive packets to the
> >> Linux networking stack.
> >>
> >> Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
> >> ---
> >>
> >> v4:
> >> * allow only single queue
> >> * use driver.name as name
> >>
> >> v3:
> >> * rebase on top of latest master
> >>
> >> v2:
> >> * updated driver name eth_kni -> net_kni
> >> ---
> >>  config/common_base                      |   1 +
> >>  config/common_linuxapp                  |   1 +
> >>  drivers/net/Makefile                    |   1 +
> >>  drivers/net/kni/Makefile                |  63 +++++
> >>  drivers/net/kni/rte_eth_kni.c           | 462
> >> ++++++++++++++++++++++++++++++++
> >>  drivers/net/kni/rte_pmd_kni_version.map |   4 +
> >>  mk/rte.app.mk                           |  10 +-
> >>  7 files changed, 537 insertions(+), 5 deletions(-)
> >>  create mode 100644 drivers/net/kni/Makefile
> >>  create mode 100644 drivers/net/kni/rte_eth_kni.c
> >>  create mode 100644 drivers/net/kni/rte_pmd_kni_version.map
> >>
> >> diff --git a/config/common_base b/config/common_base
> >> index 4bff83a..3385879 100644
> >> --- a/config/common_base
> >> +++ b/config/common_base
> >> @@ -543,6 +543,7 @@ CONFIG_RTE_PIPELINE_STATS_COLLECT=n
> >>  # Compile librte_kni
> >>  #
> >>  CONFIG_RTE_LIBRTE_KNI=n
> >> +CONFIG_RTE_LIBRTE_PMD_KNI=n
> >>  CONFIG_RTE_KNI_KMOD=n
> >>  CONFIG_RTE_KNI_PREEMPT_DEFAULT=y
> >>  CONFIG_RTE_KNI_VHOST=n
> >> diff --git a/config/common_linuxapp b/config/common_linuxapp
> >> index 2483dfa..2ecd510 100644
> >> --- a/config/common_linuxapp
> >> +++ b/config/common_linuxapp
> >> @@ -39,6 +39,7 @@ CONFIG_RTE_EAL_IGB_UIO=y
> >>  CONFIG_RTE_EAL_VFIO=y
> >>  CONFIG_RTE_KNI_KMOD=y
> >>  CONFIG_RTE_LIBRTE_KNI=y
> >> +CONFIG_RTE_LIBRTE_PMD_KNI=y
> >>  CONFIG_RTE_LIBRTE_VHOST=y
> >>  CONFIG_RTE_LIBRTE_PMD_VHOST=y
> >>  CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
> >> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
> >> index bc93230..c4771cd 100644
> >> --- a/drivers/net/Makefile
> >> +++ b/drivers/net/Makefile
> >> @@ -41,6 +41,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_ENIC_PMD) += enic
> >>  DIRS-$(CONFIG_RTE_LIBRTE_FM10K_PMD) += fm10k
> >>  DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += i40e
> >>  DIRS-$(CONFIG_RTE_LIBRTE_IXGBE_PMD) += ixgbe
> >> +DIRS-$(CONFIG_RTE_LIBRTE_PMD_KNI) += kni
> >>  DIRS-$(CONFIG_RTE_LIBRTE_MLX4_PMD) += mlx4
> >>  DIRS-$(CONFIG_RTE_LIBRTE_MLX5_PMD) += mlx5
> >>  DIRS-$(CONFIG_RTE_LIBRTE_MPIPE_PMD) += mpipe
> >> diff --git a/drivers/net/kni/Makefile b/drivers/net/kni/Makefile
> >> new file mode 100644
> >> index 0000000..0b7cf91
> >> --- /dev/null
> >> +++ b/drivers/net/kni/Makefile
> >> @@ -0,0 +1,63 @@
> >> +#   BSD LICENSE
> >> +#
> >> +#   Copyright(c) 2016 Intel Corporation. All rights reserved.
> >> +#
> >> +#   Redistribution and use in source and binary forms, with or without
> >> +#   modification, are permitted provided that the following conditions
> >> +#   are met:
> >> +#
> >> +#     * Redistributions of source code must retain the above copyright
> >> +#       notice, this list of conditions and the following disclaimer.
> >> +#     * Redistributions in binary form must reproduce the above copyright
> >> +#       notice, this list of conditions and the following disclaimer in
> >> +#       the documentation and/or other materials provided with the
> >> +#       distribution.
> >> +#     * Neither the name of Intel Corporation nor the names of its
> >> +#       contributors may be used to endorse or promote products derived
> >> +#       from this software without specific prior written permission.
> >> +#
> >> +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> >> CONTRIBUTORS
> >> +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
> BUT
> >> NOT
> >> +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> >> FITNESS FOR
> >> +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> >> COPYRIGHT
> >> +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> >> INCIDENTAL,
> >> +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
> BUT
> >> NOT
> >> +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> LOSS
> >> OF USE,
> >> +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> >> AND ON ANY
> >> +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> >> TORT
> >> +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
> OF
> >> THE USE
> >> +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> >> DAMAGE.
> >> +
> >> +include $(RTE_SDK)/mk/rte.vars.mk
> >> +
> >> +#
> >> +# library name
> >> +#
> >> +LIB = librte_pmd_kni.a
> >> +
> >> +CFLAGS += -O3
> >> +CFLAGS += $(WERROR_FLAGS)
> >> +LDLIBS += -lpthread
> >> +
> >> +EXPORT_MAP := rte_pmd_kni_version.map
> >> +
> >> +LIBABIVER := 1
> >> +
> >> +#
> >> +# all source are stored in SRCS-y
> >> +#
> >> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_KNI) += rte_eth_kni.c
> >> +
> >> +#
> >> +# Export include files
> >> +#
> >> +SYMLINK-y-include +=
> >> +
> >> +# this lib depends upon:
> >> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_KNI) += lib/librte_eal
> >> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_KNI) += lib/librte_ether
> >> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_KNI) += lib/librte_kni
> >> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_KNI) += lib/librte_mbuf
> >> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_KNI) += lib/librte_mempool
> >> +
> >> +include $(RTE_SDK)/mk/rte.lib.mk
> >> diff --git a/drivers/net/kni/rte_eth_kni.c b/drivers/net/kni/rte_eth_kni.c
> >> new file mode 100644
> >> index 0000000..6c4df96
> >> --- /dev/null
> >> +++ b/drivers/net/kni/rte_eth_kni.c
> >> @@ -0,0 +1,462 @@
> >> +/*-
> >> + *   BSD LICENSE
> >> + *
> >> + *   Copyright(c) 2016 Intel Corporation. All rights reserved.
> >> + *   All rights reserved.
> >> + *
> >> + *   Redistribution and use in source and binary forms, with or without
> >> + *   modification, are permitted provided that the following conditions
> >> + *   are met:
> >> + *
> >> + *     * Redistributions of source code must retain the above copyright
> >> + *       notice, this list of conditions and the following disclaimer.
> >> + *     * Redistributions in binary form must reproduce the above copyright
> >> + *       notice, this list of conditions and the following disclaimer in
> >> + *       the documentation and/or other materials provided with the
> >> + *       distribution.
> >> + *     * Neither the name of Intel Corporation nor the names of its
> >> + *       contributors may be used to endorse or promote products derived
> >> + *       from this software without specific prior written permission.
> >> + *
> >> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
> >> CONTRIBUTORS
> >> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
> BUT
> >> NOT
> >> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
> >> FITNESS FOR
> >> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
> >> COPYRIGHT
> >> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
> >> INCIDENTAL,
> >> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING,
> BUT
> >> NOT
> >> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
> LOSS
> >> OF USE,
> >> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
> >> AND ON ANY
> >> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
> >> TORT
> >> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
> OF
> >> THE USE
> >> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
> >> DAMAGE.
> >> + */
> >> +
> >> +#include <fcntl.h>
> >> +#include <pthread.h>
> >> +#include <unistd.h>
> >> +
> >> +#include <rte_ethdev.h>
> >> +#include <rte_kni.h>
> >> +#include <rte_malloc.h>
> >> +#include <rte_vdev.h>
> >> +
> >> +/* Only single queue supported */
> >> +#define KNI_MAX_QUEUE_PER_PORT 1
> >> +
> >> +#define MAX_PACKET_SZ 2048
> >> +#define MAX_KNI_PORTS 8
> >> +
> >> +struct pmd_queue_stats {
> >> +	uint64_t pkts;
> >> +	uint64_t bytes;
> >> +	uint64_t err_pkts;
> >> +};
> >> +
> >> +struct pmd_queue {
> >> +	struct pmd_internals *internals;
> >> +	struct rte_mempool *mb_pool;
> >> +
> >> +	struct pmd_queue_stats rx;
> >> +	struct pmd_queue_stats tx;
> >> +};
> >> +
> >> +struct pmd_internals {
> >> +	struct rte_kni *kni;
> >> +	int is_kni_started;
> >> +
> >> +	pthread_t thread;
> >> +	int stop_thread;
> >> +
> >> +	struct pmd_queue rx_queues[KNI_MAX_QUEUE_PER_PORT];
> >> +	struct pmd_queue tx_queues[KNI_MAX_QUEUE_PER_PORT];
> >> +};
> >> +
> >> +static struct ether_addr eth_addr;
> >> +static struct rte_eth_link pmd_link = {
> >> +		.link_speed = ETH_SPEED_NUM_10G,
> >> +		.link_duplex = ETH_LINK_FULL_DUPLEX,
> >> +		.link_status = 0
> >> +};
> >> +static int is_kni_initialized;
> >> +
> >> +static uint16_t
> >> +eth_kni_rx(void *q, struct rte_mbuf **bufs, uint16_t nb_bufs)
> >> +{
> >> +	struct pmd_queue *kni_q = q;
> >> +	struct rte_kni *kni = kni_q->internals->kni;
> >> +	uint16_t nb_pkts;
> >> +
> >> +	nb_pkts = rte_kni_rx_burst(kni, bufs, nb_bufs);
> >> +
> >> +	kni_q->rx.pkts += nb_pkts;
> >> +	kni_q->rx.err_pkts += nb_bufs - nb_pkts;
> >> +
> >> +	return nb_pkts;
> >> +}
> >> +
> >> +static uint16_t
> >> +eth_kni_tx(void *q, struct rte_mbuf **bufs, uint16_t nb_bufs)
> >> +{
> >> +	struct pmd_queue *kni_q = q;
> >> +	struct rte_kni *kni = kni_q->internals->kni;
> >> +	uint16_t nb_pkts;
> >> +
> >> +	nb_pkts =  rte_kni_tx_burst(kni, bufs, nb_bufs);
> >> +
> >> +	kni_q->tx.pkts += nb_pkts;
> >> +	kni_q->tx.err_pkts += nb_bufs - nb_pkts;
> >> +
> >> +	return nb_pkts;
> >> +}
> >> +
> >> +static void *
> >> +kni_handle_request(void *param)
> >> +{
> >> +	struct pmd_internals *internals = param;
> >> +#define MS 1000
> >> +
> >> +	while (!internals->stop_thread) {
> >> +		rte_kni_handle_request(internals->kni);
> >> +		usleep(500 * MS);
> >> +	}
> >> +
> >> +	return param;
> >> +}
> >> +
> >
> > Do we really need a thread to handle request by default? I know there are
> apps that handle request their own way and having a separate thread could
> add synchronization problems.  Can we at least add an option to disable this?
> 
> I didn't think about there can be a use case that requires own request
> handling.
> 
> But, kni requests should be handled to make kni interface run properly,
> and to handle interface "kni" handler (internals->kni) required, which
> this PMD doesn't expose.
> 
> So, just disabling this thread won't work on its own.

I understand that and what I am asking is a way to at least disable this without having to make code changes for applications that have their own way of handling KNI request and the callback mentioned below sounds good to me.  I am fine with adding this capability with this commit or in a separate commit after you have this commit checked in.
 
> A solution can be found, like callback registraion, or get_handler API,
> but if an application has custom request handling, perhaps it may prefer
> to use kni library directly instead of this wrapper, since wrapper
> already doesn't expose all kni features.

I think one of the motivation of having KNI pmd is that it's abstracted the same way as other physical or virtual devices.  I think it makes sense to achieve  feature parity with the KNI library as much as possible.  What's currently supported in KNI library but missing in KNI PMD and any specific reason they are not supported?

> >
> >> +static int
> >> +eth_kni_start(struct rte_eth_dev *dev)
> >> +{
> >> +	struct pmd_internals *internals = dev->data->dev_private;
> >> +	uint16_t port_id = dev->data->port_id;
> >> +	struct rte_mempool *mb_pool;
> >> +	struct rte_kni_conf conf;
> >> +	const char *name = dev->data->name + 4; /* remove net_ */
> >> +
> >> +	snprintf(conf.name, RTE_KNI_NAMESIZE, "%s", name);
> >> +	conf.force_bind = 0;
> >> +	conf.group_id = port_id;
> >> +	conf.mbuf_size = MAX_PACKET_SZ;
> >> +	mb_pool = internals->rx_queues[0].mb_pool;
> >> +
> >> +	internals->kni = rte_kni_alloc(mb_pool, &conf, NULL);
> >> +	if (internals->kni == NULL) {
> >> +		RTE_LOG(ERR, PMD,
> >> +			"Fail to create kni for port: %d\n", port_id);
> >> +		return -1;
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static int
> >> +eth_kni_dev_start(struct rte_eth_dev *dev)
> >> +{
> >> +	struct pmd_internals *internals = dev->data->dev_private;
> >> +	int ret;
> >> +
> >> +	if (internals->is_kni_started == 0) {
> >> +		ret = eth_kni_start(dev);
> >> +		if (ret)
> >> +			return -1;
> >> +		internals->is_kni_started = 1;
> >> +	}
> >> +
> >
> > In case is_kni_started is 1 already,  shouldn't we return directly instead of
> proceeding?
> 
> "is_kni_started" is just to protect "eth_kni_start()", as you can see it
> doesn't have a counterpart in eth_kni_dev_stop(). This flag is to be
> sure "eth_kni_start()" called only once during PMD life cycle.
> 
> The check you mentioned already done, start() / stop() functions already
> balanced by APIs calling these functions.

What about KNI request handing thread then?  Is it safe to have multiple threads calling into rte_kni_handle_request()? My understanding is that this is not safe as kni_fifo is not multi-thread safe.  It's also a bit wasteful to create multiple threads here.

> >
> >> +	ret = pthread_create(&internals->thread, NULL, kni_handle_request,
> >> +			internals);
> >> +	if (ret) {
> >> +		RTE_LOG(ERR, PMD, "Fail to create kni request thread\n");
> >> +		return -1;
> >> +	}
> >> +
> >> +	dev->data->dev_link.link_status = 1;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static void
> >> +eth_kni_dev_stop(struct rte_eth_dev *dev)
> >> +{
> >> +	struct pmd_internals *internals = dev->data->dev_private;
> >> +	int ret;
> >> +
> >> +	internals->stop_thread = 1;
> >> +
> >> +	ret = pthread_cancel(internals->thread);
> >> +	if (ret)
> >> +		RTE_LOG(ERR, PMD, "Can't cancel the thread\n");
> >> +
> >> +	ret = pthread_join(internals->thread, NULL);
> >> +	if (ret)
> >> +		RTE_LOG(ERR, PMD, "Can't join the thread\n");
> >> +
> >> +	internals->stop_thread = 0;
> >> +
> >> +	dev->data->dev_link.link_status = 0;
> >> +}
> >> +
> >> +static int
> >> +eth_kni_dev_configure(struct rte_eth_dev *dev __rte_unused)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >> +static void
> >> +eth_kni_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info
> >> *dev_info)
> >> +{
> >> +	struct rte_eth_dev_data *data = dev->data;
> >> +
> >> +	dev_info->driver_name = data->drv_name;
> >> +	dev_info->max_mac_addrs = 1;
> >> +	dev_info->max_rx_pktlen = (uint32_t)-1;
> >> +	dev_info->max_rx_queues = KNI_MAX_QUEUE_PER_PORT;
> >> +	dev_info->max_tx_queues = KNI_MAX_QUEUE_PER_PORT;
> >> +	dev_info->min_rx_bufsize = 0;
> >> +	dev_info->pci_dev = NULL;
> >> +}
> >> +
> >> +static int
> >> +eth_kni_rx_queue_setup(struct rte_eth_dev *dev,
> >> +		uint16_t rx_queue_id,
> >> +		uint16_t nb_rx_desc __rte_unused,
> >> +		unsigned int socket_id __rte_unused,
> >> +		const struct rte_eth_rxconf *rx_conf __rte_unused,
> >> +		struct rte_mempool *mb_pool)
> >> +{
> >> +	struct pmd_internals *internals = dev->data->dev_private;
> >> +	struct pmd_queue *q;
> >> +
> >> +	q = &internals->rx_queues[rx_queue_id];
> >> +	q->internals = internals;
> >> +	q->mb_pool = mb_pool;
> >> +
> >> +	dev->data->rx_queues[rx_queue_id] = q;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static int
> >> +eth_kni_tx_queue_setup(struct rte_eth_dev *dev,
> >> +		uint16_t tx_queue_id,
> >> +		uint16_t nb_tx_desc __rte_unused,
> >> +		unsigned int socket_id __rte_unused,
> >> +		const struct rte_eth_txconf *tx_conf __rte_unused)
> >> +{
> >> +	struct pmd_internals *internals = dev->data->dev_private;
> >> +	struct pmd_queue *q;
> >> +
> >> +	q = &internals->tx_queues[tx_queue_id];
> >> +	q->internals = internals;
> >> +
> >> +	dev->data->tx_queues[tx_queue_id] = q;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static void
> >> +eth_kni_queue_release(void *q __rte_unused)
> >> +{
> >> +}
> >> +
> >> +static int
> >> +eth_kni_link_update(struct rte_eth_dev *dev __rte_unused,
> >> +		int wait_to_complete __rte_unused)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >> +static void
> >> +eth_kni_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
> >> +{
> >> +	unsigned long rx_packets_total = 0, rx_bytes_total = 0;
> >> +	unsigned long tx_packets_total = 0, tx_bytes_total = 0;
> >> +	struct rte_eth_dev_data *data = dev->data;
> >> +	unsigned long tx_packets_err_total = 0;
> >> +	unsigned int i, num_stats;
> >> +	struct pmd_queue *q;
> >> +
> >> +	num_stats = RTE_MIN((unsigned
> >> int)RTE_ETHDEV_QUEUE_STAT_CNTRS,
> >> +			data->nb_rx_queues);
> >> +	for (i = 0; i < num_stats; i++) {
> >> +		q = data->rx_queues[i];
> >> +		stats->q_ipackets[i] = q->rx.pkts;
> >> +		stats->q_ibytes[i] = q->rx.bytes;
> >> +		rx_packets_total += stats->q_ipackets[i];
> >> +		rx_bytes_total += stats->q_ibytes[i];
> >> +	}
> >> +
> >> +	num_stats = RTE_MIN((unsigned
> >> int)RTE_ETHDEV_QUEUE_STAT_CNTRS,
> >> +			data->nb_tx_queues);
> >> +	for (i = 0; i < num_stats; i++) {
> >> +		q = data->tx_queues[i];
> >> +		stats->q_opackets[i] = q->tx.pkts;
> >> +		stats->q_obytes[i] = q->tx.bytes;
> >> +		stats->q_errors[i] = q->tx.err_pkts;
> >> +		tx_packets_total += stats->q_opackets[i];
> >> +		tx_bytes_total += stats->q_obytes[i];
> >> +		tx_packets_err_total += stats->q_errors[i];
> >> +	}
> >> +
> >> +	stats->ipackets = rx_packets_total;
> >> +	stats->ibytes = rx_bytes_total;
> >> +	stats->opackets = tx_packets_total;
> >> +	stats->obytes = tx_bytes_total;
> >> +	stats->oerrors = tx_packets_err_total;
> >> +}
> >> +
> >> +static void
> >> +eth_kni_stats_reset(struct rte_eth_dev *dev)
> >> +{
> >> +	struct rte_eth_dev_data *data = dev->data;
> >> +	struct pmd_queue *q;
> >> +	unsigned int i;
> >> +
> >> +	for (i = 0; i < data->nb_rx_queues; i++) {
> >> +		q = data->rx_queues[i];
> >> +		q->rx.pkts = 0;
> >> +		q->rx.bytes = 0;
> >> +	}
> >> +	for (i = 0; i < data->nb_tx_queues; i++) {
> >> +		q = data->tx_queues[i];
> >> +		q->tx.pkts = 0;
> >> +		q->tx.bytes = 0;
> >> +		q->tx.err_pkts = 0;
> >> +	}
> >> +}
> >> +
> >> +static const struct eth_dev_ops eth_kni_ops = {
> >> +	.dev_start = eth_kni_dev_start,
> >> +	.dev_stop = eth_kni_dev_stop,
> >> +	.dev_configure = eth_kni_dev_configure,
> >> +	.dev_infos_get = eth_kni_dev_info,
> >> +	.rx_queue_setup = eth_kni_rx_queue_setup,
> >> +	.tx_queue_setup = eth_kni_tx_queue_setup,
> >> +	.rx_queue_release = eth_kni_queue_release,
> >> +	.tx_queue_release = eth_kni_queue_release,
> >> +	.link_update = eth_kni_link_update,
> >> +	.stats_get = eth_kni_stats_get,
> >> +	.stats_reset = eth_kni_stats_reset,
> >> +};
> >> +
> >> +static struct rte_vdev_driver eth_kni_drv;
> >> +
> >> +static struct rte_eth_dev *
> >> +eth_kni_create(const char *name, unsigned int numa_node)
> >> +{
> >> +	struct pmd_internals *internals = NULL;
> >> +	struct rte_eth_dev_data *data;
> >> +	struct rte_eth_dev *eth_dev;
> >> +
> >> +	RTE_LOG(INFO, PMD, "Creating kni ethdev on numa socket %u\n",
> >> +			numa_node);
> >> +
> >> +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
> >> +	if (data == NULL)
> >> +		goto error;
> >> +
> >> +	internals = rte_zmalloc_socket(name, sizeof(*internals), 0,
> >> numa_node);
> >> +	if (internals == NULL)
> >> +		goto error;
> >> +
> >> +	/* reserve an ethdev entry */
> >> +	eth_dev = rte_eth_dev_allocate(name);
> >> +	if (eth_dev == NULL)
> >> +		goto error;
> >> +
> >> +	data->dev_private = internals;
> >> +	data->port_id = eth_dev->data->port_id;
> >> +	memmove(data->name, eth_dev->data->name, sizeof(data-
> >>> name));
> >> +	data->nb_rx_queues = 1;
> >> +	data->nb_tx_queues = 1;
> >> +	data->dev_link = pmd_link;
> >> +	data->mac_addrs = &eth_addr;
> >> +
> >> +	eth_dev->data = data;
> >> +	eth_dev->dev_ops = &eth_kni_ops;
> >> +	eth_dev->driver = NULL;
> >> +
> >> +	data->dev_flags = RTE_ETH_DEV_DETACHABLE;
> >> +	data->kdrv = RTE_KDRV_NONE;
> >> +	data->drv_name = eth_kni_drv.driver.name;
> >> +	data->numa_node = numa_node;
> >> +
> >> +	return eth_dev;
> >> +
> >> +error:
> >> +	rte_free(data);
> >> +	rte_free(internals);
> >> +
> >> +	return NULL;
> >> +}
> >> +
> >> +static int
> >> +kni_init(void)
> >> +{
> >> +	if (is_kni_initialized == 0)
> >> +		rte_kni_init(MAX_KNI_PORTS);
> >> +
> >> +	is_kni_initialized += 1;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static int
> >> +eth_kni_probe(const char *name, const char *params __rte_unused)
> >> +{
> >> +	struct rte_eth_dev *eth_dev;
> >> +	int ret;
> >> +
> >> +	RTE_LOG(INFO, PMD, "Initializing eth_kni for %s\n", name);
> >> +
> >> +	ret = kni_init();
> >> +	if (ret < 0)
> >> +		/* Not return error to prevent panic in rte_eal_init() */
> >> +		return 0;
> >
> > If we don't return error here, the application that needs to add KNI ports
> eventually will fail.  If it's a fail-stop situation, isn't it better to return error
> where the it happened?
> 
> I am not sure this is fail-stop situation, but instead this gives a
> chance to applicaton for a graceful exit.
> 
> If an error value returned here, it will lead to a rte_panic() and
> application terminated abnormally!
> 
> But if we return a success at this point, since no ethernet device
> created, there is no handler in application to use, which also means no
> KNI interface created.
> Application can check number of ports and recognize KNI port is missing,
> app may chose to terminate or not, also it prefers to terminate, can do
> it properly.

I might be wrong but as far as I know,  other virtual or physical PMDS do not have this behavior.  What you proposed makes sense but it also means that the application needs extra logic (checking if all ports are successfully initialized) to handle such failures (depending on the application, it might be able to proceed or it might need to fail-stop).  Personally I would prefer consistency across all PMDs here no matter what behavior we choose here as that's the "contract" the application needs to know.
 
> >
> >> +	eth_dev = eth_kni_create(name, rte_socket_id());
> >> +	if (eth_dev == NULL)
> >> +		return -1;
> >> +
> >> +	eth_dev->rx_pkt_burst = eth_kni_rx;
> >> +	eth_dev->tx_pkt_burst = eth_kni_tx;
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static int
> >> +eth_kni_remove(const char *name)
> >> +{
> >> +	struct rte_eth_dev *eth_dev;
> >> +	struct pmd_internals *internals;
> >> +
> >> +	RTE_LOG(INFO, PMD, "Un-Initializing eth_kni for %s\n", name);
> >> +
> >> +	/* find the ethdev entry */
> >> +	eth_dev = rte_eth_dev_allocated(name);
> >> +	if (eth_dev == NULL)
> >> +		return -1;
> >> +
> >> +	eth_kni_dev_stop(eth_dev);
> >> +
> >> +	if (eth_dev->data) {
> >> +		internals = eth_dev->data->dev_private;
> >> +		rte_kni_release(internals->kni);
> >> +
> >> +		rte_free(internals);
> >> +	}
> >> +	rte_free(eth_dev->data);
> >> +
> >> +	rte_eth_dev_release_port(eth_dev);
> >> +
> >> +	is_kni_initialized -= 1;
> >> +	if (is_kni_initialized == 0)
> >> +		rte_kni_close();
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static struct rte_vdev_driver eth_kni_drv = {
> >> +	.probe = eth_kni_probe,
> >> +	.remove = eth_kni_remove,
> >> +};
> >> +
> >> +RTE_PMD_REGISTER_VDEV(net_kni, eth_kni_drv);
> >> diff --git a/drivers/net/kni/rte_pmd_kni_version.map
> >> b/drivers/net/kni/rte_pmd_kni_version.map
> >> new file mode 100644
> >> index 0000000..31eca32
> >> --- /dev/null
> >> +++ b/drivers/net/kni/rte_pmd_kni_version.map
> >> @@ -0,0 +1,4 @@
> >> +DPDK_17.02 {
> >> +
> >> +	local: *;
> >> +};
> >> diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> >> index f75f0e2..af02816 100644
> >> --- a/mk/rte.app.mk
> >> +++ b/mk/rte.app.mk
> >> @@ -59,11 +59,6 @@ _LDLIBS-y += -L$(RTE_SDK_BIN)/lib
> >>  #
> >>  # Order is important: from higher level to lower level
> >>  #
> >> -
> >> -ifeq ($(CONFIG_RTE_EXEC_ENV_LINUXAPP),y)
> >> -_LDLIBS-$(CONFIG_RTE_LIBRTE_KNI)            += -lrte_kni
> >> -endif
> >> -
> >>  _LDLIBS-$(CONFIG_RTE_LIBRTE_PIPELINE)       += -lrte_pipeline
> >>  _LDLIBS-$(CONFIG_RTE_LIBRTE_TABLE)          += -lrte_table
> >>  _LDLIBS-$(CONFIG_RTE_LIBRTE_PORT)           += -lrte_port
> >> @@ -84,6 +79,10 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_POWER)          += -
> >> lrte_power
> >>
> >>  _LDLIBS-y += --whole-archive
> >>
> >> +ifeq ($(CONFIG_RTE_EXEC_ENV_LINUXAPP),y)
> >> +_LDLIBS-$(CONFIG_RTE_LIBRTE_KNI)            += -lrte_kni
> >> +endif
> >> +
> >>  _LDLIBS-$(CONFIG_RTE_LIBRTE_TIMER)          += -lrte_timer
> >>  _LDLIBS-$(CONFIG_RTE_LIBRTE_HASH)           += -lrte_hash
> >>  _LDLIBS-$(CONFIG_RTE_LIBRTE_VHOST)          += -lrte_vhost
> >> @@ -115,6 +114,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_ENIC_PMD)       +=
> -
> >> lrte_pmd_enic
> >>  _LDLIBS-$(CONFIG_RTE_LIBRTE_FM10K_PMD)      += -lrte_pmd_fm10k
> >>  _LDLIBS-$(CONFIG_RTE_LIBRTE_I40E_PMD)       += -lrte_pmd_i40e
> >>  _LDLIBS-$(CONFIG_RTE_LIBRTE_IXGBE_PMD)      += -lrte_pmd_ixgbe
> >> +_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KNI)        += -lrte_pmd_kni
> >>  _LDLIBS-$(CONFIG_RTE_LIBRTE_MLX4_PMD)       += -lrte_pmd_mlx4 -
> >> libverbs
> >>  _LDLIBS-$(CONFIG_RTE_LIBRTE_MLX5_PMD)       += -lrte_pmd_mlx5 -
> >> libverbs
> >>  _LDLIBS-$(CONFIG_RTE_LIBRTE_MPIPE_PMD)      += -lrte_pmd_mpipe -
> lgxio
> >> --
> >> 2.9.3
> >

^ permalink raw reply

* [PATCH v4] null: fake PMD capabilities
From: Michał Mirosław @ 2016-12-14 19:16 UTC (permalink / raw)
  To: dev
In-Reply-To: <66806fbdd31de8610a0be5463d813eaf7b816e85.1481592848.git.mirq-linux@rere.qmqm.pl>

From: Paweł Małachowski <pawel.malachowski@atendesoftware.pl>

This allows for testing code which needs offloads to be supported.

Signed-off-by: Michał Mirosław <michal.miroslaw@atendesoftware.pl>
---
 drivers/net/null/rte_eth_null.c | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/drivers/net/null/rte_eth_null.c b/drivers/net/null/rte_eth_null.c
index 836d982..e60516f 100644
--- a/drivers/net/null/rte_eth_null.c
+++ b/drivers/net/null/rte_eth_null.c
@@ -284,6 +284,9 @@ eth_tx_queue_setup(struct rte_eth_dev *dev, uint16_t tx_queue_id,
 	return 0;
 }
 
+static void
+eth_dev_void_ok(struct rte_eth_dev *dev __rte_unused) { return; }
+
 
 static void
 eth_dev_info(struct rte_eth_dev *dev,
@@ -304,6 +307,19 @@ eth_dev_info(struct rte_eth_dev *dev,
 	dev_info->pci_dev = NULL;
 	dev_info->reta_size = internals->reta_size;
 	dev_info->flow_type_rss_offloads = internals->flow_type_rss_offloads;
+	/* We hereby declare we can RX offload VLAN-s out of thin air (they are not there)
+	 */
+	dev_info->rx_offload_capa = DEV_RX_OFFLOAD_VLAN_STRIP |
+		DEV_RX_OFFLOAD_QINQ_STRIP;
+	/* We promise we will do any TX offloads before passing packets to /dev/null
+	 */
+	dev_info->tx_offload_capa = DEV_TX_OFFLOAD_VLAN_INSERT |
+		DEV_TX_OFFLOAD_IPV4_CKSUM | DEV_TX_OFFLOAD_UDP_CKSUM |
+		DEV_TX_OFFLOAD_TCP_CKSUM | DEV_TX_OFFLOAD_SCTP_CKSUM |
+		DEV_TX_OFFLOAD_TCP_TSO | DEV_TX_OFFLOAD_UDP_TSO |
+		DEV_TX_OFFLOAD_OUTER_IPV4_CKSUM | DEV_TX_OFFLOAD_QINQ_INSERT |
+		DEV_TX_OFFLOAD_VXLAN_TNL_TSO | DEV_TX_OFFLOAD_GRE_TNL_TSO |
+		DEV_TX_OFFLOAD_IPIP_TNL_TSO | DEV_TX_OFFLOAD_GENEVE_TNL_TSO;
 }
 
 static void
@@ -477,7 +493,12 @@ static const struct eth_dev_ops ops = {
 	.reta_update = eth_rss_reta_update,
 	.reta_query = eth_rss_reta_query,
 	.rss_hash_update = eth_rss_hash_update,
-	.rss_hash_conf_get = eth_rss_hash_conf_get
+	.rss_hash_conf_get = eth_rss_hash_conf_get,
+	/* Fake our capabilities */
+	.promiscuous_enable   = eth_dev_void_ok,
+	.promiscuous_disable  = eth_dev_void_ok,
+	.allmulticast_enable  = eth_dev_void_ok,
+	.allmulticast_disable = eth_dev_void_ok
 };
 
 int
-- 
2.10.2

^ permalink raw reply related

* Re: [PATCH 11/13] KNI: guard against unterminated dev_info.name leading to BUG in alloc_netdev()
From: Ferruh Yigit @ 2016-12-14 17:48 UTC (permalink / raw)
  To: Michał Mirosław; +Cc: dev
In-Reply-To: <20161214173755.lnfn3qlnqdmanqi2@rere.qmqm.pl>

On 12/14/2016 5:37 PM, Michał Mirosław wrote:
> On Wed, Dec 14, 2016 at 05:33:11PM +0000, Ferruh Yigit wrote:
>> On 12/13/2016 1:08 AM, Michał Mirosław wrote:
>>> Signed-off-by: Michał Mirosław <michal.miroslaw@atendesoftware.pl>
>>> ---
>>
>> Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>
>>
>> I guess no harm doing extra check on user input.
> 
> This actually prevents an OOPS that happens when you try to create KNI with
> too long name.

Thanks for fixing then.

^ permalink raw reply

* Re: [PATCH 07/13] pcap: fix timestamps in output pcap file
From: Ferruh Yigit @ 2016-12-14 17:45 UTC (permalink / raw)
  To: Michał Mirosław, dev
In-Reply-To: <2a15ee2b33cf58e8cfbfe8d51d3af32b97595424.1481590851.git.mirq-linux@rere.qmqm.pl>

On 12/13/2016 1:08 AM, Michał Mirosław wrote:
> From: Piotr Bartosiewicz <piotr.bartosiewicz@atendesoftware.pl>
> 
> Signed-off-by: Michał Mirosław <michal.miroslaw@atendesoftware.pl>
> ---

Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>

^ permalink raw reply

* Re: [PATCH 11/13] KNI: guard against unterminated dev_info.name leading to BUG in alloc_netdev()
From: Michał Mirosław @ 2016-12-14 17:37 UTC (permalink / raw)
  To: Ferruh Yigit; +Cc: dev
In-Reply-To: <5b69bbf4-f570-26c4-79e4-c3342769d81d@intel.com>

On Wed, Dec 14, 2016 at 05:33:11PM +0000, Ferruh Yigit wrote:
> On 12/13/2016 1:08 AM, Michał Mirosław wrote:
> > Signed-off-by: Michał Mirosław <michal.miroslaw@atendesoftware.pl>
> > ---
> 
> Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>
> 
> I guess no harm doing extra check on user input.

This actually prevents an OOPS that happens when you try to create KNI with
too long name.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [PATCH 10/13] KNI: provided netif name's source is user-space
From: Ferruh Yigit @ 2016-12-14 17:35 UTC (permalink / raw)
  To: Michał Mirosław; +Cc: dev
In-Reply-To: <ac5c7173-0250-de63-1383-122be21472c4@intel.com>

On 12/14/2016 5:35 PM, Ferruh Yigit wrote:
> On 12/14/2016 5:19 PM, Michał Mirosław wrote:
>> On Wed, Dec 14, 2016 at 05:06:23PM +0000, Ferruh Yigit wrote:
>>> Hi Michal,
>>>
>>> On 12/13/2016 1:08 AM, Michał Mirosław wrote:
>>>> Signed-off-by: Michał Mirosław <michal.miroslaw@atendesoftware.pl>
>>>> ---

Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>

^ permalink raw reply

* Re: [PATCH 10/13] KNI: provided netif name's source is user-space
From: Ferruh Yigit @ 2016-12-14 17:35 UTC (permalink / raw)
  To: Michał Mirosław; +Cc: dev
In-Reply-To: <20161214171911.nupex5gi5jxbbbhm@rere.qmqm.pl>

On 12/14/2016 5:19 PM, Michał Mirosław wrote:
> On Wed, Dec 14, 2016 at 05:06:23PM +0000, Ferruh Yigit wrote:
>> Hi Michal,
>>
>> On 12/13/2016 1:08 AM, Michał Mirosław wrote:
>>> Signed-off-by: Michał Mirosław <michal.miroslaw@atendesoftware.pl>
>>> ---
>>>  lib/librte_eal/linuxapp/kni/kni_misc.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/lib/librte_eal/linuxapp/kni/kni_misc.c b/lib/librte_eal/linuxapp/kni/kni_misc.c
>>> index 497db9b..f0247aa 100644
>>> --- a/lib/librte_eal/linuxapp/kni/kni_misc.c
>>> +++ b/lib/librte_eal/linuxapp/kni/kni_misc.c
>>> @@ -363,8 +363,8 @@ kni_ioctl_create(struct net *net, uint32_t ioctl_num,
>>>  	up_read(&knet->kni_list_lock);
>>>  
>>>  	net_dev = alloc_netdev(sizeof(struct kni_dev), dev_info.name,
>>> -#ifdef NET_NAME_UNKNOWN
>>> -							NET_NAME_UNKNOWN,
>>> +#ifdef NET_NAME_USER
>>> +							NET_NAME_USER,
>>
>> Probably NET_NAME_USER fits better to definition.
>> But functionally, I wonder if you have a use case to use
>> "name_assign_type" ?
> 
> I just read what NET_NAME_UNKNOWN/NET_NAME_USER is supposed to do.
> UNKNOWN is for "old" drivers that don't know the source of the name.
> USER is for a name, that comes from a user and as such is not to be
> renamed by udev.

That is what I wonder if you are doing anything special in userspace
related iface name. But it seems the patch is to make it more proper, I
have no objection.

> 
> Best Regards,
> Michał Mirosław
> 

^ permalink raw reply

* Re: [PATCH 11/13] KNI: guard against unterminated dev_info.name leading to BUG in alloc_netdev()
From: Ferruh Yigit @ 2016-12-14 17:33 UTC (permalink / raw)
  To: Michał Mirosław, dev
In-Reply-To: <d1c6dd8ca22b51b781b2ce9f305f6a39ba9a914b.1481590851.git.mirq-linux@rere.qmqm.pl>

On 12/13/2016 1:08 AM, Michał Mirosław wrote:
> Signed-off-by: Michał Mirosław <michal.miroslaw@atendesoftware.pl>
> ---

Acked-by: Ferruh Yigit <ferruh.yigit@intel.com>

I guess no harm doing extra check on user input.

^ permalink raw reply

* [PATCH] acl: remove invalid test
From: Michał Mirosław @ 2016-12-14 17:23 UTC (permalink / raw)
  To: dev; +Cc: Ananyev, Konstantin
In-Reply-To: <f6d14b640928f69ab747793cd8749b48a1b74afb.1481590851.git.mirq-linux@rere.qmqm.pl>

rte_acl_add_rules() has no way of checking rule size.

This was hidden because the test effectively checked that
adding a rule with userdata == 0 failed.

Signed-off-by: Michał Mirosław <michal.miroslaw@atendesoftware.pl>
---
 app/test/test_acl.c | 20 --------------------
 1 file changed, 20 deletions(-)

diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 28955f0..be744ec 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -1515,26 +1515,6 @@ test_invalid_parameters(void)
 	/* free ACL context */
 	rte_acl_free(acx);
 
-	/* set wrong rule_size so that adding any rules would fail */
-	param.rule_size = RTE_ACL_IPV4VLAN_RULE_SZ + 4;
-	acx = rte_acl_create(&param);
-	if (acx == NULL) {
-		printf("Line %i: ACL context creation failed!\n", __LINE__);
-		return -1;
-	}
-
-	/* try adding a rule with size different from context rule_size */
-	result = rte_acl_ipv4vlan_add_rules(acx, &rule, 1);
-	if (result == 0) {
-		printf("Line %i: Adding an invalid sized rule "
-				"should have failed!\n", __LINE__);
-		rte_acl_free(acx);
-		return -1;
-	}
-
-	/* free ACL context */
-	rte_acl_free(acx);
-
 
 	/**
 	 * rte_acl_ipv4vlan_build
-- 
2.10.2

^ permalink raw reply related

* [PATCH v2] acl: allow zero verdict
From: Michał Mirosław @ 2016-12-14 17:23 UTC (permalink / raw)
  To: dev; +Cc: Ananyev, Konstantin
In-Reply-To: <f6d14b640928f69ab747793cd8749b48a1b74afb.1481590851.git.mirq-linux@rere.qmqm.pl>

Signed-off-by: Michał Mirosław <michal.miroslaw@atendesoftware.pl>
---
v2: fixes prog_guide and ACL tests

---
 app/test/test_acl.c                                  | 13 -------------
 doc/guides/prog_guide/packet_classif_access_ctrl.rst |  3 ++-
 lib/librte_acl/rte_acl.c                             |  3 +--
 lib/librte_acl/rte_acl.h                             |  2 --
 lib/librte_table/rte_table_acl.c                     |  2 +-
 5 files changed, 4 insertions(+), 19 deletions(-)

diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index be744ec..c6b511f 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -1357,19 +1357,6 @@ test_invalid_rules(void)
 		goto err;
 	}
 
-	rule.dst_mask_len = 0;
-	rule.src_mask_len = 0;
-	rule.data.userdata = 0;
-
-	/* try adding this rule (it should fail because userdata is invalid) */
-	ret = rte_acl_ipv4vlan_add_rules(acx, &rule, 1);
-	if (ret == 0) {
-		printf("Line %i: Adding a rule with invalid user data "
-				"should have failed!\n", __LINE__);
-		rte_acl_free(acx);
-		return -1;
-	}
-
 	rte_acl_free(acx);
 
 	return 0;
diff --git a/doc/guides/prog_guide/packet_classif_access_ctrl.rst b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
index 5fd3d34..a6bee9b 100644
--- a/doc/guides/prog_guide/packet_classif_access_ctrl.rst
+++ b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
@@ -329,8 +329,9 @@ When creating a set of rules, for each rule, additional information must be supp
     Each set could be assigned its own category and by combining them into a single database,
     one lookup returns a result for each of the four sets.
 
-*   **userdata**: A user-defined field that could be any value except zero.
+*   **userdata**: A user-defined value.
     For each category, a successful match returns the userdata field of the highest priority matched rule.
+    When no rules match, returned value is zero.
 
 .. note::
 
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 8b7e92c..d1f40be 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -313,8 +313,7 @@ acl_check_rule(const struct rte_acl_rule_data *rd)
 	if ((RTE_LEN2MASK(RTE_ACL_MAX_CATEGORIES, typeof(rd->category_mask)) &
 			rd->category_mask) == 0 ||
 			rd->priority > RTE_ACL_MAX_PRIORITY ||
-			rd->priority < RTE_ACL_MIN_PRIORITY ||
-			rd->userdata == RTE_ACL_INVALID_USERDATA)
+			rd->priority < RTE_ACL_MIN_PRIORITY)
 		return -EINVAL;
 	return 0;
 }
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index caa91f7..b53179a 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -120,8 +120,6 @@ enum {
 	RTE_ACL_MIN_PRIORITY = 0,
 };
 
-#define	RTE_ACL_INVALID_USERDATA	0
-
 #define	RTE_ACL_MASKLEN_TO_BITMASK(v, s)	\
 ((v) == 0 ? (v) : (typeof(v))((uint64_t)-1 << ((s) * CHAR_BIT - (v))))
 
diff --git a/lib/librte_table/rte_table_acl.c b/lib/librte_table/rte_table_acl.c
index 8f1f8ce..94b69a9 100644
--- a/lib/librte_table/rte_table_acl.c
+++ b/lib/librte_table/rte_table_acl.c
@@ -792,7 +792,7 @@ rte_table_acl_lookup(
 
 		pkts_mask &= ~pkt_mask;
 
-		if (action_table_pos != RTE_ACL_INVALID_USERDATA) {
+		if (action_table_pos != 0) {
 			pkts_out_mask |= pkt_mask;
 			entries[pkt_pos] = (void *)
 				&acl->memory[action_table_pos *
-- 
2.10.2

^ permalink raw reply related

* Re: [PATCH 10/13] KNI: provided netif name's source is user-space
From: Michał Mirosław @ 2016-12-14 17:19 UTC (permalink / raw)
  To: Ferruh Yigit; +Cc: dev
In-Reply-To: <817f957e-80c3-bfc0-e02b-55f397581708@intel.com>

On Wed, Dec 14, 2016 at 05:06:23PM +0000, Ferruh Yigit wrote:
> Hi Michal,
> 
> On 12/13/2016 1:08 AM, Michał Mirosław wrote:
> > Signed-off-by: Michał Mirosław <michal.miroslaw@atendesoftware.pl>
> > ---
> >  lib/librte_eal/linuxapp/kni/kni_misc.c | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/lib/librte_eal/linuxapp/kni/kni_misc.c b/lib/librte_eal/linuxapp/kni/kni_misc.c
> > index 497db9b..f0247aa 100644
> > --- a/lib/librte_eal/linuxapp/kni/kni_misc.c
> > +++ b/lib/librte_eal/linuxapp/kni/kni_misc.c
> > @@ -363,8 +363,8 @@ kni_ioctl_create(struct net *net, uint32_t ioctl_num,
> >  	up_read(&knet->kni_list_lock);
> >  
> >  	net_dev = alloc_netdev(sizeof(struct kni_dev), dev_info.name,
> > -#ifdef NET_NAME_UNKNOWN
> > -							NET_NAME_UNKNOWN,
> > +#ifdef NET_NAME_USER
> > +							NET_NAME_USER,
> 
> Probably NET_NAME_USER fits better to definition.
> But functionally, I wonder if you have a use case to use
> "name_assign_type" ?

I just read what NET_NAME_UNKNOWN/NET_NAME_USER is supposed to do.
UNKNOWN is for "old" drivers that don't know the source of the name.
USER is for a name, that comes from a user and as such is not to be
renamed by udev.

Best Regards,
Michał Mirosław

^ permalink raw reply

* Re: [PATCH 10/13] KNI: provided netif name's source is user-space
From: Ferruh Yigit @ 2016-12-14 17:06 UTC (permalink / raw)
  To: Michał Mirosław, dev
In-Reply-To: <ad817bb663ce3b8641b8884f1e994b145d141246.1481590851.git.mirq-linux@rere.qmqm.pl>

Hi Michal,

On 12/13/2016 1:08 AM, Michał Mirosław wrote:
> Signed-off-by: Michał Mirosław <michal.miroslaw@atendesoftware.pl>
> ---
>  lib/librte_eal/linuxapp/kni/kni_misc.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_eal/linuxapp/kni/kni_misc.c b/lib/librte_eal/linuxapp/kni/kni_misc.c
> index 497db9b..f0247aa 100644
> --- a/lib/librte_eal/linuxapp/kni/kni_misc.c
> +++ b/lib/librte_eal/linuxapp/kni/kni_misc.c
> @@ -363,8 +363,8 @@ kni_ioctl_create(struct net *net, uint32_t ioctl_num,
>  	up_read(&knet->kni_list_lock);
>  
>  	net_dev = alloc_netdev(sizeof(struct kni_dev), dev_info.name,
> -#ifdef NET_NAME_UNKNOWN
> -							NET_NAME_UNKNOWN,
> +#ifdef NET_NAME_USER
> +							NET_NAME_USER,

Probably NET_NAME_USER fits better to definition.
But functionally, I wonder if you have a use case to use
"name_assign_type" ?

>  #endif
>  							kni_net_init);
>  	if (net_dev == NULL) {
> 

^ permalink raw reply

* Re: No packets received if burst is too small in rte_eth_rx_burst
From: Bruce Richardson @ 2016-12-14 16:52 UTC (permalink / raw)
  To: tom.barbette; +Cc: dev
In-Reply-To: <597694261.17905196.1481728433903.JavaMail.zimbra@ulg.ac.be>

On Wed, Dec 14, 2016 at 04:13:53PM +0100, tom.barbette@ulg.ac.be wrote:
> Hi list,
> 
> Between 2.2.0 and 16.04 (up to at least 16.07.2 if not current), with the XL710 controller I do not get any packet when calling rte_eth_rx_burst if nb_pkts is too small. I would say smaller than 32. The input rate is not big, if that helps. But It should definitely get at least one packet per second.
> 
> Any ideas? Is that a bug or expected behaviour? Could be caused by other ABI changes?
> 
Does this issue still occur even if you disable the vector driver in
your build-time configuration?

/Bruce

^ permalink raw reply

* Re: [PATCH 01/22] ethdev: introduce generic flow API
From: Kevin Traynor @ 2016-12-14 16:11 UTC (permalink / raw)
  To: Adrien Mazarguil
  Cc: dev, Thomas Monjalon, Pablo de Lara, Olivier Matz,
	sugesh.chandran
In-Reply-To: <20161214135423.GZ10340@6wind.com>

On 12/14/2016 01:54 PM, Adrien Mazarguil wrote:

>>
>>>>>>> + * @param[out] error
>>>>>>> + *   Perform verbose error reporting if not NULL.
>>>>>>> + *
>>>>>>> + * @return
>>>>>>> + *   0 on success, a negative errno value otherwise and rte_errno is set.
>>>>>>> + */
>>>>>>> +int
>>>>>>> +rte_flow_query(uint8_t port_id,
>>>>>>> +	       struct rte_flow *flow,
>>>>>>> +	       enum rte_flow_action_type action,
>>>>>>> +	       void *data,
>>>>>>> +	       struct rte_flow_error *error);
>>>>>>> +
>>>>>>> +#ifdef __cplusplus
>>>>>>> +}
>>>>>>> +#endif
>>>>>>
>>>>>> I don't see a way to dump all the rules for a port out. I think this is
>>>>>> neccessary for degbugging. You could have a look through dpif.h in OVS
>>>>>> and see how dpif_flow_dump_next() is used, it might be a good reference.
>>>>>
>>>>> DPDK does not maintain flow rules and, depending on hardware capabilities
>>>>> and level of compliance, PMDs do not necessarily do it either, particularly
>>>>> since it requires space and application probably have a better method to
>>>>> store these pointers for their own needs.
>>>>
>>>> understood
>>>>
>>>>>
>>>>> What you see here is only a PMD interface. Depending on applications needs,
>>>>> generic helper functions built on top of these may be added to manage flow
>>>>> rules in the future.
>>>>
>>>> I'm thinking of the case where something goes wrong and I want to get a
>>>> dump of all the flow rules from hardware, not query the rules I think I
>>>> have. I don't see a way to do it or something to build a helper on top of?
>>>
>>> Generic helper functions would exist on top of this API and would likely
>>> maintain a list of flow rules themselves. The dump in that case would be
>>> entirely implemented in software. I think that recovering flow rules from HW
>>> may be complicated in many cases (even without taking storage allocation and
>>> rules conversion issues into account), therefore if there is really a need
>>> for it, we could perhaps add a dump() function that PMDs are free to
>>> implement later.
>>>
>>
>> ok. Maybe there are some more generic stats that can be got from the
>> hardware that would help debugging that would suffice, like total flow
>> rule hits/misses (i.e. not on a per flow rule basis).
>>
>> You can get this from the software flow caches and it's widely used for
>> debugging. e.g.
>>
>> pmd thread numa_id 0 core_id 3:
>> 	emc hits:0
>> 	megaflow hits:0
>> 	avg. subtable lookups per hit:0.00
>> 	miss:0
>>
> 
> Perhaps a rule such as the following could do the trick:
> 
>  group: 42 (or priority 42)
>  pattern: void
>  actions: count / passthru
> 
> Assuming useful flow rules are defined with higher priorities (using lower
> group ID or priority level) and provide a terminating action, this one would
> count all packets that were not caught by them.
> 
> That is one example to illustrate how "global" counters can be requested by
> applications.
> 
> Otherwise you could just make sure all rules contain mark / flag actions, in
> which case mbufs would tell directly if they went through them or need
> additional SW processing.
> 

ok, sounds like there's some options at least to work with on this which
is good. thanks.

^ permalink raw reply

* Re: [PATCH v4] net/kni: add KNI PMD
From: Ferruh Yigit @ 2016-12-14 15:59 UTC (permalink / raw)
  To: Yong Wang, dev@dpdk.org
In-Reply-To: <BY2PR05MB23595855DDDC3B10D01E651CAF980@BY2PR05MB2359.namprd05.prod.outlook.com>

On 12/12/2016 9:59 PM, Yong Wang wrote:
>> -----Original Message-----
>> From: Ferruh Yigit [mailto:ferruh.yigit@intel.com]
>> Sent: Wednesday, November 30, 2016 10:12 AM
>> To: dev@dpdk.org
>> Cc: Ferruh Yigit <ferruh.yigit@intel.com>; Yong Wang
>> <yongwang@vmware.com>
>> Subject: [PATCH v4] net/kni: add KNI PMD
>>
>> Add KNI PMD which wraps librte_kni for ease of use.
>>
>> KNI PMD can be used as any regular PMD to send / receive packets to the
>> Linux networking stack.
>>
>> Signed-off-by: Ferruh Yigit <ferruh.yigit@intel.com>
>> ---
>>
>> v4:
>> * allow only single queue
>> * use driver.name as name
>>
>> v3:
>> * rebase on top of latest master
>>
>> v2:
>> * updated driver name eth_kni -> net_kni
>> ---
>>  config/common_base                      |   1 +
>>  config/common_linuxapp                  |   1 +
>>  drivers/net/Makefile                    |   1 +
>>  drivers/net/kni/Makefile                |  63 +++++
>>  drivers/net/kni/rte_eth_kni.c           | 462
>> ++++++++++++++++++++++++++++++++
>>  drivers/net/kni/rte_pmd_kni_version.map |   4 +
>>  mk/rte.app.mk                           |  10 +-
>>  7 files changed, 537 insertions(+), 5 deletions(-)
>>  create mode 100644 drivers/net/kni/Makefile
>>  create mode 100644 drivers/net/kni/rte_eth_kni.c
>>  create mode 100644 drivers/net/kni/rte_pmd_kni_version.map
>>
>> diff --git a/config/common_base b/config/common_base
>> index 4bff83a..3385879 100644
>> --- a/config/common_base
>> +++ b/config/common_base
>> @@ -543,6 +543,7 @@ CONFIG_RTE_PIPELINE_STATS_COLLECT=n
>>  # Compile librte_kni
>>  #
>>  CONFIG_RTE_LIBRTE_KNI=n
>> +CONFIG_RTE_LIBRTE_PMD_KNI=n
>>  CONFIG_RTE_KNI_KMOD=n
>>  CONFIG_RTE_KNI_PREEMPT_DEFAULT=y
>>  CONFIG_RTE_KNI_VHOST=n
>> diff --git a/config/common_linuxapp b/config/common_linuxapp
>> index 2483dfa..2ecd510 100644
>> --- a/config/common_linuxapp
>> +++ b/config/common_linuxapp
>> @@ -39,6 +39,7 @@ CONFIG_RTE_EAL_IGB_UIO=y
>>  CONFIG_RTE_EAL_VFIO=y
>>  CONFIG_RTE_KNI_KMOD=y
>>  CONFIG_RTE_LIBRTE_KNI=y
>> +CONFIG_RTE_LIBRTE_PMD_KNI=y
>>  CONFIG_RTE_LIBRTE_VHOST=y
>>  CONFIG_RTE_LIBRTE_PMD_VHOST=y
>>  CONFIG_RTE_LIBRTE_PMD_AF_PACKET=y
>> diff --git a/drivers/net/Makefile b/drivers/net/Makefile
>> index bc93230..c4771cd 100644
>> --- a/drivers/net/Makefile
>> +++ b/drivers/net/Makefile
>> @@ -41,6 +41,7 @@ DIRS-$(CONFIG_RTE_LIBRTE_ENIC_PMD) += enic
>>  DIRS-$(CONFIG_RTE_LIBRTE_FM10K_PMD) += fm10k
>>  DIRS-$(CONFIG_RTE_LIBRTE_I40E_PMD) += i40e
>>  DIRS-$(CONFIG_RTE_LIBRTE_IXGBE_PMD) += ixgbe
>> +DIRS-$(CONFIG_RTE_LIBRTE_PMD_KNI) += kni
>>  DIRS-$(CONFIG_RTE_LIBRTE_MLX4_PMD) += mlx4
>>  DIRS-$(CONFIG_RTE_LIBRTE_MLX5_PMD) += mlx5
>>  DIRS-$(CONFIG_RTE_LIBRTE_MPIPE_PMD) += mpipe
>> diff --git a/drivers/net/kni/Makefile b/drivers/net/kni/Makefile
>> new file mode 100644
>> index 0000000..0b7cf91
>> --- /dev/null
>> +++ b/drivers/net/kni/Makefile
>> @@ -0,0 +1,63 @@
>> +#   BSD LICENSE
>> +#
>> +#   Copyright(c) 2016 Intel Corporation. All rights reserved.
>> +#
>> +#   Redistribution and use in source and binary forms, with or without
>> +#   modification, are permitted provided that the following conditions
>> +#   are met:
>> +#
>> +#     * Redistributions of source code must retain the above copyright
>> +#       notice, this list of conditions and the following disclaimer.
>> +#     * Redistributions in binary form must reproduce the above copyright
>> +#       notice, this list of conditions and the following disclaimer in
>> +#       the documentation and/or other materials provided with the
>> +#       distribution.
>> +#     * Neither the name of Intel Corporation nor the names of its
>> +#       contributors may be used to endorse or promote products derived
>> +#       from this software without specific prior written permission.
>> +#
>> +#   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
>> CONTRIBUTORS
>> +#   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
>> NOT
>> +#   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
>> FITNESS FOR
>> +#   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
>> COPYRIGHT
>> +#   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
>> INCIDENTAL,
>> +#   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
>> NOT
>> +#   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
>> OF USE,
>> +#   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
>> AND ON ANY
>> +#   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
>> TORT
>> +#   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
>> THE USE
>> +#   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
>> DAMAGE.
>> +
>> +include $(RTE_SDK)/mk/rte.vars.mk
>> +
>> +#
>> +# library name
>> +#
>> +LIB = librte_pmd_kni.a
>> +
>> +CFLAGS += -O3
>> +CFLAGS += $(WERROR_FLAGS)
>> +LDLIBS += -lpthread
>> +
>> +EXPORT_MAP := rte_pmd_kni_version.map
>> +
>> +LIBABIVER := 1
>> +
>> +#
>> +# all source are stored in SRCS-y
>> +#
>> +SRCS-$(CONFIG_RTE_LIBRTE_PMD_KNI) += rte_eth_kni.c
>> +
>> +#
>> +# Export include files
>> +#
>> +SYMLINK-y-include +=
>> +
>> +# this lib depends upon:
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_KNI) += lib/librte_eal
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_KNI) += lib/librte_ether
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_KNI) += lib/librte_kni
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_KNI) += lib/librte_mbuf
>> +DEPDIRS-$(CONFIG_RTE_LIBRTE_PMD_KNI) += lib/librte_mempool
>> +
>> +include $(RTE_SDK)/mk/rte.lib.mk
>> diff --git a/drivers/net/kni/rte_eth_kni.c b/drivers/net/kni/rte_eth_kni.c
>> new file mode 100644
>> index 0000000..6c4df96
>> --- /dev/null
>> +++ b/drivers/net/kni/rte_eth_kni.c
>> @@ -0,0 +1,462 @@
>> +/*-
>> + *   BSD LICENSE
>> + *
>> + *   Copyright(c) 2016 Intel Corporation. All rights reserved.
>> + *   All rights reserved.
>> + *
>> + *   Redistribution and use in source and binary forms, with or without
>> + *   modification, are permitted provided that the following conditions
>> + *   are met:
>> + *
>> + *     * Redistributions of source code must retain the above copyright
>> + *       notice, this list of conditions and the following disclaimer.
>> + *     * Redistributions in binary form must reproduce the above copyright
>> + *       notice, this list of conditions and the following disclaimer in
>> + *       the documentation and/or other materials provided with the
>> + *       distribution.
>> + *     * Neither the name of Intel Corporation nor the names of its
>> + *       contributors may be used to endorse or promote products derived
>> + *       from this software without specific prior written permission.
>> + *
>> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
>> CONTRIBUTORS
>> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT
>> NOT
>> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
>> FITNESS FOR
>> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE
>> COPYRIGHT
>> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT,
>> INCIDENTAL,
>> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT
>> NOT
>> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS
>> OF USE,
>> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED
>> AND ON ANY
>> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR
>> TORT
>> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF
>> THE USE
>> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
>> DAMAGE.
>> + */
>> +
>> +#include <fcntl.h>
>> +#include <pthread.h>
>> +#include <unistd.h>
>> +
>> +#include <rte_ethdev.h>
>> +#include <rte_kni.h>
>> +#include <rte_malloc.h>
>> +#include <rte_vdev.h>
>> +
>> +/* Only single queue supported */
>> +#define KNI_MAX_QUEUE_PER_PORT 1
>> +
>> +#define MAX_PACKET_SZ 2048
>> +#define MAX_KNI_PORTS 8
>> +
>> +struct pmd_queue_stats {
>> +	uint64_t pkts;
>> +	uint64_t bytes;
>> +	uint64_t err_pkts;
>> +};
>> +
>> +struct pmd_queue {
>> +	struct pmd_internals *internals;
>> +	struct rte_mempool *mb_pool;
>> +
>> +	struct pmd_queue_stats rx;
>> +	struct pmd_queue_stats tx;
>> +};
>> +
>> +struct pmd_internals {
>> +	struct rte_kni *kni;
>> +	int is_kni_started;
>> +
>> +	pthread_t thread;
>> +	int stop_thread;
>> +
>> +	struct pmd_queue rx_queues[KNI_MAX_QUEUE_PER_PORT];
>> +	struct pmd_queue tx_queues[KNI_MAX_QUEUE_PER_PORT];
>> +};
>> +
>> +static struct ether_addr eth_addr;
>> +static struct rte_eth_link pmd_link = {
>> +		.link_speed = ETH_SPEED_NUM_10G,
>> +		.link_duplex = ETH_LINK_FULL_DUPLEX,
>> +		.link_status = 0
>> +};
>> +static int is_kni_initialized;
>> +
>> +static uint16_t
>> +eth_kni_rx(void *q, struct rte_mbuf **bufs, uint16_t nb_bufs)
>> +{
>> +	struct pmd_queue *kni_q = q;
>> +	struct rte_kni *kni = kni_q->internals->kni;
>> +	uint16_t nb_pkts;
>> +
>> +	nb_pkts = rte_kni_rx_burst(kni, bufs, nb_bufs);
>> +
>> +	kni_q->rx.pkts += nb_pkts;
>> +	kni_q->rx.err_pkts += nb_bufs - nb_pkts;
>> +
>> +	return nb_pkts;
>> +}
>> +
>> +static uint16_t
>> +eth_kni_tx(void *q, struct rte_mbuf **bufs, uint16_t nb_bufs)
>> +{
>> +	struct pmd_queue *kni_q = q;
>> +	struct rte_kni *kni = kni_q->internals->kni;
>> +	uint16_t nb_pkts;
>> +
>> +	nb_pkts =  rte_kni_tx_burst(kni, bufs, nb_bufs);
>> +
>> +	kni_q->tx.pkts += nb_pkts;
>> +	kni_q->tx.err_pkts += nb_bufs - nb_pkts;
>> +
>> +	return nb_pkts;
>> +}
>> +
>> +static void *
>> +kni_handle_request(void *param)
>> +{
>> +	struct pmd_internals *internals = param;
>> +#define MS 1000
>> +
>> +	while (!internals->stop_thread) {
>> +		rte_kni_handle_request(internals->kni);
>> +		usleep(500 * MS);
>> +	}
>> +
>> +	return param;
>> +}
>> +
> 
> Do we really need a thread to handle request by default? I know there are apps that handle request their own way and having a separate thread could add synchronization problems.  Can we at least add an option to disable this?

I didn't think about there can be a use case that requires own request
handling.

But, kni requests should be handled to make kni interface run properly,
and to handle interface "kni" handler (internals->kni) required, which
this PMD doesn't expose.

So, just disabling this thread won't work on its own.

A solution can be found, like callback registraion, or get_handler API,
but if an application has custom request handling, perhaps it may prefer
to use kni library directly instead of this wrapper, since wrapper
already doesn't expose all kni features.

> 
>> +static int
>> +eth_kni_start(struct rte_eth_dev *dev)
>> +{
>> +	struct pmd_internals *internals = dev->data->dev_private;
>> +	uint16_t port_id = dev->data->port_id;
>> +	struct rte_mempool *mb_pool;
>> +	struct rte_kni_conf conf;
>> +	const char *name = dev->data->name + 4; /* remove net_ */
>> +
>> +	snprintf(conf.name, RTE_KNI_NAMESIZE, "%s", name);
>> +	conf.force_bind = 0;
>> +	conf.group_id = port_id;
>> +	conf.mbuf_size = MAX_PACKET_SZ;
>> +	mb_pool = internals->rx_queues[0].mb_pool;
>> +
>> +	internals->kni = rte_kni_alloc(mb_pool, &conf, NULL);
>> +	if (internals->kni == NULL) {
>> +		RTE_LOG(ERR, PMD,
>> +			"Fail to create kni for port: %d\n", port_id);
>> +		return -1;
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int
>> +eth_kni_dev_start(struct rte_eth_dev *dev)
>> +{
>> +	struct pmd_internals *internals = dev->data->dev_private;
>> +	int ret;
>> +
>> +	if (internals->is_kni_started == 0) {
>> +		ret = eth_kni_start(dev);
>> +		if (ret)
>> +			return -1;
>> +		internals->is_kni_started = 1;
>> +	}
>> +
> 
> In case is_kni_started is 1 already,  shouldn't we return directly instead of proceeding?

"is_kni_started" is just to protect "eth_kni_start()", as you can see it
doesn't have a counterpart in eth_kni_dev_stop(). This flag is to be
sure "eth_kni_start()" called only once during PMD life cycle.

The check you mentioned already done, start() / stop() functions already
balanced by APIs calling these functions.

> 
>> +	ret = pthread_create(&internals->thread, NULL, kni_handle_request,
>> +			internals);
>> +	if (ret) {
>> +		RTE_LOG(ERR, PMD, "Fail to create kni request thread\n");
>> +		return -1;
>> +	}
>> +
>> +	dev->data->dev_link.link_status = 1;
>> +
>> +	return 0;
>> +}
>> +
>> +static void
>> +eth_kni_dev_stop(struct rte_eth_dev *dev)
>> +{
>> +	struct pmd_internals *internals = dev->data->dev_private;
>> +	int ret;
>> +
>> +	internals->stop_thread = 1;
>> +
>> +	ret = pthread_cancel(internals->thread);
>> +	if (ret)
>> +		RTE_LOG(ERR, PMD, "Can't cancel the thread\n");
>> +
>> +	ret = pthread_join(internals->thread, NULL);
>> +	if (ret)
>> +		RTE_LOG(ERR, PMD, "Can't join the thread\n");
>> +
>> +	internals->stop_thread = 0;
>> +
>> +	dev->data->dev_link.link_status = 0;
>> +}
>> +
>> +static int
>> +eth_kni_dev_configure(struct rte_eth_dev *dev __rte_unused)
>> +{
>> +	return 0;
>> +}
>> +
>> +static void
>> +eth_kni_dev_info(struct rte_eth_dev *dev, struct rte_eth_dev_info
>> *dev_info)
>> +{
>> +	struct rte_eth_dev_data *data = dev->data;
>> +
>> +	dev_info->driver_name = data->drv_name;
>> +	dev_info->max_mac_addrs = 1;
>> +	dev_info->max_rx_pktlen = (uint32_t)-1;
>> +	dev_info->max_rx_queues = KNI_MAX_QUEUE_PER_PORT;
>> +	dev_info->max_tx_queues = KNI_MAX_QUEUE_PER_PORT;
>> +	dev_info->min_rx_bufsize = 0;
>> +	dev_info->pci_dev = NULL;
>> +}
>> +
>> +static int
>> +eth_kni_rx_queue_setup(struct rte_eth_dev *dev,
>> +		uint16_t rx_queue_id,
>> +		uint16_t nb_rx_desc __rte_unused,
>> +		unsigned int socket_id __rte_unused,
>> +		const struct rte_eth_rxconf *rx_conf __rte_unused,
>> +		struct rte_mempool *mb_pool)
>> +{
>> +	struct pmd_internals *internals = dev->data->dev_private;
>> +	struct pmd_queue *q;
>> +
>> +	q = &internals->rx_queues[rx_queue_id];
>> +	q->internals = internals;
>> +	q->mb_pool = mb_pool;
>> +
>> +	dev->data->rx_queues[rx_queue_id] = q;
>> +
>> +	return 0;
>> +}
>> +
>> +static int
>> +eth_kni_tx_queue_setup(struct rte_eth_dev *dev,
>> +		uint16_t tx_queue_id,
>> +		uint16_t nb_tx_desc __rte_unused,
>> +		unsigned int socket_id __rte_unused,
>> +		const struct rte_eth_txconf *tx_conf __rte_unused)
>> +{
>> +	struct pmd_internals *internals = dev->data->dev_private;
>> +	struct pmd_queue *q;
>> +
>> +	q = &internals->tx_queues[tx_queue_id];
>> +	q->internals = internals;
>> +
>> +	dev->data->tx_queues[tx_queue_id] = q;
>> +
>> +	return 0;
>> +}
>> +
>> +static void
>> +eth_kni_queue_release(void *q __rte_unused)
>> +{
>> +}
>> +
>> +static int
>> +eth_kni_link_update(struct rte_eth_dev *dev __rte_unused,
>> +		int wait_to_complete __rte_unused)
>> +{
>> +	return 0;
>> +}
>> +
>> +static void
>> +eth_kni_stats_get(struct rte_eth_dev *dev, struct rte_eth_stats *stats)
>> +{
>> +	unsigned long rx_packets_total = 0, rx_bytes_total = 0;
>> +	unsigned long tx_packets_total = 0, tx_bytes_total = 0;
>> +	struct rte_eth_dev_data *data = dev->data;
>> +	unsigned long tx_packets_err_total = 0;
>> +	unsigned int i, num_stats;
>> +	struct pmd_queue *q;
>> +
>> +	num_stats = RTE_MIN((unsigned
>> int)RTE_ETHDEV_QUEUE_STAT_CNTRS,
>> +			data->nb_rx_queues);
>> +	for (i = 0; i < num_stats; i++) {
>> +		q = data->rx_queues[i];
>> +		stats->q_ipackets[i] = q->rx.pkts;
>> +		stats->q_ibytes[i] = q->rx.bytes;
>> +		rx_packets_total += stats->q_ipackets[i];
>> +		rx_bytes_total += stats->q_ibytes[i];
>> +	}
>> +
>> +	num_stats = RTE_MIN((unsigned
>> int)RTE_ETHDEV_QUEUE_STAT_CNTRS,
>> +			data->nb_tx_queues);
>> +	for (i = 0; i < num_stats; i++) {
>> +		q = data->tx_queues[i];
>> +		stats->q_opackets[i] = q->tx.pkts;
>> +		stats->q_obytes[i] = q->tx.bytes;
>> +		stats->q_errors[i] = q->tx.err_pkts;
>> +		tx_packets_total += stats->q_opackets[i];
>> +		tx_bytes_total += stats->q_obytes[i];
>> +		tx_packets_err_total += stats->q_errors[i];
>> +	}
>> +
>> +	stats->ipackets = rx_packets_total;
>> +	stats->ibytes = rx_bytes_total;
>> +	stats->opackets = tx_packets_total;
>> +	stats->obytes = tx_bytes_total;
>> +	stats->oerrors = tx_packets_err_total;
>> +}
>> +
>> +static void
>> +eth_kni_stats_reset(struct rte_eth_dev *dev)
>> +{
>> +	struct rte_eth_dev_data *data = dev->data;
>> +	struct pmd_queue *q;
>> +	unsigned int i;
>> +
>> +	for (i = 0; i < data->nb_rx_queues; i++) {
>> +		q = data->rx_queues[i];
>> +		q->rx.pkts = 0;
>> +		q->rx.bytes = 0;
>> +	}
>> +	for (i = 0; i < data->nb_tx_queues; i++) {
>> +		q = data->tx_queues[i];
>> +		q->tx.pkts = 0;
>> +		q->tx.bytes = 0;
>> +		q->tx.err_pkts = 0;
>> +	}
>> +}
>> +
>> +static const struct eth_dev_ops eth_kni_ops = {
>> +	.dev_start = eth_kni_dev_start,
>> +	.dev_stop = eth_kni_dev_stop,
>> +	.dev_configure = eth_kni_dev_configure,
>> +	.dev_infos_get = eth_kni_dev_info,
>> +	.rx_queue_setup = eth_kni_rx_queue_setup,
>> +	.tx_queue_setup = eth_kni_tx_queue_setup,
>> +	.rx_queue_release = eth_kni_queue_release,
>> +	.tx_queue_release = eth_kni_queue_release,
>> +	.link_update = eth_kni_link_update,
>> +	.stats_get = eth_kni_stats_get,
>> +	.stats_reset = eth_kni_stats_reset,
>> +};
>> +
>> +static struct rte_vdev_driver eth_kni_drv;
>> +
>> +static struct rte_eth_dev *
>> +eth_kni_create(const char *name, unsigned int numa_node)
>> +{
>> +	struct pmd_internals *internals = NULL;
>> +	struct rte_eth_dev_data *data;
>> +	struct rte_eth_dev *eth_dev;
>> +
>> +	RTE_LOG(INFO, PMD, "Creating kni ethdev on numa socket %u\n",
>> +			numa_node);
>> +
>> +	data = rte_zmalloc_socket(name, sizeof(*data), 0, numa_node);
>> +	if (data == NULL)
>> +		goto error;
>> +
>> +	internals = rte_zmalloc_socket(name, sizeof(*internals), 0,
>> numa_node);
>> +	if (internals == NULL)
>> +		goto error;
>> +
>> +	/* reserve an ethdev entry */
>> +	eth_dev = rte_eth_dev_allocate(name);
>> +	if (eth_dev == NULL)
>> +		goto error;
>> +
>> +	data->dev_private = internals;
>> +	data->port_id = eth_dev->data->port_id;
>> +	memmove(data->name, eth_dev->data->name, sizeof(data-
>>> name));
>> +	data->nb_rx_queues = 1;
>> +	data->nb_tx_queues = 1;
>> +	data->dev_link = pmd_link;
>> +	data->mac_addrs = &eth_addr;
>> +
>> +	eth_dev->data = data;
>> +	eth_dev->dev_ops = &eth_kni_ops;
>> +	eth_dev->driver = NULL;
>> +
>> +	data->dev_flags = RTE_ETH_DEV_DETACHABLE;
>> +	data->kdrv = RTE_KDRV_NONE;
>> +	data->drv_name = eth_kni_drv.driver.name;
>> +	data->numa_node = numa_node;
>> +
>> +	return eth_dev;
>> +
>> +error:
>> +	rte_free(data);
>> +	rte_free(internals);
>> +
>> +	return NULL;
>> +}
>> +
>> +static int
>> +kni_init(void)
>> +{
>> +	if (is_kni_initialized == 0)
>> +		rte_kni_init(MAX_KNI_PORTS);
>> +
>> +	is_kni_initialized += 1;
>> +
>> +	return 0;
>> +}
>> +
>> +static int
>> +eth_kni_probe(const char *name, const char *params __rte_unused)
>> +{
>> +	struct rte_eth_dev *eth_dev;
>> +	int ret;
>> +
>> +	RTE_LOG(INFO, PMD, "Initializing eth_kni for %s\n", name);
>> +
>> +	ret = kni_init();
>> +	if (ret < 0)
>> +		/* Not return error to prevent panic in rte_eal_init() */
>> +		return 0;
> 
> If we don't return error here, the application that needs to add KNI ports eventually will fail.  If it's a fail-stop situation, isn't it better to return error where the it happened?

I am not sure this is fail-stop situation, but instead this gives a
chance to applicaton for a graceful exit.

If an error value returned here, it will lead to a rte_panic() and
application terminated abnormally!

But if we return a success at this point, since no ethernet device
created, there is no handler in application to use, which also means no
KNI interface created.
Application can check number of ports and recognize KNI port is missing,
app may chose to terminate or not, also it prefers to terminate, can do
it properly.

>  
>> +	eth_dev = eth_kni_create(name, rte_socket_id());
>> +	if (eth_dev == NULL)
>> +		return -1;
>> +
>> +	eth_dev->rx_pkt_burst = eth_kni_rx;
>> +	eth_dev->tx_pkt_burst = eth_kni_tx;
>> +
>> +	return 0;
>> +}
>> +
>> +static int
>> +eth_kni_remove(const char *name)
>> +{
>> +	struct rte_eth_dev *eth_dev;
>> +	struct pmd_internals *internals;
>> +
>> +	RTE_LOG(INFO, PMD, "Un-Initializing eth_kni for %s\n", name);
>> +
>> +	/* find the ethdev entry */
>> +	eth_dev = rte_eth_dev_allocated(name);
>> +	if (eth_dev == NULL)
>> +		return -1;
>> +
>> +	eth_kni_dev_stop(eth_dev);
>> +
>> +	if (eth_dev->data) {
>> +		internals = eth_dev->data->dev_private;
>> +		rte_kni_release(internals->kni);
>> +
>> +		rte_free(internals);
>> +	}
>> +	rte_free(eth_dev->data);
>> +
>> +	rte_eth_dev_release_port(eth_dev);
>> +
>> +	is_kni_initialized -= 1;
>> +	if (is_kni_initialized == 0)
>> +		rte_kni_close();
>> +
>> +	return 0;
>> +}
>> +
>> +static struct rte_vdev_driver eth_kni_drv = {
>> +	.probe = eth_kni_probe,
>> +	.remove = eth_kni_remove,
>> +};
>> +
>> +RTE_PMD_REGISTER_VDEV(net_kni, eth_kni_drv);
>> diff --git a/drivers/net/kni/rte_pmd_kni_version.map
>> b/drivers/net/kni/rte_pmd_kni_version.map
>> new file mode 100644
>> index 0000000..31eca32
>> --- /dev/null
>> +++ b/drivers/net/kni/rte_pmd_kni_version.map
>> @@ -0,0 +1,4 @@
>> +DPDK_17.02 {
>> +
>> +	local: *;
>> +};
>> diff --git a/mk/rte.app.mk b/mk/rte.app.mk
>> index f75f0e2..af02816 100644
>> --- a/mk/rte.app.mk
>> +++ b/mk/rte.app.mk
>> @@ -59,11 +59,6 @@ _LDLIBS-y += -L$(RTE_SDK_BIN)/lib
>>  #
>>  # Order is important: from higher level to lower level
>>  #
>> -
>> -ifeq ($(CONFIG_RTE_EXEC_ENV_LINUXAPP),y)
>> -_LDLIBS-$(CONFIG_RTE_LIBRTE_KNI)            += -lrte_kni
>> -endif
>> -
>>  _LDLIBS-$(CONFIG_RTE_LIBRTE_PIPELINE)       += -lrte_pipeline
>>  _LDLIBS-$(CONFIG_RTE_LIBRTE_TABLE)          += -lrte_table
>>  _LDLIBS-$(CONFIG_RTE_LIBRTE_PORT)           += -lrte_port
>> @@ -84,6 +79,10 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_POWER)          += -
>> lrte_power
>>
>>  _LDLIBS-y += --whole-archive
>>
>> +ifeq ($(CONFIG_RTE_EXEC_ENV_LINUXAPP),y)
>> +_LDLIBS-$(CONFIG_RTE_LIBRTE_KNI)            += -lrte_kni
>> +endif
>> +
>>  _LDLIBS-$(CONFIG_RTE_LIBRTE_TIMER)          += -lrte_timer
>>  _LDLIBS-$(CONFIG_RTE_LIBRTE_HASH)           += -lrte_hash
>>  _LDLIBS-$(CONFIG_RTE_LIBRTE_VHOST)          += -lrte_vhost
>> @@ -115,6 +114,7 @@ _LDLIBS-$(CONFIG_RTE_LIBRTE_ENIC_PMD)       += -
>> lrte_pmd_enic
>>  _LDLIBS-$(CONFIG_RTE_LIBRTE_FM10K_PMD)      += -lrte_pmd_fm10k
>>  _LDLIBS-$(CONFIG_RTE_LIBRTE_I40E_PMD)       += -lrte_pmd_i40e
>>  _LDLIBS-$(CONFIG_RTE_LIBRTE_IXGBE_PMD)      += -lrte_pmd_ixgbe
>> +_LDLIBS-$(CONFIG_RTE_LIBRTE_PMD_KNI)        += -lrte_pmd_kni
>>  _LDLIBS-$(CONFIG_RTE_LIBRTE_MLX4_PMD)       += -lrte_pmd_mlx4 -
>> libverbs
>>  _LDLIBS-$(CONFIG_RTE_LIBRTE_MLX5_PMD)       += -lrte_pmd_mlx5 -
>> libverbs
>>  _LDLIBS-$(CONFIG_RTE_LIBRTE_MPIPE_PMD)      += -lrte_pmd_mpipe -lgxio
>> --
>> 2.9.3
> 

^ permalink raw reply

* [PATCH v5] vhost: allow for many vhost user ports
From: Jan Wickbom @ 2016-12-14 15:30 UTC (permalink / raw)
  To: yuanhan.liu; +Cc: dev, patrik.r.andersson, Jan Wickbom
In-Reply-To: <1480606010-6132-1-git-send-email-jan.wickbom@ericsson.com>

Currently select() is used to monitor file descriptors for vhostuser
ports. This limits the number of ports possible to create since the
fd number is used as index in the fd_set and we have seen fds > 1023.
This patch changes select() to poll(). This way we can keep an
packed (pollfd) array for the fds, e.g. as many fds as the size of
the array.

Also see:
http://dpdk.org/ml/archives/dev/2016-April/037024.html

Signed-off-by: Jan Wickbom <jan.wickbom@ericsson.com>
Reported-by: Patrik Andersson <patrik.r.andersson@ericsson.com>
---

v5:
* pack of arrays moved after poll loop

v4:
* fdset_del can not shrink the array. Took care of that in sveral places
* Moved rwfds[] into struct fdset

v3:
* removed unnecessary include
* removed fdset_fill, made it functionally part of poll loop

v2:
* removed unnecessary casts
* static array replacing allocated memory

 lib/librte_vhost/fd_man.c | 234 ++++++++++++++++++++++++++--------------------
 lib/librte_vhost/fd_man.h |   2 +
 2 files changed, 135 insertions(+), 101 deletions(-)

diff --git a/lib/librte_vhost/fd_man.c b/lib/librte_vhost/fd_man.c
index 2d3eeb7..c7c2b35 100644
--- a/lib/librte_vhost/fd_man.c
+++ b/lib/librte_vhost/fd_man.c
@@ -35,93 +35,111 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <sys/socket.h>
-#include <sys/select.h>
 #include <sys/time.h>
 #include <sys/types.h>
 #include <unistd.h>
+#include <string.h>
 
 #include <rte_common.h>
 #include <rte_log.h>
 
 #include "fd_man.h"
 
+#define FDPOLLERR (POLLERR | POLLHUP | POLLNVAL)
+
 /**
- * Returns the index in the fdset for a given fd.
- * If fd is -1, it means to search for a free entry.
+ * Adjusts the highest index populated in the array of fds
  * @return
- *   index for the fd, or -1 if fd isn't in the fdset.
+ *   The new size of fdset.
  */
 static int
-fdset_find_fd(struct fdset *pfdset, int fd)
+fdset_shrink(struct fdset *pfdset)
 {
-	int i;
+	int idx;
 
-	if (pfdset == NULL)
-		return -1;
+	/* Remove all empty spots in the end */
 
-	for (i = 0; i < MAX_FDS && pfdset->fd[i].fd != fd; i++)
+	for (idx = pfdset->num - 1;
+	     idx >= 0 && pfdset->fd[idx].fd == -1;
+	     idx--)
 		;
 
-	return i ==  MAX_FDS ? -1 : i;
+	pfdset->num = idx + 1;
+
+	return pfdset->num;
 }
 
-static int
-fdset_find_free_slot(struct fdset *pfdset)
+/**
+ *  Moves the fd from last slot to specified slot, including
+ *  corresponding pollfd
+ */
+static void
+fdset_move_last(struct fdset *pfdset, int idx)
 {
-	return fdset_find_fd(pfdset, -1);
+	int last_idx = pfdset->num - 1;
+
+	if (idx < last_idx) {
+		pfdset->fd[idx] = pfdset->fd[last_idx];
+		pfdset->fd[last_idx].fd = -1;
+
+		pfdset->rwfds[idx] = pfdset->rwfds[last_idx];
+		pfdset->rwfds[last_idx].revents = 0;
+	}
 }
 
-static int
-fdset_add_fd(struct fdset  *pfdset, int idx, int fd,
-	fd_cb rcb, fd_cb wcb, void *dat)
-{
-	struct fdentry *pfdentry;
 
-	if (pfdset == NULL || idx >= MAX_FDS || fd >= FD_SETSIZE)
-		return -1;
+/**
+ *  Packs the two arrays in the fdset structure
+ */
+static void
+fdset_pack(struct fdset *pfdset)
+{
+	int idx;
+	int num = pfdset->num;
 
-	pfdentry = &pfdset->fd[idx];
-	pfdentry->fd = fd;
-	pfdentry->rcb = rcb;
-	pfdentry->wcb = wcb;
-	pfdentry->dat = dat;
+	for (idx = 0; idx < num; idx++) {
+		if (pfdset->fd[idx].fd == -1) {
+			num = fdset_shrink(pfdset);
+			fdset_move_last(pfdset, idx);
+		}
+	}
 
-	return 0;
+	fdset_shrink(pfdset);
 }
 
 /**
- * Fill the read/write fd_set with the fds in the fdset.
+ * Returns the index in the fdset for a given fd.
+ * If fd is -1, it means to search for a free entry.
  * @return
- *  the maximum fds filled in the read/write fd_set.
+ *   index for the fd, or -1 if fd isn't in the fdset.
  */
 static int
-fdset_fill(fd_set *rfset, fd_set *wfset, struct fdset *pfdset)
+fdset_find_fd(struct fdset *pfdset, int fd)
 {
-	struct fdentry *pfdentry;
-	int i, maxfds = -1;
-	int num = MAX_FDS;
+	int i;
 
-	if (pfdset == NULL)
-		return -1;
+	for (i = 0; i < pfdset->num && pfdset->fd[i].fd != fd; i++)
+		;
 
-	for (i = 0; i < num; i++) {
-		pfdentry = &pfdset->fd[i];
-		if (pfdentry->fd != -1) {
-			int added = 0;
-			if (pfdentry->rcb && rfset) {
-				FD_SET(pfdentry->fd, rfset);
-				added = 1;
-			}
-			if (pfdentry->wcb && wfset) {
-				FD_SET(pfdentry->fd, wfset);
-				added = 1;
-			}
-			if (added)
-				maxfds = pfdentry->fd < maxfds ?
-					maxfds : pfdentry->fd;
-		}
-	}
-	return maxfds;
+	return i == pfdset->num ? -1 : i;
+}
+
+static void
+fdset_add_fd(struct fdset *pfdset, int idx, int fd,
+	fd_cb rcb, fd_cb wcb, void *dat)
+{
+	struct fdentry *pfdentry = &pfdset->fd[idx];
+	struct pollfd *pfd = &pfdset->rwfds[idx];
+
+	pfdentry->fd = fd;
+	pfdentry->rcb = rcb;
+	pfdentry->wcb = wcb;
+	pfdentry->dat = dat;
+
+	pfd->fd = fd;
+	pfd->events = rcb ? POLLIN : 0;
+	pfd->events |= wcb ? POLLOUT : 0;
+	pfd->revents = 0;
 }
 
 void
@@ -132,11 +150,15 @@
 	if (pfdset == NULL)
 		return;
 
+	pthread_mutex_init(&pfdset->fd_mutex, NULL);
+
 	for (i = 0; i < MAX_FDS; i++) {
 		pfdset->fd[i].fd = -1;
 		pfdset->fd[i].dat = NULL;
 	}
 	pfdset->num = 0;
+
+	memset(pfdset->rwfds, 0, sizeof(pfdset->rwfds));
 }
 
 /**
@@ -152,14 +174,14 @@
 
 	pthread_mutex_lock(&pfdset->fd_mutex);
 
-	/* Find a free slot in the list. */
-	i = fdset_find_free_slot(pfdset);
-	if (i == -1 || fdset_add_fd(pfdset, i, fd, rcb, wcb, dat) < 0) {
+	i = pfdset->num < MAX_FDS ? pfdset->num++ : -1;
+
+	if (i == -1) {
 		pthread_mutex_unlock(&pfdset->fd_mutex);
 		return -2;
 	}
 
-	pfdset->num++;
+	fdset_add_fd(pfdset, i, fd, rcb, wcb, dat);
 
 	pthread_mutex_unlock(&pfdset->fd_mutex);
 
@@ -189,7 +211,6 @@
 			pfdset->fd[i].fd = -1;
 			pfdset->fd[i].rcb = pfdset->fd[i].wcb = NULL;
 			pfdset->fd[i].dat = NULL;
-			pfdset->num--;
 			i = -1;
 		}
 		pthread_mutex_unlock(&pfdset->fd_mutex);
@@ -198,24 +219,6 @@
 	return dat;
 }
 
-/**
- *  Unregister the fd at the specified slot from the fdset.
- */
-static void
-fdset_del_slot(struct fdset *pfdset, int index)
-{
-	if (pfdset == NULL || index < 0 || index >= MAX_FDS)
-		return;
-
-	pthread_mutex_lock(&pfdset->fd_mutex);
-
-	pfdset->fd[index].fd = -1;
-	pfdset->fd[index].rcb = pfdset->fd[index].wcb = NULL;
-	pfdset->fd[index].dat = NULL;
-	pfdset->num--;
-
-	pthread_mutex_unlock(&pfdset->fd_mutex);
-}
 
 /**
  * This functions runs in infinite blocking loop until there is no fd in
@@ -229,55 +232,74 @@
 void
 fdset_event_dispatch(struct fdset *pfdset)
 {
-	fd_set rfds, wfds;
-	int i, maxfds;
+	int i;
+	struct pollfd *pfd;
 	struct fdentry *pfdentry;
-	int num = MAX_FDS;
 	fd_cb rcb, wcb;
 	void *dat;
-	int fd;
+	int fd, numfds;
 	int remove1, remove2;
-	int ret;
 
 	if (pfdset == NULL)
 		return;
 
 	while (1) {
-		struct timeval tv;
-		tv.tv_sec = 1;
-		tv.tv_usec = 0;
-		FD_ZERO(&rfds);
-		FD_ZERO(&wfds);
-		pthread_mutex_lock(&pfdset->fd_mutex);
-
-		maxfds = fdset_fill(&rfds, &wfds, pfdset);
-
-		pthread_mutex_unlock(&pfdset->fd_mutex);
+		int has_holes;
 
 		/*
-		 * When select is blocked, other threads might unregister
+		 * When poll is blocked, other threads might unregister
 		 * listenfds from and register new listenfds into fdset.
-		 * When select returns, the entries for listenfds in the fdset
+		 * When poll returns, the entries for listenfds in the fdset
 		 * might have been updated. It is ok if there is unwanted call
 		 * for new listenfds.
 		 */
-		ret = select(maxfds + 1, &rfds, &wfds, NULL, &tv);
-		if (ret <= 0)
-			continue;
+		pthread_mutex_lock(&pfdset->fd_mutex);
 
-		for (i = 0; i < num; i++) {
-			remove1 = remove2 = 0;
+		numfds = pfdset->num;
+
+		pthread_mutex_unlock(&pfdset->fd_mutex);
+
+		poll(pfdset->rwfds, numfds, 1000 /* millisecs */);
+
+		has_holes = 0;
+		for (i = 0; i < numfds; i++) {
 			pthread_mutex_lock(&pfdset->fd_mutex);
+
 			pfdentry = &pfdset->fd[i];
 			fd = pfdentry->fd;
+			pfd = &pfdset->rwfds[i];
+
+			if (fd < 0) {
+				/* Removed during poll, care later */
+
+				has_holes = 1;
+
+				pthread_mutex_unlock(&pfdset->fd_mutex);
+
+				continue;
+			}
+
+			if (!pfd->revents) {
+
+				pthread_mutex_unlock(&pfdset->fd_mutex);
+
+				continue;
+			}
+
+			/* Valid fd, and at least one revent ... */
+
+			remove1 = remove2 = 0;
+
 			rcb = pfdentry->rcb;
 			wcb = pfdentry->wcb;
 			dat = pfdentry->dat;
 			pfdentry->busy = 1;
+
 			pthread_mutex_unlock(&pfdset->fd_mutex);
-			if (fd >= 0 && FD_ISSET(fd, &rfds) && rcb)
+
+			if (rcb && pfd->revents & (POLLIN | FDPOLLERR))
 				rcb(fd, dat, &remove1);
-			if (fd >= 0 && FD_ISSET(fd, &wfds) && wcb)
+			if (wcb && pfd->revents & (POLLOUT | FDPOLLERR))
 				wcb(fd, dat, &remove2);
 			pfdentry->busy = 0;
 			/*
@@ -292,8 +314,18 @@
 			 * listen fd in another thread, we couldn't call
 			 * fd_set_del.
 			 */
-			if (remove1 || remove2)
-				fdset_del_slot(pfdset, i);
+			if (remove1 || remove2) {
+				pfdentry->fd = -1;
+				has_holes = 1;
+			}
+		}
+
+		if (has_holes) {
+			pthread_mutex_lock(&pfdset->fd_mutex);
+
+			fdset_pack(pfdset);
+
+			pthread_mutex_unlock(&pfdset->fd_mutex);
 		}
 	}
 }
diff --git a/lib/librte_vhost/fd_man.h b/lib/librte_vhost/fd_man.h
index bd66ed1..d319cac 100644
--- a/lib/librte_vhost/fd_man.h
+++ b/lib/librte_vhost/fd_man.h
@@ -35,6 +35,7 @@
 #define _FD_MAN_H_
 #include <stdint.h>
 #include <pthread.h>
+#include <poll.h>
 
 #define MAX_FDS 1024
 
@@ -49,6 +50,7 @@ struct fdentry {
 };
 
 struct fdset {
+	struct pollfd rwfds[MAX_FDS];
 	struct fdentry fd[MAX_FDS];
 	pthread_mutex_t fd_mutex;
 	int num;	/* current fd number of this fdset */
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH v2 1/6] eventdev: introduce event driven programming model
From: Bruce Richardson @ 2016-12-14 15:19 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: dev, thomas.monjalon, hemant.agrawal, gage.eads, harry.van.haaren
In-Reply-To: <1480996340-29871-2-git-send-email-jerin.jacob@caviumnetworks.com>

On Tue, Dec 06, 2016 at 09:22:15AM +0530, Jerin Jacob wrote:
> In a polling model, lcores poll ethdev ports and associated
> rx queues directly to look for packet. In an event driven model,
> by contrast, lcores call the scheduler that selects packets for
> them based on programmer-specified criteria. Eventdev library
> adds support for event driven programming model, which offer
> applications automatic multicore scaling, dynamic load balancing,
> pipelining, packet ingress order maintenance and
> synchronization services to simplify application packet processing.
> 
> By introducing event driven programming model, DPDK can support
> both polling and event driven programming models for packet processing,
> and applications are free to choose whatever model
> (or combination of the two) that best suits their needs.
> 
> This patch adds the eventdev specification header file.
> 
> Signed-off-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> ---
<snip>
> + *
> + * The *nb_events* parameter is the number of event objects to enqueue which are
> + * supplied in the *ev* array of *rte_event* structure.
> + *
> + * The rte_event_enqueue_burst() function returns the number of
> + * events objects it actually enqueued. A return value equal to *nb_events*
> + * means that all event objects have been enqueued.
> + *
> + * @param dev_id
> + *   The identifier of the device.
> + * @param port_id
> + *   The identifier of the event port.
> + * @param ev
> + *   Points to an array of *nb_events* objects of type *rte_event* structure
> + *   which contain the event object enqueue operations to be processed.
> + * @param nb_events
> + *   The number of event objects to enqueue, typically number of
> + *   rte_event_port_enqueue_depth() available for this port.
> + *
> + * @return
> + *   The number of event objects actually enqueued on the event device. The
> + *   return value can be less than the value of the *nb_events* parameter when
> + *   the event devices queue is full or if invalid parameters are specified in a
> + *   *rte_event*. If return value is less than *nb_events*, the remaining events
> + *   at the end of ev[] are not consumed,and the caller has to take care of them
> + *
> + * @see rte_event_port_enqueue_depth()
> + */
> +uint16_t
> +rte_event_enqueue_burst(uint8_t dev_id, uint8_t port_id, struct rte_event ev[],
> +			uint16_t nb_events);
> +
One suggestion - do we want to make the ev[] array const, to disallow
drivers from modifying the events passed in? Since the event structure
is only 16B big, it should be small enough to be copied around in
scheduler instances, allow the original events to remain unmodified.

/Bruce

^ permalink raw reply

* Re: [PATCH v2 1/6] eventdev: introduce event driven programming model
From: Bruce Richardson @ 2016-12-14 15:15 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Van Haaren, Harry, dev@dpdk.org, thomas.monjalon@6wind.com,
	hemant.agrawal@nxp.com, Eads, Gage
In-Reply-To: <20161214131356.GA4224@localhost.localdomain>

On Wed, Dec 14, 2016 at 06:43:58PM +0530, Jerin Jacob wrote:
> On Thu, Dec 08, 2016 at 11:02:16AM +0000, Van Haaren, Harry wrote:
> > > From: Jerin Jacob [mailto:jerin.jacob@caviumnetworks.com]
> > > Sent: Thursday, December 8, 2016 1:24 AM
> > > To: Van Haaren, Harry <harry.van.haaren@intel.com>
> > 
> > <snip>
> > 
> > > > * Operation and sched_type *increased* to 4 bits each (from previous value of 2) to
> > > allow future expansion without ABI changes
> > > 
> > > Anyway it will break ABI if we add new operation. I would propose to keep 4bit
> > > reserved and add it when required.
> > 
> > Ok sounds good. I'll suggest to move it to the middle between operation or sched type, which would allow expanding operation without ABI breaks. On expanding the field would remain in the same place with the same bits available in that place (no ABI break), but new bits can be added into the currently reserved space.
> 
> OK. We will move the rsvd field as you suggested.
> 
> > 
> > 
> > > > * Restore flow_id to 24 bits of a 32 bit int (previous size was 20 bits)
> > > > * sub-event-type reduced to 4 bits (previous value was 8 bits). Can we think of
> > > situations where 16 values for application specified identifiers of each event-type is
> > > genuinely not enough?
> > > One packet will not go beyond 16 stages but an application may have more stages and
> > > each packet may go mutually exclusive stages. For example,
> > > 
> > > packet 0: stagex_0 ->stagex_1
> > > packet 1: stagey_0 ->stagey_1
> > > 
> > > In that sense, IMO, more than 16 is required.(AFIAK, VPP has any much larger limit on
> > > number of stages)
> > 
> > My understanding was that stages are linked to event queues, so the application can determine the stage the packet comes from by reading queue_id?
> 
> That is one way of doing it. But it is limited to number of queues
> therefore scalability issues.Another approach is through
> sub_event_type scheme without depended on the number of queues.
> 
> > 
> > I'm not opposed to having an 8 bit sub_event_type, but it seems unnecessarily large from my point of view. If you have a use for it, I'm ok with 8 bits.
> 
> OK
> 
> > 
> > 
> > > > In my opinion this structure layout is more balanced, and will perform better due to
> > > less loads that will need masking to access the required value.
> > > OK. Considering more balanced layout and above points. I propose following scheme(based on
> > > your input)
> > > 
> > > 	union {
> > > 		uint64_t event;
> > > 		struct {
> > > 			uint32_t flow_id: 20;
> > > 			uint32_t sub_event_type : 8;
> > > 			uint32_t event_type : 4;
> > > 
> > > 			uint8_t rsvd: 4; /* for future additions */
> > > 			uint8_t operation  : 2; /* new fwd drop */
> > > 			uint8_t sched_type : 2;
> > > 
> > > 			uint8_t queue_id;
> > > 			uint8_t priority;
> > > 			uint8_t impl_opaque;
> > > 		};
> > > 	};
> > > 
> > > Feedback and improvements welcomed,
> > 
> > 
> > So incorporating my latest suggestions on moving fields around, excluding sub_event_type *size* changes:
> > 
> > union {
> > 	uint64_t event;
> > 	struct {
> > 		uint32_t flow_id: 20;
> > 		uint32_t event_type : 4;
> > 		uint32_t sub_event_type : 8; /* 8 bits now naturally aligned */
> 
> Just one suggestion here. I am not sure about the correct split between
> number of bits to represent flow_id and sub_event_type fields. And its
> connected in our implementation, so I propose to move sub_event_type up so
> that future flow_id/sub_event_type bit field size change request wont
> impact our implementation. Since it is represented as 32bit as whole, I
> don't think there is an alignment issue.
> 
> So incorporating my latest suggestions on moving sub_event_type field around:
> 
> union {
> 	uint64_t event;
> 	struct {
> 		uint32_t flow_id: 20;
> 		uint32_t sub_event_type : 8;
> 		uint32_t event_type : 4;
> 

The issue with the above layout is that you have an 8-bit value which
can never be accessed as a byte. With the layout above proposed by
Harry, the sub_event_type can be accessed without any bit manipultaion
operations just by doing a byte read. With the layout you propose, all
fields require masking and/or shifting to access. It won't affect the
scheduler performance for us, but it means potentially more cycles in
the app to access those fields.

/Bruce

^ permalink raw reply

* No packets received if burst is too small in rte_eth_rx_burst
From: tom.barbette @ 2016-12-14 15:13 UTC (permalink / raw)
  To: dev
In-Reply-To: <415214732.17903310.1481728244157.JavaMail.zimbra@ulg.ac.be>

Hi list,

Between 2.2.0 and 16.04 (up to at least 16.07.2 if not current), with the XL710 controller I do not get any packet when calling rte_eth_rx_burst if nb_pkts is too small. I would say smaller than 32. The input rate is not big, if that helps. But It should definitely get at least one packet per second.

Any ideas? Is that a bug or expected behaviour? Could be caused by other ABI changes?

Thanks,
Tom

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox