Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH net-next-2.6] net: add rcu protection to netdev->ifalias
From: Eric Dumazet @ 2011-05-18  4:35 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, kaber, netdev, remi.denis-courmont
In-Reply-To: <1305671956.6741.25.camel@edumazet-laptop>

Le mercredi 18 mai 2011 à 00:39 +0200, Eric Dumazet a écrit :

> I dont know, if it happens to be too hard, we'll just stick again rtnl
> for "ip link show" ;)
> 
> 

I believe I'll take this path, its a bit too hard for the moment, and
need several preliminary steps :

It is making sense trying to get rid of rtnl_trylock() hack with more
fine grained locks, and thus lower pressure on RTNL.

Some synchronize_rcu() calls are done whith RTNL held : All processes
hitting rtnl_trylock() have to enter a busy loop, restarting syscall as
long as the RTNL owner is blocked in synchronize_rcu().

^ permalink raw reply

* [PATCH 0/2] bna: Add Debugfs & Generic Netlink Interfaces to BNA Driver
From: Rasesh Mody @ 2011-05-18  4:56 UTC (permalink / raw)
  To: davem, netdev; +Cc: adapter_linux_open_src_team, Rasesh Mody

This patch set adds debugfs and generic netlink interfaces to BNA driver for
debugging and managing the Brocade adapter.

Rasesh Mody (2):
  bna: Add Debugfs Interface
  bna: Add Generic Netlink Interface

 drivers/net/bna/Makefile       |    3 +-
 drivers/net/bna/bfa_defs.h     |   18 ++
 drivers/net/bna/bfa_ioc.c      |  113 +++++++++++++-
 drivers/net/bna/bfa_ioc.h      |    7 +
 drivers/net/bna/bfi.h          |    2 +
 drivers/net/bna/bna_ctrl.c     |    5 +-
 drivers/net/bna/bnad.c         |   48 +++++-
 drivers/net/bna/bnad.h         |   16 ++-
 drivers/net/bna/bnad_debugfs.c |  302 +++++++++++++++++++++++++++++++++++
 drivers/net/bna/bnad_genl.c    |  345 ++++++++++++++++++++++++++++++++++++++++
 drivers/net/bna/bnad_genl.h    |   87 ++++++++++
 11 files changed, 931 insertions(+), 15 deletions(-)
 create mode 100644 drivers/net/bna/bnad_debugfs.c
 create mode 100644 drivers/net/bna/bnad_genl.c
 create mode 100644 drivers/net/bna/bnad_genl.h


^ permalink raw reply

* [PATCH 1/2] bna: Add Debugfs Interface
From: Rasesh Mody @ 2011-05-18  4:57 UTC (permalink / raw)
  To: davem, netdev; +Cc: adapter_linux_open_src_team, Rasesh Mody, Debashis Dutt
In-Reply-To: <1305694621-28023-1-git-send-email-rmody@brocade.com>

This patch adds the debugfs interface to BNA driver for collecting both
live and saved firmware traces (saved, in case of a firmware heart beat
failure).

Signed-off-by: Debashis Dutt <ddutt@brocade.com>
Signed-off-by: Rasesh Mody <rmody@brocade.com>
---
 drivers/net/bna/Makefile       |    3 +-
 drivers/net/bna/bfa_ioc.c      |  105 ++++++++++++++-
 drivers/net/bna/bfa_ioc.h      |    6 +
 drivers/net/bna/bfi.h          |    2 +
 drivers/net/bna/bna_ctrl.c     |    5 +-
 drivers/net/bna/bnad.c         |   37 +++++-
 drivers/net/bna/bnad.h         |   15 ++-
 drivers/net/bna/bnad_debugfs.c |  302 ++++++++++++++++++++++++++++++++++++++++
 8 files changed, 466 insertions(+), 9 deletions(-)
 create mode 100644 drivers/net/bna/bnad_debugfs.c

diff --git a/drivers/net/bna/Makefile b/drivers/net/bna/Makefile
index a5d604d..4bb0d5d 100644
--- a/drivers/net/bna/Makefile
+++ b/drivers/net/bna/Makefile
@@ -5,7 +5,8 @@
 
 obj-$(CONFIG_BNA) += bna.o
 
-bna-objs := bnad.o bnad_ethtool.o bna_ctrl.o bna_txrx.o
+bna-objs := bnad.o bnad_debugfs.o bnad_ethtool.o
+bna-objs += bna_ctrl.o bna_txrx.o
 bna-objs += bfa_ioc.o bfa_ioc_ct.o bfa_cee.o cna_fwimg.o
 
 EXTRA_CFLAGS := -Idrivers/net/bna
diff --git a/drivers/net/bna/bfa_ioc.c b/drivers/net/bna/bfa_ioc.c
index fcb9bb3..15f9dec 100644
--- a/drivers/net/bna/bfa_ioc.c
+++ b/drivers/net/bna/bfa_ioc.c
@@ -1652,6 +1652,7 @@ bfa_ioc_fail_notify(struct bfa_ioc *ioc)
 {
 	struct list_head		*qe;
 	struct bfa_ioc_hbfail_notify	*notify;
+	int				tlen;
 
 	/**
 	 * Notify driver and common modules registered for notification.
@@ -1661,6 +1662,15 @@ bfa_ioc_fail_notify(struct bfa_ioc *ioc)
 		notify = (struct bfa_ioc_hbfail_notify *) qe;
 		notify->cbfn(notify->cbarg);
 	}
+
+	/* Save firmware trace if configured. */
+	if (ioc->dbg_fwsave_once) {
+		ioc->dbg_fwsave_once = false;
+		if (ioc->dbg_fwsave_len) {
+			tlen = ioc->dbg_fwsave_len;
+			bfa_nw_ioc_debug_fwtrc(ioc, ioc->dbg_fwsave, &tlen);
+		}
+	}
 }
 
 static void
@@ -1922,6 +1932,17 @@ bfa_ioc_smem_pgnum(struct bfa_ioc *ioc, u32 fmaddr)
 	return PSS_SMEM_PGNUM(ioc->ioc_regs.smem_pg0, fmaddr);
 }
 
+/*
+ * Initialize memory for saving firmware trace. Driver must initialize
+ * trace memory before call bfa_ioc_enable().
+ */
+void
+bfa_nw_ioc_debug_memclaim(struct bfa_ioc *ioc, void *dbg_fwsave)
+{
+	ioc->dbg_fwsave     = dbg_fwsave;
+	ioc->dbg_fwsave_len = (ioc->iocpf.auto_recover) ? BFA_DBG_FWTRC_LEN : 0;
+}
+
 /**
  * Register mailbox message handler function, to be called by common modules
  */
@@ -2209,13 +2230,95 @@ bfa_nw_ioc_get_mac(struct bfa_ioc *ioc)
 	return ioc->attr->mac;
 }
 
+static int
+bfa_ioc_smem_read(struct bfa_ioc *ioc, void *tbuf, u32 soff, u32 sz)
+{
+	u32 pgnum, loff;
+	__be32 r32;
+	int i, len;
+	u32 *buf = tbuf;
+
+	pgnum = PSS_SMEM_PGNUM(ioc->ioc_regs.smem_pg0, soff);
+	loff = PSS_SMEM_PGOFF(soff);
+
+	/*
+	 *  Hold semaphore to serialize pll init and fwtrc.
+	 */
+	if (!(bfa_nw_ioc_sem_get(ioc->ioc_regs.ioc_init_sem_reg)))
+		return 1;
+
+	writel(pgnum, ioc->ioc_regs.host_page_num_fn);
+
+	len = sz/sizeof(u32);
+	for (i = 0; i < len; i++) {
+		r32 = swab32(readl(ioc->ioc_regs.smem_page_start + loff));
+		buf[i] = be32_to_cpu(r32);
+		loff += sizeof(u32);
+
+		/*
+		 * handle page offset wrap around
+		 */
+		loff = PSS_SMEM_PGOFF(loff);
+		if (loff == 0) {
+			pgnum++;
+			writel(pgnum, ioc->ioc_regs.host_page_num_fn);
+		}
+	}
+	writel(PSS_SMEM_PGNUM(ioc->ioc_regs.smem_pg0, 0),
+			ioc->ioc_regs.host_page_num_fn);
+	/*
+	 *  release semaphore.
+	 */
+	writel(1, ioc->ioc_regs.ioc_init_sem_reg);
+
+	return 0;
+}
+
+int
+bfa_nw_ioc_debug_fwtrc(struct bfa_ioc *ioc, void *trcdata, int *trclen)
+{
+	u32 loff = (BFI_IOC_TRC_OFF + BFA_DBG_FWTRC_LEN * (ioc->port_id));
+	int tlen, status = 0;
+
+	tlen = *trclen;
+	if (tlen > BFA_DBG_FWTRC_LEN)
+		tlen = BFA_DBG_FWTRC_LEN;
+
+	status = bfa_ioc_smem_read(ioc, trcdata, loff, tlen);
+	*trclen = tlen;
+	return status;
+}
+
+/**
+ * Retrieve saved firmware trace from a prior IOC failure.
+ */
+int
+bfa_nw_ioc_debug_fwsave(struct bfa_ioc *ioc, void *trcdata, int *trclen)
+{
+	int	tlen;
+
+	if (ioc->iocpf.auto_recover)
+		ioc->dbg_fwsave_len = BFA_DBG_FWTRC_LEN;
+	else
+		return BFA_STATUS_ENOFSAVE;
+
+	tlen = *trclen;
+	if (tlen > ioc->dbg_fwsave_len)
+		tlen = ioc->dbg_fwsave_len;
+
+	memcpy(trcdata, ioc->dbg_fwsave, tlen);
+	*trclen = tlen;
+	return BFA_STATUS_OK;
+}
+
 /**
  * Firmware failure detected. Start recovery actions.
  */
 static void
 bfa_ioc_recover(struct bfa_ioc *ioc)
 {
-	pr_crit("Heart Beat of IOC has failed\n");
+	pr_crit("bna: Heart Beat of IOC has failed for pci funtion %u\n",
+		ioc->pcidev.pci_func);
 	bfa_ioc_stats(ioc, ioc_hbfails);
 	bfa_fsm_send_event(ioc, IOC_E_HBFAIL);
 }
diff --git a/drivers/net/bna/bfa_ioc.h b/drivers/net/bna/bfa_ioc.h
index bd48abe..49739cd 100644
--- a/drivers/net/bna/bfa_ioc.h
+++ b/drivers/net/bna/bfa_ioc.h
@@ -19,6 +19,9 @@
 #ifndef __BFA_IOC_H__
 #define __BFA_IOC_H__
 
+#define BFA_DBG_FWTRC_LEN	(BFI_IOC_TRC_ENTS * BFI_IOC_TRC_ENT_SZ + \
+				BFI_IOC_TRC_HDR_SZ)
+
 #include "bfa_sm.h"
 #include "bfi.h"
 #include "cna.h"
@@ -274,6 +277,9 @@ void bfa_nw_ioc_fwver_get(struct bfa_ioc *ioc,
 bool bfa_nw_ioc_fwver_cmp(struct bfa_ioc *ioc,
 			struct bfi_ioc_image_hdr *fwhdr);
 mac_t bfa_nw_ioc_get_mac(struct bfa_ioc *ioc);
+void bfa_nw_ioc_debug_memclaim(struct bfa_ioc *ioc, void *dbg_fwsave);
+int bfa_nw_ioc_debug_fwtrc(struct bfa_ioc *ioc, void *trcdata, int *trclen);
+int bfa_nw_ioc_debug_fwsave(struct bfa_ioc *ioc, void *trcdata, int *trclen);
 
 /*
  * Timeout APIs
diff --git a/drivers/net/bna/bfi.h b/drivers/net/bna/bfi.h
index 6050379..ee73b6f 100644
--- a/drivers/net/bna/bfi.h
+++ b/drivers/net/bna/bfi.h
@@ -277,6 +277,8 @@ struct bfi_ioc_getattr_reply {
  */
 #define BFI_IOC_TRC_OFF		(0x4b00)
 #define BFI_IOC_TRC_ENTS	256
+#define BFI_IOC_TRC_ENT_SZ	16
+#define BFI_IOC_TRC_HDR_SZ	32
 
 #define BFI_IOC_FW_SIGNATURE	(0xbfadbfad)
 #define BFI_IOC_MD5SUM_SZ	4
diff --git a/drivers/net/bna/bna_ctrl.c b/drivers/net/bna/bna_ctrl.c
index 53b1416..a4833e3 100644
--- a/drivers/net/bna/bna_ctrl.c
+++ b/drivers/net/bna/bna_ctrl.c
@@ -1681,6 +1681,7 @@ bna_adv_device_init(struct bna_device *device, struct bna *bna,
 	device->bna = bna;
 
 	kva = res_info[BNA_RES_MEM_T_FWTRC].res_u.mem_info.mdl[0].kva;
+	bfa_nw_ioc_debug_memclaim(&device->ioc, kva);
 
 	/**
 	 * Attach common modules (Diag, SFP, CEE, Port) and claim respective
@@ -1820,8 +1821,8 @@ bna_adv_res_req(struct bna_res_info *res_info)
 	/* Virtual memory for retreiving fw_trc */
 	res_info[BNA_RES_MEM_T_FWTRC].res_type = BNA_RES_T_MEM;
 	res_info[BNA_RES_MEM_T_FWTRC].res_u.mem_info.mem_type = BNA_MEM_T_KVA;
-	res_info[BNA_RES_MEM_T_FWTRC].res_u.mem_info.num = 0;
-	res_info[BNA_RES_MEM_T_FWTRC].res_u.mem_info.len = 0;
+	res_info[BNA_RES_MEM_T_FWTRC].res_u.mem_info.num = 1;
+	res_info[BNA_RES_MEM_T_FWTRC].res_u.mem_info.len = BFA_DBG_FWTRC_LEN;
 
 	/* DMA memory for retreiving stats */
 	res_info[BNA_RES_MEM_T_STATS].res_type = BNA_RES_T_MEM;
diff --git a/drivers/net/bna/bnad.c b/drivers/net/bna/bnad.c
index e588511..a997276 100644
--- a/drivers/net/bna/bnad.c
+++ b/drivers/net/bna/bnad.c
@@ -44,8 +44,12 @@ MODULE_PARM_DESC(bnad_ioc_auto_recover, "Enable / Disable auto recovery");
 /*
  * Global variables
  */
+u32 bna_id;
 u32 bnad_rxqs_per_cq = 2;
 
+struct mutex bnad_list_mutex;
+LIST_HEAD(bnad_list);
+
 static const u8 bnad_bcast_addr[] =  {0xff, 0xff, 0xff, 0xff, 0xff, 0xff};
 
 /*
@@ -72,6 +76,23 @@ do {								\
 
 #define BNAD_TXRX_SYNC_MDELAY	250	/* 250 msecs */
 
+static void
+bnad_add_to_list(struct bnad *bnad)
+{
+	mutex_lock(&bnad_list_mutex);
+	list_add_tail(&bnad->list_entry, &bnad_list);
+	bnad->id = bna_id++;
+	mutex_unlock(&bnad_list_mutex);
+}
+
+static void
+bnad_remove_from_list(struct bnad *bnad)
+{
+	mutex_lock(&bnad_list_mutex);
+	list_del(&bnad->list_entry);
+	mutex_unlock(&bnad_list_mutex);
+}
+
 /*
  * Reinitialize completions in CQ, once Rx is taken down
  */
@@ -3087,6 +3108,8 @@ bnad_pci_probe(struct pci_dev *pdev,
 	}
 	bnad = netdev_priv(netdev);
 
+	bnad_add_to_list(bnad);
+
 	/*
 	 * PCI initialization
 	 * 	Output : using_dac = 1 for 64 bit DMA
@@ -3129,6 +3152,8 @@ bnad_pci_probe(struct pci_dev *pdev,
 	pcidev_info.device_id = bnad->pcidev->device;
 	pcidev_info.pci_bar_kva = bnad->bar0;
 
+	bnad_debugfs_init(bnad);
+
 	mutex_lock(&bnad->conf_mutex);
 
 	spin_lock_irqsave(&bnad->bna_lock, flags);
@@ -3169,7 +3194,7 @@ bnad_pci_probe(struct pci_dev *pdev,
 	/* Finally, reguister with net_device layer */
 	err = register_netdev(netdev);
 	if (err) {
-		pr_err("BNA : Registering with netdev failed\n");
+		pr_err("bna: Registering with netdev failed\n");
 		goto disable_device;
 	}
 
@@ -3189,6 +3214,8 @@ disable_device:
 	bnad_res_free(bnad);
 	bnad_disable_msix(bnad);
 pci_uninit:
+	bnad_debugfs_uninit(bnad);
+	bnad_remove_from_list(bnad);
 	bnad_pci_uninit(pdev);
 	bnad_lock_uninit(bnad);
 	bnad_uninit(bnad);
@@ -3226,6 +3253,8 @@ bnad_pci_remove(struct pci_dev *pdev)
 
 	bnad_res_free(bnad);
 	bnad_disable_msix(bnad);
+	bnad_debugfs_uninit(bnad);
+	bnad_remove_from_list(bnad);
 	bnad_pci_uninit(pdev);
 	bnad_lock_uninit(bnad);
 	bnad_uninit(bnad);
@@ -3255,13 +3284,14 @@ bnad_module_init(void)
 {
 	int err;
 
-	pr_info("Brocade 10G Ethernet driver\n");
+	pr_info("Brocade 10G Ethernet driver  - version: %s\n", BNAD_VERSION);
 
+	mutex_init(&bnad_list_mutex);
 	bfa_nw_ioc_auto_recover(bnad_ioc_auto_recover);
 
 	err = pci_register_driver(&bnad_pci_driver);
 	if (err < 0) {
-		pr_err("bna : PCI registration failed in module init "
+		pr_err("bna: PCI registration failed in module init "
 		       "(%d)\n", err);
 		return err;
 	}
@@ -3273,6 +3303,7 @@ static void __exit
 bnad_module_exit(void)
 {
 	pci_unregister_driver(&bnad_pci_driver);
+	mutex_destroy(&bnad_list_mutex);
 
 	if (bfi_fw)
 		release_firmware(bfi_fw);
diff --git a/drivers/net/bna/bnad.h b/drivers/net/bna/bnad.h
index ccdabad..2c1f283 100644
--- a/drivers/net/bna/bnad.h
+++ b/drivers/net/bna/bnad.h
@@ -279,13 +279,20 @@ struct bnad {
 	char			adapter_name[BNAD_NAME_LEN];
 	char 			port_name[BNAD_NAME_LEN];
 	char			mbox_irq_name[BNAD_NAME_LEN];
+
+	int			id;
+	struct list_head	list_entry;
+	struct dentry		*port_debugfs_root;
+	struct dentry		*bnad_dentry_files[2];
 };
 
 /*
  * EXTERN VARIABLES
  */
-extern struct firmware *bfi_fw;
-extern u32 		bnad_rxqs_per_cq;
+extern struct firmware		*bfi_fw;
+extern struct mutex		bnad_list_mutex;
+extern struct list_head		bnad_list;
+extern u32			bnad_rxqs_per_cq;
 
 /*
  * EXTERN PROTOTYPES
@@ -306,6 +313,10 @@ extern void bnad_cleanup_rx(struct bnad *bnad, uint rx_id);
 /* Timer start/stop protos */
 extern void bnad_dim_timer_start(struct bnad *bnad);
 
+/* Debugfs */
+extern void bnad_debugfs_init(struct bnad *bnad);
+extern void bnad_debugfs_uninit(struct bnad *bnad);
+
 /* Statistics */
 extern void bnad_netdev_qstats_fill(struct bnad *bnad,
 		struct rtnl_link_stats64 *stats);
diff --git a/drivers/net/bna/bnad_debugfs.c b/drivers/net/bna/bnad_debugfs.c
new file mode 100644
index 0000000..4351ca5
--- /dev/null
+++ b/drivers/net/bna/bnad_debugfs.c
@@ -0,0 +1,302 @@
+/*
+ * Linux network driver for Brocade Converged Network Adapter.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License (GPL) Version 2 as
+ * published by the Free Software Foundation
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+/*
+ * Copyright (c) 2005-2011 Brocade Communications Systems, Inc.
+ * All rights reserved
+ * www.brocade.com
+ */
+
+/*
+ * BNAD debufs interface
+ *
+ * To access the interface, debugfs file system should be mounted
+ * if not already mounted using:
+ *	mount -t debugfs none /sys/kernel/debug
+ *
+ * BNAD Hierarchy:
+ *	- bnad/pci_dev:<pci_name>
+ *	  where the pci_name corresponds to the one under
+ *	  /sys/bus/pci/drivers/bnad
+ *
+ * Debugging service available per pci_dev:
+ *	fwtrc:  To collect current firmware trace.
+ *	fwsave: To collect last saved fw trace as a result of firmware crash.
+ */
+#include <linux/debugfs.h>
+
+#include "bnad.h"
+
+struct bnad_debug_info {
+	char *debug_buffer;
+	int buffer_len;
+};
+
+struct bnad_debugfs_entry {
+	const char *name;
+	mode_t	mode;
+	const struct file_operations *fops;
+};
+
+static int
+bnad_debugfs_open_fwtrc(struct inode *inode, struct file *file)
+{
+	struct bnad *bnad = inode->i_private;
+	struct bnad_debug_info *fw_debug;
+	unsigned long flags;
+	int rc;
+
+	fw_debug = kzalloc(sizeof(struct bnad_debug_info), GFP_KERNEL);
+	if (!fw_debug)
+		return -ENOMEM;
+
+	fw_debug->buffer_len = BFA_DBG_FWTRC_LEN;
+
+	fw_debug->debug_buffer = kzalloc(fw_debug->buffer_len, GFP_KERNEL);
+	if (!fw_debug->debug_buffer) {
+		kfree(fw_debug);
+		fw_debug = NULL;
+		pr_warn("bnad[%d]: Failed to allocate fwtrc buffer\n",
+			bnad->id);
+		return -ENOMEM;
+	}
+
+	spin_lock_irqsave(&bnad->bna_lock, flags);
+	rc = bfa_nw_ioc_debug_fwtrc(&bnad->bna.device.ioc,
+			fw_debug->debug_buffer,
+			&fw_debug->buffer_len);
+	spin_unlock_irqrestore(&bnad->bna_lock, flags);
+	if (rc != BFA_STATUS_OK) {
+		kfree(fw_debug->debug_buffer);
+		fw_debug->debug_buffer = NULL;
+		kfree(fw_debug);
+		fw_debug = NULL;
+		pr_warn("bnad[%d]: Failed to collect fwtrc\n", bnad->id);
+		return -ENOMEM;
+	}
+
+	file->private_data = fw_debug;
+
+	return 0;
+}
+
+static int
+bnad_debugfs_open_fwsave(struct inode *inode, struct file *file)
+{
+	struct bnad *bnad = inode->i_private;
+	struct bnad_debug_info *fw_debug;
+	unsigned long flags;
+	int rc;
+
+	fw_debug = kzalloc(sizeof(struct bnad_debug_info), GFP_KERNEL);
+	if (!fw_debug)
+		return -ENOMEM;
+
+	fw_debug->buffer_len = BFA_DBG_FWTRC_LEN;
+
+	fw_debug->debug_buffer = kzalloc(fw_debug->buffer_len, GFP_KERNEL);
+	if (!fw_debug->debug_buffer) {
+		kfree(fw_debug);
+		fw_debug = NULL;
+		pr_warn("bnad[%d]: Failed to allocate fwsave buffer\n",
+			bnad->id);
+		return -ENOMEM;
+	}
+
+	spin_lock_irqsave(&bnad->bna_lock, flags);
+	rc = bfa_nw_ioc_debug_fwsave(&bnad->bna.device.ioc,
+			fw_debug->debug_buffer,
+			&fw_debug->buffer_len);
+	spin_unlock_irqrestore(&bnad->bna_lock, flags);
+	if (rc != BFA_STATUS_OK && rc != BFA_STATUS_ENOFSAVE) {
+		kfree(fw_debug->debug_buffer);
+		fw_debug->debug_buffer = NULL;
+		kfree(fw_debug);
+		fw_debug = NULL;
+		pr_warn("bnad[%d]: Failed to collect fwsave\n", bnad->id);
+		return -ENOMEM;
+	}
+
+	file->private_data = fw_debug;
+
+	return 0;
+}
+
+/* Changes the current file position */
+static loff_t
+bnad_debugfs_lseek(struct file *file, loff_t offset, int orig)
+{
+	loff_t pos = file->f_pos;
+	struct bnad_debug_info *debug = file->private_data;
+
+	if (!debug)
+		return -EINVAL;
+
+	switch (orig) {
+	case 0:
+		file->f_pos = offset;
+		break;
+	case 1:
+		file->f_pos += offset;
+		break;
+	case 2:
+		file->f_pos = debug->buffer_len - offset;
+		break;
+	default:
+		return -EINVAL;
+	}
+
+	if (file->f_pos < 0 || file->f_pos > debug->buffer_len) {
+		file->f_pos = pos;
+		return -EINVAL;
+	}
+
+	return file->f_pos;
+}
+
+static ssize_t
+bnad_debugfs_read(struct file *file, char __user *buf,
+			size_t nbytes, loff_t *pos)
+{
+	struct bnad_debug_info *debug = file->private_data;
+
+	if (!debug || !debug->debug_buffer)
+		return 0;
+
+	return simple_read_from_buffer(buf, nbytes, pos,
+				debug->debug_buffer, debug->buffer_len);
+}
+
+static int
+bnad_debugfs_release_fwtrc(struct inode *inode, struct file *file)
+{
+	struct bnad_debug_info *fw_debug = file->private_data;
+
+	if (!fw_debug)
+		return 0;
+
+	kfree(fw_debug->debug_buffer);
+
+	file->private_data = NULL;
+	kfree(fw_debug);
+	fw_debug = NULL;
+	return 0;
+}
+
+static const struct file_operations bnad_debugfs_op_fwtrc = {
+	.owner		=	THIS_MODULE,
+	.open		=	bnad_debugfs_open_fwtrc,
+	.llseek		=	bnad_debugfs_lseek,
+	.read		=	bnad_debugfs_read,
+	.release	=	bnad_debugfs_release_fwtrc,
+};
+
+static const struct file_operations bnad_debugfs_op_fwsave = {
+	.owner		=	THIS_MODULE,
+	.open		=	bnad_debugfs_open_fwsave,
+	.llseek		=	bnad_debugfs_lseek,
+	.read		=	bnad_debugfs_read,
+	.release	=	bnad_debugfs_release_fwtrc,
+};
+
+static const struct bnad_debugfs_entry bnad_debugfs_files[] = {
+	{ "fwtrc",  S_IFREG|S_IRUGO, &bnad_debugfs_op_fwtrc,  },
+	{ "fwsave", S_IFREG|S_IRUGO, &bnad_debugfs_op_fwsave, },
+};
+
+/* Global varibales */
+static struct dentry *bnad_debugfs_root;
+static atomic_t bnad_debugfs_port_count;
+
+/* Initialize debugfs interface for BNAD */
+void
+bnad_debugfs_init(struct bnad *bnad)
+{
+	const struct bnad_debugfs_entry *file;
+	char name[64];
+	int i;
+
+	/* Setup the BNAD debugfs root directory*/
+	mutex_lock(&bnad_list_mutex);
+	if (!bnad_debugfs_root) {
+		bnad_debugfs_root = debugfs_create_dir("bnad", NULL);
+		atomic_set(&bnad_debugfs_port_count, 0);
+		if (!bnad_debugfs_root) {
+			mutex_unlock(&bnad_list_mutex);
+			pr_warn("BNAD debugfs root dir creation failed\n");
+			return;
+		}
+	}
+	mutex_unlock(&bnad_list_mutex);
+
+	/* Setup the pci_dev debugfs directory for the port */
+	snprintf(name, sizeof(name), "pci_dev:%s", pci_name(bnad->pcidev));
+	if (!bnad->port_debugfs_root) {
+		bnad->port_debugfs_root =
+			debugfs_create_dir(name, bnad_debugfs_root);
+		if (!bnad->port_debugfs_root) {
+			pr_warn("BNAD pci_dev:%s root dir creation failed\n",
+			       pci_name(bnad->pcidev));
+			return;
+		}
+
+		atomic_inc(&bnad_debugfs_port_count);
+
+		for (i = 0; i < ARRAY_SIZE(bnad_debugfs_files); i++) {
+			file = &bnad_debugfs_files[i];
+			bnad->bnad_dentry_files[i] =
+					debugfs_create_file(file->name,
+							file->mode,
+							bnad->port_debugfs_root,
+							bnad,
+							file->fops);
+			if (!bnad->bnad_dentry_files[i]) {
+				pr_warn(
+				       "BNAD pci_dev:%s: create %s entry \
+failed\n", pci_name(bnad->pcidev), file->name);
+				return;
+			}
+		}
+	}
+
+	pr_info("bnad[%d]: Initialized debugfs interface\n", bnad->id);
+}
+
+/* Uninitialize debugfs interface for BNAD */
+void
+bnad_debugfs_uninit(struct bnad *bnad)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(bnad_debugfs_files); i++) {
+		if (bnad->bnad_dentry_files[i]) {
+			debugfs_remove(bnad->bnad_dentry_files[i]);
+			bnad->bnad_dentry_files[i] = NULL;
+		}
+	}
+
+	/* Remove the pci_dev debugfs directory for the port */
+	if (bnad->port_debugfs_root) {
+		debugfs_remove(bnad->port_debugfs_root);
+		bnad->port_debugfs_root = NULL;
+		atomic_dec(&bnad_debugfs_port_count);
+	}
+
+	/* Remove the BNAD debugfs root directory */
+	mutex_lock(&bnad_list_mutex);
+	if (atomic_read(&bnad_debugfs_port_count) == 0) {
+		debugfs_remove(bnad_debugfs_root);
+		bnad_debugfs_root = NULL;
+	}
+	mutex_unlock(&bnad_list_mutex);
+	pr_info("bnad[%d]: Uninitialized debugfs interface\n", bnad->id);
+}
-- 
1.7.1


^ permalink raw reply related

* [PATCH 2/2] bna: Add Generic Netlink Interface
From: Rasesh Mody @ 2011-05-18  4:57 UTC (permalink / raw)
  To: davem, netdev; +Cc: adapter_linux_open_src_team, Rasesh Mody, Debashis Dutt
In-Reply-To: <1305694621-28023-1-git-send-email-rmody@brocade.com>

This patch adds the generic netlink communication interface to BNA driver. The
in-kernel generic netlink infrastructure can be used to collect hardware
specific control information and control attributes. The driver makes use of
the "doit" handler provided by the generic netlink layer to accomplish this.

Signed-off-by: Debashis Dutt <ddutt@brocade.com>
Signed-off-by: Rasesh Mody <rmody@brocade.com>
---
 drivers/net/bna/Makefile    |    2 +-
 drivers/net/bna/bfa_defs.h  |   18 +++
 drivers/net/bna/bfa_ioc.c   |    8 +-
 drivers/net/bna/bfa_ioc.h   |    1 +
 drivers/net/bna/bnad.c      |   11 ++-
 drivers/net/bna/bnad.h      |    1 +
 drivers/net/bna/bnad_genl.c |  345 +++++++++++++++++++++++++++++++++++++++++++
 drivers/net/bna/bnad_genl.h |   87 +++++++++++
 8 files changed, 466 insertions(+), 7 deletions(-)
 create mode 100644 drivers/net/bna/bnad_genl.c
 create mode 100644 drivers/net/bna/bnad_genl.h

diff --git a/drivers/net/bna/Makefile b/drivers/net/bna/Makefile
index 4bb0d5d..f3339dc 100644
--- a/drivers/net/bna/Makefile
+++ b/drivers/net/bna/Makefile
@@ -5,7 +5,7 @@
 
 obj-$(CONFIG_BNA) += bna.o
 
-bna-objs := bnad.o bnad_debugfs.o bnad_ethtool.o
+bna-objs := bnad.o bnad_debugfs.o bnad_ethtool.o bnad_genl.o
 bna-objs += bna_ctrl.o bna_txrx.o
 bna-objs += bfa_ioc.o bfa_ioc_ct.o bfa_cee.o cna_fwimg.o
 
diff --git a/drivers/net/bna/bfa_defs.h b/drivers/net/bna/bfa_defs.h
index 2ea0dfe..2dd0898 100644
--- a/drivers/net/bna/bfa_defs.h
+++ b/drivers/net/bna/bfa_defs.h
@@ -26,6 +26,24 @@
 #define BFA_STRING_32	32
 #define BFA_VERSION_LEN 64
 
+/*
+ * Check if the card having old wwn/mac handling
+ */
+#define bfa_mfg_is_old_wwn_mac_model(type) (( \
+	(type) == BFA_MFG_TYPE_CNA10P2 || \
+	(type) == BFA_MFG_TYPE_CNA10P1 || \
+	(type) == BFA_MFG_TYPE_WANCHESE))
+
+#define bfa_mfg_increment_wwn_mac(m, i)                         \
+do {                                                            \
+	u32 t = ((u32)(m)[0] << 16) | ((u32)(m)[1] << 8) | \
+		(u32)(m)[2];  \
+	t += (i);      \
+	(m)[0] = (t >> 16) & 0xFF;                              \
+	(m)[1] = (t >> 8) & 0xFF;                               \
+	(m)[2] = t & 0xFF;                                      \
+} while (0)
+
 /**
  * ---------------------- adapter definitions ------------
  */
diff --git a/drivers/net/bna/bfa_ioc.c b/drivers/net/bna/bfa_ioc.c
index 15f9dec..eeb7250 100644
--- a/drivers/net/bna/bfa_ioc.c
+++ b/drivers/net/bna/bfa_ioc.c
@@ -82,8 +82,6 @@ static void bfa_ioc_pf_fwmismatch(struct bfa_ioc *ioc);
 static void bfa_ioc_boot(struct bfa_ioc *ioc, u32 boot_type,
 			 u32 boot_param);
 static u32 bfa_ioc_smem_pgnum(struct bfa_ioc *ioc, u32 fmaddr);
-static void bfa_ioc_get_adapter_serial_num(struct bfa_ioc *ioc,
-						char *serial_num);
 static void bfa_ioc_get_adapter_fw_ver(struct bfa_ioc *ioc,
 						char *fw_ver);
 static void bfa_ioc_get_pci_chip_rev(struct bfa_ioc *ioc,
@@ -2045,7 +2043,7 @@ bfa_ioc_get_adapter_attr(struct bfa_ioc *ioc,
 
 	ioc_attr = ioc->attr;
 
-	bfa_ioc_get_adapter_serial_num(ioc, ad_attr->serial_num);
+	bfa_nw_ioc_get_adapter_serial_num(ioc, ad_attr->serial_num);
 	bfa_ioc_get_adapter_fw_ver(ioc, ad_attr->fw_ver);
 	bfa_ioc_get_adapter_optrom_ver(ioc, ad_attr->optrom_ver);
 	bfa_ioc_get_adapter_manufacturer(ioc, ad_attr->manufacturer);
@@ -2096,8 +2094,8 @@ bfa_ioc_get_type(struct bfa_ioc *ioc)
 	}
 }
 
-static void
-bfa_ioc_get_adapter_serial_num(struct bfa_ioc *ioc, char *serial_num)
+void
+bfa_nw_ioc_get_adapter_serial_num(struct bfa_ioc *ioc, char *serial_num)
 {
 	memset(serial_num, 0, BFA_ADAPTER_SERIAL_NUM_LEN);
 	memcpy(serial_num,
diff --git a/drivers/net/bna/bfa_ioc.h b/drivers/net/bna/bfa_ioc.h
index 49739cd..1f71865 100644
--- a/drivers/net/bna/bfa_ioc.h
+++ b/drivers/net/bna/bfa_ioc.h
@@ -280,6 +280,7 @@ mac_t bfa_nw_ioc_get_mac(struct bfa_ioc *ioc);
 void bfa_nw_ioc_debug_memclaim(struct bfa_ioc *ioc, void *dbg_fwsave);
 int bfa_nw_ioc_debug_fwtrc(struct bfa_ioc *ioc, void *trcdata, int *trclen);
 int bfa_nw_ioc_debug_fwsave(struct bfa_ioc *ioc, void *trcdata, int *trclen);
+void bfa_nw_ioc_get_adapter_serial_num(struct bfa_ioc *ioc, char *serial_num);
 
 /*
  * Timeout APIs
diff --git a/drivers/net/bna/bnad.c b/drivers/net/bna/bnad.c
index a997276..09aa132 100644
--- a/drivers/net/bna/bnad.c
+++ b/drivers/net/bna/bnad.c
@@ -25,6 +25,7 @@
 #include <linux/ip.h>
 
 #include "bnad.h"
+#include "bnad_genl.h"
 #include "bna.h"
 #include "cna.h"
 
@@ -721,7 +722,7 @@ bnad_disable_mbox_irq(struct bnad *bnad)
 	BNAD_UPDATE_CTR(bnad, mbox_intr_disabled);
 }
 
-static void
+void
 bnad_set_netdev_perm_addr(struct bnad *bnad)
 {
 	struct net_device *netdev = bnad->netdev;
@@ -3296,6 +3297,10 @@ bnad_module_init(void)
 		return err;
 	}
 
+	/* Register with generic netlink */
+	if (bnad_genl_init())
+		pr_err("bna: Generic Netlink Register failed\n");
+
 	return 0;
 }
 
@@ -3305,6 +3310,10 @@ bnad_module_exit(void)
 	pci_unregister_driver(&bnad_pci_driver);
 	mutex_destroy(&bnad_list_mutex);
 
+	/* Unegister with generic netlink */
+	if (bnad_genl_uninit())
+		pr_err("bna: Generic Netlink Unregister failed\n");
+
 	if (bfi_fw)
 		release_firmware(bfi_fw);
 }
diff --git a/drivers/net/bna/bnad.h b/drivers/net/bna/bnad.h
index 2c1f283..d3cb27b 100644
--- a/drivers/net/bna/bnad.h
+++ b/drivers/net/bna/bnad.h
@@ -302,6 +302,7 @@ extern u32 *cna_get_firmware_buf(struct pci_dev *pdev);
 extern void bnad_set_ethtool_ops(struct net_device *netdev);
 
 /* Configuration & setup */
+extern void bnad_set_netdev_perm_addr(struct bnad *bnad);
 extern void bnad_tx_coalescing_timeo_set(struct bnad *bnad);
 extern void bnad_rx_coalescing_timeo_set(struct bnad *bnad);
 
diff --git a/drivers/net/bna/bnad_genl.c b/drivers/net/bna/bnad_genl.c
new file mode 100644
index 0000000..eec2a56
--- /dev/null
+++ b/drivers/net/bna/bnad_genl.c
@@ -0,0 +1,345 @@
+/*
+ * Linux network driver for Brocade Converged Network Adapter.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License (GPL) Version 2 as
+ * published by the Free Software Foundation
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+/*
+ * Copyright (c) 2005-2011 Brocade Communications Systems, Inc.
+ * All rights reserved
+ * www.brocade.com
+ */
+#include "bnad.h"
+#include "bnad_genl.h"
+#include "bna.h"
+
+static struct nla_policy bnad_genl_policy[BNAD_GENL_ATTR_MAX + 1] = {
+	[BNAD_GENL_ATTR_IOC_INFO]
+				= { .len = sizeof(struct bnad_genl_ioc_info) },
+	[BNAD_GENL_ATTR_IOC_ATTR]
+				= { .len = sizeof(struct bnad_genl_ioc_attr) },
+};
+
+static struct genl_family bnad_genl_family = {
+	.id = GENL_ID_GENERATE,
+	.name = "BNAD_GENL",
+	.version = BNAD_GENL_VERSION,
+	.hdrsize = 0,
+	.maxattr = BNAD_GENL_ATTR_MAX,
+};
+
+static struct bnad *
+bnad_get_bnadev(int bna_id)
+{
+	struct bnad *bnad;
+
+	mutex_lock(&bnad_list_mutex);
+	list_for_each_entry(bnad, &bnad_list, list_entry) {
+		if (bnad->id == bna_id) {
+			mutex_unlock(&bnad_list_mutex);
+			return bnad;
+		}
+	}
+	mutex_unlock(&bnad_list_mutex);
+	return NULL;
+}
+
+static void
+bnad_hwpath_get(struct bnad *bnad, char *hwpath, char *adapter_hwpath)
+{
+	int i;
+
+	strcpy(hwpath, pci_name(bnad->pcidev));
+	strcpy(adapter_hwpath, pci_name(bnad->pcidev));
+	i = strlen(adapter_hwpath) - 1;
+	while (i && (adapter_hwpath[i] != '.'))
+		i--;
+	adapter_hwpath[i] = '\0';
+}
+
+static void
+bnad_get_pci_attr(struct bnad *bnad, struct bfa_ioc_pci_attr *pci_attr)
+{
+	pci_attr->vendor_id = bnad->pcidev->vendor;
+	pci_attr->device_id = bnad->pcidev->device;
+	pci_attr->ssid = bnad->pcidev->subsystem_device;
+	pci_attr->ssvid = bnad->pcidev->subsystem_vendor;
+	pci_attr->pcifn = PCI_FUNC(bnad->pcidev->devfn);
+}
+
+static int
+bnad_genl_ioc_info_rsp(struct genl_info *info,
+	struct bnad_genl_ioc_info *genlcmd, size_t attr_size, u8 cmd)
+{
+	struct sk_buff *rsp_skb = NULL;
+	void *genl_msg_hdr = NULL;
+	size_t msg_size;
+	int err = 0;
+
+	msg_size = nla_total_size(attr_size);
+
+	rsp_skb = genlmsg_new(msg_size, GFP_KERNEL);
+	if (!rsp_skb) {
+		pr_warn("bnad[%d]: Failed to get response skb\n",
+			genlcmd->bnad_num);
+		return -ENOMEM;
+	}
+
+	genl_msg_hdr = genlmsg_put(rsp_skb, info->snd_pid, info->snd_seq,
+		&bnad_genl_family, 0, cmd);
+	if (!genl_msg_hdr) {
+		pr_warn("bnad[%d]: Failed to get the genl_msg_header\n",
+			genlcmd->bnad_num);
+		err = -ENOMEM;
+		goto failure;
+	}
+
+	NLA_PUT(rsp_skb, BNAD_GENL_ATTR_IOC_INFO, attr_size, genlcmd);
+
+	err = genlmsg_end(rsp_skb, genl_msg_hdr);
+	if (err < 0) {
+		pr_warn("bnad[%d]: Failed to do genlmsg_end\n",
+			genlcmd->bnad_num);
+		goto failure;
+	}
+
+	err = genlmsg_reply(rsp_skb, info);
+	if (err)
+		pr_warn("bnad[%d]: Could not do genlmsg_reply (%d)\n",
+			genlcmd->bnad_num, err);
+
+	return err;
+
+nla_put_failure:
+	pr_warn("bnad[%d]: Failed to do NLA_PUT\n", genlcmd->bnad_num);
+	err = -EMSGSIZE;
+failure:
+	genlmsg_cancel(rsp_skb, genl_msg_hdr);
+	kfree_skb(rsp_skb);
+	return err;
+}
+
+static int
+bnad_genl_ioc_attr_rsp(struct genl_info *info,
+	struct bnad_genl_ioc_attr *genlcmd, size_t attr_size, u8 cmd)
+{
+	struct sk_buff *rsp_skb = NULL;
+	void *genl_msg_hdr = NULL;
+	size_t msg_size;
+	int err = 0;
+
+	msg_size = nla_total_size(attr_size);
+
+	rsp_skb = genlmsg_new(msg_size, GFP_KERNEL);
+	if (!rsp_skb) {
+		pr_warn("bnad[%d]: Failed to get response skb\n",
+			genlcmd->bnad_num);
+		return -ENOMEM;
+	}
+
+	genl_msg_hdr = genlmsg_put(rsp_skb, info->snd_pid, info->snd_seq,
+		&bnad_genl_family, 0, cmd);
+	if (!genl_msg_hdr) {
+		pr_warn("bnad[%d]: Failed to get the genl_msg_header\n",
+			genlcmd->bnad_num);
+		err = -ENOMEM;
+		goto failure;
+	}
+
+	NLA_PUT(rsp_skb, BNAD_GENL_ATTR_IOC_ATTR, attr_size, genlcmd);
+
+	err = genlmsg_end(rsp_skb, genl_msg_hdr);
+	if (err < 0) {
+		pr_warn("bnad[%d]: Failed to do genlmsg_end\n",
+			genlcmd->bnad_num);
+		goto failure;
+	}
+
+	err = genlmsg_reply(rsp_skb, info);
+	if (err)
+		pr_warn("bnad[%d]: Could not do genlmsg_reply (%d)\n",
+			genlcmd->bnad_num, err);
+
+	return err;
+
+nla_put_failure:
+	pr_warn("bnad[%d]: Failed to do NLA_PUT\n", genlcmd->bnad_num);
+	err = -EMSGSIZE;
+failure:
+	genlmsg_cancel(rsp_skb, genl_msg_hdr);
+	kfree_skb(rsp_skb);
+	return err;
+}
+
+/* Note: Should be called holding bnad_conf_lock */
+static int
+bnad_genl_ioc_get_info(struct sk_buff *skb, struct genl_info *info)
+{
+	struct bnad *bnad = NULL;
+	struct bnad_genl_ioc_info *genlcmd = NULL;
+	struct bfa_ioc *ioc = NULL;
+	unsigned long flags = 0;
+	int err;
+
+	genlcmd = (struct bnad_genl_ioc_info *)
+		nla_data(info->attrs[BNAD_GENL_ATTR_IOC_INFO]);
+
+	bnad = bnad_get_bnadev(genlcmd->bnad_num);
+	if (!bnad) {
+		pr_warn("bna: Failed to get driver instance\n");
+		return -EINVAL;
+	}
+	ioc = &bnad->bna.device.ioc;
+
+	spin_lock_irqsave(&bnad->bna_lock, flags);
+	bfa_nw_ioc_get_adapter_serial_num(ioc, genlcmd->serialnum);
+	genlcmd->mac = bfa_nw_ioc_get_mac(ioc);
+	/* Get manufacturing MAC */
+	genlcmd->factory_mac = ioc->attr->mfg_mac;
+	if (bfa_mfg_is_old_wwn_mac_model(ioc->attr->card_type))
+		genlcmd->factory_mac.mac[MAC_ADDRLEN - 1] += bfa_ioc_pcifn(ioc);
+	else
+		bfa_mfg_increment_wwn_mac(
+			&(genlcmd->factory_mac.mac[MAC_ADDRLEN-3]),
+			bfa_ioc_pcifn(ioc));
+
+	/* Get stack MAC */
+	if (is_zero_ether_addr(&bnad->perm_addr.mac[0])) {
+		bna_port_mac_get(&bnad->bna.port, &bnad->perm_addr);
+		bnad_set_netdev_perm_addr(bnad);
+	}
+	memcpy(&genlcmd->current_mac, bnad->netdev->dev_addr, sizeof(mac_t));
+
+	genlcmd->bnad_num = bnad->id;
+	spin_unlock_irqrestore(&bnad->bna_lock, flags);
+	bnad_hwpath_get(bnad, genlcmd->hwpath, genlcmd->adapter_hwpath);
+	sprintf(genlcmd->eth_name, "%s", bnad->netdev->name);
+	strncpy(genlcmd->name, bnad->adapter_name, sizeof(genlcmd->name) - 1);
+	strncpy(genlcmd->port_name, bnad->port_name,
+		sizeof(genlcmd->port_name) - 1);
+	genlcmd->ioc_type = BFA_IOC_TYPE_LL;
+	genlcmd->status = BFA_STATUS_OK;
+
+	err = bnad_genl_ioc_info_rsp(info, genlcmd,
+		sizeof(struct bnad_genl_ioc_info), BNAD_GENL_CMD_IOC_INFO);
+
+	return err;
+}
+
+/* Note: Should be called holding bnad_conf_lock */
+static int
+bnad_genl_ioc_get_attr(struct sk_buff *skb, struct genl_info *info)
+{
+	struct bnad *bnad = NULL;
+	struct bnad_genl_ioc_attr *genlcmd = NULL;
+	struct bfa_ioc *ioc = NULL;
+	unsigned long flags = 0;
+	int err = 0;
+
+	genlcmd = (struct bnad_genl_ioc_attr *)
+		nla_data(info->attrs[BNAD_GENL_ATTR_IOC_ATTR]);
+
+	bnad = bnad_get_bnadev(genlcmd->bnad_num);
+	if (!bnad) {
+		pr_warn("bna: Failed to get driver instance\n");
+		return -EINVAL;
+	}
+	ioc = &bnad->bna.device.ioc;
+
+	memset(&genlcmd->ioc_attr, 0, sizeof(genlcmd->ioc_attr));
+	spin_lock_irqsave(&bnad->bna_lock, flags);
+	bfa_nw_ioc_get_attr(&bnad->bna.device.ioc, &genlcmd->ioc_attr);
+	spin_unlock_irqrestore(&bnad->bna_lock, flags);
+	genlcmd->ioc_attr.ioc_type = BFA_IOC_TYPE_LL;
+	strcpy(genlcmd->ioc_attr.driver_attr.driver, BNAD_NAME);
+	strncpy(genlcmd->ioc_attr.driver_attr.driver_ver,
+		BNAD_VERSION, BFA_VERSION_LEN);
+	strcpy(genlcmd->ioc_attr.driver_attr.fw_ver,
+	       genlcmd->ioc_attr.adapter_attr.fw_ver);
+	strcpy(genlcmd->ioc_attr.driver_attr.bios_ver,
+	       genlcmd->ioc_attr.adapter_attr.optrom_ver);
+	bnad_get_pci_attr(bnad, &genlcmd->ioc_attr.pci_attr);
+	genlcmd->status = BFA_STATUS_OK;
+
+	err = bnad_genl_ioc_attr_rsp(info, genlcmd,
+		sizeof(struct bnad_genl_ioc_attr), BNAD_GENL_CMD_IOC_ATTR);
+
+	return err;
+}
+
+/* BNAD generic netlink ops */
+static struct genl_ops bnad_genl_ops[] = {
+	{
+	.cmd = BNAD_GENL_CMD_IOC_INFO,
+	.flags = 0,
+	.policy = bnad_genl_policy,
+	.doit = bnad_genl_ioc_get_info,
+	.dumpit = NULL,
+	},
+	{
+	.cmd = BNAD_GENL_CMD_IOC_ATTR,
+	.flags = 0,
+	.policy = bnad_genl_policy,
+	.doit = bnad_genl_ioc_get_attr,
+	.dumpit = NULL,
+	},
+};
+
+/* Initialize generic netlink for BNAD */
+int
+bnad_genl_init(void)
+{
+	int i, err = 0;
+
+	/* Register family */
+	err = genl_register_family(&bnad_genl_family);
+	if (err) {
+		pr_warn("bna: failed to register with Netlink\n");
+		return err;
+	}
+	pr_info("bna: registered with Netlink\n");
+
+	/* Register ops */
+	for (i = 0; i < sizeof(bnad_genl_ops) / sizeof(bnad_genl_ops[0]); i++) {
+		err = genl_register_ops(&bnad_genl_family, &bnad_genl_ops[i]);
+		if (err)
+			pr_warn("bna: failed to register netlink op %u\n",
+				bnad_genl_ops[i].cmd);
+		else
+			pr_info("bna: registered netlink op %u\n",
+				bnad_genl_ops[i].cmd);
+	}
+
+	return err;
+}
+
+/* Uninitialize generic netlink for BNAD */
+int
+bnad_genl_uninit(void)
+{
+	int i, err = 0;
+
+	for (i = 0; i < sizeof(bnad_genl_ops) / sizeof(bnad_genl_ops[0]); i++) {
+		err = genl_unregister_ops(&bnad_genl_family, &bnad_genl_ops[i]);
+		if (err)
+			pr_warn("bna: failed to unregister netlink op %u)\n",
+				bnad_genl_ops[i].cmd);
+		else
+			pr_info("bna: unregistered netlink op %u\n",
+				bnad_genl_ops[i].cmd);
+	}
+
+	err = genl_unregister_family(&bnad_genl_family);
+	if (err)
+		pr_warn("bna: failed to unregister with Netlink\n");
+	else
+		pr_info("bna: unregistered with Netlink\n");
+
+	return err;
+}
diff --git a/drivers/net/bna/bnad_genl.h b/drivers/net/bna/bnad_genl.h
new file mode 100644
index 0000000..d469fe5
--- /dev/null
+++ b/drivers/net/bna/bnad_genl.h
@@ -0,0 +1,87 @@
+/*
+ * Linux network driver for Brocade Converged Network Adapter.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License (GPL) Version 2 as
+ * published by the Free Software Foundation
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ */
+/*
+ * Copyright (c) 2005-2011 Brocade Communications Systems, Inc.
+ * All rights reserved
+ * www.brocade.com
+ */
+#ifndef __BNAD_GENL_H__
+#define __BNAD_GENL_H__
+
+#include <linux/device.h>
+#include <linux/netdevice.h>
+#include <linux/gfp.h>
+#include <linux/skbuff.h>
+#include <linux/ethtool.h>
+#include <linux/rtnetlink.h>
+#include <net/genetlink.h>
+
+#include "cna.h"
+
+/* Attributes */
+enum {
+	BNAD_GENL_ATTR_UNSPEC,
+	BNAD_GENL_ATTR_IOC_INFO,
+	BNAD_GENL_ATTR_IOC_ATTR,
+	__BNAD_GENL_ATTR_MAX
+};
+
+/* Effectively a single attribute */
+#define BNAD_GENL_ATTR_MAX (__BNAD_GENL_ATTR_MAX - 1)
+
+enum {
+	BNAD_GENL_VERSION = 1,
+};
+
+/* Commands/Responses */
+enum {
+	BNAD_GENL_CMD_UNSPEC,
+	BNAD_GENL_CMD_IOC_INFO,
+	BNAD_GENL_CMD_IOC_ATTR,
+	__BNAD_GENL_CMD_MAX,
+};
+
+struct bnad_genl_ioc_info {
+	int		status;
+	u16		bnad_num;
+	u16		rsvd;
+	char		serialnum[64];
+	char		hwpath[BFA_STRING_32];
+	char		adapter_hwpath[BFA_STRING_32];
+	char		guid[BFA_ADAPTER_SYM_NAME_LEN*2];
+	char		name[BFA_ADAPTER_SYM_NAME_LEN];
+	char		port_name[BFA_ADAPTER_SYM_NAME_LEN];
+	char		eth_name[BFA_ADAPTER_SYM_NAME_LEN];
+	u64		rsvd1[4];
+	mac_t		mac;
+	mac_t		factory_mac;	/* Factory mac address */
+	mac_t		current_mac;	/* Currently assigned mac address */
+	enum bfa_ioc_type     ioc_type;
+	u16		pvid;		/* Port vlan id */
+	u16		rsvd2;
+	u32		host;
+	u32		bandwidth;	/* For PF support */
+	u32		rsvd3;
+};
+
+struct bnad_genl_ioc_attr {
+	int		status;
+	u16		bnad_num;
+	u16		rsvd;
+	struct bfa_ioc_attr  ioc_attr;
+};
+
+extern int bnad_genl_init(void);
+extern int bnad_genl_uninit(void);
+
+#endif /* __BNAD_GENL_H__ */
-- 
1.7.1


^ permalink raw reply related

* Re: [PATCH] Check the value of doi before referencing it
From: David Miller @ 2011-05-18  5:04 UTC (permalink / raw)
  To: huzaifas; +Cc: netdev, kaber, yoshfuji, jmorris, pekkas, kuznet
In-Reply-To: <1305692980-4730-1-git-send-email-huzaifas@redhat.com>

From: Huzaifa Sidhpurwala <huzaifas@redhat.com>
Date: Wed, 18 May 2011 09:59:40 +0530

> Value of doi is not checked before referencing it.
> Though this does not cause any null pointer dereference since
> all the callers of cipso_v4_doi_add check the value of doi
> before calling the function, but it would be a good programming
> practice to do so anyways :)
> 
> Signed-off-by: Huzaifa Sidhpurwala <huzaifas@redhat.com>

I don't think we should fix bugs that do not exist.

^ permalink raw reply

* Re: [PATCH V5 4/6 net-next] vhost: vhost TX zero-copy support
From: Shirley Ma @ 2011-05-18  5:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <20110517212827.GC7589@redhat.com>

On Wed, 2011-05-18 at 00:28 +0300, Michael S. Tsirkin wrote:
> On Tue, May 17, 2011 at 01:50:19PM -0700, Shirley Ma wrote:
> > Resubmit the patch with most update. This patch passed some
> > live-migration test against RHEL6.2. I will run more stress test w/i
> > live migration.
> > 
> > Signed-off-by: Shirley Ma <xma@us.ibm.com>
> 
> Cool. cleanup path needs a fix - are you use you can
> not use kobj or some other existing refcounting?
I will look at this.

> Is perf regressiion caused by tx ring overrun gone now?
Nope.

> I added some comments about how we might be aqble
> to complete requests out of order but it's not a must.
Ok.

> > ---
> > 
> >  drivers/vhost/net.c   |   37 +++++++++++++++++++++++++++++++-
> >  drivers/vhost/vhost.c |   55
> ++++++++++++++++++++++++++++++++++++++++++++++++-
> >  drivers/vhost/vhost.h |   12 ++++++++++
> >  3 files changed, 101 insertions(+), 3 deletions(-)
> > 
> > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > index 2f7c76a..6bd6e28 100644
> > --- a/drivers/vhost/net.c
> > +++ b/drivers/vhost/net.c
> > @@ -32,6 +32,9 @@
> >   * Using this limit prevents one virtqueue from starving others. */
> >  #define VHOST_NET_WEIGHT 0x80000
> >  
> > +/* MAX number of TX used buffers for outstanding zerocopy */
> > +#define VHOST_MAX_ZEROCOPY_PEND 128 
> > +
> >  enum {
> >       VHOST_NET_VQ_RX = 0,
> >       VHOST_NET_VQ_TX = 1,
> > @@ -129,6 +132,7 @@ static void handle_tx(struct vhost_net *net)
> >       int err, wmem;
> >       size_t hdr_size;
> >       struct socket *sock;
> > +     struct skb_ubuf_info pend;
> >  
> >       /* TODO: check that we are running from vhost_worker? */
> >       sock = rcu_dereference_check(vq->private_data, 1);
> > @@ -151,6 +155,10 @@ static void handle_tx(struct vhost_net *net)
> >       hdr_size = vq->vhost_hlen;
> >  
> >       for (;;) {
> > +             /* Release DMAs done buffers first */
> > +             if (atomic_read(&vq->refcnt) >
> VHOST_MAX_ZEROCOPY_PEND)
> > +                     vhost_zerocopy_signal_used(vq, false);
> > +
> >               head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> >                                        ARRAY_SIZE(vq->iov),
> >                                        &out, &in,
> > @@ -166,6 +174,13 @@ static void handle_tx(struct vhost_net *net)
> >                               set_bit(SOCK_ASYNC_NOSPACE,
> &sock->flags);
> >                               break;
> >                       }
> > +                     /* If more outstanding DMAs, queue the work */
> > +                     if (sock_flag(sock->sk, SOCK_ZEROCOPY) &&
> > +                         (atomic_read(&vq->refcnt) >
> VHOST_MAX_ZEROCOPY_PEND)) {
> > +                             tx_poll_start(net, sock);
> > +                             set_bit(SOCK_ASYNC_NOSPACE,
> &sock->flags);
> > +                             break;
> > +                     }
> >                       if (unlikely(vhost_enable_notify(vq))) {
> >                               vhost_disable_notify(vq);
> >                               continue;
> > @@ -188,17 +203,35 @@ static void handle_tx(struct vhost_net *net)
> >                              iov_length(vq->hdr, s), hdr_size);
> >                       break;
> >               }
> > +             /* use msg_control to pass vhost zerocopy ubuf info to
> skb */
> > +             if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
> > +                     vq->heads[vq->upend_idx].id = head;
> > +                     if (len <= 128)
> 
> I thought we have a constant for that?

Yes, fixed it already. I might change it to PAGE_SIZE for now since the
small message sizes regression issue.

> > +                             vq->heads[vq->upend_idx].len =
> VHOST_DMA_DONE_LEN;
> > +                     else {
> > +                             vq->heads[vq->upend_idx].len = len;
> > +                             pend.callback =
> vhost_zerocopy_callback;
> > +                             pend.arg = vq;
> > +                             pend.desc = vq->upend_idx;
> > +                             msg.msg_control = &pend;
> > +                             msg.msg_controllen = sizeof(pend);
> > +                     }
> > +                     atomic_inc(&vq->refcnt);
> > +                     vq->upend_idx = (vq->upend_idx + 1) %
> UIO_MAXIOV;
> 
> Ok, so we deal with a cyclic ring apparently? What guarantees we don't
> overrun it?

VHOST_MAX_PEND (which is 128) should prevent it from overrun normally.
In the case of none DMAs can be done, the maximum entries are the ring
size, which is much smaller or equal to UIO_MAXIOV for current
implementation. 

Maybe I should add some condition check if it is overrun then exits?

> 
> > +             }
> >               /* TODO: Check specific error and bomb out unless
> ENOBUFS? */
> >               err = sock->ops->sendmsg(NULL, sock, &msg, len);
> >               if (unlikely(err < 0)) {
> > -                     vhost_discard_vq_desc(vq, 1);
> > +                     if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> > +                             vhost_discard_vq_desc(vq, 1);
> 
> How are errors handled with zerocopy?

I thought about it before: if error after macvtap skb is allocated, then
that skb has a callback which will set DMA done for add_used, if the
error before macvtap skb is allocated, then it can reverse as copy case.
To avoid the complexity check, I decided to not handle it.

> 
> >                       tx_poll_start(net, sock);
> >                       break;
> >               }
> >               if (err != len)
> >                       pr_debug("Truncated TX packet: "
> >                                " len %d != %zd\n", err, len);
> > -             vhost_add_used_and_signal(&net->dev, vq, head, 0);
> > +             if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> > +                     vhost_add_used_and_signal(&net->dev, vq, head,
> 0);
> >               total_len += len;
> >               if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> >                       vhost_poll_queue(&vq->poll);
> > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > index 2ab2912..ce799d6 100644
> > --- a/drivers/vhost/vhost.c
> > +++ b/drivers/vhost/vhost.c
> > @@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev
> *dev,
> >       vq->call_ctx = NULL;
> >       vq->call = NULL;
> >       vq->log_ctx = NULL;
> > +     vq->upend_idx = 0;
> > +     vq->done_idx = 0;
> > +     atomic_set(&vq->refcnt, 0);
> >  }
> >  
> >  static int vhost_worker(void *data)
> > @@ -230,7 +233,7 @@ static long vhost_dev_alloc_iovecs(struct
> vhost_dev *dev)
> >                                              UIO_MAXIOV,
> GFP_KERNEL);
> >               dev->vqs[i].log = kmalloc(sizeof *dev->vqs[i].log *
> UIO_MAXIOV,
> >                                         GFP_KERNEL);
> > -             dev->vqs[i].heads = kmalloc(sizeof *dev->vqs[i].heads
> *
> > +             dev->vqs[i].heads = kzalloc(sizeof *dev->vqs[i].heads
> *
> >                                           UIO_MAXIOV, GFP_KERNEL);
> 
> Which fields need to be initialized actually?

Nope, already fixed it with kmalloc.

> >  
> >               if (!dev->vqs[i].indirect || !dev->vqs[i].log ||
> > @@ -385,6 +388,38 @@ long vhost_dev_reset_owner(struct vhost_dev
> *dev)
> >       return 0;
> >  }
> >  
> > +/* 
> > +     comments
> > +*/
> 
> Hmm.

Fixed it in previous patch already.

> > +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool
> shutdown)
> > +{
> > +     int i, j = 0;
> > +
> > +     i = vq->done_idx;
> > +     while (i != vq->upend_idx) {
> 
> A for loop might be clearer.
Ok.

> > +             if ((vq->heads[i].len == VHOST_DMA_DONE_LEN) ||
> shutdown) {
> 
> On shutdown, we signal all buffers used to the guest?
> Why?

We signal all outstand DMAs in the case of driver has some DMA issue to
prevent us from shutting down. I am afraid vhost cleanup could wait
forever.

> > +                     /* reset len = 0 */
> 
> comment not very helpful.
> Could you explain what this does instead?
> Or better use some constant instead of 0 ...

Fixed it already.

> > +                     vq->heads[i].len = 0;
> > +                     i = (i + 1) % UIO_MAXIOV;
> > +                     ++j;
> > +             } else
> > +                     break;
> 
> Hmm so if the 1st entry does not complete, you do not signal anything?

No used buffers to guest, should we signal still?


> > +     }
> 
> Looking at this loop, done idx is the consumer and used idx
> is the producer, right?

Yes.

> > +     if (j) {
> > +             /* comments */
> 
> Yes?
> 
> > +             if (i > vq->done_idx)
> > +                     vhost_add_used_n(vq, &vq->heads[vq->done_idx],
> j);
> > +             else {
> > +                     vhost_add_used_n(vq, &vq->heads[vq->done_idx],
> > +                                      UIO_MAXIOV - vq->done_idx);
> > +                     vhost_add_used_n(vq, vq->heads, i);
> > +             }
> > +             vq->done_idx = i;
> > +             vhost_signal(vq->dev, vq);
> > +             atomic_sub(j, &vq->refcnt);
> 
> Code will likely be simpler if you call vhost_add_used once for
> each head in the first loop. Possibly add_used_signal might be
> a good idea too.

Ok.

> > +     }
> > +}
> > +
> >  /* Caller should have device mutex */
> >  void vhost_dev_cleanup(struct vhost_dev *dev)
> >  {
> > @@ -395,6 +430,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> >                       vhost_poll_stop(&dev->vqs[i].poll);
> >                       vhost_poll_flush(&dev->vqs[i].poll);
> >               }
> > +             /* wait for all lower device DMAs done, then notify
> guest */
> > +             if (atomic_read(&dev->vqs[i].refcnt)) {
> > +                     msleep(1000);
> > +                     vhost_zerocopy_signal_used(&dev->vqs[i],
> true);
> > +             }
> 
> This needs to be fixed somehow. Use a completion object and wait
> on it?

Worried about what if the driver has some DMAs issue, which would
prevent vhost from shutting down.

> >               if (dev->vqs[i].error_ctx)
> >                       eventfd_ctx_put(dev->vqs[i].error_ctx);
> >               if (dev->vqs[i].error)
> > @@ -603,6 +643,10 @@ static long vhost_set_vring(struct vhost_dev
> *d, int ioctl, void __user *argp)
> >  
> >       mutex_lock(&vq->mutex);
> >  
> > +     /* force all lower device DMAs done */
> > +     if (atomic_read(&vq->refcnt)) 
> > +             vhost_zerocopy_signal_used(vq, true);
> > +
> >       switch (ioctl) {
> >       case VHOST_SET_VRING_NUM:
> >               /* Resizing ring with an active backend?
> > @@ -1416,3 +1460,12 @@ void vhost_disable_notify(struct
> vhost_virtqueue *vq)
> >               vq_err(vq, "Failed to enable notification at %p: %d
> \n",
> >                      &vq->used->flags, r);
> >  }
> > +
> > +void vhost_zerocopy_callback(struct sk_buff *skb)
> > +{
> > +     int idx = skb_shinfo(skb)->ubuf.desc;
> > +     struct vhost_virtqueue *vq = skb_shinfo(skb)->ubuf.arg;
> > +
> > +     /* set len = 1 to mark this desc buffers done DMA */
> 
> this comment can now go.

Yes, it's gone already.

> > +     vq->heads[idx].len = VHOST_DMA_DONE_LEN;
> > +}
> > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > index b3363ae..8e3ecc7 100644
> > --- a/drivers/vhost/vhost.h
> > +++ b/drivers/vhost/vhost.h
> > @@ -13,6 +13,10 @@
> >  #include <linux/virtio_ring.h>
> >  #include <asm/atomic.h>
> >  
> > +/* This is for zerocopy, used buffer len is set to 1 when lower
> device DMA
> > + * done */
> > +#define VHOST_DMA_DONE_LEN   1
> > +
> >  struct vhost_device;
> >  
> >  struct vhost_work;
> > @@ -108,6 +112,12 @@ struct vhost_virtqueue {
> >       /* Log write descriptors */
> >       void __user *log_base;
> >       struct vhost_log *log;
> > +     /* vhost zerocopy support */
> > +     atomic_t refcnt; /* num of outstanding zerocopy DMAs */
> 
> future enhancement idea: this is used apparently under vq lock
> so no need for an atomic?

It is also used in skb vhost zerocopy callback.

> > +     /* copy of avail idx to monitor outstanding DMA zerocopy
> buffers */
> 
> looking at code upend_idx seems to be calculated independently
> of guest avail idx - could you clarify pls?

Yes, you are right. Should change it to: upend_idx is used to track
vring ids for outstanding zero-copy DMA buffers?

> > +     int upend_idx;
> > +     /* copy of used idx to monintor DMA done zerocopy buffers */
> 
> monitor

Ok.

> > +     int done_idx;
> 
> 
> I think in reality these are just producer and consumer
> in the head structure which for zero copy is used
Yes.

> 
> >  };
> >  
> >  struct vhost_dev {
> > @@ -154,6 +164,8 @@ bool vhost_enable_notify(struct vhost_virtqueue
> *);
> >  
> >  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log
> *log,
> >                   unsigned int log_num, u64 len);
> > +void vhost_zerocopy_callback(struct sk_buff *skb);
> > +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool
> shutdown);
> >  
> >  #define vq_err(vq, fmt, ...) do {
> \
> >               pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> > 

^ permalink raw reply

* [PATCH v3] ipconfig wait for carrier
From: Micha Nelissen @ 2011-05-18  5:22 UTC (permalink / raw)
  To: netdev; +Cc: Micha Nelissen, David Miller
In-Reply-To: <1305405386-25187-1-git-send-email-micha@neli.hopto.org>

David, forgive my clumsiness, the previous patch did not compile. This
one applies against 2.6.38 and does compile.

Currently the ip auto configuration has a hardcoded delay of 1 second.
When (ethernet) link takes longer to come up (e.g. more than 3 seconds),
nfs root may not be found.

Remove the hardcoded delay, and wait for carrier on at least one network
device.

Signed-off-by: Micha Nelissen <micha@neli.hopto.org>
Cc: David Miller <davem@davemloft.net>
---
 net/ipv4/ipconfig.c |   35 ++++++++++++++++++++++-------------
 1 files changed, 22 insertions(+), 13 deletions(-)

diff --git a/net/ipv4/ipconfig.c b/net/ipv4/ipconfig.c
index 2b09775..4225f3f 100644
--- a/net/ipv4/ipconfig.c
+++ b/net/ipv4/ipconfig.c
@@ -87,8 +87,8 @@
 #endif
 
 /* Define the friendly delay before and after opening net devices */
-#define CONF_PRE_OPEN		500	/* Before opening: 1/2 second */
-#define CONF_POST_OPEN		1	/* After opening: 1 second */
+#define CONF_POST_OPEN		10	/* After opening: 10 msecs */
+#define CONF_CARRIER_TIMEOUT	120000	/* Wait for carrier timeout */
 
 /* Define the timeout for waiting for a DHCP/BOOTP/RARP reply */
 #define CONF_OPEN_RETRIES 	2	/* (Re)open devices twice */
@@ -188,14 +188,14 @@ struct ic_device {
 static struct ic_device *ic_first_dev __initdata = NULL;/* List of open device */
 static struct net_device *ic_dev __initdata = NULL;	/* Selected device */
 
-static bool __init ic_device_match(struct net_device *dev)
+static bool __init ic_is_init_dev(struct net_device *dev)
 {
-	if (user_dev_name[0] ? !strcmp(dev->name, user_dev_name) :
+	if (dev->flags & IFF_LOOPBACK)
+		return 0;
+	return user_dev_name[0] ? !strcmp(dev->name, user_dev_name) :
 	    (!(dev->flags & IFF_LOOPBACK) &&
 	     (dev->flags & (IFF_POINTOPOINT|IFF_BROADCAST)) &&
-	     strncmp(dev->name, "dummy", 5)))
-		return true;
-	return false;
+	     strncmp(dev->name, "dummy", 5));
 }
 
 static int __init ic_open_devs(void)
@@ -203,6 +203,7 @@ static int __init ic_open_devs(void)
 	struct ic_device *d, **last;
 	struct net_device *dev;
 	unsigned short oflags;
+	unsigned long start;
 
 	last = &ic_first_dev;
 	rtnl_lock();
@@ -216,9 +217,7 @@ static int __init ic_open_devs(void)
 	}
 
 	for_each_netdev(&init_net, dev) {
-		if (dev->flags & IFF_LOOPBACK)
-			continue;
-		if (ic_device_match(dev)) {
+		if (ic_is_init_dev(dev)) {
 			int able = 0;
 			if (dev->mtu >= 364)
 				able |= IC_BOOTP;
@@ -252,6 +251,17 @@ static int __init ic_open_devs(void)
 				dev->name, able, d->xid));
 		}
 	}
+
+	/* wait for a carrier on at least one device */
+	start = jiffies;
+	while (jiffies - start < msecs_to_jiffies(CONF_CARRIER_TIMEOUT)) {
+		for_each_netdev(&init_net, dev)
+			if (ic_is_init_dev(dev) && netif_carrier_ok(dev))
+				goto have_carrier;
+
+		msleep(1);
+	}
+have_carrier:
 	rtnl_unlock();
 
 	*last = NULL;
@@ -1324,14 +1334,13 @@ static int __init wait_for_devices(void)
 {
 	int i;
 
-	msleep(CONF_PRE_OPEN);
 	for (i = 0; i < DEVICE_WAIT_MAX; i++) {
 		struct net_device *dev;
 		int found = 0;
 
 		rtnl_lock();
 		for_each_netdev(&init_net, dev) {
-			if (ic_device_match(dev)) {
+			if (ic_is_init_dev(dev)) {
 				found = 1;
 				break;
 			}
@@ -1378,7 +1387,7 @@ static int __init ip_auto_config(void)
 		return err;
 
 	/* Give drivers a chance to settle */
-	ssleep(CONF_POST_OPEN);
+	msleep(CONF_POST_OPEN);
 
 	/*
 	 * If the config information is insufficient (e.g., our IP address or
-- 
1.7.4.4


^ permalink raw reply related

* Re: [PATCH 09/18] virtio: use avail_event index
From: Michael S. Tsirkin @ 2011-05-18  5:43 UTC (permalink / raw)
  To: Rusty Russell
  Cc: Krishna Kumar, Carsten Otte, lguest-uLR06cmDAlY/bJ5BZ2RsiQ,
	Shirley Ma, kvm-u79uwXL29TY76Z2rM5mHXA,
	linux-s390-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA,
	habanero-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8, Heiko Carstens,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	virtualization-cunTk1MwBs9QetFLy7KEm3xJsTq8ys+cHZ5vskTnxNA,
	steved-r/Jw6+rmf7HQT0dZR+AlfA, Christian Borntraeger,
	Tom Lendacky, Martin Schwidefsky, linux390-tA70FqPdS9bQT0dZR+AlfA
In-Reply-To: <87tycsn9lt.fsf-8n+1lVoiYb80n/F98K4Iww@public.gmane.org>

On Wed, May 18, 2011 at 09:49:42AM +0930, Rusty Russell wrote:
> On Tue, 17 May 2011 09:10:31 +0300, "Michael S. Tsirkin" <mst-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org> wrote:
> > Well one can imagine a driver doing:
> > 
> > 	while (virtqueue_get_buf()) {
> > 		virtqueue_add_buf()
> > 	}
> > 	virtqueue_kick()
> > 
> > which looks sensible (batch kicks) but might
> > process any number of bufs between kicks.
> 
> No, we currently only expose the buffers in the kick, so it can only
> fill the ring doing that.
> 
> We could change that (and maybe that's worth looking at)...

Yes, I think we should - this way host and guest can process
data in parallel without a kick.

My patchset included that simply because it's one index
less to be confused about.


> > If we look at drivers closely enough, I think none
> > of them do the equivalent of the above, but not 100% sure.
> 
> I'm pretty sure we don't have this kind of 'echo' driver yet.  Drivers
> tend to take OS requests and queue them.  The only one which does
> anything even partially sophisticated is the net driver...
> 
> Thanks,
> Rusty.

^ permalink raw reply

* Re: [PATCH V6 4/6 net-next] vhost: vhost TX zero-copy support
From: Shirley Ma @ 2011-05-18  6:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <1305695674.10756.93.camel@localhost.localdomain>

Hello Michael,

Here is the update the patch based on all of your review comments except
the completion/wait for cleanup since I am worried about outstanding
DMAs would prevent vhost from shutting down. I am sending out this for
your review, and test it out later.

For error handling, I update macvtap.c so we can discard the desc even
in zero-copy case.

Signed-off-by: Shirley Ma <xma@us.ibm.com>
---
 drivers/vhost/net.c   |   42 +++++++++++++++++++++++++++++++++++++++++-
 drivers/vhost/vhost.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/vhost/vhost.h |   13 +++++++++++++
 3 files changed, 103 insertions(+), 1 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 2f7c76a..e87a1f8 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -32,6 +32,10 @@
  * Using this limit prevents one virtqueue from starving others. */
 #define VHOST_NET_WEIGHT 0x80000
 
+/* MAX number of TX used buffers for outstanding zerocopy */
+#define VHOST_MAX_PEND 128
+#define VHOST_GOODCOPY_LEN PAGE_SIZE
+
 enum {
 	VHOST_NET_VQ_RX = 0,
 	VHOST_NET_VQ_TX = 1,
@@ -129,6 +133,7 @@ static void handle_tx(struct vhost_net *net)
 	int err, wmem;
 	size_t hdr_size;
 	struct socket *sock;
+	struct skb_ubuf_info pend;
 
 	/* TODO: check that we are running from vhost_worker? */
 	sock = rcu_dereference_check(vq->private_data, 1);
@@ -151,6 +156,10 @@ static void handle_tx(struct vhost_net *net)
 	hdr_size = vq->vhost_hlen;
 
 	for (;;) {
+		/* Release DMAs done buffers first */
+		if (atomic_read(&vq->refcnt) > VHOST_MAX_PEND)
+			vhost_zerocopy_signal_used(vq, false);
+
 		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
 					 ARRAY_SIZE(vq->iov),
 					 &out, &in,
@@ -166,6 +175,12 @@ static void handle_tx(struct vhost_net *net)
 				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
 				break;
 			}
+			/* If more outstanding DMAs, queue the work */
+			if (atomic_read(&vq->refcnt) > VHOST_MAX_PEND) {
+				tx_poll_start(net, sock);
+				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+				break;
+			}
 			if (unlikely(vhost_enable_notify(vq))) {
 				vhost_disable_notify(vq);
 				continue;
@@ -188,6 +203,30 @@ static void handle_tx(struct vhost_net *net)
 			       iov_length(vq->hdr, s), hdr_size);
 			break;
 		}
+		/* use msg_control to pass vhost zerocopy ubuf info to skb */
+		if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
+			vq->heads[vq->upend_idx].id = head;
+			if (len < VHOST_GOODCOPY_LEN)
+				/* copy don't need to wait for DMA done */
+				vq->heads[vq->upend_idx].len =
+							VHOST_DMA_DONE_LEN;
+			else {
+				vq->heads[vq->upend_idx].len = len;
+				pend.callback = vhost_zerocopy_callback;
+				pend.arg = vq;
+				pend.desc = vq->upend_idx;
+				msg.msg_control = &pend;
+				msg.msg_controllen = sizeof(pend);
+			}
+			atomic_inc(&vq->refcnt);
+			vq->upend_idx = (vq->upend_idx + 1) % UIO_MAXIOV;
+			/* if upend_idx is full, then wait for free more */
+			if (vq->upend_idx == vq->done_idx) {
+				tx_poll_start(net, sock);
+				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
+				break;
+			}
+		}
 		/* TODO: Check specific error and bomb out unless ENOBUFS? */
 		err = sock->ops->sendmsg(NULL, sock, &msg, len);
 		if (unlikely(err < 0)) {
@@ -198,7 +237,8 @@ static void handle_tx(struct vhost_net *net)
 		if (err != len)
 			pr_debug("Truncated TX packet: "
 				 " len %d != %zd\n", err, len);
-		vhost_add_used_and_signal(&net->dev, vq, head, 0);
+		if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
+			vhost_add_used_and_signal(&net->dev, vq, head, 0);
 		total_len += len;
 		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
 			vhost_poll_queue(&vq->poll);
diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 2ab2912..f4c2730 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev *dev,
 	vq->call_ctx = NULL;
 	vq->call = NULL;
 	vq->log_ctx = NULL;
+	vq->upend_idx = 0;
+	vq->done_idx = 0;
+	atomic_set(&vq->refcnt, 0);
 }
 
 static int vhost_worker(void *data)
@@ -385,16 +388,49 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
 	return 0;
 }
 
+/* In case of DMA done not in order in lower device driver for some reason.
+ * upend_idx is used to track end of used idx, done_idx is used to track head
+ * of used idx. Once lower device DMA done contiguously, we will signal KVM
+ * guest used idx.
+ */
+void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown)
+{
+	int i, j = 0;
+
+	for (i = vq->done_idx; i != vq->upend_idx; i = i++ % UIO_MAXIOV) {
+		if ((vq->heads[i].len == VHOST_DMA_DONE_LEN) || shutdown) {
+			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
+			vhost_add_used_and_signal(vq->dev, vq,
+						  vq->heads[i].id, 0);
+			++j;
+		} else
+			break;
+	}
+	if (j) {
+		vq->done_idx = i;
+		atomic_sub(j, &vq->refcnt);
+	}
+}
+
 /* Caller should have device mutex */
 void vhost_dev_cleanup(struct vhost_dev *dev)
 {
 	int i;
+	unsigned long begin = jiffies;
 
 	for (i = 0; i < dev->nvqs; ++i) {
 		if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
 			vhost_poll_stop(&dev->vqs[i].poll);
 			vhost_poll_flush(&dev->vqs[i].poll);
 		}
+		/* Wait for all lower device DMAs done, then notify virtio_net
+		 * or just notify it without waiting for all DMA done here ?
+		 * in case of DMAs never done for some reason */
+		if (atomic_read(&dev->vqs[i].refcnt)) {
+			/* how long should we wait ? */
+			msleep(1000);
+			vhost_zerocopy_signal_used(&dev->vqs[i], true);
+		}
 		if (dev->vqs[i].error_ctx)
 			eventfd_ctx_put(dev->vqs[i].error_ctx);
 		if (dev->vqs[i].error)
@@ -603,6 +639,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
 
 	mutex_lock(&vq->mutex);
 
+	/* clean up lower device outstanding DMAs, before setting ring */
+	if (atomic_read(&vq->refcnt))
+		vhost_zerocopy_signal_used(vq, true);
+
 	switch (ioctl) {
 	case VHOST_SET_VRING_NUM:
 		/* Resizing ring with an active backend?
@@ -1416,3 +1456,12 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
 		vq_err(vq, "Failed to enable notification at %p: %d\n",
 		       &vq->used->flags, r);
 }
+
+void vhost_zerocopy_callback(struct sk_buff *skb)
+{
+	int idx = skb_shinfo(skb)->ubuf.desc;
+	struct vhost_virtqueue *vq = skb_shinfo(skb)->ubuf.arg;
+
+	/* set len = 1 to mark this desc buffers done DMA */
+	vq->heads[idx].len = VHOST_DMA_DONE_LEN;
+}
diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
index b3363ae..d0e7ac6 100644
--- a/drivers/vhost/vhost.h
+++ b/drivers/vhost/vhost.h
@@ -13,6 +13,11 @@
 #include <linux/virtio_ring.h>
 #include <asm/atomic.h>
 
+/* This is for zerocopy, used buffer len is set to 1 when lower device DMA
+ * done */
+#define VHOST_DMA_DONE_LEN	1
+#define VHOST_DMA_CLEAR_LEN	0
+
 struct vhost_device;
 
 struct vhost_work;
@@ -108,6 +113,12 @@ struct vhost_virtqueue {
 	/* Log write descriptors */
 	void __user *log_base;
 	struct vhost_log *log;
+	/* vhost zerocopy support */
+	atomic_t refcnt; /* num of outstanding zerocopy DMAs */
+	/* last used idx for outstanding DMA zerocopy buffers */
+	int upend_idx;
+	/* first used idx for DMA done zerocopy buffers */
+	int done_idx;
 };
 
 struct vhost_dev {
@@ -154,6 +165,8 @@ bool vhost_enable_notify(struct vhost_virtqueue *);
 
 int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
 		    unsigned int log_num, u64 len);
+void vhost_zerocopy_callback(struct sk_buff *skb);
+void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown);
 
 #define vq_err(vq, fmt, ...) do {                                  \
 		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \




^ permalink raw reply related

* Re: [PATCH v3] ipconfig wait for carrier
From: David Miller @ 2011-05-18  6:07 UTC (permalink / raw)
  To: micha; +Cc: netdev
In-Reply-To: <1305696161-18277-1-git-send-email-micha@neli.hopto.org>

From: Micha Nelissen <micha@neli.hopto.org>
Date: Wed, 18 May 2011 07:22:41 +0200

> David, forgive my clumsiness, the previous patch did not compile. This
> one applies against 2.6.38 and does compile.

Please send patches against the current code, not some arbitrary older
source tree.

Thanks.

^ permalink raw reply

* Re: [PATCH net-2.6] net: add skb_dst_force() in sock_queue_err_skb()
From: David Miller @ 2011-05-18  6:21 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, baryluk, shemminger
In-Reply-To: <1305673846.6741.35.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 18 May 2011 01:10:46 +0200

> Commit 7fee226ad239 (add a noref bit on skb dst) forgot to use
> skb_dst_force() on packets queued in sk_error_queue
> 
> This triggers following warning, for applications using IP_CMSG_PKTINFO
> receiving one error status
 ...
> Close bug https://bugzilla.kernel.org/show_bug.cgi?id=34622
> 
> Reported-by: Witold Baryluk <baryluk@smp.if.uj.edu.pl>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Stephen Hemminger <shemminger@vyatta.com>

Applied, thanks.

^ permalink raw reply

* Re: [PATCH v3] ipconfig wait for carrier
From: Micha Nelissen @ 2011-05-18  6:32 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110518.020716.447947355233874982.davem@davemloft.net>

Op 2011-05-18 8:07, David Miller schreef:
> From: Micha Nelissen<micha@neli.hopto.org>
> Date: Wed, 18 May 2011 07:22:41 +0200
>
>> David, forgive my clumsiness, the previous patch did not compile. This
>> one applies against 2.6.38 and does compile.
>
> Please send patches against the current code, not some arbitrary older
> source tree.

I'm confused. Against which tree/commit do you want it then?

Micha

^ permalink raw reply

* Re: [PATCH v3] ipconfig wait for carrier
From: David Miller @ 2011-05-18  6:37 UTC (permalink / raw)
  To: micha; +Cc: netdev
In-Reply-To: <4DD36803.1020001@neli.hopto.org>

From: Micha Nelissen <micha@neli.hopto.org>
Date: Wed, 18 May 2011 08:32:35 +0200

> Op 2011-05-18 8:07, David Miller schreef:
>> From: Micha Nelissen<micha@neli.hopto.org>
>> Date: Wed, 18 May 2011 07:22:41 +0200
>>
>>> David, forgive my clumsiness, the previous patch did not compile. This
>>> one applies against 2.6.38 and does compile.
>>
>> Please send patches against the current code, not some arbitrary older
>> source tree.
> 
> I'm confused. Against which tree/commit do you want it then?

Linus's current tree would be fine as would:

	git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.git

^ permalink raw reply

* Re: small RPS cache for fragments?
From: David Miller @ 2011-05-18  6:37 UTC (permalink / raw)
  To: xiaosuo; +Cc: therbert, netdev
In-Reply-To: <BANLkTimmvKn2e5Yrkd5TtktyxTWCRxqinw@mail.gmail.com>

From: Changli Gao <xiaosuo@gmail.com>
Date: Wed, 18 May 2011 07:59:05 +0800

> On Wed, May 18, 2011 at 4:49 AM, David Miller <davem@davemloft.net> wrote:
>>
>> Actually, I think it won't work.  Even Linux emits fragments last to
>> first, so we won't see the UDP header until the last packet where it's
>> no longer useful.
>>
> 
> No. Linux emits fragments first to last now. You should check the
> current code. :)

I forgot that we rearranged this, thanks :-)

So maybe the original idea can indeed work.

^ permalink raw reply

* Re: [PATCH v3] ipconfig wait for carrier
From: Micha Nelissen @ 2011-05-18  6:59 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <20110518.023725.441358922110721042.davem@davemloft.net>

Op 2011-05-18 8:37, David Miller schreef:
> From: Micha Nelissen<micha@neli.hopto.org>
> Date: Wed, 18 May 2011 08:32:35 +0200
>
>> I'm confused. Against which tree/commit do you want it then?
>
> Linus's current tree would be fine as would:
>
> 	git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6.git

Ok I see, thanks. The patch should apply just fine to your tree, there 
is only a spelling change since 2.6.38 which does not conflict.

Micha

^ permalink raw reply

* [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Jacek Luczak @ 2011-05-18  7:01 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1717 bytes --]

During the sctp_close() call, we do not use rcu primitives to
destroy the address list attached to the endpoint.  At the same
time, we do the removal of addresses from this list before
attempting to remove the socket from the port hash

As a result, it is possible for another process to find the socket
in the port hash that is in the process of being closed.  It then
proceeds to traverse the address list to find the conflict, only
to have that address list suddenly disappear without rcu() critical
section.

This can result in a kernel crash with general protection fault or
kernel NULL pointer dereference.

Fix issue by closing address list removal inside RCU critical
section.

Signed-off-by: Jacek Luczak <luczak.jacek@gmail.com>
Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>

---
 bind_addr.c |   12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
index faf71d1..19d1329 100644
--- a/net/sctp/bind_addr.c
+++ b/net/sctp/bind_addr.c
@@ -155,8 +155,16 @@ static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
 /* Dispose of an SCTP_bind_addr structure  */
 void sctp_bind_addr_free(struct sctp_bind_addr *bp)
 {
-       /* Empty the bind address list. */
-       sctp_bind_addr_clean(bp);
+       struct sctp_sockaddr_entry *addr;
+
+       /* Empty the bind address list inside RCU section. */
+       rcu_read_lock();
+       list_for_each_entry_rcu(addr, &bp->address_list, list) {
+               list_del_rcu(&addr->list);
+               call_rcu(&addr->rcu, sctp_local_addr_free);
+               SCTP_DBG_OBJCNT_DEC(addr);
+       }
+       rcu_read_unlock();

        if (bp->malloced) {
                kfree(bp);

[-- Attachment #2: sctp_fix_close_vs_conflict_race_in_bind_addr.patch --]
[-- Type: application/octet-stream, Size: 732 bytes --]

diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
index faf71d1..19d1329 100644
--- a/net/sctp/bind_addr.c
+++ b/net/sctp/bind_addr.c
@@ -155,8 +155,16 @@ static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
 /* Dispose of an SCTP_bind_addr structure  */
 void sctp_bind_addr_free(struct sctp_bind_addr *bp)
 {
-	/* Empty the bind address list. */
-	sctp_bind_addr_clean(bp);
+	struct sctp_sockaddr_entry *addr;
+
+	/* Empty the bind address list inside RCU section. */
+	rcu_read_lock();
+	list_for_each_entry_rcu(addr, &bp->address_list, list) {
+		list_del_rcu(&addr->list);
+		call_rcu(&addr->rcu, sctp_local_addr_free);
+		SCTP_DBG_OBJCNT_DEC(addr);
+	}
+	rcu_read_unlock();
 
 	if (bp->malloced) {
 		kfree(bp);

^ permalink raw reply related

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Eric Dumazet @ 2011-05-18  7:48 UTC (permalink / raw)
  To: Jacek Luczak; +Cc: netdev, Vlad Yasevich
In-Reply-To: <BANLkTikeWzuE-384uT6RhZR6Wn=DBK+CNQ@mail.gmail.com>

Le mercredi 18 mai 2011 à 09:01 +0200, Jacek Luczak a écrit :
> During the sctp_close() call, we do not use rcu primitives to
> destroy the address list attached to the endpoint.  At the same
> time, we do the removal of addresses from this list before
> attempting to remove the socket from the port hash
> 
> As a result, it is possible for another process to find the socket
> in the port hash that is in the process of being closed.  It then
> proceeds to traverse the address list to find the conflict, only
> to have that address list suddenly disappear without rcu() critical
> section.
> 
> This can result in a kernel crash with general protection fault or
> kernel NULL pointer dereference.
> 
> Fix issue by closing address list removal inside RCU critical
> section.
> 
> Signed-off-by: Jacek Luczak <luczak.jacek@gmail.com>
> Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
> 
> ---
>  bind_addr.c |   12 ++++++++++--
>  1 file changed, 10 insertions(+), 2 deletions(-)
> 
> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
> index faf71d1..19d1329 100644
> --- a/net/sctp/bind_addr.c
> +++ b/net/sctp/bind_addr.c
> @@ -155,8 +155,16 @@ static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
>  /* Dispose of an SCTP_bind_addr structure  */
>  void sctp_bind_addr_free(struct sctp_bind_addr *bp)
>  {
> -       /* Empty the bind address list. */
> -       sctp_bind_addr_clean(bp);
> +       struct sctp_sockaddr_entry *addr;
> +
> +       /* Empty the bind address list inside RCU section. */
> +       rcu_read_lock();
> +       list_for_each_entry_rcu(addr, &bp->address_list, list) {
> +               list_del_rcu(&addr->list);
> +               call_rcu(&addr->rcu, sctp_local_addr_free);
> +               SCTP_DBG_OBJCNT_DEC(addr);
> +       }
> +       rcu_read_unlock();
> 

Sorry this looks odd.

If you're removing items from this list, you must be a writer here, with
exclusive access. So rcu_read_lock()/rcu_read_unlock() is not necessary.

Therefore, I guess following code is better :

list_for_each_entry(addr, &bp->address_list, list) {
	list_del_rcu(&addr->list);
	call_rcu(&addr->rcu, sctp_local_addr_free);
	SCTP_DBG_OBJCNT_DEC(addr);
}

Then, why dont you fix sctp_bind_addr_clean() instead ?

if 'struct sctp_sockaddr_entry' is recu protected, then all frees should
be protected as well.




^ permalink raw reply

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Jacek Luczak @ 2011-05-18  8:06 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Vlad Yasevich
In-Reply-To: <1305704885.2983.4.camel@edumazet-laptop>

2011/5/18 Eric Dumazet <eric.dumazet@gmail.com>:
> Le mercredi 18 mai 2011 à 09:01 +0200, Jacek Luczak a écrit :
>> During the sctp_close() call, we do not use rcu primitives to
>> destroy the address list attached to the endpoint.  At the same
>> time, we do the removal of addresses from this list before
>> attempting to remove the socket from the port hash
>>
>> As a result, it is possible for another process to find the socket
>> in the port hash that is in the process of being closed.  It then
>> proceeds to traverse the address list to find the conflict, only
>> to have that address list suddenly disappear without rcu() critical
>> section.
>>
>> This can result in a kernel crash with general protection fault or
>> kernel NULL pointer dereference.
>>
>> Fix issue by closing address list removal inside RCU critical
>> section.
>>
>> Signed-off-by: Jacek Luczak <luczak.jacek@gmail.com>
>> Acked-by: Vlad Yasevich <vladislav.yasevich@hp.com>
>>
>> ---
>>  bind_addr.c |   12 ++++++++++--
>>  1 file changed, 10 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/sctp/bind_addr.c b/net/sctp/bind_addr.c
>> index faf71d1..19d1329 100644
>> --- a/net/sctp/bind_addr.c
>> +++ b/net/sctp/bind_addr.c
>> @@ -155,8 +155,16 @@ static void sctp_bind_addr_clean(struct sctp_bind_addr *bp)
>>  /* Dispose of an SCTP_bind_addr structure  */
>>  void sctp_bind_addr_free(struct sctp_bind_addr *bp)
>>  {
>> -       /* Empty the bind address list. */
>> -       sctp_bind_addr_clean(bp);
>> +       struct sctp_sockaddr_entry *addr;
>> +
>> +       /* Empty the bind address list inside RCU section. */
>> +       rcu_read_lock();
>> +       list_for_each_entry_rcu(addr, &bp->address_list, list) {
>> +               list_del_rcu(&addr->list);
>> +               call_rcu(&addr->rcu, sctp_local_addr_free);
>> +               SCTP_DBG_OBJCNT_DEC(addr);
>> +       }
>> +       rcu_read_unlock();
>>
>
> Sorry this looks odd.
>
> If you're removing items from this list, you must be a writer here, with
> exclusive access. So rcu_read_lock()/rcu_read_unlock() is not necessary.

I could agree to some extend ... but strict RCU section IMO is needed here.
I can check this if the issue exists.

> Therefore, I guess following code is better :
>
> list_for_each_entry(addr, &bp->address_list, list) {
>        list_del_rcu(&addr->list);
>        call_rcu(&addr->rcu, sctp_local_addr_free);
>        SCTP_DBG_OBJCNT_DEC(addr);
> }
>
> Then, why dont you fix sctp_bind_addr_clean() instead ?
>
> if 'struct sctp_sockaddr_entry' is recu protected, then all frees should
> be protected as well.

The _clean() as claimed by Vlad is called many times from various places
in code and this could give a overhead. I guess Vlad would need to comment.

-Jacek

^ permalink raw reply

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Eric Dumazet @ 2011-05-18  8:29 UTC (permalink / raw)
  To: Jacek Luczak; +Cc: netdev, Vlad Yasevich
In-Reply-To: <BANLkTik75iqfgvWGKLJYu6E7mheOwOZX+A@mail.gmail.com>

Le mercredi 18 mai 2011 à 10:06 +0200, Jacek Luczak a écrit :
> 2011/5/18 Eric Dumazet <eric.dumazet@gmail.com>:

> > If you're removing items from this list, you must be a writer here, with
> > exclusive access. So rcu_read_lock()/rcu_read_unlock() is not necessary.
> 
> I could agree to some extend ... but strict RCU section IMO is needed here.
> I can check this if the issue exists.
> 

I can tell you for sure rcu_read_lock() is not needed here. Its only
showing confusion from code's author.

Please read Documentation/RCU/listRCU.txt for concise explanations,
line 117.


> > Therefore, I guess following code is better :
> >
> > list_for_each_entry(addr, &bp->address_list, list) {
> >        list_del_rcu(&addr->list);
> >        call_rcu(&addr->rcu, sctp_local_addr_free);
> >        SCTP_DBG_OBJCNT_DEC(addr);
> > }
> >
> > Then, why dont you fix sctp_bind_addr_clean() instead ?
> >
> > if 'struct sctp_sockaddr_entry' is recu protected, then all frees should
> > be protected as well.
> 
> The _clean() as claimed by Vlad is called many times from various places
> in code and this could give a overhead. I guess Vlad would need to comment.

I guess a full review of this code is needed. You'll have to prove
sctp_bind_addr_clean() is always called after one RCU grace period if
you want to leave it as is.

You cant get RCU for free, some rules must be followed or you risk
crashes.




^ permalink raw reply

* Re: [PATCH V5 4/6 net-next] vhost: vhost TX zero-copy support
From: Michael S. Tsirkin @ 2011-05-18  8:32 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <1305695674.10756.93.camel@localhost.localdomain>

On Tue, May 17, 2011 at 10:14:34PM -0700, Shirley Ma wrote:
> On Wed, 2011-05-18 at 00:28 +0300, Michael S. Tsirkin wrote:
> > On Tue, May 17, 2011 at 01:50:19PM -0700, Shirley Ma wrote:
> > > Resubmit the patch with most update. This patch passed some
> > > live-migration test against RHEL6.2. I will run more stress test w/i
> > > live migration.
> > > 
> > > Signed-off-by: Shirley Ma <xma@us.ibm.com>
> > 
> > Cool. cleanup path needs a fix - are you use you can
> > not use kobj or some other existing refcounting?
> I will look at this.
> 
> > Is perf regressiion caused by tx ring overrun gone now?
> Nope.
> 
> > I added some comments about how we might be aqble
> > to complete requests out of order but it's not a must.
> Ok.
> 
> > > ---
> > > 
> > >  drivers/vhost/net.c   |   37 +++++++++++++++++++++++++++++++-
> > >  drivers/vhost/vhost.c |   55
> > ++++++++++++++++++++++++++++++++++++++++++++++++-
> > >  drivers/vhost/vhost.h |   12 ++++++++++
> > >  3 files changed, 101 insertions(+), 3 deletions(-)
> > > 
> > > diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> > > index 2f7c76a..6bd6e28 100644
> > > --- a/drivers/vhost/net.c
> > > +++ b/drivers/vhost/net.c
> > > @@ -32,6 +32,9 @@
> > >   * Using this limit prevents one virtqueue from starving others. */
> > >  #define VHOST_NET_WEIGHT 0x80000
> > >  
> > > +/* MAX number of TX used buffers for outstanding zerocopy */
> > > +#define VHOST_MAX_ZEROCOPY_PEND 128 
> > > +
> > >  enum {
> > >       VHOST_NET_VQ_RX = 0,
> > >       VHOST_NET_VQ_TX = 1,
> > > @@ -129,6 +132,7 @@ static void handle_tx(struct vhost_net *net)
> > >       int err, wmem;
> > >       size_t hdr_size;
> > >       struct socket *sock;
> > > +     struct skb_ubuf_info pend;
> > >  
> > >       /* TODO: check that we are running from vhost_worker? */
> > >       sock = rcu_dereference_check(vq->private_data, 1);
> > > @@ -151,6 +155,10 @@ static void handle_tx(struct vhost_net *net)
> > >       hdr_size = vq->vhost_hlen;
> > >  
> > >       for (;;) {
> > > +             /* Release DMAs done buffers first */
> > > +             if (atomic_read(&vq->refcnt) >
> > VHOST_MAX_ZEROCOPY_PEND)
> > > +                     vhost_zerocopy_signal_used(vq, false);
> > > +
> > >               head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
> > >                                        ARRAY_SIZE(vq->iov),
> > >                                        &out, &in,
> > > @@ -166,6 +174,13 @@ static void handle_tx(struct vhost_net *net)
> > >                               set_bit(SOCK_ASYNC_NOSPACE,
> > &sock->flags);
> > >                               break;
> > >                       }
> > > +                     /* If more outstanding DMAs, queue the work */
> > > +                     if (sock_flag(sock->sk, SOCK_ZEROCOPY) &&
> > > +                         (atomic_read(&vq->refcnt) >
> > VHOST_MAX_ZEROCOPY_PEND)) {
> > > +                             tx_poll_start(net, sock);
> > > +                             set_bit(SOCK_ASYNC_NOSPACE,
> > &sock->flags);
> > > +                             break;
> > > +                     }
> > >                       if (unlikely(vhost_enable_notify(vq))) {
> > >                               vhost_disable_notify(vq);
> > >                               continue;
> > > @@ -188,17 +203,35 @@ static void handle_tx(struct vhost_net *net)
> > >                              iov_length(vq->hdr, s), hdr_size);
> > >                       break;
> > >               }
> > > +             /* use msg_control to pass vhost zerocopy ubuf info to
> > skb */
> > > +             if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
> > > +                     vq->heads[vq->upend_idx].id = head;
> > > +                     if (len <= 128)
> > 
> > I thought we have a constant for that?
> 
> Yes, fixed it already. I might change it to PAGE_SIZE for now since the
> small message sizes regression issue.
> 
> > > +                             vq->heads[vq->upend_idx].len =
> > VHOST_DMA_DONE_LEN;
> > > +                     else {
> > > +                             vq->heads[vq->upend_idx].len = len;
> > > +                             pend.callback =
> > vhost_zerocopy_callback;
> > > +                             pend.arg = vq;
> > > +                             pend.desc = vq->upend_idx;
> > > +                             msg.msg_control = &pend;
> > > +                             msg.msg_controllen = sizeof(pend);
> > > +                     }
> > > +                     atomic_inc(&vq->refcnt);
> > > +                     vq->upend_idx = (vq->upend_idx + 1) %
> > UIO_MAXIOV;
> > 
> > Ok, so we deal with a cyclic ring apparently? What guarantees we don't
> > overrun it?
> 
> VHOST_MAX_PEND (which is 128) should prevent it from overrun normally.
> In the case of none DMAs can be done, the maximum entries are the ring
> size, which is much smaller or equal to UIO_MAXIOV for current
> implementation. 
> 
> Maybe I should add some condition check if it is overrun then exits?

If this can't happen, maybe add a comment why. A check + WARN_ON
can't hurt either.

> > 
> > > +             }
> > >               /* TODO: Check specific error and bomb out unless
> > ENOBUFS? */
> > >               err = sock->ops->sendmsg(NULL, sock, &msg, len);
> > >               if (unlikely(err < 0)) {
> > > -                     vhost_discard_vq_desc(vq, 1);
> > > +                     if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> > > +                             vhost_discard_vq_desc(vq, 1);
> > 
> > How are errors handled with zerocopy?
> 
> I thought about it before: if error after macvtap skb is allocated, then
> that skb has a callback which will set DMA done for add_used, if the
> error before macvtap skb is allocated, then it can reverse as copy case.
> To avoid the complexity check, I decided to not handle it.

I think we need to handle it, errors do happen.
macvtap should either don't invoke callback on error or return success.
Then it's simple.

> > 
> > >                       tx_poll_start(net, sock);
> > >                       break;
> > >               }
> > >               if (err != len)
> > >                       pr_debug("Truncated TX packet: "
> > >                                " len %d != %zd\n", err, len);
> > > -             vhost_add_used_and_signal(&net->dev, vq, head, 0);
> > > +             if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> > > +                     vhost_add_used_and_signal(&net->dev, vq, head,
> > 0);
> > >               total_len += len;
> > >               if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
> > >                       vhost_poll_queue(&vq->poll);
> > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> > > index 2ab2912..ce799d6 100644
> > > --- a/drivers/vhost/vhost.c
> > > +++ b/drivers/vhost/vhost.c
> > > @@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev
> > *dev,
> > >       vq->call_ctx = NULL;
> > >       vq->call = NULL;
> > >       vq->log_ctx = NULL;
> > > +     vq->upend_idx = 0;
> > > +     vq->done_idx = 0;
> > > +     atomic_set(&vq->refcnt, 0);
> > >  }
> > >  
> > >  static int vhost_worker(void *data)
> > > @@ -230,7 +233,7 @@ static long vhost_dev_alloc_iovecs(struct
> > vhost_dev *dev)
> > >                                              UIO_MAXIOV,
> > GFP_KERNEL);
> > >               dev->vqs[i].log = kmalloc(sizeof *dev->vqs[i].log *
> > UIO_MAXIOV,
> > >                                         GFP_KERNEL);
> > > -             dev->vqs[i].heads = kmalloc(sizeof *dev->vqs[i].heads
> > *
> > > +             dev->vqs[i].heads = kzalloc(sizeof *dev->vqs[i].heads
> > *
> > >                                           UIO_MAXIOV, GFP_KERNEL);
> > 
> > Which fields need to be initialized actually?
> 
> Nope, already fixed it with kmalloc.
> 
> > >  
> > >               if (!dev->vqs[i].indirect || !dev->vqs[i].log ||
> > > @@ -385,6 +388,38 @@ long vhost_dev_reset_owner(struct vhost_dev
> > *dev)
> > >       return 0;
> > >  }
> > >  
> > > +/* 
> > > +     comments
> > > +*/
> > 
> > Hmm.
> 
> Fixed it in previous patch already.
> 
> > > +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool
> > shutdown)
> > > +{
> > > +     int i, j = 0;
> > > +
> > > +     i = vq->done_idx;
> > > +     while (i != vq->upend_idx) {
> > 
> > A for loop might be clearer.
> Ok.
> 
> > > +             if ((vq->heads[i].len == VHOST_DMA_DONE_LEN) ||
> > shutdown) {
> > 
> > On shutdown, we signal all buffers used to the guest?
> > Why?
> 
> We signal all outstand DMAs in the case of driver has some DMA issue to
> prevent us from shutting down. I am afraid vhost cleanup could wait
> forever.

Yes but then
1. guest can reuse the buffers for something else, before device reads them
2. vhost-net module can go away
etc

IMO that this happens in finite time is one of the things driver should
guarantee when it sets the zcopy flags.

BTW we must flush these on memory table change too, right?

> > > +                     /* reset len = 0 */
> > 
> > comment not very helpful.
> > Could you explain what this does instead?
> > Or better use some constant instead of 0 ...
> 
> Fixed it already.
> 
> > > +                     vq->heads[i].len = 0;
> > > +                     i = (i + 1) % UIO_MAXIOV;
> > > +                     ++j;
> > > +             } else
> > > +                     break;
> > 
> > Hmm so if the 1st entry does not complete, you do not signal anything?
> 
> No used buffers to guest, should we signal still?

This is just myself noting that we still force used entries
to be in order. It's ok from correctness pov but likely
is at least part of the reason you still see
TX ring overruns.

> > > +     }
> > 
> > Looking at this loop, done idx is the consumer and used idx
> > is the producer, right?
> 
> Yes.
> 
> > > +     if (j) {
> > > +             /* comments */
> > 
> > Yes?
> > 
> > > +             if (i > vq->done_idx)
> > > +                     vhost_add_used_n(vq, &vq->heads[vq->done_idx],
> > j);
> > > +             else {
> > > +                     vhost_add_used_n(vq, &vq->heads[vq->done_idx],
> > > +                                      UIO_MAXIOV - vq->done_idx);
> > > +                     vhost_add_used_n(vq, vq->heads, i);
> > > +             }
> > > +             vq->done_idx = i;
> > > +             vhost_signal(vq->dev, vq);
> > > +             atomic_sub(j, &vq->refcnt);
> > 
> > Code will likely be simpler if you call vhost_add_used once for
> > each head in the first loop. Possibly add_used_signal might be
> > a good idea too.
> 
> Ok.
> 
> > > +     }
> > > +}
> > > +
> > >  /* Caller should have device mutex */
> > >  void vhost_dev_cleanup(struct vhost_dev *dev)
> > >  {
> > > @@ -395,6 +430,11 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> > >                       vhost_poll_stop(&dev->vqs[i].poll);
> > >                       vhost_poll_flush(&dev->vqs[i].poll);
> > >               }
> > > +             /* wait for all lower device DMAs done, then notify
> > guest */
> > > +             if (atomic_read(&dev->vqs[i].refcnt)) {
> > > +                     msleep(1000);
> > > +                     vhost_zerocopy_signal_used(&dev->vqs[i],
> > true);
> > > +             }
> > 
> > This needs to be fixed somehow. Use a completion object and wait
> > on it?
> 
> Worried about what if the driver has some DMAs issue, which would
> prevent vhost from shutting down.

This needs to be fixed in the driver then.

> > >               if (dev->vqs[i].error_ctx)
> > >                       eventfd_ctx_put(dev->vqs[i].error_ctx);
> > >               if (dev->vqs[i].error)
> > > @@ -603,6 +643,10 @@ static long vhost_set_vring(struct vhost_dev
> > *d, int ioctl, void __user *argp)
> > >  
> > >       mutex_lock(&vq->mutex);
> > >  
> > > +     /* force all lower device DMAs done */
> > > +     if (atomic_read(&vq->refcnt)) 
> > > +             vhost_zerocopy_signal_used(vq, true);
> > > +
> > >       switch (ioctl) {
> > >       case VHOST_SET_VRING_NUM:
> > >               /* Resizing ring with an active backend?
> > > @@ -1416,3 +1460,12 @@ void vhost_disable_notify(struct
> > vhost_virtqueue *vq)
> > >               vq_err(vq, "Failed to enable notification at %p: %d
> > \n",
> > >                      &vq->used->flags, r);
> > >  }
> > > +
> > > +void vhost_zerocopy_callback(struct sk_buff *skb)
> > > +{
> > > +     int idx = skb_shinfo(skb)->ubuf.desc;
> > > +     struct vhost_virtqueue *vq = skb_shinfo(skb)->ubuf.arg;
> > > +
> > > +     /* set len = 1 to mark this desc buffers done DMA */
> > 
> > this comment can now go.
> 
> Yes, it's gone already.
> 
> > > +     vq->heads[idx].len = VHOST_DMA_DONE_LEN;
> > > +}
> > > diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> > > index b3363ae..8e3ecc7 100644
> > > --- a/drivers/vhost/vhost.h
> > > +++ b/drivers/vhost/vhost.h
> > > @@ -13,6 +13,10 @@
> > >  #include <linux/virtio_ring.h>
> > >  #include <asm/atomic.h>
> > >  
> > > +/* This is for zerocopy, used buffer len is set to 1 when lower
> > device DMA
> > > + * done */
> > > +#define VHOST_DMA_DONE_LEN   1
> > > +
> > >  struct vhost_device;
> > >  
> > >  struct vhost_work;
> > > @@ -108,6 +112,12 @@ struct vhost_virtqueue {
> > >       /* Log write descriptors */
> > >       void __user *log_base;
> > >       struct vhost_log *log;
> > > +     /* vhost zerocopy support */
> > > +     atomic_t refcnt; /* num of outstanding zerocopy DMAs */
> > 
> > future enhancement idea: this is used apparently under vq lock
> > so no need for an atomic?
> 
> It is also used in skb vhost zerocopy callback.
> 
> > > +     /* copy of avail idx to monitor outstanding DMA zerocopy
> > buffers */
> > 
> > looking at code upend_idx seems to be calculated independently
> > of guest avail idx - could you clarify pls?
> 
> Yes, you are right. Should change it to: upend_idx is used to track
> vring ids for outstanding zero-copy DMA buffers?

mention producer-consumer index in a head array used as
a circular ring datastructure.

> > > +     int upend_idx;
> > > +     /* copy of used idx to monintor DMA done zerocopy buffers */
> > 
> > monitor
> 
> Ok.
> 
> > > +     int done_idx;
> > 
> > 
> > I think in reality these are just producer and consumer
> > in the head structure which for zero copy is used
> Yes.
> 
> > 
> > >  };
> > >  
> > >  struct vhost_dev {
> > > @@ -154,6 +164,8 @@ bool vhost_enable_notify(struct vhost_virtqueue
> > *);
> > >  
> > >  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log
> > *log,
> > >                   unsigned int log_num, u64 len);
> > > +void vhost_zerocopy_callback(struct sk_buff *skb);
> > > +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool
> > shutdown);
> > >  
> > >  #define vq_err(vq, fmt, ...) do {
> > \
> > >               pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> > > 

^ permalink raw reply

* Re: [PATCH V6 4/6 net-next] vhost: vhost TX zero-copy support
From: Michael S. Tsirkin @ 2011-05-18  8:43 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <1305699383.10756.100.camel@localhost.localdomain>

On Tue, May 17, 2011 at 11:16:23PM -0700, Shirley Ma wrote:
> Hello Michael,
> 
> Here is the update the patch based on all of your review comments except
> the completion/wait for cleanup since I am worried about outstanding
> DMAs would prevent vhost from shutting down.

Don't see what you are worried about. Doing this in a timely fashion
is just one of the things driver commits to when it published the zcopy flag.
Otherwise guest networking will get stuck as entries in the tx ring
don't free up, which is also a problem. Let's just add a comment documenting this.

> I am sending out this for your review, and test it out later.

So let's use a completion to cleanly flush outstanding requests.
Any chance kref can be reused? It's not a must.

One other piece that is currently missing is flushing
requests on memory table update. Maybe we can stick some
info in high bits of the desc field for that?

> For error handling, I update macvtap.c so we can discard the desc even
> in zero-copy case.
> 
> Signed-off-by: Shirley Ma <xma@us.ibm.com>
> ---
>  drivers/vhost/net.c   |   42 +++++++++++++++++++++++++++++++++++++++++-
>  drivers/vhost/vhost.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
>  drivers/vhost/vhost.h |   13 +++++++++++++
>  3 files changed, 103 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 2f7c76a..e87a1f8 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -32,6 +32,10 @@
>   * Using this limit prevents one virtqueue from starving others. */
>  #define VHOST_NET_WEIGHT 0x80000
>  
> +/* MAX number of TX used buffers for outstanding zerocopy */
> +#define VHOST_MAX_PEND 128
> +#define VHOST_GOODCOPY_LEN PAGE_SIZE
> +
>  enum {
>  	VHOST_NET_VQ_RX = 0,
>  	VHOST_NET_VQ_TX = 1,
> @@ -129,6 +133,7 @@ static void handle_tx(struct vhost_net *net)
>  	int err, wmem;
>  	size_t hdr_size;
>  	struct socket *sock;
> +	struct skb_ubuf_info pend;
>  
>  	/* TODO: check that we are running from vhost_worker? */
>  	sock = rcu_dereference_check(vq->private_data, 1);
> @@ -151,6 +156,10 @@ static void handle_tx(struct vhost_net *net)
>  	hdr_size = vq->vhost_hlen;
>  
>  	for (;;) {
> +		/* Release DMAs done buffers first */
> +		if (atomic_read(&vq->refcnt) > VHOST_MAX_PEND)
> +			vhost_zerocopy_signal_used(vq, false);
> +
>  		head = vhost_get_vq_desc(&net->dev, vq, vq->iov,
>  					 ARRAY_SIZE(vq->iov),
>  					 &out, &in,
> @@ -166,6 +175,12 @@ static void handle_tx(struct vhost_net *net)
>  				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
>  				break;
>  			}
> +			/* If more outstanding DMAs, queue the work */
> +			if (atomic_read(&vq->refcnt) > VHOST_MAX_PEND) {
> +				tx_poll_start(net, sock);
> +				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
> +				break;
> +			}
>  			if (unlikely(vhost_enable_notify(vq))) {
>  				vhost_disable_notify(vq);
>  				continue;
> @@ -188,6 +203,30 @@ static void handle_tx(struct vhost_net *net)
>  			       iov_length(vq->hdr, s), hdr_size);
>  			break;
>  		}
> +		/* use msg_control to pass vhost zerocopy ubuf info to skb */
> +		if (sock_flag(sock->sk, SOCK_ZEROCOPY)) {
> +			vq->heads[vq->upend_idx].id = head;
> +			if (len < VHOST_GOODCOPY_LEN)
> +				/* copy don't need to wait for DMA done */
> +				vq->heads[vq->upend_idx].len =
> +							VHOST_DMA_DONE_LEN;
> +			else {
> +				vq->heads[vq->upend_idx].len = len;
> +				pend.callback = vhost_zerocopy_callback;
> +				pend.arg = vq;
> +				pend.desc = vq->upend_idx;
> +				msg.msg_control = &pend;
> +				msg.msg_controllen = sizeof(pend);
> +			}
> +			atomic_inc(&vq->refcnt);
> +			vq->upend_idx = (vq->upend_idx + 1) % UIO_MAXIOV;
> +			/* if upend_idx is full, then wait for free more */
> +			if (vq->upend_idx == vq->done_idx) {
> +				tx_poll_start(net, sock);
> +				set_bit(SOCK_ASYNC_NOSPACE, &sock->flags);
> +				break;
> +			}
> +		}
>  		/* TODO: Check specific error and bomb out unless ENOBUFS? */
>  		err = sock->ops->sendmsg(NULL, sock, &msg, len);
>  		if (unlikely(err < 0)) {
> @@ -198,7 +237,8 @@ static void handle_tx(struct vhost_net *net)
>  		if (err != len)
>  			pr_debug("Truncated TX packet: "
>  				 " len %d != %zd\n", err, len);
> -		vhost_add_used_and_signal(&net->dev, vq, head, 0);
> +		if (!sock_flag(sock->sk, SOCK_ZEROCOPY))
> +			vhost_add_used_and_signal(&net->dev, vq, head, 0);
>  		total_len += len;
>  		if (unlikely(total_len >= VHOST_NET_WEIGHT)) {
>  			vhost_poll_queue(&vq->poll);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 2ab2912..f4c2730 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -174,6 +174,9 @@ static void vhost_vq_reset(struct vhost_dev *dev,
>  	vq->call_ctx = NULL;
>  	vq->call = NULL;
>  	vq->log_ctx = NULL;
> +	vq->upend_idx = 0;
> +	vq->done_idx = 0;
> +	atomic_set(&vq->refcnt, 0);
>  }
>  
>  static int vhost_worker(void *data)
> @@ -385,16 +388,49 @@ long vhost_dev_reset_owner(struct vhost_dev *dev)
>  	return 0;
>  }
>  
> +/* In case of DMA done not in order in lower device driver for some reason.
> + * upend_idx is used to track end of used idx, done_idx is used to track head
> + * of used idx. Once lower device DMA done contiguously, we will signal KVM
> + * guest used idx.
> + */
> +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown)
> +{
> +	int i, j = 0;
> +
> +	for (i = vq->done_idx; i != vq->upend_idx; i = i++ % UIO_MAXIOV) {
> +		if ((vq->heads[i].len == VHOST_DMA_DONE_LEN) || shutdown) {
> +			vq->heads[i].len = VHOST_DMA_CLEAR_LEN;
> +			vhost_add_used_and_signal(vq->dev, vq,
> +						  vq->heads[i].id, 0);
> +			++j;
> +		} else
> +			break;
> +	}
> +	if (j) {
> +		vq->done_idx = i;
> +		atomic_sub(j, &vq->refcnt);
> +	}
> +}
> +
>  /* Caller should have device mutex */
>  void vhost_dev_cleanup(struct vhost_dev *dev)
>  {
>  	int i;
> +	unsigned long begin = jiffies;
>  
>  	for (i = 0; i < dev->nvqs; ++i) {
>  		if (dev->vqs[i].kick && dev->vqs[i].handle_kick) {
>  			vhost_poll_stop(&dev->vqs[i].poll);
>  			vhost_poll_flush(&dev->vqs[i].poll);
>  		}
> +		/* Wait for all lower device DMAs done, then notify virtio_net
> +		 * or just notify it without waiting for all DMA done here ?
> +		 * in case of DMAs never done for some reason */
> +		if (atomic_read(&dev->vqs[i].refcnt)) {
> +			/* how long should we wait ? */
> +			msleep(1000);
> +			vhost_zerocopy_signal_used(&dev->vqs[i], true);
> +		}
>  		if (dev->vqs[i].error_ctx)
>  			eventfd_ctx_put(dev->vqs[i].error_ctx);
>  		if (dev->vqs[i].error)
> @@ -603,6 +639,10 @@ static long vhost_set_vring(struct vhost_dev *d, int ioctl, void __user *argp)
>  
>  	mutex_lock(&vq->mutex);
>  
> +	/* clean up lower device outstanding DMAs, before setting ring */
> +	if (atomic_read(&vq->refcnt))
> +		vhost_zerocopy_signal_used(vq, true);
> +
>  	switch (ioctl) {
>  	case VHOST_SET_VRING_NUM:
>  		/* Resizing ring with an active backend?
> @@ -1416,3 +1456,12 @@ void vhost_disable_notify(struct vhost_virtqueue *vq)
>  		vq_err(vq, "Failed to enable notification at %p: %d\n",
>  		       &vq->used->flags, r);
>  }
> +
> +void vhost_zerocopy_callback(struct sk_buff *skb)
> +{
> +	int idx = skb_shinfo(skb)->ubuf.desc;
> +	struct vhost_virtqueue *vq = skb_shinfo(skb)->ubuf.arg;


Let's do some sanity checking of idx here in case
there's a driver bug.

> +
> +	/* set len = 1 to mark this desc buffers done DMA */
> +	vq->heads[idx].len = VHOST_DMA_DONE_LEN;
> +}
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index b3363ae..d0e7ac6 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -13,6 +13,11 @@
>  #include <linux/virtio_ring.h>
>  #include <asm/atomic.h>
>  
> +/* This is for zerocopy, used buffer len is set to 1 when lower device DMA
> + * done */
> +#define VHOST_DMA_DONE_LEN	1
> +#define VHOST_DMA_CLEAR_LEN	0
> +
>  struct vhost_device;
>  
>  struct vhost_work;
> @@ -108,6 +113,12 @@ struct vhost_virtqueue {
>  	/* Log write descriptors */
>  	void __user *log_base;
>  	struct vhost_log *log;
> +	/* vhost zerocopy support */
> +	atomic_t refcnt; /* num of outstanding zerocopy DMAs */
> +	/* last used idx for outstanding DMA zerocopy buffers */
> +	int upend_idx;
> +	/* first used idx for DMA done zerocopy buffers */
> +	int done_idx;
>  };
>  
>  struct vhost_dev {
> @@ -154,6 +165,8 @@ bool vhost_enable_notify(struct vhost_virtqueue *);
>  
>  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
>  		    unsigned int log_num, u64 len);
> +void vhost_zerocopy_callback(struct sk_buff *skb);
> +void vhost_zerocopy_signal_used(struct vhost_virtqueue *vq, bool shutdown);
>  
>  #define vq_err(vq, fmt, ...) do {                                  \
>  		pr_debug(pr_fmt(fmt), ##__VA_ARGS__);       \
> 
> 

^ permalink raw reply

* Re: [PATCH V5 4/6 net-next] vhost: vhost TX zero-copy support
From: Michael S. Tsirkin @ 2011-05-18  8:45 UTC (permalink / raw)
  To: Shirley Ma
  Cc: David Miller, Eric Dumazet, Avi Kivity, Arnd Bergmann, netdev,
	kvm, linux-kernel
In-Reply-To: <1305695674.10756.93.camel@localhost.localdomain>

On Tue, May 17, 2011 at 10:14:34PM -0700, Shirley Ma wrote:
> Yes, fixed it already. I might change it to PAGE_SIZE for now since the
> small message sizes regression issue.

If that fixes it, let's do that for now.
-- 
MST

^ permalink raw reply

* packet received in a wrong rx-queue?
From: Jon Zhou @ 2011-05-18  9:00 UTC (permalink / raw)
  To: e1000-devel@lists.sourceforge.net, netdev@vger.kernel.org

There are 2 packets in the traffic

#1 create PDP context request, IPV4--UDP--GTP, src_ip=A and dst_ip=B,src_port=C,dst_port=D

#2 create PDP context response, IPV4--UDP--GTP,src_ip=B, dst_ip=A, src_port=D,dst_port=C

I suppose both of them will be received in same rx-queue but actually it doesn't
Anything need to check?

ethtool -i eth5
driver: ixgbe
version: 3.3.9-NAPI
firmware-version: 0.9-3
bus-info: 0000:08:00.1

regards
jon





------------------------------------------------------------------------------
What Every C/C++ and Fortran developer Should Know!
Read this article and learn how Intel has extended the reach of its 
next-generation tools to help Windows* and Linux* C/C++ and Fortran 
developers boost performance applications - including clusters. 
http://p.sf.net/sfu/intel-dev2devmay
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Re: [PATCH] SCTP: fix race between sctp_bind_addr_free() and sctp_bind_addr_conflict()
From: Wei Yongjun @ 2011-05-18  9:02 UTC (permalink / raw)
  To: Jacek Luczak; +Cc: Eric Dumazet, netdev, Vlad Yasevich
In-Reply-To: <1305707358.2983.14.camel@edumazet-laptop>


> Le mercredi 18 mai 2011 à 10:06 +0200, Jacek Luczak a écrit :
>> 2011/5/18 Eric Dumazet <eric.dumazet@gmail.com>:
>>> If you're removing items from this list, you must be a writer here, with
>>> exclusive access. So rcu_read_lock()/rcu_read_unlock() is not necessary.
>> I could agree to some extend ... but strict RCU section IMO is needed here.
>> I can check this if the issue exists.
>>
> I can tell you for sure rcu_read_lock() is not needed here. Its only
> showing confusion from code's author.
>
> Please read Documentation/RCU/listRCU.txt for concise explanations,
> line 117.
>
>
>>> Therefore, I guess following code is better :
>>>
>>> list_for_each_entry(addr, &bp->address_list, list) {
>>>        list_del_rcu(&addr->list);
>>>        call_rcu(&addr->rcu, sctp_local_addr_free);
>>>        SCTP_DBG_OBJCNT_DEC(addr);
>>> }
>>>
>>> Then, why dont you fix sctp_bind_addr_clean() instead ?
>>>
>>> if 'struct sctp_sockaddr_entry' is recu protected, then all frees should
>>> be protected as well.
>> The _clean() as claimed by Vlad is called many times from various places
>> in code and this could give a overhead. I guess Vlad would need to comment.
> I guess a full review of this code is needed. You'll have to prove
> sctp_bind_addr_clean() is always called after one RCU grace period if
> you want to leave it as is.
>
> You cant get RCU for free, some rules must be followed or you risk
> crashes.
>

fix the race between sctp_bind_addr_free() and sctp_bind_addr_conflict(), maybe you just
need to remove the socket from the port hash before empty the bind address list.
some thing like this, not test.

diff --git a/net/sctp/endpointola.c b/net/sctp/endpointola.c
index c8cc24e..924d846 100644
--- a/net/sctp/endpointola.c
+++ b/net/sctp/endpointola.c
@@ -268,12 +268,13 @@ static void sctp_endpoint_destroy(struct sctp_endpoint *ep)
 
 	/* Cleanup. */
 	sctp_inq_free(&ep->base.inqueue);
-	sctp_bind_addr_free(&ep->base.bind_addr);
 
 	/* Remove and free the port */
 	if (sctp_sk(ep->base.sk)->bind_hash)
 		sctp_put_port(ep->base.sk);
 
+	sctp_bind_addr_free(&ep->base.bind_addr);
+
 	/* Give up our hold on the sock. */
 	if (ep->base.sk)
 		sock_put(ep->base.sk);



^ permalink raw reply related

* Re: [PATCH V5 2/6 net-next] netdevice.h: Add zero-copy flag in netdevice
From: Michał Mirosław @ 2011-05-18  9:06 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, Ben Hutchings, David Miller, Eric Dumazet,
	Avi Kivity, Arnd Bergmann, netdev, kvm, linux-kernel
In-Reply-To: <1305675865.10756.63.camel@localhost.localdomain>

W dniu 18 maja 2011 01:44 użytkownik Shirley Ma <mashirle@us.ibm.com> napisał:
> On Wed, 2011-05-18 at 00:58 +0200, Michał Mirosław wrote:
>> W dniu 18 maja 2011 00:28 użytkownik Shirley Ma <mashirle@us.ibm.com>
>> napisał:
>> > On Tue, 2011-05-17 at 23:48 +0200, Michał Mirosław wrote:
>> >> 2011/5/17 Shirley Ma <mashirle@us.ibm.com>:
>> >> > Looks like to use a new flag requires more time/work. I am
>> thinking
>> >> > whether we can just use HIGHDMA flag to enable zero-copy in
>> macvtap
>> >> to
>> >> > avoid the new flag for now since mavctap uses real NICs as lower
>> >> device?
>> >>
>> >> Is there any other restriction besides requiring driver to not
>> recycle
>> >> the skb? Are there any drivers that recycle TX skbs?
>> > Not more other restrictions, skb clone is OK. pskb_expand_head()
>> looks
>> > OK to me from code review.
>>
>> > Currently there is no drivers recycle TX skbs.
>>
>> So why do you require the target device to have some flags at all?
> We could use macvtap to check lower device HIGHDMA to enable zero-copy,
> but I am not sure whether it is sufficient. If it's sufficient then we
> don't need to use a new flag here. To be safe, it's better to use a new
> flag to enable each device who can pass zero-copy test.

>> Do I understand correctly, that this zero-copy feature is about
>> packets received from VMs?
> Yes, packets sent from VMs, and received in local host for TX zero-copy
> here.

What is the zero-copy test? On some arches the HIGHDMA is not needed
at all so might be not enabled on anything. It looks like the correct
test would be per-packet check of !illegal_highdma() or maybe
NETIF_F_SG as returned from harmonize_features(). For virtual devices
or other software forwarding this might lead to skb_linearize() in
some cases, but is it that bad?

Best Regards,
Michał Mirosław

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox