Netdev List
 help / color / mirror / Atom feed
* cxgb4i_v4.3 submission
From: Rakesh Ranjan @ 2010-06-08  4:59 UTC (permalink / raw)
  To: LK-NetDev, LK-SCSIDev, LK-iSCSIDev
  Cc: LKML, Karen Xie, David Miller, James Bottomley, Mike Christie,
	Anish Bhatt

The following 3 patches add a new iscsi LLD driver cxgb4i to enable iscsi offload
support on Chelsio's new 1G and 10G cards. This is updated version of previous cxgb4i
patch. Please share you commnets after review.

Changes since cxgb4i_v4.2
1. Removed early returns from some functions, which got added for debugging.

[PATCH 1/3] cxgb4i_v4.3 : add build support
[PATCH 2/3] cxgb4i_v4.3 : libcxgbi common library part
[PATCH 3/3] cxgb4i_v4.3 : main driver files

Regards
Rakesh Ranjan

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.

^ permalink raw reply

* [PATCH 1/3] cxgb4i_v4.3 : add build support
From: Rakesh Ranjan @ 2010-06-08  4:59 UTC (permalink / raw)
  To: LK-NetDev, LK-SCSIDev, LK-iSCSIDev
  Cc: LKML, Karen Xie, David Miller, James Bottomley, Mike Christie,
	Anish Bhatt, Rakesh Ranjan
In-Reply-To: <1275973167-8640-1-git-send-email-rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

From: Rakesh Ranjan <rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>


Signed-off-by: Rakesh Ranjan <rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
---
 drivers/scsi/Kconfig       |    1 +
 drivers/scsi/Makefile      |    1 +
 drivers/scsi/cxgbi/Kbuild  |    5 +++++
 drivers/scsi/cxgbi/Kconfig |    7 +++++++
 4 files changed, 14 insertions(+), 0 deletions(-)
 create mode 100644 drivers/scsi/cxgbi/Kbuild
 create mode 100644 drivers/scsi/cxgbi/Kconfig

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index 75f2336..70983a8 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -371,6 +371,7 @@ config ISCSI_TCP
 	 http://open-iscsi.org
 
 source "drivers/scsi/cxgb3i/Kconfig"
+source "drivers/scsi/cxgbi/Kconfig"
 source "drivers/scsi/bnx2i/Kconfig"
 source "drivers/scsi/be2iscsi/Kconfig"
 
diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index 1c7ac49..b0873aa 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -133,6 +133,7 @@ obj-$(CONFIG_SCSI_STEX)		+= stex.o
 obj-$(CONFIG_SCSI_MVSAS)	+= mvsas/
 obj-$(CONFIG_PS3_ROM)		+= ps3rom.o
 obj-$(CONFIG_SCSI_CXGB3_ISCSI)	+= libiscsi.o libiscsi_tcp.o cxgb3i/
+obj-$(CONFIG_SCSI_CXGB4_ISCSI)	+= libiscsi.o libiscsi_tcp.o cxgbi/
 obj-$(CONFIG_SCSI_BNX2_ISCSI)	+= libiscsi.o bnx2i/
 obj-$(CONFIG_BE2ISCSI)		+= libiscsi.o be2iscsi/
 obj-$(CONFIG_SCSI_PMCRAID)	+= pmcraid.o
diff --git a/drivers/scsi/cxgbi/Kbuild b/drivers/scsi/cxgbi/Kbuild
new file mode 100644
index 0000000..03291c5
--- /dev/null
+++ b/drivers/scsi/cxgbi/Kbuild
@@ -0,0 +1,5 @@
+EXTRA_CFLAGS += -I$(srctree)/drivers/net/cxgb4
+
+obj-$(CONFIG_SCSI_CXGB4_ISCSI) += cxgb4i.o libcxgbi.o
+cxgb4i-y := cxgb4i_init.o cxgb4i_offload.o cxgb4i_ddp.o
+
diff --git a/drivers/scsi/cxgbi/Kconfig b/drivers/scsi/cxgbi/Kconfig
new file mode 100644
index 0000000..3f33dc2
--- /dev/null
+++ b/drivers/scsi/cxgbi/Kconfig
@@ -0,0 +1,7 @@
+config SCSI_CXGB4_ISCSI
+	tristate "Chelsio T4 iSCSI support"
+	depends on CHELSIO_T4_DEPENDS
+	select CHELSIO_T4
+	select SCSI_ISCSI_ATTRS
+	---help---
+	This driver supports iSCSI offload for the Chelsio T4 series devices.
-- 
1.6.6.1

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.

^ permalink raw reply related

* [PATCH 2/3] cxgb4i_v4.3 : libcxgbi common library part
From: Rakesh Ranjan @ 2010-06-08  4:59 UTC (permalink / raw)
  To: LK-NetDev, LK-SCSIDev, LK-iSCSIDev
  Cc: LKML, Karen Xie, David Miller, James Bottomley, Mike Christie,
	Anish Bhatt, Rakesh Ranjan
In-Reply-To: <1275973167-8640-2-git-send-email-rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

From: Rakesh Ranjan <rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>


Signed-off-by: Rakesh Ranjan <rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
---
 drivers/scsi/cxgbi/libcxgbi.c | 1518 +++++++++++++++++++++++++++++++++++++++++
 drivers/scsi/cxgbi/libcxgbi.h |  556 +++++++++++++++
 2 files changed, 2074 insertions(+), 0 deletions(-)
 create mode 100644 drivers/scsi/cxgbi/libcxgbi.c
 create mode 100644 drivers/scsi/cxgbi/libcxgbi.h

diff --git a/drivers/scsi/cxgbi/libcxgbi.c b/drivers/scsi/cxgbi/libcxgbi.c
new file mode 100644
index 0000000..f6266a0
--- /dev/null
+++ b/drivers/scsi/cxgbi/libcxgbi.c
@@ -0,0 +1,1518 @@
+/*
+ * libcxgbi.c: Chelsio common library for T3/T4 iSCSI driver.
+ *
+ * Copyright (c) 2010 Chelsio Communications, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.
+ *
+ * Written by: Karen Xie (kxie-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ * Written by: Rakesh Ranjan (rranjan-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ */
+
+#include <linux/skbuff.h>
+#include <linux/crypto.h>
+#include <linux/scatterlist.h>
+#include <linux/pci.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_host.h>
+#include <linux/if_vlan.h>
+#include <net/dst.h>
+#include <net/route.h>
+#include <net/tcp.h>
+
+#include "libcxgbi.h"
+
+MODULE_AUTHOR("Chelsio Communications");
+MODULE_DESCRIPTION("Chelsio libcxgbi common library");
+MODULE_LICENSE("GPL");
+
+static LIST_HEAD(cdev_list);
+static DEFINE_MUTEX(cdev_rwlock);
+
+static void cxgbi_release_itt(struct iscsi_task *, itt_t);
+static int cxgbi_reserve_itt(struct iscsi_task *, itt_t *);
+
+struct cxgbi_device *cxgbi_device_register(unsigned int dd_size,
+					unsigned int nports)
+{
+	struct cxgbi_device *cdev;
+
+	cdev = kzalloc(sizeof(*cdev) + dd_size, GFP_KERNEL);
+	if (!cdev)
+		return NULL;
+
+	cdev->hbas = kzalloc(sizeof(struct cxgbi_hba **) *  nports, GFP_KERNEL);
+	if (!cdev->hbas) {
+		kfree(cdev);
+		return NULL;
+	}
+
+	mutex_lock(&cdev_rwlock);
+	list_add_tail(&cdev->list_head, &cdev_list);
+	mutex_unlock(&cdev_rwlock);
+	return cdev;
+}
+EXPORT_SYMBOL_GPL(cxgbi_device_register);
+
+void cxgbi_device_unregister(struct cxgbi_device *cdev)
+{
+	mutex_lock(&cdev_rwlock);
+	list_del(&cdev->list_head);
+	mutex_unlock(&cdev_rwlock);
+
+	kfree(cdev->hbas);
+	kfree(cdev);
+}
+EXPORT_SYMBOL_GPL(cxgbi_device_unregister);
+
+static struct cxgbi_hba *cxgbi_hba_find_by_netdev(struct net_device *dev,
+						struct cxgbi_device *cdev)
+{
+	int i;
+
+	if (dev->priv_flags & IFF_802_1Q_VLAN)
+		dev = vlan_dev_real_dev(dev);
+
+	for (i = 0; i < cdev->nports; i++) {
+		if (cdev->hbas[i]->ndev == dev)
+			return cdev->hbas[i];
+	}
+
+	return NULL;
+}
+
+static struct rtable *find_route(struct net_device *dev,
+				__be32 saddr, __be32 daddr,
+				__be16 sport, __be16 dport,
+				u8 tos)
+{
+	struct rtable *rt;
+	struct flowi fl = {
+		.oif = dev ? dev->ifindex : 0,
+		.nl_u = {
+			.ip4_u = {
+				.daddr = daddr,
+				.saddr = saddr,
+				.tos = tos }
+			},
+		.proto = IPPROTO_TCP,
+		.uli_u = {
+			.ports = {
+				.sport = sport,
+				.dport = dport }
+			}
+	};
+
+	if (ip_route_output_flow(dev ? dev_net(dev) : &init_net,
+					&rt, &fl, NULL, 0))
+		return NULL;
+
+	return rt;
+}
+
+static struct net_device *cxgbi_find_dev(struct net_device *dev,
+					__be32 ipaddr)
+{
+	struct flowi fl;
+	struct rtable *rt;
+	int err;
+
+	memset(&fl, 0, sizeof(fl));
+	fl.nl_u.ip4_u.daddr = ipaddr;
+
+	err = ip_route_output_key(dev ? dev_net(dev) : &init_net, &rt, &fl);
+	if (!err)
+		return (&rt->u.dst)->dev;
+
+	return NULL;
+}
+
+static int is_cxgbi_dev(struct net_device *dev, struct cxgbi_device *cdev)
+{
+	struct net_device *ndev = dev;
+	int i;
+
+	if (dev->priv_flags & IFF_802_1Q_VLAN)
+		ndev = vlan_dev_real_dev(dev);
+
+	for (i = 0; i < cdev->nports; i++) {
+		if (ndev == cdev->ports[i])
+			return 1;
+	}
+	return 0;
+}
+
+static struct net_device *cxgbi_find_egress_dev(struct net_device *root_dev,
+						struct cxgbi_device *cdev)
+{
+	while (root_dev) {
+		if (root_dev->priv_flags & IFF_802_1Q_VLAN)
+			root_dev = vlan_dev_real_dev(root_dev);
+		else if (is_cxgbi_dev(root_dev, cdev))
+			return root_dev;
+		else
+			return NULL;
+	}
+
+	return NULL;
+}
+
+static struct cxgbi_device *cxgbi_find_cdev(struct net_device *dev,
+					    __be32 ipaddr)
+{
+	struct flowi fl;
+	struct rtable *rt;
+	struct net_device *sdev = NULL;
+	struct cxgbi_device *cdev = NULL, *tmp;
+	int err, i;
+
+	memset(&fl, 0, sizeof(fl));
+	fl.nl_u.ip4_u.daddr = ipaddr;
+
+	err = ip_route_output_key(dev ? dev_net(dev) : &init_net, &rt, &fl);
+	if (err)
+		goto out;
+
+	sdev = (&rt->u.dst)->dev;
+	mutex_lock(&cdev_rwlock);
+	list_for_each_entry_safe(cdev, tmp, &cdev_list, list_head) {
+		if (cdev) {
+			for (i = 0; i < cdev->nports; i++) {
+				if (sdev == cdev->ports[i]) {
+					mutex_unlock(&cdev_rwlock);
+					return cdev;
+				}
+			}
+		}
+	}
+	mutex_unlock(&cdev_rwlock);
+out:	return cdev;
+}
+
+/*
+ * pdu receive, interact with libiscsi_tcp
+ */
+static inline int read_pdu_skb(struct iscsi_conn *conn,
+			       struct sk_buff *skb,
+			       unsigned int offset,
+			       int offloaded)
+{
+	int status = 0;
+	int bytes_read;
+
+	bytes_read = iscsi_tcp_recv_skb(conn, skb, offset, offloaded, &status);
+	switch (status) {
+	case ISCSI_TCP_CONN_ERR:
+		return -EIO;
+	case ISCSI_TCP_SUSPENDED:
+		/* no transfer - just have caller flush queue */
+		return bytes_read;
+	case ISCSI_TCP_SKB_DONE:
+		/*
+		 * pdus should always fit in the skb and we should get
+		 * segment done notifcation.
+		 */
+		iscsi_conn_printk(KERN_ERR, conn, "Invalid pdu or skb.");
+		return -EFAULT;
+	case ISCSI_TCP_SEGMENT_DONE:
+		return bytes_read;
+	default:
+		iscsi_conn_printk(KERN_ERR, conn, "Invalid iscsi_tcp_recv_skb "
+				  "status %d\n", status);
+		return -EINVAL;
+	}
+}
+
+static int cxgbi_conn_read_bhs_pdu_skb(struct iscsi_conn *conn,
+				       struct sk_buff *skb)
+{
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_device *cdev = cconn->chba->cdev;
+	int rc;
+
+	cxgbi_rx_debug("conn 0x%p, skb 0x%p, len %u, flag 0x%x.\n",
+			conn, skb, skb->len, cdev->get_skb_ulp_mode(skb));
+
+	if (!iscsi_tcp_recv_segment_is_hdr(tcp_conn)) {
+		iscsi_conn_failure(conn, ISCSI_ERR_PROTO);
+		return -EIO;
+	}
+
+	if (conn->hdrdgst_en && (cdev->get_skb_ulp_mode(skb)
+				& ULP2_FLAG_HCRC_ERROR)) {
+		iscsi_conn_failure(conn, ISCSI_ERR_HDR_DGST);
+		return -EIO;
+	}
+
+	rc = read_pdu_skb(conn, skb, 0, 0);
+	if (rc <= 0)
+		return rc;
+
+	return 0;
+}
+
+static int cxgbi_conn_read_data_pdu_skb(struct iscsi_conn *conn,
+					struct sk_buff *skb)
+{
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_device *cdev = cconn->chba->cdev;
+	bool offloaded = 0;
+	unsigned int offset = 0;
+	int rc;
+
+	cxgbi_rx_debug("conn 0x%p, skb 0x%p, len %u, flag 0x%x.\n",
+			conn, skb, skb->len, cdev->get_skb_ulp_mode(skb));
+
+	if (conn->datadgst_en &&
+		(cdev->get_skb_ulp_mode(skb) & ULP2_FLAG_DCRC_ERROR)) {
+		iscsi_conn_failure(conn, ISCSI_ERR_DATA_DGST);
+		return -EIO;
+	}
+
+	if (iscsi_tcp_recv_segment_is_hdr(tcp_conn))
+		return 0;
+
+	if (conn->hdrdgst_en)
+		offset = ISCSI_DIGEST_SIZE;
+
+	if (cdev->get_skb_ulp_mode(skb) & ULP2_FLAG_DATA_DDPED) {
+		cxgbi_rx_debug("skb 0x%p, opcode 0x%x, data %u, ddp'ed, "
+				"itt 0x%x.\n",
+				skb,
+				tcp_conn->in.hdr->opcode & ISCSI_OPCODE_MASK,
+				tcp_conn->in.datalen,
+				ntohl(tcp_conn->in.hdr->itt));
+		offloaded = 1;
+	} else {
+		cxgbi_rx_debug("skb 0x%p, opcode 0x%x, data %u, NOT ddp'ed, "
+				"itt 0x%x.\n",
+				skb,
+				tcp_conn->in.hdr->opcode & ISCSI_OPCODE_MASK,
+				tcp_conn->in.datalen,
+				ntohl(tcp_conn->in.hdr->itt));
+	}
+
+	rc = read_pdu_skb(conn, skb, 0, offloaded);
+	if (rc < 0)
+		return rc;
+
+	return 0;
+}
+
+void cxgbi_conn_pdu_ready(struct cxgbi_sock *csk)
+{
+	struct sk_buff *skb;
+	unsigned int read = 0;
+	struct iscsi_conn *conn = csk->user_data;
+	int err = 0;
+
+	cxgbi_rx_debug("csk 0x%p.\n", csk);
+
+	read_lock(&csk->callback_lock);
+	if (unlikely(!conn || conn->suspend_rx)) {
+		cxgbi_rx_debug("conn 0x%p, id %d, suspend_rx %lu!\n",
+				conn, conn ? conn->id : 0xFF,
+				conn ? conn->suspend_rx : 0xFF);
+		read_unlock(&csk->callback_lock);
+		return;
+	}
+
+	skb = skb_peek(&csk->receive_queue);
+	while (!err && skb) {
+		__skb_unlink(skb, &csk->receive_queue);
+		read += csk->cdev->get_skb_rx_pdulen(skb);
+		cxgbi_rx_debug("conn 0x%p, csk 0x%p, rx skb 0x%p, pdulen %u\n",
+				conn, csk, skb,
+				csk->cdev->get_skb_rx_pdulen(skb));
+		if (csk->flags & CTPF_MSG_COALESCED) {
+			err = cxgbi_conn_read_bhs_pdu_skb(conn, skb);
+			err = cxgbi_conn_read_data_pdu_skb(conn, skb);
+		} else {
+			if (csk->cdev->get_skb_flags(skb) &
+			    CTP_SKCBF_HDR_RCVD)
+				err = cxgbi_conn_read_bhs_pdu_skb(conn, skb);
+			else if (csk->cdev->get_skb_flags(skb) ==
+				CTP_SKCBF_DATA_RCVD)
+				err = cxgbi_conn_read_data_pdu_skb(conn, skb);
+		}
+		__kfree_skb(skb);
+		skb = skb_peek(&csk->receive_queue);
+	}
+	cxgbi_log_debug("read %d\n", read);
+	read_unlock(&csk->callback_lock);
+	csk->copied_seq += read;
+	csk->cdev->sock_rx_credits(csk, read);
+	conn->rxdata_octets += read;
+
+	if (err) {
+		cxgbi_log_info("conn 0x%p rx failed err %d.\n", conn, err);
+		iscsi_conn_failure(conn, ISCSI_ERR_CONN_FAILED);
+	}
+}
+EXPORT_SYMBOL_GPL(cxgbi_conn_pdu_ready);
+
+static int sgl_seek_offset(struct scatterlist *sgl, unsigned int sgcnt,
+				unsigned int offset, unsigned int *off,
+				struct scatterlist **sgp)
+{
+	int i;
+	struct scatterlist *sg;
+
+	for_each_sg(sgl, sg, sgcnt, i) {
+		if (offset < sg->length) {
+			*off = offset;
+			*sgp = sg;
+			return 0;
+		}
+		offset -= sg->length;
+	}
+	return -EFAULT;
+}
+
+static int sgl_read_to_frags(struct scatterlist *sg, unsigned int sgoffset,
+				unsigned int dlen, skb_frag_t *frags,
+				int frag_max)
+{
+	unsigned int datalen = dlen;
+	unsigned int sglen = sg->length - sgoffset;
+	struct page *page = sg_page(sg);
+	int i;
+
+	i = 0;
+	do {
+		unsigned int copy;
+
+		if (!sglen) {
+			sg = sg_next(sg);
+			if (!sg) {
+				cxgbi_log_error("sg NULL, len %u/%u.\n",
+								datalen, dlen);
+				return -EINVAL;
+			}
+			sgoffset = 0;
+			sglen = sg->length;
+			page = sg_page(sg);
+
+		}
+		copy = min(datalen, sglen);
+		if (i && page == frags[i - 1].page &&
+		    sgoffset + sg->offset ==
+			frags[i - 1].page_offset + frags[i - 1].size) {
+			frags[i - 1].size += copy;
+		} else {
+			if (i >= frag_max) {
+				cxgbi_log_error("too many pages %u, "
+						 "dlen %u.\n", frag_max, dlen);
+				return -EINVAL;
+			}
+
+			frags[i].page = page;
+			frags[i].page_offset = sg->offset + sgoffset;
+			frags[i].size = copy;
+			i++;
+		}
+		datalen -= copy;
+		sgoffset += copy;
+		sglen -= copy;
+	} while (datalen);
+
+	return i;
+}
+
+int cxgbi_conn_alloc_pdu(struct iscsi_task *task, u8 opcode)
+{
+	struct iscsi_tcp_conn *tcp_conn = task->conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_device *cdev = cconn->chba->cdev;
+	struct iscsi_conn *conn = task->conn;
+	struct iscsi_tcp_task *tcp_task = task->dd_data;
+	struct cxgbi_task_data *tdata = task->dd_data + sizeof(*tcp_task);
+	struct scsi_cmnd *sc = task->sc;
+	int headroom = SKB_TX_PDU_HEADER_LEN;
+
+	tcp_task->dd_data = tdata;
+	task->hdr = NULL;
+
+	/* write command, need to send data pdus */
+	if (cdev->skb_extra_headroom && (opcode == ISCSI_OP_SCSI_DATA_OUT ||
+	    (opcode == ISCSI_OP_SCSI_CMD &&
+	    (scsi_bidi_cmnd(sc) || sc->sc_data_direction == DMA_TO_DEVICE))))
+		headroom += min(cdev->skb_extra_headroom,
+					conn->max_xmit_dlength);
+
+	tdata->skb = alloc_skb(cdev->skb_tx_headroom + headroom, GFP_ATOMIC);
+	if (!tdata->skb)
+		return -ENOMEM;
+
+	skb_reserve(tdata->skb, cdev->skb_tx_headroom);
+	cxgbi_tx_debug("task 0x%p, opcode 0x%x, skb 0x%p.\n",
+			task, opcode, tdata->skb);
+	task->hdr = (struct iscsi_hdr *)tdata->skb->data;
+	task->hdr_max = SKB_TX_PDU_HEADER_LEN;
+
+	/* data_out uses scsi_cmd's itt */
+	if (opcode != ISCSI_OP_SCSI_DATA_OUT)
+		cxgbi_reserve_itt(task, &task->hdr->itt);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cxgbi_conn_alloc_pdu);
+
+int cxgbi_conn_init_pdu(struct iscsi_task *task, unsigned int offset,
+			      unsigned int count)
+{
+	struct iscsi_tcp_conn *tcp_conn = task->conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_device *cdev = cconn->chba->cdev;
+	struct iscsi_conn *conn = task->conn;
+	struct iscsi_tcp_task *tcp_task = task->dd_data;
+	struct cxgbi_task_data *tdata = tcp_task->dd_data;
+	struct sk_buff *skb = tdata->skb;
+	unsigned int datalen = count;
+	int i, padlen = iscsi_padding(count);
+	struct page *pg;
+
+	cxgbi_tx_debug("task 0x%p,0x%p, offset %u, count %u, skb 0x%p.\n",
+			task, task->sc, offset, count, skb);
+
+	skb_put(skb, task->hdr_len);
+	cdev->set_skb_txmode(skb, conn->hdrdgst_en,
+			     datalen ? conn->datadgst_en : 0);
+	if (!count)
+		return 0;
+
+	if (task->sc) {
+		struct scsi_data_buffer *sdb = scsi_out(task->sc);
+		struct scatterlist *sg = NULL;
+		int err;
+
+		tdata->offset = offset;
+		tdata->count = count;
+		err = sgl_seek_offset(sdb->table.sgl, sdb->table.nents,
+					tdata->offset, &tdata->sgoffset, &sg);
+		if (err < 0) {
+			cxgbi_log_warn("tpdu, sgl %u, bad offset %u/%u.\n",
+					sdb->table.nents, tdata->offset,
+					sdb->length);
+			return err;
+		}
+		err = sgl_read_to_frags(sg, tdata->sgoffset, tdata->count,
+					tdata->frags, MAX_PDU_FRAGS);
+		if (err < 0) {
+			cxgbi_log_warn("tpdu, sgl %u, bad offset %u + %u.\n",
+					sdb->table.nents, tdata->offset,
+					tdata->count);
+			return err;
+		}
+		tdata->nr_frags = err;
+
+		if (tdata->nr_frags > MAX_SKB_FRAGS ||
+		    (padlen && tdata->nr_frags == MAX_SKB_FRAGS)) {
+			char *dst = skb->data + task->hdr_len;
+			skb_frag_t *frag = tdata->frags;
+
+			/* data fits in the skb's headroom */
+			for (i = 0; i < tdata->nr_frags; i++, frag++) {
+				char *src = kmap_atomic(frag->page,
+							KM_SOFTIRQ0);
+
+				memcpy(dst, src+frag->page_offset, frag->size);
+				dst += frag->size;
+				kunmap_atomic(src, KM_SOFTIRQ0);
+			}
+			if (padlen) {
+				memset(dst, 0, padlen);
+				padlen = 0;
+			}
+			skb_put(skb, count + padlen);
+		} else {
+			/* data fit into frag_list */
+			for (i = 0; i < tdata->nr_frags; i++)
+				get_page(tdata->frags[i].page);
+
+			memcpy(skb_shinfo(skb)->frags, tdata->frags,
+				sizeof(skb_frag_t) * tdata->nr_frags);
+			skb_shinfo(skb)->nr_frags = tdata->nr_frags;
+			skb->len += count;
+			skb->data_len += count;
+			skb->truesize += count;
+		}
+
+	} else {
+		pg = virt_to_page(task->data);
+
+		get_page(pg);
+		skb_fill_page_desc(skb, 0, pg, offset_in_page(task->data),
+					count);
+		skb->len += count;
+		skb->data_len += count;
+		skb->truesize += count;
+	}
+
+	if (padlen) {
+		i = skb_shinfo(skb)->nr_frags;
+		get_page(cdev->pad_page);
+		skb_fill_page_desc(skb, skb_shinfo(skb)->nr_frags,
+					cdev->pad_page, 0, padlen);
+
+		skb->data_len += padlen;
+		skb->truesize += padlen;
+		skb->len += padlen;
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cxgbi_conn_init_pdu);
+
+int cxgbi_conn_xmit_pdu(struct iscsi_task *task)
+{
+	struct iscsi_tcp_conn *tcp_conn = task->conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_device *cdev = cconn->chba->cdev;
+	struct iscsi_tcp_task *tcp_task = task->dd_data;
+	struct cxgbi_task_data *tdata = tcp_task->dd_data;
+	struct sk_buff *skb = tdata->skb;
+	unsigned int datalen;
+	int err;
+
+	if (!skb)
+		return 0;
+
+	datalen = skb->data_len;
+	tdata->skb = NULL;
+	err = cdev->sock_send_pdus(cconn->cep->csk, skb);
+	if (err > 0) {
+		int pdulen = err;
+
+		cxgbi_tx_debug("task 0x%p, skb 0x%p, len %u/%u, rv %d.\n",
+				task, skb, skb->len, skb->data_len, err);
+
+		if (task->conn->hdrdgst_en)
+			pdulen += ISCSI_DIGEST_SIZE;
+
+		if (datalen && task->conn->datadgst_en)
+			pdulen += ISCSI_DIGEST_SIZE;
+
+		task->conn->txdata_octets += pdulen;
+		return 0;
+	}
+
+	if (err == -EAGAIN || err == -ENOBUFS) {
+		/* reset skb to send when we are called again */
+		tdata->skb = skb;
+		return err;
+	}
+
+	kfree_skb(skb);
+	cxgbi_tx_debug("itt 0x%x, skb 0x%p, len %u/%u, xmit err %d.\n",
+			task->itt, skb, skb->len, skb->data_len, err);
+	iscsi_conn_printk(KERN_ERR, task->conn, "xmit err %d.\n", err);
+	iscsi_conn_failure(task->conn, ISCSI_ERR_XMIT_FAILED);
+	return err;
+}
+EXPORT_SYMBOL_GPL(cxgbi_conn_xmit_pdu);
+
+int cxgbi_pdu_init(struct cxgbi_device *cdev)
+{
+	cdev->pad_page = alloc_page(GFP_KERNEL);
+	if (!cdev->pad_page)
+		return -ENOMEM;
+
+	memset(page_address(cdev->pad_page), 0, PAGE_SIZE);
+
+	if (cdev->skb_tx_headroom > (512 * MAX_SKB_FRAGS))
+		cdev->skb_extra_headroom = cdev->skb_tx_headroom;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cxgbi_pdu_init);
+
+void cxgbi_pdu_cleanup(struct cxgbi_device *cdev)
+{
+	if (cdev->pad_page) {
+		__free_page(cdev->pad_page);
+		cdev->pad_page = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(cxgbi_pdu_cleanup);
+
+void cxgbi_conn_tx_open(struct cxgbi_sock *csk)
+{
+	struct iscsi_conn *conn = csk->user_data;
+
+	if (conn) {
+		cxgbi_tx_debug("cn 0x%p, cid %d.\n", csk, conn->id);
+		iscsi_conn_queue_work(conn);
+	}
+}
+EXPORT_SYMBOL_GPL(cxgbi_conn_tx_open);
+
+static int cxgbi_sock_get_port(struct cxgbi_sock *csk)
+{
+	struct cxgbi_device *cdev = csk->cdev;
+	unsigned int start;
+	int idx;
+
+	if (!cdev->pmap)
+		goto error_out;
+
+	if (csk->saddr.sin_port) {
+		cxgbi_log_error("connect, sin_port none ZERO %u\n",
+				ntohs(csk->saddr.sin_port));
+		return -EADDRINUSE;
+	}
+
+	spin_lock_bh(&cdev->pmap->lock);
+	start = idx = cdev->pmap->next;
+
+	do {
+		if (++idx >= cdev->pmap->max_connect)
+			idx = 0;
+		if (!cdev->pmap->port_csk[idx]) {
+			csk->saddr.sin_port =
+				htons(cdev->pmap->sport_base + idx);
+			cdev->pmap->next = idx;
+			cdev->pmap->port_csk[idx] = csk;
+			spin_unlock_bh(&cdev->pmap->lock);
+			cxgbi_conn_debug("reserved port %u\n",
+					cdev->pmap->sport_base + idx);
+			return 0;
+		}
+	} while (idx != start);
+	spin_unlock_bh(&cdev->pmap->lock);
+error_out:
+	return -EADDRNOTAVAIL;
+}
+
+static void cxgbi_sock_put_port(struct cxgbi_sock *csk)
+{
+	struct cxgbi_device *cdev = csk->cdev;
+
+	if (csk->saddr.sin_port) {
+		int idx = ntohs(csk->saddr.sin_port) - cdev->pmap->sport_base;
+
+		csk->saddr.sin_port = 0;
+		if (idx < 0 || idx >= cdev->pmap->max_connect)
+			return;
+
+		spin_lock_bh(&cdev->pmap->lock);
+		cdev->pmap->port_csk[idx] = NULL;
+		spin_unlock_bh(&cdev->pmap->lock);
+		cxgbi_conn_debug("released port %u\n",
+				cdev->pmap->sport_base + idx);
+	}
+}
+
+static struct cxgbi_sock *cxgbi_sock_create(struct cxgbi_device *cdev)
+{
+	struct cxgbi_sock *csk = NULL;
+
+	csk = kzalloc(sizeof(*csk), GFP_NOIO);
+	if (!csk)
+		return NULL;
+
+	if (cdev->alloc_cpl_skbs(csk) < 0)
+		goto free_csk;
+
+	cxgbi_conn_debug("alloc csk: 0x%p\n", csk);
+
+	csk->flags = 0;
+	spin_lock_init(&csk->lock);
+	kref_init(&csk->refcnt);
+	skb_queue_head_init(&csk->receive_queue);
+	skb_queue_head_init(&csk->write_queue);
+	setup_timer(&csk->retry_timer, NULL, (unsigned long)csk);
+	rwlock_init(&csk->callback_lock);
+	csk->cdev = cdev;
+	return csk;
+free_csk:
+	cxgbi_api_debug("csk alloc failed %p, baling out\n", csk);
+	kfree(csk);
+	return NULL;
+}
+
+static int cxgbi_sock_connect(struct net_device *dev, struct cxgbi_sock *csk,
+			      struct sockaddr_in *sin)
+{
+	struct rtable *rt;
+	__be32 sipv4 = 0;
+	struct net_device *dstdev;
+	struct cxgbi_hba *chba = NULL;
+	int err;
+
+	cxgbi_conn_debug("csk 0x%p, dev 0x%p\n", csk, dev);
+
+	if (sin->sin_family != AF_INET)
+		return -EAFNOSUPPORT;
+
+	csk->daddr.sin_port = sin->sin_port;
+	csk->daddr.sin_addr.s_addr = sin->sin_addr.s_addr;
+
+	dstdev = cxgbi_find_dev(dev, sin->sin_addr.s_addr);
+	if (!dstdev || !is_cxgbi_dev(dstdev, csk->cdev))
+		return -ENETUNREACH;
+
+	if (dstdev->priv_flags & IFF_802_1Q_VLAN)
+		dev = dstdev;
+
+	rt = find_route(dev, csk->saddr.sin_addr.s_addr,
+			csk->daddr.sin_addr.s_addr,
+			csk->saddr.sin_port,
+			csk->daddr.sin_port,
+			0);
+	if (rt == NULL) {
+		cxgbi_conn_debug("no route to %pI4, port %u, dev %s, "
+					"snic 0x%p\n",
+					&csk->daddr.sin_addr.s_addr,
+					ntohs(csk->daddr.sin_port),
+					dev ? dev->name : "any",
+					csk->dd_data);
+		return -ENETUNREACH;
+	}
+
+	if (rt->rt_flags & (RTCF_MULTICAST | RTCF_BROADCAST)) {
+		cxgbi_conn_debug("multi-cast route to %pI4, port %u, "
+					"dev %s, snic 0x%p\n",
+					&csk->daddr.sin_addr.s_addr,
+					ntohs(csk->daddr.sin_port),
+					dev ? dev->name : "any",
+					csk->dd_data);
+		ip_rt_put(rt);
+		return -ENETUNREACH;
+	}
+
+	if (!csk->saddr.sin_addr.s_addr)
+		csk->saddr.sin_addr.s_addr = rt->rt_src;
+
+	csk->dst = &rt->u.dst;
+
+	dev = cxgbi_find_egress_dev(csk->dst->dev, csk->cdev);
+	if (dev == NULL) {
+		cxgbi_conn_debug("csk: 0x%p, egress dev NULL\n", csk);
+		return -ENETUNREACH;
+	}
+
+	err = cxgbi_sock_get_port(csk);
+	if (err)
+		return err;
+
+	cxgbi_conn_debug("csk: 0x%p get port: %u\n",
+			csk, ntohs(csk->saddr.sin_port));
+
+	chba = cxgbi_hba_find_by_netdev(csk->dst->dev, csk->cdev);
+
+	sipv4 = cxgbi_get_iscsi_ipv4(chba);
+	if (!sipv4) {
+		cxgbi_conn_debug("csk: 0x%p, iscsi is not configured\n", csk);
+		sipv4 = csk->saddr.sin_addr.s_addr;
+		cxgbi_set_iscsi_ipv4(chba, sipv4);
+	} else
+		csk->saddr.sin_addr.s_addr = sipv4;
+
+	cxgbi_conn_debug("csk: 0x%p, %pI4:[%u], %pI4:[%u] SYN_SENT\n",
+				csk,
+				&csk->saddr.sin_addr.s_addr,
+				ntohs(csk->saddr.sin_port),
+				&csk->daddr.sin_addr.s_addr,
+				ntohs(csk->daddr.sin_port));
+
+	cxgbi_sock_set_state(csk, CTP_CONNECTING);
+
+	if (!csk->cdev->init_act_open(csk, dev))
+		return 0;
+
+	err = -ENOTSUPP;
+	cxgbi_conn_debug("csk 0x%p -> closed\n", csk);
+	cxgbi_sock_set_state(csk, CTP_CLOSED);
+	ip_rt_put(rt);
+	cxgbi_sock_put_port(csk);
+	return err;
+}
+
+void cxgbi_sock_conn_closing(struct cxgbi_sock *csk)
+{
+	struct iscsi_conn *conn = csk->user_data;
+
+	read_lock(&csk->callback_lock);
+	if (conn && csk->state != CTP_ESTABLISHED)
+		iscsi_conn_failure(conn, ISCSI_ERR_CONN_FAILED);
+	read_unlock(&csk->callback_lock);
+}
+EXPORT_SYMBOL_GPL(cxgbi_sock_conn_closing);
+
+void cxgbi_sock_closed(struct cxgbi_sock *csk)
+{
+	cxgbi_conn_debug("csk 0x%p, state %u, flags 0x%lx\n",
+			csk, csk->state, csk->flags);
+
+	cxgbi_sock_put_port(csk);
+	csk->cdev->release_offload_resources(csk);
+	cxgbi_sock_set_state(csk, CTP_CLOSED);
+	cxgbi_sock_conn_closing(csk);
+}
+EXPORT_SYMBOL_GPL(cxgbi_sock_closed);
+
+static void cxgbi_sock_active_close(struct cxgbi_sock *csk)
+{
+	int data_lost;
+	int close_req = 0;
+
+	cxgbi_conn_debug("csk 0x%p, state %u, flags %lu\n",
+			csk, csk->state, csk->flags);
+	dst_confirm(csk->dst);
+	cxgbi_sock_hold(csk);
+	spin_lock_bh(&csk->lock);
+	data_lost = skb_queue_len(&csk->receive_queue);
+	__skb_queue_purge(&csk->receive_queue);
+
+	switch (csk->state) {
+	case CTP_CLOSED:
+	case CTP_ACTIVE_CLOSE:
+	case CTP_CLOSE_WAIT_1:
+	case CTP_CLOSE_WAIT_2:
+	case CTP_ABORTING:
+		break;
+	case CTP_CONNECTING:
+		cxgbi_sock_set_flag(csk, CTPF_ACTIVE_CLOSE_NEEDED);
+		break;
+	case CTP_ESTABLISHED:
+		close_req = 1;
+		cxgbi_sock_set_flag(csk, CTP_ACTIVE_CLOSE);
+		break;
+	case CTP_PASSIVE_CLOSE:
+		close_req = 1;
+		cxgbi_sock_set_flag(csk, CTP_CLOSE_WAIT_2);
+		break;
+	}
+
+	if (close_req) {
+		if (data_lost)
+			csk->cdev->send_abort_req(csk);
+		else
+			csk->cdev->send_close_req(csk);
+	}
+
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+}
+
+static void cxgbi_sock_release(struct cxgbi_sock *csk)
+{
+	cxgbi_conn_debug("csk 0x%p, state %u, flags %lu\n",
+			csk, csk->state, csk->flags);
+	if (unlikely(csk->state == CTP_CONNECTING))
+		cxgbi_sock_set_state(csk, CTPF_ACTIVE_CLOSE_NEEDED);
+	else if (likely(csk->state != CTP_CLOSED))
+		cxgbi_sock_active_close(csk);
+	cxgbi_sock_put(csk);
+}
+
+static unsigned int cxgbi_sock_find_best_mtu(struct cxgbi_sock *csk,
+					     unsigned short mtu)
+{
+	int i = 0;
+
+	while (i < csk->cdev->nmtus - 1 && csk->cdev->mtus[i + 1] <= mtu)
+		++i;
+
+	return i;
+}
+
+unsigned int cxgbi_sock_select_mss(struct cxgbi_sock *csk, unsigned int pmtu)
+{
+	unsigned int idx;
+	struct dst_entry *dst = csk->dst;
+	u16 advmss = dst_metric(dst, RTAX_ADVMSS);
+
+	if (advmss > pmtu - 40)
+		advmss = pmtu - 40;
+	if (advmss < csk->cdev->mtus[0] - 40)
+		advmss = csk->cdev->mtus[0] - 40;
+	idx = cxgbi_sock_find_best_mtu(csk, advmss + 40);
+
+	return idx;
+}
+EXPORT_SYMBOL_GPL(cxgbi_sock_select_mss);
+
+static void cxgbi_release_itt(struct iscsi_task *task, itt_t hdr_itt)
+{
+	struct scsi_cmnd *sc = task->sc;
+	struct iscsi_tcp_conn *tcp_conn = task->conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_hba *chba = cconn->chba;
+	struct cxgbi_tag_format *tformat = &chba->cdev->tag_format;
+	u32 tag = ntohl((__force u32)hdr_itt);
+
+	cxgbi_tag_debug("release tag 0x%x.\n", tag);
+	if (sc && (scsi_bidi_cmnd(sc) ||
+	    sc->sc_data_direction == DMA_FROM_DEVICE) &&
+	    cxgbi_is_ddp_tag(tformat, tag))
+		chba->cdev->ddp_tag_release(chba, tag);
+}
+
+static int cxgbi_reserve_itt(struct iscsi_task *task, itt_t *hdr_itt)
+{
+	struct scsi_cmnd *sc = task->sc;
+	struct iscsi_conn *conn = task->conn;
+	struct iscsi_session *sess = conn->session;
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_hba *chba = cconn->chba;
+	struct cxgbi_tag_format *tformat = &chba->cdev->tag_format;
+	u32 sw_tag = (sess->age << cconn->task_idx_bits) | task->itt;
+	u32 tag;
+	int err = -EINVAL;
+
+	if (sc && (scsi_bidi_cmnd(sc) ||
+	    sc->sc_data_direction == DMA_FROM_DEVICE) &&
+			cxgbi_sw_tag_usable(tformat, sw_tag)) {
+		struct cxgbi_sock *csk = cconn->cep->csk;
+		struct cxgbi_gather_list *gl;
+
+		gl = chba->cdev->ddp_make_gl(scsi_in(sc)->length,
+					     scsi_in(sc)->table.sgl,
+					     scsi_in(sc)->table.nents,
+					     chba->cdev->pdev, GFP_ATOMIC);
+		if (gl) {
+			tag = sw_tag;
+			err = chba->cdev->ddp_tag_reserve(chba, csk->hwtid,
+							  tformat, &tag,
+							  gl, GFP_ATOMIC);
+			if (err < 0)
+				chba->cdev->ddp_release_gl(gl,
+							   chba->cdev->pdev);
+		}
+	}
+	if (err < 0)
+		tag = cxgbi_set_non_ddp_tag(tformat, sw_tag);
+	/*  the itt need to sent in big-endian order */
+	*hdr_itt = (__force itt_t)htonl(tag);
+
+	cxgbi_tag_debug("new sc 0x%p tag 0x%x/0x%x (itt 0x%x, age 0x%x).\n",
+			sc, tag, *hdr_itt, task->itt, sess->age);
+	return 0;
+}
+
+void cxgbi_parse_pdu_itt(struct iscsi_conn *conn, itt_t itt,
+				int *idx, int *age)
+{
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_device *cdev = cconn->chba->cdev;
+	u32 tag = ntohl((__force u32) itt);
+	u32 sw_bits;
+
+	sw_bits = cxgbi_tag_nonrsvd_bits(&cdev->tag_format, tag);
+	if (idx)
+		*idx = sw_bits & ((1 << cconn->task_idx_bits) - 1);
+	if (age)
+		*age = (sw_bits >> cconn->task_idx_bits) & ISCSI_AGE_MASK;
+
+	cxgbi_tag_debug("parse tag 0x%x/0x%x, sw 0x%x, itt 0x%x, age 0x%x.\n",
+			tag, itt, sw_bits, idx ? *idx : 0xFFFFF,
+			age ? *age : 0xFF);
+}
+EXPORT_SYMBOL_GPL(cxgbi_parse_pdu_itt);
+
+void cxgbi_cleanup_task(struct iscsi_task *task)
+{
+	struct cxgbi_task_data *tdata = task->dd_data +
+				sizeof(struct iscsi_tcp_task);
+
+	/*  never reached the xmit task callout */
+	if (tdata->skb)
+		__kfree_skb(tdata->skb);
+	memset(tdata, 0, sizeof(*tdata));
+
+	cxgbi_release_itt(task, task->hdr_itt);
+	iscsi_tcp_cleanup_task(task);
+}
+EXPORT_SYMBOL_GPL(cxgbi_cleanup_task);
+
+void cxgbi_get_conn_stats(struct iscsi_cls_conn *cls_conn,
+				struct iscsi_stats *stats)
+{
+	struct iscsi_conn *conn = cls_conn->dd_data;
+
+	stats->txdata_octets = conn->txdata_octets;
+	stats->rxdata_octets = conn->rxdata_octets;
+	stats->scsicmd_pdus = conn->scsicmd_pdus_cnt;
+	stats->dataout_pdus = conn->dataout_pdus_cnt;
+	stats->scsirsp_pdus = conn->scsirsp_pdus_cnt;
+	stats->datain_pdus = conn->datain_pdus_cnt;
+	stats->r2t_pdus = conn->r2t_pdus_cnt;
+	stats->tmfcmd_pdus = conn->tmfcmd_pdus_cnt;
+	stats->tmfrsp_pdus = conn->tmfrsp_pdus_cnt;
+	stats->digest_err = 0;
+	stats->timeout_err = 0;
+	stats->custom_length = 1;
+	strcpy(stats->custom[0].desc, "eh_abort_cnt");
+	stats->custom[0].value = conn->eh_abort_cnt;
+}
+EXPORT_SYMBOL_GPL(cxgbi_get_conn_stats);
+
+static int cxgbi_conn_max_xmit_dlength(struct iscsi_conn *conn)
+{
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_device *cdev = cconn->chba->cdev;
+	unsigned int skb_tx_headroom = cdev->skb_tx_headroom;
+	unsigned int max_def = 512 * MAX_SKB_FRAGS;
+	unsigned int max = max(max_def, skb_tx_headroom);
+
+	max = min(cconn->chba->cdev->tx_max_size, max);
+	if (conn->max_xmit_dlength)
+		conn->max_xmit_dlength = min(conn->max_xmit_dlength, max);
+	else
+		conn->max_xmit_dlength = max;
+	cxgbi_align_pdu_size(conn->max_xmit_dlength);
+	return 0;
+}
+
+static int cxgbi_conn_max_recv_dlength(struct iscsi_conn *conn)
+{
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	unsigned int max = cconn->chba->cdev->rx_max_size;
+
+	cxgbi_align_pdu_size(max);
+
+	if (conn->max_recv_dlength) {
+		if (conn->max_recv_dlength > max) {
+			cxgbi_log_error("MaxRecvDataSegmentLength %u too big."
+					" Need to be <= %u.\n",
+					conn->max_recv_dlength, max);
+			return -EINVAL;
+		}
+		conn->max_recv_dlength = min(conn->max_recv_dlength, max);
+		cxgbi_align_pdu_size(conn->max_recv_dlength);
+	} else
+		conn->max_recv_dlength = max;
+
+	return 0;
+}
+
+int cxgbi_set_conn_param(struct iscsi_cls_conn *cls_conn,
+			enum iscsi_param param, char *buf, int buflen)
+{
+	struct iscsi_conn *conn = cls_conn->dd_data;
+	struct iscsi_session *session = conn->session;
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_sock *csk = cconn->cep->csk;
+	int value, err = 0;
+
+	switch (param) {
+	case ISCSI_PARAM_HDRDGST_EN:
+		err = iscsi_set_param(cls_conn, param, buf, buflen);
+		if (!err && conn->hdrdgst_en)
+			err = csk->cdev->ddp_setup_conn_digest(csk, csk->hwtid,
+							conn->hdrdgst_en,
+							conn->datadgst_en, 0);
+		break;
+	case ISCSI_PARAM_DATADGST_EN:
+		err = iscsi_set_param(cls_conn, param, buf, buflen);
+		if (!err && conn->datadgst_en)
+			err = csk->cdev->ddp_setup_conn_digest(csk, csk->hwtid,
+							conn->hdrdgst_en,
+							conn->datadgst_en, 0);
+		break;
+	case ISCSI_PARAM_MAX_R2T:
+		sscanf(buf, "%d", &value);
+		if (value <= 0 || !is_power_of_2(value))
+			return -EINVAL;
+		if (session->max_r2t == value)
+			break;
+		iscsi_tcp_r2tpool_free(session);
+		err = iscsi_set_param(cls_conn, param, buf, buflen);
+		if (!err && iscsi_tcp_r2tpool_alloc(session))
+			return -ENOMEM;
+	case ISCSI_PARAM_MAX_RECV_DLENGTH:
+		err = iscsi_set_param(cls_conn, param, buf, buflen);
+		if (!err)
+			err = cxgbi_conn_max_recv_dlength(conn);
+		break;
+	case ISCSI_PARAM_MAX_XMIT_DLENGTH:
+		err = iscsi_set_param(cls_conn, param, buf, buflen);
+		if (!err)
+			err = cxgbi_conn_max_xmit_dlength(conn);
+		break;
+	default:
+		return iscsi_set_param(cls_conn, param, buf, buflen);
+	}
+	return err;
+}
+EXPORT_SYMBOL_GPL(cxgbi_set_conn_param);
+
+int cxgbi_get_conn_param(struct iscsi_cls_conn *cls_conn,
+			enum iscsi_param param, char *buff)
+{
+	struct iscsi_conn *iconn = cls_conn->dd_data;
+	int len;
+
+	switch (param) {
+	case ISCSI_PARAM_CONN_PORT:
+		spin_lock_bh(&iconn->session->lock);
+		len = sprintf(buff, "%hu\n", iconn->portal_port);
+		spin_unlock_bh(&iconn->session->lock);
+		break;
+	case ISCSI_PARAM_CONN_ADDRESS:
+		spin_lock_bh(&iconn->session->lock);
+		len = sprintf(buff, "%s\n", iconn->portal_address);
+		spin_unlock_bh(&iconn->session->lock);
+		break;
+	default:
+		return iscsi_conn_get_param(cls_conn, param, buff);
+	}
+	return len;
+}
+EXPORT_SYMBOL_GPL(cxgbi_get_conn_param);
+
+struct iscsi_cls_conn *
+cxgbi_create_conn(struct iscsi_cls_session *cls_session, u32 cid)
+{
+	struct iscsi_cls_conn *cls_conn;
+	struct iscsi_conn *conn;
+	struct iscsi_tcp_conn *tcp_conn;
+	struct cxgbi_conn *cconn;
+
+	cls_conn = iscsi_tcp_conn_setup(cls_session, sizeof(*cconn), cid);
+	if (!cls_conn)
+		return NULL;
+
+	conn = cls_conn->dd_data;
+	tcp_conn = conn->dd_data;
+	cconn = tcp_conn->dd_data;
+	cconn->iconn = conn;
+	return cls_conn;
+}
+EXPORT_SYMBOL_GPL(cxgbi_create_conn);
+
+int cxgbi_bind_conn(struct iscsi_cls_session *cls_session,
+				struct iscsi_cls_conn *cls_conn,
+				u64 transport_eph, int is_leading)
+{
+	struct iscsi_conn *conn = cls_conn->dd_data;
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct iscsi_endpoint *ep;
+	struct cxgbi_endpoint *cep;
+	struct cxgbi_sock *csk;
+	int err;
+
+	ep = iscsi_lookup_endpoint(transport_eph);
+	if (!ep)
+		return -EINVAL;
+
+	/*  setup ddp pagesize */
+	cep = ep->dd_data;
+	csk = cep->csk;
+	err = csk->cdev->ddp_setup_conn_host_pgsz(csk, csk->hwtid, 0);
+	if (err < 0)
+		return err;
+
+	err = iscsi_conn_bind(cls_session, cls_conn, is_leading);
+	if (err)
+		return -EINVAL;
+
+	/*  calculate the tag idx bits needed for this conn based on cmds_max */
+	cconn->task_idx_bits = (__ilog2_u32(conn->session->cmds_max - 1)) + 1;
+
+	read_lock(&csk->callback_lock);
+	csk->user_data = conn;
+	cconn->chba = cep->chba;
+	cconn->cep = cep;
+	cep->cconn = cconn;
+	read_unlock(&csk->callback_lock);
+
+	cxgbi_conn_max_xmit_dlength(conn);
+	cxgbi_conn_max_recv_dlength(conn);
+
+	spin_lock_bh(&conn->session->lock);
+	sprintf(conn->portal_address, "%pI4", &csk->daddr.sin_addr.s_addr);
+	conn->portal_port = ntohs(csk->daddr.sin_port);
+	spin_unlock_bh(&conn->session->lock);
+
+	/*  init recv engine */
+	iscsi_tcp_hdr_recv_prep(tcp_conn);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cxgbi_bind_conn);
+
+struct iscsi_cls_session *
+cxgbi_create_session(struct iscsi_endpoint *ep, u16 cmds_max, u16 qdepth,
+							u32 initial_cmdsn)
+{
+	struct cxgbi_endpoint *cep;
+	struct cxgbi_hba *chba;
+	struct Scsi_Host *shost;
+	struct iscsi_cls_session *cls_session;
+	struct iscsi_session *session;
+
+	if (!ep) {
+		cxgbi_log_error("missing endpoint\n");
+		return NULL;
+	}
+
+	cep = ep->dd_data;
+	chba = cep->chba;
+	shost = chba->shost;
+
+	BUG_ON(chba != iscsi_host_priv(shost));
+
+	cls_session = iscsi_session_setup(chba->cdev->itp, shost,
+					cmds_max, 0,
+					sizeof(struct iscsi_tcp_task) +
+					sizeof(struct cxgbi_task_data),
+					initial_cmdsn, ISCSI_MAX_TARGET);
+	if (!cls_session)
+		return NULL;
+
+	session = cls_session->dd_data;
+	if (iscsi_tcp_r2tpool_alloc(session))
+		goto remove_session;
+
+	return cls_session;
+
+remove_session:
+	iscsi_session_teardown(cls_session);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(cxgbi_create_session);
+
+void cxgbi_destroy_session(struct iscsi_cls_session *cls_session)
+{
+	iscsi_tcp_r2tpool_free(cls_session->dd_data);
+	iscsi_session_teardown(cls_session);
+}
+EXPORT_SYMBOL_GPL(cxgbi_destroy_session);
+
+int cxgbi_set_host_param(struct Scsi_Host *shost,
+			enum iscsi_host_param param, char *buff, int buflen)
+{
+	struct cxgbi_hba *chba = iscsi_host_priv(shost);
+
+	if (!chba->ndev) {
+		shost_printk(KERN_ERR, shost, "Could not set host param. "
+				"Netdev for host not set\n");
+		return -ENODEV;
+	}
+
+	cxgbi_api_debug("param %d, buff %s\n", param, buff);
+
+	switch (param) {
+	case ISCSI_HOST_PARAM_IPADDRESS:
+	{
+		__be32 addr = in_aton(buff);
+		cxgbi_set_iscsi_ipv4(chba, addr);
+		return 0;
+	}
+	case ISCSI_HOST_PARAM_HWADDRESS:
+	case ISCSI_HOST_PARAM_NETDEV_NAME:
+		return 0;
+	default:
+		return iscsi_host_set_param(shost, param, buff, buflen);
+	}
+}
+EXPORT_SYMBOL_GPL(cxgbi_set_host_param);
+
+int cxgbi_get_host_param(struct Scsi_Host *shost,
+			enum iscsi_host_param param, char *buff)
+{
+	struct cxgbi_hba *chba = iscsi_host_priv(shost);
+	int len = 0;
+
+	if (!chba->ndev) {
+		shost_printk(KERN_ERR, shost, "Could not set host param. "
+				"Netdev for host not set\n");
+		return -ENODEV;
+	}
+
+	cxgbi_api_debug("hba %s, param %d\n", chba->ndev->name, param);
+
+	switch (param) {
+	case ISCSI_HOST_PARAM_HWADDRESS:
+		len = sysfs_format_mac(buff, chba->ndev->dev_addr, 6);
+		break;
+	case ISCSI_HOST_PARAM_NETDEV_NAME:
+		len = sprintf(buff, "%s\n", chba->ndev->name);
+		break;
+	case ISCSI_HOST_PARAM_IPADDRESS:
+	{
+		__be32 addr;
+
+		addr = cxgbi_get_iscsi_ipv4(chba);
+		len = sprintf(buff, "%pI4", &addr);
+		break;
+	}
+	default:
+		return iscsi_host_get_param(shost, param, buff);
+	}
+
+	return len;
+}
+EXPORT_SYMBOL_GPL(cxgbi_get_host_param);
+
+struct iscsi_endpoint *cxgbi_ep_connect(struct Scsi_Host *shost,
+					struct sockaddr *dst_addr,
+					int non_blocking)
+{
+	struct iscsi_endpoint *iep;
+	struct cxgbi_endpoint *cep;
+	struct cxgbi_hba *hba = NULL;
+	struct cxgbi_sock *csk = NULL;
+	struct sockaddr_in *sin = (struct sockaddr_in *)dst_addr;
+	struct cxgbi_device *cdev;
+	int err = 0;
+
+	if (shost)
+		hba = iscsi_host_priv(shost);
+
+	cdev = cxgbi_find_cdev(hba ? hba->ndev : NULL,
+			((struct sockaddr_in *)dst_addr)->sin_addr.s_addr);
+	if (!cdev) {
+		cxgbi_log_info("ep connect no cdev\n");
+		err = -ENOSPC;
+		goto release_conn;
+	}
+
+	csk = cxgbi_sock_create(cdev);
+	if (!csk) {
+		cxgbi_log_info("ep connect OOM\n");
+		err = -ENOMEM;
+		goto release_conn;
+	}
+
+	err = cxgbi_sock_connect(hba ? hba->ndev : NULL, csk, sin);
+	if (err < 0) {
+		cxgbi_log_info("ep connect failed\n");
+		goto release_conn;
+	}
+
+	hba = cxgbi_hba_find_by_netdev(csk->dst->dev, cdev);
+	if (!hba) {
+		err = -ENOSPC;
+		cxgbi_log_info("Not going through cxgb4i device\n");
+		goto release_conn;
+	}
+
+	if (shost && hba != iscsi_host_priv(shost)) {
+		err = -ENOSPC;
+		cxgbi_log_info("Could not connect through request host %u\n",
+				shost->host_no);
+		goto release_conn;
+	}
+
+	if (cxgbi_sock_is_closing(csk)) {
+		err = -ENOSPC;
+		cxgbi_log_info("ep connect unable to connect\n");
+		goto release_conn;
+	}
+
+	iep = iscsi_create_endpoint(sizeof(*cep));
+	if (!iep) {
+		err = -ENOMEM;
+		cxgbi_log_info("iscsi alloc ep, OOM\n");
+		goto release_conn;
+	}
+
+	cep = iep->dd_data;
+	cep->csk = csk;
+	cep->chba = hba;
+	cxgbi_api_debug("iep 0x%p, cep 0x%p, csk 0x%p, hba 0x%p\n",
+			iep, cep, csk, hba);
+	return iep;
+release_conn:
+	cxgbi_api_debug("conn 0x%p failed, release\n", csk);
+	if (csk)
+		cxgbi_sock_release(csk);
+
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL_GPL(cxgbi_ep_connect);
+
+int cxgbi_ep_poll(struct iscsi_endpoint *ep, int timeout_ms)
+{
+	struct cxgbi_endpoint *cep = ep->dd_data;
+	struct cxgbi_sock *csk = cep->csk;
+
+	if (!cxgbi_sock_is_established(csk))
+		return 0;
+
+	return 1;
+}
+EXPORT_SYMBOL_GPL(cxgbi_ep_poll);
+
+void cxgbi_ep_disconnect(struct iscsi_endpoint *ep)
+{
+	struct cxgbi_endpoint *cep = ep->dd_data;
+	struct cxgbi_conn *cconn = cep->cconn;
+
+	if (cconn && cconn->iconn) {
+		iscsi_suspend_tx(cconn->iconn);
+		write_lock_bh(&cep->csk->callback_lock);
+		cep->csk->user_data = NULL;
+		cconn->cep = NULL;
+		write_unlock_bh(&cep->csk->callback_lock);
+	}
+
+	cxgbi_sock_release(cep->csk);
+	iscsi_destroy_endpoint(ep);
+}
+EXPORT_SYMBOL_GPL(cxgbi_ep_disconnect);
+
+struct cxgbi_hba *cxgbi_hba_add(struct cxgbi_device *cdev,
+				unsigned int max_lun,
+				unsigned int max_id,
+				struct scsi_transport_template *stt,
+				struct scsi_host_template *sht,
+				struct net_device *dev)
+{
+	struct cxgbi_hba *chba;
+	struct Scsi_Host *shost;
+	int err;
+
+	shost = iscsi_host_alloc(sht, sizeof(*chba), 1);
+
+	if (!shost) {
+		cxgbi_log_info("cdev 0x%p, ndev 0x%p, host alloc failed\n",
+				cdev, dev);
+		return NULL;
+	}
+
+	shost->transportt = stt;
+	shost->max_lun = max_lun;
+	shost->max_id = max_id;
+	shost->max_channel = 0;
+	shost->max_cmd_len = 16;
+	chba = iscsi_host_priv(shost);
+	cxgbi_log_debug("cdev %p\n", cdev);
+	chba->cdev = cdev;
+	chba->ndev = dev;
+	chba->shost = shost;
+	pci_dev_get(cdev->pdev);
+	err = iscsi_host_add(shost, &cdev->pdev->dev);
+	if (err) {
+		cxgbi_log_info("cdev 0x%p, dev 0x%p, host add failed\n",
+				cdev, dev);
+		goto pci_dev_put;
+	}
+
+	return chba;
+pci_dev_put:
+	pci_dev_put(cdev->pdev);
+	scsi_host_put(shost);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(cxgbi_hba_add);
+
+void cxgbi_hba_remove(struct cxgbi_hba *chba)
+{
+	iscsi_host_remove(chba->shost);
+	pci_dev_put(chba->cdev->pdev);
+	iscsi_host_free(chba->shost);
+}
+EXPORT_SYMBOL_GPL(cxgbi_hba_remove);
diff --git a/drivers/scsi/cxgbi/libcxgbi.h b/drivers/scsi/cxgbi/libcxgbi.h
new file mode 100644
index 0000000..4e1aa61
--- /dev/null
+++ b/drivers/scsi/cxgbi/libcxgbi.h
@@ -0,0 +1,556 @@
+/*
+ * libcxgbi.h: Chelsio common library for T3/T4 iSCSI driver.
+ *
+ * Copyright (c) 2010 Chelsio Communications, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.
+ *
+ * Written by: Karen Xie (kxie-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ * Written by: Rakesh Ranjan (rranjan-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ */
+
+#ifndef	__LIBCXGBI_H__
+#define	__LIBCXGBI_H__
+
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/netdevice.h>
+#include <linux/if_vlan.h>
+#include <linux/scatterlist.h>
+#include <linux/skbuff.h>
+#include <scsi/libiscsi_tcp.h>
+
+
+#define	cxgbi_log_error(fmt...)	printk(KERN_ERR "cxgbi: ERR! " fmt)
+#define cxgbi_log_warn(fmt...)	printk(KERN_WARNING "cxgbi: WARN! " fmt)
+#define cxgbi_log_info(fmt...)	printk(KERN_INFO "cxgbi: " fmt)
+#define cxgbi_debug_log(fmt, args...) \
+	printk(KERN_INFO "cxgbi: %s - " fmt, __func__ , ## args)
+
+#ifdef	__DEBUG_CXGBI__
+#define	cxgbi_log_debug	cxgbi_debug_log
+#else
+#define cxgbi_log_debug(fmt...)
+#endif
+
+#ifdef __DEBUG_CXGBI_TAG__
+#define cxgbi_tag_debug        cxgbi_log_debug
+#else
+#define cxgbi_tag_debug(fmt...)
+#endif
+
+#ifdef __DEBUG_CXGBI_API__
+#define cxgbi_api_debug        cxgbi_log_debug
+#else
+#define cxgbi_api_debug(fmt...)
+#endif
+
+#ifdef __DEBUG_CXGBI_CONN__
+#define cxgbi_conn_debug         cxgbi_log_debug
+#else
+#define cxgbi_conn_debug(fmt...)
+#endif
+
+#ifdef __DEBUG_CXGBI_TX__
+#define cxgbi_tx_debug           cxgbi_log_debug
+#else
+#define cxgbi_tx_debug(fmt...)
+#endif
+
+#ifdef __DEBUG_CXGBI_RX__
+#define cxgbi_rx_debug           cxgbi_log_debug
+#else
+#define cxgbi_rx_debug(fmt...)
+#endif
+
+/* always allocate rooms for AHS */
+#define SKB_TX_PDU_HEADER_LEN	\
+	(sizeof(struct iscsi_hdr) + ISCSI_MAX_AHS_SIZE)
+
+#define	ISCSI_PDU_NONPAYLOAD_LEN	312 /* bhs(48) + ahs(256) + digest(8)*/
+#define ULP2_MAX_PKT_SIZE		16224
+#define ULP2_MAX_PDU_PAYLOAD	\
+	(ULP2_MAX_PKT_SIZE - ISCSI_PDU_NONPAYLOAD_LEN)
+
+#define PPOD_PAGES_MAX			4
+#define PPOD_PAGES_SHIFT		2       /*  4 pages per pod */
+
+/*
+ * align pdu size to multiple of 512 for better performance
+ */
+#define cxgbi_align_pdu_size(n) do { n = (n) & (~511); } while (0)
+
+/*
+ * struct pagepod_hdr, pagepod - pagepod format
+ */
+struct pagepod_hdr {
+	unsigned int vld_tid;
+	unsigned int pgsz_tag_clr;
+	unsigned int max_offset;
+	unsigned int page_offset;
+	unsigned long long rsvd;
+};
+
+struct pagepod {
+	struct pagepod_hdr hdr;
+	unsigned long long addr[PPOD_PAGES_MAX + 1];
+};
+
+struct cxgbi_tag_format {
+	unsigned char sw_bits;
+	unsigned char rsvd_bits;
+	unsigned char rsvd_shift;
+	unsigned char filler[1];
+	unsigned int rsvd_mask;
+};
+
+struct cxgbi_gather_list {
+	unsigned int tag;
+	unsigned int length;
+	unsigned int offset;
+	unsigned int nelem;
+	struct page **pages;
+	dma_addr_t phys_addr[0];
+};
+
+/*
+ * sge_opaque_hdr -
+ * Opaque version of structure the SGE stores at skb->head of TX_DATA packets
+ * and for which we must reserve space.
+ */
+struct sge_opaque_hdr {
+	void *dev;
+	dma_addr_t addr[MAX_SKB_FRAGS + 1];
+};
+
+struct cxgbi_sock {
+	struct cxgbi_device *cdev;
+
+	unsigned long flags;
+	unsigned short rss_qid;
+	unsigned short txq_idx;
+	unsigned int hwtid;
+	unsigned int atid;
+	unsigned int tx_chan;
+	unsigned int rx_chan;
+	unsigned int mss_idx;
+	unsigned int smac_idx;
+	unsigned char port_id;
+	int wr_max_cred;
+	int wr_cred;
+	int wr_una_cred;
+	unsigned char hcrc_len;
+	unsigned char dcrc_len;
+
+	void *l2t;
+	struct sk_buff *wr_pending_head;
+	struct sk_buff *wr_pending_tail;
+	struct sk_buff *cpl_close;
+	struct sk_buff *cpl_abort_req;
+	struct sk_buff *cpl_abort_rpl;
+	struct sk_buff *skb_ulp_lhdr;
+	spinlock_t lock;
+	struct kref refcnt;
+	unsigned int state;
+	struct sockaddr_in saddr;
+	struct sockaddr_in daddr;
+	struct dst_entry *dst;
+	struct sk_buff_head receive_queue;
+	struct sk_buff_head write_queue;
+	struct timer_list retry_timer;
+	int err;
+	rwlock_t callback_lock;
+	void *user_data;
+
+	u32 rcv_nxt;
+	u32 copied_seq;
+	u32 rcv_wup;
+	u32 snd_nxt;
+	u32 snd_una;
+	u32 write_seq;
+};
+
+enum cxgbi_sock_states{
+	CTP_CONNECTING = 1,
+	CTP_ESTABLISHED,
+	CTP_ACTIVE_CLOSE,
+	CTP_PASSIVE_CLOSE,
+	CTP_CLOSE_WAIT_1,
+	CTP_CLOSE_WAIT_2,
+	CTP_ABORTING,
+	CTP_CLOSED,
+};
+
+enum cxgbi_sock_flags {
+	CTPF_ABORT_RPL_RCVD = 1,/*received one ABORT_RPL_RSS message */
+	CTPF_ABORT_REQ_RCVD,	/*received one ABORT_REQ_RSS message */
+	CTPF_ABORT_RPL_PENDING,	/* expecting an abort reply */
+	CTPF_TX_DATA_SENT,	/* already sent a TX_DATA WR */
+	CTPF_ACTIVE_CLOSE_NEEDED,	/* need to be closed */
+	CTPF_MSG_COALESCED,
+	CTPF_OFFLOAD_DOWN,		/* offload function off */
+};
+
+enum cxgbi_skcb_flags {
+	CTP_SKCBF_NEED_HDR = 1 << 0,	/* packet needs a header */
+	CTP_SKCBF_NO_APPEND = 1 << 1,	/* don't grow this skb */
+	CTP_SKCBF_COMPL = 1 << 2,	/* request WR completion */
+	CTP_SKCBF_HDR_RCVD = 1 << 3,	/* recieved header pdu */
+	CTP_SKCBF_DATA_RCVD = 1 << 4,	/*  recieved data pdu */
+	CTP_SKCBF_STATUS_RCVD = 1 << 5,	/* recieved ddp status */
+};
+
+static inline void cxgbi_sock_set_flag(struct cxgbi_sock *csk,
+					enum cxgbi_sock_flags flag)
+{
+	__set_bit(flag, &csk->flags);
+	cxgbi_conn_debug("csk 0x%p, set %d, state %u, flags 0x%lu\n",
+			csk, flag, csk->state, csk->flags);
+}
+
+static inline void cxgbi_sock_clear_flag(struct cxgbi_sock *csk,
+					enum cxgbi_sock_flags flag)
+{
+	__clear_bit(flag, &csk->flags);
+	cxgbi_conn_debug("csk 0x%p, clear %d, state %u, flags 0x%lu\n",
+			csk, flag, csk->state, csk->flags);
+}
+
+static inline int cxgbi_sock_flag(struct cxgbi_sock *csk,
+				enum cxgbi_sock_flags flag)
+{
+	if (csk == NULL)
+		return 0;
+
+	return test_bit(flag, &csk->flags);
+}
+
+static inline void cxgbi_sock_set_state(struct cxgbi_sock *csk, int state)
+{
+	csk->state = state;
+}
+
+static inline void cxgbi_sock_hold(struct cxgbi_sock *csk)
+{
+	kref_get(&csk->refcnt);
+}
+
+static inline void cxgbi_clean_sock(struct kref *kref)
+{
+	struct cxgbi_sock *csk = container_of(kref,
+						struct cxgbi_sock,
+						refcnt);
+	if (csk) {
+		cxgbi_log_debug("free csk 0x%p, state %u, flags 0x%lx\n",
+						csk, csk->state, csk->flags);
+		kfree(csk);
+	}
+}
+
+static inline void cxgbi_sock_put(struct cxgbi_sock *csk)
+{
+	if (csk)
+		kref_put(&csk->refcnt, cxgbi_clean_sock);
+}
+
+static inline unsigned int cxgbi_sock_is_closing(const struct cxgbi_sock *csk)
+{
+	return csk->state >= CTP_ACTIVE_CLOSE;
+}
+
+static inline unsigned int cxgbi_sock_is_established(
+						const struct cxgbi_sock *csk)
+{
+	return csk->state == CTP_ESTABLISHED;
+}
+
+static inline void cxgbi_sock_purge_write_queue(struct cxgbi_sock *csk)
+{
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue(&csk->write_queue)))
+		__kfree_skb(skb);
+}
+
+static inline int cxgbi_sock_compute_wscale(int win)
+{
+	int wscale = 0;
+	while (wscale < 14 && (65535 << wscale) < win)
+		wscale++;
+	return wscale;
+}
+
+void cxgbi_sock_conn_closing(struct cxgbi_sock *);
+void cxgbi_sock_closed(struct cxgbi_sock *);
+unsigned int cxgbi_sock_select_mss(struct cxgbi_sock *, unsigned int);
+
+struct cxgbi_hba {
+	struct net_device *ndev;
+	struct Scsi_Host *shost;
+	struct cxgbi_device *cdev;
+	__be32 ipv4addr;
+	unsigned short txq_idx;
+	unsigned char port_id;
+};
+
+struct cxgbi_ports_map {
+	unsigned int max_connect;
+	unsigned short sport_base;
+	spinlock_t lock;
+	unsigned int next;
+	struct cxgbi_sock *port_csk[0];
+};
+
+struct cxgbi_device {
+	struct list_head list_head;
+	char *name;
+	struct net_device **ports;
+	struct cxgbi_hba **hbas;
+	const unsigned short *mtus;
+	unsigned char nmtus;
+	unsigned char nports;
+	struct pci_dev *pdev;
+
+	unsigned int skb_tx_headroom;
+	unsigned int skb_extra_headroom;
+	unsigned int tx_max_size;
+	unsigned int rx_max_size;
+	struct page *pad_page;
+	struct cxgbi_ports_map *pmap;
+	struct iscsi_transport *itp;
+	struct cxgbi_tag_format tag_format;
+
+	int (*ddp_tag_reserve)(struct cxgbi_hba *, unsigned int,
+				struct cxgbi_tag_format *, u32 *,
+				struct cxgbi_gather_list *, gfp_t);
+	void (*ddp_tag_release)(struct cxgbi_hba *, u32);
+	struct cxgbi_gather_list* (*ddp_make_gl)(unsigned int,
+						struct scatterlist *,
+						unsigned int,
+						struct pci_dev *,
+						gfp_t);
+	void (*ddp_release_gl)(struct cxgbi_gather_list *, struct pci_dev *);
+	int (*ddp_setup_conn_digest)(struct cxgbi_sock *,
+					unsigned int, int, int, int);
+	int (*ddp_setup_conn_host_pgsz)(struct cxgbi_sock *,
+					unsigned int, int);
+	__u16 (*get_skb_ulp_mode)(struct sk_buff *);
+	__u16 (*get_skb_flags)(struct sk_buff *);
+	__u32 (*get_skb_tcp_seq)(struct sk_buff *);
+	__u32 (*get_skb_rx_pdulen)(struct sk_buff *);
+	void (*set_skb_txmode)(struct sk_buff *, int, int);
+
+	void (*release_offload_resources)(struct cxgbi_sock *);
+	int (*sock_send_pdus)(struct cxgbi_sock *, struct sk_buff *);
+	void (*sock_rx_credits)(struct cxgbi_sock *, int);
+	void (*send_abort_req)(struct cxgbi_sock *);
+	void (*send_close_req)(struct cxgbi_sock *);
+	int (*alloc_cpl_skbs)(struct cxgbi_sock *);
+	int (*init_act_open)(struct cxgbi_sock *, struct net_device *);
+
+	unsigned long dd_data[0];
+};
+
+static inline void *cxgbi_cdev_priv(struct cxgbi_device *cdev)
+{
+	return (void *)cdev->dd_data;
+}
+
+struct cxgbi_device *cxgbi_device_register(unsigned int, unsigned int);
+void cxgbi_device_unregister(struct cxgbi_device *);
+
+struct cxgbi_conn {
+	struct cxgbi_endpoint *cep;
+	struct iscsi_conn *iconn;
+	struct cxgbi_hba *chba;
+	u32 task_idx_bits;
+};
+
+struct cxgbi_endpoint {
+	struct cxgbi_conn *cconn;
+	struct cxgbi_hba *chba;
+	struct cxgbi_sock *csk;
+};
+
+#define MAX_PDU_FRAGS	((ULP2_MAX_PDU_PAYLOAD + 512 - 1) / 512)
+struct cxgbi_task_data {
+	unsigned short nr_frags;
+	skb_frag_t frags[MAX_PDU_FRAGS];
+	struct sk_buff *skb;
+	unsigned int offset;
+	unsigned int count;
+	unsigned int sgoffset;
+};
+
+static inline int cxgbi_is_ddp_tag(struct cxgbi_tag_format *tformat, u32 tag)
+{
+	return !(tag & (1 << (tformat->rsvd_bits + tformat->rsvd_shift - 1)));
+}
+
+static inline int cxgbi_sw_tag_usable(struct cxgbi_tag_format *tformat,
+					u32 sw_tag)
+{
+	sw_tag >>= (32 - tformat->rsvd_bits);
+	return !sw_tag;
+}
+
+static inline u32 cxgbi_set_non_ddp_tag(struct cxgbi_tag_format *tformat,
+					u32 sw_tag)
+{
+	unsigned char shift = tformat->rsvd_bits + tformat->rsvd_shift - 1;
+
+	u32 mask = (1 << shift) - 1;
+
+	if (sw_tag && (sw_tag & ~mask)) {
+		u32 v1 = sw_tag & ((1 << shift) - 1);
+		u32 v2 = (sw_tag >> (shift - 1)) << shift;
+
+		return v2 | v1 | 1 << shift;
+	}
+
+	return sw_tag | 1 << shift;
+}
+
+static inline u32 cxgbi_ddp_tag_base(struct cxgbi_tag_format *tformat,
+					u32 sw_tag)
+{
+	u32 mask = (1 << tformat->rsvd_shift) - 1;
+
+	if (sw_tag && (sw_tag & ~mask)) {
+		u32 v1 = sw_tag & mask;
+		u32 v2 = sw_tag >> tformat->rsvd_shift;
+
+		v2 <<= tformat->rsvd_bits + tformat->rsvd_shift;
+
+		return v2 | v1;
+	}
+
+	return sw_tag;
+}
+
+static inline u32 cxgbi_tag_rsvd_bits(struct cxgbi_tag_format *tformat,
+					u32 tag)
+{
+	if (cxgbi_is_ddp_tag(tformat, tag))
+		return (tag >> tformat->rsvd_shift) & tformat->rsvd_mask;
+
+	return 0;
+}
+
+static inline u32 cxgbi_tag_nonrsvd_bits(struct cxgbi_tag_format *tformat,
+					u32 tag)
+{
+	unsigned char shift = tformat->rsvd_bits + tformat->rsvd_shift - 1;
+	u32 v1, v2;
+
+	if (cxgbi_is_ddp_tag(tformat, tag)) {
+		v1 = tag & ((1 << tformat->rsvd_shift) - 1);
+		v2 = (tag >> (shift + 1)) << tformat->rsvd_shift;
+	} else {
+		u32 mask = (1 << shift) - 1;
+		tag &= ~(1 << shift);
+		v1 = tag & mask;
+		v2 = (tag >> 1) & ~mask;
+	}
+	return v1 | v2;
+}
+
+static inline void *cxgbi_alloc_big_mem(unsigned int size,
+					gfp_t gfp)
+{
+	void *p = kmalloc(size, gfp);
+	if (!p)
+		p = vmalloc(size);
+	if (p)
+		memset(p, 0, size);
+	return p;
+}
+
+static inline void cxgbi_free_big_mem(void *addr)
+{
+	if (is_vmalloc_addr(addr))
+		vfree(addr);
+	else
+		kfree(addr);
+}
+
+#define RX_DDP_STATUS_IPP_SHIFT		27      /* invalid pagepod */
+#define RX_DDP_STATUS_TID_SHIFT		26      /* tid mismatch */
+#define RX_DDP_STATUS_COLOR_SHIFT	25      /* color mismatch */
+#define RX_DDP_STATUS_OFFSET_SHIFT	24      /* offset mismatch */
+#define RX_DDP_STATUS_ULIMIT_SHIFT	23      /* ulimit error */
+#define RX_DDP_STATUS_TAG_SHIFT		22      /* tag mismatch */
+#define RX_DDP_STATUS_DCRC_SHIFT	21      /* dcrc error */
+#define RX_DDP_STATUS_HCRC_SHIFT	20      /* hcrc error */
+#define RX_DDP_STATUS_PAD_SHIFT		19      /* pad error */
+#define RX_DDP_STATUS_PPP_SHIFT		18      /* pagepod parity error */
+#define RX_DDP_STATUS_LLIMIT_SHIFT	17      /* llimit error */
+#define RX_DDP_STATUS_DDP_SHIFT		16      /* ddp'able */
+#define RX_DDP_STATUS_PMM_SHIFT		15      /* pagepod mismatch */
+
+
+#define ULP2_FLAG_DATA_READY		0x1
+#define ULP2_FLAG_DATA_DDPED		0x2
+#define ULP2_FLAG_HCRC_ERROR		0x4
+#define ULP2_FLAG_DCRC_ERROR		0x8
+#define ULP2_FLAG_PAD_ERROR		0x10
+
+struct cxgbi_hba *cxgbi_hba_add(struct cxgbi_device *,
+				unsigned int, unsigned int,
+				struct scsi_transport_template *,
+				struct scsi_host_template *,
+				struct net_device *);
+void cxgbi_hba_remove(struct cxgbi_hba *);
+
+
+void cxgbi_parse_pdu_itt(struct iscsi_conn *conn, itt_t itt,
+				int *idx, int *age);
+void cxgbi_cleanup_task(struct iscsi_task *task);
+
+void cxgbi_conn_pdu_ready(struct cxgbi_sock *);
+void cxgbi_conn_tx_open(struct cxgbi_sock *);
+int cxgbi_conn_init_pdu(struct iscsi_task *, unsigned int , unsigned int);
+int cxgbi_conn_alloc_pdu(struct iscsi_task *, u8);
+int cxgbi_conn_xmit_pdu(struct iscsi_task *);
+void cxgbi_get_conn_stats(struct iscsi_cls_conn *, struct iscsi_stats *);
+int cxgbi_set_conn_param(struct iscsi_cls_conn *,
+			enum iscsi_param, char *, int);
+int cxgbi_get_conn_param(struct iscsi_cls_conn *, enum iscsi_param, char *);
+struct iscsi_cls_conn *cxgbi_create_conn(struct iscsi_cls_session *, u32);
+int cxgbi_bind_conn(struct iscsi_cls_session *,
+			struct iscsi_cls_conn *, u64, int);
+void cxgbi_destroy_session(struct iscsi_cls_session *);
+struct iscsi_cls_session *cxgbi_create_session(struct iscsi_endpoint *,
+						u16, u16, u32);
+int cxgbi_set_host_param(struct Scsi_Host *,
+				enum iscsi_host_param, char *, int);
+int cxgbi_get_host_param(struct Scsi_Host *, enum iscsi_host_param, char *);
+
+
+int cxgbi_pdu_init(struct cxgbi_device *);
+void cxgbi_pdu_cleanup(struct cxgbi_device *);
+
+struct iscsi_endpoint *cxgbi_ep_connect(struct Scsi_Host *,
+					struct sockaddr *, int);
+int cxgbi_ep_poll(struct iscsi_endpoint *, int);
+void cxgbi_ep_disconnect(struct iscsi_endpoint *);
+
+static inline void cxgbi_set_iscsi_ipv4(struct cxgbi_hba *chba, __be32 ipaddr)
+{
+	chba->ipv4addr = ipaddr;
+}
+
+static inline __be32 cxgbi_get_iscsi_ipv4(struct cxgbi_hba *chba)
+{
+	return chba->ipv4addr;
+}
+
+struct cxgbi_device *cxgbi_device_alloc(unsigned int dd_size);
+void cxgbi_device_free(struct cxgbi_device *cdev);
+void cxgbi_device_add(struct list_head *list_head);
+void cxgbi_device_remove(struct cxgbi_device *cdev);
+
+#endif	/*__LIBCXGBI_H__*/
-- 
1.6.6.1

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.

^ permalink raw reply related

* [PATCH 3/3] cxgb4i_v4.3 : main driver files
From: Rakesh Ranjan @ 2010-06-08  4:59 UTC (permalink / raw)
  To: LK-NetDev, LK-SCSIDev, LK-iSCSIDev
  Cc: LKML, Karen Xie, David Miller, James Bottomley, Mike Christie,
	Anish Bhatt, Rakesh Ranjan
In-Reply-To: <1275973167-8640-3-git-send-email-rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

From: Rakesh Ranjan <rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>


Signed-off-by: Rakesh Ranjan <rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
---
 drivers/scsi/cxgbi/cxgb4i.h         |  175 +++++
 drivers/scsi/cxgbi/cxgb4i_ddp.c     |  653 ++++++++++++++++
 drivers/scsi/cxgbi/cxgb4i_init.c    |  317 ++++++++
 drivers/scsi/cxgbi/cxgb4i_offload.c | 1409 +++++++++++++++++++++++++++++++++++
 4 files changed, 2554 insertions(+), 0 deletions(-)
 create mode 100644 drivers/scsi/cxgbi/cxgb4i.h
 create mode 100644 drivers/scsi/cxgbi/cxgb4i_ddp.c
 create mode 100644 drivers/scsi/cxgbi/cxgb4i_init.c
 create mode 100644 drivers/scsi/cxgbi/cxgb4i_offload.c

diff --git a/drivers/scsi/cxgbi/cxgb4i.h b/drivers/scsi/cxgbi/cxgb4i.h
new file mode 100644
index 0000000..41b0a25
--- /dev/null
+++ b/drivers/scsi/cxgbi/cxgb4i.h
@@ -0,0 +1,175 @@
+/*
+ * cxgb4i.h: Chelsio T4 iSCSI driver.
+ *
+ * Copyright (c) 2010 Chelsio Communications, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.
+ *
+ * Written by: Karen Xie (kxie-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ * Written by: Rakesh Ranjan (rranjan-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ */
+
+#ifndef	__CXGB4I_H__
+#define	__CXGB4I_H__
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/netdevice.h>
+#include <linux/if_vlan.h>
+#include <linux/scatterlist.h>
+#include <linux/skbuff.h>
+#include <scsi/libiscsi_tcp.h>
+
+#include "t4fw_api.h"
+#include "t4_msg.h"
+#include "l2t.h"
+#include "cxgb4.h"
+#include "cxgb4_uld.h"
+#include "libcxgbi.h"
+
+#define	CXGB4I_MAX_CONN		16384
+#define	CXGB4I_SCSI_HOST_QDEPTH	1024
+#define	CXGB4I_MAX_TARGET	CXGB4I_MAX_CONN
+#define	CXGB4I_MAX_LUN		0x1000
+
+struct cxgb4i_snic;
+typedef int (*cxgb4i_cplhandler_func)(struct cxgb4i_snic *, struct sk_buff *);
+
+struct cxgb4i_snic {
+	struct cxgb4_lld_info lldi;
+	struct cxgb4i_ddp_info *ddp;
+	cxgb4i_cplhandler_func *handlers;
+};
+
+enum {
+	CPL_RET_BUF_DONE = 1,
+	CPL_RET_BAD_MSG = 2,
+	CPL_RET_UNKNOWN_TID = 4
+};
+
+struct cxgb4i_skb_rx_cb {
+	__u32 ddigest;
+	__u32 pdulen;
+};
+
+struct cxgb4i_skb_tx_cb {
+	struct l2t_skb_cb l2t;
+	struct sk_buff *wr_next;
+};
+
+struct cxgb4i_skb_cb {
+	__u16 flags;
+	__u16 ulp_mode;
+	__u32 seq;
+
+	union {
+		struct cxgb4i_skb_rx_cb rx;
+		struct cxgb4i_skb_tx_cb tx;
+	};
+};
+
+#define CXGB4I_SKB_CB(skb)	((struct cxgb4i_skb_cb *)&((skb)->cb[0]))
+#define cxgb4i_skb_flags(skb)	(CXGB4I_SKB_CB(skb)->flags)
+#define cxgb4i_skb_ulp_mode(skb)	(CXGB4I_SKB_CB(skb)->ulp_mode)
+#define cxgb4i_skb_tcp_seq(skb)		(CXGB4I_SKB_CB(skb)->seq)
+#define cxgb4i_skb_rx_ddigest(skb)	(CXGB4I_SKB_CB(skb)->rx.ddigest)
+#define cxgb4i_skb_rx_pdulen(skb)	(CXGB4I_SKB_CB(skb)->rx.pdulen)
+#define cxgb4i_skb_tx_wr_next(skb)	(CXGB4I_SKB_CB(skb)->tx.wr_next)
+
+/* for TX: a skb must have a headroom of at least TX_HEADER_LEN bytes */
+#define CXGB4I_TX_HEADER_LEN \
+	(sizeof(struct fw_ofld_tx_data_wr) + sizeof(struct sge_opaque_hdr))
+
+int cxgb4i_ofld_init(struct cxgbi_device *);
+void cxgb4i_ofld_cleanup(struct cxgbi_device *);
+
+struct cxgb4i_ddp_info {
+	struct kref refcnt;
+	struct cxgb4i_snic *snic;
+	struct pci_dev *pdev;
+	unsigned int max_txsz;
+	unsigned int max_rxsz;
+	unsigned int llimit;
+	unsigned int ulimit;
+	unsigned int nppods;
+	unsigned int idx_last;
+	unsigned char idx_bits;
+	unsigned char filler[3];
+	unsigned int idx_mask;
+	unsigned int rsvd_tag_mask;
+	spinlock_t map_lock;
+	struct cxgbi_gather_list **gl_map;
+};
+
+struct cpl_rx_data_ddp {
+	union opcode_tid ot;
+	__be16 urg;
+	__be16 len;
+	__be32 seq;
+	union {
+		__be32 nxt_seq;
+		__be32 ddp_report;
+	};
+	__be32 ulp_crc;
+	__be32 ddpvld;
+};
+
+#define PPOD_SIZE               sizeof(struct pagepod)  /*  64 */
+#define PPOD_SIZE_SHIFT         6
+
+#define ULPMEM_DSGL_MAX_NPPODS	16	/*  1024/PPOD_SIZE */
+#define ULPMEM_IDATA_MAX_NPPODS	4	/*  256/PPOD_SIZE */
+#define PCIE_MEMWIN_MAX_NPPODS	16	/*  1024/PPOD_SIZE */
+
+#define PPOD_COLOR_SHIFT	0
+#define PPOD_COLOR_MASK		0x3F
+#define PPOD_COLOR_SIZE         6
+#define PPOD_COLOR(x)		((x) << PPOD_COLOR_SHIFT)
+
+#define PPOD_TAG_SHIFT	6
+#define PPOD_TAG_MASK	0xFFFFFF
+#define PPOD_TAG(x)	((x) << PPOD_TAG_SHIFT)
+
+#define PPOD_PGSZ_SHIFT	30
+#define PPOD_PGSZ_MASK	0x3
+#define PPOD_PGSZ(x)	((x) << PPOD_PGSZ_SHIFT)
+
+#define PPOD_TID_SHIFT	32
+#define PPOD_TID_MASK	0xFFFFFF
+#define PPOD_TID(x)	((__u64)(x) << PPOD_TID_SHIFT)
+
+#define PPOD_VALID_SHIFT	56
+#define PPOD_VALID(x)	((__u64)(x) << PPOD_VALID_SHIFT)
+#define PPOD_VALID_FLAG	PPOD_VALID(1ULL)
+
+#define PPOD_LEN_SHIFT	32
+#define PPOD_LEN_MASK	0xFFFFFFFF
+#define PPOD_LEN(x)	((__u64)(x) << PPOD_LEN_SHIFT)
+
+#define PPOD_OFST_SHIFT	0
+#define PPOD_OFST_MASK	0xFFFFFFFF
+#define PPOD_OFST(x)	((x) << PPOD_OFST_SHIFT)
+
+#define PPOD_IDX_SHIFT          PPOD_COLOR_SIZE
+#define PPOD_IDX_MAX_SIZE       24
+
+#define W_TCB_ULP_TYPE          0
+#define TCB_ULP_TYPE_SHIFT      0
+#define TCB_ULP_TYPE_MASK       0xfULL
+#define TCB_ULP_TYPE(x)         ((x) << TCB_ULP_TYPE_SHIFT)
+
+#define W_TCB_ULP_RAW           0
+#define TCB_ULP_RAW_SHIFT       4
+#define TCB_ULP_RAW_MASK        0xffULL
+#define TCB_ULP_RAW(x)          ((x) << TCB_ULP_RAW_SHIFT)
+
+int cxgb4i_ddp_init(struct cxgbi_device *);
+void cxgb4i_ddp_cleanup(struct cxgbi_device *);
+
+#endif	/* __CXGB4I_H__ */
+
diff --git a/drivers/scsi/cxgbi/cxgb4i_ddp.c b/drivers/scsi/cxgbi/cxgb4i_ddp.c
new file mode 100644
index 0000000..24debcf
--- /dev/null
+++ b/drivers/scsi/cxgbi/cxgb4i_ddp.c
@@ -0,0 +1,653 @@
+/*
+ * cxgb4i_ddp.c: Chelsio T4 iSCSI driver.
+ *
+ * Copyright (c) 2010 Chelsio Communications, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.
+ *
+ * Written by: Karen Xie (kxie-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ * Written by: Rakesh Ranjan (rranjan-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ */
+
+#include <linux/skbuff.h>
+#include <linux/scatterlist.h>
+
+#include "libcxgbi.h"
+#include "cxgb4i.h"
+
+#define DDP_PGIDX_MAX	4
+#define DDP_THRESHOLD	2048
+
+static unsigned char ddp_page_order[DDP_PGIDX_MAX] = {0, 1, 2, 4};
+static unsigned char ddp_page_shift[DDP_PGIDX_MAX] = {12, 13, 14, 16};
+static unsigned char page_idx = DDP_PGIDX_MAX;
+static unsigned char sw_tag_idx_bits;
+static unsigned char sw_tag_age_bits;
+
+static inline void cxgb4i_ddp_ppod_set(struct pagepod *ppod,
+				       struct pagepod_hdr *hdr,
+				       struct cxgbi_gather_list *gl,
+				       unsigned int pidx)
+{
+	int i;
+
+	memcpy(ppod, hdr, sizeof(*hdr));
+	for (i = 0; i < (PPOD_PAGES_MAX + 1); i++, pidx++) {
+		ppod->addr[i] = pidx < gl->nelem ?
+			cpu_to_be64(gl->phys_addr[pidx]) : 0ULL;
+	}
+}
+
+static inline void cxgb4i_ddp_ppod_clear(struct pagepod *ppod)
+{
+	memset(ppod, 0, sizeof(*ppod));
+}
+
+static inline void cxgb4i_ddp_ulp_mem_io_set_hdr(struct ulp_mem_io *req,
+						unsigned int wr_len,
+						unsigned int dlen,
+						unsigned int pm_addr)
+{
+	struct ulptx_sgl *sgl;
+
+	INIT_ULPTX_WR(req, wr_len, 0, 0);
+	req->cmd = htonl(ULPTX_CMD(ULP_TX_MEM_WRITE));
+	req->dlen = htonl(ULP_MEMIO_DATA_LEN(dlen >> 5));
+	req->len16 = htonl(DIV_ROUND_UP(wr_len - sizeof(req->wr), 16));
+	req->lock_addr = htonl(ULP_MEMIO_ADDR(pm_addr >> 5));
+	sgl = (struct ulptx_sgl *)(req + 1);
+	sgl->cmd_nsge = htonl(ULPTX_CMD(ULP_TX_SC_DSGL) | ULPTX_NSGE(1));
+	sgl->len0 = htonl(dlen);
+}
+
+static int cxgb4i_ddp_ppod_write_sgl(struct cxgbi_hba *chba,
+				     struct cxgb4i_ddp_info *ddp,
+				     struct pagepod_hdr *hdr,
+				     unsigned int idx,
+				     unsigned int npods,
+				     struct cxgbi_gather_list *gl,
+				     unsigned int gl_pidx)
+{
+	unsigned int dlen, pm_addr, wr_len;
+	struct sk_buff *skb;
+	struct ulp_mem_io *req;
+	struct ulptx_sgl *sgl;
+	struct pagepod *ppod;
+	unsigned int i;
+
+	dlen = PPOD_SIZE * npods;
+	pm_addr = idx * PPOD_SIZE + ddp->llimit;
+	wr_len = roundup(sizeof(struct ulp_mem_io) +
+			sizeof(struct ulptx_sgl), 16);
+
+	skb = alloc_skb(wr_len + dlen, GFP_ATOMIC);
+	if (!skb) {
+		cxgbi_log_error("snic 0x%p, idx %u, npods %u, OOM\n",
+				ddp->snic, idx, npods);
+		return -ENOMEM;
+	}
+
+	memset(skb->data, 0, wr_len + dlen);
+	set_wr_txq(skb, CPL_PRIORITY_CONTROL, chba->txq_idx);
+	req = (struct ulp_mem_io *)__skb_put(skb, wr_len);
+	cxgb4i_ddp_ulp_mem_io_set_hdr(req, wr_len, dlen, pm_addr);
+	sgl = (struct ulptx_sgl *)(req + 1);
+	ppod = (struct pagepod *)(sgl + 1);
+	sgl->addr0 = cpu_to_be64(virt_to_phys(ppod));
+
+	for (i = 0; i < npods; i++, ppod++, gl_pidx += PPOD_PAGES_MAX) {
+		if (!hdr && !gl)
+			cxgb4i_ddp_ppod_clear(ppod);
+		else
+			cxgb4i_ddp_ppod_set(ppod, hdr, gl, gl_pidx);
+
+	}
+
+	cxgb4_ofld_send(chba->cdev->ports[chba->port_id], skb);
+	return 0;
+}
+
+static int cxgb4i_ddp_set_map(struct cxgbi_hba *chba,
+				struct cxgb4i_ddp_info *ddp,
+				struct pagepod_hdr *hdr,
+				unsigned int idx,
+				unsigned int npods,
+				struct cxgbi_gather_list *gl)
+{
+	unsigned int pidx, w_npods, cnt;
+	int err = 0;
+
+	for (w_npods = 0, pidx = 0; w_npods < npods;
+		idx += cnt, w_npods += cnt, pidx += PPOD_PAGES_MAX) {
+		cnt = npods - w_npods;
+		if (cnt > ULPMEM_DSGL_MAX_NPPODS)
+			cnt = ULPMEM_DSGL_MAX_NPPODS;
+		err = cxgb4i_ddp_ppod_write_sgl(chba, ddp, hdr, idx, cnt, gl,
+						pidx);
+		if (err < 0)
+			break;
+	}
+	return err;
+}
+
+static void cxgb4i_ddp_clear_map(struct cxgbi_hba *chba,
+				struct cxgb4i_ddp_info *ddp,
+				unsigned int tag,
+				unsigned int idx,
+				unsigned int npods)
+{
+	int err;
+	unsigned int w_npods, cnt;
+
+	for (w_npods = 0; w_npods < npods; idx += cnt, w_npods += cnt) {
+		cnt = npods - w_npods;
+
+		if (cnt > ULPMEM_DSGL_MAX_NPPODS)
+			cnt = ULPMEM_DSGL_MAX_NPPODS;
+		err = cxgb4i_ddp_ppod_write_sgl(chba, ddp, NULL, idx, cnt,
+						NULL, 0);
+		if (err < 0)
+			break;
+	}
+}
+
+static inline int cxgb4i_ddp_find_unused_entries(struct cxgb4i_ddp_info *ddp,
+					unsigned int start, unsigned int max,
+					unsigned int count,
+					struct cxgbi_gather_list *gl)
+{
+	unsigned int i, j, k;
+
+	/*  not enough entries */
+	if ((max - start) < count)
+		return -EBUSY;
+
+	max -= count;
+	spin_lock(&ddp->map_lock);
+	for (i = start; i < max;) {
+		for (j = 0, k = i; j < count; j++, k++) {
+			if (ddp->gl_map[k])
+				break;
+		}
+		if (j == count) {
+			for (j = 0, k = i; j < count; j++, k++)
+				ddp->gl_map[k] = gl;
+			spin_unlock(&ddp->map_lock);
+			return i;
+		}
+		i += j + 1;
+	}
+	spin_unlock(&ddp->map_lock);
+	return -EBUSY;
+}
+
+static inline void cxgb4i_ddp_unmark_entries(struct cxgb4i_ddp_info *ddp,
+						int start, int count)
+{
+	spin_lock(&ddp->map_lock);
+	memset(&ddp->gl_map[start], 0,
+		count * sizeof(struct cxgbi_gather_list *));
+	spin_unlock(&ddp->map_lock);
+}
+
+static int cxgb4i_ddp_find_page_index(unsigned long pgsz)
+{
+	int i;
+
+	for (i = 0; i < DDP_PGIDX_MAX; i++) {
+		if (pgsz == (1UL << ddp_page_shift[i]))
+			return i;
+	}
+	cxgbi_log_debug("ddp page size 0x%lx not supported\n", pgsz);
+	return DDP_PGIDX_MAX;
+}
+
+static int cxgb4i_ddp_adjust_page_table(void)
+{
+	int i;
+	unsigned int base_order, order;
+
+	if (PAGE_SIZE < (1UL << ddp_page_shift[0])) {
+		cxgbi_log_info("PAGE_SIZE 0x%lx too small, min 0x%lx\n",
+				PAGE_SIZE, 1UL << ddp_page_shift[0]);
+		return -EINVAL;
+	}
+
+	base_order = get_order(1UL << ddp_page_shift[0]);
+	order = get_order(1UL << PAGE_SHIFT);
+
+	for (i = 0; i < DDP_PGIDX_MAX; i++) {
+		/* first is the kernel page size,
+		 * then just doubling the size */
+		ddp_page_order[i] = order - base_order + i;
+		ddp_page_shift[i] = PAGE_SHIFT + i;
+	}
+	return 0;
+}
+
+static inline void cxgb4i_ddp_gl_unmap(struct pci_dev *pdev,
+					struct cxgbi_gather_list *gl)
+{
+	int i;
+
+	for (i = 0; i < gl->nelem; i++)
+		dma_unmap_page(&pdev->dev, gl->phys_addr[i], PAGE_SIZE,
+				PCI_DMA_FROMDEVICE);
+}
+
+static inline int cxgb4i_ddp_gl_map(struct pci_dev *pdev,
+				    struct cxgbi_gather_list *gl)
+{
+	int i;
+
+	for (i = 0; i < gl->nelem; i++) {
+		gl->phys_addr[i] = dma_map_page(&pdev->dev, gl->pages[i], 0,
+						PAGE_SIZE,
+						PCI_DMA_FROMDEVICE);
+		if (unlikely(dma_mapping_error(&pdev->dev, gl->phys_addr[i])))
+			goto unmap;
+	}
+	return i;
+unmap:
+	if (i) {
+		unsigned int nelem = gl->nelem;
+
+		gl->nelem = i;
+		cxgb4i_ddp_gl_unmap(pdev, gl);
+		gl->nelem = nelem;
+	}
+	return -ENOMEM;
+}
+
+static void cxgb4i_ddp_release_gl(struct cxgbi_gather_list *gl,
+				  struct pci_dev *pdev)
+{
+	cxgb4i_ddp_gl_unmap(pdev, gl);
+	kfree(gl);
+}
+
+static struct cxgbi_gather_list *cxgb4i_ddp_make_gl(unsigned int xferlen,
+						    struct scatterlist *sgl,
+						    unsigned int sgcnt,
+						    struct pci_dev *pdev,
+						    gfp_t gfp)
+{
+	struct cxgbi_gather_list *gl;
+	struct scatterlist *sg = sgl;
+	struct page *sgpage = sg_page(sg);
+	unsigned int sglen = sg->length;
+	unsigned int sgoffset = sg->offset;
+	unsigned int npages = (xferlen + sgoffset + PAGE_SIZE - 1) >>
+				PAGE_SHIFT;
+	int i = 1, j = 0;
+
+	if (xferlen < DDP_THRESHOLD) {
+		cxgbi_log_debug("xfer %u < threshold %u, no ddp.\n",
+				xferlen, DDP_THRESHOLD);
+		return NULL;
+	}
+
+	gl = kzalloc(sizeof(struct cxgbi_gather_list) +
+		     npages * (sizeof(dma_addr_t) +
+		     sizeof(struct page *)), gfp);
+	if (!gl)
+		return NULL;
+
+	gl->pages = (struct page **)&gl->phys_addr[npages];
+	gl->length = xferlen;
+	gl->offset = sgoffset;
+	gl->pages[0] = sgpage;
+	sg = sg_next(sg);
+
+	while (sg) {
+		struct page *page = sg_page(sg);
+
+		if (sgpage == page && sg->offset == sgoffset + sglen)
+			sglen += sg->length;
+		else {
+			/*  make sure the sgl is fit for ddp:
+			 *  each has the same page size, and
+			 *  all of the middle pages are used completely
+			 */
+			if ((j && sgoffset) || ((i != sgcnt - 1) &&
+					 ((sglen + sgoffset) & ~PAGE_MASK)))
+				goto error_out;
+
+			j++;
+			if (j == gl->nelem || sg->offset)
+				goto error_out;
+			gl->pages[j] = page;
+			sglen = sg->length;
+			sgoffset = sg->offset;
+			sgpage = page;
+		}
+		i++;
+		sg = sg_next(sg);
+	}
+	gl->nelem = ++j;
+
+	if (cxgb4i_ddp_gl_map(pdev, gl) < 0)
+		goto error_out;
+	return gl;
+error_out:
+	kfree(gl);
+	return NULL;
+}
+
+static void cxgb4i_ddp_tag_release(struct cxgbi_hba *chba, u32 tag)
+{
+	struct cxgb4i_snic *snic = cxgbi_cdev_priv(chba->cdev);
+	struct cxgb4i_ddp_info *ddp = snic->ddp;
+	u32 idx;
+
+	if (!ddp) {
+		cxgbi_log_error("release ddp tag 0x%x, ddp NULL.\n", tag);
+		return;
+	}
+
+	idx = (tag >> PPOD_IDX_SHIFT) & ddp->idx_mask;
+	if (idx < ddp->nppods) {
+		struct cxgbi_gather_list *gl = ddp->gl_map[idx];
+		unsigned int npods;
+
+		if (!gl || !gl->nelem) {
+			cxgbi_log_error("rel 0x%x, idx 0x%x, gl 0x%p, %u\n",
+					tag, idx, gl, gl ? gl->nelem : 0);
+			return;
+		}
+		npods = (gl->nelem + PPOD_PAGES_MAX - 1) >> PPOD_PAGES_SHIFT;
+		cxgbi_log_debug("ddp tag 0x%x, release idx 0x%x, npods %u.\n",
+				tag, idx, npods);
+		cxgb4i_ddp_clear_map(chba, ddp, tag, idx, npods);
+		cxgb4i_ddp_unmark_entries(ddp, idx, npods);
+		cxgb4i_ddp_release_gl(gl, ddp->pdev);
+	} else
+		cxgbi_log_error("ddp tag 0x%x, idx 0x%x > max 0x%x.\n",
+				tag, idx, ddp->nppods);
+}
+
+static int cxgb4i_ddp_tag_reserve(struct cxgbi_hba *chba,
+				  unsigned int tid,
+				  struct cxgbi_tag_format *tformat,
+				  u32 *tagp,
+				  struct cxgbi_gather_list *gl,
+				  gfp_t gfp)
+{
+	struct cxgb4i_snic *snic = cxgbi_cdev_priv(chba->cdev);
+	struct cxgb4i_ddp_info *ddp = snic->ddp;
+	struct pagepod_hdr hdr;
+	unsigned int npods;
+	int idx = -1;
+	int err = -ENOMEM;
+	u32 sw_tag = *tagp;
+	u32 tag;
+
+	if (page_idx >= DDP_PGIDX_MAX || !ddp || !gl || !gl->nelem ||
+			gl->length < DDP_THRESHOLD) {
+		cxgbi_log_debug("pgidx %u, xfer %u/%u, NO ddp.\n",
+				page_idx, gl->length, DDP_THRESHOLD);
+		return -EINVAL;
+	}
+
+	npods = (gl->nelem + PPOD_PAGES_MAX - 1) >> PPOD_PAGES_SHIFT;
+
+	if (ddp->idx_last == ddp->nppods)
+		idx = cxgb4i_ddp_find_unused_entries(ddp, 0, ddp->nppods,
+							npods, gl);
+	else {
+		idx = cxgb4i_ddp_find_unused_entries(ddp, ddp->idx_last + 1,
+							ddp->nppods, npods,
+							gl);
+		if (idx < 0 && ddp->idx_last >= npods) {
+			idx = cxgb4i_ddp_find_unused_entries(ddp, 0,
+				min(ddp->idx_last + npods, ddp->nppods),
+							npods, gl);
+		}
+	}
+	if (idx < 0) {
+		cxgbi_log_debug("xferlen %u, gl %u, npods %u NO DDP.\n",
+				gl->length, gl->nelem, npods);
+		return idx;
+	}
+
+	tag = cxgbi_ddp_tag_base(tformat, sw_tag);
+	tag |= idx << PPOD_IDX_SHIFT;
+
+	hdr.rsvd = 0;
+	hdr.vld_tid = htonl(PPOD_VALID_FLAG | PPOD_TID(tid));
+	hdr.pgsz_tag_clr = htonl(tag & ddp->rsvd_tag_mask);
+	hdr.max_offset = htonl(gl->length);
+	hdr.page_offset = htonl(gl->offset);
+
+	err = cxgb4i_ddp_set_map(chba, ddp, &hdr, idx, npods, gl);
+	if (err < 0)
+		goto unmark_entries;
+
+	ddp->idx_last = idx;
+	cxgbi_log_debug("xfer %u, gl %u,%u, tid 0x%x, 0x%x -> 0x%x(%u,%u).\n",
+			gl->length, gl->nelem, gl->offset, tid, sw_tag, tag,
+			idx, npods);
+	*tagp = tag;
+	return 0;
+unmark_entries:
+	cxgb4i_ddp_unmark_entries(ddp, idx, npods);
+	return err;
+}
+
+static int cxgb4i_ddp_setup_conn_pgidx(struct cxgbi_sock *csk,
+					unsigned int tid,
+					int pg_idx,
+					bool reply)
+{
+	struct sk_buff *skb;
+	struct cpl_set_tcb_field *req;
+	u64 val = pg_idx < DDP_PGIDX_MAX ? pg_idx : 0;
+
+	skb = alloc_skb(sizeof(*req), GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	/*  set up ulp submode and page size */
+	val = (val & 0x03) << 2;
+	val |= TCB_ULP_TYPE(ULP_MODE_ISCSI);
+	req = (struct cpl_set_tcb_field *)skb_put(skb, sizeof(*req));
+	INIT_TP_WR(req, tid);
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, csk->hwtid));
+	req->reply_ctrl = htons(NO_REPLY(reply) | QUEUENO(csk->rss_qid));
+	req->word_cookie = htons(TCB_WORD(W_TCB_ULP_RAW));
+	req->mask = cpu_to_be64(TCB_ULP_TYPE(TCB_ULP_TYPE_MASK));
+	req->val = cpu_to_be64(val);
+	set_wr_txq(skb, CPL_PRIORITY_CONTROL, csk->txq_idx);
+	cxgb4_ofld_send(csk->cdev->ports[csk->port_id], skb);
+	return 0;
+}
+
+static int cxgb4i_ddp_setup_conn_host_pagesize(struct cxgbi_sock *csk,
+						unsigned int tid, int reply)
+{
+	return cxgb4i_ddp_setup_conn_pgidx(csk, tid, page_idx, reply);
+}
+
+static int cxgb4i_ddp_setup_conn_digest(struct cxgbi_sock *csk,
+					unsigned int tid, int hcrc,
+					int dcrc, int reply)
+{
+	struct sk_buff *skb;
+	struct cpl_set_tcb_field *req;
+	u64 val = (hcrc ? ULP_CRC_HEADER : 0) | (dcrc ? ULP_CRC_DATA : 0);
+
+	val = TCB_ULP_RAW(val);
+	val |= TCB_ULP_TYPE(ULP_MODE_ISCSI);
+
+	skb = alloc_skb(sizeof(*req), GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	csk->hcrc_len = (hcrc ? 4 : 0);
+	csk->dcrc_len = (dcrc ? 4 : 0);
+	/*  set up ulp submode and page size */
+	req = (struct cpl_set_tcb_field *)skb_put(skb, sizeof(*req));
+	INIT_TP_WR(req, tid);
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, tid));
+	req->reply_ctrl = htons(NO_REPLY(reply) | QUEUENO(csk->rss_qid));
+	req->word_cookie = htons(TCB_WORD(W_TCB_ULP_RAW));
+	req->mask = cpu_to_be64(TCB_ULP_RAW(TCB_ULP_RAW_MASK));
+	req->val = cpu_to_be64(val);
+	set_wr_txq(skb, CPL_PRIORITY_CONTROL, csk->txq_idx);
+	cxgb4_ofld_send(csk->cdev->ports[csk->port_id], skb);
+	return 0;
+}
+
+static void __cxgb4i_ddp_cleanup(struct kref *kref)
+{
+	int i = 0;
+	struct cxgb4i_ddp_info *ddp = container_of(kref,
+						struct cxgb4i_ddp_info,
+						refcnt);
+
+	cxgbi_log_info("kref release ddp 0x%p, snic 0x%p\n", ddp, ddp->snic);
+	ddp->snic->ddp = NULL;
+
+	while (i < ddp->nppods) {
+		struct cxgbi_gather_list *gl = ddp->gl_map[i];
+
+		if (gl) {
+			int npods = (gl->nelem + PPOD_PAGES_MAX - 1) >>
+							PPOD_PAGES_SHIFT;
+			cxgbi_log_info("snic 0x%p, ddp %d + %d\n",
+						ddp->snic, i, npods);
+			kfree(gl);
+			i += npods;
+		} else
+			i++;
+	}
+	cxgbi_free_big_mem(ddp);
+}
+
+static int __cxgb4i_ddp_init(struct cxgbi_device *cdev)
+{
+	struct cxgb4i_snic *snic = cxgbi_cdev_priv(cdev);
+	struct cxgb4i_ddp_info *ddp = snic->ddp;
+	unsigned int ppmax, bits, tagmask, pgsz_factor[4];
+	int i;
+
+	if (ddp) {
+		kref_get(&ddp->refcnt);
+		cxgbi_log_warn("snic 0x%p, ddp 0x%p already set up\n",
+				snic, snic->ddp);
+		return -EALREADY;
+	}
+	sw_tag_idx_bits = (__ilog2_u32(ISCSI_ITT_MASK)) + 1;
+	sw_tag_age_bits = (__ilog2_u32(ISCSI_AGE_MASK)) + 1;
+	cdev->tag_format.sw_bits = sw_tag_idx_bits + sw_tag_age_bits;
+	cxgbi_log_info("tag itt 0x%x, %u bits, age 0x%x, %u bits\n",
+			ISCSI_ITT_MASK, sw_tag_idx_bits,
+			ISCSI_AGE_MASK, sw_tag_age_bits);
+	ppmax = (snic->lldi.vr->iscsi.size >> PPOD_SIZE_SHIFT);
+	bits = __ilog2_u32(ppmax) + 1;
+	if (bits > PPOD_IDX_MAX_SIZE)
+		bits = PPOD_IDX_MAX_SIZE;
+	ppmax = (1 << (bits - 1)) - 1;
+	ddp = cxgbi_alloc_big_mem(sizeof(struct cxgb4i_ddp_info) +
+			ppmax * (sizeof(struct cxgbi_gather_list *) +
+				sizeof(struct sk_buff *)),
+				GFP_KERNEL);
+	if (!ddp) {
+		cxgbi_log_warn("snic 0x%p unable to alloc ddp 0x%d, "
+			       "ddp disabled\n", snic, ppmax);
+		return -ENOMEM;
+	}
+	ddp->gl_map = (struct cxgbi_gather_list **)(ddp + 1);
+	spin_lock_init(&ddp->map_lock);
+	kref_init(&ddp->refcnt);
+	ddp->snic = snic;
+	ddp->pdev = snic->lldi.pdev;
+	ddp->max_txsz = min_t(unsigned int,
+				snic->lldi.iscsi_iolen,
+				ULP2_MAX_PKT_SIZE);
+	ddp->max_rxsz = min_t(unsigned int,
+				snic->lldi.iscsi_iolen,
+				ULP2_MAX_PKT_SIZE);
+	ddp->llimit = snic->lldi.vr->iscsi.start;
+	ddp->ulimit = ddp->llimit + snic->lldi.vr->iscsi.size - 1;
+	ddp->nppods = ppmax;
+	ddp->idx_last = ppmax;
+	ddp->idx_bits = bits;
+	ddp->idx_mask = (1 << bits) - 1;
+	ddp->rsvd_tag_mask = (1 << (bits + PPOD_IDX_SHIFT)) - 1;
+	tagmask = ddp->idx_mask << PPOD_IDX_SHIFT;
+	for (i = 0; i < DDP_PGIDX_MAX; i++)
+		pgsz_factor[i] = ddp_page_order[i];
+	cxgb4_iscsi_init(snic->lldi.ports[0], tagmask, pgsz_factor);
+	snic->ddp = ddp;
+	cdev->tag_format.rsvd_bits = ddp->idx_bits;
+	cdev->tag_format.rsvd_shift = PPOD_IDX_SHIFT;
+	cdev->tag_format.rsvd_mask =
+		((1 << cdev->tag_format.rsvd_bits) - 1);
+	cxgbi_log_info("tag format: sw %u, rsvd %u,%u, mask 0x%x.\n",
+			cdev->tag_format.sw_bits,
+			cdev->tag_format.rsvd_bits,
+			cdev->tag_format.rsvd_shift,
+			cdev->tag_format.rsvd_mask);
+	cdev->tx_max_size = min_t(unsigned int, ULP2_MAX_PDU_PAYLOAD,
+				ddp->max_txsz - ISCSI_PDU_NONPAYLOAD_LEN);
+	cdev->rx_max_size = min_t(unsigned int, ULP2_MAX_PDU_PAYLOAD,
+				ddp->max_rxsz - ISCSI_PDU_NONPAYLOAD_LEN);
+	cxgbi_log_info("max payload size: %u/%u, %u/%u.\n",
+			cdev->tx_max_size, ddp->max_txsz,
+			cdev->rx_max_size, ddp->max_rxsz);
+	cxgbi_log_info("snic 0x%p, nppods %u, bits %u, mask 0x%x,0x%x "
+			"pkt %u/%u, %u/%u\n",
+			snic, ppmax, ddp->idx_bits, ddp->idx_mask,
+			ddp->rsvd_tag_mask, ddp->max_txsz,
+			snic->lldi.iscsi_iolen,
+			ddp->max_rxsz, snic->lldi.iscsi_iolen);
+	return 0;
+}
+
+int cxgb4i_ddp_init(struct cxgbi_device *cdev)
+{
+	int rc = 0;
+
+	if (page_idx == DDP_PGIDX_MAX) {
+		page_idx = cxgb4i_ddp_find_page_index(PAGE_SIZE);
+
+		if (page_idx == DDP_PGIDX_MAX) {
+			cxgbi_log_info("system PAGE_SIZE %lu, update hw\n",
+					PAGE_SIZE);
+			rc = cxgb4i_ddp_adjust_page_table();
+			if (rc) {
+				cxgbi_log_info("PAGE_SIZE %lu, ddp disabled\n",
+						PAGE_SIZE);
+				return rc;
+			}
+			page_idx = cxgb4i_ddp_find_page_index(PAGE_SIZE);
+		}
+		cxgbi_log_info("system PAGE_SIZE %lu, ddp idx %u\n",
+				PAGE_SIZE, page_idx);
+	}
+	rc = __cxgb4i_ddp_init(cdev);
+	if (rc)
+		return rc;
+
+	cdev->ddp_make_gl = cxgb4i_ddp_make_gl;
+	cdev->ddp_release_gl = cxgb4i_ddp_release_gl;
+	cdev->ddp_tag_reserve = cxgb4i_ddp_tag_reserve;
+	cdev->ddp_tag_release = cxgb4i_ddp_tag_release;
+	cdev->ddp_setup_conn_digest = cxgb4i_ddp_setup_conn_digest;
+	cdev->ddp_setup_conn_host_pgsz = cxgb4i_ddp_setup_conn_host_pagesize;
+	return 0;
+}
+
+void cxgb4i_ddp_cleanup(struct cxgbi_device *cdev)
+{
+	struct cxgb4i_snic *snic = cxgbi_cdev_priv(cdev);
+	struct cxgb4i_ddp_info *ddp = snic->ddp;
+
+	cxgbi_log_info("snic 0x%p, release ddp 0x%p\n", cdev, ddp);
+	if (ddp)
+		kref_put(&ddp->refcnt, __cxgb4i_ddp_cleanup);
+}
+
diff --git a/drivers/scsi/cxgbi/cxgb4i_init.c b/drivers/scsi/cxgbi/cxgb4i_init.c
new file mode 100644
index 0000000..31da683
--- /dev/null
+++ b/drivers/scsi/cxgbi/cxgb4i_init.c
@@ -0,0 +1,317 @@
+/*
+ * cxgb4i_init.c: Chelsio T4 iSCSI driver.
+ *
+ * Copyright (c) 2010 Chelsio Communications, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.
+ *
+ * Written by: Karen Xie (kxie-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ * Written by: Rakesh Ranjan (rranjan-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ */
+
+#include <net/route.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_eh.h>
+#include <scsi/scsi_host.h>
+#include <scsi/scsi.h>
+#include <scsi/iscsi_proto.h>
+#include <scsi/libiscsi.h>
+#include <scsi/scsi_transport_iscsi.h>
+
+#include "cxgb4i.h"
+
+#define	DRV_MODULE_NAME		"cxgb4i"
+#define	DRV_MODULE_VERSION	"0.90"
+#define	DRV_MODULE_RELDATE	"05/04/2010"
+
+static char version[] =
+	"Chelsio T4 iSCSI driver " DRV_MODULE_NAME
+	" v" DRV_MODULE_VERSION " (" DRV_MODULE_RELDATE ")\n";
+
+MODULE_AUTHOR("Chelsio Communications");
+MODULE_DESCRIPTION("Chelsio T4 iSCSI driver");
+MODULE_LICENSE("GPL");
+MODULE_VERSION(DRV_MODULE_VERSION);
+
+#define RX_PULL_LEN	128
+
+static struct scsi_transport_template *cxgb4i_scsi_transport;
+static struct scsi_host_template cxgb4i_host_template;
+static struct iscsi_transport cxgb4i_iscsi_transport;
+static struct cxgbi_device *cxgb4i_cdev;
+
+static void *cxgb4i_uld_add(const struct cxgb4_lld_info *linfo);
+static int cxgb4i_uld_rx_handler(void *handle, const __be64 *rsp,
+				const struct pkt_gl *pgl);
+static int cxgb4i_uld_state_change(void *handle, enum cxgb4_state state);
+static int cxgb4i_iscsi_init(struct cxgbi_device *);
+static void cxgb4i_iscsi_cleanup(struct cxgbi_device *);
+
+const static struct cxgb4_uld_info cxgb4i_uld_info = {
+	.name = "cxgb4i",
+	.add = cxgb4i_uld_add,
+	.rx_handler = cxgb4i_uld_rx_handler,
+	.state_change = cxgb4i_uld_state_change,
+};
+
+static struct cxgbi_device *cxgb4i_uld_init(const struct cxgb4_lld_info *linfo)
+{
+	struct cxgbi_device *cdev;
+	struct cxgb4i_snic *snic;
+	struct port_info *pi;
+	int i, offs, rc;
+
+	cdev = cxgbi_device_register(sizeof(*snic), linfo->nports);
+	if (!cdev) {
+		cxgbi_log_debug("error cxgbi_device_alloc\n");
+		return NULL;
+	}
+	cxgb4i_cdev = cdev;
+	snic = cxgbi_cdev_priv(cxgb4i_cdev);
+	snic->lldi = *linfo;
+	cdev->ports = snic->lldi.ports;
+	cdev->nports = snic->lldi.nports;
+	cdev->pdev = snic->lldi.pdev;
+	cdev->mtus = snic->lldi.mtus;
+	cdev->nmtus = NMTUS;
+	cdev->skb_tx_headroom = SKB_MAX_HEAD(CXGB4I_TX_HEADER_LEN);
+	rc = cxgb4i_iscsi_init(cdev);
+	if (rc) {
+		cxgbi_log_info("cxgb4i_iscsi_init failed\n");
+		goto free_cdev;
+	}
+	rc = cxgbi_pdu_init(cdev);
+	if (rc) {
+		cxgbi_log_info("cxgbi_pdu_init failed\n");
+		goto clean_iscsi;
+	}
+	rc = cxgb4i_ddp_init(cdev);
+	if (rc) {
+		cxgbi_log_info("cxgb4i_ddp_init failed\n");
+		goto clean_pdu;
+	}
+	rc = cxgb4i_ofld_init(cdev);
+	if (rc) {
+		cxgbi_log_info("cxgb4i_ofld_init failed\n");
+		goto clean_ddp;
+	}
+
+	for (i = 0; i < cdev->nports; i++) {
+		cdev->hbas[i] = cxgbi_hba_add(cdev,
+						CXGB4I_MAX_LUN,
+						CXGB4I_MAX_CONN,
+						cxgb4i_scsi_transport,
+						&cxgb4i_host_template,
+						snic->lldi.ports[i]);
+		if (!cdev->hbas[i])
+			goto clean_iscsi;
+
+		pi = netdev_priv(snic->lldi.ports[i]);
+		offs = snic->lldi.ntxq / snic->lldi.nchan;
+		cdev->hbas[i]->txq_idx = pi->port_id * offs;
+		cdev->hbas[i]->port_id = pi->port_id;
+	}
+	return cdev;
+clean_iscsi:
+	cxgb4i_iscsi_cleanup(cdev);
+clean_pdu:
+	cxgbi_pdu_cleanup(cdev);
+clean_ddp:
+	cxgb4i_ddp_cleanup(cdev);
+free_cdev:
+	cxgbi_device_unregister(cdev);
+	cdev = ERR_PTR(-ENOMEM);
+	return cdev;
+}
+
+static void cxgb4i_uld_cleanup(void *handle)
+{
+	struct cxgbi_device *cdev = handle;
+	int i;
+
+	if (!cdev)
+		return;
+
+	for (i = 0; i < cdev->nports; i++) {
+		if (cdev->hbas[i]) {
+			cxgbi_hba_remove(cdev->hbas[i]);
+			cdev->hbas[i] = NULL;
+		}
+	}
+	cxgb4i_ofld_cleanup(cdev);
+	cxgb4i_ddp_cleanup(cdev);
+	cxgbi_pdu_cleanup(cdev);
+	cxgbi_log_info("snic 0x%p, %u scsi hosts removed.\n",
+			cdev, cdev->nports);
+	cxgb4i_iscsi_cleanup(cdev);
+	cxgbi_device_unregister(cdev);
+}
+
+static void *cxgb4i_uld_add(const struct cxgb4_lld_info *linfo)
+{
+	cxgbi_log_info("%s", version);
+	return cxgb4i_uld_init(linfo);
+}
+
+static int cxgb4i_uld_rx_handler(void *handle, const __be64 *rsp,
+				const struct pkt_gl *pgl)
+{
+	const struct cpl_act_establish *rpl;
+	struct sk_buff *skb;
+	unsigned int opc;
+	struct cxgb4i_snic *snic = cxgbi_cdev_priv(handle);
+
+	if (pgl == NULL) {
+		unsigned int len = 64 - sizeof(struct rsp_ctrl) - 8;
+
+		skb = alloc_skb(len, GFP_ATOMIC);
+		if (!skb)
+			goto nomem;
+		__skb_put(skb, len);
+		skb_copy_to_linear_data(skb, &rsp[1], len);
+	} else {
+		skb = cxgb4_pktgl_to_skb(pgl, RX_PULL_LEN, RX_PULL_LEN);
+		if (unlikely(!skb))
+			goto nomem;
+	}
+
+	rpl = (struct cpl_act_establish *)skb->data;
+	opc = rpl->ot.opcode;
+	cxgbi_log_debug("snic %p, opcode 0x%x, skb %p\n", snic, opc, skb);
+	BUG_ON(!snic->handlers[opc]);
+
+	if (snic->handlers[opc])
+		snic->handlers[opc](snic, skb);
+	else
+		cxgbi_log_error("No handler for opcode 0x%x\n", opc);
+	return 0;
+nomem:
+	cxgbi_api_debug("OOM bailing out\n");
+	return 1;
+}
+
+static int cxgb4i_uld_state_change(void *handle, enum cxgb4_state state)
+{
+	return 0;
+}
+
+static struct scsi_host_template cxgb4i_host_template = {
+	.module				= THIS_MODULE,
+	.name				= "Chelsio T4 iSCSI initiator",
+	.proc_name			= "cxgb4i",
+	.queuecommand			= iscsi_queuecommand,
+	.change_queue_depth		= iscsi_change_queue_depth,
+	.can_queue			= CXGB4I_SCSI_HOST_QDEPTH,
+	.sg_tablesize			= SG_ALL,
+	.max_sectors			= 0xFFFF,
+	.cmd_per_lun			= ISCSI_DEF_CMD_PER_LUN,
+	.eh_abort_handler		= iscsi_eh_abort,
+	.eh_device_reset_handler	= iscsi_eh_device_reset,
+	.eh_target_reset_handler	= iscsi_eh_recover_target,
+	.target_alloc			= iscsi_target_alloc,
+	.use_clustering			= DISABLE_CLUSTERING,
+	.this_id			= -1,
+};
+
+#define	CXGB4I_CAPS	(CAP_RECOVERY_L0 | CAP_MULTI_R2T |	\
+			CAP_HDRDGST | CAP_DATADGST |		\
+			CAP_DIGEST_OFFLOAD | CAP_PADDING_OFFLOAD)
+#define	CXGB4I_PMASK	(ISCSI_MAX_RECV_DLENGTH | ISCSI_MAX_XMIT_DLENGTH | \
+			ISCSI_HDRDGST_EN | ISCSI_DATADGST_EN | \
+			ISCSI_INITIAL_R2T_EN | ISCSI_MAX_R2T | \
+			ISCSI_IMM_DATA_EN | ISCSI_FIRST_BURST | \
+			ISCSI_MAX_BURST | ISCSI_PDU_INORDER_EN | \
+			ISCSI_DATASEQ_INORDER_EN | ISCSI_ERL | \
+			ISCSI_CONN_PORT | ISCSI_CONN_ADDRESS | \
+			ISCSI_EXP_STATSN | ISCSI_PERSISTENT_PORT | \
+			ISCSI_PERSISTENT_ADDRESS | ISCSI_TARGET_NAME | \
+			ISCSI_TPGT | ISCSI_USERNAME | \
+			ISCSI_PASSWORD | ISCSI_USERNAME_IN | \
+			ISCSI_PASSWORD_IN | ISCSI_FAST_ABORT | \
+			ISCSI_ABORT_TMO | ISCSI_LU_RESET_TMO | \
+			ISCSI_TGT_RESET_TMO | ISCSI_PING_TMO | \
+			ISCSI_RECV_TMO | ISCSI_IFACE_NAME | \
+			ISCSI_INITIATOR_NAME)
+#define	CXGB4I_HPMASK	(ISCSI_HOST_HWADDRESS | ISCSI_HOST_IPADDRESS | \
+			ISCSI_HOST_INITIATOR_NAME | ISCSI_HOST_INITIATOR_NAME)
+
+static struct iscsi_transport cxgb4i_iscsi_transport = {
+	.owner				= THIS_MODULE,
+	.name				= "cxgb4i",
+	.caps				= CXGB4I_CAPS,
+	.param_mask			= CXGB4I_PMASK,
+	.host_param_mask		= CXGB4I_HPMASK,
+	.get_host_param			= cxgbi_get_host_param,
+	.set_host_param			= cxgbi_set_host_param,
+
+	.create_session			= cxgbi_create_session,
+	.destroy_session		= cxgbi_destroy_session,
+	.get_session_param		= iscsi_session_get_param,
+
+	.create_conn			= cxgbi_create_conn,
+	.bind_conn			= cxgbi_bind_conn,
+	.destroy_conn			= iscsi_tcp_conn_teardown,
+	.start_conn			= iscsi_conn_start,
+	.stop_conn			= iscsi_conn_stop,
+	.get_conn_param			= cxgbi_get_conn_param,
+	.set_param			= cxgbi_set_conn_param,
+	.get_stats			= cxgbi_get_conn_stats,
+
+	.send_pdu			= iscsi_conn_send_pdu,
+
+	.init_task			= iscsi_tcp_task_init,
+	.xmit_task			= iscsi_tcp_task_xmit,
+	.cleanup_task			= cxgbi_cleanup_task,
+
+	.alloc_pdu			= cxgbi_conn_alloc_pdu,
+	.init_pdu			= cxgbi_conn_init_pdu,
+	.xmit_pdu			= cxgbi_conn_xmit_pdu,
+	.parse_pdu_itt			= cxgbi_parse_pdu_itt,
+
+	.ep_connect			= cxgbi_ep_connect,
+	.ep_poll			= cxgbi_ep_poll,
+	.ep_disconnect			= cxgbi_ep_disconnect,
+
+	.session_recovery_timedout	= iscsi_session_recovery_timedout,
+};
+
+static int cxgb4i_iscsi_init(struct cxgbi_device *cdev)
+{
+	cxgb4i_scsi_transport = iscsi_register_transport(
+					&cxgb4i_iscsi_transport);
+	if (!cxgb4i_scsi_transport) {
+		cxgbi_log_error("Could not register cxgb4i transport\n");
+		return -ENODATA;
+	}
+
+	cdev->itp = &cxgb4i_iscsi_transport;
+	return 0;
+}
+
+static void cxgb4i_iscsi_cleanup(struct cxgbi_device *cdev)
+{
+	if (cxgb4i_scsi_transport) {
+		cxgbi_api_debug("cxgb4i transport 0x%p removed\n",
+				cxgb4i_scsi_transport);
+		iscsi_unregister_transport(&cxgb4i_iscsi_transport);
+	}
+}
+
+
+static int __init cxgb4i_init_module(void)
+{
+	cxgb4_register_uld(CXGB4_ULD_ISCSI, &cxgb4i_uld_info);
+	return 0;
+}
+
+static void __exit cxgb4i_exit_module(void)
+{
+	cxgb4_unregister_uld(CXGB4_ULD_ISCSI);
+	cxgb4i_uld_cleanup(cxgb4i_cdev);
+}
+
+module_init(cxgb4i_init_module);
+module_exit(cxgb4i_exit_module);
diff --git a/drivers/scsi/cxgbi/cxgb4i_offload.c b/drivers/scsi/cxgbi/cxgb4i_offload.c
new file mode 100644
index 0000000..bc7296c
--- /dev/null
+++ b/drivers/scsi/cxgbi/cxgb4i_offload.c
@@ -0,0 +1,1409 @@
+/*
+ * cxgb4i_offload.c: Chelsio T4 iSCSI driver.
+ *
+ * Copyright (c) 2010 Chelsio Communications, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.
+ *
+ * Written by: Karen Xie (kxie-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ * Written by: Rakesh Ranjan (rranjan-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ */
+
+#include <linux/if_vlan.h>
+#include <net/dst.h>
+#include <net/route.h>
+#include <net/tcp.h>
+
+#include "libcxgbi.h"
+#include "cxgb4i.h"
+
+static int cxgb4i_rcv_win = 256 * 1024;
+module_param(cxgb4i_rcv_win, int, 0644);
+MODULE_PARM_DESC(cxgb4i_rcv_win, "TCP reveive window in bytes");
+
+static int cxgb4i_snd_win = 128 * 1024;
+module_param(cxgb4i_snd_win, int, 0644);
+MODULE_PARM_DESC(cxgb4i_snd_win, "TCP send window in bytes");
+
+static int cxgb4i_rx_credit_thres = 10 * 1024;
+module_param(cxgb4i_rx_credit_thres, int, 0644);
+MODULE_PARM_DESC(cxgb4i_rx_credit_thres,
+		"RX credits return threshold in bytes (default=10KB)");
+
+static unsigned int cxgb4i_max_connect = (8 * 1024);
+module_param(cxgb4i_max_connect, uint, 0644);
+MODULE_PARM_DESC(cxgb4i_max_connect, "Maximum number of connections");
+
+static unsigned short cxgb4i_sport_base = 20000;
+module_param(cxgb4i_sport_base, ushort, 0644);
+MODULE_PARM_DESC(cxgb4i_sport_base, "Starting port number (default 20000)");
+
+#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
+#define RCV_BUFSIZ_MASK	0x3FFU
+#define MAX_IMM_TX_PKT_LEN 128
+
+static int cxgb4i_sock_push_tx_frames(struct cxgbi_sock *, int);
+
+/*
+ * is_ofld_imm - check whether a packet can be sent as immediate data
+ * @skb: the packet
+ *
+ * Returns true if a packet can be sent as an offload WR with immediate
+ * data.  We currently use the same limit as for Ethernet packets.
+ */
+static inline int is_ofld_imm(const struct sk_buff *skb)
+{
+	return skb->len <= (MAX_IMM_TX_PKT_LEN -
+			sizeof(struct fw_ofld_tx_data_wr));
+}
+
+static void cxgb4i_sock_make_act_open_req(struct cxgbi_sock *csk,
+					   struct sk_buff *skb,
+					   unsigned int qid_atid,
+					   struct l2t_entry *e)
+{
+	struct cpl_act_open_req *req;
+	unsigned long long opt0;
+	unsigned int opt2;
+	int wscale;
+
+	cxgbi_conn_debug("csk 0x%p, atid 0x%x\n", csk, qid_atid);
+
+	wscale = cxgbi_sock_compute_wscale(csk->mss_idx);
+	opt0 = KEEP_ALIVE(1) |
+		WND_SCALE(wscale) |
+		MSS_IDX(csk->mss_idx) |
+		L2T_IDX(((struct l2t_entry *)csk->l2t)->idx) |
+		TX_CHAN(csk->tx_chan) |
+		SMAC_SEL(csk->smac_idx) |
+		RCV_BUFSIZ(cxgb4i_rcv_win >> 10);
+	opt2 = RX_CHANNEL(0) |
+		RSS_QUEUE_VALID |
+		RSS_QUEUE(csk->rss_qid);
+	set_wr_txq(skb, CPL_PRIORITY_SETUP, csk->txq_idx);
+	req = (struct cpl_act_open_req *)__skb_put(skb, sizeof(*req));
+	INIT_TP_WR(req, 0);
+	OPCODE_TID(req) = cpu_to_be32(MK_OPCODE_TID(CPL_ACT_OPEN_REQ,
+					qid_atid));
+	req->local_port = csk->saddr.sin_port;
+	req->peer_port = csk->daddr.sin_port;
+	req->local_ip = csk->saddr.sin_addr.s_addr;
+	req->peer_ip = csk->daddr.sin_addr.s_addr;
+	req->opt0 = cpu_to_be64(opt0);
+	req->params = 0;
+	req->opt2 = cpu_to_be32(opt2);
+}
+
+static void cxgb4i_fail_act_open(struct cxgbi_sock *csk, int errno)
+{
+	cxgbi_conn_debug("csk 0%p, state %u, flag 0x%lx\n", csk,
+			csk->state, csk->flags);
+	csk->err = errno;
+	cxgbi_sock_closed(csk);
+}
+
+static void cxgb4i_act_open_req_arp_failure(void *handle, struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk = (struct cxgbi_sock *)skb->sk;
+
+	cxgbi_sock_hold(csk);
+	spin_lock_bh(&csk->lock);
+	if (csk->state == CTP_CONNECTING)
+		cxgb4i_fail_act_open(csk, -EHOSTUNREACH);
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+	__kfree_skb(skb);
+}
+
+static void cxgb4i_sock_skb_entail(struct cxgbi_sock *csk,
+				   struct sk_buff *skb,
+				   int flags)
+{
+	cxgb4i_skb_tcp_seq(skb) = csk->write_seq;
+	cxgb4i_skb_flags(skb) = flags;
+	__skb_queue_tail(&csk->write_queue, skb);
+}
+
+static void cxgb4i_sock_send_close_req(struct cxgbi_sock *csk)
+{
+	struct sk_buff *skb = csk->cpl_close;
+	struct cpl_close_con_req *req = (struct cpl_close_con_req *)skb->head;
+	unsigned int tid = csk->hwtid;
+
+	csk->cpl_close = NULL;
+	set_wr_txq(skb, CPL_PRIORITY_DATA, csk->txq_idx);
+	INIT_TP_WR(req, tid);
+	OPCODE_TID(req) = cpu_to_be32(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, tid));
+	req->rsvd = 0;
+	cxgb4i_sock_skb_entail(csk, skb, CTP_SKCBF_NO_APPEND);
+	if (csk->state != CTP_CONNECTING)
+		cxgb4i_sock_push_tx_frames(csk, 1);
+}
+
+static void cxgb4i_sock_abort_arp_failure(void *handle, struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk = (struct cxgbi_sock *)handle;
+	struct cpl_abort_req *req;
+
+	req = (struct cpl_abort_req *)skb->data;
+	req->cmd = CPL_ABORT_NO_RST;
+	cxgb4_ofld_send(csk->cdev->ports[csk->port_id], skb);
+}
+
+static void cxgb4i_sock_send_abort_req(struct cxgbi_sock *csk)
+{
+	struct cpl_abort_req *req;
+	struct sk_buff *skb = csk->cpl_abort_req;
+
+	if (unlikely(csk->state == CTP_ABORTING) || !skb || !csk->cdev)
+		return;
+	cxgbi_sock_set_state(csk, CTP_ABORTING);
+	cxgbi_conn_debug("csk 0x%p, flag ABORT_RPL + ABORT_SHUT\n", csk);
+	cxgbi_sock_set_state(csk, CTPF_ABORT_RPL_PENDING);
+	cxgbi_sock_purge_write_queue(csk);
+	csk->cpl_abort_req = NULL;
+	req = (struct cpl_abort_req *)skb->head;
+	set_wr_txq(skb, CPL_PRIORITY_DATA, csk->txq_idx);
+	t4_set_arp_err_handler(skb, csk, cxgb4i_sock_abort_arp_failure);
+	INIT_TP_WR(req, csk->hwtid);
+	OPCODE_TID(req) = cpu_to_be32(MK_OPCODE_TID(CPL_ABORT_REQ, csk->hwtid));
+	req->rsvd0 = htonl(csk->snd_nxt);
+	req->rsvd1 = !cxgbi_sock_flag(csk, CTPF_TX_DATA_SENT);
+	req->cmd = CPL_ABORT_SEND_RST;
+	cxgb4_l2t_send(csk->cdev->ports[csk->port_id], skb, csk->l2t);
+}
+
+static void cxgb4i_sock_send_abort_rpl(struct cxgbi_sock *csk, int rst_status)
+{
+	struct sk_buff *skb = csk->cpl_abort_rpl;
+	struct cpl_abort_rpl *rpl = (struct cpl_abort_rpl *)skb->head;
+
+	csk->cpl_abort_rpl = NULL;
+	set_wr_txq(skb, CPL_PRIORITY_DATA, csk->txq_idx);
+	INIT_TP_WR(rpl, csk->hwtid);
+	OPCODE_TID(rpl) = cpu_to_be32(MK_OPCODE_TID(CPL_ABORT_RPL, csk->hwtid));
+	rpl->cmd = rst_status;
+	cxgb4_ofld_send(csk->cdev->ports[csk->port_id], skb);
+}
+
+static u32 cxgb4i_csk_send_rx_credits(struct cxgbi_sock *csk, u32 credits)
+{
+	struct sk_buff *skb;
+	struct cpl_rx_data_ack *req;
+	int wrlen = roundup(sizeof(*req), 16);
+
+	skb = alloc_skb(wrlen, GFP_ATOMIC);
+	if (!skb)
+		return 0;
+	req = (struct cpl_rx_data_ack *)__skb_put(skb, wrlen);
+	memset(req, 0, wrlen);
+	set_wr_txq(skb, CPL_PRIORITY_ACK, csk->txq_idx);
+	INIT_TP_WR(req, csk->hwtid);
+	OPCODE_TID(req) = cpu_to_be32(MK_OPCODE_TID(CPL_RX_DATA_ACK,
+				      csk->hwtid));
+	req->credit_dack = cpu_to_be32(RX_CREDITS(credits) | RX_FORCE_ACK(1));
+	cxgb4_ofld_send(csk->cdev->ports[csk->port_id], skb);
+	return credits;
+}
+
+#define SKB_WR_LIST_SIZE	(MAX_SKB_FRAGS + 2)
+static const unsigned int cxgb4i_ulp_extra_len[] = { 0, 4, 4, 8 };
+
+static inline unsigned int ulp_extra_len(const struct sk_buff *skb)
+{
+	return cxgb4i_ulp_extra_len[cxgb4i_skb_ulp_mode(skb) & 3];
+}
+
+static inline void cxgb4i_sock_reset_wr_list(struct cxgbi_sock *csk)
+{
+	csk->wr_pending_head = csk->wr_pending_tail = NULL;
+}
+
+static inline void cxgb4i_sock_enqueue_wr(struct cxgbi_sock *csk,
+					  struct sk_buff *skb)
+{
+	cxgb4i_skb_tx_wr_next(skb) = NULL;
+	/*
+	 * We want to take an extra reference since both us and the driver
+	 * need to free the packet before it's really freed. We know there's
+	 * just one user currently so we use atomic_set rather than skb_get
+	 * to avoid the atomic op.
+	 */
+	atomic_set(&skb->users, 2);
+
+	if (!csk->wr_pending_head)
+		csk->wr_pending_head = skb;
+	else
+		cxgb4i_skb_tx_wr_next(csk->wr_pending_tail) = skb;
+	csk->wr_pending_tail = skb;
+}
+
+static int cxgb4i_sock_count_pending_wrs(const struct cxgbi_sock *csk)
+{
+	int n = 0;
+	const struct sk_buff *skb = csk->wr_pending_head;
+
+	while (skb) {
+		n += skb->csum;
+		skb = cxgb4i_skb_tx_wr_next(skb);
+	}
+	return n;
+}
+
+static inline struct sk_buff *cxgb4i_sock_peek_wr(const struct cxgbi_sock *csk)
+{
+	return csk->wr_pending_head;
+}
+
+static inline struct sk_buff *cxgb4i_sock_dequeue_wr(struct cxgbi_sock *csk)
+{
+	struct sk_buff *skb = csk->wr_pending_head;
+
+	if (likely(skb)) {
+		csk->wr_pending_head = cxgb4i_skb_tx_wr_next(skb);
+		cxgb4i_skb_tx_wr_next(skb) = NULL;
+	}
+	return skb;
+}
+
+static void cxgb4i_sock_purge_wr_queue(struct cxgbi_sock *csk)
+{
+	struct sk_buff *skb;
+
+	while ((skb = cxgb4i_sock_dequeue_wr(csk)) != NULL)
+		kfree_skb(skb);
+}
+
+/*
+ * sgl_len - calculates the size of an SGL of the given capacity
+ * @n: the number of SGL entries
+ * Calculates the number of flits needed for a scatter/gather list that
+ * can hold the given number of entries.
+ */
+static inline unsigned int sgl_len(unsigned int n)
+{
+	n--;
+	return (3 * n) / 2 + (n & 1) + 2;
+}
+
+/*
+ * calc_tx_flits_ofld - calculate # of flits for an offload packet
+ * @skb: the packet
+ *
+ * Returns the number of flits needed for the given offload packet.
+ * These packets are already fully constructed and no additional headers
+ * will be added.
+ */
+static inline unsigned int calc_tx_flits_ofld(const struct sk_buff *skb)
+{
+	unsigned int flits, cnt;
+
+	if (is_ofld_imm(skb))
+		return DIV_ROUND_UP(skb->len, 8);
+	flits = skb_transport_offset(skb) / 8;
+	cnt = skb_shinfo(skb)->nr_frags;
+	if (skb->tail != skb->transport_header)
+		cnt++;
+	return flits + sgl_len(cnt);
+}
+
+static inline void cxgb4i_sock_send_tx_flowc_wr(struct cxgbi_sock *csk)
+{
+	struct sk_buff *skb;
+	struct fw_flowc_wr *flowc;
+	int flowclen, i;
+
+	flowclen = 80;
+	skb = alloc_skb(flowclen, GFP_ATOMIC);
+	flowc = (struct fw_flowc_wr *)__skb_put(skb, flowclen);
+	flowc->op_to_nparams =
+		htonl(FW_WR_OP(FW_FLOWC_WR) | FW_FLOWC_WR_NPARAMS(8));
+	flowc->flowid_len16 =
+		htonl(FW_WR_LEN16(DIV_ROUND_UP(72, 16)) |
+				FW_WR_FLOWID(csk->hwtid));
+	flowc->mnemval[0].mnemonic = FW_FLOWC_MNEM_PFNVFN;
+	flowc->mnemval[0].val = htonl(0);
+	flowc->mnemval[1].mnemonic = FW_FLOWC_MNEM_CH;
+	flowc->mnemval[1].val = htonl(csk->tx_chan);
+	flowc->mnemval[2].mnemonic = FW_FLOWC_MNEM_PORT;
+	flowc->mnemval[2].val = htonl(csk->tx_chan);
+	flowc->mnemval[3].mnemonic = FW_FLOWC_MNEM_IQID;
+	flowc->mnemval[3].val = htonl(csk->rss_qid);
+	flowc->mnemval[4].mnemonic = FW_FLOWC_MNEM_SNDNXT;
+	flowc->mnemval[4].val = htonl(csk->snd_nxt);
+	flowc->mnemval[5].mnemonic = FW_FLOWC_MNEM_RCVNXT;
+	flowc->mnemval[5].val = htonl(csk->rcv_nxt);
+	flowc->mnemval[6].mnemonic = FW_FLOWC_MNEM_SNDBUF;
+	flowc->mnemval[6].val = htonl(cxgb4i_snd_win);
+	flowc->mnemval[7].mnemonic = FW_FLOWC_MNEM_MSS;
+	flowc->mnemval[7].val = htonl(csk->mss_idx);
+	flowc->mnemval[8].mnemonic = 0;
+	flowc->mnemval[8].val = 0;
+	for (i = 0; i < 9; i++) {
+		flowc->mnemval[i].r4[0] = 0;
+		flowc->mnemval[i].r4[1] = 0;
+		flowc->mnemval[i].r4[2] = 0;
+	}
+	set_wr_txq(skb, CPL_PRIORITY_DATA, csk->txq_idx);
+	cxgb4_ofld_send(csk->cdev->ports[csk->port_id], skb);
+}
+
+static inline void cxgb4i_sock_make_tx_data_wr(struct cxgbi_sock *csk,
+						struct sk_buff *skb, int dlen,
+						int len, u32 credits,
+						int req_completion)
+{
+	struct fw_ofld_tx_data_wr *req;
+	unsigned int wr_ulp_mode;
+
+	if (is_ofld_imm(skb)) {
+			req = (struct fw_ofld_tx_data_wr *)
+				__skb_push(skb, sizeof(*req));
+			req->op_to_immdlen =
+				cpu_to_be32(FW_WR_OP(FW_OFLD_TX_DATA_WR) |
+					FW_WR_COMPL(req_completion) |
+					FW_WR_IMMDLEN(dlen));
+			req->flowid_len16 =
+				cpu_to_be32(FW_WR_FLOWID(csk->hwtid) |
+						FW_WR_LEN16(credits));
+	} else {
+		req = (struct fw_ofld_tx_data_wr *)
+			__skb_push(skb, sizeof(*req));
+		req->op_to_immdlen =
+			cpu_to_be32(FW_WR_OP(FW_OFLD_TX_DATA_WR) |
+					FW_WR_COMPL(req_completion) |
+					FW_WR_IMMDLEN(0));
+		req->flowid_len16 =
+			cpu_to_be32(FW_WR_FLOWID(csk->hwtid) |
+					FW_WR_LEN16(credits));
+	}
+	wr_ulp_mode =
+		FW_OFLD_TX_DATA_WR_ULPMODE(cxgb4i_skb_ulp_mode(skb) >> 4) |
+		FW_OFLD_TX_DATA_WR_ULPSUBMODE(cxgb4i_skb_ulp_mode(skb) & 3);
+	req->tunnel_to_proxy = cpu_to_be32(wr_ulp_mode) |
+		FW_OFLD_TX_DATA_WR_SHOVE(skb_peek(&csk->write_queue) ? 0 : 1);
+	req->plen = cpu_to_be32(len);
+	if (!cxgbi_sock_flag(csk, CTPF_TX_DATA_SENT))
+		cxgbi_sock_set_flag(csk, CTPF_TX_DATA_SENT);
+}
+
+static void cxgb4i_sock_arp_failure_discard(void *handle, struct sk_buff *skb)
+{
+	kfree_skb(skb);
+}
+
+static int cxgb4i_sock_push_tx_frames(struct cxgbi_sock *csk,
+						int req_completion)
+{
+	int total_size = 0;
+	struct sk_buff *skb;
+	struct cxgb4i_snic *snic;
+
+	if (unlikely(csk->state == CTP_CONNECTING ||
+			csk->state == CTP_CLOSE_WAIT_1 ||
+			csk->state >= CTP_ABORTING)) {
+		cxgbi_tx_debug("csk 0x%p, in closing state %u.\n",
+				csk, csk->state);
+		return 0;
+	}
+
+	snic = cxgbi_cdev_priv(csk->cdev);
+	while (csk->wr_cred && (skb = skb_peek(&csk->write_queue)) != NULL) {
+		int dlen;
+		int len;
+		unsigned int credits_needed;
+
+		dlen = len = skb->len;
+		skb_reset_transport_header(skb);
+		if (is_ofld_imm(skb))
+			credits_needed = DIV_ROUND_UP(dlen +
+					sizeof(struct fw_ofld_tx_data_wr), 16);
+		else
+			credits_needed = DIV_ROUND_UP(8 *
+					calc_tx_flits_ofld(skb)+
+					sizeof(struct fw_ofld_tx_data_wr), 16);
+		if (csk->wr_cred < credits_needed) {
+			cxgbi_tx_debug("csk 0x%p, skb len %u/%u, "
+					"wr %d < %u.\n",
+					csk, skb->len, skb->data_len,
+					credits_needed, csk->wr_cred);
+			break;
+		}
+		__skb_unlink(skb, &csk->write_queue);
+		set_wr_txq(skb, CPL_PRIORITY_DATA, csk->txq_idx);
+		skb->csum = credits_needed;
+		csk->wr_cred -= credits_needed;
+		csk->wr_una_cred += credits_needed;
+		cxgb4i_sock_enqueue_wr(csk, skb);
+		cxgbi_tx_debug("csk 0x%p, enqueue, skb len %u/%u, "
+				"wr %d, left %u, unack %u.\n",
+				csk, skb->len, skb->data_len,
+				credits_needed, csk->wr_cred,
+				csk->wr_una_cred);
+		if (likely(cxgb4i_skb_flags(skb) & CTP_SKCBF_NEED_HDR)) {
+			len += ulp_extra_len(skb);
+			if (!cxgbi_sock_flag(csk, CTPF_TX_DATA_SENT)) {
+				cxgb4i_sock_send_tx_flowc_wr(csk);
+				skb->csum += 5;
+				csk->wr_cred -= 5;
+				csk->wr_una_cred += 5;
+			}
+			if ((req_completion &&
+				csk->wr_una_cred == credits_needed) ||
+				(cxgb4i_skb_flags(skb) & CTP_SKCBF_COMPL) ||
+				csk->wr_una_cred >= csk->wr_max_cred >> 1) {
+				req_completion = 1;
+				csk->wr_una_cred = 0;
+			}
+			cxgb4i_sock_make_tx_data_wr(csk, skb, dlen, len,
+						    credits_needed,
+						    req_completion);
+			csk->snd_nxt += len;
+			if (req_completion)
+				cxgb4i_skb_flags(skb) &= ~CTP_SKCBF_NEED_HDR;
+		}
+		total_size += skb->truesize;
+		t4_set_arp_err_handler(skb, csk,
+					cxgb4i_sock_arp_failure_discard);
+		cxgb4_l2t_send(csk->cdev->ports[csk->port_id], skb, csk->l2t);
+	}
+	return total_size;
+}
+
+static inline void cxgb4i_sock_free_atid(struct cxgbi_sock *csk)
+{
+	struct cxgb4i_snic *snic = cxgbi_cdev_priv(csk->cdev);
+
+	cxgb4_free_atid(snic->lldi.tids, csk->atid);
+	cxgbi_sock_put(csk);
+}
+
+static void cxgb4i_sock_established(struct cxgbi_sock *csk, u32 snd_isn,
+					unsigned int opt)
+{
+	cxgbi_conn_debug("csk 0x%p, state %u.\n", csk, csk->state);
+
+	csk->write_seq = csk->snd_nxt = csk->snd_una = snd_isn;
+	/*
+	 * Causes the first RX_DATA_ACK to supply any Rx credits we couldn't
+	 * pass through opt0.
+	 */
+	if (cxgb4i_rcv_win > (RCV_BUFSIZ_MASK << 10))
+		csk->rcv_wup -= cxgb4i_rcv_win - (RCV_BUFSIZ_MASK << 10);
+	dst_confirm(csk->dst);
+	smp_mb();
+	cxgbi_sock_set_state(csk, CTP_ESTABLISHED);
+}
+
+static int cxgb4i_cpl_act_establish(struct cxgb4i_snic *snic,
+				    struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct cpl_act_establish *req = (struct cpl_act_establish *)skb->data;
+	unsigned int hwtid = GET_TID(req);
+	unsigned int atid = GET_TID_TID(ntohl(req->tos_atid));
+	struct tid_info *t = snic->lldi.tids;
+	u32 rcv_isn = be32_to_cpu(req->rcv_isn);
+
+	csk = lookup_atid(t, atid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		return CPL_RET_UNKNOWN_TID;
+	}
+	cxgbi_conn_debug("csk 0x%p, state %u, flag 0x%lx\n",
+				csk, csk->state, csk->flags);
+	csk->hwtid = hwtid;
+	cxgbi_sock_hold(csk);
+	cxgb4_insert_tid(snic->lldi.tids, csk, hwtid);
+	cxgb4i_sock_free_atid(csk);
+	spin_lock_bh(&csk->lock);
+
+	if (unlikely(csk->state != CTP_CONNECTING))
+		cxgbi_log_error("TID %u expected SYN_SENT, got EST., s %u\n",
+				csk->hwtid, csk->state);
+	csk->copied_seq = csk->rcv_wup = csk->rcv_nxt = rcv_isn;
+	cxgb4i_sock_established(csk, ntohl(req->snd_isn), ntohs(req->tcp_opt));
+	__kfree_skb(skb);
+
+	if (unlikely(cxgbi_sock_flag(csk, CTPF_ACTIVE_CLOSE_NEEDED)))
+		cxgb4i_sock_send_abort_req(csk);
+	else {
+		if (skb_queue_len(&csk->write_queue))
+			cxgb4i_sock_push_tx_frames(csk, 1);
+		cxgbi_conn_tx_open(csk);
+	}
+
+	if (csk->retry_timer.function) {
+		del_timer(&csk->retry_timer);
+		csk->retry_timer.function = NULL;
+	}
+	spin_unlock_bh(&csk->lock);
+	return 0;
+}
+
+static int act_open_rpl_status_to_errno(int status)
+{
+	switch (status) {
+	case CPL_ERR_CONN_RESET:
+		return -ECONNREFUSED;
+	case CPL_ERR_ARP_MISS:
+		return -EHOSTUNREACH;
+	case CPL_ERR_CONN_TIMEDOUT:
+		return -ETIMEDOUT;
+	case CPL_ERR_TCAM_FULL:
+		return -ENOMEM;
+	case CPL_ERR_CONN_EXIST:
+		cxgbi_log_error("ACTIVE_OPEN_RPL: 4-tuple in use\n");
+		return -EADDRINUSE;
+	default:
+		return -EIO;
+	}
+}
+
+/*
+ * Return whether a failed active open has allocated a TID
+ */
+static inline int act_open_has_tid(int status)
+{
+	return status != CPL_ERR_TCAM_FULL && status != CPL_ERR_CONN_EXIST &&
+		status != CPL_ERR_ARP_MISS;
+}
+
+static void cxgb4i_sock_act_open_retry_timer(unsigned long data)
+{
+	struct sk_buff *skb;
+	struct cxgbi_sock *csk = (struct cxgbi_sock *)data;
+
+	cxgbi_conn_debug("csk 0x%p, state %u.\n", csk, csk->state);
+	spin_lock_bh(&csk->lock);
+	skb = alloc_skb(sizeof(struct cpl_act_open_req), GFP_ATOMIC);
+	if (!skb)
+		cxgb4i_fail_act_open(csk, -ENOMEM);
+	else {
+		unsigned int qid_atid  = csk->rss_qid << 14;
+		qid_atid |= (unsigned int)csk->atid;
+		skb->sk = (struct sock *)csk;
+		t4_set_arp_err_handler(skb, csk,
+					cxgb4i_act_open_req_arp_failure);
+		cxgb4i_sock_make_act_open_req(csk, skb, qid_atid, csk->l2t);
+		cxgb4_l2t_send(csk->cdev->ports[csk->port_id], skb, csk->l2t);
+	}
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+}
+
+static int cxgb4i_cpl_act_open_rpl(struct cxgb4i_snic *snic,
+				   struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct cpl_act_open_rpl *rpl = (struct cpl_act_open_rpl *)skb->data;
+	unsigned int atid =
+		GET_TID_TID(GET_AOPEN_ATID(be32_to_cpu(rpl->atid_status)));
+	struct tid_info *t = snic->lldi.tids;
+	unsigned int status = GET_AOPEN_STATUS(be32_to_cpu(rpl->atid_status));
+
+	csk = lookup_atid(t, atid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", atid);
+		return CPL_RET_UNKNOWN_TID;
+	}
+
+	cxgbi_sock_hold(csk);
+	spin_lock_bh(&csk->lock);
+	cxgbi_conn_debug("rcv, status 0x%x, csk 0x%p, csk->state %u, "
+			"csk->flag 0x%lx, csk->atid %u.\n",
+			status, csk, csk->state, csk->flags, csk->hwtid);
+
+	if (status & act_open_has_tid(status))
+		cxgb4_remove_tid(snic->lldi.tids, csk->port_id, GET_TID(rpl));
+
+	if (status == CPL_ERR_CONN_EXIST &&
+	    csk->retry_timer.function != cxgb4i_sock_act_open_retry_timer) {
+		csk->retry_timer.function = cxgb4i_sock_act_open_retry_timer;
+		if (!mod_timer(&csk->retry_timer, jiffies + HZ / 2))
+			cxgbi_sock_hold(csk);
+	} else
+		cxgb4i_fail_act_open(csk,
+				     act_open_rpl_status_to_errno(status));
+
+	__kfree_skb(skb);
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+	return 0;
+}
+
+static int cxgb4i_cpl_peer_close(struct cxgb4i_snic *snic, struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct cpl_peer_close *req = (struct cpl_peer_close *)skb->data;
+	unsigned int hwtid = GET_TID(req);
+	struct tid_info *t = snic->lldi.tids;
+
+	csk = lookup_tid(t, hwtid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		return CPL_RET_UNKNOWN_TID;
+	}
+
+	cxgbi_sock_hold(csk);
+	spin_lock_bh(&csk->lock);
+
+	if (cxgbi_sock_flag(csk, CTPF_ABORT_RPL_PENDING))
+		goto out;
+
+	switch (csk->state) {
+	case CTP_ESTABLISHED:
+		cxgbi_sock_set_state(csk, CTP_PASSIVE_CLOSE);
+		break;
+	case CTP_ACTIVE_CLOSE:
+		cxgbi_sock_set_state(csk, CTP_CLOSE_WAIT_2);
+		break;
+	case CTP_CLOSE_WAIT_1:
+		cxgbi_sock_closed(csk);
+		break;
+	case CTP_ABORTING:
+		break;
+	default:
+		cxgbi_log_error("peer close, TID %u in bad state %u\n",
+				csk->hwtid, csk->state);
+	}
+
+	cxgbi_sock_conn_closing(csk);
+out:
+	__kfree_skb(skb);
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+	return 0;
+}
+
+static int cxgb4i_cpl_close_con_rpl(struct cxgb4i_snic *snic,
+				    struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct cpl_close_con_rpl *rpl = (struct cpl_close_con_rpl *)skb->data;
+	unsigned int hwtid = GET_TID(rpl);
+	struct tid_info *t = snic->lldi.tids;
+
+	csk = lookup_tid(t, hwtid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		return CPL_RET_UNKNOWN_TID;
+	}
+
+	cxgbi_sock_hold(csk);
+	spin_lock_bh(&csk->lock);
+	cxgbi_conn_debug("csk 0x%p, state %u, flag 0x%lx.\n",
+			csk, csk->state, csk->flags);
+	csk->snd_una = ntohl(rpl->snd_nxt) - 1;
+
+	if (cxgbi_sock_flag(csk, CTPF_ABORT_RPL_PENDING))
+		goto out;
+
+	switch (csk->state) {
+	case CTP_ACTIVE_CLOSE:
+		cxgbi_sock_set_state(csk, CTP_CLOSE_WAIT_1);
+		break;
+	case CTP_CLOSE_WAIT_1:
+	case CTP_CLOSE_WAIT_2:
+		cxgbi_sock_closed(csk);
+		break;
+	case CTP_ABORTING:
+		break;
+	default:
+		cxgbi_log_error("close_rpl, TID %u in bad state %u\n",
+				csk->hwtid, csk->state);
+	}
+out:
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+	kfree_skb(skb);
+	return 0;
+}
+
+static int abort_status_to_errno(struct cxgbi_sock *csk, int abort_reason,
+								int *need_rst)
+{
+	switch (abort_reason) {
+	case CPL_ERR_BAD_SYN: /* fall through */
+	case CPL_ERR_CONN_RESET:
+		return csk->state > CTP_ESTABLISHED ?
+			-EPIPE : -ECONNRESET;
+	case CPL_ERR_XMIT_TIMEDOUT:
+	case CPL_ERR_PERSIST_TIMEDOUT:
+	case CPL_ERR_FINWAIT2_TIMEDOUT:
+	case CPL_ERR_KEEPALIVE_TIMEDOUT:
+		return -ETIMEDOUT;
+	default:
+		return -EIO;
+	}
+}
+
+/*
+ * Returns whether an ABORT_REQ_RSS message is a negative advice.
+ */
+static inline int is_neg_adv_abort(unsigned int status)
+{
+	return status == CPL_ERR_RTX_NEG_ADVICE ||
+		status == CPL_ERR_PERSIST_NEG_ADVICE;
+}
+
+static int cxgb4i_cpl_abort_req_rss(struct cxgb4i_snic *snic,
+				    struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct cpl_abort_req_rss *req = (struct cpl_abort_req_rss *)skb->data;
+	unsigned int hwtid = GET_TID(req);
+	struct tid_info *t = snic->lldi.tids;
+	int rst_status = CPL_ABORT_NO_RST;
+
+	csk = lookup_tid(t, hwtid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		return CPL_RET_UNKNOWN_TID;
+	}
+
+	if (is_neg_adv_abort(req->status)) {
+		__kfree_skb(skb);
+		return 0;
+	}
+
+	cxgbi_sock_hold(csk);
+	spin_lock_bh(&csk->lock);
+
+	if (!cxgbi_sock_flag(csk, CTPF_ABORT_REQ_RCVD)) {
+		cxgbi_sock_set_flag(csk, CTPF_ABORT_REQ_RCVD);
+		cxgbi_sock_set_state(csk, CTP_ABORTING);
+		__kfree_skb(skb);
+		goto out;
+	}
+
+	cxgbi_sock_clear_flag(csk, CTPF_ABORT_REQ_RCVD);
+	cxgb4i_sock_send_abort_rpl(csk, rst_status);
+
+	if (!cxgbi_sock_flag(csk, CTPF_ABORT_RPL_PENDING)) {
+		csk->err = abort_status_to_errno(csk, req->status,
+						 &rst_status);
+		cxgbi_sock_closed(csk);
+	}
+out:
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+	return 0;
+}
+
+static int cxgb4i_cpl_abort_rpl_rss(struct cxgb4i_snic *snic,
+				    struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct cpl_abort_rpl_rss *rpl = (struct cpl_abort_rpl_rss *)skb->data;
+	unsigned int hwtid = GET_TID(rpl);
+	struct tid_info *t = snic->lldi.tids;
+
+	if (rpl->status == CPL_ERR_ABORT_FAILED)
+		goto out;
+
+	csk = lookup_tid(t, hwtid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		goto out;
+	}
+
+	cxgbi_sock_hold(csk);
+	spin_lock_bh(&csk->lock);
+
+	if (cxgbi_sock_flag(csk, CTPF_ABORT_RPL_PENDING)) {
+		if (!cxgbi_sock_flag(csk, CTPF_ABORT_RPL_RCVD))
+			cxgbi_sock_set_flag(csk, CTPF_ABORT_RPL_RCVD);
+		else {
+			cxgbi_sock_clear_flag(csk, CTPF_ABORT_RPL_RCVD);
+			cxgbi_sock_clear_flag(csk, CTPF_ABORT_RPL_PENDING);
+
+			if (cxgbi_sock_flag(csk, CTPF_ABORT_REQ_RCVD))
+				cxgbi_log_error("tid %u, ABORT_RPL_RSS\n",
+						csk->hwtid);
+
+			cxgbi_sock_closed(csk);
+		}
+	}
+
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+out:
+	__kfree_skb(skb);
+	return 0;
+}
+
+static int cxgb4i_cpl_iscsi_hdr(struct cxgb4i_snic *snic, struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct cpl_iscsi_hdr *cpl = (struct cpl_iscsi_hdr *)skb->data;
+	unsigned int hwtid = GET_TID(cpl);
+	struct tid_info *t = snic->lldi.tids;
+	struct sk_buff *lskb;
+
+	csk = lookup_tid(t, hwtid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		return CPL_RET_UNKNOWN_TID;
+	}
+
+	spin_lock_bh(&csk->lock);
+
+	if (unlikely(csk->state >= CTP_PASSIVE_CLOSE)) {
+		if (csk->state != CTP_ABORTING)
+			goto abort_conn;
+	}
+
+	cxgb4i_skb_tcp_seq(skb) = ntohl(cpl->seq);
+	skb_reset_transport_header(skb);
+	__skb_pull(skb, sizeof(*cpl));
+	__pskb_trim(skb, ntohs(cpl->len));
+
+	if (!csk->skb_ulp_lhdr) {
+		unsigned char *bhs;
+		unsigned int hlen, dlen;
+
+		csk->skb_ulp_lhdr = skb;
+		lskb = csk->skb_ulp_lhdr;
+		cxgb4i_skb_flags(lskb) = CTP_SKCBF_HDR_RCVD;
+
+		if (cxgb4i_skb_tcp_seq(lskb) != csk->rcv_nxt) {
+			cxgbi_log_error("tid 0x%x, CPL_ISCSI_HDR, bad seq got "
+					"0x%x, exp 0x%x\n",
+					csk->hwtid,
+					cxgb4i_skb_tcp_seq(lskb),
+					csk->rcv_nxt);
+			goto abort_conn;
+		}
+
+		bhs = lskb->data;
+		hlen = ntohs(cpl->len);
+		dlen = ntohl(*(unsigned int *)(bhs + 4)) & 0xFFFFFF;
+
+		if ((hlen + dlen) != ntohs(cpl->pdu_len_ddp) - 40) {
+			cxgbi_log_error("tid 0x%x, CPL_ISCSI_HDR, pdu len "
+					"mismatch %u != %u + %u, seq 0x%x\n",
+					csk->hwtid,
+					ntohs(cpl->pdu_len_ddp) - 40,
+					hlen, dlen, cxgb4i_skb_tcp_seq(skb));
+		}
+		cxgb4i_skb_rx_pdulen(skb) = hlen + dlen;
+		if (dlen)
+			cxgb4i_skb_rx_pdulen(skb) += csk->dcrc_len;
+		cxgb4i_skb_rx_pdulen(skb) =
+			((cxgb4i_skb_rx_pdulen(skb) + 3) & (~3));
+		csk->rcv_nxt += cxgb4i_skb_rx_pdulen(skb);
+	} else {
+		lskb = csk->skb_ulp_lhdr;
+		cxgb4i_skb_flags(lskb) |= CTP_SKCBF_DATA_RCVD;
+		cxgb4i_skb_flags(skb) = CTP_SKCBF_DATA_RCVD;
+		cxgbi_log_debug("csk 0x%p, tid 0x%x skb 0x%p, pdu data, "
+				" header 0x%p.\n",
+				csk, csk->hwtid, skb, lskb);
+	}
+
+	__skb_queue_tail(&csk->receive_queue, skb);
+	spin_unlock_bh(&csk->lock);
+	return 0;
+abort_conn:
+	cxgb4i_sock_send_abort_req(csk);
+	__kfree_skb(skb);
+	spin_unlock_bh(&csk->lock);
+	return -EINVAL;
+}
+
+static int cxgb4i_cpl_rx_data_ddp(struct cxgb4i_snic *snic, struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct sk_buff *lskb;
+	struct cpl_rx_data_ddp *rpl = (struct cpl_rx_data_ddp *)skb->data;
+	unsigned int hwtid = GET_TID(rpl);
+	struct tid_info *t = snic->lldi.tids;
+	unsigned int status;
+
+	csk = lookup_tid(t, hwtid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		return CPL_RET_UNKNOWN_TID;
+	}
+
+	spin_lock_bh(&csk->lock);
+
+	if (unlikely(csk->state >= CTP_PASSIVE_CLOSE)) {
+		if (csk->state != CTP_ABORTING)
+			goto abort_conn;
+	}
+
+	if (!csk->skb_ulp_lhdr) {
+		cxgbi_log_error("tid 0x%x, rcv RX_DATA_DDP w/o pdu header\n",
+				csk->hwtid);
+		goto abort_conn;
+	}
+
+	lskb = csk->skb_ulp_lhdr;
+	cxgb4i_skb_flags(lskb) |= CTP_SKCBF_STATUS_RCVD;
+
+	if (ntohs(rpl->len) != cxgb4i_skb_rx_pdulen(lskb)) {
+		cxgbi_log_error("tid 0x%x, RX_DATA_DDP pdulen %u != %u.\n",
+				csk->hwtid, ntohs(rpl->len),
+				cxgb4i_skb_rx_pdulen(lskb));
+	}
+
+	cxgb4i_skb_rx_ddigest(lskb) = ntohl(rpl->ulp_crc);
+	status = ntohl(rpl->ddpvld);
+
+	if (status & (1 << RX_DDP_STATUS_HCRC_SHIFT)) {
+		cxgbi_log_info("ULP2_FLAG_HCRC_ERROR set\n");
+		cxgb4i_skb_ulp_mode(skb) |= ULP2_FLAG_HCRC_ERROR;
+	}
+	if (status & (1 << RX_DDP_STATUS_DCRC_SHIFT)) {
+		cxgbi_log_info("ULP2_FLAG_DCRC_ERROR set\n");
+		cxgb4i_skb_ulp_mode(skb) |= ULP2_FLAG_DCRC_ERROR;
+	}
+	if (status & (1 << RX_DDP_STATUS_PAD_SHIFT)) {
+		cxgbi_log_info("ULP2_FLAG_PAD_ERROR set\n");
+		cxgb4i_skb_ulp_mode(skb) |= ULP2_FLAG_PAD_ERROR;
+	}
+	if ((cxgb4i_skb_flags(lskb) & ULP2_FLAG_DATA_READY)) {
+		cxgbi_log_info("ULP2_FLAG_DATA_DDPED set\n");
+		cxgb4i_skb_ulp_mode(skb) |= ULP2_FLAG_DATA_DDPED;
+	}
+
+	csk->skb_ulp_lhdr = NULL;
+	__kfree_skb(skb);
+	cxgbi_conn_pdu_ready(csk);
+	spin_unlock_bh(&csk->lock);
+	return 0;
+abort_conn:
+	cxgb4i_sock_send_abort_req(csk);
+	__kfree_skb(skb);
+	spin_unlock_bh(&csk->lock);
+	return -EINVAL;
+}
+
+static void check_wr_invariants(const struct cxgbi_sock *csk)
+{
+	int pending = cxgb4i_sock_count_pending_wrs(csk);
+
+	if (unlikely(csk->wr_cred + pending != csk->wr_max_cred))
+		printk(KERN_ERR "TID %u: credit imbalance: avail %u, "
+				"pending %u, total should be %u\n",
+				csk->hwtid,
+				csk->wr_cred,
+				pending,
+				csk->wr_max_cred);
+}
+
+static int cxgb4i_cpl_fw4_ack(struct cxgb4i_snic *snic, struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct cpl_fw4_ack *rpl = (struct cpl_fw4_ack *)skb->data;
+	unsigned int hwtid = GET_TID(rpl);
+	struct tid_info *t = snic->lldi.tids;
+	unsigned char credits;
+	unsigned int snd_una;
+
+	csk = lookup_tid(t, hwtid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		kfree_skb(skb);
+		return CPL_RET_UNKNOWN_TID;
+	}
+
+	cxgbi_sock_hold(csk);
+	spin_lock_bh(&csk->lock);
+	credits = rpl->credits;
+	snd_una = be32_to_cpu(rpl->snd_una);
+	cxgbi_tx_debug("%u WR credits, avail %u, unack %u, TID %u, state %u\n",
+				credits, csk->wr_cred, csk->wr_una_cred,
+						csk->hwtid, csk->state);
+	csk->wr_cred += credits;
+
+	if (csk->wr_una_cred > csk->wr_max_cred - csk->wr_cred)
+		csk->wr_una_cred = csk->wr_max_cred - csk->wr_cred;
+
+	while (credits) {
+		struct sk_buff *p = cxgb4i_sock_peek_wr(csk);
+
+		if (unlikely(!p)) {
+			cxgbi_log_error("%u WR_ACK credits for TID %u with "
+					"nothing pending, state %u\n",
+					credits, csk->hwtid, csk->state);
+			break;
+		}
+
+		if (unlikely(credits < p->csum))
+			p->csum -= credits;
+		else {
+			cxgb4i_sock_dequeue_wr(csk);
+			credits -= p->csum;
+			kfree_skb(p);
+		}
+	}
+
+	check_wr_invariants(csk);
+
+	if (rpl->seq_vld) {
+		if (unlikely(before(snd_una, csk->snd_una))) {
+			cxgbi_log_error("TID %u, unexpected sequence # %u "
+					"in WR_ACK snd_una %u\n",
+					csk->hwtid, snd_una, csk->snd_una);
+			goto out_free;
+		}
+	}
+
+	if (csk->snd_una != snd_una) {
+		csk->snd_una = snd_una;
+		dst_confirm(csk->dst);
+	}
+
+	if (skb_queue_len(&csk->write_queue)) {
+		if (cxgb4i_sock_push_tx_frames(csk, 0))
+			cxgbi_conn_tx_open(csk);
+	} else
+		cxgbi_conn_tx_open(csk);
+
+	goto out;
+out_free:
+	__kfree_skb(skb);
+out:
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+	return 0;
+}
+
+static int cxgb4i_cpl_set_tcb_rpl(struct cxgb4i_snic *snic, struct sk_buff *skb)
+{
+	struct cpl_set_tcb_rpl *rpl = (struct cpl_set_tcb_rpl *)skb->data;
+	unsigned int hwtid = GET_TID(rpl);
+	struct tid_info *t = snic->lldi.tids;
+	struct cxgbi_sock *csk;
+
+	csk = lookup_tid(t, hwtid);
+	if (!csk) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		__kfree_skb(skb);
+		return CPL_RET_UNKNOWN_TID;
+	}
+
+	spin_lock_bh(&csk->lock);
+
+	if (rpl->status != CPL_ERR_NONE) {
+		cxgbi_log_error("Unexpected SET_TCB_RPL status %u "
+				 "for tid %u\n", rpl->status, GET_TID(rpl));
+	}
+
+	__kfree_skb(skb);
+	spin_unlock_bh(&csk->lock);
+	return 0;
+}
+
+static void cxgb4i_sock_free_cpl_skbs(struct cxgbi_sock *csk)
+{
+	if (csk->cpl_close)
+		kfree_skb(csk->cpl_close);
+	if (csk->cpl_abort_req)
+		kfree_skb(csk->cpl_abort_req);
+	if (csk->cpl_abort_rpl)
+		kfree_skb(csk->cpl_abort_rpl);
+}
+
+static int cxgb4i_alloc_cpl_skbs(struct cxgbi_sock *csk)
+{
+	int wrlen;
+
+	wrlen = roundup(sizeof(struct cpl_close_con_req), 16);
+	csk->cpl_close = alloc_skb(wrlen, GFP_NOIO);
+	if (!csk->cpl_close)
+		return -ENOMEM;
+	skb_put(csk->cpl_close, wrlen);
+
+	wrlen = roundup(sizeof(struct cpl_abort_req), 16);
+	csk->cpl_abort_req = alloc_skb(wrlen, GFP_NOIO);
+	if (!csk->cpl_abort_req)
+		goto free_cpl_skbs;
+	skb_put(csk->cpl_abort_req, wrlen);
+
+	wrlen = roundup(sizeof(struct cpl_abort_rpl), 16);
+	csk->cpl_abort_rpl = alloc_skb(wrlen, GFP_NOIO);
+	if (!csk->cpl_abort_rpl)
+		goto free_cpl_skbs;
+	skb_put(csk->cpl_abort_rpl, wrlen);
+	return 0;
+free_cpl_skbs:
+	cxgb4i_sock_free_cpl_skbs(csk);
+	return -ENOMEM;
+}
+
+static void cxgb4i_sock_release_offload_resources(struct cxgbi_sock *csk)
+{
+
+	cxgb4i_sock_free_cpl_skbs(csk);
+
+	if (csk->wr_cred != csk->wr_max_cred) {
+		cxgb4i_sock_purge_wr_queue(csk);
+		cxgb4i_sock_reset_wr_list(csk);
+	}
+
+	if (csk->l2t) {
+		cxgb4_l2t_release(csk->l2t);
+		csk->l2t = NULL;
+	}
+
+	if (csk->state == CTP_CONNECTING)
+		cxgb4i_sock_free_atid(csk);
+	else {
+		struct cxgb4i_snic *snic = cxgbi_cdev_priv(csk->cdev);
+		cxgb4_remove_tid(snic->lldi.tids, 0, csk->hwtid);
+		cxgbi_sock_put(csk);
+	}
+
+	csk->dst = NULL;
+	csk->cdev = NULL;
+}
+
+static int cxgb4i_init_act_open(struct cxgbi_sock *csk,
+				struct net_device *dev)
+{
+	struct dst_entry *dst = csk->dst;
+	struct sk_buff *skb;
+	struct port_info *pi = netdev_priv(dev);
+	struct cxgb4i_snic *snic = cxgbi_cdev_priv(csk->cdev);
+	int offs;
+
+	cxgbi_conn_debug("csk 0x%p, state %u, flags 0x%lx\n",
+			csk, csk->state, csk->flags);
+
+	csk->atid = cxgb4_alloc_atid(snic->lldi.tids, csk);
+	if (csk->atid == -1) {
+		cxgbi_log_error("cannot alloc atid\n");
+		goto out_err;
+	}
+
+	csk->l2t = cxgb4_l2t_get(snic->lldi.l2t, csk->dst->neighbour, dev, 0);
+	if (!csk->l2t) {
+		cxgbi_log_error("cannot alloc l2t\n");
+		goto free_atid;
+	}
+
+	skb = alloc_skb(sizeof(struct cpl_act_open_req), GFP_NOIO);
+	if (!skb)
+		goto free_l2t;
+
+	skb->sk = (struct sock *)csk;
+	t4_set_arp_err_handler(skb, csk, cxgb4i_act_open_req_arp_failure);
+	cxgbi_sock_hold(csk);
+	offs = snic->lldi.ntxq / snic->lldi.nchan;
+	csk->txq_idx = pi->port_id * offs;
+	cxgbi_log_debug("csk->txq_idx : %d\n", csk->txq_idx);
+	offs = snic->lldi.nrxq / snic->lldi.nchan;
+	csk->rss_qid = snic->lldi.rxq_ids[pi->port_id * offs];
+	cxgbi_log_debug("csk->rss_qid : %d\n", csk->rss_qid);
+	csk->wr_max_cred = csk->wr_cred = snic->lldi.wr_cred;
+	csk->port_id = pi->port_id;
+	csk->tx_chan = cxgb4_port_chan(dev);
+	csk->smac_idx = csk->tx_chan << 1;
+	csk->wr_una_cred = 0;
+	csk->mss_idx = cxgbi_sock_select_mss(csk, dst_mtu(dst));
+	csk->err = 0;
+	cxgb4i_sock_reset_wr_list(csk);
+	cxgb4i_sock_make_act_open_req(csk, skb,
+					((csk->rss_qid << 14) |
+					 (csk->atid)), csk->l2t);
+	cxgb4_l2t_send(csk->cdev->ports[csk->port_id], skb, csk->l2t);
+	return 0;
+free_l2t:
+	cxgb4_l2t_release(csk->l2t);
+free_atid:
+	cxgb4i_sock_free_atid(csk);
+out_err:
+	return -EINVAL;;
+}
+
+static void cxgb4i_sock_rx_credits(struct cxgbi_sock *csk, int copied)
+{
+	int must_send;
+	u32 credits;
+
+	if (csk->state != CTP_ESTABLISHED)
+		return;
+
+	credits = csk->copied_seq - csk->rcv_wup;
+	if (unlikely(!credits))
+		return;
+
+	if (unlikely(cxgb4i_rx_credit_thres == 0))
+		return;
+
+	must_send = credits + 16384 >= cxgb4i_rcv_win;
+
+	if (must_send || credits >= cxgb4i_rx_credit_thres)
+		csk->rcv_wup += cxgb4i_csk_send_rx_credits(csk, credits);
+}
+
+static int cxgb4i_sock_send_pdus(struct cxgbi_sock *csk, struct sk_buff *skb)
+{
+	struct sk_buff *next;
+	int err, copied = 0;
+
+	spin_lock_bh(&csk->lock);
+
+	if (csk->state != CTP_ESTABLISHED) {
+		cxgbi_tx_debug("csk 0x%p, not in est. state %u.\n",
+			      csk, csk->state);
+		err = -EAGAIN;
+		goto out_err;
+	}
+
+	if (csk->err) {
+		cxgbi_tx_debug("csk 0x%p, err %d.\n", csk, csk->err);
+		err = -EPIPE;
+		goto out_err;
+	}
+
+	if (csk->write_seq - csk->snd_una >= cxgb4i_snd_win) {
+		cxgbi_tx_debug("csk 0x%p, snd %u - %u > %u.\n",
+				csk, csk->write_seq, csk->snd_una,
+				cxgb4i_snd_win);
+		err = -ENOBUFS;
+		goto out_err;
+	}
+
+	while (skb) {
+		int frags = skb_shinfo(skb)->nr_frags +
+				(skb->len != skb->data_len);
+
+		if (unlikely(skb_headroom(skb) < CXGB4I_TX_HEADER_LEN)) {
+			cxgbi_tx_debug("csk 0x%p, skb head.\n", csk);
+			err = -EINVAL;
+			goto out_err;
+		}
+
+		if (frags >= SKB_WR_LIST_SIZE) {
+			cxgbi_log_error("csk 0x%p, tx frags %d, len %u,%u.\n",
+					 csk, skb_shinfo(skb)->nr_frags,
+					 skb->len, skb->data_len);
+			err = -EINVAL;
+			goto out_err;
+		}
+
+		next = skb->next;
+		skb->next = NULL;
+		cxgb4i_sock_skb_entail(csk, skb, CTP_SKCBF_NO_APPEND |
+					CTP_SKCBF_NEED_HDR);
+		copied += skb->len;
+		csk->write_seq += skb->len + ulp_extra_len(skb);
+		skb = next;
+	}
+done:
+	if (likely(skb_queue_len(&csk->write_queue)))
+		cxgb4i_sock_push_tx_frames(csk, 1);
+	spin_unlock_bh(&csk->lock);
+	return copied;
+out_err:
+	if (copied == 0 && err == -EPIPE)
+		copied = csk->err ? csk->err : -EPIPE;
+	else
+		copied = err;
+	goto done;
+}
+
+static void tx_skb_setmode(struct sk_buff *skb, int hcrc, int dcrc)
+{
+	u8 submode = 0;
+
+	if (hcrc)
+		submode |= 1;
+	if (dcrc)
+		submode |= 2;
+	cxgb4i_skb_ulp_mode(skb) = (ULP_MODE_ISCSI << 4) | submode;
+}
+
+static inline __u16 get_skb_ulp_mode(struct sk_buff *skb)
+{
+	return cxgb4i_skb_ulp_mode(skb);
+}
+
+static inline __u16 get_skb_flags(struct sk_buff *skb)
+{
+	return cxgb4i_skb_flags(skb);
+}
+
+static inline __u32 get_skb_tcp_seq(struct sk_buff *skb)
+{
+	return cxgb4i_skb_tcp_seq(skb);
+}
+
+static inline __u32 get_skb_rx_pdulen(struct sk_buff *skb)
+{
+	return cxgb4i_skb_rx_pdulen(skb);
+}
+
+static cxgb4i_cplhandler_func cxgb4i_cplhandlers[NUM_CPL_CMDS] = {
+	[CPL_ACT_ESTABLISH] = cxgb4i_cpl_act_establish,
+	[CPL_ACT_OPEN_RPL] = cxgb4i_cpl_act_open_rpl,
+	[CPL_PEER_CLOSE] = cxgb4i_cpl_peer_close,
+	[CPL_ABORT_REQ_RSS] = cxgb4i_cpl_abort_req_rss,
+	[CPL_ABORT_RPL_RSS] = cxgb4i_cpl_abort_rpl_rss,
+	[CPL_CLOSE_CON_RPL] = cxgb4i_cpl_close_con_rpl,
+	[CPL_FW4_ACK] = cxgb4i_cpl_fw4_ack,
+	[CPL_ISCSI_HDR] = cxgb4i_cpl_iscsi_hdr,
+	[CPL_SET_TCB_RPL] = cxgb4i_cpl_set_tcb_rpl,
+	[CPL_RX_DATA_DDP] = cxgb4i_cpl_rx_data_ddp
+};
+
+int cxgb4i_ofld_init(struct cxgbi_device *cdev)
+{
+	struct cxgb4i_snic *snic = cxgbi_cdev_priv(cdev);
+	struct cxgbi_ports_map *ports;
+	int mapsize;
+
+	if (cxgb4i_max_connect > CXGB4I_MAX_CONN)
+		cxgb4i_max_connect = CXGB4I_MAX_CONN;
+
+	mapsize = (cxgb4i_max_connect * sizeof(struct cxgbi_sock));
+	ports = cxgbi_alloc_big_mem(sizeof(*ports) + mapsize, GFP_KERNEL);
+	if (!ports)
+		return -ENOMEM;
+
+	spin_lock_init(&ports->lock);
+	cdev->pmap = ports;
+	cdev->pmap->max_connect = cxgb4i_max_connect;
+	cdev->pmap->sport_base = cxgb4i_sport_base;
+	cdev->set_skb_txmode = tx_skb_setmode;
+	cdev->get_skb_ulp_mode = get_skb_ulp_mode;
+	cdev->get_skb_flags = get_skb_flags;
+	cdev->get_skb_tcp_seq = get_skb_tcp_seq;
+	cdev->get_skb_rx_pdulen = get_skb_rx_pdulen;
+	cdev->release_offload_resources = cxgb4i_sock_release_offload_resources;
+	cdev->sock_send_pdus = cxgb4i_sock_send_pdus;
+	cdev->send_abort_req = cxgb4i_sock_send_abort_req;
+	cdev->send_close_req = cxgb4i_sock_send_close_req;
+	cdev->sock_rx_credits = cxgb4i_sock_rx_credits;
+	cdev->alloc_cpl_skbs = cxgb4i_alloc_cpl_skbs;
+	cdev->init_act_open = cxgb4i_init_act_open;
+	snic->handlers = cxgb4i_cplhandlers;
+	return 0;
+}
+
+void cxgb4i_ofld_cleanup(struct cxgbi_device *cdev)
+{
+	struct cxgbi_sock *csk;
+	int i;
+
+	for (i = 0; i < cdev->pmap->max_connect; i++) {
+		if (cdev->pmap->port_csk[i]) {
+			csk = cdev->pmap->port_csk[i];
+			cdev->pmap->port_csk[i] = NULL;
+
+			cxgbi_sock_hold(csk);
+			spin_lock_bh(&csk->lock);
+			cxgbi_sock_closed(csk);
+			spin_unlock_bh(&csk->lock);
+			cxgbi_sock_put(csk);
+		}
+	}
+	cxgbi_free_big_mem(cdev->pmap);
+}
-- 
1.6.6.1

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.

^ permalink raw reply related

* Re: [PATCH net-2.6] pkt_sched: gen_estimator: add a new lock
From: Changli Gao @ 2010-06-08  5:20 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, netdev, Stephen Hemminger, Jarek Poplawski,
	Patrick McHardy
In-Reply-To: <1275973091.2775.51.camel@edumazet-laptop>

On Tue, Jun 8, 2010 at 12:58 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le mardi 08 juin 2010 à 09:00 +0800, Changli Gao a écrit :
>
>> and I think gen_replace_estimator is expected to be an atomic operation.
>>
>> And gen_estimator_active() is also assumed to be called with RTNL locked.
>>
>
> My patch fixes a bug of new/kill operators, regardless of RTNL being
> held or not. Its should be small enough to be included in linux-2.6.35.
>
> If what you say is right, all gen_replace_estimator() /
> gen_estimator_active() callers should still holds RTNL.
> I didnt change this part.
> If you believe one caller doesnt hold RTNL, please submit another patch.
>
> Then, in net-next-2.6, we can probably cleanup this to remove RTNL
> requirement if possible for gen_replace_estimator() /
> gen_estimator_active()
>
> Yes, it sounds a bit difficult (three patches instead of a single one),
> but this is the how things should be done, step by step.
>

IMO, this bug should be fixed by adding rtnl_lock to xt_RATEEST.c.
Killing rtnl should be done in separated patches. They are different
things. Your patch introduces another locks, and it is extra overhead
for other users.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* [RFC PATCH 0/5] netdev: show a process of packets
From: Koki Sanagi @ 2010-06-08  5:25 UTC (permalink / raw)
  To: netdev; +Cc: davem, kaneshige.kenji, izumi.taku

These patch-set adds tracepoints to show us a process of packets.
Using these tracepoints and existing points, we can get the time when
packet passes through some points in transmit or receive sequence.
For example, this is an output of perf script which is attached by patch 5/5.

79074.756672832sec cpu=1
irq_entry(+0.000000msec,irq=77:eth3)
         |------------softirq_raise(+0.001277msec)
irq_exit (+0.002278msec)     |
                             |
                      softirq_entry(+0.003562msec)
                             |
                             |---netif_receive_skb(+0.006279msec,len=100)
                             |            |
                             |   skb_copy_datagram_iovec(+0.038778msec, 2285:sshd)
                             |
                      napi_poll_exit(+0.017160msec, eth3)
                             |
                      softirq_exit(+0.018248msec)

The above is a receive side. Like this, it can show receive sequence from
interrupt(irq_entry) to application(skb_copy_datagram_iovec). There are 8
tracepoints in this side. All events except for skb_copy_datagram_iovec can be
associated with each other by CPU number. skb_copy_datagram_iovec can be
associated with netif_receive_skb by skbaddr.
This script shows one NET_RX softirq and events related to it. All relative
time bases on first irq_entry which raise NET_RX softirq.

   dev    len      dev_queue_xmit|----------|dev_hard_start_xmit|-----|free_skb
                         |             |                           |
   eth3   114  79044.417123332sec     0.005242msec          0.103843msec
   eth3   114  79044.580090422sec     0.002306msec          0.103632msec
   eth3   114  79044.719078251sec     0.002288msec          0.104093msec

The above is a transmit side. There are three tracepoints in this side.
Point1 is before putting a packet to Qdisc. point2 is after ndo_start_xmit in
dev_hard_start_xmit. It indicates finishing putting a packet to driver.
point3 is in consume_skb and dev_kfree_skb_irq. It indicates freeing a
transmitted packet.
Values of this script are, from left, device name, length of a packet, a time of
point1, an interval time between point1 and point2 and an interval time between
point2 and point3.

These times are useful to analyze a performance or to detect a point where
packet delays. For example,
- NET_RX softirq calling is late.
- Application is late to take a packet.
- It takes much time to put a transmitting packet to driver
  (It may be caused by packed queue)

And also, these tracepoint help us to investigate a network driver's trouble
from memory dump because ftrace records it to memory. And ftrace is so light
even if always trace on. So, in a case investigating a problem which doesn't
reproduce, it is useful.

Thanks,
Koki Sanagi.


^ permalink raw reply

* [RFC PATCH 1/5] irq: add tracepoint to softirq_raise
From: Koki Sanagi @ 2010-06-08  5:27 UTC (permalink / raw)
  To: netdev; +Cc: davem, kaneshige.kenji, izumi.taku, laijs
In-Reply-To: <4C0DD43F.9090902@jp.fujitsu.com>

This patch adds tracepoint to softirq_raise.
This is a same patch Lai Jiangshan submitted.
http://marc.info/?l=linux-kernel&m=126026122728732&w=2

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
 include/linux/interrupt.h  |    8 +++++++-
 include/trace/events/irq.h |   34 +++++++++++++++++++++++++++++++---
 2 files changed, 38 insertions(+), 4 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index c233113..1cb5726 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -18,6 +18,7 @@
 #include <asm/atomic.h>
 #include <asm/ptrace.h>
 #include <asm/system.h>
+#include <trace/events/irq.h>
 
 /*
  * These correspond to the IORESOURCE_IRQ_* defines in
@@ -402,7 +403,12 @@ asmlinkage void do_softirq(void);
 asmlinkage void __do_softirq(void);
 extern void open_softirq(int nr, void (*action)(struct softirq_action *));
 extern void softirq_init(void);
-#define __raise_softirq_irqoff(nr) do { or_softirq_pending(1UL << (nr)); } while (0)
+static inline void __raise_softirq_irqoff(unsigned int nr)
+{
+	trace_softirq_raise(nr);
+	or_softirq_pending(1UL << nr);
+}
+
 extern void raise_softirq_irqoff(unsigned int nr);
 extern void raise_softirq(unsigned int nr);
 extern void wakeup_softirqd(void);
diff --git a/include/trace/events/irq.h b/include/trace/events/irq.h
index 0e4cfb6..7cb7435 100644
--- a/include/trace/events/irq.h
+++ b/include/trace/events/irq.h
@@ -5,7 +5,9 @@
 #define _TRACE_IRQ_H
 
 #include <linux/tracepoint.h>
-#include <linux/interrupt.h>
+
+struct irqaction;
+struct softirq_action;
 
 #define softirq_name(sirq) { sirq##_SOFTIRQ, #sirq }
 #define show_softirq_name(val)				\
@@ -82,6 +84,32 @@ TRACE_EVENT(irq_handler_exit,
 		  __entry->irq, __entry->ret ? "handled" : "unhandled")
 );
 
+/**
+ * softirq_raise - called immediately when a softirq is raised
+ * @nr: softirq vector number
+ *
+ * Tracepoint for tracing when softirq action is raised.
+ * Also, when used in combination with the softirq_entry tracepoint
+ * we can determine the softirq raise latency.
+ */
+TRACE_EVENT(softirq_raise,
+
+	TP_PROTO(unsigned int nr),
+
+	TP_ARGS(nr),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	vec	)
+	),
+
+	TP_fast_assign(
+		__entry->vec	= nr;
+	),
+
+	TP_printk("vec=%d [action=%s]", __entry->vec,
+		show_softirq_name(__entry->vec))
+);
+
 DECLARE_EVENT_CLASS(softirq,
 
 	TP_PROTO(struct softirq_action *h, struct softirq_action *vec),
@@ -89,11 +117,11 @@ DECLARE_EVENT_CLASS(softirq,
 	TP_ARGS(h, vec),
 
 	TP_STRUCT__entry(
-		__field(	int,	vec			)
+		__field(	unsigned int,	vec	)
 	),
 
 	TP_fast_assign(
-		__entry->vec = (int)(h - vec);
+		__entry->vec = (unsigned int)(h - vec);
 	),
 
 	TP_printk("vec=%d [action=%s]", __entry->vec,


^ permalink raw reply related

* Re: [RFC PATCH v7 01/19] Add a new structure for skb buffer from external.
From: Herbert Xu @ 2010-06-08  5:27 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: xiaohui.xin, netdev, kvm, linux-kernel, mst, mingo, davem, jdike
In-Reply-To: <20100606161348.427822fb@nehalam>

On Sun, Jun 06, 2010 at 04:13:48PM -0700, Stephen Hemminger wrote:
> Still not sure this is a good idea for a couple of reasons:
> 
> 1. We already have lots of special cases with skb's (frags and fraglist),
>    and skb's travel through a lot of different parts of the kernel.  So any
>    new change like this creates lots of exposed points for new bugs. Look
>    at cases like MD5 TCP and netfilter, and forwarding these SKB's to ipsec
>    and ppp and ...
> 
> 2. SKB's can have infinite lifetime in the kernel. If these buffers come from
>    a fixed size pool in an external device, they can easily all get tied up
>    if you have a slow listener. What happens then?

I agree with Stephen on this.

FWIW I don't think we even need the external pages concept in
order to implement zero-copy receive (which I gather is the intent
here).

Here is one way to do it, simply construct a completely non-linear
packet in the driver, as you would if you were using the GRO frags
interface (grep for napi_gro_frags under drivers/net for examples).

This way you can transfer the entire contents of the packet without
copying through to the other side, provided that the host stack does
not modify the packet.

If the host side did modify the packet then we have to incur the
memory cost anyway.

IOW I think the only feature provided by the external pages
construct is allowing the skb->head area to be shared without
copying.  I'm claiming that this can be done by simply making
skb->head empty.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [RFC PATCH 2/5] napi: convert trace_napi_poll to TRACE_EVENT
From: Koki Sanagi @ 2010-06-08  5:28 UTC (permalink / raw)
  To: netdev; +Cc: davem, kaneshige.kenji, izumi.taku, nhorman
In-Reply-To: <4C0DD43F.9090902@jp.fujitsu.com>

This patch converts trace_napi_poll from DECLARE_EVENT to TRACE_EVENT.
This is a same patch Neil Horman submitted.
http://marc.info/?l=linux-kernel&m=125978157926853&w=2

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
 include/trace/events/napi.h |   23 +++++++++++++++++++++--
 1 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/trace/events/napi.h b/include/trace/events/napi.h
index 188deca..512a057 100644
--- a/include/trace/events/napi.h
+++ b/include/trace/events/napi.h
@@ -6,10 +6,29 @@
 
 #include <linux/netdevice.h>
 #include <linux/tracepoint.h>
+#include <linux/ftrace.h>
+
+#define NO_DEV "(no_device)"
+
+TRACE_EVENT(napi_poll,
 
-DECLARE_TRACE(napi_poll,
 	TP_PROTO(struct napi_struct *napi),
-	TP_ARGS(napi));
+
+	TP_ARGS(napi),
+
+	TP_STRUCT__entry(
+		__field(	struct napi_struct *,	napi)
+		__string(	dev_name, napi->dev ? napi->dev->name : NO_DEV)
+	),
+
+	TP_fast_assign(
+		__entry->napi = napi;
+		__assign_str(dev_name, napi->dev ? napi->dev->name : NO_DEV);
+	),
+
+	TP_printk("napi poll on napi struct %p for device %s",
+		__entry->napi, __get_str(dev_name))
+);
 
 #endif /* _TRACE_NAPI_H_ */
 



^ permalink raw reply related

* [RFC PATCH 3/5] netdev: add tracepoints to netdev layer
From: Koki Sanagi @ 2010-06-08  5:30 UTC (permalink / raw)
  To: netdev; +Cc: davem, kaneshige.kenji, izumi.taku
In-Reply-To: <4C0DD43F.9090902@jp.fujitsu.com>

This patch adds tracepoint to dev_queue_xmit, dev_hard_start_xmit and
netif_receive_skb.

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
 include/trace/events/net.h |   84 ++++++++++++++++++++++++++++++++++++++++++++
 net/core/dev.c             |    5 +++
 net/core/net-traces.c      |    1 +
 3 files changed, 90 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/net.h b/include/trace/events/net.h
new file mode 100644
index 0000000..4f82fb5
--- /dev/null
+++ b/include/trace/events/net.h
@@ -0,0 +1,84 @@
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM net
+
+#if !defined(_TRACE_NET_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _TRACE_NET_H
+
+#include <linux/skbuff.h>
+#include <linux/netdevice.h>
+#include <linux/tracepoint.h>
+
+#define NO_DEV "(no_device)"
+
+TRACE_EVENT(net_dev_xmit,
+
+	TP_PROTO(struct sk_buff *skb,
+		 int rc),
+
+	TP_ARGS(skb, rc),
+
+	TP_STRUCT__entry(
+		__field(	void *,		skbaddr		)
+		__field(	unsigned int,	len		)
+		__field(	int,		rc		)
+		__string(	name,		skb->dev->name	)
+	),
+
+	TP_fast_assign(
+		__entry->skbaddr = skb;
+		__entry->len = skb->len;
+		__entry->rc = rc;
+		__assign_str(name, skb->dev->name);
+	),
+
+	TP_printk("dev=%s skbaddr=%p len=%u rc=%d",
+		__get_str(name), __entry->skbaddr, __entry->len, __entry->rc)
+);
+
+TRACE_EVENT(net_dev_queue,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb),
+
+	TP_STRUCT__entry(
+		__field(	void *,		skbaddr		)
+		__field(	unsigned int,	len		)
+		__string(	name,		skb->dev->name	)
+	),
+
+	TP_fast_assign(
+		__entry->skbaddr = skb;
+		__entry->len = skb->len;
+		__assign_str(name, skb->dev->name);
+	),
+
+	TP_printk("dev=%s skbaddr=%p len=%u",
+		__get_str(name), __entry->skbaddr, __entry->len)
+);
+
+TRACE_EVENT(net_dev_receive,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb),
+
+	TP_STRUCT__entry(
+		__field(	void *,		skbaddr		)
+		__field(	unsigned int,	len		)
+		__string(	name,		skb->dev->name	)
+	),
+
+	TP_fast_assign(
+		__entry->skbaddr = skb;
+		__entry->len = skb->len;
+		__assign_str(name, skb->dev->name);
+	),
+
+	TP_printk("dev=%s skbaddr=%p len=%u",
+		__get_str(name), __entry->skbaddr, __entry->len)
+);
+#endif /* _TRACE_NET_H */
+
+/* This part must be outside protection */
+#include <trace/define_trace.h>
diff --git a/net/core/dev.c b/net/core/dev.c
index ec01a59..f7c731b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -130,6 +130,7 @@
 #include <linux/jhash.h>
 #include <linux/random.h>
 #include <trace/events/napi.h>
+#include <trace/events/net.h>
 #include <linux/pci.h>
 
 #include "net-sysfs.h"
@@ -1926,6 +1927,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev,
 		}
 
 		rc = ops->ndo_start_xmit(skb, dev);
+		trace_net_dev_xmit(skb, rc);
 		if (rc == NETDEV_TX_OK)
 			txq_trans_update(txq);
 		return rc;
@@ -1946,6 +1948,7 @@ gso:
 			skb_dst_drop(nskb);
 
 		rc = ops->ndo_start_xmit(nskb, dev);
+		trace_net_dev_xmit(nskb, rc);
 		if (unlikely(rc != NETDEV_TX_OK)) {
 			if (rc & ~NETDEV_TX_MASK)
 				goto out_kfree_gso_skb;
@@ -2159,6 +2162,7 @@ int dev_queue_xmit(struct sk_buff *skb)
 	}
 
 gso:
+	trace_net_dev_queue(skb);
 	/* Disable soft irqs for various locks below. Also
 	 * stops preemption for RCU.
 	 */
@@ -2934,6 +2938,7 @@ int netif_receive_skb(struct sk_buff *skb)
 	if (netdev_tstamp_prequeue)
 		net_timestamp_check(skb);
 
+	trace_net_dev_receive(skb);
 #ifdef CONFIG_RPS
 	{
 		struct rps_dev_flow voidflow, *rflow = &voidflow;
diff --git a/net/core/net-traces.c b/net/core/net-traces.c
index afa6380..7f1bb2a 100644
--- a/net/core/net-traces.c
+++ b/net/core/net-traces.c
@@ -26,6 +26,7 @@
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/skb.h>
+#include <trace/events/net.h>
 #include <trace/events/napi.h>
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kfree_skb);


^ permalink raw reply related

* [RFC PATCH 4/5] skb: add tracepoints to freeing skb
From: Koki Sanagi @ 2010-06-08  5:30 UTC (permalink / raw)
  To: netdev; +Cc: davem, kaneshige.kenji, izumi.taku
In-Reply-To: <4C0DD43F.9090902@jp.fujitsu.com>

This patch adds tracepoint to consume_skb and dev_kfree_skb_irq.

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
 include/trace/events/skb.h |   36 ++++++++++++++++++++++++++++++++++++
 net/core/dev.c             |    2 ++
 net/core/skbuff.c          |    1 +
 3 files changed, 39 insertions(+), 0 deletions(-)

diff --git a/include/trace/events/skb.h b/include/trace/events/skb.h
index 4b2be6d..6ab5b34 100644
--- a/include/trace/events/skb.h
+++ b/include/trace/events/skb.h
@@ -35,6 +35,42 @@ TRACE_EVENT(kfree_skb,
 		__entry->skbaddr, __entry->protocol, __entry->location)
 );
 
+TRACE_EVENT(consume_skb,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb),
+
+	TP_STRUCT__entry(
+		__field(	void *,	skbaddr	)
+	),
+
+	TP_fast_assign(
+		__entry->skbaddr = skb;
+	),
+
+	TP_printk("skbaddr=%p",
+		__entry->skbaddr)
+);
+
+TRACE_EVENT(dev_kfree_skb_irq,
+
+	TP_PROTO(struct sk_buff *skb),
+
+	TP_ARGS(skb),
+
+	TP_STRUCT__entry(
+		__field(	void *,	skbaddr	)
+	),
+
+	TP_fast_assign(
+		__entry->skbaddr = skb;
+	),
+
+	TP_printk("skbaddr=%p",
+		__entry->skbaddr)
+);
+
 TRACE_EVENT(skb_copy_datagram_iovec,
 
 	TP_PROTO(const struct sk_buff *skb, int len),
diff --git a/net/core/dev.c b/net/core/dev.c
index f7c731b..e0093c4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -131,6 +131,7 @@
 #include <linux/random.h>
 #include <trace/events/napi.h>
 #include <trace/events/net.h>
+#include <trace/events/skb.h>
 #include <linux/pci.h>
 
 #include "net-sysfs.h"
@@ -1584,6 +1585,7 @@ void dev_kfree_skb_irq(struct sk_buff *skb)
 		struct softnet_data *sd;
 		unsigned long flags;
 
+		trace_dev_kfree_skb_irq(skb);
 		local_irq_save(flags);
 		sd = &__get_cpu_var(softnet_data);
 		skb->next = sd->completion_queue;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4e7ac09..008c019 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -466,6 +466,7 @@ void consume_skb(struct sk_buff *skb)
 		smp_rmb();
 	else if (likely(!atomic_dec_and_test(&skb->users)))
 		return;
+	trace_consume_skb(skb);
 	__kfree_skb(skb);
 }
 EXPORT_SYMBOL(consume_skb);




^ permalink raw reply related

* [RFC PATCH 5/5] perf:add a script shows a process of packet
From: Koki Sanagi @ 2010-06-08  5:31 UTC (permalink / raw)
  To: netdev; +Cc: davem, kaneshige.kenji, izumi.taku
In-Reply-To: <4C0DD43F.9090902@jp.fujitsu.com>

This perf script show a time-chart of process of packet.
If you want to use it, install perf and

#perf trace record netdev-times

And if you want a result,

#perf trace report netdev-times

Signed-off-by: Koki Sanagi <sanagi.koki@jp.fujitsu.com>
---
 tools/perf/scripts/python/bin/netdev-times-record |    7 +
 tools/perf/scripts/python/bin/netdev-times-report |   10 +
 tools/perf/scripts/python/netdev-times.py         |  451 +++++++++++++++++++++
 3 files changed, 468 insertions(+), 0 deletions(-)

diff --git a/tools/perf/scripts/python/bin/netdev-times-record b/tools/perf/scripts/python/bin/netdev-times-record
new file mode 100644
index 0000000..c6a54fd
--- /dev/null
+++ b/tools/perf/scripts/python/bin/netdev-times-record
@@ -0,0 +1,7 @@
+#!/bin/bash
+perf record -c 1 -f -R -a -e net:net_dev_xmit -e net:net_dev_queue	\
+		-e net:net_dev_receive -e skb:consume_skb		\
+		-e skb:dev_kfree_skb_irq -e napi:napi_poll		\
+		-e irq:irq_handler_entry -e irq:irq_handler_exit	\
+		-e irq:softirq_entry -e irq:softirq_exit		\
+		-e irq:softirq_raise -e skb:skb_copy_datagram_iovec
diff --git a/tools/perf/scripts/python/bin/netdev-times-report b/tools/perf/scripts/python/bin/netdev-times-report
new file mode 100644
index 0000000..5d24c3d
--- /dev/null
+++ b/tools/perf/scripts/python/bin/netdev-times-report
@@ -0,0 +1,10 @@
+#!/bin/bash
+# description: displayi a process of packet and processing time
+# args: [comm]
+if [ $# -gt 0 ] ; then
+    if ! expr match "$1" "-" > /dev/null ; then
+	comm=$1
+	shift
+    fi
+fi
+perf trace $@ -s ~/libexec/perf-core/scripts/python/netdev-times.py $comm
diff --git a/tools/perf/scripts/python/netdev-times.py b/tools/perf/scripts/python/netdev-times.py
new file mode 100644
index 0000000..e7b47c7
--- /dev/null
+++ b/tools/perf/scripts/python/netdev-times.py
@@ -0,0 +1,451 @@
+# Display process of packets and processed time.
+# It helps you to investigate networking.
+
+import os
+import sys
+
+sys.path.append(os.environ['PERF_EXEC_PATH'] + \
+	'/scripts/python/Perf-Trace-Util/lib/Perf/Trace')
+
+from perf_trace_context import *
+from Core import *
+from Util import *
+
+all_event_list = []; # insert all tracepoint event related with this script
+irq_dic = {}; # key is cpu and value is a list which stacks irqs
+              # which raise NET_RX softirq
+net_rx_dic = {}; # key is cpu and value include time of NET_RX softirq-entry
+		 # and a list which stacks receive
+receive_hunk_list = []; # a list which include a sequence of receive events
+receive_skb_list = []; # received packet list for matching
+		       # skb_copy_datagram_iovec
+
+queue_list = []; # list of packets which pass through dev_queue_xmit
+xmit_list = [];  # list of packets which pass through dev_hard_start_xmit
+free_list = [];  # list of packets which is freed
+# Calculate a time interval(msec) from src(nsec) to dst(nsec)
+def diff_msec(src, dst):
+	return (dst - src) / 1000000.0
+
+# Display a process of transmitting a packet
+def print_transmit(hunk):
+	print "%7s %5d %6d.%09dsec %12.6fmsec      %12.6fmsec" % \
+		(hunk['dev'], hunk['len'],
+		nsecs_secs(hunk['queue_t']),
+		nsecs_nsecs(hunk['queue_t']),
+		diff_msec(hunk['queue_t'], hunk['xmit_t']),
+		diff_msec(hunk['xmit_t'], hunk['free_t']))
+
+# Display a process of received packets and interrputs associated with
+# a NET_RX softirq
+def print_receive(hunk):
+	if 'irq_list' not in hunk.keys() \
+	or len(hunk['irq_list']) == 0:
+		return
+	irq_list = hunk['irq_list']
+	cpu = irq_list[0]['cpu']
+	base_t = irq_list[0]['irq_ent_t']
+	print "%d.%09dsec cpu=%d" % \
+		(nsecs_secs(base_t), nsecs_nsecs(base_t), cpu)
+	for i in range(len(irq_list)):
+		print "irq_entry(+%fmsec,irq=%d:%s)" % \
+			(diff_msec(base_t, irq_list[i]['irq_ent_t']),
+			irq_list[i]['irq'], irq_list[i]['name'])
+
+		print "         |------------" \
+		      "softirq_raise(+%fmsec)" % \
+			diff_msec(base_t, irq_list[i]['sirq_raise_t'])
+
+		print "irq_exit (+%fmsec)     |" % \
+			diff_msec(base_t, irq_list[i]['irq_ext_t'])
+
+		print "                             |"
+
+	if 'sirq_ent_t' not in hunk.keys():
+		print 'maybe softirq_entry is dropped'
+		return
+	print "                      " \
+		"softirq_entry(+%fmsec)\n" \
+		"                      " \
+		"       |" % \
+		diff_msec(base_t, hunk['sirq_ent_t'])
+	event_list = hunk['event_list']
+	for i in range(len(event_list)):
+		event = event_list[i]
+		if event['event_name'] == 'napi_poll':
+			print "                      " \
+			      "napi_poll_exit(+%fmsec, %s)" % \
+			(diff_msec(base_t, event['event_t']), event['dev'])
+			print "                      " \
+			      "       |"
+		elif 'comm' in event.keys():
+			print "                      " \
+				"       |---netif_receive_skb" \
+				"(+%fmsec,len=%d)\n" \
+				"                      " \
+				"       |            |\n" \
+				"                      " \
+				"       |   skb_copy_datagram_iovec" \
+				"(+%fmsec, %d:%s)\n" \
+				"                      " \
+				"       |" % \
+			(diff_msec(base_t, event['event_t']),
+			event['len'],
+			diff_msec(base_t, event['comm_t']),
+			event['pid'], event['comm'])
+		else:
+			print "                      " \
+				"       |---netif_receive_skb" \
+				"(+%fmsec,len=%d)\n" \
+				"                      " \
+				"       |" % \
+				(diff_msec(base_t, event['event_t']),
+					event['len'])
+
+	print "                      " \
+	      "softirq_exit(+%fmsec)\n" % \
+		 diff_msec(base_t, hunk['sirq_ext_t'])
+
+def trace_end():
+	# order all events in time
+	all_event_list.sort(lambda a,b :cmp(a['time'], b['time']))
+	# process all events
+	for i in range(len(all_event_list)):
+		event = all_event_list[i]
+		event_name = event['event_name']
+		if event_name == 'irq__softirq_exit':
+			handle_irq_softirq_exit(event['event_name'],
+				event['context'], event['common_cpu'],
+				event['common_pid'], event['common_comm'],
+				event['time'], event['vec'])
+		elif event_name == 'irq__softirq_entry':
+			handle_irq_softirq_entry(event['event_name'],
+				event['context'], event['common_cpu'],
+				event['common_pid'], event['common_comm'],
+				event['time'],event['vec'])
+		elif event_name == 'irq__softirq_raise':
+			handle_irq_softirq_raise(event['event_name'],
+				event['context'], event['common_cpu'],
+				event['common_pid'], event['common_comm'],
+				event['time'], event['vec'])
+		elif event_name == 'irq__irq_handler_entry':
+			handle_irq_handler_entry(event['event_name'],
+				event['context'], event['common_cpu'],
+				event['common_pid'], event['common_comm'],
+				event['time'], event['irq'], event['name'])
+		elif event_name == 'irq__irq_handler_exit':
+			handle_irq_handler_exit(event['event_name'],
+				event['context'], event['common_cpu'],
+				event['common_pid'], event['common_comm'],
+				event['time'], event['irq'], event['ret'])
+		elif event_name == 'napi__napi_poll':
+			handle_napi_poll(event['event_name'],
+				event['context'], event['common_cpu'],
+				event['common_pid'], event['common_comm'],
+				event['time'], event['napi'],
+				event['dev_name'])
+		elif event_name == 'net__net_dev_receive':
+			handle_net_dev_receive(event['event_name'],
+				event['context'], event['common_cpu'],
+				event['common_pid'], event['common_comm'],
+				event['time'], event['skbaddr'],
+				event['skblen'], event['name'])
+		elif event_name == 'skb__skb_copy_datagram_iovec':
+			handle_skb_copy_datagram_iovec(event['event_name'],
+				event['context'], event['common_cpu'],
+				event['common_pid'], event['common_comm'],
+				event['time'], event['skbaddr'],
+				event['skblen'])
+		elif event_name == 'net__net_dev_queue':
+			handle_net_dev_queue(event['event_name'],
+				event['context'], event['common_cpu'],
+				event['common_pid'], event['common_comm'],
+				event['time'], event['skbaddr'],
+				event['skblen'], event['name'])
+		elif event_name == 'net__net_dev_xmit':
+			handle_net_dev_xmit(event['event_name'],
+				event['context'], event['common_cpu'],
+				event['common_pid'], event['common_comm'],
+				event['time'], event['skbaddr'],
+				event['skblen'], event['rc'], event['name'])
+		elif event_name == 'skb__dev_kfree_skb_irq':
+			handle_dev_kfree_skb_irq(event['event_name'],
+				event['context'], event['common_cpu'],
+				event['common_pid'], event['common_comm'],
+				event['time'], event['skbaddr'])
+		elif event_name == 'skb__consume_skb':
+			handle_consume_skb(event['event_name'],
+				event['context'], event['common_cpu'],
+				event['common_pid'], event['common_comm'],
+				event['time'], event['skbaddr'])
+	# display receive hunks
+	for i in range(len(receive_hunk_list)):
+		print_receive(receive_hunk_list[i])
+	# display transmit hunks
+	print "   dev    len      dev_queue_xmit|----------|" \
+		"dev_hard_start_xmit|-----|free_skb"
+	print "                         |             |" \
+		"                           |"
+	for i in range(len(free_list)):
+		print_transmit(free_list[i])
+
+def irq__softirq_exit(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	vec):
+	if symbol_str("irq__softirq_entry", "vec", vec) != "NET_RX":
+		return
+	event_data = {'event_name':event_name, 'context':context,
+		'common_cpu':common_cpu, 'common_pid':common_pid,
+		'common_comm':common_comm,'time':nsecs(common_secs,
+							common_nsecs),
+		'vec':vec}
+	all_event_list.append(event_data)
+
+def handle_irq_softirq_exit(event_name, context, common_cpu,
+	common_pid, common_comm, time,
+	vec):
+	rec_data = {'sirq_ext_t':time}
+	if common_cpu in irq_dic.keys():
+		rec_data.update({'irq_list':irq_dic[common_cpu]})
+		del irq_dic[common_cpu]
+	if common_cpu in net_rx_dic.keys():
+		rec_data.update({
+		    'event_list':net_rx_dic[common_cpu]['event_list'],
+		    'sirq_ent_t':net_rx_dic[common_cpu]['sirq_ent_t']})
+		del net_rx_dic[common_cpu]
+	# merge information realted to a NET_RX softirq
+	receive_hunk_list.append(rec_data)
+
+def irq__softirq_entry(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	vec):
+	if symbol_str("irq__softirq_entry", "vec", vec) != "NET_RX":
+		return
+	event_data = {'event_name':event_name, 'context':context,
+		'common_cpu':common_cpu, 'common_pid':common_pid,
+		'common_comm':common_comm,'time':nsecs(common_secs,
+							common_nsecs),
+		'vec':vec}
+	all_event_list.append(event_data)
+
+def handle_irq_softirq_entry(event_name, context, common_cpu,
+	common_pid, common_comm, time,
+	vec):
+		net_rx_dic[common_cpu] = {'event_list':[],
+					  'sirq_ent_t':time}
+
+def irq__softirq_raise(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	vec):
+	if symbol_str("irq__softirq_entry", "vec", vec) != "NET_RX":
+		return
+	event_data = {'event_name':event_name, 'context':context,
+		'common_cpu':common_cpu, 'common_pid':common_pid,
+		'common_comm':common_comm,'time':nsecs(common_secs,
+							common_nsecs),
+		'vec':vec}
+	all_event_list.append(event_data)
+
+def handle_irq_softirq_raise(event_name, context, common_cpu,
+	common_pid, common_comm, time,
+	vec):
+	if common_cpu not in irq_dic.keys() \
+	or len(irq_dic[common_cpu]) == 0:
+		return
+	irq = irq_dic[common_cpu].pop()
+	# put a time to prev irq on the same cpu
+	irq.update({'sirq_raise_t':time})
+	irq_dic[common_cpu].append(irq)
+
+def irq__irq_handler_entry(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	irq, name):
+	event_data = {'event_name':event_name, 'context':context,
+		'common_cpu':common_cpu, 'common_pid':common_pid,
+		'common_comm':common_comm,'time':nsecs(common_secs,
+							common_nsecs),
+		'irq':irq, 'name':name}
+	all_event_list.append(event_data)
+
+def handle_irq_handler_entry(event_name, context, common_cpu,
+	common_pid, common_comm, time,
+	irq, name):
+	if common_cpu not in irq_dic.keys():
+		irq_dic[common_cpu] = []
+	irq_record = {'irq':irq,
+		      'name':name,
+		      'cpu':common_cpu,
+		      'irq_ent_t':time}
+	irq_dic[common_cpu].append(irq_record)
+
+def irq__irq_handler_exit(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	irq, ret):
+	event_data = {'event_name':event_name, 'context':context,
+		'common_cpu':common_cpu, 'common_pid':common_pid,
+		'common_comm':common_comm,'time':nsecs(common_secs,
+							common_nsecs),
+		'irq':irq, 'ret':ret}
+	all_event_list.append(event_data)
+
+def handle_irq_handler_exit(event_name, context, common_cpu,
+	common_pid, common_comm, time,
+	irq, ret):
+	if common_cpu not in irq_dic.keys():
+		return
+	irq_record = irq_dic[common_cpu].pop()
+	irq_record.update({'irq_ext_t':time})
+	# if an irq doesn't include NET_RX softirq, drop.
+	if 'sirq_raise_t' in irq_record.keys():
+		irq_dic[common_cpu].append(irq_record)
+
+def napi__napi_poll(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	napi, dev_name):
+	event_data = {'event_name':event_name, 'context':context,
+		'common_cpu':common_cpu, 'common_pid':common_pid,
+		'common_comm':common_comm,'time':nsecs(common_secs,
+							common_nsecs),
+		'napi':napi, 'dev_name':dev_name}
+	all_event_list.append(event_data)
+
+def handle_napi_poll(event_name, context, common_cpu,
+	common_pid, common_comm, time,
+	napi, dev_name):
+	if common_cpu in net_rx_dic.keys():
+		event_list = net_rx_dic[common_cpu]['event_list']
+		rec_data = {'event_name':'napi_poll',
+			    'dev':dev_name,
+			    'event_t':time}
+		event_list.append(rec_data)
+
+def net__net_dev_receive(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	skbaddr,skblen, name):
+	event_data = {'event_name':event_name, 'context':context,
+		'common_cpu':common_cpu, 'common_pid':common_pid,
+		'common_comm':common_comm,'time':nsecs(common_secs,
+							common_nsecs),
+		'skbaddr':skbaddr, 'skblen':skblen, 'name':name}
+	all_event_list.append(event_data)
+
+def handle_net_dev_receive(event_name, context, common_cpu,
+	ccommon_pid, common_comm, time,
+	skbaddr, skblen, name):
+	if common_cpu in net_rx_dic.keys():
+		rec_data = {'event_name':'netif_receive_skb',
+			    'event_t':time,
+			    'skbaddr':skbaddr,
+			    'len':skblen}
+		event_list = net_rx_dic[common_cpu]['event_list']
+		event_list.append(rec_data)
+		receive_skb_list.insert(0, rec_data)
+
+def skb__skb_copy_datagram_iovec(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	skbaddr, skblen):
+	event_data = {'event_name':event_name, 'context':context,
+		'common_cpu':common_cpu, 'common_pid':common_pid,
+		'common_comm':common_comm,'time':nsecs(common_secs,
+							common_nsecs),
+		'skbaddr':skbaddr, 'skblen':skblen}
+	all_event_list.append(event_data)
+
+def handle_skb_copy_datagram_iovec(event_name, context, common_cpu,
+	common_pid, common_comm, time,
+	skbaddr, skblen):
+	for i in range(len(receive_skb_list)):
+		rec_data = receive_skb_list[i]
+		if skbaddr == rec_data['skbaddr'] and \
+			'comm' not in rec_data.keys():
+			rec_data.update({'comm':common_comm,
+					 'pid':common_pid,
+					 'comm_t':time})
+			del receive_skb_list[i]
+			break
+
+def net__net_dev_queue(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	skbaddr, skblen, name):
+	event_data = {'event_name':event_name, 'context':context,
+		'common_cpu':common_cpu, 'common_pid':common_pid,
+		'common_comm':common_comm,'time':nsecs(common_secs,
+							common_nsecs),
+		'skbaddr':skbaddr, 'skblen':skblen, 'name':name}
+	all_event_list.append(event_data)
+
+def handle_net_dev_queue(event_name, context, common_cpu,
+	common_pid, common_comm, time,
+	skbaddr, skblen, name):
+	skb = {'dev':name,
+	       'skbaddr':skbaddr,
+	       'len':skblen,
+	       'queue_t':time}
+	xmit_list.insert(0, skb)
+
+def net__net_dev_xmit(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	skbaddr, skblen, rc, name):
+	event_data = {'event_name':event_name, 'context':context,
+		'common_cpu':common_cpu, 'common_pid':common_pid,
+		'common_comm':common_comm,'time':nsecs(common_secs,
+							common_nsecs),
+		'skbaddr':skbaddr, 'skblen':skblen, 'rc':rc, 'name':name}
+	all_event_list.append(event_data)
+
+def handle_net_dev_xmit(event_name, context, common_cpu,
+	common_pid, common_comm, time,
+	skbaddr, skblen, rc, name):
+	if rc == 0: # NETDEV_TX_OK
+		for i in range(len(xmit_list)):
+			skb = xmit_list[i]
+			if skb['skbaddr'] == skbaddr:
+				skb['xmit_t'] = time
+				queue_list.insert(0, skb)
+				del xmit_list[i]
+				break
+
+def free_skb(event_name, context, common_cpu,
+	common_pid, common_comm, time,
+	skbaddr):
+	for i in range(len(queue_list)):
+		skb = queue_list[i]
+		if skb['skbaddr'] ==skbaddr:
+			skb['free_t'] = time
+			free_list.append(skb)
+			del queue_list[i]
+			break
+
+def skb__dev_kfree_skb_irq(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	skbaddr):
+	event_data = {'event_name':event_name, 'context':context,
+		'common_cpu':common_cpu, 'common_pid':common_pid,
+		'common_comm':common_comm,'time':nsecs(common_secs,
+							common_nsecs),
+		'skbaddr':skbaddr}
+	all_event_list.append(event_data)
+
+def handle_dev_kfree_skb_irq(event_name, context, common_cpu,
+	common_pid, common_comm, time,
+	skbaddr):
+	free_skb(event_name, context, common_cpu,
+		common_pid, common_comm, time,
+		skbaddr)
+
+def skb__consume_skb(event_name, context, common_cpu,
+	common_secs, common_nsecs, common_pid, common_comm,
+	skbaddr):
+	event_data = {'event_name':event_name, 'context':context,
+		'common_cpu':common_cpu, 'common_pid':common_pid,
+		'common_comm':common_comm,'time':nsecs(common_secs,
+							common_nsecs),
+		'skbaddr':skbaddr}
+	all_event_list.append(event_data)
+
+def handle_consume_skb(event_name, context, common_cpu,
+	common_pid, common_comm, time,
+	skbaddr):
+	free_skb(event_name, context, common_cpu,
+		common_pid, common_comm, time,
+		skbaddr)


^ permalink raw reply related

* Re: [PATCH net-2.6] pkt_sched: gen_estimator: add a new lock
From: Eric Dumazet @ 2010-06-08  5:39 UTC (permalink / raw)
  To: Changli Gao
  Cc: David Miller, netdev, Stephen Hemminger, Jarek Poplawski,
	Patrick McHardy
In-Reply-To: <AANLkTik6xx0cxZKezU31K4vYg4Rz0o7ndmN4D9vYSCKX@mail.gmail.com>

Le mardi 08 juin 2010 à 13:20 +0800, Changli Gao a écrit :

> IMO, this bug should be fixed by adding rtnl_lock to xt_RATEEST.c.
> Killing rtnl should be done in separated patches. They are different
> things. Your patch introduces another locks, and it is extra overhead
> for other users.
> 

extra overhead, in new/kill estimators ? 

Are you kidding ?

RTNL is taken, taking an extra-uncontended spinlock is free.

Nope, I wont add rtnl lock to xt_RATEEST.c

I believe you dont really understood what I patiently explained to you.

Thats becoming rediculous.




^ permalink raw reply

* Re: [PATCH net-next-2.6] anycast: Some RCU conversions
From: David Miller @ 2010-06-08  5:57 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, yoshfuji
In-Reply-To: <1275946933.2775.16.camel@edumazet-laptop>

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 07 Jun 2010 23:42:13 +0200

> - dev_get_by_flags() changed to dev_get_by_flags_rcu()
> 
> - ipv6_sock_ac_join() dont touch dev & idev refcounts
> - ipv6_sock_ac_drop() dont touch dev & idev refcounts
> - ipv6_sock_ac_close() dont touch dev & idev refcounts
> - ipv6_dev_ac_dec() dount touch idev refcount
> - ipv6_chk_acast_addr() dont touch idev refcount
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>

Looks great, applied, thanks Eric!

^ permalink raw reply

* Re: [PATCH] r8169: fix random mdio_write failures
From: Timo Teräs @ 2010-06-08  6:06 UTC (permalink / raw)
  To: Francois Romieu; +Cc: hayeswang, netdev, davem
In-Reply-To: <20100607215115.GA6583@electric-eye.fr.zoreil.com>

On 06/08/2010 12:51 AM, Francois Romieu wrote:
> hayeswang <hayeswang@realtek.com> :
>> Our hardware engineer suggests that check the completed indication
>> per 100 micro seconds. And it needs 20 micro seconds delay after the 
>> completed indication for the next command.
> 
> Should we do the same for mdio_read as well (100 us per iteration + 
> an extra 20 us) ?

Well, doing 100us per iteration will increase the latency that the code
notices "write complete" which slows down things. It'll also slightly
decrease bus traffic which is good. But I'd be just fine with 25us per
iteration. It sounds unlikely that polling the status register would
slow down the actual write operation (if that is the case then 100us
would be desirable).

Changing my 25us to 20us would good. The original 25us was just a guess.
The comment should be probably also updated that those delays are from
realtek hw specs then.

Would you like me to send a patch?

- Timo



^ permalink raw reply

* [PATCH] net/irda/sh_irda: Modify clk_get lookups
From: Kuninori Morimoto @ 2010-06-08  6:25 UTC (permalink / raw)
  To: Paul Mundt, David S. Miller; +Cc: Magnus, Linux-SH, Linux-Net

Signed-off-by: Kuninori Morimoto <kuninori.morimoto.gx@renesas.com>
---
This patch is v2 of
ARM: mach-shmobile: clock-sh7367: modify IrDA clock

 arch/arm/mach-shmobile/board-g3evm.c |    1 +
 drivers/net/irda/sh_irda.c           |    6 ++----
 2 files changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/arm/mach-shmobile/board-g3evm.c b/arch/arm/mach-shmobile/board-g3evm.c
index 95ccb94..a552590 100644
--- a/arch/arm/mach-shmobile/board-g3evm.c
+++ b/arch/arm/mach-shmobile/board-g3evm.c
@@ -232,6 +232,7 @@ static struct resource irda_resources[] = {
 
 static struct platform_device irda_device = {
 	.name		= "sh_irda",
+	.id		= -1,
 	.resource	= irda_resources,
 	.num_resources	= ARRAY_SIZE(irda_resources),
 };
diff --git a/drivers/net/irda/sh_irda.c b/drivers/net/irda/sh_irda.c
index 9a828b0..9db7084 100644
--- a/drivers/net/irda/sh_irda.c
+++ b/drivers/net/irda/sh_irda.c
@@ -748,7 +748,6 @@ static int __devinit sh_irda_probe(struct platform_device *pdev)
 	struct net_device *ndev;
 	struct sh_irda_self *self;
 	struct resource *res;
-	char clk_name[8];
 	unsigned int irq;
 	int err = -ENOMEM;
 
@@ -775,10 +774,9 @@ static int __devinit sh_irda_probe(struct platform_device *pdev)
 	if (err)
 		goto err_mem_2;
 
-	snprintf(clk_name, sizeof(clk_name), "irda%d", pdev->id);
-	self->clk = clk_get(&pdev->dev, clk_name);
+	self->clk = clk_get(&pdev->dev, NULL);
 	if (IS_ERR(self->clk)) {
-		dev_err(&pdev->dev, "cannot get clock \"%s\"\n", clk_name);
+		dev_err(&pdev->dev, "cannot get irda clock\n");
 		goto err_mem_3;
 	}
 
-- 
1.7.0.4


^ permalink raw reply related

* Re: [PATCH] r8169: fix random mdio_write failures
From: Francois Romieu @ 2010-06-08  6:26 UTC (permalink / raw)
  To: Timo Teräs; +Cc: hayeswang, netdev, davem
In-Reply-To: <4C0DDDCC.6010500@iki.fi>

Timo Teräs <timo.teras@iki.fi> :
[ok]
> iteration. It sounds unlikely that polling the status register would
> slow down the actual write operation (if that is the case then 100us
> would be desirable).

I would not be that surprized.

> Changing my 25us to 20us would good. The original 25us was just a guess.
> The comment should be probably also updated that those delays are from
> realtek hw specs then.

Yes.

> Would you like me to send a patch?

Of course. Some comment from Hayes regarding mdio_read would be
welcome beforehand though.

-- 
Ueimor

^ permalink raw reply

* [PATCH] ipvs: Add missing locking during connection table hashing and unhashing
From: Sven Wegener @ 2010-06-08  6:29 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Julian Anastasov, Simon Horman, Wensong Zhang, netdev, lvs-devel
In-Reply-To: <20100524235637.GG4794@verge.net.au>

The code that hashes and unhashes connections from the connection table
is missing locking of the connection being modified, which opens up a
race condition and results in memory corruption when this race condition
is hit.

Here is what happens in pretty verbose form:

CPU 0					CPU 1
------------				------------
An active connection is terminated and
we schedule ip_vs_conn_expire() on this
CPU to expire this connection.

					IRQ assignment is changed to this CPU,
					but the expire timer stays scheduled on
					the other CPU.

					New connection from same ip:port comes
					in right before the timer expires, we
					find the inactive connection in our
					connection table and get a reference to
					it. We proper lock the connection in
					tcp_state_transition() and read the
					connection flags in set_tcp_state().

ip_vs_conn_expire() gets called, we
unhash the connection from our
connection table and remove the hashed
flag in ip_vs_conn_unhash(), without
proper locking!

					While still holding proper locks we
					write the connection flags in
					set_tcp_state() and this sets the hashed
					flag again.

ip_vs_conn_expire() fails to expire the
connection, because the other CPU has
incremented the reference count. We try
to re-insert the connection into our
connection table, but this fails in
ip_vs_conn_hash(), because the hashed
flag has been set by the other CPU. We
re-schedule execution of
ip_vs_conn_expire(). Now this connection
has the hashed flag set, but isn't
actually hashed in our connection table
and has a dangling list_head.

					We drop the reference we held on the
					connection and schedule the expire timer
					for timeouting the connection on this
					CPU. Further packets won't be able to
					find this connection in our connection
					table.

					ip_vs_conn_expire() gets called again,
					we think it's already hashed, but the
					list_head is dangling and while removing
					the connection from our connection table
					we write to the memory location where
					this list_head points to.

The result will probably be a kernel oops at some other point in time.

Signed-off-by: Sven Wegener <sven.wegener@stealer.net>
Cc: stable@kernel.org
Acked-by: Simon Horman <horms@verge.net.au>
---
 net/netfilter/ipvs/ip_vs_conn.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

This race condition is pretty subtle, but it can be triggered remotely.
It needs the IRQ assignment change or another circumstance where packets
coming from the same ip:port for the same service are being processed on
different CPUs. And it involves hitting the exact time at which
ip_vs_conn_expire() gets called. It can be avoided by making sure that
all packets from one connection are always processed on the same CPU and
can be made harder to exploit by changing the connection timeouts to
some custom values.

diff --git a/net/netfilter/ipvs/ip_vs_conn.c b/net/netfilter/ipvs/ip_vs_conn.c
index d8f7e8e..ff04e9e 100644
--- a/net/netfilter/ipvs/ip_vs_conn.c
+++ b/net/netfilter/ipvs/ip_vs_conn.c
@@ -162,6 +162,7 @@ static inline int ip_vs_conn_hash(struct ip_vs_conn *cp)
 	hash = ip_vs_conn_hashkey(cp->af, cp->protocol, &cp->caddr, cp->cport);
 
 	ct_write_lock(hash);
+	spin_lock(&cp->lock);
 
 	if (!(cp->flags & IP_VS_CONN_F_HASHED)) {
 		list_add(&cp->c_list, &ip_vs_conn_tab[hash]);
@@ -174,6 +175,7 @@ static inline int ip_vs_conn_hash(struct ip_vs_conn *cp)
 		ret = 0;
 	}
 
+	spin_unlock(&cp->lock);
 	ct_write_unlock(hash);
 
 	return ret;
@@ -193,6 +195,7 @@ static inline int ip_vs_conn_unhash(struct ip_vs_conn *cp)
 	hash = ip_vs_conn_hashkey(cp->af, cp->protocol, &cp->caddr, cp->cport);
 
 	ct_write_lock(hash);
+	spin_lock(&cp->lock);
 
 	if (cp->flags & IP_VS_CONN_F_HASHED) {
 		list_del(&cp->c_list);
@@ -202,6 +205,7 @@ static inline int ip_vs_conn_unhash(struct ip_vs_conn *cp)
 	} else
 		ret = 0;
 
+	spin_unlock(&cp->lock);
 	ct_write_unlock(hash);
 
 	return ret;

^ permalink raw reply related

* [PATCH] fix a race at the end of NAPI
From: Figo.zhang @ 2010-06-08  6:48 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev


fix a race at the end of NAPI complete processing. it had better do __napi_complete() 
first before re-enable interrupt.

Signed-off-by: Figo.zhang <figo1802@gmail.com>
--- 
 drivers/net/8139cp.c  |    2 +-
 drivers/net/8139too.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/8139cp.c b/drivers/net/8139cp.c
old mode 100644
new mode 100755
index 9c14975..284a5f4
--- a/drivers/net/8139cp.c
+++ b/drivers/net/8139cp.c
@@ -598,8 +598,8 @@ rx_next:
 			goto rx_status_loop;
 
 		spin_lock_irqsave(&cp->lock, flags);
-		cpw16_f(IntrMask, cp_intr_mask);
 		__napi_complete(napi);
+		cpw16_f(IntrMask, cp_intr_mask);
 		spin_unlock_irqrestore(&cp->lock, flags);
 	}
 
diff --git a/drivers/net/8139too.c b/drivers/net/8139too.c
old mode 100644
new mode 100755
index 4ba7293..a7bca8c
--- a/drivers/net/8139too.c
+++ b/drivers/net/8139too.c
@@ -2088,8 +2088,8 @@ static int rtl8139_poll(struct napi_struct *napi, int budget)
 		 * again when we think we are done.
 		 */
 		spin_lock_irqsave(&tp->lock, flags);
-		RTL_W16_F(IntrMask, rtl8139_intr_mask);
 		__napi_complete(napi);
+		RTL_W16_F(IntrMask, rtl8139_intr_mask);
 		spin_unlock_irqrestore(&tp->lock, flags);
 	}
 	spin_unlock(&tp->rx_lock);



^ permalink raw reply related

* [PATCH net-next-2.6] ipv6: mcast: RCU conversions
From: Eric Dumazet @ 2010-06-08  7:05 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Hideaki YOSHIFUJI

- ipv6_sock_mc_join() : doesnt touch dev refcount

- ipv6_sock_mc_drop() : doesnt touch dev/idev refcounts

- ip6_mc_find_dev() becomes ip6_mc_find_dev_rcu() (called from rcu),
                    and doesnt touch dev/idev refcounts

- ipv6_sock_mc_close() : doesnt touch dev/idev refcounts

- ip6_mc_source() uses ip6_mc_find_dev_rcu()

- ip6_mc_msfilter() uses ip6_mc_find_dev_rcu()

- ip6_mc_msfget() uses ip6_mc_find_dev_rcu()

- ipv6_dev_mc_dec(), ipv6_chk_mcast_addr(),
  igmp6_event_query(), igmp6_event_report(),
  mld_sendpack(), igmp6_send() dont touch idev refcount

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv6/mcast.c |  183 +++++++++++++++++++++------------------------
 1 file changed, 87 insertions(+), 96 deletions(-)

diff --git a/net/ipv6/mcast.c b/net/ipv6/mcast.c
index 8752e80..3e36d15 100644
--- a/net/ipv6/mcast.c
+++ b/net/ipv6/mcast.c
@@ -152,18 +152,19 @@ int ipv6_sock_mc_join(struct sock *sk, int ifindex, const struct in6_addr *addr)
 	mc_lst->next = NULL;
 	ipv6_addr_copy(&mc_lst->addr, addr);
 
+	rcu_read_lock();
 	if (ifindex == 0) {
 		struct rt6_info *rt;
 		rt = rt6_lookup(net, addr, NULL, 0, 0);
 		if (rt) {
 			dev = rt->rt6i_dev;
-			dev_hold(dev);
 			dst_release(&rt->u.dst);
 		}
 	} else
-		dev = dev_get_by_index(net, ifindex);
+		dev = dev_get_by_index_rcu(net, ifindex);
 
 	if (dev == NULL) {
+		rcu_read_unlock();
 		sock_kfree_s(sk, mc_lst, sizeof(*mc_lst));
 		return -ENODEV;
 	}
@@ -180,8 +181,8 @@ int ipv6_sock_mc_join(struct sock *sk, int ifindex, const struct in6_addr *addr)
 	err = ipv6_dev_mc_inc(dev, addr);
 
 	if (err) {
+		rcu_read_unlock();
 		sock_kfree_s(sk, mc_lst, sizeof(*mc_lst));
-		dev_put(dev);
 		return err;
 	}
 
@@ -190,7 +191,7 @@ int ipv6_sock_mc_join(struct sock *sk, int ifindex, const struct in6_addr *addr)
 	np->ipv6_mc_list = mc_lst;
 	write_unlock_bh(&ipv6_sk_mc_lock);
 
-	dev_put(dev);
+	rcu_read_unlock();
 
 	return 0;
 }
@@ -213,18 +214,17 @@ int ipv6_sock_mc_drop(struct sock *sk, int ifindex, const struct in6_addr *addr)
 			*lnk = mc_lst->next;
 			write_unlock_bh(&ipv6_sk_mc_lock);
 
-			dev = dev_get_by_index(net, mc_lst->ifindex);
+			rcu_read_lock();
+			dev = dev_get_by_index_rcu(net, mc_lst->ifindex);
 			if (dev != NULL) {
-				struct inet6_dev *idev = in6_dev_get(dev);
+				struct inet6_dev *idev = __in6_dev_get(dev);
 
 				(void) ip6_mc_leave_src(sk, mc_lst, idev);
-				if (idev) {
+				if (idev)
 					__ipv6_dev_mc_dec(idev, &mc_lst->addr);
-					in6_dev_put(idev);
-				}
-				dev_put(dev);
 			} else
 				(void) ip6_mc_leave_src(sk, mc_lst, NULL);
+			rcu_read_unlock();
 			sock_kfree_s(sk, mc_lst, sizeof(*mc_lst));
 			return 0;
 		}
@@ -234,43 +234,36 @@ int ipv6_sock_mc_drop(struct sock *sk, int ifindex, const struct in6_addr *addr)
 	return -EADDRNOTAVAIL;
 }
 
-static struct inet6_dev *ip6_mc_find_dev(struct net *net,
-					 struct in6_addr *group,
-					 int ifindex)
+/* called with rcu_read_lock() */
+static struct inet6_dev *ip6_mc_find_dev_rcu(struct net *net,
+					     struct in6_addr *group,
+					     int ifindex)
 {
 	struct net_device *dev = NULL;
 	struct inet6_dev *idev = NULL;
 
 	if (ifindex == 0) {
-		struct rt6_info *rt;
+		struct rt6_info *rt = rt6_lookup(net, group, NULL, 0, 0);
 
-		rt = rt6_lookup(net, group, NULL, 0, 0);
 		if (rt) {
 			dev = rt->rt6i_dev;
 			dev_hold(dev);
 			dst_release(&rt->u.dst);
 		}
 	} else
-		dev = dev_get_by_index(net, ifindex);
+		dev = dev_get_by_index_rcu(net, ifindex);
 
 	if (!dev)
-		goto nodev;
-	idev = in6_dev_get(dev);
+		return NULL;
+	idev = __in6_dev_get(dev);
 	if (!idev)
-		goto release;
+		return NULL;;
 	read_lock_bh(&idev->lock);
-	if (idev->dead)
-		goto unlock_release;
-
+	if (idev->dead) {
+		read_unlock_bh(&idev->lock);
+		return NULL;
+	}
 	return idev;
-
-unlock_release:
-	read_unlock_bh(&idev->lock);
-	in6_dev_put(idev);
-release:
-	dev_put(dev);
-nodev:
-	return NULL;
 }
 
 void ipv6_sock_mc_close(struct sock *sk)
@@ -286,19 +279,17 @@ void ipv6_sock_mc_close(struct sock *sk)
 		np->ipv6_mc_list = mc_lst->next;
 		write_unlock_bh(&ipv6_sk_mc_lock);
 
-		dev = dev_get_by_index(net, mc_lst->ifindex);
+		rcu_read_lock();
+		dev = dev_get_by_index_rcu(net, mc_lst->ifindex);
 		if (dev) {
-			struct inet6_dev *idev = in6_dev_get(dev);
+			struct inet6_dev *idev = __in6_dev_get(dev);
 
 			(void) ip6_mc_leave_src(sk, mc_lst, idev);
-			if (idev) {
+			if (idev)
 				__ipv6_dev_mc_dec(idev, &mc_lst->addr);
-				in6_dev_put(idev);
-			}
-			dev_put(dev);
 		} else
 			(void) ip6_mc_leave_src(sk, mc_lst, NULL);
-
+		rcu_read_unlock();
 		sock_kfree_s(sk, mc_lst, sizeof(*mc_lst));
 
 		write_lock_bh(&ipv6_sk_mc_lock);
@@ -327,14 +318,17 @@ int ip6_mc_source(int add, int omode, struct sock *sk,
 	if (!ipv6_addr_is_multicast(group))
 		return -EINVAL;
 
-	idev = ip6_mc_find_dev(net, group, pgsr->gsr_interface);
-	if (!idev)
+	rcu_read_lock();
+	idev = ip6_mc_find_dev_rcu(net, group, pgsr->gsr_interface);
+	if (!idev) {
+		rcu_read_unlock();
 		return -ENODEV;
+	}
 	dev = idev->dev;
 
 	err = -EADDRNOTAVAIL;
 
-	read_lock_bh(&ipv6_sk_mc_lock);
+	read_lock(&ipv6_sk_mc_lock);
 	for (pmc=inet6->ipv6_mc_list; pmc; pmc=pmc->next) {
 		if (pgsr->gsr_interface && pmc->ifindex != pgsr->gsr_interface)
 			continue;
@@ -358,7 +352,7 @@ int ip6_mc_source(int add, int omode, struct sock *sk,
 		pmc->sfmode = omode;
 	}
 
-	write_lock_bh(&pmc->sflock);
+	write_lock(&pmc->sflock);
 	pmclocked = 1;
 
 	psl = pmc->sflist;
@@ -433,11 +427,10 @@ int ip6_mc_source(int add, int omode, struct sock *sk,
 	ip6_mc_add_src(idev, group, omode, 1, source, 1);
 done:
 	if (pmclocked)
-		write_unlock_bh(&pmc->sflock);
-	read_unlock_bh(&ipv6_sk_mc_lock);
+		write_unlock(&pmc->sflock);
+	read_unlock(&ipv6_sk_mc_lock);
 	read_unlock_bh(&idev->lock);
-	in6_dev_put(idev);
-	dev_put(dev);
+	rcu_read_unlock();
 	if (leavegroup)
 		return ipv6_sock_mc_drop(sk, pgsr->gsr_interface, group);
 	return err;
@@ -463,14 +456,17 @@ int ip6_mc_msfilter(struct sock *sk, struct group_filter *gsf)
 	    gsf->gf_fmode != MCAST_EXCLUDE)
 		return -EINVAL;
 
-	idev = ip6_mc_find_dev(net, group, gsf->gf_interface);
+	rcu_read_lock();
+	idev = ip6_mc_find_dev_rcu(net, group, gsf->gf_interface);
 
-	if (!idev)
+	if (!idev) {
+		rcu_read_unlock();
 		return -ENODEV;
+	}
 	dev = idev->dev;
 
 	err = 0;
-	read_lock_bh(&ipv6_sk_mc_lock);
+	read_lock(&ipv6_sk_mc_lock);
 
 	if (gsf->gf_fmode == MCAST_INCLUDE && gsf->gf_numsrc == 0) {
 		leavegroup = 1;
@@ -512,7 +508,7 @@ int ip6_mc_msfilter(struct sock *sk, struct group_filter *gsf)
 		(void) ip6_mc_add_src(idev, group, gsf->gf_fmode, 0, NULL, 0);
 	}
 
-	write_lock_bh(&pmc->sflock);
+	write_lock(&pmc->sflock);
 	psl = pmc->sflist;
 	if (psl) {
 		(void) ip6_mc_del_src(idev, group, pmc->sfmode,
@@ -522,13 +518,12 @@ int ip6_mc_msfilter(struct sock *sk, struct group_filter *gsf)
 		(void) ip6_mc_del_src(idev, group, pmc->sfmode, 0, NULL, 0);
 	pmc->sflist = newpsl;
 	pmc->sfmode = gsf->gf_fmode;
-	write_unlock_bh(&pmc->sflock);
+	write_unlock(&pmc->sflock);
 	err = 0;
 done:
-	read_unlock_bh(&ipv6_sk_mc_lock);
+	read_unlock(&ipv6_sk_mc_lock);
 	read_unlock_bh(&idev->lock);
-	in6_dev_put(idev);
-	dev_put(dev);
+	rcu_read_unlock();
 	if (leavegroup)
 		err = ipv6_sock_mc_drop(sk, gsf->gf_interface, group);
 	return err;
@@ -551,11 +546,13 @@ int ip6_mc_msfget(struct sock *sk, struct group_filter *gsf,
 	if (!ipv6_addr_is_multicast(group))
 		return -EINVAL;
 
-	idev = ip6_mc_find_dev(net, group, gsf->gf_interface);
+	rcu_read_lock();
+	idev = ip6_mc_find_dev_rcu(net, group, gsf->gf_interface);
 
-	if (!idev)
+	if (!idev) {
+		rcu_read_unlock();
 		return -ENODEV;
-
+	}
 	dev = idev->dev;
 
 	err = -EADDRNOTAVAIL;
@@ -577,8 +574,7 @@ int ip6_mc_msfget(struct sock *sk, struct group_filter *gsf,
 	psl = pmc->sflist;
 	count = psl ? psl->sl_count : 0;
 	read_unlock_bh(&idev->lock);
-	in6_dev_put(idev);
-	dev_put(dev);
+	rcu_read_unlock();
 
 	copycount = count < gsf->gf_numsrc ? count : gsf->gf_numsrc;
 	gsf->gf_numsrc = count;
@@ -604,8 +600,7 @@ int ip6_mc_msfget(struct sock *sk, struct group_filter *gsf,
 	return 0;
 done:
 	read_unlock_bh(&idev->lock);
-	in6_dev_put(idev);
-	dev_put(dev);
+	rcu_read_unlock();
 	return err;
 }
 
@@ -822,6 +817,7 @@ int ipv6_dev_mc_inc(struct net_device *dev, const struct in6_addr *addr)
 	struct ifmcaddr6 *mc;
 	struct inet6_dev *idev;
 
+	/* we need to take a reference on idev */
 	idev = in6_dev_get(dev);
 
 	if (idev == NULL)
@@ -860,7 +856,7 @@ int ipv6_dev_mc_inc(struct net_device *dev, const struct in6_addr *addr)
 	setup_timer(&mc->mca_timer, igmp6_timer_handler, (unsigned long)mc);
 
 	ipv6_addr_copy(&mc->mca_addr, addr);
-	mc->idev = idev;
+	mc->idev = idev; /* (reference taken) */
 	mc->mca_users = 1;
 	/* mca_stamp should be updated upon changes */
 	mc->mca_cstamp = mc->mca_tstamp = jiffies;
@@ -915,16 +911,18 @@ int __ipv6_dev_mc_dec(struct inet6_dev *idev, const struct in6_addr *addr)
 
 int ipv6_dev_mc_dec(struct net_device *dev, const struct in6_addr *addr)
 {
-	struct inet6_dev *idev = in6_dev_get(dev);
+	struct inet6_dev *idev;
 	int err;
 
-	if (!idev)
-		return -ENODEV;
-
-	err = __ipv6_dev_mc_dec(idev, addr);
+	rcu_read_lock();
 
-	in6_dev_put(idev);
+	idev = __in6_dev_get(dev);
+	if (!idev)
+		err = -ENODEV;
+	else
+		err = __ipv6_dev_mc_dec(idev, addr);
 
+	rcu_read_unlock();
 	return err;
 }
 
@@ -965,7 +963,8 @@ int ipv6_chk_mcast_addr(struct net_device *dev, const struct in6_addr *group,
 	struct ifmcaddr6 *mc;
 	int rv = 0;
 
-	idev = in6_dev_get(dev);
+	rcu_read_lock();
+	idev = __in6_dev_get(dev);
 	if (idev) {
 		read_lock_bh(&idev->lock);
 		for (mc = idev->mc_list; mc; mc=mc->next) {
@@ -992,8 +991,8 @@ int ipv6_chk_mcast_addr(struct net_device *dev, const struct in6_addr *group,
 				rv = 1; /* don't filter unspecified source */
 		}
 		read_unlock_bh(&idev->lock);
-		in6_dev_put(idev);
 	}
+	rcu_read_unlock();
 	return rv;
 }
 
@@ -1104,6 +1103,7 @@ static int mld_marksources(struct ifmcaddr6 *pmc, int nsrcs,
 	return 1;
 }
 
+/* called with rcu_read_lock() */
 int igmp6_event_query(struct sk_buff *skb)
 {
 	struct mld2_query *mlh2 = NULL;
@@ -1127,7 +1127,7 @@ int igmp6_event_query(struct sk_buff *skb)
 	if (!(ipv6_addr_type(&ipv6_hdr(skb)->saddr) & IPV6_ADDR_LINKLOCAL))
 		return -EINVAL;
 
-	idev = in6_dev_get(skb->dev);
+	idev = __in6_dev_get(skb->dev);
 
 	if (idev == NULL)
 		return 0;
@@ -1137,10 +1137,8 @@ int igmp6_event_query(struct sk_buff *skb)
 	group_type = ipv6_addr_type(group);
 
 	if (group_type != IPV6_ADDR_ANY &&
-	    !(group_type&IPV6_ADDR_MULTICAST)) {
-		in6_dev_put(idev);
+	    !(group_type&IPV6_ADDR_MULTICAST))
 		return -EINVAL;
-	}
 
 	if (len == 24) {
 		int switchback;
@@ -1161,10 +1159,9 @@ int igmp6_event_query(struct sk_buff *skb)
 	} else if (len >= 28) {
 		int srcs_offset = sizeof(struct mld2_query) -
 				  sizeof(struct icmp6hdr);
-		if (!pskb_may_pull(skb, srcs_offset)) {
-			in6_dev_put(idev);
+		if (!pskb_may_pull(skb, srcs_offset))
 			return -EINVAL;
-		}
+
 		mlh2 = (struct mld2_query *)skb_transport_header(skb);
 		max_delay = (MLDV2_MRC(ntohs(mlh2->mld2q_mrc))*HZ)/1000;
 		if (!max_delay)
@@ -1173,28 +1170,23 @@ int igmp6_event_query(struct sk_buff *skb)
 		if (mlh2->mld2q_qrv)
 			idev->mc_qrv = mlh2->mld2q_qrv;
 		if (group_type == IPV6_ADDR_ANY) { /* general query */
-			if (mlh2->mld2q_nsrcs) {
-				in6_dev_put(idev);
+			if (mlh2->mld2q_nsrcs)
 				return -EINVAL; /* no sources allowed */
-			}
+
 			mld_gq_start_timer(idev);
-			in6_dev_put(idev);
 			return 0;
 		}
 		/* mark sources to include, if group & source-specific */
 		if (mlh2->mld2q_nsrcs != 0) {
 			if (!pskb_may_pull(skb, srcs_offset +
-			    ntohs(mlh2->mld2q_nsrcs) * sizeof(struct in6_addr))) {
-				in6_dev_put(idev);
+			    ntohs(mlh2->mld2q_nsrcs) * sizeof(struct in6_addr)))
 				return -EINVAL;
-			}
+
 			mlh2 = (struct mld2_query *)skb_transport_header(skb);
 			mark = 1;
 		}
-	} else {
-		in6_dev_put(idev);
+	} else
 		return -EINVAL;
-	}
 
 	read_lock_bh(&idev->lock);
 	if (group_type == IPV6_ADDR_ANY) {
@@ -1227,12 +1219,11 @@ int igmp6_event_query(struct sk_buff *skb)
 		}
 	}
 	read_unlock_bh(&idev->lock);
-	in6_dev_put(idev);
 
 	return 0;
 }
 
-
+/* called with rcu_read_lock() */
 int igmp6_event_report(struct sk_buff *skb)
 {
 	struct ifmcaddr6 *ma;
@@ -1260,7 +1251,7 @@ int igmp6_event_report(struct sk_buff *skb)
 	    !(addr_type&IPV6_ADDR_LINKLOCAL))
 		return -EINVAL;
 
-	idev = in6_dev_get(skb->dev);
+	idev = __in6_dev_get(skb->dev);
 	if (idev == NULL)
 		return -ENODEV;
 
@@ -1280,7 +1271,6 @@ int igmp6_event_report(struct sk_buff *skb)
 		}
 	}
 	read_unlock_bh(&idev->lock);
-	in6_dev_put(idev);
 	return 0;
 }
 
@@ -1396,12 +1386,14 @@ static void mld_sendpack(struct sk_buff *skb)
 	struct mld2_report *pmr =
 			      (struct mld2_report *)skb_transport_header(skb);
 	int payload_len, mldlen;
-	struct inet6_dev *idev = in6_dev_get(skb->dev);
+	struct inet6_dev *idev;
 	struct net *net = dev_net(skb->dev);
 	int err;
 	struct flowi fl;
 	struct dst_entry *dst;
 
+	rcu_read_lock();
+	idev = __in6_dev_get(skb->dev);
 	IP6_UPD_PO_STATS(net, idev, IPSTATS_MIB_OUT, skb->len);
 
 	payload_len = (skb->tail - skb->network_header) - sizeof(*pip6);
@@ -1441,8 +1433,7 @@ out:
 	} else
 		IP6_INC_STATS_BH(net, idev, IPSTATS_MIB_OUTDISCARDS);
 
-	if (likely(idev != NULL))
-		in6_dev_put(idev);
+	rcu_read_unlock();
 	return;
 
 err_out:
@@ -1779,7 +1770,8 @@ static void igmp6_send(struct in6_addr *addr, struct net_device *dev, int type)
 					 IPPROTO_ICMPV6,
 					 csum_partial(hdr, len, 0));
 
-	idev = in6_dev_get(skb->dev);
+	rcu_read_lock();
+	idev = __in6_dev_get(skb->dev);
 
 	dst = icmp6_dst_alloc(skb->dev, NULL, &ipv6_hdr(skb)->daddr);
 	if (!dst) {
@@ -1806,8 +1798,7 @@ out:
 	} else
 		IP6_INC_STATS(net, idev, IPSTATS_MIB_OUTDISCARDS);
 
-	if (likely(idev != NULL))
-		in6_dev_put(idev);
+	rcu_read_unlock();
 	return;
 
 err_out:



^ permalink raw reply related

* Re: [PATCH] fix a race at the end of NAPI
From: Eric Dumazet @ 2010-06-08  7:07 UTC (permalink / raw)
  To: Figo.zhang; +Cc: David S. Miller, netdev
In-Reply-To: <1275979719.1927.4.camel@myhost>

Le mardi 08 juin 2010 à 14:48 +0800, Figo.zhang a écrit :
> fix a race at the end of NAPI complete processing. it had better do __napi_complete() 
> first before re-enable interrupt.
> 
> Signed-off-by: Figo.zhang <figo1802@gmail.com>

Patch title is misleading

> --- 
>  drivers/net/8139cp.c  |    2 +-
>  drivers/net/8139too.c |    2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/net/8139cp.c b/drivers/net/8139cp.c
> old mode 100644
> new mode 100755

Why do you change file modes ?

> index 9c14975..284a5f4
> --- a/drivers/net/8139cp.c
> +++ b/drivers/net/8139cp.c
> @@ -598,8 +598,8 @@ rx_next:
>  			goto rx_status_loop;
>  
>  		spin_lock_irqsave(&cp->lock, flags);
> -		cpw16_f(IntrMask, cp_intr_mask);
>  		__napi_complete(napi);
> +		cpw16_f(IntrMask, cp_intr_mask);
>  		spin_unlock_irqrestore(&cp->lock, flags);
>  	}
>  
> diff --git a/drivers/net/8139too.c b/drivers/net/8139too.c
> old mode 100644
> new mode 100755
> index 4ba7293..a7bca8c
> --- a/drivers/net/8139too.c
> +++ b/drivers/net/8139too.c
> @@ -2088,8 +2088,8 @@ static int rtl8139_poll(struct napi_struct *napi, int budget)
>  		 * again when we think we are done.
>  		 */
>  		spin_lock_irqsave(&tp->lock, flags);
> -		RTL_W16_F(IntrMask, rtl8139_intr_mask);
>  		__napi_complete(napi);
> +		RTL_W16_F(IntrMask, rtl8139_intr_mask);
>  		spin_unlock_irqrestore(&tp->lock, flags);
>  	}
>  	spin_unlock(&tp->rx_lock);
> 
> 



^ permalink raw reply

* [PATCH v2] net8139: fix a race at the end of NAPI
From: Figo.zhang @ 2010-06-08  7:13 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev
In-Reply-To: <1275979719.1927.4.camel@myhost>

 
fix a race at the end of NAPI complete processing, it had
better do __napi_complete() first before re-enable interrupt.

in v2, i motify it using vim.

Signed-off-by:Figo.zhang <figo1802@gmail.com>
---
 drivers/net/8139cp.c  |    2 +-
 drivers/net/8139too.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/net/8139cp.c b/drivers/net/8139cp.c
index 9c14975..284a5f4 100644
--- a/drivers/net/8139cp.c
+++ b/drivers/net/8139cp.c
@@ -598,8 +598,8 @@ rx_next:
 			goto rx_status_loop;
 
 		spin_lock_irqsave(&cp->lock, flags);
-		cpw16_f(IntrMask, cp_intr_mask);
 		__napi_complete(napi);
+		cpw16_f(IntrMask, cp_intr_mask);
 		spin_unlock_irqrestore(&cp->lock, flags);
 	}
 
diff --git a/drivers/net/8139too.c b/drivers/net/8139too.c
index 4ba7293..a7bca8c 100644
--- a/drivers/net/8139too.c
+++ b/drivers/net/8139too.c
@@ -2088,8 +2088,8 @@ static int rtl8139_poll(struct napi_struct *napi, int budget)
 		 * again when we think we are done.
 		 */
 		spin_lock_irqsave(&tp->lock, flags);
-		RTL_W16_F(IntrMask, rtl8139_intr_mask);
 		__napi_complete(napi);
+		RTL_W16_F(IntrMask, rtl8139_intr_mask);
 		spin_unlock_irqrestore(&tp->lock, flags);
 	}
 	spin_unlock(&tp->rx_lock);



^ permalink raw reply related

* Re: 2.6.35-rc2-git1 - net/mac80211/sta_info.c:125 invoked rcu_dereference_check() without protection!
From: Johannes Berg @ 2010-06-08  7:26 UTC (permalink / raw)
  To: paulmck
  Cc: Miles Lane, Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan
In-Reply-To: <20100607235941.GD2387@linux.vnet.ibm.com>

On Mon, 2010-06-07 at 16:59 -0700, Paul E. McKenney wrote:
> On Mon, Jun 07, 2010 at 02:25:44PM -0400, Miles Lane wrote:
> > [   43.478812] [ INFO: suspicious rcu_dereference_check() usage. ]
> > [   43.478815] ---------------------------------------------------
> > [   43.478820] net/mac80211/sta_info.c:125 invoked
> > rcu_dereference_check() without protection!
> > [   43.478824]
> > [   43.478824] other info that might help us debug this:
> > [   43.478826]
> > [   43.478829]
> > [   43.478830] rcu_scheduler_active = 1, debug_locks = 1
> > [   43.478834] no locks held by NetworkManager/4017.
> 
> Hmmm...  Johannes's update has been merged, and it requires that callers
> either be in an RCU read-side critical section or hold either the
> ->sta_lock or the ->sta_mtx, and this thread does none of this.
> 
> Johannes, any thoughts?
> 
> 							Thanx, Paul
> 
> > [   43.478837] stack backtrace:
> > [   43.478842] Pid: 4017, comm: NetworkManager Not tainted 2.6.35-rc2-git1 #8
> > [   43.478846] Call Trace:
> > [   43.478849]  <IRQ>  [<ffffffff81064e9c>] lockdep_rcu_dereference+0x9d/0xa5
> > [   43.478876]  [<ffffffffa010cb3c>] sta_info_get_bss+0x71/0x12d [mac80211]
> > [   43.478889]  [<ffffffffa010cc0d>] ieee80211_find_sta+0x15/0x2f [mac80211]
> > [   43.478902]  [<ffffffffa019ae16>] iwlagn_tx_queue_reclaim+0xe7/0x1bb [iwlagn]

iwlwifi wasn't using rcu protection here -- already sent a patch fixing
it. My mistake, I  think. Thanks for checking :)

johannes

^ permalink raw reply

* [PATCH net-2.6] ipv6: fix ICMP6_MIB_OUTERRORS
From: Eric Dumazet @ 2010-06-08  8:24 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

In commit 1f8438a85366 (icmp: Account for ICMP out errors), I did a typo
on IPV6 side, using ICMP6_MIB_OUTMSGS instead of ICMP6_MIB_OUTERRORS

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index ce79929..03e62f9 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -483,7 +483,7 @@ route_done:
 			      np->tclass, NULL, &fl, (struct rt6_info*)dst,
 			      MSG_DONTWAIT, np->dontfrag);
 	if (err) {
-		ICMP6_INC_STATS_BH(net, idev, ICMP6_MIB_OUTMSGS);
+		ICMP6_INC_STATS_BH(net, idev, ICMP6_MIB_OUTERRORS);
 		ip6_flush_pending_frames(sk);
 		goto out_put;
 	}
@@ -565,7 +565,7 @@ static void icmpv6_echo_reply(struct sk_buff *skb)
 				np->dontfrag);
 
 	if (err) {
-		ICMP6_INC_STATS_BH(net, idev, ICMP6_MIB_OUTMSGS);
+		ICMP6_INC_STATS_BH(net, idev, ICMP6_MIB_OUTERRORS);
 		ip6_flush_pending_frames(sk);
 		goto out_put;
 	}



^ permalink raw reply related

* Re: [v5 Patch 1/3] netpoll: add generic support for bridge and bonding devices
From: Cong Wang @ 2010-06-08  8:36 UTC (permalink / raw)
  To: David Miller
  Cc: andy, fubar, fbl, linux-kernel, mpm, netdev, bridge, gospo,
	nhorman, jmoyer, shemminger, bonding-devel
In-Reply-To: <20100607.030108.235696592.davem@davemloft.net>

On 06/07/10 18:01, David Miller wrote:
> From: Cong Wang<amwang@redhat.com>
> Date: Mon, 07 Jun 2010 17:57:49 +0800
>
>> Hmm, I still feel like this way is ugly, although it may work.
>> I guess David doesn't like it either.
>
> Of course I don't like it. :-)
>
> I suspect the locking scheme will need to be changed.
>
> Besides, if we're going to hack this up and do write lock attempts in
> the read locking paths, there is no point in using a rwlock any more.
> And I'm personally in disfavor of all rwlock usage anyways (it dirties
> the cacheline for readers just as equally for writers, and if the
> critically protected code path is short enough, that shared cache
> line atomic operation will be the predominant cost).
>
> So I'd say, 1) make this a spinlock and 2) try to use RCU for the
> read path.
>
> That would fix everything.

Yeah, agreed. Even not talking about netconsole, bonding code
does have locking problems, netconsole just makes this problem
clear.

I will try your suggestions above.

Thanks!

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox