Netdev List
 help / color / mirror / Atom feed
* cxgb4i_v4.2 submission
From: Rakesh Ranjan @ 2010-06-07 20:04 UTC (permalink / raw)
  To: LK-NetDev, LK-SCSIDev, LK-iSCSIDev
  Cc: LKML, Karen Xie, David Miller, James Bottomley, Mike Christie,
	Anish Bhatt

The following 3 patches add a new iscsi LLD driver cxgb4i to enable iscsi offload
support on Chelsio's new 1G and 10G cards. This is updated version of previous cxgb4i
patch. Please share you commnets after review.

Changes since cxgb4i_v4:
1. Convert some more functions to be static.
2. Fixed some issues in cleanup path.

[PATCH 1/3] cxgb4i_v4.2 : add build support
[PATCH 2/3] cxgb4i_v4.2 : libcxgbi common library part
[PATCH 3/3] cxgb4i_v4.2 : main driver files

Regards
Rakesh Ranjan

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.

^ permalink raw reply

* [PATCH 1/3] cxgb4i_v4.2 : add build support
From: Rakesh Ranjan @ 2010-06-07 20:04 UTC (permalink / raw)
  To: LK-NetDev, LK-SCSIDev, LK-iSCSIDev
  Cc: LKML, Karen Xie, David Miller, James Bottomley, Mike Christie,
	Anish Bhatt, Rakesh Ranjan
In-Reply-To: <1275941078-1779-1-git-send-email-rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

From: Rakesh Ranjan <rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>


Signed-off-by: Rakesh Ranjan <rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
---
 drivers/scsi/Kconfig       |    1 +
 drivers/scsi/Makefile      |    1 +
 drivers/scsi/cxgbi/Kbuild  |    5 +++++
 drivers/scsi/cxgbi/Kconfig |    7 +++++++
 4 files changed, 14 insertions(+), 0 deletions(-)
 create mode 100644 drivers/scsi/cxgbi/Kbuild
 create mode 100644 drivers/scsi/cxgbi/Kconfig

diff --git a/drivers/scsi/Kconfig b/drivers/scsi/Kconfig
index 75f2336..70983a8 100644
--- a/drivers/scsi/Kconfig
+++ b/drivers/scsi/Kconfig
@@ -371,6 +371,7 @@ config ISCSI_TCP
 	 http://open-iscsi.org
 
 source "drivers/scsi/cxgb3i/Kconfig"
+source "drivers/scsi/cxgbi/Kconfig"
 source "drivers/scsi/bnx2i/Kconfig"
 source "drivers/scsi/be2iscsi/Kconfig"
 
diff --git a/drivers/scsi/Makefile b/drivers/scsi/Makefile
index 1c7ac49..b0873aa 100644
--- a/drivers/scsi/Makefile
+++ b/drivers/scsi/Makefile
@@ -133,6 +133,7 @@ obj-$(CONFIG_SCSI_STEX)		+= stex.o
 obj-$(CONFIG_SCSI_MVSAS)	+= mvsas/
 obj-$(CONFIG_PS3_ROM)		+= ps3rom.o
 obj-$(CONFIG_SCSI_CXGB3_ISCSI)	+= libiscsi.o libiscsi_tcp.o cxgb3i/
+obj-$(CONFIG_SCSI_CXGB4_ISCSI)	+= libiscsi.o libiscsi_tcp.o cxgbi/
 obj-$(CONFIG_SCSI_BNX2_ISCSI)	+= libiscsi.o bnx2i/
 obj-$(CONFIG_BE2ISCSI)		+= libiscsi.o be2iscsi/
 obj-$(CONFIG_SCSI_PMCRAID)	+= pmcraid.o
diff --git a/drivers/scsi/cxgbi/Kbuild b/drivers/scsi/cxgbi/Kbuild
new file mode 100644
index 0000000..03291c5
--- /dev/null
+++ b/drivers/scsi/cxgbi/Kbuild
@@ -0,0 +1,5 @@
+EXTRA_CFLAGS += -I$(srctree)/drivers/net/cxgb4
+
+obj-$(CONFIG_SCSI_CXGB4_ISCSI) += cxgb4i.o libcxgbi.o
+cxgb4i-y := cxgb4i_init.o cxgb4i_offload.o cxgb4i_ddp.o
+
diff --git a/drivers/scsi/cxgbi/Kconfig b/drivers/scsi/cxgbi/Kconfig
new file mode 100644
index 0000000..3f33dc2
--- /dev/null
+++ b/drivers/scsi/cxgbi/Kconfig
@@ -0,0 +1,7 @@
+config SCSI_CXGB4_ISCSI
+	tristate "Chelsio T4 iSCSI support"
+	depends on CHELSIO_T4_DEPENDS
+	select CHELSIO_T4
+	select SCSI_ISCSI_ATTRS
+	---help---
+	This driver supports iSCSI offload for the Chelsio T4 series devices.
-- 
1.6.6.1

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.

^ permalink raw reply related

* [PATCH 2/3] cxgb4i_v4.2 : libcxgbi common library part
From: Rakesh Ranjan @ 2010-06-07 20:04 UTC (permalink / raw)
  To: LK-NetDev, LK-SCSIDev, LK-iSCSIDev
  Cc: LKML, Karen Xie, David Miller, James Bottomley, Mike Christie,
	Anish Bhatt, Rakesh Ranjan
In-Reply-To: <1275941078-1779-2-git-send-email-rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

From: Rakesh Ranjan <rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>


Signed-off-by: Rakesh Ranjan <rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
---
 drivers/scsi/cxgbi/libcxgbi.c | 1521 +++++++++++++++++++++++++++++++++++++++++
 drivers/scsi/cxgbi/libcxgbi.h |  556 +++++++++++++++
 2 files changed, 2077 insertions(+), 0 deletions(-)
 create mode 100644 drivers/scsi/cxgbi/libcxgbi.c
 create mode 100644 drivers/scsi/cxgbi/libcxgbi.h

diff --git a/drivers/scsi/cxgbi/libcxgbi.c b/drivers/scsi/cxgbi/libcxgbi.c
new file mode 100644
index 0000000..d4ca4b4
--- /dev/null
+++ b/drivers/scsi/cxgbi/libcxgbi.c
@@ -0,0 +1,1521 @@
+/*
+ * libcxgbi.c: Chelsio common library for T3/T4 iSCSI driver.
+ *
+ * Copyright (c) 2010 Chelsio Communications, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.
+ *
+ * Written by: Karen Xie (kxie-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ * Written by: Rakesh Ranjan (rranjan-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ */
+
+#include <linux/skbuff.h>
+#include <linux/crypto.h>
+#include <linux/scatterlist.h>
+#include <linux/pci.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_host.h>
+#include <linux/if_vlan.h>
+#include <net/dst.h>
+#include <net/route.h>
+#include <net/tcp.h>
+
+#include "libcxgbi.h"
+
+MODULE_AUTHOR("Chelsio Communications");
+MODULE_DESCRIPTION("Chelsio libcxgbi common library");
+MODULE_LICENSE("GPL");
+
+static LIST_HEAD(cdev_list);
+static DEFINE_MUTEX(cdev_rwlock);
+
+static void cxgbi_release_itt(struct iscsi_task *, itt_t);
+static int cxgbi_reserve_itt(struct iscsi_task *, itt_t *);
+
+struct cxgbi_device *cxgbi_device_register(unsigned int dd_size,
+					unsigned int nports)
+{
+	struct cxgbi_device *cdev;
+
+	cdev = kzalloc(sizeof(*cdev) + dd_size, GFP_KERNEL);
+	if (!cdev)
+		return NULL;
+
+	cdev->hbas = kzalloc(sizeof(struct cxgbi_hba **) *  nports, GFP_KERNEL);
+	if (!cdev->hbas) {
+		kfree(cdev);
+		return NULL;
+	}
+
+	mutex_lock(&cdev_rwlock);
+	list_add_tail(&cdev->list_head, &cdev_list);
+	mutex_unlock(&cdev_rwlock);
+	return cdev;
+}
+EXPORT_SYMBOL_GPL(cxgbi_device_register);
+
+void cxgbi_device_unregister(struct cxgbi_device *cdev)
+{
+	mutex_lock(&cdev_rwlock);
+	list_del(&cdev->list_head);
+	mutex_unlock(&cdev_rwlock);
+
+	kfree(cdev->hbas);
+	kfree(cdev);
+}
+EXPORT_SYMBOL_GPL(cxgbi_device_unregister);
+
+static struct cxgbi_hba *cxgbi_hba_find_by_netdev(struct net_device *dev,
+						struct cxgbi_device *cdev)
+{
+	int i;
+
+	if (dev->priv_flags & IFF_802_1Q_VLAN)
+		dev = vlan_dev_real_dev(dev);
+
+	for (i = 0; i < cdev->nports; i++) {
+		if (cdev->hbas[i]->ndev == dev)
+			return cdev->hbas[i];
+	}
+
+	return NULL;
+}
+
+static struct rtable *find_route(struct net_device *dev,
+				__be32 saddr, __be32 daddr,
+				__be16 sport, __be16 dport,
+				u8 tos)
+{
+	struct rtable *rt;
+	struct flowi fl = {
+		.oif = dev ? dev->ifindex : 0,
+		.nl_u = {
+			.ip4_u = {
+				.daddr = daddr,
+				.saddr = saddr,
+				.tos = tos }
+			},
+		.proto = IPPROTO_TCP,
+		.uli_u = {
+			.ports = {
+				.sport = sport,
+				.dport = dport }
+			}
+	};
+
+	if (ip_route_output_flow(dev ? dev_net(dev) : &init_net,
+					&rt, &fl, NULL, 0))
+		return NULL;
+
+	return rt;
+}
+
+static struct net_device *cxgbi_find_dev(struct net_device *dev,
+					__be32 ipaddr)
+{
+	struct flowi fl;
+	struct rtable *rt;
+	int err;
+
+	memset(&fl, 0, sizeof(fl));
+	fl.nl_u.ip4_u.daddr = ipaddr;
+
+	err = ip_route_output_key(dev ? dev_net(dev) : &init_net, &rt, &fl);
+	if (!err)
+		return (&rt->u.dst)->dev;
+
+	return NULL;
+}
+
+static int is_cxgbi_dev(struct net_device *dev, struct cxgbi_device *cdev)
+{
+	struct net_device *ndev = dev;
+	int i;
+
+	if (dev->priv_flags & IFF_802_1Q_VLAN)
+		ndev = vlan_dev_real_dev(dev);
+
+	for (i = 0; i < cdev->nports; i++) {
+		if (ndev == cdev->ports[i])
+			return 1;
+	}
+	return 0;
+}
+
+static struct net_device *cxgbi_find_egress_dev(struct net_device *root_dev,
+						struct cxgbi_device *cdev)
+{
+	while (root_dev) {
+		if (root_dev->priv_flags & IFF_802_1Q_VLAN)
+			root_dev = vlan_dev_real_dev(root_dev);
+		else if (is_cxgbi_dev(root_dev, cdev))
+			return root_dev;
+		else
+			return NULL;
+	}
+
+	return NULL;
+}
+
+static struct cxgbi_device *cxgbi_find_cdev(struct net_device *dev,
+					    __be32 ipaddr)
+{
+	struct flowi fl;
+	struct rtable *rt;
+	struct net_device *sdev = NULL;
+	struct cxgbi_device *cdev = NULL, *tmp;
+	int err, i;
+
+	memset(&fl, 0, sizeof(fl));
+	fl.nl_u.ip4_u.daddr = ipaddr;
+
+	err = ip_route_output_key(dev ? dev_net(dev) : &init_net, &rt, &fl);
+	if (err)
+		goto out;
+
+	sdev = (&rt->u.dst)->dev;
+	mutex_lock(&cdev_rwlock);
+	list_for_each_entry_safe(cdev, tmp, &cdev_list, list_head) {
+		if (cdev) {
+			for (i = 0; i < cdev->nports; i++) {
+				if (sdev == cdev->ports[i]) {
+					mutex_unlock(&cdev_rwlock);
+					return cdev;
+				}
+			}
+		}
+	}
+	mutex_unlock(&cdev_rwlock);
+out:	return cdev;
+}
+
+/*
+ * pdu receive, interact with libiscsi_tcp
+ */
+static inline int read_pdu_skb(struct iscsi_conn *conn,
+			       struct sk_buff *skb,
+			       unsigned int offset,
+			       int offloaded)
+{
+	int status = 0;
+	int bytes_read;
+
+	bytes_read = iscsi_tcp_recv_skb(conn, skb, offset, offloaded, &status);
+	switch (status) {
+	case ISCSI_TCP_CONN_ERR:
+		return -EIO;
+	case ISCSI_TCP_SUSPENDED:
+		/* no transfer - just have caller flush queue */
+		return bytes_read;
+	case ISCSI_TCP_SKB_DONE:
+		/*
+		 * pdus should always fit in the skb and we should get
+		 * segment done notifcation.
+		 */
+		iscsi_conn_printk(KERN_ERR, conn, "Invalid pdu or skb.");
+		return -EFAULT;
+	case ISCSI_TCP_SEGMENT_DONE:
+		return bytes_read;
+	default:
+		iscsi_conn_printk(KERN_ERR, conn, "Invalid iscsi_tcp_recv_skb "
+				  "status %d\n", status);
+		return -EINVAL;
+	}
+}
+
+static int cxgbi_conn_read_bhs_pdu_skb(struct iscsi_conn *conn,
+				       struct sk_buff *skb)
+{
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_device *cdev = cconn->chba->cdev;
+	int rc;
+
+	cxgbi_rx_debug("conn 0x%p, skb 0x%p, len %u, flag 0x%x.\n",
+			conn, skb, skb->len, cdev->get_skb_ulp_mode(skb));
+
+	if (!iscsi_tcp_recv_segment_is_hdr(tcp_conn)) {
+		iscsi_conn_failure(conn, ISCSI_ERR_PROTO);
+		return -EIO;
+	}
+
+	if (conn->hdrdgst_en && (cdev->get_skb_ulp_mode(skb)
+				& ULP2_FLAG_HCRC_ERROR)) {
+		iscsi_conn_failure(conn, ISCSI_ERR_HDR_DGST);
+		return -EIO;
+	}
+
+	rc = read_pdu_skb(conn, skb, 0, 0);
+	if (rc <= 0)
+		return rc;
+
+	return 0;
+}
+
+static int cxgbi_conn_read_data_pdu_skb(struct iscsi_conn *conn,
+					struct sk_buff *skb)
+{
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_device *cdev = cconn->chba->cdev;
+	bool offloaded = 0;
+	unsigned int offset = 0;
+	int rc;
+
+	cxgbi_rx_debug("conn 0x%p, skb 0x%p, len %u, flag 0x%x.\n",
+			conn, skb, skb->len, cdev->get_skb_ulp_mode(skb));
+
+	if (conn->datadgst_en &&
+		(cdev->get_skb_ulp_mode(skb) & ULP2_FLAG_DCRC_ERROR)) {
+		iscsi_conn_failure(conn, ISCSI_ERR_DATA_DGST);
+		return -EIO;
+	}
+
+	if (iscsi_tcp_recv_segment_is_hdr(tcp_conn))
+		return 0;
+
+	if (conn->hdrdgst_en)
+		offset = ISCSI_DIGEST_SIZE;
+
+	if (cdev->get_skb_ulp_mode(skb) & ULP2_FLAG_DATA_DDPED) {
+		cxgbi_rx_debug("skb 0x%p, opcode 0x%x, data %u, ddp'ed, "
+				"itt 0x%x.\n",
+				skb,
+				tcp_conn->in.hdr->opcode & ISCSI_OPCODE_MASK,
+				tcp_conn->in.datalen,
+				ntohl(tcp_conn->in.hdr->itt));
+		offloaded = 1;
+	} else {
+		cxgbi_rx_debug("skb 0x%p, opcode 0x%x, data %u, NOT ddp'ed, "
+				"itt 0x%x.\n",
+				skb,
+				tcp_conn->in.hdr->opcode & ISCSI_OPCODE_MASK,
+				tcp_conn->in.datalen,
+				ntohl(tcp_conn->in.hdr->itt));
+	}
+
+	rc = read_pdu_skb(conn, skb, 0, offloaded);
+	if (rc < 0)
+		return rc;
+
+	return 0;
+}
+
+void cxgbi_conn_pdu_ready(struct cxgbi_sock *csk)
+{
+	struct sk_buff *skb;
+	unsigned int read = 0;
+	struct iscsi_conn *conn = csk->user_data;
+	int err = 0;
+
+	cxgbi_rx_debug("csk 0x%p.\n", csk);
+
+	read_lock(&csk->callback_lock);
+	if (unlikely(!conn || conn->suspend_rx)) {
+		cxgbi_rx_debug("conn 0x%p, id %d, suspend_rx %lu!\n",
+				conn, conn ? conn->id : 0xFF,
+				conn ? conn->suspend_rx : 0xFF);
+		read_unlock(&csk->callback_lock);
+		return;
+	}
+
+	skb = skb_peek(&csk->receive_queue);
+	while (!err && skb) {
+		__skb_unlink(skb, &csk->receive_queue);
+		read += csk->cdev->get_skb_rx_pdulen(skb);
+		cxgbi_rx_debug("conn 0x%p, csk 0x%p, rx skb 0x%p, pdulen %u\n",
+				conn, csk, skb,
+				csk->cdev->get_skb_rx_pdulen(skb));
+		if (csk->flags & CTPF_MSG_COALESCED) {
+			err = cxgbi_conn_read_bhs_pdu_skb(conn, skb);
+			err = cxgbi_conn_read_data_pdu_skb(conn, skb);
+		} else {
+			if (csk->cdev->get_skb_flags(skb) &
+			    CTP_SKCBF_HDR_RCVD)
+				err = cxgbi_conn_read_bhs_pdu_skb(conn, skb);
+			else if (csk->cdev->get_skb_flags(skb) ==
+				CTP_SKCBF_DATA_RCVD)
+				err = cxgbi_conn_read_data_pdu_skb(conn, skb);
+		}
+		__kfree_skb(skb);
+		skb = skb_peek(&csk->receive_queue);
+	}
+	cxgbi_log_debug("read %d\n", read);
+	read_unlock(&csk->callback_lock);
+	csk->copied_seq += read;
+	csk->cdev->sock_rx_credits(csk, read);
+	conn->rxdata_octets += read;
+
+	if (err) {
+		cxgbi_log_info("conn 0x%p rx failed err %d.\n", conn, err);
+		iscsi_conn_failure(conn, ISCSI_ERR_CONN_FAILED);
+	}
+}
+EXPORT_SYMBOL_GPL(cxgbi_conn_pdu_ready);
+
+static int sgl_seek_offset(struct scatterlist *sgl, unsigned int sgcnt,
+				unsigned int offset, unsigned int *off,
+				struct scatterlist **sgp)
+{
+	int i;
+	struct scatterlist *sg;
+
+	for_each_sg(sgl, sg, sgcnt, i) {
+		if (offset < sg->length) {
+			*off = offset;
+			*sgp = sg;
+			return 0;
+		}
+		offset -= sg->length;
+	}
+	return -EFAULT;
+}
+
+static int sgl_read_to_frags(struct scatterlist *sg, unsigned int sgoffset,
+				unsigned int dlen, skb_frag_t *frags,
+				int frag_max)
+{
+	unsigned int datalen = dlen;
+	unsigned int sglen = sg->length - sgoffset;
+	struct page *page = sg_page(sg);
+	int i;
+
+	i = 0;
+	do {
+		unsigned int copy;
+
+		if (!sglen) {
+			sg = sg_next(sg);
+			if (!sg) {
+				cxgbi_log_error("sg NULL, len %u/%u.\n",
+								datalen, dlen);
+				return -EINVAL;
+			}
+			sgoffset = 0;
+			sglen = sg->length;
+			page = sg_page(sg);
+
+		}
+		copy = min(datalen, sglen);
+		if (i && page == frags[i - 1].page &&
+		    sgoffset + sg->offset ==
+			frags[i - 1].page_offset + frags[i - 1].size) {
+			frags[i - 1].size += copy;
+		} else {
+			if (i >= frag_max) {
+				cxgbi_log_error("too many pages %u, "
+						 "dlen %u.\n", frag_max, dlen);
+				return -EINVAL;
+			}
+
+			frags[i].page = page;
+			frags[i].page_offset = sg->offset + sgoffset;
+			frags[i].size = copy;
+			i++;
+		}
+		datalen -= copy;
+		sgoffset += copy;
+		sglen -= copy;
+	} while (datalen);
+
+	return i;
+}
+
+int cxgbi_conn_alloc_pdu(struct iscsi_task *task, u8 opcode)
+{
+	struct iscsi_tcp_conn *tcp_conn = task->conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_device *cdev = cconn->chba->cdev;
+	struct iscsi_conn *conn = task->conn;
+	struct iscsi_tcp_task *tcp_task = task->dd_data;
+	struct cxgbi_task_data *tdata = task->dd_data + sizeof(*tcp_task);
+	struct scsi_cmnd *sc = task->sc;
+	int headroom = SKB_TX_PDU_HEADER_LEN;
+
+	tcp_task->dd_data = tdata;
+	task->hdr = NULL;
+
+	/* write command, need to send data pdus */
+	if (cdev->skb_extra_headroom && (opcode == ISCSI_OP_SCSI_DATA_OUT ||
+	    (opcode == ISCSI_OP_SCSI_CMD &&
+	    (scsi_bidi_cmnd(sc) || sc->sc_data_direction == DMA_TO_DEVICE))))
+		headroom += min(cdev->skb_extra_headroom,
+					conn->max_xmit_dlength);
+
+	tdata->skb = alloc_skb(cdev->skb_tx_headroom + headroom, GFP_ATOMIC);
+	if (!tdata->skb)
+		return -ENOMEM;
+
+	skb_reserve(tdata->skb, cdev->skb_tx_headroom);
+	cxgbi_tx_debug("task 0x%p, opcode 0x%x, skb 0x%p.\n",
+			task, opcode, tdata->skb);
+	task->hdr = (struct iscsi_hdr *)tdata->skb->data;
+	task->hdr_max = SKB_TX_PDU_HEADER_LEN;
+
+	/* data_out uses scsi_cmd's itt */
+	if (opcode != ISCSI_OP_SCSI_DATA_OUT)
+		cxgbi_reserve_itt(task, &task->hdr->itt);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cxgbi_conn_alloc_pdu);
+
+int cxgbi_conn_init_pdu(struct iscsi_task *task, unsigned int offset,
+			      unsigned int count)
+{
+	struct iscsi_tcp_conn *tcp_conn = task->conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_device *cdev = cconn->chba->cdev;
+	struct iscsi_conn *conn = task->conn;
+	struct iscsi_tcp_task *tcp_task = task->dd_data;
+	struct cxgbi_task_data *tdata = tcp_task->dd_data;
+	struct sk_buff *skb = tdata->skb;
+	unsigned int datalen = count;
+	int i, padlen = iscsi_padding(count);
+	struct page *pg;
+
+	cxgbi_tx_debug("task 0x%p,0x%p, offset %u, count %u, skb 0x%p.\n",
+			task, task->sc, offset, count, skb);
+
+	skb_put(skb, task->hdr_len);
+	cdev->set_skb_txmode(skb, conn->hdrdgst_en,
+			     datalen ? conn->datadgst_en : 0);
+	if (!count)
+		return 0;
+
+	if (task->sc) {
+		struct scsi_data_buffer *sdb = scsi_out(task->sc);
+		struct scatterlist *sg = NULL;
+		int err;
+
+		tdata->offset = offset;
+		tdata->count = count;
+		err = sgl_seek_offset(sdb->table.sgl, sdb->table.nents,
+					tdata->offset, &tdata->sgoffset, &sg);
+		if (err < 0) {
+			cxgbi_log_warn("tpdu, sgl %u, bad offset %u/%u.\n",
+					sdb->table.nents, tdata->offset,
+					sdb->length);
+			return err;
+		}
+		err = sgl_read_to_frags(sg, tdata->sgoffset, tdata->count,
+					tdata->frags, MAX_PDU_FRAGS);
+		if (err < 0) {
+			cxgbi_log_warn("tpdu, sgl %u, bad offset %u + %u.\n",
+					sdb->table.nents, tdata->offset,
+					tdata->count);
+			return err;
+		}
+		tdata->nr_frags = err;
+
+		if (tdata->nr_frags > MAX_SKB_FRAGS ||
+		    (padlen && tdata->nr_frags == MAX_SKB_FRAGS)) {
+			char *dst = skb->data + task->hdr_len;
+			skb_frag_t *frag = tdata->frags;
+
+			/* data fits in the skb's headroom */
+			for (i = 0; i < tdata->nr_frags; i++, frag++) {
+				char *src = kmap_atomic(frag->page,
+							KM_SOFTIRQ0);
+
+				memcpy(dst, src+frag->page_offset, frag->size);
+				dst += frag->size;
+				kunmap_atomic(src, KM_SOFTIRQ0);
+			}
+			if (padlen) {
+				memset(dst, 0, padlen);
+				padlen = 0;
+			}
+			skb_put(skb, count + padlen);
+		} else {
+			/* data fit into frag_list */
+			for (i = 0; i < tdata->nr_frags; i++)
+				get_page(tdata->frags[i].page);
+
+			memcpy(skb_shinfo(skb)->frags, tdata->frags,
+				sizeof(skb_frag_t) * tdata->nr_frags);
+			skb_shinfo(skb)->nr_frags = tdata->nr_frags;
+			skb->len += count;
+			skb->data_len += count;
+			skb->truesize += count;
+		}
+
+	} else {
+		pg = virt_to_page(task->data);
+
+		get_page(pg);
+		skb_fill_page_desc(skb, 0, pg, offset_in_page(task->data),
+					count);
+		skb->len += count;
+		skb->data_len += count;
+		skb->truesize += count;
+	}
+
+	if (padlen) {
+		i = skb_shinfo(skb)->nr_frags;
+		get_page(cdev->pad_page);
+		skb_fill_page_desc(skb, skb_shinfo(skb)->nr_frags,
+					cdev->pad_page, 0, padlen);
+
+		skb->data_len += padlen;
+		skb->truesize += padlen;
+		skb->len += padlen;
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cxgbi_conn_init_pdu);
+
+int cxgbi_conn_xmit_pdu(struct iscsi_task *task)
+{
+	struct iscsi_tcp_conn *tcp_conn = task->conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_device *cdev = cconn->chba->cdev;
+	struct iscsi_tcp_task *tcp_task = task->dd_data;
+	struct cxgbi_task_data *tdata = tcp_task->dd_data;
+	struct sk_buff *skb = tdata->skb;
+	unsigned int datalen;
+	int err;
+
+	if (!skb)
+		return 0;
+
+	datalen = skb->data_len;
+	tdata->skb = NULL;
+	err = cdev->sock_send_pdus(cconn->cep->csk, skb);
+	if (err > 0) {
+		int pdulen = err;
+
+		cxgbi_tx_debug("task 0x%p, skb 0x%p, len %u/%u, rv %d.\n",
+				task, skb, skb->len, skb->data_len, err);
+
+		if (task->conn->hdrdgst_en)
+			pdulen += ISCSI_DIGEST_SIZE;
+
+		if (datalen && task->conn->datadgst_en)
+			pdulen += ISCSI_DIGEST_SIZE;
+
+		task->conn->txdata_octets += pdulen;
+		return 0;
+	}
+
+	if (err == -EAGAIN || err == -ENOBUFS) {
+		/* reset skb to send when we are called again */
+		tdata->skb = skb;
+		return err;
+	}
+
+	kfree_skb(skb);
+	cxgbi_tx_debug("itt 0x%x, skb 0x%p, len %u/%u, xmit err %d.\n",
+			task->itt, skb, skb->len, skb->data_len, err);
+	iscsi_conn_printk(KERN_ERR, task->conn, "xmit err %d.\n", err);
+	iscsi_conn_failure(task->conn, ISCSI_ERR_XMIT_FAILED);
+	return err;
+}
+EXPORT_SYMBOL_GPL(cxgbi_conn_xmit_pdu);
+
+int cxgbi_pdu_init(struct cxgbi_device *cdev)
+{
+	cdev->pad_page = alloc_page(GFP_KERNEL);
+	if (!cdev->pad_page)
+		return -ENOMEM;
+
+	memset(page_address(cdev->pad_page), 0, PAGE_SIZE);
+
+	if (cdev->skb_tx_headroom > (512 * MAX_SKB_FRAGS))
+		cdev->skb_extra_headroom = cdev->skb_tx_headroom;
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cxgbi_pdu_init);
+
+void cxgbi_pdu_cleanup(struct cxgbi_device *cdev)
+{
+	if (cdev->pad_page) {
+		__free_page(cdev->pad_page);
+		cdev->pad_page = NULL;
+	}
+}
+EXPORT_SYMBOL_GPL(cxgbi_pdu_cleanup);
+
+void cxgbi_conn_tx_open(struct cxgbi_sock *csk)
+{
+	struct iscsi_conn *conn = csk->user_data;
+
+	if (conn) {
+		cxgbi_tx_debug("cn 0x%p, cid %d.\n", csk, conn->id);
+		iscsi_conn_queue_work(conn);
+	}
+}
+EXPORT_SYMBOL_GPL(cxgbi_conn_tx_open);
+
+static int cxgbi_sock_get_port(struct cxgbi_sock *csk)
+{
+	struct cxgbi_device *cdev = csk->cdev;
+	unsigned int start;
+	int idx;
+
+	if (!cdev->pmap)
+		goto error_out;
+
+	if (csk->saddr.sin_port) {
+		cxgbi_log_error("connect, sin_port none ZERO %u\n",
+				ntohs(csk->saddr.sin_port));
+		return -EADDRINUSE;
+	}
+
+	spin_lock_bh(&cdev->pmap->lock);
+	start = idx = cdev->pmap->next;
+
+	do {
+		if (++idx >= cdev->pmap->max_connect)
+			idx = 0;
+		if (!cdev->pmap->port_csk[idx]) {
+			csk->saddr.sin_port =
+				htons(cdev->pmap->sport_base + idx);
+			cdev->pmap->next = idx;
+			cdev->pmap->port_csk[idx] = csk;
+			spin_unlock_bh(&cdev->pmap->lock);
+			cxgbi_conn_debug("reserved port %u\n",
+					cdev->pmap->sport_base + idx);
+			return 0;
+		}
+	} while (idx != start);
+	spin_unlock_bh(&cdev->pmap->lock);
+error_out:
+	return -EADDRNOTAVAIL;
+}
+
+static void cxgbi_sock_put_port(struct cxgbi_sock *csk)
+{
+	struct cxgbi_device *cdev = csk->cdev;
+
+	if (csk->saddr.sin_port) {
+		int idx = ntohs(csk->saddr.sin_port) - cdev->pmap->sport_base;
+
+		csk->saddr.sin_port = 0;
+		if (idx < 0 || idx >= cdev->pmap->max_connect)
+			return;
+
+		spin_lock_bh(&cdev->pmap->lock);
+		cdev->pmap->port_csk[idx] = NULL;
+		spin_unlock_bh(&cdev->pmap->lock);
+		cxgbi_conn_debug("released port %u\n",
+				cdev->pmap->sport_base + idx);
+	}
+}
+
+static struct cxgbi_sock *cxgbi_sock_create(struct cxgbi_device *cdev)
+{
+	struct cxgbi_sock *csk = NULL;
+
+	csk = kzalloc(sizeof(*csk), GFP_NOIO);
+	if (!csk)
+		return NULL;
+
+	if (cdev->alloc_cpl_skbs(csk) < 0)
+		goto free_csk;
+
+	cxgbi_conn_debug("alloc csk: 0x%p\n", csk);
+
+	csk->flags = 0;
+	spin_lock_init(&csk->lock);
+	kref_init(&csk->refcnt);
+	skb_queue_head_init(&csk->receive_queue);
+	skb_queue_head_init(&csk->write_queue);
+	setup_timer(&csk->retry_timer, NULL, (unsigned long)csk);
+	rwlock_init(&csk->callback_lock);
+	csk->cdev = cdev;
+	return csk;
+free_csk:
+	cxgbi_api_debug("csk alloc failed %p, baling out\n", csk);
+	kfree(csk);
+	return NULL;
+}
+
+static int cxgbi_sock_connect(struct net_device *dev, struct cxgbi_sock *csk,
+			      struct sockaddr_in *sin)
+{
+	struct rtable *rt;
+	__be32 sipv4 = 0;
+	struct net_device *dstdev;
+	struct cxgbi_hba *chba = NULL;
+	int err;
+
+	cxgbi_conn_debug("csk 0x%p, dev 0x%p\n", csk, dev);
+
+	if (sin->sin_family != AF_INET)
+		return -EAFNOSUPPORT;
+
+	csk->daddr.sin_port = sin->sin_port;
+	csk->daddr.sin_addr.s_addr = sin->sin_addr.s_addr;
+
+	dstdev = cxgbi_find_dev(dev, sin->sin_addr.s_addr);
+	if (!dstdev || !is_cxgbi_dev(dstdev, csk->cdev))
+		return -ENETUNREACH;
+
+	if (dstdev->priv_flags & IFF_802_1Q_VLAN)
+		dev = dstdev;
+
+	rt = find_route(dev, csk->saddr.sin_addr.s_addr,
+			csk->daddr.sin_addr.s_addr,
+			csk->saddr.sin_port,
+			csk->daddr.sin_port,
+			0);
+	if (rt == NULL) {
+		cxgbi_conn_debug("no route to %pI4, port %u, dev %s, "
+					"snic 0x%p\n",
+					&csk->daddr.sin_addr.s_addr,
+					ntohs(csk->daddr.sin_port),
+					dev ? dev->name : "any",
+					csk->dd_data);
+		return -ENETUNREACH;
+	}
+
+	if (rt->rt_flags & (RTCF_MULTICAST | RTCF_BROADCAST)) {
+		cxgbi_conn_debug("multi-cast route to %pI4, port %u, "
+					"dev %s, snic 0x%p\n",
+					&csk->daddr.sin_addr.s_addr,
+					ntohs(csk->daddr.sin_port),
+					dev ? dev->name : "any",
+					csk->dd_data);
+		ip_rt_put(rt);
+		return -ENETUNREACH;
+	}
+
+	if (!csk->saddr.sin_addr.s_addr)
+		csk->saddr.sin_addr.s_addr = rt->rt_src;
+
+	csk->dst = &rt->u.dst;
+
+	dev = cxgbi_find_egress_dev(csk->dst->dev, csk->cdev);
+	if (dev == NULL) {
+		cxgbi_conn_debug("csk: 0x%p, egress dev NULL\n", csk);
+		return -ENETUNREACH;
+	}
+
+	err = cxgbi_sock_get_port(csk);
+	if (err)
+		return err;
+
+	cxgbi_conn_debug("csk: 0x%p get port: %u\n",
+			csk, ntohs(csk->saddr.sin_port));
+
+	chba = cxgbi_hba_find_by_netdev(csk->dst->dev, csk->cdev);
+
+	sipv4 = cxgbi_get_iscsi_ipv4(chba);
+	if (!sipv4) {
+		cxgbi_conn_debug("csk: 0x%p, iscsi is not configured\n", csk);
+		sipv4 = csk->saddr.sin_addr.s_addr;
+		cxgbi_set_iscsi_ipv4(chba, sipv4);
+	} else
+		csk->saddr.sin_addr.s_addr = sipv4;
+
+	cxgbi_conn_debug("csk: 0x%p, %pI4:[%u], %pI4:[%u] SYN_SENT\n",
+				csk,
+				&csk->saddr.sin_addr.s_addr,
+				ntohs(csk->saddr.sin_port),
+				&csk->daddr.sin_addr.s_addr,
+				ntohs(csk->daddr.sin_port));
+
+	cxgbi_sock_set_state(csk, CTP_CONNECTING);
+
+	if (!csk->cdev->init_act_open(csk, dev))
+		return 0;
+
+	err = -ENOTSUPP;
+	cxgbi_conn_debug("csk 0x%p -> closed\n", csk);
+	cxgbi_sock_set_state(csk, CTP_CLOSED);
+	ip_rt_put(rt);
+	cxgbi_sock_put_port(csk);
+	return err;
+}
+
+void cxgbi_sock_conn_closing(struct cxgbi_sock *csk)
+{
+	struct iscsi_conn *conn = csk->user_data;
+
+	read_lock(&csk->callback_lock);
+	if (conn && csk->state != CTP_ESTABLISHED)
+		iscsi_conn_failure(conn, ISCSI_ERR_CONN_FAILED);
+	read_unlock(&csk->callback_lock);
+}
+EXPORT_SYMBOL_GPL(cxgbi_sock_conn_closing);
+
+void cxgbi_sock_closed(struct cxgbi_sock *csk)
+{
+	cxgbi_conn_debug("csk 0x%p, state %u, flags 0x%lx\n",
+			csk, csk->state, csk->flags);
+
+	cxgbi_sock_put_port(csk);
+	csk->cdev->release_offload_resources(csk);
+	cxgbi_sock_set_state(csk, CTP_CLOSED);
+	cxgbi_sock_conn_closing(csk);
+}
+EXPORT_SYMBOL_GPL(cxgbi_sock_closed);
+
+static void cxgbi_sock_active_close(struct cxgbi_sock *csk)
+{
+	int data_lost;
+	int close_req = 0;
+
+	cxgbi_conn_debug("csk 0x%p, state %u, flags %lu\n",
+			csk, csk->state, csk->flags);
+
+	return;
+
+	dst_confirm(csk->dst);
+	cxgbi_sock_hold(csk);
+	spin_lock_bh(&csk->lock);
+	data_lost = skb_queue_len(&csk->receive_queue);
+	__skb_queue_purge(&csk->receive_queue);
+
+	switch (csk->state) {
+	case CTP_CLOSED:
+	case CTP_ACTIVE_CLOSE:
+	case CTP_CLOSE_WAIT_1:
+	case CTP_CLOSE_WAIT_2:
+	case CTP_ABORTING:
+		break;
+	case CTP_CONNECTING:
+		cxgbi_sock_set_flag(csk, CTPF_ACTIVE_CLOSE_NEEDED);
+		break;
+	case CTP_ESTABLISHED:
+		close_req = 1;
+		cxgbi_sock_set_flag(csk, CTP_ACTIVE_CLOSE);
+		break;
+	case CTP_PASSIVE_CLOSE:
+		close_req = 1;
+		cxgbi_sock_set_flag(csk, CTP_CLOSE_WAIT_2);
+		break;
+	}
+
+	if (close_req) {
+		if (data_lost)
+			csk->cdev->send_abort_req(csk);
+		else
+			csk->cdev->send_close_req(csk);
+	}
+
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+}
+
+static void cxgbi_sock_release(struct cxgbi_sock *csk)
+{
+	cxgbi_conn_debug("csk 0x%p, state %u, flags %lu\n",
+			csk, csk->state, csk->flags);
+	if (unlikely(csk->state == CTP_CONNECTING))
+		cxgbi_sock_set_state(csk, CTPF_ACTIVE_CLOSE_NEEDED);
+	else if (likely(csk->state != CTP_CLOSED))
+		cxgbi_sock_active_close(csk);
+	cxgbi_sock_put(csk);
+}
+
+static unsigned int cxgbi_sock_find_best_mtu(struct cxgbi_sock *csk,
+					     unsigned short mtu)
+{
+	int i = 0;
+
+	while (i < csk->cdev->nmtus - 1 && csk->cdev->mtus[i + 1] <= mtu)
+		++i;
+
+	return i;
+}
+
+unsigned int cxgbi_sock_select_mss(struct cxgbi_sock *csk, unsigned int pmtu)
+{
+	unsigned int idx;
+	struct dst_entry *dst = csk->dst;
+	u16 advmss = dst_metric(dst, RTAX_ADVMSS);
+
+	if (advmss > pmtu - 40)
+		advmss = pmtu - 40;
+	if (advmss < csk->cdev->mtus[0] - 40)
+		advmss = csk->cdev->mtus[0] - 40;
+	idx = cxgbi_sock_find_best_mtu(csk, advmss + 40);
+
+	return idx;
+}
+EXPORT_SYMBOL_GPL(cxgbi_sock_select_mss);
+
+static void cxgbi_release_itt(struct iscsi_task *task, itt_t hdr_itt)
+{
+	struct scsi_cmnd *sc = task->sc;
+	struct iscsi_tcp_conn *tcp_conn = task->conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_hba *chba = cconn->chba;
+	struct cxgbi_tag_format *tformat = &chba->cdev->tag_format;
+	u32 tag = ntohl((__force u32)hdr_itt);
+
+	cxgbi_tag_debug("release tag 0x%x.\n", tag);
+	if (sc && (scsi_bidi_cmnd(sc) ||
+	    sc->sc_data_direction == DMA_FROM_DEVICE) &&
+	    cxgbi_is_ddp_tag(tformat, tag))
+		chba->cdev->ddp_tag_release(chba, tag);
+}
+
+static int cxgbi_reserve_itt(struct iscsi_task *task, itt_t *hdr_itt)
+{
+	struct scsi_cmnd *sc = task->sc;
+	struct iscsi_conn *conn = task->conn;
+	struct iscsi_session *sess = conn->session;
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_hba *chba = cconn->chba;
+	struct cxgbi_tag_format *tformat = &chba->cdev->tag_format;
+	u32 sw_tag = (sess->age << cconn->task_idx_bits) | task->itt;
+	u32 tag;
+	int err = -EINVAL;
+
+	if (sc && (scsi_bidi_cmnd(sc) ||
+	    sc->sc_data_direction == DMA_FROM_DEVICE) &&
+			cxgbi_sw_tag_usable(tformat, sw_tag)) {
+		struct cxgbi_sock *csk = cconn->cep->csk;
+		struct cxgbi_gather_list *gl;
+
+		gl = chba->cdev->ddp_make_gl(scsi_in(sc)->length,
+					     scsi_in(sc)->table.sgl,
+					     scsi_in(sc)->table.nents,
+					     chba->cdev->pdev, GFP_ATOMIC);
+		if (gl) {
+			tag = sw_tag;
+			err = chba->cdev->ddp_tag_reserve(chba, csk->hwtid,
+							  tformat, &tag,
+							  gl, GFP_ATOMIC);
+			if (err < 0)
+				chba->cdev->ddp_release_gl(gl,
+							   chba->cdev->pdev);
+		}
+	}
+	if (err < 0)
+		tag = cxgbi_set_non_ddp_tag(tformat, sw_tag);
+	/*  the itt need to sent in big-endian order */
+	*hdr_itt = (__force itt_t)htonl(tag);
+
+	cxgbi_tag_debug("new sc 0x%p tag 0x%x/0x%x (itt 0x%x, age 0x%x).\n",
+			sc, tag, *hdr_itt, task->itt, sess->age);
+	return 0;
+}
+
+void cxgbi_parse_pdu_itt(struct iscsi_conn *conn, itt_t itt,
+				int *idx, int *age)
+{
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_device *cdev = cconn->chba->cdev;
+	u32 tag = ntohl((__force u32) itt);
+	u32 sw_bits;
+
+	sw_bits = cxgbi_tag_nonrsvd_bits(&cdev->tag_format, tag);
+	if (idx)
+		*idx = sw_bits & ((1 << cconn->task_idx_bits) - 1);
+	if (age)
+		*age = (sw_bits >> cconn->task_idx_bits) & ISCSI_AGE_MASK;
+
+	cxgbi_tag_debug("parse tag 0x%x/0x%x, sw 0x%x, itt 0x%x, age 0x%x.\n",
+			tag, itt, sw_bits, idx ? *idx : 0xFFFFF,
+			age ? *age : 0xFF);
+}
+EXPORT_SYMBOL_GPL(cxgbi_parse_pdu_itt);
+
+void cxgbi_cleanup_task(struct iscsi_task *task)
+{
+	struct cxgbi_task_data *tdata = task->dd_data +
+				sizeof(struct iscsi_tcp_task);
+
+	/*  never reached the xmit task callout */
+	if (tdata->skb)
+		__kfree_skb(tdata->skb);
+	memset(tdata, 0, sizeof(*tdata));
+
+	cxgbi_release_itt(task, task->hdr_itt);
+	iscsi_tcp_cleanup_task(task);
+}
+EXPORT_SYMBOL_GPL(cxgbi_cleanup_task);
+
+void cxgbi_get_conn_stats(struct iscsi_cls_conn *cls_conn,
+				struct iscsi_stats *stats)
+{
+	struct iscsi_conn *conn = cls_conn->dd_data;
+
+	stats->txdata_octets = conn->txdata_octets;
+	stats->rxdata_octets = conn->rxdata_octets;
+	stats->scsicmd_pdus = conn->scsicmd_pdus_cnt;
+	stats->dataout_pdus = conn->dataout_pdus_cnt;
+	stats->scsirsp_pdus = conn->scsirsp_pdus_cnt;
+	stats->datain_pdus = conn->datain_pdus_cnt;
+	stats->r2t_pdus = conn->r2t_pdus_cnt;
+	stats->tmfcmd_pdus = conn->tmfcmd_pdus_cnt;
+	stats->tmfrsp_pdus = conn->tmfrsp_pdus_cnt;
+	stats->digest_err = 0;
+	stats->timeout_err = 0;
+	stats->custom_length = 1;
+	strcpy(stats->custom[0].desc, "eh_abort_cnt");
+	stats->custom[0].value = conn->eh_abort_cnt;
+}
+EXPORT_SYMBOL_GPL(cxgbi_get_conn_stats);
+
+static int cxgbi_conn_max_xmit_dlength(struct iscsi_conn *conn)
+{
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_device *cdev = cconn->chba->cdev;
+	unsigned int skb_tx_headroom = cdev->skb_tx_headroom;
+	unsigned int max_def = 512 * MAX_SKB_FRAGS;
+	unsigned int max = max(max_def, skb_tx_headroom);
+
+	max = min(cconn->chba->cdev->tx_max_size, max);
+	if (conn->max_xmit_dlength)
+		conn->max_xmit_dlength = min(conn->max_xmit_dlength, max);
+	else
+		conn->max_xmit_dlength = max;
+	cxgbi_align_pdu_size(conn->max_xmit_dlength);
+	return 0;
+}
+
+static int cxgbi_conn_max_recv_dlength(struct iscsi_conn *conn)
+{
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	unsigned int max = cconn->chba->cdev->rx_max_size;
+
+	cxgbi_align_pdu_size(max);
+
+	if (conn->max_recv_dlength) {
+		if (conn->max_recv_dlength > max) {
+			cxgbi_log_error("MaxRecvDataSegmentLength %u too big."
+					" Need to be <= %u.\n",
+					conn->max_recv_dlength, max);
+			return -EINVAL;
+		}
+		conn->max_recv_dlength = min(conn->max_recv_dlength, max);
+		cxgbi_align_pdu_size(conn->max_recv_dlength);
+	} else
+		conn->max_recv_dlength = max;
+
+	return 0;
+}
+
+int cxgbi_set_conn_param(struct iscsi_cls_conn *cls_conn,
+			enum iscsi_param param, char *buf, int buflen)
+{
+	struct iscsi_conn *conn = cls_conn->dd_data;
+	struct iscsi_session *session = conn->session;
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct cxgbi_sock *csk = cconn->cep->csk;
+	int value, err = 0;
+
+	switch (param) {
+	case ISCSI_PARAM_HDRDGST_EN:
+		err = iscsi_set_param(cls_conn, param, buf, buflen);
+		if (!err && conn->hdrdgst_en)
+			err = csk->cdev->ddp_setup_conn_digest(csk, csk->hwtid,
+							conn->hdrdgst_en,
+							conn->datadgst_en, 0);
+		break;
+	case ISCSI_PARAM_DATADGST_EN:
+		err = iscsi_set_param(cls_conn, param, buf, buflen);
+		if (!err && conn->datadgst_en)
+			err = csk->cdev->ddp_setup_conn_digest(csk, csk->hwtid,
+							conn->hdrdgst_en,
+							conn->datadgst_en, 0);
+		break;
+	case ISCSI_PARAM_MAX_R2T:
+		sscanf(buf, "%d", &value);
+		if (value <= 0 || !is_power_of_2(value))
+			return -EINVAL;
+		if (session->max_r2t == value)
+			break;
+		iscsi_tcp_r2tpool_free(session);
+		err = iscsi_set_param(cls_conn, param, buf, buflen);
+		if (!err && iscsi_tcp_r2tpool_alloc(session))
+			return -ENOMEM;
+	case ISCSI_PARAM_MAX_RECV_DLENGTH:
+		err = iscsi_set_param(cls_conn, param, buf, buflen);
+		if (!err)
+			err = cxgbi_conn_max_recv_dlength(conn);
+		break;
+	case ISCSI_PARAM_MAX_XMIT_DLENGTH:
+		err = iscsi_set_param(cls_conn, param, buf, buflen);
+		if (!err)
+			err = cxgbi_conn_max_xmit_dlength(conn);
+		break;
+	default:
+		return iscsi_set_param(cls_conn, param, buf, buflen);
+	}
+	return err;
+}
+EXPORT_SYMBOL_GPL(cxgbi_set_conn_param);
+
+int cxgbi_get_conn_param(struct iscsi_cls_conn *cls_conn,
+			enum iscsi_param param, char *buff)
+{
+	struct iscsi_conn *iconn = cls_conn->dd_data;
+	int len;
+
+	switch (param) {
+	case ISCSI_PARAM_CONN_PORT:
+		spin_lock_bh(&iconn->session->lock);
+		len = sprintf(buff, "%hu\n", iconn->portal_port);
+		spin_unlock_bh(&iconn->session->lock);
+		break;
+	case ISCSI_PARAM_CONN_ADDRESS:
+		spin_lock_bh(&iconn->session->lock);
+		len = sprintf(buff, "%s\n", iconn->portal_address);
+		spin_unlock_bh(&iconn->session->lock);
+		break;
+	default:
+		return iscsi_conn_get_param(cls_conn, param, buff);
+	}
+	return len;
+}
+EXPORT_SYMBOL_GPL(cxgbi_get_conn_param);
+
+struct iscsi_cls_conn *
+cxgbi_create_conn(struct iscsi_cls_session *cls_session, u32 cid)
+{
+	struct iscsi_cls_conn *cls_conn;
+	struct iscsi_conn *conn;
+	struct iscsi_tcp_conn *tcp_conn;
+	struct cxgbi_conn *cconn;
+
+	cls_conn = iscsi_tcp_conn_setup(cls_session, sizeof(*cconn), cid);
+	if (!cls_conn)
+		return NULL;
+
+	conn = cls_conn->dd_data;
+	tcp_conn = conn->dd_data;
+	cconn = tcp_conn->dd_data;
+	cconn->iconn = conn;
+	return cls_conn;
+}
+EXPORT_SYMBOL_GPL(cxgbi_create_conn);
+
+int cxgbi_bind_conn(struct iscsi_cls_session *cls_session,
+				struct iscsi_cls_conn *cls_conn,
+				u64 transport_eph, int is_leading)
+{
+	struct iscsi_conn *conn = cls_conn->dd_data;
+	struct iscsi_tcp_conn *tcp_conn = conn->dd_data;
+	struct cxgbi_conn *cconn = tcp_conn->dd_data;
+	struct iscsi_endpoint *ep;
+	struct cxgbi_endpoint *cep;
+	struct cxgbi_sock *csk;
+	int err;
+
+	ep = iscsi_lookup_endpoint(transport_eph);
+	if (!ep)
+		return -EINVAL;
+
+	/*  setup ddp pagesize */
+	cep = ep->dd_data;
+	csk = cep->csk;
+	err = csk->cdev->ddp_setup_conn_host_pgsz(csk, csk->hwtid, 0);
+	if (err < 0)
+		return err;
+
+	err = iscsi_conn_bind(cls_session, cls_conn, is_leading);
+	if (err)
+		return -EINVAL;
+
+	/*  calculate the tag idx bits needed for this conn based on cmds_max */
+	cconn->task_idx_bits = (__ilog2_u32(conn->session->cmds_max - 1)) + 1;
+
+	read_lock(&csk->callback_lock);
+	csk->user_data = conn;
+	cconn->chba = cep->chba;
+	cconn->cep = cep;
+	cep->cconn = cconn;
+	read_unlock(&csk->callback_lock);
+
+	cxgbi_conn_max_xmit_dlength(conn);
+	cxgbi_conn_max_recv_dlength(conn);
+
+	spin_lock_bh(&conn->session->lock);
+	sprintf(conn->portal_address, "%pI4", &csk->daddr.sin_addr.s_addr);
+	conn->portal_port = ntohs(csk->daddr.sin_port);
+	spin_unlock_bh(&conn->session->lock);
+
+	/*  init recv engine */
+	iscsi_tcp_hdr_recv_prep(tcp_conn);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(cxgbi_bind_conn);
+
+struct iscsi_cls_session *
+cxgbi_create_session(struct iscsi_endpoint *ep, u16 cmds_max, u16 qdepth,
+							u32 initial_cmdsn)
+{
+	struct cxgbi_endpoint *cep;
+	struct cxgbi_hba *chba;
+	struct Scsi_Host *shost;
+	struct iscsi_cls_session *cls_session;
+	struct iscsi_session *session;
+
+	if (!ep) {
+		cxgbi_log_error("missing endpoint\n");
+		return NULL;
+	}
+
+	cep = ep->dd_data;
+	chba = cep->chba;
+	shost = chba->shost;
+
+	BUG_ON(chba != iscsi_host_priv(shost));
+
+	cls_session = iscsi_session_setup(chba->cdev->itp, shost,
+					cmds_max, 0,
+					sizeof(struct iscsi_tcp_task) +
+					sizeof(struct cxgbi_task_data),
+					initial_cmdsn, ISCSI_MAX_TARGET);
+	if (!cls_session)
+		return NULL;
+
+	session = cls_session->dd_data;
+	if (iscsi_tcp_r2tpool_alloc(session))
+		goto remove_session;
+
+	return cls_session;
+
+remove_session:
+	iscsi_session_teardown(cls_session);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(cxgbi_create_session);
+
+void cxgbi_destroy_session(struct iscsi_cls_session *cls_session)
+{
+	iscsi_tcp_r2tpool_free(cls_session->dd_data);
+	iscsi_session_teardown(cls_session);
+}
+EXPORT_SYMBOL_GPL(cxgbi_destroy_session);
+
+int cxgbi_set_host_param(struct Scsi_Host *shost,
+			enum iscsi_host_param param, char *buff, int buflen)
+{
+	struct cxgbi_hba *chba = iscsi_host_priv(shost);
+
+	if (!chba->ndev) {
+		shost_printk(KERN_ERR, shost, "Could not set host param. "
+				"Netdev for host not set\n");
+		return -ENODEV;
+	}
+
+	cxgbi_api_debug("param %d, buff %s\n", param, buff);
+
+	switch (param) {
+	case ISCSI_HOST_PARAM_IPADDRESS:
+	{
+		__be32 addr = in_aton(buff);
+		cxgbi_set_iscsi_ipv4(chba, addr);
+		return 0;
+	}
+	case ISCSI_HOST_PARAM_HWADDRESS:
+	case ISCSI_HOST_PARAM_NETDEV_NAME:
+		return 0;
+	default:
+		return iscsi_host_set_param(shost, param, buff, buflen);
+	}
+}
+EXPORT_SYMBOL_GPL(cxgbi_set_host_param);
+
+int cxgbi_get_host_param(struct Scsi_Host *shost,
+			enum iscsi_host_param param, char *buff)
+{
+	struct cxgbi_hba *chba = iscsi_host_priv(shost);
+	int len = 0;
+
+	if (!chba->ndev) {
+		shost_printk(KERN_ERR, shost, "Could not set host param. "
+				"Netdev for host not set\n");
+		return -ENODEV;
+	}
+
+	cxgbi_api_debug("hba %s, param %d\n", chba->ndev->name, param);
+
+	switch (param) {
+	case ISCSI_HOST_PARAM_HWADDRESS:
+		len = sysfs_format_mac(buff, chba->ndev->dev_addr, 6);
+		break;
+	case ISCSI_HOST_PARAM_NETDEV_NAME:
+		len = sprintf(buff, "%s\n", chba->ndev->name);
+		break;
+	case ISCSI_HOST_PARAM_IPADDRESS:
+	{
+		__be32 addr;
+
+		addr = cxgbi_get_iscsi_ipv4(chba);
+		len = sprintf(buff, "%pI4", &addr);
+		break;
+	}
+	default:
+		return iscsi_host_get_param(shost, param, buff);
+	}
+
+	return len;
+}
+EXPORT_SYMBOL_GPL(cxgbi_get_host_param);
+
+struct iscsi_endpoint *cxgbi_ep_connect(struct Scsi_Host *shost,
+					struct sockaddr *dst_addr,
+					int non_blocking)
+{
+	struct iscsi_endpoint *iep;
+	struct cxgbi_endpoint *cep;
+	struct cxgbi_hba *hba = NULL;
+	struct cxgbi_sock *csk = NULL;
+	struct sockaddr_in *sin = (struct sockaddr_in *)dst_addr;
+	struct cxgbi_device *cdev;
+	int err = 0;
+
+	if (shost)
+		hba = iscsi_host_priv(shost);
+
+	cdev = cxgbi_find_cdev(hba ? hba->ndev : NULL,
+			((struct sockaddr_in *)dst_addr)->sin_addr.s_addr);
+	if (!cdev) {
+		cxgbi_log_info("ep connect no cdev\n");
+		err = -ENOSPC;
+		goto release_conn;
+	}
+
+	csk = cxgbi_sock_create(cdev);
+	if (!csk) {
+		cxgbi_log_info("ep connect OOM\n");
+		err = -ENOMEM;
+		goto release_conn;
+	}
+
+	err = cxgbi_sock_connect(hba ? hba->ndev : NULL, csk, sin);
+	if (err < 0) {
+		cxgbi_log_info("ep connect failed\n");
+		goto release_conn;
+	}
+
+	hba = cxgbi_hba_find_by_netdev(csk->dst->dev, cdev);
+	if (!hba) {
+		err = -ENOSPC;
+		cxgbi_log_info("Not going through cxgb4i device\n");
+		goto release_conn;
+	}
+
+	if (shost && hba != iscsi_host_priv(shost)) {
+		err = -ENOSPC;
+		cxgbi_log_info("Could not connect through request host %u\n",
+				shost->host_no);
+		goto release_conn;
+	}
+
+	if (cxgbi_sock_is_closing(csk)) {
+		err = -ENOSPC;
+		cxgbi_log_info("ep connect unable to connect\n");
+		goto release_conn;
+	}
+
+	iep = iscsi_create_endpoint(sizeof(*cep));
+	if (!iep) {
+		err = -ENOMEM;
+		cxgbi_log_info("iscsi alloc ep, OOM\n");
+		goto release_conn;
+	}
+
+	cep = iep->dd_data;
+	cep->csk = csk;
+	cep->chba = hba;
+	cxgbi_api_debug("iep 0x%p, cep 0x%p, csk 0x%p, hba 0x%p\n",
+			iep, cep, csk, hba);
+	return iep;
+release_conn:
+	cxgbi_api_debug("conn 0x%p failed, release\n", csk);
+	if (csk)
+		cxgbi_sock_release(csk);
+
+	return ERR_PTR(err);
+}
+EXPORT_SYMBOL_GPL(cxgbi_ep_connect);
+
+int cxgbi_ep_poll(struct iscsi_endpoint *ep, int timeout_ms)
+{
+	struct cxgbi_endpoint *cep = ep->dd_data;
+	struct cxgbi_sock *csk = cep->csk;
+
+	if (!cxgbi_sock_is_established(csk))
+		return 0;
+
+	return 1;
+}
+EXPORT_SYMBOL_GPL(cxgbi_ep_poll);
+
+void cxgbi_ep_disconnect(struct iscsi_endpoint *ep)
+{
+	struct cxgbi_endpoint *cep = ep->dd_data;
+	struct cxgbi_conn *cconn = cep->cconn;
+
+	if (cconn && cconn->iconn) {
+		iscsi_suspend_tx(cconn->iconn);
+		write_lock_bh(&cep->csk->callback_lock);
+		cep->csk->user_data = NULL;
+		cconn->cep = NULL;
+		write_unlock_bh(&cep->csk->callback_lock);
+	}
+
+	cxgbi_sock_release(cep->csk);
+	iscsi_destroy_endpoint(ep);
+}
+EXPORT_SYMBOL_GPL(cxgbi_ep_disconnect);
+
+struct cxgbi_hba *cxgbi_hba_add(struct cxgbi_device *cdev,
+				unsigned int max_lun,
+				unsigned int max_id,
+				struct scsi_transport_template *stt,
+				struct scsi_host_template *sht,
+				struct net_device *dev)
+{
+	struct cxgbi_hba *chba;
+	struct Scsi_Host *shost;
+	int err;
+
+	shost = iscsi_host_alloc(sht, sizeof(*chba), 1);
+
+	if (!shost) {
+		cxgbi_log_info("cdev 0x%p, ndev 0x%p, host alloc failed\n",
+				cdev, dev);
+		return NULL;
+	}
+
+	shost->transportt = stt;
+	shost->max_lun = max_lun;
+	shost->max_id = max_id;
+	shost->max_channel = 0;
+	shost->max_cmd_len = 16;
+	chba = iscsi_host_priv(shost);
+	cxgbi_log_debug("cdev %p\n", cdev);
+	chba->cdev = cdev;
+	chba->ndev = dev;
+	chba->shost = shost;
+	pci_dev_get(cdev->pdev);
+	err = iscsi_host_add(shost, &cdev->pdev->dev);
+	if (err) {
+		cxgbi_log_info("cdev 0x%p, dev 0x%p, host add failed\n",
+				cdev, dev);
+		goto pci_dev_put;
+	}
+
+	return chba;
+pci_dev_put:
+	pci_dev_put(cdev->pdev);
+	scsi_host_put(shost);
+	return NULL;
+}
+EXPORT_SYMBOL_GPL(cxgbi_hba_add);
+
+void cxgbi_hba_remove(struct cxgbi_hba *chba)
+{
+	iscsi_host_remove(chba->shost);
+	pci_dev_put(chba->cdev->pdev);
+	iscsi_host_free(chba->shost);
+}
+EXPORT_SYMBOL_GPL(cxgbi_hba_remove);
diff --git a/drivers/scsi/cxgbi/libcxgbi.h b/drivers/scsi/cxgbi/libcxgbi.h
new file mode 100644
index 0000000..4e1aa61
--- /dev/null
+++ b/drivers/scsi/cxgbi/libcxgbi.h
@@ -0,0 +1,556 @@
+/*
+ * libcxgbi.h: Chelsio common library for T3/T4 iSCSI driver.
+ *
+ * Copyright (c) 2010 Chelsio Communications, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.
+ *
+ * Written by: Karen Xie (kxie-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ * Written by: Rakesh Ranjan (rranjan-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ */
+
+#ifndef	__LIBCXGBI_H__
+#define	__LIBCXGBI_H__
+
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/netdevice.h>
+#include <linux/if_vlan.h>
+#include <linux/scatterlist.h>
+#include <linux/skbuff.h>
+#include <scsi/libiscsi_tcp.h>
+
+
+#define	cxgbi_log_error(fmt...)	printk(KERN_ERR "cxgbi: ERR! " fmt)
+#define cxgbi_log_warn(fmt...)	printk(KERN_WARNING "cxgbi: WARN! " fmt)
+#define cxgbi_log_info(fmt...)	printk(KERN_INFO "cxgbi: " fmt)
+#define cxgbi_debug_log(fmt, args...) \
+	printk(KERN_INFO "cxgbi: %s - " fmt, __func__ , ## args)
+
+#ifdef	__DEBUG_CXGBI__
+#define	cxgbi_log_debug	cxgbi_debug_log
+#else
+#define cxgbi_log_debug(fmt...)
+#endif
+
+#ifdef __DEBUG_CXGBI_TAG__
+#define cxgbi_tag_debug        cxgbi_log_debug
+#else
+#define cxgbi_tag_debug(fmt...)
+#endif
+
+#ifdef __DEBUG_CXGBI_API__
+#define cxgbi_api_debug        cxgbi_log_debug
+#else
+#define cxgbi_api_debug(fmt...)
+#endif
+
+#ifdef __DEBUG_CXGBI_CONN__
+#define cxgbi_conn_debug         cxgbi_log_debug
+#else
+#define cxgbi_conn_debug(fmt...)
+#endif
+
+#ifdef __DEBUG_CXGBI_TX__
+#define cxgbi_tx_debug           cxgbi_log_debug
+#else
+#define cxgbi_tx_debug(fmt...)
+#endif
+
+#ifdef __DEBUG_CXGBI_RX__
+#define cxgbi_rx_debug           cxgbi_log_debug
+#else
+#define cxgbi_rx_debug(fmt...)
+#endif
+
+/* always allocate rooms for AHS */
+#define SKB_TX_PDU_HEADER_LEN	\
+	(sizeof(struct iscsi_hdr) + ISCSI_MAX_AHS_SIZE)
+
+#define	ISCSI_PDU_NONPAYLOAD_LEN	312 /* bhs(48) + ahs(256) + digest(8)*/
+#define ULP2_MAX_PKT_SIZE		16224
+#define ULP2_MAX_PDU_PAYLOAD	\
+	(ULP2_MAX_PKT_SIZE - ISCSI_PDU_NONPAYLOAD_LEN)
+
+#define PPOD_PAGES_MAX			4
+#define PPOD_PAGES_SHIFT		2       /*  4 pages per pod */
+
+/*
+ * align pdu size to multiple of 512 for better performance
+ */
+#define cxgbi_align_pdu_size(n) do { n = (n) & (~511); } while (0)
+
+/*
+ * struct pagepod_hdr, pagepod - pagepod format
+ */
+struct pagepod_hdr {
+	unsigned int vld_tid;
+	unsigned int pgsz_tag_clr;
+	unsigned int max_offset;
+	unsigned int page_offset;
+	unsigned long long rsvd;
+};
+
+struct pagepod {
+	struct pagepod_hdr hdr;
+	unsigned long long addr[PPOD_PAGES_MAX + 1];
+};
+
+struct cxgbi_tag_format {
+	unsigned char sw_bits;
+	unsigned char rsvd_bits;
+	unsigned char rsvd_shift;
+	unsigned char filler[1];
+	unsigned int rsvd_mask;
+};
+
+struct cxgbi_gather_list {
+	unsigned int tag;
+	unsigned int length;
+	unsigned int offset;
+	unsigned int nelem;
+	struct page **pages;
+	dma_addr_t phys_addr[0];
+};
+
+/*
+ * sge_opaque_hdr -
+ * Opaque version of structure the SGE stores at skb->head of TX_DATA packets
+ * and for which we must reserve space.
+ */
+struct sge_opaque_hdr {
+	void *dev;
+	dma_addr_t addr[MAX_SKB_FRAGS + 1];
+};
+
+struct cxgbi_sock {
+	struct cxgbi_device *cdev;
+
+	unsigned long flags;
+	unsigned short rss_qid;
+	unsigned short txq_idx;
+	unsigned int hwtid;
+	unsigned int atid;
+	unsigned int tx_chan;
+	unsigned int rx_chan;
+	unsigned int mss_idx;
+	unsigned int smac_idx;
+	unsigned char port_id;
+	int wr_max_cred;
+	int wr_cred;
+	int wr_una_cred;
+	unsigned char hcrc_len;
+	unsigned char dcrc_len;
+
+	void *l2t;
+	struct sk_buff *wr_pending_head;
+	struct sk_buff *wr_pending_tail;
+	struct sk_buff *cpl_close;
+	struct sk_buff *cpl_abort_req;
+	struct sk_buff *cpl_abort_rpl;
+	struct sk_buff *skb_ulp_lhdr;
+	spinlock_t lock;
+	struct kref refcnt;
+	unsigned int state;
+	struct sockaddr_in saddr;
+	struct sockaddr_in daddr;
+	struct dst_entry *dst;
+	struct sk_buff_head receive_queue;
+	struct sk_buff_head write_queue;
+	struct timer_list retry_timer;
+	int err;
+	rwlock_t callback_lock;
+	void *user_data;
+
+	u32 rcv_nxt;
+	u32 copied_seq;
+	u32 rcv_wup;
+	u32 snd_nxt;
+	u32 snd_una;
+	u32 write_seq;
+};
+
+enum cxgbi_sock_states{
+	CTP_CONNECTING = 1,
+	CTP_ESTABLISHED,
+	CTP_ACTIVE_CLOSE,
+	CTP_PASSIVE_CLOSE,
+	CTP_CLOSE_WAIT_1,
+	CTP_CLOSE_WAIT_2,
+	CTP_ABORTING,
+	CTP_CLOSED,
+};
+
+enum cxgbi_sock_flags {
+	CTPF_ABORT_RPL_RCVD = 1,/*received one ABORT_RPL_RSS message */
+	CTPF_ABORT_REQ_RCVD,	/*received one ABORT_REQ_RSS message */
+	CTPF_ABORT_RPL_PENDING,	/* expecting an abort reply */
+	CTPF_TX_DATA_SENT,	/* already sent a TX_DATA WR */
+	CTPF_ACTIVE_CLOSE_NEEDED,	/* need to be closed */
+	CTPF_MSG_COALESCED,
+	CTPF_OFFLOAD_DOWN,		/* offload function off */
+};
+
+enum cxgbi_skcb_flags {
+	CTP_SKCBF_NEED_HDR = 1 << 0,	/* packet needs a header */
+	CTP_SKCBF_NO_APPEND = 1 << 1,	/* don't grow this skb */
+	CTP_SKCBF_COMPL = 1 << 2,	/* request WR completion */
+	CTP_SKCBF_HDR_RCVD = 1 << 3,	/* recieved header pdu */
+	CTP_SKCBF_DATA_RCVD = 1 << 4,	/*  recieved data pdu */
+	CTP_SKCBF_STATUS_RCVD = 1 << 5,	/* recieved ddp status */
+};
+
+static inline void cxgbi_sock_set_flag(struct cxgbi_sock *csk,
+					enum cxgbi_sock_flags flag)
+{
+	__set_bit(flag, &csk->flags);
+	cxgbi_conn_debug("csk 0x%p, set %d, state %u, flags 0x%lu\n",
+			csk, flag, csk->state, csk->flags);
+}
+
+static inline void cxgbi_sock_clear_flag(struct cxgbi_sock *csk,
+					enum cxgbi_sock_flags flag)
+{
+	__clear_bit(flag, &csk->flags);
+	cxgbi_conn_debug("csk 0x%p, clear %d, state %u, flags 0x%lu\n",
+			csk, flag, csk->state, csk->flags);
+}
+
+static inline int cxgbi_sock_flag(struct cxgbi_sock *csk,
+				enum cxgbi_sock_flags flag)
+{
+	if (csk == NULL)
+		return 0;
+
+	return test_bit(flag, &csk->flags);
+}
+
+static inline void cxgbi_sock_set_state(struct cxgbi_sock *csk, int state)
+{
+	csk->state = state;
+}
+
+static inline void cxgbi_sock_hold(struct cxgbi_sock *csk)
+{
+	kref_get(&csk->refcnt);
+}
+
+static inline void cxgbi_clean_sock(struct kref *kref)
+{
+	struct cxgbi_sock *csk = container_of(kref,
+						struct cxgbi_sock,
+						refcnt);
+	if (csk) {
+		cxgbi_log_debug("free csk 0x%p, state %u, flags 0x%lx\n",
+						csk, csk->state, csk->flags);
+		kfree(csk);
+	}
+}
+
+static inline void cxgbi_sock_put(struct cxgbi_sock *csk)
+{
+	if (csk)
+		kref_put(&csk->refcnt, cxgbi_clean_sock);
+}
+
+static inline unsigned int cxgbi_sock_is_closing(const struct cxgbi_sock *csk)
+{
+	return csk->state >= CTP_ACTIVE_CLOSE;
+}
+
+static inline unsigned int cxgbi_sock_is_established(
+						const struct cxgbi_sock *csk)
+{
+	return csk->state == CTP_ESTABLISHED;
+}
+
+static inline void cxgbi_sock_purge_write_queue(struct cxgbi_sock *csk)
+{
+	struct sk_buff *skb;
+
+	while ((skb = __skb_dequeue(&csk->write_queue)))
+		__kfree_skb(skb);
+}
+
+static inline int cxgbi_sock_compute_wscale(int win)
+{
+	int wscale = 0;
+	while (wscale < 14 && (65535 << wscale) < win)
+		wscale++;
+	return wscale;
+}
+
+void cxgbi_sock_conn_closing(struct cxgbi_sock *);
+void cxgbi_sock_closed(struct cxgbi_sock *);
+unsigned int cxgbi_sock_select_mss(struct cxgbi_sock *, unsigned int);
+
+struct cxgbi_hba {
+	struct net_device *ndev;
+	struct Scsi_Host *shost;
+	struct cxgbi_device *cdev;
+	__be32 ipv4addr;
+	unsigned short txq_idx;
+	unsigned char port_id;
+};
+
+struct cxgbi_ports_map {
+	unsigned int max_connect;
+	unsigned short sport_base;
+	spinlock_t lock;
+	unsigned int next;
+	struct cxgbi_sock *port_csk[0];
+};
+
+struct cxgbi_device {
+	struct list_head list_head;
+	char *name;
+	struct net_device **ports;
+	struct cxgbi_hba **hbas;
+	const unsigned short *mtus;
+	unsigned char nmtus;
+	unsigned char nports;
+	struct pci_dev *pdev;
+
+	unsigned int skb_tx_headroom;
+	unsigned int skb_extra_headroom;
+	unsigned int tx_max_size;
+	unsigned int rx_max_size;
+	struct page *pad_page;
+	struct cxgbi_ports_map *pmap;
+	struct iscsi_transport *itp;
+	struct cxgbi_tag_format tag_format;
+
+	int (*ddp_tag_reserve)(struct cxgbi_hba *, unsigned int,
+				struct cxgbi_tag_format *, u32 *,
+				struct cxgbi_gather_list *, gfp_t);
+	void (*ddp_tag_release)(struct cxgbi_hba *, u32);
+	struct cxgbi_gather_list* (*ddp_make_gl)(unsigned int,
+						struct scatterlist *,
+						unsigned int,
+						struct pci_dev *,
+						gfp_t);
+	void (*ddp_release_gl)(struct cxgbi_gather_list *, struct pci_dev *);
+	int (*ddp_setup_conn_digest)(struct cxgbi_sock *,
+					unsigned int, int, int, int);
+	int (*ddp_setup_conn_host_pgsz)(struct cxgbi_sock *,
+					unsigned int, int);
+	__u16 (*get_skb_ulp_mode)(struct sk_buff *);
+	__u16 (*get_skb_flags)(struct sk_buff *);
+	__u32 (*get_skb_tcp_seq)(struct sk_buff *);
+	__u32 (*get_skb_rx_pdulen)(struct sk_buff *);
+	void (*set_skb_txmode)(struct sk_buff *, int, int);
+
+	void (*release_offload_resources)(struct cxgbi_sock *);
+	int (*sock_send_pdus)(struct cxgbi_sock *, struct sk_buff *);
+	void (*sock_rx_credits)(struct cxgbi_sock *, int);
+	void (*send_abort_req)(struct cxgbi_sock *);
+	void (*send_close_req)(struct cxgbi_sock *);
+	int (*alloc_cpl_skbs)(struct cxgbi_sock *);
+	int (*init_act_open)(struct cxgbi_sock *, struct net_device *);
+
+	unsigned long dd_data[0];
+};
+
+static inline void *cxgbi_cdev_priv(struct cxgbi_device *cdev)
+{
+	return (void *)cdev->dd_data;
+}
+
+struct cxgbi_device *cxgbi_device_register(unsigned int, unsigned int);
+void cxgbi_device_unregister(struct cxgbi_device *);
+
+struct cxgbi_conn {
+	struct cxgbi_endpoint *cep;
+	struct iscsi_conn *iconn;
+	struct cxgbi_hba *chba;
+	u32 task_idx_bits;
+};
+
+struct cxgbi_endpoint {
+	struct cxgbi_conn *cconn;
+	struct cxgbi_hba *chba;
+	struct cxgbi_sock *csk;
+};
+
+#define MAX_PDU_FRAGS	((ULP2_MAX_PDU_PAYLOAD + 512 - 1) / 512)
+struct cxgbi_task_data {
+	unsigned short nr_frags;
+	skb_frag_t frags[MAX_PDU_FRAGS];
+	struct sk_buff *skb;
+	unsigned int offset;
+	unsigned int count;
+	unsigned int sgoffset;
+};
+
+static inline int cxgbi_is_ddp_tag(struct cxgbi_tag_format *tformat, u32 tag)
+{
+	return !(tag & (1 << (tformat->rsvd_bits + tformat->rsvd_shift - 1)));
+}
+
+static inline int cxgbi_sw_tag_usable(struct cxgbi_tag_format *tformat,
+					u32 sw_tag)
+{
+	sw_tag >>= (32 - tformat->rsvd_bits);
+	return !sw_tag;
+}
+
+static inline u32 cxgbi_set_non_ddp_tag(struct cxgbi_tag_format *tformat,
+					u32 sw_tag)
+{
+	unsigned char shift = tformat->rsvd_bits + tformat->rsvd_shift - 1;
+
+	u32 mask = (1 << shift) - 1;
+
+	if (sw_tag && (sw_tag & ~mask)) {
+		u32 v1 = sw_tag & ((1 << shift) - 1);
+		u32 v2 = (sw_tag >> (shift - 1)) << shift;
+
+		return v2 | v1 | 1 << shift;
+	}
+
+	return sw_tag | 1 << shift;
+}
+
+static inline u32 cxgbi_ddp_tag_base(struct cxgbi_tag_format *tformat,
+					u32 sw_tag)
+{
+	u32 mask = (1 << tformat->rsvd_shift) - 1;
+
+	if (sw_tag && (sw_tag & ~mask)) {
+		u32 v1 = sw_tag & mask;
+		u32 v2 = sw_tag >> tformat->rsvd_shift;
+
+		v2 <<= tformat->rsvd_bits + tformat->rsvd_shift;
+
+		return v2 | v1;
+	}
+
+	return sw_tag;
+}
+
+static inline u32 cxgbi_tag_rsvd_bits(struct cxgbi_tag_format *tformat,
+					u32 tag)
+{
+	if (cxgbi_is_ddp_tag(tformat, tag))
+		return (tag >> tformat->rsvd_shift) & tformat->rsvd_mask;
+
+	return 0;
+}
+
+static inline u32 cxgbi_tag_nonrsvd_bits(struct cxgbi_tag_format *tformat,
+					u32 tag)
+{
+	unsigned char shift = tformat->rsvd_bits + tformat->rsvd_shift - 1;
+	u32 v1, v2;
+
+	if (cxgbi_is_ddp_tag(tformat, tag)) {
+		v1 = tag & ((1 << tformat->rsvd_shift) - 1);
+		v2 = (tag >> (shift + 1)) << tformat->rsvd_shift;
+	} else {
+		u32 mask = (1 << shift) - 1;
+		tag &= ~(1 << shift);
+		v1 = tag & mask;
+		v2 = (tag >> 1) & ~mask;
+	}
+	return v1 | v2;
+}
+
+static inline void *cxgbi_alloc_big_mem(unsigned int size,
+					gfp_t gfp)
+{
+	void *p = kmalloc(size, gfp);
+	if (!p)
+		p = vmalloc(size);
+	if (p)
+		memset(p, 0, size);
+	return p;
+}
+
+static inline void cxgbi_free_big_mem(void *addr)
+{
+	if (is_vmalloc_addr(addr))
+		vfree(addr);
+	else
+		kfree(addr);
+}
+
+#define RX_DDP_STATUS_IPP_SHIFT		27      /* invalid pagepod */
+#define RX_DDP_STATUS_TID_SHIFT		26      /* tid mismatch */
+#define RX_DDP_STATUS_COLOR_SHIFT	25      /* color mismatch */
+#define RX_DDP_STATUS_OFFSET_SHIFT	24      /* offset mismatch */
+#define RX_DDP_STATUS_ULIMIT_SHIFT	23      /* ulimit error */
+#define RX_DDP_STATUS_TAG_SHIFT		22      /* tag mismatch */
+#define RX_DDP_STATUS_DCRC_SHIFT	21      /* dcrc error */
+#define RX_DDP_STATUS_HCRC_SHIFT	20      /* hcrc error */
+#define RX_DDP_STATUS_PAD_SHIFT		19      /* pad error */
+#define RX_DDP_STATUS_PPP_SHIFT		18      /* pagepod parity error */
+#define RX_DDP_STATUS_LLIMIT_SHIFT	17      /* llimit error */
+#define RX_DDP_STATUS_DDP_SHIFT		16      /* ddp'able */
+#define RX_DDP_STATUS_PMM_SHIFT		15      /* pagepod mismatch */
+
+
+#define ULP2_FLAG_DATA_READY		0x1
+#define ULP2_FLAG_DATA_DDPED		0x2
+#define ULP2_FLAG_HCRC_ERROR		0x4
+#define ULP2_FLAG_DCRC_ERROR		0x8
+#define ULP2_FLAG_PAD_ERROR		0x10
+
+struct cxgbi_hba *cxgbi_hba_add(struct cxgbi_device *,
+				unsigned int, unsigned int,
+				struct scsi_transport_template *,
+				struct scsi_host_template *,
+				struct net_device *);
+void cxgbi_hba_remove(struct cxgbi_hba *);
+
+
+void cxgbi_parse_pdu_itt(struct iscsi_conn *conn, itt_t itt,
+				int *idx, int *age);
+void cxgbi_cleanup_task(struct iscsi_task *task);
+
+void cxgbi_conn_pdu_ready(struct cxgbi_sock *);
+void cxgbi_conn_tx_open(struct cxgbi_sock *);
+int cxgbi_conn_init_pdu(struct iscsi_task *, unsigned int , unsigned int);
+int cxgbi_conn_alloc_pdu(struct iscsi_task *, u8);
+int cxgbi_conn_xmit_pdu(struct iscsi_task *);
+void cxgbi_get_conn_stats(struct iscsi_cls_conn *, struct iscsi_stats *);
+int cxgbi_set_conn_param(struct iscsi_cls_conn *,
+			enum iscsi_param, char *, int);
+int cxgbi_get_conn_param(struct iscsi_cls_conn *, enum iscsi_param, char *);
+struct iscsi_cls_conn *cxgbi_create_conn(struct iscsi_cls_session *, u32);
+int cxgbi_bind_conn(struct iscsi_cls_session *,
+			struct iscsi_cls_conn *, u64, int);
+void cxgbi_destroy_session(struct iscsi_cls_session *);
+struct iscsi_cls_session *cxgbi_create_session(struct iscsi_endpoint *,
+						u16, u16, u32);
+int cxgbi_set_host_param(struct Scsi_Host *,
+				enum iscsi_host_param, char *, int);
+int cxgbi_get_host_param(struct Scsi_Host *, enum iscsi_host_param, char *);
+
+
+int cxgbi_pdu_init(struct cxgbi_device *);
+void cxgbi_pdu_cleanup(struct cxgbi_device *);
+
+struct iscsi_endpoint *cxgbi_ep_connect(struct Scsi_Host *,
+					struct sockaddr *, int);
+int cxgbi_ep_poll(struct iscsi_endpoint *, int);
+void cxgbi_ep_disconnect(struct iscsi_endpoint *);
+
+static inline void cxgbi_set_iscsi_ipv4(struct cxgbi_hba *chba, __be32 ipaddr)
+{
+	chba->ipv4addr = ipaddr;
+}
+
+static inline __be32 cxgbi_get_iscsi_ipv4(struct cxgbi_hba *chba)
+{
+	return chba->ipv4addr;
+}
+
+struct cxgbi_device *cxgbi_device_alloc(unsigned int dd_size);
+void cxgbi_device_free(struct cxgbi_device *cdev);
+void cxgbi_device_add(struct list_head *list_head);
+void cxgbi_device_remove(struct cxgbi_device *cdev);
+
+#endif	/*__LIBCXGBI_H__*/
-- 
1.6.6.1

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.

^ permalink raw reply related

* [PATCH 3/3] cxgb4i_v4.2 : main driver files
From: Rakesh Ranjan @ 2010-06-07 20:04 UTC (permalink / raw)
  To: LK-NetDev, LK-SCSIDev, LK-iSCSIDev
  Cc: LKML, Karen Xie, David Miller, James Bottomley, Mike Christie,
	Anish Bhatt, Rakesh Ranjan
In-Reply-To: <1275941078-1779-3-git-send-email-rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

From: Rakesh Ranjan <rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>


Signed-off-by: Rakesh Ranjan <rakesh-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
---
 drivers/scsi/cxgbi/cxgb4i.h         |  175 +++++
 drivers/scsi/cxgbi/cxgb4i_ddp.c     |  653 ++++++++++++++++
 drivers/scsi/cxgbi/cxgb4i_init.c    |  317 ++++++++
 drivers/scsi/cxgbi/cxgb4i_offload.c | 1413 +++++++++++++++++++++++++++++++++++
 4 files changed, 2558 insertions(+), 0 deletions(-)
 create mode 100644 drivers/scsi/cxgbi/cxgb4i.h
 create mode 100644 drivers/scsi/cxgbi/cxgb4i_ddp.c
 create mode 100644 drivers/scsi/cxgbi/cxgb4i_init.c
 create mode 100644 drivers/scsi/cxgbi/cxgb4i_offload.c

diff --git a/drivers/scsi/cxgbi/cxgb4i.h b/drivers/scsi/cxgbi/cxgb4i.h
new file mode 100644
index 0000000..41b0a25
--- /dev/null
+++ b/drivers/scsi/cxgbi/cxgb4i.h
@@ -0,0 +1,175 @@
+/*
+ * cxgb4i.h: Chelsio T4 iSCSI driver.
+ *
+ * Copyright (c) 2010 Chelsio Communications, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.
+ *
+ * Written by: Karen Xie (kxie-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ * Written by: Rakesh Ranjan (rranjan-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ */
+
+#ifndef	__CXGB4I_H__
+#define	__CXGB4I_H__
+
+#include <linux/module.h>
+#include <linux/moduleparam.h>
+#include <linux/errno.h>
+#include <linux/types.h>
+#include <linux/list.h>
+#include <linux/netdevice.h>
+#include <linux/if_vlan.h>
+#include <linux/scatterlist.h>
+#include <linux/skbuff.h>
+#include <scsi/libiscsi_tcp.h>
+
+#include "t4fw_api.h"
+#include "t4_msg.h"
+#include "l2t.h"
+#include "cxgb4.h"
+#include "cxgb4_uld.h"
+#include "libcxgbi.h"
+
+#define	CXGB4I_MAX_CONN		16384
+#define	CXGB4I_SCSI_HOST_QDEPTH	1024
+#define	CXGB4I_MAX_TARGET	CXGB4I_MAX_CONN
+#define	CXGB4I_MAX_LUN		0x1000
+
+struct cxgb4i_snic;
+typedef int (*cxgb4i_cplhandler_func)(struct cxgb4i_snic *, struct sk_buff *);
+
+struct cxgb4i_snic {
+	struct cxgb4_lld_info lldi;
+	struct cxgb4i_ddp_info *ddp;
+	cxgb4i_cplhandler_func *handlers;
+};
+
+enum {
+	CPL_RET_BUF_DONE = 1,
+	CPL_RET_BAD_MSG = 2,
+	CPL_RET_UNKNOWN_TID = 4
+};
+
+struct cxgb4i_skb_rx_cb {
+	__u32 ddigest;
+	__u32 pdulen;
+};
+
+struct cxgb4i_skb_tx_cb {
+	struct l2t_skb_cb l2t;
+	struct sk_buff *wr_next;
+};
+
+struct cxgb4i_skb_cb {
+	__u16 flags;
+	__u16 ulp_mode;
+	__u32 seq;
+
+	union {
+		struct cxgb4i_skb_rx_cb rx;
+		struct cxgb4i_skb_tx_cb tx;
+	};
+};
+
+#define CXGB4I_SKB_CB(skb)	((struct cxgb4i_skb_cb *)&((skb)->cb[0]))
+#define cxgb4i_skb_flags(skb)	(CXGB4I_SKB_CB(skb)->flags)
+#define cxgb4i_skb_ulp_mode(skb)	(CXGB4I_SKB_CB(skb)->ulp_mode)
+#define cxgb4i_skb_tcp_seq(skb)		(CXGB4I_SKB_CB(skb)->seq)
+#define cxgb4i_skb_rx_ddigest(skb)	(CXGB4I_SKB_CB(skb)->rx.ddigest)
+#define cxgb4i_skb_rx_pdulen(skb)	(CXGB4I_SKB_CB(skb)->rx.pdulen)
+#define cxgb4i_skb_tx_wr_next(skb)	(CXGB4I_SKB_CB(skb)->tx.wr_next)
+
+/* for TX: a skb must have a headroom of at least TX_HEADER_LEN bytes */
+#define CXGB4I_TX_HEADER_LEN \
+	(sizeof(struct fw_ofld_tx_data_wr) + sizeof(struct sge_opaque_hdr))
+
+int cxgb4i_ofld_init(struct cxgbi_device *);
+void cxgb4i_ofld_cleanup(struct cxgbi_device *);
+
+struct cxgb4i_ddp_info {
+	struct kref refcnt;
+	struct cxgb4i_snic *snic;
+	struct pci_dev *pdev;
+	unsigned int max_txsz;
+	unsigned int max_rxsz;
+	unsigned int llimit;
+	unsigned int ulimit;
+	unsigned int nppods;
+	unsigned int idx_last;
+	unsigned char idx_bits;
+	unsigned char filler[3];
+	unsigned int idx_mask;
+	unsigned int rsvd_tag_mask;
+	spinlock_t map_lock;
+	struct cxgbi_gather_list **gl_map;
+};
+
+struct cpl_rx_data_ddp {
+	union opcode_tid ot;
+	__be16 urg;
+	__be16 len;
+	__be32 seq;
+	union {
+		__be32 nxt_seq;
+		__be32 ddp_report;
+	};
+	__be32 ulp_crc;
+	__be32 ddpvld;
+};
+
+#define PPOD_SIZE               sizeof(struct pagepod)  /*  64 */
+#define PPOD_SIZE_SHIFT         6
+
+#define ULPMEM_DSGL_MAX_NPPODS	16	/*  1024/PPOD_SIZE */
+#define ULPMEM_IDATA_MAX_NPPODS	4	/*  256/PPOD_SIZE */
+#define PCIE_MEMWIN_MAX_NPPODS	16	/*  1024/PPOD_SIZE */
+
+#define PPOD_COLOR_SHIFT	0
+#define PPOD_COLOR_MASK		0x3F
+#define PPOD_COLOR_SIZE         6
+#define PPOD_COLOR(x)		((x) << PPOD_COLOR_SHIFT)
+
+#define PPOD_TAG_SHIFT	6
+#define PPOD_TAG_MASK	0xFFFFFF
+#define PPOD_TAG(x)	((x) << PPOD_TAG_SHIFT)
+
+#define PPOD_PGSZ_SHIFT	30
+#define PPOD_PGSZ_MASK	0x3
+#define PPOD_PGSZ(x)	((x) << PPOD_PGSZ_SHIFT)
+
+#define PPOD_TID_SHIFT	32
+#define PPOD_TID_MASK	0xFFFFFF
+#define PPOD_TID(x)	((__u64)(x) << PPOD_TID_SHIFT)
+
+#define PPOD_VALID_SHIFT	56
+#define PPOD_VALID(x)	((__u64)(x) << PPOD_VALID_SHIFT)
+#define PPOD_VALID_FLAG	PPOD_VALID(1ULL)
+
+#define PPOD_LEN_SHIFT	32
+#define PPOD_LEN_MASK	0xFFFFFFFF
+#define PPOD_LEN(x)	((__u64)(x) << PPOD_LEN_SHIFT)
+
+#define PPOD_OFST_SHIFT	0
+#define PPOD_OFST_MASK	0xFFFFFFFF
+#define PPOD_OFST(x)	((x) << PPOD_OFST_SHIFT)
+
+#define PPOD_IDX_SHIFT          PPOD_COLOR_SIZE
+#define PPOD_IDX_MAX_SIZE       24
+
+#define W_TCB_ULP_TYPE          0
+#define TCB_ULP_TYPE_SHIFT      0
+#define TCB_ULP_TYPE_MASK       0xfULL
+#define TCB_ULP_TYPE(x)         ((x) << TCB_ULP_TYPE_SHIFT)
+
+#define W_TCB_ULP_RAW           0
+#define TCB_ULP_RAW_SHIFT       4
+#define TCB_ULP_RAW_MASK        0xffULL
+#define TCB_ULP_RAW(x)          ((x) << TCB_ULP_RAW_SHIFT)
+
+int cxgb4i_ddp_init(struct cxgbi_device *);
+void cxgb4i_ddp_cleanup(struct cxgbi_device *);
+
+#endif	/* __CXGB4I_H__ */
+
diff --git a/drivers/scsi/cxgbi/cxgb4i_ddp.c b/drivers/scsi/cxgbi/cxgb4i_ddp.c
new file mode 100644
index 0000000..24debcf
--- /dev/null
+++ b/drivers/scsi/cxgbi/cxgb4i_ddp.c
@@ -0,0 +1,653 @@
+/*
+ * cxgb4i_ddp.c: Chelsio T4 iSCSI driver.
+ *
+ * Copyright (c) 2010 Chelsio Communications, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.
+ *
+ * Written by: Karen Xie (kxie-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ * Written by: Rakesh Ranjan (rranjan-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ */
+
+#include <linux/skbuff.h>
+#include <linux/scatterlist.h>
+
+#include "libcxgbi.h"
+#include "cxgb4i.h"
+
+#define DDP_PGIDX_MAX	4
+#define DDP_THRESHOLD	2048
+
+static unsigned char ddp_page_order[DDP_PGIDX_MAX] = {0, 1, 2, 4};
+static unsigned char ddp_page_shift[DDP_PGIDX_MAX] = {12, 13, 14, 16};
+static unsigned char page_idx = DDP_PGIDX_MAX;
+static unsigned char sw_tag_idx_bits;
+static unsigned char sw_tag_age_bits;
+
+static inline void cxgb4i_ddp_ppod_set(struct pagepod *ppod,
+				       struct pagepod_hdr *hdr,
+				       struct cxgbi_gather_list *gl,
+				       unsigned int pidx)
+{
+	int i;
+
+	memcpy(ppod, hdr, sizeof(*hdr));
+	for (i = 0; i < (PPOD_PAGES_MAX + 1); i++, pidx++) {
+		ppod->addr[i] = pidx < gl->nelem ?
+			cpu_to_be64(gl->phys_addr[pidx]) : 0ULL;
+	}
+}
+
+static inline void cxgb4i_ddp_ppod_clear(struct pagepod *ppod)
+{
+	memset(ppod, 0, sizeof(*ppod));
+}
+
+static inline void cxgb4i_ddp_ulp_mem_io_set_hdr(struct ulp_mem_io *req,
+						unsigned int wr_len,
+						unsigned int dlen,
+						unsigned int pm_addr)
+{
+	struct ulptx_sgl *sgl;
+
+	INIT_ULPTX_WR(req, wr_len, 0, 0);
+	req->cmd = htonl(ULPTX_CMD(ULP_TX_MEM_WRITE));
+	req->dlen = htonl(ULP_MEMIO_DATA_LEN(dlen >> 5));
+	req->len16 = htonl(DIV_ROUND_UP(wr_len - sizeof(req->wr), 16));
+	req->lock_addr = htonl(ULP_MEMIO_ADDR(pm_addr >> 5));
+	sgl = (struct ulptx_sgl *)(req + 1);
+	sgl->cmd_nsge = htonl(ULPTX_CMD(ULP_TX_SC_DSGL) | ULPTX_NSGE(1));
+	sgl->len0 = htonl(dlen);
+}
+
+static int cxgb4i_ddp_ppod_write_sgl(struct cxgbi_hba *chba,
+				     struct cxgb4i_ddp_info *ddp,
+				     struct pagepod_hdr *hdr,
+				     unsigned int idx,
+				     unsigned int npods,
+				     struct cxgbi_gather_list *gl,
+				     unsigned int gl_pidx)
+{
+	unsigned int dlen, pm_addr, wr_len;
+	struct sk_buff *skb;
+	struct ulp_mem_io *req;
+	struct ulptx_sgl *sgl;
+	struct pagepod *ppod;
+	unsigned int i;
+
+	dlen = PPOD_SIZE * npods;
+	pm_addr = idx * PPOD_SIZE + ddp->llimit;
+	wr_len = roundup(sizeof(struct ulp_mem_io) +
+			sizeof(struct ulptx_sgl), 16);
+
+	skb = alloc_skb(wr_len + dlen, GFP_ATOMIC);
+	if (!skb) {
+		cxgbi_log_error("snic 0x%p, idx %u, npods %u, OOM\n",
+				ddp->snic, idx, npods);
+		return -ENOMEM;
+	}
+
+	memset(skb->data, 0, wr_len + dlen);
+	set_wr_txq(skb, CPL_PRIORITY_CONTROL, chba->txq_idx);
+	req = (struct ulp_mem_io *)__skb_put(skb, wr_len);
+	cxgb4i_ddp_ulp_mem_io_set_hdr(req, wr_len, dlen, pm_addr);
+	sgl = (struct ulptx_sgl *)(req + 1);
+	ppod = (struct pagepod *)(sgl + 1);
+	sgl->addr0 = cpu_to_be64(virt_to_phys(ppod));
+
+	for (i = 0; i < npods; i++, ppod++, gl_pidx += PPOD_PAGES_MAX) {
+		if (!hdr && !gl)
+			cxgb4i_ddp_ppod_clear(ppod);
+		else
+			cxgb4i_ddp_ppod_set(ppod, hdr, gl, gl_pidx);
+
+	}
+
+	cxgb4_ofld_send(chba->cdev->ports[chba->port_id], skb);
+	return 0;
+}
+
+static int cxgb4i_ddp_set_map(struct cxgbi_hba *chba,
+				struct cxgb4i_ddp_info *ddp,
+				struct pagepod_hdr *hdr,
+				unsigned int idx,
+				unsigned int npods,
+				struct cxgbi_gather_list *gl)
+{
+	unsigned int pidx, w_npods, cnt;
+	int err = 0;
+
+	for (w_npods = 0, pidx = 0; w_npods < npods;
+		idx += cnt, w_npods += cnt, pidx += PPOD_PAGES_MAX) {
+		cnt = npods - w_npods;
+		if (cnt > ULPMEM_DSGL_MAX_NPPODS)
+			cnt = ULPMEM_DSGL_MAX_NPPODS;
+		err = cxgb4i_ddp_ppod_write_sgl(chba, ddp, hdr, idx, cnt, gl,
+						pidx);
+		if (err < 0)
+			break;
+	}
+	return err;
+}
+
+static void cxgb4i_ddp_clear_map(struct cxgbi_hba *chba,
+				struct cxgb4i_ddp_info *ddp,
+				unsigned int tag,
+				unsigned int idx,
+				unsigned int npods)
+{
+	int err;
+	unsigned int w_npods, cnt;
+
+	for (w_npods = 0; w_npods < npods; idx += cnt, w_npods += cnt) {
+		cnt = npods - w_npods;
+
+		if (cnt > ULPMEM_DSGL_MAX_NPPODS)
+			cnt = ULPMEM_DSGL_MAX_NPPODS;
+		err = cxgb4i_ddp_ppod_write_sgl(chba, ddp, NULL, idx, cnt,
+						NULL, 0);
+		if (err < 0)
+			break;
+	}
+}
+
+static inline int cxgb4i_ddp_find_unused_entries(struct cxgb4i_ddp_info *ddp,
+					unsigned int start, unsigned int max,
+					unsigned int count,
+					struct cxgbi_gather_list *gl)
+{
+	unsigned int i, j, k;
+
+	/*  not enough entries */
+	if ((max - start) < count)
+		return -EBUSY;
+
+	max -= count;
+	spin_lock(&ddp->map_lock);
+	for (i = start; i < max;) {
+		for (j = 0, k = i; j < count; j++, k++) {
+			if (ddp->gl_map[k])
+				break;
+		}
+		if (j == count) {
+			for (j = 0, k = i; j < count; j++, k++)
+				ddp->gl_map[k] = gl;
+			spin_unlock(&ddp->map_lock);
+			return i;
+		}
+		i += j + 1;
+	}
+	spin_unlock(&ddp->map_lock);
+	return -EBUSY;
+}
+
+static inline void cxgb4i_ddp_unmark_entries(struct cxgb4i_ddp_info *ddp,
+						int start, int count)
+{
+	spin_lock(&ddp->map_lock);
+	memset(&ddp->gl_map[start], 0,
+		count * sizeof(struct cxgbi_gather_list *));
+	spin_unlock(&ddp->map_lock);
+}
+
+static int cxgb4i_ddp_find_page_index(unsigned long pgsz)
+{
+	int i;
+
+	for (i = 0; i < DDP_PGIDX_MAX; i++) {
+		if (pgsz == (1UL << ddp_page_shift[i]))
+			return i;
+	}
+	cxgbi_log_debug("ddp page size 0x%lx not supported\n", pgsz);
+	return DDP_PGIDX_MAX;
+}
+
+static int cxgb4i_ddp_adjust_page_table(void)
+{
+	int i;
+	unsigned int base_order, order;
+
+	if (PAGE_SIZE < (1UL << ddp_page_shift[0])) {
+		cxgbi_log_info("PAGE_SIZE 0x%lx too small, min 0x%lx\n",
+				PAGE_SIZE, 1UL << ddp_page_shift[0]);
+		return -EINVAL;
+	}
+
+	base_order = get_order(1UL << ddp_page_shift[0]);
+	order = get_order(1UL << PAGE_SHIFT);
+
+	for (i = 0; i < DDP_PGIDX_MAX; i++) {
+		/* first is the kernel page size,
+		 * then just doubling the size */
+		ddp_page_order[i] = order - base_order + i;
+		ddp_page_shift[i] = PAGE_SHIFT + i;
+	}
+	return 0;
+}
+
+static inline void cxgb4i_ddp_gl_unmap(struct pci_dev *pdev,
+					struct cxgbi_gather_list *gl)
+{
+	int i;
+
+	for (i = 0; i < gl->nelem; i++)
+		dma_unmap_page(&pdev->dev, gl->phys_addr[i], PAGE_SIZE,
+				PCI_DMA_FROMDEVICE);
+}
+
+static inline int cxgb4i_ddp_gl_map(struct pci_dev *pdev,
+				    struct cxgbi_gather_list *gl)
+{
+	int i;
+
+	for (i = 0; i < gl->nelem; i++) {
+		gl->phys_addr[i] = dma_map_page(&pdev->dev, gl->pages[i], 0,
+						PAGE_SIZE,
+						PCI_DMA_FROMDEVICE);
+		if (unlikely(dma_mapping_error(&pdev->dev, gl->phys_addr[i])))
+			goto unmap;
+	}
+	return i;
+unmap:
+	if (i) {
+		unsigned int nelem = gl->nelem;
+
+		gl->nelem = i;
+		cxgb4i_ddp_gl_unmap(pdev, gl);
+		gl->nelem = nelem;
+	}
+	return -ENOMEM;
+}
+
+static void cxgb4i_ddp_release_gl(struct cxgbi_gather_list *gl,
+				  struct pci_dev *pdev)
+{
+	cxgb4i_ddp_gl_unmap(pdev, gl);
+	kfree(gl);
+}
+
+static struct cxgbi_gather_list *cxgb4i_ddp_make_gl(unsigned int xferlen,
+						    struct scatterlist *sgl,
+						    unsigned int sgcnt,
+						    struct pci_dev *pdev,
+						    gfp_t gfp)
+{
+	struct cxgbi_gather_list *gl;
+	struct scatterlist *sg = sgl;
+	struct page *sgpage = sg_page(sg);
+	unsigned int sglen = sg->length;
+	unsigned int sgoffset = sg->offset;
+	unsigned int npages = (xferlen + sgoffset + PAGE_SIZE - 1) >>
+				PAGE_SHIFT;
+	int i = 1, j = 0;
+
+	if (xferlen < DDP_THRESHOLD) {
+		cxgbi_log_debug("xfer %u < threshold %u, no ddp.\n",
+				xferlen, DDP_THRESHOLD);
+		return NULL;
+	}
+
+	gl = kzalloc(sizeof(struct cxgbi_gather_list) +
+		     npages * (sizeof(dma_addr_t) +
+		     sizeof(struct page *)), gfp);
+	if (!gl)
+		return NULL;
+
+	gl->pages = (struct page **)&gl->phys_addr[npages];
+	gl->length = xferlen;
+	gl->offset = sgoffset;
+	gl->pages[0] = sgpage;
+	sg = sg_next(sg);
+
+	while (sg) {
+		struct page *page = sg_page(sg);
+
+		if (sgpage == page && sg->offset == sgoffset + sglen)
+			sglen += sg->length;
+		else {
+			/*  make sure the sgl is fit for ddp:
+			 *  each has the same page size, and
+			 *  all of the middle pages are used completely
+			 */
+			if ((j && sgoffset) || ((i != sgcnt - 1) &&
+					 ((sglen + sgoffset) & ~PAGE_MASK)))
+				goto error_out;
+
+			j++;
+			if (j == gl->nelem || sg->offset)
+				goto error_out;
+			gl->pages[j] = page;
+			sglen = sg->length;
+			sgoffset = sg->offset;
+			sgpage = page;
+		}
+		i++;
+		sg = sg_next(sg);
+	}
+	gl->nelem = ++j;
+
+	if (cxgb4i_ddp_gl_map(pdev, gl) < 0)
+		goto error_out;
+	return gl;
+error_out:
+	kfree(gl);
+	return NULL;
+}
+
+static void cxgb4i_ddp_tag_release(struct cxgbi_hba *chba, u32 tag)
+{
+	struct cxgb4i_snic *snic = cxgbi_cdev_priv(chba->cdev);
+	struct cxgb4i_ddp_info *ddp = snic->ddp;
+	u32 idx;
+
+	if (!ddp) {
+		cxgbi_log_error("release ddp tag 0x%x, ddp NULL.\n", tag);
+		return;
+	}
+
+	idx = (tag >> PPOD_IDX_SHIFT) & ddp->idx_mask;
+	if (idx < ddp->nppods) {
+		struct cxgbi_gather_list *gl = ddp->gl_map[idx];
+		unsigned int npods;
+
+		if (!gl || !gl->nelem) {
+			cxgbi_log_error("rel 0x%x, idx 0x%x, gl 0x%p, %u\n",
+					tag, idx, gl, gl ? gl->nelem : 0);
+			return;
+		}
+		npods = (gl->nelem + PPOD_PAGES_MAX - 1) >> PPOD_PAGES_SHIFT;
+		cxgbi_log_debug("ddp tag 0x%x, release idx 0x%x, npods %u.\n",
+				tag, idx, npods);
+		cxgb4i_ddp_clear_map(chba, ddp, tag, idx, npods);
+		cxgb4i_ddp_unmark_entries(ddp, idx, npods);
+		cxgb4i_ddp_release_gl(gl, ddp->pdev);
+	} else
+		cxgbi_log_error("ddp tag 0x%x, idx 0x%x > max 0x%x.\n",
+				tag, idx, ddp->nppods);
+}
+
+static int cxgb4i_ddp_tag_reserve(struct cxgbi_hba *chba,
+				  unsigned int tid,
+				  struct cxgbi_tag_format *tformat,
+				  u32 *tagp,
+				  struct cxgbi_gather_list *gl,
+				  gfp_t gfp)
+{
+	struct cxgb4i_snic *snic = cxgbi_cdev_priv(chba->cdev);
+	struct cxgb4i_ddp_info *ddp = snic->ddp;
+	struct pagepod_hdr hdr;
+	unsigned int npods;
+	int idx = -1;
+	int err = -ENOMEM;
+	u32 sw_tag = *tagp;
+	u32 tag;
+
+	if (page_idx >= DDP_PGIDX_MAX || !ddp || !gl || !gl->nelem ||
+			gl->length < DDP_THRESHOLD) {
+		cxgbi_log_debug("pgidx %u, xfer %u/%u, NO ddp.\n",
+				page_idx, gl->length, DDP_THRESHOLD);
+		return -EINVAL;
+	}
+
+	npods = (gl->nelem + PPOD_PAGES_MAX - 1) >> PPOD_PAGES_SHIFT;
+
+	if (ddp->idx_last == ddp->nppods)
+		idx = cxgb4i_ddp_find_unused_entries(ddp, 0, ddp->nppods,
+							npods, gl);
+	else {
+		idx = cxgb4i_ddp_find_unused_entries(ddp, ddp->idx_last + 1,
+							ddp->nppods, npods,
+							gl);
+		if (idx < 0 && ddp->idx_last >= npods) {
+			idx = cxgb4i_ddp_find_unused_entries(ddp, 0,
+				min(ddp->idx_last + npods, ddp->nppods),
+							npods, gl);
+		}
+	}
+	if (idx < 0) {
+		cxgbi_log_debug("xferlen %u, gl %u, npods %u NO DDP.\n",
+				gl->length, gl->nelem, npods);
+		return idx;
+	}
+
+	tag = cxgbi_ddp_tag_base(tformat, sw_tag);
+	tag |= idx << PPOD_IDX_SHIFT;
+
+	hdr.rsvd = 0;
+	hdr.vld_tid = htonl(PPOD_VALID_FLAG | PPOD_TID(tid));
+	hdr.pgsz_tag_clr = htonl(tag & ddp->rsvd_tag_mask);
+	hdr.max_offset = htonl(gl->length);
+	hdr.page_offset = htonl(gl->offset);
+
+	err = cxgb4i_ddp_set_map(chba, ddp, &hdr, idx, npods, gl);
+	if (err < 0)
+		goto unmark_entries;
+
+	ddp->idx_last = idx;
+	cxgbi_log_debug("xfer %u, gl %u,%u, tid 0x%x, 0x%x -> 0x%x(%u,%u).\n",
+			gl->length, gl->nelem, gl->offset, tid, sw_tag, tag,
+			idx, npods);
+	*tagp = tag;
+	return 0;
+unmark_entries:
+	cxgb4i_ddp_unmark_entries(ddp, idx, npods);
+	return err;
+}
+
+static int cxgb4i_ddp_setup_conn_pgidx(struct cxgbi_sock *csk,
+					unsigned int tid,
+					int pg_idx,
+					bool reply)
+{
+	struct sk_buff *skb;
+	struct cpl_set_tcb_field *req;
+	u64 val = pg_idx < DDP_PGIDX_MAX ? pg_idx : 0;
+
+	skb = alloc_skb(sizeof(*req), GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	/*  set up ulp submode and page size */
+	val = (val & 0x03) << 2;
+	val |= TCB_ULP_TYPE(ULP_MODE_ISCSI);
+	req = (struct cpl_set_tcb_field *)skb_put(skb, sizeof(*req));
+	INIT_TP_WR(req, tid);
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, csk->hwtid));
+	req->reply_ctrl = htons(NO_REPLY(reply) | QUEUENO(csk->rss_qid));
+	req->word_cookie = htons(TCB_WORD(W_TCB_ULP_RAW));
+	req->mask = cpu_to_be64(TCB_ULP_TYPE(TCB_ULP_TYPE_MASK));
+	req->val = cpu_to_be64(val);
+	set_wr_txq(skb, CPL_PRIORITY_CONTROL, csk->txq_idx);
+	cxgb4_ofld_send(csk->cdev->ports[csk->port_id], skb);
+	return 0;
+}
+
+static int cxgb4i_ddp_setup_conn_host_pagesize(struct cxgbi_sock *csk,
+						unsigned int tid, int reply)
+{
+	return cxgb4i_ddp_setup_conn_pgidx(csk, tid, page_idx, reply);
+}
+
+static int cxgb4i_ddp_setup_conn_digest(struct cxgbi_sock *csk,
+					unsigned int tid, int hcrc,
+					int dcrc, int reply)
+{
+	struct sk_buff *skb;
+	struct cpl_set_tcb_field *req;
+	u64 val = (hcrc ? ULP_CRC_HEADER : 0) | (dcrc ? ULP_CRC_DATA : 0);
+
+	val = TCB_ULP_RAW(val);
+	val |= TCB_ULP_TYPE(ULP_MODE_ISCSI);
+
+	skb = alloc_skb(sizeof(*req), GFP_KERNEL);
+	if (!skb)
+		return -ENOMEM;
+
+	csk->hcrc_len = (hcrc ? 4 : 0);
+	csk->dcrc_len = (dcrc ? 4 : 0);
+	/*  set up ulp submode and page size */
+	req = (struct cpl_set_tcb_field *)skb_put(skb, sizeof(*req));
+	INIT_TP_WR(req, tid);
+	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SET_TCB_FIELD, tid));
+	req->reply_ctrl = htons(NO_REPLY(reply) | QUEUENO(csk->rss_qid));
+	req->word_cookie = htons(TCB_WORD(W_TCB_ULP_RAW));
+	req->mask = cpu_to_be64(TCB_ULP_RAW(TCB_ULP_RAW_MASK));
+	req->val = cpu_to_be64(val);
+	set_wr_txq(skb, CPL_PRIORITY_CONTROL, csk->txq_idx);
+	cxgb4_ofld_send(csk->cdev->ports[csk->port_id], skb);
+	return 0;
+}
+
+static void __cxgb4i_ddp_cleanup(struct kref *kref)
+{
+	int i = 0;
+	struct cxgb4i_ddp_info *ddp = container_of(kref,
+						struct cxgb4i_ddp_info,
+						refcnt);
+
+	cxgbi_log_info("kref release ddp 0x%p, snic 0x%p\n", ddp, ddp->snic);
+	ddp->snic->ddp = NULL;
+
+	while (i < ddp->nppods) {
+		struct cxgbi_gather_list *gl = ddp->gl_map[i];
+
+		if (gl) {
+			int npods = (gl->nelem + PPOD_PAGES_MAX - 1) >>
+							PPOD_PAGES_SHIFT;
+			cxgbi_log_info("snic 0x%p, ddp %d + %d\n",
+						ddp->snic, i, npods);
+			kfree(gl);
+			i += npods;
+		} else
+			i++;
+	}
+	cxgbi_free_big_mem(ddp);
+}
+
+static int __cxgb4i_ddp_init(struct cxgbi_device *cdev)
+{
+	struct cxgb4i_snic *snic = cxgbi_cdev_priv(cdev);
+	struct cxgb4i_ddp_info *ddp = snic->ddp;
+	unsigned int ppmax, bits, tagmask, pgsz_factor[4];
+	int i;
+
+	if (ddp) {
+		kref_get(&ddp->refcnt);
+		cxgbi_log_warn("snic 0x%p, ddp 0x%p already set up\n",
+				snic, snic->ddp);
+		return -EALREADY;
+	}
+	sw_tag_idx_bits = (__ilog2_u32(ISCSI_ITT_MASK)) + 1;
+	sw_tag_age_bits = (__ilog2_u32(ISCSI_AGE_MASK)) + 1;
+	cdev->tag_format.sw_bits = sw_tag_idx_bits + sw_tag_age_bits;
+	cxgbi_log_info("tag itt 0x%x, %u bits, age 0x%x, %u bits\n",
+			ISCSI_ITT_MASK, sw_tag_idx_bits,
+			ISCSI_AGE_MASK, sw_tag_age_bits);
+	ppmax = (snic->lldi.vr->iscsi.size >> PPOD_SIZE_SHIFT);
+	bits = __ilog2_u32(ppmax) + 1;
+	if (bits > PPOD_IDX_MAX_SIZE)
+		bits = PPOD_IDX_MAX_SIZE;
+	ppmax = (1 << (bits - 1)) - 1;
+	ddp = cxgbi_alloc_big_mem(sizeof(struct cxgb4i_ddp_info) +
+			ppmax * (sizeof(struct cxgbi_gather_list *) +
+				sizeof(struct sk_buff *)),
+				GFP_KERNEL);
+	if (!ddp) {
+		cxgbi_log_warn("snic 0x%p unable to alloc ddp 0x%d, "
+			       "ddp disabled\n", snic, ppmax);
+		return -ENOMEM;
+	}
+	ddp->gl_map = (struct cxgbi_gather_list **)(ddp + 1);
+	spin_lock_init(&ddp->map_lock);
+	kref_init(&ddp->refcnt);
+	ddp->snic = snic;
+	ddp->pdev = snic->lldi.pdev;
+	ddp->max_txsz = min_t(unsigned int,
+				snic->lldi.iscsi_iolen,
+				ULP2_MAX_PKT_SIZE);
+	ddp->max_rxsz = min_t(unsigned int,
+				snic->lldi.iscsi_iolen,
+				ULP2_MAX_PKT_SIZE);
+	ddp->llimit = snic->lldi.vr->iscsi.start;
+	ddp->ulimit = ddp->llimit + snic->lldi.vr->iscsi.size - 1;
+	ddp->nppods = ppmax;
+	ddp->idx_last = ppmax;
+	ddp->idx_bits = bits;
+	ddp->idx_mask = (1 << bits) - 1;
+	ddp->rsvd_tag_mask = (1 << (bits + PPOD_IDX_SHIFT)) - 1;
+	tagmask = ddp->idx_mask << PPOD_IDX_SHIFT;
+	for (i = 0; i < DDP_PGIDX_MAX; i++)
+		pgsz_factor[i] = ddp_page_order[i];
+	cxgb4_iscsi_init(snic->lldi.ports[0], tagmask, pgsz_factor);
+	snic->ddp = ddp;
+	cdev->tag_format.rsvd_bits = ddp->idx_bits;
+	cdev->tag_format.rsvd_shift = PPOD_IDX_SHIFT;
+	cdev->tag_format.rsvd_mask =
+		((1 << cdev->tag_format.rsvd_bits) - 1);
+	cxgbi_log_info("tag format: sw %u, rsvd %u,%u, mask 0x%x.\n",
+			cdev->tag_format.sw_bits,
+			cdev->tag_format.rsvd_bits,
+			cdev->tag_format.rsvd_shift,
+			cdev->tag_format.rsvd_mask);
+	cdev->tx_max_size = min_t(unsigned int, ULP2_MAX_PDU_PAYLOAD,
+				ddp->max_txsz - ISCSI_PDU_NONPAYLOAD_LEN);
+	cdev->rx_max_size = min_t(unsigned int, ULP2_MAX_PDU_PAYLOAD,
+				ddp->max_rxsz - ISCSI_PDU_NONPAYLOAD_LEN);
+	cxgbi_log_info("max payload size: %u/%u, %u/%u.\n",
+			cdev->tx_max_size, ddp->max_txsz,
+			cdev->rx_max_size, ddp->max_rxsz);
+	cxgbi_log_info("snic 0x%p, nppods %u, bits %u, mask 0x%x,0x%x "
+			"pkt %u/%u, %u/%u\n",
+			snic, ppmax, ddp->idx_bits, ddp->idx_mask,
+			ddp->rsvd_tag_mask, ddp->max_txsz,
+			snic->lldi.iscsi_iolen,
+			ddp->max_rxsz, snic->lldi.iscsi_iolen);
+	return 0;
+}
+
+int cxgb4i_ddp_init(struct cxgbi_device *cdev)
+{
+	int rc = 0;
+
+	if (page_idx == DDP_PGIDX_MAX) {
+		page_idx = cxgb4i_ddp_find_page_index(PAGE_SIZE);
+
+		if (page_idx == DDP_PGIDX_MAX) {
+			cxgbi_log_info("system PAGE_SIZE %lu, update hw\n",
+					PAGE_SIZE);
+			rc = cxgb4i_ddp_adjust_page_table();
+			if (rc) {
+				cxgbi_log_info("PAGE_SIZE %lu, ddp disabled\n",
+						PAGE_SIZE);
+				return rc;
+			}
+			page_idx = cxgb4i_ddp_find_page_index(PAGE_SIZE);
+		}
+		cxgbi_log_info("system PAGE_SIZE %lu, ddp idx %u\n",
+				PAGE_SIZE, page_idx);
+	}
+	rc = __cxgb4i_ddp_init(cdev);
+	if (rc)
+		return rc;
+
+	cdev->ddp_make_gl = cxgb4i_ddp_make_gl;
+	cdev->ddp_release_gl = cxgb4i_ddp_release_gl;
+	cdev->ddp_tag_reserve = cxgb4i_ddp_tag_reserve;
+	cdev->ddp_tag_release = cxgb4i_ddp_tag_release;
+	cdev->ddp_setup_conn_digest = cxgb4i_ddp_setup_conn_digest;
+	cdev->ddp_setup_conn_host_pgsz = cxgb4i_ddp_setup_conn_host_pagesize;
+	return 0;
+}
+
+void cxgb4i_ddp_cleanup(struct cxgbi_device *cdev)
+{
+	struct cxgb4i_snic *snic = cxgbi_cdev_priv(cdev);
+	struct cxgb4i_ddp_info *ddp = snic->ddp;
+
+	cxgbi_log_info("snic 0x%p, release ddp 0x%p\n", cdev, ddp);
+	if (ddp)
+		kref_put(&ddp->refcnt, __cxgb4i_ddp_cleanup);
+}
+
diff --git a/drivers/scsi/cxgbi/cxgb4i_init.c b/drivers/scsi/cxgbi/cxgb4i_init.c
new file mode 100644
index 0000000..31da683
--- /dev/null
+++ b/drivers/scsi/cxgbi/cxgb4i_init.c
@@ -0,0 +1,317 @@
+/*
+ * cxgb4i_init.c: Chelsio T4 iSCSI driver.
+ *
+ * Copyright (c) 2010 Chelsio Communications, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.
+ *
+ * Written by: Karen Xie (kxie-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ * Written by: Rakesh Ranjan (rranjan-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ */
+
+#include <net/route.h>
+#include <scsi/scsi_cmnd.h>
+#include <scsi/scsi_device.h>
+#include <scsi/scsi_eh.h>
+#include <scsi/scsi_host.h>
+#include <scsi/scsi.h>
+#include <scsi/iscsi_proto.h>
+#include <scsi/libiscsi.h>
+#include <scsi/scsi_transport_iscsi.h>
+
+#include "cxgb4i.h"
+
+#define	DRV_MODULE_NAME		"cxgb4i"
+#define	DRV_MODULE_VERSION	"0.90"
+#define	DRV_MODULE_RELDATE	"05/04/2010"
+
+static char version[] =
+	"Chelsio T4 iSCSI driver " DRV_MODULE_NAME
+	" v" DRV_MODULE_VERSION " (" DRV_MODULE_RELDATE ")\n";
+
+MODULE_AUTHOR("Chelsio Communications");
+MODULE_DESCRIPTION("Chelsio T4 iSCSI driver");
+MODULE_LICENSE("GPL");
+MODULE_VERSION(DRV_MODULE_VERSION);
+
+#define RX_PULL_LEN	128
+
+static struct scsi_transport_template *cxgb4i_scsi_transport;
+static struct scsi_host_template cxgb4i_host_template;
+static struct iscsi_transport cxgb4i_iscsi_transport;
+static struct cxgbi_device *cxgb4i_cdev;
+
+static void *cxgb4i_uld_add(const struct cxgb4_lld_info *linfo);
+static int cxgb4i_uld_rx_handler(void *handle, const __be64 *rsp,
+				const struct pkt_gl *pgl);
+static int cxgb4i_uld_state_change(void *handle, enum cxgb4_state state);
+static int cxgb4i_iscsi_init(struct cxgbi_device *);
+static void cxgb4i_iscsi_cleanup(struct cxgbi_device *);
+
+const static struct cxgb4_uld_info cxgb4i_uld_info = {
+	.name = "cxgb4i",
+	.add = cxgb4i_uld_add,
+	.rx_handler = cxgb4i_uld_rx_handler,
+	.state_change = cxgb4i_uld_state_change,
+};
+
+static struct cxgbi_device *cxgb4i_uld_init(const struct cxgb4_lld_info *linfo)
+{
+	struct cxgbi_device *cdev;
+	struct cxgb4i_snic *snic;
+	struct port_info *pi;
+	int i, offs, rc;
+
+	cdev = cxgbi_device_register(sizeof(*snic), linfo->nports);
+	if (!cdev) {
+		cxgbi_log_debug("error cxgbi_device_alloc\n");
+		return NULL;
+	}
+	cxgb4i_cdev = cdev;
+	snic = cxgbi_cdev_priv(cxgb4i_cdev);
+	snic->lldi = *linfo;
+	cdev->ports = snic->lldi.ports;
+	cdev->nports = snic->lldi.nports;
+	cdev->pdev = snic->lldi.pdev;
+	cdev->mtus = snic->lldi.mtus;
+	cdev->nmtus = NMTUS;
+	cdev->skb_tx_headroom = SKB_MAX_HEAD(CXGB4I_TX_HEADER_LEN);
+	rc = cxgb4i_iscsi_init(cdev);
+	if (rc) {
+		cxgbi_log_info("cxgb4i_iscsi_init failed\n");
+		goto free_cdev;
+	}
+	rc = cxgbi_pdu_init(cdev);
+	if (rc) {
+		cxgbi_log_info("cxgbi_pdu_init failed\n");
+		goto clean_iscsi;
+	}
+	rc = cxgb4i_ddp_init(cdev);
+	if (rc) {
+		cxgbi_log_info("cxgb4i_ddp_init failed\n");
+		goto clean_pdu;
+	}
+	rc = cxgb4i_ofld_init(cdev);
+	if (rc) {
+		cxgbi_log_info("cxgb4i_ofld_init failed\n");
+		goto clean_ddp;
+	}
+
+	for (i = 0; i < cdev->nports; i++) {
+		cdev->hbas[i] = cxgbi_hba_add(cdev,
+						CXGB4I_MAX_LUN,
+						CXGB4I_MAX_CONN,
+						cxgb4i_scsi_transport,
+						&cxgb4i_host_template,
+						snic->lldi.ports[i]);
+		if (!cdev->hbas[i])
+			goto clean_iscsi;
+
+		pi = netdev_priv(snic->lldi.ports[i]);
+		offs = snic->lldi.ntxq / snic->lldi.nchan;
+		cdev->hbas[i]->txq_idx = pi->port_id * offs;
+		cdev->hbas[i]->port_id = pi->port_id;
+	}
+	return cdev;
+clean_iscsi:
+	cxgb4i_iscsi_cleanup(cdev);
+clean_pdu:
+	cxgbi_pdu_cleanup(cdev);
+clean_ddp:
+	cxgb4i_ddp_cleanup(cdev);
+free_cdev:
+	cxgbi_device_unregister(cdev);
+	cdev = ERR_PTR(-ENOMEM);
+	return cdev;
+}
+
+static void cxgb4i_uld_cleanup(void *handle)
+{
+	struct cxgbi_device *cdev = handle;
+	int i;
+
+	if (!cdev)
+		return;
+
+	for (i = 0; i < cdev->nports; i++) {
+		if (cdev->hbas[i]) {
+			cxgbi_hba_remove(cdev->hbas[i]);
+			cdev->hbas[i] = NULL;
+		}
+	}
+	cxgb4i_ofld_cleanup(cdev);
+	cxgb4i_ddp_cleanup(cdev);
+	cxgbi_pdu_cleanup(cdev);
+	cxgbi_log_info("snic 0x%p, %u scsi hosts removed.\n",
+			cdev, cdev->nports);
+	cxgb4i_iscsi_cleanup(cdev);
+	cxgbi_device_unregister(cdev);
+}
+
+static void *cxgb4i_uld_add(const struct cxgb4_lld_info *linfo)
+{
+	cxgbi_log_info("%s", version);
+	return cxgb4i_uld_init(linfo);
+}
+
+static int cxgb4i_uld_rx_handler(void *handle, const __be64 *rsp,
+				const struct pkt_gl *pgl)
+{
+	const struct cpl_act_establish *rpl;
+	struct sk_buff *skb;
+	unsigned int opc;
+	struct cxgb4i_snic *snic = cxgbi_cdev_priv(handle);
+
+	if (pgl == NULL) {
+		unsigned int len = 64 - sizeof(struct rsp_ctrl) - 8;
+
+		skb = alloc_skb(len, GFP_ATOMIC);
+		if (!skb)
+			goto nomem;
+		__skb_put(skb, len);
+		skb_copy_to_linear_data(skb, &rsp[1], len);
+	} else {
+		skb = cxgb4_pktgl_to_skb(pgl, RX_PULL_LEN, RX_PULL_LEN);
+		if (unlikely(!skb))
+			goto nomem;
+	}
+
+	rpl = (struct cpl_act_establish *)skb->data;
+	opc = rpl->ot.opcode;
+	cxgbi_log_debug("snic %p, opcode 0x%x, skb %p\n", snic, opc, skb);
+	BUG_ON(!snic->handlers[opc]);
+
+	if (snic->handlers[opc])
+		snic->handlers[opc](snic, skb);
+	else
+		cxgbi_log_error("No handler for opcode 0x%x\n", opc);
+	return 0;
+nomem:
+	cxgbi_api_debug("OOM bailing out\n");
+	return 1;
+}
+
+static int cxgb4i_uld_state_change(void *handle, enum cxgb4_state state)
+{
+	return 0;
+}
+
+static struct scsi_host_template cxgb4i_host_template = {
+	.module				= THIS_MODULE,
+	.name				= "Chelsio T4 iSCSI initiator",
+	.proc_name			= "cxgb4i",
+	.queuecommand			= iscsi_queuecommand,
+	.change_queue_depth		= iscsi_change_queue_depth,
+	.can_queue			= CXGB4I_SCSI_HOST_QDEPTH,
+	.sg_tablesize			= SG_ALL,
+	.max_sectors			= 0xFFFF,
+	.cmd_per_lun			= ISCSI_DEF_CMD_PER_LUN,
+	.eh_abort_handler		= iscsi_eh_abort,
+	.eh_device_reset_handler	= iscsi_eh_device_reset,
+	.eh_target_reset_handler	= iscsi_eh_recover_target,
+	.target_alloc			= iscsi_target_alloc,
+	.use_clustering			= DISABLE_CLUSTERING,
+	.this_id			= -1,
+};
+
+#define	CXGB4I_CAPS	(CAP_RECOVERY_L0 | CAP_MULTI_R2T |	\
+			CAP_HDRDGST | CAP_DATADGST |		\
+			CAP_DIGEST_OFFLOAD | CAP_PADDING_OFFLOAD)
+#define	CXGB4I_PMASK	(ISCSI_MAX_RECV_DLENGTH | ISCSI_MAX_XMIT_DLENGTH | \
+			ISCSI_HDRDGST_EN | ISCSI_DATADGST_EN | \
+			ISCSI_INITIAL_R2T_EN | ISCSI_MAX_R2T | \
+			ISCSI_IMM_DATA_EN | ISCSI_FIRST_BURST | \
+			ISCSI_MAX_BURST | ISCSI_PDU_INORDER_EN | \
+			ISCSI_DATASEQ_INORDER_EN | ISCSI_ERL | \
+			ISCSI_CONN_PORT | ISCSI_CONN_ADDRESS | \
+			ISCSI_EXP_STATSN | ISCSI_PERSISTENT_PORT | \
+			ISCSI_PERSISTENT_ADDRESS | ISCSI_TARGET_NAME | \
+			ISCSI_TPGT | ISCSI_USERNAME | \
+			ISCSI_PASSWORD | ISCSI_USERNAME_IN | \
+			ISCSI_PASSWORD_IN | ISCSI_FAST_ABORT | \
+			ISCSI_ABORT_TMO | ISCSI_LU_RESET_TMO | \
+			ISCSI_TGT_RESET_TMO | ISCSI_PING_TMO | \
+			ISCSI_RECV_TMO | ISCSI_IFACE_NAME | \
+			ISCSI_INITIATOR_NAME)
+#define	CXGB4I_HPMASK	(ISCSI_HOST_HWADDRESS | ISCSI_HOST_IPADDRESS | \
+			ISCSI_HOST_INITIATOR_NAME | ISCSI_HOST_INITIATOR_NAME)
+
+static struct iscsi_transport cxgb4i_iscsi_transport = {
+	.owner				= THIS_MODULE,
+	.name				= "cxgb4i",
+	.caps				= CXGB4I_CAPS,
+	.param_mask			= CXGB4I_PMASK,
+	.host_param_mask		= CXGB4I_HPMASK,
+	.get_host_param			= cxgbi_get_host_param,
+	.set_host_param			= cxgbi_set_host_param,
+
+	.create_session			= cxgbi_create_session,
+	.destroy_session		= cxgbi_destroy_session,
+	.get_session_param		= iscsi_session_get_param,
+
+	.create_conn			= cxgbi_create_conn,
+	.bind_conn			= cxgbi_bind_conn,
+	.destroy_conn			= iscsi_tcp_conn_teardown,
+	.start_conn			= iscsi_conn_start,
+	.stop_conn			= iscsi_conn_stop,
+	.get_conn_param			= cxgbi_get_conn_param,
+	.set_param			= cxgbi_set_conn_param,
+	.get_stats			= cxgbi_get_conn_stats,
+
+	.send_pdu			= iscsi_conn_send_pdu,
+
+	.init_task			= iscsi_tcp_task_init,
+	.xmit_task			= iscsi_tcp_task_xmit,
+	.cleanup_task			= cxgbi_cleanup_task,
+
+	.alloc_pdu			= cxgbi_conn_alloc_pdu,
+	.init_pdu			= cxgbi_conn_init_pdu,
+	.xmit_pdu			= cxgbi_conn_xmit_pdu,
+	.parse_pdu_itt			= cxgbi_parse_pdu_itt,
+
+	.ep_connect			= cxgbi_ep_connect,
+	.ep_poll			= cxgbi_ep_poll,
+	.ep_disconnect			= cxgbi_ep_disconnect,
+
+	.session_recovery_timedout	= iscsi_session_recovery_timedout,
+};
+
+static int cxgb4i_iscsi_init(struct cxgbi_device *cdev)
+{
+	cxgb4i_scsi_transport = iscsi_register_transport(
+					&cxgb4i_iscsi_transport);
+	if (!cxgb4i_scsi_transport) {
+		cxgbi_log_error("Could not register cxgb4i transport\n");
+		return -ENODATA;
+	}
+
+	cdev->itp = &cxgb4i_iscsi_transport;
+	return 0;
+}
+
+static void cxgb4i_iscsi_cleanup(struct cxgbi_device *cdev)
+{
+	if (cxgb4i_scsi_transport) {
+		cxgbi_api_debug("cxgb4i transport 0x%p removed\n",
+				cxgb4i_scsi_transport);
+		iscsi_unregister_transport(&cxgb4i_iscsi_transport);
+	}
+}
+
+
+static int __init cxgb4i_init_module(void)
+{
+	cxgb4_register_uld(CXGB4_ULD_ISCSI, &cxgb4i_uld_info);
+	return 0;
+}
+
+static void __exit cxgb4i_exit_module(void)
+{
+	cxgb4_unregister_uld(CXGB4_ULD_ISCSI);
+	cxgb4i_uld_cleanup(cxgb4i_cdev);
+}
+
+module_init(cxgb4i_init_module);
+module_exit(cxgb4i_exit_module);
diff --git a/drivers/scsi/cxgbi/cxgb4i_offload.c b/drivers/scsi/cxgbi/cxgb4i_offload.c
new file mode 100644
index 0000000..eff1b11
--- /dev/null
+++ b/drivers/scsi/cxgbi/cxgb4i_offload.c
@@ -0,0 +1,1413 @@
+/*
+ * cxgb4i_offload.c: Chelsio T4 iSCSI driver.
+ *
+ * Copyright (c) 2010 Chelsio Communications, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation.
+ *
+ * Written by: Karen Xie (kxie-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ * Written by: Rakesh Ranjan (rranjan-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org)
+ */
+
+#include <linux/if_vlan.h>
+#include <net/dst.h>
+#include <net/route.h>
+#include <net/tcp.h>
+
+#include "libcxgbi.h"
+#include "cxgb4i.h"
+
+static int cxgb4i_rcv_win = 256 * 1024;
+module_param(cxgb4i_rcv_win, int, 0644);
+MODULE_PARM_DESC(cxgb4i_rcv_win, "TCP reveive window in bytes");
+
+static int cxgb4i_snd_win = 128 * 1024;
+module_param(cxgb4i_snd_win, int, 0644);
+MODULE_PARM_DESC(cxgb4i_snd_win, "TCP send window in bytes");
+
+static int cxgb4i_rx_credit_thres = 10 * 1024;
+module_param(cxgb4i_rx_credit_thres, int, 0644);
+MODULE_PARM_DESC(cxgb4i_rx_credit_thres,
+		"RX credits return threshold in bytes (default=10KB)");
+
+static unsigned int cxgb4i_max_connect = (8 * 1024);
+module_param(cxgb4i_max_connect, uint, 0644);
+MODULE_PARM_DESC(cxgb4i_max_connect, "Maximum number of connections");
+
+static unsigned short cxgb4i_sport_base = 20000;
+module_param(cxgb4i_sport_base, ushort, 0644);
+MODULE_PARM_DESC(cxgb4i_sport_base, "Starting port number (default 20000)");
+
+#define DIV_ROUND_UP(n, d) (((n) + (d) - 1) / (d))
+#define RCV_BUFSIZ_MASK	0x3FFU
+#define MAX_IMM_TX_PKT_LEN 128
+
+static int cxgb4i_sock_push_tx_frames(struct cxgbi_sock *, int);
+
+/*
+ * is_ofld_imm - check whether a packet can be sent as immediate data
+ * @skb: the packet
+ *
+ * Returns true if a packet can be sent as an offload WR with immediate
+ * data.  We currently use the same limit as for Ethernet packets.
+ */
+static inline int is_ofld_imm(const struct sk_buff *skb)
+{
+	return skb->len <= (MAX_IMM_TX_PKT_LEN -
+			sizeof(struct fw_ofld_tx_data_wr));
+}
+
+static void cxgb4i_sock_make_act_open_req(struct cxgbi_sock *csk,
+					   struct sk_buff *skb,
+					   unsigned int qid_atid,
+					   struct l2t_entry *e)
+{
+	struct cpl_act_open_req *req;
+	unsigned long long opt0;
+	unsigned int opt2;
+	int wscale;
+
+	cxgbi_conn_debug("csk 0x%p, atid 0x%x\n", csk, qid_atid);
+
+	wscale = cxgbi_sock_compute_wscale(csk->mss_idx);
+	opt0 = KEEP_ALIVE(1) |
+		WND_SCALE(wscale) |
+		MSS_IDX(csk->mss_idx) |
+		L2T_IDX(((struct l2t_entry *)csk->l2t)->idx) |
+		TX_CHAN(csk->tx_chan) |
+		SMAC_SEL(csk->smac_idx) |
+		RCV_BUFSIZ(cxgb4i_rcv_win >> 10);
+	opt2 = RX_CHANNEL(0) |
+		RSS_QUEUE_VALID |
+		RSS_QUEUE(csk->rss_qid);
+	set_wr_txq(skb, CPL_PRIORITY_SETUP, csk->txq_idx);
+	req = (struct cpl_act_open_req *)__skb_put(skb, sizeof(*req));
+	INIT_TP_WR(req, 0);
+	OPCODE_TID(req) = cpu_to_be32(MK_OPCODE_TID(CPL_ACT_OPEN_REQ,
+					qid_atid));
+	req->local_port = csk->saddr.sin_port;
+	req->peer_port = csk->daddr.sin_port;
+	req->local_ip = csk->saddr.sin_addr.s_addr;
+	req->peer_ip = csk->daddr.sin_addr.s_addr;
+	req->opt0 = cpu_to_be64(opt0);
+	req->params = 0;
+	req->opt2 = cpu_to_be32(opt2);
+}
+
+static void cxgb4i_fail_act_open(struct cxgbi_sock *csk, int errno)
+{
+	cxgbi_conn_debug("csk 0%p, state %u, flag 0x%lx\n", csk,
+			csk->state, csk->flags);
+	csk->err = errno;
+	cxgbi_sock_closed(csk);
+}
+
+static void cxgb4i_act_open_req_arp_failure(void *handle, struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk = (struct cxgbi_sock *)skb->sk;
+
+	cxgbi_sock_hold(csk);
+	spin_lock_bh(&csk->lock);
+	if (csk->state == CTP_CONNECTING)
+		cxgb4i_fail_act_open(csk, -EHOSTUNREACH);
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+	__kfree_skb(skb);
+}
+
+static void cxgb4i_sock_skb_entail(struct cxgbi_sock *csk,
+				   struct sk_buff *skb,
+				   int flags)
+{
+	cxgb4i_skb_tcp_seq(skb) = csk->write_seq;
+	cxgb4i_skb_flags(skb) = flags;
+	__skb_queue_tail(&csk->write_queue, skb);
+}
+
+static void cxgb4i_sock_send_close_req(struct cxgbi_sock *csk)
+{
+	struct sk_buff *skb = csk->cpl_close;
+	struct cpl_close_con_req *req = (struct cpl_close_con_req *)skb->head;
+	unsigned int tid = csk->hwtid;
+
+	cxgbi_log_debug("cxgb4i_sock_send_close_req called\n");
+	return;
+	csk->cpl_close = NULL;
+	set_wr_txq(skb, CPL_PRIORITY_DATA, csk->txq_idx);
+	INIT_TP_WR(req, tid);
+	OPCODE_TID(req) = cpu_to_be32(MK_OPCODE_TID(CPL_CLOSE_CON_REQ, tid));
+	req->rsvd = 0;
+	cxgb4i_sock_skb_entail(csk, skb, CTP_SKCBF_NO_APPEND);
+	if (csk->state != CTP_CONNECTING)
+		cxgb4i_sock_push_tx_frames(csk, 1);
+}
+
+static void cxgb4i_sock_abort_arp_failure(void *handle, struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk = (struct cxgbi_sock *)handle;
+	struct cpl_abort_req *req;
+
+	req = (struct cpl_abort_req *)skb->data;
+	req->cmd = CPL_ABORT_NO_RST;
+	cxgb4_ofld_send(csk->cdev->ports[csk->port_id], skb);
+}
+
+static void cxgb4i_sock_send_abort_req(struct cxgbi_sock *csk)
+{
+	struct cpl_abort_req *req;
+	struct sk_buff *skb = csk->cpl_abort_req;
+
+	if (unlikely(csk->state == CTP_ABORTING) || !skb || !csk->cdev)
+		return;
+	cxgbi_log_debug("cxgb4i_sock_send_abort_req called\n");
+	return;
+	cxgbi_sock_set_state(csk, CTP_ABORTING);
+	cxgbi_conn_debug("csk 0x%p, flag ABORT_RPL + ABORT_SHUT\n", csk);
+	cxgbi_sock_set_state(csk, CTPF_ABORT_RPL_PENDING);
+	cxgbi_sock_purge_write_queue(csk);
+	csk->cpl_abort_req = NULL;
+	req = (struct cpl_abort_req *)skb->head;
+	set_wr_txq(skb, CPL_PRIORITY_DATA, csk->txq_idx);
+	t4_set_arp_err_handler(skb, csk, cxgb4i_sock_abort_arp_failure);
+	INIT_TP_WR(req, csk->hwtid);
+	OPCODE_TID(req) = cpu_to_be32(MK_OPCODE_TID(CPL_ABORT_REQ, csk->hwtid));
+	req->rsvd0 = htonl(csk->snd_nxt);
+	req->rsvd1 = !cxgbi_sock_flag(csk, CTPF_TX_DATA_SENT);
+	req->cmd = CPL_ABORT_SEND_RST;
+	cxgb4_l2t_send(csk->cdev->ports[csk->port_id], skb, csk->l2t);
+}
+
+static void cxgb4i_sock_send_abort_rpl(struct cxgbi_sock *csk, int rst_status)
+{
+	struct sk_buff *skb = csk->cpl_abort_rpl;
+	struct cpl_abort_rpl *rpl = (struct cpl_abort_rpl *)skb->head;
+
+	csk->cpl_abort_rpl = NULL;
+	set_wr_txq(skb, CPL_PRIORITY_DATA, csk->txq_idx);
+	INIT_TP_WR(rpl, csk->hwtid);
+	OPCODE_TID(rpl) = cpu_to_be32(MK_OPCODE_TID(CPL_ABORT_RPL, csk->hwtid));
+	rpl->cmd = rst_status;
+	cxgb4_ofld_send(csk->cdev->ports[csk->port_id], skb);
+}
+
+static u32 cxgb4i_csk_send_rx_credits(struct cxgbi_sock *csk, u32 credits)
+{
+	struct sk_buff *skb;
+	struct cpl_rx_data_ack *req;
+	int wrlen = roundup(sizeof(*req), 16);
+
+	skb = alloc_skb(wrlen, GFP_ATOMIC);
+	if (!skb)
+		return 0;
+	req = (struct cpl_rx_data_ack *)__skb_put(skb, wrlen);
+	memset(req, 0, wrlen);
+	set_wr_txq(skb, CPL_PRIORITY_ACK, csk->txq_idx);
+	INIT_TP_WR(req, csk->hwtid);
+	OPCODE_TID(req) = cpu_to_be32(MK_OPCODE_TID(CPL_RX_DATA_ACK,
+				      csk->hwtid));
+	req->credit_dack = cpu_to_be32(RX_CREDITS(credits) | RX_FORCE_ACK(1));
+	cxgb4_ofld_send(csk->cdev->ports[csk->port_id], skb);
+	return credits;
+}
+
+#define SKB_WR_LIST_SIZE	(MAX_SKB_FRAGS + 2)
+static const unsigned int cxgb4i_ulp_extra_len[] = { 0, 4, 4, 8 };
+
+static inline unsigned int ulp_extra_len(const struct sk_buff *skb)
+{
+	return cxgb4i_ulp_extra_len[cxgb4i_skb_ulp_mode(skb) & 3];
+}
+
+static inline void cxgb4i_sock_reset_wr_list(struct cxgbi_sock *csk)
+{
+	csk->wr_pending_head = csk->wr_pending_tail = NULL;
+}
+
+static inline void cxgb4i_sock_enqueue_wr(struct cxgbi_sock *csk,
+					  struct sk_buff *skb)
+{
+	cxgb4i_skb_tx_wr_next(skb) = NULL;
+	/*
+	 * We want to take an extra reference since both us and the driver
+	 * need to free the packet before it's really freed. We know there's
+	 * just one user currently so we use atomic_set rather than skb_get
+	 * to avoid the atomic op.
+	 */
+	atomic_set(&skb->users, 2);
+
+	if (!csk->wr_pending_head)
+		csk->wr_pending_head = skb;
+	else
+		cxgb4i_skb_tx_wr_next(csk->wr_pending_tail) = skb;
+	csk->wr_pending_tail = skb;
+}
+
+static int cxgb4i_sock_count_pending_wrs(const struct cxgbi_sock *csk)
+{
+	int n = 0;
+	const struct sk_buff *skb = csk->wr_pending_head;
+
+	while (skb) {
+		n += skb->csum;
+		skb = cxgb4i_skb_tx_wr_next(skb);
+	}
+	return n;
+}
+
+static inline struct sk_buff *cxgb4i_sock_peek_wr(const struct cxgbi_sock *csk)
+{
+	return csk->wr_pending_head;
+}
+
+static inline struct sk_buff *cxgb4i_sock_dequeue_wr(struct cxgbi_sock *csk)
+{
+	struct sk_buff *skb = csk->wr_pending_head;
+
+	if (likely(skb)) {
+		csk->wr_pending_head = cxgb4i_skb_tx_wr_next(skb);
+		cxgb4i_skb_tx_wr_next(skb) = NULL;
+	}
+	return skb;
+}
+
+static void cxgb4i_sock_purge_wr_queue(struct cxgbi_sock *csk)
+{
+	struct sk_buff *skb;
+
+	while ((skb = cxgb4i_sock_dequeue_wr(csk)) != NULL)
+		kfree_skb(skb);
+}
+
+/*
+ * sgl_len - calculates the size of an SGL of the given capacity
+ * @n: the number of SGL entries
+ * Calculates the number of flits needed for a scatter/gather list that
+ * can hold the given number of entries.
+ */
+static inline unsigned int sgl_len(unsigned int n)
+{
+	n--;
+	return (3 * n) / 2 + (n & 1) + 2;
+}
+
+/*
+ * calc_tx_flits_ofld - calculate # of flits for an offload packet
+ * @skb: the packet
+ *
+ * Returns the number of flits needed for the given offload packet.
+ * These packets are already fully constructed and no additional headers
+ * will be added.
+ */
+static inline unsigned int calc_tx_flits_ofld(const struct sk_buff *skb)
+{
+	unsigned int flits, cnt;
+
+	if (is_ofld_imm(skb))
+		return DIV_ROUND_UP(skb->len, 8);
+	flits = skb_transport_offset(skb) / 8;
+	cnt = skb_shinfo(skb)->nr_frags;
+	if (skb->tail != skb->transport_header)
+		cnt++;
+	return flits + sgl_len(cnt);
+}
+
+static inline void cxgb4i_sock_send_tx_flowc_wr(struct cxgbi_sock *csk)
+{
+	struct sk_buff *skb;
+	struct fw_flowc_wr *flowc;
+	int flowclen, i;
+
+	flowclen = 80;
+	skb = alloc_skb(flowclen, GFP_ATOMIC);
+	flowc = (struct fw_flowc_wr *)__skb_put(skb, flowclen);
+	flowc->op_to_nparams =
+		htonl(FW_WR_OP(FW_FLOWC_WR) | FW_FLOWC_WR_NPARAMS(8));
+	flowc->flowid_len16 =
+		htonl(FW_WR_LEN16(DIV_ROUND_UP(72, 16)) |
+				FW_WR_FLOWID(csk->hwtid));
+	flowc->mnemval[0].mnemonic = FW_FLOWC_MNEM_PFNVFN;
+	flowc->mnemval[0].val = htonl(0);
+	flowc->mnemval[1].mnemonic = FW_FLOWC_MNEM_CH;
+	flowc->mnemval[1].val = htonl(csk->tx_chan);
+	flowc->mnemval[2].mnemonic = FW_FLOWC_MNEM_PORT;
+	flowc->mnemval[2].val = htonl(csk->tx_chan);
+	flowc->mnemval[3].mnemonic = FW_FLOWC_MNEM_IQID;
+	flowc->mnemval[3].val = htonl(csk->rss_qid);
+	flowc->mnemval[4].mnemonic = FW_FLOWC_MNEM_SNDNXT;
+	flowc->mnemval[4].val = htonl(csk->snd_nxt);
+	flowc->mnemval[5].mnemonic = FW_FLOWC_MNEM_RCVNXT;
+	flowc->mnemval[5].val = htonl(csk->rcv_nxt);
+	flowc->mnemval[6].mnemonic = FW_FLOWC_MNEM_SNDBUF;
+	flowc->mnemval[6].val = htonl(cxgb4i_snd_win);
+	flowc->mnemval[7].mnemonic = FW_FLOWC_MNEM_MSS;
+	flowc->mnemval[7].val = htonl(csk->mss_idx);
+	flowc->mnemval[8].mnemonic = 0;
+	flowc->mnemval[8].val = 0;
+	for (i = 0; i < 9; i++) {
+		flowc->mnemval[i].r4[0] = 0;
+		flowc->mnemval[i].r4[1] = 0;
+		flowc->mnemval[i].r4[2] = 0;
+	}
+	set_wr_txq(skb, CPL_PRIORITY_DATA, csk->txq_idx);
+	cxgb4_ofld_send(csk->cdev->ports[csk->port_id], skb);
+}
+
+static inline void cxgb4i_sock_make_tx_data_wr(struct cxgbi_sock *csk,
+						struct sk_buff *skb, int dlen,
+						int len, u32 credits,
+						int req_completion)
+{
+	struct fw_ofld_tx_data_wr *req;
+	unsigned int wr_ulp_mode;
+
+	if (is_ofld_imm(skb)) {
+			req = (struct fw_ofld_tx_data_wr *)
+				__skb_push(skb, sizeof(*req));
+			req->op_to_immdlen =
+				cpu_to_be32(FW_WR_OP(FW_OFLD_TX_DATA_WR) |
+					FW_WR_COMPL(req_completion) |
+					FW_WR_IMMDLEN(dlen));
+			req->flowid_len16 =
+				cpu_to_be32(FW_WR_FLOWID(csk->hwtid) |
+						FW_WR_LEN16(credits));
+	} else {
+		req = (struct fw_ofld_tx_data_wr *)
+			__skb_push(skb, sizeof(*req));
+		req->op_to_immdlen =
+			cpu_to_be32(FW_WR_OP(FW_OFLD_TX_DATA_WR) |
+					FW_WR_COMPL(req_completion) |
+					FW_WR_IMMDLEN(0));
+		req->flowid_len16 =
+			cpu_to_be32(FW_WR_FLOWID(csk->hwtid) |
+					FW_WR_LEN16(credits));
+	}
+	wr_ulp_mode =
+		FW_OFLD_TX_DATA_WR_ULPMODE(cxgb4i_skb_ulp_mode(skb) >> 4) |
+		FW_OFLD_TX_DATA_WR_ULPSUBMODE(cxgb4i_skb_ulp_mode(skb) & 3);
+	req->tunnel_to_proxy = cpu_to_be32(wr_ulp_mode) |
+		FW_OFLD_TX_DATA_WR_SHOVE(skb_peek(&csk->write_queue) ? 0 : 1);
+	req->plen = cpu_to_be32(len);
+	if (!cxgbi_sock_flag(csk, CTPF_TX_DATA_SENT))
+		cxgbi_sock_set_flag(csk, CTPF_TX_DATA_SENT);
+}
+
+static void cxgb4i_sock_arp_failure_discard(void *handle, struct sk_buff *skb)
+{
+	kfree_skb(skb);
+}
+
+static int cxgb4i_sock_push_tx_frames(struct cxgbi_sock *csk,
+						int req_completion)
+{
+	int total_size = 0;
+	struct sk_buff *skb;
+	struct cxgb4i_snic *snic;
+
+	if (unlikely(csk->state == CTP_CONNECTING ||
+			csk->state == CTP_CLOSE_WAIT_1 ||
+			csk->state >= CTP_ABORTING)) {
+		cxgbi_tx_debug("csk 0x%p, in closing state %u.\n",
+				csk, csk->state);
+		return 0;
+	}
+
+	snic = cxgbi_cdev_priv(csk->cdev);
+	while (csk->wr_cred && (skb = skb_peek(&csk->write_queue)) != NULL) {
+		int dlen;
+		int len;
+		unsigned int credits_needed;
+
+		dlen = len = skb->len;
+		skb_reset_transport_header(skb);
+		if (is_ofld_imm(skb))
+			credits_needed = DIV_ROUND_UP(dlen +
+					sizeof(struct fw_ofld_tx_data_wr), 16);
+		else
+			credits_needed = DIV_ROUND_UP(8 *
+					calc_tx_flits_ofld(skb)+
+					sizeof(struct fw_ofld_tx_data_wr), 16);
+		if (csk->wr_cred < credits_needed) {
+			cxgbi_tx_debug("csk 0x%p, skb len %u/%u, "
+					"wr %d < %u.\n",
+					csk, skb->len, skb->data_len,
+					credits_needed, csk->wr_cred);
+			break;
+		}
+		__skb_unlink(skb, &csk->write_queue);
+		set_wr_txq(skb, CPL_PRIORITY_DATA, csk->txq_idx);
+		skb->csum = credits_needed;
+		csk->wr_cred -= credits_needed;
+		csk->wr_una_cred += credits_needed;
+		cxgb4i_sock_enqueue_wr(csk, skb);
+		cxgbi_tx_debug("csk 0x%p, enqueue, skb len %u/%u, "
+				"wr %d, left %u, unack %u.\n",
+				csk, skb->len, skb->data_len,
+				credits_needed, csk->wr_cred,
+				csk->wr_una_cred);
+		if (likely(cxgb4i_skb_flags(skb) & CTP_SKCBF_NEED_HDR)) {
+			len += ulp_extra_len(skb);
+			if (!cxgbi_sock_flag(csk, CTPF_TX_DATA_SENT)) {
+				cxgb4i_sock_send_tx_flowc_wr(csk);
+				skb->csum += 5;
+				csk->wr_cred -= 5;
+				csk->wr_una_cred += 5;
+			}
+			if ((req_completion &&
+				csk->wr_una_cred == credits_needed) ||
+				(cxgb4i_skb_flags(skb) & CTP_SKCBF_COMPL) ||
+				csk->wr_una_cred >= csk->wr_max_cred >> 1) {
+				req_completion = 1;
+				csk->wr_una_cred = 0;
+			}
+			cxgb4i_sock_make_tx_data_wr(csk, skb, dlen, len,
+						    credits_needed,
+						    req_completion);
+			csk->snd_nxt += len;
+			if (req_completion)
+				cxgb4i_skb_flags(skb) &= ~CTP_SKCBF_NEED_HDR;
+		}
+		total_size += skb->truesize;
+		t4_set_arp_err_handler(skb, csk,
+					cxgb4i_sock_arp_failure_discard);
+		cxgb4_l2t_send(csk->cdev->ports[csk->port_id], skb, csk->l2t);
+	}
+	return total_size;
+}
+
+static inline void cxgb4i_sock_free_atid(struct cxgbi_sock *csk)
+{
+	struct cxgb4i_snic *snic = cxgbi_cdev_priv(csk->cdev);
+
+	cxgb4_free_atid(snic->lldi.tids, csk->atid);
+	cxgbi_sock_put(csk);
+}
+
+static void cxgb4i_sock_established(struct cxgbi_sock *csk, u32 snd_isn,
+					unsigned int opt)
+{
+	cxgbi_conn_debug("csk 0x%p, state %u.\n", csk, csk->state);
+
+	csk->write_seq = csk->snd_nxt = csk->snd_una = snd_isn;
+	/*
+	 * Causes the first RX_DATA_ACK to supply any Rx credits we couldn't
+	 * pass through opt0.
+	 */
+	if (cxgb4i_rcv_win > (RCV_BUFSIZ_MASK << 10))
+		csk->rcv_wup -= cxgb4i_rcv_win - (RCV_BUFSIZ_MASK << 10);
+	dst_confirm(csk->dst);
+	smp_mb();
+	cxgbi_sock_set_state(csk, CTP_ESTABLISHED);
+}
+
+static int cxgb4i_cpl_act_establish(struct cxgb4i_snic *snic,
+				    struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct cpl_act_establish *req = (struct cpl_act_establish *)skb->data;
+	unsigned int hwtid = GET_TID(req);
+	unsigned int atid = GET_TID_TID(ntohl(req->tos_atid));
+	struct tid_info *t = snic->lldi.tids;
+	u32 rcv_isn = be32_to_cpu(req->rcv_isn);
+
+	csk = lookup_atid(t, atid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		return CPL_RET_UNKNOWN_TID;
+	}
+	cxgbi_conn_debug("csk 0x%p, state %u, flag 0x%lx\n",
+				csk, csk->state, csk->flags);
+	csk->hwtid = hwtid;
+	cxgbi_sock_hold(csk);
+	cxgb4_insert_tid(snic->lldi.tids, csk, hwtid);
+	cxgb4i_sock_free_atid(csk);
+	spin_lock_bh(&csk->lock);
+
+	if (unlikely(csk->state != CTP_CONNECTING))
+		cxgbi_log_error("TID %u expected SYN_SENT, got EST., s %u\n",
+				csk->hwtid, csk->state);
+	csk->copied_seq = csk->rcv_wup = csk->rcv_nxt = rcv_isn;
+	cxgb4i_sock_established(csk, ntohl(req->snd_isn), ntohs(req->tcp_opt));
+	__kfree_skb(skb);
+
+	if (unlikely(cxgbi_sock_flag(csk, CTPF_ACTIVE_CLOSE_NEEDED)))
+		cxgb4i_sock_send_abort_req(csk);
+	else {
+		if (skb_queue_len(&csk->write_queue))
+			cxgb4i_sock_push_tx_frames(csk, 1);
+		cxgbi_conn_tx_open(csk);
+	}
+
+	if (csk->retry_timer.function) {
+		del_timer(&csk->retry_timer);
+		csk->retry_timer.function = NULL;
+	}
+	spin_unlock_bh(&csk->lock);
+	return 0;
+}
+
+static int act_open_rpl_status_to_errno(int status)
+{
+	switch (status) {
+	case CPL_ERR_CONN_RESET:
+		return -ECONNREFUSED;
+	case CPL_ERR_ARP_MISS:
+		return -EHOSTUNREACH;
+	case CPL_ERR_CONN_TIMEDOUT:
+		return -ETIMEDOUT;
+	case CPL_ERR_TCAM_FULL:
+		return -ENOMEM;
+	case CPL_ERR_CONN_EXIST:
+		cxgbi_log_error("ACTIVE_OPEN_RPL: 4-tuple in use\n");
+		return -EADDRINUSE;
+	default:
+		return -EIO;
+	}
+}
+
+/*
+ * Return whether a failed active open has allocated a TID
+ */
+static inline int act_open_has_tid(int status)
+{
+	return status != CPL_ERR_TCAM_FULL && status != CPL_ERR_CONN_EXIST &&
+		status != CPL_ERR_ARP_MISS;
+}
+
+static void cxgb4i_sock_act_open_retry_timer(unsigned long data)
+{
+	struct sk_buff *skb;
+	struct cxgbi_sock *csk = (struct cxgbi_sock *)data;
+
+	cxgbi_conn_debug("csk 0x%p, state %u.\n", csk, csk->state);
+	spin_lock_bh(&csk->lock);
+	skb = alloc_skb(sizeof(struct cpl_act_open_req), GFP_ATOMIC);
+	if (!skb)
+		cxgb4i_fail_act_open(csk, -ENOMEM);
+	else {
+		unsigned int qid_atid  = csk->rss_qid << 14;
+		qid_atid |= (unsigned int)csk->atid;
+		skb->sk = (struct sock *)csk;
+		t4_set_arp_err_handler(skb, csk,
+					cxgb4i_act_open_req_arp_failure);
+		cxgb4i_sock_make_act_open_req(csk, skb, qid_atid, csk->l2t);
+		cxgb4_l2t_send(csk->cdev->ports[csk->port_id], skb, csk->l2t);
+	}
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+}
+
+static int cxgb4i_cpl_act_open_rpl(struct cxgb4i_snic *snic,
+				   struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct cpl_act_open_rpl *rpl = (struct cpl_act_open_rpl *)skb->data;
+	unsigned int atid =
+		GET_TID_TID(GET_AOPEN_ATID(be32_to_cpu(rpl->atid_status)));
+	struct tid_info *t = snic->lldi.tids;
+	unsigned int status = GET_AOPEN_STATUS(be32_to_cpu(rpl->atid_status));
+
+	csk = lookup_atid(t, atid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", atid);
+		return CPL_RET_UNKNOWN_TID;
+	}
+
+	cxgbi_sock_hold(csk);
+	spin_lock_bh(&csk->lock);
+	cxgbi_conn_debug("rcv, status 0x%x, csk 0x%p, csk->state %u, "
+			"csk->flag 0x%lx, csk->atid %u.\n",
+			status, csk, csk->state, csk->flags, csk->hwtid);
+
+	if (status & act_open_has_tid(status))
+		cxgb4_remove_tid(snic->lldi.tids, csk->port_id, GET_TID(rpl));
+
+	if (status == CPL_ERR_CONN_EXIST &&
+	    csk->retry_timer.function != cxgb4i_sock_act_open_retry_timer) {
+		csk->retry_timer.function = cxgb4i_sock_act_open_retry_timer;
+		if (!mod_timer(&csk->retry_timer, jiffies + HZ / 2))
+			cxgbi_sock_hold(csk);
+	} else
+		cxgb4i_fail_act_open(csk,
+				     act_open_rpl_status_to_errno(status));
+
+	__kfree_skb(skb);
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+	return 0;
+}
+
+static int cxgb4i_cpl_peer_close(struct cxgb4i_snic *snic, struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct cpl_peer_close *req = (struct cpl_peer_close *)skb->data;
+	unsigned int hwtid = GET_TID(req);
+	struct tid_info *t = snic->lldi.tids;
+
+	csk = lookup_tid(t, hwtid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		return CPL_RET_UNKNOWN_TID;
+	}
+
+	cxgbi_sock_hold(csk);
+	spin_lock_bh(&csk->lock);
+
+	if (cxgbi_sock_flag(csk, CTPF_ABORT_RPL_PENDING))
+		goto out;
+
+	switch (csk->state) {
+	case CTP_ESTABLISHED:
+		cxgbi_sock_set_state(csk, CTP_PASSIVE_CLOSE);
+		break;
+	case CTP_ACTIVE_CLOSE:
+		cxgbi_sock_set_state(csk, CTP_CLOSE_WAIT_2);
+		break;
+	case CTP_CLOSE_WAIT_1:
+		cxgbi_sock_closed(csk);
+		break;
+	case CTP_ABORTING:
+		break;
+	default:
+		cxgbi_log_error("peer close, TID %u in bad state %u\n",
+				csk->hwtid, csk->state);
+	}
+
+	cxgbi_sock_conn_closing(csk);
+out:
+	__kfree_skb(skb);
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+	return 0;
+}
+
+static int cxgb4i_cpl_close_con_rpl(struct cxgb4i_snic *snic,
+				    struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct cpl_close_con_rpl *rpl = (struct cpl_close_con_rpl *)skb->data;
+	unsigned int hwtid = GET_TID(rpl);
+	struct tid_info *t = snic->lldi.tids;
+
+	csk = lookup_tid(t, hwtid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		return CPL_RET_UNKNOWN_TID;
+	}
+
+	cxgbi_sock_hold(csk);
+	spin_lock_bh(&csk->lock);
+	cxgbi_conn_debug("csk 0x%p, state %u, flag 0x%lx.\n",
+			csk, csk->state, csk->flags);
+	csk->snd_una = ntohl(rpl->snd_nxt) - 1;
+
+	if (cxgbi_sock_flag(csk, CTPF_ABORT_RPL_PENDING))
+		goto out;
+
+	switch (csk->state) {
+	case CTP_ACTIVE_CLOSE:
+		cxgbi_sock_set_state(csk, CTP_CLOSE_WAIT_1);
+		break;
+	case CTP_CLOSE_WAIT_1:
+	case CTP_CLOSE_WAIT_2:
+		cxgbi_sock_closed(csk);
+		break;
+	case CTP_ABORTING:
+		break;
+	default:
+		cxgbi_log_error("close_rpl, TID %u in bad state %u\n",
+				csk->hwtid, csk->state);
+	}
+out:
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+	kfree_skb(skb);
+	return 0;
+}
+
+static int abort_status_to_errno(struct cxgbi_sock *csk, int abort_reason,
+								int *need_rst)
+{
+	switch (abort_reason) {
+	case CPL_ERR_BAD_SYN: /* fall through */
+	case CPL_ERR_CONN_RESET:
+		return csk->state > CTP_ESTABLISHED ?
+			-EPIPE : -ECONNRESET;
+	case CPL_ERR_XMIT_TIMEDOUT:
+	case CPL_ERR_PERSIST_TIMEDOUT:
+	case CPL_ERR_FINWAIT2_TIMEDOUT:
+	case CPL_ERR_KEEPALIVE_TIMEDOUT:
+		return -ETIMEDOUT;
+	default:
+		return -EIO;
+	}
+}
+
+/*
+ * Returns whether an ABORT_REQ_RSS message is a negative advice.
+ */
+static inline int is_neg_adv_abort(unsigned int status)
+{
+	return status == CPL_ERR_RTX_NEG_ADVICE ||
+		status == CPL_ERR_PERSIST_NEG_ADVICE;
+}
+
+static int cxgb4i_cpl_abort_req_rss(struct cxgb4i_snic *snic,
+				    struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct cpl_abort_req_rss *req = (struct cpl_abort_req_rss *)skb->data;
+	unsigned int hwtid = GET_TID(req);
+	struct tid_info *t = snic->lldi.tids;
+	int rst_status = CPL_ABORT_NO_RST;
+
+	csk = lookup_tid(t, hwtid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		return CPL_RET_UNKNOWN_TID;
+	}
+
+	if (is_neg_adv_abort(req->status)) {
+		__kfree_skb(skb);
+		return 0;
+	}
+
+	cxgbi_sock_hold(csk);
+	spin_lock_bh(&csk->lock);
+
+	if (!cxgbi_sock_flag(csk, CTPF_ABORT_REQ_RCVD)) {
+		cxgbi_sock_set_flag(csk, CTPF_ABORT_REQ_RCVD);
+		cxgbi_sock_set_state(csk, CTP_ABORTING);
+		__kfree_skb(skb);
+		goto out;
+	}
+
+	cxgbi_sock_clear_flag(csk, CTPF_ABORT_REQ_RCVD);
+	cxgb4i_sock_send_abort_rpl(csk, rst_status);
+
+	if (!cxgbi_sock_flag(csk, CTPF_ABORT_RPL_PENDING)) {
+		csk->err = abort_status_to_errno(csk, req->status,
+						 &rst_status);
+		cxgbi_sock_closed(csk);
+	}
+out:
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+	return 0;
+}
+
+static int cxgb4i_cpl_abort_rpl_rss(struct cxgb4i_snic *snic,
+				    struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct cpl_abort_rpl_rss *rpl = (struct cpl_abort_rpl_rss *)skb->data;
+	unsigned int hwtid = GET_TID(rpl);
+	struct tid_info *t = snic->lldi.tids;
+
+	if (rpl->status == CPL_ERR_ABORT_FAILED)
+		goto out;
+
+	csk = lookup_tid(t, hwtid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		goto out;
+	}
+
+	cxgbi_sock_hold(csk);
+	spin_lock_bh(&csk->lock);
+
+	if (cxgbi_sock_flag(csk, CTPF_ABORT_RPL_PENDING)) {
+		if (!cxgbi_sock_flag(csk, CTPF_ABORT_RPL_RCVD))
+			cxgbi_sock_set_flag(csk, CTPF_ABORT_RPL_RCVD);
+		else {
+			cxgbi_sock_clear_flag(csk, CTPF_ABORT_RPL_RCVD);
+			cxgbi_sock_clear_flag(csk, CTPF_ABORT_RPL_PENDING);
+
+			if (cxgbi_sock_flag(csk, CTPF_ABORT_REQ_RCVD))
+				cxgbi_log_error("tid %u, ABORT_RPL_RSS\n",
+						csk->hwtid);
+
+			cxgbi_sock_closed(csk);
+		}
+	}
+
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+out:
+	__kfree_skb(skb);
+	return 0;
+}
+
+static int cxgb4i_cpl_iscsi_hdr(struct cxgb4i_snic *snic, struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct cpl_iscsi_hdr *cpl = (struct cpl_iscsi_hdr *)skb->data;
+	unsigned int hwtid = GET_TID(cpl);
+	struct tid_info *t = snic->lldi.tids;
+	struct sk_buff *lskb;
+
+	csk = lookup_tid(t, hwtid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		return CPL_RET_UNKNOWN_TID;
+	}
+
+	spin_lock_bh(&csk->lock);
+
+	if (unlikely(csk->state >= CTP_PASSIVE_CLOSE)) {
+		if (csk->state != CTP_ABORTING)
+			goto abort_conn;
+	}
+
+	cxgb4i_skb_tcp_seq(skb) = ntohl(cpl->seq);
+	skb_reset_transport_header(skb);
+	__skb_pull(skb, sizeof(*cpl));
+	__pskb_trim(skb, ntohs(cpl->len));
+
+	if (!csk->skb_ulp_lhdr) {
+		unsigned char *bhs;
+		unsigned int hlen, dlen;
+
+		csk->skb_ulp_lhdr = skb;
+		lskb = csk->skb_ulp_lhdr;
+		cxgb4i_skb_flags(lskb) = CTP_SKCBF_HDR_RCVD;
+
+		if (cxgb4i_skb_tcp_seq(lskb) != csk->rcv_nxt) {
+			cxgbi_log_error("tid 0x%x, CPL_ISCSI_HDR, bad seq got "
+					"0x%x, exp 0x%x\n",
+					csk->hwtid,
+					cxgb4i_skb_tcp_seq(lskb),
+					csk->rcv_nxt);
+			goto abort_conn;
+		}
+
+		bhs = lskb->data;
+		hlen = ntohs(cpl->len);
+		dlen = ntohl(*(unsigned int *)(bhs + 4)) & 0xFFFFFF;
+
+		if ((hlen + dlen) != ntohs(cpl->pdu_len_ddp) - 40) {
+			cxgbi_log_error("tid 0x%x, CPL_ISCSI_HDR, pdu len "
+					"mismatch %u != %u + %u, seq 0x%x\n",
+					csk->hwtid,
+					ntohs(cpl->pdu_len_ddp) - 40,
+					hlen, dlen, cxgb4i_skb_tcp_seq(skb));
+		}
+		cxgb4i_skb_rx_pdulen(skb) = hlen + dlen;
+		if (dlen)
+			cxgb4i_skb_rx_pdulen(skb) += csk->dcrc_len;
+		cxgb4i_skb_rx_pdulen(skb) =
+			((cxgb4i_skb_rx_pdulen(skb) + 3) & (~3));
+		csk->rcv_nxt += cxgb4i_skb_rx_pdulen(skb);
+	} else {
+		lskb = csk->skb_ulp_lhdr;
+		cxgb4i_skb_flags(lskb) |= CTP_SKCBF_DATA_RCVD;
+		cxgb4i_skb_flags(skb) = CTP_SKCBF_DATA_RCVD;
+		cxgbi_log_debug("csk 0x%p, tid 0x%x skb 0x%p, pdu data, "
+				" header 0x%p.\n",
+				csk, csk->hwtid, skb, lskb);
+	}
+
+	__skb_queue_tail(&csk->receive_queue, skb);
+	spin_unlock_bh(&csk->lock);
+	return 0;
+abort_conn:
+	cxgb4i_sock_send_abort_req(csk);
+	__kfree_skb(skb);
+	spin_unlock_bh(&csk->lock);
+	return -EINVAL;
+}
+
+static int cxgb4i_cpl_rx_data_ddp(struct cxgb4i_snic *snic, struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct sk_buff *lskb;
+	struct cpl_rx_data_ddp *rpl = (struct cpl_rx_data_ddp *)skb->data;
+	unsigned int hwtid = GET_TID(rpl);
+	struct tid_info *t = snic->lldi.tids;
+	unsigned int status;
+
+	csk = lookup_tid(t, hwtid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		return CPL_RET_UNKNOWN_TID;
+	}
+
+	spin_lock_bh(&csk->lock);
+
+	if (unlikely(csk->state >= CTP_PASSIVE_CLOSE)) {
+		if (csk->state != CTP_ABORTING)
+			goto abort_conn;
+	}
+
+	if (!csk->skb_ulp_lhdr) {
+		cxgbi_log_error("tid 0x%x, rcv RX_DATA_DDP w/o pdu header\n",
+				csk->hwtid);
+		goto abort_conn;
+	}
+
+	lskb = csk->skb_ulp_lhdr;
+	cxgb4i_skb_flags(lskb) |= CTP_SKCBF_STATUS_RCVD;
+
+	if (ntohs(rpl->len) != cxgb4i_skb_rx_pdulen(lskb)) {
+		cxgbi_log_error("tid 0x%x, RX_DATA_DDP pdulen %u != %u.\n",
+				csk->hwtid, ntohs(rpl->len),
+				cxgb4i_skb_rx_pdulen(lskb));
+	}
+
+	cxgb4i_skb_rx_ddigest(lskb) = ntohl(rpl->ulp_crc);
+	status = ntohl(rpl->ddpvld);
+
+	if (status & (1 << RX_DDP_STATUS_HCRC_SHIFT)) {
+		cxgbi_log_info("ULP2_FLAG_HCRC_ERROR set\n");
+		cxgb4i_skb_ulp_mode(skb) |= ULP2_FLAG_HCRC_ERROR;
+	}
+	if (status & (1 << RX_DDP_STATUS_DCRC_SHIFT)) {
+		cxgbi_log_info("ULP2_FLAG_DCRC_ERROR set\n");
+		cxgb4i_skb_ulp_mode(skb) |= ULP2_FLAG_DCRC_ERROR;
+	}
+	if (status & (1 << RX_DDP_STATUS_PAD_SHIFT)) {
+		cxgbi_log_info("ULP2_FLAG_PAD_ERROR set\n");
+		cxgb4i_skb_ulp_mode(skb) |= ULP2_FLAG_PAD_ERROR;
+	}
+	if ((cxgb4i_skb_flags(lskb) & ULP2_FLAG_DATA_READY)) {
+		cxgbi_log_info("ULP2_FLAG_DATA_DDPED set\n");
+		cxgb4i_skb_ulp_mode(skb) |= ULP2_FLAG_DATA_DDPED;
+	}
+
+	csk->skb_ulp_lhdr = NULL;
+	__kfree_skb(skb);
+	cxgbi_conn_pdu_ready(csk);
+	spin_unlock_bh(&csk->lock);
+	return 0;
+abort_conn:
+	cxgb4i_sock_send_abort_req(csk);
+	__kfree_skb(skb);
+	spin_unlock_bh(&csk->lock);
+	return -EINVAL;
+}
+
+static void check_wr_invariants(const struct cxgbi_sock *csk)
+{
+	int pending = cxgb4i_sock_count_pending_wrs(csk);
+
+	if (unlikely(csk->wr_cred + pending != csk->wr_max_cred))
+		printk(KERN_ERR "TID %u: credit imbalance: avail %u, "
+				"pending %u, total should be %u\n",
+				csk->hwtid,
+				csk->wr_cred,
+				pending,
+				csk->wr_max_cred);
+}
+
+static int cxgb4i_cpl_fw4_ack(struct cxgb4i_snic *snic, struct sk_buff *skb)
+{
+	struct cxgbi_sock *csk;
+	struct cpl_fw4_ack *rpl = (struct cpl_fw4_ack *)skb->data;
+	unsigned int hwtid = GET_TID(rpl);
+	struct tid_info *t = snic->lldi.tids;
+	unsigned char credits;
+	unsigned int snd_una;
+
+	csk = lookup_tid(t, hwtid);
+	if (unlikely(!csk)) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		kfree_skb(skb);
+		return CPL_RET_UNKNOWN_TID;
+	}
+
+	cxgbi_sock_hold(csk);
+	spin_lock_bh(&csk->lock);
+	credits = rpl->credits;
+	snd_una = be32_to_cpu(rpl->snd_una);
+	cxgbi_tx_debug("%u WR credits, avail %u, unack %u, TID %u, state %u\n",
+				credits, csk->wr_cred, csk->wr_una_cred,
+						csk->hwtid, csk->state);
+	csk->wr_cred += credits;
+
+	if (csk->wr_una_cred > csk->wr_max_cred - csk->wr_cred)
+		csk->wr_una_cred = csk->wr_max_cred - csk->wr_cred;
+
+	while (credits) {
+		struct sk_buff *p = cxgb4i_sock_peek_wr(csk);
+
+		if (unlikely(!p)) {
+			cxgbi_log_error("%u WR_ACK credits for TID %u with "
+					"nothing pending, state %u\n",
+					credits, csk->hwtid, csk->state);
+			break;
+		}
+
+		if (unlikely(credits < p->csum))
+			p->csum -= credits;
+		else {
+			cxgb4i_sock_dequeue_wr(csk);
+			credits -= p->csum;
+			kfree_skb(p);
+		}
+	}
+
+	check_wr_invariants(csk);
+
+	if (rpl->seq_vld) {
+		if (unlikely(before(snd_una, csk->snd_una))) {
+			cxgbi_log_error("TID %u, unexpected sequence # %u "
+					"in WR_ACK snd_una %u\n",
+					csk->hwtid, snd_una, csk->snd_una);
+			goto out_free;
+		}
+	}
+
+	if (csk->snd_una != snd_una) {
+		csk->snd_una = snd_una;
+		dst_confirm(csk->dst);
+	}
+
+	if (skb_queue_len(&csk->write_queue)) {
+		if (cxgb4i_sock_push_tx_frames(csk, 0))
+			cxgbi_conn_tx_open(csk);
+	} else
+		cxgbi_conn_tx_open(csk);
+
+	goto out;
+out_free:
+	__kfree_skb(skb);
+out:
+	spin_unlock_bh(&csk->lock);
+	cxgbi_sock_put(csk);
+	return 0;
+}
+
+static int cxgb4i_cpl_set_tcb_rpl(struct cxgb4i_snic *snic, struct sk_buff *skb)
+{
+	struct cpl_set_tcb_rpl *rpl = (struct cpl_set_tcb_rpl *)skb->data;
+	unsigned int hwtid = GET_TID(rpl);
+	struct tid_info *t = snic->lldi.tids;
+	struct cxgbi_sock *csk;
+
+	csk = lookup_tid(t, hwtid);
+	if (!csk) {
+		cxgbi_log_error("can't find connection for tid %u\n", hwtid);
+		__kfree_skb(skb);
+		return CPL_RET_UNKNOWN_TID;
+	}
+
+	spin_lock_bh(&csk->lock);
+
+	if (rpl->status != CPL_ERR_NONE) {
+		cxgbi_log_error("Unexpected SET_TCB_RPL status %u "
+				 "for tid %u\n", rpl->status, GET_TID(rpl));
+	}
+
+	__kfree_skb(skb);
+	spin_unlock_bh(&csk->lock);
+	return 0;
+}
+
+static void cxgb4i_sock_free_cpl_skbs(struct cxgbi_sock *csk)
+{
+	if (csk->cpl_close)
+		kfree_skb(csk->cpl_close);
+	if (csk->cpl_abort_req)
+		kfree_skb(csk->cpl_abort_req);
+	if (csk->cpl_abort_rpl)
+		kfree_skb(csk->cpl_abort_rpl);
+}
+
+static int cxgb4i_alloc_cpl_skbs(struct cxgbi_sock *csk)
+{
+	int wrlen;
+
+	wrlen = roundup(sizeof(struct cpl_close_con_req), 16);
+	csk->cpl_close = alloc_skb(wrlen, GFP_NOIO);
+	if (!csk->cpl_close)
+		return -ENOMEM;
+	skb_put(csk->cpl_close, wrlen);
+
+	wrlen = roundup(sizeof(struct cpl_abort_req), 16);
+	csk->cpl_abort_req = alloc_skb(wrlen, GFP_NOIO);
+	if (!csk->cpl_abort_req)
+		goto free_cpl_skbs;
+	skb_put(csk->cpl_abort_req, wrlen);
+
+	wrlen = roundup(sizeof(struct cpl_abort_rpl), 16);
+	csk->cpl_abort_rpl = alloc_skb(wrlen, GFP_NOIO);
+	if (!csk->cpl_abort_rpl)
+		goto free_cpl_skbs;
+	skb_put(csk->cpl_abort_rpl, wrlen);
+	return 0;
+free_cpl_skbs:
+	cxgb4i_sock_free_cpl_skbs(csk);
+	return -ENOMEM;
+}
+
+static void cxgb4i_sock_release_offload_resources(struct cxgbi_sock *csk)
+{
+
+	cxgb4i_sock_free_cpl_skbs(csk);
+
+	if (csk->wr_cred != csk->wr_max_cred) {
+		cxgb4i_sock_purge_wr_queue(csk);
+		cxgb4i_sock_reset_wr_list(csk);
+	}
+
+	if (csk->l2t) {
+		cxgb4_l2t_release(csk->l2t);
+		csk->l2t = NULL;
+	}
+
+	if (csk->state == CTP_CONNECTING)
+		cxgb4i_sock_free_atid(csk);
+	else {
+		struct cxgb4i_snic *snic = cxgbi_cdev_priv(csk->cdev);
+		cxgb4_remove_tid(snic->lldi.tids, 0, csk->hwtid);
+		cxgbi_sock_put(csk);
+	}
+
+	csk->dst = NULL;
+	csk->cdev = NULL;
+}
+
+static int cxgb4i_init_act_open(struct cxgbi_sock *csk,
+				struct net_device *dev)
+{
+	struct dst_entry *dst = csk->dst;
+	struct sk_buff *skb;
+	struct port_info *pi = netdev_priv(dev);
+	struct cxgb4i_snic *snic = cxgbi_cdev_priv(csk->cdev);
+	int offs;
+
+	cxgbi_conn_debug("csk 0x%p, state %u, flags 0x%lx\n",
+			csk, csk->state, csk->flags);
+
+	csk->atid = cxgb4_alloc_atid(snic->lldi.tids, csk);
+	if (csk->atid == -1) {
+		cxgbi_log_error("cannot alloc atid\n");
+		goto out_err;
+	}
+
+	csk->l2t = cxgb4_l2t_get(snic->lldi.l2t, csk->dst->neighbour, dev, 0);
+	if (!csk->l2t) {
+		cxgbi_log_error("cannot alloc l2t\n");
+		goto free_atid;
+	}
+
+	skb = alloc_skb(sizeof(struct cpl_act_open_req), GFP_NOIO);
+	if (!skb)
+		goto free_l2t;
+
+	skb->sk = (struct sock *)csk;
+	t4_set_arp_err_handler(skb, csk, cxgb4i_act_open_req_arp_failure);
+	cxgbi_sock_hold(csk);
+	offs = snic->lldi.ntxq / snic->lldi.nchan;
+	csk->txq_idx = pi->port_id * offs;
+	cxgbi_log_debug("csk->txq_idx : %d\n", csk->txq_idx);
+	offs = snic->lldi.nrxq / snic->lldi.nchan;
+	csk->rss_qid = snic->lldi.rxq_ids[pi->port_id * offs];
+	cxgbi_log_debug("csk->rss_qid : %d\n", csk->rss_qid);
+	csk->wr_max_cred = csk->wr_cred = snic->lldi.wr_cred;
+	csk->port_id = pi->port_id;
+	csk->tx_chan = cxgb4_port_chan(dev);
+	csk->smac_idx = csk->tx_chan << 1;
+	csk->wr_una_cred = 0;
+	csk->mss_idx = cxgbi_sock_select_mss(csk, dst_mtu(dst));
+	csk->err = 0;
+	cxgb4i_sock_reset_wr_list(csk);
+	cxgb4i_sock_make_act_open_req(csk, skb,
+					((csk->rss_qid << 14) |
+					 (csk->atid)), csk->l2t);
+	cxgb4_l2t_send(csk->cdev->ports[csk->port_id], skb, csk->l2t);
+	return 0;
+free_l2t:
+	cxgb4_l2t_release(csk->l2t);
+free_atid:
+	cxgb4i_sock_free_atid(csk);
+out_err:
+	return -EINVAL;;
+}
+
+static void cxgb4i_sock_rx_credits(struct cxgbi_sock *csk, int copied)
+{
+	int must_send;
+	u32 credits;
+
+	if (csk->state != CTP_ESTABLISHED)
+		return;
+
+	credits = csk->copied_seq - csk->rcv_wup;
+	if (unlikely(!credits))
+		return;
+
+	if (unlikely(cxgb4i_rx_credit_thres == 0))
+		return;
+
+	must_send = credits + 16384 >= cxgb4i_rcv_win;
+
+	if (must_send || credits >= cxgb4i_rx_credit_thres)
+		csk->rcv_wup += cxgb4i_csk_send_rx_credits(csk, credits);
+}
+
+static int cxgb4i_sock_send_pdus(struct cxgbi_sock *csk, struct sk_buff *skb)
+{
+	struct sk_buff *next;
+	int err, copied = 0;
+
+	spin_lock_bh(&csk->lock);
+
+	if (csk->state != CTP_ESTABLISHED) {
+		cxgbi_tx_debug("csk 0x%p, not in est. state %u.\n",
+			      csk, csk->state);
+		err = -EAGAIN;
+		goto out_err;
+	}
+
+	if (csk->err) {
+		cxgbi_tx_debug("csk 0x%p, err %d.\n", csk, csk->err);
+		err = -EPIPE;
+		goto out_err;
+	}
+
+	if (csk->write_seq - csk->snd_una >= cxgb4i_snd_win) {
+		cxgbi_tx_debug("csk 0x%p, snd %u - %u > %u.\n",
+				csk, csk->write_seq, csk->snd_una,
+				cxgb4i_snd_win);
+		err = -ENOBUFS;
+		goto out_err;
+	}
+
+	while (skb) {
+		int frags = skb_shinfo(skb)->nr_frags +
+				(skb->len != skb->data_len);
+
+		if (unlikely(skb_headroom(skb) < CXGB4I_TX_HEADER_LEN)) {
+			cxgbi_tx_debug("csk 0x%p, skb head.\n", csk);
+			err = -EINVAL;
+			goto out_err;
+		}
+
+		if (frags >= SKB_WR_LIST_SIZE) {
+			cxgbi_log_error("csk 0x%p, tx frags %d, len %u,%u.\n",
+					 csk, skb_shinfo(skb)->nr_frags,
+					 skb->len, skb->data_len);
+			err = -EINVAL;
+			goto out_err;
+		}
+
+		next = skb->next;
+		skb->next = NULL;
+		cxgb4i_sock_skb_entail(csk, skb, CTP_SKCBF_NO_APPEND |
+					CTP_SKCBF_NEED_HDR);
+		copied += skb->len;
+		csk->write_seq += skb->len + ulp_extra_len(skb);
+		skb = next;
+	}
+done:
+	if (likely(skb_queue_len(&csk->write_queue)))
+		cxgb4i_sock_push_tx_frames(csk, 1);
+	spin_unlock_bh(&csk->lock);
+	return copied;
+out_err:
+	if (copied == 0 && err == -EPIPE)
+		copied = csk->err ? csk->err : -EPIPE;
+	else
+		copied = err;
+	goto done;
+}
+
+static void tx_skb_setmode(struct sk_buff *skb, int hcrc, int dcrc)
+{
+	u8 submode = 0;
+
+	if (hcrc)
+		submode |= 1;
+	if (dcrc)
+		submode |= 2;
+	cxgb4i_skb_ulp_mode(skb) = (ULP_MODE_ISCSI << 4) | submode;
+}
+
+static inline __u16 get_skb_ulp_mode(struct sk_buff *skb)
+{
+	return cxgb4i_skb_ulp_mode(skb);
+}
+
+static inline __u16 get_skb_flags(struct sk_buff *skb)
+{
+	return cxgb4i_skb_flags(skb);
+}
+
+static inline __u32 get_skb_tcp_seq(struct sk_buff *skb)
+{
+	return cxgb4i_skb_tcp_seq(skb);
+}
+
+static inline __u32 get_skb_rx_pdulen(struct sk_buff *skb)
+{
+	return cxgb4i_skb_rx_pdulen(skb);
+}
+
+static cxgb4i_cplhandler_func cxgb4i_cplhandlers[NUM_CPL_CMDS] = {
+	[CPL_ACT_ESTABLISH] = cxgb4i_cpl_act_establish,
+	[CPL_ACT_OPEN_RPL] = cxgb4i_cpl_act_open_rpl,
+	[CPL_PEER_CLOSE] = cxgb4i_cpl_peer_close,
+	[CPL_ABORT_REQ_RSS] = cxgb4i_cpl_abort_req_rss,
+	[CPL_ABORT_RPL_RSS] = cxgb4i_cpl_abort_rpl_rss,
+	[CPL_CLOSE_CON_RPL] = cxgb4i_cpl_close_con_rpl,
+	[CPL_FW4_ACK] = cxgb4i_cpl_fw4_ack,
+	[CPL_ISCSI_HDR] = cxgb4i_cpl_iscsi_hdr,
+	[CPL_SET_TCB_RPL] = cxgb4i_cpl_set_tcb_rpl,
+	[CPL_RX_DATA_DDP] = cxgb4i_cpl_rx_data_ddp
+};
+
+int cxgb4i_ofld_init(struct cxgbi_device *cdev)
+{
+	struct cxgb4i_snic *snic = cxgbi_cdev_priv(cdev);
+	struct cxgbi_ports_map *ports;
+	int mapsize;
+
+	if (cxgb4i_max_connect > CXGB4I_MAX_CONN)
+		cxgb4i_max_connect = CXGB4I_MAX_CONN;
+
+	mapsize = (cxgb4i_max_connect * sizeof(struct cxgbi_sock));
+	ports = cxgbi_alloc_big_mem(sizeof(*ports) + mapsize, GFP_KERNEL);
+	if (!ports)
+		return -ENOMEM;
+
+	spin_lock_init(&ports->lock);
+	cdev->pmap = ports;
+	cdev->pmap->max_connect = cxgb4i_max_connect;
+	cdev->pmap->sport_base = cxgb4i_sport_base;
+	cdev->set_skb_txmode = tx_skb_setmode;
+	cdev->get_skb_ulp_mode = get_skb_ulp_mode;
+	cdev->get_skb_flags = get_skb_flags;
+	cdev->get_skb_tcp_seq = get_skb_tcp_seq;
+	cdev->get_skb_rx_pdulen = get_skb_rx_pdulen;
+	cdev->release_offload_resources = cxgb4i_sock_release_offload_resources;
+	cdev->sock_send_pdus = cxgb4i_sock_send_pdus;
+	cdev->send_abort_req = cxgb4i_sock_send_abort_req;
+	cdev->send_close_req = cxgb4i_sock_send_close_req;
+	cdev->sock_rx_credits = cxgb4i_sock_rx_credits;
+	cdev->alloc_cpl_skbs = cxgb4i_alloc_cpl_skbs;
+	cdev->init_act_open = cxgb4i_init_act_open;
+	snic->handlers = cxgb4i_cplhandlers;
+	return 0;
+}
+
+void cxgb4i_ofld_cleanup(struct cxgbi_device *cdev)
+{
+	struct cxgbi_sock *csk;
+	int i;
+
+	for (i = 0; i < cdev->pmap->max_connect; i++) {
+		if (cdev->pmap->port_csk[i]) {
+			csk = cdev->pmap->port_csk[i];
+			cdev->pmap->port_csk[i] = NULL;
+
+			cxgbi_sock_hold(csk);
+			spin_lock_bh(&csk->lock);
+			cxgbi_sock_closed(csk);
+			spin_unlock_bh(&csk->lock);
+			cxgbi_sock_put(csk);
+		}
+	}
+	cxgbi_free_big_mem(cdev->pmap);
+}
-- 
1.6.6.1

-- 
You received this message because you are subscribed to the Google Groups "open-iscsi" group.
To post to this group, send email to open-iscsi-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
To unsubscribe from this group, send email to open-iscsi+unsubscribe-/JYPxA39Uh5TLH3MbocFF+G/Ez6ZCGd0@public.gmane.org
For more options, visit this group at http://groups.google.com/group/open-iscsi?hl=en.

^ permalink raw reply related

* [PATCH nf-next-2.6] xt_rateest: Better struct xt_rateest layout
From: Eric Dumazet @ 2010-06-07 20:17 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: netdev, Netfilter Development Mailinglist

We currently dirty two cache lines in struct xt_rateest, this hurts SMP
performance.

This patch moves lock/bstats/rstats at beginning of structure so that
they share a single cache line.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/netfilter/xt_rateest.h |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/include/net/netfilter/xt_rateest.h b/include/net/netfilter/xt_rateest.h
index ddbf37e..b1d780e 100644
--- a/include/net/netfilter/xt_rateest.h
+++ b/include/net/netfilter/xt_rateest.h
@@ -2,13 +2,17 @@
 #define _XT_RATEEST_H
 
 struct xt_rateest {
+	/* keep lock and bstats on same cache line to speedup xt_rateest_tg() */
+	struct gnet_stats_basic_packed	bstats;
+	spinlock_t			lock;
+	/* keep rstats and lock on same cache line to speedup xt_rateest_mt() */
+	struct gnet_stats_rate_est	rstats;
+
+	/* following fields not accessed in hot path */
 	struct hlist_node		list;
 	char				name[IFNAMSIZ];
 	unsigned int			refcnt;
-	spinlock_t			lock;
 	struct gnet_estimator		params;
-	struct gnet_stats_rate_est	rstats;
-	struct gnet_stats_basic_packed	bstats;
 };
 
 extern struct xt_rateest *xt_rateest_lookup(const char *name);



^ permalink raw reply related

* Re: [PATCH] netconsole: queue console messages to send later
From: Matt Mackall @ 2010-06-07 20:21 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Flavio Leitner, netdev, David Miller, Cong Wang, Jay Vosburgh,
	Flavio Leitner, Andy Gospodarek, Neil Horman, Jeff Moyer, lkml,
	bridge, bonding-devel
In-Reply-To: <20100607130015.15555744@nehalam>

On Mon, 2010-06-07 at 13:00 -0700, Stephen Hemminger wrote:
> On Mon, 07 Jun 2010 14:50:48 -0500
> Matt Mackall <mpm@selenic.com> wrote:
> 
> > On Mon, 2010-06-07 at 16:24 -0300, Flavio Leitner wrote:
> > > There are some networking drivers that hold a lock in the
> > > transmit path. Therefore, if a console message is printed
> > > after that, netconsole will push it through the transmit path,
> > > resulting in a deadlock.
> > 
> > This is an ongoing pain we've known about since before introducing the
> > netpoll code to the tree.
> > 
> > My take has always been that any form of queueing is contrary to the
> > goal of netpoll: timely delivery of messages even during machine-killing
> > situations like oopses. There may never be a second chance to deliver
> > the message as the machine may be locked solid. And there may be no
> > other way to get the message out of the box in such situations. Adding
> > queueing is a throwing-the-baby-out-with-the-bathwater fix.
> > 
> > I think Dave agrees with me here, and I believe he's said in the past
> > that drivers trying to print messages in such contexts should be
> > considered buggy.
> > 
> 
> Because it to hard to fix all possible device configurations.
> There should be any way to detect recursion and just drop the message to
> avoid deadlock.

Open to suggestions. The locks in question are driver-internal. There
also may not be any actual recursion taking place:

driver path a takes private lock x
driver path a attempts printk
printk calls into netconsole
netconsole calls into driver path b
driver path b attempts to take lock x -> deadlock

So we can't even try to walk back the stack looking for such nonsense.
Though we could perhaps force queuing of all messages -from- the driver
bound to netconsole. Tricky, and not quite foolproof.

-- 
Mathematics is the supreme nostalgia of our time.

^ permalink raw reply

* Re: [Linux-ATM-General] RX/close vcc race with solos/atmtcp/usbatm/he
From: David Woodhouse @ 2010-06-07 20:49 UTC (permalink / raw)
  To: chas3; +Cc: linux-atm-general, netdev
In-Reply-To: <201006071637.o57GbqWd002514@thirdoffive.cmf.nrl.navy.mil>

On Mon, 2010-06-07 at 12:37 -0400, Chas Williams (CONTRACTOR) wrote:
> i dont understand.  if you do a sock_hold() in find_vcc(), and then call
> vcc->push() you should be able to call vcc->push() and then sock_put(). 

Holding the reference doesn't stop the problem. The problem is

 vcc_release()
 --> vcc_destroy_socket()
   --> br2684_push(vcc, NULL)
         sets vcc->user_back = NULL
         (which it what causes the oops when try try to feed it any
          subsequent packets).

 Only _later_ does vcc_release() call sock_put().

It doesn't _matter_ that the tasklet is holding a reference on the
socket, because it's not the sk_free() which is causing the problem. 

Just making dev->ops->close() wait for the tasklet is perfectly
sufficient. That call happens from vcc_destroy_socket() before the call
to br2684_push(), and all is well.

-- 
dwmw2


^ permalink raw reply

* RFS seems to have incompatiblities with bridged vlans
From: Peter Lieven @ 2010-06-07 20:36 UTC (permalink / raw)
  To: davem, netdev

Hi,

i today tried out 2.6.32-rc2 and I see a lot of warning messages like this:

Jun  7 22:33:15 172.21.55.20 kernel: [ 3012.575884] br141 received packet on queue 4, but number of RX queues is 1

The host is a SMP system with 8 cores, so I think there is expected to be one rx queue per CPU, but it seems
the bridge iface has only one.

The br141 config looks like this:

#brctl show br141
bridge name	bridge id		STP enabled	interfaces
br141		8000.001018593cdc	no		eth2.141
							tap0
							tap2

The bridge is basically a bridge between a 802.1q vlan at eth2 and some tap Interfaces bound to
virtual machines on the system.

Peter.

^ permalink raw reply

* [PATCH net-next-2.6] anycast: Some RCU conversions
From: Eric Dumazet @ 2010-06-07 21:42 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Hideaki YOSHIFUJI

- dev_get_by_flags() changed to dev_get_by_flags_rcu()

- ipv6_sock_ac_join() dont touch dev & idev refcounts
- ipv6_sock_ac_drop() dont touch dev & idev refcounts
- ipv6_sock_ac_close() dont touch dev & idev refcounts
- ipv6_dev_ac_dec() dount touch idev refcount
- ipv6_chk_acast_addr() dont touch idev refcount

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
---
 include/linux/netdevice.h |    4 -
 net/core/dev.c            |   14 ++---
 net/ipv6/anycast.c        |   90 ++++++++++++++++--------------------
 3 files changed, 49 insertions(+), 59 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5156b80..c319f28 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1271,8 +1271,8 @@ extern void		dev_add_pack(struct packet_type *pt);
 extern void		dev_remove_pack(struct packet_type *pt);
 extern void		__dev_remove_pack(struct packet_type *pt);
 
-extern struct net_device	*dev_get_by_flags(struct net *net, unsigned short flags,
-						  unsigned short mask);
+extern struct net_device	*dev_get_by_flags_rcu(struct net *net, unsigned short flags,
+						      unsigned short mask);
 extern struct net_device	*dev_get_by_name(struct net *net, const char *name);
 extern struct net_device	*dev_get_by_name_rcu(struct net *net, const char *name);
 extern struct net_device	*__dev_get_by_name(struct net *net, const char *name);
diff --git a/net/core/dev.c b/net/core/dev.c
index b65347c..d41054d 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -803,35 +803,31 @@ struct net_device *dev_getfirstbyhwtype(struct net *net, unsigned short type)
 EXPORT_SYMBOL(dev_getfirstbyhwtype);
 
 /**
- *	dev_get_by_flags - find any device with given flags
+ *	dev_get_by_flags_rcu - find any device with given flags
  *	@net: the applicable net namespace
  *	@if_flags: IFF_* values
  *	@mask: bitmask of bits in if_flags to check
  *
  *	Search for any interface with the given flags. Returns NULL if a device
- *	is not found or a pointer to the device. The device returned has
- *	had a reference added and the pointer is safe until the user calls
- *	dev_put to indicate they have finished with it.
+ *	is not found or a pointer to the device. Must be called inside
+ *	rcu_read_lock(), and result refcount is unchanged.
  */
 
-struct net_device *dev_get_by_flags(struct net *net, unsigned short if_flags,
+struct net_device *dev_get_by_flags_rcu(struct net *net, unsigned short if_flags,
 				    unsigned short mask)
 {
 	struct net_device *dev, *ret;
 
 	ret = NULL;
-	rcu_read_lock();
 	for_each_netdev_rcu(net, dev) {
 		if (((dev->flags ^ if_flags) & mask) == 0) {
-			dev_hold(dev);
 			ret = dev;
 			break;
 		}
 	}
-	rcu_read_unlock();
 	return ret;
 }
-EXPORT_SYMBOL(dev_get_by_flags);
+EXPORT_SYMBOL(dev_get_by_flags_rcu);
 
 /**
  *	dev_valid_name - check if name is okay for network device
diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index b5b0705..f058fbd 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -77,41 +77,40 @@ int ipv6_sock_ac_join(struct sock *sk, int ifindex, struct in6_addr *addr)
 	pac->acl_next = NULL;
 	ipv6_addr_copy(&pac->acl_addr, addr);
 
+	rcu_read_lock();
 	if (ifindex == 0) {
 		struct rt6_info *rt;
 
 		rt = rt6_lookup(net, addr, NULL, 0, 0);
 		if (rt) {
 			dev = rt->rt6i_dev;
-			dev_hold(dev);
 			dst_release(&rt->u.dst);
 		} else if (ishost) {
 			err = -EADDRNOTAVAIL;
-			goto out_free_pac;
+			goto error;
 		} else {
 			/* router, no matching interface: just pick one */
-
-			dev = dev_get_by_flags(net, IFF_UP, IFF_UP|IFF_LOOPBACK);
+			dev = dev_get_by_flags_rcu(net, IFF_UP,
+						   IFF_UP | IFF_LOOPBACK);
 		}
 	} else
-		dev = dev_get_by_index(net, ifindex);
+		dev = dev_get_by_index_rcu(net, ifindex);
 
 	if (dev == NULL) {
 		err = -ENODEV;
-		goto out_free_pac;
+		goto error;
 	}
 
-	idev = in6_dev_get(dev);
+	idev = __in6_dev_get(dev);
 	if (!idev) {
 		if (ifindex)
 			err = -ENODEV;
 		else
 			err = -EADDRNOTAVAIL;
-		goto out_dev_put;
+		goto error;
 	}
 	/* reset ishost, now that we have a specific device */
 	ishost = !idev->cnf.forwarding;
-	in6_dev_put(idev);
 
 	pac->acl_ifindex = dev->ifindex;
 
@@ -124,26 +123,22 @@ int ipv6_sock_ac_join(struct sock *sk, int ifindex, struct in6_addr *addr)
 		if (ishost)
 			err = -EADDRNOTAVAIL;
 		if (err)
-			goto out_dev_put;
+			goto error;
 	}
 
 	err = ipv6_dev_ac_inc(dev, addr);
-	if (err)
-		goto out_dev_put;
-
-	write_lock_bh(&ipv6_sk_ac_lock);
-	pac->acl_next = np->ipv6_ac_list;
-	np->ipv6_ac_list = pac;
-	write_unlock_bh(&ipv6_sk_ac_lock);
-
-	dev_put(dev);
-
-	return 0;
+	if (!err) {
+		write_lock_bh(&ipv6_sk_ac_lock);
+		pac->acl_next = np->ipv6_ac_list;
+		np->ipv6_ac_list = pac;
+		write_unlock_bh(&ipv6_sk_ac_lock);
+		pac = NULL;
+	}
 
-out_dev_put:
-	dev_put(dev);
-out_free_pac:
-	sock_kfree_s(sk, pac, sizeof(*pac));
+error:
+	rcu_read_unlock();
+	if (pac)
+		sock_kfree_s(sk, pac, sizeof(*pac));
 	return err;
 }
 
@@ -176,11 +171,12 @@ int ipv6_sock_ac_drop(struct sock *sk, int ifindex, struct in6_addr *addr)
 
 	write_unlock_bh(&ipv6_sk_ac_lock);
 
-	dev = dev_get_by_index(net, pac->acl_ifindex);
-	if (dev) {
+	rcu_read_lock();
+	dev = dev_get_by_index_rcu(net, pac->acl_ifindex);
+	if (dev)
 		ipv6_dev_ac_dec(dev, &pac->acl_addr);
-		dev_put(dev);
-	}
+	rcu_read_unlock();
+
 	sock_kfree_s(sk, pac, sizeof(*pac));
 	return 0;
 }
@@ -199,13 +195,12 @@ void ipv6_sock_ac_close(struct sock *sk)
 	write_unlock_bh(&ipv6_sk_ac_lock);
 
 	prev_index = 0;
+	rcu_read_lock();
 	while (pac) {
 		struct ipv6_ac_socklist *next = pac->acl_next;
 
 		if (pac->acl_ifindex != prev_index) {
-			if (dev)
-				dev_put(dev);
-			dev = dev_get_by_index(net, pac->acl_ifindex);
+			dev = dev_get_by_index_rcu(net, pac->acl_ifindex);
 			prev_index = pac->acl_ifindex;
 		}
 		if (dev)
@@ -213,8 +208,7 @@ void ipv6_sock_ac_close(struct sock *sk)
 		sock_kfree_s(sk, pac, sizeof(*pac));
 		pac = next;
 	}
-	if (dev)
-		dev_put(dev);
+	rcu_read_unlock();
 }
 
 #if 0
@@ -363,33 +357,32 @@ int __ipv6_dev_ac_dec(struct inet6_dev *idev, struct in6_addr *addr)
 	return 0;
 }
 
+/* called with rcu_read_lock() */
 static int ipv6_dev_ac_dec(struct net_device *dev, struct in6_addr *addr)
 {
-	int ret;
-	struct inet6_dev *idev = in6_dev_get(dev);
+	struct inet6_dev *idev = __in6_dev_get(dev);
+
 	if (idev == NULL)
 		return -ENODEV;
-	ret = __ipv6_dev_ac_dec(idev, addr);
-	in6_dev_put(idev);
-	return ret;
+	return __ipv6_dev_ac_dec(idev, addr);
 }
 
 /*
  *	check if the interface has this anycast address
+ *	called with rcu_read_lock()
  */
 static int ipv6_chk_acast_dev(struct net_device *dev, struct in6_addr *addr)
 {
 	struct inet6_dev *idev;
 	struct ifacaddr6 *aca;
 
-	idev = in6_dev_get(dev);
+	idev = __in6_dev_get(dev);
 	if (idev) {
 		read_lock_bh(&idev->lock);
 		for (aca = idev->ac_list; aca; aca = aca->aca_next)
 			if (ipv6_addr_equal(&aca->aca_addr, addr))
 				break;
 		read_unlock_bh(&idev->lock);
-		in6_dev_put(idev);
 		return aca != NULL;
 	}
 	return 0;
@@ -403,14 +396,15 @@ int ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
 {
 	int found = 0;
 
-	if (dev)
-		return ipv6_chk_acast_dev(dev, addr);
 	rcu_read_lock();
-	for_each_netdev_rcu(net, dev)
-		if (ipv6_chk_acast_dev(dev, addr)) {
-			found = 1;
-			break;
-		}
+	if (dev)
+		found = ipv6_chk_acast_dev(dev, addr);
+	else
+		for_each_netdev_rcu(net, dev)
+			if (ipv6_chk_acast_dev(dev, addr)) {
+				found = 1;
+				break;
+			}
 	rcu_read_unlock();
 	return found;
 }



^ permalink raw reply related

* Re: [PATCH] r8169: fix random mdio_write failures
From: Francois Romieu @ 2010-06-07 21:51 UTC (permalink / raw)
  To: hayeswang; +Cc: 'Timo Teräs', netdev, davem
In-Reply-To: <4C31B2E3BE2B481DB749C669DFE23645@realtek.com.tw>

hayeswang <hayeswang@realtek.com> :
> Our hardware engineer suggests that check the completed indication
> per 100 micro seconds. And it needs 20 micro seconds delay after the 
> completed indication for the next command.

Should we do the same for mdio_read as well (100 us per iteration + 
an extra 20 us) ?

-- 
Ueimor

^ permalink raw reply

* Re: RFS seems to have incompatiblities with bridged vlans
From: Stephen Hemminger @ 2010-06-07 21:59 UTC (permalink / raw)
  To: Peter Lieven; +Cc: davem, netdev
In-Reply-To: <62C40338-045A-417E-9B90-E59A320E1343@dlh.net>

On Mon, 7 Jun 2010 22:36:11 +0200
Peter Lieven <pl@dlh.net> wrote:

> Hi,
> 
> i today tried out 2.6.32-rc2 and I see a lot of warning messages like this:
> 
> Jun  7 22:33:15 172.21.55.20 kernel: [ 3012.575884] br141 received packet on queue 4, but number of RX queues is 1
> 
> The host is a SMP system with 8 cores, so I think there is expected to be one rx queue per CPU, but it seems
> the bridge iface has only one.
> 

The bridge interface has no queues. It doesn't queue any packets. 
The test in receive packet path is not appropriate in this case.
Not sure what the right fix is. Pretending the bridge device has
multiple queues (num_queues == NUM_CPUS) is a possibility but
seems like overhead without real increase in parallelism.




^ permalink raw reply

* Re: RFS seems to have incompatiblities with bridged vlans
From: Eric Dumazet @ 2010-06-07 22:19 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Peter Lieven, davem, netdev
In-Reply-To: <20100607145910.2458ac87@nehalam>

Le lundi 07 juin 2010 à 14:59 -0700, Stephen Hemminger a écrit :
> On Mon, 7 Jun 2010 22:36:11 +0200
> Peter Lieven <pl@dlh.net> wrote:
> 
> > Hi,
> > 
> > i today tried out 2.6.32-rc2 and I see a lot of warning messages like this:
> > 
> > Jun  7 22:33:15 172.21.55.20 kernel: [ 3012.575884] br141 received packet on queue 4, but number of RX queues is 1
> > 
> > The host is a SMP system with 8 cores, so I think there is expected to be one rx queue per CPU, but it seems
> > the bridge iface has only one.
> > 
> 
> The bridge interface has no queues. It doesn't queue any packets. 
> The test in receive packet path is not appropriate in this case.
> Not sure what the right fix is. Pretending the bridge device has
> multiple queues (num_queues == NUM_CPUS) is a possibility but
> seems like overhead without real increase in parallelism.

We have same problem with bonding and multiqueue devices...

It does make sense to pretend we have several queues, in case we have
another stage like

eth0	+
  	+-> bond0  -> vlan.1000
eth1	+

because in this case, vlan.1000 would also be multiqueue, so tx will be
parallelized on several queues.




^ permalink raw reply

* Re: RFS seems to have incompatiblities with bridged vlans
From: John Fastabend @ 2010-06-07 22:30 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Peter Lieven, davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <20100607145910.2458ac87@nehalam>

Stephen Hemminger wrote:
> On Mon, 7 Jun 2010 22:36:11 +0200
> Peter Lieven <pl@dlh.net> wrote:
> 
>> Hi,
>>
>> i today tried out 2.6.32-rc2 and I see a lot of warning messages like this:
>>
>> Jun  7 22:33:15 172.21.55.20 kernel: [ 3012.575884] br141 received packet on queue 4, but number of RX queues is 1
>>
>> The host is a SMP system with 8 cores, so I think there is expected to be one rx queue per CPU, but it seems
>> the bridge iface has only one.
>>
> 
> The bridge interface has no queues. It doesn't queue any packets. 
> The test in receive packet path is not appropriate in this case.
> Not sure what the right fix is. Pretending the bridge device has
> multiple queues (num_queues == NUM_CPUS) is a possibility but
> seems like overhead without real increase in parallelism.
> 
> 
> 

There is always a possibility that the underlying device sets the 
queue_mapping to be greater then num_cpus.  Also I suspect the same 
issue exists with bonding devices.  Maybe something like the following 
is worth while? compile tested only,

[PATCH] 8021q: vlan reassigns dev without check queue_mapping

recv path reassigns skb->dev without sanity checking the
queue_mapping field.  This can result in the queue_mapping
field being set incorrectly if the new dev supports less
queues then the underlying device.

This patch just resets the queue_mapping to 0 which should
resolve this issue?  Any thoughts?

The same issue could happen on bonding devices as well.

Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
---

  net/8021q/vlan_core.c |    6 ++++++
  1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index bd537fc..ad309f8 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -21,6 +21,9 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct 
vlan_group *grp,
  	if (!skb->dev)
  		goto drop;

+	if (unlikely(skb->queue_mapping >= skb->dev->real_num_tx_queues))
+		skb_set_queue_mapping(skb, 0);
+
  	return (polling ? netif_receive_skb(skb) : netif_rx(skb));

  drop:
@@ -93,6 +96,9 @@ vlan_gro_common(struct napi_struct *napi, struct 
vlan_group *grp,
  	if (!skb->dev)
  		goto drop;

+	if (unlikely(skb->queue_mapping >= skb->dev->real_num_tx_queues))
+		skb_set_queue_mapping(skb, 0);
+
  	for (p = napi->gro_list; p; p = p->next) {
  		NAPI_GRO_CB(p)->same_flow =
  			p->dev == skb->dev && !compare_ether_header(

^ permalink raw reply related

* Re: RFS seems to have incompatiblities with bridged vlans
From: John Fastabend @ 2010-06-07 23:13 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Peter Lieven, davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <4C0D7312.1020300@intel.com>

John Fastabend wrote:
> Stephen Hemminger wrote:
>> On Mon, 7 Jun 2010 22:36:11 +0200
>> Peter Lieven <pl@dlh.net> wrote:
>>
>>> Hi,
>>>
>>> i today tried out 2.6.32-rc2 and I see a lot of warning messages like this:
>>>
>>> Jun  7 22:33:15 172.21.55.20 kernel: [ 3012.575884] br141 received packet on queue 4, but number of RX queues is 1
>>>
>>> The host is a SMP system with 8 cores, so I think there is expected to be one rx queue per CPU, but it seems
>>> the bridge iface has only one.
>>>
>> The bridge interface has no queues. It doesn't queue any packets. 
>> The test in receive packet path is not appropriate in this case.
>> Not sure what the right fix is. Pretending the bridge device has
>> multiple queues (num_queues == NUM_CPUS) is a possibility but
>> seems like overhead without real increase in parallelism.
>>
>>
>>
> 
> There is always a possibility that the underlying device sets the 
> queue_mapping to be greater then num_cpus.  Also I suspect the same 
> issue exists with bonding devices.  Maybe something like the following 
> is worth while? compile tested only,
> 
> [PATCH] 8021q: vlan reassigns dev without check queue_mapping
> 
> recv path reassigns skb->dev without sanity checking the
> queue_mapping field.  This can result in the queue_mapping
> field being set incorrectly if the new dev supports less
> queues then the underlying device.
> 
> This patch just resets the queue_mapping to 0 which should
> resolve this issue?  Any thoughts?
> 
> The same issue could happen on bonding devices as well.
> 
> Signed-off-by: John Fastabend <john.r.fastabend@intel.com>
> ---
> 
>   net/8021q/vlan_core.c |    6 ++++++
>   1 files changed, 6 insertions(+), 0 deletions(-)
> 
> diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
> index bd537fc..ad309f8 100644
> --- a/net/8021q/vlan_core.c
> +++ b/net/8021q/vlan_core.c
> @@ -21,6 +21,9 @@ int __vlan_hwaccel_rx(struct sk_buff *skb, struct 
> vlan_group *grp,
>   	if (!skb->dev)
>   		goto drop;
> 
> +	if (unlikely(skb->queue_mapping >= skb->dev->real_num_tx_queues))
> +		skb_set_queue_mapping(skb, 0);
> +

Actually this should be dev->num_rx_queues not real_num_tx_queues.

>   	return (polling ? netif_receive_skb(skb) : netif_rx(skb));
> 
>   drop:
> @@ -93,6 +96,9 @@ vlan_gro_common(struct napi_struct *napi, struct 
> vlan_group *grp,
>   	if (!skb->dev)
>   		goto drop;
> 
> +	if (unlikely(skb->queue_mapping >= skb->dev->real_num_tx_queues))
> +		skb_set_queue_mapping(skb, 0);
> +

same here.

>   	for (p = napi->gro_list; p; p = p->next) {
>   		NAPI_GRO_CB(p)->same_flow =
>   			p->dev == skb->dev && !compare_ether_header(
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* Re: [PATCH net-next 00/11] tg3: Bugfixes and 5719 support
From: David Miller @ 2010-06-07 23:44 UTC (permalink / raw)
  To: mcarlson; +Cc: netdev, andy
In-Reply-To: <20100607184038.GA12316@mcarlson.broadcom.com>

From: "Matt Carlson" <mcarlson@broadcom.com>
Date: Mon, 7 Jun 2010 11:40:38 -0700

> Our hardware specs are available.  You can find them at:
> 
> http://www.broadcom.com/support/ethernet_nic/open_source.php?source=top
> 
> The specs for the latest asic revs won't be there yet, but I'm sure
> they'll get there.
> 
> In light of this, do you still feel we need to take this stance with
> the tg3 driver?

Nope, not with those documents available, that changes everything.

Thanks!

^ permalink raw reply

* Re: [PATCH] netconsole: queue console messages to send later
From: David Miller @ 2010-06-07 23:50 UTC (permalink / raw)
  To: fleitner
  Cc: netdev, amwang, fubar, fbl, mpm, gospo, nhorman, jmoyer,
	shemminger, linux-kernel, bridge, bonding-devel
In-Reply-To: <1275938692-26997-1-git-send-email-fleitner@redhat.com>

From: Flavio Leitner <fleitner@redhat.com>
Date: Mon,  7 Jun 2010 16:24:52 -0300

> There are some networking drivers that hold a lock in the
> transmit path. Therefore, if a console message is printed
> after that, netconsole will push it through the transmit path,
> resulting in a deadlock.
> 
> This patch fixes the re-injection problem by queuing the console
> messages in a preallocated circular buffer and then scheduling a
> workqueue to send them later with another context.
> 
> Signed-off-by: Flavio Leitner <fleitner@redhat.com>

You absolutely and positively MUST NOT do this.  Otherwise netconsole
becomes completely useless.  Your idea has been proposed several times
as far back as 6 years ago, it was unacceptable then and it's
unacceptable now.

The whole point of netconsole is that we may be deep in an interrupt
or other atomic context, the machine is about to hard hang, and it's
absolutely essential that we get out any and all kernel logging
messages that we can, immediately.

There may not be another timer or workqueue able to execute after the
printk() we're trying to emit.  We may never get to that point.

So if we defer messages, that means we won't get the message and we
won't be able to debug the problem.

Fix the locking in the drivers or layers that cause the issue instead
of breaking netconsole.

^ permalink raw reply

* Re: [PATCH] netconsole: queue console messages to send later
From: David Miller @ 2010-06-07 23:52 UTC (permalink / raw)
  To: mpm
  Cc: shemminger, fleitner, netdev, amwang, fubar, fbl, gospo, nhorman,
	jmoyer, linux-kernel, bridge, bonding-devel
In-Reply-To: <1275942091.26597.85.camel@calx>

From: Matt Mackall <mpm@selenic.com>
Date: Mon, 07 Jun 2010 15:21:31 -0500

> Open to suggestions. The locks in question are driver-internal. There
> also may not be any actual recursion taking place:
> 
> driver path a takes private lock x
> driver path a attempts printk
> printk calls into netconsole
> netconsole calls into driver path b
> driver path b attempts to take lock x -> deadlock
> 
> So we can't even try to walk back the stack looking for such nonsense.
> Though we could perhaps force queuing of all messages -from- the driver
> bound to netconsole. Tricky, and not quite foolproof.

Look, this is all nonsense talk.

This is only coming about because of the recent discussions about
bonding, so let's fix bonding's locking.  I've made concrete
suggestions on converting it's rwlocks over to spinlocks and RCU to
fix the specific problem bonding has.

Every time we hit some new locking issue the knee jerk reaction is
to do something stupid to the generic netconsole code instead of
fixing the real source of the problem.

^ permalink raw reply

* Re: 2.6.35-rc2-git1 - net/mac80211/sta_info.c:125 invoked rcu_dereference_check() without protection!
From: Paul E. McKenney @ 2010-06-07 23:59 UTC (permalink / raw)
  To: Miles Lane
  Cc: Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan, Johannes Berg
In-Reply-To: <AANLkTintKf_9C1sBsUac6Prvb1nWn7nWszatJ8yUb-VM@mail.gmail.com>

On Mon, Jun 07, 2010 at 02:25:44PM -0400, Miles Lane wrote:
> [   43.478812] [ INFO: suspicious rcu_dereference_check() usage. ]
> [   43.478815] ---------------------------------------------------
> [   43.478820] net/mac80211/sta_info.c:125 invoked
> rcu_dereference_check() without protection!
> [   43.478824]
> [   43.478824] other info that might help us debug this:
> [   43.478826]
> [   43.478829]
> [   43.478830] rcu_scheduler_active = 1, debug_locks = 1
> [   43.478834] no locks held by NetworkManager/4017.

Hmmm...  Johannes's update has been merged, and it requires that callers
either be in an RCU read-side critical section or hold either the
->sta_lock or the ->sta_mtx, and this thread does none of this.

Johannes, any thoughts?

							Thanx, Paul

> [   43.478837] stack backtrace:
> [   43.478842] Pid: 4017, comm: NetworkManager Not tainted 2.6.35-rc2-git1 #8
> [   43.478846] Call Trace:
> [   43.478849]  <IRQ>  [<ffffffff81064e9c>] lockdep_rcu_dereference+0x9d/0xa5
> [   43.478876]  [<ffffffffa010cb3c>] sta_info_get_bss+0x71/0x12d [mac80211]
> [   43.478889]  [<ffffffffa010cc0d>] ieee80211_find_sta+0x15/0x2f [mac80211]
> [   43.478902]  [<ffffffffa019ae16>] iwlagn_tx_queue_reclaim+0xe7/0x1bb [iwlagn]
> [   43.478909]  [<ffffffff8106530b>] ? mark_lock+0x2d/0x262
> [   43.478920]  [<ffffffffa019e270>] iwlagn_rx_reply_tx+0x4cd/0x58a [iwlagn]
> [   43.478928]  [<ffffffff811ca313>] ? is_swiotlb_buffer+0x2e/0x3b
> [   43.478937]  [<ffffffffa0192ec2>] iwl_rx_handle+0x161/0x2bf [iwlagn]
> [   43.478946]  [<ffffffffa019370a>] iwl_irq_tasklet+0x2eb/0x408 [iwlagn]
> [   43.478953]  [<ffffffff8103f892>] tasklet_action+0xa7/0x10f
> [   43.478960]  [<ffffffff810404cf>] __do_softirq+0x148/0x25a
> [   43.478966]  [<ffffffff8100314c>] call_softirq+0x1c/0x28
> [   43.478972]  [<ffffffff81004d20>] do_softirq+0x38/0x80
> [   43.478977]  [<ffffffff8103ffd0>] irq_exit+0x45/0x94
> [   43.478983]  [<ffffffff81004465>] do_IRQ+0xad/0xc4
> [   43.478989]  [<ffffffff813ca3d3>] ret_from_intr+0x0/0xf
> [   43.478993]  <EOI>

^ permalink raw reply

* Re: 2.6.35-rc2-git1 - lib/idr.c:605 invoked rcu_dereference_check() without protection!
From: Paul E. McKenney @ 2010-06-08  0:12 UTC (permalink / raw)
  To: Miles Lane
  Cc: Vivek Goyal, Eric Paris, David Woodhouse, Lai Jiangshan,
	Ingo Molnar, Peter Zijlstra, LKML, nauman, eric.dumazet, netdev,
	Jens Axboe, Gui Jianfeng, Li Zefan, Johannes Berg
In-Reply-To: <AANLkTim06fRB-Kvm12v2umq8MxRxbcl3gdbNsiRTDmia@mail.gmail.com>

On Mon, Jun 07, 2010 at 02:23:17PM -0400, Miles Lane wrote:
> [    2.677955] [ INFO: suspicious rcu_dereference_check() usage. ]
> [    2.679089] ---------------------------------------------------
> [    2.680276] lib/idr.c:605 invoked rcu_dereference_check() without protection!
> [    2.681499]
> [    2.681500] other info that might help us debug this:
> [    2.681501]
> [    2.685509]
> [    2.685510] rcu_scheduler_active = 1, debug_locks = 1
> [    2.688221] 1 lock held by swapper/1:
> [    2.689587]  #0:  (mtd_table_mutex){+.+...}, at:
> [<ffffffff812bea45>] register_mtd_user+0x1a/0x69
> [    2.691096]
> [    2.691098] stack backtrace:
> [    2.694059] Pid: 1, comm: swapper Not tainted 2.6.35-rc2-git1 #8
> [    2.695601] Call Trace:
> [    2.697243]  [<ffffffff81064e9c>] lockdep_rcu_dereference+0x9d/0xa5
> [    2.698868]  [<ffffffff811b9c86>] idr_get_next+0x60/0x124
> [    2.700556]  [<ffffffff812be779>] __mtd_next_device+0x1b/0x1d
> [    2.702238]  [<ffffffff812bea7c>] register_mtd_user+0x51/0x69
> [    2.703964]  [<ffffffff816cca45>] init_mtdchar+0xb3/0xd3
> [    2.705686]  [<ffffffff816cc992>] ? init_mtdchar+0x0/0xd3
> [    2.707470]  [<ffffffff810001ef>] do_one_initcall+0x59/0x14e
> [    2.709255]  [<ffffffff816a768a>] kernel_init+0x144/0x1ce
> [    2.711082]  [<ffffffff81003054>] kernel_thread_helper+0x4/0x10
> [    2.712862]  [<ffffffff813ca480>] ? restore_args+0x0/0x30
> [    2.714647]  [<ffffffff816a7546>] ? kernel_init+0x0/0x1ce
> [    2.716415]  [<ffffffff81003050>] ? kernel_thread_helper+0x0/0x10

This looks like a new one!  Does the following patch take care of it?

							Thanx, Paul

------------------------------------------------------------------------

commit 2d54a6c31b72c902b09d365e9c66205a5c07e549
Author: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Date:   Mon Jun 7 17:09:45 2010 -0700

    idr: fix RCU lockdep splat in idr_get_next()
    
    Convert to rcu_dereference_raw() given that many callers may have many
    different locking models.
    
    Located-by: Miles Lane <miles.lane@gmail.com>
    Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>

diff --git a/lib/idr.c b/lib/idr.c
index 2eb1dca..f099f25 100644
--- a/lib/idr.c
+++ b/lib/idr.c
@@ -599,7 +599,7 @@ void *idr_get_next(struct idr *idp, int *nextidp)
 	/* find first ent */
 	n = idp->layers * IDR_BITS;
 	max = 1 << n;
-	p = rcu_dereference(idp->top);
+	p = rcu_dereference_raw(idp->top);
 	if (!p)
 		return NULL;
 
@@ -607,7 +607,7 @@ void *idr_get_next(struct idr *idp, int *nextidp)
 		while (n > 0 && p) {
 			n -= IDR_BITS;
 			*paa++ = p;
-			p = rcu_dereference(p->ary[(id >> n) & IDR_MASK]);
+			p = rcu_dereference_raw(p->ary[(id >> n) & IDR_MASK]);
 		}
 
 		if (p) {

^ permalink raw reply related

* Re: 2.6.35-rc2-git1 - include/linux/cgroup.h:534 invoked rcu_dereference_check() without protection!
From: Paul E. McKenney @ 2010-06-08  0:19 UTC (permalink / raw)
  To: Miles Lane
  Cc: Vivek Goyal, Eric Paris, Lai Jiangshan, Ingo Molnar,
	Peter Zijlstra, LKML, nauman, eric.dumazet, netdev, Jens Axboe,
	Gui Jianfeng, Li Zefan, Johannes Berg
In-Reply-To: <AANLkTin2pPqOUx--9fIX3BH3e-cU6oCRufijcx_4ozx5@mail.gmail.com>

On Mon, Jun 07, 2010 at 02:14:25PM -0400, Miles Lane wrote:
> Hi All,
> 
> I just reproduced a warning I reported quite a while ago.  Is a patch
> for this in the pipeline?

I proposed a patch, thinking that it was a false positive.  Peter Zijlstra
pointed out that there was a real race, and proposed an alternative patch,
which may be found at http://lkml.org/lkml/2010/4/22/603.

Could you please test Peter's patch and let us know if it cures the problem?

							Thanx, Paul

> [    0.167267] [ INFO: suspicious rcu_dereference_check() usage. ]
> [    0.167396] ---------------------------------------------------
> [    0.167526] include/linux/cgroup.h:534 invoked
> rcu_dereference_check() without protection!
> [    0.167728]
> [    0.167729] other info that might help us debug this:
> [    0.167731]
> [    0.168092]
> [    0.168093] rcu_scheduler_active = 1, debug_locks = 1
> [    0.168337] no locks held by watchdog/0/5.
> [    0.168462]
> [    0.168463] stack backtrace:
> [    0.168704] Pid: 5, comm: watchdog/0 Not tainted 2.6.35-rc2-git1 #8
> [    0.168834] Call Trace:
> [    0.168965]  [<ffffffff81064e9c>] lockdep_rcu_dereference+0x9d/0xa5
> [    0.169100]  [<ffffffff8102c1ce>] task_subsys_state+0x59/0x70
> [    0.169232]  [<ffffffff8103189b>] __sched_setscheduler+0x19d/0x2f8
> [    0.169365]  [<ffffffff8102a5ef>] ? need_resched+0x1e/0x28
> [    0.169497]  [<ffffffff813c7d01>] ? schedule+0x586/0x619
> [    0.169628]  [<ffffffff81081c33>] ? watchdog+0x0/0x8c
> [    0.169758]  [<ffffffff81031a11>] sched_setscheduler+0xe/0x10
> [    0.169889]  [<ffffffff81081c5d>] watchdog+0x2a/0x8c
> [    0.170010]  [<ffffffff81081c33>] ? watchdog+0x0/0x8c
> [    0.170141]  [<ffffffff81054a82>] kthread+0x89/0x91
> [    0.170274]  [<ffffffff81003054>] kernel_thread_helper+0x4/0x10
> [    0.170405]  [<ffffffff813ca480>] ? restore_args+0x0/0x30
> [    0.170536]  [<ffffffff810549f9>] ? kthread+0x0/0x91
> [    0.170667]  [<ffffffff81003050>] ? kernel_thread_helper+0x0/0x10
> [    0.176751] lockdep: fixing up alternatives.

^ permalink raw reply

* [PATCH][RFC] Introduction to sk_buff state checking
From: David VomLehn @ 2010-06-08  0:30 UTC (permalink / raw)
  To: to, "netdev, netdev

Add state checking for sk_buffs

Several common errors in sk_buff usage can be caught by maintaining and
checking the sk_buff state. States are: free, allocated, and queued. Simply
reporting an error is not sufficient; it may be the previous state change
that was erroneous. Thus, the ability to record the location of the previous
state change is essential.

This patch was previously submitted and David Miller objected to to the
overhead of recording the previous state change. Specifically, he wrote:

	The specifics are that if you showed me something that
	added a "u16", at most, I'd go to bat for you and support
	your work going upstream.

With only three states, only two bits are required to record the current
state but recording the location of the call that changed the state in
no more than 14 bits posed an interesting challenge. This patchset contains
the infrastructure required to map between a small value, the "callsite ID",
and the location of the call. 

There are four patches in this patchset:
1.	This introduction
2.	A patch allowing the site of a call to be passed over multiple calls
	without modifying the intervening functions.
3.	A patch providing infrastructure that allows the location of a call,
	the "callsite ID" to be encoded in a small integer.
4.	Modifications to the sk_buff functions to add the two-bit state
	and a 14-bit callsite ID, check the state, and report failures.
	
Restrictions
o	Call site IDs are never reused, so it is possible to exceed the
	maximum number of IDs by having a large number of call locations.
	In addition, it does not recognize that the same module has been
	unloaded and reloaded, so calls from the reloaded module will be
	assigned new IDs. Detection of incorrect operations on an sk_buff
	is not affected by exhaustion of call site IDs, but it may not be
	possible to determine the location of the last operation.
	CONFIG_DEBUG_SKB_ID_SIZE is set to reduce the sk_buff growth to 16
	bits and should handle most cases. It could be made larger to allow
	more call site IDs, if necessary.
o	The callsite structures for a module will be freed when that module
	is unloaded, even though sk_buffs may be using IDs corresponding to
	those call sites. To allow useful error reporting, the call site
	information in a module being unloaded is copied. If a module is
	unloaded and CONFIG_CALLSITE_TERSE is enabled, the address of
	the call site is not longer valid, so only the function name and
	offset are printed. If CONFIG_CALLSITE_TERSE is enabled and the
	is loaded, the address is also reported.

Signed-off-by: David VomLehn <dvomlehn@cisco.com>

^ permalink raw reply

* [PATCH][RFC] Infrastructure for out-of-band parameter passing
From: David VomLehn @ 2010-06-08  0:30 UTC (permalink / raw)
  To: to, "netdev, netdev

Infrastructure to support out of band/indirect passing of data to functions.

It is useful at times to be able to pass data from one function to another
nested many stack frames more deeply than the passing function. Doing so
allows the interfaces in the intervening functions to be simpler, though
this "hidden" information passing risks increased complexity. In cases
where this is justified, this simple infrastructure provides the
functionality.

Out of band data passing is implemented by adding, for each instance,
an element to the task_struct that serves as the pointer to the top
of a OOB parameter stack. Data is made available by being pushed
on the OOB parameter stack by a function, and accessed via the top
element of the OOB parameter stack.

Signed-off-by: David VomLehn (dvomlehn@cisco.com)
---
 include/linux/oobparam.h |   89 ++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 89 insertions(+), 0 deletions(-)

diff --git a/include/linux/oobparam.h b/include/linux/oobparam.h
new file mode 100644
index 0000000..6eaa04c
--- /dev/null
+++ b/include/linux/oobparam.h
@@ -0,0 +1,89 @@
+/*
+ * Definitions used to pass information across multiple stack frames
+ * within a single task.
+ *
+ * Copyright (C) 2010  Cisco Systems, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ *
+ * Author: David VomLehn
+ *
+ * These definitions allow one function to establish a context for another
+ * function without adding parameters to the intervening functions. It can,
+ * for example, supply state information from one function to be used in the
+ * case that some function nested several levels more deeply wants to report
+ * the context from the original function. Since it works in a way invisible
+ * to the intervening functions, passing information this way is less
+ * obvious than simply supplying parameters and it should be used sparingly.
+ *
+ * To use this:
+ * 1.	Define a structure containing a struct oobparam:
+ *		struct ctx {
+ *			const char *name;
+ *			OOBPARAM_FRAME(frame);
+ *			...
+ *		};
+ * 2.	Create an element in the task_struct to serve as the top of the
+ *	oobparam stack:
+ *		OOBPARAM_TOP(my_top);
+ * 3.	Populate the structure in the function that has context to be passed.
+ *		struct ctx ctx;
+ *		ctx.name = __func__;
+ * 4.	Surround the call to the next function with oobparam_push() and
+ *	oobparam_pop():
+ *		oobparam_push(&ctx.frame);
+ *		myfunc();
+ *		oobparam_pop(&ctx.frame);
+ * 5.	In function that wants to use the parameter value, use OOBPARAM_CUR
+ *	to retrieve the value:
+ *		struct ctx *ctx;
+ *		ctx = OOBPARAM_CUR(&my_top, struct ctx, frame);
+ *		pr_info("Indirectly called by %s\n", ctx->name);
+ */
+
+#ifndef	_LINUX_OOBPARAM_H_
+#define _LINUX_OOBPARAM_H_
+
+struct oobparam {
+	struct oobparam *next;
+};
+
+#define OOBPARAM_TOP(name)	struct oobparam name;
+#define OOBPARAM_FRAME(name)	struct oobparam name;
+
+/**
+ * oobparam_push - push an out of band parameter frame on the OOB param stack
+ * @head	Pointer to the OOB parameter stack top, which must be in the
+ *		task structure.
+ * @frame	Pointer to the OOB parameter frame, generally embedded in
+ *		another structure
+ */
+static inline void oobparam_push(struct oobparam *top, struct oobparam *frame)
+{
+	frame->next = top;
+	/* We need to ensure that the pointer in the frame is set prior to
+	 * the pointer to the top in case we handle an interrupt in between
+	 * the two stores. */
+	wmb();
+	top->next = frame;
+}
+
+static inline void oobparam_pop(struct oobparam *top)
+{
+	top->next = top->next->next;
+}
+
+#define OOBPARAM_CUR(top, type, frame) container_of((top)->next, type, frame)
+#endif

^ permalink raw reply related

* [PATCH][RFC] Infrastructure for compact call location representation
From: David VomLehn @ 2010-06-08  0:30 UTC (permalink / raw)
  To: to, "netdev, netdev

This patch allows the location of a call to be recorded as a small integer,
with each call location ("callsite") assigned a new value the first time
the code in that location is executed. Locations can be recorded as a
an address or as a __FILE__/__LINE__ pair. The later is easier to read, but
requires significantly more space.

The goal here was to record the location in very few bits but, at the same
time, to have minimal overhead.  The key observation towards achieving this
goals is to note that there are are far fewer locations where calls of
interest are made than there are addresses. If the site of each call of
interest is assigned a unique ID, and there are fewer than n of them,
only log2(n) bits are required to identify the call site. If the IDs
are assigned dynamically and most call sites aren't reached, you can get
by with even fewer bits.

This is debugging code and callsite IDs are generally only used when failures
are detected, so though the mapping from a callsite location to a callsite ID
must be fast, speed is not as critical for the reverse mapping. Also, an ID
is assigned to a callsite just once, so it is acceptable to take a while to
assign an ID, but things should run with minimal delay if an ID is already
assigned.

The major implementation pieces are:
1.	Two macros are provided for use in wrapping functions that are to
	be instrumented. CALLSITE_CALL is for functions that return values,
	CALLSITE_CALL_VOID is used for functions that do not.
2.	The call site infrastructure consists of three data structures:
	a.	A statically allocated struct callsite_id holds the ID for
		the call site.
	b.	A statically allocated struct callsite_static holds
		information which is constant for each callsite. The call site
		ID could be combined with this, but by separating them I hope
		to avoid polluting the cache with this very cold information.
	c.	A struct callsite_frame builds on the oobparam infrastructure
		and holds the call site ID. This is assigned at this time
		if this had not previously been done. This will be pushed on
		the OOB parameter stack before calling the skb_* function
		and popped after it returns.
3.	A callsite_top structure is added to task_struct. When a call site
	is entered, its callsite_frame is pushed on the call site stack.
4.	When a function needs to know the call site ID so it can be stored,
	it gets it from the callsite_frame at the top of the call site
	stack.

Notes
o	Under simple test conditions, the number of call site IDs allocated
	can be quite small, small enough to fit in 6 bits. That would reduce
	the sk_buff growth to one byte. This is *not* a recommended
	configuration.
o	This is placed in net/core and linux/net since those are the only
	current users, but there is nothing about this that is networking-
	specific.

Restrictions
o	Call site IDs are never reused, so it is possible to exceed the
	maximum number of IDs by having a large number of call locations.
	In addition, it does not recognize that the same module has been
	unloaded and reloaded, so calls from the reloaded module will be
	assigned new IDs. Detection of incorrect operations on an sk_buff
	is not affected by exhaustion of call site IDs, but it may not be
	possible to determine the location of the last operation.
	CONFIG_DEBUG_SKB_ID_SIZE is set to reduce the sk_buff growth to 16
	bits and should handle most cases. It could be made larger to allow
	more call site IDs, if necessary.
o	The callsite structures for a module will be freed when that module
	is unloaded, even though sk_buffs may be using IDs corresponding to
	those call sites. To allow useful error reporting, the call site
	information in a module being unloaded is copied. If
	CONFIG_CALLSITE_TERSE is not enabled and the module that last changed
	the sk_buff is no longer loaded, the address of the call site
	is no longer valid, so only the function name and offset are printed
	if the module is unloaded. If it is loaded, the address is also
	reported.

History
v2	Support small callsite IDs and split out out-of-band parameter
	parsing.
V1	Initial release

Signed-off-by: David VomLehn <dvomlehn@cisco.com>
---
 include/net/callsite-types.h |  160 +++++++++++++++++++
 include/net/callsite.h       |  208 +++++++++++++++++++++++++
 net/core/callsite.c          |  354 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 722 insertions(+), 0 deletions(-)

diff --git a/include/net/callsite-types.h b/include/net/callsite-types.h
new file mode 100644
index 0000000..796cfb1
--- /dev/null
+++ b/include/net/callsite-types.h
@@ -0,0 +1,160 @@
+/*
+ *				callsite-types.h
+ *
+ * Definitions for tracking sites at which functions are called with low
+ * overhead.
+ *
+ * Copyright (C) 2009  Scientific-Atlanta, Inc.
+ * Copyright (C) 2010  Cisco Systems, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ *
+ * Author: David VomLehn
+ */
+
+#ifndef	_LINUX_NET_CALLSITE_TYPES_H_
+#define _LINUX_NET_CALLSITE_TYPES_H_
+#include <linux/oobparam.h>
+
+/* Pre-defined call site IDs */
+#define	CALLSITE_UNSET		0	/* Never set (default) */
+#define	CALLSITE_UNKNOWN	1	/* Tried to set, but couldn't */
+#define CALLSITE_START		2	/* First valid call site ID */
+
+#define	CALLSITE_MAX_ID_SIZE	16	/* Max # bits in a call site ID */
+
+/**
+ * struct callsite_id - callsite identifier
+ * @id:		Unique value assigned to callsite
+ *
+ * This includes the unique ID assigned to the call site and the information
+ * that defines the location of the call site.
+ */
+struct callsite_id {
+	unsigned		id:CALLSITE_MAX_ID_SIZE;
+};
+
+/* The id value must be set to CALLSITE_UNSET. This is conveniently defined
+ * to have the value zero, so we don't need to explicitly set it */
+#define CALLSITE_ID_INIT() {						\
+	}
+
+/**
+ * struct callsite_where - Location information for loaded callsites
+ * @here:	Address of code doing the calling (if terse reporting)
+ * @file:	Pointer to file name (if not using terse reporting)
+ * @lineno:	Line number in file (if not using terse reporting)
+ */
+struct callsite_where {
+#ifdef CONFIG_CALLSITE_TERSE
+	void			*here;	/* Address */
+#else
+	const char		*file;	/* Call location */
+	unsigned short		lineno;
+#endif
+};
+
+#ifdef CONFIG_CALLSITE_TERSE
+#define CALLSITE_WHERE_INIT()			{		\
+			.here =		NULL,			\
+		}
+#else
+#define CALLSITE_WHERE_INIT()			{		\
+			.file =		__FILE__,		\
+			.lineno =	__LINE__,		\
+		}
+#endif
+
+/**
+ * struct callsite_const - constant per-callsite information
+ * @id:		Pointer to the location of the callsite ID
+ * @where:	Location information
+ * @module:	Pointer to the module om which the callsite exists
+ * @set:	Pointer to information that allies to all callsites of this
+ *		particular set.
+ */
+struct callsite_static {
+	struct callsite_id	*id;
+	struct callsite_where	where;
+	struct module		*module;
+	struct callsite_set	*set;
+};
+
+#define	CALLSITE_STATIC_INIT(_id, _set)	{			\
+			.id =		_id,			\
+			.where =	CALLSITE_WHERE_INIT(),	\
+			.module =	THIS_MODULE,		\
+			.set =		_set,			\
+		}
+
+/*
+ * callsite_set - information about a set of callsite IDs
+ * @name:		Callsite_set name
+ * @width:		Number of bits available for callsite ID
+ * Private members:
+ * @warned:		Has a warning been printed that no call site ID could
+ *			be assigned for this callsite set?
+ * @max_id:		The maximim value of a callsite ID. This must fit in
+ *			the number of bits allocated to the callsite_id and
+ *			must be at least CALLSITE_START.
+ * @next_id:		Value of the next callsite ID to give out. Will never
+ *			be more than @max_id.
+ * @info:		Pointer to callsite_info array.
+ * @lock:		Lock that protects the @callsites structure member
+ * @callsite_id_sets:	Link to the next callsite_id_set
+ */
+struct callsite_set {
+	const char		*name;
+	unsigned int		width;
+	/* private */
+	bool			warned:1;
+	unsigned int		max_id;
+	unsigned int		next_id;
+	struct callsite_info	*info;
+	spinlock_t		lock;
+	struct list_head	callsite_id_sets;
+};
+
+#define CALLSITE_SET_INIT(str_name, _varname, _width)	{		\
+	.name =			str_name,				\
+	.width =		_width,					\
+	.lock =			__SPIN_LOCK_UNLOCKED((_varname).lock),	\
+	.callsite_id_sets =	LIST_HEAD_INIT((_varname).callsite_id_sets), \
+}
+
+/**
+ * struct callsite_frame - data in each "frame" of the callsite stack
+ * @id:			Callsite ID
+ * @callsite_oobparam:	Data for passing out of band parameters
+ *
+ * Data that is stored on the stack each time a call is made. A linked list
+ * of these is constructed on the stack for each task. In effect, these
+ * are "frames" for the stack of call sites
+ */
+struct callsite_frame {
+	struct callsite_id	id;
+	OOBPARAM_FRAME(frame);
+};
+#define CALLSITE_FRAME(name)	struct callsite_frame name;
+
+/**
+ * struct callsite_top - pointer to the top of the callsite stack
+ * @callsite_top	Pointer to the top of the callsite stack
+ */
+struct callsite_top {
+	OOBPARAM_TOP(top);
+};
+#define CALLSITE_TOP(name)	struct callsite_top name;
+#endif
diff --git a/include/net/callsite.h b/include/net/callsite.h
new file mode 100644
index 0000000..a355a23
--- /dev/null
+++ b/include/net/callsite.h
@@ -0,0 +1,208 @@
+/*
+ *			callsite.h
+ *
+ * Definitions for tracking callers to functions with very low storage
+ * overhead.
+ *
+ * Copyright (C) 2009  Scientific-Atlanta, Inc.
+ * Copyright (C) 2010  Cisco Systems, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ *
+ * Author: David VomLehn
+ */
+
+#ifndef	_LINUX_NET_CALLSITE_H_
+#define _LINUX_NET_CALLSITE_H_
+#include <linux/stringify.h>
+#include <linux/sched.h>
+#include <net/callsite-types.h>
+
+#ifdef CONFIG_CALLSITE
+/* CALLSITE_VARS - macro to define all variables local to a call site
+ * @cs_top:	Name of a variable in which to store the value of the top
+ *		of the stack
+ * @cs_id:	Name of the statically allocated variable in which the call
+ *		site ID is stored
+ * @cs_static:	Name of the statically allocated structure in which constant
+ *		data about the call site is stored
+ * @cs_sf:	Name of the &struct callsite_frame variable (allocated
+ *		on the stack)
+ * @set:	Pointer to the &struct callsite_set for this set of call sites
+ * @top:	Pointer to the &struct callsite_top for this thread
+ */
+#define CALLSITE_VARS(cs_top, cs_id, cs_static, cs_sf, set, top)	\
+		struct callsite_top *cs_top = (top);			\
+		static struct callsite_id cs_id;			\
+		static struct callsite_static cs_static =		\
+			CALLSITE_STATIC_INIT(&cs_id, (set));		\
+		struct callsite_frame cs_sf
+
+/* Define a macro for declaring the variables */
+#define CALLSITE_DECL(cs_top, cs_id, cs_static, cs_sf, set, top) \
+	CALLSITE_VARS(cs_top, cs_id, cs_static, cs_sf, (set), (top))
+
+/**
+ * CALLSITE_CALL - Push a callsite stack "frame" and call a function
+ * @top:	Pointer to a pointer to the first in the list of
+ *		&callsite_top "frames
+ * @set:	Pointer to a &struct callsite_set
+ * @fn:		Function returning a non-void value
+ * @...:	Arguments to fn()
+ *
+ * Push a callsite stack "frame" on the stack, call the given function,
+ * and pop the callsite stack frame. Evaluates to the value returned by
+ * the function.
+ */
+#define CALLSITE_CALL(top, set, fn, arg1, ...)	({			\
+		CALLSITE_DECL(_cs_top, _cs_id, _cs_static,		\
+			_cs_stackframe, (set), (top));			\
+		typeof((fn)(arg1, ##__VA_ARGS__)) _cs_result;		\
+		callsite_push(_cs_top, &_cs_stackframe, &_cs_id,	\
+			&_cs_static);					\
+		_cs_result = (fn)(arg1, ##__VA_ARGS__);			\
+		callsite_pop(_cs_top);					\
+		_cs_result;						\
+	})
+
+/**
+ * CALLSITE_CALL_VOID - Push a callsite stack "frame" and call a void function
+ * @top:	Pointer to a pointer to the first in the list of
+ *		callsite_top "frames
+ * @fn:		Function of type void
+ * @...:	Arguments to fn()
+ *
+ * Push a callsite stack "frame" on the stack, call the given function,
+ * and pop the callsite stack frame.
+ */
+#define CALLSITE_CALL_VOID(top, set, fn, arg1, ...)	do {		\
+		CALLSITE_DECL(_cs_top, _cs_id, _cs_static,		\
+			_cs_stackframe, (set), (top));			\
+		callsite_push(_cs_top, &_cs_stackframe, &_cs_id,	\
+			&_cs_static);					\
+		(fn)(arg1, ##__VA_ARGS__);				\
+		callsite_pop(_cs_top);					\
+	} while (0)
+
+#define CALLSITE_CUR(top) \
+	OOBPARAM_CUR(&top->top, struct callsite_frame, frame)
+
+extern void callsite_print_where_by_id(struct callsite_set *cs_set,
+	unsigned int id);
+extern void callsite_assign_id(struct callsite_static *cs_static);
+extern void callsite_remove_module(struct module *module);
+extern int callsite_set_register(struct callsite_set *cs_set);
+
+/**
+ * callsite_set_id - Set the callsite ID if it isn't already set
+ * @id:		Pointer to &callsite_id to check and set
+ * @cs_static:	Pointer to &struct callsite_static data for this callsite
+ */
+static inline void callsite_set_id(struct callsite_id *id,
+	struct callsite_static *cs_static)
+{
+	if (unlikely(id->id == CALLSITE_UNSET))
+		callsite_assign_id(cs_static);
+}
+
+/*
+ * callsite_push - Push the current callsite on the callsite stack
+ * @top:	Pointer to the stack information
+ * @s:		Pointer to stack "frame" on stack.
+ * @cs_where:	Pointer to statically allocated per-callsite location
+ *		information
+ */
+static inline void callsite_push(struct callsite_top *top,
+	struct callsite_frame *s, struct callsite_id *id,
+	struct callsite_static *cs_static)
+{
+	callsite_set_id(id, cs_static);
+	s->id = *id;
+	oobparam_push(&top->top, &s->frame);
+}
+
+/*
+ * callsite_pop - Pop the current callsite from the callsite stack
+ * @top:	Pointer to a pointer to the top of the callsite stack
+ *
+ * It is possible that the memory pointed to by top will be reused once it
+ * goes out of scope, and the storage now used by the next element of
+ * the top callsite_top structure modified. If top has not been changed
+ * by then, the linked list will be thoroughly confused. We use barrier() to
+ * ensure that top is changed before the callsite_top structure goes out
+ * of scope.
+ */
+static inline void callsite_pop(struct callsite_top *top)
+{
+	oobparam_pop(&top->top);
+}
+
+/**
+ * callsite_top_id - Get the &callsite_id for the topmost &callsite_frame
+ * @top:	Pointer to the &struct callsite_top
+ */
+static inline int callsite_get_id(struct callsite_top *top)
+{
+	return CALLSITE_CUR(top)->id.id;
+}
+
+/**
+ * callsite_task_init - initialize a callsite member of the task structure
+ * @p	Pointer to the member to initialize
+ */
+static inline void callsite_top_init(struct callsite_top *p)
+{
+}
+#else
+#define CALLSITE_CALL(top, set, fn, arg1, ...)	({			\
+		(fn)(arg1, ##__VA_ARGS__);				\
+	})
+
+/* Macro to use to call functions that do not return values */
+#define CALLSITE_CALL_VOID(top, set, fn, arg1, ...)	do {		\
+		(fn)(arg1, ##__VA_ARGS__);				\
+	} while (0)
+
+static inline void callsite_remove_module(struct module *module)
+{
+}
+
+static inline void callsite_push(struct callsite_top *top,
+	struct callsite_frame *s, struct callsite_id *id,
+	const struct callsite_static *cs_static)
+{
+}
+
+static inline void callsite_pop(struct callsite_top **top)
+{
+}
+
+static inline void callsite_top_init(struct callsite_top *p)
+{
+}
+
+static inline int callsite_set_register(struct callsite_set *cs_set)
+{
+	return 0;
+}
+
+static inline int callsite_top_id(const struct callsite_frame *top)
+{
+	return 0;
+}
+#endif
+
+extern void *here(void);
+#endif
diff --git a/net/core/callsite.c b/net/core/callsite.c
new file mode 100644
index 0000000..e77d44b
--- /dev/null
+++ b/net/core/callsite.c
@@ -0,0 +1,354 @@
+/*
+ *			callsite.c
+ *
+ * Support for assigning IDs to addresses.
+ *
+ * For debugging, it may be desirable to store information about where a
+ * structure was, for example used or modified. This location would generically
+ * be an address, but since there are generally only a small number of
+ * addresses that would actually be used, a much smaller tag value can be
+ * stored instead. This code takes care of dynamically generating IDs and
+ * converting them to addresses, when necessary. It is written with the
+ * assumption that generating IDs must be very fast, but converting them
+ * to addresses does not need to be particularly quick.
+ *
+ * Copyright (C) 2009  Scientific-Atlanta, Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+ *
+ * Author: David VomLehn
+ */
+
+#include <linux/kernel.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/slab.h>
+#include <linux/kallsyms.h>
+#include <linux/module.h>
+#include <net/callsite.h>
+
+/**
+ * struct callsite_info - Per-call site information
+ * @unloaded:	Is this in a module that has been unloaded?
+ * @where:	Union of location information
+ *   @loaded	Location information if callsite is loaded
+ *   @unloaded	Location information if callsite is in an unloaded module
+ * @lock:	Lock for updating information for this call site
+ */
+struct callsite_info {
+	bool					unloaded:1;
+	union {
+		const struct callsite_static	*loaded;
+		const char			*unloaded;
+	} where;
+};
+
+/* List of callsite_id_sets and the synchronization primitive that protects
+ * the list */
+static DEFINE_MUTEX(callsite_id_sets_lock);
+static LIST_HEAD(callsite_id_sets);
+
+static const char unknown_location[] = "<unknown>";
+
+/**
+ * callsite_print_location - Print location from &struct callsite_where
+ * @where:	Pointer to &struct callsite_where to print
+ *
+ * The caller of this function should have already printed the printk()
+ * priority so that these pr_cont()s will use the same priority.
+ */
+#ifdef CONFIG_CALLSITE_TERSE
+void callsite_print_where(const struct callsite_where *where)
+{
+	print_ip_sym((unsigned long) where->here);
+}
+#else
+void callsite_print_where(const struct callsite_where *where)
+{
+	pr_cont("%s:%u\n", where->file, where->lineno);
+}
+#endif
+
+/**
+ * callsite_location_alloc - Store location in allocated string
+ * @cs:	Pointer to the &struct callsite
+ *
+ * Returns a pointer to an allocated string, or %NULL.
+ */
+#ifdef CONFIG_CALLSITE_TERSE
+static char *callsite_location_alloc(const struct callsite_where *where)
+{
+	char	*location;
+
+	location = kmalloc(KSYM_SYMBOL_LEN, GFP_KERNEL);
+
+	/* The only interesting information is really the function and offset
+	 * information. The actual location is irrelevant because the module
+	 * is going away. */
+	if (location != NULL) {
+		size_t	size;
+		snprintf(location, KSYM_SYMBOL_LEN, "%pS", where->here);
+		size = strlen(location) + 1;
+		location = krealloc(location, size, GFP_KERNEL);
+	}
+
+	return location;
+}
+#else
+static char *callsite_location_alloc(const struct callsite_where *where)
+{
+	char	*location;
+	size_t	size;
+	static const char colon[] = ":";
+	static const char lineno[] = "99999";
+
+	size = strlen(where->file) + sizeof(colon) + sizeof(lineno) + 1;
+	location = kmalloc(size, GFP_KERNEL);
+
+	if (location != NULL)
+		snprintf(location, size, "%s:%u", where->file, where->lineno);
+
+	return location;
+}
+#endif
+
+/**
+ * callsite_location - Get a string for the location
+ * @cs:	Pointer to call site information
+ */
+static const char *callsite_location(const struct callsite_where *where)
+{
+	const char	*p;
+
+	p = callsite_location_alloc(where);
+	if (p == NULL)
+		p = unknown_location;
+	return p;
+}
+
+/**
+ * callsite_print_location_by_id - Symbolically print the caller location
+ * @cs_set:	Pointer to information about this set of caller IDs
+ * @id:		Call site ID to search for
+ *
+ * The caller of this function should have already printed the printk()
+ * priority so that these pr_cont()s will use the same priority.
+ */
+void callsite_print_where_by_id(struct callsite_set *cs_set,
+	unsigned int id)
+{
+	struct callsite_info	*p;
+	unsigned long		flags;
+
+	switch (id) {
+	case CALLSITE_UNSET:
+		pr_cont("<unset>\n");
+		break;
+	case CALLSITE_UNKNOWN:
+		pr_cont("%s\n", unknown_location);
+		break;
+	default:
+		spin_lock_irqsave(&cs_set->lock, flags);
+
+		/* If the ID is not valid, we can't say where what the callsite
+		 * might be. */
+		if (id >= cs_set->next_id || cs_set->info == NULL)
+			pr_cont("<unknown ID %u>\n", id); /* Couldn't find ID */
+
+		else {
+			p = &cs_set->info[id - CALLSITE_START];
+
+			if (p->unloaded)
+				pr_cont("%s (module unloaded)\n",
+					p->where.unloaded);
+			else
+				callsite_print_where(&p->where.loaded->where);
+		}
+
+		spin_unlock_irqrestore(&cs_set->lock, flags);
+		break;
+	}
+}
+
+/**
+ * callsite_assign_id - Assign a call site ID
+ * @cs_static:	Pointer to static information about the callsite
+ *
+ * If the ID is @CALLSITE_UNSET in a given &struct callsite, this
+ * function is called to assign a call site ID. The value assigned will
+ * normally * be @CALLSITE_START or above, but if we exceed the maximum
+ * size of an ID, * we assign @CALLSITE_UNKNOWN.
+ */
+extern void callsite_assign_id(struct callsite_static *cs_static)
+{
+	unsigned int		id;
+	unsigned long		flags;
+	struct callsite_set	*set;
+
+	/* Record the caller's location. */
+	cs_static->where.here = here();
+
+	/* Lock the call site set. The first time we check, we do so on
+	 * the optimistic assumption that it has already been set. It may
+	 * have been since checking, though, which is why we need to lock
+	 * the callsite_set and check again. */
+	set = cs_static->set;
+	spin_lock_irqsave(&set->lock, flags);
+
+	/* If the callsite_info array wasn't allocated, we can't assign an
+	 * ID and the callsite is unknown. Since the value returned is not
+	 * @CALLSITE_UNKNOWN, we won't try again to assign a callsite ID for
+	 * this site, */
+	if (set->info == NULL)
+		cs_static->id->id = CALLSITE_UNKNOWN;
+
+	else if (cs_static->id->id == CALLSITE_UNSET) {
+
+		/* If we are out of tags, just indicate that it's unknown */
+		if (set->next_id > set->max_id) {
+			id = CALLSITE_UNKNOWN;
+			if (!set->warned) {
+				pr_warning("Exhausted IDs for callsite_set "
+					"%s\n", set->name);
+				set->warned = true;
+			}
+		}
+
+		else {
+			struct callsite_info *p;
+			id = set->next_id;
+			set->next_id++;
+
+			p = &set->info[id - CALLSITE_START];
+			p->unloaded = false;
+			p->where.loaded = cs_static;
+		}
+
+		cs_static->id->id = id;
+	}
+
+	spin_unlock_irqrestore(&set->lock, flags);
+#ifdef DEBUG
+	if (cs_static->id->id != CALLSITE_UNKNOWN) {
+		pr_debug("%s: assigned ID %u to call at ",
+			__func__, cs_static->id->id);
+		callsite_print_where_by_id(set, cs_static->id->id);
+	}
+#endif
+}
+EXPORT_SYMBOL(callsite_assign_id);
+
+/*
+ * callsite_unload_id - Preserve call site ID info on module unload
+ * @p:	Pointer to &struct callsite_info to preserve
+ *
+ * We assume that we are not in atomic mode, so we can sleep waiting for
+ * memory. Must be called with the list lock held.
+ */
+static void callsite_unload_id(struct callsite_info *p)
+{
+	p->unloaded = true;
+	p->where.unloaded = callsite_location(&p->where.loaded->where);
+}
+
+/*
+ * callsite_unload_module - Preserve callsite ID info when unloading a module
+ * @module:	Pointer to &struct module for module being unloaded
+ *
+ * Call site IDs are assigned dynamically as the need arises, which works well
+ * much of the time. There is an issue, though, with call site ID information
+ * stored in modules, because the callsite associated with an ID
+ * goes away when that module is removed. To handle that, we copy all of the
+ * callsite information for a module when it is removed, including
+ * generating the location string.
+ */
+void callsite_remove_module(struct module *module)
+{
+	unsigned long		flags;
+	struct callsite_set *cs;
+
+	mutex_lock(&callsite_id_sets_lock);
+	list_for_each_entry(cs, &callsite_id_sets, callsite_id_sets) {
+		int	i;
+
+		if (cs->info == NULL)
+			continue;
+
+		spin_lock_irqsave(&cs->lock, flags);
+
+		for (i = CALLSITE_START; i < cs->next_id; i++) {
+			struct callsite_info *p;
+
+			p = &cs->info[i - CALLSITE_START];
+
+			if (!p->unloaded && p->where.loaded->module == module)
+				callsite_unload_id(p);
+		}
+
+		spin_unlock_irqrestore(&cs->lock, flags);
+	}
+	mutex_unlock(&callsite_id_sets_lock);
+}
+
+/**
+ * callsite_id_set_register - add information for a set of callsite IDs
+ * @cs_set:	Pointer to the &struct callsite_id_set to add
+ *
+ * Returns zero on success, or a negative errno value.
+ */
+int callsite_set_register(struct callsite_set *cs_set)
+{
+	size_t	n;
+	size_t	size;
+	struct callsite_info *info;
+
+	BUG_ON(cs_set->max_id > (1 << CALLSITE_MAX_ID_SIZE));
+	cs_set->max_id = (1 << (cs_set->width)) - 1;
+
+	if (cs_set->max_id < CALLSITE_START)
+		return -EINVAL;
+
+	n = cs_set->max_id - CALLSITE_START;
+
+	/* Allocate memory for the maximum number of callsites. We take
+	 * advantage of the fact that the value of CALLSITE_UNSET is zero */
+	size = n * sizeof(struct callsite_info);
+	BUG_ON(CALLSITE_UNSET != 0);
+	info = kzalloc(size, GFP_KERNEL);
+
+	if (info == NULL)
+		return -ENOMEM;
+
+	cs_set->info = info;
+	cs_set->next_id = CALLSITE_START;
+
+	mutex_lock(&callsite_id_sets_lock);
+	list_add(&cs_set->callsite_id_sets, &callsite_id_sets);
+	mutex_unlock(&callsite_id_sets_lock);
+
+	return 0;
+}
+EXPORT_SYMBOL(callsite_set_register);
+
+/**
+ * here - address in calling function
+ *
+ * This needs the caller to create a stackframe, so it can't be inlined.
+ */
+noinline void *here()
+{
+	return __builtin_return_address(0);
+}
+EXPORT_SYMBOL(here);

^ permalink raw reply related

* [PATCH][RFC] Check sk_buff states
From: David VomLehn @ 2010-06-08  0:30 UTC (permalink / raw)
  To: to, "netdev, netdev

This uses the oobparam and callsite infrastructure to implement sk_buff
state checking and error reporting. Possible states of an sk_buff are:
	SKB_STATE_FREE - Is not currently in use
	SKB_STATE_ALLOCATED - Has been allocated, but is not on a queue
	SKB_STATE_QUEUED - Is allocated and on a queue
Since there are only three states, two bits suffice to record the state of
an sk_buff structure, so checking for consistent state is easy. (For you
weenies, the fourth possible state *is* flagged as an error).

Some of the errors that can be detected are:
o	Double-freeing an skb_buff
o	Using kfree() instead of kfree_skb()
o	Queuing an sk_buffer that is already on a queue

Notes
o	Under relatively simple test conditions, .i.e. one Ethernet interface
	using an NFS filesystem, the number of call site IDs allocated
	can be quite small, small enough to fit in 6 bits. That would reduce
	the sk_buff growth to one byte. This is *not* a recommended
	configuration.

Restrictions
o	This code relies on fact that sk_buff memory is allocated from
	dedicated kmem_caches. Specifically:
	1.	It uses kmem_cache constructors to initialize fields used for
		checking state.
	2.	It assumes that the fields used to check state will either:
		a.	Not be modified between calling kmem_cache_free and
			kmem_cache_alloc_node, or
		b.	The kmem_cache constructor, will be called.
o	Since the state and callsite ID must not be zeroed by __alloc_skb(),
	it may not be possible to squeeze them into existing padding in
	the sk_buff structure. If so, the addition size added might be larger
	than the 12 bits presently configured.
o	There is some chance that use of kfree() instead of skb_free() will
	not be detected. Normally, when the skbuff freed with kfree() is
	reallocated, its state will still be SKB_STATE_ALLOCATED. Since
	alloc_skb() expects the state to be SKB_STATE_FREE, the error will
	be caught.  However, if the slab used for the skbuff gets returned
	to the buddy allocator and then reallocated to the kmem_cache,
	it will get re-initialized. This will set its state to SKB_STATE_FREE.
	Returning a slab to the buddy allocator is done much less frequently
	than allocation and freeing, so the bulk of these errors will be
	found.
o	Though this code should not have much of a performance or size
	impact, careful consideration should be given before enabling it
	in a production version of the kernel.

History
v2	Reduce the per-sk_buff overhead by shrinking the state and using
	call site IDs. Instead of storing the last allocation and free, only
	store the last state-changing operation.
v1	Original version

Signed-off-by: David VomLehn <dvomlehn@cisco.com>
---
 include/linux/sched.h  |    5 +
 include/linux/skbuff.h |  279 +++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/fork.c          |    3 +
 kernel/module.c        |    2 +
 lib/Kconfig.debug      |   53 +++++++++
 net/core/Makefile      |    2 +-
 net/core/skbuff.c      |  188 ++++++++++++++++++++++++++++++++-
 7 files changed, 523 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f118809..e6a39e1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -91,6 +91,7 @@ struct sched_param {
 #include <linux/kobject.h>
 #include <linux/latencytop.h>
 #include <linux/cred.h>
+#include <net/callsite-types.h>
 
 #include <asm/processor.h>
 
@@ -1389,6 +1390,10 @@ struct task_struct {
 	int softirqs_enabled;
 	int softirq_context;
 #endif
+#ifdef CONFIG_DEBUG_SKB_STATE_CHANGES
+	/* Stack of calls using callsite ID logging */
+	CALLSITE_TOP(skb_callsite_top);
+#endif
 #ifdef CONFIG_LOCKDEP
 # define MAX_LOCK_DEPTH 48UL
 	u64 curr_chain_key;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bf243fc..0bd52ea 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -29,6 +29,7 @@
 #include <linux/rcupdate.h>
 #include <linux/dmaengine.h>
 #include <linux/hrtimer.h>
+#include <net/callsite.h>
 
 /* Don't change this without changing skb_csum_unnecessary! */
 #define CHECKSUM_NONE 0
@@ -88,7 +89,7 @@
  *			  about CHECKSUM_UNNECESSARY. 8)
  *	NETIF_F_IPV6_CSUM about as dumb as the last one but does IPv6 instead.
  *
- *	Any questions? No questions, good. 		--ANK
+ *	Any questions? No questions, good.		--ANK
  */
 
 struct net_device;
@@ -254,6 +255,32 @@ typedef unsigned int sk_buff_data_t;
 typedef unsigned char *sk_buff_data_t;
 #endif
 
+/* States for a struct sk_buff */
+enum skb_check_state {
+	SKB_STATE_INVALID,	/* Zero is a common corruption value, so make
+				 * an invalid value */
+	SKB_STATE_FREE,		/* sk_buff is not allocated */
+	SKB_STATE_ALLOCATED,	/* In used, but not queued */
+	SKB_STATE_QUEUED,	/* On a queue */
+	SKB_STATE_NUM		/* Number of states */
+};
+
+#define SKB_VALIDATE_STATE_SIZE			2 /* ilog2(SKB_STATE_NUM) */
+
+/*
+ * Convert an skb_check to the corresponding mask
+ * @state:	The skb_check value to convert
+ * Returns the mask
+ */
+static inline unsigned skb_check2mask(enum skb_check_state state)
+{
+	return 1u << state;
+}
+
+#define SKB_STATE_FREE_MASK		(1 << SKB_STATE_FREE)
+#define SKB_STATE_ALLOCATED_MASK	(1 << SKB_STATE_ALLOCATED)
+#define SKB_STATE_QUEUED_MASK		(1 << SKB_STATE_QUEUED)
+
 /** 
  *	struct sk_buff - socket buffer
  *	@next: Next buffer in list
@@ -308,6 +335,8 @@ typedef unsigned char *sk_buff_data_t;
  *		done by skb DMA functions
  *	@secmark: security marking
  *	@vlan_tci: vlan tag control information
+ *	@state: State for consistency checking
+ *	@last_op: Address of caller who last changed the sk_buff state
  */
 
 struct sk_buff {
@@ -409,6 +438,14 @@ struct sk_buff {
 				*data;
 	unsigned int		truesize;
 	atomic_t		users;
+#ifdef CONFIG_DEBUG_SKB_STATE_CHANGES
+	/* private: */
+	/* These are set by the constructor called by the slab cache allocator
+	 * and should not be zeroed with the rest of the sk_buff. */
+	unsigned int	state:SKB_VALIDATE_STATE_SIZE;
+	unsigned int	last_op:CONFIG_DEBUG_SKB_ID_WIDTH;
+	/* public: */
+#endif
 };
 
 #ifdef __KERNEL__
@@ -484,6 +521,130 @@ static inline struct rtable *skb_rtable(const struct sk_buff *skb)
 	return (struct rtable *)skb_dst(skb);
 }
 
+#ifdef CONFIG_DEBUG_SKB_STATE_CHANGES
+extern struct callsite_set skb_callsite_set;
+
+extern void skb_report_invalid(const struct sk_buff *skb, unsigned expected);
+
+static inline void skb_callsite_task_init(struct task_struct *p)
+{
+	callsite_top_init(&p->skb_callsite_top);
+}
+
+/**
+* skb_check - verify that the state of the sk_buff is as expected
+* @skb:		Pointer to the &struct sk_buff to validate
+* @expected:	Mask of valid states
+*/
+static inline void skb_check(const struct sk_buff *skb, unsigned expected)
+{
+	if (unlikely((skb_check2mask(skb->state) & expected) == 0))
+		skb_report_invalid(skb, expected);
+}
+
+/**
+* skb_check_and_set - validate the current sk_buff state and set to a new value
+* @skb:	Pointer to the &sk_buff whose state is to be validated and set
+* @expected:	Expected state of the &sk_buff
+* @new_state:	New state of the &sk_buff
+*/
+static inline void skb_check_and_set(struct sk_buff *skb, unsigned expected,
+	enum skb_check_state new_state)
+{
+	skb_check(skb, expected);
+	skb->state = new_state;
+}
+
+/**
+ * skb_check_and_set_state - Check and set the state for specific function
+ * @skb:	Pointer to the &sk_buff whose state is to be validated and set
+ * @expected:	Expected state mask for the &sk_buff
+ * @state:	New state for the sk_buff
+ */
+static inline void skb_check_and_set_state(struct sk_buff *skb,
+	unsigned expected, enum skb_check_state state)
+{
+	skb_check_and_set(skb, expected, state);
+	skb->last_op = callsite_get_id(&current->skb_callsite_top);
+}
+
+/**
+ * skb_check_and_set_alloc - Check and set the state for allocation function
+ * @skb:	Pointer to the &sk_buff whose state is to be validated and set
+ * @expected:	Expected state mask for the &sk_buff
+ *
+ * We allow %NULL values for skb, in which case we do nothing.
+ */
+static inline void skb_check_and_set_alloc(struct sk_buff *skb,
+	unsigned expected)
+{
+	if (likely(skb))
+		skb_check_and_set_state(skb, expected, SKB_STATE_ALLOCATED);
+}
+
+/**
+ * skb_check_and_set_free - Check and set the state for free function
+ * @skb:	Pointer to the &sk_buff whose state is to be validated and set
+ * @expected:	Expected state mask for the &sk_buff
+ *
+ * %NULL values for skb are valid in this case, which causes the check to be
+ * skipped.
+ */
+static inline void skb_check_and_set_free(struct sk_buff *skb,
+	unsigned expected)
+{
+	if (likely(skb))
+		skb_check_and_set_state(skb, expected, SKB_STATE_FREE);
+}
+
+/**
+ * skb_check_and_set_queue - Check and set the state for queuing functions
+ * @skb:	Pointer to the &sk_buff whose state is to be validated and set
+ * @expected:	Expected state mask for the &sk_buff
+ *
+ * %NULL values for skb are valid in this case, which causes the check to be
+ * skipped.
+ */
+static inline void skb_check_and_set_queued(struct sk_buff *skb,
+	unsigned expected)
+{
+	skb_check_and_set_state(skb, expected, SKB_STATE_QUEUED);
+}
+#else
+static inline void skb_callsite_task_init(struct task_struct *p)
+{
+}
+
+static inline void skb_check(const struct sk_buff *skb, unsigned expected)
+{
+}
+
+static inline void skb_check_and_set(struct sk_buff *skb, unsigned expected,
+	const enum skb_check_state new_state)
+{
+}
+
+static inline void skb_check_and_set_state(struct sk_buff *skb,
+	unsigned expected, enum skb_check_state state)
+{
+}
+
+static inline void skb_check_and_set_alloc(struct sk_buff *skb,
+	unsigned expected)
+{
+}
+
+static inline void skb_check_and_set_free(struct sk_buff *skb,
+	unsigned expected)
+{
+}
+
+static inline void skb_check_and_set_queued(struct sk_buff *skb,
+	unsigned expected)
+{
+}
+#endif
+
 extern void kfree_skb(struct sk_buff *skb);
 extern void consume_skb(struct sk_buff *skb);
 extern void	       __kfree_skb(struct sk_buff *skb);
@@ -886,6 +1047,7 @@ static inline void __skb_insert(struct sk_buff *newsk,
 				struct sk_buff *prev, struct sk_buff *next,
 				struct sk_buff_head *list)
 {
+	skb_check_and_set_queued(newsk, SKB_STATE_ALLOCATED_MASK);
 	newsk->next = next;
 	newsk->prev = prev;
 	next->prev  = prev->next = newsk;
@@ -1040,6 +1202,8 @@ static inline void __skb_unlink(struct sk_buff *skb, struct sk_buff_head *list)
 {
 	struct sk_buff *next, *prev;
 
+	skb_check_and_set_state(skb, SKB_STATE_QUEUED_MASK,
+		SKB_STATE_ALLOCATED);
 	list->qlen--;
 	next	   = skb->next;
 	prev	   = skb->prev;
@@ -1116,8 +1280,8 @@ static inline void skb_fill_page_desc(struct sk_buff *skb, int i,
 extern void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page,
 			    int off, int size);
 
-#define SKB_PAGE_ASSERT(skb) 	BUG_ON(skb_shinfo(skb)->nr_frags)
-#define SKB_FRAG_ASSERT(skb) 	BUG_ON(skb_has_frags(skb))
+#define SKB_PAGE_ASSERT(skb)	BUG_ON(skb_shinfo(skb)->nr_frags)
+#define SKB_FRAG_ASSERT(skb)	BUG_ON(skb_has_frags(skb))
 #define SKB_LINEAR_ASSERT(skb)  BUG_ON(skb_is_nonlinear(skb))
 
 #ifdef NET_SKBUFF_DATA_USES_OFFSET
@@ -1554,9 +1718,9 @@ extern struct page *__netdev_alloc_page(struct net_device *dev, gfp_t gfp_mask);
  *	netdev_alloc_page - allocate a page for ps-rx on a specific device
  *	@dev: network device to receive on
  *
- * 	Allocate a new page node local to the specified device.
+ *	Allocate a new page node local to the specified device.
  *
- * 	%NULL is returned if there is no free memory.
+ *	%NULL is returned if there is no free memory.
  */
 static inline struct page *netdev_alloc_page(struct net_device *dev)
 {
@@ -2144,5 +2308,110 @@ static inline void skb_forward_csum(struct sk_buff *skb)
 }
 
 bool skb_partial_csum_set(struct sk_buff *skb, u16 start, u16 off);
+
+#ifdef CONFIG_DEBUG_SKB_STATE_CHANGES
+/*
+ * Macros used to call functions that record the location of the last operation
+ * performed on an sk_buff. We use these instead of using the CALLSITE_CALL*
+ * macros directly so that we can redefine them in skbuff.c
+ */
+#define	SKB_CALL(fn, arg1, ...) ({					\
+		CALLSITE_CALL(&current->skb_callsite_top,		\
+			&skb_callsite_set, fn, arg1, ##__VA_ARGS__);	\
+	})
+
+#define	SKB_CALL_VOID(fn, arg1, ...) do {				\
+		CALLSITE_CALL_VOID(&current->skb_callsite_top,	\
+			&skb_callsite_set, fn, arg1, ##__VA_ARGS__);	\
+	} while (0)
+/*
+ * Macros that cover functions that call callsite_get_id() directly or
+ * indirectly. The macros must be defined after all of the functions are
+ * defined so that we don't have a covered function calling a covered function.
+ */
+#define kfree_skb(skb)							\
+	SKB_CALL_VOID(kfree_skb, skb)
+#define consume_skb(skb)						\
+	SKB_CALL_VOID(consume_skb, skb)
+#define __kfree_skb(skb)						\
+	SKB_CALL_VOID(__kfree_skb, skb)
+#define __alloc_skb(size, priority, fclone, node)			\
+	SKB_CALL(__alloc_skb, size, priority, fclone, node)
+#define alloc_skb(size, priority)					\
+	SKB_CALL(alloc_skb, size, priority)
+#define skb_recycle_check(skb, skb_size)				\
+	SKB_CALL(skb_recycle_check, skb, skb_size)
+#define alloc_skb_fclone(size, priority)				\
+	SKB_CALL(alloc_skb_fclone, size, priority)
+#define skb_morph(dst, src)						\
+	SKB_CALL(skb_morph, dst, src)
+#define skb_clone(skb, priority)					\
+	SKB_CALL(skb_clone, skb, priority)
+#define skb_copy(skb, priority)						\
+	SKB_CALL(skb_copy, skb, priority)
+#define pskb_copy(skb, gfp_mask)					\
+	SKB_CALL(pskb_copy, skb, gfp_mask)
+#define skb_realloc_headroom(skb, headroom)				\
+	SKB_CALL(skb_realloc_headroom, skb, headroom)
+#define skb_copy_expand(skb, newheadroom, newtailroom, priority)	\
+	SKB_CALL(skb_copy_expand, skb, newheadroom, newtailroom, priority)
+#define skb_cow_data(skb, tailbits, trailer)				\
+	SKB_CALL(skb_cow_data, skb, tailbits, trailer)
+#define skb_share_check(skb, pri)					\
+	SKB_CALL(skb_share_check, skb, pri)
+#define skb_unshare(skb, pri)						\
+	SKB_CALL(skb_unshare, skb, pri)
+#define __pskb_pull_tail(skb, delta)					\
+	SKB_CALL(__pskb_pull_tail, skb, delta)
+#define ___pskb_trim(skb, len)						\
+	SKB_CALL(___pskb_trim, skb, len)
+#define __pskb_trim(skb, len)						\
+	SKB_CALL(__pskb_trim, skb, len)
+#define pskb_trim(skb, len)						\
+	SKB_CALL(pskb_trim, skb, len)
+#define skb_queue_purge(list)						\
+	SKB_CALL_VOID(skb_queue_purge, list)
+#define __skb_queue_purge(list)						\
+	SKB_CALL_VOID(__skb_queue_purge, list)
+#define __dev_alloc_skb(length, gfp_mask)				\
+	SKB_CALL(__dev_alloc_skb, length, gfp_mask)
+#define dev_alloc_skb(length)						\
+	SKB_CALL(dev_alloc_skb, length)
+#define __netdev_alloc_skb(dev, length, gfp_mask)			\
+	SKB_CALL(__netdev_alloc_skb, dev, length, gfp_mask)
+#define netdev_alloc_skb(dev, length)					\
+	SKB_CALL(netdev_alloc_skb, dev, length)
+#define skb_segment(skb, features)					\
+	SKB_CALL(skb_segment, skb, features)
+#define nf_conntrack_put_reasm(skb)					\
+	SKB_CALL(nf_conntrack_put_reasm, skb)
+#define __skb_insert(newsk, prev, next, list)				\
+	SKB_CALL_VOID(__skb_insert, newsk, prev, next, list)
+#define __skb_queue_after(list, prev, newsk)				\
+	SKB_CALL_VOID(__skb_queue_after, list, prev, newsk)
+#define __skb_queue_before(list, prev, newsk)				\
+	SKB_CALL_VOID(__skb_queue_before, list, prev, newsk)
+#define skb_queue_head(list, newsk)					\
+	SKB_CALL_VOID(skb_queue_head, list, newsk)
+#define __skb_queue_head(list, newsk)					\
+	SKB_CALL_VOID(__skb_queue_head, list, newsk)
+#define skb_queue_tail(list, newsk)					\
+	SKB_CALL_VOID(skb_queue_tail, list, newsk)
+#define __skb_queue_tail(list, newsk)					\
+	SKB_CALL_VOID(__skb_queue_tail, list, newsk)
+#define skb_unlink(skb, list)						\
+	SKB_CALL_VOID(skb_unlink, skb, list)
+#define __skb_unlink(skb, list)						\
+	SKB_CALL_VOID(__skb_unlink, skb, list)
+#define skb_dequeue(list)						\
+	SKB_CALL(skb_dequeue, list)
+#define __skb_dequeue(list)						\
+	SKB_CALL(__skb_dequeue, list)
+#define skb_dequeue_tail(list)						\
+	SKB_CALL(skb_dequeue_tail, list)
+#define __skb_dequeue_tail(list)					\
+	SKB_CALL(__skb_dequeue_tail, list)
+
+#endif	/* CONFIG_DEBUG_SKB_STATE_CHANGES */
 #endif	/* __KERNEL__ */
 #endif	/* _LINUX_SKBUFF_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index b6cce14..13b90a5 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1093,6 +1093,9 @@ static struct task_struct *copy_process(unsigned long clone_flags,
 #else
 	p->hardirqs_enabled = 0;
 #endif
+#ifdef CONFIG_DEBUG_SKB_STATE_CHANGES
+	skb_callsite_task_init(p);
+#endif
 	p->hardirq_enable_ip = 0;
 	p->hardirq_enable_event = 0;
 	p->hardirq_disable_ip = _THIS_IP_;
diff --git a/kernel/module.c b/kernel/module.c
index 8c6b428..7e4d615 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -55,6 +55,7 @@
 #include <linux/async.h>
 #include <linux/percpu.h>
 #include <linux/kmemleak.h>
+#include <net/callsite.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/module.h>
@@ -788,6 +789,7 @@ SYSCALL_DEFINE2(delete_module, const char __user *, name_user,
 	/* Store the name of the last unloaded module for diagnostic purposes */
 	strlcpy(last_unloaded_module, mod->name, sizeof(last_unloaded_module));
 	ddebug_remove_module(mod->name);
+	callsite_remove_module(mod);
 
 	free_module(mod);
 	return 0;
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index e722e9d..27a9a3e 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -315,6 +315,59 @@ config DEBUG_OBJECTS_ENABLE_DEFAULT
         help
           Debug objects boot parameter default value
 
+config DEBUG_SKB_STATE_CHANGES
+	bool "Validate state of networking buffers (struct sk_buff)"
+	depends on DEBUG_KERNEL && NET
+	default n
+	select CALLSITE
+	help
+	  Simple and quick checks to validate that sk_buffs are in the right
+	  state. Intended to catch errors, including, but not limited to,
+	  the following:
+	  - Double-freeing sk_buffs
+	  - Freeing sk_buffs that are still on a queue
+	  - Using kfree instead of kfree_skb (might miss some instances)
+
+config DEBUG_SKB_ID_WIDTH
+	int "Maximum number of bits used to track sk_buff call sites"
+	depends on DEBUG_SKB_STATE_CHANGES
+	range 6 16
+	default 10
+	help
+	  The sk_buff state validation code tracks the location of the last
+	  state-changing operation within each sk_buff. To minimize memory
+	  usage, it stores IDs instead of the address. This value specifies
+	  the maximum number of bits allocated for the ID. If this value is
+	  too small, sk_buff state validation is still done, but it will not
+	  be possible to determine the location where the state was last
+	  changed.
+
+	  If in doubt, leave this at its default setting.
+
+config CALLSITE
+	bool
+	help
+	  Enable code that dynamically generates small tags from code
+	  addresses, which allows reduced overhead in structures that need
+	  to store call location information. This is typically used for
+	  track sites at which calls are made for debugging.
+
+config CALLSITE_TERSE
+	bool "Use terse reporting of call sites"
+	default "n"
+	depends on DEBUG_KERNEL
+	help
+	  When call site IDs are being used to store references to locations
+	  where specific functions are called, the call sites can be reported
+	  either as a file name/line number pair or using the "%pS" format.
+	  Although using file name and line numbers is more readable, it takes
+	  significantly more space. The "%pS" format will report the location
+	  corresponding to the start of the line where the call to the skb_*
+	  function is located. There are generally multiple instructions that
+	  must be skipped to find the actual location of the call, which may
+	  actually have been expanded inline. Thus, using this option is only
+	  for those comfortable with looking at disassembled C code.
+
 config DEBUG_SLAB
 	bool "Debug slab memory allocations"
 	depends on DEBUG_KERNEL && SLAB && !KMEMCHECK
diff --git a/net/core/Makefile b/net/core/Makefile
index 51c3eec..8cbb6f8 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -18,4 +18,4 @@ obj-$(CONFIG_NET_DMA) += user_dma.o
 obj-$(CONFIG_FIB_RULES) += fib_rules.o
 obj-$(CONFIG_TRACEPOINTS) += net-traces.o
 obj-$(CONFIG_NET_DROP_MONITOR) += drop_monitor.o
-
+obj-$(CONFIG_CALLSITE) += callsite.o
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 9f07e74..f320ea9 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -57,12 +57,14 @@
 #include <linux/init.h>
 #include <linux/scatterlist.h>
 #include <linux/errqueue.h>
+#include <linux/kallsyms.h>
 
 #include <net/protocol.h>
 #include <net/dst.h>
 #include <net/sock.h>
 #include <net/checksum.h>
 #include <net/xfrm.h>
+#include <net/callsite.h>
 
 #include <asm/uaccess.h>
 #include <asm/system.h>
@@ -70,6 +72,21 @@
 
 #include "kmap_skb.h"
 
+#ifdef CONFIG_DEBUG_SKB_STATE_CHANGES
+/*
+ * In this file, we never want to use the macros that use the CALLSITE_CALL*
+ * macros to call functions. Doing so would interpose a callsite_stackframe
+ * between the user's call and the call in this file, which would cause
+ * an uninteresting callsite to be recorded. Instead, we define macros that
+ * just call the functions.
+ */
+#undef SKB_CALL
+#define	SKB_CALL(fn, arg1, ...) fn(arg1, ##__VA_ARGS__)
+
+#undef SKB_CALL_VOID
+#define	SKB_CALL_VOID(fn, arg1, ...) fn(arg1, ##__VA_ARGS__)
+#endif
+
 static struct kmem_cache *skbuff_head_cache __read_mostly;
 static struct kmem_cache *skbuff_fclone_cache __read_mostly;
 
@@ -193,7 +210,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	/*
 	 * Only clear those fields we need to clear, not those that we will
 	 * actually initialise below. Hence, don't put any more fields after
-	 * the tail pointer in struct sk_buff!
+	 * the tail pointer in struct sk_buff! (Unless you don't *want* them
+	 * to be cleared, such as with the sk_buff validation fields)
 	 */
 	memset(skb, 0, offsetof(struct sk_buff, tail));
 	skb->truesize = size + sizeof(struct sk_buff);
@@ -224,6 +242,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 
 		child->fclone = SKB_FCLONE_UNAVAILABLE;
 	}
+	skb_check_and_set_alloc(skb, SKB_STATE_FREE_MASK);
 out:
 	return skb;
 nodata:
@@ -312,6 +331,7 @@ static void skb_drop_list(struct sk_buff **listp)
 	do {
 		struct sk_buff *this = list;
 		list = list->next;
+		skb_check_and_set_queued(this, SKB_STATE_QUEUED_MASK);
 		kfree_skb(this);
 	} while (list);
 }
@@ -357,11 +377,14 @@ static void kfree_skbmem(struct sk_buff *skb)
 
 	switch (skb->fclone) {
 	case SKB_FCLONE_UNAVAILABLE:
+		skb_check_and_set_free(skb, SKB_STATE_ALLOCATED_MASK);
 		kmem_cache_free(skbuff_head_cache, skb);
 		break;
 
 	case SKB_FCLONE_ORIG:
 		fclone_ref = (atomic_t *) (skb + 2);
+		skb_check_and_set_free(skb, SKB_STATE_ALLOCATED_MASK);
+
 		if (atomic_dec_and_test(fclone_ref))
 			kmem_cache_free(skbuff_fclone_cache, skb);
 		break;
@@ -374,6 +397,7 @@ static void kfree_skbmem(struct sk_buff *skb)
 		 * fast-cloning again.
 		 */
 		skb->fclone = SKB_FCLONE_UNAVAILABLE;
+		skb_check_and_set_free(skb, SKB_STATE_ALLOCATED_MASK);
 
 		if (atomic_dec_and_test(fclone_ref))
 			kmem_cache_free(skbuff_fclone_cache, other);
@@ -639,6 +663,7 @@ struct sk_buff *skb_clone(struct sk_buff *skb, gfp_t gfp_mask)
 		n->fclone = SKB_FCLONE_UNAVAILABLE;
 	}
 
+	skb_check_and_set_alloc(n, SKB_STATE_FREE_MASK);
 	return __skb_clone(n, skb);
 }
 EXPORT_SYMBOL(skb_clone);
@@ -1248,6 +1273,7 @@ unsigned char *__pskb_pull_tail(struct sk_buff *skb, int delta)
 		/* Free pulled out fragments. */
 		while ((list = skb_shinfo(skb)->frag_list) != insp) {
 			skb_shinfo(skb)->frag_list = list->next;
+			skb_check_and_set_queued(list, SKB_STATE_QUEUED_MASK);
 			kfree_skb(list);
 		}
 		/* And insert new clone at head. */
@@ -2646,6 +2672,7 @@ skip_fraglist:
 err:
 	while ((skb = segs)) {
 		segs = skb->next;
+		skb_check_and_set_queued(skb, SKB_STATE_QUEUED_MASK);
 		kfree_skb(skb);
 	}
 	return ERR_PTR(err);
@@ -2760,19 +2787,69 @@ done:
 }
 EXPORT_SYMBOL_GPL(skb_gro_receive);
 
+#ifdef	CONFIG_DEBUG_SKB_STATE_CHANGES
+/*
+ * skb_add_callsite_id_set - Register the sk_buff callsite ID set information
+ */
+static void skb_add_callsite_id_set(void)
+{
+	callsite_set_register(&skb_callsite_set);
+}
+
+/*
+ * Initialize an sk_buff for state validation
+ * @skb:	Pointer to the sk_buff to initialize
+ */
+static void skb_kmem_cache_ctor(struct sk_buff *skb)
+{
+	skb->state = SKB_STATE_FREE;
+	skb->last_op = CALLSITE_UNSET;
+}
+
+/*
+ * Initialize an sk_buff from skb_head_cache for state validation
+ * @c:	Pointer to the associated kmem_cache structure
+ * @p:	Pointer to the particular object in the cache to be initialized
+ */
+static void skb_head_cache_ctor(void *p)
+{
+	skb_kmem_cache_ctor((struct sk_buff *) p);
+}
+
+/*
+ * Initialize an sk_buff from skb_clone_cache for state validation
+ * @c:	Pointer to the associated kmem_cache structure
+ * @p:	Pointer to the particular object in the cache to be initialized
+ */
+static void skb_fclone_cache_ctor(void *p)
+{
+	struct sk_buff *skb = p;
+	skb_kmem_cache_ctor(skb);
+	skb_kmem_cache_ctor(skb + 1);
+}
+#else
+static void skb_add_callsite_id_set(void)
+{
+}
+
+#define	skb_head_cache_ctor	NULL
+#define	skb_fclone_cache_ctor	NULL
+#endif
+
 void __init skb_init(void)
 {
 	skbuff_head_cache = kmem_cache_create("skbuff_head_cache",
 					      sizeof(struct sk_buff),
 					      0,
 					      SLAB_HWCACHE_ALIGN|SLAB_PANIC,
-					      NULL);
+					      skb_head_cache_ctor);
 	skbuff_fclone_cache = kmem_cache_create("skbuff_fclone_cache",
 						(2*sizeof(struct sk_buff)) +
 						sizeof(atomic_t),
 						0,
 						SLAB_HWCACHE_ALIGN|SLAB_PANIC,
-						NULL);
+						skb_fclone_cache_ctor);
+	skb_add_callsite_id_set();
 }
 
 /**
@@ -3069,3 +3146,108 @@ void __skb_warn_lro_forwarding(const struct sk_buff *skb)
 			   " while LRO is enabled\n", skb->dev->name);
 }
 EXPORT_SYMBOL(__skb_warn_lro_forwarding);
+
+#ifdef	CONFIG_DEBUG_SKB_STATE_CHANGES
+
+struct callsite_set skb_callsite_set = CALLSITE_SET_INIT("skb_callset_set",
+	skb_callsite_set, CONFIG_DEBUG_SKB_ID_WIDTH);
+EXPORT_SYMBOL(skb_callsite_set);
+
+static const char *skb_check_names[] = {
+	[SKB_STATE_INVALID] =	"Invalid",
+	[SKB_STATE_FREE] =	"Free",
+	[SKB_STATE_ALLOCATED] =	"Allocated",
+	[SKB_STATE_QUEUED] =	"Queued",
+};
+
+/*
+* Returns a string corresponding to the current sk_buff check state.
+* @state:	State whose name is to be determined
+* Returns a pointer to the corresponding string. If the state isn't
+* valid, it will return a pointer to a string indicating an invalid
+* state.
+*/
+static const char *skb_check_name(enum skb_check_state state)
+{
+	const char	*result;
+
+	if (state >= ARRAY_SIZE(skb_check_names))
+		result = skb_check_names[SKB_STATE_INVALID];
+
+	else {
+		result = skb_check_names[state];
+
+		if (!result)
+			result = skb_check_names[SKB_STATE_INVALID];
+	}
+
+	return result;
+}
+
+/*
+* Reports an invalid state
+* @skb:		Pointer to the &struct sk_buff with an invalid state
+* @expected:	Mask of valid states
+*/
+void skb_report_invalid(const struct sk_buff *skb, unsigned expected)
+{
+#define	INDENT "     "
+	unsigned		nbytes;
+	enum skb_check_state	actual = skb->state;
+	enum skb_check_state	i;
+
+	pr_err("Invalid sk_buff detected at 0x%p state %d (%s)\n",
+		skb, actual, skb_check_name(actual));
+
+	pr_err("Valid states:");
+	for (i = 0; i < SKB_STATE_NUM; i++) {
+		if (skb_check2mask(i) & expected)
+			pr_cont(" %s", skb_check_name(i));
+	}
+	pr_cont("\n");
+
+	/* Print the last allocation and free. The specific address is that
+	 * of the call to the inline *_trace function and so could be slightly
+	 * different than the call to the *_log function, but it will
+	 * be very close. */
+	pr_err(INDENT "last operation: ");
+	callsite_print_where_by_id(&skb_callsite_set, skb->last_op);
+
+	/* If we seem to be in reasonable shape, try to dump a bit of the
+	 * buffer so that we have more debugging information. Note that we
+	 * only try this if the state of the buffer indicates that we should
+	 * be able to trust the sk_buff buffer pointers */
+	switch (skb->state) {
+	case SKB_STATE_ALLOCATED:
+	case SKB_STATE_QUEUED:
+		/* Only try to do the dump if we don't have fragments because
+		 * the system might be in too bad a shape to do the mapping
+		 * into the kernel virtual address space. */
+		if (skb_is_nonlinear(skb))
+			pr_err("sk_buff is non-linear; skipping print\n");
+
+		else {
+			nbytes = min(96u, (size_t) (skb->end - skb->head));
+			pr_err("sk_buff contents (data offset from head: 0x%x "
+				"bytes)\n", skb->data - skb->head);
+			print_hex_dump(KERN_ERR, "", DUMP_PREFIX_OFFSET, 16, 1,
+				skb->head, nbytes, 0);
+		}
+		break;
+	default:
+		pr_err("sk_buff is neither allocated nor queued; "
+			"skipping print\n");
+		break;
+	}
+
+	/*
+	 * There is a good chance that the system is compromised when this is
+	 * called, but it is possible the problem is limited to a single
+	 * driver and may not recur very frequently. Since we are dealing with
+	 * networking code, minor errors may be corrected, so BUG() instead
+	 * of panicing.
+	 */
+	BUG();
+}
+EXPORT_SYMBOL(skb_report_invalid);
+#endif

^ permalink raw reply related

* Re: [PATCH] netconsole: queue console messages to send later
From: Flavio Leitner @ 2010-06-08  0:37 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, amwang, fubar, mpm, gospo, nhorman, jmoyer, shemminger,
	linux-kernel, bridge, bonding-devel
In-Reply-To: <20100607.165024.135517125.davem@davemloft.net>

On Mon, Jun 07, 2010 at 04:50:24PM -0700, David Miller wrote:
> From: Flavio Leitner <fleitner@redhat.com>
> Date: Mon,  7 Jun 2010 16:24:52 -0300
> 
> > There are some networking drivers that hold a lock in the
> > transmit path. Therefore, if a console message is printed
> > after that, netconsole will push it through the transmit path,
> > resulting in a deadlock.
> > 
> > This patch fixes the re-injection problem by queuing the console
> > messages in a preallocated circular buffer and then scheduling a
> > workqueue to send them later with another context.
> > 
> > Signed-off-by: Flavio Leitner <fleitner@redhat.com>
> 
> You absolutely and positively MUST NOT do this.  Otherwise netconsole
> becomes completely useless.  Your idea has been proposed several times
> as far back as 6 years ago, it was unacceptable then and it's
> unacceptable now.
> 
> The whole point of netconsole is that we may be deep in an interrupt
> or other atomic context, the machine is about to hard hang, and it's
> absolutely essential that we get out any and all kernel logging
> messages that we can, immediately.

Got it. I've never assumed that netconsole would work reliable on 
such situations, so I thought as we have better ways now it would
be helpful. See another idea below.

> There may not be another timer or workqueue able to execute after the
> printk() we're trying to emit.  We may never get to that point.

What if in the netpoll, before we push the skb to the driver, we check
for a bit saying that it's already pushing another skb. In this case,
queue the new skb inside of netpoll and soon as the first call returns
and try to clear the bit, it will send the next skb?

printk("message 1")
...
netconsole called
netpoll sets the flag bit
pushes to the bonding driver which does another printk("message 2")
netconsole called again
netpoll checks for the flag, queue the message, returns.
so, bonding can finish up to send the first message
netpoll is about to return, checks for new queued messages, and pushes them.
bonding finishes up to send the second message
....

No deadlocks, skbs are ordered and still under the same opportunity
to send something. Does it sound acceptable?
It's off the top of my head, so probably this idea has some problems.


> Fix the locking in the drivers or layers that cause the issue instead
> of breaking netconsole.

Someday, somewhere, I know because I did this before, someone will
use a debugging printk() and will see the entire box hanging with
absolutely no message in any console because of this problem. 
I'm not saying that fixing driver isn't the right way to go but
it seems not enough to me.

-- 
Flavio

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox