Netdev List

Netdev List
 help / color / mirror / Atom feed

* RE: [PATCH net-next] iwlwifi: dont pull too much payload in skb head
From: Eric Dumazet @ 2012-05-18 15:21 UTC (permalink / raw)
  To: Berg, Johannes; +Cc: David Miller, netdev, Guy, Wey-Yi W
In-Reply-To: <1DC40B07CD6EC041A66726C271A73AE61955AE5D@IRSMSX102.ger.corp.intel.com>

On Fri, 2012-05-18 at 14:59 +0000, Berg, Johannes wrote:
> > Since merge window is now pretty close, I would prefer David applies this
> > directly in net-next, if you dont mind, as this patch is more a core network issue
> > than an iwlwifi one.
> > 
> > Thanks !
> 
> Sure, good with me, I don't think we have colliding patches.
> 
> Reviewed-by: Johannes Berg <johannes.berg@intel.com>

Thanks

> We may want to move this code into mac80211 later though since it also
> has an if (pull in everything, even reallocating if necessary, if it's
> a management frame), but that can wait, I think we're the only driver
> using paged RX.

This is OK, these frames wont be injected in linux IP/TCP stack.

Or maybe you would like an optimized version of skb_header_pointer(),
avoiding the copy if the whole blob can be part of _one_ fragment ?

^ permalink raw reply

* [PATCH net-next] net: introduce netdev_alloc_frag()
From: Eric Dumazet @ 2012-05-18 15:12 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

Fix two issues introduced in commit a1c7fff7e18f5
( net: netdev_alloc_skb() use build_skb() )

- Must be IRQ safe (non NAPI drivers can use it)
- Must not leak the frag if build_skb() fails to allocate sk_buff

This patch introduces netdev_alloc_frag() for drivers willing to
use build_skb() instead of __netdev_alloc_skb() variants.

Factorize code so that :
__dev_alloc_skb() is a wrapper around __netdev_alloc_skb(), and
dev_alloc_skb() a wrapper around netdev_alloc_skb()

Use __GFP_COLD flag.

Almost all network drivers now benefit from skb->head_frag
infrastructure.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/skbuff.h |   42 ++++++++-----------
 net/core/skbuff.c      |   82 +++++++++++++++++++--------------------
 2 files changed, 59 insertions(+), 65 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index bb47314..fe37c21 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1680,31 +1680,11 @@ static inline void __skb_queue_purge(struct sk_buff_head *list)
 		kfree_skb(skb);
 }
 
-/**
- *	__dev_alloc_skb - allocate an skbuff for receiving
- *	@length: length to allocate
- *	@gfp_mask: get_free_pages mask, passed to alloc_skb
- *
- *	Allocate a new &sk_buff and assign it a usage count of one. The
- *	buffer has unspecified headroom built in. Users should allocate
- *	the headroom they think they need without accounting for the
- *	built in space. The built in space is used for optimisations.
- *
- *	%NULL is returned if there is no free memory.
- */
-static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
-					      gfp_t gfp_mask)
-{
-	struct sk_buff *skb = alloc_skb(length + NET_SKB_PAD, gfp_mask);
-	if (likely(skb))
-		skb_reserve(skb, NET_SKB_PAD);
-	return skb;
-}
-
-extern struct sk_buff *dev_alloc_skb(unsigned int length);
+extern void *netdev_alloc_frag(unsigned int fragsz);
 
 extern struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
-		unsigned int length, gfp_t gfp_mask);
+					  unsigned int length,
+					  gfp_t gfp_mask);
 
 /**
  *	netdev_alloc_skb - allocate an skbuff for rx on a specific device
@@ -1720,11 +1700,25 @@ extern struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
  *	allocates memory it can be called from an interrupt.
  */
 static inline struct sk_buff *netdev_alloc_skb(struct net_device *dev,
-		unsigned int length)
+					       unsigned int length)
 {
 	return __netdev_alloc_skb(dev, length, GFP_ATOMIC);
 }
 
+/* legacy helper around __netdev_alloc_skb() */
+static inline struct sk_buff *__dev_alloc_skb(unsigned int length,
+					      gfp_t gfp_mask)
+{
+	return __netdev_alloc_skb(NULL, length, gfp_mask);
+}
+
+/* legacy helper around netdev_alloc_skb() */
+static inline struct sk_buff *dev_alloc_skb(unsigned int length)
+{
+	return netdev_alloc_skb(NULL, length);
+}
+
+
 static inline struct sk_buff *__netdev_alloc_skb_ip_align(struct net_device *dev,
 		unsigned int length, gfp_t gfp)
 {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 7645df1..f0bcbe6 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -300,6 +300,40 @@ struct netdev_alloc_cache {
 static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
 
 /**
+ * netdev_alloc_frag - allocate a page fragment
+ * @fragsz: fragment size
+ *
+ * Allocates a frag from a page for receive buffer.
+ * Uses GFP_ATOMIC allocations.
+ */
+void *netdev_alloc_frag(unsigned int fragsz)
+{
+	struct netdev_alloc_cache *nc;
+	void *data = NULL;
+	unsigned long flags;
+
+	local_irq_save(flags);
+	nc = &__get_cpu_var(netdev_alloc_cache);
+	if (unlikely(!nc->page)) {
+refill:
+		nc->page = alloc_page(GFP_ATOMIC | __GFP_COLD);
+		nc->offset = 0;
+	}
+	if (likely(nc->page)) {
+		if (nc->offset + fragsz > PAGE_SIZE) {
+			put_page(nc->page);
+			goto refill;
+		}
+		data = page_address(nc->page) + nc->offset;
+		nc->offset += fragsz;
+		get_page(nc->page);
+	}
+	local_irq_restore(flags);
+	return data;
+}
+EXPORT_SYMBOL(netdev_alloc_frag);
+
+/**
  *	__netdev_alloc_skb - allocate an skbuff for rx on a specific device
  *	@dev: network device to receive on
  *	@length: length to allocate
@@ -313,32 +347,20 @@ static DEFINE_PER_CPU(struct netdev_alloc_cache, netdev_alloc_cache);
  *	%NULL is returned if there is no free memory.
  */
 struct sk_buff *__netdev_alloc_skb(struct net_device *dev,
-		unsigned int length, gfp_t gfp_mask)
+				   unsigned int length, gfp_t gfp_mask)
 {
-	struct sk_buff *skb;
+	struct sk_buff *skb = NULL;
 	unsigned int fragsz = SKB_DATA_ALIGN(length + NET_SKB_PAD) +
 			      SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
 
 	if (fragsz <= PAGE_SIZE && !(gfp_mask & __GFP_WAIT)) {
-		struct netdev_alloc_cache *nc;
-		void *data = NULL;
+		void *data = netdev_alloc_frag(fragsz);
 
-		nc = &get_cpu_var(netdev_alloc_cache);
-		if (!nc->page) {
-refill:			nc->page = alloc_page(gfp_mask);
-			nc->offset = 0;
-		}
-		if (likely(nc->page)) {
-			if (nc->offset + fragsz > PAGE_SIZE) {
-				put_page(nc->page);
-				goto refill;
-			}
-			data = page_address(nc->page) + nc->offset;
-			nc->offset += fragsz;
-			get_page(nc->page);
+		if (likely(data)) {
+			skb = build_skb(data, fragsz);
+			if (unlikely(!skb))
+				put_page(virt_to_head_page(data));
 		}
-		put_cpu_var(netdev_alloc_cache);
-		skb = data ? build_skb(data, fragsz) : NULL;
 	} else {
 		skb = __alloc_skb(length + NET_SKB_PAD, gfp_mask, 0, NUMA_NO_NODE);
 	}
@@ -360,28 +382,6 @@ void skb_add_rx_frag(struct sk_buff *skb, int i, struct page *page, int off,
 }
 EXPORT_SYMBOL(skb_add_rx_frag);
 
-/**
- *	dev_alloc_skb - allocate an skbuff for receiving
- *	@length: length to allocate
- *
- *	Allocate a new &sk_buff and assign it a usage count of one. The
- *	buffer has unspecified headroom built in. Users should allocate
- *	the headroom they think they need without accounting for the
- *	built in space. The built in space is used for optimisations.
- *
- *	%NULL is returned if there is no free memory. Although this function
- *	allocates memory it can be called from an interrupt.
- */
-struct sk_buff *dev_alloc_skb(unsigned int length)
-{
-	/*
-	 * There is more code here than it seems:
-	 * __dev_alloc_skb is an inline
-	 */
-	return __dev_alloc_skb(length, GFP_ATOMIC);
-}
-EXPORT_SYMBOL(dev_alloc_skb);
-
 static void skb_drop_list(struct sk_buff **listp)
 {
 	struct sk_buff *list = *listp;

^ permalink raw reply related

* RE: [PATCH net-next] iwlwifi: dont pull too much payload in skb head
From: Berg, Johannes @ 2012-05-18 14:59 UTC (permalink / raw)
  To: Eric Dumazet, David Miller; +Cc: netdev, Guy, Wey-Yi W
In-Reply-To: <1337352513.7029.18.camel@edumazet-glaptop>

> Since merge window is now pretty close, I would prefer David applies this
> directly in net-next, if you dont mind, as this patch is more a core network issue
> than an iwlwifi one.
> 
> Thanks !

Sure, good with me, I don't think we have colliding patches.

Reviewed-by: Johannes Berg <johannes.berg@intel.com>

> As iwlwifi use fat skbs, it should not pull too much data in skb->head, and
> particularly no tcp data payload, or splice() is slower, and TCP coalescing is
> disabled. Copying payload to userland also involves at least two copies (part
> from header, part from fragment)
> 
> Each layer will pull its header from the fragment as needed.
> 
> (on 64bit arches, skb_tailroom(skb) at this point is 192 bytes)
> 
> With this patch applied, I have a major reduction of collapsed/pruned TCP
> packets, a nice increase of TCPRcvCoalesce counter, and overall better Internet
> User experience.
> 
> Small packets are still using a fragless skb, so that page can be reused by the
> driver.

We may want to move this code into mac80211 later though since it also has an if (pull in everything, even reallocating if necessary, if it's a management frame), but that can wait, I think we're the only driver using paged RX.

johannes

PS: sorry about the footer -- unfortunately I haven't managed to convince IT to remove it on my @intel address
--------------------------------------------------------------------------------------
Intel GmbH
Dornacher Strasse 1
85622 Feldkirchen/Muenchen, Deutschland 
Sitz der Gesellschaft: Feldkirchen bei Muenchen
Geschaeftsfuehrer: Douglas Lusk, Peter Gleissner, Hannes Schwaderer
Registergericht: Muenchen HRB 47456 
Ust.-IdNr./VAT Registration No.: DE129385895
Citibank Frankfurt a.M. (BLZ 502 109 00) 600119052

^ permalink raw reply

* [PATCH 3/3] ethtool: Addition of -m option to dump module eeprom
From: Stuart Hodgson @ 2012-05-18 14:58 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, David Miller, Yaniv Rosner, Eilon Greenstein

The -m option now allows for retrieval of EEPROM
information form a plug in module such as SFP+. This
shows specific information about the type and
capabilities of the module in use The format can be
easily extended to support other modules types such as
QSFP in future. Raw data dump is also supported.

Signed-off-by: Stuart Hodgson <smhodgson@solarflare.com>
---
 Makefile.am    |    2 +-
 ethtool.8.in   |   10 ++
 ethtool.c      |   87 +++++++++++++
 internal.h     |    3 +
 sfpid.c        |  387 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 test-cmdline.c |   10 ++
 6 files changed, 498 insertions(+), 1 deletions(-)
 create mode 100644 sfpid.c

diff --git a/Makefile.am b/Makefile.am
index 4b0eb17..e40fc99 100644
--- a/Makefile.am
+++ b/Makefile.am
@@ -9,7 +9,7 @@ ethtool_SOURCES = ethtool.c ethtool-copy.h internal.h \
 		  fec_8xx.c ibm_emac.c ixgb.c ixgbe.c natsemi.c	\
 		  pcnet32.c realtek.c tg3.c marvell.c vioc.c	\
 		  smsc911x.c at76c50x-usb.c sfc.c stmmac.c	\
-		  rxclass.c
+		  rxclass.c sfpid.c
 
 TESTS = test-cmdline
 check_PROGRAMS = test-cmdline test-one-cmdline
diff --git a/ethtool.8.in b/ethtool.8.in
index 63d5d48..cba86ff 100644
--- a/ethtool.8.in
+++ b/ethtool.8.in
@@ -318,6 +318,13 @@ ethtool \- query or control network driver and hardware settings
 .BN other
 .BN combined
 .HP
+.B ethtool \-m|\-\-dump\-module\-eeprom
+.I devname
+.B2 raw on off
+.B2 hex on off
+.BN offset
+.BN length
+.HP
 .B ethtool \-\-show\-priv\-flags
 .I devname
 .HP
@@ -789,6 +796,9 @@ Changes the number of channels used only for other purposes e.g. link interrupts
 .BI combined \ N
 Changes the number of multi-purpose channels.
 .TP
+.B \-m \-\-dump\-module\-eeprom
+Retrieves and if possible decodes the EEPROM from plugin modules, e.g SFP+, QSFP
+.TP
 .B \-\-show\-priv\-flags
 Queries the specified network device for its private flags.  The
 names and meanings of private flags (if any) are defined by each
diff --git a/ethtool.c b/ethtool.c
index fdc21de..d9f1462 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -3078,6 +3078,87 @@ static int do_sprivflags(struct cmd_context *ctx)
 	return 0;
 }
 
+static int do_getmodule(struct cmd_context *ctx)
+{
+	struct ethtool_modinfo modinfo;
+	struct ethtool_eeprom *eeprom;
+	u32 geeprom_offset = 0;
+	u32 geeprom_length = -1;
+	int geeprom_changed = 0;
+	int geeprom_dump_raw = 0;
+	int geeprom_dump_hex = 0;
+	int err;
+
+	struct cmdline_info cmdline_geeprom[] = {
+		{ "offset", CMDL_U32, &geeprom_offset, NULL },
+		{ "length", CMDL_U32, &geeprom_length, NULL },
+		{ "raw", CMDL_BOOL, &geeprom_dump_raw, NULL },
+		{ "hex", CMDL_BOOL, &geeprom_dump_hex, NULL },
+	};
+
+	parse_generic_cmdline(ctx, &geeprom_changed,
+			      cmdline_geeprom, ARRAY_SIZE(cmdline_geeprom));
+
+	if (geeprom_dump_raw && geeprom_dump_hex) {
+		printf("Hex and raw dump cannot be specified together\n");
+		return 1;
+	}
+
+	modinfo.cmd = ETHTOOL_GMODULEINFO;
+	err = send_ioctl(ctx, &modinfo);
+	if (err < 0) {
+		perror("Cannot get module EEPROM information");
+		return 1;
+	}
+
+	if (geeprom_length == -1)
+		geeprom_length = modinfo.eeprom_len;
+
+	if (modinfo.eeprom_len < geeprom_offset + geeprom_length)
+		geeprom_length = modinfo.eeprom_len - geeprom_offset;
+
+	eeprom = calloc(1, sizeof(*eeprom)+geeprom_length);
+	if (!eeprom) {
+		perror("Cannot allocate memory for Module EEPROM data");
+		return 1;
+	}
+
+	eeprom->cmd = ETHTOOL_GMODULEEEPROM;
+	eeprom->len = geeprom_length;
+	eeprom->offset = geeprom_offset;
+	err = send_ioctl(ctx, eeprom);
+	if (err < 0) {
+		perror("Cannot get Module EEPROM data");
+		free(eeprom);
+		return 1;
+	}
+
+	if (geeprom_dump_raw) {
+		fwrite(eeprom->data, 1, eeprom->len, stdout);
+	} else {
+		if (eeprom->offset != 0  ||
+		    (eeprom->len != modinfo.eeprom_len)) {
+			geeprom_dump_hex = 1;
+		} else if (!geeprom_dump_hex) {
+			switch (modinfo.type) {
+			case ETH_MODULE_SFF_8079:
+			case ETH_MODULE_SFF_8472:
+				sff8079_show_all(eeprom->data);
+				break;
+			default:
+				geeprom_dump_hex = 1;
+				break;
+			}
+		}
+		if (geeprom_dump_hex)
+			dump_hex(eeprom->data, eeprom->len, eeprom->offset);
+	}
+
+	free(eeprom);
+
+	return 0;
+}
+
 int send_ioctl(struct cmd_context *ctx, void *cmd)
 {
 #ifndef TEST_ETHTOOL
@@ -3232,6 +3313,12 @@ static const struct option {
 	{ "--show-priv-flags" , 1, do_gprivflags, "Query private flags" },
 	{ "--set-priv-flags", 1, do_sprivflags, "Set private flags",
 	  "		FLAG on|off ...\n" },
+	{ "-m|--dump-module-eeprom", 1, do_getmodule,
+	  "Qeuery/Decode Module EEPROM information",
+	  "		[ raw on|off ]\n"
+	  "		[ hex on|off ]\n"
+	  "		[ offset N ]\n"
+	  "		[ length N ]\n" },
 	{ "-h|--help", 0, show_usage, "Show this help" },
 	{ "--version", 0, do_version, "Show version number" },
 	{}
diff --git a/internal.h b/internal.h
index 867c0ea..576b79b 100644
--- a/internal.h
+++ b/internal.h
@@ -174,4 +174,7 @@ int rxclass_rule_ins(struct cmd_context *ctx,
 		     struct ethtool_rx_flow_spec *fsp);
 int rxclass_rule_del(struct cmd_context *ctx, __u32 loc);
 
+/* Module EEPROM parsing code */
+void sff8079_show_all(const __u8 *id);
+
 #endif /* ETHTOOL_INTERNAL_H__ */
diff --git a/sfpid.c b/sfpid.c
new file mode 100644
index 0000000..a4a671d
--- /dev/null
+++ b/sfpid.c
@@ -0,0 +1,387 @@
+/****************************************************************************
+ * Support for Solarflare Solarstorm network controllers and boards
+ * Copyright 2010 Solarflare Communications Inc.
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License version 2 as published
+ * by the Free Software Foundation, incorporated herein by reference.
+ */
+
+#include <stdio.h>
+#include "internal.h"
+
+static void sff8079_show_identifier(const __u8 *id)
+{
+	printf("\tIdentifier          : 0x%02x", id[0]);
+	switch (id[0]) {
+	case 0x00:
+		printf(" (no module present, unknown, or unspecified)\n");
+		break;
+	case 0x01:
+		printf(" (GBIC)\n");
+		break;
+	case 0x02:
+		printf(" (module soldered to motherboard)\n");
+		break;
+	case 0x03:
+		printf(" (SFP)\n");
+		break;
+	default:
+		 printf(" (reserved or unknown)\n");
+		break;
+	}
+}
+
+static void sff8079_show_ext_identifier(const __u8 *id)
+{
+	printf("\tExtended identifier : 0x%02x", id[1]);
+	if (id[1] == 0x00)
+		printf(" (GBIC not specified / not MOD_DEF compliant)\n");
+	else if (id[1] == 0x04)
+		printf(" (GBIC/SFP defined by 2-wire interface ID)\n");
+	else if (id[1] <= 0x07)
+		printf(" (GBIC compliant with MOD_DEF %u)\n", id[1]);
+	else
+		printf(" (unknown)\n");
+}
+
+static void sff8079_show_connector(const __u8 *id)
+{
+	printf("\tConnector           : 0x%02x", id[2]);
+	switch (id[2]) {
+	case 0x00:
+		printf(" (unknown or unspecified)\n");
+		break;
+	case 0x01:
+		printf(" (SC)\n");
+		break;
+	case 0x02:
+		printf(" (Fibre Channel Style 1 copper)\n");
+		break;
+	case 0x03:
+		printf(" (Fibre Channel Style 2 copper)\n");
+		break;
+	case 0x04:
+		printf(" (BNC/TNC)\n");
+		break;
+	case 0x05:
+		printf(" (Fibre Channel coaxial headers)\n");
+		break;
+	case 0x06:
+		printf(" (FibreJack)\n");
+		break;
+	case 0x07:
+		printf(" (LC)\n");
+		break;
+	case 0x08:
+		printf(" (MT-RJ)\n");
+		break;
+	case 0x09:
+		printf(" (MU)\n");
+		break;
+	case 0x0a:
+		printf(" (SG)\n");
+		break;
+	case 0x0b:
+		printf(" (Optical pigtail)\n");
+		break;
+	case 0x0c:
+		printf(" (MPO Parallel Optic)\n");
+		break;
+	case 0x20:
+		printf(" (HSSDC II)\n");
+		break;
+	case 0x21:
+		printf(" (Copper pigtail)\n");
+		break;
+	case 0x22:
+		printf(" (RJ45)\n");
+		break;
+	default:
+		printf(" (reserved or unknown)\n");
+		break;
+	}
+}
+
+static void sff8079_show_transceiver(const __u8 *id)
+{
+	static const char *pfx = "\t                    :  =>";
+
+	printf("\tTransceiver codes   : 0x%02x 0x%02x 0x%02x" \
+	       "0x%02x 0x%02x 0x%02x 0x%02x 0x%02x\n",
+	       id[3], id[4], id[5], id[6],
+	       id[7], id[8], id[9], id[10]);
+	/* 10G Ethernet Compliance Codes */
+	if (id[3] & (1 << 7))
+		printf("%s 10G Ethernet: 10G Base-ER" \
+		       " [SFF-8472 rev10.4 only]\n", pfx);
+	if (id[3] & (1 << 6))
+		printf("%s 10G Ethernet: 10G Base-LRM\n", pfx);
+	if (id[3] & (1 << 5))
+		printf("%s 10G Ethernet: 10G Base-LR\n", pfx);
+	if (id[3] & (1 << 4))
+		printf("%s 10G Ethernet: 10G Base-SR\n", pfx);
+	/* Infiniband Compliance Codes */
+	if (id[3] & (1 << 3))
+		printf("%s Infiniband: 1X SX\n", pfx);
+	if (id[3] & (1 << 2))
+		printf("%s Infiniband: 1X LX\n", pfx);
+	if (id[3] & (1 << 1))
+		printf("%s Infiniband: 1X Copper Active\n", pfx);
+	if (id[3] & (1 << 0))
+		printf("%s Infiniband: 1X Copper Passive\n", pfx);
+	/* ESCON Compliance Codes */
+	if (id[4] & (1 << 7))
+		printf("%s ESCON: ESCON MMF, 1310nm LED\n", pfx);
+	if (id[4] & (1 << 6))
+		printf("%s ESCON: ESCON SMF, 1310nm Laser\n", pfx);
+	/* SONET Compliance Codes */
+	if (id[4] & (1 << 5))
+		printf("%s SONET: OC-192, short reach\n", pfx);
+	if (id[4] & (1 << 4))
+		printf("%s SONET: SONET reach specifier bit 1\n", pfx);
+	if (id[4] & (1 << 3))
+		printf("%s SONET: SONET reach specifier bit 2\n", pfx);
+	if (id[4] & (1 << 2))
+		printf("%s SONET: OC-48, long reach\n", pfx);
+	if (id[4] & (1 << 1))
+		printf("%s SONET: OC-48, intermediate reach\n", pfx);
+	if (id[4] & (1 << 0))
+		printf("%s SONET: OC-48, short reach\n", pfx);
+	if (id[5] & (1 << 6))
+		printf("%s SONET: OC-12, single mode, long reach\n", pfx);
+	if (id[5] & (1 << 5))
+		printf("%s SONET: OC-12, single mode, inter. reach\n", pfx);
+	if (id[5] & (1 << 4))
+		printf("%s SONET: OC-12, short reach\n", pfx);
+	if (id[5] & (1 << 2))
+		printf("%s SONET: OC-3, single mode, long reach\n", pfx);
+	if (id[5] & (1 << 1))
+		printf("%s SONET: OC-3, single mode, inter. reach\n", pfx);
+	if (id[5] & (1 << 0))
+		printf("%s SONET: OC-3, short reach\n", pfx);
+	/* Ethernet Compliance Codes */
+	if (id[6] & (1 << 7))
+		printf("%s Ethernet: BASE-PX\n", pfx);
+	if (id[6] & (1 << 6))
+		printf("%s Ethernet: BASE-BX10\n", pfx);
+	if (id[6] & (1 << 5))
+		printf("%s Ethernet: 100BASE-FX\n", pfx);
+	if (id[6] & (1 << 4))
+		printf("%s Ethernet: 100BASE-LX/LX10\n", pfx);
+	if (id[6] & (1 << 3))
+		printf("%s Ethernet: 1000BASE-T\n", pfx);
+	if (id[6] & (1 << 2))
+		printf("%s Ethernet: 1000BASE-CX\n", pfx);
+	if (id[6] & (1 << 1))
+		printf("%s Ethernet: 1000BASE-LX\n", pfx);
+	if (id[6] & (1 << 0))
+		printf("%s Ethernet: 1000BASE-SX\n", pfx);
+	/* Fibre Channel link length */
+	if (id[7] & (1 << 7))
+		printf("%s FC: very long distance (V)\n", pfx);
+	if (id[7] & (1 << 6))
+		printf("%s FC: short distance (S)\n", pfx);
+	if (id[7] & (1 << 5))
+		printf("%s FC: intermediate distance (I)\n", pfx);
+	if (id[7] & (1 << 4))
+		printf("%s FC: long distance (L)\n", pfx);
+	if (id[7] & (1 << 3))
+		printf("%s FC: medium distance (M)\n", pfx);
+	/* Fibre Channel transmitter technology */
+	if (id[7] & (1 << 2))
+		printf("%s FC: Shortwave laser, linear Rx (SA)\n", pfx);
+	if (id[7] & (1 << 1))
+		printf("%s FC: Longwave laser (LC)\n", pfx);
+	if (id[7] & (1 << 0))
+		printf("%s FC: Electrical inter-enclosure (EL)\n", pfx);
+	if (id[8] & (1 << 7))
+		printf("%s FC: Electrical intra-enclosure (EL)\n", pfx);
+	if (id[8] & (1 << 6))
+		printf("%s FC: Shortwave laser w/o OFC (SN)\n", pfx);
+	if (id[8] & (1 << 5))
+		printf("%s FC: Shortwave laser with OFC (SL)\n", pfx);
+	if (id[8] & (1 << 4))
+		printf("%s FC: Longwave laser (LL)\n", pfx);
+	if (id[8] & (1 << 3))
+		printf("%s FC: Copper Active\n", pfx);
+	if (id[8] & (1 << 2))
+		printf("%s FC: Copper Passive\n", pfx);
+	if (id[8] & (1 << 1))
+		printf("%s FC: Copper FC-BaseT\n", pfx);
+	/* Fibre Channel transmission media */
+	if (id[9] & (1 << 7))
+		printf("%s FC: Twin Axial Pair (TW)\n", pfx);
+	if (id[9] & (1 << 6))
+		printf("%s FC: Twisted Pair (TP)\n", pfx);
+	if (id[9] & (1 << 5))
+		printf("%s FC: Miniature Coax (MI)\n", pfx);
+	if (id[9] & (1 << 4))
+		printf("%s FC: Video Coax (TV)\n", pfx);
+	if (id[9] & (1 << 3))
+		printf("%s FC: Multimode, 62.5um (M6)\n", pfx);
+	if (id[9] & (1 << 2))
+		printf("%s FC: Multimode, 50um (M5)\n", pfx);
+	if (id[9] & (1 << 0))
+		printf("%s FC: Single Mode (SM)\n", pfx);
+	/* Fibre Channel speed */
+	if (id[10] & (1 << 7))
+		printf("%s FC: 1200 MBytes/sec\n", pfx);
+	if (id[10] & (1 << 6))
+		printf("%s FC: 800 MBytes/sec\n", pfx);
+	if (id[10] & (1 << 4))
+		printf("%s FC: 400 MBytes/sec\n", pfx);
+	if (id[10] & (1 << 2))
+		printf("%s FC: 200 MBytes/sec\n", pfx);
+	if (id[10] & (1 << 0))
+		printf("%s FC: 100 MBytes/sec\n", pfx);
+}
+
+static void sff8079_show_encoding(const __u8 *id)
+{
+	printf("\tEncoding            : 0x%02x", id[11]);
+	switch (id[11]) {
+	case 0x00:
+		printf(" (unspecified)\n");
+		break;
+	case 0x01:
+		printf(" (8B/10B)\n");
+		break;
+	case 0x02:
+		printf(" (4B/5B)\n");
+		break;
+	case 0x03:
+		printf(" (NRZ)\n");
+		break;
+	case 0x04:
+		printf(" (Manchester)\n");
+		break;
+	case 0x05:
+		printf(" (SONET Scrambled)\n");
+		break;
+	case 0x06:
+		printf(" (64B/66B)\n");
+		break;
+	default:
+		printf(" (reserved or unknown)\n");
+		break;
+	}
+}
+
+static void sff8079_show_rate_identifier(const __u8 *id)
+{
+	printf("\tRate identifier     : 0x%02x", id[13]);
+	switch (id[13]) {
+	case 0x00:
+		printf(" (unspecified)\n");
+		break;
+	case 0x01:
+		printf(" (4/2/1G Rate_Select & AS0/AS1)\n");
+		break;
+	case 0x02:
+		printf(" (8/4/2G Rx Rate_Select only)\n");
+		break;
+	case 0x03:
+		printf(" (8/4/2G Independent Rx & Tx Rate_Select)\n");
+		break;
+	case 0x04:
+		printf(" (8/4/2G Tx Rate_Select only)\n");
+		break;
+	default:
+		printf(" (reserved or unknown)\n");
+		break;
+	}
+}
+
+static void sff8079_show_oui(const __u8 *id)
+{
+	printf("\tVendor OUI          : %02x:%02x:%02x\n",
+	       id[37], id[38], id[39]);
+}
+
+static void sff8079_show_wavelength_or_copper_compliance(const __u8 *id)
+{
+	if (id[8] & (1 << 2)) {
+		printf("\tPassive Cu cmplnce. : 0x%02x", id[60]);
+		switch (id[60]) {
+		case 0x00:
+			printf(" (unspecified)");
+			break;
+		case 0x01:
+			printf(" (SFF-8431 appendix E)");
+			break;
+		default:
+			printf(" (unknown)");
+			break;
+		}
+		printf(" [SFF-8472 rev10.4 only]\n");
+	} else if (id[8] & (1 << 3)) {
+		printf("\tActive Cu cmplnce.  : 0x%02x", id[60]);
+		switch (id[60]) {
+		case 0x00:
+			printf(" (unspecified)");
+			break;
+		case 0x01:
+			printf(" (SFF-8431 appendix E)");
+			break;
+		case 0x04:
+			printf(" (SFF-8431 limiting)");
+			break;
+		default:
+			printf(" (unknown)");
+			break;
+		}
+		printf(" [SFF-8472 rev10.4 only]\n");
+	} else {
+		printf("\tLaser wavelength    : %unm\n",
+		       (id[60] << 8) | id[61]);
+	}
+}
+
+static void sff8079_show_value_with_unit(const __u8 *id, unsigned int reg,
+					 const char *name, unsigned int mult,
+					 const char *unit)
+{
+	unsigned int val = id[reg];
+
+	printf("\t%-20s: %u%s\n", name, val * mult, unit);
+}
+
+static void sff8079_show_ascii(const __u8 *id, unsigned int first_reg,
+			       unsigned int last_reg, const char *name)
+{
+	unsigned int reg, val;
+
+	printf("\t%-20s: ", name);
+	for (reg = first_reg; reg <= last_reg; reg++) {
+		val = id[reg];
+		putchar(((val >= 32) && (val <= 126)) ? val : '_');
+	}
+	printf("\n");
+}
+
+void sff8079_show_all(const __u8 *id)
+{
+	sff8079_show_identifier(id);
+	if ((id[0] == 0x03) && (id[1] == 0x04)) {
+		sff8079_show_ext_identifier(id);
+		sff8079_show_connector(id);
+		sff8079_show_transceiver(id);
+		sff8079_show_encoding(id);
+		sff8079_show_value_with_unit(id, 12, "BR, Nominal", 100, "MBd");
+		sff8079_show_rate_identifier(id);
+		sff8079_show_value_with_unit(id, 14,
+					     "Length (SMF,km)", 1, "km");
+		sff8079_show_value_with_unit(id, 15, "Length (SMF)", 100, "m");
+		sff8079_show_value_with_unit(id, 16, "Length (50um)", 10, "m");
+		sff8079_show_value_with_unit(id, 17,
+					     "Length (62.5um)", 10, "m");
+		sff8079_show_value_with_unit(id, 18, "Length (Copper)", 1, "m");
+		sff8079_show_value_with_unit(id, 19, "Length (OM3)", 10, "m");
+		sff8079_show_wavelength_or_copper_compliance(id);
+		sff8079_show_ascii(id, 20, 35, "Vendor name");
+		sff8079_show_oui(id);
+		sff8079_show_ascii(id, 40, 55, "Vendor PN");
+		sff8079_show_ascii(id, 56, 59, "Vendor rev");
+	}
+}
diff --git a/test-cmdline.c b/test-cmdline.c
index 4718842..f830d54 100644
--- a/test-cmdline.c
+++ b/test-cmdline.c
@@ -210,6 +210,16 @@ static struct test_case {
 	{ 0, "--show-priv-flags devname" },
 	{ 1, "--show-priv-flags devname foo" },
 	{ 1, "--show-priv-flags" },
+	{ 1, "-m" },
+	{ 0, "-m devname" },
+	{ 1, "--dump-module-eeprom" },
+	{ 0, "--dump-module-eeprom devname" },
+	{ 0, "-m devname raw on" },
+	{ 0, "-m devname raw off" },
+	{ 0, "-m devname hex on" },
+	{ 0, "-m devname hex off" },
+	{ 1, "-m devname hex on raw on" },
+	{ 0, "-m devname offset 4 length 6" },
 	/* can't test --set-priv-flags yet */
 	{ 0, "-h" },
 	{ 0, "--help" },
-- 
1.7.7.6

^ permalink raw reply related

* [PATCH 2/3] ethtool: Update ethtool-copy.h to support module eeprom retrieval
From: Stuart Hodgson @ 2012-05-18 14:58 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Yaniv Rosner, David Miller, netdev, Eilon Greenstein

Signed-off-by: Stuart Hodgson <smhodgson@solarflare.com>
---
 ethtool-copy.h |   72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 70 insertions(+), 2 deletions(-)

diff --git a/ethtool-copy.h b/ethtool-copy.h
index d904c1a..9e26a76 100644
--- a/ethtool-copy.h
+++ b/ethtool-copy.h
@@ -27,10 +27,15 @@ struct ethtool_cmd {
 				 * access it */
 	__u8	duplex;		/* Duplex, half or full */
 	__u8	port;		/* Which connector port */
-	__u8	phy_address;
+	__u8	phy_address;	/* MDIO PHY address (PRTAD for clause 45).
+				 * May be read-only or read-write
+				 * depending on the driver.
+				 */
 	__u8	transceiver;	/* Which transceiver to use */
 	__u8	autoneg;	/* Enable or disable autonegotiation */
-	__u8	mdio_support;
+	__u8	mdio_support;	/* MDIO protocols supported.  Read-only.
+				 * Not set by all drivers.
+				 */
 	__u32	maxtxpkt;	/* Tx pkts before generating tx int */
 	__u32	maxrxpkt;	/* Rx pkts before generating rx int */
 	__u16	speed_hi;       /* The forced speed (upper
@@ -56,6 +61,20 @@ static __inline__ __u32 ethtool_cmd_speed(const struct ethtool_cmd *ep)
 	return (ep->speed_hi << 16) | ep->speed;
 }
 
+/* Device supports clause 22 register access to PHY or peripherals
+ * using the interface defined in <linux/mii.h>.  This should not be
+ * set if there are known to be no such peripherals present or if
+ * the driver only emulates clause 22 registers for compatibility.
+ */
+#define ETH_MDIO_SUPPORTS_C22	1
+
+/* Device supports clause 45 register access to PHY or peripherals
+ * using the interface defined in <linux/mii.h> and <linux/mdio.h>.
+ * This should not be set if there are known to be no such peripherals
+ * present.
+ */
+#define ETH_MDIO_SUPPORTS_C45	2
+
 #define ETHTOOL_FWVERS_LEN	32
 #define ETHTOOL_BUSINFO_LEN	32
 /* these strings are set to whatever the driver author decides... */
@@ -115,6 +134,23 @@ struct ethtool_eeprom {
 };
 
 /**
+ * struct ethtool_modinfo - plugin module eeprom information
+ * @cmd: %ETHTOOL_GMODULEINFO
+ * @type: Standard the module information conforms to %ETH_MODULE_SFF_xxxx
+ * @eeprom_len: Length of the eeprom
+ *
+ * This structure is used to return the information to
+ * properly size memory for a subsequent call to %ETHTOOL_GMODULEEEPROM.
+ * The type code indicates the eeprom data format
+ */
+struct ethtool_modinfo {
+	__u32   cmd;
+	__u32   type;
+	__u32   eeprom_len;
+	__u32   reserved[8];
+};
+
+/**
  * struct ethtool_coalesce - coalescing parameters for IRQs and stats updates
  * @cmd: ETHTOOL_{G,S}COALESCE
  * @rx_coalesce_usecs: How many usecs to delay an RX interrupt after
@@ -680,6 +716,29 @@ struct ethtool_sfeatures {
 	struct ethtool_set_features_block features[0];
 };
 
+/**
+ * struct ethtool_ts_info - holds a device's timestamping and PHC association
+ * @cmd: command number = %ETHTOOL_GET_TS_INFO
+ * @so_timestamping: bit mask of the sum of the supported SO_TIMESTAMPING flags
+ * @phc_index: device index of the associated PHC, or -1 if there is none
+ * @tx_types: bit mask of the supported hwtstamp_tx_types enumeration values
+ * @rx_filters: bit mask of the supported hwtstamp_rx_filters enumeration values
+ *
+ * The bits in the 'tx_types' and 'rx_filters' fields correspond to
+ * the 'hwtstamp_tx_types' and 'hwtstamp_rx_filters' enumeration values,
+ * respectively.  For example, if the device supports HWTSTAMP_TX_ON,
+ * then (1 << HWTSTAMP_TX_ON) in 'tx_types' will be set.
+ */
+struct ethtool_ts_info {
+	__u32	cmd;
+	__u32	so_timestamping;
+	__s32	phc_index;
+	__u32	tx_types;
+	__u32	tx_reserved[3];
+	__u32	rx_filters;
+	__u32	rx_reserved[3];
+};
+
 /*
  * %ETHTOOL_SFEATURES changes features present in features[].valid to the
  * values of corresponding bits in features[].requested. Bits in .requested
@@ -786,6 +845,9 @@ enum ethtool_sfeatures_retval_bits {
 #define ETHTOOL_SET_DUMP	0x0000003e /* Set dump settings */
 #define ETHTOOL_GET_DUMP_FLAG	0x0000003f /* Get dump settings */
 #define ETHTOOL_GET_DUMP_DATA	0x00000040 /* Get dump data */
+#define ETHTOOL_GET_TS_INFO	0x00000041 /* Get time stamping and PHC info */
+#define ETHTOOL_GMODULEINFO	0x00000042 /* Get plug-in module information */
+#define ETHTOOL_GMODULEEEPROM	0x00000043 /* Get plug-in module eeprom */
 
 /* compatibility with older code */
 #define SPARC_ETH_GSET		ETHTOOL_GSET
@@ -935,6 +997,12 @@ enum ethtool_sfeatures_retval_bits {
 #define RX_CLS_LOC_FIRST	0xfffffffe
 #define RX_CLS_LOC_LAST		0xfffffffd
 
+/* EEPROM Standards for plug in modules */
+#define ETH_MODULE_SFF_8079		0x1
+#define ETH_MODULE_SFF_8079_LEN		256
+#define ETH_MODULE_SFF_8472		0x2
+#define ETH_MODULE_SFF_8472_LEN		512
+
 /* Reset flags */
 /* The reset() operation must clear the flags for the components which
  * were actually reset.  On successful return, the flags indicate the
-- 
1.7.7.6

^ permalink raw reply related

* [PATCH 1/3] ethtool: Split out printing of hex data
From: Stuart Hodgson @ 2012-05-18 14:58 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, Yaniv Rosner, David Miller, Eilon Greenstein

Split out printing of hex data to common function from
dump_regs and dump_eeprom. Ready for use by module
eeprom dumping.

Signed-off-by: Stuart Hodgson <smhodgson@solarflare.com>
---
 ethtool.c |   35 ++++++++++++++++++-----------------
 1 files changed, 18 insertions(+), 17 deletions(-)

diff --git a/ethtool.c b/ethtool.c
index e80b38b..fdc21de 100644
--- a/ethtool.c
+++ b/ethtool.c
@@ -787,6 +787,20 @@ static const struct {
 	{ "st_gmac", st_gmac_dump_regs },
 };
 
+static void dump_hex(__u8 *data, int len, int offset)
+{
+	int i;
+
+	fprintf(stdout, "Offset\tValues\n");
+	fprintf(stdout, "--------\t-----");
+	for (i = 0; i < len; i++) {
+		if (i%16 == 0)
+			fprintf(stdout, "\n0x%04x:\t", i + offset);
+		fprintf(stdout, " %02x", data[i]);
+	}
+	fprintf(stdout, "\n\n");
+}
+
 static int dump_regs(int gregs_dump_raw, int gregs_dump_hex,
 		     const char *gregs_dump_file,
 		     struct ethtool_drvinfo *info, struct ethtool_regs *regs)
@@ -820,22 +834,14 @@ static int dump_regs(int gregs_dump_raw, int gregs_dump_hex,
 				     ETHTOOL_BUSINFO_LEN))
 				return driver_list[i].func(info, regs);
 
-	fprintf(stdout, "Offset\tValues\n");
-	fprintf(stdout, "--------\t-----");
-	for (i = 0; i < regs->len; i++) {
-		if (i%16 == 0)
-			fprintf(stdout, "\n%03x:\t", i);
-		fprintf(stdout, " %02x", regs->data[i]);
-	}
-	fprintf(stdout, "\n\n");
+	dump_hex(regs->data, regs->len, 0);
+
 	return 0;
 }
 
 static int dump_eeprom(int geeprom_dump_raw, struct ethtool_drvinfo *info,
 		       struct ethtool_eeprom *ee)
 {
-	int i;
-
 	if (geeprom_dump_raw) {
 		fwrite(ee->data, 1, ee->len, stdout);
 		return 0;
@@ -847,13 +853,8 @@ static int dump_eeprom(int geeprom_dump_raw, struct ethtool_drvinfo *info,
 		return tg3_dump_eeprom(info, ee);
 	}
 
-	fprintf(stdout, "Offset\t\tValues\n");
-	fprintf(stdout, "------\t\t------");
-	for (i = 0; i < ee->len; i++) {
-		if(!(i%16)) fprintf(stdout, "\n0x%04x\t\t", i + ee->offset);
-		fprintf(stdout, "%02x ", ee->data[i]);
-	}
-	fprintf(stdout, "\n");
+	dump_hex(ee->data, ee->len, ee->offset);
+
 	return 0;
 }
 
-- 
1.7.7.6

^ permalink raw reply related

* [PATCH 0/3] ethtool: Add command to dump and parse module EEPROM
From: Stuart Hodgson @ 2012-05-18 14:57 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: Yaniv Rosner, David Miller, netdev, Eilon Greenstein

These three patches add support for the new ETHTOOL_GMODULEINFO and 
ETHTOOL_GMODULEEEPROM commands in net-next.

The first patch splits duplicated code used to print hex data into a function
The second patch updates the the ethtool-copy.h from the net-next. 
The third implements the new commands. Support for parsing SFP 8079 compatible EEPROMs
has also be incorporated.

Stu

^ permalink raw reply

* [PATCH net-next] iwlwifi: dont pull too much payload in skb head
From: Eric Dumazet @ 2012-05-18 14:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Johannes Berg, Wey-Yi Guy

From: Eric Dumazet <edumazet@google.com>

Since merge window is now pretty close, I would prefer David applies
this directly in net-next, if you dont mind, as this patch is more a
core network issue than an iwlwifi one.

Thanks !

[PATCH net-next] iwlwifi: dont pull too much payload in skb head

As iwlwifi use fat skbs, it should not pull too much data in skb->head,
and particularly no tcp data payload, or splice() is slower, and TCP
coalescing is disabled. Copying payload to userland also involves at
least two copies (part from header, part from fragment)

Each layer will pull its header from the fragment as needed.

(on 64bit arches, skb_tailroom(skb) at this point is 192 bytes)

With this patch applied, I have a major reduction of collapsed/pruned
TCP packets, a nice increase of TCPRcvCoalesce counter, and overall
better Internet User experience.

Small packets are still using a fragless skb, so that page can be reused
by the driver.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Wey-Yi Guy <wey-yi.w.guy@intel.com>
---
 drivers/net/wireless/iwlwifi/iwl-agn-rx.c |    7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/net/wireless/iwlwifi/iwl-agn-rx.c b/drivers/net/wireless/iwlwifi/iwl-agn-rx.c
index 18a3837..403de96 100644
--- a/drivers/net/wireless/iwlwifi/iwl-agn-rx.c
+++ b/drivers/net/wireless/iwlwifi/iwl-agn-rx.c
@@ -759,7 +759,12 @@ static void iwlagn_pass_packet_to_mac80211(struct iwl_priv *priv,
 		IWL_ERR(priv, "alloc_skb failed\n");
 		return;
 	}
-	hdrlen = min_t(unsigned int, len, skb_tailroom(skb));
+	/* If frame is small enough to fit in skb->head, pull it completely.
+	 * If not, only pull ieee80211_hdr so that splice() or TCP coalesce
+	 * are more efficient.
+	 */
+	hdrlen = (len <= skb_tailroom(skb)) ? len : sizeof(*hdr);
+
 	memcpy(skb_put(skb, hdrlen), hdr, hdrlen);
 	fraglen = len - hdrlen;

^ permalink raw reply related

* Re: Strange latency spikes/TX network stalls on Sun Fire X4150(x86) and e1000e
From: Denys Fedoryshchenko @ 2012-05-18 14:04 UTC (permalink / raw)
  To: netdev, e1000-devel, jeffrey.t.kirsher, jesse.brandeburg,
	therbert, eric.dumazet, davem
In-Reply-To: <df211722d18136698aeeda2f17b876e2@visp.net.lb>

It seems logic in BQL has serious issues. The most bad thing, if 
someone don't want limits (especially low as this),
there is no way to disable BQL in Kernel configuration, only tuning 
each interface over sysfs values.

I just did short debug:
if (limit != dql->limit) {
+                printk("New limit %d\n", dql->limit);
                 dql->limit = limit;
                 ovlimit = 0;
}

And got this numbers:
[   18.696839] New limit 0
[   19.622967] New limit 42
[   20.037810] New limit 165
[   35.473666] New limit 386
[   37.418591] New limit 1374
[   37.420064] New limit 6432
[   39.209480] New limit 16548
[   39.214773] New limit 1704
[   40.696065] New limit 6762
[   40.696390] New limit 15564
[   41.921120] New limit 25788
[   41.921165] New limit 388
[   42.696286] New limit 534
[   42.696539] New limit 1096
[   42.696719] New limit 2304
[   53.360394] New limit 24334
[   54.696072] New limit 484
[   54.696135] New limit 934

This means sometimes limit goes below MTU, and till queue limit 
increased, i will see this traffic "stalled",
if there is large packet in queue. Probably BQL miscalculate queue as 
full because of some specific handling
of sent packets in e1000e on this specific hardware. Because it should 
not be full, it is 1Gbps wire,
and it is empty. So in result, instead of eliminating latency, it is 
adding it.

I can make a patch that will make minimum BQL value not less than MTU + 
overhead, is it ok like this?
Probably it will solve issue, but it is more workaround and safety 
fuse, than a solution.

On 2012-05-17 19:54, Denys Fedoryshchenko wrote:
> Also i notice, limit constantly changing over time (even i am not
> touching it).
>
> centaur ~ # grep "" 
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/*
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:0
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018
> 
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0
> centaur ~ # grep "" 
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/*
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/hold_time:1000
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/inflight:4542
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit:13018
> 
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_max:1879048192
> /sys/class/net/eth0/queues/tx-0/byte_queue_limits/limit_min:0
>
> Is it supposed to be like this?
>
> On 2012-05-17 16:42, Denys Fedoryshchenko wrote:
>> Found commit that cause problem:
>>
>> author	Tom Herbert <therbert@google.com>
>> Mon, 28 Nov 2011 16:33:16 +0000 (16:33 +0000)
>> committer	David S. Miller <davem@davemloft.net>
>> Tue, 29 Nov 2011 17:46:19 +0000 (12:46 -0500)
>> commit	3f0cfa3bc11e7f00c9994e0f469cbc0e7da7b00c
>> tree	d6670a4f94b2b9dedacc38edb6f0e1306b889f6b	tree | snapshot
>> parent	114cf5802165ee93e3ab461c9c505cd94a08b800	commit | diff
>> e1000e: Support for byte queue limits
>>
>> Changes to e1000e to use byte queue limits.
>>
>> Signed-off-by: Tom Herbert <therbert@google.com>
>> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
>> Signed-off-by: David S. Miller <davem@davemloft.net>
>>
>> If i reverse it, problem disappearing.
>>
>> How i reproduce it:
>> In two consoles do "fast" ping to nearby host
>> ping 194.146.XXX.XXX -s1472 -i0.0001
>> ping 194.146.XXX.XXX -s1472 -i0.1
>>
>> For third open ssh to host with "problem", open mcedit, and just
>> scroll down large text file.
>> After few seconds some "stalls" will occur, and in ping history i 
>> can see:
>> 1480 bytes from 194.146.153.7: icmp_req=1797 ttl=64 time=0.161 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1798 ttl=64 time=0.198 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1799 ttl=64 time=0.340 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1800 ttl=64 time=0.381 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1801 ttl=64 time=914 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1802 ttl=64 time=804 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1803 ttl=64 time=704 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1804 ttl=64 time=594 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1805 ttl=64 time=0.287 ms
>> 1480 bytes from 194.146.153.7: icmp_req=1806 ttl=64 time=0.226 ms
>>
>>
>> If i apply small patch - problem will disappear. Sure it is not a
>> solution, but
>> let me know how i can help to debug problem more.
>>
>> --- netdev.c    2012-05-12 20:08:37.000000000 +0300
>> +++ netdev.c.patched    2012-05-17 16:32:28.895760472 +0300
>> @@ -1135,7 +1135,7 @@
>>
>>         tx_ring->next_to_clean = i;
>>
>> -       netdev_completed_queue(netdev, pkts_compl, bytes_compl);
>> +//     netdev_completed_queue(netdev, pkts_compl, bytes_compl);
>>
>>  #define TX_WAKE_THRESHOLD 32
>>         if (count && netif_carrier_ok(netdev) &&
>> @@ -2263,7 +2263,7 @@
>>                 e1000_put_txbuf(adapter, buffer_info);
>>         }
>>
>> -       netdev_reset_queue(adapter->netdev);
>> +//     netdev_reset_queue(adapter->netdev);
>>         size = sizeof(struct e1000_buffer) * tx_ring->count;
>>         memset(tx_ring->buffer_info, 0, size);
>>
>> @@ -5056,7 +5056,7 @@
>>         /* if count is 0 then mapping error has occurred */
>>         count = e1000_tx_map(adapter, skb, first, max_per_txd,
>> nr_frags, mss);
>>         if (count) {
>> -               netdev_sent_queue(netdev, skb->len);
>> +//             netdev_sent_queue(netdev, skb->len);
>>                 e1000_tx_queue(adapter, tx_flags, count);
>>                 /* Make sure there is space in the ring for the next 
>> send. */
>>                 e1000_maybe_stop_tx(netdev, MAX_SKB_FRAGS + 2);
>>
>>
>>
>> On 2012-05-15 17:15, Denys Fedoryshchenko wrote:
>>> Hi
>>>
>>> I have two identical servers, Sun Fire X4150, both has different
>>> flavors of Linux, x86_64 and i386.
>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>> Ethernet Controller (Copper) (rev 01)
>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>> Ethernet Controller (Copper) (rev 01)
>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>>> Ethernet Controller (rev 06)
>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>>> Ethernet Controller (rev 06)
>>> I am using now interface:
>>> #ethtool -i eth0
>>> driver: e1000e
>>> version: 1.9.5-k
>>> firmware-version: 2.1-11
>>> bus-info: 0000:04:00.0
>>> There is 2 CPU , Intel(R) Xeon(R) CPU           E5440  @ 2.83GHz .
>>>
>>> i386 was acting as NAT and shaper, and as soon as i removed shaper
>>> from it, i started to experience strange lockups, e.g. traffic is
>>> normal for 5-30 seconds, then short lockup for 500-3000ms (usually
>>> around 1000ms) with dropped packets counter increasing. I was
>>> suspecting it is due load, but it seems was wrong.
>>> Recently, on another server, x86_64 i am using as development, i
>>> upgrade kernel (it was old, from 2.6 series) and on completely idle
>>> machine started to experience same latency spikes, while i am just
>>> running mc and for example typing in text editor - i notice 
>>> "stalls".
>>> After i investigate it a little more, i notice also small amount of
>>> drops on interface. No tcpdump running. Also this machine is idle, 
>>> and
>>> the only traffic there - some small broadcasts from network, my 
>>> ssh,
>>> and ping.
>>>
>>> Dropped packets in ifconfig
>>>           RX packets:3752868 errors:0 dropped:5350 overruns:0 
>>> frame:0
>>> Counter is increasing sometimes, when this stall happening.
>>>
>>> ethtool -S is clean, there is no dropped packets.
>>>
>>> I did tried to check load (mpstat and perf), there is nothing
>>> suspicious, latencytop also doesn't show anything suspicious.
>>> dropwatch report a lot of drops, but mostly because there is some
>>> broadcasts and etc. tcpdump at the moment of such drops doesn't 
>>> show
>>> anything suspicious.
>>> Changed qdisc from default fifo_fast to bfifo, without any result.
>>> Tried:  ethtool -K eth0 tso off gso off gro off sg off , no result
>>> Problem occured at 3.3.6 - 3.4.0-rc7, most probably 3.3.0 also, but 
>>> i
>>> don't remember for sure. I thik on some kernels like 3.1 probably 
>>> it
>>> doesn't occur, i will check it soon, because it is not always 
>>> reliable
>>> to reproduce it. All tests i did on 3.4.0-rc7.
>>>
>>> I did run also in background tcpdump, additionally iptables with
>>> timestamps, and at time when stall occured, seems i am still 
>>> receiving
>>> packets properly, also on iperf udp  (from some host to this 
>>> SunFire)
>>> at this moments no packets missing. But i am sure RX interface 
>>> errors
>>> are increasing.
>>> If i do iperf from SunFire to test host - there is packetloss at
>>> moments when stall occured.
>>>
>>> I suspect that by some reason network card stop to transmit, but
>>> unable to pinpoint issue. All other hosts in this network are fine 
>>> and
>>> don't have such problems.
>>> Can you help me with that please? Maybe i can provide more debug
>>> information, compile with patches and etc. Also i will try to 
>>> fallback
>>> to 3.1 and 3.0 kernels.
>>>
>>> Here it is how it occurs and i am reproducing it:
>>> I'm just opening file, and start to scroll it in mc, then in 
>>> another
>>> console i run ping
>>> [1337089061.844167] 1480 bytes from 194.146.153.20: icmp_req=162
>>> ttl=64 time=0.485 ms
>>> [1337089061.944138] 1480 bytes from 194.146.153.20: icmp_req=163
>>> ttl=64 time=0.470 ms
>>> [1337089062.467759] 1480 bytes from 194.146.153.20: icmp_req=164
>>> ttl=64 time=424 ms
>>> [1337089062.467899] 1480 bytes from 194.146.153.20: icmp_req=165
>>> ttl=64 time=324 ms
>>> [1337089062.468058] 1480 bytes from 194.146.153.20: icmp_req=166
>>> ttl=64 time=214 ms
>>> [1337089062.468161] 1480 bytes from 194.146.153.20: icmp_req=167
>>> ttl=64 time=104 ms
>>> [1337089062.468958] 1480 bytes from 194.146.153.20: icmp_req=168
>>> ttl=64 time=1.15 ms
>>> [1337089062.568604] 1480 bytes from 194.146.153.20: icmp_req=169
>>> ttl=64 time=0.477 ms
>>> [1337089062.668909] 1480 bytes from 194.146.153.20: icmp_req=170
>>> ttl=64 time=0.667 ms
>>>
>>> Remote host tcpdump:
>>> 1337089061.934737 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 163, length 1480
>>> 1337089062.458360 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 164, length 1480
>>> 1337089062.458380 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 164, length 1480
>>> 1337089062.458481 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 165, length 1480
>>> 1337089062.458502 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 165, length 1480
>>> 1337089062.458606 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 166, length 1480
>>> 1337089062.458623 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 166, length 1480
>>> 1337089062.458729 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 167, length 1480
>>> 1337089062.458745 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 167, length 1480
>>> 1337089062.459537 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 168, length 1480
>>> 1337089062.459545 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 168, length 1480
>>>
>>> Local host(SunFire) tcpdump:
>>> 1337089061.844140 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 162, length 1480
>>> 1337089061.943661 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 163, length 1480
>>> 1337089061.944124 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 163, length 1480
>>> 1337089062.465622 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 164, length 1480
>>> 1337089062.465630 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 165, length 1480
>>> 1337089062.465632 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 166, length 1480
>>> 1337089062.465634 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 167, length 1480
>>> 1337089062.467730 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 164, length 1480
>>> 1337089062.467785 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 168, length 1480
>>> 1337089062.467884 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 165, length 1480
>>> 1337089062.468035 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 166, length 1480
>>> 1337089062.468129 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 167, length 1480
>>> 1337089062.468928 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 168, length 1480
>>> 1337089062.568112 IP 194.146.153.22 > 194.146.153.20: ICMP echo
>>> request, id 3486, seq 169, length 1480
>>> 1337089062.568578 IP 194.146.153.20 > 194.146.153.22: ICMP echo
>>> reply, id 3486, seq 169, length 1480
>>>
>>> lspci -t
>>> centaur src # lspci -t
>>> -[0000:00]-+-00.0
>>>            +-02.0-[01-05]--+-00.0-[02-04]--+-00.0-[03]--
>>>            |               |               \-02.0-[04]--+-00.0
>>>            |               |                            \-00.1
>>>            |               \-00.3-[05]--
>>>            +-03.0-[06]--
>>>            +-04.0-[07]----00.0
>>>            +-05.0-[08]--
>>>            +-06.0-[09]--
>>>            +-07.0-[0a]--
>>>            +-08.0
>>>            +-10.0
>>>            +-10.1
>>>            +-10.2
>>>            +-11.0
>>>            +-13.0
>>>            +-15.0
>>>            +-16.0
>>>            +-1c.0-[0b]--+-00.0
>>>            |            \-00.1
>>>            +-1d.0
>>>            +-1d.1
>>>            +-1d.2
>>>            +-1d.3
>>>            +-1d.7
>>>            +-1e.0-[0c]----05.0
>>>            +-1f.0
>>>            +-1f.1
>>>            +-1f.2
>>>            \-1f.3
>>> lspci
>>> 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory
>>> Controller Hub (rev b1)
>>> 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>> Express
>>> x4 Port 2 (rev b1)
>>> 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>> Express
>>> x4 Port 3 (rev b1)
>>> 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>> Express
>>> x8 Port 4-5 (rev b1)
>>> 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>> Express
>>> x4 Port 5 (rev b1)
>>> 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>> Express
>>> x8 Port 6-7 (rev b1)
>>> 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI 
>>> Express
>>> x4 Port 7 (rev b1)
>>> 00:08.0 System peripheral: Intel Corporation 5000 Series Chipset 
>>> DMA
>>> Engine (rev b1)
>>> 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>> Registers (rev b1)
>>> 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>> Registers (rev b1)
>>> 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB
>>> Registers (rev b1)
>>> 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved
>>> Registers (rev b1)
>>> 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved
>>> Registers (rev b1)
>>> 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
>>> Registers (rev b1)
>>> 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD
>>> Registers (rev b1)
>>> 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset
>>> PCI Express Root Port 1 (rev 09)
>>> 00:1d.0 USB controller: Intel Corporation 631xESB/632xESB/3100
>>> Chipset UHCI USB Controller #1 (rev 09)
>>> 00:1d.1 USB controller: Intel Corporation 631xESB/632xESB/3100
>>> Chipset UHCI USB Controller #2 (rev 09)
>>> 00:1d.2 USB controller: Intel Corporation 631xESB/632xESB/3100
>>> Chipset UHCI USB Controller #3 (rev 09)
>>> 00:1d.3 USB controller: Intel Corporation 631xESB/632xESB/3100
>>> Chipset UHCI USB Controller #4 (rev 09)
>>> 00:1d.7 USB controller: Intel Corporation 631xESB/632xESB/3100
>>> Chipset EHCI USB2 Controller (rev 09)
>>> 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
>>> 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset
>>> LPC Interface Controller (rev 09)
>>> 00:1f.1 IDE interface: Intel Corporation 631xESB/632xESB IDE
>>> Controller (rev 09)
>>> 00:1f.2 SATA controller: Intel Corporation 631xESB/632xESB SATA 
>>> AHCI
>>> Controller (rev 09)
>>> 00:1f.3 SMBus: Intel Corporation 631xESB/632xESB/3100 Chipset SMBus
>>> Controller (rev 09)
>>> 01:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>>> Upstream Port (rev 01)
>>> 01:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express 
>>> to
>>> PCI-X Bridge (rev 01)
>>> 02:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>>> Downstream Port E1 (rev 01)
>>> 02:02.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express
>>> Downstream Port E3 (rev 01)
>>> 04:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>> Ethernet Controller (Copper) (rev 01)
>>> 04:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
>>> Ethernet Controller (Copper) (rev 01)
>>> 07:00.0 RAID bus controller: Adaptec AAC-RAID (rev 09)
>>> 0b:00.0 Ethernet controller: Intel Corporation 82571EB Gigabit
>>> Ethernet Controller (rev 06)
>>> 0b:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit
>>> Ethernet Controller (rev 06)
>>> 0c:05.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED
>>> Graphics Family
>>>
>>>
>>> dmesg:
>>> [    4.936885] e1000: Intel(R) PRO/1000 Network Driver - version
>>> 7.3.21-k8-NAPI
>>> [    4.936887] e1000: Copyright (c) 1999-2006 Intel Corporation.
>>> [    4.936966] e1000e: Intel(R) PRO/1000 Network Driver - 1.9.5-k
>>> [    4.936967] e1000e: Copyright(c) 1999 - 2012 Intel Corporation.
>>> [    4.938529] e1000e 0000:04:00.0: (unregistered net_device):
>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative 
>>> mode
>>> [    4.939598] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>> [    4.992246] e1000e 0000:04:00.0: eth0: (PCI 
>>> Express:2.5GT/s:Width
>>> x4) 00:1e:68:04:99:f8
>>> [    4.992657] e1000e 0000:04:00.0: eth0: Intel(R) PRO/1000 Network
>>> Connection
>>> [    4.992964] e1000e 0000:04:00.0: eth0: MAC: 5, PHY: 5, PBA No: 
>>> FFFFFF-0FF
>>> [    4.994745] e1000e 0000:04:00.1: (unregistered net_device):
>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative 
>>> mode
>>> [    4.996233] e1000e 0000:04:00.1: irq 66 for MSI/MSI-X
>>> [    5.050901] e1000e 0000:04:00.1: eth1: (PCI 
>>> Express:2.5GT/s:Width
>>> x4) 00:1e:68:04:99:f9
>>> [    5.051317] e1000e 0000:04:00.1: eth1: Intel(R) PRO/1000 Network
>>> Connection
>>> [    5.051623] e1000e 0000:04:00.1: eth1: MAC: 5, PHY: 5, PBA No: 
>>> FFFFFF-0FF
>>> [    5.051857] e1000e 0000:0b:00.0: Disabling ASPM  L1
>>> [    5.052168] e1000e 0000:0b:00.0: (unregistered net_device):
>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative 
>>> mode
>>> [    5.052611] e1000e 0000:0b:00.0: irq 67 for MSI/MSI-X
>>> [    5.223454] e1000e 0000:0b:00.0: eth2: (PCI 
>>> Express:2.5GT/s:Width
>>> x4) 00:1e:68:04:99:fa
>>> [    5.223864] e1000e 0000:0b:00.0: eth2: Intel(R) PRO/1000 Network
>>> Connection
>>> [    5.224178] e1000e 0000:0b:00.0: eth2: MAC: 0, PHY: 4, PBA No: 
>>> C83246-002
>>> [    5.224412] e1000e 0000:0b:00.1: Disabling ASPM  L1
>>> [    5.224709] e1000e 0000:0b:00.1: (unregistered net_device):
>>> Interrupt Throttling Rate (ints/sec) set to dynamic conservative 
>>> mode
>>> [    5.225168] e1000e 0000:0b:00.1: irq 68 for MSI/MSI-X
>>> [    5.397603] e1000e 0000:0b:00.1: eth3: (PCI 
>>> Express:2.5GT/s:Width
>>> x4) 00:1e:68:04:99:fb
>>> [    5.398021] e1000e 0000:0b:00.1: eth3: Intel(R) PRO/1000 Network
>>> Connection
>>> [    5.398336] e1000e 0000:0b:00.1: eth3: MAC: 0, PHY: 4, PBA No: 
>>> C83246-002
>>> [   13.859817] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>> [   13.962309] e1000e 0000:04:00.0: irq 65 for MSI/MSI-X
>>> [   17.150392] e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex,
>>> Flow Control: None
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe netdev" 
>>> in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>> ---
>> Network engineer
>> Denys Fedoryshchenko
>>
>> Dora Highway - Center Cebaco - 2nd Floor
>> Beirut, Lebanon
>> Tel:	+961 1 247373
>> E-Mail: denys@visp.net.lb
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> ---
> Network engineer
> Denys Fedoryshchenko
>
> Dora Highway - Center Cebaco - 2nd Floor
> Beirut, Lebanon
> Tel:	+961 1 247373
> E-Mail: denys@visp.net.lb
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

---
Network engineer
Denys Fedoryshchenko

Dora Highway - Center Cebaco - 2nd Floor
Beirut, Lebanon
Tel:	+961 1 247373
E-Mail: denys@visp.net.lb

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* e1000 rx emulation bug (Was: [PATCH] e1000: Reset rx ring index on receive overrun)
From: Samuel Thibault @ 2012-05-18 13:51 UTC (permalink / raw)
  To: Brandeburg, Jesse, qemu-devel, dlaor
  Cc: Jiri Pirko, e1000-devel@lists.sourceforge.net, Dean Nelson,
	Allan, Bruce W, linux-kernel@vger.kernel.org, David S. Miller,
	Ronciak, John, netdev@vger.kernel.org
In-Reply-To: <alpine.WNT.2.00.1205171718160.7900@jbrandeb-mobl2.amr.corp.intel.com>

Hello,

There seems to be a bug in qemu in the e1000 emulation, which triggers
an issue with the Linux driver.

What happens in Linux is the following:

- e1000_open 
  - e1000_configure
    - e1000_setup_rctl
      - enables E1000_RCTL_EN
    - e1000_configure_rx
      - sets RDT/RDH to 0
    - alloc_rx_buf
      - pushes buffers to the ring

with bad luck, or on high traffic of small packets, what is observed is
that between setting RDT/RDH and pushing buffers, the ring fills up in
qemu. Here is what happens there on the qemu side:

- e1000_receive
  - e1000_has_rxbufs
    - total_size <= s->rxbuf_size (because it's small)
      return s->mac_reg[RDH] != s->mac_reg[RDT] || !s->check_rxov;

although RDH == RDT == 0, it returns 1, because since RDT/RDH have
just been set to 0, set_rdt has cleared check_rxov. e1000_receive
thus believes there is room, and proceeds with filling the ring.
Unfortunately, since no buffer was pushed, desc.buffer_addr is NULL, and
thus the do loop skips all these nul rx descriptors of the ring, but
marking each of them with E1000_RXD_STAT_DD, and eventually wrapping
around.  From then on, since check_rxov has been set by the do loop,
nothing more is pushed, until the linux driver pushes buffers to the
ring. qemu can then fill some descriptors, and Linux read them, but
since the whole ring was filled with E1000_RXD_STAT_DD, Linux goes on
reading, and thus gets completely desynchronized with the device.

That raises two questions:

- what is the role of the check_rxov flag?  Is hardware really allowed
  to push in some cases, even when RDH==RDT?  Removing it makes things
  work just fine.
- BTW, when skipping a descriptor because of NULL address, does
  E1000_RXD_STAT_DD have to be set?

Samuel

------------------------------------------------------------------------------
Live Security Virtual Conference
Exclusive live event will cover all the ways today's security and 
threat landscape has changed and how IT managers can respond. Discussions 
will include endpoint security, mobile security and the latest in malware 
threats. http://www.accelacomm.com/jaw/sfrnl04242012/114/50122263/
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit http://communities.intel.com/community/wired

^ permalink raw reply

* Regarding: Bug 43242 - Kconfig: Unable to disable Broadcom and Chelsio devices
From: Paul Menzel @ 2012-05-18 11:37 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 250 bytes --]

Dear netdev folks,


I created the ticket for bug #43242 [1]. Is that a Kconfig problem or a
problem with the Kconfig files in the Broadcom and Chelsio directories?


Thanks,

Paul


[1] https://bugzilla.kernel.org/show_bug.cgi?id=43242

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: Kernel consistently panicing on br_parse_ip_options
From: Massimo Cetra @ 2012-05-18 10:40 UTC (permalink / raw)
  To: linux-kernel, netdev; +Cc: Eric Dumazet, David Miller
In-Reply-To: <4FB0FC82.70003@navynet.it>

[-- Attachment #1: Type: text/plain, Size: 939 bytes --]

On 14/05/2012 14:37, Massimo Cetra wrote:
> Hello,
>
> I had already filed similar panics a month ago.
> Today i upgraded to 3.2.16 and nothing seems to be changed (and i don't
> see anything related in .17).
>
> The server (a Dell R410 with a couple of bnx2 ethernet cards) has two
> bridges onboard.
>
> Each bridge is connected to a different switch and has 2 uses:
> - one bridge is connecting an internal network and the KVM hosts that
> run on the same machine
> - one bridge connects the server to the public network along with
> another bunch of kvm servers whose interfaces bridges
>
> The bug can be easily triggered adding or removing (with heartbeat) a
> virtual address (br0:1, for example) .
>
> Is there any known fix or patch ?

As an attachment you may find another panic that seems compatible with 
the previous one.
The crash seems consistent and reproduceable. Is there anyone worrying 
about this ?

Thanks
  Massimo


[-- Attachment #2: panic2.txt --]
[-- Type: text/plain, Size: 14705 bytes --]

May 18 11:24:05 172.30.1.2 [335035.342152] BUG: unable to handle kernel 
May 18 11:24:05 172.30.1.2 [335035.357985] IP:
May 18 11:24:05 172.30.1.2 [335035.372242] PGD 0 
May 18 11:24:05 172.30.1.2 
May 18 11:24:05 172.30.1.2 [335035.376433] Oops: 0000 [#1] 
May 18 11:24:05 172.30.1.2 SMP
May 18 11:24:05 172.30.1.2 
May 18 11:24:05 172.30.1.2 [335035.383065] CPU 0 
May 18 11:24:05 172.30.1.2 
May 18 11:24:05 172.30.1.2 [335035.386893] Modules linked in:
May 18 11:24:05 172.30.1.2 
May 18 11:24:05 172.30.1.2 [335035.526015] 
May 18 11:24:05 172.30.1.2 [335035.529151] Pid: 4321, comm: kvm Not tainted 3.2.0-2-amd64 #1
May 18 11:24:05 172.30.1.2 /0N051F
May 18 11:24:05 172.30.1.2 
May 18 11:24:05 172.30.1.2 [335035.546372] RIP: 0010:[<ffffffffa0294336>] 
May 18 11:24:05 172.30.1.2 [335035.565482] RSP: 0018:ffff88042fc03b18  EFLAGS: 00010293
May 18 11:24:05 172.30.1.2 [335035.576265] RAX: 0000000000000000 RBX: ffff8803b5395dc0 RCX: 0000000000000007
May 18 11:24:05 172.30.1.2 [335035.590705] RDX: ffffffffa0294308 RSI: 0000000104fdde80 RDI: ffff8803b5395dc0
May 18 11:24:05 172.30.1.2 [335035.605145] RBP: ffff8804261ba000 R08: 0000000000000000 R09: ffff88042fc03ad0
May 18 11:24:05 172.30.1.2 [335035.619585] R10: ffffffff8165aac0 R11: ffffffff8165aac0 R12: 0000000000000000
May 18 11:24:05 172.30.1.2 [335035.634022] R13: ffff880425158002 R14: ffff880425c21f00 R15: ffff880425158000
May 18 11:24:05 172.30.1.2 [335035.648462] FS:  00007f6793d51900(0000) GS:ffff88042fc00000(0000) knlGS:0000000000000000
May 18 11:24:05 172.30.1.2 [335035.664808] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
May 18 11:24:05 172.30.1.2 [335035.676457] CR2: 0000000000000018 CR3: 0000000211fcf000 CR4: 00000000000026e0
May 18 11:24:05 172.30.1.2 [335035.690897] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
May 18 11:24:05 172.30.1.2 [335035.705336] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
May 18 11:24:05 172.30.1.2 [335035.719776] Process kvm (pid: 4321, threadinfo ffff880211b98000, task ffff88022594c870)
May 18 11:24:05 172.30.1.2 [335035.735948] Stack:
May 18 11:24:05 172.30.1.2 [335035.740140] ffffffff80000000
May 18 11:24:05 172.30.1.2 
May 18 11:24:05 172.30.1.2 [335035.755147] ffff8802279f7000
May 18 11:24:06 172.30.1.2 
May 18 11:24:06 172.30.1.2 [335035.770154] ffff8803b5395dc0
May 18 11:24:06 172.30.1.2 
May 18 11:24:06 172.30.1.2 [335035.785170] Call Trace:
May 18 11:24:06 172.30.1.2 [335035.790229] <IRQ> 
May 18 11:24:06 172.30.1.2 
May 18 11:24:06 172.30.1.2 [335035.794611] [<ffffffffa02946db>] ? br_parse_ip_options+0x3d/0x19a [bridge]
May 18 11:24:06 172.30.1.2 [335035.808693] [<ffffffffa0294a67>] ? br_nf_forward_ip+0x1c0/0x1d4 [bridge]
May 18 11:24:06 172.30.1.2 [335035.822428] [<ffffffff812ac039>] ? nf_iterate+0x41/0x77
May 18 11:24:06 172.30.1.2 [335035.833213] [<ffffffffa028f918>] ? __br_deliver+0xa0/0xa0 [bridge]
May 18 11:24:06 172.30.1.2 [335035.845906] [<ffffffffa028f918>] ? __br_deliver+0xa0/0xa0 [bridge]
May 18 11:24:06 172.30.1.2 [335035.858597] [<ffffffff812ac0d7>] ? nf_hook_slow+0x68/0x101
May 18 11:24:06 172.30.1.2 [335035.869902] [<ffffffffa028f918>] ? __br_deliver+0xa0/0xa0 [bridge]
May 18 11:24:06 172.30.1.2 [335035.882594] [<ffffffffa029037a>] ? NF_HOOK.constprop.4+0x56/0x56 [bridge]
May 18 11:24:06 172.30.1.2 [335035.896500] [<ffffffffa028f918>] ? __br_deliver+0xa0/0xa0 [bridge]
May 18 11:24:06 172.30.1.2 [335035.909193] [<ffffffffa028f85e>] ? NF_HOOK.constprop.8+0x3c/0x56 [bridge]
May 18 11:24:06 172.30.1.2 [335035.923099] [<ffffffffa028f9f2>] ? br_forward+0x16/0x5a [bridge]
May 18 11:24:06 172.30.1.2 [335035.935446] [<ffffffffa029051b>] ? br_handle_frame_finish+0x1a1/0x20f [bridge]
May 18 11:24:06 172.30.1.2 [335035.950235] [<ffffffffa02945ff>] ? br_nf_pre_routing_finish+0x1d0/0x1dd [bridge]
May 18 11:24:06 172.30.1.2 [335035.965371] [<ffffffffa0293ff0>] ? NF_HOOK_THRESH+0x3b/0x55 [bridge]
May 18 11:24:06 172.30.1.2 [335035.978410] [<ffffffffa0294f58>] ? br_nf_pre_routing+0x3e8/0x3f5 [bridge]
May 18 11:24:06 172.30.1.2 [335035.992315] [<ffffffff812ac039>] ? nf_iterate+0x41/0x77
May 18 11:24:06 172.30.1.2 [335036.003100] [<ffffffffa029037a>] ? NF_HOOK.constprop.4+0x56/0x56 [bridge]
May 18 11:24:06 172.30.1.2 [335036.017004] [<ffffffff812ac0d7>] ? nf_hook_slow+0x68/0x101
May 18 11:24:06 172.30.1.2 [335036.028309] [<ffffffffa029037a>] ? NF_HOOK.constprop.4+0x56/0x56 [bridge]
May 18 11:24:06 172.30.1.2 [335036.042216] [<ffffffffa029037a>] ? NF_HOOK.constprop.4+0x56/0x56 [bridge]
May 18 11:24:06 172.30.1.2 [335036.056123] [<ffffffffa0290360>] ? NF_HOOK.constprop.4+0x3c/0x56 [bridge]
May 18 11:24:06 172.30.1.2 [335036.070031] [<ffffffffa029073c>] ? br_handle_frame+0x1b3/0x1cb [bridge]
May 18 11:24:06 172.30.1.2 [335036.083591] [<ffffffffa0290589>] ? br_handle_frame_finish+0x20f/0x20f [bridge]
May 18 11:24:06 172.30.1.2 [335036.098382] [<ffffffff812892c0>] ? __netif_receive_skb+0x324/0x41f
May 18 11:24:06 172.30.1.2 [335036.111071] [<ffffffff81289427>] ? process_backlog+0x6c/0x123
May 18 11:24:06 172.30.1.2 [335036.122896] [<ffffffff8128b30d>] ? net_rx_action+0xa1/0x1af
May 18 11:24:06 172.30.1.2 [335036.134374] [<ffffffff8104b298>] ? __local_bh_enable+0x40/0x77
May 18 11:24:06 172.30.1.2 [335036.146371] [<ffffffff8104be30>] ? __do_softirq+0xb9/0x177
May 18 11:24:06 172.30.1.2 [335036.157678] [<ffffffff8135046c>] ? call_softirq+0x1c/0x30
May 18 11:24:06 172.30.1.2 [335036.168805] <EOI> 
May 18 11:24:06 172.30.1.2 
May 18 11:24:06 172.30.1.2 [335036.173187] [<ffffffff8100f8e5>] ? do_softirq+0x3c/0x7b
May 18 11:24:06 172.30.1.2 [335036.183969] [<ffffffff8128b5fd>] ? netif_rx_ni+0x1e/0x27
May 18 11:24:06 172.30.1.2 [335036.194927] [<ffffffffa02a1721>] ? tun_get_user+0x39a/0x3c2 [tun]
May 18 11:24:06 172.30.1.2 [335036.207445] [<ffffffffa02a1a66>] ? tun_chr_poll+0xcd/0xcd [tun]
May 18 11:24:06 172.30.1.2 [335036.219615] [<ffffffffa02a1ac4>] ? tun_chr_aio_write+0x5e/0x79 [tun]
May 18 11:24:06 172.30.1.2 [335036.232656] [<ffffffff810f9594>] ? do_sync_readv_writev+0x9a/0xd7
May 18 11:24:06 172.30.1.2 [335036.245174] [<ffffffff810363c7>] ? should_resched+0x5/0x23
May 18 11:24:06 172.30.1.2 [335036.256478] [<ffffffff810f8c16>] ? do_sync_read+0xab/0xe3
May 18 11:24:06 172.30.1.2 [335036.267608] [<ffffffff810363c7>] ? should_resched+0x5/0x23
May 18 11:24:06 172.30.1.2 [335036.278915] [<ffffffff811626a1>] ? security_file_permission+0x16/0x2d
May 18 11:24:06 172.30.1.2 [335036.292124] [<ffffffff810f97f8>] ? do_readv_writev+0xaf/0x11c
May 18 11:24:06 172.30.1.2 [335036.303950] [<ffffffff8112ab7e>] ? eventfd_ctx_read+0x162/0x174
May 18 11:24:06 172.30.1.2 [335036.316124] [<ffffffff8103f3ff>] ? try_to_wake_up+0x197/0x197
May 18 11:24:06 172.30.1.2 [335036.327949] [<ffffffff810f99cd>] ? sys_writev+0x45/0x90
May 18 11:24:06 172.30.1.2 [335036.338733] [<ffffffff8134e212>] ? system_call_fastpath+0x16/0x1b
May 18 11:24:06 172.30.1.2 [335036.351249] Code: 
May 18 11:24:06 172.30.1.2 53
May 18 11:24:06 172.30.1.2 48
May 18 11:24:06 172.30.1.2 89
May 18 11:24:06 172.30.1.2 fb
May 18 11:24:06 172.30.1.2 48
May 18 11:24:06 172.30.1.2 83
May 18 11:24:06 172.30.1.2 ec
May 18 11:24:06 172.30.1.2 10
May 18 11:24:06 172.30.1.2 66
May 18 11:24:06 172.30.1.2 81
May 18 11:24:06 172.30.1.2 7f
May 18 11:24:06 172.30.1.2 7e
May 18 11:24:06 172.30.1.2 08
May 18 11:24:06 172.30.1.2 06
May 18 11:24:06 172.30.1.2 4c
May 18 11:24:06 172.30.1.2 8b
May 18 11:24:06 172.30.1.2 a7
May 18 11:24:06 172.30.1.2 98
May 18 11:24:06 172.30.1.2 00
May 18 11:24:06 172.30.1.2 00
May 18 11:24:06 172.30.1.2 00
May 18 11:24:06 172.30.1.2 74
May 18 11:24:06 172.30.1.2 3d
May 18 11:24:06 172.30.1.2 e8
May 18 11:24:06 172.30.1.2 07
May 18 11:24:06 172.30.1.2 fe
May 18 11:24:06 172.30.1.2 ff
May 18 11:24:06 172.30.1.2 ff
May 18 11:24:06 172.30.1.2 66
May 18 11:24:06 172.30.1.2 3d
May 18 11:24:06 172.30.1.2 08
May 18 11:24:06 172.30.1.2 06
May 18 11:24:06 172.30.1.2 75
May 18 11:24:06 172.30.1.2 09
May 18 11:24:06 172.30.1.2 83
May 18 11:24:06 172.30.1.2 3d
May 18 11:24:06 172.30.1.2 98
May 18 11:24:06 172.30.1.2 6a
May 18 11:24:06 172.30.1.2 00
May 18 11:24:06 172.30.1.2 00
May 18 11:24:06 172.30.1.2 00
May 18 11:24:06 172.30.1.2 75
May 18 11:24:06 172.30.1.2 29
May 18 11:24:06 172.30.1.2 
May 18 11:24:06 172.30.1.2 f6
May 18 11:24:06 172.30.1.2 44
May 18 11:24:06 172.30.1.2 24
May 18 11:24:06 172.30.1.2 18
May 18 11:24:06 172.30.1.2 01
May 18 11:24:06 172.30.1.2 49
May 18 11:24:06 172.30.1.2 8b
May 18 11:24:06 172.30.1.2 6c
May 18 11:24:06 172.30.1.2 24
May 18 11:24:06 172.30.1.2 08
May 18 11:24:06 172.30.1.2 74
May 18 11:24:06 172.30.1.2 12
May 18 11:24:06 172.30.1.2 8a
May 18 11:24:06 172.30.1.2 43
May 18 11:24:06 172.30.1.2 7d
May 18 11:24:06 172.30.1.2 83
May 18 11:24:06 172.30.1.2 e0
May 18 11:24:06 172.30.1.2 f8
May 18 11:24:06 172.30.1.2 83
May 18 11:24:06 172.30.1.2 c8
May 18 11:24:06 172.30.1.2 
May 18 11:24:06 172.30.1.2 [335036.390118] RIP 
May 18 11:24:06 172.30.1.2 [335036.404573] RSP <ffff88042fc03b18>
May 18 11:24:06 172.30.1.2 [335036.411711] CR2: 0000000000000018
May 18 11:24:06 172.30.1.2 [335036.418997] ---[ end trace 3272a02392487fe9 ]---
May 18 11:24:06 172.30.1.2 [335036.428496] Kernel panic - not syncing: Fatal exception in interrupt
May 18 11:24:06 172.30.1.2 [335036.441519] Pid: 4321, comm: kvm Tainted: G      D      3.2.0-2-amd64 #1
May 18 11:24:06 172.30.1.2 [335036.455226] Call Trace:
May 18 11:24:06 172.30.1.2 [335036.460431] <IRQ> 
May 18 11:24:06 172.30.1.2 [335036.472087] [<ffffffff8134a086>] ? oops_end+0xa9/0xb6
May 18 11:24:06 172.30.1.2 [335036.482697] [<ffffffff81342487>] ? no_context+0x1ff/0x20e
May 18 11:24:06 172.30.1.2 [335036.494029] [<ffffffff810e9c30>] ? virt_to_slab+0x6/0x16
May 18 11:24:06 172.30.1.2 [335036.505173] [<ffffffff8134c099>] ? do_page_fault+0x1a8/0x337
May 18 11:24:06 172.30.1.2 [335036.517001] [<ffffffffa0398f06>] ? ip_vs_conn_put+0x28/0x32 [ip_vs]
May 18 11:24:06 172.30.1.2 [335036.530050] [<ffffffffa039b0e0>] ? ip_vs_out+0x2bd/0x432 [ip_vs]
May 18 11:24:06 172.30.1.2 [335036.542621] [<ffffffff812ac0d7>] ? nf_hook_slow+0x68/0x101
May 18 11:24:06 172.30.1.2 [335036.554228] [<ffffffff813497f5>] ? page_fault+0x25/0x30
May 18 11:24:06 172.30.1.2 [335036.565318] [<ffffffffa0294308>] ? nf_bridge_update_protocol+0x20/0x20 [bridge]
May 18 11:24:06 172.30.1.2 [335036.580577] [<ffffffffa0294336>] ? br_nf_forward_finish+0x2e/0x95 [bridge]
May 18 11:24:06 172.30.1.2 [335036.594754] [<ffffffffa0294327>] ? br_nf_forward_finish+0x1f/0x95 [bridge]
May 18 11:24:06 172.30.1.2 [335036.608979] [<ffffffffa02946db>] ? br_parse_ip_options+0x3d/0x19a [bridge]
May 18 11:24:06 172.30.1.2 [335036.623296] [<ffffffffa0294a67>] ? br_nf_forward_ip+0x1c0/0x1d4 [bridge]
May 18 11:24:06 172.30.1.2 [335036.637308] [<ffffffff812ac039>] ? nf_iterate+0x41/0x77
May 18 11:24:06 172.30.1.2 [335036.648326] [<ffffffffa028f918>] ? __br_deliver+0xa0/0xa0 [bridge]
May 18 11:24:06 172.30.1.2 [335036.661291] [<ffffffffa028f918>] ? __br_deliver+0xa0/0xa0 [bridge]
May 18 11:24:06 172.30.1.2 [335036.674283] [<ffffffff812ac0d7>] ? nf_hook_slow+0x68/0x101
May 18 11:24:06 172.30.1.2 [335036.685892] [<ffffffffa028f918>] ? __br_deliver+0xa0/0xa0 [bridge]
May 18 11:24:06 172.30.1.2 [335036.698883] [<ffffffffa029037a>] ? NF_HOOK.constprop.4+0x56/0x56 [bridge]
May 18 11:24:06 172.30.1.2 [335036.713091] [<ffffffffa028f918>] ? __br_deliver+0xa0/0xa0 [bridge]
May 18 11:24:06 172.30.1.2 [335036.726088] [<ffffffffa028f85e>] ? NF_HOOK.constprop.8+0x3c/0x56 [bridge]
May 18 11:24:06 172.30.1.2 [335036.740298] [<ffffffffa028f9f2>] ? br_forward+0x16/0x5a [bridge]
May 18 11:24:07 172.30.1.2 [335036.752942] [<ffffffffa029051b>] ? br_handle_frame_finish+0x1a1/0x20f [bridge]
May 18 11:24:07 172.30.1.2 [335036.767996] [<ffffffffa02945ff>] ? br_nf_pre_routing_finish+0x1d0/0x1dd [bridge]
May 18 11:24:07 172.30.1.2 [335036.783410] [<ffffffffa0293ff0>] ? NF_HOOK_THRESH+0x3b/0x55 [bridge]
May 18 11:24:07 172.30.1.2 [335036.796757] [<ffffffffa0294f58>] ? br_nf_pre_routing+0x3e8/0x3f5 [bridge]
May 18 11:24:07 172.30.1.2 [335036.810959] [<ffffffff812ac039>] ? nf_iterate+0x41/0x77
May 18 11:24:07 172.30.1.2 [335036.822043] [<ffffffffa029037a>] ? NF_HOOK.constprop.4+0x56/0x56 [bridge]
May 18 11:24:07 172.30.1.2 [335036.836248] [<ffffffff812ac0d7>] ? nf_hook_slow+0x68/0x101
May 18 11:24:07 172.30.1.2 [335036.847849] [<ffffffffa029037a>] ? NF_HOOK.constprop.4+0x56/0x56 [bridge]
May 18 11:24:07 172.30.1.2 [335036.862053] [<ffffffffa029037a>] ? NF_HOOK.constprop.4+0x56/0x56 [bridge]
May 18 11:24:07 172.30.1.2 [335036.876264] [<ffffffffa0290360>] ? NF_HOOK.constprop.4+0x3c/0x56 [bridge]
May 18 11:24:07 172.30.1.2 [335036.890457] [<ffffffffa029073c>] ? br_handle_frame+0x1b3/0x1cb [bridge]
May 18 11:24:07 172.30.1.2 [335036.904319] [<ffffffffa0290589>] ? br_handle_frame_finish+0x20f/0x20f [bridge]
May 18 11:24:07 172.30.1.2 [335036.919409] [<ffffffff812892c0>] ? __netif_receive_skb+0x324/0x41f
May 18 11:24:07 172.30.1.2 [335036.932395] [<ffffffff81289427>] ? process_backlog+0x6c/0x123
May 18 11:24:07 172.30.1.2 [335036.944520] [<ffffffff8128b30d>] ? net_rx_action+0xa1/0x1af
May 18 11:24:07 172.30.1.2 [335036.956298] [<ffffffff8104b298>] ? __local_bh_enable+0x40/0x77
May 18 11:24:07 172.30.1.2 [335036.968594] [<ffffffff8104be30>] ? __do_softirq+0xb9/0x177
May 18 11:24:07 172.30.1.2 [335036.980197] [<ffffffff8135046c>] ? call_softirq+0x1c/0x30
May 18 11:24:07 172.30.1.2 [335036.991624] <EOI> 
May 18 11:24:07 172.30.1.2 [335037.004236] [<ffffffff8128b5fd>] ? netif_rx_ni+0x1e/0x27
May 18 11:24:07 172.30.1.2 [335037.015493] [<ffffffffa02a1721>] ? tun_get_user+0x39a/0x3c2 [tun]
May 18 11:24:07 172.30.1.2 [335037.028308] [<ffffffffa02a1a66>] ? tun_chr_poll+0xcd/0xcd [tun]
May 18 11:24:07 172.30.1.2 [335037.040765] [<ffffffffa02a1ac4>] ? tun_chr_aio_write+0x5e/0x79 [tun]
May 18 11:24:07 172.30.1.2 [335037.054111] [<ffffffff810f9594>] ? do_sync_readv_writev+0x9a/0xd7
May 18 11:24:07 172.30.1.2 [335037.066925] [<ffffffff810363c7>] ? should_resched+0x5/0x23
May 18 11:24:07 172.30.1.2 [335037.078534] [<ffffffff810f8c16>] ? do_sync_read+0xab/0xe3
May 18 11:24:07 172.30.1.2 [335037.089957] [<ffffffff810363c7>] ? should_resched+0x5/0x23
May 18 11:24:07 172.30.1.2 [335037.101560] [<ffffffff811626a1>] ? security_file_permission+0x16/0x2d
May 18 11:24:07 172.30.1.2 [335037.115072] [<ffffffff810f97f8>] ? do_readv_writev+0xaf/0x11c
May 18 11:24:07 172.30.1.2 [335037.127196] [<ffffffff8112ab7e>] ? eventfd_ctx_read+0x162/0x174
May 18 11:24:07 172.30.1.2 [335037.139673] [<ffffffff8103f3ff>] ? try_to_wake_up+0x197/0x197
May 18 11:24:07 172.30.1.2 [335037.151782] [<ffffffff810f99cd>] ? sys_writev+0x45/0x90
May 18 11:24:07 172.30.1.2 [335037.162870] [<ffffffff8134e212>] ? system_call_fastpath+0x16/0x1b

^ permalink raw reply

* [PATCH 07/10] RDMA/cxgb4: DB Drop Recovery for RDMA and LLD queues.
From: Vipul Pandya @ 2012-05-18  9:59 UTC (permalink / raw)
  To: linux-rdma, netdev; +Cc: roland, davem, divy, dm, kumaras, swise, Vipul Pandya
In-Reply-To: <1337335173-3226-1-git-send-email-vipul@chelsio.com>

add module option db_fc_threshold which is the count of active QPs
that trigger automatic db flow control mode.

automatically transition to/from flow control mode when the active qp
count crosses db_fc_theshold.

add more db debugfs stats

on DB DROP event from the LLD, recover all the iwarp queues.

Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/infiniband/hw/cxgb4/device.c   |  176 ++++++++++++++++++++++++++++++-
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |   24 ++++-
 drivers/infiniband/hw/cxgb4/qp.c       |   47 ++++++++-
 drivers/infiniband/hw/cxgb4/t4.h       |   24 +++++
 4 files changed, 259 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/device.c b/drivers/infiniband/hw/cxgb4/device.c
index 9062ed9..bdb398f 100644
--- a/drivers/infiniband/hw/cxgb4/device.c
+++ b/drivers/infiniband/hw/cxgb4/device.c
@@ -246,6 +246,8 @@ static const struct file_operations stag_debugfs_fops = {
 	.llseek  = default_llseek,
 };
 
+static char *db_state_str[] = {"NORMAL", "FLOW_CONTROL", "RECOVERY"};
+
 static int stats_show(struct seq_file *seq, void *v)
 {
 	struct c4iw_dev *dev = seq->private;
@@ -272,6 +274,9 @@ static int stats_show(struct seq_file *seq, void *v)
 	seq_printf(seq, "  DB FULL: %10llu\n", dev->rdev.stats.db_full);
 	seq_printf(seq, " DB EMPTY: %10llu\n", dev->rdev.stats.db_empty);
 	seq_printf(seq, "  DB DROP: %10llu\n", dev->rdev.stats.db_drop);
+	seq_printf(seq, " DB State: %s Transitions %llu\n",
+		   db_state_str[dev->db_state],
+		   dev->rdev.stats.db_state_transitions);
 	return 0;
 }
 
@@ -295,6 +300,7 @@ static ssize_t stats_clear(struct file *file, const char __user *buf,
 	dev->rdev.stats.db_full = 0;
 	dev->rdev.stats.db_empty = 0;
 	dev->rdev.stats.db_drop = 0;
+	dev->rdev.stats.db_state_transitions = 0;
 	mutex_unlock(&dev->rdev.stats.lock);
 	return count;
 }
@@ -677,8 +683,11 @@ static int disable_qp_db(int id, void *p, void *data)
 static void stop_queues(struct uld_ctx *ctx)
 {
 	spin_lock_irq(&ctx->dev->lock);
-	ctx->dev->db_state = FLOW_CONTROL;
-	idr_for_each(&ctx->dev->qpidr, disable_qp_db, NULL);
+	if (ctx->dev->db_state == NORMAL) {
+		ctx->dev->rdev.stats.db_state_transitions++;
+		ctx->dev->db_state = FLOW_CONTROL;
+		idr_for_each(&ctx->dev->qpidr, disable_qp_db, NULL);
+	}
 	spin_unlock_irq(&ctx->dev->lock);
 }
 
@@ -693,9 +702,165 @@ static int enable_qp_db(int id, void *p, void *data)
 static void resume_queues(struct uld_ctx *ctx)
 {
 	spin_lock_irq(&ctx->dev->lock);
-	ctx->dev->db_state = NORMAL;
-	idr_for_each(&ctx->dev->qpidr, enable_qp_db, NULL);
+	if (ctx->dev->qpcnt <= db_fc_threshold &&
+	    ctx->dev->db_state == FLOW_CONTROL) {
+		ctx->dev->db_state = NORMAL;
+		ctx->dev->rdev.stats.db_state_transitions++;
+		idr_for_each(&ctx->dev->qpidr, enable_qp_db, NULL);
+	}
+	spin_unlock_irq(&ctx->dev->lock);
+}
+
+struct qp_list {
+	unsigned idx;
+	struct c4iw_qp **qps;
+};
+
+static int add_and_ref_qp(int id, void *p, void *data)
+{
+	struct qp_list *qp_listp = data;
+	struct c4iw_qp *qp = p;
+
+	c4iw_qp_add_ref(&qp->ibqp);
+	qp_listp->qps[qp_listp->idx++] = qp;
+	return 0;
+}
+
+static int count_qps(int id, void *p, void *data)
+{
+	unsigned *countp = data;
+	(*countp)++;
+	return 0;
+}
+
+static void deref_qps(struct qp_list qp_list)
+{
+	int idx;
+
+	for (idx = 0; idx < qp_list.idx; idx++)
+		c4iw_qp_rem_ref(&qp_list.qps[idx]->ibqp);
+}
+
+static void recover_lost_dbs(struct uld_ctx *ctx, struct qp_list *qp_list)
+{
+	int idx;
+	int ret;
+
+	for (idx = 0; idx < qp_list->idx; idx++) {
+		struct c4iw_qp *qp = qp_list->qps[idx];
+
+		ret = cxgb4_sync_txq_pidx(qp->rhp->rdev.lldi.ports[0],
+					  qp->wq.sq.qid,
+					  t4_sq_host_wq_pidx(&qp->wq),
+					  t4_sq_wq_size(&qp->wq));
+		if (ret) {
+			printk(KERN_ERR MOD "%s: Fatal error - "
+			       "DB overflow recovery failed - "
+			       "error syncing SQ qid %u\n",
+			       pci_name(ctx->lldi.pdev), qp->wq.sq.qid);
+			return;
+		}
+
+		ret = cxgb4_sync_txq_pidx(qp->rhp->rdev.lldi.ports[0],
+					  qp->wq.rq.qid,
+					  t4_rq_host_wq_pidx(&qp->wq),
+					  t4_rq_wq_size(&qp->wq));
+
+		if (ret) {
+			printk(KERN_ERR MOD "%s: Fatal error - "
+			       "DB overflow recovery failed - "
+			       "error syncing RQ qid %u\n",
+			       pci_name(ctx->lldi.pdev), qp->wq.rq.qid);
+			return;
+		}
+
+		/* Wait for the dbfifo to drain */
+		while (cxgb4_dbfifo_count(qp->rhp->rdev.lldi.ports[0], 1) > 0) {
+			set_current_state(TASK_UNINTERRUPTIBLE);
+			schedule_timeout(usecs_to_jiffies(10));
+		}
+	}
+}
+
+static void recover_queues(struct uld_ctx *ctx)
+{
+	int count = 0;
+	struct qp_list qp_list;
+	int ret;
+
+	/* lock out kernel db ringers */
+	mutex_lock(&ctx->dev->db_mutex);
+
+	/* put all queues in to recovery mode */
+	spin_lock_irq(&ctx->dev->lock);
+	ctx->dev->db_state = RECOVERY;
+	ctx->dev->rdev.stats.db_state_transitions++;
+	idr_for_each(&ctx->dev->qpidr, disable_qp_db, NULL);
+	spin_unlock_irq(&ctx->dev->lock);
+
+	/* slow everybody down */
+	set_current_state(TASK_UNINTERRUPTIBLE);
+	schedule_timeout(usecs_to_jiffies(1000));
+
+	/* Wait for the dbfifo to completely drain. */
+	while (cxgb4_dbfifo_count(ctx->dev->rdev.lldi.ports[0], 1) > 0) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		schedule_timeout(usecs_to_jiffies(10));
+	}
+
+	/* flush the SGE contexts */
+	ret = cxgb4_flush_eq_cache(ctx->dev->rdev.lldi.ports[0]);
+	if (ret) {
+		printk(KERN_ERR MOD "%s: Fatal error - DB overflow recovery failed\n",
+		       pci_name(ctx->lldi.pdev));
+		goto out;
+	}
+
+	/* Count active queues so we can build a list of queues to recover */
+	spin_lock_irq(&ctx->dev->lock);
+	idr_for_each(&ctx->dev->qpidr, count_qps, &count);
+
+	qp_list.qps = kzalloc(count * sizeof *qp_list.qps, GFP_ATOMIC);
+	if (!qp_list.qps) {
+		printk(KERN_ERR MOD "%s: Fatal error - DB overflow recovery failed\n",
+		       pci_name(ctx->lldi.pdev));
+		spin_unlock_irq(&ctx->dev->lock);
+		goto out;
+	}
+	qp_list.idx = 0;
+
+	/* add and ref each qp so it doesn't get freed */
+	idr_for_each(&ctx->dev->qpidr, add_and_ref_qp, &qp_list);
+
 	spin_unlock_irq(&ctx->dev->lock);
+
+	/* now traverse the list in a safe context to recover the db state*/
+	recover_lost_dbs(ctx, &qp_list);
+
+	/* we're almost done!  deref the qps and clean up */
+	deref_qps(qp_list);
+	kfree(qp_list.qps);
+
+	/* Wait for the dbfifo to completely drain again */
+	while (cxgb4_dbfifo_count(ctx->dev->rdev.lldi.ports[0], 1) > 0) {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		schedule_timeout(usecs_to_jiffies(10));
+	}
+
+	/* resume the queues */
+	spin_lock_irq(&ctx->dev->lock);
+	if (ctx->dev->qpcnt > db_fc_threshold)
+		ctx->dev->db_state = FLOW_CONTROL;
+	else {
+		ctx->dev->db_state = NORMAL;
+		idr_for_each(&ctx->dev->qpidr, enable_qp_db, NULL);
+	}
+	ctx->dev->rdev.stats.db_state_transitions++;
+	spin_unlock_irq(&ctx->dev->lock);
+
+out:
+	/* start up kernel db ringers again */
+	mutex_unlock(&ctx->dev->db_mutex);
 }
 
 static int c4iw_uld_control(void *handle, enum cxgb4_control control, ...)
@@ -716,8 +881,7 @@ static int c4iw_uld_control(void *handle, enum cxgb4_control control, ...)
 		mutex_unlock(&ctx->dev->rdev.stats.lock);
 		break;
 	case CXGB4_CONTROL_DB_DROP:
-		printk(KERN_WARNING MOD "%s: Fatal DB DROP\n",
-		       pci_name(ctx->lldi.pdev));
+		recover_queues(ctx);
 		mutex_lock(&ctx->dev->rdev.stats.lock);
 		ctx->dev->rdev.stats.db_drop++;
 		mutex_unlock(&ctx->dev->rdev.stats.lock);
diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index e8b88a0..6818659 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -120,6 +120,7 @@ struct c4iw_stats {
 	u64  db_full;
 	u64  db_empty;
 	u64  db_drop;
+	u64  db_state_transitions;
 };
 
 struct c4iw_rdev {
@@ -212,6 +213,7 @@ struct c4iw_dev {
 	struct mutex db_mutex;
 	struct dentry *debugfs_root;
 	enum db_state db_state;
+	int qpcnt;
 };
 
 static inline struct c4iw_dev *to_c4iw_dev(struct ib_device *ibdev)
@@ -271,11 +273,25 @@ static inline int insert_handle_nolock(struct c4iw_dev *rhp, struct idr *idr,
 	return _insert_handle(rhp, idr, handle, id, 0);
 }
 
-static inline void remove_handle(struct c4iw_dev *rhp, struct idr *idr, u32 id)
+static inline void _remove_handle(struct c4iw_dev *rhp, struct idr *idr,
+				   u32 id, int lock)
 {
-	spin_lock_irq(&rhp->lock);
+	if (lock)
+		spin_lock_irq(&rhp->lock);
 	idr_remove(idr, id);
-	spin_unlock_irq(&rhp->lock);
+	if (lock)
+		spin_unlock_irq(&rhp->lock);
+}
+
+static inline void remove_handle(struct c4iw_dev *rhp, struct idr *idr, u32 id)
+{
+	_remove_handle(rhp, idr, id, 1);
+}
+
+static inline void remove_handle_nolock(struct c4iw_dev *rhp,
+					 struct idr *idr, u32 id)
+{
+	_remove_handle(rhp, idr, id, 0);
 }
 
 struct c4iw_pd {
@@ -843,5 +859,7 @@ void c4iw_ev_dispatch(struct c4iw_dev *dev, struct t4_cqe *err_cqe);
 extern struct cxgb4_client t4c_client;
 extern c4iw_handler_func c4iw_handlers[NUM_CPL_CMDS];
 extern int c4iw_max_read_depth;
+extern int db_fc_threshold;
+
 
 #endif
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index beec667..ba1343e 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -42,6 +42,11 @@ static int ocqp_support = 1;
 module_param(ocqp_support, int, 0644);
 MODULE_PARM_DESC(ocqp_support, "Support on-chip SQs (default=1)");
 
+int db_fc_threshold = 2000;
+module_param(db_fc_threshold, int, 0644);
+MODULE_PARM_DESC(db_fc_threshold, "QP count/threshold that triggers automatic "
+		 "db flow control mode (default = 2000)");
+
 static void set_state(struct c4iw_qp *qhp, enum c4iw_qp_state state)
 {
 	unsigned long flag;
@@ -1143,13 +1148,19 @@ static int ring_kernel_db(struct c4iw_qp *qhp, u32 qid, u16 inc)
 
 	mutex_lock(&qhp->rhp->db_mutex);
 	do {
-		if (cxgb4_dbfifo_count(qhp->rhp->rdev.lldi.ports[0], 1) < 768) {
+
+		/*
+		 * The interrupt threshold is dbfifo_int_thresh << 6. So
+		 * make sure we don't cross that and generate an interrupt.
+		 */
+		if (cxgb4_dbfifo_count(qhp->rhp->rdev.lldi.ports[0], 1) <
+		    (qhp->rhp->rdev.lldi.dbfifo_int_thresh << 5)) {
 			writel(V_QID(qid) | V_PIDX(inc), qhp->wq.db);
 			break;
 		}
 		set_current_state(TASK_UNINTERRUPTIBLE);
 		schedule_timeout(usecs_to_jiffies(delay));
-		delay = min(delay << 1, 200000);
+		delay = min(delay << 1, 2000);
 	} while (1);
 	mutex_unlock(&qhp->rhp->db_mutex);
 	return 0;
@@ -1388,6 +1399,14 @@ out:
 	return ret;
 }
 
+static int enable_qp_db(int id, void *p, void *data)
+{
+	struct c4iw_qp *qp = p;
+
+	t4_enable_wq_db(&qp->wq);
+	return 0;
+}
+
 int c4iw_destroy_qp(struct ib_qp *ib_qp)
 {
 	struct c4iw_dev *rhp;
@@ -1405,7 +1424,16 @@ int c4iw_destroy_qp(struct ib_qp *ib_qp)
 		c4iw_modify_qp(rhp, qhp, C4IW_QP_ATTR_NEXT_STATE, &attrs, 0);
 	wait_event(qhp->wait, !qhp->ep);
 
-	remove_handle(rhp, &rhp->qpidr, qhp->wq.sq.qid);
+	spin_lock_irq(&rhp->lock);
+	remove_handle_nolock(rhp, &rhp->qpidr, qhp->wq.sq.qid);
+	rhp->qpcnt--;
+	BUG_ON(rhp->qpcnt < 0);
+	if (rhp->qpcnt <= db_fc_threshold && rhp->db_state == FLOW_CONTROL) {
+		rhp->rdev.stats.db_state_transitions++;
+		rhp->db_state = NORMAL;
+		idr_for_each(&rhp->qpidr, enable_qp_db, NULL);
+	}
+	spin_unlock_irq(&rhp->lock);
 	atomic_dec(&qhp->refcnt);
 	wait_event(qhp->wait, !atomic_read(&qhp->refcnt));
 
@@ -1419,6 +1447,14 @@ int c4iw_destroy_qp(struct ib_qp *ib_qp)
 	return 0;
 }
 
+static int disable_qp_db(int id, void *p, void *data)
+{
+	struct c4iw_qp *qp = p;
+
+	t4_disable_wq_db(&qp->wq);
+	return 0;
+}
+
 struct ib_qp *c4iw_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *attrs,
 			     struct ib_udata *udata)
 {
@@ -1508,6 +1544,11 @@ struct ib_qp *c4iw_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *attrs,
 	spin_lock_irq(&rhp->lock);
 	if (rhp->db_state != NORMAL)
 		t4_disable_wq_db(&qhp->wq);
+	if (++rhp->qpcnt > db_fc_threshold && rhp->db_state == NORMAL) {
+		rhp->rdev.stats.db_state_transitions++;
+		rhp->db_state = FLOW_CONTROL;
+		idr_for_each(&rhp->qpidr, disable_qp_db, NULL);
+	}
 	ret = insert_handle_nolock(rhp, &rhp->qpidr, qhp, qhp->wq.sq.qid);
 	spin_unlock_irq(&rhp->lock);
 	if (ret)
diff --git a/drivers/infiniband/hw/cxgb4/t4.h b/drivers/infiniband/hw/cxgb4/t4.h
index c0221ee..16f26ab 100644
--- a/drivers/infiniband/hw/cxgb4/t4.h
+++ b/drivers/infiniband/hw/cxgb4/t4.h
@@ -62,6 +62,10 @@ struct t4_status_page {
 	__be16 pidx;
 	u8 qp_err;	/* flit 1 - sw owns */
 	u8 db_off;
+	u8 pad;
+	u16 host_wq_pidx;
+	u16 host_cidx;
+	u16 host_pidx;
 };
 
 #define T4_EQ_ENTRY_SIZE 64
@@ -375,6 +379,16 @@ static inline void t4_rq_consume(struct t4_wq *wq)
 		wq->rq.cidx = 0;
 }
 
+static inline u16 t4_rq_host_wq_pidx(struct t4_wq *wq)
+{
+	return wq->rq.queue[wq->rq.size].status.host_wq_pidx;
+}
+
+static inline u16 t4_rq_wq_size(struct t4_wq *wq)
+{
+		return wq->rq.size * T4_RQ_NUM_SLOTS;
+}
+
 static inline int t4_sq_onchip(struct t4_sq *sq)
 {
 	return sq->flags & T4_SQ_ONCHIP;
@@ -412,6 +426,16 @@ static inline void t4_sq_consume(struct t4_wq *wq)
 		wq->sq.cidx = 0;
 }
 
+static inline u16 t4_sq_host_wq_pidx(struct t4_wq *wq)
+{
+	return wq->sq.queue[wq->sq.size].status.host_wq_pidx;
+}
+
+static inline u16 t4_sq_wq_size(struct t4_wq *wq)
+{
+		return wq->sq.size * T4_SQ_NUM_SLOTS;
+}
+
 static inline void t4_ring_sq_db(struct t4_wq *wq, u16 inc)
 {
 	wmb();
-- 
1.7.1

^ permalink raw reply related

* [PATCH 04/10] RDMA/cxgb4: Add debugfs rdma memory stats
From: Vipul Pandya @ 2012-05-18  9:59 UTC (permalink / raw)
  To: linux-rdma, netdev; +Cc: roland, davem, divy, dm, kumaras, swise, Vipul Pandya
In-Reply-To: <1337335173-3226-1-git-send-email-vipul@chelsio.com>

Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/infiniband/hw/cxgb4/device.c   |   78 +++++++++++++++++++++++++++++++-
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |   17 +++++++
 drivers/infiniband/hw/cxgb4/mem.c      |   11 ++++-
 drivers/infiniband/hw/cxgb4/provider.c |    8 +++
 drivers/infiniband/hw/cxgb4/resource.c |   44 ++++++++++++++++++
 5 files changed, 155 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/device.c b/drivers/infiniband/hw/cxgb4/device.c
index 6d0df6e..8483111 100644
--- a/drivers/infiniband/hw/cxgb4/device.c
+++ b/drivers/infiniband/hw/cxgb4/device.c
@@ -240,6 +240,62 @@ static const struct file_operations stag_debugfs_fops = {
 	.llseek  = default_llseek,
 };
 
+static int stats_show(struct seq_file *seq, void *v)
+{
+	struct c4iw_dev *dev = seq->private;
+
+	seq_printf(seq, " Object: %10s %10s %10s\n", "Total", "Current", "Max");
+	seq_printf(seq, "     PDID: %10llu %10llu %10llu\n",
+			dev->rdev.stats.pd.total, dev->rdev.stats.pd.cur,
+			dev->rdev.stats.pd.max);
+	seq_printf(seq, "      QID: %10llu %10llu %10llu\n",
+			dev->rdev.stats.qid.total, dev->rdev.stats.qid.cur,
+			dev->rdev.stats.qid.max);
+	seq_printf(seq, "   TPTMEM: %10llu %10llu %10llu\n",
+			dev->rdev.stats.stag.total, dev->rdev.stats.stag.cur,
+			dev->rdev.stats.stag.max);
+	seq_printf(seq, "   PBLMEM: %10llu %10llu %10llu\n",
+			dev->rdev.stats.pbl.total, dev->rdev.stats.pbl.cur,
+			dev->rdev.stats.pbl.max);
+	seq_printf(seq, "   RQTMEM: %10llu %10llu %10llu\n",
+			dev->rdev.stats.rqt.total, dev->rdev.stats.rqt.cur,
+			dev->rdev.stats.rqt.max);
+	seq_printf(seq, "  OCQPMEM: %10llu %10llu %10llu\n",
+			dev->rdev.stats.ocqp.total, dev->rdev.stats.ocqp.cur,
+			dev->rdev.stats.ocqp.max);
+	return 0;
+}
+
+static int stats_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, stats_show, inode->i_private);
+}
+
+static ssize_t stats_clear(struct file *file, const char __user *buf,
+		size_t count, loff_t *pos)
+{
+	struct c4iw_dev *dev = ((struct seq_file *)file->private_data)->private;
+
+	mutex_lock(&dev->rdev.stats.lock);
+	dev->rdev.stats.pd.max = 0;
+	dev->rdev.stats.qid.max = 0;
+	dev->rdev.stats.stag.max = 0;
+	dev->rdev.stats.pbl.max = 0;
+	dev->rdev.stats.rqt.max = 0;
+	dev->rdev.stats.ocqp.max = 0;
+	mutex_unlock(&dev->rdev.stats.lock);
+	return count;
+}
+
+static const struct file_operations stats_debugfs_fops = {
+	.owner   = THIS_MODULE,
+	.open    = stats_open,
+	.release = single_release,
+	.read    = seq_read,
+	.llseek  = seq_lseek,
+	.write   = stats_clear,
+};
+
 static int setup_debugfs(struct c4iw_dev *devp)
 {
 	struct dentry *de;
@@ -256,6 +312,12 @@ static int setup_debugfs(struct c4iw_dev *devp)
 				 (void *)devp, &stag_debugfs_fops);
 	if (de && de->d_inode)
 		de->d_inode->i_size = 4096;
+
+	de = debugfs_create_file("stats", S_IWUSR, devp->debugfs_root,
+			(void *)devp, &stats_debugfs_fops);
+	if (de && de->d_inode)
+		de->d_inode->i_size = 4096;
+
 	return 0;
 }
 
@@ -269,9 +331,13 @@ void c4iw_release_dev_ucontext(struct c4iw_rdev *rdev,
 	list_for_each_safe(pos, nxt, &uctx->qpids) {
 		entry = list_entry(pos, struct c4iw_qid_list, entry);
 		list_del_init(&entry->entry);
-		if (!(entry->qid & rdev->qpmask))
+		if (!(entry->qid & rdev->qpmask)) {
 			c4iw_put_resource(&rdev->resource.qid_fifo, entry->qid,
-					  &rdev->resource.qid_fifo_lock);
+					&rdev->resource.qid_fifo_lock);
+			mutex_lock(&rdev->stats.lock);
+			rdev->stats.qid.cur -= rdev->qpmask + 1;
+			mutex_unlock(&rdev->stats.lock);
+		}
 		kfree(entry);
 	}
 
@@ -332,6 +398,13 @@ static int c4iw_rdev_open(struct c4iw_rdev *rdev)
 		goto err1;
 	}
 
+	rdev->stats.pd.total = T4_MAX_NUM_PD;
+	rdev->stats.stag.total = rdev->lldi.vr->stag.size;
+	rdev->stats.pbl.total = rdev->lldi.vr->pbl.size;
+	rdev->stats.rqt.total = rdev->lldi.vr->rq.size;
+	rdev->stats.ocqp.total = rdev->lldi.vr->ocq.size;
+	rdev->stats.qid.total = rdev->lldi.vr->qp.size;
+
 	err = c4iw_init_resource(rdev, c4iw_num_stags(rdev), T4_MAX_NUM_PD);
 	if (err) {
 		printk(KERN_ERR MOD "error %d initializing resources\n", err);
@@ -440,6 +513,7 @@ static struct c4iw_dev *c4iw_alloc(const struct cxgb4_lld_info *infop)
 	idr_init(&devp->qpidr);
 	idr_init(&devp->mmidr);
 	spin_lock_init(&devp->lock);
+	mutex_init(&devp->rdev.stats.lock);
 
 	if (c4iw_debugfs_root) {
 		devp->debugfs_root = debugfs_create_dir(
diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index 1357c5b..a849074 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -103,6 +103,22 @@ enum c4iw_rdev_flags {
 	T4_FATAL_ERROR = (1<<0),
 };
 
+struct c4iw_stat {
+	u64 total;
+	u64 cur;
+	u64 max;
+};
+
+struct c4iw_stats {
+	struct mutex lock;
+	struct c4iw_stat qid;
+	struct c4iw_stat pd;
+	struct c4iw_stat stag;
+	struct c4iw_stat pbl;
+	struct c4iw_stat rqt;
+	struct c4iw_stat ocqp;
+};
+
 struct c4iw_rdev {
 	struct c4iw_resource resource;
 	unsigned long qpshift;
@@ -117,6 +133,7 @@ struct c4iw_rdev {
 	struct cxgb4_lld_info lldi;
 	unsigned long oc_mw_pa;
 	void __iomem *oc_mw_kva;
+	struct c4iw_stats stats;
 };
 
 static inline int c4iw_fatal_error(struct c4iw_rdev *rdev)
diff --git a/drivers/infiniband/hw/cxgb4/mem.c b/drivers/infiniband/hw/cxgb4/mem.c
index 40c8353..2a87379 100644
--- a/drivers/infiniband/hw/cxgb4/mem.c
+++ b/drivers/infiniband/hw/cxgb4/mem.c
@@ -135,6 +135,11 @@ static int write_tpt_entry(struct c4iw_rdev *rdev, u32 reset_tpt_entry,
 					     &rdev->resource.tpt_fifo_lock);
 		if (!stag_idx)
 			return -ENOMEM;
+		mutex_lock(&rdev->stats.lock);
+		rdev->stats.stag.cur += 32;
+		if (rdev->stats.stag.cur > rdev->stats.stag.max)
+			rdev->stats.stag.max = rdev->stats.stag.cur;
+		mutex_unlock(&rdev->stats.lock);
 		*stag = (stag_idx << 8) | (atomic_inc_return(&key) & 0xff);
 	}
 	PDBG("%s stag_state 0x%0x type 0x%0x pdid 0x%0x, stag_idx 0x%x\n",
@@ -165,9 +170,13 @@ static int write_tpt_entry(struct c4iw_rdev *rdev, u32 reset_tpt_entry,
 				(rdev->lldi.vr->stag.start >> 5),
 				sizeof(tpt), &tpt);
 
-	if (reset_tpt_entry)
+	if (reset_tpt_entry) {
 		c4iw_put_resource(&rdev->resource.tpt_fifo, stag_idx,
 				  &rdev->resource.tpt_fifo_lock);
+		mutex_lock(&rdev->stats.lock);
+		rdev->stats.stag.cur -= 32;
+		mutex_unlock(&rdev->stats.lock);
+	}
 	return err;
 }
 
diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
index be1c18f..8d58736 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -190,6 +190,9 @@ static int c4iw_deallocate_pd(struct ib_pd *pd)
 	PDBG("%s ibpd %p pdid 0x%x\n", __func__, pd, php->pdid);
 	c4iw_put_resource(&rhp->rdev.resource.pdid_fifo, php->pdid,
 			  &rhp->rdev.resource.pdid_fifo_lock);
+	mutex_lock(&rhp->rdev.stats.lock);
+	rhp->rdev.stats.pd.cur--;
+	mutex_unlock(&rhp->rdev.stats.lock);
 	kfree(php);
 	return 0;
 }
@@ -222,6 +225,11 @@ static struct ib_pd *c4iw_allocate_pd(struct ib_device *ibdev,
 			return ERR_PTR(-EFAULT);
 		}
 	}
+	mutex_lock(&rhp->rdev.stats.lock);
+	rhp->rdev.stats.pd.cur++;
+	if (rhp->rdev.stats.pd.cur > rhp->rdev.stats.pd.max)
+		rhp->rdev.stats.pd.max = rhp->rdev.stats.pd.cur;
+	mutex_unlock(&rhp->rdev.stats.lock);
 	PDBG("%s pdid 0x%0x ptr 0x%p\n", __func__, pdid, php);
 	return &php->ibpd;
 }
diff --git a/drivers/infiniband/hw/cxgb4/resource.c b/drivers/infiniband/hw/cxgb4/resource.c
index 407ff39..1b948d1 100644
--- a/drivers/infiniband/hw/cxgb4/resource.c
+++ b/drivers/infiniband/hw/cxgb4/resource.c
@@ -185,6 +185,9 @@ u32 c4iw_get_cqid(struct c4iw_rdev *rdev, struct c4iw_dev_ucontext *uctx)
 					&rdev->resource.qid_fifo_lock);
 		if (!qid)
 			goto out;
+		mutex_lock(&rdev->stats.lock);
+		rdev->stats.qid.cur += rdev->qpmask + 1;
+		mutex_unlock(&rdev->stats.lock);
 		for (i = qid+1; i & rdev->qpmask; i++) {
 			entry = kmalloc(sizeof *entry, GFP_KERNEL);
 			if (!entry)
@@ -213,6 +216,10 @@ u32 c4iw_get_cqid(struct c4iw_rdev *rdev, struct c4iw_dev_ucontext *uctx)
 out:
 	mutex_unlock(&uctx->lock);
 	PDBG("%s qid 0x%x\n", __func__, qid);
+	mutex_lock(&rdev->stats.lock);
+	if (rdev->stats.qid.cur > rdev->stats.qid.max)
+		rdev->stats.qid.max = rdev->stats.qid.cur;
+	mutex_unlock(&rdev->stats.lock);
 	return qid;
 }
 
@@ -249,6 +256,9 @@ u32 c4iw_get_qpid(struct c4iw_rdev *rdev, struct c4iw_dev_ucontext *uctx)
 					&rdev->resource.qid_fifo_lock);
 		if (!qid)
 			goto out;
+		mutex_lock(&rdev->stats.lock);
+		rdev->stats.qid.cur += rdev->qpmask + 1;
+		mutex_unlock(&rdev->stats.lock);
 		for (i = qid+1; i & rdev->qpmask; i++) {
 			entry = kmalloc(sizeof *entry, GFP_KERNEL);
 			if (!entry)
@@ -277,6 +287,10 @@ u32 c4iw_get_qpid(struct c4iw_rdev *rdev, struct c4iw_dev_ucontext *uctx)
 out:
 	mutex_unlock(&uctx->lock);
 	PDBG("%s qid 0x%x\n", __func__, qid);
+	mutex_lock(&rdev->stats.lock);
+	if (rdev->stats.qid.cur > rdev->stats.qid.max)
+		rdev->stats.qid.max = rdev->stats.qid.cur;
+	mutex_unlock(&rdev->stats.lock);
 	return qid;
 }
 
@@ -315,12 +329,22 @@ u32 c4iw_pblpool_alloc(struct c4iw_rdev *rdev, int size)
 	if (!addr)
 		printk_ratelimited(KERN_WARNING MOD "%s: Out of PBL memory\n",
 		       pci_name(rdev->lldi.pdev));
+	if (addr) {
+		mutex_lock(&rdev->stats.lock);
+		rdev->stats.pbl.cur += roundup(size, 1 << MIN_PBL_SHIFT);
+		if (rdev->stats.pbl.cur > rdev->stats.pbl.max)
+			rdev->stats.pbl.max = rdev->stats.pbl.cur;
+		mutex_unlock(&rdev->stats.lock);
+	}
 	return (u32)addr;
 }
 
 void c4iw_pblpool_free(struct c4iw_rdev *rdev, u32 addr, int size)
 {
 	PDBG("%s addr 0x%x size %d\n", __func__, addr, size);
+	mutex_lock(&rdev->stats.lock);
+	rdev->stats.pbl.cur -= roundup(size, 1 << MIN_PBL_SHIFT);
+	mutex_unlock(&rdev->stats.lock);
 	gen_pool_free(rdev->pbl_pool, (unsigned long)addr, size);
 }
 
@@ -377,12 +401,22 @@ u32 c4iw_rqtpool_alloc(struct c4iw_rdev *rdev, int size)
 	if (!addr)
 		printk_ratelimited(KERN_WARNING MOD "%s: Out of RQT memory\n",
 		       pci_name(rdev->lldi.pdev));
+	if (addr) {
+		mutex_lock(&rdev->stats.lock);
+		rdev->stats.rqt.cur += roundup(size << 6, 1 << MIN_RQT_SHIFT);
+		if (rdev->stats.rqt.cur > rdev->stats.rqt.max)
+			rdev->stats.rqt.max = rdev->stats.rqt.cur;
+		mutex_unlock(&rdev->stats.lock);
+	}
 	return (u32)addr;
 }
 
 void c4iw_rqtpool_free(struct c4iw_rdev *rdev, u32 addr, int size)
 {
 	PDBG("%s addr 0x%x size %d\n", __func__, addr, size << 6);
+	mutex_lock(&rdev->stats.lock);
+	rdev->stats.rqt.cur -= roundup(size << 6, 1 << MIN_RQT_SHIFT);
+	mutex_unlock(&rdev->stats.lock);
 	gen_pool_free(rdev->rqt_pool, (unsigned long)addr, size << 6);
 }
 
@@ -433,12 +467,22 @@ u32 c4iw_ocqp_pool_alloc(struct c4iw_rdev *rdev, int size)
 {
 	unsigned long addr = gen_pool_alloc(rdev->ocqp_pool, size);
 	PDBG("%s addr 0x%x size %d\n", __func__, (u32)addr, size);
+	if (addr) {
+		mutex_lock(&rdev->stats.lock);
+		rdev->stats.ocqp.cur += roundup(size, 1 << MIN_OCQP_SHIFT);
+		if (rdev->stats.ocqp.cur > rdev->stats.ocqp.max)
+			rdev->stats.ocqp.max = rdev->stats.ocqp.cur;
+		mutex_unlock(&rdev->stats.lock);
+	}
 	return (u32)addr;
 }
 
 void c4iw_ocqp_pool_free(struct c4iw_rdev *rdev, u32 addr, int size)
 {
 	PDBG("%s addr 0x%x size %d\n", __func__, addr, size);
+	mutex_lock(&rdev->stats.lock);
+	rdev->stats.ocqp.cur -= roundup(size, 1 << MIN_OCQP_SHIFT);
+	mutex_unlock(&rdev->stats.lock);
 	gen_pool_free(rdev->ocqp_pool, (unsigned long)addr, size);
 }
 
-- 
1.7.1

^ permalink raw reply related

* [PATCH 05/10] RDMA/cxgb4: Add DB Overflow Avoidance.
From: Vipul Pandya @ 2012-05-18  9:59 UTC (permalink / raw)
  To: linux-rdma, netdev; +Cc: roland, davem, divy, dm, kumaras, swise, Vipul Pandya
In-Reply-To: <1337335173-3226-1-git-send-email-vipul@chelsio.com>

get FULL/EMPTY/DROP events from LLD

on FULL event, disable normal user mode DB rings.

add modify_qp semantics to allow user processes to call into
the kernel to ring doobells without overflowing.

Add DB Full/Empty/Drop stats.

Mark queues when created indicating the doorbell state.

If we're in the middle of db overflow avoidance, then newly created
queues should start out in this mode.

Bump the C4IW_UVERBS_ABI_VERSION to 2 so the user mode library can
know if the driver supports the kernel mode db ringing.

Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/infiniband/hw/cxgb4/device.c   |   84 +++++++++++++++++++++++++++++--
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |   37 ++++++++++++--
 drivers/infiniband/hw/cxgb4/qp.c       |   51 +++++++++++++++++++-
 drivers/infiniband/hw/cxgb4/user.h     |    2 +-
 4 files changed, 162 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/device.c b/drivers/infiniband/hw/cxgb4/device.c
index 8483111..9062ed9 100644
--- a/drivers/infiniband/hw/cxgb4/device.c
+++ b/drivers/infiniband/hw/cxgb4/device.c
@@ -44,6 +44,12 @@ MODULE_DESCRIPTION("Chelsio T4 RDMA Driver");
 MODULE_LICENSE("Dual BSD/GPL");
 MODULE_VERSION(DRV_VERSION);
 
+struct uld_ctx {
+	struct list_head entry;
+	struct cxgb4_lld_info lldi;
+	struct c4iw_dev *dev;
+};
+
 static LIST_HEAD(uld_ctx_list);
 static DEFINE_MUTEX(dev_mutex);
 
@@ -263,6 +269,9 @@ static int stats_show(struct seq_file *seq, void *v)
 	seq_printf(seq, "  OCQPMEM: %10llu %10llu %10llu\n",
 			dev->rdev.stats.ocqp.total, dev->rdev.stats.ocqp.cur,
 			dev->rdev.stats.ocqp.max);
+	seq_printf(seq, "  DB FULL: %10llu\n", dev->rdev.stats.db_full);
+	seq_printf(seq, " DB EMPTY: %10llu\n", dev->rdev.stats.db_empty);
+	seq_printf(seq, "  DB DROP: %10llu\n", dev->rdev.stats.db_drop);
 	return 0;
 }
 
@@ -283,6 +292,9 @@ static ssize_t stats_clear(struct file *file, const char __user *buf,
 	dev->rdev.stats.pbl.max = 0;
 	dev->rdev.stats.rqt.max = 0;
 	dev->rdev.stats.ocqp.max = 0;
+	dev->rdev.stats.db_full = 0;
+	dev->rdev.stats.db_empty = 0;
+	dev->rdev.stats.db_drop = 0;
 	mutex_unlock(&dev->rdev.stats.lock);
 	return count;
 }
@@ -443,12 +455,6 @@ static void c4iw_rdev_close(struct c4iw_rdev *rdev)
 	c4iw_destroy_resource(&rdev->resource);
 }
 
-struct uld_ctx {
-	struct list_head entry;
-	struct cxgb4_lld_info lldi;
-	struct c4iw_dev *dev;
-};
-
 static void c4iw_dealloc(struct uld_ctx *ctx)
 {
 	c4iw_rdev_close(&ctx->dev->rdev);
@@ -514,6 +520,7 @@ static struct c4iw_dev *c4iw_alloc(const struct cxgb4_lld_info *infop)
 	idr_init(&devp->mmidr);
 	spin_lock_init(&devp->lock);
 	mutex_init(&devp->rdev.stats.lock);
+	mutex_init(&devp->db_mutex);
 
 	if (c4iw_debugfs_root) {
 		devp->debugfs_root = debugfs_create_dir(
@@ -659,11 +666,76 @@ static int c4iw_uld_state_change(void *handle, enum cxgb4_state new_state)
 	return 0;
 }
 
+static int disable_qp_db(int id, void *p, void *data)
+{
+	struct c4iw_qp *qp = p;
+
+	t4_disable_wq_db(&qp->wq);
+	return 0;
+}
+
+static void stop_queues(struct uld_ctx *ctx)
+{
+	spin_lock_irq(&ctx->dev->lock);
+	ctx->dev->db_state = FLOW_CONTROL;
+	idr_for_each(&ctx->dev->qpidr, disable_qp_db, NULL);
+	spin_unlock_irq(&ctx->dev->lock);
+}
+
+static int enable_qp_db(int id, void *p, void *data)
+{
+	struct c4iw_qp *qp = p;
+
+	t4_enable_wq_db(&qp->wq);
+	return 0;
+}
+
+static void resume_queues(struct uld_ctx *ctx)
+{
+	spin_lock_irq(&ctx->dev->lock);
+	ctx->dev->db_state = NORMAL;
+	idr_for_each(&ctx->dev->qpidr, enable_qp_db, NULL);
+	spin_unlock_irq(&ctx->dev->lock);
+}
+
+static int c4iw_uld_control(void *handle, enum cxgb4_control control, ...)
+{
+	struct uld_ctx *ctx = handle;
+
+	switch (control) {
+	case CXGB4_CONTROL_DB_FULL:
+		stop_queues(ctx);
+		mutex_lock(&ctx->dev->rdev.stats.lock);
+		ctx->dev->rdev.stats.db_full++;
+		mutex_unlock(&ctx->dev->rdev.stats.lock);
+		break;
+	case CXGB4_CONTROL_DB_EMPTY:
+		resume_queues(ctx);
+		mutex_lock(&ctx->dev->rdev.stats.lock);
+		ctx->dev->rdev.stats.db_empty++;
+		mutex_unlock(&ctx->dev->rdev.stats.lock);
+		break;
+	case CXGB4_CONTROL_DB_DROP:
+		printk(KERN_WARNING MOD "%s: Fatal DB DROP\n",
+		       pci_name(ctx->lldi.pdev));
+		mutex_lock(&ctx->dev->rdev.stats.lock);
+		ctx->dev->rdev.stats.db_drop++;
+		mutex_unlock(&ctx->dev->rdev.stats.lock);
+		break;
+	default:
+		printk(KERN_WARNING MOD "%s: unknown control cmd %u\n",
+		       pci_name(ctx->lldi.pdev), control);
+		break;
+	}
+	return 0;
+}
+
 static struct cxgb4_uld_info c4iw_uld_info = {
 	.name = DRV_NAME,
 	.add = c4iw_uld_add,
 	.rx_handler = c4iw_uld_rx_handler,
 	.state_change = c4iw_uld_state_change,
+	.control = c4iw_uld_control,
 };
 
 static int __init c4iw_init_module(void)
diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index a849074..a11ed5c 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -117,6 +117,9 @@ struct c4iw_stats {
 	struct c4iw_stat pbl;
 	struct c4iw_stat rqt;
 	struct c4iw_stat ocqp;
+	u64  db_full;
+	u64  db_empty;
+	u64  db_drop;
 };
 
 struct c4iw_rdev {
@@ -192,6 +195,12 @@ static inline int c4iw_wait_for_reply(struct c4iw_rdev *rdev,
 	return wr_waitp->ret;
 }
 
+enum db_state {
+	NORMAL = 0,
+	FLOW_CONTROL = 1,
+	RECOVERY = 2
+};
+
 struct c4iw_dev {
 	struct ib_device ibdev;
 	struct c4iw_rdev rdev;
@@ -200,7 +209,9 @@ struct c4iw_dev {
 	struct idr qpidr;
 	struct idr mmidr;
 	spinlock_t lock;
+	struct mutex db_mutex;
 	struct dentry *debugfs_root;
+	enum db_state db_state;
 };
 
 static inline struct c4iw_dev *to_c4iw_dev(struct ib_device *ibdev)
@@ -228,8 +239,8 @@ static inline struct c4iw_mr *get_mhp(struct c4iw_dev *rhp, u32 mmid)
 	return idr_find(&rhp->mmidr, mmid);
 }
 
-static inline int insert_handle(struct c4iw_dev *rhp, struct idr *idr,
-				void *handle, u32 id)
+static inline int _insert_handle(struct c4iw_dev *rhp, struct idr *idr,
+				 void *handle, u32 id, int lock)
 {
 	int ret;
 	int newid;
@@ -237,15 +248,29 @@ static inline int insert_handle(struct c4iw_dev *rhp, struct idr *idr,
 	do {
 		if (!idr_pre_get(idr, GFP_KERNEL))
 			return -ENOMEM;
-		spin_lock_irq(&rhp->lock);
+		if (lock)
+			spin_lock_irq(&rhp->lock);
 		ret = idr_get_new_above(idr, handle, id, &newid);
 		BUG_ON(newid != id);
-		spin_unlock_irq(&rhp->lock);
+		if (lock)
+			spin_unlock_irq(&rhp->lock);
 	} while (ret == -EAGAIN);
 
 	return ret;
 }
 
+static inline int insert_handle(struct c4iw_dev *rhp, struct idr *idr,
+				void *handle, u32 id)
+{
+	return _insert_handle(rhp, idr, handle, id, 1);
+}
+
+static inline int insert_handle_nolock(struct c4iw_dev *rhp, struct idr *idr,
+				       void *handle, u32 id)
+{
+	return _insert_handle(rhp, idr, handle, id, 0);
+}
+
 static inline void remove_handle(struct c4iw_dev *rhp, struct idr *idr, u32 id)
 {
 	spin_lock_irq(&rhp->lock);
@@ -370,6 +395,8 @@ struct c4iw_qp_attributes {
 	struct c4iw_ep *llp_stream_handle;
 	u8 layer_etype;
 	u8 ecode;
+	u16 sq_db_inc;
+	u16 rq_db_inc;
 };
 
 struct c4iw_qp {
@@ -444,6 +471,8 @@ static inline void insert_mmap(struct c4iw_ucontext *ucontext,
 
 enum c4iw_qp_attr_mask {
 	C4IW_QP_ATTR_NEXT_STATE = 1 << 0,
+	C4IW_QP_ATTR_SQ_DB = 1<<1,
+	C4IW_QP_ATTR_RQ_DB = 1<<2,
 	C4IW_QP_ATTR_ENABLE_RDMA_READ = 1 << 7,
 	C4IW_QP_ATTR_ENABLE_RDMA_WRITE = 1 << 8,
 	C4IW_QP_ATTR_ENABLE_RDMA_BIND = 1 << 9,
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index 5f940ae..beec667 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -34,6 +34,10 @@
 
 #include "iw_cxgb4.h"
 
+static int db_delay_usecs = 1;
+module_param(db_delay_usecs, int, 0644);
+MODULE_PARM_DESC(db_delay_usecs, "Usecs to delay awaiting db fifo to drain");
+
 static int ocqp_support = 1;
 module_param(ocqp_support, int, 0644);
 MODULE_PARM_DESC(ocqp_support, "Support on-chip SQs (default=1)");
@@ -1128,6 +1132,29 @@ out:
 	return ret;
 }
 
+/*
+ * Called by the library when the qp has user dbs disabled due to
+ * a DB_FULL condition.  This function will single-thread all user
+ * DB rings to avoid overflowing the hw db-fifo.
+ */
+static int ring_kernel_db(struct c4iw_qp *qhp, u32 qid, u16 inc)
+{
+	int delay = db_delay_usecs;
+
+	mutex_lock(&qhp->rhp->db_mutex);
+	do {
+		if (cxgb4_dbfifo_count(qhp->rhp->rdev.lldi.ports[0], 1) < 768) {
+			writel(V_QID(qid) | V_PIDX(inc), qhp->wq.db);
+			break;
+		}
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		schedule_timeout(usecs_to_jiffies(delay));
+		delay = min(delay << 1, 200000);
+	} while (1);
+	mutex_unlock(&qhp->rhp->db_mutex);
+	return 0;
+}
+
 int c4iw_modify_qp(struct c4iw_dev *rhp, struct c4iw_qp *qhp,
 		   enum c4iw_qp_attr_mask mask,
 		   struct c4iw_qp_attributes *attrs,
@@ -1176,6 +1203,15 @@ int c4iw_modify_qp(struct c4iw_dev *rhp, struct c4iw_qp *qhp,
 		qhp->attr = newattr;
 	}
 
+	if (mask & C4IW_QP_ATTR_SQ_DB) {
+		ret = ring_kernel_db(qhp, qhp->wq.sq.qid, attrs->sq_db_inc);
+		goto out;
+	}
+	if (mask & C4IW_QP_ATTR_RQ_DB) {
+		ret = ring_kernel_db(qhp, qhp->wq.rq.qid, attrs->rq_db_inc);
+		goto out;
+	}
+
 	if (!(mask & C4IW_QP_ATTR_NEXT_STATE))
 		goto out;
 	if (qhp->attr.state == attrs->next_state)
@@ -1469,7 +1505,11 @@ struct ib_qp *c4iw_create_qp(struct ib_pd *pd, struct ib_qp_init_attr *attrs,
 	init_waitqueue_head(&qhp->wait);
 	atomic_set(&qhp->refcnt, 1);
 
-	ret = insert_handle(rhp, &rhp->qpidr, qhp, qhp->wq.sq.qid);
+	spin_lock_irq(&rhp->lock);
+	if (rhp->db_state != NORMAL)
+		t4_disable_wq_db(&qhp->wq);
+	ret = insert_handle_nolock(rhp, &rhp->qpidr, qhp, qhp->wq.sq.qid);
+	spin_unlock_irq(&rhp->lock);
 	if (ret)
 		goto err2;
 
@@ -1613,6 +1653,15 @@ int c4iw_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 			 C4IW_QP_ATTR_ENABLE_RDMA_WRITE |
 			 C4IW_QP_ATTR_ENABLE_RDMA_BIND) : 0;
 
+	/*
+	 * Use SQ_PSN and RQ_PSN to pass in IDX_INC values for
+	 * ringing the queue db when we're in DB_FULL mode.
+	 */
+	attrs.sq_db_inc = attr->sq_psn;
+	attrs.rq_db_inc = attr->rq_psn;
+	mask |= (attr_mask & IB_QP_SQ_PSN) ? C4IW_QP_ATTR_SQ_DB : 0;
+	mask |= (attr_mask & IB_QP_RQ_PSN) ? C4IW_QP_ATTR_RQ_DB : 0;
+
 	return c4iw_modify_qp(rhp, qhp, mask, &attrs, 0);
 }
 
diff --git a/drivers/infiniband/hw/cxgb4/user.h b/drivers/infiniband/hw/cxgb4/user.h
index e6669d5..32b754c 100644
--- a/drivers/infiniband/hw/cxgb4/user.h
+++ b/drivers/infiniband/hw/cxgb4/user.h
@@ -32,7 +32,7 @@
 #ifndef __C4IW_USER_H__
 #define __C4IW_USER_H__
 
-#define C4IW_UVERBS_ABI_VERSION	1
+#define C4IW_UVERBS_ABI_VERSION	2
 
 /*
  * Make sure that all structs defined in this file remain laid out so
-- 
1.7.1

^ permalink raw reply related

* [PATCH 10/10] RDMA/cxgb4: Add query_qp support in driver to query the qp state before flushing.
From: Vipul Pandya @ 2012-05-18  9:59 UTC (permalink / raw)
  To: linux-rdma, netdev; +Cc: roland, davem, divy, dm, kumaras, swise, Vipul Pandya
In-Reply-To: <1337335173-3226-1-git-send-email-vipul@chelsio.com>

Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |   19 +++++++++++++++++++
 drivers/infiniband/hw/cxgb4/provider.c |    2 ++
 drivers/infiniband/hw/cxgb4/qp.c       |   11 +++++++++++
 3 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index 2d5b06b..9beb3a9 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -551,6 +551,23 @@ static inline int c4iw_convert_state(enum ib_qp_state ib_state)
 	}
 }
 
+static inline int to_ib_qp_state(int c4iw_qp_state)
+{
+	switch (c4iw_qp_state) {
+	case C4IW_QP_STATE_IDLE:
+		return IB_QPS_INIT;
+	case C4IW_QP_STATE_RTS:
+		return IB_QPS_RTS;
+	case C4IW_QP_STATE_CLOSING:
+		return IB_QPS_SQD;
+	case C4IW_QP_STATE_TERMINATE:
+		return IB_QPS_SQE;
+	case C4IW_QP_STATE_ERROR:
+		return IB_QPS_ERR;
+	}
+	return IB_QPS_ERR;
+}
+
 static inline u32 c4iw_ib_to_tpt_access(int a)
 {
 	return (a & IB_ACCESS_REMOTE_WRITE ? FW_RI_MEM_ACCESS_REM_WRITE : 0) |
@@ -846,6 +863,8 @@ struct ib_qp *c4iw_create_qp(struct ib_pd *pd,
 			     struct ib_udata *udata);
 int c4iw_ib_modify_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
 				 int attr_mask, struct ib_udata *udata);
+int c4iw_ib_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+		     int attr_mask, struct ib_qp_init_attr *init_attr);
 struct ib_qp *c4iw_get_qp(struct ib_device *dev, int qpn);
 u32 c4iw_rqtpool_alloc(struct c4iw_rdev *rdev, int size);
 void c4iw_rqtpool_free(struct c4iw_rdev *rdev, u32 addr, int size);
diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
index fe98a0a..e084fdc 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -443,6 +443,7 @@ int c4iw_register_device(struct c4iw_dev *dev)
 	    (1ull << IB_USER_VERBS_CMD_REQ_NOTIFY_CQ) |
 	    (1ull << IB_USER_VERBS_CMD_CREATE_QP) |
 	    (1ull << IB_USER_VERBS_CMD_MODIFY_QP) |
+	    (1ull << IB_USER_VERBS_CMD_QUERY_QP) |
 	    (1ull << IB_USER_VERBS_CMD_POLL_CQ) |
 	    (1ull << IB_USER_VERBS_CMD_DESTROY_QP) |
 	    (1ull << IB_USER_VERBS_CMD_POST_SEND) |
@@ -465,6 +466,7 @@ int c4iw_register_device(struct c4iw_dev *dev)
 	dev->ibdev.destroy_ah = c4iw_ah_destroy;
 	dev->ibdev.create_qp = c4iw_create_qp;
 	dev->ibdev.modify_qp = c4iw_ib_modify_qp;
+	dev->ibdev.query_qp = c4iw_ib_query_qp;
 	dev->ibdev.destroy_qp = c4iw_destroy_qp;
 	dev->ibdev.create_cq = c4iw_create_cq;
 	dev->ibdev.destroy_cq = c4iw_destroy_cq;
diff --git a/drivers/infiniband/hw/cxgb4/qp.c b/drivers/infiniband/hw/cxgb4/qp.c
index ba1343e..45aedf1 100644
--- a/drivers/infiniband/hw/cxgb4/qp.c
+++ b/drivers/infiniband/hw/cxgb4/qp.c
@@ -1711,3 +1711,14 @@ struct ib_qp *c4iw_get_qp(struct ib_device *dev, int qpn)
 	PDBG("%s ib_dev %p qpn 0x%x\n", __func__, dev, qpn);
 	return (struct ib_qp *)get_qhp(to_c4iw_dev(dev), qpn);
 }
+
+int c4iw_ib_query_qp(struct ib_qp *ibqp, struct ib_qp_attr *attr,
+		     int attr_mask, struct ib_qp_init_attr *init_attr)
+{
+	struct c4iw_qp *qhp = to_c4iw_qp(ibqp);
+
+	memset(attr, 0, sizeof *attr);
+	memset(init_attr, 0, sizeof *init_attr);
+	attr->qp_state = to_ib_qp_state(qhp->attr.state);
+	return 0;
+}
-- 
1.7.1

^ permalink raw reply related

* [PATCH 00/10] Doorbell drop recovery for Chelsio T4 iWARP
From: Vipul Pandya @ 2012-05-18  9:59 UTC (permalink / raw)
  To: linux-rdma, netdev; +Cc: roland, davem, divy, dm, kumaras, swise, Vipul Pandya

This patch-series implements doorbell drop recovery for Chelsio T4 iWARP driver.

In the event where DBs are dropped application can get stalled for one or more
reasons. So, we recover RDMA and LLD queues in such an event.
We also take care for handling DB overflow events.

The patch-series also has some bug fixes, adds RDMA debugfs stats and removes
kfifo usage for ID mangement.

The patch-series is based on Roland's infiniband tree (for-next branch), 
and involves changes on drivers/net/ethernet/chelsio/cxgb4 and
drivers/infiniband/hw/cxgb4.

The changes on drivers/infiniband/hw/cxgb4 are dependent on the changes of
drivers/net/ethernet/chelsio/cxgb4 for the T4-iWARP driver to build correctly.
So, we request to merge the entire patch-series through Roland's tree.

Both linux-rdma and netdev are included in this post for review.

The earlier posting of this series was reviewed and can be found at below link.
http://www.mail-archive.com/linux-rdma@vger.kernel.org/msg09606.html

Below is a link where Roland advised to re-post the series.
http://www.spinics.net/lists/netdev/msg187997.html

Vipul Pandya (10):
  cxgb4: Detect DB FULL events and notify RDMA ULD.
  cxgb4: Common platform specific changes for DB Drop Recovery
  cxgb4: DB Drop Recovery for RDMA and LLD queues.
  RDMA/cxgb4: Add debugfs rdma memory stats
  RDMA/cxgb4: Add DB Overflow Avoidance.
  RDMA/cxgb4: disable interrupts in c4iw_ev_dispatch().
  RDMA/cxgb4: DB Drop Recovery for RDMA and LLD queues.
  RDMA/cxgb4: Use vmalloc for debugfs qp dump. Allows dumping thousands
    of qps.
  RDMA/cxgb4: remove kfifo usage
  RDMA/cxgb4: Add query_qp support in driver to query the qp state
    before flushing.

 drivers/infiniband/hw/cxgb4/Makefile            |    2 +-
 drivers/infiniband/hw/cxgb4/cm.c                |   23 ++-
 drivers/infiniband/hw/cxgb4/device.c            |  339 ++++++++++++++++++++++-
 drivers/infiniband/hw/cxgb4/ev.c                |    8 +-
 drivers/infiniband/hw/cxgb4/id_table.c          |  112 ++++++++
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h          |  134 ++++++++--
 drivers/infiniband/hw/cxgb4/mem.c               |   21 +-
 drivers/infiniband/hw/cxgb4/provider.c          |   19 +-
 drivers/infiniband/hw/cxgb4/qp.c                |  105 +++++++-
 drivers/infiniband/hw/cxgb4/resource.c          |  180 +++++-------
 drivers/infiniband/hw/cxgb4/t4.h                |   24 ++
 drivers/infiniband/hw/cxgb4/user.h              |    2 +-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |   23 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  235 +++++++++++++++-
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h  |   11 +
 drivers/net/ethernet/chelsio/cxgb4/sge.c        |   22 ++-
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c      |   62 ++++-
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |   53 ++++
 drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h   |   15 +
 19 files changed, 1218 insertions(+), 172 deletions(-)
 create mode 100644 drivers/infiniband/hw/cxgb4/id_table.c

^ permalink raw reply

* [PATCH 08/10] RDMA/cxgb4: Use vmalloc for debugfs qp dump. Allows dumping thousands of qps.
From: Vipul Pandya @ 2012-05-18  9:59 UTC (permalink / raw)
  To: linux-rdma, netdev; +Cc: roland, davem, divy, dm, kumaras, swise, Vipul Pandya
In-Reply-To: <1337335173-3226-1-git-send-email-vipul@chelsio.com>

Log active open failures of interest.

Signed-off-by: Vipul Pandya <vipul@chelsio.com>
Signed-off-by: Steve Wise <swise@opengridcomputing.com>
---
 drivers/infiniband/hw/cxgb4/cm.c     |   18 ++++++++++++++++++
 drivers/infiniband/hw/cxgb4/device.c |    4 ++--
 2 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 6ce401a..55ab284 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -1413,6 +1413,24 @@ static int act_open_rpl(struct c4iw_dev *dev, struct sk_buff *skb)
 		return 0;
 	}
 
+	/*
+	 * Log interesting failures.
+	 */
+	switch (status) {
+	case CPL_ERR_CONN_RESET:
+	case CPL_ERR_CONN_TIMEDOUT:
+		break;
+	default:
+		printk(KERN_INFO MOD "Active open failure - "
+		       "atid %u status %u errno %d %pI4:%u->%pI4:%u\n",
+		       atid, status, status2errno(status),
+		       &ep->com.local_addr.sin_addr.s_addr,
+		       ntohs(ep->com.local_addr.sin_port),
+		       &ep->com.remote_addr.sin_addr.s_addr,
+		       ntohs(ep->com.remote_addr.sin_port));
+		break;
+	}
+
 	connect_reply_upcall(ep, status2errno(status));
 	state_set(&ep->com, DEAD);
 
diff --git a/drivers/infiniband/hw/cxgb4/device.c b/drivers/infiniband/hw/cxgb4/device.c
index bdb398f..8545629 100644
--- a/drivers/infiniband/hw/cxgb4/device.c
+++ b/drivers/infiniband/hw/cxgb4/device.c
@@ -121,7 +121,7 @@ static int qp_release(struct inode *inode, struct file *file)
 		printk(KERN_INFO "%s null qpd?\n", __func__);
 		return 0;
 	}
-	kfree(qpd->buf);
+	vfree(qpd->buf);
 	kfree(qpd);
 	return 0;
 }
@@ -145,7 +145,7 @@ static int qp_open(struct inode *inode, struct file *file)
 	spin_unlock_irq(&qpd->devp->lock);
 
 	qpd->bufsize = count * 128;
-	qpd->buf = kmalloc(qpd->bufsize, GFP_KERNEL);
+	qpd->buf = vmalloc(qpd->bufsize);
 	if (!qpd->buf) {
 		ret = -ENOMEM;
 		goto err1;
-- 
1.7.1

^ permalink raw reply related

* Re: [V2 PATCH 2/9] macvtap: zerocopy: fix truesize underestimation
From: Jason Wang @ 2012-05-18 10:10 UTC (permalink / raw)
  To: Shirley Ma; +Cc: eric.dumazet, mst, netdev, linux-kernel, ebiederm, davem
In-Reply-To: <1337268512.10741.53.camel@oc3660625478.ibm.com>

On 05/17/2012 11:28 PM, Shirley Ma wrote:
> On Thu, 2012-05-17 at 10:59 +0800, Jason Wang wrote:
>> Didn't see how this affact skb->len. And for truesize, I think they
>> are
>> different, when the offset were not zero, the data in this vector
>> were
>> divided into two parts. First part is copied into skb directly, and
>> the
>> second were pinned from a whole userspace page by
>> get_user_pages_fast(),
>> so we need count the whole page to the socket limit to prevent evil
>> application.
> What I meant that the code for skb->truesize has double added the first
> offset if any left from that vector (partically copied into skb
> directly, and then count pagesize which includes the offset (truesize +=
> PAGE_SIZE)).

Yes, I get you mean. There's no difference between first frag and 
others: it's also possible for other frags that didn't occupy the whole 
page. Since we pin the whole user page, better to count the whole page 
size to prevent evil application.
> Thanks
> Shirley
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

* [PATCH 09/10] RDMA/cxgb4: remove kfifo usage
From: Vipul Pandya @ 2012-05-18  9:59 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-BHEL68pLQRGGvPXPguhicg, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	divy-ut6Up61K2wZBDgjK7y7TUQ, dm-ut6Up61K2wZBDgjK7y7TUQ,
	kumaras-ut6Up61K2wZBDgjK7y7TUQ,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW, Vipul Pandya
In-Reply-To: <1337335173-3226-1-git-send-email-vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

Using kfifos for ID management was limiting the number of QPs and
preventing NP384 MPI jobs.  So replace it with a simple bitmap
allocator.

Remove IDs from the IDR tables before deallocating them. This bug was
causing the BUG_ON() in insert_handle() to fire because the ID was
getting reused before being removed from the IDR table.

Signed-off-by: Vipul Pandya <vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
Signed-off-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/infiniband/hw/cxgb4/Makefile   |    2 +-
 drivers/infiniband/hw/cxgb4/device.c   |   37 +++++---
 drivers/infiniband/hw/cxgb4/id_table.c |  112 ++++++++++++++++++++++++
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |   35 ++++++--
 drivers/infiniband/hw/cxgb4/mem.c      |   10 +--
 drivers/infiniband/hw/cxgb4/provider.c |    9 +--
 drivers/infiniband/hw/cxgb4/resource.c |  148 ++++++++------------------------
 7 files changed, 203 insertions(+), 150 deletions(-)
 create mode 100644 drivers/infiniband/hw/cxgb4/id_table.c

diff --git a/drivers/infiniband/hw/cxgb4/Makefile b/drivers/infiniband/hw/cxgb4/Makefile
index 46b878c..e11cf72 100644
--- a/drivers/infiniband/hw/cxgb4/Makefile
+++ b/drivers/infiniband/hw/cxgb4/Makefile
@@ -2,4 +2,4 @@ ccflags-y := -Idrivers/net/ethernet/chelsio/cxgb4
 
 obj-$(CONFIG_INFINIBAND_CXGB4) += iw_cxgb4.o
 
-iw_cxgb4-y :=  device.o cm.o provider.o mem.o cq.o qp.o resource.o ev.o
+iw_cxgb4-y :=  device.o cm.o provider.o mem.o cq.o qp.o resource.o ev.o id_table.o
diff --git a/drivers/infiniband/hw/cxgb4/device.c b/drivers/infiniband/hw/cxgb4/device.c
index 8545629..c8fd1d8 100644
--- a/drivers/infiniband/hw/cxgb4/device.c
+++ b/drivers/infiniband/hw/cxgb4/device.c
@@ -252,25 +252,26 @@ static int stats_show(struct seq_file *seq, void *v)
 {
 	struct c4iw_dev *dev = seq->private;
 
-	seq_printf(seq, " Object: %10s %10s %10s\n", "Total", "Current", "Max");
-	seq_printf(seq, "     PDID: %10llu %10llu %10llu\n",
+	seq_printf(seq, "   Object: %10s %10s %10s %10s\n", "Total", "Current",
+		   "Max", "Fail");
+	seq_printf(seq, "     PDID: %10llu %10llu %10llu %10llu\n",
 			dev->rdev.stats.pd.total, dev->rdev.stats.pd.cur,
-			dev->rdev.stats.pd.max);
-	seq_printf(seq, "      QID: %10llu %10llu %10llu\n",
+			dev->rdev.stats.pd.max, dev->rdev.stats.pd.fail);
+	seq_printf(seq, "      QID: %10llu %10llu %10llu %10llu\n",
 			dev->rdev.stats.qid.total, dev->rdev.stats.qid.cur,
-			dev->rdev.stats.qid.max);
-	seq_printf(seq, "   TPTMEM: %10llu %10llu %10llu\n",
+			dev->rdev.stats.qid.max, dev->rdev.stats.qid.fail);
+	seq_printf(seq, "   TPTMEM: %10llu %10llu %10llu %10llu\n",
 			dev->rdev.stats.stag.total, dev->rdev.stats.stag.cur,
-			dev->rdev.stats.stag.max);
-	seq_printf(seq, "   PBLMEM: %10llu %10llu %10llu\n",
+			dev->rdev.stats.stag.max, dev->rdev.stats.stag.fail);
+	seq_printf(seq, "   PBLMEM: %10llu %10llu %10llu %10llu\n",
 			dev->rdev.stats.pbl.total, dev->rdev.stats.pbl.cur,
-			dev->rdev.stats.pbl.max);
-	seq_printf(seq, "   RQTMEM: %10llu %10llu %10llu\n",
+			dev->rdev.stats.pbl.max, dev->rdev.stats.pbl.fail);
+	seq_printf(seq, "   RQTMEM: %10llu %10llu %10llu %10llu\n",
 			dev->rdev.stats.rqt.total, dev->rdev.stats.rqt.cur,
-			dev->rdev.stats.rqt.max);
-	seq_printf(seq, "  OCQPMEM: %10llu %10llu %10llu\n",
+			dev->rdev.stats.rqt.max, dev->rdev.stats.rqt.fail);
+	seq_printf(seq, "  OCQPMEM: %10llu %10llu %10llu %10llu\n",
 			dev->rdev.stats.ocqp.total, dev->rdev.stats.ocqp.cur,
-			dev->rdev.stats.ocqp.max);
+			dev->rdev.stats.ocqp.max, dev->rdev.stats.ocqp.fail);
 	seq_printf(seq, "  DB FULL: %10llu\n", dev->rdev.stats.db_full);
 	seq_printf(seq, " DB EMPTY: %10llu\n", dev->rdev.stats.db_empty);
 	seq_printf(seq, "  DB DROP: %10llu\n", dev->rdev.stats.db_drop);
@@ -292,11 +293,17 @@ static ssize_t stats_clear(struct file *file, const char __user *buf,
 
 	mutex_lock(&dev->rdev.stats.lock);
 	dev->rdev.stats.pd.max = 0;
+	dev->rdev.stats.pd.fail = 0;
 	dev->rdev.stats.qid.max = 0;
+	dev->rdev.stats.qid.fail = 0;
 	dev->rdev.stats.stag.max = 0;
+	dev->rdev.stats.stag.fail = 0;
 	dev->rdev.stats.pbl.max = 0;
+	dev->rdev.stats.pbl.fail = 0;
 	dev->rdev.stats.rqt.max = 0;
+	dev->rdev.stats.rqt.fail = 0;
 	dev->rdev.stats.ocqp.max = 0;
+	dev->rdev.stats.ocqp.fail = 0;
 	dev->rdev.stats.db_full = 0;
 	dev->rdev.stats.db_empty = 0;
 	dev->rdev.stats.db_drop = 0;
@@ -350,8 +357,8 @@ void c4iw_release_dev_ucontext(struct c4iw_rdev *rdev,
 		entry = list_entry(pos, struct c4iw_qid_list, entry);
 		list_del_init(&entry->entry);
 		if (!(entry->qid & rdev->qpmask)) {
-			c4iw_put_resource(&rdev->resource.qid_fifo, entry->qid,
-					&rdev->resource.qid_fifo_lock);
+			c4iw_put_resource(&rdev->resource.qid_table,
+					  entry->qid);
 			mutex_lock(&rdev->stats.lock);
 			rdev->stats.qid.cur -= rdev->qpmask + 1;
 			mutex_unlock(&rdev->stats.lock);
diff --git a/drivers/infiniband/hw/cxgb4/id_table.c b/drivers/infiniband/hw/cxgb4/id_table.c
new file mode 100644
index 0000000..f95e5df
--- /dev/null
+++ b/drivers/infiniband/hw/cxgb4/id_table.c
@@ -0,0 +1,112 @@
+/*
+ * Copyright (c) 2011 Chelsio Communications.  All rights reserved.
+ *
+ * This software is available to you under a choice of one of two
+ * licenses.  You may choose to be licensed under the terms of the GNU
+ * General Public License (GPL) Version 2, available from the file
+ * COPYING in the main directory of this source tree, or the
+ * OpenIB.org BSD license below:
+ *
+ *     Redistribution and use in source and binary forms, with or
+ *     without modification, are permitted provided that the following
+ *     conditions are met:
+ *
+ *      - Redistributions of source code must retain the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer.
+ *
+ *      - Redistributions in binary form must reproduce the above
+ *        copyright notice, this list of conditions and the following
+ *        disclaimer in the documentation and/or other materials
+ *        provided with the distribution.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
+ * EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
+ * MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
+ * NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS
+ * BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ * ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
+ * CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+ * SOFTWARE.
+ */
+#include <linux/kernel.h>
+#include <linux/random.h>
+#include "iw_cxgb4.h"
+
+#define RANDOM_SKIP 16
+
+/*
+ * Trivial bitmap-based allocator. If the random flag is set, the
+ * allocator is designed to:
+ * - pseudo-randomize the id returned such that it is not trivially predictable.
+ * - avoid reuse of recently used id (at the expense of predictability)
+ */
+u32 c4iw_id_alloc(struct c4iw_id_table *alloc)
+{
+	unsigned long flags;
+	u32 obj;
+
+	spin_lock_irqsave(&alloc->lock, flags);
+
+	obj = find_next_zero_bit(alloc->table, alloc->max, alloc->last);
+	if (obj >= alloc->max)
+		obj = find_first_zero_bit(alloc->table, alloc->max);
+
+	if (obj < alloc->max) {
+		if (alloc->flags & C4IW_ID_TABLE_F_RANDOM)
+			alloc->last += random32() % RANDOM_SKIP;
+		else
+			alloc->last = obj + 1;
+		if (alloc->last >= alloc->max)
+			alloc->last = 0;
+		set_bit(obj, alloc->table);
+		obj += alloc->start;
+	} else
+		obj = -1;
+
+	spin_unlock_irqrestore(&alloc->lock, flags);
+	return obj;
+}
+
+void c4iw_id_free(struct c4iw_id_table *alloc, u32 obj)
+{
+	unsigned long flags;
+
+	obj -= alloc->start;
+	BUG_ON((int)obj < 0);
+
+	spin_lock_irqsave(&alloc->lock, flags);
+	clear_bit(obj, alloc->table);
+	spin_unlock_irqrestore(&alloc->lock, flags);
+}
+
+int c4iw_id_table_alloc(struct c4iw_id_table *alloc, u32 start, u32 num,
+			u32 reserved, u32 flags)
+{
+	int i;
+
+	alloc->start = start;
+	alloc->flags = flags;
+	if (flags & C4IW_ID_TABLE_F_RANDOM)
+		alloc->last = random32() % RANDOM_SKIP;
+	else
+		alloc->last = 0;
+	alloc->max  = num;
+	spin_lock_init(&alloc->lock);
+	alloc->table = kmalloc(BITS_TO_LONGS(num) * sizeof(long),
+				GFP_KERNEL);
+	if (!alloc->table)
+		return -ENOMEM;
+
+	bitmap_zero(alloc->table, num);
+	if (!(alloc->flags & C4IW_ID_TABLE_F_EMPTY))
+		for (i = 0; i < reserved; ++i)
+			set_bit(i, alloc->table);
+
+	return 0;
+}
+
+void c4iw_id_table_free(struct c4iw_id_table *alloc)
+{
+	kfree(alloc->table);
+}
diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index 6818659..2d5b06b 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -45,7 +45,6 @@
 #include <linux/kref.h>
 #include <linux/timer.h>
 #include <linux/io.h>
-#include <linux/kfifo.h>
 
 #include <asm/byteorder.h>
 
@@ -79,13 +78,22 @@ static inline void *cplhdr(struct sk_buff *skb)
 	return skb->data;
 }
 
+#define C4IW_ID_TABLE_F_RANDOM 1       /* Pseudo-randomize the id's returned */
+#define C4IW_ID_TABLE_F_EMPTY  2       /* Table is initially empty */
+
+struct c4iw_id_table {
+	u32 flags;
+	u32 start;              /* logical minimal id */
+	u32 last;               /* hint for find */
+	u32 max;
+	spinlock_t lock;
+	unsigned long *table;
+};
+
 struct c4iw_resource {
-	struct kfifo tpt_fifo;
-	spinlock_t tpt_fifo_lock;
-	struct kfifo qid_fifo;
-	spinlock_t qid_fifo_lock;
-	struct kfifo pdid_fifo;
-	spinlock_t pdid_fifo_lock;
+	struct c4iw_id_table tpt_table;
+	struct c4iw_id_table qid_table;
+	struct c4iw_id_table pdid_table;
 };
 
 struct c4iw_qid_list {
@@ -107,6 +115,7 @@ struct c4iw_stat {
 	u64 total;
 	u64 cur;
 	u64 max;
+	u64 fail;
 };
 
 struct c4iw_stats {
@@ -253,7 +262,7 @@ static inline int _insert_handle(struct c4iw_dev *rhp, struct idr *idr,
 		if (lock)
 			spin_lock_irq(&rhp->lock);
 		ret = idr_get_new_above(idr, handle, id, &newid);
-		BUG_ON(newid != id);
+		BUG_ON(!ret && newid != id);
 		if (lock)
 			spin_unlock_irq(&rhp->lock);
 	} while (ret == -EAGAIN);
@@ -755,14 +764,20 @@ static inline int compute_wscale(int win)
 	return wscale;
 }
 
+u32 c4iw_id_alloc(struct c4iw_id_table *alloc);
+void c4iw_id_free(struct c4iw_id_table *alloc, u32 obj);
+int c4iw_id_table_alloc(struct c4iw_id_table *alloc, u32 start, u32 num,
+			u32 reserved, u32 flags);
+void c4iw_id_table_free(struct c4iw_id_table *alloc);
+
 typedef int (*c4iw_handler_func)(struct c4iw_dev *dev, struct sk_buff *skb);
 
 int c4iw_ep_redirect(void *ctx, struct dst_entry *old, struct dst_entry *new,
 		     struct l2t_entry *l2t);
 void c4iw_put_qpid(struct c4iw_rdev *rdev, u32 qpid,
 		   struct c4iw_dev_ucontext *uctx);
-u32 c4iw_get_resource(struct kfifo *fifo, spinlock_t *lock);
-void c4iw_put_resource(struct kfifo *fifo, u32 entry, spinlock_t *lock);
+u32 c4iw_get_resource(struct c4iw_id_table *id_table);
+void c4iw_put_resource(struct c4iw_id_table *id_table, u32 entry);
 int c4iw_init_resource(struct c4iw_rdev *rdev, u32 nr_tpt, u32 nr_pdid);
 int c4iw_init_ctrl_qp(struct c4iw_rdev *rdev);
 int c4iw_pblpool_create(struct c4iw_rdev *rdev);
diff --git a/drivers/infiniband/hw/cxgb4/mem.c b/drivers/infiniband/hw/cxgb4/mem.c
index 2a87379..57e07c6 100644
--- a/drivers/infiniband/hw/cxgb4/mem.c
+++ b/drivers/infiniband/hw/cxgb4/mem.c
@@ -131,8 +131,7 @@ static int write_tpt_entry(struct c4iw_rdev *rdev, u32 reset_tpt_entry,
 	stag_idx = (*stag) >> 8;
 
 	if ((!reset_tpt_entry) && (*stag == T4_STAG_UNSET)) {
-		stag_idx = c4iw_get_resource(&rdev->resource.tpt_fifo,
-					     &rdev->resource.tpt_fifo_lock);
+		stag_idx = c4iw_get_resource(&rdev->resource.tpt_table);
 		if (!stag_idx)
 			return -ENOMEM;
 		mutex_lock(&rdev->stats.lock);
@@ -171,8 +170,7 @@ static int write_tpt_entry(struct c4iw_rdev *rdev, u32 reset_tpt_entry,
 				sizeof(tpt), &tpt);
 
 	if (reset_tpt_entry) {
-		c4iw_put_resource(&rdev->resource.tpt_fifo, stag_idx,
-				  &rdev->resource.tpt_fifo_lock);
+		c4iw_put_resource(&rdev->resource.tpt_table, stag_idx);
 		mutex_lock(&rdev->stats.lock);
 		rdev->stats.stag.cur -= 32;
 		mutex_unlock(&rdev->stats.lock);
@@ -695,8 +693,8 @@ int c4iw_dealloc_mw(struct ib_mw *mw)
 	mhp = to_c4iw_mw(mw);
 	rhp = mhp->rhp;
 	mmid = (mw->rkey) >> 8;
-	deallocate_window(&rhp->rdev, mhp->attr.stag);
 	remove_handle(rhp, &rhp->mmidr, mmid);
+	deallocate_window(&rhp->rdev, mhp->attr.stag);
 	kfree(mhp);
 	PDBG("%s ib_mw %p mmid 0x%x ptr %p\n", __func__, mw, mmid, mhp);
 	return 0;
@@ -798,12 +796,12 @@ int c4iw_dereg_mr(struct ib_mr *ib_mr)
 	mhp = to_c4iw_mr(ib_mr);
 	rhp = mhp->rhp;
 	mmid = mhp->attr.stag >> 8;
+	remove_handle(rhp, &rhp->mmidr, mmid);
 	dereg_mem(&rhp->rdev, mhp->attr.stag, mhp->attr.pbl_size,
 		       mhp->attr.pbl_addr);
 	if (mhp->attr.pbl_size)
 		c4iw_pblpool_free(&mhp->rhp->rdev, mhp->attr.pbl_addr,
 				  mhp->attr.pbl_size << 3);
-	remove_handle(rhp, &rhp->mmidr, mmid);
 	if (mhp->kva)
 		kfree((void *) (unsigned long) mhp->kva);
 	if (mhp->umem)
diff --git a/drivers/infiniband/hw/cxgb4/provider.c b/drivers/infiniband/hw/cxgb4/provider.c
index 8d58736..fe98a0a 100644
--- a/drivers/infiniband/hw/cxgb4/provider.c
+++ b/drivers/infiniband/hw/cxgb4/provider.c
@@ -188,8 +188,7 @@ static int c4iw_deallocate_pd(struct ib_pd *pd)
 	php = to_c4iw_pd(pd);
 	rhp = php->rhp;
 	PDBG("%s ibpd %p pdid 0x%x\n", __func__, pd, php->pdid);
-	c4iw_put_resource(&rhp->rdev.resource.pdid_fifo, php->pdid,
-			  &rhp->rdev.resource.pdid_fifo_lock);
+	c4iw_put_resource(&rhp->rdev.resource.pdid_table, php->pdid);
 	mutex_lock(&rhp->rdev.stats.lock);
 	rhp->rdev.stats.pd.cur--;
 	mutex_unlock(&rhp->rdev.stats.lock);
@@ -207,14 +206,12 @@ static struct ib_pd *c4iw_allocate_pd(struct ib_device *ibdev,
 
 	PDBG("%s ibdev %p\n", __func__, ibdev);
 	rhp = (struct c4iw_dev *) ibdev;
-	pdid =  c4iw_get_resource(&rhp->rdev.resource.pdid_fifo,
-				  &rhp->rdev.resource.pdid_fifo_lock);
+	pdid =  c4iw_get_resource(&rhp->rdev.resource.pdid_table);
 	if (!pdid)
 		return ERR_PTR(-EINVAL);
 	php = kzalloc(sizeof(*php), GFP_KERNEL);
 	if (!php) {
-		c4iw_put_resource(&rhp->rdev.resource.pdid_fifo, pdid,
-				  &rhp->rdev.resource.pdid_fifo_lock);
+		c4iw_put_resource(&rhp->rdev.resource.pdid_table, pdid);
 		return ERR_PTR(-ENOMEM);
 	}
 	php->pdid = pdid;
diff --git a/drivers/infiniband/hw/cxgb4/resource.c b/drivers/infiniband/hw/cxgb4/resource.c
index 1b948d1..cdef4d7 100644
--- a/drivers/infiniband/hw/cxgb4/resource.c
+++ b/drivers/infiniband/hw/cxgb4/resource.c
@@ -30,96 +30,25 @@
  * SOFTWARE.
  */
 /* Crude resource management */
-#include <linux/kernel.h>
-#include <linux/random.h>
-#include <linux/slab.h>
-#include <linux/kfifo.h>
 #include <linux/spinlock.h>
-#include <linux/errno.h>
 #include <linux/genalloc.h>
 #include <linux/ratelimit.h>
 #include "iw_cxgb4.h"
 
-#define RANDOM_SIZE 16
-
-static int __c4iw_init_resource_fifo(struct kfifo *fifo,
-				   spinlock_t *fifo_lock,
-				   u32 nr, u32 skip_low,
-				   u32 skip_high,
-				   int random)
-{
-	u32 i, j, entry = 0, idx;
-	u32 random_bytes;
-	u32 rarray[16];
-	spin_lock_init(fifo_lock);
-
-	if (kfifo_alloc(fifo, nr * sizeof(u32), GFP_KERNEL))
-		return -ENOMEM;
-
-	for (i = 0; i < skip_low + skip_high; i++)
-		kfifo_in(fifo, (unsigned char *) &entry, sizeof(u32));
-	if (random) {
-		j = 0;
-		random_bytes = random32();
-		for (i = 0; i < RANDOM_SIZE; i++)
-			rarray[i] = i + skip_low;
-		for (i = skip_low + RANDOM_SIZE; i < nr - skip_high; i++) {
-			if (j >= RANDOM_SIZE) {
-				j = 0;
-				random_bytes = random32();
-			}
-			idx = (random_bytes >> (j * 2)) & 0xF;
-			kfifo_in(fifo,
-				(unsigned char *) &rarray[idx],
-				sizeof(u32));
-			rarray[idx] = i;
-			j++;
-		}
-		for (i = 0; i < RANDOM_SIZE; i++)
-			kfifo_in(fifo,
-				(unsigned char *) &rarray[i],
-				sizeof(u32));
-	} else
-		for (i = skip_low; i < nr - skip_high; i++)
-			kfifo_in(fifo, (unsigned char *) &i, sizeof(u32));
-
-	for (i = 0; i < skip_low + skip_high; i++)
-		if (kfifo_out_locked(fifo, (unsigned char *) &entry,
-				     sizeof(u32), fifo_lock))
-			break;
-	return 0;
-}
-
-static int c4iw_init_resource_fifo(struct kfifo *fifo, spinlock_t * fifo_lock,
-				   u32 nr, u32 skip_low, u32 skip_high)
-{
-	return __c4iw_init_resource_fifo(fifo, fifo_lock, nr, skip_low,
-					  skip_high, 0);
-}
-
-static int c4iw_init_resource_fifo_random(struct kfifo *fifo,
-				   spinlock_t *fifo_lock,
-				   u32 nr, u32 skip_low, u32 skip_high)
-{
-	return __c4iw_init_resource_fifo(fifo, fifo_lock, nr, skip_low,
-					  skip_high, 1);
-}
-
-static int c4iw_init_qid_fifo(struct c4iw_rdev *rdev)
+static int c4iw_init_qid_table(struct c4iw_rdev *rdev)
 {
 	u32 i;
 
-	spin_lock_init(&rdev->resource.qid_fifo_lock);
-
-	if (kfifo_alloc(&rdev->resource.qid_fifo, rdev->lldi.vr->qp.size *
-			sizeof(u32), GFP_KERNEL))
+	if (c4iw_id_table_alloc(&rdev->resource.qid_table,
+				rdev->lldi.vr->qp.start,
+				rdev->lldi.vr->qp.size,
+				rdev->lldi.vr->qp.size, 0))
 		return -ENOMEM;
 
 	for (i = rdev->lldi.vr->qp.start;
-	     i < rdev->lldi.vr->qp.start + rdev->lldi.vr->qp.size; i++)
+		i < rdev->lldi.vr->qp.start + rdev->lldi.vr->qp.size; i++)
 		if (!(i & rdev->qpmask))
-			kfifo_in(&rdev->resource.qid_fifo,
-				    (unsigned char *) &i, sizeof(u32));
+			c4iw_id_free(&rdev->resource.qid_table, i);
 	return 0;
 }
 
@@ -127,44 +56,42 @@ static int c4iw_init_qid_fifo(struct c4iw_rdev *rdev)
 int c4iw_init_resource(struct c4iw_rdev *rdev, u32 nr_tpt, u32 nr_pdid)
 {
 	int err = 0;
-	err = c4iw_init_resource_fifo_random(&rdev->resource.tpt_fifo,
-					     &rdev->resource.tpt_fifo_lock,
-					     nr_tpt, 1, 0);
+	err = c4iw_id_table_alloc(&rdev->resource.tpt_table, 0, nr_tpt, 1,
+					C4IW_ID_TABLE_F_RANDOM);
 	if (err)
 		goto tpt_err;
-	err = c4iw_init_qid_fifo(rdev);
+	err = c4iw_init_qid_table(rdev);
 	if (err)
 		goto qid_err;
-	err = c4iw_init_resource_fifo(&rdev->resource.pdid_fifo,
-				      &rdev->resource.pdid_fifo_lock,
-				      nr_pdid, 1, 0);
+	err = c4iw_id_table_alloc(&rdev->resource.pdid_table, 0,
+					nr_pdid, 1, 0);
 	if (err)
 		goto pdid_err;
 	return 0;
-pdid_err:
-	kfifo_free(&rdev->resource.qid_fifo);
-qid_err:
-	kfifo_free(&rdev->resource.tpt_fifo);
-tpt_err:
+ pdid_err:
+	c4iw_id_table_free(&rdev->resource.qid_table);
+ qid_err:
+	c4iw_id_table_free(&rdev->resource.tpt_table);
+ tpt_err:
 	return -ENOMEM;
 }
 
 /*
  * returns 0 if no resource available
  */
-u32 c4iw_get_resource(struct kfifo *fifo, spinlock_t *lock)
+u32 c4iw_get_resource(struct c4iw_id_table *id_table)
 {
 	u32 entry;
-	if (kfifo_out_locked(fifo, (unsigned char *) &entry, sizeof(u32), lock))
-		return entry;
-	else
+	entry = c4iw_id_alloc(id_table);
+	if (entry == (u32)(-1))
 		return 0;
+	return entry;
 }
 
-void c4iw_put_resource(struct kfifo *fifo, u32 entry, spinlock_t *lock)
+void c4iw_put_resource(struct c4iw_id_table *id_table, u32 entry)
 {
 	PDBG("%s entry 0x%x\n", __func__, entry);
-	kfifo_in_locked(fifo, (unsigned char *) &entry, sizeof(u32), lock);
+	c4iw_id_free(id_table, entry);
 }
 
 u32 c4iw_get_cqid(struct c4iw_rdev *rdev, struct c4iw_dev_ucontext *uctx)
@@ -181,8 +108,7 @@ u32 c4iw_get_cqid(struct c4iw_rdev *rdev, struct c4iw_dev_ucontext *uctx)
 		qid = entry->qid;
 		kfree(entry);
 	} else {
-		qid = c4iw_get_resource(&rdev->resource.qid_fifo,
-					&rdev->resource.qid_fifo_lock);
+		qid = c4iw_get_resource(&rdev->resource.qid_table);
 		if (!qid)
 			goto out;
 		mutex_lock(&rdev->stats.lock);
@@ -252,8 +178,7 @@ u32 c4iw_get_qpid(struct c4iw_rdev *rdev, struct c4iw_dev_ucontext *uctx)
 		qid = entry->qid;
 		kfree(entry);
 	} else {
-		qid = c4iw_get_resource(&rdev->resource.qid_fifo,
-					&rdev->resource.qid_fifo_lock);
+		qid = c4iw_get_resource(&rdev->resource.qid_table);
 		if (!qid)
 			goto out;
 		mutex_lock(&rdev->stats.lock);
@@ -311,9 +236,9 @@ void c4iw_put_qpid(struct c4iw_rdev *rdev, u32 qid,
 
 void c4iw_destroy_resource(struct c4iw_resource *rscp)
 {
-	kfifo_free(&rscp->tpt_fifo);
-	kfifo_free(&rscp->qid_fifo);
-	kfifo_free(&rscp->pdid_fifo);
+	c4iw_id_table_free(&rscp->tpt_table);
+	c4iw_id_table_free(&rscp->qid_table);
+	c4iw_id_table_free(&rscp->pdid_table);
 }
 
 /*
@@ -326,16 +251,14 @@ u32 c4iw_pblpool_alloc(struct c4iw_rdev *rdev, int size)
 {
 	unsigned long addr = gen_pool_alloc(rdev->pbl_pool, size);
 	PDBG("%s addr 0x%x size %d\n", __func__, (u32)addr, size);
-	if (!addr)
-		printk_ratelimited(KERN_WARNING MOD "%s: Out of PBL memory\n",
-		       pci_name(rdev->lldi.pdev));
+	mutex_lock(&rdev->stats.lock);
 	if (addr) {
-		mutex_lock(&rdev->stats.lock);
 		rdev->stats.pbl.cur += roundup(size, 1 << MIN_PBL_SHIFT);
 		if (rdev->stats.pbl.cur > rdev->stats.pbl.max)
 			rdev->stats.pbl.max = rdev->stats.pbl.cur;
-		mutex_unlock(&rdev->stats.lock);
-	}
+	} else
+		rdev->stats.pbl.fail++;
+	mutex_unlock(&rdev->stats.lock);
 	return (u32)addr;
 }
 
@@ -401,13 +324,14 @@ u32 c4iw_rqtpool_alloc(struct c4iw_rdev *rdev, int size)
 	if (!addr)
 		printk_ratelimited(KERN_WARNING MOD "%s: Out of RQT memory\n",
 		       pci_name(rdev->lldi.pdev));
+	mutex_lock(&rdev->stats.lock);
 	if (addr) {
-		mutex_lock(&rdev->stats.lock);
 		rdev->stats.rqt.cur += roundup(size << 6, 1 << MIN_RQT_SHIFT);
 		if (rdev->stats.rqt.cur > rdev->stats.rqt.max)
 			rdev->stats.rqt.max = rdev->stats.rqt.cur;
-		mutex_unlock(&rdev->stats.lock);
-	}
+	} else
+		rdev->stats.rqt.fail++;
+	mutex_unlock(&rdev->stats.lock);
 	return (u32)addr;
 }
 
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 06/10] RDMA/cxgb4: disable interrupts in c4iw_ev_dispatch().
From: Vipul Pandya @ 2012-05-18  9:59 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-BHEL68pLQRGGvPXPguhicg, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	divy-ut6Up61K2wZBDgjK7y7TUQ, dm-ut6Up61K2wZBDgjK7y7TUQ,
	kumaras-ut6Up61K2wZBDgjK7y7TUQ,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW, Vipul Pandya
In-Reply-To: <1337335173-3226-1-git-send-email-vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

Use GFP_ATOMIC in _insert_handle() if ints are disabled.

Don't panic if we get an abort with no endpoint found. Just log a
warning.

Signed-off-by: Vipul Pandya <vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
Signed-off-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/infiniband/hw/cxgb4/cm.c       |    5 ++++-
 drivers/infiniband/hw/cxgb4/ev.c       |    8 ++++----
 drivers/infiniband/hw/cxgb4/iw_cxgb4.h |    2 +-
 3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/hw/cxgb4/cm.c b/drivers/infiniband/hw/cxgb4/cm.c
index 4c7c62f..6ce401a 100644
--- a/drivers/infiniband/hw/cxgb4/cm.c
+++ b/drivers/infiniband/hw/cxgb4/cm.c
@@ -1362,7 +1362,10 @@ static int abort_rpl(struct c4iw_dev *dev, struct sk_buff *skb)
 
 	ep = lookup_tid(t, tid);
 	PDBG("%s ep %p tid %u\n", __func__, ep, ep->hwtid);
-	BUG_ON(!ep);
+	if (!ep) {
+		printk(KERN_WARNING MOD "Abort rpl to freed endpoint\n");
+		return 0;
+	}
 	mutex_lock(&ep->com.mutex);
 	switch (ep->com.state) {
 	case ABORTING:
diff --git a/drivers/infiniband/hw/cxgb4/ev.c b/drivers/infiniband/hw/cxgb4/ev.c
index 397cb36..cf2f6b4 100644
--- a/drivers/infiniband/hw/cxgb4/ev.c
+++ b/drivers/infiniband/hw/cxgb4/ev.c
@@ -84,7 +84,7 @@ void c4iw_ev_dispatch(struct c4iw_dev *dev, struct t4_cqe *err_cqe)
 	struct c4iw_qp *qhp;
 	u32 cqid;
 
-	spin_lock(&dev->lock);
+	spin_lock_irq(&dev->lock);
 	qhp = get_qhp(dev, CQE_QPID(err_cqe));
 	if (!qhp) {
 		printk(KERN_ERR MOD "BAD AE qpid 0x%x opcode %d "
@@ -93,7 +93,7 @@ void c4iw_ev_dispatch(struct c4iw_dev *dev, struct t4_cqe *err_cqe)
 		       CQE_OPCODE(err_cqe), CQE_STATUS(err_cqe),
 		       CQE_TYPE(err_cqe), CQE_WRID_HI(err_cqe),
 		       CQE_WRID_LOW(err_cqe));
-		spin_unlock(&dev->lock);
+		spin_unlock_irq(&dev->lock);
 		goto out;
 	}
 
@@ -109,13 +109,13 @@ void c4iw_ev_dispatch(struct c4iw_dev *dev, struct t4_cqe *err_cqe)
 		       CQE_OPCODE(err_cqe), CQE_STATUS(err_cqe),
 		       CQE_TYPE(err_cqe), CQE_WRID_HI(err_cqe),
 		       CQE_WRID_LOW(err_cqe));
-		spin_unlock(&dev->lock);
+		spin_unlock_irq(&dev->lock);
 		goto out;
 	}
 
 	c4iw_qp_add_ref(&qhp->ibqp);
 	atomic_inc(&chp->refcnt);
-	spin_unlock(&dev->lock);
+	spin_unlock_irq(&dev->lock);
 
 	/* Bad incoming write */
 	if (RQ_TYPE(err_cqe) &&
diff --git a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
index a11ed5c..e8b88a0 100644
--- a/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
+++ b/drivers/infiniband/hw/cxgb4/iw_cxgb4.h
@@ -246,7 +246,7 @@ static inline int _insert_handle(struct c4iw_dev *rhp, struct idr *idr,
 	int newid;
 
 	do {
-		if (!idr_pre_get(idr, GFP_KERNEL))
+		if (!idr_pre_get(idr, lock ? GFP_KERNEL : GFP_ATOMIC))
 			return -ENOMEM;
 		if (lock)
 			spin_lock_irq(&rhp->lock);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 03/10] cxgb4: DB Drop Recovery for RDMA and LLD queues.
From: Vipul Pandya @ 2012-05-18  9:59 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-BHEL68pLQRGGvPXPguhicg, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	divy-ut6Up61K2wZBDgjK7y7TUQ, dm-ut6Up61K2wZBDgjK7y7TUQ,
	kumaras-ut6Up61K2wZBDgjK7y7TUQ,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW, Vipul Pandya
In-Reply-To: <1337335173-3226-1-git-send-email-vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

recover LLD EQs for DB drop interrupts.  This includes adding a new
db_lock, a spin lock disabling BH too, used by the recovery thread and
the ring_tx_db() paths to allow db drop recovery.

cleaned up initial db avoidance code.

add read_eq_indices() - allows the LLD to use the pcie mw to efficiently
read hw eq contexts.

add cxgb4_sync_txq_pidx() - called by iw_cxgb4 to sync up the sw/hw pidx
value.

add flush_eq_cache() and cxgb4_flush_eq_cache().  This allows iw_cxgb4
to flush the sge eq context cache before beginning db drop recovery.

add module parameter, dbfoifo_int_thresh, to allow tuning the db
interrupt threshold value.

add dbfifo_int_thresh to cxgb4_lld_info so iw_cxgb4 knows the threshold.

add module parameter, dbfoifo_drain_delay, to allow tuning the amount
of time delay between DB FULL and EMPTY upcalls to iw_cxgb4.

Signed-off-by: Vipul Pandya <vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
Signed-off-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |   16 ++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |  214 +++++++++++++++++++----
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h  |    4 +
 drivers/net/ethernet/chelsio/cxgb4/sge.c        |   20 ++-
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h    |   53 ++++++
 drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h   |   15 ++
 6 files changed, 280 insertions(+), 42 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 5f3c0a7..ec2dafe 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -51,6 +51,8 @@
 #define FW_VERSION_MINOR 1
 #define FW_VERSION_MICRO 0
 
+#define CH_WARN(adap, fmt, ...) dev_warn(adap->pdev_dev, fmt, ## __VA_ARGS__)
+
 enum {
 	MAX_NPORTS = 4,     /* max # of ports */
 	SERNUM_LEN = 24,    /* Serial # length */
@@ -64,6 +66,15 @@ enum {
 	MEM_MC
 };
 
+enum {
+	MEMWIN0_APERTURE = 65536,
+	MEMWIN0_BASE     = 0x30000,
+	MEMWIN1_APERTURE = 32768,
+	MEMWIN1_BASE     = 0x28000,
+	MEMWIN2_APERTURE = 2048,
+	MEMWIN2_BASE     = 0x1b800,
+};
+
 enum dev_master {
 	MASTER_CANT,
 	MASTER_MAY,
@@ -403,6 +414,9 @@ struct sge_txq {
 	struct tx_sw_desc *sdesc;   /* address of SW Tx descriptor ring */
 	struct sge_qstat *stat;     /* queue status entry */
 	dma_addr_t    phys_addr;    /* physical address of the ring */
+	spinlock_t db_lock;
+	int db_disabled;
+	unsigned short db_pidx;
 };
 
 struct sge_eth_txq {                /* state for an SGE Ethernet Tx queue */
@@ -475,6 +489,7 @@ struct adapter {
 	void __iomem *regs;
 	struct pci_dev *pdev;
 	struct device *pdev_dev;
+	unsigned int mbox;
 	unsigned int fn;
 	unsigned int flags;
 
@@ -607,6 +622,7 @@ irqreturn_t t4_sge_intr_msix(int irq, void *cookie);
 void t4_sge_init(struct adapter *adap);
 void t4_sge_start(struct adapter *adap);
 void t4_sge_stop(struct adapter *adap);
+extern int dbfifo_int_thresh;
 
 #define for_each_port(adapter, iter) \
 	for (iter = 0; iter < (adapter)->params.nports; ++iter)
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index c243f93..e1f96fb 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -149,15 +149,6 @@ static unsigned int pfvfres_pmask(struct adapter *adapter,
 #endif
 
 enum {
-	MEMWIN0_APERTURE = 65536,
-	MEMWIN0_BASE     = 0x30000,
-	MEMWIN1_APERTURE = 32768,
-	MEMWIN1_BASE     = 0x28000,
-	MEMWIN2_APERTURE = 2048,
-	MEMWIN2_BASE     = 0x1b800,
-};
-
-enum {
 	MAX_TXQ_ENTRIES      = 16384,
 	MAX_CTRL_TXQ_ENTRIES = 1024,
 	MAX_RSPQ_ENTRIES     = 16384,
@@ -371,6 +362,15 @@ static int set_addr_filters(const struct net_device *dev, bool sleep)
 				uhash | mhash, sleep);
 }
 
+int dbfifo_int_thresh = 10; /* 10 == 640 entry threshold */
+module_param(dbfifo_int_thresh, int, 0644);
+MODULE_PARM_DESC(dbfifo_int_thresh, "doorbell fifo interrupt threshold");
+
+int dbfifo_drain_delay = 1000; /* usecs to sleep while draining the dbfifo */
+module_param(dbfifo_drain_delay, int, 0644);
+MODULE_PARM_DESC(dbfifo_drain_delay,
+		 "usecs to sleep while draining the dbfifo");
+
 /*
  * Set Rx properties of a port, such as promiscruity, address filters, and MTU.
  * If @mtu is -1 it is left unchanged.
@@ -389,6 +389,8 @@ static int set_rxmode(struct net_device *dev, int mtu, bool sleep_ok)
 	return ret;
 }
 
+static struct workqueue_struct *workq;
+
 /**
  *	link_start - enable a port
  *	@dev: the port to enable
@@ -2196,7 +2198,7 @@ static void cxgb4_queue_tid_release(struct tid_info *t, unsigned int chan,
 	adap->tid_release_head = (void **)((uintptr_t)p | chan);
 	if (!adap->tid_release_task_busy) {
 		adap->tid_release_task_busy = true;
-		schedule_work(&adap->tid_release_task);
+		queue_work(workq, &adap->tid_release_task);
 	}
 	spin_unlock_bh(&adap->tid_release_lock);
 }
@@ -2423,6 +2425,59 @@ void cxgb4_iscsi_init(struct net_device *dev, unsigned int tag_mask,
 }
 EXPORT_SYMBOL(cxgb4_iscsi_init);
 
+int cxgb4_flush_eq_cache(struct net_device *dev)
+{
+	struct adapter *adap = netdev2adap(dev);
+	int ret;
+
+	ret = t4_fwaddrspace_write(adap, adap->mbox,
+				   0xe1000000 + A_SGE_CTXT_CMD, 0x20000000);
+	return ret;
+}
+EXPORT_SYMBOL(cxgb4_flush_eq_cache);
+
+static int read_eq_indices(struct adapter *adap, u16 qid, u16 *pidx, u16 *cidx)
+{
+	u32 addr = t4_read_reg(adap, A_SGE_DBQ_CTXT_BADDR) + 24 * qid + 8;
+	__be64 indices;
+	int ret;
+
+	ret = t4_mem_win_read_len(adap, addr, (__be32 *)&indices, 8);
+	if (!ret) {
+		indices = be64_to_cpu(indices);
+		*cidx = (indices >> 25) & 0xffff;
+		*pidx = (indices >> 9) & 0xffff;
+	}
+	return ret;
+}
+
+int cxgb4_sync_txq_pidx(struct net_device *dev, u16 qid, u16 pidx,
+			u16 size)
+{
+	struct adapter *adap = netdev2adap(dev);
+	u16 hw_pidx, hw_cidx;
+	int ret;
+
+	ret = read_eq_indices(adap, qid, &hw_pidx, &hw_cidx);
+	if (ret)
+		goto out;
+
+	if (pidx != hw_pidx) {
+		u16 delta;
+
+		if (pidx >= hw_pidx)
+			delta = pidx - hw_pidx;
+		else
+			delta = size - hw_pidx + pidx;
+		wmb();
+		t4_write_reg(adap, MYPF_REG(A_SGE_PF_KDOORBELL),
+			     V_QID(qid) | V_PIDX(delta));
+	}
+out:
+	return ret;
+}
+EXPORT_SYMBOL(cxgb4_sync_txq_pidx);
+
 static struct pci_driver cxgb4_driver;
 
 static void check_neigh_update(struct neighbour *neigh)
@@ -2456,6 +2511,95 @@ static struct notifier_block cxgb4_netevent_nb = {
 	.notifier_call = netevent_cb
 };
 
+static void drain_db_fifo(struct adapter *adap, int usecs)
+{
+	u32 v;
+
+	do {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		schedule_timeout(usecs_to_jiffies(usecs));
+		v = t4_read_reg(adap, A_SGE_DBFIFO_STATUS);
+		if (G_LP_COUNT(v) == 0 && G_HP_COUNT(v) == 0)
+			break;
+	} while (1);
+}
+
+static void disable_txq_db(struct sge_txq *q)
+{
+	spin_lock_irq(&q->db_lock);
+	q->db_disabled = 1;
+	spin_unlock_irq(&q->db_lock);
+}
+
+static void enable_txq_db(struct sge_txq *q)
+{
+	spin_lock_irq(&q->db_lock);
+	q->db_disabled = 0;
+	spin_unlock_irq(&q->db_lock);
+}
+
+static void disable_dbs(struct adapter *adap)
+{
+	int i;
+
+	for_each_ethrxq(&adap->sge, i)
+		disable_txq_db(&adap->sge.ethtxq[i].q);
+	for_each_ofldrxq(&adap->sge, i)
+		disable_txq_db(&adap->sge.ofldtxq[i].q);
+	for_each_port(adap, i)
+		disable_txq_db(&adap->sge.ctrlq[i].q);
+}
+
+static void enable_dbs(struct adapter *adap)
+{
+	int i;
+
+	for_each_ethrxq(&adap->sge, i)
+		enable_txq_db(&adap->sge.ethtxq[i].q);
+	for_each_ofldrxq(&adap->sge, i)
+		enable_txq_db(&adap->sge.ofldtxq[i].q);
+	for_each_port(adap, i)
+		enable_txq_db(&adap->sge.ctrlq[i].q);
+}
+
+static void sync_txq_pidx(struct adapter *adap, struct sge_txq *q)
+{
+	u16 hw_pidx, hw_cidx;
+	int ret;
+
+	spin_lock_bh(&q->db_lock);
+	ret = read_eq_indices(adap, (u16)q->cntxt_id, &hw_pidx, &hw_cidx);
+	if (ret)
+		goto out;
+	if (q->db_pidx != hw_pidx) {
+		u16 delta;
+
+		if (q->db_pidx >= hw_pidx)
+			delta = q->db_pidx - hw_pidx;
+		else
+			delta = q->size - hw_pidx + q->db_pidx;
+		wmb();
+		t4_write_reg(adap, MYPF_REG(A_SGE_PF_KDOORBELL),
+				V_QID(q->cntxt_id) | V_PIDX(delta));
+	}
+out:
+	q->db_disabled = 0;
+	spin_unlock_bh(&q->db_lock);
+	if (ret)
+		CH_WARN(adap, "DB drop recovery failed.\n");
+}
+static void recover_all_queues(struct adapter *adap)
+{
+	int i;
+
+	for_each_ethrxq(&adap->sge, i)
+		sync_txq_pidx(adap, &adap->sge.ethtxq[i].q);
+	for_each_ofldrxq(&adap->sge, i)
+		sync_txq_pidx(adap, &adap->sge.ofldtxq[i].q);
+	for_each_port(adap, i)
+		sync_txq_pidx(adap, &adap->sge.ctrlq[i].q);
+}
+
 static void notify_rdma_uld(struct adapter *adap, enum cxgb4_control cmd)
 {
 	mutex_lock(&uld_mutex);
@@ -2468,55 +2612,41 @@ static void notify_rdma_uld(struct adapter *adap, enum cxgb4_control cmd)
 static void process_db_full(struct work_struct *work)
 {
 	struct adapter *adap;
-	static int delay = 1000;
-	u32 v;
 
 	adap = container_of(work, struct adapter, db_full_task);
 
-
-	/* stop LLD queues */
-
 	notify_rdma_uld(adap, CXGB4_CONTROL_DB_FULL);
-	do {
-		set_current_state(TASK_UNINTERRUPTIBLE);
-		schedule_timeout(usecs_to_jiffies(delay));
-		v = t4_read_reg(adap, A_SGE_DBFIFO_STATUS);
-		if (G_LP_COUNT(v) == 0 && G_HP_COUNT(v) == 0)
-			break;
-	} while (1);
+	drain_db_fifo(adap, dbfifo_drain_delay);
+	t4_set_reg_field(adap, A_SGE_INT_ENABLE3,
+			F_DBFIFO_HP_INT | F_DBFIFO_LP_INT,
+			F_DBFIFO_HP_INT | F_DBFIFO_LP_INT);
 	notify_rdma_uld(adap, CXGB4_CONTROL_DB_EMPTY);
-
-
-	/*
-	 * The more we get db full interrupts, the more we'll delay
-	 * in re-enabling db rings on queues, capped off at 200ms.
-	 */
-	delay = min(delay << 1, 200000);
-
-	/* resume LLD queues */
 }
 
 static void process_db_drop(struct work_struct *work)
 {
 	struct adapter *adap;
-	adap = container_of(work, struct adapter, db_drop_task);
 
+	adap = container_of(work, struct adapter, db_drop_task);
 
-	/*
-	 * sync the PIDX values in HW and SW for LLD queues.
-	 */
-
+	t4_set_reg_field(adap, A_SGE_DOORBELL_CONTROL, F_DROPPED_DB, 0);
+	disable_dbs(adap);
 	notify_rdma_uld(adap, CXGB4_CONTROL_DB_DROP);
+	drain_db_fifo(adap, 1);
+	recover_all_queues(adap);
+	enable_dbs(adap);
 }
 
 void t4_db_full(struct adapter *adap)
 {
-	schedule_work(&adap->db_full_task);
+	t4_set_reg_field(adap, A_SGE_INT_ENABLE3,
+			F_DBFIFO_HP_INT | F_DBFIFO_LP_INT, 0);
+	queue_work(workq, &adap->db_full_task);
 }
 
 void t4_db_dropped(struct adapter *adap)
 {
-	schedule_work(&adap->db_drop_task);
+	queue_work(workq, &adap->db_drop_task);
 }
 
 static void uld_attach(struct adapter *adap, unsigned int uld)
@@ -2552,6 +2682,7 @@ static void uld_attach(struct adapter *adap, unsigned int uld)
 	lli.gts_reg = adap->regs + MYPF_REG(SGE_PF_GTS);
 	lli.db_reg = adap->regs + MYPF_REG(SGE_PF_KDOORBELL);
 	lli.fw_vers = adap->params.fw_vers;
+	lli.dbfifo_int_thresh = dbfifo_int_thresh;
 
 	handle = ulds[uld].add(&lli);
 	if (IS_ERR(handle)) {
@@ -3668,6 +3799,7 @@ static int __devinit init_one(struct pci_dev *pdev,
 
 	adapter->pdev = pdev;
 	adapter->pdev_dev = &pdev->dev;
+	adapter->mbox = func;
 	adapter->fn = func;
 	adapter->msg_enable = dflt_msg_enable;
 	memset(adapter->chan_map, 0xff, sizeof(adapter->chan_map));
@@ -3865,6 +3997,10 @@ static int __init cxgb4_init_module(void)
 {
 	int ret;
 
+	workq = create_singlethread_workqueue("cxgb4");
+	if (!workq)
+		return -ENOMEM;
+
 	/* Debugfs support is optional, just warn if this fails */
 	cxgb4_debugfs_root = debugfs_create_dir(KBUILD_MODNAME, NULL);
 	if (!cxgb4_debugfs_root)
@@ -3880,6 +4016,8 @@ static void __exit cxgb4_cleanup_module(void)
 {
 	pci_unregister_driver(&cxgb4_driver);
 	debugfs_remove(cxgb4_debugfs_root);  /* NULL ok */
+	flush_workqueue(workq);
+	destroy_workqueue(workq);
 }
 
 module_init(cxgb4_init_module);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
index 5cc2f27..d79980c 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
@@ -218,6 +218,7 @@ struct cxgb4_lld_info {
 	unsigned short ucq_density;          /* # of user CQs/page */
 	void __iomem *gts_reg;               /* address of GTS register */
 	void __iomem *db_reg;                /* address of kernel doorbell */
+	int dbfifo_int_thresh;		     /* doorbell fifo int threshold */
 };
 
 struct cxgb4_uld_info {
@@ -226,6 +227,7 @@ struct cxgb4_uld_info {
 	int (*rx_handler)(void *handle, const __be64 *rsp,
 			  const struct pkt_gl *gl);
 	int (*state_change)(void *handle, enum cxgb4_state new_state);
+	int (*control)(void *handle, enum cxgb4_control control, ...);
 };
 
 int cxgb4_register_uld(enum cxgb4_uld type, const struct cxgb4_uld_info *p);
@@ -243,4 +245,6 @@ void cxgb4_iscsi_init(struct net_device *dev, unsigned int tag_mask,
 		      const unsigned int *pgsz_order);
 struct sk_buff *cxgb4_pktgl_to_skb(const struct pkt_gl *gl,
 				   unsigned int skb_len, unsigned int pull_len);
+int cxgb4_sync_txq_pidx(struct net_device *dev, u16 qid, u16 pidx, u16 size);
+int cxgb4_flush_eq_cache(struct net_device *dev);
 #endif  /* !__CXGB4_OFLD_H */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 234c157..e111d97 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -767,8 +767,13 @@ static void write_sgl(const struct sk_buff *skb, struct sge_txq *q,
 static inline void ring_tx_db(struct adapter *adap, struct sge_txq *q, int n)
 {
 	wmb();            /* write descriptors before telling HW */
-	t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL),
-		     QID(q->cntxt_id) | PIDX(n));
+	spin_lock(&q->db_lock);
+	if (!q->db_disabled) {
+		t4_write_reg(adap, MYPF_REG(A_SGE_PF_KDOORBELL),
+			     V_QID(q->cntxt_id) | V_PIDX(n));
+	}
+	q->db_pidx = q->pidx;
+	spin_unlock(&q->db_lock);
 }
 
 /**
@@ -2081,6 +2086,7 @@ static void init_txq(struct adapter *adap, struct sge_txq *q, unsigned int id)
 	q->stops = q->restarts = 0;
 	q->stat = (void *)&q->desc[q->size];
 	q->cntxt_id = id;
+	spin_lock_init(&q->db_lock);
 	adap->sge.egr_map[id - adap->sge.egr_start] = q;
 }
 
@@ -2415,9 +2421,15 @@ void t4_sge_init(struct adapter *adap)
 			 RXPKTCPLMODE |
 			 (STAT_LEN == 128 ? EGRSTATUSPAGESIZE : 0));
 
+	/*
+	 * Set up to drop DOORBELL writes when the DOORBELL FIFO overflows
+	 * and generate an interrupt when this occurs so we can recover.
+	 */
 	t4_set_reg_field(adap, A_SGE_DBFIFO_STATUS,
-			V_HP_INT_THRESH(5) | V_LP_INT_THRESH(5),
-			V_HP_INT_THRESH(5) | V_LP_INT_THRESH(5));
+			V_HP_INT_THRESH(M_HP_INT_THRESH) |
+			V_LP_INT_THRESH(M_LP_INT_THRESH),
+			V_HP_INT_THRESH(dbfifo_int_thresh) |
+			V_LP_INT_THRESH(dbfifo_int_thresh));
 	t4_set_reg_field(adap, A_SGE_DOORBELL_CONTROL, F_ENABLE_DROP,
 			F_ENABLE_DROP);
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h b/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
index 0adc5bc..111fc32 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_regs.h
@@ -190,6 +190,59 @@
 #define SGE_DEBUG_DATA_LOW 0x10d4
 #define SGE_INGRESS_QUEUES_PER_PAGE_PF 0x10f4
 
+#define S_LP_INT_THRESH    12
+#define V_LP_INT_THRESH(x) ((x) << S_LP_INT_THRESH)
+#define S_HP_INT_THRESH    28
+#define V_HP_INT_THRESH(x) ((x) << S_HP_INT_THRESH)
+#define A_SGE_DBFIFO_STATUS 0x10a4
+
+#define S_ENABLE_DROP    13
+#define V_ENABLE_DROP(x) ((x) << S_ENABLE_DROP)
+#define F_ENABLE_DROP    V_ENABLE_DROP(1U)
+#define A_SGE_DOORBELL_CONTROL 0x10a8
+
+#define A_SGE_CTXT_CMD 0x11fc
+#define A_SGE_DBQ_CTXT_BADDR 0x1084
+
+#define A_SGE_PF_KDOORBELL 0x0
+
+#define S_QID 15
+#define V_QID(x) ((x) << S_QID)
+
+#define S_PIDX 0
+#define V_PIDX(x) ((x) << S_PIDX)
+
+#define M_LP_COUNT 0x7ffU
+#define S_LP_COUNT 0
+#define G_LP_COUNT(x) (((x) >> S_LP_COUNT) & M_LP_COUNT)
+
+#define M_HP_COUNT 0x7ffU
+#define S_HP_COUNT 16
+#define G_HP_COUNT(x) (((x) >> S_HP_COUNT) & M_HP_COUNT)
+
+#define A_SGE_INT_ENABLE3 0x1040
+
+#define S_DBFIFO_HP_INT 8
+#define V_DBFIFO_HP_INT(x) ((x) << S_DBFIFO_HP_INT)
+#define F_DBFIFO_HP_INT V_DBFIFO_HP_INT(1U)
+
+#define S_DBFIFO_LP_INT 7
+#define V_DBFIFO_LP_INT(x) ((x) << S_DBFIFO_LP_INT)
+#define F_DBFIFO_LP_INT V_DBFIFO_LP_INT(1U)
+
+#define S_DROPPED_DB 0
+#define V_DROPPED_DB(x) ((x) << S_DROPPED_DB)
+#define F_DROPPED_DB V_DROPPED_DB(1U)
+
+#define S_ERR_DROPPED_DB 18
+#define V_ERR_DROPPED_DB(x) ((x) << S_ERR_DROPPED_DB)
+#define F_ERR_DROPPED_DB V_ERR_DROPPED_DB(1U)
+
+#define A_PCIE_MEM_ACCESS_OFFSET 0x306c
+
+#define M_HP_INT_THRESH 0xfU
+#define M_LP_INT_THRESH 0xfU
+
 #define PCIE_PF_CLI 0x44
 #define PCIE_INT_CAUSE 0x3004
 #define  UNXSPLCPLERR  0x20000000U
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
index edcfd7e..ad53f79 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
@@ -1620,4 +1620,19 @@ struct fw_hdr {
 #define FW_HDR_FW_VER_MINOR_GET(x) (((x) >> 16) & 0xff)
 #define FW_HDR_FW_VER_MICRO_GET(x) (((x) >> 8) & 0xff)
 #define FW_HDR_FW_VER_BUILD_GET(x) (((x) >> 0) & 0xff)
+
+#define S_FW_CMD_OP 24
+#define V_FW_CMD_OP(x) ((x) << S_FW_CMD_OP)
+
+#define S_FW_CMD_REQUEST 23
+#define V_FW_CMD_REQUEST(x) ((x) << S_FW_CMD_REQUEST)
+#define F_FW_CMD_REQUEST V_FW_CMD_REQUEST(1U)
+
+#define S_FW_CMD_WRITE 21
+#define V_FW_CMD_WRITE(x) ((x) << S_FW_CMD_WRITE)
+#define F_FW_CMD_WRITE V_FW_CMD_WRITE(1U)
+
+#define S_FW_LDST_CMD_ADDRSPACE 0
+#define V_FW_LDST_CMD_ADDRSPACE(x) ((x) << S_FW_LDST_CMD_ADDRSPACE)
+
 #endif /* _T4FW_INTERFACE_H_ */
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 02/10] cxgb4: Common platform specific changes for DB Drop Recovery
From: Vipul Pandya @ 2012-05-18  9:59 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-BHEL68pLQRGGvPXPguhicg, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	divy-ut6Up61K2wZBDgjK7y7TUQ, dm-ut6Up61K2wZBDgjK7y7TUQ,
	kumaras-ut6Up61K2wZBDgjK7y7TUQ,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW, Vipul Pandya
In-Reply-To: <1337335173-3226-1-git-send-email-vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

Add platform-specific callback functions for interrupts.  This is
needed to do a single read-clear of the CAUSE register and then call
out to platform specific functions for DB threshold interrupts and DB
drop interrupts.

Add t4_mem_win_read_len() - mem-window reads for arbitrary lengths.
This is used to read the CIDX/PIDX values from EC contexts during DB
drop recovery.

Add t4_fwaddrspace_write() - sends addrspace write cmds to the fw.
Needed to flush the sge eq context cache.

Signed-off-by: Vipul Pandya <vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
Signed-off-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h |    3 +
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c |   69 +++++++++++++++++++++++----
 2 files changed, 61 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index f91b259..5f3c0a7 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -723,4 +723,7 @@ int t4_ofld_eq_free(struct adapter *adap, unsigned int mbox, unsigned int pf,
 int t4_handle_fw_rpl(struct adapter *adap, const __be64 *rpl);
 void t4_db_full(struct adapter *adapter);
 void t4_db_dropped(struct adapter *adapter);
+int t4_mem_win_read_len(struct adapter *adap, u32 addr, __be32 *data, int len);
+int t4_fwaddrspace_write(struct adapter *adap, unsigned int mbox,
+			 u32 addr, u32 val);
 #endif /* __CXGB4_H__ */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index 13609bf..32e1dd5 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -868,11 +868,14 @@ int t4_restart_aneg(struct adapter *adap, unsigned int mbox, unsigned int port)
 	return t4_wr_mbox(adap, mbox, &c, sizeof(c), NULL);
 }
 
+typedef void (*int_handler_t)(struct adapter *adap);
+
 struct intr_info {
 	unsigned int mask;       /* bits to check in interrupt status */
 	const char *msg;         /* message to print or NULL */
 	short stat_idx;          /* stat counter to increment or -1 */
 	unsigned short fatal;    /* whether the condition reported is fatal */
+	int_handler_t int_handler; /* platform-specific int handler */
 };
 
 /**
@@ -905,6 +908,8 @@ static int t4_handle_intr_status(struct adapter *adapter, unsigned int reg,
 		} else if (acts->msg && printk_ratelimit())
 			dev_warn(adapter->pdev_dev, "%s (0x%x)\n", acts->msg,
 				 status & acts->mask);
+		if (acts->int_handler)
+			acts->int_handler(adapter);
 		mask |= acts->mask;
 	}
 	status &= mask;
@@ -1013,9 +1018,9 @@ static void sge_intr_handler(struct adapter *adapter)
 		{ ERR_INVALID_CIDX_INC,
 		  "SGE GTS CIDX increment too large", -1, 0 },
 		{ ERR_CPL_OPCODE_0, "SGE received 0-length CPL", -1, 0 },
-		{ F_DBFIFO_LP_INT, NULL, -1, 0 },
-		{ F_DBFIFO_HP_INT, NULL, -1, 0 },
-		{ ERR_DROPPED_DB, "SGE doorbell dropped", -1, 0 },
+		{ F_DBFIFO_LP_INT, NULL, -1, 0, t4_db_full },
+		{ F_DBFIFO_HP_INT, NULL, -1, 0, t4_db_full },
+		{ F_ERR_DROPPED_DB, NULL, -1, 0, t4_db_dropped },
 		{ ERR_DATA_CPL_ON_HIGH_QID1 | ERR_DATA_CPL_ON_HIGH_QID0,
 		  "SGE IQID > 1023 received CPL for FL", -1, 0 },
 		{ ERR_BAD_DB_PIDX3, "SGE DBP 3 pidx increment too large", -1,
@@ -1036,20 +1041,14 @@ static void sge_intr_handler(struct adapter *adapter)
 	};
 
 	v = (u64)t4_read_reg(adapter, SGE_INT_CAUSE1) |
-	    ((u64)t4_read_reg(adapter, SGE_INT_CAUSE2) << 32);
+		((u64)t4_read_reg(adapter, SGE_INT_CAUSE2) << 32);
 	if (v) {
 		dev_alert(adapter->pdev_dev, "SGE parity error (%#llx)\n",
-			 (unsigned long long)v);
+				(unsigned long long)v);
 		t4_write_reg(adapter, SGE_INT_CAUSE1, v);
 		t4_write_reg(adapter, SGE_INT_CAUSE2, v >> 32);
 	}
 
-	err = t4_read_reg(adapter, A_SGE_INT_CAUSE3);
-	if (err & (F_DBFIFO_HP_INT|F_DBFIFO_LP_INT))
-		t4_db_full(adapter);
-	if (err & F_ERR_DROPPED_DB)
-		t4_db_dropped(adapter);
-
 	if (t4_handle_intr_status(adapter, SGE_INT_CAUSE3, sge_intr_info) ||
 	    v != 0)
 		t4_fatal_err(adapter);
@@ -1995,6 +1994,54 @@ int t4_wol_pat_enable(struct adapter *adap, unsigned int port, unsigned int map,
 	(var).retval_len16 = htonl(FW_LEN16(var)); \
 } while (0)
 
+int t4_fwaddrspace_write(struct adapter *adap, unsigned int mbox,
+			  u32 addr, u32 val)
+{
+	struct fw_ldst_cmd c;
+
+	memset(&c, 0, sizeof(c));
+	c.op_to_addrspace = htonl(V_FW_CMD_OP(FW_LDST_CMD) | F_FW_CMD_REQUEST |
+			    F_FW_CMD_WRITE |
+			    V_FW_LDST_CMD_ADDRSPACE(FW_LDST_ADDRSPC_FIRMWARE));
+	c.cycles_to_len16 = htonl(FW_LEN16(c));
+	c.u.addrval.addr = htonl(addr);
+	c.u.addrval.val = htonl(val);
+
+	return t4_wr_mbox(adap, mbox, &c, sizeof(c), NULL);
+}
+
+/*
+ *     t4_mem_win_read_len - read memory through PCIE memory window
+ *     @adap: the adapter
+ *     @addr: address of first byte requested aligned on 32b.
+ *     @data: len bytes to hold the data read
+ *     @len: amount of data to read from window.  Must be <=
+ *            MEMWIN0_APERATURE after adjusting for 16B alignment
+ *            requirements of the the memory window.
+ *
+ *     Read len bytes of data from MC starting at @addr.
+ */
+int t4_mem_win_read_len(struct adapter *adap, u32 addr, __be32 *data, int len)
+{
+	int i;
+	int off;
+
+	/*
+	 * Align on a 16B boundary.
+	 */
+	off = addr & 15;
+	if ((addr & 3) || (len + off) > MEMWIN0_APERTURE)
+		return -EINVAL;
+
+	t4_write_reg(adap, A_PCIE_MEM_ACCESS_OFFSET, addr & ~15);
+	t4_read_reg(adap, A_PCIE_MEM_ACCESS_OFFSET);
+
+	for (i = 0; i < len; i += 4)
+		*data++ = t4_read_reg(adap, (MEMWIN0_BASE + off + i));
+
+	return 0;
+}
+
 /**
  *	t4_mdio_rd - read a PHY register through MDIO
  *	@adap: the adapter
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 01/10] cxgb4: Detect DB FULL events and notify RDMA ULD.
From: Vipul Pandya @ 2012-05-18  9:59 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: roland-BHEL68pLQRGGvPXPguhicg, davem-fT/PcQaiUtIeIZ0/mPfg9Q,
	divy-ut6Up61K2wZBDgjK7y7TUQ, dm-ut6Up61K2wZBDgjK7y7TUQ,
	kumaras-ut6Up61K2wZBDgjK7y7TUQ,
	swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW, Vipul Pandya
In-Reply-To: <1337335173-3226-1-git-send-email-vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>

Signed-off-by: Vipul Pandya <vipul-ut6Up61K2wZBDgjK7y7TUQ@public.gmane.org>
Signed-off-by: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
---
 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h      |    4 +
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |   77 +++++++++++++++++++++++
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h  |    7 ++
 drivers/net/ethernet/chelsio/cxgb4/sge.c        |    6 ++
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c      |    9 +++
 5 files changed, 103 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 0fe1885..f91b259 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -504,6 +504,8 @@ struct adapter {
 	void **tid_release_head;
 	spinlock_t tid_release_lock;
 	struct work_struct tid_release_task;
+	struct work_struct db_full_task;
+	struct work_struct db_drop_task;
 	bool tid_release_task_busy;
 
 	struct dentry *debugfs_root;
@@ -719,4 +721,6 @@ int t4_ctrl_eq_free(struct adapter *adap, unsigned int mbox, unsigned int pf,
 int t4_ofld_eq_free(struct adapter *adap, unsigned int mbox, unsigned int pf,
 		    unsigned int vf, unsigned int eqid);
 int t4_handle_fw_rpl(struct adapter *adap, const __be64 *rpl);
+void t4_db_full(struct adapter *adapter);
+void t4_db_dropped(struct adapter *adapter);
 #endif /* __CXGB4_H__ */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index b126b98..c243f93 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -2366,6 +2366,16 @@ unsigned int cxgb4_port_chan(const struct net_device *dev)
 }
 EXPORT_SYMBOL(cxgb4_port_chan);
 
+unsigned int cxgb4_dbfifo_count(const struct net_device *dev, int lpfifo)
+{
+	struct adapter *adap = netdev2adap(dev);
+	u32 v;
+
+	v = t4_read_reg(adap, A_SGE_DBFIFO_STATUS);
+	return lpfifo ? G_LP_COUNT(v) : G_HP_COUNT(v);
+}
+EXPORT_SYMBOL(cxgb4_dbfifo_count);
+
 /**
  *	cxgb4_port_viid - get the VI id of a port
  *	@dev: the net device for the port
@@ -2446,6 +2456,69 @@ static struct notifier_block cxgb4_netevent_nb = {
 	.notifier_call = netevent_cb
 };
 
+static void notify_rdma_uld(struct adapter *adap, enum cxgb4_control cmd)
+{
+	mutex_lock(&uld_mutex);
+	if (adap->uld_handle[CXGB4_ULD_RDMA])
+		ulds[CXGB4_ULD_RDMA].control(adap->uld_handle[CXGB4_ULD_RDMA],
+				cmd);
+	mutex_unlock(&uld_mutex);
+}
+
+static void process_db_full(struct work_struct *work)
+{
+	struct adapter *adap;
+	static int delay = 1000;
+	u32 v;
+
+	adap = container_of(work, struct adapter, db_full_task);
+
+
+	/* stop LLD queues */
+
+	notify_rdma_uld(adap, CXGB4_CONTROL_DB_FULL);
+	do {
+		set_current_state(TASK_UNINTERRUPTIBLE);
+		schedule_timeout(usecs_to_jiffies(delay));
+		v = t4_read_reg(adap, A_SGE_DBFIFO_STATUS);
+		if (G_LP_COUNT(v) == 0 && G_HP_COUNT(v) == 0)
+			break;
+	} while (1);
+	notify_rdma_uld(adap, CXGB4_CONTROL_DB_EMPTY);
+
+
+	/*
+	 * The more we get db full interrupts, the more we'll delay
+	 * in re-enabling db rings on queues, capped off at 200ms.
+	 */
+	delay = min(delay << 1, 200000);
+
+	/* resume LLD queues */
+}
+
+static void process_db_drop(struct work_struct *work)
+{
+	struct adapter *adap;
+	adap = container_of(work, struct adapter, db_drop_task);
+
+
+	/*
+	 * sync the PIDX values in HW and SW for LLD queues.
+	 */
+
+	notify_rdma_uld(adap, CXGB4_CONTROL_DB_DROP);
+}
+
+void t4_db_full(struct adapter *adap)
+{
+	schedule_work(&adap->db_full_task);
+}
+
+void t4_db_dropped(struct adapter *adap)
+{
+	schedule_work(&adap->db_drop_task);
+}
+
 static void uld_attach(struct adapter *adap, unsigned int uld)
 {
 	void *handle;
@@ -2649,6 +2722,8 @@ static void cxgb_down(struct adapter *adapter)
 {
 	t4_intr_disable(adapter);
 	cancel_work_sync(&adapter->tid_release_task);
+	cancel_work_sync(&adapter->db_full_task);
+	cancel_work_sync(&adapter->db_drop_task);
 	adapter->tid_release_task_busy = false;
 	adapter->tid_release_head = NULL;
 
@@ -3601,6 +3676,8 @@ static int __devinit init_one(struct pci_dev *pdev,
 	spin_lock_init(&adapter->tid_release_lock);
 
 	INIT_WORK(&adapter->tid_release_task, process_tid_release_list);
+	INIT_WORK(&adapter->db_full_task, process_db_full);
+	INIT_WORK(&adapter->db_drop_task, process_db_drop);
 
 	err = t4_prep_adapter(adapter);
 	if (err)
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
index b1d39b8..5cc2f27 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_uld.h
@@ -163,6 +163,12 @@ enum cxgb4_state {
 	CXGB4_STATE_DETACH
 };
 
+enum cxgb4_control {
+	CXGB4_CONTROL_DB_FULL,
+	CXGB4_CONTROL_DB_EMPTY,
+	CXGB4_CONTROL_DB_DROP,
+};
+
 struct pci_dev;
 struct l2t_data;
 struct net_device;
@@ -225,6 +231,7 @@ struct cxgb4_uld_info {
 int cxgb4_register_uld(enum cxgb4_uld type, const struct cxgb4_uld_info *p);
 int cxgb4_unregister_uld(enum cxgb4_uld type);
 int cxgb4_ofld_send(struct net_device *dev, struct sk_buff *skb);
+unsigned int cxgb4_dbfifo_count(const struct net_device *dev, int lpfifo);
 unsigned int cxgb4_port_chan(const struct net_device *dev);
 unsigned int cxgb4_port_viid(const struct net_device *dev);
 unsigned int cxgb4_port_idx(const struct net_device *dev);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index 2dae795..234c157 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -2415,6 +2415,12 @@ void t4_sge_init(struct adapter *adap)
 			 RXPKTCPLMODE |
 			 (STAT_LEN == 128 ? EGRSTATUSPAGESIZE : 0));
 
+	t4_set_reg_field(adap, A_SGE_DBFIFO_STATUS,
+			V_HP_INT_THRESH(5) | V_LP_INT_THRESH(5),
+			V_HP_INT_THRESH(5) | V_LP_INT_THRESH(5));
+	t4_set_reg_field(adap, A_SGE_DOORBELL_CONTROL, F_ENABLE_DROP,
+			F_ENABLE_DROP);
+
 	for (i = v = 0; i < 32; i += 4)
 		v |= (PAGE_SHIFT - 10) << i;
 	t4_write_reg(adap, SGE_HOST_PAGE_SIZE, v);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index d1ec111..13609bf 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -1013,6 +1013,8 @@ static void sge_intr_handler(struct adapter *adapter)
 		{ ERR_INVALID_CIDX_INC,
 		  "SGE GTS CIDX increment too large", -1, 0 },
 		{ ERR_CPL_OPCODE_0, "SGE received 0-length CPL", -1, 0 },
+		{ F_DBFIFO_LP_INT, NULL, -1, 0 },
+		{ F_DBFIFO_HP_INT, NULL, -1, 0 },
 		{ ERR_DROPPED_DB, "SGE doorbell dropped", -1, 0 },
 		{ ERR_DATA_CPL_ON_HIGH_QID1 | ERR_DATA_CPL_ON_HIGH_QID0,
 		  "SGE IQID > 1023 received CPL for FL", -1, 0 },
@@ -1042,6 +1044,12 @@ static void sge_intr_handler(struct adapter *adapter)
 		t4_write_reg(adapter, SGE_INT_CAUSE2, v >> 32);
 	}
 
+	err = t4_read_reg(adapter, A_SGE_INT_CAUSE3);
+	if (err & (F_DBFIFO_HP_INT|F_DBFIFO_LP_INT))
+		t4_db_full(adapter);
+	if (err & F_ERR_DROPPED_DB)
+		t4_db_dropped(adapter);
+
 	if (t4_handle_intr_status(adapter, SGE_INT_CAUSE3, sge_intr_info) ||
 	    v != 0)
 		t4_fatal_err(adapter);
@@ -1513,6 +1521,7 @@ void t4_intr_enable(struct adapter *adapter)
 		     ERR_BAD_DB_PIDX2 | ERR_BAD_DB_PIDX1 |
 		     ERR_BAD_DB_PIDX0 | ERR_ING_CTXT_PRIO |
 		     ERR_EGR_CTXT_PRIO | INGRESS_SIZE_ERR |
+		     F_DBFIFO_HP_INT | F_DBFIFO_LP_INT |
 		     EGRESS_SIZE_ERR);
 	t4_write_reg(adapter, MYPF_REG(PL_PF_INT_ENABLE), PF_INTR_MASK);
 	t4_set_reg_field(adapter, PL_INT_MAP0, 0, 1 << pf);
-- 
1.7.1

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* Re: [V2 PATCH 9/9] vhost: zerocopy: poll vq in zerocopy callback
From: Jason Wang @ 2012-05-18  9:58 UTC (permalink / raw)
  To: Shirley Ma
  Cc: Michael S. Tsirkin, eric.dumazet, netdev, linux-kernel, ebiederm,
	davem
In-Reply-To: <1337268862.10741.58.camel@oc3660625478.ibm.com>

On 05/17/2012 11:34 PM, Shirley Ma wrote:
> On Thu, 2012-05-17 at 10:50 +0800, Jason Wang wrote:
>> The problem is we may stop the tx queue when there no enough capacity
>> to
>> place packets, at this moment  we depends on the tx interrupt to
>> re-enable the tx queue. So if we didn't poll the vhost during
>> callback,
>> guest may lose the tx interrupt to re-enable the tx queue which could
>> stall the whole tx queue.
> VHOST_MAX_PEND should handle the capacity.
>
> Hasn't the above situation been handled in handle_tx() code?:
> ...
>                          if (unlikely(num_pends>  VHOST_MAX_PEND)) {
>                                  tx_poll_start(net, sock);
>                                  set_bit(SOCK_ASYNC_NOSPACE,&sock->flags);
>                                  break;
>                          }
> ...
>
> Thanks
> Shirley

It may not help in because:

- tx polling depends on skb_orphan() which is often called by device 
driver when it place the packet into the queue of the devices instead 
of  when the packets were sent. So it was too early for vhost to be 
notified.

- it only works when the pending DMAs exceeds VHOST_MAX_PEND, it's 
highly possible that guest needs to be notified when the pending packets 
isn't so much.

So this piece of code may not help and could be removed and we need to 
poll the virt-queue during zerocopy callback ( through it could be 
further optimized but may not be easy).
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox