[RFC] p54pci: skb_over_panic, soft lockup, stall under flood

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Quintin Pitts <geek4linux@gmail.com>
To: John Linville <linville@tuxdriver.com>
Cc: linux-wireless <linux-wireless@vger.kernel.org>,
	Christian Lamparter <chunkeey@web.de>
Subject: [RFC] p54pci: skb_over_panic, soft lockup, stall under flood
Date: Sun, 11 Oct 2009 09:28:55 -0500	[thread overview]
Message-ID: <4AD1EBA7.904@gmail.com> (raw)

Hi,

Sorry for my lack of experience in all aspects - first time
submitting!!!

In trying to get p54pci driver to be stable on my platform and hardware
- here is a generic patch that seems to accomplish that.  Since the
ViewSonic V210 uses the IT8152 pci bridge - some attention was needed to
get dma related allocation in the first physical 64M.  I have verified
that the dma related allocation is in the first 64M and dmabounce is not
being used - just for those wondering if that was part of the problems.

Platform: ViewSonic V210 arm pxa255
Kernel 2.6.30.5 eabi
Wireless Drivers from compat-wireless-2009-09-30 and what I applied the below patch to.
Firmware used: FW rev 2.13.12.0 - Softmac protocol 5.9

Wireless card: GemTek WL-850FJB minipci card.

phy0: p54 detected a LM86 firmware
p54: rx_mtu reduced from 3240 to 2376
phy0: FW rev 2.13.12.0 - Softmac protocol 5.9
phy0: cryptographic accelerator WEP:YES, TKIP:YES, CCMP:YES
phy0: hwaddr 00:90:4b:c1:06:bc, MAC:isl3890 RF:Frisbee
phy0: Selected rate control algorithm 'minstrel'

device pci info (lspci -v):

00:06.0 Network controller: Intersil Corporation ISL3886 [Prism Javelin/Prism Xbow] (rev 01)
Subsystem: Intersil Corporation Device 0000
Flags: bus master, medium devsel, latency 56, IRQ 217
Memory at 11000000 (32-bit, non-prefetchable) [size=8K]
Capabilities: [dc] Power Management version 1
Kernel driver in use: p54pci
Kernel modules: prism54, p54pci

Reasons for patch was to solve the below problems.

1.  p54p_check_rx_ring - skb_over_panic: Under a ping flood or just left
running for a bit would panic with a skb_over_panic. Investigation
showed for some odd reason the device/firmware instead of writing a
length in the data rx_ring (desc->len) had instead written the whole dma
address (host->host_addr) into location of the len/flag (host->len and
host->flags) spot and the same dma address that was in the ring.  Added
the following condition in p54p_check_rx_ring to trap that condition and
trim the skb reset the len and flags only.  By the way - I used haret to
see if it I could prove it happening under wince - located the dma
memory that was being used for rings - and also happening under windows
ce with the  len/flag being set to the same as the host dma.  Scanning
the ring at 1000 times per second (I think)  In a flood or iperf.  Would
see an occasional len/flag location get set to the same host address in
that ring - may only happen a few times every minute.  Under normal
operation maybe a few times a day.

   if(unlikely(len == (desc->host_addr & 0xffff)
   && (desc->flags == ((desc->host_addr & 0xffff0000) >> 16))) )

2.  p54p_refill_rx_ring - eventual stall: Has the potential in very busy
(flood) to over run the last rx data processed ring index corrupting the
next rings - causing some havoc of getting some 13 indexes difference
between priv->rx_idx_data and ring_control host_idx on a 8 index ring.
This appears to eventually fill up the TX queue - returning a -ENOSPC in
p54_assign_address (txrx.c) because of ring corruption missing some TX
releases.  Changed p54p_refill_rx_ring to take a index parm and use that
as the last processed ring index - instead of the using the ring_control
device_idx.

3.  p54p_check_rx_ring - eventual stall: On ping flood - Control
P54_CONTROL_TYPE_TXDONE rx packets that are skb reused - seem to cause a
problem on the next time around with the same index.   Even though the
length was not the same was still being seen as a
P54_CONTROL_TYPE_TXDONE packet again. Side affects varied - one being
the main end result same as the #2 listed above TX not being released
and returning a -ENOSPC in p54_assign_address (txrx.c) - stall.
Problem went away if did not reuse the skb but unmap it and
dev_kfree_skb if return was zero from p54_rx. Still unclear why this
would be - but had no problems with patch afterwards.

4.  p54p_check_rx_ring - soft lockup in p54p_refill_rx_ring.  This only
occurred when 5 minute iperf on a fast wireless network - Or 1 to 2 days
of unit left up.  Discovered that the device had lost it's mind and set
the ring_control->device_index[ring_index] exactly 0xFF or 255 less than
it should be (ram issue??) don't know.  Happens on three of my devices
the same way.  If left to continue - the p54p_refill_rx_ring while loop
goes negative and soft lockup.  Trap and return if device_idx - (*index)
greater than ring_index.  Error is only tripped the one time - meaning
the next time p54p_check_rx_ring is called the device index is back to
what it should have been.

5.  p54p_open   - 1 out of 10 boots will produce device does not
respond! or Cannot boot firmware!.    Minor - but frustrating all the
same.
Always rmmod p54pci and then modprobe p54pci works.  It seems if get a
error on p54p_open trying again works.  And if p54_read_eeprom fails -
trying again works.

The below was applied to compat-wireless-2009-09-30:

Thanks,

Quintin.

Signed-off-by: Quintin Pitts <geek4linux@gmail.com>

--- 

--- a/drivers/net/wireless/p54/p54pci.c	2009-09-29 23:13:58.000000000 -0500
+++ b/drivers/net/wireless/p54/p54pci.c	2009-10-09 08:15:58.000000000 -0500
@@ -131,7 +131,7 @@ static int p54p_upload_firmware(struct i

 static void p54p_refill_rx_ring(struct ieee80211_hw *dev,
 	int ring_index, struct p54p_desc *ring, u32 ring_limit,
-	struct sk_buff **rx_buf)
+	struct sk_buff **rx_buf, u32 index)
 {
 	struct p54p_priv *priv = dev->priv;
 	struct p54p_ring_control *ring_control = priv->ring_control;
@@ -139,7 +139,11 @@ static void p54p_refill_rx_ring(struct i

 	idx = le32_to_cpu(ring_control->host_idx[ring_index]);
 	limit = idx;
-	limit -= le32_to_cpu(ring_control->device_idx[ring_index]);
+/*
+ *           Use last processed index instead of device_idx
+ *           so we don't corrupt our ring 
+ */
+	limit -= le32_to_cpu(index);
 	limit = ring_limit - limit;

 	i = idx % ring_limit;
@@ -181,9 +185,26 @@ static void p54p_check_rx_ring(struct ie
 	struct p54p_ring_control *ring_control = priv->ring_control;
 	struct p54p_desc *desc;
 	u32 idx, i;
+	int ret;

+	idx = le32_to_cpu(ring_control->device_idx[ring_index]);
 	i = (*index) % ring_limit;
-	(*index) = idx = le32_to_cpu(ring_control->device_idx[ring_index]);
+	if(unlikely((idx - (*index)) > ring_limit || 
+ (le32_to_cpu(ring_control->host_idx[ring_index]) - (*index)) > ring_limit)) { 
+  	printk(KERN_DEBUG "%s: devidx jumped *index=%d devidx=%d hostidx=%d ring_limit=%d\n",
+	__func__,(*index),idx,ring_control->host_idx[ring_index],ring_limit);
+/* 
+ * Do nothing things are really wrong - device index has jumped got corrupted
+ *  - wait for it to stabilize 
+ * So far device idx exactly 0xFF (255) bytes less than what it should be. 
+ * only seen to happen on very fast wireless and packet floods and/or iperf test
+ * In testing this error only encountered once - so next time around the 
+ * device index is correct.
+ * if to continue would soft lockup/hang in while loop in p54p_refill_rx_ring
+ */
+		return;
+		}
+	(*index) = idx;
 	idx %= ring_limit;
 	while (i != idx) {
 		u16 len;
@@ -197,25 +218,40 @@ static void p54p_check_rx_ring(struct ie
 			i %= ring_limit;
 			continue;
 		}
+		if(unlikely(len == (desc->host_addr & 0xffff) 
+	&& (desc->flags == ((desc->host_addr & 0xffff0000) >> 16))) ) {
+/* device has put device dma in desc len/flag location - will crash in skb_put
+ * desc->len and desc->flags contain the host_addr -
+ * trap before skb_put and discard
+ * ViewSonic V210 and wireless card GENTEK WL-850 , IT8152 PCI bridge 
+ * happens occasionally - no clear reason or frequency.
+ *  
+ */ 
+		printk(KERN_DEBUG "%s: rx_ring len/flags has address - skipping!\n",__func__); 
+                  skb_trim(skb,0);
+		  desc->len = cpu_to_le16(priv->common.rx_mtu + 32);
+		  desc->flags=0;
+                 
+		} else {
+
 		skb_put(skb, len);

-		if (p54_rx(dev, skb)) {
-			pci_unmap_single(priv->pdev,
+		ret=p54_rx(dev,skb);
+		pci_unmap_single(priv->pdev,
 					 le32_to_cpu(desc->host_addr),
 					 priv->common.rx_mtu + 32,
 					 PCI_DMA_FROMDEVICE);
-			rx_buf[i] = NULL;
-			desc->host_addr = 0;
-		} else {
-			skb_trim(skb, 0);
-			desc->len = cpu_to_le16(priv->common.rx_mtu + 32);
-		}
+		if(ret==0)
+			dev_kfree_skb(skb);
+		rx_buf[i] = NULL;
+		desc->host_addr = 0;
+		} /* end of desc->len skb corrupt crash test */

 		i++;
 		i %= ring_limit;
 	}

-	p54p_refill_rx_ring(dev, ring_index, ring, ring_limit, rx_buf);
+	p54p_refill_rx_ring(dev, ring_index, ring, ring_limit, rx_buf, (*index));
 }

 /* caller must hold priv->lock */
@@ -428,10 +464,10 @@ static int p54p_open(struct ieee80211_hw
 	priv->rx_idx_mgmt = priv->tx_idx_mgmt = 0;

 	p54p_refill_rx_ring(dev, 0, priv->ring_control->rx_data,
-		ARRAY_SIZE(priv->ring_control->rx_data), priv->rx_buf_data);
+		ARRAY_SIZE(priv->ring_control->rx_data), priv->rx_buf_data, 0);

 	p54p_refill_rx_ring(dev, 2, priv->ring_control->rx_mgmt,
-		ARRAY_SIZE(priv->ring_control->rx_mgmt), priv->rx_buf_mgmt);
+		ARRAY_SIZE(priv->ring_control->rx_mgmt), priv->rx_buf_mgmt, 0);

 	P54P_WRITE(ring_control_base, cpu_to_le32(priv->ring_control_dma));
 	P54P_READ(ring_control_base);
@@ -550,9 +586,26 @@ static int __devinit p54p_probe(struct p
 	}

 	err = p54p_open(dev);
-	if (err)
-		goto err_free_common;
+	if (err) {
+                
+		printk(KERN_DEBUG "%s: p54p_open failed - trying again\n",__func__);
+                msleep(10);
+		err = p54p_open(dev);
+		if (err)
+			goto err_free_common;
+        }
 	err = p54_read_eeprom(dev);
+	if (err)
+	{
+                printk(KERN_DEBUG "%s: p54_read_eeprom failed - trying again\n",__func__);
+		p54p_stop(dev);
+		err = p54p_open(dev);
+                if (err)
+			goto err_free_common;
+		msleep(10);
+		err = p54_read_eeprom(dev);
+             
+	}
 	p54p_stop(dev);
 	if (err)
 		goto err_free_common;

next             reply	other threads:[~2009-10-11 14:29 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-10-11 14:28 Quintin Pitts [this message]
2009-10-11 15:31 ` [RFC] p54pci: skb_over_panic, soft lockup, stall under flood Larry Finger
2009-10-11 19:41   ` Christian Lamparter
2009-10-12  0:26     ` Quintin Pitts
2009-10-12  0:09   ` Quintin Pitts
2009-10-12  8:57     ` Christian Lamparter
2009-10-12 13:34       ` Quintin Pitts

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4AD1EBA7.904@gmail.com \
    --to=geek4linux@gmail.com \
    --cc=chunkeey@web.de \
    --cc=linux-wireless@vger.kernel.org \
    --cc=linville@tuxdriver.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.