[cc += lkml]

On Sun, 2010-11-14 at 22:33 +0100, Krzysztof Halasa wrote:
> Well... could be a hardware problem. I'd make sure the card sits well in
> the PCI slot.

Nope, the card works perfectly in the old box and the bug triggers in
any PCI slot of the new box (running the new kernel with the pc300too
driver).


> Also... it's rather improbable, but I'd look at the SCA-II chip. There
> were certain chips with a hardware bug which could cause such problems.
> Chips with Hitachi logo and "R" letter after the lot code were ok, and
> all later chips made by Renesas (either missing any logo or with
> Renesas' - no "R" letter there) were ok.
> 
> The faulty chips were marked with Hitachi logo and were missing the "R"
> letter after the lot code. I think Hitachi fixed it in 1999 or so.
> I'm not sure if this bug could manifest itself when only one SCA channel
> was in use. The app note doesn't say a word about it, but I think I only
> experienced the problem (with an older card, not PC300) when both
> channels were simultaneously in use.

Looks like we've hit this bug! Here's a photo of the board to confirm
it's the bogus chip:

 http://people.sugarlabs.org/bernie/pc300too-photo.jpg


> It could also be some PLX PCI9050 oddity. Can you show me output of
> "lspci -vv" command (may be limited to the PC300 device)?

Attached as pc300too-lspci.out

> Also, please say something about the machine (CPU, motherboard). What
> speed are you trying to use it with?

Attached as pc300too-dmidecode.out

The most interesting thing is probably the dmesg output:


[   59.175900] bernie: stat=0x80, desc_address=ffffc900111003a8, port->chan=0
[   59.176639] bernie: cp=3b4, bp=1ef18, len=56, unused=12
[   67.159314] bernie: stat=0x80, desc_address=ffffc90011100390, port->chan=0
[   67.163214] bernie: cp=39c, bp=1e298, len=56, unused=12
[   68.425601] bernie: stat=0x80, desc_address=ffffc90011100390, port->chan=0
[   68.426123] bernie: cp=39c, bp=1e298, len=77, unused=12
[   70.312068] bernie: stat=0x80, desc_address=ffffc900111003b4, port->chan=0
[   70.314393] bernie: cp=3c0, bp=1f558, len=1504, unused=12

So it seems that sometimes the controller doesn't always clear the EOM
(0x80) status bit after transmitting a frame. Size and contents of the
packet doesn't seem to matter We're using a single T1 channel.

To obtain this debug output, I modified the driver as follows:

--- linux-2.6.36.orig/drivers/net/wan/hd64572.c	2010-10-20 16:30:22.000000000 -0400
+++ linux-2.6.36/drivers/net/wan/hd64572.c	2010-11-12 20:48:03.000000000 -0500
@@ -567,11 +567,20 @@ static netdev_tx_t sca_xmit(struct sk_bu
 	card_t *card = port->card;
 	pkt_desc __iomem *desc;
 	u32 buff, len;
+	uint8_t stat;
 
 	spin_lock_irq(&port->lock);
 
 	desc = desc_address(port, port->txin + 1, 1);
-	BUG_ON(readb(&desc->stat)); /* previous xmit should stop queue */
+
+	//BUG_ON(readb(&desc->stat)); /* previous xmit should stop queue */
+	stat = readb(&desc->stat); /* previous xmit should stop queue */
+	if (stat) {
+		printk(KERN_EMERG "bernie: stat=0x%02x, desc_address=%p, port->chan=%d\n", stat, desc, port->chan);
+		printk(KERN_EMERG "bernie: cp=%x, bp=%x, len=%d, unused=%x\n", readw(&desc->cp), readl(&desc->bp), readw(&desc->len), readb(&desc->unused));
+		printk(KERN_EMERG "bernie: %s TX(%i):", dev->name, skb->len);
+		debug_frame(skb);
+	}
 
 #ifdef DEBUG_PKT
 	printk(KERN_DEBUG "%s TX(%i):", dev->name, skb->len);


With this patch applied, our system doesn't crash any more and
communication works both ways with negligible packet loss.

Shall we submit a patch lowering the BUG_ON() to a KERN_ERR to report
the problem only once?

-- 
   // Bernie Innocenti - http://codewiz.org/
 \X/  Sugar Labs       - http://sugarlabs.org/