public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Steven Rostedt <rostedt@goodmis.org>
To: Andi Kleen <ak@suse.de>
Cc: "Ingo Molnar" <mingo@elte.hu>,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	"John Bäckstrand" <sandos@home.se>
Subject: Re: lockups with netconsole on e1000 on media insertion
Date: Fri, 05 Aug 2005 09:49:03 -0400	[thread overview]
Message-ID: <1123249743.18332.16.camel@localhost.localdomain> (raw)
In-Reply-To: <p73ek987gjw.fsf@bragg.suse.de>

On Fri, 2005-08-05 at 13:45 +0200, Andi Kleen wrote:
> John Bäckstrand <sandos@home.se> writes:
> 
> > I've been trying to hunt down a hard lockup issue with some hardware
> > of mine, but I've possibly hit a kernel bug instead. When using
> > netconsole on my e1000, if I unplug the cable and then re-plug it, the
> > machine locks up hard. It manages to print the "link up" message on
> > the screen, but nothing after that. Now, I wonder if this is supposed
> > to be so? I tried this on 4 different configurations, 2.6.13-rc5 and
> > 2.6.12 with and without "noapic acpi=off", same result on all of
> > them. I've tried with 1 and 3 other NICs in the machine at the same
> > time.
> 
> I ran into the same problem some time ago on e1000. The problem was
> that if the link doesn't come up netconsole ends up waiting forever
> for it.
> 
> The patch was for 2.6.12, did a quick untested port to 2.6.13rc5.
> 
> -Andi
> 
> Only try a limited number to send packets in netpoll
> 
> Avoids hangs on e1000 when link is not up.
> 
> Signed-off-by: Andi Kleen <ak@suse.de>
> 
> Index: linux/net/core/netpoll.c
> ===================================================================
> --- linux.orig/net/core/netpoll.c
> +++ linux/net/core/netpoll.c
> @@ -247,9 +247,11 @@ static void netpoll_send_skb(struct netp
>  {
>  	int status;
>  	struct netpoll_info *npinfo;
> +	/* Only try 5 times in case the link is down etc. */
> +	int try = 5;
>  
>  repeat:
> -	if(!np || !np->dev || !netif_running(np->dev)) {
> +	if(try-- == 0 || !np || !np->dev || !netif_running(np->dev)) {
>  		__kfree_skb(skb);
>  		return;
>  	}
> @@ -286,6 +288,9 @@ repeat:
>  
>  	/* transmit busy */
>  	if(status) {
> +		/* Don't count spinlock as try */
> +		if (status == NETDEV_TX_LOCKED)
> +			try++; 
>  		netpoll_poll(np);
>  		goto repeat;
>  	}
> -

This is fixing the symptom and is not the cure.  Unfortunately I don't
have a e1000 card so I can't try a fix. But I did have a e100 card that
would lock up the same way.  The problem was that netpoll_poll calls the
cards netpoll routine (in e1000_main.c e1000_netpoll).  In the e100
case, when the transmit buffer would fill up, the queue would go down.
But the netpoll routine in the e100 code never put it back up after it
was all transfered. So this would lock up the kernel when that happened.

I believe that the e1000 is suffering the same problem, but I can't fix
it since I don't have an e1000 to test, but what probably needs to be
done is to check to see if the transmit buffer can be cleaned and the
queue go back up.

e1000_netpoll calls e1000_intr which looks like this:

static irqreturn_t
e1000_intr(int irq, void *data, struct pt_regs *regs)
{
	struct net_device *netdev = data;
	struct e1000_adapter *adapter = netdev_priv(netdev);
	struct e1000_hw *hw = &adapter->hw;
	uint32_t icr = E1000_READ_REG(hw, ICR);
#ifndef CONFIG_E1000_NAPI
	unsigned int i;
#endif

	if(unlikely(!icr))
		return IRQ_NONE;  /* Not our interrupt */

^^^^^^^^
---- Here I'm wondering if the netpoll case this is returned?


	if(unlikely(icr & (E1000_ICR_RXSEQ | E1000_ICR_LSC))) {
		hw->get_link_status = 1;
		mod_timer(&adapter->watchdog_timer, jiffies);
	}

#ifdef CONFIG_E1000_NAPI
	if(likely(netif_rx_schedule_prep(netdev))) {

		/* Disable interrupts and register for poll. The flush 
		  of the posted write is intentionally left out.
		*/

		atomic_inc(&adapter->irq_sem);
		E1000_WRITE_REG(hw, IMC, ~0);
		__netif_rx_schedule(netdev);
	}
#else
	/* Writing IMC and IMS is needed for 82547.
	   Due to Hub Link bus being occupied, an interrupt
	   de-assertion message is not able to be sent.
	   When an interrupt assertion message is generated later,
	   two messages are re-ordered and sent out.
	   That causes APIC to think 82547 is in de-assertion
	   state, while 82547 is in assertion state, resulting
	   in dead lock. Writing IMC forces 82547 into
	   de-assertion state.
	*/
	if(hw->mac_type == e1000_82547 || hw->mac_type == e1000_82547_rev_2){
		atomic_inc(&adapter->irq_sem);
		E1000_WRITE_REG(hw, IMC, ~0);
	}

	for(i = 0; i < E1000_MAX_INTR; i++)
		if(unlikely(!adapter->clean_rx(adapter) &
		   !e1000_clean_tx_irq(adapter)))
^^^^^
----  This should clean the transmit buffer, but it may not get here.

			break;

	if(hw->mac_type == e1000_82547 || hw->mac_type == e1000_82547_rev_2)
		e1000_irq_enable(adapter);
#endif

	return IRQ_HANDLED;
}



So maybe the patch should be something like:

--- linux-2.6.13-rc3/drivers/net/e1000/e1000_main.c.orig	2005-08-05 09:32:01.000000000 -0400
+++ linux-2.6.13-rc3/drivers/net/e1000/e1000_main.c	2005-08-05 09:33:56.000000000 -0400
@@ -3816,6 +3816,7 @@ e1000_netpoll(struct net_device *netdev)
 	struct e1000_adapter *adapter = netdev_priv(netdev);
 	disable_irq(adapter->pdev->irq);
 	e1000_intr(adapter->pdev->irq, netdev, NULL);
+	e1000_clean_tx_irq(adapter);
 	enable_irq(adapter->pdev->irq);
 }
 #endif


I don't have the card, so I can't test it. But if this works (after
removing the previous patch) then this is the better solution.  If this
does work, then we should probably add the timeout in netpoll with a
warning that the netpoll of the driver is broken:

Here's a modified version of the other patch: So we know where the
problem is.

#### John, Delete this part if you apply the above. ####

--- linux-2.6.13-rc3/net/core/netpoll.c.orig	2005-08-05 09:37:00.000000000 -0400
+++ linux-2.6.13-rc3/net/core/netpoll.c	2005-08-05 09:44:19.000000000 -0400
@@ -247,9 +247,14 @@ static void netpoll_send_skb(struct netp
 {
 	int status;
 	struct netpoll_info *npinfo;
+	/* only try five times incase link is down */
+	int try=5;
 
 repeat:
-	if(!np || !np->dev || !netif_running(np->dev)) {
+	if(try-- == 0 || !np || !np->dev || !netif_running(np->dev)) {
+		if (!try)
+			printk(KERN_WARNING "net driver is stuck down, maybe a"
+					" problem with the driver's netpoll\n");
 		__kfree_skb(skb);
 		return;
 	}
@@ -286,6 +291,9 @@ repeat:
 
 	/* transmit busy */
 	if(status) {
+		/* Don't count spinlock as try */
+		if (status == NETDEV_TX_LOCKED)
+			try++;
 		netpoll_poll(np);
 		goto repeat;
 	}


-- Steve



  parent reply	other threads:[~2005-08-05 13:49 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <42F347D2.7000207@home.se.suse.lists.linux.kernel>
2005-08-05 11:45 ` lockups with netconsole on e1000 on media insertion Andi Kleen
2005-08-05 12:44   ` John Bäckstrand
2005-08-05 13:49   ` Steven Rostedt [this message]
2005-08-05 13:55     ` Andi Kleen
2005-08-05 14:10       ` Steven Rostedt
2005-08-05 14:14         ` Andi Kleen
2005-08-05 14:27           ` Steven Rostedt
2005-08-05 14:36             ` David S. Miller
2005-08-05 15:02               ` Steven Rostedt
2005-08-05 14:36           ` [PATCH] netpoll can lock up on low memory Steven Rostedt
2005-08-05 20:01             ` Matt Mackall
2005-08-05 20:57               ` Steven Rostedt
2005-08-05 21:28                 ` Matt Mackall
2005-08-06  0:23                   ` Steven Rostedt
2005-08-06  1:53                     ` Matt Mackall
2005-08-06  2:32                       ` Steven Rostedt
2005-08-06  7:30                         ` Daniel Phillips
2005-08-06  7:58                         ` Ingo Molnar
2005-08-06 23:10                           ` Matt Mackall
2005-08-06  9:46                         ` David S. Miller
2005-08-06  9:57                           ` Steven Rostedt
2005-08-06 12:09                             ` John Bäckstrand
2005-08-07  5:40                             ` Matt Mackall
2005-08-05 21:26               ` Andi Kleen
2005-08-05 21:42                 ` Matt Mackall
2005-08-05 21:51                   ` Andi Kleen
2005-08-06  1:16                     ` Matt Mackall
2005-08-06  0:30                 ` Steven Rostedt
2005-08-06  7:45                 ` Ingo Molnar
2005-08-06 11:29                   ` Andi Kleen
2005-08-07 21:12     ` lockups with netconsole on e1000 on media insertion John Bäckstrand
2005-08-08  2:29       ` Steven Rostedt
2005-08-05 20:12   ` Matt Mackall
2005-08-05 21:56     ` Andi Kleen
2005-08-05 23:20       ` Matt Mackall
2005-08-05 23:51         ` Andi Kleen
2005-08-06  1:22           ` Matt Mackall
2005-08-06  1:37             ` Daniel Phillips
2005-08-05 11:04 John Bäckstrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1123249743.18332.16.camel@localhost.localdomain \
    --to=rostedt@goodmis.org \
    --cc=ak@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=netdev@vger.kernel.org \
    --cc=sandos@home.se \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox