From mboxrd@z Thu Jan 1 00:00:00 1970 From: Colin Leroy Subject: Re: [PATCH] Prevent netpoll hanging when link is down Date: Thu, 7 Oct 2004 16:05:32 +0200 Sender: netdev-bounce@oss.sgi.com Message-ID: <20041007160532.60c3f26b@pirandello> References: <20041006232544.53615761@jack.colino.net> <20041006214322.GG31237@waste.org> <20041007075319.6b31430d@jack.colino.net> <20041006234912.66bfbdcc.davem@davemloft.net> Mime-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Cc: mpm@selenic.com, akpm@osdl.org, netdev@oss.sgi.com Return-path: To: "David S. Miller" In-Reply-To: <20041006234912.66bfbdcc.davem@davemloft.net> Errors-to: netdev-bounce@oss.sgi.com List-Id: netdev.vger.kernel.org On 06 Oct 2004 at 23h10, David S. Miller wrote: Hi again, > Folks debugging this should verify that gp->hw_running is non-zero > when the problematic case runs. I looked a bit more at the code and found out a possible problem. However it doesn't fix the hang, so either it's not what I found, or there's something else. First, my newbie question: is it possible to deadlock a spinlock on a Uniprocessor kernel ? For example, there's something I find suspect in netpoll/sungem interaction: it starts in netpoll_send_skb(), where xmit_lock is acquired; then, gem_start_xmit() is called. In gem_start_xmit(), if the TX ring is full, we return 1, but previously log the error via printk(). In this condition, isn't netpoll_send_skb() called again (via the whole console stuff), where it retries to get the lock on xmit_lock? Could that cause a deadlock? Also, at the end of netpoll_send_skb() is a check for the return status of hard_start_xmit. If I understand correctly, hard_start_xmit functions return 0 for success, -1 for "busy please retry" and 1 for "hard error". Shouldn't netpoll free the skb and forget about it in case status is 1 (instead of goto'ing repeat) ? Based on this, I've a new patch that I'm attaching. As said it doesn't fix the problem, but if my I understood is valid, it may still be useful ? Any insight will be greatly appreciated :) Patch following: --- a/net/core/netpoll.c 2004-10-05 21:09:49.000000000 +0200 +++ b/net/core/netpoll.c 2004-10-07 14:11:00.000000000 +0200 @@ -188,7 +188,10 @@ return; } - spin_lock(&np->dev->xmit_lock); + if (!spin_trylock(&np->dev->xmit_lock)) { + /* looks like np->dev->hard_start_xmit did a printk */ + goto bail_free; + } np->dev->xmit_lock_owner = smp_processor_id(); /* @@ -204,13 +207,18 @@ } status = np->dev->hard_start_xmit(skb, np->dev); + np->dev->xmit_lock_owner = -1; spin_unlock(&np->dev->xmit_lock); - /* transmit busy */ - if(status) { + if(status == -1) { + /* transmit busy */ netpoll_poll(np); goto repeat; + } else if (status == 1) { + /* hard error, drop this skb */ +bail_free: + __kfree_skb(skb); } }