[cc'ing netdev]

On Fri, 2005-01-07 at 01:20 +0100, Francois Romieu wrote:
> Mark Broadbent <markb@wetlettuce.com> :
> [...]
> > I had a think about the netpoll deadlock following the initial detective
> > work by you and Francois did.  I came up with a sample fix given below
> > that defers the packet transmission within netpoll until the xmit_lock
> > can be grabbed.  An advantage of this is that is maintains the packet
> > ordering.  I have attached the patch.
> 
> Nice.
> 
> Comments below. Off to bed now.
> 
> > It is possible for netpoll to deadlock by attempting to take the network
> > device xmit_lock twice on different processors.  This patch resolves this
> > by deferring to transmission of the second and subsequent skbs.
> > 
> > Signed-Off-By: Mark Broadbent <markb@wetlettuce.com>
> > 
> > --- linux-2.6.10-org/net/core/netpoll.c	2005-01-05 22:30:53.000000000 +0000
> > +++ linux-2.6.10/net/core/netpoll.c	2005-01-06 21:00:24.000000000 +0000
> [...]
> > @@ -178,17 +201,14 @@ repeat:
> >  	return skb;
> >  }
> >  
> > -void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb)
> > +/* This must be called with np->dev->xmit_lock spinlock held - this function 
> > + * function will drop the lock before returning indicating if it needs to be 
> > + * retried
> > + */
> > +static int transmit_skb(struct netpoll *np, struct sk_buff *skb)
> 
> It may seem like cosmetic but I do not like functions which unlock a lock
> they have not acquired themselves (and I am probably not alone). Would you
> consider pushing the trylock() in transmit_skb() itself ?
> 
> All the callers of this functions have to issue the trylock anyway so it
> should not matter from a performance POV.

Good point, I was having doubts about that.

> [...]
> > +	
> > +	spin_lock(&tx_list_lock);
>         ^^^^^^^^^
> kernel/workqueue.c::run_workqueue() could reenable interrupts before
> it calls tx_retry_wq(). An irq safe version of the underlined lock
> seems required to avoid deadlocking with printks potentially issued
> from interrupt context.

Fixed.

> > +	
> > +	list_for_each_safe(tx_ctr, tx_tmp, &tx_queue.link) {
> > +		tx_el = list_entry(tx_ctr, struct tx_queue_list, link);
> > +		
> > +		while (tx_el->np && tx_el->np->dev && 
> > +				netif_running(tx_el->np->dev)) {
> 
> netif_running() without xmit_lock held ? Uh oh...

This is done in the original code (and still is in other parts of the code).

> Imho you should try to keep transmit_skb() closer to the former
> netpoll_send_skb() (modulo the try_lock change of course).
> Btw I do not see what will prevent cleanup_netconsole() to succeed
> when tx_retry_wq() is scheduled for later execution. Add some module
> refcounting ?

No need, if netpoll_cleanup is called whilst a retry is pending the tx
list lock is taken.  All the references to the netpoll pointer are
deleted from the pending list before the lock is released.

> > +void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb)
> > +{
> > +repeat:
> > +	if(!np || !np->dev || !netif_running(np->dev))
> > +		goto out_free;
> 
> I am not sure that the first two tests above really made sense in
> the former version of the code:
> -> kernel/printk.c::vprintk() (downs console_sem)
>    -> kernel/printk.c::release_console_sem 
>       -> kernel/printk.c::call_console_drivers
>          [blah blah]
>          -> drivers/net/netconsole.c::netconsole.write_msg
>             -> drivers/net/netconsole.c::netpoll_send_udp
>                -> drivers/net/netconsole.c::netpoll_send_skb
> 
> -> drivers/net/netconsole.c::cleanup_netconsole
>    -> kernel/printk.c::unregister_console
>       -> kernel/printk.c::acquire_console_sem
>          -> try to down(&console_sem)
> 
> dev_put() has not been issued, np->dev is still among us and != NULL.
> So far, so good.
> 
> However netpoll_send_skb can be called through:
> netpoll_rx -> arp_reply -> netpoll_send_skb
> Here I fail to see why netpoll_cleanup could not happen at the same time
> and issue np->dev = NULL;

Could this solved by having a per netpoll structure rw_lock to protect
against changing elements in the structure?

e.g. 
struct netpoll {
	...
	rwlock_t netpoll_lock;
};

write_lock_irqsave(&np->netpoll_lock, flags);
np->dev = NULL;
write_unlock_irqsave(&np->netpoll_lock, flags);

> > +
> > +	/* If the list is non-empty then pkts are pending tx,
> > +	 * append to the end of list to stop reordering of the pkts
> > +	 */
> > +	if (!list_empty(&tx_queue.link) || !spin_trylock(&np->dev->xmit_lock)) {
> > +		struct tx_queue_list *txel;
> > +		
> > +		/* If we failed to get the xmit_lock then netpoll probably 
> > +		 * already holds this lock but will deadlock here if we relock,
> 
> Ok ("probably" ?).

:) Fixed

> > +		 * queue the skb for transmission later
> > +		 */
> > +		spin_lock(&tx_list_lock);
> 
> No spin_lock_irqsave ? Hmmm...
> The only use of netpoll_send_skb() goes through netpoll_send_udp() which is
> issued in drivers/net/netconsole.c::write_msg() with local_irq disabled.
> You are right but an extra comment would not hurt imho.

Is it safe to issue spin_lock_irqsave/spin_unlock_restore in this
context?  Reason: What if it wasn't netconsole calling netpoll_send_udp
and not with local interrupts disabled?

> > +		
> > +		txel = kmalloc(sizeof(struct tx_queue_list), GFP_ATOMIC);
> > +		if (!txel)
> > +			goto out_free;
> > +		
> > +		txel->skb = skb_get(skb);
> > +		txel->np  = np;
> > +		
> > +		list_add_tail(&txel->link, &tx_queue.link);
> > +		
> > +		queue_delayed_work(tx_wq, &tx_wq_obj, HZ/100);
> > +		
> > +		spin_unlock(&tx_list_lock);
> > +	} else if (transmit_skb(np, skb))
> > +		goto repeat;
> 
> As I read it, the previous code could loop as long as netif_queue_stopped()
> is true and it can acquire the lock. Any reason to not defer the skb for
> transmission in tx_retry_wq() ?

Nope, no good reason.  If transmit_skb fails it'll now defer.

> 
> > +	
> > +	return;
> > +	
> > +out_free:
> > +	__kfree_skb(skb);
> >  }
> >  
> >  void netpoll_send_udp(struct netpoll *np, const char *msg, int len)
> > @@ -636,6 +729,15 @@ int netpoll_setup(struct netpoll *np)
> >  		spin_unlock_irqrestore(&rx_list_lock, flags);
> >  	}
> >  
> > +	/* Workqueue to schedule the tx retries */
> > +	if (!tx_wq)
> > +		tx_wq = create_workqueue("netpoll");
> > +	
> > +	if (!tx_wq) {
> > +		printk(KERN_ERR "Failed to create netpoll workqueue\n");
> > +		goto release;
> > +	}
> > +	
> 
> This queue is created but never removed. May be it could/should go in a
> separate function and be anchored to net_dev_init(). The network maintainers
> will tell if they care or not.
> 
> Pure curiosity: why a specific queue instead of schedule_delayed_work() ?

Pure ignorance I'm afraid ;).  That'll more what I was looking for
originally so I've changed it.

I've attached a revised version.

Thanks for you comments (and Matts)

Mark

-- 
Mark Broadbent <markb@wetlettuce.com>