Races in net_rx_action vs netpoll?

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Races in net_rx_action vs netpoll?
@ 2007-07-04 12:16 Olaf Kirch
  2007-07-09 22:27 ` David Miller
  0 siblings, 1 reply; 14+ messages in thread
From: Olaf Kirch @ 2007-07-04 12:16 UTC (permalink / raw)
  To: netdev

Hi,

as a caveat, let me say that the whole thing discussed below started
as an investigation of an oops on RHEL4. You may or may not be
interested in 2.6.9 kernels (depending on your employer :-) but
looking at mainline it seems to be the issue is still present.

It appears there is a race condition between net_rx_action and
poll_napi. In the particular scenario I've been debugging, the
machine BUGs in netif_rx_complete because the device is no longer
on the poll list.

Here's a snippet from net_rx_action:

        while (!list_empty(&queue->poll_list)) {
                struct net_device *dev;

                if (budget <= 0 || jiffies - start_time > 1)
                        goto softnet_break;

                local_irq_enable();

                dev = list_entry(queue->poll_list.next,
                                 struct net_device, poll_list);
                have = netpoll_poll_lock(dev);

And here's poll_napi:

        if (test_bit(__LINK_STATE_RX_SCHED, &np->dev->state) &&
            npinfo->poll_owner != smp_processor_id() &&
            spin_trylock(&npinfo->poll_lock)) {
                [...]
                np->dev->poll(np->dev, &budget);

Here's the race:

 CPU0	interrupt arrives, puts interface on local poll_list
	net_rx_action runs, finds the list is not empty.

 CPU1	Something causes a kernel printk, which is sent via
	netconsole. Via netpoll_send_skb we arrive in
	poll_napi. RX_SCHED is set, poll_owner is -1,
	and we're able to grab the poll_lock.

 CPU0	Calls netpoll_poll_lock and spins on the poll_lock

 CPU1	Calls dev->poll, which does its work and calls
	netif_rx_complete, returns, and releases the spinlock.

 CPU0	resumes, calls dev->poll(), which calls netif_rx_action
	again, and BUGs as the device is no longer on the poll_list.

Does this make sense so far?

Of course, the race can happen the other way round as well -
poll_napi tests for the RX_SCHED bit before taking the spin_lock, so
net_rx_action running on a different CPU may call netif_rx_complete
and release the spinlock inbetween poll_napi's test_bit() and
spin_trylock(). Not very likely to happen, but possible.

This is just one failure mode. A worse case would be if
net_rx_action finds the poll_list is non-empty, but before it
executes dev = list_entry(queue->poll_list.next, ...), poll_napi
runs on another CPU and removes the device from the list. In the
worst case, dev would now point into softnet_data.

I think the only real fix for this is to restrict who is allowed
to remove the interface from the poll_list. Only net_rx_action
should be allowed to do so. A possible patch is given below
(beware, it's untested so far)

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir@lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax
-----------------
From: Olaf Kirch <olaf.kirch@oracle.com>

Keep netpoll/poll_napi from messing with the poll_list.
Only net_rx_action is allowed to manipulate the list.

This patch is a draft only.

Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com>
---
 include/linux/netdevice.h |   32 +++++++++++++++++++++++++-------
 net/core/dev.c            |   13 ++++++++++++-
 2 files changed, 37 insertions(+), 8 deletions(-)

Index: iscsi-2.6/include/linux/netdevice.h
===================================================================
--- iscsi-2.6.orig/include/linux/netdevice.h
+++ iscsi-2.6/include/linux/netdevice.h
@@ -248,6 +248,7 @@ enum netdev_state_t
 	__LINK_STATE_LINKWATCH_PENDING,
 	__LINK_STATE_DORMANT,
 	__LINK_STATE_QDISC_RUNNING,
+	__LINK_STATE_RX_COMPLETE,
 };
 
 
@@ -868,6 +869,12 @@ static inline u32 netif_msg_init(int deb
 /* Test if receive needs to be scheduled */
 static inline int __netif_rx_schedule_prep(struct net_device *dev)
 {
+	/* The driver may have decided that there's no more work
+	 * to be done - but now another interrupt arrives, and
+	 * we changed our mind. */
+	smp_mb__before_clear_bit();
+	clear_bit(__LINK_STATE_RX_COMPLETE, &dev->state);
+
 	return !test_and_set_bit(__LINK_STATE_RX_SCHED, &dev->state);
 }
 
@@ -910,21 +917,35 @@ static inline int netif_rx_reschedule(st
 	return 0;
 }
 
-/* Remove interface from poll list: it must be in the poll list
- * on current cpu. This primitive is called by dev->poll(), when
+/* Tell net_rx_action that we think there's no more work to be
+ * done. */
+static inline void netif_rx_complete(struct net_device *dev)
+{
+	set_bit(__LINK_STATE_RX_COMPLETE, &dev->state);
+}
+
+/* Remove interface from poll list if RX_COMPLETE is set.
+ * The device it must be in the poll list on current cpu.
+ * This primitive is called by net_rx_action when
  * it completes the work. The device cannot be out of poll list at this
  * moment, it is BUG().
  */
-static inline void netif_rx_complete(struct net_device *dev)
+static inline int netif_rx_continue(struct net_device *dev)
 {
 	unsigned long flags;
 
 	local_irq_save(flags);
+	if (!test_bit(__LINK_STATE_RX_COMPLETE, &dev->state)) {
+		local_irq_restore(flags);
+		return 1;
+	}
+
 	BUG_ON(!test_bit(__LINK_STATE_RX_SCHED, &dev->state));
 	list_del(&dev->poll_list);
 	smp_mb__before_clear_bit();
 	clear_bit(__LINK_STATE_RX_SCHED, &dev->state);
 	local_irq_restore(flags);
+	return 0;
 }
 
 static inline void netif_poll_disable(struct net_device *dev)
@@ -945,10 +966,7 @@ static inline void netif_poll_enable(str
  */
 static inline void __netif_rx_complete(struct net_device *dev)
 {
-	BUG_ON(!test_bit(__LINK_STATE_RX_SCHED, &dev->state));
-	list_del(&dev->poll_list);
-	smp_mb__before_clear_bit();
-	clear_bit(__LINK_STATE_RX_SCHED, &dev->state);
+	set_bit(__LINK_STATE_RX_COMPLETE, &dev->state);
 }
 
 static inline void netif_tx_lock(struct net_device *dev)
Index: iscsi-2.6/net/core/dev.c
===================================================================
--- iscsi-2.6.orig/net/core/dev.c
+++ iscsi-2.6/net/core/dev.c
@@ -1994,7 +1994,16 @@ static void net_rx_action(struct softirq
 				 struct net_device, poll_list);
 		have = netpoll_poll_lock(dev);
 
-		if (dev->quota <= 0 || dev->poll(dev, &budget)) {
+		/* We keep polling iff
+		 *  -	the device is over quota, OR
+		 *  -	the driver's poll() routine says there's more work
+		 *	(in which case it should set RX_COMPLETE, too), OR
+		 *  -	the RX_COMPLETE flag was cleared in the meantime
+		 *	(eg by an incoming interrupt).
+		 */
+		if (dev->quota <= 0 || dev->poll(dev, &budget) ||
+		    netif_rx_continue(dev)) {
+			clear_bit(__LINK_STATE_RX_COMPLETE, &dev->state);
 			netpoll_poll_unlock(have);
 			local_irq_disable();
 			list_move_tail(&dev->poll_list, &queue->poll_list);
@@ -2003,6 +2012,8 @@ static void net_rx_action(struct softirq
 			else
 				dev->quota = dev->weight;
 		} else {
+			/* If we get here, netif_rx_continue removed the
+			 * device from the poll_list. */
 			netpoll_poll_unlock(have);
 			dev_put(dev);
 			local_irq_disable();

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Races in net_rx_action vs netpoll?
  2007-07-04 12:16 Races in net_rx_action vs netpoll? Olaf Kirch
@ 2007-07-09 22:27 ` David Miller
  2007-07-10 10:44   ` Olaf Kirch
  0 siblings, 1 reply; 14+ messages in thread
From: David Miller @ 2007-07-09 22:27 UTC (permalink / raw)
  To: okir; +Cc: netdev

From: Olaf Kirch <okir@lst.de>
Date: Wed, 4 Jul 2007 14:16:32 +0200

Another locking bug in netpoll, why am I not surprised? :-/

Thanks for reporting this Olaf.

> I think the only real fix for this is to restrict who is allowed
> to remove the interface from the poll_list. Only net_rx_action
> should be allowed to do so. A possible patch is given below
> (beware, it's untested so far)

I'm happy to entertain this kind of solution, but we really
need to first have an interface to change multiple bits
at a time in one atomic operation, because by itself this
patch doubles the number of atomices we do when starting
a NAPI poll.

>  static inline int __netif_rx_schedule_prep(struct net_device *dev)
>  {
> +	/* The driver may have decided that there's no more work
> +	 * to be done - but now another interrupt arrives, and
> +	 * we changed our mind. */
> +	smp_mb__before_clear_bit();
> +	clear_bit(__LINK_STATE_RX_COMPLETE, &dev->state);
> +
>  	return !test_and_set_bit(__LINK_STATE_RX_SCHED, &dev->state);
>  }
>  

Because of that change.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Races in net_rx_action vs netpoll?
  2007-07-09 22:27 ` David Miller
@ 2007-07-10 10:44   ` Olaf Kirch
  2007-07-11  5:44     ` David Miller
  2007-07-12 12:59     ` Jarek Poplawski
  0 siblings, 2 replies; 14+ messages in thread
From: Olaf Kirch @ 2007-07-10 10:44 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Tuesday 10 July 2007 00:27, David Miller wrote:
> I'm happy to entertain this kind of solution, but we really
> need to first have an interface to change multiple bits
> at a time in one atomic operation, because by itself this
> patch doubles the number of atomices we do when starting
> a NAPI poll.

Understood. How about the patch below? It takes a similar
approach, but it puts the onus on the netpoll code
path rather than the general NAPI case.

Olaf
--------------
From: Olaf Kirch <olaf.kirch@oracle.com>

Keep netpoll/poll_napi from messing with the poll_list.
Only net_rx_action is allowed to manipulate the list.

Signed-off-by: Olaf Kirch <olaf.kirch@oracle.com>
---
 include/linux/netdevice.h |   10 ++++++++++
 net/core/netpoll.c        |    8 ++++++++
 2 files changed, 18 insertions(+)

Index: iscsi-2.6/include/linux/netdevice.h
===================================================================
--- iscsi-2.6.orig/include/linux/netdevice.h
+++ iscsi-2.6/include/linux/netdevice.h
@@ -248,6 +248,8 @@ enum netdev_state_t
 	__LINK_STATE_LINKWATCH_PENDING,
 	__LINK_STATE_DORMANT,
 	__LINK_STATE_QDISC_RUNNING,
+	/* Set by the netpoll NAPI code */
+	__LINK_STATE_POLL_LIST_FROZEN,
 };
 
 
@@ -919,6 +921,14 @@ static inline void netif_rx_complete(str
 {
 	unsigned long flags;
 
+#ifdef CONFIG_NETPOLL
+	/* Prevent race with netpoll - yes, this is a kludge.
+	 * But at least it doesn't penalize the non-netpoll
+	 * code path. */
+	if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, &dev->state))
+		return;
+#endif
+
 	local_irq_save(flags);
 	BUG_ON(!test_bit(__LINK_STATE_RX_SCHED, &dev->state));
 	list_del(&dev->poll_list);
Index: iscsi-2.6/net/core/netpoll.c
===================================================================
--- iscsi-2.6.orig/net/core/netpoll.c
+++ iscsi-2.6/net/core/netpoll.c
@@ -123,6 +123,13 @@ static void poll_napi(struct netpoll *np
 	if (test_bit(__LINK_STATE_RX_SCHED, &np->dev->state) &&
 	    npinfo->poll_owner != smp_processor_id() &&
 	    spin_trylock(&npinfo->poll_lock)) {
+		/* When calling dev->poll from poll_napi, we may end up in
+		 * netif_rx_complete. However, only the CPU to which the
+		 * device was queued is allowed to remove it from poll_list.
+		 * Setting POLL_LIST_FROZEN tells netif_rx_complete
+		 * to leave the NAPI state alone.
+		 */
+		set_bit(__LINK_STATE_POLL_LIST_FROZEN, &np->dev->state);
 		npinfo->rx_flags |= NETPOLL_RX_DROP;
 		atomic_inc(&trapped);
 
@@ -130,6 +137,7 @@ static void poll_napi(struct netpoll *np
 
 		atomic_dec(&trapped);
 		npinfo->rx_flags &= ~NETPOLL_RX_DROP;
+		clear_bit(__LINK_STATE_POLL_LIST_FROZEN, &np->dev->state);
 		spin_unlock(&npinfo->poll_lock);
 	}
 }

-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir@lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Races in net_rx_action vs netpoll?
  2007-07-10 10:44   ` Olaf Kirch
@ 2007-07-11  5:44     ` David Miller
  2007-07-11  7:41       ` Olaf Kirch
  2007-07-12 12:59     ` Jarek Poplawski
  1 sibling, 1 reply; 14+ messages in thread
From: David Miller @ 2007-07-11  5:44 UTC (permalink / raw)
  To: okir; +Cc: netdev

From: Olaf Kirch <okir@lst.de>
Date: Tue, 10 Jul 2007 12:44:31 +0200

> On Tuesday 10 July 2007 00:27, David Miller wrote:
> > I'm happy to entertain this kind of solution, but we really
> > need to first have an interface to change multiple bits
> > at a time in one atomic operation, because by itself this
> > patch doubles the number of atomices we do when starting
> > a NAPI poll.
> 
> Understood. How about the patch below? It takes a similar
> approach, but it puts the onus on the netpoll code
> path rather than the general NAPI case.

Definitely looks more palatable.

> @@ -919,6 +921,14 @@ static inline void netif_rx_complete(str
>  {
>  	unsigned long flags;
>  
> +#ifdef CONFIG_NETPOLL
> +	/* Prevent race with netpoll - yes, this is a kludge.
> +	 * But at least it doesn't penalize the non-netpoll
> +	 * code path. */
> +	if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, &dev->state))
> +		return;
> +#endif
> +
>  	local_irq_save(flags);
>  	BUG_ON(!test_bit(__LINK_STATE_RX_SCHED, &dev->state));
>  	list_del(&dev->poll_list);

That new bit can be set in interrupt context can't it?

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Races in net_rx_action vs netpoll?
  2007-07-11  5:44     ` David Miller
@ 2007-07-11  7:41       ` Olaf Kirch
  2007-07-12  2:33         ` David Miller
  0 siblings, 1 reply; 14+ messages in thread
From: Olaf Kirch @ 2007-07-11  7:41 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Wednesday 11 July 2007 07:44, David Miller wrote:
> > +#ifdef CONFIG_NETPOLL
> > +	/* Prevent race with netpoll - yes, this is a kludge.
> > +	 * But at least it doesn't penalize the non-netpoll
> > +	 * code path. */
> > +	if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, &dev->state))
> > +		return;
> > +#endif
> > +
> >  	local_irq_save(flags);
> >  	BUG_ON(!test_bit(__LINK_STATE_RX_SCHED, &dev->state));
> >  	list_del(&dev->poll_list);
> 
> That new bit can be set in interrupt context can't it?

It's set and cleared in poll_napi only, and as far as I can tell 
poll_napi will only ever be called from via softirq, but never
from an interrupt handler directly.

I also don't think the test_bit() needs to lock out interrupts.
The only reason we do it for the RX_SCHED bit is that the RX_SCHED
bit and the poll_list change must happen atomically wrt interrupts
from the NIC, right?

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir@lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Races in net_rx_action vs netpoll?
  2007-07-11  7:41       ` Olaf Kirch
@ 2007-07-12  2:33         ` David Miller
  2007-07-19 15:19           ` Olaf Kirch
  0 siblings, 1 reply; 14+ messages in thread
From: David Miller @ 2007-07-12  2:33 UTC (permalink / raw)
  To: okir; +Cc: netdev

From: Olaf Kirch <okir@lst.de>
Date: Wed, 11 Jul 2007 09:41:37 +0200

> On Wednesday 11 July 2007 07:44, David Miller wrote:
> > > +#ifdef CONFIG_NETPOLL
> > > +	/* Prevent race with netpoll - yes, this is a kludge.
> > > +	 * But at least it doesn't penalize the non-netpoll
> > > +	 * code path. */
> > > +	if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, &dev->state))
> > > +		return;
> > > +#endif
> > > +
> > >  	local_irq_save(flags);
> > >  	BUG_ON(!test_bit(__LINK_STATE_RX_SCHED, &dev->state));
> > >  	list_del(&dev->poll_list);
> > 
> > That new bit can be set in interrupt context can't it?
> 
> It's set and cleared in poll_napi only, and as far as I can tell 
> poll_napi will only ever be called from via softirq, but never
> from an interrupt handler directly.
> 
> I also don't think the test_bit() needs to lock out interrupts.
> The only reason we do it for the RX_SCHED bit is that the RX_SCHED
> bit and the poll_list change must happen atomically wrt interrupts
> from the NIC, right?

Ok, sounds good.

I'll add merge your patch with a target of 2.6.23

If you really want, after this patch has sat in 2.6.23 for a while
and got some good testing, we can consider a submission for -stable.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Races in net_rx_action vs netpoll?
  2007-07-10 10:44   ` Olaf Kirch
  2007-07-11  5:44     ` David Miller
@ 2007-07-12 12:59     ` Jarek Poplawski
  2007-07-12 13:54       ` Olaf Kirch
  1 sibling, 1 reply; 14+ messages in thread
From: Jarek Poplawski @ 2007-07-12 12:59 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: David Miller, netdev

Hi!

I'm really sorry I couldn't write this sooner.
Below are a few of my doubts:


On 10-07-2007 12:44, Olaf Kirch wrote:
> On Tuesday 10 July 2007 00:27, David Miller wrote:
>> I'm happy to entertain this kind of solution, but we really
>> need to first have an interface to change multiple bits
>> at a time in one atomic operation, because by itself this
>> patch doubles the number of atomices we do when starting
>> a NAPI poll.
...
> --- iscsi-2.6.orig/include/linux/netdevice.h
> +++ iscsi-2.6/include/linux/netdevice.h
> @@ -248,6 +248,8 @@ enum netdev_state_t
>  	__LINK_STATE_LINKWATCH_PENDING,
>  	__LINK_STATE_DORMANT,
>  	__LINK_STATE_QDISC_RUNNING,
> +	/* Set by the netpoll NAPI code */
> +	__LINK_STATE_POLL_LIST_FROZEN,
>  };
>  
>  
> @@ -919,6 +921,14 @@ static inline void netif_rx_complete(str
>  {
>  	unsigned long flags;
>  
> +#ifdef CONFIG_NETPOLL
> +	/* Prevent race with netpoll - yes, this is a kludge.
> +	 * But at least it doesn't penalize the non-netpoll
> +	 * code path. */

Alas, this can penalize those who have it enabled (e.g. by distro),
but don't use.

And it looks like _netif_rx_complete should be a better place,
at least considering such cards as: 8139too, skge, sungem and
maybe more (according to 2.6.22).

> +	if (test_bit(__LINK_STATE_POLL_LIST_FROZEN, &dev->state))
> +		return;
> +#endif
> +
>  	local_irq_save(flags);
>  	BUG_ON(!test_bit(__LINK_STATE_RX_SCHED, &dev->state));
>  	list_del(&dev->poll_list);
> Index: iscsi-2.6/net/core/netpoll.c
> ===================================================================
> --- iscsi-2.6.orig/net/core/netpoll.c
> +++ iscsi-2.6/net/core/netpoll.c
> @@ -123,6 +123,13 @@ static void poll_napi(struct netpoll *np
>  	if (test_bit(__LINK_STATE_RX_SCHED, &np->dev->state) &&
>  	    npinfo->poll_owner != smp_processor_id() &&
>  	    spin_trylock(&npinfo->poll_lock)) {
> +		/* When calling dev->poll from poll_napi, we may end up in
> +		 * netif_rx_complete. However, only the CPU to which the
> +		 * device was queued is allowed to remove it from poll_list.
> +		 * Setting POLL_LIST_FROZEN tells netif_rx_complete
> +		 * to leave the NAPI state alone.
> +		 */
> +		set_bit(__LINK_STATE_POLL_LIST_FROZEN, &np->dev->state);
>  		npinfo->rx_flags |= NETPOLL_RX_DROP;

I wonder, why this flag cannot be used for this check?

>  		atomic_inc(&trapped);
>  
> @@ -130,6 +137,7 @@ static void poll_napi(struct netpoll *np
>  
>  		atomic_dec(&trapped);
>  		npinfo->rx_flags &= ~NETPOLL_RX_DROP;
> +		clear_bit(__LINK_STATE_POLL_LIST_FROZEN, &np->dev->state);
>  		spin_unlock(&npinfo->poll_lock);
>  	}
>  }
> 

BTW, I'd be very glad if somebody could hint me what is the main
reason for such troublesome function as poll_napi: if it's about
performance isn't this enough to increase budget for netpoll in
net_rx_action?

Best regards,
Jarek P.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Races in net_rx_action vs netpoll?
  2007-07-12 12:59     ` Jarek Poplawski
@ 2007-07-12 13:54       ` Olaf Kirch
  2007-07-13  8:55         ` Jarek Poplawski
  0 siblings, 1 reply; 14+ messages in thread
From: Olaf Kirch @ 2007-07-12 13:54 UTC (permalink / raw)
  To: Jarek Poplawski; +Cc: David Miller, netdev

Hi Jarek,

On Thursday 12 July 2007 14:59, Jarek Poplawski wrote:

> > +#ifdef CONFIG_NETPOLL
> > +	/* Prevent race with netpoll - yes, this is a kludge.
> > +	 * But at least it doesn't penalize the non-netpoll
> > +	 * code path. */
> 
> Alas, this can penalize those who have it enabled (e.g. by distro),
> but don't use.

Well, the test_bit is actually cheap; it's not atomic, and has no memory
ordering requirements by all I know. The costly thing is set_bit/clear_bit
in poll_napi; and you only ever get there when you *use* netpoll.

> And it looks like _netif_rx_complete should be a better place,
> at least considering such cards as: 8139too, skge, sungem and
> maybe more (according to 2.6.22).

Why?

> > +		set_bit(__LINK_STATE_POLL_LIST_FROZEN, &np->dev->state);
> >  		npinfo->rx_flags |= NETPOLL_RX_DROP;
> 
> I wonder, why this flag cannot be used for this check?

I tried, but it made the patch rather icky. netpoll_info is defined
in netpoll.h, which includes netdevice.h. So you cannot inline the
check, and have to use an out-of-line function instead, along the
lines of

extern int am_i_being_called_by_poll_napi(struct net_device *);

netif_rx_complete(struct net_device *dev)
{
#ifdef CONFIG_NETPOLL
	if (unlikely(dev->npinfo && am_i_being_called_by_poll_napi(dev))
		return;
#endif
	...
}

If you don't mind that, yes - this flag (or better, a newly introduced
NETPOLL_RX_NAPI) may work as well.

One thing I was a little worried about was whether dev->npinfo can
go away all of a sudden. It's really just protected by an rcu_readlock...

> BTW, I'd be very glad if somebody could hint me what is the main
> reason for such troublesome function as poll_napi: if it's about
> performance isn't this enough to increase budget for netpoll in
> net_rx_action?

I think one reason is that you want to get the kernel oops out even
when the machine is so hosed that it doesn't even service softirqs
anymore.

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir@lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Races in net_rx_action vs netpoll?
  2007-07-12 13:54       ` Olaf Kirch
@ 2007-07-13  8:55         ` Jarek Poplawski
  2007-07-16  8:06           ` Jarek Poplawski
  0 siblings, 1 reply; 14+ messages in thread
From: Jarek Poplawski @ 2007-07-13  8:55 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: David Miller, netdev

On Thu, Jul 12, 2007 at 03:54:32PM +0200, Olaf Kirch wrote:
> Hi Jarek,
> 
> On Thursday 12 July 2007 14:59, Jarek Poplawski wrote:
> 
> > > +#ifdef CONFIG_NETPOLL
> > > +	/* Prevent race with netpoll - yes, this is a kludge.
> > > +	 * But at least it doesn't penalize the non-netpoll
> > > +	 * code path. */
> > 
> > Alas, this can penalize those who have it enabled (e.g. by distro),
> > but don't use.
> 
> Well, the test_bit is actually cheap; it's not atomic, and has no memory
> ordering requirements by all I know. The costly thing is set_bit/clear_bit
> in poll_napi; and you only ever get there when you *use* netpoll.

I've only meant "it doesn't penalize" isn't too precise here,
at least if you take it "mathematically". But, e.g. "politically"
it's 110% right, of course.

> 
> > And it looks like _netif_rx_complete should be a better place,
> > at least considering such cards as: 8139too, skge, sungem and
> > maybe more (according to 2.6.22).
> 
> Why?

It seems I miss something, but if it's to be called from dev->poll,
these drivers use __netif_rx_complete instead of netif_rx_complete.

> 
> > > +		set_bit(__LINK_STATE_POLL_LIST_FROZEN, &np->dev->state);
> > >  		npinfo->rx_flags |= NETPOLL_RX_DROP;
> > 
> > I wonder, why this flag cannot be used for this check?
> 
> I tried, but it made the patch rather icky. netpoll_info is defined
> in netpoll.h, which includes netdevice.h. So you cannot inline the
> check, and have to use an out-of-line function instead, along the
> lines of
> 
> extern int am_i_being_called_by_poll_napi(struct net_device *);
> 
> netif_rx_complete(struct net_device *dev)
> {
> #ifdef CONFIG_NETPOLL
> 	if (unlikely(dev->npinfo && am_i_being_called_by_poll_napi(dev))
> 		return;
> #endif
> 	...
> }
> 
> If you don't mind that, yes - this flag (or better, a newly introduced
> NETPOLL_RX_NAPI) may work as well.
> 
> One thing I was a little worried about was whether dev->npinfo can
> go away all of a sudden. It's really just protected by an rcu_readlock...

I didn't think about this struct netpoll_info vs. inlining,
but I'm glad you did, so adding a new flag looks more reasonably
if we don't want to mess to much with netdevice.h.
BTW, I don't think there could be any problem with rcu (if it's
all about calling dev->poll from poll_napi) because then poll_napi
should have the same problem.

> 
> > BTW, I'd be very glad if somebody could hint me what is the main
> > reason for such troublesome function as poll_napi: if it's about
> > performance isn't this enough to increase budget for netpoll in
> > net_rx_action?
> 
> I think one reason is that you want to get the kernel oops out even
> when the machine is so hosed that it doesn't even service softirqs
> anymore.

Thanks! So, I have to think about this more. Of course, such idea
is fine if it doesn't collide with normal service, which I'm not
sure is true now. I mean this problem here, and e.g. needles
servicing of not netconsole packets by different cpus, but also
some unclear to me things like calling this from find_skb when
there is a problem with alloc_skb (I wonder how a card driver
manages to get these skbs for receiving?).

Cheers,
Jarek P.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Races in net_rx_action vs netpoll?
  2007-07-13  8:55         ` Jarek Poplawski
@ 2007-07-16  8:06           ` Jarek Poplawski
  0 siblings, 0 replies; 14+ messages in thread
From: Jarek Poplawski @ 2007-07-16  8:06 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: David Miller, netdev

On Fri, Jul 13, 2007 at 10:55:08AM +0200, Jarek Poplawski wrote:
> On Thu, Jul 12, 2007 at 03:54:32PM +0200, Olaf Kirch wrote:
...
> > One thing I was a little worried about was whether dev->npinfo can
> > go away all of a sudden. It's really just protected by an rcu_readlock...
...
> BTW, I don't think there could be any problem with rcu (if it's
> all about calling dev->poll from poll_napi) because then poll_napi
> should have the same problem.

To do it justice: I've missed your point, but of course you are
right the validity of dev->npinfo should be analyzed with such
a solution too.

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Races in net_rx_action vs netpoll?
  2007-07-12  2:33         ` David Miller
@ 2007-07-19 15:19           ` Olaf Kirch
  2007-07-19 16:27             ` Stephen Hemminger
  0 siblings, 1 reply; 14+ messages in thread
From: Olaf Kirch @ 2007-07-19 15:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Thursday 12 July 2007 04:33, David Miller wrote:
> I'll add merge your patch with a target of 2.6.23
> 
> If you really want, after this patch has sat in 2.6.23 for a while
> and got some good testing, we can consider a submission for -stable.

Okay, those of you who followed the discussion on lkml will have
read why this patch breaks on e1000.

Short summary: some NIC drivers expect that there is a one-to-one
relation between calls to net_rx_schedule (where we put the device
on the poll list) and netif_rx_complete (where it's supposed to be
taken off the list). The e1000 is such a beast. Not sure if other
drivers make the same assumption re NAPI.

So: should a driver be allowed to rely on this behavior? Or should
I go and look for another fix to the poll_napi issue?

I keep coming back to the question Jarek asked - why does netpoll
want to call dev->poll() anyway? I dug around a little and it
seems the original idea was to do this only if netpoll_poll was
running on the CPU the netdevice was scheduled to.

So one way to fix the problem is to add a dev->poll_cpu field
that tells us on which CPU's poll list it has been added - and
check for this in poll_napi.

Comments?

David, should I submit an updated patch for 2.6.23, or do you
prefer to yank the patch now and try again for 2.6.24?

Olaf
-- 
Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
okir@lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Races in net_rx_action vs netpoll?
  2007-07-19 15:19           ` Olaf Kirch
@ 2007-07-19 16:27             ` Stephen Hemminger
  2007-07-22  7:05               ` David Miller
  0 siblings, 1 reply; 14+ messages in thread
From: Stephen Hemminger @ 2007-07-19 16:27 UTC (permalink / raw)
  To: Olaf Kirch; +Cc: David Miller, netdev

On Thu, 19 Jul 2007 17:19:19 +0200
Olaf Kirch <okir@lst.de> wrote:

> On Thursday 12 July 2007 04:33, David Miller wrote:
> > I'll add merge your patch with a target of 2.6.23
> > 
> > If you really want, after this patch has sat in 2.6.23 for a while
> > and got some good testing, we can consider a submission for -stable.
> 
> Okay, those of you who followed the discussion on lkml will have
> read why this patch breaks on e1000.
> 
> Short summary: some NIC drivers expect that there is a one-to-one
> relation between calls to net_rx_schedule (where we put the device
> on the poll list) and netif_rx_complete (where it's supposed to be
> taken off the list). The e1000 is such a beast. Not sure if other
> drivers make the same assumption re NAPI.
> 
> So: should a driver be allowed to rely on this behavior? Or should
> I go and look for another fix to the poll_napi issue?
> 
> I keep coming back to the question Jarek asked - why does netpoll
> want to call dev->poll() anyway? I dug around a little and it
> seems the original idea was to do this only if netpoll_poll was
> running on the CPU the netdevice was scheduled to.
> 
> So one way to fix the problem is to add a dev->poll_cpu field
> that tells us on which CPU's poll list it has been added - and
> check for this in poll_napi.
> 
> Comments?

Please revisit the requirements that netconsole needs and redesign it
from scratch.  The existing code  is causing too much breakage. 

Can it be done without breaking the semantics of network
devices, or should we rewrite the driver interface to take have a different
interface like netdev_sync_send_skb()  that is slow, synchronous and 
non-interrupt (ie polls for completion).  Of course, then people will complain
that netconsole traffic slows the machine down.
for completion.

> David, should I submit an updated patch for 2.6.23, or do you
> prefer to yank the patch now and try again for 2.6.24?
> 
> Olaf
> -- 
> Olaf Kirch  |  --- o --- Nous sommes du soleil we love when we play
> okir@lst.de |    / | \   sol.dhoop.naytheet.ah kin.ir.samse.qurax
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Races in net_rx_action vs netpoll?
  2007-07-19 16:27             ` Stephen Hemminger
@ 2007-07-22  7:05               ` David Miller
  2007-07-24 10:26                 ` Jarek Poplawski
  0 siblings, 1 reply; 14+ messages in thread
From: David Miller @ 2007-07-22  7:05 UTC (permalink / raw)
  To: shemminger; +Cc: okir, netdev

From: Stephen Hemminger <shemminger@linux-foundation.org>
Date: Thu, 19 Jul 2007 17:27:47 +0100

> Please revisit the requirements that netconsole needs and redesign
> it from scratch.  The existing code is causing too much breakage.
>
> Can it be done without breaking the semantics of network devices, or
> should we rewrite the driver interface to take have a different
> interface like netdev_sync_send_skb() that is slow, synchronous and
> non-interrupt (ie polls for completion).  Of course, then people
> will complain that netconsole traffic slows the machine down.  for
> completion.

I couldn't agree more.

Since netpoll runs outside of all of the normal netdevice locking
rules, only the people using netpoll hit all the bugs.  That means
most of us do not test out these code path, which is bad.

So, if anything, a more integrated implementation is essential.


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: Races in net_rx_action vs netpoll?
  2007-07-22  7:05               ` David Miller
@ 2007-07-24 10:26                 ` Jarek Poplawski
  0 siblings, 0 replies; 14+ messages in thread
From: Jarek Poplawski @ 2007-07-24 10:26 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, okir, netdev

On 22-07-2007 09:05, David Miller wrote:
> From: Stephen Hemminger <shemminger@linux-foundation.org>
> Date: Thu, 19 Jul 2007 17:27:47 +0100
> 
>> Please revisit the requirements that netconsole needs and redesign
>> it from scratch.  The existing code is causing too much breakage.
>>
>> Can it be done without breaking the semantics of network devices, or
>> should we rewrite the driver interface to take have a different
>> interface like netdev_sync_send_skb() that is slow, synchronous and
>> non-interrupt (ie polls for completion).  Of course, then people
>> will complain that netconsole traffic slows the machine down.  for
>> completion.
> 
> I couldn't agree more.
> 
> Since netpoll runs outside of all of the normal netdevice locking
> rules, only the people using netpoll hit all the bugs.  That means
> most of us do not test out these code path, which is bad.
> 
> So, if anything, a more integrated implementation is essential.
> 

But, IMHO, until this will be done some simpler measures could
do be done to make poll_napi less dangerous (as a matter of fact
I wonder why oopses observed & diagnosed by Olaf are so rare).

Current locking with netpoll_poll_lock is mainly misleading.
It seems somebody who planned napi didn't even think such helpers
as poll_napi are possible on other cpus and somebody doing netpoll
didn't want to show this all per cpu data needs full locking anyway.
But since it's like this there is no reason to invent a wheel again
and "normal" locking should be done: so global (not per device)
spin_lock (#ifdef CONFIG_NETPOLL only) held during all net_rx_action
and spin_trylock similarly in poll_napi (with STATE_RX_SCHED
re-checking) should be minimum needed here.

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2007-07-24 10:16 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-04 12:16 Races in net_rx_action vs netpoll? Olaf Kirch
2007-07-09 22:27 ` David Miller
2007-07-10 10:44   ` Olaf Kirch
2007-07-11  5:44     ` David Miller
2007-07-11  7:41       ` Olaf Kirch
2007-07-12  2:33         ` David Miller
2007-07-19 15:19           ` Olaf Kirch
2007-07-19 16:27             ` Stephen Hemminger
2007-07-22  7:05               ` David Miller
2007-07-24 10:26                 ` Jarek Poplawski
2007-07-12 12:59     ` Jarek Poplawski
2007-07-12 13:54       ` Olaf Kirch
2007-07-13  8:55         ` Jarek Poplawski
2007-07-16  8:06           ` Jarek Poplawski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).