netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* new NAPI interface broken
@ 2007-09-07  9:37 Jan-Bernd Themann
  2007-09-12 12:50 ` David Miller
  2007-09-14 22:12 ` David Miller
  0 siblings, 2 replies; 19+ messages in thread
From: Jan-Bernd Themann @ 2007-09-07  9:37 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, themann, Christoph Raisch

Hi Stephen,

I saw that you developed most of the new NAPI interface.
I already addressed this issue a while ago. Please correct me if I got
it wrong. I think there is still a serious problem with the NAPI
changes to make NAPI polling independent of struct net_device objects.
Its about the question who inserts and removes devices from the poll list.

netif_rx_schedule: sets NAPI_STATE_SCHED flag, insert device in poll list.
netif_rx_complete: clears NAPI_STATE_SCHED
netif_rx_reschedule: sets NAPI_STATE_SCHED, insert device in poll list.
net_rx_action: 
 -removes dev from poll list
 -calls poll function
 -adds dev to poll list if NAPI_STATE_SCHED still set

1) netif_rx_complete and netif_rx_reschedule don't work together
2) On SMP systems: after netif_rx_complete has been called on CPU1
   (+interruts enabled), netif_rx_schedule could be called on CPU2 
   (irq handler) before net_rx_action on CPU1 has checked NAPI_STATE_SCHED. 
   In that case the device would be added to poll lists of CPU1 and CPU2
   as net_rx_action would see NAPI_STATE_SCHED set.
   This must not happen. It will be caught when netif_rx_complete is
   called the second time (BUG() called)

This would mean we have a problem on all SMP machines right now.

If I got all this right then we probably need a further flag to tell
net_rx_action whether to poll again or to stop (with the possibility
that the device has been scheduled on a different CPU in between).
The "old" NAPI interface uses the return value of poll to determine
if the device has to be polled again or not. 
We can either switch back or in case we want to stick to
the new return value, we might have to add something similar to 
the NAPI_STATE_SCHED flag or a new parameter...

Regards,
Jan-Bernd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken
  2007-09-07  9:37 new NAPI interface broken Jan-Bernd Themann
@ 2007-09-12 12:50 ` David Miller
  2007-09-12 13:10   ` new NAPI interface broken for POWER architecture? Christoph Raisch
  2007-10-16  7:29   ` new NAPI interface broken Benjamin Herrenschmidt
  2007-09-14 22:12 ` David Miller
  1 sibling, 2 replies; 19+ messages in thread
From: David Miller @ 2007-09-12 12:50 UTC (permalink / raw)
  To: ossthema; +Cc: shemminger, netdev, themann, raisch

From: Jan-Bernd Themann <ossthema@de.ibm.com>
Date: Fri, 7 Sep 2007 11:37:02 +0200

> 2) On SMP systems: after netif_rx_complete has been called on CPU1
>    (+interruts enabled), netif_rx_schedule could be called on CPU2 
>    (irq handler) before net_rx_action on CPU1 has checked NAPI_STATE_SCHED. 
>    In that case the device would be added to poll lists of CPU1 and CPU2
>    as net_rx_action would see NAPI_STATE_SCHED set.
>    This must not happen. It will be caught when netif_rx_complete is
>    called the second time (BUG() called)
> 
> This would mean we have a problem on all SMP machines right now.

This is not a correct statement.

Only on your platform do network device interrupts get moved
around, no other platform does this.

Sparc64 doesn't, all interrupts stay in one location after
the cpu is initially choosen.

x86 and x86_64 specifically do not move around network
device interrupts, even though other device types do
get dynamic IRQ cpu distribution.

That's why you are the only person seeing this problem.

I agree that it should be fixed, but we should also fix the IRQ
distribution scheme used on powerpc platforms which is totally
broken in these cases.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken for POWER architecture?
  2007-09-12 12:50 ` David Miller
@ 2007-09-12 13:10   ` Christoph Raisch
  2007-09-12 13:27     ` David Miller
  2007-09-12 15:18     ` Arnd Bergmann
  2007-10-16  7:29   ` new NAPI interface broken Benjamin Herrenschmidt
  1 sibling, 2 replies; 19+ messages in thread
From: Christoph Raisch @ 2007-09-12 13:10 UTC (permalink / raw)
  To: David Miller
  Cc: Jan-Bernd Themann, netdev, ossthema, shemminger, Arnd Bergmann,
	Paul Mackerras, Michael Ellerman, linuxppc-dev



David Miller <davem@davemloft.net> wrote on 12.09.2007 14:50:04:

> From: Jan-Bernd Themann <ossthema@de.ibm.com>
> Date: Fri, 7 Sep 2007 11:37:02 +0200
>
> > 2) On SMP systems: after netif_rx_complete has been called on CPU1
> >    (+interruts enabled), netif_rx_schedule could be called on CPU2
> >    (irq handler) before net_rx_action on CPU1 has checked
NAPI_STATE_SCHED.
> >    In that case the device would be added to poll lists of CPU1 and
CPU2
> >    as net_rx_action would see NAPI_STATE_SCHED set.
> >    This must not happen. It will be caught when netif_rx_complete is
> >    called the second time (BUG() called)
> >
> > This would mean we have a problem on all SMP machines right now.
>
> This is not a correct statement.
>
> Only on your platform do network device interrupts get moved
> around, no other platform does this.
>
> Sparc64 doesn't, all interrupts stay in one location after
> the cpu is initially choosen.
>
> x86 and x86_64 specifically do not move around network
> device interrupts, even though other device types do
> get dynamic IRQ cpu distribution.
>
> That's why you are the only person seeing this problem.
>
> I agree that it should be fixed, but we should also fix the IRQ
> distribution scheme used on powerpc platforms which is totally
> broken in these cases.

This is definitely not something we can change in the HEA device driver
alone.
It could also affect any other networking cards on POWER (e1000,s2io...).

Paul, Michael, Arndt, what is your opinion here?

Gruss / Regards
Christoph Raisch


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken for POWER architecture?
  2007-09-12 13:10   ` new NAPI interface broken for POWER architecture? Christoph Raisch
@ 2007-09-12 13:27     ` David Miller
  2007-09-12 15:18     ` Arnd Bergmann
  1 sibling, 0 replies; 19+ messages in thread
From: David Miller @ 2007-09-12 13:27 UTC (permalink / raw)
  To: RAISCH
  Cc: THEMANN, netdev, ossthema, shemminger, ARNDB, pmac, ellerman,
	linuxppc-dev

From: Christoph Raisch <RAISCH@de.ibm.com>
Date: Wed, 12 Sep 2007 15:10:08 +0200

> This is definitely not something we can change in the HEA device driver
> alone.

And it shouldn't be, x86 implements the policy in irq balance
daemon, powerpc should do it wherever it would be appropriate
there.

> Paul, Michael, Arndt, what is your opinion here?

I'm all ears too :)

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken for POWER architecture?
  2007-09-12 13:10   ` new NAPI interface broken for POWER architecture? Christoph Raisch
  2007-09-12 13:27     ` David Miller
@ 2007-09-12 15:18     ` Arnd Bergmann
  1 sibling, 0 replies; 19+ messages in thread
From: Arnd Bergmann @ 2007-09-12 15:18 UTC (permalink / raw)
  To: Christoph Raisch
  Cc: David Miller, Jan-Bernd Themann, netdev, ossthema, shemminger,
	Paul Mackerras, Michael Ellerman, linuxppc-dev

On Wednesday 12 September 2007, Christoph Raisch wrote:
> David Miller <davem@davemloft.net> wrote on 12.09.2007 14:50:04:
>
> > I agree that it should be fixed, but we should also fix the IRQ
> > distribution scheme used on powerpc platforms which is totally
> > broken in these cases.
> 
> This is definitely not something we can change in the HEA device driver
> alone.
> It could also affect any other networking cards on POWER (e1000,s2io...).
> 
> Paul, Michael, Arndt, what is your opinion here?

The situation on Cell with the existing south bridge chips is that
interrupts _never_ get moved around, but are routed to specific
SMT threads by the firmware, while Linux does not interfere with
this.

We have been thinking about changing this so we can distribute
interrupts over all SMT threads in a given NUMA node, or even
over all logical CPUs in the system by reprogramming the
interrupt controller after each interrupt, but the current Axon bridge
chip will always have all devices routed to the same target CPU,
so it's unclear whether that is even an advantage.

	Arnd <><

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken
  2007-09-07  9:37 new NAPI interface broken Jan-Bernd Themann
  2007-09-12 12:50 ` David Miller
@ 2007-09-14 22:12 ` David Miller
  2007-09-18 16:15   ` Jan-Bernd Themann
  1 sibling, 1 reply; 19+ messages in thread
From: David Miller @ 2007-09-14 22:12 UTC (permalink / raw)
  To: ossthema; +Cc: shemminger, netdev, themann, raisch

From: Jan-Bernd Themann <ossthema@de.ibm.com>
Date: Fri, 7 Sep 2007 11:37:02 +0200

> Its about the question who inserts and removes devices from the poll list.
> 
> netif_rx_schedule: sets NAPI_STATE_SCHED flag, insert device in poll list.
> netif_rx_complete: clears NAPI_STATE_SCHED
> netif_rx_reschedule: sets NAPI_STATE_SCHED, insert device in poll list.
> net_rx_action: 
>  -removes dev from poll list
>  -calls poll function
>  -adds dev to poll list if NAPI_STATE_SCHED still set

Ok, for now I'm going to try and deal with this by reverting
the list handling to something approximating the old NAPI
code, as per the patch below.

I've only quickly test booted into this kernel on my workstation.
So take care when trying it out.

I took the liberty to add some assertions and comments all over
wrt. list and quota handling.  One thing to note that (and this
was true previously too) that if ->poll() uses the whole quota
it _MUST_ _NOT_ modify the NAPI state by doing a complete or
reschedule.  The net_rx_action code still owns the NAPI state in
that case, and therefore is allowed to make modifications to the
->poll_list.

I realize this adds a lot of IRQ enable/disable overhead back
into the code, but we can't get rid of that cleanly without
solving this list ownership and modification issue properly.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0106fa6..9837ed3 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -284,7 +284,14 @@ extern int __init netdev_boot_setup(char *str);
  * Structure for NAPI scheduling similar to tasklet but with weighting
  */
 struct napi_struct {
+	/* The poll_list must only be managed by the entity which
+	 * changes the state of the NAPI_STATE_SCHED bit.  This means
+	 * whoever atomically sets that bit can add this napi_struct
+	 * to the per-cpu poll_list, and whoever clears that bit
+	 * can remove from the list right before clearing the bit.
+	 */
 	struct list_head	poll_list;
+
 	unsigned long		state;
 	int			weight;
 	int			quota;
@@ -336,13 +343,21 @@ static inline void napi_schedule(struct napi_struct *n)
  *
  * Mark NAPI processing as complete.
  */
-static inline void napi_complete(struct napi_struct *n)
+static inline void __napi_complete(struct napi_struct *n)
 {
 	BUG_ON(!test_bit(NAPI_STATE_SCHED, &n->state));
+	list_del(&n->poll_list);
 	smp_mb__before_clear_bit();
 	clear_bit(NAPI_STATE_SCHED, &n->state);
 }
 
+static inline void napi_complete(struct napi_struct *n)
+{
+	local_irq_disable();
+	__napi_complete(n);
+	local_irq_enable();
+}
+
 /**
  *	napi_disable - prevent NAPI from scheduling
  *	@n: napi context
@@ -1184,19 +1199,11 @@ static inline u32 netif_msg_init(int debug_value, int default_msg_enable_bits)
 	return (1 << debug_value) - 1;
 }
 
-/* Try to reschedule poll. Called by dev->poll() after netif_rx_complete().
- * Do not inline this?
- */
+/* Try to reschedule poll. Called by dev->poll() after netif_rx_complete().  */
 static inline int netif_rx_reschedule(struct napi_struct *n)
 {
 	if (napi_schedule_prep(n)) {
-		unsigned long flags;
-
-		local_irq_save(flags);
-		list_add_tail(&n->poll_list,
-			      &__get_cpu_var(softnet_data).poll_list);
-		__raise_softirq_irqoff(NET_RX_SOFTIRQ);
-		local_irq_restore(flags);
+		__napi_schedule(n);
 		return 1;
 	}
 	return 0;
@@ -1234,7 +1241,7 @@ static inline void netif_rx_schedule(struct net_device *dev,
 static inline void __netif_rx_complete(struct net_device *dev,
 				       struct napi_struct *napi)
 {
-	napi_complete(napi);
+	__napi_complete(napi);
 	dev_put(dev);
 }
 
diff --git a/net/core/dev.c b/net/core/dev.c
index f119dc0..2897352 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -2120,6 +2120,13 @@ static int process_backlog(struct napi_struct *napi, int quota)
 	return work;
 }
 
+static void napi_check_quota_bug(struct napi_struct *n)
+{
+	/* It is illegal to consume more than the given quota.  */
+	if (WARN_ON_ONCE(n->quota < 0))
+		n->quota = 0;
+}
+
 /**
  * __napi_schedule - schedule for receive
  * @napi: entry to schedule
@@ -2130,10 +2137,8 @@ void fastcall __napi_schedule(struct napi_struct *n)
 {
 	unsigned long flags;
 
-	if (n->quota < 0)
-		n->quota += n->weight;
-	else
-		n->quota = n->weight;
+	napi_check_quota_bug(n);
+	n->quota = n->weight;
 
 	local_irq_save(flags);
 	list_add_tail(&n->poll_list, &__get_cpu_var(softnet_data).poll_list);
@@ -2145,32 +2150,36 @@ EXPORT_SYMBOL(__napi_schedule);
 
 static void net_rx_action(struct softirq_action *h)
 {
-	struct list_head list;
+	struct list_head *list = &__get_cpu_var(softnet_data).poll_list;
 	unsigned long start_time = jiffies;
 	int budget = netdev_budget;
 	void *have;
 
 	local_irq_disable();
-	list_replace_init(&__get_cpu_var(softnet_data).poll_list, &list);
-	local_irq_enable();
 
-	while (!list_empty(&list)) {
+	while (!list_empty(list)) {
 		struct napi_struct *n;
 
-		/* if softirq window is exhuasted then punt */
-		if (unlikely(budget <= 0 || jiffies != start_time)) {
-			local_irq_disable();
-			list_splice(&list, &__get_cpu_var(softnet_data).poll_list);
-			__raise_softirq_irqoff(NET_RX_SOFTIRQ);
-			local_irq_enable();
-			break;
-		}
+		/* If softirq window is exhuasted then punt.
+		 *
+		 * Note that this is a slight policy change from the
+		 * previous NAPI code, which would allow up to 2
+		 * jiffies to pass before breaking out.  The test
+		 * used to be "jiffies - start_time > 1".
+		 */
+		if (unlikely(budget <= 0 || jiffies != start_time))
+			goto softnet_break;
 
-		n = list_entry(list.next, struct napi_struct, poll_list);
+		local_irq_enable();
 
-		have = netpoll_poll_lock(n);
+		/* Even though interrupts have been re-enabled, this
+		 * access is safe because interrupts can only add new
+		 * entries to the tail of this list, and only ->poll()
+		 * calls can remove this head entry from the list.
+		 */
+		n = list_entry(list->next, struct napi_struct, poll_list);
 
-		list_del(&n->poll_list);
+		have = netpoll_poll_lock(n);
 
 		/* if quota not exhausted process work */
 		if (likely(n->quota > 0)) {
@@ -2180,12 +2189,25 @@ static void net_rx_action(struct softirq_action *h)
 			n->quota -= work;
 		}
 
-		/* if napi_complete not called, reschedule */
-		if (test_bit(NAPI_STATE_SCHED, &n->state))
-			__napi_schedule(n);
+		local_irq_disable();
+
+		napi_check_quota_bug(n);
+
+		/* Drivers must not modify the NAPI state if they
+		 * consume the entire quota.  In such cases this code
+		 * still "owns" the NAPI instance and therefore can
+		 * move the instance around on the list at-will.
+		 */
+		if (unlikely(!n->quota)) {
+			n->quota = n->weight;
+			list_move_tail(&n->poll_list, list);
+		}
 
 		netpoll_poll_unlock(have);
 	}
+out:
+	local_irq_enable();
+
 #ifdef CONFIG_NET_DMA
 	/*
 	 * There may not be any more sk_buffs coming right now, so push
@@ -2200,6 +2222,13 @@ static void net_rx_action(struct softirq_action *h)
 		}
 	}
 #endif
+
+	return;
+
+softnet_break:
+	__get_cpu_var(netdev_rx_stat).time_squeeze++;
+	__raise_softirq_irqoff(NET_RX_SOFTIRQ);
+	goto out;
 }
 
 static gifconf_func_t * gifconf_list [NPROTO];

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken
  2007-09-14 22:12 ` David Miller
@ 2007-09-18 16:15   ` Jan-Bernd Themann
  2007-09-18 19:08     ` David Miller
  2007-09-19 15:33     ` Roland Dreier
  0 siblings, 2 replies; 19+ messages in thread
From: Jan-Bernd Themann @ 2007-09-18 16:15 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, netdev, themann, raisch

Hi,

On Saturday 15 September 2007 00:12, David Miller wrote:
> Ok, for now I'm going to try and deal with this by reverting
> the list handling to something approximating the old NAPI
> code, as per the patch below.
> 
> I've only quickly test booted into this kernel on my workstation.
> So take care when trying it out.
> 
> I took the liberty to add some assertions and comments all over
> wrt. list and quota handling.  One thing to note that (and this
> was true previously too) that if ->poll() uses the whole quota
> it _MUST_ _NOT_ modify the NAPI state by doing a complete or
> reschedule.  The net_rx_action code still owns the NAPI state in
> that case, and therefore is allowed to make modifications to the
> ->poll_list.

I'm currently updating the eHEA polling function to work with this
scheme. Up to now we had sort of a quota for or TX refill part as well,
which was also done in the poll function. This was possible as the device
could be scheduled again even if the quota has not been used
completely. If I got it right this is not longer possible. 
The idea was to benefit from the same "fairness" scheme of NAPI as it is
done for the RX side.

One other thing I observed is that I can not unload the module as the
ref counter of the eth device is too low. I haven't tracked down the 
source of this problem yet.

Regards,
Jan-Bernd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken
  2007-09-18 16:15   ` Jan-Bernd Themann
@ 2007-09-18 19:08     ` David Miller
  2007-09-19 15:33     ` Roland Dreier
  1 sibling, 0 replies; 19+ messages in thread
From: David Miller @ 2007-09-18 19:08 UTC (permalink / raw)
  To: ossthema; +Cc: shemminger, netdev, themann, raisch

From: Jan-Bernd Themann <ossthema@de.ibm.com>
Date: Tue, 18 Sep 2007 18:15:45 +0200

> One other thing I observed is that I can not unload the module as the
> ref counter of the eth device is too low. I haven't tracked down the 
> source of this problem yet.

This is probably because of the resched device refcounting
bug that Roland Dreier just posted a fix for.

Look for subject "Fix refcounting problem with netif_rx_reschedule()"

I merge that in shortly to net-2.6.24 as well.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken
  2007-09-18 16:15   ` Jan-Bernd Themann
  2007-09-18 19:08     ` David Miller
@ 2007-09-19 15:33     ` Roland Dreier
  2007-09-19 15:43       ` Jan-Bernd Themann
  1 sibling, 1 reply; 19+ messages in thread
From: Roland Dreier @ 2007-09-19 15:33 UTC (permalink / raw)
  To: Jan-Bernd Themann; +Cc: David Miller, shemminger, netdev, themann, raisch

 > One other thing I observed is that I can not unload the module as the
 > ref counter of the eth device is too low. I haven't tracked down the 
 > source of this problem yet.

I suspect that this is because netif_rx_reschedule() was missing a
dev_hold() to match the dev_put() in netif_rx_complete().  Dave merged
a fix for that problem yesterday, so current net-2.6.24 should be OK.
Let us know if it's not OK, because then there's another bug...

 - R.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken
  2007-09-19 15:33     ` Roland Dreier
@ 2007-09-19 15:43       ` Jan-Bernd Themann
  0 siblings, 0 replies; 19+ messages in thread
From: Jan-Bernd Themann @ 2007-09-19 15:43 UTC (permalink / raw)
  To: Roland Dreier; +Cc: David Miller, shemminger, netdev, themann, raisch

On Wednesday 19 September 2007 17:33, Roland Dreier wrote:
>  > One other thing I observed is that I can not unload the module as the
>  > ref counter of the eth device is too low. I haven't tracked down the 
>  > source of this problem yet.
> 
> I suspect that this is because netif_rx_reschedule() was missing a
> dev_hold() to match the dev_put() in netif_rx_complete().  Dave merged
> a fix for that problem yesterday, so current net-2.6.24 should be OK.
> Let us know if it's not OK, because then there's another bug...
> 
>  - R.
> 

When I applied this netif_rx_reschedule fix this problem did not occur
anymore (module could be unloaded). So I guess I hit the problem you
described.

Thanks,
Jan-Bernd

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken
  2007-09-12 12:50 ` David Miller
  2007-09-12 13:10   ` new NAPI interface broken for POWER architecture? Christoph Raisch
@ 2007-10-16  7:29   ` Benjamin Herrenschmidt
  2007-10-16  7:42     ` Benjamin Herrenschmidt
  2007-10-16  7:44     ` David Miller
  1 sibling, 2 replies; 19+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-16  7:29 UTC (permalink / raw)
  To: David Miller; +Cc: ossthema, shemminger, netdev, themann, raisch

Jumping on an old train ...

On Wed, 2007-09-12 at 05:50 -0700, David Miller wrote:
> 
> > This would mean we have a problem on all SMP machines right now.
> 
> This is not a correct statement.
> 
> Only on your platform do network device interrupts get moved
> around, no other platform does this.
> 
> Sparc64 doesn't, all interrupts stay in one location after
> the cpu is initially choosen.
> 
> x86 and x86_64 specifically do not move around network
> device interrupts, even though other device types do
> get dynamic IRQ cpu distribution.
> 
> That's why you are the only person seeing this problem.
> 
> I agree that it should be fixed, but we should also fix the IRQ
> distribution scheme used on powerpc platforms which is totally
> broken in these cases.

So the powerpc platform just honors the affinity mask, and depending on
the PIC does things that range from nothing to spreading interrupts to
CPUs in the affinity mask.

All interrupts by defaults are spread to all CPUs (full balancing).

At this stage, it's afaik userland business to enforce different
policies by changing the affinities via /proc/irq/*.

Do you have any pointer to how that is done on x86 or sparc64 ? On my
x86 laptop using ubuntu gutsy, I definitely see the IRQ on which the
network card is connected (e1000) happily spread between the 2 cores
just like powerpc would do.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken
  2007-10-16  7:29   ` new NAPI interface broken Benjamin Herrenschmidt
@ 2007-10-16  7:42     ` Benjamin Herrenschmidt
  2007-10-16  7:44     ` David Miller
  1 sibling, 0 replies; 19+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-16  7:42 UTC (permalink / raw)
  To: David Miller; +Cc: ossthema, shemminger, netdev, themann, raisch

> So the powerpc platform just honors the affinity mask, and depending on
> the PIC does things that range from nothing to spreading interrupts to
> CPUs in the affinity mask.
> 
> All interrupts by defaults are spread to all CPUs (full balancing).
> 
> At this stage, it's afaik userland business to enforce different
> policies by changing the affinities via /proc/irq/*.
> 
> Do you have any pointer to how that is done on x86 or sparc64 ? On my
> x86 laptop using ubuntu gutsy, I definitely see the IRQ on which the
> network card is connected (e1000) happily spread between the 2 cores
> just like powerpc would do.

More specifically, IRQF_NOBALANCING doesn't seem to be set anywhere
except a few arch specific timer interrupts etc... nowhere I can see in
network drivers or the network stack (the stack wouldn't know what IRQ
anyway since not all drivers set netdev->irq).

We currently don't have a balance kthread like x86 has, though I wonder
if we should move this one out of x86 and make it generic (hell, it's
even hidden in the IO_APIC code :-) But at this stage, HW balancing by
the PIC is the norm and seems to be happening.

Ben.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken
  2007-10-16  7:29   ` new NAPI interface broken Benjamin Herrenschmidt
  2007-10-16  7:42     ` Benjamin Herrenschmidt
@ 2007-10-16  7:44     ` David Miller
  2007-10-16  8:28       ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 19+ messages in thread
From: David Miller @ 2007-10-16  7:44 UTC (permalink / raw)
  To: benh; +Cc: ossthema, shemminger, netdev, themann, raisch

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Tue, 16 Oct 2007 17:29:47 +1000

> Do you have any pointer to how that is done on x86 or sparc64 ?

Sparc64 does it statically in the kernel.

For x86, see http://irqbalance.org/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken
  2007-10-16  7:44     ` David Miller
@ 2007-10-16  8:28       ` Benjamin Herrenschmidt
  2007-10-16  8:31         ` David Miller
  0 siblings, 1 reply; 19+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-16  8:28 UTC (permalink / raw)
  To: David Miller; +Cc: ossthema, shemminger, netdev, themann, raisch


On Tue, 2007-10-16 at 00:44 -0700, David Miller wrote:
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Tue, 16 Oct 2007 17:29:47 +1000
> 
> > Do you have any pointer to how that is done on x86 or sparc64 ?
> 
> Sparc64 does it statically in the kernel.
> 
> For x86, see http://irqbalance.org/

Allright, so that's an out of tree userland thingy... (which may well
work on ppc too I suppose). Definitely not installed by default by my
distro so IRQs from the network cards on all x86's using ubuntu gutsy at
least are spread to all CPUs :-)

There's also a balance kernel thread in x86 with has a notion of irq
that isn't moveable but that flag isn't set by any driver.

Cheers,
Ben.




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken
  2007-10-16  8:28       ` Benjamin Herrenschmidt
@ 2007-10-16  8:31         ` David Miller
  2007-10-16  9:01           ` Benjamin Herrenschmidt
  2007-10-16 15:56           ` Arjan van de Ven
  0 siblings, 2 replies; 19+ messages in thread
From: David Miller @ 2007-10-16  8:31 UTC (permalink / raw)
  To: benh; +Cc: ossthema, shemminger, netdev, themann, raisch, arjan

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Date: Tue, 16 Oct 2007 18:28:56 +1000

> Allright, so that's an out of tree userland thingy... (which may well
> work on ppc too I suppose). Definitely not installed by default by my
> distro so IRQs from the network cards on all x86's using ubuntu gutsy at
> least are spread to all CPUs :-)

But the thing does treat network interfaces differently from
other devices.

Arjan ran over to my table at kernel summit when I presented this
topic to the audience, to remind me of all of this and he should be
included in on any discussions about this topic. :-)

I've CC:'d him.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken
  2007-10-16  8:31         ` David Miller
@ 2007-10-16  9:01           ` Benjamin Herrenschmidt
  2007-10-16 21:01             ` Anton Blanchard
  2007-10-17 15:26             ` Christoph Raisch
  2007-10-16 15:56           ` Arjan van de Ven
  1 sibling, 2 replies; 19+ messages in thread
From: Benjamin Herrenschmidt @ 2007-10-16  9:01 UTC (permalink / raw)
  To: David Miller; +Cc: ossthema, shemminger, netdev, themann, raisch, arjan


On Tue, 2007-10-16 at 01:31 -0700, David Miller wrote:
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Tue, 16 Oct 2007 18:28:56 +1000
> 
> > Allright, so that's an out of tree userland thingy... (which may well
> > work on ppc too I suppose). Definitely not installed by default by my
> > distro so IRQs from the network cards on all x86's using ubuntu gutsy at
> > least are spread to all CPUs :-)
> 
> But the thing does treat network interfaces differently from
> other devices.
> 
> Arjan ran over to my table at kernel summit when I presented this
> topic to the audience, to remind me of all of this and he should be
> included in on any discussions about this topic. :-)
> 
> I've CC:'d him.

As far as I know, the x86 in-kernel thingy doesn't but yeah, the
userland one seems much more evolved and does things based on the
"class" of the device.

Christoph, have any of you tried it on powerpc ?

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken
  2007-10-16  8:31         ` David Miller
  2007-10-16  9:01           ` Benjamin Herrenschmidt
@ 2007-10-16 15:56           ` Arjan van de Ven
  1 sibling, 0 replies; 19+ messages in thread
From: Arjan van de Ven @ 2007-10-16 15:56 UTC (permalink / raw)
  To: David Miller; +Cc: benh, ossthema, shemminger, netdev, themann, raisch

David Miller wrote:
> From: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Date: Tue, 16 Oct 2007 18:28:56 +1000
> 
>> Allright, so that's an out of tree userland thingy... (which may well
>> work on ppc too I suppose). Definitely not installed by default by my
>> distro so IRQs from the network cards on all x86's using ubuntu gutsy at
>> least are spread to all CPUs :-)
> 
> But the thing does treat network interfaces differently from
> other devices.
> 
and it works on various architectures....
The in-kernel x86 thing is going away (it's actually highly 
suboptimal)....

Yes it's done in userland, this is the right place so far.... out of 
tree... well... there's no good place for such userspace tools in the 
kernel tree currently otherwise I'd love to have it there.
If your distro doesn't install this by default, please file a bug 
against the distro; I know we (Intel) worked with Fedora, RHEL, SuSE 
and Ubuntu to get it included.... maybe others don't?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken
  2007-10-16  9:01           ` Benjamin Herrenschmidt
@ 2007-10-16 21:01             ` Anton Blanchard
  2007-10-17 15:26             ` Christoph Raisch
  1 sibling, 0 replies; 19+ messages in thread
From: Anton Blanchard @ 2007-10-16 21:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: David Miller, ossthema, shemminger, netdev, themann, raisch,
	arjan

 
Hi,

> As far as I know, the x86 in-kernel thingy doesn't but yeah, the
> userland one seems much more evolved and does things based on the
> "class" of the device.
> 
> Christoph, have any of you tried it on powerpc ?

FYI It works fine on PowerPC and its installed by default on some
distros (eg RHEL5).

Anton

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: new NAPI interface broken
  2007-10-16  9:01           ` Benjamin Herrenschmidt
  2007-10-16 21:01             ` Anton Blanchard
@ 2007-10-17 15:26             ` Christoph Raisch
  1 sibling, 0 replies; 19+ messages in thread
From: Christoph Raisch @ 2007-10-17 15:26 UTC (permalink / raw)
  To: benh; +Cc: arjan, David Miller, Jan-Bernd Themann, netdev, ossthema,
	shemminger


Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote on 16.10.2007
11:01:49:

>
>
> Christoph, have any of you tried it on powerpc ?

No we didn't try this (yet).
This approach makes a lot of sense.

Why is this not installed by both large distros on PPC by default?
how mature is this for larger SMPs on PPC?

>
> Cheers,
> Ben.
>
>
Gruss / Regards
Christoph R.



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2007-10-17 15:26 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-07  9:37 new NAPI interface broken Jan-Bernd Themann
2007-09-12 12:50 ` David Miller
2007-09-12 13:10   ` new NAPI interface broken for POWER architecture? Christoph Raisch
2007-09-12 13:27     ` David Miller
2007-09-12 15:18     ` Arnd Bergmann
2007-10-16  7:29   ` new NAPI interface broken Benjamin Herrenschmidt
2007-10-16  7:42     ` Benjamin Herrenschmidt
2007-10-16  7:44     ` David Miller
2007-10-16  8:28       ` Benjamin Herrenschmidt
2007-10-16  8:31         ` David Miller
2007-10-16  9:01           ` Benjamin Herrenschmidt
2007-10-16 21:01             ` Anton Blanchard
2007-10-17 15:26             ` Christoph Raisch
2007-10-16 15:56           ` Arjan van de Ven
2007-09-14 22:12 ` David Miller
2007-09-18 16:15   ` Jan-Bernd Themann
2007-09-18 19:08     ` David Miller
2007-09-19 15:33     ` Roland Dreier
2007-09-19 15:43       ` Jan-Bernd Themann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).