RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
@ 2007-09-06 14:16 James Chapman
  2007-09-06 14:37 ` Stephen Hemminger
                   ` (4 more replies)
  0 siblings, 5 replies; 29+ messages in thread
From: James Chapman @ 2007-09-06 14:16 UTC (permalink / raw)
  To: netdev; +Cc: hadi, davem, jeff, mandeep.baines, ossthema

This RFC suggests some possible improvements to NAPI in the area of minimizing interrupt rates. A possible scheme to reduce interrupt rate for the low packet rate / fast CPU case is described. 

First, do we need to encourage consistency in NAPI poll drivers? A survey of current NAPI drivers shows different strategies being used in their poll(). Some such as r8169 do the napi_complete() if poll() does less work than their allowed budget. Others such as e100 and tg3 do napi_complete() only if they do no work at all. And some drivers use NAPI only for receive handling, perhaps setting txdone interrupts for 1 in N transmitted packets, while others do all "interrupt" processing in their poll(). Should we encourage more consistency? Should we encourage more NAPI driver maintainers to minimize interrupts by doing all rx _and_ tx processing in the poll(), and do napi_complete() only when the poll does _no_ work?

One well known issue with NAPI is that it is possible with certain traffic patterns for NAPI drivers to schedule in and out of polled mode very quickly. Worst case, a NAPI driver might get 1 interrupt per packet. With fast CPUs and interfaces, this can happen at high rates, causing high CPU loads and poor packet processing performance. Some drivers avoid this by using hardware interrupt mitigation features of the network device in tandem with NAPI to throttle the max interrupt rate per device. But this adds latency. Jamal's paper http://kernel.org/pub/linux/kernel/people/hadi/docs/UKUUG2005.pdf discusses this problem in some detail.

By making some small changes to the NAPI core, I think it is possible to prevent high interrupt rates with NAPI, regardless of traffic patterns and without using per-device hardware interrupt mitigation. The basic idea is that instead of immediately exiting polled mode when it finds no work to do, the driver's poll() keeps itself in active polled mode for 1-2 jiffies and only does napi_complete() when it does no work in that time period. When it does no work in its poll(), the driver can return 0 while leaving itself in the NAPI poll list. This means it is possible for the softirq processing to spin around its active device list, doing no work, since no quota is consumed. A change is therefore also needed in the NAPI core to detect the case when the only devices that are being actively pol
 led in softirq processing are doing no work on each poll and to exit the softirq loop rather than wasting CPU cycles.

The code changes are shown in the patch below. The patch is against the latest NAPI rework posted by DaveM http://marc.info/?l=linux-netdev&m=118829721407289&w=2. I used e100 and tg3 drivers to test. Since a driver that returns 0 from its poll() while leaving itself in polled mode would now used by the NAPI core as a condition for exiting the softirq poll loop, all existing NAPI drivers would need to conform to this new invariant. Some drivers, e.g. e100, can return 0 even if they do tx work in their poll().

Clearly, keeping a device in polled mode for 1-2 jiffies after it would otherwise have gone idle means that it might be called many times by the NAPI softirq while it has no work to do. This wastes CPU cycles. It would be important therefore to implement the driver's poll() to make this case as efficient as possible, perhaps testing for it early.

When a device is in polled mode while idle, there are 2 scheduling cases to consider:-

1. One or more other netdevs is not idle and is consuming quota on each poll. The net_rx softirq will loop until the next jiffy tick or when quota is exceeded, calling each device in its polled list. Since the idle device is still in the poll list, it will be polled very rapidly.

2. No other active device is in the poll list. The net_rx softirq will poll the idle device twice and then exit the softirq processing loop as if quota is exceeded. See the net_rx_action() changes in the patch which force the loop to exit if no work is being done by any device in the poll list.

In both cases described above, the scheduler will continue NAPI processing from ksoftirqd. This might be very soon, especially if the system is otherwise idle. But if the system is idle, do we really care that idle network devices will be polled for 1-2 jiffies? If the system is otherwise busy, ksoftirqd will share the CPU with other threads/processes which will reduce the poll rate anyway.

In testing, I see significant reduction in interrupt rate for typical traffic patterns. A flood ping, for example, keeps the device in polled mode, generating no interrupts. In a test, 8510 packets are sent/received versus 6200 previously; CPU load is 100% versus 62% previously; and 1 netdev interrupt occurs versus 12400 previously. Performance and CPU load under extreme network load (using pktgen) is unchanged, as expected. Most importantly though, it is no longer possible to find a combination of CPU performance and traffic pattern that induce high interrupt rates. And because hardware interrupt mitigation isn't used, packet latency is minimized.

The increase in CPU load isn't surprising for a flood ping test since the CPU is working to bounce packets as fast as it can. The increase in packet rate is a good indicator of how much the interrupt and NAPI scheduling overhead is. The CPU load shows 100% because ksoftirqd is always wanting the CPU for the duration of the flood ping. The beauty of NAPI is that the scheduler gets to decide which thread gets the CPU, not hardware CPU interrupt priorities. On my desktop system, I perceive _better_ system response (smoother X cursor movement etc) during the flood ping test, despite the CPU load being increased. For a system whose main job is processing network traffic quickly, like an embedded router or a network server, this approach might be very beneficial. For a desktop, I'm less sure, al
 though as I said above, I've noticed no performance issues in my setups to date.

Is this worth pursuing further? I'm considering doing more work to measure the effects at various relatively low packet rates. I also want to investigate using High Res Timers rather than jiffy sampling to reduce the idle poll time. Perhaps it is also worth trying HRT in the net_rx softirq too. I thought it would be worth throwing the ideas out there first to get early feedback.

Here's the patch.

Index: linux-2.6/drivers/net/e100.c
===================================================================
--- linux-2.6.orig/drivers/net/e100.c
+++ linux-2.6/drivers/net/e100.c
@@ -544,6 +544,7 @@ struct nic {
 	struct cb *cb_to_use;
 	struct cb *cb_to_send;
 	struct cb *cb_to_clean;
+	unsigned long exit_poll_time;
 	u16 tx_command;
 	/* End: frequently used values: keep adjacent for cache effect */

@@ -1993,12 +1994,35 @@ static int e100_poll(struct napi_struct 
 	e100_rx_clean(nic, &work_done, budget);
 	tx_cleaned = e100_tx_clean(nic);

-	/* If no Rx and Tx cleanup work was done, exit polling mode. */
-	if((!tx_cleaned && (work_done == 0)) || !netif_running(netdev)) {
+	if (!netif_running(netdev)) {
 		netif_rx_complete(netdev, napi);
 		e100_enable_irq(nic);
+		return 0;
 	}

+	/* Stay in polled mode if we do any tx cleanup */
+	if (tx_cleaned)
+		work_done++;
+
+	/* If no Rx and Tx cleanup work was done, exit polling mode if
+	 * we've seen no work for 1-2 jiffies.
+	 */
+	if (work_done == 0) {
+		if (nic->exit_poll_time) {
+			if (time_after(jiffies, nic->exit_poll_time)) {
+				nic->exit_poll_time = 0;
+				netif_rx_complete(netdev, napi);
+				e100_enable_irq(nic);
+			}
+		} else {
+			nic->exit_poll_time = jiffies + 2;
+		}
+		return 0;
+	}
+
+	/* Otherwise, reset poll exit time and stay in poll list */
+	nic->exit_poll_time = 0;
+
 	return work_done;
 }

Index: linux-2.6/drivers/net/tg3.c
===================================================================
--- linux-2.6.orig/drivers/net/tg3.c
+++ linux-2.6/drivers/net/tg3.c
@@ -3473,6 +3473,24 @@ static int tg3_poll(struct napi_struct *
 	struct tg3_hw_status *sblk = tp->hw_status;
 	int work_done = 0;

+	/* fastpath having no work while we're holding ourself in
+	 * polled mode
+	 */
+	if ((tp->exit_poll_time) && (!tg3_has_work(tp))) {
+		if (time_after(jiffies, tp->exit_poll_time)) {
+			tp->exit_poll_time = 0;
+			/* tell net stack and NIC we're done */
+			netif_rx_complete(netdev, napi);
+			tg3_restart_ints(tp);
+		}
+		return 0;
+	}
+
+	/* if we get here, there might be work to do so disable the
+	 * poll hold fastpath above
+	 */
+	tp->exit_poll_time = 0;
+
 	/* handle link change and other phy events */
 	if (!(tp->tg3_flags &
 	      (TG3_FLAG_USE_LINKCHG_REG |
@@ -3511,11 +3529,11 @@ static int tg3_poll(struct napi_struct *
 	} else
 		sblk->status &= ~SD_STATUS_UPDATED;

-	/* if no more work, tell net stack and NIC we're done */
-	if (!tg3_has_work(tp)) {
-		netif_rx_complete(netdev, napi);
-		tg3_restart_ints(tp);
-	}
+	/* if no more work, set the time in jiffies when we should
+	 * exit polled mode
+	 */
+	if (!tg3_has_work(tp))
+		tp->exit_poll_time = jiffies + 2;

 	return work_done;
 }
Index: linux-2.6/drivers/net/tg3.h
===================================================================
--- linux-2.6.orig/drivers/net/tg3.h
+++ linux-2.6/drivers/net/tg3.h
@@ -2163,6 +2163,7 @@ struct tg3 {
 	u32				last_tag;

 	u32				msg_enable;
+	unsigned long			exit_poll_time;

 	/* begin "tx thread" cacheline section */
 	void				(*write32_tx_mbox) (struct tg3 *, u32,
Index: linux-2.6/net/core/dev.c
===================================================================
--- linux-2.6.orig/net/core/dev.c
+++ linux-2.6/net/core/dev.c
@@ -2073,6 +2073,8 @@ static void net_rx_action(struct softirq
 	unsigned long start_time = jiffies;
 	int budget = netdev_budget;
 	void *have;
+	struct napi_struct *last_hold = NULL;
+	int done = 0;

 	local_irq_disable();
 	list_replace_init(&__get_cpu_var(softnet_data).poll_list, &list);
@@ -2082,7 +2084,7 @@ static void net_rx_action(struct softirq
 		struct napi_struct *n;

 		/* if softirq window is exhuasted then punt */
-		if (unlikely(budget <= 0 || jiffies != start_time)) {
+		if (unlikely(budget <= 0 || jiffies != start_time || done)) {
 			local_irq_disable();
 			list_splice(&list, &__get_cpu_var(softnet_data).poll_list);
 			__raise_softirq_irqoff(NET_RX_SOFTIRQ);
@@ -2096,12 +2098,28 @@ static void net_rx_action(struct softirq

 		list_del(&n->poll_list);

-		/* if quota not exhausted process work */
+		/* if quota not exhausted process work. We special
+		 * case on n->poll() returning 0 here when the driver
+		 * doesn't do a napi_complete(), which indicates that
+		 * the device wants to stay on the poll list although
+		 * it did no work. We remember the first device to go
+		 * into this state in order to terminate this loop if
+		 * we see the same device again without doing any
+		 * other work.
+		 */
 		if (likely(n->quota > 0)) {
 			int work = n->poll(n, min(budget, n->quota));

-			budget -= work;
-			n->quota -= work;
+			if (likely(work)) {
+				budget -= work;
+				n->quota -= work;
+				last_hold = NULL;
+			} else if (test_bit(NAPI_STATE_SCHED, &n->state)) {
+				if (unlikely(n == last_hold))
+					done = 1;
+				if (likely(!last_hold))
+					last_hold = n;
+			}
 		}

 		/* if napi_complete not called, reschedule */

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-06 14:16 RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates James Chapman
@ 2007-09-06 14:37 ` Stephen Hemminger
  2007-09-06 15:30   ` James Chapman
  2007-09-06 23:06 ` jamal
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: Stephen Hemminger @ 2007-09-06 14:37 UTC (permalink / raw)
  To: James Chapman; +Cc: netdev, hadi, davem, jeff, mandeep.baines, ossthema

On Thu, 6 Sep 2007 15:16:00 +0100
James Chapman <jchapman@katalix.com> wrote:

> This RFC suggests some possible improvements to NAPI in the area of minimizing interrupt rates. A possible scheme to reduce interrupt rate for the low packet rate / fast CPU case is described. 
> 
> First, do we need to encourage consistency in NAPI poll drivers? A survey of current NAPI drivers shows different strategies being used in their poll(). Some such as r8169 do the napi_complete() if poll() does less work than their allowed budget. Others such as e100 and tg3 do napi_complete() only if they do no work at all. And some drivers use NAPI only for receive handling, perhaps setting txdone interrupts for 1 in N transmitted packets, while others do all "interrupt" processing in their poll(). Should we encourage more consistency? Should we encourage more NAPI driver maintainers to minimize interrupts by doing all rx _and_ tx processing in the poll(), and do napi_complete() only when the poll does _no_ work?
> 
> One well known issue with NAPI is that it is possible with certain traffic patterns for NAPI drivers to schedule in and out of polled mode very quickly. Worst case, a NAPI driver might get 1 interrupt per packet. With fast CPUs and interfaces, this can happen at high rates, causing high CPU loads and poor packet processing performance. Some drivers avoid this by using hardware interrupt mitigation features of the network device in tandem with NAPI to throttle the max interrupt rate per device. But this adds latency. Jamal's paper http://kernel.org/pub/linux/kernel/people/hadi/docs/UKUUG2005.pdf discusses this problem in some detail.
> 
> By making some small changes to the NAPI core, I think it is possible to prevent high interrupt rates with NAPI, regardless of traffic patterns and without using per-device hardware interrupt mitigation. The basic idea is that instead of immediately exiting polled mode when it finds no work to do, the driver's poll() keeps itself in active polled mode for 1-2 jiffies and only does napi_complete() when it does no work in that time period. When it does no work in its poll(), the driver can return 0 while leaving itself in the NAPI poll list. This means it is possible for the softirq processing to spin around its active device list, doing no work, since no quota is consumed. A change is therefore also needed in the NAPI core to detect the case when the only devices that are being actively p
 olled in softirq processing are doing no work on each poll and to exit the softirq loop rather than wasting CPU cycles.
> 
> The code changes are shown in the patch below. The patch is against the latest NAPI rework posted by DaveM http://marc.info/?l=linux-netdev&m=118829721407289&w=2. I used e100 and tg3 drivers to test. Since a driver that returns 0 from its poll() while leaving itself in polled mode would now used by the NAPI core as a condition for exiting the softirq poll loop, all existing NAPI drivers would need to conform to this new invariant. Some drivers, e.g. e100, can return 0 even if they do tx work in their poll().
> 
> Clearly, keeping a device in polled mode for 1-2 jiffies after it would otherwise have gone idle means that it might be called many times by the NAPI softirq while it has no work to do. This wastes CPU cycles. It would be important therefore to implement the driver's poll() to make this case as efficient as possible, perhaps testing for it early.
> 
> When a device is in polled mode while idle, there are 2 scheduling cases to consider:-
> 
> 1. One or more other netdevs is not idle and is consuming quota on each poll. The net_rx softirq will loop until the next jiffy tick or when quota is exceeded, calling each device in its polled list. Since the idle device is still in the poll list, it will be polled very rapidly.
> 
> 2. No other active device is in the poll list. The net_rx softirq will poll the idle device twice and then exit the softirq processing loop as if quota is exceeded. See the net_rx_action() changes in the patch which force the loop to exit if no work is being done by any device in the poll list.
> 
> In both cases described above, the scheduler will continue NAPI processing from ksoftirqd. This might be very soon, especially if the system is otherwise idle. But if the system is idle, do we really care that idle network devices will be polled for 1-2 jiffies? If the system is otherwise busy, ksoftirqd will share the CPU with other threads/processes which will reduce the poll rate anyway.
> 
> In testing, I see significant reduction in interrupt rate for typical traffic patterns. A flood ping, for example, keeps the device in polled mode, generating no interrupts. In a test, 8510 packets are sent/received versus 6200 previously; CPU load is 100% versus 62% previously; and 1 netdev interrupt occurs versus 12400 previously. Performance and CPU load under extreme network load (using pktgen) is unchanged, as expected. Most importantly though, it is no longer possible to find a combination of CPU performance and traffic pattern that induce high interrupt rates. And because hardware interrupt mitigation isn't used, packet latency is minimized.
> 
> The increase in CPU load isn't surprising for a flood ping test since the CPU is working to bounce packets as fast as it can. The increase in packet rate is a good indicator of how much the interrupt and NAPI scheduling overhead is. The CPU load shows 100% because ksoftirqd is always wanting the CPU for the duration of the flood ping. The beauty of NAPI is that the scheduler gets to decide which thread gets the CPU, not hardware CPU interrupt priorities. On my desktop system, I perceive _better_ system response (smoother X cursor movement etc) during the flood ping test, despite the CPU load being increased. For a system whose main job is processing network traffic quickly, like an embedded router or a network server, this approach might be very beneficial. For a desktop, I'm less sure, 
 although as I said above, I've noticed no performance issues in my setups to date.
> 
> 
> Is this worth pursuing further? I'm considering doing more work to measure the effects at various relatively low packet rates. I also want to investigate using High Res Timers rather than jiffy sampling to reduce the idle poll time. Perhaps it is also worth trying HRT in the net_rx softirq too. I thought it would be worth throwing the ideas out there first to get early feedback.
> 


What about the latency that NAPI imposes? Right now there are certain applications that
don't like NAPI because it add several more microseconds, and this may make it worse.
Maybe a per-device flag or tuning parameters (like weight sysfs value)? or some other
way to set low-latency values.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-06 14:37 ` Stephen Hemminger
@ 2007-09-06 15:30   ` James Chapman
  2007-09-06 15:37     ` Stephen Hemminger
  0 siblings, 1 reply; 29+ messages in thread
From: James Chapman @ 2007-09-06 15:30 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, hadi, davem, jeff, mandeep.baines, ossthema

Stephen Hemminger wrote:

> What about the latency that NAPI imposes? Right now there are certain applications that
> don't like NAPI because it add several more microseconds, and this may make it worse.

Latency is something that I think this approach will actually improve, 
at the expense of additional polling. Or is it the ksoftirqd scheduling 
latency that you are referring to?

> Maybe a per-device flag or tuning parameters (like weight sysfs value)? or some other
> way to set low-latency values.

Yes. I'd like to think good defaults could be derived though, perhaps 
based on settings like CONFIG_PREEMPT, CONFIG_HIGH_RES_TIMER, CONFIG_HZ 
and maybe even bogomips / nr_cpus.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-06 15:30   ` James Chapman
@ 2007-09-06 15:37     ` Stephen Hemminger
  2007-09-06 16:07       ` James Chapman
  0 siblings, 1 reply; 29+ messages in thread
From: Stephen Hemminger @ 2007-09-06 15:37 UTC (permalink / raw)
  To: James Chapman; +Cc: netdev, hadi, davem, jeff, mandeep.baines, ossthema

On Thu, 06 Sep 2007 16:30:30 +0100
James Chapman <jchapman@katalix.com> wrote:

> Stephen Hemminger wrote:
> 
> > What about the latency that NAPI imposes? Right now there are certain applications that
> > don't like NAPI because it add several more microseconds, and this may make it worse.
> 
> Latency is something that I think this approach will actually improve, 
> at the expense of additional polling. Or is it the ksoftirqd scheduling 
> latency that you are referring to?

The problem is that you leave interrupts disabled, right. Also you are busy during
idle which kills powersaving and no hz clock.

> > Maybe a per-device flag or tuning parameters (like weight sysfs value)? or some other
> > way to set low-latency values.
> 
> Yes. I'd like to think good defaults could be derived though, perhaps 
> based on settings like CONFIG_PREEMPT, CONFIG_HIGH_RES_TIMER, CONFIG_HZ 
> and maybe even bogomips / nr_cpus.
> 
> -- 
> James Chapman
> Katalix Systems Ltd
> http://www.katalix.com
> Catalysts for your Embedded Linux software development
> 

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-06 15:37     ` Stephen Hemminger
@ 2007-09-06 16:07       ` James Chapman
  0 siblings, 0 replies; 29+ messages in thread
From: James Chapman @ 2007-09-06 16:07 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, hadi, davem, jeff, mandeep.baines, ossthema

Stephen Hemminger wrote:
> On Thu, 06 Sep 2007 16:30:30 +0100
> James Chapman <jchapman@katalix.com> wrote:
> 
>> Stephen Hemminger wrote:
>>
>>> What about the latency that NAPI imposes? Right now there are certain applications that
>>> don't like NAPI because it add several more microseconds, and this may make it worse.
 >>
>> Latency is something that I think this approach will actually improve, 
>> at the expense of additional polling. Or is it the ksoftirqd scheduling 
>> latency that you are referring to?
> 
> The problem is that you leave interrupts disabled, right.

Are you saying NAPI drivers should avoid keeping interrupts disabled?

> Also you are busy during idle which kills powersaving and no hz clock.

But perhaps some environments don't care about powersave because they 
are always busy? Embedded routers or network servers, for example.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-06 14:16 RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates James Chapman
  2007-09-06 14:37 ` Stephen Hemminger
@ 2007-09-06 23:06 ` jamal
  2007-09-07  9:31   ` James Chapman
  2007-09-07  3:55 ` Mandeep Singh Baines
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 29+ messages in thread
From: jamal @ 2007-09-06 23:06 UTC (permalink / raw)
  To: James Chapman; +Cc: netdev, davem, jeff, mandeep.baines, ossthema

On Thu, 2007-06-09 at 15:16 +0100, James Chapman wrote:

> First, do we need to encourage consistency in NAPI poll drivers? A 
> survey of current NAPI drivers shows different strategies being used 
> in their poll(). Some such as r8169 do the napi_complete() if poll() 
> does less work than their allowed budget. Others such as e100 and tg3 
> do napi_complete() only if they do no work at all. And some drivers use 
> NAPI only for receive handling, perhaps setting txdone interrupts for 
> 1 in N transmitted packets, while others do all "interrupt" processing in 
> their poll(). Should we encourage more consistency? Should we encourage more 
> NAPI driver maintainers to minimize interrupts by doing all rx _and_ tx 
> processing in the poll(), and do napi_complete() only when the poll does _no_ work?

not to stiffle the discussion, but Stephen Hemminger is planning to
write a new howto; that would be a good time to bring up the topic. The 
challenge is that there may be hardware issues that will result in small
deviations.

> Clearly, keeping a device in polled mode for 1-2 jiffies after it would otherwise 
> have gone idle means 
> that it might be called many times by the NAPI softirq while it has no work to do. 
> This wastes CPU cycles. It would be important therefore to implement the driver's 
> poll() to make this case as efficient as possible, perhaps testing for it early.

> When a device is in polled mode while idle, there are 2 scheduling cases to consider:-
> 
> 1. One or more other netdevs is not idle and is consuming quota on each poll. The net_rx 
> softirq 
> will loop until the next jiffy tick or when quota is exceeded, calling each device 
> in its polled 
> list. Since the idle device is still in the poll list, it will be polled very rapidly.

One suggestion on limiting the amount of polls is to actually have the
driver chew something off the quota even on empty polls - easier by just
changing the driver. A simple case will be say 1 packet (more may make
more sense, machine dependent) every time poll is invoked by the core.
This way the core algorithm continues to be fair and when the jiffies
are exceeded you bail out from the driver.

> 2. No other active device is in the poll list. The net_rx softirq will poll 
> the idle device twice 
> and then exit the softirq processing loop as if quota is exceeded. See the 
> net_rx_action() changes 
> in the patch which force the loop to exit if no work is being done by any 
> device in the poll list.
> 
> In both cases described above, the scheduler will continue NAPI processing 
> from ksoftirqd. This 
> might be very soon, especially if the system is otherwise idle. But if the 
> system is idle, do we 
> really care that idle network devices will be polled for 1-2 jiffies? 

Unfortunately the folks who have brought this up as an issue would
answer affirmatively. 
OTOH, if you can demonstrate that you spend less cycles polling rather
than letting NAPI do its thing, you will be able to make a compelling
case.

> If the system is otherwise 
> busy, ksoftirqd will share the CPU with other threads/processes which will reduce the poll rate 
> anyway.
> 
> In testing, I see significant reduction in interrupt rate for typical traffic patterns. A flood ping, 
> for example, keeps the device in polled mode, generating no interrupts. 

Must be a fast machine.

> In a test, 8510 packets are sent/received versus 6200 previously; 

The other packets are dropped? What are the rtt numbers like?

> CPU load is 100% versus 62% previously; 

not good.

> and 1 netdev interrupt occurs versus 12400 previously. 

good - maybe ;->

> Performance and CPU load under extreme 
> network load (using pktgen) is unchanged, as expected. 
> Most importantly though, it is no longer possible to find a combination 
> of CPU performance and traffic pattern that induce high interrupt rates. 
> And because hardware interrupt mitigation isn't used, packet latency is minimized.

I dont think youd find much win against NAPI in this case; 

> The increase in CPU load isn't surprising for a flood ping test since the CPU 
> is working to bounce packets as fast as it can. The increase in packet rate 
> is a good indicator of how much the interrupt and NAPI scheduling overhead is. 

Your results above showed decreased tput and increased cpu - did you
mistype that?

> The CPU load shows 100% because ksoftirqd is always wanting the CPU for the duration 
> of the flood ping. The beauty of NAPI is that the scheduler gets to decide which thread 
> gets the CPU, not hardware CPU interrupt priorities. On my desktop system, I perceive 
> _better_ system response (smoother X cursor movement etc) during the flood ping test, 

interesting - i think i did not notice something similar on my laptop
but i couldnt quantify it and it didnt seem to make sense.

> despite the CPU load being increased. For a system whose main job is processing network 
> traffic quickly, like an embedded router or a network server, this approach might be very 
> beneficial.

I am not sure i buy that James;-> The router types really have not much
of a challenge in this area.

>  For a desktop, I'm less sure, although as I said above, I've noticed no performance 
> issues in my setups to date.

> Is this worth pursuing further? I'm considering doing more work to measure the effects at 
> various relatively low packet rates. 

The standard litmus test applies since this is about performance.
Ignoring memory, the three standard net resources to worry about are
cpu, throughput and latency.  If you can show one or more of those
resources got better consistently without affecting the others across
different scenarios - you have a case to make.
For example in my experiments:
At high traffic rates, i didnt affect any of those axes.
At low rates, I was able to reduce cpu abuse, make throughput consistent
but make latency a lot worse. So this meant it was not fit to push
forward.

> I also want to investigate using High Res Timers rather 
> than jiffy sampling to reduce the idle poll time.

Mandeep also mentioned tickless - it would be interesting to see both.

>  Perhaps it is also worth trying HRT in the 
> net_rx softirq too. 

You may wanna also try the approach i did with hrt+/tickless by changing
only the driver and not the core.

> I thought it would be worth throwing the ideas out there 
> first to get early feedback.

You are doing the right thing by following the path on perfomance
analysis. I hope you dont get discouraged because the return on
investment may be very low in such work - the majority of the work is in
the testing and analysis (not in puking code endlessly). 

cheers,
jamal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-06 23:06 ` jamal
@ 2007-09-07  9:31   ` James Chapman
  2007-09-07 13:22     ` jamal
  2007-09-07 21:20     ` Jason Lunz
  0 siblings, 2 replies; 29+ messages in thread
From: James Chapman @ 2007-09-07  9:31 UTC (permalink / raw)
  To: hadi; +Cc: netdev, davem, jeff, mandeep.baines, ossthema, Stephen Hemminger

jamal wrote:
> On Thu, 2007-06-09 at 15:16 +0100, James Chapman wrote:
> 
 >> First, do we need to encourage consistency in NAPI poll drivers?

> not to stiffle the discussion, but Stephen Hemminger is planning to
> write a new howto; that would be a good time to bring up the topic. The 
> challenge is that there may be hardware issues that will result in small
> deviations.

Ok.

>> When a device is in polled mode while idle, there are 2 scheduling cases to consider:-
>>
>> 1. One or more other netdevs is not idle and is consuming quota on each poll. The net_rx 
>> softirq 
>> will loop until the next jiffy tick or when quota is exceeded, calling each device 
>> in its polled 
>> list. Since the idle device is still in the poll list, it will be polled very rapidly.
> 
> One suggestion on limiting the amount of polls is to actually have the
> driver chew something off the quota even on empty polls - easier by just
> changing the driver. A simple case will be say 1 packet (more may make
> more sense, machine dependent) every time poll is invoked by the core.

I wanted to minimize the impact on devices that do have work to do. But 
it's worth investigating. Thanks for the suggestion.

>> In testing, I see significant reduction in interrupt rate for typical traffic patterns. A flood ping, 
>> for example, keeps the device in polled mode, generating no interrupts. 
> 
> Must be a fast machine.

Not really. I used 3-year-old, single CPU x86 boxes with e100 
interfaces. The idle poll change keeps them in polled mode. Without idle 
poll, I get twice as many interrupts as packets, one for txdone and one 
for rx. NAPI is continuously scheduled in/out.

>> In a test, 8510 packets are sent/received versus 6200 previously; 
> 
> The other packets are dropped? 

No. Since I did a flood ping from the machine under test, the improved 
latency meant that the ping response was handled more quickly, causing 
the next packet to be sent sooner. So more packets were transmitted in 
the allotted time (10 seconds).

> What are the rtt numbers like?

With current NAPI:
rtt min/avg/max/mdev = 0.902/1.843/101.727/4.659 ms, pipe 9, ipg/ewma 
1.611/1.421 ms

With idle poll changes:
rtt min/avg/max/mdev = 0.898/1.117/28.371/0.689 ms, pipe 3, ipg/ewma 
1.175/1.236 ms

>> CPU load is 100% versus 62% previously; 
> 
> not good.

But the CPU has done more work. The flood ping will always show 
increased CPU with these changes because the driver always stays in the 
NAPI poll list. For typical LAN traffic, the average CPU usage doesn't 
increase as much, though more measurements would be useful.

> Your results above showed decreased tput and increased cpu - did you
> mistype that?

I didn't use clear English. :) I'm seeing increased throughput, mostly 
because latency is improved. The increased cpu is partly because of the 
increased throughput, and partly because ksoftirqd stays busy longer.

>> despite the CPU load being increased. For a system whose main job is processing network 
>> traffic quickly, like an embedded router or a network server, this approach might be very 
>> beneficial.
> 
> I am not sure i buy that James;-> The router types really have not much
> of a challenge in this area.

The problem I started thinking about was the one where NAPI thrashes 
in/out of polled mode at higher and higher rates as network interface 
speeds and CPU speeds increase. A flood ping demonstrates this even on 
100M links on my boxes. Networking boxes want consistent 
performance/latency for all traffic patterns and they need to avoid 
interrupt livelock. Current practice seems to be to use hardware 
interrupt mitigation or timers to limit interrupt rate but this just 
hurts latency, as you noted. So I'm trying to find a way to limit the 
NAPI interrupt rate without increasing latency. My comment about this 
approach being suitable for routers and networked servers is that these 
boxes care more about minimizing packet latency than they do about 
wasting CPU cycles by polling idle devices.

> You are doing the right thing by following the path on perfomance
> analysis. I hope you dont get discouraged because the return on
> investment may be very low in such work - the majority of the work is in
> the testing and analysis (not in puking code endlessly). 

Thanks for your feedback. The challenge will be finding the time to do 
this work. :)

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-07  9:31   ` James Chapman
@ 2007-09-07 13:22     ` jamal
  2007-09-10  9:20       ` James Chapman
  2007-09-12  7:04       ` Bill Fink
  2007-09-07 21:20     ` Jason Lunz
  1 sibling, 2 replies; 29+ messages in thread
From: jamal @ 2007-09-07 13:22 UTC (permalink / raw)
  To: James Chapman
  Cc: netdev, davem, jeff, mandeep.baines, ossthema, Stephen Hemminger

On Fri, 2007-07-09 at 10:31 +0100, James Chapman wrote:
> Not really. I used 3-year-old, single CPU x86 boxes with e100 
> interfaces. 
> The idle poll change keeps them in polled mode. Without idle 
> poll, I get twice as many interrupts as packets, one for txdone and one 
> for rx. NAPI is continuously scheduled in/out.

Certainly faster than the machine in the paper (which was about 2 years
old in 2005).
I could never get ping -f to do that for me - so things must be getting
worse with newer machines then.

> No. Since I did a flood ping from the machine under test, the improved 
> latency meant that the ping response was handled more quickly, causing 
> the next packet to be sent sooner. So more packets were transmitted in 
> the allotted time (10 seconds).

ok.

> With current NAPI:
> rtt min/avg/max/mdev = 0.902/1.843/101.727/4.659 ms, pipe 9, ipg/ewma 
> 1.611/1.421 ms
> 
> With idle poll changes:
> rtt min/avg/max/mdev = 0.898/1.117/28.371/0.689 ms, pipe 3, ipg/ewma 
> 1.175/1.236 ms

Not bad in terms of latency. The deviation certainly looks better.

> But the CPU has done more work. 

I am going to be the devil's advocate[1]:
If the problem i am trying to solve is "reduce cpu use at lower rate",
then this is not the right answer because your cpu use has gone up.
Your latency numbers have not improved that much (looking at the avg)
and your throughput is not that much higher. Will i be willing to pay
more cpu (of an already piggish cpu use by NAPI at that rate with 2
interupts per packet)?

Another test: try a simple ping and compare the rtts.

> The problem I started thinking about was the one where NAPI thrashes 
> in/out of polled mode at higher and higher rates as network interface 
> speeds and CPU speeds increase. A flood ping demonstrates this even on 
> 100M links on my boxes. 

things must be getting worse in the state of average hardware out there.
It will be worthwile exercise to compare on an even faster machine
and see what transpires there.

> Networking boxes want consistent 
> performance/latency for all traffic patterns and they need to avoid 
> interrupt livelock. Current practice seems to be to use hardware 
> interrupt mitigation or timers to limit interrupt rate but this just 
> hurts latency, as you noted. So I'm trying to find a way to limit the 
> NAPI interrupt rate without increasing latency. My comment about this 
> approach being suitable for routers and networked servers is that these 
> boxes care more about minimizing packet latency than they do about 
> wasting CPU cycles by polling idle devices.

I think the arguement of "who cares about a little more cpu" is valid
for the case of routers. It is a double edged sword, because it applies
to the case of "who cares if NAPI uses a little more cpu at low rates"
and "who cares if James turns on polling and abuses a little more-more
cpu". Since NAPI is the incumbent, the onus(sp?) is to do better. You
must do better sir!

Look at the timers, she said - that way you may be able to cut the cpu
abuse.

cheers,
jamal

[1] historically the devils advocate was a farce really ;->

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-07 13:22     ` jamal
@ 2007-09-10  9:20       ` James Chapman
  2007-09-10 12:27         ` jamal
  2007-09-12  7:04       ` Bill Fink
  1 sibling, 1 reply; 29+ messages in thread
From: James Chapman @ 2007-09-10  9:20 UTC (permalink / raw)
  To: hadi; +Cc: netdev, davem, jeff, mandeep.baines, ossthema, Stephen Hemminger

jamal wrote:

> If the problem i am trying to solve is "reduce cpu use at lower rate",
> then this is not the right answer because your cpu use has gone up.

The problem I'm trying to solve is "reduce the max interrupt rate from 
NAPI drivers while minimizing latency". In modern systems, the interrupt 
rate can be so high that the CPU spends too much time processing 
interrupts, resulting in the system's behavior seen by the user being 
degraded.

Having the poll() called when idle will always increase CPU usage. But 
the feedback you and others are giving encourages me to find a better 
compromise. At the end of the day, it's going to be a trade-off of cpu 
and latency. The trade-off will be different for each hardware system 
and application environment so parameters need to be tunable.

I'll go away and do some tests. I'll post results here for discussion later.

Thanks for your feedback.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-10  9:20       ` James Chapman
@ 2007-09-10 12:27         ` jamal
  0 siblings, 0 replies; 29+ messages in thread
From: jamal @ 2007-09-10 12:27 UTC (permalink / raw)
  To: James Chapman
  Cc: netdev, davem, jeff, mandeep.baines, ossthema, Stephen Hemminger

On Mon, 2007-10-09 at 10:20 +0100, James Chapman wrote:
> jamal wrote:
> 
> > If the problem i am trying to solve is "reduce cpu use at lower rate",
> > then this is not the right answer because your cpu use has gone up.
> 
> The problem I'm trying to solve is "reduce the max interrupt rate from 
> NAPI drivers while minimizing latency".

As long as what you are saying above translates to "there is one
interupt per packet per napi poll" then we are saying the same thing.

>  In modern systems, the interrupt 
> rate can be so high that the CPU spends too much time processing 
> interrupts, resulting in the system's behavior seen by the user being 
> degraded.

modern systems also can handle interupts a lot better. 
If you can amortize two packets per interupt per napi poll then you
have done better than the breakeven point; however, I think it is fair
to also disprove that in modern hardware the breakeven point is met with
amortizing two packets.

> Having the poll() called when idle will always increase CPU usage. But 
> the feedback you and others are giving encourages me to find a better 
> compromise. 

i dont mean in any way to discourage you - just making you work
better ;-> It is very refreshing to see that you understand the scope is
performance not vomiting endless versions of code - and for this i feel
obligated to help.

> I'll go away and do some tests. I'll post results here for discussion later.

way to go.

cheers,
jamal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-07 13:22     ` jamal
  2007-09-10  9:20       ` James Chapman
@ 2007-09-12  7:04       ` Bill Fink
  2007-09-12 12:12         ` jamal
  1 sibling, 1 reply; 29+ messages in thread
From: Bill Fink @ 2007-09-12  7:04 UTC (permalink / raw)
  To: hadi
  Cc: James Chapman, netdev, davem, jeff, mandeep.baines, ossthema,
	Stephen Hemminger

On Fri, 07 Sep 2007, jamal wrote:

> On Fri, 2007-07-09 at 10:31 +0100, James Chapman wrote:
> > Not really. I used 3-year-old, single CPU x86 boxes with e100 
> > interfaces. 
> > The idle poll change keeps them in polled mode. Without idle 
> > poll, I get twice as many interrupts as packets, one for txdone and one 
> > for rx. NAPI is continuously scheduled in/out.
> 
> Certainly faster than the machine in the paper (which was about 2 years
> old in 2005).
> I could never get ping -f to do that for me - so things must be getting
> worse with newer machines then.
> 
> > No. Since I did a flood ping from the machine under test, the improved 
> > latency meant that the ping response was handled more quickly, causing 
> > the next packet to be sent sooner. So more packets were transmitted in 
> > the allotted time (10 seconds).
> 
> ok.
> 
> > With current NAPI:
> > rtt min/avg/max/mdev = 0.902/1.843/101.727/4.659 ms, pipe 9, ipg/ewma 
> > 1.611/1.421 ms
> > 
> > With idle poll changes:
> > rtt min/avg/max/mdev = 0.898/1.117/28.371/0.689 ms, pipe 3, ipg/ewma 
> > 1.175/1.236 ms
> 
> Not bad in terms of latency. The deviation certainly looks better.
> 
> > But the CPU has done more work. 
> 
> I am going to be the devil's advocate[1]:

So let me be the angel's advocate.  :-)

> If the problem i am trying to solve is "reduce cpu use at lower rate",
> then this is not the right answer because your cpu use has gone up.
> Your latency numbers have not improved that much (looking at the avg)
> and your throughput is not that much higher. Will i be willing to pay
> more cpu (of an already piggish cpu use by NAPI at that rate with 2
> interupts per packet)?

I view his results much more favorably.  With current NAPI, the average
RTT is 104% higher than the minimum, the deviation is 4.659 ms, and the
maximum RTT is 101.727 ms.  With his patch, the average RTT is only 24%
higher than the minimum, the deviation is only 0.689 ms, and the maximum
RTT is 28.371 ms.  The average RTT improved by 39%, the deviation was
6.8 times smaller, and the maximum RTT was 3.6 times smaller.  So in
every respect the latency was significantly better.

The throughput increased from 6200 packets to 8510 packets or an increase
of 37%.  The only negative is that the CPU utilization increased from
62% to 100% or an increase of 61%, so the CPU increase was greater than
the increase in the amount of work performed (17.6% greater than what
one would expect purely from the increased amount of work).

You can't always improve on all metrics of a workload.  Sometimes there
are tradeoffs to be made to be decided by the user based on what's most
important to that user and his specific workload.  And the suggested
ethtool option (defaulting to current behavior) would enable the user
to make that decision.

						-Bill

P.S.  I agree that some tests run in parallel with some CPU hogs also
      running might be beneficial and enlightening.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-12  7:04       ` Bill Fink
@ 2007-09-12 12:12         ` jamal
  2007-09-12 13:50           ` James Chapman
  0 siblings, 1 reply; 29+ messages in thread
From: jamal @ 2007-09-12 12:12 UTC (permalink / raw)
  To: Bill Fink
  Cc: James Chapman, netdev, davem, jeff, mandeep.baines, ossthema,
	Stephen Hemminger

On Wed, 2007-12-09 at 03:04 -0400, Bill Fink wrote:
> On Fri, 07 Sep 2007, jamal wrote:

> > I am going to be the devil's advocate[1]:
> 
> So let me be the angel's advocate.  :-)

I think this would make you God's advocate ;->
(http://en.wikipedia.org/wiki/God%27s_advocate)

> I view his results much more favorably.  

The challenge is, under _low traffic_: bad bad CPU use.
Thats what is at stake, correct?

Lets bury the stats for a sec ...

1) Has that CPU situation improved? No, it has gotten worse.
2) Was there a throughput problem? No. 
Remember, this is _low traffic and the complaint is not NAPI doesnt do
high throughput. I am not willing to spend 34% more cpu to get a few
hundred pps (under low traffic!). 
3)Latency improvement is good. But is 34% cost worthwile for the corner
case of low traffic?

Heres an analogy:
I went to buy bread and complained that 66cents was too much for such
a tiny sliced loaf.
You tell me you have solved my problem: asking me to pay a dollar
because you made the bread slices crispier. I was complaining on the _66
cents price_ not on the crispiness of the slices ;-> Crispier slices are
good - but am i, the person who was complaining about price, willing to
pay 40-50% more? People are bitching about NAPI abusing CPU, is the 
answer to abuse more CPU than NAPI?;->
The answer could be "I am not solving that problem anymore" - at least
thats what James is saying;->

Note: I am not saying theres no problem - just saying the result is not
addressing the problem.

> You can't always improve on all metrics of a workload.  

But you gotta try to be consistent. 
If, for example, one packet size/rate got negative results but the next
got positive results - thats lacking consistency. 

> Sometimes there
> are tradeoffs to be made to be decided by the user based on what's most
> important to that user and his specific workload.  And the suggested
> ethtool option (defaulting to current behavior) would enable the user
> to make that decision.

And the challenge is:
What workload is willing to invest that much cpu for low traffic?
Can you name one? One that may come close is database benchmarks for
latency - but those folks wouldnt touch this with a mile-long pole if
you told them their cpu use is going to get worse than what NAPI (that
big bad CPU hog under low traffic) is giving them.

> 
> P.S.  I agree that some tests run in parallel with some CPU hogs also
>       running might be beneficial and enlightening.

indeed.

cheers,
jamal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-12 12:12         ` jamal
@ 2007-09-12 13:50           ` James Chapman
  2007-09-12 14:02             ` Stephen Hemminger
  2007-09-14 13:14             ` jamal
  0 siblings, 2 replies; 29+ messages in thread
From: James Chapman @ 2007-09-12 13:50 UTC (permalink / raw)
  To: hadi, Bill Fink
  Cc: netdev, davem, jeff, mandeep.baines, ossthema, Stephen Hemminger

jamal wrote:
> On Wed, 2007-12-09 at 03:04 -0400, Bill Fink wrote:
>> On Fri, 07 Sep 2007, jamal wrote:
> 
>>> I am going to be the devil's advocate[1]:
>> So let me be the angel's advocate.  :-)
> 
> I think this would make you God's advocate ;->
> (http://en.wikipedia.org/wiki/God%27s_advocate)
> 
>> I view his results much more favorably.  
> 
> The challenge is, under _low traffic_: bad bad CPU use.
> Thats what is at stake, correct?

By low traffic, I assume you mean a rate at which the NAPI driver 
doesn't stay in polled mode. The problem is that that rate is getting 
higher all the time, as interface and CPU speeds increase. This results 
in too many interrupts and NAPI thrashing in/out of polled mode very 
quickly.

> Lets bury the stats for a sec ...

Yes please. We need an analysis of what happens to cpu usage, latency, 
pps etc when various factors are changed, e.g. input pps, NAPI busy-idle 
delay etc. The main purpose of my RFC wasn't to push a patch into the 
kernel right now, it was to highlight the issue and to find out if 
others were already working on it. The feedback has been good so far. I 
just need to find some time to do some testing. :)

> People are bitching about NAPI abusing CPU, is the 
> answer to abuse more CPU than NAPI?;->

Jamal, do you have more details? Are people saying NAPI gets too much of 
the CPU pie because they profiled it? Are they complaining that system 
behavior degrades too much under certain network traffic conditions? 
Mouse cursor movement jittery? Real-time apps such as music/video 
players starved of CPU? Is it possible they blame NAPI because they see 
tangible effects on their system, not because measured CPU usage is 
high? I say this because my music/video player and mouse cursor behave 
_much_ better with my NAPI changes during general use, despite the 
increase in measured cpu load. Even ftp can make my system's mouse 
cursor jitter...

> The answer could be "I am not solving that problem anymore" - at least
> thats what James is saying;->

I'm investigating whether the symptoms I describe above can be reduced 
or eliminated without resorting to hardware interrupt mitigation. 
Specifically, I want to do more testing on the idle polling scheme which 
seems to improve system behavior in my tests. This will involve more 
than doing a flood ping or two. :)

>> Sometimes there
>> are tradeoffs to be made to be decided by the user based on what's most
>> important to that user and his specific workload.  And the suggested
>> ethtool option (defaulting to current behavior) would enable the user
>> to make that decision.
> 
> And the challenge is:
> What workload is willing to invest that much cpu for low traffic?
> Can you name one? One that may come close is database benchmarks for
> latency - but those folks wouldnt touch this with a mile-long pole if
> you told them their cpu use is going to get worse than what NAPI (that
> big bad CPU hog under low traffic) is giving them.

I agree with both of you. But we need more test results first to know 
whether it will be useful to offer NAPI idle polling as an _option_.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-12 13:50           ` James Chapman
@ 2007-09-12 14:02             ` Stephen Hemminger
  2007-09-12 16:26               ` James Chapman
  2007-09-12 16:47               ` Mandeep Baines
  2007-09-14 13:14             ` jamal
  1 sibling, 2 replies; 29+ messages in thread
From: Stephen Hemminger @ 2007-09-12 14:02 UTC (permalink / raw)
  To: James Chapman
  Cc: hadi, Bill Fink, netdev, davem, jeff, mandeep.baines, ossthema

On Wed, 12 Sep 2007 14:50:01 +0100
James Chapman <jchapman@katalix.com> wrote:

> jamal wrote:
> > On Wed, 2007-12-09 at 03:04 -0400, Bill Fink wrote:
> >> On Fri, 07 Sep 2007, jamal wrote:
> > 
> >>> I am going to be the devil's advocate[1]:
> >> So let me be the angel's advocate.  :-)
> > 
> > I think this would make you God's advocate ;->
> > (http://en.wikipedia.org/wiki/God%27s_advocate)
> > 
> >> I view his results much more favorably.  
> > 
> > The challenge is, under _low traffic_: bad bad CPU use.
> > Thats what is at stake, correct?
> 
> By low traffic, I assume you mean a rate at which the NAPI driver 
> doesn't stay in polled mode. The problem is that that rate is getting 
> higher all the time, as interface and CPU speeds increase. This results 
> in too many interrupts and NAPI thrashing in/out of polled mode very 
> quickly.

But if you compare this to non-NAPI driver the same softirq
overhead happens. The problem is that for many older devices disabling IRQ's
require an expensive non-cached PCI access. Smarter, newer devices
all use MSI which is pure edge triggered and with proper register
usage, NAPI should be no worse than non-NAPI.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-12 14:02             ` Stephen Hemminger
@ 2007-09-12 16:26               ` James Chapman
  2007-09-12 16:47               ` Mandeep Baines
  1 sibling, 0 replies; 29+ messages in thread
From: James Chapman @ 2007-09-12 16:26 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: hadi, Bill Fink, netdev, davem, jeff, mandeep.baines, ossthema

Stephen Hemminger wrote:
> On Wed, 12 Sep 2007 14:50:01 +0100
> James Chapman <jchapman@katalix.com> wrote:
>> By low traffic, I assume you mean a rate at which the NAPI driver 
>> doesn't stay in polled mode. The problem is that that rate is getting 
>> higher all the time, as interface and CPU speeds increase. This results 
>> in too many interrupts and NAPI thrashing in/out of polled mode very 
>> quickly.
> 
> But if you compare this to non-NAPI driver the same softirq
> overhead happens. The problem is that for many older devices disabling IRQ's
> require an expensive non-cached PCI access. Smarter, newer devices
> all use MSI which is pure edge triggered and with proper register
> usage, NAPI should be no worse than non-NAPI.

While MSI is good, the CPU interrupt overhead (saving/restoring CPU 
registers) can hurt bad, especially for RISC CPUs. When packet 
processing is interrupt-driven, the kernel's scheduler plays second 
fiddle to hardware interrupt and softirq scheduling. Even super-priority 
real-time threads don't get a look in.

When traffic rates cause 1 interrupt per tx/rx packet event, NAPI will 
use more CPU and have higher latency than non-NAPI because of the extra 
work done to enter and leave polled mode. At higher packet rates, NAPI 
works very well, unlike non-NAPI which usually needs hardware interrupt 
mitigation to avoid interrupt live-lock.

I think NAPI should be a _requirement_ for new net drivers. But I 
recognize that it has some issues, hence this thread.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-12 14:02             ` Stephen Hemminger
  2007-09-12 16:26               ` James Chapman
@ 2007-09-12 16:47               ` Mandeep Baines
  2007-09-13  6:57                 ` David Miller
  1 sibling, 1 reply; 29+ messages in thread
From: Mandeep Baines @ 2007-09-12 16:47 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: James Chapman, hadi, Bill Fink, netdev, davem, jeff, ossthema

On 9/12/07, Stephen Hemminger <shemminger@linux-foundation.org> wrote:
> But if you compare this to non-NAPI driver the same softirq
> overhead happens. The problem is that for many older devices disabling IRQ's
> require an expensive non-cached PCI access. Smarter, newer devices
> all use MSI which is pure edge triggered and with proper register
> usage, NAPI should be no worse than non-NAPI.

Why would disabling IRQ's be expensive on non-MSI PCI devices?
Wouldn't it just require a single MMIO write to clear the interrupt
mask of the device. These are write-buffered so the latency should be
minimal. As mentioned in Jamal's UKUUG paper, any MMIO reading could
be avoided by caching the interrupt mask.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-12 16:47               ` Mandeep Baines
@ 2007-09-13  6:57                 ` David Miller
  0 siblings, 0 replies; 29+ messages in thread
From: David Miller @ 2007-09-13  6:57 UTC (permalink / raw)
  To: mandeep.baines
  Cc: shemminger, jchapman, hadi, billfink, netdev, jeff, ossthema

From: "Mandeep Baines" <mandeep.baines@gmail.com>
Date: Wed, 12 Sep 2007 09:47:46 -0700

> Why would disabling IRQ's be expensive on non-MSI PCI devices?
> Wouldn't it just require a single MMIO write to clear the interrupt
> mask of the device.

MMIO's are the most expensive part of the whole interrupt
servicing routines and minimizing them is absolutely
crucial.

This is why many devices do things like report status purely
in memory data structures, automatically disable interrupts
on either MSI delivery or status register read, etc.

Often you will see the first MMIO access in the interrupt
handler at the top of the profiles.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-12 13:50           ` James Chapman
  2007-09-12 14:02             ` Stephen Hemminger
@ 2007-09-14 13:14             ` jamal
  1 sibling, 0 replies; 29+ messages in thread
From: jamal @ 2007-09-14 13:14 UTC (permalink / raw)
  To: James Chapman
  Cc: Bill Fink, netdev, davem, jeff, mandeep.baines, ossthema,
	Stephen Hemminger, Rick Jones

On Wed, 2007-12-09 at 14:50 +0100, James Chapman wrote:

> By low traffic, I assume you mean a rate at which the NAPI driver 
> doesn't stay in polled mode. 

i.e:
"one interupt per packet per napi poll" which cause about 1-2 more IOs
in comparison to the case where you didnt do NAPI.

> The problem is that that rate is getting 
> higher all the time, as interface and CPU speeds increase. 

indeed;
While i dont want to throw more work at you, with some of the things
that improve the IO cost like PCI express, MSI, and some of the
intelligent things the tg3 does, is this problem still rampant etc? I
think if you can find (seems you have) one "modern" machine (with MSI
and a tg3 etc) that has this problem circa 2007 that will be a good
start.

> This results 
> in too many interrupts and NAPI thrashing in/out of polled mode very 
> quickly.

indeed.

> Yes please. We need an analysis of what happens to cpu usage, latency, 
> pps etc when various factors are changed, e.g. input pps, NAPI busy-idle 
> delay etc. The main purpose of my RFC wasn't to push a patch into the 
> kernel right now, it was to highlight the issue and to find out if 
> others were already working on it. The feedback has been good so far. I 
> just need to find some time to do some testing. :)

I love your message. From a blackbox perspective, yes we have some
challenges for NAPI below certain thresholds of traffic. 
My claim (in the paper) was the discrepancy between the cost of IO
access vs cost of RAM vs cost of caches vs CPU speeds has gotten too
high.
CPU Vendors have been paying close attention to most but IO. So avoiding
IO when you can is a good thing.

> Jamal, do you have more details? Are people saying NAPI gets too much of 
> the CPU pie because they profiled it?

In the old days Manfred Spraul actually did profile.
Most of the other folks were running benchmarks which account for cpu
use in addition to resources like bandwidth and latency. And so while
bandwidth and latency didnt affect them that much, they observed their
benchmarks didnt look good at low rates (even when they looked excellent
at high rates) because of CPU.

>  Are they complaining that system 
> behavior degrades too much under certain network traffic conditions?

yes - Under low traffic, high speed cpu youd notice a slightly higher
cpu use.

>  
> Mouse cursor movement jittery? Real-time apps such as music/video 
> players starved of CPU? Is it possible they blame NAPI because they see 
> tangible effects on their system, not because measured CPU usage is 
> high? 

If i recall correctly transactional type benchmarks is where this was
observed. Some IBM and Intel people bring it up every few months and
maybe Rick Jones once in a while.
Rick, care to comment on the benchmarks?

> I say this because my music/video player and mouse cursor behave 
> _much_ better with my NAPI changes during general use, despite the 
> increase in measured cpu load. Even ftp can make my system's mouse 
> cursor jitter...

Like i told you in my other email - i did notice something similar, i
just couldnt put my finger to it and at some points thought i was
imagining it.

cheers,
jamal

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-07  9:31   ` James Chapman
  2007-09-07 13:22     ` jamal
@ 2007-09-07 21:20     ` Jason Lunz
  2007-09-10  9:25       ` James Chapman
  1 sibling, 1 reply; 29+ messages in thread
From: Jason Lunz @ 2007-09-07 21:20 UTC (permalink / raw)
  To: James Chapman
  Cc: netdev, davem, jeff, mandeep.baines, ossthema, hadi,
	Stephen Hemminger

In gmane.linux.network, you wrote:
> But the CPU has done more work. The flood ping will always show 
> increased CPU with these changes because the driver always stays in the 
> NAPI poll list. For typical LAN traffic, the average CPU usage doesn't 
> increase as much, though more measurements would be useful.

I'd be particularly interested to see what happens to your latency when
other apps are hogging the cpu. I assume from your description that your
cpu is mostly free to schedule the niced softirqd for the device polling
duration, but this won't always be the case. If other tasks are running
at high priority, it could be nearly a full jiffy before softirqd gets
to check the poll list again and the latency introduced could be much
higher than you've yet measured.

Jason

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-07 21:20     ` Jason Lunz
@ 2007-09-10  9:25       ` James Chapman
  0 siblings, 0 replies; 29+ messages in thread
From: James Chapman @ 2007-09-10  9:25 UTC (permalink / raw)
  To: Jason Lunz
  Cc: netdev, davem, jeff, mandeep.baines, ossthema, hadi,
	Stephen Hemminger

Jason Lunz wrote:
> I'd be particularly interested to see what happens to your latency when
> other apps are hogging the cpu. I assume from your description that your
> cpu is mostly free to schedule the niced softirqd for the device polling
> duration, but this won't always be the case. If other tasks are running
> at high priority, it could be nearly a full jiffy before softirqd gets
> to check the poll list again and the latency introduced could be much
> higher than you've yet measured.

Indeed. The effect of cpu load on all of this is important to consider. 
The challenge will be how to test it fairly on different test runs.

One thing to bear in mind is that interrupts are processed at highest 
priority, above any scheduled work. Reducing interrupt rate gives the 
scheduler more chance to run what it thinks is the next highest priority 
work. This might be at the expense of network processing. Is it better 
to give other runnable tasks a fair chunk of the cpu pie? I think so.

I'll try to incorporate application cpu load into my tests. Thanks for 
your feedback.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-06 14:16 RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates James Chapman
  2007-09-06 14:37 ` Stephen Hemminger
  2007-09-06 23:06 ` jamal
@ 2007-09-07  3:55 ` Mandeep Singh Baines
  2007-09-07  9:38   ` James Chapman
  2007-09-08 16:32 ` Andi Kleen
  2007-09-12 15:12 ` David Miller
  4 siblings, 1 reply; 29+ messages in thread
From: Mandeep Singh Baines @ 2007-09-07  3:55 UTC (permalink / raw)
  To: James Chapman; +Cc: netdev, hadi, davem, jeff, mandeep.baines, ossthema

Hi James,

I like the idea of staying in poll longer.

My comments are similar to what Jamal and Stephen have already said.

A tunable (via sysfs) would be nice.

A timer might be preferred to jiffy polling. Jiffy polling will not increase 
latency the way a timer would. However, jiffy polling will consume a lot more
CPU than a timer would. Hence more power. For jiffy polling, you could have 
thousands of calls to poll for a single packet received. While in a timer 
approach the numbers of polls per packet is upper bound to 2. 

I think it may difficult to make poll efficient for the no packet case because,
at a minimum, you have to poll the device state via the has_work method.

If you go to a timer implementation then having a tunable will be important.
Different appications will have different requirements on delay and jitter.
Some applications may want to trade delay/jitter for less CPU/power 
consumption and some may not.

imho, the work should definately be pursued further:)

Regards,
Mandeep

James Chapman (jchapman@katalix.com) wrote:
> This RFC suggests some possible improvements to NAPI in the area of minimizing interrupt rates. A possible scheme to reduce interrupt rate for the low packet rate / fast CPU case is described. 
> 
> First, do we need to encourage consistency in NAPI poll drivers? A survey of current NAPI drivers shows different strategies being used in their poll(). Some such as r8169 do the napi_complete() if poll() does less work than their allowed budget. Others such as e100 and tg3 do napi_complete() only if they do no work at all. And some drivers use NAPI only for receive handling, perhaps setting txdone interrupts for 1 in N transmitted packets, while others do all "interrupt" processing in their poll(). Should we encourage more consistency? Should we encourage more NAPI driver maintainers to minimize interrupts by doing all rx _and_ tx processing in the poll(), and do napi_complete() only when the poll does _no_ work?
> 
> One well known issue with NAPI is that it is possible with certain traffic patterns for NAPI drivers to schedule in and out of polled mode very quickly. Worst case, a NAPI driver might get 1 interrupt per packet. With fast CPUs and interfaces, this can happen at high rates, causing high CPU loads and poor packet processing performance. Some drivers avoid this by using hardware interrupt mitigation features of the network device in tandem with NAPI to throttle the max interrupt rate per device. But this adds latency. Jamal's paper http://kernel.org/pub/linux/kernel/people/hadi/docs/UKUUG2005.pdf discusses this problem in some detail.
> 
> By making some small changes to the NAPI core, I think it is possible to prevent high interrupt rates with NAPI, regardless of traffic patterns and without using per-device hardware interrupt mitigation. The basic idea is that instead of immediately exiting polled mode when it finds no work to do, the driver's poll() keeps itself in active polled mode for 1-2 jiffies and only does napi_complete() when it does no work in that time period. When it does no work in its poll(), the driver can return 0 while leaving itself in the NAPI poll list. This means it is possible for the softirq processing to spin around its active device list, doing no work, since no quota is consumed. A change is therefore also needed in the NAPI core to detect the case when the only devices that are being actively p
 olled in softirq processing are doing no work on each poll and to exit the softirq loop rather than wasting CPU cycles.
> 
> The code changes are shown in the patch below. The patch is against the latest NAPI rework posted by DaveM http://marc.info/?l=linux-netdev&m=118829721407289&w=2. I used e100 and tg3 drivers to test. Since a driver that returns 0 from its poll() while leaving itself in polled mode would now used by the NAPI core as a condition for exiting the softirq poll loop, all existing NAPI drivers would need to conform to this new invariant. Some drivers, e.g. e100, can return 0 even if they do tx work in their poll().
> 
> Clearly, keeping a device in polled mode for 1-2 jiffies after it would otherwise have gone idle means that it might be called many times by the NAPI softirq while it has no work to do. This wastes CPU cycles. It would be important therefore to implement the driver's poll() to make this case as efficient as possible, perhaps testing for it early.
> 
> When a device is in polled mode while idle, there are 2 scheduling cases to consider:-
> 
> 1. One or more other netdevs is not idle and is consuming quota on each poll. The net_rx softirq will loop until the next jiffy tick or when quota is exceeded, calling each device in its polled list. Since the idle device is still in the poll list, it will be polled very rapidly.
> 
> 2. No other active device is in the poll list. The net_rx softirq will poll the idle device twice and then exit the softirq processing loop as if quota is exceeded. See the net_rx_action() changes in the patch which force the loop to exit if no work is being done by any device in the poll list.
> 
> In both cases described above, the scheduler will continue NAPI processing from ksoftirqd. This might be very soon, especially if the system is otherwise idle. But if the system is idle, do we really care that idle network devices will be polled for 1-2 jiffies? If the system is otherwise busy, ksoftirqd will share the CPU with other threads/processes which will reduce the poll rate anyway.
> 
> In testing, I see significant reduction in interrupt rate for typical traffic patterns. A flood ping, for example, keeps the device in polled mode, generating no interrupts. In a test, 8510 packets are sent/received versus 6200 previously; CPU load is 100% versus 62% previously; and 1 netdev interrupt occurs versus 12400 previously. Performance and CPU load under extreme network load (using pktgen) is unchanged, as expected. Most importantly though, it is no longer possible to find a combination of CPU performance and traffic pattern that induce high interrupt rates. And because hardware interrupt mitigation isn't used, packet latency is minimized.
> 
> The increase in CPU load isn't surprising for a flood ping test since the CPU is working to bounce packets as fast as it can. The increase in packet rate is a good indicator of how much the interrupt and NAPI scheduling overhead is. The CPU load shows 100% because ksoftirqd is always wanting the CPU for the duration of the flood ping. The beauty of NAPI is that the scheduler gets to decide which thread gets the CPU, not hardware CPU interrupt priorities. On my desktop system, I perceive _better_ system response (smoother X cursor movement etc) during the flood ping test, despite the CPU load being increased. For a system whose main job is processing network traffic quickly, like an embedded router or a network server, this approach might be very beneficial. For a desktop, I'm less sure, 
 although as I said above, I've noticed no performance issues in my setups to date.
> 
> 
> Is this worth pursuing further? I'm considering doing more work to measure the effects at various relatively low packet rates. I also want to investigate using High Res Timers rather than jiffy sampling to reduce the idle poll time. Perhaps it is also worth trying HRT in the net_rx softirq too. I thought it would be worth throwing the ideas out there first to get early feedback.
> 
> Here's the patch.
> 
> Index: linux-2.6/drivers/net/e100.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/e100.c
> +++ linux-2.6/drivers/net/e100.c
> @@ -544,6 +544,7 @@ struct nic {
>  	struct cb *cb_to_use;
>  	struct cb *cb_to_send;
>  	struct cb *cb_to_clean;
> +	unsigned long exit_poll_time;
>  	u16 tx_command;
>  	/* End: frequently used values: keep adjacent for cache effect */
>  
> @@ -1993,12 +1994,35 @@ static int e100_poll(struct napi_struct 
>  	e100_rx_clean(nic, &work_done, budget);
>  	tx_cleaned = e100_tx_clean(nic);
>  
> -	/* If no Rx and Tx cleanup work was done, exit polling mode. */
> -	if((!tx_cleaned && (work_done == 0)) || !netif_running(netdev)) {
> +	if (!netif_running(netdev)) {
>  		netif_rx_complete(netdev, napi);
>  		e100_enable_irq(nic);
> +		return 0;
>  	}
>  
> +	/* Stay in polled mode if we do any tx cleanup */
> +	if (tx_cleaned)
> +		work_done++;
> +
> +	/* If no Rx and Tx cleanup work was done, exit polling mode if
> +	 * we've seen no work for 1-2 jiffies.
> +	 */
> +	if (work_done == 0) {
> +		if (nic->exit_poll_time) {
> +			if (time_after(jiffies, nic->exit_poll_time)) {
> +				nic->exit_poll_time = 0;
> +				netif_rx_complete(netdev, napi);
> +				e100_enable_irq(nic);
> +			}
> +		} else {
> +			nic->exit_poll_time = jiffies + 2;
> +		}
> +		return 0;
> +	}
> +
> +	/* Otherwise, reset poll exit time and stay in poll list */
> +	nic->exit_poll_time = 0;
> +
>  	return work_done;
>  }
>  
> Index: linux-2.6/drivers/net/tg3.c
> ===================================================================
> --- linux-2.6.orig/drivers/net/tg3.c
> +++ linux-2.6/drivers/net/tg3.c
> @@ -3473,6 +3473,24 @@ static int tg3_poll(struct napi_struct *
>  	struct tg3_hw_status *sblk = tp->hw_status;
>  	int work_done = 0;
>  
> +	/* fastpath having no work while we're holding ourself in
> +	 * polled mode
> +	 */
> +	if ((tp->exit_poll_time) && (!tg3_has_work(tp))) {
> +		if (time_after(jiffies, tp->exit_poll_time)) {
> +			tp->exit_poll_time = 0;
> +			/* tell net stack and NIC we're done */
> +			netif_rx_complete(netdev, napi);
> +			tg3_restart_ints(tp);
> +		}
> +		return 0;
> +	}
> +
> +	/* if we get here, there might be work to do so disable the
> +	 * poll hold fastpath above
> +	 */
> +	tp->exit_poll_time = 0;
> +
>  	/* handle link change and other phy events */
>  	if (!(tp->tg3_flags &
>  	      (TG3_FLAG_USE_LINKCHG_REG |
> @@ -3511,11 +3529,11 @@ static int tg3_poll(struct napi_struct *
>  	} else
>  		sblk->status &= ~SD_STATUS_UPDATED;
>  
> -	/* if no more work, tell net stack and NIC we're done */
> -	if (!tg3_has_work(tp)) {
> -		netif_rx_complete(netdev, napi);
> -		tg3_restart_ints(tp);
> -	}
> +	/* if no more work, set the time in jiffies when we should
> +	 * exit polled mode
> +	 */
> +	if (!tg3_has_work(tp))
> +		tp->exit_poll_time = jiffies + 2;
>  
>  	return work_done;
>  }
> Index: linux-2.6/drivers/net/tg3.h
> ===================================================================
> --- linux-2.6.orig/drivers/net/tg3.h
> +++ linux-2.6/drivers/net/tg3.h
> @@ -2163,6 +2163,7 @@ struct tg3 {
>  	u32				last_tag;
>  
>  	u32				msg_enable;
> +	unsigned long			exit_poll_time;
>  
>  	/* begin "tx thread" cacheline section */
>  	void				(*write32_tx_mbox) (struct tg3 *, u32,
> Index: linux-2.6/net/core/dev.c
> ===================================================================
> --- linux-2.6.orig/net/core/dev.c
> +++ linux-2.6/net/core/dev.c
> @@ -2073,6 +2073,8 @@ static void net_rx_action(struct softirq
>  	unsigned long start_time = jiffies;
>  	int budget = netdev_budget;
>  	void *have;
> +	struct napi_struct *last_hold = NULL;
> +	int done = 0;
>  
>  	local_irq_disable();
>  	list_replace_init(&__get_cpu_var(softnet_data).poll_list, &list);
> @@ -2082,7 +2084,7 @@ static void net_rx_action(struct softirq
>  		struct napi_struct *n;
>  
>  		/* if softirq window is exhuasted then punt */
> -		if (unlikely(budget <= 0 || jiffies != start_time)) {
> +		if (unlikely(budget <= 0 || jiffies != start_time || done)) {
>  			local_irq_disable();
>  			list_splice(&list, &__get_cpu_var(softnet_data).poll_list);
>  			__raise_softirq_irqoff(NET_RX_SOFTIRQ);
> @@ -2096,12 +2098,28 @@ static void net_rx_action(struct softirq
>  
>  		list_del(&n->poll_list);
>  
> -		/* if quota not exhausted process work */
> +		/* if quota not exhausted process work. We special
> +		 * case on n->poll() returning 0 here when the driver
> +		 * doesn't do a napi_complete(), which indicates that
> +		 * the device wants to stay on the poll list although
> +		 * it did no work. We remember the first device to go
> +		 * into this state in order to terminate this loop if
> +		 * we see the same device again without doing any
> +		 * other work.
> +		 */
>  		if (likely(n->quota > 0)) {
>  			int work = n->poll(n, min(budget, n->quota));
>  
> -			budget -= work;
> -			n->quota -= work;
> +			if (likely(work)) {
> +				budget -= work;
> +				n->quota -= work;
> +				last_hold = NULL;
> +			} else if (test_bit(NAPI_STATE_SCHED, &n->state)) {
> +				if (unlikely(n == last_hold))
> +					done = 1;
> +				if (likely(!last_hold))
> +					last_hold = n;
> +			}
>  		}
>  
>  		/* if napi_complete not called, reschedule */

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-07  3:55 ` Mandeep Singh Baines
@ 2007-09-07  9:38   ` James Chapman
  2007-09-08 16:42     ` Mandeep Singh Baines
  0 siblings, 1 reply; 29+ messages in thread
From: James Chapman @ 2007-09-07  9:38 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: netdev, hadi, davem, jeff, ossthema, Stephen Hemminger

Hi Mandeep,

Mandeep Singh Baines wrote:
> Hi James,
> 
> I like the idea of staying in poll longer.
> 
> My comments are similar to what Jamal and Stephen have already said.
> 
> A tunable (via sysfs) would be nice.
> 
> A timer might be preferred to jiffy polling. Jiffy polling will not increase 
> latency the way a timer would. However, jiffy polling will consume a lot more
> CPU than a timer would. Hence more power. For jiffy polling, you could have 
> thousands of calls to poll for a single packet received. While in a timer 
> approach the numbers of polls per packet is upper bound to 2. 

Why would using a timer to hold off the napi_complete() rather than 
jiffy count limit the polls per packet to 2?

> I think it may difficult to make poll efficient for the no packet case because,
> at a minimum, you have to poll the device state via the has_work method.

Why wouldn't it be efficient? It would usually be done by reading an 
"interrupt pending" register.

> If you go to a timer implementation then having a tunable will be important.
> Different appications will have different requirements on delay and jitter.
> Some applications may want to trade delay/jitter for less CPU/power 
> consumption and some may not.

I agree. I'm leaning towards a new ethtool parameter to control this to 
be consistent with other per-device tunables.

> imho, the work should definately be pursued further:)

Thanks Mandeep. I'll try. :)

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-07  9:38   ` James Chapman
@ 2007-09-08 16:42     ` Mandeep Singh Baines
  2007-09-10  9:33       ` James Chapman
  2007-09-10 12:12       ` jamal
  0 siblings, 2 replies; 29+ messages in thread
From: Mandeep Singh Baines @ 2007-09-08 16:42 UTC (permalink / raw)
  To: James Chapman
  Cc: Mandeep Singh Baines, netdev, hadi, davem, jeff, ossthema,
	Stephen Hemminger

James Chapman (jchapman@katalix.com) wrote:
> Hi Mandeep,
>
> Mandeep Singh Baines wrote:
>> Hi James,
>> I like the idea of staying in poll longer.
>> My comments are similar to what Jamal and Stephen have already 
>> said.
>> A tunable (via sysfs) would be nice.
>> A timer might be preferred to jiffy polling. Jiffy polling will not 
>> increase latency the way a timer would. However, jiffy polling will 
>> consume a lot more
>> CPU than a timer would. Hence more power. For jiffy polling, you 
>> could have thousands of calls to poll for a single packet received. 
>> While in a timer approach the numbers of polls per packet is upper 
>> bound to 2. 
>
> Why would using a timer to hold off the napi_complete() rather than 
> jiffy count limit the polls per packet to 2?
>

I was thinking a timer could be used in the way suggested in Jamal's
paper. The driver would do nothing (park) until the timer expires. So
there would be no calls to poll for the duration of the timer. Hence,
this approach would add extra latency not present in a jiffy polling
approach.

>> I think it may difficult to make poll efficient for the no packet 
>> case because,
>> at a minimum, you have to poll the device state via the has_work 
>> method.
>
> Why wouldn't it be efficient? It would usually be done by reading an 
> "interrupt pending" register.
>

Reading the "interrupt pending" register would require an MMIO read.
MMIO reads are very expensive. In some systems the latency of an MMIO
read can be 1000x that of an L1 cache access.

You can use mmio_test to measure MMIO read latency on your system:

http://svn.gnumonks.org/trunk/mmio_test/

However, work_done() doesn't have to be inefficient. For newer
devices you can implement work_done() without an MMIO read by polling
the next ring entry status in memory or some other mechanism. Since
PCI is coherent, acceses to this memory location could be cached
after the first miss. For architectures where PCI is not coherent you'd 
have to go to memory for every poll. So for these architectures has_work()
will be moderately expensive (memory access) even when has_work() does
not require an MMIO read. This might affect home routers: not sure if MIPS or
ARM have coherent PCI.

>> If you go to a timer implementation then having a tunable will be 
>> important.
>> Different appications will have different requirements on delay and 
>> jitter.
>> Some applications may want to trade delay/jitter for less CPU/power 
>> consumption and some may not.
>
> I agree. I'm leaning towards a new ethtool parameter to control this 
> to be consistent with other per-device tunables.
>
>> imho, the work should definately be pursued further:)
>
> Thanks Mandeep. I'll try. :)
>
> -- 
> James Chapman
> Katalix Systems Ltd
> http://www.katalix.com
> Catalysts for your Embedded Linux software development
>

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-08 16:42     ` Mandeep Singh Baines
@ 2007-09-10  9:33       ` James Chapman
  2007-09-10 12:12       ` jamal
  1 sibling, 0 replies; 29+ messages in thread
From: James Chapman @ 2007-09-10  9:33 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: netdev, hadi, davem, jeff, ossthema, Stephen Hemminger

Mandeep Singh Baines wrote:
>> Why would using a timer to hold off the napi_complete() rather than 
>> jiffy count limit the polls per packet to 2?
>>
> I was thinking a timer could be used in the way suggested in Jamal's
> paper. The driver would do nothing (park) until the timer expires. So
> there would be no calls to poll for the duration of the timer. Hence,
> this approach would add extra latency not present in a jiffy polling
> approach.

Ah, ok. I wasn't planning to test timer-driven polling. :)

>> Why wouldn't it be efficient? It would usually be done by reading an 
>> "interrupt pending" register.
>>
> Reading the "interrupt pending" register would require an MMIO read.
> MMIO reads are very expensive. In some systems the latency of an MMIO
> read can be 1000x that of an L1 cache access.

Agreed. Testing for any work being available should be as efficient as 
possible and would be driver specific.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-08 16:42     ` Mandeep Singh Baines
  2007-09-10  9:33       ` James Chapman
@ 2007-09-10 12:12       ` jamal
  1 sibling, 0 replies; 29+ messages in thread
From: jamal @ 2007-09-10 12:12 UTC (permalink / raw)
  To: Mandeep Singh Baines
  Cc: James Chapman, netdev, davem, jeff, ossthema, Stephen Hemminger

On Sat, 2007-08-09 at 09:42 -0700, Mandeep Singh Baines wrote:

> Reading the "interrupt pending" register would require an MMIO read.
> MMIO reads are very expensive. In some systems the latency of an MMIO
> read can be 1000x that of an L1 cache access.

Indeed.

> However, work_done() doesn't have to be inefficient. For newer
> devices you can implement work_done() without an MMIO read by polling
> the next ring entry status in memory or some other mechanism. Since
> PCI is coherent, acceses to this memory location could be cached
> after the first miss. For architectures where PCI is not coherent you'd 
> have to go to memory for every poll. So for these architectures has_work()
> will be moderately expensive (memory access) even when has_work() does
> not require an MMIO read. This might affect home routers: not sure if MIPS or
> ARM have coherent PCI.

I think the effect would be clearly experimentally observable in smaller
devices e.g the Geode you seem to be experimenting on.
One other suggestion i made in the paper is to something along the lines
of cached_irq_mask for the i8259

cheers,
jamal


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-06 14:16 RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates James Chapman
                   ` (2 preceding siblings ...)
  2007-09-07  3:55 ` Mandeep Singh Baines
@ 2007-09-08 16:32 ` Andi Kleen
  2007-09-10  9:25   ` James Chapman
  2007-09-12 15:12 ` David Miller
  4 siblings, 1 reply; 29+ messages in thread
From: Andi Kleen @ 2007-09-08 16:32 UTC (permalink / raw)
  To: James Chapman; +Cc: netdev, hadi, davem, jeff, mandeep.baines, ossthema

James Chapman <jchapman@katalix.com> writes:
> 
> Clearly, keeping a device in polled mode for 1-2 jiffies 

1-2 jiffies can be a long time on a HZ=100 kernel (20ms). A fast CPU
could do a lot of loops in this time, which would be waste of power
and CPU time.

On some platforms the precise timers (like ktime_get()) can be slow,
but often they are fast.  It might make sense to use a shorter
constant time wait on those with fast timers at least. Right now this
cannot be known by portable code, but there was a proposal some time
ago to export some global estimate to tell how fast
ktime_get().et.al. are. That could be reviewed.

-Andi

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-08 16:32 ` Andi Kleen
@ 2007-09-10  9:25   ` James Chapman
  0 siblings, 0 replies; 29+ messages in thread
From: James Chapman @ 2007-09-10  9:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: netdev, hadi, davem, jeff, mandeep.baines, ossthema

Andi Kleen wrote:
> James Chapman <jchapman@katalix.com> writes:

> On some platforms the precise timers (like ktime_get()) can be slow,
> but often they are fast.  It might make sense to use a shorter
> constant time wait on those with fast timers at least. Right now this
> cannot be known by portable code, but there was a proposal some time
> ago to export some global estimate to tell how fast
> ktime_get().et.al. are. That could be reviewed.

Interesting. Is ktime_get() fast enough on P4 systems? I'll be using 
those to test with.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-06 14:16 RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates James Chapman
                   ` (3 preceding siblings ...)
  2007-09-08 16:32 ` Andi Kleen
@ 2007-09-12 15:12 ` David Miller
  2007-09-12 16:39   ` James Chapman
  4 siblings, 1 reply; 29+ messages in thread
From: David Miller @ 2007-09-12 15:12 UTC (permalink / raw)
  To: jchapman; +Cc: netdev, hadi, jeff, mandeep.baines, ossthema

From: James Chapman <jchapman@katalix.com>
Date: Thu, 6 Sep 2007 15:16:00 +0100

> First, do we need to encourage consistency in NAPI poll drivers? A
> survey of current NAPI drivers shows different strategies being used
> in their poll(). Some such as r8169 do the napi_complete() if poll()
> does less work than their allowed budget. Others such as e100 and
> tg3 do napi_complete() only if they do no work at all.

Actually, I want to clarify this situation.  In reality these
drivers are more consistent than different.

For some chips the cheapest way to figure out if there is more
RX work is simply to see if the amount of work processed is
less than "budget".  It's too expensive to recheck the hardware.

On some chips like tg3, it's extremely cheap to see if new work
arrived between the completion of processing the RX queue and
the NAPI completion check, so they do it.

But logically they are testing the same exact high level thing.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates
  2007-09-12 15:12 ` David Miller
@ 2007-09-12 16:39   ` James Chapman
  0 siblings, 0 replies; 29+ messages in thread
From: James Chapman @ 2007-09-12 16:39 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, hadi, jeff, mandeep.baines, ossthema

David Miller wrote:
> From: James Chapman <jchapman@katalix.com>
> Date: Thu, 6 Sep 2007 15:16:00 +0100
> 
>> First, do we need to encourage consistency in NAPI poll drivers? A
>> survey of current NAPI drivers shows different strategies being used
>> in their poll(). Some such as r8169 do the napi_complete() if poll()
>> does less work than their allowed budget. Others such as e100 and
>> tg3 do napi_complete() only if they do no work at all.
> 
> Actually, I want to clarify this situation.  In reality these
> drivers are more consistent than different.
> 
> For some chips the cheapest way to figure out if there is more
> RX work is simply to see if the amount of work processed is
> less than "budget".  It's too expensive to recheck the hardware.
> 
> On some chips like tg3, it's extremely cheap to see if new work
> arrived between the completion of processing the RX queue and
> the NAPI completion check, so they do it.

The inconsistencies I see are to do with the conditions that the driver 
chooses to exit polled mode, i.e. doing no work in the poll() versus 
doing less than budget, and whether txdone processing is done in the 
poll or in the interrupt handler. I didn't mean to suggest that 
rechecking for more work just before doing the napi_complete() was an 
example of inconsistency.

The rest of the RFC talks about polling the device while it might be 
idle. The overhead of checking for work varies for each system / device 
as you say. Where it is expensive, the driver could optimize that case.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2007-09-14 13:14 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-06 14:16 RFC: possible NAPI improvements to reduce interrupt rates for low traffic rates James Chapman
2007-09-06 14:37 ` Stephen Hemminger
2007-09-06 15:30   ` James Chapman
2007-09-06 15:37     ` Stephen Hemminger
2007-09-06 16:07       ` James Chapman
2007-09-06 23:06 ` jamal
2007-09-07  9:31   ` James Chapman
2007-09-07 13:22     ` jamal
2007-09-10  9:20       ` James Chapman
2007-09-10 12:27         ` jamal
2007-09-12  7:04       ` Bill Fink
2007-09-12 12:12         ` jamal
2007-09-12 13:50           ` James Chapman
2007-09-12 14:02             ` Stephen Hemminger
2007-09-12 16:26               ` James Chapman
2007-09-12 16:47               ` Mandeep Baines
2007-09-13  6:57                 ` David Miller
2007-09-14 13:14             ` jamal
2007-09-07 21:20     ` Jason Lunz
2007-09-10  9:25       ` James Chapman
2007-09-07  3:55 ` Mandeep Singh Baines
2007-09-07  9:38   ` James Chapman
2007-09-08 16:42     ` Mandeep Singh Baines
2007-09-10  9:33       ` James Chapman
2007-09-10 12:12       ` jamal
2007-09-08 16:32 ` Andi Kleen
2007-09-10  9:25   ` James Chapman
2007-09-12 15:12 ` David Miller
2007-09-12 16:39   ` James Chapman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).