NAPI poll behavior in various Intel drivers

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* NAPI poll behavior in various Intel drivers
@ 2008-01-04 11:40 David Miller
  2008-01-04 20:10 ` James Chapman
  2008-01-07  8:24 ` Jarek Poplawski
  0 siblings, 2 replies; 8+ messages in thread
From: David Miller @ 2008-01-04 11:40 UTC (permalink / raw)
  To: netdev; +Cc: auke-jan.h.kok

Several Intel networking drivers such as e1000, e1000e
and e100 all do this to exit NAPI polling:

	if ((!tx_cleaned && (work_done == 0)) ||
 	   !netif_running(poll_dev)) {

I tried to make this use in the NAPI rework:

	if ((!tx_cleaned && (work_done < budget)) ||
 	   !netif_running(poll_dev)) {

But that got reverted by:

	commit f7bbb9098315d712351aba7861a8c9fcf6bf0213

	e1000: Fix NAPI state bug when Rx complete

	Don't exit polling when we have not yet used our budget, this causes
	the NAPI system to end up with a messed up poll list.

	Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
	Signed-off-by: Jeff Garzik <jeff@garzik.org>

I definitely would not have signed off on that :-)

That "tx_cleaned" thing clouds the logic in all of these driver's
poll routines.

The one necessary precondition is that when work_done < budget
we exit polling and return a value less than budget.

If the ->poll() returns a value less than budget, net_rx_action()
assumes that the device has been removed from the poll() list.

		/* Drivers must not modify the NAPI state if they
		 * consume the entire weight.  In such cases this code
		 * still "owns" the NAPI instance and therefore can
		 * move the instance around on the list at-will.
		 */
		if (unlikely(work == weight))
			list_move_tail(&n->poll_list, list);

This "work_done == 0" test in these drivers, is thus, wrong.  It
should be "work_done < budget" and the whole tx_cleaned thing needs to
be removed.

It happens to work, because what happens is that we loop again and
process the same NAPI struct again.

As a result, E1000 devices get polled TWICE every time they
process at least one RX packet, but do not consume the whole
quota.

I smell a performance hack, and if so this is wrong and against
all of the principles of NAPI.  Either that or it's a workaround
for the "!netif_running()" case.

I noticed this while trying to work on a generic fix for the
"->poll() does not exit when device is brought down while being
bombed with packets" bug.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NAPI poll behavior in various Intel drivers
  2008-01-04 11:40 NAPI poll behavior in various Intel drivers David Miller
@ 2008-01-04 20:10 ` James Chapman
  2008-01-04 21:24   ` David Miller
  2008-01-07  8:24 ` Jarek Poplawski
  1 sibling, 1 reply; 8+ messages in thread
From: James Chapman @ 2008-01-04 20:10 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, auke-jan.h.kok

David Miller wrote:
> Several Intel networking drivers such as e1000, e1000e
> and e100 all do this to exit NAPI polling:
> 
> 	if ((!tx_cleaned && (work_done == 0)) ||
>  	   !netif_running(poll_dev)) {
> 
> I tried to make this use in the NAPI rework:
> 
> 	if ((!tx_cleaned && (work_done < budget)) ||
>  	   !netif_running(poll_dev)) {
> 
> But that got reverted by:
> 
> 	commit f7bbb9098315d712351aba7861a8c9fcf6bf0213
> 
> 	e1000: Fix NAPI state bug when Rx complete
>     
> 	Don't exit polling when we have not yet used our budget, this causes
> 	the NAPI system to end up with a messed up poll list.
>     
> 	Signed-off-by: Auke Kok <auke-jan.h.kok@intel.com>
> 	Signed-off-by: Jeff Garzik <jeff@garzik.org>
> 
> I definitely would not have signed off on that :-)
> 
> That "tx_cleaned" thing clouds the logic in all of these driver's
> poll routines.
> 
> The one necessary precondition is that when work_done < budget
> we exit polling and return a value less than budget.
> 
> If the ->poll() returns a value less than budget, net_rx_action()
> assumes that the device has been removed from the poll() list.
> 
> 		/* Drivers must not modify the NAPI state if they
> 		 * consume the entire weight.  In such cases this code
> 		 * still "owns" the NAPI instance and therefore can
> 		 * move the instance around on the list at-will.
> 		 */
> 		if (unlikely(work == weight))
> 			list_move_tail(&n->poll_list, list);
> 
> This "work_done == 0" test in these drivers, is thus, wrong.  It
> should be "work_done < budget" and the whole tx_cleaned thing needs to
> be removed.
> 
> It happens to work, because what happens is that we loop again and
> process the same NAPI struct again.
> 
> As a result, E1000 devices get polled TWICE every time they
> process at least one RX packet, but do not consume the whole
> quota.
> 
> I smell a performance hack, and if so this is wrong and against
> all of the principles of NAPI.  Either that or it's a workaround
> for the "!netif_running()" case.

You have a good nose, Dave. :) And you can probably partly blame me for
this scheme. I worked with the e100 driver author (Scott Feldman) 5
years ago to performance tune and stress test the driver. I found much
better packets/sec forwarding performance (packets in one e100, out
another) by staying in polled mode until no receive or transmit was done.

With the latest NAPI, this code has to change. But rather than remove
the tx_cleaned logic completely, shouldn't transmit processing be
included in the work_done accounting when a driver does transmit cleanup
processing in the poll? If not, then when an interface is transmitting
far more than receiving, it will exit polled mode on every poll.

> I noticed this while trying to work on a generic fix for the
> "->poll() does not exit when device is brought down while being
> bombed with packets" bug.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NAPI poll behavior in various Intel drivers
  2008-01-04 20:10 ` James Chapman
@ 2008-01-04 21:24   ` David Miller
  2008-01-05  0:18     ` James Chapman
  0 siblings, 1 reply; 8+ messages in thread
From: David Miller @ 2008-01-04 21:24 UTC (permalink / raw)
  To: jchapman; +Cc: netdev, auke-jan.h.kok

From: James Chapman <jchapman@katalix.com>
Date: Fri, 04 Jan 2008 20:10:30 +0000

> With the latest NAPI, this code has to change. But rather than remove
> the tx_cleaned logic completely, shouldn't transmit processing be
> included in the work_done accounting when a driver does transmit cleanup
> processing in the poll?

Most other NAPI drivers don't do this, they just process all the
pending TX work unconditionally and do not account it into the NAPI
poll work.

The logic is that, like link state handling, TX work is very cheap and
not the cpu cycle eater that RX packet processing is.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NAPI poll behavior in various Intel drivers
  2008-01-04 21:24   ` David Miller
@ 2008-01-05  0:18     ` James Chapman
  2008-01-05  7:25       ` David Miller
  0 siblings, 1 reply; 8+ messages in thread
From: James Chapman @ 2008-01-05  0:18 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, auke-jan.h.kok

David Miller wrote:
> From: James Chapman <jchapman@katalix.com>
> Date: Fri, 04 Jan 2008 20:10:30 +0000
> 
>> With the latest NAPI, this code has to change. But rather than remove
>> the tx_cleaned logic completely, shouldn't transmit processing be
>> included in the work_done accounting when a driver does transmit cleanup
>> processing in the poll?
> 
> Most other NAPI drivers don't do this, they just process all the
> pending TX work unconditionally and do not account it into the NAPI
> poll work.

This will cause the interface to thrash in/out of polled mode very
quickly when it is doing almost all transmit work. That's something to
avoid, no?

> The logic is that, like link state handling, TX work is very cheap and
> not the cpu cycle eater that RX packet processing is.

The processing is cheap but it is being done in the poll so it is
scheduled by NAPI. The event rate for TX events can be just as high as
RX. For link state handling, the event rate is always low so won't cause
NAPI to schedule rapidly.

-- 
James Chapman
Katalix Systems Ltd
http://www.katalix.com
Catalysts for your Embedded Linux software development


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NAPI poll behavior in various Intel drivers
  2008-01-05  0:18     ` James Chapman
@ 2008-01-05  7:25       ` David Miller
  2008-01-05 13:29         ` Andi Kleen
  0 siblings, 1 reply; 8+ messages in thread
From: David Miller @ 2008-01-05  7:25 UTC (permalink / raw)
  To: jchapman; +Cc: netdev, auke-jan.h.kok

From: James Chapman <jchapman@katalix.com>
Date: Sat, 05 Jan 2008 00:18:31 +0000

> David Miller wrote:
> > From: James Chapman <jchapman@katalix.com>
> > Date: Fri, 04 Jan 2008 20:10:30 +0000
> > 
> >> With the latest NAPI, this code has to change. But rather than remove
> >> the tx_cleaned logic completely, shouldn't transmit processing be
> >> included in the work_done accounting when a driver does transmit cleanup
> >> processing in the poll?
> > 
> > Most other NAPI drivers don't do this, they just process all the
> > pending TX work unconditionally and do not account it into the NAPI
> > poll work.
> 
> This will cause the interface to thrash in/out of polled mode very
> quickly when it is doing almost all transmit work. That's something to
> avoid, no?

I see your point although I've never seen this in practice
with tg3 or niu.

For a heavy transmit load with TCP, the cpu killer is the ACK receive
processing.

I guess for a datagram send situation it might start to edge up.

However, you're likely deferring TX events (say every 1/4 of the TX
ring or similar) which should make the effect matter much less.

But anyways, it's someone else's driver so if they want to take
TX work into account sure. :-)  I'll keep this in mind in my NAPI
changes.

Thanks.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NAPI poll behavior in various Intel drivers
  2008-01-05  7:25       ` David Miller
@ 2008-01-05 13:29         ` Andi Kleen
  2008-01-06  4:15           ` David Miller
  0 siblings, 1 reply; 8+ messages in thread
From: Andi Kleen @ 2008-01-05 13:29 UTC (permalink / raw)
  To: David Miller; +Cc: jchapman, netdev, auke-jan.h.kok

David Miller <davem@davemloft.net> writes:

> From: James Chapman <jchapman@katalix.com>
> Date: Sat, 05 Jan 2008 00:18:31 +0000
>
>> David Miller wrote:
>> > From: James Chapman <jchapman@katalix.com>
>> > Date: Fri, 04 Jan 2008 20:10:30 +0000
>> > 
>> >> With the latest NAPI, this code has to change. But rather than remove
>> >> the tx_cleaned logic completely, shouldn't transmit processing be
>> >> included in the work_done accounting when a driver does transmit cleanup
>> >> processing in the poll?
>> > 
>> > Most other NAPI drivers don't do this, they just process all the
>> > pending TX work unconditionally and do not account it into the NAPI
>> > poll work.
>> 
>> This will cause the interface to thrash in/out of polled mode very
>> quickly when it is doing almost all transmit work. That's something to
>> avoid, no?
>
> I see your point although I've never seen this in practice
> with tg3 or niu.

In 2.4 we used to have (haven't checked recently) performance regressions
with NAPI vs non NAPI (or versus the old BCM vendor driver) on tg3 for
some workloads that didn't fully fill the link. The theory was always
that the reason for that was something like the regular switching in
and out. So I think we saw that problem on tg3 too.

-Andi

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NAPI poll behavior in various Intel drivers
  2008-01-05 13:29         ` Andi Kleen
@ 2008-01-06  4:15           ` David Miller
  0 siblings, 0 replies; 8+ messages in thread
From: David Miller @ 2008-01-06  4:15 UTC (permalink / raw)
  To: andi; +Cc: jchapman, netdev, auke-jan.h.kok

From: Andi Kleen <andi@firstfloor.org>
Date: Sat, 05 Jan 2008 14:29:05 +0100

> In 2.4 we used to have (haven't checked recently) performance regressions
> with NAPI vs non NAPI (or versus the old BCM vendor driver) on tg3 for
> some workloads that didn't fully fill the link. The theory was always
> that the reason for that was something like the regular switching in
> and out. So I think we saw that problem on tg3 too.

It was because we originally didn't program the HW interrupt
mitigation settings at all when using NAPI, now we do and the problem
is long gone.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: NAPI poll behavior in various Intel drivers
  2008-01-04 11:40 NAPI poll behavior in various Intel drivers David Miller
  2008-01-04 20:10 ` James Chapman
@ 2008-01-07  8:24 ` Jarek Poplawski
  1 sibling, 0 replies; 8+ messages in thread
From: Jarek Poplawski @ 2008-01-07  8:24 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, auke-jan.h.kok

On 04-01-2008 12:40, David Miller wrote:
...
> That "tx_cleaned" thing clouds the logic in all of these driver's
> poll routines.
> 
> The one necessary precondition is that when work_done < budget
> we exit polling and return a value less than budget.
> 
> If the ->poll() returns a value less than budget, net_rx_action()
> assumes that the device has been removed from the poll() list.
> 
> 		/* Drivers must not modify the NAPI state if they
> 		 * consume the entire weight.  In such cases this code
> 		 * still "owns" the NAPI instance and therefore can
> 		 * move the instance around on the list at-will.
> 		 */
> 		if (unlikely(work == weight))
> 			list_move_tail(&n->poll_list, list);
> 

Probably it's because of this clouded logic, but IMHO: "Drivers must
not modify the NAPI state if..." doesn't imply: drivers must modify
the NAPI state otherwise. (But it seems I must have missed the real
reason for this quote here.)

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-01-07  8:18 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-01-04 11:40 NAPI poll behavior in various Intel drivers David Miller
2008-01-04 20:10 ` James Chapman
2008-01-04 21:24   ` David Miller
2008-01-05  0:18     ` James Chapman
2008-01-05  7:25       ` David Miller
2008-01-05 13:29         ` Andi Kleen
2008-01-06  4:15           ` David Miller
2008-01-07  8:24 ` Jarek Poplawski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).