[WIP][DOC] Net tx batching

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [WIP][DOC] Net tx batching
@ 2007-06-11 13:52 jamal
  2007-06-11 23:09 ` jamal
  0 siblings, 1 reply; 4+ messages in thread
From: jamal @ 2007-06-11 13:52 UTC (permalink / raw)
  To: NetDev
  Cc: Krishna Kumar2, Evgeniy Polyakov, Gagan Arneja, Leonid Grossman,
	Sridhar Samudrala, Rick Jones, Robert Olsson, David Miller

[-- Attachment #1: Type: text/plain, Size: 425 bytes --]

I have started writting a small howto for drivers. Hoping to get a wider
testing with more drivers.
So far i have changed e1000 and tun drivers as well as modified the
packetgen tool to do batching.

I will update this document as needed if something is unclear. 
Please contribute by asking questions, changing a driver and wide
testing. I may target tg3 next and write a tool to do testing from
UDP level.

cheers,
jamal

[-- Attachment #2: batch-driver-howto.txt --]
[-- Type: text/plain, Size: 4715 bytes --]

Heres the begining of a howto for driver author.
The current working tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git

The intended audience for this howto is people already
familiar with netdevices.

0) Hardware Pre-requisites:
---------------------------

You must have at least hardware that is capable of doing
DMA with many descriptors; i.e having hardware with a queue
length of 3 (as in some fscked ethernet hardware) is not
very useful in this case.

1) What is new in the driver API:
---------------------------------

a) A new method called onto the driver by the net tx core to
batch packets. This method, dev->hard_batch_xmit(dev), 
is no different than dev->hard_start_xmit(dev) in terms of the 
arguements it takes. You just have to handle it differently 
(more below).

b) A new method, dev->hard_prep_xmit(), called onto the driver to 
massage the packet before it gets transmitted. 
This method is optional i.e if you dont specify it, you will
not be invoked(more below)

c) A new variable dev->xmit_win which provides suggestions to the
core calling into the driver a rough estimate of how many
packets can be batched onto the driver.

2) Driver pre-requisite
------------------------

The typical driver tx state machine is:

----
--> +Core sends packets
    +--> Driver puts packet onto hardware queue
    +    if hardware queue is full, netif_stop_queue(dev)
    +
--> +core stops sending because of netif_stop_queue(dev)
..
.. time passes
..
..
--> +---> driver has transmitted packets, opens up tx path by
          invoking netif_wake_queue(dev)
--> +Core sends packets, and the cycle repeats.
----

The pre-requisite for batching changes is that the driver should 
provide a low threshold to open up the tx path.
This is a very important requirement in making batching useful.
Drivers such as tg3 and e1000 already do this.
So in the above annotation, as a driver author, before you
invoke netif_wake_queue(dev) you check if there are enough
entries left.

Heres an example of how i added it to tun driver
---
+#define NETDEV_LTT 4 /* the low threshold to open up the tx path */
..
..
	u32 t = skb_queue_len(&tun->readq);
	if (netif_queue_stopped(tun->dev) && t < NETDEV_LTT) {
		tun->dev->xmit_win = tun->dev->tx_queue_len;
		netif_wake_queue(tun->dev);
	}
---

Heres how the batching e1000 driver does it (ignore the setting of
netdev->xmit_win, more on this later):

--
	if (netif_queue_stopped(netdev)) {
	       int rspace =  E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS +  2);
	       netdev->xmit_win = rspace;
	       netif_wake_queue(netdev);
       }
---

in tg3 code looks like:

-----
	if (netif_queue_stopped(tp->dev) &&
		(tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp)))
			netif_wake_queue(tp->dev);
---

3) Driver Setup:

a) On initialization (before netdev registration)
 i) set NETIF_F_BTX in  dev->features 
  i.e dev->features |= NETIF_F_BTX
  This makes the core do proper initialization.

  ii) set dev->xmit_win to something reasonable like
  maybe half the tx DMA ring size etc.
  This is later used by the core to guess how much packets to send
  in one batch. 

  b) create proper pointer to the two new methods desribed above.

4) The new methods

  a) The batching method

Heres an example of a batch tx routine that is similar
to the one i added to tun driver

----
  static int xxx_net_bxmit(struct net_device *dev)
  {
  ....
  ....
        while (skb_queue_len(dev->blist)) {
	        dequeue from dev->blist
		enqueue onto hardware ring
		if hardware ring full break
        }

	if (hardware ring full) {
		  netif_stop_queue(dev);
		  dev->xmit_win = 1;
	}

       if we queued on hardware, tell it to chew
       .......
       ..
       .
  }
------

All return codes like NETDEV_TX_OK etc still apply.
In this method, if there are any IO operations that apply to a 
set of packets (such as kicking DMA) leave them to the end and apply
them once if you have successfully enqueued. For an example of this
look e1000 driver e1000_kick_DMA() function.

b) The dev->hard_prep_xmit() method

Use this method to only do pre-processing of the skb passed.
If in the current dev->hard_start_xmit() you are pre-processing
packets before holding any locks (eg formating them to be put in
any descriptor etc).
Look at e1000_prep_queue_frame() for an example.
You may use the skb->cb to store any state that you need to know
of later when batching.

5) setting the dev->xmit_win 

As mentioned earlier this variable provides hints on how much
data to send from the core to the driver. Some suggestions:
a)on doing a netif_stop, set it to 1
b)on netif_wake_queue set it to the max available space

cheers,
jamal

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [WIP][DOC] Net tx batching
  2007-06-11 13:52 [WIP][DOC] Net tx batching jamal
@ 2007-06-11 23:09 ` jamal
  2007-08-08 12:02   ` [DOC] Net tx batching driver howto jamal
  0 siblings, 1 reply; 4+ messages in thread
From: jamal @ 2007-06-11 23:09 UTC (permalink / raw)
  To: NetDev
  Cc: Krishna Kumar2, Evgeniy Polyakov, Jeff Garzik, Gagan Arneja,
	Leonid Grossman, Sridhar Samudrala, Rick Jones, Robert Olsson,
	David Miller

[-- Attachment #1: Type: text/plain, Size: 532 bytes --]

A small update on the e1000 ....

On Mon, 2007-11-06 at 09:52 -0400, jamal wrote:
> I have started writting a small howto for drivers. Hoping to get a wider
> testing with more drivers.
> So far i have changed e1000 and tun drivers as well as modified the
> packetgen tool to do batching.
> 
> I will update this document as needed if something is unclear. 
> Please contribute by asking questions, changing a driver and wide
> testing. I may target tg3 next and write a tool to do testing from
> UDP level.
> 
> cheers,
> jamal
> 

[-- Attachment #2: batch-driver-howto.txt --]
[-- Type: text/plain, Size: 5005 bytes --]

Heres the begining of a howto for driver author.
The current working tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git

The intended audience for this howto is people already
familiar with netdevices.

0) Hardware Pre-requisites:
---------------------------

You must have at least hardware that is capable of doing
DMA with many descriptors; i.e having hardware with a queue
length of 3 (as in some fscked ethernet hardware) is not
very useful in this case.

1) What is new in the driver API:
---------------------------------

a) A new method called onto the driver by the net tx core to
batch packets. This method, dev->hard_batch_xmit(dev), 
is no different than dev->hard_start_xmit(dev) in terms of the 
arguements it takes. You just have to handle it differently 
(more below).

b) A new method, dev->hard_prep_xmit(), called onto the driver to 
massage the packet before it gets transmitted. 
This method is optional i.e if you dont specify it, you will
not be invoked(more below)

c) A new variable dev->xmit_win which provides suggestions to the
core calling into the driver a rough estimate of how many
packets can be batched onto the driver.

2) Driver pre-requisite
------------------------

The typical driver tx state machine is:

----
--> +Core sends packets
    +--> Driver puts packet onto hardware queue
    +    if hardware queue is full, netif_stop_queue(dev)
    +
--> +core stops sending because of netif_stop_queue(dev)
..
.. time passes
..
..
--> +---> driver has transmitted packets, opens up tx path by
          invoking netif_wake_queue(dev)
--> +Core sends packets, and the cycle repeats.
----

The pre-requisite for batching changes is that the driver should 
provide a low threshold to open up the tx path.
This is a very important requirement in making batching useful.
Drivers such as tg3 and e1000 already do this.
So in the above annotation, as a driver author, before you
invoke netif_wake_queue(dev) you check if there are enough
entries left.

Heres an example of how i added it to tun driver
---
+#define NETDEV_LTT 4 /* the low threshold to open up the tx path */
..
..
	u32 t = skb_queue_len(&tun->readq);
	if (netif_queue_stopped(tun->dev) && t < NETDEV_LTT) {
		tun->dev->xmit_win = tun->dev->tx_queue_len;
		netif_wake_queue(tun->dev);
	}
---

Heres how the batching e1000 driver does it (ignore the setting of
netdev->xmit_win, more on this later):

--
if (unlikely(cleaned && netif_carrier_ok(netdev) &&
     E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD)) {

	if (netif_queue_stopped(netdev)) {
	       int rspace =  E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS +  2);
	       netdev->xmit_win = rspace;
	       netif_wake_queue(netdev);
       }
---

in tg3 code looks like:

-----
	if (netif_queue_stopped(tp->dev) &&
		(tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp)))
			netif_wake_queue(tp->dev);
---

3) Driver Setup:
-------------------

a) On initialization (before netdev registration)
 i) set NETIF_F_BTX in  dev->features 
  i.e dev->features |= NETIF_F_BTX
  This makes the core do proper initialization.

  ii) set dev->xmit_win to something reasonable like
  maybe half the tx DMA ring size etc.
  This is later used by the core to guess how much packets to send
  in one batch. 

  b) create proper pointer to the two new methods desribed above.

4) The new methods
--------------------

  a) The batching method

Heres an example of a batch tx routine that is similar
to the one i added to tun driver

----
  static int xxx_net_bxmit(struct net_device *dev)
  {
  ....
  ....
        while (skb_queue_len(dev->blist)) {
	        dequeue from dev->blist
		enqueue onto hardware ring
		if hardware ring full break
        }

	if (hardware ring full) {
		  netif_stop_queue(dev);
		  dev->xmit_win = 1;
	}

       if we queued on hardware, tell it to chew
       .......
       ..
       .
  }
------

All return codes like NETDEV_TX_OK etc still apply.
In this method, if there are any IO operations that apply to a 
set of packets (such as kicking DMA) leave them to the end and apply
them once if you have successfully enqueued. For an example of this
look e1000 driver e1000_kick_DMA() function.

b) The dev->hard_prep_xmit() method

Use this method to only do pre-processing of the skb passed.
If in the current dev->hard_start_xmit() you are pre-processing
packets before holding any locks (eg formating them to be put in
any descriptor etc).
Look at e1000_prep_queue_frame() for an example.
You may use the skb->cb to store any state that you need to know
of later when batching.

5) setting the dev->xmit_win 
-----------------------------

As mentioned earlier this variable provides hints on how much
data to send from the core to the driver. Some suggestions:
a)on doing a netif_stop, set it to 1
b)on netif_wake_queue set it to the max available space

Appendix 1: History
-------------------
June 11: Initial revision
June 11: Fixed typo on e1000 netif_wake description ..

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [DOC] Net tx batching driver howto
  2007-06-11 23:09 ` jamal
@ 2007-08-08 12:02   ` jamal
  2007-08-08 13:03     ` [DOC] Net tx batching core evolution jamal
  0 siblings, 1 reply; 4+ messages in thread
From: jamal @ 2007-08-08 12:02 UTC (permalink / raw)
  To: netdev
  Cc: Krishna Kumar2, Evgeniy Polyakov, Jeff Garzik, Gagan Arneja,
	Leonid Grossman, Sridhar Samudrala, Rick Jones, Robert Olsson,
	David Miller, shemminger, kaber

[-- Attachment #1: Type: text/plain, Size: 163 bytes --]

This is a small update to the howto.

The current working tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git

cheers,
jamal

[-- Attachment #2: batch-driver-howto.txt --]
[-- Type: text/plain, Size: 5671 bytes --]

Heres the begining of a howto for driver author.
The current working tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git

The intended audience for this howto is people already
familiar with netdevices.

0) Hardware Pre-requisites:
---------------------------

You must have at least hardware that is capable of doing
DMA with many descriptors; i.e having hardware with a queue
length of 3 (as in some fscked ethernet hardware) is not
very useful in this case.

1) What is new in the driver API:
---------------------------------

a) A new method called onto the driver by the net tx core to
batch packets. This method, dev->hard_batch_xmit(dev), 
is no different than dev->hard_start_xmit(dev) in terms of the 
arguements it takes. You just have to handle it differently 
(more below).

b) A new method, dev->hard_prep_xmit(), called onto the driver to 
massage the packet before it gets transmitted. 
This method is optional i.e if you dont specify it, you will
not be invoked(more below)

c) A new variable dev->xmit_win which provides suggestions to the
core calling into the driver a rough estimate of how many
packets can be batched onto the driver. 

2) Driver pre-requisite
------------------------

The typical driver tx state machine is:

----
--> +Core sends packets
    +--> Driver puts packet onto hardware queue
    +    if hardware queue is full, netif_stop_queue(dev)
    +
--> +core stops sending because of netif_stop_queue(dev)
..
.. time passes
..
..
--> +---> driver has transmitted packets, opens up tx path by
          invoking netif_wake_queue(dev)
--> +Core sends packets, and the cycle repeats.
----

The pre-requisite for batching changes is that the driver should 
provide a low threshold to open up the tx path.
This is a very important requirement in making batching useful.
Drivers such as tg3 and e1000 already do this.
So in the above annotation, as a driver author, before you
invoke netif_wake_queue(dev) you check if there are enough
entries left.

Heres an example of how i added it to tun driver
---
+#define NETDEV_LTT 4 /* the low threshold to open up the tx path */
..
..
	u32 t = skb_queue_len(&tun->readq);
	if (netif_queue_stopped(tun->dev) && t < NETDEV_LTT) {
		tun->dev->xmit_win = tun->dev->tx_queue_len;
		netif_wake_queue(tun->dev);
	}
---

Heres how the batching e1000 driver does it (ignore the setting of
netdev->xmit_win, more on this later):

--
if (unlikely(cleaned && netif_carrier_ok(netdev) &&
     E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD)) {

	if (netif_queue_stopped(netdev)) {
	       int rspace =  E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS +  2);
	       netdev->xmit_win = rspace;
	       netif_wake_queue(netdev);
       }
---

in tg3 code looks like:

-----
	if (netif_queue_stopped(tp->dev) &&
		(tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp)))
			netif_wake_queue(tp->dev);
---

3) Driver Setup:
-------------------

a) On initialization (before netdev registration)
 i) set NETIF_F_BTX in  dev->features 
  i.e dev->features |= NETIF_F_BTX
  This makes the core do proper initialization.

  ii) set dev->xmit_win to something reasonable like
  maybe half the tx DMA ring size etc.
  This is later used by the core to guess how much packets to send
  in one batch. 

  b) create proper pointer to the two new methods desribed above.

4) The new methods
--------------------

  a) The batching method

Heres an example of a batch tx routine that is similar
to the one i added to tun driver

----
  static int xxx_net_bxmit(struct net_device *dev)
  {
  ....
  ....
        while (skb_queue_len(dev->blist)) {
	        dequeue from dev->blist
		enqueue onto hardware ring
		if hardware ring full break
        }

	if (hardware ring full) {
		  netif_stop_queue(dev);
		  dev->xmit_win = 1;
	}

       if we queued on hardware, tell it to chew
       .......
       ..
       .
  }
------

All return codes like NETDEV_TX_OK etc still apply.
In this method, if there are any IO operations that apply to a 
set of packets (such as kicking DMA) leave them to the end and apply
them once if you have successfully enqueued. For an example of this
look e1000 driver e1000_kick_DMA() function.

b) The dev->hard_prep_xmit() method

The benefits of this method are described in an a separate document.

Use this method to only do pre-processing of the skb passed.
If in the current dev->hard_start_xmit() you are pre-processing
packets before holding any locks (eg formating them to be put in
any descriptor etc).
Look at e1000_prep_queue_frame() for an example.
You may use the skb->cb to store any state that you need to know
of later when batching.
PS: I have found when discussing with Michael Chan and Matt Carlson
that skb->cb[0] is used by the VLAN code to pass VLAN info to the driver.
I think this is a violation of the usage of the cb scratch pad. 
To work around this, you could use skb->cb[8] or do what the broadcom
tg3 bacthing driver does which is to glean the vlan info first then
re-use the skb->cb.

5) setting the dev->xmit_win 
-----------------------------

As mentioned earlier this variable provides hints on how much
data to send from the core to the driver. Some suggestions:
a)on doing a netif_stop, set it to 1
b)on netif_wake_queue set it to the max available space

The variable is important because it avoids the core sending
any more than what the driver can handle therefore avoiding 
any need to muck with packet scheduling mechanisms.

Appendix 1: History
-------------------
June 11: Initial revision
June 11: Fixed typo on e1000 netif_wake description ..
Aug  08: Added info on VLAN and the skb->cb[] danger ..

^ permalink raw reply	[flat|nested] 4+ messages in thread

* [DOC] Net tx batching core evolution
  2007-08-08 12:02   ` [DOC] Net tx batching driver howto jamal
@ 2007-08-08 13:03     ` jamal
  0 siblings, 0 replies; 4+ messages in thread
From: jamal @ 2007-08-08 13:03 UTC (permalink / raw)
  To: netdev
  Cc: Krishna Kumar2, Evgeniy Polyakov, Jeff Garzik, Gagan Arneja,
	Leonid Grossman, Sridhar Samudrala, Rick Jones, Robert Olsson,
	David Miller, shemminger, kaber

[-- Attachment #1: Type: text/plain, Size: 260 bytes --]

The attached doc describes the evolution of batching work that leads
towards the dev->hard_prep_xmit() api being introduced.

As usual the updated working tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git

cheers,
jamal

[-- Attachment #2: evolution-of-batch.txt --]
[-- Type: text/plain, Size: 5518 bytes --]

The code for this work can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git

The purpose of this doc is not to describe the batching work although
benefits of that approach could be gleaned from this doc (example
section 1.x should give the implicit value proposition).
There are more details on the rest of the batching work described in
the driver howto posted on netdev as well as a couple of presentations 
i have given in the past (refer to the last few slides of my netconf 2006 
slides for example).

The purpose of this doc is to describe the evolution of the work
work which leads to two important apis to the driver writer.
The first one is introduction of a method called 
dev->hard_prep_xmit() and the second api is the introduction
of variable dev->xmit_win

For the sake of clarity i will be using non-LLTX because it is easier 
to explain.
Note: The e1000, although non-LLTX has shown considerable improvement
with this approach as well as verified by experiments. (For a lot of
other reasons e1000 needs to be converted to be non-LLTX).

1.0 Classical approach 
------------------------

Lets start with classical approach of what a random driver will do

loop:
  1--core spin for qlock
  1a---dequeu packet
  1b---release qlock

  2--core spins for xmit_lock

    3--enter hardware xmit routine
    3a----format packet:
    --------e.g vlan, mss, shinfo, csum, descriptor count, etc
    -----------if there is something wrong free skb and return OK
    3b----do per chip specific checks
    -----------if there is something wrong free skb and return OK
    3c----all is good, now stash a packet into DMA ring.

  4-- release tx lock
  5-- if all ok (there are still packets, netif not stopped etc)
      continue, else break
end_loop:

In between #1 and #1b, another CPU contends for qlock
In between #3 and #4, another CPU contends for txlock

1.1 Challenge to classical approach:
-------------------------------------

The cost of grabbing/setting up a lock is not cheap.
Observation also: spinning CPUs is expensive because the utilization
of the compute cycles goes down. 
We assume the cost of dequeueing from the qdisc is not as expensive
as the enqueueing to DMA-ring.

1.2 Addressing the challenge to classical approach:
---------------------------------------------------

So we start with a simple premise to resolve the challenge above.
We try to amortize the cost by:
a) grabbing "as many" packets as we can between #1 and #1b with very 
little processing so we dont hold that lock for long.
b) then send "as many" as we can in between #3 and #4.

Lets start with a simple approach (which is what the batch code did 
in its earlier versions):

loop:
  1--core spin for qlock
    loop1: // "as many packets"
       1a---dequeu and enqueue on dev->blist
    end loop1:
  1b---release qlock

  2--core spins for xmit_lock

    3--enter hardware xmit routine
    loop2 ("for as many packets" or "no more ring space"):
       3a----format packet
       --------e.g vlan, mss, shinfo, csum, descriptor count, etc
       -----------if there is something wrong free skb and return OK (**)
       3b----do per chip specific checks
       -----------if there is something wrong free skb and return OK
       3c----all is good, now stash a packet into DMA ring.
    end_loop2:
       3d -- if you enqueued packets tell tx DMA to chew on them ...

  4-- release tx lock
  5-- if all ok (there are still packets, netif not stopped etc)
      continue;
end_loop

loop2 has the added side-benefit of improving the instruction cache 
warmness.
i.e If we can do things in a loop we would keep the instruction cache
warm for more packets (hence improve the hit rate/CPU-utilization).
However, the length of this loop may affect the hit-rate (very clear
to see as you use machines with bigger caches and qdisc_restart showing
in profiles). 
Note also: We have also amortized the cost of bus IO in step 3d 
because we make only one call after exiting the loop. 

2.0 New challenge
-----------------

A new challenge is introduced.
We could hold the tx lock for "long period" depending on how unclean
the driver path is. i.e imagine holding this lock for 100 packets.
So we need to find a balance.

We observe that within loop2, #3a doesnt need to be held within the
xmit lock. All it does is muck with an skb and may end up infact 
dropping it.

2.1 Addressing new challenge
-----------------------------

To address the new challenge, i introduced the dev->hard_prep_xmit() api. 
Essentially, dev->hard_prep_xmit() moves the packet formatting into #1
So now we change the code to be something along the lines of:

loop:
  1--core spin for qlock
    loop1: // "as many packets"
       1a---dequeu and enqueue on dev->blist
       1b--if driver has hard_prep_xmit, then format packet.
    end loop1:
  1c---release qlock

  2--core spins for xmit_lock

    3--enter hardware xmit routine
    loop2: ("for as many packets" or "no more ring space"):
       3a----do per chip specific checks
       -----------if there is something wrong free skb and return OK
       3b----all is good, now stash a packet into DMA ring.
    end_loop2:
       3c -- if you enqueued packets tell tx DMA to chew on them ...

  4-- release tx lock
  5-- if all ok (there are still packets, netif not stopped etc)
      continue;
end_loop

3.0 TBA:
--------
talk here about further optimizations added starting with xmit_win..

Appendix 1: HISTORY
---------------------
Aug 08/2007 - initial revision

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-08-08 13:04 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-11 13:52 [WIP][DOC] Net tx batching jamal
2007-06-11 23:09 ` jamal
2007-08-08 12:02   ` [DOC] Net tx batching driver howto jamal
2007-08-08 13:03     ` [DOC] Net tx batching core evolution jamal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).