From: jamal <hadi@cyberus.ca>
To: netdev@vger.kernel.org
Cc: Krishna Kumar2 <krkumar2@in.ibm.com>,
Evgeniy Polyakov <johnpol@2ka.mipt.ru>,
Jeff Garzik <jeff@garzik.org>, Gagan Arneja <gaagaan@gmail.com>,
Leonid Grossman <Leonid.Grossman@neterion.com>,
Sridhar Samudrala <sri@us.ibm.com>,
Rick Jones <rick.jones2@hp.com>,
Robert Olsson <Robert.Olsson@data.slu.se>,
David Miller <davem@davemloft.net>,
shemminger@linux-foundation.org, kaber@trash.net
Subject: [DOC] Net tx batching core evolution
Date: Wed, 08 Aug 2007 09:03:54 -0400 [thread overview]
Message-ID: <1186578234.5171.31.camel@localhost> (raw)
In-Reply-To: <1186574575.5171.17.camel@localhost>
[-- Attachment #1: Type: text/plain, Size: 260 bytes --]
The attached doc describes the evolution of batching work that leads
towards the dev->hard_prep_xmit() api being introduced.
As usual the updated working tree can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git
cheers,
jamal
[-- Attachment #2: evolution-of-batch.txt --]
[-- Type: text/plain, Size: 5518 bytes --]
The code for this work can be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git
The purpose of this doc is not to describe the batching work although
benefits of that approach could be gleaned from this doc (example
section 1.x should give the implicit value proposition).
There are more details on the rest of the batching work described in
the driver howto posted on netdev as well as a couple of presentations
i have given in the past (refer to the last few slides of my netconf 2006
slides for example).
The purpose of this doc is to describe the evolution of the work
work which leads to two important apis to the driver writer.
The first one is introduction of a method called
dev->hard_prep_xmit() and the second api is the introduction
of variable dev->xmit_win
For the sake of clarity i will be using non-LLTX because it is easier
to explain.
Note: The e1000, although non-LLTX has shown considerable improvement
with this approach as well as verified by experiments. (For a lot of
other reasons e1000 needs to be converted to be non-LLTX).
1.0 Classical approach
------------------------
Lets start with classical approach of what a random driver will do
loop:
1--core spin for qlock
1a---dequeu packet
1b---release qlock
2--core spins for xmit_lock
3--enter hardware xmit routine
3a----format packet:
--------e.g vlan, mss, shinfo, csum, descriptor count, etc
-----------if there is something wrong free skb and return OK
3b----do per chip specific checks
-----------if there is something wrong free skb and return OK
3c----all is good, now stash a packet into DMA ring.
4-- release tx lock
5-- if all ok (there are still packets, netif not stopped etc)
continue, else break
end_loop:
In between #1 and #1b, another CPU contends for qlock
In between #3 and #4, another CPU contends for txlock
1.1 Challenge to classical approach:
-------------------------------------
The cost of grabbing/setting up a lock is not cheap.
Observation also: spinning CPUs is expensive because the utilization
of the compute cycles goes down.
We assume the cost of dequeueing from the qdisc is not as expensive
as the enqueueing to DMA-ring.
1.2 Addressing the challenge to classical approach:
---------------------------------------------------
So we start with a simple premise to resolve the challenge above.
We try to amortize the cost by:
a) grabbing "as many" packets as we can between #1 and #1b with very
little processing so we dont hold that lock for long.
b) then send "as many" as we can in between #3 and #4.
Lets start with a simple approach (which is what the batch code did
in its earlier versions):
loop:
1--core spin for qlock
loop1: // "as many packets"
1a---dequeu and enqueue on dev->blist
end loop1:
1b---release qlock
2--core spins for xmit_lock
3--enter hardware xmit routine
loop2 ("for as many packets" or "no more ring space"):
3a----format packet
--------e.g vlan, mss, shinfo, csum, descriptor count, etc
-----------if there is something wrong free skb and return OK (**)
3b----do per chip specific checks
-----------if there is something wrong free skb and return OK
3c----all is good, now stash a packet into DMA ring.
end_loop2:
3d -- if you enqueued packets tell tx DMA to chew on them ...
4-- release tx lock
5-- if all ok (there are still packets, netif not stopped etc)
continue;
end_loop
loop2 has the added side-benefit of improving the instruction cache
warmness.
i.e If we can do things in a loop we would keep the instruction cache
warm for more packets (hence improve the hit rate/CPU-utilization).
However, the length of this loop may affect the hit-rate (very clear
to see as you use machines with bigger caches and qdisc_restart showing
in profiles).
Note also: We have also amortized the cost of bus IO in step 3d
because we make only one call after exiting the loop.
2.0 New challenge
-----------------
A new challenge is introduced.
We could hold the tx lock for "long period" depending on how unclean
the driver path is. i.e imagine holding this lock for 100 packets.
So we need to find a balance.
We observe that within loop2, #3a doesnt need to be held within the
xmit lock. All it does is muck with an skb and may end up infact
dropping it.
2.1 Addressing new challenge
-----------------------------
To address the new challenge, i introduced the dev->hard_prep_xmit() api.
Essentially, dev->hard_prep_xmit() moves the packet formatting into #1
So now we change the code to be something along the lines of:
loop:
1--core spin for qlock
loop1: // "as many packets"
1a---dequeu and enqueue on dev->blist
1b--if driver has hard_prep_xmit, then format packet.
end loop1:
1c---release qlock
2--core spins for xmit_lock
3--enter hardware xmit routine
loop2: ("for as many packets" or "no more ring space"):
3a----do per chip specific checks
-----------if there is something wrong free skb and return OK
3b----all is good, now stash a packet into DMA ring.
end_loop2:
3c -- if you enqueued packets tell tx DMA to chew on them ...
4-- release tx lock
5-- if all ok (there are still packets, netif not stopped etc)
continue;
end_loop
3.0 TBA:
--------
talk here about further optimizations added starting with xmit_win..
Appendix 1: HISTORY
---------------------
Aug 08/2007 - initial revision
prev parent reply other threads:[~2007-08-08 13:04 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-06-11 13:52 [WIP][DOC] Net tx batching jamal
2007-06-11 23:09 ` jamal
2007-08-08 12:02 ` [DOC] Net tx batching driver howto jamal
2007-08-08 13:03 ` jamal [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1186578234.5171.31.camel@localhost \
--to=hadi@cyberus.ca \
--cc=Leonid.Grossman@neterion.com \
--cc=Robert.Olsson@data.slu.se \
--cc=davem@davemloft.net \
--cc=gaagaan@gmail.com \
--cc=jeff@garzik.org \
--cc=johnpol@2ka.mipt.ru \
--cc=kaber@trash.net \
--cc=krkumar2@in.ibm.com \
--cc=netdev@vger.kernel.org \
--cc=rick.jones2@hp.com \
--cc=shemminger@linux-foundation.org \
--cc=sri@us.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).