[WIP][PATCHES] Network xmit batching

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [WIP][PATCHES] Network xmit batching
@ 2007-06-06 13:49 jamal
  2007-06-07  6:16 ` Krishna Kumar2
                   ` (2 more replies)
  0 siblings, 3 replies; 57+ messages in thread
From: jamal @ 2007-06-06 13:49 UTC (permalink / raw)
  To: netdev
  Cc: Krishna Kumar2, Evgeniy Polyakov, Gagan Arneja, Sridhar Samudrala,
	Rick Jones

[-- Attachment #1: Type: text/plain, Size: 2180 bytes --]

Folks,

While Krishna and I have been attempting this on the side, progress has
been rather slow - so this is to solicit for more participation so we
can get this over with faster. Success (myself being conservative when
it comes to performance) requires testing on a wide variety of hardware.

The results look promising - certainly from a pktgen perspective where
performance has been known in some cases to go up over 50%.
Tests by Sridhar on a low number of TCP flows also indicate improved
performance as well as lowered CPU use.

I have setup the current state of my patches against Linus tree at:
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git

This is also clean against 2.6.22-rc4. So if you want just a diff that
will work against 2.6.22-rc4 - i can send it to you.
I also have a tree against Daves net-2.6 at 
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-net26.git
but iam abandoning that effort until we get this stable due to the
occasional bug that cropped up(like e1000).

I am attaching a pktgen script. There is one experimental parameter
called "batch_low" - for starters just leave it at 0 in order to reduce
experimental variance. If you have solid results you can muck around
with it.
KK has a netperf script he has been using - if you know netperf your
help will really be appreciated in testing it on your hardware.
KK, can you please post your script?
Testing with forwarding and bridging will also be appreaciated.
Above that, suggestions to changes as long as they are based on
verifiable results or glaringly obvious changes are welcome. My
preference at the moment is to flesh out the patch as is and then
improve on it later if it shows it has some value on a wide variety of
apps. As the subject is indicating this is a WIP and as all eulas
suggest "subject to change without notice". 
If you help out, when you post your results, can you please say what
hardware and setup was?

The only real driver that has been changed is e1000 for now. KK is
working on something infiniband related and i plan (if noone beats me)
to get tg3 working. It would be nice if someone converted some 10G
ethernet driver.

cheers,
jamal

[-- Attachment #2: pktgen.batch-1-1 --]
[-- Type: application/x-shellscript, Size: 1389 bytes --]

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-06 13:49 [WIP][PATCHES] Network xmit batching jamal
@ 2007-06-07  6:16 ` Krishna Kumar2
  2007-06-07 11:43   ` jamal
  2007-06-07  8:42 ` Krishna Kumar2
  2007-06-07 22:42 ` jamal
  2 siblings, 1 reply; 57+ messages in thread
From: Krishna Kumar2 @ 2007-06-07  6:16 UTC (permalink / raw)
  To: hadi; +Cc: Gagan Arneja, Evgeniy Polyakov, netdev, Rick Jones,
	Sridhar Samudrala

[-- Attachment #1: Type: text/plain, Size: 2674 bytes --]

My somewhat confusing netperf script (to run on client) is attached below.
Server just requires
to run netserver. Client is not completely accurate since I am not using
netperf4 (moving to
that after some initial hiccups).

thanks,

- KK

(See attached file: netperf.scp)


J Hadi Salim <j.hadi123@gmail.com> wrote on 06/06/2007 07:19:21 PM:

> Folks,
>
> While Krishna and I have been attempting this on the side, progress has
> been rather slow - so this is to solicit for more participation so we
> can get this over with faster. Success (myself being conservative when
> it comes to performance) requires testing on a wide variety of hardware.
>
> The results look promising - certainly from a pktgen perspective where
> performance has been known in some cases to go up over 50%.
> Tests by Sridhar on a low number of TCP flows also indicate improved
> performance as well as lowered CPU use.
>
> I have setup the current state of my patches against Linus tree at:
> git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git
>
> This is also clean against 2.6.22-rc4. So if you want just a diff that
> will work against 2.6.22-rc4 - i can send it to you.
> I also have a tree against Daves net-2.6 at
> git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-net26.git
> but iam abandoning that effort until we get this stable due to the
> occasional bug that cropped up(like e1000).
>
> I am attaching a pktgen script. There is one experimental parameter
> called "batch_low" - for starters just leave it at 0 in order to reduce
> experimental variance. If you have solid results you can muck around
> with it.
> KK has a netperf script he has been using - if you know netperf your
> help will really be appreciated in testing it on your hardware.
> KK, can you please post your script?
> Testing with forwarding and bridging will also be appreaciated.
> Above that, suggestions to changes as long as they are based on
> verifiable results or glaringly obvious changes are welcome. My
> preference at the moment is to flesh out the patch as is and then
> improve on it later if it shows it has some value on a wide variety of
> apps. As the subject is indicating this is a WIP and as all eulas
> suggest "subject to change without notice".
> If you help out, when you post your results, can you please say what
> hardware and setup was?
>
> The only real driver that has been changed is e1000 for now. KK is
> working on something infiniband related and i plan (if noone beats me)
> to get tg3 working. It would be nice if someone converted some 10G
> ethernet driver.
>
> cheers,
> jamal
> [attachment "pktgen.batch-1-1" deleted by Krishna Kumar2/India/IBM]

[-- Attachment #2: netperf.scp --]
[-- Type: application/octet-stream, Size: 1661 bytes --]

typeset -i i
typeset -i p

#server=192.168.1.2
echo "Set the server IP address here and remove the next line"
exit

if [ $# -ne 1 ]
then
	echo "Require a unique filename to output results"
	exit
fi

mkdir -p results

res=results/$1
if [ -f $res -o -d $res ]
then
	echo "$res already exists, provide a unique name"
	exit
fi

mkdir $res > /dev/null 2>&1

# This particular run combination (of time/tests/procs/buffer sizes) takes
# about 1 hour 34 mins to finish.

# Time to run
TIME=180

# Various buffer sizes
S="8 32 128 512 4096"

# Set the number of parallel netperfs
P="1 32"

# Should set the send/receive socket sizes so as to have uniform environment
# for every run. This can be easily done by modifying the netperf command

echo "Starting at : `date`" > $res/DATE
for size in $S
do
	for p in $P
	do
	(
		echo "E1000 Running $p TCP Test(s) size:$size delay:"
		sync
		i=0
		while [ $i -lt $p ]
		do
			echo "netperf -c -C -v 1 -t TCP_STREAM \
				-I99,5 -l $TIME -H $server -- -m $size \
				-M $size &"
			i=$i+1
		done
		wait
	) >> $res/tcp.$p 2>&1
	done

	for p in $P
	do
	(
		echo "E1000 Running $p TCP(s) Test size:$size delay:-D"
		sync
		i=0
		while [ $i -lt $p ]
		do
			echo "netperf -c -C -v 1 -t TCP_STREAM \
				-I99,5 -l $TIME -H $server -- -m $size \
				-M $size -D &"
			i=$i+1
		done
		wait
	) >> $res/tcp_d.$p 2>&1
	done

	for p in $P
	do
	(
		echo "E1000 Running $p UDP(s) Test size:$size"
		sync
		i=0
		while [ $i -lt $p ]
		do
			echo "netperf -c -C -v 1 -t UDP_STREAM \
				-I99,5 -l $TIME -H $server -- -m $size \
				-M $size &"
			i=$i+1
		done
		wait
	) >> $res/udp.$p 2>&1
	done
done
echo "Finished at : `date`" >> $res/DATE

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-06 13:49 [WIP][PATCHES] Network xmit batching jamal
  2007-06-07  6:16 ` Krishna Kumar2
@ 2007-06-07  8:42 ` Krishna Kumar2
  2007-06-07 12:16   ` jamal
  2007-06-07 22:42 ` jamal
  2 siblings, 1 reply; 57+ messages in thread
From: Krishna Kumar2 @ 2007-06-07  8:42 UTC (permalink / raw)
  To: hadi; +Cc: Gagan Arneja, Evgeniy Polyakov, netdev, Rick Jones,
	Sridhar Samudrala

Hi Jamal,

I ran these bits today and the results are included. For comparison, I am
running 2.6.22-rc3 original bits. The systems are both 2.8Ghz, 2 cpu, P4,
2GB RAM, one E1000 82547GI card connected using a crossover cable.
The test runs for 3 mins for each case. I have run only once instead of
taking any averages, so there could be some spurts/drops.

These results are based on the test script that I sent earlier today. I
removed the results for UDP 32 procs 512 and 4096 buffer cases since
the BW was coming >line speed (infact it was showing 1500Mb/s and
4900Mb/s respectively for both the ORG and these bits). I am not sure
how it is coming this high, but netperf4 is the only way to correctly
measure multiple process combined BW. Another thing to do is to disable
pure performance fixes in E1000 (eg changing THRESHOLD to 128 and
some other changes like Erratum workaround or MSI, etc) which are
independent of this functionality. Then a more accurate performance
result is possible when comparing org vs batch code, without mixing in
unrelated performance fixes which skews the results (either positively
or negatively :).

Each iteration consists of running buffer sizes 8, 32, 128, 512, 4096.

---------------------------------------------------------------------------------------------
   Org                  New               Perc
BW    Service           BW     Service          BW     Service
---------------------------------------------------------------------------------------------

                  TCP 1 process
68.50 119.94                  67.70 121.34            -1.16 1.16
234.68      35.02             234.42      35.02       -.11  0
768.91      10.68             850.38      9.65        10.59 -9.64
941.16      2.92              941.15      2.80        0     -4.10
939.78      1.90              939.81      1.87        0     -1.57

                  TCP 32 processes
93.80 185714.97         91.02 190822.43   -2.96 2.75
324.76      53909.46          315.69      54528.68    -2.79 1.14
944.24      13035.72          958.14      12946.88    1.47  -.68
939.95      4508.47                 941.35      4545.90           .14   .83
941.88      3334.41                 941.88      3134.97           0     -5.98

                  TCP 1 process No Delay
18.35 447.47                  17.75 462.79            -3.26 3.42
73.64 111.53                  71.31 115.20            -3.16 3.29
275.33      29.83             272.35      30.16       -1.08 1.10
940.41      3.99              941.12      2.83        .07   -29.07
941.06      3.00              941.12      1.87        0     -37.66

                  TCP 32 processes No Delay
40.59 454802.47         36.80 525062.48   -9.33 15.44
93.34 191264.12         89.41 220342.26   -4.21 15.20
940.99      12663.67          942.11      13143.16    .11   3.78
941.81      4659.62                 942.24      4435.86           .04   -4.80
941.80      3384.20                 941.77      3163.40           0     -6.52

                  UDP 1 process
20.2  407.20                  20.2  406.45            0     -.18
80.1  102.63                  80.6  102.01            .62   -.60
317.5 25.71             319.1 25.58       .50   -.50
885.4 7.24              885.3 5.15        -.01  -28.86
957.1 2.96              957.1 2.72        0     -8.10

                  UDP 32 processes (only 3 buffer sizes)
21.1  850934.50         21.5  823970.70   1.89  -3.16
83.2  211132.86         85.0  209824.30   2.16  -.61
337.6 73860.56          353.7 242295.07   4.76  228.04
---------------------------------------------------------------------------------------------
Avg: 14107.18   2064517.05    14200.02   2309541.53   .65   11.86

Summary : Average BW (whatever meaning that has) improved 0.65%, while
                 Service Demand deteriorated 11.86%

Regards,

- KK

J Hadi Salim <j.hadi123@gmail.com> wrote on 06/06/2007 07:19:21 PM:

> Folks,
>
> While Krishna and I have been attempting this on the side, progress has
> been rather slow - so this is to solicit for more participation so we
> can get this over with faster. Success (myself being conservative when
> it comes to performance) requires testing on a wide variety of hardware.
>
> The results look promising - certainly from a pktgen perspective where
> performance has been known in some cases to go up over 50%.
> Tests by Sridhar on a low number of TCP flows also indicate improved
> performance as well as lowered CPU use.
>
> I have setup the current state of my patches against Linus tree at:
> git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git
>
> This is also clean against 2.6.22-rc4. So if you want just a diff that
> will work against 2.6.22-rc4 - i can send it to you.
> I also have a tree against Daves net-2.6 at
> git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-net26.git
> but iam abandoning that effort until we get this stable due to the
> occasional bug that cropped up(like e1000).
>
> I am attaching a pktgen script. There is one experimental parameter
> called "batch_low" - for starters just leave it at 0 in order to reduce
> experimental variance. If you have solid results you can muck around
> with it.
> KK has a netperf script he has been using - if you know netperf your
> help will really be appreciated in testing it on your hardware.
> KK, can you please post your script?
> Testing with forwarding and bridging will also be appreaciated.
> Above that, suggestions to changes as long as they are based on
> verifiable results or glaringly obvious changes are welcome. My
> preference at the moment is to flesh out the patch as is and then
> improve on it later if it shows it has some value on a wide variety of
> apps. As the subject is indicating this is a WIP and as all eulas
> suggest "subject to change without notice".
> If you help out, when you post your results, can you please say what
> hardware and setup was?
>
> The only real driver that has been changed is e1000 for now. KK is
> working on something infiniband related and i plan (if noone beats me)
> to get tg3 working. It would be nice if someone converted some 10G
> ethernet driver.
>
> cheers,
> jamal
> [attachment "pktgen.batch-1-1" deleted by Krishna Kumar2/India/IBM]


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-07  6:16 ` Krishna Kumar2
@ 2007-06-07 11:43   ` jamal
  2007-06-07 16:13     ` Evgeniy Polyakov
  2007-06-19 13:21     ` Evgeniy Polyakov
  0 siblings, 2 replies; 57+ messages in thread
From: jamal @ 2007-06-07 11:43 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: Gagan Arneja, Evgeniy Polyakov, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

On Thu, 2007-07-06 at 11:46 +0530, Krishna Kumar2 wrote:
> My somewhat confusing netperf script (to run on client) is attached below.
> Server just requires
> to run netserver. Client is not completely accurate since I am not using
> netperf4 (moving to
> that after some initial hiccups).

Thanks KK.
Folks, we need help. Please run this on different hardware. Evgeniy, i
thought this kind of stuff excites you, no? ;-> (wink, wink).
Only the sender needs the patch but the receiver must be a more powerful
machine (so that it is not the bottleneck).
A very interesting test will be say 10K flows serving different packet
sizes to simulate a busy server.

I realized i cant really run the sort of tests KK is running because i
dont have the second machine. With pktgen i have an easy way out because
i can drop all the packets on the receiver and just count them (whereas
netperf requires end 2 end semantics).

cheers,
jamal

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-07  8:42 ` Krishna Kumar2
@ 2007-06-07 12:16   ` jamal
  2007-06-08  5:06     ` Krishna Kumar2
  2007-06-08 17:27     ` Rick Jones
  0 siblings, 2 replies; 57+ messages in thread
From: jamal @ 2007-06-07 12:16 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: Gagan Arneja, Evgeniy Polyakov, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

KK,

On Thu, 2007-07-06 at 14:12 +0530, Krishna Kumar2 wrote:
> I have run only once instead of
> taking any averages, so there could be some spurts/drops.

Would be nice to run three sets - but i think even one would be
sufficiently revealing.

> These results are based on the test script that I sent earlier today. I
> removed the results for UDP 32 procs 512 and 4096 buffer cases since
> the BW was coming >line speed (infact it was showing 1500Mb/s and
> 4900Mb/s respectively for both the ORG and these bits). 

I expect UDP to overwhelm the receiver. So the receiver needs a lot more
tuning (like increased rcv socket buffer sizes to keep up, IMO).

But yes, the above is an odd result - Rick any insight into this?

> I am not sure
> how it is coming this high, but netperf4 is the only way to correctly
> measure multiple process combined BW. Another thing to do is to disable
> pure performance fixes in E1000 (eg changing THRESHOLD to 128 and
> some other changes like Erratum workaround or MSI, etc) which are
> independent of this functionality. Then a more accurate performance
> result is possible when comparing org vs batch code, without mixing in
> unrelated performance fixes which skews the results (either positively
> or negatively :).
> 

I agree that THRESHOLD change needs to be the same for a fair
comparison. Note however, it is definetely a tuning parameter which is a
fundamental aspect of this batching exercise (historically this was
added to e1000 because i found it useful in my 2006 batch experiments).
When all the dust settles we should be able to pick a value that is
optimal.
Would it be useful if i made this a boot/module parameter? It should
have been actually.

The erratum changes - I am not so sure; the ->prep_xmit() is a
fundamental aspect and it needs to run lockless; the erratum forces us
to run with a lock. In any case, I dont think that affects your chip.

> Each iteration consists of running buffer sizes 8, 32, 128, 512, 4096.

It seems to me any runs with buffer less than 512B are unable to fill
the pipe - so will not really benefit (will probably do with nagling).
However, the < 512 B should show equivalent results before and after the
changes.
You can try to turn off _BTX feature in the driver and see if they are
the same. If they are not, then the suspect change will be easy to find.
When i turned off the _BTX changes i saw no difference with pktgen.
But that is a different code path.

> Summary : Average BW (whatever meaning that has) improved 0.65%, while
>                  Service Demand deteriorated 11.86%

Sorry, been many moons since i last played with netperf; what does "service
demand" mean?

cheers,
jamal

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-07 11:43   ` jamal
@ 2007-06-07 16:13     ` Evgeniy Polyakov
  2007-06-07 22:23       ` jamal
  2007-06-08  5:05       ` Krishna Kumar2
  2007-06-19 13:21     ` Evgeniy Polyakov
  1 sibling, 2 replies; 57+ messages in thread
From: Evgeniy Polyakov @ 2007-06-07 16:13 UTC (permalink / raw)
  To: jamal
  Cc: Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

On Thu, Jun 07, 2007 at 07:43:49AM -0400, jamal (hadi@cyberus.ca) wrote:
> Folks, we need help. Please run this on different hardware. Evgeniy, i
> thought this kind of stuff excites you, no? ;-> (wink, wink).
> Only the sender needs the patch but the receiver must be a more powerful
> machine (so that it is not the bottleneck).
> A very interesting test will be say 10K flows serving different packet
> sizes to simulate a busy server.

Actually I wonder where the devil lives, but I do not see how that
patchset can improve sending situation.
Let me clarify: there are two possibilities to send data:
1. via batched sending, which runs via queue of packets and performs
prepare call (which only setups some private flags, no work with 
hardware) and then sending call.
2. old xmit function (which seems to be unused by kernel now?)

Btw, prep_queue_frame seems to be always called under tx_lock, but it
old e1000 xmit function calls it without lock. Locked case is correct,
since it accesses private registers via e1000_transfer_dhcp_info() for
some adapters.

So, essentially batched sending is 
lock
while ((skb = dequue))
  send
unlock

where queue of skbs are prepared by stack using the same transmit lock.

Where is a gain?

Btw, this one forces a smile:
	if (unlikely(ret != NETDEV_TX_OK))
			return NETDEV_TX_OK;

P.S. I do not have e1000 hardware to test, the only testing machine has
r8169 driver.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-07 16:13     ` Evgeniy Polyakov
@ 2007-06-07 22:23       ` jamal
  2007-06-08  8:38         ` Evgeniy Polyakov
  2007-06-08  5:05       ` Krishna Kumar2
  1 sibling, 1 reply; 57+ messages in thread
From: jamal @ 2007-06-07 22:23 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

On Thu, 2007-07-06 at 20:13 +0400, Evgeniy Polyakov wrote:

> Actually I wonder where the devil lives, but I do not see how that
> patchset can improve sending situation.
> Let me clarify: there are two possibilities to send data:
> 1. via batched sending, which runs via queue of packets and performs
> prepare call (which only setups some private flags, no work with 
> hardware) and then sending call.

I believe both are called with no lock. The idea is to avoid the lock
entirely when unneeded. That code may end up finding that the packet
is bogus and throw it out when it deems it useless.
If you followed the discussions on multi-ring, this call is where
i suggested to select the tx ring as well.

> 2. old xmit function (which seems to be unused by kernel now?)
> 

You can change that by turning off _BTX feature in the driver.
For WIP reasons it is on at the moment.

> Btw, prep_queue_frame seems to be always called under tx_lock, but it
> old e1000 xmit function calls it without lock. 

I think both call it without lock.

> Locked case is correct,
> since it accesses private registers via e1000_transfer_dhcp_info() for
> some adapters.

I am unsure about the value of that lock (refer to email to Auke). There
is only one CPU that can enter the tx path and the contention is
minimal.

> So, essentially batched sending is 
> lock
> while ((skb = dequue))
>   send
> unlock
> 
> where queue of skbs are prepared by stack using the same transmit lock.
> 
> Where is a gain?

The amortizing of the lock on tx is where the value is.
Did you see the numbers Evgeniy? ;->
Heres one i can vouch on a dual processor 2GHz that i tested with
pktgen;

----
    1) Original e1000 driver (no batching):
    a) We got a xmit throughput of 362Kpackets/second of 362K with
    the default setup (everything falls on cpu#0).
    b) With tying to CPU#1, i saw 401Kpps.

    2) Repeated the tests with batching patches (as in this commit)
    And got an outstanding 694Kpps throughput.

    5) Repeated #4 with binding to cpu #1.
    And throughput didnt improve that much - was hitting 697Kpps
    I think we are pretty much hitting upper limits here
    ...
----

I am actually testing as we speak on faster hardware - I will post
results shortly.

> Btw, this one forces a smile:
> 	if (unlikely(ret != NETDEV_TX_OK))
> 			return NETDEV_TX_OK;
> 

Dont wanna change the way e1000 behaves. It returns NETDEV_TX_OK even
when it netif_stops; this allows the top layer to exit. 

> P.S. I do not have e1000 hardware to test, the only testing machine has
> r8169 driver.

Send me your shipping address privately and i can send you some.

cheers,
jamal

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-06 13:49 [WIP][PATCHES] Network xmit batching jamal
  2007-06-07  6:16 ` Krishna Kumar2
  2007-06-07  8:42 ` Krishna Kumar2
@ 2007-06-07 22:42 ` jamal
  2 siblings, 0 replies; 57+ messages in thread
From: jamal @ 2007-06-07 22:42 UTC (permalink / raw)
  To: netdev
  Cc: Krishna Kumar2, Evgeniy Polyakov, Gagan Arneja, Sridhar Samudrala,
	Rick Jones, Robert Olsson, David Miller

On Wed, 2007-06-06 at 09:49 -0400, jamal wrote:

> If you help out, when you post your results, can you please say what
> hardware and setup was?

Ok, i have tested on new hardware with pktgen:

Machine: Dual Xeon processor with EMT64 - kernel is 32 bit.
Chipset: E7520 Memory Controller Hub (MCH). The block diagrams claim
direct connection between the MCH and PCI Express ...
Ethernet: 6 ethernet ports - e1000 82571EB.

The results are just too good i am getting worried ;->
Pktgen with plain vanilla 2.6.22-rc4 - small packets (64 B on the wire)
i.e zero changes. Avg throughput is 747Kpps
2.6.22-rc4 with batching changes as in the git tree avg throughput is
1087Kpps

That is about 45% improvement.

Note, in theory - didnt try this - you could run one ethernet port per
CPU and get equivalent improvement per CPU.

I would really really like other people to test ;->

cheers,
jamal

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-07 16:13     ` Evgeniy Polyakov
  2007-06-07 22:23       ` jamal
@ 2007-06-08  5:05       ` Krishna Kumar2
  1 sibling, 0 replies; 57+ messages in thread
From: Krishna Kumar2 @ 2007-06-08  5:05 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Gagan Arneja, jamal, netdev, Rick Jones,
	Robert Olsson, Sridhar Samudrala

Hi Evgeniy,

> Let me clarify: there are two possibilities to send data:
> 1. via batched sending, which runs via queue of packets and performs
> prepare call (which only setups some private flags, no work with
> hardware) and then sending call.
> 2. old xmit function (which seems to be unused by kernel now?)

The old xmit is used by all drivers. Only drivers providing this new
API will go through the other path. I am planning to also submit some
changes on top of jamal's patch, which will provide a configuration
parameter to enable/disable this API for a particular device.

> Btw, prep_queue_frame seems to be always called under tx_lock, but it
> old e1000 xmit function calls it without lock. Locked case is correct,
> since it accesses private registers via e1000_transfer_dhcp_info() for
> some adapters.

Prep is called after holding the dev->queue lock (not tx_lock). The old
E1000 xmit also calls prep holding just the dev->queue lock. tx_lock is
held after that.

> So, essentially batched sending is
> lock
> while ((skb = dequue))
>   send
> unlock
>
> where queue of skbs are prepared by stack using the same transmit lock.
>
> Where is a gain?

(Not the transmit lock). But in any case, we are amortizing the cost of
the lock in the driver by starting multiple I/O's and doing a single DMA
operation. The prep part is done in both cases holding the dev->queue
lock. Similarly IPoIB could have performance gains as it requires a
completion notification for every skb, and this could be changed to get
notification for only the last skb (and avoid calling completion
notification for 1 to n-1 skbs).

> Btw, this one forces a smile:
>    if (unlikely(ret != NETDEV_TX_OK))
>          return NETDEV_TX_OK;

Actually that is how the original code is handling error in prep since
those skbs are freed up anyway and should not be retried or requeue'd :-)

Thanks,

- KK

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-07 12:16   ` jamal
@ 2007-06-08  5:06     ` Krishna Kumar2
  2007-06-08 11:14       ` jamal
  2007-06-08 17:27     ` Rick Jones
  1 sibling, 1 reply; 57+ messages in thread
From: Krishna Kumar2 @ 2007-06-08  5:06 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, Gagan Arneja, Evgeniy Polyakov, netdev, Rick Jones,
	Robert Olsson, Sridhar Samudrala

Hi Jamal,

> Would be nice to run three sets - but i think even one would be
> sufficiently revealing.

I will try multiple runs over the weekend. During the week, the systems
are used for other purposes too.

> I expect UDP to overwhelm the receiver. So the receiver needs a lot more
> tuning (like increased rcv socket buffer sizes to keep up, IMO).

I will try that. Also on the receiver, I am using unmodified 2.6.21 bits.

> It seems to me any runs with buffer less than 512B are unable to fill
> the pipe - so will not really benefit (will probably do with nagling).
> However, the < 512 B should show equivalent results before and after the
> changes.

My earlier experiments showed that even small buffers were filling the
E1000
slots and resulting in stop queue very often. In any case, I will also
add 1 or 2 larger packet sizes (1K, 16K in addition to the 4K already
there).

> You can try to turn off _BTX feature in the driver and see if they are
> the same. If they are not, then the suspect change will be easy to find.

I was planning to submit my changes on top of this patch, and since it
includes
a configuration option per device, it will be easy to test with and without
this API. When I ran after setting this config option to 0, the results
were
almost identical to the original code. I will try to post that today for
your
review/comments.

> Sorry, been many moons since i last played with netperf; what does
"service
> demand" mean?

It gives an indication of the amount of CPU cycles to send out a particular
amount of data. Netperf provides it as us/KB. I don't know the internals of
netperf enough to say how this is calculated.

thanks,

- KK

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-07 22:23       ` jamal
@ 2007-06-08  8:38         ` Evgeniy Polyakov
  2007-06-08 11:31           ` jamal
  0 siblings, 1 reply; 57+ messages in thread
From: Evgeniy Polyakov @ 2007-06-08  8:38 UTC (permalink / raw)
  To: jamal
  Cc: Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

On Thu, Jun 07, 2007 at 06:23:16PM -0400, jamal (hadi@cyberus.ca) wrote:
> On Thu, 2007-07-06 at 20:13 +0400, Evgeniy Polyakov wrote:
> 
> > Actually I wonder where the devil lives, but I do not see how that
> > patchset can improve sending situation.
> > Let me clarify: there are two possibilities to send data:
> > 1. via batched sending, which runs via queue of packets and performs
> > prepare call (which only setups some private flags, no work with 
> > hardware) and then sending call.
> 
> I believe both are called with no lock. The idea is to avoid the lock
> entirely when unneeded. That code may end up finding that the packet
> is bogus and throw it out when it deems it useless.
> If you followed the discussions on multi-ring, this call is where
> i suggested to select the tx ring as well.

Hmm...

+	netif_tx_lock_bh(odev);
+	if (!netif_queue_stopped(odev)) {
+
+		idle_start = getCurUs();
+		pkt_dev->tx_entered++;
+		ret = odev->hard_batch_xmit(&odev->blist, odev);


+	if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) {
+		/* Collision - tell upper layer to requeue */
+		return NETDEV_TX_LOCKED;
+	}
+
+	while ((skb = __skb_dequeue(list)) != NULL) {
+#ifdef coredoesnoprep
+		ret = netdev->hard_prep_xmit(skb, netdev);
+		if (ret != NETDEV_TX_OK)
+			continue;
+#endif
+
+		/*XXX: This may be an opportunity to not give nit
+		 * the packet if the dev ix TX BUSY ;-> */
+		dev_do_xmit_nit(skb, netdev);
+		ret = e1000_queue_frame(skb, netdev);

The same applies to *_gso case.

> > 2. old xmit function (which seems to be unused by kernel now?)
> > 
> 
> You can change that by turning off _BTX feature in the driver.
> For WIP reasons it is on at the moment.
> 
> > Btw, prep_queue_frame seems to be always called under tx_lock, but it
> > old e1000 xmit function calls it without lock. 
> 
> I think both call it without lock.

Without lock that would be wrong - it accesses hardware.

> > Locked case is correct,
> > since it accesses private registers via e1000_transfer_dhcp_info() for
> > some adapters.
> 
> I am unsure about the value of that lock (refer to email to Auke). There
> is only one CPU that can enter the tx path and the contention is
> minimal.
> 
> > So, essentially batched sending is 
> > lock
> > while ((skb = dequue))
> >   send
> > unlock
> > 
> > where queue of skbs are prepared by stack using the same transmit lock.
> > 
> > Where is a gain?
> 
> The amortizing of the lock on tx is where the value is.
> Did you see the numbers Evgeniy? ;->
> Heres one i can vouch on a dual processor 2GHz that i tested with
> pktgen;

I only saw results Krishna posted, and i also do not know, what
service demand is :)

> ----
>     1) Original e1000 driver (no batching):
>     a) We got a xmit throughput of 362Kpackets/second of 362K with
>     the default setup (everything falls on cpu#0).
>     b) With tying to CPU#1, i saw 401Kpps.
> 
>     2) Repeated the tests with batching patches (as in this commit)
>     And got an outstanding 694Kpps throughput.
> 
>     5) Repeated #4 with binding to cpu #1.
>     And throughput didnt improve that much - was hitting 697Kpps
>     I think we are pretty much hitting upper limits here
>     ...
> ----
> 
> I am actually testing as we speak on faster hardware - I will post
> results shortly.

Result looks good, but I still do not understand how it appeared, that
is why I'm not that excited about idea - I just do not know it in
details.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-08  5:06     ` Krishna Kumar2
@ 2007-06-08 11:14       ` jamal
  2007-06-08 11:31         ` Krishna Kumar2
  0 siblings, 1 reply; 57+ messages in thread
From: jamal @ 2007-06-08 11:14 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: David Miller, Gagan Arneja, Evgeniy Polyakov, netdev, Rick Jones,
	Robert Olsson, Sridhar Samudrala

KK,
On Fri, 2007-08-06 at 10:36 +0530, Krishna Kumar2 wrote:

> I will try that. Also on the receiver, I am using unmodified 2.6.21 bits.

That should be fine as long as the sender is running the patched
2.6.22-rc4

> My earlier experiments showed that even small buffers were filling the
> E1000
> slots and resulting in stop queue very often. In any case, I will also
> add 1 or 2 larger packet sizes (1K, 16K in addition to the 4K already
> there).

Thats interesting - it is possible there is transient burstiness which
fills up the ring.
My observation of your results (hence my comments): for example the
buffer size = 8B, TCP 1 process you achieve less than 70M.  That is less
than 100Kpps on average being sent out. Very very tiny - so it is
interesting that it is causing a shutdown.
Also note something else strange that it is kind of strange that
something like UDP which doesnt backoff will send out less
packets/second ;->

I could put a little hack in the e1000 driver to find exact number
number of times per run it was shutdown.

BTW, another interesting things to do is ensure that several netperfs
are running on different CPUs.

> I was planning to submit my changes on top of this patch, and since it
> includes
> a configuration option per device, it will be easy to test with and without
> this API. 

fantastic.

> When I ran after setting this config option to 0, the results
> were almost identical to the original code. I will try to post that today for
> your review/comments.

no problem.

> > Sorry, been many moons since i last played with netperf; what does
> "service
> > demand" mean?
> 
> It gives an indication of the amount of CPU cycles to send out a particular
> amount of data. Netperf provides it as us/KB. I don't know the internals of
> netperf enough to say how this is calculated.

I am hoping Rick would comment.

cheers,
jamal

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-08  8:38         ` Evgeniy Polyakov
@ 2007-06-08 11:31           ` jamal
  2007-06-08 12:09             ` Evgeniy Polyakov
  0 siblings, 1 reply; 57+ messages in thread
From: jamal @ 2007-06-08 11:31 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

On Fri, 2007-08-06 at 12:38 +0400, Evgeniy Polyakov wrote:
> On Thu, Jun 07, 2007 at 06:23:16PM -0400, jamal (hadi@cyberus.ca) wrote:

> > I believe both are called with no lock. The idea is to avoid the lock
> > entirely when unneeded. That code may end up finding that the packet
[..]
> +	netif_tx_lock_bh(odev);
> +	if (!netif_queue_stopped(odev)) {
> +
> +		idle_start = getCurUs();
> +		pkt_dev->tx_entered++;
> +		ret = odev->hard_batch_xmit(&odev->blist, odev);

[..]
> The same applies to *_gso case.
> 

You missed an important piece which is grabbing of
__LINK_STATE_QDISC_RUNNING


> Without lock that would be wrong - it accesses hardware.

We are achieving the goal of only a single CPU entering that path. Are
you saying that is not good enough?

> I only saw results Krishna posted, 

Ok, sorry - i thought you saw the git log or earlier results where
other things are captured.

> and i also do not know, what service demand is :)

>From the explanation seems to be how much cpu was used while sending. Do
you have any suggestions for computing cpu use?
in pktgen i added code to count how many microsecs were used in
transmitting.

> Result looks good, but I still do not understand how it appeared, that
> is why I'm not that excited about idea - I just do not know it in
> details.

To add to KKs explanation on other email:
Essentially the value is in amortizing the cost of barriers and IO per
packet. For example the queue lock is held/released only once per X
packets. DMA kicking which includes both a PCI IO write and mbs is done
only once per X packets. There are still a lot of room for improvement
of such IO;

cheers,
jamal


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-08 11:14       ` jamal
@ 2007-06-08 11:31         ` Krishna Kumar2
  2007-06-08 11:43           ` jamal
  2007-06-08 18:00           ` Rick Jones
  0 siblings, 2 replies; 57+ messages in thread
From: Krishna Kumar2 @ 2007-06-08 11:31 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, Gagan Arneja, Evgeniy Polyakov, netdev, Rick Jones,
	Robert Olsson, Sridhar Samudrala

Hi Jamal,

J Hadi Salim <j.hadi123@gmail.com> wrote on 06/08/2007 04:44:06 PM:

> That should be fine as long as the sender is running the patched
> 2.6.22-rc4

Definitely :)

> Thats interesting - it is possible there is transient burstiness which
> fills up the ring.
> My observation of your results (hence my comments): for example the
> buffer size = 8B, TCP 1 process you achieve less than 70M.  That is less
> than 100Kpps on average being sent out. Very very tiny - so it is
> interesting that it is causing a shutdown.

I thought it comes to 1.147Mpps, or did I calculate wrong
(70*1024*1024/8/8) ?

> Also note something else strange that it is kind of strange that
> something like UDP which doesnt backoff will send out less
> packets/second ;->

Cannot explain that either :)

> BTW, another interesting things to do is ensure that several netperfs
> are running on different CPUs.

My script was doing that earlier, I trimmed all that to make it easier
to understand. Will post the larger version later.

> no problem.

Thanks, please let me know what you think of the patch I sent earlier.

I am running a larger 5 iteration run with buffer sizes :8,32,128,512,1
K,4K,16K.
It is going to run for around 12 hours and since I am moving house during
the
weekend, I will be able to look at the results only on Monday.

Regards,

- KK

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-08 11:31         ` Krishna Kumar2
@ 2007-06-08 11:43           ` jamal
  2007-06-08 18:00           ` Rick Jones
  1 sibling, 0 replies; 57+ messages in thread
From: jamal @ 2007-06-08 11:43 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: David Miller, Gagan Arneja, Evgeniy Polyakov, netdev, Rick Jones,
	Robert Olsson, Sridhar Samudrala

KK,

On Fri, 2007-08-06 at 17:01 +0530, Krishna Kumar2 wrote:

> I thought it comes to 1.147Mpps, or did I calculate wrong
> (70*1024*1024/8/8) ?

I assumed 8B to mean data that is on top of TCP/UDP?
If so then in the case of UDP we have 8B UDP header, 20B IP and 14B
ethernet < 64B minimal allowed Ethernet packet; so it gets padded and
goes out as 64B.
There are, as you state above, 1.147(or is it 1.48?) such packets/sec in
1Gbps.
So (70Mbps/1000Mbps)*1.147 is the rough number i was reffering to.

> My script was doing that earlier, I trimmed all that to make it easier
> to understand. Will post the larger version later.

That will be nice because remember we can have multiple CPU packet
producers but only one CPU consumer.

> > no problem.
> 
> Thanks, please let me know what you think of the patch I sent earlier.

I havent seen a patch. Can you resend it?

> I am running a larger 5 iteration run with buffer sizes :8,32,128,512,1
> K,4K,16K.
> It is going to run for around 12 hours and since I am moving house during
> the
> weekend, I will be able to look at the results only on Monday.
> 

sounds good.

cheers,
jamal

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-08 11:31           ` jamal
@ 2007-06-08 12:09             ` Evgeniy Polyakov
  2007-06-08 13:07               ` jamal
  0 siblings, 1 reply; 57+ messages in thread
From: Evgeniy Polyakov @ 2007-06-08 12:09 UTC (permalink / raw)
  To: jamal
  Cc: Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

On Fri, Jun 08, 2007 at 07:31:07AM -0400, jamal (hadi@cyberus.ca) wrote:
> On Fri, 2007-08-06 at 12:38 +0400, Evgeniy Polyakov wrote:
> > On Thu, Jun 07, 2007 at 06:23:16PM -0400, jamal (hadi@cyberus.ca) wrote:
> 
> > > I believe both are called with no lock. The idea is to avoid the lock
> > > entirely when unneeded. That code may end up finding that the packet
> [..]
> > +	netif_tx_lock_bh(odev);
> > +	if (!netif_queue_stopped(odev)) {
> > +
> > +		idle_start = getCurUs();
> > +		pkt_dev->tx_entered++;
> > +		ret = odev->hard_batch_xmit(&odev->blist, odev);
> 
> [..]
> > The same applies to *_gso case.
> > 
> 
> You missed an important piece which is grabbing of
> __LINK_STATE_QDISC_RUNNING

But lock is still being hold - or there was no intention to reduce lock
usage? As far as I read Krishna's mail, lock usage was not an issue, so
that hunk probably should be dropped from the analysis.
 
> > Without lock that would be wrong - it accesses hardware.
> 
> We are achieving the goal of only a single CPU entering that path. Are
> you saying that is not good enough?

Then why essentially the same code (current batch_xmit callback)
previously was always called with disabled interrupts? Aren't there
some watchdog/link/poll/whatever issues present?

> > and i also do not know, what service demand is :)
> 
> From the explanation seems to be how much cpu was used while sending. Do
> you have any suggestions for computing cpu use?
> in pktgen i added code to count how many microsecs were used in
> transmitting.

Something, that anyone can understand :)
For example /proc stats, although it is not very accurate, but it is
really usable parameter from userspace point ov view.

> > Result looks good, but I still do not understand how it appeared, that
> > is why I'm not that excited about idea - I just do not know it in
> > details.
> 
> To add to KKs explanation on other email:
> Essentially the value is in amortizing the cost of barriers and IO per
> packet. For example the queue lock is held/released only once per X
> packets. DMA kicking which includes both a PCI IO write and mbs is done
> only once per X packets. There are still a lot of room for improvement
> of such IO;

Btw, what is the size of the packet in pktgen in your tests? Likely it
is small, since result is that good. That can explain alot.

> cheers,
> jamal

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-08 12:09             ` Evgeniy Polyakov
@ 2007-06-08 13:07               ` jamal
  2007-06-08 21:02                 ` Evgeniy Polyakov
  0 siblings, 1 reply; 57+ messages in thread
From: jamal @ 2007-06-08 13:07 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

On Fri, 2007-08-06 at 16:09 +0400, Evgeniy Polyakov wrote:
> On Fri, Jun 08, 2007 at 07:31:07AM -0400, jamal (hadi@cyberus.ca) wrote:


> But lock is still being hold - or there was no intention to reduce lock
> usage? As far as I read Krishna's mail, lock usage was not an issue, so
> that hunk probably should be dropped from the analysis.

With post 2.6.18 that atomic bit guarantees only one CPU will enter tx
path. The lock is only necessary to protect shared resources between tx
and rx (which could be simultenously be entered by two CPUs) such as tx
ring. Refer to some other thread talking about a possible bug with e1000
in this area. So maybe e1000 is not a good example in this sense. But
look at tg3.

> > > Without lock that would be wrong - it accesses hardware.
> > 
> > We are achieving the goal of only a single CPU entering that path. Are
> > you saying that is not good enough?
> 
> Then why essentially the same code (current batch_xmit callback)
> previously was always called with disabled interrupts? Aren't there
> some watchdog/link/poll/whatever issues present?

not in the e1000 as it stands today.


> Something, that anyone can understand :)
> For example /proc stats, although it is not very accurate, but it is
> really usable parameter from userspace point ov view.

which /proc stats?


> Btw, what is the size of the packet in pktgen in your tests? Likely it
> is small, since result is that good. That can explain alot.

There is a per-packet cost involved in that code path. So the more
packets/second you can generate the more intensely you can test that
path. I believe you will achieve overall better results with large
packets.

cheers,
jamal



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-07 12:16   ` jamal
  2007-06-08  5:06     ` Krishna Kumar2
@ 2007-06-08 17:27     ` Rick Jones
  2007-06-09  0:17       ` jamal
  1 sibling, 1 reply; 57+ messages in thread
From: Rick Jones @ 2007-06-08 17:27 UTC (permalink / raw)
  To: hadi
  Cc: Krishna Kumar2, Gagan Arneja, Evgeniy Polyakov, netdev,
	Sridhar Samudrala, David Miller, Robert Olsson

>>These results are based on the test script that I sent earlier today. I
>>removed the results for UDP 32 procs 512 and 4096 buffer cases since
>>the BW was coming >line speed (infact it was showing 1500Mb/s and
>>4900Mb/s respectively for both the ORG and these bits). 
> 
> 
> I expect UDP to overwhelm the receiver. So the receiver needs a lot more
> tuning (like increased rcv socket buffer sizes to keep up, IMO).
> 
> But yes, the above is an odd result - Rick any insight into this?

Indeed, there is no flow control provided by netperf for the UDP_STREAM 
test and so it is quite common for a receiver to be overwhelmed.  One 
can tweak the SO_RCVBUF size a bit to try to help with transients, but 
if the sender is sustainably faster than the receiver, you have to 
configure netperf with --enable-intervals  and then provide a send burst 
(number of sends) size and an inter burst interval (constrained by "HZ" 
on the platform) to pace the netperf UDP sender.  You can get finer 
grained control with --enable-spin, but that shoots your netperf-sided 
CPU util to hell.

And with UDP datagram sizes > MTU there is (in the abstract, not sure 
about current Linux code) the concern about filling a transmit queue 
with some but not all of the fragments of a datagram and the others 
being tossed, so one ends-up sending unreassemblable datagram fragments.

>>Summary : Average BW (whatever meaning that has) improved 0.65%, while
>>                 Service Demand deteriorated 11.86%
> 
> 
> Sorry, been many moons since i last played with netperf; what does "service
> demand" mean?

Service demand is a measure of efficiency.  It is a 
normalization/reconciliation of the "throughput" and the CPU utilization 
to arrive at a CPU consumed per unit of work figure.  Lower is better.

Now, when running aggregate tests with netperf2 using the "launch a 
bunch in the background with confidence intervals enble to get 
iterations to minimize skew error" :)

<http://www.netperf.org/svn/netperf2/tags/netperf-2.4.3/doc/netperf.html#Using-Netperf-to-Measure-Aggregate-Performance>

you cannot take the netperf service demand directly - each netperf is 
calculating assuming that it is the only thing running on the system. 
It then ass-u-me-s that the CPU util it measured was all for its work. 
This means the service demand figure will be quite higher than it really is.

So, for aggregate tests using netperf2, one has to calculate service 
demand by hand.  Sum the throughput as KB/s, convert the CPU util and 
number of CPUs to a microseconds of CPU consumed per second and divide 
to get microseconds per KB for the aggregate.

rick jones

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-08 11:31         ` Krishna Kumar2
  2007-06-08 11:43           ` jamal
@ 2007-06-08 18:00           ` Rick Jones
  1 sibling, 0 replies; 57+ messages in thread
From: Rick Jones @ 2007-06-08 18:00 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: hadi, David Miller, Gagan Arneja, Evgeniy Polyakov, netdev,
	Robert Olsson, Sridhar Samudrala

>>Also note something else strange that it is kind of strange that
>>something like UDP which doesnt backoff will send out less
>>packets/second ;->
> 
> 
> Cannot explain that either :)

Perhaps delays in restarting after the intra-stack flow control is 
asserted.  One possible thing to do to try to deal with that a little 
would be to increase SO_SNDBUF in netperf with the -s option.  That at 
least is something I did back in 2.4 days

rick jones

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-08 13:07               ` jamal
@ 2007-06-08 21:02                 ` Evgeniy Polyakov
  0 siblings, 0 replies; 57+ messages in thread
From: Evgeniy Polyakov @ 2007-06-08 21:02 UTC (permalink / raw)
  To: jamal
  Cc: Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

On Fri, Jun 08, 2007 at 09:07:47AM -0400, jamal (hadi@cyberus.ca) wrote:
> > Something, that anyone can understand :)
> > For example /proc stats, although it is not very accurate, but it is
> > really usable parameter from userspace point ov view.
> 
> which /proc stats?

/proc/$pid/stat, for pktgen it is likely not that interesting, but for
usual userspace applcation it is quite interesting parameter.
At least that is what 'top' shows.
 
-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-08 17:27     ` Rick Jones
@ 2007-06-09  0:17       ` jamal
  2007-06-09  0:40         ` Rick Jones
  0 siblings, 1 reply; 57+ messages in thread
From: jamal @ 2007-06-09  0:17 UTC (permalink / raw)
  To: Rick Jones
  Cc: Krishna Kumar2, Gagan Arneja, Evgeniy Polyakov, netdev,
	Sridhar Samudrala, David Miller, Robert Olsson

On Fri, 2007-08-06 at 10:27 -0700, Rick Jones wrote:

[..]

> you cannot take the netperf service demand directly - each netperf is 
> calculating assuming that it is the only thing running on the system. 
> It then ass-u-me-s that the CPU util it measured was all for its work. 
> This means the service demand figure will be quite higher than it really is.
> 
> So, for aggregate tests using netperf2, one has to calculate service 
> demand by hand.  Sum the throughput as KB/s, convert the CPU util and 
> number of CPUs to a microseconds of CPU consumed per second and divide 
> to get microseconds per KB for the aggregate.

>From what you are saying above seems to me that for more than one proc
it is safe to just run netperf4 instead of netperf2?

It also seems reasonable to set up large socket buffers on the receiver.

cheers,
jamal




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-09  0:17       ` jamal
@ 2007-06-09  0:40         ` Rick Jones
  0 siblings, 0 replies; 57+ messages in thread
From: Rick Jones @ 2007-06-09  0:40 UTC (permalink / raw)
  To: hadi
  Cc: Krishna Kumar2, Gagan Arneja, Evgeniy Polyakov, netdev,
	Sridhar Samudrala, David Miller, Robert Olsson

jamal wrote:
> On Fri, 2007-08-06 at 10:27 -0700, Rick Jones wrote:
> 
> [..]
> 
> 
>>you cannot take the netperf service demand directly - each netperf is 
>>calculating assuming that it is the only thing running on the system. 
>>It then ass-u-me-s that the CPU util it measured was all for its work. 
>>This means the service demand figure will be quite higher than it really is.
>>
>>So, for aggregate tests using netperf2, one has to calculate service 
>>demand by hand.  Sum the throughput as KB/s, convert the CPU util and 
>>number of CPUs to a microseconds of CPU consumed per second and divide 
>>to get microseconds per KB for the aggregate.
> 
> 
> From what you are saying above seems to me that for more than one proc
> it is safe to just run netperf4 instead of netperf2?

Well, it is easier to be safe on aggregates with netperf4 than netperf2 
although at present it is more difficult to run netperf4 than netperf2


> It also seems reasonable to set up large socket buffers on the receiver.

For bulk transfers I often do.

rick

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-07 11:43   ` jamal
  2007-06-07 16:13     ` Evgeniy Polyakov
@ 2007-06-19 13:21     ` Evgeniy Polyakov
  2007-06-19 13:33       ` Evgeniy Polyakov
                         ` (2 more replies)
  1 sibling, 3 replies; 57+ messages in thread
From: Evgeniy Polyakov @ 2007-06-19 13:21 UTC (permalink / raw)
  To: jamal
  Cc: Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

Hi.

On Thu, Jun 07, 2007 at 07:43:49AM -0400, jamal (hadi@cyberus.ca) wrote:
> Folks, we need help. Please run this on different hardware. Evgeniy, i
> thought this kind of stuff excites you, no? ;-> (wink, wink).
> Only the sender needs the patch but the receiver must be a more powerful
> machine (so that it is not the bottleneck).

I've ran several simple tests with desktop e1000 adapter I managed to
find.

Test machine is amd athlon64 3500+ with 1gb of ram.
Another point is dektop core duo 3.4 ghz with 2 gb of ram and sky2
driver.

Simple test included test -> desktop and vice versa traffic with 128 and
4096 block size in netperf-2.4.3 setup.

Test machine runs 2.6.22-rc5-batch and mainline tree (there is a test
with 2.6.22-rc4 and there is a noticeble performance win compared to
that tree in the latest git, likely tcp congestion changes resulted in
better utilisation). Batched xmit has better numbers.

Results:

2.6.20-ubuntu -> test machine

2.6.22-rc4-batch:
128: 725.43 724.14
4096: 698.63 712.77

2.6.22-rc5-mainline:
128: 760.91 762.04
4096: 784.32 788.53

2.6.22-rc5-batch:
128: 766.70
4096: 788.24


test machine -> 2.6.20-ubuntu

2.6.22-rc5-mainline:
128: 558.16 (Desired confidence was not achieved within the specified iterations.)
4096: 814.01

2.6.22-rc5-batch:
128: 569.14 (Desired confidence was not achieved within the specified iterations.)
4096: 822.72


-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-19 13:21     ` Evgeniy Polyakov
@ 2007-06-19 13:33       ` Evgeniy Polyakov
  2007-06-19 14:00         ` Evgeniy Polyakov
  2007-06-19 16:24       ` jamal
  2007-06-21 21:00       ` Rick Jones
  2 siblings, 1 reply; 57+ messages in thread
From: Evgeniy Polyakov @ 2007-06-19 13:33 UTC (permalink / raw)
  To: jamal
  Cc: Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

On Tue, Jun 19, 2007 at 05:21:48PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> Simple test included test -> desktop and vice versa traffic with 128 and
> 4096 block size in netperf-2.4.3 setup.

I.e. it is not pktgen, but usual userspace application, which should not
use new batch methods.

I will try to find pktget scripts posted in this thread.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-19 13:33       ` Evgeniy Polyakov
@ 2007-06-19 14:00         ` Evgeniy Polyakov
  2007-06-19 14:09           ` Evgeniy Polyakov
  2007-06-19 16:28           ` jamal
  0 siblings, 2 replies; 57+ messages in thread
From: Evgeniy Polyakov @ 2007-06-19 14:00 UTC (permalink / raw)
  To: jamal
  Cc: Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

On Tue, Jun 19, 2007 at 05:33:33PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> On Tue, Jun 19, 2007 at 05:21:48PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > Simple test included test -> desktop and vice versa traffic with 128 and
> > 4096 block size in netperf-2.4.3 setup.
> 
> I.e. it is not pktgen, but usual userspace application, which should not
> use new batch methods.
> 
> I will try to find pktget scripts posted in this thread.

pktgen reults are quite poor:
batch (changed from default script: count reduced, clone increased to 10k)
241319pps 115Mb/sec

mainline (the same script, on start it wrote about unsupported batch_low
parameter:
497520pps 238Mb/sec

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-19 14:00         ` Evgeniy Polyakov
@ 2007-06-19 14:09           ` Evgeniy Polyakov
  2007-06-19 16:32             ` jamal
  2007-06-19 16:28           ` jamal
  1 sibling, 1 reply; 57+ messages in thread
From: Evgeniy Polyakov @ 2007-06-19 14:09 UTC (permalink / raw)
  To: jamal
  Cc: Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

On Tue, Jun 19, 2007 at 06:00:39PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> On Tue, Jun 19, 2007 at 05:33:33PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > On Tue, Jun 19, 2007 at 05:21:48PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> > > Simple test included test -> desktop and vice versa traffic with 128 and
> > > 4096 block size in netperf-2.4.3 setup.
> > 
> > I.e. it is not pktgen, but usual userspace application, which should not
> > use new batch methods.
> > 
> > I will try to find pktget scripts posted in this thread.
> 
> pktgen reults are quite poor:
> batch (changed from default script: count reduced, clone increased to 10k)
> 241319pps 115Mb/sec
> 
> mainline (the same script, on start it wrote about unsupported batch_low
> parameter:
> 497520pps 238Mb/sec


And here is with batch_low=400:

182607pps 87Mb/sec

Here is output for batch_low=0 batch tree (batch_low=400 is similar,
with min_batch parameter set to 400):

Params: count 1000000  min_pkt_size: 60  max_pkt_size: 60 min_batch 0
     frags: 0  delay: 0  clone_skb: 10000  ifname: eth1
     flows: 0 flowlen: 0
     dst_min: 192.168.4.81  dst_max: 
     src_min:   src_max: 
     src_mac: 00:0E:0C:B8:63:0A  dst_mac: 00:17:31:9A:E5:BE
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags: 
Current:
     pkts-sofar: 1000000  errors: 0
     started: 1182274184549351us  stopped: 1182274188693235us idle: 20us alloc 4007752us txt 127073us
     seq_num: 1000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x3000a8c0  cur_daddr: 0x5104a8c0
     cur_udp_dst: 9  cur_udp_src: 9
     flows: 0
Result: OK: T4143884(U4143864+I20+A4007752+T127073) usec, P1000000 TE8511TS1(B60,-1frags)
  241319pps 115Mb/sec (115833120bps) errors: 0



-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-19 13:21     ` Evgeniy Polyakov
  2007-06-19 13:33       ` Evgeniy Polyakov
@ 2007-06-19 16:24       ` jamal
  2007-06-21 21:00       ` Rick Jones
  2 siblings, 0 replies; 57+ messages in thread
From: jamal @ 2007-06-19 16:24 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

On Tue, 2007-19-06 at 17:21 +0400, Evgeniy Polyakov wrote:

> I've ran several simple tests with desktop e1000 adapter I managed to
> find.

Mucho gracias Evgeniy. 

> Test machine is amd athlon64 3500+ with 1gb of ram.
> Another point is dektop core duo 3.4 ghz with 2 gb of ram and sky2
> driver.
> 
> Simple test included test -> desktop and vice versa traffic with 128 and
> 4096 block size in netperf-2.4.3 setup.
> 

Sounds good enough for starters. 

> Test machine runs 2.6.22-rc5-batch and mainline tree (there is a test
> with 2.6.22-rc4 and there is a noticeble performance win compared to
> that tree in the latest git, likely tcp congestion changes resulted in
> better utilisation). 

Do you have tso on/off? I suspect youd see better numbers with tso off.

> Batched xmit has better numbers.

Much appreciated. It would be useful to sort of measure CPU utilization
as well (my understanding is netperf4 is more capable of that).
I think the benefit would be a lot more visible as the number of flows
goes up (simply because there will be a lot more packets/sec going into
the driver).

cheers,
jamal



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-19 14:00         ` Evgeniy Polyakov
  2007-06-19 14:09           ` Evgeniy Polyakov
@ 2007-06-19 16:28           ` jamal
  2007-06-19 16:35             ` Evgeniy Polyakov
                               ` (2 more replies)
  1 sibling, 3 replies; 57+ messages in thread
From: jamal @ 2007-06-19 16:28 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

On Tue, 2007-19-06 at 18:00 +0400, Evgeniy Polyakov wrote:

> pktgen reults are quite poor:
> batch (changed from default script: count reduced, clone increased to 10k)
> 241319pps 115Mb/sec
> 
> mainline (the same script, on start it wrote about unsupported batch_low
> parameter:
> 497520pps 238Mb/sec

What is your kernel config in regards to HRES timers? Robert mentioned
to me that the clock source maybe causing issues with pktgen (maybe even
qos). 
Robert, insights?

In my case, I have:
# CONFIG_TICK_ONESHOT is not set
# CONFIG_NO_HZ is not set
# CONFIG_HIGH_RES_TIMERS is not set

I will try to turn on those in a few days and see if notice a
difference.

Again, thanks Evgeniy.

cheers,
jamal


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-19 14:09           ` Evgeniy Polyakov
@ 2007-06-19 16:32             ` jamal
  2007-06-19 16:44               ` Evgeniy Polyakov
  0 siblings, 1 reply; 57+ messages in thread
From: jamal @ 2007-06-19 16:32 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

On Tue, 2007-19-06 at 18:09 +0400, Evgeniy Polyakov wrote:
> On Tue, Jun 19, 2007 at 06:00:39PM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:

> > 
> > pktgen reults are quite poor:
> > batch (changed from default script: count reduced, clone increased to 10k)
> > 241319pps 115Mb/sec

BTW, dont turn on the cloning - leave it as 1 so we can have every
packet alloced so as to have worst case scenario.
 
> > mainline (the same script, on start it wrote about unsupported batch_low
> > parameter:
> > 497520pps 238Mb/sec
> 
> 
> And here is with batch_low=400:

I am going to get rid of that batch param - it doesnt seem to matter
what that value is in the long run.



> Result: OK: T4143884(U4143864+I20+A4007752+T127073) usec, P1000000 TE8511TS1(B60,-1frags)

So on average you have been sending about 120 packets per batch - which
should give you decent numbers. Lets see how Robert responds and if you
have time, turn off the kernel configs like i did.

cheers,
jamal




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-19 16:28           ` jamal
@ 2007-06-19 16:35             ` Evgeniy Polyakov
  2007-06-19 16:45             ` Evgeniy Polyakov
  2007-06-19 17:35             ` Robert Olsson
  2 siblings, 0 replies; 57+ messages in thread
From: Evgeniy Polyakov @ 2007-06-19 16:35 UTC (permalink / raw)
  To: jamal
  Cc: Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

On Tue, Jun 19, 2007 at 12:28:49PM -0400, jamal (hadi@cyberus.ca) wrote:
> On Tue, 2007-19-06 at 18:00 +0400, Evgeniy Polyakov wrote:
> 
> > pktgen reults are quite poor:
> > batch (changed from default script: count reduced, clone increased to 10k)
> > 241319pps 115Mb/sec
> > 
> > mainline (the same script, on start it wrote about unsupported batch_low
> > parameter:
> > 497520pps 238Mb/sec
> 
> What is your kernel config in regards to HRES timers? Robert mentioned
> to me that the clock source maybe causing issues with pktgen (maybe even
> qos). 
> Robert, insights?
> 
> In my case, I have:
> # CONFIG_TICK_ONESHOT is not set
> # CONFIG_NO_HZ is not set
> # CONFIG_HIGH_RES_TIMERS is not set
> 
> I will try to turn on those in a few days and see if notice a
> difference.

Essentially the same:

$ cat ./.config | egrep "CONFIG_TICK_ONESHOT|CONFIG_NO_HZ|CONFIG_HIGH_RES_TIMERS"
$ head Makefile 
VERSION = 2
PATCHLEVEL = 6
SUBLEVEL = 22
EXTRAVERSION = -rc5
NAME = Holy Dancing Manatees, Batman!

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-19 16:32             ` jamal
@ 2007-06-19 16:44               ` Evgeniy Polyakov
  0 siblings, 0 replies; 57+ messages in thread
From: Evgeniy Polyakov @ 2007-06-19 16:44 UTC (permalink / raw)
  To: jamal
  Cc: Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

On Tue, Jun 19, 2007 at 12:32:48PM -0400, jamal (hadi@cyberus.ca) wrote:
> > > pktgen reults are quite poor:
> > > batch (changed from default script: count reduced, clone increased to 10k)
> > > 241319pps 115Mb/sec
> 
> BTW, dont turn on the cloning - leave it as 1 so we can have every
> packet alloced so as to have worst case scenario.

here is with batch_min 400 and clone 1:
184251pps 88Mb/sec (88440480bps) errors: 0
A bit better than without cloning, but that might be a noise.

batch_min 0 and clone 1:
258564pps 124Mb/sec (124110720bps) errors: 0
Noticebly better than with cloning, but still much worse than mainline.

> > > mainline (the same script, on start it wrote about unsupported batch_low
> > > parameter:
> > > 497520pps 238Mb/sec
> > 
> > 
> > And here is with batch_low=400:
> 
> I am going to get rid of that batch param - it doesnt seem to matter
> what that value is in the long run.

Hmm, with that parameter result is 87Mb/sec while with batch_min equal
to zero it is 115Mb/sec. Without batching result is even better -
238Mb/sec, so it does seem to affect performance.

> > Result: OK: T4143884(U4143864+I20+A4007752+T127073) usec, P1000000 TE8511TS1(B60,-1frags)
> 
> So on average you have been sending about 120 packets per batch - which
> should give you decent numbers. Lets see how Robert responds and if you
> have time, turn off the kernel configs like i did.

It is not turned on in my config.

> cheers,
> jamal

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-19 16:28           ` jamal
  2007-06-19 16:35             ` Evgeniy Polyakov
@ 2007-06-19 16:45             ` Evgeniy Polyakov
  2007-06-19 17:35             ` Robert Olsson
  2 siblings, 0 replies; 57+ messages in thread
From: Evgeniy Polyakov @ 2007-06-19 16:45 UTC (permalink / raw)
  To: jamal
  Cc: Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller, Robert Olsson

On Tue, Jun 19, 2007 at 12:28:49PM -0400, jamal (hadi@cyberus.ca) wrote:
> In my case, I have:
> # CONFIG_TICK_ONESHOT is not set
> # CONFIG_NO_HZ is not set
> # CONFIG_HIGH_RES_TIMERS is not set

$ cat ./.config | egrep "CONFIG_TICK_ONESHOT|CONFIG_NO_HZ|CONFIG_HIGH_RES_TIMERS"
$

> I will try to turn on those in a few days and see if notice a
> difference.

It is turned off.

> Again, thanks Evgeniy.

No problem.

> cheers,
> jamal

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-19 16:28           ` jamal
  2007-06-19 16:35             ` Evgeniy Polyakov
  2007-06-19 16:45             ` Evgeniy Polyakov
@ 2007-06-19 17:35             ` Robert Olsson
  2007-06-19 17:48               ` jamal
  2007-06-19 22:28               ` [WIP][PATCHES] Network xmit batching David Miller
  2 siblings, 2 replies; 57+ messages in thread
From: Robert Olsson @ 2007-06-19 17:35 UTC (permalink / raw)
  To: hadi
  Cc: Evgeniy Polyakov, Krishna Kumar2, Gagan Arneja, netdev,
	Rick Jones, Sridhar Samudrala, David Miller, Robert Olsson

jamal writes:

 > What is your kernel config in regards to HRES timers? Robert mentioned
 > to me that the clock source maybe causing issues with pktgen (maybe even
 > qos). Robert, insights?

 pktgen heavily uses gettimeofday. I was using tsc as clock source with 
 our opterons in the lab. In late 2.6.20 gettimeofday was changed so tsc 
 couldn't be used on opterons (pktgen at least). 

 To give you an example w.intel's 82571EB we could send 1.488 Mpps using tsc 
 but with hpet only 400 kpps.

 But Evgeniy is most likely using the same clocksource both for the mainline 
 and batch tests so there must be something different...

 Cheers.
					--ro

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-19 17:35             ` Robert Olsson
@ 2007-06-19 17:48               ` jamal
  2007-06-19 17:55                 ` Evgeniy Polyakov
  2007-06-28  0:05                 ` [WIP][PATCHES] Network xmit batching - tg3 support jamal
  2007-06-19 22:28               ` [WIP][PATCHES] Network xmit batching David Miller
  1 sibling, 2 replies; 57+ messages in thread
From: jamal @ 2007-06-19 17:48 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Evgeniy Polyakov, Krishna Kumar2, Gagan Arneja, netdev,
	Rick Jones, Sridhar Samudrala, David Miller

On Tue, 2007-19-06 at 19:35 +0200, Robert Olsson wrote:

>  But Evgeniy is most likely using the same clocksource both for the mainline 
>  and batch tests so there must be something different...

Iam curious though which one, from my notes on my laptop for that test
machine:

# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
acpi_pm jiffies tsc

# cat /sys/devices/system/clocksource/clocksource0/current_clocksource
acpi_pm

OTOH, on my laptop:
-------
hadi@lilsol:~$ sudo cat /sys/devices/system/clocksource/clocksource0/
available_clocksource  current_clocksource
hadi@lilsol:~$ sudo
cat /sys/devices/system/clocksource/clocksource0/current_clocksource
hpet
---------------

cheers,
jamal

PS:- i wont be able to respond for another 24 hrs or so.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-19 17:48               ` jamal
@ 2007-06-19 17:55                 ` Evgeniy Polyakov
  2007-06-28  0:05                 ` [WIP][PATCHES] Network xmit batching - tg3 support jamal
  1 sibling, 0 replies; 57+ messages in thread
From: Evgeniy Polyakov @ 2007-06-19 17:55 UTC (permalink / raw)
  To: jamal
  Cc: Robert Olsson, Krishna Kumar2, Gagan Arneja, netdev, Rick Jones,
	Sridhar Samudrala, David Miller

On Tue, Jun 19, 2007 at 01:48:20PM -0400, jamal (hadi@cyberus.ca) wrote:
> On Tue, 2007-19-06 at 19:35 +0200, Robert Olsson wrote:
> 
> >  But Evgeniy is most likely using the same clocksource both for the mainline 
> >  and batch tests so there must be something different...
> 
> Iam curious though which one, from my notes on my laptop for that test
> machine:

Batch:
$ sudo cat
/sys/devices/system/clocksource/clocksource0/available_clocksource
tsc acpi_pm jiffies 

Mainline:
$ sudo cat
/sys/devices/system/clocksource/clocksource0/available_clocksource
tsc acpi_pm jiffies 

Config is exactly the same, so the same should be all parameters.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-19 17:35             ` Robert Olsson
  2007-06-19 17:48               ` jamal
@ 2007-06-19 22:28               ` David Miller
  2007-06-21 15:54                 ` FSCKED clock sources WAS(Re: " jamal
  1 sibling, 1 reply; 57+ messages in thread
From: David Miller @ 2007-06-19 22:28 UTC (permalink / raw)
  To: Robert.Olsson; +Cc: hadi, johnpol, krkumar2, gaagaan, netdev, rick.jones2, sri

From: Robert Olsson <Robert.Olsson@data.slu.se>
Date: Tue, 19 Jun 2007 19:35:45 +0200

>  pktgen heavily uses gettimeofday. I was using tsc as clock source with 
>  our opterons in the lab. In late 2.6.20 gettimeofday was changed so tsc 
>  couldn't be used on opterons (pktgen at least). 
> 
>  To give you an example w.intel's 82571EB we could send 1.488 Mpps using tsc 
>  but with hpet only 400 kpps.

Converting pktgen over to ktime_t might be a nice cleanup.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* FSCKED clock sources WAS(Re: [WIP][PATCHES] Network xmit batching
  2007-06-19 22:28               ` [WIP][PATCHES] Network xmit batching David Miller
@ 2007-06-21 15:54                 ` jamal
  2007-06-21 16:08                   ` jamal
  2007-06-21 16:45                   ` Evgeniy Polyakov
  0 siblings, 2 replies; 57+ messages in thread
From: jamal @ 2007-06-21 15:54 UTC (permalink / raw)
  To: David Miller
  Cc: Robert.Olsson, johnpol, krkumar2, gaagaan, netdev, rick.jones2,
	sri

[-- Attachment #1: Type: text/plain, Size: 617 bytes --]

On Tue, 2007-19-06 at 15:28 -0700, David Miller wrote:

> Converting pktgen over to ktime_t might be a nice cleanup.

Would that really solve it? i.e doesnt it still tie to what the clock
source is?

I had a friend of mine (Robert, you know Jeremy) and results are
slightly different from what Evginy found.

The summary is: Batching always is better, jiffies is always the better
clock source (and who would have thunk,eh? Opteron kicks a Xeons ass).
Attached results.

Evgeniy, did you sync on the batching case with the git tree?
Can you describe your hardware in /proc/cpuinfo and /proc/interupts?

cheers,
jamal

[-- Attachment #2: batch-clock-res --]
[-- Type: text/plain, Size: 2630 bytes --]

The test variables are:
----------------------

1) A Intel Xeon[1] machine vs an AMD opteron[2].
2) A plain 2622-rc4 kernel vs a 2622-rc4 with batching
(from git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git)
3) Different clock sources acpi-pm, jiffies and tsc

Test setup
-----------

pktgen was used to send from the system under test (where
test variables #2-#3 were adjusted) to a second box. 
CPU affinity was tied to cpu2 in all case to reduce variables in all test 
cases...

Test validation
---------------

Throughput results were confirmed to match on receiver
and sender (as reported by pktgen)

Results
-------
The AMD opteron always had better results.
The batching kernels always was better than non-batching.
The jiffies clock was always the most consistent and gave
best performance

Kernel-type | acpi-pm clock | jiffies clock | tsc clock |
+h/ware     |               |               |           |
------------+---------------+---------------+-----------+
2622-rc4    | 347Kpps       | 1.40 Mpps     | 1.36Mpps  |
plain       |               |               |           |
Intel Xeon  |               |               |           |
------------+---------------+---------------+-----------+
2622-rc4    | 342Kpps       | 853 kpps      | 821kpps   |
plain       |               |               |           |
AMD opteron |               |               |           |
------------+---------------+---------------+-----------+
2622-rc4    | 615Kpps       | 1.46 Mpps     | 1.46Mpps  |
batch       |               |               |           |
Intel Xeon  |               |               |           |
------------+---------------+---------------+-----------+
2622-rc4    | 633Kpps       | 1.18 Mpps     | 1.17Mpps  |
batch       |               |               |           |
AMD opteron |               |               |           |
------------+---------------+---------------+-----------+

The two systems under test 
---------------------------

[1]-------------
vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 2.80GHz
stepping        : 1
cpu MHz         : 2793.329
cache size      : 1024 KB
physical id     : 3
siblings        : 2
core id         : 0
cpu cores       : 1
-------------

[2]-------------
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 33
model name      : Dual Core AMD Opteron(tm) Processor 275
stepping        : 2
cpu MHz         : 2194.778
cache size      : 1024 KB
physical id     : 1
siblings        : 2
core id         : 1
cpu cores       : 2
---------------------------------------------


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: FSCKED clock sources WAS(Re: [WIP][PATCHES] Network xmit batching
  2007-06-21 15:54                 ` FSCKED clock sources WAS(Re: " jamal
@ 2007-06-21 16:08                   ` jamal
  2007-06-21 16:55                     ` Benjamin LaHaise
  2007-06-21 16:45                   ` Evgeniy Polyakov
  1 sibling, 1 reply; 57+ messages in thread
From: jamal @ 2007-06-21 16:08 UTC (permalink / raw)
  To: David Miller
  Cc: Robert.Olsson, johnpol, krkumar2, gaagaan, netdev, rick.jones2,
	sri

On Thu, 2007-21-06 at 11:54 -0400, jamal wrote:

> The summary is: Batching always is better, jiffies is always the better
> clock source (and who would have thunk,eh? Opteron kicks a Xeons ass).

The results in the table for opteron and xeon are swapped when
cutnpasting from a larger test result. So Opteron is the one with better
results.
In any case - off for the day over here.

cheers,
jamal



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: FSCKED clock sources WAS(Re: [WIP][PATCHES] Network xmit batching
  2007-06-21 15:54                 ` FSCKED clock sources WAS(Re: " jamal
  2007-06-21 16:08                   ` jamal
@ 2007-06-21 16:45                   ` Evgeniy Polyakov
  2007-06-25 16:58                     ` jamal
  1 sibling, 1 reply; 57+ messages in thread
From: Evgeniy Polyakov @ 2007-06-21 16:45 UTC (permalink / raw)
  To: jamal
  Cc: David Miller, Robert.Olsson, krkumar2, gaagaan, netdev,
	rick.jones2, sri

On Thu, Jun 21, 2007 at 11:54:17AM -0400, jamal (hadi@cyberus.ca) wrote:
> Evgeniy, did you sync on the batching case with the git tree?

My tree contains following commits:

Latest mainline commit: fa490cfd15d7ce0900097cc4e60cfd7a76381138
Latest batch commit: 9b8cc32088abfda8be7f394cfd5ee6ac694da39c

> Can you describe your hardware in /proc/cpuinfo and /proc/interupts?

Sure.
cpuinfo:
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 15
model name      : AMD Athlon(tm) 64 Processor 3500+
stepping        : 0
cpu MHz         : 2210.092
cache size      : 512 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge
mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm
3dnowext 3dnow up
bogomips        : 4423.20
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

interrupts:
CPU0       
0:    1668864   IO-APIC-edge      timer
1:         78   IO-APIC-edge      i8042
8:          0   IO-APIC-edge      rtc
9:          0   IO-APIC-fasteoi   acpi
12:        102   IO-APIC-edge      i8042
14:        465   IO-APIC-edge      ide0
18:     774515   IO-APIC-fasteoi   eth1
22:          0   IO-APIC-fasteoi   sata_nv
23:       5068   IO-APIC-fasteoi   sata_nv
NMI:          0 
LOC:    1668914 
ERR:          0

I pulled the latest version recently and started netperf test - both
netperf on sending (batching) machine and netserver on receiver takes
about 16-25% of CPU time, which is likely a bug.
With 4096 block it is 819 mbit/sec, which is slightly more than mainline
result, but I can not say that it is noticebly higher than a noise rate.

I did not check CPU usage time of the previous reelases, but receiving
netserver was always around 15-16%.

Here is pktgen result:

Params: count 1000000  min_pkt_size: 60  max_pkt_size: 60 min_batch 0
     frags: 0  delay: 0  clone_skb: 1  ifname: eth1
     flows: 0 flowlen: 0
     dst_min: 192.168.4.81  dst_max: 
     src_min:   src_max: 
     src_mac: 00:0E:0C:B8:63:0A  dst_mac: 00:17:31:9A:E5:BE
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags: 
Current:
     pkts-sofar: 1000000  errors: 0
     started: 1182456838614560us  stopped: 1182456842533487us idle: 15us alloc 3780137us txt 130388us
     seq_num: 1000001  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x3000a8c0  cur_daddr: 0x5104a8c0
     cur_udp_dst: 9  cur_udp_src: 9
     flows: 0
Result: OK: T3918927(U3918912+I15+A3780137+T130388) usec, P1000000 TE8511TS1(B60,-1frags)
  255171pps 122Mb/sec (122482080bps) errors: 0

There is no cloning.
When there is no cloning, mainline shows 112 Mb/sec, which is less, but
when there are 10k clones results are:
mainline: 	469857pps 225Mb/sec
latest batch:	246089pps 118Mb/sec

So, that is definitely a sign, that batching has some issues with skb
reusage.

> cheers,
> jamal


-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: FSCKED clock sources WAS(Re: [WIP][PATCHES] Network xmit batching
  2007-06-21 16:08                   ` jamal
@ 2007-06-21 16:55                     ` Benjamin LaHaise
  2007-06-25 16:59                       ` jamal
  0 siblings, 1 reply; 57+ messages in thread
From: Benjamin LaHaise @ 2007-06-21 16:55 UTC (permalink / raw)
  To: jamal
  Cc: David Miller, Robert.Olsson, johnpol, krkumar2, gaagaan, netdev,
	rick.jones2, sri

On Thu, Jun 21, 2007 at 12:08:19PM -0400, jamal wrote:
> The results in the table for opteron and xeon are swapped when
> cutnpasting from a larger test result. So Opteron is the one with better
> results.
> In any case - off for the day over here.

You should qualify that as 'Old P4 Xeon', as the Core 2 Xeons are leagues 
better.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <zyntrop@kvack.org>.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-19 13:21     ` Evgeniy Polyakov
  2007-06-19 13:33       ` Evgeniy Polyakov
  2007-06-19 16:24       ` jamal
@ 2007-06-21 21:00       ` Rick Jones
  2007-06-22  9:59         ` Evgeniy Polyakov
  2 siblings, 1 reply; 57+ messages in thread
From: Rick Jones @ 2007-06-21 21:00 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: rick.jones2, jamal, Krishna Kumar2, Gagan Arneja, netdev,
	Sridhar Samudrala, David Miller, Robert Olsson

On Tue, 2007-06-19 at 17:21 +0400, Evgeniy Polyakov wrote:
> Hi.
> 
> On Thu, Jun 07, 2007 at 07:43:49AM -0400, jamal (hadi@cyberus.ca) wrote:
> > Folks, we need help. Please run this on different hardware. Evgeniy, i
> > thought this kind of stuff excites you, no? ;-> (wink, wink).
> > Only the sender needs the patch but the receiver must be a more powerful
> > machine (so that it is not the bottleneck).
> 
> I've ran several simple tests with desktop e1000 adapter I managed to
> find.
> 
> Test machine is amd athlon64 3500+ with 1gb of ram.
> Another point is dektop core duo 3.4 ghz with 2 gb of ram and sky2
> driver.
> 
> Simple test included test -> desktop and vice versa traffic with 128 and
> 4096 block size in netperf-2.4.3 setup.

Is that in conjunction with setting the test-specific -D to set
TCP_NODELAY, or was Nagle left-on?  If the latter, perhaps timing issues
could be why the confidence intervals weren't hit since the relative
batching of 128byte sends into larger segments is something of a race.

rick jones


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-21 21:00       ` Rick Jones
@ 2007-06-22  9:59         ` Evgeniy Polyakov
  2007-06-25 17:35           ` Rick Jones
  0 siblings, 1 reply; 57+ messages in thread
From: Evgeniy Polyakov @ 2007-06-22  9:59 UTC (permalink / raw)
  To: Rick Jones
  Cc: jamal, Krishna Kumar2, Gagan Arneja, netdev, Sridhar Samudrala,
	David Miller, Robert Olsson

On Thu, Jun 21, 2007 at 02:00:07PM -0700, Rick Jones (rick.jones2@hp.com) wrote:
> > Simple test included test -> desktop and vice versa traffic with 128 and
> > 4096 block size in netperf-2.4.3 setup.
> 
> Is that in conjunction with setting the test-specific -D to set
> TCP_NODELAY, or was Nagle left-on?  If the latter, perhaps timing issues
> could be why the confidence intervals weren't hit since the relative
> batching of 128byte sends into larger segments is something of a race.

I used this parameters:
netperf -l 60 -H kano -t TCP_STREAM -i 10,2 -I 99,5 -- -m 128 -s 128K
-S 128K

so without nodelay.

With nodelay I've gotten:
batch-128: 128.91 mbit/sec
mainline-128: 140.57 mbit/sec

which is about 5 times less than withouth nodelay (~760 mbit/s)
Although nodelay results look more realistic.


> rick jones

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: FSCKED clock sources WAS(Re: [WIP][PATCHES] Network xmit batching
  2007-06-21 16:45                   ` Evgeniy Polyakov
@ 2007-06-25 16:58                     ` jamal
  0 siblings, 0 replies; 57+ messages in thread
From: jamal @ 2007-06-25 16:58 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: David Miller, Robert.Olsson, krkumar2, gaagaan, netdev,
	rick.jones2, sri

On Thu, 2007-21-06 at 20:45 +0400, Evgeniy Polyakov wrote:
> On Thu, Jun 21, 2007 at 11:54:17AM -0400, jamal (hadi@cyberus.ca) wrote:
> > Evgeniy, did you sync on the batching case with the git tree?
> 
> My tree contains following commits:
> 
> Latest mainline commit: fa490cfd15d7ce0900097cc4e60cfd7a76381138
> Latest batch commit: 9b8cc32088abfda8be7f394cfd5ee6ac694da39c

That looks right. There have been a few fixes since, but mostly cosmetic
and shouldnt affect the results.
Strange - so far your results contradict both mine on pktgen (since you
show poor performance sometimes) and what Krishna posted on netperf
(since you show improvement).

BTW, on pktgen and batching - dont use the clone parameter for 
anything > 1. I have not tested that code-path at all and it is highly
likely buggy.

Thanks a lot for taking the time Evgeniy.

cheers,
jamal

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: FSCKED clock sources WAS(Re: [WIP][PATCHES] Network xmit batching
  2007-06-21 16:55                     ` Benjamin LaHaise
@ 2007-06-25 16:59                       ` jamal
  2007-06-25 17:08                         ` Benjamin LaHaise
  0 siblings, 1 reply; 57+ messages in thread
From: jamal @ 2007-06-25 16:59 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: David Miller, Robert.Olsson, johnpol, krkumar2, gaagaan, netdev,
	rick.jones2, sri

On Thu, 2007-21-06 at 12:55 -0400, Benjamin LaHaise wrote:

> You should qualify that as 'Old P4 Xeon', as the Core 2 Xeons are leagues 
> better.

The Xeon hardware is not that old - about a year or so (and so is the
opteron). 
BTW, how could you tell this was old Xeon?

cheers,
jamal 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: FSCKED clock sources WAS(Re: [WIP][PATCHES] Network xmit batching
  2007-06-25 16:59                       ` jamal
@ 2007-06-25 17:08                         ` Benjamin LaHaise
  2007-06-25 17:16                           ` jamal
  0 siblings, 1 reply; 57+ messages in thread
From: Benjamin LaHaise @ 2007-06-25 17:08 UTC (permalink / raw)
  To: jamal
  Cc: David Miller, Robert.Olsson, johnpol, krkumar2, gaagaan, netdev,
	rick.jones2, sri

On Mon, Jun 25, 2007 at 12:59:54PM -0400, jamal wrote:
> On Thu, 2007-21-06 at 12:55 -0400, Benjamin LaHaise wrote:
> 
> > You should qualify that as 'Old P4 Xeon', as the Core 2 Xeons are leagues 
> > better.
> 
> The Xeon hardware is not that old - about a year or so (and so is the
> opteron). 
> BTW, how could you tell this was old Xeon?

CPUID:

vendor_id       : GenuineIntel
cpu family      : 15
model           : 4
model name      : Intel(R) Xeon(TM) CPU 2.80GHz

shows that it is a P4 Xeon, which sucks compared to:

vendor_id       : GenuineIntel
cpu family      : 6
model           : 15
model name      : Genuine Intel(R) CPU                  @ 2.66GHz

which is a Core 2 based Xeon.  The tuning required by the P4 is quite 
different than the Core 2, and it generally performs more poorly due to 
the length of the pipeline and the expense of pipeline flushes.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <zyntrop@kvack.org>.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: FSCKED clock sources WAS(Re: [WIP][PATCHES] Network xmit batching
  2007-06-25 17:08                         ` Benjamin LaHaise
@ 2007-06-25 17:16                           ` jamal
  0 siblings, 0 replies; 57+ messages in thread
From: jamal @ 2007-06-25 17:16 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: David Miller, Robert.Olsson, johnpol, krkumar2, gaagaan, netdev,
	rick.jones2, sri

On Mon, 2007-25-06 at 13:08 -0400, Benjamin LaHaise wrote:

> CPUID:
> 
> vendor_id       : GenuineIntel
> cpu family      : 15
> model           : 4
> model name      : Intel(R) Xeon(TM) CPU 2.80GHz
> 
> shows that it is a P4 Xeon, which sucks compared to:
> 
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 15
> model name      : Genuine Intel(R) CPU                  @ 2.66GHz
> 
> which is a Core 2 based Xeon. 

Ok. Should the model name at least reflect one being a Xeon P4 and other
as core2 Xeon? ;->

>  The tuning required by the P4 is quite 
> different than the Core 2, and it generally performs more poorly due to 
> the length of the pipeline and the expense of pipeline flushes.

I would be very interested to see some numbers on a proper Core2 based
Xeon (unfortunately cant afford one). You have the hardware, what says
you? ;->
AFAIT, the main reason Opterons has the numbers it does is due to 
the integrated on chip MC. I dont see Intel any time soon getting rid of
that (for economical reasons more than technical).
In any case i see batching as actually being a lot more cache friendly
too.

cheers,
jamal 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching
  2007-06-22  9:59         ` Evgeniy Polyakov
@ 2007-06-25 17:35           ` Rick Jones
  0 siblings, 0 replies; 57+ messages in thread
From: Rick Jones @ 2007-06-25 17:35 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: jamal, Krishna Kumar2, Gagan Arneja, netdev, Sridhar Samudrala,
	David Miller, Robert Olsson

Evgeniy Polyakov wrote:
> On Thu, Jun 21, 2007 at 02:00:07PM -0700, Rick Jones (rick.jones2@hp.com) wrote:
> 
>>>Simple test included test -> desktop and vice versa traffic with 128 and
>>>4096 block size in netperf-2.4.3 setup.
>>
>>Is that in conjunction with setting the test-specific -D to set
>>TCP_NODELAY, or was Nagle left-on?  If the latter, perhaps timing issues
>>could be why the confidence intervals weren't hit since the relative
>>batching of 128byte sends into larger segments is something of a race.
> 
> 
> I used this parameters:
> netperf -l 60 -H kano -t TCP_STREAM -i 10,2 -I 99,5 -- -m 128 -s 128K
> -S 128K

You can take -i up to 30 for the max count if you want to try to hit the 
levels.

> 
> so without nodelay.
> 
> With nodelay I've gotten:
> batch-128: 128.91 mbit/sec
> mainline-128: 140.57 mbit/sec
> 
> which is about 5 times less than withouth nodelay (~760 mbit/s)
> Although nodelay results look more realistic.

all that fun send batching that happens without nodelay :)

rick jones

> 
> 
> 
>>rick jones
> 
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* [WIP][PATCHES] Network xmit batching - tg3 support
  2007-06-19 17:48               ` jamal
  2007-06-19 17:55                 ` Evgeniy Polyakov
@ 2007-06-28  0:05                 ` jamal
  2007-07-02 21:20                   ` Matt Carlson
  1 sibling, 1 reply; 57+ messages in thread
From: jamal @ 2007-06-28  0:05 UTC (permalink / raw)
  To: Robert Olsson
  Cc: Evgeniy Polyakov, Krishna Kumar2, Gagan Arneja, netdev,
	Rick Jones, Sridhar Samudrala, David Miller, Jeff Garzik,
	Michael Chan

[-- Attachment #1: Type: text/plain, Size: 812 bytes --]

peoplez,

I have added support for tg3 on batching. I see equivalent performance
improvement for pktgen as i did with e1000 when using gige. 
I have only tested on two machines (one being a laptop which does
10/100Mbps). Unfortunately in both cases these are considered to be in
the class of  "buggy" tg3s (which take a longer code path).

To the tg3 folks - can you double check if am off on something?
I have split a few things that you may like as well.
I havent upgraded the tree - it is still circa 2.6.22-rc4 based; at some
point i will sync with Daves net-26

Anyone who has tg3 based hardware: I would appreciate any testing and
results ...

The git tree is at:
git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git
but i have attached the patch in case you just wanna stare.

cheers,
jamal

[-- Attachment #2: tg3-batchp --]
[-- Type: text/plain, Size: 16930 bytes --]

commit 91859b60521653a2f72ac70dfe9bfada4fdb28cb
Author: Jamal Hadi Salim <hadi@cyberus.ca>
Date:   Wed Jun 27 19:50:35 2007 -0400

    [NET_BATCH] Add tg3 batch support
    Make tg3 use the batch api.
    
    I have tested on my old laptop and another server class
    machine; they all seem to work - unfortunately they are
    both considered old class tg3.
    
    I am sure theres improvements to be made, but this is as a
    functional good start.
    
    Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

diff --git a/drivers/net/tg3.c b/drivers/net/tg3.c
index 2f31841..be03cbd 100644
--- a/drivers/net/tg3.c
+++ b/drivers/net/tg3.c
@@ -581,6 +581,7 @@ static inline void tg3_netif_stop(struct tg3 *tp)
 static inline void tg3_netif_start(struct tg3 *tp)
 {
 	netif_wake_queue(tp->dev);
+	tp->dev->xmit_win = TG3_TX_RING_SIZE >> 2;
 	/* NOTE: unconditional netif_wake_queue is only appropriate
 	 * so long as all callers are assured to have free tx slots
 	 * (such as after tg3_init_hw)
@@ -3066,6 +3067,7 @@ static inline u32 tg3_tx_avail(struct tg3 *tp)
  */
 static void tg3_tx(struct tg3 *tp)
 {
+	int dcount;
 	u32 hw_idx = tp->hw_status->idx[0].tx_consumer;
 	u32 sw_idx = tp->tx_cons;
 
@@ -3118,12 +3120,16 @@ static void tg3_tx(struct tg3 *tp)
 	 */
 	smp_mb();
 
+	dcount = tg3_tx_avail(tp);
 	if (unlikely(netif_queue_stopped(tp->dev) &&
-		     (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp)))) {
+		     (dcount > TG3_TX_WAKEUP_THRESH(tp)))) {
 		netif_tx_lock(tp->dev);
+		tp->dev->xmit_win = 1;
 		if (netif_queue_stopped(tp->dev) &&
-		    (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp)))
+		    (dcount > TG3_TX_WAKEUP_THRESH(tp))) {
 			netif_wake_queue(tp->dev);
+			tp->dev->xmit_win = dcount;
+		}
 		netif_tx_unlock(tp->dev);
 	}
 }
@@ -3877,47 +3883,56 @@ static void tg3_set_txd(struct tg3 *tp, int entry,
 	txd->vlan_tag = vlan_tag << TXD_VLAN_TAG_SHIFT;
 }
 
-/* hard_start_xmit for devices that don't have any bugs and
- * support TG3_FLG2_HW_TSO_2 only.
- */
-static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
-{
-	struct tg3 *tp = netdev_priv(dev);
-	dma_addr_t mapping;
-	u32 len, entry, base_flags, mss;
-
-	len = skb_headlen(skb);
+struct tg3_tx_cbdata {
+	u32 base_flags;
+	int count;
+	unsigned int max_per_txd;
+	unsigned int nr_frags;
+	unsigned int mss;
+};
+#define TG3_SKB_CB(__skb)       ((struct tg3_tx_cbdata *)&((__skb)->cb[0]))
+#define NETDEV_TX_DROPPED       -5
 
-	/* We are running in BH disabled context with netif_tx_lock
-	 * and TX reclaim runs via tp->poll inside of a software
-	 * interrupt.  Furthermore, IRQ processing runs lockless so we have
-	 * no IRQ context deadlocks to worry about either.  Rejoice!
-	 */
-	if (unlikely(tg3_tx_avail(tp) <= (skb_shinfo(skb)->nr_frags + 1))) {
-		if (!netif_queue_stopped(dev)) {
-			netif_stop_queue(dev);
+static int tg3_prep_bug_frame(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tg3_tx_cbdata *cb = TG3_SKB_CB(skb);
 
-			/* This is a hard error, log it. */
-			printk(KERN_ERR PFX "%s: BUG! Tx Ring full when "
-			       "queue awake!\n", dev->name);
+	cb->base_flags = 0;
+	cb->mss = skb_shinfo(skb)->gso_size;
+	if (cb->mss != 0) {
+		if (skb_header_cloned(skb) &&
+		    pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) {
+			dev_kfree_skb(skb);
+			return NETDEV_TX_DROPPED;
 		}
-		return NETDEV_TX_BUSY;
+
+		cb->base_flags |= (TXD_FLAG_CPU_PRE_DMA |
+			       TXD_FLAG_CPU_POST_DMA);
 	}
 
-	entry = tp->tx_prod;
-	base_flags = 0;
-	mss = 0;
-	if ((mss = skb_shinfo(skb)->gso_size) != 0) {
+	if (skb->ip_summed == CHECKSUM_PARTIAL)
+		cb->base_flags |= TXD_FLAG_TCPUDP_CSUM;
+
+	return NETDEV_TX_OK;
+}
+
+static int tg3_prep_frame(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tg3_tx_cbdata *cb = TG3_SKB_CB(skb);
+
+	cb->base_flags = 0;
+	cb->mss = skb_shinfo(skb)->gso_size;
+	if (cb->mss != 0) {
 		int tcp_opt_len, ip_tcp_len;
 
 		if (skb_header_cloned(skb) &&
 		    pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) {
 			dev_kfree_skb(skb);
-			goto out_unlock;
+			return NETDEV_TX_DROPPED;
 		}
 
 		if (skb_shinfo(skb)->gso_type & SKB_GSO_TCPV6)
-			mss |= (skb_headlen(skb) - ETH_HLEN) << 9;
+			cb->mss |= (skb_headlen(skb) - ETH_HLEN) << 9;
 		else {
 			struct iphdr *iph = ip_hdr(skb);
 
@@ -3925,32 +3940,68 @@ static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			ip_tcp_len = ip_hdrlen(skb) + sizeof(struct tcphdr);
 
 			iph->check = 0;
-			iph->tot_len = htons(mss + ip_tcp_len + tcp_opt_len);
-			mss |= (ip_tcp_len + tcp_opt_len) << 9;
+			iph->tot_len = htons(cb->mss + ip_tcp_len
+					     + tcp_opt_len);
+			cb->mss |= (ip_tcp_len + tcp_opt_len) << 9;
 		}
 
-		base_flags |= (TXD_FLAG_CPU_PRE_DMA |
+		cb->base_flags |= (TXD_FLAG_CPU_PRE_DMA |
 			       TXD_FLAG_CPU_POST_DMA);
 
 		tcp_hdr(skb)->check = 0;
 
 	}
 	else if (skb->ip_summed == CHECKSUM_PARTIAL)
-		base_flags |= TXD_FLAG_TCPUDP_CSUM;
+		cb->base_flags |= TXD_FLAG_TCPUDP_CSUM;
+
+	return NETDEV_TX_OK;
+}
+
+void tg3_kick_DMA(struct tg3 *tp)
+{
+	u32 entry = tp->tx_prod;
+	u32 count = tg3_tx_avail(tp);
+	/* Packets are ready, update Tx producer idx local and on card. */
+	tw32_tx_mbox((MAILBOX_SNDHOST_PROD_IDX_0 + TG3_64BIT_REG_LOW), entry);
+
+	if (unlikely(count <= (MAX_SKB_FRAGS + 1))) {
+		netif_stop_queue(tp->dev);
+		tp->dev->xmit_win = 1;
+		if (count > TG3_TX_WAKEUP_THRESH(tp)) {
+			netif_wake_queue(tp->dev);
+			tp->dev->xmit_win = count;
+		}
+	}
+
+	mmiowb();
+	tp->dev->trans_start = jiffies;
+}
+
+
+static int tg3_enqueue(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tg3 *tp = netdev_priv(dev);
+	dma_addr_t mapping;
+	u32 len, entry;
+	struct tg3_tx_cbdata *cb = TG3_SKB_CB(skb);
+
+
 #if TG3_VLAN_TAG_USED
 	if (tp->vlgrp != NULL && vlan_tx_tag_present(skb))
-		base_flags |= (TXD_FLAG_VLAN |
+		cb->base_flags |= (TXD_FLAG_VLAN |
 			       (vlan_tx_tag_get(skb) << 16));
 #endif
 
+	entry = tp->tx_prod;
+	len = skb_headlen(skb);
 	/* Queue skb data, a.k.a. the main skb fragment. */
 	mapping = pci_map_single(tp->pdev, skb->data, len, PCI_DMA_TODEVICE);
 
 	tp->tx_buffers[entry].skb = skb;
 	pci_unmap_addr_set(&tp->tx_buffers[entry], mapping, mapping);
 
-	tg3_set_txd(tp, entry, mapping, len, base_flags,
-		    (skb_shinfo(skb)->nr_frags == 0) | (mss << 1));
+	tg3_set_txd(tp, entry, mapping, len, cb->base_flags,
+		    (skb_shinfo(skb)->nr_frags == 0) | (cb->mss << 1));
 
 	entry = NEXT_TX(entry);
 
@@ -3972,30 +4023,79 @@ static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			pci_unmap_addr_set(&tp->tx_buffers[entry], mapping, mapping);
 
 			tg3_set_txd(tp, entry, mapping, len,
-				    base_flags, (i == last) | (mss << 1));
+				    cb->base_flags,
+				    (i == last) | (cb->mss << 1));
 
 			entry = NEXT_TX(entry);
 		}
 	}
+  
+	tp->tx_prod = entry;
+	return NETDEV_TX_OK;
+}
 
-	/* Packets are ready, update Tx producer idx local and on card. */
-	tw32_tx_mbox((MAILBOX_SNDHOST_PROD_IDX_0 + TG3_64BIT_REG_LOW), entry);
+/* hard_start_xmit for devices that don't have any bugs and
+ * support TG3_FLG2_HW_TSO_2 only.
+ */
+static int tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tg3 *tp = netdev_priv(dev);
+	int ret = tg3_prep_frame(skb, dev);
 
-	tp->tx_prod = entry;
-	if (unlikely(tg3_tx_avail(tp) <= (MAX_SKB_FRAGS + 1))) {
-		netif_stop_queue(dev);
-		if (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp))
-			netif_wake_queue(tp->dev);
-	}
+	/* XXX: original code did mmiowb(); on failure,
+	* I dont think thats necessary
+	*/
+	if (unlikely(ret != NETDEV_TX_OK))
+	       return NETDEV_TX_OK;
+
+	/* We are running in BH disabled context with netif_tx_lock
+	 * and TX reclaim runs via tp->poll inside of a software
+	 * interrupt.  Furthermore, IRQ processing runs lockless so we have
+	 * no IRQ context deadlocks to worry about either.  Rejoice!
+	 */
+	if (unlikely(tg3_tx_avail(tp) <= (skb_shinfo(skb)->nr_frags + 1))) {
+		if (!netif_queue_stopped(dev)) {
+			netif_stop_queue(dev);
 
-out_unlock:
-    	mmiowb();
+			/* This is a hard error, log it. */
+			printk(KERN_ERR PFX "%s: BUG! Tx Ring full when "
+			       "queue awake!\n", dev->name);
+		}
+		return NETDEV_TX_BUSY;
+	}
 
-	dev->trans_start = jiffies;
+	ret = tg3_enqueue(skb, dev);
+	if (ret == NETDEV_TX_OK)
+		tg3_kick_DMA(tp);
 
-	return NETDEV_TX_OK;
+	return ret;
 }
 
+static int tg3_start_bxmit(struct net_device *dev)
+{
+	struct sk_buff *skb = NULL;
+	int didq = 0, ret = NETDEV_TX_OK;
+	struct tg3 *tp = netdev_priv(dev);
+
+	while ((skb = __skb_dequeue(dev->blist)) != NULL) {
+		if (unlikely(tg3_tx_avail(tp) <=
+			    (skb_shinfo(skb)->nr_frags + 1))) {
+			netif_stop_queue(dev);
+			__skb_queue_head(dev->blist, skb);
+			ret = NETDEV_TX_OK;
+			break;
+		}
+
+		ret = tg3_enqueue(skb, dev);
+		if (ret == NETDEV_TX_OK)
+			didq++;
+	}
+
+	if (didq)
+		tg3_kick_DMA(tp);
+
+	return ret;
+}
 static int tg3_start_xmit_dma_bug(struct sk_buff *, struct net_device *);
 
 /* Use GSO to workaround a rare TSO bug that may be triggered when the
@@ -4008,9 +4108,11 @@ static int tg3_tso_bug(struct tg3 *tp, struct sk_buff *skb)
 	/* Estimate the number of fragments in the worst case */
 	if (unlikely(tg3_tx_avail(tp) <= (skb_shinfo(skb)->gso_segs * 3))) {
 		netif_stop_queue(tp->dev);
+		tp->dev->xmit_win = 1;
 		if (tg3_tx_avail(tp) <= (skb_shinfo(skb)->gso_segs * 3))
 			return NETDEV_TX_BUSY;
 
+		tp->dev->xmit_win = tg3_tx_avail(tp);
 		netif_wake_queue(tp->dev);
 	}
 
@@ -4034,46 +4136,19 @@ tg3_tso_bug_end:
 /* hard_start_xmit for devices that have the 4G bug and/or 40-bit bug and
  * support TG3_FLG2_HW_TSO_1 or firmware TSO only.
  */
-static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
+static int tg3_enqueue_buggy(struct sk_buff *skb, struct net_device *dev)
 {
 	struct tg3 *tp = netdev_priv(dev);
 	dma_addr_t mapping;
-	u32 len, entry, base_flags, mss;
+	u32 len, entry;
 	int would_hit_hwbug;
+	struct tg3_tx_cbdata *cb = TG3_SKB_CB(skb);
 
-	len = skb_headlen(skb);
 
-	/* We are running in BH disabled context with netif_tx_lock
-	 * and TX reclaim runs via tp->poll inside of a software
-	 * interrupt.  Furthermore, IRQ processing runs lockless so we have
-	 * no IRQ context deadlocks to worry about either.  Rejoice!
-	 */
-	if (unlikely(tg3_tx_avail(tp) <= (skb_shinfo(skb)->nr_frags + 1))) {
-		if (!netif_queue_stopped(dev)) {
-			netif_stop_queue(dev);
-
-			/* This is a hard error, log it. */
-			printk(KERN_ERR PFX "%s: BUG! Tx Ring full when "
-			       "queue awake!\n", dev->name);
-		}
-		return NETDEV_TX_BUSY;
-	}
-
-	entry = tp->tx_prod;
-	base_flags = 0;
-	if (skb->ip_summed == CHECKSUM_PARTIAL)
-		base_flags |= TXD_FLAG_TCPUDP_CSUM;
-	mss = 0;
-	if ((mss = skb_shinfo(skb)->gso_size) != 0) {
+	if (cb->mss != 0) {
 		struct iphdr *iph;
 		int tcp_opt_len, ip_tcp_len, hdr_len;
 
-		if (skb_header_cloned(skb) &&
-		    pskb_expand_head(skb, 0, 0, GFP_ATOMIC)) {
-			dev_kfree_skb(skb);
-			goto out_unlock;
-		}
-
 		tcp_opt_len = tcp_optlen(skb);
 		ip_tcp_len = ip_hdrlen(skb) + sizeof(struct tcphdr);
 
@@ -4082,15 +4157,13 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
 			     (tp->tg3_flags2 & TG3_FLG2_TSO_BUG))
 			return (tg3_tso_bug(tp, skb));
 
-		base_flags |= (TXD_FLAG_CPU_PRE_DMA |
-			       TXD_FLAG_CPU_POST_DMA);
 
 		iph = ip_hdr(skb);
 		iph->check = 0;
-		iph->tot_len = htons(mss + hdr_len);
+		iph->tot_len = htons(cb->mss + hdr_len);
 		if (tp->tg3_flags2 & TG3_FLG2_HW_TSO) {
 			tcp_hdr(skb)->check = 0;
-			base_flags &= ~TXD_FLAG_TCPUDP_CSUM;
+			cb->base_flags &= ~TXD_FLAG_TCPUDP_CSUM;
 		} else
 			tcp_hdr(skb)->check = ~csum_tcpudp_magic(iph->saddr,
 								 iph->daddr, 0,
@@ -4103,22 +4176,24 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
 				int tsflags;
 
 				tsflags = (iph->ihl - 5) + (tcp_opt_len >> 2);
-				mss |= (tsflags << 11);
+				cb->mss |= (tsflags << 11);
 			}
 		} else {
 			if (tcp_opt_len || iph->ihl > 5) {
 				int tsflags;
 
 				tsflags = (iph->ihl - 5) + (tcp_opt_len >> 2);
-				base_flags |= tsflags << 12;
+				cb->base_flags |= tsflags << 12;
 			}
 		}
 	}
 #if TG3_VLAN_TAG_USED
 	if (tp->vlgrp != NULL && vlan_tx_tag_present(skb))
-		base_flags |= (TXD_FLAG_VLAN |
+		cb->base_flags |= (TXD_FLAG_VLAN |
 			       (vlan_tx_tag_get(skb) << 16));
 #endif
+	len = skb_headlen(skb);
+	entry = tp->tx_prod;
 
 	/* Queue skb data, a.k.a. the main skb fragment. */
 	mapping = pci_map_single(tp->pdev, skb->data, len, PCI_DMA_TODEVICE);
@@ -4131,8 +4206,8 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
 	if (tg3_4g_overflow_test(mapping, len))
 		would_hit_hwbug = 1;
 
-	tg3_set_txd(tp, entry, mapping, len, base_flags,
-		    (skb_shinfo(skb)->nr_frags == 0) | (mss << 1));
+	tg3_set_txd(tp, entry, mapping, len, cb->base_flags,
+		    (skb_shinfo(skb)->nr_frags == 0) | (cb->mss << 1));
 
 	entry = NEXT_TX(entry);
 
@@ -4161,10 +4236,11 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
 
 			if (tp->tg3_flags2 & TG3_FLG2_HW_TSO)
 				tg3_set_txd(tp, entry, mapping, len,
-					    base_flags, (i == last)|(mss << 1));
+					    cb->base_flags,
+					    (i == last)|(cb->mss << 1));
 			else
 				tg3_set_txd(tp, entry, mapping, len,
-					    base_flags, (i == last));
+					    cb->base_flags, (i == last));
 
 			entry = NEXT_TX(entry);
 		}
@@ -4181,28 +4257,78 @@ static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
 		 * failure, silently drop this packet.
 		 */
 		if (tigon3_dma_hwbug_workaround(tp, skb, last_plus_one,
-						&start, base_flags, mss))
-			goto out_unlock;
+						&start, cb->base_flags,
+						cb->mss)) {
+			mmiowb();
+			return NETDEV_TX_OK;
+		}
 
 		entry = start;
 	}
 
-	/* Packets are ready, update Tx producer idx local and on card. */
-	tw32_tx_mbox((MAILBOX_SNDHOST_PROD_IDX_0 + TG3_64BIT_REG_LOW), entry);
-
 	tp->tx_prod = entry;
-	if (unlikely(tg3_tx_avail(tp) <= (MAX_SKB_FRAGS + 1))) {
-		netif_stop_queue(dev);
-		if (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp))
-			netif_wake_queue(tp->dev);
+	return NETDEV_TX_OK;
+}
+
+static int tg3_start_xmit_dma_bug(struct sk_buff *skb, struct net_device *dev)
+{
+	struct tg3 *tp = netdev_priv(dev);
+	int ret = tg3_prep_bug_frame(skb, dev);
+
+	if (unlikely(ret != NETDEV_TX_OK))
+	       return NETDEV_TX_OK;
+
+	/* We are running in BH disabled context with netif_tx_lock
+	 * and TX reclaim runs via tp->poll inside of a software
+	 * interrupt.  Furthermore, IRQ processing runs lockless so we have
+	 * no IRQ context deadlocks to worry about either.  Rejoice!
+	 */
+	if (unlikely(tg3_tx_avail(tp) <= (skb_shinfo(skb)->nr_frags + 1))) {
+		if (!netif_queue_stopped(dev)) {
+			netif_stop_queue(dev);
+
+			/* This is a hard error, log it. */
+			printk(KERN_ERR PFX "%s: BUG! Tx Ring full when "
+			       "queue awake!\n", dev->name);
+		}
+		return NETDEV_TX_BUSY;
 	}
 
-out_unlock:
-    	mmiowb();
+	ret = tg3_enqueue_buggy(skb, dev);
+	if (ret == NETDEV_TX_OK)
+		tg3_kick_DMA(tp);
 
-	dev->trans_start = jiffies;
+	return ret;
+}
 
-	return NETDEV_TX_OK;
+static int tg3_start_bxmit_buggy(struct net_device *dev)
+{
+	int ret = NETDEV_TX_OK, didq = 0;
+	struct sk_buff *skb = NULL;
+	struct tg3 *tp = netdev_priv(dev);
+
+	while ((skb = __skb_dequeue(dev->blist)) != NULL) {
+		/*XXX: inline this and optimize this check
+		 *eventually to not keep checking unless
+		 *necessary
+		 **/
+		if (unlikely(tg3_tx_avail(tp) <=
+			    (skb_shinfo(skb)->nr_frags + 1))) {
+			netif_stop_queue(dev);
+			__skb_queue_head(dev->blist, skb);
+			ret = NETDEV_TX_OK;
+			break;
+		}
+
+		ret = tg3_enqueue_buggy(skb, dev);
+		if (ret == NETDEV_TX_OK)
+			didq++;
+	}
+
+	if (didq)
+		tg3_kick_DMA(tp);
+
+	return ret;
 }
 
 static inline void tg3_set_mtu(struct net_device *dev, struct tg3 *tp,
@@ -10978,10 +11104,15 @@ static int __devinit tg3_get_invariants(struct tg3 *tp)
 	 */
 	if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5755 ||
 	    GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5787 ||
-	    GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5906)
+	    GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5906) {
 		tp->dev->hard_start_xmit = tg3_start_xmit;
-	else
+		tp->dev->hard_prep_xmit = tg3_prep_frame;
+		tp->dev->hard_batch_xmit = tg3_start_bxmit;
+	} else {
 		tp->dev->hard_start_xmit = tg3_start_xmit_dma_bug;
+		tp->dev->hard_prep_xmit = tg3_prep_bug_frame;
+		tp->dev->hard_batch_xmit = tg3_start_bxmit_buggy;
+	}
 
 	tp->rx_offset = 2;
 	if (GET_ASIC_REV(tp->pci_chip_rev_id) == ASIC_REV_5701 &&
@@ -11831,6 +11962,9 @@ static int __devinit tg3_init_one(struct pci_dev *pdev,
 	dev->watchdog_timeo = TG3_TX_TIMEOUT;
 	dev->change_mtu = tg3_change_mtu;
 	dev->irq = pdev->irq;
+	dev->features |= NETIF_F_BTX;
+	dev->xmit_win = TG3_TX_RING_SIZE >> 2;
+
 #ifdef CONFIG_NET_POLL_CONTROLLER
 	dev->poll_controller = tg3_poll_controller;
 #endif

^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching - tg3 support
  2007-06-28  0:05                 ` [WIP][PATCHES] Network xmit batching - tg3 support jamal
@ 2007-07-02 21:20                   ` Matt Carlson
  2007-07-03  0:21                     ` Michael Chan
  2007-07-03 13:09                     ` jamal
  0 siblings, 2 replies; 57+ messages in thread
From: Matt Carlson @ 2007-07-02 21:20 UTC (permalink / raw)
  To: hadi
  Cc: Robert Olsson, Evgeniy Polyakov, Krishna Kumar2, Gagan Arneja,
	netdev, Rick Jones, Sridhar Samudrala, David Miller, Jeff Garzik,
	Michael Chan

On Wed, 2007-06-27 at 20:05 -0400, jamal wrote:
> peoplez,
> 
> I have added support for tg3 on batching. I see equivalent performance
> improvement for pktgen as i did with e1000 when using gige. 
> I have only tested on two machines (one being a laptop which does
> 10/100Mbps). Unfortunately in both cases these are considered to be in
> the class of  "buggy" tg3s (which take a longer code path).
> 
> To the tg3 folks - can you double check if am off on something?
> I have split a few things that you may like as well.
> I havent upgraded the tree - it is still circa 2.6.22-rc4 based; at some
> point i will sync with Daves net-26
> 
> Anyone who has tg3 based hardware: I would appreciate any testing and
> results ...
> 
> The git tree is at:
> git://git.kernel.org/pub/scm/linux/kernel/git/hadi/batch-lin26.git
> but i have attached the patch in case you just wanna stare.
> 
> cheers,
> jamal

Hi Jamal.  I'll be testing your patch soon, but I wanted to point out a
bug in the patch.  The patch defines TG3_SKB_CB() as follows :

#define TG3_SKB_CB(__skb) ((struct tg3_tx_cbdata *)&((__skb)->cb[0]))

This definition will collide with the VLAN macros if TG3_VLAN_TAG_USED
is set.  vlan_tx_tag_get() is defined as :

#define vlan_tx_tag_get(__skb)  (VLAN_TX_SKB_CB(__skb)->vlan_tag)

VLAN_TX_SKB_CB is defined as :

#define VLAN_TX_SKB_CB(__skb) \
        ((struct vlan_skb_tx_cookie *)&((__skb)->cb[0]))

Also, I think the count, max_per_txd, and nr_frags fields of the
tg3_tx_cbdata struct are not needed.


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching - tg3 support
  2007-07-02 21:20                   ` Matt Carlson
@ 2007-07-03  0:21                     ` Michael Chan
  2007-07-03 13:26                       ` jamal
  2007-07-03 13:09                     ` jamal
  1 sibling, 1 reply; 57+ messages in thread
From: Michael Chan @ 2007-07-03  0:21 UTC (permalink / raw)
  To: Matt Carlson
  Cc: hadi, Robert Olsson, Evgeniy Polyakov, Krishna Kumar2,
	Gagan Arneja, netdev, Rick Jones, Sridhar Samudrala, David Miller,
	Jeff Garzik

On Mon, 2007-07-02 at 14:20 -0700, Matt Carlson wrote:

> 
> Also, I think the count, max_per_txd, and nr_frags fields of the
> tg3_tx_cbdata struct are not needed.

The count field is not needed also.

+struct tg3_tx_cbdata {
+	u32 base_flags;
+	int count;
+	unsigned int max_per_txd;
+	unsigned int nr_frags;
+	unsigned int mss;
+};
+#define TG3_SKB_CB(__skb)       ((struct tg3_tx_cbdata *)&((__skb)->cb[0]))

Only base_flags and mss are needed and these can be determined right
before sending the frame.  So is it better not to store these in the
skb->cb at all?

@@ -3118,12 +3120,16 @@ static void tg3_tx(struct tg3 *tp)
 	 */
 	smp_mb();
 
+	dcount = tg3_tx_avail(tp);
 	if (unlikely(netif_queue_stopped(tp->dev) &&
-		     (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp)))) {
+		     (dcount > TG3_TX_WAKEUP_THRESH(tp)))) {
 		netif_tx_lock(tp->dev);
+		tp->dev->xmit_win = 1;
 		if (netif_queue_stopped(tp->dev) &&
-		    (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp)))
+		    (dcount > TG3_TX_WAKEUP_THRESH(tp))) {
 			netif_wake_queue(tp->dev);
+			tp->dev->xmit_win = dcount;
+		}
 		netif_tx_unlock(tp->dev);
 	}
 }

This is also not right.  tg3_tx() runs without netif_tx_lock().
tg3_tx_avail() can change after you get the netif_tx_lock() and we must
get the updated value again.  If we just rely on dcount, we can call
wake_queue() when the ring is full, or there may be no wakeup when the
ring is empty.




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching - tg3 support
  2007-07-02 21:20                   ` Matt Carlson
  2007-07-03  0:21                     ` Michael Chan
@ 2007-07-03 13:09                     ` jamal
  2007-07-03 19:31                       ` Matt Carlson
  2007-07-03 21:30                       ` David Miller
  1 sibling, 2 replies; 57+ messages in thread
From: jamal @ 2007-07-03 13:09 UTC (permalink / raw)
  To: Matt Carlson
  Cc: Robert Olsson, Evgeniy Polyakov, Krishna Kumar2, Gagan Arneja,
	netdev, Rick Jones, Sridhar Samudrala, David Miller, Jeff Garzik,
	Michael Chan

On Mon, 2007-02-07 at 14:20 -0700, Matt Carlson wrote:

> 
> Hi Jamal.  I'll be testing your patch soon,

much thanks. Please let me know if you need help while doing this.
What tools are you planning to test with? I have tested this patch
with pktgen on a dual opteron/tg3-buggy. 
There is an outstanding issue in regards to clocksource - however the
batch does well regardless of the clock source.

>  but I wanted to point out a
> bug in the patch.  The patch defines TG3_SKB_CB() as follows :
> 
> #define TG3_SKB_CB(__skb) ((struct tg3_tx_cbdata *)&((__skb)->cb[0]))
> 
> This definition will collide with the VLAN macros if TG3_VLAN_TAG_USED
> is set.  vlan_tx_tag_get() is defined as :
> 
> #define vlan_tx_tag_get(__skb)  (VLAN_TX_SKB_CB(__skb)->vlan_tag)
> 
> VLAN_TX_SKB_CB is defined as :
> 
> #define VLAN_TX_SKB_CB(__skb) \
>         ((struct vlan_skb_tx_cookie *)&((__skb)->cb[0]))
> 

yikes. Thanks for catching that - I thought i had this pretty much
covered after scanning the source. So that bug exists on the e1000 as
well.
[It sounds very dangerous to me the way skb->cb is being used by the
vlan code (i.e requires human intervention/knowledge to catch it as an
issue). I had no freaking idea the vlan code was using it. Maybe a huge
comment somewhere on how these cbs are being used by drivers would help
or even a registration on startup to just make sure there is no conflict
at a layer (i have been meaning to do the later for years now). In any
case this is is a digression]. 

In the meantime, changing it to use byte 8 and above should do it? i.e.
#define TG3_SKB_CB(__skb) ((struct tg3_tx_cbdata *)&((__skb)->cb[8]))

There are 48 bytes there on the skb-cb, so there should be plenty.

> Also, I think the count, max_per_txd, and nr_frags fields of the
> tg3_tx_cbdata struct are not needed.

Yes, you are right. That was a result of the LinuxWay(tm) (aka cutnpaste
from the e1000 which needs them), Can you send me a patch for that and
TG3_SKB_CB? 

cheers,
jamal

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching - tg3 support
  2007-07-03  0:21                     ` Michael Chan
@ 2007-07-03 13:26                       ` jamal
  2007-07-04  4:19                         ` Krishna Kumar2
  0 siblings, 1 reply; 57+ messages in thread
From: jamal @ 2007-07-03 13:26 UTC (permalink / raw)
  To: Michael Chan
  Cc: Matt Carlson, Robert Olsson, Evgeniy Polyakov, Krishna Kumar2,
	Gagan Arneja, netdev, Rick Jones, Sridhar Samudrala, David Miller,
	Jeff Garzik

On Mon, 2007-02-07 at 17:21 -0700, Michael Chan wrote:

[Matt, please include the count in the fix per previous email]

> Only base_flags and mss are needed and these can be determined right
> before sending the frame.  So is it better not to store these in the
> skb->cb at all?

long answer:
My goal with storing these values and computing them was to do certain
things that dont require holding the netif_tx_lock within a batch as
well. Evaluating the packet metadata and formating the packet to be
ready for stashing into DMA was one thing i could do outside of holding
the lock easily - and running that in loop of 100 packets amortizes the
instruction cache and allows me (when i get to it) to hold the lock for
a lot less computation.
In the e1000 for example i was able to go as far as computing how many
descriptors are needed per skb and stashing those in the skb->cb; the
way the tg3 is structured makes doing all that needing some big changes.
So to answer your question: It would be nice if i could keep the skb->cb
for now and we can get rid of it later if deemed a nuisance for
batching.

>  
> +	dcount = tg3_tx_avail(tp);
>  	if (unlikely(netif_queue_stopped(tp->dev) &&
> -		     (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp)))) {
> +		     (dcount > TG3_TX_WAKEUP_THRESH(tp)))) {
>  		netif_tx_lock(tp->dev);
> +		tp->dev->xmit_win = 1;
>  		if (netif_queue_stopped(tp->dev) &&
> -		    (tg3_tx_avail(tp) > TG3_TX_WAKEUP_THRESH(tp)))
> +		    (dcount > TG3_TX_WAKEUP_THRESH(tp))) {
>  			netif_wake_queue(tp->dev);
> +			tp->dev->xmit_win = dcount;
> +		}
>  		netif_tx_unlock(tp->dev);
>  	}
>  }
> 
> This is also not right.  tg3_tx() runs without netif_tx_lock().
> tg3_tx_avail() can change after you get the netif_tx_lock() and we must
> get the updated value again.  If we just rely on dcount, we can call
> wake_queue() when the ring is full, or there may be no wakeup when the
> ring is empty.

You are right, I was trying to be a smart-ass there so i didnt have to
fold the lines (for readability). Can you or Matt restore it to the way
it was originally while keeping the xmit_win update and just send me a
patch?

Thanks a lot Michael and Matt for looking at this and improving it.
Maybe i should switch all my nics to tg3 from now on ;->

cheers,
jamal

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching - tg3 support
  2007-07-03 13:09                     ` jamal
@ 2007-07-03 19:31                       ` Matt Carlson
  2007-07-04  1:59                         ` jamal
  2007-07-03 21:30                       ` David Miller
  1 sibling, 1 reply; 57+ messages in thread
From: Matt Carlson @ 2007-07-03 19:31 UTC (permalink / raw)
  To: hadi
  Cc: Robert Olsson, Evgeniy Polyakov, Krishna Kumar2, Gagan Arneja,
	netdev, Rick Jones, Sridhar Samudrala, David Miller, Jeff Garzik,
	Michael Chan

On Tue, 2007-07-03 at 09:09 -0400, jamal wrote:
> On Mon, 2007-02-07 at 14:20 -0700, Matt Carlson wrote:
> 
> 
> > 
> > Hi Jamal.  I'll be testing your patch soon,
> 
> much thanks. Please let me know if you need help while doing this.
> What tools are you planning to test with? I have tested this patch
> with pktgen on a dual opteron/tg3-buggy. 
> There is an outstanding issue in regards to clocksource - however the
> batch does well regardless of the clock source.

I had planned on using netperf, but pktgen sounds like a more controlled
environment.  Thanks for the tip.

> >  but I wanted to point out a
> > bug in the patch.  The patch defines TG3_SKB_CB() as follows :
> > 
> > #define TG3_SKB_CB(__skb) ((struct tg3_tx_cbdata *)&((__skb)->cb[0]))
> > 
> > This definition will collide with the VLAN macros if TG3_VLAN_TAG_USED
> > is set.  vlan_tx_tag_get() is defined as :
> > 
> > #define vlan_tx_tag_get(__skb)  (VLAN_TX_SKB_CB(__skb)->vlan_tag)
> > 
> > VLAN_TX_SKB_CB is defined as :
> > 
> > #define VLAN_TX_SKB_CB(__skb) \
> >         ((struct vlan_skb_tx_cookie *)&((__skb)->cb[0]))
> > 
> 
> yikes. Thanks for catching that - I thought i had this pretty much
> covered after scanning the source. So that bug exists on the e1000 as
> well.
> [It sounds very dangerous to me the way skb->cb is being used by the
> vlan code (i.e requires human intervention/knowledge to catch it as an
> issue). I had no freaking idea the vlan code was using it. Maybe a huge
> comment somewhere on how these cbs are being used by drivers would help
> or even a registration on startup to just make sure there is no conflict
> at a layer (i have been meaning to do the later for years now). In any
> case this is is a digression]. 
> 
> In the meantime, changing it to use byte 8 and above should do it? i.e.
> #define TG3_SKB_CB(__skb) ((struct tg3_tx_cbdata *)&((__skb)->cb[8]))
> 
> There are 48 bytes there on the skb-cb, so there should be plenty.

Do you see any reason why we couldn't just add the VLAN code to the prep
stage and simply overwrite that portion of the skb-cb?  Our driver would
just store its value in the base_flags member of tg3_tx_cbdata.

> > Also, I think the count, max_per_txd, and nr_frags fields of the
> > tg3_tx_cbdata struct are not needed.
> 
> Yes, you are right. That was a result of the LinuxWay(tm) (aka cutnpaste
> from the e1000 which needs them), Can you send me a patch for that and
> TG3_SKB_CB? 

Once we iron out the skb-cb issue, sure.

> 
> cheers,
> jamal
> 
> 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching - tg3 support
  2007-07-03 13:09                     ` jamal
  2007-07-03 19:31                       ` Matt Carlson
@ 2007-07-03 21:30                       ` David Miller
  1 sibling, 0 replies; 57+ messages in thread
From: David Miller @ 2007-07-03 21:30 UTC (permalink / raw)
  To: hadi
  Cc: mcarlson, Robert.Olsson, johnpol, krkumar2, gaagaan, netdev,
	rick.jones2, sri, jeff, mchan

From: jamal <hadi@cyberus.ca>
Date: Tue, 03 Jul 2007 09:09:48 -0400

> [It sounds very dangerous to me the way skb->cb is being used by the
> vlan code (i.e requires human intervention/knowledge to catch it as an
> issue). I had no freaking idea the vlan code was using it. Maybe a huge
> comment somewhere on how these cbs are being used by drivers would help
> or even a registration on startup to just make sure there is no conflict
> at a layer (i have been meaning to do the later for years now). In any
> case this is is a digression]. 

This is how the VLAN layer passed in the tag for the transmit
function.

The CB is owned by the caller until you spam it with your local
data at your level, so if you need any information the previous
layer provides you have to sample it before using the ->cb[]

So you can simply fix this by reading the VLAN tag, then using
the ->cb[] however you like.

It's always been like this, don't be surprised :)

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching - tg3 support
  2007-07-03 19:31                       ` Matt Carlson
@ 2007-07-04  1:59                         ` jamal
  0 siblings, 0 replies; 57+ messages in thread
From: jamal @ 2007-07-04  1:59 UTC (permalink / raw)
  To: Matt Carlson
  Cc: Robert Olsson, Evgeniy Polyakov, Krishna Kumar2, Gagan Arneja,
	netdev, Rick Jones, Sridhar Samudrala, David Miller, Jeff Garzik,
	Michael Chan

On Tue, 2007-03-07 at 12:31 -0700, Matt Carlson wrote:

> I had planned on using netperf, but pktgen sounds like a more controlled
> environment.  Thanks for the tip.

I can help more if you use pktgen - netperf is more involved. Also
pktgen is much closer to the driver so it would let you see closely
if any improvements are obvious or not. 

> Do you see any reason why we couldn't just add the VLAN code to the prep
> stage and simply overwrite that portion of the skb-cb?  Our driver would
> just store its value in the base_flags member of tg3_tx_cbdata.

That would be a much better approach;-> Go for it.

cheers,
jamal



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching - tg3 support
  2007-07-03 13:26                       ` jamal
@ 2007-07-04  4:19                         ` Krishna Kumar2
  2007-07-04 13:22                           ` jamal
  0 siblings, 1 reply; 57+ messages in thread
From: Krishna Kumar2 @ 2007-07-04  4:19 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, Gagan Arneja, Jeff Garzik, Evgeniy Polyakov,
	Matt Carlson, Michael Chan, netdev, Rick Jones, Robert Olsson,
	Sridhar Samudrala

Hi Jamal,

J Hadi Salim <j.hadi123@gmail.com> wrote on 07/03/2007 06:56:20 PM:

> On Mon, 2007-02-07 at 17:21 -0700, Michael Chan wrote:
>
> [Matt, please include the count in the fix per previous email]
> long answer:
> My goal with storing these values and computing them was to do certain
> things that dont require holding the netif_tx_lock within a batch as
> well. Evaluating the packet metadata and formating the packet to be
> ready for stashing into DMA was one thing i could do outside of holding
> the lock easily - and running that in loop of 100 packets amortizes the
> instruction cache and allows me (when i get to it) to hold the lock for
> a lot less computation.

Do you see any contention for tx_lock which can justify having a prep
handler? As I understood, no other CPU can be in the xmit code at the
same time since RUNNING bit is held. Hence getting this lock early or
late should not matter for the xmit side (and you are also holding
dev->queue_lock while running prep so no new skbs can also be added to
the dev during this time). And I couldn't find any driver that uses the
same tx_lock on rx, so where is the saving by doing prep ?

- KK

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [WIP][PATCHES] Network xmit batching - tg3 support
  2007-07-04  4:19                         ` Krishna Kumar2
@ 2007-07-04 13:22                           ` jamal
  0 siblings, 0 replies; 57+ messages in thread
From: jamal @ 2007-07-04 13:22 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: David Miller, Gagan Arneja, Jeff Garzik, Evgeniy Polyakov,
	Matt Carlson, Michael Chan, netdev, Rick Jones, Robert Olsson,
	Sridhar Samudrala

On Wed, 2007-04-07 at 09:49 +0530, Krishna Kumar2 wrote:

> Do you see any contention for tx_lock which can justify having a prep
> handler? As I understood, no other CPU can be in the xmit code at the
> same time since RUNNING bit is held. Hence getting this lock early or
> late should not matter for the xmit side (and you are also holding
> dev->queue_lock while running prep so no new skbs can also be added to
> the dev during this time). And I couldn't find any driver that uses the
> same tx_lock on rx, 

On any non-LLTX driver:
netif_tx_lock() is grabbed on both tx and receive paths. 

I have to admit, the e1000 is too clever for its own good:
there is no documentation/description (and Herbert never explained) but
it seems to be able to play with only certain parts of the ring on tx
side and others on rx side so that a lock becomes unnecessary. So even
for non-LLTX, some lock needs to be held on rx just to be clean (and
theres a lot of grunty-ness against LLTX drivers out there).

> so where is the saving by doing prep ?

You would agree i think that if there is contention between two CPUs for
the same lock, gut feeling is there is benefit to do things outside the
lock when possible. This becomes more important with batching when
amortizing the cost of a lock.
Theres a little more than that and I could write up a longer description
if needed.

Also note: not all drivers may be able to move things outside of the
lock - infact the tun driver i converted is hard to do anything with
since it doesnt have any hardware features that makes it easy to move
things outside. For this reason this api is optional (and tun infact
doesnt use it); and as i have said a few times now, i will be more than
happy to get rid of it if experiments show it is unnecessary. For now my
take is someone needs to disprove it is valuable (most prefered with
experiments).

cheers,
jamal

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2007-07-04 13:22 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-06 13:49 [WIP][PATCHES] Network xmit batching jamal
2007-06-07  6:16 ` Krishna Kumar2
2007-06-07 11:43   ` jamal
2007-06-07 16:13     ` Evgeniy Polyakov
2007-06-07 22:23       ` jamal
2007-06-08  8:38         ` Evgeniy Polyakov
2007-06-08 11:31           ` jamal
2007-06-08 12:09             ` Evgeniy Polyakov
2007-06-08 13:07               ` jamal
2007-06-08 21:02                 ` Evgeniy Polyakov
2007-06-08  5:05       ` Krishna Kumar2
2007-06-19 13:21     ` Evgeniy Polyakov
2007-06-19 13:33       ` Evgeniy Polyakov
2007-06-19 14:00         ` Evgeniy Polyakov
2007-06-19 14:09           ` Evgeniy Polyakov
2007-06-19 16:32             ` jamal
2007-06-19 16:44               ` Evgeniy Polyakov
2007-06-19 16:28           ` jamal
2007-06-19 16:35             ` Evgeniy Polyakov
2007-06-19 16:45             ` Evgeniy Polyakov
2007-06-19 17:35             ` Robert Olsson
2007-06-19 17:48               ` jamal
2007-06-19 17:55                 ` Evgeniy Polyakov
2007-06-28  0:05                 ` [WIP][PATCHES] Network xmit batching - tg3 support jamal
2007-07-02 21:20                   ` Matt Carlson
2007-07-03  0:21                     ` Michael Chan
2007-07-03 13:26                       ` jamal
2007-07-04  4:19                         ` Krishna Kumar2
2007-07-04 13:22                           ` jamal
2007-07-03 13:09                     ` jamal
2007-07-03 19:31                       ` Matt Carlson
2007-07-04  1:59                         ` jamal
2007-07-03 21:30                       ` David Miller
2007-06-19 22:28               ` [WIP][PATCHES] Network xmit batching David Miller
2007-06-21 15:54                 ` FSCKED clock sources WAS(Re: " jamal
2007-06-21 16:08                   ` jamal
2007-06-21 16:55                     ` Benjamin LaHaise
2007-06-25 16:59                       ` jamal
2007-06-25 17:08                         ` Benjamin LaHaise
2007-06-25 17:16                           ` jamal
2007-06-21 16:45                   ` Evgeniy Polyakov
2007-06-25 16:58                     ` jamal
2007-06-19 16:24       ` jamal
2007-06-21 21:00       ` Rick Jones
2007-06-22  9:59         ` Evgeniy Polyakov
2007-06-25 17:35           ` Rick Jones
2007-06-07  8:42 ` Krishna Kumar2
2007-06-07 12:16   ` jamal
2007-06-08  5:06     ` Krishna Kumar2
2007-06-08 11:14       ` jamal
2007-06-08 11:31         ` Krishna Kumar2
2007-06-08 11:43           ` jamal
2007-06-08 18:00           ` Rick Jones
2007-06-08 17:27     ` Rick Jones
2007-06-09  0:17       ` jamal
2007-06-09  0:40         ` Rick Jones
2007-06-07 22:42 ` jamal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).