netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] New driver API to speed up small packets xmits
@ 2007-05-10 14:53 Krishna Kumar
  2007-05-10 15:08 ` Evgeniy Polyakov
                   ` (2 more replies)
  0 siblings, 3 replies; 96+ messages in thread
From: Krishna Kumar @ 2007-05-10 14:53 UTC (permalink / raw)
  To: netdev; +Cc: krkumar2, Krishna Kumar

Hi all,

While looking at common packet sizes on xmits, I found that most of
the packets are small. On my personal system, the statistics of
packets after using (browsing, mail, ftp'ing two linux kernels from
www.kernel.org) for about 6 hours is :

-----------------------------------------------------------
	Packet Size	#packets (Total:60720)	Percentage
-----------------------------------------------------------
	32 		0 			0
	64 		7716 			12.70
	80 		40193 			66.19
sub-total:		47909			78.90 %

	96 		2007 			3.30
	128 		1917 			3.15
sub-total:		3924			6.46 %

	256 		1822 			3.00
	384 		863 			1.42
	512 		459 			.75
sub-total:		3144			5.18 %

	640 		763 			1.25
	768 		2329 			3.83
	1024 		1700 			2.79
	1500 		461 			.75
sub-total:		5253			8.65 %

	2048 		312 			.51
	4096 		53 			.08
	8192 		84 			.13
	16384 		41 			.06
	32768+ 		0 			0
sub-total:		490			0.81 %
-----------------------------------------------------------

Doing some measurements, I found that for small packets like 128 bytes,
the bandwidth is approximately 60% of the line speed. To possibly speed
up performance of small packet xmits, a method of "linking" skbs was
thought of - where two pointers (skb_flink/blink) is added to the skb.
It is felt (no data yet) that drivers will get better results when more
number of "linked" skbs are sent to it in one shot, rather than sending
each skb independently (where for each skb, extra call to driver is
made and also the driver needs to get/drop lock, etc). The method is to
send as many packets as possible from qdisc (eg multiple packets can
accumulate if the driver is stopped or trylock failed) if the device
supports the new API. Steps for enabling API for a driver is :

	- driver needs to set NETIF_F_LINKED_SKB before netdev_register
	- register_netdev sets a new tx_link_skbs tunable parameter in
	  dev to 1, indicating that the driver supports linked skbs.
	- driver implements the new API - hard_start_xmit_link to
	  handle linked skbs, which is mostly a simple task. Eg,
	  support for e1000 driver can be added, avoiding duplicating
	  existing code as :

	e1000_xmit_frame_link()
	{
	top:
		next_skb = skb->linked
		(original driver code here)
		skb = next_skb;
		if (skb)
			goto top;
		...
	}

	e1000_xmit_frame()
	{
		return e1000_xmit_frame_link(skb, NULL, dev);
	}

	Drivers can take other approaches, eg, get lock at the top and
	handle all the packets in one shot, or get/drop locks for each
	skb; but those are internal to the driver. In any case, driver
	changes to support (optional) this API is minimal.

The main change is in core/sched code. Qdisc links packets if the
device supports it and multiple skbs are present, and calls
dev_hard_start_xmit, which calls one of the two API's depending on
whether the passed skb is linked or not. A sys interface can set or
reset the tx_link_skbs parameter for the device to use the old or the
new driver API.

The reason to implement the same was to speed up IPoIB driver. But
before doing that, a proof of concept for E1000/AMSO drivers was
considered (as most of the code is generic) before implementing for
IPoIB. I do not have test results at this time but I am working on it.

Please let me know if this approach is acceptable, or any suggestions.

Thanks,

- KK

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 14:53 Krishna Kumar
@ 2007-05-10 15:08 ` Evgeniy Polyakov
  2007-05-10 15:22   ` Krishna Kumar2
  2007-05-10 20:21 ` Roland Dreier
  2007-05-11  9:05 ` Andi Kleen
  2 siblings, 1 reply; 96+ messages in thread
From: Evgeniy Polyakov @ 2007-05-10 15:08 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: netdev

On Thu, May 10, 2007 at 08:23:51PM +0530, Krishna Kumar (krkumar2@in.ibm.com) wrote:
> The reason to implement the same was to speed up IPoIB driver. But
> before doing that, a proof of concept for E1000/AMSO drivers was
> considered (as most of the code is generic) before implementing for
> IPoIB. I do not have test results at this time but I am working on it.
> 
> Please let me know if this approach is acceptable, or any suggestions.

Doesn't it looks exactly like GSO/TSO/LRO stuff implemented already?

Btw, main CPU limiting factor here is syscall overhead (userspace protocol
processing with 1500 MTU allows to reduce CPU usage and increase
performance for 128 bytes packets sending/receiving total of 10 times).

> Thanks,
> 
> - KK

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 15:08 ` Evgeniy Polyakov
@ 2007-05-10 15:22   ` Krishna Kumar2
  2007-05-10 15:48     ` Evgeniy Polyakov
  2007-05-10 17:19     ` Rick Jones
  0 siblings, 2 replies; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-10 15:22 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

Hi Evgeniy,

Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 05/10/2007 08:38:33 PM:

> On Thu, May 10, 2007 at 08:23:51PM +0530, Krishna Kumar
(krkumar2@in.ibm.com) wrote:
> > The reason to implement the same was to speed up IPoIB driver. But
> > before doing that, a proof of concept for E1000/AMSO drivers was
> > considered (as most of the code is generic) before implementing for
> > IPoIB. I do not have test results at this time but I am working on it.
> >
> > Please let me know if this approach is acceptable, or any suggestions.
>
> Doesn't it looks exactly like GSO/TSO/LRO stuff implemented already?

It is the reverse - GSO will segment one super-packet just before calling
the driver so that the stack is traversed only once. In my case, I am
trying to send out multiple skbs, possibly small packets, in one shot.
GSO will not help for small packets.

> Btw, main CPU limiting factor here is syscall overhead (userspace
protocol
> processing with 1500 MTU allows to reduce CPU usage and increase
> performance for 128 bytes packets sending/receiving total of 10 times).

I will test this also. But I was curious to see if without any changes to
applications, I can get better performance by linking packets and sending
it once to the driver.

What is your opinion ?

thanks,

- KK

>
> > Thanks,
> >
> > - KK
>
> --
>    Evgeniy Polyakov


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 15:22   ` Krishna Kumar2
@ 2007-05-10 15:48     ` Evgeniy Polyakov
  2007-05-10 16:08       ` jamal
  2007-05-10 17:19     ` Rick Jones
  1 sibling, 1 reply; 96+ messages in thread
From: Evgeniy Polyakov @ 2007-05-10 15:48 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: netdev

On Thu, May 10, 2007 at 08:52:12PM +0530, Krishna Kumar2 (krkumar2@in.ibm.com) wrote:
> > > The reason to implement the same was to speed up IPoIB driver. But
> > > before doing that, a proof of concept for E1000/AMSO drivers was
> > > considered (as most of the code is generic) before implementing for
> > > IPoIB. I do not have test results at this time but I am working on it.
> > >
> > > Please let me know if this approach is acceptable, or any suggestions.
> >
> > Doesn't it looks exactly like GSO/TSO/LRO stuff implemented already?
> 
> It is the reverse - GSO will segment one super-packet just before calling
> the driver so that the stack is traversed only once. In my case, I am
> trying to send out multiple skbs, possibly small packets, in one shot.
> GSO will not help for small packets.

And you can create that packet yourself, if it not enough to combine
into MSS sized chunk.

> > Btw, main CPU limiting factor here is syscall overhead (userspace
> protocol
> > processing with 1500 MTU allows to reduce CPU usage and increase
> > performance for 128 bytes packets sending/receiving total of 10 times).
> 
> I will test this also. But I was curious to see if without any changes to
> applications, I can get better performance by linking packets and sending
> it once to the driver.

Without changes it will unlikely to be visible.

> What is your opinion ?

IMHO if you do not see in profile anything related to driver's xmit
function, it does not require to be fixed. But as starting point, why
just not increase skb allocation size (say twice) in tcp_sendmsg() and
see results? It of course can endup with worse behaviour (if mtu is
small), but it is noticebly simpler to implement and test first.

> thanks,
> 
> - KK
> 
> >
> > > Thanks,
> > >
> > > - KK
> >
> > --
> >    Evgeniy Polyakov

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 15:48     ` Evgeniy Polyakov
@ 2007-05-10 16:08       ` jamal
  0 siblings, 0 replies; 96+ messages in thread
From: jamal @ 2007-05-10 16:08 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Krishna Kumar2, netdev

On Thu, 2007-10-05 at 19:48 +0400, Evgeniy Polyakov wrote:

> IMHO if you do not see in profile anything related to driver's xmit
> function, it does not require to be fixed. 

True, but i think there may be value in amortizing the cost towards
the driver. 
i.e If you grab a lock and send X packets vs a single packet, depending
on the cost of the lock you may observe a reduction in CPU.

I have tried similar experiments in the past and observed zero
improvements in sane machines, but huge improvements on insane machines.
Slides on my work here (just jump to about slide 93):
http://vger.kernel.org/jamal_netconf2006.sxi

One thought i had at the time was perhaps the e1000 DMA setup/teardown
was very inefficient.

As the slides indicate i was trying to get throughput improvements but
failed to seen any on sane machines.
A suggestion made to me was to look at CPU utilization as a metric.
Unfortunately i havent had any chance yet.

cheers,
jamal


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 15:22   ` Krishna Kumar2
  2007-05-10 15:48     ` Evgeniy Polyakov
@ 2007-05-10 17:19     ` Rick Jones
  2007-05-10 18:07       ` Sridhar Samudrala
                         ` (2 more replies)
  1 sibling, 3 replies; 96+ messages in thread
From: Rick Jones @ 2007-05-10 17:19 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Evgeniy Polyakov, netdev

> It is the reverse - GSO will segment one super-packet just before calling
> the driver so that the stack is traversed only once. In my case, I am
> trying to send out multiple skbs, possibly small packets, in one shot.
> GSO will not help for small packets.

If there are small packets that implies small sends, which suggests that they 
would be coalesced either implicitly by the Nagle algorithm or explicitly with 
TCP_CORK no?

rick jones

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 17:19     ` Rick Jones
@ 2007-05-10 18:07       ` Sridhar Samudrala
  2007-05-10 19:43         ` Gagan Arneja
  2007-05-10 18:13       ` Vlad Yasevich
  2007-05-10 21:27       ` David Stevens
  2 siblings, 1 reply; 96+ messages in thread
From: Sridhar Samudrala @ 2007-05-10 18:07 UTC (permalink / raw)
  To: Rick Jones; +Cc: Krishna Kumar2, Evgeniy Polyakov, netdev

On Thu, 2007-05-10 at 10:19 -0700, Rick Jones wrote:
> > It is the reverse - GSO will segment one super-packet just before calling
> > the driver so that the stack is traversed only once. In my case, I am
> > trying to send out multiple skbs, possibly small packets, in one shot.
> > GSO will not help for small packets.
> 
> If there are small packets that implies small sends, which suggests that they 
> would be coalesced either implicitly by the Nagle algorithm or explicitly with 
> TCP_CORK no?

small packets belonging to the same connection could be coalesced by
TCP, but this may help the case where multiple parallel connections are
sending small packets.

Thanks
Sridhar


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 17:19     ` Rick Jones
  2007-05-10 18:07       ` Sridhar Samudrala
@ 2007-05-10 18:13       ` Vlad Yasevich
  2007-05-10 18:20         ` Rick Jones
  2007-05-10 21:27       ` David Stevens
  2 siblings, 1 reply; 96+ messages in thread
From: Vlad Yasevich @ 2007-05-10 18:13 UTC (permalink / raw)
  To: Rick Jones; +Cc: Krishna Kumar2, Evgeniy Polyakov, netdev

Rick Jones wrote:
>> It is the reverse - GSO will segment one super-packet just before calling
>> the driver so that the stack is traversed only once. In my case, I am
>> trying to send out multiple skbs, possibly small packets, in one shot.
>> GSO will not help for small packets.
> 
> If there are small packets that implies small sends, which suggests that
> they would be coalesced either implicitly by the Nagle algorithm or
> explicitly with TCP_CORK no?
> 
> rick jones
> -

May be for TCP?  What about other protocols?

-vlad

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 18:13       ` Vlad Yasevich
@ 2007-05-10 18:20         ` Rick Jones
  2007-05-10 18:32           ` Vlad Yasevich
  0 siblings, 1 reply; 96+ messages in thread
From: Rick Jones @ 2007-05-10 18:20 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: Krishna Kumar2, Evgeniy Polyakov, netdev

Vlad Yasevich wrote:
> Rick Jones wrote:
> 
>>>It is the reverse - GSO will segment one super-packet just before calling
>>>the driver so that the stack is traversed only once. In my case, I am
>>>trying to send out multiple skbs, possibly small packets, in one shot.
>>>GSO will not help for small packets.
>>
>>If there are small packets that implies small sends, which suggests that
>>they would be coalesced either implicitly by the Nagle algorithm or
>>explicitly with TCP_CORK no?
>>
>>rick jones
>>-
> 
> 
> May be for TCP?  What about other protocols?

There are other protocols?-)  True, UDP, and I suppose certain modes of SCTP 
might be sending streams of small packets, as might TCP with TCP_NODELAY set.

Do they often queue-up outside the driver?

rick jones

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 18:20         ` Rick Jones
@ 2007-05-10 18:32           ` Vlad Yasevich
  2007-05-10 18:40             ` Rick Jones
  2007-05-10 18:59             ` Ian McDonald
  0 siblings, 2 replies; 96+ messages in thread
From: Vlad Yasevich @ 2007-05-10 18:32 UTC (permalink / raw)
  To: Rick Jones; +Cc: Krishna Kumar2, Evgeniy Polyakov, netdev

Rick Jones wrote:
> Vlad Yasevich wrote:
>> Rick Jones wrote:
>>
>>>> It is the reverse - GSO will segment one super-packet just before
>>>> calling
>>>> the driver so that the stack is traversed only once. In my case, I am
>>>> trying to send out multiple skbs, possibly small packets, in one shot.
>>>> GSO will not help for small packets.
>>>
>>> If there are small packets that implies small sends, which suggests that
>>> they would be coalesced either implicitly by the Nagle algorithm or
>>> explicitly with TCP_CORK no?
>>>
>>> rick jones
>>> -
>>
>>
>> May be for TCP?  What about other protocols?
> 
> There are other protocols?-)  True, UDP, and I suppose certain modes of
> SCTP might be sending streams of small packets, as might TCP with
> TCP_NODELAY set.
> 
> Do they often queue-up outside the driver?

Not sure if DCCP might fall into this category as well...

I think the idea of this patch is gather some number of these small packets and
shove them at the driver in one go instead of each small packet at a time.

I might be helpful, but reserve judgment till I see more numbers.

-vlad

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 18:32           ` Vlad Yasevich
@ 2007-05-10 18:40             ` Rick Jones
  2007-05-10 18:59             ` Ian McDonald
  1 sibling, 0 replies; 96+ messages in thread
From: Rick Jones @ 2007-05-10 18:40 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: Krishna Kumar2, Evgeniy Polyakov, netdev

> 
> Not sure if DCCP might fall into this category as well...
> 
> I think the idea of this patch is gather some number of these small packets and
> shove them at the driver in one go instead of each small packet at a time.

This reminds me... (rick starts waxing rhapshodic about old HP-UX behviour :)

The HP-UX drivers have a "packet train" mode whereby IP will give them all the 
fragments of an IP datagram in one shot.  The idea was that the driver would 
take all of them, or none of them.  This was to deal with issues around filling 
a driver's transmit queue with some of the fragments and then dropping the 
others, thus transmitting IP datagrams which had no hope of being reassembled 
into anything other than frankengrams.

rick jones

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 18:32           ` Vlad Yasevich
  2007-05-10 18:40             ` Rick Jones
@ 2007-05-10 18:59             ` Ian McDonald
  2007-05-10 19:21               ` Vlad Yasevich
  2007-05-11  5:04               ` Krishna Kumar2
  1 sibling, 2 replies; 96+ messages in thread
From: Ian McDonald @ 2007-05-10 18:59 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: Rick Jones, Krishna Kumar2, Evgeniy Polyakov, netdev

On 5/11/07, Vlad Yasevich <vladislav.yasevich@hp.com> wrote:
> >> May be for TCP?  What about other protocols?
> >
> > There are other protocols?-)  True, UDP, and I suppose certain modes of
> > SCTP might be sending streams of small packets, as might TCP with
> > TCP_NODELAY set.
> >
> > Do they often queue-up outside the driver?
>
> Not sure if DCCP might fall into this category as well...
>
Yes DCCP definitely can queue outside the driver.

> I think the idea of this patch is gather some number of these small packets and
> shove them at the driver in one go instead of each small packet at a time.
>
> I might be helpful, but reserve judgment till I see more numbers.
>
> -vlad

As I see this proposed patch it is about reducing the number of "task
switches" between the driver and the protocol. I use task switch in
speech marks as it isn't really as is in the kernel. So in other words
we are hoping that spending more time in each area would keep the
cache hot and work to be done if locks held. This of course requires
that the complexity added doesn't outweigh the gains - otherwise you
could end up in a worse scenario where the driver doesn't send packets
because the protocol is busy linking them.

As far as I can tell you're not combining packets?? This would
definitely break UDP/DCCP which are datagram based.

Will be interesting to see results but I'm a little sceptical.

-- 
Web: http://wand.net.nz/~iam4/
Blog: http://iansblog.jandi.co.nz
WAND Network Research Group

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 18:59             ` Ian McDonald
@ 2007-05-10 19:21               ` Vlad Yasevich
  2007-05-10 19:26                 ` Ian McDonald
  2007-05-10 20:32                 ` David Miller
  2007-05-11  5:04               ` Krishna Kumar2
  1 sibling, 2 replies; 96+ messages in thread
From: Vlad Yasevich @ 2007-05-10 19:21 UTC (permalink / raw)
  To: Ian McDonald; +Cc: Rick Jones, Krishna Kumar2, Evgeniy Polyakov, netdev

Ian McDonald wrote:
> On 5/11/07, Vlad Yasevich <vladislav.yasevich@hp.com> wrote:
>> >> May be for TCP?  What about other protocols?
>> >
>> > There are other protocols?-)  True, UDP, and I suppose certain modes of
>> > SCTP might be sending streams of small packets, as might TCP with
>> > TCP_NODELAY set.
>> >
>> > Do they often queue-up outside the driver?
>>
>> Not sure if DCCP might fall into this category as well...
>>
> Yes DCCP definitely can queue outside the driver.
> 
>> I think the idea of this patch is gather some number of these small
>> packets and
>> shove them at the driver in one go instead of each small packet at a
>> time.
>>
>> I might be helpful, but reserve judgment till I see more numbers.
>>
>> -vlad
> 
> As I see this proposed patch it is about reducing the number of "task
> switches" between the driver and the protocol. I use task switch in
> speech marks as it isn't really as is in the kernel. So in other words
> we are hoping that spending more time in each area would keep the
> cache hot and work to be done if locks held. This of course requires
> that the complexity added doesn't outweigh the gains - otherwise you
> could end up in a worse scenario where the driver doesn't send packets
> because the protocol is busy linking them.
> 
> As far as I can tell you're not combining packets?? This would
> definitely break UDP/DCCP which are datagram based.

If it's combining packets, it would break a lot of other things as well.

The proposed usage seems to be to link packets together into a chain
and give the whole chain to the driver, instead of handing down one packet
at a time.

The win might be biggest on a system were a lot of applications send a lot of
small packets.  Some number will aggregate in the prio queue and then get shoved
into a driver in one go.

But...  this is all conjecture until we see the code.

-vlad

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 19:21               ` Vlad Yasevich
@ 2007-05-10 19:26                 ` Ian McDonald
  2007-05-10 20:32                 ` David Miller
  1 sibling, 0 replies; 96+ messages in thread
From: Ian McDonald @ 2007-05-10 19:26 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: Rick Jones, Krishna Kumar2, Evgeniy Polyakov, netdev

On 5/11/07, Vlad Yasevich <vladislav.yasevich@hp.com> wrote:
> The win might be biggest on a system were a lot of applications send a lot of
> small packets.  Some number will aggregate in the prio queue and then get shoved
> into a driver in one go.

That's assuming that the device doesn't run out of things to send first....
>
> But...  this is all conjecture until we see the code.
>
Agree

-- 
Web: http://wand.net.nz/~iam4/
Blog: http://iansblog.jandi.co.nz
WAND Network Research Group

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 18:07       ` Sridhar Samudrala
@ 2007-05-10 19:43         ` Gagan Arneja
  2007-05-10 20:11           ` jamal
                             ` (2 more replies)
  0 siblings, 3 replies; 96+ messages in thread
From: Gagan Arneja @ 2007-05-10 19:43 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Rick Jones, Krishna Kumar2, Evgeniy Polyakov, netdev

> small packets belonging to the same connection could be coalesced by
> TCP, but this may help the case where multiple parallel connections are
> sending small packets.

It's not just small packets. The cost of calling hard_start_xmit/byte 
was rather high on your particular device. I've seen  PCI read 
transaction in hard_start_xmit taking ~10,000 cycles on one particular 
device. Count the cycles your brand of NIC is taking in it's 
xmit_routine. The worse it is, the stronger your case for cluster 
transmits.

Also, I think, you don't have to chain skbs, they're already chained in 
  Qdisc->q. All you have to do is take the whole q and try to shove it 
at the device hoping for better results. But then, if you have rather 
big backlog, you run the risk of reordering packets if you have to requeue.

> 
> Thanks
> Sridhar

--
Gagan

> 
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 19:43         ` Gagan Arneja
@ 2007-05-10 20:11           ` jamal
  2007-05-10 20:14             ` Rick Jones
                               ` (2 more replies)
  2007-05-10 20:37           ` David Miller
  2007-05-11  5:20           ` Krishna Kumar2
  2 siblings, 3 replies; 96+ messages in thread
From: jamal @ 2007-05-10 20:11 UTC (permalink / raw)
  To: Gagan Arneja
  Cc: Sridhar Samudrala, Rick Jones, Krishna Kumar2, Evgeniy Polyakov,
	netdev


The discussion seems to have steered into protocol coalescing.
My tests for example were related to forwarding and not specific
to any protocol.

On Thu, 2007-10-05 at 12:43 -0700, Gagan Arneja wrote:

> It's not just small packets. The cost of calling hard_start_xmit/byte 
> was rather high on your particular device. I've seen  PCI read 
> transaction in hard_start_xmit taking ~10,000 cycles on one particular 
> device. Count the cycles your brand of NIC is taking in it's 
> xmit_routine. The worse it is, the stronger your case for cluster 
> transmits.

You would need to almost re-write the driver to make sure it does IO
which is taking advantage of the batching. 

> Also, I think, you don't have to chain skbs, they're already chained in 
>   Qdisc->q. 
> All you have to do is take the whole q and try to shove it 
> at the device hoping for better results. 

You are onto something with desire to use the qdisc; however, note you
need to leave the qdisc above so other CPUs can continue enqueueing.
In my approach i used the qdisc to populate a list that is attached to
the dev.

> But then, if you have rather 
> big backlog, you run the risk of reordering packets if you have to requeue.

You could avoid totaly requeueing by asking the driver how much space it
has. Once you shove down to it a number of packets, you are then
guaranteed they will never be requeued (refer to the slides i pointed to
earlier).

cheers,
jamal



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:11           ` jamal
@ 2007-05-10 20:14             ` Rick Jones
  2007-05-10 20:15               ` jamal
  2007-05-10 20:15             ` Gagan Arneja
  2007-05-11  5:22             ` Krishna Kumar2
  2 siblings, 1 reply; 96+ messages in thread
From: Rick Jones @ 2007-05-10 20:14 UTC (permalink / raw)
  To: hadi
  Cc: Gagan Arneja, Sridhar Samudrala, Krishna Kumar2, Evgeniy Polyakov,
	netdev

jamal wrote:
> The discussion seems to have steered into protocol coalescing.
> My tests for example were related to forwarding and not specific
> to any protocol.

Just the natural tendency of end-system types to think of end-system things 
rather than router things.

rick jones

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:11           ` jamal
  2007-05-10 20:14             ` Rick Jones
@ 2007-05-10 20:15             ` Gagan Arneja
  2007-05-10 20:21               ` jamal
  2007-05-11  5:22             ` Krishna Kumar2
  2 siblings, 1 reply; 96+ messages in thread
From: Gagan Arneja @ 2007-05-10 20:15 UTC (permalink / raw)
  To: hadi
  Cc: Gagan Arneja, Sridhar Samudrala, Rick Jones, Krishna Kumar2,
	Evgeniy Polyakov, netdev

jamal wrote:

> You would need to almost re-write the driver to make sure it does IO
> which is taking advantage of the batching.

Really! It's just the transmit routine. How radical can that be?

--
Gagan

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:14             ` Rick Jones
@ 2007-05-10 20:15               ` jamal
  0 siblings, 0 replies; 96+ messages in thread
From: jamal @ 2007-05-10 20:15 UTC (permalink / raw)
  To: Rick Jones
  Cc: Gagan Arneja, Sridhar Samudrala, Krishna Kumar2, Evgeniy Polyakov,
	netdev

On Thu, 2007-10-05 at 13:14 -0700, Rick Jones wrote:

> Just the natural tendency of end-system types to think of end-system things 
> rather than router things.

Well router types felt they were being left out ;->

cheers,
jamal


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:15             ` Gagan Arneja
@ 2007-05-10 20:21               ` jamal
  2007-05-10 20:25                 ` Gagan Arneja
  0 siblings, 1 reply; 96+ messages in thread
From: jamal @ 2007-05-10 20:21 UTC (permalink / raw)
  To: Gagan Arneja
  Cc: Sridhar Samudrala, Rick Jones, Krishna Kumar2, Evgeniy Polyakov,
	netdev

On Thu, 2007-10-05 at 13:15 -0700, Gagan Arneja wrote:

> Really! It's just the transmit routine. How radical can that be?
> 

Ok, you have a point there, but it could be challenging with many
tunables:
For example:
my biggest challenge with the e1000 was just hacking up the DMA setup
path - i seem to get better numbers if i dont kick the DMA until i stash
all the packets on the ring first etc. It seemed counter-intuitive.

Now if you can ammortize PCI operations as well (the 10K cycles you
pointed out), i think thats where you get the best bang for the buck.

cheers,
jamal



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 14:53 Krishna Kumar
  2007-05-10 15:08 ` Evgeniy Polyakov
@ 2007-05-10 20:21 ` Roland Dreier
  2007-05-11  7:30   ` Krishna Kumar2
  2007-05-11  9:05 ` Andi Kleen
  2 siblings, 1 reply; 96+ messages in thread
From: Roland Dreier @ 2007-05-10 20:21 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: netdev, Krishna Kumar

This is pretty interesting to me for IP-over-InfiniBand, for a couple
of reasons.  First of all, I can push multiple send requests to the
underlying adapter in one go, which saves taking and dropping the same
lock and also probably allows fewer MMIO writes for doorbells.

However the second reason this is useful is actually more
interesting.  Right now we request a send completion interrupt for
every packet we send because most current IB adapters don't really
have very good interrupt mitigation features.  If we don't signal a
send completion for every packet, then we might run into a situation
where we don't get a completion for the last packet that gets sent and
end up stalling the interface because the stack never gives us another
packet to send.  However, if I have a whole queue of packets passed to
my xmit routine, then I only have to request a completion for the last
packet in the list, which could potentially cut down on interrupts and
completion processing a lot.

 - R.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:21               ` jamal
@ 2007-05-10 20:25                 ` Gagan Arneja
  0 siblings, 0 replies; 96+ messages in thread
From: Gagan Arneja @ 2007-05-10 20:25 UTC (permalink / raw)
  To: hadi
  Cc: Gagan Arneja, Sridhar Samudrala, Rick Jones, Krishna Kumar2,
	Evgeniy Polyakov, netdev


> For example:
> my biggest challenge with the e1000 was just hacking up the DMA setup
> path - i seem to get better numbers if i dont kick the DMA until i stash
> all the packets on the ring first etc. It seemed counter-intuitive.

That seems to make sense. The rings are(?) in system memory and you can 
write to them faster. Kicking the DMA needs a transaction on the PCI 
bus, which can be painfully slow...

--
Gagan

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 19:21               ` Vlad Yasevich
  2007-05-10 19:26                 ` Ian McDonald
@ 2007-05-10 20:32                 ` David Miller
  2007-05-10 20:49                   ` Rick Jones
  1 sibling, 1 reply; 96+ messages in thread
From: David Miller @ 2007-05-10 20:32 UTC (permalink / raw)
  To: vladislav.yasevich; +Cc: ian.mcdonald, rick.jones2, krkumar2, johnpol, netdev

From: Vlad Yasevich <vladislav.yasevich@hp.com>
Date: Thu, 10 May 2007 15:21:30 -0400

> The win might be biggest on a system were a lot of applications send
> a lot of small packets.  Some number will aggregate in the prio
> queue and then get shoved into a driver in one go.
>
> But...  this is all conjecture until we see the code.

Also, whatever you gain in cpu usage you'll lose in latency.

And things sending tiny frames are exactly the ones that care about
latency.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 19:43         ` Gagan Arneja
  2007-05-10 20:11           ` jamal
@ 2007-05-10 20:37           ` David Miller
  2007-05-10 20:40             ` Gagan Arneja
  2007-05-11  5:21             ` Krishna Kumar2
  2007-05-11  5:20           ` Krishna Kumar2
  2 siblings, 2 replies; 96+ messages in thread
From: David Miller @ 2007-05-10 20:37 UTC (permalink / raw)
  To: gaagaan; +Cc: sri, rick.jones2, krkumar2, johnpol, netdev

From: Gagan Arneja <gaagaan@gmail.com>
Date: Thu, 10 May 2007 12:43:53 -0700

> It's not just small packets. The cost of calling hard_start_xmit/byte 
> was rather high on your particular device. I've seen  PCI read 
> transaction in hard_start_xmit taking ~10,000 cycles on one particular 
> device. Count the cycles your brand of NIC is taking in it's 
> xmit_routine. The worse it is, the stronger your case for cluster 
> transmits.
> 
> Also, I think, you don't have to chain skbs, they're already chained in 
>   Qdisc->q. All you have to do is take the whole q and try to shove it 
> at the device hoping for better results. But then, if you have rather 
> big backlog, you run the risk of reordering packets if you have to requeue.

If the qdisc is packed with packets and we would just loop sending
them to the device, yes it might make sense.

But if that isn't the case, which frankly is the usual case, you add a
non-trivial amount of latency by batching and that's bad exactly for
the kind of applications that send small frames.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:37           ` David Miller
@ 2007-05-10 20:40             ` Gagan Arneja
  2007-05-10 20:57               ` David Miller
  2007-05-11  5:21             ` Krishna Kumar2
  1 sibling, 1 reply; 96+ messages in thread
From: Gagan Arneja @ 2007-05-10 20:40 UTC (permalink / raw)
  To: David Miller; +Cc: sri, rick.jones2, krkumar2, johnpol, netdev

David Miller wrote:
> 
> If the qdisc is packed with packets and we would just loop sending
> them to the device, yes it might make sense.
> 
> But if that isn't the case, which frankly is the usual case, you add a
> non-trivial amount of latency by batching and that's bad exactly for
> the kind of applications that send small frames.
> 

I don't understand how transmitting already batched up packets in one go 
introduce latency.

--
Gagan

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:32                 ` David Miller
@ 2007-05-10 20:49                   ` Rick Jones
  2007-05-10 21:02                     ` David Miller
  0 siblings, 1 reply; 96+ messages in thread
From: Rick Jones @ 2007-05-10 20:49 UTC (permalink / raw)
  To: David Miller; +Cc: vladislav.yasevich, ian.mcdonald, krkumar2, johnpol, netdev

David Miller wrote:
> From: Vlad Yasevich <vladislav.yasevich@hp.com>
> Date: Thu, 10 May 2007 15:21:30 -0400
> 
> 
>>The win might be biggest on a system were a lot of applications send
>>a lot of small packets.  Some number will aggregate in the prio
>>queue and then get shoved into a driver in one go.
>>
>>But...  this is all conjecture until we see the code.
> 
> 
> Also, whatever you gain in cpu usage you'll lose in latency.
> 
> And things sending tiny frames are exactly the ones that care about
> latency.

I'd think one would only do this in those situations/places where a natural "out 
of driver" queue develops in the first place wouldn't one?

rick jones

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:40             ` Gagan Arneja
@ 2007-05-10 20:57               ` David Miller
  2007-05-11  6:07                 ` Krishna Kumar2
  0 siblings, 1 reply; 96+ messages in thread
From: David Miller @ 2007-05-10 20:57 UTC (permalink / raw)
  To: gaagaan; +Cc: sri, rick.jones2, krkumar2, johnpol, netdev

From: Gagan Arneja <gaagaan@gmail.com>
Date: Thu, 10 May 2007 13:40:22 -0700

> David Miller wrote:
> > 
> > If the qdisc is packed with packets and we would just loop sending
> > them to the device, yes it might make sense.
> > 
> > But if that isn't the case, which frankly is the usual case, you add a
> > non-trivial amount of latency by batching and that's bad exactly for
> > the kind of applications that send small frames.
> > 
> 
> I don't understand how transmitting already batched up packets in one go 
> introduce latency.

Keep thinking :-)

The only case where these ideas can be seriously considered is during
netif_wake_queue().  In all other cases we go straight to the device
from sendmsg(), since there is space in the TX ring, and we should
not try to batch as that will add latency.

So maybe find a solution in that area, some new driver interface that
can suck N packets out of a qdisc when netif_wake_queue() occurs.

If you look at the contexts in which netif_wake_queue() is called,
it's perfect, it already has the TX state of the driver locked, it has
the TX descriptor head/tail pointers handy, etc. and it's already
taken all the cache misses needed in order to work with this state.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:49                   ` Rick Jones
@ 2007-05-10 21:02                     ` David Miller
  2007-05-10 21:14                       ` Gagan Arneja
  0 siblings, 1 reply; 96+ messages in thread
From: David Miller @ 2007-05-10 21:02 UTC (permalink / raw)
  To: rick.jones2; +Cc: vladislav.yasevich, ian.mcdonald, krkumar2, johnpol, netdev

From: Rick Jones <rick.jones2@hp.com>
Date: Thu, 10 May 2007 13:49:44 -0700

> I'd think one would only do this in those situations/places where a
> natural "out of driver" queue develops in the first place wouldn't
> one?

Indeed.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:02                     ` David Miller
@ 2007-05-10 21:14                       ` Gagan Arneja
  2007-05-11  2:28                         ` Stephen Hemminger
  0 siblings, 1 reply; 96+ messages in thread
From: Gagan Arneja @ 2007-05-10 21:14 UTC (permalink / raw)
  To: David Miller
  Cc: rick.jones2, vladislav.yasevich, ian.mcdonald, krkumar2, johnpol,
	netdev

David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Thu, 10 May 2007 13:49:44 -0700
> 
>> I'd think one would only do this in those situations/places where a
>> natural "out of driver" queue develops in the first place wouldn't
>> one?
> 
> Indeed.

And one builds in qdisc because your device sink is slow. There's 
nothing inherently there in the kernel encouraging Qdisc->q to grow. 
It's precisely these broken devices that can take advantage of cluster 
transmits.

--
Gagan

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 17:19     ` Rick Jones
  2007-05-10 18:07       ` Sridhar Samudrala
  2007-05-10 18:13       ` Vlad Yasevich
@ 2007-05-10 21:27       ` David Stevens
  2007-05-10 21:40         ` David Miller
                           ` (2 more replies)
  2 siblings, 3 replies; 96+ messages in thread
From: David Stevens @ 2007-05-10 21:27 UTC (permalink / raw)
  To: Rick Jones; +Cc: Evgeniy Polyakov, Krishna Kumar2, netdev, netdev-owner

The word "small" is coming up a lot in this discussion, and
I think packet size really has nothing to do with it. Multiple
streams generating packets of any size would benefit; the
key ingredient is a queue length greater than 1.

I think the intent is to remove queue lock cycles by taking
the whole list (at least up to the count of free ring buffers)
when the queue is greater than one packet, thus effectively
removing the lock expense for n-1 packets.

                                        +-DLS


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:27       ` David Stevens
@ 2007-05-10 21:40         ` David Miller
  2007-05-10 21:50           ` Gagan Arneja
  2007-05-10 21:41         ` Eric Dumazet
  2007-05-10 21:45         ` Rick Jones
  2 siblings, 1 reply; 96+ messages in thread
From: David Miller @ 2007-05-10 21:40 UTC (permalink / raw)
  To: dlstevens; +Cc: rick.jones2, johnpol, krkumar2, netdev, netdev-owner

From: David Stevens <dlstevens@us.ibm.com>
Date: Thu, 10 May 2007 14:27:56 -0700

> The word "small" is coming up a lot in this discussion, and
> I think packet size really has nothing to do with it. Multiple
> streams generating packets of any size would benefit; the
> key ingredient is a queue length greater than 1.
> 
> I think the intent is to remove queue lock cycles by taking
> the whole list (at least up to the count of free ring buffers)
> when the queue is greater than one packet, thus effectively
> removing the lock expense for n-1 packets.

Right.

But I think it's critical to do two things:

1) Do this when netif_wake_queue() is triggers and thus the
   TX is locked already.

2) Have some way for the driver to say how many free TX slots
   there are in order to minimize if not eliminate requeueing
   during this batching thing.

If you drop the TX lock, the number of free slots can change
as another cpu gets in there queuing packets.

I know there are some hardware workarounds that require using
more TX ring buffer slots and are usually necessary, which
makes %100 accurate indication of free slots not possible.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:27       ` David Stevens
  2007-05-10 21:40         ` David Miller
@ 2007-05-10 21:41         ` Eric Dumazet
  2007-05-10 22:09           ` Rick Jones
  2007-05-10 21:45         ` Rick Jones
  2 siblings, 1 reply; 96+ messages in thread
From: Eric Dumazet @ 2007-05-10 21:41 UTC (permalink / raw)
  To: David Stevens
  Cc: Rick Jones, Evgeniy Polyakov, Krishna Kumar2, netdev,
	netdev-owner

David Stevens a écrit :
> The word "small" is coming up a lot in this discussion, and
> I think packet size really has nothing to do with it. Multiple
> streams generating packets of any size would benefit; the
> key ingredient is a queue length greater than 1.
> 
> I think the intent is to remove queue lock cycles by taking
> the whole list (at least up to the count of free ring buffers)
> when the queue is greater than one packet, thus effectively
> removing the lock expense for n-1 packets.
> 

Yes, but on modern cpus, locked operations are basically free once the CPU 
already has the cache line in exclusive access in its L1 cache.

I am not sure adding yet another driver API will help very much.
It will for sure adds some bugs and pain.

A less expensive (and less prone to bugs) optimization would be to prefetch 
one cache line for next qdisc skb, as a cache line miss is far more expensive 
than a locked operation (if lock already in L1 cache of course)



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:27       ` David Stevens
  2007-05-10 21:40         ` David Miller
  2007-05-10 21:41         ` Eric Dumazet
@ 2007-05-10 21:45         ` Rick Jones
  2007-05-10 21:53           ` David Stevens
  2 siblings, 1 reply; 96+ messages in thread
From: Rick Jones @ 2007-05-10 21:45 UTC (permalink / raw)
  To: David Stevens; +Cc: Evgeniy Polyakov, Krishna Kumar2, netdev, netdev-owner

David Stevens wrote:
> The word "small" is coming up a lot in this discussion, and
> I think packet size really has nothing to do with it. Multiple
> streams generating packets of any size would benefit; the
> key ingredient is a queue length greater than 1.
> 
> I think the intent is to remove queue lock cycles by taking
> the whole list (at least up to the count of free ring buffers)
> when the queue is greater than one packet, thus effectively
> removing the lock expense for n-1 packets.

Which worked _very_ well (the whole list) going in the other direction for the 
netisr queue(s) in HP-UX 10.20.  OK, I promise no more old HP-UX stories for the 
balance of the week :)

Taking some count of the list might be a triffle too complicated.

rick jones

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:40         ` David Miller
@ 2007-05-10 21:50           ` Gagan Arneja
  2007-05-10 22:06             ` David Miller
  0 siblings, 1 reply; 96+ messages in thread
From: Gagan Arneja @ 2007-05-10 21:50 UTC (permalink / raw)
  To: David Miller
  Cc: dlstevens, rick.jones2, johnpol, krkumar2, netdev, netdev-owner

David Miller wrote:

> Right.
> 
> But I think it's critical to do two things:
> 
> 1) Do this when netif_wake_queue() is triggers and thus the
>    TX is locked already.
> 
> 2) Have some way for the driver to say how many free TX slots
>    there are in order to minimize if not eliminate requeueing
>    during this batching thing.

I don't think you can reliably do this. Jumbograms may end up taking 
more than one slot.

> 
> If you drop the TX lock, the number of free slots can change
> as another cpu gets in there queuing packets.

Can you ever have more than one thread inside the driver? Isn't 
xmit_lock held while we're in there?

--
Gagan


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:45         ` Rick Jones
@ 2007-05-10 21:53           ` David Stevens
  0 siblings, 0 replies; 96+ messages in thread
From: David Stevens @ 2007-05-10 21:53 UTC (permalink / raw)
  To: Rick Jones; +Cc: Evgeniy Polyakov, Krishna Kumar2, netdev, netdev-owner

> Which worked _very_ well (the whole list) going in the other direction 
for the 
> netisr queue(s) in HP-UX 10.20.  OK, I promise no more old HP-UX stories 
for the 
> balance of the week :)

        Yes, OSes I worked on in other lives usually took the whole queue
and then took responsibility for handling them outside the lock.
        A dequeue_n() under 1 lock should still be better than n locks,
I'd think, since you may bounce to different CPUs during the n individual
cycles.

                                                        +-DLS


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:50           ` Gagan Arneja
@ 2007-05-10 22:06             ` David Miller
  2007-05-11  9:46               ` Krishna Kumar2
  0 siblings, 1 reply; 96+ messages in thread
From: David Miller @ 2007-05-10 22:06 UTC (permalink / raw)
  To: gaagaan; +Cc: dlstevens, rick.jones2, johnpol, krkumar2, netdev, netdev-owner

From: Gagan Arneja <gaagaan@gmail.com>
Date: Thu, 10 May 2007 14:50:19 -0700

> David Miller wrote:
> 
> > If you drop the TX lock, the number of free slots can change
> > as another cpu gets in there queuing packets.
> 
> Can you ever have more than one thread inside the driver? Isn't 
> xmit_lock held while we're in there?

There are restrictions wrt. when the xmit_lock and the
queue lock can be held at the same time.

The devil is definitely in the details if you try to
implemen this.  It definitely lends support for Eric D.'s
assertion that this change will only add bugs and doing
something simple like prefetches is proabably a safer
route to go down.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:41         ` Eric Dumazet
@ 2007-05-10 22:09           ` Rick Jones
  0 siblings, 0 replies; 96+ messages in thread
From: Rick Jones @ 2007-05-10 22:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Stevens, Evgeniy Polyakov, Krishna Kumar2, netdev,
	netdev-owner

Eric Dumazet wrote:
> David Stevens a écrit :
> 
>> The word "small" is coming up a lot in this discussion, and
>> I think packet size really has nothing to do with it. Multiple
>> streams generating packets of any size would benefit; the
>> key ingredient is a queue length greater than 1.
>>
>> I think the intent is to remove queue lock cycles by taking
>> the whole list (at least up to the count of free ring buffers)
>> when the queue is greater than one packet, thus effectively
>> removing the lock expense for n-1 packets.
>>
> 
> Yes, but on modern cpus, locked operations are basically free once the 
> CPU already has the cache line in exclusive access in its L1 cache.

But will it here?  Any of the CPUs are trying to add things to the qdisc, but 
only one CPU is pulling from it right?  Even if the "pulling from it" is 
happening in a loop, there can be scores or more other cores trying to add 
things to the queue, which would cause that cache line to migrate.

> I am not sure adding yet another driver API will help very much.
> It will for sure adds some bugs and pain.

That could very well be.

> A less expensive (and less prone to bugs) optimization would be to 
> prefetch one cache line for next qdisc skb, as a cache line miss is far 
> more expensive than a locked operation (if lock already in L1 cache of 
> course)

Might they not build on on top of the other?

rick jones



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:14                       ` Gagan Arneja
@ 2007-05-11  2:28                         ` Stephen Hemminger
  2007-05-11  5:01                           ` Gagan Arneja
  0 siblings, 1 reply; 96+ messages in thread
From: Stephen Hemminger @ 2007-05-11  2:28 UTC (permalink / raw)
  To: Gagan Arneja
  Cc: David Miller, rick.jones2, vladislav.yasevich, ian.mcdonald,
	krkumar2, johnpol, netdev

On Thu, 10 May 2007 14:14:05 -0700
Gagan Arneja <gaagaan@gmail.com> wrote:

> David Miller wrote:
> > From: Rick Jones <rick.jones2@hp.com>
> > Date: Thu, 10 May 2007 13:49:44 -0700
> > 
> >> I'd think one would only do this in those situations/places where a
> >> natural "out of driver" queue develops in the first place wouldn't
> >> one?
> > 
> > Indeed.
> 
> And one builds in qdisc because your device sink is slow. There's 
> nothing inherently there in the kernel encouraging Qdisc->q to grow. 
> It's precisely these broken devices that can take advantage of cluster 
> transmits.

If you have braindead slow hardware, 
there is nothing that says your start_xmit routine can't do its own
coalescing.  The cost of calling the transmit routine is the responsibility
of the driver, not the core network code.
-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  2:28                         ` Stephen Hemminger
@ 2007-05-11  5:01                           ` Gagan Arneja
  0 siblings, 0 replies; 96+ messages in thread
From: Gagan Arneja @ 2007-05-11  5:01 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Gagan Arneja, David Miller, rick.jones2, vladislav.yasevich,
	ian.mcdonald, krkumar2, johnpol, netdev

 > If you have braindead slow hardware,
 > there is nothing that says your start_xmit routine can't do its own
 > coalescing.  The cost of calling the transmit routine is the 
responsibility
 > of the driver, not the core network code.

Yes, except you very likely run the risk of the driver introducing 
latency. The problem is prevalent enough that one can argue for a fix (a 
better one at that) higher up. And it's not all that high up that we 
need to get completely paranoid about it.

 > --
 > Stephen Hemminger <shemminger@linux-foundation.org>
 >

--
Gagan


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 18:59             ` Ian McDonald
  2007-05-10 19:21               ` Vlad Yasevich
@ 2007-05-11  5:04               ` Krishna Kumar2
  2007-05-11  9:01                 ` Evgeniy Polyakov
  1 sibling, 1 reply; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  5:04 UTC (permalink / raw)
  To: Ian McDonald; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Vlad Yasevich

"Ian McDonald" <ian.mcdonald@jandi.co.nz> wrote on 05/11/2007 12:29:08 AM:

>
> As I see this proposed patch it is about reducing the number of "task
> switches" between the driver and the protocol. I use task switch in
> speech marks as it isn't really as is in the kernel. So in other words
> we are hoping that spending more time in each area would keep the
> cache hot and work to be done if locks held. This of course requires
> that the complexity added doesn't outweigh the gains - otherwise you
> could end up in a worse scenario where the driver doesn't send packets
> because the protocol is busy linking them.

This is a right summary.

> As far as I can tell you're not combining packets?? This would
> definitely break UDP/DCCP which are datagram based.

Not combining packets, I am sending them out in the same sequence it was
queued. If the xmit failed, the driver's new API returns the skb which
failed to be sent. This skb and all other linked skbs are requeue'd in
the reverse order (fofi?) till the next time it is tried again. I see
that sometimes I can send tx_queue_len packets in one shot and all
succeeds. But the downside is that in failure case, the packets have to
be requeue'd.

> Will be interesting to see results but I'm a little sceptical.

I was just going to test, and Rick helpfully pointed out netperf4.

Thanks,

- KK

>
> --
> Web: http://wand.net.nz/~iam4/
> Blog: http://iansblog.jandi.co.nz
> WAND Network Research Group


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 19:43         ` Gagan Arneja
  2007-05-10 20:11           ` jamal
  2007-05-10 20:37           ` David Miller
@ 2007-05-11  5:20           ` Krishna Kumar2
  2007-05-11  5:35             ` Gagan Arneja
  2 siblings, 1 reply; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  5:20 UTC (permalink / raw)
  To: Gagan Arneja; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala

Gagan Arneja <gaagaan@gmail.com> wrote on 05/11/2007 01:13:53 AM:

> Also, I think, you don't have to chain skbs, they're already chained in
>   Qdisc->q. All you have to do is take the whole q and try to shove it
> at the device hoping for better results. But then, if you have rather
> big backlog, you run the risk of reordering packets if you have to
requeue.

I haven't seen reordering packets (I did once when I was having a bug in
the requeue code, some TCP messages on receiver indicating packets out of
order). When a send fails, the packet are requeued in reverse (go to end of
the failed skb and traverse back to the failed skb and requeue each skb).
Since new inserts go to the end, the queue is guaranteed to be in order.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:37           ` David Miller
  2007-05-10 20:40             ` Gagan Arneja
@ 2007-05-11  5:21             ` Krishna Kumar2
  1 sibling, 0 replies; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  5:21 UTC (permalink / raw)
  To: David Miller; +Cc: gaagaan, johnpol, netdev, rick.jones2, sri

David Miller <davem@davemloft.net> wrote on 05/11/2007 02:07:10 AM:

> From: Gagan Arneja <gaagaan@gmail.com>
> Date: Thu, 10 May 2007 12:43:53 -0700
> >
> > Also, I think, you don't have to chain skbs, they're already chained in

> >   Qdisc->q. All you have to do is take the whole q and try to shove it
> > at the device hoping for better results. But then, if you have rather
> > big backlog, you run the risk of reordering packets if you have to
requeue.
>
> If the qdisc is packed with packets and we would just loop sending
> them to the device, yes it might make sense.
>
> But if that isn't the case, which frankly is the usual case, you add a
> non-trivial amount of latency by batching and that's bad exactly for
> the kind of applications that send small frames.

Without batching, the latency should be the same since qdisc_run would
send each in a tight loop ?

I will also run TCP/UDP RR tests.

thanks,

- KK


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:11           ` jamal
  2007-05-10 20:14             ` Rick Jones
  2007-05-10 20:15             ` Gagan Arneja
@ 2007-05-11  5:22             ` Krishna Kumar2
  2007-05-11 11:27               ` jamal
  2 siblings, 1 reply; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  5:22 UTC (permalink / raw)
  To: hadi; +Cc: Gagan Arneja, Evgeniy Polyakov, netdev, Rick Jones,
	Sridhar Samudrala

J Hadi Salim <j.hadi123@gmail.com> wrote on 05/11/2007 01:41:27 AM:

> > It's not just small packets. The cost of calling hard_start_xmit/byte
> > was rather high on your particular device. I've seen  PCI read
> > transaction in hard_start_xmit taking ~10,000 cycles on one particular
> > device. Count the cycles your brand of NIC is taking in it's
> > xmit_routine. The worse it is, the stronger your case for cluster
> > transmits.
>
> You would need to almost re-write the driver to make sure it does IO
> which is taking advantage of the batching.

I didn't try to optimize the driver to take any real advantage, I coded it
as simply as :

top:
      next = skb->skb_flink;
      Original driver code here, or another option is to remove the locking
            and put it before the "top:" above, no other changes.
      skb = next;
      if (skb)
            goto top;
This way, the diffs are minimal. However, driver experts might know of some
tricks to optimize this, since most drivers have a hundred checks before
sending one skb, and whether some optimization to avoid that for each skb
is possible.

> You could avoid totaly requeueing by asking the driver how much space it
> has. Once you shove down to it a number of packets, you are then
> guaranteed they will never be requeued (refer to the slides i pointed to
> earlier).

This is a good idea, I will look at your slides to see what this exactly
is.
However, even with this check, drivers will return error, eg trylock failed
or, like the e1000 has - fifo_workaround failures. But I still think it
will
significantly reduce the failures. Right now, I get a lot of requeue's.

thanks,

- KK


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  5:20           ` Krishna Kumar2
@ 2007-05-11  5:35             ` Gagan Arneja
  2007-05-11  5:43               ` Krishna Kumar2
  0 siblings, 1 reply; 96+ messages in thread
From: Gagan Arneja @ 2007-05-11  5:35 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala

Krishna Kumar2 wrote:
> I haven't seen reordering packets (I did once when I was having a bug in
> the requeue code, some TCP messages on receiver indicating packets out of
> order). When a send fails, the packet are requeued in reverse (go to end of
> the failed skb and traverse back to the failed skb and requeue each skb).
> Since new inserts go to the end, the queue is guaranteed to be in order.

queue_lock is dropped when you're in xmit. There's no guarantee packets 
won't be queued up while you're trying a transmit.

> 
> Thanks,
> 
> - KK
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  5:35             ` Gagan Arneja
@ 2007-05-11  5:43               ` Krishna Kumar2
  2007-05-11  5:57                 ` Gagan Arneja
  0 siblings, 1 reply; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  5:43 UTC (permalink / raw)
  To: Gagan Arneja; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala

Gagan Arneja <gaagaan@gmail.com> wrote on 05/11/2007 11:05:47 AM:

> Krishna Kumar2 wrote:
> > I haven't seen reordering packets (I did once when I was having a bug
in
> > the requeue code, some TCP messages on receiver indicating packets out
of
> > order). When a send fails, the packet are requeued in reverse (go to
end of
> > the failed skb and traverse back to the failed skb and requeue each
skb).
> > Since new inserts go to the end, the queue is guaranteed to be in
order.
>
> queue_lock is dropped when you're in xmit. There's no guarantee packets
> won't be queued up while you're trying a transmit.

Right, but I am the sole dequeue'r, and on failure, I requeue those packets
to
the beginning of the queue (just as it would happen in the regular case of
one
packet xmit/failure/requeue).

- KK


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  5:43               ` Krishna Kumar2
@ 2007-05-11  5:57                 ` Gagan Arneja
  2007-05-11  6:06                   ` Krishna Kumar2
  0 siblings, 1 reply; 96+ messages in thread
From: Gagan Arneja @ 2007-05-11  5:57 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala

> 
> Right, but I am the sole dequeue'r, and on failure, I requeue those packets
> to
> the beginning of the queue (just as it would happen in the regular case of
> one
> packet xmit/failure/requeue).

What about a race between trying to reacquire queue_lock and another 
failed transmit?

--
Gagan

> 
> - KK
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  5:57                 ` Gagan Arneja
@ 2007-05-11  6:06                   ` Krishna Kumar2
  2007-05-11  6:29                     ` Gagan Arneja
  0 siblings, 1 reply; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  6:06 UTC (permalink / raw)
  To: Gagan Arneja; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala

Hi Gagan,

Gagan Arneja <gaagaan@gmail.com> wrote on 05/11/2007 11:27:54 AM:

> >
> > Right, but I am the sole dequeue'r, and on failure, I requeue those
packets
> > to
> > the beginning of the queue (just as it would happen in the regular case
of
> > one
> > packet xmit/failure/requeue).
>
> What about a race between trying to reacquire queue_lock and another
> failed transmit?

That is not possible too. I hold the QDISC_RUNNING bit in dev->state and
am the only sender for this device, so there is no other failed transmit.
Also, on failure of dev_hard_start_xmit, qdisc_restart will spin for the
queue lock (which could be held quickly by enqueuer's), and q->requeue()
skbs to the head of the list.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:57               ` David Miller
@ 2007-05-11  6:07                 ` Krishna Kumar2
  0 siblings, 0 replies; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  6:07 UTC (permalink / raw)
  To: David Miller; +Cc: gaagaan, johnpol, netdev, rick.jones2, sri

Hi Dave,

David Miller <davem@davemloft.net> wrote on 05/11/2007 02:27:07 AM:

> > I don't understand how transmitting already batched up packets in one
go
> > introduce latency.
>
> Keep thinking :-)
>
> The only case where these ideas can be seriously considered is during
> netif_wake_queue().  In all other cases we go straight to the device
> from sendmsg(), since there is space in the TX ring, and we should
> not try to batch as that will add latency.

I am not very clear about this, you are saying latency will increase
due to old packets not being cache-warm ?

But the existing code will do exactly the same thing as what I am
proposing, except it will iterate over all the packets one by one.
So will batching them make latency worse ?

In the sendmsg case you wrote above (is it tcp_transmit_skb), it will
still come via IP (through dst_output) ?

I would like to try your idea if you are sure it will reduce latency.

Thanks,

- KK

> So maybe find a solution in that area, some new driver interface that
> can suck N packets out of a qdisc when netif_wake_queue() occurs.
>
> If you look at the contexts in which netif_wake_queue() is called,
> it's perfect, it already has the TX state of the driver locked, it has
> the TX descriptor head/tail pointers handy, etc. and it's already
> taken all the cache misses needed in order to work with this state.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  6:06                   ` Krishna Kumar2
@ 2007-05-11  6:29                     ` Gagan Arneja
  2007-05-11  6:52                       ` Krishna Kumar2
  0 siblings, 1 reply; 96+ messages in thread
From: Gagan Arneja @ 2007-05-11  6:29 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala

Krishna Kumar2 wrote:
>> What about a race between trying to reacquire queue_lock and another
>> failed transmit?
> 
> That is not possible too. I hold the QDISC_RUNNING bit in dev->state and
> am the only sender for this device, so there is no other failed transmit.
> Also, on failure of dev_hard_start_xmit, qdisc_restart will spin for the
> queue lock (which could be held quickly by enqueuer's), and q->requeue()
> skbs to the head of the list.

I have to claim incomplete familiarity for the code. But still, if 
you're out there running with no locks for a period, there's no 
assumption you can make. The "lock could be held quickly" assertion is a 
fallacy.

> 
> Thanks,
> 
> - KK
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  6:29                     ` Gagan Arneja
@ 2007-05-11  6:52                       ` Krishna Kumar2
  0 siblings, 0 replies; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  6:52 UTC (permalink / raw)
  To: Gagan Arneja
  Cc: Evgeniy Polyakov, netdev, netdev-owner, Rick Jones,
	Sridhar Samudrala

Hi Gagan,

> I have to claim incomplete familiarity for the code. But still, if
> you're out there running with no locks for a period, there's no
> assumption you can make. The "lock could be held quickly" assertion is a
> fallacy.

I will try to explain since the code is pretty complicated. Packets coming
to IP (in dev_queue_xmit) are first added to a FIFO queue associated with
the device (queue_lock is first held). Then qdisc_run is called and if it
can set the QDISC_RUNNING bit (making it the "sole dequeue'r"), it calls
qdisc_restart (which *could* get the tx lock) and drops the queue_lock to
call the actual driver xmit routine. Meanwhile 'n' cpus can race and add
more skbs to the same queue, each getting this lock for a short time and
dropping, and all these skbs are added to the end of the queue. These are
"pure enqueuer's" (since their attempt to set the same bit fails and they
return from qdisc_run). In the original thread, the dequeue'r gets an error
from driver xmit. It first re-gets the queue_lock and calls requeue, which
adds the skb to the head of the queue. Since no one else has been able to
dequeue packets while the RUNNING bit is set, packet re-ordering cannot
happen. After returning to __qdisc_run, the RUNNING bit is cleared.

Hope that clarifies the code.

thanks,

- KK


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
@ 2007-05-11  7:14 Krishna Kumar2
  0 siblings, 0 replies; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  7:14 UTC (permalink / raw)
  To: David Stevens
  Cc: Evgeniy Polyakov, Krishna Kumar2, netdev, netdev-owner,
	Rick Jones


(Mistaken didn't reply-all previous time)

Hi Dave,

David Stevens <dlstevens@us.ibm.com> wrote on 05/11/2007 02:57:56 AM:

> The word "small" is coming up a lot in this discussion, and
> I think packet size really has nothing to do with it. Multiple
> streams generating packets of any size would benefit; the
> key ingredient is a queue length greater than 1.

Correct. Re-thinking about this - for larger packets the bandwidth
may not improve as it reaches line-speed, but I guess CPU util could
reduce (which could reduce in small packet case too). Using "small"
was not covering this case.

Thanks,

- KK

> I think the intent is to remove queue lock cycles by taking
> the whole list (at least up to the count of free ring buffers)
> when the queue is greater than one packet, thus effectively
> removing the lock expense for n-1 packets.



^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:21 ` Roland Dreier
@ 2007-05-11  7:30   ` Krishna Kumar2
  2007-05-11 11:21     ` Roland Dreier
  0 siblings, 1 reply; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  7:30 UTC (permalink / raw)
  To: Roland Dreier; +Cc: netdev

Hi Roland,

Roland Dreier <rdreier@cisco.com> wrote on 05/11/2007 01:51:50 AM:

> This is pretty interesting to me for IP-over-InfiniBand, for a couple
> of reasons.  First of all, I can push multiple send requests to the
> underlying adapter in one go, which saves taking and dropping the same
> lock and also probably allows fewer MMIO writes for doorbells.
>
> However the second reason this is useful is actually more
> interesting.  Right now we request a send completion interrupt for
> every packet we send because most current IB adapters don't really
> have very good interrupt mitigation features.  If we don't signal a
> send completion for every packet, then we might run into a situation
> where we don't get a completion for the last packet that gets sent and
> end up stalling the interface because the stack never gives us another
> packet to send.  However, if I have a whole queue of packets passed to
> my xmit routine, then I only have to request a completion for the last
> packet in the list, which could potentially cut down on interrupts and
> completion processing a lot.

Sounds a good idea. I had a question on error handling. What happens if
the driver asynchronously returns an error for this WR (single WR
containing multiple skbs) ? Does it mean all the skbs failed to be sent ?
Requeuing all of them is a bad idea since it leads to infinitely doing the
same thing, so should every skb be free'd ? Isn't that harsh (if first skb
is bad and others are fine) and how do we avoid that ? Infact we don't
know which skb was transmitted and which failed. If we requeue skbs,
there will also be out-of-order skbs as the callback is asynchronous.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  9:05 ` Andi Kleen
@ 2007-05-11  8:32   ` Krishna Kumar2
  2007-05-11  9:37     ` Andi Kleen
  0 siblings, 1 reply; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  8:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ak, netdev

Hi Andy,

ak@suse.de wrote on 05/11/2007 02:35:05 PM:

> You don't need that. You can just use the normal next/prev pointers.
> In general it's a good idea to lower lock overhead etc., the VM has
> used similar tricks very successfully in the past.

Does this mean each skb should be for the same connection if next/prev
is used ?

> Another problem is that this setup typically requires the aggregate
> packets to be from the same connection. Otherwise you will only
> safe a short trip into the stack until the linked packet would need
> to be split again to pass to multiple sockets. With that the scheme
> probably helps much less.

I guess you meant this for receives only. On the send side, packets
for different sockets can be linked and sent together, right ?

> Or you could do this only if multiple packets belong to the same
> single connection (basically with a one hit cache); but then it would

But for sends, why does same or different connection matter ? There is
no aggregating of skbs.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  9:37     ` Andi Kleen
@ 2007-05-11  8:50       ` Krishna Kumar2
  2007-05-11 11:16       ` Roland Dreier
  1 sibling, 0 replies; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  8:50 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ak, Andi Kleen, netdev

Hi Andy,

Andi Kleen <ak@suse.de> wrote on 05/11/2007 03:07:14 PM:

> But without it aggregation  on RX is much less useful because the packets
> cannot be kept together after socket demux which happens relatively early
> in the packet processing path.

Then I misunderstood you, my proposal is only for TX. GSO will not help as
it is for large packets to get segmented at the latest point in time to
avoid going through the stack multiple times. My proposal is to link
multiple
packets to send to the device in one shot rather than calling qdisc_restart
multiple times.

> > But for sends, why does same or different connection matter ? There is
> > no aggregating of skbs.
>
> I wasn't talking about sending.
>
> But there actually is :- TSO/GSO.

Sorry to mis-write what I meant : in my proposal there is no aggregating
of skbs so single/multiple connections should not matter.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  5:04               ` Krishna Kumar2
@ 2007-05-11  9:01                 ` Evgeniy Polyakov
  2007-05-11  9:18                   ` Krishna Kumar2
  0 siblings, 1 reply; 96+ messages in thread
From: Evgeniy Polyakov @ 2007-05-11  9:01 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Ian McDonald, netdev, Rick Jones, Vlad Yasevich

On Fri, May 11, 2007 at 10:34:22AM +0530, Krishna Kumar2 (krkumar2@in.ibm.com) wrote:
> Not combining packets, I am sending them out in the same sequence it was
> queued. If the xmit failed, the driver's new API returns the skb which
> failed to be sent. This skb and all other linked skbs are requeue'd in
> the reverse order (fofi?) till the next time it is tried again. I see
> that sometimes I can send tx_queue_len packets in one shot and all
> succeeds. But the downside is that in failure case, the packets have to
> be requeue'd.

And what if you have thousand(s) of packets queued and first one has
failed, requeing all the rest one-by-one is not a solution. If it is
being done under heavy lock (with disabled irqs especially) it becomes a
disaster.

I thought of a bit different approach: driver maintains own queue (or
has access to stack's one) and only uses lock to dequeue a packet. If
transmit fails, nothing is requeued, but the same packet is tried until
transmit is completed. If number of transmits failed in order, driver
says it is broken and/or stops queue. Thus one can setup several
descriptors in one go and do it without any locks. Stack calls driver's
xmit function which essentially sets bit that new packets are available,
but driver must have access to qdisk.

If e1000 driver would not be so... so 'uncommon' compared to another
small gige drivers I would try to cook up a patch, but I will not with
e1000 :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 14:53 Krishna Kumar
  2007-05-10 15:08 ` Evgeniy Polyakov
  2007-05-10 20:21 ` Roland Dreier
@ 2007-05-11  9:05 ` Andi Kleen
  2007-05-11  8:32   ` Krishna Kumar2
  2 siblings, 1 reply; 96+ messages in thread
From: Andi Kleen @ 2007-05-11  9:05 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: netdev, Krishna Kumar

Krishna Kumar <krkumar2@in.ibm.com> writes:

> Doing some measurements, I found that for small packets like 128 bytes,
> the bandwidth is approximately 60% of the line speed. To possibly speed
> up performance of small packet xmits, a method of "linking" skbs was
> thought of - where two pointers (skb_flink/blink) is added to the skb.

You don't need that. You can just use the normal next/prev pointers.
In general it's a good idea to lower lock overhead etc., the VM has
used similar tricks very successfully in the past.

There were some thoughts about this earlier, but in highend
NICs the direction instead seems to go towards LRO (large receive offloading). 
 
LRO is basically like TSO, just for receiving. The NIC aggregates
multiple packets into a single larger one that is then processed by
the stack as one skb. This typically doesn't use linked lists, but an
array of pages.

Your scheme would help old NICs that don't have this optimization.
Might be a worth goal, although people often seem to be more interested
in modern hardware.

Another problem is that this setup typically requires the aggregate
packets to be from the same connection. Otherwise you will only
safe a short trip into the stack until the linked packet would need
to be split again to pass to multiple sockets. With that the scheme
probably helps much less.

The hardware schemes typically use at least some kind of hash to
aggregiate connections You might need to implement something similar
too if it doesn't save enough time.  Don't know if it would be very
efficient in software.

Or you could do this only if multiple packets belong to the same
single connection (basically with a one hit cache); but then it would
smell a bit like a benchmark hack.

-Andi

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  9:01                 ` Evgeniy Polyakov
@ 2007-05-11  9:18                   ` Krishna Kumar2
  2007-05-11  9:32                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  9:18 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ian McDonald, netdev, Rick Jones, Vlad Yasevich

Hi Evgeniy,

Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 05/11/2007 02:31:38 PM:

> On Fri, May 11, 2007 at 10:34:22AM +0530, Krishna Kumar2
(krkumar2@in.ibm.com) wrote:
> > Not combining packets, I am sending them out in the same sequence it
was
> > queued. If the xmit failed, the driver's new API returns the skb which
> > failed to be sent. This skb and all other linked skbs are requeue'd in
> > the reverse order (fofi?) till the next time it is tried again. I see
> > that sometimes I can send tx_queue_len packets in one shot and all
> > succeeds. But the downside is that in failure case, the packets have to
> > be requeue'd.
>
> And what if you have thousand(s) of packets queued and first one has
> failed, requeing all the rest one-by-one is not a solution. If it is
> being done under heavy lock (with disabled irqs especially) it becomes a
> disaster.

If first one failed for other reasons from described below, it is freed up
and the next one attempted. There are three cases where we cannot continue
:
no slots, device blocked, no lock.

Jamal had suggested to get information on #available slots from the driver.
The queue_stopped is checked before linking packets, so the only other
error case is not getting a lock. And this too is true only in the ~LLTX
case,
which optionally could be a check to enable this linking. Also, there could
be limits to how many packets are queue'd. I tried tx_queue_len as well as
tx_queue_len/2, but there could be other options.

> I thought of a bit different approach: driver maintains own queue (or
> has access to stack's one) and only uses lock to dequeue a packet. If
> transmit fails, nothing is requeued, but the same packet is tried until
> transmit is completed. If number of transmits failed in order, driver
> says it is broken and/or stops queue. Thus one can setup several
> descriptors in one go and do it without any locks. Stack calls driver's
> xmit function which essentially sets bit that new packets are available,
> but driver must have access to qdisk.

Bugs will increase if drivers also are allowed to access qdisc, since it
is difficult code as it is. Right now it is easy to see how the qdisc is
manipulated.

thanks,

- KK

> If e1000 driver would not be so... so 'uncommon' compared to another
> small gige drivers I would try to cook up a patch, but I will not with
> e1000 :)
>
> --
>    Evgeniy Polyakov


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  9:18                   ` Krishna Kumar2
@ 2007-05-11  9:32                     ` Evgeniy Polyakov
  2007-05-11  9:52                       ` Krishna Kumar2
  0 siblings, 1 reply; 96+ messages in thread
From: Evgeniy Polyakov @ 2007-05-11  9:32 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Ian McDonald, netdev, Rick Jones, Vlad Yasevich

On Fri, May 11, 2007 at 02:48:14PM +0530, Krishna Kumar2 (krkumar2@in.ibm.com) wrote:
> > And what if you have thousand(s) of packets queued and first one has
> > failed, requeing all the rest one-by-one is not a solution. If it is
> > being done under heavy lock (with disabled irqs especially) it becomes a
> > disaster.
> 
> If first one failed for other reasons from described below, it is freed up
> and the next one attempted. There are three cases where we cannot continue
> :
> no slots, device blocked, no lock.
> 
> Jamal had suggested to get information on #available slots from the driver.
> The queue_stopped is checked before linking packets, so the only other
> error case is not getting a lock. And this too is true only in the ~LLTX
> case,
> which optionally could be a check to enable this linking. Also, there could
> be limits to how many packets are queue'd. I tried tx_queue_len as well as
> tx_queue_len/2, but there could be other options.

Why stack should ever know what hardware is being used?
If you are going to break things, break it with scary sound :)

Design new interface, where driver has access to the queue (say only
using one helper dequeue_first_packet()), stack can only queue packets
to the tail, driver can also stop that by setting a bit (which was there
likely before we had knew word Linux).

That is all, driver does not access qdisk, it only gets the first skb.
Transmit function is completely lockless (until something is shared with
receive path/interrupts, but it is private driver's part), stack will
not be changed at all (except new helper to dequeue first packet from
the qdisk, actually it is already there, it just needs to be exported),
driver's xit function sets a flag that there are packets (or it can be
even empty, since packet's presense can be detected via the same qdisk
interface). You read qdisk len, setup several descriptors in one go,
dequeue set of skbs and commit them. If something has failed, it is not
requeued, but resent (or even dropped after fair amount of attempts).
No locks, no requeues? Seems simple imho.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  8:32   ` Krishna Kumar2
@ 2007-05-11  9:37     ` Andi Kleen
  2007-05-11  8:50       ` Krishna Kumar2
  2007-05-11 11:16       ` Roland Dreier
  0 siblings, 2 replies; 96+ messages in thread
From: Andi Kleen @ 2007-05-11  9:37 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Andi Kleen, ak, netdev

On Fri, May 11, 2007 at 02:02:55PM +0530, Krishna Kumar2 wrote:
> Hi Andy,
> 
> ak@suse.de wrote on 05/11/2007 02:35:05 PM:
> 
> > You don't need that. You can just use the normal next/prev pointers.
> > In general it's a good idea to lower lock overhead etc., the VM has
> > used similar tricks very successfully in the past.
> 
> Does this mean each skb should be for the same connection if next/prev
> is used ?

No; but see next paragraph.

But without it aggregation  on RX is much less useful because the packets
cannot be kept together after socket demux which happens relatively early
in the packet processing path.

> 
> > Another problem is that this setup typically requires the aggregate
> > packets to be from the same connection. Otherwise you will only
> > safe a short trip into the stack until the linked packet would need
> > to be split again to pass to multiple sockets. With that the scheme
> > probably helps much less.
> 
> I guess you meant this for receives only. On the send side, packets

Yes.

> for different sockets can be linked and sent together, right ?

Not implemented and as DaveM pointed out such batching
has some problems.  There is just TSO/GSO for single sockets

> 
> > Or you could do this only if multiple packets belong to the same
> > single connection (basically with a one hit cache); but then it would
> 
> But for sends, why does same or different connection matter ? There is
> no aggregating of skbs.

I wasn't talking about sending.


But there actually is :- TSO/GSO.


-Andi

> 
> Thanks,
> 
> - KK
> 

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 22:06             ` David Miller
@ 2007-05-11  9:46               ` Krishna Kumar2
  0 siblings, 0 replies; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  9:46 UTC (permalink / raw)
  To: David Miller
  Cc: dlstevens, gaagaan, johnpol, netdev, netdev-owner, rick.jones2

Hi all,

Very preliminary testing with 20 procs on E1000 driver gives me following
result:

skbsz        Org BW         New BW              %               Org demand
New Demand     %
32              315.98           347.48                9.97%         21090
20958               0.62%
96              833.67           882.92                5.91%         7939
9107                 -14.71

But this is test running for just 30 secs (just too short) and netperf2
(not netperf4,
which I am going to use later). Using single E1000 card cross-cable'd on
2.6.21.1
kernel on 2 CPU 2.8Ghz Xseries systems.

I will have more detailed report next week, especially once I run netperf4.
I am taking
off for the next two days, so will reply later on this thread.

Thanks,

- KK

David Miller <davem@davemloft.net> wrote on 05/11/2007 03:36:05 AM:

> From: Gagan Arneja <gaagaan@gmail.com>
> Date: Thu, 10 May 2007 14:50:19 -0700
>
> > David Miller wrote:
> >
> > > If you drop the TX lock, the number of free slots can change
> > > as another cpu gets in there queuing packets.
> >
> > Can you ever have more than one thread inside the driver? Isn't
> > xmit_lock held while we're in there?
>
> There are restrictions wrt. when the xmit_lock and the
> queue lock can be held at the same time.
>
> The devil is definitely in the details if you try to
> implemen this.  It definitely lends support for Eric D.'s
> assertion that this change will only add bugs and doing
> something simple like prefetches is proabably a safer
> route to go down.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  9:32                     ` Evgeniy Polyakov
@ 2007-05-11  9:52                       ` Krishna Kumar2
  2007-05-11  9:56                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  9:52 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ian McDonald, netdev, Rick Jones, Vlad Yasevich

Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 05/11/2007 03:02:02 PM:

> On Fri, May 11, 2007 at 02:48:14PM +0530, Krishna Kumar2
(krkumar2@in.ibm.com) wrote:
> > > And what if you have thousand(s) of packets queued and first one has
> > > failed, requeing all the rest one-by-one is not a solution. If it is
> > > being done under heavy lock (with disabled irqs especially) it
becomes a
> > > disaster.
> >
> > If first one failed for other reasons from described below, it is freed
up
> > and the next one attempted. There are three cases where we cannot
continue
> > :
> > no slots, device blocked, no lock.
> >
> Why stack should ever know what hardware is being used?
> If you are going to break things, break it with scary sound :)

Actually it could just be an API exported by drivers which implement this
new
xmit API.

> Design new interface, where driver has access to the queue (say only
> using one helper dequeue_first_packet()), stack can only queue packets
> to the tail, driver can also stop that by setting a bit (which was there
> likely before we had knew word Linux).
>
> That is all, driver does not access qdisk, it only gets the first skb.
> Transmit function is completely lockless (until something is shared with
> receive path/interrupts, but it is private driver's part), stack will
> not be changed at all (except new helper to dequeue first packet from
> the qdisk, actually it is already there, it just needs to be exported),
> driver's xit function sets a flag that there are packets (or it can be
> even empty, since packet's presense can be detected via the same qdisk
> interface). You read qdisk len, setup several descriptors in one go,
> dequeue set of skbs and commit them. If something has failed, it is not
> requeued, but resent (or even dropped after fair amount of attempts).
> No locks, no requeues? Seems simple imho.

I will analyze this in more detail when I return (leaving just now, so got
really no time). The only issue that I see quickly is "No locks", since to
get things off the queue you need to hold the queue_lock.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  9:52                       ` Krishna Kumar2
@ 2007-05-11  9:56                         ` Evgeniy Polyakov
  2007-05-11 11:30                           ` jamal
  0 siblings, 1 reply; 96+ messages in thread
From: Evgeniy Polyakov @ 2007-05-11  9:56 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Ian McDonald, netdev, Rick Jones, Vlad Yasevich

On Fri, May 11, 2007 at 03:22:13PM +0530, Krishna Kumar2 (krkumar2@in.ibm.com) wrote:
> > No locks, no requeues? Seems simple imho.
> 
> I will analyze this in more detail when I return (leaving just now, so got
> really no time). The only issue that I see quickly is "No locks", since to
> get things off the queue you need to hold the queue_lock.

I meant no locks during processing of the packets (pci read/write, dma
setup and so on), of course it is needed to dequeue a packet, but only
for that operation.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  9:37     ` Andi Kleen
  2007-05-11  8:50       ` Krishna Kumar2
@ 2007-05-11 11:16       ` Roland Dreier
  2007-05-13  6:00         ` Andi Kleen
  1 sibling, 1 reply; 96+ messages in thread
From: Roland Dreier @ 2007-05-11 11:16 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Krishna Kumar2, Andi Kleen, netdev

 > I wasn't talking about sending.
 > 
 > But there actually is :- TSO/GSO.

As I said before, getting multiple packets in one call to xmit would
be nice for amortizing per-xmit overhead in IPoIB.  So it would be
nice if the cases where the stack does GSO ended up passing all the
segments into the driver in one go.

 - R.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  7:30   ` Krishna Kumar2
@ 2007-05-11 11:21     ` Roland Dreier
  0 siblings, 0 replies; 96+ messages in thread
From: Roland Dreier @ 2007-05-11 11:21 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: netdev

 > Sounds a good idea. I had a question on error handling. What happens if
 > the driver asynchronously returns an error for this WR (single WR
 > containing multiple skbs) ? Does it mean all the skbs failed to be sent ?
 > Requeuing all of them is a bad idea since it leads to infinitely doing the
 > same thing, so should every skb be free'd ? Isn't that harsh (if first skb
 > is bad and others are fine) and how do we avoid that ? Infact we don't
 > know which skb was transmitted and which failed. If we requeue skbs,
 > there will also be out-of-order skbs as the callback is asynchronous.

Each packet will still get a separate work request.  However if one
work request fails, then the QP will transition to the error state and
all later work requests will fail too.

But the completion handling doesn't change at all -- IPoIB already
queues work requests with the hardware and then collects completions
asynchronously.  If we get a failed completion then we free the skb
and bump tx errors -- there's nothing else really sensible to do.

However errors on UD QPs can never really happen except for driver
bugs or hardware failures, and for connected QPs (used in connected
mode), errors probably mean that a connection broke, so dropping
packets is quite sensible.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  5:22             ` Krishna Kumar2
@ 2007-05-11 11:27               ` jamal
  0 siblings, 0 replies; 96+ messages in thread
From: jamal @ 2007-05-11 11:27 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: Gagan Arneja, Evgeniy Polyakov, netdev, Rick Jones,
	Sridhar Samudrala

On Fri, 2007-11-05 at 10:52 +0530, Krishna Kumar2 wrote:

> I didn't try to optimize the driver to take any real advantage, I coded it
> as simply as :
> 
> top:
>       next = skb->skb_flink;
>       Original driver code here, or another option is to remove the locking
>             and put it before the "top:" above, no other changes.
>       skb = next;
>       if (skb)
>             goto top;
> This way, the diffs are minimal. However, driver experts might know of some
> tricks to optimize this, since most drivers have a hundred checks before
> sending one skb, and whether some optimization to avoid that for each skb
> is possible.
> 

You will need to at some point, no big deal in order to get things in
place first- the simple example of DMA i talked about yesterday is a
simple example. i.e you avoid kicking the DMA until you have queued all
in the ring (as Gagan pointed out, kicking the DMA is a PCI IO write)
I am sure there are gazillion others.
 

> This is a good idea, I will look at your slides to see what this exactly
> is.

I will try to post my patches updated to the latest net-2.6 kernel
before you come back on monday.

> However, even with this check, drivers will return error, eg trylock failed
> or, like the e1000 has - fifo_workaround failures. But I still think it
> will significantly reduce the failures. 

Yes, this is what i refered to as the "many tunables" and led to the
claim that you may have to re-write the whole tx path in my posting.

> Right now, I get a lot of requeue's.

requeues are a good sign ;->

cheers,
jamal


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  9:56                         ` Evgeniy Polyakov
@ 2007-05-11 11:30                           ` jamal
  2007-05-11 11:53                             ` Evgeniy Polyakov
  0 siblings, 1 reply; 96+ messages in thread
From: jamal @ 2007-05-11 11:30 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Krishna Kumar2, Ian McDonald, netdev, Rick Jones, Vlad Yasevich

On Fri, 2007-11-05 at 13:56 +0400, Evgeniy Polyakov wrote:

> I meant no locks during processing of the packets (pci read/write, dma
> setup and so on), of course it is needed to dequeue a packet, but only
> for that operation.

I dont think you can avoid the lock Evgeniy. You need to protect against
the tx ring having some other things happening to it from the napi poll
or netdev interupts (most of the drivers touch the tx ring on the napi
poll).

cheers,
jamal


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11 11:30                           ` jamal
@ 2007-05-11 11:53                             ` Evgeniy Polyakov
  2007-05-11 12:15                               ` jamal
  0 siblings, 1 reply; 96+ messages in thread
From: Evgeniy Polyakov @ 2007-05-11 11:53 UTC (permalink / raw)
  To: jamal; +Cc: Krishna Kumar2, Ian McDonald, netdev, Rick Jones, Vlad Yasevich

On Fri, May 11, 2007 at 07:30:02AM -0400, jamal (hadi@cyberus.ca) wrote:
> > I meant no locks during processing of the packets (pci read/write, dma
> > setup and so on), of course it is needed to dequeue a packet, but only
> > for that operation.
> 
> I dont think you can avoid the lock Evgeniy. You need to protect against
> the tx ring having some other things happening to it from the napi poll
> or netdev interupts (most of the drivers touch the tx ring on the napi
> poll).

As I said there might be another lock, if interrupt handler is shared,
or registers are accessed, but it is privite driver's business, which
has nothing in common with stack itself. Stack just queues an skb,
which, after detached from the tx queue by driver, just does not exist
for stack anymore. It can be dequeed with rcu protection even.

> cheers,
> jamal

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11 11:53                             ` Evgeniy Polyakov
@ 2007-05-11 12:15                               ` jamal
  0 siblings, 0 replies; 96+ messages in thread
From: jamal @ 2007-05-11 12:15 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Krishna Kumar2, Ian McDonald, netdev, Rick Jones, Vlad Yasevich

On Fri, 2007-11-05 at 15:53 +0400, Evgeniy Polyakov wrote:

> As I said there might be another lock, if interrupt handler is shared,
> or registers are accessed, but it is privite driver's business, which
> has nothing in common with stack itself. 

Ok, we are saying the same thing then. eg in e1000 that would be
tx_ring->lock or something along those lines.

> Stack just queues an skb,
> which, after detached from the tx queue by driver, just does not exist
> for stack anymore. It can be dequeed with rcu protection even.

in my case i had a dev->blist where the stack queued. You really want to
avoid requeueing altogether.
The trick i used is to have the driver tell you how many packets you can
pull off the qdisc.

cheers,
jamal




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11 11:16       ` Roland Dreier
@ 2007-05-13  6:00         ` Andi Kleen
  2007-05-15 16:25           ` Roland Dreier
  0 siblings, 1 reply; 96+ messages in thread
From: Andi Kleen @ 2007-05-13  6:00 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Krishna Kumar2, netdev

On Friday 11 May 2007 13:16:44 Roland Dreier wrote:
>  > I wasn't talking about sending.
>  > 
>  > But there actually is :- TSO/GSO.
> 
> As I said before, getting multiple packets in one call to xmit would
> be nice for amortizing per-xmit overhead in IPoIB.  So it would be
> nice if the cases where the stack does GSO ended up passing all the
> segments into the driver in one go.

Well TCP does upto 64k -- that is what GSO is about.

If you want it for unrelated connections I'm not sure it would actually
work because if you have a deep TX queue already it is unlikely 
the qdisc will buffer that much. Most packet injection should
come from the individual sockets which is obviously only a single connection.

-Andi

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-13  6:00         ` Andi Kleen
@ 2007-05-15 16:25           ` Roland Dreier
  2007-05-15 20:18             ` David Miller
  0 siblings, 1 reply; 96+ messages in thread
From: Roland Dreier @ 2007-05-15 16:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Krishna Kumar2, netdev

 > > As I said before, getting multiple packets in one call to xmit would
 > > be nice for amortizing per-xmit overhead in IPoIB.  So it would be
 > > nice if the cases where the stack does GSO ended up passing all the
 > > segments into the driver in one go.
 > 
 > Well TCP does upto 64k -- that is what GSO is about.

I see... the plan would be to add NETIF_F_GSO_SOFTWARE to the device
features and use skb_gso_segment() in the netdevice driver?  (I just
studied GSO more carefully -- I hadn't realized that was possible)

I'll have to think about implementing that for IPoIB.  One issue I see
is if I have, say, 4 free entries in my send queue and skb_gso_segment()
gives me back 5 packets to send.  It's not clear I can recover at that
point -- I guess I have to check against gso_segs in the xmit routine
before actually doing the segmentation.

 - R.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 16:25           ` Roland Dreier
@ 2007-05-15 20:18             ` David Miller
  2007-05-15 20:52               ` Roland Dreier
  0 siblings, 1 reply; 96+ messages in thread
From: David Miller @ 2007-05-15 20:18 UTC (permalink / raw)
  To: rdreier; +Cc: ak, krkumar2, netdev

From: Roland Dreier <rdreier@cisco.com>
Date: Tue, 15 May 2007 09:25:28 -0700

> I'll have to think about implementing that for IPoIB.  One issue I see
> is if I have, say, 4 free entries in my send queue and skb_gso_segment()
> gives me back 5 packets to send.  It's not clear I can recover at that
> point -- I guess I have to check against gso_segs in the xmit routine
> before actually doing the segmentation.

I'd suggest adding a fudge factor to your free TX space, which
is advisable anyways so that when TX is woken up, more of
the transfer from queue to device can happen in a batch-like
fashion.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 20:18             ` David Miller
@ 2007-05-15 20:52               ` Roland Dreier
  2007-05-15 21:48                 ` Michael Chan
  0 siblings, 1 reply; 96+ messages in thread
From: Roland Dreier @ 2007-05-15 20:52 UTC (permalink / raw)
  To: David Miller; +Cc: ak, krkumar2, netdev

 > > I'll have to think about implementing that for IPoIB.  One issue I see
 > > is if I have, say, 4 free entries in my send queue and skb_gso_segment()
 > > gives me back 5 packets to send.  It's not clear I can recover at that
 > > point -- I guess I have to check against gso_segs in the xmit routine
 > > before actually doing the segmentation.
 > 
 > I'd suggest adding a fudge factor to your free TX space, which
 > is advisable anyways so that when TX is woken up, more of
 > the transfer from queue to device can happen in a batch-like
 > fashion.

Well, IPoIB doesn't do netif_wake_queue() until half the device's TX
queue is free, so we should get batching.  However, I'm not sure that
I can count on a fudge factor ensuring that there's enough space to
handle everything skb_gso_segment() gives me -- is there any reliable
way to get an upper bound on how many segments a given gso skb will
use when it's segmented?

 - R.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 21:48                 ` Michael Chan
@ 2007-05-15 21:08                   ` Roland Dreier
  2007-05-15 22:05                     ` Michael Chan
  0 siblings, 1 reply; 96+ messages in thread
From: Roland Dreier @ 2007-05-15 21:08 UTC (permalink / raw)
  To: Michael Chan; +Cc: David Miller, ak, krkumar2, netdev

 > > Well, IPoIB doesn't do netif_wake_queue() until half the device's TX
 > > queue is free, so we should get batching.  However, I'm not sure that
 > > I can count on a fudge factor ensuring that there's enough space to
 > > handle everything skb_gso_segment() gives me -- is there any reliable
 > > way to get an upper bound on how many segments a given gso skb will
 > > use when it's segmented?
 > 
 > Take a look at tg3.c.  I use (gso_segs * 3) as the upper bound.

Thanks for the pointer... I noticed that code, but could you tell me
where the "* 3" comes from?

 - R.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found] <OF0CAD6D87.DBE62968-ON872572DC.0073646A-882572DC.0073BEC2@us.ibm.com>
@ 2007-05-15 21:17 ` Roland Dreier
       [not found]   ` <OFF5654BB8.74EC8DCB-ON872572DC.00752079-882572DC.00756B23@us.ibm.com>
  0 siblings, 1 reply; 96+ messages in thread
From: Roland Dreier @ 2007-05-15 21:17 UTC (permalink / raw)
  To: Shirley Ma; +Cc: David Miller, ak, krkumar2, netdev, netdev-owner

 >       I thought to enable GSO, device driver actually does nothing rather
 > than enabling the flag. GSO moved TCP offloading to interface layer before
 > device xmit. It's a different idea with multiple packets per xmit. GSO
 > still queue the packet one bye one in QDISC and xmit one bye one. The
 > multiple packets per xmit will xmit N packets when N packets in QDISC
 > queue. Please corrent me if wrong.

Current use of GSO does segmentation just above the netdevice driver.
However there's nothing that prevents it from being used to push
segmentation into the driver -- as I described, just set NETIF_F_GSO_SOFTWARE
and do skb_gso_segment() within the driver.

As Michael Chan pointed out, tg3.c already does this to work around a
rare bug in the HW TSO implementation.  For IPoIB we could do it all
the time to get multiple send work requests from one call to the xmit
method.

 - R.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]   ` <OFF5654BB8.74EC8DCB-ON872572DC.00752079-882572DC.00756B23@us.ibm.com>
@ 2007-05-15 21:25     ` Roland Dreier
       [not found]       ` <OF21D475A2.5E5C88DE-ON872572DC.00763DE4-882572DC.00768A7E@us.ibm.com>
  2007-05-15 21:32     ` David Miller
  1 sibling, 1 reply; 96+ messages in thread
From: Roland Dreier @ 2007-05-15 21:25 UTC (permalink / raw)
  To: Shirley Ma; +Cc: ak, David Miller, krkumar2, netdev, netdev-owner

    Shirley>       I just wonder without TSO support in HW, how much
    Shirley> benefit we can get by pushing GSO from interface layer to
    Shirley> device layer besides we can do multiple packets in IPoIB.

The entire benefit comes from having multiple packets to queue in one
call to the xmit method.

 - R.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 22:05                     ` Michael Chan
@ 2007-05-15 21:28                       ` David Miller
  2007-05-18  7:04                         ` Michael Chan
  0 siblings, 1 reply; 96+ messages in thread
From: David Miller @ 2007-05-15 21:28 UTC (permalink / raw)
  To: mchan; +Cc: rdreier, ak, krkumar2, netdev

From: "Michael Chan" <mchan@broadcom.com>
Date: Tue, 15 May 2007 15:05:28 -0700

> On Tue, 2007-05-15 at 14:08 -0700, Roland Dreier wrote:
> >  > > Well, IPoIB doesn't do netif_wake_queue() until half the device's TX
> >  > > queue is free, so we should get batching.  However, I'm not sure that
> >  > > I can count on a fudge factor ensuring that there's enough space to
> >  > > handle everything skb_gso_segment() gives me -- is there any reliable
> >  > > way to get an upper bound on how many segments a given gso skb will
> >  > > use when it's segmented?
> >  > 
> >  > Take a look at tg3.c.  I use (gso_segs * 3) as the upper bound.
> > 
> > Thanks for the pointer... I noticed that code, but could you tell me
> > where the "* 3" comes from?
> > 
> For each gso_seg, there will be a header and the payload may span 2
> pages for 1500-byte packets.  We always assume 1500-byte packets because
> the buggy chips do not support jumbo frames.

Correct.

I think there may be a case where you could see up to 4 segments.
If the user corks the TCP socket, does a sendmsg() (which puts
the data in the per-socket page) then does a sendfile() you'll
see something like:

skb->data	IP, TCP, ethernet headers, etc.
page0		sendmsg() data
page1		sendfile
page2		sendfile

Ie. this can happen if the sendfile() part starts near the
end of a page, so it would get split even for a 1500 MTU
frame.

Even more complex variants are possible if the user does
tiny sendfile() requests to different pages within the file.

So in fact it can span up to N pages.

But there is an upper limit defined by the original GSO
frame, and that is controlled by MAX_SKB_FRAGS, so at most
you would see MAX_SKB_FRAGS plus some small constant.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]   ` <OFF5654BB8.74EC8DCB-ON872572DC.00752079-882572DC.00756B23@us.ibm.com>
  2007-05-15 21:25     ` Roland Dreier
@ 2007-05-15 21:32     ` David Miller
  2007-05-15 22:17       ` [WIP] [PATCH] WAS " jamal
                         ` (2 more replies)
  1 sibling, 3 replies; 96+ messages in thread
From: David Miller @ 2007-05-15 21:32 UTC (permalink / raw)
  To: xma; +Cc: rdreier, ak, krkumar2, netdev, netdev-owner

From: Shirley Ma <xma@us.ibm.com>
Date: Tue, 15 May 2007 14:22:57 -0700

>       I just wonder without TSO support in HW, how much benefit we
> can get by pushing GSO from interface layer to device layer besides
> we can do multiple packets in IPoIB.

I bet the gain is non-trivial.

I'd say about half of the gain from TSO comes from only calling down
into the driver from TCP one time as opposed to N times.  That's
the majority of the "CPU work" involved in TCP sending.

The rest of the gain comes from only transmitting the packet headers
once rather than N times, which conserves I/O bus bandwidth.

GSO will not help the case of lots of UDP applications sending
small packets, or something like that.  An efficient qdisc-->driver
transfer during netif_wake_queue() could help solve some of that,
as is being discussed here.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]       ` <OF21D475A2.5E5C88DE-ON872572DC.00763DE4-882572DC.00768A7E@us.ibm.com>
@ 2007-05-15 21:38         ` David Miller
  0 siblings, 0 replies; 96+ messages in thread
From: David Miller @ 2007-05-15 21:38 UTC (permalink / raw)
  To: xma; +Cc: rdreier, ak, krkumar2, netdev, netdev-owner

From: Shirley Ma <xma@us.ibm.com>
Date: Tue, 15 May 2007 14:35:13 -0700

>       I would think it might be better to implement dequeueN in
> interface xmit which will benefit multiple streams as well not just
> for large packet.

I think it is beneficial to have both, each of them helps with
different aspects of packet sending overhead.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 20:52               ` Roland Dreier
@ 2007-05-15 21:48                 ` Michael Chan
  2007-05-15 21:08                   ` Roland Dreier
  0 siblings, 1 reply; 96+ messages in thread
From: Michael Chan @ 2007-05-15 21:48 UTC (permalink / raw)
  To: Roland Dreier; +Cc: David Miller, ak, krkumar2, netdev

On Tue, 2007-05-15 at 13:52 -0700, Roland Dreier wrote:

> Well, IPoIB doesn't do netif_wake_queue() until half the device's TX
> queue is free, so we should get batching.  However, I'm not sure that
> I can count on a fudge factor ensuring that there's enough space to
> handle everything skb_gso_segment() gives me -- is there any reliable
> way to get an upper bound on how many segments a given gso skb will
> use when it's segmented?

Take a look at tg3.c.  I use (gso_segs * 3) as the upper bound.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 21:08                   ` Roland Dreier
@ 2007-05-15 22:05                     ` Michael Chan
  2007-05-15 21:28                       ` David Miller
  0 siblings, 1 reply; 96+ messages in thread
From: Michael Chan @ 2007-05-15 22:05 UTC (permalink / raw)
  To: Roland Dreier; +Cc: David Miller, ak, krkumar2, netdev

On Tue, 2007-05-15 at 14:08 -0700, Roland Dreier wrote:
>  > > Well, IPoIB doesn't do netif_wake_queue() until half the device's TX
>  > > queue is free, so we should get batching.  However, I'm not sure that
>  > > I can count on a fudge factor ensuring that there's enough space to
>  > > handle everything skb_gso_segment() gives me -- is there any reliable
>  > > way to get an upper bound on how many segments a given gso skb will
>  > > use when it's segmented?
>  > 
>  > Take a look at tg3.c.  I use (gso_segs * 3) as the upper bound.
> 
> Thanks for the pointer... I noticed that code, but could you tell me
> where the "* 3" comes from?
> 
For each gso_seg, there will be a header and the payload may span 2
pages for 1500-byte packets.  We always assume 1500-byte packets because
the buggy chips do not support jumbo frames.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 21:32     ` David Miller
@ 2007-05-15 22:17       ` jamal
  2007-05-15 22:48         ` jamal
  2007-05-16 22:12         ` Sridhar Samudrala
       [not found]       ` <OF6757F56D.EE5984FD-ON872572DC.0081026C-882572DC.00814B8F@us.ibm.com>
  2007-05-21  7:56       ` Herbert Xu
  2 siblings, 2 replies; 96+ messages in thread
From: jamal @ 2007-05-15 22:17 UTC (permalink / raw)
  To: David Miller
  Cc: xma, rdreier, ak, krkumar2, netdev, netdev-owner, ashwin.chaugule,
	Evgeniy Polyakov, Gagan Arneja

[-- Attachment #1: Type: text/plain, Size: 2196 bytes --]

On Tue, 2007-15-05 at 14:32 -0700, David Miller wrote:
>  An efficient qdisc-->driver
> transfer during netif_wake_queue() could help solve some of that,
> as is being discussed here.

Ok, heres the approach i discussed at netconf.
It needs net-2.6 and the patch i posted earlier to clean up
qdisc_restart() [1].
I havent ported over all the bits from 2.6.18,  but this works.
Krishna and i have colluded privately on working together. I just need
to reproduce the patches, so here is the core.
A lot of the code in the core could be aggragated later - right now i am
worried about correctness.
I will post a patch for tun device in a few minutes
that i use to test on my laptop (i need to remove some debugs) to show
an example.
I also plan to post a patch for e1000 - but that will take more
than a few minutes.
the e1000 driver has changed quiet a bit since 2.6.18, so it is
consuming.

What does a driver need to do to get batched-to?

1) On initialization (probe probably)
 a) set NETIF_F_BTX in its dev->features at startup
 i.e dev->features |= NETIF_F_BTX
 b) initialize the batch queue i.e something like
skb_queue_head_init(&dev->blist);
c) set dev->xmit_win to something reasonable like
maybe half the DMA ring size or tx_queuelen

2) create a new method for batch txmit.
This loops on dev->blist and stashes onto hardware.
All return codes like NETDEV_TX_OK etc still apply.

3) set the dev->xmit_win which provides hints on how much
data to send from the core to the driver. Some suggestions:
a)on doing a netif_stop, set it to 1
b)on netif_wake_queue set it to the max available space

Of course, to work, all this requires that the driver to have a
threshold for waking up tx path; like  drivers such as e1000 or tg3 do
in order to invoke netif_wake_queue (example look at TX_WAKE_THRESHOLD
usage in e1000).


feedback welcome (preferably in the form of patches).

Anyone with a really nice tool to measure CPU improvement will help
a great deal in quantifying things. As i have said earlier, I never saw
any throughput improvement. But like T/GSO it may be just CPU savings
(as was suggested at netconf).

cheers,
jamal
[1] http://marc.info/?l=linux-netdev&m=117914954911959&w=2


[-- Attachment #2: batch0 --]
[-- Type: text/x-patch, Size: 4512 bytes --]

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f671cd2..7205748 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -325,6 +325,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
 #define NETIF_F_GSO		2048	/* Enable software GSO. */
 #define NETIF_F_LLTX		4096	/* LockLess TX */
+#define NETIF_F_BTX		8192	/* Capable of batch tx */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -450,6 +451,11 @@ struct net_device
 	void			*priv;	/* pointer to private data	*/
 	int			(*hard_start_xmit) (struct sk_buff *skb,
 						    struct net_device *dev);
+	int			(*hard_batch_xmit) (struct sk_buff_head *list,
+						    struct net_device *dev);
+	int			(*hard_prep_xmit) (struct sk_buff *skb,
+						    struct net_device *dev);
+	int			xmit_win;
 	/* These may be needed for future network-power-down code. */
 	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
 
@@ -466,6 +472,10 @@ struct net_device
 	struct list_head	todo_list;
 	/* device index hash chain */
 	struct hlist_node	index_hlist;
+	/*XXX: Fix eventually to not allocate if device not
+	 *batch capable
+	*/
+	struct sk_buff_head	blist;
 
 	struct net_device	*link_watch_next;
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ed80054..61fa301 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -85,10 +85,12 @@ static inline int
 do_dev_requeue(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q)
 {
 
-	if (unlikely(skb->next))
-		dev->gso_skb = skb;
-	else
-		q->ops->requeue(skb, q);
+	if (skb) {
+		if (unlikely(skb->next))
+			dev->gso_skb = skb;
+		else
+			q->ops->requeue(skb, q);
+	}
 	/* XXX: Could netif_schedule fail? Or is that fact we are
 	 * requeueing imply the hardware path is closed
 	 * and even if we fail, some interupt will wake us
@@ -116,7 +118,10 @@ tx_islocked(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q)
 	int ret = handle_dev_cpu_collision(dev);
 
 	if (ret == SCHED_TX_DROP) {
-		kfree_skb(skb);
+		if (skb) /* we are not batching */
+			kfree_skb(skb);
+		else if (!skb_queue_empty(&dev->blist))
+			skb_queue_purge(&dev->blist);
 		return qdisc_qlen(q);
 	}
 
@@ -195,10 +200,99 @@ static inline int qdisc_restart(struct net_device *dev)
 	return do_dev_requeue(skb, dev, q);
 }
 
+static int try_get_tx_pkts(struct net_device *dev, struct Qdisc *q, int count)
+{
+	struct sk_buff *skb;
+	struct sk_buff_head *skbs = &dev->blist;
+	int tdq = count;
+
+	/* 
+	 * very unlikely, but who knows ..
+	 * If this happens we dont try to grab more pkts
+	 */
+	if (!skb_queue_empty(&dev->blist))
+		return skb_queue_len(&dev->blist);
+
+	if (dev->gso_skb) {
+		count--;
+		__skb_queue_head(skbs, dev->gso_skb);
+		dev->gso_skb = NULL;
+	}
+
+	while (count) {
+		skb = q->dequeue(q);
+		if (!skb)
+			break;
+		count--;
+		__skb_queue_head(skbs, skb);
+	}
+
+	return tdq - count;
+}
+
+static inline int try_tx_pkts(struct net_device *dev)
+{
+
+	return dev->hard_batch_xmit(&dev->blist, dev);
+
+}
+
+/* same comments as in qdisc_restart apply;
+ * at some point use shared code with qdisc_restart*/
+int batch_qdisc_restart(struct net_device *dev)
+{
+	struct Qdisc *q = dev->qdisc;
+	unsigned lockless = (dev->features & NETIF_F_LLTX);
+	int count = dev->xmit_win;
+	int ret = 0;
+
+	ret = try_get_tx_pkts(dev, q, count);
+
+	if (ret == 0)
+		return qdisc_qlen(q);
+
+	/* we have packets to send! */
+	if (!lockless) {
+		if (!netif_tx_trylock(dev))
+			return tx_islocked(NULL, dev, q);
+	}
+
+	/* all clear .. */
+	spin_unlock(&dev->queue_lock);
+
+	ret = NETDEV_TX_BUSY;
+	if (!netif_queue_stopped(dev))
+		ret = try_tx_pkts(dev);
+
+	if (!lockless)
+		netif_tx_unlock(dev);
+
+	spin_lock(&dev->queue_lock);
+
+	q = dev->qdisc;
+
+	/* most likely result, packet went ok */
+	if (ret == NETDEV_TX_OK)
+		return qdisc_qlen(q);
+	/* only for lockless drivers .. */
+	if (ret == NETDEV_TX_LOCKED && lockless)
+		return tx_islocked(NULL, dev, q);
+
+	if (unlikely(ret != NETDEV_TX_BUSY && net_ratelimit()))
+		printk(KERN_WARNING " BUG %s code %d qlen %d\n",
+		       dev->name, ret, q->q.qlen);
+
+	return do_dev_requeue(NULL, dev, q);
+}
+
 void __qdisc_run(struct net_device *dev)
 {
+	unsigned batching = (dev->features & NETIF_F_BTX);
+
 	do {
-		if (!qdisc_restart(dev))
+		if (!batching && !qdisc_restart(dev))
+			break;
+		else if (!batch_qdisc_restart(dev))
 			break;
 	} while (!netif_queue_stopped(dev));
 

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 22:17       ` [WIP] [PATCH] WAS " jamal
@ 2007-05-15 22:48         ` jamal
  2007-05-16  0:50           ` jamal
  2007-05-16 22:12         ` Sridhar Samudrala
  1 sibling, 1 reply; 96+ messages in thread
From: jamal @ 2007-05-15 22:48 UTC (permalink / raw)
  To: David Miller
  Cc: xma, rdreier, ak, krkumar2, netdev, ashwin.chaugule,
	Evgeniy Polyakov, Max Krasnyansky, Gagan Arneja

[-- Attachment #1: Type: text/plain, Size: 681 bytes --]

On Tue, 2007-15-05 at 18:17 -0400, jamal wrote:

> I will post a patch for tun device in a few minutes
> that i use to test on my laptop (i need to remove some debugs) to show
> an example.

Ok, here it is. 
The way i test is to point packets at a tun device. [One way i do it
is attach an ingress qdisc on lo; attach a u32 filter to match all;
on match redirect to the tun device].
The user space program reading sleeps for about a second every 20
packets or so. This forces things to accumulate in the drivers queue.
Backpressure builds up and the throttling effect is really nice to see
working.

I will try to post the e1000 patch tonight or tommorow morning.

cheers,
jamal



[-- Attachment #2: tun-b --]
[-- Type: text/x-patch, Size: 3734 bytes --]

diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a2c6caa..076f794 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -70,6 +70,7 @@
 static int debug;
 #endif
 
+#define NETDEV_LTT 4 /* the low threshold to open up the tx path */
 /* Network device part of the driver */
 
 static LIST_HEAD(tun_dev_list);
@@ -86,9 +87,53 @@ static int tun_net_open(struct net_device *dev)
 static int tun_net_close(struct net_device *dev)
 {
 	netif_stop_queue(dev);
+	//skb_queue_purge(&dev->blist);
 	return 0;
 }
 
+/* Batch Net device start xmit
+ * combine with non-batching version
+ * */
+static int tun_net_bxmit(struct sk_buff_head *skbs, struct net_device *dev)
+{
+	struct sk_buff *skb;
+	struct tun_struct *tun = netdev_priv(dev);
+	u32 qlen = skb_queue_len(&tun->readq);
+
+	/* Drop packet if interface is not attached */
+	if (!tun->attached) {
+		tun->stats.tx_dropped+=skb_queue_len(&dev->blist);
+		skb_queue_purge(&dev->blist);
+		return NETDEV_TX_OK;
+	}
+
+	while (skb_queue_len(&dev->blist)) {
+		skb = __skb_dequeue(skbs);
+		if (!skb)
+			break;
+		skb_queue_tail(&tun->readq, skb);
+	}
+
+	qlen = skb_queue_len(&tun->readq);
+	if (qlen >= dev->tx_queue_len) {
+		netif_stop_queue(dev);
+		tun->stats.tx_fifo_errors++;
+		dev->xmit_win = 1;
+	} else {
+		dev->xmit_win = dev->tx_queue_len - qlen;
+	}
+
+	/* Queue packet */
+	dev->trans_start = jiffies;
+
+	/* Notify and wake up reader process */
+	if (tun->flags & TUN_FASYNC)
+		kill_fasync(&tun->fasync, SIGIO, POLL_IN);
+	wake_up_interruptible(&tun->read_wait);
+
+	return NETDEV_TX_OK;
+}
+
 /* Net device start xmit */
 static int tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 {
@@ -207,6 +252,7 @@ static void tun_net_init(struct net_device *dev)
 		dev->tx_queue_len = TUN_READQ_SIZE;  /* We prefer our own queue length */
 		break;
 	}
+	dev->xmit_win = dev->tx_queue_len>>1; /* handwave, handwave */
 }
 
 /* Character device part */
@@ -382,7 +428,13 @@ static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec *iv,
 			schedule();
 			continue;
 		}
-		netif_wake_queue(tun->dev);
+		{
+			u32 t = skb_queue_len(&tun->readq);
+			if (netif_queue_stopped(tun->dev) && t < NETDEV_LTT) {
+				tun->dev->xmit_win = tun->dev->tx_queue_len;
+				netif_wake_queue(tun->dev);
+			}
+		}
 
 		/** Decide whether to accept this packet. This code is designed to
 		 * behave identically to an Ethernet interface. Accept the packet if
@@ -429,6 +481,7 @@ static void tun_setup(struct net_device *dev)
 	struct tun_struct *tun = netdev_priv(dev);
 
 	skb_queue_head_init(&tun->readq);
+	skb_queue_head_init(&dev->blist);
 	init_waitqueue_head(&tun->read_wait);
 
 	tun->owner = -1;
@@ -436,6 +489,8 @@ static void tun_setup(struct net_device *dev)
 	SET_MODULE_OWNER(dev);
 	dev->open = tun_net_open;
 	dev->hard_start_xmit = tun_net_xmit;
+	dev->hard_prep_xmit = NULL;
+	dev->hard_batch_xmit = tun_net_bxmit;
 	dev->stop = tun_net_close;
 	dev->get_stats = tun_net_stats;
 	dev->ethtool_ops = &tun_ethtool_ops;
@@ -458,7 +513,7 @@ static struct tun_struct *tun_get_by_name(const char *name)
 static int tun_set_iff(struct file *file, struct ifreq *ifr)
 {
 	struct tun_struct *tun;
-	struct net_device *dev;
+	struct net_device *dev = NULL;
 	int err;
 
 	tun = tun_get_by_name(ifr->ifr_name);
@@ -528,12 +583,15 @@ static int tun_set_iff(struct file *file, struct ifreq *ifr)
 	}
 
 	DBG(KERN_INFO "%s: tun_set_iff\n", tun->dev->name);
+	dev->features |= NETIF_F_BTX;
 
 	if (ifr->ifr_flags & IFF_NO_PI)
 		tun->flags |= TUN_NO_PI;
 
-	if (ifr->ifr_flags & IFF_ONE_QUEUE)
+	if (ifr->ifr_flags & IFF_ONE_QUEUE) {
 		tun->flags |= TUN_ONE_QUEUE;
+		dev->features &= ~NETIF_F_BTX;
+	}
 
 	file->private_data = tun;
 	tun->attached = 1;

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]       ` <OF6757F56D.EE5984FD-ON872572DC.0081026C-882572DC.00814B8F@us.ibm.com>
@ 2007-05-15 23:36         ` David Miller
  0 siblings, 0 replies; 96+ messages in thread
From: David Miller @ 2007-05-15 23:36 UTC (permalink / raw)
  To: xma; +Cc: ak, krkumar2, netdev, netdev-owner, rdreier

From: Shirley Ma <xma@us.ibm.com>
Date: Tue, 15 May 2007 16:33:22 -0700

>       That's interesting. So a generic LRO in interface layer will benefit
> the preformance more, right? Receiving path TCP N times is more expensive
> than sending, I think.

If you look at some of the drivers doing LRO, the bulk of the
implementation is in software, so yes :-)

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 22:48         ` jamal
@ 2007-05-16  0:50           ` jamal
  0 siblings, 0 replies; 96+ messages in thread
From: jamal @ 2007-05-16  0:50 UTC (permalink / raw)
  To: David Miller
  Cc: xma, rdreier, ak, krkumar2, netdev, ashwin.chaugule,
	Evgeniy Polyakov, Max Krasnyansky, Gagan Arneja

On Tue, 2007-15-05 at 18:48 -0400, jamal wrote:

> I will try to post the e1000 patch tonight or tommorow morning.

I have the e1000 path done; a few features from the 2.6.18 missing
(mainly the one mucking with tx ring pruning on the tx path).
While it compiles and looks right - i havent tested it and wont have
time for another day or so. However, if anyone wants it in its current
form -let me know and i will email you privately.

cheers,
jamal 


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-17  4:03           ` Krishna Kumar2
@ 2007-05-16 21:44             ` Sridhar Samudrala
  2007-05-17  5:01               ` Krishna Kumar2
  0 siblings, 1 reply; 96+ messages in thread
From: Sridhar Samudrala @ 2007-05-16 21:44 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: ak, ashwin.chaugule, David Miller, Gagan Arneja, hadi,
	Evgeniy Polyakov, netdev, netdev-owner, rdreier, xma

Krishna Kumar2 wrote:
> Hi Sridhar,
> 
> Sridhar Samudrala <sri@us.ibm.com> wrote on 05/17/2007 03:42:03 AM:
> 
>> AFAIK, gso_skb can be a list of skb's. Can we add a list
>> to another list using __skb_queue_head()?
>> Also, if gso_skb is a list of multiple skb's, i think the
>> count needs to be decremented by the number of segments in
>> gso_skb.
> 
> gso_skb is the last GSO skb that failed to be sent. This already
> segmented skb is kept "cached" and whenever the next xmit happens,
> this skb is first sent before any packets from the queue are taken
> out (otherwise out-of-order packets). So there can atmost be one
> gso_skb per device.

Yes. There can be only one gso_skb per device. But it can have more
than one segments linked together via skb->next.

Thanks
Sridhar


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 22:17       ` [WIP] [PATCH] WAS " jamal
  2007-05-15 22:48         ` jamal
@ 2007-05-16 22:12         ` Sridhar Samudrala
  2007-05-16 22:52           ` jamal
  2007-05-17  4:03           ` Krishna Kumar2
  1 sibling, 2 replies; 96+ messages in thread
From: Sridhar Samudrala @ 2007-05-16 22:12 UTC (permalink / raw)
  To: hadi
  Cc: David Miller, xma, rdreier, ak, krkumar2, netdev, netdev-owner,
	ashwin.chaugule, Evgeniy Polyakov, Gagan Arneja

Jamal,

Here are some comments i have on your patch.
See them inline.

Thanks
Sridhar

+static int try_get_tx_pkts(struct net_device *dev, struct Qdisc *q, int count)
+{
+       struct sk_buff *skb;
+       struct sk_buff_head *skbs = &dev->blist;
+       int tdq = count;
+
+       /* 
+        * very unlikely, but who knows ..
+        * If this happens we dont try to grab more pkts
+        */
+       if (!skb_queue_empty(&dev->blist))
+               return skb_queue_len(&dev->blist);
+
+       if (dev->gso_skb) {
+               count--;
+               __skb_queue_head(skbs, dev->gso_skb);
+               dev->gso_skb = NULL;
+       }

AFAIK, gso_skb can be a list of skb's. Can we add a list
to another list using __skb_queue_head()?
Also, if gso_skb is a list of multiple skb's, i think the
count needs to be decremented by the number of segments in
gso_skb.

+
+       while (count) {
+               skb = q->dequeue(q);
+               if (!skb)
+                       break;
+               count--;
+               __skb_queue_head(skbs, skb);
+       }
+
+       return tdq - count;
+}
+
+static inline int try_tx_pkts(struct net_device *dev)
+{
+
+       return dev->hard_batch_xmit(&dev->blist, dev);
+
+}
+
+/* same comments as in qdisc_restart apply;
+ * at some point use shared code with qdisc_restart*/
+int batch_qdisc_restart(struct net_device *dev)
+{
+       struct Qdisc *q = dev->qdisc;
+       unsigned lockless = (dev->features & NETIF_F_LLTX);
+       int count = dev->xmit_win;
+       int ret = 0;
+
+       ret = try_get_tx_pkts(dev, q, count);
+
+       if (ret == 0)
+               return qdisc_qlen(q);
+
+       /* we have packets to send! */
+       if (!lockless) {
+               if (!netif_tx_trylock(dev))
+                       return tx_islocked(NULL, dev, q);
+       }
+
+       /* all clear .. */
+       spin_unlock(&dev->queue_lock);
+
+       ret = NETDEV_TX_BUSY;
+       if (!netif_queue_stopped(dev))
+               ret = try_tx_pkts(dev);

try_tx_pkts() is directly calling the device's batch xmit routine.
Don't we need to call dev_hard_start_xmit() to handle dev_queue_xmit_nit
and GSO segmentation?

+
+       if (!lockless)
+               netif_tx_unlock(dev);
+
+       spin_lock(&dev->queue_lock);
+
+       q = dev->qdisc;
+
+       /* most likely result, packet went ok */
+       if (ret == NETDEV_TX_OK)
+               return qdisc_qlen(q);
+       /* only for lockless drivers .. */
+       if (ret == NETDEV_TX_LOCKED && lockless)
+               return tx_islocked(NULL, dev, q);
+
+       if (unlikely(ret != NETDEV_TX_BUSY && net_ratelimit()))
+               printk(KERN_WARNING " BUG %s code %d qlen %d\n",
+                      dev->name, ret, q->q.qlen);
+
+       return do_dev_requeue(NULL, dev, q);
+}




^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-16 22:12         ` Sridhar Samudrala
@ 2007-05-16 22:52           ` jamal
  2007-05-17  3:25             ` jamal
  2007-05-17  4:03           ` Krishna Kumar2
  1 sibling, 1 reply; 96+ messages in thread
From: jamal @ 2007-05-16 22:52 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: David Miller, xma, rdreier, ak, krkumar2, netdev, netdev-owner,
	ashwin.chaugule, Evgeniy Polyakov, Gagan Arneja

On Wed, 2007-16-05 at 15:12 -0700, Sridhar Samudrala wrote:
> Jamal,
> 
> Here are some comments i have on your patch.
> See them inline.
> 

Thanks for taking the time Sridhar.

> try_tx_pkts() is directly calling the device's batch xmit routine.
> Don't we need to call dev_hard_start_xmit() to handle dev_queue_xmit_nit
> and GSO segmentation?
> 

I think this is the core of all your other comments.
Good catch - that piece of code has changed since 2.6.18; so the patch
i posted wont work with GSO.

You actually caught a bug in my other patch as well that nobody else
did;->

I will have to think a bit about this; i may end up coalescing when
grabbing the packets but call the nit from the driver using a helper.

cheers,
jamal


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-16 22:52           ` jamal
@ 2007-05-17  3:25             ` jamal
  2007-05-18 12:07               ` jamal
  0 siblings, 1 reply; 96+ messages in thread
From: jamal @ 2007-05-17  3:25 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: David Miller, xma, rdreier, ak, krkumar2, netdev, ashwin.chaugule,
	Evgeniy Polyakov, Auke, Gagan Arneja

[-- Attachment #1: Type: text/plain, Size: 808 bytes --]

On Wed, 2007-16-05 at 18:52 -0400, jamal wrote:
> On Wed, 2007-16-05 at 15:12 -0700, Sridhar Samudrala wrote:

> 
> I will have to think a bit about this; i may end up coalescing when
> grabbing the packets but call the nit from the driver using a helper.
> 

Thats what i did. This would hopefully work with GSO now (infact nit
works now with GSO when it didnt before).

This patch now includes two changed drivers (tun and e1000). I have
tested tun with this patch. I tested e1000 earlier and i couldnt find
any issues - although as the tittle says its a WIP.

As before you need net-2.6. You also need the qdisc restart cleanup
patch.

Please comment.

If all is good, I think my next efforts will be to convert pktgen to be
aware of the api so we can have some serious traffic generation.

cheers,
jamal

[-- Attachment #2: combined-batch-p --]
[-- Type: text/x-patch, Size: 25942 bytes --]

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 637ae8f..b4c900e 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -154,6 +154,8 @@ static void e1000_update_phy_info(unsigned long data);
 static void e1000_watchdog(unsigned long data);
 static void e1000_82547_tx_fifo_stall(unsigned long data);
 static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev);
+static int e1000_prep_queue_frame(struct sk_buff *skb, struct net_device *dev);
+static int e1000_xmit_frames(struct sk_buff_head *list, struct net_device *dev);
 static struct net_device_stats * e1000_get_stats(struct net_device *netdev);
 static int e1000_change_mtu(struct net_device *netdev, int new_mtu);
 static int e1000_set_mac(struct net_device *netdev, void *p);
@@ -932,6 +934,8 @@ e1000_probe(struct pci_dev *pdev,
 	netdev->open = &e1000_open;
 	netdev->stop = &e1000_close;
 	netdev->hard_start_xmit = &e1000_xmit_frame;
+	netdev->hard_prep_xmit = &e1000_prep_queue_frame;
+	netdev->hard_batch_xmit = &e1000_xmit_frames;
 	netdev->get_stats = &e1000_get_stats;
 	netdev->set_multicast_list = &e1000_set_multi;
 	netdev->set_mac_address = &e1000_set_mac;
@@ -940,6 +944,7 @@ e1000_probe(struct pci_dev *pdev,
 	e1000_set_ethtool_ops(netdev);
 	netdev->tx_timeout = &e1000_tx_timeout;
 	netdev->watchdog_timeo = 5 * HZ;
+	skb_queue_head_init(&netdev->blist);
 #ifdef CONFIG_E1000_NAPI
 	netdev->poll = &e1000_clean;
 	netdev->weight = 64;
@@ -998,6 +1003,7 @@ e1000_probe(struct pci_dev *pdev,
 		netdev->features |= NETIF_F_HIGHDMA;
 
 	netdev->features |= NETIF_F_LLTX;
+	netdev->features |= NETIF_F_BTX;
 
 	adapter->en_mng_pt = e1000_enable_mng_pass_thru(&adapter->hw);
 
@@ -1155,6 +1161,7 @@ e1000_probe(struct pci_dev *pdev,
 	if ((err = register_netdev(netdev)))
 		goto err_register;
 
+	netdev->xmit_win = adapter->tx_ring->count>>1;
 	/* tell the stack to leave us alone until e1000_open() is called */
 	netif_carrier_off(netdev);
 	netif_stop_queue(netdev);
@@ -1449,6 +1456,7 @@ e1000_open(struct net_device *netdev)
 	/* fire a link status change interrupt to start the watchdog */
 	E1000_WRITE_REG(&adapter->hw, ICS, E1000_ICS_LSC);
 
+	printk("%s Batch window is %d\n",netdev->name, netdev->xmit_win);
 	return E1000_SUCCESS;
 
 err_req_irq:
@@ -1503,6 +1511,7 @@ e1000_close(struct net_device *netdev)
 	    e1000_check_mng_mode(&adapter->hw))
 		e1000_release_hw_control(adapter);
 
+	skb_queue_purge(&netdev->blist);
 	return 0;
 }
 
@@ -3098,6 +3107,18 @@ e1000_tx_map(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring,
 }
 
 static void
+e1000_kick_DMA(struct e1000_adapter *adapter,
+               struct e1000_tx_ring *tx_ring, int i)
+{
+	wmb();
+	writel(i, adapter->hw.hw_addr + tx_ring->tdt);
+	/* we need this if more than one processor can write to our tail
+	** at a time, it syncronizes IO on IA64/Altix systems */
+	mmiowb();
+}
+
+
+static void
 e1000_tx_queue(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring,
                int tx_flags, int count)
 {
@@ -3139,17 +3160,7 @@ e1000_tx_queue(struct e1000_adapter *adapter, struct e1000_tx_ring *tx_ring,
 
 	tx_desc->lower.data |= cpu_to_le32(adapter->txd_cmd);
 
-	/* Force memory writes to complete before letting h/w
-	 * know there are new descriptors to fetch.  (Only
-	 * applicable for weak-ordered memory model archs,
-	 * such as IA-64). */
-	wmb();
-
 	tx_ring->next_to_use = i;
-	writel(i, adapter->hw.hw_addr + tx_ring->tdt);
-	/* we need this if more than one processor can write to our tail
-	 * at a time, it syncronizes IO on IA64/Altix systems */
-	mmiowb();
 }
 
 /**
@@ -3256,54 +3267,60 @@ static int e1000_maybe_stop_tx(struct net_device *netdev,
 }
 
 #define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 )
+struct e1000_tx_cbdata {
+	int count;
+	unsigned int max_per_txd;
+	unsigned int nr_frags;
+	unsigned int mss;
+};
+
+#define E1000_SKB_CB(__skb)       ((struct e1000_tx_cbdata *)&((__skb)->cb[0]))
+#define NETDEV_TX_DROPPED	-5
+
 static int
-e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+e1000_prep_queue_frame(struct sk_buff *skb, struct net_device *netdev)
 {
-	struct e1000_adapter *adapter = netdev_priv(netdev);
 	struct e1000_tx_ring *tx_ring;
-	unsigned int first, max_per_txd = E1000_MAX_DATA_PER_TXD;
+	unsigned int f;
+	struct e1000_adapter *adapter = netdev_priv(netdev);
 	unsigned int max_txd_pwr = E1000_MAX_TXD_PWR;
-	unsigned int tx_flags = 0;
 	unsigned int len = skb->len;
-	unsigned long flags;
-	unsigned int nr_frags = 0;
-	unsigned int mss = 0;
-	int count = 0;
-	int tso;
-	unsigned int f;
+
+	struct e1000_tx_cbdata *cb = E1000_SKB_CB(skb);
+	cb->mss = 0;
+	cb->nr_frags = 0;
+	cb->max_per_txd = E1000_MAX_DATA_PER_TXD;
+	cb->count = 0;
+
 	len -= skb->data_len;
 
-	/* This goes back to the question of how to logically map a tx queue
-	 * to a flow.  Right now, performance is impacted slightly negatively
-	 * if using multiple tx queues.  If the stack breaks away from a
-	 * single qdisc implementation, we can look at this again. */
 	tx_ring = adapter->tx_ring;
 
 	if (unlikely(skb->len <= 0)) {
 		dev_kfree_skb_any(skb);
-		return NETDEV_TX_OK;
+		return NETDEV_TX_DROPPED;
 	}
 
-	/* 82571 and newer doesn't need the workaround that limited descriptor
-	 * length to 4kB */
+	/* 82571 and newer doesn't need the workaround that limited 
+	   descriptor length to 4kB */
 	if (adapter->hw.mac_type >= e1000_82571)
-		max_per_txd = 8192;
+		cb->max_per_txd = 8192;
 
-	mss = skb_shinfo(skb)->gso_size;
+	cb->mss = skb_shinfo(skb)->gso_size;
 	/* The controller does a simple calculation to
 	 * make sure there is enough room in the FIFO before
 	 * initiating the DMA for each buffer.  The calc is:
 	 * 4 = ceil(buffer len/mss).  To make sure we don't
 	 * overrun the FIFO, adjust the max buffer len if mss
 	 * drops. */
-	if (mss) {
+	if (cb->mss) {
 		uint8_t hdr_len;
-		max_per_txd = min(mss << 2, max_per_txd);
-		max_txd_pwr = fls(max_per_txd) - 1;
+		cb->max_per_txd = min(cb->mss << 2, cb->max_per_txd);
+		max_txd_pwr = fls(cb->max_per_txd) - 1;
 
 		/* TSO Workaround for 82571/2/3 Controllers -- if skb->data
-		* points to just header, pull a few bytes of payload from
-		* frags into skb->data */
+		 * points to just header, pull a few bytes of payload from
+		 * frags into skb->data */
 		hdr_len = skb_transport_offset(skb) + tcp_hdrlen(skb);
 		if (skb->data_len && (hdr_len == (skb->len - skb->data_len))) {
 			switch (adapter->hw.mac_type) {
@@ -3315,7 +3332,8 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 				 * NOTE: this is a TSO only workaround
 				 * if end byte alignment not correct move us
 				 * into the next dword */
-				if ((unsigned long)(skb_tail_pointer(skb) - 1) & 4)
+				if ((unsigned long)(skb_tail_pointer(skb) -
+						    1) & 4)
 					break;
 				/* fall through */
 			case e1000_82571:
@@ -3327,7 +3345,7 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 					DPRINTK(DRV, ERR,
 						"__pskb_pull_tail failed.\n");
 					dev_kfree_skb_any(skb);
-					return NETDEV_TX_OK;
+					return NETDEV_TX_DROPPED;
 				}
 				len = skb->len - skb->data_len;
 				break;
@@ -3339,46 +3357,56 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 	}
 
 	/* reserve a descriptor for the offload context */
-	if ((mss) || (skb->ip_summed == CHECKSUM_PARTIAL))
-		count++;
-	count++;
+	if ((cb->mss) || (skb->ip_summed == CHECKSUM_PARTIAL))
+		cb->count++;
+	cb->count++;
 
 	/* Controller Erratum workaround */
 	if (!skb->data_len && tx_ring->last_tx_tso && !skb_is_gso(skb))
-		count++;
+		cb->count++;
 
-	count += TXD_USE_COUNT(len, max_txd_pwr);
+	cb->count += TXD_USE_COUNT(len, max_txd_pwr);
 
 	if (adapter->pcix_82544)
-		count++;
+		cb->count++;
 
 	/* work-around for errata 10 and it applies to all controllers
 	 * in PCI-X mode, so add one more descriptor to the count
 	 */
 	if (unlikely((adapter->hw.bus_type == e1000_bus_type_pcix) &&
-			(len > 2015)))
-		count++;
+		     (len > 2015)))
+		cb->count++;
 
-	nr_frags = skb_shinfo(skb)->nr_frags;
-	for (f = 0; f < nr_frags; f++)
-		count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size,
-				       max_txd_pwr);
+	cb->nr_frags = skb_shinfo(skb)->nr_frags;
+	for (f = 0; f < cb->nr_frags; f++)
+		cb->count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size,
+					   max_txd_pwr);
 	if (adapter->pcix_82544)
-		count += nr_frags;
-
+		cb->count += cb->nr_frags;
 
 	if (adapter->hw.tx_pkt_filtering &&
 	    (adapter->hw.mac_type == e1000_82573))
 		e1000_transfer_dhcp_info(adapter, skb);
 
-	if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags))
-		/* Collision - tell upper layer to requeue */
-		return NETDEV_TX_LOCKED;
+	return NETDEV_TX_OK;
+}
+
+/* invoked under tx_ring->lock */
+static int e1000_queue_frame(struct sk_buff *skb, struct net_device *netdev)
+{
+	struct e1000_tx_ring *tx_ring;
+	int tso;
+	unsigned int first;
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	unsigned int tx_flags = 0;
+
+	struct e1000_tx_cbdata *cb = E1000_SKB_CB(skb);
+	tx_ring = adapter->tx_ring;
 
 	/* need: count + 2 desc gap to keep tail from touching
 	 * head, otherwise try next time */
-	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2))) {
-		spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
+	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, cb->count + 2))) {
+		netif_stop_queue(netdev);
 		return NETDEV_TX_BUSY;
 	}
 
@@ -3386,7 +3414,6 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 		if (unlikely(e1000_82547_fifo_workaround(adapter, skb))) {
 			netif_stop_queue(netdev);
 			mod_timer(&adapter->tx_fifo_stall_timer, jiffies + 1);
-			spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
 			return NETDEV_TX_BUSY;
 		}
 	}
@@ -3401,8 +3428,7 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 	tso = e1000_tso(adapter, tx_ring, skb);
 	if (tso < 0) {
 		dev_kfree_skb_any(skb);
-		spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
-		return NETDEV_TX_OK;
+		return NETDEV_TX_DROPPED;
 	}
 
 	if (likely(tso)) {
@@ -3418,16 +3444,157 @@ e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 		tx_flags |= E1000_TX_FLAGS_IPV4;
 
 	e1000_tx_queue(adapter, tx_ring, tx_flags,
-	               e1000_tx_map(adapter, tx_ring, skb, first,
-	                            max_per_txd, nr_frags, mss));
+		       e1000_tx_map(adapter, tx_ring, skb, first,
+				    cb->max_per_txd, cb->nr_frags, cb->mss));
 
-	netdev->trans_start = jiffies;
+	return NETDEV_TX_OK;
+}
+
+/* called with tx_ring->lock held */
+static int real_e1000_xmit_frame(struct sk_buff *skb, struct net_device *dev)
+{
+	struct e1000_adapter *adapter = netdev_priv(dev);
+	int ret = NETDEV_TX_OK;
+	struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+
+	ret = e1000_queue_frame(skb, dev);
+
+	if (ret == NETDEV_TX_OK) {
+		e1000_kick_DMA(adapter, tx_ring, adapter->tx_ring->next_to_use);
+		dev->trans_start = jiffies;
+	}
+	if (ret == NETDEV_TX_DROPPED)
+		ret = NETDEV_TX_OK;
 
-	/* Make sure there is space in the ring for the next send. */
-	e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);
+	/* XXX: This seems so unnecessary, because if we are 
+	 * NETDEV_TX_BUSY already, we are already 
+	 * netif_queue_stopped(dev)
+	 * but its what the driver does, resolve later */
 
+	if (unlikely(e1000_maybe_stop_tx(dev, tx_ring, MAX_SKB_FRAGS + 2))) {
+		dev->xmit_win = 1;
+		netif_stop_queue(dev);
+		ret = NETDEV_TX_BUSY;
+	} else {
+		int rspace = E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS + 2);
+		dev->xmit_win = rspace;
+	}
+
+	if (ret == NETDEV_TX_BUSY)
+		printk("Single: %s stopped with win of %d\n",
+			dev->name,dev->xmit_win);
+	return ret;
+}
+
+/* single frame transmitter */
+static int e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
+{
+	int ret = NETDEV_TX_OK;
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+	unsigned long flags;
+	struct e1000_tx_cbdata *cb; 
+
+	/* hopefully we will never cb data > 48 bytes .. */
+	memset(skb->cb, 0, sizeof(skb->cb));
+	ret = netdev->hard_prep_xmit(skb, netdev);
+	if (ret != NETDEV_TX_OK)
+		return NETDEV_TX_OK;
+
+	if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) {
+		/* Collision - tell upper layer to requeue */
+		return NETDEV_TX_LOCKED;
+	}
+
+	cb = E1000_SKB_CB(skb);
+	/* need: count + 2 desc gap to keep tail from touching
+	 * head, otherwise try next time */
+	/* XXX: This seems so unnecessary, because if we are 
+	 * NETDEV_TX_BUSY already, we are already 
+	 * netif_queue_stopped(dev)
+	 * but its what the driver does, resolve later */
+	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, cb->count + 2))) {
+		spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
+		return NETDEV_TX_BUSY;
+	}
+
+	ret = real_e1000_xmit_frame(skb, netdev);
 	spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
-	return NETDEV_TX_OK;
+	return ret;
+}
+
+/*
+ * Batch transmit
+*/
+static int
+e1000_xmit_frames(struct sk_buff_head *list, struct net_device *netdev)
+{
+	struct e1000_adapter *adapter = netdev->priv;
+	struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+	int ret = NETDEV_TX_OK;
+	int didq = 0;
+	struct sk_buff *skb = NULL;
+	unsigned long flags;
+
+	/* 
+	 * we should probably wait for this lock!
+	 */
+	if (!spin_trylock_irqsave(&tx_ring->tx_lock, flags)) {
+		/* Collision - tell upper layer to requeue */
+		return NETDEV_TX_LOCKED;
+	}
+
+	while ((skb = __skb_dequeue(list)) != NULL) {
+		memset(skb->cb, 0, sizeof(skb->cb)); /* remove? */
+		ret = netdev->hard_prep_xmit(skb, netdev);
+		if (ret != NETDEV_TX_OK)
+			continue;
+
+		/*XXX: This may be an opportunity to not give nit
+		 * the packet if the dev ix TX BUSY ;-> */
+		dev_do_xmit_nit(skb, netdev);
+		ret = e1000_queue_frame(skb, netdev);
+		if (ret == NETDEV_TX_OK) {
+			didq++;
+		} else {
+			/* should never happen, but murphy is around */
+			if (ret == NETDEV_TX_BUSY) {
+				__skb_queue_head(list, skb);
+				    break;
+			}
+		}
+	}
+
+	/* we tried to send as many as we could .. */
+	if (didq) {
+		e1000_kick_DMA(adapter, tx_ring, adapter->tx_ring->next_to_use);
+		netdev->trans_start = jiffies;
+	}
+
+	if (ret == NETDEV_TX_DROPPED)
+		ret = NETDEV_TX_OK;
+
+	/* XXX: This seems so unnecessary, because if we are 
+	 * NETDEV_TX_BUSY already, we are already 
+	 * netif_queue_stopped(dev)
+	 * but its what the driver does, resolve later */
+	/* need: MAX_SKB_FRAGS + 2 desc gap to keep tail from touching
+	 * head, otherwise try next time */
+	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2))) {
+		netdev->xmit_win = 1;
+		netif_stop_queue(netdev);
+		ret = NETDEV_TX_BUSY;
+	} else {
+		int rspace = E1000_DESC_UNUSED(tx_ring) - (MAX_SKB_FRAGS + 2);
+		netdev->xmit_win = rspace;
+		printk("batch %s still awake with win of %d\n",
+			netdev->name, netdev->xmit_win);
+	}
+	spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
+	if (ret == NETDEV_TX_BUSY)
+		printk("Batch: %s stopped with win of %d\n",
+			netdev->name, netdev->xmit_win);
+	return ret;
 }
 
 /**
@@ -4032,7 +4199,10 @@ e1000_clean_tx_irq(struct e1000_adapter *adapter,
 		 */
 		smp_mb();
 		if (netif_queue_stopped(netdev)) {
+			netdev->xmit_win = E1000_DESC_UNUSED(tx_ring);
 			netif_wake_queue(netdev);
+			printk(" %s woken with win of %d\n",
+				netdev->name,netdev->xmit_win);
 			++adapter->restart_queue;
 		}
 	}
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index a2c6caa..e128ae3 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -70,6 +70,7 @@
 static int debug;
 #endif
 
+#define NETDEV_LTT 4 /* the low threshold to open up the tx path */
 /* Network device part of the driver */
 
 static LIST_HEAD(tun_dev_list);
@@ -86,9 +87,56 @@ static int tun_net_open(struct net_device *dev)
 static int tun_net_close(struct net_device *dev)
 {
 	netif_stop_queue(dev);
+	skb_queue_purge(&dev->blist);
 	return 0;
 }
 
+/* Batch Net device start xmit
+ * combine with non-batching version
+ * */
+static int tun_net_bxmit(struct sk_buff_head *skbs, struct net_device *dev)
+{
+	struct sk_buff *skb;
+	int didq = 0;
+	struct tun_struct *tun = netdev_priv(dev);
+	u32 qlen = skb_queue_len(&tun->readq);
+
+	/* Drop packet if interface is not attached */
+	if (!tun->attached) {
+		tun->stats.tx_dropped+=skb_queue_len(&dev->blist);
+		skb_queue_purge(&dev->blist);
+		return NETDEV_TX_OK;
+	}
+
+	while (skb_queue_len(&dev->blist)) {
+		skb = __skb_dequeue(skbs);
+		if (!skb)
+			break;
+		dev_do_xmit_nit(skb, dev);
+		skb_queue_tail(&tun->readq, skb);
+		didq++;
+	}
+
+	qlen = skb_queue_len(&tun->readq);
+	if (qlen >= dev->tx_queue_len) {
+		netif_stop_queue(dev);
+		tun->stats.tx_fifo_errors++;
+		dev->xmit_win = 1;
+	} else {
+		dev->xmit_win = dev->tx_queue_len - qlen;
+	}
+
+	if (didq)
+		dev->trans_start = jiffies;
+
+	/* Notify and wake up reader process */
+	if (tun->flags & TUN_FASYNC)
+		kill_fasync(&tun->fasync, SIGIO, POLL_IN);
+	wake_up_interruptible(&tun->read_wait);
+
+	return NETDEV_TX_OK;
+}
+
 /* Net device start xmit */
 static int tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 {
@@ -207,6 +255,7 @@ static void tun_net_init(struct net_device *dev)
 		dev->tx_queue_len = TUN_READQ_SIZE;  /* We prefer our own queue length */
 		break;
 	}
+	dev->xmit_win = dev->tx_queue_len>>1; /* handwave, handwave */
 }
 
 /* Character device part */
@@ -382,7 +431,13 @@ static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec *iv,
 			schedule();
 			continue;
 		}
-		netif_wake_queue(tun->dev);
+		{
+			u32 t = skb_queue_len(&tun->readq);
+			if (netif_queue_stopped(tun->dev) && t < NETDEV_LTT) {
+				tun->dev->xmit_win = tun->dev->tx_queue_len;
+				netif_wake_queue(tun->dev);
+			}
+		}
 
 		/** Decide whether to accept this packet. This code is designed to
 		 * behave identically to an Ethernet interface. Accept the packet if
@@ -429,6 +484,7 @@ static void tun_setup(struct net_device *dev)
 	struct tun_struct *tun = netdev_priv(dev);
 
 	skb_queue_head_init(&tun->readq);
+	skb_queue_head_init(&dev->blist);
 	init_waitqueue_head(&tun->read_wait);
 
 	tun->owner = -1;
@@ -436,6 +492,8 @@ static void tun_setup(struct net_device *dev)
 	SET_MODULE_OWNER(dev);
 	dev->open = tun_net_open;
 	dev->hard_start_xmit = tun_net_xmit;
+	dev->hard_prep_xmit = NULL;
+	dev->hard_batch_xmit = tun_net_bxmit;
 	dev->stop = tun_net_close;
 	dev->get_stats = tun_net_stats;
 	dev->ethtool_ops = &tun_ethtool_ops;
@@ -458,7 +516,7 @@ static struct tun_struct *tun_get_by_name(const char *name)
 static int tun_set_iff(struct file *file, struct ifreq *ifr)
 {
 	struct tun_struct *tun;
-	struct net_device *dev;
+	struct net_device *dev = NULL;
 	int err;
 
 	tun = tun_get_by_name(ifr->ifr_name);
@@ -528,12 +586,15 @@ static int tun_set_iff(struct file *file, struct ifreq *ifr)
 	}
 
 	DBG(KERN_INFO "%s: tun_set_iff\n", tun->dev->name);
+	dev->features |= NETIF_F_BTX;
 
 	if (ifr->ifr_flags & IFF_NO_PI)
 		tun->flags |= TUN_NO_PI;
 
-	if (ifr->ifr_flags & IFF_ONE_QUEUE)
+	if (ifr->ifr_flags & IFF_ONE_QUEUE) {
 		tun->flags |= TUN_ONE_QUEUE;
+		dev->features &= ~NETIF_F_BTX;
+	}
 
 	file->private_data = tun;
 	tun->attached = 1;
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index f671cd2..85a1baf 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -325,6 +325,7 @@ struct net_device
 #define NETIF_F_VLAN_CHALLENGED	1024	/* Device cannot handle VLAN packets */
 #define NETIF_F_GSO		2048	/* Enable software GSO. */
 #define NETIF_F_LLTX		4096	/* LockLess TX */
+#define NETIF_F_BTX		8192	/* Capable of batch tx */
 
 	/* Segmentation offload features */
 #define NETIF_F_GSO_SHIFT	16
@@ -450,6 +451,11 @@ struct net_device
 	void			*priv;	/* pointer to private data	*/
 	int			(*hard_start_xmit) (struct sk_buff *skb,
 						    struct net_device *dev);
+	int			(*hard_batch_xmit) (struct sk_buff_head *list,
+						    struct net_device *dev);
+	int			(*hard_prep_xmit) (struct sk_buff *skb,
+						    struct net_device *dev);
+	int			xmit_win;
 	/* These may be needed for future network-power-down code. */
 	unsigned long		trans_start;	/* Time (in jiffies) of last Tx	*/
 
@@ -466,6 +472,10 @@ struct net_device
 	struct list_head	todo_list;
 	/* device index hash chain */
 	struct hlist_node	index_hlist;
+	/*XXX: Fix eventually to not allocate if device not
+	 *batch capable
+	*/
+	struct sk_buff_head	blist;
 
 	struct net_device	*link_watch_next;
 
@@ -742,7 +752,12 @@ extern int		dev_set_mac_address(struct net_device *,
 					    struct sockaddr *);
 extern int		dev_hard_start_xmit(struct sk_buff *skb,
 					    struct net_device *dev);
-
+extern int		do_gso_skb(struct sk_buff *skb,
+			           struct sk_buff_head *skbs);
+extern int		do_possible_gso_skb(struct sk_buff *skb,
+					    struct net_device *dev);
+extern void 		dev_do_xmit_nit(struct sk_buff *skb, 
+					struct net_device *dev);
 extern void		dev_init(void);
 
 extern int		netdev_budget;
diff --git a/net/core/dev.c b/net/core/dev.c
index 8301e2a..0d728cd 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1372,6 +1372,47 @@ out_kfree_skb:
 	return 0;
 }
 
+int do_gso_skb(struct sk_buff *skb, struct sk_buff_head *skbs)
+{
+	int tdq = 0;
+	do {
+		struct sk_buff *nskb = skb->next;
+
+		skb->next = nskb->next;
+		nskb->next = NULL;
+		tdq++;
+		__skb_queue_head(skbs, skb);
+	} while (skb->next);
+	skb->destructor = DEV_GSO_CB(skb)->destructor;
+
+	return tdq;
+}
+
+int do_possible_gso_skb(struct sk_buff *skb, struct net_device *dev)
+{
+	struct sk_buff_head *skbs = &dev->blist;
+
+	if (netif_needs_gso(dev, skb)) {
+		if (unlikely(dev_gso_segment(skb))) {
+			kfree_skb(skb);
+			return 0;
+		}
+		if (skb->next)
+			return do_gso_skb(skb, skbs);
+	}
+
+	__skb_queue_head(skbs, skb);
+	return 1;
+}
+
+/* invoked by driver when batching once figured skb is sane*/
+void  dev_do_xmit_nit(struct sk_buff *skb, struct net_device *dev)
+{
+	if (!list_empty(&ptype_all))
+		dev_queue_xmit_nit(skb, dev);
+}
+
+
 #define HARD_TX_LOCK(dev, cpu) {			\
 	if ((dev->features & NETIF_F_LLTX) == 0) {	\
 		netif_tx_lock(dev);			\
diff --git a/net/core/pktgen.c b/net/core/pktgen.c
index 9cd3a1c..530de14 100644
--- a/net/core/pktgen.c
+++ b/net/core/pktgen.c
@@ -3217,9 +3217,7 @@ static __inline__ void pktgen_xmit(struct pktgen_dev *pkt_dev)
 			pkt_dev->next_tx_us++;
 			pkt_dev->next_tx_ns -= 1000;
 		}
-	}
-
-	else {			/* Retry it next time */
+	} else { /* netif_queue_stopped -- Retry it next time */
 		pkt_dev->last_ok = 0;
 		pkt_dev->next_tx_us = getCurUs();	/* TODO */
 		pkt_dev->next_tx_ns = 0;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index ed80054..4fe5a9b 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -85,10 +85,12 @@ static inline int
 do_dev_requeue(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q)
 {
 
-	if (unlikely(skb->next))
-		dev->gso_skb = skb;
-	else
-		q->ops->requeue(skb, q);
+	if (skb) {
+		if (unlikely(skb->next))
+			dev->gso_skb = skb;
+		else
+			q->ops->requeue(skb, q);
+	}
 	/* XXX: Could netif_schedule fail? Or is that fact we are
 	 * requeueing imply the hardware path is closed
 	 * and even if we fail, some interupt will wake us
@@ -116,7 +118,10 @@ tx_islocked(struct sk_buff *skb, struct net_device *dev, struct Qdisc *q)
 	int ret = handle_dev_cpu_collision(dev);
 
 	if (ret == SCHED_TX_DROP) {
-		kfree_skb(skb);
+		if (skb) /* we are not batching */
+			kfree_skb(skb);
+		else if (!skb_queue_empty(&dev->blist))
+			skb_queue_purge(&dev->blist);
 		return qdisc_qlen(q);
 	}
 
@@ -195,10 +200,104 @@ static inline int qdisc_restart(struct net_device *dev)
 	return do_dev_requeue(skb, dev, q);
 }
 
+static int try_get_tx_pkts(struct net_device *dev, struct Qdisc *q, int count)
+{
+	struct sk_buff *skb;
+	struct sk_buff_head *skbs = &dev->blist;
+	int tdq = 0;
+
+	/* 
+	 * very unlikely, but who knows ..
+	 * If this happens we dont try to grab more pkts
+	 */
+	if (!skb_queue_empty(&dev->blist))
+		return skb_queue_len(&dev->blist);
+
+	if (unlikely(dev->gso_skb)) {
+		skb = dev->gso_skb;
+		dev->gso_skb = NULL;
+		tdq = do_gso_skb(skb, skbs);
+	}
+
+	if (tdq > count)
+		return tdq; /*we will stop here */
+
+	count -= tdq;
+	while (count > 0) {
+		skb = q->dequeue(q);
+		if (!skb)
+			break;
+		
+		tdq += do_possible_gso_skb(skb, dev);
+		count -= tdq;
+	}
+
+	return tdq;
+}
+
+static inline int try_tx_pkts(struct net_device *dev)
+{
+
+	return dev->hard_batch_xmit(&dev->blist, dev);
+
+}
+
+/* same comments as in qdisc_restart apply;
+ * at some point use shared code with qdisc_restart*/
+int batch_qdisc_restart(struct net_device *dev)
+{
+	struct Qdisc *q = dev->qdisc;
+	unsigned lockless = (dev->features & NETIF_F_LLTX);
+	int count = dev->xmit_win;
+	int ret = 0;
+
+	ret = try_get_tx_pkts(dev, q, count);
+
+	if (ret == 0)
+		return qdisc_qlen(q);
+
+	/* we have packets to send! */
+	if (!lockless) {
+		if (!netif_tx_trylock(dev))
+			return tx_islocked(NULL, dev, q);
+	}
+
+	/* all clear .. */
+	spin_unlock(&dev->queue_lock);
+
+	ret = NETDEV_TX_BUSY;
+	if (!netif_queue_stopped(dev))
+		ret = try_tx_pkts(dev);
+
+	if (!lockless)
+		netif_tx_unlock(dev);
+
+	spin_lock(&dev->queue_lock);
+
+	q = dev->qdisc;
+
+	/* most likely result, packet went ok */
+	if (ret == NETDEV_TX_OK)
+		return qdisc_qlen(q);
+	/* only for lockless drivers .. */
+	if (ret == NETDEV_TX_LOCKED && lockless)
+		return tx_islocked(NULL, dev, q);
+
+	if (unlikely(ret != NETDEV_TX_BUSY && net_ratelimit()))
+		printk(KERN_WARNING " BUG %s code %d qlen %d\n",
+		       dev->name, ret, q->q.qlen);
+
+	return do_dev_requeue(NULL, dev, q);
+}
+
 void __qdisc_run(struct net_device *dev)
 {
+	unsigned batching = (dev->features & NETIF_F_BTX);
+
 	do {
-		if (!qdisc_restart(dev))
+		if (!batching && !qdisc_restart(dev))
+			break;
+		else if (!batch_qdisc_restart(dev))
 			break;
 	} while (!netif_queue_stopped(dev));
 

^ permalink raw reply related	[flat|nested] 96+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-16 22:12         ` Sridhar Samudrala
  2007-05-16 22:52           ` jamal
@ 2007-05-17  4:03           ` Krishna Kumar2
  2007-05-16 21:44             ` Sridhar Samudrala
  1 sibling, 1 reply; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-17  4:03 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: ak, ashwin.chaugule, David Miller, Gagan Arneja, hadi,
	Evgeniy Polyakov, netdev, netdev-owner, rdreier, xma

Hi Sridhar,

Sridhar Samudrala <sri@us.ibm.com> wrote on 05/17/2007 03:42:03 AM:

> AFAIK, gso_skb can be a list of skb's. Can we add a list
> to another list using __skb_queue_head()?
> Also, if gso_skb is a list of multiple skb's, i think the
> count needs to be decremented by the number of segments in
> gso_skb.

gso_skb is the last GSO skb that failed to be sent. This already
segmented skb is kept "cached" and whenever the next xmit happens,
this skb is first sent before any packets from the queue are taken
out (otherwise out-of-order packets). So there can atmost be one
gso_skb per device.

- KK


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-16 21:44             ` Sridhar Samudrala
@ 2007-05-17  5:01               ` Krishna Kumar2
  0 siblings, 0 replies; 96+ messages in thread
From: Krishna Kumar2 @ 2007-05-17  5:01 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: ak, ashwin.chaugule, David Miller, Gagan Arneja, hadi,
	Evgeniy Polyakov, netdev, netdev-owner, rdreier, xma

Sridhar Samudrala <sri@us.ibm.com> wrote on 05/17/2007 03:14:41 AM:

> Krishna Kumar2 wrote:
> > Hi Sridhar,
> >
> > Sridhar Samudrala <sri@us.ibm.com> wrote on 05/17/2007 03:42:03 AM:
> >
> >> AFAIK, gso_skb can be a list of skb's. Can we add a list
> >> to another list using __skb_queue_head()?
> >> Also, if gso_skb is a list of multiple skb's, i think the
> >> count needs to be decremented by the number of segments in
> >> gso_skb.
> >
> > gso_skb is the last GSO skb that failed to be sent. This already
> > segmented skb is kept "cached" and whenever the next xmit happens,
> > this skb is first sent before any packets from the queue are taken
> > out (otherwise out-of-order packets). So there can atmost be one
> > gso_skb per device.
>
> Yes. There can be only one gso_skb per device. But it can have more
> than one segments linked together via skb->next.

It will definitely have more than 1 segments (criterea to be on the
gso_skb). But I think you are right, the segmented skb will "lose"
it's 1-n segments (newsk->next = next) when it is added to the list.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 21:28                       ` David Miller
@ 2007-05-18  7:04                         ` Michael Chan
  0 siblings, 0 replies; 96+ messages in thread
From: Michael Chan @ 2007-05-18  7:04 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, ak, krkumar2, netdev

David Miller wrote:

> From: "Michael Chan" <mchan@broadcom.com>
> Date: Tue, 15 May 2007 15:05:28 -0700
> 
> > For each gso_seg, there will be a header and the payload may span 2
> > pages for 1500-byte packets.  We always assume 1500-byte packets
> > because the buggy chips do not support jumbo frames.
> 
> Correct.
> 
> I think there may be a case where you could see up to 4 segments.
> If the user corks the TCP socket, does a sendmsg() (which puts
> the data in the per-socket page) then does a sendfile() you'll
> see something like:
> 
> skb->data	IP, TCP, ethernet headers, etc.
> page0		sendmsg() data
> page1		sendfile
> page2		sendfile
> 
> Ie. this can happen if the sendfile() part starts near the
> end of a page, so it would get split even for a 1500 MTU
> frame.
> 
> Even more complex variants are possible if the user does
> tiny sendfile() requests to different pages within the file.
> 
> So in fact it can span up to N pages.
> 
> But there is an upper limit defined by the original GSO
> frame, and that is controlled by MAX_SKB_FRAGS, so at most
> you would see MAX_SKB_FRAGS plus some small constant.
> 
> 

I see.  Is the following a better estimate?

(gso_segs * 2) + skb_shinfo(skb)->nr_frags

For each gso_seg, you get at least a header segment and a
segment from a page for 1500-byte packets.  You can get more
segments if packets cross page boundaries or if there are tiny
sendfile pages or some sendmsg pages mixed in.  But these
additional segments will never be more than nr_frags.


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [WIP] [PATCH] WAS Re: [RFC] New driver API to speed up small packets xmits
  2007-05-17  3:25             ` jamal
@ 2007-05-18 12:07               ` jamal
  0 siblings, 0 replies; 96+ messages in thread
From: jamal @ 2007-05-18 12:07 UTC (permalink / raw)
  To: Sridhar Samudrala
  Cc: David Miller, xma, rdreier, ak, krkumar2, netdev, ashwin.chaugule,
	Evgeniy Polyakov, Auke, Gagan Arneja

On Wed, 2007-16-05 at 23:25 -0400, jamal wrote:

> 
> This patch now includes two changed drivers (tun and e1000). I have
> tested tun with this patch. I tested e1000 earlier and i couldnt find
> any issues - although as the tittle says its a WIP.
> 
> As before you need net-2.6. You also need the qdisc restart cleanup
> patch.
> 

I have found a bug on e1000 which manifests under a lot of pounding with
traffic.
I am not gonna be posting the patch anymore - but will continue to fix
and add things. If you want the patches email me privately.

cheers,
jamal


^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 21:32     ` David Miller
  2007-05-15 22:17       ` [WIP] [PATCH] WAS " jamal
       [not found]       ` <OF6757F56D.EE5984FD-ON872572DC.0081026C-882572DC.00814B8F@us.ibm.com>
@ 2007-05-21  7:56       ` Herbert Xu
       [not found]         ` <OF9ABCD08D.2CD1B193-ON872572E3.007A6FC1-882572E3.007ACE1A@us.ibm.com>
  2 siblings, 1 reply; 96+ messages in thread
From: Herbert Xu @ 2007-05-21  7:56 UTC (permalink / raw)
  To: David Miller; +Cc: xma, rdreier, ak, krkumar2, netdev, netdev-owner

David Miller <davem@davemloft.net> wrote:
> From: Shirley Ma <xma@us.ibm.com>
> Date: Tue, 15 May 2007 14:22:57 -0700
> 
>>       I just wonder without TSO support in HW, how much benefit we
>> can get by pushing GSO from interface layer to device layer besides
>> we can do multiple packets in IPoIB.
> 
> I bet the gain is non-trivial.

Yep, for any NIC that supports SG but not TSO then software GSO will
be a big win.  When the NIC doesn't support SG then the win is mostly
offset by the need to copy the packet again.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]         ` <OF9ABCD08D.2CD1B193-ON872572E3.007A6FC1-882572E3.007ACE1A@us.ibm.com>
@ 2007-05-22 22:36           ` David Miller
       [not found]             ` <OFCF3EB7F8.9740C0C7-ON872572E3.007DADF6-882572E3.007E0E7B@us.ibm.com>
  2007-05-22 23:12             ` Herbert Xu
  0 siblings, 2 replies; 96+ messages in thread
From: David Miller @ 2007-05-22 22:36 UTC (permalink / raw)
  To: xma; +Cc: herbert, ak, krkumar2, netdev, netdev-owner, rdreier

From: Shirley Ma <xma@us.ibm.com>
Date: Tue, 22 May 2007 15:22:35 -0700

> > Yep, for any NIC that supports SG but not TSO then software GSO will
> > be a big win.  When the NIC doesn't support SG then the win is mostly
> > offset by the need to copy the packet again.
> >
> > Cheers,
> > --
> 
>       We could avoid packet copy by allocating number of 64K/MTU skb
> buffers instead of one big 64K buffer size. Is it a possible approach?

I think you misunderstand the problem.

If the device does not support scatter gather, it is impossible to
avoid the copy.

SKB's from TSO are composed of discontiguous page chunks represented
by the skb_shared_info() array.  These can come either from sendmsg()
user data (which the kernel fills into a per-socket data page which is
reallocated once filled), or from sendfile() page cache pages.

Therefore, there has to be a copy somewhere to be able to give
that packet to a non-scatter-gather device.

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]             ` <OFCF3EB7F8.9740C0C7-ON872572E3.007DADF6-882572E3.007E0E7B@us.ibm.com>
@ 2007-05-22 23:04               ` David Miller
  0 siblings, 0 replies; 96+ messages in thread
From: David Miller @ 2007-05-22 23:04 UTC (permalink / raw)
  To: xma; +Cc: ak, herbert, krkumar2, netdev, netdev-owner, rdreier

From: Shirley Ma <xma@us.ibm.com>
Date: Tue, 22 May 2007 15:58:05 -0700

> Sorry for the confusion. I am thinking to avoid copy in skb_segment() for
> GSO. The way could be in tcp_sendmsg() to allocate small discontiguous
> buffers (equal = MTU) instead of allocating pages.

The SKB splitting algorithm in TCP's transmit engine depends upon the
skb_shared_info() array being splittable at arbitrary points with only
page counts to manage.  This is the only way I found to make SKB
splitting at transmit time extremely inexpensive.

SACK block processing needs to perform these kinds of splits
at well, so it really really has to be cheap.

The invariant is that every TCP TSO packet must have it's header
at skb->data and all of it's data in the paged skb_shared_info().

^ permalink raw reply	[flat|nested] 96+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-22 22:36           ` David Miller
       [not found]             ` <OFCF3EB7F8.9740C0C7-ON872572E3.007DADF6-882572E3.007E0E7B@us.ibm.com>
@ 2007-05-22 23:12             ` Herbert Xu
  1 sibling, 0 replies; 96+ messages in thread
From: Herbert Xu @ 2007-05-22 23:12 UTC (permalink / raw)
  To: David Miller; +Cc: xma, ak, krkumar2, netdev, netdev-owner, rdreier

On Tue, May 22, 2007 at 03:36:36PM -0700, David Miller wrote:
> 
> > > Yep, for any NIC that supports SG but not TSO then software GSO will
> > > be a big win.  When the NIC doesn't support SG then the win is mostly
> > > offset by the need to copy the packet again.

...
 
> SKB's from TSO are composed of discontiguous page chunks represented
> by the skb_shared_info() array.  These can come either from sendmsg()
> user data (which the kernel fills into a per-socket data page which is
> reallocated once filled), or from sendfile() page cache pages.

Yes sendmsg() is the case where it's almost even whether GSO is
turned on or off because GSO does an extra copy which offsets the
win in reduced per-packet cost.

For sendfile() it's still a win though since we have to copy it
anyway and doing it in GSO avoids the per-packet cost.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 96+ messages in thread

end of thread, other threads:[~2007-05-22 23:12 UTC | newest]

Thread overview: 96+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <OF0CAD6D87.DBE62968-ON872572DC.0073646A-882572DC.0073BEC2@us.ibm.com>
2007-05-15 21:17 ` [RFC] New driver API to speed up small packets xmits Roland Dreier
     [not found]   ` <OFF5654BB8.74EC8DCB-ON872572DC.00752079-882572DC.00756B23@us.ibm.com>
2007-05-15 21:25     ` Roland Dreier
     [not found]       ` <OF21D475A2.5E5C88DE-ON872572DC.00763DE4-882572DC.00768A7E@us.ibm.com>
2007-05-15 21:38         ` David Miller
2007-05-15 21:32     ` David Miller
2007-05-15 22:17       ` [WIP] [PATCH] WAS " jamal
2007-05-15 22:48         ` jamal
2007-05-16  0:50           ` jamal
2007-05-16 22:12         ` Sridhar Samudrala
2007-05-16 22:52           ` jamal
2007-05-17  3:25             ` jamal
2007-05-18 12:07               ` jamal
2007-05-17  4:03           ` Krishna Kumar2
2007-05-16 21:44             ` Sridhar Samudrala
2007-05-17  5:01               ` Krishna Kumar2
     [not found]       ` <OF6757F56D.EE5984FD-ON872572DC.0081026C-882572DC.00814B8F@us.ibm.com>
2007-05-15 23:36         ` David Miller
2007-05-21  7:56       ` Herbert Xu
     [not found]         ` <OF9ABCD08D.2CD1B193-ON872572E3.007A6FC1-882572E3.007ACE1A@us.ibm.com>
2007-05-22 22:36           ` David Miller
     [not found]             ` <OFCF3EB7F8.9740C0C7-ON872572E3.007DADF6-882572E3.007E0E7B@us.ibm.com>
2007-05-22 23:04               ` David Miller
2007-05-22 23:12             ` Herbert Xu
2007-05-11  7:14 Krishna Kumar2
  -- strict thread matches above, loose matches on Subject: below --
2007-05-10 14:53 Krishna Kumar
2007-05-10 15:08 ` Evgeniy Polyakov
2007-05-10 15:22   ` Krishna Kumar2
2007-05-10 15:48     ` Evgeniy Polyakov
2007-05-10 16:08       ` jamal
2007-05-10 17:19     ` Rick Jones
2007-05-10 18:07       ` Sridhar Samudrala
2007-05-10 19:43         ` Gagan Arneja
2007-05-10 20:11           ` jamal
2007-05-10 20:14             ` Rick Jones
2007-05-10 20:15               ` jamal
2007-05-10 20:15             ` Gagan Arneja
2007-05-10 20:21               ` jamal
2007-05-10 20:25                 ` Gagan Arneja
2007-05-11  5:22             ` Krishna Kumar2
2007-05-11 11:27               ` jamal
2007-05-10 20:37           ` David Miller
2007-05-10 20:40             ` Gagan Arneja
2007-05-10 20:57               ` David Miller
2007-05-11  6:07                 ` Krishna Kumar2
2007-05-11  5:21             ` Krishna Kumar2
2007-05-11  5:20           ` Krishna Kumar2
2007-05-11  5:35             ` Gagan Arneja
2007-05-11  5:43               ` Krishna Kumar2
2007-05-11  5:57                 ` Gagan Arneja
2007-05-11  6:06                   ` Krishna Kumar2
2007-05-11  6:29                     ` Gagan Arneja
2007-05-11  6:52                       ` Krishna Kumar2
2007-05-10 18:13       ` Vlad Yasevich
2007-05-10 18:20         ` Rick Jones
2007-05-10 18:32           ` Vlad Yasevich
2007-05-10 18:40             ` Rick Jones
2007-05-10 18:59             ` Ian McDonald
2007-05-10 19:21               ` Vlad Yasevich
2007-05-10 19:26                 ` Ian McDonald
2007-05-10 20:32                 ` David Miller
2007-05-10 20:49                   ` Rick Jones
2007-05-10 21:02                     ` David Miller
2007-05-10 21:14                       ` Gagan Arneja
2007-05-11  2:28                         ` Stephen Hemminger
2007-05-11  5:01                           ` Gagan Arneja
2007-05-11  5:04               ` Krishna Kumar2
2007-05-11  9:01                 ` Evgeniy Polyakov
2007-05-11  9:18                   ` Krishna Kumar2
2007-05-11  9:32                     ` Evgeniy Polyakov
2007-05-11  9:52                       ` Krishna Kumar2
2007-05-11  9:56                         ` Evgeniy Polyakov
2007-05-11 11:30                           ` jamal
2007-05-11 11:53                             ` Evgeniy Polyakov
2007-05-11 12:15                               ` jamal
2007-05-10 21:27       ` David Stevens
2007-05-10 21:40         ` David Miller
2007-05-10 21:50           ` Gagan Arneja
2007-05-10 22:06             ` David Miller
2007-05-11  9:46               ` Krishna Kumar2
2007-05-10 21:41         ` Eric Dumazet
2007-05-10 22:09           ` Rick Jones
2007-05-10 21:45         ` Rick Jones
2007-05-10 21:53           ` David Stevens
2007-05-10 20:21 ` Roland Dreier
2007-05-11  7:30   ` Krishna Kumar2
2007-05-11 11:21     ` Roland Dreier
2007-05-11  9:05 ` Andi Kleen
2007-05-11  8:32   ` Krishna Kumar2
2007-05-11  9:37     ` Andi Kleen
2007-05-11  8:50       ` Krishna Kumar2
2007-05-11 11:16       ` Roland Dreier
2007-05-13  6:00         ` Andi Kleen
2007-05-15 16:25           ` Roland Dreier
2007-05-15 20:18             ` David Miller
2007-05-15 20:52               ` Roland Dreier
2007-05-15 21:48                 ` Michael Chan
2007-05-15 21:08                   ` Roland Dreier
2007-05-15 22:05                     ` Michael Chan
2007-05-15 21:28                       ` David Miller
2007-05-18  7:04                         ` Michael Chan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).