[RFC] New driver API to speed up small packets xmits

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] New driver API to speed up small packets xmits
@ 2007-05-10 14:53 Krishna Kumar
  2007-05-10 15:08 ` Evgeniy Polyakov
                   ` (2 more replies)
  0 siblings, 3 replies; 86+ messages in thread
From: Krishna Kumar @ 2007-05-10 14:53 UTC (permalink / raw)
  To: netdev; +Cc: krkumar2, Krishna Kumar

Hi all,

While looking at common packet sizes on xmits, I found that most of
the packets are small. On my personal system, the statistics of
packets after using (browsing, mail, ftp'ing two linux kernels from
www.kernel.org) for about 6 hours is :

-----------------------------------------------------------
	Packet Size	#packets (Total:60720)	Percentage
-----------------------------------------------------------
	32 		0 			0
	64 		7716 			12.70
	80 		40193 			66.19
sub-total:		47909			78.90 %

	96 		2007 			3.30
	128 		1917 			3.15
sub-total:		3924			6.46 %

	256 		1822 			3.00
	384 		863 			1.42
	512 		459 			.75
sub-total:		3144			5.18 %

	640 		763 			1.25
	768 		2329 			3.83
	1024 		1700 			2.79
	1500 		461 			.75
sub-total:		5253			8.65 %

	2048 		312 			.51
	4096 		53 			.08
	8192 		84 			.13
	16384 		41 			.06
	32768+ 		0 			0
sub-total:		490			0.81 %
-----------------------------------------------------------

Doing some measurements, I found that for small packets like 128 bytes,
the bandwidth is approximately 60% of the line speed. To possibly speed
up performance of small packet xmits, a method of "linking" skbs was
thought of - where two pointers (skb_flink/blink) is added to the skb.
It is felt (no data yet) that drivers will get better results when more
number of "linked" skbs are sent to it in one shot, rather than sending
each skb independently (where for each skb, extra call to driver is
made and also the driver needs to get/drop lock, etc). The method is to
send as many packets as possible from qdisc (eg multiple packets can
accumulate if the driver is stopped or trylock failed) if the device
supports the new API. Steps for enabling API for a driver is :

	- driver needs to set NETIF_F_LINKED_SKB before netdev_register
	- register_netdev sets a new tx_link_skbs tunable parameter in
	  dev to 1, indicating that the driver supports linked skbs.
	- driver implements the new API - hard_start_xmit_link to
	  handle linked skbs, which is mostly a simple task. Eg,
	  support for e1000 driver can be added, avoiding duplicating
	  existing code as :

	e1000_xmit_frame_link()
	{
	top:
		next_skb = skb->linked
		(original driver code here)
		skb = next_skb;
		if (skb)
			goto top;
		...
	}

	e1000_xmit_frame()
	{
		return e1000_xmit_frame_link(skb, NULL, dev);
	}

	Drivers can take other approaches, eg, get lock at the top and
	handle all the packets in one shot, or get/drop locks for each
	skb; but those are internal to the driver. In any case, driver
	changes to support (optional) this API is minimal.

The main change is in core/sched code. Qdisc links packets if the
device supports it and multiple skbs are present, and calls
dev_hard_start_xmit, which calls one of the two API's depending on
whether the passed skb is linked or not. A sys interface can set or
reset the tx_link_skbs parameter for the device to use the old or the
new driver API.

The reason to implement the same was to speed up IPoIB driver. But
before doing that, a proof of concept for E1000/AMSO drivers was
considered (as most of the code is generic) before implementing for
IPoIB. I do not have test results at this time but I am working on it.

Please let me know if this approach is acceptable, or any suggestions.

Thanks,

- KK

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 14:53 [RFC] New driver API to speed up small packets xmits Krishna Kumar
@ 2007-05-10 15:08 ` Evgeniy Polyakov
  2007-05-10 15:22   ` Krishna Kumar2
  2007-05-10 20:21 ` Roland Dreier
  2007-05-11  9:05 ` Andi Kleen
  2 siblings, 1 reply; 86+ messages in thread
From: Evgeniy Polyakov @ 2007-05-10 15:08 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: netdev

On Thu, May 10, 2007 at 08:23:51PM +0530, Krishna Kumar (krkumar2@in.ibm.com) wrote:
> The reason to implement the same was to speed up IPoIB driver. But
> before doing that, a proof of concept for E1000/AMSO drivers was
> considered (as most of the code is generic) before implementing for
> IPoIB. I do not have test results at this time but I am working on it.
> 
> Please let me know if this approach is acceptable, or any suggestions.

Doesn't it looks exactly like GSO/TSO/LRO stuff implemented already?

Btw, main CPU limiting factor here is syscall overhead (userspace protocol
processing with 1500 MTU allows to reduce CPU usage and increase
performance for 128 bytes packets sending/receiving total of 10 times).

> Thanks,
> 
> - KK

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 15:08 ` Evgeniy Polyakov
@ 2007-05-10 15:22   ` Krishna Kumar2
  2007-05-10 15:48     ` Evgeniy Polyakov
  2007-05-10 17:19     ` Rick Jones
  0 siblings, 2 replies; 86+ messages in thread
From: Krishna Kumar2 @ 2007-05-10 15:22 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

Hi Evgeniy,

Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 05/10/2007 08:38:33 PM:

> On Thu, May 10, 2007 at 08:23:51PM +0530, Krishna Kumar
(krkumar2@in.ibm.com) wrote:
> > The reason to implement the same was to speed up IPoIB driver. But
> > before doing that, a proof of concept for E1000/AMSO drivers was
> > considered (as most of the code is generic) before implementing for
> > IPoIB. I do not have test results at this time but I am working on it.
> >
> > Please let me know if this approach is acceptable, or any suggestions.
>
> Doesn't it looks exactly like GSO/TSO/LRO stuff implemented already?

It is the reverse - GSO will segment one super-packet just before calling
the driver so that the stack is traversed only once. In my case, I am
trying to send out multiple skbs, possibly small packets, in one shot.
GSO will not help for small packets.

> Btw, main CPU limiting factor here is syscall overhead (userspace
protocol
> processing with 1500 MTU allows to reduce CPU usage and increase
> performance for 128 bytes packets sending/receiving total of 10 times).

I will test this also. But I was curious to see if without any changes to
applications, I can get better performance by linking packets and sending
it once to the driver.

What is your opinion ?

thanks,

- KK

>
> > Thanks,
> >
> > - KK
>
> --
>    Evgeniy Polyakov


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 15:22   ` Krishna Kumar2
@ 2007-05-10 15:48     ` Evgeniy Polyakov
  2007-05-10 16:08       ` jamal
  2007-05-10 17:19     ` Rick Jones
  1 sibling, 1 reply; 86+ messages in thread
From: Evgeniy Polyakov @ 2007-05-10 15:48 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: netdev

On Thu, May 10, 2007 at 08:52:12PM +0530, Krishna Kumar2 (krkumar2@in.ibm.com) wrote:
> > > The reason to implement the same was to speed up IPoIB driver. But
> > > before doing that, a proof of concept for E1000/AMSO drivers was
> > > considered (as most of the code is generic) before implementing for
> > > IPoIB. I do not have test results at this time but I am working on it.
> > >
> > > Please let me know if this approach is acceptable, or any suggestions.
> >
> > Doesn't it looks exactly like GSO/TSO/LRO stuff implemented already?
> 
> It is the reverse - GSO will segment one super-packet just before calling
> the driver so that the stack is traversed only once. In my case, I am
> trying to send out multiple skbs, possibly small packets, in one shot.
> GSO will not help for small packets.

And you can create that packet yourself, if it not enough to combine
into MSS sized chunk.

> > Btw, main CPU limiting factor here is syscall overhead (userspace
> protocol
> > processing with 1500 MTU allows to reduce CPU usage and increase
> > performance for 128 bytes packets sending/receiving total of 10 times).
> 
> I will test this also. But I was curious to see if without any changes to
> applications, I can get better performance by linking packets and sending
> it once to the driver.

Without changes it will unlikely to be visible.

> What is your opinion ?

IMHO if you do not see in profile anything related to driver's xmit
function, it does not require to be fixed. But as starting point, why
just not increase skb allocation size (say twice) in tcp_sendmsg() and
see results? It of course can endup with worse behaviour (if mtu is
small), but it is noticebly simpler to implement and test first.

> thanks,
> 
> - KK
> 
> >
> > > Thanks,
> > >
> > > - KK
> >
> > --
> >    Evgeniy Polyakov

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 15:48     ` Evgeniy Polyakov
@ 2007-05-10 16:08       ` jamal
  0 siblings, 0 replies; 86+ messages in thread
From: jamal @ 2007-05-10 16:08 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Krishna Kumar2, netdev

On Thu, 2007-10-05 at 19:48 +0400, Evgeniy Polyakov wrote:

> IMHO if you do not see in profile anything related to driver's xmit
> function, it does not require to be fixed. 

True, but i think there may be value in amortizing the cost towards
the driver. 
i.e If you grab a lock and send X packets vs a single packet, depending
on the cost of the lock you may observe a reduction in CPU.

I have tried similar experiments in the past and observed zero
improvements in sane machines, but huge improvements on insane machines.
Slides on my work here (just jump to about slide 93):
http://vger.kernel.org/jamal_netconf2006.sxi

One thought i had at the time was perhaps the e1000 DMA setup/teardown
was very inefficient.

As the slides indicate i was trying to get throughput improvements but
failed to seen any on sane machines.
A suggestion made to me was to look at CPU utilization as a metric.
Unfortunately i havent had any chance yet.

cheers,
jamal

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 15:22   ` Krishna Kumar2
  2007-05-10 15:48     ` Evgeniy Polyakov
@ 2007-05-10 17:19     ` Rick Jones
  2007-05-10 18:07       ` Sridhar Samudrala
                         ` (2 more replies)
  1 sibling, 3 replies; 86+ messages in thread
From: Rick Jones @ 2007-05-10 17:19 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Evgeniy Polyakov, netdev

> It is the reverse - GSO will segment one super-packet just before calling
> the driver so that the stack is traversed only once. In my case, I am
> trying to send out multiple skbs, possibly small packets, in one shot.
> GSO will not help for small packets.

If there are small packets that implies small sends, which suggests that they 
would be coalesced either implicitly by the Nagle algorithm or explicitly with 
TCP_CORK no?

rick jones

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 17:19     ` Rick Jones
@ 2007-05-10 18:07       ` Sridhar Samudrala
  2007-05-10 19:43         ` Gagan Arneja
  2007-05-10 18:13       ` Vlad Yasevich
  2007-05-10 21:27       ` David Stevens
  2 siblings, 1 reply; 86+ messages in thread
From: Sridhar Samudrala @ 2007-05-10 18:07 UTC (permalink / raw)
  To: Rick Jones; +Cc: Krishna Kumar2, Evgeniy Polyakov, netdev

On Thu, 2007-05-10 at 10:19 -0700, Rick Jones wrote:
> > It is the reverse - GSO will segment one super-packet just before calling
> > the driver so that the stack is traversed only once. In my case, I am
> > trying to send out multiple skbs, possibly small packets, in one shot.
> > GSO will not help for small packets.
> 
> If there are small packets that implies small sends, which suggests that they 
> would be coalesced either implicitly by the Nagle algorithm or explicitly with 
> TCP_CORK no?

small packets belonging to the same connection could be coalesced by
TCP, but this may help the case where multiple parallel connections are
sending small packets.

Thanks
Sridhar


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 17:19     ` Rick Jones
  2007-05-10 18:07       ` Sridhar Samudrala
@ 2007-05-10 18:13       ` Vlad Yasevich
  2007-05-10 18:20         ` Rick Jones
  2007-05-10 21:27       ` David Stevens
  2 siblings, 1 reply; 86+ messages in thread
From: Vlad Yasevich @ 2007-05-10 18:13 UTC (permalink / raw)
  To: Rick Jones; +Cc: Krishna Kumar2, Evgeniy Polyakov, netdev

Rick Jones wrote:
>> It is the reverse - GSO will segment one super-packet just before calling
>> the driver so that the stack is traversed only once. In my case, I am
>> trying to send out multiple skbs, possibly small packets, in one shot.
>> GSO will not help for small packets.
> 
> If there are small packets that implies small sends, which suggests that
> they would be coalesced either implicitly by the Nagle algorithm or
> explicitly with TCP_CORK no?
> 
> rick jones
> -

May be for TCP?  What about other protocols?

-vlad

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 18:13       ` Vlad Yasevich
@ 2007-05-10 18:20         ` Rick Jones
  2007-05-10 18:32           ` Vlad Yasevich
  0 siblings, 1 reply; 86+ messages in thread
From: Rick Jones @ 2007-05-10 18:20 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: Krishna Kumar2, Evgeniy Polyakov, netdev

Vlad Yasevich wrote:
> Rick Jones wrote:
> 
>>>It is the reverse - GSO will segment one super-packet just before calling
>>>the driver so that the stack is traversed only once. In my case, I am
>>>trying to send out multiple skbs, possibly small packets, in one shot.
>>>GSO will not help for small packets.
>>
>>If there are small packets that implies small sends, which suggests that
>>they would be coalesced either implicitly by the Nagle algorithm or
>>explicitly with TCP_CORK no?
>>
>>rick jones
>>-
> 
> 
> May be for TCP?  What about other protocols?

There are other protocols?-)  True, UDP, and I suppose certain modes of SCTP 
might be sending streams of small packets, as might TCP with TCP_NODELAY set.

Do they often queue-up outside the driver?

rick jones

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 18:20         ` Rick Jones
@ 2007-05-10 18:32           ` Vlad Yasevich
  2007-05-10 18:40             ` Rick Jones
  2007-05-10 18:59             ` Ian McDonald
  0 siblings, 2 replies; 86+ messages in thread
From: Vlad Yasevich @ 2007-05-10 18:32 UTC (permalink / raw)
  To: Rick Jones; +Cc: Krishna Kumar2, Evgeniy Polyakov, netdev

Rick Jones wrote:
> Vlad Yasevich wrote:
>> Rick Jones wrote:
>>
>>>> It is the reverse - GSO will segment one super-packet just before
>>>> calling
>>>> the driver so that the stack is traversed only once. In my case, I am
>>>> trying to send out multiple skbs, possibly small packets, in one shot.
>>>> GSO will not help for small packets.
>>>
>>> If there are small packets that implies small sends, which suggests that
>>> they would be coalesced either implicitly by the Nagle algorithm or
>>> explicitly with TCP_CORK no?
>>>
>>> rick jones
>>> -
>>
>>
>> May be for TCP?  What about other protocols?
> 
> There are other protocols?-)  True, UDP, and I suppose certain modes of
> SCTP might be sending streams of small packets, as might TCP with
> TCP_NODELAY set.
> 
> Do they often queue-up outside the driver?

Not sure if DCCP might fall into this category as well...

I think the idea of this patch is gather some number of these small packets and
shove them at the driver in one go instead of each small packet at a time.

I might be helpful, but reserve judgment till I see more numbers.

-vlad

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 18:32           ` Vlad Yasevich
@ 2007-05-10 18:40             ` Rick Jones
  2007-05-10 18:59             ` Ian McDonald
  1 sibling, 0 replies; 86+ messages in thread
From: Rick Jones @ 2007-05-10 18:40 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: Krishna Kumar2, Evgeniy Polyakov, netdev

> 
> Not sure if DCCP might fall into this category as well...
> 
> I think the idea of this patch is gather some number of these small packets and
> shove them at the driver in one go instead of each small packet at a time.

This reminds me... (rick starts waxing rhapshodic about old HP-UX behviour :)

The HP-UX drivers have a "packet train" mode whereby IP will give them all the 
fragments of an IP datagram in one shot.  The idea was that the driver would 
take all of them, or none of them.  This was to deal with issues around filling 
a driver's transmit queue with some of the fragments and then dropping the 
others, thus transmitting IP datagrams which had no hope of being reassembled 
into anything other than frankengrams.

rick jones

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 18:32           ` Vlad Yasevich
  2007-05-10 18:40             ` Rick Jones
@ 2007-05-10 18:59             ` Ian McDonald
  2007-05-10 19:21               ` Vlad Yasevich
  2007-05-11  5:04               ` Krishna Kumar2
  1 sibling, 2 replies; 86+ messages in thread
From: Ian McDonald @ 2007-05-10 18:59 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: Rick Jones, Krishna Kumar2, Evgeniy Polyakov, netdev

On 5/11/07, Vlad Yasevich <vladislav.yasevich@hp.com> wrote:
> >> May be for TCP?  What about other protocols?
> >
> > There are other protocols?-)  True, UDP, and I suppose certain modes of
> > SCTP might be sending streams of small packets, as might TCP with
> > TCP_NODELAY set.
> >
> > Do they often queue-up outside the driver?
>
> Not sure if DCCP might fall into this category as well...
>
Yes DCCP definitely can queue outside the driver.

> I think the idea of this patch is gather some number of these small packets and
> shove them at the driver in one go instead of each small packet at a time.
>
> I might be helpful, but reserve judgment till I see more numbers.
>
> -vlad

As I see this proposed patch it is about reducing the number of "task
switches" between the driver and the protocol. I use task switch in
speech marks as it isn't really as is in the kernel. So in other words
we are hoping that spending more time in each area would keep the
cache hot and work to be done if locks held. This of course requires
that the complexity added doesn't outweigh the gains - otherwise you
could end up in a worse scenario where the driver doesn't send packets
because the protocol is busy linking them.

As far as I can tell you're not combining packets?? This would
definitely break UDP/DCCP which are datagram based.

Will be interesting to see results but I'm a little sceptical.

-- 
Web: http://wand.net.nz/~iam4/
Blog: http://iansblog.jandi.co.nz
WAND Network Research Group

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 18:59             ` Ian McDonald
@ 2007-05-10 19:21               ` Vlad Yasevich
  2007-05-10 19:26                 ` Ian McDonald
  2007-05-10 20:32                 ` David Miller
  2007-05-11  5:04               ` Krishna Kumar2
  1 sibling, 2 replies; 86+ messages in thread
From: Vlad Yasevich @ 2007-05-10 19:21 UTC (permalink / raw)
  To: Ian McDonald; +Cc: Rick Jones, Krishna Kumar2, Evgeniy Polyakov, netdev

Ian McDonald wrote:
> On 5/11/07, Vlad Yasevich <vladislav.yasevich@hp.com> wrote:
>> >> May be for TCP?  What about other protocols?
>> >
>> > There are other protocols?-)  True, UDP, and I suppose certain modes of
>> > SCTP might be sending streams of small packets, as might TCP with
>> > TCP_NODELAY set.
>> >
>> > Do they often queue-up outside the driver?
>>
>> Not sure if DCCP might fall into this category as well...
>>
> Yes DCCP definitely can queue outside the driver.
> 
>> I think the idea of this patch is gather some number of these small
>> packets and
>> shove them at the driver in one go instead of each small packet at a
>> time.
>>
>> I might be helpful, but reserve judgment till I see more numbers.
>>
>> -vlad
> 
> As I see this proposed patch it is about reducing the number of "task
> switches" between the driver and the protocol. I use task switch in
> speech marks as it isn't really as is in the kernel. So in other words
> we are hoping that spending more time in each area would keep the
> cache hot and work to be done if locks held. This of course requires
> that the complexity added doesn't outweigh the gains - otherwise you
> could end up in a worse scenario where the driver doesn't send packets
> because the protocol is busy linking them.
> 
> As far as I can tell you're not combining packets?? This would
> definitely break UDP/DCCP which are datagram based.

If it's combining packets, it would break a lot of other things as well.

The proposed usage seems to be to link packets together into a chain
and give the whole chain to the driver, instead of handing down one packet
at a time.

The win might be biggest on a system were a lot of applications send a lot of
small packets.  Some number will aggregate in the prio queue and then get shoved
into a driver in one go.

But...  this is all conjecture until we see the code.

-vlad

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 19:21               ` Vlad Yasevich
@ 2007-05-10 19:26                 ` Ian McDonald
  2007-05-10 20:32                 ` David Miller
  1 sibling, 0 replies; 86+ messages in thread
From: Ian McDonald @ 2007-05-10 19:26 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: Rick Jones, Krishna Kumar2, Evgeniy Polyakov, netdev

On 5/11/07, Vlad Yasevich <vladislav.yasevich@hp.com> wrote:
> The win might be biggest on a system were a lot of applications send a lot of
> small packets.  Some number will aggregate in the prio queue and then get shoved
> into a driver in one go.

That's assuming that the device doesn't run out of things to send first....
>
> But...  this is all conjecture until we see the code.
>
Agree

-- 
Web: http://wand.net.nz/~iam4/
Blog: http://iansblog.jandi.co.nz
WAND Network Research Group

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 18:07       ` Sridhar Samudrala
@ 2007-05-10 19:43         ` Gagan Arneja
  2007-05-10 20:11           ` jamal
                             ` (2 more replies)
  0 siblings, 3 replies; 86+ messages in thread
From: Gagan Arneja @ 2007-05-10 19:43 UTC (permalink / raw)
  To: Sridhar Samudrala; +Cc: Rick Jones, Krishna Kumar2, Evgeniy Polyakov, netdev

> small packets belonging to the same connection could be coalesced by
> TCP, but this may help the case where multiple parallel connections are
> sending small packets.

It's not just small packets. The cost of calling hard_start_xmit/byte 
was rather high on your particular device. I've seen  PCI read 
transaction in hard_start_xmit taking ~10,000 cycles on one particular 
device. Count the cycles your brand of NIC is taking in it's 
xmit_routine. The worse it is, the stronger your case for cluster 
transmits.

Also, I think, you don't have to chain skbs, they're already chained in 
  Qdisc->q. All you have to do is take the whole q and try to shove it 
at the device hoping for better results. But then, if you have rather 
big backlog, you run the risk of reordering packets if you have to requeue.

> 
> Thanks
> Sridhar

--
Gagan

> 
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 19:43         ` Gagan Arneja
@ 2007-05-10 20:11           ` jamal
  2007-05-10 20:14             ` Rick Jones
                               ` (2 more replies)
  2007-05-10 20:37           ` David Miller
  2007-05-11  5:20           ` Krishna Kumar2
  2 siblings, 3 replies; 86+ messages in thread
From: jamal @ 2007-05-10 20:11 UTC (permalink / raw)
  To: Gagan Arneja
  Cc: Sridhar Samudrala, Rick Jones, Krishna Kumar2, Evgeniy Polyakov,
	netdev

The discussion seems to have steered into protocol coalescing.
My tests for example were related to forwarding and not specific
to any protocol.

On Thu, 2007-10-05 at 12:43 -0700, Gagan Arneja wrote:

> It's not just small packets. The cost of calling hard_start_xmit/byte 
> was rather high on your particular device. I've seen  PCI read 
> transaction in hard_start_xmit taking ~10,000 cycles on one particular 
> device. Count the cycles your brand of NIC is taking in it's 
> xmit_routine. The worse it is, the stronger your case for cluster 
> transmits.

You would need to almost re-write the driver to make sure it does IO
which is taking advantage of the batching. 

> Also, I think, you don't have to chain skbs, they're already chained in 
>   Qdisc->q. 
> All you have to do is take the whole q and try to shove it 
> at the device hoping for better results. 

You are onto something with desire to use the qdisc; however, note you
need to leave the qdisc above so other CPUs can continue enqueueing.
In my approach i used the qdisc to populate a list that is attached to
the dev.

> But then, if you have rather 
> big backlog, you run the risk of reordering packets if you have to requeue.

You could avoid totaly requeueing by asking the driver how much space it
has. Once you shove down to it a number of packets, you are then
guaranteed they will never be requeued (refer to the slides i pointed to
earlier).

cheers,
jamal

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:11           ` jamal
@ 2007-05-10 20:14             ` Rick Jones
  2007-05-10 20:15               ` jamal
  2007-05-10 20:15             ` Gagan Arneja
  2007-05-11  5:22             ` Krishna Kumar2
  2 siblings, 1 reply; 86+ messages in thread
From: Rick Jones @ 2007-05-10 20:14 UTC (permalink / raw)
  To: hadi
  Cc: Gagan Arneja, Sridhar Samudrala, Krishna Kumar2, Evgeniy Polyakov,
	netdev

jamal wrote:
> The discussion seems to have steered into protocol coalescing.
> My tests for example were related to forwarding and not specific
> to any protocol.

Just the natural tendency of end-system types to think of end-system things 
rather than router things.

rick jones

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:11           ` jamal
  2007-05-10 20:14             ` Rick Jones
@ 2007-05-10 20:15             ` Gagan Arneja
  2007-05-10 20:21               ` jamal
  2007-05-11  5:22             ` Krishna Kumar2
  2 siblings, 1 reply; 86+ messages in thread
From: Gagan Arneja @ 2007-05-10 20:15 UTC (permalink / raw)
  To: hadi
  Cc: Gagan Arneja, Sridhar Samudrala, Rick Jones, Krishna Kumar2,
	Evgeniy Polyakov, netdev

jamal wrote:

> You would need to almost re-write the driver to make sure it does IO
> which is taking advantage of the batching.

Really! It's just the transmit routine. How radical can that be?

--
Gagan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:14             ` Rick Jones
@ 2007-05-10 20:15               ` jamal
  0 siblings, 0 replies; 86+ messages in thread
From: jamal @ 2007-05-10 20:15 UTC (permalink / raw)
  To: Rick Jones
  Cc: Gagan Arneja, Sridhar Samudrala, Krishna Kumar2, Evgeniy Polyakov,
	netdev

On Thu, 2007-10-05 at 13:14 -0700, Rick Jones wrote:

> Just the natural tendency of end-system types to think of end-system things 
> rather than router things.

Well router types felt they were being left out ;->

cheers,
jamal


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:15             ` Gagan Arneja
@ 2007-05-10 20:21               ` jamal
  2007-05-10 20:25                 ` Gagan Arneja
  0 siblings, 1 reply; 86+ messages in thread
From: jamal @ 2007-05-10 20:21 UTC (permalink / raw)
  To: Gagan Arneja
  Cc: Sridhar Samudrala, Rick Jones, Krishna Kumar2, Evgeniy Polyakov,
	netdev

On Thu, 2007-10-05 at 13:15 -0700, Gagan Arneja wrote:

> Really! It's just the transmit routine. How radical can that be?
> 

Ok, you have a point there, but it could be challenging with many
tunables:
For example:
my biggest challenge with the e1000 was just hacking up the DMA setup
path - i seem to get better numbers if i dont kick the DMA until i stash
all the packets on the ring first etc. It seemed counter-intuitive.

Now if you can ammortize PCI operations as well (the 10K cycles you
pointed out), i think thats where you get the best bang for the buck.

cheers,
jamal

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 14:53 [RFC] New driver API to speed up small packets xmits Krishna Kumar
  2007-05-10 15:08 ` Evgeniy Polyakov
@ 2007-05-10 20:21 ` Roland Dreier
  2007-05-11  7:30   ` Krishna Kumar2
  2007-05-11  9:05 ` Andi Kleen
  2 siblings, 1 reply; 86+ messages in thread
From: Roland Dreier @ 2007-05-10 20:21 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: netdev, Krishna Kumar

This is pretty interesting to me for IP-over-InfiniBand, for a couple
of reasons.  First of all, I can push multiple send requests to the
underlying adapter in one go, which saves taking and dropping the same
lock and also probably allows fewer MMIO writes for doorbells.

However the second reason this is useful is actually more
interesting.  Right now we request a send completion interrupt for
every packet we send because most current IB adapters don't really
have very good interrupt mitigation features.  If we don't signal a
send completion for every packet, then we might run into a situation
where we don't get a completion for the last packet that gets sent and
end up stalling the interface because the stack never gives us another
packet to send.  However, if I have a whole queue of packets passed to
my xmit routine, then I only have to request a completion for the last
packet in the list, which could potentially cut down on interrupts and
completion processing a lot.

 - R.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:21               ` jamal
@ 2007-05-10 20:25                 ` Gagan Arneja
  0 siblings, 0 replies; 86+ messages in thread
From: Gagan Arneja @ 2007-05-10 20:25 UTC (permalink / raw)
  To: hadi
  Cc: Gagan Arneja, Sridhar Samudrala, Rick Jones, Krishna Kumar2,
	Evgeniy Polyakov, netdev


> For example:
> my biggest challenge with the e1000 was just hacking up the DMA setup
> path - i seem to get better numbers if i dont kick the DMA until i stash
> all the packets on the ring first etc. It seemed counter-intuitive.

That seems to make sense. The rings are(?) in system memory and you can 
write to them faster. Kicking the DMA needs a transaction on the PCI 
bus, which can be painfully slow...

--
Gagan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 19:21               ` Vlad Yasevich
  2007-05-10 19:26                 ` Ian McDonald
@ 2007-05-10 20:32                 ` David Miller
  2007-05-10 20:49                   ` Rick Jones
  1 sibling, 1 reply; 86+ messages in thread
From: David Miller @ 2007-05-10 20:32 UTC (permalink / raw)
  To: vladislav.yasevich; +Cc: ian.mcdonald, rick.jones2, krkumar2, johnpol, netdev

From: Vlad Yasevich <vladislav.yasevich@hp.com>
Date: Thu, 10 May 2007 15:21:30 -0400

> The win might be biggest on a system were a lot of applications send
> a lot of small packets.  Some number will aggregate in the prio
> queue and then get shoved into a driver in one go.
>
> But...  this is all conjecture until we see the code.

Also, whatever you gain in cpu usage you'll lose in latency.

And things sending tiny frames are exactly the ones that care about
latency.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 19:43         ` Gagan Arneja
  2007-05-10 20:11           ` jamal
@ 2007-05-10 20:37           ` David Miller
  2007-05-10 20:40             ` Gagan Arneja
  2007-05-11  5:21             ` Krishna Kumar2
  2007-05-11  5:20           ` Krishna Kumar2
  2 siblings, 2 replies; 86+ messages in thread
From: David Miller @ 2007-05-10 20:37 UTC (permalink / raw)
  To: gaagaan; +Cc: sri, rick.jones2, krkumar2, johnpol, netdev

From: Gagan Arneja <gaagaan@gmail.com>
Date: Thu, 10 May 2007 12:43:53 -0700

> It's not just small packets. The cost of calling hard_start_xmit/byte 
> was rather high on your particular device. I've seen  PCI read 
> transaction in hard_start_xmit taking ~10,000 cycles on one particular 
> device. Count the cycles your brand of NIC is taking in it's 
> xmit_routine. The worse it is, the stronger your case for cluster 
> transmits.
> 
> Also, I think, you don't have to chain skbs, they're already chained in 
>   Qdisc->q. All you have to do is take the whole q and try to shove it 
> at the device hoping for better results. But then, if you have rather 
> big backlog, you run the risk of reordering packets if you have to requeue.

If the qdisc is packed with packets and we would just loop sending
them to the device, yes it might make sense.

But if that isn't the case, which frankly is the usual case, you add a
non-trivial amount of latency by batching and that's bad exactly for
the kind of applications that send small frames.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:37           ` David Miller
@ 2007-05-10 20:40             ` Gagan Arneja
  2007-05-10 20:57               ` David Miller
  2007-05-11  5:21             ` Krishna Kumar2
  1 sibling, 1 reply; 86+ messages in thread
From: Gagan Arneja @ 2007-05-10 20:40 UTC (permalink / raw)
  To: David Miller; +Cc: sri, rick.jones2, krkumar2, johnpol, netdev

David Miller wrote:
> 
> If the qdisc is packed with packets and we would just loop sending
> them to the device, yes it might make sense.
> 
> But if that isn't the case, which frankly is the usual case, you add a
> non-trivial amount of latency by batching and that's bad exactly for
> the kind of applications that send small frames.
> 

I don't understand how transmitting already batched up packets in one go 
introduce latency.

--
Gagan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:32                 ` David Miller
@ 2007-05-10 20:49                   ` Rick Jones
  2007-05-10 21:02                     ` David Miller
  0 siblings, 1 reply; 86+ messages in thread
From: Rick Jones @ 2007-05-10 20:49 UTC (permalink / raw)
  To: David Miller; +Cc: vladislav.yasevich, ian.mcdonald, krkumar2, johnpol, netdev

David Miller wrote:
> From: Vlad Yasevich <vladislav.yasevich@hp.com>
> Date: Thu, 10 May 2007 15:21:30 -0400
> 
> 
>>The win might be biggest on a system were a lot of applications send
>>a lot of small packets.  Some number will aggregate in the prio
>>queue and then get shoved into a driver in one go.
>>
>>But...  this is all conjecture until we see the code.
> 
> 
> Also, whatever you gain in cpu usage you'll lose in latency.
> 
> And things sending tiny frames are exactly the ones that care about
> latency.

I'd think one would only do this in those situations/places where a natural "out 
of driver" queue develops in the first place wouldn't one?

rick jones

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:40             ` Gagan Arneja
@ 2007-05-10 20:57               ` David Miller
  2007-05-11  6:07                 ` Krishna Kumar2
  0 siblings, 1 reply; 86+ messages in thread
From: David Miller @ 2007-05-10 20:57 UTC (permalink / raw)
  To: gaagaan; +Cc: sri, rick.jones2, krkumar2, johnpol, netdev

From: Gagan Arneja <gaagaan@gmail.com>
Date: Thu, 10 May 2007 13:40:22 -0700

> David Miller wrote:
> > 
> > If the qdisc is packed with packets and we would just loop sending
> > them to the device, yes it might make sense.
> > 
> > But if that isn't the case, which frankly is the usual case, you add a
> > non-trivial amount of latency by batching and that's bad exactly for
> > the kind of applications that send small frames.
> > 
> 
> I don't understand how transmitting already batched up packets in one go 
> introduce latency.

Keep thinking :-)

The only case where these ideas can be seriously considered is during
netif_wake_queue().  In all other cases we go straight to the device
from sendmsg(), since there is space in the TX ring, and we should
not try to batch as that will add latency.

So maybe find a solution in that area, some new driver interface that
can suck N packets out of a qdisc when netif_wake_queue() occurs.

If you look at the contexts in which netif_wake_queue() is called,
it's perfect, it already has the TX state of the driver locked, it has
the TX descriptor head/tail pointers handy, etc. and it's already
taken all the cache misses needed in order to work with this state.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:49                   ` Rick Jones
@ 2007-05-10 21:02                     ` David Miller
  2007-05-10 21:14                       ` Gagan Arneja
  0 siblings, 1 reply; 86+ messages in thread
From: David Miller @ 2007-05-10 21:02 UTC (permalink / raw)
  To: rick.jones2; +Cc: vladislav.yasevich, ian.mcdonald, krkumar2, johnpol, netdev

From: Rick Jones <rick.jones2@hp.com>
Date: Thu, 10 May 2007 13:49:44 -0700

> I'd think one would only do this in those situations/places where a
> natural "out of driver" queue develops in the first place wouldn't
> one?

Indeed.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:02                     ` David Miller
@ 2007-05-10 21:14                       ` Gagan Arneja
  2007-05-11  2:28                         ` Stephen Hemminger
  0 siblings, 1 reply; 86+ messages in thread
From: Gagan Arneja @ 2007-05-10 21:14 UTC (permalink / raw)
  To: David Miller
  Cc: rick.jones2, vladislav.yasevich, ian.mcdonald, krkumar2, johnpol,
	netdev

David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Thu, 10 May 2007 13:49:44 -0700
> 
>> I'd think one would only do this in those situations/places where a
>> natural "out of driver" queue develops in the first place wouldn't
>> one?
> 
> Indeed.

And one builds in qdisc because your device sink is slow. There's 
nothing inherently there in the kernel encouraging Qdisc->q to grow. 
It's precisely these broken devices that can take advantage of cluster 
transmits.

--
Gagan

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 17:19     ` Rick Jones
  2007-05-10 18:07       ` Sridhar Samudrala
  2007-05-10 18:13       ` Vlad Yasevich
@ 2007-05-10 21:27       ` David Stevens
  2007-05-10 21:40         ` David Miller
                           ` (2 more replies)
  2 siblings, 3 replies; 86+ messages in thread
From: David Stevens @ 2007-05-10 21:27 UTC (permalink / raw)
  To: Rick Jones; +Cc: Evgeniy Polyakov, Krishna Kumar2, netdev, netdev-owner

The word "small" is coming up a lot in this discussion, and
I think packet size really has nothing to do with it. Multiple
streams generating packets of any size would benefit; the
key ingredient is a queue length greater than 1.

I think the intent is to remove queue lock cycles by taking
the whole list (at least up to the count of free ring buffers)
when the queue is greater than one packet, thus effectively
removing the lock expense for n-1 packets.

                                        +-DLS

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:27       ` David Stevens
@ 2007-05-10 21:40         ` David Miller
  2007-05-10 21:50           ` Gagan Arneja
  2007-05-10 21:41         ` Eric Dumazet
  2007-05-10 21:45         ` Rick Jones
  2 siblings, 1 reply; 86+ messages in thread
From: David Miller @ 2007-05-10 21:40 UTC (permalink / raw)
  To: dlstevens; +Cc: rick.jones2, johnpol, krkumar2, netdev, netdev-owner

From: David Stevens <dlstevens@us.ibm.com>
Date: Thu, 10 May 2007 14:27:56 -0700

> The word "small" is coming up a lot in this discussion, and
> I think packet size really has nothing to do with it. Multiple
> streams generating packets of any size would benefit; the
> key ingredient is a queue length greater than 1.
> 
> I think the intent is to remove queue lock cycles by taking
> the whole list (at least up to the count of free ring buffers)
> when the queue is greater than one packet, thus effectively
> removing the lock expense for n-1 packets.

Right.

But I think it's critical to do two things:

1) Do this when netif_wake_queue() is triggers and thus the
   TX is locked already.

2) Have some way for the driver to say how many free TX slots
   there are in order to minimize if not eliminate requeueing
   during this batching thing.

If you drop the TX lock, the number of free slots can change
as another cpu gets in there queuing packets.

I know there are some hardware workarounds that require using
more TX ring buffer slots and are usually necessary, which
makes %100 accurate indication of free slots not possible.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:27       ` David Stevens
  2007-05-10 21:40         ` David Miller
@ 2007-05-10 21:41         ` Eric Dumazet
  2007-05-10 22:09           ` Rick Jones
  2007-05-10 21:45         ` Rick Jones
  2 siblings, 1 reply; 86+ messages in thread
From: Eric Dumazet @ 2007-05-10 21:41 UTC (permalink / raw)
  To: David Stevens
  Cc: Rick Jones, Evgeniy Polyakov, Krishna Kumar2, netdev,
	netdev-owner

David Stevens a écrit :
> The word "small" is coming up a lot in this discussion, and
> I think packet size really has nothing to do with it. Multiple
> streams generating packets of any size would benefit; the
> key ingredient is a queue length greater than 1.
> 
> I think the intent is to remove queue lock cycles by taking
> the whole list (at least up to the count of free ring buffers)
> when the queue is greater than one packet, thus effectively
> removing the lock expense for n-1 packets.
> 

Yes, but on modern cpus, locked operations are basically free once the CPU 
already has the cache line in exclusive access in its L1 cache.

I am not sure adding yet another driver API will help very much.
It will for sure adds some bugs and pain.

A less expensive (and less prone to bugs) optimization would be to prefetch 
one cache line for next qdisc skb, as a cache line miss is far more expensive 
than a locked operation (if lock already in L1 cache of course)



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:27       ` David Stevens
  2007-05-10 21:40         ` David Miller
  2007-05-10 21:41         ` Eric Dumazet
@ 2007-05-10 21:45         ` Rick Jones
  2007-05-10 21:53           ` David Stevens
  2 siblings, 1 reply; 86+ messages in thread
From: Rick Jones @ 2007-05-10 21:45 UTC (permalink / raw)
  To: David Stevens; +Cc: Evgeniy Polyakov, Krishna Kumar2, netdev, netdev-owner

David Stevens wrote:
> The word "small" is coming up a lot in this discussion, and
> I think packet size really has nothing to do with it. Multiple
> streams generating packets of any size would benefit; the
> key ingredient is a queue length greater than 1.
> 
> I think the intent is to remove queue lock cycles by taking
> the whole list (at least up to the count of free ring buffers)
> when the queue is greater than one packet, thus effectively
> removing the lock expense for n-1 packets.

Which worked _very_ well (the whole list) going in the other direction for the 
netisr queue(s) in HP-UX 10.20.  OK, I promise no more old HP-UX stories for the 
balance of the week :)

Taking some count of the list might be a triffle too complicated.

rick jones

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:40         ` David Miller
@ 2007-05-10 21:50           ` Gagan Arneja
  2007-05-10 22:06             ` David Miller
  0 siblings, 1 reply; 86+ messages in thread
From: Gagan Arneja @ 2007-05-10 21:50 UTC (permalink / raw)
  To: David Miller
  Cc: dlstevens, rick.jones2, johnpol, krkumar2, netdev, netdev-owner

David Miller wrote:

> Right.
> 
> But I think it's critical to do two things:
> 
> 1) Do this when netif_wake_queue() is triggers and thus the
>    TX is locked already.
> 
> 2) Have some way for the driver to say how many free TX slots
>    there are in order to minimize if not eliminate requeueing
>    during this batching thing.

I don't think you can reliably do this. Jumbograms may end up taking 
more than one slot.

> 
> If you drop the TX lock, the number of free slots can change
> as another cpu gets in there queuing packets.

Can you ever have more than one thread inside the driver? Isn't 
xmit_lock held while we're in there?

--
Gagan


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:45         ` Rick Jones
@ 2007-05-10 21:53           ` David Stevens
  0 siblings, 0 replies; 86+ messages in thread
From: David Stevens @ 2007-05-10 21:53 UTC (permalink / raw)
  To: Rick Jones; +Cc: Evgeniy Polyakov, Krishna Kumar2, netdev, netdev-owner

> Which worked _very_ well (the whole list) going in the other direction 
for the 
> netisr queue(s) in HP-UX 10.20.  OK, I promise no more old HP-UX stories 
for the 
> balance of the week :)

        Yes, OSes I worked on in other lives usually took the whole queue
and then took responsibility for handling them outside the lock.
        A dequeue_n() under 1 lock should still be better than n locks,
I'd think, since you may bounce to different CPUs during the n individual
cycles.

                                                        +-DLS


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:50           ` Gagan Arneja
@ 2007-05-10 22:06             ` David Miller
  2007-05-11  9:46               ` Krishna Kumar2
  0 siblings, 1 reply; 86+ messages in thread
From: David Miller @ 2007-05-10 22:06 UTC (permalink / raw)
  To: gaagaan; +Cc: dlstevens, rick.jones2, johnpol, krkumar2, netdev, netdev-owner

From: Gagan Arneja <gaagaan@gmail.com>
Date: Thu, 10 May 2007 14:50:19 -0700

> David Miller wrote:
> 
> > If you drop the TX lock, the number of free slots can change
> > as another cpu gets in there queuing packets.
> 
> Can you ever have more than one thread inside the driver? Isn't 
> xmit_lock held while we're in there?

There are restrictions wrt. when the xmit_lock and the
queue lock can be held at the same time.

The devil is definitely in the details if you try to
implemen this.  It definitely lends support for Eric D.'s
assertion that this change will only add bugs and doing
something simple like prefetches is proabably a safer
route to go down.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:41         ` Eric Dumazet
@ 2007-05-10 22:09           ` Rick Jones
  0 siblings, 0 replies; 86+ messages in thread
From: Rick Jones @ 2007-05-10 22:09 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Stevens, Evgeniy Polyakov, Krishna Kumar2, netdev,
	netdev-owner

Eric Dumazet wrote:
> David Stevens a écrit :
> 
>> The word "small" is coming up a lot in this discussion, and
>> I think packet size really has nothing to do with it. Multiple
>> streams generating packets of any size would benefit; the
>> key ingredient is a queue length greater than 1.
>>
>> I think the intent is to remove queue lock cycles by taking
>> the whole list (at least up to the count of free ring buffers)
>> when the queue is greater than one packet, thus effectively
>> removing the lock expense for n-1 packets.
>>
> 
> Yes, but on modern cpus, locked operations are basically free once the 
> CPU already has the cache line in exclusive access in its L1 cache.

But will it here?  Any of the CPUs are trying to add things to the qdisc, but 
only one CPU is pulling from it right?  Even if the "pulling from it" is 
happening in a loop, there can be scores or more other cores trying to add 
things to the queue, which would cause that cache line to migrate.

> I am not sure adding yet another driver API will help very much.
> It will for sure adds some bugs and pain.

That could very well be.

> A less expensive (and less prone to bugs) optimization would be to 
> prefetch one cache line for next qdisc skb, as a cache line miss is far 
> more expensive than a locked operation (if lock already in L1 cache of 
> course)

Might they not build on on top of the other?

rick jones



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 21:14                       ` Gagan Arneja
@ 2007-05-11  2:28                         ` Stephen Hemminger
  2007-05-11  5:01                           ` Gagan Arneja
  0 siblings, 1 reply; 86+ messages in thread
From: Stephen Hemminger @ 2007-05-11  2:28 UTC (permalink / raw)
  To: Gagan Arneja
  Cc: David Miller, rick.jones2, vladislav.yasevich, ian.mcdonald,
	krkumar2, johnpol, netdev

On Thu, 10 May 2007 14:14:05 -0700
Gagan Arneja <gaagaan@gmail.com> wrote:

> David Miller wrote:
> > From: Rick Jones <rick.jones2@hp.com>
> > Date: Thu, 10 May 2007 13:49:44 -0700
> > 
> >> I'd think one would only do this in those situations/places where a
> >> natural "out of driver" queue develops in the first place wouldn't
> >> one?
> > 
> > Indeed.
> 
> And one builds in qdisc because your device sink is slow. There's 
> nothing inherently there in the kernel encouraging Qdisc->q to grow. 
> It's precisely these broken devices that can take advantage of cluster 
> transmits.

If you have braindead slow hardware, 
there is nothing that says your start_xmit routine can't do its own
coalescing.  The cost of calling the transmit routine is the responsibility
of the driver, not the core network code.
-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  2:28                         ` Stephen Hemminger
@ 2007-05-11  5:01                           ` Gagan Arneja
  0 siblings, 0 replies; 86+ messages in thread
From: Gagan Arneja @ 2007-05-11  5:01 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Gagan Arneja, David Miller, rick.jones2, vladislav.yasevich,
	ian.mcdonald, krkumar2, johnpol, netdev

 > If you have braindead slow hardware,
 > there is nothing that says your start_xmit routine can't do its own
 > coalescing.  The cost of calling the transmit routine is the 
responsibility
 > of the driver, not the core network code.

Yes, except you very likely run the risk of the driver introducing 
latency. The problem is prevalent enough that one can argue for a fix (a 
better one at that) higher up. And it's not all that high up that we 
need to get completely paranoid about it.

 > --
 > Stephen Hemminger <shemminger@linux-foundation.org>
 >

--
Gagan


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 18:59             ` Ian McDonald
  2007-05-10 19:21               ` Vlad Yasevich
@ 2007-05-11  5:04               ` Krishna Kumar2
  2007-05-11  9:01                 ` Evgeniy Polyakov
  1 sibling, 1 reply; 86+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  5:04 UTC (permalink / raw)
  To: Ian McDonald; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Vlad Yasevich

"Ian McDonald" <ian.mcdonald@jandi.co.nz> wrote on 05/11/2007 12:29:08 AM:

>
> As I see this proposed patch it is about reducing the number of "task
> switches" between the driver and the protocol. I use task switch in
> speech marks as it isn't really as is in the kernel. So in other words
> we are hoping that spending more time in each area would keep the
> cache hot and work to be done if locks held. This of course requires
> that the complexity added doesn't outweigh the gains - otherwise you
> could end up in a worse scenario where the driver doesn't send packets
> because the protocol is busy linking them.

This is a right summary.

> As far as I can tell you're not combining packets?? This would
> definitely break UDP/DCCP which are datagram based.

Not combining packets, I am sending them out in the same sequence it was
queued. If the xmit failed, the driver's new API returns the skb which
failed to be sent. This skb and all other linked skbs are requeue'd in
the reverse order (fofi?) till the next time it is tried again. I see
that sometimes I can send tx_queue_len packets in one shot and all
succeeds. But the downside is that in failure case, the packets have to
be requeue'd.

> Will be interesting to see results but I'm a little sceptical.

I was just going to test, and Rick helpfully pointed out netperf4.

Thanks,

- KK

>
> --
> Web: http://wand.net.nz/~iam4/
> Blog: http://iansblog.jandi.co.nz
> WAND Network Research Group


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 19:43         ` Gagan Arneja
  2007-05-10 20:11           ` jamal
  2007-05-10 20:37           ` David Miller
@ 2007-05-11  5:20           ` Krishna Kumar2
  2007-05-11  5:35             ` Gagan Arneja
  2 siblings, 1 reply; 86+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  5:20 UTC (permalink / raw)
  To: Gagan Arneja; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala

Gagan Arneja <gaagaan@gmail.com> wrote on 05/11/2007 01:13:53 AM:

> Also, I think, you don't have to chain skbs, they're already chained in
>   Qdisc->q. All you have to do is take the whole q and try to shove it
> at the device hoping for better results. But then, if you have rather
> big backlog, you run the risk of reordering packets if you have to
requeue.

I haven't seen reordering packets (I did once when I was having a bug in
the requeue code, some TCP messages on receiver indicating packets out of
order). When a send fails, the packet are requeued in reverse (go to end of
the failed skb and traverse back to the failed skb and requeue each skb).
Since new inserts go to the end, the queue is guaranteed to be in order.

Thanks,

- KK

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:37           ` David Miller
  2007-05-10 20:40             ` Gagan Arneja
@ 2007-05-11  5:21             ` Krishna Kumar2
  1 sibling, 0 replies; 86+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  5:21 UTC (permalink / raw)
  To: David Miller; +Cc: gaagaan, johnpol, netdev, rick.jones2, sri

David Miller <davem@davemloft.net> wrote on 05/11/2007 02:07:10 AM:

> From: Gagan Arneja <gaagaan@gmail.com>
> Date: Thu, 10 May 2007 12:43:53 -0700
> >
> > Also, I think, you don't have to chain skbs, they're already chained in

> >   Qdisc->q. All you have to do is take the whole q and try to shove it
> > at the device hoping for better results. But then, if you have rather
> > big backlog, you run the risk of reordering packets if you have to
requeue.
>
> If the qdisc is packed with packets and we would just loop sending
> them to the device, yes it might make sense.
>
> But if that isn't the case, which frankly is the usual case, you add a
> non-trivial amount of latency by batching and that's bad exactly for
> the kind of applications that send small frames.

Without batching, the latency should be the same since qdisc_run would
send each in a tight loop ?

I will also run TCP/UDP RR tests.

thanks,

- KK


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:11           ` jamal
  2007-05-10 20:14             ` Rick Jones
  2007-05-10 20:15             ` Gagan Arneja
@ 2007-05-11  5:22             ` Krishna Kumar2
  2007-05-11 11:27               ` jamal
  2 siblings, 1 reply; 86+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  5:22 UTC (permalink / raw)
  To: hadi; +Cc: Gagan Arneja, Evgeniy Polyakov, netdev, Rick Jones,
	Sridhar Samudrala

J Hadi Salim <j.hadi123@gmail.com> wrote on 05/11/2007 01:41:27 AM:

> > It's not just small packets. The cost of calling hard_start_xmit/byte
> > was rather high on your particular device. I've seen  PCI read
> > transaction in hard_start_xmit taking ~10,000 cycles on one particular
> > device. Count the cycles your brand of NIC is taking in it's
> > xmit_routine. The worse it is, the stronger your case for cluster
> > transmits.
>
> You would need to almost re-write the driver to make sure it does IO
> which is taking advantage of the batching.

I didn't try to optimize the driver to take any real advantage, I coded it
as simply as :

top:
      next = skb->skb_flink;
      Original driver code here, or another option is to remove the locking
            and put it before the "top:" above, no other changes.
      skb = next;
      if (skb)
            goto top;
This way, the diffs are minimal. However, driver experts might know of some
tricks to optimize this, since most drivers have a hundred checks before
sending one skb, and whether some optimization to avoid that for each skb
is possible.

> You could avoid totaly requeueing by asking the driver how much space it
> has. Once you shove down to it a number of packets, you are then
> guaranteed they will never be requeued (refer to the slides i pointed to
> earlier).

This is a good idea, I will look at your slides to see what this exactly
is.
However, even with this check, drivers will return error, eg trylock failed
or, like the e1000 has - fifo_workaround failures. But I still think it
will
significantly reduce the failures. Right now, I get a lot of requeue's.

thanks,

- KK


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  5:20           ` Krishna Kumar2
@ 2007-05-11  5:35             ` Gagan Arneja
  2007-05-11  5:43               ` Krishna Kumar2
  0 siblings, 1 reply; 86+ messages in thread
From: Gagan Arneja @ 2007-05-11  5:35 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala

Krishna Kumar2 wrote:
> I haven't seen reordering packets (I did once when I was having a bug in
> the requeue code, some TCP messages on receiver indicating packets out of
> order). When a send fails, the packet are requeued in reverse (go to end of
> the failed skb and traverse back to the failed skb and requeue each skb).
> Since new inserts go to the end, the queue is guaranteed to be in order.

queue_lock is dropped when you're in xmit. There's no guarantee packets 
won't be queued up while you're trying a transmit.

> 
> Thanks,
> 
> - KK
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  5:35             ` Gagan Arneja
@ 2007-05-11  5:43               ` Krishna Kumar2
  2007-05-11  5:57                 ` Gagan Arneja
  0 siblings, 1 reply; 86+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  5:43 UTC (permalink / raw)
  To: Gagan Arneja; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala

Gagan Arneja <gaagaan@gmail.com> wrote on 05/11/2007 11:05:47 AM:

> Krishna Kumar2 wrote:
> > I haven't seen reordering packets (I did once when I was having a bug
in
> > the requeue code, some TCP messages on receiver indicating packets out
of
> > order). When a send fails, the packet are requeued in reverse (go to
end of
> > the failed skb and traverse back to the failed skb and requeue each
skb).
> > Since new inserts go to the end, the queue is guaranteed to be in
order.
>
> queue_lock is dropped when you're in xmit. There's no guarantee packets
> won't be queued up while you're trying a transmit.

Right, but I am the sole dequeue'r, and on failure, I requeue those packets
to
the beginning of the queue (just as it would happen in the regular case of
one
packet xmit/failure/requeue).

- KK


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  5:43               ` Krishna Kumar2
@ 2007-05-11  5:57                 ` Gagan Arneja
  2007-05-11  6:06                   ` Krishna Kumar2
  0 siblings, 1 reply; 86+ messages in thread
From: Gagan Arneja @ 2007-05-11  5:57 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala

> 
> Right, but I am the sole dequeue'r, and on failure, I requeue those packets
> to
> the beginning of the queue (just as it would happen in the regular case of
> one
> packet xmit/failure/requeue).

What about a race between trying to reacquire queue_lock and another 
failed transmit?

--
Gagan

> 
> - KK
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  5:57                 ` Gagan Arneja
@ 2007-05-11  6:06                   ` Krishna Kumar2
  2007-05-11  6:29                     ` Gagan Arneja
  0 siblings, 1 reply; 86+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  6:06 UTC (permalink / raw)
  To: Gagan Arneja; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala

Hi Gagan,

Gagan Arneja <gaagaan@gmail.com> wrote on 05/11/2007 11:27:54 AM:

> >
> > Right, but I am the sole dequeue'r, and on failure, I requeue those
packets
> > to
> > the beginning of the queue (just as it would happen in the regular case
of
> > one
> > packet xmit/failure/requeue).
>
> What about a race between trying to reacquire queue_lock and another
> failed transmit?

That is not possible too. I hold the QDISC_RUNNING bit in dev->state and
am the only sender for this device, so there is no other failed transmit.
Also, on failure of dev_hard_start_xmit, qdisc_restart will spin for the
queue lock (which could be held quickly by enqueuer's), and q->requeue()
skbs to the head of the list.

Thanks,

- KK

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:57               ` David Miller
@ 2007-05-11  6:07                 ` Krishna Kumar2
  0 siblings, 0 replies; 86+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  6:07 UTC (permalink / raw)
  To: David Miller; +Cc: gaagaan, johnpol, netdev, rick.jones2, sri

Hi Dave,

David Miller <davem@davemloft.net> wrote on 05/11/2007 02:27:07 AM:

> > I don't understand how transmitting already batched up packets in one
go
> > introduce latency.
>
> Keep thinking :-)
>
> The only case where these ideas can be seriously considered is during
> netif_wake_queue().  In all other cases we go straight to the device
> from sendmsg(), since there is space in the TX ring, and we should
> not try to batch as that will add latency.

I am not very clear about this, you are saying latency will increase
due to old packets not being cache-warm ?

But the existing code will do exactly the same thing as what I am
proposing, except it will iterate over all the packets one by one.
So will batching them make latency worse ?

In the sendmsg case you wrote above (is it tcp_transmit_skb), it will
still come via IP (through dst_output) ?

I would like to try your idea if you are sure it will reduce latency.

Thanks,

- KK

> So maybe find a solution in that area, some new driver interface that
> can suck N packets out of a qdisc when netif_wake_queue() occurs.
>
> If you look at the contexts in which netif_wake_queue() is called,
> it's perfect, it already has the TX state of the driver locked, it has
> the TX descriptor head/tail pointers handy, etc. and it's already
> taken all the cache misses needed in order to work with this state.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  6:06                   ` Krishna Kumar2
@ 2007-05-11  6:29                     ` Gagan Arneja
  2007-05-11  6:52                       ` Krishna Kumar2
  0 siblings, 1 reply; 86+ messages in thread
From: Gagan Arneja @ 2007-05-11  6:29 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Evgeniy Polyakov, netdev, Rick Jones, Sridhar Samudrala

Krishna Kumar2 wrote:
>> What about a race between trying to reacquire queue_lock and another
>> failed transmit?
> 
> That is not possible too. I hold the QDISC_RUNNING bit in dev->state and
> am the only sender for this device, so there is no other failed transmit.
> Also, on failure of dev_hard_start_xmit, qdisc_restart will spin for the
> queue lock (which could be held quickly by enqueuer's), and q->requeue()
> skbs to the head of the list.

I have to claim incomplete familiarity for the code. But still, if 
you're out there running with no locks for a period, there's no 
assumption you can make. The "lock could be held quickly" assertion is a 
fallacy.

> 
> Thanks,
> 
> - KK
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  6:29                     ` Gagan Arneja
@ 2007-05-11  6:52                       ` Krishna Kumar2
  0 siblings, 0 replies; 86+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  6:52 UTC (permalink / raw)
  To: Gagan Arneja
  Cc: Evgeniy Polyakov, netdev, netdev-owner, Rick Jones,
	Sridhar Samudrala

Hi Gagan,

> I have to claim incomplete familiarity for the code. But still, if
> you're out there running with no locks for a period, there's no
> assumption you can make. The "lock could be held quickly" assertion is a
> fallacy.

I will try to explain since the code is pretty complicated. Packets coming
to IP (in dev_queue_xmit) are first added to a FIFO queue associated with
the device (queue_lock is first held). Then qdisc_run is called and if it
can set the QDISC_RUNNING bit (making it the "sole dequeue'r"), it calls
qdisc_restart (which *could* get the tx lock) and drops the queue_lock to
call the actual driver xmit routine. Meanwhile 'n' cpus can race and add
more skbs to the same queue, each getting this lock for a short time and
dropping, and all these skbs are added to the end of the queue. These are
"pure enqueuer's" (since their attempt to set the same bit fails and they
return from qdisc_run). In the original thread, the dequeue'r gets an error
from driver xmit. It first re-gets the queue_lock and calls requeue, which
adds the skb to the head of the queue. Since no one else has been able to
dequeue packets while the RUNNING bit is set, packet re-ordering cannot
happen. After returning to __qdisc_run, the RUNNING bit is cleared.

Hope that clarifies the code.

thanks,

- KK

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
@ 2007-05-11  7:14 Krishna Kumar2
  0 siblings, 0 replies; 86+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  7:14 UTC (permalink / raw)
  To: David Stevens
  Cc: Evgeniy Polyakov, Krishna Kumar2, netdev, netdev-owner,
	Rick Jones


(Mistaken didn't reply-all previous time)

Hi Dave,

David Stevens <dlstevens@us.ibm.com> wrote on 05/11/2007 02:57:56 AM:

> The word "small" is coming up a lot in this discussion, and
> I think packet size really has nothing to do with it. Multiple
> streams generating packets of any size would benefit; the
> key ingredient is a queue length greater than 1.

Correct. Re-thinking about this - for larger packets the bandwidth
may not improve as it reaches line-speed, but I guess CPU util could
reduce (which could reduce in small packet case too). Using "small"
was not covering this case.

Thanks,

- KK

> I think the intent is to remove queue lock cycles by taking
> the whole list (at least up to the count of free ring buffers)
> when the queue is greater than one packet, thus effectively
> removing the lock expense for n-1 packets.



^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 20:21 ` Roland Dreier
@ 2007-05-11  7:30   ` Krishna Kumar2
  2007-05-11 11:21     ` Roland Dreier
  0 siblings, 1 reply; 86+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  7:30 UTC (permalink / raw)
  To: Roland Dreier; +Cc: netdev

Hi Roland,

Roland Dreier <rdreier@cisco.com> wrote on 05/11/2007 01:51:50 AM:

> This is pretty interesting to me for IP-over-InfiniBand, for a couple
> of reasons.  First of all, I can push multiple send requests to the
> underlying adapter in one go, which saves taking and dropping the same
> lock and also probably allows fewer MMIO writes for doorbells.
>
> However the second reason this is useful is actually more
> interesting.  Right now we request a send completion interrupt for
> every packet we send because most current IB adapters don't really
> have very good interrupt mitigation features.  If we don't signal a
> send completion for every packet, then we might run into a situation
> where we don't get a completion for the last packet that gets sent and
> end up stalling the interface because the stack never gives us another
> packet to send.  However, if I have a whole queue of packets passed to
> my xmit routine, then I only have to request a completion for the last
> packet in the list, which could potentially cut down on interrupts and
> completion processing a lot.

Sounds a good idea. I had a question on error handling. What happens if
the driver asynchronously returns an error for this WR (single WR
containing multiple skbs) ? Does it mean all the skbs failed to be sent ?
Requeuing all of them is a bad idea since it leads to infinitely doing the
same thing, so should every skb be free'd ? Isn't that harsh (if first skb
is bad and others are fine) and how do we avoid that ? Infact we don't
know which skb was transmitted and which failed. If we requeue skbs,
there will also be out-of-order skbs as the callback is asynchronous.

Thanks,

- KK

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  9:05 ` Andi Kleen
@ 2007-05-11  8:32   ` Krishna Kumar2
  2007-05-11  9:37     ` Andi Kleen
  0 siblings, 1 reply; 86+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  8:32 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ak, netdev

Hi Andy,

ak@suse.de wrote on 05/11/2007 02:35:05 PM:

> You don't need that. You can just use the normal next/prev pointers.
> In general it's a good idea to lower lock overhead etc., the VM has
> used similar tricks very successfully in the past.

Does this mean each skb should be for the same connection if next/prev
is used ?

> Another problem is that this setup typically requires the aggregate
> packets to be from the same connection. Otherwise you will only
> safe a short trip into the stack until the linked packet would need
> to be split again to pass to multiple sockets. With that the scheme
> probably helps much less.

I guess you meant this for receives only. On the send side, packets
for different sockets can be linked and sent together, right ?

> Or you could do this only if multiple packets belong to the same
> single connection (basically with a one hit cache); but then it would

But for sends, why does same or different connection matter ? There is
no aggregating of skbs.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  9:37     ` Andi Kleen
@ 2007-05-11  8:50       ` Krishna Kumar2
  2007-05-11 11:16       ` Roland Dreier
  1 sibling, 0 replies; 86+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  8:50 UTC (permalink / raw)
  To: Andi Kleen; +Cc: ak, Andi Kleen, netdev

Hi Andy,

Andi Kleen <ak@suse.de> wrote on 05/11/2007 03:07:14 PM:

> But without it aggregation  on RX is much less useful because the packets
> cannot be kept together after socket demux which happens relatively early
> in the packet processing path.

Then I misunderstood you, my proposal is only for TX. GSO will not help as
it is for large packets to get segmented at the latest point in time to
avoid going through the stack multiple times. My proposal is to link
multiple
packets to send to the device in one shot rather than calling qdisc_restart
multiple times.

> > But for sends, why does same or different connection matter ? There is
> > no aggregating of skbs.
>
> I wasn't talking about sending.
>
> But there actually is :- TSO/GSO.

Sorry to mis-write what I meant : in my proposal there is no aggregating
of skbs so single/multiple connections should not matter.

Thanks,

- KK

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  5:04               ` Krishna Kumar2
@ 2007-05-11  9:01                 ` Evgeniy Polyakov
  2007-05-11  9:18                   ` Krishna Kumar2
  0 siblings, 1 reply; 86+ messages in thread
From: Evgeniy Polyakov @ 2007-05-11  9:01 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Ian McDonald, netdev, Rick Jones, Vlad Yasevich

On Fri, May 11, 2007 at 10:34:22AM +0530, Krishna Kumar2 (krkumar2@in.ibm.com) wrote:
> Not combining packets, I am sending them out in the same sequence it was
> queued. If the xmit failed, the driver's new API returns the skb which
> failed to be sent. This skb and all other linked skbs are requeue'd in
> the reverse order (fofi?) till the next time it is tried again. I see
> that sometimes I can send tx_queue_len packets in one shot and all
> succeeds. But the downside is that in failure case, the packets have to
> be requeue'd.

And what if you have thousand(s) of packets queued and first one has
failed, requeing all the rest one-by-one is not a solution. If it is
being done under heavy lock (with disabled irqs especially) it becomes a
disaster.

I thought of a bit different approach: driver maintains own queue (or
has access to stack's one) and only uses lock to dequeue a packet. If
transmit fails, nothing is requeued, but the same packet is tried until
transmit is completed. If number of transmits failed in order, driver
says it is broken and/or stops queue. Thus one can setup several
descriptors in one go and do it without any locks. Stack calls driver's
xmit function which essentially sets bit that new packets are available,
but driver must have access to qdisk.

If e1000 driver would not be so... so 'uncommon' compared to another
small gige drivers I would try to cook up a patch, but I will not with
e1000 :)

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 14:53 [RFC] New driver API to speed up small packets xmits Krishna Kumar
  2007-05-10 15:08 ` Evgeniy Polyakov
  2007-05-10 20:21 ` Roland Dreier
@ 2007-05-11  9:05 ` Andi Kleen
  2007-05-11  8:32   ` Krishna Kumar2
  2 siblings, 1 reply; 86+ messages in thread
From: Andi Kleen @ 2007-05-11  9:05 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: netdev, Krishna Kumar

Krishna Kumar <krkumar2@in.ibm.com> writes:

> Doing some measurements, I found that for small packets like 128 bytes,
> the bandwidth is approximately 60% of the line speed. To possibly speed
> up performance of small packet xmits, a method of "linking" skbs was
> thought of - where two pointers (skb_flink/blink) is added to the skb.

You don't need that. You can just use the normal next/prev pointers.
In general it's a good idea to lower lock overhead etc., the VM has
used similar tricks very successfully in the past.

There were some thoughts about this earlier, but in highend
NICs the direction instead seems to go towards LRO (large receive offloading). 

LRO is basically like TSO, just for receiving. The NIC aggregates
multiple packets into a single larger one that is then processed by
the stack as one skb. This typically doesn't use linked lists, but an
array of pages.

Your scheme would help old NICs that don't have this optimization.
Might be a worth goal, although people often seem to be more interested
in modern hardware.

Another problem is that this setup typically requires the aggregate
packets to be from the same connection. Otherwise you will only
safe a short trip into the stack until the linked packet would need
to be split again to pass to multiple sockets. With that the scheme
probably helps much less.

The hardware schemes typically use at least some kind of hash to
aggregiate connections You might need to implement something similar
too if it doesn't save enough time.  Don't know if it would be very
efficient in software.

Or you could do this only if multiple packets belong to the same
single connection (basically with a one hit cache); but then it would
smell a bit like a benchmark hack.

-Andi

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  9:01                 ` Evgeniy Polyakov
@ 2007-05-11  9:18                   ` Krishna Kumar2
  2007-05-11  9:32                     ` Evgeniy Polyakov
  0 siblings, 1 reply; 86+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  9:18 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ian McDonald, netdev, Rick Jones, Vlad Yasevich

Hi Evgeniy,

Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 05/11/2007 02:31:38 PM:

> On Fri, May 11, 2007 at 10:34:22AM +0530, Krishna Kumar2
(krkumar2@in.ibm.com) wrote:
> > Not combining packets, I am sending them out in the same sequence it
was
> > queued. If the xmit failed, the driver's new API returns the skb which
> > failed to be sent. This skb and all other linked skbs are requeue'd in
> > the reverse order (fofi?) till the next time it is tried again. I see
> > that sometimes I can send tx_queue_len packets in one shot and all
> > succeeds. But the downside is that in failure case, the packets have to
> > be requeue'd.
>
> And what if you have thousand(s) of packets queued and first one has
> failed, requeing all the rest one-by-one is not a solution. If it is
> being done under heavy lock (with disabled irqs especially) it becomes a
> disaster.

If first one failed for other reasons from described below, it is freed up
and the next one attempted. There are three cases where we cannot continue
:
no slots, device blocked, no lock.

Jamal had suggested to get information on #available slots from the driver.
The queue_stopped is checked before linking packets, so the only other
error case is not getting a lock. And this too is true only in the ~LLTX
case,
which optionally could be a check to enable this linking. Also, there could
be limits to how many packets are queue'd. I tried tx_queue_len as well as
tx_queue_len/2, but there could be other options.

> I thought of a bit different approach: driver maintains own queue (or
> has access to stack's one) and only uses lock to dequeue a packet. If
> transmit fails, nothing is requeued, but the same packet is tried until
> transmit is completed. If number of transmits failed in order, driver
> says it is broken and/or stops queue. Thus one can setup several
> descriptors in one go and do it without any locks. Stack calls driver's
> xmit function which essentially sets bit that new packets are available,
> but driver must have access to qdisk.

Bugs will increase if drivers also are allowed to access qdisc, since it
is difficult code as it is. Right now it is easy to see how the qdisc is
manipulated.

thanks,

- KK

> If e1000 driver would not be so... so 'uncommon' compared to another
> small gige drivers I would try to cook up a patch, but I will not with
> e1000 :)
>
> --
>    Evgeniy Polyakov


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  9:18                   ` Krishna Kumar2
@ 2007-05-11  9:32                     ` Evgeniy Polyakov
  2007-05-11  9:52                       ` Krishna Kumar2
  0 siblings, 1 reply; 86+ messages in thread
From: Evgeniy Polyakov @ 2007-05-11  9:32 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Ian McDonald, netdev, Rick Jones, Vlad Yasevich

On Fri, May 11, 2007 at 02:48:14PM +0530, Krishna Kumar2 (krkumar2@in.ibm.com) wrote:
> > And what if you have thousand(s) of packets queued and first one has
> > failed, requeing all the rest one-by-one is not a solution. If it is
> > being done under heavy lock (with disabled irqs especially) it becomes a
> > disaster.
> 
> If first one failed for other reasons from described below, it is freed up
> and the next one attempted. There are three cases where we cannot continue
> :
> no slots, device blocked, no lock.
> 
> Jamal had suggested to get information on #available slots from the driver.
> The queue_stopped is checked before linking packets, so the only other
> error case is not getting a lock. And this too is true only in the ~LLTX
> case,
> which optionally could be a check to enable this linking. Also, there could
> be limits to how many packets are queue'd. I tried tx_queue_len as well as
> tx_queue_len/2, but there could be other options.

Why stack should ever know what hardware is being used?
If you are going to break things, break it with scary sound :)

Design new interface, where driver has access to the queue (say only
using one helper dequeue_first_packet()), stack can only queue packets
to the tail, driver can also stop that by setting a bit (which was there
likely before we had knew word Linux).

That is all, driver does not access qdisk, it only gets the first skb.
Transmit function is completely lockless (until something is shared with
receive path/interrupts, but it is private driver's part), stack will
not be changed at all (except new helper to dequeue first packet from
the qdisk, actually it is already there, it just needs to be exported),
driver's xit function sets a flag that there are packets (or it can be
even empty, since packet's presense can be detected via the same qdisk
interface). You read qdisk len, setup several descriptors in one go,
dequeue set of skbs and commit them. If something has failed, it is not
requeued, but resent (or even dropped after fair amount of attempts).
No locks, no requeues? Seems simple imho.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  8:32   ` Krishna Kumar2
@ 2007-05-11  9:37     ` Andi Kleen
  2007-05-11  8:50       ` Krishna Kumar2
  2007-05-11 11:16       ` Roland Dreier
  0 siblings, 2 replies; 86+ messages in thread
From: Andi Kleen @ 2007-05-11  9:37 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Andi Kleen, ak, netdev

On Fri, May 11, 2007 at 02:02:55PM +0530, Krishna Kumar2 wrote:
> Hi Andy,
> 
> ak@suse.de wrote on 05/11/2007 02:35:05 PM:
> 
> > You don't need that. You can just use the normal next/prev pointers.
> > In general it's a good idea to lower lock overhead etc., the VM has
> > used similar tricks very successfully in the past.
> 
> Does this mean each skb should be for the same connection if next/prev
> is used ?

No; but see next paragraph.

But without it aggregation  on RX is much less useful because the packets
cannot be kept together after socket demux which happens relatively early
in the packet processing path.

> 
> > Another problem is that this setup typically requires the aggregate
> > packets to be from the same connection. Otherwise you will only
> > safe a short trip into the stack until the linked packet would need
> > to be split again to pass to multiple sockets. With that the scheme
> > probably helps much less.
> 
> I guess you meant this for receives only. On the send side, packets

Yes.

> for different sockets can be linked and sent together, right ?

Not implemented and as DaveM pointed out such batching
has some problems.  There is just TSO/GSO for single sockets

> 
> > Or you could do this only if multiple packets belong to the same
> > single connection (basically with a one hit cache); but then it would
> 
> But for sends, why does same or different connection matter ? There is
> no aggregating of skbs.

I wasn't talking about sending.


But there actually is :- TSO/GSO.


-Andi

> 
> Thanks,
> 
> - KK
> 

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-10 22:06             ` David Miller
@ 2007-05-11  9:46               ` Krishna Kumar2
  0 siblings, 0 replies; 86+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  9:46 UTC (permalink / raw)
  To: David Miller
  Cc: dlstevens, gaagaan, johnpol, netdev, netdev-owner, rick.jones2

Hi all,

Very preliminary testing with 20 procs on E1000 driver gives me following
result:

skbsz        Org BW         New BW              %               Org demand
New Demand     %
32              315.98           347.48                9.97%         21090
20958               0.62%
96              833.67           882.92                5.91%         7939
9107                 -14.71

But this is test running for just 30 secs (just too short) and netperf2
(not netperf4,
which I am going to use later). Using single E1000 card cross-cable'd on
2.6.21.1
kernel on 2 CPU 2.8Ghz Xseries systems.

I will have more detailed report next week, especially once I run netperf4.
I am taking
off for the next two days, so will reply later on this thread.

Thanks,

- KK

David Miller <davem@davemloft.net> wrote on 05/11/2007 03:36:05 AM:

> From: Gagan Arneja <gaagaan@gmail.com>
> Date: Thu, 10 May 2007 14:50:19 -0700
>
> > David Miller wrote:
> >
> > > If you drop the TX lock, the number of free slots can change
> > > as another cpu gets in there queuing packets.
> >
> > Can you ever have more than one thread inside the driver? Isn't
> > xmit_lock held while we're in there?
>
> There are restrictions wrt. when the xmit_lock and the
> queue lock can be held at the same time.
>
> The devil is definitely in the details if you try to
> implemen this.  It definitely lends support for Eric D.'s
> assertion that this change will only add bugs and doing
> something simple like prefetches is proabably a safer
> route to go down.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  9:32                     ` Evgeniy Polyakov
@ 2007-05-11  9:52                       ` Krishna Kumar2
  2007-05-11  9:56                         ` Evgeniy Polyakov
  0 siblings, 1 reply; 86+ messages in thread
From: Krishna Kumar2 @ 2007-05-11  9:52 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: Ian McDonald, netdev, Rick Jones, Vlad Yasevich

Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote on 05/11/2007 03:02:02 PM:

> On Fri, May 11, 2007 at 02:48:14PM +0530, Krishna Kumar2
(krkumar2@in.ibm.com) wrote:
> > > And what if you have thousand(s) of packets queued and first one has
> > > failed, requeing all the rest one-by-one is not a solution. If it is
> > > being done under heavy lock (with disabled irqs especially) it
becomes a
> > > disaster.
> >
> > If first one failed for other reasons from described below, it is freed
up
> > and the next one attempted. There are three cases where we cannot
continue
> > :
> > no slots, device blocked, no lock.
> >
> Why stack should ever know what hardware is being used?
> If you are going to break things, break it with scary sound :)

Actually it could just be an API exported by drivers which implement this
new
xmit API.

> Design new interface, where driver has access to the queue (say only
> using one helper dequeue_first_packet()), stack can only queue packets
> to the tail, driver can also stop that by setting a bit (which was there
> likely before we had knew word Linux).
>
> That is all, driver does not access qdisk, it only gets the first skb.
> Transmit function is completely lockless (until something is shared with
> receive path/interrupts, but it is private driver's part), stack will
> not be changed at all (except new helper to dequeue first packet from
> the qdisk, actually it is already there, it just needs to be exported),
> driver's xit function sets a flag that there are packets (or it can be
> even empty, since packet's presense can be detected via the same qdisk
> interface). You read qdisk len, setup several descriptors in one go,
> dequeue set of skbs and commit them. If something has failed, it is not
> requeued, but resent (or even dropped after fair amount of attempts).
> No locks, no requeues? Seems simple imho.

I will analyze this in more detail when I return (leaving just now, so got
really no time). The only issue that I see quickly is "No locks", since to
get things off the queue you need to hold the queue_lock.

Thanks,

- KK


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  9:52                       ` Krishna Kumar2
@ 2007-05-11  9:56                         ` Evgeniy Polyakov
  2007-05-11 11:30                           ` jamal
  0 siblings, 1 reply; 86+ messages in thread
From: Evgeniy Polyakov @ 2007-05-11  9:56 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: Ian McDonald, netdev, Rick Jones, Vlad Yasevich

On Fri, May 11, 2007 at 03:22:13PM +0530, Krishna Kumar2 (krkumar2@in.ibm.com) wrote:
> > No locks, no requeues? Seems simple imho.
> 
> I will analyze this in more detail when I return (leaving just now, so got
> really no time). The only issue that I see quickly is "No locks", since to
> get things off the queue you need to hold the queue_lock.

I meant no locks during processing of the packets (pci read/write, dma
setup and so on), of course it is needed to dequeue a packet, but only
for that operation.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  9:37     ` Andi Kleen
  2007-05-11  8:50       ` Krishna Kumar2
@ 2007-05-11 11:16       ` Roland Dreier
  2007-05-13  6:00         ` Andi Kleen
  1 sibling, 1 reply; 86+ messages in thread
From: Roland Dreier @ 2007-05-11 11:16 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Krishna Kumar2, Andi Kleen, netdev

 > I wasn't talking about sending.
 > 
 > But there actually is :- TSO/GSO.

As I said before, getting multiple packets in one call to xmit would
be nice for amortizing per-xmit overhead in IPoIB.  So it would be
nice if the cases where the stack does GSO ended up passing all the
segments into the driver in one go.

 - R.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  7:30   ` Krishna Kumar2
@ 2007-05-11 11:21     ` Roland Dreier
  0 siblings, 0 replies; 86+ messages in thread
From: Roland Dreier @ 2007-05-11 11:21 UTC (permalink / raw)
  To: Krishna Kumar2; +Cc: netdev

 > Sounds a good idea. I had a question on error handling. What happens if
 > the driver asynchronously returns an error for this WR (single WR
 > containing multiple skbs) ? Does it mean all the skbs failed to be sent ?
 > Requeuing all of them is a bad idea since it leads to infinitely doing the
 > same thing, so should every skb be free'd ? Isn't that harsh (if first skb
 > is bad and others are fine) and how do we avoid that ? Infact we don't
 > know which skb was transmitted and which failed. If we requeue skbs,
 > there will also be out-of-order skbs as the callback is asynchronous.

Each packet will still get a separate work request.  However if one
work request fails, then the QP will transition to the error state and
all later work requests will fail too.

But the completion handling doesn't change at all -- IPoIB already
queues work requests with the hardware and then collects completions
asynchronously.  If we get a failed completion then we free the skb
and bump tx errors -- there's nothing else really sensible to do.

However errors on UD QPs can never really happen except for driver
bugs or hardware failures, and for connected QPs (used in connected
mode), errors probably mean that a connection broke, so dropping
packets is quite sensible.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  5:22             ` Krishna Kumar2
@ 2007-05-11 11:27               ` jamal
  0 siblings, 0 replies; 86+ messages in thread
From: jamal @ 2007-05-11 11:27 UTC (permalink / raw)
  To: Krishna Kumar2
  Cc: Gagan Arneja, Evgeniy Polyakov, netdev, Rick Jones,
	Sridhar Samudrala

On Fri, 2007-11-05 at 10:52 +0530, Krishna Kumar2 wrote:

> I didn't try to optimize the driver to take any real advantage, I coded it
> as simply as :
> 
> top:
>       next = skb->skb_flink;
>       Original driver code here, or another option is to remove the locking
>             and put it before the "top:" above, no other changes.
>       skb = next;
>       if (skb)
>             goto top;
> This way, the diffs are minimal. However, driver experts might know of some
> tricks to optimize this, since most drivers have a hundred checks before
> sending one skb, and whether some optimization to avoid that for each skb
> is possible.
> 

You will need to at some point, no big deal in order to get things in
place first- the simple example of DMA i talked about yesterday is a
simple example. i.e you avoid kicking the DMA until you have queued all
in the ring (as Gagan pointed out, kicking the DMA is a PCI IO write)
I am sure there are gazillion others.
 

> This is a good idea, I will look at your slides to see what this exactly
> is.

I will try to post my patches updated to the latest net-2.6 kernel
before you come back on monday.

> However, even with this check, drivers will return error, eg trylock failed
> or, like the e1000 has - fifo_workaround failures. But I still think it
> will significantly reduce the failures. 

Yes, this is what i refered to as the "many tunables" and led to the
claim that you may have to re-write the whole tx path in my posting.

> Right now, I get a lot of requeue's.

requeues are a good sign ;->

cheers,
jamal


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11  9:56                         ` Evgeniy Polyakov
@ 2007-05-11 11:30                           ` jamal
  2007-05-11 11:53                             ` Evgeniy Polyakov
  0 siblings, 1 reply; 86+ messages in thread
From: jamal @ 2007-05-11 11:30 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Krishna Kumar2, Ian McDonald, netdev, Rick Jones, Vlad Yasevich

On Fri, 2007-11-05 at 13:56 +0400, Evgeniy Polyakov wrote:

> I meant no locks during processing of the packets (pci read/write, dma
> setup and so on), of course it is needed to dequeue a packet, but only
> for that operation.

I dont think you can avoid the lock Evgeniy. You need to protect against
the tx ring having some other things happening to it from the napi poll
or netdev interupts (most of the drivers touch the tx ring on the napi
poll).

cheers,
jamal


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11 11:30                           ` jamal
@ 2007-05-11 11:53                             ` Evgeniy Polyakov
  2007-05-11 12:15                               ` jamal
  0 siblings, 1 reply; 86+ messages in thread
From: Evgeniy Polyakov @ 2007-05-11 11:53 UTC (permalink / raw)
  To: jamal; +Cc: Krishna Kumar2, Ian McDonald, netdev, Rick Jones, Vlad Yasevich

On Fri, May 11, 2007 at 07:30:02AM -0400, jamal (hadi@cyberus.ca) wrote:
> > I meant no locks during processing of the packets (pci read/write, dma
> > setup and so on), of course it is needed to dequeue a packet, but only
> > for that operation.
> 
> I dont think you can avoid the lock Evgeniy. You need to protect against
> the tx ring having some other things happening to it from the napi poll
> or netdev interupts (most of the drivers touch the tx ring on the napi
> poll).

As I said there might be another lock, if interrupt handler is shared,
or registers are accessed, but it is privite driver's business, which
has nothing in common with stack itself. Stack just queues an skb,
which, after detached from the tx queue by driver, just does not exist
for stack anymore. It can be dequeed with rcu protection even.

> cheers,
> jamal

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11 11:53                             ` Evgeniy Polyakov
@ 2007-05-11 12:15                               ` jamal
  0 siblings, 0 replies; 86+ messages in thread
From: jamal @ 2007-05-11 12:15 UTC (permalink / raw)
  To: Evgeniy Polyakov
  Cc: Krishna Kumar2, Ian McDonald, netdev, Rick Jones, Vlad Yasevich

On Fri, 2007-11-05 at 15:53 +0400, Evgeniy Polyakov wrote:

> As I said there might be another lock, if interrupt handler is shared,
> or registers are accessed, but it is privite driver's business, which
> has nothing in common with stack itself. 

Ok, we are saying the same thing then. eg in e1000 that would be
tx_ring->lock or something along those lines.

> Stack just queues an skb,
> which, after detached from the tx queue by driver, just does not exist
> for stack anymore. It can be dequeed with rcu protection even.

in my case i had a dev->blist where the stack queued. You really want to
avoid requeueing altogether.
The trick i used is to have the driver tell you how many packets you can
pull off the qdisc.

cheers,
jamal




^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-11 11:16       ` Roland Dreier
@ 2007-05-13  6:00         ` Andi Kleen
  2007-05-15 16:25           ` Roland Dreier
  0 siblings, 1 reply; 86+ messages in thread
From: Andi Kleen @ 2007-05-13  6:00 UTC (permalink / raw)
  To: Roland Dreier; +Cc: Krishna Kumar2, netdev

On Friday 11 May 2007 13:16:44 Roland Dreier wrote:
>  > I wasn't talking about sending.
>  > 
>  > But there actually is :- TSO/GSO.
> 
> As I said before, getting multiple packets in one call to xmit would
> be nice for amortizing per-xmit overhead in IPoIB.  So it would be
> nice if the cases where the stack does GSO ended up passing all the
> segments into the driver in one go.

Well TCP does upto 64k -- that is what GSO is about.

If you want it for unrelated connections I'm not sure it would actually
work because if you have a deep TX queue already it is unlikely 
the qdisc will buffer that much. Most packet injection should
come from the individual sockets which is obviously only a single connection.

-Andi

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-13  6:00         ` Andi Kleen
@ 2007-05-15 16:25           ` Roland Dreier
  2007-05-15 20:18             ` David Miller
  0 siblings, 1 reply; 86+ messages in thread
From: Roland Dreier @ 2007-05-15 16:25 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Krishna Kumar2, netdev

 > > As I said before, getting multiple packets in one call to xmit would
 > > be nice for amortizing per-xmit overhead in IPoIB.  So it would be
 > > nice if the cases where the stack does GSO ended up passing all the
 > > segments into the driver in one go.
 > 
 > Well TCP does upto 64k -- that is what GSO is about.

I see... the plan would be to add NETIF_F_GSO_SOFTWARE to the device
features and use skb_gso_segment() in the netdevice driver?  (I just
studied GSO more carefully -- I hadn't realized that was possible)

I'll have to think about implementing that for IPoIB.  One issue I see
is if I have, say, 4 free entries in my send queue and skb_gso_segment()
gives me back 5 packets to send.  It's not clear I can recover at that
point -- I guess I have to check against gso_segs in the xmit routine
before actually doing the segmentation.

 - R.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 16:25           ` Roland Dreier
@ 2007-05-15 20:18             ` David Miller
  2007-05-15 20:52               ` Roland Dreier
  0 siblings, 1 reply; 86+ messages in thread
From: David Miller @ 2007-05-15 20:18 UTC (permalink / raw)
  To: rdreier; +Cc: ak, krkumar2, netdev

From: Roland Dreier <rdreier@cisco.com>
Date: Tue, 15 May 2007 09:25:28 -0700

> I'll have to think about implementing that for IPoIB.  One issue I see
> is if I have, say, 4 free entries in my send queue and skb_gso_segment()
> gives me back 5 packets to send.  It's not clear I can recover at that
> point -- I guess I have to check against gso_segs in the xmit routine
> before actually doing the segmentation.

I'd suggest adding a fudge factor to your free TX space, which
is advisable anyways so that when TX is woken up, more of
the transfer from queue to device can happen in a batch-like
fashion.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 20:18             ` David Miller
@ 2007-05-15 20:52               ` Roland Dreier
  2007-05-15 21:48                 ` Michael Chan
  0 siblings, 1 reply; 86+ messages in thread
From: Roland Dreier @ 2007-05-15 20:52 UTC (permalink / raw)
  To: David Miller; +Cc: ak, krkumar2, netdev

 > > I'll have to think about implementing that for IPoIB.  One issue I see
 > > is if I have, say, 4 free entries in my send queue and skb_gso_segment()
 > > gives me back 5 packets to send.  It's not clear I can recover at that
 > > point -- I guess I have to check against gso_segs in the xmit routine
 > > before actually doing the segmentation.
 > 
 > I'd suggest adding a fudge factor to your free TX space, which
 > is advisable anyways so that when TX is woken up, more of
 > the transfer from queue to device can happen in a batch-like
 > fashion.

Well, IPoIB doesn't do netif_wake_queue() until half the device's TX
queue is free, so we should get batching.  However, I'm not sure that
I can count on a fudge factor ensuring that there's enough space to
handle everything skb_gso_segment() gives me -- is there any reliable
way to get an upper bound on how many segments a given gso skb will
use when it's segmented?

 - R.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 21:48                 ` Michael Chan
@ 2007-05-15 21:08                   ` Roland Dreier
  2007-05-15 22:05                     ` Michael Chan
  0 siblings, 1 reply; 86+ messages in thread
From: Roland Dreier @ 2007-05-15 21:08 UTC (permalink / raw)
  To: Michael Chan; +Cc: David Miller, ak, krkumar2, netdev

 > > Well, IPoIB doesn't do netif_wake_queue() until half the device's TX
 > > queue is free, so we should get batching.  However, I'm not sure that
 > > I can count on a fudge factor ensuring that there's enough space to
 > > handle everything skb_gso_segment() gives me -- is there any reliable
 > > way to get an upper bound on how many segments a given gso skb will
 > > use when it's segmented?
 > 
 > Take a look at tg3.c.  I use (gso_segs * 3) as the upper bound.

Thanks for the pointer... I noticed that code, but could you tell me
where the "* 3" comes from?

 - R.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found] <OF0CAD6D87.DBE62968-ON872572DC.0073646A-882572DC.0073BEC2@us.ibm.com>
@ 2007-05-15 21:17 ` Roland Dreier
       [not found]   ` <OFF5654BB8.74EC8DCB-ON872572DC.00752079-882572DC.00756B23@us.ibm.com>
  0 siblings, 1 reply; 86+ messages in thread
From: Roland Dreier @ 2007-05-15 21:17 UTC (permalink / raw)
  To: Shirley Ma; +Cc: David Miller, ak, krkumar2, netdev, netdev-owner

 >       I thought to enable GSO, device driver actually does nothing rather
 > than enabling the flag. GSO moved TCP offloading to interface layer before
 > device xmit. It's a different idea with multiple packets per xmit. GSO
 > still queue the packet one bye one in QDISC and xmit one bye one. The
 > multiple packets per xmit will xmit N packets when N packets in QDISC
 > queue. Please corrent me if wrong.

Current use of GSO does segmentation just above the netdevice driver.
However there's nothing that prevents it from being used to push
segmentation into the driver -- as I described, just set NETIF_F_GSO_SOFTWARE
and do skb_gso_segment() within the driver.

As Michael Chan pointed out, tg3.c already does this to work around a
rare bug in the HW TSO implementation.  For IPoIB we could do it all
the time to get multiple send work requests from one call to the xmit
method.

 - R.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]   ` <OFF5654BB8.74EC8DCB-ON872572DC.00752079-882572DC.00756B23@us.ibm.com>
@ 2007-05-15 21:25     ` Roland Dreier
       [not found]       ` <OF21D475A2.5E5C88DE-ON872572DC.00763DE4-882572DC.00768A7E@us.ibm.com>
  2007-05-15 21:32     ` David Miller
  1 sibling, 1 reply; 86+ messages in thread
From: Roland Dreier @ 2007-05-15 21:25 UTC (permalink / raw)
  To: Shirley Ma; +Cc: ak, David Miller, krkumar2, netdev, netdev-owner

    Shirley>       I just wonder without TSO support in HW, how much
    Shirley> benefit we can get by pushing GSO from interface layer to
    Shirley> device layer besides we can do multiple packets in IPoIB.

The entire benefit comes from having multiple packets to queue in one
call to the xmit method.

 - R.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 22:05                     ` Michael Chan
@ 2007-05-15 21:28                       ` David Miller
  2007-05-18  7:04                         ` Michael Chan
  0 siblings, 1 reply; 86+ messages in thread
From: David Miller @ 2007-05-15 21:28 UTC (permalink / raw)
  To: mchan; +Cc: rdreier, ak, krkumar2, netdev

From: "Michael Chan" <mchan@broadcom.com>
Date: Tue, 15 May 2007 15:05:28 -0700

> On Tue, 2007-05-15 at 14:08 -0700, Roland Dreier wrote:
> >  > > Well, IPoIB doesn't do netif_wake_queue() until half the device's TX
> >  > > queue is free, so we should get batching.  However, I'm not sure that
> >  > > I can count on a fudge factor ensuring that there's enough space to
> >  > > handle everything skb_gso_segment() gives me -- is there any reliable
> >  > > way to get an upper bound on how many segments a given gso skb will
> >  > > use when it's segmented?
> >  > 
> >  > Take a look at tg3.c.  I use (gso_segs * 3) as the upper bound.
> > 
> > Thanks for the pointer... I noticed that code, but could you tell me
> > where the "* 3" comes from?
> > 
> For each gso_seg, there will be a header and the payload may span 2
> pages for 1500-byte packets.  We always assume 1500-byte packets because
> the buggy chips do not support jumbo frames.

Correct.

I think there may be a case where you could see up to 4 segments.
If the user corks the TCP socket, does a sendmsg() (which puts
the data in the per-socket page) then does a sendfile() you'll
see something like:

skb->data	IP, TCP, ethernet headers, etc.
page0		sendmsg() data
page1		sendfile
page2		sendfile

Ie. this can happen if the sendfile() part starts near the
end of a page, so it would get split even for a 1500 MTU
frame.

Even more complex variants are possible if the user does
tiny sendfile() requests to different pages within the file.

So in fact it can span up to N pages.

But there is an upper limit defined by the original GSO
frame, and that is controlled by MAX_SKB_FRAGS, so at most
you would see MAX_SKB_FRAGS plus some small constant.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]   ` <OFF5654BB8.74EC8DCB-ON872572DC.00752079-882572DC.00756B23@us.ibm.com>
  2007-05-15 21:25     ` Roland Dreier
@ 2007-05-15 21:32     ` David Miller
       [not found]       ` <OF6757F56D.EE5984FD-ON872572DC.0081026C-882572DC.00814B8F@us.ibm.com>
  2007-05-21  7:56       ` Herbert Xu
  1 sibling, 2 replies; 86+ messages in thread
From: David Miller @ 2007-05-15 21:32 UTC (permalink / raw)
  To: xma; +Cc: rdreier, ak, krkumar2, netdev, netdev-owner

From: Shirley Ma <xma@us.ibm.com>
Date: Tue, 15 May 2007 14:22:57 -0700

>       I just wonder without TSO support in HW, how much benefit we
> can get by pushing GSO from interface layer to device layer besides
> we can do multiple packets in IPoIB.

I bet the gain is non-trivial.

I'd say about half of the gain from TSO comes from only calling down
into the driver from TCP one time as opposed to N times.  That's
the majority of the "CPU work" involved in TCP sending.

The rest of the gain comes from only transmitting the packet headers
once rather than N times, which conserves I/O bus bandwidth.

GSO will not help the case of lots of UDP applications sending
small packets, or something like that.  An efficient qdisc-->driver
transfer during netif_wake_queue() could help solve some of that,
as is being discussed here.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]       ` <OF21D475A2.5E5C88DE-ON872572DC.00763DE4-882572DC.00768A7E@us.ibm.com>
@ 2007-05-15 21:38         ` David Miller
  0 siblings, 0 replies; 86+ messages in thread
From: David Miller @ 2007-05-15 21:38 UTC (permalink / raw)
  To: xma; +Cc: rdreier, ak, krkumar2, netdev, netdev-owner

From: Shirley Ma <xma@us.ibm.com>
Date: Tue, 15 May 2007 14:35:13 -0700

>       I would think it might be better to implement dequeueN in
> interface xmit which will benefit multiple streams as well not just
> for large packet.

I think it is beneficial to have both, each of them helps with
different aspects of packet sending overhead.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 20:52               ` Roland Dreier
@ 2007-05-15 21:48                 ` Michael Chan
  2007-05-15 21:08                   ` Roland Dreier
  0 siblings, 1 reply; 86+ messages in thread
From: Michael Chan @ 2007-05-15 21:48 UTC (permalink / raw)
  To: Roland Dreier; +Cc: David Miller, ak, krkumar2, netdev

On Tue, 2007-05-15 at 13:52 -0700, Roland Dreier wrote:

> Well, IPoIB doesn't do netif_wake_queue() until half the device's TX
> queue is free, so we should get batching.  However, I'm not sure that
> I can count on a fudge factor ensuring that there's enough space to
> handle everything skb_gso_segment() gives me -- is there any reliable
> way to get an upper bound on how many segments a given gso skb will
> use when it's segmented?

Take a look at tg3.c.  I use (gso_segs * 3) as the upper bound.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 21:08                   ` Roland Dreier
@ 2007-05-15 22:05                     ` Michael Chan
  2007-05-15 21:28                       ` David Miller
  0 siblings, 1 reply; 86+ messages in thread
From: Michael Chan @ 2007-05-15 22:05 UTC (permalink / raw)
  To: Roland Dreier; +Cc: David Miller, ak, krkumar2, netdev

On Tue, 2007-05-15 at 14:08 -0700, Roland Dreier wrote:
>  > > Well, IPoIB doesn't do netif_wake_queue() until half the device's TX
>  > > queue is free, so we should get batching.  However, I'm not sure that
>  > > I can count on a fudge factor ensuring that there's enough space to
>  > > handle everything skb_gso_segment() gives me -- is there any reliable
>  > > way to get an upper bound on how many segments a given gso skb will
>  > > use when it's segmented?
>  > 
>  > Take a look at tg3.c.  I use (gso_segs * 3) as the upper bound.
> 
> Thanks for the pointer... I noticed that code, but could you tell me
> where the "* 3" comes from?
> 
For each gso_seg, there will be a header and the payload may span 2
pages for 1500-byte packets.  We always assume 1500-byte packets because
the buggy chips do not support jumbo frames.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]       ` <OF6757F56D.EE5984FD-ON872572DC.0081026C-882572DC.00814B8F@us.ibm.com>
@ 2007-05-15 23:36         ` David Miller
  0 siblings, 0 replies; 86+ messages in thread
From: David Miller @ 2007-05-15 23:36 UTC (permalink / raw)
  To: xma; +Cc: ak, krkumar2, netdev, netdev-owner, rdreier

From: Shirley Ma <xma@us.ibm.com>
Date: Tue, 15 May 2007 16:33:22 -0700

>       That's interesting. So a generic LRO in interface layer will benefit
> the preformance more, right? Receiving path TCP N times is more expensive
> than sending, I think.

If you look at some of the drivers doing LRO, the bulk of the
implementation is in software, so yes :-)

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 21:28                       ` David Miller
@ 2007-05-18  7:04                         ` Michael Chan
  0 siblings, 0 replies; 86+ messages in thread
From: Michael Chan @ 2007-05-18  7:04 UTC (permalink / raw)
  To: David Miller; +Cc: rdreier, ak, krkumar2, netdev

David Miller wrote:

> From: "Michael Chan" <mchan@broadcom.com>
> Date: Tue, 15 May 2007 15:05:28 -0700
> 
> > For each gso_seg, there will be a header and the payload may span 2
> > pages for 1500-byte packets.  We always assume 1500-byte packets
> > because the buggy chips do not support jumbo frames.
> 
> Correct.
> 
> I think there may be a case where you could see up to 4 segments.
> If the user corks the TCP socket, does a sendmsg() (which puts
> the data in the per-socket page) then does a sendfile() you'll
> see something like:
> 
> skb->data	IP, TCP, ethernet headers, etc.
> page0		sendmsg() data
> page1		sendfile
> page2		sendfile
> 
> Ie. this can happen if the sendfile() part starts near the
> end of a page, so it would get split even for a 1500 MTU
> frame.
> 
> Even more complex variants are possible if the user does
> tiny sendfile() requests to different pages within the file.
> 
> So in fact it can span up to N pages.
> 
> But there is an upper limit defined by the original GSO
> frame, and that is controlled by MAX_SKB_FRAGS, so at most
> you would see MAX_SKB_FRAGS plus some small constant.
> 
> 

I see.  Is the following a better estimate?

(gso_segs * 2) + skb_shinfo(skb)->nr_frags

For each gso_seg, you get at least a header segment and a
segment from a page for 1500-byte packets.  You can get more
segments if packets cross page boundaries or if there are tiny
sendfile pages or some sendmsg pages mixed in.  But these
additional segments will never be more than nr_frags.


^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-15 21:32     ` David Miller
       [not found]       ` <OF6757F56D.EE5984FD-ON872572DC.0081026C-882572DC.00814B8F@us.ibm.com>
@ 2007-05-21  7:56       ` Herbert Xu
       [not found]         ` <OF9ABCD08D.2CD1B193-ON872572E3.007A6FC1-882572E3.007ACE1A@us.ibm.com>
  1 sibling, 1 reply; 86+ messages in thread
From: Herbert Xu @ 2007-05-21  7:56 UTC (permalink / raw)
  To: David Miller; +Cc: xma, rdreier, ak, krkumar2, netdev, netdev-owner

David Miller <davem@davemloft.net> wrote:
> From: Shirley Ma <xma@us.ibm.com>
> Date: Tue, 15 May 2007 14:22:57 -0700
> 
>>       I just wonder without TSO support in HW, how much benefit we
>> can get by pushing GSO from interface layer to device layer besides
>> we can do multiple packets in IPoIB.
> 
> I bet the gain is non-trivial.

Yep, for any NIC that supports SG but not TSO then software GSO will
be a big win.  When the NIC doesn't support SG then the win is mostly
offset by the need to copy the packet again.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]         ` <OF9ABCD08D.2CD1B193-ON872572E3.007A6FC1-882572E3.007ACE1A@us.ibm.com>
@ 2007-05-22 22:36           ` David Miller
       [not found]             ` <OFCF3EB7F8.9740C0C7-ON872572E3.007DADF6-882572E3.007E0E7B@us.ibm.com>
  2007-05-22 23:12             ` Herbert Xu
  0 siblings, 2 replies; 86+ messages in thread
From: David Miller @ 2007-05-22 22:36 UTC (permalink / raw)
  To: xma; +Cc: herbert, ak, krkumar2, netdev, netdev-owner, rdreier

From: Shirley Ma <xma@us.ibm.com>
Date: Tue, 22 May 2007 15:22:35 -0700

> > Yep, for any NIC that supports SG but not TSO then software GSO will
> > be a big win.  When the NIC doesn't support SG then the win is mostly
> > offset by the need to copy the packet again.
> >
> > Cheers,
> > --
> 
>       We could avoid packet copy by allocating number of 64K/MTU skb
> buffers instead of one big 64K buffer size. Is it a possible approach?

I think you misunderstand the problem.

If the device does not support scatter gather, it is impossible to
avoid the copy.

SKB's from TSO are composed of discontiguous page chunks represented
by the skb_shared_info() array.  These can come either from sendmsg()
user data (which the kernel fills into a per-socket data page which is
reallocated once filled), or from sendfile() page cache pages.

Therefore, there has to be a copy somewhere to be able to give
that packet to a non-scatter-gather device.

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
       [not found]             ` <OFCF3EB7F8.9740C0C7-ON872572E3.007DADF6-882572E3.007E0E7B@us.ibm.com>
@ 2007-05-22 23:04               ` David Miller
  0 siblings, 0 replies; 86+ messages in thread
From: David Miller @ 2007-05-22 23:04 UTC (permalink / raw)
  To: xma; +Cc: ak, herbert, krkumar2, netdev, netdev-owner, rdreier

From: Shirley Ma <xma@us.ibm.com>
Date: Tue, 22 May 2007 15:58:05 -0700

> Sorry for the confusion. I am thinking to avoid copy in skb_segment() for
> GSO. The way could be in tcp_sendmsg() to allocate small discontiguous
> buffers (equal = MTU) instead of allocating pages.

The SKB splitting algorithm in TCP's transmit engine depends upon the
skb_shared_info() array being splittable at arbitrary points with only
page counts to manage.  This is the only way I found to make SKB
splitting at transmit time extremely inexpensive.

SACK block processing needs to perform these kinds of splits
at well, so it really really has to be cheap.

The invariant is that every TCP TSO packet must have it's header
at skb->data and all of it's data in the paged skb_shared_info().

^ permalink raw reply	[flat|nested] 86+ messages in thread

* Re: [RFC] New driver API to speed up small packets xmits
  2007-05-22 22:36           ` David Miller
       [not found]             ` <OFCF3EB7F8.9740C0C7-ON872572E3.007DADF6-882572E3.007E0E7B@us.ibm.com>
@ 2007-05-22 23:12             ` Herbert Xu
  1 sibling, 0 replies; 86+ messages in thread
From: Herbert Xu @ 2007-05-22 23:12 UTC (permalink / raw)
  To: David Miller; +Cc: xma, ak, krkumar2, netdev, netdev-owner, rdreier

On Tue, May 22, 2007 at 03:36:36PM -0700, David Miller wrote:
> 
> > > Yep, for any NIC that supports SG but not TSO then software GSO will
> > > be a big win.  When the NIC doesn't support SG then the win is mostly
> > > offset by the need to copy the packet again.

...
 
> SKB's from TSO are composed of discontiguous page chunks represented
> by the skb_shared_info() array.  These can come either from sendmsg()
> user data (which the kernel fills into a per-socket data page which is
> reallocated once filled), or from sendfile() page cache pages.

Yes sendmsg() is the case where it's almost even whether GSO is
turned on or off because GSO does an extra copy which offsets the
win in reduced per-packet cost.

For sendfile() it's still a win though since we have to copy it
anyway and doing it in GSO avoids the per-packet cost.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 86+ messages in thread

end of thread, other threads:[~2007-05-22 23:12 UTC | newest]

Thread overview: 86+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-05-10 14:53 [RFC] New driver API to speed up small packets xmits Krishna Kumar
2007-05-10 15:08 ` Evgeniy Polyakov
2007-05-10 15:22   ` Krishna Kumar2
2007-05-10 15:48     ` Evgeniy Polyakov
2007-05-10 16:08       ` jamal
2007-05-10 17:19     ` Rick Jones
2007-05-10 18:07       ` Sridhar Samudrala
2007-05-10 19:43         ` Gagan Arneja
2007-05-10 20:11           ` jamal
2007-05-10 20:14             ` Rick Jones
2007-05-10 20:15               ` jamal
2007-05-10 20:15             ` Gagan Arneja
2007-05-10 20:21               ` jamal
2007-05-10 20:25                 ` Gagan Arneja
2007-05-11  5:22             ` Krishna Kumar2
2007-05-11 11:27               ` jamal
2007-05-10 20:37           ` David Miller
2007-05-10 20:40             ` Gagan Arneja
2007-05-10 20:57               ` David Miller
2007-05-11  6:07                 ` Krishna Kumar2
2007-05-11  5:21             ` Krishna Kumar2
2007-05-11  5:20           ` Krishna Kumar2
2007-05-11  5:35             ` Gagan Arneja
2007-05-11  5:43               ` Krishna Kumar2
2007-05-11  5:57                 ` Gagan Arneja
2007-05-11  6:06                   ` Krishna Kumar2
2007-05-11  6:29                     ` Gagan Arneja
2007-05-11  6:52                       ` Krishna Kumar2
2007-05-10 18:13       ` Vlad Yasevich
2007-05-10 18:20         ` Rick Jones
2007-05-10 18:32           ` Vlad Yasevich
2007-05-10 18:40             ` Rick Jones
2007-05-10 18:59             ` Ian McDonald
2007-05-10 19:21               ` Vlad Yasevich
2007-05-10 19:26                 ` Ian McDonald
2007-05-10 20:32                 ` David Miller
2007-05-10 20:49                   ` Rick Jones
2007-05-10 21:02                     ` David Miller
2007-05-10 21:14                       ` Gagan Arneja
2007-05-11  2:28                         ` Stephen Hemminger
2007-05-11  5:01                           ` Gagan Arneja
2007-05-11  5:04               ` Krishna Kumar2
2007-05-11  9:01                 ` Evgeniy Polyakov
2007-05-11  9:18                   ` Krishna Kumar2
2007-05-11  9:32                     ` Evgeniy Polyakov
2007-05-11  9:52                       ` Krishna Kumar2
2007-05-11  9:56                         ` Evgeniy Polyakov
2007-05-11 11:30                           ` jamal
2007-05-11 11:53                             ` Evgeniy Polyakov
2007-05-11 12:15                               ` jamal
2007-05-10 21:27       ` David Stevens
2007-05-10 21:40         ` David Miller
2007-05-10 21:50           ` Gagan Arneja
2007-05-10 22:06             ` David Miller
2007-05-11  9:46               ` Krishna Kumar2
2007-05-10 21:41         ` Eric Dumazet
2007-05-10 22:09           ` Rick Jones
2007-05-10 21:45         ` Rick Jones
2007-05-10 21:53           ` David Stevens
2007-05-10 20:21 ` Roland Dreier
2007-05-11  7:30   ` Krishna Kumar2
2007-05-11 11:21     ` Roland Dreier
2007-05-11  9:05 ` Andi Kleen
2007-05-11  8:32   ` Krishna Kumar2
2007-05-11  9:37     ` Andi Kleen
2007-05-11  8:50       ` Krishna Kumar2
2007-05-11 11:16       ` Roland Dreier
2007-05-13  6:00         ` Andi Kleen
2007-05-15 16:25           ` Roland Dreier
2007-05-15 20:18             ` David Miller
2007-05-15 20:52               ` Roland Dreier
2007-05-15 21:48                 ` Michael Chan
2007-05-15 21:08                   ` Roland Dreier
2007-05-15 22:05                     ` Michael Chan
2007-05-15 21:28                       ` David Miller
2007-05-18  7:04                         ` Michael Chan
  -- strict thread matches above, loose matches on Subject: below --
2007-05-11  7:14 Krishna Kumar2
     [not found] <OF0CAD6D87.DBE62968-ON872572DC.0073646A-882572DC.0073BEC2@us.ibm.com>
2007-05-15 21:17 ` Roland Dreier
     [not found]   ` <OFF5654BB8.74EC8DCB-ON872572DC.00752079-882572DC.00756B23@us.ibm.com>
2007-05-15 21:25     ` Roland Dreier
     [not found]       ` <OF21D475A2.5E5C88DE-ON872572DC.00763DE4-882572DC.00768A7E@us.ibm.com>
2007-05-15 21:38         ` David Miller
2007-05-15 21:32     ` David Miller
     [not found]       ` <OF6757F56D.EE5984FD-ON872572DC.0081026C-882572DC.00814B8F@us.ibm.com>
2007-05-15 23:36         ` David Miller
2007-05-21  7:56       ` Herbert Xu
     [not found]         ` <OF9ABCD08D.2CD1B193-ON872572E3.007A6FC1-882572E3.007ACE1A@us.ibm.com>
2007-05-22 22:36           ` David Miller
     [not found]             ` <OFCF3EB7F8.9740C0C7-ON872572E3.007DADF6-882572E3.007E0E7B@us.ibm.com>
2007-05-22 23:04               ` David Miller
2007-05-22 23:12             ` Herbert Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).