* Zero copy transmit
@ 2003-04-29 18:44 Steve Modica
2003-04-29 19:20 ` Andi Kleen
0 siblings, 1 reply; 12+ messages in thread
From: Steve Modica @ 2003-04-29 18:44 UTC (permalink / raw)
To: netdev
Hi All,
We are doing some experiementing with Altix systems (Itanium II with
NUMA) and we're taking a big hit from __copy_user traffic. We would
like to modify the write, writev, send and sendto interfaces such that
we can avoid the __copy_user call by marking pages copy-on-write (COW)
and handing them off to be transmitted. Since this requires TLB
updates, we would only implement this code on platforms that defined
themselves as capable of fast TLB updates.
There was a lot of concern expressed on the l-k alias about COW being
difficult to support becaue of the TLB update issues, but NUMA systems
have to be especially quick at TLB updates, so it's something we want to
take advantage of.
I'm looking for comments and suggestions as to how we could do this
without impacting other system types.
Best Regards!
Steve
--
Steve Modica
Manager - Networking Drivers Group
"Give a man a fish, and he will eat for a day, hit him with a fish and
he leaves you alone" - me
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Zero copy transmit
2003-04-29 18:44 Zero copy transmit Steve Modica
@ 2003-04-29 19:20 ` Andi Kleen
2003-04-29 19:33 ` Robin Holt
2003-04-29 19:41 ` Steve Modica
0 siblings, 2 replies; 12+ messages in thread
From: Andi Kleen @ 2003-04-29 19:20 UTC (permalink / raw)
To: Steve Modica; +Cc: netdev
On Tue, Apr 29, 2003 at 01:44:15PM -0500, Steve Modica wrote:
> We are doing some experiementing with Altix systems (Itanium II with
> NUMA) and we're taking a big hit from __copy_user traffic. We would
> like to modify the write, writev, send and sendto interfaces such that
> we can avoid the __copy_user call by marking pages copy-on-write (COW)
> and handing them off to be transmitted. Since this requires TLB
> updates, we would only implement this code on platforms that defined
> themselves as capable of fast TLB updates.
A much better way would be to use the POSIX aio interfaces. They support
zero copy transmit, but don't require COW. Instead they just tell
the user process when it is safe to touch the buffer again.
There was already some code to do aio TCP sending, but it didn't
do zero copy and was not merged for some reason.
Also you can already do zero copy transmit using sendfile()
Linux basically has all the infrastructure you need for this already;
just the high level interface to the AIO system calls is still missing.
-Andi
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Zero copy transmit
2003-04-29 19:20 ` Andi Kleen
@ 2003-04-29 19:33 ` Robin Holt
2003-04-29 19:41 ` Andi Kleen
2003-04-29 19:41 ` Steve Modica
1 sibling, 1 reply; 12+ messages in thread
From: Robin Holt @ 2003-04-29 19:33 UTC (permalink / raw)
To: Andi Kleen; +Cc: Steve Modica, netdev
On Tue, Apr 29, 2003 at 09:20:41PM +0200, Andi Kleen wrote:
> A much better way would be to use the POSIX aio interfaces. They support
> zero copy transmit, but don't require COW. Instead they just tell
> the user process when it is safe to touch the buffer again.
>
> There was already some code to do aio TCP sending, but it didn't
> do zero copy and was not merged for some reason.
>
> Also you can already do zero copy transmit using sendfile()
Users would need to rewrite all their apps to use either the async or
sendfile method. That assumption seems a little broad.
I don't disagree that implementing the remainder of the AIO system
calls would also be good, but is there something wrong with getting
write et. al. to work with zero copy?
Robin Holt
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: Zero copy transmit
2003-04-29 19:33 ` Robin Holt
@ 2003-04-29 19:41 ` Andi Kleen
0 siblings, 0 replies; 12+ messages in thread
From: Andi Kleen @ 2003-04-29 19:41 UTC (permalink / raw)
To: Robin Holt; +Cc: Andi Kleen, Steve Modica, netdev
On Tue, Apr 29, 2003 at 02:33:36PM -0500, Robin Holt wrote:
> On Tue, Apr 29, 2003 at 09:20:41PM +0200, Andi Kleen wrote:
> > A much better way would be to use the POSIX aio interfaces. They support
> > zero copy transmit, but don't require COW. Instead they just tell
> > the user process when it is safe to touch the buffer again.
> >
> > There was already some code to do aio TCP sending, but it didn't
> > do zero copy and was not merged for some reason.
> >
> > Also you can already do zero copy transmit using sendfile()
>
> Users would need to rewrite all their apps to use either the async or
> sendfile method. That assumption seems a little broad.
In my experience only a few programs are performance critical in this
way; and their developers/users usually do not mind changing their programs
a bit to get the best performance. In fact they are always happy when
they get such knobs from you ;)
>
> I don't disagree that implementing the remainder of the AIO system
> calls would also be good, but is there something wrong with getting
> write et. al. to work with zero copy?
You have to ask DaveM/Alexey - they had it, but rejected it, apparently
also based on some bad experiences on other operating systems.
I can see their point - e.g. in the worst case each write could
trigger two TLB flush IPIs to all CPUs in the system (one to COW it
and another to un COW it). You can copy a lot of data in the time
it takes to process all of them, especially on a big machine.
-Andi
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Zero copy transmit
2003-04-29 19:20 ` Andi Kleen
2003-04-29 19:33 ` Robin Holt
@ 2003-04-29 19:41 ` Steve Modica
2003-04-29 19:59 ` Andi Kleen
2003-04-29 20:17 ` Christoph Hellwig
1 sibling, 2 replies; 12+ messages in thread
From: Steve Modica @ 2003-04-29 19:41 UTC (permalink / raw)
To: netdev
Andi Kleen wrote:
> On Tue, Apr 29, 2003 at 01:44:15PM -0500, Steve Modica wrote:
>
>>We are doing some experiementing with Altix systems (Itanium II with
>>NUMA) and we're taking a big hit from __copy_user traffic. We would
>>like to modify the write, writev, send and sendto interfaces such that
>>we can avoid the __copy_user call by marking pages copy-on-write (COW)
>>and handing them off to be transmitted. Since this requires TLB
>>updates, we would only implement this code on platforms that defined
>>themselves as capable of fast TLB updates.
>
>
> A much better way would be to use the POSIX aio interfaces. They support
> zero copy transmit, but don't require COW. Instead they just tell
> the user process when it is safe to touch the buffer again.
>
> There was already some code to do aio TCP sending, but it didn't
> do zero copy and was not merged for some reason.
>
> Also you can already do zero copy transmit using sendfile()
>
> Linux basically has all the infrastructure you need for this already;
> just the high level interface to the AIO system calls is still missing.
>
> -Andi
Hi Andi,
We are aware of sendfile() and used it for the purposes of proving that
zero copy would make a big difference for us.
At issue is really application capture and customer adoption. There are
tons of apps and lots of engineers that know socket operations and
write/writev. Asking all ISVs to recode for linux would leave them with
two separate APIs to deal with. They would have send/sendto or
write/writev on Solaris, HPUX and whatever else, and linux would have
sendfile.
We really want to do this in such a way that it doesn't create a huge
footprint (and we think we can) and we want to make sure we don't impact
systems that can't take advantage of fast TLB updates.
Steve
--
Steve Modica
work: 651-683-3224
mobile: 651-261-3201
Manager - Networking Drivers Group
"Give a man a fish, and he will eat for a day, hit him with a fish and
he leaves you alone" - me
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Zero copy transmit
2003-04-29 19:41 ` Steve Modica
@ 2003-04-29 19:59 ` Andi Kleen
2003-04-29 20:09 ` Steve Modica
2003-04-29 20:17 ` Christoph Hellwig
1 sibling, 1 reply; 12+ messages in thread
From: Andi Kleen @ 2003-04-29 19:59 UTC (permalink / raw)
To: Steve Modica; +Cc: netdev
> At issue is really application capture and customer adoption. There are
> tons of apps and lots of engineers that know socket operations and
> write/writev. Asking all ISVs to recode for linux would leave them with
> two separate APIs to deal with. They would have send/sendto or
> write/writev on Solaris, HPUX and whatever else, and linux would have
> sendfile.
aio_write / lio_listio exists on Solaris and HP/UX too.
(and even Windows; their completion port interfaces are very similar)
>
> We really want to do this in such a way that it doesn't create a huge
> footprint (and we think we can) and we want to make sure we don't impact
> systems that can't take advantage of fast TLB updates.
So how do you avoid the two TLB flush IPIs to all CPUs that have the current mm
mapped ?
-Andi
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Zero copy transmit
2003-04-29 19:59 ` Andi Kleen
@ 2003-04-29 20:09 ` Steve Modica
2003-04-29 20:39 ` Andi Kleen
0 siblings, 1 reply; 12+ messages in thread
From: Steve Modica @ 2003-04-29 20:09 UTC (permalink / raw)
To: netdev
Andi Kleen wrote:
>>At issue is really application capture and customer adoption. There are
>>tons of apps and lots of engineers that know socket operations and
>>write/writev. Asking all ISVs to recode for linux would leave them with
>>two separate APIs to deal with. They would have send/sendto or
>>write/writev on Solaris, HPUX and whatever else, and linux would have
>>sendfile.
>
>
> aio_write / lio_listio exists on Solaris and HP/UX too.
>
> (and even Windows; their completion port interfaces are very similar)
Right.. although some might say that aio_write is used a lot less often
than write or send.
It's hard to convince thousands of application writers to revisit stuff
like this. It's a lot easier to bring the hardware feature in to the
APIs that people just commonly use.
>
>
>>We really want to do this in such a way that it doesn't create a huge
>>footprint (and we think we can) and we want to make sure we don't impact
>>systems that can't take advantage of fast TLB updates.
>
>
> So how do you avoid the two TLB flush IPIs to all CPUs that have the current mm
> mapped ?
>
> -Andi
I could speculate about how we might avoid that, or I could speculate
that even with those operations, it would still be faster, but I'd
rather just demonstrate it with a patch.
Don't get me wrong, we would certainly drop any notions of this if we
found that it was slower and I will be glad to post any results. The
goal is to take advantage of the hardware to make things faster.
Going back to your example above, don't solaris and hpux also do COW for
write and send? (I don't have their sources) If so, why would they do
it if it's slower?
Steve
--
Steve Modica
work: 651-683-3224
mobile: 651-261-3201
Manager - Networking Drivers Group
"Give a man a fish, and he will eat for a day, hit him with a fish and
he leaves you alone" - me
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Zero copy transmit
2003-04-29 20:09 ` Steve Modica
@ 2003-04-29 20:39 ` Andi Kleen
2003-04-30 1:41 ` Michael Richardson
2003-04-30 15:05 ` Robin Holt
0 siblings, 2 replies; 12+ messages in thread
From: Andi Kleen @ 2003-04-29 20:39 UTC (permalink / raw)
To: netdev; +Cc: modica
> Don't get me wrong, we would certainly drop any notions of this if we
> found that it was slower and I will be glad to post any results. The
> goal is to take advantage of the hardware to make things faster.
You have no hardware to make the remote TLB flushes fast ;)
I'm sure you can show it being an advantage with a single threaded process.
But when you run it on a multithreaded application just with two threads
it may look very different.
> Going back to your example above, don't solaris and hpux also do COW for
> write and send? (I don't have their sources) If so, why would they do
> it if it's slower?
I don't know if they do. The only Unix I'm aware of that has zero copy
sendmsg() is NetBSD and their focus does not seem to be SMP scalability.
I observed the problem recently just with swapping a big (10GB) process
whose working set slightly exceeded the available memory.
kswapd was running on one CPU; the process on another. kswapd
was aging the pages of the memory hog all the time, which requires an unmapping
and a remote TLB flush in the process' page tables. The result
was that two CPUs were 100% tied up in the kernel, just spinning on the
page_table_lock of the mm and processing TLB IPIs (spinlock was ~50%; IPI
overhead 40% or so). I predict that your proposed TLB flushing write will
cause the same problem with lots of writes. It's more or less the same thing,
except that kswapd has a builtin rate limit and runs only on a single CPU
and write() has not.
Also last time I checked most Linux ports still used an single global
spinlock for the TLB flush IPI. You would add a nice new hot lock
to the network path.
-Andi
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Zero copy transmit
2003-04-29 20:39 ` Andi Kleen
@ 2003-04-30 1:41 ` Michael Richardson
2003-04-30 15:05 ` Robin Holt
1 sibling, 0 replies; 12+ messages in thread
From: Michael Richardson @ 2003-04-30 1:41 UTC (permalink / raw)
To: netdev
>>>>> "Andi" == Andi Kleen <ak@suse.de> writes:
Andi> I don't know if they do. The only Unix I'm aware of that has zero copy
Andi> sendmsg() is NetBSD and their focus does not seem to be SMP scalability.
NetBSD 1.6 is the first release with both zero-copy and SMP.
(1.6.1 came out two weeks ago)
So, the focus was on making it work.
I would not read any more intent to it yet. Jason Thorpe is pretty smart.
] ON HUMILITY: to err is human. To moo, bovine. | firewalls [
] Michael Richardson, Sandelman Software Works, Ottawa, ON |net architect[
] mcr@sandelman.ottawa.on.ca http://www.sandelman.ottawa.on.ca/ |device driver[
] panic("Just another Debian GNU/Linux using, kernel hacking, security guy"); [
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: Zero copy transmit
2003-04-29 20:39 ` Andi Kleen
2003-04-30 1:41 ` Michael Richardson
@ 2003-04-30 15:05 ` Robin Holt
2003-04-30 15:29 ` Andi Kleen
1 sibling, 1 reply; 12+ messages in thread
From: Robin Holt @ 2003-04-30 15:05 UTC (permalink / raw)
To: Andi Kleen; +Cc: netdev, modica
On Tue, Apr 29, 2003 at 10:39:46PM +0200, Andi Kleen wrote:
> > Don't get me wrong, we would certainly drop any notions of this if we
> > found that it was slower and I will be glad to post any results. The
> > goal is to take advantage of the hardware to make things faster.
>
> You have no hardware to make the remote TLB flushes fast ;)
>
> I'm sure you can show it being an advantage with a single threaded process.
> But when you run it on a multithreaded application just with two threads
> it may look very different.
>
Last time I checked, the IA64 processor provides a ptc.g instruction for
exactly this. The only hit we take from using it is Intel limits it to
a single outstanding ptc.g pending machine wide. This is accomplished with
a global spinlock. I would love to convince Intel to change this instruction,
but that probably will not happen any time soon.
I will concede that the ptc.g instruction takes a considerable period of
time on our 64 processor machines, but that comes out to a lot of local
TLB coherence domains that need to be updated.
I believe there is a similar instruction for x86. Could someone verify
this?
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Zero copy transmit
2003-04-30 15:05 ` Robin Holt
@ 2003-04-30 15:29 ` Andi Kleen
0 siblings, 0 replies; 12+ messages in thread
From: Andi Kleen @ 2003-04-30 15:29 UTC (permalink / raw)
To: Robin Holt; +Cc: Andi Kleen, netdev, modica
> Last time I checked, the IA64 processor provides a ptc.g instruction for
> exactly this. The only hit we take from using it is Intel limits it to
> a single outstanding ptc.g pending machine wide. This is accomplished with
> a global spinlock. I would love to convince Intel to change this instruction,
> but that probably will not happen any time soon.
IA64 Linux doesn't use it at least. The 2.5 flush_tlb_mm calls smp_flush_tlb_mm
which ends up doing IPIs. Same for flush_tlb_page - calls flush_tlb_range
which calls sn2_global_tlb_purge, which does something complicated
that also looks like an global IPI. It also takes a global spinlock.
>
> I will concede that the ptc.g instruction takes a considerable period of
> time on our 64 processor machines, but that comes out to a lot of local
> TLB coherence domains that need to be updated.
>
> I believe there is a similar instruction for x86. Could someone verify
> this?
Nope. x86 has to IPI for remote TLB flushes.
-Andi
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Zero copy transmit
2003-04-29 19:41 ` Steve Modica
2003-04-29 19:59 ` Andi Kleen
@ 2003-04-29 20:17 ` Christoph Hellwig
1 sibling, 0 replies; 12+ messages in thread
From: Christoph Hellwig @ 2003-04-29 20:17 UTC (permalink / raw)
To: Steve Modica; +Cc: netdev
On Tue, Apr 29, 2003 at 02:41:27PM -0500, Steve Modica wrote:
> At issue is really application capture and customer adoption. There are
> tons of apps and lots of engineers that know socket operations and
> write/writev. Asking all ISVs to recode for linux would leave them with
> two separate APIs to deal with. They would have send/sendto or
> write/writev on Solaris, HPUX and whatever else, and linux would have
> sendfile.
Solaris and HPUX have sendfile, the HPUX one has a slightly different
API, the Solaris one is modelled after Linux.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2003-04-30 15:29 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-04-29 18:44 Zero copy transmit Steve Modica
2003-04-29 19:20 ` Andi Kleen
2003-04-29 19:33 ` Robin Holt
2003-04-29 19:41 ` Andi Kleen
2003-04-29 19:41 ` Steve Modica
2003-04-29 19:59 ` Andi Kleen
2003-04-29 20:09 ` Steve Modica
2003-04-29 20:39 ` Andi Kleen
2003-04-30 1:41 ` Michael Richardson
2003-04-30 15:05 ` Robin Holt
2003-04-30 15:29 ` Andi Kleen
2003-04-29 20:17 ` Christoph Hellwig
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).