* Status of AIO
@ 2006-03-06 6:24 Dan Aloni
2006-03-06 15:05 ` Phillip Susi
` (2 more replies)
0 siblings, 3 replies; 28+ messages in thread
From: Dan Aloni @ 2006-03-06 6:24 UTC (permalink / raw)
To: Linux Kernel List
Hello,
I'm trying to assert the status of AIO under the current version
of Linux 2.6. However by searching I wasn't able to find any
indication about it's current state. Is there anyone using it
under a production environment?
I'd like to know how complete it is and whether socket AIO is
adaquately supported.
Thanks,
--
Dan Aloni
da-x@monatomic.org, da-x@colinux.org, da-x@gmx.net, dan@xiv.co.il
^ permalink raw reply [flat|nested] 28+ messages in thread* Re: Status of AIO 2006-03-06 6:24 Status of AIO Dan Aloni @ 2006-03-06 15:05 ` Phillip Susi 2006-03-06 21:18 ` Benjamin LaHaise 2006-03-06 23:18 ` Phillip Susi 2 siblings, 0 replies; 28+ messages in thread From: Phillip Susi @ 2006-03-06 15:05 UTC (permalink / raw) To: Dan Aloni; +Cc: Linux Kernel List Dan Aloni wrote: > Hello, > > I'm trying to assert the status of AIO under the current version I think you mean ascertain. > of Linux 2.6. However by searching I wasn't able to find any > indication about it's current state. Is there anyone using it > under a production environment? > > I'd like to know how complete it is and whether socket AIO is > adaquately supported. > > Thanks, > AFAIK, it is not yet supported by the sockets layer, and the glibc posix aio apis do NOT use the kernel aio support. I have done some experimentation with it by hacking dd, but from what I can tell, it is not used in any sort of production capacity. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-06 6:24 Status of AIO Dan Aloni 2006-03-06 15:05 ` Phillip Susi @ 2006-03-06 21:18 ` Benjamin LaHaise 2006-03-06 22:53 ` Ulrich Drepper 2006-03-07 1:30 ` Dan Aloni 2006-03-06 23:18 ` Phillip Susi 2 siblings, 2 replies; 28+ messages in thread From: Benjamin LaHaise @ 2006-03-06 21:18 UTC (permalink / raw) To: Dan Aloni; +Cc: Linux Kernel List On Mon, Mar 06, 2006 at 08:24:03AM +0200, Dan Aloni wrote: > Hello, > > I'm trying to assert the status of AIO under the current version > of Linux 2.6. However by searching I wasn't able to find any > indication about it's current state. Is there anyone using it > under a production environment? For O_DIRECT aio things are pretty stable (barring a patch to improve -EIO handling). The functionality is used by the various databases, so it gets a fair amount of exercise. > I'd like to know how complete it is and whether socket AIO is > adaquately supported. Socket AIO is not supported yet, but it is useful to get user requests to know there is demand for it. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-06 21:18 ` Benjamin LaHaise @ 2006-03-06 22:53 ` Ulrich Drepper 2006-03-06 23:15 ` Phillip Susi 2006-03-06 23:33 ` Benjamin LaHaise 2006-03-07 1:30 ` Dan Aloni 1 sibling, 2 replies; 28+ messages in thread From: Ulrich Drepper @ 2006-03-06 22:53 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Dan Aloni, Linux Kernel List On 3/6/06, Benjamin LaHaise <bcrl@kvack.org> wrote: > Socket AIO is not supported yet, but it is useful to get user requests to > know there is demand for it. I don't think the POSIX AIO nor the kernel AIO interfaces are suitable for sockets, at least the way we can expect network traffic to be handled in the near future. Some more radical approaches are needed. I'll have some proposals which will be part of the talk I have at OLS. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-06 22:53 ` Ulrich Drepper @ 2006-03-06 23:15 ` Phillip Susi 2006-03-08 7:09 ` Ulrich Drepper 2006-03-06 23:33 ` Benjamin LaHaise 1 sibling, 1 reply; 28+ messages in thread From: Phillip Susi @ 2006-03-06 23:15 UTC (permalink / raw) To: Ulrich Drepper; +Cc: Benjamin LaHaise, Dan Aloni, Linux Kernel List Ulrich Drepper wrote: > > I don't think the POSIX AIO nor the kernel AIO interfaces are suitable > for sockets, at least the way we can expect network traffic to be > handled in the near future. Some more radical approaches are needed. > I'll have some proposals which will be part of the talk I have at OLS. Why do you say it is not suitable? The kernel aio interfaces should work very well, especially when combined with O_DIRECT. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-06 23:15 ` Phillip Susi @ 2006-03-08 7:09 ` Ulrich Drepper 2006-03-08 15:58 ` Phillip Susi 0 siblings, 1 reply; 28+ messages in thread From: Ulrich Drepper @ 2006-03-08 7:09 UTC (permalink / raw) To: Phillip Susi; +Cc: Benjamin LaHaise, Dan Aloni, Linux Kernel List On 3/6/06, Phillip Susi <psusi@cfl.rr.com> wrote: > Why do you say it is not suitable? The kernel aio interfaces should > work very well, especially when combined with O_DIRECT. What has network I/O to do with O_DIRECT? I'm talking about async network I/O. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-08 7:09 ` Ulrich Drepper @ 2006-03-08 15:58 ` Phillip Susi 0 siblings, 0 replies; 28+ messages in thread From: Phillip Susi @ 2006-03-08 15:58 UTC (permalink / raw) To: Ulrich Drepper; +Cc: Benjamin LaHaise, Dan Aloni, Linux Kernel List Ulrich Drepper wrote: > What has network I/O to do with O_DIRECT? I'm talking about async network I/O. O_DIRECT allows for zero copy IO, which saves a boatload of cpu cycles. For disk IO it is possible to use O_DIRECT without aio, but there is generally a loss of efficiency doing so. For network IO, O_DIRECT is not even possible without aio. By using aio and O_DIRECT for network IO, you can achieve massive performance and scalability gains. You said before that the kernel aio interface is not suitable for sockets. Why not? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-06 22:53 ` Ulrich Drepper 2006-03-06 23:15 ` Phillip Susi @ 2006-03-06 23:33 ` Benjamin LaHaise 2006-03-07 0:24 ` David S. Miller 1 sibling, 1 reply; 28+ messages in thread From: Benjamin LaHaise @ 2006-03-06 23:33 UTC (permalink / raw) To: Ulrich Drepper; +Cc: Dan Aloni, Linux Kernel List On Mon, Mar 06, 2006 at 02:53:07PM -0800, Ulrich Drepper wrote: > I don't think the POSIX AIO nor the kernel AIO interfaces are suitable > for sockets, at least the way we can expect network traffic to be > handled in the near future. Some more radical approaches are needed. > I'll have some proposals which will be part of the talk I have at OLS. Oh? I've always envisioned that network AIO would be able to use O_DIRECT style zero copy transmit, and something like I/O AT on the receive side. The in kernel API provides a lightweight event mechanism that should work ideally for this purpose. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-06 23:33 ` Benjamin LaHaise @ 2006-03-07 0:24 ` David S. Miller 2006-03-07 0:42 ` Benjamin LaHaise 2006-03-07 1:34 ` Phillip Susi 0 siblings, 2 replies; 28+ messages in thread From: David S. Miller @ 2006-03-07 0:24 UTC (permalink / raw) To: bcrl; +Cc: drepper, da-x, linux-kernel From: Benjamin LaHaise <bcrl@kvack.org> Date: Mon, 6 Mar 2006 18:33:00 -0500 > On Mon, Mar 06, 2006 at 02:53:07PM -0800, Ulrich Drepper wrote: > > I don't think the POSIX AIO nor the kernel AIO interfaces are suitable > > for sockets, at least the way we can expect network traffic to be > > handled in the near future. Some more radical approaches are needed. > > I'll have some proposals which will be part of the talk I have at OLS. > > Oh? I've always envisioned that network AIO would be able to use O_DIRECT > style zero copy transmit, and something like I/O AT on the receive side. > The in kernel API provides a lightweight event mechanism that should work > ideally for this purpose. I think something like net channels will be more effective on receive. We shouldn't be designing things for the old and inefficient world where the work is done in software and hardware interrupt context, it should be moved as close as possible to the compute entities and that means putting the work all the way into the app itself, if not very close. To me, it is not a matter of if we put the networking stack at least partially into some userland library, but when. Eveyone has their brains wrapped around how OS support for networking has always been done, and assuming that particular model is erroneous (and net channels show good hard evidence that it is) this continued thought process merely continues the error. I really dislike it when non-networking people work on these interfaces. They've all frankly stunk, and they've had several opportunities to try and get it right. I want a bonafide networking person to work on any high performance networking API we every decide to actually use. This is why I going to sit and wait patiently for Van Jacobson's work to get published and mature, because it's the only light in the tunnel since Multics. Yes, since Multics, that's how bad the existing models for doing these things are. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-07 0:24 ` David S. Miller @ 2006-03-07 0:42 ` Benjamin LaHaise 2006-03-07 0:51 ` David S. Miller 2006-03-07 1:34 ` Phillip Susi 1 sibling, 1 reply; 28+ messages in thread From: Benjamin LaHaise @ 2006-03-07 0:42 UTC (permalink / raw) To: David S. Miller; +Cc: drepper, da-x, linux-kernel On Mon, Mar 06, 2006 at 04:24:44PM -0800, David S. Miller wrote: > > Oh? I've always envisioned that network AIO would be able to use O_DIRECT > > style zero copy transmit, and something like I/O AT on the receive side. > > The in kernel API provides a lightweight event mechanism that should work > > ideally for this purpose. > > I think something like net channels will be more effective on receive. Perhaps, but we don't necessarily have to go to that extreme to get the value of the approach. One way of doing network receive that would let us keep the same userland API is to have the kernel perform the receive portion of TCP processing in userspace as a vsyscall. The whole channel would then be a concept internal to the kernel. Once that works and the internals have settled down, it might make sense to export an API that allows us to expose parts of the channel to the user. Unfortunately, I think that the problem of getting the packets delivered to the right user is Hard (especially with incoming filters and all the other features of the stack). ... > I want a bonafide networking person to work on any high performance > networking API we every decide to actually use. I'm open to suggestions. =-) So far my thoughts have mostly been limited to how to make tx faster, at which point you have to go into the kernel somehow to deal with the virtual => physical address translation (be it with a locked buffer or whatever) and kicking the hardware. Rx has been much less interesting simply because the hardware side doesn't offer as much. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-07 0:42 ` Benjamin LaHaise @ 2006-03-07 0:51 ` David S. Miller 2006-03-07 1:39 ` Benjamin LaHaise 0 siblings, 1 reply; 28+ messages in thread From: David S. Miller @ 2006-03-07 0:51 UTC (permalink / raw) To: bcrl; +Cc: drepper, da-x, linux-kernel From: Benjamin LaHaise <bcrl@kvack.org> Date: Mon, 6 Mar 2006 19:42:37 -0500 > I'm open to suggestions. =-) So far my thoughts have mostly been limited > to how to make tx faster, at which point you have to go into the kernel > somehow to deal with the virtual => physical address translation (be it > with a locked buffer or whatever) and kicking the hardware. Rx has been > much less interesting simply because the hardware side doesn't offer as > much. I think any such VM tricks need serious thought. It has serious consequences as far as cost especially on SMP. Evgivny has some data that shows this, and chapter 5 of Networking Algorithmics has a lot of good analysis and paper references on this topic. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-07 0:51 ` David S. Miller @ 2006-03-07 1:39 ` Benjamin LaHaise 2006-03-07 2:04 ` Dan Aloni 2006-03-07 3:06 ` David S. Miller 0 siblings, 2 replies; 28+ messages in thread From: Benjamin LaHaise @ 2006-03-07 1:39 UTC (permalink / raw) To: David S. Miller; +Cc: drepper, da-x, linux-kernel On Mon, Mar 06, 2006 at 04:51:29PM -0800, David S. Miller wrote: > I think any such VM tricks need serious thought. It has serious > consequences as far as cost especially on SMP. Evgivny has some data > that shows this, and chapter 5 of Networking Algorithmics has a lot of > good analysis and paper references on this topic. VM tricks do suck, so you just have to use the tricks that nobody else is... My thinking is to do something like the following: have a structure to reference a set of pages. When it is first created, it takes a reference on the pages in question, and it is added to the vm_area_struct of the user so that the vm can poke it for freeing when memory pressure occurs. The sk_buff dataref also has to have a pointer to the pageref added. Now, the trick to making it useful is as follows: struct pageref { atomic_t free_count; int use_count; /* protected by socket lock */ ... unsigned long user_address; unsigned long length; struct socket *sock; /* backref for VM */ struct page *pages[]; }; The fast path in network transmit becomes: if (sock->pageref->... overlaps buf) { for each packet built { use_count++; <add pageref to skb's dataref happily without atomics or memory copying> } } Then the kfree_skb() path does an atomic_dec() on pageref->free_count instead of the page. (Or get rid of the atomic using knowledge about the fact that a given pageref could only be freed by the network driver it was given to.) That would make the transmit path bloody cheap, and the tx irq context no more expensive than it already is. It's probably easier to show this tx path with code that gets the details right. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-07 1:39 ` Benjamin LaHaise @ 2006-03-07 2:04 ` Dan Aloni 2006-03-07 2:07 ` Benjamin LaHaise 2006-03-07 3:06 ` David S. Miller 1 sibling, 1 reply; 28+ messages in thread From: Dan Aloni @ 2006-03-07 2:04 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: David S. Miller, drepper, linux-kernel On Mon, Mar 06, 2006 at 08:39:15PM -0500, Benjamin LaHaise wrote: > On Mon, Mar 06, 2006 at 04:51:29PM -0800, David S. Miller wrote: > > I think any such VM tricks need serious thought. It has serious > > consequences as far as cost especially on SMP. Evgivny has some data > > that shows this, and chapter 5 of Networking Algorithmics has a lot of > > good analysis and paper references on this topic. > > VM tricks do suck, so you just have to use the tricks that nobody else > is... My thinking is to do something like the following: have a structure > to reference a set of pages. When it is first created, it takes a reference > on the pages in question, and it is added to the vm_area_struct of the user > so that the vm can poke it for freeing when memory pressure occurs. The > sk_buff dataref also has to have a pointer to the pageref added. Now, the > trick to making it useful is as follows: > > struct pageref { > atomic_t free_count; > int use_count; /* protected by socket lock */ > ... > unsigned long user_address; > unsigned long length; > struct socket *sock; /* backref for VM */ > struct page *pages[]; > }; [...] > > It's probably easier to show this tx path with code that gets the details > right. This somehow resembles the scatter-gatter lists already used in some subsystems such as the SCSI sg driver. BTW you have to make these pages Copy-On-Write before this procedure starts because you wouldn't want it to accidently fill the zero page, i.e. the VM will have to supply a unique set of pages otherwise it messes up. -- Dan Aloni da-x@monatomic.org, da-x@colinux.org, da-x@gmx.net, dan@xiv.co.il ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-07 2:04 ` Dan Aloni @ 2006-03-07 2:07 ` Benjamin LaHaise 2006-03-07 3:11 ` David S. Miller 2006-03-07 7:33 ` Dan Aloni 0 siblings, 2 replies; 28+ messages in thread From: Benjamin LaHaise @ 2006-03-07 2:07 UTC (permalink / raw) To: Dan Aloni; +Cc: David S. Miller, drepper, linux-kernel On Tue, Mar 07, 2006 at 04:04:11AM +0200, Dan Aloni wrote: > This somehow resembles the scatter-gatter lists already used in some > subsystems such as the SCSI sg driver. None of the iovecs are particularly special. What's special here is that particulars of the container make the fast path *cheap*. > BTW you have to make these pages Copy-On-Write before this procedure > starts because you wouldn't want it to accidently fill the zero page, > i.e. the VM will have to supply a unique set of pages otherwise it > messes up. No, that would be insanely expensive. There's no way this would be done transparently to the user unless we know that we're blocking until the transmit is complete. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-07 2:07 ` Benjamin LaHaise @ 2006-03-07 3:11 ` David S. Miller 2006-03-07 7:33 ` Dan Aloni 1 sibling, 0 replies; 28+ messages in thread From: David S. Miller @ 2006-03-07 3:11 UTC (permalink / raw) To: bcrl; +Cc: da-x, drepper, linux-kernel From: Benjamin LaHaise <bcrl@kvack.org> Date: Mon, 6 Mar 2006 21:07:36 -0500 > On Tue, Mar 07, 2006 at 04:04:11AM +0200, Dan Aloni wrote: > > This somehow resembles the scatter-gatter lists already used in some > > subsystems such as the SCSI sg driver. > > None of the iovecs are particularly special. What's special here is that > particulars of the container make the fast path *cheap*. Please read Druschel and Peterson's paper on fbufs and any followon work before going down this path. Fbufs are exactly what you are proposing as a workaround for the VM cost of page flipping, and the idea has been around since 1993. :-) As I mentioned Chapter 5 of Networking Algorithmics discusses this in detail, and also covers many related attempts such as I/O Lite. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-07 2:07 ` Benjamin LaHaise 2006-03-07 3:11 ` David S. Miller @ 2006-03-07 7:33 ` Dan Aloni 1 sibling, 0 replies; 28+ messages in thread From: Dan Aloni @ 2006-03-07 7:33 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: David S. Miller, drepper, linux-kernel On Mon, Mar 06, 2006 at 09:07:36PM -0500, Benjamin LaHaise wrote: > On Tue, Mar 07, 2006 at 04:04:11AM +0200, Dan Aloni wrote: > > This somehow resembles the scatter-gatter lists already used in some > > subsystems such as the SCSI sg driver. > > None of the iovecs are particularly special. What's special here is that > particulars of the container make the fast path *cheap*. > > > BTW you have to make these pages Copy-On-Write before this procedure > > starts because you wouldn't want it to accidently fill the zero page, > > i.e. the VM will have to supply a unique set of pages otherwise it > > messes up. > > No, that would be insanely expensive. There's no way this would be done > transparently to the user unless we know that we're blocking until the > transmit is complete. Sure it can't be transparent to the user, but you can just require the user to perform mlock on the VMA and you get around this problem. -- Dan Aloni da-x@monatomic.org, da-x@colinux.org, da-x@gmx.net, dan@xiv.co.il ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-07 1:39 ` Benjamin LaHaise 2006-03-07 2:04 ` Dan Aloni @ 2006-03-07 3:06 ` David S. Miller 2006-03-07 16:35 ` Benjamin LaHaise 1 sibling, 1 reply; 28+ messages in thread From: David S. Miller @ 2006-03-07 3:06 UTC (permalink / raw) To: bcrl; +Cc: drepper, da-x, linux-kernel From: Benjamin LaHaise <bcrl@kvack.org> Date: Mon, 6 Mar 2006 20:39:15 -0500 > VM tricks do suck, so you just have to use the tricks that nobody else > is... My thinking is to do something like the following: have a structure > to reference a set of pages. When it is first created, it takes a reference > on the pages in question, and it is added to the vm_area_struct of the user > so that the vm can poke it for freeing when memory pressure occurs. The > sk_buff dataref also has to have a pointer to the pageref added. You've just reinvented fbufs, and they have their own known set of issues. Please read chapter 5 of Networking Algorithmics or ask someone to paraphrase the content for you. It really covers this completely, and once you read it you will be able to avoid reinenting the wheel and falling under the false notion of having invented something :-) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-07 3:06 ` David S. Miller @ 2006-03-07 16:35 ` Benjamin LaHaise 0 siblings, 0 replies; 28+ messages in thread From: Benjamin LaHaise @ 2006-03-07 16:35 UTC (permalink / raw) To: David S. Miller; +Cc: drepper, da-x, linux-kernel On Mon, Mar 06, 2006 at 07:06:33PM -0800, David S. Miller wrote: > You've just reinvented fbufs, and they have their own known set of > issues. > Please read chapter 5 of Networking Algorithmics or ask someone to > paraphrase the content for you. It really covers this completely, and > once you read it you will be able to avoid reinenting the wheel and > falling under the false notion of having invented something :-) Nothing in software is particularly unique given the same set of requirements. Unfortunately, none of the local book stores have a copy of Networking Algorithmics in stock, so it will be a few days before it arrives. What problems does this approach have? Aside from the fact that it's useless unless implemented on top of AIO type semantics, it looks like a good way to improve performance. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-07 0:24 ` David S. Miller 2006-03-07 0:42 ` Benjamin LaHaise @ 2006-03-07 1:34 ` Phillip Susi 2006-03-07 3:04 ` David S. Miller 1 sibling, 1 reply; 28+ messages in thread From: Phillip Susi @ 2006-03-07 1:34 UTC (permalink / raw) To: David S. Miller; +Cc: bcrl, drepper, da-x, linux-kernel David S. Miller wrote: > > I think something like net channels will be more effective on receive. > What is this "net channels"? I'll do some googling but if you have a direct reference it would be helpful. > We shouldn't be designing things for the old and inefficient world > where the work is done in software and hardware interrupt context, it > should be moved as close as possible to the compute entities and that > means putting the work all the way into the app itself, if not very > close. > > To me, it is not a matter of if we put the networking stack at least > partially into some userland library, but when. > Maybe you should try using a microkernel then like mach? The Linux way of doing things is to leave critical services that most user mode code depends on, such as filesystems and the network stack, in the kernel. I don't think that's going to change. > Eveyone has their brains wrapped around how OS support for networking > has always been done, and assuming that particular model is erroneous > (and net channels show good hard evidence that it is) this continued > thought process merely continues the error. > Have you taken a look at bsd's kqueue and NT's IO completion port approach? They allow virtually all of the IO to be offloaded to hardware DMA, and there's no reason Linux can't do the same with aio and O_DIRECT. There's no need completely throw out the stack and start over, let alone in user mode, to get there. > I really dislike it when non-networking people work on these > interfaces. They've all frankly stunk, and they've had several > opportunities to try and get it right. > I agree, the old (non) blocking IO style interfaces have all sucked, which is why it's time to move on to aio. NT has been demonstrating for 10 years now ( that's how long ago I wrote an FTPd using IOCPs on NT ) the benefits of async IO. It has been a long time coming, but once the Linux kernel is capable of zero copy aio, I will be quite happy. > I want a bonafide networking person to work on any high performance > networking API we every decide to actually use. > > This is why I going to sit and wait patiently for Van Jacobson's work > to get published and mature, because it's the only light in the tunnel > since Multics. > > Yes, since Multics, that's how bad the existing models for doing these > things are. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-07 1:34 ` Phillip Susi @ 2006-03-07 3:04 ` David S. Miller 2006-03-07 4:07 ` Phillip Susi 0 siblings, 1 reply; 28+ messages in thread From: David S. Miller @ 2006-03-07 3:04 UTC (permalink / raw) To: psusi; +Cc: bcrl, drepper, da-x, linux-kernel From: Phillip Susi <psusi@cfl.rr.com> Date: Mon, 06 Mar 2006 20:34:46 -0500 > What is this "net channels"? I'll do some googling but if you have a > direct reference it would be helpful. You didn't google hard enough, my blog entry on the topic comes up as the first entry when you google for "Van Jacobson net channels". > Maybe you should try using a microkernel then like mach? The Linux way > of doing things is to leave critical services that most user mode code > depends on, such as filesystems and the network stack, in the kernel. I > don't think that's going to change. Oh yee of little faith, and we don't need to go to a microkernel architecture to move things like parts of the TCP stack into user space. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-07 3:04 ` David S. Miller @ 2006-03-07 4:07 ` Phillip Susi 2006-03-07 6:02 ` David S. Miller 0 siblings, 1 reply; 28+ messages in thread From: Phillip Susi @ 2006-03-07 4:07 UTC (permalink / raw) To: David S. Miller; +Cc: bcrl, drepper, da-x, linux-kernel David S. Miller wrote: > You didn't google hard enough, my blog entry on the topic > comes up as the first entry when you google for "Van Jacobson > net channels". > Thanks, I read the page... I find it to be a little extreme, and zero copy aio can get the same benefits without all that hassle. Let me write this as a reply to the article itself: > With SMP systems this "end host" concept really should be extended to > the computing entities within the system, that being cpus and threads > within the box. I agree; all threads and cpus should be able to concurrently process network IO, and without wasting cpu cycles copying the data around 6 times. That does not, and should not mean moving the TCP/IP protocol to user space. > So, given all that, how do you implement network packet receive > properly? Well, first of all, you stop doing so much work in interrupt > (both hard and soft) context. Jamal Hadi Salim and others understood > this quite well, and NAPI is a direct consequence of that understanding. > But what Van is trying to show in his presentation is that you can take > this further, in fact a _lot_ further. I agree; a minimum of work should be done in interrupt context. Specifically the interrupt handler should simply insert and remove packets from the queue and program the hardware registers for DMA access to the packet buffer memory. If the hardware supports scatter/gather DMA, then the upper layers can enqueue packet buffers to send/recieve into/from and the interrupt handler just pulls packets off this queue when the hardware raises an interrupt to indicate it has completed the DMA transfer. This is how NT and I believe BSD have been doing things for some time now, and the direction the Linux kernel is moving in. > A Van Jacobson channel is a path for network packets. It is > implemented as an array'd queue of packets. There is state for the > producer and the consumer, and it all sits in different cache lines so > that it is never the case that both the consumer and producer write to > shared cache lines. Network cards want to know purely about packets, yet > for years we've been enforcing an OS determined model and abstraction > for network packets upon the drivers for such cards. This has come in > the form of "mbufs" in BSD and "SKBs" under Linux, but the channels are > designed so that this is totally unnecessary. Drivers no longer need to > know about what the OS packet buffers look like, channels just contain > pointers to packet data. I must admit, I am a bit confused by this. It sounds a lot like the pot calling the kettle black to me. Aren't SKBs and mbufs already just a form of the very queue of packets being advocated here? Don't they simply list memory ranges for the driver to transfer to the nic as a packet? > The next step is to build channels to sockets. We need some > intelligence in order to map packets to channels, and this comes in the > form of a tiny packet classifier the drivers use on input. It reads the > protocol, ports, and addresses to determine the flow ID and uses this to > find a channel. If no matching flow is found, we fall back to the basic > channel we created in the first step. As sockets are created, channel > mappings are installed and thus the driver classifier can find them > later. The socket wakes up, and does protocol input processing and > copying into userspace directly out of the channel. How is this any different from what we have now, other than bypassing the kernel buffer? The tcp/ip layer looks at the incoming packet to decide what socket it goes with, and copies it to the waiting buffer. Right now that waiting buffer is a kernel buffer, because at the time the packet arrives, the kernel does not have any user buffers. If the user process uses aio though, it can hand the kernel a few buffers to receive into ahead of time so when the packets have been classified, they can be copied directly to the user buffer. > And in the next step you can have the socket ask for a channel ID > (with a getsockopt or something like that), have it mmap() a receive > ring buffer into user space, and the mapped channel just tosses the > packet data into that mmap()'d area and wakes up the process. The > process has a mini TCP receive engine in user space. There is no need to use mmap() and burden the user code with implementing TCP itself ( which is quite a lot of work ). It can hand the kernel buffers by queuing multiple O_DIRECT aio requests and the kernel can directly dump the data stream there after stripping off the headers. When sending it can program the hardware to directly scatter/gather DMA from the user buffer attached to the aio request. > And you can take this even further than that (think, remote systems). > At each stage Van presents a table of profiled measurements for a normal > bulk TCP data transfer. The final stage of channeling all the way to > userspace is some 6 times faster than what we're doing today, yes I said > 6 times faster that isn't a typo. Yes, we can and should have a 6 times speed up, but as I've explained above, NT has had that for 10 years without having to push TCP into user space. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-07 4:07 ` Phillip Susi @ 2006-03-07 6:02 ` David S. Miller 2006-03-07 16:06 ` Phillip Susi 0 siblings, 1 reply; 28+ messages in thread From: David S. Miller @ 2006-03-07 6:02 UTC (permalink / raw) To: psusi; +Cc: bcrl, drepper, da-x, linux-kernel From: Phillip Susi <psusi@cfl.rr.com> Date: Mon, 06 Mar 2006 23:07:05 -0500 > How is this any different from what we have now, other than bypassing > the kernel buffer? The tcp/ip layer looks at the incoming packet to > decide what socket it goes with, and copies it to the waiting buffer. > Right now that waiting buffer is a kernel buffer, because at the time > the packet arrives, the kernel does not have any user buffers. The whole idea is to figure out what socket gets the packet without going through the IP and TCP stack at all, in the hardware interrupt handler, using a tiny classifier that's very fast and can be implemented in hardware. Please wrap your brain around the idea a little longer than the 15 or so minutes you have thus far... thanks. > Yes, we can and should have a 6 times speed up, but as I've explained > above, NT has had that for 10 years without having to push TCP into user > space. That's complete BS. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-07 6:02 ` David S. Miller @ 2006-03-07 16:06 ` Phillip Susi 0 siblings, 0 replies; 28+ messages in thread From: Phillip Susi @ 2006-03-07 16:06 UTC (permalink / raw) To: David S. Miller; +Cc: bcrl, drepper, da-x, linux-kernel David S. Miller wrote: > The whole idea is to figure out what socket gets the packet > without going through the IP and TCP stack at all, in the > hardware interrupt handler, using a tiny classifier that's > very fast and can be implemented in hardware. > AFAIK, "going through the IP and TCP stack" just means passing a quick packet classifier to locate the corresponding socket. It would be nice to be able to possibly offload that to the hardware, but you don't need to throw out the baby ( tcp/ip stack ) with the bathwater to get there. Maybe some sort of interface could be constructed to allow the higher layers to pass down some sort of ASL type byte code classifier to the NIC driver, which could either call it via the kernel software ASL interpreter, or convert it to firmware code to load into the hardware. > Please wrap your brain around the idea a little longer than > the 15 or so minutes you have thus far... thanks. > I've had my brain wrapped around these sort of networking problems for over 10 years now, so I think I have a fair handle on things. Certainly enough to carry on a discussion about it. >> Yes, we can and should have a 6 times speed up, but as I've explained >> above, NT has had that for 10 years without having to push TCP into user >> space. > > That's complete BS. Error, does not compute. Your holier than thou attitude does not a healthy discussion make. I explained the methods that have been in use on NT to achieve a 6 fold decrease in cpu utilization for bulk network IO, and how it can be applied to the Linux kernel without radical changes. If you don't understand it, then ask sensible questions, not just cry "That's complete BS!" We already have O_DIRECT aio for disk drives that can do zero copy, there's no reason not to apply that to the network stack as well. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-06 21:18 ` Benjamin LaHaise 2006-03-06 22:53 ` Ulrich Drepper @ 2006-03-07 1:30 ` Dan Aloni 2006-03-07 1:37 ` Nicholas Miell ` (2 more replies) 1 sibling, 3 replies; 28+ messages in thread From: Dan Aloni @ 2006-03-07 1:30 UTC (permalink / raw) To: Benjamin LaHaise; +Cc: Linux Kernel List On Mon, Mar 06, 2006 at 04:18:54PM -0500, Benjamin LaHaise wrote: > On Mon, Mar 06, 2006 at 08:24:03AM +0200, Dan Aloni wrote: > > Hello, > > > > I'm trying to assert the status of AIO under the current version > > of Linux 2.6. However by searching I wasn't able to find any > > indication about it's current state. Is there anyone using it > > under a production environment? > > For O_DIRECT aio things are pretty stable (barring a patch to improve -EIO > handling). The functionality is used by the various databases, so it gets > a fair amount of exercise. > > > I'd like to know how complete it is and whether socket AIO is > > adaquately supported. > > Socket AIO is not supported yet, but it is useful to get user requests to > know there is demand for it. Well, I've written a small test app to see if it works with network sockets and apparently it did for that small test case (connect() with aio_read(), loop with aio_error(), and aio_return()). I thought perhaps the glibc implementation was running behind the scene, so I've checked to see if it a thread was created in the background and I there wasn't any thread. -- Dan Aloni da-x@monatomic.org, da-x@colinux.org, da-x@gmx.net, dan@xiv.co.il ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-07 1:30 ` Dan Aloni @ 2006-03-07 1:37 ` Nicholas Miell 2006-03-07 1:37 ` Phillip Susi 2006-03-07 1:40 ` Benjamin LaHaise 2 siblings, 0 replies; 28+ messages in thread From: Nicholas Miell @ 2006-03-07 1:37 UTC (permalink / raw) To: Dan Aloni; +Cc: Benjamin LaHaise, Linux Kernel List On Tue, 2006-03-07 at 03:30 +0200, Dan Aloni wrote: > On Mon, Mar 06, 2006 at 04:18:54PM -0500, Benjamin LaHaise wrote: > > On Mon, Mar 06, 2006 at 08:24:03AM +0200, Dan Aloni wrote: > > > Hello, > > > > > > I'm trying to assert the status of AIO under the current version > > > of Linux 2.6. However by searching I wasn't able to find any > > > indication about it's current state. Is there anyone using it > > > under a production environment? > > > > For O_DIRECT aio things are pretty stable (barring a patch to improve -EIO > > handling). The functionality is used by the various databases, so it gets > > a fair amount of exercise. > > > > > I'd like to know how complete it is and whether socket AIO is > > > adaquately supported. > > > > Socket AIO is not supported yet, but it is useful to get user requests to > > know there is demand for it. > > Well, I've written a small test app to see if it works with network > sockets and apparently it did for that small test case (connect() > with aio_read(), loop with aio_error(), and aio_return()). I thought > perhaps the glibc implementation was running behind the scene, so I've > checked to see if it a thread was created in the background and I > there wasn't any thread. None of the aio_* functions use the kernel's AIO interface. They're implemented entirely in userspace using a thread pool. -- Nicholas Miell <nmiell@comcast.net> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-07 1:30 ` Dan Aloni 2006-03-07 1:37 ` Nicholas Miell @ 2006-03-07 1:37 ` Phillip Susi 2006-03-07 1:40 ` Benjamin LaHaise 2 siblings, 0 replies; 28+ messages in thread From: Phillip Susi @ 2006-03-07 1:37 UTC (permalink / raw) To: Dan Aloni; +Cc: Benjamin LaHaise, Linux Kernel List aio_* functions are library routines in glibc that are implemented by spawning threads to use the normal kernel io syscalls. They don't use real async IO in the kernel. I'm not sure why you didn't see the thread, but if you look up the glibc sources you will see how it works. To use the kernel aio you make calls to io_submit(). Dan Aloni wrote: > Well, I've written a small test app to see if it works with network > sockets and apparently it did for that small test case (connect() > with aio_read(), loop with aio_error(), and aio_return()). I thought > perhaps the glibc implementation was running behind the scene, so I've > checked to see if it a thread was created in the background and I > there wasn't any thread. > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-07 1:30 ` Dan Aloni 2006-03-07 1:37 ` Nicholas Miell 2006-03-07 1:37 ` Phillip Susi @ 2006-03-07 1:40 ` Benjamin LaHaise 2 siblings, 0 replies; 28+ messages in thread From: Benjamin LaHaise @ 2006-03-07 1:40 UTC (permalink / raw) To: Dan Aloni; +Cc: Linux Kernel List On Tue, Mar 07, 2006 at 03:30:50AM +0200, Dan Aloni wrote: > Well, I've written a small test app to see if it works with network > sockets and apparently it did for that small test case (connect() > with aio_read(), loop with aio_error(), and aio_return()). I thought > perhaps the glibc implementation was running behind the scene, so I've > checked to see if it a thread was created in the background and I > there wasn't any thread. Unfortunately, it will block in io_submit when it shouldn't. -ben -- "Time is of no importance, Mr. President, only life is important." Don't Email: <dont@kvack.org>. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: Status of AIO 2006-03-06 6:24 Status of AIO Dan Aloni 2006-03-06 15:05 ` Phillip Susi 2006-03-06 21:18 ` Benjamin LaHaise @ 2006-03-06 23:18 ` Phillip Susi 2 siblings, 0 replies; 28+ messages in thread From: Phillip Susi @ 2006-03-06 23:18 UTC (permalink / raw) To: Dan Aloni; +Cc: Linux Kernel List I'm sending this again because it looks like the original got lost. At least, I've not seen it show up on the mailing list yet and I sent it 8 hours ago. Dan Aloni wrote: > Hello, > > I'm trying to assert the status of AIO under the current version I think you mean ascertain. > of Linux 2.6. However by searching I wasn't able to find any > indication about it's current state. Is there anyone using it > under a production environment? > > I'd like to know how complete it is and whether socket AIO is > adaquately supported. > > Thanks, > AFAIK, it is not yet supported by the sockets layer, and the glibc posix aio apis do NOT use the kernel aio support. I have done some experimentation with it by hacking dd, but from what I can tell, it is not used in any sort of production capacity. ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2006-03-08 15:59 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-03-06 6:24 Status of AIO Dan Aloni 2006-03-06 15:05 ` Phillip Susi 2006-03-06 21:18 ` Benjamin LaHaise 2006-03-06 22:53 ` Ulrich Drepper 2006-03-06 23:15 ` Phillip Susi 2006-03-08 7:09 ` Ulrich Drepper 2006-03-08 15:58 ` Phillip Susi 2006-03-06 23:33 ` Benjamin LaHaise 2006-03-07 0:24 ` David S. Miller 2006-03-07 0:42 ` Benjamin LaHaise 2006-03-07 0:51 ` David S. Miller 2006-03-07 1:39 ` Benjamin LaHaise 2006-03-07 2:04 ` Dan Aloni 2006-03-07 2:07 ` Benjamin LaHaise 2006-03-07 3:11 ` David S. Miller 2006-03-07 7:33 ` Dan Aloni 2006-03-07 3:06 ` David S. Miller 2006-03-07 16:35 ` Benjamin LaHaise 2006-03-07 1:34 ` Phillip Susi 2006-03-07 3:04 ` David S. Miller 2006-03-07 4:07 ` Phillip Susi 2006-03-07 6:02 ` David S. Miller 2006-03-07 16:06 ` Phillip Susi 2006-03-07 1:30 ` Dan Aloni 2006-03-07 1:37 ` Nicholas Miell 2006-03-07 1:37 ` Phillip Susi 2006-03-07 1:40 ` Benjamin LaHaise 2006-03-06 23:18 ` Phillip Susi
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox