* Questions regarding network drivers @ 2006-11-10 3:06 Jonathan Day 2006-11-10 10:00 ` Evgeniy Polyakov 0 siblings, 1 reply; 7+ messages in thread From: Jonathan Day @ 2006-11-10 3:06 UTC (permalink / raw) To: netdev Hi, I've got an interesting problem to contend with and need some advice from the great wise ones here. First of all, is it possible (and/or "reasonable practice") when developing a network driver to do zero-copy transfers between main memory and the network device? Secondly, the network device is only designed to work with short packets and I really want to keep the throughput up. My thought was that if I fired off an interrupt then transfer a page of data into an area I know is safe, the kernel will have enough time to find a new safe area and post the address before the next page is ready to send. Can anyone suggest why this wouldn't work or, assuming it can work, why this would be a Bad Idea? Lastly, assuming my sanity lasts that long, would I be correct in assuming that the first step in the process of getting the driver peer-reviewed and accepted would be to post the patches here? Thanks for any help, Jonathan Day ____________________________________________________________________________________ Yahoo! Music Unlimited Access over 1 million songs. http://music.yahoo.com/unlimited ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Questions regarding network drivers 2006-11-10 3:06 Questions regarding network drivers Jonathan Day @ 2006-11-10 10:00 ` Evgeniy Polyakov 2006-11-10 17:34 ` Jonathan Day 0 siblings, 1 reply; 7+ messages in thread From: Evgeniy Polyakov @ 2006-11-10 10:00 UTC (permalink / raw) To: Jonathan Day; +Cc: netdev On Thu, Nov 09, 2006 at 07:06:00PM -0800, Jonathan Day (imipak@yahoo.com) wrote: > Hi, > > I've got an interesting problem to contend with and > need some advice from the great wise ones here. > > First of all, is it possible (and/or "reasonable > practice") when developing a network driver to do > zero-copy transfers between main memory and the > network device? What do you mean? DMA from NIC memory into CPU memory? > Secondly, the network device is only designed to work > with short packets and I really want to keep the > throughput up. My thought was that if I fired off an > interrupt then transfer a page of data into an area I > know is safe, the kernel will have enough time to find > a new safe area and post the address before the next > page is ready to send. > > Can anyone suggest why this wouldn't work or, assuming > it can work, why this would be a Bad Idea? There should not be any kind of 'kernel will have enough time to do something', instead you must guarantee that there will not be any kind of races. You can either prealocate several buffers or allocate them on demand in interrupts. > Lastly, assuming my sanity lasts that long, would I be > correct in assuming that the first step in the process > of getting the driver peer-reviewed and accepted would > be to post the patches here? Actually not, the first step in that process is learning jig dance and of course providing enough beer and other goodies to network maintainers. > Thanks for any help, No problem, but to answer at least on of your question more information should be provided. > Jonathan Day -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Questions regarding network drivers 2006-11-10 10:00 ` Evgeniy Polyakov @ 2006-11-10 17:34 ` Jonathan Day 2006-11-10 17:55 ` Stephen Hemminger 2006-11-11 11:54 ` Evgeniy Polyakov 0 siblings, 2 replies; 7+ messages in thread From: Jonathan Day @ 2006-11-10 17:34 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: netdev --- Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > On Thu, Nov 09, 2006 at 07:06:00PM -0800, Jonathan > Day (imipak@yahoo.com) wrote: > > Hi, > > > > I've got an interesting problem to contend with > and > > need some advice from the great wise ones here. > > > > First of all, is it possible (and/or "reasonable > > practice") when developing a network driver to do > > zero-copy transfers between main memory and the > > network device? > > What do you mean? > DMA from NIC memory into CPU memory? Yes. I want to bypass the kernel altogether and think there may be a way to do this, but want to make very certain that I'm not going down the wrong track. The underlying problem is this. The group I'm working with is messing about with building their own networking device that will run at an equal speed to the bus leading to the host (2.5 gigabits/second). The device has its own DMA controller and can operate as bus master. It's my task to figure out how to get the data into the host at near-100% bandwidth without dropping anything, with minimal latency and real-time characteristics. (I talked them out of making me do this blindfolded, but on further consideration, I'm not sure if this was a good idea.) > > Secondly, the network device is only designed to > work > > with short packets and I really want to keep the > > throughput up. My thought was that if I fired off > an > > interrupt then transfer a page of data into an > area I > > know is safe, the kernel will have enough time to > find > > a new safe area and post the address before the > next > > page is ready to send. > > > > Can anyone suggest why this wouldn't work or, > assuming > > it can work, why this would be a Bad Idea? > > There should not be any kind of 'kernel will have > enough time to do > something', instead you must guarantee that there > will not be any kind > of races. You can either prealocate several buffers > or allocate them on > demand in interrupts. The exact process I was thinking of is as follows: 1. Driver sets up a full page and pins it. 2. Driver obtains the physical address and places that address into a known, fixed location. 3. Driver sends interrupt to network device to say that everything is ready. 4. On obtaining the interrupt, a bit is set to true on the network device, to say that the host is ready to receive. 5. The network device has a counter for the number of packets that can be put in one page. If this number is zero and the bit is set, then: 5.1 The counter is set to the maximum number of packets storeable in the page. 5.2 The page address is DMAed out of the known location in host memory and placed in network device memory. 5.3 The bit is cleared. 5.4 The driver is given an interrupt to tell it to prepare the next page. 5.5 If the sender had previously been told to stop transmitting, it is now told to continue. 6. Every received packet is placed in a ring buffer on the network device. 7. The counter is reduced by 1. 8. If the counter reaches zero, OR the packet is followed by an end-of-transmission notification: 8.1. The ring buffer is DMAed en-masse into host memory at the location given in the previously cached page address. 8.2. If the new page bit has not been set to true, the network device notifies the sender to pause. The idea is that we're DMAing to a page that we know is safe, because the driver has ensured that before telling the network device where it is. (ie: the driver ensures that the page actually does exist, has been fuly set up in the VMM, and isn't going to move in physical memory or into swap.) The above description could be "enhanced" by having two or more available pages, so that it could rapidly switch. The only requirement for a sustainable system would be that I could allocate and make safe pages at an average rate equal to or faster than they would be consumed. The xon/xoff-type control would then be simply a way of guaranteeing that the network device didn't run out of places to put things. The people working on the hardware have said that they can handle the hardware side of my description, but I want to make sure that (a) this will actually work the way I expect, and (b) there isn't something staring me in the face that's a billion times easier and a trillion times more efficient. What I'm looking for is every argument that can possibly be thrown against this method - latency, throughput, accepted standards for Linux drivers, excessive weirdness, whatever. > > Lastly, assuming my sanity lasts that long, would > I be > > correct in assuming that the first step in the > process > > of getting the driver peer-reviewed and accepted > would > > be to post the patches here? > > Actually not, the first step in that process is > learning jig dance and > of course providing enough beer and other goodies to > network maintainers. I tried doing a jig dance once, but the saw cut my shoes in half. I can try getting beer. Oregon has some acceptable microbreweries, but I miss having a pint of Hatters in England. Mead is easier. I brew mead. Strong, dry, rocket-fuel mead. > > Thanks for any help, > > No problem, but to answer at least on of your > question more > information should be provided. Hopefully what's there now is sufficient, though I'd be happy to add more if need-be. ____________________________________________________________________________________ Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail beta. http://new.mail.yahoo.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Questions regarding network drivers 2006-11-10 17:34 ` Jonathan Day @ 2006-11-10 17:55 ` Stephen Hemminger 2006-11-11 11:54 ` Evgeniy Polyakov 1 sibling, 0 replies; 7+ messages in thread From: Stephen Hemminger @ 2006-11-10 17:55 UTC (permalink / raw) To: Jonathan Day; +Cc: Evgeniy Polyakov, netdev On Fri, 10 Nov 2006 09:34:33 -0800 (PST) Jonathan Day <imipak@yahoo.com> wrote: > > --- Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > > > On Thu, Nov 09, 2006 at 07:06:00PM -0800, Jonathan > > Day (imipak@yahoo.com) wrote: > > > Hi, > > > > > > I've got an interesting problem to contend with > > and > > > need some advice from the great wise ones here. > > > > > > First of all, is it possible (and/or "reasonable > > > practice") when developing a network driver to do > > > zero-copy transfers between main memory and the > > > network device? > > > > What do you mean? > > DMA from NIC memory into CPU memory? > > Yes. I want to bypass the kernel altogether and think > there may be a way to do this, but want to make very > certain that I'm not going down the wrong track. > This is not normally possible because network data could be intended for multiple sockets. It is not possible to know where to DMA the data to until you have done all the necessary protocol demultiplexing and filtering. There has been talk of having really smart hardware to help. > The underlying problem is this. The group I'm working > with is messing about with building their own > networking device that will run at an equal speed to > the bus leading to the host (2.5 gigabits/second). The > device has its own DMA controller and can operate as > bus master. > > It's my task to figure out how to get the data into > the host at near-100% bandwidth without dropping > anything, with minimal latency and real-time > characteristics. You might want to look at RDMA Infiniband model. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Questions regarding network drivers 2006-11-10 17:34 ` Jonathan Day 2006-11-10 17:55 ` Stephen Hemminger @ 2006-11-11 11:54 ` Evgeniy Polyakov 2006-11-11 23:21 ` Jonathan Day 1 sibling, 1 reply; 7+ messages in thread From: Evgeniy Polyakov @ 2006-11-11 11:54 UTC (permalink / raw) To: Jonathan Day; +Cc: netdev On Fri, Nov 10, 2006 at 09:34:33AM -0800, Jonathan Day (imipak@yahoo.com) wrote: > > > First of all, is it possible (and/or "reasonable > > > practice") when developing a network driver to do > > > zero-copy transfers between main memory and the > > > network device? > > > > What do you mean? > > DMA from NIC memory into CPU memory? > > Yes. I want to bypass the kernel altogether and think > there may be a way to do this, but want to make very > certain that I'm not going down the wrong track. > > The underlying problem is this. The group I'm working > with is messing about with building their own > networking device that will run at an equal speed to > the bus leading to the host (2.5 gigabits/second). The > device has its own DMA controller and can operate as > bus master. > > It's my task to figure out how to get the data into > the host at near-100% bandwidth without dropping > anything, with minimal latency and real-time > characteristics. <advertisement> You can use netchannels, which were designed for exactly that kind of load. </advertisement> You need to process some headers anyway - to select appropriate socket or netchannel or whatever. ... driver's internals were skipped ... > The idea is that we're DMAing to a page that we know > is safe, because the driver has ensured that before > telling the network device where it is. (ie: the > driver ensures that the page actually does exist, has > been fuly set up in the VMM, and isn't going to move > in physical memory or into swap.) It does not differ from usual behaviour actaully except that you need to setup different skb for each packet. So you can preallocate required number of skbs, allocate special page and setup it's pointers appropriately - this will require you to change some logic behind frag_list, for example you can increase page's reference counter to number of packets in that page and setup frag_list for each skb to point to different packets inside that page. Each skb->data must have some headers though, so you will need to copy parts of the packets. > The above description could be "enhanced" by having > two or more available pages, so that it could rapidly > switch. The only requirement for a sustainable system > would be that I could allocate and make safe pages at > an average rate equal to or faster than they would be > consumed. The xon/xoff-type control would then be > simply a way of guaranteeing that the network device > didn't run out of places to put things. > > The people working on the hardware have said that they > can handle the hardware side of my description, but I > want to make sure that (a) this will actually work the > way I expect, and (b) there isn't something staring me > in the face that's a billion times easier and a > trillion times more efficient. > > What I'm looking for is every argument that can > possibly be thrown against this method - latency, > throughput, accepted standards for Linux drivers, > excessive weirdness, whatever. Excessive weirdness is indeed the case. If I understood you correctly, hardware is capable to DMA several packets one-by-one into given buffer and you want to take advantage of that? > > > Lastly, assuming my sanity lasts that long, would > > I be > > > correct in assuming that the first step in the > > process > > > of getting the driver peer-reviewed and accepted > > would > > > be to post the patches here? > > > > Actually not, the first step in that process is > > learning jig dance and > > of course providing enough beer and other goodies to > > network maintainers. > > I tried doing a jig dance once, but the saw cut my > shoes in half. > > I can try getting beer. Oregon has some acceptable > microbreweries, but I miss having a pint of Hatters in > England. Mead is easier. I brew mead. Strong, dry, > rocket-fuel mead. Then you definitely just need to send your driver to netdev@. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Questions regarding network drivers 2006-11-11 11:54 ` Evgeniy Polyakov @ 2006-11-11 23:21 ` Jonathan Day 2006-11-13 9:05 ` Evgeniy Polyakov 0 siblings, 1 reply; 7+ messages in thread From: Jonathan Day @ 2006-11-11 23:21 UTC (permalink / raw) To: Evgeniy Polyakov; +Cc: netdev --- Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > You can use netchannels, which were designed for > exactly that kind of > load. Actually, netchannels are a mechanism I've been looking at intensely, as a way to simplify this and keep it sane, without loosing performance. > You need to process some headers anyway - to select > appropriate socket > or netchannel or whatever. I was planning on cheating there. I've revised the method slightly, based on previous comments, so that I'm now doing the following: The master table has a set of pointers to tables. One group, one table - for groups in use. Unused groups have no associated table and the pointer is null. Pointers are always at (group number * pointer size) from the start of the table. There's a max of 4096 groups in this system, or the hardware guys start shooting. Each group has a table that contains a set of pointers to buffers. Offset is always (node number * pointer size). Nodes that are active members have pointers to pinned, allocated and cleaned buffers. Nodes that aren't members get null pointers. There are 32 nodes in the current system. The packet header contains the group address at a fixed offset and the source node number at another fixed offset. The hardware can then grab these and do the necessary calculations. The hardware has a DMA controller which can get the group pointer from the master table then get the buffer pointer from the group table. (Essentially, I'm mimicking a primitive virtual memory system here.) The hardware then stores enough packet payloads to fill one page of memory (unless it gets an end-of-transmit signal before then) and then DMAs the page of data into the physical page in one shot. (Since I did the original design, I worked out that there's too much overhead in setting up a transmit to just do a single packet at a time.) The hardware is apparently smart enough to re-order packers if they come in out of sequence and to NACK anything that's missing. Once a page has been filled, the driver simply pushes the pointer onto a queue and then grabs a pointer to a fresh page. Based on what I'm hearing, I'll probably finish this driver simply to see what it does and post it up - it'll have educational value, whatever happens. I'll also work on a netchannels version which is less hardware-specific and more "normal". > > I can try getting beer. Oregon has some acceptable > > microbreweries, but I miss having a pint of > Hatters in > > England. Mead is easier. I brew mead. Strong, dry, > > rocket-fuel mead. > > Then you definitely just need to send your driver to > netdev@. Sounds good to me. Liquid driver review aids will be prepared. ____________________________________________________________________________________ Do you Yahoo!? Everyone is raving about the all-new Yahoo! Mail beta. http://new.mail.yahoo.com ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Questions regarding network drivers 2006-11-11 23:21 ` Jonathan Day @ 2006-11-13 9:05 ` Evgeniy Polyakov 0 siblings, 0 replies; 7+ messages in thread From: Evgeniy Polyakov @ 2006-11-13 9:05 UTC (permalink / raw) To: Jonathan Day; +Cc: netdev On Sat, Nov 11, 2006 at 03:21:17PM -0800, Jonathan Day (imipak@yahoo.com) wrote: > > --- Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote: > > You can use netchannels, which were designed for > > exactly that kind of > > load. > > Actually, netchannels are a mechanism I've been > looking at intensely, as a way to simplify this and > keep it sane, without loosing performance. > > > You need to process some headers anyway - to select > > appropriate socket > > or netchannel or whatever. > > I was planning on cheating there. I've revised the > method slightly, based on previous comments, so that > I'm now doing the following: > > The master table has a set of pointers to tables. One > group, one table - for groups in use. Unused groups > have no associated table and the pointer is null. > Pointers are always at (group number * pointer size) > from the start of the table. There's a max of 4096 > groups in this system, or the hardware guys start > shooting. > > Each group has a table that contains a set of pointers > to buffers. Offset is always (node number * pointer > size). Nodes that are active members have pointers to > pinned, allocated and cleaned buffers. Nodes that > aren't members get null pointers. There are 32 nodes > in the current system. > > The packet header contains the group address at a > fixed offset and the source node number at another > fixed offset. The hardware can then grab these and do > the necessary calculations. The hardware has a DMA > controller which can get the group pointer from the > master table then get the buffer pointer from the > group table. > > (Essentially, I'm mimicking a primitive virtual memory > system here.) Do you have any benchmarks of that system, it looks quite complex to avoid some problems... > The hardware then stores enough packet payloads to > fill one page of memory (unless it gets an > end-of-transmit signal before then) and then DMAs the > page of data into the physical page in one shot. > (Since I did the original design, I worked out that > there's too much overhead in setting up a transmit to > just do a single packet at a time.) The hardware is > apparently smart enough to re-order packers if they > come in out of sequence and to NACK anything that's > missing. I.e. it supports TCP sequence number checks? In that case you store a set of flows in hardware, what happens when it overflows? > Once a page has been filled, the driver simply pushes > the pointer onto a queue and then grabs a pointer to a > fresh page. > > > Based on what I'm hearing, I'll probably finish this > driver simply to see what it does and post it up - > it'll have educational value, whatever happens. I'll > also work on a netchannels version which is less > hardware-specific and more "normal". That would be great. > > > I can try getting beer. Oregon has some acceptable > > > microbreweries, but I miss having a pint of > > Hatters in > > > England. Mead is easier. I brew mead. Strong, dry, > > > rocket-fuel mead. > > > > Then you definitely just need to send your driver to > > netdev@. > > Sounds good to me. Liquid driver review aids will be prepared. That's good - send your dirver and some kind of benchmarks for throughput and latency tests. -- Evgeniy Polyakov ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2006-11-13 9:05 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-11-10 3:06 Questions regarding network drivers Jonathan Day 2006-11-10 10:00 ` Evgeniy Polyakov 2006-11-10 17:34 ` Jonathan Day 2006-11-10 17:55 ` Stephen Hemminger 2006-11-11 11:54 ` Evgeniy Polyakov 2006-11-11 23:21 ` Jonathan Day 2006-11-13 9:05 ` Evgeniy Polyakov
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).