Questions regarding network drivers

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Questions regarding network drivers
@ 2006-11-10  3:06 Jonathan Day
  2006-11-10 10:00 ` Evgeniy Polyakov
  0 siblings, 1 reply; 7+ messages in thread
From: Jonathan Day @ 2006-11-10  3:06 UTC (permalink / raw)
  To: netdev

Hi,

I've got an interesting problem to contend with and
need some advice from the great wise ones here.

First of all, is it possible (and/or "reasonable
practice") when developing a network driver to do
zero-copy transfers between main memory and the
network device?

Secondly, the network device is only designed to work
with short packets and I really want to keep the
throughput up. My thought was that if I fired off an
interrupt then transfer a page of data into an area I
know is safe, the kernel will have enough time to find
a new safe area and post the address before the next
page is ready to send.

Can anyone suggest why this wouldn't work or, assuming
it can work, why this would be a Bad Idea?

Lastly, assuming my sanity lasts that long, would I be
correct in assuming that the first step in the process
of getting the driver peer-reviewed and accepted would
be to post the patches here?

Thanks for any help,

Jonathan Day

____________________________________________________________________________________
Yahoo! Music Unlimited
Access over 1 million songs.
http://music.yahoo.com/unlimited

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions regarding network drivers
  2006-11-10  3:06 Questions regarding network drivers Jonathan Day
@ 2006-11-10 10:00 ` Evgeniy Polyakov
  2006-11-10 17:34   ` Jonathan Day
  0 siblings, 1 reply; 7+ messages in thread
From: Evgeniy Polyakov @ 2006-11-10 10:00 UTC (permalink / raw)
  To: Jonathan Day; +Cc: netdev

On Thu, Nov 09, 2006 at 07:06:00PM -0800, Jonathan Day (imipak@yahoo.com) wrote:
> Hi,
> 
> I've got an interesting problem to contend with and
> need some advice from the great wise ones here.
> 
> First of all, is it possible (and/or "reasonable
> practice") when developing a network driver to do
> zero-copy transfers between main memory and the
> network device?

What do you mean?
DMA from NIC memory into CPU memory?

> Secondly, the network device is only designed to work
> with short packets and I really want to keep the
> throughput up. My thought was that if I fired off an
> interrupt then transfer a page of data into an area I
> know is safe, the kernel will have enough time to find
> a new safe area and post the address before the next
> page is ready to send.
> 
> Can anyone suggest why this wouldn't work or, assuming
> it can work, why this would be a Bad Idea?

There should not be any kind of 'kernel will have enough time to do
something', instead you must guarantee that there will not be any kind
of races. You can either prealocate several buffers or allocate them on
demand in interrupts.

> Lastly, assuming my sanity lasts that long, would I be
> correct in assuming that the first step in the process
> of getting the driver peer-reviewed and accepted would
> be to post the patches here?

Actually not, the first step in that process is learning jig dance and
of course providing enough beer and other goodies to network maintainers.

> Thanks for any help,

No problem, but to answer at least on of your question more
information should be provided.

> Jonathan Day

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions regarding network drivers
  2006-11-10 10:00 ` Evgeniy Polyakov
@ 2006-11-10 17:34   ` Jonathan Day
  2006-11-10 17:55     ` Stephen Hemminger
  2006-11-11 11:54     ` Evgeniy Polyakov
  0 siblings, 2 replies; 7+ messages in thread
From: Jonathan Day @ 2006-11-10 17:34 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

--- Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:

> On Thu, Nov 09, 2006 at 07:06:00PM -0800, Jonathan
> Day (imipak@yahoo.com) wrote:
> > Hi,
> > 
> > I've got an interesting problem to contend with
> and
> > need some advice from the great wise ones here.
> > 
> > First of all, is it possible (and/or "reasonable
> > practice") when developing a network driver to do
> > zero-copy transfers between main memory and the
> > network device?
> 
> What do you mean?
> DMA from NIC memory into CPU memory?

Yes. I want to bypass the kernel altogether and think
there may be a way to do this, but want to make very
certain that I'm not going down the wrong track.

The underlying problem is this. The group I'm working
with is messing about with building their own
networking device that will run at an equal speed to
the bus leading to the host (2.5 gigabits/second). The
device has its own DMA controller and can operate as
bus master.

It's my task to figure out how to get the data into
the host at near-100% bandwidth without dropping
anything, with minimal latency and real-time
characteristics.

(I talked them out of making me do this blindfolded,
but on further consideration, I'm not sure if this was
a good idea.)

> > Secondly, the network device is only designed to
> work
> > with short packets and I really want to keep the
> > throughput up. My thought was that if I fired off
> an
> > interrupt then transfer a page of data into an
> area I
> > know is safe, the kernel will have enough time to
> find
> > a new safe area and post the address before the
> next
> > page is ready to send.
> > 
> > Can anyone suggest why this wouldn't work or,
> assuming
> > it can work, why this would be a Bad Idea?
> 
> There should not be any kind of 'kernel will have
> enough time to do
> something', instead you must guarantee that there
> will not be any kind
> of races. You can either prealocate several buffers
> or allocate them on
> demand in interrupts.

The exact process I was thinking of is as follows:

1. Driver sets up a full page and pins it.

2. Driver obtains the physical address and places that
address into a known, fixed location.

3. Driver sends interrupt to network device to say
that everything is ready.

4. On obtaining the interrupt, a bit is set to true on
the network device, to say that the host is ready to
receive.

5. The network device has a counter for the number of
packets that can be put in one page. If this number is
zero and the bit is set, then:

5.1 The counter is set to the maximum number of
packets storeable in the page.

5.2 The page address is DMAed out of the known
location in host memory and placed in network device
memory.

5.3 The bit is cleared.

5.4 The driver is given an interrupt to tell it to
prepare the next page.

5.5 If the sender had previously been told to stop
transmitting, it is now told to continue.

6. Every received packet is placed in a ring buffer on
the network device.

7. The counter is reduced by 1.

8. If the counter reaches zero, OR the packet is
followed by an end-of-transmission notification:

8.1. The ring buffer is DMAed en-masse into host
memory at the location given in the previously cached
page address.

8.2. If the new page bit has not been set to true, the
network device notifies the sender to pause.

The idea is that we're DMAing to a page that we know
is safe, because the driver has ensured that before
telling the network device where it is. (ie: the
driver ensures that the page actually does exist, has
been fuly set up in the VMM, and isn't going to move
in physical memory or into swap.)

The above description could be "enhanced" by having
two or more available pages, so that it could rapidly
switch. The only requirement for a sustainable system
would be that I could allocate and make safe pages at
an average rate equal to or faster than they would be
consumed. The xon/xoff-type control would then be
simply a way of guaranteeing that the network device
didn't run out of places to put things.

The people working on the hardware have said that they
can handle the hardware side of my description, but I
want to make sure that (a) this will actually work the
way I expect, and (b) there isn't something staring me
in the face that's a billion times easier and a
trillion times more efficient.

What I'm looking for is every argument that can
possibly be thrown against this method - latency,
throughput, accepted standards for Linux drivers,
excessive weirdness, whatever.

> > Lastly, assuming my sanity lasts that long, would
> I be
> > correct in assuming that the first step in the
> process
> > of getting the driver peer-reviewed and accepted
> would
> > be to post the patches here?
> 
> Actually not, the first step in that process is
> learning jig dance and
> of course providing enough beer and other goodies to
> network maintainers.

I tried doing a jig dance once, but the saw cut my
shoes in half.

I can try getting beer. Oregon has some acceptable
microbreweries, but I miss having a pint of Hatters in
England. Mead is easier. I brew mead. Strong, dry,
rocket-fuel mead.

> > Thanks for any help,
> 
> No problem, but to answer at least on of your
> question more
> information should be provided.

Hopefully what's there now is sufficient, though I'd
be happy to add more if need-be.

____________________________________________________________________________________
Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions regarding network drivers
  2006-11-10 17:34   ` Jonathan Day
@ 2006-11-10 17:55     ` Stephen Hemminger
  2006-11-11 11:54     ` Evgeniy Polyakov
  1 sibling, 0 replies; 7+ messages in thread
From: Stephen Hemminger @ 2006-11-10 17:55 UTC (permalink / raw)
  To: Jonathan Day; +Cc: Evgeniy Polyakov, netdev

On Fri, 10 Nov 2006 09:34:33 -0800 (PST)
Jonathan Day <imipak@yahoo.com> wrote:

> 
> --- Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> 
> > On Thu, Nov 09, 2006 at 07:06:00PM -0800, Jonathan
> > Day (imipak@yahoo.com) wrote:
> > > Hi,
> > > 
> > > I've got an interesting problem to contend with
> > and
> > > need some advice from the great wise ones here.
> > > 
> > > First of all, is it possible (and/or "reasonable
> > > practice") when developing a network driver to do
> > > zero-copy transfers between main memory and the
> > > network device?
> > 
> > What do you mean?
> > DMA from NIC memory into CPU memory?
> 
> Yes. I want to bypass the kernel altogether and think
> there may be a way to do this, but want to make very
> certain that I'm not going down the wrong track.
> 

This is not normally possible because network data could be intended
for multiple sockets. It is not possible to know where to DMA the
data to until you have done all the necessary protocol demultiplexing
and filtering. There has been talk of having really smart hardware
to help.

> The underlying problem is this. The group I'm working
> with is messing about with building their own
> networking device that will run at an equal speed to
> the bus leading to the host (2.5 gigabits/second). The
> device has its own DMA controller and can operate as
> bus master.
> 
> It's my task to figure out how to get the data into
> the host at near-100% bandwidth without dropping
> anything, with minimal latency and real-time
> characteristics.

You might want to look at RDMA Infiniband model.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions regarding network drivers
  2006-11-10 17:34   ` Jonathan Day
  2006-11-10 17:55     ` Stephen Hemminger
@ 2006-11-11 11:54     ` Evgeniy Polyakov
  2006-11-11 23:21       ` Jonathan Day
  1 sibling, 1 reply; 7+ messages in thread
From: Evgeniy Polyakov @ 2006-11-11 11:54 UTC (permalink / raw)
  To: Jonathan Day; +Cc: netdev

On Fri, Nov 10, 2006 at 09:34:33AM -0800, Jonathan Day (imipak@yahoo.com) wrote:
> > > First of all, is it possible (and/or "reasonable
> > > practice") when developing a network driver to do
> > > zero-copy transfers between main memory and the
> > > network device?
> > 
> > What do you mean?
> > DMA from NIC memory into CPU memory?
> 
> Yes. I want to bypass the kernel altogether and think
> there may be a way to do this, but want to make very
> certain that I'm not going down the wrong track.
> 
> The underlying problem is this. The group I'm working
> with is messing about with building their own
> networking device that will run at an equal speed to
> the bus leading to the host (2.5 gigabits/second). The
> device has its own DMA controller and can operate as
> bus master.
> 
> It's my task to figure out how to get the data into
> the host at near-100% bandwidth without dropping
> anything, with minimal latency and real-time
> characteristics.

<advertisement>
You can use netchannels, which were designed for exactly that kind of
load.
</advertisement>

You need to process some headers anyway - to select appropriate socket
or netchannel or whatever.

...  driver's internals were skipped ...

> The idea is that we're DMAing to a page that we know
> is safe, because the driver has ensured that before
> telling the network device where it is. (ie: the
> driver ensures that the page actually does exist, has
> been fuly set up in the VMM, and isn't going to move
> in physical memory or into swap.)

It does not differ from usual behaviour actaully except that you need to
setup different skb for each packet.
So you can preallocate required number of skbs, allocate special page
and setup it's pointers appropriately - this will require you to change
some logic behind frag_list, for example you can increase page's
reference counter to number of packets in that page and setup frag_list
for each skb to point to different packets inside that page.
Each skb->data must have some headers though, so you will need to copy
parts of the packets.

> The above description could be "enhanced" by having
> two or more available pages, so that it could rapidly
> switch. The only requirement for a sustainable system
> would be that I could allocate and make safe pages at
> an average rate equal to or faster than they would be
> consumed. The xon/xoff-type control would then be
> simply a way of guaranteeing that the network device
> didn't run out of places to put things.
> 
> The people working on the hardware have said that they
> can handle the hardware side of my description, but I
> want to make sure that (a) this will actually work the
> way I expect, and (b) there isn't something staring me
> in the face that's a billion times easier and a
> trillion times more efficient.
> 
> What I'm looking for is every argument that can
> possibly be thrown against this method - latency,
> throughput, accepted standards for Linux drivers,
> excessive weirdness, whatever.

Excessive weirdness is indeed the case.
If I understood you correctly, hardware is capable to DMA several
packets one-by-one into given buffer and you want to take advantage of
that?

> > > Lastly, assuming my sanity lasts that long, would
> > I be
> > > correct in assuming that the first step in the
> > process
> > > of getting the driver peer-reviewed and accepted
> > would
> > > be to post the patches here?
> > 
> > Actually not, the first step in that process is
> > learning jig dance and
> > of course providing enough beer and other goodies to
> > network maintainers.
> 
> I tried doing a jig dance once, but the saw cut my
> shoes in half.
> 
> I can try getting beer. Oregon has some acceptable
> microbreweries, but I miss having a pint of Hatters in
> England. Mead is easier. I brew mead. Strong, dry,
> rocket-fuel mead.

Then you definitely just need to send your driver to netdev@.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions regarding network drivers
  2006-11-11 11:54     ` Evgeniy Polyakov
@ 2006-11-11 23:21       ` Jonathan Day
  2006-11-13  9:05         ` Evgeniy Polyakov
  0 siblings, 1 reply; 7+ messages in thread
From: Jonathan Day @ 2006-11-11 23:21 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: netdev

--- Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> You can use netchannels, which were designed for
> exactly that kind of
> load.

Actually, netchannels are a mechanism I've been
looking at intensely, as a way to simplify this and
keep it sane, without loosing performance.

> You need to process some headers anyway - to select
> appropriate socket
> or netchannel or whatever.

I was planning on cheating there. I've revised the
method slightly, based on previous comments, so that
I'm now doing the following:

The master table has a set of pointers to tables. One
group, one table - for groups in use. Unused groups
have no associated table and the pointer is null.
Pointers are always at (group number * pointer size)
from the start of the table. There's a max of 4096
groups in this system, or the hardware guys start
shooting.

Each group has a table that contains a set of pointers
to buffers. Offset is always (node number * pointer
size). Nodes that are active members have pointers to
pinned, allocated and cleaned buffers. Nodes that
aren't members get null pointers. There are 32 nodes
in the current system.

The packet header contains the group address at a
fixed offset and the source node number at another
fixed offset. The hardware can then grab these and do
the necessary calculations. The hardware has a DMA
controller which can get the group pointer from the
master table then get the buffer pointer from the
group table.

(Essentially, I'm mimicking a primitive virtual memory
system here.)

The hardware then stores enough packet payloads to
fill one page of memory (unless it gets an
end-of-transmit signal before then) and then DMAs the
page of data into the physical page in one shot.
(Since I did the original design, I worked out that
there's too much overhead in setting up a transmit to
just do a single packet at a time.) The hardware is
apparently smart enough to re-order packers if they
come in out of sequence and to NACK anything that's
missing.

Once a page has been filled, the driver simply pushes
the pointer onto a queue and then grabs a pointer to a
fresh page.

Based on what I'm hearing, I'll probably finish this
driver simply to see what it does and post it up -
it'll have educational value, whatever happens. I'll
also work on a netchannels version which is less
hardware-specific and more "normal".

> > I can try getting beer. Oregon has some acceptable
> > microbreweries, but I miss having a pint of
> Hatters in
> > England. Mead is easier. I brew mead. Strong, dry,
> > rocket-fuel mead.
> 
> Then you definitely just need to send your driver to
> netdev@.

Sounds good to me. Liquid driver review aids will be prepared.

____________________________________________________________________________________
Do you Yahoo!?
Everyone is raving about the all-new Yahoo! Mail beta.
http://new.mail.yahoo.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Questions regarding network drivers
  2006-11-11 23:21       ` Jonathan Day
@ 2006-11-13  9:05         ` Evgeniy Polyakov
  0 siblings, 0 replies; 7+ messages in thread
From: Evgeniy Polyakov @ 2006-11-13  9:05 UTC (permalink / raw)
  To: Jonathan Day; +Cc: netdev

On Sat, Nov 11, 2006 at 03:21:17PM -0800, Jonathan Day (imipak@yahoo.com) wrote:
> 
> --- Evgeniy Polyakov <johnpol@2ka.mipt.ru> wrote:
> > You can use netchannels, which were designed for
> > exactly that kind of
> > load.
> 
> Actually, netchannels are a mechanism I've been
> looking at intensely, as a way to simplify this and
> keep it sane, without loosing performance.
> 
> > You need to process some headers anyway - to select
> > appropriate socket
> > or netchannel or whatever.
> 
> I was planning on cheating there. I've revised the
> method slightly, based on previous comments, so that
> I'm now doing the following:
> 
> The master table has a set of pointers to tables. One
> group, one table - for groups in use. Unused groups
> have no associated table and the pointer is null.
> Pointers are always at (group number * pointer size)
> from the start of the table. There's a max of 4096
> groups in this system, or the hardware guys start
> shooting.
> 
> Each group has a table that contains a set of pointers
> to buffers. Offset is always (node number * pointer
> size). Nodes that are active members have pointers to
> pinned, allocated and cleaned buffers. Nodes that
> aren't members get null pointers. There are 32 nodes
> in the current system.
> 
> The packet header contains the group address at a
> fixed offset and the source node number at another
> fixed offset. The hardware can then grab these and do
> the necessary calculations. The hardware has a DMA
> controller which can get the group pointer from the
> master table then get the buffer pointer from the
> group table.
> 
> (Essentially, I'm mimicking a primitive virtual memory
> system here.)

Do you have any benchmarks of that system, it looks quite complex to
avoid some problems...

> The hardware then stores enough packet payloads to
> fill one page of memory (unless it gets an
> end-of-transmit signal before then) and then DMAs the
> page of data into the physical page in one shot.
> (Since I did the original design, I worked out that
> there's too much overhead in setting up a transmit to
> just do a single packet at a time.) The hardware is
> apparently smart enough to re-order packers if they
> come in out of sequence and to NACK anything that's
> missing.

I.e. it supports TCP sequence number checks?
In that case you store a set of flows in hardware, what happens when it
overflows?

> Once a page has been filled, the driver simply pushes
> the pointer onto a queue and then grabs a pointer to a
> fresh page.
> 
> 
> Based on what I'm hearing, I'll probably finish this
> driver simply to see what it does and post it up -
> it'll have educational value, whatever happens. I'll
> also work on a netchannels version which is less
> hardware-specific and more "normal".

That would be great.

> > > I can try getting beer. Oregon has some acceptable
> > > microbreweries, but I miss having a pint of
> > Hatters in
> > > England. Mead is easier. I brew mead. Strong, dry,
> > > rocket-fuel mead.
> > 
> > Then you definitely just need to send your driver to
> > netdev@.
> 
> Sounds good to me. Liquid driver review aids will be prepared.
 
That's good - send your dirver and some kind of benchmarks for
throughput and latency tests. 
  

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2006-11-13  9:05 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-11-10  3:06 Questions regarding network drivers Jonathan Day
2006-11-10 10:00 ` Evgeniy Polyakov
2006-11-10 17:34   ` Jonathan Day
2006-11-10 17:55     ` Stephen Hemminger
2006-11-11 11:54     ` Evgeniy Polyakov
2006-11-11 23:21       ` Jonathan Day
2006-11-13  9:05         ` Evgeniy Polyakov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).