Van Jacobson's net channels and real-time

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Van Jacobson's net channels and real-time
@ 2006-04-20 16:29 Esben Nielsen
  2006-04-20 19:09 ` David S. Miller
  2006-04-21  8:53 ` Jan Kiszka
  0 siblings, 2 replies; 31+ messages in thread
From: Esben Nielsen @ 2006-04-20 16:29 UTC (permalink / raw)
  To: linux-kernel; +Cc: Ingo Molnar

Before I start, where is VJ's code? I have not been able to find it anywhere.

With the preempt-realtime branch maturing and finding it's way into the
mainline kernel, using Linux (without sub-kernels) for real-time applications
is becomming an realistic option without having to do a lot of hacks in the
kernel on your own. But the network stack could be improved and some of the
ideas in Van Jacobson's net channels could be usefull when receiving network
packages with real-time latencies.

Finding the end point in the receive interrupt and send of the packet to
the receiving process directly is a good idea if it is fast enough to do
so in the interrupt context (and I think it can be done very fast). One
problem in the current setup, is that everything has to go through the
soft interrupt.  That is even if you make a completely new, non-IP
protocol, the latency for delivering the frame to your application is
still limited by the latency of the IP-stack because it still have to go
through soft irq which might be busy working on IP packages. Even if you
open a raw socket, the latency is limited to the latency of the soft irq.
At work we use a widely used commercial RTOS. It got exactly the same
problem of having every network packet being handled by the same thread.

Buffer management is another issue. On the RTOS above you make a buffer pool
per network device for receiving packages. On Linux received packages are taken
from the global memory pool with GFP_ATOMIC. On both systems you can easily run
out of buffers if they are not freed back to the pool fast enough. In that
case you will just have to drop packages as they are received. Without
having the code to VJ's net channels, it looks like they solve the problem:
Each end receiver provides his own receive resources. If a receiver can't cope
with  all the traffic, it will loose packages, the others wont. That makes it
safe to run important real-time traffic along with some unpredictable, low
priority  TCP/IP traffic. If the TCP/IP receivers does not run fast enough,
their packages will be dropped, but the driver will not drop the real-time
packages. The nice thing about a real-time task is that you know it's latency
and therefore know how many receive buffers it needs to avoid loosing
packages in a worst case scenario.

Implementing new protocols in user space is a good idea, too. The developer -
who doesn't need to be a hard-core kernel hacker - can pick whatever language
he wants and has far easier access to debugging tools than in the kernel.
Unfortunately, it does not perform very well.
Using raw sockets is a way to do protocol stacks in user space now, but you
can only listen to packets with a specific protocol id. Therefore you either
have to make one thread or process in userspace receive everything and then
send it to the end receivers, or let all threads receive all and let them
throw away packages not for them. Apparently the filter mechanism for VJ's
net channels (if it is made general enough) would solve that problem, too.

Many realtime applications are time triggered. I.e. they wake up
say every 5 ms and poll their environment for new inputs, do their
calculations and then send out the results again. For such an application
it will be very efficient to make the driver put the network packages in a
mmap'ed area, but not try to wake up the application. The application will
simply poll the mmap'ed channel every 5 ms. Once this is setup there is no
system calls issued for receiving packages at all!
On Linux today - for packet orientated protocols at least - the application has
to issue a system call for every packet received plus a system call in the end
to check there is no more packeges to be read.

Esben

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-20 16:29 Van Jacobson's net channels and real-time Esben Nielsen
@ 2006-04-20 19:09 ` David S. Miller
  2006-04-21 16:52   ` Ingo Oeser
  2006-04-22 19:30   ` bert hubert
  2006-04-21  8:53 ` Jan Kiszka
  1 sibling, 2 replies; 31+ messages in thread
From: David S. Miller @ 2006-04-20 19:09 UTC (permalink / raw)
  To: simlo; +Cc: linux-kernel, mingo, netdev

[ Maybe ask questions like this on "netdev" where the networking
  developers hang out?  Added to CC: ]

Van fell off the face of the planet after giving his presentation and
never published his code, only his slides.

I've started to make a slow attempt at implementing his ideas, nothing
but pure infrastructure so far, but you can look at what I have here:

	kernel.org:/pub/scm/linux/kernel/git/davem/vj-2.6.git

don't expect major progress and don't expect anything beyond a simple
channel to softint packet processing on receive any time soon.

Going all the way to the socket is a large endeavor and will require a
lot of restructuring to do it right, so expect this to take on the
order of months.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-20 19:09 ` David S. Miller
@ 2006-04-21 16:52   ` Ingo Oeser
  2006-04-22 11:48     ` Jörn Engel
  2006-04-23  5:56     ` David S. Miller
  2006-04-22 19:30   ` bert hubert
  1 sibling, 2 replies; 31+ messages in thread
From: Ingo Oeser @ 2006-04-21 16:52 UTC (permalink / raw)
  To: David S. Miller; +Cc: simlo, linux-kernel, mingo, netdev, Ingo Oeser

Hi David,

nice to see you getting started with it.

I'm not sure about the queue logic there.

1867 /* Caller must have exclusive producer access to the netchannel. */
1868 int netchannel_enqueue(struct netchannel *np, struct netchannel_buftrailer *bp)
1869 {
1870 	unsigned long tail;
1871
1872 	tail = np->netchan_tail;
1873 	if (tail == np->netchan_head)
1874 		return -ENOMEM;

This looks wrong, since empty and full are the same condition in your
case.

1891 struct netchannel_buftrailer *__netchannel_dequeue(struct netchannel *np)
1892 {
1893 	unsigned long head = np->netchan_head;
1894 	struct netchannel_buftrailer *bp = np->netchan_queue[head];
1895
1896 	BUG_ON(np->netchan_tail == head);

See?

What about sth. like

struct netchannel {
   /* This is only read/written by the writer (producer) */
   unsigned long write_ptr;
  struct netchannel_buftrailer *netchan_queue[NET_CHANNEL_ENTRIES];

   /* This is modified by both */
  atomic_t filled_entries; /* cache_line_align this? */

   /* This is only read/written by the reader (consumer) */
   unsigned long read_ptr;
}

This would prevent this bug from the beginning and let us still use the
full queue size.

If cacheline bouncing because of the shared filled_entries becomes an issue,
you are receiving or sending a lot.

Then you can enqueue and dequeue multiple and commit the counts later.
To be done with a atomic_read, atomic_add and atomic_sub on filled_entries.

Maybe even cheaper with local_t instead of atomic_t later on.

But I guess the cacheline bouncing will be a non-issue, since the whole
point of netchannels was to keep traffic as local to a cpu as possible, right?

Would you like to see a sample patch relative to your tree, 
to show you what I mean?

Regards

Ingo Oeser

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-21 16:52   ` Ingo Oeser
@ 2006-04-22 11:48     ` Jörn Engel
  2006-04-22 13:29       ` Ingo Oeser
  2006-04-23  5:51       ` David S. Miller
  2006-04-23  5:56     ` David S. Miller
  1 sibling, 2 replies; 31+ messages in thread
From: Jörn Engel @ 2006-04-22 11:48 UTC (permalink / raw)
  To: Ingo Oeser
  Cc: David S. Miller, simlo, linux-kernel, mingo, netdev, Ingo Oeser

On Fri, 21 April 2006 18:52:47 +0200, Ingo Oeser wrote:
> What about sth. like
> 
> struct netchannel {
>    /* This is only read/written by the writer (producer) */
>    unsigned long write_ptr;
>   struct netchannel_buftrailer *netchan_queue[NET_CHANNEL_ENTRIES];
> 
>    /* This is modified by both */
>   atomic_t filled_entries; /* cache_line_align this? */
> 
>    /* This is only read/written by the reader (consumer) */
>    unsigned long read_ptr;
> }
> 
> This would prevent this bug from the beginning and let us still use the
> full queue size.
> 
> If cacheline bouncing because of the shared filled_entries becomes an issue,
> you are receiving or sending a lot.

Unless I completely misunderstand something, one of the main points of
the netchannels if to have *zero* fields written to by both producer
and consumer.  Receiving and sending a lot can be expected to be the
common case, so taking a performance hit in this case is hardly a good
idea.

I haven't looked at Davem's implementation at all, but Van simply
seperated fields in consumer-written and producer-written, with proper
alignment between them.  Some consumer-written fields are also read by
the producer and vice versa.  But none of this results in cacheline
pingpong.

If your description of the problem is correct, it should only mean
that the implementation has a problem, not the concept.

Jörn

-- 
Time? What's that? Time is only worth what you do with it.
-- Theo de Raadt

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-22 11:48     ` Jörn Engel
@ 2006-04-22 13:29       ` Ingo Oeser
  2006-04-22 13:49         ` Jörn Engel
                           ` (2 more replies)
  2006-04-23  5:51       ` David S. Miller
  1 sibling, 3 replies; 31+ messages in thread
From: Ingo Oeser @ 2006-04-22 13:29 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Ingo Oeser, David S. Miller, simlo, linux-kernel, mingo, netdev

Hi Jörn,

On Saturday, 22. April 2006 13:48, Jörn Engel wrote:
> Unless I completely misunderstand something, one of the main points of
> the netchannels if to have *zero* fields written to by both producer
> and consumer. 

Hmm, for me the main point was to keep the complete processing
of a single packet within one CPU/Core where this is a non-issue.

> Receiving and sending a lot can be expected to be the 
> common case, so taking a performance hit in this case is hardly a good
> idea.

There is no hit. If you receive/send in bursts you can simply aggregate
them until a certain queueing threshold. 

The queue design outlined can split the queueing in reserve and commit stages,
where the producer can be told how much in can produce and the consumer is
told  how much it can consume. 

Within their areas the producer and consumer can freely move around.
So this is not exactly a queue, but a dynamic double buffer :-)

So maybe doing queueing with the classic head/tail variant is better here,
but the other variant might replace it without problems and allows
for some nice improvements.

Regards

Ingo Oeser

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-22 13:29       ` Ingo Oeser
@ 2006-04-22 13:49         ` Jörn Engel
  2006-04-23  0:05           ` Ingo Oeser
  2006-04-23  5:52         ` David S. Miller
  2006-04-23  9:23         ` Avi Kivity
  2 siblings, 1 reply; 31+ messages in thread
From: Jörn Engel @ 2006-04-22 13:49 UTC (permalink / raw)
  To: Ingo Oeser
  Cc: Ingo Oeser, David S. Miller, simlo, linux-kernel, mingo, netdev

On Sat, 22 April 2006 15:29:58 +0200, Ingo Oeser wrote:
> On Saturday, 22. April 2006 13:48, Jörn Engel wrote:
> > Unless I completely misunderstand something, one of the main points of
> > the netchannels if to have *zero* fields written to by both producer
> > and consumer. 
> 
> Hmm, for me the main point was to keep the complete processing
> of a single packet within one CPU/Core where this is a non-issue.

That was another main point, yes.  And the endpoints should be as
little burden on the bottlenecks as possible.  One bottleneck is the
receive interrupt, which shouldn't wait for cachelines from other cpus
too much.

Jörn

-- 
Why do musicians compose symphonies and poets write poems?
They do it because life wouldn't have any meaning for them if they didn't.
That's why I draw cartoons.  It's my life.
-- Charles Shultz

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-22 13:49         ` Jörn Engel
@ 2006-04-23  0:05           ` Ingo Oeser
  2006-04-23  5:50             ` David S. Miller
  2006-04-24 16:42             ` Auke Kok
  0 siblings, 2 replies; 31+ messages in thread
From: Ingo Oeser @ 2006-04-23  0:05 UTC (permalink / raw)
  To: Jörn Engel
  Cc: Ingo Oeser, David S. Miller, simlo, linux-kernel, mingo, netdev

On Saturday, 22. April 2006 15:49, Jörn Engel wrote:
> That was another main point, yes.  And the endpoints should be as
> little burden on the bottlenecks as possible.  One bottleneck is the
> receive interrupt, which shouldn't wait for cachelines from other cpus
> too much.

Thats right. This will be made a non issue with early demuxing
on the NIC and MSI (or was it MSI-X?) which will select
the right CPU based on hardware channels.

In the meantime I would reduce the effects with only committing
on full buffer or on leaving the interrupt handler. 

This would be ok, because here you have to wakeup the process
anyway on full buffer and if it slept because of empty buffer. 

You loose only, if your application didn't sleep yet and you need to
leave the interrupt handler because there is no work anymore.
In this case the atomic_add would be significant.

All this is quite similiar to now we do page_vec stuff in mm/ already.

Regards

Ingo Oeser

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-23  0:05           ` Ingo Oeser
@ 2006-04-23  5:50             ` David S. Miller
  2006-04-24 16:42             ` Auke Kok
  1 sibling, 0 replies; 31+ messages in thread
From: David S. Miller @ 2006-04-23  5:50 UTC (permalink / raw)
  To: ioe-lkml; +Cc: joern, netdev, simlo, linux-kernel, mingo, netdev

From: Ingo Oeser <ioe-lkml@rameria.de>
Date: Sun, 23 Apr 2006 02:05:32 +0200

> On Saturday, 22. April 2006 15:49, Jörn Engel wrote:
> > That was another main point, yes.  And the endpoints should be as
> > little burden on the bottlenecks as possible.  One bottleneck is the
> > receive interrupt, which shouldn't wait for cachelines from other cpus
> > too much.
> 
> Thats right. This will be made a non issue with early demuxing
> on the NIC and MSI (or was it MSI-X?) which will select
> the right CPU based on hardware channels.

It is not clear that MSI'ing the RX interrupt to multiple cpus is the
answer.

Consider the fact that by doing so you're reducing the amount of batch
work each interrupt does by a factor N.  One of the biggest gains of
NAPI btw is that it batches patcket receive, if you don't believe the
benefits of this put a simply cycle counter sample around
netif_receive_skb() calls, and note the difference between the first
packet processed and subsequent ones, it's several orders of magnitude
faster to process subsequent packets within a batch.  I've done this
before on tg3 with sparc64 and posted the numbers on netdev about a
year or so ago.

If you are doing something like netchannels, it helps to batch so that
the demuxing table stays hot in the cpu cache.

There is even talk of dedicating a thread on enormously multi-
threaded cpus just to the NIC hardware interrupt, so it could net
channel to the socket processes running on the other strands.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-23  0:05           ` Ingo Oeser
  2006-04-23  5:50             ` David S. Miller
@ 2006-04-24 16:42             ` Auke Kok
  2006-04-24 16:59               ` linux-os (Dick Johnson)
  1 sibling, 1 reply; 31+ messages in thread
From: Auke Kok @ 2006-04-24 16:42 UTC (permalink / raw)
  To: Ingo Oeser
  Cc: Jörn Engel, Ingo Oeser, David S. Miller, simlo, linux-kernel,
	mingo, netdev

Ingo Oeser wrote:
> On Saturday, 22. April 2006 15:49, Jörn Engel wrote:
>> That was another main point, yes.  And the endpoints should be as
>> little burden on the bottlenecks as possible.  One bottleneck is the
>> receive interrupt, which shouldn't wait for cachelines from other cpus
>> too much.
> 
> Thats right. This will be made a non issue with early demuxing
> on the NIC and MSI (or was it MSI-X?) which will select
> the right CPU based on hardware channels.

MSI-X. with MSI you still have only one cpu handling all MSI interrupts and 
that doesn't look any different than ordinary interrupts. MSI-X will allow 
much better interrupt handling across several cpu's.

Auke

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-24 16:42             ` Auke Kok
@ 2006-04-24 16:59               ` linux-os (Dick Johnson)
  2006-04-24 17:19                 ` Rick Jones
                                   ` (2 more replies)
  0 siblings, 3 replies; 31+ messages in thread
From: linux-os (Dick Johnson) @ 2006-04-24 16:59 UTC (permalink / raw)
  To: Auke Kok
  Cc: Ingo Oeser, Jörn Engel, Ingo Oeser, David S. Miller, simlo,
	linux-kernel, mingo, netdev

On Mon, 24 Apr 2006, Auke Kok wrote:

> Ingo Oeser wrote:
>> On Saturday, 22. April 2006 15:49, Jörn Engel wrote:
>>> That was another main point, yes.  And the endpoints should be as
>>> little burden on the bottlenecks as possible.  One bottleneck is the
>>> receive interrupt, which shouldn't wait for cachelines from other cpus
>>> too much.
>>
>> Thats right. This will be made a non issue with early demuxing
>> on the NIC and MSI (or was it MSI-X?) which will select
>> the right CPU based on hardware channels.
>
> MSI-X. with MSI you still have only one cpu handling all MSI interrupts and
> that doesn't look any different than ordinary interrupts. MSI-X will allow
> much better interrupt handling across several cpu's.
>
> Auke
> -

Message signaled interrupts are just a kudge to save a trace on a
PC board (read make junk cheaper still). They are not faster and
may even be slower. They will not be the salvation of any interrupt
latency problems. The solutions for increasing networking speed,
where the bit-rate on the wire gets close to the bit-rate on the
bus, is to put more and more of the networking code inside the
network board. The CPU get interrupted after most things (like
network handshakes) are complete.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.4 on an i686 machine (5592.89 BogoMips).
Warning : 98.36% of all statistics are fiction, book release in April.
_
\x1a\x04

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-24 16:59               ` linux-os (Dick Johnson)
@ 2006-04-24 17:19                 ` Rick Jones
  2006-04-24 18:12                   ` linux-os (Dick Johnson)
  2006-04-24 23:17                 ` Michael Chan
  2006-04-25  1:49                 ` Auke Kok
  2 siblings, 1 reply; 31+ messages in thread
From: Rick Jones @ 2006-04-24 17:19 UTC (permalink / raw)
  To: linux-os (Dick Johnson)
  Cc: Auke Kok, Ingo Oeser, Jörn Engel, Ingo Oeser,
	David S. Miller, simlo, linux-kernel, mingo, netdev

>>>Thats right. This will be made a non issue with early demuxing
>>>on the NIC and MSI (or was it MSI-X?) which will select
>>>the right CPU based on hardware channels.
>>
>>MSI-X. with MSI you still have only one cpu handling all MSI interrupts and
>>that doesn't look any different than ordinary interrupts. MSI-X will allow
>>much better interrupt handling across several cpu's.
>>
>>Auke
>>-
> 
> 
> Message signaled interrupts are just a kudge to save a trace on a
> PC board (read make junk cheaper still). They are not faster and
> may even be slower. They will not be the salvation of any interrupt
> latency problems. The solutions for increasing networking speed,
> where the bit-rate on the wire gets close to the bit-rate on the
> bus, is to put more and more of the networking code inside the
> network board. The CPU get interrupted after most things (like
> network handshakes) are complete.

if the issue is bus vs network bitrates would offloading really buy that 
much?  i suppose that for minimum sized packets not DMA'ing the headers 
across the bus would be a decent win, but down at small packet sizes 
where headers would be 1/3 to 1/2 the stuff DMA'd around, I would think 
one is talking more about CPU path lengths than bus bitrates.

and up and "full size" segments, since everyone is so fond of bulk 
transfer tests, the transfer saved by not shovig headers across the bus 
is what 54/1448 or ~3.75%

spreading interrupts via MSI-X seems nice and all, but i keep wondering 
if the header field-based distribution that is (will be) done by the 
NICs is putting the cart before the horse - should the NIC essentially 
be telling the system the CPU on which to run the application, or should 
the CPU on which the application runs be telling "networking" where it 
should be happening?

rick jones

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-24 17:19                 ` Rick Jones
@ 2006-04-24 18:12                   ` linux-os (Dick Johnson)
  0 siblings, 0 replies; 31+ messages in thread
From: linux-os (Dick Johnson) @ 2006-04-24 18:12 UTC (permalink / raw)
  To: Rick Jones
  Cc: Auke Kok, Ingo Oeser, Jörn Engel, Ingo Oeser,
	David S. Miller, simlo, linux-kernel, mingo, netdev

On Mon, 24 Apr 2006, Rick Jones wrote:

>>>> Thats right. This will be made a non issue with early demuxing
>>>> on the NIC and MSI (or was it MSI-X?) which will select
>>>> the right CPU based on hardware channels.
>>>
>>> MSI-X. with MSI you still have only one cpu handling all MSI interrupts and
>>> that doesn't look any different than ordinary interrupts. MSI-X will allow
>>> much better interrupt handling across several cpu's.
>>>
>>> Auke
>>> -
>>
>>
>> Message signaled interrupts are just a kudge to save a trace on a
>> PC board (read make junk cheaper still). They are not faster and
>> may even be slower. They will not be the salvation of any interrupt
>> latency problems. The solutions for increasing networking speed,
>> where the bit-rate on the wire gets close to the bit-rate on the
>> bus, is to put more and more of the networking code inside the
>> network board. The CPU get interrupted after most things (like
>> network handshakes) are complete.
>
> if the issue is bus vs network bitrates would offloading really buy that
> much?  i suppose that for minimum sized packets not DMA'ing the headers
> across the bus would be a decent win, but down at small packet sizes
> where headers would be 1/3 to 1/2 the stuff DMA'd around, I would think
> one is talking more about CPU path lengths than bus bitrates.
>
> and up and "full size" segments, since everyone is so fond of bulk
> transfer tests, the transfer saved by not shovig headers across the bus
> is what 54/1448 or ~3.75%
>
> spreading interrupts via MSI-X seems nice and all, but i keep wondering
> if the header field-based distribution that is (will be) done by the
> NICs is putting the cart before the horse - should the NIC essentially
> be telling the system the CPU on which to run the application, or should
> the CPU on which the application runs be telling "networking" where it
> should be happening?
>
> rick jones
>

Ideally, TCP/IP is so mature that one should be able to tell some
hardware state-machine "Connect with 123.555.44.333, port 23" and
it signals via interrupt when that happens. Then one should be
able to say "send these data to that address" or "fill this buffer
with data from that address". All the networking could be done
on the board, perhaps with a dedicated CPU (as is now done) or
all in silicon.

So, the driver end of the networking software just handles
buffers. There are interrupts that show status such as
completions or time-outs, trivial stuff.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.4 on an i686 machine (5592.89 BogoMips).
Warning : 98.36% of all statistics are fiction, book release in April.
_
\x1a\x04

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-24 16:59               ` linux-os (Dick Johnson)
  2006-04-24 17:19                 ` Rick Jones
@ 2006-04-24 23:17                 ` Michael Chan
  2006-04-25  1:49                 ` Auke Kok
  2 siblings, 0 replies; 31+ messages in thread
From: Michael Chan @ 2006-04-24 23:17 UTC (permalink / raw)
  To: linux-os (Dick Johnson)
  Cc: Auke Kok, Ingo Oeser, Jörn Engel, Ingo Oeser,
	David S. Miller, simlo, linux-kernel, mingo, netdev

On Mon, 2006-04-24 at 12:59 -0400, linux-os (Dick Johnson) wrote:

> Message signaled interrupts are just a kudge to save a trace on a
> PC board (read make junk cheaper still). They are not faster and
> may even be slower. They will not be the salvation of any interrupt
> latency problems.

MSI has 2 very nice properties: MSI is never shared and MSI guarantees
that all DMA activities before the MSI have completed. When you take
advantage of these guarantees in your MSI handler, there can be
noticeable improvements compared to using INTA.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-24 16:59               ` linux-os (Dick Johnson)
  2006-04-24 17:19                 ` Rick Jones
  2006-04-24 23:17                 ` Michael Chan
@ 2006-04-25  1:49                 ` Auke Kok
  2006-04-25 11:29                   ` linux-os (Dick Johnson)
  2 siblings, 1 reply; 31+ messages in thread
From: Auke Kok @ 2006-04-25  1:49 UTC (permalink / raw)
  To: linux-os (Dick Johnson)
  Cc: Auke Kok, Ingo Oeser, Jörn Engel, Ingo Oeser,
	David S. Miller, simlo, linux-kernel, mingo, netdev

linux-os (Dick Johnson) wrote:
> On Mon, 24 Apr 2006, Auke Kok wrote:
> 
>> Ingo Oeser wrote:
>>> On Saturday, 22. April 2006 15:49, Jörn Engel wrote:
>>>> That was another main point, yes.  And the endpoints should be as
>>>> little burden on the bottlenecks as possible.  One bottleneck is the
>>>> receive interrupt, which shouldn't wait for cachelines from other cpus
>>>> too much.
>>> Thats right. This will be made a non issue with early demuxing
>>> on the NIC and MSI (or was it MSI-X?) which will select
>>> the right CPU based on hardware channels.
>> MSI-X. with MSI you still have only one cpu handling all MSI interrupts and
>> that doesn't look any different than ordinary interrupts. MSI-X will allow
>> much better interrupt handling across several cpu's.
>>
>> Auke
>> -
> 
> Message signaled interrupts are just a kudge to save a trace on a
> PC board (read make junk cheaper still).

yes. Also in PCI-Express there is no physical interrupt line anymore due to 
the architecture, so even classical interrupts are sent as "message" over the bus.

> They are not faster and may even be slower.

thus in the case of PCI-Express, MSI interrupts are just as fast as the 
ordinary ones. I have no numbers on whether MSI is faster or not then e.g. 
interrupts on PCI-X, but generally speaking, the PCI-Express bus is not 
designed to be "low latency" at all, at best it gives you X latency, where X 
is something like microseconds. The MSI message itself only takes 10-20 
nanoseconds though, but all the handling probably adds a large factor to that 
(1000 or so). No clue on classical interrupt line latency - anyone?

> They will not be the salvation of any interrupt latency problems. 

This is also not the problem - we really don't care that our 100.000 packets 
arrive 20usec slower per packet, just as long as the bus is not idle for those 
intervals. We would care a lot if 25.000 of those arrive directly at the 
proper CPU, without the need for one of the cpu's to arbitrate on every 
interrupt. That's the idea anyway.

Nowadays with irq throttling we introduce a lot of designed latency anyway, 
especially with network devices.

> The solutions for increasing networking speed,
> where the bit-rate on the wire gets close to the bit-rate on the
> bus, is to put more and more of the networking code inside the
> network board. The CPU get interrupted after most things (like
> network handshakes) are complete.

That is a limited vision of the situation. You could argue that the current 
CPU's have so much power that they can easily do a lot of the processing 
instead of the hardware, and thus warm caches for userspace, setup sockets 
etc. This is the whole idea of Van Jacobsen's net channels. Putting more 
offloading into the hardware just brings so much problems with itself, that 
are just far easier solved in the OS.

Cheers,

Auke

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-25  1:49                 ` Auke Kok
@ 2006-04-25 11:29                   ` linux-os (Dick Johnson)
  2006-05-02 12:41                     ` Vojtech Pavlik
  0 siblings, 1 reply; 31+ messages in thread
From: linux-os (Dick Johnson) @ 2006-04-25 11:29 UTC (permalink / raw)
  To: Auke Kok
  Cc: Auke Kok, Ingo Oeser, Jörn Engel, Ingo Oeser,
	David S. Miller, simlo, linux-kernel, mingo, netdev

On Mon, 24 Apr 2006, Auke Kok wrote:

> linux-os (Dick Johnson) wrote:
>> On Mon, 24 Apr 2006, Auke Kok wrote:
>>
>>> Ingo Oeser wrote:
>>>> On Saturday, 22. April 2006 15:49, Jörn Engel wrote:
>>>>> That was another main point, yes.  And the endpoints should be as
>>>>> little burden on the bottlenecks as possible.  One bottleneck is the
>>>>> receive interrupt, which shouldn't wait for cachelines from other cpus
>>>>> too much.
>>>> Thats right. This will be made a non issue with early demuxing
>>>> on the NIC and MSI (or was it MSI-X?) which will select
>>>> the right CPU based on hardware channels.
>>> MSI-X. with MSI you still have only one cpu handling all MSI interrupts and
>>> that doesn't look any different than ordinary interrupts. MSI-X will allow
>>> much better interrupt handling across several cpu's.
>>>
>>> Auke
>>> -
>>
>> Message signaled interrupts are just a kudge to save a trace on a
>> PC board (read make junk cheaper still).
>
> yes. Also in PCI-Express there is no physical interrupt line anymore due to
> the architecture, so even classical interrupts are sent as "message" over the bus.
>
>> They are not faster and may even be slower.
>
> thus in the case of PCI-Express, MSI interrupts are just as fast as the
> ordinary ones. I have no numbers on whether MSI is faster or not then e.g.
> interrupts on PCI-X, but generally speaking, the PCI-Express bus is not
> designed to be "low latency" at all, at best it gives you X latency, where X
> is something like microseconds. The MSI message itself only takes 10-20
> nanoseconds though, but all the handling probably adds a large factor to that
> (1000 or so). No clue on classical interrupt line latency - anyone?
>

About 9 nanosecond per foot of FR-4 (G10) trace, plus the access time
through the gate-arrays (about 20 ns) so, from the time a device needs
the CPU, until it hits the interrupt pin, you have typically 30 to
50 nanoseconds. Of course the CPU is __much__ slower. However, these
physical latencies are in series, cannot be compensated for because
the CPU can't see into the future.

>> They will not be the salvation of any interrupt latency problems.
>
> This is also not the problem - we really don't care that our 100.000 packets
> arrive 20usec slower per packet, just as long as the bus is not idle for those
> intervals. We would care a lot if 25.000 of those arrive directly at the
> proper CPU, without the need for one of the cpu's to arbitrate on every
> interrupt. That's the idea anyway.

It forces driver-writers to loop in ISRs to handle new status changes
that happened before an asserted interrupt even got to the CPU. This
is bad. You end up polled in the ISR, with the interrupts off. Turning
on the interrupts exacerbates the problem, you may never leave the
ISR! It becomes the new "idle task". To properly use interrupts,
the hardware latency must be less than the CPUs response to the
hardware stimulus.

>
> Nowadays with irq throttling we introduce a lot of designed latency anyway,
> especially with network devices.
>
>> The solutions for increasing networking speed,
>> where the bit-rate on the wire gets close to the bit-rate on the
>> bus, is to put more and more of the networking code inside the
>> network board. The CPU get interrupted after most things (like
>> network handshakes) are complete.
>
> That is a limited vision of the situation. You could argue that the current
> CPU's have so much power that they can easily do a lot of the processing
> instead of the hardware, and thus warm caches for userspace, setup sockets
> etc. This is the whole idea of Van Jacobsen's net channels. Putting more
> offloading into the hardware just brings so much problems with itself, that
> are just far easier solved in the OS.
>
>
> Cheers,
>
> Auke
>

Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.4 on an i686 machine (5592.89 BogoMips).
Warning : 98.36% of all statistics are fiction, book release in April.
_
\x1a\x04

****************************************************************
The information transmitted in this message is confidential and may be privileged.  Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited.  If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-25 11:29                   ` linux-os (Dick Johnson)
@ 2006-05-02 12:41                     ` Vojtech Pavlik
  2006-05-02 15:58                       ` Andi Kleen
  0 siblings, 1 reply; 31+ messages in thread
From: Vojtech Pavlik @ 2006-05-02 12:41 UTC (permalink / raw)
  To: linux-os (Dick Johnson)
  Cc: Auke Kok, Auke Kok, Ingo Oeser, Jörn Engel, Ingo Oeser,
	David S. Miller, simlo, linux-kernel, mingo, netdev

On Tue, Apr 25, 2006 at 07:29:40AM -0400, linux-os (Dick Johnson) wrote:

> >> Message signaled interrupts are just a kudge to save a trace on a
> >> PC board (read make junk cheaper still).
> >
> > yes. Also in PCI-Express there is no physical interrupt line anymore due to
> > the architecture, so even classical interrupts are sent as "message" over the bus.
> >
> >> They are not faster and may even be slower.
> >
> > thus in the case of PCI-Express, MSI interrupts are just as fast as the
> > ordinary ones. I have no numbers on whether MSI is faster or not then e.g.
> > interrupts on PCI-X, but generally speaking, the PCI-Express bus is not
> > designed to be "low latency" at all, at best it gives you X latency, where X
> > is something like microseconds. The MSI message itself only takes 10-20
> > nanoseconds though, but all the handling probably adds a large factor to that
> > (1000 or so). No clue on classical interrupt line latency - anyone?
> 
> About 9 nanosecond per foot of FR-4 (G10) trace, plus the access time
> through the gate-arrays (about 20 ns) so, from the time a device needs
> the CPU, until it hits the interrupt pin, you have typically 30 to
> 50 nanoseconds. Of course the CPU is __much__ slower. However, these
> physical latencies are in series, cannot be compensated for because
> the CPU can't see into the future.
 
You seem to be missing the fact that most of todays interrupts are
delivered through the APIC bus, which isn't fast at all.

-- 
Vojtech Pavlik
Director SuSE Labs

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-05-02 12:41                     ` Vojtech Pavlik
@ 2006-05-02 15:58                       ` Andi Kleen
  0 siblings, 0 replies; 31+ messages in thread
From: Andi Kleen @ 2006-05-02 15:58 UTC (permalink / raw)
  To: Vojtech Pavlik
  Cc: linux-os (Dick Johnson), Auke Kok, Auke Kok, Ingo Oeser,
	Jörn Engel, Ingo Oeser, David S. Miller, simlo, linux-kernel,
	mingo, netdev

On Tuesday 02 May 2006 14:41, Vojtech Pavlik wrote:
 
> You seem to be missing the fact that most of todays interrupts are
> delivered through the APIC bus, which isn't fast at all.

You mean slow right?  Modern x86s (anything newer than a P3) generally don't 
have an separate APIC bus anymore but just send messages over their main
processor connection.

-Andi

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-22 13:29       ` Ingo Oeser
  2006-04-22 13:49         ` Jörn Engel
@ 2006-04-23  5:52         ` David S. Miller
  2006-04-23  9:23         ` Avi Kivity
  2 siblings, 0 replies; 31+ messages in thread
From: David S. Miller @ 2006-04-23  5:52 UTC (permalink / raw)
  To: ioe-lkml; +Cc: joern, netdev, simlo, linux-kernel, mingo, netdev

From: Ingo Oeser <ioe-lkml@rameria.de>
Date: Sat, 22 Apr 2006 15:29:58 +0200

> On Saturday, 22. April 2006 13:48, Jörn Engel wrote:
> > Unless I completely misunderstand something, one of the main points of
> > the netchannels if to have *zero* fields written to by both producer
> > and consumer. 
> 
> Hmm, for me the main point was to keep the complete processing
> of a single packet within one CPU/Core where this is a non-issue.

Both are the important issues.

You move the bulk of the packet processing work to the end
cores of the system, yes.  But you do so with an enormously
SMP friendly queue data structure so that it does not matter
at all that the packet is received on one cpu, yet processed
in socket context on another.

If you elide either part of the implementation, you miss the
entire point of net channels.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-22 13:29       ` Ingo Oeser
  2006-04-22 13:49         ` Jörn Engel
  2006-04-23  5:52         ` David S. Miller
@ 2006-04-23  9:23         ` Avi Kivity
  2 siblings, 0 replies; 31+ messages in thread
From: Avi Kivity @ 2006-04-23  9:23 UTC (permalink / raw)
  To: Ingo Oeser
  Cc: Jörn Engel, Ingo Oeser, David S. Miller, simlo, linux-kernel,
	mingo, netdev

Ingo Oeser wrote:
> Hi Jörn,
>
> On Saturday, 22. April 2006 13:48, Jörn Engel wrote:
>   
>> Unless I completely misunderstand something, one of the main points of
>> the netchannels if to have *zero* fields written to by both producer
>> and consumer. 
>>     
>
> Hmm, for me the main point was to keep the complete processing
> of a single packet within one CPU/Core where this is a non-issue.
>   
But the interrupt for a packet can be received by cpu 0 whereas the rest 
of processing proceeds on cpu 1; so it still helps to keep the producer 
index and consumer index on separate cachelines.

-- 
error compiling committee.c: too many arguments to function


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-22 11:48     ` Jörn Engel
  2006-04-22 13:29       ` Ingo Oeser
@ 2006-04-23  5:51       ` David S. Miller
  1 sibling, 0 replies; 31+ messages in thread
From: David S. Miller @ 2006-04-23  5:51 UTC (permalink / raw)
  To: joern; +Cc: netdev, simlo, linux-kernel, mingo, netdev, ioe-lkml

From: Jörn Engel <joern@wohnheim.fh-wedel.de>
Date: Sat, 22 Apr 2006 13:48:46 +0200

> Unless I completely misunderstand something, one of the main points of
> the netchannels if to have *zero* fields written to by both producer
> and consumer.  Receiving and sending a lot can be expected to be the
> common case, so taking a performance hit in this case is hardly a good
> idea.

That's absolutely correct, this is absolutely critical to
the implementation.

If you're doing any atomic operations, or any write operations by both
consumer and producer to the same cacheline, you've broken things :-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-21 16:52   ` Ingo Oeser
  2006-04-22 11:48     ` Jörn Engel
@ 2006-04-23  5:56     ` David S. Miller
  2006-04-23 14:15       ` Ingo Oeser
  1 sibling, 1 reply; 31+ messages in thread
From: David S. Miller @ 2006-04-23  5:56 UTC (permalink / raw)
  To: netdev; +Cc: simlo, linux-kernel, mingo, netdev, ioe-lkml

From: Ingo Oeser <netdev@axxeo.de>
Date: Fri, 21 Apr 2006 18:52:47 +0200

> nice to see you getting started with it.

Thanks for reviewing.

> I'm not sure about the queue logic there.
> 
> 1867 /* Caller must have exclusive producer access to the netchannel. */
> 1868 int netchannel_enqueue(struct netchannel *np, struct netchannel_buftrailer *bp)
> 1869 {
> 1870 	unsigned long tail;
> 1871
> 1872 	tail = np->netchan_tail;
> 1873 	if (tail == np->netchan_head)
> 1874 		return -ENOMEM;
> 
> This looks wrong, since empty and full are the same condition in your
> case.

Thanks, that's obviously wrong.  I'll try to fix this up.

> What about sth. like
> 
> struct netchannel {
>    /* This is only read/written by the writer (producer) */
>    unsigned long write_ptr;
>   struct netchannel_buftrailer *netchan_queue[NET_CHANNEL_ENTRIES];
> 
>    /* This is modified by both */
>   atomic_t filled_entries; /* cache_line_align this? */
> 
>    /* This is only read/written by the reader (consumer) */
>    unsigned long read_ptr;
> }

As stated elsewhere, if you add atomic operations you break the entire
idea of net channels.  They are meant to be SMP efficient data structures
where the producer has one cache line that only it dirties and the
consumer has one cache line that likewise only it dirties.

> If cacheline bouncing because of the shared filled_entries becomes an issue,
> you are receiving or sending a lot.

Cacheline bouncing is the core issue being addressed by this
data structure, so we really can't consider your idea seriously.

I've just got an off-by-one error, no need to wreck the entire
data structure just to solve that :-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-23  5:56     ` David S. Miller
@ 2006-04-23 14:15       ` Ingo Oeser
  0 siblings, 0 replies; 31+ messages in thread
From: Ingo Oeser @ 2006-04-23 14:15 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, simlo, linux-kernel, mingo, netdev

Hi Dave,

On Sunday, 23. April 2006 07:56, David S. Miller wrote:
> > If cacheline bouncing because of the shared filled_entries becomes an issue,
> > you are receiving or sending a lot.
> 
> Cacheline bouncing is the core issue being addressed by this
> data structure, so we really can't consider your idea seriously.

Ok, I can see it now more clearly. Many thanks for clearing that up 
in the other replies. I had a major misunderstanding there.

> I've just got an off-by-one error, no need to wreck the entire
> data structure just to solve that :-)

Yes, you are right. But even then I can still implement the
reserve/commit once you provide the helpers for
producer_space and consumer_space.


Regards

Ingo Oeser

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-20 19:09 ` David S. Miller
  2006-04-21 16:52   ` Ingo Oeser
@ 2006-04-22 19:30   ` bert hubert
  2006-04-23  5:53     ` David S. Miller
  1 sibling, 1 reply; 31+ messages in thread
From: bert hubert @ 2006-04-22 19:30 UTC (permalink / raw)
  To: David S. Miller; +Cc: simlo, linux-kernel, mingo, netdev

On Thu, Apr 20, 2006 at 12:09:55PM -0700, David S. Miller wrote:
> Going all the way to the socket is a large endeavor and will require a
> lot of restructuring to do it right, so expect this to take on the
> order of months.

That's what you said about Niagara too :-) 

Good luck!

-- 
http://www.PowerDNS.com      Open source, database driven DNS Software 
http://netherlabs.nl              Open and Closed source services

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-22 19:30   ` bert hubert
@ 2006-04-23  5:53     ` David S. Miller
  0 siblings, 0 replies; 31+ messages in thread
From: David S. Miller @ 2006-04-23  5:53 UTC (permalink / raw)
  To: bert.hubert; +Cc: simlo, linux-kernel, mingo, netdev

From: bert hubert <bert.hubert@netherlabs.nl>
Date: Sat, 22 Apr 2006 21:30:24 +0200

> On Thu, Apr 20, 2006 at 12:09:55PM -0700, David S. Miller wrote:
> > Going all the way to the socket is a large endeavor and will require a
> > lot of restructuring to do it right, so expect this to take on the
> > order of months.
> 
> That's what you said about Niagara too :-) 

I'm just trying to keep the expectations low so it's easier
to exceed them :-)

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-20 16:29 Van Jacobson's net channels and real-time Esben Nielsen
  2006-04-20 19:09 ` David S. Miller
@ 2006-04-21  8:53 ` Jan Kiszka
  2006-04-24 14:22   ` Esben Nielsen
  1 sibling, 1 reply; 31+ messages in thread
From: Jan Kiszka @ 2006-04-21  8:53 UTC (permalink / raw)
  To: Esben Nielsen; +Cc: linux-kernel, Ingo Molnar, David S. Miller

2006/4/20, Esben Nielsen <simlo@phys.au.dk>:
> Before I start, where is VJ's code? I have not been able to find it anywhere.
>
> With the preempt-realtime branch maturing and finding it's way into the
> mainline kernel, using Linux (without sub-kernels) for real-time applications
> is becomming an realistic option without having to do a lot of hacks in the
> kernel on your own.

Well, commenting on this statement would likely create a thread of its own...

> But the network stack could be improved and some of the
> ideas in Van Jacobson's net channels could be usefull when receiving network
> packages with real-time latencies.

... so it's better to focus on a fruitful discussion on these
interesting ideas which may lay the ground for a future coexistence of
both hard-RT and throughput optimised networking stacks in the
mainline kernel. I'm slightly sceptical, but maybe I'll be proven
wrong.

My following remarks are biased toward hart-RT. What may appear
problematic in this context could be a non-issue for scenarios where
the overall throughput counts, not individual packet latencies.

>
> Finding the end point in the receive interrupt and send of the packet to
> the receiving process directly is a good idea if it is fast enough to do
> so in the interrupt context (and I think it can be done very fast). One

This heavily depends on the protocol to parse. Single-packet messages
based on TCP, UDP, or whatever, are yet easy to demux: some table for
the frame type, some for the IP protocol, and another for the port (or
an overall hash for a single table) -> here's the receiver.

But now think of fragmented IP packets. The first piece can be
assigned normally, but the succeeding fragments require a dynamically
added detection in that critical demux path (IP fragments are
identified by src+dest IP, protocol, and an arbitrary ID). Each
pending chain of fragments for a netchannel would create yet another
demux rule. But I'm also curious to see the code used for this by Van
Jacobson.

BTW, that's the issue we also face in RTnet when handling UDP/IP under
hart-RT constraints. We avoid unbounded demux complexity by setting a
hard limit on the number of open chains. If you want to have a look at
the code: www.rtnet.org

> problem in the current setup, is that everything has to go through the
> soft interrupt.  That is even if you make a completely new, non-IP
> protocol, the latency for delivering the frame to your application is
> still limited by the latency of the IP-stack because it still have to go
> through soft irq which might be busy working on IP packages. Even if you
> open a raw socket, the latency is limited to the latency of the soft irq.
> At work we use a widely used commercial RTOS. It got exactly the same
> problem of having every network packet being handled by the same thread.

The question of _where_ to do that demultiplexing is actually not that
critical as _how_ to do it - and with which complexity. For hard-RT,
it should to be O(1) or, if not feasible, O(n), where n is only
influenced by the RT applications and their traffic footprints, but
not by an unknown set of non-RT applications and communication links.
[Even with O(1) demux, the pure numbers of incoming non-RT packets can
still cause QoS crosstalk - a different issue.]

>
> Buffer management is another issue. On the RTOS above you make a buffer pool
> per network device for receiving packages. On Linux received packages are taken
> from the global memory pool with GFP_ATOMIC. On both systems you can easily run
> out of buffers if they are not freed back to the pool fast enough. In that
> case you will just have to drop packages as they are received. Without
> having the code to VJ's net channels, it looks like they solve the problem:
> Each end receiver provides his own receive resources. If a receiver can't cope
> with  all the traffic, it will loose packages, the others wont. That makes it
> safe to run important real-time traffic along with some unpredictable, low
> priority  TCP/IP traffic. If the TCP/IP receivers does not run fast enough,
> their packages will be dropped, but the driver will not drop the real-time
> packages. The nice thing about a real-time task is that you know it's latency
> and therefore know how many receive buffers it needs to avoid loosing
> packages in a worst case scenario.

Yep, this is a core feature for RT networking. And this is essentially
the way we handle it in RTnet for quite some time: "Here is a filled
buffer for you. Give me an empty one from your pool, and it will be
yours. If you can't, I'll drop it!" The existing concept works quite
well for single consumers. But it's still a tricky thing when
considering multiple consumers sharing a physical buffer. RTnet
doesn't handle this so far (except for packet capturing). I have some
rough ideas for a generic solution in my mind, but RTnet users didn't
ask for this so far loudly, thus no further effort was spent on it.

Actually the pre-allocation issue is not only limited to skb-based
networking. It's one reason why we have separate RT Firewire and RT
USB projects. The restrictions and modifications they require make
them unhandy for standard use but perfectly fitting for deterministic
applications.

Ok, to sum up what I see as the core topics for the first steps: we
need A) a mechanism to use pre-allocated buffers for certain
communication links and B) a smart early-demux algorithm of manageable
complexity which decides what receiver has to be accounted for an
incoming packet.

The former is widely a question of restructuring the existing code,
but the latter is still unclear to me. Let me sketch my first idea:

struct pattern {
        unsigned long offset;
        unsigned long mask;
        unsigned long value; /* buffer[offset] & mask == value? */
}

struct rule {
        struct list_head rules;
        int pattern_count;
        struct pattern pattern[MAX_PATTERNS_PER_RULE];
        struct netchannel *destination;
}

For each incoming packet, the NIC or a demux thread would then walk
its list of rules, apply all patterns, and push the packet into the
channel on match. Kind of generic and protocol-agnostic, but will
surely not scale very well, specifically when allowing rules for
fragmented messages popping up. An optimisation might be to use
hard-coded pattern checks for the well-known protocols (UDP, TCP, IP
fragment, etc.). But maybe I'm just overseeing THE simple solution of
the problem now, am I?

Once we had those two core features in the kernel, it would start
making sense to think about how to manage other modifications to NIC
drivers, protocols, and APIs gracefully that are required or desirable
for hard-RT networking.

Looking forward to further discussions!

Jan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-21  8:53 ` Jan Kiszka
@ 2006-04-24 14:22   ` Esben Nielsen
  2006-04-27  8:09     ` Jan Kiszka
  0 siblings, 1 reply; 31+ messages in thread
From: Esben Nielsen @ 2006-04-24 14:22 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Esben Nielsen, linux-kernel, Ingo Molnar, David S. Miller

On Fri, 21 Apr 2006, Jan Kiszka wrote:

> 2006/4/20, Esben Nielsen <simlo@phys.au.dk>:
>> Before I start, where is VJ's code? I have not been able to find it anywhere.
>>
>> With the preempt-realtime branch maturing and finding it's way into the
>> mainline kernel, using Linux (without sub-kernels) for real-time applications
>> is becomming an realistic option without having to do a lot of hacks in the
>> kernel on your own.
>
> Well, commenting on this statement would likely create a thread of its own...

We have had that last year I think...

>
>> But the network stack could be improved and some of the
>> ideas in Van Jacobson's net channels could be usefull when receiving network
>> packages with real-time latencies.
>
> ... so it's better to focus on a fruitful discussion on these
> interesting ideas which may lay the ground for a future coexistence of
> both hard-RT and throughput optimised networking stacks in the
> mainline kernel. I'm slightly sceptical, but maybe I'll be proven
> wrong.

Scaling over many CPUs and RT share some common techniques. The two goals
are not the same but they are not completely opposit, nor orthorgonal
goals, but rather like two arrows pointing in the same general direction.

>
> My following remarks are biased toward hart-RT. What may appear
> problematic in this context could be a non-issue for scenarios where
> the overall throughput counts, not individual packet latencies.
>
>>
>> Finding the end point in the receive interrupt and send of the packet to
>> the receiving process directly is a good idea if it is fast enough to do
>> so in the interrupt context (and I think it can be done very fast). One
>
> This heavily depends on the protocol to parse. Single-packet messages
> based on TCP, UDP, or whatever, are yet easy to demux: some table for
> the frame type, some for the IP protocol, and another for the port (or
> an overall hash for a single table) -> here's the receiver.
>
> But now think of fragmented IP packets. The first piece can be
> assigned normally, but the succeeding fragments require a dynamically
> added detection in that critical demux path (IP fragments are
> identified by src+dest IP, protocol, and an arbitrary ID). Each
> pending chain of fragments for a netchannel would create yet another
> demux rule. But I'm also curious to see the code used for this by Van
> Jacobson.

Turn off fragmentation :-) Web servers often do that (giving a lot of
trouble to pppoe users). IPv6 is also defined without fragmentation at
this level, right?
A good first solution would be to send framented IP through the usual IP
stack.


>
> BTW, that's the issue we also face in RTnet when handling UDP/IP under
> hart-RT constraints. We avoid unbounded demux complexity by setting a
> hard limit on the number of open chains. If you want to have a look at
> the code: www.rtnet.org

I am only on the net about 30 min every 2nd day. I write mails offline and
send them later - that is why I am so late at answering.

>
>> problem in the current setup, is that everything has to go through the
>> soft interrupt.  That is even if you make a completely new, non-IP
>> protocol, the latency for delivering the frame to your application is
>> still limited by the latency of the IP-stack because it still have to go
>> through soft irq which might be busy working on IP packages. Even if you
>> open a raw socket, the latency is limited to the latency of the soft irq.
>> At work we use a widely used commercial RTOS. It got exactly the same
>> problem of having every network packet being handled by the same thread.
>
> The question of _where_ to do that demultiplexing is actually not that
> critical as _how_ to do it - and with which complexity. For hard-RT,
> it should to be O(1) or, if not feasible, O(n), where n is only
> influenced by the RT applications and their traffic footprints, but
> not by an unknown set of non-RT applications and communication links.
> [Even with O(1) demux, the pure numbers of incoming non-RT packets can
> still cause QoS crosstalk - a different issue.]

Yep, ofcourse. But not obviouse to people in not familiar with
deterministic RT. I assume that you mean the same by "hard RT" as I mean
by "deterministic RT". Old discussions on lkml has shown that there is a
lot of disagreement about what is meant :-)

>
>>
>> Buffer management is another issue. On the RTOS above you make a buffer pool
>> per network device for receiving packages. On Linux received packages are taken
>> from the global memory pool with GFP_ATOMIC. On both systems you can easily run
>> out of buffers if they are not freed back to the pool fast enough. In that
>> case you will just have to drop packages as they are received. Without
>> having the code to VJ's net channels, it looks like they solve the problem:
>> Each end receiver provides his own receive resources. If a receiver can't cope
>> with  all the traffic, it will loose packages, the others wont. That makes it
>> safe to run important real-time traffic along with some unpredictable, low
>> priority  TCP/IP traffic. If the TCP/IP receivers does not run fast enough,
>> their packages will be dropped, but the driver will not drop the real-time
>> packages. The nice thing about a real-time task is that you know it's latency
>> and therefore know how many receive buffers it needs to avoid loosing
>> packages in a worst case scenario.
>
> Yep, this is a core feature for RT networking. And this is essentially
> the way we handle it in RTnet for quite some time: "Here is a filled
> buffer for you. Give me an empty one from your pool, and it will be
> yours. If you can't, I'll drop it!" The existing concept works quite
> well for single consumers. But it's still a tricky thing when
> considering multiple consumers sharing a physical buffer. RTnet
> doesn't handle this so far (except for packet capturing). I have some
> rough ideas for a generic solution in my mind, but RTnet users didn't
> ask for this so far loudly, thus no further effort was spent on it.
>

Exchanging skbs instead of simply handing over skbs is ofcourse a good
idea, but it will slow down the stack slightly.  VJ _might_ have made
stuff more effective.

> Actually the pre-allocation issue is not only limited to skb-based
> networking. It's one reason why we have separate RT Firewire and RT USB
> projects. The restrictions and modifications they require make them
> unhandy for standard use but perfectly fitting for deterministic
> applications.
>
>
> Ok, to sum up what I see as the core topics for the first steps: we
> need A) a mechanism to use pre-allocated buffers for certain
> communication links and B) a smart early-demux algorithm of manageable
> complexity which decides what receiver has to be accounted for an
> incoming packet.
>
> The former is widely a question of restructuring the existing code,
> but the latter is still unclear to me. Let me sketch my first idea:
>
> struct pattern {
>        unsigned long offset;
>        unsigned long mask;
>        unsigned long value; /* buffer[offset] & mask == value? */
> }
>
> struct rule {
>        struct list_head rules;
>        int pattern_count;
>        struct pattern pattern[MAX_PATTERNS_PER_RULE];
>        struct netchannel *destination;
> }
>
> For each incoming packet, the NIC or a demux thread would then walk
> its list of rules, apply all patterns, and push the packet into the
> channel on match. Kind of generic and protocol-agnostic, but will
> surely not scale very well, specifically when allowing rules for
> fragmented messages popping up. An optimisation might be to use
> hard-coded pattern checks for the well-known protocols (UDP, TCP, IP
> fragment, etc.). But maybe I'm just overseeing THE simple solution of
> the problem now, am I?
>

I came up with a simple, quite general idea - but not general enough
to include fragmentation. See below.

> Once we had those two core features in the kernel, it would start
> making sense to think about how to manage other modifications to NIC
> drivers, protocols, and APIs gracefully that are required or desirable
> for hard-RT networking.
>
> Looking forward to further discussions!
>
You will have it :-)

> Jan
>

Here is a simple filter idea. The kernel have to put a maximum
filter length to make the filtering deterministic.

filter.h:
----------------------------------------------------------------------
/*
 * Copyright (c) 2006 Esben Nielsen
 *
 * Distributeable under GPL
 *
 * Released under the terms of the GNU GPL v2.0.
 */

#ifndef FILTER_H
#define FILTER_H

enum rx_action_type
{
	LOOK_AT_BYTE,
	GIVE_TO_CHANNEL
};

struct rx_action
{
	int usage;
	enum rx_action_type type;
	union
	{
		struct {
			unsigned long offset;
			struct rx_action *actions[256];
			struct rx_action *not_that_long;
		} look_at_byte;
		struct {
			struct netchannel *channel;
			struct rx_action *cont;
		} give_to_channel;
	} args;
};

#endif
------------------------------------------------------------------------

filter.c:
-----------------------------------------------------------------------
/*
 * Copyright (c) 2006 Esben Nielsen
 *
 * Distributeable under GPL
 *
 * Released under the terms of the GNU GPL v2.0.
 */

#include <stdlib.h>
#include <string.h>
#include <stdint.h>
#include <errno.h>

#include "filter.h"

void do_rx_actions(struct rx_action *act, const unsigned char *data,
		unsigned long int length)
{
 start:
	switch(act->type)
	{
	case LOOK_AT_BYTE:
		if(act->args.look_at_byte.offset >= length)
			act = act->args.look_at_byte.not_that_long;
		else
			act = act->args.look_at_byte.actions
				[data[act->args.look_at_byte.offset]];

		goto start;
	case GIVE_TO_CHANNEL:
		netchannel_receive(act->args.give_to_channel.channel, data, length);
		act = act->args.give_to_channel.cont;
		if(act)
			goto start;
		break;
	default:
		BUG_ON(1);

	}
}



extern struct netchannel * const default_netchannel;

struct rx_action default_action;

struct rx_action *get_action(struct rx_action *act)
{
	act->usage++;
	return act;
}

struct rx_action *alloc_rx_action(enum rx_action_type type)
{
	struct rx_action *res = malloc(sizeof(struct rx_action));
	if(res) {
		int i;
		res->usage = 1;
		res->type = type;

		switch(type)
		{
		case LOOK_AT_BYTE:
			for(i=0;i<256;i++)
				res->args.look_at_byte.actions[i] =
					get_action(&default_action);

			res->args.look_at_byte.not_that_long =
				get_action(&default_action);
			break;
		case GIVE_TO_CHANNEL:
			res->args.give_to_channel.channel =
				default_netchannel;
			res->args.give_to_channel.cont = NULL;
			break;
		default:
			BUG_ON(1);
		}
	}

	return res;
}


void free_rx_action(struct rx_action **a_ref)
{
	int i;
	struct rx_action *a = *a_ref;
	*a_ref = NULL;

	if(!a)
		return;

	a->usage--;
	if(a->usage)
		return;

	switch(a->type)
	{
	case LOOK_AT_BYTE:
		for(i=0; i<256;i++)
			free_rx_action(&a->args.look_at_byte.actions[i]);
		free_rx_action(&a->args.look_at_byte.not_that_long);
		break;
	case GIVE_TO_CHANNEL:
		free_rx_action(&a->args.give_to_channel.cont);;
		break;
	default:
		BUG_ON(1);

	}
	free(a);
}

struct rx_action *make_look_at_byte(unsigned long offset, unsigned char val,
				    struct rx_action *todo)
{
	struct rx_action *act;
	if(!todo)
		return NULL;


	act = alloc_rx_action(LOOK_AT_BYTE);
	if( !act) {
		free_rx_action(&todo);
		return NULL;
	}

	act->args.look_at_byte.offset = offset;
	free_rx_action(&act->args.look_at_byte.actions[val]);
	act->args.look_at_byte.actions[val] = todo;

	return act;

}
struct rx_action *ethernet_to_ip(struct rx_action *todo)
{
	return make_look_at_byte( 12, 0x08, make_look_at_byte( 13, 0, todo) );
}

struct rx_action *ethernet_to_udp(struct rx_action *todo)
{
	return ethernet_to_ip(make_look_at_byte( 23, 17 /* IPPROTO_UDP */,
						 todo));
}


struct rx_action *ethernet_to_udp_port(struct rx_action *todo, uint16_t port)
{
	return ethernet_to_udp
		(make_look_at_byte( 36, port>>8,
				    make_look_at_byte(37,port & 0xFF,todo)));
}

struct rx_action *merge_rx_actions(struct rx_action *act1,
				   struct rx_action *act2);

struct rx_action *merge_give_to_channel(struct rx_action *give,
					struct rx_action *other)
{
	int was_not_zero;
	struct rx_action *res =
		alloc_rx_action(GIVE_TO_CHANNEL);
	if(!res)
		return NULL;

	BUG_ON(give->type!=GIVE_TO_CHANNEL);

	res->args.give_to_channel.channel =
		give->args.give_to_channel.channel;
	was_not_zero = (res->args.give_to_channel.cont != NULL);
	res->args.give_to_channel.cont =
		merge_rx_actions(give->args.give_to_channel.cont, other);

	if( was_not_zero && !res->args.give_to_channel.cont ) {
		free_rx_action(&res);
		return NULL;
	}
	return res;
}


struct rx_action *merge_rx_actions(struct rx_action *act1,
				   struct rx_action *act2)
{

	if( !act1 || act1 == &default_action)
		return act2 ? get_action(act2) : NULL;

	if( !act2 || act2 == &default_action)
		return get_action(act1);

	switch(act1->type)
	{
	case LOOK_AT_BYTE:
		switch(act2->type)
		{
		case LOOK_AT_BYTE:
			if( act1->args.look_at_byte.offset ==
			    act2->args.look_at_byte.offset ) {
				int i;
				struct rx_action *res =
					alloc_rx_action(LOOK_AT_BYTE);
				if(!res)
					return NULL;

				res->args.look_at_byte.offset =
					act1->args.look_at_byte.offset;

				for(i=0; i<256; i++) {
					free_rx_action(&res->args.look_at_byte.
						       actions[i]);
					res->args.look_at_byte.actions[i] =
						merge_rx_actions
						( act1->args.look_at_byte.actions[i],
						  act2->args.look_at_byte.actions[i]);
					if(!res->args.look_at_byte.actions[i]) {
						free_rx_action(&res);
						return NULL;
					}
				}
				res->args.look_at_byte.not_that_long =
					merge_rx_actions
					( act1->args.look_at_byte.not_that_long,
					  act2->args.look_at_byte.not_that_long);
				if(!res->args.look_at_byte.not_that_long) {
					free_rx_action(&res);
					return NULL;
				}

				return res;
			}
			if( act2->args.look_at_byte.offset <
			    act1->args.look_at_byte.offset ) {
				struct rx_action *tmp;
				tmp = act1;
				act1 = act2;
				act2 = tmp;
			}

			if( act1->args.look_at_byte.offset <
			    act2->args.look_at_byte.offset ) {
int i;
				struct rx_action *res =
					alloc_rx_action(LOOK_AT_BYTE);
				if(!res)
					return NULL;

				res->args.look_at_byte.offset =
					act1->args.look_at_byte.offset;

				for(i=0; i<256; i++) {
					free_rx_action(&res->args.look_at_byte.
						       actions[i]);
					res->args.look_at_byte.actions[i] =
						merge_rx_actions(act1->args.look_at_byte.actions[i],act2);
					if(!res->args.look_at_byte.actions[i]) {
						free_rx_action(&res);
						return NULL;
					}
				}

				res->args.look_at_byte.not_that_long =
					merge_rx_actions(act1->args.look_at_byte.not_that_long, act2);
				if(!res->args.look_at_byte.not_that_long) {
					free_rx_action(&res);
					return NULL;
				}
				return res;
			}
			else
				BUG_ON(1);
		case GIVE_TO_CHANNEL:
			return merge_give_to_channel(act2,act1);
		}
		BUG_ON(1);
		break;
	case GIVE_TO_CHANNEL:
		return merge_give_to_channel(act1,act2);
	}
	BUG_ON(1);
	return NULL;
}

void init_rx_actions()
{
	default_action.usage = 1;
	default_action.type = GIVE_TO_CHANNEL;
	default_action.args.give_to_channel.channel =
		default_netchannel;
	default_action.args.give_to_channel.cont = NULL;
}

struct netdevice {
	struct rx_action *action;
	struct mutex *change_action_lock;
};

int add_action_to_device(struct netdevice *dev, struct rx_action *act)
{
	struct rx_action *new, *old;

	mutex_lock(&dev->change_action_lock);
	new = merge_rx_actions(dev->action, act);
	if(!new) {
		mutex_unlock(&dev->change_action_lock);
		return -ENOMEM;
	}
	old = dev->action;
	dev->action = new;
	mutex_unlock(&dev->change_action_lock);
	syncronize_rcu();
	free_rx_action(&old);
	return 0;
}



^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-24 14:22   ` Esben Nielsen
@ 2006-04-27  8:09     ` Jan Kiszka
  2006-04-27  8:16       ` David S. Miller
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Kiszka @ 2006-04-27  8:09 UTC (permalink / raw)
  To: Esben Nielsen; +Cc: linux-kernel, Ingo Molnar, David S. Miller

2006/4/24, Esben Nielsen <simlo@phys.au.dk>:
> On Fri, 21 Apr 2006, Jan Kiszka wrote:
>
> > 2006/4/20, Esben Nielsen <simlo@phys.au.dk>:
> >>
> >> Finding the end point in the receive interrupt and send of the packet to
> >> the receiving process directly is a good idea if it is fast enough to do
> >> so in the interrupt context (and I think it can be done very fast). One
> >
> > This heavily depends on the protocol to parse. Single-packet messages
> > based on TCP, UDP, or whatever, are yet easy to demux: some table for
> > the frame type, some for the IP protocol, and another for the port (or
> > an overall hash for a single table) -> here's the receiver.
> >
> > But now think of fragmented IP packets. The first piece can be
> > assigned normally, but the succeeding fragments require a dynamically
> > added detection in that critical demux path (IP fragments are
> > identified by src+dest IP, protocol, and an arbitrary ID). Each
> > pending chain of fragments for a netchannel would create yet another
> > demux rule. But I'm also curious to see the code used for this by Van
> > Jacobson.
>
> Turn off fragmentation :-) Web servers often do that (giving a lot of
> trouble to pppoe users). IPv6 is also defined without fragmentation at
> this level, right?

As far as I see it the demux situation is not that different with IPv6
- as long as you do not prepare the fragments specifically. I'm
thinking of IP options carrying the destination port which is so far
only contained in the first fragment. But such tweaks only work if all
participants follow the rules. Anyway, worth to keep in mind.

> A good first solution would be to send framented IP through the usual IP
> stack.
>

Although this excludes protocols which exploit this feature. But you
are right, one problem after the other.

>
> >
> > BTW, that's the issue we also face in RTnet when handling UDP/IP under
> > hart-RT constraints. We avoid unbounded demux complexity by setting a
> > hard limit on the number of open chains. If you want to have a look at
> > the code: www.rtnet.org
>
> I am only on the net about 30 min every 2nd day. I write mails offline and
> send them later - that is why I am so late at answering.
>

Different situation on my side - same effect. :-/

> >
> >> problem in the current setup, is that everything has to go through the
> >> soft interrupt.  That is even if you make a completely new, non-IP
> >> protocol, the latency for delivering the frame to your application is
> >> still limited by the latency of the IP-stack because it still have to go
> >> through soft irq which might be busy working on IP packages. Even if you
> >> open a raw socket, the latency is limited to the latency of the soft irq.
> >> At work we use a widely used commercial RTOS. It got exactly the same
> >> problem of having every network packet being handled by the same thread.
> >
> > The question of _where_ to do that demultiplexing is actually not that
> > critical as _how_ to do it - and with which complexity. For hard-RT,
> > it should to be O(1) or, if not feasible, O(n), where n is only
> > influenced by the RT applications and their traffic footprints, but
> > not by an unknown set of non-RT applications and communication links.
> > [Even with O(1) demux, the pure numbers of incoming non-RT packets can
> > still cause QoS crosstalk - a different issue.]
>
> Yep, ofcourse. But not obviouse to people in not familiar with
> deterministic RT. I assume that you mean the same by "hard RT" as I mean
> by "deterministic RT". Old discussions on lkml has shown that there is a
> lot of disagreement about what is meant :-)

I tend to be sluggish, I know. Actually, when being pedantic, soft RT
can also be deterministic in failing to meet the specified deadline
once in a while :). But I'm sure we mean the same: the required
logical and temporal properties must always be fulfilled, i.e. without
even rare exceptions.

>
> >
> >>
> >> Buffer management is another issue. On the RTOS above you make a buffer pool
> >> per network device for receiving packages. On Linux received packages are taken
> >> from the global memory pool with GFP_ATOMIC. On both systems you can easily run
> >> out of buffers if they are not freed back to the pool fast enough. In that
> >> case you will just have to drop packages as they are received. Without
> >> having the code to VJ's net channels, it looks like they solve the problem:
> >> Each end receiver provides his own receive resources. If a receiver can't cope
> >> with  all the traffic, it will loose packages, the others wont. That makes it
> >> safe to run important real-time traffic along with some unpredictable, low
> >> priority  TCP/IP traffic. If the TCP/IP receivers does not run fast enough,
> >> their packages will be dropped, but the driver will not drop the real-time
> >> packages. The nice thing about a real-time task is that you know it's latency
> >> and therefore know how many receive buffers it needs to avoid loosing
> >> packages in a worst case scenario.
> >
> > Yep, this is a core feature for RT networking. And this is essentially
> > the way we handle it in RTnet for quite some time: "Here is a filled
> > buffer for you. Give me an empty one from your pool, and it will be
> > yours. If you can't, I'll drop it!" The existing concept works quite
> > well for single consumers. But it's still a tricky thing when
> > considering multiple consumers sharing a physical buffer. RTnet
> > doesn't handle this so far (except for packet capturing). I have some
> > rough ideas for a generic solution in my mind, but RTnet users didn't
> > ask for this so far loudly, thus no further effort was spent on it.
> >
>
> Exchanging skbs instead of simply handing over skbs is ofcourse a good
> idea, but it will slow down the stack slightly.  VJ _might_ have made
> stuff more effective.

I'm scratching my head, digging for the reasons why I once considered
and then dropped the idea of accounting, i.e. maintaining counters
instead of passing empty buffers. One reason likely was that RTnet is
not built on top of strict one-way channels. This means you either
have to maintain a central pool for free buffers (not that scalable)
or will run into troubles having enough real buffers in the local
pools of NICs, sockets, etc. when they are actually needed. Might be
feasible, though, under the constraint of single producer / single
consumer.

>
> I came up with a simple, quite general idea - but not general enough
> to include fragmentation. See below.
>

I like it! It's a bit memory-hungry, but I guess this can be improved.
See data structure suggestions below.

> > Once we had those two core features in the kernel, it would start
> > making sense to think about how to manage other modifications to NIC
> > drivers, protocols, and APIs gracefully that are required or desirable
> > for hard-RT networking.
> >
> > Looking forward to further discussions!
> >
> You will have it :-)

Oh, yeah, I'm afraid. ;)

>
> > Jan
> >
>
> Here is a simple filter idea. The kernel have to put a maximum
> filter length to make the filtering deterministic.
>
> filter.h:
> ----------------------------------------------------------------------
> /*
>  * Copyright (c) 2006 Esben Nielsen
>  *
>  * Distributeable under GPL
>  *
>  * Released under the terms of the GNU GPL v2.0.
>  */
>
> #ifndef FILTER_H
> #define FILTER_H
>
> enum rx_action_type
> {
>         LOOK_AT_BYTE,
>         GIVE_TO_CHANNEL
> };

I would additionally introduce COMPARE_BYTE in order to replace
table-based lookup in case the number of different bytes is simply too
small. Look e.g. at the high byte of the Ethernet frame type
considering ETH_P_ARP and ETH_P_IP (the common case) - they are
identical.

>
> struct rx_action
> {
>         int usage;
>         enum rx_action_type type;
>         union
>         {
>                 struct {
>                         unsigned long offset;
>                         struct rx_action *actions[256];
>                         struct rx_action *not_that_long;
>                 } look_at_byte;
>                 struct {
>                         struct netchannel *channel;
>                         struct rx_action *cont;
>                 } give_to_channel;
>         } args;
> };

What about this:

struct rx_action_hdr
{
        int usage;
        enum rx_action_type type;
}

struct rx_demux_byte
{
        struct rx_action_hdr hdr;
        unsigned long offset;
        struct rx_action *actions[256];
        struct rx_action *not_that_long;
}

struct rx_compare_byte
{
        struct rx_action_hdr hdr;
        unsigned long offset;
        unsigned char value;
        struct rx_action *action;
        struct rx_action *not_that_long;
}

struct rx_give_to_channel{
        struct rx_action_hdr hdr;
        struct netchannel *channel;
        struct rx_action *cont;
}

Sorry, I haven't looked at further details of your implementation yet.

Jan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-27  8:09     ` Jan Kiszka
@ 2006-04-27  8:16       ` David S. Miller
  2006-04-27 10:00         ` Jan Kiszka
  0 siblings, 1 reply; 31+ messages in thread
From: David S. Miller @ 2006-04-27  8:16 UTC (permalink / raw)
  To: jan.kiszka; +Cc: simlo, linux-kernel, mingo

From: "Jan Kiszka" <jan.kiszka@googlemail.com>
Date: Thu, 27 Apr 2006 10:09:06 +0200

> What about this:

Can I recommend a trip to the local university engineering library for
a quick readup on the current state of the art wrt.  packet
classification algorithms?

Barring that, a read of chapter 12 "Packet Classification"
from Networking Algorithmics will give you a great primer.

I'm suggesting this, because all I see is fishing around with
painfully inefficient algorithms.

In any event, the initial net channel implementation will likely just
do straight socket hash lookups identical to how TCP does socket
lookups in the current stack.  Full match on established sockets, and
failing that fall back to the listening socket lookup which allows
some forms of wildcarding.

Thanks.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-27  8:16       ` David S. Miller
@ 2006-04-27 10:00         ` Jan Kiszka
  2006-04-27 19:50           ` David S. Miller
  0 siblings, 1 reply; 31+ messages in thread
From: Jan Kiszka @ 2006-04-27 10:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: simlo, linux-kernel, mingo

2006/4/27, David S. Miller <davem@davemloft.net>:
>
> Can I recommend a trip to the local university engineering library for
> a quick readup on the current state of the art wrt.  packet
> classification algorithms?
>
> Barring that, a read of chapter 12 "Packet Classification"
> from Networking Algorithmics will give you a great primer.
>
> I'm suggesting this, because all I see is fishing around with
> painfully inefficient algorithms.
>
> In any event, the initial net channel implementation will likely just
> do straight socket hash lookups identical to how TCP does socket
> lookups in the current stack.  Full match on established sockets, and
> failing that fall back to the listening socket lookup which allows
> some forms of wildcarding.
>

Sorry that you had to remind of the different primary goals. I think
we may look for something pluggable to support both large-scale rule
tables as well as small ones for embedded RT-systems.

Jan

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: Van Jacobson's net channels and real-time
  2006-04-27 10:00         ` Jan Kiszka
@ 2006-04-27 19:50           ` David S. Miller
  0 siblings, 0 replies; 31+ messages in thread
From: David S. Miller @ 2006-04-27 19:50 UTC (permalink / raw)
  To: jan.kiszka; +Cc: simlo, linux-kernel, mingo

From: "Jan Kiszka" <jan.kiszka@googlemail.com>
Date: Thu, 27 Apr 2006 12:00:53 +0200

> Sorry that you had to remind of the different primary goals. I think
> we may look for something pluggable to support both large-scale rule
> tables as well as small ones for embedded RT-systems.

Even for such objectives, very specific understanding exists
in the algorithmic community for what is known to work best
for various kinds of packet classification.

^ permalink raw reply	[flat|nested] 31+ messages in thread

[parent not found: <63KcN-6lD-25@gated-at.bofh.it>]

[parent not found: <64wrg-2cg-41@gated-at.bofh.it>]

[parent not found: <64wAE-2Cs-9@gated-at.bofh.it>]

[parent not found: <64AkV-8cG-7@gated-at.bofh.it>]

[parent not found: <65cqo-5tR-33@gated-at.bofh.it>]

[parent not found: <65cJF-66i-11@gated-at.bofh.it>]

* Re: Van Jacobson's net channels and real-time
       [not found]         ` <65cJF-66i-11@gated-at.bofh.it>
@ 2006-04-24 23:48           ` Robert Hancock
  0 siblings, 0 replies; 31+ messages in thread
From: Robert Hancock @ 2006-04-24 23:48 UTC (permalink / raw)
  To: linux-kernel

linux-os (Dick Johnson) wrote:
> Message signaled interrupts are just a kudge to save a trace on a
> PC board (read make junk cheaper still). They are not faster and
> may even be slower.

Save a trace on the PC board? How about no, since the devices still need 
to support INTX interrupts anyway. And yes, they can be faster, mainly 
because being an in-band signal it simplifies some PCI posting related 
issues, and there is no need to worry about sharing.

 > They will not be the salvation of any interrupt
> latency problems. The solutions for increasing networking speed,
> where the bit-rate on the wire gets close to the bit-rate on the
> bus, is to put more and more of the networking code inside the
> network board. The CPU get interrupted after most things (like
> network handshakes) are complete.

You mean like these?

http://linux-net.osdl.org/index.php/TOE

-- 
Robert Hancock      Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2006-05-02 16:03 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-20 16:29 Van Jacobson's net channels and real-time Esben Nielsen
2006-04-20 19:09 ` David S. Miller
2006-04-21 16:52   ` Ingo Oeser
2006-04-22 11:48     ` Jörn Engel
2006-04-22 13:29       ` Ingo Oeser
2006-04-22 13:49         ` Jörn Engel
2006-04-23  0:05           ` Ingo Oeser
2006-04-23  5:50             ` David S. Miller
2006-04-24 16:42             ` Auke Kok
2006-04-24 16:59               ` linux-os (Dick Johnson)
2006-04-24 17:19                 ` Rick Jones
2006-04-24 18:12                   ` linux-os (Dick Johnson)
2006-04-24 23:17                 ` Michael Chan
2006-04-25  1:49                 ` Auke Kok
2006-04-25 11:29                   ` linux-os (Dick Johnson)
2006-05-02 12:41                     ` Vojtech Pavlik
2006-05-02 15:58                       ` Andi Kleen
2006-04-23  5:52         ` David S. Miller
2006-04-23  9:23         ` Avi Kivity
2006-04-23  5:51       ` David S. Miller
2006-04-23  5:56     ` David S. Miller
2006-04-23 14:15       ` Ingo Oeser
2006-04-22 19:30   ` bert hubert
2006-04-23  5:53     ` David S. Miller
2006-04-21  8:53 ` Jan Kiszka
2006-04-24 14:22   ` Esben Nielsen
2006-04-27  8:09     ` Jan Kiszka
2006-04-27  8:16       ` David S. Miller
2006-04-27 10:00         ` Jan Kiszka
2006-04-27 19:50           ` David S. Miller
     [not found] <63KcN-6lD-25@gated-at.bofh.it>
     [not found] ` <64wrg-2cg-41@gated-at.bofh.it>
     [not found]   ` <64wAE-2Cs-9@gated-at.bofh.it>
     [not found]     ` <64AkV-8cG-7@gated-at.bofh.it>
     [not found]       ` <65cqo-5tR-33@gated-at.bofh.it>
     [not found]         ` <65cJF-66i-11@gated-at.bofh.it>
2006-04-24 23:48           ` Robert Hancock

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).