netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-26 19:30 Caitlin Bestler
  2006-04-26 19:46 ` Jeff Garzik
  2006-04-27  3:40 ` Rusty Russell
  0 siblings, 2 replies; 78+ messages in thread
From: Caitlin Bestler @ 2006-04-26 19:30 UTC (permalink / raw)
  To: David S. Miller; +Cc: kelly, netdev, rusty

David S. Miller wrote:

> 
> I personally think allowing sockets to trump firewall rules
> is an acceptable relaxation of the rules in order to simplify
> the implementation.

I agree.  I have never seen a set of netfilter rules that
would block arbitrary packets *within* an established connection.

Technically you can create such rules, but every single set
of rules actually deployed that I have ever seen started with
a rule to pass all packets for established connections, and
then proceeded to control which connections could be initiated
or accepted.




^ permalink raw reply	[flat|nested] 78+ messages in thread
* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-28 23:45 Caitlin Bestler
  0 siblings, 0 replies; 78+ messages in thread
From: Caitlin Bestler @ 2006-04-28 23:45 UTC (permalink / raw)
  To: David S. Miller, rusty; +Cc: johnpol, kelly, netdev

David S. Miller wrote:
> From: Rusty Russell <rusty@rustcorp.com.au>
> Date: Sat, 29 Apr 2006 08:04:04 +1000
> 
>> You're still thinking you can bypass classifiers for established
>> sockets, but I really don't think you can.  I think the simplest
>> solution is to effectively remove from (or flag) the established &
>> listening hashes anything which could be effected by classifiers, so
>> those packets get send through the default channel.
> 
> OK, when rules are installed, the socket channel mappings are
> flushed.  This is your idea right?

You mean when new rules are installed that would conflict with
an existing mapping, right?

Bumping every connection out of vj-channel mode whenever any new
rule was installed would be very counter-productive.

Ultimately, you only want a direct-to-user vj-channel when all
packets assigned to it would be passed by netchannels, and maybe
increment a single packet counter. Checking a single QoS rate
limiter may be possible too, but if there are more complex
rules then the channel has to be kept in kernel because it
wouldn't make sense to trust user-mode code to apply the
netchannel rules reliably.


^ permalink raw reply	[flat|nested] 78+ messages in thread
* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-28 17:55 Caitlin Bestler
  2006-04-28 22:17 ` Rusty Russell
  0 siblings, 1 reply; 78+ messages in thread
From: Caitlin Bestler @ 2006-04-28 17:55 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David S. Miller, kelly, rusty, netdev

Evgeniy Polyakov wrote:

> 
> I see your point, and respectfully disagree.
> The more complex userspace interface we create the less users
> it will have. It is completely unconvenient to read 100 bytes
> and receive only 80, since 20 were eaten by header. And what
> if we need only 20, but packet contains 100, introduce per packet
> head pointer? For purpose of benchmarking it works perfectly - read
> the whole packet, one can event touch that data to emulate real
> work, but for the real world it becomes practically unusabl.
> 

In a straight-forward user-mode library using existing interfaces the
message would be interleaved with the headers in the inbound ring.
While the inbound ring is part of user memory, it is not what the
user would process from, that would be the buffer they supplied 
in a call to read() or recvmsg(), that buffer would have to make
no allowances for interleaved headers.

Enabling zero-copy when a buffer is pre-posted is possible, but
modestly complex. Research on MPI and SDP have generally
shown that the unless the pinning overhead is eliminated somehow
that the buffers have to be quite large before zero-copy reception
becomes a benefit. vj_netchannels represent a strategy of minimizing
registration/pinning costs even if it means paying for an extra copy.
Because the extra copy is closely tied to the activation of the data
sink consumer the cost of that extra copy is greatly reduced because
it places the data in the cache immediately before the application
will in fact use the received data.

Also keep in mind that once the issues are resolved to allow the
netchannel rings to be directly visible to a user-mode client that
enhanced/specialized interfaces can easily be added in user-mode
libraries. So focusing on supporting existing conventional interfaces
is probably the best approach for the initial efforts.


^ permalink raw reply	[flat|nested] 78+ messages in thread
* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-28 17:02 Caitlin Bestler
  2006-04-28 17:18 ` Stephen Hemminger
  2006-04-28 17:25 ` Evgeniy Polyakov
  0 siblings, 2 replies; 78+ messages in thread
From: Caitlin Bestler @ 2006-04-28 17:02 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David S. Miller, kelly, rusty, netdev

Evgeniy Polyakov wrote:
> On Fri, Apr 28, 2006 at 08:59:19AM -0700, Caitlin Bestler
> (caitlinb@broadcom.com) wrote:
>>> Btw, how is it supposed to work without header split capabale
>>> hardware?
>> 
>> Hardware that can classify packets is obviously capable of doing
>> header data separation, but that does not mean that it has to do so.
>> 
>> If the host wants header data separation it's real value is that when
>> packets arrive in order that fewer distinct copies are required to
>> move the data to the user buffer (because separated data can be
>> placed back-to-back in a data-only ring). But that's an
>> optimization, it's not needed to make the idea worth doing, or even
>> necessarily in the first implementation.
> 
> If there is dataflow, not flow of packets or flow of data
> with holes, it could be possible to modify recv() to just
> return the right pointer, so in theory userspace
> modifications would be minimal.
> With copy in place it completely does not differ from current
> design with copy_to_user() being used since memcpy() is just
> slightly faster than copy*user().

If the app is really ready to use a modified interface we might as well
just give them a QP/CQ interface. But I suppose "receive by pointer"
interfaces don't really stretch the sockets interface all that badly.
The key is that you have to decide how the buffer is released,
is it the next call? Or a separate call? Does releasing buffer
N+2 release buffers N and N+1? What you want to avoid 
is having to keep a scoreboard of which buffers have been
released.

But in context, header/data separation would allow in order
packets to have the data be placed back to back, which 
could allow a single recv to report the payload of multiple
successive TCP segments. So the benefit of header/data
separation remains the same, and I still say it's a optimization
that should not be made a requirement. The benefits of vj_channels
exist even without them. When the packet classifier runs on the
host, header/data separation would not be free. I want to enable
hardware offloads, not make the kernel bend over backwards
to emulate how hardware would work. I'm just hoping that we
can agree to let hardware do its work without being forced to
work the same way the kernel does (i.e., running down a long
list of arbitrary packet filter rules on a per packet basis).



^ permalink raw reply	[flat|nested] 78+ messages in thread
* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-28 15:59 Caitlin Bestler
  2006-04-28 16:12 ` Evgeniy Polyakov
  0 siblings, 1 reply; 78+ messages in thread
From: Caitlin Bestler @ 2006-04-28 15:59 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David S. Miller, kelly, rusty, netdev

Evgeniy Polyakov wrote:
> On Thu, Apr 27, 2006 at 02:12:09PM -0700, Caitlin Bestler
> (caitlinb@broadcom.com) wrote:
>> So the real issue is when there is an intelligent device that uses
>> hardware packet classification to place the packet in the correct
>> ring. We don't want to bypass packet filtering, but it would be
>> terribly wasteful to reclassify the packet.
>> Intelligent NICs will have packet classification capabilities to
>> support RDMA and iSCSI. Those capabilities should be available to
>> benefit SOCK_STREAM and SOCK_DGRAM users as well without it being a
>> choice of either turning all stack control over to the NIC or
>> ignorign all NIC capabilities beyound pretending to be a dumb
>> Ethernet NIC. 
> 
> Btw, how is it supposed to work without header split capabale
> hardware?

Hardware that can classify packets is obviously capable of doing
header data separation, but that does not mean that it has to do so.

If the host wants header data separation it's real value is that when
packets arrive in order that fewer distinct copies are required to
move the data to the user buffer (because separated data can
be placed back-to-back in a data-only ring). But that's an optimization,
it's not needed to make the idea worth doing, or even necessarily
in the first implementation.


^ permalink raw reply	[flat|nested] 78+ messages in thread
* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-27 21:12 Caitlin Bestler
  2006-04-28  6:10 ` Evgeniy Polyakov
  2006-04-28  8:24 ` Rusty Russell
  0 siblings, 2 replies; 78+ messages in thread
From: Caitlin Bestler @ 2006-04-27 21:12 UTC (permalink / raw)
  To: David S. Miller, johnpol; +Cc: kelly, rusty, netdev

netdev-owner@vger.kernel.org wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Thu, 27 Apr 2006 15:51:26 +0400
> 
>> There are some caveats here found while developing zero-copy sniffer
>> [1]. Project's goal was to remap skbs into userspace in real-time.
>> While absolute numbers (posted to netdev@) were really high, it is
>> only applicable to read-only application. As was shown in IOAT
>> thread, data must be warmed in caches, so reading from mapped area
>> will be as fast as memcpy() (read+write), and copy_to_user()
>> actually almost equal to memcpy() (benchmarks were posted to
>> netdev@). And we must add remapping overhead.
> 
> Yes, all of these issues are related quite strongly.  Thanks
> for making the connection explicit.
> 
> But, the mapping overhead is zero for this net channel stuff,
> at least as it is implemented and designed by Kelly.  Ring
> buffer is setup ahead of time into the user's address space,
> and a ring of buffers into that area are given to the networking card.
> 
> We remember the translations here, so no get_user_pages() on
> each transfer and garbage like that.  And yes this all harks
> back to the issues that are discussed in Chapter 5 of
> Networking Algorithmics.
> But the core thing to understand is that by defining a new
> API and setting up the buffer pool ahead of time, we avoid all of the
> get_user_pages() overhead while retaining full kernel/user protection.
> 
> Evgeniy, the difference between this and your work is that
> you did not have an intelligent piece of hardware that could
> be told to recognize flows, and only put packets for a
> specific flow into that's flow's buffer pool.
> 
>> If we want to dma data from nic into premapped userspace area, this
>> will strike with message sizes/misalignment/slow read and so on, so
>> preallocation has even more problems.
> 
> I do not really think this is an issue, we put the full
> packet into user space and teach it where the offset is to
> the actual data.
> We'll do the same things we do today to try and get the data
> area aligned.  User can do whatever is logical and relevant
> on his end to deal with strange cases.
> 
> In fact we can specify that card has to take some care to get
> data area of packet aligned on say an 8 byte boundary or
> something like that.  When we don't have hardware assist, we
> are going to be doing copies.
> 
>> This change also requires significant changes in application, at
>> least until recv/send are changed, which is not the best thing to do.
> 
> This is exactly the point, we can only do a good job and
> receive zero copy if we can change the interfaces, and that's
> exactly what we're doing here.
> 
>> I do think that significant win in VJ's tests belongs not to
>> remapping and cache-oriented changes, but to move all protocol
>> processing into process' context.
> 
> I partly disagree.  The biggest win is eliminating all of the
> control overhead (all of "softint RX + protocol demux + IP
> route lookup + socket lookup" is turned into single flow
> demux), and the SMP safe data structure which makes it
> realistic enough to always move the bulk of the packet work
> to the socket's home cpu.
> 
> I do not think userspace protocol implementation buys enough
> to justify it.  We have to do the protection switch in and
> out of kernel space anyways, so why not still do the
> protected protocol processing work in the kernel?  It is
> still being done on the user's behalf, contributes to his
> time slice, and avoids all of the terrible issues of
> userspace protocol implementations.
> 
> So in my mind, the optimal situation from both a protection
> preservation and also a performance perspective is net
> channels to kernel socket protocol processing, buffers DMA'd
> directly into userspace if hardware assist is present.
> 

Having a ring that is already flow qualified is indeed the
most important savings, and worth pursuing even if reaching
consensus on how to safely enable user-mode L4 processing.
The latter *can* be a big advantage when the L4 processing
can be done based on a user-mode call from an already
scheduled process. But the benefit is not there for a process
that needs to be woken up each time it receives a short request.

So the real issue is when there is an intelligent device that
uses hardware packet classification to place the packet in
the correct ring. We don't want to bypass packet filtering,
but it would be terribly wasteful to reclassify the packet.
Intelligent NICs will have packet classification capabilities
to support RDMA and iSCSI. Those capabilities should be available
to benefit SOCK_STREAM and SOCK_DGRAM users as well without it
being a choice of either turning all stack control over to
the NIC or ignorign all NIC capabilities beyound pretending
to be a dumb Ethernet NIC.

For example, counting packets within an approved connection
is a valid goal that the final solution should support. But
would a simple count be sufficient, or do we truly need the
full flexibility currently found in netfilter?

Obviously all of this does not need to be resolved in full
detail, but there should be some sense of the direction so
that data structures can be designed properly. My assumption
is that each input ring has a matching output ring, and that
the output ring cannot be used to send packets that would
not be matched by the reverse rule for the paired input ring.
So the information that supports enforcing that rule needs
to be stored somewhere other than the ring itself.


^ permalink raw reply	[flat|nested] 78+ messages in thread
* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-27  1:02 Caitlin Bestler
  2006-04-27  6:08 ` David S. Miller
  0 siblings, 1 reply; 78+ messages in thread
From: Caitlin Bestler @ 2006-04-27  1:02 UTC (permalink / raw)
  To: David S. Miller; +Cc: jeff, kelly, netdev, rusty

netdev-owner@vger.kernel.org wrote:
> From: "Caitlin Bestler" <caitlinb@broadcom.com>
> Date: Wed, 26 Apr 2006 15:53:44 -0700
> 
>> The netchannel qualifiers should only deal with TCP packets for
>> established connections. Listens would continue to be dealt with by
>> the existing stack logic, vj_channelizing only occurring when the the
>> connection was accepted.
> 
> I consider netchannel support for listening TCP sockets to be
> absolutely essential. -

Meaning that inbound SYNs would be placed in a net channel
for processing by a Consumer at the other end of the ring?

If so the rules filtering SYNs would have to be applied either
before it went into the ring, or when the consumer end takes
them out. The latter makes more sense to me, because the rules
about what remote hosts can initiate a connection request to
a given TCP port can be fairly complex for a variety of
legitimate reasons.

Would it be reasonable to state that a net channel carrying
SYNs should not be set up when the consumer is a user mode
process?




^ permalink raw reply	[flat|nested] 78+ messages in thread
* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-26 22:53 Caitlin Bestler
  2006-04-26 22:59 ` David S. Miller
  0 siblings, 1 reply; 78+ messages in thread
From: Caitlin Bestler @ 2006-04-26 22:53 UTC (permalink / raw)
  To: David S. Miller, jeff; +Cc: kelly, netdev, rusty

David S. Miller wrote:
> From: Jeff Garzik <jeff@garzik.org>
> Date: Wed, 26 Apr 2006 15:46:58 -0400
> 
>> Oh, there are plenty of examples of filtering within an established
>> connection:  input rules.  I've seen "drop all packets from <these>
>> IPs" type rules frequently.  Victims of DoS use those kinds of rules
>> to stop packets as early as possible.
> 
> Yes, good point, but this applies to listening connections.
> 
> We'll need to figure out a way to deal with this.
> 
> It occurs to me that for established connections, netfilter
> can simply remove all matching entries from the netchannel lookup
> tables. 
> 
> But that still leaves the thorny listening socket issue.
> This may by itself make netfilter netchannel support
> important and that brings up a lot of issues about classifier
> algorithms. 
> 
> All of this I wanted to avoid as we start this work :-)
> 
> We can think about how to approach these other problems and
> start with something simple meanwhile.  That seems to me to
> be the best approach moving forward.
> 
> It's important to start really simple else we'll just keep
> getting bogged down in complexity and details and never
> implement anything.

How does this sound?

The netchannel qualifiers should only deal with TCP packets
for established connections. Listens would continue to be 
dealt with by the existing stack logic, vj_channelizing
only occurring when the the connection was accepted.

The vj_netchannel qualifiers would conceptually take place
before the netfilter rules (to avoid making deployment
of netchannels dependent on netfilter) but their creation
would have to be approved by netfilter (if netfiler was
active). Netfilter could also revoke vj_channel qualifiers.

If the rule is that "if a vj_netchannel rule exists then it
must be ok with netfilter" is actually very easy to implement.
During early development you simply tell the testers "hey,
don't set up any netchannels that netfilter would reject"
and defer implementing enforcement until after the netchannels
code actually works. After all, if it is isn't actually successfully
transmitting or receiving packets yet it can't really be acting
contrary to netfilter policy.




^ permalink raw reply	[flat|nested] 78+ messages in thread
* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-26 20:20 Caitlin Bestler
  2006-04-26 22:35 ` David S. Miller
  0 siblings, 1 reply; 78+ messages in thread
From: Caitlin Bestler @ 2006-04-26 20:20 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: David S. Miller, kelly, netdev, rusty

Jeff Garzik wrote:
> Caitlin Bestler wrote:
>> David S. Miller wrote:
>> 
>>> I personally think allowing sockets to trump firewall rules is an
>>> acceptable relaxation of the rules in order to simplify the
>>> implementation.
>> 
>> I agree.  I have never seen a set of netfilter rules that would block
>> arbitrary packets *within* an established connection.
>> 
>> Technically you can create such rules, but every single set of rules
>> actually deployed that I have ever seen started with a rule to pass
>> all packets for established connections, and then proceeded to
>> control which connections could be initiated or accepted.
> 
> Oh, there are plenty of examples of filtering within an established
> connection:  input rules.  I've seen "drop all packets from <these>
> IPs" type rules frequently.  Victims of DoS use those kinds of
> rules to stop packets as early as possible.
> 
> 	Jeff

If you are dropping all packets from IP X, then how was the connection
established? Obviously we are only dealing with connections that
were established before the rule to drop all packets from IP X
was created.

That calls for an ability to revoke the assignment of any flow to
a vj_netchannel when a new rule is created that would filter any
packet that would be classified by the flow.

Basically the rule is that a delegation to a vj_netchannel is
only allowed for flows where *all* packets assigned to that flow
(input or output) would receive a 'pass' from netchannels.

That makes sense.  What I don't see a need for is examing *each*
delegated packet against the entire set of existing rules. Basically,
a flow should either be rule-compliant or not. If it is not, then
the delegation of the flow should be abandoned. If that requires
re-importing TCP state, then perhaps the TCP connection needs to
be aborted.

In any event, if netfilter is selectively rejecting packets in the
middle
of a connection then the connection is going to fail anyway. 





^ permalink raw reply	[flat|nested] 78+ messages in thread
* RE: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-26 16:57 Caitlin Bestler
  2006-04-26 19:23 ` David S. Miller
  0 siblings, 1 reply; 78+ messages in thread
From: Caitlin Bestler @ 2006-04-26 16:57 UTC (permalink / raw)
  To: David S. Miller, kelly; +Cc: netdev, rusty

netdev-owner@vger.kernel.org wrote:
> Ok I have comments already just glancing at the initial patch.
> 
> With the 32-bit descriptors in the channel, you indeed end up
> with a fixed sized pool with a lot of hard-to-finesse sizing
> and lookup problems to solve.
> 
> So what I wanted to do was finesse the entire issue by simply
> side-stepping it initially.  Use a normal buffer with a tail
> descriptor, when you enqueue you give a tail descriptor pointer.
> 
> Yes, it's weirder to handle this in hardware, but it's not
> impossible and using real pointers means two things:
> 
> 1) You can design a simple netif_receive_skb() channel that works
>    today, encapsulation of channel buffers into an SKB is like
>    15 lines of code and no funny lookups.
> 
> 2) People can start porting the input path of drivers right now and
>    retain full functionality and test anything they want.  This is
>    important for getting the drivers stable as fast as possible.
> 
> And it also means we can tackle the buffer pool issue of the
> 32-bit descriptors later, if we actually want to do things
> that way, I think we probably don't.
> 
> To be honest, I don't think using a 32-bit descriptor is so
> critical even from a hardware implementation perspective.
> Yes, on 64-bit you're dealing with a 64-bit quantity so the
> number of entries in the channel are halfed from what a 32-bit arch
> uses. 
> 
> Yes I say this for 2 reasons:
> 
> 1) We have no idea whether it's critical to have "~512" entries
>    in the channel which is about what a u32 queue entry type
>    affords you on x86 with 4096 byte page size.
> 
> 2) Furthermore, it is sized by page size, and most 64-bit platforms
>    use an 8K base page size anyways, so the number of queue entries
>    ends of being the same.  Yes, I know some 64-bit platforms use
>    a 4K page size, please see #1 :-)
> 
> I really dislike the pools of buffers, partly because they
> are fixed size (or dynamically sized and even more expensive
> to implement), but moreso because there is all of this
> absolutely stupid state management you eat just to get at the
> real data.  That's pointless, we're trying to make this as
> light as possible.  Just use real pointers and describe the
> packet with a tail descriptor.
> 
> We can use a u64 or whatever in a hardware implementation.
> 
> Next, you can't even begin to work on the protocol channels
> before you do one very important piece of work.  Integration
> of all of the ipv4 and ipv6 protocol hash tables into a
> central code, it's a total prerequisite.  Then you modify
> things to use a generic
> inet_{,listen_}lookup() or inet6_{,listen_}lookup() that
> takes a protocol number as well as saddr/daddr/sport/dport
> and searches from a central table.
> 
> So I think I'll continue working on my implementation, it's
> more transitional and that's how we have to do this kind of work. -

The major element I liked about Kelly's approach is that the ring
is clearly designed to allow a NIC to place packets directly into
a ring that is directly accessible by the user. Evolutionary steps
are good, but isn't direct placement into a user-accessible simple
ring buffer the ultimate justification of netchannels?

But that doesn't mean that we have to have a very artificial definition
of the ring based on presumptions that hardware only understands 512<<n
sized buffers. Hardware today is typically just as smart as the
processors
that IP networks were first designed on, if not more so.

Central integration also will need to be integrated with packet
filtering.
In particular, once a flow has been assigned to a netchannel ring, who
is
responsible for doing the packet filtering? Or is it enough to check the
packet filter when the net channel flow is created?




^ permalink raw reply	[flat|nested] 78+ messages in thread
* [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-26 11:47 Kelly Daly
  2006-04-26  7:33 ` David S. Miller
  2006-04-26  7:59 ` David S. Miller
  0 siblings, 2 replies; 78+ messages in thread
From: Kelly Daly @ 2006-04-26 11:47 UTC (permalink / raw)
  To: netdev; +Cc: rusty, davem

Hey guys...  I've been working with Rusty on a VJ Channel implementation.  
Noting Dave's recent release of his implementation, we thought we'd better 
get this "out there" so we can do some early comparison/combining and 
come up with the best possible implementation.

There are three patches in total:
1) vj_core.patch - core files for VJ to userspace
2) vj_udp.patch  - badly hacked up UDP receive implementation - basically just to test what logic may be like!
3) vj_ne2k.patch - modified NE2K and 8390 used for testing on QEMU

Notes:
* channels can have global or local buffers (local for userspace.  Could be used directly by intelligent NIC)
* UDP receive breaks real UDP - doesn't talk anything except VJ Channels anymore.  Needs integration with normal sources.
* Userspace test app (below) uses VJ protocol family to mmap space for local buffers, if it receives buffers in kernel space sends a request for that buffer to be copied to local buffer.
* Default channel converts to skb and feeds through normal receive path.

TODO:
* send not yet implemented
* integrate non vj
* LOTS of fixmes

Cheers,
Kelly



Test userspace app:
/*  Van Jacobson net channels implementation for Linux
    Copyright (C) 2006  Kelly Daly <kdaly@au.ibm.com>  IBM Corporation

    This program is free software; you can redistribute it and/or modify
    it under the terms of the GNU General Public License as published by
    the Free Software Foundation; either version 2 of the License, or
    (at your option) any later version.

    This program is distributed in the hope that it will be useful,
    but WITHOUT ANY WARRANTY; without even the implied warranty of
    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
    GNU General Public License for more details.

    You should have received a copy of the GNU General Public License
    along with this program; if not, write to the Free Software
    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
*/
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/mman.h>
#include <sys/poll.h>
#include <netinet/in.h>
#include "linux-2.6.16/include/linux/types.h"
#include "linux-2.6.16/include/linux/vjchan.h"

//flowid
#define SADDR 0
#define DADDR 0
#define SPORT 0
#define DPORT 60000
#define IFINDEX 0

#define PF_VJCHAN 27

static struct vj_buffer *get_buffer(struct vj_channel_ring *ring, int desc_num)
{
        printf("desc_num %i\n", desc_num);
        return (void *)ring + (desc_num + 1) * getpagesize();
}
/* return the next buffer, but do not move on */
static struct vj_buffer *vj_peek_next_buffer(struct vj_channel_ring *ring)
{
        if (ring->c.head == ring->p.tail)
                return NULL;
        return get_buffer(ring, ring->q[ring->c.head]);
}

/* move on to next buffer */
static void vj_done_with_buffer(struct vj_channel_ring *ring)
{
        ring->c.head = (ring->c.head+1)%VJ_NET_CHANNEL_ENTRIES;

        printf("done_with_buffer\n\n");
}

int main(int argc, char *argv[])
{
        int sk, cls, bnd, pll;
        void * mmapped;
        struct vj_flowid flowid;
        struct vj_channel_ring *ring;
        struct vj_buffer *buf;
        struct pollfd pfd;

        printf("\nstart of vjchannel socket test app\n");
        sk = socket(PF_VJCHAN, SOCK_DGRAM, IPPROTO_UDP);
        if (sk == -1) {
                perror("Unable to open socket!");
                return -1;
        }
        printf("socket open with ret code %i\n\n", sk);

//create flowid!!!
        flowid.saddr = SADDR;
        flowid.daddr = DADDR;
        flowid.sport = SPORT;
        flowid.dport = htons(DPORT);
        flowid.ifindex = IFINDEX;
        flowid.proto = IPPROTO_UDP;

        printf("flowid created\n");

        bnd = bind(sk, (struct sockaddr *)&flowid, sizeof(struct vj_flowid));
        if (bnd == -1) {
                perror("Unable to bind socket!");
                return -1;
        }
        printf("socket bound with ret code %i\n\n", bnd);

        ring = mmap(0, (getpagesize() * (VJ_NET_CHANNEL_ENTRIES+1)), PROT_READ|PROT_WRITE, MAP_SHARED, sk, 0);
        if (ring == MAP_FAILED) {
                perror ("Unable to mmap socket!");
                return -1;
        }
        printf("socket mmapped to address %lu\n\n", (unsigned long)mmapped);
        
        pfd.fd = sk;
        pfd.events = POLLIN;

        for (;;) {
                pll = poll(&pfd, 1, -1);
                
                if (pll < 0) {
                        perror("polling failed!");
                        return -1;
                }

//consume
                buf = vj_peek_next_buffer(ring);

                printf("buf %p\n", buf);

//print data, not headers
                printf("   Buffer Length = %i\n", buf->data_len);
                printf("   Header Length = %i\n", buf->header_len);
                printf("   Buffer Data: '%.*s'\n", buf->data_len - 28, buf->data + buf->header_len + 28);
                vj_done_with_buffer(ring);
        }

        cls = close(sk);
        if (cls != 0) {
                perror("Unable to close socket!");
                return -2;
        }
        printf("socket closed with ret code %i\n\n", cls);
        return 0;
}




-------------------------
Signed-off-by: Kelly Daly <kelly@au.ibm.com>

Basic infrastructure for Van Jacobson net channels: lockless ringbuffer for buffer transport.  Entries in ring buffer are descriptors for global or local buffers: ring and local buffers are mmapped into userspace.
Channels are registered with the core by flowid, and a thread services the default channel for any non-matching packets.  Drivers get (global) buffers from vj_get_buffer, and dispatch them through vj_netif_rx.
As userspace mmap cannot reach global buffers, select() copies global buffers into local buffers if required.


diff -r 47031a1f466c linux-2.6.16/include/linux/socket.h
--- linux-2.6.16/include/linux/socket.h Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/include/linux/socket.h Mon Apr 24 19:50:46 2006
@@ -186,6 +187,7 @@
 #define AF_PPPOX       24      /* PPPoX sockets                */
 #define AF_WANPIPE     25      /* Wanpipe API Sockets */
 #define AF_LLC         26      /* Linux LLC                    */
+#define AF_VJCHAN      27      /* VJ Channel */
 #define AF_TIPC                30      /* TIPC sockets                 */
 #define AF_BLUETOOTH   31      /* Bluetooth sockets            */
 #define AF_MAX         32      /* For now.. */
@@ -219,7 +221,8 @@
 #define PF_PPPOX       AF_PPPOX
 #define PF_WANPIPE     AF_WANPIPE
 #define PF_LLC         AF_LLC
+#define PF_VJCHAN      AF_VJCHAN
 #define PF_TIPC                AF_TIPC
 #define PF_BLUETOOTH   AF_BLUETOOTH
 #define PF_MAX         AF_MAX

diff -r 47031a1f466c linux-2.6.16/net/Kconfig
--- linux-2.6.16/net/Kconfig    Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/net/Kconfig    Mon Apr 24 19:50:46 2006
@@ -65,6 +65,12 @@
 source "net/ipv6/Kconfig"
 
 endif # if INET
+
+config VJCHAN
+       bool "Van Jacobson Net Channel Support (EXPERIMENTAL)"
+       depends on EXPERIMENTAL
+       ---help---
+         This adds a userspace-accessible packet receive interface.  Say N.
 
 menuconfig NETFILTER
        bool "Network packet filtering (replaces ipchains)"
diff -r 47031a1f466c linux-2.6.16/net/Makefile
--- linux-2.6.16/net/Makefile   Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/net/Makefile   Mon Apr 24 19:50:46 2006
@@ -46,6 +46,7 @@
 obj-$(CONFIG_IP_SCTP)          += sctp/
 obj-$(CONFIG_IEEE80211)                += ieee80211/
 obj-$(CONFIG_TIPC)             += tipc/
+obj-$(CONFIG_VJCHAN)           += vjchan/
 
 ifeq ($(CONFIG_NET),y)
 obj-$(CONFIG_SYSCTL)           += sysctl_net.o
diff -r 47031a1f466c linux-2.6.16/include/linux/vjchan.h
--- /dev/null   Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/include/linux/vjchan.h Mon Apr 24 19:50:46 2006
@@ -0,0 +1,79 @@
+#ifndef _LINUX_VJCHAN_H
+#define _LINUX_VJCHAN_H
+
+/* num entries in channel q: set so consumer is at offset 1024. */
+#define VJ_NET_CHANNEL_ENTRIES 254
+/* identifies non-local buffers (ie. need kernel to copy to a local) */
+#define VJ_HIGH_BIT 0x80000000
+
+struct vj_producer {
+       __u16 tail;                     /* next element to add */
+       __u8 wakecnt;                   /* do wakeup if != consumer wakecnt */
+       __u8 pad;
+       __u16 old_head;                 /* last cleared buffer posn +1 */
+       __u16 pad2;
+};
+
+struct vj_consumer {
+       __u16 head;                     /* next element to remove */
+       __u8 wakecnt;                   /* increment to request wakeup */
+};
+
+/* mmap returns one of these, followed by 254 pages with a buffer each */
+struct vj_channel_ring {
+       struct vj_producer p;           /* producer's header */
+       __u32 q[VJ_NET_CHANNEL_ENTRIES];
+       struct vj_consumer c;           /* consumer's header */
+};
+
+struct vj_buffer {
+       __u32 data_len;         /* length of actual data in buffer */
+       __u32 header_len;       /* offset eth + ip header (true for now) */
+       __u32 ifindex;          /* interface the packet came in on. */
+       char data[0];
+};
+
+/* Currently assumed IPv4 */
+struct vj_flowid
+{
+       __u32 saddr, daddr;
+       __u16 sport, dport;
+       __u32 ifindex;
+       __u16 proto;
+};
+
+#ifdef __KERNEL__
+struct net_device;
+struct sk_buff;
+
+struct vj_descriptor {
+       unsigned long address;          /* address of net_channel_buffer */
+       unsigned long buffer_len;       /* max length including header */
+};
+
+/* Everything about a vj_channel */
+struct vj_channel
+{
+       struct vj_channel_ring *ring;
+       wait_queue_head_t wq;
+       struct list_head list;
+       struct vj_flowid flowid;
+       int num_local_buffers;
+       struct vj_descriptor *descs;
+        unsigned long * used_descs;
+};
+
+void vj_inc_wakecnt(struct vj_channel *chan);
+struct vj_buffer *vj_get_buffer(int *desc_num);
+void vj_netif_rx(struct vj_buffer *buffer, int desc_num, unsigned short proto);
+int vj_xmit(struct sk_buff *skb, struct net_device *dev);
+struct vj_channel *vj_alloc_chan(int num_buffers);
+void vj_register_chan(struct vj_channel *chan, const struct vj_flowid *flowid);
+void vj_unregister_chan(struct vj_channel *chan);
+void vj_free_chan(struct vj_channel *chan);
+struct vj_buffer *vj_peek_next_buffer(struct vj_channel *chan);
+void vj_done_with_buffer(struct vj_channel *chan);
+unsigned short eth_vj_type_trans(struct vj_buffer *buffer);
+int vj_need_local_buffer(struct vj_channel *chan);
+#endif
+#endif /* _LINUX_VJCHAN_H */
diff -r 47031a1f466c linux-2.6.16/net/vjchan/Makefile
--- /dev/null   Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/net/vjchan/Makefile    Mon Apr 24 19:50:46 2006
@@ -0,0 +1,3 @@
+#obj-m += vjtest.o
+obj-y += vjnet.o
+obj-y += af_vjchan.o
diff -r 47031a1f466c linux-2.6.16/net/vjchan/af_vjchan.c
--- /dev/null   Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/net/vjchan/af_vjchan.c Mon Apr 24 19:50:46 2006
@@ -0,0 +1,198 @@
+/*  Van Jacobson net channels implementation for Linux
+    Copyright (C) 2006  Kelly Daly <kdaly@au.ibm.com>  IBM Corporation
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program; if not, write to the Free Software
+    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+*/
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/socket.h>
+#include <linux/vjchan.h>
+#include <net/sock.h>
+
+struct vjchan_sock
+{
+       struct sock sk;
+       struct vj_channel *chan;
+       int vj_reg_flag;
+};
+
+static inline struct vjchan_sock *vj_sk(struct sock *sk)
+{
+       return (struct vjchan_sock *)sk;
+}
+
+static struct proto vjchan_proto = {
+       .name = "VJCHAN",
+       .owner = THIS_MODULE,
+       .obj_size = sizeof(struct vjchan_sock),
+};
+
+int vjchan_release(struct socket *sock)
+{
+       struct sock *sk = sock->sk;
+
+       sock_orphan(sk);
+       sock->sk = NULL;
+       sock_put(sk);
+       return 0;
+}
+
+int vjchan_bind(struct socket *sock, struct sockaddr *addr, int sockaddr_len)
+{
+       struct sock *sk = sock->sk;
+       struct vjchan_sock *vjsk;
+       struct vj_flowid *flowid = (struct vj_flowid *)addr;
+
+       /* FIXME: avoid clashing with normal sockets, replace zeroes. */
+       vjsk = vj_sk(sk);
+       vj_register_chan(vjsk->chan, flowid);
+       vjsk->vj_reg_flag = 1;
+
+       return 0;
+}
+
+int vjchan_getname(struct socket *sock, struct sockaddr *addr,
+                  int *sockaddr_len, int peer)
+{
+       /* FIXME: Implement */
+       return 0;
+}
+
+unsigned int vjchan_poll(struct file *file, struct socket *sock,
+                        struct poll_table_struct *wait)
+{
+       struct sock *sk = sock->sk;
+       struct vj_channel *chan = vj_sk(sk)->chan;
+
+       poll_wait(file, &chan->wq, wait);
+       vj_inc_wakecnt(chan);
+
+       if (vj_peek_next_buffer(chan) && vj_need_local_buffer(chan) == 0)
+               return POLLIN | POLLRDNORM;
+
+       return 0;
+}
+
+/* We map the ring first, then one page per buffer. */
+int vjchan_mmap(struct file *file, struct socket *sock,
+               struct vm_area_struct *vma)
+{
+       struct sock *sk = sock->sk;
+       struct vj_channel *chan = vj_sk(sk)->chan;
+       int i, vip;
+       unsigned long pos;
+
+       if (vma->vm_end - vma->vm_start !=
+           (1 + chan->num_local_buffers)*PAGE_SIZE)
+               return -EINVAL;
+
+       pos = vma->vm_start;
+       vip = vm_insert_page(vma, pos, virt_to_page(chan->ring));
+       pos += PAGE_SIZE;
+       for (i = 0; i < chan->num_local_buffers; i++) {
+               vip = vm_insert_page(vma, pos, virt_to_page(chan->descs[i].address));
+               pos += PAGE_SIZE;
+       }
+       return 0;
+}
+
+const struct proto_ops vjchan_ops = {
+       .family = PF_VJCHAN,
+       .owner = THIS_MODULE,
+       .release = vjchan_release,
+       .bind = vjchan_bind,
+       .socketpair = sock_no_socketpair,
+       .accept = sock_no_accept,
+       .getname = vjchan_getname,
+       .poll = vjchan_poll,
+       .ioctl = sock_no_ioctl,
+       .shutdown = sock_no_shutdown,
+       .setsockopt = sock_common_setsockopt,
+       .getsockopt = sock_common_getsockopt,
+       .sendmsg = sock_no_sendmsg,
+       .recvmsg = sock_no_recvmsg,
+       .mmap = vjchan_mmap,
+       .sendpage = sock_no_sendpage
+};
+
+static void vjchan_destruct(struct sock *sk)
+{
+       struct vjchan_sock *vjsk;
+
+       vjsk = vj_sk(sk);
+       if (vjsk->vj_reg_flag) {
+               vj_unregister_chan(vjsk->chan);
+               vjsk->vj_reg_flag = 0;
+       }
+       vj_free_chan(vjsk->chan);
+
+}
+
+static int vjchan_create(struct socket *sock, int protocol)
+{
+       struct sock *sk;
+       struct vjchan_sock *vjsk;
+       int err;
+
+       if (!capable(CAP_NET_RAW))
+               return -EPERM;
+       if (sock->type != SOCK_DGRAM
+           && sock->type != SOCK_RAW
+           && sock->type != SOCK_PACKET)
+               return -ESOCKTNOSUPPORT;
+
+       sock->state = SS_UNCONNECTED;
+
+       err = -ENOBUFS;
+       sk = sk_alloc(PF_VJCHAN, GFP_KERNEL, &vjchan_proto, 1);
+       if (sk == NULL)
+               goto out;
+
+       sock->ops = &vjchan_ops;
+
+       sock_init_data(sock, sk);
+       sk->sk_family = PF_VJCHAN;
+       sk->sk_destruct = vjchan_destruct;
+
+       vjsk = vj_sk(sk);
+       vjsk->chan = vj_alloc_chan(VJ_NET_CHANNEL_ENTRIES);
+       vjsk->vj_reg_flag = 0;
+       if (!vjsk->chan)
+               return -ENOMEM;
+       return 0;
+out:
+       return err;
+}
+
+static struct net_proto_family vjchan_family_ops = {
+       .family =       PF_VJCHAN,
+       .create =       vjchan_create,
+       .owner  =       THIS_MODULE,
+};
+
+static void __exit vjchan_exit(void)
+{
+       sock_unregister(PF_VJCHAN);
+}
+
+static int __init vjchan_init(void)
+{
+       return sock_register(&vjchan_family_ops);
+}
+
+module_init(vjchan_init);
+module_exit(vjchan_exit);
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_NETPROTO(PF_VJCHAN);
diff -r 47031a1f466c linux-2.6.16/net/vjchan/vjnet.c
--- /dev/null   Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/net/vjchan/vjnet.c     Mon Apr 24 19:50:46 2006
@@ -0,0 +1,550 @@
+/*  Van Jacobson net channels implementation for Linux
+    Copyright (C) 2006  Kelly Daly <kdaly@au.ibm.com>  IBM Corporation
+
+    This program is free software; you can redistribute it and/or modify
+    it under the terms of the GNU General Public License as published by
+    the Free Software Foundation; either version 2 of the License, or
+    (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU General Public License for more details.
+
+    You should have received a copy of the GNU General Public License
+    along with this program; if not, write to the Free Software
+    Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
+*/
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/kthread.h>
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+#include <linux/etherdevice.h>
+#include <linux/spinlock.h>
+#include <linux/ip.h>
+#include <linux/udp.h>
+#include <linux/vjchan.h>
+
+#define BUFFER_DATA_LEN 2048
+#define NUM_GLOBAL_DESCRIPTORS 1024
+
+/* All our channels.  FIXME: Lockless funky hash structure please... */
+static LIST_HEAD(channels);
+static spinlock_t chan_lock = SPIN_LOCK_UNLOCKED;
+
+/* Default channel, also holds global buffers (userspace-mapped
+ * channels have local buffers, which they prefer to use). */
+static struct vj_channel *default_chan;
+
+/* need to increment for wake in udp.c wait_for_vj_buffer */
+void vj_inc_wakecnt(struct vj_channel *chan)
+{
+       chan->ring->c.wakecnt++;
+       pr_debug("*** incremented wakecnt - should allow wake up\n");
+}
+EXPORT_SYMBOL(vj_inc_wakecnt);
+
+static int is_empty(struct vj_channel_ring *ring)
+{
+       if (ring->c.head == ring->p.tail)
+               return 1;
+       return 0;
+}
+
+static struct vj_buffer *get_buffer(unsigned int desc_num,
+                                   struct vj_channel *chan)
+{
+       struct vj_buffer *buf;
+
+       if ((desc_num & VJ_HIGH_BIT) || (chan->num_local_buffers == 0)) {
+               desc_num &= ~VJ_HIGH_BIT;
+               BUG_ON(desc_num >= default_chan->num_local_buffers);
+               buf = (struct vj_buffer*)default_chan->descs[desc_num].address;
+       } else {
+               BUG_ON(desc_num >= chan->num_local_buffers);
+               buf = (struct vj_buffer *)chan->descs[desc_num].address;
+       }
+       
+       pr_debug("       received desc_num is %i\n", desc_num);
+       pr_debug("get_buffer %p (%s) %i: %p (len=%li ifind=%i hlen=%li) %#02X %#02X %#02X %#02X %#02X %#02X %#02X %#02X\n",
+                current, current->comm, desc_num, buf, buf->data_len, buf->ifindex, buf->header_len + (sizeof(struct iphdr *) * 4),
+                buf->data[0], buf->data[1], buf->data[2], buf->data[3], buf->data[4], buf->data[5], buf->data[6], buf->data[7]);
+
+       return buf;
+}
+
+static void release_buffer(struct vj_channel *chan, unsigned int descnum)
+{
+       if (descnum & VJ_HIGH_BIT) {
+               BUG_ON(test_bit(descnum & ~VJ_HIGH_BIT,
+                               default_chan->used_descs) == 0);
+               clear_bit(descnum & ~VJ_HIGH_BIT, default_chan->used_descs);
+       } else {
+               BUG_ON(test_bit(descnum, chan->used_descs) == 0);
+               clear_bit(descnum, chan->used_descs);
+       }
+}
+
+/* Free all descriptors for the current channel between where we last
+ * freed to and where the consumer has not yet consumed. chan->c.head
+ * is not cleared because it may not have been consumed, therefore
+ * chan->p.old_head is not cleared.  If chan->p.old_head ==
+ * chan->c.head then nothing more has been consumed since we last
+ * freed the descriptors. 
+ *
+ * Because we're using local and global channels we need to select the
+ * bitmap according to the channel.  Local channels may be pointing to
+ * local or global buffers, so we need to select the bitmap according
+ * to the buffer type */
+
+/* Free descriptors consumer has consumed since last free */
+static void free_descs_for_channel(struct vj_channel *chan)
+{
+       struct vj_channel_ring *ring = chan->ring;
+       int desc_num;
+
+       while (ring->p.old_head != ring->c.head) {
+               printk("ring->p.old_head %i, ring->c.head %i\n", ring->p.old_head, ring->c.head);
+               desc_num = ring->q[ring->p.old_head];
+
+               printk("desc_num %i\n", desc_num);
+
+               /* FIXME: Security concerns: make sure this descriptor
+                * really used by this vjchannel.  Userspace could
+                * have changed it. */
+               release_buffer(chan, desc_num);
+               ring->p.old_head = (ring->p.old_head + 1) % VJ_NET_CHANNEL_ENTRIES;
+               printk("ring->p.old_head %i, ring->c.head %i\n\n", ring->p.old_head, ring->c.head);
+       }
+}
+
+/* return -1 if no descriptor found and none can be freed */
+static int get_free_descriptor(struct vj_channel *chan)
+{
+       int free_desc, bitval;
+
+       BUG_ON(chan->num_local_buffers == 0);
+       do {
+               free_desc = find_first_zero_bit(chan->used_descs,
+                                               chan->num_local_buffers);
+               pr_debug("free_desc = %i\n", free_desc);
+               if (free_desc >= chan->num_local_buffers) {
+                       /* no descriptors, refresh bitmap and try again! */
+                       free_descs_for_channel(chan);
+                       free_desc = find_first_zero_bit(chan->used_descs,
+                                               chan->num_local_buffers);
+                       if (free_desc >= chan->num_local_buffers)
+                               /* still no descriptors */
+                               return -1;
+               }
+               bitval = test_and_set_bit(free_desc, chan->used_descs);
+               pr_debug("bitval = %i\n", bitval);
+       } while (bitval == 1);  //keep going until we get a FREE free bit!
+
+       /* We set high bit to indicate a global channel. */
+       if (chan == default_chan)
+               free_desc |= VJ_HIGH_BIT;
+       return free_desc;
+}
+
+/* This function puts a buffer into a local address space for a
+ * channel that is unable to use a kernel address space.  If address
+ * high bit is set then the buffer is in kernel space - get a free
+ * local buffer and copy it across.  Set local buf to used (done when
+ * finding free buffer), kernel buf to unused. */
+/* FIXME: Loop, do as many as possible at once. */
+int vj_need_local_buffer(struct vj_channel *chan)
+{
+       struct vj_channel_ring *ring = chan->ring;
+       u32 new_desc, k_desc;
+
+       k_desc = ring->q[ring->c.head];
+
+       if (ring->q[ring->c.head] & VJ_HIGH_BIT) {
+               struct vj_buffer *buf, *kbuf;
+
+               kbuf = get_buffer(k_desc, chan);
+               new_desc = get_free_descriptor(chan);
+               if (new_desc == -1)
+                       return -ENOBUFS;
+               buf = get_buffer(new_desc, chan);       
+               memcpy (buf, kbuf, sizeof(struct vj_buffer)
+                       + kbuf->data_len + kbuf->header_len);
+/* clear the old descriptor and set q to new one */
+               k_desc &= ~VJ_HIGH_BIT;
+               clear_bit(k_desc, default_chan->used_descs);    
+               ring->q[ring->c.head] = new_desc;
+       }
+       return 0;
+}
+EXPORT_SYMBOL(vj_need_local_buffer);
+
+struct vj_buffer *vj_get_buffer(int *desc_num)
+{
+       *desc_num = get_free_descriptor(default_chan);
+
+       if (*desc_num == -1) {
+               printk("no free bits!\n");
+               return NULL;  
+       }
+
+       return get_buffer(*desc_num, default_chan);
+}
+EXPORT_SYMBOL(vj_get_buffer);
+
+static void enqueue_buffer(struct vj_channel *chan, struct vj_buffer *buffer, int desc_num)
+{
+       u16 tail, nxt;
+       int i;
+
+       pr_debug("*** in enqueue buffer\n");
+       pr_debug("   desc_num = %i\n", desc_num);
+       pr_debug("   Buffer Data Length = %lu\n", buffer->data_len);
+       pr_debug("   Buffer Header Length = %lu\n", buffer->header_len);
+       pr_debug("   Buffer Data:\n");
+       for (i = 0; i < buffer->data_len; i++) {
+               pr_debug("%i ", buffer->data[i]);
+               if (i % 20 == 0)
+                       pr_debug("\n");
+       }
+       pr_debug("\n");
+
+       tail = chan->ring->p.tail;
+       nxt = (tail + 1) % VJ_NET_CHANNEL_ENTRIES;
+               
+       pr_debug("nxt = %i and chan->c.head = %i\n", nxt, chan->ring->c.head);
+       if (nxt != chan->ring->c.head) {
+               chan->ring->q[tail] = desc_num;
+
+               smp_wmb();
+               chan->ring->p.tail=nxt;
+               pr_debug("chan->p.wakecnt = %i and chan->c.wakecnt = %i\n", chan->ring->p.wakecnt, chan->ring->c.wakecnt);
+               free_descs_for_channel(chan);
+               if (chan->ring->p.wakecnt != chan->ring->c.wakecnt) {
+                       ++chan->ring->p.wakecnt;
+                       /* consume whatever is available */
+                       pr_debug("WAKE UP, CONSUMER!!!\n\n");
+                       wake_up(&chan->wq);
+               }
+       } else //if can't add it to chan, may as well allow it to be reused
+               release_buffer(chan, desc_num);
+}
+
+/* FIXME: If we're going to do wildcards here, we need to do ordering between different partial matches... */
+static struct vj_channel *find_channel(u32 saddr, u32 daddr, u16 proto, u16 sport, u16 dport, u32 ifindex)
+{
+       struct vj_channel *i;
+
+       pr_debug("args saddr %u, daddr %u, sport %u, dport %u, ifindex %u, proto %u\n", saddr, daddr, sport, dport, ifindex, proto);
+
+       list_for_each_entry(i, &channels, list) {
+               pr_debug("saddr %u, daddr %u, sport %u, dport %u, ifindex %u, proto %u\n", i->flowid.saddr, i->flowid.daddr, i->flowid.sport, i->flowid.dport, i->flowid.ifindex, i->flowid.proto);
+       
+               if ((!i->flowid.saddr || i->flowid.saddr == saddr) &&
+                   (!i->flowid.daddr || i->flowid.daddr == daddr) &&
+                   (!i->flowid.proto || i->flowid.proto == proto) &&
+                   (!i->flowid.sport || i->flowid.sport == sport) &&
+                   (!i->flowid.dport || i->flowid.dport == dport) &&
+                   (!i->flowid.ifindex || i->flowid.ifindex == ifindex)) {
+                       pr_debug("Found channel %p\n", i);
+                       return i;
+               }
+       }
+       pr_debug("using default channel %p\n", default_chan);
+       return default_chan;
+}
+
+void vj_netif_rx(struct vj_buffer *buffer, int desc_num, 
+                unsigned short proto)
+{
+       struct vj_channel *chan;
+       struct iphdr *ip;
+       int iphl, offset, real_data_len;
+       u16 *ports;
+       unsigned long flags;
+
+       offset = sizeof(struct iphdr) + sizeof(struct udphdr);
+       real_data_len = buffer->data_len - offset;
+
+
+       pr_debug("data_len = %lu, offset = %i, real data? = %i\n\n\n", buffer->data_len, offset, real_data_len);
+       /* this is always 18 when there's 18 or less characters in buffer->data */
+
+       pr_debug("rx) desc_num = %i\n\n", desc_num);
+
+       spin_lock_irqsave(&chan_lock, flags);
+       if (proto == __constant_htons(ETH_P_IP)) {
+
+               ip = (struct iphdr *)(buffer->data + buffer->header_len);
+               ports = (u16 *)(ip + 1);
+               iphl = ip->ihl * 4;
+               
+               if ((buffer->data_len < (iphl + 4)) || 
+                   (iphl != sizeof(struct iphdr))) {
+                       pr_debug("Bad data, default chan\n");
+                       pr_debug("buffer data_len = %li, header len = %li, ip->ihl = %i\n", buffer->data_len, buffer->header_len, ip->ihl);
+                       chan = default_chan;
+               } else {
+                       chan = find_channel(ip->saddr, ip->daddr, 
+                                           ip->protocol, ports[0], 
+                                           ports[1], buffer->ifindex);
+                       
+               }
+       } else
+               chan = default_chan;
+       enqueue_buffer(chan, buffer, desc_num);
+
+       spin_unlock_irqrestore(&chan_lock, flags);
+}
+EXPORT_SYMBOL(vj_netif_rx);
+
+/*
+ *     Determine the packet's protocol ID. The rule here is that we 
+ *     assume 802.3 if the type field is short enough to be a length.
+ *     This is normal practice and works for any 'now in use' protocol.
+ */
+ 
+unsigned short eth_vj_type_trans(struct vj_buffer *buffer)
+{
+       struct ethhdr *eth;
+       unsigned char *rawp;
+
+       eth = (struct ethhdr *)buffer->data;
+       buffer->header_len = ETH_HLEN;
+
+       BUG_ON(buffer->header_len > buffer->data_len);  
+
+       buffer->data_len -= buffer->header_len;
+       if (ntohs(eth->h_proto) >= 1536)
+               return eth->h_proto;
+               
+       rawp = buffer->data;
+       
+       /*
+        *      This is a magic hack to spot IPX packets. Older Novell breaks
+        *      the protocol design and runs IPX over 802.3 without an 802.2 LLC
+        *      layer. We look for FFFF which isn't a used 802.2 SSAP/DSAP. This
+        *      won't work for fault tolerant netware but does for the rest.
+        */
+       if (*(unsigned short *)rawp == 0xFFFF)
+               return htons(ETH_P_802_3);
+               
+       /*
+        *      Real 802.2 LLC
+        */
+       return htons(ETH_P_802_2);
+}
+EXPORT_SYMBOL(eth_vj_type_trans);
+
+static void send_to_netif_rx(struct vj_buffer *buffer)
+{
+       struct sk_buff *skb;
+       struct net_device *dev;
+       int i;
+
+       dev = dev_get_by_index(buffer->ifindex);
+       if (!dev)
+               return;
+       skb = dev_alloc_skb(buffer->data_len + 2);
+       if (skb == NULL) {
+               dev_put(dev);
+               return;
+       }
+
+       skb_reserve(skb, 2);
+       skb->dev = dev;
+
+       skb_put(skb, buffer->data_len);
+       memcpy(skb->data, buffer->data, buffer->data_len);
+
+       pr_debug(" *** C buffer data_len = %lu and skb->len = %i\n", buffer->data_len, skb->len);
+       for (i = 0; i < 10; i++)
+               pr_debug("%i\n", skb->data[i]);
+
+       skb->protocol = eth_type_trans(skb, skb->dev);
+
+       netif_receive_skb(skb);
+}
+
+/* handles default_chan (buffers that nobody else wants) */
+static int default_thread(void *unused)
+{
+       int consumed = 0;
+       int woken = 0;
+       struct vj_buffer *buffer;
+       wait_queue_t wait;
+
+       /* When we get woken up, don't want to be removed from waitqueue! */
+//no more wait.task    struct task_struct * task is now void *private
+       wait.private = current;
+       wait.func = default_wake_function;
+       INIT_LIST_HEAD(&wait.task_list);
+
+       add_wait_queue(&default_chan->wq, &wait);
+       set_current_state(TASK_UNINTERRUPTIBLE);
+       while (!kthread_should_stop()) {
+               /* FIXME: if we do this before prepare_to_wait, avoids wmb */
+               default_chan->ring->c.wakecnt++;
+               smp_wmb();
+
+               while (!is_empty(default_chan->ring)) {
+                       smp_read_barrier_depends();
+                       buffer = get_buffer(default_chan->ring->q[default_chan->ring->c.head], default_chan);
+                       pr_debug("calling send_to_netif_rx\n");
+                       send_to_netif_rx(buffer);
+                       smp_rmb();
+                       default_chan->ring->c.head = (default_chan->ring->c.head+1)%VJ_NET_CHANNEL_ENTRIES;
+                       consumed++;
+               }
+
+               schedule();
+               woken++;
+               set_current_state(TASK_INTERRUPTIBLE);
+       }
+       remove_wait_queue(&default_chan->wq, &wait);
+
+       __set_current_state(TASK_RUNNING);
+
+       pr_debug("consumer finished! consumed %i and woke %i\n", consumed, woken);
+       return 0;
+}
+
+/* return the next buffer, but do not move on */
+struct vj_buffer *vj_peek_next_buffer(struct vj_channel *chan)
+{
+       struct vj_channel_ring *ring = chan->ring;
+
+       if (is_empty(ring))
+               return NULL;
+       return get_buffer(ring->q[ring->c.head], chan);
+}
+EXPORT_SYMBOL(vj_peek_next_buffer);
+
+/* move on to next buffer */
+void vj_done_with_buffer(struct vj_channel *chan)
+{
+       struct vj_channel_ring *ring = chan->ring;
+
+       ring->c.head = (ring->c.head+1)%VJ_NET_CHANNEL_ENTRIES;
+
+       pr_debug("done_with_buffer\n\n");
+}
+EXPORT_SYMBOL(vj_done_with_buffer);
+
+struct vj_channel *vj_alloc_chan(int num_buffers)
+{
+       int i;
+       struct vj_channel *chan = kmalloc(sizeof(*chan), GFP_KERNEL);
+
+       if (!chan)
+               return NULL;
+
+       chan->ring = (void *)get_zeroed_page(GFP_KERNEL);
+       if (chan->ring == NULL)
+               goto free_chan;
+
+       init_waitqueue_head(&chan->wq);
+       chan->ring->p.tail = chan->ring->p.wakecnt = chan->ring->p.old_head = chan->ring->c.head = chan->ring->c.wakecnt = 0;
+
+       chan->num_local_buffers = num_buffers;
+       if (chan->num_local_buffers == 0)
+               return chan;
+
+       chan->used_descs = kzalloc(BITS_TO_LONGS(chan->num_local_buffers)
+                                  * sizeof(long), GFP_KERNEL);
+       if (chan->used_descs == NULL)
+               goto free_ring;
+       chan->descs = kmalloc(sizeof(*chan->descs)*num_buffers, GFP_KERNEL);
+       if (chan->descs == NULL)
+               goto free_used_descs;
+       for (i = 0; i < chan->num_local_buffers; i++) {
+               chan->descs[i].buffer_len = PAGE_SIZE;
+               chan->descs[i].address = get_zeroed_page(GFP_KERNEL);
+               if (chan->descs[i].address == 0)
+                       goto free_descs;
+       }
+
+       return chan;
+
+free_descs:
+       for (--i; i >= 0; i--)
+               free_page(chan->descs[i].address);
+       kfree(chan->descs);
+free_used_descs:
+       kfree(chan->used_descs);
+free_ring:
+       free_page((unsigned long)chan->ring);
+free_chan:
+       kfree(chan);
+       return NULL;
+}
+EXPORT_SYMBOL(vj_alloc_chan);
+
+void vj_register_chan(struct vj_channel *chan, const struct vj_flowid *flowid)
+{
+       pr_debug("%p %s: registering channel %p\n",
+              current, current->comm, chan);
+       chan->flowid = *flowid;
+       spin_lock_irq(&chan_lock);
+       list_add(&chan->list, &channels);
+       spin_unlock_irq(&chan_lock);
+}
+EXPORT_SYMBOL(vj_register_chan);
+
+void vj_unregister_chan(struct vj_channel *chan)
+{
+       pr_debug("%p %s: unregistering channel %p\n",
+              current, current->comm, chan);
+       spin_lock_irq(&chan_lock);
+       list_del(&chan->list);
+       spin_unlock_irq(&chan_lock);
+}
+EXPORT_SYMBOL(vj_unregister_chan);
+
+void vj_free_chan(struct vj_channel *chan)
+{
+       pr_debug("%p %s: freeing channel %p\n",
+              current, current->comm, chan);
+       /* FIXME: Mark any buffer still in channel as free! */
+       kfree(chan);
+}
+EXPORT_SYMBOL(vj_free_chan);
+
+
+
+/* not using at the mo - working on rx, not tx */
+int vj_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+       struct vj_buffer *buffer;
+       /* first element in dev priv data must be addr of net_channel */
+//     struct net_channel *chan = *(struct net_channel **) netdev_priv(dev) + 1;
+       int desc_num;
+
+       buffer = vj_get_buffer(&desc_num);
+       buffer->data_len = skb->len;
+       memcpy(buffer->data, skb->data, buffer->data_len);
+//     enqueue_buffer(chan, buffer, desc_num);
+
+       kfree(skb);
+       return 0;
+}
+EXPORT_SYMBOL(vj_xmit);
+
+static int __init init(void)
+{
+       default_chan = vj_alloc_chan(NUM_GLOBAL_DESCRIPTORS);
+       if (!default_chan)
+               return -ENOMEM;
+
+       kthread_run(default_thread, NULL, "kvj_net");
+       return 0;
+}
+
+module_init(init);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("VJ Channel Networking Module.");
+MODULE_AUTHOR("Kelly Daly <kelly@au1.ibm.com>");

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2006-07-08  0:04 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-26 19:30 [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch Caitlin Bestler
2006-04-26 19:46 ` Jeff Garzik
2006-04-26 22:40   ` David S. Miller
2006-04-27  3:40 ` Rusty Russell
2006-04-27  4:58   ` James Morris
2006-04-27  6:16     ` David S. Miller
2006-04-27  6:17   ` David S. Miller
  -- strict thread matches above, loose matches on Subject: below --
2006-04-28 23:45 Caitlin Bestler
2006-04-28 17:55 Caitlin Bestler
2006-04-28 22:17 ` Rusty Russell
2006-04-28 22:40   ` David S. Miller
2006-04-29  0:22     ` Rusty Russell
2006-04-29  6:46       ` David S. Miller
2006-04-28 17:02 Caitlin Bestler
2006-04-28 17:18 ` Stephen Hemminger
2006-04-28 17:29   ` Evgeniy Polyakov
2006-04-28 17:41     ` Stephen Hemminger
2006-04-28 17:55       ` Evgeniy Polyakov
2006-04-28 19:16         ` David S. Miller
2006-04-28 19:49           ` Stephen Hemminger
2006-04-28 19:59             ` Evgeniy Polyakov
2006-04-28 22:00               ` David S. Miller
2006-04-29 13:54                 ` Evgeniy Polyakov
     [not found]                 ` <20060429124451.GA19810@2ka.mipt.ru>
2006-05-01 21:32                   ` David S. Miller
2006-05-02  7:08                     ` Evgeniy Polyakov
2006-04-28 19:52           ` Evgeniy Polyakov
2006-04-28 19:10   ` David S. Miller
2006-04-28 20:46     ` Brent Cook
2006-04-28 17:25 ` Evgeniy Polyakov
2006-04-28 19:14   ` David S. Miller
2006-04-28 15:59 Caitlin Bestler
2006-04-28 16:12 ` Evgeniy Polyakov
2006-04-28 19:09   ` David S. Miller
2006-04-27 21:12 Caitlin Bestler
2006-04-28  6:10 ` Evgeniy Polyakov
2006-04-28  7:20   ` David S. Miller
2006-04-28  7:32     ` Evgeniy Polyakov
2006-04-28 18:20       ` David S. Miller
2006-04-28  8:24 ` Rusty Russell
2006-04-28 19:21   ` David S. Miller
2006-04-28 22:04     ` Rusty Russell
2006-04-28 22:38       ` David S. Miller
2006-04-29  0:10         ` Rusty Russell
2006-04-27  1:02 Caitlin Bestler
2006-04-27  6:08 ` David S. Miller
2006-04-27  6:17   ` Andi Kleen
2006-04-27  6:27     ` David S. Miller
2006-04-27  6:41       ` Andi Kleen
2006-04-27  7:52         ` David S. Miller
2006-04-26 22:53 Caitlin Bestler
2006-04-26 22:59 ` David S. Miller
2006-04-26 20:20 Caitlin Bestler
2006-04-26 22:35 ` David S. Miller
2006-04-26 16:57 Caitlin Bestler
2006-04-26 19:23 ` David S. Miller
2006-04-26 11:47 Kelly Daly
2006-04-26  7:33 ` David S. Miller
2006-04-27  3:31   ` Kelly Daly
2006-04-27  6:25     ` David S. Miller
2006-04-27 11:51       ` Evgeniy Polyakov
2006-04-27 20:09         ` David S. Miller
2006-04-28  6:05           ` Evgeniy Polyakov
2006-05-04  2:59       ` Kelly Daly
2006-05-04 23:22         ` David S. Miller
2006-05-05  1:31           ` Rusty Russell
2006-04-26  7:59 ` David S. Miller
2006-05-04  7:28   ` Kelly Daly
2006-05-04 23:11     ` David S. Miller
2006-05-05  2:48       ` Kelly Daly
2006-05-16  1:02         ` Kelly Daly
2006-05-16  1:05           ` David S. Miller
2006-05-16  1:15             ` Kelly Daly
2006-05-16  5:16           ` David S. Miller
2006-06-22  2:05             ` Kelly Daly
2006-06-22  3:58               ` James Morris
2006-06-22  4:31                 ` Arnaldo Carvalho de Melo
2006-06-22  4:36                 ` YOSHIFUJI Hideaki / 吉藤英明
2006-07-08  0:05               ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).