* [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
@ 2006-04-26 11:47 Kelly Daly
2006-04-26 7:33 ` David S. Miller
2006-04-26 7:59 ` David S. Miller
0 siblings, 2 replies; 35+ messages in thread
From: Kelly Daly @ 2006-04-26 11:47 UTC (permalink / raw)
To: netdev; +Cc: rusty, davem
Hey guys... I've been working with Rusty on a VJ Channel implementation.
Noting Dave's recent release of his implementation, we thought we'd better
get this "out there" so we can do some early comparison/combining and
come up with the best possible implementation.
There are three patches in total:
1) vj_core.patch - core files for VJ to userspace
2) vj_udp.patch - badly hacked up UDP receive implementation - basically just to test what logic may be like!
3) vj_ne2k.patch - modified NE2K and 8390 used for testing on QEMU
Notes:
* channels can have global or local buffers (local for userspace. Could be used directly by intelligent NIC)
* UDP receive breaks real UDP - doesn't talk anything except VJ Channels anymore. Needs integration with normal sources.
* Userspace test app (below) uses VJ protocol family to mmap space for local buffers, if it receives buffers in kernel space sends a request for that buffer to be copied to local buffer.
* Default channel converts to skb and feeds through normal receive path.
TODO:
* send not yet implemented
* integrate non vj
* LOTS of fixmes
Cheers,
Kelly
Test userspace app:
/* Van Jacobson net channels implementation for Linux
Copyright (C) 2006 Kelly Daly <kdaly@au.ibm.com> IBM Corporation
This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program; if not, write to the Free Software
Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
*/
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <sys/mman.h>
#include <sys/poll.h>
#include <netinet/in.h>
#include "linux-2.6.16/include/linux/types.h"
#include "linux-2.6.16/include/linux/vjchan.h"
//flowid
#define SADDR 0
#define DADDR 0
#define SPORT 0
#define DPORT 60000
#define IFINDEX 0
#define PF_VJCHAN 27
static struct vj_buffer *get_buffer(struct vj_channel_ring *ring, int desc_num)
{
printf("desc_num %i\n", desc_num);
return (void *)ring + (desc_num + 1) * getpagesize();
}
/* return the next buffer, but do not move on */
static struct vj_buffer *vj_peek_next_buffer(struct vj_channel_ring *ring)
{
if (ring->c.head == ring->p.tail)
return NULL;
return get_buffer(ring, ring->q[ring->c.head]);
}
/* move on to next buffer */
static void vj_done_with_buffer(struct vj_channel_ring *ring)
{
ring->c.head = (ring->c.head+1)%VJ_NET_CHANNEL_ENTRIES;
printf("done_with_buffer\n\n");
}
int main(int argc, char *argv[])
{
int sk, cls, bnd, pll;
void * mmapped;
struct vj_flowid flowid;
struct vj_channel_ring *ring;
struct vj_buffer *buf;
struct pollfd pfd;
printf("\nstart of vjchannel socket test app\n");
sk = socket(PF_VJCHAN, SOCK_DGRAM, IPPROTO_UDP);
if (sk == -1) {
perror("Unable to open socket!");
return -1;
}
printf("socket open with ret code %i\n\n", sk);
//create flowid!!!
flowid.saddr = SADDR;
flowid.daddr = DADDR;
flowid.sport = SPORT;
flowid.dport = htons(DPORT);
flowid.ifindex = IFINDEX;
flowid.proto = IPPROTO_UDP;
printf("flowid created\n");
bnd = bind(sk, (struct sockaddr *)&flowid, sizeof(struct vj_flowid));
if (bnd == -1) {
perror("Unable to bind socket!");
return -1;
}
printf("socket bound with ret code %i\n\n", bnd);
ring = mmap(0, (getpagesize() * (VJ_NET_CHANNEL_ENTRIES+1)), PROT_READ|PROT_WRITE, MAP_SHARED, sk, 0);
if (ring == MAP_FAILED) {
perror ("Unable to mmap socket!");
return -1;
}
printf("socket mmapped to address %lu\n\n", (unsigned long)mmapped);
pfd.fd = sk;
pfd.events = POLLIN;
for (;;) {
pll = poll(&pfd, 1, -1);
if (pll < 0) {
perror("polling failed!");
return -1;
}
//consume
buf = vj_peek_next_buffer(ring);
printf("buf %p\n", buf);
//print data, not headers
printf(" Buffer Length = %i\n", buf->data_len);
printf(" Header Length = %i\n", buf->header_len);
printf(" Buffer Data: '%.*s'\n", buf->data_len - 28, buf->data + buf->header_len + 28);
vj_done_with_buffer(ring);
}
cls = close(sk);
if (cls != 0) {
perror("Unable to close socket!");
return -2;
}
printf("socket closed with ret code %i\n\n", cls);
return 0;
}
-------------------------
Signed-off-by: Kelly Daly <kelly@au.ibm.com>
Basic infrastructure for Van Jacobson net channels: lockless ringbuffer for buffer transport. Entries in ring buffer are descriptors for global or local buffers: ring and local buffers are mmapped into userspace.
Channels are registered with the core by flowid, and a thread services the default channel for any non-matching packets. Drivers get (global) buffers from vj_get_buffer, and dispatch them through vj_netif_rx.
As userspace mmap cannot reach global buffers, select() copies global buffers into local buffers if required.
diff -r 47031a1f466c linux-2.6.16/include/linux/socket.h
--- linux-2.6.16/include/linux/socket.h Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/include/linux/socket.h Mon Apr 24 19:50:46 2006
@@ -186,6 +187,7 @@
#define AF_PPPOX 24 /* PPPoX sockets */
#define AF_WANPIPE 25 /* Wanpipe API Sockets */
#define AF_LLC 26 /* Linux LLC */
+#define AF_VJCHAN 27 /* VJ Channel */
#define AF_TIPC 30 /* TIPC sockets */
#define AF_BLUETOOTH 31 /* Bluetooth sockets */
#define AF_MAX 32 /* For now.. */
@@ -219,7 +221,8 @@
#define PF_PPPOX AF_PPPOX
#define PF_WANPIPE AF_WANPIPE
#define PF_LLC AF_LLC
+#define PF_VJCHAN AF_VJCHAN
#define PF_TIPC AF_TIPC
#define PF_BLUETOOTH AF_BLUETOOTH
#define PF_MAX AF_MAX
diff -r 47031a1f466c linux-2.6.16/net/Kconfig
--- linux-2.6.16/net/Kconfig Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/net/Kconfig Mon Apr 24 19:50:46 2006
@@ -65,6 +65,12 @@
source "net/ipv6/Kconfig"
endif # if INET
+
+config VJCHAN
+ bool "Van Jacobson Net Channel Support (EXPERIMENTAL)"
+ depends on EXPERIMENTAL
+ ---help---
+ This adds a userspace-accessible packet receive interface. Say N.
menuconfig NETFILTER
bool "Network packet filtering (replaces ipchains)"
diff -r 47031a1f466c linux-2.6.16/net/Makefile
--- linux-2.6.16/net/Makefile Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/net/Makefile Mon Apr 24 19:50:46 2006
@@ -46,6 +46,7 @@
obj-$(CONFIG_IP_SCTP) += sctp/
obj-$(CONFIG_IEEE80211) += ieee80211/
obj-$(CONFIG_TIPC) += tipc/
+obj-$(CONFIG_VJCHAN) += vjchan/
ifeq ($(CONFIG_NET),y)
obj-$(CONFIG_SYSCTL) += sysctl_net.o
diff -r 47031a1f466c linux-2.6.16/include/linux/vjchan.h
--- /dev/null Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/include/linux/vjchan.h Mon Apr 24 19:50:46 2006
@@ -0,0 +1,79 @@
+#ifndef _LINUX_VJCHAN_H
+#define _LINUX_VJCHAN_H
+
+/* num entries in channel q: set so consumer is at offset 1024. */
+#define VJ_NET_CHANNEL_ENTRIES 254
+/* identifies non-local buffers (ie. need kernel to copy to a local) */
+#define VJ_HIGH_BIT 0x80000000
+
+struct vj_producer {
+ __u16 tail; /* next element to add */
+ __u8 wakecnt; /* do wakeup if != consumer wakecnt */
+ __u8 pad;
+ __u16 old_head; /* last cleared buffer posn +1 */
+ __u16 pad2;
+};
+
+struct vj_consumer {
+ __u16 head; /* next element to remove */
+ __u8 wakecnt; /* increment to request wakeup */
+};
+
+/* mmap returns one of these, followed by 254 pages with a buffer each */
+struct vj_channel_ring {
+ struct vj_producer p; /* producer's header */
+ __u32 q[VJ_NET_CHANNEL_ENTRIES];
+ struct vj_consumer c; /* consumer's header */
+};
+
+struct vj_buffer {
+ __u32 data_len; /* length of actual data in buffer */
+ __u32 header_len; /* offset eth + ip header (true for now) */
+ __u32 ifindex; /* interface the packet came in on. */
+ char data[0];
+};
+
+/* Currently assumed IPv4 */
+struct vj_flowid
+{
+ __u32 saddr, daddr;
+ __u16 sport, dport;
+ __u32 ifindex;
+ __u16 proto;
+};
+
+#ifdef __KERNEL__
+struct net_device;
+struct sk_buff;
+
+struct vj_descriptor {
+ unsigned long address; /* address of net_channel_buffer */
+ unsigned long buffer_len; /* max length including header */
+};
+
+/* Everything about a vj_channel */
+struct vj_channel
+{
+ struct vj_channel_ring *ring;
+ wait_queue_head_t wq;
+ struct list_head list;
+ struct vj_flowid flowid;
+ int num_local_buffers;
+ struct vj_descriptor *descs;
+ unsigned long * used_descs;
+};
+
+void vj_inc_wakecnt(struct vj_channel *chan);
+struct vj_buffer *vj_get_buffer(int *desc_num);
+void vj_netif_rx(struct vj_buffer *buffer, int desc_num, unsigned short proto);
+int vj_xmit(struct sk_buff *skb, struct net_device *dev);
+struct vj_channel *vj_alloc_chan(int num_buffers);
+void vj_register_chan(struct vj_channel *chan, const struct vj_flowid *flowid);
+void vj_unregister_chan(struct vj_channel *chan);
+void vj_free_chan(struct vj_channel *chan);
+struct vj_buffer *vj_peek_next_buffer(struct vj_channel *chan);
+void vj_done_with_buffer(struct vj_channel *chan);
+unsigned short eth_vj_type_trans(struct vj_buffer *buffer);
+int vj_need_local_buffer(struct vj_channel *chan);
+#endif
+#endif /* _LINUX_VJCHAN_H */
diff -r 47031a1f466c linux-2.6.16/net/vjchan/Makefile
--- /dev/null Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/net/vjchan/Makefile Mon Apr 24 19:50:46 2006
@@ -0,0 +1,3 @@
+#obj-m += vjtest.o
+obj-y += vjnet.o
+obj-y += af_vjchan.o
diff -r 47031a1f466c linux-2.6.16/net/vjchan/af_vjchan.c
--- /dev/null Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/net/vjchan/af_vjchan.c Mon Apr 24 19:50:46 2006
@@ -0,0 +1,198 @@
+/* Van Jacobson net channels implementation for Linux
+ Copyright (C) 2006 Kelly Daly <kdaly@au.ibm.com> IBM Corporation
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+*/
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/socket.h>
+#include <linux/vjchan.h>
+#include <net/sock.h>
+
+struct vjchan_sock
+{
+ struct sock sk;
+ struct vj_channel *chan;
+ int vj_reg_flag;
+};
+
+static inline struct vjchan_sock *vj_sk(struct sock *sk)
+{
+ return (struct vjchan_sock *)sk;
+}
+
+static struct proto vjchan_proto = {
+ .name = "VJCHAN",
+ .owner = THIS_MODULE,
+ .obj_size = sizeof(struct vjchan_sock),
+};
+
+int vjchan_release(struct socket *sock)
+{
+ struct sock *sk = sock->sk;
+
+ sock_orphan(sk);
+ sock->sk = NULL;
+ sock_put(sk);
+ return 0;
+}
+
+int vjchan_bind(struct socket *sock, struct sockaddr *addr, int sockaddr_len)
+{
+ struct sock *sk = sock->sk;
+ struct vjchan_sock *vjsk;
+ struct vj_flowid *flowid = (struct vj_flowid *)addr;
+
+ /* FIXME: avoid clashing with normal sockets, replace zeroes. */
+ vjsk = vj_sk(sk);
+ vj_register_chan(vjsk->chan, flowid);
+ vjsk->vj_reg_flag = 1;
+
+ return 0;
+}
+
+int vjchan_getname(struct socket *sock, struct sockaddr *addr,
+ int *sockaddr_len, int peer)
+{
+ /* FIXME: Implement */
+ return 0;
+}
+
+unsigned int vjchan_poll(struct file *file, struct socket *sock,
+ struct poll_table_struct *wait)
+{
+ struct sock *sk = sock->sk;
+ struct vj_channel *chan = vj_sk(sk)->chan;
+
+ poll_wait(file, &chan->wq, wait);
+ vj_inc_wakecnt(chan);
+
+ if (vj_peek_next_buffer(chan) && vj_need_local_buffer(chan) == 0)
+ return POLLIN | POLLRDNORM;
+
+ return 0;
+}
+
+/* We map the ring first, then one page per buffer. */
+int vjchan_mmap(struct file *file, struct socket *sock,
+ struct vm_area_struct *vma)
+{
+ struct sock *sk = sock->sk;
+ struct vj_channel *chan = vj_sk(sk)->chan;
+ int i, vip;
+ unsigned long pos;
+
+ if (vma->vm_end - vma->vm_start !=
+ (1 + chan->num_local_buffers)*PAGE_SIZE)
+ return -EINVAL;
+
+ pos = vma->vm_start;
+ vip = vm_insert_page(vma, pos, virt_to_page(chan->ring));
+ pos += PAGE_SIZE;
+ for (i = 0; i < chan->num_local_buffers; i++) {
+ vip = vm_insert_page(vma, pos, virt_to_page(chan->descs[i].address));
+ pos += PAGE_SIZE;
+ }
+ return 0;
+}
+
+const struct proto_ops vjchan_ops = {
+ .family = PF_VJCHAN,
+ .owner = THIS_MODULE,
+ .release = vjchan_release,
+ .bind = vjchan_bind,
+ .socketpair = sock_no_socketpair,
+ .accept = sock_no_accept,
+ .getname = vjchan_getname,
+ .poll = vjchan_poll,
+ .ioctl = sock_no_ioctl,
+ .shutdown = sock_no_shutdown,
+ .setsockopt = sock_common_setsockopt,
+ .getsockopt = sock_common_getsockopt,
+ .sendmsg = sock_no_sendmsg,
+ .recvmsg = sock_no_recvmsg,
+ .mmap = vjchan_mmap,
+ .sendpage = sock_no_sendpage
+};
+
+static void vjchan_destruct(struct sock *sk)
+{
+ struct vjchan_sock *vjsk;
+
+ vjsk = vj_sk(sk);
+ if (vjsk->vj_reg_flag) {
+ vj_unregister_chan(vjsk->chan);
+ vjsk->vj_reg_flag = 0;
+ }
+ vj_free_chan(vjsk->chan);
+
+}
+
+static int vjchan_create(struct socket *sock, int protocol)
+{
+ struct sock *sk;
+ struct vjchan_sock *vjsk;
+ int err;
+
+ if (!capable(CAP_NET_RAW))
+ return -EPERM;
+ if (sock->type != SOCK_DGRAM
+ && sock->type != SOCK_RAW
+ && sock->type != SOCK_PACKET)
+ return -ESOCKTNOSUPPORT;
+
+ sock->state = SS_UNCONNECTED;
+
+ err = -ENOBUFS;
+ sk = sk_alloc(PF_VJCHAN, GFP_KERNEL, &vjchan_proto, 1);
+ if (sk == NULL)
+ goto out;
+
+ sock->ops = &vjchan_ops;
+
+ sock_init_data(sock, sk);
+ sk->sk_family = PF_VJCHAN;
+ sk->sk_destruct = vjchan_destruct;
+
+ vjsk = vj_sk(sk);
+ vjsk->chan = vj_alloc_chan(VJ_NET_CHANNEL_ENTRIES);
+ vjsk->vj_reg_flag = 0;
+ if (!vjsk->chan)
+ return -ENOMEM;
+ return 0;
+out:
+ return err;
+}
+
+static struct net_proto_family vjchan_family_ops = {
+ .family = PF_VJCHAN,
+ .create = vjchan_create,
+ .owner = THIS_MODULE,
+};
+
+static void __exit vjchan_exit(void)
+{
+ sock_unregister(PF_VJCHAN);
+}
+
+static int __init vjchan_init(void)
+{
+ return sock_register(&vjchan_family_ops);
+}
+
+module_init(vjchan_init);
+module_exit(vjchan_exit);
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_NETPROTO(PF_VJCHAN);
diff -r 47031a1f466c linux-2.6.16/net/vjchan/vjnet.c
--- /dev/null Thu Mar 23 06:32:12 2006
+++ linux-2.6.16/net/vjchan/vjnet.c Mon Apr 24 19:50:46 2006
@@ -0,0 +1,550 @@
+/* Van Jacobson net channels implementation for Linux
+ Copyright (C) 2006 Kelly Daly <kdaly@au.ibm.com> IBM Corporation
+
+ This program is free software; you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation; either version 2 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program; if not, write to the Free Software
+ Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA
+*/
+#include <linux/init.h>
+#include <linux/module.h>
+#include <linux/errno.h>
+#include <linux/slab.h>
+#include <linux/kthread.h>
+#include <linux/netdevice.h>
+#include <linux/skbuff.h>
+#include <linux/etherdevice.h>
+#include <linux/spinlock.h>
+#include <linux/ip.h>
+#include <linux/udp.h>
+#include <linux/vjchan.h>
+
+#define BUFFER_DATA_LEN 2048
+#define NUM_GLOBAL_DESCRIPTORS 1024
+
+/* All our channels. FIXME: Lockless funky hash structure please... */
+static LIST_HEAD(channels);
+static spinlock_t chan_lock = SPIN_LOCK_UNLOCKED;
+
+/* Default channel, also holds global buffers (userspace-mapped
+ * channels have local buffers, which they prefer to use). */
+static struct vj_channel *default_chan;
+
+/* need to increment for wake in udp.c wait_for_vj_buffer */
+void vj_inc_wakecnt(struct vj_channel *chan)
+{
+ chan->ring->c.wakecnt++;
+ pr_debug("*** incremented wakecnt - should allow wake up\n");
+}
+EXPORT_SYMBOL(vj_inc_wakecnt);
+
+static int is_empty(struct vj_channel_ring *ring)
+{
+ if (ring->c.head == ring->p.tail)
+ return 1;
+ return 0;
+}
+
+static struct vj_buffer *get_buffer(unsigned int desc_num,
+ struct vj_channel *chan)
+{
+ struct vj_buffer *buf;
+
+ if ((desc_num & VJ_HIGH_BIT) || (chan->num_local_buffers == 0)) {
+ desc_num &= ~VJ_HIGH_BIT;
+ BUG_ON(desc_num >= default_chan->num_local_buffers);
+ buf = (struct vj_buffer*)default_chan->descs[desc_num].address;
+ } else {
+ BUG_ON(desc_num >= chan->num_local_buffers);
+ buf = (struct vj_buffer *)chan->descs[desc_num].address;
+ }
+
+ pr_debug(" received desc_num is %i\n", desc_num);
+ pr_debug("get_buffer %p (%s) %i: %p (len=%li ifind=%i hlen=%li) %#02X %#02X %#02X %#02X %#02X %#02X %#02X %#02X\n",
+ current, current->comm, desc_num, buf, buf->data_len, buf->ifindex, buf->header_len + (sizeof(struct iphdr *) * 4),
+ buf->data[0], buf->data[1], buf->data[2], buf->data[3], buf->data[4], buf->data[5], buf->data[6], buf->data[7]);
+
+ return buf;
+}
+
+static void release_buffer(struct vj_channel *chan, unsigned int descnum)
+{
+ if (descnum & VJ_HIGH_BIT) {
+ BUG_ON(test_bit(descnum & ~VJ_HIGH_BIT,
+ default_chan->used_descs) == 0);
+ clear_bit(descnum & ~VJ_HIGH_BIT, default_chan->used_descs);
+ } else {
+ BUG_ON(test_bit(descnum, chan->used_descs) == 0);
+ clear_bit(descnum, chan->used_descs);
+ }
+}
+
+/* Free all descriptors for the current channel between where we last
+ * freed to and where the consumer has not yet consumed. chan->c.head
+ * is not cleared because it may not have been consumed, therefore
+ * chan->p.old_head is not cleared. If chan->p.old_head ==
+ * chan->c.head then nothing more has been consumed since we last
+ * freed the descriptors.
+ *
+ * Because we're using local and global channels we need to select the
+ * bitmap according to the channel. Local channels may be pointing to
+ * local or global buffers, so we need to select the bitmap according
+ * to the buffer type */
+
+/* Free descriptors consumer has consumed since last free */
+static void free_descs_for_channel(struct vj_channel *chan)
+{
+ struct vj_channel_ring *ring = chan->ring;
+ int desc_num;
+
+ while (ring->p.old_head != ring->c.head) {
+ printk("ring->p.old_head %i, ring->c.head %i\n", ring->p.old_head, ring->c.head);
+ desc_num = ring->q[ring->p.old_head];
+
+ printk("desc_num %i\n", desc_num);
+
+ /* FIXME: Security concerns: make sure this descriptor
+ * really used by this vjchannel. Userspace could
+ * have changed it. */
+ release_buffer(chan, desc_num);
+ ring->p.old_head = (ring->p.old_head + 1) % VJ_NET_CHANNEL_ENTRIES;
+ printk("ring->p.old_head %i, ring->c.head %i\n\n", ring->p.old_head, ring->c.head);
+ }
+}
+
+/* return -1 if no descriptor found and none can be freed */
+static int get_free_descriptor(struct vj_channel *chan)
+{
+ int free_desc, bitval;
+
+ BUG_ON(chan->num_local_buffers == 0);
+ do {
+ free_desc = find_first_zero_bit(chan->used_descs,
+ chan->num_local_buffers);
+ pr_debug("free_desc = %i\n", free_desc);
+ if (free_desc >= chan->num_local_buffers) {
+ /* no descriptors, refresh bitmap and try again! */
+ free_descs_for_channel(chan);
+ free_desc = find_first_zero_bit(chan->used_descs,
+ chan->num_local_buffers);
+ if (free_desc >= chan->num_local_buffers)
+ /* still no descriptors */
+ return -1;
+ }
+ bitval = test_and_set_bit(free_desc, chan->used_descs);
+ pr_debug("bitval = %i\n", bitval);
+ } while (bitval == 1); //keep going until we get a FREE free bit!
+
+ /* We set high bit to indicate a global channel. */
+ if (chan == default_chan)
+ free_desc |= VJ_HIGH_BIT;
+ return free_desc;
+}
+
+/* This function puts a buffer into a local address space for a
+ * channel that is unable to use a kernel address space. If address
+ * high bit is set then the buffer is in kernel space - get a free
+ * local buffer and copy it across. Set local buf to used (done when
+ * finding free buffer), kernel buf to unused. */
+/* FIXME: Loop, do as many as possible at once. */
+int vj_need_local_buffer(struct vj_channel *chan)
+{
+ struct vj_channel_ring *ring = chan->ring;
+ u32 new_desc, k_desc;
+
+ k_desc = ring->q[ring->c.head];
+
+ if (ring->q[ring->c.head] & VJ_HIGH_BIT) {
+ struct vj_buffer *buf, *kbuf;
+
+ kbuf = get_buffer(k_desc, chan);
+ new_desc = get_free_descriptor(chan);
+ if (new_desc == -1)
+ return -ENOBUFS;
+ buf = get_buffer(new_desc, chan);
+ memcpy (buf, kbuf, sizeof(struct vj_buffer)
+ + kbuf->data_len + kbuf->header_len);
+/* clear the old descriptor and set q to new one */
+ k_desc &= ~VJ_HIGH_BIT;
+ clear_bit(k_desc, default_chan->used_descs);
+ ring->q[ring->c.head] = new_desc;
+ }
+ return 0;
+}
+EXPORT_SYMBOL(vj_need_local_buffer);
+
+struct vj_buffer *vj_get_buffer(int *desc_num)
+{
+ *desc_num = get_free_descriptor(default_chan);
+
+ if (*desc_num == -1) {
+ printk("no free bits!\n");
+ return NULL;
+ }
+
+ return get_buffer(*desc_num, default_chan);
+}
+EXPORT_SYMBOL(vj_get_buffer);
+
+static void enqueue_buffer(struct vj_channel *chan, struct vj_buffer *buffer, int desc_num)
+{
+ u16 tail, nxt;
+ int i;
+
+ pr_debug("*** in enqueue buffer\n");
+ pr_debug(" desc_num = %i\n", desc_num);
+ pr_debug(" Buffer Data Length = %lu\n", buffer->data_len);
+ pr_debug(" Buffer Header Length = %lu\n", buffer->header_len);
+ pr_debug(" Buffer Data:\n");
+ for (i = 0; i < buffer->data_len; i++) {
+ pr_debug("%i ", buffer->data[i]);
+ if (i % 20 == 0)
+ pr_debug("\n");
+ }
+ pr_debug("\n");
+
+ tail = chan->ring->p.tail;
+ nxt = (tail + 1) % VJ_NET_CHANNEL_ENTRIES;
+
+ pr_debug("nxt = %i and chan->c.head = %i\n", nxt, chan->ring->c.head);
+ if (nxt != chan->ring->c.head) {
+ chan->ring->q[tail] = desc_num;
+
+ smp_wmb();
+ chan->ring->p.tail=nxt;
+ pr_debug("chan->p.wakecnt = %i and chan->c.wakecnt = %i\n", chan->ring->p.wakecnt, chan->ring->c.wakecnt);
+ free_descs_for_channel(chan);
+ if (chan->ring->p.wakecnt != chan->ring->c.wakecnt) {
+ ++chan->ring->p.wakecnt;
+ /* consume whatever is available */
+ pr_debug("WAKE UP, CONSUMER!!!\n\n");
+ wake_up(&chan->wq);
+ }
+ } else //if can't add it to chan, may as well allow it to be reused
+ release_buffer(chan, desc_num);
+}
+
+/* FIXME: If we're going to do wildcards here, we need to do ordering between different partial matches... */
+static struct vj_channel *find_channel(u32 saddr, u32 daddr, u16 proto, u16 sport, u16 dport, u32 ifindex)
+{
+ struct vj_channel *i;
+
+ pr_debug("args saddr %u, daddr %u, sport %u, dport %u, ifindex %u, proto %u\n", saddr, daddr, sport, dport, ifindex, proto);
+
+ list_for_each_entry(i, &channels, list) {
+ pr_debug("saddr %u, daddr %u, sport %u, dport %u, ifindex %u, proto %u\n", i->flowid.saddr, i->flowid.daddr, i->flowid.sport, i->flowid.dport, i->flowid.ifindex, i->flowid.proto);
+
+ if ((!i->flowid.saddr || i->flowid.saddr == saddr) &&
+ (!i->flowid.daddr || i->flowid.daddr == daddr) &&
+ (!i->flowid.proto || i->flowid.proto == proto) &&
+ (!i->flowid.sport || i->flowid.sport == sport) &&
+ (!i->flowid.dport || i->flowid.dport == dport) &&
+ (!i->flowid.ifindex || i->flowid.ifindex == ifindex)) {
+ pr_debug("Found channel %p\n", i);
+ return i;
+ }
+ }
+ pr_debug("using default channel %p\n", default_chan);
+ return default_chan;
+}
+
+void vj_netif_rx(struct vj_buffer *buffer, int desc_num,
+ unsigned short proto)
+{
+ struct vj_channel *chan;
+ struct iphdr *ip;
+ int iphl, offset, real_data_len;
+ u16 *ports;
+ unsigned long flags;
+
+ offset = sizeof(struct iphdr) + sizeof(struct udphdr);
+ real_data_len = buffer->data_len - offset;
+
+
+ pr_debug("data_len = %lu, offset = %i, real data? = %i\n\n\n", buffer->data_len, offset, real_data_len);
+ /* this is always 18 when there's 18 or less characters in buffer->data */
+
+ pr_debug("rx) desc_num = %i\n\n", desc_num);
+
+ spin_lock_irqsave(&chan_lock, flags);
+ if (proto == __constant_htons(ETH_P_IP)) {
+
+ ip = (struct iphdr *)(buffer->data + buffer->header_len);
+ ports = (u16 *)(ip + 1);
+ iphl = ip->ihl * 4;
+
+ if ((buffer->data_len < (iphl + 4)) ||
+ (iphl != sizeof(struct iphdr))) {
+ pr_debug("Bad data, default chan\n");
+ pr_debug("buffer data_len = %li, header len = %li, ip->ihl = %i\n", buffer->data_len, buffer->header_len, ip->ihl);
+ chan = default_chan;
+ } else {
+ chan = find_channel(ip->saddr, ip->daddr,
+ ip->protocol, ports[0],
+ ports[1], buffer->ifindex);
+
+ }
+ } else
+ chan = default_chan;
+ enqueue_buffer(chan, buffer, desc_num);
+
+ spin_unlock_irqrestore(&chan_lock, flags);
+}
+EXPORT_SYMBOL(vj_netif_rx);
+
+/*
+ * Determine the packet's protocol ID. The rule here is that we
+ * assume 802.3 if the type field is short enough to be a length.
+ * This is normal practice and works for any 'now in use' protocol.
+ */
+
+unsigned short eth_vj_type_trans(struct vj_buffer *buffer)
+{
+ struct ethhdr *eth;
+ unsigned char *rawp;
+
+ eth = (struct ethhdr *)buffer->data;
+ buffer->header_len = ETH_HLEN;
+
+ BUG_ON(buffer->header_len > buffer->data_len);
+
+ buffer->data_len -= buffer->header_len;
+ if (ntohs(eth->h_proto) >= 1536)
+ return eth->h_proto;
+
+ rawp = buffer->data;
+
+ /*
+ * This is a magic hack to spot IPX packets. Older Novell breaks
+ * the protocol design and runs IPX over 802.3 without an 802.2 LLC
+ * layer. We look for FFFF which isn't a used 802.2 SSAP/DSAP. This
+ * won't work for fault tolerant netware but does for the rest.
+ */
+ if (*(unsigned short *)rawp == 0xFFFF)
+ return htons(ETH_P_802_3);
+
+ /*
+ * Real 802.2 LLC
+ */
+ return htons(ETH_P_802_2);
+}
+EXPORT_SYMBOL(eth_vj_type_trans);
+
+static void send_to_netif_rx(struct vj_buffer *buffer)
+{
+ struct sk_buff *skb;
+ struct net_device *dev;
+ int i;
+
+ dev = dev_get_by_index(buffer->ifindex);
+ if (!dev)
+ return;
+ skb = dev_alloc_skb(buffer->data_len + 2);
+ if (skb == NULL) {
+ dev_put(dev);
+ return;
+ }
+
+ skb_reserve(skb, 2);
+ skb->dev = dev;
+
+ skb_put(skb, buffer->data_len);
+ memcpy(skb->data, buffer->data, buffer->data_len);
+
+ pr_debug(" *** C buffer data_len = %lu and skb->len = %i\n", buffer->data_len, skb->len);
+ for (i = 0; i < 10; i++)
+ pr_debug("%i\n", skb->data[i]);
+
+ skb->protocol = eth_type_trans(skb, skb->dev);
+
+ netif_receive_skb(skb);
+}
+
+/* handles default_chan (buffers that nobody else wants) */
+static int default_thread(void *unused)
+{
+ int consumed = 0;
+ int woken = 0;
+ struct vj_buffer *buffer;
+ wait_queue_t wait;
+
+ /* When we get woken up, don't want to be removed from waitqueue! */
+//no more wait.task struct task_struct * task is now void *private
+ wait.private = current;
+ wait.func = default_wake_function;
+ INIT_LIST_HEAD(&wait.task_list);
+
+ add_wait_queue(&default_chan->wq, &wait);
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ while (!kthread_should_stop()) {
+ /* FIXME: if we do this before prepare_to_wait, avoids wmb */
+ default_chan->ring->c.wakecnt++;
+ smp_wmb();
+
+ while (!is_empty(default_chan->ring)) {
+ smp_read_barrier_depends();
+ buffer = get_buffer(default_chan->ring->q[default_chan->ring->c.head], default_chan);
+ pr_debug("calling send_to_netif_rx\n");
+ send_to_netif_rx(buffer);
+ smp_rmb();
+ default_chan->ring->c.head = (default_chan->ring->c.head+1)%VJ_NET_CHANNEL_ENTRIES;
+ consumed++;
+ }
+
+ schedule();
+ woken++;
+ set_current_state(TASK_INTERRUPTIBLE);
+ }
+ remove_wait_queue(&default_chan->wq, &wait);
+
+ __set_current_state(TASK_RUNNING);
+
+ pr_debug("consumer finished! consumed %i and woke %i\n", consumed, woken);
+ return 0;
+}
+
+/* return the next buffer, but do not move on */
+struct vj_buffer *vj_peek_next_buffer(struct vj_channel *chan)
+{
+ struct vj_channel_ring *ring = chan->ring;
+
+ if (is_empty(ring))
+ return NULL;
+ return get_buffer(ring->q[ring->c.head], chan);
+}
+EXPORT_SYMBOL(vj_peek_next_buffer);
+
+/* move on to next buffer */
+void vj_done_with_buffer(struct vj_channel *chan)
+{
+ struct vj_channel_ring *ring = chan->ring;
+
+ ring->c.head = (ring->c.head+1)%VJ_NET_CHANNEL_ENTRIES;
+
+ pr_debug("done_with_buffer\n\n");
+}
+EXPORT_SYMBOL(vj_done_with_buffer);
+
+struct vj_channel *vj_alloc_chan(int num_buffers)
+{
+ int i;
+ struct vj_channel *chan = kmalloc(sizeof(*chan), GFP_KERNEL);
+
+ if (!chan)
+ return NULL;
+
+ chan->ring = (void *)get_zeroed_page(GFP_KERNEL);
+ if (chan->ring == NULL)
+ goto free_chan;
+
+ init_waitqueue_head(&chan->wq);
+ chan->ring->p.tail = chan->ring->p.wakecnt = chan->ring->p.old_head = chan->ring->c.head = chan->ring->c.wakecnt = 0;
+
+ chan->num_local_buffers = num_buffers;
+ if (chan->num_local_buffers == 0)
+ return chan;
+
+ chan->used_descs = kzalloc(BITS_TO_LONGS(chan->num_local_buffers)
+ * sizeof(long), GFP_KERNEL);
+ if (chan->used_descs == NULL)
+ goto free_ring;
+ chan->descs = kmalloc(sizeof(*chan->descs)*num_buffers, GFP_KERNEL);
+ if (chan->descs == NULL)
+ goto free_used_descs;
+ for (i = 0; i < chan->num_local_buffers; i++) {
+ chan->descs[i].buffer_len = PAGE_SIZE;
+ chan->descs[i].address = get_zeroed_page(GFP_KERNEL);
+ if (chan->descs[i].address == 0)
+ goto free_descs;
+ }
+
+ return chan;
+
+free_descs:
+ for (--i; i >= 0; i--)
+ free_page(chan->descs[i].address);
+ kfree(chan->descs);
+free_used_descs:
+ kfree(chan->used_descs);
+free_ring:
+ free_page((unsigned long)chan->ring);
+free_chan:
+ kfree(chan);
+ return NULL;
+}
+EXPORT_SYMBOL(vj_alloc_chan);
+
+void vj_register_chan(struct vj_channel *chan, const struct vj_flowid *flowid)
+{
+ pr_debug("%p %s: registering channel %p\n",
+ current, current->comm, chan);
+ chan->flowid = *flowid;
+ spin_lock_irq(&chan_lock);
+ list_add(&chan->list, &channels);
+ spin_unlock_irq(&chan_lock);
+}
+EXPORT_SYMBOL(vj_register_chan);
+
+void vj_unregister_chan(struct vj_channel *chan)
+{
+ pr_debug("%p %s: unregistering channel %p\n",
+ current, current->comm, chan);
+ spin_lock_irq(&chan_lock);
+ list_del(&chan->list);
+ spin_unlock_irq(&chan_lock);
+}
+EXPORT_SYMBOL(vj_unregister_chan);
+
+void vj_free_chan(struct vj_channel *chan)
+{
+ pr_debug("%p %s: freeing channel %p\n",
+ current, current->comm, chan);
+ /* FIXME: Mark any buffer still in channel as free! */
+ kfree(chan);
+}
+EXPORT_SYMBOL(vj_free_chan);
+
+
+
+/* not using at the mo - working on rx, not tx */
+int vj_xmit(struct sk_buff *skb, struct net_device *dev)
+{
+ struct vj_buffer *buffer;
+ /* first element in dev priv data must be addr of net_channel */
+// struct net_channel *chan = *(struct net_channel **) netdev_priv(dev) + 1;
+ int desc_num;
+
+ buffer = vj_get_buffer(&desc_num);
+ buffer->data_len = skb->len;
+ memcpy(buffer->data, skb->data, buffer->data_len);
+// enqueue_buffer(chan, buffer, desc_num);
+
+ kfree(skb);
+ return 0;
+}
+EXPORT_SYMBOL(vj_xmit);
+
+static int __init init(void)
+{
+ default_chan = vj_alloc_chan(NUM_GLOBAL_DESCRIPTORS);
+ if (!default_chan)
+ return -ENOMEM;
+
+ kthread_run(default_thread, NULL, "kvj_net");
+ return 0;
+}
+
+module_init(init);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("VJ Channel Networking Module.");
+MODULE_AUTHOR("Kelly Daly <kelly@au1.ibm.com>");
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-04-26 11:47 [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch Kelly Daly
@ 2006-04-26 7:33 ` David S. Miller
2006-04-27 3:31 ` Kelly Daly
2006-04-26 7:59 ` David S. Miller
1 sibling, 1 reply; 35+ messages in thread
From: David S. Miller @ 2006-04-26 7:33 UTC (permalink / raw)
To: kelly; +Cc: netdev, rusty
From: Kelly Daly <kelly@au1.ibm.com>
Date: Wed, 26 Apr 2006 11:47:34 +0000
> Noting Dave's recent release of his implementation, we thought we'd
> better get this "out there" so we can do some early
> comparison/combining and come up with the best possible
> implementation.
Thanks for publishing your work.
I'm actually not that upset that I duplicated the work a little
bit because trying to start implementing things forced me to
think in a more focued way about this stuff.
I'll look over your patches, thanks.
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-04-26 7:33 ` David S. Miller
@ 2006-04-27 3:31 ` Kelly Daly
2006-04-27 6:25 ` David S. Miller
0 siblings, 1 reply; 35+ messages in thread
From: Kelly Daly @ 2006-04-27 3:31 UTC (permalink / raw)
To: David S. Miller; +Cc: rusty, netdev
Hi Dave,
Thanks for your response. =)
On Wednesday 26 April 2006 17:59, you wrote:
> Ok I have comments already just glancing at the initial patch.
>
> With the 32-bit descriptors in the channel, you indeed end up
> with a fixed sized pool with a lot of hard-to-finesse sizing
> and lookup problems to solve.
It should be quite trivial to resize this pool using RCU.
>
> So what I wanted to do was finesse the entire issue by simply
> side-stepping it initially. Use a normal buffer with a tail
> descriptor, when you enqueue you give a tail descriptor pointer.
The tail pointers are an excellent idea - and they certainly fix a lot of
compatibility issues that we side-stepped (we were going for the "make it
work" approach rather than the "make it right" - figured we could get to that
bit later =P ).
> I really dislike the pools of buffers, partly because they are fixed
> size (or dynamically sized and even more expensive to implement), but
> moreso because there is all of this absolutely stupid state management
> you eat just to get at the real data. That's pointless, we're trying
> to make this as light as possible. Just use real pointers and
> describe the packet with a tail descriptor.
We approached this from the understanding that an intelligent NIC will be able
to transition directly to userspace, which is a major win. 0 copies to
userspace would be sweet. I think we can still achieve this using your
scheme without *too* much pain.
> Next, you can't even begin to work on the protocol channels before you
> do one very important piece of work. Integration of all of the ipv4
> and ipv6 protocol hash tables into a central code, it's a total
> prerequisite. Then you modify things to use a generic
> inet_{,listen_}lookup() or inet6_{,listen_}lookup() that takes a
> protocol number as well as saddr/daddr/sport/dport and searches
> from a central table.
Understood. And agreed. Once again was side-stepped just to try to get a
"working model". Will look into this immediately.
> So I think I'll continue working on my implementation, it's more
> transitional and that's how we have to do this kind of work.
Thanks again for your comments =) (and thanks to everyone else who took the
time to respond to this)
Kelly
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-04-27 3:31 ` Kelly Daly
@ 2006-04-27 6:25 ` David S. Miller
2006-04-27 11:51 ` Evgeniy Polyakov
2006-05-04 2:59 ` Kelly Daly
0 siblings, 2 replies; 35+ messages in thread
From: David S. Miller @ 2006-04-27 6:25 UTC (permalink / raw)
To: kelly; +Cc: rusty, netdev
From: Kelly Daly <kelly@au1.ibm.com>
Date: Thu, 27 Apr 2006 13:31:37 +1000
> It should be quite trivial to resize this pool using RCU.
Yes, a lot of this stuff can use RCU, in particular the channel
demux is a prime candidate.
There are some non-trivial issues wrt. synchronizing the net
channel lookup state with socket state changes (socket moves to
close or whatever). This reminds me that we had some nice TCP
hash table RCU patches that Ben LaHaise posted at one point and
that slipped through the cracks. That took care of all the event
ordering issues, it seemed at the time, and is something we need
to get back on track with.
> The tail pointers are an excellent idea - and they certainly fix a
> lot of compatibility issues that we side-stepped (we were going for
> the "make it work" approach rather than the "make it right" -
> figured we could get to that bit later =P ).
Start simple, we can keep mucking with the interfaces over and over
again as we move from simple netif_receive_skb() channels out to the
more complex socket demux style channel.
This is a big and long project, there are no style points for trying
to go all the way in the first pass :-)
> We approached this from the understanding that an intelligent NIC
> will be able to transition directly to userspace, which is a major
> win. 0 copies to userspace would be sweet. I think we can still
> achieve this using your scheme without *too* much pain.
Understood. What's your basic idea? Just make the buffers in the
pool large enough to fit the SKB encapsulation at the end?
Note that this will change a lot of the assumptions currently in
your buffer handling code about buffer reuse and such.
So the idea in your scheme is to give the buffer pools to the NIC
in a per-channel way via a simple descriptor table? And the u32's
are arbitrary keys that index into this descriptor table, right?
I would suggest just sticking to the simple global input queue.
Solve the easy problems and the buffering model first. Then we
can port drivers and people can bang on the basic infrastructure.
Take my SKB encapsulator in my vj-2.6 tree once you've transformed
your buffer pools to accomodate.
I'll actually sit back and let you do that, I'm actually coming around
more to your scheme in some regards :-) I'll sit and think about some
of the heavier issues we'll hit in the next phase and once you have
a cut at the current phase I'll work on a tg3 driver port.
Thanks!
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-04-27 6:25 ` David S. Miller
@ 2006-04-27 11:51 ` Evgeniy Polyakov
2006-04-27 20:09 ` David S. Miller
2006-05-04 2:59 ` Kelly Daly
1 sibling, 1 reply; 35+ messages in thread
From: Evgeniy Polyakov @ 2006-04-27 11:51 UTC (permalink / raw)
To: David S. Miller; +Cc: kelly, rusty, netdev
On Wed, Apr 26, 2006 at 11:25:01PM -0700, David S. Miller (davem@davemloft.net) wrote:
> > We approached this from the understanding that an intelligent NIC
> > will be able to transition directly to userspace, which is a major
> > win. 0 copies to userspace would be sweet. I think we can still
> > achieve this using your scheme without *too* much pain.
>
> Understood. What's your basic idea? Just make the buffers in the
> pool large enough to fit the SKB encapsulation at the end?
There are some caveats here found while developing zero-copy sniffer
[1]. Project's goal was to remap skbs into userspace in real-time.
While absolute numbers (posted to netdev@) were really high, it is only
applicable to read-only application. As was shown in IOAT thread,
data must be warmed in caches, so reading from mapped area will be as
fast as memcpy() (read+write), and copy_to_user() actually almost equal
to memcpy() (benchmarks were posted to netdev@). And we must add
remapping overhead.
If we want to dma data from nic into premapped userspace area, this will
strike with message sizes/misalignment/slow read and so on, so
preallocation has even more problems.
This change also requires significant changes in application, at least
until recv/send are changed, which is not the best thing to do.
So I think that mapping itself can be done as some additional socket
option or something not turnedon by default.
I do think that significant win in VJ's tests belongs not to remapping
and cache-oriented changes, but to move all protocol processing into
process' context.
I fully agree with Dave that it must be implemented step-by-step, and
the most significant, IMHO, is moving protocol processing into socket's
"place". This will force to netfilter changes, but I do think that for
the proof-of-concept code we can turn it off.
I will start to work in this direction next week after aio_sendfile() is
completed.
So, we will have three attempts to write incompatible stacks - and that is good :)
No one need an excuse to rewrite something, as I read in Rusty's blog...
Thanks.
[1]. http://tservice.net.ru/~s0mbre/old/?section=projects&item=af_tlb
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-04-27 11:51 ` Evgeniy Polyakov
@ 2006-04-27 20:09 ` David S. Miller
2006-04-28 6:05 ` Evgeniy Polyakov
0 siblings, 1 reply; 35+ messages in thread
From: David S. Miller @ 2006-04-27 20:09 UTC (permalink / raw)
To: johnpol; +Cc: kelly, rusty, netdev
From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Thu, 27 Apr 2006 15:51:26 +0400
> There are some caveats here found while developing zero-copy sniffer
> [1]. Project's goal was to remap skbs into userspace in real-time.
> While absolute numbers (posted to netdev@) were really high, it is only
> applicable to read-only application. As was shown in IOAT thread,
> data must be warmed in caches, so reading from mapped area will be as
> fast as memcpy() (read+write), and copy_to_user() actually almost equal
> to memcpy() (benchmarks were posted to netdev@). And we must add
> remapping overhead.
Yes, all of these issues are related quite strongly. Thanks for
making the connection explicit.
But, the mapping overhead is zero for this net channel stuff, at
least as it is implemented and designed by Kelly. Ring buffer is
setup ahead of time into the user's address space, and a ring of
buffers into that area are given to the networking card.
We remember the translations here, so no get_user_pages() on each
transfer and garbage like that. And yes this all harks back to the
issues that are discussed in Chapter 5 of Networking Algorithmics.
But the core thing to understand is that by defining a new API and
setting up the buffer pool ahead of time, we avoid all of the
get_user_pages() overhead while retaining full kernel/user protection.
Evgeniy, the difference between this and your work is that you did not
have an intelligent piece of hardware that could be told to recognize
flows, and only put packets for a specific flow into that's flow's
buffer pool.
> If we want to dma data from nic into premapped userspace area, this will
> strike with message sizes/misalignment/slow read and so on, so
> preallocation has even more problems.
I do not really think this is an issue, we put the full packet into
user space and teach it where the offset is to the actual data.
We'll do the same things we do today to try and get the data area
aligned. User can do whatever is logical and relevant on his end
to deal with strange cases.
In fact we can specify that card has to take some care to get data
area of packet aligned on say an 8 byte boundary or something like
that. When we don't have hardware assist, we are going to be doing
copies.
> This change also requires significant changes in application, at least
> until recv/send are changed, which is not the best thing to do.
This is exactly the point, we can only do a good job and receive zero
copy if we can change the interfaces, and that's exactly what we're
doing here.
> I do think that significant win in VJ's tests belongs not to remapping
> and cache-oriented changes, but to move all protocol processing into
> process' context.
I partly disagree. The biggest win is eliminating all of the control
overhead (all of "softint RX + protocol demux + IP route lookup +
socket lookup" is turned into single flow demux), and the SMP safe
data structure which makes it realistic enough to always move the bulk
of the packet work to the socket's home cpu.
I do not think userspace protocol implementation buys enough to
justify it. We have to do the protection switch in and out of kernel
space anyways, so why not still do the protected protocol processing
work in the kernel? It is still being done on the user's behalf,
contributes to his time slice, and avoids all of the terrible issues
of userspace protocol implementations.
So in my mind, the optimal situation from both a protection preservation
and also a performance perspective is net channels to kernel socket
protocol processing, buffers DMA'd directly into userspace if hardware
assist is present.
> I fully agree with Dave that it must be implemented step-by-step, and
> the most significant, IMHO, is moving protocol processing into socket's
> "place". This will force to netfilter changes, but I do think that for
> the proof-of-concept code we can turn it off.
And I also want to note that even if the whole idea explodes and
cannot be made to work, there are good arguments for transitioning
to SKB'less drivers for their own sake. So work will really not
be lost.
Let's have 100 different implementations of net channels! :-)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-04-27 20:09 ` David S. Miller
@ 2006-04-28 6:05 ` Evgeniy Polyakov
0 siblings, 0 replies; 35+ messages in thread
From: Evgeniy Polyakov @ 2006-04-28 6:05 UTC (permalink / raw)
To: David S. Miller; +Cc: kelly, rusty, netdev
On Thu, Apr 27, 2006 at 01:09:18PM -0700, David S. Miller (davem@davemloft.net) wrote:
> Evgeniy, the difference between this and your work is that you did not
> have an intelligent piece of hardware that could be told to recognize
> flows, and only put packets for a specific flow into that's flow's
> buffer pool.
There are the most "intellegent" NICs which use MMIO copy like Realtek 8139 :)
which were used in receiving zero-copy [1] project.
There was special alorithm researched for receiving zero-copy [1] to allow
to put not page-aligned TCP frames into pages, but there was other
problem when page was committed, since no byte commit is allowed in VFS.
In this case we do not have that problem, but instead we must force userspace to
be very smart when dealing with mapped buffers, instead of simple recv().
And for sending it must be even smarter, since data must be properly
aligned. And what about crappy hardware which can DMA only into limited
memory area, or NIC that can not do sg? Or do we need remapping for NIC
that can not do checksum calculation?
> > If we want to dma data from nic into premapped userspace area, this will
> > strike with message sizes/misalignment/slow read and so on, so
> > preallocation has even more problems.
>
> I do not really think this is an issue, we put the full packet into
> user space and teach it where the offset is to the actual data.
> We'll do the same things we do today to try and get the data area
> aligned. User can do whatever is logical and relevant on his end
> to deal with strange cases.
>
> In fact we can specify that card has to take some care to get data
> area of packet aligned on say an 8 byte boundary or something like
> that. When we don't have hardware assist, we are going to be doing
> copies.
Userspace must be too smart, and as we saw with various java tests, it
can not be so even now.
And what if pages are shared and several threads are trying to write
into the same remapped area? Will we use COW and be blamed like Mach
and FreeBSD developers? :)
> > I do think that significant win in VJ's tests belongs not to remapping
> > and cache-oriented changes, but to move all protocol processing into
> > process' context.
>
> I partly disagree. The biggest win is eliminating all of the control
> overhead (all of "softint RX + protocol demux + IP route lookup +
> socket lookup" is turned into single flow demux), and the SMP safe
> data structure which makes it realistic enough to always move the bulk
> of the packet work to the socket's home cpu.
>
> I do not think userspace protocol implementation buys enough to
> justify it. We have to do the protection switch in and out of kernel
> space anyways, so why not still do the protected protocol processing
> work in the kernel? It is still being done on the user's behalf,
> contributes to his time slice, and avoids all of the terrible issues
> of userspace protocol implementations.
After hard irq softirq is scheduled, then later userspace is scheduled,
at least 2 context switch just to move a packet, and "slow" userspace
code is interrupted by both irqs again...
I run some tests on ppc32 embedded boards which showed that rescheduling
latency tend to have milliseconds delay sometimes (about 4 running processes
on 200mhz cpu), although we do not have some real-time requirements here
it is not a good sign...
> And I also want to note that even if the whole idea explodes and
> cannot be made to work, there are good arguments for transitioning
> to SKB'less drivers for their own sake. So work will really not
> be lost.
>
> Let's have 100 different implementations of net channels! :-)
:)
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-04-27 6:25 ` David S. Miller
2006-04-27 11:51 ` Evgeniy Polyakov
@ 2006-05-04 2:59 ` Kelly Daly
2006-05-04 23:22 ` David S. Miller
1 sibling, 1 reply; 35+ messages in thread
From: Kelly Daly @ 2006-05-04 2:59 UTC (permalink / raw)
Cc: David S. Miller, rusty, netdev
On Thursday 27 April 2006 16:25, you wrote:
> So the idea in your scheme is to give the buffer pools to the NIC
> in a per-channel way via a simple descriptor table? And the u32's
> are arbitrary keys that index into this descriptor table, right?
>
yeah - it _was_... Although since having a play with coding it into your
implementation we've come up with the following:
Using the descriptor table adds excess complexity for kernel buffers, and is
really only useful for userspace. So instead of using descriptor tables for
everything we've come up with a dynamic descriptor table scheme instead where
they are used only for userspace.
The move to skb-ising the buffers has made it more difficult to keep track of
buffer lifetimes. Previously we were leaving the buffers in the ring until
completely finished with them. The producer could reuse the buffer once the
consumer head had moved on. With the graft to skb we can no longer do this
unless the packets are processed serially (which is ok for socket channels,
but not realistic for the default).
We DID write an infrastructure to resolve this issue, although it is more
complex than the dynamic descriptor scheme for userspace. And we want to
keep this simple - right?
Cheers,
K
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-05-04 2:59 ` Kelly Daly
@ 2006-05-04 23:22 ` David S. Miller
2006-05-05 1:31 ` Rusty Russell
0 siblings, 1 reply; 35+ messages in thread
From: David S. Miller @ 2006-05-04 23:22 UTC (permalink / raw)
To: kelly; +Cc: rusty, netdev
From: Kelly Daly <kelly@au1.ibm.com>
Date: Thu, 4 May 2006 12:59:23 +1000
> We DID write an infrastructure to resolve this issue, although it is more
> complex than the dynamic descriptor scheme for userspace. And we want to
> keep this simple - right?
Yes.
I wonder if it is possible to manage the buffer pool just like a SLAB
cache to deal with the variable lifetimes. The system has a natural
"working set" size of networking buffers at a given point in time and
even the default net channel can grow to accomodate that with some
kind of limit.
This is kind of what I was alluding to in the past, in that we now
have globals limits on system TCP socket memory when really what we
want to do is have a set of global generic system packet memory
limits.
These two things can tie in together.
Note that this means we need a callback in the SKB to free the memory
up. For direct net channels to a socket, you don't need any callbacks
of course because as you mentioned you know the buffer lifetimes.
People want such a callback anyways in order to experiment with SKB
recycling in drivers.
Note that some kind of "shrink" callback would need to be implemented.
It would only be needed for the default channel. We need to seriously
avoid needing something like this over the socket net channels because
that is serious complexity.
Finally... if we go the global packet memory route, we will need hard
and soft limits. There is a danger in such a scheme of not being able
to get critical control packets out (ACKs, etc.). Also, there are all
kinds of classification and drop algorithms (see RED) which could be
used to handle overload situations gracefully.
So, are you still sure you want to do away with the descriptors for
the default channel? Is the scheme I have outlined above doable or
is there some critical barrier or some complexity issue which makes
it undesirable?
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-05-04 23:22 ` David S. Miller
@ 2006-05-05 1:31 ` Rusty Russell
0 siblings, 0 replies; 35+ messages in thread
From: Rusty Russell @ 2006-05-05 1:31 UTC (permalink / raw)
To: David S. Miller; +Cc: kelly, netdev
On Thu, 2006-05-04 at 16:22 -0700, David S. Miller wrote:
> From: Kelly Daly <kelly@au1.ibm.com>
> Date: Thu, 4 May 2006 12:59:23 +1000
>
> > We DID write an infrastructure to resolve this issue, although it is more
> > complex than the dynamic descriptor scheme for userspace. And we want to
> > keep this simple - right?
>
> Yes.
>
> I wonder if it is possible to manage the buffer pool just like a SLAB
> cache to deal with the variable lifetimes. The system has a natural
> "working set" size of networking buffers at a given point in time and
> even the default net channel can grow to accomodate that with some
> kind of limit.
>
> This is kind of what I was alluding to in the past, in that we now
> have globals limits on system TCP socket memory when really what we
> want to do is have a set of global generic system packet memory
> limits.
>
> These two things can tie in together.
Hi Dave,
We kept a simple "used" bitmap, but to avoid the consumer touching it,
also put a "I am masquerading as an SKB" bit in the trailer, like so:
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .16405-linux-2.6.17-rc3-git7/include/linux/skbuff.h .16405-linux-2.6.17-rc3-git7.updated/include/linux/skbuff.h
--- .16405-linux-2.6.17-rc3-git7/include/linux/skbuff.h 2006-05-03 22:07:14.000000000 +1000
+++ .16405-linux-2.6.17-rc3-git7.updated/include/linux/skbuff.h 2006-05-03 22:07:15.000000000 +1000
@@ -133,7 +133,8 @@ struct skb_frag_struct {
*/
struct skb_shared_info {
atomic_t dataref;
- unsigned short nr_frags;
+ unsigned short nr_frags : 15;
+ unsigned int chan_as_skb : 1;
unsigned short tso_size;
unsigned short tso_segs;
unsigned short ufo_size;
diff -urpN --exclude TAGS -X /home/rusty/devel/kernel/kernel-patches/current-dontdiff --minimal .16405-linux-2.6.17-rc3-git7/net/core/skbuff.c .16405-linux-2.6.17-rc3-git7.updated/net/core/skbuff.c
--- .16405-linux-2.6.17-rc3-git7/net/core/skbuff.c 2006-05-03 22:07:14.000000000 +1000
+++ .16405-linux-2.6.17-rc3-git7.updated/net/core/skbuff.c 2006-05-03 22:07:15.000000000 +1000
@@ -289,6 +289,7 @@ struct sk_buff *skb_netchan_graft(struct
skb_shinfo(skb)->ufo_size = 0;
skb_shinfo(skb)->ip6_frag_id = 0;
skb_shinfo(skb)->frag_list = NULL;
+ skb_shinfo(skb)->chan_as_skb = 1;
return skb;
}
@@ -328,7 +329,10 @@ void skb_release_data(struct sk_buff *sk
if (skb_shinfo(skb)->frag_list)
skb_drop_fraglist(skb);
- kfree(skb->head);
+ if (skb_shinfo(skb)->chan_as_skb)
+ skb_shinfo(skb)->chan_as_skb = 0;
+ else
+ kfree(skb->head);
}
}
Buffer allocation would be: find_first_bit, check that it's not actually
inside an skb, or otherwise find_next_bit. Assuming most buffers do not
go down default channel, this is efficient.
Problems:
1) it's still not cache-friendly with producers on multiple CPUs. We
could divide up the bitmap into per-cpu regions to try first to improve
cache behaviour.
2) In addition, we had every buffer one page large. This isn't
sufficient for jumbo frames, and wasteful for ethernet. So if we
statically assign descriptors -> buffers, we need to have multiple
sizes.
3) OTOH, if descriptor table is dynamic, we have cache issues again as
multiple people are writing to it, and it's not clear what we really
gain over direct pointers.
4) Grow/shrink can be done, but needs stop_machine, or maybe tricky RCU.
5) The killer for me: we can't use our scheme straight-to-userspace
anyway, since we can't trust the (user-writable) ringbuffer in deciding
what buffers to release. Since we need to store this somewhere, we need
a test in netchannel_enqueue. At which point, we might as well
translate to "descriptors" at that point, anyway (since descriptors are
only really needed for userspace). Something like:
tail = np->netchan_tail;
if (tail == np->netchan_head)
return -ENOMEM;
+ /* Write to userspace? They can't deref ptr anyway. */
+ if (np->shadow_ring && !netchan_local_buf(bp)) {
+ np->shadow_ring[tail] = bp;
+ bp = (void *)-1;
+ }
np->netchan_queue[tail++] = bp;
if (tail == NET_CHANNEL_ENTRIES)
(We don't have local buffers yet, but I'm assuming we'll use v. low
pointers for them). Userspace goes "desc number is in range, we can
access directly" or "desc number isn't, call into kernel to copy them
for us".
> So, are you still sure you want to do away with the descriptors for
> the default channel? Is the scheme I have outlined above doable or
> is there some critical barrier or some complexity issue which makes
> it undesirable?
I think it's simpler to build global alloc limiters on what we have.
The slab already has the nice lifetime and cache-friendly properties we
want, so we just have to write the limiting code. There's enough work
there to keep us busy 8)
Cheers,
Rusty.
--
ccontrol: http://ozlabs.org/~rusty/ccontrol
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-04-26 11:47 [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch Kelly Daly
2006-04-26 7:33 ` David S. Miller
@ 2006-04-26 7:59 ` David S. Miller
2006-05-04 7:28 ` Kelly Daly
1 sibling, 1 reply; 35+ messages in thread
From: David S. Miller @ 2006-04-26 7:59 UTC (permalink / raw)
To: kelly; +Cc: netdev, rusty
Ok I have comments already just glancing at the initial patch.
With the 32-bit descriptors in the channel, you indeed end up
with a fixed sized pool with a lot of hard-to-finesse sizing
and lookup problems to solve.
So what I wanted to do was finesse the entire issue by simply
side-stepping it initially. Use a normal buffer with a tail
descriptor, when you enqueue you give a tail descriptor pointer.
Yes, it's weirder to handle this in hardware, but it's not
impossible and using real pointers means two things:
1) You can design a simple netif_receive_skb() channel that works
today, encapsulation of channel buffers into an SKB is like
15 lines of code and no funny lookups.
2) People can start porting the input path of drivers right now and
retain full functionality and test anything they want. This is
important for getting the drivers stable as fast as possible.
And it also means we can tackle the buffer pool issue of the 32-bit
descriptors later, if we actually want to do things that way, I
think we probably don't.
To be honest, I don't think using a 32-bit descriptor is so critical
even from a hardware implementation perspective. Yes, on 64-bit
you're dealing with a 64-bit quantity so the number of entries in the
channel are halfed from what a 32-bit arch uses.
Yes I say this for 2 reasons:
1) We have no idea whether it's critical to have "~512" entries
in the channel which is about what a u32 queue entry type
affords you on x86 with 4096 byte page size.
2) Furthermore, it is sized by page size, and most 64-bit platforms
use an 8K base page size anyways, so the number of queue entries
ends of being the same. Yes, I know some 64-bit platforms use
a 4K page size, please see #1 :-)
I really dislike the pools of buffers, partly because they are fixed
size (or dynamically sized and even more expensive to implement), but
moreso because there is all of this absolutely stupid state management
you eat just to get at the real data. That's pointless, we're trying
to make this as light as possible. Just use real pointers and
describe the packet with a tail descriptor.
We can use a u64 or whatever in a hardware implementation.
Next, you can't even begin to work on the protocol channels before you
do one very important piece of work. Integration of all of the ipv4
and ipv6 protocol hash tables into a central code, it's a total
prerequisite. Then you modify things to use a generic
inet_{,listen_}lookup() or inet6_{,listen_}lookup() that takes a
protocol number as well as saddr/daddr/sport/dport and searches
from a central table.
So I think I'll continue working on my implementation, it's more
transitional and that's how we have to do this kind of work.
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-04-26 7:59 ` David S. Miller
@ 2006-05-04 7:28 ` Kelly Daly
2006-05-04 23:11 ` David S. Miller
0 siblings, 1 reply; 35+ messages in thread
From: Kelly Daly @ 2006-05-04 7:28 UTC (permalink / raw)
To: David S. Miller; +Cc: kelly, netdev, rusty
On Wednesday 26 April 2006 17:59, David S. Miller wrote:
> Next, you can't even begin to work on the protocol channels before you
> do one very important piece of work. Integration of all of the ipv4
> and ipv6 protocol hash tables into a central code, it's a total
> prerequisite. Then you modify things to use a generic
> inet_{,listen_}lookup() or inet6_{,listen_}lookup() that takes a
> protocol number as well as saddr/daddr/sport/dport and searches
> from a central table.
Back here again ;)
Is this on the right track (see patch below)?
K
_____________________
diff -urp davem_orig/include/net/inet_hashtables.h kelly/include/net/inet_hashtables.h
--- davem_orig/include/net/inet_hashtables.h 2006-04-27 00:08:32.000000000 +1000
+++ kelly/include/net/inet_hashtables.h 2006-05-04 14:28:59.000000000 +1000
@@ -418,4 +418,6 @@ static inline struct sock *inet_lookup(s
extern int inet_hash_connect(struct inet_timewait_death_row *death_row,
struct sock *sk);
+
+extern struct inet_hashinfo *inet_hashes[256];
#endif /* _INET_HASHTABLES_H */
diff -urp davem_orig/include/net/sock.h kelly/include/net/sock.h
--- davem_orig/include/net/sock.h 2006-05-02 13:42:10.000000000 +1000
+++ kelly/include/net/sock.h 2006-05-04 14:28:59.000000000 +1000
@@ -196,6 +196,7 @@ struct sock {
unsigned short sk_type;
int sk_rcvbuf;
socket_lock_t sk_lock;
+ struct netchannel *sk_channel;
wait_queue_head_t *sk_sleep;
struct dst_entry *sk_dst_cache;
struct xfrm_policy *sk_policy[2];
diff -urp davem_orig/net/core/dev.c kelly/net/core/dev.c
--- davem_orig/net/core/dev.c 2006-04-27 15:49:27.000000000 +1000
+++ kelly/net/core/dev.c 2006-05-04 16:58:49.000000000 +1000
@@ -116,6 +116,7 @@
#include <net/iw_handler.h>
#include <asm/current.h>
#include <linux/audit.h>
+#include <net/inet_hashtables.h>
/*
* The list of packet types we will receive (as opposed to discard)
@@ -190,6 +191,8 @@ static inline struct hlist_head *dev_ind
return &dev_index_head[ifindex & ((1<<NETDEV_HASHBITS)-1)];
}
+static struct netchannel default_netchannel;
+
/*
* Our notifier list
*/
@@ -1907,6 +1910,37 @@ struct netchannel_buftrailer *__netchann
}
EXPORT_SYMBOL_GPL(__netchannel_dequeue);
+
+/* Find the channel for a packet, or return default channel. */
+struct netchannel *find_netchannel(const struct netchannel_buftrailer *np)
+{
+ struct sock *sk = NULL;
+ unsigned long dlen = np->netchan_buf_len - np->netchan_buf_offset;
+ void *data = (void *)np - dlen;
+
+ switch (np->netchan_buf_proto) {
+ case __constant_htons(ETH_P_IP): {
+ struct iphdr *ip = data;
+ int iphl = ip->ihl * 4;
+
+ if (dlen >= (iphl + 4) && iphl == sizeof(struct iphdr)) {
+ u16 *ports = (u16 *)(ip + 1);
+
+ if (inet_hashes[ip->protocol]) {
+ sk = inet_lookup(inet_hashes[ip->protocol],
+ ip->saddr, ports[0],
+ ip->daddr, ports[1],
+ np->netchan_buf_dev->ifindex);
+ }
+ break;
+ }
+ }
+ }
+ if (sk && sk->sk_channel)
+ return sk->sk_channel;
+ return &default_netchannel;
+}
+
static gifconf_func_t * gifconf_list [NPROTO];
/**
@@ -3421,6 +3455,9 @@ static int __init net_dev_init(void)
hotcpu_notifier(dev_cpu_callback, 0);
dst_init();
dev_mcast_init();
+
+ /* FIXME: This should be attached to thread/threads. */
+ netchannel_init(&default_netchannel, NULL, NULL);
rc = 0;
out:
return rc;
diff -urp davem_orig/net/ipv4/inet_hashtables.c kelly/net/ipv4/inet_hashtables.c
--- davem_orig/net/ipv4/inet_hashtables.c 2006-04-27 00:08:33.000000000 +1000
+++ kelly/net/ipv4/inet_hashtables.c 2006-05-04 14:28:59.000000000 +1000
@@ -337,3 +337,5 @@ out:
}
EXPORT_SYMBOL_GPL(inet_hash_connect);
+
+struct inet_hashinfo *inet_hashes[256];
diff -urp davem_orig/net/ipv4/tcp.c kelly/net/ipv4/tcp.c
--- davem_orig/net/ipv4/tcp.c 2006-04-27 00:08:33.000000000 +1000
+++ kelly/net/ipv4/tcp.c 2006-05-04 14:28:59.000000000 +1000
@@ -2173,6 +2173,7 @@ void __init tcp_init(void)
tcp_hashinfo.ehash_size << 1, tcp_hashinfo.bhash_size);
tcp_register_congestion_control(&tcp_reno);
+ inet_hashes[IPPROTO_TCP] = &tcp_hashinfo;
}
EXPORT_SYMBOL(tcp_close);
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-05-04 7:28 ` Kelly Daly
@ 2006-05-04 23:11 ` David S. Miller
2006-05-05 2:48 ` Kelly Daly
0 siblings, 1 reply; 35+ messages in thread
From: David S. Miller @ 2006-05-04 23:11 UTC (permalink / raw)
To: kelly; +Cc: netdev, rusty
From: Kelly Daly <kelly@au1.ibm.com>
Date: Thu, 4 May 2006 17:28:27 +1000
> On Wednesday 26 April 2006 17:59, David S. Miller wrote:
> > Next, you can't even begin to work on the protocol channels before you
> > do one very important piece of work. Integration of all of the ipv4
> > and ipv6 protocol hash tables into a central code, it's a total
> > prerequisite. Then you modify things to use a generic
> > inet_{,listen_}lookup() or inet6_{,listen_}lookup() that takes a
> > protocol number as well as saddr/daddr/sport/dport and searches
> > from a central table.
>
> Back here again ;)
>
> Is this on the right track (see patch below)?
It is on the right track.
I very much fear abuse of the inet_hashes[] array. So I'd rather
hide it behind a programmatic interface, something like:
extern struct sock *inet_lookup_proto(u16 protocol, u32 saddr, u16 sport,
u32 daddr, u16 dport, int ifindex);
and export that from inet_hashtables.c
Then you have registry and unregistry functions in inet_hashtables.c
that setup the static inet_hashes[] array. So TCP would go:
inet_hash_register(IPPROTO_TCP, &tcp_hashinfo);
instead of the direct assignment to inet_hashes[] it makes right
now in your patch.
Thanks!
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-05-04 23:11 ` David S. Miller
@ 2006-05-05 2:48 ` Kelly Daly
2006-05-16 1:02 ` Kelly Daly
0 siblings, 1 reply; 35+ messages in thread
From: Kelly Daly @ 2006-05-05 2:48 UTC (permalink / raw)
To: David S. Miller; +Cc: rusty, netdev
On Friday 05 May 2006 09:11, David S. Miller wrote:
> I very much fear abuse of the inet_hashes[] array. So I'd rather
> hide it behind a programmatic interface, something like:
done! I will continue with implementation of default netchannel for now.
> Thanks!
anytime =)
Cheers,
K
______________________
diff -urp davem_orig/include/net/inet_hashtables.h kelly/include/net/inet_hashtables.h
--- davem_orig/include/net/inet_hashtables.h 2006-04-27 00:08:32.000000000 +1000
+++ kelly/include/net/inet_hashtables.h 2006-05-05 12:05:33.000000000 +1000
@@ -418,4 +418,7 @@ static inline struct sock *inet_lookup(s
extern int inet_hash_connect(struct inet_timewait_death_row *death_row,
struct sock *sk);
+extern void inet_hash_register(u8 proto, struct inet_hashinfo *hashinfo);
+extern struct sock *inet_lookup_proto(u8 protocol, u32 saddr, u16 sport, u32 daddr, u16 dport, int ifindex);
+
#endif /* _INET_HASHTABLES_H */
diff -urp davem_orig/include/net/sock.h kelly/include/net/sock.h
--- davem_orig/include/net/sock.h 2006-05-02 13:42:10.000000000 +1000
+++ kelly/include/net/sock.h 2006-05-04 14:28:59.000000000 +1000
@@ -196,6 +196,7 @@ struct sock {
unsigned short sk_type;
int sk_rcvbuf;
socket_lock_t sk_lock;
+ struct netchannel *sk_channel;
wait_queue_head_t *sk_sleep;
struct dst_entry *sk_dst_cache;
struct xfrm_policy *sk_policy[2];
diff -urp davem_orig/net/core/dev.c kelly/net/core/dev.c
--- davem_orig/net/core/dev.c 2006-04-27 15:49:27.000000000 +1000
+++ kelly/net/core/dev.c 2006-05-05 10:39:22.000000000 +1000
@@ -116,6 +116,7 @@
#include <net/iw_handler.h>
#include <asm/current.h>
#include <linux/audit.h>
+#include <net/inet_hashtables.h>
/*
* The list of packet types we will receive (as opposed to discard)
@@ -190,6 +191,8 @@ static inline struct hlist_head *dev_ind
return &dev_index_head[ifindex & ((1<<NETDEV_HASHBITS)-1)];
}
+static struct netchannel default_netchannel;
+
/*
* Our notifier list
*/
@@ -1907,6 +1910,34 @@ struct netchannel_buftrailer *__netchann
}
EXPORT_SYMBOL_GPL(__netchannel_dequeue);
+
+/* Find the channel for a packet, or return default channel. */
+struct netchannel *find_netchannel(const struct netchannel_buftrailer *np)
+{
+ struct sock *sk = NULL;
+ unsigned long dlen = np->netchan_buf_len - np->netchan_buf_offset;
+ void *data = (void *)np - dlen;
+
+ switch (np->netchan_buf_proto) {
+ case __constant_htons(ETH_P_IP): {
+ struct iphdr *ip = data;
+ int iphl = ip->ihl * 4;
+
+ if (dlen >= (iphl + 4) && iphl == sizeof(struct iphdr)) {
+ u16 *ports = (u16 *)(ip + 1);
+ sk = inet_lookup_proto(ip->protocol,
+ ip->saddr, ports[0],
+ ip->daddr, ports[1],
+ np->netchan_buf_dev->ifindex);
+ break;
+ }
+ }
+ }
+ if (sk && sk->sk_channel)
+ return sk->sk_channel;
+ return &default_netchannel;
+}
+
static gifconf_func_t * gifconf_list [NPROTO];
/**
@@ -3421,6 +3452,9 @@ static int __init net_dev_init(void)
hotcpu_notifier(dev_cpu_callback, 0);
dst_init();
dev_mcast_init();
+
+ /* FIXME: This should be attached to thread/threads. */
+ netchannel_init(&default_netchannel, NULL, NULL);
rc = 0;
out:
return rc;
diff -urp davem_orig/net/ipv4/inet_hashtables.c kelly/net/ipv4/inet_hashtables.c
--- davem_orig/net/ipv4/inet_hashtables.c 2006-04-27 00:08:33.000000000 +1000
+++ kelly/net/ipv4/inet_hashtables.c 2006-05-05 12:05:33.000000000 +1000
@@ -337,3 +337,25 @@ out:
}
EXPORT_SYMBOL_GPL(inet_hash_connect);
+
+static struct inet_hashinfo *inet_hashes[256];
+
+void inet_hash_register(u8 proto, struct inet_hashinfo *hashinfo)
+{
+ BUG_ON(inet_hashes[proto]);
+ inet_hashes[proto] = hashinfo;
+}
+EXPORT_SYMBOL(inet_hash_register);
+
+struct sock *inet_lookup_proto(u8 protocol, u32 saddr, u16 sport, u32 daddr, u16 dport, int ifindex)
+{
+ struct sock *sk = NULL;
+ if (inet_hashes[protocol]) {
+ sk = inet_lookup(inet_hashes[protocol],
+ saddr, sport,
+ daddr, dport,
+ ifindex);
+ }
+ return sk;
+}
+EXPORT_SYMBOL(inet_lookup_proto);
diff -urp davem_orig/net/ipv4/tcp.c kelly/net/ipv4/tcp.c
--- davem_orig/net/ipv4/tcp.c 2006-04-27 00:08:33.000000000 +1000
+++ kelly/net/ipv4/tcp.c 2006-05-05 11:29:18.000000000 +1000
@@ -2173,6 +2173,7 @@ void __init tcp_init(void)
tcp_hashinfo.ehash_size << 1, tcp_hashinfo.bhash_size);
tcp_register_congestion_control(&tcp_reno);
+ inet_hash_register(IPPROTO_TCP, &tcp_hashinfo);
}
EXPORT_SYMBOL(tcp_close);
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-05-05 2:48 ` Kelly Daly
@ 2006-05-16 1:02 ` Kelly Daly
2006-05-16 1:05 ` David S. Miller
` (2 more replies)
0 siblings, 3 replies; 35+ messages in thread
From: Kelly Daly @ 2006-05-16 1:02 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev, rusty
On Friday 05 May 2006 12:48, Kelly Daly wrote:
> done! I will continue with implementation of default netchannel for now.
___________________
diff -urp davem_orig/include/net/inet_hashtables.h kelly/include/net/inet_hashtables.h
--- davem_orig/include/net/inet_hashtables.h 2006-04-27 00:08:32.000000000 +1000
+++ kelly/include/net/inet_hashtables.h 2006-05-05 12:45:44.000000000 +1000
@@ -418,4 +418,7 @@ static inline struct sock *inet_lookup(s
extern int inet_hash_connect(struct inet_timewait_death_row *death_row,
struct sock *sk);
+extern void inet_hash_register(u8 proto, struct inet_hashinfo *hashinfo);
+extern struct sock *inet_lookup_proto(u8 protocol, u32 saddr, u16 sport, u32 daddr, u16 dport, int ifindex);
+
#endif /* _INET_HASHTABLES_H */
diff -urp davem_orig/include/net/sock.h kelly/include/net/sock.h
--- davem_orig/include/net/sock.h 2006-05-02 13:42:10.000000000 +1000
+++ kelly/include/net/sock.h 2006-05-04 14:28:59.000000000 +1000
@@ -196,6 +196,7 @@ struct sock {
unsigned short sk_type;
int sk_rcvbuf;
socket_lock_t sk_lock;
+ struct netchannel *sk_channel;
wait_queue_head_t *sk_sleep;
struct dst_entry *sk_dst_cache;
struct xfrm_policy *sk_policy[2];
diff -urp davem_orig/net/core/dev.c kelly/net/core/dev.c
--- davem_orig/net/core/dev.c 2006-04-27 15:49:27.000000000 +1000
+++ kelly/net/core/dev.c 2006-05-15 12:21:41.000000000 +1000
@@ -113,9 +113,11 @@
#include <linux/delay.h>
#include <linux/wireless.h>
#include <linux/netchannel.h>
+#include <linux/kthread.h>
#include <net/iw_handler.h>
#include <asm/current.h>
#include <linux/audit.h>
+#include <net/inet_hashtables.h>
/*
* The list of packet types we will receive (as opposed to discard)
@@ -190,6 +192,10 @@ static inline struct hlist_head *dev_ind
return &dev_index_head[ifindex & ((1<<NETDEV_HASHBITS)-1)];
}
+/* default netchannel shtuff */
+static struct netchannel default_netchannel;
+static wait_queue_head_t default_netchannel_wq;
+
/*
* Our notifier list
*/
@@ -1854,6 +1860,35 @@ softnet_break:
goto out;
}
+static void default_netchannel_wake(struct netchannel *np)
+{
+ wake_up(&default_netchannel_wq);
+}
+
+/* handles default chan buffers that nobody else wants */
+static int default_netchannel_thread(void *unused)
+{
+ wait_queue_t wait;
+ struct netchannel_buftrailer *bp;
+ struct sk_buff *skbp;
+
+ wait.private = current;
+ wait.func = default_wake_function;;
+ INIT_LIST_HEAD(&wait.task_list);
+
+ add_wait_queue(&default_netchannel_wq, &wait);
+ set_current_state(TASK_UNINTERRUPTIBLE);
+ while (!kthread_should_stop()) {
+ bp = __netchannel_dequeue(&default_netchannel);
+ skbp = skb_netchan_graft(bp, GFP_ATOMIC);
+ netif_receive_skb(skbp);
+ }
+ remove_wait_queue(&default_netchannel_wq, &wait);
+ __set_current_state(TASK_RUNNING);
+ return 0;
+}
+
void netchannel_init(struct netchannel *np,
void (*callb)(struct netchannel *), void *callb_data)
{
@@ -1907,6 +1942,34 @@ struct netchannel_buftrailer *__netchann
}
EXPORT_SYMBOL_GPL(__netchannel_dequeue);
+
+/* Find the channel for a packet, or return default channel. */
+struct netchannel *find_netchannel(const struct netchannel_buftrailer *np)
+{
+ struct sock *sk = NULL;
+ unsigned long dlen = np->netchan_buf_len - np->netchan_buf_offset;
+ void *data = (void *)np - dlen;
+
+ switch (np->netchan_buf_proto) {
+ case __constant_htons(ETH_P_IP): {
+ struct iphdr *ip = data;
+ int iphl = ip->ihl * 4;
+
+ if (dlen >= (iphl + 4) && iphl == sizeof(struct iphdr)) {
+ u16 *ports = (u16 *)(ip + 1);
+ sk = inet_lookup_proto(ip->protocol,
+ ip->saddr, ports[0],
+ ip->daddr, ports[1],
+ np->netchan_buf_dev->ifindex);
+ break;
+ }
+ }
+ }
+ if (sk && sk->sk_channel)
+ return sk->sk_channel;
+ return &default_netchannel;
+}
+
static gifconf_func_t * gifconf_list [NPROTO];
/**
@@ -3375,6 +3438,7 @@ static int dev_cpu_callback(struct notif
static int __init net_dev_init(void)
{
int i, rc = -ENOMEM;
+ struct task_struct *netchan_thread;
BUG_ON(!dev_boot_phase);
@@ -3421,7 +3485,12 @@ static int __init net_dev_init(void)
hotcpu_notifier(dev_cpu_callback, 0);
dst_init();
dev_mcast_init();
- rc = 0;
+
+ netchannel_init(&default_netchannel, default_netchannel_wake, NULL);
+ netchan_thread = kthread_run(default_netchannel_thread, NULL, "kvj_def");
+
+ if (!IS_ERR(netchan_thread)) /* kthread_run returned thread */
+ rc = 0;
out:
return rc;
}
diff -urp davem_orig/net/ipv4/inet_hashtables.c kelly/net/ipv4/inet_hashtables.c
--- davem_orig/net/ipv4/inet_hashtables.c 2006-04-27 00:08:33.000000000 +1000
+++ kelly/net/ipv4/inet_hashtables.c 2006-05-05 12:45:44.000000000 +1000
@@ -337,3 +337,25 @@ out:
}
EXPORT_SYMBOL_GPL(inet_hash_connect);
+
+static struct inet_hashinfo *inet_hashes[256];
+
+void inet_hash_register(u8 proto, struct inet_hashinfo *hashinfo)
+{
+ BUG_ON(inet_hashes[proto]);
+ inet_hashes[proto] = hashinfo;
+}
+EXPORT_SYMBOL(inet_hash_register);
+
+struct sock *inet_lookup_proto(u8 protocol, u32 saddr, u16 sport, u32 daddr, u16 dport, int ifindex)
+{
+ struct sock *sk = NULL;
+ if (inet_hashes[protocol]) {
+ sk = inet_lookup(inet_hashes[protocol],
+ saddr, sport,
+ daddr, dport,
+ ifindex);
+ }
+ return sk;
+}
+EXPORT_SYMBOL(inet_lookup_proto);
diff -urp davem_orig/net/ipv4/tcp.c kelly/net/ipv4/tcp.c
--- davem_orig/net/ipv4/tcp.c 2006-04-27 00:08:33.000000000 +1000
+++ kelly/net/ipv4/tcp.c 2006-05-05 11:29:18.000000000 +1000
@@ -2173,6 +2173,7 @@ void __init tcp_init(void)
tcp_hashinfo.ehash_size << 1, tcp_hashinfo.bhash_size);
tcp_register_congestion_control(&tcp_reno);
+ inet_hash_register(IPPROTO_TCP, &tcp_hashinfo);
}
EXPORT_SYMBOL(tcp_close);
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-05-16 1:02 ` Kelly Daly
@ 2006-05-16 1:05 ` David S. Miller
2006-05-16 1:15 ` Kelly Daly
2006-05-16 5:16 ` David S. Miller
2006-05-16 6:19 ` [1/1] netchannel subsystem Evgeniy Polyakov
2 siblings, 1 reply; 35+ messages in thread
From: David S. Miller @ 2006-05-16 1:05 UTC (permalink / raw)
To: kelly; +Cc: netdev, rusty
From: Kelly Daly <kelly@au1.ibm.com>
Date: Tue, 16 May 2006 11:02:29 +1000
> On Friday 05 May 2006 12:48, Kelly Daly wrote:
> > done! I will continue with implementation of default netchannel for now.
Some context? It's been a week since we were discussing this,
so I'd like to know what we're looking at here in this patch :)
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-05-16 1:05 ` David S. Miller
@ 2006-05-16 1:15 ` Kelly Daly
0 siblings, 0 replies; 35+ messages in thread
From: Kelly Daly @ 2006-05-16 1:15 UTC (permalink / raw)
To: David S. Miller; +Cc: kelly, netdev, rusty
On Tuesday 16 May 2006 11:05, David S. Miller wrote:
> From: Kelly Daly <kelly@au1.ibm.com>
> Date: Tue, 16 May 2006 11:02:29 +1000
>
> > On Friday 05 May 2006 12:48, Kelly Daly wrote:
> > > done! I will continue with implementation of default netchannel for
> > > now.
>
> Some context? It's been a week since we were discussing this,
> so I'd like to know what we're looking at here in this patch :)
the implementation of the default netchannel =)
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-05-16 1:02 ` Kelly Daly
2006-05-16 1:05 ` David S. Miller
@ 2006-05-16 5:16 ` David S. Miller
2006-06-22 2:05 ` Kelly Daly
2006-05-16 6:19 ` [1/1] netchannel subsystem Evgeniy Polyakov
2 siblings, 1 reply; 35+ messages in thread
From: David S. Miller @ 2006-05-16 5:16 UTC (permalink / raw)
To: kelly; +Cc: netdev, rusty
From: Kelly Daly <kelly@au1.ibm.com>
Date: Tue, 16 May 2006 11:02:29 +1000
> +/* handles default chan buffers that nobody else wants */
> +static int default_netchannel_thread(void *unused)
> +{
> + wait_queue_t wait;
> + struct netchannel_buftrailer *bp;
> + struct sk_buff *skbp;
> +
> + wait.private = current;
> + wait.func = default_wake_function;;
> + INIT_LIST_HEAD(&wait.task_list);
> +
> + add_wait_queue(&default_netchannel_wq, &wait);
> + set_current_state(TASK_UNINTERRUPTIBLE);
> + while (!kthread_should_stop()) {
> + bp = __netchannel_dequeue(&default_netchannel);
> + skbp = skb_netchan_graft(bp, GFP_ATOMIC);
> + netif_receive_skb(skbp);
> + }
> + remove_wait_queue(&default_netchannel_wq, &wait);
> + __set_current_state(TASK_RUNNING);
> + return 0;
> +}
> +
When does this thread ever go to sleep? Seems like it will loop
forever and not block when the default_netchannel queue is empty.
:-)
> + unsigned long dlen = np->netchan_buf_len - np->netchan_buf_offset;
Probably deserves a "netchan_buf_len(bp)" inline in linux/netchannel.h
> diff -urp davem_orig/net/ipv4/inet_hashtables.c kelly/net/ipv4/inet_hashtables.c
> --- davem_orig/net/ipv4/inet_hashtables.c 2006-04-27 00:08:33.000000000 +1000
> +++ kelly/net/ipv4/inet_hashtables.c 2006-05-05 12:45:44.000000000 +1000
The hash table bits look good, just as they did last time :-)
So I'll put this part into my vj-2.6 tree now, thanks.
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-05-16 5:16 ` David S. Miller
@ 2006-06-22 2:05 ` Kelly Daly
2006-06-22 3:58 ` James Morris
2006-07-08 0:05 ` David Miller
0 siblings, 2 replies; 35+ messages in thread
From: Kelly Daly @ 2006-06-22 2:05 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev, rusty
> The hash table bits look good, just as they did last time :-)
> So I'll put this part into my vj-2.6 tree now, thanks.
Rockin' - thanks...
Sorry for the massive delay - here's the next attempt.
-------
diff -urp davem/include/linux/netchannel.h kelly_new/include/linux/netchannel.h
--- davem/include/linux/netchannel.h 2006-06-16 15:14:15.000000000 +1000
+++ kelly_new/include/linux/netchannel.h 2006-06-22 11:47:04.000000000 +1000
@@ -19,6 +19,7 @@ struct netchannel {
void (*netchan_callb)(struct netchannel *);
void *netchan_callb_data;
unsigned long netchan_head;
+ wait_queue_head_t wq;
};
extern void netchannel_init(struct netchannel *,
@@ -56,6 +57,11 @@ static inline unsigned char *netchan_buf
return netchan_buf_base(bp) + bp->netchan_buf_offset;
}
+static inline int netchan_data_len(const struct netchannel_buftrailer *bp)
+{
+ return bp->netchan_buf_len - bp->netchan_buf_offset;
+}
+
extern int netchannel_enqueue(struct netchannel *, struct netchannel_buftrailer *);
extern struct netchannel_buftrailer *__netchannel_dequeue(struct netchannel *);
static inline struct netchannel_buftrailer *netchannel_dequeue(struct netchannel *np)
@@ -65,6 +71,7 @@ static inline struct netchannel_buftrail
return __netchannel_dequeue(np);
}
+extern struct netchannel *find_netchannel(const struct netchannel_buftrailer *bp);
extern struct sk_buff *skb_netchan_graft(struct netchannel_buftrailer *, gfp_t);
#endif /* _LINUX_NETCHANNEL_H */
diff -urp davem/include/net/inet_hashtables.h kelly_new/include/net/inet_hashtables.h
--- davem/include/net/inet_hashtables.h 2006-06-16 14:34:20.000000000 +1000
+++ kelly_new/include/net/inet_hashtables.h 2006-06-19 10:42:45.000000000 +1000
@@ -418,4 +418,7 @@ static inline struct sock *inet_lookup(s
extern int inet_hash_connect(struct inet_timewait_death_row *death_row,
struct sock *sk);
+extern void inet_hash_register(u8 proto, struct inet_hashinfo *hashinfo);
+extern struct sock *inet_lookup_proto(u8 protocol, u32 saddr, u16 sport, u32 daddr, u16 dport, int ifindex);
+
#endif /* _INET_HASHTABLES_H */
diff -urp davem/include/net/sock.h kelly_new/include/net/sock.h
--- davem/include/net/sock.h 2006-06-16 15:14:16.000000000 +1000
+++ kelly_new/include/net/sock.h 2006-06-19 10:42:45.000000000 +1000
@@ -196,6 +196,7 @@ struct sock {
unsigned short sk_type;
int sk_rcvbuf;
socket_lock_t sk_lock;
+ struct netchannel *sk_channel;
wait_queue_head_t *sk_sleep;
struct dst_entry *sk_dst_cache;
struct xfrm_policy *sk_policy[2];
diff -urp davem/net/core/dev.c kelly_new/net/core/dev.c
--- davem/net/core/dev.c 2006-06-16 15:14:16.000000000 +1000
+++ kelly_new/net/core/dev.c 2006-06-22 11:45:55.000000000 +1000
@@ -113,9 +113,12 @@
#include <linux/delay.h>
#include <linux/wireless.h>
#include <linux/netchannel.h>
+#include <linux/kthread.h>
+#include <linux/wait.h>
#include <net/iw_handler.h>
#include <asm/current.h>
#include <linux/audit.h>
+#include <net/inet_hashtables.h>
/*
* The list of packet types we will receive (as opposed to discard)
@@ -190,6 +193,8 @@ static inline struct hlist_head *dev_ind
return &dev_index_head[ifindex & ((1<<NETDEV_HASHBITS)-1)];
}
+static struct netchannel default_netchannel;
+
/*
* Our notifier list
*/
@@ -1854,11 +1859,18 @@ softnet_break:
goto out;
}
+void netchannel_wake(struct netchannel *np)
+{
+ wake_up(&np->wq);
+}
+
void netchannel_init(struct netchannel *np,
void (*callb)(struct netchannel *), void *callb_data)
{
memset(np, 0, sizeof(*np));
+ init_waitqueue_head(&np->wq);
+
np->netchan_callb = callb;
np->netchan_callb_data = callb_data;
}
@@ -1912,6 +1924,76 @@ struct netchannel_buftrailer *__netchann
}
EXPORT_SYMBOL_GPL(__netchannel_dequeue);
+/* Find the channel for a packet, or return default channel. */
+struct netchannel *find_netchannel(const struct netchannel_buftrailer *bp)
+{
+ struct sock *sk = NULL;
+ int datalen = netchan_data_len(bp);
+
+ switch (bp->netchan_buf_proto) {
+ case __constant_htons(ETH_P_IP): {
+ struct iphdr *ip = (void *)bp - datalen;
+ int iphl = ip->ihl * 4;
+
+ /* FIXME: Do sanity checks, parse packet. */
+
+ if (datalen >+ (iphl + 4) && iphl == sizeof(struct iphdr)) {
+ u16 *ports = (u16 *)ip + 1;
+ sk = inet_lookup_proto(ip->protocol,
+ ip->saddr, ports[0],
+ ip->daddr, ports[1],
+ bp->netchan_buf_dev->ifindex);
+ }
+ break;
+ }
+ }
+
+ if (sk && sk->sk_channel)
+ return sk->sk_channel;
+ return &default_netchannel;
+}
+EXPORT_SYMBOL_GPL(find_netchannel);
+
+static int sock_add_netchannel(struct sock *sk)
+{
+ struct netchannel *np;
+
+ np = kmalloc(sizeof(struct netchannel), GFP_KERNEL);
+ if (!np)
+ return -ENOMEM;
+ netchannel_init(np, netchannel_wake, (void *)np);
+ sk->sk_channel = np;
+
+ return 0;
+}
+
+/* deal with packets coming to default thread */
+static int netchannel_default_thread(void *unused)
+{
+ struct netchannel *np = &default_netchannel;
+ struct netchannel_buftrailer *nbp;
+ struct sk_buff *skbp;
+ DECLARE_WAITQUEUE(wait, current);
+
+ add_wait_queue(&np->wq, &wait);
+ set_current_state(TASK_UNINTERRUPTIBLE);
+
+ while (!kthread_should_stop()) {
+ while (np->netchan_tail != np->netchan_head) {
+ nbp = netchannel_dequeue(np);
+ skbp = skb_netchan_graft(nbp, GFP_KERNEL);
+ netif_receive_skb(skbp);
+ }
+ schedule();
+ set_current_state(TASK_INTERRUPTIBLE);
+ }
+
+ remove_wait_queue(&np->wq, &wait);
+ __set_current_state(TASK_RUNNING);
+
+ return 0;
+}
+
static gifconf_func_t * gifconf_list [NPROTO];
/**
@@ -3426,6 +3508,10 @@ static int __init net_dev_init(void)
hotcpu_notifier(dev_cpu_callback, 0);
dst_init();
dev_mcast_init();
+
+ netchannel_init(&default_netchannel, netchannel_wake, (void *)&default_netchannel);
+ kthread_run(netchannel_default_thread, NULL, "nc_def");
+
rc = 0;
out:
return rc;
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-06-22 2:05 ` Kelly Daly
@ 2006-06-22 3:58 ` James Morris
2006-06-22 4:31 ` Arnaldo Carvalho de Melo
2006-06-22 4:36 ` YOSHIFUJI Hideaki / 吉藤英明
2006-07-08 0:05 ` David Miller
1 sibling, 2 replies; 35+ messages in thread
From: James Morris @ 2006-06-22 3:58 UTC (permalink / raw)
To: Kelly Daly; +Cc: David S. Miller, netdev, rusty
On Thu, 22 Jun 2006, Kelly Daly wrote:
> + switch (bp->netchan_buf_proto) {
> + case __constant_htons(ETH_P_IP): {
__constant_htons and friends should not be used in runtime code, only for
data being initialized at compile time.
- James
--
James Morris
<jmorris@namei.org>
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-06-22 3:58 ` James Morris
@ 2006-06-22 4:31 ` Arnaldo Carvalho de Melo
2006-06-22 4:36 ` YOSHIFUJI Hideaki / 吉藤英明
1 sibling, 0 replies; 35+ messages in thread
From: Arnaldo Carvalho de Melo @ 2006-06-22 4:31 UTC (permalink / raw)
To: James Morris; +Cc: Kelly Daly, David S. Miller, netdev, rusty
On 6/22/06, James Morris <jmorris@namei.org> wrote:
> On Thu, 22 Jun 2006, Kelly Daly wrote:
>
> > + switch (bp->netchan_buf_proto) {
> > + case __constant_htons(ETH_P_IP): {
>
> __constant_htons and friends should not be used in runtime code, only for
> data being initialized at compile time.
... because they generate the same code, so, to make source code less
cluttered ... :-)
- Arnaldo
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-06-22 3:58 ` James Morris
2006-06-22 4:31 ` Arnaldo Carvalho de Melo
@ 2006-06-22 4:36 ` YOSHIFUJI Hideaki / 吉藤英明
1 sibling, 0 replies; 35+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2006-06-22 4:36 UTC (permalink / raw)
To: jmorris; +Cc: kelly, davem, netdev, rusty, yoshfuji
In article <Pine.LNX.4.64.0606212354420.15426@d.namei> (at Wed, 21 Jun 2006 23:58:56 -0400 (EDT)), James Morris <jmorris@namei.org> says:
> On Thu, 22 Jun 2006, Kelly Daly wrote:
>
> > + switch (bp->netchan_buf_proto) {
> > + case __constant_htons(ETH_P_IP): {
>
> __constant_htons and friends should not be used in runtime code, only for
> data being initialized at compile time.
I disagree. For "case," use __constant_{hton,ntoh}{s,l}(), please.
--yoshfuji
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch
2006-06-22 2:05 ` Kelly Daly
2006-06-22 3:58 ` James Morris
@ 2006-07-08 0:05 ` David Miller
1 sibling, 0 replies; 35+ messages in thread
From: David Miller @ 2006-07-08 0:05 UTC (permalink / raw)
To: kelly; +Cc: netdev, rusty
From: Kelly Daly <kelly@au1.ibm.com>
Date: Thu, 22 Jun 2006 12:05:35 +1000
> > The hash table bits look good, just as they did last time :-)
> > So I'll put this part into my vj-2.6 tree now, thanks.
> Rockin' - thanks...
>
> Sorry for the massive delay - here's the next attempt.
My review delay was just as bad if not worse :-)
> +static int sock_add_netchannel(struct sock *sk)
> +{
> + struct netchannel *np;
> +
> + np = kmalloc(sizeof(struct netchannel), GFP_KERNEL);
> + if (!np)
> + return -ENOMEM;
> + netchannel_init(np, netchannel_wake, (void *)np);
> + sk->sk_channel = np;
> +
> + return 0;
> +}
This function is unreferenced entirely? It's marked static,
so don't bother including it unless it is being used.
Fix this, give me a good changelog and signed-off-by line
and I'll stick this into the vj-2.6 tree
Thanks!
^ permalink raw reply [flat|nested] 35+ messages in thread
* [1/1] netchannel subsystem.
2006-05-16 1:02 ` Kelly Daly
2006-05-16 1:05 ` David S. Miller
2006-05-16 5:16 ` David S. Miller
@ 2006-05-16 6:19 ` Evgeniy Polyakov
2006-05-16 6:57 ` David S. Miller
2 siblings, 1 reply; 35+ messages in thread
From: Evgeniy Polyakov @ 2006-05-16 6:19 UTC (permalink / raw)
To: netdev; +Cc: David S. Miller, netdev, Kelly Daly, rusty
Let me also bring attention to another netchannel implementation.
Some design notes [blog copypasts, sorry if it is out of sync sometimes].
First of all, do not use sockets. Just forget that such interface
exists.
New receiving channel abstraction will be created by special syscall,
which allows to specify in creation time source IP range, or it can be
wildcarded to IPADDR_ANY.
Interrupt handler (or netif_receive_skb()) will check skb and try to get
netchannel, then skb will be linked into netchannel receive queue if
memory limits allow it and that is all. No protocol processing.
If hardware allows to split header from data it will be possible to
create zero-copy technique: when netchannel is selected in
netif_receive_skb() it will have ->alloc_data() callback which would
allow, for example, to allocate data from userspace mapped area.
When receiver is awakened after data has been added to netchannel, it
will call netchannel's ->get_data() callback which would allow to remap
data pages from skb into process' VMA.
Unified network cache will hold information about source and destination
IP addresses, or it's hashes for IPv6, information about ports and
well-defined IP protocols.
It is built as two dimensional array of default 8 order (i.e. 64k
entries) with XOR hash. Comparison of Jenkins hash with XOR hash used in
TCP socket selection code can be found at [1].
Main problem here is routing cache. Linux is great in routing, it
supports several cache algorithms, which are extremely fast in
appropriate environments, but main question is "do we need routing for
netchannels"?. So it looks like netchannel either must include that
route interface or, if it will be designed as userspace-only
communication channel, implement own trivial pseudo-route algorithm.
Most of the drivers do not set skb->pkt_type, so it stays default
PACKET_HOST, which means that every packet, received by interface must
be checked to be valid for the system, which is a point to use standard
Linux routing, but from the other point of view, it is quite easy to
create own callback for IP address setup, which would update netchannel
routing table.
No one need an excuse to rewrite anything, so I will create own cache
for netchannels, which will be updated when IP addresses are changed.
Ok, basic interfaces have been created:
* unified cache to store netchannels (IPv4 and stub for IPv6 hashes,
TCP and UDP)
* skb queueing mechanism
* netchannel creation/removing commands
* netchannel's callback to allocate/free pages (for
example to get data from mapped area) not only from SLAB cache
There are only couple of things left (today maybe):
* create netchannel's callback to move/copy data to userspace
* create read data command
* test new netchannel interface :) (it boots and works ok
with netchannel interface turned on for usual sockets)
If you have read all above, here is link to hash comparison [1].
[1]. Compared Jenkins hash with XOR hash used in TCP socket selection code.
http://tservice.net.ru/~s0mbre/blog/2006/05/14#2006_05_14
And proof-of-concept patch itself.
Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index f48bef1..b69d7d3 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -315,3 +315,4 @@ ENTRY(sys_call_table)
.long sys_splice
.long sys_sync_file_range
.long sys_tee /* 315 */
+ .long sys_netchannel_control
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5a92fed..fdfb997 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -696,4 +696,5 @@ ia32_sys_call_table:
.quad sys_sync_file_range
.quad sys_tee
.quad compat_sys_vmsplice
+ .quad sys_netchannel_control
ia32_syscall_end:
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index feb77cb..08c230e 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -617,8 +617,10 @@ __SYSCALL(__NR_tee, sys_tee)
__SYSCALL(__NR_sync_file_range, sys_sync_file_range)
#define __NR_vmsplice 278
__SYSCALL(__NR_vmsplice, sys_vmsplice)
+#define __NR_netchannel_control 279
+__SYSCALL(__NR_vmsplice, sys_netchannel_control)
-#define __NR_syscall_max __NR_vmsplice
+#define __NR_syscall_max __NR_netchannel_control
#ifndef __NO_STUBS
diff --git a/include/linux/netchannel.h b/include/linux/netchannel.h
new file mode 100644
index 0000000..4e65c9b
--- /dev/null
+++ b/include/linux/netchannel.h
@@ -0,0 +1,73 @@
+/*
+ * netchannel.h
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __NETCHANNEL_H
+#define __NETCHANNEL_H
+
+#include <linux/types.h>
+
+enum netchannel_commands {
+ NETCHANNEL_CREATE = 0,
+ NETCHANNEL_REMOVE,
+ NETCHANNEL_BIND,
+ NETCHANNEL_READ,
+};
+
+struct unetchannel
+{
+ __u32 src, dst; /* source/destination hashes */
+ __u16 sport, dport; /* source/destination ports */
+ __u8 proto; /* IP protocol number */
+ __u8 listen;
+ __u8 reserved[2];
+};
+
+struct unetchannel_control
+{
+ struct unetchannel unc;
+ __u32 cmd;
+ __u32 len;
+};
+
+#ifdef __KERNEL__
+
+struct netchannel
+{
+ struct hlist_node node;
+ atomic_t refcnt;
+ struct rcu_head rcu_head;
+ struct unetchannel unc;
+ unsigned long hit;
+
+ struct page * (*nc_alloc_page)(unsigned int size);
+ void (*nc_free_page)(struct page *page);
+
+ struct sk_buff_head list;
+};
+
+struct netchannel_cache_head
+{
+ struct hlist_head head;
+ struct mutex mutex;
+};
+
+#endif /* __KERNEL__ */
+#endif /* __NETCHANNEL_H */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a461b51..9924911 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -684,6 +684,15 @@ extern void dev_queue_xmit_nit(struct s
extern void dev_init(void);
+#ifdef CONFIG_NETCHANNEL
+extern int netchannel_recv(struct sk_buff *skb);
+#else
+static int netchannel_recv(struct sk_buff *skb)
+{
+ return -1;
+}
+#endif
+
extern int netdev_nit;
extern int netdev_budget;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f8f2347..accd00b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -301,6 +301,7 @@ struct sk_buff {
* Handling routines are only of interest to the kernel
*/
#include <linux/slab.h>
+#include <linux/netchannel.h>
#include <asm/system.h>
@@ -314,6 +315,17 @@ static inline struct sk_buff *alloc_skb(
return __alloc_skb(size, priority, 0);
}
+#ifdef CONFIG_NETCHANNEL
+extern struct sk_buff *netchannel_alloc(struct unetchannel *unc, unsigned int header_size,
+ unsigned int total_size, gfp_t gfp_mask);
+#else
+static struct sk_buff *netchannel_alloc(struct unetchannel *unc, unsigned int header_size,
+ unsigned int total_size, gfp_t gfp_mask)
+{
+ return NULL;
+}
+#endif
+
static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
gfp_t priority)
{
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3996960..296929d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -582,4 +582,6 @@ asmlinkage long sys_tee(int fdin, int fd
asmlinkage long sys_sync_file_range(int fd, loff_t offset, loff_t nbytes,
unsigned int flags);
+asmlinkage int sys_netchannel_control(void __user *arg);
+
#endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5433195..1747fc3 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -132,3 +132,5 @@ cond_syscall(sys_mincore);
cond_syscall(sys_madvise);
cond_syscall(sys_mremap);
cond_syscall(sys_remap_file_pages);
+
+cond_syscall(sys_netchannel_control);
diff --git a/net/Kconfig b/net/Kconfig
index 4193cdc..465e37b 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -66,6 +66,14 @@ source "net/ipv6/Kconfig"
endif # if INET
+config NETCHANNEL
+ bool "Network channels"
+ ---help---
+ Network channels are peer-to-peer abstraction, which allows to create
+ high performance communications.
+ Main advantages are unified address cache, protocol processing moved
+ to userspace, receiving zero-copy support and other interesting features.
+
menuconfig NETFILTER
bool "Network packet filtering (replaces ipchains)"
---help---
diff --git a/net/core/Makefile b/net/core/Makefile
index 79fe12c..7119812 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -16,3 +16,4 @@ obj-$(CONFIG_NET_DIVERT) += dv.o
obj-$(CONFIG_NET_PKTGEN) += pktgen.o
obj-$(CONFIG_WIRELESS_EXT) += wireless.o
obj-$(CONFIG_NETPOLL) += netpoll.o
+obj-$(CONFIG_NETCHANNEL) += netchannel.o
diff --git a/net/core/dev.c b/net/core/dev.c
index 9ab3cfa..2721111 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1712,6 +1712,10 @@ int netif_receive_skb(struct sk_buff *sk
}
}
+ ret = netchannel_recv(skb);
+ if (!ret)
+ goto out;
+
#ifdef CONFIG_NET_CLS_ACT
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
diff --git a/net/core/netchannel.c b/net/core/netchannel.c
new file mode 100644
index 0000000..5b3b2bb
--- /dev/null
+++ b/net/core/netchannel.c
@@ -0,0 +1,583 @@
+/*
+ * netchannel.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/types.h>
+#include <linux/unistd.h>
+#include <linux/linkage.h>
+#include <linux/notifier.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/errno.h>
+
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+
+#include <linux/netdevice.h>
+#include <linux/inetdevice.h>
+#include <net/addrconf.h>
+
+#include <asm/uaccess.h>
+
+static unsigned int netchannel_hash_order = 8;
+static struct netchannel_cache_head ***netchannel_hash_table;
+static kmem_cache_t *netchannel_cache;
+
+static int netchannel_inetaddr_notifier_call(struct notifier_block *, unsigned long, void *);
+static struct notifier_block netchannel_inetaddr_notifier = {
+ .notifier_call = &netchannel_inetaddr_notifier_call
+};
+
+#ifdef CONFIG_IPV6
+static int netchannel_inet6addr_notifier_call(struct notifier_block *, unsigned long, void *);
+static struct notifier_block netchannel_inet6addr_notifier = {
+ .notifier_call = &netchannel_inet6addr_notifier_call
+};
+#endif
+
+static inline unsigned int netchannel_hash(struct unetchannel *unc)
+{
+ unsigned int h = (unc->dst ^ unc->dport) ^ (unc->src ^ unc->sport);
+ h ^= h >> 16;
+ h ^= h >> 8;
+ h ^= unc->proto;
+ return h & ((1 << 2*netchannel_hash_order) - 1);
+}
+
+static inline void netchannel_convert_hash(unsigned int hash, unsigned int *col, unsigned int *row)
+{
+ *row = hash & ((1 << netchannel_hash_order) - 1);
+ *col = (hash >> netchannel_hash_order) & ((1 << netchannel_hash_order) - 1);
+}
+
+static struct netchannel_cache_head *netchannel_bucket(struct unetchannel *unc)
+{
+ unsigned int hash = netchannel_hash(unc);
+ unsigned int col, row;
+
+ netchannel_convert_hash(hash, &col, &row);
+ return netchannel_hash_table[col][row];
+}
+
+static inline int netchannel_hash_equal_full(struct unetchannel *unc1, struct unetchannel *unc2)
+{
+ return (unc1->dport == unc2->dport) && (unc1->dst == unc2->dst) &&
+ (unc1->sport == unc2->sport) && (unc1->src == unc2->src) &&
+ (unc1->proto == unc2->proto);
+}
+
+static inline int netchannel_hash_equal_dest(struct unetchannel *unc1, struct unetchannel *unc2)
+{
+ return ((unc1->dport == unc2->dport) && (unc1->dst == unc2->dst) && (unc1->proto == unc2->proto));
+}
+
+static struct netchannel *netchannel_check_dest(struct unetchannel *unc, struct netchannel_cache_head *bucket)
+{
+ struct netchannel *nc;
+ struct hlist_node *node;
+ int found = 0;
+
+ hlist_for_each_entry_rcu(nc, node, &bucket->head, node) {
+ if (netchannel_hash_equal_dest(&nc->unc, unc)) {
+ found = 1;
+ break;
+ }
+ }
+
+ return (found)?nc:NULL;
+}
+
+static struct netchannel *netchannel_check_full(struct unetchannel *unc, struct netchannel_cache_head *bucket)
+{
+ struct netchannel *nc;
+ struct hlist_node *node;
+ int found = 0;
+
+ hlist_for_each_entry_rcu(nc, node, &bucket->head, node) {
+ if (netchannel_hash_equal_full(&nc->unc, unc)) {
+ found = 1;
+ break;
+ }
+ }
+
+ return (found)?nc:NULL;
+}
+
+static void netchannel_free_rcu(struct rcu_head *rcu)
+{
+ struct netchannel *nc = container_of(rcu, struct netchannel, rcu_head);
+
+ kmem_cache_free(netchannel_cache, nc);
+}
+
+static inline void netchannel_get(struct netchannel *nc)
+{
+ atomic_inc(&nc->refcnt);
+}
+
+static inline void netchannel_put(struct netchannel *nc)
+{
+ if (atomic_dec_and_test(&nc->refcnt))
+ call_rcu(&nc->rcu_head, &netchannel_free_rcu);
+}
+
+
+static int netchannel_convert_skb_ipv6(struct sk_buff *skb, struct unetchannel *unc)
+{
+ /*
+ * Hash IP addresses into src/dst. Setup TCP/UDP ports.
+ * Not supported yet.
+ */
+ return -1;
+}
+
+static int netchannel_convert_skb_ipv4(struct sk_buff *skb, struct unetchannel *unc)
+{
+ struct iphdr *iph;
+ u32 len;
+ struct tcphdr *th;
+ struct udphdr *uh;
+
+ if (!pskb_may_pull(skb, sizeof(struct iphdr)))
+ goto inhdr_error;
+
+ iph = skb->nh.iph;
+
+ if (iph->ihl < 5 || iph->version != 4)
+ goto inhdr_error;
+
+ if (!pskb_may_pull(skb, iph->ihl*4))
+ goto inhdr_error;
+
+ iph = skb->nh.iph;
+
+ if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
+ goto inhdr_error;
+
+ len = ntohs(iph->tot_len);
+ if (skb->len < len || len < (iph->ihl*4))
+ goto inhdr_error;
+
+ unc->dst = iph->daddr;
+ unc->src = iph->saddr;
+ unc->proto = iph->protocol;
+
+ len = skb->len;
+
+ switch (unc->proto) {
+ case IPPROTO_TCP:
+ if (!pskb_may_pull(skb, sizeof(struct tcphdr)))
+ goto inhdr_error;
+ th = skb->h.th;
+
+ if (th->doff < sizeof(struct tcphdr) / 4)
+ goto inhdr_error;
+
+ unc->dport = th->dest;
+ unc->sport = th->source;
+ break;
+ case IPPROTO_UDP:
+ if (!pskb_may_pull(skb, sizeof(struct udphdr)))
+ goto inhdr_error;
+ uh = skb->h.uh;
+
+ if (ntohs(uh->len) > len || ntohs(uh->len) < sizeof(*uh))
+ goto inhdr_error;
+
+ unc->dport = uh->dest;
+ unc->sport = uh->source;
+ break;
+ default:
+ goto inhdr_error;
+ }
+
+ return 0;
+
+inhdr_error:
+ return -1;
+}
+
+static int netchannel_convert_skb(struct sk_buff *skb, struct unetchannel *unc)
+{
+ if (skb->pkt_type == PACKET_OTHERHOST)
+ return -1;
+
+ switch (skb->protocol) {
+ case ETH_P_IP:
+ return netchannel_convert_skb_ipv4(skb, unc);
+ case ETH_P_IPV6:
+ return netchannel_convert_skb_ipv6(skb, unc);
+ default:
+ return -1;
+ }
+}
+
+/*
+ * By design netchannels allow to "allocate" data
+ * not only from SLAB cache, but get it from mapped area
+ * or from VFS cache (requires process' context or preallocation).
+ */
+struct sk_buff *netchannel_alloc(struct unetchannel *unc, unsigned int header_size,
+ unsigned int total_size, gfp_t gfp_mask)
+{
+ struct netchannel *nc;
+ struct netchannel_cache_head *bucket;
+ int err;
+ struct sk_buff *skb = NULL;
+ unsigned int size, pnum, i;
+
+ skb = alloc_skb(header_size, gfp_mask);
+ if (!skb)
+ return NULL;
+
+ rcu_read_lock();
+ bucket = netchannel_bucket(unc);
+ nc = netchannel_check_full(unc, bucket);
+ if (!nc) {
+ err = -ENODEV;
+ goto err_out_free_skb;
+ }
+
+ if (!nc->nc_alloc_page || !nc->nc_free_page) {
+ err = -EINVAL;
+ goto err_out_free_skb;
+ }
+
+ netchannel_get(nc);
+
+ size = total_size - header_size;
+ pnum = PAGE_ALIGN(size) >> PAGE_SHIFT;
+
+ for (i=0; i<pnum; ++i) {
+ unsigned int cs = min_t(unsigned int, PAGE_SIZE, size);
+ struct page *page;
+
+ page = nc->nc_alloc_page(cs);
+ if (!page)
+ break;
+
+ skb_fill_page_desc(skb, skb_shinfo(skb)->nr_frags, page, 0, cs);
+
+ skb->len += cs;
+ skb->data_len += cs;
+ skb->truesize += cs;
+
+ size -= cs;
+ }
+
+ if (i < pnum) {
+ pnum = i;
+ err = -ENOMEM;
+ goto err_out_free_frags;
+ }
+
+ rcu_read_unlock();
+
+ return skb;
+
+err_out_free_frags:
+ for (i=0; i<pnum; ++i) {
+ unsigned int cs = skb_shinfo(skb)->frags[i].size;
+ struct page *page = skb_shinfo(skb)->frags[i].page;
+
+ nc->nc_free_page(page);
+
+ skb->len -= cs;
+ skb->data_len -= cs;
+ skb->truesize -= cs;
+ }
+
+err_out_free_skb:
+ kfree_skb(skb);
+ return NULL;
+}
+
+int netchannel_recv(struct sk_buff *skb)
+{
+ struct netchannel *nc;
+ struct unetchannel unc;
+ struct netchannel_cache_head *bucket;
+ int err;
+
+ if (!netchannel_hash_table)
+ return -ENODEV;
+
+ rcu_read_lock();
+
+ err = netchannel_convert_skb(skb, &unc);
+ if (err)
+ goto unlock;
+
+ bucket = netchannel_bucket(&unc);
+ nc = netchannel_check_full(&unc, bucket);
+ if (!nc) {
+ err = -ENODEV;
+ goto unlock;
+ }
+
+ nc->hit++;
+
+ skb_queue_tail(&nc->list, skb);
+
+unlock:
+ rcu_read_unlock();
+ return err;
+}
+
+static int netchannel_create(struct unetchannel *unc)
+{
+ struct netchannel *nc;
+ int err = -ENOMEM;
+ struct netchannel_cache_head *bucket;
+
+ if (!netchannel_hash_table)
+ return -ENODEV;
+
+ bucket = netchannel_bucket(unc);
+
+ mutex_lock(&bucket->mutex);
+
+ if (netchannel_check_full(unc, bucket)) {
+ err = -EEXIST;
+ goto out_unlock;
+ }
+
+ if (unc->listen && netchannel_check_dest(unc, bucket)) {
+ err = -EEXIST;
+ goto out_unlock;
+ }
+
+ nc = kmem_cache_alloc(netchannel_cache, GFP_KERNEL);
+ if (!nc)
+ goto out_exit;
+
+ memset(nc, 0, sizeof(struct netchannel));
+
+ nc->hit = 0;
+ skb_queue_head_init(&nc->list);
+ atomic_set(&nc->refcnt, 1);
+ memcpy(&nc->unc, unc, sizeof(struct unetchannel));
+
+ hlist_add_head_rcu(&nc->node, &bucket->head);
+
+out_unlock:
+ mutex_unlock(&bucket->mutex);
+out_exit:
+ printk("netchannel: create %u.%u.%u.%u:%d -> %u.%u.%u.%u:%d, proto: %d, err: %d.\n",
+ NIPQUAD(unc->src), unc->sport, NIPQUAD(unc->dst), unc->dport, unc->proto, err);
+
+ return err;
+}
+
+static int netchannel_remove(struct unetchannel *unc)
+{
+ struct netchannel *nc;
+ int err = -ENODEV;
+ struct netchannel_cache_head *bucket;
+ unsigned long hit = 0;
+
+ if (!netchannel_hash_table)
+ return -ENODEV;
+
+ bucket = netchannel_bucket(unc);
+
+ mutex_lock(&bucket->mutex);
+
+ nc = netchannel_check_full(unc, bucket);
+ if (!nc)
+ nc = netchannel_check_dest(unc, bucket);
+
+ if (!nc)
+ goto out_unlock;
+
+ hlist_del_rcu(&nc->node);
+ hit = nc->hit;
+
+ netchannel_put(nc);
+ err = 0;
+
+out_unlock:
+ mutex_unlock(&bucket->mutex);
+ printk("netchannel: remove %u.%u.%u.%u:%d -> %u.%u.%u.%u:%d, proto: %d, err: %d, hit: %lu.\n",
+ NIPQUAD(unc->src), unc->sport, NIPQUAD(unc->dst), unc->dport, unc->proto, err,
+ hit);
+ return 0;
+}
+
+asmlinkage int sys_netchannel_control(void __user *arg)
+{
+ struct unetchannel_control ctl;
+ int ret;
+
+ if (!netchannel_hash_table)
+ return -ENODEV;
+
+ if (copy_from_user(&ctl, arg, sizeof(struct unetchannel_control)))
+ return -ERESTARTSYS;
+
+ switch (ctl.cmd) {
+ case NETCHANNEL_CREATE:
+ case NETCHANNEL_BIND:
+ ret = netchannel_create(&ctl.unc);
+ break;
+ case NETCHANNEL_REMOVE:
+ ret = netchannel_remove(&ctl.unc);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ return ret;
+}
+
+static inline void netchannel_dump_addr(struct in_ifaddr *ifa, char *str)
+{
+ printk("netchannel: %s %u.%u.%u.%u/%u.%u.%u.%u\n", str, NIPQUAD(ifa->ifa_local), NIPQUAD(ifa->ifa_mask));
+}
+
+static int netchannel_inetaddr_notifier_call(struct notifier_block *this, unsigned long event, void *ptr)
+{
+ struct in_ifaddr *ifa = ptr;
+
+ printk("netchannel: inet event=%lx, ifa=%p.\n", event, ifa);
+
+ switch (event) {
+ case NETDEV_UP:
+ netchannel_dump_addr(ifa, "add");
+ break;
+ case NETDEV_DOWN:
+ netchannel_dump_addr(ifa, "del");
+ break;
+ default:
+ netchannel_dump_addr(ifa, "unk");
+ break;
+ }
+
+ return NOTIFY_DONE;
+}
+
+#ifdef CONFIG_IPV6
+static int netchannel_inet6addr_notifier_call(struct notifier_block *this, unsigned long event, void *ptr)
+{
+ struct inet6_ifaddr *ifa = ptr;
+
+ printk("netchannel: inet6 event=%lx, ifa=%p.\n", event, ifa);
+ return NOTIFY_DONE;
+}
+#endif
+
+static int __init netchannel_init(void)
+{
+ unsigned int i, j, size;
+ int err = -ENOMEM;
+
+ size = (1 << netchannel_hash_order);
+
+ netchannel_hash_table = kzalloc(size * sizeof(void *), GFP_KERNEL);
+ if (!netchannel_hash_table)
+ goto err_out_exit;
+
+ for (i=0; i<size; ++i) {
+ struct netchannel_cache_head **col;
+
+ col = kzalloc(size * sizeof(void *), GFP_KERNEL);
+ if (!col)
+ break;
+
+ for (j=0; j<size; ++j) {
+ struct netchannel_cache_head *head;
+
+ head = kzalloc(sizeof(struct netchannel_cache_head), GFP_KERNEL);
+ if (!head)
+ break;
+
+ INIT_HLIST_HEAD(&head->head);
+ mutex_init(&head->mutex);
+
+ col[j] = head;
+ }
+
+ if (j<size && j>0) {
+ while (j >= 0)
+ kfree(col[j--]);
+ kfree(col);
+ break;
+ }
+
+ netchannel_hash_table[i] = col;
+ }
+
+ if (i<size) {
+ size = i;
+ goto err_out_free;
+ }
+
+ netchannel_cache = kmem_cache_create("netchannel", sizeof(struct netchannel), 0, 0,
+ NULL, NULL);
+ if (!netchannel_cache)
+ goto err_out_free;
+
+ register_inetaddr_notifier(&netchannel_inetaddr_notifier);
+#ifdef CONFIG_IPV6
+ register_inet6addr_notifier(&netchannel_inet6addr_notifier);
+#endif
+
+ printk("netchannel: Created %u order two-dimensional hash table.\n",
+ netchannel_hash_order);
+
+ return 0;
+
+err_out_free:
+ for (i=0; i<size; ++i) {
+ for (j=0; j<(1 << netchannel_hash_order); ++j)
+ kfree(netchannel_hash_table[i][j]);
+ kfree(netchannel_hash_table[i]);
+ }
+ kfree(netchannel_hash_table);
+err_out_exit:
+
+ printk("netchannel: Failed to create %u order two-dimensional hash table.\n",
+ netchannel_hash_order);
+ return err;
+}
+
+static void __exit netchannel_exit(void)
+{
+ unsigned int i, j;
+
+ unregister_inetaddr_notifier(&netchannel_inetaddr_notifier);
+#ifdef CONFIG_IPV6
+ unregister_inet6addr_notifier(&netchannel_inet6addr_notifier);
+#endif
+ kmem_cache_destroy(netchannel_cache);
+
+ for (i=0; i<(1 << netchannel_hash_order); ++i) {
+ for (j=0; j<(1 << netchannel_hash_order); ++j)
+ kfree(netchannel_hash_table[i][j]);
+ kfree(netchannel_hash_table[i]);
+ }
+ kfree(netchannel_hash_table);
+}
+
+late_initcall(netchannel_init);
--
Evgeniy Polyakov
^ permalink raw reply related [flat|nested] 35+ messages in thread* Re: [1/1] netchannel subsystem.
2006-05-16 6:19 ` [1/1] netchannel subsystem Evgeniy Polyakov
@ 2006-05-16 6:57 ` David S. Miller
2006-05-16 6:59 ` Evgeniy Polyakov
2006-05-16 17:34 ` [1/1] Netchannel subsyste Evgeniy Polyakov
0 siblings, 2 replies; 35+ messages in thread
From: David S. Miller @ 2006-05-16 6:57 UTC (permalink / raw)
To: johnpol; +Cc: netdev, kelly, rusty
From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Tue, 16 May 2006 10:19:09 +0400
> +static int netchannel_convert_skb_ipv4(struct sk_buff *skb, struct unetchannel *unc)
> +{
...
> + switch (unc->proto) {
> + case IPPROTO_TCP:
...
> + case IPPROTO_UDP:
...
Why do people write code like this?
Port location is protocol agnostic, there are always 2
16-bit ports at beginning of header without exception.
Without this, ICMP would be useless :-)
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [1/1] netchannel subsystem.
2006-05-16 6:57 ` David S. Miller
@ 2006-05-16 6:59 ` Evgeniy Polyakov
2006-05-16 7:06 ` David S. Miller
2006-05-16 7:07 ` Evgeniy Polyakov
2006-05-16 17:34 ` [1/1] Netchannel subsyste Evgeniy Polyakov
1 sibling, 2 replies; 35+ messages in thread
From: Evgeniy Polyakov @ 2006-05-16 6:59 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev, kelly, rusty
On Mon, May 15, 2006 at 11:57:12PM -0700, David S. Miller (davem@davemloft.net) wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Tue, 16 May 2006 10:19:09 +0400
>
> > +static int netchannel_convert_skb_ipv4(struct sk_buff *skb, struct unetchannel *unc)
> > +{
> ...
> > + switch (unc->proto) {
> > + case IPPROTO_TCP:
> ...
> > + case IPPROTO_UDP:
> ...
>
> Why do people write code like this?
>
> Port location is protocol agnostic, there are always 2
> 16-bit ports at beginning of header without exception.
>
> Without this, ICMP would be useless :-)
And what if we use ESP which would place it's hashed sequence number as
port?
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 35+ messages in thread* Re: [1/1] netchannel subsystem.
2006-05-16 6:59 ` Evgeniy Polyakov
@ 2006-05-16 7:06 ` David S. Miller
2006-05-16 7:15 ` Evgeniy Polyakov
2006-05-16 7:07 ` Evgeniy Polyakov
1 sibling, 1 reply; 35+ messages in thread
From: David S. Miller @ 2006-05-16 7:06 UTC (permalink / raw)
To: johnpol; +Cc: netdev, kelly, rusty
From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Tue, 16 May 2006 10:59:23 +0400
> And what if we use ESP which would place it's hashed sequence number as
> port?
If it makes you happy put something like:
case TCP:
case UDP:
case SCTP:
case DCCP:
...
default:
...
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [1/1] netchannel subsystem.
2006-05-16 7:06 ` David S. Miller
@ 2006-05-16 7:15 ` Evgeniy Polyakov
0 siblings, 0 replies; 35+ messages in thread
From: Evgeniy Polyakov @ 2006-05-16 7:15 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev, kelly, rusty
On Tue, May 16, 2006 at 12:06:31AM -0700, David S. Miller (davem@davemloft.net) wrote:
> From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> Date: Tue, 16 May 2006 10:59:23 +0400
>
> > And what if we use ESP which would place it's hashed sequence number as
> > port?
>
> If it makes you happy put something like:
>
> case TCP:
> case UDP:
> case SCTP:
> case DCCP:
> ...
>
> default:
> ...
That makes sence, but from the other point we loose ability to drop
packet early while checking it's header for correct checksum, sizes and
so on...
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: [1/1] netchannel subsystem.
2006-05-16 6:59 ` Evgeniy Polyakov
2006-05-16 7:06 ` David S. Miller
@ 2006-05-16 7:07 ` Evgeniy Polyakov
1 sibling, 0 replies; 35+ messages in thread
From: Evgeniy Polyakov @ 2006-05-16 7:07 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev, kelly, rusty
On Tue, May 16, 2006 at 10:59:23AM +0400, Evgeniy Polyakov (johnpol@2ka.mipt.ru) wrote:
> On Mon, May 15, 2006 at 11:57:12PM -0700, David S. Miller (davem@davemloft.net) wrote:
> > From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
> > Date: Tue, 16 May 2006 10:19:09 +0400
> >
> > > +static int netchannel_convert_skb_ipv4(struct sk_buff *skb, struct unetchannel *unc)
> > > +{
> > ...
> > > + switch (unc->proto) {
> > > + case IPPROTO_TCP:
> > ...
> > > + case IPPROTO_UDP:
> > ...
> >
> > Why do people write code like this?
> >
> > Port location is protocol agnostic, there are always 2
> > 16-bit ports at beginning of header without exception.
> >
> > Without this, ICMP would be useless :-)
>
> And what if we use ESP which would place it's hashed sequence number as
> port?
Actually it should be one big hash, no matter if it is ipv4/tcp or
ipv6/esp, src/dst/sport/dport/proto were created just to allow easier
debug in ipv4 environment.
Attached patch for userspace copy:
--- /tmp/netchannel.1 2006-05-16 10:33:17.000000000 +0400
+++ /tmp/netchannel.2 2006-05-16 11:23:44.000000000 +0400
@@ -35,10 +35,10 @@
diff --git a/include/linux/netchannel.h b/include/linux/netchannel.h
@@ -100,6 +100,7 @@
+
+ struct page * (*nc_alloc_page)(unsigned int size);
+ void (*nc_free_page)(struct page *page);
++ int (*nc_read_data)(struct netchannel *, unsigned int *len, void __user *arg);
+
+ struct sk_buff_head list;
+};
@@ -228,10 +229,10 @@
ret = deliver_skb(skb, pt_prev, orig_dev);
diff --git a/net/core/netchannel.c b/net/core/netchannel.c
--- /dev/null
+++ b/net/core/netchannel.c
-@@ -0,0 +1,583 @@
+@@ -0,0 +1,649 @@
+/*
+ * netchannel.c
+ *
@@ -578,6 +579,40 @@
+ return err;
+}
+
++/*
++ * Actually it should be something like recvmsg().
++ */
++static int netchannel_copy_to_user(struct netchannel *nc, unsigned int *len, void __user *arg)
++{
++ unsigned int copied;
++ struct sk_buff *skb;
++ struct iovec to;
++ int err = -EINVAL;
++
++ to.iov_base = arg;
++ to.iov_len = *len;
++
++ skb = skb_dequeue(&nc->list);
++ if (!skb)
++ return -EAGAIN;
++
++ copied = skb->len;
++ if (copied > *len)
++ copied = *len;
++
++ if (skb->ip_summed==CHECKSUM_UNNECESSARY) {
++ err = skb_copy_datagram_iovec(skb, 0, &to, copied);
++ } else {
++ err = skb_copy_and_csum_datagram_iovec(skb,0, &to);
++ }
++
++ *len = (err == 0)?copied:0;
++
++ kfree_skb(skb);
++
++ return err;
++}
++
+static int netchannel_create(struct unetchannel *unc)
+{
+ struct netchannel *nc;
@@ -612,6 +647,8 @@
+ atomic_set(&nc->refcnt, 1);
+ memcpy(&nc->unc, unc, sizeof(struct unetchannel));
+
++ nc->nc_read_data = &netchannel_copy_to_user;
++
+ hlist_add_head_rcu(&nc->node, &bucket->head);
+
+out_unlock:
@@ -658,6 +695,30 @@
+ return 0;
+}
+
++static int netchannel_recv_data(struct unetchannel_control *ctl, void __user *data)
++{
++ int ret = -ENODEV;
++ struct netchannel_cache_head *bucket;
++ struct netchannel *nc;
++
++ bucket = netchannel_bucket(&ctl->unc);
++
++ mutex_lock(&bucket->mutex);
++
++ nc = netchannel_check_full(&ctl->unc, bucket);
++ if (!nc)
++ nc = netchannel_check_dest(&ctl->unc, bucket);
++
++ if (!nc)
++ goto out_unlock;
++
++ ret = nc->nc_read_data(nc, &ctl->len, data);
++
++out_unlock:
++ mutex_unlock(&bucket->mutex);
++ return ret;
++}
++
+asmlinkage int sys_netchannel_control(void __user *arg)
+{
+ struct unetchannel_control ctl;
@@ -677,10 +738,16 @@
+ case NETCHANNEL_REMOVE:
+ ret = netchannel_remove(&ctl.unc);
+ break;
++ case NETCHANNEL_READ:
++ ret = netchannel_recv_data(&ctl, arg + sizeof(struct unetchannel_control));
++ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
++
++ if (copy_to_user(arg, &ctl, sizeof(struct unetchannel_control)))
++ return -ERESTARTSYS;
+
+ return ret;
+}
--
Evgeniy Polyakov
^ permalink raw reply [flat|nested] 35+ messages in thread
* [1/1] Netchannel subsyste.
2006-05-16 6:57 ` David S. Miller
2006-05-16 6:59 ` Evgeniy Polyakov
@ 2006-05-16 17:34 ` Evgeniy Polyakov
2006-05-18 10:34 ` Netchannel subsystem update Evgeniy Polyakov
1 sibling, 1 reply; 35+ messages in thread
From: Evgeniy Polyakov @ 2006-05-16 17:34 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev, kelly, rusty
Receiving support.
As proof-of-concept code I created simple copy_to_user() based getting
data callback. Next step is to implement netchannels data allocation
callbacks to get data from mapped userspace area and make get data
callback be similar to ->recvmsg() so all protocol processing happens in
userspace (TCP should start working, now only UDP case) if there is some
interest in it. Patch attached.
There are brief description of netchannel design and implementation,
patches and userspace utility at project's homepage [1].
Thank you.
1. Netchannel project homepage.
http://tservice.net.ru/~s0mbre/old/?section=projects&item=netchannel
Receiving netchannel implementation.
Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
diff --git a/arch/i386/kernel/syscall_table.S b/arch/i386/kernel/syscall_table.S
index f48bef1..7a4a758 100644
--- a/arch/i386/kernel/syscall_table.S
+++ b/arch/i386/kernel/syscall_table.S
@@ -315,3 +315,5 @@ ENTRY(sys_call_table)
.long sys_splice
.long sys_sync_file_range
.long sys_tee /* 315 */
+ .long sys_vmsplice
+ .long sys_netchannel_control
diff --git a/arch/x86_64/ia32/ia32entry.S b/arch/x86_64/ia32/ia32entry.S
index 5a92fed..fdfb997 100644
--- a/arch/x86_64/ia32/ia32entry.S
+++ b/arch/x86_64/ia32/ia32entry.S
@@ -696,4 +696,5 @@ ia32_sys_call_table:
.quad sys_sync_file_range
.quad sys_tee
.quad compat_sys_vmsplice
+ .quad sys_netchannel_control
ia32_syscall_end:
diff --git a/include/asm-i386/unistd.h b/include/asm-i386/unistd.h
index eb4b152..777cd85 100644
--- a/include/asm-i386/unistd.h
+++ b/include/asm-i386/unistd.h
@@ -322,8 +322,9 @@
#define __NR_sync_file_range 314
#define __NR_tee 315
#define __NR_vmsplice 316
+#define __NR_netchannel_control 317
-#define NR_syscalls 317
+#define NR_syscalls 318
/*
* user-visible error numbers are in the range -1 - -128: see
diff --git a/include/asm-x86_64/unistd.h b/include/asm-x86_64/unistd.h
index feb77cb..08c230e 100644
--- a/include/asm-x86_64/unistd.h
+++ b/include/asm-x86_64/unistd.h
@@ -617,8 +617,10 @@ __SYSCALL(__NR_tee, sys_tee)
__SYSCALL(__NR_sync_file_range, sys_sync_file_range)
#define __NR_vmsplice 278
__SYSCALL(__NR_vmsplice, sys_vmsplice)
+#define __NR_netchannel_control 279
+__SYSCALL(__NR_vmsplice, sys_netchannel_control)
-#define __NR_syscall_max __NR_vmsplice
+#define __NR_syscall_max __NR_netchannel_control
#ifndef __NO_STUBS
diff --git a/include/linux/netchannel.h b/include/linux/netchannel.h
new file mode 100644
index 0000000..e87a148
--- /dev/null
+++ b/include/linux/netchannel.h
@@ -0,0 +1,75 @@
+/*
+ * netchannel.h
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#ifndef __NETCHANNEL_H
+#define __NETCHANNEL_H
+
+#include <linux/types.h>
+
+enum netchannel_commands {
+ NETCHANNEL_CREATE = 0,
+ NETCHANNEL_REMOVE,
+ NETCHANNEL_BIND,
+ NETCHANNEL_READ,
+ NETCHANNEL_DUMP,
+};
+
+struct unetchannel
+{
+ __u32 src, dst; /* source/destination hashes */
+ __u16 sport, dport; /* source/destination ports */
+ __u8 proto; /* IP protocol number */
+ __u8 listen;
+ __u8 reserved[2];
+};
+
+struct unetchannel_control
+{
+ struct unetchannel unc;
+ __u32 cmd;
+ __u32 len;
+};
+
+#ifdef __KERNEL__
+
+struct netchannel
+{
+ struct hlist_node node;
+ atomic_t refcnt;
+ struct rcu_head rcu_head;
+ struct unetchannel unc;
+ unsigned long hit;
+
+ struct page * (*nc_alloc_page)(unsigned int size);
+ void (*nc_free_page)(struct page *page);
+ int (*nc_read_data)(struct netchannel *, unsigned int *len, void __user *arg);
+
+ struct sk_buff_head list;
+};
+
+struct netchannel_cache_head
+{
+ struct hlist_head head;
+ struct mutex mutex;
+};
+
+#endif /* __KERNEL__ */
+#endif /* __NETCHANNEL_H */
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a461b51..9924911 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -684,6 +684,15 @@ extern void dev_queue_xmit_nit(struct s
extern void dev_init(void);
+#ifdef CONFIG_NETCHANNEL
+extern int netchannel_recv(struct sk_buff *skb);
+#else
+static int netchannel_recv(struct sk_buff *skb)
+{
+ return -1;
+}
+#endif
+
extern int netdev_nit;
extern int netdev_budget;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index f8f2347..accd00b 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -301,6 +301,7 @@ struct sk_buff {
* Handling routines are only of interest to the kernel
*/
#include <linux/slab.h>
+#include <linux/netchannel.h>
#include <asm/system.h>
@@ -314,6 +315,17 @@ static inline struct sk_buff *alloc_skb(
return __alloc_skb(size, priority, 0);
}
+#ifdef CONFIG_NETCHANNEL
+extern struct sk_buff *netchannel_alloc(struct unetchannel *unc, unsigned int header_size,
+ unsigned int total_size, gfp_t gfp_mask);
+#else
+static struct sk_buff *netchannel_alloc(struct unetchannel *unc, unsigned int header_size,
+ unsigned int total_size, gfp_t gfp_mask)
+{
+ return NULL;
+}
+#endif
+
static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
gfp_t priority)
{
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 3996960..8c22875 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -582,4 +582,6 @@ asmlinkage long sys_tee(int fdin, int fd
asmlinkage long sys_sync_file_range(int fd, loff_t offset, loff_t nbytes,
unsigned int flags);
+asmlinkage long sys_netchannel_control(void __user *arg);
+
#endif
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 5433195..1747fc3 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -132,3 +132,5 @@ cond_syscall(sys_mincore);
cond_syscall(sys_madvise);
cond_syscall(sys_mremap);
cond_syscall(sys_remap_file_pages);
+
+cond_syscall(sys_netchannel_control);
diff --git a/net/Kconfig b/net/Kconfig
index 4193cdc..465e37b 100644
--- a/net/Kconfig
+++ b/net/Kconfig
@@ -66,6 +66,14 @@ source "net/ipv6/Kconfig"
endif # if INET
+config NETCHANNEL
+ bool "Network channels"
+ ---help---
+ Network channels are peer-to-peer abstraction, which allows to create
+ high performance communications.
+ Main advantages are unified address cache, protocol processing moved
+ to userspace, receiving zero-copy support and other interesting features.
+
menuconfig NETFILTER
bool "Network packet filtering (replaces ipchains)"
---help---
diff --git a/net/core/Makefile b/net/core/Makefile
index 79fe12c..7119812 100644
--- a/net/core/Makefile
+++ b/net/core/Makefile
@@ -16,3 +16,4 @@ obj-$(CONFIG_NET_DIVERT) += dv.o
obj-$(CONFIG_NET_PKTGEN) += pktgen.o
obj-$(CONFIG_WIRELESS_EXT) += wireless.o
obj-$(CONFIG_NETPOLL) += netpoll.o
+obj-$(CONFIG_NETCHANNEL) += netchannel.o
diff --git a/net/core/dev.c b/net/core/dev.c
index 9ab3cfa..2721111 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1712,6 +1712,10 @@ int netif_receive_skb(struct sk_buff *sk
}
}
+ ret = netchannel_recv(skb);
+ if (!ret)
+ goto out;
+
#ifdef CONFIG_NET_CLS_ACT
if (pt_prev) {
ret = deliver_skb(skb, pt_prev, orig_dev);
diff --git a/net/core/netchannel.c b/net/core/netchannel.c
new file mode 100644
index 0000000..169a764
--- /dev/null
+++ b/net/core/netchannel.c
@@ -0,0 +1,691 @@
+/*
+ * netchannel.c
+ *
+ * 2006 Copyright (c) Evgeniy Polyakov <johnpol@2ka.mipt.ru>
+ * All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA
+ */
+
+#include <linux/types.h>
+#include <linux/unistd.h>
+#include <linux/linkage.h>
+#include <linux/notifier.h>
+#include <linux/list.h>
+#include <linux/slab.h>
+#include <linux/skbuff.h>
+#include <linux/errno.h>
+
+#include <linux/in.h>
+#include <linux/ip.h>
+#include <linux/tcp.h>
+#include <linux/udp.h>
+
+#include <linux/netdevice.h>
+#include <linux/inetdevice.h>
+#include <net/addrconf.h>
+
+#include <asm/uaccess.h>
+
+static unsigned int netchannel_hash_order = 8;
+static struct netchannel_cache_head ***netchannel_hash_table;
+static kmem_cache_t *netchannel_cache;
+
+static int netchannel_inetaddr_notifier_call(struct notifier_block *, unsigned long, void *);
+static struct notifier_block netchannel_inetaddr_notifier = {
+ .notifier_call = &netchannel_inetaddr_notifier_call
+};
+
+#ifdef CONFIG_IPV6
+static int netchannel_inet6addr_notifier_call(struct notifier_block *, unsigned long, void *);
+static struct notifier_block netchannel_inet6addr_notifier = {
+ .notifier_call = &netchannel_inet6addr_notifier_call
+};
+#endif
+
+static inline unsigned int netchannel_hash(struct unetchannel *unc)
+{
+ unsigned int h = (unc->dst ^ unc->dport) ^ (unc->src ^ unc->sport);
+ h ^= h >> 16;
+ h ^= h >> 8;
+ h ^= unc->proto;
+ return h & ((1 << 2*netchannel_hash_order) - 1);
+}
+
+static inline void netchannel_convert_hash(unsigned int hash, unsigned int *col, unsigned int *row)
+{
+ *row = hash & ((1 << netchannel_hash_order) - 1);
+ *col = (hash >> netchannel_hash_order) & ((1 << netchannel_hash_order) - 1);
+}
+
+static struct netchannel_cache_head *netchannel_bucket(struct unetchannel *unc)
+{
+ unsigned int hash = netchannel_hash(unc);
+ unsigned int col, row;
+
+ netchannel_convert_hash(hash, &col, &row);
+ return netchannel_hash_table[col][row];
+}
+
+static inline int netchannel_hash_equal_full(struct unetchannel *unc1, struct unetchannel *unc2)
+{
+ return (unc1->dport == unc2->dport) && (unc1->dst == unc2->dst) &&
+ (unc1->sport == unc2->sport) && (unc1->src == unc2->src) &&
+ (unc1->proto == unc2->proto);
+}
+
+static inline int netchannel_hash_equal_dest(struct unetchannel *unc1, struct unetchannel *unc2)
+{
+ return ((unc1->dport == unc2->dport) && (unc1->dst == unc2->dst) && (unc1->proto == unc2->proto));
+}
+
+static struct netchannel *netchannel_check_dest(struct unetchannel *unc, struct netchannel_cache_head *bucket)
+{
+ struct netchannel *nc;
+ struct hlist_node *node;
+ int found = 0;
+
+ hlist_for_each_entry_rcu(nc, node, &bucket->head, node) {
+ if (netchannel_hash_equal_dest(&nc->unc, unc)) {
+ found = 1;
+ break;
+ }
+ }
+
+ return (found)?nc:NULL;
+}
+
+static struct netchannel *netchannel_check_full(struct unetchannel *unc, struct netchannel_cache_head *bucket)
+{
+ struct netchannel *nc;
+ struct hlist_node *node;
+ int found = 0;
+
+ hlist_for_each_entry_rcu(nc, node, &bucket->head, node) {
+ if (netchannel_hash_equal_full(&nc->unc, unc)) {
+ found = 1;
+ break;
+ }
+ }
+
+ return (found)?nc:NULL;
+}
+
+static void netchannel_free_rcu(struct rcu_head *rcu)
+{
+ struct netchannel *nc = container_of(rcu, struct netchannel, rcu_head);
+
+ kmem_cache_free(netchannel_cache, nc);
+}
+
+static inline void netchannel_get(struct netchannel *nc)
+{
+ atomic_inc(&nc->refcnt);
+}
+
+static inline void netchannel_put(struct netchannel *nc)
+{
+ if (atomic_dec_and_test(&nc->refcnt))
+ call_rcu(&nc->rcu_head, &netchannel_free_rcu);
+}
+
+static inline void netchannel_dump_info_unc(struct unetchannel *unc, char *prefix, unsigned long hit, int err)
+{
+ u32 src, dst;
+ u16 sport, dport;
+
+ dst = unc->dst;
+ src = unc->src;
+ dport = ntohs(unc->dport);
+ sport = ntohs(unc->sport);
+
+ printk(KERN_INFO "netchannel: %s %u.%u.%u.%u:%u -> %u.%u.%u.%u:%u, proto: %u, hit: %lu, err: %d.\n",
+ prefix, NIPQUAD(src), sport, NIPQUAD(dst), dport, unc->proto, hit, err);
+}
+
+static int netchannel_convert_skb_ipv6(struct sk_buff *skb, struct unetchannel *unc)
+{
+ /*
+ * Hash IP addresses into src/dst. Setup TCP/UDP ports.
+ * Not supported yet.
+ */
+ return -1;
+}
+
+static int netchannel_convert_skb_ipv4(struct sk_buff *skb, struct unetchannel *unc)
+{
+ struct iphdr *iph;
+ u32 len;
+ struct tcphdr *th;
+ struct udphdr *uh;
+
+ if (!pskb_may_pull(skb, sizeof(struct iphdr)))
+ goto inhdr_error;
+
+ iph = skb->nh.iph;
+
+ if (iph->ihl < 5 || iph->version != 4)
+ goto inhdr_error;
+
+ if (!pskb_may_pull(skb, iph->ihl*4))
+ goto inhdr_error;
+
+ iph = skb->nh.iph;
+
+ if (unlikely(ip_fast_csum((u8 *)iph, iph->ihl)))
+ goto inhdr_error;
+
+ len = ntohs(iph->tot_len);
+ if (skb->len < len || len < (iph->ihl*4))
+ goto inhdr_error;
+
+ unc->dst = iph->daddr;
+ unc->src = iph->saddr;
+ unc->proto = iph->protocol;
+
+ len = skb->len;
+
+ skb->h.raw = skb->nh.iph + iph->ihl*4;
+
+ switch (unc->proto) {
+ case IPPROTO_TCP:
+ if (!pskb_may_pull(skb, sizeof(struct tcphdr)))
+ goto inhdr_error;
+ th = skb->h.th;
+
+ if (th->doff < sizeof(struct tcphdr) / 4)
+ goto inhdr_error;
+
+ unc->dport = th->dest;
+ unc->sport = th->source;
+ break;
+ case IPPROTO_UDP:
+ if (!pskb_may_pull(skb, sizeof(struct udphdr)))
+ goto inhdr_error;
+ uh = skb->h.uh;
+
+ if (ntohs(uh->len) < sizeof(struct udphdr))
+ goto inhdr_error;
+
+ unc->dport = uh->dest;
+ unc->sport = uh->source;
+ break;
+ default:
+ goto inhdr_error;
+ }
+
+ return 0;
+
+inhdr_error:
+ return -1;
+}
+
+static int netchannel_convert_skb(struct sk_buff *skb, struct unetchannel *unc)
+{
+ if (skb->pkt_type == PACKET_OTHERHOST)
+ return -1;
+
+ switch (ntohs(skb->protocol)) {
+ case ETH_P_IP:
+ return netchannel_convert_skb_ipv4(skb, unc);
+ case ETH_P_IPV6:
+ return netchannel_convert_skb_ipv6(skb, unc);
+ default:
+ return -1;
+ }
+}
+
+/*
+ * By design netchannels allow to "allocate" data
+ * not only from SLAB cache, but get it from mapped area
+ * or from VFS cache (requires process' context or preallocation).
+ */
+struct sk_buff *netchannel_alloc(struct unetchannel *unc, unsigned int header_size,
+ unsigned int total_size, gfp_t gfp_mask)
+{
+ struct netchannel *nc;
+ struct netchannel_cache_head *bucket;
+ int err;
+ struct sk_buff *skb = NULL;
+ unsigned int size, pnum, i;
+
+ skb = alloc_skb(header_size, gfp_mask);
+ if (!skb)
+ return NULL;
+
+ rcu_read_lock();
+ bucket = netchannel_bucket(unc);
+ nc = netchannel_check_full(unc, bucket);
+ if (!nc) {
+ err = -ENODEV;
+ goto err_out_free_skb;
+ }
+
+ if (!nc->nc_alloc_page || !nc->nc_free_page) {
+ err = -EINVAL;
+ goto err_out_free_skb;
+ }
+
+ netchannel_get(nc);
+
+ size = total_size - header_size;
+ pnum = PAGE_ALIGN(size) >> PAGE_SHIFT;
+
+ for (i=0; i<pnum; ++i) {
+ unsigned int cs = min_t(unsigned int, PAGE_SIZE, size);
+ struct page *page;
+
+ page = nc->nc_alloc_page(cs);
+ if (!page)
+ break;
+
+ skb_fill_page_desc(skb, skb_shinfo(skb)->nr_frags, page, 0, cs);
+
+ skb->len += cs;
+ skb->data_len += cs;
+ skb->truesize += cs;
+
+ size -= cs;
+ }
+
+ if (i < pnum) {
+ pnum = i;
+ err = -ENOMEM;
+ goto err_out_free_frags;
+ }
+
+ rcu_read_unlock();
+
+ return skb;
+
+err_out_free_frags:
+ for (i=0; i<pnum; ++i) {
+ unsigned int cs = skb_shinfo(skb)->frags[i].size;
+ struct page *page = skb_shinfo(skb)->frags[i].page;
+
+ nc->nc_free_page(page);
+
+ skb->len -= cs;
+ skb->data_len -= cs;
+ skb->truesize -= cs;
+ }
+
+err_out_free_skb:
+ kfree_skb(skb);
+ return NULL;
+}
+
+int netchannel_recv(struct sk_buff *skb)
+{
+ struct netchannel *nc;
+ struct unetchannel unc;
+ struct netchannel_cache_head *bucket;
+ int err;
+
+ if (!netchannel_hash_table)
+ return -ENODEV;
+
+ rcu_read_lock();
+
+ err = netchannel_convert_skb(skb, &unc);
+ if (err)
+ goto unlock;
+
+ bucket = netchannel_bucket(&unc);
+ nc = netchannel_check_full(&unc, bucket);
+ if (!nc) {
+ err = -ENODEV;
+ goto unlock;
+ }
+
+ nc->hit++;
+
+ skb_queue_tail(&nc->list, skb);
+
+unlock:
+ rcu_read_unlock();
+ return err;
+}
+
+/*
+ * Actually it should be something like recvmsg().
+ */
+static int netchannel_copy_to_user(struct netchannel *nc, unsigned int *len, void __user *arg)
+{
+ unsigned int copied;
+ struct sk_buff *skb;
+ struct iovec to;
+ int err = -EINVAL;
+
+ to.iov_base = arg;
+ to.iov_len = *len;
+
+ skb = skb_dequeue(&nc->list);
+ if (!skb)
+ return -EAGAIN;
+
+ copied = skb->len;
+ if (copied > *len)
+ copied = *len;
+
+ if (skb->ip_summed==CHECKSUM_UNNECESSARY) {
+ err = skb_copy_datagram_iovec(skb, 0, &to, copied);
+ } else {
+ err = skb_copy_and_csum_datagram_iovec(skb,0, &to);
+ }
+
+ *len = (err == 0)?copied:0;
+
+ kfree_skb(skb);
+
+ return err;
+}
+
+static int netchannel_create(struct unetchannel *unc)
+{
+ struct netchannel *nc;
+ int err = -ENOMEM;
+ struct netchannel_cache_head *bucket;
+
+ if (!netchannel_hash_table)
+ return -ENODEV;
+
+ bucket = netchannel_bucket(unc);
+
+ mutex_lock(&bucket->mutex);
+
+ if (netchannel_check_full(unc, bucket)) {
+ err = -EEXIST;
+ goto out_unlock;
+ }
+
+ if (unc->listen && netchannel_check_dest(unc, bucket)) {
+ err = -EEXIST;
+ goto out_unlock;
+ }
+
+ nc = kmem_cache_alloc(netchannel_cache, GFP_KERNEL);
+ if (!nc)
+ goto out_exit;
+
+ memset(nc, 0, sizeof(struct netchannel));
+
+ nc->hit = 0;
+ skb_queue_head_init(&nc->list);
+ atomic_set(&nc->refcnt, 1);
+ memcpy(&nc->unc, unc, sizeof(struct unetchannel));
+
+ nc->nc_read_data = &netchannel_copy_to_user;
+
+ hlist_add_head_rcu(&nc->node, &bucket->head);
+ err = 0;
+
+out_unlock:
+ mutex_unlock(&bucket->mutex);
+out_exit:
+ netchannel_dump_info_unc(unc, "create", 0, err);
+
+ return err;
+}
+
+static int netchannel_remove(struct unetchannel *unc)
+{
+ struct netchannel *nc;
+ int err = -ENODEV;
+ struct netchannel_cache_head *bucket;
+ unsigned long hit = 0;
+
+ if (!netchannel_hash_table)
+ return -ENODEV;
+
+ bucket = netchannel_bucket(unc);
+
+ mutex_lock(&bucket->mutex);
+
+ nc = netchannel_check_full(unc, bucket);
+ if (!nc)
+ nc = netchannel_check_dest(unc, bucket);
+
+ if (!nc)
+ goto out_unlock;
+
+ hlist_del_rcu(&nc->node);
+ hit = nc->hit;
+
+ netchannel_put(nc);
+ err = 0;
+
+out_unlock:
+ mutex_unlock(&bucket->mutex);
+ netchannel_dump_info_unc(unc, "remove", hit, err);
+ return err;
+}
+
+static int netchannel_recv_data(struct unetchannel_control *ctl, void __user *data)
+{
+ int ret = -ENODEV;
+ struct netchannel_cache_head *bucket;
+ struct netchannel *nc;
+
+ bucket = netchannel_bucket(&ctl->unc);
+
+ mutex_lock(&bucket->mutex);
+
+ nc = netchannel_check_full(&ctl->unc, bucket);
+ if (!nc)
+ nc = netchannel_check_dest(&ctl->unc, bucket);
+
+ if (!nc)
+ goto out_unlock;
+
+ ret = nc->nc_read_data(nc, &ctl->len, data);
+
+out_unlock:
+ mutex_unlock(&bucket->mutex);
+ return ret;
+}
+
+static int netchannel_dump_info(struct unetchannel *unc)
+{
+ struct netchannel_cache_head *bucket;
+ struct netchannel *nc;
+ char *ncs = "none";
+ unsigned long hit = 0;
+ int err;
+
+ bucket = netchannel_bucket(unc);
+
+ mutex_lock(&bucket->mutex);
+ nc = netchannel_check_full(unc, bucket);
+ if (!nc) {
+ nc = netchannel_check_dest(unc, bucket);
+ if (nc)
+ ncs = "dest";
+ } else
+ ncs = "full";
+ if (nc)
+ hit = nc->hit;
+ mutex_unlock(&bucket->mutex);
+ err = (nc)?0:-ENODEV;
+
+ netchannel_dump_info_unc(unc, ncs, hit, err);
+
+ return err;
+}
+
+asmlinkage long sys_netchannel_control(void __user *arg)
+{
+ struct unetchannel_control ctl;
+ int ret;
+
+ if (!netchannel_hash_table)
+ return -ENODEV;
+
+ if (copy_from_user(&ctl, arg, sizeof(struct unetchannel_control)))
+ return -ERESTARTSYS;
+
+ switch (ctl.cmd) {
+ case NETCHANNEL_CREATE:
+ case NETCHANNEL_BIND:
+ ret = netchannel_create(&ctl.unc);
+ break;
+ case NETCHANNEL_REMOVE:
+ ret = netchannel_remove(&ctl.unc);
+ break;
+ case NETCHANNEL_READ:
+ ret = netchannel_recv_data(&ctl, arg + sizeof(struct unetchannel_control));
+ break;
+ case NETCHANNEL_DUMP:
+ ret = netchannel_dump_info(&ctl.unc);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+ if (copy_to_user(arg, &ctl, sizeof(struct unetchannel_control)))
+ return -ERESTARTSYS;
+
+ return ret;
+}
+
+static inline void netchannel_dump_addr(struct in_ifaddr *ifa, char *str)
+{
+ printk("netchannel: %s %u.%u.%u.%u/%u.%u.%u.%u\n", str, NIPQUAD(ifa->ifa_local), NIPQUAD(ifa->ifa_mask));
+}
+
+static int netchannel_inetaddr_notifier_call(struct notifier_block *this, unsigned long event, void *ptr)
+{
+ struct in_ifaddr *ifa = ptr;
+
+ switch (event) {
+ case NETDEV_UP:
+ netchannel_dump_addr(ifa, "add");
+ break;
+ case NETDEV_DOWN:
+ netchannel_dump_addr(ifa, "del");
+ break;
+ default:
+ netchannel_dump_addr(ifa, "unk");
+ break;
+ }
+
+ return NOTIFY_DONE;
+}
+
+#ifdef CONFIG_IPV6
+static int netchannel_inet6addr_notifier_call(struct notifier_block *this, unsigned long event, void *ptr)
+{
+ struct inet6_ifaddr *ifa = ptr;
+
+ printk("netchannel: inet6 event=%lx, ifa=%p.\n", event, ifa);
+ return NOTIFY_DONE;
+}
+#endif
+
+static int __init netchannel_init(void)
+{
+ unsigned int i, j, size;
+ int err = -ENOMEM;
+
+ size = (1 << netchannel_hash_order);
+
+ netchannel_hash_table = kzalloc(size * sizeof(void *), GFP_KERNEL);
+ if (!netchannel_hash_table)
+ goto err_out_exit;
+
+ for (i=0; i<size; ++i) {
+ struct netchannel_cache_head **col;
+
+ col = kzalloc(size * sizeof(void *), GFP_KERNEL);
+ if (!col)
+ break;
+
+ for (j=0; j<size; ++j) {
+ struct netchannel_cache_head *head;
+
+ head = kzalloc(sizeof(struct netchannel_cache_head), GFP_KERNEL);
+ if (!head)
+ break;
+
+ INIT_HLIST_HEAD(&head->head);
+ mutex_init(&head->mutex);
+
+ col[j] = head;
+ }
+
+ if (j<size && j>0) {
+ while (j >= 0)
+ kfree(col[j--]);
+ kfree(col);
+ break;
+ }
+
+ netchannel_hash_table[i] = col;
+ }
+
+ if (i<size) {
+ size = i;
+ goto err_out_free;
+ }
+
+ netchannel_cache = kmem_cache_create("netchannel", sizeof(struct netchannel), 0, 0,
+ NULL, NULL);
+ if (!netchannel_cache)
+ goto err_out_free;
+
+ register_inetaddr_notifier(&netchannel_inetaddr_notifier);
+#ifdef CONFIG_IPV6
+ register_inet6addr_notifier(&netchannel_inet6addr_notifier);
+#endif
+
+ printk("netchannel: Created %u order two-dimensional hash table.\n",
+ netchannel_hash_order);
+
+ return 0;
+
+err_out_free:
+ for (i=0; i<size; ++i) {
+ for (j=0; j<(1 << netchannel_hash_order); ++j)
+ kfree(netchannel_hash_table[i][j]);
+ kfree(netchannel_hash_table[i]);
+ }
+ kfree(netchannel_hash_table);
+err_out_exit:
+
+ printk("netchannel: Failed to create %u order two-dimensional hash table.\n",
+ netchannel_hash_order);
+ return err;
+}
+
+static void __exit netchannel_exit(void)
+{
+ unsigned int i, j;
+
+ unregister_inetaddr_notifier(&netchannel_inetaddr_notifier);
+#ifdef CONFIG_IPV6
+ unregister_inet6addr_notifier(&netchannel_inet6addr_notifier);
+#endif
+ kmem_cache_destroy(netchannel_cache);
+
+ for (i=0; i<(1 << netchannel_hash_order); ++i) {
+ for (j=0; j<(1 << netchannel_hash_order); ++j)
+ kfree(netchannel_hash_table[i][j]);
+ kfree(netchannel_hash_table[i]);
+ }
+ kfree(netchannel_hash_table);
+}
+
+late_initcall(netchannel_init);
--
Evgeniy Polyakov
^ permalink raw reply related [flat|nested] 35+ messages in thread* Netchannel subsystem update.
2006-05-16 17:34 ` [1/1] Netchannel subsyste Evgeniy Polyakov
@ 2006-05-18 10:34 ` Evgeniy Polyakov
2006-05-20 15:52 ` Evgeniy Polyakov
0 siblings, 1 reply; 35+ messages in thread
From: Evgeniy Polyakov @ 2006-05-18 10:34 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev, kelly, rusty
This updates brings new features to the following supported:
* unified cache to store netchannels (IPv4 and stub for fied cache
to store netchannels (IPv4 and stub for IPv6 hashes, TCP and UDP)
* skb queueing mechanism
* netchannel creation/removing/reading commands
* netchannel's callback to allocate/free pages (for
example to get data from mapped area) not only from SLAB cache
* netchannel's callback to move/copy data to userspace
Added:
* memory limits (soft limits, since update is not protected).
* blocking reading.
* two types of data reading backends (copy_to_user(), copy to (could be
mapped) area).
Patch against previous release is attached.
Userspace application, design and implementation notes, full patchsets
can be found at project homepage [1].
1. Network channel homepage.
http://tservice.net.ru/~s0mbre/old/?section=projects&item=netchannel
I would like to rise a question about how netchannel object should be
handled by system in general, i.e. should netchannels be associated with
process or they should live by themselfs, i.e. like routes?
My implementation allows netchannels to be setup permanently, so process
can exit and then new one can bind to existing netchannel and read it's
data, but it requires some tricks to create mapping of it's pages into
process' context...
Also if netchannel is created, but no process is associated with it, who
will process protocol state machine?
Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
diff --git a/include/linux/netchannel.h b/include/linux/netchannel.h
index e87a148..7ab2fa0 100644
--- a/include/linux/netchannel.h
+++ b/include/linux/netchannel.h
@@ -32,13 +32,20 @@ enum netchannel_commands {
NETCHANNEL_DUMP,
};
+enum netchannel_type {
+ NETCHANNEL_COPY_USER = 0,
+ NETCHANNEL_MMAP,
+ NETCHANEL_VM_HACK,
+};
+
struct unetchannel
{
__u32 src, dst; /* source/destination hashes */
__u16 sport, dport; /* source/destination ports */
__u8 proto; /* IP protocol number */
- __u8 listen;
- __u8 reserved[2];
+ __u8 type; /* Netchannel type */
+ __u8 memory_limit_order; /* Memor limit order */
+ __u8 reserved;
};
struct unetchannel_control
@@ -46,6 +53,8 @@ struct unetchannel_control
struct unetchannel unc;
__u32 cmd;
__u32 len;
+ __u32 flags;
+ __u32 timeout;
};
#ifdef __KERNEL__
@@ -60,9 +69,14 @@ struct netchannel
struct page * (*nc_alloc_page)(unsigned int size);
void (*nc_free_page)(struct page *page);
- int (*nc_read_data)(struct netchannel *, unsigned int *len, void __user *arg);
+ int (*nc_read_data)(struct netchannel *, unsigned int *timeout, unsigned int *len, void *arg);
+
+ struct sk_buff_head recv_queue;
+ wait_queue_head_t wait;
+
+ unsigned int qlen;
- struct sk_buff_head list;
+ void *priv;
};
struct netchannel_cache_head
@@ -71,5 +85,15 @@ struct netchannel_cache_head
struct mutex mutex;
};
+#define NETCHANNEL_MAX_ORDER 32
+#define NETCHANNEL_MIN_ORDER PAGE_SHIFT
+
+struct netchannel_mmap
+{
+ struct page **page;
+ unsigned int pnum;
+ unsigned int poff;
+};
+
#endif /* __KERNEL__ */
#endif /* __NETCHANNEL_H */
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index accd00b..ba82aa2 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -301,7 +301,6 @@ struct sk_buff {
* Handling routines are only of interest to the kernel
*/
#include <linux/slab.h>
-#include <linux/netchannel.h>
#include <asm/system.h>
@@ -316,10 +315,11 @@ static inline struct sk_buff *alloc_skb(
}
#ifdef CONFIG_NETCHANNEL
+struct unetchannel;
extern struct sk_buff *netchannel_alloc(struct unetchannel *unc, unsigned int header_size,
unsigned int total_size, gfp_t gfp_mask);
#else
-static struct sk_buff *netchannel_alloc(struct unetchannel *unc, unsigned int header_size,
+static struct sk_buff *netchannel_alloc(void *unc, unsigned int header_size,
unsigned int total_size, gfp_t gfp_mask)
{
return NULL;
diff --git a/net/core/netchannel.c b/net/core/netchannel.c
index 169a764..96e5e5b 100644
--- a/net/core/netchannel.c
+++ b/net/core/netchannel.c
@@ -27,6 +27,8 @@
#include <linux/slab.h>
#include <linux/skbuff.h>
#include <linux/errno.h>
+#include <linux/highmem.h>
+#include <linux/netchannel.h>
#include <linux/in.h>
#include <linux/ip.h>
@@ -127,6 +129,7 @@ static void netchannel_free_rcu(struct r
{
struct netchannel *nc = container_of(rcu, struct netchannel, rcu_head);
+ netchannel_cleanup(nc);
kmem_cache_free(netchannel_cache, nc);
}
@@ -151,8 +154,10 @@ static inline void netchannel_dump_info_
dport = ntohs(unc->dport);
sport = ntohs(unc->sport);
- printk(KERN_INFO "netchannel: %s %u.%u.%u.%u:%u -> %u.%u.%u.%u:%u, proto: %u, hit: %lu, err: %d.\n",
- prefix, NIPQUAD(src), sport, NIPQUAD(dst), dport, unc->proto, hit, err);
+ printk(KERN_NOTICE "netchannel: %s %u.%u.%u.%u:%u -> %u.%u.%u.%u:%u, "
+ "proto: %u, type: %u, order: %u, hit: %lu, err: %d.\n",
+ prefix, NIPQUAD(src), sport, NIPQUAD(dst), dport,
+ unc->proto, unc->type, unc->memory_limit_order, hit, err);
}
static int netchannel_convert_skb_ipv6(struct sk_buff *skb, struct unetchannel *unc)
@@ -197,7 +202,7 @@ static int netchannel_convert_skb_ipv4(s
len = skb->len;
- skb->h.raw = skb->nh.iph + iph->ihl*4;
+ skb->h.raw = skb->nh.raw + iph->ihl*4;
switch (unc->proto) {
case IPPROTO_TCP:
@@ -352,35 +357,91 @@ int netchannel_recv(struct sk_buff *skb)
nc->hit++;
- skb_queue_tail(&nc->list, skb);
+ if (nc->qlen + skb->len > (1 << nc->unc.memory_limit_order)) {
+ kfree_skb(skb);
+ err = 0;
+ goto unlock;
+ }
+
+ skb_queue_tail(&nc->recv_queue, skb);
+ nc->qlen += skb->len;
unlock:
rcu_read_unlock();
return err;
}
+static int netchannel_wait_for_packet(struct netchannel *nc, long *timeo_p)
+{
+ int error = 0;
+ DEFINE_WAIT(wait);
+
+ prepare_to_wait_exclusive(&nc->wait, &wait, TASK_INTERRUPTIBLE);
+
+ if (skb_queue_empty(&nc->recv_queue)) {
+ if (signal_pending(current))
+ goto interrupted;
+
+ *timeo_p = schedule_timeout(*timeo_p);
+ }
+out:
+ finish_wait(&nc->wait, &wait);
+ return error;
+interrupted:
+ error = (*timeo_p == MAX_SCHEDULE_TIMEOUT) ? -ERESTARTSYS : -EINTR;
+ goto out;
+}
+
+static struct sk_buff *netchannel_get_skb(struct netchannel *nc, unsigned int *timeout, int *error)
+{
+ struct sk_buff *skb = NULL;
+ long tm = *timeout;
+
+ *error = 0;
+
+ while (1) {
+ skb = skb_dequeue(&nc->recv_queue);
+ if (skb)
+ break;
+
+ if (*timeout) {
+ *error = netchannel_wait_for_packet(nc, &tm);
+ if (*error) {
+ *timeout = tm;
+ break;
+ }
+ tm = *timeout;
+ } else {
+ *error = -EAGAIN;
+ break;
+ }
+ }
+
+ return skb;
+}
+
/*
* Actually it should be something like recvmsg().
*/
-static int netchannel_copy_to_user(struct netchannel *nc, unsigned int *len, void __user *arg)
+static int netchannel_copy_to_user(struct netchannel *nc, unsigned int *timeout, unsigned int *len, void *arg)
{
unsigned int copied;
struct sk_buff *skb;
struct iovec to;
- int err = -EINVAL;
-
- to.iov_base = arg;
- to.iov_len = *len;
+ int err;
- skb = skb_dequeue(&nc->list);
+ skb = netchannel_get_skb(nc, timeout, &err);
if (!skb)
- return -EAGAIN;
+ return err;
+
+ to.iov_base = arg;
+ to.iov_len = *len;
copied = skb->len;
if (copied > *len)
copied = *len;
-
- if (skb->ip_summed==CHECKSUM_UNNECESSARY) {
+
+ if (skb->ip_summed == CHECKSUM_UNNECESSARY) {
err = skb_copy_datagram_iovec(skb, 0, &to, copied);
} else {
err = skb_copy_and_csum_datagram_iovec(skb,0, &to);
@@ -388,56 +449,290 @@ static int netchannel_copy_to_user(struc
*len = (err == 0)?copied:0;
+ nc->qlen -= skb->len;
kfree_skb(skb);
return err;
}
-static int netchannel_create(struct unetchannel *unc)
+int netchannel_skb_copy_datagram(const struct sk_buff *skb, int offset,
+ void *to, int len)
{
- struct netchannel *nc;
- int err = -ENOMEM;
- struct netchannel_cache_head *bucket;
+ int start = skb_headlen(skb);
+ int i, copy = start - offset;
+
+ /* Copy header. */
+ if (copy > 0) {
+ if (copy > len)
+ copy = len;
+ memcpy(to, skb->data + offset, copy);
+
+ if ((len -= copy) == 0)
+ return 0;
+ offset += copy;
+ to += copy;
+ }
+
+ /* Copy paged appendix. Hmm... why does this look so complicated? */
+ for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
+ int end;
+
+ BUG_TRAP(start <= offset + len);
+
+ end = start + skb_shinfo(skb)->frags[i].size;
+ if ((copy = end - offset) > 0) {
+ u8 *vaddr;
+ skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
+ struct page *page = frag->page;
+
+ if (copy > len)
+ copy = len;
+ vaddr = kmap(page);
+ memcpy(to, vaddr + frag->page_offset +
+ offset - start, copy);
+ kunmap(page);
+ if (!(len -= copy))
+ return 0;
+ offset += copy;
+ to += copy;
+ }
+ start = end;
+ }
+
+ if (skb_shinfo(skb)->frag_list) {
+ struct sk_buff *list = skb_shinfo(skb)->frag_list;
+
+ for (; list; list = list->next) {
+ int end;
+
+ BUG_TRAP(start <= offset + len);
+
+ end = start + list->len;
+ if ((copy = end - offset) > 0) {
+ if (copy > len)
+ copy = len;
+ if (netchannel_skb_copy_datagram(list,
+ offset - start,
+ to, copy))
+ goto fault;
+ if ((len -= copy) == 0)
+ return 0;
+ offset += copy;
+ to += copy;
+ }
+ start = end;
+ }
+ }
+ if (!len)
+ return 0;
+
+fault:
+ return -EFAULT;
+}
+
+static int netchannel_copy_to_mem(struct netchannel *nc, unsigned int *timeout, unsigned int *len, void *arg)
+{
+ struct netchannel_mmap *m = nc->priv;
+ unsigned int copied, skb_offset = 0;
+ struct sk_buff *skb;
+ int err;
+
+ skb = netchannel_get_skb(nc, timeout, &err);
+ if (!skb)
+ return err;
+
+ copied = skb->len;
+
+ while (copied) {
+ int pnum = ((m->poff % PAGE_SIZE) % m->pnum);
+ struct page *page = m->page[pnum];
+ void *page_map, *ptr;
+ unsigned int sz, left;
+
+ left = PAGE_SIZE - (m->poff % (PAGE_SIZE - 1));
+ sz = min_t(unsigned int, left, copied);
+
+ if (!sz) {
+ err = -ENOSPC;
+ goto err_out;
+ }
+
+ page_map = kmap_atomic(page, KM_USER0);
+ if (!page_map) {
+ err = -ENOMEM;
+ goto err_out;
+ }
+ ptr = page_map + (m->poff % (PAGE_SIZE - 1));
+
+ err = netchannel_skb_copy_datagram(skb, skb_offset, ptr, sz);
+ if (err) {
+ kunmap_atomic(page_map, KM_USER0);
+ goto err_out;
+ }
+ kunmap_atomic(page_map, KM_USER0);
+
+ copied -= sz;
+ m->poff += sz;
+ skb_offset += sz;
+#if 1
+ if (m->poff >= PAGE_SIZE * m->pnum) {
+ //netchannel_dump_info_unc(&nc->unc, "rewind", nc->hit, 0);
+ m->poff = 0;
+ }
+#endif
+ }
+ *len = skb->len;
+
+ err = 0;
+
+err_out:
+ nc->qlen -= skb->len;
+ kfree_skb(skb);
+
+ return err;
+}
+
+static int netchannel_mmap_setup(struct netchannel *nc)
+{
+ struct netchannel_mmap *m;
+ unsigned int i, pnum;
+
+ pnum = (1 << (nc->unc.memory_limit_order - NETCHANNEL_MIN_ORDER));
+
+ m = kzalloc(sizeof(struct netchannel_mmap) + sizeof(struct page *) * pnum, GFP_KERNEL);
+ if (!m)
+ return -ENOMEM;
+
+ m->page = (struct page **)(m + 1);
+ m->pnum = pnum;
+
+ for (i=0; i<pnum; ++i) {
+ m->page[i] = alloc_page(GFP_KERNEL);
+ if (!m->page[i])
+ break;
+ }
+
+ if (i < pnum) {
+ pnum = i;
+ goto err_out_free;
+ }
+
+ nc->priv = m;
+ nc->nc_read_data = &netchannel_copy_to_mem;
+
+ return 0;
+
+err_out_free:
+ for (i=0; i<pnum; ++i)
+ __free_page(m->page[i]);
+
+ kfree(m);
+
+ return -ENOMEM;
- if (!netchannel_hash_table)
- return -ENODEV;
+}
- bucket = netchannel_bucket(unc);
+static void netchannel_mmap_cleanup(struct netchannel *nc)
+{
+ unsigned int i;
+ struct netchannel_mmap *m = nc->priv;
- mutex_lock(&bucket->mutex);
+ for (i=0; i<m->pnum; ++i)
+ __free_page(m->page[i]);
- if (netchannel_check_full(unc, bucket)) {
- err = -EEXIST;
- goto out_unlock;
+ kfree(m);
+}
+
+static void netchannel_cleanup(struct netchannel *nc)
+{
+ switch (nc->unc.type) {
+ case NETCHANNEL_COPY_USER:
+ break;
+ case NETCHANNEL_MMAP:
+ netchannel_mmap_cleanup(nc);
+ break;
+ default:
+ break;
}
+}
- if (unc->listen && netchannel_check_dest(unc, bucket)) {
- err = -EEXIST;
- goto out_unlock;
+static int netchannel_setup(struct netchannel *nc)
+{
+ int ret = 0;
+
+ if (nc->unc.memory_limit_order > NETCHANNEL_MAX_ORDER)
+ return -E2BIG;
+
+ if (nc->unc.memory_limit_order < NETCHANNEL_MIN_ORDER)
+ nc->unc.memory_limit_order = NETCHANNEL_MIN_ORDER;
+
+ switch (nc->unc.type) {
+ case NETCHANNEL_COPY_USER:
+ nc->nc_read_data = &netchannel_copy_to_user;
+ break;
+ case NETCHANNEL_MMAP:
+ ret = netchannel_mmap_setup(nc);
+ break;
+ default:
+ ret = -EINVAL;
+ break;
}
+ return ret;
+}
+
+static int netchannel_create(struct unetchannel *unc)
+{
+ struct netchannel *nc;
+ int err = -ENOMEM;
+ struct netchannel_cache_head *bucket;
+
+ if (!netchannel_hash_table)
+ return -ENODEV;
+
nc = kmem_cache_alloc(netchannel_cache, GFP_KERNEL);
if (!nc)
- goto out_exit;
+ return -ENOMEM;
memset(nc, 0, sizeof(struct netchannel));
nc->hit = 0;
- skb_queue_head_init(&nc->list);
+ skb_queue_head_init(&nc->recv_queue);
+ init_waitqueue_head(&nc->wait);
atomic_set(&nc->refcnt, 1);
memcpy(&nc->unc, unc, sizeof(struct unetchannel));
- nc->nc_read_data = &netchannel_copy_to_user;
+ err = netchannel_setup(nc);
+ if (err)
+ goto err_out_free;
+
+ bucket = netchannel_bucket(unc);
+
+ mutex_lock(&bucket->mutex);
+
+ if (netchannel_check_full(unc, bucket)) {
+ err = -EEXIST;
+ goto err_out_unlock;
+ }
hlist_add_head_rcu(&nc->node, &bucket->head);
err = 0;
-out_unlock:
mutex_unlock(&bucket->mutex);
-out_exit:
+
netchannel_dump_info_unc(unc, "create", 0, err);
return err;
+
+err_out_unlock:
+ mutex_unlock(&bucket->mutex);
+
+ netchannel_cleanup(nc);
+
+err_out_free:
+ kmem_cache_free(netchannel_cache, nc);
+
+ return err;
}
static int netchannel_remove(struct unetchannel *unc)
@@ -488,11 +783,17 @@ static int netchannel_recv_data(struct u
nc = netchannel_check_dest(&ctl->unc, bucket);
if (!nc)
- goto out_unlock;
+ goto err_out_unlock;
+
+ netchannel_get(nc);
+ mutex_unlock(&bucket->mutex);
- ret = nc->nc_read_data(nc, &ctl->len, data);
+ ret = nc->nc_read_data(nc, &ctl->timeout, &ctl->len, data);
+
+ netchannel_put(nc);
+ return ret;
-out_unlock:
+err_out_unlock:
mutex_unlock(&bucket->mutex);
return ret;
}
--
Evgeniy Polyakov
^ permalink raw reply related [flat|nested] 35+ messages in thread* Re: Netchannel subsystem update.
2006-05-18 10:34 ` Netchannel subsystem update Evgeniy Polyakov
@ 2006-05-20 15:52 ` Evgeniy Polyakov
2006-05-22 6:06 ` David S. Miller
0 siblings, 1 reply; 35+ messages in thread
From: Evgeniy Polyakov @ 2006-05-20 15:52 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev, kelly, rusty
The more I think about TCP processing in netchannels, the more I get
close to the following ideas:
* map netchannel to socket.
* implement own TCP (receiving for now) state machine.
So I would like to ask people, what do we want for netchannels
* existing Linux TCP stack
* fairly simple minimalistic RFC compliant stack
While developing first apporoach I've found that input TCP processing
sometimes refers to dst_entry which can only be obtained through the input
routing code. You can find appropriate changes in attached incremental patch.
Full netchannel patch can be found at homepage [1].
Implementations is fairly proof-of-concept,
since I do not like the idea to bind netchannel to socket.<br/>
All TCP state machine is handled inside socket code, so userspace
must create listening socket, wait until new connection is created,
accept it and the bind netchannel to the newly created socket for
established connection. All further data flow is handled inside
netchannels, but actually it is not working as expected yet.
So question is how to process TCP state machine for netchannels: bind
them to socket and use existing code, or create small netchannel TCP
state machine?
1. Netchannel homepage.
http://tservice.net.ru/~s0mbre/old/?section=projects&item=netchannel
Initial TCP support for netchannels. Incremental patch.
Proof-of-concept only.
Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
diff --git a/include/linux/netchannel.h b/include/linux/netchannel.h
index 7ab2fa0..c161809 100644
--- a/include/linux/netchannel.h
+++ b/include/linux/netchannel.h
@@ -55,6 +55,7 @@ struct unetchannel_control
__u32 len;
__u32 flags;
__u32 timeout;
+ unsigned int fd;
};
#ifdef __KERNEL__
@@ -77,6 +78,8 @@ struct netchannel
unsigned int qlen;
void *priv;
+
+ struct inode *inode;
};
struct netchannel_cache_head
diff --git a/net/core/netchannel.c b/net/core/netchannel.c
index 96e5e5b..a33ed60 100644
--- a/net/core/netchannel.c
+++ b/net/core/netchannel.c
@@ -25,6 +25,7 @@
#include <linux/notifier.h>
#include <linux/list.h>
#include <linux/slab.h>
+#include <linux/file.h>
#include <linux/skbuff.h>
#include <linux/errno.h>
#include <linux/highmem.h>
@@ -114,7 +115,7 @@ static struct netchannel *netchannel_che
struct netchannel *nc;
struct hlist_node *node;
int found = 0;
-
+
hlist_for_each_entry_rcu(nc, node, &bucket->head, node) {
if (netchannel_hash_equal_full(&nc->unc, unc)) {
found = 1;
@@ -125,6 +126,30 @@ static struct netchannel *netchannel_che
return (found)?nc:NULL;
}
+static void netchannel_mmap_cleanup(struct netchannel *nc)
+{
+ unsigned int i;
+ struct netchannel_mmap *m = nc->priv;
+
+ for (i=0; i<m->pnum; ++i)
+ __free_page(m->page[i]);
+
+ kfree(m);
+}
+
+static void netchannel_cleanup(struct netchannel *nc)
+{
+ switch (nc->unc.type) {
+ case NETCHANNEL_COPY_USER:
+ break;
+ case NETCHANNEL_MMAP:
+ netchannel_mmap_cleanup(nc);
+ break;
+ default:
+ break;
+ }
+}
+
static void netchannel_free_rcu(struct rcu_head *rcu)
{
struct netchannel *nc = container_of(rcu, struct netchannel, rcu_head);
@@ -365,9 +390,11 @@ int netchannel_recv(struct sk_buff *skb)
skb_queue_tail(&nc->recv_queue, skb);
nc->qlen += skb->len;
+ wake_up(&nc->wait);
unlock:
rcu_read_unlock();
+
return err;
}
@@ -420,9 +447,68 @@ static struct sk_buff *netchannel_get_sk
return skb;
}
-/*
- * Actually it should be something like recvmsg().
- */
+static int netchannel_copy_to_user_tcp(struct netchannel *nc, unsigned int *timeout, unsigned int *len, void *arg)
+{
+ struct tcphdr *th;
+ int err = -ENODEV;
+ struct socket *sock;
+ struct sock *sk;
+ struct sk_buff *skb;
+
+ skb = netchannel_get_skb(nc, timeout, &err);
+ if (!skb)
+ return err;
+
+ if (!nc->inode)
+ goto err_out_free;
+ sock = SOCKET_I(nc->inode);
+ if (!sock || !sock->sk)
+ goto err_out_free;
+
+ sk = sock->sk;
+
+ __skb_pull(skb, skb->nh.iph->ihl*4);
+
+ skb->h.raw = skb->data;
+
+ th = skb->h.th;
+
+ printk("netchannel: TCP: syn: %u, fin: %u, rst: %u, psh: %u, ack: %u, urg: %u, ece: %u, cwr: %u, res1: %u, doff: %u.\n",
+ th->syn, th->fin, th->rst, th->psh, th->ack, th->urg, th->ece, th->cwr, th->res1, th->doff);
+
+ if (sk->sk_state == TCP_ESTABLISHED) {
+ struct iovec to;
+ unsigned int copied;
+
+ to.iov_base = arg;
+ to.iov_len = *len;
+
+ copied = skb->len;
+ if (copied > *len)
+ copied = *len;
+
+ if (skb->ip_summed == CHECKSUM_UNNECESSARY) {
+ err = skb_copy_datagram_iovec(skb, 0, &to, copied);
+ } else {
+ err = skb_copy_and_csum_datagram_iovec(skb,0, &to);
+ }
+
+ *len = (err == 0)?copied:0;
+ }
+
+ nc->qlen -= skb->len;
+
+ err = sk->sk_backlog_rcv(sk, skb);
+ printk("netchannel: TCP: sk_backlog_rcv() ret: %d.\n", err);
+ return err;
+
+err_out_free:
+ nc->qlen -= skb->len;
+ kfree_skb(skb);
+
+ return err;
+}
+
static int netchannel_copy_to_user(struct netchannel *nc, unsigned int *timeout, unsigned int *len, void *arg)
{
unsigned int copied;
@@ -632,30 +718,6 @@ err_out_free:
}
-static void netchannel_mmap_cleanup(struct netchannel *nc)
-{
- unsigned int i;
- struct netchannel_mmap *m = nc->priv;
-
- for (i=0; i<m->pnum; ++i)
- __free_page(m->page[i]);
-
- kfree(m);
-}
-
-static void netchannel_cleanup(struct netchannel *nc)
-{
- switch (nc->unc.type) {
- case NETCHANNEL_COPY_USER:
- break;
- case NETCHANNEL_MMAP:
- netchannel_mmap_cleanup(nc);
- break;
- default:
- break;
- }
-}
-
static int netchannel_setup(struct netchannel *nc)
{
int ret = 0;
@@ -668,7 +730,17 @@ static int netchannel_setup(struct netch
switch (nc->unc.type) {
case NETCHANNEL_COPY_USER:
- nc->nc_read_data = &netchannel_copy_to_user;
+ switch (nc->unc.proto) {
+ case IPPROTO_UDP:
+ nc->nc_read_data = &netchannel_copy_to_user;
+ break;
+ case IPPROTO_TCP:
+ nc->nc_read_data = &netchannel_copy_to_user_tcp;
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
break;
case NETCHANNEL_MMAP:
ret = netchannel_mmap_setup(nc);
@@ -681,15 +753,53 @@ static int netchannel_setup(struct netch
return ret;
}
+static int netchannel_bind(struct unetchannel_control *ctl)
+{
+ struct netchannel *nc;
+ int err = -EINVAL, fput_needed;
+ struct netchannel_cache_head *bucket;
+ struct file *file;
+ struct inode *inode;
+
+ file = fget_light(ctl->fd, &fput_needed);
+ if (!file)
+ goto err_out_exit;
+
+ inode = igrab(file->f_dentry->d_inode);
+ if (!inode)
+ goto err_out_fput;
+
+ bucket = netchannel_bucket(&ctl->unc);
+
+ mutex_lock(&bucket->mutex);
+
+ nc = netchannel_check_full(&ctl->unc, bucket);
+ if (!nc) {
+ err = -ENODEV;
+ goto err_out_unlock;
+ }
+
+ nc->inode = inode;
+
+ fput_light(file, fput_needed);
+ mutex_unlock(&bucket->mutex);
+
+ return 0;
+
+err_out_unlock:
+ mutex_unlock(&bucket->mutex);
+err_out_fput:
+ fput_light(file, fput_needed);
+err_out_exit:
+ return err;
+}
+
static int netchannel_create(struct unetchannel *unc)
{
struct netchannel *nc;
int err = -ENOMEM;
struct netchannel_cache_head *bucket;
- if (!netchannel_hash_table)
- return -ENODEV;
-
nc = kmem_cache_alloc(netchannel_cache, GFP_KERNEL);
if (!nc)
return -ENOMEM;
@@ -759,6 +869,11 @@ static int netchannel_remove(struct unet
hlist_del_rcu(&nc->node);
hit = nc->hit;
+ if (nc->inode) {
+ iput(nc->inode);
+ nc->inode = NULL;
+ }
+
netchannel_put(nc);
err = 0;
@@ -839,9 +954,11 @@ asmlinkage long sys_netchannel_control(v
switch (ctl.cmd) {
case NETCHANNEL_CREATE:
- case NETCHANNEL_BIND:
ret = netchannel_create(&ctl.unc);
break;
+ case NETCHANNEL_BIND:
+ ret = netchannel_bind(&ctl);
+ break;
case NETCHANNEL_REMOVE:
ret = netchannel_remove(&ctl.unc);
break;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 672950e..eb2dc12 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -727,7 +727,10 @@ int tcp_v4_conn_request(struct sock *sk,
#endif
/* Never answer to SYNs send to broadcast or multicast */
- if (((struct rtable *)skb->dst)->rt_flags &
+ if (!skb->dst) {
+ if (MULTICAST(daddr))
+ goto drop;
+ } else if (((struct rtable *)skb->dst)->rt_flags &
(RTCF_BROADCAST | RTCF_MULTICAST))
goto drop;
@@ -924,15 +927,21 @@ static struct sock *tcp_v4_hnd_req(struc
struct iphdr *iph = skb->nh.iph;
struct sock *nsk;
struct request_sock **prev;
+ int iif;
/* Find possible connection requests. */
struct request_sock *req = inet_csk_search_req(sk, &prev, th->source,
iph->saddr, iph->daddr);
if (req)
return tcp_check_req(sk, skb, req, prev);
+ if (!skb->dst)
+ iif = 0;
+ else
+ iif = inet_iif(skb);
+
nsk = __inet_lookup_established(&tcp_hashinfo, skb->nh.iph->saddr,
th->source, skb->nh.iph->daddr,
- ntohs(th->dest), inet_iif(skb));
+ ntohs(th->dest), iif);
if (nsk) {
if (nsk->sk_state != TCP_TIME_WAIT) {
--
Evgeniy Polyakov
^ permalink raw reply related [flat|nested] 35+ messages in thread* Re: Netchannel subsystem update.
2006-05-20 15:52 ` Evgeniy Polyakov
@ 2006-05-22 6:06 ` David S. Miller
2006-05-22 16:34 ` [Netchannel] Full TCP receiving support Evgeniy Polyakov
0 siblings, 1 reply; 35+ messages in thread
From: David S. Miller @ 2006-05-22 6:06 UTC (permalink / raw)
To: johnpol; +Cc: netdev, kelly, rusty
From: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
Date: Sat, 20 May 2006 19:52:02 +0400
> While developing first apporoach I've found that input TCP processing
> sometimes refers to dst_entry which can only be obtained through the input
> routing code. You can find appropriate changes in attached incremental patch.
It would be no trouble to cache the input route in the socket
when netchannel is enabled.
^ permalink raw reply [flat|nested] 35+ messages in thread
* [Netchannel] Full TCP receiving support.
2006-05-22 6:06 ` David S. Miller
@ 2006-05-22 16:34 ` Evgeniy Polyakov
2006-05-24 9:38 ` Evgeniy Polyakov
0 siblings, 1 reply; 35+ messages in thread
From: Evgeniy Polyakov @ 2006-05-22 16:34 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev, kelly, rusty
Hello, developers.
Attached patch implements full TCP input processing for netchannels [1].
It is based on socket processing code and is fairly hairy for now.
Main idea is to queue skbs into netchannels private queue in interrupt
time and then remove skbs and process them in process' context.
To make TCP works userspace procesing code should only perform several
simple steps similar to how backlog is processed in socket code.
Attached patch against previously posted netchannel patches which
mostly implements netchannel_copy_to_user_tcp() function which performs
TCP processing and copies dat ato userspace. As you can see it is quite
trivial.
Current state is quite proof-of-concept, since there are some ugliness
in the code and various uninteresting debugs, so I plan to clean this up
and run some tests to show if such approach works or not.
Full patch and userspace application are available from netchannel homepage [1].
Thank you.
1. Netchannel homepage.
http://tservice.net.ru/~s0mbre/old/?section=projects&item=netchannel
Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
diff --git a/net/core/netchannel.c b/net/core/netchannel.c
index a33ed60..7239a49 100644
--- a/net/core/netchannel.c
+++ b/net/core/netchannel.c
@@ -34,6 +34,7 @@
#include <linux/in.h>
#include <linux/ip.h>
#include <linux/tcp.h>
+#include <net/tcp.h>
#include <linux/udp.h>
#include <linux/netdevice.h>
@@ -221,6 +222,13 @@ static int netchannel_convert_skb_ipv4(s
if (skb->len < len || len < (iph->ihl*4))
goto inhdr_error;
+ if (pskb_trim_rcsum(skb, len))
+ goto inhdr_error;
+
+ if (iph->ihl > 5)
+ printk("netchannel: IP options: %u.%u.%u.%u -> %u.%u.%u.%u, ihl: %u.\n",
+ NIPQUAD(iph->saddr), NIPQUAD(iph->daddr), iph->ihl);
+
unc->dst = iph->daddr;
unc->src = iph->saddr;
unc->proto = iph->protocol;
@@ -388,9 +396,12 @@ int netchannel_recv(struct sk_buff *skb)
goto unlock;
}
- skb_queue_tail(&nc->recv_queue, skb);
nc->qlen += skb->len;
+ skb_queue_tail(&nc->recv_queue, skb);
wake_up(&nc->wait);
+
+ if (nc->inode && SOCKET_I(nc->inode)->sk)
+ wake_up(SOCKET_I(nc->inode)->sk->sk_sleep);
unlock:
rcu_read_unlock();
@@ -454,58 +465,75 @@ static int netchannel_copy_to_user_tcp(s
struct socket *sock;
struct sock *sk;
struct sk_buff *skb;
-
- skb = netchannel_get_skb(nc, timeout, &err);
- if (!skb)
- return err;
+ struct iovec iov;
+ struct msghdr msg;
+ unsigned flags = MSG_DONTWAIT;
if (!nc->inode)
- goto err_out_free;
+ goto err_out;
sock = SOCKET_I(nc->inode);
if (!sock || !sock->sk)
- goto err_out_free;
+ goto err_out;
sk = sock->sk;
- __skb_pull(skb, skb->nh.iph->ihl*4);
+ do {
+ msg.msg_control=NULL;
+ msg.msg_controllen=0;
+ msg.msg_iovlen=1;
+ msg.msg_iov=&iov;
+ msg.msg_name=NULL;
+ msg.msg_namelen=0;
+ msg.msg_flags = flags;
+ iov.iov_len=*len;
+ iov.iov_base=arg;
- skb->h.raw = skb->data;
+ err = sock_recvmsg(sock, &msg, iov.iov_len, flags);
- th = skb->h.th;
+ printk("netchannel: TCP: len: %u, err: %d.\n", *len, err);
- printk("netchannel: TCP: syn: %u, fin: %u, rst: %u, psh: %u, ack: %u, urg: %u, ece: %u, cwr: %u, res1: %u, doff: %u.\n",
- th->syn, th->fin, th->rst, th->psh, th->ack, th->urg, th->ece, th->cwr, th->res1, th->doff);
-
- if (sk->sk_state == TCP_ESTABLISHED) {
- struct iovec to;
- unsigned int copied;
-
- to.iov_base = arg;
- to.iov_len = *len;
+ if (err > 0) {
+ *len = err;
+ return 0;
+ } else if (err && err != -EAGAIN)
+ return err;
- copied = skb->len;
- if (copied > *len)
- copied = *len;
+ err = 0;
- if (skb->ip_summed == CHECKSUM_UNNECESSARY) {
- err = skb_copy_datagram_iovec(skb, 0, &to, copied);
- } else {
- err = skb_copy_and_csum_datagram_iovec(skb,0, &to);
- }
+ skb = netchannel_get_skb(nc, timeout, &err);
+ if (!skb)
+ return err;
+
+ __skb_pull(skb, skb->nh.iph->ihl*4);
+
+ skb->h.raw = skb->data;
+
+ th = skb->h.th;
+ TCP_SKB_CB(skb)->seq = ntohl(th->seq);
+ TCP_SKB_CB(skb)->end_seq = (TCP_SKB_CB(skb)->seq + th->syn + th->fin +
+ skb->len - th->doff * 4);
+ TCP_SKB_CB(skb)->ack_seq = ntohl(th->ack_seq);
+ TCP_SKB_CB(skb)->when = 0;
+ TCP_SKB_CB(skb)->flags = skb->nh.iph->tos;
+ TCP_SKB_CB(skb)->sacked = 0;
- *len = (err == 0)?copied:0;
- }
-
- nc->qlen -= skb->len;
+ printk("netchannel: TCP: syn: %u, fin: %u, rst: %u, psh: %u, ack: %u, urg: %u, ece: %u, cwr: %u, res1: %u, doff: %u.\n",
+ th->syn, th->fin, th->rst, th->psh, th->ack, th->urg, th->ece, th->cwr, th->res1, th->doff);
+
+ nc->qlen -= skb->len;
- err = sk->sk_backlog_rcv(sk, skb);
- printk("netchannel: TCP: sk_backlog_rcv() ret: %d.\n", err);
- return err;
+ err = sk->sk_backlog_rcv(sk, skb);
+
+ printk("netchannel: TCP: seq=%u, ack=%u, sk_state=%u, backlog_err: %d, sock_qlen: %u.\n",
+ th->seq, th->ack_seq, sk->sk_state, err, skb_queue_len(&sk->sk_receive_queue));
+
+ if (err)
+ return err;
+ } while (!err);
-err_out_free:
- nc->qlen -= skb->len;
- kfree_skb(skb);
+ return 0;
+err_out:
return err;
}
--
Evgeniy Polyakov
^ permalink raw reply related [flat|nested] 35+ messages in thread* Re: [Netchannel] Full TCP receiving support.
2006-05-22 16:34 ` [Netchannel] Full TCP receiving support Evgeniy Polyakov
@ 2006-05-24 9:38 ` Evgeniy Polyakov
0 siblings, 0 replies; 35+ messages in thread
From: Evgeniy Polyakov @ 2006-05-24 9:38 UTC (permalink / raw)
To: David S. Miller; +Cc: netdev, kelly, rusty
[-- Attachment #1: Type: text/plain, Size: 1493 bytes --]
Hello, developers.
Initial TCP benchmark.
After tweaking some stuff I run netchannel vs socket TCP benchmark.
Unfortunately currently I'm unable to run 1Gbit test, since my test
machine is slightly broken... So only 100Mbit for now.
Performance graph attached.
Speed is the same, CPU usage is the same
(socket CPU usage is 1% (or 20% :) more than netchannel one, but let's
throw this away as some experimental error).
Netchannel was setup in copy_to_user mode, so it was not expected to be
faster or eat less CPU, but main purpose of this test was to show that
netchannels with _all_ protocol processing moved to process' context can
be at least that fast as when it is splitted into pieces.
I've found some links in the web and blogs where some developers completely
disagree with VJ idea of moving stuff into process context...
Well, this should break theirs mostly theoretical arguments.
It is clear that using memcpy setup CPU usage numbers (at least) for netchannel
are noticebly better than for sockets (see previous posted benchmarks).
I will start changing core networking code to accept different copying
"actor" methods which will allow to use netchannels preallocated mapped area
instead of copy_to_user().
Full patch and userspace application are available from netchannel homepage [1].
Thank you.
1. Netchannel homepage.
http://tservice.net.ru/~s0mbre/old/?section=projects&item=netchannel
Signed-off-by: Evgeniy Polyakov <johnpol@2ka.mipt.ru>
--
Evgeniy Polyakov
[-- Attachment #2: netchannel_speed.png --]
[-- Type: image/png, Size: 7126 bytes --]
^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2006-07-08 0:04 UTC | newest]
Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-26 11:47 [PATCH 1/3] Rough VJ Channel Implementation - vj_core.patch Kelly Daly
2006-04-26 7:33 ` David S. Miller
2006-04-27 3:31 ` Kelly Daly
2006-04-27 6:25 ` David S. Miller
2006-04-27 11:51 ` Evgeniy Polyakov
2006-04-27 20:09 ` David S. Miller
2006-04-28 6:05 ` Evgeniy Polyakov
2006-05-04 2:59 ` Kelly Daly
2006-05-04 23:22 ` David S. Miller
2006-05-05 1:31 ` Rusty Russell
2006-04-26 7:59 ` David S. Miller
2006-05-04 7:28 ` Kelly Daly
2006-05-04 23:11 ` David S. Miller
2006-05-05 2:48 ` Kelly Daly
2006-05-16 1:02 ` Kelly Daly
2006-05-16 1:05 ` David S. Miller
2006-05-16 1:15 ` Kelly Daly
2006-05-16 5:16 ` David S. Miller
2006-06-22 2:05 ` Kelly Daly
2006-06-22 3:58 ` James Morris
2006-06-22 4:31 ` Arnaldo Carvalho de Melo
2006-06-22 4:36 ` YOSHIFUJI Hideaki / 吉藤英明
2006-07-08 0:05 ` David Miller
2006-05-16 6:19 ` [1/1] netchannel subsystem Evgeniy Polyakov
2006-05-16 6:57 ` David S. Miller
2006-05-16 6:59 ` Evgeniy Polyakov
2006-05-16 7:06 ` David S. Miller
2006-05-16 7:15 ` Evgeniy Polyakov
2006-05-16 7:07 ` Evgeniy Polyakov
2006-05-16 17:34 ` [1/1] Netchannel subsyste Evgeniy Polyakov
2006-05-18 10:34 ` Netchannel subsystem update Evgeniy Polyakov
2006-05-20 15:52 ` Evgeniy Polyakov
2006-05-22 6:06 ` David S. Miller
2006-05-22 16:34 ` [Netchannel] Full TCP receiving support Evgeniy Polyakov
2006-05-24 9:38 ` Evgeniy Polyakov
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).