Netdev List
 help / color / mirror / Atom feed
* RE: [PATCH 0/3] UCC TDM driver for MPC83xx platforms
From: Kalra Ashish @ 2008-01-16 11:47 UTC (permalink / raw)
  To: Kumar Gala, Andrew Morton
  Cc: Phillips Kim, Aggrwal Poonam, sfr, rubini, linux-ppcdev, netdev,
	linux-kernel, Barkowski Michael, Cutler Richard, Kalra Ashish,
	Suresh PV
In-Reply-To: <5927F7A4-D42A-40F6-AE9C-EDA34738A752@kernel.crashing.org>

Hello All,
 
I am sure that the TDM bus driver model/framework will make us put a lot
more programming effort without
any assurance of the code being accepted by the Linux community,
especially as there are many
Telephony/VoIP stack implementations in Linux such as the Sangoma
WANPIPE Kernel suite which
have their own Zaptel TDM (channelized zero-copy) interface layer. There
are other High Speed serial (HSS)
API interfaces, again supporting channelized and/or prioritized API
interfaces. All these implementations 
are proprietary and have their own tightly coupled upper layers and
hardware abstraction layers. It is
difficult to predict that these stacks will move towards a generic TDM
bus driver interface. Therefore, i think
we can have our own tightly coupled interface with our VoIP framework
and let us the keep the driver as it is,
i.e., as a misc driver.

Ashish

-----Original Message-----
From: Kumar Gala [mailto:galak@kernel.crashing.org] 
Sent: Tuesday, January 15, 2008 9:01 AM
To: Andrew Morton
Cc: Phillips Kim; Aggrwal Poonam; sfr@canb.auug.org.au;
rubini@vision.unipv.it; linux-ppcdev@ozlabs.kernel.org;
netdev@vger.kernel.org; linux-kernel@vger.kernel.org; Barkowski Michael;
Kalra Ashish; Cutler Richard
Subject: Re: [PATCH 0/3] UCC TDM driver for MPC83xx platforms


On Jan 14, 2008, at 3:15 PM, Andrew Morton wrote:

> On Mon, 14 Jan 2008 12:00:51 -0600
> Kim Phillips <kim.phillips@freescale.com> wrote:
>
>> On Thu, 10 Jan 2008 21:41:20 -0700
>> "Aggrwal Poonam" <Poonam.Aggrwal@freescale.com> wrote:
>>
>>> Hello  All
>>>
>>> I am waiting for more feedback on the patches.
>>>
>>> If there are no objections please consider them for 2.6.25.
>>>
>> if this isn't going to go through Alessandro Rubini/misc drivers, can

>> it go through the akpm/mm tree?
>>
>
> That would work.  But it might be more appropriate to go Kumar-
> >paulus->Linus.

I'm ok w/taking the arch/powerpc bits, but I"m a bit concerned about  
the driver itself.  I'm wondering if we need a TDM framework in the  
kernel.

I guess if Poonam could possibly describe how this driver is actually  
used that would be helpful.  I see we have 8315 with a discrete TDM  
block and I'm guessing 82xx/85xx based CPM parts of some form of TDM  
as well.

- k

^ permalink raw reply

* Re: [PATCH] ICMP: ICMP_MIB_OUTMSGS increment duplicated
From: Herbert Xu @ 2008-01-16 11:49 UTC (permalink / raw)
  To: Wang Chen; +Cc: davem, dlstevens, netdev
In-Reply-To: <478DD57B.6020503@cn.fujitsu.com>

Wang Chen <wangchen@cn.fujitsu.com> wrote:
> In tree net-2.6.25, commit "96793b482540f3a26e2188eaf75cb56b7829d3e3"
> made a mistake.
> 
> In that patch, David L added a icmp_out_count() in ip_push_pending_frames(),
> remove icmp_out_count() from icmp_reply(). But he forgot to remove 
> icmp_out_count() from icmp_send() too.
> Since icmp_send and icmp_reply will call icmp_push_reply, which will call
> ip_push_pending_frames, a duplicated increment happened in icmp_send.

Actually having the icmp_out_count call in ip_push_pending_frames seems
inconsistent.  Having it there means that we count raw socket ICMP packets
too.  But we don't do that for any other protocol, e.g., raw UDP packets
don't get counted.

So was the inclusion of raw ICMP packets intentional?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH] TUN/TAP GSO/partial csum support
From: Rusty Russell @ 2008-01-16 12:06 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Max Krasnyansky

[-- Attachment #1: Type: text/plain, Size: 9879 bytes --]

OK, revised with help from Herbert.  Also, I have attached a test program and
a script to run it (it short-circuits two tun devices, so you can run it with
the patch applied and see big packets flowing).

This implements partial checksum and GSO support for tun/tap.

We use the virtio_net_hdr: it is an ABI already and designed to
encapsulate such metadata as GSO and partial checksums.

lguest performance (160MB sendfile, worst/best/avg, 20 runs):
	Before: 5.06/3.39/3.82
	After:  4.69/0.84/2.84

Note that there is no easy way to detect if GSO is supported: see next
patch.

Questions:
1) Should we rename/move virtio_net_hdr to something more generic?
2) Is this the right way to build a paged skb from user pages?

Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
---
 drivers/net/tun.c      |  250 +++++++++++++++++++++++++++++++++++++++++++------
 include/linux/if_tun.h |    2 
 2 files changed, 225 insertions(+), 27 deletions(-)

diff -r ba3c0eb8741a drivers/net/tun.c
--- a/drivers/net/tun.c	Wed Jan 16 17:35:25 2008 +1100
+++ b/drivers/net/tun.c	Wed Jan 16 22:11:11 2008 +1100
@@ -62,6 +62,7 @@
 #include <linux/if_ether.h>
 #include <linux/if_tun.h>
 #include <linux/crc32.h>
+#include <linux/virtio_net.h>
 #include <net/net_namespace.h>
 
 #include <asm/system.h>
@@ -238,35 +239,189 @@ static unsigned int tun_chr_poll(struct 
 	return mask;
 }
 
+static struct sk_buff *copy_user_skb(size_t align, struct iovec *iv, size_t len)
+{
+	struct sk_buff *skb;
+
+	if (!(skb = alloc_skb(len + align, GFP_KERNEL)))
+		return ERR_PTR(-ENOMEM);
+
+	if (align)
+		skb_reserve(skb, align);
+
+	if (memcpy_fromiovec(skb_put(skb, len), iv, len)) {
+		kfree_skb(skb);
+		return ERR_PTR(-EFAULT);
+	}
+	return skb;
+}
+
+/* This will fail if they give us a crazy iovec, but that's their own fault. */
+static int get_user_skb_frags(const struct iovec *iv, size_t count,
+			      struct skb_frag_struct *f)
+{
+	unsigned int i, j, num_pg = 0;
+	int err;
+	struct page *pages[MAX_SKB_FRAGS];
+
+	down_read(&current->mm->mmap_sem);
+	for (i = 0; i < count; i++) {
+		int n, npages;
+		unsigned long base, len;
+		base = (unsigned long)iv[i].iov_base;
+		len = (unsigned long)iv[i].iov_len;
+
+		if (len == 0)
+			continue;
+
+		/* How many pages will this take? */
+		npages = 1 + (base + len - 1)/PAGE_SIZE - base/PAGE_SIZE;
+		if (unlikely(num_pg + npages > MAX_SKB_FRAGS)) {
+			err = -ENOSPC;
+			goto fail;
+		}
+		n = get_user_pages(current, current->mm, base, npages,
+				   0, 0, pages, NULL);
+		if (unlikely(n < 0)) {
+			err = n;
+			goto fail;
+		}
+
+		/* Transfer pages to the frag array */
+		for (j = 0; j < n; j++) {
+			f[num_pg].page = pages[j];
+			if (j == 0) {
+				f[num_pg].page_offset = offset_in_page(base);
+				f[num_pg].size = min(len, PAGE_SIZE -
+						     f[num_pg].page_offset);
+			} else {
+				f[num_pg].page_offset = 0;
+				f[num_pg].size = min(len, PAGE_SIZE);
+			}
+			len -= f[num_pg].size;
+			base += f[num_pg].size;
+			num_pg++;
+		}
+
+		if (unlikely(n != npages)) {
+			err = -EFAULT;
+			goto fail;
+		}
+	}
+	up_read(&current->mm->mmap_sem);
+	return num_pg;
+
+fail:
+	for (i = 0; i < num_pg; i++)
+		put_page(f[i].page);
+	up_read(&current->mm->mmap_sem);
+	return err;
+}
+
+
+static struct sk_buff *map_user_skb(const struct virtio_net_hdr *gso,
+				    size_t align, struct iovec *iv,
+				    size_t count, size_t len)
+{
+	struct sk_buff *skb;
+	struct skb_shared_info *sinfo;
+	int err;
+
+	if (!(skb = alloc_skb(gso->gso_hdr_len + align, GFP_KERNEL)))
+		return ERR_PTR(-ENOMEM);
+
+	if (align)
+		skb_reserve(skb, align);
+
+	sinfo = skb_shinfo(skb);
+	sinfo->gso_size = gso->gso_size;
+	sinfo->gso_type = SKB_GSO_DODGY;
+	switch (gso->gso_type) {
+	case VIRTIO_NET_HDR_GSO_TCPV4_ECN:
+		sinfo->gso_type |= SKB_GSO_TCP_ECN;
+		/* fall through */
+	case VIRTIO_NET_HDR_GSO_TCPV4:
+		sinfo->gso_type |= SKB_GSO_TCPV4;
+		break;
+	case VIRTIO_NET_HDR_GSO_TCPV6:
+		sinfo->gso_type |= SKB_GSO_TCPV6;
+		break;
+	case VIRTIO_NET_HDR_GSO_UDP:
+		sinfo->gso_type |= SKB_GSO_UDP;
+		break;
+	default:
+		err = -EINVAL;
+		goto fail;
+	}
+
+	/* Copy in the header. */
+	if (memcpy_fromiovec(skb_put(skb, gso->gso_hdr_len), iv,
+			     gso->gso_hdr_len)) {
+		err = -EFAULT;
+		goto fail;
+	}
+
+	err = get_user_skb_frags(iv, count, sinfo->frags);
+	if (err < 0)
+		goto fail;
+
+	sinfo->nr_frags = err;
+	skb->len += len;
+	skb->data_len += len;
+	
+	return skb;
+
+fail:
+	kfree_skb(skb);
+	return ERR_PTR(err);
+}
+
+static inline size_t iov_total(const struct iovec *iv, unsigned long count)
+{
+	unsigned long i;
+	size_t len;
+
+	for (i = 0, len = 0; i < count; i++)
+		len += iv[i].iov_len;
+
+	return len;
+}
+
 /* Get packet from user space buffer */
-static __inline__ ssize_t tun_get_user(struct tun_struct *tun, struct iovec *iv, size_t count)
+static __inline__ ssize_t tun_get_user(struct tun_struct *tun, struct iovec *iv, size_t num)
 {
 	struct tun_pi pi = { 0, __constant_htons(ETH_P_IP) };
+	struct virtio_net_hdr gso = { 0, VIRTIO_NET_HDR_GSO_NONE };
 	struct sk_buff *skb;
-	size_t len = count, align = 0;
+	size_t tot_len = iov_total(iv, num);
+	size_t len = tot_len, align = 0;
 
 	if (!(tun->flags & TUN_NO_PI)) {
-		if ((len -= sizeof(pi)) > count)
+		if ((len -= sizeof(pi)) > tot_len)
 			return -EINVAL;
 
 		if(memcpy_fromiovec((void *)&pi, iv, sizeof(pi)))
+			return -EFAULT;
+	}
+	if (tun->flags & TUN_GSO_HDR) {
+		if ((len -= sizeof(gso)) > tot_len)
+			return -EINVAL;
+
+		if (memcpy_fromiovec((void *)&gso, iv, sizeof(gso)))
 			return -EFAULT;
 	}
 
 	if ((tun->flags & TUN_TYPE_MASK) == TUN_TAP_DEV)
 		align = NET_IP_ALIGN;
 
-	if (!(skb = alloc_skb(len + align, GFP_KERNEL))) {
+	if (gso.gso_type != VIRTIO_NET_HDR_GSO_NONE)
+		skb = map_user_skb(&gso, align, iv, num, len);
+	else
+		skb = copy_user_skb(align, iv, len);
+
+	if (IS_ERR(skb)) {
 		tun->dev->stats.rx_dropped++;
-		return -ENOMEM;
-	}
-
-	if (align)
-		skb_reserve(skb, align);
-	if (memcpy_fromiovec(skb_put(skb, len), iv, len)) {
-		tun->dev->stats.rx_dropped++;
-		kfree_skb(skb);
-		return -EFAULT;
+		return PTR_ERR(skb);
 	}
 
 	switch (tun->flags & TUN_TYPE_MASK) {
@@ -280,7 +435,13 @@ static __inline__ ssize_t tun_get_user(s
 		break;
 	};
 
-	if (tun->flags & TUN_NOCHECKSUM)
+	if (gso.flags & (1 << VIRTIO_NET_F_NO_CSUM)) {
+		if (!skb_partial_csum_set(skb,gso.csum_start,gso.csum_offset)) {
+			tun->dev->stats.rx_dropped++;
+			kfree_skb(skb);
+			return -EINVAL;
+		}
+	} else if (tun->flags & TUN_NOCHECKSUM)
 		skb->ip_summed = CHECKSUM_UNNECESSARY;
 
 	netif_rx_ni(skb);
@@ -289,18 +450,7 @@ static __inline__ ssize_t tun_get_user(s
 	tun->dev->stats.rx_packets++;
 	tun->dev->stats.rx_bytes += len;
 
-	return count;
-}
-
-static inline size_t iov_total(const struct iovec *iv, unsigned long count)
-{
-	unsigned long i;
-	size_t len;
-
-	for (i = 0, len = 0; i < count; i++)
-		len += iv[i].iov_len;
-
-	return len;
+	return tot_len;
 }
 
 static ssize_t tun_chr_aio_write(struct kiocb *iocb, const struct iovec *iv,
@@ -313,7 +463,7 @@ static ssize_t tun_chr_aio_write(struct 
 
 	DBG(KERN_INFO "%s: tun_chr_write %ld\n", tun->dev->name, count);
 
-	return tun_get_user(tun, (struct iovec *) iv, iov_total(iv, count));
+	return tun_get_user(tun, (struct iovec *) iv, count);
 }
 
 /* Put packet to the user space buffer */
@@ -336,6 +486,42 @@ static __inline__ ssize_t tun_put_user(s
 		if (memcpy_toiovec(iv, (void *) &pi, sizeof(pi)))
 			return -EFAULT;
 		total += sizeof(pi);
+	}
+	if (tun->flags & TUN_GSO_HDR) {
+		struct virtio_net_hdr gso;
+		struct skb_shared_info *sinfo = skb_shinfo(skb);
+
+		if (skb_is_gso(skb)) {
+			gso.gso_hdr_len = skb_transport_header(skb) - skb->data;
+			gso.gso_size = sinfo->gso_size;
+			if (sinfo->gso_type & SKB_GSO_TCP_ECN)
+				gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4_ECN;
+			else if (sinfo->gso_type & SKB_GSO_TCPV4)
+				gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
+			else if (sinfo->gso_type & SKB_GSO_TCPV6)
+				gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
+			else if (sinfo->gso_type & SKB_GSO_UDP)
+				gso.gso_type = VIRTIO_NET_HDR_GSO_UDP;
+			else
+				BUG();
+		} else
+			gso.gso_type = VIRTIO_NET_HDR_GSO_NONE;
+		
+		if (skb->ip_summed == CHECKSUM_PARTIAL) {
+			gso.flags = VIRTIO_NET_HDR_F_NEEDS_CSUM;
+			gso.csum_start = skb->csum_start - skb_headroom(skb);
+			gso.csum_offset = skb->csum_offset;
+		} else {
+			gso.flags = 0;
+			gso.csum_offset = gso.csum_start = 0;
+		}
+
+		if ((len -= sizeof(gso)) < 0)
+			return -EINVAL;
+
+		if (memcpy_toiovec(iv, (void *)&gso, sizeof(gso)))
+			return -EFAULT;
+		total += sizeof(gso);
 	}
 
 	len = min_t(int, skb->len, len);
@@ -523,6 +709,13 @@ static int tun_set_iff(struct file *file
 
 		tun_net_init(dev);
 
+		/* GSO?  One of everything, please. */
+		if (ifr->ifr_flags & IFF_GSO_HDR)
+			dev->features = (NETIF_F_SG | NETIF_F_HW_CSUM
+					 | NETIF_F_HIGHDMA | NETIF_F_FRAGLIST
+					 | NETIF_F_TSO | NETIF_F_UFO
+					 | NETIF_F_TSO_ECN | NETIF_F_TSO6);
+
 		if (strchr(dev->name, '%')) {
 			err = dev_alloc_name(dev, dev->name);
 			if (err < 0)
@@ -543,6 +736,9 @@ static int tun_set_iff(struct file *file
 
 	if (ifr->ifr_flags & IFF_ONE_QUEUE)
 		tun->flags |= TUN_ONE_QUEUE;
+
+	if (ifr->ifr_flags & IFF_GSO_HDR)
+		tun->flags |= TUN_GSO_HDR;
 
 	file->private_data = tun;
 	tun->attached = 1;
diff -r ba3c0eb8741a include/linux/if_tun.h
--- a/include/linux/if_tun.h	Wed Jan 16 17:35:25 2008 +1100
+++ b/include/linux/if_tun.h	Wed Jan 16 22:11:11 2008 +1100
@@ -70,6 +70,7 @@ struct tun_struct {
 #define TUN_NO_PI	0x0040
 #define TUN_ONE_QUEUE	0x0080
 #define TUN_PERSIST 	0x0100	
+#define TUN_GSO_HDR 	0x0200	
 
 /* Ioctl defines */
 #define TUNSETNOCSUM  _IOW('T', 200, int) 
@@ -79,6 +80,7 @@ struct tun_struct {
 #define IFF_TAP		0x0002
 #define IFF_NO_PI	0x1000
 #define IFF_ONE_QUEUE	0x2000
+#define IFF_GSO_HDR	0x4000
 
 struct tun_pi {
 	unsigned short flags;

[-- Attachment #2: tun_gso_pipe.c --]
[-- Type: text/x-csrc, Size: 8976 bytes --]

#include <signal.h>
#include <stddef.h>
#include <errno.h>
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <netinet/ip.h>
#include <netinet/ip_icmp.h>
#include <netinet/udp.h>
#include <netinet/tcp.h>
#include <net/if.h>
#include <net/ethernet.h>
#include <stdio.h>
#include <string.h>
#include <err.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdlib.h>
#include <sys/uio.h>
#include <linux/sockios.h>
#include <linux/if_tun.h>
#include <stdbool.h>
#include <stdint.h>
#include <stddef.h>

typedef uint32_t u32;
typedef uint16_t u16;
typedef uint8_t u8;

#ifndef TUNGETFEATURES
#define TUNGETFEATURES  _IOR('T', 207, unsigned int)
#endif
#ifndef IFF_GSO_HDR
#define IFF_GSO_HDR	0x4000
#endif

static bool use_gso = true;

static bool write_all(int fd, const void *data, unsigned long size)
{
	while (size) {
		int done;

		done = write(fd, data, size);
		if (done < 0 && errno == EINTR)
			continue;
		if (done <= 0)
			return false;
		data += done;
		size -= done;
	}

	return true;
}

static bool read_all(int fd, void *data, unsigned long size)
{
	while (size) {
		int done;

		done = read(fd, data, size);
		if (done < 0 && errno == EINTR)
			continue;
		if (done <= 0)
			return false;
		data += done;
		size -= done;
	}

	return true;
}

static uint32_t str2ip(const char *ipaddr)
{
	unsigned int byte[4];

	sscanf(ipaddr, "%u.%u.%u.%u", &byte[0], &byte[1], &byte[2], &byte[3]);
	return (byte[0] << 24) | (byte[1] << 16) | (byte[2] << 8) | byte[3];
}

static void configure_device(int fd, const char *devname, uint32_t ipaddr)
{
	struct ifreq ifr;
	struct sockaddr_in *sin = (struct sockaddr_in *)&ifr.ifr_addr;

	/* Don't read these incantations.  Just cut & paste them like I did! */
	memset(&ifr, 0, sizeof(ifr));
	strcpy(ifr.ifr_name, devname);
	sin->sin_family = AF_INET;
	sin->sin_addr.s_addr = htonl(ipaddr);
	if (ioctl(fd, SIOCSIFADDR, &ifr) != 0)
		err(1, "Setting %s interface address", devname);
	ifr.ifr_flags = IFF_UP;
	if (ioctl(fd, SIOCSIFFLAGS, &ifr) != 0)
		err(1, "Bringing interface %s up", devname);
}

static int setup_tun_net(uint32_t ip)
{
	struct ifreq ifr;
	int netfd, ipfd;
	unsigned int features;

	/* We open the /dev/net/tun device and tell it we want a tap device.  A
	 * tap device is like a tun device, only somehow different.  To tell
	 * the truth, I completely blundered my way through this code, but it
	 * works now! */
	netfd = open("/dev/net/tun", O_RDWR);
	if (netfd < 0)
		err(1, "Opening /dev/net/tun");

	if (use_gso &&
	    (ioctl(netfd, TUNGETFEATURES, &features) != 0
	     || !(features & IFF_GSO_HDR))) {
		fprintf(stderr, "No GSO support!\n");
		use_gso = false;
	}

	memset(&ifr, 0, sizeof(ifr));
	ifr.ifr_flags = IFF_TAP | IFF_NO_PI | (use_gso ? IFF_GSO_HDR : 0);
	strcpy(ifr.ifr_name, "tap%d");
	if (ioctl(netfd, TUNSETIFF, &ifr) != 0)
		err(1, "configuring /dev/net/tun");

	/* We need a socket to perform the magic network ioctls to bring up the
	 * tap interface, connect to the bridge etc.  Any socket will do! */
	ipfd = socket(PF_INET, SOCK_DGRAM, IPPROTO_IP);
	if (ipfd < 0)
		err(1, "opening IP socket");

	/* We are peer 0, ie. first slot, so we hand dev->mem to this routine
	 * to write the MAC address at the start of the device memory.  */
	configure_device(ipfd, ifr.ifr_name, ip);
	close(ipfd);

	return netfd;
}

static void two_way_popen(char *const argv[])
{
	int pid;
	int pipe1[2], pipe2[2];

	if (pipe(pipe1) != 0 || pipe(pipe2) != 0)
		err(1, "creating pipe");

	pid = fork();
	if (pid == -1)
		err(1, "forking");

	if (pid == 0) {
		/* We are the child. */
		close(pipe1[1]);
		close(pipe2[0]);
		dup2(pipe1[0], STDIN_FILENO);
		dup2(pipe2[1], STDOUT_FILENO);

		execvp(argv[0], argv);
		fprintf(stderr, "Failed to exec '%s': %m\n", argv[0]);
		kill(getppid(), SIGKILL);
	}

	/* We are parent. */
	close(pipe1[0]);
	close(pipe2[1]);
	dup2(pipe1[1], STDOUT_FILENO);
	dup2(pipe2[0], STDIN_FILENO);
}

struct virtio_net_hdr
{
#define VIRTIO_NET_HDR_F_NEEDS_CSUM	1	// Use csum_start, csum_offset
	__u8 flags;
#define VIRTIO_NET_HDR_GSO_NONE		0	// Not a GSO frame
#define VIRTIO_NET_HDR_GSO_TCPV4	1	// GSO frame, IPv4 TCP (TSO)
/* FIXME: Do we need this?  If they said they can handle ECN, do they care? */
#define VIRTIO_NET_HDR_GSO_TCPV4_ECN	2	// GSO frame, IPv4 TCP w/ ECN
#define VIRTIO_NET_HDR_GSO_UDP		3	// GSO frame, IPv4 UDP (UFO)
#define VIRTIO_NET_HDR_GSO_TCPV6	4	// GSO frame, IPv6 TCP
	__u8 gso_type;
	__u16 gso_hdr_len;	/* Ethernet + IP + tcp/udp hdrs */
	__u16 gso_size;		/* Bytes to append to gso_hdr_len per frame */
	__u16 csum_start;	/* Position to start checksumming from */
	__u16 csum_offset;	/* Offset after that to place checksum */
};

struct packet
{
	struct virtio_net_hdr gso;
	struct ether_header mac;
	struct iphdr ip;
	union {
		struct icmphdr icmp;
		struct tcphdr tcp;
		struct udphdr udp;
		char pad[65535 - 34];
	};
} __attribute__((packed));

static inline unsigned short from32to16(unsigned long x)
{
	/* add up 16-bit and 16-bit for 16+c bit */
	x = (x & 0xffff) + (x >> 16);
	/* add up carry.. */
	x = (x & 0xffff) + (x >> 16);
	return x;
}

static unsigned int csum_fold(unsigned int sum)
{
	return ~from32to16(sum);
}

static unsigned long do_csum(const unsigned char * buff, int len)
{
	int odd, count;
	unsigned long result = 0;

	if (len <= 0)
		return 0;

	odd = 1 & (unsigned long) buff;
	if (odd) {
		result = *buff;
		len--;
		buff++;
	}
	count = len >> 1;		/* nr of 16-bit words.. */
	if (count) {
		if (2 & (unsigned long) buff) {
			result += *(unsigned short *) buff;
			count--;
			len -= 2;
			buff += 2;
		}
		count >>= 1;		/* nr of 32-bit words.. */
		if (count) {
		        unsigned long carry = 0;
			do {
				unsigned int w = *(unsigned int *) buff;
				count--;
				buff += 4;
				result += carry;
				result += w;
				carry = (w > result);
			} while (count);
			result += carry;
			result = (result & 0xffff) + (result >> 16);
		}
		if (len & 2) {
			result += *(unsigned short *) buff;
			buff += 2;
		}
	}
	if (len & 1)
		result += (*buff << 8);
	result = from32to16(result);
	if (odd)
		result = ((result >> 8) & 0xff) | ((result & 0xff) << 8);

	return result;
}

static unsigned int csum_partial(const void * buff, int len, unsigned int sum)
{
	unsigned int result = do_csum(buff, len);

	/* add in old sum, and carry.. */
	result += sum;
	if (sum > result)
		result += 1;
	return result;
}

static void csum_replace(__u16 *sum, u32 from, u32 to)
{
	u32 diff[] = { ~from, to };
	*sum = csum_fold(csum_partial(diff, sizeof(diff), *sum ^ 0xFFFF));
}

#define NIPQUAD(addr) \
	((unsigned char *)&addr)[0], \
	((unsigned char *)&addr)[1], \
	((unsigned char *)&addr)[2], \
	((unsigned char *)&addr)[3]

/* Change destination IP address */
static void nat_packet(struct packet *packet, u32 src, u32 dst)
{
	u32 oldsrc, olddst;

	if (packet->mac.ether_type != htons(ETHERTYPE_IP))
		return;

	oldsrc = packet->ip.saddr;
	olddst = packet->ip.daddr;
	packet->ip.saddr = src;
	packet->ip.daddr = dst;
	csum_replace(&packet->ip.check, oldsrc, src);
	csum_replace(&packet->ip.check, olddst, dst);

	switch (packet->ip.protocol) {
	case IPPROTO_TCP:
		csum_replace(&packet->tcp.check, oldsrc, src);
		csum_replace(&packet->tcp.check, olddst, dst);
		break;
	case IPPROTO_UDP:
		csum_replace(&packet->udp.check, oldsrc, src);
		csum_replace(&packet->udp.check, olddst, dst);
		break;
	}
}

int main(int argc, char *argv[])
{
	int netfd;
	__u32 natdst, natsrc;
	int size;
	struct packet packet;
	void *buf;

	if (argv[1] && strcmp(argv[1], "--no-gso") == 0) {
		argv++;
		argc--;
		use_gso = false;
	}

	if (argc < 4)
		errx(1, "Usage: %s [--no-gso] ip-addr src-nat-addr dst-nat-addr [command-to-open...]", argv[0]);

	netfd = setup_tun_net(str2ip(argv[1]));
	natsrc = htonl(str2ip(argv[2]));
	natdst = htonl(str2ip(argv[3]));

	/* Eg. ssh othermachine /root/tun_gso_pipe 192.168.1.2 192.168.5.2 192.158.5.1 */
	if (argc > 4)
		two_way_popen(argv+4);

	if (use_gso)
		buf = &packet;
	else
		buf = &packet.mac;

	for (;;) {
		fd_set fds;

		FD_ZERO(&fds);
		FD_SET(netfd, &fds);
		FD_SET(STDIN_FILENO, &fds);
		select(netfd+1, &fds, NULL, NULL, NULL);
		if (FD_ISSET(netfd, &fds)) {
			size = read(netfd, buf, sizeof(packet));
			if (size <= 0)
				err(1, "Reading netfd");
			if (use_gso)
				fprintf(stderr, "Read %u, gso = %u/%u\n", size,
					packet.gso.gso_type,
					packet.gso.gso_size);
			nat_packet(&packet, natsrc, natdst);
			if (!write_all(STDOUT_FILENO, &size, sizeof(size))
			    || !write_all(STDOUT_FILENO, buf, size))
				err(1, "Writing data to stdout");
		}
		if (FD_ISSET(STDIN_FILENO, &fds)) {
			int ret;
			if (!read_all(STDIN_FILENO, &size, sizeof(size)))
				err(1, "Reading stdin");
			if (!read_all(STDIN_FILENO, buf, size))
				err(1, "Reading %u byte packet", size);
			fprintf(stderr, "Writing %u, gso = %u/%u\n", size,
				packet.gso.gso_type,
				packet.gso.gso_size);
			ret = write(netfd, buf, size);
			if (ret != size)
				err(1, "Writing data to netfd gave %i/%i",
				    ret, size);
		}
	}
}

[-- Attachment #3: tun_gso_pipe-setup.sh --]
[-- Type: application/x-shellscript, Size: 794 bytes --]

^ permalink raw reply

* [PATCH] Interface to query tun/tap features.
From: Rusty Russell @ 2008-01-16 12:07 UTC (permalink / raw)
  To: netdev; +Cc: Herbert Xu, Max Krasnyansky
In-Reply-To: <200801162306.27767.rusty@rustcorp.com.au>

The problem with introducing IFF_GSO_HDR is that it needs to set dev->features
(to enable GSO, checksumming, etc), which is supposed to be done before
register_netdevice(), ie. as part of TUNSETIFF.

Unfortunately, TUNSETIFF has always just ignored flags it doesn't understand,
so there's no good way of detecting whether the kernel supports IFF_GSO_HDR.

This patch implements a TUNGETFEATURES ioctl which returns all the valid IFF
flags.  It could be extended later to include other features.

Here's an example program which uses it:

#include <linux/if_tun.h>
#include <sys/types.h>
#include <sys/ioctl.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <err.h>
#include <stdio.h>

static struct {
	unsigned int flag;
	const char *name;
} known_flags[] = {
	{ IFF_TUN, "TUN" },
	{ IFF_TAP, "TAP" },
	{ IFF_NO_PI, "NO_PI" },
	{ IFF_ONE_QUEUE, "ONE_QUEUE" },
	{ IFF_GSO_HDR, "GSO_HDR" },
};

int main()
{
	unsigned int features, i;

	int netfd = open("/dev/net/tun", O_RDWR);
	if (netfd < 0)
		err(1, "Opening /dev/net/tun");

	if (ioctl(netfd, TUNGETFEATURES, &features) != 0) {
		printf("Kernel does not support TUNGETFEATURES, guessing\n");
		features = (IFF_TUN|IFF_TAP|IFF_NO_PI|IFF_ONE_QUEUE);
	}
	printf("Available features are: ");
	for (i = 0; i < sizeof(known_flags)/sizeof(known_flags[0]); i++) {
		if (features & known_flags[i].flag) {
			features &= ~known_flags[i].flag;
			printf("%s ", known_flags[i].name);
		}
	}
	if (features)
		printf("(UNKNOWN %#x)", features);
	printf("\n");
	return 0;
}
---
 drivers/net/tun.c      |    9 +++++++++
 include/linux/if_tun.h |    2 ++
 2 files changed, 11 insertions(+)

diff -r ba3c0eb8741a drivers/net/tun.c
--- a/drivers/net/tun.c	Wed Jan 16 17:35:25 2008 +1100
+++ b/drivers/net/tun.c	Wed Jan 16 22:11:11 2008 +1100
@@ -583,6 +779,15 @@ static int tun_chr_ioctl(struct inode *i
 		if (copy_to_user(argp, &ifr, sizeof(ifr)))
 			return -EFAULT;
 		return 0;
+	}
+
+	if (cmd == TUNGETFEATURES) {
+		/* Currently this just means: "what IFF flags are valid?".
+		 * This is needed because we never checked for invalid flags on
+		 * TUNSETIFF.  This was introduced with IFF_GSO_HDR, so if a
+		 * kernel doesn't have this ioctl, it doesn't have GSO header
+		 * support. */
+		return put_user(IFF_ALL_FLAGS, (unsigned int __user*)argp);
 	}
 
 	if (!tun)
diff -r ba3c0eb8741a include/linux/if_tun.h
--- a/include/linux/if_tun.h	Wed Jan 16 17:35:25 2008 +1100
+++ b/include/linux/if_tun.h	Wed Jan 16 22:11:11 2008 +1100
@@ -79,13 +80,15 @@ struct tun_struct {
 #define TUNSETOWNER   _IOW('T', 204, int)
 #define TUNSETLINK    _IOW('T', 205, int)
 #define TUNSETGROUP   _IOW('T', 206, int)
+#define TUNGETFEATURES _IOR('T', 207, unsigned int)
 
 /* TUNSETIFF ifr flags */
 #define IFF_TUN		0x0001
 #define IFF_TAP		0x0002
 #define IFF_NO_PI	0x1000
 #define IFF_ONE_QUEUE	0x2000
 #define IFF_GSO_HDR	0x4000
+#define IFF_ALL_FLAGS (IFF_TUN|IFF_TAP|IFF_NO_PI|IFF_ONE_QUEUE|IFF_GSO_HDR)
 
 struct tun_pi {
 	unsigned short flags;

^ permalink raw reply

* Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: David Miller @ 2008-01-16 12:25 UTC (permalink / raw)
  To: slavon; +Cc: elendil, netdev, linux-kernel
In-Reply-To: <478DC824.4030600@bigtelecom.ru>

From: Badalian Vyacheslav <slavon@bigtelecom.ru>
Date: Wed, 16 Jan 2008 12:02:28 +0300

> applied to 2.6.24-rc7-git2
> Have messages
> Also have regression after apply patch.
> System may do above 800mbs traffic before patch. After its "exit polling 
> mode?" (4 CPU, 1 cpu get 100% si (process ksoftirqd/0), 3 CPU is IDLE)
> After patch system was go to "exit polling mode" at above 600mbs.

What do you mean by 'system was go to "exit polling mode"'?

Please be more clear about your situation, in particular
provide every detail about what happens so that we can
properly debug this.

THanks.

^ permalink raw reply

* Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: David Miller @ 2008-01-16 12:28 UTC (permalink / raw)
  To: slavon; +Cc: elendil, netdev, linux-kernel
In-Reply-To: <478DC824.4030600@bigtelecom.ru>

From: Badalian Vyacheslav <slavon@bigtelecom.ru>
Date: Wed, 16 Jan 2008 12:02:28 +0300

> Also have regression after apply patch.

BTW, if you are using the e1000e driver then this initial
patch will not work.

My more recent patch posting for this problem, will.

I include it again below for you:

[NET]: Fix TX timeout regression in Intel drivers.

This fixes a regression added by changeset
53e52c729cc169db82a6105fac7a166e10c2ec36 ("[NET]: Make ->poll()
breakout consistent in Intel ethernet drivers.")

As pointed out by Jesse Brandeburg, for three of the drivers edited
above there is breakout logic in the *_clean_tx_irq() code to prevent
running TX reclaim forever.  If this occurs, we have to elide NAPI
poll completion or else those TX events will never be serviced.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 13d57b0..0c9a6f7 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3919,7 +3919,7 @@ e1000_clean(struct napi_struct *napi, int budget)
 {
 	struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter, napi);
 	struct net_device *poll_dev = adapter->netdev;
-	int work_done = 0;
+	int tx_cleaned = 0, work_done = 0;
 
 	/* Must NOT use netdev_priv macro here. */
 	adapter = poll_dev->priv;
@@ -3929,14 +3929,17 @@ e1000_clean(struct napi_struct *napi, int budget)
 	 * simultaneously.  A failure obtaining the lock means
 	 * tx_ring[0] is currently being cleaned anyway. */
 	if (spin_trylock(&adapter->tx_queue_lock)) {
-		e1000_clean_tx_irq(adapter,
-				   &adapter->tx_ring[0]);
+		tx_cleaned = e1000_clean_tx_irq(adapter,
+						&adapter->tx_ring[0]);
 		spin_unlock(&adapter->tx_queue_lock);
 	}
 
 	adapter->clean_rx(adapter, &adapter->rx_ring[0],
 	                  &work_done, budget);
 
+	if (tx_cleaned)
+		work_done = budget;
+
 	/* If budget not fully consumed, exit the polling mode */
 	if (work_done < budget) {
 		if (likely(adapter->itr_setting & 3))
diff --git a/drivers/net/e1000e/netdev.c b/drivers/net/e1000e/netdev.c
index 4a6fc74..2ab3bfb 100644
--- a/drivers/net/e1000e/netdev.c
+++ b/drivers/net/e1000e/netdev.c
@@ -1384,7 +1384,7 @@ static int e1000_clean(struct napi_struct *napi, int budget)
 {
 	struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter, napi);
 	struct net_device *poll_dev = adapter->netdev;
-	int work_done = 0;
+	int tx_cleaned = 0, work_done = 0;
 
 	/* Must NOT use netdev_priv macro here. */
 	adapter = poll_dev->priv;
@@ -1394,12 +1394,15 @@ static int e1000_clean(struct napi_struct *napi, int budget)
 	 * simultaneously.  A failure obtaining the lock means
 	 * tx_ring is currently being cleaned anyway. */
 	if (spin_trylock(&adapter->tx_queue_lock)) {
-		e1000_clean_tx_irq(adapter);
+		tx_cleaned = e1000_clean_tx_irq(adapter);
 		spin_unlock(&adapter->tx_queue_lock);
 	}
 
 	adapter->clean_rx(adapter, &work_done, budget);
 
+	if (tx_cleaned)
+		work_done = budget;
+
 	/* If budget not fully consumed, exit the polling mode */
 	if (work_done < budget) {
 		if (adapter->itr_setting & 3)
diff --git a/drivers/net/ixgbe/ixgbe_main.c b/drivers/net/ixgbe/ixgbe_main.c
index a564916..de3f45e 100644
--- a/drivers/net/ixgbe/ixgbe_main.c
+++ b/drivers/net/ixgbe/ixgbe_main.c
@@ -1468,13 +1468,16 @@ static int ixgbe_clean(struct napi_struct *napi, int budget)
 	struct ixgbe_adapter *adapter = container_of(napi,
 					struct ixgbe_adapter, napi);
 	struct net_device *netdev = adapter->netdev;
-	int work_done = 0;
+	int tx_cleaned = 0, work_done = 0;
 
 	/* In non-MSIX case, there is no multi-Tx/Rx queue */
-	ixgbe_clean_tx_irq(adapter, adapter->tx_ring);
+	tx_cleaned = ixgbe_clean_tx_irq(adapter, adapter->tx_ring);
 	ixgbe_clean_rx_irq(adapter, &adapter->rx_ring[0], &work_done,
 			   budget);
 
+	if (tx_cleaned)
+		work_done = budget;
+
 	/* If budget not fully consumed, exit the polling mode */
 	if (work_done < budget) {
 		netif_rx_complete(netdev, napi);

^ permalink raw reply related

* Re: memory leakage in bridge(kernel-2.6.23.14)
From: David Miller @ 2008-01-16 12:31 UTC (permalink / raw)
  To: wyb; +Cc: linux-net, netdev
In-Reply-To: <200801161006.AZZ53966@topsec.com.cn>

From: <wyb@topsec.com.cn>
Date: Wed, 16 Jan 2008 18:04:53 +0800

> In SMP, if a bridge fdb is being created when another CPU at the same time
> delete the bridge, this newly created fdb may incur a leakage:

netdev@vger.kernel.org (CC:'d) is the proper place to report
things like this.

'linux-net' is only for general user questions about the networking,
not for bug reports, patch postings, or developer discussion.  The
'netdev' list is for that.

Thank you.

> 
> CPU0:
> 
> static void del_nbp(struct net_bridge_port *p)
> {
> 	/*
> 	 * CPU1 enter br_fdb_update(), bridge port is still valid.
> 	 */
> 	......
> 	spin_lock_bh(&br->lock);
> 	br_stp_disable_port(p);
> 	spin_unlock_bh(&br->lock);
> 
> 	br_ifinfo_notify(RTM_DELLINK, p);
> 
> 	br_fdb_delete_by_port(br, p, 1);
> 
> 	/*
> 	 * CPU1 call fdb_create() for the being deleted bridge,
> 	 * a fdb would be add to bridge's fdb hash table, and will never
> 	 * be freed. because when deleting a bridge, linux flush fdb for
> each
> 	 * bridge port, but this newly created fdb belong to no bridge port
> 	 */
> 	......
> }
> 
> To fix this, fdb_create() should be changed to:
> {
> 	struct net_bridge_fdb_entry *fdb;
> 
> 	/*
> 	 * if the bridge port is deleted, then return.
> 	 */
> 	if (!(source->state == BR_STATE_LEARNING ||
> 	      source->state == BR_STATE_FORWARDING))
> 		return;
> 
> 	fdb = kmem_cache_alloc(br_fdb_cache, GFP_ATOMIC);
> 	if (fdb) {
> 		memcpy(fdb->addr.addr, addr, ETH_ALEN);
> 		atomic_set(&fdb->use_count, 1);
> 		hlist_add_head_rcu(&fdb->hlist, head);
> 
> 		fdb->dst = source;
> 		fdb->is_local = is_local;
> 		fdb->is_static = is_local;
> 		fdb->ageing_timer = jiffies;
> 	}
> 	return fdb;
> }
> 
> 
> 
> -
> To unsubscribe from this list: send the line "unsubscribe linux-net" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH 4/4] bonding: Fix some RTNL taking
From: Jarek Poplawski @ 2008-01-16 12:44 UTC (permalink / raw)
  To: Makito SHIOKAWA; +Cc: netdev
In-Reply-To: <20080115063650.236883000@miraclelinux.com>

On 15-01-2008 07:36, Makito SHIOKAWA wrote:
> Fix some RTNL lock taking:
> 
> * RTNL (mutex; may sleep) must not be taken under read_lock (spinlock; must be
> atomic). However, RTNL is taken under read_lock in bond_loadbalance_arp_mon()
> and bond_activebackup_arp_mon(). So change code to take RTNL outside of read_lock.
> 
> * rtnl_unlock() calls netdev_run_todo() which takes net_todo_run_mutex, and
> rtnl_unlock() is called under read_lock in bond_mii_monitor(). So for the same
> reason as above, change code to call rtnl_unlock() outside of read_lock.
> 
> Signed-off-by: Makito SHIOKAWA <mshiokawa@miraclelinux.com>
> ---
>  drivers/net/bonding/bond_main.c |   24 ++++++++++--------------
>  1 file changed, 10 insertions(+), 14 deletions(-)
> 
> --- a/drivers/net/bonding/bond_main.c
> +++ b/drivers/net/bonding/bond_main.c
> @@ -2372,6 +2372,7 @@ void bond_mii_monitor(struct work_struct
>  	struct bonding *bond = container_of(work, struct bonding,
>  					    mii_work.work);
>  	unsigned long delay;
> +	int need_unlock = 0;
>  
>  	read_lock(&bond->lock);
>  	if (bond->kill_timers) {
> @@ -2383,13 +2384,16 @@ void bond_mii_monitor(struct work_struct
>  		rtnl_lock();
>  		read_lock(&bond->lock);
>  		__bond_mii_monitor(bond, 1);
> -		rtnl_unlock();
> +		need_unlock = 1;

Maybe I'm wrong, but since this read_lock() is given and taken anyway,
it seems this looks a bit better to me (why hold this rtnl longer
than needed?):
		read_unlock(&bond->lock);
		rtnl_unlock();
		read_lock(&bond->lock);

On the other hand, probably 'if (bond->kill_timers)' could be repeated
after this read_lock() retaking.

>  	}
>  
>  	delay = ((bond->params.miimon * HZ) / 1000) ? : 1;
> -	read_unlock(&bond->lock);
>  	if (bond->params.miimon)
>  		queue_delayed_work(bond->wq, &bond->mii_work, delay);

If this if () is really necessary here, then this should be better
before "delay = ..." with a block.

> +	read_unlock(&bond->lock);
> +	/* rtnl_unlock() may sleep, so call it after read_unlock() */
> +	if (need_unlock)
> +		rtnl_unlock();
>  }

Regards,
Jarek P.

^ permalink raw reply

* Re: [PATCH 2.6.23+] ingress classify to [nf]mark
From: jamal @ 2008-01-16 12:45 UTC (permalink / raw)
  To: mahatma; +Cc: netdev
In-Reply-To: <478BE021.1070408@bspu.unibel.by>

On Mon, 2008-14-01 at 20:20 -0200, Dzianis Kahanovich wrote:
> jamal wrote:
[..] 

> > Did that make sense?
>
> After current "#endif" - may be.

I am afraid that would be counter to expected behavior. 
Default is meant to apply when no value has been defined. Mark of 0 for
example doesnt mean "default". Let me demonstrate with the ifdefs again
with some arbitrary example:

-----------------
#ifdef CONFIG_NET_CLS_ACT
..classify ...
    .. action 1 sets mark to 0x11111
    .. action 2 checks some state and conditionally let action 3 execute
    .. action 3 sets mark to 0

if OK is returned set tc_index based on classid

#else // no actions compiled
..classify
.... jamal suggests: set default mark and tc_index for ingress here
#endif

// mahatma wants to set default for mark and tcindex here 
// so it works for both actions and none-action code
------------------------

Lets look at the case of actions compiled in:
I have defined my policies (in user space) so that the mark can be set
to either 0 or 0x1111 depending on some runtime state. 
Your default (kernel) code is now going to overide my policy - which is
bad. Even in the case of OK being returned, it is wrong to set tc_index;
unfortunately, we dont have an action that can set tc_index today; if we
did, we would need to remove that setting.

You other intent was to set the value of mark based on the value of
classid. You _can do that today already_ with no changes via a policy in
user space. You suggested to do an ifdef so you wont have to type in the
line which says how to mark, and i said that was a bad idea (we need
less ifdefs not more). 

For the case of no actions compiled in:
nothing can write into the values of either tcindex or mark after
classification (on ingress), so it is ok to override. If you did this
for egress as well, that would be wrong because it is expected that some
qdiscs may set or utilize these metadatum.

I am not sure if it made more sense this time?

> What "result" are with:
> 1) no filters?
> 2) 1 filter only, with "action continue"?

Please refer to above verbosity and see if it all makes better sense.

cheers,
jamal


^ permalink raw reply

* Re: [PATCH 3/4] bonding: Fix work rearming
From: Jarek Poplawski @ 2008-01-16 13:36 UTC (permalink / raw)
  To: Makito SHIOKAWA; +Cc: netdev
In-Reply-To: <478D77D7.60703@miraclelinux.com>

On Wed, Jan 16, 2008 at 12:19:51PM +0900, Makito SHIOKAWA wrote:
> This patch is supposing a case that bond_mii_monitor() is invoked in 
> bond_open(), and after that, 0 is set to miimon via sysfs (see same place 
> on other monitors).
> Though message in bonding_store_miimon() says miimon value 1-INT_MAX 
> rejected, but it looks like 0 can be accepted and monitor must be stopped 
> in that case.

But, since during this change from sysfs cancel_delayed_work_sync()
could be probably used, and it's rather efficient with killing
rearming works, it seems this check could be unnecessary yet.

Jarek P.

^ permalink raw reply

* Compiling as a module using the 2.6.25 git tree
From: Mark Ryden @ 2008-01-16 13:32 UTC (permalink / raw)
  To: netdev

Hello,

A question about compiling as module using the 2.6.25 git tree:
I had git cloned the 2.6.25 DaveM tree.
I ran "make menuconfig".
In many cases I see in the help:
"To compile this code as a module, choose M here: the
 module will be called ..."
For example, in "Packet Generator" or in 802.1d Etherent Bridging
(CONFIG_BRIDGE).

However, I cannot choose 'M' here ; the toggle is only between
building into the kernel ('*')
or not.

BRIDGE depends (according to the help) only on CONFIG_NET and I have
CONFIG_NET selected.
more .config | grep  CONFIG_INET
CONFIG_INET=y

Also llc is configured: (though in fact bridge selects llc).
more .config | grep LLC
CONFIG_LLC=y
CONFIG_LLC2=y

And in the Kconfig file we have:
config BRIDGE
        tristate "802.1d Ethernet Bridging"


When I try "make menuconfig" on the kernel of a distro (like FC8) I
**can** choose
bridge as a module. (I see it as 'M').

Any ideas why? Is is something due to using the git tree?

Mark

^ permalink raw reply

* Re: [RFC][PATCH] Fixing SA/SP dumps on netlink/af_key
From: jamal @ 2008-01-16 13:52 UTC (permalink / raw)
  To: Timo Teräs; +Cc: netdev
In-Reply-To: <478A038B.4090900@iki.fi>


Timo,

Thanks for the effort you are putting on this.
I would just speak to the concepts instead of the code - hopefully that
would simulate a discussion (and shorten my email)

On Sun, 2008-13-01 at 14:26 +0200, Timo Teräs wrote:
> Hi all,
> 
> The problem with IPsec SP/SA dumping is that they don't work well with large SPD/SADB:s. 
> In netlink the dumping code is O(n^2) and can miss some entries/dump duplicates if the DB 
> is changed between the recv() calls to read the dump entries. This is due to the entry 
> index counting done in xfrm_user code. With af_key the dump entries are just dropped 
> when the socket receive queue is filled.

nod to the above

> I tried to address all these problems, and added the xfrm_policy and xfrm_state structure 
> to a new linked list that is used solely for dumping. This way the dumps can be done in O(n) 
> and have an iterator point to them. I also modified to af_key to stop dumping when receive 
> queue is filling up and continue it from a callback at recvfrom() when there's more room.
> 
> This patch is against 2.6.23 vanilla. I wanted to get some feedback before I make it against 
> the git tree.
> 
> But I'd like to hear about:
> - if the approach is ok (adding the entries in one more list)?
> - if the new/modified variables, structures and function names are ok?
> - if I should port these against net-2.6 or net-2.6.25 git tree?
> - if I should split this to more than one commit? (maybe xfrm core, xfrm user and af_key parts?)
> 

To answer your last 2 questions:
There are two issues that are inter-mingled in there. The most important
is pf_key not being robust on dump. The other being the accurracy of
netlink dumping - this has much broader implications. I think you need
to separate the patches into those two at least and prioritize on the
pf_key as a patch that you push in. I would also try to make the pf_key
dumping approach to mirror what netlink does today - in the worst case
it would be as innacurate as netlink; which is not bad. I think youll
find no resistance getting such a patch in.

To answer your first 2 questions:

My main dilema with your approach is the policy of maintaining such a
list in the kernel and memory consumption needed vs the innacuracy of
netlink dumps. I have run a system with >400K SPD entries and >100K SAD
entries on a running system and i needed a few Gig of RAM to just make
sure my other apps dont get hit by oom at some point or other.  With
your approach of maintaining extra SA/P D, i would need double the RAM
amount. RAM is relatively cheap these days, but latency involved is not
(and one could argue that CPU cycles are much much cheaper).

Netlink dumping gives up accuracy[1], uses 50% of the memory you are
proposing but abuses more cpu cycles. User space could maintain the list
you are proposing instead - and compensate for the innaccuracies by
watching events[2]
Of course that shifts the accuracy to events which could be lost on
their way to user space. This issue is alleviated a little more with
your approach of keeping the list in the kernel and adding new updates
to the tail of the dump list (which, btw, your patch didnt seem to do);
however, even if you solve that: 
** you are still will be faced with challenges of not being atomic on
updates; example an entry already on your dump list could be deleted
before being read by user space. I cant see you solving that without
abusing a _lot_ more cpu (trying to find the entry on the dump list that
may have gotten updated or deleted). Theres also the issue of how long
to keep the dump list before aging it out and how to deal with multiple
users.

Ok, lets say you throw in the computation in there to walk the dump list
and find the proper entry and find out it only consumes 62% cpu of what
current netlink dump approach does, you are still using twice the RAM
needs. And hence it becomes a policy choice and what would be needed is 
some knob for me to select "i dont care about abusing more memory; i
want more accuracy dammit". I can tell you based on my current
experience i would rather use less RAM - but thats a matter of choice at
the moment because i havent come across scenarios of the conflicts where
user space wasnt able to resolve. 

I apologize for the long email, it would have been longer if i commented
on the code.

I hope you find my comments useful and dont see them as rocks being
thrown at you - rather look at them as worth at least 2-cents Canadian
and will stimulate other people speak out.

cheers,
jamal

[1] In cases where a dump happens to not complete while SAD/SPD are
getting concurently updated, it is feasible that a dump list will have
innacurate details.

[2] Racoon for example doesnt do this, hence it needs to dump from the
kernel when purging SAs.  


^ permalink raw reply

* Re: [RFC][PATCH] Fixing SA/SP dumps on netlink/af_key
From: Timo Teräs @ 2008-01-16 14:28 UTC (permalink / raw)
  To: hadi; +Cc: netdev
In-Reply-To: <1200491531.4457.91.camel@localhost>

jamal wrote:
> On Sun, 2008-13-01 at 14:26 +0200, Timo Teräs wrote:
>> I tried to address all these problems, and added the xfrm_policy
>> and xfrm_state structure to a new linked list that is used solely
>> for dumping. This way the dumps can be done in O(n) and have an
>> iterator point to them. I also modified to af_key to stop dumping
>> when receive queue is filling up and continue it from a callback at
>> recvfrom() when there's more room.
>>
>> But I'd like to hear about:
>> - if the approach is ok (adding the entries in one more list)?
>> - if the new/modified variables, structures and function names are ok?
>> - if I should port these against net-2.6 or net-2.6.25 git tree?
>> - if I should split this to more than one commit? (maybe xfrm core,
>> xfrm user and af_key parts?)
> 
> To answer your last 2 questions:
> There are two issues that are inter-mingled in there. The most
> important is pf_key not being robust on dump. The other being the
> accurracy of netlink dumping - this has much broader implications. I
> think you need to separate the patches into those two at least and
> prioritize on the pf_key as a patch that you push in. I would also
> try to make the pf_key dumping approach to mirror what netlink does
> today - in the worst case it would be as innacurate as netlink; which
> is not bad. I think youll find no resistance getting such a patch in.

Creating a separate af_key patch would not be a big problem. I was
just hoping avoiding it as the xfrm_state / xfrm_policy changes
modify the API and requires changing af_key also.

> To answer your first 2 questions:
> 
> My main dilema with your approach is the policy of maintaining such a
>  list in the kernel and memory consumption needed vs the innacuracy 
> of netlink dumps.With your approach of maintaining extra SA/P D, i
> would need double the RAM amount.

No. I'm not creating second copies of the SADB/SPD entries. The entries
are just added to one more list. Memory requirements go up on per entry
(figures based on 2.6.22.9 kernel):
sizeof(struct xfrm_policy) 636 -> 644 bytes
sizeof(struct xfrm_state)  452 -> 460 bytes

For 400K entries it would take about 3 megs of more memory. Where the
total database size is around 240-250 megs.

> Netlink dumping gives up accuracy[1], uses 50% of the memory you are 
> proposing but abuses more cpu cycles. User space could maintain the
> list you are proposing instead - and compensate for the innaccuracies
> by watching events[2] Of course that shifts the accuracy to events
> which could be lost on their way to user space. This issue is
> alleviated a little more with your approach of keeping the list in
> the kernel and adding new updates to the tail of the dump list
> (which, btw, your patch didnt seem to do); however, even if you solve
> that: ** you are still will be faced with challenges of not being
> atomic on updates; example an entry already on your dump list could
> be deleted before being read by user space. I cant see you solving
> that without abusing a _lot_ more cpu (trying to find the entry on
> the dump list that may have gotten updated or deleted). Theres also
> the issue of how long to keep the dump list before aging it out and
> how to deal with multiple users.

If more entries are added, you can get notifications of them.
 
Cheers,
  Timo


^ permalink raw reply

* Re: [PATCH] net: NEWEMAC: Remove "rgmii-interface" from rgmii matching table
From: Josh Boyer @ 2008-01-16 15:01 UTC (permalink / raw)
  To: benh; +Cc: Stefan Roese, linuxppc-dev, netdev
In-Reply-To: <1200477239.6755.21.camel@pasglop>

On Wed, 16 Jan 2008 20:53:59 +1100
Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:

> 
> On Wed, 2008-01-16 at 10:37 +0100, Stefan Roese wrote:
> > With the removal the the "rgmii-interface" device_type property from the
> > dts files, the newemac driver needs an update to only rely on compatible
> > property.
> > 
> > Signed-off-by: Stefan Roese <sr@denx.de>
> 
> I need to test if it works on CAB, can't change the DT on those. I'll
> let you know tomorrow.

This should be fine on CAB.  The rgmii node has:

compatible = "ibm,rgmii-axon", "ibm,rgmii"

so the match should still catch on the latter.

josh

^ permalink raw reply

* [patch 0/2][NETNS][DST] pass dst_ops to gc functions and add a netns pointer in it
From: Daniel Lezcano @ 2008-01-16 14:54 UTC (permalink / raw)
  To: davem; +Cc: netdev, den, benjamin.thery

The two following patches do trivial changes to facilitate the
implementation of the network namespace in some protocols.

The first one pass the dst_ops as parameter to the gc functions.
The second patch just adds a netns pointer field into the dst_ops
structure.

-- 

^ permalink raw reply

* [patch 1/2][NETNS][DST] dst: pass the dst_ops as parameter to the gc functions
From: Daniel Lezcano @ 2008-01-16 14:54 UTC (permalink / raw)
  To: davem; +Cc: netdev, den, benjamin.thery
In-Reply-To: <20080116145416.844293640@localhost.localdomain>

[-- Attachment #1: pass-dst-gc-ops-parameter.patch --]
[-- Type: text/plain, Size: 5523 bytes --]

The garbage collection function receive the dst_ops structure as
parameter. This is useful for the next incoming patchset because
it will need the dst_ops (there will be several instances) and 
the network namespace pointer (contained in the dst_ops).

The protocols which do not take care of the namespaces will not
be impacted by this change (expect for the function signature),
they do just ignore the parameter.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
---
 include/net/dst.h       |    2 +-
 net/core/dst.c          |    2 +-
 net/decnet/dn_route.c   |    4 ++--
 net/ipv4/route.c        |    6 +++---
 net/ipv4/xfrm4_policy.c |    2 +-
 net/ipv6/route.c        |    4 ++--
 net/ipv6/xfrm6_policy.c |    2 +-
 7 files changed, 11 insertions(+), 11 deletions(-)

Index: net-2.6.25-misc/include/net/dst.h
===================================================================
--- net-2.6.25-misc.orig/include/net/dst.h
+++ net-2.6.25-misc/include/net/dst.h
@@ -89,7 +89,7 @@ struct dst_ops
 	__be16			protocol;
 	unsigned		gc_thresh;
 
-	int			(*gc)(void);
+	int			(*gc)(struct dst_ops *ops);
 	struct dst_entry *	(*check)(struct dst_entry *, __u32 cookie);
 	void			(*destroy)(struct dst_entry *);
 	void			(*ifdown)(struct dst_entry *,
Index: net-2.6.25-misc/net/core/dst.c
===================================================================
--- net-2.6.25-misc.orig/net/core/dst.c
+++ net-2.6.25-misc/net/core/dst.c
@@ -165,7 +165,7 @@ void * dst_alloc(struct dst_ops * ops)
 	struct dst_entry * dst;
 
 	if (ops->gc && atomic_read(&ops->entries) > ops->gc_thresh) {
-		if (ops->gc())
+		if (ops->gc(ops))
 			return NULL;
 	}
 	dst = kmem_cache_zalloc(ops->kmem_cachep, GFP_ATOMIC);
Index: net-2.6.25-misc/net/decnet/dn_route.c
===================================================================
--- net-2.6.25-misc.orig/net/decnet/dn_route.c
+++ net-2.6.25-misc/net/decnet/dn_route.c
@@ -107,7 +107,7 @@ static const int dn_rt_mtu_expires = 10 
 
 static unsigned long dn_rt_deadline;
 
-static int dn_dst_gc(void);
+static int dn_dst_gc(struct dst_ops *ops);
 static struct dst_entry *dn_dst_check(struct dst_entry *, __u32);
 static struct dst_entry *dn_dst_negative_advice(struct dst_entry *);
 static void dn_dst_link_failure(struct sk_buff *);
@@ -185,7 +185,7 @@ static void dn_dst_check_expire(unsigned
 	mod_timer(&dn_route_timer, now + decnet_dst_gc_interval * HZ);
 }
 
-static int dn_dst_gc(void)
+static int dn_dst_gc(struct dst_ops *ops)
 {
 	struct dn_route *rt, **rtp;
 	int i;
Index: net-2.6.25-misc/net/ipv4/route.c
===================================================================
--- net-2.6.25-misc.orig/net/ipv4/route.c
+++ net-2.6.25-misc/net/ipv4/route.c
@@ -154,7 +154,7 @@ static void		 ipv4_dst_ifdown(struct dst
 static struct dst_entry *ipv4_negative_advice(struct dst_entry *dst);
 static void		 ipv4_link_failure(struct sk_buff *skb);
 static void		 ip_rt_update_pmtu(struct dst_entry *dst, u32 mtu);
-static int rt_garbage_collect(void);
+static int rt_garbage_collect(struct dst_ops *ops);
 
 
 static struct dst_ops ipv4_dst_ops = {
@@ -820,7 +820,7 @@ static void rt_secret_rebuild(unsigned l
    and when load increases it reduces to limit cache size.
  */
 
-static int rt_garbage_collect(void)
+static int rt_garbage_collect(struct dst_ops *ops)
 {
 	static unsigned long expire = RT_GC_TIMEOUT;
 	static unsigned long last_gc;
@@ -1035,7 +1035,7 @@ restart:
 				int saved_int = ip_rt_gc_min_interval;
 				ip_rt_gc_elasticity	= 1;
 				ip_rt_gc_min_interval	= 0;
-				rt_garbage_collect();
+				rt_garbage_collect(&ipv4_dst_ops);
 				ip_rt_gc_min_interval	= saved_int;
 				ip_rt_gc_elasticity	= saved_elasticity;
 				goto restart;
Index: net-2.6.25-misc/net/ipv4/xfrm4_policy.c
===================================================================
--- net-2.6.25-misc.orig/net/ipv4/xfrm4_policy.c
+++ net-2.6.25-misc/net/ipv4/xfrm4_policy.c
@@ -185,7 +185,7 @@ _decode_session4(struct sk_buff *skb, st
 	fl->fl4_tos = iph->tos;
 }
 
-static inline int xfrm4_garbage_collect(void)
+static inline int xfrm4_garbage_collect(struct dst_ops *ops)
 {
 	xfrm4_policy_afinfo.garbage_collect();
 	return (atomic_read(&xfrm4_dst_ops.entries) > xfrm4_dst_ops.gc_thresh*2);
Index: net-2.6.25-misc/net/ipv6/route.c
===================================================================
--- net-2.6.25-misc.orig/net/ipv6/route.c
+++ net-2.6.25-misc/net/ipv6/route.c
@@ -79,7 +79,7 @@ static struct dst_entry *ip6_negative_ad
 static void		ip6_dst_destroy(struct dst_entry *);
 static void		ip6_dst_ifdown(struct dst_entry *,
 				       struct net_device *dev, int how);
-static int		 ip6_dst_gc(void);
+static int		 ip6_dst_gc(struct dst_ops *ops);
 
 static int		ip6_pkt_discard(struct sk_buff *skb);
 static int		ip6_pkt_discard_out(struct sk_buff *skb);
@@ -978,7 +978,7 @@ int ndisc_dst_gc(int *more)
 	return freed;
 }
 
-static int ip6_dst_gc(void)
+static int ip6_dst_gc(struct dst_ops *ops)
 {
 	static unsigned expire = 30*HZ;
 	static unsigned long last_gc;
Index: net-2.6.25-misc/net/ipv6/xfrm6_policy.c
===================================================================
--- net-2.6.25-misc.orig/net/ipv6/xfrm6_policy.c
+++ net-2.6.25-misc/net/ipv6/xfrm6_policy.c
@@ -212,7 +212,7 @@ _decode_session6(struct sk_buff *skb, st
 	}
 }
 
-static inline int xfrm6_garbage_collect(void)
+static inline int xfrm6_garbage_collect(struct dst_ops *ops)
 {
 	xfrm6_policy_afinfo.garbage_collect();
 	return (atomic_read(&xfrm6_dst_ops.entries) > xfrm6_dst_ops.gc_thresh*2);

-- 

^ permalink raw reply

* [patch 2/2][NETNS][DST] add the network namespace pointer in dst_ops
From: Daniel Lezcano @ 2008-01-16 14:54 UTC (permalink / raw)
  To: davem; +Cc: netdev, den, benjamin.thery
In-Reply-To: <20080116145416.844293640@localhost.localdomain>

[-- Attachment #1: add-net-ns-to-dst-ops.patch --]
[-- Type: text/plain, Size: 788 bytes --]

The network namespace pointer can be stored into the dst_ops structure.
This is usefull when there are multiple instances of the dst_ops for a
protocol. When there are no several instances, this field will be never
used in the protocol. So there is no impact for the protocols which do
implement the network namespaces.

Signed-off-by: Daniel Lezcano <dlezcano@fr.ibm.com>
---
 include/net/dst.h |    1 +
 1 file changed, 1 insertion(+)

Index: net-2.6.25-misc/include/net/dst.h
===================================================================
--- net-2.6.25-misc.orig/include/net/dst.h
+++ net-2.6.25-misc/include/net/dst.h
@@ -102,6 +102,7 @@ struct dst_ops
 
 	atomic_t		entries;
 	struct kmem_cache 		*kmem_cachep;
+	struct net              *dst_net;
 };
 
 #ifdef __KERNEL__

-- 

^ permalink raw reply

* [patch] add net_device_stats support to ethtool
From: Dan Nicolaescu @ 2008-01-16 15:29 UTC (permalink / raw)
  To: netdev


Hi,

I have posted this patch in the past with absolutely no reply.
I would appreciate some sort of feedback of the form 
interested/not interested.  Should I just drop it?

"ethtool -S" only supports devices that have custom code written to
print the stats. 

A lot of drivers use "struct net_device_stats", so adding code to
ethtool would make it very easy for such drivers to add support for
"ethtool -S". The drivers would just need to add this:

	.get_strings	   = ethtool_op_net_device_stats_get_strings,
	.get_stats_count   = ethtool_op_net_device_stats_get_stats_count,
	.get_ethtool_stats = ethtool_op_net_device_get_ethtool_stats,

to their struct ethtool_ops. I found this very useful when debugging a
new driver.

(The function names are not the best, suggestions for better names are
welcome).

The code added is very small, and it gives easy access to stats that the
drivers are computing anyway.

This should save code in case the drivers the use net_device_stats
decide they want ethtool stats support.

The patch only adds, it does not modify any existing line of code.

Thanks
        --Dan

--- ethtool.c~	2004-12-24 13:35:50.000000000 -0800
+++ ethtool.c	2006-12-01 08:55:26.000000000 -0800
@@ -809,6 +809,64 @@
 	return -EOPNOTSUPP;
 }
 
+
+#define NET_DEVICE_NUM_STATS (sizeof(struct net_device_stats) / sizeof(unsigned long))
+
+static struct {
+	const char string[ETH_GSTRING_LEN];
+} ethtool_net_device_stats_keys[NET_DEVICE_NUM_STATS] = {
+	{ "rx_packets"},	 
+	{ "tx_packets"},	 
+	{ "rx_bytes"},	 
+	{ "tx_bytes"},	 
+	{ "rx_errors"},	 
+	{ "tx_errors"},	 
+	{ "rx_dropped"},	 
+	{ "tx_dropped"},	 
+	{ "multicast"},	 
+	{ "collisions"},
+	{ "rx_length_errors"},
+	{ "rx_over_errors"},	 
+	{ "rx_crc_errors"},	 
+	{ "rx_frame_errors"}, 
+	{ "rx_fifo_errors"},	 
+	{ "rx_missed_errors"},
+	{ "tx_aborted_errors"},
+	{ "tx_carrier_errors"},
+	{ "tx_fifo_errors"},
+	{ "tx_heartbeat_errors"},
+	{ "tx_window_errors"},
+	{ "rx_compressed"},
+	{ "tx_compressed"}
+};
+
+int ethtool_op_net_device_stats_get_stats_count(struct net_device *dev)
+{
+	return NET_DEVICE_NUM_STATS;
+}
+
+void ethtool_op_net_device_stats_get_strings(struct net_device *dev, u32 stringset, u8 *buf)
+{
+	switch (stringset) {
+	case ETH_SS_STATS:
+		memcpy(buf, &ethtool_net_device_stats_keys, sizeof(ethtool_net_device_stats_keys));
+		break;
+	default:
+		WARN_ON(1);	/* we need a WARN() */
+		break;
+	}
+}
+
+void ethtool_op_net_device_get_ethtool_stats(struct net_device *dev,
+					      struct ethtool_stats *estats, u64 *tmp_stats)
+{
+	u32 i;
+	u64 *dest = tmp_stats;
+	unsigned long *src = (unsigned long*)dev->get_stats(dev);
+	for (i = 0; i < estats->n_stats; i++)
+	  *dest++ = *src++;
+}
+
 EXPORT_SYMBOL(dev_ethtool);
 EXPORT_SYMBOL(ethtool_op_get_link);
 EXPORT_SYMBOL(ethtool_op_get_sg);
@@ -817,3 +875,6 @@
 EXPORT_SYMBOL(ethtool_op_set_sg);
 EXPORT_SYMBOL(ethtool_op_set_tso);
 EXPORT_SYMBOL(ethtool_op_set_tx_csum);
+EXPORT_SYMBOL(ethtool_op_net_device_stats_get_stats_count);
+EXPORT_SYMBOL(ethtool_op_net_device_stats_get_strings);
+EXPORT_SYMBOL(ethtool_op_net_device_get_ethtool_stats);
--- ethtool.h~	2006-09-05 12:29:45.000000000 -0700
+++ ethtool.h	2006-12-01 08:51:46.000000000 -0800
@@ -260,6 +260,12 @@
 int ethtool_op_set_sg(struct net_device *dev, u32 data);
 u32 ethtool_op_get_tso(struct net_device *dev);
 int ethtool_op_set_tso(struct net_device *dev, u32 data);
+int ethtool_op_net_device_stats_get_stats_count(struct net_device *dev);
+void ethtool_op_net_device_stats_get_strings(struct net_device *dev,
+					     u32 stringset, u8 *buf);
+void ethtool_op_net_device_get_ethtool_stats(struct net_device *dev,
+					     struct ethtool_stats *estats,
+					     u64 *tmp_stats);
 
 /**
  * &ethtool_ops - Alter and report network device settings

^ permalink raw reply

* Re: [PATCH] ICMP: ICMP_MIB_OUTMSGS increment duplicated
From: David Stevens @ 2008-01-16 16:23 UTC (permalink / raw)
  To: Herbert Xu; +Cc: davem, netdev, Wang Chen
In-Reply-To: <E1JF6lR-0007sF-00@gondolin.me.apana.org.au>

Herbert Xu <herbert@gondor.apana.org.au> wrote on 01/16/2008 03:49:01 AM:

> Actually having the icmp_out_count call in ip_push_pending_frames seems
> inconsistent.  Having it there means that we count raw socket ICMP 
packets
> too.  But we don't do that for any other protocol, e.g., raw UDP packets
> don't get counted.

Herbert,
        The patch was to support the ICMPMsgStats table. Since none of 
certain
types of output ICMP messages are generated by the kernel, but are 
required
by the RFC, counting raw sockets is intentional (and the only way those 
ICMP
types can be counted at all).
        Raw UDP packets would not be counted either before or after the 
patch,
but aren't part of the ICMPMsgStats table. Adding those might be 
worthwhile,
but it isn't quite the hole that the ICMP out stats were, since there is a
cooked interface for UDP output that counts the common use, at least.

Wang,
        I think your patch is correct; did you test the same case for 
IPv6?

 +-DLS


^ permalink raw reply

* A reliable way to improve your life!
From: santos @ 2008-01-16 16:47 UTC (permalink / raw)
  To: netdev

Say good bye to ED dysfunction! http://rhwe.imagineoh.com


^ permalink raw reply

* Re: [REGRESSION] 2.6.24-rc7: e1000: Detected Tx Unit Hang
From: Robert Olsson @ 2008-01-16 17:07 UTC (permalink / raw)
  To: David Miller; +Cc: elendil, jesse.brandeburg, slavon, netdev, linux-kernel
In-Reply-To: <20080116.022953.184520911.davem@davemloft.net>


David Miller writes:
 > > On Wednesday 16 January 2008, David Miller wrote:
 > > > Ok, here is the patch I'll propose to fix this.  The goal is to make
 > > > it as simple as possible without regressing the thing we were trying
 > > > to fix.
 > > 
 > > Looks good to me. Tested with -rc8.
 > 
 > Thanks for testing.

 Yes that code looks nice. I'm using the patch but I've noticed another 
 phenomena with the current e1000 driver. There is a race when taking a 
 device down at high traffic loads. I've tracked and instrumented and it 
 seems like occasionly irq_sem can get bump up so interrupts can't be 
 enabled again.


eth0 e1000_irq_enable sem = 1    <- High netload
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1
eth0 e1000_irq_enable sem = 1    <- ifconfig eth0 down
eth0 e1000_irq_disable sem = 2

**e1000_open                     <- ifconfig eth0 up
eth0 e1000_irq_disable sem = 3      Dead. irq's can't be enabled
e1000_irq_enable miss
eth0 e1000_irq_enable sem = 2
e1000_irq_enable miss
eth0 e1000_irq_enable sem = 1
ADDRCONF(NETDEV_UP): eth0: link is not ready


Cheers
					--ro

static void
e1000_irq_disable(struct e1000_adapter *adapter)
{
        atomic_inc(&adapter->irq_sem);
        E1000_WRITE_REG(&adapter->hw, IMC, ~0);
        E1000_WRITE_FLUSH(&adapter->hw);
        synchronize_irq(adapter->pdev->irq);

        if(adapter->netdev->ifindex == 3)
        printk("%s e1000_irq_disable sem = %d\n",  adapter->netdev->name,
               atomic_read(&adapter->irq_sem));
}

static void
e1000_irq_enable(struct e1000_adapter *adapter)
{
        if (likely(atomic_dec_and_test(&adapter->irq_sem))) {
                E1000_WRITE_REG(&adapter->hw, IMS, IMS_ENABLE_MASK);
                E1000_WRITE_FLUSH(&adapter->hw);
                }
        else
                printk("e1000_irq_enable miss\n");

        if(adapter->netdev->ifindex == 3)
          printk("%s e1000_irq_enable sem = %d\n",  adapter->netdev->name,
                 atomic_read(&adapter->irq_sem));
}

^ permalink raw reply

* [IPV4] FIB_HASH : Avoid unecessary loop in fn_hash_dump_zone()
From: Eric Dumazet @ 2008-01-16 17:19 UTC (permalink / raw)
  To: David Miller; +Cc: netdev@vger.kernel.org, Robert Olsson

I noticed "ip route list" was slower than "cat /proc/net/route" on a machine with a full Internet
routing table (214392 entries : Special thanks to Robert ;) )

This is similar to problem reported in commit d8c9283089287341c85a0a69de32c2287a990e71

Fix is to avoid scanning the begining of fz_hash table, but directly seek to the right offset.

Before patch :

time ip route >/tmp/ROUTE

real    0m1.285s
user    0m0.712s
sys     0m0.436s

After patch

# time ip route >/tmp/ROUTE

real    0m0.835s
user    0m0.692s
sys     0m0.124s

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

 net/ipv4/fib_hash.c |   20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/net/ipv4/fib_hash.c b/net/ipv4/fib_hash.c
index 527a6e0..4156988 100644
--- a/net/ipv4/fib_hash.c
+++ b/net/ipv4/fib_hash.c
@@ -718,19 +718,18 @@ fn_hash_dump_zone(struct sk_buff *skb, struct netlink_callback *cb,
 {
 	int h, s_h;
 
+	if (fz->fz_hash == NULL)
+		return skb->len;
 	s_h = cb->args[3];
-	for (h=0; h < fz->fz_divisor; h++) {
-		if (h < s_h) continue;
-		if (h > s_h)
-			memset(&cb->args[4], 0,
-			       sizeof(cb->args) - 4*sizeof(cb->args[0]));
-		if (fz->fz_hash == NULL ||
-		    hlist_empty(&fz->fz_hash[h]))
+	for (h = s_h; h < fz->fz_divisor; h++) {
+		if (hlist_empty(&fz->fz_hash[h]))
 			continue;
-		if (fn_hash_dump_bucket(skb, cb, tb, fz, &fz->fz_hash[h])<0) {
+		if (fn_hash_dump_bucket(skb, cb, tb, fz, &fz->fz_hash[h]) < 0) {
 			cb->args[3] = h;
 			return -1;
 		}
+		memset(&cb->args[4], 0,
+		       sizeof(cb->args) - 4*sizeof(cb->args[0]));
 	}
 	cb->args[3] = h;
 	return skb->len;
@@ -746,14 +745,13 @@ static int fn_hash_dump(struct fib_table *tb, struct sk_buff *skb, struct netlin
 	read_lock(&fib_hash_lock);
 	for (fz = table->fn_zone_list, m=0; fz; fz = fz->fz_next, m++) {
 		if (m < s_m) continue;
-		if (m > s_m)
-			memset(&cb->args[3], 0,
-			       sizeof(cb->args) - 3*sizeof(cb->args[0]));
 		if (fn_hash_dump_zone(skb, cb, tb, fz) < 0) {
 			cb->args[2] = m;
 			read_unlock(&fib_hash_lock);
 			return -1;
 		}
+		memset(&cb->args[3], 0,
+		       sizeof(cb->args) - 3*sizeof(cb->args[0]));
 	}
 	read_unlock(&fib_hash_lock);
 	cb->args[2] = m;

^ permalink raw reply related

* Re: [patch] add net_device_stats support to ethtool
From: Ben Greear @ 2008-01-16 18:03 UTC (permalink / raw)
  To: Dan Nicolaescu; +Cc: netdev
In-Reply-To: <200801161529.m0GFTfYR005096@sallyv1.ics.uci.edu>

Dan Nicolaescu wrote:
> Hi,
>
> I have posted this patch in the past with absolutely no reply.
> I would appreciate some sort of feedback of the form 
> interested/not interested.  Should I just drop it?
>
>   
I like it, but why not offer this for all devices since they all have 
these stats.
Could add new handlers called something like .get_strings_generic, or
just add this to the higher-level ethtool handling before it looks for 
handlers.

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply

* Re: [patch] add net_device_stats support to ethtool
From: Dan Nicolaescu @ 2008-01-16 18:34 UTC (permalink / raw)
  To: Ben Greear; +Cc: netdev
In-Reply-To: <478E46F6.4070901@candelatech.com>

Ben Greear <greearb@candelatech.com> writes:

  > Dan Nicolaescu wrote:
  > > Hi,
  > >
  > > I have posted this patch in the past with absolutely no reply.
  > > I would appreciate some sort of feedback of the form interested/not
  > > interested.  Should I just drop it?
  > >
  > >   
  > I like it, but why not offer this for all devices since they all have
  > these stats.
  >
  > Could add new handlers called something like .get_strings_generic, or
  > just add this to the higher-level ethtool handling before it looks for
  > handlers.

If I get your point, then the difference would be that drivers would add
to the initialization of the ethtool structure something like:

         .get_strings_generic = 1;
          
instead of what I originally proposed:

	.get_strings	   = ethtool_op_net_device_stats_get_strings,
	.get_stats_count   = ethtool_op_net_device_stats_get_stats_count,
	.get_ethtool_stats = ethtool_op_net_device_get_ethtool_stats,

Sure that could work, but it would require a few lines of changes in
ethtool.

I can submit a patch that does things that way, if that is considered
better.  But I would like to hear that this code is wanted before
putting any effort in it. I has been ignored for so long... 

Thanks

        --dan

^ permalink raw reply

* Re: [patch] add net_device_stats support to ethtool
From: Ben Greear @ 2008-01-16 18:46 UTC (permalink / raw)
  To: Dan Nicolaescu; +Cc: netdev
In-Reply-To: <200801161834.m0GIYCIr027613@sallyv1.ics.uci.edu>

Dan Nicolaescu wrote:
> Ben Greear <greearb@candelatech.com> writes:
>
>   > Dan Nicolaescu wrote:
>   > > Hi,
>   > >
>   > > I have posted this patch in the past with absolutely no reply.
>   > > I would appreciate some sort of feedback of the form interested/not
>   > > interested.  Should I just drop it?
>   > >
>   > >   
>   > I like it, but why not offer this for all devices since they all have
>   > these stats.
>   >
>   > Could add new handlers called something like .get_strings_generic, or
>   > just add this to the higher-level ethtool handling before it looks for
>   > handlers.
>
> If I get your point, then the difference would be that drivers would add
> to the initialization of the ethtool structure something like:
>   
I meant something more like this (this will not apply..I hand-edited it 
to remove
some extraneous crap from my patch...and I'm sure it's white-space damaged).

diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index c5e0593..095d1eb 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c

+/* Handle some generic ethtool commands here */
+static int ethtool_get_netdev_stats(struct net_device *dev, void 
*useraddr) {
+   
+    struct ethtool_ndstats* nds = (struct ethtool_ndstats*)(useraddr);
+     
+    struct net_device_stats *stats = dev->get_stats(dev);
+    if (stats) {
+        if (copy_to_user(nds->data, stats, sizeof(*stats))) {
+            return -EFAULT;
+        }
+    }
+    else {
+        return -EOPNOTSUPP;
+    }
+    return 0;
+}
+
+
 /* The main entry point in this file.  Called from net/core/dev.c */
 
 int dev_ethtool(struct ifreq *ifr)
@@ -796,9 +884,6 @@ int dev_ethtool(struct ifreq *ifr)
     if (!dev || !netif_device_present(dev))
         return -ENODEV;
 
-    if (!dev->ethtool_ops)
-        return -EOPNOTSUPP;
-
     if (copy_from_user(&ethcmd, useraddr, sizeof (ethcmd)))
         return -EFAULT;
 
@@ -823,12 +908,25 @@ int dev_ethtool(struct ifreq *ifr)
             return -EPERM;
     }
 
-    if (dev->ethtool_ops->begin)
+    if (dev->ethtool_ops && dev->ethtool_ops->begin)
         if ((rc = dev->ethtool_ops->begin(dev)) < 0)
             return rc;
 
     old_features = dev->features;
 
+    /* Handle some generic operations that do not require specific
+     * ethtool handlers.
+     */
+    switch (ethcmd) {
+    case ETHTOOL_GNDSTATS:
+        return ethtool_get_netdev_stats(dev, useraddr);
+    default:
+        break;
+    }
+
+    if (!dev->ethtool_ops)
+        return -EOPNOTSUPP;
+
     switch (ethcmd) {
     case ETHTOOL_GSET:
         rc = ethtool_get_settings(dev, useraddr);


-- 
Ben Greear <greearb@candelatech.com> 
Candela Technologies Inc  http://www.candelatech.com



^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox