Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: TCP-MD5 checksum failure on x86_64 SMP
From: Eric Dumazet @ 2010-05-05 18:53 UTC (permalink / raw)
  To: Bhaskar Dutta; +Cc: Stephen Hemminger, Ben Hutchings, netdev
In-Reply-To: <g2p571fb4001005051103w67e1b9ddn3e8f7feb84d0559@mail.gmail.com>

Le mercredi 05 mai 2010 à 23:33 +0530, Bhaskar Dutta a écrit :

> Hi,
> 
> TSO, GSO and SG are already turned off.
> rx/tx checksumming is on, but that shouldn't matter, right?
> 
> # ethtool -k eth0
> Offload parameters for eth0:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: off
> tcp segmentation offload: off
> udp fragmentation offload: off
> generic segmentation offload: off
> 
> The bad packets are very small in size, most have no data at all (<300 bytes).
> 
> After adding some logs to kernel 2.6.31-12, it seems that
> tcp_v4_md5_hash_skb (function that calculates the md5 hash) is
> (might?) getting corrupt.
> 
> The tcp4_pseudohdr (bp = &hp->md5_blk.ip4) structure's saddr, daddr
> and len fields get modified to different values towards the end of the
> tcp_v4_md5_hash_skb function whenever there is a checksum error.
> 
> The tcp4_pseudohdr (bp) is within the tcp_md5sig_pool (hp), which is
> filled up by tcp_get_md5sig_pool (which calls per_cpu_ptr).
> 
> Using a local copy of the tcp4_pseudohdr in the same function
> tcp_v4_md5_hash_skb (copied all fields from the original
> tcp4_pseudohdr within the tcp_md5sig_pool) and calculating the md5
> checksum with the local  tcp4_pseudohdr seems to solve the issue
> (don't see bad packets for a hours in load tests, and without the
> change I can see them instantaneously in the load tests).
> 
> I am still unable to figure out how this is happening. Please let me
> know if you have any pointers.

I am not familiar with this code, but I suspect same per_cpu data can be
used at both time by a sender (process context) and by a receiver
(softirq context).

To trigger this, you need at least two active md5 sockets.

tcp_get_md5sig_pool() should probably disable bh to make sure current
cpu wont be preempted by softirq processing


Something like :

diff --git a/include/net/tcp.h b/include/net/tcp.h
index fb5c66b..e232123 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -1221,12 +1221,15 @@ struct tcp_md5sig_pool          *tcp_get_md5sig_pool(void)
 	struct tcp_md5sig_pool *ret = __tcp_get_md5sig_pool(cpu);
 	if (!ret)
 		put_cpu();
+	else
+		local_bh_disable();
 	return ret;
 }
 
 static inline void             tcp_put_md5sig_pool(void)
 {
 	__tcp_put_md5sig_pool();
+	local_bh_enable();
 	put_cpu();
 }



^ permalink raw reply related

* [PATCH 4/4 v2] ks8851: read/write MAC address on companion eeprom through debugfs
From: Sebastien Jan @ 2010-05-05 18:45 UTC (permalink / raw)
  To: netdev; +Cc: linux-omap, Abraham Arce, Ben Dooks, Tristram.Ha, Sebastien Jan
In-Reply-To: <1273085155-1260-1-git-send-email-s-jan@ti.com>

A more elegant alternative to ethtool for updating the ks8851
MAC address stored on its companion eeprom.
Using this debugfs interface does not require any knowledge on the
ks8851 companion eeprom organization to update the MAC address.

Example to write 01:23:45:67:89:AB MAC address to the companion
eeprom (assuming debugfs is mounted in /sys/kernel/debug):
$ echo "01:23:45:67:89:AB" > /sys/kernel/debug/ks8851/mac_eeprom

Signed-off-by: Sebastien Jan <s-jan@ti.com>
---
 drivers/net/ks8851.c |  216 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 216 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ks8851.c b/drivers/net/ks8851.c
index 8e38c36..1b62ec5 100644
--- a/drivers/net/ks8851.c
+++ b/drivers/net/ks8851.c
@@ -22,6 +22,11 @@
 
 #include <linux/spi/spi.h>
 
+#ifdef CONFIG_DEBUG_FS
+#include <linux/debugfs.h>
+#include <linux/ctype.h>
+#endif
+
 #include "ks8851.h"
 
 /**
@@ -1550,6 +1555,214 @@ static int ks8851_read_selftest(struct ks8851_net *ks)
 	return 0;
 }
 
+#ifdef CONFIG_DEBUG_FS
+static struct dentry *ks8851_dir;
+
+static int ks8851_mac_eeprom_open(struct inode *inode, struct file *file)
+{
+	file->private_data = inode->i_private;
+	return 0;
+}
+
+static int ks8851_mac_eeprom_release(struct inode *inode, struct file *file)
+{
+	return 0;
+}
+
+static loff_t ks8851_mac_eeprom_seek(struct file *file, loff_t off, int whence)
+{
+	return 0;
+}
+
+/**
+ * ks8851_mac_eeprom_read - Read a MAC address in ks8851 companion EEPROM
+ *
+ * Warning: The READ feature is not supported on ks8851 revision 0.
+ */
+static ssize_t ks8851_mac_eeprom_read(struct file *filep, char __user *buff,
+						size_t count, loff_t *offp)
+{
+	ssize_t ret;
+	struct net_device *dev = filep->private_data;
+	char str[50];
+	unsigned int data;
+	unsigned char mac_addr[6];
+
+	if (*offp > 0) {
+		ret = 0;
+		goto ks8851_cnt_rd_bk;
+	}
+
+	data = ks8851_eeprom_read(dev, 1);
+	mac_addr[5] = data & 0xFF;
+	mac_addr[4] = (data >> 8) & 0xFF;
+	data = ks8851_eeprom_read(dev, 2);
+	mac_addr[3] = data & 0xFF;
+	mac_addr[2] = (data >> 8) & 0xFF;
+	data = ks8851_eeprom_read(dev, 3);
+	mac_addr[1] = data & 0xFF;
+	mac_addr[0] = (data >> 8) & 0xFF;
+
+	sprintf(str, "%02x:%02x:%02x:%02x:%02x:%02x\n", mac_addr[0],
+			mac_addr[1], mac_addr[2], mac_addr[3], mac_addr[4],
+								mac_addr[5]);
+
+	ret = strlen(str);
+
+	if (copy_to_user((void __user *)buff, str, ret)) {
+		dev_err(&dev->dev, "ks8851 - copy_to_user failed\n");
+		ret = 0;
+	} else {
+		*offp = ret;
+	}
+
+ks8851_cnt_rd_bk:
+	return ret;
+}
+
+/*
+ * Split the buffer `buf' into ':'-separated words.
+ * Return the number of words or <0 on error.
+ */
+#define isdelimiter(c)	((c) == ':')
+static int ks8851_debug_tokenize(char *buf, char *words[], int maxwords)
+{
+	int nwords = 0;
+
+	while (*buf) {
+		char *end;
+
+		/* Skip leading whitespace */
+		while (*buf && isspace(*buf))
+			buf++;
+		if (!*buf)
+			break;	/* oh, it was trailing whitespace */
+
+		/* Run `end' over a word */
+		for (end = buf ; *end && !isdelimiter(*end) ; end++)
+			;
+		/* `buf' is the start of the word, `end' is one past the end */
+
+		if (nwords == maxwords)
+			return -EINVAL;	/* ran out of words[] before bytes */
+		if (*end)
+			*end++ = '\0';	/* terminate the word */
+		words[nwords++] = buf;
+		buf = end;
+	}
+	return nwords;
+}
+
+/**
+ * ks8851_mac_eeprom_write - Write a MAC address in ks8851 companion EEPROM
+ *
+ */
+static ssize_t ks8851_mac_eeprom_write(struct file *filep,
+			const char __user *buff, size_t count, loff_t *offp)
+{
+	struct net_device *dev = filep->private_data;
+	ssize_t ret;
+#define MAXWORDS 6
+	int nwords, i;
+	char *words[MAXWORDS];
+	char tmpbuf[256];
+	unsigned long mac_addr[6];
+
+	if (count == 0)
+		return 0;
+	if (count > sizeof(tmpbuf)-1)
+		return -E2BIG;
+	if (copy_from_user(tmpbuf, buff, count))
+		return -EFAULT;
+	tmpbuf[count] = '\0';
+	dev_dbg(&dev->dev, "%s: read %d bytes from userspace\n",
+			__func__, (int)count);
+
+	nwords = ks8851_debug_tokenize(tmpbuf, words, MAXWORDS);
+	if (nwords != 6) {
+		dev_warn(&dev->dev,
+		"ks8851 MAC address write to EEPROM requires a MAC address " \
+						"like 01:23:45:67:89:AB\n");
+		return -EINVAL;
+	}
+
+	for (i = 0; i < 6; i++)
+		strict_strtoul(words[i], 16, &mac_addr[i]);
+
+	ks8851_eeprom_write(dev, EEPROM_OP_EWEN, 0, 0);
+
+	ks8851_eeprom_write(dev, EEPROM_OP_WRITE, 1,
+						mac_addr[4] << 8 | mac_addr[5]);
+	mdelay(EEPROM_WRITE_TIME);
+	ks8851_eeprom_write(dev, EEPROM_OP_WRITE, 2,
+						mac_addr[2] << 8 | mac_addr[3]);
+	mdelay(EEPROM_WRITE_TIME);
+	ks8851_eeprom_write(dev, EEPROM_OP_WRITE, 3,
+						mac_addr[0] << 8 | mac_addr[1]);
+	mdelay(EEPROM_WRITE_TIME);
+
+	ks8851_eeprom_write(dev, EEPROM_OP_EWDS, 0, 0);
+
+	dev_dbg(&dev->dev, "MAC address %02lx.%02lx.%02lx.%02lx.%02lx.%02lx "\
+			"written to EEPROM\n", mac_addr[0], mac_addr[1],
+			mac_addr[2], mac_addr[3], mac_addr[4], mac_addr[5]);
+
+	ret = count;
+	*offp += count;
+	return ret;
+}
+
+static const struct file_operations ks8851_mac_eeprom_fops = {
+	.open		= ks8851_mac_eeprom_open,
+	.read		= ks8851_mac_eeprom_read,
+	.write		= ks8851_mac_eeprom_write,
+	.llseek		= ks8851_mac_eeprom_seek,
+	.release	= ks8851_mac_eeprom_release,
+};
+
+int __init ks8851_debug_init(struct net_device *dev)
+{
+	struct ks8851_net *ks = netdev_priv(dev);
+	mode_t mac_access = S_IWUGO;
+
+	if (ks->rc_ccr & CCR_EEPROM) {
+		ks8851_dir = debugfs_create_dir("ks8851", NULL);
+		if (IS_ERR(ks8851_dir))
+			return PTR_ERR(ks8851_dir);
+
+		/* Check ks8851 version and features */
+		mutex_lock(&ks->lock);
+		if (CIDER_REV_GET(ks8851_rdreg16(ks, KS_CIDER)) > 0)
+			mac_access |= S_IRUGO;
+		mutex_unlock(&ks->lock);
+
+		debugfs_create_file("mac_eeprom", mac_access, ks8851_dir, dev,
+						&ks8851_mac_eeprom_fops);
+		debugfs_create_u32("eeprom_size", S_IRUGO | S_IWUGO,
+					ks8851_dir, (u32 *) &(ks->eeprom_size));
+	} else {
+		ks8851_dir = NULL;
+	}
+	return 0;
+}
+
+void ks8851_debug_exit(void)
+{
+	if (ks8851_dir)
+		debugfs_remove_recursive(ks8851_dir);
+}
+#else
+int __init ks8851_debug_init(struct net_device *dev)
+{
+	return 0;
+};
+
+void ks8851_debug_exit(void)
+{
+};
+#endif /* CONFIG_DEBUG_FS */
+
+
 /* driver bus management functions */
 
 static int __devinit ks8851_probe(struct spi_device *spi)
@@ -1649,6 +1862,8 @@ static int __devinit ks8851_probe(struct spi_device *spi)
 		goto err_netdev;
 	}
 
+	ks8851_debug_init(ndev);
+
 	dev_info(&spi->dev, "revision %d, MAC %pM, IRQ %d\n",
 		 CIDER_REV_GET(ks8851_rdreg16(ks, KS_CIDER)),
 		 ndev->dev_addr, ndev->irq);
@@ -1672,6 +1887,7 @@ static int __devexit ks8851_remove(struct spi_device *spi)
 	if (netif_msg_drv(priv))
 		dev_info(&spi->dev, "remove");
 
+	ks8851_debug_exit();
 	unregister_netdev(priv->netdev);
 	free_irq(spi->irq, priv);
 	free_netdev(priv->netdev);
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH 2/4 v2] ks8851: Low level functions for read/write to companion eeprom
From: Sebastien Jan @ 2010-05-05 18:45 UTC (permalink / raw)
  To: netdev; +Cc: linux-omap, Abraham Arce, Ben Dooks, Tristram.Ha, Sebastien Jan
In-Reply-To: <1273085155-1260-1-git-send-email-s-jan@ti.com>

Low-level functions provide 16bits words read and write capability
to ks8851 companion eeprom.

Signed-off-by: Sebastien Jan <s-jan@ti.com>
---
 drivers/net/ks8851.c |  228 ++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/net/ks8851.h |   14 +++-
 2 files changed, 241 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ks8851.c b/drivers/net/ks8851.c
index a84e500..787f9df 100644
--- a/drivers/net/ks8851.c
+++ b/drivers/net/ks8851.c
@@ -1044,6 +1044,234 @@ static const struct net_device_ops ks8851_netdev_ops = {
 	.ndo_validate_addr	= eth_validate_addr,
 };
 
+/* Companion eeprom access */
+
+enum {	/* EEPROM programming states */
+	EEPROM_CONTROL,
+	EEPROM_ADDRESS,
+	EEPROM_DATA,
+	EEPROM_COMPLETE
+};
+
+/**
+ * ks8851_eeprom_read - read a 16bits word in ks8851 companion EEPROM
+ * @dev: The network device the PHY is on.
+ * @addr: EEPROM address to read
+ *
+ * eeprom_size: used to define the data coding length. Can be changed
+ * through debug-fs.
+ *
+ * Programs a read on the EEPROM using ks8851 EEPROM SW access feature.
+ * Warning: The READ feature is not supported on ks8851 revision 0.
+ *
+ * Rough programming model:
+ *  - on period start: set clock high and read value on bus
+ *  - on period / 2: set clock low and program value on bus
+ *  - start on period / 2
+ */
+unsigned int ks8851_eeprom_read(struct net_device *dev, unsigned int addr)
+{
+	struct ks8851_net *ks = netdev_priv(dev);
+	int eepcr;
+	int ctrl = EEPROM_OP_READ;
+	int state = EEPROM_CONTROL;
+	int bit_count = EEPROM_OP_LEN - 1;
+	unsigned int data = 0;
+	int dummy;
+	unsigned int addr_len;
+
+	addr_len = (ks->eeprom_size == 128) ? 6 : 8;
+
+	/* start transaction: chip select high, authorize write */
+	mutex_lock(&ks->lock);
+	eepcr = EEPCR_EESA | EEPCR_EESRWA;
+	ks8851_wrreg16(ks, KS_EEPCR, eepcr);
+	eepcr |= EEPCR_EECS;
+	ks8851_wrreg16(ks, KS_EEPCR, eepcr);
+	mutex_unlock(&ks->lock);
+
+	while (state != EEPROM_COMPLETE) {
+		/* falling clock period starts... */
+		/* set EED_IO pin for control and address */
+		eepcr &= ~EEPCR_EEDO;
+		switch (state) {
+		case EEPROM_CONTROL:
+			eepcr |= ((ctrl >> bit_count) & 1) << 2;
+			if (bit_count-- <= 0) {
+				bit_count = addr_len - 1;
+				state = EEPROM_ADDRESS;
+			}
+			break;
+		case EEPROM_ADDRESS:
+			eepcr |= ((addr >> bit_count) & 1) << 2;
+			bit_count--;
+			break;
+		case EEPROM_DATA:
+			/* Change to receive mode */
+			eepcr &= ~EEPCR_EESRWA;
+			break;
+		}
+
+		/* lower clock  */
+		eepcr &= ~EEPCR_EESCK;
+
+		mutex_lock(&ks->lock);
+		ks8851_wrreg16(ks, KS_EEPCR, eepcr);
+		mutex_unlock(&ks->lock);
+
+		/* waitread period / 2 */
+		udelay(EEPROM_SK_PERIOD / 2);
+
+		/* rising clock period starts... */
+
+		/* raise clock */
+		mutex_lock(&ks->lock);
+		eepcr |= EEPCR_EESCK;
+		ks8851_wrreg16(ks, KS_EEPCR, eepcr);
+		mutex_unlock(&ks->lock);
+
+		/* Manage read */
+		switch (state) {
+		case EEPROM_ADDRESS:
+			if (bit_count < 0) {
+				bit_count = EEPROM_DATA_LEN - 1;
+				state = EEPROM_DATA;
+			}
+			break;
+		case EEPROM_DATA:
+			mutex_lock(&ks->lock);
+			dummy = ks8851_rdreg16(ks, KS_EEPCR);
+			mutex_unlock(&ks->lock);
+			data |= ((dummy >> EEPCR_EESB_OFFSET) & 1) << bit_count;
+			if (bit_count-- <= 0)
+				state = EEPROM_COMPLETE;
+			break;
+		}
+
+		/* wait period / 2 */
+		udelay(EEPROM_SK_PERIOD / 2);
+	}
+
+	/* close transaction */
+	mutex_lock(&ks->lock);
+	eepcr &= ~EEPCR_EECS;
+	ks8851_wrreg16(ks, KS_EEPCR, eepcr);
+	eepcr = 0;
+	ks8851_wrreg16(ks, KS_EEPCR, eepcr);
+	mutex_unlock(&ks->lock);
+
+	return data;
+}
+
+/**
+ * ks8851_eeprom_write - write a 16bits word in ks8851 companion EEPROM
+ * @dev: The network device the PHY is on.
+ * @op: operand (can be WRITE, EWEN, EWDS)
+ * @addr: EEPROM address to write
+ * @data: data to write
+ *
+ * eeprom_size: used to define the data coding length. Can be changed
+ * through debug-fs.
+ *
+ * Programs a write on the EEPROM using ks8851 EEPROM SW access feature.
+ *
+ * Note that a write enable is required before writing data.
+ *
+ * Rough programming model:
+ *  - on period start: set clock high
+ *  - on period / 2: set clock low and program value on bus
+ *  - start on period / 2
+ */
+void ks8851_eeprom_write(struct net_device *dev, unsigned int op,
+					unsigned int addr, unsigned int data)
+{
+	struct ks8851_net *ks = netdev_priv(dev);
+	int eepcr;
+	int state = EEPROM_CONTROL;
+	int bit_count = EEPROM_OP_LEN - 1;
+	unsigned int addr_len;
+
+	addr_len = (ks->eeprom_size == 128) ? 6 : 8;
+
+	switch (op) {
+	case EEPROM_OP_EWEN:
+		addr = 0x30;
+	break;
+	case EEPROM_OP_EWDS:
+		addr = 0;
+		break;
+	}
+
+	/* start transaction: chip select high, authorize write */
+	mutex_lock(&ks->lock);
+	eepcr = EEPCR_EESA | EEPCR_EESRWA;
+	ks8851_wrreg16(ks, KS_EEPCR, eepcr);
+	eepcr |= EEPCR_EECS;
+	ks8851_wrreg16(ks, KS_EEPCR, eepcr);
+	mutex_unlock(&ks->lock);
+
+	while (state != EEPROM_COMPLETE) {
+		/* falling clock period starts... */
+		/* set EED_IO pin for control and address */
+		eepcr &= ~EEPCR_EEDO;
+		switch (state) {
+		case EEPROM_CONTROL:
+			eepcr |= ((op >> bit_count) & 1) << 2;
+			if (bit_count-- <= 0) {
+				bit_count = addr_len - 1;
+				state = EEPROM_ADDRESS;
+			}
+			break;
+		case EEPROM_ADDRESS:
+			eepcr |= ((addr >> bit_count) & 1) << 2;
+			if (bit_count-- <= 0) {
+				if (op == EEPROM_OP_WRITE) {
+					bit_count = EEPROM_DATA_LEN - 1;
+					state = EEPROM_DATA;
+				} else {
+					state = EEPROM_COMPLETE;
+				}
+			}
+			break;
+		case EEPROM_DATA:
+			eepcr |= ((data >> bit_count) & 1) << 2;
+			if (bit_count-- <= 0)
+				state = EEPROM_COMPLETE;
+			break;
+		}
+
+		/* lower clock  */
+		eepcr &= ~EEPCR_EESCK;
+
+		mutex_lock(&ks->lock);
+		ks8851_wrreg16(ks, KS_EEPCR, eepcr);
+		mutex_unlock(&ks->lock);
+
+		/* wait period / 2 */
+		udelay(EEPROM_SK_PERIOD / 2);
+
+		/* rising clock period starts... */
+
+		/* raise clock */
+		eepcr |= EEPCR_EESCK;
+		mutex_lock(&ks->lock);
+		ks8851_wrreg16(ks, KS_EEPCR, eepcr);
+		mutex_unlock(&ks->lock);
+
+		/* wait period / 2 */
+		udelay(EEPROM_SK_PERIOD / 2);
+	}
+
+	/* close transaction */
+	mutex_lock(&ks->lock);
+	eepcr &= ~EEPCR_EECS;
+	ks8851_wrreg16(ks, KS_EEPCR, eepcr);
+	eepcr = 0;
+	ks8851_wrreg16(ks, KS_EEPCR, eepcr);
+	mutex_unlock(&ks->lock);
+
+}
+
 /* ethtool support */
 
 static void ks8851_get_drvinfo(struct net_device *dev,
diff --git a/drivers/net/ks8851.h b/drivers/net/ks8851.h
index f52c312..537fb06 100644
--- a/drivers/net/ks8851.h
+++ b/drivers/net/ks8851.h
@@ -25,12 +25,24 @@
 #define OBCR_ODS_16mA				(1 << 6)
 
 #define KS_EEPCR				0x22
+#define EEPCR_EESRWA				(1 << 5)
 #define EEPCR_EESA				(1 << 4)
-#define EEPCR_EESB				(1 << 3)
+#define EEPCR_EESB_OFFSET			3
+#define EEPCR_EESB				(1 << EEPCR_EESB_OFFSET)
 #define EEPCR_EEDO				(1 << 2)
 #define EEPCR_EESCK				(1 << 1)
 #define EEPCR_EECS				(1 << 0)
 
+#define EEPROM_OP_LEN				3	/* bits:*/
+#define EEPROM_OP_READ				0x06
+#define EEPROM_OP_EWEN				0x04
+#define EEPROM_OP_WRITE				0x05
+#define EEPROM_OP_EWDS				0x14
+
+#define EEPROM_DATA_LEN				16	/* 16 bits EEPROM */
+#define EEPROM_WRITE_TIME			4	/* wrt ack time in ms */
+#define EEPROM_SK_PERIOD			400	/* in us */
+
 #define KS_MBIR					0x24
 #define MBIR_TXMBF				(1 << 12)
 #define MBIR_TXMBFA				(1 << 11)
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH 1/4 v2] ks8851: Add caching of CCR register
From: Sebastien Jan @ 2010-05-05 18:45 UTC (permalink / raw)
  To: netdev; +Cc: linux-omap, Abraham Arce, Ben Dooks, Tristram.Ha, Sebastien Jan
In-Reply-To: <1273085155-1260-1-git-send-email-s-jan@ti.com>

CCR register contains information on companion eeprom availability.

Signed-off-by: Sebastien Jan <s-jan@ti.com>
---
 drivers/net/ks8851.c |   12 ++++++++++++
 1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ks8851.c b/drivers/net/ks8851.c
index 9e9f9b3..a84e500 100644
--- a/drivers/net/ks8851.c
+++ b/drivers/net/ks8851.c
@@ -76,7 +76,9 @@ union ks8851_tx_hdr {
  * @msg_enable: The message flags controlling driver output (see ethtool).
  * @fid: Incrementing frame id tag.
  * @rc_ier: Cached copy of KS_IER.
+ * @rc_ccr: Cached copy of KS_CCR.
  * @rc_rxqcr: Cached copy of KS_RXQCR.
+ * @eeprom_size: Companion eeprom size in Bytes, 0 if no eeprom
  *
  * The @lock ensures that the chip is protected when certain operations are
  * in progress. When the read or write packet transfer is in progress, most
@@ -107,6 +109,8 @@ struct ks8851_net {
 
 	u16			rc_ier;
 	u16			rc_rxqcr;
+	u16			rc_ccr;
+	u16			eeprom_size;
 
 	struct mii_if_info	mii;
 	struct ks8851_rxctrl	rxctrl;
@@ -1279,6 +1283,14 @@ static int __devinit ks8851_probe(struct spi_device *spi)
 		goto err_id;
 	}
 
+	/* cache the contents of the CCR register for EEPROM, etc. */
+	ks->rc_ccr = ks8851_rdreg16(ks, KS_CCR);
+
+	if (ks->rc_ccr & CCR_EEPROM)
+		ks->eeprom_size = 128;
+	else
+		ks->eeprom_size = 0;
+
 	ks8851_read_selftest(ks);
 	ks8851_init_mac(ks);
 
-- 
1.6.3.3


^ permalink raw reply related

* [PATCH 0/4 v2] ks8851: support for read/write MAC address from eeprom
From: Sebastien Jan @ 2010-05-05 18:45 UTC (permalink / raw)
  To: netdev; +Cc: linux-omap, Abraham Arce, Ben Dooks, Tristram.Ha, Sebastien Jan

Provides the ability to update the ks8851 companion eeprom content (including the
MAC address) through ethtool or debugfs (for MAC address only).

Differences with v1:
 - added the support of eeprom access through ethtool
 - different patches split

Patch set tested on OMAP4-based test board. Build tested on net-2.6 head.

Applies on current net-2.6 head.

Sebastien Jan (4):
  ks8851: Add caching of CCR register
  ks8851: Low level functions for read/write to companion eeprom
  ks8851: companion eeprom access through ethtool
  ks8851: read/write MAC address on companion eeprom through debugfs

 drivers/net/ks8851.c |  570 ++++++++++++++++++++++++++++++++++++++++++++++++++
 drivers/net/ks8851.h |   14 ++-
 2 files changed, 583 insertions(+), 1 deletions(-)


^ permalink raw reply

* [PATCH 3/4 v2] ks8851: companion eeprom access through ethtool
From: Sebastien Jan @ 2010-05-05 18:45 UTC (permalink / raw)
  To: netdev; +Cc: linux-omap, Abraham Arce, Ben Dooks, Tristram.Ha, Sebastien Jan
In-Reply-To: <1273085155-1260-1-git-send-email-s-jan@ti.com>

Accessing ks8851 companion eeprom permits modifying the ks8851 stored
MAC address.

Example how to change the MAC address using ethtool, to set the
01:23:45:67:89:AB MAC address:
$ echo "0:AB8976452301" | xxd -r > mac.bin
$ sudo ethtool -E eth0 magic 0x8870 offset 2 < mac.bin

Signed-off-by: Sebastien Jan <s-jan@ti.com>
---
 drivers/net/ks8851.c |  114 ++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 114 insertions(+), 0 deletions(-)

diff --git a/drivers/net/ks8851.c b/drivers/net/ks8851.c
index 787f9df..8e38c36 100644
--- a/drivers/net/ks8851.c
+++ b/drivers/net/ks8851.c
@@ -1318,6 +1318,117 @@ static int ks8851_nway_reset(struct net_device *dev)
 	return mii_nway_restart(&ks->mii);
 }
 
+static int ks8851_get_eeprom_len(struct net_device *dev)
+{
+	struct ks8851_net *ks = netdev_priv(dev);
+	return ks->eeprom_size;
+}
+
+static int ks8851_get_eeprom(struct net_device *dev,
+			    struct ethtool_eeprom *eeprom, u8 *bytes)
+{
+	struct ks8851_net *ks = netdev_priv(dev);
+	u16 *eeprom_buff;
+	int first_word;
+	int last_word;
+	int ret_val = 0;
+	u16 i;
+
+	if (eeprom->len == 0)
+		return -EINVAL;
+
+	if (eeprom->len > ks->eeprom_size)
+		return -EINVAL;
+
+	eeprom->magic = ks8851_rdreg16(ks, KS_CIDER);
+
+	first_word = eeprom->offset >> 1;
+	last_word = (eeprom->offset + eeprom->len - 1) >> 1;
+
+	eeprom_buff = kmalloc(sizeof(u16) *
+			(last_word - first_word + 1), GFP_KERNEL);
+	if (!eeprom_buff)
+		return -ENOMEM;
+
+	for (i = 0; i < last_word - first_word + 1; i++)
+		eeprom_buff[i] = ks8851_eeprom_read(dev, first_word + 1);
+
+	/* Device's eeprom is little-endian, word addressable */
+	for (i = 0; i < last_word - first_word + 1; i++)
+		le16_to_cpus(&eeprom_buff[i]);
+
+	memcpy(bytes, (u8 *)eeprom_buff + (eeprom->offset & 1), eeprom->len);
+	kfree(eeprom_buff);
+
+	return ret_val;
+}
+
+static int ks8851_set_eeprom(struct net_device *dev,
+			    struct ethtool_eeprom *eeprom, u8 *bytes)
+{
+	struct ks8851_net *ks = netdev_priv(dev);
+	u16 *eeprom_buff;
+	void *ptr;
+	int max_len;
+	int first_word;
+	int last_word;
+	int ret_val = 0;
+	u16 i;
+
+	if (eeprom->len == 0)
+		return -EOPNOTSUPP;
+
+	if (eeprom->len > ks->eeprom_size)
+		return -EINVAL;
+
+	if (eeprom->magic != ks8851_rdreg16(ks, KS_CIDER))
+		return -EFAULT;
+
+	first_word = eeprom->offset >> 1;
+	last_word = (eeprom->offset + eeprom->len - 1) >> 1;
+	max_len = (last_word - first_word + 1) * 2;
+	eeprom_buff = kmalloc(max_len, GFP_KERNEL);
+	if (!eeprom_buff)
+		return -ENOMEM;
+
+	ptr = (void *)eeprom_buff;
+
+	if (eeprom->offset & 1) {
+		/* need read/modify/write of first changed EEPROM word */
+		/* only the second byte of the word is being modified */
+		eeprom_buff[0] = ks8851_eeprom_read(dev, first_word);
+		ptr++;
+	}
+	if ((eeprom->offset + eeprom->len) & 1)
+		/* need read/modify/write of last changed EEPROM word */
+		/* only the first byte of the word is being modified */
+		eeprom_buff[last_word - first_word] =
+					ks8851_eeprom_read(dev, last_word);
+
+
+	/* Device's eeprom is little-endian, word addressable */
+	le16_to_cpus(&eeprom_buff[0]);
+	le16_to_cpus(&eeprom_buff[last_word - first_word]);
+
+	memcpy(ptr, bytes, eeprom->len);
+
+	for (i = 0; i < last_word - first_word + 1; i++)
+		eeprom_buff[i] = cpu_to_le16(eeprom_buff[i]);
+
+	ks8851_eeprom_write(dev, EEPROM_OP_EWEN, 0, 0);
+
+	for (i = 0; i < last_word - first_word + 1; i++) {
+		ks8851_eeprom_write(dev, EEPROM_OP_WRITE, first_word + i,
+							eeprom_buff[i]);
+		mdelay(EEPROM_WRITE_TIME);
+	}
+
+	ks8851_eeprom_write(dev, EEPROM_OP_EWDS, 0, 0);
+
+	kfree(eeprom_buff);
+	return ret_val;
+}
+
 static const struct ethtool_ops ks8851_ethtool_ops = {
 	.get_drvinfo	= ks8851_get_drvinfo,
 	.get_msglevel	= ks8851_get_msglevel,
@@ -1326,6 +1437,9 @@ static const struct ethtool_ops ks8851_ethtool_ops = {
 	.set_settings	= ks8851_set_settings,
 	.get_link	= ks8851_get_link,
 	.nway_reset	= ks8851_nway_reset,
+	.get_eeprom_len	= ks8851_get_eeprom_len,
+	.get_eeprom	= ks8851_get_eeprom,
+	.set_eeprom	= ks8851_set_eeprom,
 };
 
 /* MII interface controls */
-- 
1.6.3.3


^ permalink raw reply related

* Re: TCP-MD5 checksum failure on x86_64 SMP
From: Bhaskar Dutta @ 2010-05-05 18:03 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Ben Hutchings, netdev
In-Reply-To: <20100504101301.5f4dd9c2@nehalam>

On Tue, May 4, 2010 at 10:43 PM, Stephen Hemminger
<shemminger@vyatta.com> wrote:
>
> On Tue, 4 May 2010 22:38:49 +0530
> Bhaskar Dutta <bhaskie@gmail.com> wrote:
>
> > On Tue, May 4, 2010 at 9:42 PM, Stephen Hemminger <shemminger@vyatta.com> wrote:
> > > On Tue, 4 May 2010 19:58:32 +0530
> > > Bhaskar Dutta <bhaskie@gmail.com> wrote:
> > >
> > >> On Tue, May 4, 2010 at 5:02 PM, Ben Hutchings <bhutchings@solarflare.com> wrote:
> > >> > On Tue, 2010-05-04 at 09:00 +0530, Bhaskar Dutta wrote:
> > >> >> Hi,
> > >> >>
> > >> >> I am observing intermittent TCP-MD5 checksum failures
> > >> >> (CONFIG_TCP_MD5SIG)  on kernel 2.6.31 while talking to a BGP router.
> > >> >>
> > >> >> The problem is only seen in multi-core 64 bit machines.
> > >> >> Is there any known bug in the per_cpu_ptr implementation (I am aware
> > >> >> that the percpu allocator has been re-implemented in 2.6.33) that
> > >> >> might cause a corruption in 64 bit SMP machines?
> > >> >>
> > >> >> Any pointers would be appreciated.
> > >> >
> > >> > There was another recent report of incorrect MD5 signatures in
> > >> > <http://thread.gmane.org/gmane.linux.network/159556>, but without any
> > >> > response.
> > >> >
> > >> > Ben.
> > >> >
> > >>
> > >> I found another thread posted back in Jan 2007 with a similar bug
> > >> (x86_64 on 2.6.20) but no replies to that as well.
> > >> http://lkml.org/lkml/2007/1/20/56
> > >
> > > 2.6.20 had lots of other MD5 bugs. Your problem might be related to
> > > GRO.  MD5 may not handle multi-fragment packets.
> > > --
> >
> > I am getting the issue on 2.6.31 and 2.6.28 (gro infrastructure was
> > added in 2.6.29).
> > Also, both segmentation offloading as well as receive offloading
> > (gso/gro) are turned off.
> >
> > Moreover outgoing TCP packets are the ones with the corrupt checksums.
> > Both tcpdump on my local machine and the BGP router on the other side
> > complain of the bad checksums with the same packet.
> >
> > I am trying to figure out if there is something in the per-cpu
> > implementation that might be causing a corruption (SMP and x86_64) but
> > I am not really getting anywhere.
>
> I seriously doubt the per-cpu stuff is the issue.
>
> > I am trying to reproduce the bad checksums with the latest kernel
> > sources since it has a new implementation of the percpu allocator.
>
> First turn off all offload settings on the device (TSO,GSO,SG,CSUM)
> then check that size of the bad packets. Are they fragmented or
> just simple linear packets?
>
> --

Hi,

TSO, GSO and SG are already turned off.
rx/tx checksumming is on, but that shouldn't matter, right?

# ethtool -k eth0
Offload parameters for eth0:
rx-checksumming: on
tx-checksumming: on
scatter-gather: off
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off

The bad packets are very small in size, most have no data at all (<300 bytes).

After adding some logs to kernel 2.6.31-12, it seems that
tcp_v4_md5_hash_skb (function that calculates the md5 hash) is
(might?) getting corrupt.

The tcp4_pseudohdr (bp = &hp->md5_blk.ip4) structure's saddr, daddr
and len fields get modified to different values towards the end of the
tcp_v4_md5_hash_skb function whenever there is a checksum error.

The tcp4_pseudohdr (bp) is within the tcp_md5sig_pool (hp), which is
filled up by tcp_get_md5sig_pool (which calls per_cpu_ptr).

Using a local copy of the tcp4_pseudohdr in the same function
tcp_v4_md5_hash_skb (copied all fields from the original
tcp4_pseudohdr within the tcp_md5sig_pool) and calculating the md5
checksum with the local  tcp4_pseudohdr seems to solve the issue
(don't see bad packets for a hours in load tests, and without the
change I can see them instantaneously in the load tests).

I am still unable to figure out how this is happening. Please let me
know if you have any pointers.

Thanks a lot!
Bhaskar

^ permalink raw reply

* Re: RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Avi Kivity @ 2010-05-05 17:59 UTC (permalink / raw)
  To: Pankaj Thakkar
  Cc: linux-kernel, netdev, virtualization, pv-drivers, sbhatewara
In-Reply-To: <20100504230225.GP8323@vmware.com>

On 05/05/2010 02:02 AM, Pankaj Thakkar wrote:
> 2. Hypervisor control: All control operations from the guest such as programming
> MAC address go through the hypervisor layer and hence can be subjected to
> hypervisor policies. The PF driver can be further used to put policy decisions
> like which VLAN the guest should be on.
>    

Is this enforced?  Since you pass the hardware through, you can't rely 
on the guest actually doing this, yes?

> The plugin image is provided by the IHVs along with the PF driver and is
> packaged in the hypervisor. The plugin image is OS agnostic and can be loaded
> either into a Linux VM or a Windows VM. The plugin is written against the Shell
> API interface which the shell is responsible for implementing. The API
> interface allows the plugin to do TX and RX only by programming the hardware
> rings (along with things like buffer allocation and basic initialization). The
> virtual machine comes up in paravirtualized/emulated mode when it is booted.
> The hypervisor allocates the VF and other resources and notifies the shell of
> the availability of the VF. The hypervisor injects the plugin into memory
> location specified by the shell. The shell initializes the plugin by calling
> into a known entry point and the plugin initializes the data path. The control
> path is already initialized by the PF driver when the VF is allocated. At this
> point the shell switches to using the loaded plugin to do all further TX and RX
> operations. The guest networking stack does not participate in these operations
> and continues to function normally. All the control operations continue being
> trapped by the hypervisor and are directed to the PF driver as needed. For
> example, if the MAC address changes the hypervisor updates its internal state
> and changes the state of the embedded switch as well through the PF control
> API.
>    

This is essentially a miniature network stack with a its own mini 
bonding layer, mini hotplug, and mini API, except s/API/ABI/.  Is this a 
correct view?

If so, the Linuxy approach would be to use the ordinary drivers and the 
Linux networking API, and hide the bond setup using namespaces.  The 
bond driver, or perhaps a new, similar, driver can be enhanced to 
propagate ethtool commands to its (hidden) components, and to have a 
control channel with the hypervisor.

This would make the approach hypervisor agnostic, you're just pairing 
two devices and presenting them to the rest of the stack as a single device.

> We have reworked our existing Linux vmxnet3 driver to accomodate NPA by
> splitting the driver into two parts: Shell and Plugin. The new split driver is
>    

So the Shell would be the reworked or new bond driver, and Plugins would 
be ordinary Linux network drivers.


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

^ permalink raw reply

* Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Stephen Hemminger @ 2010-05-05 17:52 UTC (permalink / raw)
  To: Dmitry Torokhov
  Cc: Christoph Hellwig, pv-drivers@vmware.com, Pankaj Thakkar,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <20100505173951.GA8388@infradead.org>

On Wed, 5 May 2010 13:39:51 -0400
Christoph Hellwig <hch@infradead.org> wrote:

> On Wed, May 05, 2010 at 10:35:28AM -0700, Dmitry Torokhov wrote:
> > Yes, with the exception that the only body of code that will be
> > accepted by the shell should be GPL-licensed and thus open and available
> > for examining. This is not different from having a standard kernel
> > module that is loaded normally and plugs into a certain subsystem.
> > The difference is that the binary resides not on guest filesystem
> > but elsewhere.
> 
> Forget about the licensing.  Loading binary blobs written to a shim
> layer is a complete pain in the ass and totally unsupportable, and
> also uninteresting because of the overhead.
> 
> If you have any interesting in developing this further, do:
> 
>  (1) move the limited VF drivers directly into the kernel tree,
>      talk to them through a normal ops vector
>  (2) get rid of the whole shim crap and instead integrate the limited
>      VF driver with the full VF driver we already have, instead of
>      duplicating the code
>  (3) don't make the PV to VF integration VMware-specific but also
>      provide an open reference implementation like virtio.  We're not
>      going to add massive amount of infrastructure that is not actually
>      useable in a free software stack.

Let me put it bluntly. Any design that allows external code to run
in the kernel is not going to be accepted.  Out of tree kernel modules are enough
of a pain already, why do you expect the developers to add another
interface.

^ permalink raw reply

* RE: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Pankaj Thakkar @ 2010-05-05 17:47 UTC (permalink / raw)
  To: Christoph Hellwig, Dmitry Torokhov
  Cc: pv-drivers@vmware.com, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <20100505173951.GA8388@infradead.org>



> -----Original Message-----
> From: Christoph Hellwig [mailto:hch@infradead.org]
> Sent: Wednesday, May 05, 2010 10:40 AM
> To: Dmitry Torokhov
> Cc: Christoph Hellwig; pv-drivers@vmware.com; Pankaj Thakkar;
> netdev@vger.kernel.org; linux-kernel@vger.kernel.org;
> virtualization@lists.linux-foundation.org
> Subject: Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for
> vmxnet3
> 
> On Wed, May 05, 2010 at 10:35:28AM -0700, Dmitry Torokhov wrote:
> > Yes, with the exception that the only body of code that will be
> > accepted by the shell should be GPL-licensed and thus open and
> available
> > for examining. This is not different from having a standard kernel
> > module that is loaded normally and plugs into a certain subsystem.
> > The difference is that the binary resides not on guest filesystem
> > but elsewhere.
> 
> Forget about the licensing.  Loading binary blobs written to a shim
> layer is a complete pain in the ass and totally unsupportable, and
> also uninteresting because of the overhead.

[PT] Why do you think it is unsupportable? How different is it from any module written against a well maintained interface? What overhead are you talking about?

> 
> If you have any interesting in developing this further, do:
> 
>  (1) move the limited VF drivers directly into the kernel tree,
>      talk to them through a normal ops vector
[PT] This assumes that all the VF drivers would always be available. Also we have to support windows and our current design supports it nicely in an OS agnostic manner.

>  (2) get rid of the whole shim crap and instead integrate the limited
>      VF driver with the full VF driver we already have, instead of
>      duplicating the code
[PT] Having a full VF driver adds a lot of dependency on the guest VM and this is what NPA tries to avoid.

>  (3) don't make the PV to VF integration VMware-specific but also
>      provide an open reference implementation like virtio.  We're not
>      going to add massive amount of infrastructure that is not actually
>      useable in a free software stack.
[PT] Today this is tied to vmxnet3 device and is intended to work on ESX hypervisor only (vmxnet3 works on VMware hypervisor only). All the loading support is inside the ESX hypervisor. I am going to post the interface between the shell and the plugin soon and you can see that there is not a whole lot of dependency or infrastructure requirements from the Linux kernel. Please keep in mind that we don't use Linux as a hypervisor but as a guest VM.


^ permalink raw reply

* Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Christoph Hellwig @ 2010-05-05 17:39 UTC (permalink / raw)
  To: Dmitry Torokhov
  Cc: Christoph Hellwig, pv-drivers@vmware.com, Pankaj Thakkar,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <201005051035.29831.dtor@vmware.com>

On Wed, May 05, 2010 at 10:35:28AM -0700, Dmitry Torokhov wrote:
> Yes, with the exception that the only body of code that will be
> accepted by the shell should be GPL-licensed and thus open and available
> for examining. This is not different from having a standard kernel
> module that is loaded normally and plugs into a certain subsystem.
> The difference is that the binary resides not on guest filesystem
> but elsewhere.

Forget about the licensing.  Loading binary blobs written to a shim
layer is a complete pain in the ass and totally unsupportable, and
also uninteresting because of the overhead.

If you have any interesting in developing this further, do:

 (1) move the limited VF drivers directly into the kernel tree,
     talk to them through a normal ops vector
 (2) get rid of the whole shim crap and instead integrate the limited
     VF driver with the full VF driver we already have, instead of
     duplicating the code
 (3) don't make the PV to VF integration VMware-specific but also
     provide an open reference implementation like virtio.  We're not
     going to add massive amount of infrastructure that is not actually
     useable in a free software stack.

^ permalink raw reply

* Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Dmitry Torokhov @ 2010-05-05 17:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: pv-drivers@vmware.com, Pankaj Thakkar, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <20100505173120.GA1752@infradead.org>

On Wednesday 05 May 2010 10:31:20 am Christoph Hellwig wrote:
> On Wed, May 05, 2010 at 10:29:40AM -0700, Dmitry Torokhov wrote:
> > > We're not going to add any kind of loader for binry blobs into kernel
> > > space, sorry.  Don't even bother wasting your time on this.
> > 
> > It would not be a binary blob but software properly released under GPL.
> > The current plan is for the shell to enforce GPL requirement on the
> > plugin code, similar to what module loaded does for regular kernel
> > modules.
> 
> The mechanism described in the document is loading a binary blob
> coded to an abstract API.

Yes, with the exception that the only body of code that will be
accepted by the shell should be GPL-licensed and thus open and available
for examining. This is not different from having a standard kernel
module that is loaded normally and plugs into a certain subsystem.
The difference is that the binary resides not on guest filesystem
but elsewhere.

> 
> That's something entirely different from having normal modules for
> the Virtual Functions, which we already have for various pieces of
> hardware anyway.

-- 
Dmitry

^ permalink raw reply

* Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Christoph Hellwig @ 2010-05-05 17:31 UTC (permalink / raw)
  To: Dmitry Torokhov
  Cc: pv-drivers, Christoph Hellwig, Pankaj Thakkar,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <201005051029.42052.dtor@vmware.com>

On Wed, May 05, 2010 at 10:29:40AM -0700, Dmitry Torokhov wrote:
> > We're not going to add any kind of loader for binry blobs into kernel
> > space, sorry.  Don't even bother wasting your time on this.
> > 
> 
> It would not be a binary blob but software properly released under GPL.
> The current plan is for the shell to enforce GPL requirement on the
> plugin code, similar to what module loaded does for regular kernel
> modules.

The mechanism described in the document is loading a binary blob
coded to an abstract API.

That's something entirely different from having normal modules for
the Virtual Functions, which we already have for various pieces of
hardware anyway.

^ permalink raw reply

* Re: [Pv-drivers] RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Dmitry Torokhov @ 2010-05-05 17:29 UTC (permalink / raw)
  To: pv-drivers
  Cc: Christoph Hellwig, Pankaj Thakkar, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org
In-Reply-To: <20100505172316.GA25338@infradead.org>

On Wednesday 05 May 2010 10:23:16 am Christoph Hellwig wrote:
> On Tue, May 04, 2010 at 04:02:25PM -0700, Pankaj Thakkar wrote:
> > The plugin image is provided by the IHVs along with the PF driver and is
> > packaged in the hypervisor. The plugin image is OS agnostic and can be
> > loaded either into a Linux VM or a Windows VM. The plugin is written
> > against the Shell API interface which the shell is responsible for
> > implementing. The API
> 
> We're not going to add any kind of loader for binry blobs into kernel
> space, sorry.  Don't even bother wasting your time on this.
> 

It would not be a binary blob but software properly released under GPL.
The current plan is for the shell to enforce GPL requirement on the
plugin code, similar to what module loaded does for regular kernel
modules.

-- 
Dmitry

^ permalink raw reply

* Re: RFC: Network Plugin Architecture (NPA) for vmxnet3
From: Christoph Hellwig @ 2010-05-05 17:23 UTC (permalink / raw)
  To: Pankaj Thakkar
  Cc: linux-kernel, netdev, virtualization, pv-drivers, sbhatewara
In-Reply-To: <20100504230225.GP8323@vmware.com>

On Tue, May 04, 2010 at 04:02:25PM -0700, Pankaj Thakkar wrote:
> The plugin image is provided by the IHVs along with the PF driver and is
> packaged in the hypervisor. The plugin image is OS agnostic and can be loaded
> either into a Linux VM or a Windows VM. The plugin is written against the Shell
> API interface which the shell is responsible for implementing. The API

We're not going to add any kind of loader for binry blobs into kernel
space, sorry.  Don't even bother wasting your time on this.


^ permalink raw reply

* Re: Fw: [Bug 15907] New: IP_ADD_SOURCE_MEMBERSHIP after IP_ADD_MEMBERSHIP join on same multicast-group dont return EINVAL
From: David Stevens @ 2010-05-05 17:09 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Herbert Xu, netdev, netdev-owner
In-Reply-To: <20100505082138.6cb93aa1@nehalam>

This particular failure  may be  just a matter of translating the
EADDRINUSE check in IP_ADD_SOURCE_MEMBERSHIP to
return EINVAL rather than ignoring it. The more general change of
segregating SSM and ASM should go further than that (e.g., a boolean
to tell you which way the membership was added and checking in
all the operations).

The code predates this informational RFC and allows you to
change back and forth between ASM and SSM where it makes
sense (a more liberal interpretation).

Of course, if existing applications are mixing them already, they
would break, and I'm not sure I agree it's a good thing to have
to destroy an existing membership and recreate it if you want to
switch from ASM to SSM.

I can look at this, but not for a few days at least; can review if
someone else does before I do.

                                        +-DLS


netdev-owner@vger.kernel.org wrote on 05/05/2010 08:21:38 AM:

> 
> 
> Begin forwarded message:
> 
> Date: Wed, 5 May 2010 09:48:40 GMT
> From: bugzilla-daemon@bugzilla.kernel.org
> To: shemminger@linux-foundation.org
> Subject: [Bug 15907] New: IP_ADD_SOURCE_MEMBERSHIP after 
IP_ADD_MEMBERSHIP 
> join on same multicast-group dont return EINVAL
> 
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=15907
> 
>            Summary: IP_ADD_SOURCE_MEMBERSHIP after IP_ADD_MEMBERSHIP 
join
>                     on same multicast-group dont return EINVAL
>            Product: Networking
>            Version: 2.5
>     Kernel Version: 2.6.34-rc6
>           Platform: All
>         OS/Version: Linux
>               Tree: Mainline
>             Status: NEW
>           Severity: normal
>           Priority: P1
>          Component: IPV4
>         AssignedTo: shemminger@linux-foundation.org
>         ReportedBy: mail@fholler.de
>         Regression: No
> 
> 
> Created an attachment (id=26225)
>  --> (https://bugzilla.kernel.org/attachment.cgi?id=26225)
> asm+ssm join test program
> 
> When an SSM IP_ADD_SOURCE_MEMBERSHIP is done after an ASM 
IP_ADD_MEMBERSHIP
> join on the same group(& same interface) the setsockopt operation should 
return
> EINVAL.
> 
> The linux implementation returns successfull
> 
> 
> 
> https://www3.tools.ietf.org/html/rfc3678#section-4.1.3
> 
> I attached an simple C test program.
> 
> -- 
> Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are the assignee for the bug.
> 
> 
> -- 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply

* RE: [RFC PATCH net-next] drivers/net/ks*: Use netdev_<level>, netif_<level> and pr_<level>
From: Ha, Tristram @ 2010-05-05 17:06 UTC (permalink / raw)
  To: David Miller; +Cc: ben, Choi, David, richard.rojfors.ext, netdev, joe
In-Reply-To: <20100316.213133.102673436.davem@davemloft.net>

David Miller wrote:
> From: Joe Perches <joe@perches.com>
> Date: Sat, 27 Feb 2010 16:43:51 -0800
> 
>> I'm not sure this is correct.
>> 
>> It changes logging macros from:
>> 	dev_<level>(&ks->spidev->dev,
>> to
>> 	netdev_<level>(ks->netdev,
> 
> I'm just applying this, more than a week to get a review is more than
enough.
> 
> Thanks Joe.

I have no objection about the patch as it just uses the latest kernel
debug call.  Is this new code available in current 2.6.34 release?  Or
will it be in the next version?

I generally download the latest 2.6.34-rc release to generate patches
for submission.

^ permalink raw reply

* RTL-8110SC lockup with r8169
From: Pádraig Brady @ 2010-05-05 16:05 UTC (permalink / raw)
  To: netdev; +Cc: Francois Romieu, Glen Gray

[-- Attachment #1: Type: text/plain, Size: 1812 bytes --]

Hi,

We're having an issue with the r8169 driver, where very often
(1 in 10 boots) it will lockup and our netboot system will hang.
On this hardware previously, we used FC5 with the r1000 driver without issue.

The interesting (problematic) thing is that the system needs to
be _power cycled_ to get the link detected again.
A reboot does not suffice. The symptoms sound like:

http://kerneltrap.org/mailarchive/linux-netdev/2009/3/6/5110184
http://git.kernel.org/?p=linux/kernel/git/stable/linux-2.6.32.y.git;a=commitdiff;h=ea8dbdd17099a9a5864ebd4c87e01e657b19c7ab

However the above code wasn't in the 2.6.32.10-90.fc12 driver we used.
Also I've back-ported the latest r8169 driver from git to our kernel
and it still has the same issue. The inconsequential diff
between our driver and the latest in git are attached just in case.

In the following dmesg you can see a successful boot.
Why does the link go down twice even in that case?
On lockup one does not see the "link up" message and
a power cycle is required, as the link is not detected
by the netboot ROM on a reboot.

r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
r8169 0000:01:04.0: PCI INT A -> Link[LNKA] -> GSI 11 (level, low) -> IRQ 11
r8169 0000:01:04.0: no PCI Express capability
eth0: RTL8169sc/8110sc at 0xde87c000, 00:60:ef:08:49:f5, XID 98000000 IRQ 11
r8169: eth0: link down
r8169: eth0: link down
r8169: eth0: link up

I'm in a good position to test any patches/ideas you may have.

cheers,
Pádraig.

Ancillary version details are...

# dmesg | grep 8169
# lspci -n | grep -v 8086:
01:04.0 0200: 10ec:8167 (rev 10)

# ethtool -i eth0
driver: r8169
version: 2.3LK-NAPI
firmware-version:
bus-info: 0000:01:04.0

# uname -a
Linux Unit-00:60:ef:08:49:f5 2.6.32.10-90.fc12.i686 #1 SMP Tue Mar 23 10:21:29 UTC 2010 i686 i686 i386 GNU/Linux

[-- Attachment #2: r8169-unimportant.diff --]
[-- Type: text/x-patch, Size: 11726 bytes --]

--- /home/padraig/kernel/r8169.c.latest	2010-05-05 11:42:54.345452634 +0100
+++ /home/padraig/kernel/r8169.c.new	2010-05-05 16:36:36.805627050 +0100
@@ -168,7 +168,7 @@
 static void rtl_hw_start_8168(struct net_device *);
 static void rtl_hw_start_8101(struct net_device *);
 
-static DEFINE_PCI_DEVICE_TABLE(rtl8169_pci_tbl) = {
+static struct pci_device_id rtl8169_pci_tbl[] = {
 	{ PCI_DEVICE(PCI_VENDOR_ID_REALTEK,	0x8129), 0, 0, RTL_CFG_0 },
 	{ PCI_DEVICE(PCI_VENDOR_ID_REALTEK,	0x8136), 0, 0, RTL_CFG_2 },
 	{ PCI_DEVICE(PCI_VENDOR_ID_REALTEK,	0x8167), 0, 0, RTL_CFG_0 },
@@ -749,10 +749,12 @@
 	spin_lock_irqsave(&tp->lock, flags);
 	if (tp->link_ok(ioaddr)) {
 		netif_carrier_on(dev);
-		netif_info(tp, ifup, dev, "link up\n");
+		if (netif_msg_ifup(tp))
+			printk(KERN_INFO PFX "%s: link up\n", dev->name);
 	} else {
+		if (netif_msg_ifdown(tp))
+			printk(KERN_INFO PFX "%s: link down\n", dev->name);
 		netif_carrier_off(dev);
-		netif_info(tp, ifdown, dev, "link down\n");
 	}
 	spin_unlock_irqrestore(&tp->lock, flags);
 }
@@ -865,8 +867,11 @@
 	} else if (autoneg == AUTONEG_ENABLE)
 		RTL_W32(TBICSR, reg | TBINwEnable | TBINwRestart);
 	else {
-		netif_warn(tp, link, dev,
-			   "incorrect speed setting refused in TBI mode\n");
+		if (netif_msg_link(tp)) {
+			printk(KERN_WARNING "%s: "
+			       "incorrect speed setting refused in TBI mode\n",
+			       dev->name);
+		}
 		ret = -EOPNOTSUPP;
 	}
 
@@ -901,9 +906,9 @@
 		    (tp->mac_version != RTL_GIGA_MAC_VER_15) &&
 		    (tp->mac_version != RTL_GIGA_MAC_VER_16)) {
 			giga_ctrl |= ADVERTISE_1000FULL | ADVERTISE_1000HALF;
-		} else {
-			netif_info(tp, link, dev,
-				   "PHY does not support 1000Mbps\n");
+		} else if (netif_msg_link(tp)) {
+			printk(KERN_INFO "%s: PHY does not support 1000Mbps.\n",
+			       dev->name);
 		}
 
 		bmcr = BMCR_ANENABLE | BMCR_ANRESTART;
@@ -2705,7 +2710,8 @@
 	if (tp->link_ok(ioaddr))
 		goto out_unlock;
 
-	netif_warn(tp, link, dev, "PHY reset until link up\n");
+	if (netif_msg_link(tp))
+		printk(KERN_WARNING "%s: PHY reset until link up\n", dev->name);
 
 	tp->phy_reset_enable(ioaddr);
 
@@ -2776,7 +2782,8 @@
 			return;
 		msleep(1);
 	}
-	netif_err(tp, link, dev, "PHY reset failed\n");
+	if (netif_msg_link(tp))
+		printk(KERN_ERR "%s: PHY reset failed.\n", dev->name);
 }
 
 static void rtl8169_init_phy(struct net_device *dev, struct rtl8169_private *tp)
@@ -2810,8 +2817,8 @@
 	 */
 	rtl8169_set_speed(dev, AUTONEG_ENABLE, SPEED_1000, DUPLEX_FULL);
 
-	if (RTL_R8(PHYstatus) & TBI_Enable)
-		netif_info(tp, link, dev, "TBI auto-negotiating\n");
+	if ((RTL_R8(PHYstatus) & TBI_Enable) && netif_msg_link(tp))
+		printk(KERN_INFO PFX "%s: TBI auto-negotiating\n", dev->name);
 }
 
 static void rtl_rar_set(struct rtl8169_private *tp, u8 *addr)
@@ -3016,33 +3023,41 @@
 	/* enable device (incl. PCI PM wakeup and hotplug setup) */
 	rc = pci_enable_device(pdev);
 	if (rc < 0) {
-		netif_err(tp, probe, dev, "enable failure\n");
+		if (netif_msg_probe(tp))
+			dev_err(&pdev->dev, "enable failure\n");
 		goto err_out_free_dev_1;
 	}
 
 	if (pci_set_mwi(pdev) < 0)
-		netif_info(tp, probe, dev, "Mem-Wr-Inval unavailable\n");
+		if (netif_msg_probe(tp)) {
+			dev_err(&pdev->dev, "Mem-Wr-Inval unavailable\n");
+		}
 
 	/* make sure PCI base addr 1 is MMIO */
 	if (!(pci_resource_flags(pdev, region) & IORESOURCE_MEM)) {
-		netif_err(tp, probe, dev,
-			  "region #%d not an MMIO resource, aborting\n",
-			  region);
+		if (netif_msg_probe(tp)) {
+			dev_err(&pdev->dev,
+				"region #%d not an MMIO resource, aborting\n",
+				region);
+		}
 		rc = -ENODEV;
 		goto err_out_mwi_2;
 	}
 
 	/* check for weird/broken PCI region reporting */
 	if (pci_resource_len(pdev, region) < R8169_REGS_SIZE) {
-		netif_err(tp, probe, dev,
-			  "Invalid PCI region size(s), aborting\n");
+		if (netif_msg_probe(tp)) {
+			dev_err(&pdev->dev,
+				"Invalid PCI region size(s), aborting\n");
+		}
 		rc = -ENODEV;
 		goto err_out_mwi_2;
 	}
 
 	rc = pci_request_regions(pdev, MODULENAME);
 	if (rc < 0) {
-		netif_err(tp, probe, dev, "could not request regions\n");
+		if (netif_msg_probe(tp))
+			dev_err(&pdev->dev, "could not request regions.\n");
 		goto err_out_mwi_2;
 	}
 
@@ -3055,7 +3070,10 @@
 	} else {
 		rc = pci_set_dma_mask(pdev, DMA_BIT_MASK(32));
 		if (rc < 0) {
-			netif_err(tp, probe, dev, "DMA configuration failed\n");
+			if (netif_msg_probe(tp)) {
+				dev_err(&pdev->dev,
+					"DMA configuration failed.\n");
+			}
 			goto err_out_free_res_3;
 		}
 	}
@@ -3063,14 +3081,15 @@
 	/* ioremap MMIO region */
 	ioaddr = ioremap(pci_resource_start(pdev, region), R8169_REGS_SIZE);
 	if (!ioaddr) {
-		netif_err(tp, probe, dev, "cannot remap MMIO, aborting\n");
+		if (netif_msg_probe(tp))
+			dev_err(&pdev->dev, "cannot remap MMIO, aborting\n");
 		rc = -EIO;
 		goto err_out_free_res_3;
 	}
 
 	tp->pcie_cap = pci_find_capability(pdev, PCI_CAP_ID_EXP);
-	if (!tp->pcie_cap)
-		netif_info(tp, probe, dev, "no PCI Express capability\n");
+	if (!tp->pcie_cap && netif_msg_probe(tp))
+		dev_info(&pdev->dev, "no PCI Express capability\n");
 
 	RTL_W16(IntrMask, 0x0000);
 
@@ -3093,8 +3112,10 @@
 
 	/* Use appropriate default if unknown */
 	if (tp->mac_version == RTL_GIGA_MAC_NONE) {
-		netif_notice(tp, probe, dev,
-			     "unknown MAC, using family default\n");
+		if (netif_msg_probe(tp)) {
+			dev_notice(&pdev->dev,
+				   "unknown MAC, using family default\n");
+		}
 		tp->mac_version = cfg->default_ver;
 	}
 
@@ -3176,10 +3197,19 @@
 
 	pci_set_drvdata(pdev, dev);
 
-	netif_info(tp, probe, dev, "%s at 0x%lx, %pM, XID %08x IRQ %d\n",
-		   rtl_chip_info[tp->chipset].name,
-		   dev->base_addr, dev->dev_addr,
-		   (u32)(RTL_R32(TxConfig) & 0x9cf0f8ff), dev->irq);
+	if (netif_msg_probe(tp)) {
+		u32 xid = RTL_R32(TxConfig) & 0x9cf0f8ff;
+
+		printk(KERN_INFO "%s: %s at 0x%lx, "
+		       "%2.2x:%2.2x:%2.2x:%2.2x:%2.2x:%2.2x, "
+		       "XID %08x IRQ %d\n",
+		       dev->name,
+		       rtl_chip_info[tp->chipset].name,
+		       dev->base_addr,
+		       dev->dev_addr[0], dev->dev_addr[1],
+		       dev->dev_addr[2], dev->dev_addr[3],
+		       dev->dev_addr[4], dev->dev_addr[5], xid, dev->irq);
+	}
 
 	rtl8169_init_phy(dev, tp);
 
@@ -3231,8 +3261,8 @@
 	unsigned int max_frame = mtu + VLAN_ETH_HLEN + ETH_FCS_LEN;
 
 	if (max_frame != 16383)
-		printk(KERN_WARNING PFX "WARNING! Changing of MTU on this "
-			"NIC may lead to frame reception errors!\n");
+		printk(KERN_WARNING "WARNING! Changing of MTU on this NIC "
+			"May lead to frame reception errors!\n");
 
 	tp->rx_buf_sz = (max_frame > RX_BUF_SIZE) ? max_frame : RX_BUF_SIZE;
 }
@@ -4131,10 +4161,10 @@
 
 	ret = rtl8169_open(dev);
 	if (unlikely(ret < 0)) {
-		if (net_ratelimit())
-			netif_err(tp, drv, dev,
-				  "reinit failure (status = %d). Rescheduling\n",
-				  ret);
+		if (net_ratelimit() && netif_msg_drv(tp)) {
+			printk(KERN_ERR PFX "%s: reinit failure (status = %d)."
+			       " Rescheduling.\n", dev->name, ret);
+		}
 		rtl8169_schedule_work(dev, rtl8169_reinit_task);
 	}
 
@@ -4164,8 +4194,10 @@
 		netif_wake_queue(dev);
 		rtl8169_check_link_status(dev, tp, tp->mmio_addr);
 	} else {
-		if (net_ratelimit())
-			netif_emerg(tp, intr, dev, "Rx buffers shortage\n");
+		if (net_ratelimit() && netif_msg_intr(tp)) {
+			printk(KERN_EMERG PFX "%s: Rx buffers shortage\n",
+			       dev->name);
+		}
 		rtl8169_schedule_work(dev, rtl8169_reset_task);
 	}
 
@@ -4253,7 +4285,11 @@
 	u32 opts1;
 
 	if (unlikely(TX_BUFFS_AVAIL(tp) < skb_shinfo(skb)->nr_frags)) {
-		netif_err(tp, drv, dev, "BUG! Tx Ring full when queue awake!\n");
+		if (netif_msg_drv(tp)) {
+			printk(KERN_ERR
+			       "%s: BUG! Tx Ring full when queue awake!\n",
+			       dev->name);
+		}
 		goto err_stop;
 	}
 
@@ -4315,8 +4351,11 @@
 	pci_read_config_word(pdev, PCI_COMMAND, &pci_cmd);
 	pci_read_config_word(pdev, PCI_STATUS, &pci_status);
 
-	netif_err(tp, intr, dev, "PCI error (cmd = 0x%04x, status = 0x%04x)\n",
-		  pci_cmd, pci_status);
+	if (netif_msg_intr(tp)) {
+		printk(KERN_ERR
+		       "%s: PCI error (cmd = 0x%04x, status = 0x%04x).\n",
+		       dev->name, pci_cmd, pci_status);
+	}
 
 	/*
 	 * The recovery sequence below admits a very elaborated explanation:
@@ -4340,7 +4379,8 @@
 
 	/* The infamous DAC f*ckup only happens at boot time */
 	if ((tp->cp_cmd & PCIDAC) && !tp->dirty_rx && !tp->cur_rx) {
-		netif_info(tp, intr, dev, "disabling PCI DAC\n");
+		if (netif_msg_intr(tp))
+			printk(KERN_INFO "%s: disabling PCI DAC.\n", dev->name);
 		tp->cp_cmd &= ~PCIDAC;
 		RTL_W16(CPlusCmd, tp->cp_cmd);
 		dev->features &= ~NETIF_F_HIGHDMA;
@@ -4432,12 +4472,13 @@
 	if (pkt_size >= rx_copybreak)
 		goto out;
 
-	skb = netdev_alloc_skb_ip_align(tp->dev, pkt_size);
+	skb = netdev_alloc_skb(tp->dev, pkt_size + NET_IP_ALIGN);
 	if (!skb)
 		goto out;
 
 	pci_dma_sync_single_for_cpu(tp->pci_dev, addr, pkt_size,
 				    PCI_DMA_FROMDEVICE);
+	skb_reserve(skb, NET_IP_ALIGN);
 	skb_copy_from_linear_data(*sk_buff, skb->data, pkt_size);
 	*sk_buff = skb;
 	done = true;
@@ -4475,8 +4516,11 @@
 		if (status & DescOwn)
 			break;
 		if (unlikely(status & RxRES)) {
-			netif_info(tp, rx_err, dev, "Rx ERROR. status = %08x\n",
-				   status);
+			if (netif_msg_rx_err(tp)) {
+				printk(KERN_INFO
+				       "%s: Rx ERROR. status = %08x\n",
+				       dev->name, status);
+			}
 			dev->stats.rx_errors++;
 			if (status & (RxRWT | RxRUNT))
 				dev->stats.rx_length_errors++;
@@ -4543,8 +4587,8 @@
 	tp->cur_rx = cur_rx;
 
 	delta = rtl8169_rx_fill(tp, dev, tp->dirty_rx, tp->cur_rx);
-	if (!delta && count)
-		netif_info(tp, intr, dev, "no Rx buffer allocated\n");
+	if (!delta && count && netif_msg_intr(tp))
+		printk(KERN_INFO "%s: no Rx buffer allocated\n", dev->name);
 	tp->dirty_rx += delta;
 
 	/*
@@ -4554,8 +4598,8 @@
 	 *   after refill ?
 	 * - how do others driver handle this condition (Uh oh...).
 	 */
-	if (tp->dirty_rx + NUM_RX_DESC == tp->cur_rx)
-		netif_emerg(tp, intr, dev, "Rx buffers exhausted\n");
+	if ((tp->dirty_rx + NUM_RX_DESC == tp->cur_rx) && netif_msg_intr(tp))
+		printk(KERN_EMERG "%s: Rx buffers exhausted\n", dev->name);
 
 	return count;
 }
@@ -4610,9 +4654,10 @@
 
 			if (likely(napi_schedule_prep(&tp->napi)))
 				__napi_schedule(&tp->napi);
-			else
-				netif_info(tp, intr, dev,
-					   "interrupt %04x in poll\n", status);
+			else if (netif_msg_intr(tp)) {
+				printk(KERN_INFO "%s: interrupt %04x in poll\n",
+				dev->name, status);
+			}
 		}
 
 		/* We only get a new MSI interrupt when all active irq
@@ -4748,22 +4793,27 @@
 
 	if (dev->flags & IFF_PROMISC) {
 		/* Unconditionally log net taps. */
-		netif_notice(tp, link, dev, "Promiscuous mode enabled\n");
+		if (netif_msg_link(tp)) {
+			printk(KERN_NOTICE "%s: Promiscuous mode enabled.\n",
+			       dev->name);
+		}
 		rx_mode =
 		    AcceptBroadcast | AcceptMulticast | AcceptMyPhys |
 		    AcceptAllPhys;
 		mc_filter[1] = mc_filter[0] = 0xffffffff;
-	} else if ((netdev_mc_count(dev) > multicast_filter_limit) ||
-		   (dev->flags & IFF_ALLMULTI)) {
+	} else if ((dev->mc_count > multicast_filter_limit)
+		   || (dev->flags & IFF_ALLMULTI)) {
 		/* Too many to filter perfectly -- accept all multicasts. */
 		rx_mode = AcceptBroadcast | AcceptMulticast | AcceptMyPhys;
 		mc_filter[1] = mc_filter[0] = 0xffffffff;
 	} else {
 		struct dev_mc_list *mclist;
+		unsigned int i;
 
 		rx_mode = AcceptBroadcast | AcceptMyPhys;
 		mc_filter[1] = mc_filter[0] = 0;
-		netdev_for_each_mc_addr(mclist, dev) {
+		for (i = 0, mclist = dev->mc_list; mclist && i < dev->mc_count;
+		     i++, mclist = mclist->next) {
 			int bit_nr = ether_crc(ETH_ALEN, mclist->dmi_addr) >> 26;
 			mc_filter[bit_nr >> 5] |= 1 << (bit_nr & 31);
 			rx_mode |= AcceptMulticast;

^ permalink raw reply

* Re: linux kernel's IPV6_MULTICAST_HOPS default is 64; should be 1?
From: Brian Haley @ 2010-05-05 15:36 UTC (permalink / raw)
  To: David Miller; +Cc: dlstevens, enh, netdev, netdev-owner
In-Reply-To: <20100504.144647.157477097.davem@davemloft.net>

David Miller wrote:
> From: Brian Haley <brian.haley@hp.com>
> Date: Tue, 04 May 2010 10:40:58 -0400
> 
>> Specifying -1 for setsockopt(IPV6_MULTICAST_HOPS) should set the socket
>> value back to the system default value of IPV6_DEFAULT_MCASTHOPS (1).
>>
>> Signed-off-by: Brian Haley <brian.haley@hp.com>
> 
> In cast it wasn't clear from my other reply, I'm not applying this
> patch because I intentionally left this behavior there based upon
> some comments from Elliot in that this lets developers get the
> old default by asking for "-1" explicitly with a setsockopt.

I now see that in Elliot's email, but I think it's incorrect.  The RFC
says that setting it to -1 should get you the kernel default, which is
now 1.  Without this change, setting it to -1 will get you 64, the
old behavior.  If the user wants to, they can always just set it to
64 themselves, that's better than assuming when you set it to -1
you're going to get 64.

I'm just trying to make this follow the RFC and behave like other OSes
for consistency.

-Brian

^ permalink raw reply

* Fw: [Bug 15907] New: IP_ADD_SOURCE_MEMBERSHIP after IP_ADD_MEMBERSHIP join on same multicast-group dont return EINVAL
From: Stephen Hemminger @ 2010-05-05 15:21 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev



Begin forwarded message:

Date: Wed, 5 May 2010 09:48:40 GMT
From: bugzilla-daemon@bugzilla.kernel.org
To: shemminger@linux-foundation.org
Subject: [Bug 15907] New: IP_ADD_SOURCE_MEMBERSHIP after IP_ADD_MEMBERSHIP join on same multicast-group dont return EINVAL


https://bugzilla.kernel.org/show_bug.cgi?id=15907

           Summary: IP_ADD_SOURCE_MEMBERSHIP after IP_ADD_MEMBERSHIP join
                    on same multicast-group dont return EINVAL
           Product: Networking
           Version: 2.5
    Kernel Version: 2.6.34-rc6
          Platform: All
        OS/Version: Linux
              Tree: Mainline
            Status: NEW
          Severity: normal
          Priority: P1
         Component: IPV4
        AssignedTo: shemminger@linux-foundation.org
        ReportedBy: mail@fholler.de
        Regression: No


Created an attachment (id=26225)
 --> (https://bugzilla.kernel.org/attachment.cgi?id=26225)
asm+ssm join test program

When an SSM IP_ADD_SOURCE_MEMBERSHIP is done after an ASM IP_ADD_MEMBERSHIP
join on the same group(& same interface) the setsockopt operation should return
EINVAL.

The linux implementation returns successfull



https://www3.tools.ietf.org/html/rfc3678#section-4.1.3

I attached an simple C test program.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.


-- 

^ permalink raw reply

* Re: [PATCH v2] net/gianfar: drop recycled skbs on MTU change
From: Andy Fleming @ 2010-05-05 15:18 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior; +Cc: David Miller, afleming, netdev
In-Reply-To: <20100505083047.GA4398@Chamillionaire.breakpoint.cc>


On May 5, 2010, at 3:30 AM, Sebastian Andrzej Siewior wrote:

> From: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> 
> The size for skbs which is added to the recycled list is using the
> current descriptor size which is current MTU. gfar_new_skb() is also
> using this size. So after changing or alteast increasing the MTU all
> recycled skbs should be dropped.
> 
> Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> ---
>>> I think we should probably do this in free_skb_resources.  And remove
>>> the call from gfar_close().
>> 
>> Ok, Sebastian please rework your patch as requested by Andy.
> 
> This has the side effect of dropping them on reset which is not
> required.

Sure, but that's true of the buffers in the BD ring, too.  My theory, here, is just that it's best to treat it the same as the other "skb resources".  If we want to avoid reallocation during a reset, that's a separate patch.  :)

Acked-by: Andy Fleming <afleming@freescale.com>


^ permalink raw reply

* Re: [PATCH 2/6] netns: Teach network device kobjects which namespace they are in.
From: Serge E. Hallyn @ 2010-05-05 15:17 UTC (permalink / raw)
  To: Eric W. Biederman
  Cc: Greg Kroah-Hartman, Kay Sievers, linux-kernel, Tejun Heo,
	Cornelia Huck, Eric Dumazet, Benjamin LaHaise, netdev,
	David Miller
In-Reply-To: <1273019809-16472-2-git-send-email-ebiederm@xmission.com>

Quoting Eric W. Biederman (ebiederm@xmission.com):
> diff --git a/net/Kconfig b/net/Kconfig
> index 041c35e..265e33b 100644
> --- a/net/Kconfig
> +++ b/net/Kconfig
> @@ -45,6 +45,14 @@ config COMPAT_NETLINK_MESSAGES
> 
>  menu "Networking options"
> 
> +config NET_NS
> +	bool "Network namespace support"
> +	default n
> +	depends on EXPERIMENTAL && NAMESPACES
> +	help
> +	  Allow user space to create what appear to be multiple instances
> +	  of the network stack.
> +

Hi Eric,

I'm confused - NET_NS is defined in init/Kconfig right now.  Is the tree
you're working from very different from mine, or is this the unfortunate
rekult of the patches sitting so long?

>  source "net/packet/Kconfig"
>  source "net/unix/Kconfig"
>  source "net/xfrm/Kconfig"
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index 099c753..1b98e36 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -13,7 +13,9 @@
>  #include <linux/kernel.h>
>  #include <linux/netdevice.h>
>  #include <linux/if_arp.h>
> +#include <linux/nsproxy.h>
>  #include <net/sock.h>
> +#include <net/net_namespace.h>
>  #include <linux/rtnetlink.h>
>  #include <linux/wireless.h>
>  #include <net/wext.h>
> @@ -466,6 +468,37 @@ static struct attribute_group wireless_group = {
>  };
>  #endif
> 
> +static const void *net_current_ns(void)
> +{
> +	return current->nsproxy->net_ns;
> +}
> +
> +static const void *net_initial_ns(void)
> +{
> +	return &init_net;
> +}
> +
> +static const void *net_netlink_ns(struct sock *sk)
> +{
> +	return sock_net(sk);
> +}
> +
> +static struct kobj_ns_type_operations net_ns_type_operations = {
> +	.type = KOBJ_NS_TYPE_NET,
> +	.current_ns = net_current_ns,
> +	.netlink_ns = net_netlink_ns,
> +	.initial_ns = net_initial_ns,
> +};
> +
> +static void net_kobj_ns_exit(struct net *net)
> +{
> +	kobj_ns_exit(KOBJ_NS_TYPE_NET, net);
> +}
> +
> +static struct pernet_operations sysfs_net_ops = {
> +	.exit = net_kobj_ns_exit,
> +};
> +
>  #endif /* CONFIG_SYSFS */

...

>  int netdev_kobject_init(void)
>  {
> +	kobj_ns_type_register(&net_ns_type_operations);
> +#ifdef CONFIG_SYSFS
> +	register_pernet_subsys(&sysfs_net_ops);
> +#endif
>  	return class_register(&net_class);

I think the kobj_ns_type_register() needs to be under
ifdef CONFIG_SYSFS as well, bc net_ns_type_operations is defined
under ifdef CONFIG_SYSFS.

-serge

^ permalink raw reply

* [PATCH] nf_conntrack_core.c: fix for dead connection after flushing conntrack cache
From: Joerg Marx @ 2010-05-05 14:46 UTC (permalink / raw)
  To: netdev; +Cc: kaber

Hi,

we encountered a weird problem of a 'stalled' connection when using
'conntrack -F' on a box with heavy network load. 'conntrack -L' gave us
sometimes a [UNREPLIED] entry for the traffic in question, but no traffic
flow, no matching of packets in or out, only the timer went down from 600
to 0 (it was ESP traffic - the default generic timeout = 600 seconds).
After the entry vanished by timeout (or doing a 'conntrack -F' once more),
all worked normal again.

The reason we finally found, is a race window in 'nf_conntrack_confirm'
when calling '__nf_conntrack_confirm':

In 'nf_conntrack_confirm' is checked (without holding a lock), if the entry
to be confirmed is possibly dying: !nf_ct_is_dying(ct).
If not, then __nf_conntrack_confirm will do some sanity checking, grab a
spin_lock_bh and insert the 'ct' into the lookup cache.

Now consider the following scenario:

1. a connection has seen the first packet already -> state is UNREPLIED
2. now the answer is to be sent, conntrack wants to confirm the connection
3. the !nf_ct_is_dying(ct) check is passed, __nf_conntrack_confirm is just
started
4. in a user context a 'conntrack -F' command is running right now e.g. on
another CPU
5. this will flag all unconfirmed connections as 'dying' in
get_next_corpse(...), including the entry going to be confirmed!
6. now the already 'dying' entry is included into the hash cache in
__nf_conntrack_confirm - BOOM!

After this step the connection in question is dead, because no packets are
forwarded until the entry is purged from hash cache. This was a big blocker
for us, because each dead IPsec tunnel is a dead branch network for 10
minutes...

For every packet from now on 'nf_conntrack_find_get' will ignore the entry,
because it is dying and because __nf_conntrack_confirm finds the hash in
the cache already, it will NF_DROP the packet.

The key for finding this was 'NF_CT_STAT_INC(net, insert_failed)' in
__nf_conntrack_confirm.

The suggested solution is to check for '!nf_ct_is_dying(ct)' again, _after_
the spin_lock_bh is grabbed in __nf_conntrack_confirm. So it is clear, that
no other softirq or user context can set that 'evil' dying flag ;-)
The return value in this case should be NF_ACCEPT, so we loose no packets
then, this is also important for us.

---
 net/netfilter/nf_conntrack_core.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/net/netfilter/nf_conntrack_core.c
b/net/netfilter/nf_conntrack_core.c
index 1374179..e2c8bfe 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -413,6 +413,11 @@ __nf_conntrack_confirm(struct sk_buff *skb)

 	spin_lock_bh(&nf_conntrack_lock);

+	if (unlikely(nf_ct_is_dying(ct))) {
+		spin_unlock_bh(&nf_conntrack_lock);
+		return NF_ACCEPT;
+	}
+
 	/* See if there's one in the list already, including reverse:
 	   NAT could have grabbed it without realizing, since we're
 	   not in the hash.  If there is, we lost race. */
-- 1.5.6.5

Best regards
Joerg.

-- 

^ permalink raw reply related

* Kernel 2.6.33-2 - ip -6 addr list does not show any addresses
From: Samuel Suter @ 2010-05-05 14:43 UTC (permalink / raw)
  To: netdev

Hello all

I am repeatedly hitting a problem on two of my servers running 
kernel.org 2.6.33-2 on Debian 5.0 x86_64 using the Debian 5.0 stable 
iproute package (iproute2-ss080725).

There is a clear discrepancy between the information returned by 
'ifconfig' and 'ip addr' (or 'ip -6 addr'). This server has 26 physical 
interfaces and ifconfig lists all of them with all their assigned 
addresses, but 'ip addr' only lists my loopback and one of the interfaces.

The actual problem emerges when I try and flush all addresses and add 
them again:

# ip -6 addr flush dev eth24
Nothing to flush.
# ip -6 addr add 2001:470:921b:7845::3d/64 dev eth24
RTNETLINK answers: File exists
# ifconfig eth24
eth24     Link encap:Ethernet  HWaddr 00:22:19:c4:7a:1d 
          inet addr:10.7.24.61  Bcast:0.0.0.0  Mask:255.255.255.128
          inet6 addr: 2001:470:921b:7845::77/64 Scope:Global
          inet6 addr: 2001:470:921b:7845::66/64 Scope:Global
          inet6 addr: 2001:470:921b:7845::55/64 Scope:Global
          inet6 addr: 2001:470:921b:7845::44/64 Scope:Global
          inet6 addr: 2001:470:921b:7845::76/64 Scope:Global
          inet6 addr: 2001:470:921b:7845::67/64 Scope:Global
          .
          .
          .
# ip -6 addr list dev eth24
#

This shows that 'ifconfig' knows about these addresses, but 'ip' does 
not. All the IPv6 addresses show up in /proc/net/if_inet6.
# grep eth24 /proc/net/if_inet6 | wc -l
57
# ifconfig eth24 | grep inet6 | wc -l
57

This leaves me in a situation where I an unable to flush the IP 
addresses on a device.

This problem seems to happen more often on a server running kernel.org 
2.6.33-2 kernel as opposed to the Debian patched 2.6.26, though it did 
happen with the debian kernel.

Regards

Samuel Suter

^ permalink raw reply

* Re: [PATCH -next 2/3] bnx2: Add prefetches to rx path.
From: Michael Chan @ 2010-05-05 15:01 UTC (permalink / raw)
  To: 'Eric Dumazet'; +Cc: davem@davemloft.net, netdev@vger.kernel.org
In-Reply-To: <1273038330.2304.10.camel@edumazet-laptop>

Eric Dumazet wrote:

> > @@ -3097,7 +3099,11 @@ bnx2_rx_int(struct bnx2 *bp, struct
> bnx2_napi *bnapi, int budget)
> >
> >             rx_buf = &rxr->rx_buf_ring[sw_ring_cons];
> >             skb = rx_buf->skb;
> > +           prefetch(skb);
>
> why not a prefetchw() ?

Yes, didn't know there was a prefetchw() before you pointed it
out.

>
> >
> > +           next_rx_buf =
> > +
> &rxr->rx_buf_ring[RX_RING_IDX(NEXT_RX_BD(sw_cons))];
> > +           prefetch(next_rx_buf->desc);
>
> So cpu is allowed to start a memory transaction on
> next_skb->data, while
> not yes DMA unmapped ?

Very good point.  The prefetch() will not work and will be
wasted on systems that have pci_dma_sync_...() defined.  I
think we can skip the prefetch if pci_dma_sync_...() is
defined.  The logic to determine if the next descriptor is
ready, dma_sync it, and prefetch it will be complicated and
we may end up not gaining any performance in the end.

What do you think?  Thanks.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox