Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] cxgb{3,4}: streamline Kconfig options
From: David Miller @ 2011-02-23 20:27 UTC (permalink / raw)
  To: JBeulich; +Cc: divy, dm, linux-kbuild, netdev
In-Reply-To: <4D64E5720200007800033420@vpn.id2.novell.com>

From: "Jan Beulich" <JBeulich@novell.com>
Date: Wed, 23 Feb 2011 09:46:10 +0000

>>>> On 22.02.11 at 19:14, David Miller <davem@davemloft.net> wrote:
>> From: "Jan Beulich" <JBeulich@novell.com>
>> Date: Thu, 17 Feb 2011 13:29:30 +0000
>> 
>>> The CHELSIO_T{3,4}_DEPENDS options are really awkward, and can be
>>> easily dropped if the reverse dependencies of SCSI_CXGB{3,4}_ISCSI on
>>> the former get converted to normal (forward) ones referring to
>>> CHELSIO_T{3,4}.
>>> 
>>> Signed-off-by: Jan Beulich <jbeulich@novell.com>
>> 
>> I think the goal of these strange rules is not to be complicated
>> on purpose, but rather to cause the iSCSI drivers to appear without
>> the user having to know that he needs to enable the networking
>> driver in order for that to happen.
> 
> While I realize that this might have been the reason, it's completely
> contrary to how everyone else writes dependencies, and hence I
> think these should be removed.

If you knew you were changing the behavior of the config option in
this way, you sure didn't think it was worth mentioning in your commit
message.

I definitely would never expect to have to enable a scsi option to get
some network driver visible to enable in the config, and therefore I
could see the opposite being insanely frustrating too.

You can't ignore these issues and just say "that's not the normal way
so I'm going to change it anyways."



^ permalink raw reply

* Re: [PATCH] r8169: incorrect args to oob notify
From: Francois Romieu @ 2011-02-23 20:00 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: David Miller, netdev, Hayes Wang
In-Reply-To: <20110223095237.0ee9c130@nehalam>

Stephen Hemminger <shemminger@vyatta.com> :
[...]
> Sparse detected this bug. The function oob_notify was being
> passed the private ptr, but expected to get the ioaddr.
> Compile checked only; not tested on real hardware.
> 
> Patch against net-next-2.6 but should be applied to net-2.6.
> Bug not present in 2.6.37 and earlier.

Hayes has sent this fix and a few others. I am giving them a short
testing right now.

-- 
Ueimor

^ permalink raw reply

* [PATCH 2/5] atl1[ce]: fix sparse warnings
From: Stephen Hemminger @ 2011-02-23 19:06 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <20110223190647.482444598@vyatta.com>

[-- Attachment #1: atl1c-sparse.patch --]
[-- Type: text/plain, Size: 1751 bytes --]

The dmaw_block is an enum and max_pay_load is u32. Therefore
sparse gives warning about comparison of unsigned and signed value.
Resolve by using min_t to force cast.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/drivers/net/atl1c/atl1c_main.c	2011-02-23 09:58:11.282457630 -0800
+++ b/drivers/net/atl1c/atl1c_main.c	2011-02-23 09:58:42.626823758 -0800
@@ -1102,10 +1102,10 @@ static void atl1c_configure_tx(struct at
 	AT_READ_REG(hw, REG_DEVICE_CTRL, &dev_ctrl_data);
 	max_pay_load  = (dev_ctrl_data >> DEVICE_CTRL_MAX_PAYLOAD_SHIFT) &
 			DEVICE_CTRL_MAX_PAYLOAD_MASK;
-	hw->dmaw_block = min(max_pay_load, hw->dmaw_block);
+	hw->dmaw_block = min_t(u32, max_pay_load, hw->dmaw_block);
 	max_pay_load  = (dev_ctrl_data >> DEVICE_CTRL_MAX_RREQ_SZ_SHIFT) &
 			DEVICE_CTRL_MAX_RREQ_SZ_MASK;
-	hw->dmar_block = min(max_pay_load, hw->dmar_block);
+	hw->dmar_block = min_t(u32, max_pay_load, hw->dmar_block);
 
 	txq_ctrl_data = (hw->tpd_burst & TXQ_NUM_TPD_BURST_MASK) <<
 			TXQ_NUM_TPD_BURST_SHIFT;
--- a/drivers/net/atl1e/atl1e_main.c	2011-02-23 09:58:52.594940247 -0800
+++ b/drivers/net/atl1e/atl1e_main.c	2011-02-23 09:59:03.859071913 -0800
@@ -932,11 +932,11 @@ static inline void atl1e_configure_tx(st
 	max_pay_load  = ((dev_ctrl_data >> DEVICE_CTRL_MAX_PAYLOAD_SHIFT)) &
 			DEVICE_CTRL_MAX_PAYLOAD_MASK;
 
-	hw->dmaw_block = min(max_pay_load, hw->dmaw_block);
+	hw->dmaw_block = min_t(u32, max_pay_load, hw->dmaw_block);
 
 	max_pay_load  = ((dev_ctrl_data >> DEVICE_CTRL_MAX_RREQ_SZ_SHIFT)) &
 			DEVICE_CTRL_MAX_RREQ_SZ_MASK;
-	hw->dmar_block = min(max_pay_load, hw->dmar_block);
+	hw->dmar_block = min_t(u32, max_pay_load, hw->dmar_block);
 
 	if (hw->nic_type != athr_l2e_revB)
 		AT_WRITE_REGW(hw, REG_TXQ_CTRL + 2,



^ permalink raw reply

* [PATCH 4/5] mqprio: cleanups
From: Stephen Hemminger @ 2011-02-23 19:06 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <20110223190647.482444598@vyatta.com>

[-- Attachment #1: mqprio-sparse.patch --]
[-- Type: text/plain, Size: 914 bytes --]

* make qdisc_ops local 
* add sparse annotation about expected unlock/unlock in dump_class_stats
* fix indentation

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/net/sched/sch_mqprio.c	2011-02-23 10:22:17.171259588 -0800
+++ b/net/sched/sch_mqprio.c	2011-02-23 10:28:51.955835730 -0800
@@ -311,7 +311,9 @@ static int mqprio_dump_class(struct Qdis
 }
 
 static int mqprio_dump_class_stats(struct Qdisc *sch, unsigned long cl,
-			       struct gnet_dump *d)
+				   struct gnet_dump *d)
+	__releases(d->lock)
+	__acquires(d->lock)
 {
 	struct net_device *dev = qdisc_dev(sch);
 
@@ -389,7 +391,7 @@ static const struct Qdisc_class_ops mqpr
 	.dump_stats	= mqprio_dump_class_stats,
 };
 
-struct Qdisc_ops mqprio_qdisc_ops __read_mostly = {
+static struct Qdisc_ops mqprio_qdisc_ops __read_mostly = {
 	.cl_ops		= &mqprio_class_ops,
 	.id		= "mqprio",
 	.priv_size	= sizeof(struct mqprio_sched),



^ permalink raw reply

* [PATCH 1/5] socket: suppress sparse warnings
From: Stephen Hemminger @ 2011-02-23 19:06 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <20110223190647.482444598@vyatta.com>

[-- Attachment #1: socket-sparse --]
[-- Type: text/plain, Size: 1122 bytes --]

Use __force to quiet sparse warnings for cases where the code
is simulating user space pointers.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/net/socket.c	2011-02-23 08:26:51.905069267 -0800
+++ b/net/socket.c	2011-02-23 08:26:59.005160474 -0800
@@ -2648,7 +2648,8 @@ static int bond_ioctl(struct net *net, u
 
 		old_fs = get_fs();
 		set_fs(KERNEL_DS);
-		err = dev_ioctl(net, cmd, &kifr);
+		err = dev_ioctl(net, cmd,
+				(struct ifreq __user __force *) &kifr);
 		set_fs(old_fs);
 
 		return err;
@@ -2757,7 +2758,7 @@ static int compat_sioc_ifmap(struct net
 
 	old_fs = get_fs();
 	set_fs(KERNEL_DS);
-	err = dev_ioctl(net, cmd, (void __user *)&ifr);
+	err = dev_ioctl(net, cmd, (void  __user __force *)&ifr);
 	set_fs(old_fs);
 
 	if (cmd == SIOCGIFMAP && !err) {
@@ -2862,7 +2863,8 @@ static int routing_ioctl(struct net *net
 		ret |= __get_user(rtdev, &(ur4->rt_dev));
 		if (rtdev) {
 			ret |= copy_from_user(devname, compat_ptr(rtdev), 15);
-			r4.rt_dev = devname; devname[15] = 0;
+			r4.rt_dev = (char __user __force *)devname;
+			devname[15] = 0;
 		} else
 			r4.rt_dev = NULL;
 



^ permalink raw reply

* [PATCH 3/5] afkey: add sparse annotation about rcu
From: Stephen Hemminger @ 2011-02-23 19:06 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <20110223190647.482444598@vyatta.com>

[-- Attachment #1: afkey-sparse --]
[-- Type: text/plain, Size: 604 bytes --]

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/net/key/af_key.c	2011-02-23 10:20:23.181943425 -0800
+++ b/net/key/af_key.c	2011-02-23 10:20:47.290221581 -0800
@@ -3655,6 +3655,7 @@ static int pfkey_seq_show(struct seq_fil
 }
 
 static void *pfkey_seq_start(struct seq_file *f, loff_t *ppos)
+	__acquires(rcu)
 {
 	struct net *net = seq_file_net(f);
 	struct netns_pfkey *net_pfkey = net_generic(net, pfkey_net_id);
@@ -3672,6 +3673,7 @@ static void *pfkey_seq_next(struct seq_f
 }
 
 static void pfkey_seq_stop(struct seq_file *f, void *v)
+	__releases(rcu)
 {
 	rcu_read_unlock();
 }



^ permalink raw reply

* [PATCH 0/5] sparse related patches
From: Stephen Hemminger @ 2011-02-23 19:06 UTC (permalink / raw)
  To: davem; +Cc: netdev



^ permalink raw reply

* [PATCH 5/5] em_meta: fix sparse warning
From: Stephen Hemminger @ 2011-02-23 19:06 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <20110223190647.482444598@vyatta.com>

[-- Attachment #1: em-meta-sparse --]
[-- Type: text/plain, Size: 457 bytes --]

gfp_t needs to be cast to integer.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/net/sched/em_meta.c	2011-02-23 10:26:17.534042646 -0800
+++ b/net/sched/em_meta.c	2011-02-23 10:30:37.533063770 -0800
@@ -401,7 +401,7 @@ META_COLLECTOR(int_sk_sndbuf)
 META_COLLECTOR(int_sk_alloc)
 {
 	SKIP_NONLOCAL(skb);
-	dst->value = skb->sk->sk_allocation;
+	dst->value = (__force int) skb->sk->sk_allocation;
 }
 
 META_COLLECTOR(int_sk_route_caps)



^ permalink raw reply

* [PATCH 1/2] bonding: fix incorrect transmit queue offset
From: Andy Gospodarek @ 2011-02-23 19:42 UTC (permalink / raw)
  To: netdev, Phil Oester; +Cc: Ben Hutchings, Jay Vosburgh

Users noticed the following messages:

kernel: bond0 selects TX queue 16, but real number of TX queues is 16

were filling their logs when frames received from a device that
contained 16 input queues were being forwarded out of a bonded device.
Ben pointed out that the contents of skb->queue_mapping cannot be
directly mapped to a transmit queue, so I went with his suggestion for a
solution to the offset problem.

The function now also warns the user to increase the default value for
tx_queues if needed and returns a valid tx queue to dev_pick_tx.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>
Reported-by: Phil Oester <kernel@linuxace.com>
CC: Ben Hutchings <bhutchings@solarflare.com>
---
 drivers/net/bonding/bond_main.c |   24 +++++++++++++++++-------
 1 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 77e3c6a..1512c83 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -4533,15 +4533,25 @@ out:
 	return res;
 }
 
+/*
+ * This helper function exists to help dev_pick_tx get the correct
+ * destination queue.  Using a helper function skips the a call to
+ * skb_tx_hash and will put the skbs in the queue we expect on their
+ * way down to the bonding driver.
+ */
 static u16 bond_select_queue(struct net_device *dev, struct sk_buff *skb)
 {
-	/*
-	 * This helper function exists to help dev_pick_tx get the correct
-	 * destination queue.  Using a helper function skips the a call to
-	 * skb_tx_hash and will put the skbs in the queue we expect on their
-	 * way down to the bonding driver.
-	 */
-	return skb->queue_mapping;
+	u16 txq = skb_rx_queue_recorded(skb) ? skb_get_rx_queue(skb) : 0;
+
+	while (txq >= dev->real_num_tx_queues) {
+		/* let the user know if we do not have enough tx queues */
+		if (net_ratelimit())
+			pr_warning("%s selects invalid tx queue %d.  Consider"
+				   " setting module option tx_queues > %d.",
+				   dev->name, txq, dev->real_num_tx_queues);
+		txq -= dev->real_num_tx_queues;
+	}
+	return txq;
 }
 
 static netdev_tx_t bond_start_xmit(struct sk_buff *skb, struct net_device *dev)
-- 
1.7.4


^ permalink raw reply related

* [PATCH 2/2] bonding: cleanup transmit queue storage and configuration
From: Andy Gospodarek @ 2011-02-23 19:43 UTC (permalink / raw)
  To: netdev

Current bonding code presumes queues 1-16 were usable and queue 0 was
reserved, but parts of the code didn't account for this correctly.  This
update now allows 16 queues (numbered 0-15) to be used (in the nominal
case) and queue awareness can be completely and clearly disabled rather
than by simply setting the queue for a device to 0.

Configuration syntax has changed a little, but is similar to the
existing method.  To associate a queue with an output device, you can do
this:

# echo "+eth1:2" > /sys/class/net/bond0/bonding/queue_id

and to clear it this:

# echo "-eth1:2" > /sys/class/net/bond0/bonding/queue_id

Contents of /proc/net/bonding/bond0 are also a bit different as the
output queue will only be printed if set.

Signed-off-by: Andy Gospodarek <andy@greyhouse.net>

---
 Documentation/networking/bonding.txt |   20 ++++++----
 drivers/net/bonding/bond_main.c      |   14 ++++---
 drivers/net/bonding/bond_sysfs.c     |   71 +++++++++++++++++++++------------
 drivers/net/bonding/bonding.h        |   20 +++++++++
 4 files changed, 85 insertions(+), 40 deletions(-)

diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index 25d2f41..8a27f0f 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -1404,10 +1404,11 @@ By default the bonding driver is multiqueue aware and 16 queues are created
 when the driver initializes (see Documentation/networking/multiqueue.txt
 for details).  If more or less queues are desired the module parameter
 tx_queues can be used to change this value.  There is no sysfs parameter
-available as the allocation is done at module init time.
+available as the allocation is done at module init time.  Currently only one
+queue per device can be set.
 
 The output of the file /proc/net/bonding/bondX has changed so the output Queue
-ID is now printed for each slave:
+ID is now printed for each slave that has a queue configured:
 
 Bonding Mode: fault-tolerance (active-backup)
 Primary Slave: None
@@ -1421,7 +1422,6 @@ Slave Interface: eth0
 MII Status: up
 Link Failure Count: 0
 Permanent HW addr: 00:1a:a0:12:8f:cb
-Slave queue ID: 0
 
 Slave Interface: eth1
 MII Status: up
@@ -1431,12 +1431,16 @@ Slave queue ID: 2
 
 The queue_id for a slave can be set using the command:
 
-# echo "eth1:2" > /sys/class/net/bond0/bonding/queue_id
+# echo "+eth1:2" > /sys/class/net/bond0/bonding/queue_id
 
-Any interface that needs a queue_id set should set it with multiple calls
-like the one above until proper priorities are set for all interfaces.  On
-distributions that allow configuration via initscripts, multiple 'queue_id'
-arguments can be added to BONDING_OPTS to set all needed slave queues.
+and removed using the command:
+
+# echo "-eth1:2" > /sys/class/net/bond0/bonding/queue_id
+
+Any interface that needs a queue_id set should set it with multiple calls until
+proper priorities are set or cleared for all interfaces.  On distributions that
+allow configuration via initscripts, multiple 'queue_id' arguments can be added
+to BONDING_OPTS to set all needed slave queues.
 
 These queue id's can be used in conjunction with the tc utility to configure
 a multiqueue qdisc and filters to bias certain traffic to transmit on certain
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 1512c83..3e8fbfe 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1560,8 +1560,8 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 	}
 
 	/*
-	 * Set the new_slave's queue_id to be zero.  Queue ID mapping
-	 * is set via sysfs or module option if desired.
+	 * Set the new_slave's queue_id to be zero to disable.  Queue ID
+	 * mapping is set via sysfs or module option if desired.
 	 */
 	new_slave->queue_id = 0;
 
@@ -3355,7 +3355,8 @@ static void bond_info_show_slave(struct seq_file *seq,
 		else
 			seq_puts(seq, "Aggregator ID: N/A\n");
 	}
-	seq_printf(seq, "Slave queue ID: %d\n", slave->queue_id);
+	if (slave_tx_queue_recorded(slave))
+		seq_printf(seq, "Slave queue ID: %d\n", slave_get_tx_queue(slave));
 }
 
 static int bond_info_seq_show(struct seq_file *seq, void *v)
@@ -4516,15 +4517,16 @@ static inline int bond_slave_override(struct bonding *bond,
 
 	/* Find out if any slaves have the same mapping as this skb. */
 	bond_for_each_slave(bond, check_slave, i) {
-		if (check_slave->queue_id == skb->queue_mapping) {
+		if (slave_tx_queue_recorded(check_slave) &&
+		    slave_get_tx_queue(check_slave) == skb->queue_mapping) {
 			slave = check_slave;
 			break;
 		}
 	}
 
 	/* If the slave isn't UP, use default transmit policy. */
-	if (slave && slave->queue_id && IS_UP(slave->dev) &&
-	    (slave->link == BOND_LINK_UP)) {
+	if (slave && slave_tx_queue_recorded(slave) &&
+	    IS_UP(slave->dev) && (slave->link == BOND_LINK_UP)) {
 		res = bond_dev_queue_xmit(bond, skb, slave->dev);
 	}
 
diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 72bb0f6..d09701f 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -1448,15 +1448,18 @@ static ssize_t bonding_show_queue_id(struct device *d,
 
 	read_lock(&bond->lock);
 	bond_for_each_slave(bond, slave, i) {
-		if (res > (PAGE_SIZE - IFNAMSIZ - 6)) {
-			/* not enough space for another interface_name:queue_id pair */
-			if ((PAGE_SIZE - res) > 10)
-				res = PAGE_SIZE - 10;
-			res += sprintf(buf + res, "++more++ ");
-			break;
+		if (slave_tx_queue_recorded(slave)) {
+			if (res > (PAGE_SIZE - IFNAMSIZ - 6)) {
+				/* Not enough space for another
+				 * interface_name:queue_id pair. */
+				if ((PAGE_SIZE - res) > 10)
+					res = PAGE_SIZE - 10;
+				res += sprintf(buf + res, "++more++ ");
+				break;
+			}
+			res += sprintf(buf + res, "%s:%d ", slave->dev->name,
+				       slave_get_tx_queue(slave));
 		}
-		res += sprintf(buf + res, "%s:%d ",
-			       slave->dev->name, slave->queue_id);
 	}
 	read_unlock(&bond->lock);
 	if (res)
@@ -1479,55 +1482,71 @@ static ssize_t bonding_store_queue_id(struct device *d,
 	int i, ret = count;
 	char *delim;
 	struct net_device *sdev = NULL;
+	char command[IFNAMSIZ + 7] = {0, };
+	char *ifname;
 
 	if (!rtnl_trylock())
 		return restart_syscall();
 
+	sscanf(buffer, "%22s", command); /* IFNAMSIZ*/
+
 	/* delim will point to queue id if successful */
-	delim = strchr(buffer, ':');
+	delim = strchr(command + 1, ':');
 	if (!delim)
 		goto err_no_cmd;
-
 	/*
 	 * Terminate string that points to device name and bump it
 	 * up one, so we can read the queue id there.
 	 */
 	*delim = '\0';
-	if (sscanf(++delim, "%hd\n", &qid) != 1)
+	if (sscanf(++delim, "%hd\n", &qid) != 1) {
 		goto err_no_cmd;
+	}
 
-	/* Check buffer length, valid ifname and queue id */
-	if (strlen(buffer) > IFNAMSIZ ||
-	    !dev_valid_name(buffer) ||
-	    qid > bond->params.tx_queues)
+	/* ifname will now be command + 1 */
+	ifname = command + 1;
+	if ((strlen(command) <= 1) ||
+	    !dev_valid_name(ifname) ||
+	    qid >= bond->params.tx_queues)
 		goto err_no_cmd;
 
 	/* Get the pointer to that interface if it exists */
-	sdev = __dev_get_by_name(dev_net(bond->dev), buffer);
+	sdev = __dev_get_by_name(dev_net(bond->dev), ifname);
 	if (!sdev)
 		goto err_no_cmd;
 
 	read_lock(&bond->lock);
 
-	/* Search for thes slave and check for duplicate qids */
+	/* Search for the slave needing the change */
 	update_slave = NULL;
 	bond_for_each_slave(bond, slave, i) {
 		if (sdev == slave->dev)
-			/*
-			 * We don't need to check the matching
-			 * slave for dups, since we're overwriting it
-			 */
 			update_slave = slave;
-		else if (qid && qid == slave->queue_id) {
+		else if (slave_tx_queue_recorded(slave) &&
+			 qid == slave_get_tx_queue(slave)) {
 			goto err_no_cmd_unlock;
 		}
 	}
-
 	if (!update_slave)
 		goto err_no_cmd_unlock;
 
-	/* Actually set the qids for the slave */
-	update_slave->queue_id = qid;
+	/* at this point we have a valid slave */
+	switch (command[0]) {
+	case '+':
+		if (slave_tx_queue_recorded(update_slave))
+			goto err_no_cmd_unlock;
+		slave_record_tx_queue(update_slave, qid);
+		break;
+	case '-':
+		if (!slave_tx_queue_recorded(update_slave) ||
+		    slave_get_tx_queue(update_slave) != qid)
+			goto err_no_cmd_unlock;
+		slave_clear_tx_queue(update_slave);
+		break;
+
+	default:
+		goto err_no_cmd;
+	}
 
 	read_unlock(&bond->lock);
 out:
@@ -1537,7 +1556,7 @@ out:
 err_no_cmd_unlock:
 	read_unlock(&bond->lock);
 err_no_cmd:
-	pr_info("invalid input for queue_id set for %s.\n",
+	pr_info("cannot modify queue_id for %s.\n",
 		bond->dev->name);
 	ret = -EPERM;
 	goto out;
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 31fe980..c9059f4 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -422,4 +422,24 @@ static inline void bond_unregister_ipv6_notifier(void)
 }
 #endif
 
+static inline void slave_record_tx_queue(struct slave *slave, u16 tx_queue)
+{
+	slave->queue_id = tx_queue + 1;
+}
+
+static inline u16 slave_get_tx_queue(const struct slave *slave)
+{
+	return slave->queue_id - 1;
+}
+
+static inline bool slave_tx_queue_recorded(const struct slave *slave)
+{
+	return slave->queue_id != 0;
+}
+
+static inline void slave_clear_tx_queue(struct slave *slave)
+{
+	slave->queue_id = 0;
+}
+
 #endif /* _LINUX_BONDING_H */
-- 
1.7.4


^ permalink raw reply related

* Re: [PATCH V11 2/4] ptp: Added a clock that uses the eTSEC found on the MPC85xx.
From: Scott Wood @ 2011-02-23 19:24 UTC (permalink / raw)
  To: Grant Likely
  Cc: Richard Cochran, Thomas Gleixner, Rodolfo Giometti, Arnd Bergmann,
	Peter Zijlstra, linux-api-u79uwXL29TY76Z2rM5mHXA,
	devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Russell King, Paul Mackerras,
	John Stultz, Alan Cox, netdev-u79uwXL29TY76Z2rM5mHXA,
	Mike Frysinger, Christoph Lameter,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ, David Miller,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Krzysztof Halasa
In-Reply-To: <20110223175459.GH14597-MrY2KI0G/OVr83L8+7iqerDks+cytr/Z@public.gmane.org>

On Wed, 23 Feb 2011 10:54:59 -0700
Grant Likely <grant.likely-s3s/WqlpOiPyB63q8FvJNQ@public.gmane.org> wrote:

> On Wed, Feb 23, 2011 at 11:26:12AM -0600, Scott Wood wrote:
> > eTSEC is versioned, that's more reliable than the chip name since chips
> > have revisions (rev 2.1 of mpc8313 has eTSEC 1.6, not sure about previous
> > revs of mpc8313).  Logic blocks can be and have been uprevved between one
> > revision of a chip to the next.  I think "fsl,mpc8313rev2.1-etsec-ptp"
> > would be taking things a bit too far (and there could be board-level bugs
> > too...).
> > 
> > If you really need to know the exact SoC you're on, look in SVR (which
> > will provide revision info as well).  Isn't the device tree for things that
> > can't be probed?
> 
> This is far more about the binding than it is about the chip revision.
> When documenting a binding it makes far more sense to anchor it to a
> specific implementation than to try and come up with a 'generic'
> catchall.

Whatever string is used should be written into a binding document.

fsl,etsec-v1.6-ptp seems like it would be just as good for that purpose.

Even just fsl,etsec-ptp will identify the binding, though it's lacking in
identifying the hardware (in the absence of access to the eTSEC ID
registers).

If somehow Freescale makes something completely different in the future
called "etsec-ptp", then we'll just have to pick a different name for
*that* compatible (after we smack whoever was responsible for reusing the
name).  The point of the vendor namespacing is to constrain this problem to
a manageable scope.

> > The eTSEC revision is probeable as well, but due the way PTP is described as
> > a separate node, the driver doesn't have straightforward access to those
> > registers.
> 
> Ignorant question: Should the ptp be described as a separate node?

Probably not.

> > Insisting on an explicit chip also encourages people to claim compatibility
> > with that chip without ensuring that it really is fully compatible.
> 
> In practise, I've not seen this to be an issue.

I see it often enough in our BSPs (though the BSP device trees tend to be
problematic in a variety of ways), especially on things like guts and
pmc.

It's a question of how strong a statement people are asked to make -- in a
place where fixing errors is somewhat painful, and we don't really need that
strong statement of compatibility to be made, as long as there's another
way to fully identify the hardware (e.g. SVR, top-level board compatible) if
some strange workaround needs to be made[1].

To turn things around, in practice, I've not seen compatibles that don't
encode a specific chip name to be a problem, as long as it's well
documented what it means.

-Scott

[1] IIRC, this was the original reason for citing a specific chip, but it
doesn't hold up if that chip gets cited by other chips as compatible.  If
compatibliity includes having all the same bugs, then very little will be
able to claim compatibility, and we'll be back to long lists of device
IDs in the driver.  Not to mention errata that are discovered after a
device tree claiming compatibility is released...

^ permalink raw reply

* [patch net-next-2.6 V3] net: convert bonding to use rx_handler
From: Jiri Pirko @ 2011-02-23 19:05 UTC (permalink / raw)
  To: David Miller
  Cc: kaber, eric.dumazet, netdev, shemminger, fubar, nicolas.2p.debian,
	andy
In-Reply-To: <20110218205832.GE2602@psychotron.redhat.com>

This patch converts bonding to use rx_handler. Results in cleaner
__netif_receive_skb() with much less exceptions needed. Also
bond-specific work is moved into bond code.

Did performance test using pktgen and counting incoming packets by
iptables. No regression noted.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>

v1->v2:
        using skb_iif instead of new input_dev to remember original
	device

v2->v3:
	do another loop in case skb->dev is changed. That way orig_dev
	core can be left untouched.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
---
 drivers/net/bonding/bond_main.c |   74 ++++++++++++++++++++++++-
 net/core/dev.c                  |  119 ++++++++++-----------------------------
 2 files changed, 104 insertions(+), 89 deletions(-)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 77e3c6a..3772b61 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -1423,6 +1423,67 @@ static void bond_setup_by_slave(struct net_device *bond_dev,
 	bond->setup_by_slave = 1;
 }
 
+/* On bonding slaves other than the currently active slave, suppress
+ * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
+ * ARP on active-backup slaves with arp_validate enabled.
+ */
+static bool bond_should_deliver_exact_match(struct sk_buff *skb,
+					    struct net_device *slave_dev,
+					    struct net_device *bond_dev)
+{
+	if (slave_dev->priv_flags & IFF_SLAVE_INACTIVE) {
+		if (slave_dev->priv_flags & IFF_SLAVE_NEEDARP &&
+		    skb->protocol == __cpu_to_be16(ETH_P_ARP))
+			return false;
+
+		if (bond_dev->priv_flags & IFF_MASTER_ALB &&
+		    skb->pkt_type != PACKET_BROADCAST &&
+		    skb->pkt_type != PACKET_MULTICAST)
+				return false;
+
+		if (bond_dev->priv_flags & IFF_MASTER_8023AD &&
+		    skb->protocol == __cpu_to_be16(ETH_P_SLOW))
+			return false;
+
+		return true;
+	}
+	return false;
+}
+
+static struct sk_buff *bond_handle_frame(struct sk_buff *skb)
+{
+	struct net_device *slave_dev;
+	struct net_device *bond_dev;
+
+	skb = skb_share_check(skb, GFP_ATOMIC);
+	if (unlikely(!skb))
+		return NULL;
+	slave_dev = skb->dev;
+	bond_dev = ACCESS_ONCE(slave_dev->master);
+	if (unlikely(!bond_dev))
+		return skb;
+
+	if (bond_dev->priv_flags & IFF_MASTER_ARPMON)
+		slave_dev->last_rx = jiffies;
+
+	if (bond_should_deliver_exact_match(skb, slave_dev, bond_dev)) {
+		skb->deliver_no_wcard = 1;
+		return skb;
+	}
+
+	skb->dev = bond_dev;
+
+	if (bond_dev->priv_flags & IFF_MASTER_ALB &&
+	    bond_dev->priv_flags & IFF_BRIDGE_PORT &&
+	    skb->pkt_type == PACKET_HOST) {
+		u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
+
+		memcpy(dest, bond_dev->dev_addr, ETH_ALEN);
+	}
+
+	return skb;
+}
+
 /* enslave device <slave> to bond device <master> */
 int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 {
@@ -1599,11 +1660,17 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 		pr_debug("Error %d calling netdev_set_bond_master\n", res);
 		goto err_restore_mac;
 	}
+	res = netdev_rx_handler_register(slave_dev, bond_handle_frame, NULL);
+	if (res) {
+		pr_debug("Error %d calling netdev_rx_handler_register\n", res);
+		goto err_unset_master;
+	}
+
 	/* open the slave since the application closed it */
 	res = dev_open(slave_dev);
 	if (res) {
 		pr_debug("Opening slave %s failed\n", slave_dev->name);
-		goto err_unset_master;
+		goto err_unreg_rxhandler;
 	}
 
 	new_slave->dev = slave_dev;
@@ -1811,6 +1878,9 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 err_close:
 	dev_close(slave_dev);
 
+err_unreg_rxhandler:
+	netdev_rx_handler_unregister(slave_dev);
+
 err_unset_master:
 	netdev_set_bond_master(slave_dev, NULL);
 
@@ -1992,6 +2062,7 @@ int bond_release(struct net_device *bond_dev, struct net_device *slave_dev)
 		netif_addr_unlock_bh(bond_dev);
 	}
 
+	netdev_rx_handler_unregister(slave_dev);
 	netdev_set_bond_master(slave_dev, NULL);
 
 #ifdef CONFIG_NET_POLL_CONTROLLER
@@ -2114,6 +2185,7 @@ static int bond_release_all(struct net_device *bond_dev)
 			netif_addr_unlock_bh(bond_dev);
 		}
 
+		netdev_rx_handler_unregister(slave_dev);
 		netdev_set_bond_master(slave_dev, NULL);
 
 		/* close slave before restoring its mac address */
diff --git a/net/core/dev.c b/net/core/dev.c
index 578415c..e06a834 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -3096,63 +3096,31 @@ void netdev_rx_handler_unregister(struct net_device *dev)
 }
 EXPORT_SYMBOL_GPL(netdev_rx_handler_unregister);
 
-static inline void skb_bond_set_mac_by_master(struct sk_buff *skb,
-					      struct net_device *master)
+static void vlan_on_bond_hook(struct sk_buff *skb)
 {
-	if (skb->pkt_type == PACKET_HOST) {
-		u16 *dest = (u16 *) eth_hdr(skb)->h_dest;
+	/*
+	 * Make sure ARP frames received on VLAN interfaces stacked on
+	 * bonding interfaces still make their way to any base bonding
+	 * device that may have registered for a specific ptype.
+	 */
+	if (skb->dev->priv_flags & IFF_802_1Q_VLAN &&
+	    vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING &&
+	    skb->protocol == htons(ETH_P_ARP)) {
+		struct sk_buff *skb2 = skb_clone(skb, GFP_ATOMIC);
 
-		memcpy(dest, master->dev_addr, ETH_ALEN);
+		if (!skb2)
+			return;
+		skb2->dev = vlan_dev_real_dev(skb->dev);
+		netif_rx(skb2);
 	}
 }
 
-/* On bonding slaves other than the currently active slave, suppress
- * duplicates except for 802.3ad ETH_P_SLOW, alb non-mcast/bcast, and
- * ARP on active-backup slaves with arp_validate enabled.
- */
-static int __skb_bond_should_drop(struct sk_buff *skb,
-				  struct net_device *master)
-{
-	struct net_device *dev = skb->dev;
-
-	if (master->priv_flags & IFF_MASTER_ARPMON)
-		dev->last_rx = jiffies;
-
-	if ((master->priv_flags & IFF_MASTER_ALB) &&
-	    (master->priv_flags & IFF_BRIDGE_PORT)) {
-		/* Do address unmangle. The local destination address
-		 * will be always the one master has. Provides the right
-		 * functionality in a bridge.
-		 */
-		skb_bond_set_mac_by_master(skb, master);
-	}
-
-	if (dev->priv_flags & IFF_SLAVE_INACTIVE) {
-		if ((dev->priv_flags & IFF_SLAVE_NEEDARP) &&
-		    skb->protocol == __cpu_to_be16(ETH_P_ARP))
-			return 0;
-
-		if (master->priv_flags & IFF_MASTER_ALB) {
-			if (skb->pkt_type != PACKET_BROADCAST &&
-			    skb->pkt_type != PACKET_MULTICAST)
-				return 0;
-		}
-		if (master->priv_flags & IFF_MASTER_8023AD &&
-		    skb->protocol == __cpu_to_be16(ETH_P_SLOW))
-			return 0;
-
-		return 1;
-	}
-	return 0;
-}
-
 static int __netif_receive_skb(struct sk_buff *skb)
 {
 	struct packet_type *ptype, *pt_prev;
 	rx_handler_func_t *rx_handler;
 	struct net_device *orig_dev;
-	struct net_device *null_or_orig;
-	struct net_device *orig_or_bond;
+	struct net_device *null_or_dev;
 	int ret = NET_RX_DROP;
 	__be16 type;
 
@@ -3167,32 +3135,8 @@ static int __netif_receive_skb(struct sk_buff *skb)
 
 	if (!skb->skb_iif)
 		skb->skb_iif = skb->dev->ifindex;
-
-	/*
-	 * bonding note: skbs received on inactive slaves should only
-	 * be delivered to pkt handlers that are exact matches.  Also
-	 * the deliver_no_wcard flag will be set.  If packet handlers
-	 * are sensitive to duplicate packets these skbs will need to
-	 * be dropped at the handler.
-	 */
-	null_or_orig = NULL;
 	orig_dev = skb->dev;
-	if (skb->deliver_no_wcard)
-		null_or_orig = orig_dev;
-	else if (netif_is_bond_slave(orig_dev)) {
-		struct net_device *bond_master = ACCESS_ONCE(orig_dev->master);
-
-		if (likely(bond_master)) {
-			if (__skb_bond_should_drop(skb, bond_master)) {
-				skb->deliver_no_wcard = 1;
-				/* deliver only exact match */
-				null_or_orig = orig_dev;
-			} else
-				skb->dev = bond_master;
-		}
-	}
 
-	__this_cpu_inc(softnet_data.processed);
 	skb_reset_network_header(skb);
 	skb_reset_transport_header(skb);
 	skb->mac_len = skb->network_header - skb->mac_header;
@@ -3201,6 +3145,10 @@ static int __netif_receive_skb(struct sk_buff *skb)
 
 	rcu_read_lock();
 
+another_round:
+
+	__this_cpu_inc(softnet_data.processed);
+
 #ifdef CONFIG_NET_CLS_ACT
 	if (skb->tc_verd & TC_NCLS) {
 		skb->tc_verd = CLR_TC_NCLS(skb->tc_verd);
@@ -3209,8 +3157,7 @@ static int __netif_receive_skb(struct sk_buff *skb)
 #endif
 
 	list_for_each_entry_rcu(ptype, &ptype_all, list) {
-		if (ptype->dev == null_or_orig || ptype->dev == skb->dev ||
-		    ptype->dev == orig_dev) {
+		if (!ptype->dev || ptype->dev == skb->dev) {
 			if (pt_prev)
 				ret = deliver_skb(skb, pt_prev, orig_dev);
 			pt_prev = ptype;
@@ -3224,16 +3171,20 @@ static int __netif_receive_skb(struct sk_buff *skb)
 ncls:
 #endif
 
-	/* Handle special case of bridge or macvlan */
 	rx_handler = rcu_dereference(skb->dev->rx_handler);
 	if (rx_handler) {
+		struct net_device *prev_dev;
+
 		if (pt_prev) {
 			ret = deliver_skb(skb, pt_prev, orig_dev);
 			pt_prev = NULL;
 		}
+		prev_dev = skb->dev;
 		skb = rx_handler(skb);
 		if (!skb)
 			goto out;
+		if (skb->dev != prev_dev)
+			goto another_round;
 	}
 
 	if (vlan_tx_tag_present(skb)) {
@@ -3248,24 +3199,16 @@ ncls:
 			goto out;
 	}
 
-	/*
-	 * Make sure frames received on VLAN interfaces stacked on
-	 * bonding interfaces still make their way to any base bonding
-	 * device that may have registered for a specific ptype.  The
-	 * handler may have to adjust skb->dev and orig_dev.
-	 */
-	orig_or_bond = orig_dev;
-	if ((skb->dev->priv_flags & IFF_802_1Q_VLAN) &&
-	    (vlan_dev_real_dev(skb->dev)->priv_flags & IFF_BONDING)) {
-		orig_or_bond = vlan_dev_real_dev(skb->dev);
-	}
+	vlan_on_bond_hook(skb);
+
+	/* deliver only exact match when indicated */
+	null_or_dev = skb->deliver_no_wcard ? skb->dev : NULL;
 
 	type = skb->protocol;
 	list_for_each_entry_rcu(ptype,
 			&ptype_base[ntohs(type) & PTYPE_HASH_MASK], list) {
-		if (ptype->type == type && (ptype->dev == null_or_orig ||
-		     ptype->dev == skb->dev || ptype->dev == orig_dev ||
-		     ptype->dev == orig_or_bond)) {
+		if (ptype->type == type &&
+		    (ptype->dev == null_or_dev || ptype->dev == skb->dev)) {
 			if (pt_prev)
 				ret = deliver_skb(skb, pt_prev, orig_dev);
 			pt_prev = ptype;
-- 
1.7.3.4


^ permalink raw reply related

* Re: [ISC-Bugs #22806] [PATCH] lfp: fix AF_INET checksum with csum offloading
From: Michael S. Tsirkin @ 2011-02-23 18:46 UTC (permalink / raw)
  To: dhcp-bugs, dhcp-hackers; +Cc: kvm, herbert, davem, netdev
In-Reply-To: <20101227171632.GA14920@redhat.com>

No answer from isc so far.
I think it's a problem that the popular dhclient
has had this bug for so long.
Anyone knows what else can be done?


On Mon, Dec 27, 2010 at 07:16:32PM +0200, Michael S. Tsirkin wrote:
> When an AF_INET socket is used, linux would sometimes
> return a packet without the checksum.  This happens when a packet
> originates on the same machine, which is most common with virtual
> machines but might be possible even without with a software-based dhcp
> server such as dnsmasq.
> This appears to be a performance optimization to avoid calculating
> the checksum and there's no way to disable this. Users are required to
> detect such packets by passing a msg_control parameter to the recvmsg
> call. This was added in linux kernel 2.6.21.
> 
> dhclient from isc.org as of 4.2.0-P2 fails to do this and
> concequently discards such packets, so such setups fail to get an IP.
> This was reported in the past:
> https://lists.isc.org/mailman/htdig/dhcp-hackers/2010-April/001825.html
> 
> The attached patch fixes this by passing msg_control and looking
> at the tp_status field in the result. If the TP_STATUS_CSUMNOTREADY
> bit is set, the packet checksum is missing because the packet
> is local.
> 
> The patch below is currently successfully used at least on Fedora and
> some other Red Hat based distributions. Applying this to the ISC sources
> would help make the problem go away for everyone.
> ---
> 
> Notes:
> - The patch is to be applied with -p1.
> - The patch below applies to dhcp 4.2.0-P2 and 4.1.2 (the last with an offset).
> - Please Cc me on comments directly, I am not subscribed to the dhcp-hackers list.
> 
>  common/bpf.c     |    2 +-
>  common/dlpi.c    |    2 +-
>  common/lpf.c     |   81 ++++++++++++++++++++++++++++++++++++++++++------------
>  common/nit.c     |    2 +-
>  common/packet.c  |    4 +-
>  common/upf.c     |    2 +-
>  includes/dhcpd.h |    2 +-
>  7 files changed, 70 insertions(+), 25 deletions(-)
> 
> diff --git a/common/bpf.c b/common/bpf.c
> index b0ef657..8bd5727 100644
> --- a/common/bpf.c
> +++ b/common/bpf.c
> @@ -485,7 +485,7 @@ ssize_t receive_packet (interface, buf, len, from, hfrom)
>  		offset = decode_udp_ip_header (interface,
>  					       interface -> rbuf,
>  					       interface -> rbuf_offset,
> -  					       from, hdr.bh_caplen, &paylen);
> +  					       from, hdr.bh_caplen, &paylen, 0);
>  
>  		/* If the IP or UDP checksum was bad, skip the packet... */
>  		if (offset < 0) {
> diff --git a/common/dlpi.c b/common/dlpi.c
> index eb64342..d4a8bb9 100644
> --- a/common/dlpi.c
> +++ b/common/dlpi.c
> @@ -694,7 +694,7 @@ ssize_t receive_packet (interface, buf, len, from, hfrom)
>  	length -= offset;
>  #endif
>  	offset = decode_udp_ip_header (interface, dbuf, bufix,
> -				       from, length, &paylen);
> +				       from, length, &paylen, 0);
>  
>  	/*
>  	 * If the IP or UDP checksum was bad, skip the packet...
> diff --git a/common/lpf.c b/common/lpf.c
> index f727b7c..4bdb0f1 100644
> --- a/common/lpf.c
> +++ b/common/lpf.c
> @@ -29,18 +29,33 @@
>  #include "dhcpd.h"
>  #if defined (USE_LPF_SEND) || defined (USE_LPF_RECEIVE)
>  #include <sys/ioctl.h>
> +#include <sys/socket.h>
>  #include <sys/uio.h>
>  #include <errno.h>
>  
>  #include <asm/types.h>
>  #include <linux/filter.h>
>  #include <linux/if_ether.h>
> +#include <linux/if_packet.h>
>  #include <netinet/in_systm.h>
>  #include "includes/netinet/ip.h"
>  #include "includes/netinet/udp.h"
>  #include "includes/netinet/if_ether.h"
>  #include <net/if.h>
>  
> +#ifndef PACKET_AUXDATA
> +#define PACKET_AUXDATA 8
> +
> +struct tpacket_auxdata
> +{
> +	__u32		tp_status;
> +	__u32		tp_len;
> +	__u32		tp_snaplen;
> +	__u16		tp_mac;
> +	__u16		tp_net;
> +};
> +#endif
> +
>  /* Reinitializes the specified interface after an address change.   This
>     is not required for packet-filter APIs. */
>  
> @@ -66,10 +81,14 @@ int if_register_lpf (info)
>  	struct interface_info *info;
>  {
>  	int sock;
> -	struct sockaddr sa;
> +	union {
> +		struct sockaddr_ll ll;
> +		struct sockaddr common;
> +	} sa;
> +	struct ifreq ifr;
>  
>  	/* Make an LPF socket. */
> -	if ((sock = socket(PF_PACKET, SOCK_PACKET,
> +	if ((sock = socket(PF_PACKET, SOCK_RAW,
>  			   htons((short)ETH_P_ALL))) < 0) {
>  		if (errno == ENOPROTOOPT || errno == EPROTONOSUPPORT ||
>  		    errno == ESOCKTNOSUPPORT || errno == EPFNOSUPPORT ||
> @@ -84,11 +103,16 @@ int if_register_lpf (info)
>  		log_fatal ("Open a socket for LPF: %m");
>  	}
>  
> +	memset (&ifr, 0, sizeof ifr);
> +	strncpy (ifr.ifr_name, (const char *)info -> ifp, sizeof ifr.ifr_name);
> +	if (ioctl (sock, SIOCGIFINDEX, &ifr))
> +		log_fatal ("Failed to get interface index: %m");
> +
>  	/* Bind to the interface name */
>  	memset (&sa, 0, sizeof sa);
> -	sa.sa_family = AF_PACKET;
> -	strncpy (sa.sa_data, (const char *)info -> ifp, sizeof sa.sa_data);
> -	if (bind (sock, &sa, sizeof sa)) {
> +	sa.ll.sll_family = AF_PACKET;
> +	sa.ll.sll_ifindex = ifr.ifr_ifindex;
> +	if (bind (sock, &sa.common, sizeof sa)) {
>  		if (errno == ENOPROTOOPT || errno == EPROTONOSUPPORT ||
>  		    errno == ESOCKTNOSUPPORT || errno == EPFNOSUPPORT ||
>  		    errno == EAFNOSUPPORT || errno == EINVAL) {
> @@ -170,9 +194,18 @@ static void lpf_gen_filter_setup (struct interface_info *);
>  void if_register_receive (info)
>  	struct interface_info *info;
>  {
> +	int val;
> +
>  	/* Open a LPF device and hang it on this interface... */
>  	info -> rfdesc = if_register_lpf (info);
>  
> +	val = 1;
> +	if (setsockopt (info -> rfdesc, SOL_PACKET, PACKET_AUXDATA, &val,
> +			sizeof val) < 0) {
> +		if (errno != ENOPROTOOPT)
> +			log_fatal ("Failed to set auxiliary packet data: %m");
> +	}
> +
>  #if defined (HAVE_TR_SUPPORT)
>  	if (info -> hw_address.hbuf [0] == HTYPE_IEEE802)
>  		lpf_tr_filter_setup (info);
> @@ -294,7 +327,6 @@ ssize_t send_packet (interface, packet, raw, len, from, to, hto)
>  	double hh [16];
>  	double ih [1536 / sizeof (double)];
>  	unsigned char *buf = (unsigned char *)ih;
> -	struct sockaddr sa;
>  	int result;
>  	int fudge;
>  
> @@ -315,15 +347,7 @@ ssize_t send_packet (interface, packet, raw, len, from, to, hto)
>  				(unsigned char *)raw, len);
>  	memcpy (buf + ibufp, raw, len);
>  
> -	/* For some reason, SOCK_PACKET sockets can't be connected,
> -	   so we have to do a sentdo every time. */
> -	memset (&sa, 0, sizeof sa);
> -	sa.sa_family = AF_PACKET;
> -	strncpy (sa.sa_data,
> -		 (const char *)interface -> ifp, sizeof sa.sa_data);
> -
> -	result = sendto (interface -> wfdesc,
> -			 buf + fudge, ibufp + len - fudge, 0, &sa, sizeof sa);
> +	result = write (interface -> wfdesc, buf + fudge, ibufp + len - fudge);
>  	if (result < 0)
>  		log_error ("send_packet: %m");
>  	return result;
> @@ -340,14 +364,35 @@ ssize_t receive_packet (interface, buf, len, from, hfrom)
>  {
>  	int length = 0;
>  	int offset = 0;
> +	int nocsum = 0;
>  	unsigned char ibuf [1536];
>  	unsigned bufix = 0;
>  	unsigned paylen;
> -
> -	length = read (interface -> rfdesc, ibuf, sizeof ibuf);
> +	unsigned char cmsgbuf[CMSG_LEN(sizeof(struct tpacket_auxdata))];
> +	struct iovec iov = {
> +		.iov_base = ibuf,
> +		.iov_len = sizeof ibuf,
> +	};
> +	struct msghdr msg = {
> +		.msg_iov = &iov,
> +		.msg_iovlen = 1,
> +		.msg_control = cmsgbuf,
> +		.msg_controllen = sizeof(cmsgbuf),
> +	};
> +	struct cmsghdr *cmsg;
> +
> +	length = recvmsg (interface -> rfdesc, &msg, 0);
>  	if (length <= 0)
>  		return length;
>  
> +	for (cmsg = CMSG_FIRSTHDR(&msg); cmsg; cmsg = CMSG_NXTHDR(&msg, cmsg)) {
> +		if (cmsg->cmsg_level == SOL_PACKET &&
> +		    cmsg->cmsg_type == PACKET_AUXDATA) {
> +			struct tpacket_auxdata *aux = (void *)CMSG_DATA(cmsg);
> +			nocsum = aux->tp_status & TP_STATUS_CSUMNOTREADY;
> +		}
> +	}
> +
>  	bufix = 0;
>  	/* Decode the physical header... */
>  	offset = decode_hw_header (interface, ibuf, bufix, hfrom);
> @@ -364,7 +409,7 @@ ssize_t receive_packet (interface, buf, len, from, hfrom)
>  
>  	/* Decode the IP and UDP headers... */
>  	offset = decode_udp_ip_header (interface, ibuf, bufix, from,
> -				       (unsigned)length, &paylen);
> +				       (unsigned)length, &paylen, nocsum);
>  
>  	/* If the IP or UDP checksum was bad, skip the packet... */
>  	if (offset < 0)
> diff --git a/common/nit.c b/common/nit.c
> index 3822206..0da9c36 100644
> --- a/common/nit.c
> +++ b/common/nit.c
> @@ -369,7 +369,7 @@ ssize_t receive_packet (interface, buf, len, from, hfrom)
>  
>  	/* Decode the IP and UDP headers... */
>  	offset = decode_udp_ip_header (interface, ibuf, bufix,
> -				       from, length, &paylen);
> +				       from, length, &paylen, 0);
>  
>  	/* If the IP or UDP checksum was bad, skip the packet... */
>  	if (offset < 0)
> diff --git a/common/packet.c b/common/packet.c
> index 42bca69..fd2d975 100644
> --- a/common/packet.c
> +++ b/common/packet.c
> @@ -211,7 +211,7 @@ ssize_t
>  decode_udp_ip_header(struct interface_info *interface,
>  		     unsigned char *buf, unsigned bufix,
>  		     struct sockaddr_in *from, unsigned buflen,
> -		     unsigned *rbuflen)
> +		     unsigned *rbuflen, int nocsum)
>  {
>    unsigned char *data;
>    struct ip ip;
> @@ -322,7 +322,7 @@ decode_udp_ip_header(struct interface_info *interface,
>  					   8, IPPROTO_UDP + ulen))));
>  
>    udp_packets_seen++;
> -  if (usum && usum != sum) {
> +  if (!nocsum && usum && usum != sum) {
>  	  udp_packets_bad_checksum++;
>  	  if (udp_packets_seen > 4 &&
>  	      (udp_packets_seen / udp_packets_bad_checksum) < 2) {
> diff --git a/common/upf.c b/common/upf.c
> index feb82a2..fff3949 100644
> --- a/common/upf.c
> +++ b/common/upf.c
> @@ -320,7 +320,7 @@ ssize_t receive_packet (interface, buf, len, from, hfrom)
>  
>  	/* Decode the IP and UDP headers... */
>  	offset = decode_udp_ip_header (interface, ibuf, bufix,
> -				       from, length, &paylen);
> +				       from, length, &paylen, 0);
>  
>  	/* If the IP or UDP checksum was bad, skip the packet... */
>  	if (offset < 0)
> diff --git a/includes/dhcpd.h b/includes/dhcpd.h
> index cd7d962..0835d98 100644
> --- a/includes/dhcpd.h
> +++ b/includes/dhcpd.h
> @@ -2769,7 +2769,7 @@ ssize_t decode_hw_header PROTO ((struct interface_info *, unsigned char *,
>  				 unsigned, struct hardware *));
>  ssize_t decode_udp_ip_header PROTO ((struct interface_info *, unsigned char *,
>  				     unsigned, struct sockaddr_in *,
> -				     unsigned, unsigned *));
> +				     unsigned, unsigned *, int));
>  
>  /* ethernet.c */
>  void assemble_ethernet_header PROTO ((struct interface_info *, unsigned char *,
> -- 
> 1.7.3.2.91.g446ac

^ permalink raw reply

* Re: Mass udp flow reboot linux with RealTek RTL-8169 Gigabit
From: Hans Nieser @ 2011-02-23 18:31 UTC (permalink / raw)
  To: Francois Romieu; +Cc: netdev, linux-kernel
In-Reply-To: <1298463704.31256.29.camel@krikkit>

On Wed, 2011-02-23 at 13:21 +0100, Hans Nieser wrote:
> On Wed, 2011-02-23 at 10:55 +0100, Francois Romieu wrote:
> > Hans Nieser <hnsr@xs4all.nl> :
> > [...]
> > > With your patches applied to 2.6.38-rc6, I have gathered some of the
> > > info you requested from Seblu as well, I hope it's helpful:
> > > 
> > > 1: see attachment
> > 
> > Ok.
> > 
> > The chipset requires no trivial last minute regression fix (yet).
> > 
> > > 2: I'm not sure how to check the size of the packets, but I'm just
> > > fetching a (large) file over http/tcp, so I guess they are mostly of the
> > > size of my MTU which is 1500 looking at ifconfig output
> > 
> > Fine.
> > 
> > Your testcases are always based on a real download, whence including some
> > disk activity, as opposed to a pure network test, right ?
> 
> Yeah, I just had a little script that wgetted a file from a webserver in
> my LAN and saved it to separate (non-root) fs, then removed it - in a
> loop. When testing on the 2.6.35 and 2.6.35.9 kernels it did max out at
> about 107MiB/s, sometimes falling down a little presumably when disk was
> being touched.
> 
> > > For the other vmstat/ethtool/interrupts output, I started the following
> > > commands remotely via ssh a second or two before starting the download,
> > > and the machine locked up a few seconds later:
> > 
> > SysRq is enabled (/etc/sysctl.conf::kernel.sysrq = 1), the computer was
> > switched back on a no-X console before the test. Then the keyboard leds
> > ignore keypresses and the sysrq keys don't display anything in the
> > console, right ?
> 
> Yep I had X shutdown and switched to VT1, after lock up the LEDs can't
> be toggled anymore and sysrq key combo was nonresponsive (it works if I
> do it before it locks up)
> 
> > You may enable PCIEASPM_DEBUG, force 'pcie_aspm=off' and switch from
> > SLUB to SLAB but it's a bit cargo-cultish.
> 
> I'll give that a try this evening
> 
> > A bisection could help. Bisecting 2.6.35 .. 2.6.35.9 may be enough if
> > 2.6.35.9 works well.
> 
> Hmm did you mean bisecting 2.6.36 - 2.6.35.9 ? Since with 2.6.36 and
> above I can get the machine to hang within seconds and performance is
> really bad (10-20MiB/s with wget), while with 2.6.35.9 and 2.6.35
> performance was really good (reaching 107MiB/s most of the time) and
> lock up took 5-10 minutes instead of seconds (I guess I didn't mention
> this in my last e-mail but I managed to get both 2.6.35 and 2.6.35.9 to
> lock up eventually) - but I guess something changed between .35 and .36
> that made the issue easier to trigger.
> 
> I can also try even older kernels to see if there is one that doesn't
> lock up at all
> 

Ok, I just tried 2.6.34, and after over 5 hours of running my script,
the system is still up and running, with only 24 'link up' messages on
dmesg, and having transferred 2.1TiB of data (1428042421 rx_packets, 45
rx_missed). So I'm going to assume the problem isn't present with this
kernel and try a bisect between it and 2.6.35

^ permalink raw reply

* Re: [PATCH] e1000: fix sparse warning
From: Brandeburg, Jesse @ 2011-02-23 18:24 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Kirsher, Jeffrey T, Jesse Brandeburg,
	e1000-devel@lists.sourceforge.net, netdev@vger.kernel.org
In-Reply-To: <20110223101203.0894c474@nehalam>



On Wed, 23 Feb 2011, Stephen Hemminger wrote:

> Sparse complains because the e1000 driver is calling ioread on a pointer
> not tagged as __iomem.
> 
> Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

Seems fine, thanks Stephen.

Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>

 

^ permalink raw reply

* Re: [PATCH] cls_u32: fix sparse warnings
From: Stephen Hemminger @ 2011-02-23 18:16 UTC (permalink / raw)
  To: hadi; +Cc: David Miller, netdev
In-Reply-To: <1298464542.2271.1.camel@mojatatu>

On Wed, 23 Feb 2011 07:35:42 -0500
jamal <hadi@cyberus.ca> wrote:

> On Sun, 2011-02-20 at 18:14 -0800, Stephen Hemminger wrote:
> > The variable _data is used in asm-generic to define sections
> > which causes sparse warnings, so just rename the variable.
> > 
> > Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>
> 
> Acked-by: Jamal Hadi Salim <hadi@cyberus.ca>
> 
> Hey, while you are sparsing away you may wanna fix 
> br_fdb_cleanup() as well ;-> /me runs
> 

With current sparse and net-next it comes up clean.

^ permalink raw reply

* [PATCH] e1000: fix sparse warning
From: Stephen Hemminger @ 2011-02-23 18:12 UTC (permalink / raw)
  To: Jeff Kirsher, Jesse Brandeburg; +Cc: e1000-devel, netdev

Sparse complains because the e1000 driver is calling ioread on a pointer
not tagged as __iomem.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/drivers/net/e1000/e1000_osdep.h	2011-02-23 10:00:31.496097384 -0800
+++ b/drivers/net/e1000/e1000_osdep.h	2011-02-23 10:00:47.740287665 -0800
@@ -42,7 +42,8 @@
 #define GBE_CONFIG_RAM_BASE \
 	((unsigned int)(CONFIG_RAM_BASE + GBE_CONFIG_OFFSET))
 
-#define GBE_CONFIG_BASE_VIRT    phys_to_virt(GBE_CONFIG_RAM_BASE)
+#define GBE_CONFIG_BASE_VIRT \
+	((void __iomem *)phys_to_virt(GBE_CONFIG_RAM_BASE))
 
 #define GBE_CONFIG_FLASH_WRITE(base, offset, count, data) \
 	(iowrite16_rep(base + offset, data, count))

^ permalink raw reply

* Re: [PATCH V11 2/4] ptp: Added a clock that uses the eTSEC found on the MPC85xx.
From: Grant Likely @ 2011-02-23 17:54 UTC (permalink / raw)
  To: Scott Wood
  Cc: Richard Cochran, Thomas Gleixner, Rodolfo Giometti, Arnd Bergmann,
	Peter Zijlstra, linux-api-u79uwXL29TY76Z2rM5mHXA,
	devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Russell King, Paul Mackerras,
	John Stultz, Alan Cox, netdev-u79uwXL29TY76Z2rM5mHXA,
	Mike Frysinger, Christoph Lameter,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ, David Miller,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Krzysztof Halasa
In-Reply-To: <20110223112612.30071995@schlenkerla>

On Wed, Feb 23, 2011 at 11:26:12AM -0600, Scott Wood wrote:
> On Wed, 23 Feb 2011 09:50:58 -0700
> Grant Likely <grant.likely-s3s/WqlpOiPyB63q8FvJNQ@public.gmane.org> wrote:
> 
> > On Wed, Feb 23, 2011 at 11:38:17AM +0100, Richard Cochran wrote:
> > > +
> > > +* Gianfar PTP clock nodes
> > > +
> > > +General Properties:
> > > +
> > > +  - compatible   Should be "fsl,etsec-ptp"
> > 
> > Should specify an *exact* part; ie: "fsl,mpc8313-etsec-ptp" instead of
> > trying to define a generic catchall.  The reason is that the same
> > marketing name can end up getting applied to a wide range of parts.
> > 
> > Instead, choose one specific device to stand in as the 'common'
> > implementation and get all parts with the same core to claim
> > compatibility with it.  ie: a p2020 might have:
> > 
> > 	compatible = "fsl,mpc2020-etsec-ptp", "fsl,mpc8313-etsec-ptp";
> 
> eTSEC is versioned, that's more reliable than the chip name since chips
> have revisions (rev 2.1 of mpc8313 has eTSEC 1.6, not sure about previous
> revs of mpc8313).  Logic blocks can be and have been uprevved between one
> revision of a chip to the next.  I think "fsl,mpc8313rev2.1-etsec-ptp"
> would be taking things a bit too far (and there could be board-level bugs
> too...).
> 
> If you really need to know the exact SoC you're on, look in SVR (which
> will provide revision info as well).  Isn't the device tree for things that
> can't be probed?

This is far more about the binding than it is about the chip revision.
When documenting a binding it makes far more sense to anchor it to a
specific implementation than to try and come up with a 'generic'
catchall.  A new binding means new compatible value and dropping any
claims of being compatible with the old.

It is not about the logic block version, particularly when it is
detectable by the driver as you say.

> 
> The eTSEC revision is probeable as well, but due the way PTP is described as
> a separate node, the driver doesn't have straightforward access to those
> registers.

Ignorant question: Should the ptp be described as a separate node?

> 
> Insisting on an explicit chip also encourages people to claim compatibility
> with that chip without ensuring that it really is fully compatible.

In practise, I've not seen this to be an issue.

g.

^ permalink raw reply

* [PATCH] qla3xxx: add missing __iomem annotation
From: Stephen Hemminger @ 2011-02-23 17:54 UTC (permalink / raw)
  To: Ron Mercer, David Miller; +Cc: netdev

Add necessary annotations about pointer to io memory space
that is checked by sparse.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/drivers/net/qla3xxx.c	2011-02-23 09:42:44.363626866 -0800
+++ b/drivers/net/qla3xxx.c	2011-02-23 09:43:32.564194771 -0800
@@ -379,7 +379,7 @@ static void fm93c56a_select(struct ql3_a
 {
 	struct ql3xxx_port_registers __iomem *port_regs =
 			qdev->mem_map_registers;
-	u32 *spir = &port_regs->CommonRegs.serialPortInterfaceReg;
+	__iomem u32 *spir = &port_regs->CommonRegs.serialPortInterfaceReg;
 
 	qdev->eeprom_cmd_data = AUBURN_EEPROM_CS_1;
 	ql_write_nvram_reg(qdev, spir, ISP_NVRAM_MASK | qdev->eeprom_cmd_data);
@@ -398,7 +398,7 @@ static void fm93c56a_cmd(struct ql3_adap
 	u32 previousBit;
 	struct ql3xxx_port_registers __iomem *port_regs =
 			qdev->mem_map_registers;
-	u32 *spir = &port_regs->CommonRegs.serialPortInterfaceReg;
+	__iomem u32 *spir = &port_regs->CommonRegs.serialPortInterfaceReg;
 
 	/* Clock in a zero, then do the start bit */
 	ql_write_nvram_reg(qdev, spir,
@@ -467,7 +467,7 @@ static void fm93c56a_deselect(struct ql3
 {
 	struct ql3xxx_port_registers __iomem *port_regs =
 			qdev->mem_map_registers;
-	u32 *spir = &port_regs->CommonRegs.serialPortInterfaceReg;
+	__iomem u32 *spir = &port_regs->CommonRegs.serialPortInterfaceReg;
 
 	qdev->eeprom_cmd_data = AUBURN_EEPROM_CS_0;
 	ql_write_nvram_reg(qdev, spir, ISP_NVRAM_MASK | qdev->eeprom_cmd_data);
@@ -483,7 +483,7 @@ static void fm93c56a_datain(struct ql3_a
 	u32 dataBit;
 	struct ql3xxx_port_registers __iomem *port_regs =
 			qdev->mem_map_registers;
-	u32 *spir = &port_regs->CommonRegs.serialPortInterfaceReg;
+	__iomem u32 *spir = &port_regs->CommonRegs.serialPortInterfaceReg;
 
 	/* Read the data bits */
 	/* The first bit is a dummy.  Clock right over it. */
@@ -3011,7 +3011,7 @@ static int ql_adapter_initialize(struct
 	u32 value;
 	struct ql3xxx_port_registers __iomem *port_regs =
 		qdev->mem_map_registers;
-	u32 *spir = &port_regs->CommonRegs.serialPortInterfaceReg;
+	__iomem u32 *spir = &port_regs->CommonRegs.serialPortInterfaceReg;
 	struct ql3xxx_host_memory_registers __iomem *hmem_regs =
 		(void __iomem *)port_regs;
 	u32 delay = 10;

^ permalink raw reply

* [PATCH] r8169: incorrect args to oob notify
From: Stephen Hemminger @ 2011-02-23 17:52 UTC (permalink / raw)
  To: Francois Romieu, David Miller; +Cc: netdev


Sparse detected this bug. The function oob_notify was being
passed the private ptr, but expected to get the ioaddr.
Compile checked only; not tested on real hardware.

Patch against net-next-2.6 but should be applied to net-2.6.
Bug not present in 2.6.37 and earlier.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>


--- a/drivers/net/r8169.c	2011-02-23 09:46:30.042289776 -0800
+++ b/drivers/net/r8169.c	2011-02-23 09:48:00.167355773 -0800
@@ -617,8 +617,9 @@ static void ocp_write(struct rtl8169_pri
 	}
 }
 
-static void rtl8168_oob_notify(void __iomem *ioaddr, u8 cmd)
+static void rtl8168_oob_notify(struct rtl8169_private *tp, u8 cmd)
 {
+	void __iomem *ioaddr = tp->mmio_addr;
 	int i;
 
 	RTL_W8(ERIDR, cmd);
@@ -630,7 +631,7 @@ static void rtl8168_oob_notify(void __io
 			break;
 	}
 
-	ocp_write(ioaddr, 0x1, 0x30, 0x00000001);
+	ocp_write(tp, 0x1, 0x30, 0x00000001);
 }
 
 #define OOB_CMD_RESET		0x00

^ permalink raw reply

* Stale entries in RT_TABLE_LOCAL
From: Alex Sidorenko @ 2011-02-23 17:43 UTC (permalink / raw)
  To: netdev@vger.kernel.org

[-- Attachment #1: Type: text/plain, Size: 6094 bytes --]

Hello,

I have found several scenarios when after deleting IP-address from an 
interface there is a stale entry left in RT_TABLE_LOCAL.

All these scenarios use the fact that it is possible to add the same address 
multiple times to the same interface using different masks.

Let us do the following using dummy0 interface:

ifconfig dummy0 192.168.140.31 netmask 255.255.252.0
ip addr add 192.168.142.109/23 dev dummy0
ip addr add 192.168.142.109/22 dev dummy0
ip addr del 192.168.142.109/22 dev dummy0
ip addr del 192.168.142.109/23 dev dummy0

We add 192.168.142.109/23 and 192.168.142.109/22, then delete them (order is 
important). After that, 192.168.142.109 is not in 'ip addr ls' but there are 
entries using this addr in RT_TABLE_LOCAL.

An attached script demonstrates the problem:

{asid 14:00:57} sudo sh iptest.sh
Tables before the test
13: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether 5e:1a:fa:44:90:f6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.140.31/22 brd 192.168.143.255 scope global dummy0
    inet6 fe80::5c1a:faff:fe44:90f6/64 scope link 
       valid_lft forever preferred_lft forever

local 192.168.140.31 dev dummy0  proto kernel  scope host  src 192.168.140.31 
broadcast 192.168.140.0 dev dummy0  proto kernel  scope link  src 
192.168.140.31 
broadcast 192.168.143.255 dev dummy0  proto kernel  scope link  src
192.168.140.31 
----------------------
Tables after the test
13: dummy0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN 
    link/ether 5e:1a:fa:44:90:f6 brd ff:ff:ff:ff:ff:ff
    inet 192.168.140.31/22 brd 192.168.143.255 scope global dummy0
    inet6 fe80::5c1a:faff:fe44:90f6/64 scope link 
       valid_lft forever preferred_lft forever

local 192.168.140.31 dev dummy0  proto kernel  scope host  src 192.168.140.31 
local 192.168.142.109 dev dummy0  proto kernel  scope host  src 192.168.140.31 
broadcast 192.168.143.255 dev dummy0  proto kernel  scope link  src
192.168.140.31 
broadcast 192.168.143.255 dev dummy0  proto kernel  scope link  src
192.168.142.109 

As you see, even though there is no 192.168.142.109 on dummy0 address list, 
the entries referring to this addr are still present in RT_TABLE_LOCAL.

Another scenario (adding/deleting two addresses, each one twice with different 
mask) can lead to stale entries cross-referencing each other, like

local 192.168.5.8  proto kernel  scope host  src 192.168.5.9 
local 192.168.5.9  proto kernel  scope host  src 192.168.5.8 

Analysis
--------

Both scenarios use the fact that we can add the same address multiple times to 
the same interface, using different masks.

1. When we delete an IP addr, we remove it from the interface addr list and 
send a notifier to routing code (fib_del_ifaddr) asking to delete the 
associated routes.

2. When we enter fib_del_ifaddr(struct in_ifaddr *ifa), the address is already
deleted. But if we add the same IP twice (with different masks), the same
address (even though with different prefix) is present two times. So after the
first deletion we still have its 2nd instance on the list.

3. We do the following in fib_del_ifaddr():

         for (ifa1 = in_dev->ifa_list; ifa1; ifa1 = ifa1->ifa_next) {
                 if (ifa->ifa_local == ifa1->ifa_local)
                         ok |= LOCAL_OK;
                 if (ifa->ifa_broadcast == ifa1->ifa_broadcast)
                         ok |= BRD_OK;
                 if (brd == ifa1->ifa_broadcast)
                         ok |= BRD1_OK;
                 if (any == ifa1->ifa_broadcast)
                         ok |= BRD0_OK;
         }

That is, we loop on all addrs of the interface (in_dev->ifa_list) and compare
address we have just deleted (passed in 'ifa') with addresses on the list. 

As we compare them without taking prefix (mask) into account, the following 
will be true:

ifa->ifa_local == ifa1->ifa_local
ifa->ifa_broadcast == ifa1->ifa_broadcast

4. As a result, after deleting the first instance of IP (192.168.142.109/22) 
we still have  192.168.142.109/23 on the list. The routing code will find that 
this addr (and broadcast) are still present on the list and will not delete 
the routes.

5. When we delete the second time (192.168.142.109/23), there will be no
192.168.142.109 on the list anymore and the routing code will delete the route 
- but only one out of two entries.

How this can be fixed
---------------------

I am not sure what is the best way to fix this, I can think of several 
approaches:

  (a) change the sources so that it would be impossible to add the same IP 
      multiple times, even with different masks. I cannot think of any
      situation where adding the same IP (but with different mask) to the same
      interface could be useful. But maybe I am wrong?

  (b) improve the deletion algorithm in fib_del_ifaddr()

  (c) add a periodic cleanup that will purge all entries from 'local' table if
      there are no corresponding IPs on the interface list

Impact
------

Stale entries in RT_TABLE_LOCAL make ARP reply to requests for that IPs, even 
though these IPs do not belong to any interface.

These scenarios might seem a bit pathological, but in reality they are 
possible on clusters with multiple addresses on several interfaces, where 
addresses are added/deleted for service migration. Address migration can be 
done both by software and by system administrators and if by mistake a wrong 
mask is used, we can get this situation.

And yes, one of HP customers met exactly this problem. They saw a 'duplicate 
IP' issue after migrating some services and found that the host replies to 
ARP-request even though 'ip addr ls' did not show this address. It is not 
common knowledge that ARP implementation uses RT_TABLE_LOCAL to decide whether 
IP is local, so they were unable to understand what is wrong.

Regards,
Alex

------------------------------------------------------------------
Alexandre Sidorenko             email: asid@hp.com
WTEC Linux                      Hewlett-Packard (Canada)
------------------------------------------------------------------

[-- Attachment #2: iptest.sh --]
[-- Type: application/x-shellscript, Size: 597 bytes --]

^ permalink raw reply

* [PATCH] bonding: fix sparse warning
From: Stephen Hemminger @ 2011-02-23 17:40 UTC (permalink / raw)
  To: Jay Vosburgh, David Miller; +Cc: netdev

Fix use of zero where NULL expected. And wrap long line.

Signed-off-by: Stephen Hemminger <shemminger@vyatta.com>

--- a/drivers/net/bonding/bonding.h	2011-02-23 09:33:39.493248155 -0800
+++ b/drivers/net/bonding/bonding.h	2011-02-23 09:33:48.825356637 -0800
@@ -265,7 +265,8 @@ struct bonding {
  *
  * Caller must hold bond lock for read
  */
-static inline struct slave *bond_get_slave_by_dev(struct bonding *bond, struct net_device *slave_dev)
+static inline struct slave *bond_get_slave_by_dev(struct bonding *bond,
+						  struct net_device *slave_dev)
 {
 	struct slave *slave = NULL;
 	int i;
@@ -276,7 +277,7 @@ static inline struct slave *bond_get_sla
 		}
 	}
 
-	return 0;
+	return NULL;
 }
 
 static inline struct bonding *bond_get_bond_by_slave(struct slave *slave)

^ permalink raw reply

* Re: [PATCH] net_sched: long word align struct qdisc_skb_cb data
From: Stephen Hemminger @ 2011-02-23 17:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Patrick McHardy, David Miller, Juliusz Chroboczek,
	John W. Linville, netdev, Andi Kleen
In-Reply-To: <1298480707.3301.386.camel@edumazet-laptop>

On Wed, 23 Feb 2011 18:05:07 +0100
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le mercredi 23 février 2011 à 17:24 +0100, Patrick McHardy a écrit :
> > Am 23.02.2011 16:14, schrieb Eric Dumazet:
> > > diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> > > index 16626a0..f40d32e 100644
> > > --- a/include/net/sch_generic.h
> > > +++ b/include/net/sch_generic.h
> > > @@ -218,6 +218,7 @@ struct tcf_proto {
> > >  
> > >  struct qdisc_skb_cb {
> > >  	unsigned int		pkt_len;
> > > +	unsigned int		sfb_classid;
> > >  	char			data[];
> > >  };
> > 
> > This could be moved into a SFB specific cb, similar to what netem
> > does.
> 
> Hmm... well... I want to be sure no other sch will destroy my values.
> 
> netem seems buggy then.
> 
> Probably following patch is needed ?

Yes, it was long word aligned when netem was written but
we seem to have bit creep.



-- 

^ permalink raw reply

* Re: [PATCH V11 2/4] ptp: Added a clock that uses the eTSEC found on the MPC85xx.
From: Scott Wood @ 2011-02-23 17:26 UTC (permalink / raw)
  To: Grant Likely
  Cc: Richard Cochran, Thomas Gleixner, Rodolfo Giometti, Arnd Bergmann,
	Peter Zijlstra, linux-api-u79uwXL29TY76Z2rM5mHXA,
	devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Russell King, Paul Mackerras,
	John Stultz, Alan Cox, netdev-u79uwXL29TY76Z2rM5mHXA,
	Mike Frysinger, Christoph Lameter,
	linuxppc-dev-uLR06cmDAlY/bJ5BZ2RsiQ, David Miller,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	Krzysztof Halasa
In-Reply-To: <20110223165058.GE14597-MrY2KI0G/OVr83L8+7iqerDks+cytr/Z@public.gmane.org>

On Wed, 23 Feb 2011 09:50:58 -0700
Grant Likely <grant.likely-s3s/WqlpOiPyB63q8FvJNQ@public.gmane.org> wrote:

> On Wed, Feb 23, 2011 at 11:38:17AM +0100, Richard Cochran wrote:
> > +
> > +* Gianfar PTP clock nodes
> > +
> > +General Properties:
> > +
> > +  - compatible   Should be "fsl,etsec-ptp"
> 
> Should specify an *exact* part; ie: "fsl,mpc8313-etsec-ptp" instead of
> trying to define a generic catchall.  The reason is that the same
> marketing name can end up getting applied to a wide range of parts.
> 
> Instead, choose one specific device to stand in as the 'common'
> implementation and get all parts with the same core to claim
> compatibility with it.  ie: a p2020 might have:
> 
> 	compatible = "fsl,mpc2020-etsec-ptp", "fsl,mpc8313-etsec-ptp";

eTSEC is versioned, that's more reliable than the chip name since chips
have revisions (rev 2.1 of mpc8313 has eTSEC 1.6, not sure about previous
revs of mpc8313).  Logic blocks can be and have been uprevved between one
revision of a chip to the next.  I think "fsl,mpc8313rev2.1-etsec-ptp"
would be taking things a bit too far (and there could be board-level bugs
too...).

If you really need to know the exact SoC you're on, look in SVR (which
will provide revision info as well).  Isn't the device tree for things that
can't be probed?

The eTSEC revision is probeable as well, but due the way PTP is described as
a separate node, the driver doesn't have straightforward access to those
registers.

Insisting on an explicit chip also encourages people to claim compatibility
with that chip without ensuring that it really is fully compatible.

-Scott

^ permalink raw reply

* Re: [PATCH net-next-2.6 v3] net_sched: SFB flow scheduler
From: Eric Dumazet @ 2011-02-23 17:16 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: David Miller, Juliusz Chroboczek, John W. Linville,
	Stephen Hemminger, netdev, Andi Kleen
In-Reply-To: <4D653C9C.8010205@trash.net>

Le mercredi 23 février 2011 à 17:58 +0100, Patrick McHardy a écrit :
> Am 23.02.2011 17:48, schrieb Eric Dumazet:
> > Le mercredi 23 février 2011 à 17:24 +0100, Patrick McHardy a écrit :
> > 
> >> This needs to be a per-skb property, otherwise you could have the
> >> situation:
> >>
> >> - enqueue skb, double_buffering=0, increment buffer 0
> >> - enable double buffering
> >> - swap buffers
> >> - dequeue same skb, decrement buffer 1
> >>
> >> after which the qlen values of buffer 1 will be incorrect.
> >>
> > 
> > Normally its OK, because we bzero() the zone, and the "decrement" is
> > 0-bounded.
> 
> Yeah, but we might decrement buckets of different flows which
> are non-zero. Probably not too bad, but still not correct.
> 
> > I had this idea (of storing two bits per skb), but :
> > 
> > - It means that swap_buffer() should not touch (bzero) the 'old' bins
> 
> Yes, it means we have to properly decrement the old buffer
> until all bins reached zero.
> 
> > - Since hash perturbator is changed, we have to store the two hash
> > values per skb (instead of one u32 / classid).
> 
> Indeed.

BTQ, I had this idea of storing the double_buffer per skb reading SFB
paper, because paper says the double buffering is really needed only for
unelastic flows, not for all packets.

paper quote :

As one set of bins is being used
for queue management, a second set of bins using the next set of hash
functions can be warmed up. In this
case, any time a flow is classified as non-responsive, it is hashed
using the second set of hash functions and
the marking probabilities of the corresponding bins in the warmup set
are updated.

So using two 'hash' values per skb is the way to go, with special 0
value meanings : skb was not 'inserted' into virtual queues.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox