Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Getting IP Address for netdev
From: Liu Jiusheng @ 2009-09-25 13:43 UTC (permalink / raw)
  To: Pawel Pastuszak; +Cc: netdev
In-Reply-To: <8ff29df80909241953j6c6ce8ebpaa5760504158062e@mail.gmail.com>

On Thu, Sep 24, 2009 at 10:53:41PM -0400, Pawel Pastuszak wrote:
> Hello everyone,
> 
> I am looking for some help, I am writing an netdev driver and i want
> to known how to i get the current ip  that is set to the device?
> 

use in_dev_get to get in_device;
use macro for_ifa to get the IPv4 address set on the device.

-- 
Liu Jiusheng

^ permalink raw reply

* Re: [PATCH] ax25: Fix ax25_cb refcounting in ax25_ctl_ioctl
From: Ralf Baechle DL5RB @ 2009-09-25 13:40 UTC (permalink / raw)
  To: Jarek Poplawski
  Cc: David Miller, Bernard Pidoux F6BVP, Bernard Pidoux,
	Linux Netdev List, linux-hams
In-Reply-To: <20090925131038.GA14778@ff.dom.local>

On Fri, Sep 25, 2009 at 01:10:38PM +0000, Jarek Poplawski wrote:

> This bug isn't responsible for these oopses here, but looks quite
> obviously. (I'm not sure if it's easy to test/hit with the common
> tools.)

The issue your patch fixes is obvious enough.

> Jarek P.
> ------------>
> [PATCH] ax25: Fix ax25_cb refcounting in ax25_ctl_ioctl
> 
> Use ax25_cb_put after ax25_find_cb in ax25_ctl_ioctl.
> 
> Reported-by: Bernard Pidoux F6BVP <f6bvp@free.fr>
> Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>

Reviewed-by: Ralf Baechle <ralf@linux-mips.org>

  Ralf

^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] bonding: introduce primary_reselect option
From: Jiri Pirko @ 2009-09-25 13:28 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: netdev, davem, bonding-devel, nicolas.2p.debian
In-Reply-To: <21492.1253838848@death.nxdomain.ibm.com>

Subject: [PATCH net-2.6 v3] bonding: introduce primary_reselect option

In some cases there is not desirable to switch back to primary interface when
it's link recovers and rather stay with currently active one. We need to avoid
packetloss as much as we can in some cases. This is solved by introducing
primary_reselect option. Note that enslaved primary slave is set as current
active no matter what.

Patch modified by Jay Vosburgh as follows: fixed bug in action
after change of option setting via sysfs, revised the documentation
update, and bumped the bonding version number.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
---

	Note that this patch depends on the "make ab_arp select active
slaves as other modes" patch recently approved, but not yet appearing in
net-next-2.6 as I write this.  http://patchwork.ozlabs.org/patch/32684/

Dave, this patch is diffed against net-2.6 but should be cleanly applied on
net-next-2.6 when it will include mentioned patch.

diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index d5181ce..61f516b 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -1,7 +1,7 @@
 
 		Linux Ethernet Bonding Driver HOWTO
 
-		Latest update: 12 November 2007
+		Latest update: 23 September 2009
 
 Initial release : Thomas Davis <tadavis at lbl.gov>
 Corrections, HA extensions : 2000/10/03-15 :
@@ -614,6 +614,46 @@ primary
 
 	The primary option is only valid for active-backup mode.
 
+primary_reselect
+
+	Specifies the reselection policy for the primary slave.  This
+	affects how the primary slave is chosen to become the active slave
+	when failure of the active slave or recovery of the primary slave
+	occurs.  This option is designed to prevent flip-flopping between
+	the primary slave and other slaves.  Possible values are:
+
+	always or 0 (default)
+
+		The primary slave becomes the active slave whenever it
+		comes back up.
+
+	better or 1
+
+		The primary slave becomes the active slave when it comes
+		back up, if the speed and duplex of the primary slave is
+		better than the speed and duplex of the current active
+		slave.
+
+	failure or 2
+
+		The primary slave becomes the active slave only if the
+		current active slave fails and the primary slave is up.
+
+	The primary_reselect setting is ignored in two cases:
+
+		If no slaves are active, the first slave to recover is
+		made the active slave.
+
+		When initially enslaved, the primary slave is always made
+		the active slave.
+
+	Changing the primary_reselect policy via sysfs will cause an
+	immediate selection of the best active slave according to the new
+	policy.  This may or may not result in a change of the active
+	slave, depending upon the circumstances.
+
+	This option was added for bonding version 3.6.0.
+
 updelay
 
 	Specifies the time, in milliseconds, to wait before enabling a
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 69c5b15..19d57d5 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -94,6 +94,7 @@ static int downdelay;
 static int use_carrier	= 1;
 static char *mode;
 static char *primary;
+static char *primary_reselect;
 static char *lacp_rate;
 static char *ad_select;
 static char *xmit_hash_policy;
@@ -126,6 +127,14 @@ MODULE_PARM_DESC(mode, "Mode of operation : 0 for balance-rr, "
 		       "6 for balance-alb");
 module_param(primary, charp, 0);
 MODULE_PARM_DESC(primary, "Primary network device to use");
+module_param(primary_reselect, charp, 0);
+MODULE_PARM_DESC(primary_reselect, "Reselect primary slave "
+				   "once it comes up; "
+				   "0 for always (default), "
+				   "1 for only if speed of primary is "
+				   "better, "
+				   "2 for only on active slave "
+				   "failure");
 module_param(lacp_rate, charp, 0);
 MODULE_PARM_DESC(lacp_rate, "LACPDU tx rate to request from 802.3ad partner "
 			    "(slow/fast)");
@@ -200,6 +209,13 @@ const struct bond_parm_tbl fail_over_mac_tbl[] = {
 {	NULL,			-1},
 };
 
+const struct bond_parm_tbl pri_reselect_tbl[] = {
+{	"always",		BOND_PRI_RESELECT_ALWAYS},
+{	"better",		BOND_PRI_RESELECT_BETTER},
+{	"failure",		BOND_PRI_RESELECT_FAILURE},
+{	NULL,			-1},
+};
+
 struct bond_parm_tbl ad_select_tbl[] = {
 {	"stable",	BOND_AD_STABLE},
 {	"bandwidth",	BOND_AD_BANDWIDTH},
@@ -1070,6 +1086,25 @@ out:
 
 }
 
+static bool bond_should_change_active(struct bonding *bond)
+{
+	struct slave *prim = bond->primary_slave;
+	struct slave *curr = bond->curr_active_slave;
+
+	if (!prim || !curr || curr->link != BOND_LINK_UP)
+		return true;
+	if (bond->force_primary) {
+		bond->force_primary = false;
+		return true;
+	}
+	if (bond->params.primary_reselect == BOND_PRI_RESELECT_BETTER &&
+	    (prim->speed < curr->speed ||
+	     (prim->speed == curr->speed && prim->duplex <= curr->duplex)))
+		return false;
+	if (bond->params.primary_reselect == BOND_PRI_RESELECT_FAILURE)
+		return false;
+	return true;
+}
 
 /**
  * find_best_interface - select the best available slave to be the active one
@@ -1094,7 +1129,8 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
 	}
 
 	if ((bond->primary_slave) &&
-	    bond->primary_slave->link == BOND_LINK_UP) {
+	    bond->primary_slave->link == BOND_LINK_UP &&
+	    bond_should_change_active(bond)) {
 		new_active = bond->primary_slave;
 	}
 
@@ -1678,8 +1714,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 
 	if (USES_PRIMARY(bond->params.mode) && bond->params.primary[0]) {
 		/* if there is a primary slave, remember it */
-		if (strcmp(bond->params.primary, new_slave->dev->name) == 0)
+		if (strcmp(bond->params.primary, new_slave->dev->name) == 0) {
 			bond->primary_slave = new_slave;
+			bond->force_primary = true;
+		}
 	}
 
 	write_lock_bh(&bond->curr_slave_lock);
@@ -3201,11 +3239,14 @@ static void bond_info_show_master(struct seq_file *seq)
 	}
 
 	if (USES_PRIMARY(bond->params.mode)) {
-		seq_printf(seq, "Primary Slave: %s\n",
+		seq_printf(seq, "Primary Slave: %s",
 			   (bond->primary_slave) ?
 			   bond->primary_slave->dev->name : "None");
+		if (bond->primary_slave)
+			seq_printf(seq, " (primary_reselect %s)",
+		   pri_reselect_tbl[bond->params.primary_reselect].modename);
 
-		seq_printf(seq, "Currently Active Slave: %s\n",
+		seq_printf(seq, "\nCurrently Active Slave: %s\n",
 			   (curr) ? curr->dev->name : "None");
 	}
 
@@ -4646,7 +4687,7 @@ int bond_parse_parm(const char *buf, const struct bond_parm_tbl *tbl)
 
 static int bond_check_params(struct bond_params *params)
 {
-	int arp_validate_value, fail_over_mac_value;
+	int arp_validate_value, fail_over_mac_value, primary_reselect_value;
 
 	/*
 	 * Convert string parameters.
@@ -4945,6 +4986,20 @@ static int bond_check_params(struct bond_params *params)
 		primary = NULL;
 	}
 
+	if (primary && primary_reselect) {
+		primary_reselect_value = bond_parse_parm(primary_reselect,
+							 pri_reselect_tbl);
+		if (primary_reselect_value == -1) {
+			pr_err(DRV_NAME
+			       ": Error: Invalid primary_reselect \"%s\"\n",
+			       primary_reselect ==
+					NULL ? "NULL" : primary_reselect);
+			return -EINVAL;
+		}
+	} else {
+		primary_reselect_value = BOND_PRI_RESELECT_ALWAYS;
+	}
+
 	if (fail_over_mac) {
 		fail_over_mac_value = bond_parse_parm(fail_over_mac,
 						      fail_over_mac_tbl);
@@ -4976,6 +5031,7 @@ static int bond_check_params(struct bond_params *params)
 	params->use_carrier = use_carrier;
 	params->lacp_fast = lacp_fast;
 	params->primary[0] = 0;
+	params->primary_reselect = primary_reselect_value;
 	params->fail_over_mac = fail_over_mac_value;
 
 	if (primary) {
diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 6044e12..8ee6164 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -1212,6 +1212,58 @@ static DEVICE_ATTR(primary, S_IRUGO | S_IWUSR,
 		   bonding_show_primary, bonding_store_primary);
 
 /*
+ * Show and set the primary_reselect flag.
+ */
+static ssize_t bonding_show_primary_reselect(struct device *d,
+					     struct device_attribute *attr,
+					     char *buf)
+{
+	struct bonding *bond = to_bond(d);
+
+	return sprintf(buf, "%s %d\n",
+		       pri_reselect_tbl[bond->params.primary_reselect].modename,
+		       bond->params.primary_reselect);
+}
+
+static ssize_t bonding_store_primary_reselect(struct device *d,
+					      struct device_attribute *attr,
+					      const char *buf, size_t count)
+{
+	int new_value, ret = count;
+	struct bonding *bond = to_bond(d);
+
+	if (!rtnl_trylock())
+		return restart_syscall();
+
+	new_value = bond_parse_parm(buf, pri_reselect_tbl);
+	if (new_value < 0)  {
+		pr_err(DRV_NAME
+		       ": %s: Ignoring invalid primary_reselect value %.*s.\n",
+		       bond->dev->name,
+		       (int) strlen(buf) - 1, buf);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	bond->params.primary_reselect = new_value;
+	pr_info(DRV_NAME ": %s: setting primary_reselect to %s (%d).\n",
+		bond->dev->name, pri_reselect_tbl[new_value].modename,
+		new_value);
+
+	read_lock(&bond->lock);
+	write_lock_bh(&bond->curr_slave_lock);
+	bond_select_active_slave(bond);
+	write_unlock_bh(&bond->curr_slave_lock);
+	read_unlock(&bond->lock);
+out:
+	rtnl_unlock();
+	return ret;
+}
+static DEVICE_ATTR(primary_reselect, S_IRUGO | S_IWUSR,
+		   bonding_show_primary_reselect,
+		   bonding_store_primary_reselect);
+
+/*
  * Show and set the use_carrier flag.
  */
 static ssize_t bonding_show_carrier(struct device *d,
@@ -1500,6 +1552,7 @@ static struct attribute *per_bond_attrs[] = {
 	&dev_attr_num_unsol_na.attr,
 	&dev_attr_miimon.attr,
 	&dev_attr_primary.attr,
+	&dev_attr_primary_reselect.attr,
 	&dev_attr_use_carrier.attr,
 	&dev_attr_active_slave.attr,
 	&dev_attr_mii_status.attr,
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 6824771..9c03c2e 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -23,8 +23,8 @@
 #include "bond_3ad.h"
 #include "bond_alb.h"
 
-#define DRV_VERSION	"3.5.0"
-#define DRV_RELDATE	"November 4, 2008"
+#define DRV_VERSION	"3.6.0"
+#define DRV_RELDATE	"September 26, 2009"
 #define DRV_NAME	"bonding"
 #define DRV_DESCRIPTION	"Ethernet Channel Bonding Driver"
 
@@ -131,6 +131,7 @@ struct bond_params {
 	int lacp_fast;
 	int ad_select;
 	char primary[IFNAMSIZ];
+	int primary_reselect;
 	__be32 arp_targets[BOND_MAX_ARP_TARGETS];
 };
 
@@ -190,6 +191,7 @@ struct bonding {
 	struct   slave *curr_active_slave;
 	struct   slave *current_arp_slave;
 	struct   slave *primary_slave;
+	bool     force_primary;
 	s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
 	rwlock_t lock;
 	rwlock_t curr_slave_lock;
@@ -258,6 +260,10 @@ static inline bool bond_is_lb(const struct bonding *bond)
 		|| bond->params.mode == BOND_MODE_ALB;
 }
 
+#define BOND_PRI_RESELECT_ALWAYS	0
+#define BOND_PRI_RESELECT_BETTER	1
+#define BOND_PRI_RESELECT_FAILURE	2
+
 #define BOND_FOM_NONE			0
 #define BOND_FOM_ACTIVE			1
 #define BOND_FOM_FOLLOW			2
@@ -348,6 +354,7 @@ extern const struct bond_parm_tbl bond_mode_tbl[];
 extern const struct bond_parm_tbl xmit_hashtype_tbl[];
 extern const struct bond_parm_tbl arp_validate_tbl[];
 extern const struct bond_parm_tbl fail_over_mac_tbl[];
+extern const struct bond_parm_tbl pri_reselect_tbl[];
 extern struct bond_parm_tbl ad_select_tbl[];
 
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)

^ permalink raw reply related

* Re: [PATCH net-next-2.6 v2] bonding: introduce primary_reselect option
From: Jiri Pirko @ 2009-09-25 13:25 UTC (permalink / raw)
  To: Bill Fink; +Cc: Jay Vosburgh, netdev, davem, bonding-devel, nicolas.2p.debian
In-Reply-To: <20090925034725.60bcd85b.billfink@mindspring.com>

Fri, Sep 25, 2009 at 09:47:25AM CEST, billfink@mindspring.com wrote:
>On Thu, 24 Sep 2009, Jay Vosburgh wrote:
>
>> 
>> From: Jiri Pirko <jpirko@redhat.com>
>> 
>> In some cases there is not desirable to switch back to primary interface when
>> it's link recovers and rather stay with currently active one. We need to avoid
>> packetloss as much as we can in some cases. This is solved by introducing
>> primary_reselect option. Note that enslaved primary slave is set as current
>> active no matter what.
>> 
>> Patch modified by Jay Vosburgh as follows: fixed bug in action
>> after change of option setting via sysfs, revised the documentation
>> update, and bumped the bonding version number.
>> 
>> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
>> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
>> ---
>> 
>> 	Note that this patch depends on the "make ab_arp select active
>> slaves as other modes" patch recently approved, but not yet appearing in
>> net-next-2.6 as I write this.  http://patchwork.ozlabs.org/patch/32684/
>> 
>>  Documentation/networking/bonding.txt |   42 +++++++++++++++++++++-
>>  drivers/net/bonding/bond_main.c      |   66 +++++++++++++++++++++++++++++++---
>>  drivers/net/bonding/bond_main.c.rej  |   18 +++++++++
>>  drivers/net/bonding/bond_sysfs.c     |   53 +++++++++++++++++++++++++++
>>  drivers/net/bonding/bonding.h        |   11 +++++-
>>  5 files changed, 182 insertions(+), 8 deletions(-)
>>  create mode 100644 drivers/net/bonding/bond_main.c.rej
>
>I doubt you intended to include a patch reject file in your patch.

Noticed - I'm about to resend...
>
>						-Bill

^ permalink raw reply

* [PATCH] ax25: Fix ax25_cb refcounting in ax25_ctl_ioctl
From: Jarek Poplawski @ 2009-09-25 13:10 UTC (permalink / raw)
  To: David Miller
  Cc: Bernard Pidoux F6BVP, Bernard Pidoux, Ralf Baechle DL5RB,
	Linux Netdev List, linux-hams
In-Reply-To: <4ABA9058.3010605@free.fr>

This bug isn't responsible for these oopses here, but looks quite
obviously. (I'm not sure if it's easy to test/hit with the common
tools.)

Jarek P.
------------>
[PATCH] ax25: Fix ax25_cb refcounting in ax25_ctl_ioctl

Use ax25_cb_put after ax25_find_cb in ax25_ctl_ioctl.

Reported-by: Bernard Pidoux F6BVP <f6bvp@free.fr>
Signed-off-by: Jarek Poplawski <jarkao2@gmail.com>
---

 net/ax25/af_ax25.c |   27 +++++++++++++++++----------
 1 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/net/ax25/af_ax25.c b/net/ax25/af_ax25.c
index d6b1b05..fbcac76 100644
--- a/net/ax25/af_ax25.c
+++ b/net/ax25/af_ax25.c
@@ -358,6 +358,7 @@ static int ax25_ctl_ioctl(const unsigned int cmd, void __user *arg)
 	ax25_dev *ax25_dev;
 	ax25_cb *ax25;
 	unsigned int k;
+	int ret = 0;
 
 	if (copy_from_user(&ax25_ctl, arg, sizeof(ax25_ctl)))
 		return -EFAULT;
@@ -388,57 +389,63 @@ static int ax25_ctl_ioctl(const unsigned int cmd, void __user *arg)
 	case AX25_WINDOW:
 		if (ax25->modulus == AX25_MODULUS) {
 			if (ax25_ctl.arg < 1 || ax25_ctl.arg > 7)
-				return -EINVAL;
+				goto einval_put;
 		} else {
 			if (ax25_ctl.arg < 1 || ax25_ctl.arg > 63)
-				return -EINVAL;
+				goto einval_put;
 		}
 		ax25->window = ax25_ctl.arg;
 		break;
 
 	case AX25_T1:
 		if (ax25_ctl.arg < 1)
-			return -EINVAL;
+			goto einval_put;
 		ax25->rtt = (ax25_ctl.arg * HZ) / 2;
 		ax25->t1  = ax25_ctl.arg * HZ;
 		break;
 
 	case AX25_T2:
 		if (ax25_ctl.arg < 1)
-			return -EINVAL;
+			goto einval_put;
 		ax25->t2 = ax25_ctl.arg * HZ;
 		break;
 
 	case AX25_N2:
 		if (ax25_ctl.arg < 1 || ax25_ctl.arg > 31)
-			return -EINVAL;
+			goto einval_put;
 		ax25->n2count = 0;
 		ax25->n2 = ax25_ctl.arg;
 		break;
 
 	case AX25_T3:
 		if (ax25_ctl.arg < 0)
-			return -EINVAL;
+			goto einval_put;
 		ax25->t3 = ax25_ctl.arg * HZ;
 		break;
 
 	case AX25_IDLE:
 		if (ax25_ctl.arg < 0)
-			return -EINVAL;
+			goto einval_put;
 		ax25->idle = ax25_ctl.arg * 60 * HZ;
 		break;
 
 	case AX25_PACLEN:
 		if (ax25_ctl.arg < 16 || ax25_ctl.arg > 65535)
-			return -EINVAL;
+			goto einval_put;
 		ax25->paclen = ax25_ctl.arg;
 		break;
 
 	default:
-		return -EINVAL;
+		goto einval_put;
 	  }
 
-	return 0;
+out_put:
+	ax25_cb_put(ax25);
+	return ret;
+
+einval_put:
+	ret = -EINVAL;
+	goto out_put;
 }
 
 static void ax25_fillin_cb_from_dev(ax25_cb *ax25, ax25_dev *ax25_dev)

^ permalink raw reply related

* Re: TCP stack bug related to F-RTO?
From: Ilpo Järvinen @ 2009-09-25 13:09 UTC (permalink / raw)
  To: Ray Lee; +Cc: Joe Cao, Netdev, LKML, jcaoco2002
In-Reply-To: <2c0942db0909241639vafd348eia19436d18b182d60@mail.gmail.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2551 bytes --]

On Thu, 24 Sep 2009, Ray Lee wrote:

> [adding netdev cc:]
> 
> On Thu, Sep 24, 2009 at 10:43 AM, Joe Cao <caoco2002@yahoo.com> wrote:
> >
> > Hello,
> >
> > I have found the following behavior with different versions of linux 
> > kernel. The attached pcap trace is collected with server 
> > (192.168.0.13) running 2.6.24 and shows the problem. Basically the 
> > behavior is like this: 
> >
> > 1. The client opens up a big window,
> > 2. the server sends 19 packets in a row (pkt #14- #32 in the trace), but all of them are dropped due to some congestion.
> > 3. The server hits RTO and retransmits pkt #14 in #33
> > 4. The client immediately acks #33 (=#14), and the server (seems like to enter F-RTO) expends the window and sends *NEW* pkt #35 & #36.=A0 Timeoute is doubled to 2*RTO; The client immediately sends two Dup-ack to #35 and #36.
> > 5. after 2*RTO, pkt #15 is retransmitted in #39.
> > 6. The client immediately acks #39 (=#15) in #40, and the server continues to expand the window and sends two *NEW* pkt #41 & #42. Now the timeoute is doubled to 4 *RTO.
> > 8. After 4*RTO timeout, #16 is retransmitted.
> > 9....
> > 10. The above steps repeats for retransmitting pkt #16-#32 and each time the timeout is doubled.
> > 11. It takes a long long time to retransmit all the lost packets and before that is done, the client sends a RST because of timeout.
> >
> > The above behavior looks like F-RTO is in effect.  And there seems to 
> > be a bug in the TCP's congestion control and retransmission algorithm. 
> > Why doesn't the TCP on server (running 2.6.24) enter the slow start? 
> > Why should the server take that long to recover from a short period 
> > of packet loss?
> >
> > Has anyone else noticed similar problem before?  If my analysis was 
> > wrong, can anyone gives me some pointers to what's really wrong and 
> > how to fix it?

Yes, 2.6.24 is an obsoleted version with known wrongs in FRTO 
implementation. Fixes never when to 2.6.24 stable series as it was 
_already_ obsoleted when the problems where reported and found. The 
correct fixes may be found from 2.6.25.7 (.7 iirc) and are included from 
2.6.26 onward too.

Just in case you happen to run ubuntu based kernel from that era (of 
course you should be reporting the bug here then...), a word of warning: 
it seemed nearly impossible for them to get a simple thing like that 
fixed, I haven't been looking if they'd eventually come to some sensible 
conclusion in that matter or is it still unresolved (or e.g., closed 
without real resolution).

-- 
 i.

^ permalink raw reply

* Re: [linux-pm] [PATCH] 3c59x: Get rid of "Trying to free already-free IRQ"
From: Rafael J. Wysocki @ 2009-09-25 13:02 UTC (permalink / raw)
  To: avorontsov; +Cc: Alan Stern, linux-pm, netdev, David Miller
In-Reply-To: <20090925125616.GA20863@oksana.dev.rtsoft.ru>

On Friday 25 September 2009, Anton Vorontsov wrote:
> On Fri, Sep 25, 2009 at 02:32:30PM +0200, Rafael J. Wysocki wrote:
> > On Friday 25 September 2009, Alan Stern wrote:
> > > On Thu, 24 Sep 2009, Anton Vorontsov wrote:
> > > 
> > > > Though, there are few other issues with suspend/resume in this driver.
> > > > The intention of calling free_irq() in suspend() was to avoid any
> > > > possible spurious interrupts (see commit 5b039e681b8c5f30aac9cc04385
> > > > "3c59x PM fixes"). But,
> > > > 
> > > > - On resume, the driver was requesting IRQ just after pci_set_master(),
> > > >   but before vortex_up() (which actually resets 3c59x chips).
> > > 
> > > Shouldn't it be possible to reset the chip (or at least prevent it from 
> > > generating spurious IRQs) during the early-resume phase?
> > > 
> > > > - Issuing free_irq() on a shared IRQ doesn't guarantee that a buggy
> > > >   HW won't trigger spurious interrupts in another driver that
> > > >   requested the same interrupt. So, if we want to protect from
> > > >   unexpected interrupts, then on suspend we should issue disable_irq(),
> > > >   not free_irq().
> > > 
> > > What if some other device shares the IRQ and still relies on receiving
> > > interrupts when this code runs?  Won't disable_irq() mess up the other
> > > device?
> > 
> > Ah, I overlooked the disable_irq()/enable_irq() part, which is not really
> > necessary anyway.
> 
> Well, it is necessary if 3c59x really throws spurious interrupts
> upon suspend (i.e. after pci_disable_device(pdev)). My first though
> was to just remove free/request_irq stuff, but then I could introduce
> a regression if 3c59x really throws unexpected IRQs and 3c59x was
> the only user of a PCI IRQ (in that case free_irq() would actually
> help).
> 
> > Anton, have you tried without that?
> 
> Yes, and there wasn't any issues for 3x59x I have. Alan raised a very
> good point, and converting to dev_pm_opsas as you've suggested would
> solve it in a nice way, since if we use the dev_pm_ops, PCI core
> will disable the device in _noirq suspend, after we quiesced the
> chip itself.

That's exactly why I suggested to do that. :-)

> I'll send another patch that reworks PM stuff in the driver soon.

Thanks a lot for taking care of this!

Best,
Rafael

^ permalink raw reply

* Re: [linux-pm] [PATCH] 3c59x: Get rid of "Trying to free already-free IRQ"
From: Anton Vorontsov @ 2009-09-25 12:56 UTC (permalink / raw)
  To: Rafael J. Wysocki; +Cc: Alan Stern, linux-pm, netdev, David Miller
In-Reply-To: <200909251432.30738.rjw@sisk.pl>

On Fri, Sep 25, 2009 at 02:32:30PM +0200, Rafael J. Wysocki wrote:
> On Friday 25 September 2009, Alan Stern wrote:
> > On Thu, 24 Sep 2009, Anton Vorontsov wrote:
> > 
> > > Though, there are few other issues with suspend/resume in this driver.
> > > The intention of calling free_irq() in suspend() was to avoid any
> > > possible spurious interrupts (see commit 5b039e681b8c5f30aac9cc04385
> > > "3c59x PM fixes"). But,
> > > 
> > > - On resume, the driver was requesting IRQ just after pci_set_master(),
> > >   but before vortex_up() (which actually resets 3c59x chips).
> > 
> > Shouldn't it be possible to reset the chip (or at least prevent it from 
> > generating spurious IRQs) during the early-resume phase?
> > 
> > > - Issuing free_irq() on a shared IRQ doesn't guarantee that a buggy
> > >   HW won't trigger spurious interrupts in another driver that
> > >   requested the same interrupt. So, if we want to protect from
> > >   unexpected interrupts, then on suspend we should issue disable_irq(),
> > >   not free_irq().
> > 
> > What if some other device shares the IRQ and still relies on receiving
> > interrupts when this code runs?  Won't disable_irq() mess up the other
> > device?
> 
> Ah, I overlooked the disable_irq()/enable_irq() part, which is not really
> necessary anyway.

Well, it is necessary if 3c59x really throws spurious interrupts
upon suspend (i.e. after pci_disable_device(pdev)). My first though
was to just remove free/request_irq stuff, but then I could introduce
a regression if 3c59x really throws unexpected IRQs and 3c59x was
the only user of a PCI IRQ (in that case free_irq() would actually
help).

> Anton, have you tried without that?

Yes, and there wasn't any issues for 3x59x I have. Alan raised a very
good point, and converting to dev_pm_opsas as you've suggested would
solve it in a nice way, since if we use the dev_pm_ops, PCI core
will disable the device in _noirq suspend, after we quiesced the
chip itself.

I'll send another patch that reworks PM stuff in the driver soon.

Thanks,

-- 
Anton Vorontsov
email: cbouatmailru@gmail.com
irc://irc.freenode.net/bd2

^ permalink raw reply

* Re: [PATCH] 3c59x: Get rid of "Trying to free already-free IRQ"
From: Rafael J. Wysocki @ 2009-09-25 12:35 UTC (permalink / raw)
  To: David Miller; +Cc: avorontsov, linux-pm, netdev
In-Reply-To: <20090924.152619.260814858.davem@davemloft.net>

On Friday 25 September 2009, David Miller wrote:
> From: Anton Vorontsov <avorontsov@ru.mvista.com>
> Date: Fri, 25 Sep 2009 01:30:39 +0400
> 
> > On Thu, Sep 24, 2009 at 10:30:33PM +0200, Rafael J. Wysocki wrote:
> >> On Thursday 24 September 2009, Anton Vorontsov wrote:
> >> > Following trace pops up if we try to suspend with 3c59x ethernet NIC
> >> > brought down:
> >> 
> >> Patch looks good, but IMO it'd be a little effort to convert the driver to
> >> dev_pm_ops while you're at it (please see r8169 for a working example).
> > 
> > I'd like to avoid putting irrelevant stuff into bugfixes.
> 
> Agreed.

Well, the point is that all of the PCI core stuff the driver does is not
necessary and should better be dropped along with the IRQ thing.

Best,
Rafael

^ permalink raw reply

* Re: [linux-pm] [PATCH] 3c59x: Get rid of "Trying to free already-free IRQ"
From: Rafael J. Wysocki @ 2009-09-25 12:32 UTC (permalink / raw)
  To: Alan Stern; +Cc: linux-pm, Anton Vorontsov, netdev, David Miller
In-Reply-To: <Pine.LNX.4.44L0.0909250038300.14463-100000@netrider.rowland.org>

On Friday 25 September 2009, Alan Stern wrote:
> On Thu, 24 Sep 2009, Anton Vorontsov wrote:
> 
> > Though, there are few other issues with suspend/resume in this driver.
> > The intention of calling free_irq() in suspend() was to avoid any
> > possible spurious interrupts (see commit 5b039e681b8c5f30aac9cc04385
> > "3c59x PM fixes"). But,
> > 
> > - On resume, the driver was requesting IRQ just after pci_set_master(),
> >   but before vortex_up() (which actually resets 3c59x chips).
> 
> Shouldn't it be possible to reset the chip (or at least prevent it from 
> generating spurious IRQs) during the early-resume phase?
> 
> > - Issuing free_irq() on a shared IRQ doesn't guarantee that a buggy
> >   HW won't trigger spurious interrupts in another driver that
> >   requested the same interrupt. So, if we want to protect from
> >   unexpected interrupts, then on suspend we should issue disable_irq(),
> >   not free_irq().
> 
> What if some other device shares the IRQ and still relies on receiving
> interrupts when this code runs?  Won't disable_irq() mess up the other
> device?

Ah, I overlooked the disable_irq()/enable_irq() part, which is not really
necessary anyway.

Anton, have you tried without that?

Rafael

^ permalink raw reply

* Re: TCP stack bug related to F-RTO?
From: zhigang gong @ 2009-09-25  8:55 UTC (permalink / raw)
  To: Joe Cao; +Cc: linux-kernel, jcaoco2002, netdev
In-Reply-To: <511432.48405.qm@web63401.mail.re1.yahoo.com>

Oh, I see, so I spoke too quickly in last mail. You just ignore some packets
in the trace. I have analysed the traffic flow  and have some findings as below,
hope it's helpful.

>> > 1. The client opens up a big window,
>> > 2. the server sends 19 packets in a row (pkt #14- #32
>> in the trace), but all of them are dropped due to some
>> congestion.
>> > 3. The server hits RTO and retransmits pkt #14 in #33
This retransmission timer expiring indicate the server's tcp/ip
stack to enter slow start mode, as a result we can see the
server's sending window will be reduced to one.

>> > 4. The client immediately acks #33 (=#14), and the
>> server (seems like to enter F-RTO) expends the window and
>> sends *NEW* pkt #35 & #36.=A0 Timeoute is doubled to
>> 2*RTO; The client immediately sends two Dup-ack to #35 and
>> #36.

Server is still in slow start mode, and extend window to 2.

>> > 5. after 2*RTO, pkt #15 is retransmitted in #39.

Here , the second retransmission timer expiring ocur. Server's sending
window reduce to one again and continue in slow start mode.

>> > 6. The client immediately acks #39 (=#15) in #40, and
>> the server continues to expand the window and sends two
>> *NEW* pkt #41 & #42. Now the timeoute is doubled to 4
>> *RTO.
Here you ignore two duplicate acks #37 and #38 sent by the client. As I know
the server must receive three or even more duplcate acks before it enter fast
retransmit mode, otherwise it will still in slow start mode and  it
will wait until next
time retransmission timer expiring before retransmit the lost packets.
And this is
actually what you got.

I'm not an kernel expert, I just analyse from the TCP protocol standard. From my
view, I think there is no problem in the server's network stack. But
there maybe
some problem in the client (or some intermediate network appliance) side, as it
always just sends two duplicate acks at the same time, and never send the third
one no matter how long the interval is. In my opinion, if the client
can send the third
duplicate acks then the server will enter fast retransmit mode and
then fast recovery
then every thing will be ok.

>> > 8. After 4*RTO timeout, #16 is retransmitted.
>> > 9....
>> > 10. The above steps repeats for retransmitting pkt
>> #16-#32 and each time the timeout is doubled.
>> > 11. It takes a long long time to retransmit all the
>> lost packets and before that is done, the client sends a RST
>> because of timeout.

On Fri, Sep 25, 2009 at 2:42 PM, Joe Cao <caoco2002@yahoo.com> wrote:
> Hi,
>
> On the wrong tcp checksum, that's because of hardware checksum offload.
>
> As for the seq/ack number, because the trace is long, I deliberately removed those irrelevant packets between after the three-way handshake and when the problem happens.  That can be seen from the timestamps.
>
> Please also note that I intentionally replaced the IP addresses and mac addresses in the trace to hide proprietary information in the trace.
>
> Anyway, the problem is not related to the checksum, or seq/ack number, otherwise, you won't see the behavior shown in the trace.
>
> Thanks,
> Joe
>
> --- On Thu, 9/24/09, zhigang gong <zhigang.gong@gmail.com> wrote:
>

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Avi Kivity @ 2009-09-25  8:22 UTC (permalink / raw)
  To: Gregory Haskins
  Cc: Ira W. Snyder, Michael S. Tsirkin, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <4ABBB46D.2000102@gmail.com>

On 09/24/2009 09:03 PM, Gregory Haskins wrote:
>
>> I don't really see how vhost and vbus are different here.  vhost expects
>> signalling to happen through a couple of eventfds and requires someone
>> to supply them and implement kernel support (if needed).  vbus requires
>> someone to write a connector to provide the signalling implementation.
>> Neither will work out-of-the-box when implementing virtio-net over
>> falling dominos, for example.
>>      
> I realize in retrospect that my choice of words above implies vbus _is_
> complete, but this is not what I was saying.  What I was trying to
> convey is that vbus is _more_ complete.  Yes, in either case some kind
> of glue needs to be written.  The difference is that vbus implements
> more of the glue generally, and leaves less required to be customized
> for each iteration.
>    


No argument there.  Since you care about non-virt scenarios and virtio 
doesn't, naturally vbus is a better fit for them as the code stands.  
But that's not a strong argument for vbus; instead of adding vbus you 
could make virtio more friendly to non-virt (there's a limit how far you 
can take this, not imposed by the code, but by virtio's charter as a 
virtual device driver framework).

> Going back to our stack diagrams, you could think of a vhost solution
> like this:
>
> --------------------------
> | virtio-net
> --------------------------
> | virtio-ring
> --------------------------
> | virtio-bus
> --------------------------
> | ? undefined-1 ?
> --------------------------
> | vhost
> --------------------------
>
> and you could think of a vbus solution like this
>
> --------------------------
> | virtio-net
> --------------------------
> | virtio-ring
> --------------------------
> | virtio-bus
> --------------------------
> | bus-interface
> --------------------------
> | ? undefined-2 ?
> --------------------------
> | bus-model
> --------------------------
> | virtio-net-device (vhost ported to vbus model? :)
> --------------------------
>
>
> So the difference between vhost and vbus in this particular context is
> that you need to have "undefined-1" do device discovery/hotswap,
> config-space, address-decode/isolation, signal-path routing, memory-path
> routing, etc.  Today this function is filled by things like virtio-pci,
> pci-bus, KVM/ioeventfd, and QEMU for x86.  I am not as familiar with
> lguest, but presumably it is filled there by components like
> virtio-lguest, lguest-bus, lguest.ko, and lguest-launcher.  And to use
> more contemporary examples, we might have virtio-domino, domino-bus,
> domino.ko, and domino-launcher as well as virtio-ira, ira-bus, ira.ko,
> and ira-launcher.
>
> Contrast this to the vbus stack:  The bus-X components (when optionally
> employed by the connector designer) do device-discovery, hotswap,
> config-space, address-decode/isolation, signal-path and memory-path
> routing, etc in a general (and pv-centric) way. The "undefined-2"
> portion is the "connector", and just needs to convey messages like
> "DEVCALL" and "SHMSIGNAL".  The rest is handled in other parts of the stack.
>
>    

Right.  virtio assumes that it's in a virt scenario and that the guest 
architecture already has enumeration and hotplug mechanisms which it 
would prefer to use.  That happens to be the case for kvm/x86.

> So to answer your question, the difference is that the part that has to
> be customized in vbus should be a fraction of what needs to be
> customized with vhost because it defines more of the stack.

But if you want to use the native mechanisms, vbus doesn't have any 
added value.

> And, as
> eluded to in my diagram, both virtio-net and vhost (with some
> modifications to fit into the vbus framework) are potentially
> complementary, not competitors.
>    

Only theoretically.  The existing installed base would have to be thrown 
away, or we'd need to support both.

  


>> Without a vbus-connector-falling-dominos, vbus-venet can't do anything
>> either.
>>      
> Mostly covered above...
>
> However, I was addressing your assertion that vhost somehow magically
> accomplishes this "container/addressing" function without any specific
> kernel support.  This is incorrect.  I contend that this kernel support
> is required and present.  The difference is that its defined elsewhere
> (and typically in a transport/arch specific way).
>
> IOW: You can basically think of the programmed PIO addresses as forming
> its "container".  Only addresses explicitly added are visible, and
> everything else is inaccessible.  This whole discussion is merely a
> question of what's been generalized verses what needs to be
> re-implemented each time.
>    

Sorry, this is too abstract for me.



>> vbus doesn't do kvm guest address decoding for the fast path.  It's
>> still done by ioeventfd.
>>      
> That is not correct.  vbus does its own native address decoding in the
> fast path, such as here:
>
> http://git.kernel.org/?p=linux/kernel/git/ghaskins/alacrityvm/linux-2.6.git;a=blob;f=kernel/vbus/client.c;h=e85b2d92d629734866496b67455dd307486e394a;hb=e6cbd4d1decca8e829db3b2b9b6ec65330b379e9#l331
>
>    

All this is after kvm has decoded that vbus is addresses.  It can't work 
without someone outside vbus deciding that.

> In fact, it's actually a simpler design to unify things this way because
> you avoid splitting the device model up. Consider how painful the vhost
> implementation would be if it didn't already have the userspace
> virtio-net to fall-back on.  This is effectively what we face for new
> devices going forward if that model is to persist.
>    


It doesn't have just virtio-net, it has userspace-based hostplug and a 
bunch of other devices impemented in userspace.  Currently qemu has 
virtio bindings for pci and syborg (whatever that is), and device models 
for baloon, block, net, and console, so it seems implementing device 
support in userspace is not as disasterous as you make it to be.

>> Invariably?
>>      
> As in "always"
>    

Refactor instead of duplicating.

>    
>>   Use libraries (virtio-shmem.ko, libvhost.so).
>>      
> What do you suppose vbus is?  vbus-proxy.ko = virtio-shmem.ko, and you
> dont need libvhost.so per se since you can just use standard kernel
> interfaces (like configfs/sysfs).  I could create an .so going forward
> for the new ioctl-based interface, I suppose.
>    

Refactor instead of rewriting.



>> For kvm/x86 pci definitely remains king.
>>      
> For full virtualization, sure.  I agree.  However, we are talking about
> PV here.  For PV, PCI is not a requirement and is a technical dead-end IMO.
>
> KVM seems to be the only virt solution that thinks otherwise (*), but I
> believe that is primarily a condition of its maturity.  I aim to help
> advance things here.
>
> (*) citation: xen has xenbus, lguest has lguest-bus, vmware has some
> vmi-esq thing (I forget what its called) to name a few.  Love 'em or
> hate 'em, most other hypervisors do something along these lines.  I'd
> like to try to create one for KVM, but to unify them all (at least for
> the Linux-based host designs).
>    

VMware are throwing VMI away (won't be supported in their new product, 
and they've sent a patch to rip it off from Linux); Xen has to tunnel 
xenbus in pci for full virtualization (which is where Windows is, and 
where Linux will be too once people realize it's faster).  lguest is 
meant as an example hypervisor, not an attempt to take over the world.

"PCI is a dead end" could not be more wrong, it's what guests support.  
An right now you can have a guest using pci to access a mix of 
userspace-emulated devices, userspace-emulated-but-kernel-accelerated 
virtio devices, and real host devices.  All on one dead-end bus.  Try 
that with vbus.


>>> I digress.  My point here isn't PCI.  The point here is the missing
>>> component for when PCI is not present.  The component that is partially
>>> satisfied by vbus's devid addressing scheme.  If you are going to use
>>> vhost, and you don't have PCI, you've gotta build something to replace
>>> it.
>>>
>>>        
>> Yes, that's why people have keyboards.  They'll write that glue code if
>> they need it.  If it turns out to be a hit an people start having virtio
>> transport module writing parties, they'll figure out a way to share code.
>>      
> Sigh...  The party has already started.  I tried to invite you months ago...
>    

I've been voting virtio since 2007.

>> On the guest side, virtio-shmem.ko can unify the ring access.  It
>> probably makes sense even today.  On the host side I eventfd is the
>> kernel interface and libvhostconfig.so can provide the configuration
>> when an existing ABI is not imposed.
>>      
> That won't cut it.  For one, creating an eventfd is only part of the
> equation.  I.e. you need to have originate/terminate somewhere
> interesting (and in-kernel, otherwise use tuntap).
>    

vbus needs the same thing so it cancels out.

>> Look at the virtio-net feature negotiation.  There's a lot more there
>> than the MAC address, and it's going to grow.
>>      
> Agreed, but note that makes my point.  That feature negotiation almost
> invariably influences the device-model, not some config-space shim.
> IOW: terminating config-space at some userspace shim is pointless.  The
> model ultimately needs the result of whatever transpires during that
> negotiation anyway.
>    

Well, let's see.  Can vbus today:

- let userspace know which features are available (so it can decide if 
live migration is possible)
- let userspace limit which features are exposed to the guest (so it can 
make live migration possible among hosts of different capabilities)
- let userspace know which features were negotiated (so it can transfer 
them to the other host during live migration)
- let userspace tell the kernel which features were negotiated (when 
live migration completes, to avoid requiring the guest to re-negotiate)
- do all that from an unprivileged process
- securely wrt other unprivileged processes

?

What are your plans here?



-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] bonding: introduce primary_reselect option
From: Bill Fink @ 2009-09-25  7:47 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Jiri Pirko, netdev, davem, bonding-devel, nicolas.2p.debian
In-Reply-To: <21492.1253838848@death.nxdomain.ibm.com>

On Thu, 24 Sep 2009, Jay Vosburgh wrote:

> 
> From: Jiri Pirko <jpirko@redhat.com>
> 
> In some cases there is not desirable to switch back to primary interface when
> it's link recovers and rather stay with currently active one. We need to avoid
> packetloss as much as we can in some cases. This is solved by introducing
> primary_reselect option. Note that enslaved primary slave is set as current
> active no matter what.
> 
> Patch modified by Jay Vosburgh as follows: fixed bug in action
> after change of option setting via sysfs, revised the documentation
> update, and bumped the bonding version number.
> 
> Signed-off-by: Jiri Pirko <jpirko@redhat.com>
> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
> ---
> 
> 	Note that this patch depends on the "make ab_arp select active
> slaves as other modes" patch recently approved, but not yet appearing in
> net-next-2.6 as I write this.  http://patchwork.ozlabs.org/patch/32684/
> 
>  Documentation/networking/bonding.txt |   42 +++++++++++++++++++++-
>  drivers/net/bonding/bond_main.c      |   66 +++++++++++++++++++++++++++++++---
>  drivers/net/bonding/bond_main.c.rej  |   18 +++++++++
>  drivers/net/bonding/bond_sysfs.c     |   53 +++++++++++++++++++++++++++
>  drivers/net/bonding/bonding.h        |   11 +++++-
>  5 files changed, 182 insertions(+), 8 deletions(-)
>  create mode 100644 drivers/net/bonding/bond_main.c.rej

I doubt you intended to include a patch reject file in your patch.

						-Bill

^ permalink raw reply

* Re: [PATCHv5 3/3] vhost_net: a kernel-level virtio server
From: Avi Kivity @ 2009-09-25  7:43 UTC (permalink / raw)
  To: Ira W. Snyder
  Cc: Gregory Haskins, Michael S. Tsirkin, netdev, virtualization, kvm,
	linux-kernel, mingo, linux-mm, akpm, hpa, Rusty Russell, s.hetze,
	alacrityvm-devel
In-Reply-To: <20090924192754.GA14341@ovro.caltech.edu>

On 09/24/2009 10:27 PM, Ira W. Snyder wrote:
>>> Ira can make ira-bus, and ira-eventfd, etc, etc.
>>>
>>> Each iteration will invariably introduce duplicated parts of the stack.
>>>
>>>        
>> Invariably?  Use libraries (virtio-shmem.ko, libvhost.so).
>>
>>      
> Referencing libraries that don't yet exist doesn't seem like a good
> argument against vbus from my point of view. I'm not speficially
> advocating for vbus; I'm just letting you know how it looks to another
> developer in the trenches.
>    

My argument is that we shouldn't write a new framework instead of fixing 
or extending an existing one.

> If you'd like to see the amount of duplication present, look at the code
> I'm currently working on.

Yes, virtio-phys-guest looks pretty much duplicated.  Looks like it 
should be pretty easy to deduplicate.

>   It mostly works at this point, though I
> haven't finished my userspace, nor figured out how to actually transfer
> data.
>
> The current question I have (just to let you know where I am in
> development) is:
>
> I have the physical address of the remote data, but how do I get it into
> a userspace buffer, so I can pass it to tun?
>    

vhost does guest physical address to host userspace address (it your 
scenario, remote physical to local virtual) using a table of memory 
slots; there's an ioctl that allows userspace to initialize that table.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [GIT]: Networking
From: David Miller @ 2009-09-25  7:29 UTC (permalink / raw)
  To: torvalds; +Cc: akpm, netdev, linux-kernel


1) Fix rt2x00 build failures, from Andrew Price.

2) 3c59x, fix double IRQ free.  From Anton Vorontsov.

3) Fix CONFIG_NET=n build on some platforms.  From akpm

4) staging cpc-usb CAN driver upgraded to a bonafide driver
   under drivers/net/can/usb

5) Fix smsc95xx zero-length-pack scenerios and add some USB
   product IDs.

6) Fix packet generator scheduler unfriendliness, from Stephen
   Hemminger.

7) Handle bogus optlen for IP_MULTICAST_IF socket option.
   From Shan Wei.

8) phonet stack bug fixes from Rémi Denis-Courmont

9) Make sure SKY2_HW_RAM_BUFFER gets initialized properly in
   sky2 driver, from Mike McCormack.

10) If neither IFF_TUN nor IFF_TAP is set we must return -EINVAL
    in tap driver, from Kusanagi Kouichi.

11) 8139cp specifies KERN_* log level twice in printks.  Fix from
    Alan Jenkins.

12) b43 wireless bug fixes from Michael Buesch, Larry Finger, and
    Albert Herranz.

13) netxen bug fixes from Dhananjay Phadke.

14) More generic netlink locking fixes from Johannes Berg.

15) AX25 was busted by previous socket buffer allocation changes,
    fix from Eric Dumazet.

16) hostap netdev_ops conversion was buggy, fix from Martin Decky.

17) cpmac stopped building when BUS_ID_SIZE was removed, fix from
    Florian Fainelli.

18) kaweth memory leak fix from Kevin Cernekee.

19) New at91 CAN driver from Marc Kleine-Budde.

Please pull, thanks a lot!

The following changes since commit 94e0fb086fc5663c38bbc0fe86d698be8314f82f:
  Linus Torvalds (1):
        Merge branch 'drm-intel-next' of git://git.kernel.org/.../anholt/drm-intel

are available in the git repository at:

  master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6.git master

Alan Jenkins (1):
      8139cp: fix duplicate loglevel in module load message

Albert Herranz (2):
      b43: Add Soft-MAC SDIO device support
      b43: fix build error if !CONFIG_B43_LEDS

Alexander Duyck (1):
      igb: resolve namespacecheck warning for igb_hash_mc_addr

Andrew Morton (1):
      net: fix CONFIG_NET=n build on sparc64

Andrew Price (1):
      rt2x00: fix the definition of rt2x00crypto_rx_insert_iv

Anton Vorontsov (1):
      3c59x: Get rid of "Trying to free already-free IRQ"

Christian Lamparter (2):
      p54usb: add Zcomax XG-705A usbid
      ar9170usb: add usbid for TP-Link TL-WN821N v2

Daniel C Halperin (1):
      iwlwifi: fix HT operation in 2.4 GHz band

David S. Miller (3):
      Merge branch 'master' of git://git.infradead.org/users/dwmw2/solos-2.6
      Merge branch 'master' of git://git.kernel.org/.../linville/wireless-next-2.6
      Merge branch 'master' of /home/davem/src/GIT/linux-2.6/

David Woodhouse (1):
      solos: Add some margin-related parameters

Dhananjay Phadke (2):
      netxen: fix minor tx timeout bug
      netxen: fix firmware init after resume

Don Skidmore (1):
      ixgbe: fix sfp_timer clean up in ixgbe_down

Eric Dumazet (2):
      ax25: Fix SIOCAX25GETINFO ioctl
      tunnel: eliminate recursion field

Fabio Estevam (1):
      fec: Add FEC support for MX25 processor

Florian Fainelli (1):
      cpmac: fix compilation errors against undeclared BUS_ID_SIZE

Holger Schurig (2):
      cfg80211: use cfg80211_wext_freq() for freq conversion
      cfg80211: minimal error handling for wext-compat freq scanning

Jaswinder Singh Rajput (2):
      pcmcia: pcnet_cs.c removing useless condition
      net: fix htmldocs sunrpc, clnt.c

Joe Perches (1):
      lib/vsprintf.c: Avoid possible unaligned accesses in %pI6c

Johannes Berg (5):
      iwlwifi: disable powersave for 4965
      cfg80211: fix SME connect
      mac80211: fix DTIM setting
      cfg80211: don't overwrite privacy setting
      genetlink: fix netns vs. netlink table locking (2)

Julia Lawall (2):
      drivers/net: remove duplicate structure field initialization
      drivers/net/wireless: Use usb_endpoint_dir_out

Kevin Cernekee (1):
      kaweth: Fix memory leak in kaweth_control()

Kusanagi Kouichi (1):
      tun: Return -EINVAL if neither IFF_TUN nor IFF_TAP is set.

Larry Finger (2):
      ssb: Fix error when V1 SPROM extraction is forced
      b43: Implement RFKILL status for LP PHY

Luis R. Rodriguez (1):
      wireless: default CONFIG_WLAN to y

Marc Kleine-Budde (3):
      at91sam9263: add at91_can device to generic device definition
      at91sam9263ek: activate at91 CAN controller
      at91_can: add driver for Atmel's CAN controller on AT91SAM9263

Martin Decky (1):
      hostap: Revert a toxic part of the conversion to net_device_ops

Michael Buesch (11):
      b43: Force-wake queues on init
      ssb: Disable verbose SDIO coreswitch
      b43: Fix resume failure
      b43: Rewrite suspend/resume code
      b43: Do not use _irqsafe callbacks
      b43: Fix SDIO interrupt handler deadlock
      b43: Fix IRQ sync for SDIO
      b43: Add optional verbose runtime statistics
      b43: Disable PMQ mechanism
      b43: Don't abuse wl->current_dev in the led work
      b43: Remove BROKEN attribute from SDIO

Michael Chan (1):
      cnic: Shutdown iSCSI ring during uio_close.

Michal Simek (1):
      net: xilinx_emaclite: Fix problem with first incoming packet

Mike McCormack (1):
      sky2: Set SKY2_HW_RAM_BUFFER in sky2_init

Nathan Williams (2):
      solos: support new FPGA RAM layout
      solos: Check for rogue received packets

Nelson, Shannon (2):
      ixgbe: Allow tx itr specific settings
      ixgbe: move rx queue RSC configuration to a separate function

Pavel Roskin (1):
      rc80211_minstrel: fix contention window calculation

Randy Dunlap (2):
      ssb/sdio: fix printk format warnings
      wl12xx: fix kconfig/link errors

Reinette Chatre (3):
      iwlwifi: fix potential rx buffer loss
      iwlwifi: do not send sync command while holding spinlock
      iwlwifi: reduce noise when skb allocation fails

Rémi Denis-Courmont (2):
      Phonet: fix race for port number in concurrent bind()
      Phonet: error on broadcast sending (unimplemented)

Sebastian Haas (3):
      cpc-usb: Removed driver from staging tree
      ems_usb: Added support for EMS CPC-USB/ARM7 CAN/USB interface
      ems_pci: fix size of CAN controllers BAR mapping for CPC-PCI v2

Senthil Balasubramanian (2):
      ath9k: Adjust the chainmasks properly
      ath9k: Fix bug in chain handling

Shan Wei (1):
      ipv4: check optlen for IP_MULTICAST_IF option

Simon Farnsworth (1):
      solos: Show Interleaving details for ADSL2 and 2+

Stanislaw Gruszka (1):
      iwlagn: fix panic in iwl{5000,4965}_rx_reply_tx

Stephen Hemminger (2):
      pktgen: T_TERMINATE flag is unused
      pktgen: better scheduler friendliness

Steve Glendinning (2):
      smsc95xx: add additional USB product IDs
      smsc95xx: fix transmission where ZLP is expected

Sujith (5):
      ath9k: Fix bug in ANI channel handling
      ath9k: Restore TSF after RESET
      ath9k: Fix chip wakeup issue
      ath9k: Fix regression in PA calibration
      ath9k: Fix RFKILL bugs

Thomas Ilnseher (1):
      b43: Add LP PHY Analog Switch Support

Vasanthakumar Thiagarajan (3):
      ath9k: Fix rx data corruption
      ath9k: Don't read NF when chip has gone through full sleep mode
      ath9k: Do a full reset for AR9280

Vivek Natarajan (5):
      ath9k: Set default noise floor value for AR9287
      ath9k: Revamp PCIE workarounds
      ath9k: Fix AHB reset for AR9280
      ath9k: Disable autosleep feature by default.
      ath9k: Initialize txgain and rxgain for newer AR9287 chipsets.

Wey-Yi Guy (1):
      iwlwifi: find the correct first antenna

jie.yang@atheros.com (1):
      atl1c:remove compiling warning

roel kluin (1):
      atm: dereference of he_dev->rbps_virt in he_init_group()

 arch/arm/mach-at91/at91sam9263_devices.c    |   36 +
 arch/arm/mach-at91/board-sam9263ek.c        |   19 +
 arch/arm/mach-at91/include/mach/board.h     |    6 +
 drivers/atm/he.c                            |   59 ++-
 drivers/atm/solos-attrlist.c                |   11 +
 drivers/atm/solos-pci.c                     |   75 ++-
 drivers/net/3c59x.c                         |   12 +-
 drivers/net/8139cp.c                        |    2 +-
 drivers/net/Kconfig                         |    2 +-
 drivers/net/atl1c/atl1c_main.c              |    2 +-
 drivers/net/can/Kconfig                     |   13 +
 drivers/net/can/Makefile                    |    3 +
 drivers/net/can/sja1000/ems_pci.c           |   16 +-
 drivers/net/can/usb/Makefile                |    5 +
 drivers/net/can/usb/ems_usb.c               | 1155 ++++++++++++++++++++++++++
 drivers/net/cnic.c                          |    4 +-
 drivers/net/cpmac.c                         |    8 +-
 drivers/net/ehea/ehea_main.c                |    1 -
 drivers/net/igb/e1000_mac.c                 |   72 +-
 drivers/net/igb/e1000_mac.h                 |    1 -
 drivers/net/ixgbe/ixgbe.h                   |    6 +-
 drivers/net/ixgbe/ixgbe_ethtool.c           |   75 ++-
 drivers/net/ixgbe/ixgbe_main.c              |  111 ++-
 drivers/net/netxen/netxen_nic_main.c        |    8 +-
 drivers/net/pcmcia/pcnet_cs.c               |   11 +-
 drivers/net/sky2.c                          |    4 +-
 drivers/net/sunvnet.c                       |    1 -
 drivers/net/tun.c                           |    4 +-
 drivers/net/usb/kaweth.c                    |   18 +-
 drivers/net/usb/smsc95xx.c                  |   67 ++-
 drivers/net/usb/usbnet.c                    |    2 +-
 drivers/net/wireless/ath/ar9170/usb.c       |    2 +
 drivers/net/wireless/ath/ath9k/calib.c      |   23 +-
 drivers/net/wireless/ath/ath9k/calib.h      |    1 +
 drivers/net/wireless/ath/ath9k/eeprom_def.c |    4 +-
 drivers/net/wireless/ath/ath9k/hw.c         |  202 +++--
 drivers/net/wireless/ath/ath9k/hw.h         |    4 +-
 drivers/net/wireless/ath/ath9k/main.c       |   16 +-
 drivers/net/wireless/ath/ath9k/reg.h        |    3 +-
 drivers/net/wireless/b43/Kconfig            |   21 +-
 drivers/net/wireless/b43/Makefile           |    1 +
 drivers/net/wireless/b43/b43.h              |   23 +-
 drivers/net/wireless/b43/debugfs.c          |    1 +
 drivers/net/wireless/b43/debugfs.h          |    1 +
 drivers/net/wireless/b43/dma.c              |    4 +-
 drivers/net/wireless/b43/leds.c             |  266 +++++--
 drivers/net/wireless/b43/leds.h             |   33 +-
 drivers/net/wireless/b43/main.c             |  224 +++---
 drivers/net/wireless/b43/phy_lp.c           |   12 +-
 drivers/net/wireless/b43/pio.c              |    2 +-
 drivers/net/wireless/b43/rfkill.c           |    2 +-
 drivers/net/wireless/b43/sdio.c             |  202 +++++
 drivers/net/wireless/b43/sdio.h             |   45 +
 drivers/net/wireless/b43/xmit.c             |    5 +-
 drivers/net/wireless/iwlwifi/iwl-4965.c     |    6 +
 drivers/net/wireless/iwlwifi/iwl-5000.c     |    6 +
 drivers/net/wireless/iwlwifi/iwl-rx.c       |   10 +-
 drivers/net/wireless/iwlwifi/iwl-sta.c      |    2 +-
 drivers/net/wireless/iwlwifi/iwl3945-base.c |    9 +-
 drivers/net/wireless/rt2x00/rt2x00lib.h     |    2 +-
 drivers/net/wireless/wl12xx/Kconfig         |    2 +-
 drivers/net/wireless/zd1211rw/zd_usb.c      |    2 +-
 drivers/net/xilinx_emaclite.c               |    7 +-
 drivers/staging/Kconfig                     |    2 -
 drivers/staging/Makefile                    |    1 -
 drivers/staging/cpc-usb/Kconfig             |    4 -
 drivers/staging/cpc-usb/Makefile            |    3 -
 drivers/staging/cpc-usb/TODO                |   10 -
 drivers/staging/cpc-usb/cpc-usb_drv.c       | 1184 ---------------------------
 drivers/staging/cpc-usb/cpc.h               |  417 ----------
 drivers/staging/cpc-usb/cpc_int.h           |   83 --
 drivers/staging/cpc-usb/cpcusb.h            |   86 --
 drivers/staging/cpc-usb/sja2m16c.h          |   41 -
 drivers/staging/cpc-usb/sja2m16c_2.c        |  452 ----------
 include/linux/netlink.h                     |    1 +
 include/linux/phonet.h                      |    1 +
 include/linux/usb/usbnet.h                  |    1 +
 include/net/ipip.h                          |    1 -
 kernel/sys_ni.c                             |    1 +
 lib/vsprintf.c                              |   25 +-
 net/ax25/af_ax25.c                          |    4 +-
 net/core/pktgen.c                           |  160 ++--
 net/ipv4/ip_gre.c                           |   13 +-
 net/ipv4/ip_sockglue.c                      |    3 +
 net/ipv4/ipip.c                             |    8 -
 net/ipv6/ip6_tunnel.c                       |    7 -
 net/ipv6/sit.c                              |    8 -
 net/mac80211/scan.c                         |    4 +-
 net/netlink/af_netlink.c                    |   19 +-
 net/netlink/genetlink.c                     |    4 +-
 net/phonet/af_phonet.c                      |    6 +
 net/phonet/socket.c                         |   16 +-
 net/sunrpc/clnt.c                           |    5 +-
 net/wireless/wext-sme.c                     |    2 +-
 94 files changed, 2613 insertions(+), 2911 deletions(-)
 create mode 100644 drivers/net/can/usb/Makefile
 create mode 100644 drivers/net/can/usb/ems_usb.c
 create mode 100644 drivers/net/wireless/b43/sdio.c
 create mode 100644 drivers/net/wireless/b43/sdio.h
 delete mode 100644 drivers/staging/cpc-usb/Kconfig
 delete mode 100644 drivers/staging/cpc-usb/Makefile
 delete mode 100644 drivers/staging/cpc-usb/TODO
 delete mode 100644 drivers/staging/cpc-usb/cpc-usb_drv.c
 delete mode 100644 drivers/staging/cpc-usb/cpc.h
 delete mode 100644 drivers/staging/cpc-usb/cpc_int.h
 delete mode 100644 drivers/staging/cpc-usb/cpcusb.h
 delete mode 100644 drivers/staging/cpc-usb/sja2m16c.h
 delete mode 100644 drivers/staging/cpc-usb/sja2m16c_2.c

^ permalink raw reply

* Re: TCP stack bug related to F-RTO?
From: Joe Cao @ 2009-09-25  6:42 UTC (permalink / raw)
  To: zhigang gong; +Cc: linux-kernel, jcaoco2002, netdev
In-Reply-To: <40c9f5b20909241932k5e1f1d74kf8065e2e06aa4d09@mail.gmail.com>

Hi,

On the wrong tcp checksum, that's because of hardware checksum offload.

As for the seq/ack number, because the trace is long, I deliberately removed those irrelevant packets between after the three-way handshake and when the problem happens.  That can be seen from the timestamps.

Please also note that I intentionally replaced the IP addresses and mac addresses in the trace to hide proprietary information in the trace.

Anyway, the problem is not related to the checksum, or seq/ack number, otherwise, you won't see the behavior shown in the trace.

Thanks,
Joe

--- On Thu, 9/24/09, zhigang gong <zhigang.gong@gmail.com> wrote:

> From: zhigang gong <zhigang.gong@gmail.com>
> Subject: Re: TCP stack bug related to F-RTO?
> To: "Joe Cao" <caoco2002@yahoo.com>
> Cc: linux-kernel@vger.kernel.org, jcaoco2002@yahoo.com, netdev@vger.kernel.org
> Date: Thursday, September 24, 2009, 7:32 PM
> On Fri, Sep 25, 2009 at 1:43 AM, Joe
> Cao <caoco2002@yahoo.com>
> wrote:
> > Hello,
> >
> > I have found the following behavior with different
> versions of linux kernel. The attached pcap trace is
> collected with server (192.168.0.13) running 2.6.24 and
> shows the problem. Basically the behavior is like this:
> >
> > 1. The client opens up a big window,
> > 2. the server sends 19 packets in a row (pkt #14- #32
> in the trace), but all of them are dropped due to some
> congestion.
> > 3. The server hits RTO and retransmits pkt #14 in #33
> > 4. The client immediately acks #33 (=#14), and the
> server (seems like to enter F-RTO) expends the window and
> sends *NEW* pkt #35 & #36.=A0 Timeoute is doubled to
> 2*RTO; The client immediately sends two Dup-ack to #35 and
> #36.
> > 5. after 2*RTO, pkt #15 is retransmitted in #39.
> > 6. The client immediately acks #39 (=#15) in #40, and
> the server continues to expand the window and sends two
> *NEW* pkt #41 & #42. Now the timeoute is doubled to 4
> *RTO.
> > 8. After 4*RTO timeout, #16 is retransmitted.
> > 9....
> > 10. The above steps repeats for retransmitting pkt
> #16-#32 and each time the timeout is doubled.
> > 11. It takes a long long time to retransmit all the
> lost packets and before that is done, the client sends a RST
> because of timeout.
> >
> > The above behavior looks like F-RTO is in effect.
>  And there seems to be a bug in the TCP's congestion
> control and
> > retransmission algorithm. Why doesn't the TCP on
> server (running 2.6.24) enter the slow start?
> As I know, the early implementation hasn't enter slow start
> if the
> remote end is in the same network.  I'm not sure that
> of the version
> 2.6.24. But after I have a look at your trace, I think this
> is not the
> point of your problem. The behaviour of your client
> 192.168.0.82 is
> very strange. The client always send a packet with error
> TCP checksum
> and the 4# to 13# packets sent by the
> client   totally don't conform
> to  the TCP protocol, not only with wrong TCP checksum
> but also with
> incorrect seq and ack number.
> 
> My suggestion is that before you start to investigate the
> server
> side's behaviour, you need to correct your client side's
> TCP/IP stack
> implementation first.
> 
> >Why should the server take that long to recover from a
> short period of packet loss?
> 
> >
> > Has anyone else noticed similar problem before?  If
> my analysis was wrong, can anyone gives me some pointers to
> what's really wrong and how to fix it?
> >
> > Thanks a lot,
> > Joe
> >
> > PS. Please cc me when this message is replied.
> >
> >
> >
> 


      

^ permalink raw reply

* [PATCH] TI Davinci EMAC: Fix in vector definition for EMAC_VERSION_2
From: Sriram @ 2009-09-25  5:15 UTC (permalink / raw)
  To: netdev; +Cc: davinci-linux-open-source, Sriram

In the emac_poll function when looking for interrupt status masks
correct definition must be chosen based on EMAC_VERSION(the bit
mask has changed from version 1 to version 2).

Signed-off-by: Sriram <srk@ti.com>
Acked-by: Chaithrika U S <chaithrika@ti.com>
---
 drivers/net/davinci_emac.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/drivers/net/davinci_emac.c b/drivers/net/davinci_emac.c
index 12fd446..376f527 100644
--- a/drivers/net/davinci_emac.c
+++ b/drivers/net/davinci_emac.c
@@ -200,6 +200,9 @@ static const char emac_version_string[] = "TI DaVinci EMAC Linux v6.1";
 /** NOTE:: For DM646x the IN_VECTOR has changed */
 #define EMAC_DM646X_MAC_IN_VECTOR_RX_INT_VEC	BIT(EMAC_DEF_RX_CH)
 #define EMAC_DM646X_MAC_IN_VECTOR_TX_INT_VEC	BIT(16 + EMAC_DEF_TX_CH)
+#define EMAC_DM646X_MAC_IN_VECTOR_HOST_INT	BIT(26)
+#define EMAC_DM646X_MAC_IN_VECTOR_STATPEND_INT	BIT(27)
+
 
 /* CPPI bit positions */
 #define EMAC_CPPI_SOP_BIT		BIT(31)
@@ -2168,7 +2171,11 @@ static int emac_poll(struct napi_struct *napi, int budget)
 		emac_int_enable(priv);
 	}
 
-	if (unlikely(status & EMAC_DM644X_MAC_IN_VECTOR_HOST_INT)) {
+	mask = EMAC_DM644X_MAC_IN_VECTOR_HOST_INT;
+	if (priv->version == EMAC_VERSION_2)
+		mask = EMAC_DM646X_MAC_IN_VECTOR_HOST_INT;
+
+	if (unlikely(status & mask)) {
 		u32 ch, cause;
 		dev_err(emac_dev, "DaVinci EMAC: Fatal Hardware Error\n");
 		netif_stop_queue(ndev);
-- 
1.6.2.4


^ permalink raw reply related

* Re: PATCH 0/1: rt2x00dev.c / rt2x00lib.h fixes build breakage
From: John W. Linville @ 2009-09-25  4:53 UTC (permalink / raw)
  To: Ken Lewis; +Cc: linux-next, LKML, netdev
In-Reply-To: <5a44caba0909231358y23f21c0drb2a3451084028a6f@mail.gmail.com>

On Wed, Sep 23, 2009 at 09:58:35PM +0100, Ken Lewis wrote:
> The headers in drivers/net/wireless/rt2x00/rt2x00lib.h don't match the
> use of the function in rt2x00dev.c  The build fails as a result.
> 
> This has been a problem in linux-next since early September.  I've
> e-mailed a patch to linux-next and to linux-net, but the 2.6.32 merge
> window has brought the problem to the mainline and so I'm re-sending
> my patch.  I've opened a bug on bugzilla:
> http://bugzilla.kernel.org/show_bug.cgi?id=14217

Always make sure to send wireless LAN patches to
linux-wireless@vger.kernel.org.  Anyway, the following patch is in
the pull request I sent to Dave yesterday (and which I believe he
has already pulled):

commit fe2475633676b0a976400dfc53f8d7006f56543e
Author: Andrew Price <andy@andrewprice.me.uk>
Date:   Thu Sep 17 21:15:48 2009 +0100

    rt2x00: fix the definition of rt2x00crypto_rx_insert_iv
    
    Remove the redundant l2pad parameter from the definition of
    rt2x00crypto_rx_insert_iv which is used when only CONFIG_RT2500PCI but
    none of the other rt2x00 family drivers is configured.
    
    Signed-off-by: Andrew Price <andy@andrewprice.me.uk>
    Signed-off-by: John W. Linville <linville@tuxdriver.com>

Hth!

John
-- 
John W. Linville		Someday the world will need a hero, and you
linville@tuxdriver.com			might be all we have.  Be ready.

^ permalink raw reply

* Re: [linux-pm] [PATCH] 3c59x: Get rid of "Trying to free already-free IRQ"
From: Alan Stern @ 2009-09-25  4:43 UTC (permalink / raw)
  To: Anton Vorontsov; +Cc: David Miller, netdev, linux-pm
In-Reply-To: <20090924183152.GA30254@oksana.dev.rtsoft.ru>

On Thu, 24 Sep 2009, Anton Vorontsov wrote:

> Though, there are few other issues with suspend/resume in this driver.
> The intention of calling free_irq() in suspend() was to avoid any
> possible spurious interrupts (see commit 5b039e681b8c5f30aac9cc04385
> "3c59x PM fixes"). But,
> 
> - On resume, the driver was requesting IRQ just after pci_set_master(),
>   but before vortex_up() (which actually resets 3c59x chips).

Shouldn't it be possible to reset the chip (or at least prevent it from 
generating spurious IRQs) during the early-resume phase?

> - Issuing free_irq() on a shared IRQ doesn't guarantee that a buggy
>   HW won't trigger spurious interrupts in another driver that
>   requested the same interrupt. So, if we want to protect from
>   unexpected interrupts, then on suspend we should issue disable_irq(),
>   not free_irq().

What if some other device shares the IRQ and still relies on receiving
interrupts when this code runs?  Won't disable_irq() mess up the other
device?

Alan Stern


^ permalink raw reply

* Getting IP Address for netdev
From: Pawel Pastuszak @ 2009-09-25  2:53 UTC (permalink / raw)
  To: netdev

Hello everyone,

I am looking for some help, I am writing an netdev driver and i want
to known how to i get the current ip  that is set to the device?


Thanks,
Pawel

^ permalink raw reply

* Re: TCP stack bug related to F-RTO?
From: zhigang gong @ 2009-09-25  2:32 UTC (permalink / raw)
  To: Joe Cao; +Cc: linux-kernel, jcaoco2002, netdev
In-Reply-To: <427999.33681.qm@web63406.mail.re1.yahoo.com>

On Fri, Sep 25, 2009 at 1:43 AM, Joe Cao <caoco2002@yahoo.com> wrote:
> Hello,
>
> I have found the following behavior with different versions of linux kernel. The attached pcap trace is collected with server (192.168.0.13) running 2.6.24 and shows the problem. Basically the behavior is like this:
>
> 1. The client opens up a big window,
> 2. the server sends 19 packets in a row (pkt #14- #32 in the trace), but all of them are dropped due to some congestion.
> 3. The server hits RTO and retransmits pkt #14 in #33
> 4. The client immediately acks #33 (=#14), and the server (seems like to enter F-RTO) expends the window and sends *NEW* pkt #35 & #36.=A0 Timeoute is doubled to 2*RTO; The client immediately sends two Dup-ack to #35 and #36.
> 5. after 2*RTO, pkt #15 is retransmitted in #39.
> 6. The client immediately acks #39 (=#15) in #40, and the server continues to expand the window and sends two *NEW* pkt #41 & #42. Now the timeoute is doubled to 4 *RTO.
> 8. After 4*RTO timeout, #16 is retransmitted.
> 9....
> 10. The above steps repeats for retransmitting pkt #16-#32 and each time the timeout is doubled.
> 11. It takes a long long time to retransmit all the lost packets and before that is done, the client sends a RST because of timeout.
>
> The above behavior looks like F-RTO is in effect.  And there seems to be a bug in the TCP's congestion control and
> retransmission algorithm. Why doesn't the TCP on server (running 2.6.24) enter the slow start?
As I know, the early implementation hasn't enter slow start if the
remote end is in the same network.  I'm not sure that of the version
2.6.24. But after I have a look at your trace, I think this is not the
point of your problem. The behaviour of your client 192.168.0.82 is
very strange. The client always send a packet with error TCP checksum
and the 4# to 13# packets sent by the client   totally don't conform
to  the TCP protocol, not only with wrong TCP checksum but also with
incorrect seq and ack number.

My suggestion is that before you start to investigate the server
side's behaviour, you need to correct your client side's TCP/IP stack
implementation first.

>Why should the server take that long to recover from a short period of packet loss?

>
> Has anyone else noticed similar problem before?  If my analysis was wrong, can anyone gives me some pointers to what's really wrong and how to fix it?
>
> Thanks a lot,
> Joe
>
> PS. Please cc me when this message is replied.
>
>
>

^ permalink raw reply

* [PATCH net-next-2.6] cxgb3: Added private MAC address and provisioning packet handler for iSCSI
From: kxie @ 2009-09-25  1:33 UTC (permalink / raw)
  To: davem
  Cc: swise, divy, rranjan, kxie, James.Bottomley, michaelc,
	linux-kernel, netdev

00c487ed661c0904757a21b7c958eba59e68482a
[PATCH net-next-2.6] cxgb3: Added private MAC address and provisioning packet handler for iSCSI

This patch added support of private MAC address per port and provisioning
packet handler for iSCSI traffic only.

Acked-by: Karen Xie <kxie@chelsio.com>
Acked-by: Divy Le Ray <divy@chelsio.com>
Signed-off-by: Rakesh Ranjan <rranjan@chelsio.com>
---
 drivers/net/cxgb3/adapter.h       |   18 +++++++++++++++++-
 drivers/net/cxgb3/cxgb3_main.c    |   22 ++++++++++++++++++----
 drivers/net/cxgb3/cxgb3_offload.c |    2 +-
 drivers/net/cxgb3/sge.c           |   30 ++++++++++++++++++++----------
 4 files changed, 56 insertions(+), 16 deletions(-)

diff --git a/drivers/net/cxgb3/adapter.h b/drivers/net/cxgb3/adapter.h
index 2b1aea6..3f3083a 100644
--- a/drivers/net/cxgb3/adapter.h
+++ b/drivers/net/cxgb3/adapter.h
@@ -48,12 +48,28 @@
 struct vlan_group;
 struct adapter;
 struct sge_qset;
+struct port_info;
 
 enum {			/* rx_offload flags */
 	T3_RX_CSUM	= 1 << 0,
 	T3_LRO		= 1 << 1,
 };
 
+enum {
+	LAN_MAC_IDX	= 0,
+	SAN_MAC_IDX,
+
+	MAX_MAC_IDX
+};
+
+struct iscsi_config {
+	__be32	ipv4_addr;
+	__u8	mac_addr[ETH_ALEN];
+	__u32	flags;
+	int (*send)(struct port_info *pi, struct sk_buff **skb);
+	int (*recv)(struct port_info *pi, struct sk_buff *skb);
+};
+
 struct port_info {
 	struct adapter *adapter;
 	struct vlan_group *vlan_grp;
@@ -67,7 +83,7 @@ struct port_info {
 	struct link_config link_config;
 	struct net_device_stats netstats;
 	int activity;
-	__be32 iscsi_ipv4addr;
+	struct iscsi_config iscsic;
 
 	int link_fault; /* link fault was detected */
 };
diff --git a/drivers/net/cxgb3/cxgb3_main.c b/drivers/net/cxgb3/cxgb3_main.c
index 34e776c..c9113d3 100644
--- a/drivers/net/cxgb3/cxgb3_main.c
+++ b/drivers/net/cxgb3/cxgb3_main.c
@@ -344,8 +344,10 @@ static void link_start(struct net_device *dev)
 
 	init_rx_mode(&rm, dev, dev->mc_list);
 	t3_mac_reset(mac);
+	t3_mac_set_num_ucast(mac, MAX_MAC_IDX);
 	t3_mac_set_mtu(mac, dev->mtu);
-	t3_mac_set_address(mac, 0, dev->dev_addr);
+	t3_mac_set_address(mac, LAN_MAC_IDX, dev->dev_addr);
+	t3_mac_set_address(mac, SAN_MAC_IDX, pi->iscsic.mac_addr);
 	t3_mac_set_rx_mode(mac, &rm);
 	t3_link_start(&pi->phy, mac, &pi->link_config);
 	t3_mac_enable(mac, MAC_DIRECTION_RX | MAC_DIRECTION_TX);
@@ -903,6 +905,7 @@ static inline int offload_tx(struct t3cdev *tdev, struct sk_buff *skb)
 static int write_smt_entry(struct adapter *adapter, int idx)
 {
 	struct cpl_smt_write_req *req;
+	struct port_info *pi = netdev_priv(adapter->port[idx]);
 	struct sk_buff *skb = alloc_skb(sizeof(*req), GFP_KERNEL);
 
 	if (!skb)
@@ -913,8 +916,8 @@ static int write_smt_entry(struct adapter *adapter, int idx)
 	OPCODE_TID(req) = htonl(MK_OPCODE_TID(CPL_SMT_WRITE_REQ, idx));
 	req->mtu_idx = NMTUS - 1;	/* should be 0 but there's a T3 bug */
 	req->iff = idx;
-	memset(req->src_mac1, 0, sizeof(req->src_mac1));
 	memcpy(req->src_mac0, adapter->port[idx]->dev_addr, ETH_ALEN);
+	memcpy(req->src_mac1, pi->iscsic.mac_addr, ETH_ALEN);
 	skb->priority = 1;
 	offload_tx(&adapter->tdev, skb);
 	return 0;
@@ -2516,7 +2519,7 @@ static int cxgb_set_mac_addr(struct net_device *dev, void *p)
 		return -EINVAL;
 
 	memcpy(dev->dev_addr, addr->sa_data, dev->addr_len);
-	t3_mac_set_address(&pi->mac, 0, dev->dev_addr);
+	t3_mac_set_address(&pi->mac, LAN_MAC_IDX, dev->dev_addr);
 	if (offload_running(adapter))
 		write_smt_entry(adapter, pi->port_id);
 	return 0;
@@ -2654,7 +2657,7 @@ static void check_t3b2_mac(struct adapter *adapter)
 			struct cmac *mac = &p->mac;
 
 			t3_mac_set_mtu(mac, dev->mtu);
-			t3_mac_set_address(mac, 0, dev->dev_addr);
+			t3_mac_set_address(mac, LAN_MAC_IDX, dev->dev_addr);
 			cxgb_set_rxmode(dev);
 			t3_link_start(&p->phy, mac, &p->link_config);
 			t3_mac_enable(mac, MAC_DIRECTION_RX | MAC_DIRECTION_TX);
@@ -3112,6 +3115,14 @@ static const struct net_device_ops cxgb_netdev_ops = {
 #endif
 };
 
+static void __devinit cxgb3_init_iscsi_mac(struct net_device *dev)
+{
+	struct port_info *pi = netdev_priv(dev);
+
+	memcpy(pi->iscsic.mac_addr, dev->dev_addr, ETH_ALEN);
+	pi->iscsic.mac_addr[3] |= 0x80;
+}
+
 static int __devinit init_one(struct pci_dev *pdev,
 			      const struct pci_device_id *ent)
 {
@@ -3270,6 +3281,9 @@ static int __devinit init_one(struct pci_dev *pdev,
 		goto out_free_dev;
 	}
 
+	for_each_port(adapter, i)
+		cxgb3_init_iscsi_mac(adapter->port[i]);
+
 	/* Driver's ready. Reflect it on LEDs */
 	t3_led_ready(adapter);
 
diff --git a/drivers/net/cxgb3/cxgb3_offload.c b/drivers/net/cxgb3/cxgb3_offload.c
index 75064ee..7f314c3 100644
--- a/drivers/net/cxgb3/cxgb3_offload.c
+++ b/drivers/net/cxgb3/cxgb3_offload.c
@@ -447,7 +447,7 @@ static int cxgb_offload_ctl(struct t3cdev *tdev, unsigned int req, void *data)
 	case GET_ISCSI_IPV4ADDR: {
 		struct iscsi_ipv4addr *p = data;
 		struct port_info *pi = netdev_priv(p->dev);
-		p->ipv4addr = pi->iscsi_ipv4addr;
+		p->ipv4addr = pi->iscsic.ipv4_addr;
 		break;
 	}
 	case GET_EMBEDDED_INFO: {
diff --git a/drivers/net/cxgb3/sge.c b/drivers/net/cxgb3/sge.c
index f866128..a911363 100644
--- a/drivers/net/cxgb3/sge.c
+++ b/drivers/net/cxgb3/sge.c
@@ -1946,10 +1946,9 @@ static void restart_tx(struct sge_qset *qs)
  *	Check if the ARP request is probing the private IP address
  *	dedicated to iSCSI, generate an ARP reply if so.
  */
-static void cxgb3_arp_process(struct adapter *adapter, struct sk_buff *skb)
+static void cxgb3_arp_process(struct port_info *pi, struct sk_buff *skb)
 {
 	struct net_device *dev = skb->dev;
-	struct port_info *pi;
 	struct arphdr *arp;
 	unsigned char *arp_ptr;
 	unsigned char *sha;
@@ -1972,12 +1971,11 @@ static void cxgb3_arp_process(struct adapter *adapter, struct sk_buff *skb)
 	arp_ptr += dev->addr_len;
 	memcpy(&tip, arp_ptr, sizeof(tip));
 
-	pi = netdev_priv(dev);
-	if (tip != pi->iscsi_ipv4addr)
+	if (tip != pi->iscsic.ipv4_addr)
 		return;
 
 	arp_send(ARPOP_REPLY, ETH_P_ARP, sip, dev, tip, sha,
-		 dev->dev_addr, sha);
+		 pi->iscsic.mac_addr, sha);
 
 }
 
@@ -1986,6 +1984,19 @@ static inline int is_arp(struct sk_buff *skb)
 	return skb->protocol == htons(ETH_P_ARP);
 }
 
+static void cxgb3_process_iscsi_prov_pack(struct port_info *pi,
+					struct sk_buff *skb)
+{
+	if (is_arp(skb)) {
+		cxgb3_arp_process(pi, skb);
+		return;
+	}
+
+	if (pi->iscsic.recv)
+		pi->iscsic.recv(pi, skb);
+
+}
+
 /**
  *	rx_eth - process an ingress ethernet packet
  *	@adap: the adapter
@@ -2024,13 +2035,12 @@ static void rx_eth(struct adapter *adap, struct sge_rspq *rq,
 				vlan_gro_receive(&qs->napi, grp,
 						 ntohs(p->vlan), skb);
 			else {
-				if (unlikely(pi->iscsi_ipv4addr &&
-				    is_arp(skb))) {
+				if (unlikely(pi->iscsic.flags)) {
 					unsigned short vtag = ntohs(p->vlan) &
 								VLAN_VID_MASK;
 					skb->dev = vlan_group_get_device(grp,
 									 vtag);
-					cxgb3_arp_process(adap, skb);
+					cxgb3_process_iscsi_prov_pack(pi, skb);
 				}
 				__vlan_hwaccel_rx(skb, grp, ntohs(p->vlan),
 					  	  rq->polling);
@@ -2041,8 +2051,8 @@ static void rx_eth(struct adapter *adap, struct sge_rspq *rq,
 		if (lro)
 			napi_gro_receive(&qs->napi, skb);
 		else {
-			if (unlikely(pi->iscsi_ipv4addr && is_arp(skb)))
-				cxgb3_arp_process(adap, skb);
+			if (unlikely(pi->iscsic.flags))
+				cxgb3_process_iscsi_prov_pack(pi, skb);
 			netif_receive_skb(skb);
 		}
 	} else
-- 
1.6.0.6

^ permalink raw reply related

* Re: [PATCH net-next-2.6 v2] bonding: introduce primary_reselect option
From: Jay Vosburgh @ 2009-09-25  0:34 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, davem, bonding-devel, nicolas.2p.debian
In-Reply-To: <20090918195358.GB32154@psychotron.redhat.com>


From: Jiri Pirko <jpirko@redhat.com>

In some cases there is not desirable to switch back to primary interface when
it's link recovers and rather stay with currently active one. We need to avoid
packetloss as much as we can in some cases. This is solved by introducing
primary_reselect option. Note that enslaved primary slave is set as current
active no matter what.

Patch modified by Jay Vosburgh as follows: fixed bug in action
after change of option setting via sysfs, revised the documentation
update, and bumped the bonding version number.

Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
---

	Note that this patch depends on the "make ab_arp select active
slaves as other modes" patch recently approved, but not yet appearing in
net-next-2.6 as I write this.  http://patchwork.ozlabs.org/patch/32684/

 Documentation/networking/bonding.txt |   42 +++++++++++++++++++++-
 drivers/net/bonding/bond_main.c      |   66 +++++++++++++++++++++++++++++++---
 drivers/net/bonding/bond_main.c.rej  |   18 +++++++++
 drivers/net/bonding/bond_sysfs.c     |   53 +++++++++++++++++++++++++++
 drivers/net/bonding/bonding.h        |   11 +++++-
 5 files changed, 182 insertions(+), 8 deletions(-)
 create mode 100644 drivers/net/bonding/bond_main.c.rej

diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index d5181ce..61f516b 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -1,7 +1,7 @@
 
 		Linux Ethernet Bonding Driver HOWTO
 
-		Latest update: 12 November 2007
+		Latest update: 23 September 2009
 
 Initial release : Thomas Davis <tadavis at lbl.gov>
 Corrections, HA extensions : 2000/10/03-15 :
@@ -614,6 +614,46 @@ primary
 
 	The primary option is only valid for active-backup mode.
 
+primary_reselect
+
+	Specifies the reselection policy for the primary slave.  This
+	affects how the primary slave is chosen to become the active slave
+	when failure of the active slave or recovery of the primary slave
+	occurs.  This option is designed to prevent flip-flopping between
+	the primary slave and other slaves.  Possible values are:
+
+	always or 0 (default)
+
+		The primary slave becomes the active slave whenever it
+		comes back up.
+
+	better or 1
+
+		The primary slave becomes the active slave when it comes
+		back up, if the speed and duplex of the primary slave is
+		better than the speed and duplex of the current active
+		slave.
+
+	failure or 2
+
+		The primary slave becomes the active slave only if the
+		current active slave fails and the primary slave is up.
+
+	The primary_reselect setting is ignored in two cases:
+
+		If no slaves are active, the first slave to recover is
+		made the active slave.
+
+		When initially enslaved, the primary slave is always made
+		the active slave.
+
+	Changing the primary_reselect policy via sysfs will cause an
+	immediate selection of the best active slave according to the new
+	policy.  This may or may not result in a change of the active
+	slave, depending upon the circumstances.
+
+	This option was added for bonding version 3.6.0.
+
 updelay
 
 	Specifies the time, in milliseconds, to wait before enabling a
diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 699bfdd..ba78baa 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -94,6 +94,7 @@ static int downdelay;
 static int use_carrier	= 1;
 static char *mode;
 static char *primary;
+static char *primary_reselect;
 static char *lacp_rate;
 static char *ad_select;
 static char *xmit_hash_policy;
@@ -126,6 +127,14 @@ MODULE_PARM_DESC(mode, "Mode of operation : 0 for balance-rr, "
 		       "6 for balance-alb");
 module_param(primary, charp, 0);
 MODULE_PARM_DESC(primary, "Primary network device to use");
+module_param(primary_reselect, charp, 0);
+MODULE_PARM_DESC(primary_reselect, "Reselect primary slave "
+				   "once it comes up; "
+				   "0 for always (default), "
+				   "1 for only if speed of primary is "
+				   "better, "
+				   "2 for only on active slave "
+				   "failure");
 module_param(lacp_rate, charp, 0);
 MODULE_PARM_DESC(lacp_rate, "LACPDU tx rate to request from 802.3ad partner "
 			    "(slow/fast)");
@@ -200,6 +209,13 @@ const struct bond_parm_tbl fail_over_mac_tbl[] = {
 {	NULL,			-1},
 };
 
+const struct bond_parm_tbl pri_reselect_tbl[] = {
+{	"always",		BOND_PRI_RESELECT_ALWAYS},
+{	"better",		BOND_PRI_RESELECT_BETTER},
+{	"failure",		BOND_PRI_RESELECT_FAILURE},
+{	NULL,			-1},
+};
+
 struct bond_parm_tbl ad_select_tbl[] = {
 {	"stable",	BOND_AD_STABLE},
 {	"bandwidth",	BOND_AD_BANDWIDTH},
@@ -1070,6 +1086,25 @@ out:
 
 }
 
+static bool bond_should_change_active(struct bonding *bond)
+{
+	struct slave *prim = bond->primary_slave;
+	struct slave *curr = bond->curr_active_slave;
+
+	if (!prim || !curr || curr->link != BOND_LINK_UP)
+		return true;
+	if (bond->force_primary) {
+		bond->force_primary = false;
+		return true;
+	}
+	if (bond->params.primary_reselect == BOND_PRI_RESELECT_BETTER &&
+	    (prim->speed < curr->speed ||
+	     (prim->speed == curr->speed && prim->duplex <= curr->duplex)))
+		return false;
+	if (bond->params.primary_reselect == BOND_PRI_RESELECT_FAILURE)
+		return false;
+	return true;
+}
 
 /**
  * find_best_interface - select the best available slave to be the active one
@@ -1094,7 +1129,8 @@ static struct slave *bond_find_best_slave(struct bonding *bond)
 	}
 
 	if ((bond->primary_slave) &&
-	    bond->primary_slave->link == BOND_LINK_UP) {
+	    bond->primary_slave->link == BOND_LINK_UP &&
+	    bond_should_change_active(bond)) {
 		new_active = bond->primary_slave;
 	}
 
@@ -1675,8 +1711,10 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 
 	if (USES_PRIMARY(bond->params.mode) && bond->params.primary[0]) {
 		/* if there is a primary slave, remember it */
-		if (strcmp(bond->params.primary, new_slave->dev->name) == 0)
+		if (strcmp(bond->params.primary, new_slave->dev->name) == 0) {
 			bond->primary_slave = new_slave;
+			bond->force_primary = true;
+		}
 	}
 
 	write_lock_bh(&bond->curr_slave_lock);
@@ -3198,11 +3236,14 @@ static void bond_info_show_master(struct seq_file *seq)
 	}
 
 	if (USES_PRIMARY(bond->params.mode)) {
-		seq_printf(seq, "Primary Slave: %s\n",
+		seq_printf(seq, "Primary Slave: %s",
 			   (bond->primary_slave) ?
 			   bond->primary_slave->dev->name : "None");
+		if (bond->primary_slave)
+			seq_printf(seq, " (primary_reselect %s)",
+		   pri_reselect_tbl[bond->params.primary_reselect].modename);
 
-		seq_printf(seq, "Currently Active Slave: %s\n",
+		seq_printf(seq, "\nCurrently Active Slave: %s\n",
 			   (curr) ? curr->dev->name : "None");
 	}
 
@@ -4643,7 +4684,7 @@ int bond_parse_parm(const char *buf, const struct bond_parm_tbl *tbl)
 
 static int bond_check_params(struct bond_params *params)
 {
-	int arp_validate_value, fail_over_mac_value;
+	int arp_validate_value, fail_over_mac_value, primary_reselect_value;
 
 	/*
 	 * Convert string parameters.
@@ -4942,6 +4983,20 @@ static int bond_check_params(struct bond_params *params)
 		primary = NULL;
 	}
 
+	if (primary && primary_reselect) {
+		primary_reselect_value = bond_parse_parm(primary_reselect,
+							 pri_reselect_tbl);
+		if (primary_reselect_value == -1) {
+			pr_err(DRV_NAME
+			       ": Error: Invalid primary_reselect \"%s\"\n",
+			       primary_reselect ==
+					NULL ? "NULL" : primary_reselect);
+			return -EINVAL;
+		}
+	} else {
+		primary_reselect_value = BOND_PRI_RESELECT_ALWAYS;
+	}
+
 	if (fail_over_mac) {
 		fail_over_mac_value = bond_parse_parm(fail_over_mac,
 						      fail_over_mac_tbl);
@@ -4973,6 +5028,7 @@ static int bond_check_params(struct bond_params *params)
 	params->use_carrier = use_carrier;
 	params->lacp_fast = lacp_fast;
 	params->primary[0] = 0;
+	params->primary_reselect = primary_reselect_value;
 	params->fail_over_mac = fail_over_mac_value;
 
 	if (primary) {
diff --git a/drivers/net/bonding/bond_main.c.rej b/drivers/net/bonding/bond_main.c.rej
new file mode 100644
index 0000000..6854718
--- /dev/null
+++ b/drivers/net/bonding/bond_main.c.rej
@@ -0,0 +1,18 @@
+*************** static struct slave *bond_find_best_slave(struct bonding *bond)
+*** 1094,1100 ****
+  	}
+  
+  	if ((bond->primary_slave) &&
+- 	    bond->primary_slave->link == BOND_LINK_UP) {
+  		new_active = bond->primary_slave;
+  	}
+  
+--- 1129,1136 ----
+  	}
+  
+  	if ((bond->primary_slave) &&
++ 	    bond->primary_slave->link == BOND_LINK_UP &&
++ 	    bond_should_change_active(bond)) {
+  		new_active = bond->primary_slave;
+  	}
+  
diff --git a/drivers/net/bonding/bond_sysfs.c b/drivers/net/bonding/bond_sysfs.c
index 6044e12..8ee6164 100644
--- a/drivers/net/bonding/bond_sysfs.c
+++ b/drivers/net/bonding/bond_sysfs.c
@@ -1212,6 +1212,58 @@ static DEVICE_ATTR(primary, S_IRUGO | S_IWUSR,
 		   bonding_show_primary, bonding_store_primary);
 
 /*
+ * Show and set the primary_reselect flag.
+ */
+static ssize_t bonding_show_primary_reselect(struct device *d,
+					     struct device_attribute *attr,
+					     char *buf)
+{
+	struct bonding *bond = to_bond(d);
+
+	return sprintf(buf, "%s %d\n",
+		       pri_reselect_tbl[bond->params.primary_reselect].modename,
+		       bond->params.primary_reselect);
+}
+
+static ssize_t bonding_store_primary_reselect(struct device *d,
+					      struct device_attribute *attr,
+					      const char *buf, size_t count)
+{
+	int new_value, ret = count;
+	struct bonding *bond = to_bond(d);
+
+	if (!rtnl_trylock())
+		return restart_syscall();
+
+	new_value = bond_parse_parm(buf, pri_reselect_tbl);
+	if (new_value < 0)  {
+		pr_err(DRV_NAME
+		       ": %s: Ignoring invalid primary_reselect value %.*s.\n",
+		       bond->dev->name,
+		       (int) strlen(buf) - 1, buf);
+		ret = -EINVAL;
+		goto out;
+	}
+
+	bond->params.primary_reselect = new_value;
+	pr_info(DRV_NAME ": %s: setting primary_reselect to %s (%d).\n",
+		bond->dev->name, pri_reselect_tbl[new_value].modename,
+		new_value);
+
+	read_lock(&bond->lock);
+	write_lock_bh(&bond->curr_slave_lock);
+	bond_select_active_slave(bond);
+	write_unlock_bh(&bond->curr_slave_lock);
+	read_unlock(&bond->lock);
+out:
+	rtnl_unlock();
+	return ret;
+}
+static DEVICE_ATTR(primary_reselect, S_IRUGO | S_IWUSR,
+		   bonding_show_primary_reselect,
+		   bonding_store_primary_reselect);
+
+/*
  * Show and set the use_carrier flag.
  */
 static ssize_t bonding_show_carrier(struct device *d,
@@ -1500,6 +1552,7 @@ static struct attribute *per_bond_attrs[] = {
 	&dev_attr_num_unsol_na.attr,
 	&dev_attr_miimon.attr,
 	&dev_attr_primary.attr,
+	&dev_attr_primary_reselect.attr,
 	&dev_attr_use_carrier.attr,
 	&dev_attr_active_slave.attr,
 	&dev_attr_mii_status.attr,
diff --git a/drivers/net/bonding/bonding.h b/drivers/net/bonding/bonding.h
index 6824771..9c03c2e 100644
--- a/drivers/net/bonding/bonding.h
+++ b/drivers/net/bonding/bonding.h
@@ -23,8 +23,8 @@
 #include "bond_3ad.h"
 #include "bond_alb.h"
 
-#define DRV_VERSION	"3.5.0"
-#define DRV_RELDATE	"November 4, 2008"
+#define DRV_VERSION	"3.6.0"
+#define DRV_RELDATE	"September 26, 2009"
 #define DRV_NAME	"bonding"
 #define DRV_DESCRIPTION	"Ethernet Channel Bonding Driver"
 
@@ -131,6 +131,7 @@ struct bond_params {
 	int lacp_fast;
 	int ad_select;
 	char primary[IFNAMSIZ];
+	int primary_reselect;
 	__be32 arp_targets[BOND_MAX_ARP_TARGETS];
 };
 
@@ -190,6 +191,7 @@ struct bonding {
 	struct   slave *curr_active_slave;
 	struct   slave *current_arp_slave;
 	struct   slave *primary_slave;
+	bool     force_primary;
 	s32      slave_cnt; /* never change this value outside the attach/detach wrappers */
 	rwlock_t lock;
 	rwlock_t curr_slave_lock;
@@ -258,6 +260,10 @@ static inline bool bond_is_lb(const struct bonding *bond)
 		|| bond->params.mode == BOND_MODE_ALB;
 }
 
+#define BOND_PRI_RESELECT_ALWAYS	0
+#define BOND_PRI_RESELECT_BETTER	1
+#define BOND_PRI_RESELECT_FAILURE	2
+
 #define BOND_FOM_NONE			0
 #define BOND_FOM_ACTIVE			1
 #define BOND_FOM_FOLLOW			2
@@ -348,6 +354,7 @@ extern const struct bond_parm_tbl bond_mode_tbl[];
 extern const struct bond_parm_tbl xmit_hashtype_tbl[];
 extern const struct bond_parm_tbl arp_validate_tbl[];
 extern const struct bond_parm_tbl fail_over_mac_tbl[];
+extern const struct bond_parm_tbl pri_reselect_tbl[];
 extern struct bond_parm_tbl ad_select_tbl[];
 
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
-- 
1.6.0.2


^ permalink raw reply related

* net-2.6 merged with upstream
From: David Miller @ 2009-09-25  0:34 UTC (permalink / raw)
  To: netdev

I merged Linus's tree into net-2.6 in order to deal with
some conflicts.

I intend to make a push to Linus later tonight or early Friday
morning.

Patch acceptance will get progressively stricter, and you should
be looking at fixing bugs being reported at this point anyways.
:-)

^ permalink raw reply

* Re: TCP stack bug related to F-RTO?
From: Ray Lee @ 2009-09-24 23:39 UTC (permalink / raw)
  To: Joe Cao, Netdev; +Cc: linux-kernel, jcaoco2002
In-Reply-To: <427999.33681.qm@web63406.mail.re1.yahoo.com>

[-- Attachment #1: Type: text/plain, Size: 1850 bytes --]

[adding netdev cc:]

On Thu, Sep 24, 2009 at 10:43 AM, Joe Cao <caoco2002@yahoo.com> wrote:
>
> Hello,
>
> I have found the following behavior with different versions of linux kernel. The attached pcap trace is collected with server (192.168.0.13) running 2.6.24 and shows the problem. Basically the behavior is like this:
>
> 1. The client opens up a big window,
> 2. the server sends 19 packets in a row (pkt #14- #32 in the trace), but all of them are dropped due to some congestion.
> 3. The server hits RTO and retransmits pkt #14 in #33
> 4. The client immediately acks #33 (=#14), and the server (seems like to enter F-RTO) expends the window and sends *NEW* pkt #35 & #36.=A0 Timeoute is doubled to 2*RTO; The client immediately sends two Dup-ack to #35 and #36.
> 5. after 2*RTO, pkt #15 is retransmitted in #39.
> 6. The client immediately acks #39 (=#15) in #40, and the server continues to expand the window and sends two *NEW* pkt #41 & #42. Now the timeoute is doubled to 4 *RTO.
> 8. After 4*RTO timeout, #16 is retransmitted.
> 9....
> 10. The above steps repeats for retransmitting pkt #16-#32 and each time the timeout is doubled.
> 11. It takes a long long time to retransmit all the lost packets and before that is done, the client sends a RST because of timeout.
>
> The above behavior looks like F-RTO is in effect.  And there seems to be a bug in the TCP's congestion control and retransmission algorithm. Why doesn't the TCP on server (running 2.6.24) enter the slow start? Why should the server take that long to recover from a short period of packet loss?
>
> Has anyone else noticed similar problem before?  If my analysis was wrong, can anyone gives me some pointers to what's really wrong and how to fix it?
>
> Thanks a lot,
> Joe
>
> PS. Please cc me when this message is replied.
>
>
>

[-- Attachment #2: frto.pcap.7 --]
[-- Type: application/octet-stream, Size: 73622 bytes --]

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox