bonding: time limits too tight in bond_ab_arp

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* bonding: time limits too tight in bond_ab_arp_inspect
@ 2012-08-22 17:45 Jiri Bohac
  2012-08-22 17:54 ` Chris Friesen
  0 siblings, 1 reply; 7+ messages in thread
From: Jiri Bohac @ 2012-08-22 17:45 UTC (permalink / raw)
  To: Jay Vosburgh, Andy Gospodarek, netdev; +Cc: Petr Tesarik

Hi,

a customer reported that a bonding slave did not come back up
after setting their link down and then up again. ARP monitoring +
arp_validate were used.

Petr has tracked the problem down to the time comaprisons in
bond_ab_arp_inspect().

                if (slave->link != BOND_LINK_UP) {
                        if (time_in_range(jiffies,
                                slave_last_rx(bond, slave) - delta_in_ticks,
                                slave_last_rx(bond, slave) + delta_in_ticks)) {

                                slave->new_link = BOND_LINK_UP;
                                commit++;
                        }

                        continue;
                }

This code is run from bond_activebackup_arp_mon() about
delta_in_ticks jiffies after the previous ARP probe has been
sent. If the delayed work gets executed exactly in delta_in_ticks
jiffies, there is a chance the slave will be brought up.  If the
delayed work runs one jiffy later, the slave will stay down.

With arp_validate this is more noticeable, since traffic other than the
bonding-generated ARP probes does not update the slave_last_rx timestamp.

A simple patch will fix this case.

--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -3001,7 +3001,7 @@ static int bond_ab_arp_inspect(struct bo
 		if (slave->link != BOND_LINK_UP) {
 			if (time_in_range(jiffies,
 				slave_last_rx(bond, slave) - delta_in_ticks,
-				slave_last_rx(bond, slave) + delta_in_ticks)) {
+				slave_last_rx(bond, slave) + 2 * delta_in_ticks)) {

 				slave->new_link = BOND_LINK_UP;
 				commit++;

The remaining time comparisons inside bond_ab_arp_inspect() have larger
tolerances (3*delta_in_ticks or 2*delta_in_ticks), but it still seems strange
that the precision of delayed work scheduling should steal a full
arp_interval from the time limits.

What is the intention of e.g. the "3*delta since last receive" limit? 
Was this really meant to be "as little as 2*delta + 1 jiffy"?

Should they perhaps all be increased by, say, delta_in_ticks/2, to make this
less dependent on the current scheduling latencies?

Thoughts?

-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bonding: time limits too tight in bond_ab_arp_inspect
  2012-08-22 17:45 bonding: time limits too tight in bond_ab_arp_inspect Jiri Bohac
@ 2012-08-22 17:54 ` Chris Friesen
  2012-08-22 18:42   ` Jay Vosburgh
  0 siblings, 1 reply; 7+ messages in thread
From: Chris Friesen @ 2012-08-22 17:54 UTC (permalink / raw)
  To: Jiri Bohac; +Cc: Jay Vosburgh, Andy Gospodarek, netdev, Petr Tesarik

On 08/22/2012 11:45 AM, Jiri Bohac wrote:

> This code is run from bond_activebackup_arp_mon() about
> delta_in_ticks jiffies after the previous ARP probe has been
> sent. If the delayed work gets executed exactly in delta_in_ticks
> jiffies, there is a chance the slave will be brought up.  If the
> delayed work runs one jiffy later, the slave will stay down.

<snip>

> Should they perhaps all be increased by, say, delta_in_ticks/2, to make this
> less dependent on the current scheduling latencies?

We have been using a patch that tracks the arpmon requested sleep time 
vs the actual sleep time and adds any scheduling latency to the allowed 
delta.  That way if we sleep too long due to scheduling latency it 
doesn't affect the calculation.

Chris

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bonding: time limits too tight in bond_ab_arp_inspect
  2012-08-22 17:54 ` Chris Friesen
@ 2012-08-22 18:42   ` Jay Vosburgh
  2012-08-22 18:58     ` Chris Friesen
                       ` (2 more replies)
  0 siblings, 3 replies; 7+ messages in thread
From: Jay Vosburgh @ 2012-08-22 18:42 UTC (permalink / raw)
  To: Chris Friesen; +Cc: Jiri Bohac, Andy Gospodarek, netdev, Petr Tesarik

Chris Friesen <chris.friesen@genband.com> wrote:

>On 08/22/2012 11:45 AM, Jiri Bohac wrote:
>
>> This code is run from bond_activebackup_arp_mon() about
>> delta_in_ticks jiffies after the previous ARP probe has been
>> sent. If the delayed work gets executed exactly in delta_in_ticks
>> jiffies, there is a chance the slave will be brought up.  If the
>> delayed work runs one jiffy later, the slave will stay down.

	Presumably the ARP reply is coming back in less than one jiffy,
then, so the slave_last_rx() value is the same jiffy as when the
_inspect was previously called?

><snip>
>
>> Should they perhaps all be increased by, say, delta_in_ticks/2, to make this
>> less dependent on the current scheduling latencies?
>
>We have been using a patch that tracks the arpmon requested sleep time vs
>the actual sleep time and adds any scheduling latency to the allowed
>delta.  That way if we sleep too long due to scheduling latency it doesn't
>affect the calculation.

	How much scheduling latency do you see?

	Is that really better than just permitting a bit more slack in
the timing window?

	As to the 2 * delta and 3 * delta calculations, these values
predate my involvement with bonding, so I'm not entirely sure why those
specific values were chosen (there are no log messages from that era
that I'm aware of).  My presumption has been that this part:

                /*
                 * Active slave is down if:
                 * - more than 2*delta since transmitting OR
                 * - (more than 2*delta since receive AND
                 *    the bond has an IP address)
                 */
                trans_start = dev_trans_start(slave->dev);
                if (bond_is_active_slave(slave) &&
                    (!time_in_range(jiffies,
                        trans_start - delta_in_ticks,
                        trans_start + 2 * delta_in_ticks) ||
                     !time_in_range(jiffies,
                        slave_last_rx(bond, slave) - delta_in_ticks,
                        slave_last_rx(bond, slave) + 2 * delta_in_ticks))) {

                        slave->new_link = BOND_LINK_DOWN;
                        commit++;
                }

	was structured this way (allowing 2 * delta) to permit the loss
of a single ARP on an otherwise idle interface without triggering a link
down.

	My guess, though, is that until relatively recently the timing
window was not too tight, and there was effectively some slack in the
calculation, because the slave_last_rx() would be set to some small
number of jiffies after the last exection of the monitor, and so the
"slave_last_rx() + delta_in_ticks" wasn't as narrow a window as it
appears to be now.

	So, without having tested this myself, based on the above, I
don't see that adding some slack would be a problem.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bonding: time limits too tight in bond_ab_arp_inspect
  2012-08-22 18:42   ` Jay Vosburgh
@ 2012-08-22 18:58     ` Chris Friesen
  2012-08-23  7:34     ` Petr Tesarik
  2012-08-30 22:02     ` [PATCH] bonding: add some slack to arp monitoring time limits Jiri Bohac
  2 siblings, 0 replies; 7+ messages in thread
From: Chris Friesen @ 2012-08-22 18:58 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Jiri Bohac, Andy Gospodarek, netdev, Petr Tesarik

On 08/22/2012 12:42 PM, Jay Vosburgh wrote:
> Chris Friesen<chris.friesen@genband.com>  wrote:
>
>> On 08/22/2012 11:45 AM, Jiri Bohac wrote:
>>
>>> This code is run from bond_activebackup_arp_mon() about
>>> delta_in_ticks jiffies after the previous ARP probe has been
>>> sent. If the delayed work gets executed exactly in delta_in_ticks
>>> jiffies, there is a chance the slave will be brought up.  If the
>>> delayed work runs one jiffy later, the slave will stay down.
>
> 	Presumably the ARP reply is coming back in less than one jiffy,
> then, so the slave_last_rx() value is the same jiffy as when the
> _inspect was previously called?
>
>> <snip>
>>
>>> Should they perhaps all be increased by, say, delta_in_ticks/2, to make this
>>> less dependent on the current scheduling latencies?
>>
>> We have been using a patch that tracks the arpmon requested sleep time vs
>> the actual sleep time and adds any scheduling latency to the allowed
>> delta.  That way if we sleep too long due to scheduling latency it doesn't
>> affect the calculation.
>
> 	How much scheduling latency do you see?
>
> 	Is that really better than just permitting a bit more slack in
> the timing window?

We hit enough latency that it triggered arpmon to falsely mark multiple 
links as lost.  This triggered our system maintenance code to go into a 
"oh no we can't talk to the outside world" secenario, which does fairly 
intrusive things to try and bring connectivity back up.  Basically a bad 
thing to happen just because of a random scheduler latency spike.

I should note that we added this some time back and are still running 
older kernels so I have no idea what latency on modern kernels is like.

Chris

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: bonding: time limits too tight in bond_ab_arp_inspect
  2012-08-22 18:42   ` Jay Vosburgh
  2012-08-22 18:58     ` Chris Friesen
@ 2012-08-23  7:34     ` Petr Tesarik
  2012-08-30 22:02     ` [PATCH] bonding: add some slack to arp monitoring time limits Jiri Bohac
  2 siblings, 0 replies; 7+ messages in thread
From: Petr Tesarik @ 2012-08-23  7:34 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Chris Friesen, Jiri Bohac, Andy Gospodarek, netdev

Dne St 22. srpna 2012 20:42:02 Jay Vosburgh napsal(a):
> Chris Friesen <chris.friesen@genband.com> wrote:
> >On 08/22/2012 11:45 AM, Jiri Bohac wrote:
> >> This code is run from bond_activebackup_arp_mon() about
> >> delta_in_ticks jiffies after the previous ARP probe has been
> >> sent. If the delayed work gets executed exactly in delta_in_ticks
> >> jiffies, there is a chance the slave will be brought up.  If the
> >> delayed work runs one jiffy later, the slave will stay down.
> 
> 	Presumably the ARP reply is coming back in less than one jiffy,
> then, so the slave_last_rx() value is the same jiffy as when the
> _inspect was previously called?

Yes, that's what happens. Keep in mind that the backup slave validates the 
original ARP query, so on a fast network, you get it more or less immediately 
(for my case, I can see a delay of ~70us).

Anyway, why do we have to wait until the next ARP send? Couldn't we simply 
kick the work queue when we receive a valid packet on a down interface?

Petr Tesarik
SUSE Linux

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH] bonding: add some slack to arp monitoring time limits
  2012-08-22 18:42   ` Jay Vosburgh
  2012-08-22 18:58     ` Chris Friesen
  2012-08-23  7:34     ` Petr Tesarik
@ 2012-08-30 22:02     ` Jiri Bohac
  2012-08-31 20:37       ` David Miller
  2 siblings, 1 reply; 7+ messages in thread
From: Jiri Bohac @ 2012-08-30 22:02 UTC (permalink / raw)
  To: Jay Vosburgh
  Cc: Chris Friesen, Jiri Bohac, Andy Gospodarek, netdev, Petr Tesarik,
	davem

Currently, all the time limits in the bonding ARP monitor are in
multiples of arp_interval -- the time interval at which the ARP
monitor is periodically scheduled.

With a fast network round-trip and a little scheduling latency
of the ARP monitor work, a limit of n*delta_in_ticks may
effectively mean (n-1)*delta_in_ticks.

This is fatal in case of n==1  (the link will stay down
forever) and makes the behaviour non-deterministic in all the
other cases.

Add a delta_in_ticks/2 time slack to all the time limits.

Signed-off-by: Jiri Bohac <jbohac@suse.cz>

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index 6fae5f3..0f04115 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -2813,12 +2813,13 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
 					    arp_work.work);
 	struct slave *slave, *oldcurrent;
 	int do_failover = 0;
-	int delta_in_ticks;
+	int delta_in_ticks, extra_ticks;
 	int i;
 
 	read_lock(&bond->lock);
 
 	delta_in_ticks = msecs_to_jiffies(bond->params.arp_interval);
+	extra_ticks = delta_in_ticks / 2;
 
 	if (bond->slave_cnt == 0)
 		goto re_arm;
@@ -2841,10 +2842,10 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
 		if (slave->link != BOND_LINK_UP) {
 			if (time_in_range(jiffies,
 				trans_start - delta_in_ticks,
-				trans_start + delta_in_ticks) &&
+				trans_start + delta_in_ticks + extra_ticks) &&
 			    time_in_range(jiffies,
 				slave->dev->last_rx - delta_in_ticks,
-				slave->dev->last_rx + delta_in_ticks)) {
+				slave->dev->last_rx + delta_in_ticks + extra_ticks)) {
 
 				slave->link  = BOND_LINK_UP;
 				bond_set_active_slave(slave);
@@ -2874,10 +2875,10 @@ void bond_loadbalance_arp_mon(struct work_struct *work)
 			 */
 			if (!time_in_range(jiffies,
 				trans_start - delta_in_ticks,
-				trans_start + 2 * delta_in_ticks) ||
+				trans_start + 2 * delta_in_ticks + extra_ticks) ||
 			    !time_in_range(jiffies,
 				slave->dev->last_rx - delta_in_ticks,
-				slave->dev->last_rx + 2 * delta_in_ticks)) {
+				slave->dev->last_rx + 2 * delta_in_ticks + extra_ticks)) {
 
 				slave->link  = BOND_LINK_DOWN;
 				bond_set_backup_slave(slave);
@@ -2935,6 +2936,14 @@ static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
 	struct slave *slave;
 	int i, commit = 0;
 	unsigned long trans_start;
+	int extra_ticks;
+
+	/* All the time comparisons below need some extra time. Otherwise, on
+	 * fast networks the ARP probe/reply may arrive within the same jiffy
+	 * as it was sent.  Then, the next time the ARP monitor is run, one
+	 * arp_interval will already have passed in the comparisons.
+	 */
+	extra_ticks = delta_in_ticks / 2;
 
 	bond_for_each_slave(bond, slave, i) {
 		slave->new_link = BOND_LINK_NOCHANGE;
@@ -2942,7 +2951,7 @@ static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
 		if (slave->link != BOND_LINK_UP) {
 			if (time_in_range(jiffies,
 				slave_last_rx(bond, slave) - delta_in_ticks,
-				slave_last_rx(bond, slave) + delta_in_ticks)) {
+				slave_last_rx(bond, slave) + delta_in_ticks + extra_ticks)) {
 
 				slave->new_link = BOND_LINK_UP;
 				commit++;
@@ -2958,7 +2967,7 @@ static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
 		 */
 		if (time_in_range(jiffies,
 				  slave->jiffies - delta_in_ticks,
-				  slave->jiffies + 2 * delta_in_ticks))
+				  slave->jiffies + 2 * delta_in_ticks + extra_ticks))
 			continue;
 
 		/*
@@ -2978,7 +2987,7 @@ static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
 		    !bond->current_arp_slave &&
 		    !time_in_range(jiffies,
 			slave_last_rx(bond, slave) - delta_in_ticks,
-			slave_last_rx(bond, slave) + 3 * delta_in_ticks)) {
+			slave_last_rx(bond, slave) + 3 * delta_in_ticks + extra_ticks)) {
 
 			slave->new_link = BOND_LINK_DOWN;
 			commit++;
@@ -2994,10 +3003,10 @@ static int bond_ab_arp_inspect(struct bonding *bond, int delta_in_ticks)
 		if (bond_is_active_slave(slave) &&
 		    (!time_in_range(jiffies,
 			trans_start - delta_in_ticks,
-			trans_start + 2 * delta_in_ticks) ||
+			trans_start + 2 * delta_in_ticks + extra_ticks) ||
 		     !time_in_range(jiffies,
 			slave_last_rx(bond, slave) - delta_in_ticks,
-			slave_last_rx(bond, slave) + 2 * delta_in_ticks))) {
+			slave_last_rx(bond, slave) + 2 * delta_in_ticks + extra_ticks))) {
 
 			slave->new_link = BOND_LINK_DOWN;
 			commit++;
@@ -3029,7 +3038,7 @@ static void bond_ab_arp_commit(struct bonding *bond, int delta_in_ticks)
 			if ((!bond->curr_active_slave &&
 			     time_in_range(jiffies,
 					   trans_start - delta_in_ticks,
-					   trans_start + delta_in_ticks)) ||
+					   trans_start + delta_in_ticks + delta_in_ticks / 2)) ||
 			    bond->curr_active_slave != slave) {
 				slave->link = BOND_LINK_UP;
 				if (bond->current_arp_slave) {
-- 
Jiri Bohac <jbohac@suse.cz>
SUSE Labs, SUSE CZ

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH] bonding: add some slack to arp monitoring time limits
  2012-08-30 22:02     ` [PATCH] bonding: add some slack to arp monitoring time limits Jiri Bohac
@ 2012-08-31 20:37       ` David Miller
  0 siblings, 0 replies; 7+ messages in thread
From: David Miller @ 2012-08-31 20:37 UTC (permalink / raw)
  To: jbohac; +Cc: fubar, chris.friesen, andy, netdev, ptesarik

From: Jiri Bohac <jbohac@suse.cz>
Date: Fri, 31 Aug 2012 00:02:47 +0200

> Currently, all the time limits in the bonding ARP monitor are in
> multiples of arp_interval -- the time interval at which the ARP
> monitor is periodically scheduled.
> 
> With a fast network round-trip and a little scheduling latency
> of the ARP monitor work, a limit of n*delta_in_ticks may
> effectively mean (n-1)*delta_in_ticks.
> 
> This is fatal in case of n==1  (the link will stay down
> forever) and makes the behaviour non-deterministic in all the
> other cases.
> 
> Add a delta_in_ticks/2 time slack to all the time limits.
> 
> Signed-off-by: Jiri Bohac <jbohac@suse.cz>

Applied to net-next, thanks.

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-08-31 20:37 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-08-22 17:45 bonding: time limits too tight in bond_ab_arp_inspect Jiri Bohac
2012-08-22 17:54 ` Chris Friesen
2012-08-22 18:42   ` Jay Vosburgh
2012-08-22 18:58     ` Chris Friesen
2012-08-23  7:34     ` Petr Tesarik
2012-08-30 22:02     ` [PATCH] bonding: add some slack to arp monitoring time limits Jiri Bohac
2012-08-31 20:37       ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).