From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751956AbbIAXZC (ORCPT ); Tue, 1 Sep 2015 19:25:02 -0400 Received: from mail-by2on0136.outbound.protection.outlook.com ([207.46.100.136]:44107 "EHLO na01-by2-obe.outbound.protection.outlook.com" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1751669AbbIAXY5 (ORCPT ); Tue, 1 Sep 2015 19:24:57 -0400 Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=uwe.koziolek@redknee.com; Subject: Re: [PATCH] net/bonding: send arp in interval if no active slave To: Andy Gospodarek References: <1439828583-27325-1-git-send-email-jarod@redhat.com> <20150817165500.GA21512@vps.falico.eu> <55D215F7.3080905@redhat.com> <55D22E64.6020807@redknee.com> <2649.1439838866@famine> <55D2494F.3020800@redknee.com> <20150901154157.GY504@gospo.home.greyhouse.net> CC: Jay Vosburgh , Jarod Wilson , Veaceslav Falico , , From: Uwe Koziolek Message-ID: <55E63065.8070906@redknee.com> Date: Wed, 2 Sep 2015 01:10:29 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.2.0 MIME-Version: 1.0 In-Reply-To: <20150901154157.GY504@gospo.home.greyhouse.net> Content-Type: text/plain; charset="windows-1252"; format=flowed Content-Transfer-Encoding: 7bit X-Originating-IP: [84.132.46.233] X-ClientProxiedBy: DB4PR07CA023.eurprd07.prod.outlook.com (10.242.229.33) To CO2PR0501MB903.namprd05.prod.outlook.com (10.141.247.18) X-Microsoft-Exchange-Diagnostics: 1;CO2PR0501MB903;2:K9YiKMX/V1Chcj6rHZ3/omVDSWZh7iyBSA2/0F+OwZkhsFkc2pSgsUps5VT38TfJM1Sv47qI0jJA5SaUmXsWDIsofMnIh3jwXzdJoOQVRXRXG9tZK3E/iw9Pc9f6wEVsi6UJWvebOvT17nvQe6GaiIeB6Kuj2MDbM3THSnas56Q=;3:B+825LQGVoxiCyh8PF3v1XCUwU77StlDYsBum2Qcs/oDk66yt73RCErtVPfcsut1OUiJXwWT6j/fS9vhI3VaEvQO1g+54x0yNp2xqBCAAWsFJoFdbNbJsl9d1h+Unp9+ghXTlyzU63uxg1vSBxW4/Q==;25:vWREpRiPkVbOvCshd4s2yu48E5GMToNnFCSj0YyUHOHUSmhp8DpNKddE3VK1jAHynnpjfGNVntLcYQin1KP+F53dpBsA+g8GVnWipDPhwtHZJcU+/sZc37f+pIbCfZeWTdrqKt6xaFlRdTzfmTdo25jVax4BRn6rnKgJY+zphsfc1DNef6CcTpZx0zIUbYyWH3ixKJlAlhuUHZ3MiQIA5aAad1Fx1O0H2Xju3MH59EuYFvPCfGcXY8jS5r+JRLkzgckXca4dYIV+8Ey8TFpTJQ== X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:CO2PR0501MB903; X-Microsoft-Exchange-Diagnostics: 1;CO2PR0501MB903;20:ita5qpQgGF8bB9e5hUF+7MY9V7JI2E5ocd5mbup5rXOYr0SUSXGVCRZLV+AwT49+bqk+PqXXt441sff9XjvkJ9ZhyaCywEZ5pJ9yP6hh83VVYSzNUFvmOoKcwqjFYpncgZ7/Tsagr6XHt0mFKbVBVt9ZR/B2wnOjWUIwJKxzechMMZxodoajxZo8P3mkDm2mbzrTfks4HjAmm27MGymvGvmhY+nwZKMslFZ7SsC3eqOuZbCX014tjb6EQyYaDFr0i4VZV5vbwSZSafX24mV6KoNK5yeZGiPZ/dJCWPlH9xRWz2UWYtVRtyR/tM7Kfs8MCy7FvURbbJ3pBl5s4bkxu+dj/5c0BTsIswgM16pTnFqChBg92rne0ntPKALlr+FAXpR5rVK4FEBLQfDGYTxGj/bRUnIXqt6A7TihGKNdo8nK4x7JuAFaq9cKwj8pwlJD9ES2pj/Mii4i63lEzda6I4p+6Pr9cOkF7WzJES9+W9BG7IJgXXCgV62ErTnjBntF;4:4gUPr2yEzvI/fqMHB+Yww1S4nr0LdghOL9gaqChA98l1dxKINjt3IVng3SvFQ9F2aE6mQqH7VyFFGKNXZWgtH4sPfsZU/OBGhTLZs5om6YEqFQCXSSsRYRcQP/EgDvDGIJFHgUjccdY+/P/kVTyQMzkCGxqhA2J889p/PjPy6jTU8OpG1ZqG/AE7CeKcvNC05OhACxWeNtBaDHlj2vGQHkGA+o6ad1pLG7xeqSDFHRQGLlVlx1aM3fnf+1U3arELdMJOIw4/edAGqKb+pHbO4/zg+HVJ2rg8E3ucBXq0Puf2JgNI7duCr76pqRMh9VKRl03yPDw1U/51nnpJeW0TDA== X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:; X-Exchange-Antispam-Report-CFA-Test: BCL:0;PCL:0;RULEID:(601004)(8121501046)(5005006)(3002001);SRVR:CO2PR0501MB903;BCL:0;PCL:0;RULEID:;SRVR:CO2PR0501MB903; X-Forefront-PRVS: 06860EDC7B X-Forefront-Antispam-Report: SFV:NSPM;SFS:(10019020)(6049001)(6009001)(199003)(377424004)(24454002)(54534003)(377454003)(189002)(97736004)(19580395003)(64706001)(80316001)(81156007)(101416001)(42186005)(19580405001)(106356001)(86362001)(4001540100001)(87266999)(4001350100001)(68736005)(92566002)(46102003)(54356999)(77096005)(36756003)(66066001)(5001920100001)(5007970100001)(23746002)(65816999)(62966003)(64126003)(110136002)(40100003)(5001960100002)(83506001)(5004730100002)(50986999)(117156001)(65956001)(65806001)(105586002)(33656002)(93886004)(5001860100001)(87976001)(2950100001)(47776003)(122386002)(59896002)(76176999)(50466002)(189998001)(77156002)(5001830100001);DIR:OUT;SFP:1102;SCL:1;SRVR:CO2PR0501MB903;H:[192.168.222.3];FPR:;SPF:None;PTR:InfoNoRecords;A:1;MX:1;LANG:en; X-Microsoft-Exchange-Diagnostics: =?Windows-1252?Q?1;CO2PR0501MB903;23:qRXRSY03yfs2npDLj64WrGlOG/C6XLU5PtS9?= =?Windows-1252?Q?OcxPuqi1Ozb3MeKFzVKM+DGObMPFxaJ3+nvTSu2wqd42xwnh8HMQdQKs?= =?Windows-1252?Q?oRTRwzB5RDbEJPhxJxKg+TcHCpzxh1Chs9rdrLMELt1QbWH/BeaxN8mi?= =?Windows-1252?Q?sizE5RGSVwNYJ4dUl3/TGdR4xA2lmqQnJIaO/RfIvDtAlBS2vLzzHbd3?= =?Windows-1252?Q?8Wcn6j6PKfb5XJPWYtO02b2CbF5lPCDH95m4WIrF3P2KBNj3FC0TmqEV?= =?Windows-1252?Q?+zHrnoK7vlqIhDvUmBuQjz4s4FuhEBgBebcTH/1sb+Ug1Pnz8hFakXKc?= =?Windows-1252?Q?01LuxdYjGWcFo/INJgdrOOCfjqxASM2bUUdQiSndZo6PO6UZoucqY8x+?= =?Windows-1252?Q?0cxs/C8/1RIbSYmEEb91/LqNQqW+UkNLjn0qfXGlxJSscMAYuPckWqdj?= =?Windows-1252?Q?JZ3izMbwphbdiEgfktzTRzSs2h5C6vkxZclXBMDdhGvdm0qiMQbj5251?= =?Windows-1252?Q?aDYI70eq3zzwqDAcme7pzX4ErecHr517rLO++n4/C29JT8CNONfi8pV7?= =?Windows-1252?Q?qXNMTAGcayNl3C14VglYWvaDu6YwzYJOyEbjwZFJcQU2R/0wVV3Lsmv8?= =?Windows-1252?Q?zs5kTpPGuYVuHf6QRCWhY17mAaxNVPDPmOYhVQlYD7lSlXUcHOE5HpAr?= =?Windows-1252?Q?zMFEUGT0A4bp/dMDzVNpEC9NkadBe8VdNG6dDUTwllOmhuJTfGxlxlJL?= =?Windows-1252?Q?b540yAqnAIvaHPkYxzxuVfVuOcsCaiLuyPogcxUAxOpDhlTZKvFatWoi?= =?Windows-1252?Q?Yd55vH4+c9Zi3ANB8elKq/cANzJuLl9znMZakX86tIgikY3UtNXHYvkI?= =?Windows-1252?Q?C7t/gWFlKsLu84VkTtm83jINOP7ryIi+ZniLQK+RfIk9O4mZUPGBD14Z?= =?Windows-1252?Q?sak9I4IFiCVZe9+Pe7WQOQyBmNVqH6Qq4dC2TayoU15nu9m3si4KQ/3c?= =?Windows-1252?Q?xnI1BlaeMnMmIXDUS5+dmhR94OPuYHomvFUnOOt2SdJNq0UIIfdA4VlU?= =?Windows-1252?Q?/MNyg8MnolfQNanrVlJWVsw8c5oMs6DxAK9EvuR9YFNE4OAX5MJ78eub?= =?Windows-1252?Q?9BeLJJCEGlZqcLu3ivTgnq/iEQaYPMmAhbZf+WKOAlGbx8O9AdQguYaG?= =?Windows-1252?Q?/5058bXpddVo95kdC4DxrlP9g+0TOF2VXgNHafg2p7THAgNldv1E895k?= =?Windows-1252?Q?SrOKKmpD/deH1njIntqXrYbq9cDbB+aqYX0TckFlRbM/AX4uwVRhUeXq?= =?Windows-1252?Q?C/aOYdenGSkeHgg1ipszJAXdUsIZxfSZVKkfVvLQY9h5gko2dtQG0/qf?= =?Windows-1252?Q?SPKrE3jLIJs/g3C+2XqlP7ooYa8tK0ip2PZoOOcTpEO69J6cY9jbsmdU?= =?Windows-1252?Q?byd4oscy75+OgUvG6SsKR8uGrzNbzB7c7JevWOULF8voGXLX68qCnC9z?= =?Windows-1252?Q?sLsb1+LVXRGltzQsJXfO20yYEDNgDiGKl+ZQu2JGdQw7jd7t1nssYtav?= =?Windows-1252?Q?VHASBQq6ODQPSPhnO3lDIVHOiCJL+wglk1XY+VnrCohTL36j5HWMY7sE?= =?Windows-1252?Q?cw=3D=3D?= X-Microsoft-Exchange-Diagnostics: 1;CO2PR0501MB903;5:vJVm8kTn6JDK1KYE3v/UjsR3RqeAnWovXVRL89j6v2l60o7Yez1SBGlARgiJKDL9K67H3aCSZNw99/WLvYXeqo2AV4IIvN5nbzKL9zaCTCVBLV03k7juDNm5v3QRu/XaDC6CPyZzaCdj72VsUMCKdA==;24:q17tw4uMQMifXMdB/6OWfZT7KjHXz2kuBOiSnpwcZfZHbeZkRhFH7ZVRCRRBlJtTEkux1dODOjR+dtMb6uYsJPLtVA3Wq00RtGkPIODbvcg=;20:2zwAcnKf4lzkJu0fSP7wtHFgMR5LBz7+IuDb1pxlyI+NaZLyaCzhM893Y7jcSWeJudlTZ35QbnMYXL4RGNf5gA== SpamDiagnosticOutput: 1:23 SpamDiagnosticMetadata: NSPM X-OriginatorOrg: redknee.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 01 Sep 2015 23:10:43.0166 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: CO2PR0501MB903 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Sep 01, 2015 at 05:41 PM +0200, Andy Gospodarek wrote: > On Mon, Aug 17, 2015 at 10:51:27PM +0200, Uwe Koziolek wrote: >> On Mon, Aug 17, 2015 at 09:14PM +0200, Jay Vosburgh wrote: >>> Uwe Koziolek wrote: >>> >>>> On2015-08-17 07:12 PM,Jarod Wilson wrote: >>>>> On 2015-08-17 12:55 PM, Veaceslav Falico wrote: >>>>>> On Mon, Aug 17, 2015 at 12:23:03PM -0400, Jarod Wilson wrote: >>>>>>> From: Uwe Koziolek >>>>>>> >>>>>>> With some very finicky switch hardware, active backup bonding can get >>>>>>> into >>>>>>> a situation where we play ping-pong between interfaces, trying to get >>>>>>> one >>>>>>> to come up as the active slave. There seems to be an issue with the >>>>>>> switch's arp replies either taking too long, or simply getting lost, >>>>>>> so we >>>>>>> wind up unable to get any interface up and active. Sometimes, the issue >>>>>>> sorts itself out after a while, sometimes it doesn't. >>>>>>> >>>>>>> Testing with num_grat_arp has proven fruitless, but sending an >>>>>>> additional >>>>>>> arp on curr_arp_slave if we're still in the arp_interval timeslice in >>>>>>> bond_ab_arp_probe(), has shown to produce 100% reliability in testing >>>>>>> with >>>>>>> this hardware combination. >>>>>> Sorry, I don't understand the logic of why it works, and what exactly >>>>>> are >>>>>> we fixiing here. >>>>>> >>>>>> It also breaks completely the logic for link state management in case >>>>>> of no >>>>>> current active slave for 2*arp_interval. >>>>>> >>>>>> Could you please elaborate what exactly is fixed here, and how it >>>>>> works? :) >>>>> I can either duplicate some information from the bug, or Uwe can, to >>>>> illustrate the exact nature of the problem. >>>>> >>>>>> p.s. num_grat_arp maybe could help? >>>>> That was my thought as well, but as I understand it, that route was >>>>> explored, and it didn't help any. I don't actually have a reproducer >>>>> setup of my own, unfortunately, so I'm kind of caught in the middle >>>>> here... >>>>> >>>>> Uwe, can you perhaps further enlighten us as to what num_grat_arp >>>>> settings were tried that didn't help? I'm still of the mind that if >>>>> num_grat_arp *didn't* help, we probably need to do something keyed off >>>>> num_grat_arp. >>>> The bonding slaves are connected to high available switches, each of the >>>> slaves is connected to a different switch. If the bond is starting, only >>>> the selected slave sends one arp-request. If a matching arp_response was >>>> received, this slave and the bond is going into state up, sending the >>>> gratitious arps... >>>> But if you got no arp reply the next slave was selected. >>>> With most of the newer switches, not overloaded, or with other software >>>> bugs, or with a single switch configuration, you would get a arp response >>>> on the first arp request. >>>> But in case of high availability configuration with non perfect switches >>>> like HP ProCurve 54xx, also with some Cisco models, you may not get a >>>> response on the first arp request. >>>> >>>> I have seen network snoops, there the switches are not responding to the >>>> first arp request on slave 1, the second arp request was sent on slave 2 >>>> but the response was received on slave one, and all following arp >>>> requests are anwsered on the wrong slave for a longer time. >>> Could you elaborate on the exact "high availability >>> configuration" here, including the model(s) of switch(es) involved? >>> >>> Is this some kind of race between the switch or switches >>> updating the forwarding tables and the bond flip flopping between the >>> slaves? E.g., source MAC from ARP sent on slave 1 is used to populate >>> the forwarding table, but (for whatever reason) there is no reply. ARP >>> on slave 2 is sent (using the same source MAC, unless you set >>> fail_over_mac), but forwarding tables still send that MAC to slave 1, so >>> reply is sent there. >> High availability: >> 2 managed switches with routing capabilities have an interconnect. >> One slave of a bonding interface is connected to the first switch, the >> second slave is connected to the other switch. >> The switch models are HP ProCurve 5406 and HP ProCurve 5412. As far as i >> remember also HP E 3500 and E 3800 are also >> affected, for the affected Cisco models I can't answer today. >> Affected single switch configurations was not seen. >> >> Yes, race conditions with delayed upgrades of the forwarding tables is a >> well matching explanation for the problem. >> >>>> The proposed change sents up to 3 arp requests on a down bond using the >>>> same slave, delayed by arp_interval. >>>> Using problematic switches i have seen the the arp response on the right >>>> slave at latest on the second arp request. So the bond is going into state >>>> up. >>>> >>>> How does it works: >>>> The bonds in up state are handled on the beginning of bond_ab_arp_probe >>>> procedure, the other part of this procedure is handling the slave change. >>>> The proposed change is bypassing the slave change for 2 additional calls >>>> of bond_ab_arp_probe. >>>> Now the retries are not only for an up bond available, they are also >>>> implemented for a down bond. >>> Does this delay failover or bringup on switches that are not >>> "problematic"? I.e., if arp_interval is, say, 1000 (1 second), will >>> this impact failover / recovery times? >>> >>> -J >> It depends. >> failover times are not impacted, this is handled different. >> Only the transition from a down bonding interface (bond and all slaves are >> down) to the state up can be increased by up to 2 times arp_interval, >> If the selected interface did not came up .If well working switches are >> used, and everything other is also ok, there are no impacts. > So I'm not a huge fan of workarounds like these, but I also understand > from a practical standpoint that this is useful. My only issue with the > patch would be to please include a small comment (1-2 lines) in the code > that describes the behavior. I know we have the changelog entries for > this, but I would feel better about having an exception like this in the > code for those reading it and wondering: > > "Why would we wait 2 intervals before failing over to the next interface > when there are no active interfaces?" > diff -up a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c --- a/drivers/net/bonding/bond_main.c 2015-08-30 20:34:09.000000000 +0200 +++ b/drivers/net/bonding/bond_main.c 2015-09-02 00:39:10.000298202 +0200 @@ -2795,6 +2795,16 @@ static bool bond_ab_arp_probe(struct bon return should_notify_rtnl; } + /* sometimes the forwarding tables of the switches are not updated fast enough + * the first arp response after a slave change is received on the wrong slave. + * the arp requests will be retried 2 times on the same slave + */ + + if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) { + bond_arp_send_all(bond, curr_arp_slave); + return should_notify_rtnl; + } + bond_set_slave_inactive_flags(curr_arp_slave, BOND_SLAVE_NOTIFY_LATER); bond_for_each_slave_rcu(bond, slave, iter) { >>>> The num_grat_arp has no chance to solve the problem. The num_grat_arp is >>>> only used, if a different slave is going active. >>>> But in our case, the bonding slaves are not going into the state active >>>> for a longer time. >>>>>>> [jarod: manufacturing of changelog] >>>>>>> CC: Jay Vosburgh >>>>>>> CC: Veaceslav Falico >>>>>>> CC: Andy Gospodarek >>>>>>> CC: netdev@vger.kernel.org >>>>>>> Signed-off-by: Uwe Koziolek >>>>>>> Signed-off-by: Jarod Wilson >>>>>>> --- >>>>>>> drivers/net/bonding/bond_main.c | 5 +++++ >>>>>>> 1 file changed, 5 insertions(+) >>>>>>> >>>>>>> diff --git a/drivers/net/bonding/bond_main.c >>>>>>> b/drivers/net/bonding/bond_main.c >>>>>>> index 0c627b4..60b9483 100644 >>>>>>> --- a/drivers/net/bonding/bond_main.c >>>>>>> +++ b/drivers/net/bonding/bond_main.c >>>>>>> @@ -2794,6 +2794,11 @@ static bool bond_ab_arp_probe(struct bonding >>>>>>> *bond) >>>>>>> return should_notify_rtnl; >>>>>>> } >>>>>>> >>>>>>> + if (bond_time_in_interval(bond, curr_arp_slave->last_link_up, 2)) >>>>>>> { >>>>>>> + bond_arp_send_all(bond, curr_arp_slave); >>>>>>> + return should_notify_rtnl; >>>>>>> + } >>>>>>> + >>>>>>> bond_set_slave_inactive_flags(curr_arp_slave, >>>>>>> BOND_SLAVE_NOTIFY_LATER); >>>>>>> >>>>>>> bond_for_each_slave_rcu(bond, slave, iter) { >>>>>>> -- >>>>>>> 1.8.3.1 >>> --- >>> -Jay Vosburgh, jay.vosburgh@canonical.com