From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jay Vosburgh Subject: Re: [PATCH RFC] net/bonding: announce fail-over for the active-backup mode Date: Wed, 28 May 2008 15:19:08 -0700 Message-ID: <12513.1212013148@death> References: Cc: Jeff Garzik , netdev@vger.kernel.org To: Or Gerlitz Return-path: Received: from e6.ny.us.ibm.com ([32.97.182.146]:33227 "EHLO e6.ny.us.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755540AbYE1WTc (ORCPT ); Wed, 28 May 2008 18:19:32 -0400 Received: from d01relay02.pok.ibm.com (d01relay02.pok.ibm.com [9.56.227.234]) by e6.ny.us.ibm.com (8.13.8/8.13.8) with ESMTP id m4SMLY2T024828 for ; Wed, 28 May 2008 18:21:34 -0400 Received: from d01av04.pok.ibm.com (d01av04.pok.ibm.com [9.56.224.64]) by d01relay02.pok.ibm.com (8.13.8/8.13.8/NCO v8.7) with ESMTP id m4SMJBU2161690 for ; Wed, 28 May 2008 18:19:11 -0400 Received: from d01av04.pok.ibm.com (loopback [127.0.0.1]) by d01av04.pok.ibm.com (8.12.11.20060308/8.13.3) with ESMTP id m4SMJAnS010477 for ; Wed, 28 May 2008 18:19:10 -0400 In-reply-to: Sender: netdev-owner@vger.kernel.org List-ID: Or Gerlitz wrote: >> Enhance bonding to announce fail-over for the active-backup mode through the >> netdev events notifier chain mechanism. Such an event can be of use for the >> RDMA CM (communication manager) to let native RDMA ULPs (eg NFS-RDMA, iSER) >> always be aligned with the IP stack, in the sense that they use the same >> ports/links as the stack does. More usages can be done to allow monitoring >> tools based on netlink events be aware to bonding failover. >> >> Signed-off-by: Or Gerlitz >[...] >> --- linux-2.6.26-rc2.orig/drivers/net/bonding/bond_main.c 2008-05-13 10:02:22.000000000 +0300 >> +++ linux-2.6.26-rc2/drivers/net/bonding/bond_main.c 2008-05-15 12:29:44.000000000 +0300 >> @@ -1117,6 +1117,7 @@ void bond_change_active_slave(struct bon >> bond->send_grat_arp = 1; >> } else >> bond_send_gratuitous_arp(bond); >> + netdev_bonding_change(bond->dev); >> } >> } >[...] >> --- linux-2.6.26-rc2.orig/net/core/dev.c 2008-05-13 10:02:31.000000000 +0300 >> +++ linux-2.6.26-rc2/net/core/dev.c 2008-05-13 11:50:49.000000000 +0300 >> @@ -956,6 +956,12 @@ void netdev_state_change(struct net_devi >> } >> } >> >> +void netdev_bonding_change(struct net_device *dev) >> +{ >> + call_netdevice_notifiers(NETDEV_BONDING_FAILOVER, dev); >> +} >> +EXPORT_SYMBOL(netdev_bonding_change); > >Hi Jay, > >First, I'll be happy to get a comment from you on the patch. I've been away from the office, so I'm just now looking at this. Philosophically speaking, I don't see a problem with adding a notifier like this, but others higher in the food chain may have thoughts. From a technical point of view, however, you'll need a much more extensive change to deal with the locking requirements, which I'll describe below. >Second, bond_change_active_slave() is called on few occasions under write_lock_bh() >which means it runs in atomic context. I was thinking that it might be better to >invoke call_netdevice_notifiers() from thread context, since some components may >need to allocate memory in order to deliver event (eg netlink) and may assume the >context they are being called is not atomic. Yes, this is one of the locking cases that has been slowly changing. To make the change you're requesting, I suspect that all of the remaining callers of bond_change_active_slave will have to be converted as follows. The "proper" locking for calls to bond_change_active_slave is RTNL, read_lock of bond->lock and write_lock_bh of curr_slave_lock. Some places in the code still have other locking, and get away with it due to special knowledge (e.g., bond_release sets the active slave to NULL, which won't go down any paths that need the special locking). The reason for the locking requirements is that there are places that must release some of these locks to get to the correct locking context for the notifier calls (which is RTNL and nothing else). The main case for this is in bond_alb_handle_active_change, wherein the MAC addresses of the slaves are moved around. The dev_set_mac_address function needs RTNL and nothing else, because it will issue a notifier call. Your case is similar: you want to issue a notifier call during an active-backup failover, so that notifier call will have to be made holding RTNL and no other locks. I think the most maintainable way to do that is to convert the remaining callers of bond_change_active_slave to hold the correct set of locks, and then have bond_change_active_slave drop down to just RTNL at the appropriate place to make the notifier call. That may not be as simple as it sounds, as it may open race windows. > >For example I sometimes get the below "scheduling while atomic" warning which points >on ipoib code, but the stack trace has some netlink calls which allocate skb. I am >not sure yet what triggers this, however I wanted to check with you and Jeff if/what >there are some convensions for the context of call_netdevice_notifiers(), thanks. >bonding: bond0: link status definitely down for interface ib0, disabling it >bonding: bond0: making interface ib1 the new active one. >BUG: scheduling while atomic: bond0/14237/0x10000100 >Pid: 14237, comm: bond0 Not tainted 2.6.26-rc3 #4 > >Call Trace: > [] schedule+0x98/0x57b > [] dbg_redzone1+0x16/0x1f > [] :ib_ipoib:ipoib_start_xmit+0x445/0x459 > [] kmem_cache_alloc_node+0x147/0x177 > [] __alloc_skb+0x35/0x12b > [] __cond_resched+0x1c/0x43 > [] _cond_resched+0x2d/0x38 > [] kmem_cache_alloc_node+0x25/0x177 > [] __alloc_skb+0x35/0x12b > [] rtmsg_ifinfo+0x3a/0xd4 > [] rtnetlink_event+0x3d/0x41 > [] notifier_call_chain+0x30/0x54 > [] :bonding:bond_select_active_slave+0xb9/0xe8 > [] :bonding:__bond_mii_monitor+0x43a/0x464 > [] :bonding:bond_mii_monitor+0x5e/0xaa > [] :bonding:bond_mii_monitor+0x0/0xaa > [] run_workqueue+0x7f/0x107 > [] worker_thread+0x0/0xef > [] worker_thread+0xe5/0xef > [] autoremove_wake_function+0x0/0x2e > [] autoremove_wake_function+0x0/0x2e > [] kthread+0x3d/0x63 > [] child_rip+0xa/0x12 > [] kthread+0x0/0x63 > [] child_rip+0x0/0x12 It's from the call to nlmsg_new (an inline that calls alloc_skb) in rtmsg_ifinfo, which allocates at GFP_KERNEL. As I recall, there are other similar cases, so it's not simply a matter of changing rtmsg_ifinfo. The notifier calls have to happen with RTNL and no other locks. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com