From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jason Gunthorpe Subject: Re: if/how to dictate IB device name per PCI BDF Date: Fri, 12 Oct 2012 17:38:02 -0600 Message-ID: <20121012233802.GD25541@obsidianresearch.com> References: <5076A755.30605@mellanox.com> <20121012224332.GB25541@obsidianresearch.com> <20121012160418.385ace32ddde5379381d5889@llnl.gov> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Return-path: Content-Disposition: inline In-Reply-To: <20121012160418.385ace32ddde5379381d5889-i2BcT+NCU+M@public.gmane.org> Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Ira Weiny Cc: Or Gerlitz , Roland Dreier , "linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)" , Shlomo Pongratz , Eli Cohen List-Id: linux-rdma@vger.kernel.org On Fri, Oct 12, 2012 at 04:04:18PM -0700, Ira Weiny wrote: > > FWIW, I reflected on this once (I belive it was for the netlink > > discussion).. Allowing for a device to be renamed creates a serious > > problem for RDMA since there isn't a stable 'if-index' like way for > > software to refer to it. > > > > The udev rules all rely on a means to rename the device once the > > kernel has created it. > > > > For this reason, all the software I've built uses the port GID to ID > > the resource for bind/etc and strongly discourages use of the RDMA > > device name. > > IMHO, from a user perspective using GID's or GUIDs is just too hard. > While the port GID is unique across reboots in practice with 1000's > of nodes in a cluster names are much more useful and, at least here > at LLNL, nobody refers to HCA's by GUID. They refer to them by > hostname/card name. This is especially true when HCA's get swapped > out for maintenance. Textual names (ie a name service) for GID alleviate alot of that pain, but really, how often do you need to use device RDMA name?? Assuming GID port selection is available, the only case that requires the full GID is when you have multiple HCAs and ports are active on the *same subnet*. Otherwise you can refer to the port by: port number, by GID prefix or by 'first found active port' without ever involving the rdma device name. I never did it, but it did occur to me to support the IBoIB device name and IP addresses as a port specifier.. > Furthermore, how would you propose MPI selects a card when the job > is launched on different nodes, chosen by a scheduler, from run to > run? Three things spring to mind: - Choose an active port at random that matches a set of subnet prefixes - Choose an active port at random from a globally known list of job-acceptable GIDs - Choose an active port #X that matches a set of subnet prefixes, from a random device Relying on rdma device name for MPI wouldn't work in a heterogeneous environment where different nodes have a different 'correct' device name. > Unfortunately, I don't know the details of the "serious problem" you > describe above but it seems that a udev solution to renaming things > would be worth trying to solve from a users perspective. Renaming the device changes the name of all the sysfs directories, so any 'handle' is invalidated at rename time. All udev can do for you is trigger the rename call, you still have to somewhat sanely enable renaming in the kernel. And it is not entirely theoretical, with threaded startup and things like systemd, it is very plausible that the rename could hit in the middle of something like opensm starting up and create a flakey system. Particularly since we'd want to see opensm's startup triggered by directory creation in sysfs. I think it would be great to be able to do the rename, but to do it properly things need to be more like the net stack - netlink for configuration/discovery, and a stable ifindex for use in all netlink messages. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html