* if/how to dictate IB device name per PCI BDF
@ 2012-10-11 11:02 Or Gerlitz
[not found] ` <5076A755.30605-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
0 siblings, 1 reply; 4+ messages in thread
From: Or Gerlitz @ 2012-10-11 11:02 UTC (permalink / raw)
To: Roland Dreier
Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
Shlomo Pongratz, Eli Cohen
Hi Roland,
We got a report that on a system with multiple (say two) ConnectX HCAs,
its possible
for the order of device probing to be different across simple reboots,
that is sometimes
the device with PCI BDF X is probed 1st and gets to be IB device mlx4_0
and some
other-timesthe device with BDF Y gets to be mlx4_0 and X becomes mlx4_1,
and so on,
which for some reason creates a hassle for them.
I don't fully understand how the PCI scan order can change between
different
reboots, but have the feeling its possible (in black box manner, the
reports claims that).
Thinking about this a bit, will it be possible to provide a udev
rulethat can dictate
to the IB core what name / suffix digit to assign for a device with
certain BDF?
Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 4+ messages in thread[parent not found: <5076A755.30605-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: if/how to dictate IB device name per PCI BDF [not found] ` <5076A755.30605-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> @ 2012-10-12 22:43 ` Jason Gunthorpe [not found] ` <20121012224332.GB25541-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> 0 siblings, 1 reply; 4+ messages in thread From: Jason Gunthorpe @ 2012-10-12 22:43 UTC (permalink / raw) To: Or Gerlitz Cc: Roland Dreier, linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Shlomo Pongratz, Eli Cohen On Thu, Oct 11, 2012 at 01:02:45PM +0200, Or Gerlitz wrote: > Thinking about this a bit, will it be possible to provide a udev > rulethat can dictate to the IB core what name / suffix digit to > assign for a device with certain BDF? FWIW, I reflected on this once (I belive it was for the netlink discussion).. Allowing for a device to be renamed creates a serious problem for RDMA since there isn't a stable 'if-index' like way for software to refer to it. The udev rules all rely on a means to rename the device once the kernel has created it. For this reason, all the software I've built uses the port GID to ID the resource for bind/etc and strongly discourages use of the RDMA device name. The ipoib devices should already be renamable via udev rules, though has anyone checked that udev doesn't have a problem with the long 'MAC'? Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 4+ messages in thread
[parent not found: <20121012224332.GB25541-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>]
* Re: if/how to dictate IB device name per PCI BDF [not found] ` <20121012224332.GB25541-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> @ 2012-10-12 23:04 ` Ira Weiny [not found] ` <20121012160418.385ace32ddde5379381d5889-i2BcT+NCU+M@public.gmane.org> 0 siblings, 1 reply; 4+ messages in thread From: Ira Weiny @ 2012-10-12 23:04 UTC (permalink / raw) To: Jason Gunthorpe Cc: Or Gerlitz, Roland Dreier, linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Shlomo Pongratz, Eli Cohen On Fri, 12 Oct 2012 16:43:32 -0600 Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote: > On Thu, Oct 11, 2012 at 01:02:45PM +0200, Or Gerlitz wrote: > > > Thinking about this a bit, will it be possible to provide a udev > > rulethat can dictate to the IB core what name / suffix digit to > > assign for a device with certain BDF? > > FWIW, I reflected on this once (I belive it was for the netlink > discussion).. Allowing for a device to be renamed creates a serious > problem for RDMA since there isn't a stable 'if-index' like way for > software to refer to it. > > The udev rules all rely on a means to rename the device once the > kernel has created it. > > For this reason, all the software I've built uses the port GID to ID > the resource for bind/etc and strongly discourages use of the RDMA > device name. IMHO, from a user perspective using GID's or GUIDs is just too hard. While the port GID is unique across reboots in practice with 1000's of nodes in a cluster names are much more useful and, at least here at LLNL, nobody refers to HCA's by GUID. They refer to them by hostname/card name. This is especially true when HCA's get swapped out for maintenance. Furthermore, how would you propose MPI selects a card when the job is launched on different nodes, chosen by a scheduler, from run to run? Unfortunately, I don't know the details of the "serious problem" you describe above but it seems that a udev solution to renaming things would be worth trying to solve from a users perspective. Ira > > The ipoib devices should already be renamable via udev rules, though > has anyone checked that udev doesn't have a problem with the long 'MAC'? > > Jason > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Ira Weiny Member of Technical Staff Lawrence Livermore National Lab 925-423-8008 weiny2-i2BcT+NCU+M@public.gmane.org -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 4+ messages in thread
[parent not found: <20121012160418.385ace32ddde5379381d5889-i2BcT+NCU+M@public.gmane.org>]
* Re: if/how to dictate IB device name per PCI BDF [not found] ` <20121012160418.385ace32ddde5379381d5889-i2BcT+NCU+M@public.gmane.org> @ 2012-10-12 23:38 ` Jason Gunthorpe 0 siblings, 0 replies; 4+ messages in thread From: Jason Gunthorpe @ 2012-10-12 23:38 UTC (permalink / raw) To: Ira Weiny Cc: Or Gerlitz, Roland Dreier, linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org), Shlomo Pongratz, Eli Cohen On Fri, Oct 12, 2012 at 04:04:18PM -0700, Ira Weiny wrote: > > FWIW, I reflected on this once (I belive it was for the netlink > > discussion).. Allowing for a device to be renamed creates a serious > > problem for RDMA since there isn't a stable 'if-index' like way for > > software to refer to it. > > > > The udev rules all rely on a means to rename the device once the > > kernel has created it. > > > > For this reason, all the software I've built uses the port GID to ID > > the resource for bind/etc and strongly discourages use of the RDMA > > device name. > > IMHO, from a user perspective using GID's or GUIDs is just too hard. > While the port GID is unique across reboots in practice with 1000's > of nodes in a cluster names are much more useful and, at least here > at LLNL, nobody refers to HCA's by GUID. They refer to them by > hostname/card name. This is especially true when HCA's get swapped > out for maintenance. Textual names (ie a name service) for GID alleviate alot of that pain, but really, how often do you need to use device RDMA name?? Assuming GID port selection is available, the only case that requires the full GID is when you have multiple HCAs and ports are active on the *same subnet*. Otherwise you can refer to the port by: port number, by GID prefix or by 'first found active port' without ever involving the rdma device name. I never did it, but it did occur to me to support the IBoIB device name and IP addresses as a port specifier.. > Furthermore, how would you propose MPI selects a card when the job > is launched on different nodes, chosen by a scheduler, from run to > run? Three things spring to mind: - Choose an active port at random that matches a set of subnet prefixes - Choose an active port at random from a globally known list of job-acceptable GIDs - Choose an active port #X that matches a set of subnet prefixes, from a random device Relying on rdma device name for MPI wouldn't work in a heterogeneous environment where different nodes have a different 'correct' device name. > Unfortunately, I don't know the details of the "serious problem" you > describe above but it seems that a udev solution to renaming things > would be worth trying to solve from a users perspective. Renaming the device changes the name of all the sysfs directories, so any 'handle' is invalidated at rename time. All udev can do for you is trigger the rename call, you still have to somewhat sanely enable renaming in the kernel. And it is not entirely theoretical, with threaded startup and things like systemd, it is very plausible that the rename could hit in the middle of something like opensm starting up and create a flakey system. Particularly since we'd want to see opensm's startup triggered by directory creation in sysfs. I think it would be great to be able to do the rename, but to do it properly things need to be more like the net stack - netlink for configuration/discovery, and a stable ifindex for use in all netlink messages. Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2012-10-12 23:38 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-11 11:02 if/how to dictate IB device name per PCI BDF Or Gerlitz
[not found] ` <5076A755.30605-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2012-10-12 22:43 ` Jason Gunthorpe
[not found] ` <20121012224332.GB25541-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2012-10-12 23:04 ` Ira Weiny
[not found] ` <20121012160418.385ace32ddde5379381d5889-i2BcT+NCU+M@public.gmane.org>
2012-10-12 23:38 ` Jason Gunthorpe
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox