public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* if/how to dictate IB device name per PCI BDF
@ 2012-10-11 11:02 Or Gerlitz
       [not found] ` <5076A755.30605-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Or Gerlitz @ 2012-10-11 11:02 UTC (permalink / raw)
  To: Roland Dreier
  Cc: linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	Shlomo Pongratz, Eli Cohen

Hi Roland,

We got a report that on a system with multiple (say two) ConnectX HCAs, 
its possible
for the order of device probing to be different across simple reboots, 
that is sometimes
the device with PCI BDF X is probed 1st and gets to be IB device mlx4_0 
and some
other-timesthe device with BDF Y gets to be mlx4_0 and X becomes mlx4_1, 
and so on,
which for some reason creates a hassle for them.

I don't fully understand how the PCI scan order can change between 
different
reboots, but have the feeling its possible (in black box manner, the 
reports claims that).

Thinking about this a bit, will it be possible to provide a udev 
rulethat can dictate
to the IB core what name / suffix digit to assign for a device with 
certain BDF?

Or.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: if/how to dictate IB device name per PCI BDF
       [not found] ` <5076A755.30605-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2012-10-12 22:43   ` Jason Gunthorpe
       [not found]     ` <20121012224332.GB25541-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Jason Gunthorpe @ 2012-10-12 22:43 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Roland Dreier,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	Shlomo Pongratz, Eli Cohen

On Thu, Oct 11, 2012 at 01:02:45PM +0200, Or Gerlitz wrote:

> Thinking about this a bit, will it be possible to provide a udev
> rulethat can dictate to the IB core what name / suffix digit to
> assign for a device with certain BDF?

FWIW, I reflected on this once (I belive it was for the netlink
discussion).. Allowing for a device to be renamed creates a serious
problem for RDMA since there isn't a stable 'if-index' like way for
software to refer to it.

The udev rules all rely on a means to rename the device once the
kernel has created it.

For this reason, all the software I've built uses the port GID to ID
the resource for bind/etc and strongly discourages use of the RDMA
device name.

The ipoib devices should already be renamable via udev rules, though
has anyone checked that udev doesn't have a problem with the long 'MAC'?

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: if/how to dictate IB device name per PCI BDF
       [not found]     ` <20121012224332.GB25541-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
@ 2012-10-12 23:04       ` Ira Weiny
       [not found]         ` <20121012160418.385ace32ddde5379381d5889-i2BcT+NCU+M@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Ira Weiny @ 2012-10-12 23:04 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Or Gerlitz, Roland Dreier,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	Shlomo Pongratz, Eli Cohen

On Fri, 12 Oct 2012 16:43:32 -0600
Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org> wrote:

> On Thu, Oct 11, 2012 at 01:02:45PM +0200, Or Gerlitz wrote:
> 
> > Thinking about this a bit, will it be possible to provide a udev
> > rulethat can dictate to the IB core what name / suffix digit to
> > assign for a device with certain BDF?
> 
> FWIW, I reflected on this once (I belive it was for the netlink
> discussion).. Allowing for a device to be renamed creates a serious
> problem for RDMA since there isn't a stable 'if-index' like way for
> software to refer to it.
> 
> The udev rules all rely on a means to rename the device once the
> kernel has created it.
> 
> For this reason, all the software I've built uses the port GID to ID
> the resource for bind/etc and strongly discourages use of the RDMA
> device name.

IMHO, from a user perspective using GID's or GUIDs is just too hard.  While the port GID is unique across reboots in practice with 1000's of nodes in a cluster names are much more useful and, at least here at LLNL, nobody refers to HCA's by GUID.  They refer to them by hostname/card name.  This is especially true when HCA's get swapped out for maintenance.

Furthermore, how would you propose MPI selects a card when the job is launched on different nodes, chosen by a scheduler, from run to run?

Unfortunately, I don't know the details of the "serious problem" you describe above but it seems that a udev solution to renaming things would be worth trying to solve from a users perspective.

Ira

> 
> The ipoib devices should already be renamable via udev rules, though
> has anyone checked that udev doesn't have a problem with the long 'MAC'?
> 
> Jason
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


-- 
Ira Weiny
Member of Technical Staff
Lawrence Livermore National Lab
925-423-8008
weiny2-i2BcT+NCU+M@public.gmane.org
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: if/how to dictate IB device name per PCI BDF
       [not found]         ` <20121012160418.385ace32ddde5379381d5889-i2BcT+NCU+M@public.gmane.org>
@ 2012-10-12 23:38           ` Jason Gunthorpe
  0 siblings, 0 replies; 4+ messages in thread
From: Jason Gunthorpe @ 2012-10-12 23:38 UTC (permalink / raw)
  To: Ira Weiny
  Cc: Or Gerlitz, Roland Dreier,
	linux-rdma (linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org),
	Shlomo Pongratz, Eli Cohen

On Fri, Oct 12, 2012 at 04:04:18PM -0700, Ira Weiny wrote:

> > FWIW, I reflected on this once (I belive it was for the netlink
> > discussion).. Allowing for a device to be renamed creates a serious
> > problem for RDMA since there isn't a stable 'if-index' like way for
> > software to refer to it.
> > 
> > The udev rules all rely on a means to rename the device once the
> > kernel has created it.
> > 
> > For this reason, all the software I've built uses the port GID to ID
> > the resource for bind/etc and strongly discourages use of the RDMA
> > device name.
> 
> IMHO, from a user perspective using GID's or GUIDs is just too hard.
> While the port GID is unique across reboots in practice with 1000's
> of nodes in a cluster names are much more useful and, at least here
> at LLNL, nobody refers to HCA's by GUID.  They refer to them by
> hostname/card name.  This is especially true when HCA's get swapped
> out for maintenance.

Textual names (ie a name service) for GID alleviate alot of that pain,
but really, how often do you need to use device RDMA name??

Assuming GID port selection is available, the only case that requires
the full GID is when you have multiple HCAs and ports are active on
the *same subnet*.

Otherwise you can refer to the port by: port number, by GID prefix or
by 'first found active port' without ever involving the rdma device
name.

I never did it, but it did occur to me to support the IBoIB device
name and IP addresses as a port specifier..

> Furthermore, how would you propose MPI selects a card when the job
> is launched on different nodes, chosen by a scheduler, from run to
> run?

Three things spring to mind:
 - Choose an active port at random that matches a set of subnet prefixes
 - Choose an active port at random from a globally known list of
   job-acceptable GIDs
 - Choose an active port #X that matches a set of subnet prefixes,
   from a random device

Relying on rdma device name for MPI wouldn't work in a heterogeneous
environment where different nodes have a different 'correct' device
name.

> Unfortunately, I don't know the details of the "serious problem" you
> describe above but it seems that a udev solution to renaming things
> would be worth trying to solve from a users perspective.

Renaming the device changes the name of all the sysfs directories, so
any 'handle' is invalidated at rename time. All udev can do for you is
trigger the rename call, you still have to somewhat sanely enable
renaming in the kernel.

And it is not entirely theoretical, with threaded startup and things
like systemd, it is very plausible that the rename could hit in the
middle of something like opensm starting up and create a flakey
system. Particularly since we'd want to see opensm's startup triggered
by directory creation in sysfs.

I think it would be great to be able to do the rename, but to do it
properly things need to be more like the net stack - netlink for
configuration/discovery, and a stable ifindex for use in all netlink
messages.

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2012-10-12 23:38 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-11 11:02 if/how to dictate IB device name per PCI BDF Or Gerlitz
     [not found] ` <5076A755.30605-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2012-10-12 22:43   ` Jason Gunthorpe
     [not found]     ` <20121012224332.GB25541-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2012-10-12 23:04       ` Ira Weiny
     [not found]         ` <20121012160418.385ace32ddde5379381d5889-i2BcT+NCU+M@public.gmane.org>
2012-10-12 23:38           ` Jason Gunthorpe

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox