From: Jason Gunthorpe <jgunthorpe-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
To: Ira Weiny <weiny2-i2BcT+NCU+M@public.gmane.org>
Cc: Or Gerlitz <ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
Roland Dreier <roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>,
"linux-rdma
(linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org)"
<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
Shlomo Pongratz <shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
Eli Cohen <eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
Subject: Re: if/how to dictate IB device name per PCI BDF
Date: Fri, 12 Oct 2012 17:38:02 -0600 [thread overview]
Message-ID: <20121012233802.GD25541@obsidianresearch.com> (raw)
In-Reply-To: <20121012160418.385ace32ddde5379381d5889-i2BcT+NCU+M@public.gmane.org>
On Fri, Oct 12, 2012 at 04:04:18PM -0700, Ira Weiny wrote:
> > FWIW, I reflected on this once (I belive it was for the netlink
> > discussion).. Allowing for a device to be renamed creates a serious
> > problem for RDMA since there isn't a stable 'if-index' like way for
> > software to refer to it.
> >
> > The udev rules all rely on a means to rename the device once the
> > kernel has created it.
> >
> > For this reason, all the software I've built uses the port GID to ID
> > the resource for bind/etc and strongly discourages use of the RDMA
> > device name.
>
> IMHO, from a user perspective using GID's or GUIDs is just too hard.
> While the port GID is unique across reboots in practice with 1000's
> of nodes in a cluster names are much more useful and, at least here
> at LLNL, nobody refers to HCA's by GUID. They refer to them by
> hostname/card name. This is especially true when HCA's get swapped
> out for maintenance.
Textual names (ie a name service) for GID alleviate alot of that pain,
but really, how often do you need to use device RDMA name??
Assuming GID port selection is available, the only case that requires
the full GID is when you have multiple HCAs and ports are active on
the *same subnet*.
Otherwise you can refer to the port by: port number, by GID prefix or
by 'first found active port' without ever involving the rdma device
name.
I never did it, but it did occur to me to support the IBoIB device
name and IP addresses as a port specifier..
> Furthermore, how would you propose MPI selects a card when the job
> is launched on different nodes, chosen by a scheduler, from run to
> run?
Three things spring to mind:
- Choose an active port at random that matches a set of subnet prefixes
- Choose an active port at random from a globally known list of
job-acceptable GIDs
- Choose an active port #X that matches a set of subnet prefixes,
from a random device
Relying on rdma device name for MPI wouldn't work in a heterogeneous
environment where different nodes have a different 'correct' device
name.
> Unfortunately, I don't know the details of the "serious problem" you
> describe above but it seems that a udev solution to renaming things
> would be worth trying to solve from a users perspective.
Renaming the device changes the name of all the sysfs directories, so
any 'handle' is invalidated at rename time. All udev can do for you is
trigger the rename call, you still have to somewhat sanely enable
renaming in the kernel.
And it is not entirely theoretical, with threaded startup and things
like systemd, it is very plausible that the rename could hit in the
middle of something like opensm starting up and create a flakey
system. Particularly since we'd want to see opensm's startup triggered
by directory creation in sysfs.
I think it would be great to be able to do the rename, but to do it
properly things need to be more like the net stack - netlink for
configuration/discovery, and a stable ifindex for use in all netlink
messages.
Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
prev parent reply other threads:[~2012-10-12 23:38 UTC|newest]
Thread overview: 4+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-10-11 11:02 if/how to dictate IB device name per PCI BDF Or Gerlitz
[not found] ` <5076A755.30605-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2012-10-12 22:43 ` Jason Gunthorpe
[not found] ` <20121012224332.GB25541-ePGOBjL8dl3ta4EC/59zMFaTQe2KTcn/@public.gmane.org>
2012-10-12 23:04 ` Ira Weiny
[not found] ` <20121012160418.385ace32ddde5379381d5889-i2BcT+NCU+M@public.gmane.org>
2012-10-12 23:38 ` Jason Gunthorpe [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20121012233802.GD25541@obsidianresearch.com \
--to=jgunthorpe-epgobjl8dl3ta4ec/59zmfatqe2ktcn/@public.gmane.org \
--cc=eli-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
--cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
--cc=ogerlitz-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
--cc=roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org \
--cc=shlomop-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
--cc=weiny2-i2BcT+NCU+M@public.gmane.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox