public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* Node Description mismatch between saquery & smpquery
@ 2013-06-17 21:38 Albert Chu
       [not found] ` <1371505093.19017.76.camel-akkeaxHeDKRliZ7u+bvwcg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Albert Chu @ 2013-06-17 21:38 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA

We've recently noticed that the Node Description for a node can
mis-mismatch between the output of smpquery and saquery.  For example:

# smpquery NodeDesc 427
Node Description:.................sierra1932 qib0

# saquery NodeRecord 427 | grep NodeDesc
                NodeDescription.........QLogic Infiniband HCA

A restart of OpenSM is the current solution to resolve this.

We've noticed it occurring more often on our larger clusters than our
smaller clusters, leading to a speculation about why it is happening.

The speculation is when a node comes up, there is a window of time in
which the HCA is up, can be scanned by OpenSM, but not yet have its node
descriptor set (in RHEL I appears to be set via /etc/init.d/rdma).
During this window, OpenSM reads/stores the non-desired node descriptor
(in the above case the non-desired "Qlogic Infiniband HCA").

When the node descriptor is changed, a trap should be sent to opensm
indicating the change.  Normally OpenSM gets the trap and reads the new
node descriptor.

On our large clusters all nodes are typically brought up at the same
time, so there are probably a ton of node descriptor change traps
happening at the exact same time.  We speculate a number of these are
dropped/lost, and subsequently OpenSM never realizes that the node
descriptor has changed.

I don't know if the speculation sounds reasonable or not.  Regardless,
we're not sure of the best fix.

A trivial fix would be to just make OpenSM re-scan the node descriptor
of an HCA, perhaps during a heavy sweep.  But I don't know if this is
optimal.  It'll introduce more MADs on the wire.  However if the present
solution is to restart OpenSM, we figure this can't be any worse.

Just wondering what peoples thoughts are of if there's another obvious
solution we're not seeing.

Al

-- 
Albert Chu
chu11-i2BcT+NCU+M@public.gmane.org
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Node Description mismatch between saquery & smpquery
       [not found] ` <1371505093.19017.76.camel-akkeaxHeDKRliZ7u+bvwcg@public.gmane.org>
@ 2013-06-17 22:00   ` Weiny, Ira
       [not found]     ` <2807E5FD2F6FDA4886F6618EAC48510E020A19F2-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
  2013-06-18 11:13   ` Hal Rosenstock
  1 sibling, 1 reply; 6+ messages in thread
From: Weiny, Ira @ 2013-06-17 22:00 UTC (permalink / raw)
  To: Albert Chu, linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

Does running "update_desc" in the console fix this?

Ira

> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
> owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Albert Chu
> Sent: Monday, June 17, 2013 2:38 PM
> To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> Subject: Node Description mismatch between saquery & smpquery
> 
> We've recently noticed that the Node Description for a node can mis-
> mismatch between the output of smpquery and saquery.  For example:
> 
> # smpquery NodeDesc 427
> Node Description:.................sierra1932 qib0
> 
> # saquery NodeRecord 427 | grep NodeDesc
>                 NodeDescription.........QLogic Infiniband HCA
> 
> A restart of OpenSM is the current solution to resolve this.
> 
> We've noticed it occurring more often on our larger clusters than our smaller
> clusters, leading to a speculation about why it is happening.
> 
> The speculation is when a node comes up, there is a window of time in which
> the HCA is up, can be scanned by OpenSM, but not yet have its node
> descriptor set (in RHEL I appears to be set via /etc/init.d/rdma).
> During this window, OpenSM reads/stores the non-desired node descriptor
> (in the above case the non-desired "Qlogic Infiniband HCA").
> 
> When the node descriptor is changed, a trap should be sent to opensm
> indicating the change.  Normally OpenSM gets the trap and reads the new
> node descriptor.
> 
> On our large clusters all nodes are typically brought up at the same time, so
> there are probably a ton of node descriptor change traps happening at the
> exact same time.  We speculate a number of these are dropped/lost, and
> subsequently OpenSM never realizes that the node descriptor has changed.
> 
> I don't know if the speculation sounds reasonable or not.  Regardless, we're
> not sure of the best fix.
> 
> A trivial fix would be to just make OpenSM re-scan the node descriptor of an
> HCA, perhaps during a heavy sweep.  But I don't know if this is optimal.  It'll
> introduce more MADs on the wire.  However if the present solution is to
> restart OpenSM, we figure this can't be any worse.
> 
> Just wondering what peoples thoughts are of if there's another obvious
> solution we're not seeing.
> 
> Al
> 
> --
> Albert Chu
> chu11-i2BcT+NCU+M@public.gmane.org
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Node Description mismatch between saquery & smpquery
       [not found]     ` <2807E5FD2F6FDA4886F6618EAC48510E020A19F2-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
@ 2013-06-17 23:58       ` Albert Chu
  0 siblings, 0 replies; 6+ messages in thread
From: Albert Chu @ 2013-06-17 23:58 UTC (permalink / raw)
  To: Weiny, Ira; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

On Mon, 2013-06-17 at 22:00 +0000, Weiny, Ira wrote:
> Does running "update_desc" in the console fix this?

This worked as a short term solution.  But we're still thinking about a
longer term one that requires less interaction.

Al

> Ira
> 
> > -----Original Message-----
> > From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
> > owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org] On Behalf Of Albert Chu
> > Sent: Monday, June 17, 2013 2:38 PM
> > To: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > Subject: Node Description mismatch between saquery & smpquery
> > 
> > We've recently noticed that the Node Description for a node can mis-
> > mismatch between the output of smpquery and saquery.  For example:
> > 
> > # smpquery NodeDesc 427
> > Node Description:.................sierra1932 qib0
> > 
> > # saquery NodeRecord 427 | grep NodeDesc
> >                 NodeDescription.........QLogic Infiniband HCA
> > 
> > A restart of OpenSM is the current solution to resolve this.
> > 
> > We've noticed it occurring more often on our larger clusters than our smaller
> > clusters, leading to a speculation about why it is happening.
> > 
> > The speculation is when a node comes up, there is a window of time in which
> > the HCA is up, can be scanned by OpenSM, but not yet have its node
> > descriptor set (in RHEL I appears to be set via /etc/init.d/rdma).
> > During this window, OpenSM reads/stores the non-desired node descriptor
> > (in the above case the non-desired "Qlogic Infiniband HCA").
> > 
> > When the node descriptor is changed, a trap should be sent to opensm
> > indicating the change.  Normally OpenSM gets the trap and reads the new
> > node descriptor.
> > 
> > On our large clusters all nodes are typically brought up at the same time, so
> > there are probably a ton of node descriptor change traps happening at the
> > exact same time.  We speculate a number of these are dropped/lost, and
> > subsequently OpenSM never realizes that the node descriptor has changed.
> > 
> > I don't know if the speculation sounds reasonable or not.  Regardless, we're
> > not sure of the best fix.
> > 
> > A trivial fix would be to just make OpenSM re-scan the node descriptor of an
> > HCA, perhaps during a heavy sweep.  But I don't know if this is optimal.  It'll
> > introduce more MADs on the wire.  However if the present solution is to
> > restart OpenSM, we figure this can't be any worse.
> > 
> > Just wondering what peoples thoughts are of if there's another obvious
> > solution we're not seeing.
> > 
> > Al
> > 
> > --
> > Albert Chu
> > chu11-i2BcT+NCU+M@public.gmane.org
> > Computer Scientist
> > High Performance Systems Division
> > Lawrence Livermore National Laboratory
> > 
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> > the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Albert Chu
chu11-i2BcT+NCU+M@public.gmane.org
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Node Description mismatch between saquery & smpquery
       [not found] ` <1371505093.19017.76.camel-akkeaxHeDKRliZ7u+bvwcg@public.gmane.org>
  2013-06-17 22:00   ` Weiny, Ira
@ 2013-06-18 11:13   ` Hal Rosenstock
       [not found]     ` <51C040C7.9070109-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
  1 sibling, 1 reply; 6+ messages in thread
From: Hal Rosenstock @ 2013-06-18 11:13 UTC (permalink / raw)
  To: Albert Chu; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 6/17/2013 5:38 PM, Albert Chu wrote:
> We've recently noticed that the Node Description for a node can
> mis-mismatch between the output of smpquery and saquery.  For example:
> 
> # smpquery NodeDesc 427
> Node Description:.................sierra1932 qib0
> 
> # saquery NodeRecord 427 | grep NodeDesc
>                 NodeDescription.........QLogic Infiniband HCA
> 
> A restart of OpenSM is the current solution to resolve this.
> 
> We've noticed it occurring more often on our larger clusters than our
> smaller clusters, leading to a speculation about why it is happening.
> 
> The speculation is when a node comes up, there is a window of time in
> which the HCA is up, can be scanned by OpenSM, but not yet have its node
> descriptor set (in RHEL I appears to be set via /etc/init.d/rdma).
> During this window, OpenSM reads/stores the non-desired node descriptor
> (in the above case the non-desired "Qlogic Infiniband HCA").
> 
> When the node descriptor is changed, a trap should be sent to opensm
> indicating the change.  Normally OpenSM gets the trap and reads the new
> node descriptor.

Are you sure the trap is being issued by those devices when the
NodeDescription is changed locally ?

Also, if so, do these devices implement timeout/retry on sending the
trap (e.g. trying to make sure that they receive trap repress before
giving up on trap) ?

> On our large clusters all nodes are typically brought up at the same
> time, so there are probably a ton of node descriptor change traps
> happening at the exact same time.  We speculate a number of these are
> dropped/lost, and subsequently OpenSM never realizes that the node
> descriptor has changed.

Do you see any evidence of that traps are being dropped ? Have you
correlated any VL15Dropped counters in the subnet with this ? Also,
there is a module parameter in MAD kernel module that might help with
any unsolicited MAD bursts. You might try increasing that on your SM
node(s).

> I don't know if the speculation sounds reasonable or not.  Regardless,
> we're not sure of the best fix.
> 
> A trivial fix would be to just make OpenSM re-scan the node descriptor
> of an HCA, perhaps during a heavy sweep.  But I don't know if this is
> optimal.  It'll introduce more MADs on the wire.  However if the present
> solution is to restart OpenSM, we figure this can't be any worse.

Yes, but to add the additional queries in is O(n) there and has been
resisted in the past.

> Just wondering what peoples thoughts are of if there's another obvious
> solution we're not seeing.

I think this issue needs better understanding first.

-- Hal

> Al
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Node Description mismatch between saquery & smpquery
       [not found]     ` <51C040C7.9070109-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
@ 2013-06-18 18:14       ` Albert Chu
       [not found]         ` <1371579281.19017.86.camel-akkeaxHeDKRliZ7u+bvwcg@public.gmane.org>
  0 siblings, 1 reply; 6+ messages in thread
From: Albert Chu @ 2013-06-18 18:14 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On Tue, 2013-06-18 at 07:13 -0400, Hal Rosenstock wrote:
> On 6/17/2013 5:38 PM, Albert Chu wrote:
> > We've recently noticed that the Node Description for a node can
> > mis-mismatch between the output of smpquery and saquery.  For example:
> > 
> > # smpquery NodeDesc 427
> > Node Description:.................sierra1932 qib0
> > 
> > # saquery NodeRecord 427 | grep NodeDesc
> >                 NodeDescription.........QLogic Infiniband HCA
> > 
> > A restart of OpenSM is the current solution to resolve this.
> > 
> > We've noticed it occurring more often on our larger clusters than our
> > smaller clusters, leading to a speculation about why it is happening.
> > 
> > The speculation is when a node comes up, there is a window of time in
> > which the HCA is up, can be scanned by OpenSM, but not yet have its node
> > descriptor set (in RHEL I appears to be set via /etc/init.d/rdma).
> > During this window, OpenSM reads/stores the non-desired node descriptor
> > (in the above case the non-desired "Qlogic Infiniband HCA").
> > 
> > When the node descriptor is changed, a trap should be sent to opensm
> > indicating the change.  Normally OpenSM gets the trap and reads the new
> > node descriptor.
> 
> Are you sure the trap is being issued by those devices when the
> NodeDescription is changed locally ?

These particular devices do support the trap and tests show they do send
traps on changes (i.e. manually
changing /sys/class/infiniband/qib0/node_desc).

> Also, if so, do these devices implement timeout/retry on sending the
> trap (e.g. trying to make sure that they receive trap repress before
> giving up on trap) ?

This I don't know.  I've been trying to figure out if they do and if
they do how it might be configurable.  Is there a way to figure this
out?

> > On our large clusters all nodes are typically brought up at the same
> > time, so there are probably a ton of node descriptor change traps
> > happening at the exact same time.  We speculate a number of these are
> > dropped/lost, and subsequently OpenSM never realizes that the node
> > descriptor has changed.
> 
> Do you see any evidence of that traps are being dropped ? Have you
> correlated any VL15Dropped counters in the subnet with this ? Also,
> there is a module parameter in MAD kernel module that might help with
> any unsolicited MAD bursts. You might try increasing that on your SM
> node(s).

On our largest clusters we always see a nice chunk of VL15 drops,
however we haven't correlated them specifically to a trap.

> > I don't know if the speculation sounds reasonable or not.  Regardless,
> > we're not sure of the best fix.
> > 
> > A trivial fix would be to just make OpenSM re-scan the node descriptor
> > of an HCA, perhaps during a heavy sweep.  But I don't know if this is
> > optimal.  It'll introduce more MADs on the wire.  However if the present
> > solution is to restart OpenSM, we figure this can't be any worse.
> 
> Yes, but to add the additional queries in is O(n) there and has been
> resisted in the past.
> 
> > Just wondering what peoples thoughts are of if there's another obvious
> > solution we're not seeing.
> 
> I think this issue needs better understanding first.

Yeah, just looking for hints/pointers for the time being.

Thanks,

Al

> -- Hal
> 
> > Al
> > 
> 
-- 
Albert Chu
chu11-i2BcT+NCU+M@public.gmane.org
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: Node Description mismatch between saquery & smpquery
       [not found]         ` <1371579281.19017.86.camel-akkeaxHeDKRliZ7u+bvwcg@public.gmane.org>
@ 2013-06-18 22:07           ` Weiny, Ira
  0 siblings, 0 replies; 6+ messages in thread
From: Weiny, Ira @ 2013-06-18 22:07 UTC (permalink / raw)
  To: Albert Chu, Hal Rosenstock
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org

> -----Original Message-----
> From: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org [mailto:linux-rdma-
> Subject: Re: Node Description mismatch between saquery & smpquery
> 
> On Tue, 2013-06-18 at 07:13 -0400, Hal Rosenstock wrote:
> > On 6/17/2013 5:38 PM, Albert Chu wrote:
> > > We've recently noticed that the Node Description for a node can
> > > mis-mismatch between the output of smpquery and saquery.  For
> example:
> > >
> > > # smpquery NodeDesc 427
> > > Node Description:.................sierra1932 qib0
> > >
> > > # saquery NodeRecord 427 | grep NodeDesc
> > >                 NodeDescription.........QLogic Infiniband HCA
> > >
> > > A restart of OpenSM is the current solution to resolve this.

[snip]

> > >
> > > When the node descriptor is changed, a trap should be sent to opensm
> > > indicating the change.  Normally OpenSM gets the trap and reads the
> > > new node descriptor.
> >
> > Are you sure the trap is being issued by those devices when the
> > NodeDescription is changed locally ?
> 
> These particular devices do support the trap and tests show they do send
> traps on changes (i.e. manually changing
> /sys/class/infiniband/qib0/node_desc).
> 
> > Also, if so, do these devices implement timeout/retry on sending the
> > trap (e.g. trying to make sure that they receive trap repress before
> > giving up on trap) ?
> 
> This I don't know.  I've been trying to figure out if they do and if they do how
> it might be configurable.  Is there a way to figure this out?
> 

Looking quickly at the driver I don't think it does resend the trap.  However, Mike might know better: CC'ed.

Ira

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2013-06-18 22:07 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-17 21:38 Node Description mismatch between saquery & smpquery Albert Chu
     [not found] ` <1371505093.19017.76.camel-akkeaxHeDKRliZ7u+bvwcg@public.gmane.org>
2013-06-17 22:00   ` Weiny, Ira
     [not found]     ` <2807E5FD2F6FDA4886F6618EAC48510E020A19F2-8k97q/ur5Z2krb+BlOpmy7fspsVTdybXVpNB7YpNyf8@public.gmane.org>
2013-06-17 23:58       ` Albert Chu
2013-06-18 11:13   ` Hal Rosenstock
     [not found]     ` <51C040C7.9070109-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2013-06-18 18:14       ` Albert Chu
     [not found]         ` <1371579281.19017.86.camel-akkeaxHeDKRliZ7u+bvwcg@public.gmane.org>
2013-06-18 22:07           ` Weiny, Ira

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox