public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
* ibnetdiscover issue with multiported CA (or router) with multiple ports on same subnet
@ 2010-08-25 17:27 Hal Rosenstock
       [not found] ` <AANLkTi=54yw-YcpXNvWmsCFybZhTZX6Hp6=KW6Gz9KLH-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Hal Rosenstock @ 2010-08-25 17:27 UTC (permalink / raw)
  To: Sasha Khapyorsky; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Sasha,

I'm seeing an issue with ibnetdiscover from a CA port where it appears
to extend a path at a "remote" CA port (it's actually another port on
the same CA) to query NodeInfo of the next hop beyond it. I get the
following error message:

src/query_smp.c:188; umad (DR path slid 0; dlid 0; 0,1,20,2 Attr
0x11:0) bad status 110; Connection timed out

where smpquery -D nodeinfo of 0,1,20 is a CA which can also be seen
from the topology.

It appears to stem from the following code snippet from
libibnetdisc/src/ibnetdisc.c:recv_port_info

        if (port_num && mad_get_field(port->info, 0, IB_PORT_PHYS_STATE_F)
            == IB_PORT_PHYS_STATE_LINKUP
            && ((node->type == IB_NODE_SWITCH && port_num != local_port) ||
                (node == fabric->from_node && port_num == local_port))) {
                ib_portid_t path = smp->path;
                if (extend_dpath(engine, &path, port_num) > 0)
                        query_node_info(engine, &path, node);
        }

that was introduced by:
commit fcb8d5e7588e38508a8e354c37009d73c0a3889f
Author: Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org>
Date:   Sat Apr 10 02:43:24 2010 +0300

    libibnetdisc: no backward NodeInfo queries

    Then switch is reached via port N we don't need to query back via this
    port - source node is discovered already. Finally this saves some amount
    of unnecessary MADs.

    Signed-off-by: Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org>

and subsequently modified by:
commit 49d149c63a44d99259f516a15af53d8cf3f0e7c9
Author: Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org>
Date:   Tue Apr 13 19:54:45 2010 +0300

    libibnetdisc: don't try to cross discovery over CA

    When discovery is running from CA node it shouldn't try to cross over
    all ports, but only via local one (send over non-local ports will fail
    since CA doesn't route MADs).

    Signed-off-by: Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org>

due to the (node == fabric->from_node && port_num == local_port)
clause being TRUE.

ibnetdiscover
src/query_smp.c:188; umad (DR path slid 0; dlid 0; 0,1,20,2 Attr
0x11:0) bad status 110; Connection timed out
#
# Topology file: generated on Wed Aug 25 18:52:16 2010
#
# Initiated from node 0002c9020020ee0c port 0002c9020020ee0d

vendid=0x2c9
devid=0xb924
sysimgguid=0xb8cffff00438b
switchguid=0xb8cffff00438b(b8cffff00438b)
Switch  24 "S-000b8cffff00438b"         # "MT47396 Infiniscale-III
Mellanox Technologies" base port 0 lid 4 lmc 0
[5]     "H-0002c903000010e0"[1](2c903000010e1)          # "sw124
HCA-1" lid 5 4xDDR
[6]     "H-0002c9030000d1c8"[1](2c9030000d1c9)          # "sw123
HCA-1" lid 0 4xDDR
[7]     "H-0002c9020020ee0c"[1](2c9020020ee0d)          # "sw075
HCA-1" lid 2 4xDDR
[20]    "H-0002c9020020ee0c"[2](2c9020020ee0e)          # "sw075
HCA-1" lid 3 4xDDR

...

vendid=0x2c9
devid=0x6278
sysimgguid=0x2c9020020ee0f
caguid=0x2c9020020ee0c
Ca      2 "H-0002c9020020ee0c"          # "sw075 HCA-1"
[1](2c9020020ee0d)      "S-000b8cffff00438b"[7]         # lid 2 lmc 0
"MT47396 Infiniscale-III Mellanox Technologies" lid 4 4xDDR
[2](2c9020020ee0e)      "S-000b8cffff00438b"[20]                # lid
3 lmc 0 "MT47396 Infiniscale-III Mellanox Technologies" lid 4 4xDDR


smpquery -D nodeinfo 0,1,20
# Node info: DR path slid 65535; dlid 65535; 0,1,20
BaseVers:........................1
ClassVers:.......................1
NodeType:........................Channel Adapter
NumPorts:........................2
SystemGuid:......................0x0002c9020020ee0f
Guid:............................0x0002c9020020ee0c
PortGuid:........................0x0002c9020020ee0e
PartCap:.........................64
DevId:...........................0x6278
Revision:........................0x000000a0
LocalPort:.......................2
VendorId:........................0x0002c9

I don't think the local port part of the test above (node ==
fabric->from_node && port_num == local_port)  is correct where:

        local_port = (uint8_t) mad_get_field(port_info, 0,
IB_PORT_LOCAL_PORT_F);

Instead, shouldn't port_num be checked against the local port that
initiated the ibnetdiscover (which in this case is port 1) ? If so, a
"from_portnum" could be added/saved in the fabric structure and used
for this check. Do you concur with this approach ?

-- Hal
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: ibnetdiscover issue with multiported CA (or router) with multiple ports on same subnet
       [not found] ` <AANLkTi=54yw-YcpXNvWmsCFybZhTZX6Hp6=KW6Gz9KLH-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-09-01 13:43   ` Sasha Khapyorsky
  2010-09-01 13:47     ` Hal Rosenstock
  0 siblings, 1 reply; 4+ messages in thread
From: Sasha Khapyorsky @ 2010-09-01 13:43 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Hal,

On 13:27 Wed 25 Aug     , Hal Rosenstock wrote:
> 
> I'm seeing an issue with ibnetdiscover from a CA port where it appears
> to extend a path at a "remote" CA port (it's actually another port on
> the same CA) to query NodeInfo of the next hop beyond it. I get the
> following error message:
> 
> src/query_smp.c:188; umad (DR path slid 0; dlid 0; 0,1,20,2 Attr
> 0x11:0) bad status 110; Connection timed out
> 
> where smpquery -D nodeinfo of 0,1,20 is a CA which can also be seen
> from the topology.
> 
> It appears to stem from the following code snippet from
> libibnetdisc/src/ibnetdisc.c:recv_port_info
> 
>         if (port_num && mad_get_field(port->info, 0, IB_PORT_PHYS_STATE_F)
>             == IB_PORT_PHYS_STATE_LINKUP
>             && ((node->type == IB_NODE_SWITCH && port_num != local_port) ||
>                 (node == fabric->from_node && port_num == local_port))) {
>                 ib_portid_t path = smp->path;
>                 if (extend_dpath(engine, &path, port_num) > 0)
>                         query_node_info(engine, &path, node);
>         }

This makes sense for me.

> 
> that was introduced by:
> commit fcb8d5e7588e38508a8e354c37009d73c0a3889f
> Author: Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org>
> Date:   Sat Apr 10 02:43:24 2010 +0300
> 
>     libibnetdisc: no backward NodeInfo queries
> 
>     Then switch is reached via port N we don't need to query back via this
>     port - source node is discovered already. Finally this saves some amount
>     of unnecessary MADs.
> 
>     Signed-off-by: Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org>
> 
> and subsequently modified by:
> commit 49d149c63a44d99259f516a15af53d8cf3f0e7c9
> Author: Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org>
> Date:   Tue Apr 13 19:54:45 2010 +0300
> 
>     libibnetdisc: don't try to cross discovery over CA
> 
>     When discovery is running from CA node it shouldn't try to cross over
>     all ports, but only via local one (send over non-local ports will fail
>     since CA doesn't route MADs).
> 
>     Signed-off-by: Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org>
> 
> due to the (node == fabric->from_node && port_num == local_port)
> clause being TRUE.

But I don't see how those patches are actually related to the story. An
original (before patches) condition was:

	if (port_num && mad_get_field(port->info, 0, IB_PORT_PHYS_STATE_F)
	    == IB_PORT_PHYS_STATE_LINKUP
	    && (node->type == IB_NODE_SWITCH || node == fabric->from_node))

, which has the described bug as I can understand this.

Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: ibnetdiscover issue with multiported CA (or router) with multiple ports on same subnet
  2010-09-01 13:43   ` Sasha Khapyorsky
@ 2010-09-01 13:47     ` Hal Rosenstock
       [not found]       ` <AANLkTi=RuZukS=kNkZVxax3CP8oZROqYxe71GQfuV2Rx-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 4+ messages in thread
From: Hal Rosenstock @ 2010-09-01 13:47 UTC (permalink / raw)
  To: Sasha Khapyorsky; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

Hi Sasha,

On Wed, Sep 1, 2010 at 9:43 AM, Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org> wrote:
> Hi Hal,
>
> On 13:27 Wed 25 Aug     , Hal Rosenstock wrote:
>>
>> I'm seeing an issue with ibnetdiscover from a CA port where it appears
>> to extend a path at a "remote" CA port (it's actually another port on
>> the same CA) to query NodeInfo of the next hop beyond it. I get the
>> following error message:
>>
>> src/query_smp.c:188; umad (DR path slid 0; dlid 0; 0,1,20,2 Attr
>> 0x11:0) bad status 110; Connection timed out
>>
>> where smpquery -D nodeinfo of 0,1,20 is a CA which can also be seen
>> from the topology.
>>
>> It appears to stem from the following code snippet from
>> libibnetdisc/src/ibnetdisc.c:recv_port_info
>>
>>         if (port_num && mad_get_field(port->info, 0, IB_PORT_PHYS_STATE_F)
>>             == IB_PORT_PHYS_STATE_LINKUP
>>             && ((node->type == IB_NODE_SWITCH && port_num != local_port) ||
>>                 (node == fabric->from_node && port_num == local_port))) {
>>                 ib_portid_t path = smp->path;
>>                 if (extend_dpath(engine, &path, port_num) > 0)
>>                         query_node_info(engine, &path, node);
>>         }
>
> This makes sense for me.
>
>>
>> that was introduced by:
>> commit fcb8d5e7588e38508a8e354c37009d73c0a3889f
>> Author: Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org>
>> Date:   Sat Apr 10 02:43:24 2010 +0300
>>
>>     libibnetdisc: no backward NodeInfo queries
>>
>>     Then switch is reached via port N we don't need to query back via this
>>     port - source node is discovered already. Finally this saves some amount
>>     of unnecessary MADs.
>>
>>     Signed-off-by: Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org>
>>
>> and subsequently modified by:
>> commit 49d149c63a44d99259f516a15af53d8cf3f0e7c9
>> Author: Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org>
>> Date:   Tue Apr 13 19:54:45 2010 +0300
>>
>>     libibnetdisc: don't try to cross discovery over CA
>>
>>     When discovery is running from CA node it shouldn't try to cross over
>>     all ports, but only via local one (send over non-local ports will fail
>>     since CA doesn't route MADs).
>>
>>     Signed-off-by: Sasha Khapyorsky <sashak-smomgflXvOZWk0Htik3J/w@public.gmane.org>
>>
>> due to the (node == fabric->from_node && port_num == local_port)
>> clause being TRUE.
>
> But I don't see how those patches are actually related to the story. An
> original (before patches) condition was:
>
>        if (port_num && mad_get_field(port->info, 0, IB_PORT_PHYS_STATE_F)
>            == IB_PORT_PHYS_STATE_LINKUP
>            && (node->type == IB_NODE_SWITCH || node == fabric->from_node))
>
> , which has the described bug as I can understand this.

I thought this used to work and those changes looked related to me.
Maybe the fix is right but that part of the problem description isn't.
Do you want a revised patch without that part of the description ?

-- Hal

>
> Sasha
> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: ibnetdiscover issue with multiported CA (or router) with multiple ports on same subnet
       [not found]       ` <AANLkTi=RuZukS=kNkZVxax3CP8oZROqYxe71GQfuV2Rx-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2010-09-01 17:06         ` Sasha Khapyorsky
  0 siblings, 0 replies; 4+ messages in thread
From: Sasha Khapyorsky @ 2010-09-01 17:06 UTC (permalink / raw)
  To: Hal Rosenstock; +Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA

On 09:47 Wed 01 Sep     , Hal Rosenstock wrote:
> 
> I thought this used to work and those changes looked related to me.
> Maybe the fix is right but that part of the problem description isn't.
> Do you want a revised patch without that part of the description ?

No needs - I applied this already. Thanks.

Sasha
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2010-09-01 17:06 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-08-25 17:27 ibnetdiscover issue with multiported CA (or router) with multiple ports on same subnet Hal Rosenstock
     [not found] ` <AANLkTi=54yw-YcpXNvWmsCFybZhTZX6Hp6=KW6Gz9KLH-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-09-01 13:43   ` Sasha Khapyorsky
2010-09-01 13:47     ` Hal Rosenstock
     [not found]       ` <AANLkTi=RuZukS=kNkZVxax3CP8oZROqYxe71GQfuV2Rx-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2010-09-01 17:06         ` Sasha Khapyorsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox