public inbox for linux-rdma@vger.kernel.org
 help / color / mirror / Atom feed
From: Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
To: Ira Weiny <weiny2-i2BcT+NCU+M@public.gmane.org>
Cc: Alex Netes <alexne-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>,
	"linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org"
	<linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: [PATCH] opensm: on MAD error call back print DR PATH or LID of request MAD
Date: Thu, 15 Dec 2011 17:20:21 -0500	[thread overview]
Message-ID: <4EEA72A5.8080602@dev.mellanox.co.il> (raw)
In-Reply-To: <20111215094006.8aa8f198.weiny2-i2BcT+NCU+M@public.gmane.org>

On 12/15/2011 12:40 PM, Ira Weiny wrote:
> On Thu, 15 Dec 2011 06:15:09 -0800
> Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
> 
>> On 12/14/2011 10:18 PM, Ira Weiny wrote:
>>> On Wed, 14 Dec 2011 18:21:28 -0800
>>> Hal Rosenstock <hal-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org> wrote:
>>>
>>>> On 12/14/2011 12:09 AM, Ira Weiny wrote:
>>>>>
>>>>> In addition print transaction ID of all DR PATH dumps to make sure we know
>>>>> which MAD's they refer to.
>>>>>
>>>>> Signed-off-by: Ira Weiny <weiny2-i2BcT+NCU+M@public.gmane.org>
>>>>> ---
>>>>>  opensm/osm_helper.c      |    5 +++--
>>>>>  opensm/osm_sm_mad_ctrl.c |    7 +++++++
>>>>>  2 files changed, 10 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/opensm/osm_helper.c b/opensm/osm_helper.c
>>>>> index f9f3d9d..b6591c4 100644
>>>>> --- a/opensm/osm_helper.c
>>>>> +++ b/opensm/osm_helper.c
>>>>> @@ -2059,8 +2059,9 @@ void osm_dump_smp_dr_path(IN osm_log_t * p_log, IN const ib_smp_t * p_smp,
>>>>>  		char buf[BUF_SIZE];
>>>>>  		unsigned n;
>>>>>  
>>>>> -		n = sprintf(buf, "Received SMP on a %u hop path: "
>>>>> -			    "Initial path = ", p_smp->hop_count);
>>>>> +		n = sprintf(buf, "Received SMP (TID 0x%" PRIx64 ") on a %u hop path: "
>>>>> +			    "Initial path = ",
>>>>> +			    cl_ntoh64(p_smp->trans_id), p_smp->hop_count);
>>>>>  		n += sprint_uint8_arr(buf + n, sizeof(buf) - n,
>>>>>  				      p_smp->initial_path,
>>>>>  				      p_smp->hop_count + 1);
>>>>> diff --git a/opensm/osm_sm_mad_ctrl.c b/opensm/osm_sm_mad_ctrl.c
>>>>> index ee92c66..6abf8b8 100644
>>>>> --- a/opensm/osm_sm_mad_ctrl.c
>>>>> +++ b/opensm/osm_sm_mad_ctrl.c
>>>>> @@ -721,6 +721,13 @@ static void sm_mad_ctrl_send_err_cb(IN void *context, IN osm_madw_t * p_madw)
>>>>>  		ib_get_sm_attr_str(p_smp->attr_id), cl_ntoh32(p_smp->attr_mod),
>>>>>  		cl_ntoh64(p_smp->trans_id));
>>>>>  
>>>>> +	if (p_smp->mgmt_class == IB_MCLASS_SUBN_DIR) {
>>>>> +		osm_dump_smp_dr_path(p_ctrl->p_log, p_smp, OSM_LOG_ERROR);
>>>>
>>>> Rather than here, should this be in osm_vendor_ibumad.c ? There's
>>>> already one similar log there but looks like evicted entry logging was
>>>> not done. If not, then do any logs there need to be removed as redundant ?
>>>
>>> Yes looking a bit closer I see that is redundant with the current umad_status
>>> implementation.  
>>
>> Well, not quite. That logs on any send error and osm_dump_smp_dr_path is
>> only currently called for NodeInfo. That's one reason why it's down at
>> that layer right now. I can see your v2 patch addresses this.
>>
>>> IE the message you get is:
>>>
>>> 	Dec 14 18:31:54 137584 [AEB0C700] 0x01 -> Received SMP on a 4 hop path: Initial path = 0,0,0,0,0, Return path  = 0,0,0,0,0
>>>
>>> That is useless.  I can alter the patch to remove that as well.
>>
>> Another alternative is that if it's a bug, why not just fix that ?
> 
> I was not sure if it was a bug (but I did not dig through the umad and kernel
> layers to verify that).  It looks to me like umad_status only returns TIMEOUT
> as an error.  Furthermore, I think the "response" MAD returned (and printed in
> the above logs) does not include a DR path since it is not really a response
> but just a place holder.

I thought that the sent MAD is returned in the timeout case. That would
include the DR paths in the SMP case. It seems to work only some of the
time though and I'm not sure why that is.

> Does the kernel/umad layer copy the request MAD into the "response" MAD on
> timeout?[*]

See above and below under [*]

>  (It looks to me like only some of the data is copied into the
> "response" MAD; method, attr_id, and transaction ID).  

That's certainly the observation in some but not all cases...

> If the entire request
> is to be copied then the above might be bug.  

I thought the request MAD is supposed to be returned on a send error
(including timeout). Maybe some shortcut/optimization was taken here
in some cases.

> If not then the print statement
> I removed is useless but the other one at least confirms the "fake" MAD is
> equivalent to the one printed in the error call back.

Yes, but why have the extra log message just for that ?

> If you want to keep the error printing at the vendor layer then I guess the
> appropriate place to print the error would be on the request mad like this...
> 
> 			} else {
> 				p_req_madw->status = IB_TIMEOUT;
> 				/* cb frees req_madw */
> 				pthread_mutex_lock(&p_vend->cb_mutex);
> 				pthread_cleanup_push(unlock_mutex,
> 						     &p_vend->cb_mutex);
> // dump p_req_madw here before callback
> 				(*p_bind->send_err_callback) (p_bind->
> 							      client_context,
> 							      p_req_madw);
> 				pthread_cleanup_pop(1);
> 			}

There's also the cancelled case too that calls send_err_callback.

> Or you are correct we could just remove all the logging in the vendor layer
> here.  
>
> Alex, any opinions?
> 
> Ira
> 
> [*] I don't think the request MAD should be copied except for the TID as is
> documented so you can find the request MAD.

What documentation are you referring to ?

-- Hal

>>
>>>>
>>>> Also, does this log every timeout (at error level) ? If so, that might
>>>> not be a good thing in all subnets as timeouts are common.
>>>
>>> Why would you say that?  I think it is very valid to know what nodes are
>>> timeing out.  When would you not want to know the destination of what is
>>> timing out?
>>
>> Yes, but I'm concerned about spamming the log. Timeouts are "normal" in
>> many subnets; maybe not yours.
>>
>> I was just saying level might be reduced but I can see 5411 is error
>> level too.
>>
>>>>
>>>>> +	} else {
>>>>> +		OSM_LOG(p_ctrl->p_log, OSM_LOG_ERROR, "LID %u\n",
>>>>> +			cl_ntoh16(p_madw->mad_addr.dest_lid));
>>>>
>>>> Log the TID here too ?
>>>
>>> Actually I think moving that into the error print above is better.
>>
>> Sure; that's another way of accomplishing the same thing.
>>
>> -- Hal
>>
>>> Sending V2 now,
>>> Ira
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
>> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2011-12-15 22:20 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-12-14  5:09 [PATCH] opensm: on MAD error call back print DR PATH or LID of request MAD Ira Weiny
     [not found] ` <20111213210918.9cf79850.weiny2-i2BcT+NCU+M@public.gmane.org>
2011-12-15  2:21   ` Hal Rosenstock
     [not found]     ` <4EE959A8.3000409-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-12-15  3:18       ` Ira Weiny
     [not found]         ` <20111214191840.a3a80cf3.weiny2-i2BcT+NCU+M@public.gmane.org>
2011-12-15 14:15           ` Hal Rosenstock
     [not found]             ` <4EEA00ED.2080609-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-12-15 17:40               ` Ira Weiny
     [not found]                 ` <20111215094006.8aa8f198.weiny2-i2BcT+NCU+M@public.gmane.org>
2011-12-15 22:20                   ` Hal Rosenstock [this message]
     [not found]                     ` <4EEA72A5.8080602-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>
2011-12-15 22:54                       ` Ira Weiny
     [not found]                         ` <20111215145430.20f6adee.weiny2-i2BcT+NCU+M@public.gmane.org>
2011-12-16 13:03                           ` Hal Rosenstock

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4EEA72A5.8080602@dev.mellanox.co.il \
    --to=hal-ldsdmyg8hgv8yrgs2mwiifqbs+8scbdb@public.gmane.org \
    --cc=alexne-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org \
    --cc=linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org \
    --cc=weiny2-i2BcT+NCU+M@public.gmane.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox