Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection

All of lore.kernel.org
 help / color / mirror / Atom feed

From: sagi grimberg <sagig@mellanox.com>
To: Chuck Lever <chuck.lever@oracle.com>,
	Senn Klemens <klemens.senn@ims.co.at>
Cc: <linux-rdma@vger.kernel.org>,
	Linux NFS Mailing List <linux-nfs@vger.kernel.org>
Subject: Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection
Date: Sat, 29 Mar 2014 02:06:47 +0300	[thread overview]
Message-ID: <53360087.9060902@mellanox.com> (raw)
In-Reply-To: <3FF5D87A-8199-4CE1-BF97-82DC61E4F480@oracle.com>

On 3/29/2014 1:30 AM, Chuck Lever wrote:
> On Mar 28, 2014, at 2:42 AM, Senn Klemens <klemens.senn@ims.co.at> wrote:
>
>> Hi Chuck,
>>
>> On 03/27/2014 04:59 PM, Chuck Lever wrote:
>>> Hi-
>>>
>>>
>>> On Mar 27, 2014, at 12:53 AM, Reiter Rafael <rafael.reiter@ims.co.at> wrote:
>>>
>>>> On 03/26/2014 07:15 PM, Chuck Lever wrote:
>>>>> Hi Rafael-
>>>>>
>>>>> I’ll take a look. Can you report your HCA and how you reproduce this issue?
>>>> The HCA is Mellanox Technologies MT26428.
>>>>
>>>> Reproduction:
>>>> 1) Mount a directory via NFS/RDMA
>>>> mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/
>> An additional "ls /mnt" is needed here (between step 1 and 2)
>>
>>>> 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection
>>>> 3) ls /mnt
>>>> 4) wait 5-30 seconds
>>> Thanks for the information.
>>>
>>> I have that HCA, but I won’t have access to my test systems for a week (traveling). So can you try this:
>>>
>>> # rpcdebug -m rpc -s trans
>>>
>>> then reproduce (starting with step 1 above). Some debugging output will appear at the tail of /var/log/messages. Copy it to this thread.
>>>
>> The output of /var/log/messages is:
>>
>> [  143.233701] RPC:  1688 xprt_rdma_allocate: size 1112 too large for
>> buffer[1024]: prog 100003 vers 4 proc 1
>> [  143.233708] RPC:  1688 xprt_rdma_allocate: size 1112, request
>> 0xffff88105894c000
>> [  143.233715] RPC:  1688 rpcrdma_inline_pullup: pad 0 destp
>> 0xffff88105894d7dc len 124 hdrlen 124
>> [  143.233718] RPC:       rpcrdma_register_frmr_external: Using frmr
>> ffff88084e589260 to map 1 segments
>> [  143.233722] RPC:  1688 rpcrdma_create_chunks: reply chunk elem
>> 652@0x105894d92c:0xced01 (last)
>> [  143.233725] RPC:  1688 rpcrdma_marshal_req: reply chunk: hdrlen 48
>> rpclen 124 padlen 0 headerp 0xffff88105894d100 base 0xffff88105894d760
>> lkey 0x8000
>> [  143.233785] RPC:       rpcrdma_event_process: event rep
>> ffff88084e589260 status 0 opcode 8 length 0
>> [  177.272397] RPC:       rpcrdma_event_process: event rep
>> (null) status C opcode FFFF8808 length 4294967295
>> [  177.272649] RPC:       rpcrdma_event_process: event rep
>> ffff880848ed0000 status 5 opcode FFFF8808 length 4294936584
> The mlx4 provider is returning a WC completion status of
> IB_WC_WR_FLUSH_ERR.
>
>> [  177.272651] RPC:       rpcrdma_event_process: WC opcode -30712 status
>> 5, connection lost
> -30712 is a bogus WC opcode. So the mlx4 provider is not filling in the
> WC opcode. rpcrdma_event_process() thus can’t depend on the contents of
> the ib_wc.opcode field when the WC completion status != IB_WC_SUCCESS.

Hey Chuck,

That is correct, the opcode field in the wc is not reliable in FLUSH errors.

>
> A copy of the opcode reachable from the incoming rpcrdma_rep could be
> added, initialized in the forward paths. rpcrdma_event_process() could
> use the copy in the error case.

How about suppressing completions alltogether for fast_reg and local_inv 
work requests?
if these shall fail you will get an error completion and the QP will 
transition to error state
generating FLUSH_ERR completions for all pending WRs. In this case, you 
can just ignore
flush fast_reg + local_inv errors.

see http://marc.info/?l=linux-rdma&m=139047309831997&w=2

Sagi.

WARNING: multiple messages have this Message-ID (diff)

From: sagi grimberg <sagig-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
To: Chuck Lever <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>,
	Senn Klemens
	<klemens.senn-cv18SyjCLaheoWH0uzbU5w@public.gmane.org>
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	Linux NFS Mailing List
	<linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
Subject: Re: Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection
Date: Sat, 29 Mar 2014 02:06:47 +0300	[thread overview]
Message-ID: <53360087.9060902@mellanox.com> (raw)
In-Reply-To: <3FF5D87A-8199-4CE1-BF97-82DC61E4F480-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>

On 3/29/2014 1:30 AM, Chuck Lever wrote:
> On Mar 28, 2014, at 2:42 AM, Senn Klemens <klemens.senn-cv18SyjCLaheoWH0uzbU5w@public.gmane.org> wrote:
>
>> Hi Chuck,
>>
>> On 03/27/2014 04:59 PM, Chuck Lever wrote:
>>> Hi-
>>>
>>>
>>> On Mar 27, 2014, at 12:53 AM, Reiter Rafael <rafael.reiter-cv18SyjCLag@public.gmane.orgt> wrote:
>>>
>>>> On 03/26/2014 07:15 PM, Chuck Lever wrote:
>>>>> Hi Rafael-
>>>>>
>>>>> I’ll take a look. Can you report your HCA and how you reproduce this issue?
>>>> The HCA is Mellanox Technologies MT26428.
>>>>
>>>> Reproduction:
>>>> 1) Mount a directory via NFS/RDMA
>>>> mount -t nfs -o port=20049,rdma,vers=4.0,timeo=900 172.16.100.2:/ /mnt/
>> An additional "ls /mnt" is needed here (between step 1 and 2)
>>
>>>> 2) Pull the Infiniband cable or use ibportstate to disrupt the Infiniband connection
>>>> 3) ls /mnt
>>>> 4) wait 5-30 seconds
>>> Thanks for the information.
>>>
>>> I have that HCA, but I won’t have access to my test systems for a week (traveling). So can you try this:
>>>
>>> # rpcdebug -m rpc -s trans
>>>
>>> then reproduce (starting with step 1 above). Some debugging output will appear at the tail of /var/log/messages. Copy it to this thread.
>>>
>> The output of /var/log/messages is:
>>
>> [  143.233701] RPC:  1688 xprt_rdma_allocate: size 1112 too large for
>> buffer[1024]: prog 100003 vers 4 proc 1
>> [  143.233708] RPC:  1688 xprt_rdma_allocate: size 1112, request
>> 0xffff88105894c000
>> [  143.233715] RPC:  1688 rpcrdma_inline_pullup: pad 0 destp
>> 0xffff88105894d7dc len 124 hdrlen 124
>> [  143.233718] RPC:       rpcrdma_register_frmr_external: Using frmr
>> ffff88084e589260 to map 1 segments
>> [  143.233722] RPC:  1688 rpcrdma_create_chunks: reply chunk elem
>> 652@0x105894d92c:0xced01 (last)
>> [  143.233725] RPC:  1688 rpcrdma_marshal_req: reply chunk: hdrlen 48
>> rpclen 124 padlen 0 headerp 0xffff88105894d100 base 0xffff88105894d760
>> lkey 0x8000
>> [  143.233785] RPC:       rpcrdma_event_process: event rep
>> ffff88084e589260 status 0 opcode 8 length 0
>> [  177.272397] RPC:       rpcrdma_event_process: event rep
>> (null) status C opcode FFFF8808 length 4294967295
>> [  177.272649] RPC:       rpcrdma_event_process: event rep
>> ffff880848ed0000 status 5 opcode FFFF8808 length 4294936584
> The mlx4 provider is returning a WC completion status of
> IB_WC_WR_FLUSH_ERR.
>
>> [  177.272651] RPC:       rpcrdma_event_process: WC opcode -30712 status
>> 5, connection lost
> -30712 is a bogus WC opcode. So the mlx4 provider is not filling in the
> WC opcode. rpcrdma_event_process() thus can’t depend on the contents of
> the ib_wc.opcode field when the WC completion status != IB_WC_SUCCESS.

Hey Chuck,

That is correct, the opcode field in the wc is not reliable in FLUSH errors.

>
> A copy of the opcode reachable from the incoming rpcrdma_rep could be
> added, initialized in the forward paths. rpcrdma_event_process() could
> use the copy in the error case.

How about suppressing completions alltogether for fast_reg and local_inv 
work requests?
if these shall fail you will get an error completion and the QP will 
transition to error state
generating FLUSH_ERR completions for all pending WRs. In this case, you 
can just ignore
flush fast_reg + local_inv errors.

see http://marc.info/?l=linux-rdma&m=139047309831997&w=2

Sagi.
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

next prev parent reply	other threads:[~2014-03-28 23:07 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-03-26 13:20 Kernel oops/panic with NFS over RDMA mount after disrupted Infiniband connection rafael.reiter
2014-03-26 18:15 ` Chuck Lever
2014-03-26 18:15   ` Chuck Lever
2014-03-27  7:53   ` Reiter Rafael
2014-03-27  7:53     ` Reiter Rafael
2014-03-27 15:59     ` Chuck Lever
2014-03-27 15:59       ` Chuck Lever
2014-03-28  9:42       ` Senn Klemens
2014-03-28  9:42         ` Senn Klemens
2014-03-28 22:30         ` Chuck Lever
2014-03-28 22:30           ` Chuck Lever
2014-03-28 23:06           ` sagi grimberg [this message]
2014-03-28 23:06             ` sagi grimberg
2014-03-29  0:05             ` Chuck Lever
2014-03-29  0:05               ` Chuck Lever
2014-03-29  0:52               ` sagi grimberg
2014-03-29  0:52                 ` sagi grimberg
2014-04-04 15:20                 ` Chuck Lever
2014-04-04 15:20                   ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=53360087.9060902@mellanox.com \
    --to=sagig@mellanox.com \
    --cc=chuck.lever@oracle.com \
    --cc=klemens.senn@ims.co.at \
    --cc=linux-nfs@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.