From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-nfs-owner@vger.kernel.org>
Received: from smtp.opengridcomputing.com ([72.48.136.20]:53681 "EHLO
	smtp.opengridcomputing.com" rhost-flags-OK-OK-OK-OK)
	by vger.kernel.org with ESMTP id S1752574AbcCJQVG (ORCPT
	<rfc822;linux-nfs@vger.kernel.org>); Thu, 10 Mar 2016 11:21:06 -0500
From: "Steve Wise" <swise@opengridcomputing.com>
To: "'Chuck Lever'" <chuck.lever@oracle.com>
Cc: "'Sagi Grimberg'" <sagig@dev.mellanox.co.il>, <anna.schumaker@netapp.com>,
        "'Linux RDMA Mailing List'" <linux-rdma@vger.kernel.org>,
        "'Linux NFS Mailing List'" <linux-nfs@vger.kernel.org>
References: <20160304162447.13590.9524.stgit@oracle120-ib.cthon.org> <20160304162801.13590.89343.stgit@oracle120-ib.cthon.org> <56DF1186.3030303@dev.mellanox.co.il> <8696EFBA-B7DB-42AC-AB57-C656070F4ED3@oracle.com> <56E00483.2060304@dev.mellanox.co.il> <6B59B087-9CFA-458B-8848-B08B8E14E2C7@oracle.com> <56E14BA2.2050504@dev.mellanox.co.il> <7abb01d17ade$1faf0ff0$5f0d2fd0$@opengridcomputing.com> <AC62FAB3-5569-4FA3-93AF-35CD2A1869EF@oracle.com> <7b2101d17ae1$f88597b0$e990c710$@opengridcomputing.com> <BB3E1E71-E3B0-48D2-BADE-120152BE42D3@oracle.com> <7b3901d17ae5$18fbf540$4af3dfc0$@opengridcomputing.com> <BE799F1D-970E-49F8-8C96-FFDF4E6E9A9C@oracle.com> <7b6b01d17ae7$506c6490$f1452db0$@opengridcomputing.com> <B32CA8B9-3EB7-4DC3-A945-5C9F05D5F984@oracle.com>
In-Reply-To: <B32CA8B9-3EB7-4DC3-A945-5C9F05D5F984@oracle.com>
Subject: RE: [PATCH v3 05/11] xprtrdma: Do not wait if ib_post_send() fails
Date: Thu, 10 Mar 2016 10:21:28 -0600
Message-ID: <7b6d01d17ae8$e68f7e20$b3ae7a60$@opengridcomputing.com>
MIME-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Sender: linux-nfs-owner@vger.kernel.org
List-ID: <linux-nfs.vger.kernel.org>

> >>>>>>>>>> Moving the QP into error state right after with rdma_disconnect
> >>>>>>>>>> you are not sure that none of the subset of the invalidations
> >>>>>>>>>> that _were_ posted completed and you get the corresponding MRs
> >>>>>>>>>> in a bogus state...
> >>>>>>>>>
> >>>>>>>>> Moving the QP to error state and then draining the CQs means
> >>>>>>>>> that all LOCAL_INV WRs that managed to get posted will get
> >>>>>>>>> completed or flushed. That's already handled today.
> >>>>>>>>>
> >>>>>>>>> It's the WRs that didn't get posted that I'm worried about
> >>>>>>>>> in this patch.
> >>>>>>>>>
> >>>>>>>>> Are there RDMA consumers in the kernel that use that third
> >>>>>>>>> argument to recover when LOCAL_INV WRs cannot be posted?
> >>>>>>>>
> >>>>>>>> None :)
> >>>>>>>>
> >>>>>>>>>>> I suppose I could reset these MRs instead (that is,
> >>>>>>>>>>> pass them to ib_dereg_mr).
> >>>>>>>>>>
> >>>>>>>>>> Or, just wait for a completion for those that were posted
> >>>>>>>>>> and then all the MRs are in a consistent state.
> >>>>>>>>>
> >>>>>>>>> When a LOCAL_INV completes with IB_WC_SUCCESS, the associated
> >>>>>>>>> MR is in a known state (ie, invalid).
> >>>>>>>>>
> >>>>>>>>> The WRs that flush mean the associated MRs are not in a known
> >>>>>>>>> state. Sometimes the MR state is different than the hardware
> >>>>>>>>> state, for example. Trying to do anything with one of these
> >>>>>>>>> inconsistent MRs results in IB_WC_BIND_MW_ERR until the thing
> >>>>>>>>> is deregistered.
> >>>>>>>>
> >>>>>>>> Correct.
> >>>>>>>>
> >>>>>>>
> >>>>>>> It is legal to invalidate an MR that is not in the valid state.  So
you
> >>>>> don't
> >>>>>>> have to deregister it, you can assume it is valid and post another
LINV
> >>> WR.
> >>>>>>
> >>>>>> I've tried that. Once the MR is inconsistent, even LOCAL_INV
> >>>>>> does not work.
> >>>>>>
> >>>>>
> >>>>> Maybe IB Verbs don't mandate that invalidating an invalid MR must be
> >>> allowed?
> >>>>> (looking at the verbs spec now).
> >>>>
> >>>
> >>> IB Verbs doesn't have specify this requirement.  iW verbs does.  So
> > transport
> >>> independent applications cannot rely on it.  So ib_dereg_mr() seems to be
> > the
> >>> only thing you can do.
> >>>
> >>>> If the MR is truly invalid, then there is no issue, and
> >>>> the second LOCAL_INV completes successfully.
> >>>>
> >>>> The problem is after a flushed LOCAL_INV, the MR state
> >>>> sometimes does not match the hardware state. The MR is
> >>>> neither registered or invalid.
> >>>>
> >>>
> >>> There is a difference, at least with iWARP devices, between the MR state:
> > VALID
> >>> vs INVALID, and if the MR is allocated or not.
> >>>
> >>>> A flushed LOCAL_INV tells you nothing more than that the
> >>>> LOCAL_INV didn't complete. The MR state at that point is
> >>>> unknown.
> >>>>
> >>>
> >>> With respect to iWARP and cxgb4: when you allocate a fastreg MR, HW has an
> >> entry
> >>> for that MR and it is marked "allocated".  The MR record in HW also has a
> > state:
> >>> VALID or INVALID.  While the MR is "allocated" you can post WRs to
> > invalidate it
> >>> which changes the state to INVALID, or fast-register memory which makes it
> >>> VALID.  Regardless of what happens on any given QP, the MR remains
> > "allocated"
> >>> until you call ib_dereg_mr().  So at least for cxgb4, you could in fact
just
> >>> post another LINV to get it back to a known state that allows subsequent
> >>> fast-reg WRs.
> >>>
> >>> Perhaps IB devices don't work this way.
> >>>
> >>> What error did you get when you tried just doing an LINV after a flush?
> >>
> >> With CX-2 and CX-3, after a flushed LOCAL_INV, trying either
> >> a FASTREG or LOCAL_INV on that MR can sometimes complete with
> >> IB_WC_MW_BIND_ERR.
> >
> >
> > I wonder if you post a FASREG+LINV+LINV if you'd get the same failure?  IE
> > invalidate the same rkey twice.  Just as an experiment...
> 
> Once the MR is in this state, FASTREG does not work either.
> All FASTREG and LINV flush with IB_WC_MW_BIND_ERR until
> the MR is deregistered.

Mellanox can probably tell us why. 

I was just wondering if posting a double LINV on a valid working FRMR would fail
with these devices.  But its moot.  As you've concluded, looks like the only
safe was to handle this is to dereg them and reallocate...


From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Steve Wise" <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
Subject: RE: [PATCH v3 05/11] xprtrdma: Do not wait if ib_post_send() fails
Date: Thu, 10 Mar 2016 10:21:28 -0600
Message-ID: <7b6d01d17ae8$e68f7e20$b3ae7a60$@opengridcomputing.com>
References: <20160304162447.13590.9524.stgit@oracle120-ib.cthon.org> <20160304162801.13590.89343.stgit@oracle120-ib.cthon.org> <56DF1186.3030303@dev.mellanox.co.il> <8696EFBA-B7DB-42AC-AB57-C656070F4ED3@oracle.com> <56E00483.2060304@dev.mellanox.co.il> <6B59B087-9CFA-458B-8848-B08B8E14E2C7@oracle.com> <56E14BA2.2050504@dev.mellanox.co.il> <7abb01d17ade$1faf0ff0$5f0d2fd0$@opengridcomputing.com> <AC62FAB3-5569-4FA3-93AF-35CD2A1869EF@oracle.com> <7b2101d17ae1$f88597b0$e990c710$@opengridcomputing.com> <BB3E1E71-E3B0-48D2-BADE-120152BE42D3@oracle.com> <7b3901d17ae5$18fbf540$4af3dfc0$@opengridcomputing.com> <BE799F1D-970E-49F8-8C96-FFDF4E6E9A9C@oracle.com> <7b6b01d17ae7$506c6490$f1452db0$@opengridcomputing.com> <B32CA8B9-3EB7-4DC3-A945-5C9F05D5F984@oracle.com>
Mime-Version: 1.0
Content-Type: text/plain;
	charset="us-ascii"
Content-Transfer-Encoding: 7bit
Return-path: <linux-nfs-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
In-Reply-To: <B32CA8B9-3EB7-4DC3-A945-5C9F05D5F984-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Content-Language: en-us
Sender: linux-nfs-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
To: 'Chuck Lever' <chuck.lever-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: 'Sagi Grimberg' <sagig-LDSdmyG8hGV8YrgS2mwiifqBs+8SCbDb@public.gmane.org>, anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org, 'Linux RDMA Mailing List' <linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>, 'Linux NFS Mailing List' <linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
List-Id: linux-rdma@vger.kernel.org

> >>>>>>>>>> Moving the QP into error state right after with rdma_disconnect
> >>>>>>>>>> you are not sure that none of the subset of the invalidations
> >>>>>>>>>> that _were_ posted completed and you get the corresponding MRs
> >>>>>>>>>> in a bogus state...
> >>>>>>>>>
> >>>>>>>>> Moving the QP to error state and then draining the CQs means
> >>>>>>>>> that all LOCAL_INV WRs that managed to get posted will get
> >>>>>>>>> completed or flushed. That's already handled today.
> >>>>>>>>>
> >>>>>>>>> It's the WRs that didn't get posted that I'm worried about
> >>>>>>>>> in this patch.
> >>>>>>>>>
> >>>>>>>>> Are there RDMA consumers in the kernel that use that third
> >>>>>>>>> argument to recover when LOCAL_INV WRs cannot be posted?
> >>>>>>>>
> >>>>>>>> None :)
> >>>>>>>>
> >>>>>>>>>>> I suppose I could reset these MRs instead (that is,
> >>>>>>>>>>> pass them to ib_dereg_mr).
> >>>>>>>>>>
> >>>>>>>>>> Or, just wait for a completion for those that were posted
> >>>>>>>>>> and then all the MRs are in a consistent state.
> >>>>>>>>>
> >>>>>>>>> When a LOCAL_INV completes with IB_WC_SUCCESS, the associated
> >>>>>>>>> MR is in a known state (ie, invalid).
> >>>>>>>>>
> >>>>>>>>> The WRs that flush mean the associated MRs are not in a known
> >>>>>>>>> state. Sometimes the MR state is different than the hardware
> >>>>>>>>> state, for example. Trying to do anything with one of these
> >>>>>>>>> inconsistent MRs results in IB_WC_BIND_MW_ERR until the thing
> >>>>>>>>> is deregistered.
> >>>>>>>>
> >>>>>>>> Correct.
> >>>>>>>>
> >>>>>>>
> >>>>>>> It is legal to invalidate an MR that is not in the valid state.  So
you
> >>>>> don't
> >>>>>>> have to deregister it, you can assume it is valid and post another
LINV
> >>> WR.
> >>>>>>
> >>>>>> I've tried that. Once the MR is inconsistent, even LOCAL_INV
> >>>>>> does not work.
> >>>>>>
> >>>>>
> >>>>> Maybe IB Verbs don't mandate that invalidating an invalid MR must be
> >>> allowed?
> >>>>> (looking at the verbs spec now).
> >>>>
> >>>
> >>> IB Verbs doesn't have specify this requirement.  iW verbs does.  So
> > transport
> >>> independent applications cannot rely on it.  So ib_dereg_mr() seems to be
> > the
> >>> only thing you can do.
> >>>
> >>>> If the MR is truly invalid, then there is no issue, and
> >>>> the second LOCAL_INV completes successfully.
> >>>>
> >>>> The problem is after a flushed LOCAL_INV, the MR state
> >>>> sometimes does not match the hardware state. The MR is
> >>>> neither registered or invalid.
> >>>>
> >>>
> >>> There is a difference, at least with iWARP devices, between the MR state:
> > VALID
> >>> vs INVALID, and if the MR is allocated or not.
> >>>
> >>>> A flushed LOCAL_INV tells you nothing more than that the
> >>>> LOCAL_INV didn't complete. The MR state at that point is
> >>>> unknown.
> >>>>
> >>>
> >>> With respect to iWARP and cxgb4: when you allocate a fastreg MR, HW has an
> >> entry
> >>> for that MR and it is marked "allocated".  The MR record in HW also has a
> > state:
> >>> VALID or INVALID.  While the MR is "allocated" you can post WRs to
> > invalidate it
> >>> which changes the state to INVALID, or fast-register memory which makes it
> >>> VALID.  Regardless of what happens on any given QP, the MR remains
> > "allocated"
> >>> until you call ib_dereg_mr().  So at least for cxgb4, you could in fact
just
> >>> post another LINV to get it back to a known state that allows subsequent
> >>> fast-reg WRs.
> >>>
> >>> Perhaps IB devices don't work this way.
> >>>
> >>> What error did you get when you tried just doing an LINV after a flush?
> >>
> >> With CX-2 and CX-3, after a flushed LOCAL_INV, trying either
> >> a FASTREG or LOCAL_INV on that MR can sometimes complete with
> >> IB_WC_MW_BIND_ERR.
> >
> >
> > I wonder if you post a FASREG+LINV+LINV if you'd get the same failure?  IE
> > invalidate the same rkey twice.  Just as an experiment...
> 
> Once the MR is in this state, FASTREG does not work either.
> All FASTREG and LINV flush with IB_WC_MW_BIND_ERR until
> the MR is deregistered.

Mellanox can probably tell us why. 

I was just wondering if posting a double LINV on a valid working FRMR would fail
with these devices.  But its moot.  As you've concluded, looks like the only
safe was to handle this is to dereg them and reallocate...


--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html