From mboxrd@z Thu Jan 1 00:00:00 1970 From: Leon Romanovsky Subject: Re: "memory management error" with NFS/RDMA on RoCE Date: Wed, 5 Jul 2017 18:29:27 +0300 Message-ID: <20170705152927.GM1528@mtr-leonro.local> References: <7F0FCF80-DB7B-46F1-BB9A-0B070603DE61@oracle.com> <797a43c4-f30d-9deb-a332-c62cbd01be7b@grimberg.me> <2FEEE227-9BCF-4454-A056-3997C1E54686@oracle.com> <20170627173620.GT1248@mtr-leonro.local> <06510488-DB16-4781-8E5A-FDFFDDD00B4F@oracle.com> Mime-Version: 1.0 Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="uAgJxtfIS94j9H4T" Return-path: Content-Disposition: inline In-Reply-To: <06510488-DB16-4781-8E5A-FDFFDDD00B4F-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org> Sender: linux-rdma-owner-u79uwXL29TY76Z2rM5mHXA@public.gmane.org To: Chuck Lever Cc: Sagi Grimberg , linux-rdma List-Id: linux-rdma@vger.kernel.org --uAgJxtfIS94j9H4T Content-Type: text/plain; charset=us-ascii Content-Disposition: inline On Wed, Jul 05, 2017 at 10:40:41AM -0400, Chuck Lever wrote: > > > On Jun 27, 2017, at 1:36 PM, Leon Romanovsky wrote: > > > > On Tue, Jun 27, 2017 at 10:56:29AM -0400, Chuck Lever wrote: > >> Hi Sagi- > >> > >>> On Jun 27, 2017, at 5:28 AM, Sagi Grimberg wrote: > >>> > >>> > >>>> While running xfstests on an NFS/RDMA mount, I see this in > >>>> the client's /var/log/messages multiple times: > >>>> Jun 22 14:13:45 manet kernel: mlx5_0:dump_cqe:275:(pid 0): dump error cqe > >>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000 > >>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000 > >>>> Jun 22 14:13:45 manet kernel: 00000000 00000000 00000000 00000000 > >>>> Jun 22 14:13:45 manet kernel: 00000000 08007806 250000cd 024027d3 > >>>> Jun 22 14:13:45 manet kernel: rpcrdma: fastreg: memory management operation error (6/0x78) > >>>> As far as I can tell the client is able to recover and continue > >>>> the test. However, this error is not supposed to happen in normal > >>>> operation. > >>>> This is with a Mellanox CX4 in RoCEv1 mode, v4.12-rc2. > >>> > >>> Is this a regression? > >> > >> I can't answer that question with authority, because I just > >> started trying out NFS/RDMA on RoCE with mlx5. But Robert has > >> reported very similar symptoms with iSER on v4.9. It appears > >> to have been around for a while, if these are the same. > >> > >> > >>> What kernel version are you running? > >> > >> v4.12-rc2. > >> > >> > >>> FW revision? > >> > >> 12.18.2000 > >> > >> > >>> Is the below commit applied? > >> > >> This commit does not appear to be applied to my kernel. > >> > >> > >>> commit 6e8484c5cf07c7ee632587e98c1a12d319dacb7c > >>> Author: Max Gurtovoy > >>> Date: Sun May 28 10:53:11 2017 +0300 > >>> > >>> RDMA/mlx5: set UMR wqe fence according to HCA cap > >>> > >>> Cache the needed umr_fence and set the wqe ctrl segmennt > >>> accordingly. > >>> > >>> Signed-off-by: Max Gurtovoy > >>> Acked-by: Leon Romanovsky > >>> Reviewed-by: Sagi Grimberg > >>> Signed-off-by: Doug Ledford > >>> > >>> This is the only thing that changed in that area > >>> lately... > >>> > >>> Can you try without it? > >> > >> I haven't tried with it. I can pull it and see if it helps. > >> > >> I have tried: > >> > >> - with and without IOMMU enabled > >> - with RoCE v1 and v2 > >> - with instrumentation: > >> > >> This can happen to any MR at any time after any number of > >> uses. It does not appear to be "sticky" (ie, xprtrdma > >> recovery from a memory management error clears the problem > >> successfully by releasing the MR and allocating a new one). > >> > >> So it feels like a f/w or driver problem to me, at this > >> point. > > > > Jack and me discussed your issue tomorrow morning and we have strong > > feeling that it is FW. > > Hi Leon- > > Who is going to drive this issue to resolution? Do you need me > to do something? I don't think so, Jack was supposed to do it. > > > > Thanks > > > >> > >> -- > >> Chuck Lever > >> > >> > >> > >> -- > >> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > >> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Chuck Lever > > > --uAgJxtfIS94j9H4T Content-Type: application/pgp-signature; name="signature.asc" -----BEGIN PGP SIGNATURE----- iQIzBAEBCAAdFiEEkhr/r4Op1/04yqaB5GN7iDZyWKcFAlldBdcACgkQ5GN7iDZy WKeF1BAAz8SP4utTlkbzYt9tW8cnC6M0K5yVcsAJdbMadIlhzXXwiwg4YiD703lN eGw6DSMDdo973IXv7QVuwAnlm+pGkWTUDTD9wfxDV4GlNA31IuOnFnSwurj0GP4q tUL0iqGofSuKYoeTOcF/xhATk7V5jyFc9tftn1bLInIUQFZqEUNHnpxgbNOGZlUZ X7CyNSDY7+mBn9kn29oFnKYFuIaJCait0aGP61QUmCB6YW6dIsL5PN+umL+WokN1 U0ww+34deTTsOB+C47s+V/k8oOOwNk7b1wlBajtFk9T0Ie3X23zlcs3UGPzjAs8t JsFKo0cVMhJKEDcEympJS87+0EkWPVyvR+tU2DZfElNfv3HEIA86Z6LP96G3J6go ShtAh5ye6O8RsGf6N4Zm1HWbFKc0VeFIR/RkTKZKq3MKcWXTOSFJmylkak5mBq7P J6DVK4hK+1btEpgNS/G7FqogGLz44YuJTLFEr+5gKHZgogxYDgSzcvj9BTlUhjyQ YX/dzcWsgTx5SGTpVrKgv/5cqbylJX2s1YazY0wXsGVSkBJvt6lZAe+4sg7Eqshe v0QZHZ0RjEZFNY2T6RKUsqoWAA/Ku/It/NfdPYxaDsb9VuTp21kO9+3mADreRXud sUgE5moHLVld6LmCcuNYDBMoC3er7b8jxSCEpF17Q4R4nhRW+oo= =hTwT -----END PGP SIGNATURE----- --uAgJxtfIS94j9H4T-- -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html