From: Leon Romanovsky <leon@kernel.org>
To: Chuck Lever III <chuck.lever@oracle.com>
Cc: Timo Rothenpieler <timo@rothenpieler.org>,
Linux NFS Mailing List <linux-nfs@vger.kernel.org>,
linux-rdma <linux-rdma@vger.kernel.org>
Subject: Re: Spurious instability with NFSoRDMA under moderate load
Date: Wed, 19 May 2021 18:20:40 +0300 [thread overview]
Message-ID: <YKUsyKUFdL9IfLRp@unreal> (raw)
In-Reply-To: <72ECF9E1-1F6E-44AF-850C-536BED898DDD@oracle.com>
On Mon, May 17, 2021 at 04:27:29PM +0000, Chuck Lever III wrote:
> Hello Timo-
>
> > On May 16, 2021, at 1:29 PM, Timo Rothenpieler <timo@rothenpieler.org> wrote:
> >
> > This has happened 3 times so far over the last couple months, and I do not have a clear way to reproduce it.
> > It happens under moderate load, when lots of nodes read and write from the server. Though not in any super intense way. Just normal program execution, writing of light logs, and other standard tasks.
> >
> > The issues on the clients manifest in a multitude of ways. Most of the time, random IO operations just fail, rarely hang indefinitely and make the process unkillable.
> > Another example would be: "Failed to remove '.../.nfs00000000007b03af00000001': Device or resource busy"
> >
> > Once a client is in that state, the only way to get it back into order is a reboot.
> >
> > On the server side, a single error cqe is dumped each time this problem happened. So far, I always rebooted the server as well, to make sure everything is back in order. Not sure if that is strictly necessary.
> >
> >> [561889.198889] infiniband mlx5_0: dump_cqe:272:(pid 709): dump error cqe
> >> [561889.198945] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [561889.198984] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [561889.199023] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [561889.199061] 00000030: 00 00 00 00 00 00 88 13 08 00 01 13 07 47 67 d2
> >
> >> [985074.602880] infiniband mlx5_0: dump_cqe:272:(pid 599): dump error cqe
> >> [985074.602921] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [985074.602946] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [985074.602970] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [985074.602994] 00000030: 00 00 00 00 00 00 88 13 08 00 01 46 f2 93 0b d3
> >
> >> [1648894.168819] infiniband ibp1s0: dump_cqe:272:(pid 696): dump error cqe
> >> [1648894.168853] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [1648894.168878] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [1648894.168903] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> [1648894.168928] 00000030: 00 00 00 00 00 00 88 13 08 00 01 08 6b d2 b9 d3
>
> I'm hoping Leon can get out his magic decoder ring and tell us if
> these CQE dumps contain a useful WC status code.
Unfortunately, no. I failed to parse it, if I read the dump correctly,
it is not marked as error and opcode is 0.
Thanks
prev parent reply other threads:[~2021-05-19 15:20 UTC|newest]
Thread overview: 33+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-16 17:29 Spurious instability with NFSoRDMA under moderate load Timo Rothenpieler
2021-05-17 16:27 ` Chuck Lever III
2021-05-17 17:37 ` Timo Rothenpieler
2021-06-21 16:06 ` Timo Rothenpieler
2021-06-21 16:28 ` Chuck Lever III
2021-08-10 12:49 ` Timo Rothenpieler
[not found] ` <a28b403e-42cf-3189-a4db-86d20da1b7aa@rothenpieler.org>
2021-08-10 17:17 ` Chuck Lever III
2021-08-10 21:40 ` Timo Rothenpieler
[not found] ` <141fdf51-2aa1-6614-fe4e-96f168cbe6cf@rothenpieler.org>
2021-08-11 0:19 ` Chuck Lever III
[not found] ` <64F9A492-44B9-4057-ABA5-C8202828A8DD@oracle.com>
[not found] ` <1b8a24a9-5dba-3faf-8b0a-16e728a6051c@rothenpieler.org>
[not found] ` <5DD80ADC-0A4B-4D95-8CF7-29096439DE9D@oracle.com>
[not found] ` <0444ca5c-e8b6-1d80-d8a5-8469daa74970@rothenpieler.org>
[not found] ` <cc2f55cd-57d4-d7c3-ed83-8b81ea60d821@rothenpieler.org>
2021-08-11 17:30 ` Chuck Lever III
2021-08-11 18:38 ` Olga Kornievskaia
2021-08-11 18:51 ` Chuck Lever III
2021-08-11 19:46 ` Olga Kornievskaia
2021-08-11 20:01 ` Chuck Lever III
2021-08-11 20:14 ` J. Bruce Fields
2021-08-11 20:40 ` Olga Kornievskaia
2021-08-12 15:40 ` J. Bruce Fields
2021-08-11 20:51 ` J. Bruce Fields
2021-08-11 20:51 ` Olga Kornievskaia
2021-08-12 18:13 ` Timo Rothenpieler
2021-08-16 13:26 ` Chuck Lever III
2021-08-20 15:12 ` Chuck Lever III
2021-08-20 16:21 ` Timo Rothenpieler
[not found] ` <60273c2e-e946-25fb-68af-975f793e73d2@rothenpieler.org>
2021-10-29 15:14 ` Chuck Lever III
2021-10-29 18:17 ` Timo Rothenpieler
2021-10-29 19:06 ` Chuck Lever III
2021-08-17 21:08 ` Chuck Lever III
2021-08-17 21:51 ` Timo Rothenpieler
2021-08-17 22:55 ` dai.ngo
2021-08-17 23:05 ` dai.ngo
2021-08-18 16:55 ` Chuck Lever III
2021-08-18 0:03 ` Timo Rothenpieler
2021-05-19 15:20 ` Leon Romanovsky [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=YKUsyKUFdL9IfLRp@unreal \
--to=leon@kernel.org \
--cc=chuck.lever@oracle.com \
--cc=linux-nfs@vger.kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=timo@rothenpieler.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox