Linux IOMMU Development
 help / color / mirror / Atom feed
From: Max Gurtovoy via iommu <iommu@lists.linux-foundation.org>
To: Mark Ruijter <mruijter@primelogic.nl>,
	Robin Murphy <robin.murphy@arm.com>,
	Martin Oliveira <Martin.Oliveira@eideticom.com>,
	Chaitanya Kulkarni <chaitanyak@nvidia.com>
Cc: Kelly Ursenbach <Kelly.Ursenbach@eideticom.com>,
	"linux-rdma@vger.kernel.org" <linux-rdma@vger.kernel.org>,
	"Lee, Jason" <jasonlee@lanl.gov>,
	"linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>,
	"iommu@lists.linux-foundation.org"
	<iommu@lists.linux-foundation.org>,
	Logan Gunthorpe <Logan.Gunthorpe@eideticom.com>
Subject: Re: Error when running fio against nvme-of rdma target (mlx5 driver)
Date: Tue, 17 May 2022 14:16:35 +0300	[thread overview]
Message-ID: <920e58ac-6a57-fdb1-a2c7-b6fef388917e@nvidia.com> (raw)
In-Reply-To: <3F2D3249-79E4-4CE1-940F-E1E0719EFAF0@primelogic.nl>

Hi,

Can you please send the original scenario, setup details and dumps ?

I can't find it in my mailbox.

you can send it directly to me to avoid spam.

-Max.

On 5/17/2022 11:26 AM, Mark Ruijter wrote:
> Hi Robin,
>
> I ran into the exact same problem while testing with 4 connect-x6 cards, kernel 5.18-rc6.
>
> [ 4878.273016] nvme nvme0: Successfully reconnected (3 attempts)
> [ 4879.122015] nvme nvme0: starting error recovery
> [ 4879.122028] infiniband mlx5_4: mlx5_handle_error_cqe:332:(pid 0): WC error: 4, Message: local protection error
> [ 4879.122035] infiniband mlx5_4: dump_cqe:272:(pid 0): dump error cqe
> [ 4879.122037] 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [ 4879.122039] 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [ 4879.122040] 00000020: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> [ 4879.122040] 00000030: 00 00 00 00 a9 00 56 04 00 00 00 ed 0d da ff e2
> [ 4881.085547] nvme nvme3: Reconnecting in 10 seconds...
>
> I assume this means that the problem has still not been resolved?
> If so, I'll try to diagnose the problem.
>
> Thanks,
>
> --Mark
>
> On 11/02/2022, 12:35, "Linux-nvme on behalf of Robin Murphy" <linux-nvme-bounces@lists.infradead.org on behalf of robin.murphy@arm.com> wrote:
>
>      On 2022-02-10 23:58, Martin Oliveira wrote:
>      > On 2/9/22 1:41 AM, Chaitanya Kulkarni wrote:
>      >> On 2/8/22 6:50 PM, Martin Oliveira wrote:
>      >>> Hello,
>      >>>
>      >>> We have been hitting an error when running IO over our nvme-of setup, using the mlx5 driver and we are wondering if anyone has seen anything similar/has any suggestions.
>      >>>
>      >>> Both initiator and target are AMD EPYC 7502 machines connected over RDMA using a Mellanox MT28908. Target has 12 NVMe SSDs which are exposed as a single NVMe fabrics device, one physical SSD per namespace.
>      >>>
>      >>
>      >> Thanks for reporting this, if you can bisect the problem on your setup
>      >> it will help others to help you better.
>      >>
>      >> -ck
>      >
>      > Hi Chaitanya,
>      >
>      > I went back to a kernel as old as 4.15 and the problem was still there, so I don't know of a good commit to start from.
>      >
>      > I also learned that I can reproduce this with as little as 3 cards and I updated the firmware on the Mellanox cards to the latest version.
>      >
>      > I'd be happy to try any tests if someone has any suggestions.
>
>      The IOMMU is probably your friend here - one thing that might be worth
>      trying is capturing the iommu:map and iommu:unmap tracepoints to see if
>      the address reported in subsequent IOMMU faults was previously mapped as
>      a valid DMA address (be warned that there will likely be a *lot* of
>      trace generated). With 5.13 or newer, booting with "iommu.forcedac=1"
>      should also make it easier to tell real DMA IOVAs from rogue physical
>      addresses or other nonsense, as real DMA addresses should then look more
>      like 0xffff24d08000.
>
>      That could at least help narrow down whether it's some kind of
>      use-after-free race or a completely bogus address creeping in somehow.
>
>      Robin.
>
>
_______________________________________________
iommu mailing list
iommu@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/iommu

  reply	other threads:[~2022-05-17 11:16 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-09  2:50 Error when running fio against nvme-of rdma target (mlx5 driver) Martin Oliveira
2022-02-09  8:41 ` Chaitanya Kulkarni via iommu
2022-02-10 23:58   ` Martin Oliveira
2022-02-11 11:35     ` Robin Murphy
2022-05-17  8:26       ` Mark Ruijter
2022-05-17 11:16         ` Max Gurtovoy via iommu [this message]
2022-02-09 12:48 ` Robin Murphy
2024-01-31  9:18 ` Arthur Muller

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=920e58ac-6a57-fdb1-a2c7-b6fef388917e@nvidia.com \
    --to=iommu@lists.linux-foundation.org \
    --cc=Kelly.Ursenbach@eideticom.com \
    --cc=Logan.Gunthorpe@eideticom.com \
    --cc=Martin.Oliveira@eideticom.com \
    --cc=chaitanyak@nvidia.com \
    --cc=jasonlee@lanl.gov \
    --cc=linux-nvme@lists.infradead.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=mgurtovoy@nvidia.com \
    --cc=mruijter@primelogic.nl \
    --cc=robin.murphy@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox