linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jakub Kicinski <kuba@kernel.org>
To: Aron Silverton <aron.silverton@oracle.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Saeed Mahameed <saeed@kernel.org>,
	Jason Gunthorpe <jgg@nvidia.com>,
	David Ahern <dsahern@kernel.org>, Arnd Bergmann <arnd@arndb.de>,
	Leon Romanovsky <leonro@nvidia.com>, Jiri Pirko <jiri@nvidia.com>,
	Leonid Bloch <lbloch@nvidia.com>,
	Itay Avraham <itayavr@nvidia.com>,
	linux-kernel@vger.kernel.org, Saeed Mahameed <saeedm@nvidia.com>
Subject: Re: [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver
Date: Tue, 5 Dec 2023 20:48:55 -0800	[thread overview]
Message-ID: <20231205204855.52fa5cc1@kernel.org> (raw)
In-Reply-To: <fgalnohzpiox7rvsf3wsurkf2x3rdtyhwqq5tk43gesvjlw6yl@i7colkh2sx5h>

On Tue, 5 Dec 2023 11:11:00 -0600 Aron Silverton wrote:
> 1. As mentioned already, we recently faced a complex problem with RDMA
> in KVM and were getting nowhere trying to debug using the usual methods.
> Mellanox support was able to use this debug interface to see what was
> happening on the PCI bus and prove that the issue was caused by
> corrupted PCIe transactions. This finally put the investigation on the
> correct path. The debug interface was used consistently and extensively
> to test theories about what was happening in the system and, ultimately,
> allowed the problem to be solved.

You hit on an important point, and what is also my experience working
at Meta. I may have even mentioned it in this thread already.
If there is a serious issue with a complex device, there are two ways
you can get support - dump all you can and send the dump to the vendor
or get on a live debugging session with their engineers. Users' ability
to debug those devices is practically non-existent. The idea that we
need access to FW internals is predicated on the assumption that we
have an ability to make sense of those internals.

Once you're on a support call with the vendor - just load a custom
kernel, module, whatever, it's already extremely expensive manual labor.

> 2. We've faced RDMA issues related to lost EQ doorbells, requiring
> complex debug, and ultimately root-caused as a defective CPU. Without
> interactive access to the device allowing us to test theories like,
> "what if we manually restart the EQ", we could not have proven this
> definitively.

I'm not familiar with the RDMA debugging capabilities. Perhaps there
are some gaps there. The more proprietary the implementation the harder
it is to debug. An answer to that would be "try to keep as much as
possible open".. and interfaces which let closed user space talk to
closed FW take us in the opposite direction.

FWIW good netdevice drivers have a selftest which tests IRQ generation
and EQ handling. I think that'd cover the case you're describing?
IDK if mlx5 has them, but if it doesn't definitely worth adding. And I
recommend running those on suspicious machines (ethtool -t, devlink has
some selftests, too)

> Firstly, We believe in working upstream and all of the advantages that
> that brings to all the distros as well as to us and our customers.
> 
> Secondly, Our cloud business offers many types of machine instances,
> some with bare metal/vfio mlx5 devices, that require customer driven
> debug and we want our customers to have the freedom to choose which OS
> they want to use.

I understand that having everything packaged and shipped together makes
life easier.

If the point of the kernel at this stage of its evolution is to collect
incompatible bits of vendor software, make sure they build cleanly and
ship them to distros - someone should tell me, and I will relent.

  reply	other threads:[~2023-12-06  4:50 UTC|newest]

Thread overview: 57+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-21  7:06 [PATCH V3 0/5] mlx5 ConnectX control misc driver Saeed Mahameed
2023-11-21  7:06 ` [PATCH V3 1/5] mlx5: Add aux dev for ctl interface Saeed Mahameed
2023-11-21  7:06 ` [PATCH V3 2/5] misc: mlx5ctl: Add mlx5ctl misc driver Saeed Mahameed
2023-11-27 13:36   ` Greg Kroah-Hartman
2023-11-27 14:40     ` Jason Gunthorpe
2023-11-27 15:51       ` Greg Kroah-Hartman
2023-11-27 16:17         ` Jason Gunthorpe
2023-11-27 18:27           ` Greg Kroah-Hartman
2023-11-27 19:26             ` Saeed Mahameed
2023-11-28  0:07               ` Jakub Kicinski
2023-11-28  4:46                 ` David Ahern
2023-11-28 14:53                   ` Jakub Kicinski
2023-11-28 16:24                     ` Jason Gunthorpe
2023-11-28 16:44                       ` Jakub Kicinski
2023-11-28 17:52                         ` Jason Gunthorpe
2023-11-28 18:33                           ` Jakub Kicinski
2023-11-28 19:55                             ` Saeed Mahameed
2023-11-28 20:10                             ` Saeed Mahameed
2023-11-29  9:08                               ` Greg Kroah-Hartman
2023-12-04 21:37                                 ` Aron Silverton
2023-12-05  2:52                                   ` Jakub Kicinski
2023-12-05 17:11                                     ` Aron Silverton
2023-12-06  4:48                                       ` Jakub Kicinski [this message]
2023-12-07 15:54                                         ` David Ahern
2023-12-07 16:20                                           ` Jakub Kicinski
2023-12-07 16:41                                         ` Aron Silverton
2023-12-07 17:23                                           ` Jakub Kicinski
2023-12-07 18:06                                             ` Aron Silverton
2023-12-07 19:02                                               ` Saeed Mahameed
2023-12-08  5:29                                                 ` Greg Kroah-Hartman
2023-12-08 13:34                                                   ` Jason Gunthorpe
2023-12-08  5:27                                               ` Greg Kroah-Hartman
2023-12-08 12:52                                                 ` Jason Gunthorpe
2023-12-07 18:54                                           ` Saeed Mahameed
2023-12-13 16:55                                             ` Christoph Hellwig
2023-11-28 19:31                         ` Saeed Mahameed
2023-11-28 16:52                     ` David Ahern
2023-11-27 18:59   ` Greg Kroah-Hartman
2023-11-29  9:08     ` Saeed Mahameed
2023-11-29  9:20       ` Greg Kroah-Hartman
2023-11-29 13:02         ` Jason Gunthorpe
2023-11-29 15:41           ` Greg Kroah-Hartman
2023-11-29 18:07             ` Jason Gunthorpe
2023-11-21  7:06 ` [PATCH V3 3/5] misc: mlx5ctl: Add info ioctl Saeed Mahameed
2023-11-27 19:09   ` Greg Kroah-Hartman
2023-11-27 20:39     ` Saeed Mahameed
2023-11-28  9:13       ` Greg Kroah-Hartman
2023-11-29  8:53         ` Saeed Mahameed
2023-11-21  7:06 ` [PATCH V3 4/5] misc: mlx5ctl: Add command rpc ioctl Saeed Mahameed
2023-11-21  7:06 ` [PATCH V3 5/5] misc: mlx5ctl: Add umem reg/unreg ioctl Saeed Mahameed
2023-11-21 20:44   ` Jakub Kicinski
2023-11-21 21:04     ` Saeed Mahameed
2023-11-21 22:10       ` Jakub Kicinski
2023-11-21 22:52         ` Saeed Mahameed
2023-11-21 22:18       ` David Ahern
2023-11-21 22:46         ` Saeed Mahameed
2023-11-21 23:46     ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231205204855.52fa5cc1@kernel.org \
    --to=kuba@kernel.org \
    --cc=arnd@arndb.de \
    --cc=aron.silverton@oracle.com \
    --cc=dsahern@kernel.org \
    --cc=gregkh@linuxfoundation.org \
    --cc=itayavr@nvidia.com \
    --cc=jgg@nvidia.com \
    --cc=jiri@nvidia.com \
    --cc=lbloch@nvidia.com \
    --cc=leonro@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=saeed@kernel.org \
    --cc=saeedm@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).