Archive-only list for patches
 help / color / mirror / Atom feed
From: Daniel Vetter <daniel@ffwll.ch>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: Dan Williams <dan.j.williams@intel.com>,
	Jakub Kicinski <kuba@kernel.org>,
	David Ahern <dsahern@kernel.org>,
	Jonathan Corbet <corbet@lwn.net>,
	Itay Avraham <itayavr@nvidia.com>,
	Leon Romanovsky <leon@kernel.org>,
	linux-doc@vger.kernel.org, linux-rdma@vger.kernel.org,
	netdev@vger.kernel.org, Paolo Abeni <pabeni@redhat.com>,
	Saeed Mahameed <saeedm@nvidia.com>,
	Tariq Toukan <tariqt@nvidia.com>,
	Andy Gospodarek <andrew.gospodarek@broadcom.com>,
	Aron Silverton <aron.silverton@oracle.com>,
	Christoph Hellwig <hch@infradead.org>,
	Jiri Pirko <jiri@nvidia.com>, Leonid Bloch <lbloch@nvidia.com>,
	Leon Romanovsky <leonro@nvidia.com>,
	linux-cxl@vger.kernel.org, patches@lists.linux.dev
Subject: Re: [PATCH 0/8] Introduce fwctl subystem
Date: Tue, 11 Jun 2024 17:36:17 +0200	[thread overview]
Message-ID: <Zmhu8egti-URPFoB@phenom.ffwll.local> (raw)
In-Reply-To: <20240605135911.GT19897@nvidia.com>

On Wed, Jun 05, 2024 at 10:59:11AM -0300, Jason Gunthorpe wrote:
> On Tue, Jun 04, 2024 at 04:56:57PM -0700, Dan Williams wrote:
> > * Introspection / validation: Subsystem community needs to be able to
> >   audit behavior after the fact.
> > 
> >   To me this means even if the kernel is letting a command through based
> >   on the stated Command Effect of "Configuration Change after Cold Reset"
> >   upstream community has a need to be able to read the vendor
> >   specification for that command. I.e. commands might be vendor-specific,
> >   but never vendor-private. I see this as similar to the requirement for
> >   open source userspace for sophisticated accelerators.
> 
> I'm less hard on this. As long as reasonable open userspace exists I
> think it is fine to let other stuff through too. I can appreciate the
> DRM stance on this, but IMHO, there is meaningfully more value for open
> source in trying get an open Vulkan implementation vs blocking users
> from reading their vendor'd diagnostic SI values.
> 
> I don't think we should get into some kind of extremism and insist
> that every single bit must be documented/standardized or Linux won't
> support it.

I figured it might be useful to paint what we do in DRM with a bit more
nuance. In the principles, we're indeed fairly radical in what we require,
but in practice we aim for a much more pragmatic approach in what we
merge. There's two major axis here:

1. One is ecosystem maturity. One end is 3d, with vulkan as the clear
industry standard, and an upstream full-featured userspace driver in
mesa3d is the only technically reasonable choice. And all gpu vendors
agree and by this year even nvidia started hiring an upstream team. But
this didn't happen magically overnight, it took 1-2 decades of background
discussions and tactical push&pulling to get there.

The other end is currently AI accelerators. It's a complete mess, where
across the platform (client, edge, cloud), customer and vendor dimension
every point has a different stack. And the problem is so obvious that
everyone is working to fix this, which means currently
https://xkcd.com/927/ is happening in parallel. Just to get things going
we're accepting pretty much anything that's a notch above total garbage
for userspace and for merging into the kernel.

2. The other part is how much it impacts applications. If you can't run
the same application across different vendors, the case for an upstream
stack becomes a lot weaker. At the other end is infrastructure enabling
like device configuration, error handling and recovery, hw debugging and
reliablity/health reporting. That's a lot more vendor specific in nature
and needs to be customized anyway per deployement. And only much higher in
the stack, maybe in k8s, can a technically reasonable unification even
happen.  So again we're much more lenient about infrastructure enabling
and uapi than stuff applications will use directly.

Currently that's enough of a mess in drm that I feel like enforcing
something like fwctl is still too much. But maybe once fwctl is
established with other subsystems/devices we can start the conversations
with vendors to get this going a few years down the road.

Both together mean we land a lot of code that's questionable at best,
clear garbage at worst. But since we've been in the merging garbage
business just to get things going for decades, we've become pretty good at
dealing with the kernel internal and uapi fallout, some say too good. But
personally I don't think there's a path to where we are with 3d/vulkan
that doesn't go through years of this kind of suck, and very much merged
into upstream kind of suck.

For all the concerns about trusting vendors/devices to not abuse very broad
uapi interfaces: Modern accelerator command submission boils down to "run
this context at this $addr", and the kernel never ever directly sees
anything more fly by. That's the same interface you need for a no-op job
as a full blown AI workload, so in theory maximal abuse potential.

In practice, it doesn't seem to be an issue, at least not beyond the
intentionally pragmatic choices where we merge kernel code with known
sub-par/incomplete userspace. I'm not sure why, but to my knowledge all
attempts to break the spirit of our userspace rules while following the
letter die in vendor-internal discussions, at least for all the
established upstream driver teams.

And for new ones it takes years of private chats to get them going and
fully established in upstream anyway.

Maybe one reason we have a bit an extremist reputation is that all the
public takes are the radical principled requirements, while the actual
pragmatic discussions mostly happen in private.

tldr; fwctl as I understand it feels like a bridge to far for drm today,
but I'd very much like someone else to make this happen so we could
eventually push towards adoption too.

Cheers, Sima
-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

  parent reply	other threads:[~2024-06-11 15:36 UTC|newest]

Thread overview: 73+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-06-03 15:53 [PATCH 0/8] Introduce fwctl subystem Jason Gunthorpe
2024-06-03 15:53 ` [PATCH 1/8] fwctl: Add basic structure for a class subsystem with a cdev Jason Gunthorpe
2024-06-04  9:32   ` Leon Romanovsky
2024-06-04 15:50     ` Jason Gunthorpe
2024-06-04 17:05       ` Jonathan Cameron
2024-06-04 18:52         ` Jason Gunthorpe
2024-06-05 11:08           ` Jonathan Cameron
2024-06-04 16:42   ` Randy Dunlap
2024-06-04 16:44     ` Jason Gunthorpe
2024-06-03 15:53 ` [PATCH 2/8] fwctl: Basic ioctl dispatch for the character device Jason Gunthorpe
2024-06-04 12:16   ` Zhu Yanjun
2024-06-04 12:22     ` Leon Romanovsky
2024-06-04 16:50       ` Jonathan Cameron
2024-06-04 16:58         ` Jason Gunthorpe
2024-06-05 11:07           ` Jonathan Cameron
2024-06-05 18:27             ` Jason Gunthorpe
2024-06-06 13:34               ` Jonathan Cameron
2024-06-06 15:37                 ` Randy Dunlap
2024-06-05 15:42   ` Przemek Kitszel
2024-06-05 15:49     ` Jason Gunthorpe
2024-06-03 15:53 ` [PATCH 3/8] fwctl: FWCTL_INFO to return basic information about the device Jason Gunthorpe
2024-06-13 23:32   ` Dave Jiang
2024-06-13 23:40     ` Jason Gunthorpe
2024-06-14 16:37       ` Dave Jiang
2024-06-03 15:53 ` [PATCH 4/8] taint: Add TAINT_FWCTL Jason Gunthorpe
2024-06-03 15:53 ` [PATCH 5/8] fwctl: FWCTL_RPC to execute a Remote Procedure Call to device firmware Jason Gunthorpe
2024-06-03 15:53 ` [PATCH 6/8] fwctl: Add documentation Jason Gunthorpe
2024-06-05  2:31   ` Randy Dunlap
2024-06-05 16:03     ` Jason Gunthorpe
2024-06-05 20:14       ` Randy Dunlap
2024-06-03 15:53 ` [PATCH 7/8] fwctl/mlx5: Support for communicating with mlx5 fw Jason Gunthorpe
2024-06-03 15:53 ` [PATCH 8/8] mlx5: Create an auxiliary device for fwctl_mlx5 Jason Gunthorpe
2024-06-03 18:42 ` [PATCH 0/8] Introduce fwctl subystem Jakub Kicinski
2024-06-04  3:01   ` David Ahern
2024-06-04 14:04     ` Jakub Kicinski
2024-06-04 21:28       ` Saeed Mahameed
2024-06-04 22:32         ` Jakub Kicinski
2024-06-05 14:50           ` Jason Gunthorpe
2024-06-05 15:41             ` Jakub Kicinski
2024-06-04 23:56       ` Dan Williams
2024-06-05  3:05         ` Jakub Kicinski
2024-06-05 11:19         ` Jonathan Cameron
2024-06-05 13:59         ` Jason Gunthorpe
2024-06-06  2:35           ` David Ahern
2024-06-06 14:18             ` Jakub Kicinski
2024-06-06 14:48               ` Jason Gunthorpe
2024-06-06 15:05                 ` Jakub Kicinski
2024-06-06 17:47                   ` David Ahern
2024-06-07  6:48                     ` Jiri Pirko
2024-06-07 14:50                       ` David Ahern
2024-06-07 15:14                         ` Jason Gunthorpe
2024-06-07 15:50                           ` Jiri Pirko
2024-06-07 17:24                             ` Jason Gunthorpe
2024-06-07  7:34               ` Jiri Pirko
2024-06-07 12:49                 ` Andrew Lunn
2024-06-07 13:34                   ` Jiri Pirko
2024-06-08  1:43                     ` Jakub Kicinski
2024-06-06  4:56           ` Dan Williams
2024-06-06  8:50             ` Leon Romanovsky
2024-06-06 22:11               ` Dan Williams
2024-06-07  0:02                 ` Jason Gunthorpe
2024-06-07 13:12                 ` Leon Romanovsky
2024-06-06 14:41             ` Jason Gunthorpe
2024-06-06 14:58               ` Jakub Kicinski
2024-06-06 17:24               ` Dan Williams
2024-06-07  0:25                 ` Jason Gunthorpe
2024-06-07 10:47                   ` Przemek Kitszel
2024-06-11 15:36           ` Daniel Vetter [this message]
2024-06-11 16:17             ` Jason Gunthorpe
2024-06-11 16:54               ` Daniel Vetter
2024-06-06  1:58       ` David Ahern
2024-06-05  3:11 ` Jakub Kicinski
2024-06-05 12:06   ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zmhu8egti-URPFoB@phenom.ffwll.local \
    --to=daniel@ffwll.ch \
    --cc=andrew.gospodarek@broadcom.com \
    --cc=aron.silverton@oracle.com \
    --cc=corbet@lwn.net \
    --cc=dan.j.williams@intel.com \
    --cc=dsahern@kernel.org \
    --cc=hch@infradead.org \
    --cc=itayavr@nvidia.com \
    --cc=jgg@nvidia.com \
    --cc=jiri@nvidia.com \
    --cc=kuba@kernel.org \
    --cc=lbloch@nvidia.com \
    --cc=leon@kernel.org \
    --cc=leonro@nvidia.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-rdma@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    --cc=patches@lists.linux.dev \
    --cc=saeedm@nvidia.com \
    --cc=tariqt@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox