From: Jason Gunthorpe <jgg@nvidia.com>
To: Robin Murphy <robin.murphy@arm.com>
Cc: joro@8bytes.org, will@kernel.org, iommu@lists.linux.dev,
linux-arm-kernel@lists.infradead.org, nicolinc@nvidia.com
Subject: Re: [PATCH] iommu/arm-smmu-v3: Fail aliasing StreamIDs more gracefully
Date: Fri, 11 Apr 2025 15:46:46 -0300 [thread overview]
Message-ID: <20250411184646.GB252886@nvidia.com> (raw)
In-Reply-To: <9ed7aba6-4d0d-4cfb-a85b-ca943c7e031a@arm.com>
On Fri, Apr 11, 2025 at 07:16:57PM +0100, Robin Murphy wrote:
> > That is different from the iommu instance owning the device but being
> > unable to probe it for some reason. The failure recovery in the core
> > code should be different.
>
> Perhaps, but right now this is a trivial, obvious fix using the mechanisms
> that we already have and that other drivers are already using similarly,
> that can effectively restore the exact same overall behaviour as 6.14 with
> zero risk.
Yes, but so would ignoring the error within bus_iommu_probe, which is
close to what used to be happening. I'm a little worried this will
turn into a wack a mole..
If we keep it this way it further muddies what the driver is supposed
to be doing with it's error codes. How should a driver author decide
which one to use?
> Nothing in this two-line patch precludes coming back and doing more later.
Yes
> triggering those failures? :) Annoyingly I chose to kick the can down the
> road because I know I'm deliberately using a quirky unsupported hardware
> setup...
Oh interesting, I'm actually quite surprised that very new chips are
still pretending to be PCI bridges. :\
> Errors which occur during .probe_device clearly don't have to be specific to
> the device - e.g. OOM - but even device-adjacent conditions may be
> sufficiently catastrophic to indicate that it's probably not worth
> continuing - for instance if the arm_smmu_sid_in_range() check were to fail
> then it's clear firmware is giving us bad data, at which point we can't
> necessarily trust *anything* it's said about the SMMU at all.
Maybe, but also, maybe not. The more optimistic view is that the FW is
generally correct and tested by someone and mistakes are narrow to a
single device. That is my general experiance with server FW at least..
Plug in some weird thing, the FW freaks out and does something wrong,
for that thing, but everything else remains fine.
> > It would be nice if the dev->iommu would record that the struct device
> > is inoperable and then we can fail
> > iommu_device_use_default_domain()/etc in a contained way.
>
> Maybe, but what would the nature of that containment be?
Just block the end driver from binding so we don't compound the
problems. Not much you can do on the iommu side. For PCI devices
without a driver bind they won't enable bus master so they should be
contained at that point.
> other than leave the user to pick up the pieces, and often the best chance
> for that *is* going to be to give up entirely, fail and disable the IOMMU,
> because otherwise blocking I/O might prevent the user seeing anything at
> all.
Maybe, or the untested unregister iommu driver path will crash :)
Or the fault is localized to the device and we will get a better
outcome by just not loading that driver.
In the end we are guessing, I'm just looking at this from the POV that
for a long time we ignored errors per-device and now we are failing
the whole iommu startup. That is a big change in v6.15, and it doesn't
seem really necessary.
> And for any remaining niche cases of "expected" errors like this one where
> the driver could in principle support the thing but doesn't due to its own
> limitation, honestly don't spend effort on gold-plating that limitation in
> core code, just spend it on making the driver support the thing instead.
Yes, Nicolin will try a v2
So, go ahead with this if you think it is best, but I do prefer
ignoring all errors. If we get a 2nd similar issue lets agree to
change bus_iommu_probe()/etc to ignore probe error codes rather than
wack-a-mole ENODEV across many drivers.
Jason
next prev parent reply other threads:[~2025-04-11 18:49 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-11 14:09 [PATCH] iommu/arm-smmu-v3: Fail aliasing StreamIDs more gracefully Robin Murphy
2025-04-11 15:21 ` Jason Gunthorpe
2025-04-11 15:50 ` Pranjal Shrivastava
2025-04-11 15:58 ` Jason Gunthorpe
2025-04-11 18:16 ` Robin Murphy
2025-04-11 18:46 ` Jason Gunthorpe [this message]
2025-04-17 14:27 ` Will Deacon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250411184646.GB252886@nvidia.com \
--to=jgg@nvidia.com \
--cc=iommu@lists.linux.dev \
--cc=joro@8bytes.org \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=nicolinc@nvidia.com \
--cc=robin.murphy@arm.com \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.