Re: [PATCH v7 12/12] iommu: Improve iopf_queue_flush_dev()

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Jason Gunthorpe <jgg@ziepe.ca>
To: Lu Baolu <baolu.lu@linux.intel.com>
Cc: Joerg Roedel <joro@8bytes.org>, Will Deacon <will@kernel.org>,
	Robin Murphy <robin.murphy@arm.com>,
	Kevin Tian <kevin.tian@intel.com>,
	Jean-Philippe Brucker <jean-philippe@linaro.org>,
	Nicolin Chen <nicolinc@nvidia.com>, Yi Liu <yi.l.liu@intel.com>,
	Jacob Pan <jacob.jun.pan@linux.intel.com>,
	Yan Zhao <yan.y.zhao@intel.com>,
	iommu@lists.linux.dev, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH v7 12/12] iommu: Improve iopf_queue_flush_dev()
Date: Fri, 1 Dec 2023 16:35:36 -0400	[thread overview]
Message-ID: <20231201203536.GG1489931@ziepe.ca> (raw)
In-Reply-To: <20231115030226.16700-13-baolu.lu@linux.intel.com>

On Wed, Nov 15, 2023 at 11:02:26AM +0800, Lu Baolu wrote:
> The iopf_queue_flush_dev() is called by the iommu driver before releasing
> a PASID. It ensures that all pending faults for this PASID have been
> handled or cancelled, and won't hit the address space that reuses this
> PASID. The driver must make sure that no new fault is added to the queue.

This needs more explanation, why should anyone care?

More importantly, why is *discarding* the right thing to do?
Especially why would we discard a partial page request group?

After we change a translation we may have PRI requests in a
queue. They need to be acknowledged, not discarded. The DMA in the
device should be restarted and the device should observe the new
translation - if it is blocking then it should take a DMA error.

More broadly, we should just let things run their normal course. The
domain to deliver the fault to should be determined very early. If we
get a fault and there is no fault domain currently assigned then just
restart it.

The main reason to fence would be to allow the domain to become freed
as the faults should be holding pointers to it. But I feel there are
simpler options for that then this..

> The SMMUv3 driver doesn't use it because it only implements the
> Arm-specific stall fault model where DMA transactions are held in the SMMU
> while waiting for the OS to handle iopf's. Since a device driver must
> complete all DMA transactions before detaching domain, there are no
> pending iopf's with the stall model. PRI support requires adding a call to
> iopf_queue_flush_dev() after flushing the hardware page fault queue.

This explanation doesn't make much sense, from a device driver
perspective both PRI and stall cause the device to not complete DMAs.

The difference between stall and PRI is fairly small, stall causes an
internal bus to lock up while PRI does not.

> -int iopf_queue_flush_dev(struct device *dev)
> +int iopf_queue_discard_dev_pasid(struct device *dev, ioasid_t pasid)
>  {
>  	struct iommu_fault_param *iopf_param = iopf_get_dev_fault_param(dev);
> +	const struct iommu_ops *ops = dev_iommu_ops(dev);
> +	struct iommu_page_response resp;
> +	struct iopf_fault *iopf, *next;
> +	int ret = 0;
>  
>  	if (!iopf_param)
>  		return -ENODEV;
>  
>  	flush_workqueue(iopf_param->queue->wq);
> +

A naked flush_workqueue like this is really suspicious, it needs a
comment explaining why the queue can't get more work queued at this
point. 

I suppose the driver is expected to stop calling
iommu_report_device_fault() before calling this function, but that
doesn't seem like it is going to be possible. Drivers should be
implementing atomic replace for the PASID updates and in that case
there is no momement when it can say the HW will stop generating PRI.

I'm looking at this code after these patches are applied and it still
seems quite bonkers to me :(

Why do we allocate two copies of the memory on all fault paths?

Why do we have fault->type still that only has one value?

What is serializing iommu_get_domain_for_dev_pasid() in the fault
path? It looks sort of like the plan is to use iopf_param->lock and
ensure domain removal grabs that lock at least after the xarray is
changed - but does that actually happen?

I would suggest, broadly, a flow for iommu_report_device_fault() sort
of:

1) Allocate memory for the evt. Every path except errors needs this,
   so just do it
2) iopf_get_dev_fault_param() should not have locks in it! This is
   fast path now. Use a refcount, atomic compare exchange to allocate,
   and RCU free.
3) Everything runs under the fault_param->lock
4) Check if !IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE, set it aside and then
   exit! This logic is really tortured and confusing
5) Allocate memory and assemble the group
6) Obtain the domain for this group and incr a per-domain counter that a
   fault is pending on that domain
7) Put the *group* into the WQ. Put the *group* on a list in fault_param
   instead of the individual faults
8) Don't linear search a linked list in iommu_page_response()! Pass
   the group in that we got from the WQ that we *know* is still
   active. Ack that passed group.

When freeing a domain wait for the per-domain counter to go to
zero. This ensures that the WQ is flushed out and all the outside
domain references are gone.

When wanting to turn off PRI make sure a non-PRI domain is
attached to everything. Fence against the HW's event queue. No new
iommu_report_device_fault() is possible.

Lock the fault_param->lock and go through every pending group and
respond it. Mark the group memory as invalid so iommu_page_response()
NOP's it. Unlock, fence the HW against queued responses, and turn off
PRI.

An *optimization* would be to lightly flush the domain when changing
the translation. Lock the fault_param->lock and look for groups in the
list with old_domain.  Do the same as for PRI-off: respond to the
group, mark it as NOP. The WQ may still be chewing on something so the
domain free still has to check and wait.

Did I get it right??

Jason

next prev parent reply	other threads:[~2023-12-01 20:35 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-15  3:02 [PATCH v7 00/12] iommu: Prepare to deliver page faults to user space Lu Baolu
2023-11-15  3:02 ` [PATCH v7 01/12] iommu: Move iommu fault data to linux/iommu.h Lu Baolu
2023-12-04 10:52   ` Yi Liu
2023-11-15  3:02 ` [PATCH v7 02/12] iommu/arm-smmu-v3: Remove unrecoverable faults reporting Lu Baolu
2023-12-01 15:42   ` Jason Gunthorpe
2023-12-04 10:54   ` Yi Liu
2023-12-05 11:48     ` Baolu Lu
2023-11-15  3:02 ` [PATCH v7 03/12] iommu: Remove unrecoverable fault data Lu Baolu
2023-12-04 10:58   ` Yi Liu
2023-12-05 11:55     ` Baolu Lu
2023-11-15  3:02 ` [PATCH v7 04/12] iommu: Cleanup iopf data structure definitions Lu Baolu
2023-12-04 11:03   ` Yi Liu
2023-11-15  3:02 ` [PATCH v7 05/12] iommu: Merge iopf_device_param into iommu_fault_param Lu Baolu
2023-12-04 12:32   ` Yi Liu
2023-12-05 12:01     ` Baolu Lu
2023-11-15  3:02 ` [PATCH v7 06/12] iommu: Remove iommu_[un]register_device_fault_handler() Lu Baolu
2023-12-04 12:36   ` Yi Liu
2023-12-05 12:09     ` Baolu Lu
2023-11-15  3:02 ` [PATCH v7 07/12] iommu: Merge iommu_fault_event and iopf_fault Lu Baolu
2023-12-01 19:09   ` Jason Gunthorpe
2023-12-04 12:40   ` Yi Liu
2023-11-15  3:02 ` [PATCH v7 08/12] iommu: Prepare for separating SVA and IOPF Lu Baolu
2023-12-05  7:10   ` Yi Liu
2023-11-15  3:02 ` [PATCH v7 09/12] iommu: Make iommu_queue_iopf() more generic Lu Baolu
2023-12-01 19:14   ` Jason Gunthorpe
2023-12-05  7:13   ` Yi Liu
2023-12-05 12:13     ` Baolu Lu
2023-11-15  3:02 ` [PATCH v7 10/12] iommu: Separate SVA and IOPF Lu Baolu
2023-11-15  3:02 ` [PATCH v7 11/12] iommu: Consolidate per-device fault data management Lu Baolu
2023-12-01 19:46   ` Jason Gunthorpe
2023-12-04  0:58     ` Baolu Lu
2023-11-15  3:02 ` [PATCH v7 12/12] iommu: Improve iopf_queue_flush_dev() Lu Baolu
2023-12-01 20:35   ` Jason Gunthorpe [this message]
2023-12-03  8:53     ` Baolu Lu
2023-12-03 14:14       ` Jason Gunthorpe
2023-12-04  1:32         ` Baolu Lu
2023-12-04  5:37           ` Tian, Kevin
2023-12-04 13:25             ` Jason Gunthorpe
2023-12-05  1:32               ` Tian, Kevin
2023-12-05  1:53                 ` Jason Gunthorpe
2023-12-05  3:23                   ` Tian, Kevin
2023-12-05 15:52                     ` Jason Gunthorpe
2023-12-04 13:12           ` Jason Gunthorpe
2023-12-04  3:46     ` Baolu Lu
2023-12-04 13:27       ` Jason Gunthorpe
2023-12-05  1:13         ` Baolu Lu
2023-11-24  6:30 ` [PATCH v7 00/12] iommu: Prepare to deliver page faults to user space liulongfang
2023-11-24 12:01   ` Baolu Lu
2023-11-25  4:05     ` liulongfang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231201203536.GG1489931@ziepe.ca \
    --to=jgg@ziepe.ca \
    --cc=baolu.lu@linux.intel.com \
    --cc=iommu@lists.linux.dev \
    --cc=jacob.jun.pan@linux.intel.com \
    --cc=jean-philippe@linaro.org \
    --cc=joro@8bytes.org \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nicolinc@nvidia.com \
    --cc=robin.murphy@arm.com \
    --cc=will@kernel.org \
    --cc=yan.y.zhao@intel.com \
    --cc=yi.l.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox