public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Nicolin Chen <nicolinc@nvidia.com>
To: "jgg@nvidia.com" <jgg@nvidia.com>, "Tian, Kevin" <kevin.tian@intel.com>
Cc: "joro@8bytes.org" <joro@8bytes.org>,
	"will@kernel.org" <will@kernel.org>,
	"robin.murphy@arm.com" <robin.murphy@arm.com>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>,
	"shuah@kernel.org" <shuah@kernel.org>,
	"Liu, Yi L" <yi.l.liu@intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"iommu@lists.linux.dev" <iommu@lists.linux.dev>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>,
	"linux-kselftest@vger.kernel.org"
	<linux-kselftest@vger.kernel.org>,
	"baolu.lu@linux.intel.com" <baolu.lu@linux.intel.com>
Subject: Re: [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain()
Date: Fri, 10 Feb 2023 16:10:24 -0800	[thread overview]
Message-ID: <Y+bc8OrWfw3Fq57n@Asurada-Nvidia> (raw)
In-Reply-To: <BN9PR11MB5276268D3ED0360913A05C368CDE9@BN9PR11MB5276.namprd11.prod.outlook.com>

On Fri, Feb 10, 2023 at 02:11:23AM +0000, Tian, Kevin wrote:

> > > > +     if (!iommufd_hw_pagetable_has_group(hwpt, idev->group)) {
> > > > +             if (list_empty(&hwpt->devices)) {
> > > > +                     iopt_table_remove_domain(&hwpt->ioas->iopt,
> > > > +                                              hwpt->domain);
> > > > +                     list_del(&hwpt->hwpt_item);
> > > > +             }
> > >
> > > I'm not sure how this can be fully shared between detach and replace.
> > > Here some work e.g. above needs to be done before calling
> > > iommu_group_replace_domain() while others can be done afterwards.
> >
> > This iopt_table_remove_domain/list_del is supposed to be done in
> > the hwpt's destroy() actually. We couldn't move it because it'd
> > need the new domain_alloc_user op and its implementation in ARM
> > driver. Overall, I think it should be safe to put it behind the
> > iommu_group_replace_domain().
> >
> 
> My confusion is that we have different flows between detach/attach
> and replace.
> 
> today with separate detach+attach we have following flow:
> 
>         Remove device from current hwpt;
>         if (last_device in hwpt) {
>                 Remove hwpt domain from current iopt;
>                 if (last_device in group)
>                         detach group from hwpt domain;
>         }
> 
>         if (first device in group) {
>                 attach group to new hwpt domain;
>                 if (first_device in hwpt)
>                         Add hwpt domain to new iopt;
>         Add device to new hwpt;
> 
> but replace flow is different on the detach part:
> 
>         if (first device in group) {
>                 replace group's domain from current hwpt to new hwpt;
>                 if (first_device in hwpt)
>                         Add hwpt domain to new iopt;
>         }
> 
>         Remove device from old hwpt;
>         if (last_device in old hwpt)
>                 Remove hwpt domain from old iopt;
> 
>         Add device to new hwpt;

Oh... thinking it carefully, I see the flow does look a bit off.
Perhaps it's better to have a similar flow for replace.

However, I think something would be still different due to its
tricky nature, especially for a multi-device iommu_group.

An iommu_group_detach happens only when a device is the last one
in its group to go through the routine via a DETACH ioctl, while
an iommu_group_replace_domain() happens only when the device is
the first one in its group to go through the routine via another
ATTACH ioctl. However, when the first device does a replace, the
cleanup routine of the old hwpt is a NOP, since there are still
other devices (same group) in the old hwpt. And two implications
here:
1) Any other device in the same group has to forcibly switch to
   the new domain, when the first device does a replace.
2) The actual hwpt cleanup can only happen at the last device's
   replace call.

This also means that kernel has to rely on the integrity of the
user space that it must replace all active devices in the group:

For a three-device iommu_group,
[scenario 1]
	a. ATTACH dev1 to hwpt1;
	b. ATTACH dev2 to hwpt1;
	c. ATTACH dev3 to hwpt1;
	d. ATTACH (REPLACE) dev1 to hwpt2;
	   (no hwpt1 cleanup; replace dev2&3 too; do hwpt2 init)
	e. ATTACH (REPLACE) dev2 to hwpt2; // user space must do
	   (no hwpt1 cleanup; no dev2 replace; no hwpt2 init)
	f. ATTACH (REPLACE) dev3 to hwpt2; // user space must do
	   (do hwpt1 cleanup; no dev3 replace; no hwpt2 init)

[scenario 2]
	a. ATTACH dev1 to hwpt1;
	b. ATTACH dev2 to hwpt1;
	c. ATTACH dev3 to hwpt1;
	d. DETACH dev3 from hwpt1;
	   (detach dev3; no hwpt1 cleanup)
	f. ATTACH (REPLACE) dev1 to hwpt2;
	   (no hwpt1 cleanup; replace dev2&3 too; do hwpt2 init)
	g. ATTACH (REPLACE) dev2 to hwpt2;	// user space must do
	   (do hwpt1 cleanup; no dev2 replace; no hwpt2 init)
	h. (optional) ATTACH dev3 to hwpt2;	// clean ATTACH, not a REPLACE
	   (no hwpt1 cleanup; no dev3 replace; no hwpt2 init)

[scenario 3]
	a. ATTACH dev1 to hwpt1;
	b. ATTACH dev2 to hwpt1;
	c. ATTACH dev3 to hwpt1;
	d. DETACH dev3 from hwpt1;
	   (detach dev3; no hwpt1 cleanup)
	e. DETACH dev2 from hwpt1;
	   (detach dev2; no hwpt1 cleanup)
	f. ATTACH (REPLACE) dev1 to hwpt2;
	   (do hwpt1 cleanup; replace dev2&3 too; do hwpt2 init)
	g. (optional) ATTACH dev2 to hwpt2;	// clean ATTACH, not a REPLACE
	   (no hwpt1 cleanup; no dev2 replace; no hwpt2 init)
	h. (optional) ATTACH dev3 to hwpt2;	// clean ATTACH, not a REPLACE
	   (no hwpt1 cleanup; no dev3 replace; no hwpt2 init)

Following the narratives above,

[current detach+attach flow]
	// DETACH dev1 from hwpt1;
	Log dev1 out of the hwpt1's device list;
	NOP; // hwpt1 has its group;
	iopt_remove_reserved_iova(hwpt1->iopt, dev1);
	idev1->hwpt = NULL;
	refcount_dec();
	// DETACH dev2 from hwpt1;
	Log dev2 out of the hwpt1's device list;
	if (hwpt1 does not have its group) { // last device to detach
		if (hwpt1's device list is empty)
			iopt_table_remove_domain/list_del(hwpt1);
		iommu_detach_group();
	}
	iopt_remove_reserved_iova(hwpt1->iopt, dev2);
	idev2->hwpt = NULL;
	refcount_dec();
	...
	// ATTACH dev1 to hwpt2;
	iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev1);
	if (hwpt2 does not have its group) { // first device to attach
		iommu_attach_group();
		if (hwpt2's device list is empty)
			iopt_table_add_domain/list_add(hwpt2);
	}
	idev1->hwpt = hwpt2;
	refcount_inc();
	Log dev1 in the hwpt2's device list;
	// ATTACH dev2 to hwpt2;
	iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev2);
	NOP; // hwpt2 has its group;
	idev2->hwpt = hwpt2;
	refcount_inc();
	Log dev2 in to the hwpt2's device list;


[correct (?) replace flow - scenario 1 above]

	// 1.d Switch (REPLACE) dev1 from hwpt1 to hwpt2;
	partial detach (dev1) {
		Log dev1 out of the hwpt1's device list;
		NOP // hwpt1 has its group, and hwpt1's device list isn't empty
		iopt_remove_reserved_iova(hwpt1->iopt, dev1);
		refcount_dec();
	}
	iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev1);
	if (hwpt2 does not have its group) { // first device to replace
		iommu_group_replace_domain();
		if (hwpt2's device list is empty)
			iopt_table_add_domain/list_add(hwpt2);
	}
	idev1->hwpt = hwpt2;
	refcount_int();
	Log dev1 in the hwpt2's device list;

	// 1.e Switch (REPLACE) dev2 from hwpt1 to hwpt2;
	partial detach (dev2) {
		Log dev2 out of the hwpt1's device list;
		NOP // hwpt1 has its group, and hwpt1's device list isn't empty
		iopt_remove_reserved_iova(hwpt1->iopt, dev2);
		refcount_dec();
	}
	iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev2);
	NOP; // hwpt2 has its group, and hwpt2's device list isn't empty
	idev2->hwpt = hwpt2;
	refcount_int();
	Log dev2 in the hwpt2's device list;

	// 1.f Switch (REPLACE) dev3 from hwpt1 to hwpt2;
	partial detach (dev3) {
		Log dev3 out of the hwpt1's device list;
		if (hwpt1 does not have its group) { // last device to detach
			if (hwpt1's device list is empty)
				iopt_table_remove_domain/list_del(hwpt1);
		}
		iopt_remove_reserved_iova(hwpt1->iopt, dev3);
		refcount_dec();
	}
	iopt_table_enforce_group_resv_regions(hwpt2->iopt, dev3);
	NOP; // hwpt2 has its group, and hwpt2's device list isn't empty
	idev3->hwpt = hwpt2;
	refcount_int();
	Log dev3 in the hwpt2's device list;

And, this would also complicate the error-out routines too...

> I'm yet to figure out whether we have sufficient lock protection to
> prevent other paths from using old iopt/hwpt to find the device
> which is already attached to a different domain.

With the correct (?) flow, I think it'd be safer for one-device
group. But it's gets tricker for the multi-device case above:
the dev2 and dev3 are already in the new domain, but all their
iommufd objects (idev->hwpt and iopt) are lagging. Do we need a
lock across the three IOCTLs?

One way to avoid it is to force-update idev2 and idev3 too when
idev1 does a replace -- by iterating all same-group devices out
of the old hwpt. This, however, feels a violation against being
device-centric...

i.e. scenario 1:
	a. ATTACH dev1 to hwpt1;
	b. ATTACH dev2 to hwpt1;
	c. ATTACH dev3 to hwpt1;
	d. ATTACH (REPLACE) dev1 to hwpt2;
	   (do hwpt1 cleanup; replace dev2&3 and update idev2&3 too; do hwpt2 init)
	e. ATTACH (REPLACE) dev2 to hwpt2; // user space must do, or be aware of 1.d
	   (kernel does dummy)
	f. ATTACH (REPLACE) dev3 to hwpt2; // user space must do, or be aware of 1.d
	   (kernel does dummy)

Thanks
Nic

  reply	other threads:[~2023-02-11  0:10 UTC|newest]

Thread overview: 71+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-07 21:17 [PATCH v2 00/10] Add IO page table replacement support Nicolin Chen
2023-02-07 21:17 ` [PATCH v2 01/10] iommu: Move dev_iommu_ops() to private header Nicolin Chen
2023-02-09  2:49   ` Tian, Kevin
2023-02-07 21:17 ` [PATCH v2 02/10] iommu: Introduce a new iommu_group_replace_domain() API Nicolin Chen
2023-02-09  2:55   ` Tian, Kevin
2023-02-09 13:23     ` Jason Gunthorpe
2023-02-10  1:34       ` Tian, Kevin
2023-02-10 23:51   ` Alex Williamson
2023-02-11  0:44     ` Jason Gunthorpe
2023-02-13  2:24       ` Tian, Kevin
2023-02-13  8:34         ` Baolu Lu
2023-02-13 14:45         ` Jason Gunthorpe
2023-02-14  3:29           ` Tian, Kevin
2023-02-15  6:10   ` Tian, Kevin
2023-02-15 12:52     ` Jason Gunthorpe
2023-02-22  2:11       ` Tian, Kevin
2023-02-24  0:57         ` Jason Gunthorpe
2023-02-24  8:07           ` Tian, Kevin
2023-02-07 21:17 ` [PATCH v2 03/10] iommufd: Create access in vfio_iommufd_emulated_bind() Nicolin Chen
2023-02-09  2:56   ` Tian, Kevin
2023-02-09 16:15     ` Nicolin Chen
2023-02-09 18:58   ` Eric Farman
2023-02-09 19:54     ` Nicolin Chen
2023-02-07 21:17 ` [PATCH v2 04/10] iommufd/selftest: Add IOMMU_TEST_OP_ACCESS_SET_IOAS coverage Nicolin Chen
2023-02-09  2:59   ` Tian, Kevin
2023-02-07 21:17 ` [PATCH v2 05/10] iommufd: Add replace support in iommufd_access_set_ioas() Nicolin Chen
2023-02-09  3:13   ` Tian, Kevin
2023-02-09 20:28     ` Nicolin Chen
2023-02-09 20:49       ` Jason Gunthorpe
2023-02-09 22:18         ` Nicolin Chen
2023-02-07 21:17 ` [PATCH v2 06/10] iommufd/selftest: Add coverage for access->ioas replacement Nicolin Chen
2023-02-07 21:17 ` [PATCH v2 07/10] iommufd/device: Make hwpt_list list_add/del symmetric Nicolin Chen
2023-02-09  3:23   ` Tian, Kevin
2023-02-09 13:24     ` Jason Gunthorpe
2023-02-10  1:46       ` Tian, Kevin
2023-02-10 21:17         ` Jason Gunthorpe
2023-02-13  2:12           ` Tian, Kevin
2023-02-07 21:18 ` [PATCH v2 08/10] iommufd/device: Use iommu_group_replace_domain() Nicolin Chen
2023-02-08  8:08   ` Liu, Yi L
2023-02-09 20:55     ` Nicolin Chen
2023-02-08  8:12   ` Liu, Yi L
2023-02-09 20:56     ` Nicolin Chen
2023-02-09  4:00   ` Tian, Kevin
2023-02-09 21:13     ` Nicolin Chen
2023-02-10  0:01       ` Jason Gunthorpe
2023-02-10 20:50         ` Nicolin Chen
2023-02-10  2:11       ` Tian, Kevin
2023-02-11  0:10         ` Nicolin Chen [this message]
2023-02-13  2:34           ` Tian, Kevin
2023-02-13  7:48             ` Nicolin Chen
2023-02-13  8:27               ` Tian, Kevin
2023-02-13 14:49               ` Jason Gunthorpe
2023-02-14 10:54                 ` Nicolin Chen
2023-02-15  1:37                   ` Tian, Kevin
2023-02-15  1:58                     ` Nicolin Chen
2023-02-15  2:15                       ` Tian, Kevin
2023-02-15  7:15                         ` Nicolin Chen
2023-02-15  7:24                           ` Tian, Kevin
2023-02-15 12:51                           ` Jason Gunthorpe
2023-02-14 10:59         ` Nicolin Chen
2023-02-15  1:38           ` Tian, Kevin
2023-02-15  7:16             ` Nicolin Chen
2023-02-07 21:18 ` [PATCH v2 09/10] vfio: Support IO page table replacement Nicolin Chen
2023-02-09  4:06   ` Tian, Kevin
2023-02-07 21:18 ` [PATCH v2 10/10] vfio: Do not allow !ops->dma_unmap in vfio_pin/unpin_pages() Nicolin Chen
2023-02-09  4:10   ` Tian, Kevin
2023-02-09 13:26     ` Jason Gunthorpe
2023-02-09 16:19       ` Nicolin Chen
2023-02-09  2:50 ` [PATCH v2 00/10] Add IO page table replacement support Tian, Kevin
2023-02-09 16:13   ` Nicolin Chen
2023-02-10  1:34     ` Tian, Kevin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y+bc8OrWfw3Fq57n@Asurada-Nvidia \
    --to=nicolinc@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=baolu.lu@linux.intel.com \
    --cc=iommu@lists.linux.dev \
    --cc=jgg@nvidia.com \
    --cc=joro@8bytes.org \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=robin.murphy@arm.com \
    --cc=shuah@kernel.org \
    --cc=will@kernel.org \
    --cc=yi.l.liu@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox