linux-doc.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Lu Baolu <baolu.lu@linux.intel.com>,
	bpf@vger.kernel.org, Jonathan Corbet <corbet@lwn.net>,
	David Woodhouse <dwmw2@infradead.org>,
	iommu@lists.linux.dev, Joerg Roedel <joro@8bytes.org>,
	Kevin Tian <kevin.tian@intel.com>,
	linux-doc@vger.kernel.org, linux-kselftest@vger.kernel.org,
	llvm@lists.linux.dev, Nathan Chancellor <nathan@kernel.org>,
	Nick Desaulniers <ndesaulniers@google.com>,
	Miguel Ojeda <ojeda@kernel.org>,
	Robin Murphy <robin.murphy@arm.com>,
	Shuah Khan <shuah@kernel.org>,
	Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>,
	Tom Rix <trix@redhat.com>, Will Deacon <will@kernel.org>
Cc: Alex Williamson <alex.williamson@redhat.com>,
	Chaitanya Kulkarni <chaitanyak@nvidia.com>,
	Cornelia Huck <cohuck@redhat.com>,
	Daniel Jordan <daniel.m.jordan@oracle.com>,
	David Gibson <david@gibson.dropbear.id.au>,
	Eric Auger <eric.auger@redhat.com>,
	Eric Farman <farman@linux.ibm.com>,
	Jason Wang <jasowang@redhat.com>,
	Jean-Philippe Brucker <jean-philippe@linaro.org>,
	Joao Martins <joao.m.martins@oracle.com>,
	kvm@vger.kernel.org, Matthew Rosato <mjrosato@linux.ibm.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	Nicolin Chen <nicolinc@nvidia.com>,
	Niklas Schnelle <schnelle@linux.ibm.com>,
	Shameerali Kolothum Thodi  <shameerali.kolothum.thodi@huawei.com>,
	Yi Liu <yi.l.liu@intel.com>, Keqian Zhu <zhukeqian1@huawei.com>
Subject: Re: [PATCH v3 8/15] iommufd: Algorithms for PFN storage
Date: Thu, 3 Nov 2022 17:08:08 -0300	[thread overview]
Message-ID: <Y2QfqAWxqT5cCfmN@nvidia.com> (raw)
In-Reply-To: <8-v3-402a7d6459de+24b-iommufd_jgg@nvidia.com>

On Tue, Oct 25, 2022 at 03:12:17PM -0300, Jason Gunthorpe wrote:
> +
> +/**
> + * iopt_area_fill_domains() - Install PFNs into the area's domains
> + * @area: The area to act on
> + * @pages: The pages associated with the area (area->pages is NULL)
> + *
> + * Called during area creation. The area is freshly created and not inserted in
> + * the domains_itree yet. PFNs are read and loaded into every domain held in the
> + * area's io_pagetable and the area is installed in the domains_itree.
> + *
> + * On failure all domains are left unchanged.
> + */
> +int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
> +{
> +	struct pfn_reader pfns;
> +	struct iommu_domain *domain;
> +	unsigned long unmap_index;
> +	unsigned long index;
> +	int rc;
> +
> +	lockdep_assert_held(&area->iopt->domains_rwsem);
> +
> +	if (xa_empty(&area->iopt->domains))
> +		return 0;
> +
> +	mutex_lock(&pages->mutex);
> +	rc = pfn_reader_first(&pfns, pages, iopt_area_index(area),
> +			      iopt_area_last_index(area));
> +	if (rc)
> +		goto out_unlock;
> +
> +	while (!pfn_reader_done(&pfns)) {
> +		xa_for_each(&area->iopt->domains, index, domain) {
> +			rc = batch_to_domain(&pfns.batch, domain, area,
> +					     pfns.batch_start_index);
> +			if (rc)
> +				goto out_unmap;
> +		}
> +
> +		rc = pfn_reader_next(&pfns);
> +		if (rc)
> +			goto out_unmap;
> +	}
> +	rc = pfn_reader_update_pinned(&pfns);
> +	if (rc)
> +		goto out_unmap;
> +
> +	area->storage_domain = xa_load(&area->iopt->domains, 0);
> +	interval_tree_insert(&area->pages_node, &pages->domains_itree);
> +	goto out_destroy;
> +
> +out_unmap:
> +	xa_for_each(&area->iopt->domains, unmap_index, domain) {
> +		unsigned long end_index = pfns.batch_start_index;
> +
> +		if (unmap_index <= index)
> +			end_index = pfns.batch_end_index;

syzkaller found that there is a typo here, it should be <

However, I wasn't able to make a quick reproduction for something that
should have a been a very reliable failure path using nth fault
injection. This led to a great big adventure where I discovered that
fault injection and xarray do not play nicely together:

https://lore.kernel.org/r/Y2QR0EDvq7p9i1xw@nvidia.com

Which ended up spending a whole bunch of time to add a nth failure
study to the test suite and understand what is going on and how to
make it work better. It now covers this scenario deterministically.

The exhaustive nth failure study also shows this error handling has
another, more serious, problem. We keep track of how many pages have
been pinned inside the pages, and we also keep track of the last
charge to the rlimit/etc. At the end of operations these are
reconciled. There are lots of assertions checking that this is being
tracked properly so that we don't loose track of a pinned page in the
very complicated logic.

The new test suite discovered this missing:

 		/* Any pages not transferred to the batch are just unpinned */
 		unpin_user_pages(pfns->user.upages + (pfns->batch_end_index -
 						      pfns->user.upages_start),
 				 npages);
+		iopt_pages_sub_npinned(pages, npages);

We need to charge back the as-yet-unprocessed pages we are unpinning
when destroying the batch.

And then we get into trouble that things are not happening in the
order the assertions would like as this:

> +			iopt_area_unfill_partial_domain(area, pages, domain,
> +							end_index);

Demands that the pinned page charge during unfilling be only
decreasing, never increasing. However we can still be holding pinned
pages in the batch at this instant that don't get cleaned up until:

> +		}
> +	}
> +out_destroy:
> +	pfn_reader_destroy(&pfns);

Here

Thus the assertions get unhappy.

Introducing a pfn_reader_release_pins() which is called before
unfilling gets everything in the right order and the testing of these
two functions now passes, though I still have to insert a few more
error injection points to get full coverage.

Syzkaller has found another 4 things I still have to look at and is
now sitting at 65%(72%) coverage. So steadily progressing..

Jason

diff --git a/drivers/iommu/iommufd/pages.c b/drivers/iommu/iommufd/pages.c
index 245e7b96902107..ce707d6f5ee959 100644
--- a/drivers/iommu/iommufd/pages.c
+++ b/drivers/iommu/iommufd/pages.c
@@ -994,7 +994,15 @@ static int pfn_reader_init(struct pfn_reader *pfns, struct iopt_pages *pages,
 	return 0;
 }
 
-static void pfn_reader_destroy(struct pfn_reader *pfns)
+/*
+ * There are many assertions regarding the state of pages->npinned vs
+ * pages->last_pinned, for instance something like unmapping a domain must only
+ * decrement the npinned, and pfn_reader_destroy() must be called only after all
+ * the pins are updated. This is fine for success flows, but error flows
+ * sometimes need to release the pins held inside the pfn_reader before going on
+ * to complete unmapping and releasing pins held in domains.
+ */
+static void pfn_reader_release_pins(struct pfn_reader *pfns)
 {
 	struct iopt_pages *pages = pfns->pages;
 
@@ -1005,12 +1013,20 @@ static void pfn_reader_destroy(struct pfn_reader *pfns)
 		unpin_user_pages(pfns->user.upages + (pfns->batch_end_index -
 						      pfns->user.upages_start),
 				 npages);
+		iopt_pages_sub_npinned(pages, npages);
+		pfns->user.upages_end = pfns->batch_end_index;
 	}
-
-	pfn_reader_user_destroy(&pfns->user, pfns->pages);
-
 	if (pfns->batch_start_index != pfns->batch_end_index)
 		pfn_reader_unpin(pfns);
+	pfns->batch_start_index = pfns->batch_end_index;
+}
+
+static void pfn_reader_destroy(struct pfn_reader *pfns)
+{
+	struct iopt_pages *pages = pfns->pages;
+
+	pfn_reader_release_pins(pfns);
+	pfn_reader_user_destroy(&pfns->user, pfns->pages);
 	batch_destroy(&pfns->batch, NULL);
 	WARN_ON(pages->last_npinned != pages->npinned);
 }
@@ -1223,6 +1239,7 @@ void iopt_area_unfill_domain(struct iopt_area *area, struct iopt_pages *pages,
  */
 int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
 {
+	unsigned long done_end_index;
 	struct pfn_reader pfns;
 	int rc;
 
@@ -1234,10 +1251,12 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
 		return rc;
 
 	while (!pfn_reader_done(&pfns)) {
+		done_end_index = pfns.batch_start_index;
 		rc = batch_to_domain(&pfns.batch, domain, area,
 				     pfns.batch_start_index);
 		if (rc)
 			goto out_unmap;
+		done_end_index = pfns.batch_end_index;
 
 		rc = pfn_reader_next(&pfns);
 		if (rc)
@@ -1250,8 +1269,9 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
 	goto out_destroy;
 
 out_unmap:
+	pfn_reader_release_pins(&pfns);
 	iopt_area_unfill_partial_domain(area, area->pages, domain,
-					pfns.batch_start_index);
+					done_end_index);
 out_destroy:
 	pfn_reader_destroy(&pfns);
 	return rc;
@@ -1270,9 +1290,11 @@ int iopt_area_fill_domain(struct iopt_area *area, struct iommu_domain *domain)
  */
 int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
 {
-	struct pfn_reader pfns;
+	unsigned long done_first_end_index;
+	unsigned long done_all_end_index;
 	struct iommu_domain *domain;
 	unsigned long unmap_index;
+	struct pfn_reader pfns;
 	unsigned long index;
 	int rc;
 
@@ -1288,12 +1310,15 @@ int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
 		goto out_unlock;
 
 	while (!pfn_reader_done(&pfns)) {
+		done_first_end_index = pfns.batch_end_index;
+		done_all_end_index = pfns.batch_start_index;
 		xa_for_each(&area->iopt->domains, index, domain) {
 			rc = batch_to_domain(&pfns.batch, domain, area,
 					     pfns.batch_start_index);
 			if (rc)
 				goto out_unmap;
 		}
+		done_all_end_index = done_first_end_index;
 
 		rc = pfn_reader_next(&pfns);
 		if (rc)
@@ -1308,11 +1333,14 @@ int iopt_area_fill_domains(struct iopt_area *area, struct iopt_pages *pages)
 	goto out_destroy;
 
 out_unmap:
+	pfn_reader_release_pins(&pfns);
 	xa_for_each(&area->iopt->domains, unmap_index, domain) {
-		unsigned long end_index = pfns.batch_start_index;
+		unsigned long end_index;
 
-		if (unmap_index <= index)
-			end_index = pfns.batch_end_index;
+		if (unmap_index < index)
+			end_index = done_first_end_index;
+		else
+			end_index = done_all_end_index;
 
 		/*
 		 * The area is not yet part of the domains_itree so we have to

  parent reply	other threads:[~2022-11-03 20:11 UTC|newest]

Thread overview: 85+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-10-25 18:12 [PATCH v3 00/15] IOMMUFD Generic interface Jason Gunthorpe
2022-10-25 18:12 ` [PATCH v3 01/15] iommu: Add IOMMU_CAP_ENFORCE_CACHE_COHERENCY Jason Gunthorpe
2022-10-26 12:45   ` Baolu Lu
2022-11-03  5:03   ` Tian, Kevin
2022-11-04 19:25     ` Jason Gunthorpe
2022-10-25 18:12 ` [PATCH v3 02/15] iommu: Add device-centric DMA ownership interfaces Jason Gunthorpe
2022-11-03  5:11   ` Tian, Kevin
2022-11-04 19:32     ` Jason Gunthorpe
2022-10-25 18:12 ` [PATCH v3 03/15] interval-tree: Add a utility to iterate over spans in an interval tree Jason Gunthorpe
2022-11-03  5:31   ` Tian, Kevin
2022-11-04 19:38     ` Jason Gunthorpe
2022-11-05  1:32       ` Tian, Kevin
2022-11-05  1:48       ` Matthew Wilcox
2022-11-07 14:38         ` Jason Gunthorpe
2022-10-25 18:12 ` [PATCH v3 04/15] iommufd: Overview documentation Jason Gunthorpe
2022-10-26  4:17   ` Bagas Sanjaya
2022-10-28 19:09     ` Jason Gunthorpe
2022-10-25 18:12 ` [PATCH v3 05/15] iommufd: File descriptor, context, kconfig and makefiles Jason Gunthorpe
2022-10-26 12:58   ` Baolu Lu
2022-10-26 17:14     ` Jason Gunthorpe
2022-10-29  3:43       ` Baolu Lu
2022-11-03  7:22   ` Tian, Kevin
2022-11-07 17:00     ` Jason Gunthorpe
2022-10-25 18:12 ` [PATCH v3 06/15] kernel/user: Allow user::locked_vm to be usable for iommufd Jason Gunthorpe
2022-11-03  7:23   ` Tian, Kevin
2022-10-25 18:12 ` [PATCH v3 07/15] iommufd: PFN handling for iopt_pages Jason Gunthorpe
2022-11-01 19:38   ` Nicolin Chen
2022-11-02 13:13     ` Jason Gunthorpe
2022-10-25 18:12 ` [PATCH v3 08/15] iommufd: Algorithms for PFN storage Jason Gunthorpe
2022-10-31 16:01   ` [PATCH v3 8/15] " Jason Gunthorpe
2022-11-01 16:09   ` Jason Gunthorpe
2022-11-03 20:08   ` Jason Gunthorpe [this message]
2022-11-04 16:26     ` Jason Gunthorpe
2022-11-04 16:04   ` Jason Gunthorpe
2022-10-25 18:12 ` [PATCH v3 09/15] iommufd: Data structure to provide IOVA to PFN mapping Jason Gunthorpe
2022-10-26 18:46   ` [PATCH v3 9/15] " Jason Gunthorpe
2022-10-27 11:37   ` Jason Gunthorpe
2022-10-27 13:35   ` Jason Gunthorpe
2022-10-28 18:52   ` Jason Gunthorpe
2022-11-01 19:17   ` [PATCH v3 09/15] " Nicolin Chen
2022-11-02 13:11     ` Jason Gunthorpe
2022-10-25 18:12 ` [PATCH v3 10/15] iommufd: IOCTLs for the io_pagetable Jason Gunthorpe
2022-10-26 17:01   ` Jason Gunthorpe
2022-10-26 23:17   ` Jason Gunthorpe
2022-10-29  7:25   ` Baolu Lu
2022-11-07 14:17     ` Jason Gunthorpe
2022-11-04  8:32   ` Tian, Kevin
2022-11-07 15:02     ` Jason Gunthorpe
2022-11-08  2:05       ` Tian, Kevin
2022-11-08 17:29         ` Jason Gunthorpe
2022-11-09  2:50           ` Tian, Kevin
2022-11-09 13:05             ` Jason Gunthorpe
2022-10-25 18:12 ` [PATCH v3 11/15] iommufd: Add a HW pagetable object Jason Gunthorpe
2022-11-04 10:00   ` Tian, Kevin
2022-10-25 18:12 ` [PATCH v3 12/15] iommufd: Add kAPI toward external drivers for physical devices Jason Gunthorpe
2022-10-29  7:19   ` Baolu Lu
2022-11-07 14:14     ` Jason Gunthorpe
2022-11-05  7:17   ` Tian, Kevin
2022-11-07 17:54     ` Jason Gunthorpe
2022-11-08  2:17       ` Tian, Kevin
2022-10-25 18:12 ` [PATCH v3 13/15] iommufd: Add kAPI toward external drivers for kernel access Jason Gunthorpe
2022-10-25 18:12 ` [PATCH v3 14/15] iommufd: vfio container FD ioctl compatibility Jason Gunthorpe
2022-10-27 14:12   ` Jason Gunthorpe
2022-11-01 19:45   ` Nicolin Chen
2022-11-02 13:15     ` Jason Gunthorpe
2022-11-05  0:07   ` Baolu Lu
2022-11-07 14:23     ` Jason Gunthorpe
2022-11-05  9:31   ` Tian, Kevin
2022-11-07 17:08     ` Jason Gunthorpe
2022-11-07 23:53       ` Tian, Kevin
2022-11-08  0:09         ` Jason Gunthorpe
2022-11-08  0:13           ` Tian, Kevin
2022-11-08  0:17             ` Jason Gunthorpe
2022-10-25 18:12 ` [PATCH v3 15/15] iommufd: Add a selftest Jason Gunthorpe
2022-11-01 20:32   ` Nicolin Chen
2022-11-02 13:17     ` Jason Gunthorpe
2022-11-02 18:49       ` Nathan Chancellor
2022-11-04  1:01   ` Jason Gunthorpe
2022-11-04  5:43     ` Tian, Kevin
2022-11-04 19:42       ` Jason Gunthorpe
2022-10-28 23:57 ` [PATCH v3 00/15] IOMMUFD Generic interface Nicolin Chen
2022-11-04 21:27 ` Alex Williamson
2022-11-04 22:03   ` Alex Williamson
2022-11-07 14:22     ` Jason Gunthorpe
2022-11-07 14:19   ` Jason Gunthorpe

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y2QfqAWxqT5cCfmN@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=baolu.lu@linux.intel.com \
    --cc=bpf@vger.kernel.org \
    --cc=chaitanyak@nvidia.com \
    --cc=cohuck@redhat.com \
    --cc=corbet@lwn.net \
    --cc=daniel.m.jordan@oracle.com \
    --cc=david@gibson.dropbear.id.au \
    --cc=dwmw2@infradead.org \
    --cc=eric.auger@redhat.com \
    --cc=farman@linux.ibm.com \
    --cc=iommu@lists.linux.dev \
    --cc=jasowang@redhat.com \
    --cc=jean-philippe@linaro.org \
    --cc=joao.m.martins@oracle.com \
    --cc=joro@8bytes.org \
    --cc=kevin.tian@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kselftest@vger.kernel.org \
    --cc=llvm@lists.linux.dev \
    --cc=mjrosato@linux.ibm.com \
    --cc=mst@redhat.com \
    --cc=nathan@kernel.org \
    --cc=ndesaulniers@google.com \
    --cc=nicolinc@nvidia.com \
    --cc=ojeda@kernel.org \
    --cc=robin.murphy@arm.com \
    --cc=schnelle@linux.ibm.com \
    --cc=shameerali.kolothum.thodi@huawei.com \
    --cc=shuah@kernel.org \
    --cc=suravee.suthikulpanit@amd.com \
    --cc=trix@redhat.com \
    --cc=will@kernel.org \
    --cc=yi.l.liu@intel.com \
    --cc=zhukeqian1@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).