From: Joao Martins <joao.m.martins@oracle.com>
To: Baolu Lu <baolu.lu@linux.intel.com>
Cc: iommu@lists.linux.dev, Kevin Tian <kevin.tian@intel.com>,
Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>,
Yi Liu <yi.l.liu@intel.com>, Yi Y Sun <yi.y.sun@intel.com>,
Nicolin Chen <nicolinc@nvidia.com>,
Joerg Roedel <joro@8bytes.org>,
Suravee Suthikulpanit <suravee.suthikulpanit@amd.com>,
Will Deacon <will@kernel.org>,
Robin Murphy <robin.murphy@arm.com>,
Zhenzhong Duan <zhenzhong.duan@intel.com>,
Alex Williamson <alex.williamson@redhat.com>,
kvm@vger.kernel.org, Jason Gunthorpe <jgg@nvidia.com>
Subject: Re: [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs
Date: Fri, 20 Oct 2023 12:20:44 +0100 [thread overview]
Message-ID: <395aa01c-b982-4ff3-aa05-9fa0ea50bfee@oracle.com> (raw)
In-Reply-To: <b5d304b9-d54f-4abd-bfeb-de853458d2af@oracle.com>
On 20/10/2023 10:34, Joao Martins wrote:
> On 20/10/2023 03:21, Baolu Lu wrote:
>> On 10/19/23 7:58 PM, Joao Martins wrote:
>>> On 19/10/2023 01:17, Joao Martins wrote:
>>>> On 19/10/2023 00:11, Jason Gunthorpe wrote:
>>>>> On Wed, Oct 18, 2023 at 09:27:08PM +0100, Joao Martins wrote:
>>>>>> +static int iommu_v1_read_and_clear_dirty(struct io_pgtable_ops *ops,
>>>>>> + unsigned long iova, size_t size,
>>>>>> + unsigned long flags,
>>>>>> + struct iommu_dirty_bitmap *dirty)
>>>>>> +{
>>>>>> + struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops);
>>>>>> + unsigned long end = iova + size - 1;
>>>>>> +
>>>>>> + do {
>>>>>> + unsigned long pgsize = 0;
>>>>>> + u64 *ptep, pte;
>>>>>> +
>>>>>> + ptep = fetch_pte(pgtable, iova, &pgsize);
>>>>>> + if (ptep)
>>>>>> + pte = READ_ONCE(*ptep);
>>>>> It is fine for now, but this is so slow for something that is such a
>>>>> fast path. We are optimizing away a TLB invalidation but leaving
>>>>> this???
>>>>>
>>>> More obvious reason is that I'm still working towards the 'faster' page table
>>>> walker. Then map/unmap code needs to do similar lookups so thought of reusing
>>>> the same functions as map/unmap initially. And improve it afterwards or when
>>>> introducing the splitting.
>>>>
>>>>> It is a radix tree, you walk trees by retaining your position at each
>>>>> level as you go (eg in a function per-level call chain or something)
>>>>> then ++ is cheap. Re-searching the entire tree every time is madness.
>>>> I'm aware -- I have an improved page-table walker for AMD[0] (not yet for Intel;
>>>> still in the works),
>>> Sigh, I realized that Intel's pfn_to_dma_pte() (main lookup function for
>>> map/unmap/iova_to_phys) does something a little off when it finds a non-present
>>> PTE. It allocates a page table to it; which is not OK in this specific case (I
>>> would argue it's neither for iova_to_phys but well maybe I misunderstand the
>>> expectation of that API).
>>
>> pfn_to_dma_pte() doesn't allocate page for a non-present PTE if the
>> target_level parameter is set to 0. See below line 932.
>>
>> 913 static struct dma_pte *pfn_to_dma_pte(struct dmar_domain *domain,
>> 914 unsigned long pfn, int *target_level,
>> 915 gfp_t gfp)
>> 916 {
>>
>> [...]
>>
>> 927 while (1) {
>> 928 void *tmp_page;
>> 929
>> 930 offset = pfn_level_offset(pfn, level);
>> 931 pte = &parent[offset];
>> 932 if (!*target_level && (dma_pte_superpage(pte) ||
>> !dma_pte_present(pte)))
>> 933 break;
>>
>> So both iova_to_phys() and read_and_clear_dirty() are doing things
>> right:
>>
>> struct dma_pte *pte;
>> int level = 0;
>>
>> pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT,
>> &level, GFP_KERNEL);
>> if (pte && dma_pte_present(pte)) {
>> /* The PTE is valid, check anything you want! */
>> ... ...
>> }
>>
>> Or, I am overlooking something else?
>
> You're right, thanks for the keeping me straight -- I was already doing the
> right thing. I've forgotten about it in the midst of the other code -- Probably
> worth a comment in the caller to make it obvious.
For what is worth, this is the improved page-table walker I have in staging (as
a separate patch) alongside AMD. It is quite similar, except AMD IOMMU has a
bigger featureset in the PTEs page size it can represent, but the crux of the
walking is the same, bearing different coding style in the IOMMU drivers.
diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c
index 97558b420e35..f6990962af2a 100644
--- a/drivers/iommu/intel/iommu.c
+++ b/drivers/iommu/intel/iommu.c
@@ -4889,14 +4889,52 @@ static int intel_iommu_set_dirty_tracking(struct
iommu_domain *domain,
return ret;
}
+static int walk_dirty_dma_pte_level(struct dmar_domain *domain, int level,
+ struct dma_pte *pte, unsigned long start_pfn,
+ unsigned long last_pfn, unsigned long flags,
+ struct iommu_dirty_bitmap *dirty)
+{
+ unsigned long pfn, page_size;
+
+ pfn = start_pfn;
+ pte = &pte[pfn_level_offset(pfn, level)];
+
+ do {
+ unsigned long level_pfn = pfn & level_mask(level);
+ unsigned long level_last;
+
+ if (!dma_pte_present(pte))
+ goto next;
+
+ if (level > 1 && !dma_pte_superpage(pte)) {
+ level_last = level_pfn + level_size(level) - 1;
+ level_last = min(level_last, last_pfn);
+ walk_dirty_dma_pte_level(domain, level - 1,
+ phys_to_virt(dma_pte_addr(pte)),
+ pfn, level_last,
+ flags, dirty);
+ } else {
+ page_size = level_size(level) << VTD_PAGE_SHIFT;
+
+ if (dma_sl_pte_test_and_clear_dirty(pte, flags))
+ iommu_dirty_bitmap_record(dirty,
+ pfn << VTD_PAGE_SHIFT,
+ page_size);
+ }
+next:
+ pfn = level_pfn + level_size(level);
+ } while (!first_pte_in_page(++pte) && pfn <= last_pfn);
+
+ return 0;
+}
+
static int intel_iommu_read_and_clear_dirty(struct iommu_domain *domain,
unsigned long iova, size_t size,
unsigned long flags,
struct iommu_dirty_bitmap *dirty)
{
struct dmar_domain *dmar_domain = to_dmar_domain(domain);
- unsigned long end = iova + size - 1;
- unsigned long pgsize;
+ unsigned long start_pfn, last_pfn;
/*
* IOMMUFD core calls into a dirty tracking disabled domain without an
@@ -4907,24 +4945,14 @@ static int intel_iommu_read_and_clear_dirty(struct
iommu_domain *domain,
if (!dmar_domain->dirty_tracking && dirty->bitmap)
return -EINVAL;
- do {
- struct dma_pte *pte;
- int lvl = 0;
-
- pte = pfn_to_dma_pte(dmar_domain, iova >> VTD_PAGE_SHIFT, &lvl,
- GFP_ATOMIC);
- pgsize = level_size(lvl) << VTD_PAGE_SHIFT;
- if (!pte || !dma_pte_present(pte)) {
- iova += pgsize;
- continue;
- }
- if (dma_sl_pte_test_and_clear_dirty(pte, flags))
- iommu_dirty_bitmap_record(dirty, iova, pgsize);
- iova += pgsize;
- } while (iova < end);
+ start_pfn = iova >> VTD_PAGE_SHIFT;
+ last_pfn = (iova + size - 1) >> VTD_PAGE_SHIFT;
- return 0;
+ return walk_dirty_dma_pte_level(dmar_domain,
+ agaw_to_level(dmar_domain->agaw),
+ dmar_domain->pgd, start_pfn, last_pfn,
+ flags, dirty);
}
const struct iommu_dirty_ops intel_dirty_ops = {
next prev parent reply other threads:[~2023-10-20 11:21 UTC|newest]
Thread overview: 84+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-18 20:26 [PATCH v4 00/18] IOMMUFD Dirty Tracking Joao Martins
2023-10-18 20:26 ` [PATCH v4 01/18] vfio/iova_bitmap: Export more API symbols Joao Martins
2023-10-18 22:14 ` Jason Gunthorpe
2023-10-20 5:45 ` Tian, Kevin
2023-10-20 16:44 ` Alex Williamson
2023-10-18 20:26 ` [PATCH v4 02/18] vfio: Move iova_bitmap into iommufd Joao Martins
2023-10-18 22:14 ` Jason Gunthorpe
2023-10-19 17:48 ` Brett Creeley
2023-10-20 5:46 ` Tian, Kevin
2023-10-20 16:44 ` Alex Williamson
2023-10-18 20:27 ` [PATCH v4 03/18] iommufd/iova_bitmap: Move symbols to IOMMUFD namespace Joao Martins
2023-10-18 22:16 ` Jason Gunthorpe
2023-10-19 17:48 ` Brett Creeley
2023-10-20 5:47 ` Tian, Kevin
2023-10-20 16:44 ` Alex Williamson
2023-10-18 20:27 ` [PATCH v4 04/18] iommu: Add iommu_domain ops for dirty tracking Joao Martins
2023-10-18 22:26 ` Jason Gunthorpe
2023-10-19 1:45 ` Baolu Lu
2023-10-20 5:54 ` Tian, Kevin
2023-10-20 11:24 ` Joao Martins
2023-10-18 20:27 ` [PATCH v4 05/18] iommufd: Add a flag to enforce dirty tracking on attach Joao Martins
2023-10-18 22:26 ` Jason Gunthorpe
2023-10-18 22:38 ` Jason Gunthorpe
2023-10-18 23:38 ` Joao Martins
2023-10-20 5:55 ` Tian, Kevin
2023-10-18 20:27 ` [PATCH v4 06/18] iommufd: Add IOMMU_HWPT_SET_DIRTY Joao Martins
2023-10-18 22:28 ` Jason Gunthorpe
2023-10-20 6:09 ` Tian, Kevin
2023-10-20 15:30 ` Joao Martins
2023-10-20 7:56 ` Tian, Kevin
2023-10-20 20:41 ` Joao Martins
2023-10-18 20:27 ` [PATCH v4 07/18] iommufd: Add IOMMU_HWPT_GET_DIRTY_IOVA Joao Martins
2023-10-18 22:39 ` Jason Gunthorpe
2023-10-18 23:43 ` Joao Martins
2023-10-19 12:01 ` Jason Gunthorpe
2023-10-19 12:04 ` Joao Martins
2023-10-19 10:01 ` Joao Martins
2023-10-20 6:32 ` Tian, Kevin
2023-10-20 11:53 ` Joao Martins
2023-10-20 13:40 ` Jason Gunthorpe
2023-10-18 20:27 ` [PATCH v4 08/18] iommufd: Add capabilities to IOMMU_GET_HW_INFO Joao Martins
2023-10-18 22:44 ` Jason Gunthorpe
2023-10-19 9:55 ` Joao Martins
2023-10-19 23:56 ` Jason Gunthorpe
2023-10-20 6:46 ` Tian, Kevin
2023-10-20 11:52 ` Joao Martins
2023-10-18 20:27 ` [PATCH v4 09/18] iommufd: Add a flag to skip clearing of IOPTE dirty Joao Martins
2023-10-18 22:54 ` Jason Gunthorpe
2023-10-18 23:50 ` Joao Martins
2023-10-20 6:52 ` Tian, Kevin
2023-10-18 20:27 ` [PATCH v4 10/18] iommu/amd: Add domain_alloc_user based domain allocation Joao Martins
2023-10-18 22:58 ` Jason Gunthorpe
2023-10-18 23:54 ` Joao Martins
2023-10-18 20:27 ` [PATCH v4 11/18] iommu/amd: Access/Dirty bit support in IOPTEs Joao Martins
2023-10-18 23:11 ` Jason Gunthorpe
2023-10-19 0:17 ` Joao Martins
2023-10-19 11:58 ` Joao Martins
2023-10-19 23:59 ` Jason Gunthorpe
2023-10-20 14:43 ` Joao Martins
2023-10-20 21:22 ` Joao Martins
2023-10-21 16:14 ` Jason Gunthorpe
2023-10-22 7:07 ` Yishai Hadas
2023-10-20 2:21 ` Baolu Lu
2023-10-20 7:01 ` Tian, Kevin
2023-10-20 9:34 ` Joao Martins
2023-10-20 11:20 ` Joao Martins [this message]
2023-10-20 18:57 ` Joao Martins
2023-10-18 20:27 ` [PATCH v4 12/18] iommu/intel: Access/Dirty bit support for SL domains Joao Martins
2023-10-19 3:04 ` Baolu Lu
2023-10-19 9:14 ` Joao Martins
2023-10-19 10:33 ` Joao Martins
2023-10-19 23:56 ` Jason Gunthorpe
2023-10-20 10:12 ` Joao Martins
2023-10-20 7:53 ` Tian, Kevin
2023-10-20 9:15 ` Baolu Lu
2023-10-18 20:27 ` [PATCH v4 13/18] iommufd/selftest: Expand mock_domain with dev_flags Joao Martins
2023-10-20 7:57 ` Tian, Kevin
2023-10-18 20:27 ` [PATCH v4 14/18] iommufd/selftest: Test IOMMU_HWPT_ALLOC_ENFORCE_DIRTY Joao Martins
2023-10-20 7:59 ` Tian, Kevin
2023-10-18 20:27 ` [PATCH v4 15/18] iommufd/selftest: Test IOMMU_HWPT_SET_DIRTY Joao Martins
2023-10-20 8:00 ` Tian, Kevin
2023-10-18 20:27 ` [PATCH v4 16/18] iommufd/selftest: Test IOMMU_HWPT_GET_DIRTY_IOVA Joao Martins
2023-10-18 20:27 ` [PATCH v4 17/18] iommufd/selftest: Test out_capabilities in IOMMU_GET_HW_INFO Joao Martins
2023-10-18 20:27 ` [PATCH v4 18/18] iommufd/selftest: Test IOMMU_GET_DIRTY_IOVA_NO_CLEAR flag Joao Martins
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=395aa01c-b982-4ff3-aa05-9fa0ea50bfee@oracle.com \
--to=joao.m.martins@oracle.com \
--cc=alex.williamson@redhat.com \
--cc=baolu.lu@linux.intel.com \
--cc=iommu@lists.linux.dev \
--cc=jgg@nvidia.com \
--cc=joro@8bytes.org \
--cc=kevin.tian@intel.com \
--cc=kvm@vger.kernel.org \
--cc=nicolinc@nvidia.com \
--cc=robin.murphy@arm.com \
--cc=shameerali.kolothum.thodi@huawei.com \
--cc=suravee.suthikulpanit@amd.com \
--cc=will@kernel.org \
--cc=yi.l.liu@intel.com \
--cc=yi.y.sun@intel.com \
--cc=zhenzhong.duan@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox