* Cache Invalidation Solution for Nested IOMMU
@ 2023-04-03 0:33 Nicolin Chen
2023-04-03 7:26 ` Liu, Yi L
` (2 more replies)
0 siblings, 3 replies; 35+ messages in thread
From: Nicolin Chen @ 2023-04-03 0:33 UTC (permalink / raw)
To: Robin Murphy, jgg, kevin.tian
Cc: yi.l.liu, eric.auger, baolu.lu, shameerali.kolothum.thodi,
jean-philippe, iommu
Hi all,
Per discussion in the nested SMMUv3 series[1], we have come up
with a couple of ideas to accelerate the invalidation uAPI.
I have drafted two solutions and would like to collect all of
your inputs, so we shall decide which approach we would choose
to put in the next version.
The first version is simply to individually forward the entire
command. This can save a few CPU cycles from packing/unpacking
invalidation fields of the commands via a data structure, v.s.
the structure in v1[2].
[User Data]
struct iommu_hwpt_invalidate_arm_smmuv3 {
__u64 cmd[2];
__u32 error;
__u32 __reserved;
};
[Driver Handler]
// This draft requires set_rid/unset_rid in the 2nd draft for
// command sanity to report errors.
static void arm_smmu_cache_invalidate_user(struct iommu_domain *domain,
void *user_data)
{
struct iommu_hwpt_invalidate_arm_smmuv3 *inv_info = user_data;
struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain);
struct arm_smmu_device *smmu = smmu_domain->smmu;
u64 cmd[CMDQ_ENT_DWORDS];
if (!smmu || !smmu_domain->s2 || domain->type != IOMMU_DOMAIN_NESTED)
return;
cmd[0] = inv_info->cmd[0];
cmd[1] = inv_info->cmd[1];
switch (cmd[0] & CMDQ_0_OP) {
case CMDQ_OP_TLBI_NSNH_ALL:
cmd[0] &= ~CMDQ_0_OP;
cmd[0] |= CMDQ_OP_TLBI_NH_ALL;
fallthrough;
case CMDQ_OP_TLBI_NH_VA:
case CMDQ_OP_TLBI_NH_VAA:
case CMDQ_OP_TLBI_NH_ALL:
case CMDQ_OP_TLBI_NH_ASID:
cmd[0] &= ~CMDQ_TLBI_0_VMID;
cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID,
smmu_domain->s2->s2_cfg.vmid);
arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, true);
break;
case CMDQ_OP_CFGI_CD:
case CMDQ_OP_CFGI_CD_ALL:
arm_smmu_sync_cd(smmu_domain,
FIELD_GET(CMDQ_CFGI_0_SSID, cmd[0]), false);
break;
default:
return;
}
}
The second version is a bit further to support batching, by sharing
a kernel page via mmap().
Firstly, I drafted a pair of new ioctls and iommu_ops callbacks to
link vSID with pSID. This seems to be useful for Intel VT-d also.
https://github.com/nicolinc/iommufd/commit/cbec01245edc70e5de88fbc477d8f0af50b3b0ed
Then I added a new mmap interface to share kernel page(s) from the
Driver, to allow QEMU to write all TLBI commands as a single batch.
Then it can initiate the batch invalidation via another synchronous
hypercall.
https://github.com/nicolinc/iommufd/commit/ee717eb6d46b5285db1aae9172ecdfc70b9cd9ca
The two TMP changes are available in this repo:
https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-mmap-04022023
This solution certainly needs some rework before it can be posted
in the next version for review. Yet again, I'd like to see if it
can be a good direction for us to adopt in v2.
The new set_rid/unset_rid ioctls and the mmap interface would be
essential for VCMDQ support that we'd like to achieve at the end
of this journey. So, personally I'd like to see it can be used at
this stage, by the generic SMMUv3 (and potentially VT-d) too.
Thanks
Nicolin
[1] https://lore.kernel.org/linux-iommu/ZBe3kxRXf+VbKy+m@Asurada-Nvidia/
[2] https://lore.kernel.org/linux-iommu/364cfbe5b228ab178093db2de13fa3accf7a6120.1678348754.git.nicolinc@nvidia.com/
^ permalink raw reply [flat|nested] 35+ messages in thread* RE: Cache Invalidation Solution for Nested IOMMU 2023-04-03 0:33 Cache Invalidation Solution for Nested IOMMU Nicolin Chen @ 2023-04-03 7:26 ` Liu, Yi L 2023-04-03 8:39 ` Tian, Kevin 2023-04-03 12:23 ` Jason Gunthorpe 2023-04-03 8:00 ` Tian, Kevin 2023-04-03 14:08 ` Jason Gunthorpe 2 siblings, 2 replies; 35+ messages in thread From: Liu, Yi L @ 2023-04-03 7:26 UTC (permalink / raw) To: Nicolin Chen, Robin Murphy, jgg@nvidia.com, Tian, Kevin Cc: eric.auger@redhat.com, baolu.lu@linux.intel.com, shameerali.kolothum.thodi@huawei.com, jean-philippe@linaro.org, iommu@lists.linux.dev, Jason Gunthorpe, peterx@redhat.com Hi Nic, > From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Monday, April 3, 2023 8:34 AM > Hi all, > > Per discussion in the nested SMMUv3 series[1], we have come up > with a couple of ideas to accelerate the invalidation uAPI. > > I have drafted two solutions and would like to collect all of > your inputs, so we shall decide which approach we would choose > to put in the next version. > > The first version is simply to individually forward the entire > command. This can save a few CPU cycles from packing/unpacking > invalidation fields of the commands via a data structure, v.s. > the structure in v1[2]. > > [User Data] > struct iommu_hwpt_invalidate_arm_smmuv3 { > __u64 cmd[2]; > __u32 error; > __u32 __reserved; > }; > > [Driver Handler] > // This draft requires set_rid/unset_rid in the 2nd draft for > // command sanity to report errors. > static void arm_smmu_cache_invalidate_user(struct iommu_domain *domain, > void *user_data) > { > struct iommu_hwpt_invalidate_arm_smmuv3 *inv_info = user_data; > struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); > struct arm_smmu_device *smmu = smmu_domain->smmu; > u64 cmd[CMDQ_ENT_DWORDS]; > > if (!smmu || !smmu_domain->s2 || domain->type != > IOMMU_DOMAIN_NESTED) > return; > > cmd[0] = inv_info->cmd[0]; > cmd[1] = inv_info->cmd[1]; > switch (cmd[0] & CMDQ_0_OP) { > case CMDQ_OP_TLBI_NSNH_ALL: > cmd[0] &= ~CMDQ_0_OP; > cmd[0] |= CMDQ_OP_TLBI_NH_ALL; > fallthrough; > case CMDQ_OP_TLBI_NH_VA: > case CMDQ_OP_TLBI_NH_VAA: > case CMDQ_OP_TLBI_NH_ALL: > case CMDQ_OP_TLBI_NH_ASID: > cmd[0] &= ~CMDQ_TLBI_0_VMID; > cmd[0] |= FIELD_PREP(CMDQ_TLBI_0_VMID, > smmu_domain->s2->s2_cfg.vmid); > arm_smmu_cmdq_issue_cmdlist(smmu, cmd, 1, true); > break; > case CMDQ_OP_CFGI_CD: > case CMDQ_OP_CFGI_CD_ALL: > arm_smmu_sync_cd(smmu_domain, > FIELD_GET(CMDQ_CFGI_0_SSID, cmd[0]), false); > break; > default: > return; > } > } > > > The second version is a bit further to support batching, by sharing > a kernel page via mmap(). > > Firstly, I drafted a pair of new ioctls and iommu_ops callbacks to > link vSID with pSID. This seems to be useful for Intel VT-d also. > https://github.com/nicolinc/iommufd/commit/cbec01245edc70e5de88fbc477d8f0af5 > 0b3b0ed > > Then I added a new mmap interface to share kernel page(s) from the > Driver, to allow QEMU to write all TLBI commands as a single batch. > Then it can initiate the batch invalidation via another synchronous > hypercall. > https://github.com/nicolinc/iommufd/commit/ee717eb6d46b5285db1aae9172ecdfc7 > 0b9cd9ca > > The two TMP changes are available in this repo: > https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-mmap- > 04022023 > This solution certainly needs some rework before it can be posted > in the next version for review. Yet again, I'd like to see if it > can be a good direction for us to adopt in v2. > > The new set_rid/unset_rid ioctls and the mmap interface would be > essential for VCMDQ support that we'd like to achieve at the end > of this journey. So, personally I'd like to see it can be used at > this stage, by the generic SMMUv3 (and potentially VT-d) too. good try. Just curious if the vRID->pRID conversion is the only conversion that is required in the cache invalidation path for ARM SMMU? For VT-d, except for the device-TLB invalidation, there is no RID information in the invalidation command submitted by iommu driver. Device-TLB invalidation would be somehow an included operation in IOTLB invalidation if the page table is used by devices that has enabled ATS. So guest VT-d iommu driver may not use the Device-TLB invalidation. So vRID->pRID conversion may not happen for VT-d side. VT-d side requires vPASID->pPASID and vDomain_id->pDomain_id conversion. vPASID conversion may be needed later as we may disable guest PASID support at this moment. For vDomain_id conversion, there is a gap. Domain_id is not global today. So if there are multiple vIOMMU instance exposed to guest, the vDomain_id would have conflict between the invalidation commands submitted via different vIOMMU instances. Even we may make its allocation global, I'm also curious when and how should we build up the vDomain_id->pDomain_id relationship in kernel. Maybe a new iommufd uapi just like the vRID set? or it may be part of the device driver's attach_hwpt uapi (e.g. VFIO) as this relationship should just exist when the hwpt and device are attached. Regards, Yi Liu > > Thanks > Nicolin > > [1] https://lore.kernel.org/linux-iommu/ZBe3kxRXf+VbKy+m@Asurada-Nvidia/ > [2] https://lore.kernel.org/linux- > iommu/364cfbe5b228ab178093db2de13fa3accf7a6120.1678348754.git.nicolinc@nvid > ia.com/ ^ permalink raw reply [flat|nested] 35+ messages in thread
* RE: Cache Invalidation Solution for Nested IOMMU 2023-04-03 7:26 ` Liu, Yi L @ 2023-04-03 8:39 ` Tian, Kevin 2023-04-03 15:24 ` Nicolin Chen 2023-04-03 12:23 ` Jason Gunthorpe 1 sibling, 1 reply; 35+ messages in thread From: Tian, Kevin @ 2023-04-03 8:39 UTC (permalink / raw) To: Liu, Yi L, Nicolin Chen, Robin Murphy, jgg@nvidia.com Cc: eric.auger@redhat.com, baolu.lu@linux.intel.com, shameerali.kolothum.thodi@huawei.com, jean-philippe@linaro.org, iommu@lists.linux.dev, Jason Gunthorpe, peterx@redhat.com > From: Liu, Yi L <yi.l.liu@intel.com> > Sent: Monday, April 3, 2023 3:26 PM > > For VT-d, except for the device-TLB invalidation, there is no RID > information in the invalidation command submitted by iommu driver. > Device-TLB invalidation would be somehow an included operation in > IOTLB invalidation if the page table is used by devices that has > enabled ATS. So guest VT-d iommu driver may not use the Device-TLB > invalidation. So vRID->pRID conversion may not happen for VT-d side. Guest can still submit a device-tlb invalidation descriptor. Just that it could be skipped as a nop here if the host always invalidates device-tlb when handling the iotlb invalidation request. > > VT-d side requires vPASID->pPASID and vDomain_id->pDomain_id > conversion. > vPASID conversion may be needed later as we may disable guest PASID vPASID conversion is mandatory when we enable vSVA on SIOV device. > support at this moment. For vDomain_id conversion, there is a gap. > Domain_id is not global today. So if there are multiple vIOMMU instance > exposed to guest, the vDomain_id would have conflict between the > invalidation commands submitted via different vIOMMU instances. Even > we may make its allocation global, I'm also curious when and how should > we build up the vDomain_id->pDomain_id relationship in kernel. Maybe a > new iommufd uapi just like the vRID set? or it may be part of the > device driver's attach_hwpt uapi (e.g. VFIO) as this relationship should > just exist when the hwpt and device are attached. > IMHO we need a vDID->hwpt conversion i.e. the kernel wants to know which s1 hwpt is covered by a vDID. From this angle adding vDID at attach_hwpt makes more sense. But honestly speaking I'm hesitating to introduce native format and those assistant APIs for VT-d at this point. Supporting in-kernel short path won't happen in short term. What we defined now may not fit the requirement when it comes. With that let's continue to define a customized simple format for VT-d iotlb invalidation, plus allowing the user to batch the request. Having extra packing/unpacking overhead is negligible compared to the long invalidation path at this moment. Then we can consider native format as a 2nd supported format later when in-kernel acceleration is being worked on. Thanks Kevin ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-03 8:39 ` Tian, Kevin @ 2023-04-03 15:24 ` Nicolin Chen 2023-04-04 2:42 ` Tian, Kevin 0 siblings, 1 reply; 35+ messages in thread From: Nicolin Chen @ 2023-04-03 15:24 UTC (permalink / raw) To: Tian, Kevin Cc: Liu, Yi L, Robin Murphy, jgg@nvidia.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, shameerali.kolothum.thodi@huawei.com, jean-philippe@linaro.org, iommu@lists.linux.dev, peterx@redhat.com On Mon, Apr 03, 2023 at 08:39:02AM +0000, Tian, Kevin wrote: > External email: Use caution opening links or attachments > > > > From: Liu, Yi L <yi.l.liu@intel.com> > > Sent: Monday, April 3, 2023 3:26 PM > > > > For VT-d, except for the device-TLB invalidation, there is no RID > > information in the invalidation command submitted by iommu driver. > > Device-TLB invalidation would be somehow an included operation in > > IOTLB invalidation if the page table is used by devices that has > > enabled ATS. So guest VT-d iommu driver may not use the Device-TLB > > invalidation. So vRID->pRID conversion may not happen for VT-d side. > > Guest can still submit a device-tlb invalidation descriptor. Just that it > could be skipped as a nop here if the host always invalidates device-tlb > when handling the iotlb invalidation request. In either case, it sounds like RID isn't needed in VT-d case? SMMU needs vSID<->pSID info to verify an ATS invalidation. > > VT-d side requires vPASID->pPASID and vDomain_id->pDomain_id > > conversion. > > vPASID conversion may be needed later as we may disable guest PASID > > vPASID conversion is mandatory when we enable vSVA on SIOV device. vPASID is allocated at runtime, so the hypercall timing is a bit different than SMMU's vSID. But I think it could go with this uAPI too? We'd just need to turn the uAPI to a shareable one. > > support at this moment. For vDomain_id conversion, there is a gap. > > Domain_id is not global today. So if there are multiple vIOMMU instance > > exposed to guest, the vDomain_id would have conflict between the > > invalidation commands submitted via different vIOMMU instances. Even > > we may make its allocation global, I'm also curious when and how should > > we build up the vDomain_id->pDomain_id relationship in kernel. Maybe a > > new iommufd uapi just like the vRID set? or it may be part of the > > device driver's attach_hwpt uapi (e.g. VFIO) as this relationship should > > just exist when the hwpt and device are attached. > > > > IMHO we need a vDID->hwpt conversion i.e. the kernel wants to know > which s1 hwpt is covered by a vDID. From this angle adding vDID at > attach_hwpt makes more sense. I share the same view. A "vDID" is HWPT oriented, while that set/unset_rid_user uAPI is for passthrough devices. So, it'd be probably better to go with a different uAPI at least. > But honestly speaking I'm hesitating to introduce native format and those > assistant APIs for VT-d at this point. Supporting in-kernel short path > won't happen in short term. What we defined now may not fit the > requirement when it comes. > > With that let's continue to define a customized simple format for VT-d iotlb > invalidation, plus allowing the user to batch the request. Having extra > packing/unpacking overhead is negligible compared to the long invalidation > path at this moment. Then we can consider native format as a 2nd > supported format later when in-kernel acceleration is being worked on. It'd be okay to do it later for VT-d, so long as the uAPI we add for SMMUv3 would potentially fit VT-d too :) Thanks Nicolin ^ permalink raw reply [flat|nested] 35+ messages in thread
* RE: Cache Invalidation Solution for Nested IOMMU 2023-04-03 15:24 ` Nicolin Chen @ 2023-04-04 2:42 ` Tian, Kevin 2023-04-04 3:12 ` Nicolin Chen 0 siblings, 1 reply; 35+ messages in thread From: Tian, Kevin @ 2023-04-04 2:42 UTC (permalink / raw) To: Nicolin Chen Cc: Liu, Yi L, Robin Murphy, jgg@nvidia.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, shameerali.kolothum.thodi@huawei.com, jean-philippe@linaro.org, iommu@lists.linux.dev, peterx@redhat.com > From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Monday, April 3, 2023 11:24 PM > > > > VT-d side requires vPASID->pPASID and vDomain_id->pDomain_id > > > conversion. > > > vPASID conversion may be needed later as we may disable guest PASID > > > > vPASID conversion is mandatory when we enable vSVA on SIOV device. > > vPASID is allocated at runtime, so the hypercall timing is a > bit different than SMMU's vSID. But I think it could go with > this uAPI too? We'd just need to turn the uAPI to a shareable > one. Not necessarily. It's clearer to be a separate cmd and format. > > But honestly speaking I'm hesitating to introduce native format and those > > assistant APIs for VT-d at this point. Supporting in-kernel short path > > won't happen in short term. What we defined now may not fit the > > requirement when it comes. > > > > With that let's continue to define a customized simple format for VT-d iotlb > > invalidation, plus allowing the user to batch the request. Having extra > > packing/unpacking overhead is negligible compared to the long invalidation > > path at this moment. Then we can consider native format as a 2nd > > supported format later when in-kernel acceleration is being worked on. > > It'd be okay to do it later for VT-d, so long as the uAPI we > add for SMMUv3 would potentially fit VT-d too :) > Yes. btw you need decide which usage is comprehended in this design: 1) vSMMU reads cmd from guest TLBI queue when the tail register is written and then submits the cmd in a user-provided buffer to the kernel. This is the basic path. 2) vSMMU reads base addr of guest TLBI queue when the start register is written and registers the guest queue to the kernel. In the meantime establish the protocol between kvm and smmu driver so when kvm traps guest write to the tail register it directly notifies the smmu driver and skips the userspace. smmu driver then directly reads cmd from guest queue to handle. This is the in-kernel short path. 3) with VCMDQ then vSMMU needs to mmap start/head/tail/... registers of VCMDQ and allows the guest to directly access. No host intervention when guest submits cmd to VCMDQ. This is the hw acceleration path. I'm a bit confused in some discussions whether what you implemented for 1) must be forward compatible with 2) and 3). This is difficult before we actually start working on them. Given an iommu driver will support multiple formats (e.g. when vhost-iommu comes) probably we should more focus on the minimal necessary for what 1) actually requires now? Thanks Kevin ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-04 2:42 ` Tian, Kevin @ 2023-04-04 3:12 ` Nicolin Chen 0 siblings, 0 replies; 35+ messages in thread From: Nicolin Chen @ 2023-04-04 3:12 UTC (permalink / raw) To: Tian, Kevin Cc: Liu, Yi L, Robin Murphy, jgg@nvidia.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, shameerali.kolothum.thodi@huawei.com, jean-philippe@linaro.org, iommu@lists.linux.dev, peterx@redhat.com On Tue, Apr 04, 2023 at 02:42:49AM +0000, Tian, Kevin wrote: > External email: Use caution opening links or attachments > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Monday, April 3, 2023 11:24 PM > > > > > > VT-d side requires vPASID->pPASID and vDomain_id->pDomain_id > > > > conversion. > > > > vPASID conversion may be needed later as we may disable guest PASID > > > > > > vPASID conversion is mandatory when we enable vSVA on SIOV device. > > > > vPASID is allocated at runtime, so the hypercall timing is a > > bit different than SMMU's vSID. But I think it could go with > > this uAPI too? We'd just need to turn the uAPI to a shareable > > one. > > Not necessarily. It's clearer to be a separate cmd and format. OK. Then set/unset_rid_user can be standalone, yet it likely needs a better naming or so. > > > But honestly speaking I'm hesitating to introduce native format and those > > > assistant APIs for VT-d at this point. Supporting in-kernel short path > > > won't happen in short term. What we defined now may not fit the > > > requirement when it comes. > > > > > > With that let's continue to define a customized simple format for VT-d iotlb > > > invalidation, plus allowing the user to batch the request. Having extra > > > packing/unpacking overhead is negligible compared to the long invalidation > > > path at this moment. Then we can consider native format as a 2nd > > > supported format later when in-kernel acceleration is being worked on. > > > > It'd be okay to do it later for VT-d, so long as the uAPI we > > add for SMMUv3 would potentially fit VT-d too :) > > > > Yes. btw you need decide which usage is comprehended in this design: > > 1) vSMMU reads cmd from guest TLBI queue when the tail register is written > and then submits the cmd in a user-provided buffer to the kernel. > > This is the basic path. > > 2) vSMMU reads base addr of guest TLBI queue when the start register is > written and registers the guest queue to the kernel. In the meantime > establish the protocol between kvm and smmu driver so when kvm > traps guest write to the tail register it directly notifies the smmu driver > and skips the userspace. smmu driver then directly reads cmd from guest > queue to handle. > > This is the in-kernel short path. > > 3) with VCMDQ then vSMMU needs to mmap start/head/tail/... registers > of VCMDQ and allows the guest to directly access. No host intervention > when guest submits cmd to VCMDQ. > > This is the hw acceleration path. > > I'm a bit confused in some discussions whether what you implemented > for 1) must be forward compatible with 2) and 3). This is difficult before > we actually start working on them. Given an iommu driver will support > multiple formats (e.g. when vhost-iommu comes) probably we should > more focus on the minimal necessary for what 1) actually requires now? My draft is more of an enhanced (1) with batching, meanwhile using a mmap interface, thinking of (3). Comparing to the (2), it simplifies the host kernel, as QEMU could load every TLBI command into a mmap'd buffer whenever it traps one. Maybe (2) could be a cleaner implementation. I could try it too. Thanks Nicolin ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-03 7:26 ` Liu, Yi L 2023-04-03 8:39 ` Tian, Kevin @ 2023-04-03 12:23 ` Jason Gunthorpe 1 sibling, 0 replies; 35+ messages in thread From: Jason Gunthorpe @ 2023-04-03 12:23 UTC (permalink / raw) To: Liu, Yi L Cc: Nicolin Chen, Robin Murphy, Tian, Kevin, eric.auger@redhat.com, baolu.lu@linux.intel.com, shameerali.kolothum.thodi@huawei.com, jean-philippe@linaro.org, iommu@lists.linux.dev, peterx@redhat.com On Mon, Apr 03, 2023 at 07:26:28AM +0000, Liu, Yi L wrote: > For VT-d, except for the device-TLB invalidation, there is no RID > information in the invalidation command submitted by iommu driver. > Device-TLB invalidation would be somehow an included operation in > IOTLB invalidation if the page table is used by devices that has > enabled ATS. So guest VT-d iommu driver may not use the Device-TLB > invalidation. So vRID->pRID conversion may not happen for VT-d side. It is the same as ARM, the RID translation is only needed to support the ATS case. > Domain_id is not global today. So if there are multiple vIOMMU instance > exposed to guest, the vDomain_id would have conflict between the > invalidation commands submitted via different vIOMMU instances. You'd need to expose multiple physical instances as multiple virtual instances. > we may make its allocation global, I'm also curious when and how should > we build up the vDomain_id->pDomain_id relationship in kernel. Maybe a > new iommufd uapi just like the vRID set? I'm not so sure it even works sensibly for VT-d. Don't you also have to convert the type of invalidation as well? Eg the nested mode uses the DID quite differently than the guest that is not nesting. IIRC you end up converting vDIDs into pPASID/pDIDs and doing a PASID invalidation. The point of doing all this stuff is to avoid invalidation multiplication, one guest invalidation should translate into one host invalidation. When properly setup the guest invalidations should not iterate over host objects.. Jason ^ permalink raw reply [flat|nested] 35+ messages in thread
* RE: Cache Invalidation Solution for Nested IOMMU 2023-04-03 0:33 Cache Invalidation Solution for Nested IOMMU Nicolin Chen 2023-04-03 7:26 ` Liu, Yi L @ 2023-04-03 8:00 ` Tian, Kevin 2023-04-03 14:29 ` Nicolin Chen 2023-04-03 14:08 ` Jason Gunthorpe 2 siblings, 1 reply; 35+ messages in thread From: Tian, Kevin @ 2023-04-03 8:00 UTC (permalink / raw) To: Nicolin Chen, Robin Murphy, jgg@nvidia.com Cc: Liu, Yi L, eric.auger@redhat.com, baolu.lu@linux.intel.com, shameerali.kolothum.thodi@huawei.com, jean-philippe@linaro.org, iommu@lists.linux.dev > From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Monday, April 3, 2023 8:34 AM > > The new set_rid/unset_rid ioctls and the mmap interface would be > essential for VCMDQ support that we'd like to achieve at the end > of this journey. So, personally I'd like to see it can be used at > this stage, by the generic SMMUv3 (and potentially VT-d) too. > We talked earlier that there could be multiple VCMDQ's when the guest is assigned multiple devices behind different SMMU's. How does the mmap interface per iommufd work in that scenario? and looks this is different from the requirement of having a software short path in kernel to reduce the invalidation overhead for emulated vIOMMUs. In this case the invalidation queue is in guest memory then instead we want a registration cmd here. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-03 8:00 ` Tian, Kevin @ 2023-04-03 14:29 ` Nicolin Chen 2023-04-04 2:15 ` Tian, Kevin 0 siblings, 1 reply; 35+ messages in thread From: Nicolin Chen @ 2023-04-03 14:29 UTC (permalink / raw) To: Tian, Kevin Cc: Robin Murphy, jgg@nvidia.com, Liu, Yi L, eric.auger@redhat.com, baolu.lu@linux.intel.com, shameerali.kolothum.thodi@huawei.com, jean-philippe@linaro.org, iommu@lists.linux.dev On Mon, Apr 03, 2023 at 08:00:12AM +0000, Tian, Kevin wrote: > External email: Use caution opening links or attachments > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Monday, April 3, 2023 8:34 AM > > > > The new set_rid/unset_rid ioctls and the mmap interface would be > > essential for VCMDQ support that we'd like to achieve at the end > > of this journey. So, personally I'd like to see it can be used at > > this stage, by the generic SMMUv3 (and potentially VT-d) too. > > > > We talked earlier that there could be multiple VCMDQ's when the > guest is assigned multiple devices behind different SMMU's. How > does the mmap interface per iommufd work in that scenario? Trying to documenting that each IOMMUFD object can possibly have a shared page, the mmap interface takes the index of an IOMMUFD object ID. So, either a pt_id(S1) or a dev_id should be able to identify which physical SMMU, I think. > and looks this is different from the requirement of having a > software short path in kernel to reduce the invalidation overhead > for emulated vIOMMUs. In this case the invalidation queue is > in guest memory then instead we want a registration cmd here. Yes for the first part. There are certain difficulties of doing a short path, such as host kernel replacing the host queue that the actual HW ran on, with a guest TLBI queue. So, my draft is more about batching. For the last part, what's that "registration cmd" to do? In my draft, the hypervisor dispatch all invalidation commands to the guest TLBI queue (or call it user queue), which is transparent to the guest OS. Thanks Nicolin ^ permalink raw reply [flat|nested] 35+ messages in thread
* RE: Cache Invalidation Solution for Nested IOMMU 2023-04-03 14:29 ` Nicolin Chen @ 2023-04-04 2:15 ` Tian, Kevin 2023-04-04 2:47 ` Nicolin Chen 0 siblings, 1 reply; 35+ messages in thread From: Tian, Kevin @ 2023-04-04 2:15 UTC (permalink / raw) To: Nicolin Chen Cc: Robin Murphy, jgg@nvidia.com, Liu, Yi L, eric.auger@redhat.com, baolu.lu@linux.intel.com, shameerali.kolothum.thodi@huawei.com, jean-philippe@linaro.org, iommu@lists.linux.dev > From: Nicolin Chen <nicolinc@nvidia.com> > Sent: Monday, April 3, 2023 10:30 PM > > On Mon, Apr 03, 2023 at 08:00:12AM +0000, Tian, Kevin wrote: > > External email: Use caution opening links or attachments > > > > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > Sent: Monday, April 3, 2023 8:34 AM > > > > > > The new set_rid/unset_rid ioctls and the mmap interface would be > > > essential for VCMDQ support that we'd like to achieve at the end > > > of this journey. So, personally I'd like to see it can be used at > > > this stage, by the generic SMMUv3 (and potentially VT-d) too. > > > > > > > We talked earlier that there could be multiple VCMDQ's when the > > guest is assigned multiple devices behind different SMMU's. How > > does the mmap interface per iommufd work in that scenario? > > Trying to documenting that each IOMMUFD object can possibly have > a shared page, the mmap interface takes the index of an IOMMUFD > object ID. So, either a pt_id(S1) or a dev_id should be able to > identify which physical SMMU, I think. Are all allowed cmds in VCMDQ per hwpt? If not then building the mmap interface per hwpt object is not correct. We may want explicit VCMDQ object in that case. and devices behind different SMMU's may be attached to a same hwpt. In that case the number of VCMDQ associated to a hwpt is also dynamic. But if just talking about batching for emulated smmu then having the user to pass a big buffer makes more sense. > > > and looks this is different from the requirement of having a > > software short path in kernel to reduce the invalidation overhead > > for emulated vIOMMUs. In this case the invalidation queue is > > in guest memory then instead we want a registration cmd here. > > Yes for the first part. There are certain difficulties of doing > a short path, such as host kernel replacing the host queue that > the actual HW ran on, with a guest TLBI queue. So, my draft is > more about batching. > > For the last part, what's that "registration cmd" to do? In my > draft, the hypervisor dispatch all invalidation commands to the > guest TLBI queue (or call it user queue), which is transparent > to the guest OS. > registration means the user to pass the buffer to the kernel. If we want to support kernel short path, then we want the host smmu driver to directly read cmd out of guest TLBI queue. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-04 2:15 ` Tian, Kevin @ 2023-04-04 2:47 ` Nicolin Chen 0 siblings, 0 replies; 35+ messages in thread From: Nicolin Chen @ 2023-04-04 2:47 UTC (permalink / raw) To: Tian, Kevin Cc: Robin Murphy, jgg@nvidia.com, Liu, Yi L, eric.auger@redhat.com, baolu.lu@linux.intel.com, shameerali.kolothum.thodi@huawei.com, jean-philippe@linaro.org, iommu@lists.linux.dev On Tue, Apr 04, 2023 at 02:15:38AM +0000, Tian, Kevin wrote: > External email: Use caution opening links or attachments > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > Sent: Monday, April 3, 2023 10:30 PM > > > > On Mon, Apr 03, 2023 at 08:00:12AM +0000, Tian, Kevin wrote: > > > External email: Use caution opening links or attachments > > > > > > > > > > From: Nicolin Chen <nicolinc@nvidia.com> > > > > Sent: Monday, April 3, 2023 8:34 AM > > > > > > > > The new set_rid/unset_rid ioctls and the mmap interface would be > > > > essential for VCMDQ support that we'd like to achieve at the end > > > > of this journey. So, personally I'd like to see it can be used at > > > > this stage, by the generic SMMUv3 (and potentially VT-d) too. > > > > > > > > > > We talked earlier that there could be multiple VCMDQ's when the > > > guest is assigned multiple devices behind different SMMU's. How > > > does the mmap interface per iommufd work in that scenario? > > > > Trying to documenting that each IOMMUFD object can possibly have > > a shared page, the mmap interface takes the index of an IOMMUFD > > object ID. So, either a pt_id(S1) or a dev_id should be able to > > identify which physical SMMU, I think. > > Are all allowed cmds in VCMDQ per hwpt? If not then building the > mmap interface per hwpt object is not correct. We may want explicit > VCMDQ object in that case. One VCMDQ HW per SMMU instance. So all HWPTs that are created by devices behind the same SMMU instance share the same VCMDQ HW. Each VCMDQ HW can also allocate multiple queues that don't necessarily tie to any HWPT either. > and devices behind different SMMU's may be attached to a same > hwpt. In that case the number of VCMDQ associated to a hwpt is > also dynamic. Unless two HWPTs share the same S1 Context Table, how can two devices behind different SMMUs attach to the same HWPT? And, it doesn't sound very plausible to share the same S1 Context Table between two devices either? > But if just talking about batching for emulated smmu then having > the user to pass a big buffer makes more sense. OK. That's in align with Jason's suggestion, passing a queue buffer via the ioctl. > > > and looks this is different from the requirement of having a > > > software short path in kernel to reduce the invalidation overhead > > > for emulated vIOMMUs. In this case the invalidation queue is > > > in guest memory then instead we want a registration cmd here. > > > > Yes for the first part. There are certain difficulties of doing > > a short path, such as host kernel replacing the host queue that > > the actual HW ran on, with a guest TLBI queue. So, my draft is > > more about batching. > > > > For the last part, what's that "registration cmd" to do? In my > > draft, the hypervisor dispatch all invalidation commands to the > > guest TLBI queue (or call it user queue), which is transparent > > to the guest OS. > > > > registration means the user to pass the buffer to the kernel. > > If we want to support kernel short path, then we want the host > smmu driver to directly read cmd out of guest TLBI queue. I expect the mmap approach can go a bit further for an SMMU that has ECMDQ capability (multi CMDQs): it could load the guest TLBI queue without copying that buffer and inserting into the host queue, by allocating a separate cmdq object in the SMMU driver and mmap'ing its cmdq->q.base to QEMU. Thanks Nicolin ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-03 0:33 Cache Invalidation Solution for Nested IOMMU Nicolin Chen 2023-04-03 7:26 ` Liu, Yi L 2023-04-03 8:00 ` Tian, Kevin @ 2023-04-03 14:08 ` Jason Gunthorpe 2023-04-03 14:51 ` Nicolin Chen 2 siblings, 1 reply; 35+ messages in thread From: Jason Gunthorpe @ 2023-04-03 14:08 UTC (permalink / raw) To: Nicolin Chen Cc: Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu, shameerali.kolothum.thodi, jean-philippe, iommu On Sun, Apr 02, 2023 at 05:33:35PM -0700, Nicolin Chen wrote: > The first version is simply to individually forward the entire > command. This can save a few CPU cycles from packing/unpacking > invalidation fields of the commands via a data structure, v.s. > the structure in v1[2]. The kernel must validate the SID for the ATS invalidations, we can't just blindly pass it through. And this simple path needs an explanation how errors are properly handled, eg by making execution synchronous, or someone guaranteeing that errors are impossible. > Then I added a new mmap interface to share kernel page(s) from the > Driver, to allow QEMU to write all TLBI commands as a single batch. > Then it can initiate the batch invalidation via another synchronous > hypercall. I don't think a mmap is really needed for simple batching, just passing a larger buffer to ioctl is probably good enough If a SW side is built it should mirror the HW vCMDQ path, not be different. Jason ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-03 14:08 ` Jason Gunthorpe @ 2023-04-03 14:51 ` Nicolin Chen 2023-04-03 19:15 ` Robin Murphy 0 siblings, 1 reply; 35+ messages in thread From: Nicolin Chen @ 2023-04-03 14:51 UTC (permalink / raw) To: Jason Gunthorpe Cc: Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu, shameerali.kolothum.thodi, jean-philippe, iommu On Mon, Apr 03, 2023 at 11:08:23AM -0300, Jason Gunthorpe wrote: > On Sun, Apr 02, 2023 at 05:33:35PM -0700, Nicolin Chen wrote: > > The first version is simply to individually forward the entire > > command. This can save a few CPU cycles from packing/unpacking > > invalidation fields of the commands via a data structure, v.s. > > the structure in v1[2]. > > The kernel must validate the SID for the ATS invalidations, we can't > just blindly pass it through. Yes. I didn't go further with the first version, yet leaving a line of comments in the handler: we'd need set/unset_rid_user, to validate the SID field of INV_ATC commands, as we discussed. > And this simple path needs an explanation how errors are properly > handled, eg by making execution synchronous, or someone guaranteeing > that errors are impossible. Yes. Both versions here execute in a synchronous fashion. The error code will be returned in the cache_invalidate_user data structure. > > Then I added a new mmap interface to share kernel page(s) from the > > Driver, to allow QEMU to write all TLBI commands as a single batch. > > Then it can initiate the batch invalidation via another synchronous > > hypercall. > > I don't think a mmap is really needed for simple batching, just > passing a larger buffer to ioctl is probably good enough It wouldn't be a must, yet can omit a copy_from_user() at each hypercall? And it also eases VCMDQ a bit. > If a SW side is built it should mirror the HW vCMDQ path, not be > different. The host kernel has the host queue, while the hypervisor fills in a guest TLBI queue. Switching between two queues at one SMMU CMDQ (HW) requires a very complicated locking mechanism, v.s. inserting the batch to the existing host queue. And it probably doesn't have a big perf improvement by doing that? If SMMU has ECMDQ, it'd allocate a free CMDQ upon availability, calling arm_smmu_init_one_queue() and mmapping q->base, then it can execute the guest TLBI queue directly, passing that q ptr. Thanks Nicolin ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-03 14:51 ` Nicolin Chen @ 2023-04-03 19:15 ` Robin Murphy 2023-04-04 0:02 ` Nicolin Chen 0 siblings, 1 reply; 35+ messages in thread From: Robin Murphy @ 2023-04-03 19:15 UTC (permalink / raw) To: Nicolin Chen, Jason Gunthorpe Cc: kevin.tian, yi.l.liu, eric.auger, baolu.lu, shameerali.kolothum.thodi, jean-philippe, iommu On 2023-04-03 15:51, Nicolin Chen wrote: > On Mon, Apr 03, 2023 at 11:08:23AM -0300, Jason Gunthorpe wrote: >> On Sun, Apr 02, 2023 at 05:33:35PM -0700, Nicolin Chen wrote: >>> The first version is simply to individually forward the entire >>> command. This can save a few CPU cycles from packing/unpacking >>> invalidation fields of the commands via a data structure, v.s. >>> the structure in v1[2]. >> >> The kernel must validate the SID for the ATS invalidations, we can't >> just blindly pass it through. > > Yes. I didn't go further with the first version, yet leaving a > line of comments in the handler: we'd need set/unset_rid_user, > to validate the SID field of INV_ATC commands, as we discussed. > >> And this simple path needs an explanation how errors are properly >> handled, eg by making execution synchronous, or someone guaranteeing >> that errors are impossible. > > Yes. Both versions here execute in a synchronous fashion. The > error code will be returned in the cache_invalidate_user data > structure. > >>> Then I added a new mmap interface to share kernel page(s) from the >>> Driver, to allow QEMU to write all TLBI commands as a single batch. >>> Then it can initiate the batch invalidation via another synchronous >>> hypercall. >> >> I don't think a mmap is really needed for simple batching, just >> passing a larger buffer to ioctl is probably good enough > > It wouldn't be a must, yet can omit a copy_from_user() at each > hypercall? And it also eases VCMDQ a bit. > >> If a SW side is built it should mirror the HW vCMDQ path, not be >> different. > > The host kernel has the host queue, while the hypervisor fills > in a guest TLBI queue. Switching between two queues at one SMMU > CMDQ (HW) requires a very complicated locking mechanism, v.s. > inserting the batch to the existing host queue. And it probably > doesn't have a big perf improvement by doing that? > > If SMMU has ECMDQ, it'd allocate a free CMDQ upon availability, > calling arm_smmu_init_one_queue() and mmapping q->base, then it > can execute the guest TLBI queue directly, passing that q ptr. FWIW I don't think that should be visible in a userspace interface. When the VMM is just requesting some invalidations in order to emulate some commands, it's up to the SMMU driver, or at best between the driver and and IOMMUFD, to decide exactly how those requests get executed as physical commands - that should not make any difference to the requester other than how quickly the requests are processed. AFAICS this interface can't look like the proper hardware vCMDQ path, because the whole point of that will be to configure it in advance, map the queue controls directly into the guest, and avoid trapping invalidations to the VMM at all. This invalidation request interface is a large part of precisely what that path is intended to bypass. I don't see much benefit in supporting an additional slightly-accelerated slow path where the host avoids a tiny bit of housekeeping by maintaining a real vCMDQ *on behalf of* the guest and forwarding trapped commands into it instead of just processing them normally as host commands. Or, conversely, emulating a vCMDQ *in* the host kernel, in a way which still requires traps to bounce through the VMM and back - that just seems objectively worse than keeping all the emulation together in one place (however, I would concur that a "vhost-style" emulation, using all the same interfaces for configuration and error/irq/etc. reporting as the real hardware would, might be viable if performance really demands it). Thanks, Robin. ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-03 19:15 ` Robin Murphy @ 2023-04-04 0:02 ` Nicolin Chen 2023-04-04 16:20 ` Jason Gunthorpe 0 siblings, 1 reply; 35+ messages in thread From: Nicolin Chen @ 2023-04-04 0:02 UTC (permalink / raw) To: Robin Murphy Cc: Jason Gunthorpe, kevin.tian, yi.l.liu, eric.auger, baolu.lu, shameerali.kolothum.thodi, jean-philippe, iommu Hi Robin, On Mon, Apr 03, 2023 at 08:15:03PM +0100, Robin Murphy wrote: > External email: Use caution opening links or attachments > > > On 2023-04-03 15:51, Nicolin Chen wrote: > > On Mon, Apr 03, 2023 at 11:08:23AM -0300, Jason Gunthorpe wrote: > > > On Sun, Apr 02, 2023 at 05:33:35PM -0700, Nicolin Chen wrote: > > > > The first version is simply to individually forward the entire > > > > command. This can save a few CPU cycles from packing/unpacking > > > > invalidation fields of the commands via a data structure, v.s. > > > > the structure in v1[2]. > > > > > > The kernel must validate the SID for the ATS invalidations, we can't > > > just blindly pass it through. > > > > Yes. I didn't go further with the first version, yet leaving a > > line of comments in the handler: we'd need set/unset_rid_user, > > to validate the SID field of INV_ATC commands, as we discussed. > > > > > And this simple path needs an explanation how errors are properly > > > handled, eg by making execution synchronous, or someone guaranteeing > > > that errors are impossible. > > > > Yes. Both versions here execute in a synchronous fashion. The > > error code will be returned in the cache_invalidate_user data > > structure. > > > > > > Then I added a new mmap interface to share kernel page(s) from the > > > > Driver, to allow QEMU to write all TLBI commands as a single batch. > > > > Then it can initiate the batch invalidation via another synchronous > > > > hypercall. > > > > > > I don't think a mmap is really needed for simple batching, just > > > passing a larger buffer to ioctl is probably good enough > > > > It wouldn't be a must, yet can omit a copy_from_user() at each > > hypercall? And it also eases VCMDQ a bit. > > > > > If a SW side is built it should mirror the HW vCMDQ path, not be > > > different. > > > > The host kernel has the host queue, while the hypervisor fills > > in a guest TLBI queue. Switching between two queues at one SMMU > > CMDQ (HW) requires a very complicated locking mechanism, v.s. > > inserting the batch to the existing host queue. And it probably > > doesn't have a big perf improvement by doing that? > > > > If SMMU has ECMDQ, it'd allocate a free CMDQ upon availability, > > calling arm_smmu_init_one_queue() and mmapping q->base, then it > > can execute the guest TLBI queue directly, passing that q ptr. > > FWIW I don't think that should be visible in a userspace interface. When > the VMM is just requesting some invalidations in order to emulate some > commands, it's up to the SMMU driver, or at best between the driver and > and IOMMUFD, to decide exactly how those requests get executed as > physical commands - that should not make any difference to the requester > other than how quickly the requests are processed. > > AFAICS this interface can't look like the proper hardware vCMDQ path, > because the whole point of that will be to configure it in advance, map > the queue controls directly into the guest, and avoid trapping > invalidations to the VMM at all. This invalidation request interface is > a large part of precisely what that path is intended to bypass. I don't > see much benefit in supporting an additional slightly-accelerated slow > path where the host avoids a tiny bit of housekeeping by maintaining a > real vCMDQ *on behalf of* the guest and forwarding trapped commands into > it instead of just processing them normally as host commands. Or, I tend to agree with most of your point. The implementation of a SW emulated VCMDQ might be overcomplicated cooperating with the kernel driver and the QEMU vSMMU code. If SMMU HW (in most cases) only has one CMDQ, it is hard to switch between commands of host's and guest's to have a performance gain. VCMDQ could simply do that because it has multiple CMDQs by nature. What is the normal processing approach that you'd suggest? Do you agree that having a batch invalidation would be nicer? We could go for the mmap'd page approach in my draft, or go for the ioctl that Jason pointed out. My preference is to have a mmap'd page, so the interface can be reused later by VCMDQ too. Performance-wise, it should be good enough, since it does batching, IMHO. > conversely, emulating a vCMDQ *in* the host kernel, in a way which still > requires traps to bounce through the VMM and back - that just seems > objectively worse than keeping all the emulation together in one place > (however, I would concur that a "vhost-style" emulation, using all the > same interfaces for configuration and error/irq/etc. reporting as the > real hardware would, might be viable if performance really demands it). I am not very familiar with the vhost style. By a glance at its doc, it seems to be one interface for all hypercalls? That would be very different than what we design here.. Thanks Nicolin ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-04 0:02 ` Nicolin Chen @ 2023-04-04 16:20 ` Jason Gunthorpe 2023-04-04 16:50 ` Shameerali Kolothum Thodi 2023-04-05 5:45 ` Nicolin Chen 0 siblings, 2 replies; 35+ messages in thread From: Jason Gunthorpe @ 2023-04-04 16:20 UTC (permalink / raw) To: Nicolin Chen Cc: Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu, shameerali.kolothum.thodi, jean-philippe, iommu On Mon, Apr 03, 2023 at 05:02:09PM -0700, Nicolin Chen wrote: > My preference is to have a mmap'd page, so the interface can > be reused later by VCMDQ too. Performance-wise, it should be > good enough, since it does batching, IMHO. You can't reuse mmaping the queue page with vcmdq, so it doesn't seem meaningful to me. There should be no mmap on the SW path. If you need a half step between an ioctl as a batch and a full vhost-like queue scheme then using iouring with pre-registered memory would be appropriate. I feel like this is a topic Shameerali should share some insight with since the Huawei implementation will rely on this SW path. Jason ^ permalink raw reply [flat|nested] 35+ messages in thread
* RE: Cache Invalidation Solution for Nested IOMMU 2023-04-04 16:20 ` Jason Gunthorpe @ 2023-04-04 16:50 ` Shameerali Kolothum Thodi 2023-04-05 11:57 ` Jason Gunthorpe 2023-04-06 6:23 ` Zhangfei Gao 2023-04-05 5:45 ` Nicolin Chen 1 sibling, 2 replies; 35+ messages in thread From: Shameerali Kolothum Thodi @ 2023-04-04 16:50 UTC (permalink / raw) To: Jason Gunthorpe, Nicolin Chen Cc: Robin Murphy, kevin.tian@intel.com, yi.l.liu@intel.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, jean-philippe@linaro.org, iommu@lists.linux.dev, Zhangfei Gao > -----Original Message----- > From: Jason Gunthorpe [mailto:jgg@nvidia.com] > Sent: 04 April 2023 17:20 > To: Nicolin Chen <nicolinc@nvidia.com> > Cc: Robin Murphy <robin.murphy@arm.com>; kevin.tian@intel.com; > yi.l.liu@intel.com; eric.auger@redhat.com; baolu.lu@linux.intel.com; > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; > jean-philippe@linaro.org; iommu@lists.linux.dev > Subject: Re: Cache Invalidation Solution for Nested IOMMU > > On Mon, Apr 03, 2023 at 05:02:09PM -0700, Nicolin Chen wrote: > > > My preference is to have a mmap'd page, so the interface can > > be reused later by VCMDQ too. Performance-wise, it should be > > good enough, since it does batching, IMHO. > > You can't reuse mmaping the queue page with vcmdq, so it doesn't seem > meaningful to me. > > There should be no mmap on the SW path. If you need a half step > between an ioctl as a batch and a full vhost-like queue scheme then > using iouring with pre-registered memory would be appropriate. > > I feel like this is a topic Shameerali should share some insight with > since the Huawei implementation will rely on this SW path. So far the tests we have done are mainly using IOCTL method without batching and I don't have any numbers to compare against the batched method yet. [+Zhangfei] Zhangfei, do you think we can run some UADK/vSVA tests with different Invalidation solutions discussed here and compare? Thanks, Shameer ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-04 16:50 ` Shameerali Kolothum Thodi @ 2023-04-05 11:57 ` Jason Gunthorpe 2023-04-06 6:23 ` Zhangfei Gao 1 sibling, 0 replies; 35+ messages in thread From: Jason Gunthorpe @ 2023-04-05 11:57 UTC (permalink / raw) To: Shameerali Kolothum Thodi Cc: Nicolin Chen, Robin Murphy, kevin.tian@intel.com, yi.l.liu@intel.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, jean-philippe@linaro.org, iommu@lists.linux.dev, Zhangfei Gao On Tue, Apr 04, 2023 at 04:50:14PM +0000, Shameerali Kolothum Thodi wrote: > > > > -----Original Message----- > > From: Jason Gunthorpe [mailto:jgg@nvidia.com] > > Sent: 04 April 2023 17:20 > > To: Nicolin Chen <nicolinc@nvidia.com> > > Cc: Robin Murphy <robin.murphy@arm.com>; kevin.tian@intel.com; > > yi.l.liu@intel.com; eric.auger@redhat.com; baolu.lu@linux.intel.com; > > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; > > jean-philippe@linaro.org; iommu@lists.linux.dev > > Subject: Re: Cache Invalidation Solution for Nested IOMMU > > > > On Mon, Apr 03, 2023 at 05:02:09PM -0700, Nicolin Chen wrote: > > > > > My preference is to have a mmap'd page, so the interface can > > > be reused later by VCMDQ too. Performance-wise, it should be > > > good enough, since it does batching, IMHO. > > > > You can't reuse mmaping the queue page with vcmdq, so it doesn't seem > > meaningful to me. > > > > There should be no mmap on the SW path. If you need a half step > > between an ioctl as a batch and a full vhost-like queue scheme then > > using iouring with pre-registered memory would be appropriate. > > > > I feel like this is a topic Shameerali should share some insight with > > since the Huawei implementation will rely on this SW path. > > So far the tests we have done are mainly using IOCTL method without > batching and I don't have any numbers to compare against the batched > method yet. The question is more are you happy with the performance of single ioctl or does this need optimization. It sits in a hot path of the mm, so it broadly impacts certain workloads once SVA is enabled. Jason ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-04 16:50 ` Shameerali Kolothum Thodi 2023-04-05 11:57 ` Jason Gunthorpe @ 2023-04-06 6:23 ` Zhangfei Gao 2023-04-06 6:39 ` Nicolin Chen 2023-04-06 11:40 ` Jason Gunthorpe 1 sibling, 2 replies; 35+ messages in thread From: Zhangfei Gao @ 2023-04-06 6:23 UTC (permalink / raw) To: Shameerali Kolothum Thodi Cc: Jason Gunthorpe, Nicolin Chen, Robin Murphy, kevin.tian@intel.com, yi.l.liu@intel.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, jean-philippe@linaro.org, iommu@lists.linux.dev, qianweili On Wed, 5 Apr 2023 at 00:50, Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> wrote: > > > > > -----Original Message----- > > From: Jason Gunthorpe [mailto:jgg@nvidia.com] > > Sent: 04 April 2023 17:20 > > To: Nicolin Chen <nicolinc@nvidia.com> > > Cc: Robin Murphy <robin.murphy@arm.com>; kevin.tian@intel.com; > > yi.l.liu@intel.com; eric.auger@redhat.com; baolu.lu@linux.intel.com; > > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; > > jean-philippe@linaro.org; iommu@lists.linux.dev > > Subject: Re: Cache Invalidation Solution for Nested IOMMU > > > > On Mon, Apr 03, 2023 at 05:02:09PM -0700, Nicolin Chen wrote: > > > > > My preference is to have a mmap'd page, so the interface can > > > be reused later by VCMDQ too. Performance-wise, it should be > > > good enough, since it does batching, IMHO. > > > > You can't reuse mmaping the queue page with vcmdq, so it doesn't seem > > meaningful to me. > > > > There should be no mmap on the SW path. If you need a half step > > between an ioctl as a batch and a full vhost-like queue scheme then > > using iouring with pre-registered memory would be appropriate. > > > > I feel like this is a topic Shameerali should share some insight with > > since the Huawei implementation will rely on this SW path. > > So far the tests we have done are mainly using IOCTL method without > batching and I don't have any numbers to compare against the batched > method yet. > > [+Zhangfei] > > Zhangfei, do you think we can run some UADK/vSVA tests with different > Invalidation solutions discussed here and compare? Hi, Nicolin Do you have mmap branch for kernel and Qemu? We are using ioctl method now. From the testing, the TLB miss impacts performance a lot, so we use huge page method. After using huge page method, guest can achieve comparable performance with host. More details: https://docs.qq.com/doc/DRXlpQmpTSlBZTGZZ Thanks ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-06 6:23 ` Zhangfei Gao @ 2023-04-06 6:39 ` Nicolin Chen 2023-04-06 11:40 ` Jason Gunthorpe 1 sibling, 0 replies; 35+ messages in thread From: Nicolin Chen @ 2023-04-06 6:39 UTC (permalink / raw) To: Zhangfei Gao Cc: Shameerali Kolothum Thodi, Jason Gunthorpe, Robin Murphy, kevin.tian@intel.com, yi.l.liu@intel.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, jean-philippe@linaro.org, iommu@lists.linux.dev, qianweili On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote: > External email: Use caution opening links or attachments > > > On Wed, 5 Apr 2023 at 00:50, Shameerali Kolothum Thodi > <shameerali.kolothum.thodi@huawei.com> wrote: > > > > > > > > > -----Original Message----- > > > From: Jason Gunthorpe [mailto:jgg@nvidia.com] > > > Sent: 04 April 2023 17:20 > > > To: Nicolin Chen <nicolinc@nvidia.com> > > > Cc: Robin Murphy <robin.murphy@arm.com>; kevin.tian@intel.com; > > > yi.l.liu@intel.com; eric.auger@redhat.com; baolu.lu@linux.intel.com; > > > Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com>; > > > jean-philippe@linaro.org; iommu@lists.linux.dev > > > Subject: Re: Cache Invalidation Solution for Nested IOMMU > > > > > > On Mon, Apr 03, 2023 at 05:02:09PM -0700, Nicolin Chen wrote: > > > > > > > My preference is to have a mmap'd page, so the interface can > > > > be reused later by VCMDQ too. Performance-wise, it should be > > > > good enough, since it does batching, IMHO. > > > > > > You can't reuse mmaping the queue page with vcmdq, so it doesn't seem > > > meaningful to me. > > > > > > There should be no mmap on the SW path. If you need a half step > > > between an ioctl as a batch and a full vhost-like queue scheme then > > > using iouring with pre-registered memory would be appropriate. > > > > > > I feel like this is a topic Shameerali should share some insight with > > > since the Huawei implementation will rely on this SW path. > > > > So far the tests we have done are mainly using IOCTL method without > > batching and I don't have any numbers to compare against the batched > > method yet. > > > > [+Zhangfei] > > > > Zhangfei, do you think we can run some UADK/vSVA tests with different > > Invalidation solutions discussed here and compare? > > Hi, Nicolin > > Do you have mmap branch for kernel and Qemu? > > We are using ioctl method now. > From the testing, the TLB miss impacts performance a lot, so we use > huge page method. > After using huge page method, guest can achieve comparable performance > with host. Looks like the test has been running on previous VFIO solution? Have you tried a sanity using the latest IOMMUFD pair? I recall that Shameer has a set of branches for linux and QEMU. Meanwhile, I'll prepare a pair of branches for you to test the mmap solution. Let me get back to you by the end of the week. Thanks Nicolin ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-06 6:23 ` Zhangfei Gao 2023-04-06 6:39 ` Nicolin Chen @ 2023-04-06 11:40 ` Jason Gunthorpe 2023-04-10 1:08 ` Nicolin Chen 1 sibling, 1 reply; 35+ messages in thread From: Jason Gunthorpe @ 2023-04-06 11:40 UTC (permalink / raw) To: Zhangfei Gao Cc: Shameerali Kolothum Thodi, Nicolin Chen, Robin Murphy, kevin.tian@intel.com, yi.l.liu@intel.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, jean-philippe@linaro.org, iommu@lists.linux.dev, qianweili On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote: > We are using ioctl method now. > From the testing, the TLB miss impacts performance a lot, so we use > huge page method. > After using huge page method, guest can achieve comparable performance > with host. Looks like these tests are not stressing the MM, just measuring pure BW of the DMA, so they don't get into the invalidation regime.. You need to measure a more real application that is actually using the MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA. Jason ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-06 11:40 ` Jason Gunthorpe @ 2023-04-10 1:08 ` Nicolin Chen 2023-04-11 9:07 ` Jean-Philippe Brucker 2023-04-12 2:47 ` Zhangfei Gao 0 siblings, 2 replies; 35+ messages in thread From: Nicolin Chen @ 2023-04-10 1:08 UTC (permalink / raw) To: Zhangfei Gao, Jason Gunthorpe Cc: Shameerali Kolothum Thodi, Robin Murphy, kevin.tian@intel.com, yi.l.liu@intel.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, jean-philippe@linaro.org, iommu@lists.linux.dev, qianweili On Thu, Apr 06, 2023 at 08:40:04AM -0300, Jason Gunthorpe wrote: > On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote: > > > We are using ioctl method now. > > From the testing, the TLB miss impacts performance a lot, so we use > > huge page method. > > After using huge page method, guest can achieve comparable performance > > with host. > > Looks like these tests are not stressing the MM, just measuring pure > BW of the DMA, so they don't get into the invalidation regime.. > > You need to measure a more real application that is actually using the > MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA. Would an iommu map/unmap benchmark test be useful here? I added a test program to measure map/unmap times with a set of different sized buffers: https://github.com/nicolinc/iommufd/commit/3eb417f2cae0234cc801c6ad74de2afb0ddbdf84 (Also thinking about sending this with a RFC series) @Zhangfei, In case that this could be useful, you can pull these two branches for perf measurement with and without mmap: # Kernel https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-mmap-04082023 # QEMU https://github.com/nicolinc/qemu/commits/wip/iommufd_nesting-mmap-04072023 To test without mmap, simply revert the following commits: # Kernel git revert ecf602a3c8480ba7ce2c7e77c2d15ca873dbf2e4 # QEMU git revert c726c014de70998f14b8741a6a96e18a2a7bcd0f And, you can get the test script in the branch: tools/testing/selftests/iommu/iommu_benchmark.sh On my emulation environment (very slow), with mmap, I see improvements. I'll also try setting up a test suite on a proper HW this week. Thanks Nicolin ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-10 1:08 ` Nicolin Chen @ 2023-04-11 9:07 ` Jean-Philippe Brucker 2023-04-11 11:57 ` Jason Gunthorpe 2023-04-11 18:43 ` Nicolin Chen 2023-04-12 2:47 ` Zhangfei Gao 1 sibling, 2 replies; 35+ messages in thread From: Jean-Philippe Brucker @ 2023-04-11 9:07 UTC (permalink / raw) To: Nicolin Chen Cc: Zhangfei Gao, Jason Gunthorpe, Shameerali Kolothum Thodi, Robin Murphy, kevin.tian@intel.com, yi.l.liu@intel.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, iommu@lists.linux.dev, qianweili On Sun, Apr 09, 2023 at 06:08:25PM -0700, Nicolin Chen wrote: > On Thu, Apr 06, 2023 at 08:40:04AM -0300, Jason Gunthorpe wrote: > > On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote: > > > > > We are using ioctl method now. > > > From the testing, the TLB miss impacts performance a lot, so we use > > > huge page method. > > > After using huge page method, guest can achieve comparable performance > > > with host. > > > > Looks like these tests are not stressing the MM, just measuring pure > > BW of the DMA, so they don't get into the invalidation regime.. > > > > You need to measure a more real application that is actually using the > > MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA. > > Would an iommu map/unmap benchmark test be useful here? You can reuse and improve dma_map_benchmark, already upstream. It supports multi-threads and reports stddev. For some comparisons the report resolution (.1 us) is insufficient and needs to be increased, but for comparing guest performance it might be alright. https://lore.kernel.org/linux-iommu/20201102080646.2180-2-song.bao.hua@hisilicon.com/ Thanks, Jean ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-11 9:07 ` Jean-Philippe Brucker @ 2023-04-11 11:57 ` Jason Gunthorpe 2023-04-11 18:39 ` Nicolin Chen 2023-04-11 18:43 ` Nicolin Chen 1 sibling, 1 reply; 35+ messages in thread From: Jason Gunthorpe @ 2023-04-11 11:57 UTC (permalink / raw) To: Jean-Philippe Brucker Cc: Nicolin Chen, Zhangfei Gao, Shameerali Kolothum Thodi, Robin Murphy, kevin.tian@intel.com, yi.l.liu@intel.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, iommu@lists.linux.dev, qianweili On Tue, Apr 11, 2023 at 10:07:40AM +0100, Jean-Philippe Brucker wrote: > On Sun, Apr 09, 2023 at 06:08:25PM -0700, Nicolin Chen wrote: > > On Thu, Apr 06, 2023 at 08:40:04AM -0300, Jason Gunthorpe wrote: > > > On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote: > > > > > > > We are using ioctl method now. > > > > From the testing, the TLB miss impacts performance a lot, so we use > > > > huge page method. > > > > After using huge page method, guest can achieve comparable performance > > > > with host. > > > > > > Looks like these tests are not stressing the MM, just measuring pure > > > BW of the DMA, so they don't get into the invalidation regime.. > > > > > > You need to measure a more real application that is actually using the > > > MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA. > > > > Would an iommu map/unmap benchmark test be useful here? > > You can reuse and improve dma_map_benchmark, already upstream. It supports > multi-threads and reports stddev. For some comparisons the report > resolution (.1 us) is insufficient and needs to be increased, but for > comparing guest performance it might be alright. > > https://lore.kernel.org/linux-iommu/20201102080646.2180-2-song.bao.hua@hisilicon.com/ I'm not talking about kernel dma_map/unmap, I'm talking about MM activities like mmap and munmap that become slower once SVA is enabled Jason ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-11 11:57 ` Jason Gunthorpe @ 2023-04-11 18:39 ` Nicolin Chen 2023-04-11 18:41 ` Jason Gunthorpe 0 siblings, 1 reply; 35+ messages in thread From: Nicolin Chen @ 2023-04-11 18:39 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jean-Philippe Brucker, Zhangfei Gao, Shameerali Kolothum Thodi, Robin Murphy, kevin.tian@intel.com, yi.l.liu@intel.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, iommu@lists.linux.dev, qianweili On Tue, Apr 11, 2023 at 08:57:35AM -0300, Jason Gunthorpe wrote: > On Tue, Apr 11, 2023 at 10:07:40AM +0100, Jean-Philippe Brucker wrote: > > On Sun, Apr 09, 2023 at 06:08:25PM -0700, Nicolin Chen wrote: > > > On Thu, Apr 06, 2023 at 08:40:04AM -0300, Jason Gunthorpe wrote: > > > > On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote: > > > > > > > > > We are using ioctl method now. > > > > > From the testing, the TLB miss impacts performance a lot, so we use > > > > > huge page method. > > > > > After using huge page method, guest can achieve comparable performance > > > > > with host. > > > > > > > > Looks like these tests are not stressing the MM, just measuring pure > > > > BW of the DMA, so they don't get into the invalidation regime.. > > > > > > > > You need to measure a more real application that is actually using the > > > > MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA. > > > > > > Would an iommu map/unmap benchmark test be useful here? > > > > You can reuse and improve dma_map_benchmark, already upstream. It supports > > multi-threads and reports stddev. For some comparisons the report > > resolution (.1 us) is insufficient and needs to be increased, but for > > comparing guest performance it might be alright. > > > > https://lore.kernel.org/linux-iommu/20201102080646.2180-2-song.bao.hua@hisilicon.com/ > > I'm not talking about kernel dma_map/unmap, I'm talking about MM > activities like mmap and munmap that become slower once SVA is enabled Is it about mmap/munmap() calls or accessing mmap'd memory? If it's about accessing the memory, the test can cover via the invalidation pathway. Otherwise, mmap/munmap() are called in ->domain_alloc/free() respectively -- mind elaborating why it should be concerned? Thanks Nic ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-11 18:39 ` Nicolin Chen @ 2023-04-11 18:41 ` Jason Gunthorpe 2023-04-11 19:02 ` Nicolin Chen 0 siblings, 1 reply; 35+ messages in thread From: Jason Gunthorpe @ 2023-04-11 18:41 UTC (permalink / raw) To: Nicolin Chen Cc: Jean-Philippe Brucker, Zhangfei Gao, Shameerali Kolothum Thodi, Robin Murphy, kevin.tian@intel.com, yi.l.liu@intel.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, iommu@lists.linux.dev, qianweili On Tue, Apr 11, 2023 at 11:39:26AM -0700, Nicolin Chen wrote: > On Tue, Apr 11, 2023 at 08:57:35AM -0300, Jason Gunthorpe wrote: > > On Tue, Apr 11, 2023 at 10:07:40AM +0100, Jean-Philippe Brucker wrote: > > > On Sun, Apr 09, 2023 at 06:08:25PM -0700, Nicolin Chen wrote: > > > > On Thu, Apr 06, 2023 at 08:40:04AM -0300, Jason Gunthorpe wrote: > > > > > On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote: > > > > > > > > > > > We are using ioctl method now. > > > > > > From the testing, the TLB miss impacts performance a lot, so we use > > > > > > huge page method. > > > > > > After using huge page method, guest can achieve comparable performance > > > > > > with host. > > > > > > > > > > Looks like these tests are not stressing the MM, just measuring pure > > > > > BW of the DMA, so they don't get into the invalidation regime.. > > > > > > > > > > You need to measure a more real application that is actually using the > > > > > MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA. > > > > > > > > Would an iommu map/unmap benchmark test be useful here? > > > > > > You can reuse and improve dma_map_benchmark, already upstream. It supports > > > multi-threads and reports stddev. For some comparisons the report > > > resolution (.1 us) is insufficient and needs to be increased, but for > > > comparing guest performance it might be alright. > > > > > > https://lore.kernel.org/linux-iommu/20201102080646.2180-2-song.bao.hua@hisilicon.com/ > > > > I'm not talking about kernel dma_map/unmap, I'm talking about MM > > activities like mmap and munmap that become slower once SVA is enabled > > Is it about mmap/munmap() calls or accessing mmap'd memory? I mean literally userspace mmap() calls to change the mm_struct. Jason ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-11 18:41 ` Jason Gunthorpe @ 2023-04-11 19:02 ` Nicolin Chen 0 siblings, 0 replies; 35+ messages in thread From: Nicolin Chen @ 2023-04-11 19:02 UTC (permalink / raw) To: Jason Gunthorpe Cc: Jean-Philippe Brucker, Zhangfei Gao, Shameerali Kolothum Thodi, Robin Murphy, kevin.tian@intel.com, yi.l.liu@intel.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, iommu@lists.linux.dev, qianweili On Tue, Apr 11, 2023 at 03:41:07PM -0300, Jason Gunthorpe wrote: > > > > > > > We are using ioctl method now. > > > > > > > From the testing, the TLB miss impacts performance a lot, so we use > > > > > > > huge page method. > > > > > > > After using huge page method, guest can achieve comparable performance > > > > > > > with host. > > > > > > > > > > > > Looks like these tests are not stressing the MM, just measuring pure > > > > > > BW of the DMA, so they don't get into the invalidation regime.. > > > > > > > > > > > > You need to measure a more real application that is actually using the > > > > > > MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA. > > > I'm not talking about kernel dma_map/unmap, I'm talking about MM > > > activities like mmap and munmap that become slower once SVA is enabled > > > > Is it about mmap/munmap() calls or accessing mmap'd memory? > > I mean literally userspace mmap() calls to change the mm_struct. Hmm, it sounds like a different perf factor? Otherwise, I still don't understand how it impacts our decision of choosing between mmap'd queue page and doing copy_from_user(). Thanks Nicolin ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-11 9:07 ` Jean-Philippe Brucker 2023-04-11 11:57 ` Jason Gunthorpe @ 2023-04-11 18:43 ` Nicolin Chen 1 sibling, 0 replies; 35+ messages in thread From: Nicolin Chen @ 2023-04-11 18:43 UTC (permalink / raw) To: Jean-Philippe Brucker Cc: Zhangfei Gao, Jason Gunthorpe, Shameerali Kolothum Thodi, Robin Murphy, kevin.tian@intel.com, yi.l.liu@intel.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, iommu@lists.linux.dev, qianweili On Tue, Apr 11, 2023 at 10:07:40AM +0100, Jean-Philippe Brucker wrote: > External email: Use caution opening links or attachments > > > On Sun, Apr 09, 2023 at 06:08:25PM -0700, Nicolin Chen wrote: > > On Thu, Apr 06, 2023 at 08:40:04AM -0300, Jason Gunthorpe wrote: > > > On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote: > > > > > > > We are using ioctl method now. > > > > From the testing, the TLB miss impacts performance a lot, so we use > > > > huge page method. > > > > After using huge page method, guest can achieve comparable performance > > > > with host. > > > > > > Looks like these tests are not stressing the MM, just measuring pure > > > BW of the DMA, so they don't get into the invalidation regime.. > > > > > > You need to measure a more real application that is actually using the > > > MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA. > > > > Would an iommu map/unmap benchmark test be useful here? > > You can reuse and improve dma_map_benchmark, already upstream. It supports > multi-threads and reports stddev. For some comparisons the report > resolution (.1 us) is insufficient and needs to be increased, but for > comparing guest performance it might be alright. Oh, didn't know about that. Thanks for pointing the test :) Nicolin ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-10 1:08 ` Nicolin Chen 2023-04-11 9:07 ` Jean-Philippe Brucker @ 2023-04-12 2:47 ` Zhangfei Gao 2023-04-12 5:47 ` Nicolin Chen 2023-05-03 15:14 ` Shameerali Kolothum Thodi 1 sibling, 2 replies; 35+ messages in thread From: Zhangfei Gao @ 2023-04-12 2:47 UTC (permalink / raw) To: Nicolin Chen Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Robin Murphy, kevin.tian@intel.com, yi.l.liu@intel.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, jean-philippe@linaro.org, iommu@lists.linux.dev, qianweili Hi, Nicolin On Mon, 10 Apr 2023 at 09:08, Nicolin Chen <nicolinc@nvidia.com> wrote: > > On Thu, Apr 06, 2023 at 08:40:04AM -0300, Jason Gunthorpe wrote: > > On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote: > > > > > We are using ioctl method now. > > > From the testing, the TLB miss impacts performance a lot, so we use > > > huge page method. > > > After using huge page method, guest can achieve comparable performance > > > with host. > > > > Looks like these tests are not stressing the MM, just measuring pure > > BW of the DMA, so they don't get into the invalidation regime.. > > > > You need to measure a more real application that is actually using the > > MM (eg alloc/free memory, fork, etc) while it operates and turn on SVA. > > Would an iommu map/unmap benchmark test be useful here? > > I added a test program to measure map/unmap times with a > set of different sized buffers: > https://github.com/nicolinc/iommufd/commit/3eb417f2cae0234cc801c6ad74de2afb0ddbdf84 > (Also thinking about sending this with a RFC series) > > @Zhangfei, > In case that this could be useful, you can pull these two > branches for perf measurement with and without mmap: > # Kernel > https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-mmap-04082023 > # QEMU > https://github.com/nicolinc/qemu/commits/wip/iommufd_nesting-mmap-04072023 > > To test without mmap, simply revert the following commits: > # Kernel > git revert ecf602a3c8480ba7ce2c7e77c2d15ca873dbf2e4 > # QEMU > git revert c726c014de70998f14b8741a6a96e18a2a7bcd0f > > And, you can get the test script in the branch: > tools/testing/selftests/iommu/iommu_benchmark.sh > > > On my emulation environment (very slow), with mmap, I see > improvements. I'll also try setting up a test suite on a > proper HW this week. > Still debugging the mmap method here, Status 4.12 1. ioctl method works, based on iommufd performance same as before, ie.6.2 Using huge page method for guest, guest app can get comparable performance as host (The branches are provided by Shameer) https://github.com/gaozhangfei/linux-kernel-uadk/tree/private-iommufd_nesting-03272023-6.3-rc4-arm https://github.com/gaozhangfei/qemu/tree/private-v7.2.0-iommufd-nesting-arm 2. mmap method still in debug, (patch porting may have issues, need to port patches to the working version, in check) qemu: https://github.com/gaozhangfei/linux-kernel-uadk/tree/iommufd_nesting-mmap-04082023 https://github.com/gaozhangfei/qemu/tree/private-v7.2.0-iommufd-nesting-arm-mmap log: guest: hisi sec init Kunpeng920! qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to invalidate TLB: prod=15fa vs cons=1000709 qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to invalidate TLB: prod=9bb vs cons=1000709 qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: Connection timed out qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to invalidate TLB: prod=4c0 vs cons=4c0 qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to invalidate TLB: prod=f7a vs cons=1000709 qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to invalidate TLB: prod=81d vs cons=1000709 host: [218.986319] arm-smmu-v3 arm-smmu-v3.3.auto: unexpected global error reported (0x00000001), this could be serious [218.996464] arm-smmu-v3 arm-smmu-v3.3.auto: CMDQ error (cons 0x01004805): Illegal command [219.004606] arm-smmu-v3 arm-smmu-v3.3.auto: skipping command in error state: [219.011622] arm-smmu-v3 arm-smmu-v3.3.auto:0x0060000202110687 [219.017515] arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000 [219.023422] arm-smmu-v3 arm-smmu-v3.3.auto: unexpected global error reported (0x00000001), this could be serious [219.033550] arm-smmu-v3 arm-smmu-v3.3.auto: CMDQ error (cons 0x01004806): Illegal command [219.041692] arm-smmu-v3 arm-smmu-v3.3.auto: skipping command in error state: [219.048707] arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000 [219.054600]arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000 ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-12 2:47 ` Zhangfei Gao @ 2023-04-12 5:47 ` Nicolin Chen 2023-05-03 15:14 ` Shameerali Kolothum Thodi 1 sibling, 0 replies; 35+ messages in thread From: Nicolin Chen @ 2023-04-12 5:47 UTC (permalink / raw) To: Zhangfei Gao Cc: Jason Gunthorpe, Shameerali Kolothum Thodi, Robin Murphy, kevin.tian@intel.com, yi.l.liu@intel.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, jean-philippe@linaro.org, iommu@lists.linux.dev, qianweili Hi Zhangfei, On Wed, Apr 12, 2023 at 10:47:56AM +0800, Zhangfei Gao wrote: > qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device Only ATC_INV checks vSID field and return -ENODEV if vSID is not linked to any pSID (i.e. set_rid_user has not been called correctly for that device.) Maybe you can try some prints in set_rid_user on both kernel and qemu sides. > host: > [218.986319] arm-smmu-v3 arm-smmu-v3.3.auto: unexpected global error > reported (0x00000001), this could be serious > [218.996464] arm-smmu-v3 arm-smmu-v3.3.auto: CMDQ error (cons > 0x01004805): Illegal command > [219.004606] arm-smmu-v3 arm-smmu-v3.3.auto: skipping command in error state: > [219.011622] arm-smmu-v3 arm-smmu-v3.3.auto:0x0060000202110687 > [219.017515] arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000 > [219.023422] arm-smmu-v3 arm-smmu-v3.3.auto: unexpected global error > reported (0x00000001), this could be serious > [219.033550] arm-smmu-v3 arm-smmu-v3.3.auto: CMDQ error (cons > 0x01004806): Illegal command > [219.041692] arm-smmu-v3 arm-smmu-v3.3.auto: skipping command in error state: > [219.048707] arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000 > [219.054600]arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000 This two commands look very wrong...wondering why host could encounter them. There's a pr_debug in arm_smmu_fix_user_cmd() that can change to pr_alert for guest queue debugging. Sorry, the mmap solution might be buggy, as it's a proof of concept. Thanks Nicolin ^ permalink raw reply [flat|nested] 35+ messages in thread
* RE: Cache Invalidation Solution for Nested IOMMU 2023-04-12 2:47 ` Zhangfei Gao 2023-04-12 5:47 ` Nicolin Chen @ 2023-05-03 15:14 ` Shameerali Kolothum Thodi 2023-05-03 23:44 ` Nicolin Chen 1 sibling, 1 reply; 35+ messages in thread From: Shameerali Kolothum Thodi @ 2023-05-03 15:14 UTC (permalink / raw) To: Zhangfei Gao, Nicolin Chen Cc: Jason Gunthorpe, Robin Murphy, kevin.tian@intel.com, yi.l.liu@intel.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, jean-philippe@linaro.org, iommu@lists.linux.dev, qianweili > -----Original Message----- > From: Zhangfei Gao [mailto:zhangfei.gao@linaro.org] > Sent: 12 April 2023 03:48 > To: Nicolin Chen <nicolinc@nvidia.com> > Cc: Jason Gunthorpe <jgg@nvidia.com>; Shameerali Kolothum Thodi > <shameerali.kolothum.thodi@huawei.com>; Robin Murphy > <robin.murphy@arm.com>; kevin.tian@intel.com; yi.l.liu@intel.com; > eric.auger@redhat.com; baolu.lu@linux.intel.com; jean-philippe@linaro.org; > iommu@lists.linux.dev; qianweili <qianweili@huawei.com> > Subject: Re: Cache Invalidation Solution for Nested IOMMU > > Hi, Nicolin > > On Mon, 10 Apr 2023 at 09:08, Nicolin Chen <nicolinc@nvidia.com> wrote: > > > > On Thu, Apr 06, 2023 at 08:40:04AM -0300, Jason Gunthorpe wrote: > > > On Thu, Apr 06, 2023 at 02:23:17PM +0800, Zhangfei Gao wrote: > > > > > > > We are using ioctl method now. > > > > From the testing, the TLB miss impacts performance a lot, so we > > > > use huge page method. > > > > After using huge page method, guest can achieve comparable > > > > performance with host. > > > > > > Looks like these tests are not stressing the MM, just measuring pure > > > BW of the DMA, so they don't get into the invalidation regime.. > > > > > > You need to measure a more real application that is actually using > > > the MM (eg alloc/free memory, fork, etc) while it operates and turn on > SVA. > > > > Would an iommu map/unmap benchmark test be useful here? > > > > I added a test program to measure map/unmap times with a set of > > different sized buffers: > > > https://github.com/nicolinc/iommufd/commit/3eb417f2cae0234cc801c6ad > 74d > > e2afb0ddbdf84 (Also thinking about sending this with a RFC series) > > > > @Zhangfei, > > In case that this could be useful, you can pull these two branches for > > perf measurement with and without mmap: > > # Kernel > > > https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-mma > p-0 > > 4082023 > > # QEMU > > > https://github.com/nicolinc/qemu/commits/wip/iommufd_nesting-mmap-0 > 407 > > 2023 > > > > To test without mmap, simply revert the following commits: > > # Kernel > > git revert ecf602a3c8480ba7ce2c7e77c2d15ca873dbf2e4 > > # QEMU > > git revert c726c014de70998f14b8741a6a96e18a2a7bcd0f > > > > And, you can get the test script in the branch: > > tools/testing/selftests/iommu/iommu_benchmark.sh > > > > > > On my emulation environment (very slow), with mmap, I see > > improvements. I'll also try setting up a test suite on a proper HW > > this week. > > > > Still debugging the mmap method here, > > Status 4.12 > > 1. ioctl method works, based on iommufd > > performance same as before, ie.6.2 > > Using huge page method for guest, guest app can get comparable > performance as host > > (The branches are provided by Shameer) > https://github.com/gaozhangfei/linux-kernel-uadk/tree/private-iommufd_ne > sting-03272023-6.3-rc4-arm > https://github.com/gaozhangfei/qemu/tree/private-v7.2.0-iommufd-nesting > -arm > > > 2. mmap method still in debug, > (patch porting may have issues, need to port patches to the working version, > in check) > > qemu: > > https://github.com/gaozhangfei/linux-kernel-uadk/tree/iommufd_nesting-m > map-04082023 > > https://github.com/gaozhangfei/qemu/tree/private-v7.2.0-iommufd-nesting > -arm-mmap Hi Zhangfei, I had a go with your above branches. The issue below seems to happen only when you assign multiple devices to Guest. It looks like when you have multiple devices attached to Guest, the host kernel receives invalid Guest cmd(0x0) and ends up issuing that to HW triggering the below errors. For now, could you please try with just one device? At least in my setup it didn’t trigger the below with just one dev. Also as Jason pointed out, I think we might need to run an application that actually hits the mm invalidation path in Guest kernel(arm_smmu_mm_invalidate_range()). With test_sva_perf I couldn't see much of that happening during the test run. Also another point is once we have BTM enabled we might not have these mm invalidations reaching the host kernel. Thanks, Shameer > > hisi sec init Kunpeng920! > > qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device > > qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to > invalidate TLB: prod=15fa vs cons=1000709 > > qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device > > qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to > invalidate TLB: prod=9bb vs cons=1000709 > > qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: Connection timed > out > > qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to > invalidate TLB: prod=4c0 vs cons=4c0 > > qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device > > qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to > invalidate TLB: prod=f7a vs cons=1000709 > > qemu-system-aarch64: IOMMU_HWPT_INVALIDATE failed: No such device > > qemu-system-aarch64: --------smmuv3_invalidate_cache_mmap: failed to > invalidate TLB: prod=81d vs cons=1000709 > > host: > > [218.986319] arm-smmu-v3 arm-smmu-v3.3.auto: unexpected global error > reported (0x00000001), this could be serious > > [218.996464] arm-smmu-v3 arm-smmu-v3.3.auto: CMDQ error (cons > 0x01004805): Illegal command > > [219.004606] arm-smmu-v3 arm-smmu-v3.3.auto: skipping command in > error state: > > [219.011622] arm-smmu-v3 arm-smmu-v3.3.auto:0x0060000202110687 > > [219.017515] arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000 > > [219.023422] arm-smmu-v3 arm-smmu-v3.3.auto: unexpected global error > reported (0x00000001), this could be serious > > [219.033550] arm-smmu-v3 arm-smmu-v3.3.auto: CMDQ error (cons > 0x01004806): Illegal command > > [219.041692] arm-smmu-v3 arm-smmu-v3.3.auto: skipping command in > error state: > > [219.048707] arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000 > > [219.054600]arm-smmu-v3 arm-smmu-v3.3.auto:0x0000000000000000 ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-05-03 15:14 ` Shameerali Kolothum Thodi @ 2023-05-03 23:44 ` Nicolin Chen 0 siblings, 0 replies; 35+ messages in thread From: Nicolin Chen @ 2023-05-03 23:44 UTC (permalink / raw) To: Zhangfei Gao, Shameerali Kolothum Thodi Cc: Jason Gunthorpe, Robin Murphy, kevin.tian@intel.com, yi.l.liu@intel.com, eric.auger@redhat.com, baolu.lu@linux.intel.com, jean-philippe@linaro.org, iommu@lists.linux.dev, qianweili On Wed, May 03, 2023 at 03:14:28PM +0000, Shameerali Kolothum Thodi wrote: > > > On my emulation environment (very slow), with mmap, I see > > > improvements. I'll also try setting up a test suite on a proper HW > > > this week. > > > > > > > Still debugging the mmap method here, > > > > Status 4.12 > > > > 1. ioctl method works, based on iommufd > I had a go with your above branches. The issue below seems to happen only > when you assign multiple devices to Guest. It looks like when you have multiple devices > attached to Guest, the host kernel receives invalid Guest cmd(0x0) and ends up issuing > that to HW triggering the below errors. I actually found a bug in my previous wip branch (non-mmap), resulting in a cons/prod index overflow. But the mmap branch seems to be okay, since my QEMU code copies all TLBI commands starting from index=0x0. You may compare the git-diff, just in case that it helps: https://github.com/nicolinc/iommufd/commits/wip/iommufd_nesting-04262023-Nic https://github.com/nicolinc/qemu/commits/wip/iommufd_nesting-04222023 Btw, I ran a comparison on a real hardware. But it does not show significant improvements between the ioctl solution and the mmap solution. So I kept the ioctl one in v2. Thanks Nic ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-04 16:20 ` Jason Gunthorpe 2023-04-04 16:50 ` Shameerali Kolothum Thodi @ 2023-04-05 5:45 ` Nicolin Chen 2023-04-05 11:37 ` Jason Gunthorpe 1 sibling, 1 reply; 35+ messages in thread From: Nicolin Chen @ 2023-04-05 5:45 UTC (permalink / raw) To: Jason Gunthorpe Cc: Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu, shameerali.kolothum.thodi, jean-philippe, iommu On Tue, Apr 04, 2023 at 01:20:01PM -0300, Jason Gunthorpe wrote: > On Mon, Apr 03, 2023 at 05:02:09PM -0700, Nicolin Chen wrote: > > > My preference is to have a mmap'd page, so the interface can > > be reused later by VCMDQ too. Performance-wise, it should be > > good enough, since it does batching, IMHO. > > You can't reuse mmaping the queue page with vcmdq, so it doesn't seem > meaningful to me. > > There should be no mmap on the SW path. If you need a half step > between an ioctl as a batch and a full vhost-like queue scheme then > using iouring with pre-registered memory would be appropriate. I've changed to a non-mmap approach that the host kernel reads the guest queue directly and inserts all invalidation commands into the host queue. The qsz could be as large as 128 x 64K pages. So, there has to be a big array of pages getting pinned in the handler. (The handler still needs a pathway to report errors. I will add tomorrow.) Does the implementation below look fine in general? Thanks Nicolin [User Data] /** * struct iommu_hwpt_invalidate_arm_smmuv3 - ARM SMMUv3 cahce invalidation info * @cmdq_base: User space base virtual address of user command queue * @cmdq_entry_size: Entry size of user command queue * @cmdq_log2size: User command queue size as log 2 (entries) * Refer to LOG2SIZE field of SMMU_CMDQ_BASE register * @cmdq_prod: Producer index of user command queue * @cmdq_cons: Consumer index of user command queue */ struct iommu_hwpt_invalidate_arm_smmuv3 { __u64 cmdq_base; __u32 cmdq_entry_size; __u32 cmdq_log2size; __u32 cmdq_prod; __u32 cmdq_cons; }; [Host Handler] static int arm_smmu_fix_user_cmd(struct arm_smmu_domain *smmu_domain, u64 *cmd) { struct arm_smmu_stream *stream; switch (*cmd & CMDQ_0_OP) { case CMDQ_OP_TLBI_NSNH_ALL: *cmd &= ~CMDQ_0_OP; *cmd |= CMDQ_OP_TLBI_NH_ALL; fallthrough; case CMDQ_OP_TLBI_NH_VA: case CMDQ_OP_TLBI_NH_VAA: case CMDQ_OP_TLBI_NH_ALL: case CMDQ_OP_TLBI_NH_ASID: *cmd &= ~CMDQ_TLBI_0_VMID; *cmd |= FIELD_PREP(CMDQ_TLBI_0_VMID, smmu_domain->s2->s2_cfg.vmid); break; case CMDQ_OP_ATC_INV: case CMDQ_OP_CFGI_CD: case CMDQ_OP_CFGI_CD_ALL: xa_lock(&smmu_domain->smmu->user_streams); stream = xa_load(&smmu_domain->smmu->user_streams, FIELD_GET(CMDQ_CFGI_0_SID, *cmd)); xa_unlock(&smmu_domain->smmu->user_streams); if (!stream) return -ENODEV; *cmd &= ~CMDQ_CFGI_0_SID; *cmd |= FIELD_PREP(CMDQ_CFGI_0_SID, stream->id); break; default: return -EOPNOTSUPP; } pr_debug("Fixed user CMD: %016llx : %016llx\n", cmd[1], cmd[0]); return 0; } static void arm_smmu_cache_invalidate_user(struct iommu_domain *domain, void *user_data) { const u32 cons_err = FIELD_PREP(CMDQ_CONS_ERR, CMDQ_ERR_CERROR_ILL_IDX); struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); struct iommu_hwpt_invalidate_arm_smmuv3 *inv = user_data; struct arm_smmu_device *smmu = smmu_domain->smmu; struct arm_smmu_queue q = { .llq = { .prod = inv->cmdq_prod, .cons = inv->cmdq_cons, .max_n_shift = inv->cmdq_log2size, }, .ent_dwords = inv->cmdq_entry_size / sizeof(u64), }; int ncmds = inv->cmdq_prod - inv->cmdq_cons; unsigned int nents = 1 << q.llq.max_n_shift; size_t qsz = nents * inv->cmdq_entry_size; unsigned long npages = qsz >> PAGE_SHIFT; struct page **pages; long pinned; u64 *cmds; int i = 0; int ret; if (!smmu || !smmu_domain->s2 || domain->type != IOMMU_DOMAIN_NESTED) return; if (WARN_ON(q.ent_dwords != CMDQ_ENT_DWORDS)) return; if (WARN_ON(queue_empty(&q.llq))) return; WARN_ON(q.llq.max_n_shift > smmu->cmdq.q.llq.max_n_shift); pages = kcalloc(npages, sizeof(*pages), GFP_KERNEL); if (!pages) return; if (ncmds <= 0) ncmds += nents; cmds = kcalloc(ncmds, inv->cmdq_entry_size, GFP_KERNEL); if (!cmds) goto out_free_pages; pinned = get_user_pages(inv->cmdq_base, npages, FOLL_GET, pages, NULL); if (pinned != npages) goto out_put_page; q.base = page_to_virt(pages[0]) + (inv->cmdq_base & (PAGE_SIZE - 1)); do { u64 *cmd = &cmds[i * CMDQ_ENT_DWORDS]; queue_read(cmd, Q_ENT(&q, q.llq.cons), q.ent_dwords); ret = arm_smmu_fix_user_cmd(smmu_domain, cmd); if (ret && ret != -EOPNOTSUPP) { q.llq.cons |= cons_err; goto out_put_page; } if (!ret) i++; queue_inc_cons(&q.llq); } while (!queue_empty(&q.llq)); ret = arm_smmu_cmdq_issue_cmdlist(smmu, cmds, i, true); /* FIXME return CMD_SYNC timeout */ out_put_page: for (i = 0; i < pinned; i++) put_page(pages[i]); kfree(cmds); out_free_pages: kfree(pages); inv->cmdq_cons = q.llq.cons; } ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-05 5:45 ` Nicolin Chen @ 2023-04-05 11:37 ` Jason Gunthorpe 2023-04-05 15:34 ` Nicolin Chen 0 siblings, 1 reply; 35+ messages in thread From: Jason Gunthorpe @ 2023-04-05 11:37 UTC (permalink / raw) To: Nicolin Chen Cc: Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu, shameerali.kolothum.thodi, jean-philippe, iommu On Tue, Apr 04, 2023 at 10:45:56PM -0700, Nicolin Chen wrote: > On Tue, Apr 04, 2023 at 01:20:01PM -0300, Jason Gunthorpe wrote: > > On Mon, Apr 03, 2023 at 05:02:09PM -0700, Nicolin Chen wrote: > > > > > My preference is to have a mmap'd page, so the interface can > > > be reused later by VCMDQ too. Performance-wise, it should be > > > good enough, since it does batching, IMHO. > > > > You can't reuse mmaping the queue page with vcmdq, so it doesn't seem > > meaningful to me. > > > > There should be no mmap on the SW path. If you need a half step > > between an ioctl as a batch and a full vhost-like queue scheme then > > using iouring with pre-registered memory would be appropriate. > > I've changed to a non-mmap approach that the host kernel reads > the guest queue directly and inserts all invalidation commands > into the host queue. > > The qsz could be as large as 128 x 64K pages. So, there has to > be a big array of pages getting pinned in the handler. > > (The handler still needs a pathway to report errors. I will add > tomorrow.) > > Does the implementation below look fine in general? In general get_user_pages() is really slow compared to copy_from_user Jason ^ permalink raw reply [flat|nested] 35+ messages in thread
* Re: Cache Invalidation Solution for Nested IOMMU 2023-04-05 11:37 ` Jason Gunthorpe @ 2023-04-05 15:34 ` Nicolin Chen 0 siblings, 0 replies; 35+ messages in thread From: Nicolin Chen @ 2023-04-05 15:34 UTC (permalink / raw) To: Jason Gunthorpe Cc: Robin Murphy, kevin.tian, yi.l.liu, eric.auger, baolu.lu, shameerali.kolothum.thodi, jean-philippe, iommu On Wed, Apr 05, 2023 at 08:37:56AM -0300, Jason Gunthorpe wrote: > On Tue, Apr 04, 2023 at 10:45:56PM -0700, Nicolin Chen wrote: > > On Tue, Apr 04, 2023 at 01:20:01PM -0300, Jason Gunthorpe wrote: > > > On Mon, Apr 03, 2023 at 05:02:09PM -0700, Nicolin Chen wrote: > > > > > > > My preference is to have a mmap'd page, so the interface can > > > > be reused later by VCMDQ too. Performance-wise, it should be > > > > good enough, since it does batching, IMHO. > > > > > > You can't reuse mmaping the queue page with vcmdq, so it doesn't seem > > > meaningful to me. > > > > > > There should be no mmap on the SW path. If you need a half step > > > between an ioctl as a batch and a full vhost-like queue scheme then > > > using iouring with pre-registered memory would be appropriate. > > > > I've changed to a non-mmap approach that the host kernel reads > > the guest queue directly and inserts all invalidation commands > > into the host queue. > > > > The qsz could be as large as 128 x 64K pages. So, there has to > > be a big array of pages getting pinned in the handler. > > > > (The handler still needs a pathway to report errors. I will add > > tomorrow.) > > > > Does the implementation below look fine in general? > > In general get_user_pages() is really slow compared to copy_from_user Will change to using copy_from_user(). Thanks! Nic ^ permalink raw reply [flat|nested] 35+ messages in thread
end of thread, other threads:[~2023-05-03 23:45 UTC | newest] Thread overview: 35+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-04-03 0:33 Cache Invalidation Solution for Nested IOMMU Nicolin Chen 2023-04-03 7:26 ` Liu, Yi L 2023-04-03 8:39 ` Tian, Kevin 2023-04-03 15:24 ` Nicolin Chen 2023-04-04 2:42 ` Tian, Kevin 2023-04-04 3:12 ` Nicolin Chen 2023-04-03 12:23 ` Jason Gunthorpe 2023-04-03 8:00 ` Tian, Kevin 2023-04-03 14:29 ` Nicolin Chen 2023-04-04 2:15 ` Tian, Kevin 2023-04-04 2:47 ` Nicolin Chen 2023-04-03 14:08 ` Jason Gunthorpe 2023-04-03 14:51 ` Nicolin Chen 2023-04-03 19:15 ` Robin Murphy 2023-04-04 0:02 ` Nicolin Chen 2023-04-04 16:20 ` Jason Gunthorpe 2023-04-04 16:50 ` Shameerali Kolothum Thodi 2023-04-05 11:57 ` Jason Gunthorpe 2023-04-06 6:23 ` Zhangfei Gao 2023-04-06 6:39 ` Nicolin Chen 2023-04-06 11:40 ` Jason Gunthorpe 2023-04-10 1:08 ` Nicolin Chen 2023-04-11 9:07 ` Jean-Philippe Brucker 2023-04-11 11:57 ` Jason Gunthorpe 2023-04-11 18:39 ` Nicolin Chen 2023-04-11 18:41 ` Jason Gunthorpe 2023-04-11 19:02 ` Nicolin Chen 2023-04-11 18:43 ` Nicolin Chen 2023-04-12 2:47 ` Zhangfei Gao 2023-04-12 5:47 ` Nicolin Chen 2023-05-03 15:14 ` Shameerali Kolothum Thodi 2023-05-03 23:44 ` Nicolin Chen 2023-04-05 5:45 ` Nicolin Chen 2023-04-05 11:37 ` Jason Gunthorpe 2023-04-05 15:34 ` Nicolin Chen
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox