* About unmap pages and set dirty tracking on nested parent domain
@ 2024-01-25 13:55 Yi Liu
2024-01-25 14:03 ` Jason Gunthorpe
0 siblings, 1 reply; 5+ messages in thread
From: Yi Liu @ 2024-01-25 13:55 UTC (permalink / raw)
To: Jason Gunthorpe, Tian, Kevin
Cc: Lu Baolu, Nicolin Chen, alex.williamson@redhat.com, Robin Murphy,
Joerg Roedel, iommu@lists.linux.dev
Hi Jason, Kevin,
Today, Intel iommu driver only tracks attached devices/iommus in the nested
domain. While the nested parent domain does not. This makes cache flush on
nested parent domain be a nop if it's only used as parent. It is an issue
if there is unmap on nested parent domain. Also, the dirty page tracking
would not take effect as well for such a nested parent domain. I plan to
fix this gap by tracking stage-1 domains in the nested parent domain within
intel iommu driver. So that iommu driver can loop the nested domains and
the devices attached on it to flush caches. However, I'd like to check with
you first before taking action.
1) Do we want to allow unmap pages on nested parent domain? Today there is
no PRQ support on the nested parent domain (stage-2). That's why both
VFIO and IOMMUFD pins the page in the DMA_MAP. As a result, VMMs like
Qemu cannot not unamp pages in nested parent domain after VM is running.
Otherwise, there will be DMA translation faults. However, there is no
architecture limitation to support PRQ for stage-2. If PRQ is supported,
there may be unmap on nested parent.
2) If answer of 1) is yes. Should the owner of stage-1 be notified about
the unmap event on its nested parent, hence owner is able to flush the
corresponding stage-1 cache explicitly? For VT-d, no such requirement as
intel iommu driver can flush the related stage-1 cache as well when
flushing the stage-2 cache in nested translation mode. Not sure about
other vendors.
3) Is it enough to fix this gap within iommu driver? or need to be handled
in the generic layer? e.g. let the iommufd layer to track stage-1 hwpts
in stage-2 hwpt. In this way, iommufd can flush stage-1 cache when
unmapping pages on stage-2.
Regards,
Yi Liu
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: About unmap pages and set dirty tracking on nested parent domain
2024-01-25 13:55 About unmap pages and set dirty tracking on nested parent domain Yi Liu
@ 2024-01-25 14:03 ` Jason Gunthorpe
2024-01-26 0:19 ` Jason Gunthorpe
2024-01-26 9:30 ` Yi Liu
0 siblings, 2 replies; 5+ messages in thread
From: Jason Gunthorpe @ 2024-01-25 14:03 UTC (permalink / raw)
To: Yi Liu, Suravee Suthikulpanit
Cc: Tian, Kevin, Lu Baolu, Nicolin Chen, alex.williamson@redhat.com,
Robin Murphy, Joerg Roedel, iommu@lists.linux.dev
On Thu, Jan 25, 2024 at 09:55:46PM +0800, Yi Liu wrote:
> Hi Jason, Kevin,
>
> Today, Intel iommu driver only tracks attached devices/iommus in the nested
> domain. While the nested parent domain does not.
Heh, I was just looking at this bug on my ARM implemention too :)
> This makes cache flush on nested parent domain be a nop if it's only
> used as parent.
Yep, this is wrong.
> 1) Do we want to allow unmap pages on nested parent domain?
Yes. It is needed for memory unplug.
> Today there is
> no PRQ support on the nested parent domain (stage-2). That's why both
> VFIO and IOMMUFD pins the page in the DMA_MAP. As a result, VMMs like
> Qemu cannot not unamp pages in nested parent domain after VM is
> running.
No, qemu can do an explicit unmap command to iommufd. PRQ is not relavent
> 2) If answer of 1) is yes. Should the owner of stage-1 be notified about
> the unmap event on its nested parent, hence owner is able to flush the
> corresponding stage-1 cache explicitly?
No. We don't support "mdev" "access" operations on nests so there is
no reason to notify anyone. If qemu hot unplugs memory from a VM then
it should already have some idea that the VM is not doing DMA to that
memory.
> 3) Is it enough to fix this gap within iommu driver? or need to be handled
> in the generic layer? e.g. let the iommufd layer to track stage-1 hwpts
> in stage-2 hwpt. In this way, iommufd can flush stage-1 cache when
> unmapping pages on stage-2.
I think the iommu driver should fix it
Notice there is also an ATC requirement here, when the nesting parent
changes it needs to issue a full ATC flush on PASID 0, not a range
flush.
Also notice the iommu probably has to zap the entire IOTLB for any
nesting child if the parent changes, unless it has amazing HW :)
It would be really awkward to try to lift this detail out of the
driver.
Lets add Suravee to be sure the AMD driver is aware of this detail
too.
My plan is to have the nesting attach add the device to the parent
domain's invalidation list and have a flag in the master_domain to
indicate this attachment has the special ATC invalidation.
This will allow the S2 to be used normally as well, eg for the
identity map.
Jason
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: About unmap pages and set dirty tracking on nested parent domain
2024-01-25 14:03 ` Jason Gunthorpe
@ 2024-01-26 0:19 ` Jason Gunthorpe
2024-01-26 9:36 ` Yi Liu
2024-01-26 9:30 ` Yi Liu
1 sibling, 1 reply; 5+ messages in thread
From: Jason Gunthorpe @ 2024-01-26 0:19 UTC (permalink / raw)
To: Yi Liu, Suravee Suthikulpanit
Cc: Tian, Kevin, Lu Baolu, Nicolin Chen, alex.williamson@redhat.com,
Robin Murphy, Joerg Roedel, iommu@lists.linux.dev
On Thu, Jan 25, 2024 at 10:03:31AM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 25, 2024 at 09:55:46PM +0800, Yi Liu wrote:
> > Hi Jason, Kevin,
> >
> > Today, Intel iommu driver only tracks attached devices/iommus in the nested
> > domain. While the nested parent domain does not.
>
> Heh, I was just looking at this bug on my ARM implemention too :)
What I did for SMMU is on my github how:
https://github.com/jgunthorpe/linux/commits/smmuv3_newapi/
See iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
The list that tracks where the iommu domain sends its invalidations
gets a flag indicating that ATC has to be a full flush
The domain gets a flag that says the IOTLB has to be a full flush (to
wipe the nest child entries too)
Also note that if the VM is being relied on to generate ATC
invalidations then the hypervisor must also track the VM's ATS
state. If the VM thinks ATS is off then it will not issue ATC flushes.
ARM has a convenient STE bit that makes this simple, otherwise the VMM
will have to trap the PCI-E ATS config write and forward it through
iommufd somehow.
Tracking all the ATS stuff was a big PITA, I think the approach I got
for SMMU is pretty good though.
Jason
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: About unmap pages and set dirty tracking on nested parent domain
2024-01-25 14:03 ` Jason Gunthorpe
2024-01-26 0:19 ` Jason Gunthorpe
@ 2024-01-26 9:30 ` Yi Liu
1 sibling, 0 replies; 5+ messages in thread
From: Yi Liu @ 2024-01-26 9:30 UTC (permalink / raw)
To: Jason Gunthorpe, Suravee Suthikulpanit
Cc: Tian, Kevin, Lu Baolu, Nicolin Chen, alex.williamson@redhat.com,
Robin Murphy, Joerg Roedel, iommu@lists.linux.dev
On 2024/1/25 22:03, Jason Gunthorpe wrote:
> On Thu, Jan 25, 2024 at 09:55:46PM +0800, Yi Liu wrote:
>> Hi Jason, Kevin,
>>
>> Today, Intel iommu driver only tracks attached devices/iommus in the nested
>> domain. While the nested parent domain does not.
>
> Heh, I was just looking at this bug on my ARM implemention too :)
>
>> This makes cache flush on nested parent domain be a nop if it's only
>> used as parent.
>
> Yep, this is wrong.
>
>> 1) Do we want to allow unmap pages on nested parent domain?
>
> Yes. It is needed for memory unplug.
>
>> Today there is
>> no PRQ support on the nested parent domain (stage-2). That's why both
>> VFIO and IOMMUFD pins the page in the DMA_MAP. As a result, VMMs like
>> Qemu cannot not unamp pages in nested parent domain after VM is
>> running.
>
> No, qemu can do an explicit unmap command to iommufd. PRQ is not relavent
Indeed!!!
>
>> 2) If answer of 1) is yes. Should the owner of stage-1 be notified about
>> the unmap event on its nested parent, hence owner is able to flush the
>> corresponding stage-1 cache explicitly?
>
> No. We don't support "mdev" "access" operations on nests so there is
> no reason to notify anyone. If qemu hot unplugs memory from a VM then
> it should already have some idea that the VM is not doing DMA to that
> memory.
makes sense.
>
>> 3) Is it enough to fix this gap within iommu driver? or need to be handled
>> in the generic layer? e.g. let the iommufd layer to track stage-1 hwpts
>> in stage-2 hwpt. In this way, iommufd can flush stage-1 cache when
>> unmapping pages on stage-2.
>
> I think the iommu driver should fix it
okay.
> Notice there is also an ATC requirement here, when the nesting parent
> changes it needs to issue a full ATC flush on PASID 0, not a range
> flush.
right. I missed this part. Thanks for pointing it out.
> Also notice the iommu probably has to zap the entire IOTLB for any
> nesting child if the parent changes, unless it has amazing HW :)
VT-d seems to be the amazing HW :) It can flush the stage-1 caches that
refers the stage-2 mapping during stage-2 cache invalidation.
> It would be really awkward to try to lift this detail out of the
> driver.
ok. let's do it in iommu driver.
> Lets add Suravee to be sure the AMD driver is aware of this detail
> too.
>
> My plan is to have the nesting attach add the device to the parent
> domain's invalidation list and have a flag in the master_domain to
> indicate this attachment has the special ATC invalidation.
>
> This will allow the S2 to be used normally as well, eg for the
> identity map.
I've considered this way as well. However, this means a single device (say
device_domain_info) at least needs two list_head to link into the
stage-1 and stage-2 domain. It may result in some inconvenience in
the existing single stage (e.g. stage-2) code. Also, this means we
need to track the iommu in the stage-2 domain as well. This means
in nested attach, there will be two domain/iommu association. This
will get some extra complexity in the domain ID determination. Need
to ensure the two association uses the same domain ID. To be simple,
I planned to have a list tracking stage-1 domains in its parent domain.
When flushing cache, set dirty tracking, the helper should loop the
stage-1 domain list and loop the devices/iommus accordingly. While for
the device tlb flush, just flush the entire addr range (0 - MAX). My
code is in the below branch with some other fixes. Need more tests.. :)
a3b8450962b6 iommu/vt-d: Set up dirty tracking for nested parent domain
85879c330286 iommu/vt-d: Add missing cache flush on nested parent domain
ac22f14b2612 iommu/vt-d: Wrap page selective iotlb flush for domain
7311a31614dc iommu/vt-d: Track nested domains on the same parent
https://github.com/yiliu1765/iommufd/tree/wip/iommufd_nesting_fixes
--
Regards,
Yi Liu
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: About unmap pages and set dirty tracking on nested parent domain
2024-01-26 0:19 ` Jason Gunthorpe
@ 2024-01-26 9:36 ` Yi Liu
0 siblings, 0 replies; 5+ messages in thread
From: Yi Liu @ 2024-01-26 9:36 UTC (permalink / raw)
To: Jason Gunthorpe, Suravee Suthikulpanit
Cc: Tian, Kevin, Lu Baolu, Nicolin Chen, alex.williamson@redhat.com,
Robin Murphy, Joerg Roedel, iommu@lists.linux.dev
On 2024/1/26 08:19, Jason Gunthorpe wrote:
> On Thu, Jan 25, 2024 at 10:03:31AM -0400, Jason Gunthorpe wrote:
>> On Thu, Jan 25, 2024 at 09:55:46PM +0800, Yi Liu wrote:
>>> Hi Jason, Kevin,
>>>
>>> Today, Intel iommu driver only tracks attached devices/iommus in the nested
>>> domain. While the nested parent domain does not.
>>
>> Heh, I was just looking at this bug on my ARM implemention too :)
>
> What I did for SMMU is on my github how:
>
> https://github.com/jgunthorpe/linux/commits/smmuv3_newapi/
>
> See iommu/arm-smmu-v3: Support IOMMU_DOMAIN_NESTED
thanks for the sharing.
> The list that tracks where the iommu domain sends its invalidations
> gets a flag indicating that ATC has to be a full flush
yeah, I pass a 0 - MAX to do it.
> The domain gets a flag that says the IOTLB has to be a full flush (to
> wipe the nest child entries too)
yes
> Also note that if the VM is being relied on to generate ATC
> invalidations then the hypervisor must also track the VM's ATS
> state. If the VM thinks ATS is off then it will not issue ATC flushes.
for now, we don't rely on VM to issue ATC invalidations as we don't have
plan to forward the ATC invalidation timeout error to VM. As one of your
prior remarks (it should have be handled in a RAS arch). So we don't really
need VM to generate ATC invalidation since intel iommu driver will flush
ATC if host has enabled it. So I think VT-d side is fine on this so far.
> ARM has a convenient STE bit that makes this simple, otherwise the VMM
> will have to trap the PCI-E ATS config write and forward it through
> iommufd somehow.
yes. This may need new API for device driver (VFIO) to pass such info.
> Tracking all the ATS stuff was a big PITA, I think the approach I got
> for SMMU is pretty good though.
--
Regards,
Yi Liu
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-01-26 9:32 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-25 13:55 About unmap pages and set dirty tracking on nested parent domain Yi Liu
2024-01-25 14:03 ` Jason Gunthorpe
2024-01-26 0:19 ` Jason Gunthorpe
2024-01-26 9:36 ` Yi Liu
2024-01-26 9:30 ` Yi Liu
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.