* One question about the hypercall to translate gfn to mfn.
@ 2014-12-09 10:10 Yu, Zhang
2014-12-09 10:19 ` Paul Durrant
` (2 more replies)
0 siblings, 3 replies; 59+ messages in thread
From: Yu, Zhang @ 2014-12-09 10:10 UTC (permalink / raw)
To: Paul.Durrant, keir, tim, JBeulich, kevin.tian, Xen-devel
Hi all,
As you can see, we are pushing our XenGT patches to the upstream. One
feature we need in xen is to translate guests' gfn to mfn in XenGT dom0
device model.
Here we may have 2 similar solutions:
1> Paul told me(and thank you, Paul :)) that there used to be a
hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in
commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there was no
usage at that time. So solution 1 is to revert this commit. However,
since this hypercall was removed ages ago, the reverting met many
conflicts, i.e. the gmfn_to_mfn is no longer used in x86, etc.
2> In our project, we defined a new hypercall
XENMEM_get_mfn_from_pfn, which has a similar implementation like the
previous XENMEM_translate_gpfn_list. One of the major differences is
that this newly defined one is only for x86(called in arch_memory_op),
so we do not have to worry about the arm side.
Does anyone has any suggestions about this?
Thanks in advance. :)
B.R.
Yu
^ permalink raw reply [flat|nested] 59+ messages in thread* Re: One question about the hypercall to translate gfn to mfn. 2014-12-09 10:10 One question about the hypercall to translate gfn to mfn Yu, Zhang @ 2014-12-09 10:19 ` Paul Durrant 2014-12-09 10:37 ` Yu, Zhang 2014-12-09 10:38 ` Jan Beulich 2014-12-09 10:46 ` Tim Deegan 2 siblings, 1 reply; 59+ messages in thread From: Paul Durrant @ 2014-12-09 10:19 UTC (permalink / raw) To: Yu, Zhang, Keir (Xen.org), Tim (Xen.org), JBeulich@suse.com, Kevin Tian, Xen-devel@lists.xen.org > -----Original Message----- > From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com] > Sent: 09 December 2014 10:11 > To: Paul Durrant; Keir (Xen.org); Tim (Xen.org); JBeulich@suse.com; Kevin > Tian; Xen-devel@lists.xen.org > Subject: One question about the hypercall to translate gfn to mfn. > > Hi all, > > As you can see, we are pushing our XenGT patches to the upstream. One > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0 > device model. > > Here we may have 2 similar solutions: > 1> Paul told me(and thank you, Paul :)) that there used to be a > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there was > no > usage at that time. So solution 1 is to revert this commit. However, > since this hypercall was removed ages ago, the reverting met many > conflicts, i.e. the gmfn_to_mfn is no longer used in x86, etc. > > 2> In our project, we defined a new hypercall > XENMEM_get_mfn_from_pfn, which has a similar implementation like the > previous XENMEM_translate_gpfn_list. One of the major differences is > that this newly defined one is only for x86(called in arch_memory_op), > so we do not have to worry about the arm side. > > Does anyone has any suggestions about this? IIUC what is needed is a means to IOMMU map a gfn in the service domain (dom0 for the moment) such that it can be accessed by the GPU. I think use of an raw mfn value currently works only because dom0 is using a 1:1 IOMMU mapping scheme. Is my understanding correct, or do you really need raw mfn values? Paul > Thanks in advance. :) > > B.R. > Yu ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-09 10:19 ` Paul Durrant @ 2014-12-09 10:37 ` Yu, Zhang 2014-12-09 10:50 ` Jan Beulich 2014-12-09 10:51 ` Malcolm Crossley 0 siblings, 2 replies; 59+ messages in thread From: Yu, Zhang @ 2014-12-09 10:37 UTC (permalink / raw) To: Paul Durrant, Keir (Xen.org), Tim (Xen.org), JBeulich@suse.com, Kevin Tian, Xen-devel@lists.xen.org On 12/9/2014 6:19 PM, Paul Durrant wrote: > I think use of an raw mfn value currently works only because dom0 is using a 1:1 IOMMU mapping scheme. Is my understanding correct, or do you really need raw mfn values? Thanks for your quick response, Paul. Well, not exactly for this case. :) In XenGT, our need to translate gfn to mfn is for GPU's page table, which contains the translation between graphic address and the memory address. This page table is maintained by GPU drivers, and our service domain need to have a method to translate the guest physical addresses written by the vGPU into host physical ones. We do not use IOMMU in XenGT and therefore this translation may not necessarily be a 1:1 mapping. B.R. Yu ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-09 10:37 ` Yu, Zhang @ 2014-12-09 10:50 ` Jan Beulich 2014-12-10 1:07 ` Tian, Kevin 2014-12-09 10:51 ` Malcolm Crossley 1 sibling, 1 reply; 59+ messages in thread From: Jan Beulich @ 2014-12-09 10:50 UTC (permalink / raw) To: Zhang Yu Cc: Tim (Xen.org), Kevin Tian, Paul Durrant, Keir (Xen.org), Xen-devel@lists.xen.org >>> On 09.12.14 at 11:37, <yu.c.zhang@linux.intel.com> wrote: > On 12/9/2014 6:19 PM, Paul Durrant wrote: >> I think use of an raw mfn value currently works only because dom0 is using a > 1:1 IOMMU mapping scheme. Is my understanding correct, or do you really need > raw mfn values? > Thanks for your quick response, Paul. > Well, not exactly for this case. :) > In XenGT, our need to translate gfn to mfn is for GPU's page table, > which contains the translation between graphic address and the memory > address. This page table is maintained by GPU drivers, and our service > domain need to have a method to translate the guest physical addresses > written by the vGPU into host physical ones. > We do not use IOMMU in XenGT and therefore this translation may not > necessarily be a 1:1 mapping. Hmm, that suggests you indeed need raw MFNs, which in turn seems problematic wrt PVH Dom0 (or you'd need a GFN->GMFN translation layer). But while you don't use the IOMMU yourself, I suppose the GPU accesses still don't bypass the IOMMU? In which case all you'd need returned is a frame number that guarantees that after IOMMU translation it refers to the correct MFN, i.e. still allowing for your Dom0 driver to simply set aside a part of its PFN space, asking Xen to (IOMMU-)map the necessary guest frames into there. Jan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-09 10:50 ` Jan Beulich @ 2014-12-10 1:07 ` Tian, Kevin 2014-12-10 8:39 ` Jan Beulich 0 siblings, 1 reply; 59+ messages in thread From: Tian, Kevin @ 2014-12-10 1:07 UTC (permalink / raw) To: Jan Beulich, Zhang Yu Cc: Tim (Xen.org), Paul Durrant, Keir (Xen.org), Xen-devel@lists.xen.org > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Tuesday, December 09, 2014 6:50 PM > > >>> On 09.12.14 at 11:37, <yu.c.zhang@linux.intel.com> wrote: > > On 12/9/2014 6:19 PM, Paul Durrant wrote: > >> I think use of an raw mfn value currently works only because dom0 is using > a > > 1:1 IOMMU mapping scheme. Is my understanding correct, or do you really > need > > raw mfn values? > > Thanks for your quick response, Paul. > > Well, not exactly for this case. :) > > In XenGT, our need to translate gfn to mfn is for GPU's page table, > > which contains the translation between graphic address and the memory > > address. This page table is maintained by GPU drivers, and our service > > domain need to have a method to translate the guest physical addresses > > written by the vGPU into host physical ones. > > We do not use IOMMU in XenGT and therefore this translation may not > > necessarily be a 1:1 mapping. > > Hmm, that suggests you indeed need raw MFNs, which in turn seems > problematic wrt PVH Dom0 (or you'd need a GFN->GMFN translation > layer). But while you don't use the IOMMU yourself, I suppose the GPU > accesses still don't bypass the IOMMU? In which case all you'd need > returned is a frame number that guarantees that after IOMMU > translation it refers to the correct MFN, i.e. still allowing for your Dom0 > driver to simply set aside a part of its PFN space, asking Xen to > (IOMMU-)map the necessary guest frames into there. > No. What we require is the raw MFNs. One IOMMU device entry can't point to multiple VM's page tables, so that's why XenGT needs to use software shadow GPU page table to implement the sharing. Note it's not for dom0 to access the MFN. It's for dom0 to setup the correct shadow GPU page table, so a VM can access the graphics memory in a controlled way. Thanks Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-10 1:07 ` Tian, Kevin @ 2014-12-10 8:39 ` Jan Beulich 2014-12-10 8:47 ` Tian, Kevin 2014-12-10 8:50 ` Tian, Kevin 0 siblings, 2 replies; 59+ messages in thread From: Jan Beulich @ 2014-12-10 8:39 UTC (permalink / raw) To: Kevin Tian, Zhang Yu Cc: Tim (Xen.org), Paul Durrant, Keir (Xen.org), Xen-devel@lists.xen.org >>> On 10.12.14 at 02:07, <kevin.tian@intel.com> wrote: >> From: Jan Beulich [mailto:JBeulich@suse.com] >> Sent: Tuesday, December 09, 2014 6:50 PM >> >> >>> On 09.12.14 at 11:37, <yu.c.zhang@linux.intel.com> wrote: >> > On 12/9/2014 6:19 PM, Paul Durrant wrote: >> >> I think use of an raw mfn value currently works only because dom0 is using >> a >> > 1:1 IOMMU mapping scheme. Is my understanding correct, or do you really >> need >> > raw mfn values? >> > Thanks for your quick response, Paul. >> > Well, not exactly for this case. :) >> > In XenGT, our need to translate gfn to mfn is for GPU's page table, >> > which contains the translation between graphic address and the memory >> > address. This page table is maintained by GPU drivers, and our service >> > domain need to have a method to translate the guest physical addresses >> > written by the vGPU into host physical ones. >> > We do not use IOMMU in XenGT and therefore this translation may not >> > necessarily be a 1:1 mapping. >> >> Hmm, that suggests you indeed need raw MFNs, which in turn seems >> problematic wrt PVH Dom0 (or you'd need a GFN->GMFN translation >> layer). But while you don't use the IOMMU yourself, I suppose the GPU >> accesses still don't bypass the IOMMU? In which case all you'd need >> returned is a frame number that guarantees that after IOMMU >> translation it refers to the correct MFN, i.e. still allowing for your Dom0 >> driver to simply set aside a part of its PFN space, asking Xen to >> (IOMMU-)map the necessary guest frames into there. >> > > No. What we require is the raw MFNs. One IOMMU device entry can't > point to multiple VM's page tables, so that's why XenGT needs to use > software shadow GPU page table to implement the sharing. Note it's > not for dom0 to access the MFN. It's for dom0 to setup the correct > shadow GPU page table, so a VM can access the graphics memory > in a controlled way. So what's the translation flow here: driver -> GPU -> IOMMU -> hardware or driver -> IOMMU -> GPU -> hardware? Or do things get set up for the GPU to bypass the IOMMU altogether? Jan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-10 8:39 ` Jan Beulich @ 2014-12-10 8:47 ` Tian, Kevin 2014-12-10 9:16 ` Jan Beulich 2014-12-10 8:50 ` Tian, Kevin 1 sibling, 1 reply; 59+ messages in thread From: Tian, Kevin @ 2014-12-10 8:47 UTC (permalink / raw) To: Jan Beulich, Zhang Yu Cc: Tim (Xen.org), Paul Durrant, Keir (Xen.org), Xen-devel@lists.xen.org > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Wednesday, December 10, 2014 4:39 PM > > >>> On 10.12.14 at 02:07, <kevin.tian@intel.com> wrote: > >> From: Jan Beulich [mailto:JBeulich@suse.com] > >> Sent: Tuesday, December 09, 2014 6:50 PM > >> > >> >>> On 09.12.14 at 11:37, <yu.c.zhang@linux.intel.com> wrote: > >> > On 12/9/2014 6:19 PM, Paul Durrant wrote: > >> >> I think use of an raw mfn value currently works only because dom0 is > using > >> a > >> > 1:1 IOMMU mapping scheme. Is my understanding correct, or do you > really > >> need > >> > raw mfn values? > >> > Thanks for your quick response, Paul. > >> > Well, not exactly for this case. :) > >> > In XenGT, our need to translate gfn to mfn is for GPU's page table, > >> > which contains the translation between graphic address and the memory > >> > address. This page table is maintained by GPU drivers, and our service > >> > domain need to have a method to translate the guest physical addresses > >> > written by the vGPU into host physical ones. > >> > We do not use IOMMU in XenGT and therefore this translation may not > >> > necessarily be a 1:1 mapping. > >> > >> Hmm, that suggests you indeed need raw MFNs, which in turn seems > >> problematic wrt PVH Dom0 (or you'd need a GFN->GMFN translation > >> layer). But while you don't use the IOMMU yourself, I suppose the GPU > >> accesses still don't bypass the IOMMU? In which case all you'd need > >> returned is a frame number that guarantees that after IOMMU > >> translation it refers to the correct MFN, i.e. still allowing for your Dom0 > >> driver to simply set aside a part of its PFN space, asking Xen to > >> (IOMMU-)map the necessary guest frames into there. > >> > > > > No. What we require is the raw MFNs. One IOMMU device entry can't > > point to multiple VM's page tables, so that's why XenGT needs to use > > software shadow GPU page table to implement the sharing. Note it's > > not for dom0 to access the MFN. It's for dom0 to setup the correct > > shadow GPU page table, so a VM can access the graphics memory > > in a controlled way. > > So what's the translation flow here: driver -> GPU -> IOMMU -> > hardware or driver -> IOMMU -> GPU -> hardware? Or do things get > set up for the GPU to bypass the IOMMU altogether? > two translation paths in assigned case: 1. [direct CPU access from VM], with partitioned PCI aperture resource, every VM can access a portion of PCI aperture directly. - CPU page table/EPT: CPU virtual address->PCI aperture - PCI aperture - bar base = Graphics Memory Address (GMA) - GPU page table: GMA -> GPA (as programmed by guest) - IOMMU: GPA -> MPA 2. [GPU access through GPU command operands], with GPU scheduling, every VM's command buffer will be fetched by GPU in a time-shared manner. - GPU page table: GMA->GPA - IOMMU: GPA->MPA In our case, IOMMU is setup with 1:1 identity table for dom0. So when GPU may access GPAs from different VMs, we can't count on IOMMU which can only serve one mapping for one device (unless we have SR-IOV). That's why we need shadow GPU page table in dom0, and need a p2m query call to translate from GPA -> MPA: - shadow GPU page table: GMA->MPA - IOMMU: MPA->MPA (for dom0) Thanks Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-10 8:47 ` Tian, Kevin @ 2014-12-10 9:16 ` Jan Beulich 2014-12-10 9:51 ` Tian, Kevin 0 siblings, 1 reply; 59+ messages in thread From: Jan Beulich @ 2014-12-10 9:16 UTC (permalink / raw) To: Kevin Tian, Zhang Yu Cc: Tim (Xen.org), Paul Durrant, Keir (Xen.org), Xen-devel@lists.xen.org >>> On 10.12.14 at 09:47, <kevin.tian@intel.com> wrote: > two translation paths in assigned case: > > 1. [direct CPU access from VM], with partitioned PCI aperture > resource, every VM can access a portion of PCI aperture directly. > > - CPU page table/EPT: CPU virtual address->PCI aperture > - PCI aperture - bar base = Graphics Memory Address (GMA) > - GPU page table: GMA -> GPA (as programmed by guest) > - IOMMU: GPA -> MPA > > 2. [GPU access through GPU command operands], with GPU scheduling, > every VM's command buffer will be fetched by GPU in a time-shared > manner. > > - GPU page table: GMA->GPA > - IOMMU: GPA->MPA > > In our case, IOMMU is setup with 1:1 identity table for dom0. So > when GPU may access GPAs from different VMs, we can't count on > IOMMU which can only serve one mapping for one device (unless > we have SR-IOV). > > That's why we need shadow GPU page table in dom0, and need a > p2m query call to translate from GPA -> MPA: > > - shadow GPU page table: GMA->MPA > - IOMMU: MPA->MPA (for dom0) I still can't see why the Dom0 translation has to remain 1:1, i.e. why Xen couldn't return some "arbitrary" GPA for the query in question here, setting up a suitable GPA->MPA translation. (I put arbitrary in quotes because this of course must not conflict with GPAs already or possibly in use by Dom0.) And I can only stress again that you shouldn't leave out PVH (where the IOMMU already isn't set up with all 1:1 mappings) from these considerations. Jan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-10 9:16 ` Jan Beulich @ 2014-12-10 9:51 ` Tian, Kevin 2014-12-10 10:07 ` Jan Beulich 2014-12-10 11:04 ` Malcolm Crossley 0 siblings, 2 replies; 59+ messages in thread From: Tian, Kevin @ 2014-12-10 9:51 UTC (permalink / raw) To: Jan Beulich, Zhang Yu Cc: Tim (Xen.org), Paul Durrant, Keir (Xen.org), Xen-devel@lists.xen.org > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Wednesday, December 10, 2014 5:17 PM > > >>> On 10.12.14 at 09:47, <kevin.tian@intel.com> wrote: > > two translation paths in assigned case: > > > > 1. [direct CPU access from VM], with partitioned PCI aperture > > resource, every VM can access a portion of PCI aperture directly. > > > > - CPU page table/EPT: CPU virtual address->PCI aperture > > - PCI aperture - bar base = Graphics Memory Address (GMA) > > - GPU page table: GMA -> GPA (as programmed by guest) > > - IOMMU: GPA -> MPA > > > > 2. [GPU access through GPU command operands], with GPU scheduling, > > every VM's command buffer will be fetched by GPU in a time-shared > > manner. > > > > - GPU page table: GMA->GPA > > - IOMMU: GPA->MPA > > > > In our case, IOMMU is setup with 1:1 identity table for dom0. So > > when GPU may access GPAs from different VMs, we can't count on > > IOMMU which can only serve one mapping for one device (unless > > we have SR-IOV). > > > > That's why we need shadow GPU page table in dom0, and need a > > p2m query call to translate from GPA -> MPA: > > > > - shadow GPU page table: GMA->MPA > > - IOMMU: MPA->MPA (for dom0) > > I still can't see why the Dom0 translation has to remain 1:1, i.e. > why Xen couldn't return some "arbitrary" GPA for the query in > question here, setting up a suitable GPA->MPA translation. (I put > arbitrary in quotes because this of course must not conflict with > GPAs already or possibly in use by Dom0.) And I can only stress > again that you shouldn't leave out PVH (where the IOMMU already > isn't set up with all 1:1 mappings) from these considerations. > It's interesting that you think IOMMU can be used in such situation. what do you mean by "arbitrary" GPA here? and It's not just about conflicting with Dom0's GPA, it's about confliction in all VM's GPAs when you hosting them through one IOMMU page table, and there's no way to prevent this definitely since GPAs are picked by VMs themselves. I don't think we can support PVH here if IOMMU is not 1:1 mapping. Thanks Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-10 9:51 ` Tian, Kevin @ 2014-12-10 10:07 ` Jan Beulich 2014-12-10 11:04 ` Malcolm Crossley 1 sibling, 0 replies; 59+ messages in thread From: Jan Beulich @ 2014-12-10 10:07 UTC (permalink / raw) To: Kevin Tian, Zhang Yu Cc: Tim (Xen.org), Paul Durrant, Keir (Xen.org), Xen-devel@lists.xen.org >>> On 10.12.14 at 10:51, <kevin.tian@intel.com> wrote: >> From: Jan Beulich [mailto:JBeulich@suse.com] >> Sent: Wednesday, December 10, 2014 5:17 PM >> >> >>> On 10.12.14 at 09:47, <kevin.tian@intel.com> wrote: >> > two translation paths in assigned case: >> > >> > 1. [direct CPU access from VM], with partitioned PCI aperture >> > resource, every VM can access a portion of PCI aperture directly. >> > >> > - CPU page table/EPT: CPU virtual address->PCI aperture >> > - PCI aperture - bar base = Graphics Memory Address (GMA) >> > - GPU page table: GMA -> GPA (as programmed by guest) >> > - IOMMU: GPA -> MPA >> > >> > 2. [GPU access through GPU command operands], with GPU scheduling, >> > every VM's command buffer will be fetched by GPU in a time-shared >> > manner. >> > >> > - GPU page table: GMA->GPA >> > - IOMMU: GPA->MPA >> > >> > In our case, IOMMU is setup with 1:1 identity table for dom0. So >> > when GPU may access GPAs from different VMs, we can't count on >> > IOMMU which can only serve one mapping for one device (unless >> > we have SR-IOV). >> > >> > That's why we need shadow GPU page table in dom0, and need a >> > p2m query call to translate from GPA -> MPA: >> > >> > - shadow GPU page table: GMA->MPA >> > - IOMMU: MPA->MPA (for dom0) >> >> I still can't see why the Dom0 translation has to remain 1:1, i.e. >> why Xen couldn't return some "arbitrary" GPA for the query in >> question here, setting up a suitable GPA->MPA translation. (I put >> arbitrary in quotes because this of course must not conflict with >> GPAs already or possibly in use by Dom0.) And I can only stress >> again that you shouldn't leave out PVH (where the IOMMU already >> isn't set up with all 1:1 mappings) from these considerations. >> > > It's interesting that you think IOMMU can be used in such situation. > > what do you mean by "arbitrary" GPA here? and It's not just about > conflicting with Dom0's GPA, it's about confliction in all VM's GPAs > when you hosting them through one IOMMU page table, and there's > no way to prevent this definitely since GPAs are picked by VMs > themselves. As long as for the involved DomU-s the physical address comes in ways similar to PCI device BARs (which they're capable to deal with), that's not a problem imo. For Dom0, just like BARs may get assigned while bringing up PCI devices, a "virtual" BAR could be invented here. > I don't think we can support PVH here if IOMMU is not 1:1 mapping. That would make XenGT quite a bit less useful going forward. But otoh don't you only care about certain MMIO regions to be 1:1 mapped? That's the case for PVH Dom0 too, iirc. Jan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-10 9:51 ` Tian, Kevin 2014-12-10 10:07 ` Jan Beulich @ 2014-12-10 11:04 ` Malcolm Crossley 1 sibling, 0 replies; 59+ messages in thread From: Malcolm Crossley @ 2014-12-10 11:04 UTC (permalink / raw) To: xen-devel On 10/12/14 09:51, Tian, Kevin wrote: >> From: Jan Beulich [mailto:JBeulich@suse.com] >> Sent: Wednesday, December 10, 2014 5:17 PM >> >>>>> On 10.12.14 at 09:47, <kevin.tian@intel.com> wrote: >>> two translation paths in assigned case: >>> >>> 1. [direct CPU access from VM], with partitioned PCI aperture >>> resource, every VM can access a portion of PCI aperture directly. >>> >>> - CPU page table/EPT: CPU virtual address->PCI aperture >>> - PCI aperture - bar base = Graphics Memory Address (GMA) >>> - GPU page table: GMA -> GPA (as programmed by guest) >>> - IOMMU: GPA -> MPA >>> >>> 2. [GPU access through GPU command operands], with GPU scheduling, >>> every VM's command buffer will be fetched by GPU in a time-shared >>> manner. >>> >>> - GPU page table: GMA->GPA >>> - IOMMU: GPA->MPA >>> >>> In our case, IOMMU is setup with 1:1 identity table for dom0. So >>> when GPU may access GPAs from different VMs, we can't count on >>> IOMMU which can only serve one mapping for one device (unless >>> we have SR-IOV). >>> >>> That's why we need shadow GPU page table in dom0, and need a >>> p2m query call to translate from GPA -> MPA: >>> >>> - shadow GPU page table: GMA->MPA >>> - IOMMU: MPA->MPA (for dom0) >> >> I still can't see why the Dom0 translation has to remain 1:1, i.e. >> why Xen couldn't return some "arbitrary" GPA for the query in >> question here, setting up a suitable GPA->MPA translation. (I put >> arbitrary in quotes because this of course must not conflict with >> GPAs already or possibly in use by Dom0.) And I can only stress >> again that you shouldn't leave out PVH (where the IOMMU already >> isn't set up with all 1:1 mappings) from these considerations. >> > > It's interesting that you think IOMMU can be used in such situation. > > what do you mean by "arbitrary" GPA here? and It's not just about > conflicting with Dom0's GPA, it's about confliction in all VM's GPAs > when you hosting them through one IOMMU page table, and there's > no way to prevent this definitely since GPAs are picked by VMs > themselves. > > I don't think we can support PVH here if IOMMU is not 1:1 mapping. > I agree with Jan, there doesn't need to be a fixed 1:1 mapping between IOMMU and MFN's addresses. I think all that's required is that there is an IOMMU mapping for the GPU device connected to dom0 (or driver domain) which allows guest memory to be accessed by the GPU. This IOMMU address is what is programmed into shadow GPU page table, I refer to this address as Bus frame number(BFN) in the PV IOMMU design document. - shadow GPU page table: GMA->BFN - IOMMU: BFN->MPA IOMMU's can almost always address more than the host physical RAM so we can create IOMMU mappings above the top of host physical RAM in order to have IOMMU mappings of guest RAM. The PV-IOMMU design allows the guest to have control of the IOMMU address space. In theory it could be extended to have permission checks for mapping guest MFN's and have a mapping interface which takes a domid and a GMFN. That way the driver domain does not need to know the actual MFN's being used. The guest itself (CPU) accesses the GPU via outbound MMIO mappings so we don't need to be concerned with address translation in that direction. I think getting Xen to allocate IOMMU mappings for a driver domain will be problematic for PV based driver domains because the M2P for PV domains is not kept strictly upto date with what the guest is using for P2M and so it will be difficult/impossible to determine which addresses are not in use. Similarly it may be difficult to HVM guests because P2M mapping are outbound (CPU to rest of host) and determining what addresses are suitable for inbound access (rest of host to memory) may be difficult. I.E should MMIO outbound address space be used for inbound IOMMU mappings? I hope I've not caused more confusion. Malcolm > Thanks > Kevin > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-10 8:39 ` Jan Beulich 2014-12-10 8:47 ` Tian, Kevin @ 2014-12-10 8:50 ` Tian, Kevin 1 sibling, 0 replies; 59+ messages in thread From: Tian, Kevin @ 2014-12-10 8:50 UTC (permalink / raw) To: Jan Beulich, Zhang Yu Cc: Tim (Xen.org), Paul Durrant, Keir (Xen.org), Xen-devel@lists.xen.org > From: Tian, Kevin > Sent: Wednesday, December 10, 2014 4:48 PM > > > From: Jan Beulich [mailto:JBeulich@suse.com] > > Sent: Wednesday, December 10, 2014 4:39 PM > > > > >>> On 10.12.14 at 02:07, <kevin.tian@intel.com> wrote: > > >> From: Jan Beulich [mailto:JBeulich@suse.com] > > >> Sent: Tuesday, December 09, 2014 6:50 PM > > >> > > >> >>> On 09.12.14 at 11:37, <yu.c.zhang@linux.intel.com> wrote: > > >> > On 12/9/2014 6:19 PM, Paul Durrant wrote: > > >> >> I think use of an raw mfn value currently works only because dom0 is > > using > > >> a > > >> > 1:1 IOMMU mapping scheme. Is my understanding correct, or do you > > really > > >> need > > >> > raw mfn values? > > >> > Thanks for your quick response, Paul. > > >> > Well, not exactly for this case. :) > > >> > In XenGT, our need to translate gfn to mfn is for GPU's page table, > > >> > which contains the translation between graphic address and the > memory > > >> > address. This page table is maintained by GPU drivers, and our service > > >> > domain need to have a method to translate the guest physical > addresses > > >> > written by the vGPU into host physical ones. > > >> > We do not use IOMMU in XenGT and therefore this translation may not > > >> > necessarily be a 1:1 mapping. > > >> > > >> Hmm, that suggests you indeed need raw MFNs, which in turn seems > > >> problematic wrt PVH Dom0 (or you'd need a GFN->GMFN translation > > >> layer). But while you don't use the IOMMU yourself, I suppose the GPU > > >> accesses still don't bypass the IOMMU? In which case all you'd need > > >> returned is a frame number that guarantees that after IOMMU > > >> translation it refers to the correct MFN, i.e. still allowing for your Dom0 > > >> driver to simply set aside a part of its PFN space, asking Xen to > > >> (IOMMU-)map the necessary guest frames into there. > > >> > > > > > > No. What we require is the raw MFNs. One IOMMU device entry can't > > > point to multiple VM's page tables, so that's why XenGT needs to use > > > software shadow GPU page table to implement the sharing. Note it's > > > not for dom0 to access the MFN. It's for dom0 to setup the correct > > > shadow GPU page table, so a VM can access the graphics memory > > > in a controlled way. > > > > So what's the translation flow here: driver -> GPU -> IOMMU -> > > hardware or driver -> IOMMU -> GPU -> hardware? Or do things get > > set up for the GPU to bypass the IOMMU altogether? > > > > two translation paths in assigned case: > > 1. [direct CPU access from VM], with partitioned PCI aperture > resource, every VM can access a portion of PCI aperture directly. sorry the above description is for XenGT shared case, and the below translation is for VT-d assigned case. Just put there to indicate the necessity of same translation path in XenGT. > > - CPU page table/EPT: CPU virtual address->PCI aperture > - PCI aperture - bar base = Graphics Memory Address (GMA) > - GPU page table: GMA -> GPA (as programmed by guest) > - IOMMU: GPA -> MPA > > 2. [GPU access through GPU command operands], with GPU scheduling, > every VM's command buffer will be fetched by GPU in a time-shared > manner. > > - GPU page table: GMA->GPA > - IOMMU: GPA->MPA > > In our case, IOMMU is setup with 1:1 identity table for dom0. So > when GPU may access GPAs from different VMs, we can't count on > IOMMU which can only serve one mapping for one device (unless > we have SR-IOV). > > That's why we need shadow GPU page table in dom0, and need a > p2m query call to translate from GPA -> MPA: > > - shadow GPU page table: GMA->MPA > - IOMMU: MPA->MPA (for dom0) > > Thanks > Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-09 10:37 ` Yu, Zhang 2014-12-09 10:50 ` Jan Beulich @ 2014-12-09 10:51 ` Malcolm Crossley 2014-12-10 1:22 ` Tian, Kevin 1 sibling, 1 reply; 59+ messages in thread From: Malcolm Crossley @ 2014-12-09 10:51 UTC (permalink / raw) To: xen-devel On 09/12/14 10:37, Yu, Zhang wrote: > > > On 12/9/2014 6:19 PM, Paul Durrant wrote: >> I think use of an raw mfn value currently works only because dom0 is >> using a 1:1 IOMMU mapping scheme. Is my understanding correct, or do >> you really need raw mfn values? > Thanks for your quick response, Paul. > Well, not exactly for this case. :) > In XenGT, our need to translate gfn to mfn is for GPU's page table, > which contains the translation between graphic address and the memory > address. This page table is maintained by GPU drivers, and our service > domain need to have a method to translate the guest physical addresses > written by the vGPU into host physical ones. > We do not use IOMMU in XenGT and therefore this translation may not > necessarily be a 1:1 mapping. XenGT must use the IOMMU mappings that Xen has setup for the domain which owns the GPU. Currently Dom0 own's the GPU and so it's IOMMU mappings match the MFN's addresses. I suspect XenGT will not work if Xen is booted with iommu=dom0-strict. Malcolm > > B.R. > Yu > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-09 10:51 ` Malcolm Crossley @ 2014-12-10 1:22 ` Tian, Kevin 0 siblings, 0 replies; 59+ messages in thread From: Tian, Kevin @ 2014-12-10 1:22 UTC (permalink / raw) To: Malcolm Crossley, xen-devel@lists.xen.org > From: Malcolm Crossley > Sent: Tuesday, December 09, 2014 6:52 PM > > On 09/12/14 10:37, Yu, Zhang wrote: > > > > > > On 12/9/2014 6:19 PM, Paul Durrant wrote: > >> I think use of an raw mfn value currently works only because dom0 is > >> using a 1:1 IOMMU mapping scheme. Is my understanding correct, or do > >> you really need raw mfn values? > > Thanks for your quick response, Paul. > > Well, not exactly for this case. :) > > In XenGT, our need to translate gfn to mfn is for GPU's page table, > > which contains the translation between graphic address and the memory > > address. This page table is maintained by GPU drivers, and our service > > domain need to have a method to translate the guest physical addresses > > written by the vGPU into host physical ones. > > We do not use IOMMU in XenGT and therefore this translation may not > > necessarily be a 1:1 mapping. > > XenGT must use the IOMMU mappings that Xen has setup for the domain > which owns the GPU. Currently Dom0 own's the GPU and so it's IOMMU > mappings match the MFN's addresses. I suspect XenGT will not work if Xen > is booted with iommu=dom0-strict. > This is a good point. So yes in this case IOMMU is still active which contains a 1:1 IOMMU mapping table, but it's a separate thing from the interface discussed here, which is about setup a shadow GPU page table for other VM's graphics memory accesses. Thanks Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-09 10:10 One question about the hypercall to translate gfn to mfn Yu, Zhang 2014-12-09 10:19 ` Paul Durrant @ 2014-12-09 10:38 ` Jan Beulich 2014-12-09 10:46 ` Tim Deegan 2 siblings, 0 replies; 59+ messages in thread From: Jan Beulich @ 2014-12-09 10:38 UTC (permalink / raw) To: Zhang Yu; +Cc: tim, kevin.tian, Paul.Durrant, keir, Xen-devel >>> On 09.12.14 at 11:10, <yu.c.zhang@linux.intel.com> wrote: > As you can see, we are pushing our XenGT patches to the upstream. One > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0 > device model. > > Here we may have 2 similar solutions: > 1> Paul told me(and thank you, Paul :)) that there used to be a > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there was no > usage at that time. So solution 1 is to revert this commit. However, > since this hypercall was removed ages ago, the reverting met many > conflicts, i.e. the gmfn_to_mfn is no longer used in x86, etc. > > 2> In our project, we defined a new hypercall > XENMEM_get_mfn_from_pfn, which has a similar implementation like the > previous XENMEM_translate_gpfn_list. One of the major differences is > that this newly defined one is only for x86(called in arch_memory_op), > so we do not have to worry about the arm side. > > Does anyone has any suggestions about this? Out of the two 1 seems preferable. But without background (see also Paul's reply) it's hard to tell whether that's what you want/need. Jan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-09 10:10 One question about the hypercall to translate gfn to mfn Yu, Zhang 2014-12-09 10:19 ` Paul Durrant 2014-12-09 10:38 ` Jan Beulich @ 2014-12-09 10:46 ` Tim Deegan 2014-12-09 11:05 ` Paul Durrant 2014-12-10 1:14 ` Tian, Kevin 2 siblings, 2 replies; 59+ messages in thread From: Tim Deegan @ 2014-12-09 10:46 UTC (permalink / raw) To: Yu, Zhang; +Cc: kevin.tian, Paul.Durrant, keir, JBeulich, Xen-devel At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote: > Hi all, > > As you can see, we are pushing our XenGT patches to the upstream. One > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0 > device model. > > Here we may have 2 similar solutions: > 1> Paul told me(and thank you, Paul :)) that there used to be a > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there was no > usage at that time. It's been suggested before that we should revive this hypercall, and I don't think it's a good idea. Whenever a domain needs to know the actual MFN of another domain's memory it's usually because the security model is problematic. In particular, finding the MFN is usually followed by a brute-force mapping from a dom0 process, or by passing the MFN to a device for unprotected DMA. These days DMA access should be protected by IOMMUs, or else the device drivers (and associated tools) are effectively inside the hypervisor's TCB. Luckily on x86 IOMMUs are widely available (and presumably present on anything new enough to run XenGT?). So I think the interface we need here is a please-map-this-gfn one, like the existing grant-table ops (which already do what you need by returning an address suitable for DMA). If adding a grant entry for every frame of the framebuffer within the guest is too much, maybe we can make a new interface for the guest to grant access to larger areas. Cheers, Tim. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-09 10:46 ` Tim Deegan @ 2014-12-09 11:05 ` Paul Durrant 2014-12-09 11:11 ` Ian Campbell 2014-12-10 1:14 ` Tian, Kevin 1 sibling, 1 reply; 59+ messages in thread From: Paul Durrant @ 2014-12-09 11:05 UTC (permalink / raw) To: Tim (Xen.org), Yu, Zhang Cc: Kevin Tian, Keir (Xen.org), JBeulich@suse.com, Xen-devel@lists.xen.org > -----Original Message----- > From: Tim Deegan [mailto:tim@xen.org] > Sent: 09 December 2014 10:47 > To: Yu, Zhang > Cc: Paul Durrant; Keir (Xen.org); JBeulich@suse.com; Kevin Tian; Xen- > devel@lists.xen.org > Subject: Re: One question about the hypercall to translate gfn to mfn. > > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote: > > Hi all, > > > > As you can see, we are pushing our XenGT patches to the upstream. One > > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0 > > device model. > > > > Here we may have 2 similar solutions: > > 1> Paul told me(and thank you, Paul :)) that there used to be a > > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there was > no > > usage at that time. > > It's been suggested before that we should revive this hypercall, and I > don't think it's a good idea. Whenever a domain needs to know the > actual MFN of another domain's memory it's usually because the > security model is problematic. In particular, finding the MFN is > usually followed by a brute-force mapping from a dom0 process, or by > passing the MFN to a device for unprotected DMA. > > These days DMA access should be protected by IOMMUs, or else > the device drivers (and associated tools) are effectively inside the > hypervisor's TCB. Luckily on x86 IOMMUs are widely available (and > presumably present on anything new enough to run XenGT?). > > So I think the interface we need here is a please-map-this-gfn one, > like the existing grant-table ops (which already do what you need by > returning an address suitable for DMA). If adding a grant entry for > every frame of the framebuffer within the guest is too much, maybe we > can make a new interface for the guest to grant access to larger areas. > IIUC the in-guest driver is Xen-unaware so any grant entry would have to be put in the guests table by the tools, which would entail some form of flexibly sized reserved range of grant entries otherwise any PV driver that are present in the guest would merrily clobber the new grant entries. A domain can already priv map a gfn into the MMU, so I think we just need an equivalent for the IOMMU. Paul > Cheers, > > Tim. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-09 11:05 ` Paul Durrant @ 2014-12-09 11:11 ` Ian Campbell 2014-12-09 11:17 ` Paul Durrant 0 siblings, 1 reply; 59+ messages in thread From: Ian Campbell @ 2014-12-09 11:11 UTC (permalink / raw) To: Paul Durrant Cc: Kevin Tian, Keir (Xen.org), Tim (Xen.org), Xen-devel@lists.xen.org, Yu, Zhang, JBeulich@suse.com On Tue, 2014-12-09 at 11:05 +0000, Paul Durrant wrote: > > -----Original Message----- > > From: Tim Deegan [mailto:tim@xen.org] > > Sent: 09 December 2014 10:47 > > To: Yu, Zhang > > Cc: Paul Durrant; Keir (Xen.org); JBeulich@suse.com; Kevin Tian; Xen- > > devel@lists.xen.org > > Subject: Re: One question about the hypercall to translate gfn to mfn. > > > > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote: > > > Hi all, > > > > > > As you can see, we are pushing our XenGT patches to the upstream. One > > > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0 > > > device model. > > > > > > Here we may have 2 similar solutions: > > > 1> Paul told me(and thank you, Paul :)) that there used to be a > > > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in > > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there was > > no > > > usage at that time. > > > > It's been suggested before that we should revive this hypercall, and I > > don't think it's a good idea. Whenever a domain needs to know the > > actual MFN of another domain's memory it's usually because the > > security model is problematic. In particular, finding the MFN is > > usually followed by a brute-force mapping from a dom0 process, or by > > passing the MFN to a device for unprotected DMA. > > > > These days DMA access should be protected by IOMMUs, or else > > the device drivers (and associated tools) are effectively inside the > > hypervisor's TCB. Luckily on x86 IOMMUs are widely available (and > > presumably present on anything new enough to run XenGT?). > > > > So I think the interface we need here is a please-map-this-gfn one, > > like the existing grant-table ops (which already do what you need by > > returning an address suitable for DMA). If adding a grant entry for > > every frame of the framebuffer within the guest is too much, maybe we > > can make a new interface for the guest to grant access to larger areas. > > > > IIUC the in-guest driver is Xen-unaware so any grant entry would have > to be put in the guests table by the tools, which would entail some > form of flexibly sized reserved range of grant entries otherwise any > PV driver that are present in the guest would merrily clobber the new > grant entries. > A domain can already priv map a gfn into the MMU, so I think we just > need an equivalent for the IOMMU. I'm not sure I'm fully understanding what's going on here, but is a variant of XENMEM_add_to_physmap+XENMAPSPACE_gmfn_foreign which also returns a DMA handle a plausible solution? Ian. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-09 11:11 ` Ian Campbell @ 2014-12-09 11:17 ` Paul Durrant 2014-12-09 11:23 ` Jan Beulich 2014-12-09 11:29 ` Ian Campbell 0 siblings, 2 replies; 59+ messages in thread From: Paul Durrant @ 2014-12-09 11:17 UTC (permalink / raw) To: Ian Campbell Cc: Kevin Tian, Keir (Xen.org), Tim (Xen.org), Xen-devel@lists.xen.org, Yu, Zhang, JBeulich@suse.com > -----Original Message----- > From: Ian Campbell > Sent: 09 December 2014 11:11 > To: Paul Durrant > Cc: Tim (Xen.org); Yu, Zhang; Kevin Tian; Keir (Xen.org); JBeulich@suse.com; > Xen-devel@lists.xen.org > Subject: Re: [Xen-devel] One question about the hypercall to translate gfn to > mfn. > > On Tue, 2014-12-09 at 11:05 +0000, Paul Durrant wrote: > > > -----Original Message----- > > > From: Tim Deegan [mailto:tim@xen.org] > > > Sent: 09 December 2014 10:47 > > > To: Yu, Zhang > > > Cc: Paul Durrant; Keir (Xen.org); JBeulich@suse.com; Kevin Tian; Xen- > > > devel@lists.xen.org > > > Subject: Re: One question about the hypercall to translate gfn to mfn. > > > > > > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote: > > > > Hi all, > > > > > > > > As you can see, we are pushing our XenGT patches to the upstream. > One > > > > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0 > > > > device model. > > > > > > > > Here we may have 2 similar solutions: > > > > 1> Paul told me(and thank you, Paul :)) that there used to be a > > > > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in > > > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there > was > > > no > > > > usage at that time. > > > > > > It's been suggested before that we should revive this hypercall, and I > > > don't think it's a good idea. Whenever a domain needs to know the > > > actual MFN of another domain's memory it's usually because the > > > security model is problematic. In particular, finding the MFN is > > > usually followed by a brute-force mapping from a dom0 process, or by > > > passing the MFN to a device for unprotected DMA. > > > > > > These days DMA access should be protected by IOMMUs, or else > > > the device drivers (and associated tools) are effectively inside the > > > hypervisor's TCB. Luckily on x86 IOMMUs are widely available (and > > > presumably present on anything new enough to run XenGT?). > > > > > > So I think the interface we need here is a please-map-this-gfn one, > > > like the existing grant-table ops (which already do what you need by > > > returning an address suitable for DMA). If adding a grant entry for > > > every frame of the framebuffer within the guest is too much, maybe we > > > can make a new interface for the guest to grant access to larger areas. > > > > > > > IIUC the in-guest driver is Xen-unaware so any grant entry would have > > to be put in the guests table by the tools, which would entail some > > form of flexibly sized reserved range of grant entries otherwise any > > PV driver that are present in the guest would merrily clobber the new > > grant entries. > > A domain can already priv map a gfn into the MMU, so I think we just > > need an equivalent for the IOMMU. > > I'm not sure I'm fully understanding what's going on here, but is a > variant of XENMEM_add_to_physmap+XENMAPSPACE_gmfn_foreign which > also > returns a DMA handle a plausible solution? > I think we want be able to avoid setting up a PTE in the MMU since it's not needed in most (or perhaps all?) cases. Paul > Ian. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-09 11:17 ` Paul Durrant @ 2014-12-09 11:23 ` Jan Beulich 2014-12-09 11:28 ` Malcolm Crossley 2014-12-09 11:29 ` Ian Campbell 1 sibling, 1 reply; 59+ messages in thread From: Jan Beulich @ 2014-12-09 11:23 UTC (permalink / raw) To: Paul Durrant Cc: Kevin Tian, Keir (Xen.org), Ian Campbell, Tim (Xen.org), Xen-devel@lists.xen.org, Zhang Yu >>> On 09.12.14 at 12:17, <Paul.Durrant@citrix.com> wrote: > I think we want be able to avoid setting up a PTE in the MMU since it's not > needed in most (or perhaps all?) cases. With shared page tables, there's no way to do one without the other. Jan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-09 11:23 ` Jan Beulich @ 2014-12-09 11:28 ` Malcolm Crossley 0 siblings, 0 replies; 59+ messages in thread From: Malcolm Crossley @ 2014-12-09 11:28 UTC (permalink / raw) To: xen-devel On 09/12/14 11:23, Jan Beulich wrote: >>>> On 09.12.14 at 12:17, <Paul.Durrant@citrix.com> wrote: >> I think we want be able to avoid setting up a PTE in the MMU since it's not >> needed in most (or perhaps all?) cases. > > With shared page tables, there's no way to do one without the other. > Interestingly the IOMMU in front of the Intel GPU is only capable of handling 4k pages and so we wouldn't end up with share page tables being used. For other PCI device's then shared page tables will be a problem. Malcolm > Jan > > > _______________________________________________ > Xen-devel mailing list > Xen-devel@lists.xen.org > http://lists.xen.org/xen-devel > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-09 11:17 ` Paul Durrant 2014-12-09 11:23 ` Jan Beulich @ 2014-12-09 11:29 ` Ian Campbell 2014-12-09 11:43 ` Paul Durrant 1 sibling, 1 reply; 59+ messages in thread From: Ian Campbell @ 2014-12-09 11:29 UTC (permalink / raw) To: Paul Durrant Cc: Kevin Tian, Keir (Xen.org), Tim (Xen.org), Xen-devel@lists.xen.org, Yu, Zhang, JBeulich@suse.com On Tue, 2014-12-09 at 11:17 +0000, Paul Durrant wrote: > > -----Original Message----- > > From: Ian Campbell > > Sent: 09 December 2014 11:11 > > To: Paul Durrant > > Cc: Tim (Xen.org); Yu, Zhang; Kevin Tian; Keir (Xen.org); JBeulich@suse.com; > > Xen-devel@lists.xen.org > > Subject: Re: [Xen-devel] One question about the hypercall to translate gfn to > > mfn. > > > > On Tue, 2014-12-09 at 11:05 +0000, Paul Durrant wrote: > > > > -----Original Message----- > > > > From: Tim Deegan [mailto:tim@xen.org] > > > > Sent: 09 December 2014 10:47 > > > > To: Yu, Zhang > > > > Cc: Paul Durrant; Keir (Xen.org); JBeulich@suse.com; Kevin Tian; Xen- > > > > devel@lists.xen.org > > > > Subject: Re: One question about the hypercall to translate gfn to mfn. > > > > > > > > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote: > > > > > Hi all, > > > > > > > > > > As you can see, we are pushing our XenGT patches to the upstream. > > One > > > > > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0 > > > > > device model. > > > > > > > > > > Here we may have 2 similar solutions: > > > > > 1> Paul told me(and thank you, Paul :)) that there used to be a > > > > > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in > > > > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there > > was > > > > no > > > > > usage at that time. > > > > > > > > It's been suggested before that we should revive this hypercall, and I > > > > don't think it's a good idea. Whenever a domain needs to know the > > > > actual MFN of another domain's memory it's usually because the > > > > security model is problematic. In particular, finding the MFN is > > > > usually followed by a brute-force mapping from a dom0 process, or by > > > > passing the MFN to a device for unprotected DMA. > > > > > > > > These days DMA access should be protected by IOMMUs, or else > > > > the device drivers (and associated tools) are effectively inside the > > > > hypervisor's TCB. Luckily on x86 IOMMUs are widely available (and > > > > presumably present on anything new enough to run XenGT?). > > > > > > > > So I think the interface we need here is a please-map-this-gfn one, > > > > like the existing grant-table ops (which already do what you need by > > > > returning an address suitable for DMA). If adding a grant entry for > > > > every frame of the framebuffer within the guest is too much, maybe we > > > > can make a new interface for the guest to grant access to larger areas. > > > > > > > > > > IIUC the in-guest driver is Xen-unaware so any grant entry would have > > > to be put in the guests table by the tools, which would entail some > > > form of flexibly sized reserved range of grant entries otherwise any > > > PV driver that are present in the guest would merrily clobber the new > > > grant entries. > > > A domain can already priv map a gfn into the MMU, so I think we just > > > need an equivalent for the IOMMU. > > > > I'm not sure I'm fully understanding what's going on here, but is a > > variant of XENMEM_add_to_physmap+XENMAPSPACE_gmfn_foreign which > > also > > returns a DMA handle a plausible solution? > > > > I think we want be able to avoid setting up a PTE in the MMU since > it's not needed in most (or perhaps all?) cases. Another (wildly under-informed) thought then: A while back Global logic proposed (for ARM) an infrastructure for allowing dom0 drivers to maintain a set of iommu like pagetables under hypervisor supervision (they called these "remoteprocessor iommu"). I didn't fully grok what it was at the time, let alone remember the details properly now, but AIUI it was essentially a framework for allowing a simple Xen side driver to provide PV-MMU-like update operations for a set of PTs which were not the main-processor's PTs, with validation etc. See http://thread.gmane.org/gmane.comp.emulators.xen.devel/212945 The introductory email even mentions GPUs... Ian. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-09 11:29 ` Ian Campbell @ 2014-12-09 11:43 ` Paul Durrant 2014-12-10 1:48 ` Tian, Kevin 0 siblings, 1 reply; 59+ messages in thread From: Paul Durrant @ 2014-12-09 11:43 UTC (permalink / raw) To: Ian Campbell Cc: Kevin Tian, Keir (Xen.org), Tim (Xen.org), Xen-devel@lists.xen.org, Yu, Zhang, JBeulich@suse.com > -----Original Message----- > From: Ian Campbell > Sent: 09 December 2014 11:29 > To: Paul Durrant > Cc: Tim (Xen.org); Yu, Zhang; Kevin Tian; Keir (Xen.org); JBeulich@suse.com; > Xen-devel@lists.xen.org > Subject: Re: [Xen-devel] One question about the hypercall to translate gfn to > mfn. > > On Tue, 2014-12-09 at 11:17 +0000, Paul Durrant wrote: > > > -----Original Message----- > > > From: Ian Campbell > > > Sent: 09 December 2014 11:11 > > > To: Paul Durrant > > > Cc: Tim (Xen.org); Yu, Zhang; Kevin Tian; Keir (Xen.org); > JBeulich@suse.com; > > > Xen-devel@lists.xen.org > > > Subject: Re: [Xen-devel] One question about the hypercall to translate > gfn to > > > mfn. > > > > > > On Tue, 2014-12-09 at 11:05 +0000, Paul Durrant wrote: > > > > > -----Original Message----- > > > > > From: Tim Deegan [mailto:tim@xen.org] > > > > > Sent: 09 December 2014 10:47 > > > > > To: Yu, Zhang > > > > > Cc: Paul Durrant; Keir (Xen.org); JBeulich@suse.com; Kevin Tian; Xen- > > > > > devel@lists.xen.org > > > > > Subject: Re: One question about the hypercall to translate gfn to mfn. > > > > > > > > > > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote: > > > > > > Hi all, > > > > > > > > > > > > As you can see, we are pushing our XenGT patches to the > upstream. > > > One > > > > > > feature we need in xen is to translate guests' gfn to mfn in XenGT > dom0 > > > > > > device model. > > > > > > > > > > > > Here we may have 2 similar solutions: > > > > > > 1> Paul told me(and thank you, Paul :)) that there used to be a > > > > > > hypercall, XENMEM_translate_gpfn_list, which was removed by > Keir in > > > > > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because > there > > > was > > > > > no > > > > > > usage at that time. > > > > > > > > > > It's been suggested before that we should revive this hypercall, and I > > > > > don't think it's a good idea. Whenever a domain needs to know the > > > > > actual MFN of another domain's memory it's usually because the > > > > > security model is problematic. In particular, finding the MFN is > > > > > usually followed by a brute-force mapping from a dom0 process, or by > > > > > passing the MFN to a device for unprotected DMA. > > > > > > > > > > These days DMA access should be protected by IOMMUs, or else > > > > > the device drivers (and associated tools) are effectively inside the > > > > > hypervisor's TCB. Luckily on x86 IOMMUs are widely available (and > > > > > presumably present on anything new enough to run XenGT?). > > > > > > > > > > So I think the interface we need here is a please-map-this-gfn one, > > > > > like the existing grant-table ops (which already do what you need by > > > > > returning an address suitable for DMA). If adding a grant entry for > > > > > every frame of the framebuffer within the guest is too much, maybe > we > > > > > can make a new interface for the guest to grant access to larger areas. > > > > > > > > > > > > > IIUC the in-guest driver is Xen-unaware so any grant entry would have > > > > to be put in the guests table by the tools, which would entail some > > > > form of flexibly sized reserved range of grant entries otherwise any > > > > PV driver that are present in the guest would merrily clobber the new > > > > grant entries. > > > > A domain can already priv map a gfn into the MMU, so I think we just > > > > need an equivalent for the IOMMU. > > > > > > I'm not sure I'm fully understanding what's going on here, but is a > > > variant of XENMEM_add_to_physmap+XENMAPSPACE_gmfn_foreign > which > > > also > > > returns a DMA handle a plausible solution? > > > > > > > I think we want be able to avoid setting up a PTE in the MMU since > > it's not needed in most (or perhaps all?) cases. > > Another (wildly under-informed) thought then: > > A while back Global logic proposed (for ARM) an infrastructure for > allowing dom0 drivers to maintain a set of iommu like pagetables under > hypervisor supervision (they called these "remoteprocessor iommu"). > > I didn't fully grok what it was at the time, let alone remember the > details properly now, but AIUI it was essentially a framework for > allowing a simple Xen side driver to provide PV-MMU-like update > operations for a set of PTs which were not the main-processor's PTs, > with validation etc. > > See http://thread.gmane.org/gmane.comp.emulators.xen.devel/212945 > > The introductory email even mentions GPUs... > That series does indeed seem to be very relevant. Paul > Ian. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-09 11:43 ` Paul Durrant @ 2014-12-10 1:48 ` Tian, Kevin 2014-12-10 10:11 ` Ian Campbell 0 siblings, 1 reply; 59+ messages in thread From: Tian, Kevin @ 2014-12-10 1:48 UTC (permalink / raw) To: Paul Durrant, Ian Campbell Cc: Yu, Zhang, Keir (Xen.org), Tim (Xen.org), JBeulich@suse.com, Xen-devel@lists.xen.org > From: Paul Durrant [mailto:Paul.Durrant@citrix.com] > Sent: Tuesday, December 09, 2014 7:44 PM > > > -----Original Message----- > > From: Ian Campbell > > Sent: 09 December 2014 11:29 > > To: Paul Durrant > > Cc: Tim (Xen.org); Yu, Zhang; Kevin Tian; Keir (Xen.org); JBeulich@suse.com; > > Xen-devel@lists.xen.org > > Subject: Re: [Xen-devel] One question about the hypercall to translate gfn to > > mfn. > > > > On Tue, 2014-12-09 at 11:17 +0000, Paul Durrant wrote: > > > > -----Original Message----- > > > > From: Ian Campbell > > > > Sent: 09 December 2014 11:11 > > > > To: Paul Durrant > > > > Cc: Tim (Xen.org); Yu, Zhang; Kevin Tian; Keir (Xen.org); > > JBeulich@suse.com; > > > > Xen-devel@lists.xen.org > > > > Subject: Re: [Xen-devel] One question about the hypercall to translate > > gfn to > > > > mfn. > > > > > > > > On Tue, 2014-12-09 at 11:05 +0000, Paul Durrant wrote: > > > > > > -----Original Message----- > > > > > > From: Tim Deegan [mailto:tim@xen.org] > > > > > > Sent: 09 December 2014 10:47 > > > > > > To: Yu, Zhang > > > > > > Cc: Paul Durrant; Keir (Xen.org); JBeulich@suse.com; Kevin Tian; Xen- > > > > > > devel@lists.xen.org > > > > > > Subject: Re: One question about the hypercall to translate gfn to mfn. > > > > > > > > > > > > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote: > > > > > > > Hi all, > > > > > > > > > > > > > > As you can see, we are pushing our XenGT patches to the > > upstream. > > > > One > > > > > > > feature we need in xen is to translate guests' gfn to mfn in XenGT > > dom0 > > > > > > > device model. > > > > > > > > > > > > > > Here we may have 2 similar solutions: > > > > > > > 1> Paul told me(and thank you, Paul :)) that there used to be a > > > > > > > hypercall, XENMEM_translate_gpfn_list, which was removed by > > Keir in > > > > > > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because > > there > > > > was > > > > > > no > > > > > > > usage at that time. > > > > > > > > > > > > It's been suggested before that we should revive this hypercall, and I > > > > > > don't think it's a good idea. Whenever a domain needs to know the > > > > > > actual MFN of another domain's memory it's usually because the > > > > > > security model is problematic. In particular, finding the MFN is > > > > > > usually followed by a brute-force mapping from a dom0 process, or > by > > > > > > passing the MFN to a device for unprotected DMA. > > > > > > > > > > > > These days DMA access should be protected by IOMMUs, or else > > > > > > the device drivers (and associated tools) are effectively inside the > > > > > > hypervisor's TCB. Luckily on x86 IOMMUs are widely available (and > > > > > > presumably present on anything new enough to run XenGT?). > > > > > > > > > > > > So I think the interface we need here is a please-map-this-gfn one, > > > > > > like the existing grant-table ops (which already do what you need by > > > > > > returning an address suitable for DMA). If adding a grant entry for > > > > > > every frame of the framebuffer within the guest is too much, maybe > > we > > > > > > can make a new interface for the guest to grant access to larger > areas. > > > > > > > > > > > > > > > > IIUC the in-guest driver is Xen-unaware so any grant entry would have > > > > > to be put in the guests table by the tools, which would entail some > > > > > form of flexibly sized reserved range of grant entries otherwise any > > > > > PV driver that are present in the guest would merrily clobber the new > > > > > grant entries. > > > > > A domain can already priv map a gfn into the MMU, so I think we just > > > > > need an equivalent for the IOMMU. > > > > > > > > I'm not sure I'm fully understanding what's going on here, but is a > > > > variant of XENMEM_add_to_physmap+XENMAPSPACE_gmfn_foreign > > which > > > > also > > > > returns a DMA handle a plausible solution? > > > > > > > > > > I think we want be able to avoid setting up a PTE in the MMU since > > > it's not needed in most (or perhaps all?) cases. > > > > Another (wildly under-informed) thought then: > > > > A while back Global logic proposed (for ARM) an infrastructure for > > allowing dom0 drivers to maintain a set of iommu like pagetables under > > hypervisor supervision (they called these "remoteprocessor iommu"). > > > > I didn't fully grok what it was at the time, let alone remember the > > details properly now, but AIUI it was essentially a framework for > > allowing a simple Xen side driver to provide PV-MMU-like update > > operations for a set of PTs which were not the main-processor's PTs, > > with validation etc. > > > > See http://thread.gmane.org/gmane.comp.emulators.xen.devel/212945 > > > > The introductory email even mentions GPUs... > > > > That series does indeed seem to be very relevant. > > Paul I'm not familiar with Arm architecture, but based on a brief reading it's for the assigned case where the MMU is exclusive owned by a VM, so some type of MMU virtualization is required and it's straightforward. However XenGT is a shared GPU usage: - a global GPU page table is partitioned among VMs. a shared shadow global page table is maintained, containing translations for multiple VMs simultaneously based on partitioning information - multiple per-process GPU page tables are created by each VM, and multiple shadow per-process GPU page tables are created correspondingly. shadow page table is switched when doing GPU context switch, same as what we did for CPU shadow page table. So you can see above shared MMU virtualization usage is very GPU specific, that's why we didn't put in Xen hypervisor, and thus additional interface is required to get p2m mapping to assist our shadow GPU page table usage. Thanks Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-10 1:48 ` Tian, Kevin @ 2014-12-10 10:11 ` Ian Campbell 2014-12-11 1:50 ` Tian, Kevin 0 siblings, 1 reply; 59+ messages in thread From: Ian Campbell @ 2014-12-10 10:11 UTC (permalink / raw) To: Tian, Kevin Cc: Keir (Xen.org), Tim (Xen.org), Xen-devel@lists.xen.org, Paul Durrant, Yu, Zhang, JBeulich@suse.com On Wed, 2014-12-10 at 01:48 +0000, Tian, Kevin wrote: > I'm not familiar with Arm architecture, but based on a brief reading it's > for the assigned case where the MMU is exclusive owned by a VM, so > some type of MMU virtualization is required and it's straightforward. > However XenGT is a shared GPU usage: > > - a global GPU page table is partitioned among VMs. a shared shadow > global page table is maintained, containing translations for multiple > VMs simultaneously based on partitioning information > - multiple per-process GPU page tables are created by each VM, and > multiple shadow per-process GPU page tables are created correspondingly. > shadow page table is switched when doing GPU context switch, same as > what we did for CPU shadow page table. None of that sounds to me to be impossible to do in the remoteproc model, perhaps it needs some extensions from its initial core feature set but I see no reason why it couldn't maintain multiple sets of page tables, each tagged with an owning domain (for validation purposes) and a mechanism to switch between them, or to be able to manage partitioning of the GPU address space. > So you can see above shared MMU virtualization usage is very GPU > specific, AIUI remoteproc is specific to a particular h/w device too, i.e. there is a device specific stub in the hypervisor which essentially knows how to implement set_pte for that bit of h/w, with appropriate safety and validation, as well as a write_cr3 type operation. > that's why we didn't put in Xen hypervisor, and thus additional > interface is required to get p2m mapping to assist our shadow GPU > page table usage. There is a great reluctance among several maintainers to expose real hardware MFNs to VMs (including dom0 and backend driver domains). I think you need to think very carefully about possible ways of avoiding the need for this. Yes, this might require some changes to your current mode/design. Ian. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-10 10:11 ` Ian Campbell @ 2014-12-11 1:50 ` Tian, Kevin 0 siblings, 0 replies; 59+ messages in thread From: Tian, Kevin @ 2014-12-11 1:50 UTC (permalink / raw) To: Ian Campbell Cc: Keir (Xen.org), Tim (Xen.org), Xen-devel@lists.xen.org, Paul Durrant, Yu, Zhang, JBeulich@suse.com > From: Ian Campbell [mailto:Ian.Campbell@citrix.com] > Sent: Wednesday, December 10, 2014 6:11 PM > > On Wed, 2014-12-10 at 01:48 +0000, Tian, Kevin wrote: > > I'm not familiar with Arm architecture, but based on a brief reading it's > > for the assigned case where the MMU is exclusive owned by a VM, so > > some type of MMU virtualization is required and it's straightforward. > > > However XenGT is a shared GPU usage: > > > > - a global GPU page table is partitioned among VMs. a shared shadow > > global page table is maintained, containing translations for multiple > > VMs simultaneously based on partitioning information > > - multiple per-process GPU page tables are created by each VM, and > > multiple shadow per-process GPU page tables are created correspondingly. > > shadow page table is switched when doing GPU context switch, same as > > what we did for CPU shadow page table. > > None of that sounds to me to be impossible to do in the remoteproc > model, perhaps it needs some extensions from its initial core feature > set but I see no reason why it couldn't maintain multiple sets of page > tables, each tagged with an owning domain (for validation purposes) and > a mechanism to switch between them, or to be able to manage partitioning > of the GPU address space. here we're talking about multiple GPU page tables on top of a IOMMU page table. Instead of one MMU unit concerned here in remoteproc. > > > So you can see above shared MMU virtualization usage is very GPU > > specific, > > AIUI remoteproc is specific to a particular h/w device too, i.e. there > is a device specific stub in the hypervisor which essentially knows how > to implement set_pte for that bit of h/w, with appropriate safety and > validation, as well as a write_cr3 type operation. > > > that's why we didn't put in Xen hypervisor, and thus additional > > interface is required to get p2m mapping to assist our shadow GPU > > page table usage. > > There is a great reluctance among several maintainers to expose real > hardware MFNs to VMs (including dom0 and backend driver domains). > > I think you need to think very carefully about possible ways of avoiding > the need for this. Yes, this might require some changes to your current > mode/design. > We're open to changes if necessary. Thanks, Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-09 10:46 ` Tim Deegan 2014-12-09 11:05 ` Paul Durrant @ 2014-12-10 1:14 ` Tian, Kevin 2014-12-10 10:36 ` Jan Beulich 2014-12-10 10:55 ` Tim Deegan 1 sibling, 2 replies; 59+ messages in thread From: Tian, Kevin @ 2014-12-10 1:14 UTC (permalink / raw) To: Tim Deegan, Yu, Zhang Cc: Paul.Durrant@citrix.com, keir@xen.org, JBeulich@suse.com, Xen-devel@lists.xen.org > From: Tim Deegan [mailto:tim@xen.org] > Sent: Tuesday, December 09, 2014 6:47 PM > > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote: > > Hi all, > > > > As you can see, we are pushing our XenGT patches to the upstream. One > > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0 > > device model. > > > > Here we may have 2 similar solutions: > > 1> Paul told me(and thank you, Paul :)) that there used to be a > > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there was > no > > usage at that time. > > It's been suggested before that we should revive this hypercall, and I > don't think it's a good idea. Whenever a domain needs to know the > actual MFN of another domain's memory it's usually because the > security model is problematic. In particular, finding the MFN is > usually followed by a brute-force mapping from a dom0 process, or by > passing the MFN to a device for unprotected DMA. In our case it's not because the security model is problematic. It's because GPU virtualization is done in Dom0 while the memory virtualization is done in hypervisor. We need a means to query GPFN->MFN so we can setup shadow GPU page table in Dom0 correctly, for a VM. > > These days DMA access should be protected by IOMMUs, or else > the device drivers (and associated tools) are effectively inside the > hypervisor's TCB. Luckily on x86 IOMMUs are widely available (and > presumably present on anything new enough to run XenGT?). yes, IOMMU protect DMA accesses in a device-agnostic way. But in our case, IOMMU can't be used because it's only for exclusively assigned case, as I replied in another mail. And to reduce the hypervisor TCB, we put device model in Dom0 which is why a interface is required to connect p2m information. > > So I think the interface we need here is a please-map-this-gfn one, > like the existing grant-table ops (which already do what you need by > returning an address suitable for DMA). If adding a grant entry for > every frame of the framebuffer within the guest is too much, maybe we > can make a new interface for the guest to grant access to larger areas. A please-map-this-gfn interface assumes the logic behind lies in Xen hypervisor, e.g. managing CPU page table or IOMMU entry. However here the management of GPU page table is in Dom0, and what we want is a please-tell-me-mfn-for-a-gpfn interface, so we can translate from gpfn in guest GPU PTE to a mfn in shadow GPU PTE. Hope this makes the requirement clearer. > > Cheers, > > Tim. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-10 1:14 ` Tian, Kevin @ 2014-12-10 10:36 ` Jan Beulich 2014-12-11 1:45 ` Tian, Kevin 2014-12-10 10:55 ` Tim Deegan 1 sibling, 1 reply; 59+ messages in thread From: Jan Beulich @ 2014-12-10 10:36 UTC (permalink / raw) To: Kevin Tian Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu, Xen-devel@lists.xen.org >>> On 10.12.14 at 02:14, <kevin.tian@intel.com> wrote: >> From: Tim Deegan [mailto:tim@xen.org] >> It's been suggested before that we should revive this hypercall, and I >> don't think it's a good idea. Whenever a domain needs to know the >> actual MFN of another domain's memory it's usually because the >> security model is problematic. In particular, finding the MFN is >> usually followed by a brute-force mapping from a dom0 process, or by >> passing the MFN to a device for unprotected DMA. > > In our case it's not because the security model is problematic. It's > because GPU virtualization is done in Dom0 while the memory virtualization > is done in hypervisor. Which by itself is a questionable design decision. > We need a means to query GPFN->MFN so we can > setup shadow GPU page table in Dom0 correctly, for a VM. > >> >> These days DMA access should be protected by IOMMUs, or else >> the device drivers (and associated tools) are effectively inside the >> hypervisor's TCB. Luckily on x86 IOMMUs are widely available (and >> presumably present on anything new enough to run XenGT?). > > yes, IOMMU protect DMA accesses in a device-agnostic way. But in > our case, IOMMU can't be used because it's only for exclusively > assigned case, as I replied in another mail. And to reduce the hypervisor > TCB, we put device model in Dom0 which is why a interface is required > to connect p2m information. > >> >> So I think the interface we need here is a please-map-this-gfn one, >> like the existing grant-table ops (which already do what you need by >> returning an address suitable for DMA). If adding a grant entry for >> every frame of the framebuffer within the guest is too much, maybe we >> can make a new interface for the guest to grant access to larger areas. > > A please-map-this-gfn interface assumes the logic behind lies in Xen > hypervisor, e.g. managing CPU page table or IOMMU entry. However > here the management of GPU page table is in Dom0, and what we > want is a please-tell-me-mfn-for-a-gpfn interface, so we can translate > from gpfn in guest GPU PTE to a mfn in shadow GPU PTE. As said before, what needs to be put in the GPU PTE depends on what the subsequent IOMMU translation would do to the address. It's not a hard requirement for the IOMMU to pass through all addresses for Dom0, so we have room to isolate things if possible. Jan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-10 10:36 ` Jan Beulich @ 2014-12-11 1:45 ` Tian, Kevin 0 siblings, 0 replies; 59+ messages in thread From: Tian, Kevin @ 2014-12-11 1:45 UTC (permalink / raw) To: Jan Beulich Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu, Xen-devel@lists.xen.org > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Wednesday, December 10, 2014 6:36 PM > > >>> On 10.12.14 at 02:14, <kevin.tian@intel.com> wrote: > >> From: Tim Deegan [mailto:tim@xen.org] > >> It's been suggested before that we should revive this hypercall, and I > >> don't think it's a good idea. Whenever a domain needs to know the > >> actual MFN of another domain's memory it's usually because the > >> security model is problematic. In particular, finding the MFN is > >> usually followed by a brute-force mapping from a dom0 process, or by > >> passing the MFN to a device for unprotected DMA. > > > > In our case it's not because the security model is problematic. It's > > because GPU virtualization is done in Dom0 while the memory virtualization > > is done in hypervisor. > > Which by itself is a questionable design decision. > I don't think we want to put a ~20K LOC device model in hypervisor. Thanks Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-10 1:14 ` Tian, Kevin 2014-12-10 10:36 ` Jan Beulich @ 2014-12-10 10:55 ` Tim Deegan 2014-12-11 1:41 ` Tian, Kevin 1 sibling, 1 reply; 59+ messages in thread From: Tim Deegan @ 2014-12-10 10:55 UTC (permalink / raw) To: Tian, Kevin Cc: keir@xen.org, Paul.Durrant@citrix.com, Yu, Zhang, JBeulich@suse.com, Xen-devel@lists.xen.org At 01:14 +0000 on 10 Dec (1418170461), Tian, Kevin wrote: > > From: Tim Deegan [mailto:tim@xen.org] > > Sent: Tuesday, December 09, 2014 6:47 PM > > > > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote: > > > Hi all, > > > > > > As you can see, we are pushing our XenGT patches to the upstream. One > > > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0 > > > device model. > > > > > > Here we may have 2 similar solutions: > > > 1> Paul told me(and thank you, Paul :)) that there used to be a > > > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in > > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there was > > no > > > usage at that time. > > > > It's been suggested before that we should revive this hypercall, and I > > don't think it's a good idea. Whenever a domain needs to know the > > actual MFN of another domain's memory it's usually because the > > security model is problematic. In particular, finding the MFN is > > usually followed by a brute-force mapping from a dom0 process, or by > > passing the MFN to a device for unprotected DMA. > > In our case it's not because the security model is problematic. It's > because GPU virtualization is done in Dom0 while the memory virtualization > is done in hypervisor. We need a means to query GPFN->MFN so we can > setup shadow GPU page table in Dom0 correctly, for a VM. I don't think we understand each other. Let me try to explain what I mean. My apologies if this sounds patronising; I'm just trying to be as clear as I can. It is Xen's job to isolate VMs from each other. As part of that, Xen uses the MMU, nested paging, and IOMMUs to control access to RAM. Any software component that can pass a raw MFN to hardware breaks that isolation, because Xen has no way of controlling what that component can do (including taking over the hypervisor). This is why I am afraid when developers ask for GFN->MFN translation functions. So if the XenGT model allowed the backend component to (cause the GPU to) perform arbitrary DMA without IOMMU checks, then that component would have complete access to the system and (from a security pov) might as well be running in the hypervisor. That would be very problematic, but AFAICT that's not what's going on. From your reply on the other thread it seems like the GPU is behind the IOMMU, so that's OK. :) When the backend component gets a GFN from the guest, it wants an address that it can give to the GPU for DMA that will map the right memory. That address must be mapped in the IOMMU tables that the GPU will be using, which means the IOMMU tables of the backend domain, IIUC[1]. So the hypercall it needs is not "give me the MFN that matches this GFN" but "please map this GFN into my IOMMU tables". Asking for the MFN will only work if the backend domain's IOMMU tables have an existing 1:1 r/w mapping of all guest RAM, which happens to be the case if the backend component is in dom0 _and_ dom0 is PV _and_ we're not using strict IOMMU tables. Restricting XenGT to work in only those circumstances would be short-sighted, not only because it would mean XenGT could never work as a driver domain, but also because it seems like PVH dom0 is going to be the default at some point. If the existing hypercalls that make IOMMU mappings are not right for XenGT then we can absolutely consider adding some more. But we need to talk about what policy Xen will enforce on the mapping requests. If the shared backend is allowed to map any page of any VM, then it can easily take control of any VM on the host (even though the IOMMU will prevent it from taking over the hypervisor itself). The absolute minumum we should allow here is some toolstack-controlled list of which VMs the XenGT backend is serving, so that it can refuse to map other VMs' memory (like an extension of IS_PRIV_FOR, which does this job for Qemu). I would also strongly advise using privilege separation in the backend between the GPUPT shadow code (which needs mapping rights and is trusted to maintain isolation between the VMs that are sharing the GPU) and the rest of the XenGT backend (which doesn't/isn't). But that's outside my remit as a hypervisor maintainer so it goes no further than an "I told you so". :) Cheers, Tim. [1] That is, AIUI this GPU doesn't context-switch which set of IOMMU tables it's using for DMA, SR-IOV-style, and that's why you need a software component in the first place. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-10 10:55 ` Tim Deegan @ 2014-12-11 1:41 ` Tian, Kevin 2014-12-11 16:46 ` Tim Deegan 2014-12-11 21:29 ` Tim Deegan 0 siblings, 2 replies; 59+ messages in thread From: Tian, Kevin @ 2014-12-11 1:41 UTC (permalink / raw) To: Tim Deegan Cc: keir@xen.org, Paul.Durrant@citrix.com, Yu, Zhang, JBeulich@suse.com, Xen-devel@lists.xen.org > From: Tim Deegan [mailto:tim@xen.org] > Sent: Wednesday, December 10, 2014 6:55 PM > > At 01:14 +0000 on 10 Dec (1418170461), Tian, Kevin wrote: > > > From: Tim Deegan [mailto:tim@xen.org] > > > Sent: Tuesday, December 09, 2014 6:47 PM > > > > > > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote: > > > > Hi all, > > > > > > > > As you can see, we are pushing our XenGT patches to the upstream. > One > > > > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0 > > > > device model. > > > > > > > > Here we may have 2 similar solutions: > > > > 1> Paul told me(and thank you, Paul :)) that there used to be a > > > > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in > > > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there > was > > > no > > > > usage at that time. > > > > > > It's been suggested before that we should revive this hypercall, and I > > > don't think it's a good idea. Whenever a domain needs to know the > > > actual MFN of another domain's memory it's usually because the > > > security model is problematic. In particular, finding the MFN is > > > usually followed by a brute-force mapping from a dom0 process, or by > > > passing the MFN to a device for unprotected DMA. > > > > In our case it's not because the security model is problematic. It's > > because GPU virtualization is done in Dom0 while the memory virtualization > > is done in hypervisor. We need a means to query GPFN->MFN so we can > > setup shadow GPU page table in Dom0 correctly, for a VM. > > I don't think we understand each other. Let me try to explain what I > mean. My apologies if this sounds patronising; I'm just trying to be > as clear as I can. Thanks for your explanation. This is a very helpful discussion. :-) > > It is Xen's job to isolate VMs from each other. As part of that, Xen > uses the MMU, nested paging, and IOMMUs to control access to RAM. Any > software component that can pass a raw MFN to hardware breaks that > isolation, because Xen has no way of controlling what that component > can do (including taking over the hypervisor). This is why I am > afraid when developers ask for GFN->MFN translation functions. When I agree Xen's job absolutely, the isolation is also required in different layers, regarding to who controls the resource and where the virtualization happens. For example talking about I/O virtualization, Dom0 or driver domain needs to isolate among backend drivers to avoid one backend interfering with another. Xen doesn't know such violation, since it only knows it's Dom0 wants to access a VM's page. btw curious of how worse exposing GFN->MFN translation compared to allowing mapping other VM's GFN? If exposing GFN->MFN is under the same permission control as mapping, would it avoid your worry here? > > So if the XenGT model allowed the backend component to (cause the GPU > to) perform arbitrary DMA without IOMMU checks, then that component > would have complete access to the system and (from a security pov) > might as well be running in the hypervisor. That would be very > problematic, but AFAICT that's not what's going on. From your reply > on the other thread it seems like the GPU is behind the IOMMU, so > that's OK. :) > > When the backend component gets a GFN from the guest, it wants an > address that it can give to the GPU for DMA that will map the right > memory. That address must be mapped in the IOMMU tables that the GPU > will be using, which means the IOMMU tables of the backend domain, > IIUC[1]. So the hypercall it needs is not "give me the MFN that matches > this GFN" but "please map this GFN into my IOMMU tables". Here "please map this GFN into my IOMMU tables" actually breaks the IOMMU isolation. IOMMU is designed for serving DMA requests issued by an exclusive VM, so IOMMU page table can restrict that VM's attempts strictly. To map multiple VM's GFNs into one IOMMU table, the 1st thing is to avoid GFN conflictions to make it functional. We thought about this approach previously, e.g. by reserving highest 3 bits of GFN as VMID, so one IOMMU page table can be used to combine multi-VM's page table together. However doing so have two limitations: a) it still requires write-protect guest GPU page table, and maintain a shadow GPU page table by translate from real GFN to pseudo GFN (plus VMID), which doesn't save any engineering effort in the device model part b) it breaks the designed isolation intrinsic of IOMMU. In such case, IOMMU can't isolate multiple VMs by itself, since a DMA request can target any pseudo GFN if valid in the page table. We have to rely on the audit in the backend component in Dom0 to ensure the isolation. So even by using IOMMU, it loses the isolation intention as you described earlier. c) this introduces tricky logic in IOMMU driver to handle such non-standard multiplexed page table style. w/o a SR-IOV implementation (so each VF has its own IOMMU page table), I don't see using IOMMU can help isolation here. > > Asking for the MFN will only work if the backend domain's IOMMU > tables have an existing 1:1 r/w mapping of all guest RAM, which > happens to be the case if the backend component is in dom0 _and_ dom0 > is PV _and_ we're not using strict IOMMU tables. Restricting XenGT to > work in only those circumstances would be short-sighted, not only > because it would mean XenGT could never work as a driver domain, but > also because it seems like PVH dom0 is going to be the default at some > point. yes, this is a good feedback we didn't think about before. So far the reason why XenGT can work is because we use default IOMMU setting which set up a 1:1 r/w mapping for all possible RAM, so when GPU hits a MFN thru shadow GPU page table, IOMMU is essentially bypassed. However like you said, if IOMMU page table is restricted to dom0's memory, or is not 1:1 identity mapping, XenGT will be broken. However I don't see a good solution for this, except using multiplexed IOMMU page table aforementioned, which however doesn't look like a sane design to me. > > If the existing hypercalls that make IOMMU mappings are not right for > XenGT then we can absolutely consider adding some more. But we need > to talk about what policy Xen will enforce on the mapping requests. > If the shared backend is allowed to map any page of any VM, then it > can easily take control of any VM on the host (even though the IOMMU > will prevent it from taking over the hypervisor itself). The > absolute minumum we should allow here is some toolstack-controlled > list of which VMs the XenGT backend is serving, so that it can refuse > to map other VMs' memory (like an extension of IS_PRIV_FOR, which does > this job for Qemu). for mapping and accessing other guest's memory, I don't think we need any new interface atop existing ones. Just similar to other backend drivers, we can leverage the same permission control. please note here the requirement of exposing p2m here, is really to setup GPU page table so a guest GPU workload can be directly executed by the GPU. > > I would also strongly advise using privilege separation in the backend > between the GPUPT shadow code (which needs mapping rights and is > trusted to maintain isolation between the VMs that are sharing the > GPU) and the rest of the XenGT backend (which doesn't/isn't). But > that's outside my remit as a hypervisor maintainer so it goes no > further than an "I told you so". :) We're open to suggestions making our code better, but could you elaborate a bit what exactly privilege separation you meant here? :-) > > Cheers, > > Tim. > > [1] That is, AIUI this GPU doesn't context-switch which set of IOMMU > tables it's using for DMA, SR-IOV-style, and that's why you need a > software component in the first place. yes, there's only one IOMMU dedicated for GPU, and it's impractical to switch the IOMMU page table given concurrent access to graphics memory from different VCPUs and different render engines within GPU. Thanks Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-11 1:41 ` Tian, Kevin @ 2014-12-11 16:46 ` Tim Deegan 2014-12-12 7:24 ` Tian, Kevin 2014-12-11 21:29 ` Tim Deegan 1 sibling, 1 reply; 59+ messages in thread From: Tim Deegan @ 2014-12-11 16:46 UTC (permalink / raw) To: Tian, Kevin Cc: keir@xen.org, Paul.Durrant@citrix.com, Yu, Zhang, JBeulich@suse.com, Xen-devel@lists.xen.org Hi, At 01:41 +0000 on 11 Dec (1418258504), Tian, Kevin wrote: > > From: Tim Deegan [mailto:tim@xen.org] > > It is Xen's job to isolate VMs from each other. As part of that, Xen > > uses the MMU, nested paging, and IOMMUs to control access to RAM. Any > > software component that can pass a raw MFN to hardware breaks that > > isolation, because Xen has no way of controlling what that component > > can do (including taking over the hypervisor). This is why I am > > afraid when developers ask for GFN->MFN translation functions. > > When I agree Xen's job absolutely, the isolation is also required in different > layers, regarding to who controls the resource and where the virtualization > happens. For example talking about I/O virtualization, Dom0 or driver domain > needs to isolate among backend drivers to avoid one backend interfering > with another. Xen doesn't know such violation, since it only knows it's Dom0 > wants to access a VM's page. I'm going to write second reply to this mail in a bit, to talk about this kind of system-level design. In this email I'll just talk about the practical aspects of interfaces and address spaces and IOMMUs. > btw curious of how worse exposing GFN->MFN translation compared to > allowing mapping other VM's GFN? If exposing GFN->MFN is under the > same permission control as mapping, would it avoid your worry here? I'm afraid not. There's nothing worrying per se in a backend knowing the MFNs of the pages -- the worry is that the backend can pass the MFNs to hardware. If the check happens only at lookup time, then XenGT can (either through a bug or a security breach) just pass _any_ MFN to the GPU for DMA. But even without considering the security aspects, this model has bugs that may be impossible for XenGT itself to even detect. E.g.: 1. Guest asks its virtual GPU to DMA to a frame of memory; 2. XenGT looks up the GFN->MFN mapping; 3. Guest balloons out the page; 4. Xen allocates the page to a different guest; 5. XenGT passes the MFN to the GPU, which DMAs to it. Whereas if stage 2 is a _mapping_ operation, Xen can refcount the underlying memory and make sure it doesn't get reallocated until XenGT is finished with it. > > When the backend component gets a GFN from the guest, it wants an > > address that it can give to the GPU for DMA that will map the right > > memory. That address must be mapped in the IOMMU tables that the GPU > > will be using, which means the IOMMU tables of the backend domain, > > IIUC[1]. So the hypercall it needs is not "give me the MFN that matches > > this GFN" but "please map this GFN into my IOMMU tables". > > Here "please map this GFN into my IOMMU tables" actually breaks the > IOMMU isolation. IOMMU is designed for serving DMA requests issued > by an exclusive VM, so IOMMU page table can restrict that VM's attempts > strictly. > > To map multiple VM's GFNs into one IOMMU table, the 1st thing is to > avoid GFN conflictions to make it functional. We thought about this approach > previously, e.g. by reserving highest 3 bits of GFN as VMID, so one IOMMU > page table can be used to combine multi-VM's page table together. However > doing so have two limitations: > > a) it still requires write-protect guest GPU page table, and maintain a shadow > GPU page table by translate from real GFN to pseudo GFN (plus VMID), which > doesn't save any engineering effort in the device model part Yes -- since there's only one IOMMU context for the whole GPU, the XenGT backend still has to audit all GPU commands to maintain isolation between clients. > b) it breaks the designed isolation intrinsic of IOMMU. In such case, IOMMU > can't isolate multiple VMs by itself, since a DMA request can target any > pseudo GFN if valid in the page table. We have to rely on the audit in the > backend component in Dom0 to ensure the isolation. Yep. > c) this introduces tricky logic in IOMMU driver to handle such non-standard > multiplexed page table style. > > w/o a SR-IOV implementation (so each VF has its own IOMMU page table), > I don't see using IOMMU can help isolation here. If I've understood your argument correctly, it basically comes down to "It would be extra work for no benefit, because XenGT still has to do all the work of isolating GPU clients from each other". It's true that XenGT still has to isolate its clients, but there are other benefits. The main one, from my point of view as a Xen maintainer, is that it allows Xen to constrain XenGT itself, in the case where bugs or security breaches mean that XenGT tries to access memory it shouldn't. More about that in my other reply. I'll talk about the rest below. > yes, this is a good feedback we didn't think about before. So far the reason > why XenGT can work is because we use default IOMMU setting which set > up a 1:1 r/w mapping for all possible RAM, so when GPU hits a MFN thru > shadow GPU page table, IOMMU is essentially bypassed. However like > you said, if IOMMU page table is restricted to dom0's memory, or is not > 1:1 identity mapping, XenGT will be broken. > > However I don't see a good solution for this, except using multiplexed > IOMMU page table aforementioned, which however doesn't look like > a sane design to me. Right. AIUI you're talking about having a component, maybe in Xen, that automatically makes a merged IOMMU table that contains multiple VMs' p2m tables all at once. I think that we can do something simpler than that which will have the same effect and also avoid race conditions like the one I mentioned at the top of the email. [First some hopefully-helpful diagrams to explain my thinking. I'll borrow 'BFN' from Malcolm's discussion of IOMMUs to describe the addresses that devices issue their DMAs in: Here's how the translations work for a HVM guest using HAP: CPU <- Code supplied by the guest | (VA) | MMU <- Pagetables supplied by the guest | (GFN) | HAP <- Guest's P2M, supplied by Xen | (MFN) | RAM Here's how it looks for a GPU operation using XenGT: GPU <- Code supplied by Guest, audited by XenGT | (GPU VA) | GPU-MMU <- GTTs supplied by XenGT (by shadowing guest ones) | (GPU BFN) | IOMMU <- XenGT backend dom's P2M (for PVH/HVM) or IOMMU tables (for PV) | (MFN) | RAM OK, on we go...] Somewhere in the existing XenGT code, XenGT has a guest GFN in its hand and makes a lookup hypercall to find the MFN. It puts that MFN into the GTTs that it passes to the GPU. But an MFN is not actually what it needs here -- it needs a GPU BFN, which the IOMMU will then turn into an MFN for it. If we replace that lookup with a _map_ hypercall, either with Xen choosing the BFN (as happens in the PV grant map operation) or with the guest choosing an unused address (as happens in the HVM/PVH grant map operation), then: - the only extra code in XenGT itself is that you need to unmap when you change the GTT; - Xen can track and control exactly which MFNs XenGT/the GPU can access; - running XenGT in a driver domain or PVH dom0 ought to work; and - we fix the race condition I described above. The default policy I'm suggesting is that the XenGT backend domain should be marked IS_PRIV_FOR (or similar) over the XenGT client VMs, which will need a small extension in Xen since at the moment struct domain has only one "target" field. BTW, this is the exact analogue of how all other backend and toolstack operations work -- they request access from Xen to specific pages and they relinquish it when they are done. In particular: > for mapping and accessing other guest's memory, I don't think we > need any new interface atop existing ones. Just similar to other backend > drivers, we can leverage the same permission control. I don't think that's right -- other backend drivers use the grant table mechanism, wher the guest explicitly grants access to only the memory it needs. AIUI you're not suggesting that you'll use that for XenGT! :) Right - I hope that made some sense. I'll go get another cup of coffee and start on that other reply... Cheers, Tim. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-11 16:46 ` Tim Deegan @ 2014-12-12 7:24 ` Tian, Kevin 2014-12-12 10:54 ` Jan Beulich 2014-12-18 15:46 ` Tim Deegan 0 siblings, 2 replies; 59+ messages in thread From: Tian, Kevin @ 2014-12-12 7:24 UTC (permalink / raw) To: Tim Deegan Cc: Yu, Zhang, Paul.Durrant@citrix.com, keir@xen.org, JBeulich@suse.com, Xen-devel@lists.xen.org > From: Tim Deegan > Sent: Friday, December 12, 2014 12:47 AM > > Hi, > > At 01:41 +0000 on 11 Dec (1418258504), Tian, Kevin wrote: > > > From: Tim Deegan [mailto:tim@xen.org] > > > It is Xen's job to isolate VMs from each other. As part of that, Xen > > > uses the MMU, nested paging, and IOMMUs to control access to RAM. > Any > > > software component that can pass a raw MFN to hardware breaks that > > > isolation, because Xen has no way of controlling what that component > > > can do (including taking over the hypervisor). This is why I am > > > afraid when developers ask for GFN->MFN translation functions. > > > > When I agree Xen's job absolutely, the isolation is also required in different > > layers, regarding to who controls the resource and where the virtualization > > happens. For example talking about I/O virtualization, Dom0 or driver > domain > > needs to isolate among backend drivers to avoid one backend interfering > > with another. Xen doesn't know such violation, since it only knows it's Dom0 > > wants to access a VM's page. > > I'm going to write second reply to this mail in a bit, to talk about > this kind of system-level design. In this email I'll just talk about > the practical aspects of interfaces and address spaces and IOMMUs. sure. I've replied to another design mail before seeing this. my bad outlook rule didn't push this mail to my eye, and fortunately I dig it out when wondering "Hi, again" in your another mail. :-) > > > btw curious of how worse exposing GFN->MFN translation compared to > > allowing mapping other VM's GFN? If exposing GFN->MFN is under the > > same permission control as mapping, would it avoid your worry here? > > I'm afraid not. There's nothing worrying per se in a backend knowing > the MFNs of the pages -- the worry is that the backend can pass the > MFNs to hardware. If the check happens only at lookup time, then XenGT > can (either through a bug or a security breach) just pass _any_ MFN to > the GPU for DMA. > > But even without considering the security aspects, this model has bugs > that may be impossible for XenGT itself to even detect. E.g.: > 1. Guest asks its virtual GPU to DMA to a frame of memory; > 2. XenGT looks up the GFN->MFN mapping; > 3. Guest balloons out the page; > 4. Xen allocates the page to a different guest; > 5. XenGT passes the MFN to the GPU, which DMAs to it. > > Whereas if stage 2 is a _mapping_ operation, Xen can refcount the > underlying memory and make sure it doesn't get reallocated until XenGT > is finished with it. yes, I see your point. Now we can't support ballooning in VM given above reason, and refcnt is required to close that gap. but just to confirm one point. from my understanding whether it's a mapping operation doesn't really matter. We can invent an interface to get p2m mapping and then increase refcnt. the key is refcnt here. when XenGT constructs a shadow GPU page table, it creates a reference to guest memory page so the refcnt must be increased. :-) > > > > When the backend component gets a GFN from the guest, it wants an > > > address that it can give to the GPU for DMA that will map the right > > > memory. That address must be mapped in the IOMMU tables that the > GPU > > > will be using, which means the IOMMU tables of the backend domain, > > > IIUC[1]. So the hypercall it needs is not "give me the MFN that matches > > > this GFN" but "please map this GFN into my IOMMU tables". > > > > Here "please map this GFN into my IOMMU tables" actually breaks the > > IOMMU isolation. IOMMU is designed for serving DMA requests issued > > by an exclusive VM, so IOMMU page table can restrict that VM's attempts > > strictly. > > > > To map multiple VM's GFNs into one IOMMU table, the 1st thing is to > > avoid GFN conflictions to make it functional. We thought about this approach > > previously, e.g. by reserving highest 3 bits of GFN as VMID, so one IOMMU > > page table can be used to combine multi-VM's page table together. However > > doing so have two limitations: > > > > a) it still requires write-protect guest GPU page table, and maintain a > shadow > > GPU page table by translate from real GFN to pseudo GFN (plus VMID), > which > > doesn't save any engineering effort in the device model part > > Yes -- since there's only one IOMMU context for the whole GPU, the > XenGT backend still has to audit all GPU commands to maintain > isolation between clients. > > > b) it breaks the designed isolation intrinsic of IOMMU. In such case, IOMMU > > can't isolate multiple VMs by itself, since a DMA request can target any > > pseudo GFN if valid in the page table. We have to rely on the audit in the > > backend component in Dom0 to ensure the isolation. > > Yep. > > > c) this introduces tricky logic in IOMMU driver to handle such non-standard > > multiplexed page table style. > > > > w/o a SR-IOV implementation (so each VF has its own IOMMU page table), > > I don't see using IOMMU can help isolation here. > > If I've understood your argument correctly, it basically comes down > to "It would be extra work for no benefit, because XenGT still has to > do all the work of isolating GPU clients from each other". It's true > that XenGT still has to isolate its clients, but there are other > benefits. > > The main one, from my point of view as a Xen maintainer, is that it > allows Xen to constrain XenGT itself, in the case where bugs or > security breaches mean that XenGT tries to access memory it shouldn't. > More about that in my other reply. I'll talk about the rest below. > > > yes, this is a good feedback we didn't think about before. So far the reason > > why XenGT can work is because we use default IOMMU setting which set > > up a 1:1 r/w mapping for all possible RAM, so when GPU hits a MFN thru > > shadow GPU page table, IOMMU is essentially bypassed. However like > > you said, if IOMMU page table is restricted to dom0's memory, or is not > > 1:1 identity mapping, XenGT will be broken. > > > > However I don't see a good solution for this, except using multiplexed > > IOMMU page table aforementioned, which however doesn't look like > > a sane design to me. > > Right. AIUI you're talking about having a component, maybe in Xen, > that automatically makes a merged IOMMU table that contains multiple > VMs' p2m tables all at once. I think that we can do something simpler > than that which will have the same effect and also avoid race > conditions like the one I mentioned at the top of the email. > > [First some hopefully-helpful diagrams to explain my thinking. I'll > borrow 'BFN' from Malcolm's discussion of IOMMUs to describe the > addresses that devices issue their DMAs in: what's 'BFN' short for? Bus Frame Number? > > Here's how the translations work for a HVM guest using HAP: > > CPU <- Code supplied by the guest > | > (VA) > | > MMU <- Pagetables supplied by the guest > | > (GFN) > | > HAP <- Guest's P2M, supplied by Xen > | > (MFN) > | > RAM > > Here's how it looks for a GPU operation using XenGT: > > GPU <- Code supplied by Guest, audited by XenGT > | > (GPU VA) > | > GPU-MMU <- GTTs supplied by XenGT (by shadowing guest ones) > | > (GPU BFN) > | > IOMMU <- XenGT backend dom's P2M (for PVH/HVM) or IOMMU > tables (for PV) > | > (MFN) > | > RAM > > OK, on we go...] > > Somewhere in the existing XenGT code, XenGT has a guest GFN in its > hand and makes a lookup hypercall to find the MFN. It puts that MFN > into the GTTs that it passes to the GPU. But an MFN is not actually > what it needs here -- it needs a GPU BFN, which the IOMMU will then > turn into an MFN for it. > > If we replace that lookup with a _map_ hypercall, either with Xen > choosing the BFN (as happens in the PV grant map operation) or with > the guest choosing an unused address (as happens in the HVM/PVH > grant map operation), then: > - the only extra code in XenGT itself is that you need to unmap > when you change the GTT; > - Xen can track and control exactly which MFNs XenGT/the GPU can access; > - running XenGT in a driver domain or PVH dom0 ought to work; and > - we fix the race condition I described above. ok, I see your point here. It does sound like a better design to meet Xen hypervisor's security requirement and can also work with PVH Dom0 or driver domain. Previously even when we said a MFN is required, it's actually a BFN due to IOMMU existence, and it works just because we have a 1:1 identity mapping in-place. And by finding a BFN some follow-up think here: - one extra unmap call will have some performance impact, especially for media processing workloads where GPU page table modifications are hot. but suppose this can be optimized with batch request - is there existing _map_ call for this purpose per your knowledge, or a new one is required? If the latter, what's the additional logic to be implemented there? - when you say _map_, do you expect this mapped into dom0's virtual address space, or just guest physical space? - how is BFN or unused address (what do you mean by address here?) allocated? does it need present in guest physical memory at boot time, or just finding some holes? - graphics memory size could be large. starting from BDW, there'll be 64bit page table format. Do you see any limitation here on finding BFN or address? > > The default policy I'm suggesting is that the XenGT backend domain > should be marked IS_PRIV_FOR (or similar) over the XenGT client VMs, > which will need a small extension in Xen since at the moment struct > domain has only one "target" field. Is that connection setup by toolstack or by hypervisor today? > > BTW, this is the exact analogue of how all other backend and toolstack > operations work -- they request access from Xen to specific pages and > they relinquish it when they are done. In particular: agree. > > > for mapping and accessing other guest's memory, I don't think we > > need any new interface atop existing ones. Just similar to other backend > > drivers, we can leverage the same permission control. > > I don't think that's right -- other backend drivers use the grant > table mechanism, wher the guest explicitly grants access to only the > memory it needs. AIUI you're not suggesting that you'll use that for > XenGT! :) yes, we're running native graphics driver in VM, not PV driver > > Right - I hope that made some sense. I'll go get another cup of > coffee and start on that other reply... > > Cheers, > Really appreciate your explanation here. It makes lots of sense to me. Thanks Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-12 7:24 ` Tian, Kevin @ 2014-12-12 10:54 ` Jan Beulich 2014-12-15 6:25 ` Tian, Kevin 2014-12-18 15:46 ` Tim Deegan 1 sibling, 1 reply; 59+ messages in thread From: Jan Beulich @ 2014-12-12 10:54 UTC (permalink / raw) To: Kevin Tian Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu, Xen-devel@lists.xen.org >>> On 12.12.14 at 08:24, <kevin.tian@intel.com> wrote: > - is there existing _map_ call for this purpose per your knowledge, or > a new one is required? If the latter, what's the additional logic to be > implemented there? I think the answer to this depends on whether you want to use grants. The goal of using the native driver in the guest (mentioned further down) speaks against this, in which case I don't think we have an existing interface. > - when you say _map_, do you expect this mapped into dom0's virtual > address space, or just guest physical space? Iiuc you don't care about the memory to be visible to the CPU, all you need is it being translated by the IOMMU. In which case the input address space for the IOMMU (which is different between PV and PVH) is where this needs to be mapped into. > - how is BFN or unused address (what do you mean by address here?) > allocated? does it need present in guest physical memory at boot time, > or just finding some holes? Fitting this into holes should be fine. > - graphics memory size could be large. starting from BDW, there'll > be 64bit page table format. Do you see any limitation here on finding > BFN or address? I don't think this concern differs much for the different models: As long as you don't want the same underlying memory to be accessible by more than one guest, the address space requirements ought to be the same. Jan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-12 10:54 ` Jan Beulich @ 2014-12-15 6:25 ` Tian, Kevin 2014-12-15 8:44 ` Jan Beulich 0 siblings, 1 reply; 59+ messages in thread From: Tian, Kevin @ 2014-12-15 6:25 UTC (permalink / raw) To: Jan Beulich Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu, Xen-devel@lists.xen.org > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Friday, December 12, 2014 6:54 PM > > >>> On 12.12.14 at 08:24, <kevin.tian@intel.com> wrote: > > - is there existing _map_ call for this purpose per your knowledge, or > > a new one is required? If the latter, what's the additional logic to be > > implemented there? > > I think the answer to this depends on whether you want to use > grants. The goal of using the native driver in the guest (mentioned > further down) speaks against this, in which case I don't think we > have an existing interface. yes, grants don't apply here. > > > - when you say _map_, do you expect this mapped into dom0's virtual > > address space, or just guest physical space? > > Iiuc you don't care about the memory to be visible to the CPU, all > you need is it being translated by the IOMMU. In which case the > input address space for the IOMMU (which is different between PV > and PVH) is where this needs to be mapped into. it should be in p2m level, not just in IOMMU. otherwise I'm wondering there'll be tricky issues ahead due to inconsistent mapping between EPT and IOMMU page table (though a specific attributes like r/w may be different from previous split table discussion). another reason here. If we just talk about shadow GPU page table, yes it's used by device only so IOMMU mapping is enough. However we do have several other places where we need to map and access guest memory, e.g. scanning command in a buffer mapped through GPU page table ( currently through remap_domain_mfn_range_in_kernel). > > > - how is BFN or unused address (what do you mean by address here?) > > allocated? does it need present in guest physical memory at boot time, > > or just finding some holes? > > Fitting this into holes should be fine. this is an interesting open to be further discussed. Here we need consider the extreme case, i.e. a 64bit GPU page table can legitimately use up all the system memory allocates to that VM, and considering dozens of VMs, it means we need reserve a very large hole. I once remember some similar cases requiring grabbing some unmapped pfns (in grant table?). So wonder whether there's already a clean interface for such purpose, or we need tweak a new one to allocate unmapped pfns (but won't conflict with usages like memory hotplug)... appreciate any suggestion here. > > > - graphics memory size could be large. starting from BDW, there'll > > be 64bit page table format. Do you see any limitation here on finding > > BFN or address? > > I don't think this concern differs much for the different models: As long > as you don't want the same underlying memory to be accessible by > more than one guest, the address space requirements ought to be the > same. See above. Thanks Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-15 6:25 ` Tian, Kevin @ 2014-12-15 8:44 ` Jan Beulich 2014-12-15 9:05 ` Tian, Kevin 0 siblings, 1 reply; 59+ messages in thread From: Jan Beulich @ 2014-12-15 8:44 UTC (permalink / raw) To: Kevin Tian Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu, Xen-devel@lists.xen.org >>> On 15.12.14 at 07:25, <kevin.tian@intel.com> wrote: >> From: Jan Beulich [mailto:JBeulich@suse.com] >> >>> On 12.12.14 at 08:24, <kevin.tian@intel.com> wrote: >> > - how is BFN or unused address (what do you mean by address here?) >> > allocated? does it need present in guest physical memory at boot time, >> > or just finding some holes? >> >> Fitting this into holes should be fine. > > this is an interesting open to be further discussed. Here we need consider > the extreme case, i.e. a 64bit GPU page table can legitimately use up all > the system memory allocates to that VM, and considering dozens of VMs, > it means we need reserve a very large hole. Oh, it's guest RAM you want mapped, not frame buffer space. But still you're never going to have to map more than the total amount of host RAM, and (with Linux) we already assume everything can be mapped through the 1:1 mapping. I.e. the only collision would be with excessive PFN reservations for ballooning purposes. Jan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-15 8:44 ` Jan Beulich @ 2014-12-15 9:05 ` Tian, Kevin 2014-12-15 9:22 ` Jan Beulich 0 siblings, 1 reply; 59+ messages in thread From: Tian, Kevin @ 2014-12-15 9:05 UTC (permalink / raw) To: Jan Beulich Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu, Xen-devel@lists.xen.org > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Monday, December 15, 2014 4:45 PM > > >>> On 15.12.14 at 07:25, <kevin.tian@intel.com> wrote: > >> From: Jan Beulich [mailto:JBeulich@suse.com] > >> >>> On 12.12.14 at 08:24, <kevin.tian@intel.com> wrote: > >> > - how is BFN or unused address (what do you mean by address here?) > >> > allocated? does it need present in guest physical memory at boot time, > >> > or just finding some holes? > >> > >> Fitting this into holes should be fine. > > > > this is an interesting open to be further discussed. Here we need consider > > the extreme case, i.e. a 64bit GPU page table can legitimately use up all > > the system memory allocates to that VM, and considering dozens of VMs, > > it means we need reserve a very large hole. > > Oh, it's guest RAM you want mapped, not frame buffer space. But still > you're never going to have to map more than the total amount of host > RAM, and (with Linux) we already assume everything can be mapped > through the 1:1 mapping. I.e. the only collision would be with excessive > PFN reservations for ballooning purposes. > Intel GPU has graphics memory (or framebuffer) backed through system memory, and we need to walk GPU page table and then map corresponding guest RAM for handling. yes, definitely host RAM is the upper limit, and what I'm concerning here is how to reserve (at boot time) or allocate (on-demand) such large PFN resource, w/o collision with other PFN reservation usage (ballooning should be fine since it's operating existing RAM ranges in dom0 e820 table). Maybe we can reserve a big-enough reserved region in dom0's e820 table at boot time, for all PFN reservation usages, and then allocate them on-demand for specific usages? Thanks Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-15 9:05 ` Tian, Kevin @ 2014-12-15 9:22 ` Jan Beulich 2014-12-15 11:16 ` Tian, Kevin 2014-12-15 15:22 ` Stefano Stabellini 0 siblings, 2 replies; 59+ messages in thread From: Jan Beulich @ 2014-12-15 9:22 UTC (permalink / raw) To: Kevin Tian Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu, Xen-devel@lists.xen.org >>> On 15.12.14 at 10:05, <kevin.tian@intel.com> wrote: > yes, definitely host RAM is the upper limit, and what I'm concerning here > is how to reserve (at boot time) or allocate (on-demand) such large PFN > resource, w/o collision with other PFN reservation usage (ballooning > should be fine since it's operating existing RAM ranges in dom0 e820 > table). I don't think ballooning is restricted to the regions named RAM in Dom0's E820 table (at least it shouldn't be, and wasn't in the classic Xen kernels). > Maybe we can reserve a big-enough reserved region in dom0's > e820 table at boot time, for all PFN reservation usages, and then allocate > them on-demand for specific usages? What would "big enough" here mean (i.e. how would one determine the needed size up front)? Plus any form of allocation would need a reasonable approach to avoid fragmentation. And anyway I'm not getting what position you're on: Do you expect to be able to fit everything that needs mapping into the available mapping space (as your reply above seems to imply) or do you think there won't be enough mapping space (as earlier replies of yours appeared to indicate)? Jan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-15 9:22 ` Jan Beulich @ 2014-12-15 11:16 ` Tian, Kevin 2014-12-15 11:27 ` Jan Beulich 2014-12-15 15:22 ` Stefano Stabellini 1 sibling, 1 reply; 59+ messages in thread From: Tian, Kevin @ 2014-12-15 11:16 UTC (permalink / raw) To: Jan Beulich Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu, Xen-devel@lists.xen.org > From: Jan Beulich [mailto:JBeulich@suse.com] > Sent: Monday, December 15, 2014 5:23 PM > > >>> On 15.12.14 at 10:05, <kevin.tian@intel.com> wrote: > > yes, definitely host RAM is the upper limit, and what I'm concerning here > > is how to reserve (at boot time) or allocate (on-demand) such large PFN > > resource, w/o collision with other PFN reservation usage (ballooning > > should be fine since it's operating existing RAM ranges in dom0 e820 > > table). > > I don't think ballooning is restricted to the regions named RAM in > Dom0's E820 table (at least it shouldn't be, and wasn't in the > classic Xen kernels). well, nice to know that. > > > Maybe we can reserve a big-enough reserved region in dom0's > > e820 table at boot time, for all PFN reservation usages, and then allocate > > them on-demand for specific usages? > > What would "big enough" here mean (i.e. how would one determine > the needed size up front)? Plus any form of allocation would need a > reasonable approach to avoid fragmentation. And anyway I'm not > getting what position you're on: Do you expect to be able to fit > everything that needs mapping into the available mapping space (as > your reply above seems to imply) or do you think there won't be > enough mapping space (as earlier replies of yours appeared to > indicate)? > I expect to have everything mapped into the available mapping space, and is asking for suggestions what's the best way to find and reserve available PFNs in a way not conflicting with other usages (either virtualization features like ballooning that you mentioned, or bare metal features like PCI hotplug or memory hotplug). Tanks Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-15 11:16 ` Tian, Kevin @ 2014-12-15 11:27 ` Jan Beulich 0 siblings, 0 replies; 59+ messages in thread From: Jan Beulich @ 2014-12-15 11:27 UTC (permalink / raw) To: Kevin Tian Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu, Xen-devel@lists.xen.org >>> On 15.12.14 at 12:16, <kevin.tian@intel.com> wrote: > I expect to have everything mapped into the available mapping space, > and is asking for suggestions what's the best way to find and reserve > available PFNs in a way not conflicting with other usages (either > virtualization features like ballooning that you mentioned, or bare > metal features like PCI hotplug or memory hotplug). Not conflicting with memory hotplug ought to be technically possible (using SRAT information), but if all physical address space is marked as possibly being used for hotplug memory this wouldn't help your case. PCI hotplug (or even just dynamic resource re-assignment) might be quite a bit more tricky, or would require (as you suggested earlier) to mark certain regions as reserved in the E820 Dom0 receives. Not conflicting with ballooning is - just like memory hotplug - simply dependent on enough space not being used for that purpose. Jan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-15 9:22 ` Jan Beulich 2014-12-15 11:16 ` Tian, Kevin @ 2014-12-15 15:22 ` Stefano Stabellini 2014-12-15 16:01 ` Jan Beulich 1 sibling, 1 reply; 59+ messages in thread From: Stefano Stabellini @ 2014-12-15 15:22 UTC (permalink / raw) To: Jan Beulich Cc: Kevin Tian, keir@xen.org, Tim Deegan, Xen-devel@lists.xen.org, Paul.Durrant@citrix.com, Zhang Yu On Mon, 15 Dec 2014, Jan Beulich wrote: > >>> On 15.12.14 at 10:05, <kevin.tian@intel.com> wrote: > > yes, definitely host RAM is the upper limit, and what I'm concerning here > > is how to reserve (at boot time) or allocate (on-demand) such large PFN > > resource, w/o collision with other PFN reservation usage (ballooning > > should be fine since it's operating existing RAM ranges in dom0 e820 > > table). > > I don't think ballooning is restricted to the regions named RAM in > Dom0's E820 table (at least it shouldn't be, and wasn't in the > classic Xen kernels). Could you please elaborate more on this? It seems counter-intuitive at best. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-15 15:22 ` Stefano Stabellini @ 2014-12-15 16:01 ` Jan Beulich 2014-12-15 16:15 ` Stefano Stabellini 0 siblings, 1 reply; 59+ messages in thread From: Jan Beulich @ 2014-12-15 16:01 UTC (permalink / raw) To: Stefano Stabellini Cc: Kevin Tian, keir@xen.org, TimDeegan, Xen-devel@lists.xen.org, Paul.Durrant@citrix.com, Zhang Yu >>> On 15.12.14 at 16:22, <stefano.stabellini@eu.citrix.com> wrote: > On Mon, 15 Dec 2014, Jan Beulich wrote: >> >>> On 15.12.14 at 10:05, <kevin.tian@intel.com> wrote: >> > yes, definitely host RAM is the upper limit, and what I'm concerning here >> > is how to reserve (at boot time) or allocate (on-demand) such large PFN >> > resource, w/o collision with other PFN reservation usage (ballooning >> > should be fine since it's operating existing RAM ranges in dom0 e820 >> > table). >> >> I don't think ballooning is restricted to the regions named RAM in >> Dom0's E820 table (at least it shouldn't be, and wasn't in the >> classic Xen kernels). > > Could you please elaborate more on this? It seems counter-intuitive at best. I don't see what's counter-intuitive here. How can the hypervisor (Dom0) or tool stack (DomU) know what ballooning intentions a guest kernel may have? It's solely the guest kernel's responsibility to make sure its ballooning activities don't collide with anything else address-wise. Jan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-15 16:01 ` Jan Beulich @ 2014-12-15 16:15 ` Stefano Stabellini 2014-12-15 16:28 ` David Vrabel 2014-12-15 16:28 ` Jan Beulich 0 siblings, 2 replies; 59+ messages in thread From: Stefano Stabellini @ 2014-12-15 16:15 UTC (permalink / raw) To: Jan Beulich Cc: Kevin Tian, keir@xen.org, Stefano Stabellini, TimDeegan, Xen-devel@lists.xen.org, Paul.Durrant@citrix.com, Zhang Yu On Mon, 15 Dec 2014, Jan Beulich wrote: > >>> On 15.12.14 at 16:22, <stefano.stabellini@eu.citrix.com> wrote: > > On Mon, 15 Dec 2014, Jan Beulich wrote: > >> >>> On 15.12.14 at 10:05, <kevin.tian@intel.com> wrote: > >> > yes, definitely host RAM is the upper limit, and what I'm concerning here > >> > is how to reserve (at boot time) or allocate (on-demand) such large PFN > >> > resource, w/o collision with other PFN reservation usage (ballooning > >> > should be fine since it's operating existing RAM ranges in dom0 e820 > >> > table). > >> > >> I don't think ballooning is restricted to the regions named RAM in > >> Dom0's E820 table (at least it shouldn't be, and wasn't in the > >> classic Xen kernels). > > > > Could you please elaborate more on this? It seems counter-intuitive at best. > > I don't see what's counter-intuitive here. How can the hypervisor > (Dom0) or tool stack (DomU) know what ballooning intentions a > guest kernel may have? The hypervisor checks that the memory the guest is giving back is actually ram, as a consequence the ballooning interface only supports ram. Do you agree? Ballooning is restricted to regions named RAM in the e820 table, because Linux respects e820 in its pfn->mfn mappings. However it is true that respecting the e820 in dom0 is not part of the interface. > It's solely the guest kernel's responsibility > to make sure its ballooning activities don't collide with anything > else address-wise. In the sense that it is in the guest kernel's responsibility to use the interface properly. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-15 16:15 ` Stefano Stabellini @ 2014-12-15 16:28 ` David Vrabel 2014-12-15 16:28 ` Jan Beulich 1 sibling, 0 replies; 59+ messages in thread From: David Vrabel @ 2014-12-15 16:28 UTC (permalink / raw) To: Stefano Stabellini, Jan Beulich Cc: Kevin Tian, keir@xen.org, TimDeegan, Xen-devel@lists.xen.org, Paul.Durrant@citrix.com, Zhang Yu On 15/12/14 16:15, Stefano Stabellini wrote: > On Mon, 15 Dec 2014, Jan Beulich wrote: >>>>> On 15.12.14 at 16:22, <stefano.stabellini@eu.citrix.com> wrote: >>> On Mon, 15 Dec 2014, Jan Beulich wrote: >>>>>>> On 15.12.14 at 10:05, <kevin.tian@intel.com> wrote: >>>>> yes, definitely host RAM is the upper limit, and what I'm concerning here >>>>> is how to reserve (at boot time) or allocate (on-demand) such large PFN >>>>> resource, w/o collision with other PFN reservation usage (ballooning >>>>> should be fine since it's operating existing RAM ranges in dom0 e820 >>>>> table). >>>> >>>> I don't think ballooning is restricted to the regions named RAM in >>>> Dom0's E820 table (at least it shouldn't be, and wasn't in the >>>> classic Xen kernels). >>> >>> Could you please elaborate more on this? It seems counter-intuitive at best. >> >> I don't see what's counter-intuitive here. How can the hypervisor >> (Dom0) or tool stack (DomU) know what ballooning intentions a >> guest kernel may have? > > The hypervisor checks that the memory the guest is giving back is > actually ram, as a consequence the ballooning interface only supports > ram. Do you agree? > > Ballooning is restricted to regions named RAM in the e820 table, because > Linux respects e820 in its pfn->mfn mappings. However it is true that > respecting the e820 in dom0 is not part of the interface. Linux will quite happily allow you to add memory outside of the initial e820 RAM regions. The current balloon driver even supports this using the kernel's generic memory hotplug infrastructure. David ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-15 16:15 ` Stefano Stabellini 2014-12-15 16:28 ` David Vrabel @ 2014-12-15 16:28 ` Jan Beulich 1 sibling, 0 replies; 59+ messages in thread From: Jan Beulich @ 2014-12-15 16:28 UTC (permalink / raw) To: Stefano Stabellini Cc: Kevin Tian, keir@xen.org, TimDeegan, Xen-devel@lists.xen.org, Paul.Durrant@citrix.com, Zhang Yu >>> On 15.12.14 at 17:15, <stefano.stabellini@eu.citrix.com> wrote: > On Mon, 15 Dec 2014, Jan Beulich wrote: >> >>> On 15.12.14 at 16:22, <stefano.stabellini@eu.citrix.com> wrote: >> > On Mon, 15 Dec 2014, Jan Beulich wrote: >> >> >>> On 15.12.14 at 10:05, <kevin.tian@intel.com> wrote: >> >> > yes, definitely host RAM is the upper limit, and what I'm concerning here >> >> > is how to reserve (at boot time) or allocate (on-demand) such large PFN >> >> > resource, w/o collision with other PFN reservation usage (ballooning >> >> > should be fine since it's operating existing RAM ranges in dom0 e820 >> >> > table). >> >> >> >> I don't think ballooning is restricted to the regions named RAM in >> >> Dom0's E820 table (at least it shouldn't be, and wasn't in the >> >> classic Xen kernels). >> > >> > Could you please elaborate more on this? It seems counter-intuitive at best. >> >> I don't see what's counter-intuitive here. How can the hypervisor >> (Dom0) or tool stack (DomU) know what ballooning intentions a >> guest kernel may have? > > The hypervisor checks that the memory the guest is giving back is > actually ram, as a consequence the ballooning interface only supports > ram. Do you agree? Of course. > Ballooning is restricted to regions named RAM in the e820 table, because > Linux respects e820 in its pfn->mfn mappings. However it is true that > respecting the e820 in dom0 is not part of the interface. Right. Plus the kernel is free to extend the region(s) perceived as RAM in the E820 is sees (makes up) at boot time. >> It's solely the guest kernel's responsibility >> to make sure its ballooning activities don't collide with anything >> else address-wise. > > In the sense that it is in the guest kernel's responsibility to use the > interface properly. That's a given for this discussion. The important aspect is that neither tools nor hypervisor have any influence on how a PV kernel partitions its PFN space - the only thing they control is the boot time state thereof. Jan ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-12 7:24 ` Tian, Kevin 2014-12-12 10:54 ` Jan Beulich @ 2014-12-18 15:46 ` Tim Deegan 2015-01-06 8:56 ` Tian, Kevin 1 sibling, 1 reply; 59+ messages in thread From: Tim Deegan @ 2014-12-18 15:46 UTC (permalink / raw) To: Tian, Kevin Cc: keir, Xen-devel, Paul.Durrant, Yu, Zhang, David Vrabel, JBeulich, Malcolm Crossley Hi, At 07:24 +0000 on 12 Dec (1418365491), Tian, Kevin wrote: > > I'm afraid not. There's nothing worrying per se in a backend knowing > > the MFNs of the pages -- the worry is that the backend can pass the > > MFNs to hardware. If the check happens only at lookup time, then XenGT > > can (either through a bug or a security breach) just pass _any_ MFN to > > the GPU for DMA. > > > > But even without considering the security aspects, this model has bugs > > that may be impossible for XenGT itself to even detect. E.g.: > > 1. Guest asks its virtual GPU to DMA to a frame of memory; > > 2. XenGT looks up the GFN->MFN mapping; > > 3. Guest balloons out the page; > > 4. Xen allocates the page to a different guest; > > 5. XenGT passes the MFN to the GPU, which DMAs to it. > > > > Whereas if stage 2 is a _mapping_ operation, Xen can refcount the > > underlying memory and make sure it doesn't get reallocated until XenGT > > is finished with it. > > yes, I see your point. Now we can't support ballooning in VM given above > reason, and refcnt is required to close that gap. > > but just to confirm one point. from my understanding whether it's a > mapping operation doesn't really matter. We can invent an interface > to get p2m mapping and then increase refcnt. the key is refcnt here. > when XenGT constructs a shadow GPU page table, it creates a reference > to guest memory page so the refcnt must be increased. :-) True. :) But Xen does need to remember all the refcounts that were created (so it can tidy up if the domain crashes). If Xen is already doing that it might as well do it in the IOMMU tables since that solves other problems. > > [First some hopefully-helpful diagrams to explain my thinking. I'll > > borrow 'BFN' from Malcolm's discussion of IOMMUs to describe the > > addresses that devices issue their DMAs in: > > what's 'BFN' short for? Bus Frame Number? Yes, I think so. > > If we replace that lookup with a _map_ hypercall, either with Xen > > choosing the BFN (as happens in the PV grant map operation) or with > > the guest choosing an unused address (as happens in the HVM/PVH > > grant map operation), then: > > - the only extra code in XenGT itself is that you need to unmap > > when you change the GTT; > > - Xen can track and control exactly which MFNs XenGT/the GPU can access; > > - running XenGT in a driver domain or PVH dom0 ought to work; and > > - we fix the race condition I described above. > > ok, I see your point here. It does sound like a better design to meet > Xen hypervisor's security requirement and can also work with PVH > Dom0 or driver domain. Previously even when we said a MFN is > required, it's actually a BFN due to IOMMU existence, and it works > just because we have a 1:1 identity mapping in-place. And by finding > a BFN > > some follow-up think here: > > - one extra unmap call will have some performance impact, especially > for media processing workloads where GPU page table modifications > are hot. but suppose this can be optimized with batch request Yep. In general I'd hope that the extra overhead of unmap is small compared with the trap + emulate + ioreq + schedule that's just happened. Though I know that IOTLB shootdowns are potentially rather expensive right now so it might want some measurement. > - is there existing _map_ call for this purpose per your knowledge, or > a new one is required? If the latter, what's the additional logic to be > implemented there? For PVH, the XENMEM_add_to_physmap (gmfn_foreign) path ought to do what you need, I think. For PV, I think we probably need a new map operation with sensible semantics. My inclination would be to have it follow the grant-map semantics (i.e. caller supplies domid + gfn, hypervisor supplies BFN and success/failure code). Malcolm might have opinions about this -- it starts looking like the sort of PV IOMMU interface he's suggested before. > - when you say _map_, do you expect this mapped into dom0's virtual > address space, or just guest physical space? For PVH, I mean into guest physical address space (and iommu tables, since those are the same). For PV, I mean just the IOMMU tables -- since the guest controls its own PFN space entirely there's nothing Xen can to map things into it. > - how is BFN or unused address (what do you mean by address here?) > allocated? does it need present in guest physical memory at boot time, > or just finding some holes? That's really a question for the xen maintainers in the linux kernel. I presume that whatever bookkeeping they currently do for grant-mapped memory would suffice here just as well. > - graphics memory size could be large. starting from BDW, there'll > be 64bit page table format. Do you see any limitation here on finding > BFN or address? Not really. The IOMMU tables are also 64-bit so there must be enough addresses to map all of RAM. There shouldn't be any need for these mappings to be _contiguous_, btw. You just need to have one free address for each mapping. Again, following how grant maps work, I'd imagine that PVH guests will allocate an unused GFN for each mapping and do enough bookkeeping to make sure they don't clash with other GFN users (grant mapping, ballooning, &c). PV guests will probably be given a BFN by the hypervisor at map time (which will be == MFN in practice) and just needs to pass the same BFN to the unmap call later (it can store it in the GTT meanwhile). > > The default policy I'm suggesting is that the XenGT backend domain > > should be marked IS_PRIV_FOR (or similar) over the XenGT client VMs, > > which will need a small extension in Xen since at the moment struct > > domain has only one "target" field. > > Is that connection setup by toolstack or by hypervisor today? It's set up by the toolstack using XEN_DOMCTL_set_target. Extending that to something like XEN_DOMCTL_set_target_list would be OK, I think, along with some sort of lookup call. Or maybe an add_target/remove_target pair would be easier? Tim. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-18 15:46 ` Tim Deegan @ 2015-01-06 8:56 ` Tian, Kevin 2015-01-08 12:43 ` Tim Deegan 0 siblings, 1 reply; 59+ messages in thread From: Tian, Kevin @ 2015-01-06 8:56 UTC (permalink / raw) To: Tim Deegan Cc: keir@xen.org, Xen-devel@lists.xen.org, Paul.Durrant@citrix.com, Yu, Zhang, David Vrabel, JBeulich@suse.com, Malcolm Crossley > From: Tim Deegan [mailto:tim@xen.org] > Sent: Thursday, December 18, 2014 11:47 PM > > Hi, > > At 07:24 +0000 on 12 Dec (1418365491), Tian, Kevin wrote: > > > I'm afraid not. There's nothing worrying per se in a backend knowing > > > the MFNs of the pages -- the worry is that the backend can pass the > > > MFNs to hardware. If the check happens only at lookup time, then XenGT > > > can (either through a bug or a security breach) just pass _any_ MFN to > > > the GPU for DMA. > > > > > > But even without considering the security aspects, this model has bugs > > > that may be impossible for XenGT itself to even detect. E.g.: > > > 1. Guest asks its virtual GPU to DMA to a frame of memory; > > > 2. XenGT looks up the GFN->MFN mapping; > > > 3. Guest balloons out the page; > > > 4. Xen allocates the page to a different guest; > > > 5. XenGT passes the MFN to the GPU, which DMAs to it. > > > > > > Whereas if stage 2 is a _mapping_ operation, Xen can refcount the > > > underlying memory and make sure it doesn't get reallocated until XenGT > > > is finished with it. > > > > yes, I see your point. Now we can't support ballooning in VM given above > > reason, and refcnt is required to close that gap. > > > > but just to confirm one point. from my understanding whether it's a > > mapping operation doesn't really matter. We can invent an interface > > to get p2m mapping and then increase refcnt. the key is refcnt here. > > when XenGT constructs a shadow GPU page table, it creates a reference > > to guest memory page so the refcnt must be increased. :-) > > True. :) But Xen does need to remember all the refcounts that were > created (so it can tidy up if the domain crashes). If Xen is already > doing that it might as well do it in the IOMMU tables since that > solves other problems. would a refcnt in p2m layer enough so we don't need separate refcnt in both EPT and IOMMU page table? > > > > [First some hopefully-helpful diagrams to explain my thinking. I'll > > > borrow 'BFN' from Malcolm's discussion of IOMMUs to describe the > > > addresses that devices issue their DMAs in: > > > > what's 'BFN' short for? Bus Frame Number? > > Yes, I think so. > > > > If we replace that lookup with a _map_ hypercall, either with Xen > > > choosing the BFN (as happens in the PV grant map operation) or with > > > the guest choosing an unused address (as happens in the HVM/PVH > > > grant map operation), then: > > > - the only extra code in XenGT itself is that you need to unmap > > > when you change the GTT; > > > - Xen can track and control exactly which MFNs XenGT/the GPU can > access; > > > - running XenGT in a driver domain or PVH dom0 ought to work; and > > > - we fix the race condition I described above. > > > > ok, I see your point here. It does sound like a better design to meet > > Xen hypervisor's security requirement and can also work with PVH > > Dom0 or driver domain. Previously even when we said a MFN is > > required, it's actually a BFN due to IOMMU existence, and it works > > just because we have a 1:1 identity mapping in-place. And by finding > > a BFN > > > > some follow-up think here: > > > > - one extra unmap call will have some performance impact, especially > > for media processing workloads where GPU page table modifications > > are hot. but suppose this can be optimized with batch request > > Yep. In general I'd hope that the extra overhead of unmap is small > compared with the trap + emulate + ioreq + schedule that's just > happened. Though I know that IOTLB shootdowns are potentially rather > expensive right now so it might want some measurement. yes, that's the hard part requiring experiments to find a good balance between complexity and performance. IOMMU page table is not designed with same frequent modifications as CPU/GPU page tables, but following above trend make them connected. Another option might be reserve a big enough BFNs to cover all available guest memory at boot time, so to eliminate run-time modification overhead. > > > - is there existing _map_ call for this purpose per your knowledge, or > > a new one is required? If the latter, what's the additional logic to be > > implemented there? > > For PVH, the XENMEM_add_to_physmap (gmfn_foreign) path ought to do > what you need, I think. For PV, I think we probably need a new map > operation with sensible semantics. My inclination would be to have it > follow the grant-map semantics (i.e. caller supplies domid + gfn, > hypervisor supplies BFN and success/failure code). setup mapping is not a big problem. it's more about finding available BFNs in a way not conflicting with other usages e.g. memory hotplug, ballooning (well for this I'm not sure now whether it's only for existing gfns from other thread...) > > Malcolm might have opinions about this -- it starts looking like the > sort of PV IOMMU interface he's suggested before. we'd like to hear Malcolm's suggestion here. > > > - when you say _map_, do you expect this mapped into dom0's virtual > > address space, or just guest physical space? > > For PVH, I mean into guest physical address space (and iommu tables, > since those are the same). For PV, I mean just the IOMMU tables -- > since the guest controls its own PFN space entirely there's nothing > Xen can to map things into it. > > > - how is BFN or unused address (what do you mean by address here?) > > allocated? does it need present in guest physical memory at boot time, > > or just finding some holes? > > That's really a question for the xen maintainers in the linux kernel. > I presume that whatever bookkeeping they currently do for grant-mapped > memory would suffice here just as well. will study that part. > > > - graphics memory size could be large. starting from BDW, there'll > > be 64bit page table format. Do you see any limitation here on finding > > BFN or address? > > Not really. The IOMMU tables are also 64-bit so there must be enough > addresses to map all of RAM. There shouldn't be any need for these > mappings to be _contiguous_, btw. You just need to have one free > address for each mapping. Again, following how grant maps work, I'd > imagine that PVH guests will allocate an unused GFN for each mapping > and do enough bookkeeping to make sure they don't clash with other GFN > users (grant mapping, ballooning, &c). PV guests will probably be > given a BFN by the hypervisor at map time (which will be == MFN in > practice) and just needs to pass the same BFN to the unmap call later > (it can store it in the GTT meanwhile). if possible prefer to make both consistent, i.e. always finding unused GFN? > > > > The default policy I'm suggesting is that the XenGT backend domain > > > should be marked IS_PRIV_FOR (or similar) over the XenGT client VMs, > > > which will need a small extension in Xen since at the moment struct > > > domain has only one "target" field. > > > > Is that connection setup by toolstack or by hypervisor today? > > It's set up by the toolstack using XEN_DOMCTL_set_target. Extending > that to something like XEN_DOMCTL_set_target_list would be OK, I > think, along with some sort of lookup call. Or maybe an > add_target/remove_target pair would be easier? > Thanks for suggestions. Yu and I will have a detail study and work out a proposal. :-) Thanks Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2015-01-06 8:56 ` Tian, Kevin @ 2015-01-08 12:43 ` Tim Deegan 2015-01-09 8:02 ` Tian, Kevin 0 siblings, 1 reply; 59+ messages in thread From: Tim Deegan @ 2015-01-08 12:43 UTC (permalink / raw) To: Tian, Kevin Cc: keir@xen.org, Xen-devel@lists.xen.org, Paul.Durrant@citrix.com, Yu, Zhang, David Vrabel, JBeulich@suse.com, Malcolm Crossley Hi, At 08:56 +0000 on 06 Jan (1420530995), Tian, Kevin wrote: > > From: Tim Deegan [mailto:tim@xen.org] > > At 07:24 +0000 on 12 Dec (1418365491), Tian, Kevin wrote: > > > but just to confirm one point. from my understanding whether it's a > > > mapping operation doesn't really matter. We can invent an interface > > > to get p2m mapping and then increase refcnt. the key is refcnt here. > > > when XenGT constructs a shadow GPU page table, it creates a reference > > > to guest memory page so the refcnt must be increased. :-) > > > > True. :) But Xen does need to remember all the refcounts that were > > created (so it can tidy up if the domain crashes). If Xen is already > > doing that it might as well do it in the IOMMU tables since that > > solves other problems. > > would a refcnt in p2m layer enough so we don't need separate refcnt in both > EPT and IOMMU page table? Yes, that sounds right. The p2m layer is actually the same as the EPT table, so that is where the refcount should be attached to (and it shouldn't matter whether the IOMMU page tables are shared or not). > yes, that's the hard part requiring experiments to find a good balance > between complexity and performance. IOMMU page table is not designed > with same frequent modifications as CPU/GPU page tables, but following > above trend make them connected. Another option might be reserve a big > enough BFNs to cover all available guest memory at boot time, so to > eliminate run-time modification overhead. Sure, or you can map them on demend but keep a cache of maps to avoid unmapping between uses. > > Not really. The IOMMU tables are also 64-bit so there must be enough > > addresses to map all of RAM. There shouldn't be any need for these > > mappings to be _contiguous_, btw. You just need to have one free > > address for each mapping. Again, following how grant maps work, I'd > > imagine that PVH guests will allocate an unused GFN for each mapping > > and do enough bookkeeping to make sure they don't clash with other GFN > > users (grant mapping, ballooning, &c). PV guests will probably be > > given a BFN by the hypervisor at map time (which will be == MFN in > > practice) and just needs to pass the same BFN to the unmap call later > > (it can store it in the GTT meanwhile). > > if possible prefer to make both consistent, i.e. always finding unused GFN? I don't think it will be possible. PV domains are already using BFNs supplied by Xen (in fact == MFN) for backend grant mappings, which would conflict with supplying their own for these mappings. But again, I think the kernel maintainers for Xen may have a better idea of how these interfaces are used inside the kernel. For example, it might be easy enough to wrap the two systems inside a common API inside linux. Again, following how grant mapping works seems like the way forward. Cheers, Tim. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2015-01-08 12:43 ` Tim Deegan @ 2015-01-09 8:02 ` Tian, Kevin 2015-01-09 20:08 ` Konrad Rzeszutek Wilk 2015-01-12 11:14 ` David Vrabel 0 siblings, 2 replies; 59+ messages in thread From: Tian, Kevin @ 2015-01-09 8:02 UTC (permalink / raw) To: Tim Deegan Cc: keir@xen.org, Xen-devel@lists.xen.org, Paul.Durrant@citrix.com, Yu, Zhang, David Vrabel, JBeulich@suse.com, Malcolm Crossley > From: Tim Deegan [mailto:tim@xen.org] > Sent: Thursday, January 08, 2015 8:43 PM > > Hi, > > > > Not really. The IOMMU tables are also 64-bit so there must be enough > > > addresses to map all of RAM. There shouldn't be any need for these > > > mappings to be _contiguous_, btw. You just need to have one free > > > address for each mapping. Again, following how grant maps work, I'd > > > imagine that PVH guests will allocate an unused GFN for each mapping > > > and do enough bookkeeping to make sure they don't clash with other GFN > > > users (grant mapping, ballooning, &c). PV guests will probably be > > > given a BFN by the hypervisor at map time (which will be == MFN in > > > practice) and just needs to pass the same BFN to the unmap call later > > > (it can store it in the GTT meanwhile). > > > > if possible prefer to make both consistent, i.e. always finding unused GFN? > > I don't think it will be possible. PV domains are already using BFNs > supplied by Xen (in fact == MFN) for backend grant mappings, which > would conflict with supplying their own for these mappings. But > again, I think the kernel maintainers for Xen may have a better idea > of how these interfaces are used inside the kernel. For example, > it might be easy enough to wrap the two systems inside a common API > inside linux. Again, following how grant mapping works seems like > the way forward. > So Konrad, do you have any insight here? :-) Thanks Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2015-01-09 8:02 ` Tian, Kevin @ 2015-01-09 20:08 ` Konrad Rzeszutek Wilk 2015-01-12 11:14 ` David Vrabel 1 sibling, 0 replies; 59+ messages in thread From: Konrad Rzeszutek Wilk @ 2015-01-09 20:08 UTC (permalink / raw) To: Tian, Kevin Cc: keir@xen.org, Tim Deegan, Xen-devel@lists.xen.org, Paul.Durrant@citrix.com, Yu, Zhang, David Vrabel, JBeulich@suse.com, Malcolm Crossley On Fri, Jan 09, 2015 at 08:02:48AM +0000, Tian, Kevin wrote: > > From: Tim Deegan [mailto:tim@xen.org] > > Sent: Thursday, January 08, 2015 8:43 PM > > > > Hi, > > > > > > Not really. The IOMMU tables are also 64-bit so there must be enough > > > > addresses to map all of RAM. There shouldn't be any need for these > > > > mappings to be _contiguous_, btw. You just need to have one free > > > > address for each mapping. Again, following how grant maps work, I'd > > > > imagine that PVH guests will allocate an unused GFN for each mapping > > > > and do enough bookkeeping to make sure they don't clash with other GFN > > > > users (grant mapping, ballooning, &c). PV guests will probably be > > > > given a BFN by the hypervisor at map time (which will be == MFN in > > > > practice) and just needs to pass the same BFN to the unmap call later > > > > (it can store it in the GTT meanwhile). > > > > > > if possible prefer to make both consistent, i.e. always finding unused GFN? > > > > I don't think it will be possible. PV domains are already using BFNs > > supplied by Xen (in fact == MFN) for backend grant mappings, which > > would conflict with supplying their own for these mappings. But > > again, I think the kernel maintainers for Xen may have a better idea > > of how these interfaces are used inside the kernel. For example, > > it might be easy enough to wrap the two systems inside a common API > > inside linux. Again, following how grant mapping works seems like > > the way forward. > > > > So Konrad, do you have any insight here? :-) For grants we end up making the 'struct page' for said grant be visible in our linear space. We stash the original BFNs(MFN) in the 'struct page' and replace the P2M in PV guests with the new BFN(MFN). David and Jenniefer is working on making this more lightweight. How often do we these updates? We could also do simpler way - which is what backend drivers do - is to get a swath of vmalloc memory and hooking the BFNs to it. That can stay for quite some time. The neat thing about vmalloc is that it is an sliding-window type mechanism to deal with memory that is not usually accessed via linear page tables. I suppose the complexity behind this is that this 'window' at the GPU page tables needs to change. As in it moves around as there are different guests doing things. So the mechanism of swapping this 'window' is going to be expensive to map/unmap (as you have to flush the TLBs in the initial domain for the page-tables - unless you have multiple 'windows' and we flush the olders ones lazily? But that sounds complex). Who is doing the audit/modification ? Is it some application in the initial domain (backend) domain or some driver in the kernel? > > Thanks > Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2015-01-09 8:02 ` Tian, Kevin 2015-01-09 20:08 ` Konrad Rzeszutek Wilk @ 2015-01-12 11:14 ` David Vrabel 1 sibling, 0 replies; 59+ messages in thread From: David Vrabel @ 2015-01-12 11:14 UTC (permalink / raw) To: Tian, Kevin, Tim Deegan Cc: keir@xen.org, Xen-devel@lists.xen.org, Paul.Durrant@citrix.com, Yu, Zhang, David Vrabel, JBeulich@suse.com, Malcolm Crossley On 09/01/15 08:02, Tian, Kevin wrote: >> From: Tim Deegan [mailto:tim@xen.org] >> Sent: Thursday, January 08, 2015 8:43 PM >> >> Hi, >> >>>> Not really. The IOMMU tables are also 64-bit so there must be enough >>>> addresses to map all of RAM. There shouldn't be any need for these >>>> mappings to be _contiguous_, btw. You just need to have one free >>>> address for each mapping. Again, following how grant maps work, I'd >>>> imagine that PVH guests will allocate an unused GFN for each mapping >>>> and do enough bookkeeping to make sure they don't clash with other GFN >>>> users (grant mapping, ballooning, &c). PV guests will probably be >>>> given a BFN by the hypervisor at map time (which will be == MFN in >>>> practice) and just needs to pass the same BFN to the unmap call later >>>> (it can store it in the GTT meanwhile). >>> >>> if possible prefer to make both consistent, i.e. always finding unused GFN? >> >> I don't think it will be possible. PV domains are already using BFNs >> supplied by Xen (in fact == MFN) for backend grant mappings, which >> would conflict with supplying their own for these mappings. But >> again, I think the kernel maintainers for Xen may have a better idea >> of how these interfaces are used inside the kernel. For example, >> it might be easy enough to wrap the two systems inside a common API >> inside linux. Again, following how grant mapping works seems like >> the way forward. >> > > So Konrad, do you have any insight here? :-) Malcolm took two pages of this notebook explaining to me how he thought it should work (in combination with his PV IOMMU work), so I'll let him explain. David ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-11 1:41 ` Tian, Kevin 2014-12-11 16:46 ` Tim Deegan @ 2014-12-11 21:29 ` Tim Deegan 2014-12-12 6:29 ` Tian, Kevin 2014-12-12 7:30 ` Tian, Kevin 1 sibling, 2 replies; 59+ messages in thread From: Tim Deegan @ 2014-12-11 21:29 UTC (permalink / raw) To: Tian, Kevin Cc: keir@xen.org, Paul.Durrant@citrix.com, Yu, Zhang, JBeulich@suse.com, Xen-devel@lists.xen.org Hi, again. :) As promised, I'm going to talk about more abstract design considerations. Thi will be a lot less concrete than in the other email, and about a larger range of things. Some of of them may not be really desirable - or even possible. [ TL;DR: read the other reply with the practical suggestions in it :) ] I'm talking from the point of view of a hypervisor maintainer, looking at introducing this new XenGT component and thinking about what security properties we would like the _system_ to have once XenGT is introduced. I'm going to lay out a series of broadly increasing levels of security goodness and talk about what we'd need to do to get there. For the purposes of this discussion, Xen does not _trust_ XenGT. By that I mean that Xen can't rely on the correctness/integrity of XenGT itself to maintain system security. Now, we can decide that for some properties we _will_ choose to trust XenGT, but the default is to assume that XenGT could be compromised or buggy. (This is not intended as a slur on XenGT, btw -- this is how we reason about device driver domains, qemu-dm and other components. There will be bugs in any component, and we're designing the system to minimise the effect of those bugs.) OK. Properties we would like to have: LEVEL 0: Protect Xen itself from XenGT -------------------------------------- Bugs in XenGT should not be able to crash he host, and a compromised XenGT should not be able to take over the hypervisor We're not there in the current design, purely because XenGT has to be in dom0 (so it can trivially DoS Xen by rebooting the host). But it doesn't seem too hard: as soon as we can run XenGT in a driver domain, and with IOMMU tables that restrict the GPU from writing to Xen's datastructures, we'll have this property. [BTW, this whole discussion assumes that the GPU has no 'back door' access to issue DMA that is not translated by the IOMMU. I have heard rumours in the past that such things exist. :) If the GPU can issue untranslated DMA, then whetever controls it can take over the entire system, and so we can't make _any_ security guarantees about it.] LEVEL 1: Isolate XenGT's clients from other VMs ----------------------------------------------- In other words we partition the machine into VMs XenGT can touch (i.e. its clients) and those it can't. Then a malicious client that compromises XenGT only gains access to other VMs that share a GPU with it. That means we can deploy XenGT for some VMs without increasing the risk to other tenants. Again we're not there yet, but I think the design I was talking about in my other email would do it: if XenGT must map all the memory it wants to let the GPU DMA to, and Xen's policy is to deny mappings for non-client-vm memory, then VMs that aren't using XenGT are protected. LEVEL 2: Isolate XenGT's clients from each other ------------------------------------------------ This is trickier, as you pointed out. We could: a) Decide that we will trust XenGT to provide this property. After all, that's its main purpose! This is how we treat other shared backends: if a NIC device driver domain is compromised, the attacker controls the network traffic for all its frontends. OTOH, we don't trust qemu in that way -- instead we use stub domains and IS_PRIV_FOR to enforce isolation. b) Move all of XenGT into Xen. This is just defining the problem away and would probably do more harm than good - after all, keeping it separate has other advantages. c) Use privilege separation: break XenGT into parts, isolated from each other, with the principle of least privilege applied to them. E.g. - GPU emulation could be in a per-client component that doesn't share state with the other clients' emulators; - Shadowing GTTs and auditing GPU commands could move into Xen, with a clean interface to the emulation parts. That way, even if a client VM can exploit a bug in the emulator, it can't affect other clients because it can't see their emulator state, and it can't bypass the safety rules because they're enforced by Xen. When I talked about privilege separation before I was suggesting something like this, but without moving anything into Xen -- e.g. the device-emulation code for each client could be in a per-client, non-root process. The code that audits and issues commands to the GPU would be in a separate process, which is allowed to make hypercalls, and which does not trust the emulator processes. My apologies if you're already doing this -- I know XenGT has some components in a kernel driver and some elsewhere but I haven't looked at the details. LEVEL 3: Isolate XenGT's clients from XenGT itself -------------------------------------------------- XenGT should not be able to access parts of its client VMs that they have not given it permission to. E.g. XenGT should not be able to read a client VM's crypto keys unless it displays them on the framebuffer or uses the GPU to accelerate crypto. Unlike level 2, device driver domains _do_ have this property: this is what the grant tables are used for. A compromised NIC driver domain can MITM the frontend guest but it can't read any memory in the guest other than network buffers. Again there are a few approaches, like: a) Declare that we don't care (i.e. that we will trust XenGT for this property too). In a way it's no worse than trusting the firmware on a dedicated pass-though GPU. But on the other hand the client VM is sharing that firmware with some other VMs... :( b) Make the GPU driver in the client use grant tables for all RAM that it gives to the GPU. Probably not practical! c) Move just the code that builds the GTTs into Xen. That way Xen would guarantee that the GPU never accessed memory it wasn't allowed to. I'm sure there are other ideas too. Conclusion ---------- That's enough rambling from me -- time to come back down to earth. While I think it's useful to think about all these things, we don't want to get carried away. :) And as I said, for some things we can decide to trust XenGT to provide them, as long as we're clear about what that means. I think that a reasonable minimum standard to expect is to enforce levels 0 and 1 in Xen, and trust XenGT for levels 2 and 3. And I think we can do that without needing any huge engineering effort; as I said, I think that's covered in my earlier reply. Cheers, Tim. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-11 21:29 ` Tim Deegan @ 2014-12-12 6:29 ` Tian, Kevin 2014-12-18 16:08 ` Tim Deegan 2015-01-05 15:49 ` George Dunlap 2014-12-12 7:30 ` Tian, Kevin 1 sibling, 2 replies; 59+ messages in thread From: Tian, Kevin @ 2014-12-12 6:29 UTC (permalink / raw) To: Tim Deegan Cc: keir@xen.org, Paul.Durrant@citrix.com, Yu, Zhang, JBeulich@suse.com, Xen-devel@lists.xen.org > From: Tim Deegan [mailto:tim@xen.org] > Sent: Friday, December 12, 2014 5:29 AM > > Hi, again. :) > > As promised, I'm going to talk about more abstract design > considerations. Thi will be a lot less concrete than in the other > email, and about a larger range of things. Some of of them may not be > really desirable - or even possible. Thanks for your time on sharing thoughts on this! I'll give my comments in same level and leave detail technical discussion in another thread. :-) > > [ TL;DR: read the other reply with the practical suggestions in it :) ] > > I'm talking from the point of view of a hypervisor maintainer, looking > at introducing this new XenGT component and thinking about what > security properties we would like the _system_ to have once XenGT is > introduced. I'm going to lay out a series of broadly increasing > levels of security goodness and talk about what we'd need to do to get > there. that's a good clarification of the levels. > > For the purposes of this discussion, Xen does not _trust_ XenGT. By > that I mean that Xen can't rely on the correctness/integrity of XenGT > itself to maintain system security. Now, we can decide that for some > properties we _will_ choose to trust XenGT, but the default is to > assume that XenGT could be compromised or buggy. (This is not > intended as a slur on XenGT, btw -- this is how we reason about device > driver domains, qemu-dm and other components. There will be bugs in > any component, and we're designing the system to minimise the effect > of those bugs.) Yes, it's a fair concern. > > OK. Properties we would like to have: > > LEVEL 0: Protect Xen itself from XenGT > -------------------------------------- > > Bugs in XenGT should not be able to crash he host, and a compromised > XenGT should not be able to take over the hypervisor > > We're not there in the current design, purely because XenGT has to be > in dom0 (so it can trivially DoS Xen by rebooting the host). Can we really decouple dom0 from DoS Xen? I know there's on-going effort like PVH Dom0, however there are lots of trickiness in Dom0 which can put the platform into a bad state. One example is ACPI. All the platform details are encapsulated in AML language, and only dom0 knows how to handle ACPI events. Unless Xen has another parser to guard all possible resources which might be touched thru ACPI, a tampered dom0 has many way to break out. But that'd be very challenging and complex. If we can't containerize Dom0's behavior completely, I would think dom0 and Xen actually in the same trust zone, so putting XenGT in Dom0 shouldn't make things worse. > > But it doesn't seem too hard: as soon as we can run XenGT in a driver > domain, and with IOMMU tables that restrict the GPU from writing to Xen's > datastructures, we'll have this property. > > [BTW, this whole discussion assumes that the GPU has no 'back door' > access to issue DMA that is not translated by the IOMMU. I have heard > rumours in the past that such things exist. :) If the GPU can issue > untranslated DMA, then whetever controls it can take over the entire > system, and so we can't make _any_ security guarantees about it.] I definitely agree with this LEVEL 0 requirement in general, e.g. dom0 can't DMA into Xen's data structure (this is ensured even for default 1:1 identity mapping). However I'm not on whether XenGT must be put in a driver domain as a hard requirement. It's nice to have (and some implementation opens let's discuss in another thread) > > > LEVEL 1: Isolate XenGT's clients from other VMs > ----------------------------------------------- > > In other words we partition the machine into VMs XenGT can touch > (i.e. its clients) and those it can't. Then a malicious client that > compromises XenGT only gains access to other VMs that share a GPU with > it. That means we can deploy XenGT for some VMs without increasing > the risk to other tenants. > > Again we're not there yet, but I think the design I was talking about > in my other email would do it: if XenGT must map all the memory it > wants to let the GPU DMA to, and Xen's policy is to deny mappings for > non-client-vm memory, then VMs that aren't using XenGT are protected. fully agree. We have a 'vgt' control option in each VM's config file. that can be the hint for Xen to decide allow or deny mapping from XenGT. > > > LEVEL 2: Isolate XenGT's clients from each other > ------------------------------------------------ > > This is trickier, as you pointed out. We could: > > a) Decide that we will trust XenGT to provide this property. After > all, that's its main purpose! This is how we treat other shared > backends: if a NIC device driver domain is compromised, the > attacker controls the network traffic for all its frontends. > OTOH, we don't trust qemu in that way -- instead we use stub domains > and IS_PRIV_FOR to enforce isolation. yep. Just curious, I thought stubdomain is not popularly used. typical case is to have qemu in dom0. is this still true? :-) > > b) Move all of XenGT into Xen. This is just defining the problem away > and would probably do more harm than good - after all, keeping it > separate has other advantages. I'll explain below why we don't keep XenGT in Xen. > > c) Use privilege separation: break XenGT into parts, isolated from each > other, with the principle of least privilege applied to them. E.g. > - GPU emulation could be in a per-client component that doesn't > share state with the other clients' emulators; yes, we're doing it that way now. the emulation is a per-vm kernel thread. a separate main thread manages physical GPU to do context switch. > - Shadowing GTTs and auditing GPU commands could move into Xen, > with a clean interface to the emulation parts. I'm afraid there's no such a clean interface given the complexity of GPU. Here let me give some other background which impacts XenGT design (some are existing, and some are following plan). Putting them here is not to say "we don't want to change due to other reasons", but to show the list of factors we need to balance: 1. the core device model will be merged as part of Intel graphics kernel driver. This can avoid duplicated physical GPU management in XenGT (that's today's implementation) with benefits on simplicity, quality and maintainability. 2. the same device model will then be shared by both XenGT and KVMGT, only requiring Xen/KVM to provide a minimal set of emulation services, like event forwarding, map guest memory, etc. 3. GPU emulation is complex, and generation-to-generation there are lots of differences. Our customers need a flexible release model so we can release new features and bug fixes quickly thru kernel module. Those are major reasons we come to current XenGT architecture. Then back to your idea on moving shadow GTT and auditing GPU commands into Xen. It will cause more complexity on: - somehow it means we have two drivers on one device, each responsible for some role. Then likely we need hack Intel graphics driver's GTT management code and scheduling code to cooperate with this movement. That's unlikely to be acceptable by driver people - auditing GPU commands need to understand vGPU context, which means share and synchronization of a large buffer required between Xen and XenGT - GTT/command format are not compatible generation-to-generation, which means unnecessary maintenance effort in Xen - and last but not the least, GPU HW itself is not designed so cleanly to separate GTT from remaining parts, which means even we move GTT mgmt. into hypervisor, there are many means to bypass the control, e.g. changing the root pointer of GTT (which may be in a register, or maybe in a memory structure). while once we wants to move those parts into Xen which will dig out more bits and finally we have to pull the whole driver in Xen (though less complex than a real graphics driver) sorry write a long detail in this high level discussion. Just write-down when thinking whether this is practical, and hope it answers our concern here. :-) > That way, even if a client VM can exploit a bug in the emulator, > it can't affect other clients because it can't see their emulator > state, and it can't bypass the safety rules because they're > enforced by Xen. > > When I talked about privilege separation before I was suggesting > something like this, but without moving anything into Xen -- e.g. > the device-emulation code for each client could be in a per-client, > non-root process. The code that audits and issues commands to the > GPU would be in a separate process, which is allowed to make > hypercalls, and which does not trust the emulator processes. > My apologies if you're already doing this -- I know XenGT has some > components in a kernel driver and some elsewhere but I haven't > looked at the details. that's a good comment. we're implementing that way, but might not be so strictly separated. I'll bring this comment back to our engineering team to have it well considered. > > > LEVEL 3: Isolate XenGT's clients from XenGT itself > -------------------------------------------------- > > XenGT should not be able to access parts of its client VMs that they > have not given it permission to. E.g. XenGT should not be able to > read a client VM's crypto keys unless it displays them on the > framebuffer or uses the GPU to accelerate crypto. > > Unlike level 2, device driver domains _do_ have this property: this is > what the grant tables are used for. A compromised NIC driver domain > can MITM the frontend guest but it can't read any memory in the guest > other than network buffers. > > Again there are a few approaches, like: > > a) Declare that we don't care (i.e. that we will trust XenGT for this > property too). In a way it's no worse than trusting the firmware > on a dedicated pass-though GPU. But on the other hand the client > VM is sharing that firmware with some other VMs... :( > > b) Make the GPU driver in the client use grant tables for all RAM that > it gives to the GPU. Probably not practical! yes, and that can be a good research topic. :-) > > c) Move just the code that builds the GTTs into Xen. That way > Xen would guarantee that the GPU never accessed memory it wasn't > allowed to. as explained above, it's impractical to separate a self-contained GTT logic into Xen. In GPU, GTT is somehow an attribute belonging to a render context, not like CPU CR3 which is very simple. > > I'm sure there are other ideas too. > > > Conclusion > ---------- > > That's enough rambling from me -- time to come back down to earth. > While I think it's useful to think about all these things, we don't > want to get carried away. :) And as I said, for some things we can > decide to trust XenGT to provide them, as long as we're clear about > what that means. > > I think that a reasonable minimum standard to expect is to enforce > levels 0 and 1 in Xen, and trust XenGT for levels 2 and 3. And I > think we can do that without needing any huge engineering effort; > as I said, I think that's covered in my earlier reply. > I agree the conclusion that "minimum standard to expect is to enforce levels 0 and 1 in Xen, and trust XenGT for levels 2 and 3", except the concern whether PVH Dom0 is a hard requirement or not. Having said that, I'm happy to discuss technical detail in another thread on how to support PVH Dom0. Thanks Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-12 6:29 ` Tian, Kevin @ 2014-12-18 16:08 ` Tim Deegan 2014-12-18 17:01 ` Andrew Cooper 2015-01-05 15:49 ` George Dunlap 1 sibling, 1 reply; 59+ messages in thread From: Tim Deegan @ 2014-12-18 16:08 UTC (permalink / raw) To: Tian, Kevin Cc: keir@xen.org, Paul.Durrant@citrix.com, Yu, Zhang, JBeulich@suse.com, Xen-devel@lists.xen.org Hi, At 06:29 +0000 on 12 Dec (1418362182), Tian, Kevin wrote: > If we can't containerize Dom0's behavior completely, I would think dom0 > and Xen actually in the same trust zone, so putting XenGT in Dom0 shouldn't > make things worse. Ah, but it does -- it's putting thousands of lines of code into that trust zone and adding a new attack surface. So it would be much better if we could put XenGT in its own domain where it doesn't have full dom0 privileges. > However I'm not on whether XenGT must be put > in a driver domain as a hard requirement. It's nice to have (and some > implementation opens let's discuss in another thread) Sure -- it would take a lot of toolstack work to actually put XenGT into a driver domain. So although I strongly encourage it, I don't think it's a hard requirement. But I'd like to make sure that we end up with a XenGT that _could_ go in a driver domain if that toolstack plumbing was done. > yep. Just curious, I thought stubdomain is not popularly used. typical > case is to have qemu in dom0. is this still true? :-) Some do and some don't. :) High-security distros like Qubes and XenClient do. You can enable it in xl config files pretty easily. IIRC the xapi toolstack doesn't use it, but XenServer uses privilege separation to isolate the qemu processes in dom0. > sorry write a long detail in this high level discussion. Just write-down > when thinking whether this is practical, and hope it answers our concern > here. :-) Thank you for that, it's helpful to have a clear idea about it. Cheers, Tim. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-18 16:08 ` Tim Deegan @ 2014-12-18 17:01 ` Andrew Cooper 0 siblings, 0 replies; 59+ messages in thread From: Andrew Cooper @ 2014-12-18 17:01 UTC (permalink / raw) To: Tim Deegan, Tian, Kevin Cc: Yu, Zhang, Paul.Durrant@citrix.com, keir@xen.org, JBeulich@suse.com, Xen-devel@lists.xen.org On 18/12/14 16:08, Tim Deegan wrote: >> yep. Just curious, I thought stubdomain is not popularly used. typical >> > case is to have qemu in dom0. is this still true? :-) > Some do and some don't. :) High-security distros like Qubes and > XenClient do. You can enable it in xl config files pretty easily. > IIRC the xapi toolstack doesn't use it, but XenServer uses privilege > separation to isolate the qemu processes in dom0. > We are looking into stubdomains as part of future architectural roadmap, but as identified, there is a lot of toolstack plumbing required before this be feasible to put into XenServer. Our privilege separate in qemu is a stopgap measure which we would like to replace in due course. ~Andrew ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-12 6:29 ` Tian, Kevin 2014-12-18 16:08 ` Tim Deegan @ 2015-01-05 15:49 ` George Dunlap 2015-01-06 8:42 ` Tian, Kevin 1 sibling, 1 reply; 59+ messages in thread From: George Dunlap @ 2015-01-05 15:49 UTC (permalink / raw) To: Tian, Kevin Cc: keir@xen.org, Tim Deegan, Xen-devel@lists.xen.org, Paul.Durrant@citrix.com, Yu, Zhang, JBeulich@suse.com On Fri, Dec 12, 2014 at 6:29 AM, Tian, Kevin <kevin.tian@intel.com> wrote: >> We're not there in the current design, purely because XenGT has to be >> in dom0 (so it can trivially DoS Xen by rebooting the host). > > Can we really decouple dom0 from DoS Xen? I know there's on-going effort > like PVH Dom0, however there are lots of trickiness in Dom0 which can > put the platform into a bad state. One example is ACPI. All the platform > details are encapsulated in AML language, and only dom0 knows how to > handle ACPI events. Unless Xen has another parser to guard all possible > resources which might be touched thru ACPI, a tampered dom0 has many > way to break out. But that'd be very challenging and complex. > > If we can't containerize Dom0's behavior completely, I would think dom0 > and Xen actually in the same trust zone, so putting XenGT in Dom0 shouldn't > make things worse. The question here is, "If a malicious guest can manage to break into XenGT, what can they do?" If XenGT is running in dom0, then the answer is, "At very least, they can DoS the host because dom0 is allowed to reboot; they can probably do lots of other nasty things as well." If XenGT is running in its own domain, and can only add IOMMU entries for MFNs belonging to XenGT-only VMs, then the answer is, "They can access other XenGT-enabled VMs, but they cannot shut down the host or access non-XenGT VMs." Slides 8-11 of a presentation I gave (http://www.slideshare.net/xen_com_mgr/a-brief-tutorial-on-xens-advanced-security-features) can give you a graphical idea of what we're' talking about. -George ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2015-01-05 15:49 ` George Dunlap @ 2015-01-06 8:42 ` Tian, Kevin 2015-01-06 10:35 ` Ian Campbell 0 siblings, 1 reply; 59+ messages in thread From: Tian, Kevin @ 2015-01-06 8:42 UTC (permalink / raw) To: George Dunlap Cc: keir@xen.org, Tim Deegan, Xen-devel@lists.xen.org, Paul.Durrant@citrix.com, Yu, Zhang, JBeulich@suse.com > From: George Dunlap > Sent: Monday, January 05, 2015 11:50 PM > > On Fri, Dec 12, 2014 at 6:29 AM, Tian, Kevin <kevin.tian@intel.com> wrote: > >> We're not there in the current design, purely because XenGT has to be > >> in dom0 (so it can trivially DoS Xen by rebooting the host). > > > > Can we really decouple dom0 from DoS Xen? I know there's on-going effort > > like PVH Dom0, however there are lots of trickiness in Dom0 which can > > put the platform into a bad state. One example is ACPI. All the platform > > details are encapsulated in AML language, and only dom0 knows how to > > handle ACPI events. Unless Xen has another parser to guard all possible > > resources which might be touched thru ACPI, a tampered dom0 has many > > way to break out. But that'd be very challenging and complex. > > > > If we can't containerize Dom0's behavior completely, I would think dom0 > > and Xen actually in the same trust zone, so putting XenGT in Dom0 shouldn't > > make things worse. > > The question here is, "If a malicious guest can manage to break into > XenGT, what can they do?" > > If XenGT is running in dom0, then the answer is, "At very least, they > can DoS the host because dom0 is allowed to reboot; they can probably > do lots of other nasty things as well." > > If XenGT is running in its own domain, and can only add IOMMU entries > for MFNs belonging to XenGT-only VMs, then the answer is, "They can > access other XenGT-enabled VMs, but they cannot shut down the host or > access non-XenGT VMs." > > Slides 8-11 of a presentation I gave > (http://www.slideshare.net/xen_com_mgr/a-brief-tutorial-on-xens-advanced-s > ecurity-features) > can give you a graphical idea of what we're' talking about. > I agree we need to make XenGT more isolated following on-going trend from previous discussion, but regarding to whether Dom0/Xen are in the same security domain, I don't see my statement is changed w/ above attempts which just try to move privileged Xen stuff away from dom0, but all existing Linux vulnerabilities allow a tampered Dom0 do many evil things with root permission or even tampered kernel to DoS Xen (e.g. w/ ACPI). PVH dom0 can help performance... but itself alone doesn't change the fact that Dom0/Xen are actually in the same security domain. :-) Thanks Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2015-01-06 8:42 ` Tian, Kevin @ 2015-01-06 10:35 ` Ian Campbell 0 siblings, 0 replies; 59+ messages in thread From: Ian Campbell @ 2015-01-06 10:35 UTC (permalink / raw) To: Tian, Kevin Cc: keir@xen.org, George Dunlap, Tim Deegan, Xen-devel@lists.xen.org, Paul.Durrant@citrix.com, Yu, Zhang, JBeulich@suse.com On Tue, 2015-01-06 at 08:42 +0000, Tian, Kevin wrote: > > From: George Dunlap > > Sent: Monday, January 05, 2015 11:50 PM > > > > On Fri, Dec 12, 2014 at 6:29 AM, Tian, Kevin <kevin.tian@intel.com> wrote: > > >> We're not there in the current design, purely because XenGT has to be > > >> in dom0 (so it can trivially DoS Xen by rebooting the host). > > > > > > Can we really decouple dom0 from DoS Xen? I know there's on-going effort > > > like PVH Dom0, however there are lots of trickiness in Dom0 which can > > > put the platform into a bad state. One example is ACPI. All the platform > > > details are encapsulated in AML language, and only dom0 knows how to > > > handle ACPI events. Unless Xen has another parser to guard all possible > > > resources which might be touched thru ACPI, a tampered dom0 has many > > > way to break out. But that'd be very challenging and complex. > > > > > > If we can't containerize Dom0's behavior completely, I would think dom0 > > > and Xen actually in the same trust zone, so putting XenGT in Dom0 shouldn't > > > make things worse. > > > > The question here is, "If a malicious guest can manage to break into > > XenGT, what can they do?" > > > > If XenGT is running in dom0, then the answer is, "At very least, they > > can DoS the host because dom0 is allowed to reboot; they can probably > > do lots of other nasty things as well." > > > > If XenGT is running in its own domain, and can only add IOMMU entries > > for MFNs belonging to XenGT-only VMs, then the answer is, "They can > > access other XenGT-enabled VMs, but they cannot shut down the host or > > access non-XenGT VMs." > > > > Slides 8-11 of a presentation I gave > > (http://www.slideshare.net/xen_com_mgr/a-brief-tutorial-on-xens-advanced-s > > ecurity-features) > > can give you a graphical idea of what we're' talking about. > > > > I agree we need to make XenGT more isolated following on-going trend from > previous discussion, but regarding to whether Dom0/Xen are in the same security > domain, I don't see my statement is changed w/ above attempts which just try to > move privileged Xen stuff away from dom0, but all existing Linux vulnerabilities > allow a tampered Dom0 do many evil things with root permission or even tampered > kernel to DoS Xen (e.g. w/ ACPI). PVH dom0 can help performance... but itself alone > doesn't change the fact that Dom0/Xen are actually in the same security domain. :-) Which is a good reason why one would want to remove as much potentially vulnerable code from dom0 as possible, and then deny it the corresponding permissions via XSM too. I also find the argument "dom0 can do some bad things so we should let it be able to do all bad things" rather specious. Ian. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: One question about the hypercall to translate gfn to mfn. 2014-12-11 21:29 ` Tim Deegan 2014-12-12 6:29 ` Tian, Kevin @ 2014-12-12 7:30 ` Tian, Kevin 1 sibling, 0 replies; 59+ messages in thread From: Tian, Kevin @ 2014-12-12 7:30 UTC (permalink / raw) To: Tim Deegan Cc: keir@xen.org, Paul.Durrant@citrix.com, Yu, Zhang, JBeulich@suse.com, Xen-devel@lists.xen.org > From: Tian, Kevin > Sent: Friday, December 12, 2014 2:30 PM > > > > Conclusion > > ---------- > > > > That's enough rambling from me -- time to come back down to earth. > > While I think it's useful to think about all these things, we don't > > want to get carried away. :) And as I said, for some things we can > > decide to trust XenGT to provide them, as long as we're clear about > > what that means. > > > > I think that a reasonable minimum standard to expect is to enforce > > levels 0 and 1 in Xen, and trust XenGT for levels 2 and 3. And I > > think we can do that without needing any huge engineering effort; > > as I said, I think that's covered in my earlier reply. > > > > I agree the conclusion that "minimum standard to expect is to enforce > levels 0 and 1 in Xen, and trust XenGT for levels 2 and 3", except the > concern whether PVH Dom0 is a hard requirement or not. Having > said that, I'm happy to discuss technical detail in another thread on > how to support PVH Dom0. > So after going through another mail, now I agree both level 0/1 can't be enforced. :-) Thanks Kevin ^ permalink raw reply [flat|nested] 59+ messages in thread
end of thread, other threads:[~2015-01-12 11:14 UTC | newest] Thread overview: 59+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-12-09 10:10 One question about the hypercall to translate gfn to mfn Yu, Zhang 2014-12-09 10:19 ` Paul Durrant 2014-12-09 10:37 ` Yu, Zhang 2014-12-09 10:50 ` Jan Beulich 2014-12-10 1:07 ` Tian, Kevin 2014-12-10 8:39 ` Jan Beulich 2014-12-10 8:47 ` Tian, Kevin 2014-12-10 9:16 ` Jan Beulich 2014-12-10 9:51 ` Tian, Kevin 2014-12-10 10:07 ` Jan Beulich 2014-12-10 11:04 ` Malcolm Crossley 2014-12-10 8:50 ` Tian, Kevin 2014-12-09 10:51 ` Malcolm Crossley 2014-12-10 1:22 ` Tian, Kevin 2014-12-09 10:38 ` Jan Beulich 2014-12-09 10:46 ` Tim Deegan 2014-12-09 11:05 ` Paul Durrant 2014-12-09 11:11 ` Ian Campbell 2014-12-09 11:17 ` Paul Durrant 2014-12-09 11:23 ` Jan Beulich 2014-12-09 11:28 ` Malcolm Crossley 2014-12-09 11:29 ` Ian Campbell 2014-12-09 11:43 ` Paul Durrant 2014-12-10 1:48 ` Tian, Kevin 2014-12-10 10:11 ` Ian Campbell 2014-12-11 1:50 ` Tian, Kevin 2014-12-10 1:14 ` Tian, Kevin 2014-12-10 10:36 ` Jan Beulich 2014-12-11 1:45 ` Tian, Kevin 2014-12-10 10:55 ` Tim Deegan 2014-12-11 1:41 ` Tian, Kevin 2014-12-11 16:46 ` Tim Deegan 2014-12-12 7:24 ` Tian, Kevin 2014-12-12 10:54 ` Jan Beulich 2014-12-15 6:25 ` Tian, Kevin 2014-12-15 8:44 ` Jan Beulich 2014-12-15 9:05 ` Tian, Kevin 2014-12-15 9:22 ` Jan Beulich 2014-12-15 11:16 ` Tian, Kevin 2014-12-15 11:27 ` Jan Beulich 2014-12-15 15:22 ` Stefano Stabellini 2014-12-15 16:01 ` Jan Beulich 2014-12-15 16:15 ` Stefano Stabellini 2014-12-15 16:28 ` David Vrabel 2014-12-15 16:28 ` Jan Beulich 2014-12-18 15:46 ` Tim Deegan 2015-01-06 8:56 ` Tian, Kevin 2015-01-08 12:43 ` Tim Deegan 2015-01-09 8:02 ` Tian, Kevin 2015-01-09 20:08 ` Konrad Rzeszutek Wilk 2015-01-12 11:14 ` David Vrabel 2014-12-11 21:29 ` Tim Deegan 2014-12-12 6:29 ` Tian, Kevin 2014-12-18 16:08 ` Tim Deegan 2014-12-18 17:01 ` Andrew Cooper 2015-01-05 15:49 ` George Dunlap 2015-01-06 8:42 ` Tian, Kevin 2015-01-06 10:35 ` Ian Campbell 2014-12-12 7:30 ` Tian, Kevin
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.