One question about the hypercall to translate gfn to mfn.

All of lore.kernel.org
 help / color / mirror / Atom feed

* One question about the hypercall to translate gfn to mfn.
@ 2014-12-09 10:10 Yu, Zhang
  2014-12-09 10:19 ` Paul Durrant
                   ` (2 more replies)
  0 siblings, 3 replies; 59+ messages in thread
From: Yu, Zhang @ 2014-12-09 10:10 UTC (permalink / raw)
  To: Paul.Durrant, keir, tim, JBeulich, kevin.tian, Xen-devel

Hi all,

   As you can see, we are pushing our XenGT patches to the upstream. One 
feature we need in xen is to translate guests' gfn to mfn in XenGT dom0 
device model.

   Here we may have 2 similar solutions:
   1> Paul told me(and thank you, Paul :)) that there used to be a 
hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in 
commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there was no 
usage at that time. So solution 1 is to revert this commit. However, 
since this hypercall was removed ages ago, the reverting met many 
conflicts, i.e. the gmfn_to_mfn is no longer used in x86, etc.

   2> In our project, we defined a new hypercall 
XENMEM_get_mfn_from_pfn, which has a similar implementation like the 
previous XENMEM_translate_gpfn_list. One of the major differences is 
that this newly defined one is only for x86(called in arch_memory_op), 
so we do not have to worry about the arm side.

   Does anyone has any suggestions about this?
   Thanks in advance. :)

B.R.
Yu

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-09 10:10 One question about the hypercall to translate gfn to mfn Yu, Zhang
@ 2014-12-09 10:19 ` Paul Durrant
  2014-12-09 10:37   ` Yu, Zhang
  2014-12-09 10:38 ` Jan Beulich
  2014-12-09 10:46 ` Tim Deegan
  2 siblings, 1 reply; 59+ messages in thread
From: Paul Durrant @ 2014-12-09 10:19 UTC (permalink / raw)
  To: Yu, Zhang, Keir (Xen.org), Tim (Xen.org), JBeulich@suse.com,
	Kevin Tian, Xen-devel@lists.xen.org

> -----Original Message-----
> From: Yu, Zhang [mailto:yu.c.zhang@linux.intel.com]
> Sent: 09 December 2014 10:11
> To: Paul Durrant; Keir (Xen.org); Tim (Xen.org); JBeulich@suse.com; Kevin
> Tian; Xen-devel@lists.xen.org
> Subject: One question about the hypercall to translate gfn to mfn.
> 
> Hi all,
> 
>    As you can see, we are pushing our XenGT patches to the upstream. One
> feature we need in xen is to translate guests' gfn to mfn in XenGT dom0
> device model.
> 
>    Here we may have 2 similar solutions:
>    1> Paul told me(and thank you, Paul :)) that there used to be a
> hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in
> commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there was
> no
> usage at that time. So solution 1 is to revert this commit. However,
> since this hypercall was removed ages ago, the reverting met many
> conflicts, i.e. the gmfn_to_mfn is no longer used in x86, etc.
> 
>    2> In our project, we defined a new hypercall
> XENMEM_get_mfn_from_pfn, which has a similar implementation like the
> previous XENMEM_translate_gpfn_list. One of the major differences is
> that this newly defined one is only for x86(called in arch_memory_op),
> so we do not have to worry about the arm side.
> 
>    Does anyone has any suggestions about this?

IIUC what is needed is a means to IOMMU map a gfn in the service domain (dom0 for the moment) such that it can be accessed by the GPU. I think use of an raw mfn value currently works only because dom0 is using a 1:1 IOMMU mapping scheme. Is my understanding correct, or do you really need raw mfn values?

  Paul

>    Thanks in advance. :)
> 
> B.R.
> Yu

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-09 10:19 ` Paul Durrant
@ 2014-12-09 10:37   ` Yu, Zhang
  2014-12-09 10:50     ` Jan Beulich
  2014-12-09 10:51     ` Malcolm Crossley
  0 siblings, 2 replies; 59+ messages in thread
From: Yu, Zhang @ 2014-12-09 10:37 UTC (permalink / raw)
  To: Paul Durrant, Keir (Xen.org), Tim (Xen.org), JBeulich@suse.com,
	Kevin Tian, Xen-devel@lists.xen.org



On 12/9/2014 6:19 PM, Paul Durrant wrote:
> I think use of an raw mfn value currently works only because dom0 is using a 1:1 IOMMU mapping scheme. Is my understanding correct, or do you really need raw mfn values?
Thanks for your quick response, Paul.
Well, not exactly for this case. :)
In XenGT, our need to translate gfn to mfn is for GPU's page table, 
which contains the translation between graphic address and the memory 
address. This page table is maintained by GPU drivers, and our service 
domain need to have a method to translate the guest physical addresses 
written by the vGPU into host physical ones.
We do not use IOMMU in XenGT and therefore this translation may not 
necessarily be a 1:1 mapping.

B.R.
Yu

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-09 10:37   ` Yu, Zhang
@ 2014-12-09 10:50     ` Jan Beulich
  2014-12-10  1:07       ` Tian, Kevin
  2014-12-09 10:51     ` Malcolm Crossley
  1 sibling, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2014-12-09 10:50 UTC (permalink / raw)
  To: Zhang Yu
  Cc: Tim (Xen.org), Kevin Tian, Paul Durrant, Keir (Xen.org),
	Xen-devel@lists.xen.org

>>> On 09.12.14 at 11:37, <yu.c.zhang@linux.intel.com> wrote:
> On 12/9/2014 6:19 PM, Paul Durrant wrote:
>> I think use of an raw mfn value currently works only because dom0 is using a 
> 1:1 IOMMU mapping scheme. Is my understanding correct, or do you really need 
> raw mfn values?
> Thanks for your quick response, Paul.
> Well, not exactly for this case. :)
> In XenGT, our need to translate gfn to mfn is for GPU's page table, 
> which contains the translation between graphic address and the memory 
> address. This page table is maintained by GPU drivers, and our service 
> domain need to have a method to translate the guest physical addresses 
> written by the vGPU into host physical ones.
> We do not use IOMMU in XenGT and therefore this translation may not 
> necessarily be a 1:1 mapping.

Hmm, that suggests you indeed need raw MFNs, which in turn seems
problematic wrt PVH Dom0 (or you'd need a GFN->GMFN translation
layer). But while you don't use the IOMMU yourself, I suppose the GPU
accesses still don't bypass the IOMMU? In which case all you'd need
returned is a frame number that guarantees that after IOMMU
translation it refers to the correct MFN, i.e. still allowing for your Dom0
driver to simply set aside a part of its PFN space, asking Xen to
(IOMMU-)map the necessary guest frames into there.

Jan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-09 10:50     ` Jan Beulich
@ 2014-12-10  1:07       ` Tian, Kevin
  2014-12-10  8:39         ` Jan Beulich
  0 siblings, 1 reply; 59+ messages in thread
From: Tian, Kevin @ 2014-12-10  1:07 UTC (permalink / raw)
  To: Jan Beulich, Zhang Yu
  Cc: Tim (Xen.org), Paul Durrant, Keir (Xen.org),
	Xen-devel@lists.xen.org

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Tuesday, December 09, 2014 6:50 PM
> 
> >>> On 09.12.14 at 11:37, <yu.c.zhang@linux.intel.com> wrote:
> > On 12/9/2014 6:19 PM, Paul Durrant wrote:
> >> I think use of an raw mfn value currently works only because dom0 is using
> a
> > 1:1 IOMMU mapping scheme. Is my understanding correct, or do you really
> need
> > raw mfn values?
> > Thanks for your quick response, Paul.
> > Well, not exactly for this case. :)
> > In XenGT, our need to translate gfn to mfn is for GPU's page table,
> > which contains the translation between graphic address and the memory
> > address. This page table is maintained by GPU drivers, and our service
> > domain need to have a method to translate the guest physical addresses
> > written by the vGPU into host physical ones.
> > We do not use IOMMU in XenGT and therefore this translation may not
> > necessarily be a 1:1 mapping.
> 
> Hmm, that suggests you indeed need raw MFNs, which in turn seems
> problematic wrt PVH Dom0 (or you'd need a GFN->GMFN translation
> layer). But while you don't use the IOMMU yourself, I suppose the GPU
> accesses still don't bypass the IOMMU? In which case all you'd need
> returned is a frame number that guarantees that after IOMMU
> translation it refers to the correct MFN, i.e. still allowing for your Dom0
> driver to simply set aside a part of its PFN space, asking Xen to
> (IOMMU-)map the necessary guest frames into there.
> 

No. What we require is the raw MFNs. One IOMMU device entry can't
point to multiple VM's page tables, so that's why XenGT needs to use
software shadow GPU page table to implement the sharing. Note it's
not for dom0 to access the MFN. It's for dom0 to setup the correct
shadow GPU page table, so a VM can access the graphics memory
in a controlled way.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-10  1:07       ` Tian, Kevin
@ 2014-12-10  8:39         ` Jan Beulich
  2014-12-10  8:47           ` Tian, Kevin
  2014-12-10  8:50           ` Tian, Kevin
  0 siblings, 2 replies; 59+ messages in thread
From: Jan Beulich @ 2014-12-10  8:39 UTC (permalink / raw)
  To: Kevin Tian, Zhang Yu
  Cc: Tim (Xen.org), Paul Durrant, Keir (Xen.org),
	Xen-devel@lists.xen.org

>>> On 10.12.14 at 02:07, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Tuesday, December 09, 2014 6:50 PM
>> 
>> >>> On 09.12.14 at 11:37, <yu.c.zhang@linux.intel.com> wrote:
>> > On 12/9/2014 6:19 PM, Paul Durrant wrote:
>> >> I think use of an raw mfn value currently works only because dom0 is using
>> a
>> > 1:1 IOMMU mapping scheme. Is my understanding correct, or do you really
>> need
>> > raw mfn values?
>> > Thanks for your quick response, Paul.
>> > Well, not exactly for this case. :)
>> > In XenGT, our need to translate gfn to mfn is for GPU's page table,
>> > which contains the translation between graphic address and the memory
>> > address. This page table is maintained by GPU drivers, and our service
>> > domain need to have a method to translate the guest physical addresses
>> > written by the vGPU into host physical ones.
>> > We do not use IOMMU in XenGT and therefore this translation may not
>> > necessarily be a 1:1 mapping.
>> 
>> Hmm, that suggests you indeed need raw MFNs, which in turn seems
>> problematic wrt PVH Dom0 (or you'd need a GFN->GMFN translation
>> layer). But while you don't use the IOMMU yourself, I suppose the GPU
>> accesses still don't bypass the IOMMU? In which case all you'd need
>> returned is a frame number that guarantees that after IOMMU
>> translation it refers to the correct MFN, i.e. still allowing for your Dom0
>> driver to simply set aside a part of its PFN space, asking Xen to
>> (IOMMU-)map the necessary guest frames into there.
>> 
> 
> No. What we require is the raw MFNs. One IOMMU device entry can't
> point to multiple VM's page tables, so that's why XenGT needs to use
> software shadow GPU page table to implement the sharing. Note it's
> not for dom0 to access the MFN. It's for dom0 to setup the correct
> shadow GPU page table, so a VM can access the graphics memory
> in a controlled way.

So what's the translation flow here: driver -> GPU -> IOMMU ->
hardware or driver -> IOMMU -> GPU -> hardware? Or do things get
set up for the GPU to bypass the IOMMU altogether?

Jan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-10  8:39         ` Jan Beulich
@ 2014-12-10  8:47           ` Tian, Kevin
  2014-12-10  9:16             ` Jan Beulich
  2014-12-10  8:50           ` Tian, Kevin
  1 sibling, 1 reply; 59+ messages in thread
From: Tian, Kevin @ 2014-12-10  8:47 UTC (permalink / raw)
  To: Jan Beulich, Zhang Yu
  Cc: Tim (Xen.org), Paul Durrant, Keir (Xen.org),
	Xen-devel@lists.xen.org

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, December 10, 2014 4:39 PM
> 
> >>> On 10.12.14 at 02:07, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> Sent: Tuesday, December 09, 2014 6:50 PM
> >>
> >> >>> On 09.12.14 at 11:37, <yu.c.zhang@linux.intel.com> wrote:
> >> > On 12/9/2014 6:19 PM, Paul Durrant wrote:
> >> >> I think use of an raw mfn value currently works only because dom0 is
> using
> >> a
> >> > 1:1 IOMMU mapping scheme. Is my understanding correct, or do you
> really
> >> need
> >> > raw mfn values?
> >> > Thanks for your quick response, Paul.
> >> > Well, not exactly for this case. :)
> >> > In XenGT, our need to translate gfn to mfn is for GPU's page table,
> >> > which contains the translation between graphic address and the memory
> >> > address. This page table is maintained by GPU drivers, and our service
> >> > domain need to have a method to translate the guest physical addresses
> >> > written by the vGPU into host physical ones.
> >> > We do not use IOMMU in XenGT and therefore this translation may not
> >> > necessarily be a 1:1 mapping.
> >>
> >> Hmm, that suggests you indeed need raw MFNs, which in turn seems
> >> problematic wrt PVH Dom0 (or you'd need a GFN->GMFN translation
> >> layer). But while you don't use the IOMMU yourself, I suppose the GPU
> >> accesses still don't bypass the IOMMU? In which case all you'd need
> >> returned is a frame number that guarantees that after IOMMU
> >> translation it refers to the correct MFN, i.e. still allowing for your Dom0
> >> driver to simply set aside a part of its PFN space, asking Xen to
> >> (IOMMU-)map the necessary guest frames into there.
> >>
> >
> > No. What we require is the raw MFNs. One IOMMU device entry can't
> > point to multiple VM's page tables, so that's why XenGT needs to use
> > software shadow GPU page table to implement the sharing. Note it's
> > not for dom0 to access the MFN. It's for dom0 to setup the correct
> > shadow GPU page table, so a VM can access the graphics memory
> > in a controlled way.
> 
> So what's the translation flow here: driver -> GPU -> IOMMU ->
> hardware or driver -> IOMMU -> GPU -> hardware? Or do things get
> set up for the GPU to bypass the IOMMU altogether?
> 

two translation paths in assigned case:

1. [direct CPU access from VM], with partitioned PCI aperture
resource, every VM can access a portion of PCI aperture directly.

- CPU page table/EPT: CPU virtual address->PCI aperture
- PCI aperture - bar base = Graphics Memory Address (GMA)
- GPU page table: GMA -> GPA (as programmed by guest)
- IOMMU: GPA -> MPA

2. [GPU access through GPU command operands], with GPU scheduling,
every VM's command buffer will be fetched by GPU in a time-shared
manner.

- GPU page table: GMA->GPA
- IOMMU: GPA->MPA

In our case, IOMMU is setup with 1:1 identity table for dom0. So 
when GPU may access GPAs from different VMs, we can't count on
IOMMU which can only serve one mapping for one device (unless 
we have SR-IOV). 

That's why we need shadow GPU page table in dom0, and need a
p2m query call to translate from GPA -> MPA:

- shadow GPU page table: GMA->MPA
- IOMMU: MPA->MPA (for dom0)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-10  8:47           ` Tian, Kevin
@ 2014-12-10  9:16             ` Jan Beulich
  2014-12-10  9:51               ` Tian, Kevin
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2014-12-10  9:16 UTC (permalink / raw)
  To: Kevin Tian, Zhang Yu
  Cc: Tim (Xen.org), Paul Durrant, Keir (Xen.org),
	Xen-devel@lists.xen.org

>>> On 10.12.14 at 09:47, <kevin.tian@intel.com> wrote:
> two translation paths in assigned case:
> 
> 1. [direct CPU access from VM], with partitioned PCI aperture
> resource, every VM can access a portion of PCI aperture directly.
> 
> - CPU page table/EPT: CPU virtual address->PCI aperture
> - PCI aperture - bar base = Graphics Memory Address (GMA)
> - GPU page table: GMA -> GPA (as programmed by guest)
> - IOMMU: GPA -> MPA
> 
> 2. [GPU access through GPU command operands], with GPU scheduling,
> every VM's command buffer will be fetched by GPU in a time-shared
> manner.
> 
> - GPU page table: GMA->GPA
> - IOMMU: GPA->MPA
> 
> In our case, IOMMU is setup with 1:1 identity table for dom0. So 
> when GPU may access GPAs from different VMs, we can't count on
> IOMMU which can only serve one mapping for one device (unless 
> we have SR-IOV). 
> 
> That's why we need shadow GPU page table in dom0, and need a
> p2m query call to translate from GPA -> MPA:
> 
> - shadow GPU page table: GMA->MPA
> - IOMMU: MPA->MPA (for dom0)

I still can't see why the Dom0 translation has to remain 1:1, i.e.
why Xen couldn't return some "arbitrary" GPA for the query in
question here, setting up a suitable GPA->MPA translation. (I put
arbitrary in quotes because this of course must not conflict with
GPAs already or possibly in use by Dom0.) And I can only stress
again that you shouldn't leave out PVH (where the IOMMU already
isn't set up with all 1:1 mappings) from these considerations.

Jan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-10  9:16             ` Jan Beulich
@ 2014-12-10  9:51               ` Tian, Kevin
  2014-12-10 10:07                 ` Jan Beulich
  2014-12-10 11:04                 ` Malcolm Crossley
  0 siblings, 2 replies; 59+ messages in thread
From: Tian, Kevin @ 2014-12-10  9:51 UTC (permalink / raw)
  To: Jan Beulich, Zhang Yu
  Cc: Tim (Xen.org), Paul Durrant, Keir (Xen.org),
	Xen-devel@lists.xen.org

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, December 10, 2014 5:17 PM
> 
> >>> On 10.12.14 at 09:47, <kevin.tian@intel.com> wrote:
> > two translation paths in assigned case:
> >
> > 1. [direct CPU access from VM], with partitioned PCI aperture
> > resource, every VM can access a portion of PCI aperture directly.
> >
> > - CPU page table/EPT: CPU virtual address->PCI aperture
> > - PCI aperture - bar base = Graphics Memory Address (GMA)
> > - GPU page table: GMA -> GPA (as programmed by guest)
> > - IOMMU: GPA -> MPA
> >
> > 2. [GPU access through GPU command operands], with GPU scheduling,
> > every VM's command buffer will be fetched by GPU in a time-shared
> > manner.
> >
> > - GPU page table: GMA->GPA
> > - IOMMU: GPA->MPA
> >
> > In our case, IOMMU is setup with 1:1 identity table for dom0. So
> > when GPU may access GPAs from different VMs, we can't count on
> > IOMMU which can only serve one mapping for one device (unless
> > we have SR-IOV).
> >
> > That's why we need shadow GPU page table in dom0, and need a
> > p2m query call to translate from GPA -> MPA:
> >
> > - shadow GPU page table: GMA->MPA
> > - IOMMU: MPA->MPA (for dom0)
> 
> I still can't see why the Dom0 translation has to remain 1:1, i.e.
> why Xen couldn't return some "arbitrary" GPA for the query in
> question here, setting up a suitable GPA->MPA translation. (I put
> arbitrary in quotes because this of course must not conflict with
> GPAs already or possibly in use by Dom0.) And I can only stress
> again that you shouldn't leave out PVH (where the IOMMU already
> isn't set up with all 1:1 mappings) from these considerations.
> 

It's interesting that you think IOMMU can be used in such situation.

what do you mean by "arbitrary" GPA here? and It's not just about 
conflicting with Dom0's GPA, it's about confliction in all VM's GPAs 
when you hosting them through one IOMMU page table, and there's 
no way to prevent this definitely since GPAs are picked by VMs 
themselves.

I don't think we can support PVH here if IOMMU is not 1:1 mapping.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-10  9:51               ` Tian, Kevin
@ 2014-12-10 10:07                 ` Jan Beulich
  2014-12-10 11:04                 ` Malcolm Crossley
  1 sibling, 0 replies; 59+ messages in thread
From: Jan Beulich @ 2014-12-10 10:07 UTC (permalink / raw)
  To: Kevin Tian, Zhang Yu
  Cc: Tim (Xen.org), Paul Durrant, Keir (Xen.org),
	Xen-devel@lists.xen.org

>>> On 10.12.14 at 10:51, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, December 10, 2014 5:17 PM
>> 
>> >>> On 10.12.14 at 09:47, <kevin.tian@intel.com> wrote:
>> > two translation paths in assigned case:
>> >
>> > 1. [direct CPU access from VM], with partitioned PCI aperture
>> > resource, every VM can access a portion of PCI aperture directly.
>> >
>> > - CPU page table/EPT: CPU virtual address->PCI aperture
>> > - PCI aperture - bar base = Graphics Memory Address (GMA)
>> > - GPU page table: GMA -> GPA (as programmed by guest)
>> > - IOMMU: GPA -> MPA
>> >
>> > 2. [GPU access through GPU command operands], with GPU scheduling,
>> > every VM's command buffer will be fetched by GPU in a time-shared
>> > manner.
>> >
>> > - GPU page table: GMA->GPA
>> > - IOMMU: GPA->MPA
>> >
>> > In our case, IOMMU is setup with 1:1 identity table for dom0. So
>> > when GPU may access GPAs from different VMs, we can't count on
>> > IOMMU which can only serve one mapping for one device (unless
>> > we have SR-IOV).
>> >
>> > That's why we need shadow GPU page table in dom0, and need a
>> > p2m query call to translate from GPA -> MPA:
>> >
>> > - shadow GPU page table: GMA->MPA
>> > - IOMMU: MPA->MPA (for dom0)
>> 
>> I still can't see why the Dom0 translation has to remain 1:1, i.e.
>> why Xen couldn't return some "arbitrary" GPA for the query in
>> question here, setting up a suitable GPA->MPA translation. (I put
>> arbitrary in quotes because this of course must not conflict with
>> GPAs already or possibly in use by Dom0.) And I can only stress
>> again that you shouldn't leave out PVH (where the IOMMU already
>> isn't set up with all 1:1 mappings) from these considerations.
>> 
> 
> It's interesting that you think IOMMU can be used in such situation.
> 
> what do you mean by "arbitrary" GPA here? and It's not just about 
> conflicting with Dom0's GPA, it's about confliction in all VM's GPAs 
> when you hosting them through one IOMMU page table, and there's 
> no way to prevent this definitely since GPAs are picked by VMs 
> themselves.

As long as for the involved DomU-s the physical address comes in
ways similar to PCI device BARs (which they're capable to deal with),
that's not a problem imo. For Dom0, just like BARs may get assigned
while bringing up PCI devices, a "virtual" BAR could be invented here.

> I don't think we can support PVH here if IOMMU is not 1:1 mapping.

That would make XenGT quite a bit less useful going forward. But
otoh don't you only care about certain MMIO regions to be 1:1
mapped? That's the case for PVH Dom0 too, iirc.

Jan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-10  9:51               ` Tian, Kevin
  2014-12-10 10:07                 ` Jan Beulich
@ 2014-12-10 11:04                 ` Malcolm Crossley
  1 sibling, 0 replies; 59+ messages in thread
From: Malcolm Crossley @ 2014-12-10 11:04 UTC (permalink / raw)
  To: xen-devel

On 10/12/14 09:51, Tian, Kevin wrote:
>> From: Jan Beulich [mailto:JBeulich@suse.com]
>> Sent: Wednesday, December 10, 2014 5:17 PM
>>
>>>>> On 10.12.14 at 09:47, <kevin.tian@intel.com> wrote:
>>> two translation paths in assigned case:
>>>
>>> 1. [direct CPU access from VM], with partitioned PCI aperture
>>> resource, every VM can access a portion of PCI aperture directly.
>>>
>>> - CPU page table/EPT: CPU virtual address->PCI aperture
>>> - PCI aperture - bar base = Graphics Memory Address (GMA)
>>> - GPU page table: GMA -> GPA (as programmed by guest)
>>> - IOMMU: GPA -> MPA
>>>
>>> 2. [GPU access through GPU command operands], with GPU scheduling,
>>> every VM's command buffer will be fetched by GPU in a time-shared
>>> manner.
>>>
>>> - GPU page table: GMA->GPA
>>> - IOMMU: GPA->MPA
>>>
>>> In our case, IOMMU is setup with 1:1 identity table for dom0. So
>>> when GPU may access GPAs from different VMs, we can't count on
>>> IOMMU which can only serve one mapping for one device (unless
>>> we have SR-IOV).
>>>
>>> That's why we need shadow GPU page table in dom0, and need a
>>> p2m query call to translate from GPA -> MPA:
>>>
>>> - shadow GPU page table: GMA->MPA
>>> - IOMMU: MPA->MPA (for dom0)
>>
>> I still can't see why the Dom0 translation has to remain 1:1, i.e.
>> why Xen couldn't return some "arbitrary" GPA for the query in
>> question here, setting up a suitable GPA->MPA translation. (I put
>> arbitrary in quotes because this of course must not conflict with
>> GPAs already or possibly in use by Dom0.) And I can only stress
>> again that you shouldn't leave out PVH (where the IOMMU already
>> isn't set up with all 1:1 mappings) from these considerations.
>>
> 
> It's interesting that you think IOMMU can be used in such situation.
> 
> what do you mean by "arbitrary" GPA here? and It's not just about 
> conflicting with Dom0's GPA, it's about confliction in all VM's GPAs 
> when you hosting them through one IOMMU page table, and there's 
> no way to prevent this definitely since GPAs are picked by VMs 
> themselves.
> 
> I don't think we can support PVH here if IOMMU is not 1:1 mapping.
> 

I agree with Jan, there doesn't need to be a fixed 1:1 mapping between
IOMMU and MFN's addresses.

I think all that's required is that there is an IOMMU mapping for the
GPU device connected to dom0 (or driver domain) which allows guest
memory to be accessed by the GPU. This IOMMU address is what is
programmed into shadow GPU page table, I refer to this address as Bus
frame number(BFN) in the PV IOMMU design document.

- shadow GPU page table: GMA->BFN
- IOMMU: BFN->MPA

IOMMU's can almost always address more than the host physical RAM so we
can create IOMMU mappings above the top of host physical RAM in order to
have IOMMU mappings of guest RAM.

The PV-IOMMU design allows the guest to have control of the IOMMU
address space. In theory it could be extended to have permission checks
for mapping guest MFN's and have a mapping interface which takes a domid
and a GMFN. That way the driver domain does not need to know the actual
MFN's being used.

The guest itself (CPU) accesses the GPU via outbound MMIO mappings so we
don't need to be concerned with address translation in that direction.

I think getting Xen to allocate IOMMU mappings for a driver domain will
be problematic for PV based driver domains because the M2P for PV
domains is not kept strictly upto date with what the guest is using for
P2M and so it will be difficult/impossible to determine which addresses
are not in use.

Similarly it may be difficult to HVM guests because P2M mapping are
outbound (CPU to rest of host) and determining what addresses are
suitable for inbound access (rest of host to memory) may be difficult.
I.E should MMIO outbound address space be used for inbound IOMMU mappings?

I hope I've not caused more confusion.

Malcolm

> Thanks
> Kevin
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-10  8:39         ` Jan Beulich
  2014-12-10  8:47           ` Tian, Kevin
@ 2014-12-10  8:50           ` Tian, Kevin
  1 sibling, 0 replies; 59+ messages in thread
From: Tian, Kevin @ 2014-12-10  8:50 UTC (permalink / raw)
  To: Jan Beulich, Zhang Yu
  Cc: Tim (Xen.org), Paul Durrant, Keir (Xen.org),
	Xen-devel@lists.xen.org

> From: Tian, Kevin
> Sent: Wednesday, December 10, 2014 4:48 PM
> 
> > From: Jan Beulich [mailto:JBeulich@suse.com]
> > Sent: Wednesday, December 10, 2014 4:39 PM
> >
> > >>> On 10.12.14 at 02:07, <kevin.tian@intel.com> wrote:
> > >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> > >> Sent: Tuesday, December 09, 2014 6:50 PM
> > >>
> > >> >>> On 09.12.14 at 11:37, <yu.c.zhang@linux.intel.com> wrote:
> > >> > On 12/9/2014 6:19 PM, Paul Durrant wrote:
> > >> >> I think use of an raw mfn value currently works only because dom0 is
> > using
> > >> a
> > >> > 1:1 IOMMU mapping scheme. Is my understanding correct, or do you
> > really
> > >> need
> > >> > raw mfn values?
> > >> > Thanks for your quick response, Paul.
> > >> > Well, not exactly for this case. :)
> > >> > In XenGT, our need to translate gfn to mfn is for GPU's page table,
> > >> > which contains the translation between graphic address and the
> memory
> > >> > address. This page table is maintained by GPU drivers, and our service
> > >> > domain need to have a method to translate the guest physical
> addresses
> > >> > written by the vGPU into host physical ones.
> > >> > We do not use IOMMU in XenGT and therefore this translation may not
> > >> > necessarily be a 1:1 mapping.
> > >>
> > >> Hmm, that suggests you indeed need raw MFNs, which in turn seems
> > >> problematic wrt PVH Dom0 (or you'd need a GFN->GMFN translation
> > >> layer). But while you don't use the IOMMU yourself, I suppose the GPU
> > >> accesses still don't bypass the IOMMU? In which case all you'd need
> > >> returned is a frame number that guarantees that after IOMMU
> > >> translation it refers to the correct MFN, i.e. still allowing for your Dom0
> > >> driver to simply set aside a part of its PFN space, asking Xen to
> > >> (IOMMU-)map the necessary guest frames into there.
> > >>
> > >
> > > No. What we require is the raw MFNs. One IOMMU device entry can't
> > > point to multiple VM's page tables, so that's why XenGT needs to use
> > > software shadow GPU page table to implement the sharing. Note it's
> > > not for dom0 to access the MFN. It's for dom0 to setup the correct
> > > shadow GPU page table, so a VM can access the graphics memory
> > > in a controlled way.
> >
> > So what's the translation flow here: driver -> GPU -> IOMMU ->
> > hardware or driver -> IOMMU -> GPU -> hardware? Or do things get
> > set up for the GPU to bypass the IOMMU altogether?
> >
> 
> two translation paths in assigned case:
> 
> 1. [direct CPU access from VM], with partitioned PCI aperture
> resource, every VM can access a portion of PCI aperture directly.

sorry the above description is for XenGT shared case, and the 
below translation is for VT-d assigned case. Just put there to indicate
the necessity of same translation path in XenGT.

> 
> - CPU page table/EPT: CPU virtual address->PCI aperture
> - PCI aperture - bar base = Graphics Memory Address (GMA)
> - GPU page table: GMA -> GPA (as programmed by guest)
> - IOMMU: GPA -> MPA
> 
> 2. [GPU access through GPU command operands], with GPU scheduling,
> every VM's command buffer will be fetched by GPU in a time-shared
> manner.
> 
> - GPU page table: GMA->GPA
> - IOMMU: GPA->MPA
> 
> In our case, IOMMU is setup with 1:1 identity table for dom0. So
> when GPU may access GPAs from different VMs, we can't count on
> IOMMU which can only serve one mapping for one device (unless
> we have SR-IOV).
> 
> That's why we need shadow GPU page table in dom0, and need a
> p2m query call to translate from GPA -> MPA:
> 
> - shadow GPU page table: GMA->MPA
> - IOMMU: MPA->MPA (for dom0)
> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-09 10:37   ` Yu, Zhang
  2014-12-09 10:50     ` Jan Beulich
@ 2014-12-09 10:51     ` Malcolm Crossley
  2014-12-10  1:22       ` Tian, Kevin
  1 sibling, 1 reply; 59+ messages in thread
From: Malcolm Crossley @ 2014-12-09 10:51 UTC (permalink / raw)
  To: xen-devel

On 09/12/14 10:37, Yu, Zhang wrote:
> 
> 
> On 12/9/2014 6:19 PM, Paul Durrant wrote:
>> I think use of an raw mfn value currently works only because dom0 is
>> using a 1:1 IOMMU mapping scheme. Is my understanding correct, or do
>> you really need raw mfn values?
> Thanks for your quick response, Paul.
> Well, not exactly for this case. :)
> In XenGT, our need to translate gfn to mfn is for GPU's page table,
> which contains the translation between graphic address and the memory
> address. This page table is maintained by GPU drivers, and our service
> domain need to have a method to translate the guest physical addresses
> written by the vGPU into host physical ones.
> We do not use IOMMU in XenGT and therefore this translation may not
> necessarily be a 1:1 mapping.

XenGT must use the IOMMU mappings that Xen has setup for the domain
which owns the GPU. Currently Dom0 own's the GPU and so it's IOMMU
mappings match the MFN's addresses. I suspect XenGT will not work if Xen
is booted with iommu=dom0-strict.

Malcolm

> 
> B.R.
> Yu
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-09 10:51     ` Malcolm Crossley
@ 2014-12-10  1:22       ` Tian, Kevin
  0 siblings, 0 replies; 59+ messages in thread
From: Tian, Kevin @ 2014-12-10  1:22 UTC (permalink / raw)
  To: Malcolm Crossley, xen-devel@lists.xen.org

> From: Malcolm Crossley
> Sent: Tuesday, December 09, 2014 6:52 PM
> 
> On 09/12/14 10:37, Yu, Zhang wrote:
> >
> >
> > On 12/9/2014 6:19 PM, Paul Durrant wrote:
> >> I think use of an raw mfn value currently works only because dom0 is
> >> using a 1:1 IOMMU mapping scheme. Is my understanding correct, or do
> >> you really need raw mfn values?
> > Thanks for your quick response, Paul.
> > Well, not exactly for this case. :)
> > In XenGT, our need to translate gfn to mfn is for GPU's page table,
> > which contains the translation between graphic address and the memory
> > address. This page table is maintained by GPU drivers, and our service
> > domain need to have a method to translate the guest physical addresses
> > written by the vGPU into host physical ones.
> > We do not use IOMMU in XenGT and therefore this translation may not
> > necessarily be a 1:1 mapping.
> 
> XenGT must use the IOMMU mappings that Xen has setup for the domain
> which owns the GPU. Currently Dom0 own's the GPU and so it's IOMMU
> mappings match the MFN's addresses. I suspect XenGT will not work if Xen
> is booted with iommu=dom0-strict.
> 

This is a good point. So yes in this case IOMMU is still active which contains
a 1:1 IOMMU mapping table, but it's a separate thing from the interface
discussed here, which is about setup a shadow GPU page table for other VM's
graphics memory accesses. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-09 10:10 One question about the hypercall to translate gfn to mfn Yu, Zhang
  2014-12-09 10:19 ` Paul Durrant
@ 2014-12-09 10:38 ` Jan Beulich
  2014-12-09 10:46 ` Tim Deegan
  2 siblings, 0 replies; 59+ messages in thread
From: Jan Beulich @ 2014-12-09 10:38 UTC (permalink / raw)
  To: Zhang Yu; +Cc: tim, kevin.tian, Paul.Durrant, keir, Xen-devel

>>> On 09.12.14 at 11:10, <yu.c.zhang@linux.intel.com> wrote:
>    As you can see, we are pushing our XenGT patches to the upstream. One 
> feature we need in xen is to translate guests' gfn to mfn in XenGT dom0 
> device model.
> 
>    Here we may have 2 similar solutions:
>    1> Paul told me(and thank you, Paul :)) that there used to be a 
> hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in 
> commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there was no 
> usage at that time. So solution 1 is to revert this commit. However, 
> since this hypercall was removed ages ago, the reverting met many 
> conflicts, i.e. the gmfn_to_mfn is no longer used in x86, etc.
> 
>    2> In our project, we defined a new hypercall 
> XENMEM_get_mfn_from_pfn, which has a similar implementation like the 
> previous XENMEM_translate_gpfn_list. One of the major differences is 
> that this newly defined one is only for x86(called in arch_memory_op), 
> so we do not have to worry about the arm side.
> 
>    Does anyone has any suggestions about this?

Out of the two 1 seems preferable. But without background (see also
Paul's reply) it's hard to tell whether that's what you want/need.

Jan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-09 10:10 One question about the hypercall to translate gfn to mfn Yu, Zhang
  2014-12-09 10:19 ` Paul Durrant
  2014-12-09 10:38 ` Jan Beulich
@ 2014-12-09 10:46 ` Tim Deegan
  2014-12-09 11:05   ` Paul Durrant
  2014-12-10  1:14   ` Tian, Kevin
  2 siblings, 2 replies; 59+ messages in thread
From: Tim Deegan @ 2014-12-09 10:46 UTC (permalink / raw)
  To: Yu, Zhang; +Cc: kevin.tian, Paul.Durrant, keir, JBeulich, Xen-devel

At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote:
> Hi all,
> 
>    As you can see, we are pushing our XenGT patches to the upstream. One 
> feature we need in xen is to translate guests' gfn to mfn in XenGT dom0 
> device model.
> 
>    Here we may have 2 similar solutions:
>    1> Paul told me(and thank you, Paul :)) that there used to be a 
> hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in 
> commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there was no 
> usage at that time.

It's been suggested before that we should revive this hypercall, and I
don't think it's a good idea.  Whenever a domain needs to know the
actual MFN of another domain's memory it's usually because the
security model is problematic.  In particular, finding the MFN is
usually followed by a brute-force mapping from a dom0 process, or by
passing the MFN to a device for unprotected DMA.

These days DMA access should be protected by IOMMUs, or else
the device drivers (and associated tools) are effectively inside the
hypervisor's TCB.  Luckily on x86 IOMMUs are widely available (and
presumably present on anything new enough to run XenGT?).

So I think the interface we need here is a please-map-this-gfn one,
like the existing grant-table ops (which already do what you need by
returning an address suitable for DMA).  If adding a grant entry for
every frame of the framebuffer within the guest is too much, maybe we
can make a new interface for the guest to grant access to larger areas.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-09 10:46 ` Tim Deegan
@ 2014-12-09 11:05   ` Paul Durrant
  2014-12-09 11:11     ` Ian Campbell
  2014-12-10  1:14   ` Tian, Kevin
  1 sibling, 1 reply; 59+ messages in thread
From: Paul Durrant @ 2014-12-09 11:05 UTC (permalink / raw)
  To: Tim (Xen.org), Yu, Zhang
  Cc: Kevin Tian, Keir (Xen.org), JBeulich@suse.com,
	Xen-devel@lists.xen.org

> -----Original Message-----
> From: Tim Deegan [mailto:tim@xen.org]
> Sent: 09 December 2014 10:47
> To: Yu, Zhang
> Cc: Paul Durrant; Keir (Xen.org); JBeulich@suse.com; Kevin Tian; Xen-
> devel@lists.xen.org
> Subject: Re: One question about the hypercall to translate gfn to mfn.
> 
> At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote:
> > Hi all,
> >
> >    As you can see, we are pushing our XenGT patches to the upstream. One
> > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0
> > device model.
> >
> >    Here we may have 2 similar solutions:
> >    1> Paul told me(and thank you, Paul :)) that there used to be a
> > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in
> > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there was
> no
> > usage at that time.
> 
> It's been suggested before that we should revive this hypercall, and I
> don't think it's a good idea.  Whenever a domain needs to know the
> actual MFN of another domain's memory it's usually because the
> security model is problematic.  In particular, finding the MFN is
> usually followed by a brute-force mapping from a dom0 process, or by
> passing the MFN to a device for unprotected DMA.
> 
> These days DMA access should be protected by IOMMUs, or else
> the device drivers (and associated tools) are effectively inside the
> hypervisor's TCB.  Luckily on x86 IOMMUs are widely available (and
> presumably present on anything new enough to run XenGT?).
> 
> So I think the interface we need here is a please-map-this-gfn one,
> like the existing grant-table ops (which already do what you need by
> returning an address suitable for DMA).  If adding a grant entry for
> every frame of the framebuffer within the guest is too much, maybe we
> can make a new interface for the guest to grant access to larger areas.
> 

IIUC the in-guest driver is Xen-unaware so any grant entry would have to be put in the guests table by the tools, which would entail some form of flexibly sized reserved range of grant entries otherwise any PV driver that are present in the guest would merrily clobber the new grant entries.
A domain can already priv map a gfn into the MMU, so I think we just need an equivalent for the IOMMU.

  Paul

> Cheers,
> 
> Tim.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-09 11:05   ` Paul Durrant
@ 2014-12-09 11:11     ` Ian Campbell
  2014-12-09 11:17       ` Paul Durrant
  0 siblings, 1 reply; 59+ messages in thread
From: Ian Campbell @ 2014-12-09 11:11 UTC (permalink / raw)
  To: Paul Durrant
  Cc: Kevin Tian, Keir (Xen.org), Tim (Xen.org),
	Xen-devel@lists.xen.org, Yu, Zhang, JBeulich@suse.com

On Tue, 2014-12-09 at 11:05 +0000, Paul Durrant wrote:
> > -----Original Message-----
> > From: Tim Deegan [mailto:tim@xen.org]
> > Sent: 09 December 2014 10:47
> > To: Yu, Zhang
> > Cc: Paul Durrant; Keir (Xen.org); JBeulich@suse.com; Kevin Tian; Xen-
> > devel@lists.xen.org
> > Subject: Re: One question about the hypercall to translate gfn to mfn.
> > 
> > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote:
> > > Hi all,
> > >
> > >    As you can see, we are pushing our XenGT patches to the upstream. One
> > > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0
> > > device model.
> > >
> > >    Here we may have 2 similar solutions:
> > >    1> Paul told me(and thank you, Paul :)) that there used to be a
> > > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in
> > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there was
> > no
> > > usage at that time.
> > 
> > It's been suggested before that we should revive this hypercall, and I
> > don't think it's a good idea.  Whenever a domain needs to know the
> > actual MFN of another domain's memory it's usually because the
> > security model is problematic.  In particular, finding the MFN is
> > usually followed by a brute-force mapping from a dom0 process, or by
> > passing the MFN to a device for unprotected DMA.
> > 
> > These days DMA access should be protected by IOMMUs, or else
> > the device drivers (and associated tools) are effectively inside the
> > hypervisor's TCB.  Luckily on x86 IOMMUs are widely available (and
> > presumably present on anything new enough to run XenGT?).
> > 
> > So I think the interface we need here is a please-map-this-gfn one,
> > like the existing grant-table ops (which already do what you need by
> > returning an address suitable for DMA).  If adding a grant entry for
> > every frame of the framebuffer within the guest is too much, maybe we
> > can make a new interface for the guest to grant access to larger areas.
> > 
> 
> IIUC the in-guest driver is Xen-unaware so any grant entry would have
> to be put in the guests table by the tools, which would entail some
> form of flexibly sized reserved range of grant entries otherwise any
> PV driver that are present in the guest would merrily clobber the new
> grant entries.
> A domain can already priv map a gfn into the MMU, so I think we just
>  need an equivalent for the IOMMU.

I'm not sure I'm fully understanding what's going on here, but is a
variant of XENMEM_add_to_physmap+XENMAPSPACE_gmfn_foreign which also
returns a DMA handle a plausible solution?

Ian.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-09 11:11     ` Ian Campbell
@ 2014-12-09 11:17       ` Paul Durrant
  2014-12-09 11:23         ` Jan Beulich
  2014-12-09 11:29         ` Ian Campbell
  0 siblings, 2 replies; 59+ messages in thread
From: Paul Durrant @ 2014-12-09 11:17 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Kevin Tian, Keir (Xen.org), Tim (Xen.org),
	Xen-devel@lists.xen.org, Yu, Zhang, JBeulich@suse.com

> -----Original Message-----
> From: Ian Campbell
> Sent: 09 December 2014 11:11
> To: Paul Durrant
> Cc: Tim (Xen.org); Yu, Zhang; Kevin Tian; Keir (Xen.org); JBeulich@suse.com;
> Xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] One question about the hypercall to translate gfn to
> mfn.
> 
> On Tue, 2014-12-09 at 11:05 +0000, Paul Durrant wrote:
> > > -----Original Message-----
> > > From: Tim Deegan [mailto:tim@xen.org]
> > > Sent: 09 December 2014 10:47
> > > To: Yu, Zhang
> > > Cc: Paul Durrant; Keir (Xen.org); JBeulich@suse.com; Kevin Tian; Xen-
> > > devel@lists.xen.org
> > > Subject: Re: One question about the hypercall to translate gfn to mfn.
> > >
> > > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote:
> > > > Hi all,
> > > >
> > > >    As you can see, we are pushing our XenGT patches to the upstream.
> One
> > > > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0
> > > > device model.
> > > >
> > > >    Here we may have 2 similar solutions:
> > > >    1> Paul told me(and thank you, Paul :)) that there used to be a
> > > > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in
> > > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there
> was
> > > no
> > > > usage at that time.
> > >
> > > It's been suggested before that we should revive this hypercall, and I
> > > don't think it's a good idea.  Whenever a domain needs to know the
> > > actual MFN of another domain's memory it's usually because the
> > > security model is problematic.  In particular, finding the MFN is
> > > usually followed by a brute-force mapping from a dom0 process, or by
> > > passing the MFN to a device for unprotected DMA.
> > >
> > > These days DMA access should be protected by IOMMUs, or else
> > > the device drivers (and associated tools) are effectively inside the
> > > hypervisor's TCB.  Luckily on x86 IOMMUs are widely available (and
> > > presumably present on anything new enough to run XenGT?).
> > >
> > > So I think the interface we need here is a please-map-this-gfn one,
> > > like the existing grant-table ops (which already do what you need by
> > > returning an address suitable for DMA).  If adding a grant entry for
> > > every frame of the framebuffer within the guest is too much, maybe we
> > > can make a new interface for the guest to grant access to larger areas.
> > >
> >
> > IIUC the in-guest driver is Xen-unaware so any grant entry would have
> > to be put in the guests table by the tools, which would entail some
> > form of flexibly sized reserved range of grant entries otherwise any
> > PV driver that are present in the guest would merrily clobber the new
> > grant entries.
> > A domain can already priv map a gfn into the MMU, so I think we just
> >  need an equivalent for the IOMMU.
> 
> I'm not sure I'm fully understanding what's going on here, but is a
> variant of XENMEM_add_to_physmap+XENMAPSPACE_gmfn_foreign which
> also
> returns a DMA handle a plausible solution?
> 

I think we want be able to avoid setting up a PTE in the MMU since it's not needed in most (or perhaps all?) cases.

  Paul

> Ian.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-09 11:17       ` Paul Durrant
@ 2014-12-09 11:23         ` Jan Beulich
  2014-12-09 11:28           ` Malcolm Crossley
  2014-12-09 11:29         ` Ian Campbell
  1 sibling, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2014-12-09 11:23 UTC (permalink / raw)
  To: Paul Durrant
  Cc: Kevin Tian, Keir (Xen.org), Ian Campbell, Tim (Xen.org),
	Xen-devel@lists.xen.org, Zhang Yu

>>> On 09.12.14 at 12:17, <Paul.Durrant@citrix.com> wrote:
> I think we want be able to avoid setting up a PTE in the MMU since it's not 
> needed in most (or perhaps all?) cases.

With shared page tables, there's no way to do one without the other.

Jan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-09 11:23         ` Jan Beulich
@ 2014-12-09 11:28           ` Malcolm Crossley
  0 siblings, 0 replies; 59+ messages in thread
From: Malcolm Crossley @ 2014-12-09 11:28 UTC (permalink / raw)
  To: xen-devel

On 09/12/14 11:23, Jan Beulich wrote:
>>>> On 09.12.14 at 12:17, <Paul.Durrant@citrix.com> wrote:
>> I think we want be able to avoid setting up a PTE in the MMU since it's not 
>> needed in most (or perhaps all?) cases.
> 
> With shared page tables, there's no way to do one without the other.
> 
Interestingly the IOMMU in front of the Intel GPU is only capable of
handling 4k pages and so we wouldn't end up with share page tables being
used.

For other PCI device's then shared page tables will be a problem.

Malcolm

> Jan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel
> 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-09 11:17       ` Paul Durrant
  2014-12-09 11:23         ` Jan Beulich
@ 2014-12-09 11:29         ` Ian Campbell
  2014-12-09 11:43           ` Paul Durrant
  1 sibling, 1 reply; 59+ messages in thread
From: Ian Campbell @ 2014-12-09 11:29 UTC (permalink / raw)
  To: Paul Durrant
  Cc: Kevin Tian, Keir (Xen.org), Tim (Xen.org),
	Xen-devel@lists.xen.org, Yu, Zhang, JBeulich@suse.com

On Tue, 2014-12-09 at 11:17 +0000, Paul Durrant wrote:
> > -----Original Message-----
> > From: Ian Campbell
> > Sent: 09 December 2014 11:11
> > To: Paul Durrant
> > Cc: Tim (Xen.org); Yu, Zhang; Kevin Tian; Keir (Xen.org); JBeulich@suse.com;
> > Xen-devel@lists.xen.org
> > Subject: Re: [Xen-devel] One question about the hypercall to translate gfn to
> > mfn.
> > 
> > On Tue, 2014-12-09 at 11:05 +0000, Paul Durrant wrote:
> > > > -----Original Message-----
> > > > From: Tim Deegan [mailto:tim@xen.org]
> > > > Sent: 09 December 2014 10:47
> > > > To: Yu, Zhang
> > > > Cc: Paul Durrant; Keir (Xen.org); JBeulich@suse.com; Kevin Tian; Xen-
> > > > devel@lists.xen.org
> > > > Subject: Re: One question about the hypercall to translate gfn to mfn.
> > > >
> > > > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote:
> > > > > Hi all,
> > > > >
> > > > >    As you can see, we are pushing our XenGT patches to the upstream.
> > One
> > > > > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0
> > > > > device model.
> > > > >
> > > > >    Here we may have 2 similar solutions:
> > > > >    1> Paul told me(and thank you, Paul :)) that there used to be a
> > > > > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in
> > > > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there
> > was
> > > > no
> > > > > usage at that time.
> > > >
> > > > It's been suggested before that we should revive this hypercall, and I
> > > > don't think it's a good idea.  Whenever a domain needs to know the
> > > > actual MFN of another domain's memory it's usually because the
> > > > security model is problematic.  In particular, finding the MFN is
> > > > usually followed by a brute-force mapping from a dom0 process, or by
> > > > passing the MFN to a device for unprotected DMA.
> > > >
> > > > These days DMA access should be protected by IOMMUs, or else
> > > > the device drivers (and associated tools) are effectively inside the
> > > > hypervisor's TCB.  Luckily on x86 IOMMUs are widely available (and
> > > > presumably present on anything new enough to run XenGT?).
> > > >
> > > > So I think the interface we need here is a please-map-this-gfn one,
> > > > like the existing grant-table ops (which already do what you need by
> > > > returning an address suitable for DMA).  If adding a grant entry for
> > > > every frame of the framebuffer within the guest is too much, maybe we
> > > > can make a new interface for the guest to grant access to larger areas.
> > > >
> > >
> > > IIUC the in-guest driver is Xen-unaware so any grant entry would have
> > > to be put in the guests table by the tools, which would entail some
> > > form of flexibly sized reserved range of grant entries otherwise any
> > > PV driver that are present in the guest would merrily clobber the new
> > > grant entries.
> > > A domain can already priv map a gfn into the MMU, so I think we just
> > >  need an equivalent for the IOMMU.
> > 
> > I'm not sure I'm fully understanding what's going on here, but is a
> > variant of XENMEM_add_to_physmap+XENMAPSPACE_gmfn_foreign which
> > also
> > returns a DMA handle a plausible solution?
> > 
> 
> I think we want be able to avoid setting up a PTE in the MMU since
> it's not needed in most (or perhaps all?) cases.

Another (wildly under-informed) thought then:

A while back Global logic proposed (for ARM) an infrastructure for
allowing dom0 drivers to maintain a set of iommu like pagetables under
hypervisor supervision (they called these "remoteprocessor iommu").

I didn't fully grok what it was at the time, let alone remember the
details properly now, but AIUI it was essentially a framework for
allowing a simple Xen side driver to provide PV-MMU-like update
operations for a set of PTs which were not the main-processor's PTs,
with validation etc.

See http://thread.gmane.org/gmane.comp.emulators.xen.devel/212945

The introductory email even mentions GPUs...

Ian.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-09 11:29         ` Ian Campbell
@ 2014-12-09 11:43           ` Paul Durrant
  2014-12-10  1:48             ` Tian, Kevin
  0 siblings, 1 reply; 59+ messages in thread
From: Paul Durrant @ 2014-12-09 11:43 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Kevin Tian, Keir (Xen.org), Tim (Xen.org),
	Xen-devel@lists.xen.org, Yu, Zhang, JBeulich@suse.com

> -----Original Message-----
> From: Ian Campbell
> Sent: 09 December 2014 11:29
> To: Paul Durrant
> Cc: Tim (Xen.org); Yu, Zhang; Kevin Tian; Keir (Xen.org); JBeulich@suse.com;
> Xen-devel@lists.xen.org
> Subject: Re: [Xen-devel] One question about the hypercall to translate gfn to
> mfn.
> 
> On Tue, 2014-12-09 at 11:17 +0000, Paul Durrant wrote:
> > > -----Original Message-----
> > > From: Ian Campbell
> > > Sent: 09 December 2014 11:11
> > > To: Paul Durrant
> > > Cc: Tim (Xen.org); Yu, Zhang; Kevin Tian; Keir (Xen.org);
> JBeulich@suse.com;
> > > Xen-devel@lists.xen.org
> > > Subject: Re: [Xen-devel] One question about the hypercall to translate
> gfn to
> > > mfn.
> > >
> > > On Tue, 2014-12-09 at 11:05 +0000, Paul Durrant wrote:
> > > > > -----Original Message-----
> > > > > From: Tim Deegan [mailto:tim@xen.org]
> > > > > Sent: 09 December 2014 10:47
> > > > > To: Yu, Zhang
> > > > > Cc: Paul Durrant; Keir (Xen.org); JBeulich@suse.com; Kevin Tian; Xen-
> > > > > devel@lists.xen.org
> > > > > Subject: Re: One question about the hypercall to translate gfn to mfn.
> > > > >
> > > > > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote:
> > > > > > Hi all,
> > > > > >
> > > > > >    As you can see, we are pushing our XenGT patches to the
> upstream.
> > > One
> > > > > > feature we need in xen is to translate guests' gfn to mfn in XenGT
> dom0
> > > > > > device model.
> > > > > >
> > > > > >    Here we may have 2 similar solutions:
> > > > > >    1> Paul told me(and thank you, Paul :)) that there used to be a
> > > > > > hypercall, XENMEM_translate_gpfn_list, which was removed by
> Keir in
> > > > > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because
> there
> > > was
> > > > > no
> > > > > > usage at that time.
> > > > >
> > > > > It's been suggested before that we should revive this hypercall, and I
> > > > > don't think it's a good idea.  Whenever a domain needs to know the
> > > > > actual MFN of another domain's memory it's usually because the
> > > > > security model is problematic.  In particular, finding the MFN is
> > > > > usually followed by a brute-force mapping from a dom0 process, or by
> > > > > passing the MFN to a device for unprotected DMA.
> > > > >
> > > > > These days DMA access should be protected by IOMMUs, or else
> > > > > the device drivers (and associated tools) are effectively inside the
> > > > > hypervisor's TCB.  Luckily on x86 IOMMUs are widely available (and
> > > > > presumably present on anything new enough to run XenGT?).
> > > > >
> > > > > So I think the interface we need here is a please-map-this-gfn one,
> > > > > like the existing grant-table ops (which already do what you need by
> > > > > returning an address suitable for DMA).  If adding a grant entry for
> > > > > every frame of the framebuffer within the guest is too much, maybe
> we
> > > > > can make a new interface for the guest to grant access to larger areas.
> > > > >
> > > >
> > > > IIUC the in-guest driver is Xen-unaware so any grant entry would have
> > > > to be put in the guests table by the tools, which would entail some
> > > > form of flexibly sized reserved range of grant entries otherwise any
> > > > PV driver that are present in the guest would merrily clobber the new
> > > > grant entries.
> > > > A domain can already priv map a gfn into the MMU, so I think we just
> > > >  need an equivalent for the IOMMU.
> > >
> > > I'm not sure I'm fully understanding what's going on here, but is a
> > > variant of XENMEM_add_to_physmap+XENMAPSPACE_gmfn_foreign
> which
> > > also
> > > returns a DMA handle a plausible solution?
> > >
> >
> > I think we want be able to avoid setting up a PTE in the MMU since
> > it's not needed in most (or perhaps all?) cases.
> 
> Another (wildly under-informed) thought then:
> 
> A while back Global logic proposed (for ARM) an infrastructure for
> allowing dom0 drivers to maintain a set of iommu like pagetables under
> hypervisor supervision (they called these "remoteprocessor iommu").
> 
> I didn't fully grok what it was at the time, let alone remember the
> details properly now, but AIUI it was essentially a framework for
> allowing a simple Xen side driver to provide PV-MMU-like update
> operations for a set of PTs which were not the main-processor's PTs,
> with validation etc.
> 
> See http://thread.gmane.org/gmane.comp.emulators.xen.devel/212945
> 
> The introductory email even mentions GPUs...
> 

That series does indeed seem to be very relevant.

  Paul

> Ian.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-09 11:43           ` Paul Durrant
@ 2014-12-10  1:48             ` Tian, Kevin
  2014-12-10 10:11               ` Ian Campbell
  0 siblings, 1 reply; 59+ messages in thread
From: Tian, Kevin @ 2014-12-10  1:48 UTC (permalink / raw)
  To: Paul Durrant, Ian Campbell
  Cc: Yu, Zhang, Keir (Xen.org), Tim (Xen.org), JBeulich@suse.com,
	Xen-devel@lists.xen.org

> From: Paul Durrant [mailto:Paul.Durrant@citrix.com]
> Sent: Tuesday, December 09, 2014 7:44 PM
> 
> > -----Original Message-----
> > From: Ian Campbell
> > Sent: 09 December 2014 11:29
> > To: Paul Durrant
> > Cc: Tim (Xen.org); Yu, Zhang; Kevin Tian; Keir (Xen.org); JBeulich@suse.com;
> > Xen-devel@lists.xen.org
> > Subject: Re: [Xen-devel] One question about the hypercall to translate gfn to
> > mfn.
> >
> > On Tue, 2014-12-09 at 11:17 +0000, Paul Durrant wrote:
> > > > -----Original Message-----
> > > > From: Ian Campbell
> > > > Sent: 09 December 2014 11:11
> > > > To: Paul Durrant
> > > > Cc: Tim (Xen.org); Yu, Zhang; Kevin Tian; Keir (Xen.org);
> > JBeulich@suse.com;
> > > > Xen-devel@lists.xen.org
> > > > Subject: Re: [Xen-devel] One question about the hypercall to translate
> > gfn to
> > > > mfn.
> > > >
> > > > On Tue, 2014-12-09 at 11:05 +0000, Paul Durrant wrote:
> > > > > > -----Original Message-----
> > > > > > From: Tim Deegan [mailto:tim@xen.org]
> > > > > > Sent: 09 December 2014 10:47
> > > > > > To: Yu, Zhang
> > > > > > Cc: Paul Durrant; Keir (Xen.org); JBeulich@suse.com; Kevin Tian; Xen-
> > > > > > devel@lists.xen.org
> > > > > > Subject: Re: One question about the hypercall to translate gfn to mfn.
> > > > > >
> > > > > > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote:
> > > > > > > Hi all,
> > > > > > >
> > > > > > >    As you can see, we are pushing our XenGT patches to the
> > upstream.
> > > > One
> > > > > > > feature we need in xen is to translate guests' gfn to mfn in XenGT
> > dom0
> > > > > > > device model.
> > > > > > >
> > > > > > >    Here we may have 2 similar solutions:
> > > > > > >    1> Paul told me(and thank you, Paul :)) that there used to be a
> > > > > > > hypercall, XENMEM_translate_gpfn_list, which was removed by
> > Keir in
> > > > > > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because
> > there
> > > > was
> > > > > > no
> > > > > > > usage at that time.
> > > > > >
> > > > > > It's been suggested before that we should revive this hypercall, and I
> > > > > > don't think it's a good idea.  Whenever a domain needs to know the
> > > > > > actual MFN of another domain's memory it's usually because the
> > > > > > security model is problematic.  In particular, finding the MFN is
> > > > > > usually followed by a brute-force mapping from a dom0 process, or
> by
> > > > > > passing the MFN to a device for unprotected DMA.
> > > > > >
> > > > > > These days DMA access should be protected by IOMMUs, or else
> > > > > > the device drivers (and associated tools) are effectively inside the
> > > > > > hypervisor's TCB.  Luckily on x86 IOMMUs are widely available (and
> > > > > > presumably present on anything new enough to run XenGT?).
> > > > > >
> > > > > > So I think the interface we need here is a please-map-this-gfn one,
> > > > > > like the existing grant-table ops (which already do what you need by
> > > > > > returning an address suitable for DMA).  If adding a grant entry for
> > > > > > every frame of the framebuffer within the guest is too much, maybe
> > we
> > > > > > can make a new interface for the guest to grant access to larger
> areas.
> > > > > >
> > > > >
> > > > > IIUC the in-guest driver is Xen-unaware so any grant entry would have
> > > > > to be put in the guests table by the tools, which would entail some
> > > > > form of flexibly sized reserved range of grant entries otherwise any
> > > > > PV driver that are present in the guest would merrily clobber the new
> > > > > grant entries.
> > > > > A domain can already priv map a gfn into the MMU, so I think we just
> > > > >  need an equivalent for the IOMMU.
> > > >
> > > > I'm not sure I'm fully understanding what's going on here, but is a
> > > > variant of XENMEM_add_to_physmap+XENMAPSPACE_gmfn_foreign
> > which
> > > > also
> > > > returns a DMA handle a plausible solution?
> > > >
> > >
> > > I think we want be able to avoid setting up a PTE in the MMU since
> > > it's not needed in most (or perhaps all?) cases.
> >
> > Another (wildly under-informed) thought then:
> >
> > A while back Global logic proposed (for ARM) an infrastructure for
> > allowing dom0 drivers to maintain a set of iommu like pagetables under
> > hypervisor supervision (they called these "remoteprocessor iommu").
> >
> > I didn't fully grok what it was at the time, let alone remember the
> > details properly now, but AIUI it was essentially a framework for
> > allowing a simple Xen side driver to provide PV-MMU-like update
> > operations for a set of PTs which were not the main-processor's PTs,
> > with validation etc.
> >
> > See http://thread.gmane.org/gmane.comp.emulators.xen.devel/212945
> >
> > The introductory email even mentions GPUs...
> >
> 
> That series does indeed seem to be very relevant.
> 
>   Paul

I'm not familiar with Arm architecture, but based on a brief reading it's
for the assigned case where the MMU is exclusive owned by a VM, so
some type of MMU virtualization is required and it's straightforward.

However XenGT is a shared GPU usage:

- a global GPU page table is partitioned among VMs. a shared shadow
global page table is maintained, containing translations for multiple
VMs simultaneously based on partitioning information
- multiple per-process GPU page tables are created by each VM, and
multiple shadow per-process GPU page tables are created correspondingly.
shadow page table is switched when doing GPU context switch, same as
what we did for CPU shadow page table.

So you can see above shared MMU virtualization usage is very GPU
specific, that's why we didn't put in Xen hypervisor, and thus additional
interface is required to get p2m mapping to assist our shadow GPU
page table usage.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-10  1:48             ` Tian, Kevin
@ 2014-12-10 10:11               ` Ian Campbell
  2014-12-11  1:50                 ` Tian, Kevin
  0 siblings, 1 reply; 59+ messages in thread
From: Ian Campbell @ 2014-12-10 10:11 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: Keir (Xen.org), Tim (Xen.org), Xen-devel@lists.xen.org,
	Paul Durrant, Yu, Zhang, JBeulich@suse.com

On Wed, 2014-12-10 at 01:48 +0000, Tian, Kevin wrote:
> I'm not familiar with Arm architecture, but based on a brief reading it's
> for the assigned case where the MMU is exclusive owned by a VM, so
> some type of MMU virtualization is required and it's straightforward.

> However XenGT is a shared GPU usage:
> 
> - a global GPU page table is partitioned among VMs. a shared shadow
> global page table is maintained, containing translations for multiple
> VMs simultaneously based on partitioning information
> - multiple per-process GPU page tables are created by each VM, and
> multiple shadow per-process GPU page tables are created correspondingly.
> shadow page table is switched when doing GPU context switch, same as
> what we did for CPU shadow page table.

None of that sounds to me to be impossible to do in the remoteproc
model, perhaps it needs some extensions from its initial core feature
set but I see no reason why it couldn't maintain multiple sets of page
tables, each tagged with an owning domain (for validation purposes) and
a mechanism to switch between them, or to be able to manage partitioning
of the GPU address space.

> So you can see above shared MMU virtualization usage is very GPU
> specific,

AIUI remoteproc is specific to a particular h/w device too, i.e. there
is a device specific stub in the hypervisor which essentially knows how
to implement set_pte for that bit of h/w, with appropriate safety and
validation, as well as a write_cr3 type operation.

>  that's why we didn't put in Xen hypervisor, and thus additional
> interface is required to get p2m mapping to assist our shadow GPU
> page table usage.

There is a great reluctance among several maintainers to expose real
hardware MFNs to VMs (including dom0 and backend driver domains).

I think you need to think very carefully about possible ways of avoiding
the need for this. Yes, this might require some changes to your current
mode/design.

Ian.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-10 10:11               ` Ian Campbell
@ 2014-12-11  1:50                 ` Tian, Kevin
  0 siblings, 0 replies; 59+ messages in thread
From: Tian, Kevin @ 2014-12-11  1:50 UTC (permalink / raw)
  To: Ian Campbell
  Cc: Keir (Xen.org), Tim (Xen.org), Xen-devel@lists.xen.org,
	Paul Durrant, Yu, Zhang, JBeulich@suse.com

> From: Ian Campbell [mailto:Ian.Campbell@citrix.com]
> Sent: Wednesday, December 10, 2014 6:11 PM
> 
> On Wed, 2014-12-10 at 01:48 +0000, Tian, Kevin wrote:
> > I'm not familiar with Arm architecture, but based on a brief reading it's
> > for the assigned case where the MMU is exclusive owned by a VM, so
> > some type of MMU virtualization is required and it's straightforward.
> 
> > However XenGT is a shared GPU usage:
> >
> > - a global GPU page table is partitioned among VMs. a shared shadow
> > global page table is maintained, containing translations for multiple
> > VMs simultaneously based on partitioning information
> > - multiple per-process GPU page tables are created by each VM, and
> > multiple shadow per-process GPU page tables are created correspondingly.
> > shadow page table is switched when doing GPU context switch, same as
> > what we did for CPU shadow page table.
> 
> None of that sounds to me to be impossible to do in the remoteproc
> model, perhaps it needs some extensions from its initial core feature
> set but I see no reason why it couldn't maintain multiple sets of page
> tables, each tagged with an owning domain (for validation purposes) and
> a mechanism to switch between them, or to be able to manage partitioning
> of the GPU address space.

here we're talking about multiple GPU page tables on top of a 
IOMMU page table. Instead of one MMU unit concerned here in 
remoteproc.

> 
> > So you can see above shared MMU virtualization usage is very GPU
> > specific,
> 
> AIUI remoteproc is specific to a particular h/w device too, i.e. there
> is a device specific stub in the hypervisor which essentially knows how
> to implement set_pte for that bit of h/w, with appropriate safety and
> validation, as well as a write_cr3 type operation.
> 
> >  that's why we didn't put in Xen hypervisor, and thus additional
> > interface is required to get p2m mapping to assist our shadow GPU
> > page table usage.
> 
> There is a great reluctance among several maintainers to expose real
> hardware MFNs to VMs (including dom0 and backend driver domains).
> 
> I think you need to think very carefully about possible ways of avoiding
> the need for this. Yes, this might require some changes to your current
> mode/design.
> 

We're open to changes if necessary.

Thanks,
Kevin 

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-09 10:46 ` Tim Deegan
  2014-12-09 11:05   ` Paul Durrant
@ 2014-12-10  1:14   ` Tian, Kevin
  2014-12-10 10:36     ` Jan Beulich
  2014-12-10 10:55     ` Tim Deegan
  1 sibling, 2 replies; 59+ messages in thread
From: Tian, Kevin @ 2014-12-10  1:14 UTC (permalink / raw)
  To: Tim Deegan, Yu, Zhang
  Cc: Paul.Durrant@citrix.com, keir@xen.org, JBeulich@suse.com,
	Xen-devel@lists.xen.org

> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Tuesday, December 09, 2014 6:47 PM
> 
> At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote:
> > Hi all,
> >
> >    As you can see, we are pushing our XenGT patches to the upstream. One
> > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0
> > device model.
> >
> >    Here we may have 2 similar solutions:
> >    1> Paul told me(and thank you, Paul :)) that there used to be a
> > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in
> > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there was
> no
> > usage at that time.
> 
> It's been suggested before that we should revive this hypercall, and I
> don't think it's a good idea.  Whenever a domain needs to know the
> actual MFN of another domain's memory it's usually because the
> security model is problematic.  In particular, finding the MFN is
> usually followed by a brute-force mapping from a dom0 process, or by
> passing the MFN to a device for unprotected DMA.

In our case it's not because the security model is problematic. It's 
because GPU virtualization is done in Dom0 while the memory virtualization
is done in hypervisor. We need a means to query GPFN->MFN so we can
setup shadow GPU page table in Dom0 correctly, for a VM.

> 
> These days DMA access should be protected by IOMMUs, or else
> the device drivers (and associated tools) are effectively inside the
> hypervisor's TCB.  Luckily on x86 IOMMUs are widely available (and
> presumably present on anything new enough to run XenGT?).

yes, IOMMU protect DMA accesses in a device-agnostic way. But in
our case, IOMMU can't be used because it's only for exclusively
assigned case, as I replied in another mail. And to reduce the hypervisor
TCB, we put device model in Dom0 which is why a interface is required
to connect p2m information.

> 
> So I think the interface we need here is a please-map-this-gfn one,
> like the existing grant-table ops (which already do what you need by
> returning an address suitable for DMA).  If adding a grant entry for
> every frame of the framebuffer within the guest is too much, maybe we
> can make a new interface for the guest to grant access to larger areas.

A please-map-this-gfn interface assumes the logic behind lies in Xen
hypervisor, e.g. managing CPU page table or IOMMU entry. However
here the management of GPU page table is in Dom0, and what we
want is a please-tell-me-mfn-for-a-gpfn interface, so we can translate
from gpfn in guest GPU PTE to a mfn in shadow GPU PTE. 

Hope this makes the requirement clearer.

> 
> Cheers,
> 
> Tim.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-10  1:14   ` Tian, Kevin
@ 2014-12-10 10:36     ` Jan Beulich
  2014-12-11  1:45       ` Tian, Kevin
  2014-12-10 10:55     ` Tim Deegan
  1 sibling, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2014-12-10 10:36 UTC (permalink / raw)
  To: Kevin Tian
  Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu,
	Xen-devel@lists.xen.org

>>> On 10.12.14 at 02:14, <kevin.tian@intel.com> wrote:
>>  From: Tim Deegan [mailto:tim@xen.org]
>> It's been suggested before that we should revive this hypercall, and I
>> don't think it's a good idea.  Whenever a domain needs to know the
>> actual MFN of another domain's memory it's usually because the
>> security model is problematic.  In particular, finding the MFN is
>> usually followed by a brute-force mapping from a dom0 process, or by
>> passing the MFN to a device for unprotected DMA.
> 
> In our case it's not because the security model is problematic. It's 
> because GPU virtualization is done in Dom0 while the memory virtualization
> is done in hypervisor.

Which by itself is a questionable design decision.

> We need a means to query GPFN->MFN so we can
> setup shadow GPU page table in Dom0 correctly, for a VM.
> 
>> 
>> These days DMA access should be protected by IOMMUs, or else
>> the device drivers (and associated tools) are effectively inside the
>> hypervisor's TCB.  Luckily on x86 IOMMUs are widely available (and
>> presumably present on anything new enough to run XenGT?).
> 
> yes, IOMMU protect DMA accesses in a device-agnostic way. But in
> our case, IOMMU can't be used because it's only for exclusively
> assigned case, as I replied in another mail. And to reduce the hypervisor
> TCB, we put device model in Dom0 which is why a interface is required
> to connect p2m information.
> 
>> 
>> So I think the interface we need here is a please-map-this-gfn one,
>> like the existing grant-table ops (which already do what you need by
>> returning an address suitable for DMA).  If adding a grant entry for
>> every frame of the framebuffer within the guest is too much, maybe we
>> can make a new interface for the guest to grant access to larger areas.
> 
> A please-map-this-gfn interface assumes the logic behind lies in Xen
> hypervisor, e.g. managing CPU page table or IOMMU entry. However
> here the management of GPU page table is in Dom0, and what we
> want is a please-tell-me-mfn-for-a-gpfn interface, so we can translate
> from gpfn in guest GPU PTE to a mfn in shadow GPU PTE. 

As said before, what needs to be put in the GPU PTE depends on
what the subsequent IOMMU translation would do to the address.
It's not a hard requirement for the IOMMU to pass through all
addresses for Dom0, so we have room to isolate things if possible.

Jan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-10 10:36     ` Jan Beulich
@ 2014-12-11  1:45       ` Tian, Kevin
  0 siblings, 0 replies; 59+ messages in thread
From: Tian, Kevin @ 2014-12-11  1:45 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu,
	Xen-devel@lists.xen.org

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Wednesday, December 10, 2014 6:36 PM
> 
> >>> On 10.12.14 at 02:14, <kevin.tian@intel.com> wrote:
> >>  From: Tim Deegan [mailto:tim@xen.org]
> >> It's been suggested before that we should revive this hypercall, and I
> >> don't think it's a good idea.  Whenever a domain needs to know the
> >> actual MFN of another domain's memory it's usually because the
> >> security model is problematic.  In particular, finding the MFN is
> >> usually followed by a brute-force mapping from a dom0 process, or by
> >> passing the MFN to a device for unprotected DMA.
> >
> > In our case it's not because the security model is problematic. It's
> > because GPU virtualization is done in Dom0 while the memory virtualization
> > is done in hypervisor.
> 
> Which by itself is a questionable design decision.
> 

I don't think we want to put a ~20K LOC device model in hypervisor.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-10  1:14   ` Tian, Kevin
  2014-12-10 10:36     ` Jan Beulich
@ 2014-12-10 10:55     ` Tim Deegan
  2014-12-11  1:41       ` Tian, Kevin
  1 sibling, 1 reply; 59+ messages in thread
From: Tim Deegan @ 2014-12-10 10:55 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: keir@xen.org, Paul.Durrant@citrix.com, Yu, Zhang,
	JBeulich@suse.com, Xen-devel@lists.xen.org

At 01:14 +0000 on 10 Dec (1418170461), Tian, Kevin wrote:
> > From: Tim Deegan [mailto:tim@xen.org]
> > Sent: Tuesday, December 09, 2014 6:47 PM
> > 
> > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote:
> > > Hi all,
> > >
> > >    As you can see, we are pushing our XenGT patches to the upstream. One
> > > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0
> > > device model.
> > >
> > >    Here we may have 2 similar solutions:
> > >    1> Paul told me(and thank you, Paul :)) that there used to be a
> > > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in
> > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there was
> > no
> > > usage at that time.
> > 
> > It's been suggested before that we should revive this hypercall, and I
> > don't think it's a good idea.  Whenever a domain needs to know the
> > actual MFN of another domain's memory it's usually because the
> > security model is problematic.  In particular, finding the MFN is
> > usually followed by a brute-force mapping from a dom0 process, or by
> > passing the MFN to a device for unprotected DMA.
> 
> In our case it's not because the security model is problematic. It's 
> because GPU virtualization is done in Dom0 while the memory virtualization
> is done in hypervisor. We need a means to query GPFN->MFN so we can
> setup shadow GPU page table in Dom0 correctly, for a VM.

I don't think we understand each other.  Let me try to explain what I
mean.  My apologies if this sounds patronising; I'm just trying to be
as clear as I can.

It is Xen's job to isolate VMs from each other.  As part of that, Xen
uses the MMU, nested paging, and IOMMUs to control access to RAM.  Any
software component that can pass a raw MFN to hardware breaks that
isolation, because Xen has no way of controlling what that component
can do (including taking over the hypervisor).  This is why I am
afraid when developers ask for GFN->MFN translation functions.

So if the XenGT model allowed the backend component to (cause the GPU
to) perform arbitrary DMA without IOMMU checks, then that component
would have complete access to the system and (from a security pov)
might as well be running in the hypervisor.  That would be very
problematic, but AFAICT that's not what's going on.  From your reply
on the other thread it seems like the GPU is behind the IOMMU, so
that's OK. :)

When the backend component gets a GFN from the guest, it wants an
address that it can give to the GPU for DMA that will map the right
memory.  That address must be mapped in the IOMMU tables that the GPU
will be using, which means the IOMMU tables of the backend domain,
IIUC[1].  So the hypercall it needs is not "give me the MFN that matches
this GFN" but "please map this GFN into my IOMMU tables".

Asking for the MFN will only work if the backend domain's IOMMU
tables have an existing 1:1 r/w mapping of all guest RAM, which
happens to be the case if the backend component is in dom0 _and_ dom0
is PV _and_ we're not using strict IOMMU tables.  Restricting XenGT to
work in only those circumstances would be short-sighted, not only
because it would mean XenGT could never work as a driver domain, but
also because it seems like PVH dom0 is going to be the default at some
point.

If the existing hypercalls that make IOMMU mappings are not right for
XenGT then we can absolutely consider adding some more.  But we need
to talk about what policy Xen will enforce on the mapping requests.
If the shared backend is allowed to map any page of any VM, then it
can easily take control of any VM on the host (even though the IOMMU
will prevent it from taking over the hypervisor itself).  The
absolute minumum we should allow here is some toolstack-controlled
list of which VMs the XenGT backend is serving, so that it can refuse
to map other VMs' memory (like an extension of IS_PRIV_FOR, which does
this job for Qemu).

I would also strongly advise using privilege separation in the backend
between the GPUPT shadow code (which needs mapping rights and is
trusted to maintain isolation between the VMs that are sharing the
GPU) and the rest of the XenGT backend (which doesn't/isn't).  But
that's outside my remit as a hypervisor maintainer so it goes no
further than an "I told you so". :)

Cheers,

Tim.

[1] That is, AIUI this GPU doesn't context-switch which set of IOMMU
    tables it's using for DMA, SR-IOV-style, and that's why you need a
    software component in the first place.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-10 10:55     ` Tim Deegan
@ 2014-12-11  1:41       ` Tian, Kevin
  2014-12-11 16:46         ` Tim Deegan
  2014-12-11 21:29         ` Tim Deegan
  0 siblings, 2 replies; 59+ messages in thread
From: Tian, Kevin @ 2014-12-11  1:41 UTC (permalink / raw)
  To: Tim Deegan
  Cc: keir@xen.org, Paul.Durrant@citrix.com, Yu, Zhang,
	JBeulich@suse.com, Xen-devel@lists.xen.org

> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Wednesday, December 10, 2014 6:55 PM
> 
> At 01:14 +0000 on 10 Dec (1418170461), Tian, Kevin wrote:
> > > From: Tim Deegan [mailto:tim@xen.org]
> > > Sent: Tuesday, December 09, 2014 6:47 PM
> > >
> > > At 18:10 +0800 on 09 Dec (1418145055), Yu, Zhang wrote:
> > > > Hi all,
> > > >
> > > >    As you can see, we are pushing our XenGT patches to the upstream.
> One
> > > > feature we need in xen is to translate guests' gfn to mfn in XenGT dom0
> > > > device model.
> > > >
> > > >    Here we may have 2 similar solutions:
> > > >    1> Paul told me(and thank you, Paul :)) that there used to be a
> > > > hypercall, XENMEM_translate_gpfn_list, which was removed by Keir in
> > > > commit 2d2f7977a052e655db6748be5dabf5a58f5c5e32, because there
> was
> > > no
> > > > usage at that time.
> > >
> > > It's been suggested before that we should revive this hypercall, and I
> > > don't think it's a good idea.  Whenever a domain needs to know the
> > > actual MFN of another domain's memory it's usually because the
> > > security model is problematic.  In particular, finding the MFN is
> > > usually followed by a brute-force mapping from a dom0 process, or by
> > > passing the MFN to a device for unprotected DMA.
> >
> > In our case it's not because the security model is problematic. It's
> > because GPU virtualization is done in Dom0 while the memory virtualization
> > is done in hypervisor. We need a means to query GPFN->MFN so we can
> > setup shadow GPU page table in Dom0 correctly, for a VM.
> 
> I don't think we understand each other.  Let me try to explain what I
> mean.  My apologies if this sounds patronising; I'm just trying to be
> as clear as I can.

Thanks for your explanation. This is a very helpful discussion. :-)

> 
> It is Xen's job to isolate VMs from each other.  As part of that, Xen
> uses the MMU, nested paging, and IOMMUs to control access to RAM.  Any
> software component that can pass a raw MFN to hardware breaks that
> isolation, because Xen has no way of controlling what that component
> can do (including taking over the hypervisor).  This is why I am
> afraid when developers ask for GFN->MFN translation functions.

When I agree Xen's job absolutely, the isolation is also required in different
layers, regarding to who controls the resource and where the virtualization 
happens. For example talking about I/O virtualization, Dom0 or driver domain 
needs to isolate among backend drivers to avoid one backend interfering 
with another. Xen doesn't know such violation, since it only knows it's Dom0
wants to access a VM's page.

btw curious of how worse exposing GFN->MFN translation compared to
allowing mapping other VM's GFN? If exposing GFN->MFN is under the
same permission control as mapping, would it avoid your worry here?

> 
> So if the XenGT model allowed the backend component to (cause the GPU
> to) perform arbitrary DMA without IOMMU checks, then that component
> would have complete access to the system and (from a security pov)
> might as well be running in the hypervisor.  That would be very
> problematic, but AFAICT that's not what's going on.  From your reply
> on the other thread it seems like the GPU is behind the IOMMU, so
> that's OK. :)
> 
> When the backend component gets a GFN from the guest, it wants an
> address that it can give to the GPU for DMA that will map the right
> memory.  That address must be mapped in the IOMMU tables that the GPU
> will be using, which means the IOMMU tables of the backend domain,
> IIUC[1].  So the hypercall it needs is not "give me the MFN that matches
> this GFN" but "please map this GFN into my IOMMU tables".

Here "please map this GFN into my IOMMU tables" actually breaks the
IOMMU isolation. IOMMU is designed for serving DMA requests issued
by an exclusive VM, so IOMMU page table can restrict that VM's attempts
strictly.

To map multiple VM's GFNs into one IOMMU table, the 1st thing is to
avoid GFN conflictions to make it functional. We thought about this approach
previously, e.g. by reserving highest 3 bits of GFN as VMID, so one IOMMU
page table can be used to combine multi-VM's page table together. However
doing so have two limitations:

a) it still requires write-protect guest GPU page table, and maintain a shadow
GPU page table by translate from real GFN to pseudo GFN (plus VMID), which
doesn't save any engineering effort in the device model part

b) it breaks the designed isolation intrinsic of IOMMU. In such case, IOMMU
can't isolate multiple VMs by itself, since a DMA request can target any 
pseudo GFN if valid in the page table. We have to rely on the audit in the 
backend component in Dom0 to ensure the isolation. So even by using IOMMU,
it loses the isolation intention as you described earlier.

c) this introduces tricky logic in IOMMU driver to handle such non-standard
multiplexed page table style. 

w/o a SR-IOV implementation (so each VF has its own IOMMU page table),
I don't see using IOMMU can help isolation here.

> 
> Asking for the MFN will only work if the backend domain's IOMMU
> tables have an existing 1:1 r/w mapping of all guest RAM, which
> happens to be the case if the backend component is in dom0 _and_ dom0
> is PV _and_ we're not using strict IOMMU tables.  Restricting XenGT to
> work in only those circumstances would be short-sighted, not only
> because it would mean XenGT could never work as a driver domain, but
> also because it seems like PVH dom0 is going to be the default at some
> point.

yes, this is a good feedback we didn't think about before. So far the reason
why XenGT can work is because we use default IOMMU setting which set
up a 1:1 r/w mapping for all possible RAM, so when GPU hits a MFN thru
shadow GPU page table, IOMMU is essentially bypassed. However like
you said, if IOMMU page table is restricted to dom0's memory, or is not
1:1 identity mapping, XenGT will be broken.

However I don't see a good solution for this, except using multiplexed
IOMMU page table aforementioned, which however doesn't look like
a sane design to me.

> 
> If the existing hypercalls that make IOMMU mappings are not right for
> XenGT then we can absolutely consider adding some more.  But we need
> to talk about what policy Xen will enforce on the mapping requests.
> If the shared backend is allowed to map any page of any VM, then it
> can easily take control of any VM on the host (even though the IOMMU
> will prevent it from taking over the hypervisor itself).  The
> absolute minumum we should allow here is some toolstack-controlled
> list of which VMs the XenGT backend is serving, so that it can refuse
> to map other VMs' memory (like an extension of IS_PRIV_FOR, which does
> this job for Qemu).

for mapping and accessing other guest's memory, I don't think we 
need any new interface atop existing ones. Just similar to other backend
drivers, we can leverage the same permission control.

please note here the requirement of exposing p2m here, is really to
setup GPU page table so a guest GPU workload can be directly executed
by the GPU.

> 
> I would also strongly advise using privilege separation in the backend
> between the GPUPT shadow code (which needs mapping rights and is
> trusted to maintain isolation between the VMs that are sharing the
> GPU) and the rest of the XenGT backend (which doesn't/isn't).  But
> that's outside my remit as a hypervisor maintainer so it goes no
> further than an "I told you so". :)

We're open to suggestions making our code better, but could you 
elaborate a bit what exactly privilege separation you meant here? :-)

> 
> Cheers,
> 
> Tim.
> 
> [1] That is, AIUI this GPU doesn't context-switch which set of IOMMU
>     tables it's using for DMA, SR-IOV-style, and that's why you need a
>     software component in the first place.

yes, there's only one IOMMU dedicated for GPU, and it's impractical to
switch the IOMMU page table given concurrent access to graphics
memory from different VCPUs and different render engines within GPU.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-11  1:41       ` Tian, Kevin
@ 2014-12-11 16:46         ` Tim Deegan
  2014-12-12  7:24           ` Tian, Kevin
  2014-12-11 21:29         ` Tim Deegan
  1 sibling, 1 reply; 59+ messages in thread
From: Tim Deegan @ 2014-12-11 16:46 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: keir@xen.org, Paul.Durrant@citrix.com, Yu, Zhang,
	JBeulich@suse.com, Xen-devel@lists.xen.org

Hi, 

At 01:41 +0000 on 11 Dec (1418258504), Tian, Kevin wrote:
> > From: Tim Deegan [mailto:tim@xen.org]
> > It is Xen's job to isolate VMs from each other.  As part of that, Xen
> > uses the MMU, nested paging, and IOMMUs to control access to RAM.  Any
> > software component that can pass a raw MFN to hardware breaks that
> > isolation, because Xen has no way of controlling what that component
> > can do (including taking over the hypervisor).  This is why I am
> > afraid when developers ask for GFN->MFN translation functions.
> 
> When I agree Xen's job absolutely, the isolation is also required in different
> layers, regarding to who controls the resource and where the virtualization 
> happens. For example talking about I/O virtualization, Dom0 or driver domain 
> needs to isolate among backend drivers to avoid one backend interfering 
> with another. Xen doesn't know such violation, since it only knows it's Dom0
> wants to access a VM's page.

I'm going to write second reply to this mail in a bit, to talk about
this kind of system-level design.  In this email I'll just talk about
the practical aspects of interfaces and address spaces and IOMMUs.

> btw curious of how worse exposing GFN->MFN translation compared to
> allowing mapping other VM's GFN? If exposing GFN->MFN is under the
> same permission control as mapping, would it avoid your worry here?

I'm afraid not.  There's nothing worrying per se in a backend knowing
the MFNs of the pages -- the worry is that the backend can pass the
MFNs to hardware.  If the check happens only at lookup time, then XenGT
can (either through a bug or a security breach) just pass _any_ MFN to
the GPU for DMA.

But even without considering the security aspects, this model has bugs
that may be impossible for XenGT itself to even detect.  E.g.:
 1. Guest asks its virtual GPU to DMA to a frame of memory;
 2. XenGT looks up the GFN->MFN mapping;
 3. Guest balloons out the page;
 4. Xen allocates the page to a different guest;
 5. XenGT passes the MFN to the GPU, which DMAs to it.

Whereas if stage 2 is a _mapping_ operation, Xen can refcount the
underlying memory and make sure it doesn't get reallocated until XenGT
is finished with it.

> > When the backend component gets a GFN from the guest, it wants an
> > address that it can give to the GPU for DMA that will map the right
> > memory.  That address must be mapped in the IOMMU tables that the GPU
> > will be using, which means the IOMMU tables of the backend domain,
> > IIUC[1].  So the hypercall it needs is not "give me the MFN that matches
> > this GFN" but "please map this GFN into my IOMMU tables".
> 
> Here "please map this GFN into my IOMMU tables" actually breaks the
> IOMMU isolation. IOMMU is designed for serving DMA requests issued
> by an exclusive VM, so IOMMU page table can restrict that VM's attempts
> strictly.
> 
> To map multiple VM's GFNs into one IOMMU table, the 1st thing is to
> avoid GFN conflictions to make it functional. We thought about this approach
> previously, e.g. by reserving highest 3 bits of GFN as VMID, so one IOMMU
> page table can be used to combine multi-VM's page table together. However
> doing so have two limitations:
> 
> a) it still requires write-protect guest GPU page table, and maintain a shadow
> GPU page table by translate from real GFN to pseudo GFN (plus VMID), which
> doesn't save any engineering effort in the device model part

Yes -- since there's only one IOMMU context for the whole GPU, the
XenGT backend still has to audit all GPU commands to maintain
isolation between clients.

> b) it breaks the designed isolation intrinsic of IOMMU. In such case, IOMMU
> can't isolate multiple VMs by itself, since a DMA request can target any 
> pseudo GFN if valid in the page table. We have to rely on the audit in the 
> backend component in Dom0 to ensure the isolation.

Yep.

> c) this introduces tricky logic in IOMMU driver to handle such non-standard
> multiplexed page table style. 
> 
> w/o a SR-IOV implementation (so each VF has its own IOMMU page table),
> I don't see using IOMMU can help isolation here.

If I've understood your argument correctly, it basically comes down
to "It would be extra work for no benefit, because XenGT still has to
do all the work of isolating GPU clients from each other".  It's true
that XenGT still has to isolate its clients, but there are other
benefits.

The main one, from my point of view as a Xen maintainer, is that it
allows Xen to constrain XenGT itself, in the case where bugs or
security breaches mean that XenGT tries to access memory it shouldn't.
More about that in my other reply.  I'll talk about the rest below.

> yes, this is a good feedback we didn't think about before. So far the reason
> why XenGT can work is because we use default IOMMU setting which set
> up a 1:1 r/w mapping for all possible RAM, so when GPU hits a MFN thru
> shadow GPU page table, IOMMU is essentially bypassed. However like
> you said, if IOMMU page table is restricted to dom0's memory, or is not
> 1:1 identity mapping, XenGT will be broken.
> 
> However I don't see a good solution for this, except using multiplexed
> IOMMU page table aforementioned, which however doesn't look like
> a sane design to me.

Right.  AIUI you're talking about having a component, maybe in Xen,
that automatically makes a merged IOMMU table that contains multiple
VMs' p2m tables all at once.  I think that we can do something simpler
than that which will have the same effect and also avoid race
conditions like the one I mentioned at the top of the email.

[First some hopefully-helpful diagrams to explain my thinking.  I'll
 borrow 'BFN' from Malcolm's discussion of IOMMUs to describe the
 addresses that devices issue their DMAs in:

 Here's how the translations work for a HVM guest using HAP:

   CPU    <- Code supplied by the guest
    |
  (VA)
    | 
   MMU    <- Pagetables supplied by the guest
    | 
  (GFN)
    | 
   HAP    <- Guest's P2M, supplied by Xen
    |
  (MFN)
    | 
   RAM

 Here's how it looks for a GPU operation using XenGT:

   GPU       <- Code supplied by Guest, audited by XenGT
    | 
  (GPU VA)
    | 
  GPU-MMU    <- GTTs supplied by XenGT (by shadowing guest ones)
    | 
  (GPU BFN)
    | 
  IOMMU      <- XenGT backend dom's P2M (for PVH/HVM) or IOMMU tables (for PV)
    |
  (MFN)
    | 
   RAM

 OK, on we go...]

Somewhere in the existing XenGT code, XenGT has a guest GFN in its
hand and makes a lookup hypercall to find the MFN.  It puts that MFN
into the GTTs that it passes to the GPU.  But an MFN is not actually
what it needs here -- it needs a GPU BFN, which the IOMMU will then
turn into an MFN for it.

If we replace that lookup with a _map_ hypercall, either with Xen
choosing the BFN (as happens in the PV grant map operation) or with
the guest choosing an unused address (as happens in the HVM/PVH
grant map operation), then:
 - the only extra code in XenGT itself is that you need to unmap
   when you change the GTT;
 - Xen can track and control exactly which MFNs XenGT/the GPU can access;
 - running XenGT in a driver domain or PVH dom0 ought to work; and
 - we fix the race condition I described above.

The default policy I'm suggesting is that the XenGT backend domain
should be marked IS_PRIV_FOR (or similar) over the XenGT client VMs,
which will need a small extension in Xen since at the moment struct
domain has only one "target" field.

BTW, this is the exact analogue of how all other backend and toolstack
operations work -- they request access from Xen to specific pages and
they relinquish it when they are done.  In particular:

> for mapping and accessing other guest's memory, I don't think we 
> need any new interface atop existing ones. Just similar to other backend
> drivers, we can leverage the same permission control.

I don't think that's right -- other backend drivers use the grant
table mechanism, wher the guest explicitly grants access to only the
memory it needs.  AIUI you're not suggesting that you'll use that for
XenGT! :)

Right - I hope that made some sense.  I'll go get another cup of
coffee and start on that other reply...

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-11 16:46         ` Tim Deegan
@ 2014-12-12  7:24           ` Tian, Kevin
  2014-12-12 10:54             ` Jan Beulich
  2014-12-18 15:46             ` Tim Deegan
  0 siblings, 2 replies; 59+ messages in thread
From: Tian, Kevin @ 2014-12-12  7:24 UTC (permalink / raw)
  To: Tim Deegan
  Cc: Yu, Zhang, Paul.Durrant@citrix.com, keir@xen.org,
	JBeulich@suse.com, Xen-devel@lists.xen.org

> From: Tim Deegan
> Sent: Friday, December 12, 2014 12:47 AM
> 
> Hi,
> 
> At 01:41 +0000 on 11 Dec (1418258504), Tian, Kevin wrote:
> > > From: Tim Deegan [mailto:tim@xen.org]
> > > It is Xen's job to isolate VMs from each other.  As part of that, Xen
> > > uses the MMU, nested paging, and IOMMUs to control access to RAM.
> Any
> > > software component that can pass a raw MFN to hardware breaks that
> > > isolation, because Xen has no way of controlling what that component
> > > can do (including taking over the hypervisor).  This is why I am
> > > afraid when developers ask for GFN->MFN translation functions.
> >
> > When I agree Xen's job absolutely, the isolation is also required in different
> > layers, regarding to who controls the resource and where the virtualization
> > happens. For example talking about I/O virtualization, Dom0 or driver
> domain
> > needs to isolate among backend drivers to avoid one backend interfering
> > with another. Xen doesn't know such violation, since it only knows it's Dom0
> > wants to access a VM's page.
> 
> I'm going to write second reply to this mail in a bit, to talk about
> this kind of system-level design.  In this email I'll just talk about
> the practical aspects of interfaces and address spaces and IOMMUs.

sure. I've replied to another design mail before seeing this. my bad outlook 
rule didn't push this mail to my eye, and fortunately I dig it out when 
wondering "Hi, again" in your another mail. :-)


> 
> > btw curious of how worse exposing GFN->MFN translation compared to
> > allowing mapping other VM's GFN? If exposing GFN->MFN is under the
> > same permission control as mapping, would it avoid your worry here?
> 
> I'm afraid not.  There's nothing worrying per se in a backend knowing
> the MFNs of the pages -- the worry is that the backend can pass the
> MFNs to hardware.  If the check happens only at lookup time, then XenGT
> can (either through a bug or a security breach) just pass _any_ MFN to
> the GPU for DMA.
> 
> But even without considering the security aspects, this model has bugs
> that may be impossible for XenGT itself to even detect.  E.g.:
>  1. Guest asks its virtual GPU to DMA to a frame of memory;
>  2. XenGT looks up the GFN->MFN mapping;
>  3. Guest balloons out the page;
>  4. Xen allocates the page to a different guest;
>  5. XenGT passes the MFN to the GPU, which DMAs to it.
> 
> Whereas if stage 2 is a _mapping_ operation, Xen can refcount the
> underlying memory and make sure it doesn't get reallocated until XenGT
> is finished with it.

yes, I see your point. Now we can't support ballooning in VM given above
reason, and refcnt is required to close that gap.

but just to confirm one point. from my understanding whether it's a 
mapping operation doesn't really matter. We can invent an interface
to get p2m mapping and then increase refcnt. the key is refcnt here.
when XenGT constructs a shadow GPU page table, it creates a reference
to guest memory page so the refcnt must be increased. :-)

> 
> > > When the backend component gets a GFN from the guest, it wants an
> > > address that it can give to the GPU for DMA that will map the right
> > > memory.  That address must be mapped in the IOMMU tables that the
> GPU
> > > will be using, which means the IOMMU tables of the backend domain,
> > > IIUC[1].  So the hypercall it needs is not "give me the MFN that matches
> > > this GFN" but "please map this GFN into my IOMMU tables".
> >
> > Here "please map this GFN into my IOMMU tables" actually breaks the
> > IOMMU isolation. IOMMU is designed for serving DMA requests issued
> > by an exclusive VM, so IOMMU page table can restrict that VM's attempts
> > strictly.
> >
> > To map multiple VM's GFNs into one IOMMU table, the 1st thing is to
> > avoid GFN conflictions to make it functional. We thought about this approach
> > previously, e.g. by reserving highest 3 bits of GFN as VMID, so one IOMMU
> > page table can be used to combine multi-VM's page table together. However
> > doing so have two limitations:
> >
> > a) it still requires write-protect guest GPU page table, and maintain a
> shadow
> > GPU page table by translate from real GFN to pseudo GFN (plus VMID),
> which
> > doesn't save any engineering effort in the device model part
> 
> Yes -- since there's only one IOMMU context for the whole GPU, the
> XenGT backend still has to audit all GPU commands to maintain
> isolation between clients.
> 
> > b) it breaks the designed isolation intrinsic of IOMMU. In such case, IOMMU
> > can't isolate multiple VMs by itself, since a DMA request can target any
> > pseudo GFN if valid in the page table. We have to rely on the audit in the
> > backend component in Dom0 to ensure the isolation.
> 
> Yep.
> 
> > c) this introduces tricky logic in IOMMU driver to handle such non-standard
> > multiplexed page table style.
> >
> > w/o a SR-IOV implementation (so each VF has its own IOMMU page table),
> > I don't see using IOMMU can help isolation here.
> 
> If I've understood your argument correctly, it basically comes down
> to "It would be extra work for no benefit, because XenGT still has to
> do all the work of isolating GPU clients from each other".  It's true
> that XenGT still has to isolate its clients, but there are other
> benefits.
> 
> The main one, from my point of view as a Xen maintainer, is that it
> allows Xen to constrain XenGT itself, in the case where bugs or
> security breaches mean that XenGT tries to access memory it shouldn't.
> More about that in my other reply.  I'll talk about the rest below.
> 
> > yes, this is a good feedback we didn't think about before. So far the reason
> > why XenGT can work is because we use default IOMMU setting which set
> > up a 1:1 r/w mapping for all possible RAM, so when GPU hits a MFN thru
> > shadow GPU page table, IOMMU is essentially bypassed. However like
> > you said, if IOMMU page table is restricted to dom0's memory, or is not
> > 1:1 identity mapping, XenGT will be broken.
> >
> > However I don't see a good solution for this, except using multiplexed
> > IOMMU page table aforementioned, which however doesn't look like
> > a sane design to me.
> 
> Right.  AIUI you're talking about having a component, maybe in Xen,
> that automatically makes a merged IOMMU table that contains multiple
> VMs' p2m tables all at once.  I think that we can do something simpler
> than that which will have the same effect and also avoid race
> conditions like the one I mentioned at the top of the email.
> 
> [First some hopefully-helpful diagrams to explain my thinking.  I'll
>  borrow 'BFN' from Malcolm's discussion of IOMMUs to describe the
>  addresses that devices issue their DMAs in:

what's 'BFN' short for? Bus Frame Number?

> 
>  Here's how the translations work for a HVM guest using HAP:
> 
>    CPU    <- Code supplied by the guest
>     |
>   (VA)
>     |
>    MMU    <- Pagetables supplied by the guest
>     |
>   (GFN)
>     |
>    HAP    <- Guest's P2M, supplied by Xen
>     |
>   (MFN)
>     |
>    RAM
> 
>  Here's how it looks for a GPU operation using XenGT:
> 
>    GPU       <- Code supplied by Guest, audited by XenGT
>     |
>   (GPU VA)
>     |
>   GPU-MMU    <- GTTs supplied by XenGT (by shadowing guest ones)
>     |
>   (GPU BFN)
>     |
>   IOMMU      <- XenGT backend dom's P2M (for PVH/HVM) or IOMMU
> tables (for PV)
>     |
>   (MFN)
>     |
>    RAM
> 
>  OK, on we go...]
> 
> Somewhere in the existing XenGT code, XenGT has a guest GFN in its
> hand and makes a lookup hypercall to find the MFN.  It puts that MFN
> into the GTTs that it passes to the GPU.  But an MFN is not actually
> what it needs here -- it needs a GPU BFN, which the IOMMU will then
> turn into an MFN for it.
> 
> If we replace that lookup with a _map_ hypercall, either with Xen
> choosing the BFN (as happens in the PV grant map operation) or with
> the guest choosing an unused address (as happens in the HVM/PVH
> grant map operation), then:
>  - the only extra code in XenGT itself is that you need to unmap
>    when you change the GTT;
>  - Xen can track and control exactly which MFNs XenGT/the GPU can access;
>  - running XenGT in a driver domain or PVH dom0 ought to work; and
>  - we fix the race condition I described above.

ok, I see your point here. It does sound like a better design to meet
Xen hypervisor's security requirement and can also work with PVH
Dom0 or driver domain. Previously even when we said a MFN is
required, it's actually a BFN due to IOMMU existence, and it works
just because we have a 1:1 identity mapping in-place. And by finding
a BFN

some follow-up think here:

- one extra unmap call will have some performance impact, especially
for media processing workloads where GPU page table modifications
are hot. but suppose this can be optimized with batch request

- is there existing _map_ call for this purpose per your knowledge, or
a new one is required? If the latter, what's the additional logic to be
implemented there?

- when you say _map_, do you expect this mapped into dom0's virtual
address space, or just guest physical space?

- how is BFN or unused address (what do you mean by address here?)
allocated? does it need present in guest physical memory at boot time,
or just finding some holes?

- graphics memory size could be large. starting from BDW, there'll
be 64bit page table format. Do you see any limitation here on finding
BFN or address?

> 
> The default policy I'm suggesting is that the XenGT backend domain
> should be marked IS_PRIV_FOR (or similar) over the XenGT client VMs,
> which will need a small extension in Xen since at the moment struct
> domain has only one "target" field.

Is that connection setup by toolstack or by hypervisor today?

> 
> BTW, this is the exact analogue of how all other backend and toolstack
> operations work -- they request access from Xen to specific pages and
> they relinquish it when they are done.  In particular:

agree.

> 
> > for mapping and accessing other guest's memory, I don't think we
> > need any new interface atop existing ones. Just similar to other backend
> > drivers, we can leverage the same permission control.
> 
> I don't think that's right -- other backend drivers use the grant
> table mechanism, wher the guest explicitly grants access to only the
> memory it needs.  AIUI you're not suggesting that you'll use that for
> XenGT! :)

yes, we're running native graphics driver in VM, not PV driver

> 
> Right - I hope that made some sense.  I'll go get another cup of
> coffee and start on that other reply...
> 
> Cheers,
> 

Really appreciate your explanation here. It makes lots of sense to me.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-12  7:24           ` Tian, Kevin
@ 2014-12-12 10:54             ` Jan Beulich
  2014-12-15  6:25               ` Tian, Kevin
  2014-12-18 15:46             ` Tim Deegan
  1 sibling, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2014-12-12 10:54 UTC (permalink / raw)
  To: Kevin Tian
  Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu,
	Xen-devel@lists.xen.org

>>> On 12.12.14 at 08:24, <kevin.tian@intel.com> wrote:
> - is there existing _map_ call for this purpose per your knowledge, or
> a new one is required? If the latter, what's the additional logic to be
> implemented there?

I think the answer to this depends on whether you want to use
grants. The goal of using the native driver in the guest (mentioned
further down) speaks against this, in which case I don't think we
have an existing interface.

> - when you say _map_, do you expect this mapped into dom0's virtual
> address space, or just guest physical space?

Iiuc you don't care about the memory to be visible to the CPU, all
you need is it being translated by the IOMMU. In which case the
input address space for the IOMMU (which is different between PV
and PVH) is where this needs to be mapped into.

> - how is BFN or unused address (what do you mean by address here?)
> allocated? does it need present in guest physical memory at boot time,
> or just finding some holes?

Fitting this into holes should be fine.

> - graphics memory size could be large. starting from BDW, there'll
> be 64bit page table format. Do you see any limitation here on finding
> BFN or address?

I don't think this concern differs much for the different models: As long
as you don't want the same underlying memory to be accessible by
more than one guest, the address space requirements ought to be the
same.

Jan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-12 10:54             ` Jan Beulich
@ 2014-12-15  6:25               ` Tian, Kevin
  2014-12-15  8:44                 ` Jan Beulich
  0 siblings, 1 reply; 59+ messages in thread
From: Tian, Kevin @ 2014-12-15  6:25 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu,
	Xen-devel@lists.xen.org

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Friday, December 12, 2014 6:54 PM
> 
> >>> On 12.12.14 at 08:24, <kevin.tian@intel.com> wrote:
> > - is there existing _map_ call for this purpose per your knowledge, or
> > a new one is required? If the latter, what's the additional logic to be
> > implemented there?
> 
> I think the answer to this depends on whether you want to use
> grants. The goal of using the native driver in the guest (mentioned
> further down) speaks against this, in which case I don't think we
> have an existing interface.

yes, grants don't apply here. 

> 
> > - when you say _map_, do you expect this mapped into dom0's virtual
> > address space, or just guest physical space?
> 
> Iiuc you don't care about the memory to be visible to the CPU, all
> you need is it being translated by the IOMMU. In which case the
> input address space for the IOMMU (which is different between PV
> and PVH) is where this needs to be mapped into.

it should be in p2m level, not just in IOMMU. otherwise I'm wondering 
there'll be tricky issues ahead due to inconsistent mapping between EPT 
and IOMMU page table (though a specific attributes like r/w may be 
different from previous split table discussion). 

another reason here. If we just talk about shadow GPU page table, yes
it's used by device only so IOMMU mapping is enough. However we do 
have several other places where we need to map and access guest memory,
e.g. scanning command in a buffer mapped through GPU page table (
currently through remap_domain_mfn_range_in_kernel). 

> 
> > - how is BFN or unused address (what do you mean by address here?)
> > allocated? does it need present in guest physical memory at boot time,
> > or just finding some holes?
> 
> Fitting this into holes should be fine.

this is an interesting open to be further discussed. Here we need consider 
the extreme case, i.e. a 64bit GPU page table can legitimately use up all 
the system memory allocates to that VM, and considering dozens of VMs, 
it means we need reserve a very large hole. 

I once remember some similar cases requiring grabbing some unmapped
pfns (in grant table?). So wonder whether there's already a clean interface
for such purpose, or we need tweak a new one to allocate unmapped pfns
(but won't conflict with usages like memory hotplug)...

appreciate any suggestion here.

> 
> > - graphics memory size could be large. starting from BDW, there'll
> > be 64bit page table format. Do you see any limitation here on finding
> > BFN or address?
> 
> I don't think this concern differs much for the different models: As long
> as you don't want the same underlying memory to be accessible by
> more than one guest, the address space requirements ought to be the
> same.

See above.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-15  6:25               ` Tian, Kevin
@ 2014-12-15  8:44                 ` Jan Beulich
  2014-12-15  9:05                   ` Tian, Kevin
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2014-12-15  8:44 UTC (permalink / raw)
  To: Kevin Tian
  Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu,
	Xen-devel@lists.xen.org

>>> On 15.12.14 at 07:25, <kevin.tian@intel.com> wrote:
>>  From: Jan Beulich [mailto:JBeulich@suse.com]
>> >>> On 12.12.14 at 08:24, <kevin.tian@intel.com> wrote:
>> > - how is BFN or unused address (what do you mean by address here?)
>> > allocated? does it need present in guest physical memory at boot time,
>> > or just finding some holes?
>> 
>> Fitting this into holes should be fine.
> 
> this is an interesting open to be further discussed. Here we need consider 
> the extreme case, i.e. a 64bit GPU page table can legitimately use up all 
> the system memory allocates to that VM, and considering dozens of VMs, 
> it means we need reserve a very large hole. 

Oh, it's guest RAM you want mapped, not frame buffer space. But still
you're never going to have to map more than the total amount of host
RAM, and (with Linux) we already assume everything can be mapped
through the 1:1 mapping. I.e. the only collision would be with excessive
PFN reservations for ballooning purposes.

Jan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-15  8:44                 ` Jan Beulich
@ 2014-12-15  9:05                   ` Tian, Kevin
  2014-12-15  9:22                     ` Jan Beulich
  0 siblings, 1 reply; 59+ messages in thread
From: Tian, Kevin @ 2014-12-15  9:05 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu,
	Xen-devel@lists.xen.org

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Monday, December 15, 2014 4:45 PM
> 
> >>> On 15.12.14 at 07:25, <kevin.tian@intel.com> wrote:
> >>  From: Jan Beulich [mailto:JBeulich@suse.com]
> >> >>> On 12.12.14 at 08:24, <kevin.tian@intel.com> wrote:
> >> > - how is BFN or unused address (what do you mean by address here?)
> >> > allocated? does it need present in guest physical memory at boot time,
> >> > or just finding some holes?
> >>
> >> Fitting this into holes should be fine.
> >
> > this is an interesting open to be further discussed. Here we need consider
> > the extreme case, i.e. a 64bit GPU page table can legitimately use up all
> > the system memory allocates to that VM, and considering dozens of VMs,
> > it means we need reserve a very large hole.
> 
> Oh, it's guest RAM you want mapped, not frame buffer space. But still
> you're never going to have to map more than the total amount of host
> RAM, and (with Linux) we already assume everything can be mapped
> through the 1:1 mapping. I.e. the only collision would be with excessive
> PFN reservations for ballooning purposes.
> 

Intel GPU has graphics memory (or framebuffer) backed through system
memory, and we need to walk GPU page table and then map corresponding
guest RAM for handling.

yes, definitely host RAM is the upper limit, and what I'm concerning here
is how to reserve (at boot time) or allocate (on-demand) such large PFN
resource, w/o collision with other PFN reservation usage (ballooning
should be fine since it's operating existing RAM ranges in dom0 e820
table). Maybe we can reserve a big-enough reserved region in dom0's 
e820 table at boot time, for all PFN reservation usages, and then allocate
them on-demand for specific usages?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-15  9:05                   ` Tian, Kevin
@ 2014-12-15  9:22                     ` Jan Beulich
  2014-12-15 11:16                       ` Tian, Kevin
  2014-12-15 15:22                       ` Stefano Stabellini
  0 siblings, 2 replies; 59+ messages in thread
From: Jan Beulich @ 2014-12-15  9:22 UTC (permalink / raw)
  To: Kevin Tian
  Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu,
	Xen-devel@lists.xen.org

>>> On 15.12.14 at 10:05, <kevin.tian@intel.com> wrote:
> yes, definitely host RAM is the upper limit, and what I'm concerning here
> is how to reserve (at boot time) or allocate (on-demand) such large PFN
> resource, w/o collision with other PFN reservation usage (ballooning
> should be fine since it's operating existing RAM ranges in dom0 e820
> table).

I don't think ballooning is restricted to the regions named RAM in
Dom0's E820 table (at least it shouldn't be, and wasn't in the
classic Xen kernels).

> Maybe we can reserve a big-enough reserved region in dom0's 
> e820 table at boot time, for all PFN reservation usages, and then allocate
> them on-demand for specific usages?

What would "big enough" here mean (i.e. how would one determine
the needed size up front)? Plus any form of allocation would need a
reasonable approach to avoid fragmentation. And anyway I'm not
getting what position you're on: Do you expect to be able to fit
everything that needs mapping into the available mapping space (as
your reply above seems to imply) or do you think there won't be
enough mapping space (as earlier replies of yours appeared to
indicate)?

Jan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-15  9:22                     ` Jan Beulich
@ 2014-12-15 11:16                       ` Tian, Kevin
  2014-12-15 11:27                         ` Jan Beulich
  2014-12-15 15:22                       ` Stefano Stabellini
  1 sibling, 1 reply; 59+ messages in thread
From: Tian, Kevin @ 2014-12-15 11:16 UTC (permalink / raw)
  To: Jan Beulich
  Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu,
	Xen-devel@lists.xen.org

> From: Jan Beulich [mailto:JBeulich@suse.com]
> Sent: Monday, December 15, 2014 5:23 PM
> 
> >>> On 15.12.14 at 10:05, <kevin.tian@intel.com> wrote:
> > yes, definitely host RAM is the upper limit, and what I'm concerning here
> > is how to reserve (at boot time) or allocate (on-demand) such large PFN
> > resource, w/o collision with other PFN reservation usage (ballooning
> > should be fine since it's operating existing RAM ranges in dom0 e820
> > table).
> 
> I don't think ballooning is restricted to the regions named RAM in
> Dom0's E820 table (at least it shouldn't be, and wasn't in the
> classic Xen kernels).

well, nice to know that.

> 
> > Maybe we can reserve a big-enough reserved region in dom0's
> > e820 table at boot time, for all PFN reservation usages, and then allocate
> > them on-demand for specific usages?
> 
> What would "big enough" here mean (i.e. how would one determine
> the needed size up front)? Plus any form of allocation would need a
> reasonable approach to avoid fragmentation. And anyway I'm not
> getting what position you're on: Do you expect to be able to fit
> everything that needs mapping into the available mapping space (as
> your reply above seems to imply) or do you think there won't be
> enough mapping space (as earlier replies of yours appeared to
> indicate)?
> 

I expect to have everything mapped into the available mapping space,
and is asking for suggestions what's the best way to find and reserve
available PFNs in a way not conflicting with other usages (either
virtualization features like ballooning that you mentioned, or bare 
metal features like PCI hotplug or memory hotplug).

Tanks
Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-15 11:16                       ` Tian, Kevin
@ 2014-12-15 11:27                         ` Jan Beulich
  0 siblings, 0 replies; 59+ messages in thread
From: Jan Beulich @ 2014-12-15 11:27 UTC (permalink / raw)
  To: Kevin Tian
  Cc: keir@xen.org, Tim Deegan, Paul.Durrant@citrix.com, Zhang Yu,
	Xen-devel@lists.xen.org

>>> On 15.12.14 at 12:16, <kevin.tian@intel.com> wrote:
> I expect to have everything mapped into the available mapping space,
> and is asking for suggestions what's the best way to find and reserve
> available PFNs in a way not conflicting with other usages (either
> virtualization features like ballooning that you mentioned, or bare 
> metal features like PCI hotplug or memory hotplug).

Not conflicting with memory hotplug ought to be technically possible
(using SRAT information), but if all physical address space is marked
as possibly being used for hotplug memory this wouldn't help your
case. PCI hotplug (or even just dynamic resource re-assignment)
might be quite a bit more tricky, or would require (as you suggested
earlier) to mark certain regions as reserved in the E820 Dom0
receives. Not conflicting with ballooning is - just like memory hotplug -
simply dependent on enough space not being used for that purpose.

Jan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-15  9:22                     ` Jan Beulich
  2014-12-15 11:16                       ` Tian, Kevin
@ 2014-12-15 15:22                       ` Stefano Stabellini
  2014-12-15 16:01                         ` Jan Beulich
  1 sibling, 1 reply; 59+ messages in thread
From: Stefano Stabellini @ 2014-12-15 15:22 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, keir@xen.org, Tim Deegan, Xen-devel@lists.xen.org,
	Paul.Durrant@citrix.com, Zhang Yu

On Mon, 15 Dec 2014, Jan Beulich wrote:
> >>> On 15.12.14 at 10:05, <kevin.tian@intel.com> wrote:
> > yes, definitely host RAM is the upper limit, and what I'm concerning here
> > is how to reserve (at boot time) or allocate (on-demand) such large PFN
> > resource, w/o collision with other PFN reservation usage (ballooning
> > should be fine since it's operating existing RAM ranges in dom0 e820
> > table).
> 
> I don't think ballooning is restricted to the regions named RAM in
> Dom0's E820 table (at least it shouldn't be, and wasn't in the
> classic Xen kernels).

Could you please elaborate more on this? It seems counter-intuitive at best.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-15 15:22                       ` Stefano Stabellini
@ 2014-12-15 16:01                         ` Jan Beulich
  2014-12-15 16:15                           ` Stefano Stabellini
  0 siblings, 1 reply; 59+ messages in thread
From: Jan Beulich @ 2014-12-15 16:01 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Kevin Tian, keir@xen.org, TimDeegan, Xen-devel@lists.xen.org,
	Paul.Durrant@citrix.com, Zhang Yu

>>> On 15.12.14 at 16:22, <stefano.stabellini@eu.citrix.com> wrote:
> On Mon, 15 Dec 2014, Jan Beulich wrote:
>> >>> On 15.12.14 at 10:05, <kevin.tian@intel.com> wrote:
>> > yes, definitely host RAM is the upper limit, and what I'm concerning here
>> > is how to reserve (at boot time) or allocate (on-demand) such large PFN
>> > resource, w/o collision with other PFN reservation usage (ballooning
>> > should be fine since it's operating existing RAM ranges in dom0 e820
>> > table).
>> 
>> I don't think ballooning is restricted to the regions named RAM in
>> Dom0's E820 table (at least it shouldn't be, and wasn't in the
>> classic Xen kernels).
> 
> Could you please elaborate more on this? It seems counter-intuitive at best.

I don't see what's counter-intuitive here. How can the hypervisor
(Dom0) or tool stack (DomU) know what ballooning intentions a
guest kernel may have? It's solely the guest kernel's responsibility
to make sure its ballooning activities don't collide with anything
else address-wise.

Jan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-15 16:01                         ` Jan Beulich
@ 2014-12-15 16:15                           ` Stefano Stabellini
  2014-12-15 16:28                             ` David Vrabel
  2014-12-15 16:28                             ` Jan Beulich
  0 siblings, 2 replies; 59+ messages in thread
From: Stefano Stabellini @ 2014-12-15 16:15 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Kevin Tian, keir@xen.org, Stefano Stabellini, TimDeegan,
	Xen-devel@lists.xen.org, Paul.Durrant@citrix.com, Zhang Yu

On Mon, 15 Dec 2014, Jan Beulich wrote:
> >>> On 15.12.14 at 16:22, <stefano.stabellini@eu.citrix.com> wrote:
> > On Mon, 15 Dec 2014, Jan Beulich wrote:
> >> >>> On 15.12.14 at 10:05, <kevin.tian@intel.com> wrote:
> >> > yes, definitely host RAM is the upper limit, and what I'm concerning here
> >> > is how to reserve (at boot time) or allocate (on-demand) such large PFN
> >> > resource, w/o collision with other PFN reservation usage (ballooning
> >> > should be fine since it's operating existing RAM ranges in dom0 e820
> >> > table).
> >> 
> >> I don't think ballooning is restricted to the regions named RAM in
> >> Dom0's E820 table (at least it shouldn't be, and wasn't in the
> >> classic Xen kernels).
> > 
> > Could you please elaborate more on this? It seems counter-intuitive at best.
> 
> I don't see what's counter-intuitive here. How can the hypervisor
> (Dom0) or tool stack (DomU) know what ballooning intentions a
> guest kernel may have?

The hypervisor checks that the memory the guest is giving back is
actually ram, as a consequence the ballooning interface only supports
ram. Do you agree?

Ballooning is restricted to regions named RAM in the e820 table, because
Linux respects e820 in its pfn->mfn mappings. However it is true that
respecting the e820 in dom0 is not part of the interface.


> It's solely the guest kernel's responsibility
> to make sure its ballooning activities don't collide with anything
> else address-wise.

In the sense that it is in the guest kernel's responsibility to use the
interface properly.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-15 16:15                           ` Stefano Stabellini
@ 2014-12-15 16:28                             ` David Vrabel
  2014-12-15 16:28                             ` Jan Beulich
  1 sibling, 0 replies; 59+ messages in thread
From: David Vrabel @ 2014-12-15 16:28 UTC (permalink / raw)
  To: Stefano Stabellini, Jan Beulich
  Cc: Kevin Tian, keir@xen.org, TimDeegan, Xen-devel@lists.xen.org,
	Paul.Durrant@citrix.com, Zhang Yu

On 15/12/14 16:15, Stefano Stabellini wrote:
> On Mon, 15 Dec 2014, Jan Beulich wrote:
>>>>> On 15.12.14 at 16:22, <stefano.stabellini@eu.citrix.com> wrote:
>>> On Mon, 15 Dec 2014, Jan Beulich wrote:
>>>>>>> On 15.12.14 at 10:05, <kevin.tian@intel.com> wrote:
>>>>> yes, definitely host RAM is the upper limit, and what I'm concerning here
>>>>> is how to reserve (at boot time) or allocate (on-demand) such large PFN
>>>>> resource, w/o collision with other PFN reservation usage (ballooning
>>>>> should be fine since it's operating existing RAM ranges in dom0 e820
>>>>> table).
>>>>
>>>> I don't think ballooning is restricted to the regions named RAM in
>>>> Dom0's E820 table (at least it shouldn't be, and wasn't in the
>>>> classic Xen kernels).
>>>
>>> Could you please elaborate more on this? It seems counter-intuitive at best.
>>
>> I don't see what's counter-intuitive here. How can the hypervisor
>> (Dom0) or tool stack (DomU) know what ballooning intentions a
>> guest kernel may have?
> 
> The hypervisor checks that the memory the guest is giving back is
> actually ram, as a consequence the ballooning interface only supports
> ram. Do you agree?
> 
> Ballooning is restricted to regions named RAM in the e820 table, because
> Linux respects e820 in its pfn->mfn mappings. However it is true that
> respecting the e820 in dom0 is not part of the interface.

Linux will quite happily allow you to add memory outside of the initial
e820 RAM regions.  The current balloon driver even supports this using
the kernel's generic memory hotplug infrastructure.

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-15 16:15                           ` Stefano Stabellini
  2014-12-15 16:28                             ` David Vrabel
@ 2014-12-15 16:28                             ` Jan Beulich
  1 sibling, 0 replies; 59+ messages in thread
From: Jan Beulich @ 2014-12-15 16:28 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: Kevin Tian, keir@xen.org, TimDeegan, Xen-devel@lists.xen.org,
	Paul.Durrant@citrix.com, Zhang Yu

>>> On 15.12.14 at 17:15, <stefano.stabellini@eu.citrix.com> wrote:
> On Mon, 15 Dec 2014, Jan Beulich wrote:
>> >>> On 15.12.14 at 16:22, <stefano.stabellini@eu.citrix.com> wrote:
>> > On Mon, 15 Dec 2014, Jan Beulich wrote:
>> >> >>> On 15.12.14 at 10:05, <kevin.tian@intel.com> wrote:
>> >> > yes, definitely host RAM is the upper limit, and what I'm concerning here
>> >> > is how to reserve (at boot time) or allocate (on-demand) such large PFN
>> >> > resource, w/o collision with other PFN reservation usage (ballooning
>> >> > should be fine since it's operating existing RAM ranges in dom0 e820
>> >> > table).
>> >> 
>> >> I don't think ballooning is restricted to the regions named RAM in
>> >> Dom0's E820 table (at least it shouldn't be, and wasn't in the
>> >> classic Xen kernels).
>> > 
>> > Could you please elaborate more on this? It seems counter-intuitive at best.
>> 
>> I don't see what's counter-intuitive here. How can the hypervisor
>> (Dom0) or tool stack (DomU) know what ballooning intentions a
>> guest kernel may have?
> 
> The hypervisor checks that the memory the guest is giving back is
> actually ram, as a consequence the ballooning interface only supports
> ram. Do you agree?

Of course.

> Ballooning is restricted to regions named RAM in the e820 table, because
> Linux respects e820 in its pfn->mfn mappings. However it is true that
> respecting the e820 in dom0 is not part of the interface.

Right. Plus the kernel is free to extend the region(s) perceived as
RAM in the E820 is sees (makes up) at boot time.

>> It's solely the guest kernel's responsibility
>> to make sure its ballooning activities don't collide with anything
>> else address-wise.
> 
> In the sense that it is in the guest kernel's responsibility to use the
> interface properly.

That's a given for this discussion. The important aspect is that neither
tools nor hypervisor have any influence on how a PV kernel
partitions its PFN space - the only thing they control is the boot time
state thereof.

Jan

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-12  7:24           ` Tian, Kevin
  2014-12-12 10:54             ` Jan Beulich
@ 2014-12-18 15:46             ` Tim Deegan
  2015-01-06  8:56               ` Tian, Kevin
  1 sibling, 1 reply; 59+ messages in thread
From: Tim Deegan @ 2014-12-18 15:46 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: keir, Xen-devel, Paul.Durrant, Yu, Zhang, David Vrabel, JBeulich,
	Malcolm Crossley

Hi, 

At 07:24 +0000 on 12 Dec (1418365491), Tian, Kevin wrote:
> > I'm afraid not.  There's nothing worrying per se in a backend knowing
> > the MFNs of the pages -- the worry is that the backend can pass the
> > MFNs to hardware.  If the check happens only at lookup time, then XenGT
> > can (either through a bug or a security breach) just pass _any_ MFN to
> > the GPU for DMA.
> > 
> > But even without considering the security aspects, this model has bugs
> > that may be impossible for XenGT itself to even detect.  E.g.:
> >  1. Guest asks its virtual GPU to DMA to a frame of memory;
> >  2. XenGT looks up the GFN->MFN mapping;
> >  3. Guest balloons out the page;
> >  4. Xen allocates the page to a different guest;
> >  5. XenGT passes the MFN to the GPU, which DMAs to it.
> > 
> > Whereas if stage 2 is a _mapping_ operation, Xen can refcount the
> > underlying memory and make sure it doesn't get reallocated until XenGT
> > is finished with it.
> 
> yes, I see your point. Now we can't support ballooning in VM given above
> reason, and refcnt is required to close that gap.
> 
> but just to confirm one point. from my understanding whether it's a 
> mapping operation doesn't really matter. We can invent an interface
> to get p2m mapping and then increase refcnt. the key is refcnt here.
> when XenGT constructs a shadow GPU page table, it creates a reference
> to guest memory page so the refcnt must be increased. :-)

True. :)  But Xen does need to remember all the refcounts that were
created (so it can tidy up if the domain crashes).  If Xen is already
doing that it might as well do it in the IOMMU tables since that
solves other problems.

> > [First some hopefully-helpful diagrams to explain my thinking.  I'll
> >  borrow 'BFN' from Malcolm's discussion of IOMMUs to describe the
> >  addresses that devices issue their DMAs in:
> 
> what's 'BFN' short for? Bus Frame Number?

Yes, I think so.

> > If we replace that lookup with a _map_ hypercall, either with Xen
> > choosing the BFN (as happens in the PV grant map operation) or with
> > the guest choosing an unused address (as happens in the HVM/PVH
> > grant map operation), then:
> >  - the only extra code in XenGT itself is that you need to unmap
> >    when you change the GTT;
> >  - Xen can track and control exactly which MFNs XenGT/the GPU can access;
> >  - running XenGT in a driver domain or PVH dom0 ought to work; and
> >  - we fix the race condition I described above.
> 
> ok, I see your point here. It does sound like a better design to meet
> Xen hypervisor's security requirement and can also work with PVH
> Dom0 or driver domain. Previously even when we said a MFN is
> required, it's actually a BFN due to IOMMU existence, and it works
> just because we have a 1:1 identity mapping in-place. And by finding
> a BFN
> 
> some follow-up think here:
> 
> - one extra unmap call will have some performance impact, especially
> for media processing workloads where GPU page table modifications
> are hot. but suppose this can be optimized with batch request

Yep.  In general I'd hope that the extra overhead of unmap is small
compared with the trap + emulate + ioreq + schedule that's just
happened.  Though I know that IOTLB shootdowns are potentially rather
expensive right now so it might want some measurement.

> - is there existing _map_ call for this purpose per your knowledge, or
> a new one is required? If the latter, what's the additional logic to be
> implemented there?

For PVH, the XENMEM_add_to_physmap (gmfn_foreign) path ought to do
what you need, I think.  For PV, I think we probably need a new map
operation with sensible semantics.  My inclination would be to have it
follow the grant-map semantics (i.e. caller supplies domid + gfn,
hypervisor supplies BFN and success/failure code). 

Malcolm might have opinions about this -- it starts looking like the
sort of PV IOMMU interface he's suggested before. 

> - when you say _map_, do you expect this mapped into dom0's virtual
> address space, or just guest physical space?

For PVH, I mean into guest physical address space (and iommu tables,
since those are the same).  For PV, I mean just the IOMMU tables --
since the guest controls its own PFN space entirely there's nothing
Xen can to map things into it.

> - how is BFN or unused address (what do you mean by address here?)
> allocated? does it need present in guest physical memory at boot time,
> or just finding some holes?

That's really a question for the xen maintainers in the linux kernel.
I presume that whatever bookkeeping they currently do for grant-mapped
memory would suffice here just as well.

> - graphics memory size could be large. starting from BDW, there'll
> be 64bit page table format. Do you see any limitation here on finding
> BFN or address?

Not really.  The IOMMU tables are also 64-bit so there must be enough
addresses to map all of RAM.  There shouldn't be any need for these
mappings to be _contiguous_, btw.  You just need to have one free
address for each mapping.  Again, following how grant maps work, I'd
imagine that PVH guests will allocate an unused GFN for each mapping
and do enough bookkeeping to make sure they don't clash with other GFN
users (grant mapping, ballooning, &c).  PV guests will probably be
given a BFN by the hypervisor at map time (which will be == MFN in
practice) and just needs to pass the same BFN to the unmap call later
(it can store it in the GTT meanwhile).

> > The default policy I'm suggesting is that the XenGT backend domain
> > should be marked IS_PRIV_FOR (or similar) over the XenGT client VMs,
> > which will need a small extension in Xen since at the moment struct
> > domain has only one "target" field.
> 
> Is that connection setup by toolstack or by hypervisor today?

It's set up by the toolstack using XEN_DOMCTL_set_target.  Extending
that to something like XEN_DOMCTL_set_target_list would be OK, I
think, along with some sort of lookup call.  Or maybe an
add_target/remove_target pair would be easier?

Tim.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-18 15:46             ` Tim Deegan
@ 2015-01-06  8:56               ` Tian, Kevin
  2015-01-08 12:43                 ` Tim Deegan
  0 siblings, 1 reply; 59+ messages in thread
From: Tian, Kevin @ 2015-01-06  8:56 UTC (permalink / raw)
  To: Tim Deegan
  Cc: keir@xen.org, Xen-devel@lists.xen.org, Paul.Durrant@citrix.com,
	Yu, Zhang, David Vrabel, JBeulich@suse.com, Malcolm Crossley

> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Thursday, December 18, 2014 11:47 PM
> 
> Hi,
> 
> At 07:24 +0000 on 12 Dec (1418365491), Tian, Kevin wrote:
> > > I'm afraid not.  There's nothing worrying per se in a backend knowing
> > > the MFNs of the pages -- the worry is that the backend can pass the
> > > MFNs to hardware.  If the check happens only at lookup time, then XenGT
> > > can (either through a bug or a security breach) just pass _any_ MFN to
> > > the GPU for DMA.
> > >
> > > But even without considering the security aspects, this model has bugs
> > > that may be impossible for XenGT itself to even detect.  E.g.:
> > >  1. Guest asks its virtual GPU to DMA to a frame of memory;
> > >  2. XenGT looks up the GFN->MFN mapping;
> > >  3. Guest balloons out the page;
> > >  4. Xen allocates the page to a different guest;
> > >  5. XenGT passes the MFN to the GPU, which DMAs to it.
> > >
> > > Whereas if stage 2 is a _mapping_ operation, Xen can refcount the
> > > underlying memory and make sure it doesn't get reallocated until XenGT
> > > is finished with it.
> >
> > yes, I see your point. Now we can't support ballooning in VM given above
> > reason, and refcnt is required to close that gap.
> >
> > but just to confirm one point. from my understanding whether it's a
> > mapping operation doesn't really matter. We can invent an interface
> > to get p2m mapping and then increase refcnt. the key is refcnt here.
> > when XenGT constructs a shadow GPU page table, it creates a reference
> > to guest memory page so the refcnt must be increased. :-)
> 
> True. :)  But Xen does need to remember all the refcounts that were
> created (so it can tidy up if the domain crashes).  If Xen is already
> doing that it might as well do it in the IOMMU tables since that
> solves other problems.

would a refcnt in p2m layer enough so we don't need separate refcnt in both
EPT and IOMMU page table?

> 
> > > [First some hopefully-helpful diagrams to explain my thinking.  I'll
> > >  borrow 'BFN' from Malcolm's discussion of IOMMUs to describe the
> > >  addresses that devices issue their DMAs in:
> >
> > what's 'BFN' short for? Bus Frame Number?
> 
> Yes, I think so.
> 
> > > If we replace that lookup with a _map_ hypercall, either with Xen
> > > choosing the BFN (as happens in the PV grant map operation) or with
> > > the guest choosing an unused address (as happens in the HVM/PVH
> > > grant map operation), then:
> > >  - the only extra code in XenGT itself is that you need to unmap
> > >    when you change the GTT;
> > >  - Xen can track and control exactly which MFNs XenGT/the GPU can
> access;
> > >  - running XenGT in a driver domain or PVH dom0 ought to work; and
> > >  - we fix the race condition I described above.
> >
> > ok, I see your point here. It does sound like a better design to meet
> > Xen hypervisor's security requirement and can also work with PVH
> > Dom0 or driver domain. Previously even when we said a MFN is
> > required, it's actually a BFN due to IOMMU existence, and it works
> > just because we have a 1:1 identity mapping in-place. And by finding
> > a BFN
> >
> > some follow-up think here:
> >
> > - one extra unmap call will have some performance impact, especially
> > for media processing workloads where GPU page table modifications
> > are hot. but suppose this can be optimized with batch request
> 
> Yep.  In general I'd hope that the extra overhead of unmap is small
> compared with the trap + emulate + ioreq + schedule that's just
> happened.  Though I know that IOTLB shootdowns are potentially rather
> expensive right now so it might want some measurement.

yes, that's the hard part requiring experiments to find a good balance
between complexity and performance. IOMMU page table is not designed 
with same frequent modifications as CPU/GPU page tables, but following
above trend make them connected. Another option might be reserve a big
enough BFNs to cover all available guest memory at boot time, so to
eliminate run-time modification overhead.

> 
> > - is there existing _map_ call for this purpose per your knowledge, or
> > a new one is required? If the latter, what's the additional logic to be
> > implemented there?
> 
> For PVH, the XENMEM_add_to_physmap (gmfn_foreign) path ought to do
> what you need, I think.  For PV, I think we probably need a new map
> operation with sensible semantics.  My inclination would be to have it
> follow the grant-map semantics (i.e. caller supplies domid + gfn,
> hypervisor supplies BFN and success/failure code).

setup mapping is not a big problem. it's more about finding available BFNs
in a way not conflicting with other usages e.g. memory hotplug, ballooning
(well for this I'm not sure now whether it's only for existing gfns from other
thread...)

> 
> Malcolm might have opinions about this -- it starts looking like the
> sort of PV IOMMU interface he's suggested before.

we'd like to hear Malcolm's suggestion here.

> 
> > - when you say _map_, do you expect this mapped into dom0's virtual
> > address space, or just guest physical space?
> 
> For PVH, I mean into guest physical address space (and iommu tables,
> since those are the same).  For PV, I mean just the IOMMU tables --
> since the guest controls its own PFN space entirely there's nothing
> Xen can to map things into it.
> 
> > - how is BFN or unused address (what do you mean by address here?)
> > allocated? does it need present in guest physical memory at boot time,
> > or just finding some holes?
> 
> That's really a question for the xen maintainers in the linux kernel.
> I presume that whatever bookkeeping they currently do for grant-mapped
> memory would suffice here just as well.

will study that part.

> 
> > - graphics memory size could be large. starting from BDW, there'll
> > be 64bit page table format. Do you see any limitation here on finding
> > BFN or address?
> 
> Not really.  The IOMMU tables are also 64-bit so there must be enough
> addresses to map all of RAM.  There shouldn't be any need for these
> mappings to be _contiguous_, btw.  You just need to have one free
> address for each mapping.  Again, following how grant maps work, I'd
> imagine that PVH guests will allocate an unused GFN for each mapping
> and do enough bookkeeping to make sure they don't clash with other GFN
> users (grant mapping, ballooning, &c).  PV guests will probably be
> given a BFN by the hypervisor at map time (which will be == MFN in
> practice) and just needs to pass the same BFN to the unmap call later
> (it can store it in the GTT meanwhile).

if possible prefer to make both consistent, i.e. always finding unused GFN?

> 
> > > The default policy I'm suggesting is that the XenGT backend domain
> > > should be marked IS_PRIV_FOR (or similar) over the XenGT client VMs,
> > > which will need a small extension in Xen since at the moment struct
> > > domain has only one "target" field.
> >
> > Is that connection setup by toolstack or by hypervisor today?
> 
> It's set up by the toolstack using XEN_DOMCTL_set_target.  Extending
> that to something like XEN_DOMCTL_set_target_list would be OK, I
> think, along with some sort of lookup call.  Or maybe an
> add_target/remove_target pair would be easier?
> 

Thanks for suggestions. Yu and I will have a detail study and work out a 
proposal. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2015-01-06  8:56               ` Tian, Kevin
@ 2015-01-08 12:43                 ` Tim Deegan
  2015-01-09  8:02                   ` Tian, Kevin
  0 siblings, 1 reply; 59+ messages in thread
From: Tim Deegan @ 2015-01-08 12:43 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: keir@xen.org, Xen-devel@lists.xen.org, Paul.Durrant@citrix.com,
	Yu, Zhang, David Vrabel, JBeulich@suse.com, Malcolm Crossley

Hi,

At 08:56 +0000 on 06 Jan (1420530995), Tian, Kevin wrote:
> > From: Tim Deegan [mailto:tim@xen.org]
> > At 07:24 +0000 on 12 Dec (1418365491), Tian, Kevin wrote:
> > > but just to confirm one point. from my understanding whether it's a
> > > mapping operation doesn't really matter. We can invent an interface
> > > to get p2m mapping and then increase refcnt. the key is refcnt here.
> > > when XenGT constructs a shadow GPU page table, it creates a reference
> > > to guest memory page so the refcnt must be increased. :-)
> > 
> > True. :)  But Xen does need to remember all the refcounts that were
> > created (so it can tidy up if the domain crashes).  If Xen is already
> > doing that it might as well do it in the IOMMU tables since that
> > solves other problems.
> 
> would a refcnt in p2m layer enough so we don't need separate refcnt in both
> EPT and IOMMU page table?

Yes, that sounds right.  The p2m layer is actually the same as the EPT
table, so that is where the refcount should be attached to (and it
shouldn't matter whether the IOMMU page tables are shared or not).

> yes, that's the hard part requiring experiments to find a good balance
> between complexity and performance. IOMMU page table is not designed 
> with same frequent modifications as CPU/GPU page tables, but following
> above trend make them connected. Another option might be reserve a big
> enough BFNs to cover all available guest memory at boot time, so to
> eliminate run-time modification overhead.

Sure, or you can map them on demend but keep a cache of maps to avoid
unmapping between uses. 

> > Not really.  The IOMMU tables are also 64-bit so there must be enough
> > addresses to map all of RAM.  There shouldn't be any need for these
> > mappings to be _contiguous_, btw.  You just need to have one free
> > address for each mapping.  Again, following how grant maps work, I'd
> > imagine that PVH guests will allocate an unused GFN for each mapping
> > and do enough bookkeeping to make sure they don't clash with other GFN
> > users (grant mapping, ballooning, &c).  PV guests will probably be
> > given a BFN by the hypervisor at map time (which will be == MFN in
> > practice) and just needs to pass the same BFN to the unmap call later
> > (it can store it in the GTT meanwhile).
> 
> if possible prefer to make both consistent, i.e. always finding unused GFN?

I don't think it will be possible.  PV domains are already using BFNs
supplied by Xen (in fact == MFN) for backend grant mappings, which
would conflict with supplying their own for these mappings.  But
again, I think the kernel maintainers for Xen may have a better idea
of how these interfaces are used inside the kernel.  For example,
it might be easy enough to wrap the two systems inside a common API
inside linux.   Again, following how grant mapping works seems like
the way forward.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2015-01-08 12:43                 ` Tim Deegan
@ 2015-01-09  8:02                   ` Tian, Kevin
  2015-01-09 20:08                     ` Konrad Rzeszutek Wilk
  2015-01-12 11:14                     ` David Vrabel
  0 siblings, 2 replies; 59+ messages in thread
From: Tian, Kevin @ 2015-01-09  8:02 UTC (permalink / raw)
  To: Tim Deegan
  Cc: keir@xen.org, Xen-devel@lists.xen.org, Paul.Durrant@citrix.com,
	Yu, Zhang, David Vrabel, JBeulich@suse.com, Malcolm Crossley

> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Thursday, January 08, 2015 8:43 PM
> 
> Hi,
> 
> > > Not really.  The IOMMU tables are also 64-bit so there must be enough
> > > addresses to map all of RAM.  There shouldn't be any need for these
> > > mappings to be _contiguous_, btw.  You just need to have one free
> > > address for each mapping.  Again, following how grant maps work, I'd
> > > imagine that PVH guests will allocate an unused GFN for each mapping
> > > and do enough bookkeeping to make sure they don't clash with other GFN
> > > users (grant mapping, ballooning, &c).  PV guests will probably be
> > > given a BFN by the hypervisor at map time (which will be == MFN in
> > > practice) and just needs to pass the same BFN to the unmap call later
> > > (it can store it in the GTT meanwhile).
> >
> > if possible prefer to make both consistent, i.e. always finding unused GFN?
> 
> I don't think it will be possible.  PV domains are already using BFNs
> supplied by Xen (in fact == MFN) for backend grant mappings, which
> would conflict with supplying their own for these mappings.  But
> again, I think the kernel maintainers for Xen may have a better idea
> of how these interfaces are used inside the kernel.  For example,
> it might be easy enough to wrap the two systems inside a common API
> inside linux.   Again, following how grant mapping works seems like
> the way forward.
> 

So Konrad, do you have any insight here? :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2015-01-09  8:02                   ` Tian, Kevin
@ 2015-01-09 20:08                     ` Konrad Rzeszutek Wilk
  2015-01-12 11:14                     ` David Vrabel
  1 sibling, 0 replies; 59+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-01-09 20:08 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: keir@xen.org, Tim Deegan, Xen-devel@lists.xen.org,
	Paul.Durrant@citrix.com, Yu, Zhang, David Vrabel,
	JBeulich@suse.com, Malcolm Crossley

On Fri, Jan 09, 2015 at 08:02:48AM +0000, Tian, Kevin wrote:
> > From: Tim Deegan [mailto:tim@xen.org]
> > Sent: Thursday, January 08, 2015 8:43 PM
> > 
> > Hi,
> > 
> > > > Not really.  The IOMMU tables are also 64-bit so there must be enough
> > > > addresses to map all of RAM.  There shouldn't be any need for these
> > > > mappings to be _contiguous_, btw.  You just need to have one free
> > > > address for each mapping.  Again, following how grant maps work, I'd
> > > > imagine that PVH guests will allocate an unused GFN for each mapping
> > > > and do enough bookkeeping to make sure they don't clash with other GFN
> > > > users (grant mapping, ballooning, &c).  PV guests will probably be
> > > > given a BFN by the hypervisor at map time (which will be == MFN in
> > > > practice) and just needs to pass the same BFN to the unmap call later
> > > > (it can store it in the GTT meanwhile).
> > >
> > > if possible prefer to make both consistent, i.e. always finding unused GFN?
> > 
> > I don't think it will be possible.  PV domains are already using BFNs
> > supplied by Xen (in fact == MFN) for backend grant mappings, which
> > would conflict with supplying their own for these mappings.  But
> > again, I think the kernel maintainers for Xen may have a better idea
> > of how these interfaces are used inside the kernel.  For example,
> > it might be easy enough to wrap the two systems inside a common API
> > inside linux.   Again, following how grant mapping works seems like
> > the way forward.
> > 
> 
> So Konrad, do you have any insight here? :-)

For grants we end up making the 'struct page' for said grant be visible
in our linear space. We stash the original BFNs(MFN) in the 'struct page'
and replace the P2M in PV guests with the new BFN(MFN). David and Jenniefer
is working on making this more lightweight.

How often do we these updates? We could also do simpler way - which is
what backend drivers do - is to get a swath of vmalloc memory and hooking
the BFNs to it.  That can stay for quite some time.

The neat thing about vmalloc is that it is an sliding-window
type mechanism to deal with memory that is not usually accessed via
linear page tables. 

I suppose the complexity behind this is that this 'window' at the GPU
page tables needs to change. As in it moves around as there are different
guests doing things. So the mechanism of swapping this 'window' is going
to be expensive to map/unmap (as you have to flush the TLBs in the 
initial domain for the page-tables - unless you have multiple
'windows' and we flush the olders ones lazily? But that sounds complex).

Who is doing the audit/modification ? Is it some application in the
initial domain (backend) domain or some driver in the kernel?

> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2015-01-09  8:02                   ` Tian, Kevin
  2015-01-09 20:08                     ` Konrad Rzeszutek Wilk
@ 2015-01-12 11:14                     ` David Vrabel
  1 sibling, 0 replies; 59+ messages in thread
From: David Vrabel @ 2015-01-12 11:14 UTC (permalink / raw)
  To: Tian, Kevin, Tim Deegan
  Cc: keir@xen.org, Xen-devel@lists.xen.org, Paul.Durrant@citrix.com,
	Yu, Zhang, David Vrabel, JBeulich@suse.com, Malcolm Crossley

On 09/01/15 08:02, Tian, Kevin wrote:
>> From: Tim Deegan [mailto:tim@xen.org]
>> Sent: Thursday, January 08, 2015 8:43 PM
>>
>> Hi,
>>
>>>> Not really.  The IOMMU tables are also 64-bit so there must be enough
>>>> addresses to map all of RAM.  There shouldn't be any need for these
>>>> mappings to be _contiguous_, btw.  You just need to have one free
>>>> address for each mapping.  Again, following how grant maps work, I'd
>>>> imagine that PVH guests will allocate an unused GFN for each mapping
>>>> and do enough bookkeeping to make sure they don't clash with other GFN
>>>> users (grant mapping, ballooning, &c).  PV guests will probably be
>>>> given a BFN by the hypervisor at map time (which will be == MFN in
>>>> practice) and just needs to pass the same BFN to the unmap call later
>>>> (it can store it in the GTT meanwhile).
>>>
>>> if possible prefer to make both consistent, i.e. always finding unused GFN?
>>
>> I don't think it will be possible.  PV domains are already using BFNs
>> supplied by Xen (in fact == MFN) for backend grant mappings, which
>> would conflict with supplying their own for these mappings.  But
>> again, I think the kernel maintainers for Xen may have a better idea
>> of how these interfaces are used inside the kernel.  For example,
>> it might be easy enough to wrap the two systems inside a common API
>> inside linux.   Again, following how grant mapping works seems like
>> the way forward.
>>
> 
> So Konrad, do you have any insight here? :-)

Malcolm took two pages of this notebook explaining to me how he thought
it should work (in combination with his PV IOMMU work), so I'll let him
explain.

David

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-11  1:41       ` Tian, Kevin
  2014-12-11 16:46         ` Tim Deegan
@ 2014-12-11 21:29         ` Tim Deegan
  2014-12-12  6:29           ` Tian, Kevin
  2014-12-12  7:30           ` Tian, Kevin
  1 sibling, 2 replies; 59+ messages in thread
From: Tim Deegan @ 2014-12-11 21:29 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: keir@xen.org, Paul.Durrant@citrix.com, Yu, Zhang,
	JBeulich@suse.com, Xen-devel@lists.xen.org

Hi, again. :)

As promised, I'm going to talk about more abstract design
considerations.  Thi will be a lot less concrete than in the other
email, and about a larger range of things.  Some of of them may not be
really desirable - or even possible.

[ TL;DR: read the other reply with the practical suggestions in it :) ]

I'm talking from the point of view of a hypervisor maintainer, looking
at introducing this new XenGT component and thinking about what
security properties we would like the _system_ to have once XenGT is
introduced.  I'm going to lay out a series of broadly increasing
levels of security goodness and talk about what we'd need to do to get
there.

For the purposes of this discussion, Xen does not _trust_ XenGT.  By
that I mean that Xen can't rely on the correctness/integrity of XenGT
itself to maintain system security.  Now, we can decide that for some
properties we _will_ choose to trust XenGT, but the default is to
assume that XenGT could be compromised or buggy.  (This is not
intended as a slur on XenGT, btw -- this is how we reason about device
driver domains, qemu-dm and other components.  There will be bugs in
any component, and we're designing the system to minimise the effect
of those bugs.)

OK.  Properties we would like to have:

LEVEL 0: Protect Xen itself from XenGT
--------------------------------------

Bugs in XenGT should not be able to crash he host, and a compromised
XenGT should not be able to take over the hypervisor

We're not there in the current design, purely because XenGT has to be
in dom0 (so it can trivially DoS Xen by rebooting the host).

But it doesn't seem too hard: as soon as we can run XenGT in a driver
domain, and with IOMMU tables that restrict the GPU from writing to Xen's
datastructures, we'll have this property.

[BTW, this whole discussion assumes that the GPU has no 'back door'
 access to issue DMA that is not translated by the IOMMU.  I have heard
 rumours in the past that such things exist. :) If the GPU can issue
 untranslated DMA, then whetever controls it can take over the entire
 system, and so we can't make _any_ security guarantees about it.]

LEVEL 1: Isolate XenGT's clients from other VMs
-----------------------------------------------

In other words we partition the machine into VMs XenGT can touch
(i.e. its clients) and those it can't.  Then a malicious client that
compromises XenGT only gains access to other VMs that share a GPU with
it.  That means we can deploy XenGT for some VMs without increasing
the risk to other tenants.

Again we're not there yet, but I think the design I was talking about
in my other email would do it: if XenGT must map all the memory it
wants to let the GPU DMA to, and Xen's policy is to deny mappings for
non-client-vm memory, then VMs that aren't using XenGT are protected.

LEVEL 2: Isolate XenGT's clients from each other
------------------------------------------------

This is trickier, as you pointed out.  We could:

a) Decide that we will trust XenGT to provide this property.  After
   all, that's its main purpose!  This is how we treat other shared
   backends: if a NIC device driver domain is compromised, the
   attacker controls the network traffic for all its frontends.
   OTOH, we don't trust qemu in that way -- instead we use stub domains 
   and IS_PRIV_FOR to enforce isolation.

b) Move all of XenGT into Xen.  This is just defining the problem away
   and would probably do more harm than good - after all, keeping it
   separate has other advantages.

c) Use privilege separation: break XenGT into parts, isolated from each
   other, with the principle of least privilege applied to them.  E.g.
   - GPU emulation could be in a per-client component that doesn't
     share state with the other clients' emulators;
   - Shadowing GTTs and auditing GPU commands could move into Xen,
     with a clean interface to the emulation parts.
   That way, even if a client VM can exploit a bug in the emulator,
   it can't affect other clients because it can't see their emulator
   state, and it can't bypass the safety rules because they're
   enforced by Xen.

   When I talked about privilege separation before I was suggesting
   something like this, but without moving anything into Xen -- e.g.
   the device-emulation code for each client could be in a per-client,
   non-root process.  The code that audits and issues commands to the
   GPU would be in a separate process, which is allowed to make
   hypercalls, and which does not trust the emulator processes.
   My apologies if you're already doing this -- I know XenGT has some
   components in a kernel driver and some elsewhere but I haven't
   looked at the details.

LEVEL 3: Isolate XenGT's clients from XenGT itself
--------------------------------------------------

XenGT should not be able to access parts of its client VMs that they
have not given it permission to.  E.g. XenGT should not be able to
read a client VM's crypto keys unless it displays them on the
framebuffer or uses the GPU to accelerate crypto.

Unlike level 2, device driver domains _do_ have this property: this is
what the grant tables are used for.  A compromised NIC driver domain
can MITM the frontend guest but it can't read any memory in the guest
other than network buffers.

Again there are a few approaches, like:

a) Declare that we don't care (i.e. that we will trust XenGT for this
   property too).  In a way it's no worse than trusting the firmware
   on a dedicated pass-though GPU.  But on the other hand the client
   VM is sharing that firmware with some other VMs... :(

b) Make the GPU driver in the client use grant tables for all RAM that
   it gives to the GPU.  Probably not practical!

c) Move just the code that builds the GTTs into Xen.  That way
   Xen would guarantee that the GPU never accessed memory it wasn't
   allowed to.

I'm sure there are other ideas too.

Conclusion
----------

That's enough rambling from me -- time to come back down to earth.
While I think it's useful to think about all these things, we don't
want to get carried away. :)  And as I said, for some things we can
decide to trust XenGT to provide them, as long as we're clear about
what that means.

I think that a reasonable minimum standard to expect is to enforce
levels 0 and 1 in Xen, and trust XenGT for levels 2 and 3.  And I
think we can do that without needing any huge engineering effort;
as I said, I think that's covered in my earlier reply.

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-11 21:29         ` Tim Deegan
@ 2014-12-12  6:29           ` Tian, Kevin
  2014-12-18 16:08             ` Tim Deegan
  2015-01-05 15:49             ` George Dunlap
  2014-12-12  7:30           ` Tian, Kevin
  1 sibling, 2 replies; 59+ messages in thread
From: Tian, Kevin @ 2014-12-12  6:29 UTC (permalink / raw)
  To: Tim Deegan
  Cc: keir@xen.org, Paul.Durrant@citrix.com, Yu, Zhang,
	JBeulich@suse.com, Xen-devel@lists.xen.org

> From: Tim Deegan [mailto:tim@xen.org]
> Sent: Friday, December 12, 2014 5:29 AM
> 
> Hi, again. :)
> 
> As promised, I'm going to talk about more abstract design
> considerations.  Thi will be a lot less concrete than in the other
> email, and about a larger range of things.  Some of of them may not be
> really desirable - or even possible.

Thanks for your time on sharing thoughts on this! I'll give my comments
in same level and leave detail technical discussion in another thread. :-)

> 
> [ TL;DR: read the other reply with the practical suggestions in it :) ]
> 
> I'm talking from the point of view of a hypervisor maintainer, looking
> at introducing this new XenGT component and thinking about what
> security properties we would like the _system_ to have once XenGT is
> introduced.  I'm going to lay out a series of broadly increasing
> levels of security goodness and talk about what we'd need to do to get
> there.

that's a good clarification of the levels.

> 
> For the purposes of this discussion, Xen does not _trust_ XenGT.  By
> that I mean that Xen can't rely on the correctness/integrity of XenGT
> itself to maintain system security.  Now, we can decide that for some
> properties we _will_ choose to trust XenGT, but the default is to
> assume that XenGT could be compromised or buggy.  (This is not
> intended as a slur on XenGT, btw -- this is how we reason about device
> driver domains, qemu-dm and other components.  There will be bugs in
> any component, and we're designing the system to minimise the effect
> of those bugs.)

Yes, it's a fair concern.

> 
> OK.  Properties we would like to have:
> 
> LEVEL 0: Protect Xen itself from XenGT
> --------------------------------------
> 
> Bugs in XenGT should not be able to crash he host, and a compromised
> XenGT should not be able to take over the hypervisor
> 
> We're not there in the current design, purely because XenGT has to be
> in dom0 (so it can trivially DoS Xen by rebooting the host).

Can we really decouple dom0 from DoS Xen? I know there's on-going effort
like PVH Dom0, however there are lots of trickiness in Dom0 which can 
put the platform into a bad state. One example is ACPI. All the platform
details are encapsulated in AML language, and only dom0 knows how to
handle ACPI events. Unless Xen has another parser to guard all possible
resources which might be touched thru ACPI, a tampered dom0 has many
way to break out. But that'd be very challenging and complex.

If we can't containerize Dom0's behavior completely, I would think dom0
and Xen actually in the same trust zone, so putting XenGT in Dom0 shouldn't
make things worse.

> 
> But it doesn't seem too hard: as soon as we can run XenGT in a driver
> domain, and with IOMMU tables that restrict the GPU from writing to Xen's
> datastructures, we'll have this property.
> 
> [BTW, this whole discussion assumes that the GPU has no 'back door'
>  access to issue DMA that is not translated by the IOMMU.  I have heard
>  rumours in the past that such things exist. :) If the GPU can issue
>  untranslated DMA, then whetever controls it can take over the entire
>  system, and so we can't make _any_ security guarantees about it.]

I definitely agree with this LEVEL 0 requirement in general, e.g. dom0 
can't DMA into Xen's data structure (this is ensured even for default 1:1 
identity mapping). However I'm not on whether XenGT must be put
in a driver domain as a hard requirement. It's nice to have (and some
implementation opens let's discuss in another thread)

> 
> 
> LEVEL 1: Isolate XenGT's clients from other VMs
> -----------------------------------------------
> 
> In other words we partition the machine into VMs XenGT can touch
> (i.e. its clients) and those it can't.  Then a malicious client that
> compromises XenGT only gains access to other VMs that share a GPU with
> it.  That means we can deploy XenGT for some VMs without increasing
> the risk to other tenants.
> 
> Again we're not there yet, but I think the design I was talking about
> in my other email would do it: if XenGT must map all the memory it
> wants to let the GPU DMA to, and Xen's policy is to deny mappings for
> non-client-vm memory, then VMs that aren't using XenGT are protected.

fully agree. We have a 'vgt' control option in each VM's config file. that
can be the hint for Xen to decide allow or deny mapping from XenGT.

> 
> 
> LEVEL 2: Isolate XenGT's clients from each other
> ------------------------------------------------
> 
> This is trickier, as you pointed out.  We could:
> 
> a) Decide that we will trust XenGT to provide this property.  After
>    all, that's its main purpose!  This is how we treat other shared
>    backends: if a NIC device driver domain is compromised, the
>    attacker controls the network traffic for all its frontends.
>    OTOH, we don't trust qemu in that way -- instead we use stub domains
>    and IS_PRIV_FOR to enforce isolation.

yep. Just curious, I thought stubdomain is not popularly used. typical
case is to have qemu in dom0. is this still true? :-)

> 
> b) Move all of XenGT into Xen.  This is just defining the problem away
>    and would probably do more harm than good - after all, keeping it
>    separate has other advantages.

I'll explain below why we don't keep XenGT in Xen.

> 
> c) Use privilege separation: break XenGT into parts, isolated from each
>    other, with the principle of least privilege applied to them.  E.g.
>    - GPU emulation could be in a per-client component that doesn't
>      share state with the other clients' emulators;

yes, we're doing it that way now. the emulation is a per-vm kernel thread.
a separate main thread manages physical GPU to do context switch.

>    - Shadowing GTTs and auditing GPU commands could move into Xen,
>      with a clean interface to the emulation parts.

I'm afraid there's no such a clean interface given the complexity of
GPU.

Here let me give some other background which impacts XenGT design
(some are existing, and some are following plan). Putting them here 
is not to say "we don't want to change due to other reasons", but to
show the list of factors we need to balance:

1. the core device model will be merged as part of Intel graphics kernel
driver. This can avoid duplicated physical GPU management in XenGT
(that's today's implementation) with benefits on simplicity, quality and 
maintainability. 

2. the same device model will then be shared by both XenGT and KVMGT,
only requiring Xen/KVM to provide a minimal set of emulation services,
like event forwarding, map guest memory, etc.

3. GPU emulation is complex, and generation-to-generation there are
lots of differences. Our customers need a flexible release model so 
we can release new features and bug fixes quickly thru kernel module.

Those are major reasons we come to current XenGT architecture. Then
back to your idea on moving shadow GTT and auditing GPU commands
into Xen. It will cause more complexity on:

- somehow it means we have two drivers on one device, each responsible
for some role. Then likely we need hack Intel graphics driver's GTT 
management code and scheduling code to cooperate with this movement. 
That's unlikely to be acceptable by driver people

- auditing GPU commands need to understand vGPU context, which
means share and synchronization of a large buffer required between 
Xen and XenGT

- GTT/command format are not compatible generation-to-generation,
which means unnecessary maintenance effort in Xen

- and last but not the least, GPU HW itself is not designed so cleanly
to separate GTT from remaining parts, which means even we move
GTT mgmt. into hypervisor, there are many means to bypass the control,
e.g. changing the root pointer of GTT (which may be in a register,
or maybe in a memory structure). while once we wants to move those
parts into Xen which will dig out more bits and finally we have to pull
the whole driver in Xen (though less complex than a real graphics driver)

sorry write a long detail in this high level discussion. Just write-down
when thinking whether this is practical, and hope it answers our concern
here. :-)

>    That way, even if a client VM can exploit a bug in the emulator,
>    it can't affect other clients because it can't see their emulator
>    state, and it can't bypass the safety rules because they're
>    enforced by Xen.
> 
>    When I talked about privilege separation before I was suggesting
>    something like this, but without moving anything into Xen -- e.g.
>    the device-emulation code for each client could be in a per-client,
>    non-root process.  The code that audits and issues commands to the
>    GPU would be in a separate process, which is allowed to make
>    hypercalls, and which does not trust the emulator processes.
>    My apologies if you're already doing this -- I know XenGT has some
>    components in a kernel driver and some elsewhere but I haven't
>    looked at the details.

that's a good comment. we're implementing that way, but might not
be so strictly separated. I'll bring this comment back to our engineering
team to have it well considered.

> 
> 
> LEVEL 3: Isolate XenGT's clients from XenGT itself
> --------------------------------------------------
> 
> XenGT should not be able to access parts of its client VMs that they
> have not given it permission to.  E.g. XenGT should not be able to
> read a client VM's crypto keys unless it displays them on the
> framebuffer or uses the GPU to accelerate crypto.
> 
> Unlike level 2, device driver domains _do_ have this property: this is
> what the grant tables are used for.  A compromised NIC driver domain
> can MITM the frontend guest but it can't read any memory in the guest
> other than network buffers.
> 
> Again there are a few approaches, like:
> 
> a) Declare that we don't care (i.e. that we will trust XenGT for this
>    property too).  In a way it's no worse than trusting the firmware
>    on a dedicated pass-though GPU.  But on the other hand the client
>    VM is sharing that firmware with some other VMs... :(
> 
> b) Make the GPU driver in the client use grant tables for all RAM that
>    it gives to the GPU.  Probably not practical!

yes, and that can be a good research topic. :-)

> 
> c) Move just the code that builds the GTTs into Xen.  That way
>    Xen would guarantee that the GPU never accessed memory it wasn't
>    allowed to.

as explained above, it's impractical to separate a self-contained GTT logic
into Xen. In GPU, GTT is somehow an attribute belonging to a render context,
not like CPU CR3 which is very simple.

> 
> I'm sure there are other ideas too.
> 
> 
> Conclusion
> ----------
> 
> That's enough rambling from me -- time to come back down to earth.
> While I think it's useful to think about all these things, we don't
> want to get carried away. :)  And as I said, for some things we can
> decide to trust XenGT to provide them, as long as we're clear about
> what that means.
> 
> I think that a reasonable minimum standard to expect is to enforce
> levels 0 and 1 in Xen, and trust XenGT for levels 2 and 3.  And I
> think we can do that without needing any huge engineering effort;
> as I said, I think that's covered in my earlier reply.
> 

I agree the conclusion that "minimum standard to expect is to enforce
levels 0 and 1 in Xen, and trust XenGT for levels 2 and 3", except the
concern whether PVH Dom0 is a hard requirement or not. Having
said that, I'm happy to discuss technical detail in another thread on
how to support PVH Dom0.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-12  6:29           ` Tian, Kevin
@ 2014-12-18 16:08             ` Tim Deegan
  2014-12-18 17:01               ` Andrew Cooper
  2015-01-05 15:49             ` George Dunlap
  1 sibling, 1 reply; 59+ messages in thread
From: Tim Deegan @ 2014-12-18 16:08 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: keir@xen.org, Paul.Durrant@citrix.com, Yu, Zhang,
	JBeulich@suse.com, Xen-devel@lists.xen.org

Hi,

At 06:29 +0000 on 12 Dec (1418362182), Tian, Kevin wrote:
> If we can't containerize Dom0's behavior completely, I would think dom0
> and Xen actually in the same trust zone, so putting XenGT in Dom0 shouldn't
> make things worse.

Ah, but it does -- it's putting thousands of lines of code into that
trust zone and adding a new attack surface.  So it would be much
better if we could put XenGT in its own domain where it doesn't have
full dom0 privileges.

> However I'm not on whether XenGT must be put
> in a driver domain as a hard requirement. It's nice to have (and some
> implementation opens let's discuss in another thread)

Sure -- it would take a lot of toolstack work to actually put XenGT
into a driver domain.  So although I strongly encourage it, I don't
think it's a hard requirement.  But I'd like to make sure that we
end up with a XenGT that _could_ go in a driver domain if that
toolstack plumbing was done.  

> yep. Just curious, I thought stubdomain is not popularly used. typical
> case is to have qemu in dom0. is this still true? :-)

Some do and some don't. :)  High-security distros like Qubes and
XenClient do.  You can enable it in xl config files pretty easily.
IIRC the xapi toolstack doesn't use it, but XenServer uses privilege
separation to isolate the qemu processes in dom0.

> sorry write a long detail in this high level discussion. Just write-down
> when thinking whether this is practical, and hope it answers our concern
> here. :-)

Thank you for that, it's helpful to have a clear idea about it. 

Cheers,

Tim.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-18 16:08             ` Tim Deegan
@ 2014-12-18 17:01               ` Andrew Cooper
  0 siblings, 0 replies; 59+ messages in thread
From: Andrew Cooper @ 2014-12-18 17:01 UTC (permalink / raw)
  To: Tim Deegan, Tian, Kevin
  Cc: Yu, Zhang, Paul.Durrant@citrix.com, keir@xen.org,
	JBeulich@suse.com, Xen-devel@lists.xen.org

On 18/12/14 16:08, Tim Deegan wrote:
>> yep. Just curious, I thought stubdomain is not popularly used. typical
>> > case is to have qemu in dom0. is this still true? :-)
> Some do and some don't. :)  High-security distros like Qubes and
> XenClient do.  You can enable it in xl config files pretty easily.
> IIRC the xapi toolstack doesn't use it, but XenServer uses privilege
> separation to isolate the qemu processes in dom0.
>

We are looking into stubdomains as part of future architectural roadmap,
but as identified, there is a lot of toolstack plumbing required before
this be feasible to put into XenServer.

Our privilege separate in qemu is a stopgap measure which we would like
to replace in due course.

~Andrew

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-12  6:29           ` Tian, Kevin
  2014-12-18 16:08             ` Tim Deegan
@ 2015-01-05 15:49             ` George Dunlap
  2015-01-06  8:42               ` Tian, Kevin
  1 sibling, 1 reply; 59+ messages in thread
From: George Dunlap @ 2015-01-05 15:49 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: keir@xen.org, Tim Deegan, Xen-devel@lists.xen.org,
	Paul.Durrant@citrix.com, Yu, Zhang, JBeulich@suse.com

On Fri, Dec 12, 2014 at 6:29 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
>> We're not there in the current design, purely because XenGT has to be
>> in dom0 (so it can trivially DoS Xen by rebooting the host).
>
> Can we really decouple dom0 from DoS Xen? I know there's on-going effort
> like PVH Dom0, however there are lots of trickiness in Dom0 which can
> put the platform into a bad state. One example is ACPI. All the platform
> details are encapsulated in AML language, and only dom0 knows how to
> handle ACPI events. Unless Xen has another parser to guard all possible
> resources which might be touched thru ACPI, a tampered dom0 has many
> way to break out. But that'd be very challenging and complex.
>
> If we can't containerize Dom0's behavior completely, I would think dom0
> and Xen actually in the same trust zone, so putting XenGT in Dom0 shouldn't
> make things worse.

The question here is, "If a malicious guest can manage to break into
XenGT, what can they do?"

If XenGT is running in dom0, then the answer is, "At very least, they
can DoS the host because dom0 is allowed to reboot; they can probably
do lots of other nasty things as well."

If XenGT is running in its own domain, and can only add IOMMU entries
for MFNs belonging to XenGT-only VMs, then the answer is, "They can
access other XenGT-enabled VMs, but they cannot shut down the host or
access non-XenGT VMs."

Slides 8-11 of a presentation I gave
(http://www.slideshare.net/xen_com_mgr/a-brief-tutorial-on-xens-advanced-security-features)
can give you a graphical idea of what we're' talking about.

 -George

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2015-01-05 15:49             ` George Dunlap
@ 2015-01-06  8:42               ` Tian, Kevin
  2015-01-06 10:35                 ` Ian Campbell
  0 siblings, 1 reply; 59+ messages in thread
From: Tian, Kevin @ 2015-01-06  8:42 UTC (permalink / raw)
  To: George Dunlap
  Cc: keir@xen.org, Tim Deegan, Xen-devel@lists.xen.org,
	Paul.Durrant@citrix.com, Yu, Zhang, JBeulich@suse.com

> From: George Dunlap
> Sent: Monday, January 05, 2015 11:50 PM
> 
> On Fri, Dec 12, 2014 at 6:29 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
> >> We're not there in the current design, purely because XenGT has to be
> >> in dom0 (so it can trivially DoS Xen by rebooting the host).
> >
> > Can we really decouple dom0 from DoS Xen? I know there's on-going effort
> > like PVH Dom0, however there are lots of trickiness in Dom0 which can
> > put the platform into a bad state. One example is ACPI. All the platform
> > details are encapsulated in AML language, and only dom0 knows how to
> > handle ACPI events. Unless Xen has another parser to guard all possible
> > resources which might be touched thru ACPI, a tampered dom0 has many
> > way to break out. But that'd be very challenging and complex.
> >
> > If we can't containerize Dom0's behavior completely, I would think dom0
> > and Xen actually in the same trust zone, so putting XenGT in Dom0 shouldn't
> > make things worse.
> 
> The question here is, "If a malicious guest can manage to break into
> XenGT, what can they do?"
> 
> If XenGT is running in dom0, then the answer is, "At very least, they
> can DoS the host because dom0 is allowed to reboot; they can probably
> do lots of other nasty things as well."
> 
> If XenGT is running in its own domain, and can only add IOMMU entries
> for MFNs belonging to XenGT-only VMs, then the answer is, "They can
> access other XenGT-enabled VMs, but they cannot shut down the host or
> access non-XenGT VMs."
> 
> Slides 8-11 of a presentation I gave
> (http://www.slideshare.net/xen_com_mgr/a-brief-tutorial-on-xens-advanced-s
> ecurity-features)
> can give you a graphical idea of what we're' talking about.
> 

I agree we need to make XenGT more isolated following on-going trend from
previous discussion, but regarding to whether Dom0/Xen are in the same security
domain, I don't see my statement is changed w/ above attempts which just try to 
move privileged Xen stuff away from dom0, but all existing Linux vulnerabilities 
allow a tampered Dom0 do many evil things with root permission or even tampered 
kernel to DoS Xen (e.g. w/ ACPI). PVH dom0 can help performance... but itself alone 
doesn't change the fact that Dom0/Xen are actually in the same security domain. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2015-01-06  8:42               ` Tian, Kevin
@ 2015-01-06 10:35                 ` Ian Campbell
  0 siblings, 0 replies; 59+ messages in thread
From: Ian Campbell @ 2015-01-06 10:35 UTC (permalink / raw)
  To: Tian, Kevin
  Cc: keir@xen.org, George Dunlap, Tim Deegan, Xen-devel@lists.xen.org,
	Paul.Durrant@citrix.com, Yu, Zhang, JBeulich@suse.com

On Tue, 2015-01-06 at 08:42 +0000, Tian, Kevin wrote:
> > From: George Dunlap
> > Sent: Monday, January 05, 2015 11:50 PM
> > 
> > On Fri, Dec 12, 2014 at 6:29 AM, Tian, Kevin <kevin.tian@intel.com> wrote:
> > >> We're not there in the current design, purely because XenGT has to be
> > >> in dom0 (so it can trivially DoS Xen by rebooting the host).
> > >
> > > Can we really decouple dom0 from DoS Xen? I know there's on-going effort
> > > like PVH Dom0, however there are lots of trickiness in Dom0 which can
> > > put the platform into a bad state. One example is ACPI. All the platform
> > > details are encapsulated in AML language, and only dom0 knows how to
> > > handle ACPI events. Unless Xen has another parser to guard all possible
> > > resources which might be touched thru ACPI, a tampered dom0 has many
> > > way to break out. But that'd be very challenging and complex.
> > >
> > > If we can't containerize Dom0's behavior completely, I would think dom0
> > > and Xen actually in the same trust zone, so putting XenGT in Dom0 shouldn't
> > > make things worse.
> > 
> > The question here is, "If a malicious guest can manage to break into
> > XenGT, what can they do?"
> > 
> > If XenGT is running in dom0, then the answer is, "At very least, they
> > can DoS the host because dom0 is allowed to reboot; they can probably
> > do lots of other nasty things as well."
> > 
> > If XenGT is running in its own domain, and can only add IOMMU entries
> > for MFNs belonging to XenGT-only VMs, then the answer is, "They can
> > access other XenGT-enabled VMs, but they cannot shut down the host or
> > access non-XenGT VMs."
> > 
> > Slides 8-11 of a presentation I gave
> > (http://www.slideshare.net/xen_com_mgr/a-brief-tutorial-on-xens-advanced-s
> > ecurity-features)
> > can give you a graphical idea of what we're' talking about.
> > 
> 
> I agree we need to make XenGT more isolated following on-going trend from
> previous discussion, but regarding to whether Dom0/Xen are in the same security
> domain, I don't see my statement is changed w/ above attempts which just try to 
> move privileged Xen stuff away from dom0, but all existing Linux vulnerabilities 
> allow a tampered Dom0 do many evil things with root permission or even tampered 
> kernel to DoS Xen (e.g. w/ ACPI). PVH dom0 can help performance... but itself alone 
> doesn't change the fact that Dom0/Xen are actually in the same security domain. :-)

Which is a good reason why one would want to remove as much potentially
vulnerable code from dom0 as possible, and then deny it the
corresponding permissions via XSM too.

I also find the argument "dom0 can do some bad things so we should let
it be able to do all bad things" rather specious.

Ian.

^ permalink raw reply	[flat|nested] 59+ messages in thread

* Re: One question about the hypercall to translate gfn to mfn.
  2014-12-11 21:29         ` Tim Deegan
  2014-12-12  6:29           ` Tian, Kevin
@ 2014-12-12  7:30           ` Tian, Kevin
  1 sibling, 0 replies; 59+ messages in thread
From: Tian, Kevin @ 2014-12-12  7:30 UTC (permalink / raw)
  To: Tim Deegan
  Cc: keir@xen.org, Paul.Durrant@citrix.com, Yu, Zhang,
	JBeulich@suse.com, Xen-devel@lists.xen.org

> From: Tian, Kevin
> Sent: Friday, December 12, 2014 2:30 PM
> >
> > Conclusion
> > ----------
> >
> > That's enough rambling from me -- time to come back down to earth.
> > While I think it's useful to think about all these things, we don't
> > want to get carried away. :)  And as I said, for some things we can
> > decide to trust XenGT to provide them, as long as we're clear about
> > what that means.
> >
> > I think that a reasonable minimum standard to expect is to enforce
> > levels 0 and 1 in Xen, and trust XenGT for levels 2 and 3.  And I
> > think we can do that without needing any huge engineering effort;
> > as I said, I think that's covered in my earlier reply.
> >
> 
> I agree the conclusion that "minimum standard to expect is to enforce
> levels 0 and 1 in Xen, and trust XenGT for levels 2 and 3", except the
> concern whether PVH Dom0 is a hard requirement or not. Having
> said that, I'm happy to discuss technical detail in another thread on
> how to support PVH Dom0.
> 

So after going through another mail, now I agree both level 0/1 can't
be enforced. :-)

Thanks
Kevin

^ permalink raw reply	[flat|nested] 59+ messages in thread

end of thread, other threads:[~2015-01-12 11:14 UTC | newest]

Thread overview: 59+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-12-09 10:10 One question about the hypercall to translate gfn to mfn Yu, Zhang
2014-12-09 10:19 ` Paul Durrant
2014-12-09 10:37   ` Yu, Zhang
2014-12-09 10:50     ` Jan Beulich
2014-12-10  1:07       ` Tian, Kevin
2014-12-10  8:39         ` Jan Beulich
2014-12-10  8:47           ` Tian, Kevin
2014-12-10  9:16             ` Jan Beulich
2014-12-10  9:51               ` Tian, Kevin
2014-12-10 10:07                 ` Jan Beulich
2014-12-10 11:04                 ` Malcolm Crossley
2014-12-10  8:50           ` Tian, Kevin
2014-12-09 10:51     ` Malcolm Crossley
2014-12-10  1:22       ` Tian, Kevin
2014-12-09 10:38 ` Jan Beulich
2014-12-09 10:46 ` Tim Deegan
2014-12-09 11:05   ` Paul Durrant
2014-12-09 11:11     ` Ian Campbell
2014-12-09 11:17       ` Paul Durrant
2014-12-09 11:23         ` Jan Beulich
2014-12-09 11:28           ` Malcolm Crossley
2014-12-09 11:29         ` Ian Campbell
2014-12-09 11:43           ` Paul Durrant
2014-12-10  1:48             ` Tian, Kevin
2014-12-10 10:11               ` Ian Campbell
2014-12-11  1:50                 ` Tian, Kevin
2014-12-10  1:14   ` Tian, Kevin
2014-12-10 10:36     ` Jan Beulich
2014-12-11  1:45       ` Tian, Kevin
2014-12-10 10:55     ` Tim Deegan
2014-12-11  1:41       ` Tian, Kevin
2014-12-11 16:46         ` Tim Deegan
2014-12-12  7:24           ` Tian, Kevin
2014-12-12 10:54             ` Jan Beulich
2014-12-15  6:25               ` Tian, Kevin
2014-12-15  8:44                 ` Jan Beulich
2014-12-15  9:05                   ` Tian, Kevin
2014-12-15  9:22                     ` Jan Beulich
2014-12-15 11:16                       ` Tian, Kevin
2014-12-15 11:27                         ` Jan Beulich
2014-12-15 15:22                       ` Stefano Stabellini
2014-12-15 16:01                         ` Jan Beulich
2014-12-15 16:15                           ` Stefano Stabellini
2014-12-15 16:28                             ` David Vrabel
2014-12-15 16:28                             ` Jan Beulich
2014-12-18 15:46             ` Tim Deegan
2015-01-06  8:56               ` Tian, Kevin
2015-01-08 12:43                 ` Tim Deegan
2015-01-09  8:02                   ` Tian, Kevin
2015-01-09 20:08                     ` Konrad Rzeszutek Wilk
2015-01-12 11:14                     ` David Vrabel
2014-12-11 21:29         ` Tim Deegan
2014-12-12  6:29           ` Tian, Kevin
2014-12-18 16:08             ` Tim Deegan
2014-12-18 17:01               ` Andrew Cooper
2015-01-05 15:49             ` George Dunlap
2015-01-06  8:42               ` Tian, Kevin
2015-01-06 10:35                 ` Ian Campbell
2014-12-12  7:30           ` Tian, Kevin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.