Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update

All of lore.kernel.org
 help / color / mirror / Atom feed

From: David Gibson <david@gibson.dropbear.id.au>
To: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Paul Mackerras <paulus@samba.org>,
	Alex Williamson <alex.williamson@redhat.com>,
	linuxppc-dev@lists.ozlabs.org, kvm@vger.kernel.org,
	kvm-ppc@vger.kernel.org
Subject: Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
Date: Fri, 10 Feb 2017 04:50:54 +0000	[thread overview]
Message-ID: <20170210045054.GC25381@umbus> (raw)
In-Reply-To: <55f38121-2cb1-9641-d837-a2facf6e9448@ozlabs.ru>

[-- Attachment #1: Type: text/plain, Size: 10993 bytes --]

On Fri, Feb 10, 2017 at 03:09:30PM +1100, Alexey Kardashevskiy wrote:
> On 10/02/17 14:07, David Gibson wrote:
> > On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
> >> On 09/02/17 14:51, David Gibson wrote:
> >>> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
> >>>> For the emulated devices it does not matter much if we get a broken TCE
> >>>> half way handling a TCE list but for VFIO it will matter as it has
> >>>> more chances to fail so we try to do our best and check as much as we
> >>>> can before proceeding.
> >>>>
> >>>> This separates a guest view table update from validation. No change in
> >>>> behavior is expected.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
> >>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
> >>>>  2 files changed, 14 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> index 15df8ae627d9..9a7b7fca5e84 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  		ret = kvmppc_tce_validate(stt, tce);
> >>>>  		if (ret != H_SUCCESS)
> >>>>  			goto unlock_exit;
> >>>> +	}
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		if (get_user(tce, tces + i)) {
> >>>> +			ret = H_TOO_HARD;
> >>>> +			goto unlock_exit;
> >>>> +		}
> >>>> +		tce = be64_to_cpu(tce);
> >>>
> >>> This doesn't look safe.  The contents of user memory could change
> >>> between the two get_user()s, meaning that you're no longer guaranteed
> >>> a TCE loaded into kernel has been validated at all.
> >>>
> >>> I think you need to either:
> >>>
> >>>     a) Make sure things safe against a bad TCE being loaded into a TCE
> >>>     table and move all validation to where the TCE is used, rather
> >>>     than loaded
> >>>
> >>> or
> >>>     b) Copy the whole set of indirect entries to a temporary in-kernel
> >>>        buffer, then validate, then load into the actual TCE table.
> >>
> >>
> >> Correct :( The problem is I do not know how far I want to go in reverting
> >> the state as it was when I started handling H_PUT_TCE_INDIRECT.
> >>
> >> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
> >> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
> >> address.
> >>
> >>
> >> To do a) I'll need to remember old content of each hardware table entry as
> >> when I reach TCE#100, I'll need to revert to the initial state which means
> >> I need to write back old TCEs to all affected hardware tables and update
> >> reference counters of all affected preregistered areas. Well, the actual
> >> tables must not have different addresses (BUG_ON? is it worth testing while
> >> writing to hardware tables that values I am replacing are the same in all
> >> tables?) so I can have just a single array of old TCEs from hardware tables
> >> in vcpu.
> > 
> > I thought you said shared tables were disabled, so the two tables
> > would have different addresses?
> 
> That would be 2 physically separated tables but the content would be the
> same as long as they belong to the same VFIO container.

Ok.  I thought you were talking about the address of the TCE tables
being the same above.  Did you mean the address of an individual page
mapped in the TCE table?

> > Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
> > only if the guest/qemu does something wrong, or can it fail for other
> > reasons? 
> 
> This should always just work.

Ok, given that, just replacing HPAs we can't translate with a clear
entry seems fine to me.

> > What about in real mode vs. virtual mode?
> 
> Real mode is no different in this matter.
> 
> Real mode is different from virtual mode in 3 aspects:
> 
> 1. iommu_table_ops::exchange() vs. exchange_rm() as real mode uses cache
> inhibited writes to invalidate "TCE kill" cache;
> 
> 2. list_for_each_entry_lockless() vs. list_for_each_entry_rct() because of
> lockdep does not work in real mode properly;
> 
> 3. real mode uses vmalloc_to_phys() while virtual mode can access vmalloc'd
> addresses directly. Not expected to fail.
> 
> This is a full list.

Ok.

> > I think the key to this approach will be to think carefully about what
> > semantics you guarantee for mappings shadowed into the hardware
> > tables.  For example, it might work to specify that the host mappings
> > only match the GPA mappings if those GPA mapings are valid in the
> > first place.  So, H_PUT_TCE etc. would succeed as long as they're able
> > to update the view of the table in terms of GPA.  But when you shadow
> > those into the HPA tables, any entries which can't be translated you
> > just replace with a cleared entry.
> 
> Literally with zero? Silently? WARN_ON_ONCE?

Well, with a no-permission TCE, which might as well be zero, yes.

WARN_ON_ONCE() is probably a good idea.

> > That should be enough to protect
> > the host.  Obviously you can expect the device to fail when you
> > actually attempt to DMA there, but that's the guest's (or qemu's) own
> > fault for putting bad addresses in the TCE table.
> > 
> > Obviously that might not be great for debugging, since mappings will
> > appear to succeed, but then not work later on.
> > 
> > This does have the nice property that it's reasonably obvious what to
> > do if you have some GPA mappings for emulated devices, then hotplug a
> > VFIO device and at that point hit a gpa->hpa translation error.
> > There's no hcall in this case, so there's no obvious way to return an
> > error to the guest.
> 
> Right. So if I do this, you would probably even ack this? :)

Assuming I don't spot some other showstopper...

Oh.. one thing to make sure you think about though: what happens if a
guest makes some mappings, then there's a memory hotplug event which
changes the set of valid GPAs?  In particular what if you hot unplug
some memory which is mapped in a guest TCE table?  You might have to
regenerate the HPA tables from the GPA table on hot unplug (unless you
have a way of locking out an unplug event while that piece of guest
ram is TCE mapped).

> >> To do b) I'll need:
> >>
> >> 1. to have a copy of TCEs from the guest in vcpu,
> > 
> > I don't quite understand this.  You need a temporary copy, yes, but I
> > don't see why it needs to be attached to the vcpu.
> 
> It does not need, I just need a safe + static + lock-free place for it as I
> do not want to do malloc() in the TCE handlers and (in theory) multiple
> CPUs can do concurrent TCE requests and I want to avoid locking especially
> in realmode.

Ah, right, it's the inability to malloc() that's the difficulty.  You
could put it in the vcpu, or you could use a per-(host)-cpu area - you
can't switch guests while in a realmode handler.
> 
> 
> >> I populate it via
> >> get_user() to make sure they won't change;
> >> 2. an array of userspace addresses translated from given TCEs; and in order
> >> to make sure these addresses won't go away, I'll need to reference each
> >> preregistered memory area via mm_iommu_mapped_inc().
> >>
> >> When I reach TCE#100, I'll have to revert the change, i.e. call
> >> mm_iommu_mapped_dec().
> > 
> > Ugh.. yeah, I think to do this sanely, what you'd have to do is copy
> > the updated translations into a temp buffer.  Then you'd to make more
> > temp buffers to store the UA and HPA translations (although maybe you
> > could overwrite/reuse the original temp buffer if you're careful).
> > Then only if all of those succeed do you copy them into the real
> > hardware tables.
> > 
> > Which sounds like it might be kinda messy, at least in real mode.
> 
> So is it worth it?

Option (a) is certainly looking better to me based on current
information.

> >> So I will end up having 2 arrays in a vcpu and simpler reverting code.
> >>
> >>
> >> Or I can do simpler version of b) which would store guest TCEs in
> >> kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
> >> does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
> >> request, some preregistered regions will stay referenced till the guest is
> >> killed or rebooted (and this will prevent memory from unregistering) - but
> >> this means no harm to the host;
> > 
> > Hrm.. that's not really true.  It's not the worst thing that can
> > happen, but allowing the guest to permanently lock extra chunks of
> > memory is a form of harm to the host.
> 
> 
> These are the same preregistered chunks which are already locked. And the
> lock is there till QEMU process is dead. What will not be possible is
> memory hotunplug.

Ah, ok, I see your point.  That's probably sufficient, but option (a)
is still looking better.

> >> and with preregistered RAM, there is no
> >> valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.
> >>
> >>
> >>
> >> Which approach to pick?
> >>
> >>
> >> LoPAPR says:
> >> ===
> >> If the TCE parameter represents the logical page address of a page that is
> >> not valid for the calling partition, return
> >> H_Parameter.
> >> ===
> >>
> >>
> >>
> >>>>  
> >>>>  		kvmppc_tce_put(stt, entry + i, tce);
> >>>>  	}
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> index 918af76ab2b6..f8a54b7c788e 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  {
> >>>>  	struct kvmppc_spapr_tce_table *stt;
> >>>>  	long i, ret = H_SUCCESS;
> >>>> -	unsigned long tces, entry, ua = 0;
> >>>> +	unsigned long tces, entry, tce, ua = 0;
> >>>>  	unsigned long *rmap = NULL;
> >>>>  
> >>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>>> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  	}
> >>>>  
> >>>>  	for (i = 0; i < npages; ++i) {
> >>>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>>  
> >>>>  		ret = kvmppc_tce_validate(stt, tce);
> >>>>  		if (ret != H_SUCCESS)
> >>>>  			goto unlock_exit;
> >>>> +	}
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>
> >>> Same problem here.
> >>>
> >>>>  
> >>>>  		kvmppc_tce_put(stt, entry + i, tce);
> >>>>  	}
> >>>
> >>
> >>
> > 
> > 
> > 
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

WARNING: multiple messages have this Message-ID (diff)

From: David Gibson <david@gibson.dropbear.id.au>
To: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: linuxppc-dev@lists.ozlabs.org,
	Alex Williamson <alex.williamson@redhat.com>,
	Paul Mackerras <paulus@samba.org>,
	kvm-ppc@vger.kernel.org, kvm@vger.kernel.org
Subject: Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
Date: Fri, 10 Feb 2017 15:50:54 +1100	[thread overview]
Message-ID: <20170210045054.GC25381@umbus> (raw)
In-Reply-To: <55f38121-2cb1-9641-d837-a2facf6e9448@ozlabs.ru>

[-- Attachment #1: Type: text/plain, Size: 10993 bytes --]

On Fri, Feb 10, 2017 at 03:09:30PM +1100, Alexey Kardashevskiy wrote:
> On 10/02/17 14:07, David Gibson wrote:
> > On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
> >> On 09/02/17 14:51, David Gibson wrote:
> >>> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
> >>>> For the emulated devices it does not matter much if we get a broken TCE
> >>>> half way handling a TCE list but for VFIO it will matter as it has
> >>>> more chances to fail so we try to do our best and check as much as we
> >>>> can before proceeding.
> >>>>
> >>>> This separates a guest view table update from validation. No change in
> >>>> behavior is expected.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
> >>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
> >>>>  2 files changed, 14 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> index 15df8ae627d9..9a7b7fca5e84 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  		ret = kvmppc_tce_validate(stt, tce);
> >>>>  		if (ret != H_SUCCESS)
> >>>>  			goto unlock_exit;
> >>>> +	}
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		if (get_user(tce, tces + i)) {
> >>>> +			ret = H_TOO_HARD;
> >>>> +			goto unlock_exit;
> >>>> +		}
> >>>> +		tce = be64_to_cpu(tce);
> >>>
> >>> This doesn't look safe.  The contents of user memory could change
> >>> between the two get_user()s, meaning that you're no longer guaranteed
> >>> a TCE loaded into kernel has been validated at all.
> >>>
> >>> I think you need to either:
> >>>
> >>>     a) Make sure things safe against a bad TCE being loaded into a TCE
> >>>     table and move all validation to where the TCE is used, rather
> >>>     than loaded
> >>>
> >>> or
> >>>     b) Copy the whole set of indirect entries to a temporary in-kernel
> >>>        buffer, then validate, then load into the actual TCE table.
> >>
> >>
> >> Correct :( The problem is I do not know how far I want to go in reverting
> >> the state as it was when I started handling H_PUT_TCE_INDIRECT.
> >>
> >> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
> >> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
> >> address.
> >>
> >>
> >> To do a) I'll need to remember old content of each hardware table entry as
> >> when I reach TCE#100, I'll need to revert to the initial state which means
> >> I need to write back old TCEs to all affected hardware tables and update
> >> reference counters of all affected preregistered areas. Well, the actual
> >> tables must not have different addresses (BUG_ON? is it worth testing while
> >> writing to hardware tables that values I am replacing are the same in all
> >> tables?) so I can have just a single array of old TCEs from hardware tables
> >> in vcpu.
> > 
> > I thought you said shared tables were disabled, so the two tables
> > would have different addresses?
> 
> That would be 2 physically separated tables but the content would be the
> same as long as they belong to the same VFIO container.

Ok.  I thought you were talking about the address of the TCE tables
being the same above.  Did you mean the address of an individual page
mapped in the TCE table?

> > Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
> > only if the guest/qemu does something wrong, or can it fail for other
> > reasons? 
> 
> This should always just work.

Ok, given that, just replacing HPAs we can't translate with a clear
entry seems fine to me.

> > What about in real mode vs. virtual mode?
> 
> Real mode is no different in this matter.
> 
> Real mode is different from virtual mode in 3 aspects:
> 
> 1. iommu_table_ops::exchange() vs. exchange_rm() as real mode uses cache
> inhibited writes to invalidate "TCE kill" cache;
> 
> 2. list_for_each_entry_lockless() vs. list_for_each_entry_rct() because of
> lockdep does not work in real mode properly;
> 
> 3. real mode uses vmalloc_to_phys() while virtual mode can access vmalloc'd
> addresses directly. Not expected to fail.
> 
> This is a full list.

Ok.

> > I think the key to this approach will be to think carefully about what
> > semantics you guarantee for mappings shadowed into the hardware
> > tables.  For example, it might work to specify that the host mappings
> > only match the GPA mappings if those GPA mapings are valid in the
> > first place.  So, H_PUT_TCE etc. would succeed as long as they're able
> > to update the view of the table in terms of GPA.  But when you shadow
> > those into the HPA tables, any entries which can't be translated you
> > just replace with a cleared entry.
> 
> Literally with zero? Silently? WARN_ON_ONCE?

Well, with a no-permission TCE, which might as well be zero, yes.

WARN_ON_ONCE() is probably a good idea.

> > That should be enough to protect
> > the host.  Obviously you can expect the device to fail when you
> > actually attempt to DMA there, but that's the guest's (or qemu's) own
> > fault for putting bad addresses in the TCE table.
> > 
> > Obviously that might not be great for debugging, since mappings will
> > appear to succeed, but then not work later on.
> > 
> > This does have the nice property that it's reasonably obvious what to
> > do if you have some GPA mappings for emulated devices, then hotplug a
> > VFIO device and at that point hit a gpa->hpa translation error.
> > There's no hcall in this case, so there's no obvious way to return an
> > error to the guest.
> 
> Right. So if I do this, you would probably even ack this? :)

Assuming I don't spot some other showstopper...

Oh.. one thing to make sure you think about though: what happens if a
guest makes some mappings, then there's a memory hotplug event which
changes the set of valid GPAs?  In particular what if you hot unplug
some memory which is mapped in a guest TCE table?  You might have to
regenerate the HPA tables from the GPA table on hot unplug (unless you
have a way of locking out an unplug event while that piece of guest
ram is TCE mapped).

> >> To do b) I'll need:
> >>
> >> 1. to have a copy of TCEs from the guest in vcpu,
> > 
> > I don't quite understand this.  You need a temporary copy, yes, but I
> > don't see why it needs to be attached to the vcpu.
> 
> It does not need, I just need a safe + static + lock-free place for it as I
> do not want to do malloc() in the TCE handlers and (in theory) multiple
> CPUs can do concurrent TCE requests and I want to avoid locking especially
> in realmode.

Ah, right, it's the inability to malloc() that's the difficulty.  You
could put it in the vcpu, or you could use a per-(host)-cpu area - you
can't switch guests while in a realmode handler.
> 
> 
> >> I populate it via
> >> get_user() to make sure they won't change;
> >> 2. an array of userspace addresses translated from given TCEs; and in order
> >> to make sure these addresses won't go away, I'll need to reference each
> >> preregistered memory area via mm_iommu_mapped_inc().
> >>
> >> When I reach TCE#100, I'll have to revert the change, i.e. call
> >> mm_iommu_mapped_dec().
> > 
> > Ugh.. yeah, I think to do this sanely, what you'd have to do is copy
> > the updated translations into a temp buffer.  Then you'd to make more
> > temp buffers to store the UA and HPA translations (although maybe you
> > could overwrite/reuse the original temp buffer if you're careful).
> > Then only if all of those succeed do you copy them into the real
> > hardware tables.
> > 
> > Which sounds like it might be kinda messy, at least in real mode.
> 
> So is it worth it?

Option (a) is certainly looking better to me based on current
information.

> >> So I will end up having 2 arrays in a vcpu and simpler reverting code.
> >>
> >>
> >> Or I can do simpler version of b) which would store guest TCEs in
> >> kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
> >> does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
> >> request, some preregistered regions will stay referenced till the guest is
> >> killed or rebooted (and this will prevent memory from unregistering) - but
> >> this means no harm to the host;
> > 
> > Hrm.. that's not really true.  It's not the worst thing that can
> > happen, but allowing the guest to permanently lock extra chunks of
> > memory is a form of harm to the host.
> 
> 
> These are the same preregistered chunks which are already locked. And the
> lock is there till QEMU process is dead. What will not be possible is
> memory hotunplug.

Ah, ok, I see your point.  That's probably sufficient, but option (a)
is still looking better.

> >> and with preregistered RAM, there is no
> >> valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.
> >>
> >>
> >>
> >> Which approach to pick?
> >>
> >>
> >> LoPAPR says:
> >> ===
> >> If the TCE parameter represents the logical page address of a page that is
> >> not valid for the calling partition, return
> >> H_Parameter.
> >> ===
> >>
> >>
> >>
> >>>>  
> >>>>  		kvmppc_tce_put(stt, entry + i, tce);
> >>>>  	}
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> index 918af76ab2b6..f8a54b7c788e 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  {
> >>>>  	struct kvmppc_spapr_tce_table *stt;
> >>>>  	long i, ret = H_SUCCESS;
> >>>> -	unsigned long tces, entry, ua = 0;
> >>>> +	unsigned long tces, entry, tce, ua = 0;
> >>>>  	unsigned long *rmap = NULL;
> >>>>  
> >>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>>> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  	}
> >>>>  
> >>>>  	for (i = 0; i < npages; ++i) {
> >>>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>>  
> >>>>  		ret = kvmppc_tce_validate(stt, tce);
> >>>>  		if (ret != H_SUCCESS)
> >>>>  			goto unlock_exit;
> >>>> +	}
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>
> >>> Same problem here.
> >>>
> >>>>  
> >>>>  		kvmppc_tce_put(stt, entry + i, tce);
> >>>>  	}
> >>>
> >>
> >>
> > 
> > 
> > 
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

WARNING: multiple messages have this Message-ID (diff)

From: David Gibson <david@gibson.dropbear.id.au>
To: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: Paul Mackerras <paulus@samba.org>,
	Alex Williamson <alex.williamson@redhat.com>,
	linuxppc-dev@lists.ozlabs.org, kvm@vger.kernel.org,
	kvm-ppc@vger.kernel.org
Subject: Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
Date: Fri, 10 Feb 2017 15:50:54 +1100	[thread overview]
Message-ID: <20170210045054.GC25381@umbus> (raw)
In-Reply-To: <55f38121-2cb1-9641-d837-a2facf6e9448@ozlabs.ru>

[-- Attachment #1: Type: text/plain, Size: 10993 bytes --]

On Fri, Feb 10, 2017 at 03:09:30PM +1100, Alexey Kardashevskiy wrote:
> On 10/02/17 14:07, David Gibson wrote:
> > On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
> >> On 09/02/17 14:51, David Gibson wrote:
> >>> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
> >>>> For the emulated devices it does not matter much if we get a broken TCE
> >>>> half way handling a TCE list but for VFIO it will matter as it has
> >>>> more chances to fail so we try to do our best and check as much as we
> >>>> can before proceeding.
> >>>>
> >>>> This separates a guest view table update from validation. No change in
> >>>> behavior is expected.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>>  arch/powerpc/kvm/book3s_64_vio.c    | 8 ++++++++
> >>>>  arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
> >>>>  2 files changed, 14 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> index 15df8ae627d9..9a7b7fca5e84 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  		ret = kvmppc_tce_validate(stt, tce);
> >>>>  		if (ret != H_SUCCESS)
> >>>>  			goto unlock_exit;
> >>>> +	}
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		if (get_user(tce, tces + i)) {
> >>>> +			ret = H_TOO_HARD;
> >>>> +			goto unlock_exit;
> >>>> +		}
> >>>> +		tce = be64_to_cpu(tce);
> >>>
> >>> This doesn't look safe.  The contents of user memory could change
> >>> between the two get_user()s, meaning that you're no longer guaranteed
> >>> a TCE loaded into kernel has been validated at all.
> >>>
> >>> I think you need to either:
> >>>
> >>>     a) Make sure things safe against a bad TCE being loaded into a TCE
> >>>     table and move all validation to where the TCE is used, rather
> >>>     than loaded
> >>>
> >>> or
> >>>     b) Copy the whole set of indirect entries to a temporary in-kernel
> >>>        buffer, then validate, then load into the actual TCE table.
> >>
> >>
> >> Correct :( The problem is I do not know how far I want to go in reverting
> >> the state as it was when I started handling H_PUT_TCE_INDIRECT.
> >>
> >> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
> >> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
> >> address.
> >>
> >>
> >> To do a) I'll need to remember old content of each hardware table entry as
> >> when I reach TCE#100, I'll need to revert to the initial state which means
> >> I need to write back old TCEs to all affected hardware tables and update
> >> reference counters of all affected preregistered areas. Well, the actual
> >> tables must not have different addresses (BUG_ON? is it worth testing while
> >> writing to hardware tables that values I am replacing are the same in all
> >> tables?) so I can have just a single array of old TCEs from hardware tables
> >> in vcpu.
> > 
> > I thought you said shared tables were disabled, so the two tables
> > would have different addresses?
> 
> That would be 2 physically separated tables but the content would be the
> same as long as they belong to the same VFIO container.

Ok.  I thought you were talking about the address of the TCE tables
being the same above.  Did you mean the address of an individual page
mapped in the TCE table?

> > Hmm.  Now I'm trying to remember, will the gpa->hpa translation fail
> > only if the guest/qemu does something wrong, or can it fail for other
> > reasons? 
> 
> This should always just work.

Ok, given that, just replacing HPAs we can't translate with a clear
entry seems fine to me.

> > What about in real mode vs. virtual mode?
> 
> Real mode is no different in this matter.
> 
> Real mode is different from virtual mode in 3 aspects:
> 
> 1. iommu_table_ops::exchange() vs. exchange_rm() as real mode uses cache
> inhibited writes to invalidate "TCE kill" cache;
> 
> 2. list_for_each_entry_lockless() vs. list_for_each_entry_rct() because of
> lockdep does not work in real mode properly;
> 
> 3. real mode uses vmalloc_to_phys() while virtual mode can access vmalloc'd
> addresses directly. Not expected to fail.
> 
> This is a full list.

Ok.

> > I think the key to this approach will be to think carefully about what
> > semantics you guarantee for mappings shadowed into the hardware
> > tables.  For example, it might work to specify that the host mappings
> > only match the GPA mappings if those GPA mapings are valid in the
> > first place.  So, H_PUT_TCE etc. would succeed as long as they're able
> > to update the view of the table in terms of GPA.  But when you shadow
> > those into the HPA tables, any entries which can't be translated you
> > just replace with a cleared entry.
> 
> Literally with zero? Silently? WARN_ON_ONCE?

Well, with a no-permission TCE, which might as well be zero, yes.

WARN_ON_ONCE() is probably a good idea.

> > That should be enough to protect
> > the host.  Obviously you can expect the device to fail when you
> > actually attempt to DMA there, but that's the guest's (or qemu's) own
> > fault for putting bad addresses in the TCE table.
> > 
> > Obviously that might not be great for debugging, since mappings will
> > appear to succeed, but then not work later on.
> > 
> > This does have the nice property that it's reasonably obvious what to
> > do if you have some GPA mappings for emulated devices, then hotplug a
> > VFIO device and at that point hit a gpa->hpa translation error.
> > There's no hcall in this case, so there's no obvious way to return an
> > error to the guest.
> 
> Right. So if I do this, you would probably even ack this? :)

Assuming I don't spot some other showstopper...

Oh.. one thing to make sure you think about though: what happens if a
guest makes some mappings, then there's a memory hotplug event which
changes the set of valid GPAs?  In particular what if you hot unplug
some memory which is mapped in a guest TCE table?  You might have to
regenerate the HPA tables from the GPA table on hot unplug (unless you
have a way of locking out an unplug event while that piece of guest
ram is TCE mapped).

> >> To do b) I'll need:
> >>
> >> 1. to have a copy of TCEs from the guest in vcpu,
> > 
> > I don't quite understand this.  You need a temporary copy, yes, but I
> > don't see why it needs to be attached to the vcpu.
> 
> It does not need, I just need a safe + static + lock-free place for it as I
> do not want to do malloc() in the TCE handlers and (in theory) multiple
> CPUs can do concurrent TCE requests and I want to avoid locking especially
> in realmode.

Ah, right, it's the inability to malloc() that's the difficulty.  You
could put it in the vcpu, or you could use a per-(host)-cpu area - you
can't switch guests while in a realmode handler.
> 
> 
> >> I populate it via
> >> get_user() to make sure they won't change;
> >> 2. an array of userspace addresses translated from given TCEs; and in order
> >> to make sure these addresses won't go away, I'll need to reference each
> >> preregistered memory area via mm_iommu_mapped_inc().
> >>
> >> When I reach TCE#100, I'll have to revert the change, i.e. call
> >> mm_iommu_mapped_dec().
> > 
> > Ugh.. yeah, I think to do this sanely, what you'd have to do is copy
> > the updated translations into a temp buffer.  Then you'd to make more
> > temp buffers to store the UA and HPA translations (although maybe you
> > could overwrite/reuse the original temp buffer if you're careful).
> > Then only if all of those succeed do you copy them into the real
> > hardware tables.
> > 
> > Which sounds like it might be kinda messy, at least in real mode.
> 
> So is it worth it?

Option (a) is certainly looking better to me based on current
information.

> >> So I will end up having 2 arrays in a vcpu and simpler reverting code.
> >>
> >>
> >> Or I can do simpler version of b) which would store guest TCEs in
> >> kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
> >> does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
> >> request, some preregistered regions will stay referenced till the guest is
> >> killed or rebooted (and this will prevent memory from unregistering) - but
> >> this means no harm to the host;
> > 
> > Hrm.. that's not really true.  It's not the worst thing that can
> > happen, but allowing the guest to permanently lock extra chunks of
> > memory is a form of harm to the host.
> 
> 
> These are the same preregistered chunks which are already locked. And the
> lock is there till QEMU process is dead. What will not be possible is
> memory hotunplug.

Ah, ok, I see your point.  That's probably sufficient, but option (a)
is still looking better.

> >> and with preregistered RAM, there is no
> >> valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.
> >>
> >>
> >>
> >> Which approach to pick?
> >>
> >>
> >> LoPAPR says:
> >> ===
> >> If the TCE parameter represents the logical page address of a page that is
> >> not valid for the calling partition, return
> >> H_Parameter.
> >> ===
> >>
> >>
> >>
> >>>>  
> >>>>  		kvmppc_tce_put(stt, entry + i, tce);
> >>>>  	}
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> index 918af76ab2b6..f8a54b7c788e 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  {
> >>>>  	struct kvmppc_spapr_tce_table *stt;
> >>>>  	long i, ret = H_SUCCESS;
> >>>> -	unsigned long tces, entry, ua = 0;
> >>>> +	unsigned long tces, entry, tce, ua = 0;
> >>>>  	unsigned long *rmap = NULL;
> >>>>  
> >>>>  	stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>>> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>>  	}
> >>>>  
> >>>>  	for (i = 0; i < npages; ++i) {
> >>>> -		unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>>  
> >>>>  		ret = kvmppc_tce_validate(stt, tce);
> >>>>  		if (ret != H_SUCCESS)
> >>>>  			goto unlock_exit;
> >>>> +	}
> >>>> +
> >>>> +	for (i = 0; i < npages; ++i) {
> >>>> +		tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>
> >>> Same problem here.
> >>>
> >>>>  
> >>>>  		kvmppc_tce_put(stt, entry + i, tce);
> >>>>  	}
> >>>
> >>
> >>
> > 
> > 
> > 
> > 
> 
> 




-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

next prev parent reply	other threads:[~2017-02-10  4:50 UTC|newest]

Thread overview: 49+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-02-07  7:17 [PATCH kernel v4 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
2017-02-07  7:17 ` Alexey Kardashevskiy
2017-02-07  7:17 ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 01/10] powerpc/mmu: Add real mode support for IOMMU preregistered memory Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 02/10] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange() Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 05/10] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 06/10] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 07/10] KVM: PPC: Pass kvm* to kvmppc_find_table() Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-09  3:51   ` David Gibson
2017-02-09  3:51     ` David Gibson
2017-02-09  3:51     ` David Gibson
2017-02-09  8:20     ` Alexey Kardashevskiy
2017-02-09  8:20       ` Alexey Kardashevskiy
2017-02-10  3:07       ` David Gibson
2017-02-10  3:07         ` David Gibson
2017-02-10  4:09         ` Alexey Kardashevskiy
2017-02-10  4:09           ` Alexey Kardashevskiy
2017-02-10  4:50           ` David Gibson [this message]
2017-02-10  4:50             ` David Gibson
2017-02-10  4:50             ` David Gibson
2017-02-10  7:58             ` Alexey Kardashevskiy
2017-02-10  7:58               ` Alexey Kardashevskiy
2017-02-07  7:17 ` [PATCH kernel v4 09/10] KVM: PPC: Use preregistered memory API to access TCE list Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-09  4:00   ` David Gibson
2017-02-09  4:00     ` David Gibson
2017-02-09  4:00     ` David Gibson
2017-02-07  7:17 ` [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO Alexey Kardashevskiy
2017-02-07  7:17   ` Alexey Kardashevskiy
2017-02-09  6:41   ` David Gibson
2017-02-09  6:41     ` David Gibson
2017-02-09  6:41     ` David Gibson
2017-02-10  2:50     ` Alexey Kardashevskiy
2017-02-10  2:50       ` Alexey Kardashevskiy
2017-02-10  2:50       ` Alexey Kardashevskiy
2017-02-10  4:02       ` David Gibson
2017-02-10  4:02         ` David Gibson
2017-02-10  4:02         ` David Gibson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170210045054.GC25381@umbus \
    --to=david@gibson.dropbear.id.au \
    --cc=aik@ozlabs.ru \
    --cc=alex.williamson@redhat.com \
    --cc=kvm-ppc@vger.kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=paulus@samba.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.