From: David Gibson <david@gibson.dropbear.id.au>
To: Alexey Kardashevskiy <aik@ozlabs.ru>
Cc: linuxppc-dev@lists.ozlabs.org,
Alex Williamson <alex.williamson@redhat.com>,
Paul Mackerras <paulus@samba.org>,
kvm-ppc@vger.kernel.org, kvm@vger.kernel.org
Subject: Re: [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update
Date: Fri, 10 Feb 2017 15:50:54 +1100 [thread overview]
Message-ID: <20170210045054.GC25381@umbus> (raw)
In-Reply-To: <55f38121-2cb1-9641-d837-a2facf6e9448@ozlabs.ru>
[-- Attachment #1: Type: text/plain, Size: 10993 bytes --]
On Fri, Feb 10, 2017 at 03:09:30PM +1100, Alexey Kardashevskiy wrote:
> On 10/02/17 14:07, David Gibson wrote:
> > On Thu, Feb 09, 2017 at 07:20:11PM +1100, Alexey Kardashevskiy wrote:
> >> On 09/02/17 14:51, David Gibson wrote:
> >>> On Tue, Feb 07, 2017 at 06:17:09PM +1100, Alexey Kardashevskiy wrote:
> >>>> For the emulated devices it does not matter much if we get a broken TCE
> >>>> half way handling a TCE list but for VFIO it will matter as it has
> >>>> more chances to fail so we try to do our best and check as much as we
> >>>> can before proceeding.
> >>>>
> >>>> This separates a guest view table update from validation. No change in
> >>>> behavior is expected.
> >>>>
> >>>> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> >>>> ---
> >>>> arch/powerpc/kvm/book3s_64_vio.c | 8 ++++++++
> >>>> arch/powerpc/kvm/book3s_64_vio_hv.c | 8 ++++++--
> >>>> 2 files changed, 14 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio.c b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> index 15df8ae627d9..9a7b7fca5e84 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio.c
> >>>> @@ -282,6 +282,14 @@ long kvmppc_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>> ret = kvmppc_tce_validate(stt, tce);
> >>>> if (ret != H_SUCCESS)
> >>>> goto unlock_exit;
> >>>> + }
> >>>> +
> >>>> + for (i = 0; i < npages; ++i) {
> >>>> + if (get_user(tce, tces + i)) {
> >>>> + ret = H_TOO_HARD;
> >>>> + goto unlock_exit;
> >>>> + }
> >>>> + tce = be64_to_cpu(tce);
> >>>
> >>> This doesn't look safe. The contents of user memory could change
> >>> between the two get_user()s, meaning that you're no longer guaranteed
> >>> a TCE loaded into kernel has been validated at all.
> >>>
> >>> I think you need to either:
> >>>
> >>> a) Make sure things safe against a bad TCE being loaded into a TCE
> >>> table and move all validation to where the TCE is used, rather
> >>> than loaded
> >>>
> >>> or
> >>> b) Copy the whole set of indirect entries to a temporary in-kernel
> >>> buffer, then validate, then load into the actual TCE table.
> >>
> >>
> >> Correct :( The problem is I do not know how far I want to go in reverting
> >> the state as it was when I started handling H_PUT_TCE_INDIRECT.
> >>
> >> For example, 1 container, 2 IOMMU groups with disabled shared tables, so -
> >> 2 tables, 512 TCEs request and TCE#100 does not translate to host physical
> >> address.
> >>
> >>
> >> To do a) I'll need to remember old content of each hardware table entry as
> >> when I reach TCE#100, I'll need to revert to the initial state which means
> >> I need to write back old TCEs to all affected hardware tables and update
> >> reference counters of all affected preregistered areas. Well, the actual
> >> tables must not have different addresses (BUG_ON? is it worth testing while
> >> writing to hardware tables that values I am replacing are the same in all
> >> tables?) so I can have just a single array of old TCEs from hardware tables
> >> in vcpu.
> >
> > I thought you said shared tables were disabled, so the two tables
> > would have different addresses?
>
> That would be 2 physically separated tables but the content would be the
> same as long as they belong to the same VFIO container.
Ok. I thought you were talking about the address of the TCE tables
being the same above. Did you mean the address of an individual page
mapped in the TCE table?
> > Hmm. Now I'm trying to remember, will the gpa->hpa translation fail
> > only if the guest/qemu does something wrong, or can it fail for other
> > reasons?
>
> This should always just work.
Ok, given that, just replacing HPAs we can't translate with a clear
entry seems fine to me.
> > What about in real mode vs. virtual mode?
>
> Real mode is no different in this matter.
>
> Real mode is different from virtual mode in 3 aspects:
>
> 1. iommu_table_ops::exchange() vs. exchange_rm() as real mode uses cache
> inhibited writes to invalidate "TCE kill" cache;
>
> 2. list_for_each_entry_lockless() vs. list_for_each_entry_rct() because of
> lockdep does not work in real mode properly;
>
> 3. real mode uses vmalloc_to_phys() while virtual mode can access vmalloc'd
> addresses directly. Not expected to fail.
>
> This is a full list.
Ok.
> > I think the key to this approach will be to think carefully about what
> > semantics you guarantee for mappings shadowed into the hardware
> > tables. For example, it might work to specify that the host mappings
> > only match the GPA mappings if those GPA mapings are valid in the
> > first place. So, H_PUT_TCE etc. would succeed as long as they're able
> > to update the view of the table in terms of GPA. But when you shadow
> > those into the HPA tables, any entries which can't be translated you
> > just replace with a cleared entry.
>
> Literally with zero? Silently? WARN_ON_ONCE?
Well, with a no-permission TCE, which might as well be zero, yes.
WARN_ON_ONCE() is probably a good idea.
> > That should be enough to protect
> > the host. Obviously you can expect the device to fail when you
> > actually attempt to DMA there, but that's the guest's (or qemu's) own
> > fault for putting bad addresses in the TCE table.
> >
> > Obviously that might not be great for debugging, since mappings will
> > appear to succeed, but then not work later on.
> >
> > This does have the nice property that it's reasonably obvious what to
> > do if you have some GPA mappings for emulated devices, then hotplug a
> > VFIO device and at that point hit a gpa->hpa translation error.
> > There's no hcall in this case, so there's no obvious way to return an
> > error to the guest.
>
> Right. So if I do this, you would probably even ack this? :)
Assuming I don't spot some other showstopper...
Oh.. one thing to make sure you think about though: what happens if a
guest makes some mappings, then there's a memory hotplug event which
changes the set of valid GPAs? In particular what if you hot unplug
some memory which is mapped in a guest TCE table? You might have to
regenerate the HPA tables from the GPA table on hot unplug (unless you
have a way of locking out an unplug event while that piece of guest
ram is TCE mapped).
> >> To do b) I'll need:
> >>
> >> 1. to have a copy of TCEs from the guest in vcpu,
> >
> > I don't quite understand this. You need a temporary copy, yes, but I
> > don't see why it needs to be attached to the vcpu.
>
> It does not need, I just need a safe + static + lock-free place for it as I
> do not want to do malloc() in the TCE handlers and (in theory) multiple
> CPUs can do concurrent TCE requests and I want to avoid locking especially
> in realmode.
Ah, right, it's the inability to malloc() that's the difficulty. You
could put it in the vcpu, or you could use a per-(host)-cpu area - you
can't switch guests while in a realmode handler.
>
>
> >> I populate it via
> >> get_user() to make sure they won't change;
> >> 2. an array of userspace addresses translated from given TCEs; and in order
> >> to make sure these addresses won't go away, I'll need to reference each
> >> preregistered memory area via mm_iommu_mapped_inc().
> >>
> >> When I reach TCE#100, I'll have to revert the change, i.e. call
> >> mm_iommu_mapped_dec().
> >
> > Ugh.. yeah, I think to do this sanely, what you'd have to do is copy
> > the updated translations into a temp buffer. Then you'd to make more
> > temp buffers to store the UA and HPA translations (although maybe you
> > could overwrite/reuse the original temp buffer if you're careful).
> > Then only if all of those succeed do you copy them into the real
> > hardware tables.
> >
> > Which sounds like it might be kinda messy, at least in real mode.
>
> So is it worth it?
Option (a) is certainly looking better to me based on current
information.
> >> So I will end up having 2 arrays in a vcpu and simpler reverting code.
> >>
> >>
> >> Or I can do simpler version of b) which would store guest TCEs in
> >> kvm_vcpu_arch::tces[512] and use them after checking. If a malicious guest
> >> does something bad and I return from H_PUT_TCE_INDIRECT in a middle of
> >> request, some preregistered regions will stay referenced till the guest is
> >> killed or rebooted (and this will prevent memory from unregistering) - but
> >> this means no harm to the host;
> >
> > Hrm.. that's not really true. It's not the worst thing that can
> > happen, but allowing the guest to permanently lock extra chunks of
> > memory is a form of harm to the host.
>
>
> These are the same preregistered chunks which are already locked. And the
> lock is there till QEMU process is dead. What will not be possible is
> memory hotunplug.
Ah, ok, I see your point. That's probably sufficient, but option (a)
is still looking better.
> >> and with preregistered RAM, there is no
> >> valid reason for H_PUT_TCE_INDIRECT to fail for a good guest.
> >>
> >>
> >>
> >> Which approach to pick?
> >>
> >>
> >> LoPAPR says:
> >> ===
> >> If the TCE parameter represents the logical page address of a page that is
> >> not valid for the calling partition, return
> >> H_Parameter.
> >> ===
> >>
> >>
> >>
> >>>>
> >>>> kvmppc_tce_put(stt, entry + i, tce);
> >>>> }
> >>>> diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> index 918af76ab2b6..f8a54b7c788e 100644
> >>>> --- a/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> +++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
> >>>> @@ -237,7 +237,7 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>> {
> >>>> struct kvmppc_spapr_tce_table *stt;
> >>>> long i, ret = H_SUCCESS;
> >>>> - unsigned long tces, entry, ua = 0;
> >>>> + unsigned long tces, entry, tce, ua = 0;
> >>>> unsigned long *rmap = NULL;
> >>>>
> >>>> stt = kvmppc_find_table(vcpu->kvm, liobn);
> >>>> @@ -279,11 +279,15 @@ long kvmppc_rm_h_put_tce_indirect(struct kvm_vcpu *vcpu,
> >>>> }
> >>>>
> >>>> for (i = 0; i < npages; ++i) {
> >>>> - unsigned long tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>> + tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>>
> >>>> ret = kvmppc_tce_validate(stt, tce);
> >>>> if (ret != H_SUCCESS)
> >>>> goto unlock_exit;
> >>>> + }
> >>>> +
> >>>> + for (i = 0; i < npages; ++i) {
> >>>> + tce = be64_to_cpu(((u64 *)tces)[i]);
> >>>
> >>> Same problem here.
> >>>
> >>>>
> >>>> kvmppc_tce_put(stt, entry + i, tce);
> >>>> }
> >>>
> >>
> >>
> >
> >
> >
> >
>
>
--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]
next prev parent reply other threads:[~2017-02-10 4:51 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-02-07 7:17 [PATCH kernel v4 00/10] powerpc/kvm/vfio: Enable in-kernel acceleration Alexey Kardashevskiy
2017-02-07 7:17 ` [PATCH kernel v4 01/10] powerpc/mmu: Add real mode support for IOMMU preregistered memory Alexey Kardashevskiy
2017-02-07 7:17 ` [PATCH kernel v4 02/10] powerpc/powernv/iommu: Add real mode version of iommu_table_ops::exchange() Alexey Kardashevskiy
2017-02-07 7:17 ` [PATCH kernel v4 03/10] powerpc/iommu/vfio_spapr_tce: Cleanup iommu_table disposal Alexey Kardashevskiy
2017-02-07 7:17 ` [PATCH kernel v4 04/10] powerpc/vfio_spapr_tce: Add reference counting to iommu_table Alexey Kardashevskiy
2017-02-07 7:17 ` [PATCH kernel v4 05/10] KVM: PPC: Reserve KVM_CAP_SPAPR_TCE_VFIO capability number Alexey Kardashevskiy
2017-02-07 7:17 ` [PATCH kernel v4 06/10] KVM: PPC: Enable IOMMU_API for KVM_BOOK3S_64 permanently Alexey Kardashevskiy
2017-02-07 7:17 ` [PATCH kernel v4 07/10] KVM: PPC: Pass kvm* to kvmppc_find_table() Alexey Kardashevskiy
2017-02-07 7:17 ` [PATCH kernel v4 08/10] KVM: PPC: Separate TCE validation from update Alexey Kardashevskiy
2017-02-09 3:51 ` David Gibson
2017-02-09 8:20 ` Alexey Kardashevskiy
2017-02-10 3:07 ` David Gibson
2017-02-10 4:09 ` Alexey Kardashevskiy
2017-02-10 4:50 ` David Gibson [this message]
2017-02-10 7:58 ` Alexey Kardashevskiy
2017-02-07 7:17 ` [PATCH kernel v4 09/10] KVM: PPC: Use preregistered memory API to access TCE list Alexey Kardashevskiy
2017-02-09 4:00 ` David Gibson
2017-02-07 7:17 ` [PATCH kernel v4 10/10] KVM: PPC: VFIO: Add in-kernel acceleration for VFIO Alexey Kardashevskiy
2017-02-09 6:41 ` David Gibson
2017-02-10 2:50 ` Alexey Kardashevskiy
2017-02-10 4:02 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20170210045054.GC25381@umbus \
--to=david@gibson.dropbear.id.au \
--cc=aik@ozlabs.ru \
--cc=alex.williamson@redhat.com \
--cc=kvm-ppc@vger.kernel.org \
--cc=kvm@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=paulus@samba.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).