From mboxrd@z Thu Jan  1 00:00:00 1970
From: Radim =?utf-8?B?S3LEjW3DocWZ?= <rkrcmar@redhat.com>
Subject: Re: [PATCH v4 2/2] KVM: move vcpu id checking to archs
Date: Fri, 22 Apr 2016 15:40:30 +0200
Message-ID: <20160422134029.GE25335@potion>
References: <146124809455.32509.15232948272580716135.stgit@bahia.huguette.org>
 <146124811255.32509.17679765789502091772.stgit@bahia.huguette.org>
 <20160421160018.GA31953@potion>
 <20160421184500.6cb5fd8a@bahia.huguette.org>
 <20160421173611.GB30356@potion>
 <20160422112538.41b23a9d@bahia.huguette.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: QUOTED-PRINTABLE
Cc: Paolo Bonzini <pbonzini@redhat.com>, james.hogan@imgtec.com,
	mingo@redhat.com, linux-mips@linux-mips.org, kvm@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	David Hildenbrand <dahi@linux.vnet.ibm.com>,
	qemu-ppc@nongnu.org, Cornelia Huck <cornelia.huck@de.ibm.com>,
	Paul Mackerras <paulus@samba.org>,
	David Gibson <david@gibson.dropbear.id.au>
To: Greg Kurz <gkurz@linux.vnet.ibm.com>
Return-path: <linux-kernel-owner@vger.kernel.org>
Content-Disposition: inline
In-Reply-To: <20160422112538.41b23a9d@bahia.huguette.org>
Sender: linux-kernel-owner@vger.kernel.org
List-Id: kvm.vger.kernel.org

2016-04-22 11:25+0200, Greg Kurz:
> Hi Radim !
>=20
> On Thu, 21 Apr 2016 19:36:11 +0200
> Radim Kr=C4=8Dm=C3=A1=C5=99 <rkrcmar@redhat.com> wrote:
>=20
> > 2016-04-21 18:45+0200, Greg Kurz:
> > > On Thu, 21 Apr 2016 18:00:19 +0200
> > > Radim Kr=C4=8Dm=C3=A1=C5=99 <rkrcmar@redhat.com> wrote: =20
> > >> 2016-04-21 16:20+0200, Greg Kurz: =20
> > >> > Commit 338c7dbadd26 ("KVM: Improve create VCPU parameter (CVE-=
2013-4587)")
> > >> > introduced a check to prevent potential kernel memory corrupti=
on in case
> > >> > the vcpu id is too great.
> > >> >=20
> > >> > Unfortunately this check assumes vcpu ids grow in sequence wit=
h a common
> > >> > difference of 1, which is wrong: archs are free to use vcpu id=
 as they fit.
> > >> > For example, QEMU originated vcpu ids for PowerPC cpus running=
 in boot3s_hv
> > >> > mode, can grow with a common difference of 2, 4 or 8: if KVM_M=
AX_VCPUS is
> > >> > 1024, guests may be limited down to 128 vcpus on POWER8.
> > >> >=20
> > >> > This means the check does not belong here and should be moved =
to some arch
> > >> > specific function: kvm_arch_vcpu_create() looks like a good ca=
ndidate.
> > >> >=20
> > >> > ARM and s390 already have such a check.
> > >> >=20
> > >> > I could not spot any path in the PowerPC or common KVM code wh=
ere a vcpu
> > >> > id is used as described in the above commit: I believe PowerPC=
 can live
> > >> > without this check.
> > >> >=20
> > >> > In the end, this patch simply moves the check to MIPS and x86.=
 While here,
> > >> > we also update the documentation to dissociate vcpu ids from t=
he maximum
> > >> > number of vcpus per virtual machine.
> > >> >=20
> > >> > Acked-by: James Hogan <james.hogan@imgtec.com>
> > >> > Acked-by: Cornelia Huck <cornelia.huck@de.ibm.com>
> > >> > Signed-off-by: Greg Kurz <gkurz@linux.vnet.ibm.com>
> > >> > ---
> > >> > v4: - updated subject for more clarity on what the patch does
> > >> >     - added James's and Connie's A-b tags
> > >> >     - updated documentation
> > >> >=20
> > >> >  Documentation/virtual/kvm/api.txt |    7 +++----
> > >> >  arch/mips/kvm/mips.c              |    7 ++++++-
> > >> >  arch/x86/kvm/x86.c                |    3 +++
> > >> >  virt/kvm/kvm_main.c               |    3 ---
> > >> >  4 files changed, 12 insertions(+), 8 deletions(-)
> > >> >=20
> > >> > diff --git a/Documentation/virtual/kvm/api.txt b/Documentation=
/virtual/kvm/api.txt
> > >> > index 4d0542c5206b..486a1d783b82 100644
> > >> > --- a/Documentation/virtual/kvm/api.txt
> > >> > +++ b/Documentation/virtual/kvm/api.txt
> > >> > @@ -199,11 +199,10 @@ Type: vm ioctl
> > >> >  Parameters: vcpu id (apic id on x86)
> > >> >  Returns: vcpu fd on success, -1 on error
> > >> > =20
> > >> > -This API adds a vcpu to a virtual machine.  The vcpu id is a =
small integer
> > >> > -in the range [0, max_vcpus).
> > >> > +This API adds a vcpu to a virtual machine.  The vcpu id is a =
positive integer.   =20
> > >>=20
> > >> Userspace won't be able to tell if KVM_CREATE_VCPU failed becaus=
e it
> > >> provided too high vcpu_id to an old KVM or because new KVM faile=
d in
> > >> other areas.  Not a huge problem (because I expect that userspac=
e will
> > >> die on both), but a new KVM_CAP would be able to disambiguate it=
=2E
> > >>=20
> > >> Toggleable capability doesn't seem necessary and only PowerPC ch=
anges,
> > >> so the capability could be arch specific ... I think that a gene=
ric one
> > >> makes more sense, though.
> > >> =20
> > >=20
> > > I'm not sure userspace can disambiguate all the cases where KVM_C=
REATE_VCPU
> > > returns EINVAL already... and, FWIW, QEMU simply exits if it gets=
 an error. =20
> >=20
> > Yes, userspace cannot disambiguate, but would have the option of no=
t
> > doing something that is destined to fail, like with KVM_CAP_MAX_VCP=
U.
> >=20
>=20
> It makes sense indeed.
>=20
>> > So I understand your concern but would we have a user for this ? =20
>>=20
>> I think so, new userspace on pre-patch KVM is the most likely one.
>>=20
>> Userspace cannot tell that KVM doesn't support the extension and
>> behaving like on patched KVM would result in a failure with cryptic
>> error message, because KVM only returns EINVAL.
>>=20
>=20
> This is already the case with or without the patch... which only chan=
ges
> things for PowerPC userspace.

I guess that the error message from QEMU should be improved then ...
The spec is quite clear that KVM is going to fail because of invalid
vcpu_id.

x86 QEMU does this when the amount of CPUs doesn't fit APIC constraints=
:
  # qemu-kvm -smp 160,cores=3D9
  qemu-system-x86_64: max_cpus is too large. APIC ID of last CPU is 278

Not perfectly understandable to the uninitiated, but it gets the point
across.

>                               And in the case of QEMU, we're already
> violating the spec with the way we compute vcpu ids.

The numbering strategy is valid.  KVM spec never said that numbering
cannot be sparse/arbitrary, just that vcpu_id must not exceed MAX_VCPUS=
=2E

>> Btw. PowerPC QEMU tries vcpu_id >=3D KVM_MAX_VCPUS and fails, instea=
d of
>> recognizing that the user wanted too much?
>>=20
>=20
> No. The error is caught in generic code and QEMU exits for all archs.
>=20
> And BTW, how would QEMU guess that vcpu id is too high ? I see at
> least three paths that can return EINVAL...

I agree, -EINVAL is the problem. :)
(Returning -ERANGE would have made more sense.)

The userspace has to guess if it tries fails, but if it was aware of th=
e
actual limit, it could assume that the creation failed because of high
vcpu_id and report it as a user-amendable error ...
Userspace would better do some sanity checks beforehand instead of
trying and failing, though.

>> >> Userspace also doesn't know the vcpu id limit anymore, and it mig=
ht
>> >> care.  What do you think about returning the arch-specific limit =
(or the
>> >> highest positive integer) as int in KVM_CAP_MAX_VCPU_ID?
>> >>  =20
>> >=20
>> > This is partly true: only arch agnostic code would be lost.
>> >=20
>> > Moreover this is a problem for powerpc only at the moment and user=
space code
>> > can compute the vcpu_id limit out of KVM_CAP_MAX_VCPUS and KVM_CAP=
_PPC_SMT. =20
>>=20
>> How would that work on KVM without this patch?
>>=20
>=20
> It doesn't work for PowerPC :)
>=20
> KVM_CAP_MAX_VCPUS indicates we can can start, say, 1024 vcpus and
> KVM_CAP_PPC_SMT indicates the host has 8 threads per core.
>=20
> KVM_CREATE_VCPU returns EINVAL when we start the 128th one because
> it has vcpu_id =3D=3D 128 * 8 =3D=3D 1024.
>=20
> Of course we can patch QEMU to restrict the maximum number of vcpus
> to MAX_VCPUS / PPC_SMT for PowerPC, but it would be infortunate since
> KVM for PowerPC is sized to run MAX_VCPUS... :\

Yeah.  If we introduced the capability, then QEMU would restrict vcpu_i=
d
to MAX_VCPUS or MAX_VCPU_ID and the number of VCPUS always to MAX_VCPUS=
=2E

>> > For other architectures, it is simply KVM_MAX_VCPUS. =20
>>=20
>> (Other architectures would not implement the capability.)
>>=20
>=20
> So this would be KVM_CAP_PPC_MAX_VCPU_ID ?

No, need.  The capability could be generic, because the concept is
arch-neutral (and it looks like x86 will use it too).

Unimplemented capabilities return 0, which is why other architectures
don't need to implement it.  Nothing can break:  userspace on old KVM
will get 0 for the capability, so it's going to behave as if we didn't
have this patch and this patch made sure that nothing actually changed
for other architectures.

>> >                                                                   =
   maybe later
>> > if we have other scenarios where vcpu ids need to cross the limit =
? =20
>>=20
>> x86 is going to have that soon too -- vcpu_id will be able to range =
from
>> 0 to 2^32-1 (or 2^31), but MAX_CPUS related data structures probably
>> won't be improved to actually scale, so MAX_CPUS will remain lower.
>>=20
>=20
> Do you have some pointers to share so that we can see the broader pic=
ture ?

At the moment, the most concrete one is x2APIC spec, sorry.

KVM currently supports only xAPIC, which has 8 bit APIC ID.
(The x2APIC interface that you might see is paravirtualized.)

vcpu_id is APIC ID on x86 and x2APIC still has sparse allocation that
encodes topology.  KVM is going to support standard x2APIC (when QEMU
supports interrupt remapping), which opens a possibility where we'd end
up in exactly the same situation where PowerPC is now -- architectural
code would handle vcpu_id bigger than MAX_CPUS, but userspace couldn't
make us of it, because of API limitations.